As promised in a previous post, here’s my list of the top 3 knowledge discovery roadblocks, in descending order of easiest to hardest to fix:
- Misguided Security
- Old School Tech, and the
- Vagaries of Capitalism
(with numbers 2 and 3 being inseparably linked)
First, though, a definition: What do I mean by knowledge discovery?
I use the term here to mean being able to locate, stumble upon, or otherwise extract useful information from unstructured content (focusing on text files in this post) through a combination of search, text-based navigation, and visualization.
Knowledge discovery is also often called content discovery to distinguish it from discovery based on structured data, like that found in a well-organized database. This latter type of discovery is often called “data visualization” or “visual analytics” (think of self-service visual tools like Tableau or Qlik Sense).
If deployed inside an organization, knowledge discovery is also referred to as enterprise search, though features like dynamic categorization, recommendations and visualization that transform ordinary search into meaningful discovery are often inadequate in enterprise search engines. (Nonetheless, I am addressing enterprise knowledge discovery in this post.)
So much for the definition, but why does knowledge discovery matter?
It matters because every day in every industry, people create enormous volumes of unstructured content (documents, email, reports, etc.), and they also spend a staggering amount of time hunting for information buried in that sea of content, typically recreating what already exists if they can’t find what they need, or forging ahead with their work without benefiting from the knowledge of their predecessors or peers (see prior post).
This is not only an enormous waste of time and money, it’s an impediment to progress, whether that progress is measured by ordinary people being able to do their jobs a little better, a little more safely, or a bit more easily day-by-day, or by enabling big insights that can help industry, government and academia achieve the economic, environmental and social sustainability we all need to thrive today and for generations to come.
And now on with the list…
#1 Misguided Security
“God save us from people who mean well.” ― Vikram Seth, A Suitable Boy
I’ve never heard a client evaluating knowledge discovery software say they wanted a guarantee that document access rights and permissions would be scrupulously IGNORED in the solution. Obviously, the opposite is true, and touting one’s rock-solid, metadata-level security is a de facto part of every knowledge discovery vendor’s dog-and-pony show.
Unfortunately, I’ve also never heard any prospective customer say: “Let’s start by asking ourselves if our system of access rights is sabotaging any hope of ever sharing knowledge within our organization.”
And that’s too bad, because misguided security policies, and the information architectures they are built upon, often doom knowledge discovery projects from the get-go. I’ve lived through this myself.
Like many of my colleagues, whatever my research, writing or analysis task, I was able to carry it to 99% completion on my own simply by cruising through the gems in our ungainly heap.
Then, IT decided to come to our rescue.
They set up work groups and divvied up access rights accordingly. They organized folders. They locked down directories. They appointed gatekeepers.
In short, in a well-intentioned effort to get us organized, IT inadvertently created silos of knowledge where none had previously existed. And my quests for knowledge shifted from self-service to time-sucking cycles of emails, calls and meetings. To make matters worse, the ‘rescue’ coincided with the death of my search engine, and I began hoarding any useful information I was able to get hold of on my desktop for fear that I might never find it again.
Now this may have been a special case given our relatively small size and degree of disorganization, but the moral of the story is the same for organizations everywhere.
For, while there there will always be a need for special lock boxes for truly confidential content, and a means for managing the production and distribution of key reference documents, most of the knowledge in an organization will always live in disorderly mountains of Word documents, presentations, database text fields and email messages, not in official, fully-vetted reference documents.
So if you want to ensure a free flow of know-how, inspiration and innovation, first create an information architecture that is as open and accessible as humanly possible.
However, also be aware that whether you have an open or rigidly siloed information architecture, you can nonetheless count on people to eventually make a royal mess of whatever you put in place anyway, which leads to Trap #2.
#2 Old School Tech
You can put all the policies and procedures in place you want, but the reality is people will never neatly classify, tag and store what they produce in an orderly and consistent fashion, or at least they won’t do so with the vast majority of content they produce.
People are simply too busy, and they have too many tools at hand that make it easy to create and circulate content in a willy-nilly fashion.
However, most of the advanced technology needed to do this is not available in many enterprise discovery products (when such products are used at all). The “why?” is covered in Trap #3, but the what’s-usually-missing is here:
- Advanced text processing
Most search and discovery products have some semantic text processing, or “natural language processing” (NLP) capabilities, meaning they can dynamically categorize, tag and otherwise enhance content to make it more easily discoverable. However, these capabilities are often basic in nature. What’s really needed for a ‘virtual order’ strategy to succeed is either a standalone text analytics/text processing product deployed alongside a search/discovery engine, or a search and discovery engine with standalone-grade NLP built into it.
- Machine learning
To enhance the results of NLP-based text processing, many text analytics companies today are incorporating machine learning into their products. Machine learning (ML) is a set of techniques rooted in artificial intelligence that enables computers to tackle tasks beyond the reach of their explicit programming. When applied to text processing, ML can enable a computer to mine text for hidden patterns, meanings and correlations and to enrich content with its discoveries, producing results that can surpass conventional NLP.
ML holds great potential for formidable tasks like automatically classifying and clustering large document collections, or producing knowledge graphs of people, ideas and content.
At its most advanced stage, ML for text processing can be performed on wholly raw content (“unsupervised learning” or “deep learning”), or it can use a “training set” of documents annotated by a human being as a launching pad (“supervised learning”).
There is also much promising work on a middle-of-the-road “semi-supervised” approach that addresses some of the challenges of the other approaches (such as the need for truly massive document sets to produce reasonable results with unsupervised learning, or the difficulty of developing or acquiring training sets for supervised learning), but again too few enterprise discovery solutions leverage machine learning in text processing, though standalone text analytics platforms increasingly do.
A second use case for machine learning is continual improvement of the discovery experience through the analysis of user behavior. The goal in this case is to identify and learn from patterns in user behavior, rather than in the content alone, to continually boost the relevance of results.
This is not so much “advanced” technology as it is a “tricky and time-consuming” technology, but regardless, a great suite of connectors is essential to great discovery. Connectors are in essence bridges that let a discovery engine easily connect to and index content from diverse sources. As forces like Cloud and mobile swell the number of content repositories an organization must connect to support global knowledge discovery, smart, reliable, and easy-to-use connectors are more critical than ever, but too many essential connectors are missing, or prohibitively expensive, in discovery products.
- Seamless access to external content
I think everyone is in general agreement that in this day and age, ‘external’ information is as essential to an organization as internal information. Fortunately, there are good tools available for extracting knowledge from social media, from web pages, from subscription content, and from enterprise content, but not enough tools that provide an elegant way to mix and match such content according a particular user or group of users’ needs, or to provide a simple, intuitive way to navigate through aggregate content.
Also often missing is a good research-centric workflow, with features such as the construction of virtual dossiers or libraries, collaboration, and workflow audit trails. That’s not to say that there aren’t any wonderful enterprise knowledge discovery tools out there that combine elements of enterprise search, advanced text analytics, external content integration and a research-centric workflow, just that there aren’t enough of them for cross-domain/cross-disciplinary discovery, or for special needs not addressed by an existing vertical solution, which brings us to Trap #3.
#3 The Vagaries of Capitalism
Many of the more mature enterprise knowledge discovery tools today have become highly verticalized by industry or profession. Some have done so because it made sense in terms of adapting content and workflow to specific domain needs, but many have done so because it makes also sales and marketing easier.
There are also plenty of interesting start-ups that began as general (horizontal) enterprise knowledge discovery solutions, but came under pressure to verticalize when venture capital funds start flowing (acquisition is a goal for many, and well-heeled buyers like niches that round out their portfolios). And for many, the vertical of choice is overwhelmingly in the realm of sales and marketing (follow the money).
While the modus operandi of free enterprise may slant the field toward trade or industry verticalization, and toward sales and marketing-oriented solutions, some solutions remain that address horizontal, or simultaneously horizontal and vertical, markets.
In that context, it will be interesting to see how the following solutions, roughly grouped by original product DNA, evolve over time, if they evolve (each class is well-suited as-is to various use cases).
Products with an enterprise search DNA
These include acquired or native solutions from vendors like HP, Dassault Systemes, IBM, Lexmark, Google, Coveo, Mindbreeze, Sinequa, Attivio, Lucidworks, etc., that may have begun as enterprise search products but have evolved along an enterprise discovery path. They are well-positioned to deliver such discovery, but vary in their use of advanced technology like machine learning and data visualization, in research workflow support, and in their handling of third party content (beyond Web crawling and RSS).
Products with a text analytics DNA
Many text analytics products today can be used in tandem with third party search and discovery software, but others increasingly integrate search and discovery features that make them interesting for standalone use for certain enterprise discovery use cases, including products from vendors like Lexalytics, Temis, Smartlogic, and Expert System.
Products with an academic research DNA
I’ve often looked at academic research portal products like those from Ebsco, ProQuest and Thomson Reuters thinking that if they had solid enterprise connectivity, they would make excellent enterprise discovery products. Vendors such as these do target corporate users, but it seems only to enable discovery for external content (mainly subscription-based).
Products with a professional research DNA
Like their academic portal cousins, many publishers and/or content aggregators (Elsevier, Thomson Reuters, Springer, Wiley, LexisNexis, etc.). offer excellent search and discovery solutions with a solid research workflow, though they tend to be verticalized, and likewise geared to subscription content.
Products with a hybrid DNA
This class of products shares much in common with products with an enterprise search DNA, though they seem to have been designed from conception with more of a research workflow in mind, and a more even approach to handling internal and external content (particularly paid content). Representative products here include IBM Watson, Palantir Gotham, Content Analyst CAAT, Brainspace for Enterprise, and IHS Goldfire. Some of these are straddling vertical and horizontal markets, and others are fairly verticalized now, or may become so in the future, and not all are equal in terms of support for subscription content.
Products with a Product Lifecycle Management (PLM) DNA
There is also another interesting, emerging class of product: cross-disciplinary innovation platforms from vendors with a PLM DNA, like the 3DEXPERIENCE Platform from my former employer Dassault Systèmes.
As forces like the Internet of Things transform products from physical objects to connected systems which integrate content, software and communications, and which require the collaborative input of all business units to develop and support, PLM systems are evolving in lockstep, and it will be interesting to see how the 3DEXPERIENCE Platform and platforms from other vendors like Siemens, PTC, Autodesk, and Aras evolve, that is to say how and if they might come to be used for both horizontal and vertical enterprise discovery needs in both product and service-based organizations.
Finally, there is a whole constellation of open source software products from the likes of the Apache Foundation and companies like Elastic that can be integrated to create an enterprise discovery solution that comprises just the right mix of enterprise search, advanced text analytics, external content integration, data visualization and a research-centric workflow, but few organizations have the skills and IT depth to succeed with a project like that on their own.
In sum, enterprise knowledge discovery on unstructured content is essential to the free flow of know-how, inspiration and innovation that organizations, and our species, need in order to survive and thrive over the long haul, but selecting and successfully deploying the right solution for your organization is not an easy task. So if you have a success story to share, I’d love to hear about it!