Author Archives: admin

  • 0

3 Biggest Knowledge Discovery Traps

As promised in a previous post, here’s my list of the top 3 knowledge discovery roadblocks, in descending order of easiest to hardest to fix:

  1. Misguided Security
  2. Old School Tech, and the
  3. Vagaries of Capitalism 

(with numbers 2 and 3 being inseparably linked)

First, though, a definition: What do I mean by knowledge discovery?

I use the term here to mean being able to locate, stumble upon, or otherwise extract useful information from unstructured content (focusing on text files in this post) through a combination of search, text-based navigation, and visualization.

Knowledge discovery is also often called content discovery to distinguish it from discovery based on structured data, like that found in a well-organized database. This latter type of discovery is often called “data visualization” or “visual analytics” (think of self-service visual tools like Tableau or Qlik Sense).

If deployed inside an organization, knowledge discovery is also referred to as enterprise search, though features like dynamic categorization, recommendations and visualization that transform ordinary search into meaningful discovery are often inadequate in enterprise search engines. (Nonetheless, I am addressing enterprise knowledge discovery in this post.)

So much for the definition, but why does knowledge discovery matter?

It matters because every day in every industry, people create enormous volumes of unstructured content (documents, email, reports, etc.), and they also spend a staggering amount of time hunting for information buried in that sea of content, typically recreating what already exists if they can’t find what they need, or forging ahead with their work without benefiting from the knowledge of their predecessors or peers  (see prior post).

This is not only an enormous waste of time and money, it’s an impediment to progress, whether that progress is measured by ordinary people being able to do their jobs a little better, a little more safely, or a bit more easily day-by-day, or by enabling big insights that can help industry, government and academia achieve the economic, environmental and social sustainability we all need to thrive today and for generations to come.

And now on with the list…

#1 Misguided Security

“God save us from people who mean well.”Vikram Seth, A Suitable Boy

I’ve never heard a client evaluating knowledge discovery software say they wanted a guarantee that document access rights and permissions would be scrupulously IGNORED in the solution. Obviously, the opposite is true, and touting one’s rock-solid, metadata-level security is a de facto part of every knowledge discovery vendor’s dog-and-pony show.

Unfortunately, I’ve also never heard any prospective customer say: “Let’s start by asking ourselves if our system of access rights is sabotaging any hope of ever sharing knowledge within our organization.”

And that’s too bad, because misguided security policies, and the information architectures they are built upon, often doom knowledge discovery projects from the get-go. I’ve lived through this myself.


Photo courtesy of ButterflySha(CC Attribution)

I was once with a start-up with a gigantic, messy file server into which all groups – from R&D to Marketing to Sales to Consulting to Customer Support – unceremoniously dumped their stuff, and everything was accessible to all. Against this mess, I imposed virtual organization through a clever (but sadly now defunct) desktop search engine.

Like many of my colleagues, whatever my research, writing or analysis task, I was able to carry it to 99% completion on my own simply by cruising through the gems in our ungainly heap.

Then, IT decided to come to our rescue.

They set up work groups and divvied up access rights accordingly. They organized folders. They locked down directories. They appointed gatekeepers.

In short, in a well-intentioned effort to get us organized, IT inadvertently created silos of knowledge where none had previously existed. And my quests for knowledge shifted from self-service to time-sucking cycles of emails, calls and meetings. To make matters worse, the ‘rescue’ coincided with the death of my search engine, and I began hoarding any useful information I was able to get hold of on my desktop for fear that I might never find it again.

Now this may have been a special case given our relatively small size and degree of disorganization, but the moral of the story is the same for organizations everywhere.

For, while there  there will always be a need for special lock boxes for truly confidential content, and a means for managing the production and distribution of key reference documents, most of the knowledge in an organization will always live in disorderly mountains of Word documents, presentations, database text fields and email messages, not in official, fully-vetted reference documents.

So if you want to ensure a free flow of know-how, inspiration and innovation, first create an information architecture that is as open and accessible as humanly possible.

However, also be aware that whether you have an open or rigidly siloed information architecture, you can nonetheless count on people to eventually make a royal mess of whatever you put in place anyway, which leads to Trap #2.

#2 Old School Tech

You can put all the policies and procedures in place you want, but the reality is people will never neatly classify, tag and store what they produce in an orderly and consistent fashion, or at least they won’t do so with the vast majority of content they produce.

People are simply too busy, and they have too many tools at hand that make it easy to create and circulate content in a willy-nilly fashion.

Wild cats

Stop herding wild cats. Photo courtesy of Sara Golemon(CC ShareALike)

In the end, it’s more pragmatic to go with the flow to the greatest extent possible: impose some order where you can and must, but otherwise stop herding and let wild cats be wild cats, and leverage top-notch text analytics, search and discovery technologies to impose dynamic virtual order.

However, most of the advanced technology needed to do this is not available in many enterprise discovery products (when such products are used at all). The “why?” is covered in Trap #3, but the what’s-usually-missing is here:

  • Advanced text processing
Most search and discovery products have some semantic text processing, or “natural language processing” (NLP) capabilities, meaning they can dynamically categorize, tag and otherwise enhance content to make it more easily discoverable. However, these capabilities are often basic in nature. What’s really needed for a ‘virtual order’ strategy to succeed is either a standalone text analytics/text processing product deployed alongside a search/discovery engine, or a search and discovery engine with standalone-grade NLP built into it.
  • Machine learning

To enhance the results of NLP-based text processing, many text analytics companies today are incorporating machine learning into their products. Machine learning (ML) is a set of techniques rooted in artificial intelligence that enables computers to tackle tasks beyond the reach of their explicit programming. When applied to text processing, ML can enable a computer to mine text for hidden patterns, meanings and correlations and to enrich content with its discoveries, producing results that can surpass conventional NLP.

Artificial Intelligence

Photo courtesy of theglobalpanorama(CC ShareALike)

ML holds great potential for formidable tasks like automatically classifying and clustering large document collections, or producing knowledge graphs of people, ideas and content.

At its most advanced stage, ML for text processing can be performed on wholly raw content (“unsupervised learning” or “deep learning”), or it can use a “training set” of documents annotated by a human being as a launching pad (“supervised learning”).

There is also much promising work on a middle-of-the-road “semi-supervised” approach that addresses some of the challenges of the other approaches (such as the need for truly massive document sets to produce reasonable results with unsupervised learning, or the difficulty of developing or acquiring training sets for supervised learning), but again too few enterprise discovery solutions leverage machine learning in text processing, though standalone text analytics platforms increasingly do.

A second use case for machine learning is continual improvement of the discovery experience through the analysis of user behavior. The goal in this case is to identify and learn from patterns in user behavior, rather than in the content alone, to continually boost the relevance of results. 

  • Great connectors

This is not so much “advanced” technology as it is a “tricky and time-consuming” technology, but regardless, a great suite of connectors is essential to great discovery. Connectors are in essence bridges that let a discovery engine easily connect to and index content from diverse sources. As forces like Cloud and mobile swell the number of content repositories an organization must connect to support global knowledge discovery, smart, reliable, and easy-to-use connectors are more critical than ever, but too many essential connectors are missing, or prohibitively expensive, in discovery products.
  • Seamless access to external content

I think everyone is in general agreement that in this day and age, ‘external’ information is as essential to an organization as internal information. Fortunately, there are good tools available for extracting knowledge from social media, from web pages, from subscription content, and from enterprise content, but not enough tools that provide an elegant way to mix and match such content according a particular user or group of users’ needs, or to provide a simple, intuitive way to navigate through aggregate content.

Also often missing is a good research-centric workflow, with features such as the construction of virtual dossiers or libraries, collaboration, and workflow audit trails. That’s not to say that there aren’t any wonderful enterprise knowledge discovery tools out there that combine elements of enterprise search, advanced text analytics, external content integration and a research-centric workflow, just that there aren’t enough of them for cross-domain/cross-disciplinary discovery, or for special needs not addressed by an existing vertical solution, which brings us to Trap #3.

#3 The Vagaries of Capitalism

Many of the more mature enterprise knowledge discovery tools today have become highly verticalized by industry or profession. Some have done so because it made sense in terms of adapting content and workflow to specific domain needs, but many have done so because it makes also sales and marketing easier.

There are also plenty of interesting start-ups that began as general (horizontal) enterprise knowledge discovery solutions, but came under pressure to verticalize when venture capital funds start flowing  (acquisition is a goal for many, and well-heeled buyers like niches that round out their portfolios). And for many, the vertical of choice is overwhelmingly in the realm of sales and marketing (follow the money).

While the modus operandi of free enterprise may slant the field toward trade or industry verticalization, and toward sales and marketing-oriented solutions, some solutions remain that address horizontal, or simultaneously horizontal and vertical, markets.

In that context, it will be interesting to see how the following solutions, roughly grouped by original product DNA, evolve over time, if they evolve (each class is well-suited as-is to various use cases).

Products with an enterprise search DNA

These include acquired or native solutions from vendors like HP, Dassault Systemes, IBM, Lexmark, Google, Coveo, Mindbreeze, Sinequa, Attivio, Lucidworks, etc., that may have begun as enterprise search products but have evolved along an enterprise discovery path. They are well-positioned to deliver such discovery, but vary in their use of advanced technology like machine learning and data visualization, in research workflow support, and in their handling of third party content (beyond Web crawling and RSS).

Products with a text analytics DNA

Many text analytics products today can be used in tandem with third party search and discovery software, but others increasingly integrate search and discovery features that make them interesting for standalone use for certain enterprise discovery use cases, including products from vendors like Lexalytics, Temis, Smartlogic, and Expert System.

Products with an academic research DNA

I’ve often looked at academic research portal products like those from Ebsco, ProQuest and Thomson Reuters thinking that if they had solid enterprise connectivity, they would make excellent enterprise discovery products. Vendors such as these do target corporate users, but it seems only to enable discovery for external content (mainly subscription-based).

Products with a professional research DNA

Like their academic portal cousins, many publishers and/or content aggregators (Elsevier, Thomson Reuters, Springer, Wiley, LexisNexis, etc.). offer excellent search and discovery solutions with a solid research workflow, though they tend to be verticalized, and likewise geared to subscription content.

Products with a hybrid DNA

This class of products shares much in common with products with an enterprise search DNA, though they seem to have been designed from conception with more of a research workflow in mind, and a more even approach to handling internal and external content (particularly paid content). Representative products here include IBM Watson, Palantir Gotham, Content Analyst CAAT, Brainspace for Enterprise, and IHS Goldfire. Some of these are straddling vertical and horizontal markets, and others are fairly verticalized now, or may become so in the future, and not all are equal in terms of support for subscription content.

Products with a Product Lifecycle Management (PLM) DNA

There is also another interesting, emerging class of product: cross-disciplinary innovation platforms from vendors with a PLM DNA, like the 3DEXPERIENCE Platform from my former employer Dassault Systèmes.

As forces like the Internet of Things transform products from physical objects to connected systems which integrate content, software and communications, and which require the collaborative input of all business units to develop and support, PLM systems are evolving in lockstep, and it will be interesting to see how the 3DEXPERIENCE Platform and platforms from other vendors like Siemens, PTC, Autodesk, and Aras evolve, that is to say how and if they might come to be used for both horizontal and vertical enterprise discovery needs in both product and service-based organizations.

Finally, there is a whole constellation of open source software products from the likes of the Apache Foundation and companies like Elastic that can be integrated to create an enterprise discovery solution that comprises just the right mix of enterprise search, advanced text analytics, external content integration, data visualization and a research-centric workflow, but few organizations have the skills and IT depth to succeed with a project like that on their own.

In sum, enterprise knowledge discovery on unstructured content is essential to the free flow of know-how, inspiration and innovation that organizations, and our species, need in order to survive and thrive over the long haul, but selecting and successfully deploying the right solution for your organization is not an easy task. So if you have a success story to share, I’d love to hear about it!

  • 0

Deep Learning & NLP Meet by the Bay

Text By the Bay, a new conference for Natural Language Processing (NLP) academics & practitioners, will take place this Friday and Saturday in San Francisco, CA (4/24 & 4/25). I wish I could be there. Not all of the speakers are from organizations working in domains of top interest to me, but interesting fundamentals that could be applied in intriguing ways.

Would like to catch these in particular:

At the intersection of Deep Learning & NLP:

  • “Practical NLP Applications of Deep Learning,” Samiur Rahman, @samiur1204 
  • “Deep Learning for Natural Language Processing,” Richard Socher, @RichardSocher
  • “Unsupervised NLP Tutorial using Apache Spark,” Marek Kolodziej, @marekinfo

Knowledge maps for Content Discovery:

  • Talks by the same title by Oren Schaedel & Seth Redmore, @sredmore

Also Interesting:

  • “Using Big Data to Identify the World’s Top Experts,” Nima Sarshar, @nimilinimo
  • “ML Scoring: Where Machine Learning Meets Search,” Joaquin Delgado, @joaquind

Hopefully some of the talks will be shared after the event… If anyone I know goes and catches one of the sessions above, I’d love a recap :-)

  • 2
Photo © Mish Sukharev, Common Copyright 2.0 Generic license

Knowledge Discovery Driver: “The Old Guys Are Retiring!”

A couple of years ago this week, I gave a talk at an aerospace industry gathering on open source intelligence, web mining, and the changing nature of national security risk. To get conversation flowing, I asked audience members to share what they personally considered to be the greatest security threat facing us today.

As the informal poll progressed, geopolitical destabilization due to climate change was in the lead as the number one threat, followed by a linked phenomenon, natural resource scarcity.

Then a young man (one of only a handful of young people in the room) raised his hand. His response? “The old guys are all retiring! And everything they know is going with them. We’re toast!” That’s my paraphrase (the session was in French), but that’s the gist of what he said, and there were enthusiastic nods of agreement from the other young people in the room.

Part of this breakdown in knowledge transfer may be due to a general reliance on sophisticated systems that encapsulate and automate knowledge at the expense of hands-on know-how (I’m thinking of Airbus’s recent call for a global overhaul of pilot training to improve manual-flying skills that have dangerously eroded due to aircraft automation).

But I am also thinking about the perennial challenge of transferring the kind of knowledge that is encapsulated in messy ‘unstructured content’ – that is to say content like reports, presentations, email messages, and field notes.

A couple of weeks ago, I was speaking to a government agency in DC that voiced a similar frustration over the transfer (or lack thereof) of knowledge embedded in this kind of eclectic content. Unlike the young man at the aerospace gathering, retirement per se wasn’t as much of an issue as were technical challenges (they were making a huge effort to support the transfer of their knowledge to a broad community), but the frustration was just as deep. And, it’s a frustration I’ve heard a thousand times over the past decade.

Which begs the question, why is it so hard? Given the wide availability of search engines, content management systems, text analytics software, data visualization tools and other such knowledge discovery technology, why is it still so difficult for everyone to tap into the knowledge that surrounds them, even inside their own organization?? That’s the focus of my next post, which I think I’ll call “The 3 Biggest Knowledge Discovery Traps.” In the meantime, if you have a top three list of your own, please share it :-). It’s a problem we all need to address if we’re going tap into our collective wisdom to meet the challenges that lie ahead.

  • 0

Climate Corporation, Monsanto and the Raison d’Etre of Data Science

As I sit enjoying the light but steady Spring rainfall that is transforming my backyard into what feels to me like a Garden of Eden, a world away in northern India, dozens of debt-stricken farmers have taken their lives in response to devastating crop losses wrought by unseasonal rain, hailstorms and high winds, while hospitals seek to help many more struggling with deep depression in the face of insurmountable debt.

India Farmers Woes

April 14, 2015: an Indian farmer from Uttar Pradesh state shows wheat damaged by unseasonal rain. (AP Photo/ Rajesh Kumar Singh).

As Vinod Kumar, an Uttar Pradesh farmer, laments in Biswajeet Banerjee’s AP article, “Normally this time of the year, we are a happy lot. Our granary is full and we clear all our dues by selling our produce. This year we lost everything. We are left with nothing. Neither food for us nor fodder for animals.”

I came across the article by chance while doing some research on The Climate Corporation. The company has always struck me as something of a poster child for empowering ordinary people to take advantage of data science to address to big challenges.

The initial challenge Climate Corporation tackled was the potentially disastrous impact of increasingly frequent extreme weather events on businesses, coming to focus over time on farmers exclusively.

They developed a platform to crunch through trillions of weather simulation data points across hundreds of terabytes of data to develop crop insurance products that would provide farmers with greater protection against losses due to extreme weather. Essentially, the policies serve as sort of ‘gap’ insurance for losses not covered by government crop insurance, with payments sent automatically when specified weather conditions occur: a marked contrast to the painfully long waits for payments tied to government inspections and damage assessments.

This is the kind of buffer of that could have made a difference in the lives of northern Indian farmers like Vinod Kumar, and  Mohammad Sabir of Wazirpur village in Uttar Pradesh, who was so decimated by the sudden loss of his entire wheat crop that he hanged himself from a mango tree on his farm earlier this month.

However, one can’t help but wonder whether there are any changes to our complex systems of public and private finance that could obviate the need for farmers like Kumar and Sabir to enmesh themselves in cycles of extreme debt to bring food to our tables and theirs each season. Or whether there are any changes to modern agricultural practices that could ameliorate the cycle of debt as well. Even broader, are there any economic or agricultural practices that are actually contributing to the escalation in extreme weather events that wreak such havoc in the first place? 

Such questions are beyond the scope of The Climate Corporation, as CEO David Friedberg states. As he tells’s John Roach, his company’s job is to “identify trends in climate data and use them to help us predict what is going to happen in the future.” On broader subjects like whether the increase in weather volatility is due to human-caused climate change (much less what role economic or agricultural systems might play in that change), he says the company doesn’t have an opinion.

Fair enough I say.

climate_pro_insetBut now the company has expanded beyond insurance to deliver analytic software for farmers designed to “deliver field-level insights powered by data science.” This software crosses the line from predictive analytics to prescriptive analytics, in other words, not just telling farmers what is likely to happen, but recommending what they should do in response. For instance, the Nitrogen Advisor feature in the Climate Pro app uses predictive models that factor in nitrogen applications to date, crop stage and weather data to recommend how much nitrogen a farmer should apply, and when, to achieve target yields.

In short, the app helps farmers – at least those following modern industrial practices – do as they have always done, but with a new level of precision.

I can’t argue with Mike Stern, the Monsanto executive who was named President and COO of The Climate Corporation following Monsanto’s acquisition of The Climate Corporation, when he states in his blog post “Data Science: The Next Revolution in Sustainable Agriculture,” that the people at The Climate Corporation believe that “data and the application of data science can help farmers make more informed decisions about their operations that can increase crop yields and help them use resources more efficiently.”


I can’t argue either with the stark facts in the infographic featured in the post: more people + fixed amount of arable land = a need to do more with less, with a potentially valuable role for data science in meeting this need.

But as I looked at the graphic, I was reminded of a postage stamp-sized patch of land I saw behind a chateau on the outskirts of Paris. It was teeming with a crazy variety of fruits and vegetables all huddled together on peculiar little mounds but thriving marvelously, producing enough food for the inhabitants of the chateau and all who pass through it, without chemical fertifilzers or herbicides or genetically engineered seeds (permaculture they called it).

I think about that tiny plot, and farmers like Kumar and Subir, and I wonder whether sometimes data science might be best applied to not just helping us do what we’re already doing more efficiently, but doing things differently, even if that ‘differently’ is decidedly low-tech, or even no-tech.

  • 4

A Naively Roundup UnReady Suburbanite

I have a lingering fondness for the scruffy olive and khaki coastal desert landscape around the Southern California city I called home for seven years, as I do for the cold but colorful alternate universe I discovered there when I took up scuba diving.  But I never really felt ‘at home’ in California. I was raised in North Carolina, and I missed the rhythm of the region’s four very distinct and evenly paced seasons.

When I overshot the Atlantic in heading back East, I found myself in an equally beautiful country, France, but in a city that really only has really two seasons, Spring and Fall, with transitions from one season to another happening seemingly overnight.

As a result of this seasonal deprivation and an adult lifetime spent living in cities, I have really been reveling in the long-lost pleasure of watching a lush Southern Spring slowly unfold. When I got a call from a neighborhood friend who is a landscaper reminding me that it was time to get my lawn ready for Spring, I was a little baffled: it looked to me like Nature had this whole Spring thing well in hand.

And so I did nothing.

Soon my lawn was covered in a dozen different shades of green, and dotted with small blue, white, yellow and purple flowers. My kids and I thought it looked very pretty, but it became clear very quickly that when given a free hand, Nature will not produce anything that even remotely resembles a proper suburban lawn. While Nature was having her way with us, I watched my neighbors spray, spread, blow, mow and edge their lawns into perfectly manicured seas of tidy emerald green grass, and for the first time I really understood how fiercely and relentlessly one must battle Nature to achieve what has become the American standard of the perfect lawn.


A housing development in Cathedral City, CA. Damon Winter/The New York Times.

As I plunged into research into organic strategies for keeping the HOA at bay, I was reminded of images I had seen in a New York Times article about the drought laying siege to the world’s seventh largest economy (a.k.a. California). The image above in particular stuck in my mind. It’s one of several remarkable images by photographer Damon Winter in that article, “California Drought Tests History of Endless Growth,” by Adam Nagourney, Jack Healy and Nelson D. Schwartz.

The authors observe that as California municipal employees rip up turf and replace it with desert scrub to comply with Governor Brown’s 25% water reduction mandate, city managers cross their fingers Californians’ ideal of beauty will evolve along with the changing landscape.

The dons at Harvard University are holding on to a similar hope. In the Harvard Magazine article “When Grass Isn’t Greener: Alternatives to the ‘perfect’ lawn, at home and at Harvard,” Nell Porter Brown details a similar transformation underway across the university’s 800 acres in an effort to, as Bruce Butterfield puts it, have a landscape that both “looks good and is good for the plants and earthworms and animals and people.”

The excellent article goes on to trace the roots of the enhanced monoculture known as American “ideal lawn” back to the 1700s, with desires for order and status morphing over time into our present addiction to what botanist Peter Del Tredici aptly notes is a purely cosmetic landscape that “goes against the more heterogeneous natural landscape and requires tons of fertilizer, herbicides, pesticides, gasoline for mowing, and water, to be maintained.”

Instead, Porter Brown notes, Del Tredici favors the “freedom lawn” concept presented in Redesigning the American Lawn. That approach “calls for a green expanse composed of a community of plants” that “sort themselves out according to the topographical gradient that is most peoples’ lawns.”

I noted this gradient phenomenon myself as I tried an early round of weed pulling. Each weed seemed to have found its perfect niche, some clinging to loose soil, others embedded in clay, some preferring damp shady spots and others loving sun-baked dry patches.  (Incidentally, my children found this weed pulling exercise both puzzling and amusing. Kids: “What’s a weed anyway?”  Me: “Uh, I dunno exactly. Any plant growing where people expect grass, I suppose.” Kids: “But we have the prettiest backyard in the neighborhood – whoa, do NOT touch the ones with the blue flowers.” Me: “Ok, I won’t touch the ones with flowers, but if we have too many weeds, the neighbors won’t like it. I was thinking we’d shoot for like a one-third ratio.” Kids: Blank adults-are-from-Mars stares.)

Fertilizer run off

Run-off water loaded with nitrates, and probably pesticides, in Northern France. Photo by F. Lamiot, Creative Commons Attribution-Share Alike 2.5 Generic license.

Later, as I pushed my reel mower silently over my accidental “Freedom Lawn,” I wondered, will one natural lawn make any difference? The answer is no, of course. But if, as Nell Porter Brown reports, NASA satellite data shows that turf covers close to 50,000 square miles of American land – three times more acreage in the nation than irrigated corn – then turf liberation at a national scale becomes very interesting indeed, especially if 98% of sprayed insecticides and 95% of herbicides reach a destination other than their target species. (Wikipedia).

You don’t have to be concerned about whether a given chemical is “safe,” or even about the spiraling loss of biodiversity worldwide, to follow the logic in math like that.