Connecting Information Back to Knowledge with AI
By Joshua Eckroth, PhD, Chief Architect September 18, 2018
Businesses produce and acquire information continuously: field logs, financial reports, meeting minutes, industry newsletters, news reports, etc. Efficiently connecting this information back to knowledge that supports decisions is crucial. Knowledge is power, but documents are not knowledge. They are just information that must be read and understood by decision-makers to be useful.
People usually write documents for other people. The authors, experts in their own way, translate what they see and what they know into summaries, explanations, tables, and graphs. They print beautiful reports. These reports are collected in the corporate archives. They sit until someone has the time to read and understand them again.
We routinely translate our expertise, our knowledge, into static text and graphics that must be read again to recover actionable knowledge. Sometimes, with millions of documents, this may seem an impossible task.
Artificial intelligence can connect that information back to knowledge. And AI is much faster than people.
If we look at some of the popular techniques of AI, we see several ways that help connect information back to knowledge:
- Optical character recognition: recovering text from printouts.
- Structure and table understanding: extracting context, such as section headings, and tabular data from the visual markers and layouts commonly found in reports.
- Natural language processing: making sense of everyday language.
- Document classification and recommendation: organizing and surfacing relevant documents.
These techniques may be chained together to create a document enrichment pipeline, shown below.
The key feature of this diagram is that each stage in the pipeline builds on the previous stage. Knowledge cannot be extracted from documents in a single pass – no single artificial intelligence or machine learning technique can make sense of a document on its own. As much as some companies may wish to simplify their offerings as a single "magic AI box," truly effective AI is more nuanced, and more interesting.
Let's look in a little more detail about each stage in this pipeline.
Optical character recognition
Sometimes, a business's documents are archived as images, the result of scanning or faxing paper documents. Before any further processing may be done, we must first find the text. We might also need to know where the individual characters appear on each page in order to facilitate vision algorithms later in the pipeline.
In particularly difficult cases, scanned documents might need to be aligned and cleaned up to reduce noise and increase contrast. OCR tools work best when the text is as clear as possible.
If the documents all already in digital format, like DOCX, XLSX, PDF, or HTML, then the OCR stage in the pipeline may be skipped.
Table and graph understanding
Many documents include section headings, some include page headers, footers, and footnotes, and some might also include tabular data. These tabular data might appear in clearly-marked tables with grid lines, while other tables might not have any gridlines, and instead use special spacing like tab-stops, or even a series of spaces. Some tables have row and column headers that help identify the meaning of the values in the table, some do not. In any case, in order to understand the document, the reader (either human or machine) must consider this structural and tabular information. For example, if similar tables of numbers appear in different places, the context of the tables, i.e., what these numbers are about, might only be known by reading the section headers. Just reading the table data is not sufficient to understand what these data represent.
Computer vision algorithms can help a machine understand a page with visual cues. For example, headers are often shown in bold text, slightly separated from the regular paragraphs. A table can also be found in a page in one of two ways: (1) if the table has grid lines, these lines can be detected and then the table values can be extracted by finding the line intersections; or (2) if there are no grid lines, an algorithm can look for consistent tabular spacing and divide up the data on the page into inferred rows and columns.
Natural language processing
Once OCR and structure and table understanding have completed their jobs, we should have some document text with contextual clues. At this stage, we use natural language processing (NLP) techniques to discover what the document is about. NLP itself also uses a processing pipeline: first, the text is broken into words and punctuation; next, each word is marked with its part-of-speech (noun, verb, adjective, etc.); next, sentences are constructed by finding verbs (e.g., "purchased"), the subject and object of each verb ("we purchased the house"), and any prepositional phrases ("at the end of the street"); finally, named entity recognition examines these sentences and other clues to find mentions of people, places, times, dates, and so on.
One might go further and find valuable keywords, i.e., noun phrases, that help describe the important knowledge contained in the document. For example, a document about trends in jet fuel prices would, presumably, contain many keywords like "jet fuel," "Jet A," etc.
Classification and recommendation
Once we have keywords and entities from natural language processing, we can infer the classification, i.e., topics or categories organized in a taxonomy, that a document covers, and/or the entities that are mentioned; e.g., companies, or places like cities and oil fields. Naturally, the classifications in question depend on the domain of interest. For example, an upstream oil & gas domain would include classes like "well drilling" and "reservoir monitoring" while a health & wellness domain would include classes like "weight management" and "mindfulness practice." Once classified into one or more taxonomies, documents can be added to a data warehouse with an appropriate search engine. The documents may be found again by searching their text, their keywords, and/or their classifications.
We can also identify which documents are similar to each other by examining their text but also their keywords and classifications. This allows us to recommend that the reader look at similar documents on the same subject, or to find duplicates such as news reports from different sources about the same event.
The role of domain knowledge
Domain knowledge is vital. Every stage in the pipeline works better if it is informed by domain knowledge. Financial reports do not look like well logs, which do not look like news stories. Structure and table understanding may be tweaked based on what kinds of patterns are common for the documents in question. Natural language processing may be focused on finding percents, monetary amounts, and currencies for some documents, and locations for other documents. Classification and entity recognition do not work at all in the absence of domain knowledge. In other words, AI algorithms cannot do the job on their own. They often must be tweaked and adapted for specific use cases. Off-the-shelf AI software is the sports car – but subject matter experts and software engineers are the drivers.
But wait, there's more! A little analysis, a lot of insight
Enriched documents each hold the knowledge of a single document. The enrichment pipeline finds the context for the information in the document, extracts the people, places, and times discussed in the document, and identifies what the document is about with keywords and classifications.
Only by looking at multiple enriched documents can we see big-picture insights. With document metadata produced by the enrichment pipeline, we can discover trends such as increasing interest in artificial intelligence and machine learning by corporations over the last few years, or sudden uncertainty about jet fuel pricing. We can put documents on the map and see clusters of activity around certain oil and gas reservoirs or areas of the world that are actively developing and testing certain technologies based on recent news reports. We can find the most recent, most interesting stories about a subject and generate periodic alerts. We can summarize a collection of documents by picking out the key phrases that mention the most important people, places, and keywords of the documents.
Producing high-level insights about multiple documents is impossible without a mature document enrichment pipeline. Producing insights from enriched documents is easy.
What to look for in your AI provider
Do you have documents that need enrichment? Ask your AI provider some fundamental questions to ensure they have experience connecting information back to knowledge:
- "What is your processing pipeline? Why does that pipeline make the most sense?"
- "Can you detect section headings and tables?" (Note, most AI pipelines skip this step.)
- "Can you classify documents into domain-specific topics, to aid in document search and discovery?"
- "How does domain knowledge inform your entire pipeline?"
- "What kinds of tools are available for analyzing enriched documents?"
i2k Connect specializes in document processing informed by domain knowledge. The pipeline and procedures described above are a high-level overview of our approach. We know that documents cannot be understood and enriched by AI algorithms in isolation.
True artificial intelligence is only achieved when sophisticated algorithms are married with subject matter expertise and talented engineers who can build a multi-stage processing pipeline. There is no magic in AI – instead, there is a combination of key elements: domain knowledge, experienced developers, and well-chosen algorithms.
Contact Us today to discuss your document enrichment and analysis needs.