28 Jul Connecting SPE Information to Knowledge
By Reid G. Smith, Ph.D., July 28, 2021
Over the past five years, SPE and i2k Connect have developed the SPE Research Portal to address the challenges facing engineers who are looking for information. The portal is available on the SPE.org homepage and SPE mobile app.
Companies run on information in the form of data. However, knowledge workers need to spend too much time searching for data, especially in documents with text, tables, and images, which are often called unstructured data. Examples include almost all the information in company repositories in the form of Microsoft Word, PDF, PowerPoint, and Excel documents, web pages, images, videos, audio files, newsfeeds, tweets, and e-mail messages. The lack of structure makes unstructured data hard to find and analyze without extensive semantic tagging, which is traditionally done by humans. This is in contrast to the relative ease with which structured data in traditional databases can be found and analyzed.
Data professionals—the people who focus on finding and analyzing data—report spending 14 hours a week looking for data, and another 10 hours a week re-creating data that cannot be found. Half the week is spent just trying to get the data they need to do the required analysis.
In 2014, i2k Connect was formed around the idea that Artificial Intelligence could change the game. Our vision is to achieve a 90% reduction in the manual effort expended today by users to find, extract, and analyze data in documents, and by data stewards, records analysts, and others to tag documents.
The major impact of these capabilities, as expressed by i2k customers, has been to dramatically reduce the time required to find and analyze data, often reducing days of effort to minutes.
What has been achieved?
We built the i2k Connect AI Platform, that integrates Subject Matter Expertise with Al natural language processing (NLP) and machine learning (ML). It automatically enriches documents by classifying them into relevant taxonomies, geotagging entities, and extracting key concepts, authors, institutions, titles, and summaries. The latest developments have enabled the platform to recognize key events, find and extract data points in document text, tables and SQL databases, and answer questions posed in natural language.We built the i2k Connect AI Platform, that integrates Subject Matter Expertise with Al natural language processing (NLP) and machine learning (ML). It automatically enriches documents by classifying them into relevant taxonomies, geotagging entities, and extracting key concepts, authors, institutions, titles, and summaries. The latest developments have enabled the platform to recognize key events, find and extract data points in document text, tables and SQL databases, and answer questions posed in natural language.
What have we learned?
Domain knowledge. One of the earliest learnings in applied AI is “in the knowledge lies the power.” To this end, the i2k AI Platform has millions of pieces of information and knowledge built-in (e.g., information on the names, locations, and other relevant parameters of 100,000 basins, fields, wells, and formations; more than a million relevant natural language terms and their mappings into 15 taxonomies; natural language knowledge, like parts of speech, and what constitutes a good keyword phrase, called a “concept tag” in the i2k Platform; knowledge of how to recognize and interpret tabular data in PDF and other file formats; and knowledge of how to extract data points from sentences). Going forward, we will continue to broaden, extend, and deepen the knowledge in the platform.
Deployment. It is worth noting that much of the effort in delivering AI in any commercial system is devoted to non-AI issues like integration into the existing corporate infrastructure, information security architecture, and workflows to provide useful products and services. Consider processing speed, to which we devote a great deal of attention, because company repositories frequently contain tens of millions of documents.
More generally, often the AI is a relatively small part of an overall solution, both from a functionality perspective and from a development/maintenance effort perspective. Indeed, the ratio can be as high as 90/10, when one includes systems architecture, infrastructure, networking, deployment, graphics/visualization, hardware, and many other essentials for developing, testing, delivering, and maintaining products and services to customers. This ratio has, if anything, only increased between the standalone applications of the 1980s and present-day AI, embedded in every conceivable product and service, file formats, and knowledge of how to extract data points from sentences. Going forward, we will continue to broaden, extend, and deepen the knowledge in the platform.
Translating the best academic research into robust commercial systems. Over the years we have also found that the latest research code and trained machine learning models are rarely if ever enough for a complete solution. There is always a substantive amount of work to be done to translate what can be achieved in a research setting to what can be applied to a commercial system. Some work involves the long-tail—edge cases that must be solved to achieve very high accuracy. Other work involves graceful degradation in the face of inadequate algorithms or knowledge. Our strategy is to maintain close contact with the latest research through our work on AITopics.org with the AAAI (Association for the Advancement of Artificial Intelligence), then integrate what can be delivered robustly to customers into the i2k Platform.
Document libraries. Beyond searching and browsing SPE documents, the i2k AI Platform is being used by customers to find and analyze documents and data files in their own repositories (e.g., file shares and content management systems), and to track information published on the Internet, including conference presentations and news items.
Well file dashboard. i2k technology is used to automatically populate and display a detailed hierarchical view of well files such as reports (daily drilling reports and end-of-well reports), well logs (density, gamma ray, spectral gamma ray, resistivity, and array laterolog), schematics, fluid, and rock samples. For logs, the platform reads LAS, DLIS, and LIS file headers.
Drilling events. The platform finds information about significant drilling events by reading daily drilling reports, end-of-well reports, and related files. Thus, it can highlight wells where significant drilling events occurred, such as lost circulation and stuck pipe, and the depths at which they occurred.
Structured data in tables and databases. Tabular data are often interspersed with paragraphs of text in oil & gas industry technical articles and reports (e.g., SPE technical articles, daily drilling reports, end-of-well reports, and regulatory forms). There are two challenging aspects of the problem: table detection (“find the tables”) and table structure recognition (“extract the table’s cells as name-value pairs”). The variety of tabular structures adds to the challenges (e.g., cell borders vs no cell borders, tables within tables, row spans and column spans).
i2k detects tables in documents (including PDF and plaintext files), recognizes each table’s structure, and returns a machine-interpretable description. It also crawls file shares and queries SQL databases to identify, extract, and summarize data points from tables, sentences, and database records (e.g., drilling events and reservoir parameters).
Additional knowledge bases. The platform includes tools to build knowledge bases specific to individual clients, which may include custom taxonomies, unique formatting, and access authorization. Improvements are made on a continual basis. A current example is Energy Transition. Companies in every sector are feeling pressure to increase focus on energy transition, together with sustainability and ESG (Environmental, Social, Governance). Over the past year, i2k has extended the platform’s classification knowledge to better enable people to zero in on relevant information on these topics, including the ability to find documents about particular renewables such as offshore wind.
Alerts. i2k gives decision-makers regularly scheduled summaries of news from selected sources with links to the full articles. Using a custom genetic algorithm (Eckroth and Schoen, 2019), i2k’s alerts feature the most relevant and diverse subset of stories that cover a broad range of topics (selected by the user) over a specified period of time. In addition, individuals can subscribe to alerts for new content that is published in their own company repositories and matches any searches they define.
Search on ambiguous place names. The locations of oil & gas fields and basins require more than retrieving documents that mention their ambiguous names. By tagging documents with names of fields and basins actually mentioned, this service helps engineers and managers keep track of new technical or legislative developments, competitor news, and the like.
Duplicates. i2k identifies and can suppress both exact and near duplicates to reduce wasted time reviewing results of searches. This also gives managers the data they need to reduce cloud transfer and storage costs.
Mergers, acquisitions, divestitures. When mergers and acquisitions occur, i2k is used to bring the document libraries of the separate companies into a common indexing system for continued safe and efficient operations. With divestitures, i2k helps split document libraries according to agreed-upon criteria. Similarly, when the indexing categories change (e.g., with new industry standards), i2k is used to reclassify millions of documents into the new categories.
One additional AI capability to be delivered this year further addresses the industry’s knowledge management needs.
Chatbot that understands oil & gas. It is often convenient for people to ask questions in their own natural language, like English or Spanish, rather than in an artificial syntax. The SPE Connect community of practice members regularly ask their peers for assistance in this way.
The chatbot we have developed can answer a question like, “What does one call a valve that allows flow in one direction only?” by examining the SPE article corpus and finding a caption of a diagram of an equipment layout for optimized zero flaring that states, in part, “Check valve connected to the production flow line—to allow flow in one direction only, from the wellhead into the production line” (Duthie, et al., 2015). This single phrase is found quickly in a corpus of millions of paragraphs of text, and the answer to the question, “check valve,” is extracted and presented to the person who asked the question, with a link back to the source text.
It can also answer questions like “What was the most costly nonproductive time in 2013 at the SEPAT A6 well?” (by reading the NPT tables in final well reports); “Which authors have written about acid treatment?” (by drawing on detection of authors of SPE articles, demonstrated earlier); and “Where is the Green Canyon leasing area?” (by drawing on the platform’s extensive knowledge of basins, fields, and formations).
These thoughts are expanded in our recently published paper.
Reid G. Smith, Eric J. Schoen, Joshua R. Eckroth,David Mack Endres,Sebastian Florez, Julia Rasmussen Elliott, and Bruce G. Buchanan (i2k Connect Inc.). SPE Research Portal: How SPE Uses Artificial Intelligence to Help You Find Technical Information. The Way Ahead, Society of Petroleum Engineers, July 27, 2021.
Contact us for more information on our software or services to assist you with gaining valuable insights on the data you need.