insideBIGDATA: DCL states it uses the latest innovations in ML, AI, and NLP technology to address customer needs. Please explain what form of AI/machine learning is being used in your product: neural networks, RNNs, LSTM, etc. and how the technology is being used.
Tammy Bilitzky: The bar for people’s expectations from data is constantly being raised – because there’s so much of it. It’s essential to go beyond just digitizing storing information, you need to harness the meaning of your content and do it quickly and accurately.
As a result, we are encountering growing demand to automate the analysis and conversion of high volumes of content and to extract information, mimicking human abilities to understand context and layering it with superhuman consistency, repeatability and speed.
That is where Artificial Intelligence (AI) fits in. AI isn’t any one technology, it’s a collection of technologies and algorithms that attempt to duplicate how a human makes decisions.
DCL’s multi-faceted approach leverages both emerging and proven technologies, with AI delivering on its potential as a game changer, transforming how we solve content-related use cases.
The need for structured data is expanding while the appetite for traditional ETL (extract, transform, load) systems and its inherent challenges diminishes. Data, its meaning and purpose, is dynamic and evolving. Firms cannot afford to be limited by a point-in-time, static view of data.
In the past, we supported all these requirements with good success using more traditional approaches such as text processing engines, regular expressions, rule-based patterns and manual review when that was needed. However, this required extensive upfront analysis and normalization of the content to process the information in a manageable manner.
Here are few examples where we have been successfully developing and implementing AI techniques:
ML/NLP to Extract Funding Details from Unstructured Content
Machine learning/Natural Language Processing (ML/NLP) refers to a set of statistical and other related techniques that provide computers with ability to “learn” without explicit programming, allowing a computer program to analyze and parse a sentence or a paragraph much the way a human would.
As used by DCL, ML/NLP is extremely helpful in automatically extracting funding information from unstructured content to locate the funding institutions, locations and grant numbers frequently cited in scientific research articles.
Our ML/NLP layer “reads” the articles, and applies grammatical rules, parts of speech analytics and classes of keywords typically associated with funding text to identify the likely funding sources in sentences, with far more specificity than a word search possibly could.
ML techniques applied include:
- Tokenization with keywords determined to consistency indicate the start of a text sequence with grant information.
- A classifier that recommends accepting or rejecting a token based on its similarity to known grant tokens, e.g. a granting organization found in the text that is similar to known granting organizations.
- DTM (Document Term Frequency) to find the frequency of each common term and drop infrequent terms. Granting organizations have a vocabulary drawn from a different ‘pool’ of terms than text.
- SVM (Support Vector Machine) to separate / classify distinct groups of data based on n-dimensions. Each dimension is one of the frequent words found within the DTM.
- Lexical Analysis to better predict grant-id information by grouping the tokens and organizing words into meaningful sentences.
- Grammar Model that builds the language rules used to “understand” the text.
We also use ML and Deep Learning RNN (recurrent neural network), a neural network that uses internal memory to retain information needed to process sequences of inputs, to reduce the number of false positives. Many items identified may just be organizations or people to whom the author is expressing gratitude, rather than funders and others may not be valid institutions. Our ML classifier maintains a training set of real vs. not-real funding sources and executes a series of techniques to extract the text strings most likely to be valid, successfully reducing our false positives. Our deep learning RNN capitalizes on publicly available institution details and naming patterns, further improving accuracy.
Metadata Extraction and Labelling using Gate, ML/NLP
The need for accurate metadata is exploding and our clients need precise data points to power search-enabled devices. Full index search capability by itself is not always optimal. However, coupling full text search with functionally-relevant metadata provides more precise, targeted search capability.
DCL has been using GATE (General Architecture for Text Engineering) for over two years to support metadata extraction, normalize text and semantically analyze content. GATE is an open source development environment for computational language processing and text mining. It provides tools to identify text strings for semantic tagging, a set of predefined, editable ontologies and the ability to create custom ontologies.
We pair GATE with other machine learning techniques to extract metadata from complex, unstructured text, and iteratively decompose and structure the content in logical increments that initiate appropriate calls to actions; all while preserving the original text for audit purposes, potential redefinition and refinement.
DCL’s ML models and statistically calculated confidence levels hone in on specific content snippets that might benefit from manual review versus entire documents.
Our utilization of GATE and ML/NLP allows us a faster ramp-up with improved accuracy and processing speed, temporal views of the data, decreases and often eliminates need for manual review.
Computer Vision to Improve OCR
A large part of our business focuses on digitization of the text buried in image-based content such as TIFF or PDF image. In these cases, we use machine learning enhanced Computer Vision software to improve optical character recognition (OCR). CV automates human visual tasks through a series of methods to acquire, process, analyze, and understand the digital images.
The CV software used at DCL programmatically examines the page in advance of the OCR step to locate OCR pitfalls like chemical formulas and math, all of which reduce the accuracy of OCR – and removes them electronically. The result is a much more accurate OCR and text extraction process. The removed content is then available as image files and can be referenced or replaced in the final product.
Document Classification using ML Text and Pattern Recognition
We deal with a myriad of document types, each with its own nuances, and use ML text and pattern recognition algorithms to perform upfront document classification. The results are then used to direct an enhanced conversion process that is tailored to the unique aspects of the subject document type and delivers refined content extraction and labelling.
Part of Speech using Public ML/NLP Libraries
A common requirement for DCL is the ability to find, “understand” and contextually structure text in free-form content, such as Item Name and Item Number. We uses ML/NLP libraries to accomplish this by gathering parts of speech and other semantic patterns to decompose and structure the content.
Linking References using TensorFlow-hosted RNN
Typically, references to other content of all types are buried in unstructured text without clear, consistent patterns or formatting rules.
We are currently building an RNN using TensorFlow, an open source symbolic math library used for machine learning applications to address this challenge.
The increasingly popular use of Tensorflow has enabled us to build a set of training patterns to narrow the text pattern’s discovery, extract content that seems to match these patterns, and then apply a probability to the match – achieving a high success rate in a fully automated fashion. This allows us to quickly process large corpuses of diverse documents and accurately annotate them to link references to related content.
insideBIGDATA: You indicate that your Harmonizer product uses textual analysis to provide extensive analytics on content and structure to facilitate comparison across the collection. Can you drill down on this use of textual analysis? How is Harmonizer being used in enterprise data pipelines? A couple of examples?
Tammy Bilitzky: DCL’s Harmonizer is a tool that uses textual analysis to analyze large bodies of content and identify sections that are either identical or similar to each other. This capability solves a number of business use cases by providing visibility and intelligence on data quality, redundancy, and reuse potential.
At the core of Harmonizer is a content assessor that applies textual analytics to examine the content, normalize the complex data structures and anomalies, and use the resulting data points to optimize the content comparison across the entire collection.
Finding redundant content is challenging. Many of the chapters or sections may address the same topics and should have exactly the same text, but they do not – and sometimes the duplicated content is simply inaccurate. Functioning as both a quality control and a reuse tool, Harmonizer groups portions of the content that are similar and potentially reusable, providing details which are useful in determining the sections that should be retained as is, normalized, removed or corrected.
An example of a common business use case for Harmonizer is identifying the redundant sections in a library of training manuals when converting from a manual process to automated Learning Management System (LMS). Material are likely inconsistently developed and updated over long periods of time, by different authors. In the process of moving to an automated process, granular chunks of redundant content should be harmonized and isolated, so they can be referenced and included as necessary in the LMS.
insideBIGDATA: Explainability of AI systems is a hot topic right now. How can your customers understand the reasons behind your recommendations?
Tammy Bilitzky: AI encompasses a wide range of technology approaches, some which have proven their value and potential; others still in their infancy. And some of which are reliable, others less so.
For some functions, you may need to leverage third-party models. Black box ML classifiers that are publicly available to perform specific functions are typically complex and powerful – but difficult to understand and require a leap of faith to rely on them for business decisions, content enrichment and operational functions. Simpler models can be understood more easily but the flip side is that they are often weaker, inflexible with limited predictive ability.
There are defined frameworks now to enable the proverbial having your cake and eating it too – generating clear explanations for third-party black-box functions that can be shared with a client without compromising predictive capabilities.
At DCL, the techniques that we are predominately using are Machine Learning/Natural Language Processing and Computer Vision(CV). The algorithms are proven and the output is understandable to customers since they strive to replicate what we do manually when we analyze documents.
While clients may not have the expertise to understand the models and algorithms, they understand results – and our comparison tools demonstrate the accuracy of our AI classifiers and other techniques against large volumes of training material and production runs.
insideBIGDATA: What future directions are you planning to follow in terms of the AI/machine learning technology you’re using in your product/service?
Tammy Bilitzky: Our overarching goal is to continue to use AI and other emerging technologies to drive improved quality and eliminate manual effort, resulting in faster turnaround times and reduced costs for our clients. There is so much valuable data buried in paper and other unsearchable content formats – as we mine this data even faster, more accurately, and at a lower price point, the benefits will be immense.
To this end, we are leveraging machine learning to detect data anomalies in large corpuses of material. We are expanding our use of powerful semantic engines to understand the content as well as deep learning and neural networks to take advantage of the large volumes of content in publicly accessible sources to normalize, enrich and repair content.
A significant challenge we are actively addressing is the linking and interpretation of complex footnotes in all types of documents, including financial statements. Some footnotes are relatively straightforward while others require specific expertise even for a person to understand them. We are currently working on a prototype that uses machine learning algorithms such as pattern and text recognition to classify the footnotes by nature and complexity. Natural language processing semantics and syntax is then applied to analyze the language of the footnotes and their context, extract and label the relevant data points.
PDF extraction is a major focus area for DCL as so much content in delivered via PDF, sourced from a wide range of desktop publishing products. We are very familiar with the challenge areas in each of the leading PDF extraction tools and are in the process of applying machine learning and semantic analysis to both analyze and repair the extracted text, providing quality previously achievable only with costly, time-consuming manual effort.
As content repositories grow, the ability to dynamically link content based on explicit and implicit references, expressed in virtually unlimited textual patterns, is becoming increasingly valuable. We have made significant strides in this area using TensorFlow and envision this to be a significant growth opportunity.
Advanced image detection, classification and comparison, essentially the equivalent of our Harmonizer product against non-textual content, has the potentially powerful AI application. There are traditional uses for this technology such as plagiarism detection and signature comparison; adapting this to compare non-textual content across large bodies of documents, highlight reuse potential and detect potential quality issues would deliver significant value. Image classification and categorization allows us to efficiently process large volumes of documents, highlighting those of particular interest such as those containing redactions.
While we don’t know where the AI journey will lead us as we aggressively exploit and expand its capabilities, we are confident that with our skilled engineers and business experts, the opportunities to transform our products and services will continue to exceed our expectations and dramatically transform our operations – and those of our clients.
Sign up for the free insideBIGDATA newsletter.