Human Guided Machine Learning is the Golden Path to Golden Records

In this special guest feature, Matt Holzapfel, Solutions Lead at Tamr, addresses how a bottom-up approach recognizes connections in the vast pools of data to identify not only the most reliable data for each entity in the ecosystem, but the means by which that data can reliably be updated — a far faster path to data mastering, and a better way to make sure that data stays golden. Matt leads efforts developing and delivering solutions that enable customers to quickly realize business benefit across multiple domains including procurement, supply chain, sales & marketing and more. Prior to joining Tamr, Matt held positions in Strategy at Sears Holdings and Strategic Sourcing at Dell, where he led the development and implementation of new analytical sourcing tools to significantly lower procurement costs. Matt has a BS in Mechanical Engineering from the University of Illinois at Urbana-Champaign and an MBA from Harvard Business School.

Every data consumer wants to believe that the data they are using to make decisions is complete and up-to-date. However, this is rarely the case. Customers change their address. Products experience revisions. New suppliers get on-boarded. All of these common events introduce data variety that can be a nightmare to address if your organization isn’t equipped with the right solutions.

It’s unrealistic to expect humans to be able to monitor every single piece of data throughout the enterprise for changes. Few changes require human expertise to understand their implications; e.g., selecting an up-to-date billing address is difficult when a person purchases a second home, but not when a person relocates. The majority simply require a consistent set of guidelines and an ability to handle massive volumes of information, two tasks that machines do well. Combining human expertise with the scalability of machine learning provides the best of both worlds, enabling data consumers to finally be able to make decisions based on reliable data.

The undeniable complexity of enterprise data

It’s no secret that enterprise data is a messy web of siloed sources and applications. It’s easy to point the finger at internal dynamics, such as “shadow IT” teams or a broader lack of governance. In reality, a large reason for the complexity of enterprise data is due to external factors — customers begin to prefer to engage through new channels, the market demands a different category of products. The dynamic nature of customers and markets means that data sources will constantly be evolving, and data complexity will only continue to grow.

This complexity can come in many forms. A customer may go by different names (e.g., “Kate O. Woods” and “K. Woods”), a product may have varying levels of detail across channels (e.g., the weight of a product may be listed on the company’s website but not distributors’), or a supplier may issue invoices under the parent company’s name while conducting all other business under the name of each subsidiary. Enterprises need a reliable way to overcome these complexities to gain a complete view of each of these entities and empower business stakeholders to make good, data-driven decisions.

Finding a formula that works: human-guided machine learning

Historically, IT organizations have tried applying rules alone to overcome this data variety. These approaches have massive upfront costs, requiring months or years to start generating meaningful results. Their siloed nature means that outcomes are disconnected from business requirements, making it almost impossible to generate accurate results that are trusted by the people who will be consuming the data. Once in production, very few people understand how they work and can modify them, leading to exorbitant maintenance costs.

A new approach has emerged that overcomes many of the limitations of a rules-only approach. This approach combines human expertise with machine learning to uncover patterns in the data that may be difficult for a human to codify. Humans provide feedback, in the form of simple examples such as ‘yes’ these two records are a match or ‘no’ these two records are not, that the machine uses as input into a model that can be applied across all data sources.

The model that is developed considers all of the attributes available across the data sources, learning how to compare each attribute individually and as part of the whole record. For example, the model may learn that even if two supplier records have completely different business names but have similar addresses, similar sales contacts, and share a website, they are likely the same supplier.

Unlike a rules-only approach, the model can compute a confidence level for each set of recommendations. When it is unconfident about a recommendation, it can proactively reach out for feedback, further training its model and giving confidence to end users that the data is being accurately mastered. Rules may get introduced during this process, but by leaving most of the heavy lifting to machines, results can be generated at unprecedented levels of speed and accuracy, so you can finally keep up with changes to your data.

Getting to a “single version of the truth”

The mastering process is important, but often, data consumers just want a “golden record” — a record for each entity that contains the best and most up-to-date information available about it. Once data is accurately mastered, business logic can be applied that selects the right information for each attribute and creates a “single version of the truth”.

Business logic needs to be able to be applied flexibly across attributes to be effective — an external data source may be the authoritative source of billing address, but an internal CRM may be the better source for a phone number. Regardless of the specific logic, data consumers should be able to weigh in on this logic, so that they are bought in. This is much easier to do when their input has also been considered throughout the mastering process and they feel ownership over the results.

Data consumers understand that data is becoming increasingly complex, but can’t use that as an excuse for poor decision making. Chief Data Officers need to equip them with trusted, up-to-date data, which can only be achieved by bringing human insight together with machine learning.

Sign up for the free insideBIGDATA newsletter.