Data, after all, is at the heart of digital transformation. Companies must be able to make sense of and leverage all of the data generated by people, places, and things—in whatever combination—in order to make and quickly act on strategic decisions.
Some companies are using data lakes as a way to integrate disparate data. A data lake is certainly better than a data silo, but it may not be the most effective model for companies striving for digital transformation. To dive deeper into data lakes and potential alternatives, I recently spoke with Joe Pasqua, Executive Vice President of Products at MarkLogic. Our discussion touched on a number of related issues, including the importance of effective data integration as organizations’ work to implement initiatives as broad as digital transformation and as specific as compliance with the new General Data Protection Regulation (GDPR) laws, which go into effect this May.
insideBIGDATA: Let’s begin with a definition of a data lake.
Joe Pasqua: Picture all of the different systems in your business – sales systems, marketing systems, support systems, and so on. Each of them contains important data, but they’re all separate. Conceptually, a data lake is a place where data from all of these systems can flow so that it can be used together. Data lakes can typically store many different types of data, which accommodates many different source systems.
insideBIGDATA: We’ve been hearing a lot about data lakes in the context of the coming GDPR regulations. What’s the connection?
Joe Pasqua: General Data Protection Regulation is an EU data privacy mandate that requires organizations to strictly manage customer data in a way that supports fundamental data rights to EU citizens, especially as related to access and portability. GDPR, which goes into effect in May, applies to businesses in the EU, but also to organizations outside the EU that process personal data for the purpose of offering goods and services to the EU or monitoring the behavior of data subjects within the EU.
In other words, just about any company that engages in any kind of e-commerce is on the hook for GDPR compliance. This means they have to see and manage all the customer data in their possession, regardless of the source system. Breaking down these data silos is very hard, so the concept of a data lake which serves as a central repository and provides a holistic view of all data is quite appealing.
insideBIGDATA: What are the advantages and disadvantages of the data lake model?
Joe Pasqua: Many organizations are drawn to this approach of “silo-busting” because data lakes seem easy to implement and they provide operational isolation by moving data and computation to a separate infrastructure. This isolation means that activity on the data lake, such as batch analytics, doesn’t impact production activities like transaction processing. Indeed, data lakes are becoming more common: Markets and Markets predicts the data lakes market will increase from $2.53 billion in 2016 to $8.81 billion by 2021, a compound annual growth rate of 28 percent.
With all that said, there are often severe drawbacks when companies move from the concept to reality. Many find that instead of a data lake, they are left with a data swamp. There are a few primary reasons for this.
First, traditional technologies used to build data lakes such as Hadoop or Amazon S3 provide no indexing or harmonization of the data they store. Users of the data lake are left to sift through massive quantities of information on their own. Typically, they end up cobbling together a set of technologies (perhaps HBase, SOLR, or other Amazon services) to help them find, harmonize, and process their data. Harmonization must deal with differences in the structure of data, terminology differences, and even semantic differences.
These efforts often fail to meet the business requirements, and even when they don’t, companies have just signed up for an ongoing development and support burden they didn’t expect.
Second, part of the allure of data lakes is simple, unfettered access to huge volumes of data. This sounds great for a data scientist, but is a nightmare for the governance group. Who has access to which parts of the data lake? Where did the data come from? Under what terms was it collected? What is the quality of the data? Data lakes can actually make things worse for regulatory compliance rather than better.
Finally, the line between analytical and operational systems is getting blurrier every day. Businesses need to make decisions in the moment, not just as a batch process. There are times when it would be really nice for the data lake to have transactional capabilities, but traditional data lake technologies aren’t built for that.
insideBIGDATA: Does this mean that any investment already made in data lakes is money down the drain?
Joe Pasqua: No, data lakes can play a valuable role for the businesses, but it’s important that they are used in the right place in the overall data fabric. They should not be the primary landing point for data from multiple systems because they lack the governance, security, and metadata features that are required. They should not be used when the application really calls for a database. They just weren’t built to do that. They can be quite valuable as an analytical tool once they are supplied with curated, governed data. Not surprisingly, that was their original intent – as an aggregated replacement for data marts.
Companies need to be careful to use data lakes appropriately, or they will throw good money after bad.
insideBIGDATA: If data lakes can’t take this central role, is there an alternative?
Joe Pasqua:Yes, an operational data hub. A data hub has many conceptual similarities to a data lake; most importantly, they also allow disparate data from multiple source silos to be collected in one place. Unlike data lakes, operational data hubs focus on data discovery and harmonization; security and governance; and real-time operational capabilities.
Organizations that implement an operational data hub can expect an active, integrated data store/interchange that provides a single, unified, 360-degree view of data. Achieving that requires the right type of underlying database. Today’s enterprise data is an amalgam of thousands of different relational schemas alongside office documents, PDFs, message exchange formats, mobile data, digital metadata, geospatial information, and knowledge graphs. Because of this wide variance, it makes sense to build an operational data hub on a database built to handle all types of data from almost any data source. This is where an enterprise NoSQL database fits the bill because it can ingest any type of data, and eliminate time-consuming data wrangling and ETL processes that can take years to implement and cost millions to maintain–all inherent weaknesses of traditional relational databases.
insideBIGDATA: What’s the upshot for enterprise stakeholders?
Joe Pasqua:Well, from both a competitive and compliance standpoint, there’s no question that companies need to break down data silos and harmonize the disparate data in their possession (now and in the future). The priority is clear, which makes it easy to figure out what to do but not necessarily how to do it. Companies need to look beyond the simple data lake model to find a solution that will help them derive new value for the business while at the same time meeting ever-growing regulatory requirements, such as GDPR.