The Must-Haves of a Modern Data Prep Platform

In this special guest feature, Joe Hellerstein, Co-founder and Chief Strategy Officer at Trifacta, discusses how the challenge of data preparation sits squarely between the growth of BI and visualization tools and the specific data needed to fuel them. Efficient data preparation is key to alleviating new demand from business users. This article offers three key requirements that a data preparation platform should have. Joe is Trifacta’s CSO, Co-founder and Jim Gray Chair of Computer Science at UC Berkeley. His career in research and industry has focused on data-centric systems and the way they drive computing. Fortune Magazine included him in their list of 50 smartest people in technology , and MIT’s Technology Review magazine included his work on their TR10 list of the 10 technologies “most likely to change our world.”

With the advent of the self-service movement, BI and visualization tools have been on a steady trajectory of increased adoption and ease of use. Today, it might feel like each organizational department has its own tool of choice to visualize and analyze data, from Tableau to Microsoft and everything in between. At the same time that tools for data visualization and analysis are becoming accessible to a broader audience of business users, the demand for suitable data sets for input into these environments has grown accordingly.

The challenge of data preparation sits squarely between the growth of BI and visualization tools and the specific data needed to fuel them. Efficient data preparation is key to alleviating new demand from business users. BI vendors have recently woken up to this problem, and started to bundle lightweight data preparation features into their existing end-user BI technology. This might appear attractive, given the prevalence of BI tools in an organization. But the hard work of data prep—the “80%-of-the-data-lifecycle-is-spent-in-data-wrangling” work—probably should not be locked up within specific siloed BI tools. The use of those tools encourages users to stay hunkered down in their personal favorite BI solution, inhibiting collaboration and dividing resources. The more of your data lifecycle you put in your BI tools, the more your “BI tribes” become isolated from each other and the rest of your data culture.

As a result, it makes more organizational sense to treat data preparation as a first-class function in the organization, backed by an independent software platform. For a data preparation platform to meet the holistic needs of an organization, it must integrate broadly in two directions: downstream, with a range of BI and visualization tools that can be broadly categorized as “data consumers,” as well upstream with “data producers,” or platforms where data is stored and processed across an organization. That means there are three key requirements that a data preparation platform should have:

  • Self-Service
    To complement and integrate with data consumers, a modern data preparation platform needs to have an intuitive user experience that works for the users who know the data best. Users should be able to directly transform their data with no code required, and count on AI-assisted user experiences to ease them through traditionally technical aspects of the process. Identifying quality issues or outliers in the data should come naturally—the user should not be required to spend too much time digging through their data.
  • Governance
    While a data preparation platform should focus on self-service, it shouldn’t neglect the organizational requirements of data architecture in the process. A data preparation platform should offer a wide range of features to ensure that IT can maintain governance over organizational resources, while offloading the details of data preparation to users who know the data best. It should leverage existing data storage and analytics platforms, ensuring that IT staff can manage issues like identity, access control and resource management in a uniform manner. At the same time, it should provide end users with interfaces for self-service scheduling and collaboration, providing grassroots governance to tasks that don’t require IT intervention
  • Independence
    Given the need for both self-service and governance, architecturally, a data preparation platform needs to be fully independent. It should be agnostic to storage format, with a wide range of connectors to storage and the ability to work with unstructured, semi-structured, and structured data. Additionally, it should be portable across runtime engines on premises and in the cloud and compile down to many targets today including Apache Spark, AWS EMR, Google Cloud Dataflow or Microsoft HDInsight. Finally, it should provides a flexible data publishing API that can feed a wide variety of BI tools, analytics and AI systems, and various file formats.

Multi-purpose platforms with embedded data preparation capabilities may initially appear attractive for their one-stop-shop convenience, but they do not represent a sensible organizational solution. Organizations need to consider the whole of their data lifecycle—the needs of both data consumers and producers—before investing in the central task of data preparation. With an independent model of data preparation, producers and consumers of data can continue their rapid evolution while ensuring compatibility with new systems and products that emerge at both ends.