By Lovan Chetty, Cazena.
Attending analytic conferences really exposes the range and sophistication of analytic techniques that people with the right skills can apply to data. An example of this is the EARL Boston conference, which focuses on how to best use the R programming language to produce analytic outcomes. At the most recent conference, I led a session that was based on many conversations my industry colleagues and I have had with companies trying to accelerate their analytics programs. And a specific type of problem that tends to arise over and over again. It’s how two distinct groups – Data Engineers and Data Scientists – can work more collaboratively despite the stark differences between their skills.
- The Data Engineering teams typically understand the most efficient ways of curating data, storing data and operationalizing processes.
- The Data Science teams have a good grasp of math, statistics and how to derive insight from the data.
Many of them described extracting data from central systems with varying degrees of pain and compliance and then spending time refactoring that data to fit the analysis they wanted to do – and then (only then)starting the analysis process. This works but it’s not the most efficient overall flow because it leads to costly duplication of efforts as multiple users may extract data and waste time doing the same refactoring or transformations of data.For these reasons, it’s not surprising that more organizations are trying to have both teams work on a single platform.
The Challenges and Benefits of a Single Platform for Data Engineering & Data Science
While the concept of a single platform is a familiar topic in data strategy discussions, the flexibility of the cloud now makes it possible, though not necessarily easy. Ideally, everyone should be able to use their own tools and a variety of languages and be supported by a common underlying data and compute platform. Some of the reasons that this is challenging are related to delivering secure access to data across a variety of teams and locations, as well as having a common governance model across the disparate set of tools and processes.
With the new platform, the R users read data directly from the central data store (HDFS), which means that they no longer spend time sub-setting or sampling datasets and copying data to their local machines. In addition, the platform gives the Data Science team more choices for their analytics. It includes a Spark cluster, which the Data Science team could leverage through R packages like SparklyR. The new functions are particularly helpful for producing models that consume larger amounts of data or applying models to full datasets for downstream consumption.The Data Science team was very happy with their new benefits on the new platform.
The new platform also included some pySpark libraries, which allowed the Data Engineers to refactor and develop a process that was significantly more efficient. This dramatically improved data ingestion processes, which meant the Data Science team could run more real-time, time-sensitive analyses. This was not something they were able to do in the past as the data was typically a few hours old. A variety of new features like this enabled changes to operational processes that added up to significant time savings and efficiency.
The longest lasting effect, however, was the Data Engineering team now had more visibility into the datasets that the Data Science team was producing. The Data Engineers realized that some of the ways they prepared and manipulated data were not optimal for the Data Science team processes. This kicked off a more collaborative effort between the two groups – and now the datasets curated and managed by the Data Engineering team are more optimally structured for the Data Science team.
That means the company can produce analytic results more quickly, both due to process and newer tools, but they now also tackle a much wider range of analytic problems.It’s a single platform success story.
Bio: Lovan Chetty leads the product team for Cazena data platform, and has significant experience working with companies to solve their data and analytics challenges.
- The Easy Button for R & Python on Spark
- 3 minute demo: Data Science Sandbox as a Service
- An opinionated Data Science Toolbox in R from Hadley Wickham, tidyverse