Daniel D. Gutierrez – Managing Editor, insideBIGDATA
insideBIGDATA: In the recent past, we’ve seen a distinct evolving role for data scientists in the enterprise. How do you see this rise in interest in these skill sets?
Shalini Agarwal: Today we are capturing the enormous amounts of data that we are producing while we are driving a car, with apps we are using for our daily life or work. All this data needs to be not only captured but also processed to make it meaningful for us to analyze to get insights and to learn from our data to improve our applications and software.
It’s the processing that has evolved. Where we are crunching trillions of bytes of data every minute and we need to do it in the most efficient manner as possible. This requires building infrastructure and tools to support these large systems. As a result, we need more people working on processing the data, making our data systems efficient and reliable. Data scientists are doing multiple jobs to accomplish their task, we can decentralize a lot of the sub tasks and develop tools and automation to accomplish them.
insideBIGDATA: What’s your take on all the talk about the shortage of data scientists?
Shalini Agarwal: Yes, we need more data scientists to improve the intelligence of our systems. The acceleration of improvements need to be addressed. One of the challenges for data scientists, is that we have been asking these highly skilled workers to also do the repetitive tasks of building data pipelines, cleaning/massaging the data. To accelerate innovation and improvements we need to decentralize the work to other functions and support data scientists and make them more productive. A large chunk of data scientists time is spent on tasks that can be automated with better systems and tooling.
insideBIGDATA: With some of the new tools available today, how can non-data scientists approach requirements previously satisfied by data scientists?
Shalini Agarwal: We have used data scientists to collect the data, massage it, process it and create charts and visuals to make them easy to understand. Once a methodology is established, this whole work flow can be automated. At LinkedIn, we have built systems to check quality of the data, we write our metrics and insights logic based on business rules that are defined (in UMP), and have automated visualization tools like Raptor that not only create charts but also sends scheduled emails right to your inbox. A non-data scientist can build this whole workflow with low investment of time from data scientist. And once a pattern is established, similar metrics and insights can be built by non-data scientists.
insideBIGDATA: Can you describe the Dr. Elephant tool LinkedIn created and open sourced to allow product managers to perform the same tasks today that would have required a data scientist with a PhD in the past?
Shalini Agarwal: Like I mentioned before, efficient systems are the key to making data scientists more productive. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Users of the system need not be data scientists as long as they understand the use case and can tune the jobs based on recommendations from Dr Elephant. We open sourced Dr. Elephant last year and activity on Github and the Dr. Elephant mailing list has been strong since day one.
Sign up for the free insideBIGDATA newsletter.