A recent Forbes article on the 10 Predictions for AI, Big Data, and Analytics in 2018 states that Data engineer will become the hot new job title, displacing its sibling role of Data Scientist. Gil Press goes on to write that Indeed.com had 13% of data-related job postings for data engineers and less than 1% for data scientists.
Intrigued, I looked at the job descriptions of Data engineer job postings by leading data-driven companies like Amazon and Facebook on LinkedIn. Strong Data Warehouse skills with a thorough knowledge of Data Extraction, Transformation, loading (ETL) processes and Data Pipeline construction expertise stood out as the essential and basic qualifications of an ideal Data Engineer.
Who is a Data Engineer Anyway?
If construction engineering deals with the designing, planning, construction, and management of physical infrastructures like buildings,roads, and tracks, data engineering applies the same intricacies to data.
A data engineer plans, designs, constructs and maintains a reliable architecture for a steady flow of clean and structured data that is ready for further analysis and is viable for production environment.
Data engineering is gaining prominence because of the fact that organizations are choked by a deluge of data ranging from logs in multiple formats to valuable business information lying unstructured in the vast Internet.
As data scientists and citizen data scientists with statistical and programming flair began proliferating, their common pain point lay in managing and maintaining the enormous volume of data. Data scientists who had to analyze and build models spent 80% of their time spotting and cleaning data. That’s when the need to fork their responsibilities came in, giving rise to a new breed of macho data engineers.
A data engineer comes to the rescue by understanding the data needed for a business, identifying the relevant new data sources, extracting the data in usable formats, making sure the data is error free and loading them for data scientists and analysts to work on.
The Data Engineering Tool-set
The work of a data engineer often overlaps with those of a data architect, a database administrator, and a software engineer implying a foretaste of each of their skill set is desirable. While a data architect or administrator is confined to their position of planning and maintenance of data infrastructure, a data engineer encompasses their titles to present a palatable form of data right from its origin to its final analysis exhibition.
Suitable skills for a data engineer would thus include:
- Proficient coding skills in R or Python
- Strong SQL skills
- Hadoop-based technologies like MapReduce, Hive, and Pig
- ETL and Data Warehousing expertise
Other than the above, to improve scalability data engineers should identify and be equipped with new retooling options for conventional ETL processes. Following the parallel processing approach, data pipelines are being built to copy data, move it to a storage solution, reformat and join the data.
As multiple data pipelines begin to pop up, open source workflow management tools like Airflow and Luigi are available to create and monitor data pipelines. Hence knowledge on these tools will be an added advantage. Data engineers can also play with Machine Learning to automate the data pipeline processes.
Data Preparation – The Main Criteria
The cleaner and better the quality of the data is, the better is the modeling and hence is the insight derived out of the trained model.
David Bianco, a data engineer at Urthecast explains that the ultimate aim of a data engineer is to provide clean, usable data to whomever may require it . This method of collecting, cleaning, processing and consolidating data is referred to as data preparation or wrangling/munging of data.
Data preparation addresses two main data issues in data analysis.
The Small (No) or Big Data Problem: Data Engineers are expected to put on their curiosity glasses and look around for new and novel sources of data both within and outside their companies. Without ample data sources, analysts and data scientists will find it difficult to build their training models. The opposite could also be problematic as large datasets can be quite hard to work with and the adage “Garbage in, garbage out” is a harsh reality in data science.
The Messy Data Problem: Once data sources are identified, metadata is to be cataloged and organized and data extraction methods are to be defined. Maxime Beauchemin, data engineer at Airbnb calls data engineers as “librarians” of data warehouses as they get their fingers dirty with transforming and structuring messy data. Conflicting nomenclature and inconsistent data delay the entire flow and play with data in an organization thus prolonging valuable insight generation.
In its crude form, though most data may seem insignificant, refining and polishing data produces shimmering nuggets of insights.
Easing Out the Pangs of Data Preparation for Data Engineers
Data preparation may seem tedious but with the right use of automation and tools, it would consume lesser time in the coming future. To perform their role efficiently, data engineers are encouraged to be on the lookout to automate and abstract most of their workload. An expertise in R/Python programming would come in very handy to succeed in their automation efforts.
The data canopy is expanding like never before and becoming increasingly interesting but also chaotic. Data engineers are entrusted with the responsibility of clearing the cluttered data ecosystem and providing a slick channel beneficial to all.
Start laying those data pipes and save the lives of drowning analysts and data scientists!
About the Author
Sign up for the free insideBIGDATA newsletter.