By Gregory Piatetsky, KDnuggets.
What was the largest dataset you analyzed / data mined?
This poll received 1240 votes, almost 3 times as many as in 2015, but results show surprising stability, fitting a pattern that emerged already in 2012, which suggests that majority of data scientists and analysts do not work with really big data.
- Gigabytes still rule: Majority of answers (57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
- More Junior Data Scientists: compared to 2015, we see higher percentage of responses in ALL ranges under 100GB, and fewer in ranges over 100GB, which indicates more junior Data Scientists coming into the industry (and taking part in this poll)
- Petabyte Big Data Scientists stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with multi-petabyte Internet-scale data stores.
- Government, Industry lead: The median Government and Industry analysts work with an order of magnitude larger datasets
- US and Europe have the largest share of Terabyte-level analysts
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2012-2016
This poll also asked about employment type, and the breakdown was
- Company or Self-Employed, 62%
- Student, 20.2%
- Academia/University, 10.2%
- Government/non-profit, 5.1%
- Other, 2.4%
Fig. 2: KDnuggets Poll: Largest Dataset 2016, by Employment. Red line shows the estimated median
Figure 3 below shows the distribution of largest dataset ranges by region, sorted by % of TB+ answers.
In US/Canada, 22% analysts worked with TB+ datasets. Next is Europe (15%), AU/NZ, 14%, Africa/Mideast (13%), and Latin America (10%). These numbers are consistently lower than in 2015, which combined with 3x increase in poll participation, suggest big growth in new Data Scientists entering the field and working on regular size data.
Fig. 3: KDnuggets Poll: Largest Dataset 2016, by Region, ordered by share of TB+ entries.
Regional participation was
- US/Canada, 37%
- Europe, 35%
- Asia, 17%
- Latin America, 5.6%
- Africa/Middle East, 3.2%
- Australia/NZ, 2.3%
Fig. 4: KDnuggets Poll: Largest Dataset 2016, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses
We note that the regions are mostly similar, but see that more Asian data scientists get to work with web-scale data than in Europe. Some lucky students in every region also get to work with web-scale data.
Here are the results of past polls:
- Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range, 2015
- 2014 KDnuggets Poll Results: Largest Dataset Analyzed surprisingly stable
- 2013 KDnuggets Poll Results: largest dataset analyzed / data mined.
- 2012 KDnuggets Poll: largest dataset you analyzed / data mined?.