By Gregory Piatetsky, KDnuggets.
What was the largest dataset you analyzed / data mined?
This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.
Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.
- Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
- Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
- Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
- Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-2018
2018 data is shown as a column, to stand apart from lines for previous years.
This poll also asked about employment type, and the breakdown was
- Company or Self-Employed, 62% (was also 62% in 2016)
- Student, 17% (was 20% in 2016)
- Academia/University, 13% (was 10% in 2016)
- Government/non-profit, 4.8% (was 5.1% in 2016)
- Other, 3.2% (was 2.4% in 2016)
Fig. 2: KDnuggets Poll: Largest Dataset 2018, by Employment. Red line shows the estimated median
Circle size corresponds to the number of responses.
Regional trends show a little more voters from Latin America, Middle East, and Australia, and a little less from US. The numbers were:
- Europe, 34.9% (was 35.1%)
- US/Canada, 34.4% (was 36.9% in 2016)
- Asia, 15.6% (was 17%)
- Latin America, 6.9% (was 5.6%)
- Africa/Middle East, 4.9% (was 3.2%)
- Australia/NZ, 3.2% (was 2.3%)
Fig. 3: Largest Dataset Analyzed, by Employment for US/Canada, Europe, and Asia. Circle size corresponds to the number of responses
We got more responses from Asian "Company" Data Scientists for 100PB data than from US/Canada or Europe Data Scientist. We see a similar situation with Asian students.
Here are the results of past polls:
- Largest Dataset Analyzed Poll shows surprising stability, more junior Data Scientists, 2016
- Poll Results: Where is Big Data? For most, Largest Dataset Analyzed is in laptop-size GB range, 2015
- 2014 KDnuggets Poll Results: Largest Dataset Analyzed surprisingly stable
- 2013 KDnuggets Poll Results: largest dataset analyzed / data mined.
- 2012 KDnuggets Poll: largest dataset you analyzed / data mined?.