The 6 components of Open-Source Data Science/ Machine Learning Ecosystem; Did Python declare victory over R?

We find 6 tools form the modern open source comments In May, we reported initial results on 19th annual KDnuggets Software Poll: Python eats away at R: Top Software for Analytics,
Fig. 1: data Science, Machine Learning Top Tools Associations, 2018
The bar length corresponds to absolute value of lift1, and the color is the value of lift (green for more Python, red for more R). The number before the tool is their rank in popularity in KDnuggets 2018 Software Poll, eg Python was no. 1, RapidMiner no. 2, etc.

We note a group of 6 primary tools that together make the modern open source data science ecosystem: Python, Anaconda, scikit-learn, Tensorflow, Keras, and Apache Spark.

Rapidminer has a small negative association with all of the tools above and does not go strongly with any other tools.

R has small positive associations with Apache Spark, SQL, and Tableau.

The second group that emerges are the 3 supporting tools for data Science and Machine Learning, which are frequently used together: SQL, Excel, and Tableau.

We note that although chart below is symmetrical relative to diagonal (top right triangle is equal to bottom left), the patterns are easier to see in the full chart, rather than half.

Lift Definition:
Lift (X & Y) = pct (X & Y) / ( pct (X) * pct (Y) )

where pct(X) is the percent of users who selected X.

Lift (X&Y) > 1 indicates that X&Y appear together more than expected if they were independent,
Lift=1 if X & Y appear with frequency expected if they are independent, and
Lift < 1 if X & Y appear together less than expected (negatively correlated)

To make the differences from one easier to see we define
Lift1 (X & Y) = Lift (X & Y) - 1

Python vs R

Next we examine Python vs R.

Let with_Py(X)= % of tool X usage with Python, and with_R(X) % of tool X usage with R. To visualize how close is each tool to Python or R, we used a very simple measure Bias_Py_R(X) = with_Py(X) - with_R(X), which is positive if tool is more used with Python and negative if it is more used with R.

In Fig. 2, we charted the bias of most popular tools with at least 100 votes, and as we can see, almost every tool is biased towards Python. The only 2 exceptions are IBM SPSS Statistics, and SAS Base. For comparison, in similar 2017 analysis there were 10 such tools: SAS Base, Microsoft tools, Weka, RapidMiner, Tableau, and Knime, and almost all became more used along with Python.

Python Vs R 2018 Poll

Fig. 2: KDnuggets 2018 data Science, Machine Learning Poll: Python vs R bias

Did Python declare victory over R?

I don't think so, because R is an excellent platform with tremendous depth and breadth, which is widely used for data analysis and visualization, and it still has about 50% share. I expect R to be used by many data scientists for a long time, but going forward, I expect more development and energy around Python ecosystem.

Big data and Deep Learning

Big data (Spark / Hadoop tools) were used by 33% of respondents in KDnuggets 2018 Software Poll, exactly the same fraction as in 2017. This suggests that most data Scientists work with medium / small data that does not require Hadoop / Spark, or they use other data in the cloud solutions.

However the fraction of Deep Learning tools grew to 43% from 32%.

For each tool X, we compute how frequently it is used with Spark/Hadoop tools (vertical axis), and how frequently it is used with Deep Learning tools (horizontal axis).

Here is a chart with top tools (with over 100 votes), excluding Deep Learning and Big data tools themselves.

Poll 2018 Big data Deep Learning Affinity

Fig. 3: KDnuggets 2018 data Science, Machine Learning Poll: Deep Learning vs Spark/Hadoop affinity

We note that Scala is the most used language with both Deep Learning and Big Data. The chart is heavy on the lower left side, with almost every tool being used more with Deep Learning than with Big data tool.

Here is the link to anonymized poll data in CSV format, with columns
  • Nrand: record id (randomized, records not in order of voting)
  • region: usca: US/Canada, euro: Europe, asia, ltam: Latin America, afme: Africa/Middle East, aunz: Australia/New Zealand
  • Python: 1 if Votes (last column) includes Python, 0 otherwise
  • RapidMiner: 1 if Votes includes RapidMiner, 0 otherwise.
  • R language : 1 if Votes includes "R Language", 0 otherwise. We used "R Language" instead of R for ease of regex matching
  • SQL Language: 1 if Votes includes "SQL Language", 0 otherwise.
  • Excel: 1 if Votes includes Excel, 0 otherwise.
  • Anaconda: 1 if Votes includes Anaconda, 0 otherwise.
  • Tensorflow: 1 if Votes includes Tensorflow, 0 otherwise.
  • Tableau: 1 if Votes includes Tableau, 0 otherwise.
  • scikit-learn: 1 if Votes includes scikit-learn, 0 otherwise.
  • Keras: 1 if Votes includes KNIME, 0 otherwise.
  • Apache Spark: 1 if Votes includes Apache Spark, 0 otherwise.
  • With DL: 1 if Votes includes Deep Learning tools, 0 otherwise.
  • With BD: 1 if Votes includes Big data tools, 0 otherwise.
  • ntools: number of tools in Votes
  • Votes: list of votes, separated by a semicolon ";"
Let me know what you find!

  • Python eats away at R: Top Software for Analytics, data Science, Machine Learning in 2018: Trends and Analysis.
  • Emerging Ecosystem: data Science and Machine Learning Software, Analyzed, 2017
  • New Leader, Trends, and Surprises in Analytics, data Science, Machine Learning Software Poll, 2017