Apache Spark : Python vs. Scala

When it comes to using the Apache Spark framework, the comments

By Preet Gandhi, NYU Center for

Apache Spark is one of the most popular framework for big data analysis. Spark is written in Scala as it can be quite fast because it's statically typed and it compiles in a known way to the JVM. Though Spark has API’s for Scala, Python, Java and R but the popularly used languages are the former two. Java does not support Read-Evaluate-Print-Loop, and R is not a general purpose language. The data science community is divided in two camps; one which prefers Scala whereas the other preferring Python. Each has its pros and cons and the final choice should depend on the outcome application.

Apache Spark Python Scala

Performance

Scala is frequently over 10 times faster than Python. Scala uses Java Virtual Machine (JVM) during runtime which gives is some speed over Python in most cases. Python is dynamically typed and this reduces the speed. Compiled languages are faster than interpreted. In case of Python, Spark libraries are called which require a lot of code processing and hence slower performance. In this scenario Scala works well for limited cores. Moreover Scala is native for Hadoop as its based on JVM. Hadoop is important because Spark was made on the top of the Hadoop's filesystem HDFS. Python interacts with Hadoop services very badly, so developers have to use 3rd party libraries (like hadoopy). Scala interacts with Hadoop via native Hadoop's API in Java. That's why it's very easy to write native Hadoop applications in Scala.

Learning Curve

Both are functional and object oriented languages which have similar syntax in addition to a thriving support communities. Scala may be a bit more complex to learn in comparison to Python due to its high-level functional features. Python is preferable for simple intuitive logic whereas Scala is more useful for complex workflows. Python has simple syntax and good standard libraries.

Concurrency

Scala has multiple standard libraries and cores which allows quick integration of the databases in Big data ecosystems. Scala allows writing of code with multiple concurrency primitives whereas Python doesn’t support concurrency or multithreading. Due to its concurrency feature, Scala allows better memory management and data processing. However Python does support heavyweight process forking. Here, only one thread is active at a time. So whenever a new code is deployed, more processes must be restarted which increases the memory overhead.

Usability

Both are expressive and we can achieve high functionality level with them. Python is more user friendly and concise. Scala is always more powerful in terms of framework, libraries, implicit, macros etc. Scala works well within the MapReduce framework because of its functional nature. Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs. Developers just need to learn the basic standard collections, which allow them to easily get acquainted with other libraries. Spark is written in Scala so knowing Scala will let you understand and modify what Spark does internally. Moreover many upcoming features will first have their APIs in Scala and Java and the Python APIs evolve in the later versions. But for NLP, Python is preferred as Scala doesn’t have many tools for machine learning or NLP. Moreover for using GraphX, GraphFrames and MLLib, Python is preferred. Python’s visualization libraries complement Pyspark as neither Spark nor Scala have anything comparable.

Code Restoration and safety

Scala is a statically typed language which allows us to find compile time errors. whereas Python is a dynamically typed language. Python language is highly prone to bugs every time you make changes to the existing code. Hence refactoring the code for Scala is easier than refactoring for Python.

Conclusion

Python is slower but very easy to use, while Scala is fastest and moderately easy to use. Scala provides access to the latest features of the Spark, as Apache Spark is written in Scala. Language choice for programming in Apache Spark depends on the features that best fit the project needs, as each one has its own pros and cons. Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building data Science applications. Overall, Scala would be more beneficial in order to utilize the full potential of Spark. The arcane syntax is worth learning if you really want to do out-of-the-box machine learning over Spark.

Bio: Preet Gandhi is a MS in data Science student at NYU Center for data Science. She is an avid Big data and data Science enthusiast. She can be reached at pg1690@nyu.edu.

Related:

  • Deep Learning With Apache Spark: Part 1
  • Hands-on: Intro to Python for data Analysis
  • Top 15 Scala Libraries for data Science in 2018