By Jules S. Damji & Sameer Farooqui, Databricks.
Not a week goes by without a mention of Apache Spark in a blog, news article, or webinar on Spark’s impact in the big data landscape. Not a meetup or conference on big data or advanced analytics is without a speaker that expounds on aspects of Spark—touting of its rapid adoption; speaking of its developments; explaining of its uses cases, in enterprises across industries.
All rightly so, and for good reason, as the Spark Survey 2015 showed that Spark’s growth is uncontested and unstoppable.
But what’s the allure? And how do you get started with a new computing platform is a burning and consuming question for any beginner. Consider these seven necessities as a gentle introduction to understanding Spark’s attraction and mastering Spark—from concepts to coding.
Step 1: Why Apache Spark?
For one, Apache Spark is the most active open source data processing engine built for speed, ease of use, and advanced analytics, with over 1000 contributors from over 250 organizations and a growing community of developers and users. Second, as a general purpose compute engine designed for distributed data processing at scale, Spark supports multiple workloads through a unified engine comprised of Spark components as libraries accessible via APIs in popular programing languages, including Scala, Java, Python, and R. And finally, it can be deployed in different environments, read data from various data sources, and interact with myriad applications.
All together, this unified compute engine makes Spark an ideal environment for diverse workloads—ETL, interactive queries (Spark SQL), advanced analytics (Machine Learning), graph processing (GraphX/GraphFrames), and Streaming (Structured Streaming)—all running within the same engine.
In the subsequent steps, you will get an introduction to some of these components, but first let’s capture key concepts and key terms.
Step 2: Apache Spark Concepts, Key Terms and Keywords
In June this year, KDnuggets published Apache Spark Key terms explained, which is a fitting introduction here. Add to this vocabulary the following Spark’s architectural terms, as they are referenced in this article.
A collection of machines or nodes in the cloud or on-premise in a data center on which Spark is installed. Among those machines are Spark workers, a Spark Master (also a cluster manager in a Standalone mode), and at least one Spark Driver.
As the name suggests, Spark master JVM acts as a cluster manager in a Standalone deployment mode to which Spark workers register themselves as part of a quorum. Depending on the deployment mode, it acts as a resource manager and decides where and how many Executors to launch, and on what Spark workers in the cluster.
The Spark worker JVM, upon receiving instructions from Spark master, launches executors on the worker on behalf of the Spark driver. Spark applications, decomposed into units of tasks, are executed on each worker’s Executor. In short, the worker’s job is to only launch an Executor on behalf of the master.
It’s a JVM container with an allocated amount of cores and memory on which Spark runs its tasks. Each worker node launches its own Spark Executor, with a configurable number of cores (or threads). Besides executing Spark tasks, an Executor also stores and caches all data partitions in memory.
Once it gets information from the Spark master of all the workers in the cluster and where they are, the driver program distributes Spark tasks to each worker’s Executor. The driver also receives computed results from each Executor’s tasks.
SparkSession and SparkContext
As shown in the diagram, a SparkContext is a conduit to access all Spark functionality; only a single SparkContext exists per JVM. The Spark driver program uses it to connect to the cluster manager to communicate, and submit Spark jobs. It allows you to configure Spark configuration parameters. And through SparkContext, the driver can instantiate other contexts such as SQLContext, HiveContext, and StreamingContext to program Spark.
However, with Apache Spark 2.0, SparkSession can access all aforementioned Spark’s functionality through a single-unified point of entry. As well as making it simpler to access Spark functionality, it also subsumes the underlying contexts to manipulate data.
A recent blog post on How to Use SparkSessions in Apache Spark 2.0 explains this in details.
Spark Deployment Modes
Spark supports four cluster deployment modes, each with its own characteristics with respect to where Spark’s components run within a Spark cluster. Of all modes, the local mode, running on a single host, is by far the simplest.
As a beginner or intermediate developer you don’t need to know this elaborate matrix right away. It’s here for your reference, and the links provide additional information. Furthermore, Step 5 is a deep dive into all aspects of Spark Architecture.
|Local||Runs on a single JVM||Runs on the same JVM as the driver||Runs on same JVM as the driver||Runs on the single host|
|Standalone||Can run on any node in the cluster||Runs on its own JVM on each node||Each worker in the cluster will launch its own JVM||Can be allocated arbitrarily where the master is started.|
|YARN (client)||On a client, not part of the cluster||YARN NodeManager||YARN’s NodeManager’s Container||YARN’s Resource Manager works with YARN’s Application Master to allocate the containers on NodeManagers for Executors.|
|YARN (cluster)||Runs within the YARN’s Application Master||Same as client mode||Same as client mode||Same as client mode|
|Mesos (client)||Runs on a client machine, not part of Mesos cluster||Runs on Mesos Slave||Container within Mesos Slave||Mesos’ master|
|Mesos (cluster)||Runs within one of Mesos’ master||Same as client mode||Same as client mode||Mesos’ master|
Spark Apps, Jobs, Stages and Tasks
An anatomy of a Spark application usually comprises of Spark operations, which can be either transformations or actions on your data sets using Spark’s RDDs, DataFrames or Datasets APIs. For example, in your Spark app, if you invoke an action, such as collect() or take() on your DataFrame or Dataset, the action will create a Job. A job will then be decomposed into single or multiple stages; stages are further divided into individual tasks; and tasks are units of execution that the Spark driver’s scheduler ships to Spark Executors on the Spark worker nodes to execute in your cluster. Often multiple tasks will run in parallel on the same executor, each processing its unit of partitioned dataset in its memory.
In this informative part of the video, Sameer Farooqui elaborates all the distinct stages in vivid details.