Apache Spark Introduction for Beginners

An extensive introduction to Apache Spark, including a look at the evolution of the product, use cases, architecture, ecosystem components, core concepts and more.

By Vikash Kumar, Tatvasoft.com.au

Businesses are utilizing Hadoop broadly to examine their informational indexes. The reason is that the Hadoop system depends on a basic programming model - MapReduce and it empowers a processing arrangement that is versatile, adaptable, blame tolerant and financially savvy.

But, the principal concern is to keep up speed in handling vast datasets as far as holding up time among inquiries and holding up the time to run the program. The below snapshot clearly justifies the limitation.

Apache Spark Introduction Figure 1

The answer the question “How to get over limitations of Hadoop MapReduce?” is APACHE SPARK. Don’t interpret that Spark and Hadoop are Competitors.

Evolution with Apache Spark

Spark was presented by Apache Software Foundation for accelerating the Hadoop computational registering programming process and overcoming its limitations. Rumors around suggest that Spark is nothing but an altered rendition of Hadoop and isn't dependent upon Hadoop. But the fact is "Hadoop is one of the approaches to implementing Spark, so they are not the competitors they are compensators."

 The below snapshot clearly justifies how Spark processing is rendering the limitation of Hadoop.

Apache Spark Introduction Figure 2

Apache Spark is an exceptionally a cluster computing technology, intended for quick computation. It depends on Hadoop MapReduce and it stretches out the MapReduce model to effectively utilize it for more kinds of computations, which incorporates intuitive questions and stream handling. The fundamental element of Spark is its in-memory clustering that expands the preparing pace of an application.

What Apache Spark is About

Apache Spark is a lightning fast cluster computing system. It provides the set of high-level API namely Java, Scala, Python, and R for application development. Apache Spark is a tool for speedily executing Spark Applications. Spark utilizes Hadoop in two different ways – one is for Storage and second is for Process handling. Just because Spark has its own Cluster Management, so it utilizes Hadoop for Storage objective.

Spark is intended to cover an extensive variety of remaining loads, for example, cluster applications, iterative calculations, intuitive questions, and streaming. Aside from supporting all these remaining tasks at hand in a particular framework, it decreases the administration weight of keeping up isolated apparatuses.

Spark is one of Hadoop's sub venture created in 2009 in UC Berkeley's AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was given to Apache programming establishment in 2013, and now Apache Spark has turned into the best level Apache venture from Feb-2014. And now the results are pretty booming.

Survey & Real-time Use Cases

Market rules and big agencies already tend to use Spark for their solutions. Flabbergast to know that the list includes - Netflix, Uber, Pinterest, Conviva, Yahoo, Alibaba, eBay, MyFitnessPal, OpenTable, TripAdvisor and much more. It can be said that Apache Spark Use Case stretch is spread right from Finance, Healthcare, Travel, e-commerce to Media & Entertainment industry.

Time is not back when many more domains examples will roll-out for using Spark in a myriad of ways.

Let's understand more about Apache Spark architecture, components, and features - that will absolutely witness why Spark is adopted by such a large community.

The numbers will surely awestruck you of the poll survey regarding why companies are intended to use the framework like Apache Spark for in-memory computing?

Apache Spark Introduction Figure 3

Apache Spark Architecture

Spark is accessible, intense, powerful and proficient Big

  1. Standalone - The arrangement implies Spark possesses the place on the top of HDFS(Hadoop Distributed File System) and space is allotted for HDFS, unequivocally. Here, Spark and MapReduce will run one next to the other to covering all in the form of Cluster.
  2. Hadoop Yarn - Hadoop Yarn arrangement implies, basically, Spark keeps running on Yarn with no pre-establishment or root get to required. It incorporates Spark into the Hadoop environment or the Hadoop stack. It enables different parts to keep running on the top of the stack having an explicit allocation for HDFS.
  3. Spark in MapReduce - Spark in MapReduce is utilized to dispatch start work notwithstanding independent arrangement. With SIMR, the client can begin Spark and uses its shell with no regulatory access.

Apache Spark Ecosystem Components

Apache Spark guarantee for quicker information handling and also simpler advancement is conceivable only because of Apache Spark Components. All of them settled the issues that happened while utilizing Hadoop MapReduce.  Now, let's talk about each Spark Ecosystem Component one by one -

Apache Spark Introduction Figure 5

1. Apache Spark Core
Spark Core is the basic general execution engine for the Spark platform that all other functionalities are based upon. It gives In-Memory registering and connected datasets in external storage frameworks.

2. Spark SQL

Spark SQL is a segment over Spark Core that presents another information abstraction called SchemaRDD, which offers help for syncing structured and unstructured information.

3. Spark Streaming

Spark Streaming use Spark Core's quick scheduling ability to perform Streaming Analytics. It ingests information in scaled-down clusters and performs RDD (Resilient Distributed Datasets) changes on those small-scale groups of information.

4. MLlib (Machine Learning Library)

MLlib is a Distributed Machine Learning structure above Spark in view of the distributed memory-based Spark architecture. It is, as indicated by benchmarks, done by the MLlib engineers against the Alternating Least Squares (ALS) executions. Spark MLlib is nine times as rapid as the Hadoop disk version of Apache Mahout (before Mahout picked up a Spark interface).

5. GraphX

GraphX is a distributed Graph-Processing framework of Spark. It gives an API for communicating chart calculation that can display the client characterized diagrams by utilizing Pregel abstraction API. It likewise gives an optimized and improved runtime to this abstraction.

6. Spark R

Essentially, to utilize Apache Spark from R. It is R bundle that gives light-weight frontend. In addition, it enables information researchers to break down expansive datasets. Likewise permits running employments intuitively on them from the R shell. In spite of the fact that, the main thought behind SparkR was to investigate diverse methods to incorporate the ease of use of R with the adaptability of Spark.

Apache Spark Abstractions & Concepts

In the below section, it is briefly discussed regarding the abstractions and concepts of Spark -

  • RDD (Resilient Distributed Dataset) - RDD is the central and significant unit of