An Overview of Apache Spark

MapReduce has some certain shortcomes. First, this paradigm forces breaking all data processing procedures down into map and reduce, while some important operations are missing, e.g. join, filter, union etc. Second, MapReduce requires reads and writes to disk before and after map and reduce, which becomes the most inefficient bottleneck for iterative tasks, like machine learning. Third, only Java is supported. Fourth, MapReduce only supports batch processing, while interactive streaming data support is still missing.

Thus, Apache Spark is proposed to handle the above problem.

Spark Programming

All different processing components in Spark share the same abstraction, called RDD. An RDD is a data container that work like a “table” in an SQL database.

Fault Tolerance

Spark uses DAG to track dependencies. Nodes are RDDs and edges are transformations. So to recover a node, we simply go through the DAG and perform te transformations again.

Action

An action is the final stage of the workflow. It triggers the execution of the DAG and returns the results to the driver (or writes data to HDFS/to files)