MapReduce has some certain shortcomes. First, this paradigm forces breaking all data processing procedures down into map and reduce, while some important operations are missing, e.g. join, filter, union etc. Second, MapReduce requires reads and writes to disk before and after map and reduce, which becomes the most inefficient bottleneck for iterative tasks, like machine learning. Third, only Java is supported. Fourth, MapReduce only supports batch processing, while interactive streaming data support is still missing.

Thus, Apache Spark is proposed to handle the above problem.

Spark Programming

All different processing components in Spark share the same abstraction, called RDD. An RDD is a data container that work like a “table” in an SQL database. RDDs are created by operation on data or reading files.

RDD Abstraction

An RDD is a read-only, partitioned collection of records created by deterministic transformations on stable storage or other RDDs. RDDs support persistence and partitioning.

Important properties include:

  • immutability
  • partitioned across workers
  • lazily computed
  • can be cached/persisted
  • recoverable using lineage on DAG.

Lineage: Fault Tolerance

Instead of replicating every intermediate dataset, Spark records how each RDD was derived. Spark uses DAG to track dependencies. Nodes are RDDs and edges are transformations. If a partition is lost, Spark recomputes only the lost partition by replaying transformation.

Action vs Transformation

An action is the final stage of the workflow. It triggers the execution of the DAG and returns the results to the driver (or writes data to HDFS/to files).

Unlike action, transformation create new RDDs lazily and does not trigger execution.

Narrow & Wide Dependencies

A narrow dependency means, each child partition depends on a small number of parent partitions (usually 11), e.g. map, filter

A wide dependency means, many child partitions depend on many parent partitions, e.g. groupByKey, join. Wide dependencies require shuffle.

This matters because

  • narrow deps can be pipelined.
  • wide deps create stage boundaries.
  • narrow deps is cheaper for recovery; wide deps may require recomputing or shuffling more data.

When Not to Use Spark?

  • for many simple cases, Apache MapReduce and Hive might be a appropriate option.
  • not designed for multi-user environment.

Spark Scheduler

Spark builds a DAG of stages. It pipelines narrow transformations inside a stage and cuts stages at wide deps/shuffles. It also uses data locality and cached partitions when scheduling tasks.