MapReduce has some certain shortcomes. First, this paradigm forces breaking all data processing procedures down into map and reduce, while some important operations are missing, e.g. join, filter, union etc. Second, MapReduce requires reads and writes to disk before and after map and reduce, which becomes the most inefficient bottleneck for iterative tasks, like machine learning. Third, only Java is supported. Fourth, MapReduce only supports batch processing, while interactive streaming data support is still missing.
Thus, Apache Spark is proposed to handle the above problem.
Spark Programming
All different processing components in Spark share the same abstraction, called RDD. An RDD is a data container that work like a “table” in an SQL database. RDDs are created by operation on data or reading files.
RDD Abstraction
An RDD is a read-only, partitioned collection of records created by deterministic transformations on stable storage or other RDDs. RDDs support persistence and partitioning.
Important properties include:
- immutability
- partitioned across workers
- lazily computed
- can be cached/persisted
- recoverable using lineage on DAG.
Lineage: Fault Tolerance
Instead of replicating every intermediate dataset, Spark records how each RDD was derived. Spark uses DAG to track dependencies. Nodes are RDDs and edges are transformations. If a partition is lost, Spark recomputes only the lost partition by replaying transformation.
Action vs Transformation
An action is the final stage of the workflow. It triggers the execution of the DAG and returns the results to the driver (or writes data to HDFS/to files).
Unlike action, transformation create new RDDs lazily and does not trigger execution.
Narrow & Wide Dependencies
A narrow dependency means, each child partition depends on a small number of parent partitions (usually ), e.g. map, filter
A wide dependency means, many child partitions depend on many parent partitions, e.g. groupByKey, join. Wide dependencies require shuffle.
This matters because
- narrow deps can be pipelined.
- wide deps create stage boundaries.
- narrow deps is cheaper for recovery; wide deps may require recomputing or shuffling more data.
When Not to Use Spark?
- for many simple cases, Apache MapReduce and Hive might be a appropriate option.
- not designed for multi-user environment.
Spark Scheduler
Spark builds a DAG of stages. It pipelines narrow transformations inside a stage and cuts stages at wide deps/shuffles. It also uses data locality and cached partitions when scheduling tasks.