Spark Streaming

In short, Spark Streaming is an extension of Spark for large scale stream processing:

scales to hendreds of ndoes and achieves second scale latencies.
efficient and fault-tolerant stateful stream processing
simple batch-like API for implementing complex algorithms

In traditional streaming systems, they have a record-at-a-time processing model that each node has mutable state and for each record, update state and send new records.

The drawback is that state is lost if node dies.

Spark Streaming

The solution of Spark Streaming is to run a streaming computation as a series of vert small, deterministic batch jobs:

Spark Streaming chop up the live stream into batches of $X$ seconds.
Then, Spark treats each batch of data as RDDs and processes them using RDD operations.

DStream is a sequence of RDDs representing a stream of data.
Finally, the processed results of the RDD operations are returned in batches.

Fault-Tolerance: Worker

Batches of input data are replicated in memory for fault-tolerance. If data are lost due to worker failure, data can be recomputed from replicated input data.

All transformed data is fault-tolerant and exactly-once transformations.

Fault-Tolerance: Master

Master saves the state of the DStreams to a checkpoint file on HDFS periodically. If master fails, DStreams can be restored with checkpoint files.