In short, Spark Streaming is an extension of Spark for large scale stream processing:
- scales to hendreds of ndoes and achieves second scale latencies.
- efficient and fault-tolerant stateful stream processing
- simple batch-like API for implementing complex algorithms
In traditional streaming systems, they have a record-at-a-time processing model that each node has mutable state and for each record, update state and send new records.
The drawback is that state is lost if node dies.
Spark Streaming
The solution of Spark Streaming is to run a streaming computation as a series of vert small, deterministic batch jobs:
-
Spark Streaming chop up the live stream into batches of seconds.
-
Then, Spark treats each batch of data as RDDs and processes them using RDD operations.
DStreamis a sequence of RDDs representing a stream of data. -
Finally, the processed results of the RDD operations are returned in batches.
Fault-Tolerance: Worker
Batches of input data are replicated in memory for fault-tolerance. If data are lost due to worker failure, data can be recomputed from replicated input data.
All transformed data is fault-tolerant and exactly-once transformations.
Fault-Tolerance: Master
Master saves the state of the DStreams to a checkpoint file on HDFS periodically. If master fails, DStreams can be restored with checkpoint files.