Raft Consensus Protocol

Raft is a consensus algorithm that lets a cluster of servers agree on a sequence of commands (a replicated log) by electing a leader to replicate those commands, so they all see the same order of operations even if some servers fail. Compared to 2PC, Raft focuses on data consistency among data replicas.

Introduction of Raft

A typical scenario for a cluster running Raft is that each server contains a consensus module, a log (replicated). Unlike Byzantine, the failure cases of Raft only include delayed/lost messages and stop-on-failure, and do not include subjective malice.

Leader-less or Leader-based Consensus

General approaches for consensus protocol include leader-less (symmetric) and leader-based (asymmetric).

For leader-less frameworks,

all servers have equal roles and
external clients can contact any server.

For leader-based frameworks,

at any given time, only 1 server is in charge while other servers accept its decisions
external clients communicate with the leader

Raft uses a leader-based approach. We’ll dive into Raft.

Raft Consensus Protocol Summary

Roles in Raft

At any given time, each server is either

leader that handles all client connections and log replication,
candidate that is used to elect a new leader,
follower that does not issue RPC and only responds to only RPCs.

Role Transittion

The state diagram of role transittion can be characterized as

graph TB
A[Follower]
B[Candidate]
C[Leader]

D((start)) --> A
A -->|timeout, start election| B
B -->|timeout, new election| B
B -->|receive majority votes| C
B -->|step down, if a new leader is elected| A
C -->|discover a server with higher term| A

What is "Term"?

Time is divided into terms, where each term is either election period or normal operation period.

It’s guaranteed that there’s only 1 leader in each term. Some terms may not have a leader (like a failed election period).

It’s worth noting that each server should maintain current term value. Change of term value means information update, thus the role of “term” is to identify obsolete information.

Log Replication

Log is an ordered sequence of entries on each server. Each entry contains

a command to execute
the term number when the command is created
an index where the command is positioned.

Logs are stored on stable storage to disk. Entries are committed if these entries are known to be stored on majority of servers.

Heartbeat

Servers start as followers. Followers expect to receive RPCs from leaders or candidates, and leaders must send heartbeats to maintain authority.

If electionTimeout elapses with no RPCs, follower assumes leader has crashed and starts a new election.

Election Basics

If a election is started, follower first increment current term, change to candidate state, vote for self, and send RequestVote RPCs to all other servers. Retry until either:

if receive votes from majority of servers:
- this server becomes the leader
- send AppendEntries heartbeats to all other servers
if receive RPC from valid leader
- then this server returns to normal follower state
if no one wins the election,
- then this server increments the term value again and starts another new election.

Safety. This procedure allows at most one winner per term. Each server gives out only one vote per term, which is persisted on disk.

Liveness. This procedure ensures that some candidate must eventually win. If the election timesout randomly in $[T,2T]$ when broadcast time $\ll T$ , then it usually works well.

Normal Operation

External clients send commands to leader
leader appends command to it log
leader sends AppendEntries RPCs to followers
Once new entry committed:
- leader passes command to its state machine, returns result to client
- leader notifies followers of committed entries in subsequent AppendEntries RPCs.
- followers pass committed commands to their state machine.

If there’re crashed or slow followers, leader retries RPCs until RPCs succeed.

Safety Requirement

Once a log entry has been applied to a state machine, no other state machine should apply a different value for that log entry.

If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders. This guarantees the safety requirement:

leader never overwrite entries in their logs
only entries in the leader’s log can be committed.
entries must be committed before applied to state machine

External Client Protocol

External client sends commands to leader, if leader is unknown then contact any server. If the contacted server is not leader, the contacted server will redirect it to leader.

The leader does not respond until command has beem logged -> commited -> executed by leader’s state machine.

If request times out (e.g. leader crash),

client re-issues command to some other server
eventually redirected to new leader
retry request with new leader

What if leader crashes after executing command, but before responding? Since we must not execute command twice, client embeds a unique ID in each command. The server includes this ID in log entry. Before accepting command, leader check its log for entry with that ID. If ID is found in log, ignore new command and return response from old command.

Raft Configuration

System config of Raft must be safe too, e.g., including but not limited to

the number of servers should be an odd number
ID & address for each server
determines what constituts a majority

And consensus mechanism must support changes in the configuration, so as to replace failed machine and change degree of replication. In Raft, system config change is treated as an operation that must reach consensus in the replicated log.