Activation Checkpointing

In essence, activation checking is to replace memory with computation. During forward, only a few activation checkpoints are saved; during backward propagation, for unsaved activations, they will be recomputed from nearest checkpoints.

More detailedly, the forward computation graph is splitted into multiple subgraphs called segments. Only activations that reside on segments edge are saved, and activations inside segments are ignored. Within a segment, autograd does not save any intermediate result, checkpoints are only saved at boundary.

During backprop, when some activation is required but missing, the backprop process is paused and we start from the nearest checkpoint and perform forward to get all activations in this segment. Then, we continue backprop with these newly computed activations. After finishing this segment, we release all recomputed segments.

Selective Checkpointing

This is to say, we selectively abandon those large or fast to compute activations, and checkpoint those small or slow to computed activations.

In the Transformers architecture, LayerNorm has small activation and will be checkpointed. For Attention/MLP, their activations are relatively larger, and with moderate computation, thus not checkpointed.

Note that, in mainstream Flash Attention, actually FA does checkpointing implicitly for attention scores.

Structured Checkpointing

We may split the model by its architecture, e.g., split by attention layers in Transformer models.