Take-Away

The first contribution is the PagedAttention algorithm:

  • Blocked storage. KV Caches are chunked into fixed size blocks that each block contains KV cache of 1616 tokens by default. Each block can be independently be stored in GPU’s HBM.
  • Blockwise attention computation. While computing attention, PA reads KV cache from blocks with block table and perform blockwise attention. This enables incontinuous memory access.
  • Performance optimization. PA designs fused kernel, including fused block write, block read & attention and block copy (copy on write), to minimize the cost brought by paging.

PA algorithm also gives birth to vllm, one of the most popular LLM serving system. It brings several innovations:

  • KV cache manager. Serves as the memory managing unit in OS.
  • Allocate on demand. vllm does not pre-allocate continuous memory for the whole sequence. Instead, it dynamically allocates new blocks according to generation progress. New blocks are allocated only if all previous blocks are filled.
  • Copy on write. For physical blocks that are required by multiple sequences (e.g. prompts in parallel sampling), blocks are only copied if the sequence is about to write. Redundent memory are reduced with counting references.
  • Scheduling and preempting.
    • First come first serve.
    • If GPU physical blocks are inadequate, adopts eviction strategy.
    • Support swap (copy KV cache to CPU RAM) and recompute (concat generated tokens with prompts and recompute KV cache) for recovery mechanism.
  • Distributed support.