nano-vllm is a minimal reimplementation of vLLM’s core inference engine, targeting Qwen3 model family, and is a pure offline inference library.
High Level Architecture
1 | ┌─────────────────────────────────────────────────────────┐ |
Component Design
Sequence holds the full token list, block table, scheduling metadata and sampling params. This the abstraction of sequence that include both prompts in prefill stage and generated tokens in decoding stage.
seq_idis a monotonically incrementing integer assigned to eachSequence. It’s like giving each inference instance an unique ID.block_tableis the page table, the core data structure of PagedAttention, that maps logical blocks to its index of the corresponding physical block.
BlockManager implements the core algorithm in PagedAttention with hash-based prefix caching.
BlockManager and Block
Each Block has field block_id, ref_count, hash and token_ids (tokens in this block of KV cache). BlockManager has field
blocks: list[Block]which is pool of all blockshash_to_block_id: dictwhich is used for prefix cache lookupfree_block_ids: deque[int]which contains available blocksused_block_ids: set[int]which contains in-use blocks
Prefix caching in BlockManager is done in compute_hash() method. The hash of a block is computed from
- previous block’s hash
- IDs of block’s tokens
This means that if seq A has [a,b,c] and seq B has [a,b,c,d,e], then the prefix hash of [a,b,c] is the same, thus these blocks of KV cache can be shared.
Allocation during prefill. The allocation process of BlockManager is done by can_allocate(), allocate(), which also handles prefix caching.
Let first check the prefix caching part. Since the hash chain is cumulative. A cache hiton block means cache hits on blocks .
- In
can_allocate(), it counts how many leading blocks are prefix cache hits, and computes how many new blocks should be allocated. Then, it checks whether the number of blocks that should be newly allocated exceeds the number of free blocks remaining.
The returned value num_cached_blocks is then passed to .allocate() if necessary (scheduler.py:line 45).
- For each cached blocks indexed from to , the manager simply increment the block’s
ref_count. - For each block that should be allocated indexed from to
seq.num_blocks,self._allocate_block()is called to assign a free physical block to the logical block.
Allocation during decoding. Due to the nature of decoding, the sequence grows token by token. Thus, we just check the length of the sequence to see if it needs a new physical block.
Modul Runner
Located in
engine/model_runner.py
KV Cache Tensor Layout
1 | kv_cache: Tensor [2, num_layers, num_blocks, block_size, num_kv_heads, head_dim] |