nano-vllm is a minimal reimplementation of vLLM’s core inference engine, targeting Qwen3 model family, and is a pure offline inference library.

High Level Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌─────────────────────────────────────────────────────────┐
│ User API: LLM.generate(prompts, sampling_params) │
└────────────────────────┬────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│ LLMEngine (engine/llm_engine.py) │
│ Central orchestrator. Loops step() until all finished. │
└──────┬──────────────────────────────────────┬───────────┘
▼ ▼
┌──────────────┐ ┌────────────────────┐
│ Scheduler │ │ ModelRunner │
│ + BlockMgr │◄────────────────────│ (MP for TP>1) │
│ + Sequence │ batch & results │ engine/ │
│ engine/ │ │ model_runner.py │
│ scheduler.py │ └────────┬───────────┘
└──────────────┘ ▼
│ ┌────────────────────┐
│ │ Qwen3ForCausalLM │
│ │ (models/qwen3.py) │
│ │ + layers/ │
│ │ attention.py │
│ │ linear.py │
│ │ sampler.py │
│ │ ... │
│ └────────────────────┘

┌────┴────────────┐
│ BlockManager │ Paged KV cache + prefix caching via xxhash
│ Sequence │ Per-request state machine
└─────────────────┘

Component Design

Sequence holds the full token list, block table, scheduling metadata and sampling params. This the abstraction of sequence that include both prompts in prefill stage and generated tokens in decoding stage.

  • seq_id is a monotonically incrementing integer assigned to each Sequence. It’s like giving each inference instance an unique ID.
  • block_table is the page table, the core data structure of PagedAttention, that maps logical blocks to its index of the corresponding physical block.

BlockManager implements the core algorithm in PagedAttention with hash-based prefix caching.

BlockManager and Block

Each Block has field block_id, ref_count, hash and token_ids (tokens in this block of KV cache). BlockManager has field

  • blocks: list[Block] which is pool of all blocks
  • hash_to_block_id: dict which is used for prefix cache lookup
  • free_block_ids: deque[int] which contains available blocks
  • used_block_ids: set[int] which contains in-use blocks

Prefix caching in BlockManager is done in compute_hash() method. The hash of a block is computed from

  • previous block’s hash
  • IDs of block’s tokens

This means that if seq A has [a,b,c] and seq B has [a,b,c,d,e], then the prefix hash of [a,b,c] is the same, thus these blocks of KV cache can be shared.

Allocation during prefill. The allocation process of BlockManager is done by can_allocate(), allocate(), which also handles prefix caching.

Let first check the prefix caching part. Since the hash chain is cumulative. A cache hiton block NN means cache hits on blocks 0,1,,N10,1,\dots,N-1.

  • In can_allocate(), it counts how many leading blocks are prefix cache hits, and computes how many new blocks should be allocated. Then, it checks whether the number of blocks that should be newly allocated exceeds the number of free blocks remaining.

The returned value num_cached_blocks is then passed to .allocate() if necessary (scheduler.py:line 45).

  • For each cached blocks indexed from 00 to num_cached_blocks1\texttt{num\_cached\_blocks}-1, the manager simply increment the block’s ref_count.
  • For each block that should be allocated indexed from num_cached_blocks\texttt{num\_cached\_blocks} to seq.num_blocks, self._allocate_block() is called to assign a free physical block to the logical block.

Allocation during decoding. Due to the nature of decoding, the sequence grows token by token. Thus, we just check the length of the sequence to see if it needs a new physical block.

Modul Runner

Located in engine/model_runner.py

KV Cache Tensor Layout

1
2
3
4
5
6
7
8
kv_cache: Tensor  [2, num_layers, num_blocks, block_size, num_kv_heads, head_dim]
│ │ │ │ │ │
│ │ │ │ │ └── head dimension (e.g., 128)
│ │ │ │ └── KV heads (after TP split)
│ │ │ └── tokens per block (256)
│ │ └── physical blocks (computed from GPU memory budget)
│ └── one per transformer layer
└── index 0 = keys, index 1 = values