nano-vllm Overview

nano-vllm is a minimal reimplementation of vLLM’s core inference engine, targeting Qwen3 model family, and is a pure offline inference library.
High Level Architecture

┌─────────────────────────────────────────────────────────┐
│  User API: LLM.generate(prompts, sampling_params)       │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  LLMEngine (engine/llm_engine.py)                       │
│  Central orchestrator. Loops step() until all finished. │
└──────┬──────────────────────────────────────┬───────────┘
       ▼                                      ▼
┌──────────────┐                     ┌────────────────────┐
│  Scheduler   │                     │   ModelRunner      │
│  + BlockMgr  │◄────────────────────│   (MP for TP>1)    │
│  + Sequence  │   batch & results   │   engine/          │
│  engine/     │                     │   model_runner.py  │
│ scheduler.py │                     └────────┬───────────┘
└──────────────┘                              ▼
       │                             ┌────────────────────┐
       │                             │  Qwen3ForCausalLM  │
       │                             │  (models/qwen3.py) │
       │                             │  + layers/         │
       │                             │    attention.py    │
       │                             │    linear.py       │
       │                             │    sampler.py      │
       │                             │    ...             │
       │                             └────────────────────┘
       │
  ┌────┴────────────┐
  │  BlockManager   │  Paged KV cache + prefix caching via xxhash
  │  Sequence       │  Per-request state machine
  └─────────────────┘