Chapter 1: Architecture Overview

This chapter introduces the high-level architecture of a vLLM-style inference engine. We start with nano-vllm — a minimal reimplementation in ~1200 lines of Python — to understand the core request lifecycle, then map each piece to the production vLLM codebase.

The Core Abstraction: LLMEngine

At the heart of both nano-vllm and vLLM sits an LLMEngine class. It is the single orchestrator that ties together scheduling, model execution, and output processing. Every inference request flows through it.

nano-vllm

nanovllm/engine/llm_engine.py

The core LLMEngine class (91 lines) — entry point for all inference requests in nano-vllm.

In nano-vllm, the engine’s public API is remarkably simple:

class LLMEngine:
    def __init__(self, model_name: str, **kwargs):
        self.model = ModelRunner(model_name, **kwargs)
        self.scheduler = Scheduler(self.model.cache_config)
        self.tokenizer = self.model.tokenizer

    def add_request(self, prompt: str, sampling_params: SamplingParams) -> str:
        """Tokenize prompt and register a new request with the scheduler."""
        request_id = str(next(self._request_counter))
        token_ids = self.tokenizer.encode(prompt)
        self.scheduler.add_request(request_id, token_ids, sampling_params)
        return request_id

    def step(self) -> List[RequestOutput]:
        """Run one decoding iteration: schedule → execute → process outputs."""
        scheduler_output = self.scheduler.schedule()
        if scheduler_output.is_empty:
            return []
        model_output = self.model.execute(scheduler_output)
        return self._process_outputs(scheduler_output, model_output)

The three key methods define the entire request lifecycle:

add_request() — tokenizes the prompt and hands it to the scheduler
step() — runs one forward pass: schedule, execute model, process outputs
generate() — convenience wrapper that calls step() in a loop until all requests finish

Request Lifecycle

A request goes through these stages:

Arrival — The user calls add_request() with a text prompt and sampling parameters. The prompt is tokenized and wrapped in a Sequence object.
Scheduling — On each step(), the scheduler decides which sequences to run. It manages three queues: waiting (new prefills), running (active decodes), and swapped (preempted to CPU). The scheduler allocates KV cache blocks and builds a batch.
Execution — The ModelRunner takes the scheduled batch, runs the transformer forward pass, and returns logits.
Sampling — Logits are processed through the sampling parameters (temperature, top-p, top-k) to produce the next token for each sequence.
Output processing — New tokens are appended to sequences. Finished sequences (hit EOS or max length) are moved out of the running queue and returned as completed outputs.

The generate() Loop

The top-level generate() method ties it all together. This is what the LLM convenience class calls under the hood:

def generate(self, prompts, sampling_params):
    # Phase 1: Add all requests
    for prompt in prompts:
        self.add_request(prompt, sampling_params)

    # Phase 2: Step until all requests complete
    outputs = {}
    while self.scheduler.has_unfinished():
        step_outputs = self.step()
        for output in step_outputs:
            if output.finished:
                outputs[output.request_id] = output

    return sorted(outputs.values(), key=lambda x: x.request_id)

This is an offline batching pattern — all prompts are submitted upfront, then the engine iterates until every request has finished. In production vLLM, an AsyncLLMEngine wraps this with an async interface for online serving, but the core loop is identical.

Mapping to Production vLLM

The production vLLM follows the same architecture but with significantly more machinery for performance and flexibility:

vllm (production)

vllm/v1/engine/core.py

Production EngineCore — same schedule/execute/update pattern with ZMQ IPC, async support, and multi-GPU coordination.

vllm (production)

vllm/entrypoints/llm.py

The LLM convenience class — high-level API that wraps LLMEngine.generate().

Key differences in production vLLM:

AsyncLLMEngine adds an async layer for serving frameworks (FastAPI, etc.)
Multi-worker execution — the model runner can distribute across multiple GPUs via tensor parallelism
Detailed metrics — request latency, throughput, queue depths, cache utilization
LoRA and multi-modal support — the engine handles adapter weights and image/audio inputs
Speculative decoding — optional draft model for faster generation

Despite this complexity, if you understand the nano-vllm add_request → step → generate loop, you understand the backbone of production vLLM.

Key Components

Here is how the major components connect:

LLM (user-facing API)
 └── LLMEngine
      ├── Tokenizer        — encode prompts, decode tokens
      ├── Scheduler        — manage sequence queues, allocate blocks
      │    └── BlockManager — track KV cache block allocation
      └── ModelRunner       — execute the transformer model
           ├── Model        — the actual neural network (e.g., LlamaForCausalLM)
           └── Sampler      — convert logits → next tokens

Each of these components gets its own chapter. In the next chapter, we will look at how requests are represented internally through the Sequence and SequenceGroup data structures.

Summary

The LLMEngine is the central orchestrator with a simple add_request / step / generate interface
Each step() performs: schedule → execute → sample → process outputs
nano-vllm implements this in ~200 lines; production vLLM adds async serving, multi-GPU, metrics, and extensibility on top of the same pattern
Understanding the request lifecycle in nano-vllm gives you a direct map to the production codebase