Chapter 2: Sequence & Request State
Every inference request needs an internal representation that tracks its tokens, its progress through the engine, and the KV cache blocks it occupies. In nano-vllm this is the Sequence class; in production vLLM it is the Request class plus a family of DTOs that shuttle data between engine layers. This chapter dissects both.
The Sequence: A Request’s Lifecycle Object
In nano-vllm, a single Sequence object carries everything the engine needs to know about one request:
class SequenceStatus(Enum):
WAITING = "waiting"
RUNNING = "running"
FINISHED = "finished"
class Sequence:
def __init__(self, request_id, token_ids, sampling_params):
self.request_id = request_id
self.token_ids = list(token_ids) # prompt + generated tokens
self.sampling_params = sampling_params
self.status = SequenceStatus.WAITING
self.block_table: List[int] = [] # physical block indices
self.logical_token_count = 0
A few things to notice:
token_idsis a single flat list. Prompt tokens go in first; every newly generated token is appended to the same list. The boundary between prompt and generation is tracked implicitly by the length at creation time.block_tablemaps logical positions to physical KV cache block indices. We will explore this in depth in Chapter 4, but for now just know that the sequence owns this mapping.statusstarts atWAITINGand follows a strict state machine.
The State Machine: WAITING → RUNNING → FINISHED
Every sequence transitions through exactly three states:
┌──────────┐ schedule() ┌──────────┐ EOS / max_tokens ┌──────────┐
│ WAITING │ ──────────────────▶│ RUNNING │ ────────────────────────▶│ FINISHED │
└──────────┘ └──────────┘ └──────────┘
│ ▲
│ │
preempt / re-schedule
-
WAITING — The sequence has been added via Sequence.__init__() but has not yet been scheduled for its first forward pass (prefill). It sits in the scheduler’s waiting queue.
-
RUNNING — The scheduler has allocated KV cache blocks and included this sequence in the current batch. It will stay in this state across multiple decode steps, generating one token per step.
-
FINISHED — The sequence hit a stop condition: an EOS token, the
max_tokenslimit, or a stop string. The scheduler removes it from the running queue and returns the completed output.
A sequence can also be preempted — moved back from RUNNING to WAITING — when the scheduler needs to free memory for higher-priority requests. We will cover preemption in Chapter 3.
Token Management
The Sequence class manages tokens through simple list operations:
class Sequence:
@property
def prompt_len(self):
return self._prompt_len # set at creation time
@property
def output_len(self):
return len(self.token_ids) - self._prompt_len
@property
def num_total_tokens(self):
return len(self.token_ids)
def append_token(self, token_id: int):
self.token_ids.append(token_id)
This flat-list design is intentional. During prefill, the model processes all prompt tokens at once. During decode, it processes only the latest token but needs the full KV cache from all previous tokens. The token_ids list gives the engine a single source of truth for the complete token history.
SamplingParams: Controlling Generation
Each sequence is bound to a SamplingParams object that controls how logits are converted to tokens:
@dataclass
class SamplingParams:
max_tokens: int = 256
temperature: float = 1.0
top_p: float = 1.0
top_k: int = -1 # -1 means disabled
stop_token_ids: List[int] = field(default_factory=list)
The sampling params are immutable for the lifetime of a request. They travel with the sequence through scheduling and execution, and are read by the sampler after each forward pass to decide the next token.
Key parameters:
temperature— scales logits before softmax. Lower values make the distribution sharper (more deterministic);0.0means greedy decoding.top_p(nucleus sampling) — keeps the smallest set of tokens whose cumulative probability exceedsp, then re-normalizes.top_k— keeps only the top-k highest-probability tokens.max_tokens— hard cap on generated tokens. Whenoutput_len >= max_tokens, the sequence is finished.stop_token_ids— if the sampled token is in this list, the sequence is finished.
Block Table: Linking Sequences to KV Cache
The block_table field on each sequence is a list of physical block indices:
# Example: a sequence using 3 physical blocks
seq.block_table = [14, 7, 22]
# Block 14 holds KV cache for tokens 0..15
# Block 7 holds KV cache for tokens 16..31
# Block 22 holds KV cache for tokens 32..47 (partially filled)
Each block holds a fixed number of token slots (the block size, typically 16). As the sequence generates more tokens and fills the current block, the block manager allocates a new physical block and appends its index to the block table. This is the core idea behind PagedAttention — we will cover it fully in Chapter 4.
Mapping to Production vLLM
Production vLLM splits the “sequence” concept across several specialized classes, each optimized for its layer of the system.
Request — The Engine-Level Representation
The Request class is the closest analog to nano-vllm’s Sequence. It tracks:
RequestStatus— an enum similar toSequenceStatus, with states likeWAITING,RUNNING,PREEMPTED, andFINISHED_*variants that distinguish finish reasons (length, stop token, abort).num_computed_tokens— how many tokens have had their KV cache computed. This is critical for chunked prefill, where a long prompt is processed across multiple steps.append_output_token_ids()— appends newly generated tokens, mirroring nano-vllm’sappend_token().block_hashes— precomputed hashes for prefix caching. Each block’s content is hashed so the cache manager can detect reusable blocks across requests.
EngineCoreRequest and EngineCoreOutput — The DTO Layer
Production vLLM separates the API-facing representation from the engine-internal one:
- EngineCoreRequest — a serializable DTO that carries the prompt tokens, sampling params, and request metadata from the async API layer into the engine core. This boundary exists because the API server and engine core may run in different processes.
- EngineCoreOutput — the return DTO carrying new token IDs and finish status back to the API layer.
FinishReason— an enum (STOP,LENGTH,ABORT) that tells the API layer why a request ended.
CachedRequestState and InputBatch — GPU-Side State
On the worker (GPU) side, the engine needs a compact representation optimized for building CUDA kernel inputs:
- CachedRequestState — caches per-request data (sampling params, token IDs, block table) on the worker to avoid re-sending it every step.
- InputBatch — aggregates all active requests into contiguous tensors for the model forward pass. It maintains numpy arrays for token IDs, positions, and block tables that can be efficiently copied to GPU.
Full SamplingParams
Production vLLM’s SamplingParams extends far beyond nano-vllm’s minimal version:
repetition_penalty,frequency_penalty,presence_penalty— penalize repeated tokensmin_tokens— force a minimum generation length before allowing EOSlogit_bias— per-token logit adjustmentsguided_decoding— constrain output to match a JSON schema or regexbest_of/n— generate multiple candidates and return the best
Despite the extra parameters, the core contract is the same: a SamplingParams object is bound to a request at creation time and read by the sampler on every decode step.
The Layered Design
Here is how the request representations stack up across the two codebases:
nano-vllm vLLM (production)
───────── ──────────────────
EngineCoreRequest (API → Engine DTO)
Sequence ───▶ Request (engine-internal state)
.token_ids .prompt_token_ids / .output_token_ids
.status .status (RequestStatus)
.block_table .block_hashes + KVCacheManager
.sampling_params .sampling_params (full version)
CachedRequestState (worker-side cache)
InputBatch (GPU tensor batch)
EngineCoreOutput (Engine → API DTO)
nano-vllm collapses all of this into a single Sequence class. Production vLLM fans it out across layers for process isolation, GPU efficiency, and extensibility. But the data is fundamentally the same: token IDs, a status, block mappings, and sampling parameters.
Summary
- Each request is represented as a
Sequence(nano-vllm) orRequest(vLLM) that tracks token IDs, status, block table, and sampling params - The state machine
WAITING → RUNNING → FINISHEDgoverns every request’s lifecycle, with preemption as an escape hatch back to WAITING - Token management uses a flat list — prompt and generated tokens live together, with the boundary tracked by the original prompt length
- The block table links each sequence to its physical KV cache blocks, enabling PagedAttention’s memory-efficient design
- Production vLLM splits the single-object design into layered DTOs (
EngineCoreRequest,Request,CachedRequestState,InputBatch,EngineCoreOutput) for process isolation and GPU efficiency