Chapter 2: Sequence & Request State

Every inference request needs an internal representation that tracks its tokens, its progress through the engine, and the KV cache blocks it occupies. In nano-vllm this is the Sequence class; in production vLLM it is the Request class plus a family of DTOs that shuttle data between engine layers. This chapter dissects both.

The Sequence: A Request’s Lifecycle Object

In nano-vllm, a single Sequence object carries everything the engine needs to know about one request:

nano-vllm
nanovllm/engine/sequence.py
Sequence class and SequenceStatus enum — the core per-request data structure in nano-vllm.
class SequenceStatus(Enum):
    WAITING = "waiting"
    RUNNING = "running"
    FINISHED = "finished"

class Sequence:
    def __init__(self, request_id, token_ids, sampling_params):
        self.request_id = request_id
        self.token_ids = list(token_ids)       # prompt + generated tokens
        self.sampling_params = sampling_params
        self.status = SequenceStatus.WAITING
        self.block_table: List[int] = []       # physical block indices
        self.logical_token_count = 0

A few things to notice:

  • token_ids is a single flat list. Prompt tokens go in first; every newly generated token is appended to the same list. The boundary between prompt and generation is tracked implicitly by the length at creation time.
  • block_table maps logical positions to physical KV cache block indices. We will explore this in depth in Chapter 4, but for now just know that the sequence owns this mapping.
  • status starts at WAITING and follows a strict state machine.

The State Machine: WAITING → RUNNING → FINISHED

Every sequence transitions through exactly three states:

  ┌──────────┐     schedule()     ┌──────────┐     EOS / max_tokens     ┌──────────┐
  │ WAITING  │ ──────────────────▶│ RUNNING  │ ────────────────────────▶│ FINISHED │
  └──────────┘                    └──────────┘                          └──────────┘
                                       │  ▲
                                       │  │
                                  preempt / re-schedule
  1. WAITING — The sequence has been added via Sequence.__init__() but has not yet been scheduled for its first forward pass (prefill). It sits in the scheduler’s waiting queue.

  2. RUNNING — The scheduler has allocated KV cache blocks and included this sequence in the current batch. It will stay in this state across multiple decode steps, generating one token per step.

  3. FINISHED — The sequence hit a stop condition: an EOS token, the max_tokens limit, or a stop string. The scheduler removes it from the running queue and returns the completed output.

A sequence can also be preempted — moved back from RUNNING to WAITING — when the scheduler needs to free memory for higher-priority requests. We will cover preemption in Chapter 3.

Token Management

The Sequence class manages tokens through simple list operations:

class Sequence:
    @property
    def prompt_len(self):
        return self._prompt_len          # set at creation time

    @property
    def output_len(self):
        return len(self.token_ids) - self._prompt_len

    @property
    def num_total_tokens(self):
        return len(self.token_ids)

    def append_token(self, token_id: int):
        self.token_ids.append(token_id)

This flat-list design is intentional. During prefill, the model processes all prompt tokens at once. During decode, it processes only the latest token but needs the full KV cache from all previous tokens. The token_ids list gives the engine a single source of truth for the complete token history.

SamplingParams: Controlling Generation

Each sequence is bound to a SamplingParams object that controls how logits are converted to tokens:

nano-vllm
nanovllm/sampling_params.py
SamplingParams dataclass — temperature, top_p, top_k, max_tokens, and stop conditions.
@dataclass
class SamplingParams:
    max_tokens: int = 256
    temperature: float = 1.0
    top_p: float = 1.0
    top_k: int = -1              # -1 means disabled
    stop_token_ids: List[int] = field(default_factory=list)

The sampling params are immutable for the lifetime of a request. They travel with the sequence through scheduling and execution, and are read by the sampler after each forward pass to decide the next token.

Key parameters:

  • temperature — scales logits before softmax. Lower values make the distribution sharper (more deterministic); 0.0 means greedy decoding.
  • top_p (nucleus sampling) — keeps the smallest set of tokens whose cumulative probability exceeds p, then re-normalizes.
  • top_k — keeps only the top-k highest-probability tokens.
  • max_tokens — hard cap on generated tokens. When output_len >= max_tokens, the sequence is finished.
  • stop_token_ids — if the sampled token is in this list, the sequence is finished.

Block Table: Linking Sequences to KV Cache

The block_table field on each sequence is a list of physical block indices:

# Example: a sequence using 3 physical blocks
seq.block_table = [14, 7, 22]
# Block 14 holds KV cache for tokens 0..15
# Block 7  holds KV cache for tokens 16..31
# Block 22 holds KV cache for tokens 32..47 (partially filled)

Each block holds a fixed number of token slots (the block size, typically 16). As the sequence generates more tokens and fills the current block, the block manager allocates a new physical block and appends its index to the block table. This is the core idea behind PagedAttention — we will cover it fully in Chapter 4.

Mapping to Production vLLM

Production vLLM splits the “sequence” concept across several specialized classes, each optimized for its layer of the system.

Request — The Engine-Level Representation

vllm (production)
vllm/v1/request.py
Request class with RequestStatus, token tracking, and block hashes for prefix caching.

The Request class is the closest analog to nano-vllm’s Sequence. It tracks:

  • RequestStatus — an enum similar to SequenceStatus, with states like WAITING, RUNNING, PREEMPTED, and FINISHED_* variants that distinguish finish reasons (length, stop token, abort).
  • num_computed_tokens — how many tokens have had their KV cache computed. This is critical for chunked prefill, where a long prompt is processed across multiple steps.
  • append_output_token_ids() — appends newly generated tokens, mirroring nano-vllm’s append_token().
  • block_hashes — precomputed hashes for prefix caching. Each block’s content is hashed so the cache manager can detect reusable blocks across requests.

EngineCoreRequest and EngineCoreOutput — The DTO Layer

vllm (production)
vllm/v1/engine/__init__.py
EngineCoreRequest, EngineCoreOutput, and FinishReason — data transfer objects between the API layer and the engine core.

Production vLLM separates the API-facing representation from the engine-internal one:

  • EngineCoreRequest — a serializable DTO that carries the prompt tokens, sampling params, and request metadata from the async API layer into the engine core. This boundary exists because the API server and engine core may run in different processes.
  • EngineCoreOutput — the return DTO carrying new token IDs and finish status back to the API layer.
  • FinishReason — an enum (STOP, LENGTH, ABORT) that tells the API layer why a request ended.

CachedRequestState and InputBatch — GPU-Side State

vllm (production)
vllm/v1/worker/gpu_input_batch.py
CachedRequestState and InputBatch — manages per-request state on the GPU worker side.

On the worker (GPU) side, the engine needs a compact representation optimized for building CUDA kernel inputs:

  • CachedRequestState — caches per-request data (sampling params, token IDs, block table) on the worker to avoid re-sending it every step.
  • InputBatch — aggregates all active requests into contiguous tensors for the model forward pass. It maintains numpy arrays for token IDs, positions, and block tables that can be efficiently copied to GPU.

Full SamplingParams

vllm (production)
vllm/sampling_params.py
Production SamplingParams — extensive parameter set including repetition penalty, min_tokens, logit bias, guided decoding, and more.

Production vLLM’s SamplingParams extends far beyond nano-vllm’s minimal version:

  • repetition_penalty, frequency_penalty, presence_penalty — penalize repeated tokens
  • min_tokens — force a minimum generation length before allowing EOS
  • logit_bias — per-token logit adjustments
  • guided_decoding — constrain output to match a JSON schema or regex
  • best_of / n — generate multiple candidates and return the best

Despite the extra parameters, the core contract is the same: a SamplingParams object is bound to a request at creation time and read by the sampler on every decode step.

The Layered Design

Here is how the request representations stack up across the two codebases:

nano-vllm                          vLLM (production)
─────────                          ──────────────────
                                   EngineCoreRequest  (API → Engine DTO)
Sequence                     ───▶  Request            (engine-internal state)
  .token_ids                       .prompt_token_ids / .output_token_ids
  .status                          .status (RequestStatus)
  .block_table                     .block_hashes + KVCacheManager
  .sampling_params                 .sampling_params (full version)
                                   CachedRequestState (worker-side cache)
                                   InputBatch         (GPU tensor batch)
                                   EngineCoreOutput   (Engine → API DTO)

nano-vllm collapses all of this into a single Sequence class. Production vLLM fans it out across layers for process isolation, GPU efficiency, and extensibility. But the data is fundamentally the same: token IDs, a status, block mappings, and sampling parameters.

Summary

  • Each request is represented as a Sequence (nano-vllm) or Request (vLLM) that tracks token IDs, status, block table, and sampling params
  • The state machine WAITING → RUNNING → FINISHED governs every request’s lifecycle, with preemption as an escape hatch back to WAITING
  • Token management uses a flat list — prompt and generated tokens live together, with the boundary tracked by the original prompt length
  • The block table links each sequence to its physical KV cache blocks, enabling PagedAttention’s memory-efficient design
  • Production vLLM splits the single-object design into layered DTOs (EngineCoreRequest, Request, CachedRequestState, InputBatch, EngineCoreOutput) for process isolation and GPU efficiency