Chapter 6: Model Architecture

This chapter walks through how a Transformer decoder layer is assembled from the building blocks we have seen so far — attention, linear layers, normalization, and activations. We will use nano-vllm’s Qwen3 implementation as the reference, since it is a clean 217-line file that covers every component of a modern LLM architecture.

The Decoder Layer

A single Transformer decoder layer follows the pre-norm residual pattern:

input
  ├── RMSNorm → Self-Attention → + residual
  └── RMSNorm → MLP            → + residual
output
nano-vllm
nanovllm/models/qwen3.py
Complete Qwen3 model — Attention, MLP, DecoderLayer, and CausalLM classes in 217 lines.

Here is the decoder layer in nano-vllm:

class Qwen3DecoderLayer(nn.Module):
    def __init__(self, config, layer_idx):
        super().__init__()
        self.self_attn = Qwen3Attention(config, layer_idx)
        self.mlp = Qwen3MLP(config)
        self.input_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
        self.post_attention_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)

    def forward(self, positions, hidden_states, kv_cache, attn_metadata, residual):
        # Pre-norm + attention
        hidden_states, residual = self.input_layernorm(hidden_states, residual)
        hidden_states = self.self_attn(positions, hidden_states, kv_cache, attn_metadata)

        # Pre-norm + MLP
        hidden_states, residual = self.post_attention_layernorm(hidden_states, residual)
        hidden_states = self.mlp(hidden_states)

        return hidden_states, residual

Notice the residual is passed through the layer and managed by the RMSNorm — this is the fused add-residual pattern we will cover below.

Fused QKV Projection

The attention layer projects the hidden state into queries, keys, and values. A naive implementation would use three separate linear layers. nano-vllm fuses them into a single matmul:

class Qwen3Attention(nn.Module):
    def __init__(self, config, layer_idx):
        super().__init__()
        self.num_heads = config.num_attention_heads
        self.num_kv_heads = config.num_key_value_heads
        self.head_dim = config.hidden_size // self.num_heads

        # Fused QKV: one matmul produces Q, K, V concatenated
        self.qkv_proj = QKVParallelLinear(
            config.hidden_size,
            self.head_dim,
            self.num_heads,
            self.num_kv_heads,
        )
        self.o_proj = RowParallelLinear(
            config.hidden_size, config.hidden_size
        )
        self.q_norm = RMSNorm(self.head_dim, config.rms_norm_eps)
        self.k_norm = RMSNorm(self.head_dim, config.rms_norm_eps)
        self.attn = Attention(
            self.num_heads, self.head_dim, self.num_kv_heads,
            scale=self.head_dim ** -0.5,
            sliding_window=getattr(config, "sliding_window", None),
        )
        self.rotary_emb = get_rope(self.head_dim, config.max_position_embeddings,
                                    config.rope_theta)

    def forward(self, positions, hidden_states, kv_cache, attn_metadata):
        qkv = self.qkv_proj(hidden_states)
        q, k, v = qkv.split(
            [self.num_heads * self.head_dim,
             self.num_kv_heads * self.head_dim,
             self.num_kv_heads * self.head_dim], dim=-1
        )
        q = self.q_norm(q.view(-1, self.num_heads, self.head_dim))
        k = self.k_norm(k.view(-1, self.num_kv_heads, self.head_dim))
        v = v.view(-1, self.num_kv_heads, self.head_dim)

        q, k = self.rotary_emb(positions, q, k)
        output = self.attn(q, k, v, kv_cache, attn_metadata)
        return self.o_proj(output.view(-1, self.num_heads * self.head_dim))

The Qwen3Attention class shows the full flow:

  1. Fused QKV projection — a single QKVParallelLinear produces Q, K, V concatenated, then splits them. One matmul instead of three.
  2. QK normalization — Qwen3 applies RMSNorm to Q and K before RoPE (not all models do this).
  3. Rotary position embedding — applies RoPE to Q and K so the model understands token positions.
  4. Attention — dispatches to the prefill or decode path (Chapter 5).
  5. Output projection — a RowParallelLinear maps back to hidden size.

SwiGLU Activation (MLP)

The MLP block uses the SwiGLU activation, which combines a gated linear unit with the SiLU (Swish) nonlinearity:

class Qwen3MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.gate_up_proj = MergedColumnParallelLinear(
            config.hidden_size,
            [config.intermediate_size, config.intermediate_size],
        )
        self.down_proj = RowParallelLinear(
            config.intermediate_size, config.hidden_size
        )
        self.act_fn = SiluAndMul()

    def forward(self, x):
        x = self.gate_up_proj(x)
        x = self.act_fn(x)
        x = self.down_proj(x)
        return x
nano-vllm
nanovllm/layers/activation.py
SiluAndMul — fused SwiGLU activation that splits input into gate and up projections.

The SiluAndMul activation works on the concatenated output of gate_up_proj:

class SiluAndMul(nn.Module):
    def forward(self, x):
        gate, up = x.chunk(2, dim=-1)
        return F.silu(gate) * up

The gate and up projections are fused into a single MergedColumnParallelLinear , just like QKV fusion — one matmul instead of two, then split and apply the activation.

RMSNorm with Fused Add-Residual

Standard layer norm computes mean and variance. RMSNorm simplifies this by only computing the root mean square, which is cheaper and works just as well in practice:

nano-vllm
nanovllm/layers/layernorm.py
RMSNorm with optional fused add-residual variant for reduced memory traffic.
class RMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.eps = eps

    def forward(self, hidden_states, residual=None):
        if residual is not None:
            # Fused: add residual + normalize in one kernel
            hidden_states = hidden_states + residual
            residual = hidden_states
        hidden_states = hidden_states.float()
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
        return (self.weight * hidden_states.to(self.weight.dtype), residual)

The fused variant is important for performance. Without it, the residual add and normalization are two separate operations, each requiring a full read-write pass over the hidden state. By fusing them, we halve the memory traffic — the hidden state is read once, the residual is added, normalized, and written back.

Look back at the decoder layer’s forward: the residual tensor flows through the entire layer, updated by each RMSNorm call. This avoids materializing intermediate residual tensors.

Rotary Position Embedding (RoPE)

RoPE encodes position information by rotating Q and K vectors in pairs of dimensions:

nano-vllm
nanovllm/layers/rotary_embedding.py
RoPE implementation with get_rope() singleton for shared position embeddings.
def get_rope(head_dim, max_position_embeddings, base=10000.0):
    """Singleton factory — all layers share the same RoPE instance."""
    ...

The get_rope() function returns a singleton so all attention layers share the same precomputed rotation matrices. The rotation is applied to Q and K before attention computation, giving the model a sense of relative position without explicit position embeddings.

The Full Model: Qwen3ForCausalLM

The top-level model class stacks everything together:

class Qwen3ForCausalLM(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_tokens = VocabParallelEmbedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleList([
            Qwen3DecoderLayer(config, i) for i in range(config.num_hidden_layers)
        ])
        self.norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
        self.lm_head = ParallelLMHead(config.vocab_size, config.hidden_size)

    def forward(self, input_ids, positions, kv_caches, attn_metadata):
        hidden_states = self.embed_tokens(input_ids)
        residual = None
        for i, layer in enumerate(self.layers):
            hidden_states, residual = layer(positions, hidden_states,
                                            kv_caches[i], attn_metadata, residual)
        hidden_states, _ = self.norm(hidden_states, residual)
        logits = self.lm_head(hidden_states)
        return logits

The Qwen3ForCausalLM shows the complete forward pass: embed tokens, pass through N decoder layers (each with its own KV cache slice), final norm, then project to vocabulary logits.

Packed Modules Mapping

When loading checkpoint weights, the fused projections need special handling. A Hugging Face checkpoint stores separate q_proj, k_proj, v_proj weights, but nano-vllm’s qkv_proj expects them concatenated. The model defines a mapping:

packed_modules_mapping = {
    "qkv_proj": ["q_proj", "k_proj", "v_proj"],
    "gate_up_proj": ["gate_proj", "up_proj"],
}

The weight loader reads this mapping to know which checkpoint weights should be stacked into which fused parameter. This is a pattern shared across all model implementations in both nano-vllm and vLLM.

Mapping to Production vLLM

Production vLLM follows the exact same architecture pattern but at much larger scale:

vllm (production)
vllm/model_executor/models/
200+ model implementations — LLaMA, Qwen2, Mistral, GPT-NeoX, and many more, all following the same pattern.
vllm (production)
vllm/model_executor/layers/layernorm.py
Production RMSNorm — uses fused CUDA kernels (from flash-attn or vllm-flash-attn) for maximum throughput.
vllm (production)
vllm/model_executor/layers/rotary_embedding/
RoPE implementations — supports standard, YaRN, dynamic NTK, and other scaling methods.

Key differences from nano-vllm:

  • 200+ model implementations — vLLM supports a vast range of architectures, but they all follow the same decoder-layer pattern with fused projections and pluggable attention
  • Fused CUDA kernels — RMSNorm, SiLU-and-mul, and rotary embedding all have optimized CUDA kernel implementations instead of pure PyTorch
  • RoPE variants — production vLLM supports YaRN, dynamic NTK scaling, and other position embedding extensions for long-context models
  • Weight loading infrastructure — a sophisticated WeightLoader handles quantized weights (GPTQ, AWQ, FP8), sharded checkpoints, and LoRA adapter merging
  • Packed modules mapping — the same packed_modules_mapping pattern is used across all models, with a centralized weight loading system that handles the stacking automatically

Despite the scale difference, if you can read nano-vllm’s Qwen3 implementation, you can read any model in vLLM — the structure is identical.

Summary

  • A decoder layer follows the pre-norm residual pattern: RMSNorm + Attention + RMSNorm + MLP, with residual connections flowing through
  • QKV projections are fused into a single QKVParallelLinear matmul, and gate/up projections into MergedColumnParallelLinear
  • SiluAndMul implements SwiGLU by splitting the fused gate+up output and applying SiLU to the gate half
  • RMSNorm fuses the residual addition to cut memory traffic in half
  • packed_modules_mapping tells the weight loader how to stack checkpoint weights into fused parameters
  • Production vLLM uses the same architecture with fused CUDA kernels, 200+ model implementations, and support for quantization and LoRA