Chapter 6: Model Architecture
This chapter walks through how a Transformer decoder layer is assembled from the building blocks we have seen so far — attention, linear layers, normalization, and activations. We will use nano-vllm’s Qwen3 implementation as the reference, since it is a clean 217-line file that covers every component of a modern LLM architecture.
The Decoder Layer
A single Transformer decoder layer follows the pre-norm residual pattern:
input
├── RMSNorm → Self-Attention → + residual
└── RMSNorm → MLP → + residual
output
Here is the decoder layer in nano-vllm:
class Qwen3DecoderLayer(nn.Module):
def __init__(self, config, layer_idx):
super().__init__()
self.self_attn = Qwen3Attention(config, layer_idx)
self.mlp = Qwen3MLP(config)
self.input_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
self.post_attention_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
def forward(self, positions, hidden_states, kv_cache, attn_metadata, residual):
# Pre-norm + attention
hidden_states, residual = self.input_layernorm(hidden_states, residual)
hidden_states = self.self_attn(positions, hidden_states, kv_cache, attn_metadata)
# Pre-norm + MLP
hidden_states, residual = self.post_attention_layernorm(hidden_states, residual)
hidden_states = self.mlp(hidden_states)
return hidden_states, residual
Notice the residual is passed through the layer and managed by the RMSNorm — this is the fused add-residual pattern we will cover below.
Fused QKV Projection
The attention layer projects the hidden state into queries, keys, and values. A naive implementation would use three separate linear layers. nano-vllm fuses them into a single matmul:
class Qwen3Attention(nn.Module):
def __init__(self, config, layer_idx):
super().__init__()
self.num_heads = config.num_attention_heads
self.num_kv_heads = config.num_key_value_heads
self.head_dim = config.hidden_size // self.num_heads
# Fused QKV: one matmul produces Q, K, V concatenated
self.qkv_proj = QKVParallelLinear(
config.hidden_size,
self.head_dim,
self.num_heads,
self.num_kv_heads,
)
self.o_proj = RowParallelLinear(
config.hidden_size, config.hidden_size
)
self.q_norm = RMSNorm(self.head_dim, config.rms_norm_eps)
self.k_norm = RMSNorm(self.head_dim, config.rms_norm_eps)
self.attn = Attention(
self.num_heads, self.head_dim, self.num_kv_heads,
scale=self.head_dim ** -0.5,
sliding_window=getattr(config, "sliding_window", None),
)
self.rotary_emb = get_rope(self.head_dim, config.max_position_embeddings,
config.rope_theta)
def forward(self, positions, hidden_states, kv_cache, attn_metadata):
qkv = self.qkv_proj(hidden_states)
q, k, v = qkv.split(
[self.num_heads * self.head_dim,
self.num_kv_heads * self.head_dim,
self.num_kv_heads * self.head_dim], dim=-1
)
q = self.q_norm(q.view(-1, self.num_heads, self.head_dim))
k = self.k_norm(k.view(-1, self.num_kv_heads, self.head_dim))
v = v.view(-1, self.num_kv_heads, self.head_dim)
q, k = self.rotary_emb(positions, q, k)
output = self.attn(q, k, v, kv_cache, attn_metadata)
return self.o_proj(output.view(-1, self.num_heads * self.head_dim))
The Qwen3Attention class shows the full flow:
- Fused QKV projection — a single QKVParallelLinear produces Q, K, V concatenated, then splits them. One matmul instead of three.
- QK normalization — Qwen3 applies RMSNorm to Q and K before RoPE (not all models do this).
- Rotary position embedding — applies RoPE to Q and K so the model understands token positions.
- Attention — dispatches to the prefill or decode path (Chapter 5).
- Output projection — a RowParallelLinear maps back to hidden size.
SwiGLU Activation (MLP)
The MLP block uses the SwiGLU activation, which combines a gated linear unit with the SiLU (Swish) nonlinearity:
class Qwen3MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.gate_up_proj = MergedColumnParallelLinear(
config.hidden_size,
[config.intermediate_size, config.intermediate_size],
)
self.down_proj = RowParallelLinear(
config.intermediate_size, config.hidden_size
)
self.act_fn = SiluAndMul()
def forward(self, x):
x = self.gate_up_proj(x)
x = self.act_fn(x)
x = self.down_proj(x)
return x
The SiluAndMul activation works on the concatenated output of gate_up_proj:
class SiluAndMul(nn.Module):
def forward(self, x):
gate, up = x.chunk(2, dim=-1)
return F.silu(gate) * up
The gate and up projections are fused into a single MergedColumnParallelLinear , just like QKV fusion — one matmul instead of two, then split and apply the activation.
RMSNorm with Fused Add-Residual
Standard layer norm computes mean and variance. RMSNorm simplifies this by only computing the root mean square, which is cheaper and works just as well in practice:
class RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.eps = eps
def forward(self, hidden_states, residual=None):
if residual is not None:
# Fused: add residual + normalize in one kernel
hidden_states = hidden_states + residual
residual = hidden_states
hidden_states = hidden_states.float()
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.eps)
return (self.weight * hidden_states.to(self.weight.dtype), residual)
The fused variant is important for performance. Without it, the residual add and normalization are two separate operations, each requiring a full read-write pass over the hidden state. By fusing them, we halve the memory traffic — the hidden state is read once, the residual is added, normalized, and written back.
Look back at the decoder layer’s forward: the residual tensor flows through the entire layer, updated by each RMSNorm call. This avoids materializing intermediate residual tensors.
Rotary Position Embedding (RoPE)
RoPE encodes position information by rotating Q and K vectors in pairs of dimensions:
def get_rope(head_dim, max_position_embeddings, base=10000.0):
"""Singleton factory — all layers share the same RoPE instance."""
...
The get_rope() function returns a singleton so all attention layers share the same precomputed rotation matrices. The rotation is applied to Q and K before attention computation, giving the model a sense of relative position without explicit position embeddings.
The Full Model: Qwen3ForCausalLM
The top-level model class stacks everything together:
class Qwen3ForCausalLM(nn.Module):
def __init__(self, config):
super().__init__()
self.embed_tokens = VocabParallelEmbedding(config.vocab_size, config.hidden_size)
self.layers = nn.ModuleList([
Qwen3DecoderLayer(config, i) for i in range(config.num_hidden_layers)
])
self.norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
self.lm_head = ParallelLMHead(config.vocab_size, config.hidden_size)
def forward(self, input_ids, positions, kv_caches, attn_metadata):
hidden_states = self.embed_tokens(input_ids)
residual = None
for i, layer in enumerate(self.layers):
hidden_states, residual = layer(positions, hidden_states,
kv_caches[i], attn_metadata, residual)
hidden_states, _ = self.norm(hidden_states, residual)
logits = self.lm_head(hidden_states)
return logits
The Qwen3ForCausalLM shows the complete forward pass: embed tokens, pass through N decoder layers (each with its own KV cache slice), final norm, then project to vocabulary logits.
Packed Modules Mapping
When loading checkpoint weights, the fused projections need special handling. A Hugging Face checkpoint stores separate q_proj, k_proj, v_proj weights, but nano-vllm’s qkv_proj expects them concatenated. The model defines a mapping:
packed_modules_mapping = {
"qkv_proj": ["q_proj", "k_proj", "v_proj"],
"gate_up_proj": ["gate_proj", "up_proj"],
}
The weight loader reads this mapping to know which checkpoint weights should be stacked into which fused parameter. This is a pattern shared across all model implementations in both nano-vllm and vLLM.
Mapping to Production vLLM
Production vLLM follows the exact same architecture pattern but at much larger scale:
Key differences from nano-vllm:
- 200+ model implementations — vLLM supports a vast range of architectures, but they all follow the same decoder-layer pattern with fused projections and pluggable attention
- Fused CUDA kernels — RMSNorm, SiLU-and-mul, and rotary embedding all have optimized CUDA kernel implementations instead of pure PyTorch
- RoPE variants — production vLLM supports YaRN, dynamic NTK scaling, and other position embedding extensions for long-context models
- Weight loading infrastructure — a sophisticated
WeightLoaderhandles quantized weights (GPTQ, AWQ, FP8), sharded checkpoints, and LoRA adapter merging - Packed modules mapping — the same
packed_modules_mappingpattern is used across all models, with a centralized weight loading system that handles the stacking automatically
Despite the scale difference, if you can read nano-vllm’s Qwen3 implementation, you can read any model in vLLM — the structure is identical.
Summary
- A decoder layer follows the pre-norm residual pattern: RMSNorm + Attention + RMSNorm + MLP, with residual connections flowing through
- QKV projections are fused into a single QKVParallelLinear matmul, and gate/up projections into MergedColumnParallelLinear
- SiluAndMul implements SwiGLU by splitting the fused gate+up output and applying SiLU to the gate half
- RMSNorm fuses the residual addition to cut memory traffic in half
packed_modules_mappingtells the weight loader how to stack checkpoint weights into fused parameters- Production vLLM uses the same architecture with fused CUDA kernels, 200+ model implementations, and support for quantization and LoRA