Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer

Implementing a mini-Transformer from scratch in PyTorch

The Attention Mechanism and the Transformer

Introduction

After theory — code. We build a minimal, working Transformer in PyTorch: a causal language model in the nanoGPT style (Karpathy 2022). Scope: token embedding + positional encoding, decoder-only block with causal masked self-attention and FFN, training on a small char-level corpus (e.g. tinyshakespeare) and simple autoregressive generation. Architecture: nn.Embedding(vocab, d_model) + nn.Embedding(max_len, d_model) (learned PE), N pre-LN blocks, lm_head Linear(d_model, vocab) with weight tying to the token embedding. Block forward: x + Attn(LN(x)); x + FFN(LN(x)). Multi-head attention via a single fused projection 3·d_model (Q, K, V in one mat-mul) and then split to (B, h, n, d_k); attention scores QK^T/√d_k with a causal mask (torch.tril), softmax, weighted sum of V, head concat, W_O projection. FFN: Linear(d_model, 4·d_model) → GELU → Linear(4·d_model, d_model). Loss: cross_entropy over shifted logits (next-token prediction). Optimizer: AdamW, learning rate ≈3e-4 with cosine schedule, weight decay 0.1, gradient clipping 1.0. Five minutes of training on a laptop yields recognizable Shakespearean style. This template — ≈200 lines — is the foundation of every large LLM. Understanding it matters more than knowing the HuggingFace API, because it shows how the concepts from the previous five lessons actually compose into a working system.