Neural Networks: From Fundamentals to Modern AI · The Attention Mechanism and the Transformer

Transformer architecture — encoder block, FFN, LayerNorm, residual

The Attention Mechanism and the Transformer

Introduction

With self-attention and positional encoding in place, we now assemble them into a full Transformer block. In the original Vaswani et al. (2017) paper the encoder is a stack of N=6 identical blocks, where each block has two sub-layers: (1) multi-head self-attention and (2) a position-wise feed-forward network (FFN). Each sub-layer is wrapped in "Add & Norm": y = LayerNorm(x + Sublayer(x)) — the classic post-LN variant. FFN is two linear layers with an activation (ReLU in 2017, today more often GELU or SwiGLU): FFN(x) = max(0, xW_1 + b_1)W_2 + b_2, with d_ff = 4·d_model (e.g. 512 → 2048 → 512). FFN operates independently on each position — a "soft per-token MLP". LayerNorm (Ba et al. 2016) normalizes along the feature dimension (not the batch dimension as BatchNorm does), so it is independent of batch size — crucial for NLP where sequences have different lengths. The decoder has three sub-layers: masked multi-head self-attention (a causal mask prevents "looking at the future"), encoder–decoder cross-attention (Q from the decoder, K/V from the last encoder layer) and an FFN. Modern LLMs (LLaMA, Mistral, GPT-3+) use the pre-LN variant (LayerNorm before the sub-layer) for training stability without warmup, RMSNorm instead of LN and SwiGLU instead of ReLU.