Transformer from Scratch · Self-Attention from Scratch
Scaled Dot-Product Attention
Self-Attention from Scratch
Introduction
Scaled dot-product attention is the concrete formula used in the Transformer: QK^T scores are scaled, masked, normalized with softmax and multiplied by V. This lesson breaks down every step.