Transformer from Scratch · Multi-Head Attention
Why Multiple Attention Heads Matter
Multi-Head Attention
Introduction
A single attention head can retrieve context, but multiple heads let the model learn different relationships in parallel. In this lesson you will see why we split the representation into heads.