Transformer from Scratch · Multi-Head Attention
Merging Heads and Output Projection
Multi-Head Attention
Introduction
After computing attention for multiple heads, we need to merge their outputs and pass them through an output projection. This lesson focuses on concatenation, contiguous, view, and the role of the output projection layer.