The model projects inputs into three matrices: Q (queries), K (keys) and V (values). It computes similarities QK^T, scales them by √d_k, normalises with softmax across key positions, and multiplies by V to obtain a weighted sum of values for each query.
It computes attention quickly and in parallel as matrix operations, avoiding RNN sequentiality and costly MLP-based scoring.
Representations of positions for which matches are sought.
Representations of positions against which similarity is measured.
Values aggregated by attention weights.
Time complexity: O(n² · d). Space complexity: O(n²).
Within a layer, all positions can be processed in parallel as matrix operations.
Dominated by matrix multiplications QK^T and AV, which map well to GPUs/Tensor Cores.