For position pos and embedding dimension i with width d_model, define: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). Even dimensions get sine, odd cosine, and wavelength grows geometrically from 2ฯ to 2ฯยท10000 with the dimension index. The resulting vector is added (not concatenated) to the token embedding at the very input of the model, before the first attention block. Key property: for any k there is a linear transformation mapping PE(pos) to PE(pos+k), so the model can easily learn attention over relative positions even though the encoding is absolute. The encoding is static and computed once โ no learnable parameters.
The Transformer, unlike RNNs and CNNs, is permutation-invariant over input tokens โ without additional position information all tokens look like a "bag of words" to it. Sinusoidal Positional Encoding solves this in the simplest possible way: with a deterministic function of position that does not need to be learned and that works for any sequence length known at pretraining time.
In the original paper Sinusoidal PE is added to the embedding, not concatenated. Concatenation would require changing d_model and disrupts the established query/key/value projection structure.
Although PE is well-defined for any pos, models trained at length L perform poorly in practice on L' >> L โ attention patterns at such positions were never seen during training.
In the original paper the token embedding is multiplied by sqrt(d_model) before adding PE, to keep both signals at the same magnitude. Omitting this scaling is a common bug in educational implementations and noticeably hurts training.
Vaswani et al. publish the Transformer and with it the deterministic sinusoidal positional encoding. The authors compare it to learned PE โ they obtain nearly identical results but choose sinusoidal as simpler and extrapolating to longer lengths.
BERT (Devlin et al., 2018) and GPT (Radford, 2018) choose learned position embeddings instead of sinusoidal โ they obtain very similar results at the cost of no extrapolation beyond the training length.
Shaw et al. (Google) introduce relative position representations โ showing that explicitly modelling distances between tokens yields better results than absolute PE on many NLP tasks.
RoPE (Su et al.) and ALiBi (Press et al.) replace additive sinusoidal/learned PE: RoPE rotates dimension pairs, ALiBi adds a linear bias in attention. Both handle long context better than classical sinusoidal PE โ the start of the decline of the original method in new large LLMs.
In new large LLMs (Llama 2/3, Qwen, DeepSeek, Mistral) Sinusoidal PE is practically replaced by RoPE. It remains in use in older models, teaching contexts, and simpler Transformers (e.g. small audio/vision models).
Embedding width, equal to the PE vector width. In the original paper d_model = 512.
The base in the formula pos / base^(2i/d_model). In the original paper 10000 โ chosen empirically. Changing it affects the range of wavelengths and indirectly the ability to represent positions.
Number of positions for which PE is precomputed and cached. An implementation detail โ the function itself is well-defined for any pos.
Sinusoidal PE is added to every input embedding โ a dense, deterministic operation without routing.
The encoding is a deterministic function of position โ the entire PE matrix can be precomputed once at initialisation and reused. There are no sequential dependencies.
Sinusoidal PE is a purely mathematical, deterministic operation โ precomputed once at initialisation, in runtime only addition. Works identically on any hardware.