Architecture

Sinusoidal PE

2017HistoricalPublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Introduces token-position information to the Transformer via deterministic sine/cosine functions at geometrically decaying frequencies, with no learned parameters — the first mechanism allowing a sequence-free ("all attention") architecture to understand token order.

How it works

For position pos and embedding dimension i with width d_model, define: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) and PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model)). Even dimensions get sine, odd cosine, and wavelength grows geometrically from 2π to 2π·10000 with the dimension index. The resulting vector is added (not concatenated) to the token embedding at the very input of the model, before the first attention block. Key property: for any k there is a linear transformation mapping PE(pos) to PE(pos+k), so the model can easily learn attention over relative positions even though the encoding is absolute. The encoding is static and computed once — no learnable parameters.

Problem solved

The Transformer, unlike RNNs and CNNs, is permutation-invariant over input tokens — without additional position information all tokens look like a "bag of words" to it. Sinusoidal Positional Encoding solves this in the simplest possible way: with a deterministic function of position that does not need to be learned and that works for any sequence length known at pretraining time.

Implementation

Reference implementations

tensor2tensor (official Transformer by the authors)

Python (TensorFlow) · Google Brain (Vaswani et al. authors)

Official

PyTorch — nn.Transformer / "Annotated Transformer"

Python (PyTorch) · Harvard NLP / PyTorch

Hugging Face Transformers — BertModel etc.

Python · Hugging Face

Implementation pitfalls

Confusing additive vs concatenated PEMedium

In the original paper Sinusoidal PE is added to the embedding, not concatenated. Concatenation would require changing d_model and disrupts the established query/key/value projection structure.

Fix:Stick to the additive form: x_in = token_emb + PE.

Assuming strong length extrapolationHigh

Although PE is well-defined for any pos, models trained at length L perform poorly in practice on L' >> L — attention patterns at such positions were never seen during training.

Fix:For long-context use RoPE + YaRN/LongRoPE or ALiBi instead of sinusoidal PE.

Inconsistent scaling of token_emb vs PEHigh

In the original paper the token embedding is multiplied by sqrt(d_model) before adding PE, to keep both signals at the same magnitude. Omitting this scaling is a common bug in educational implementations and noticeably hurts training.

Fix:Multiply the token embedding by sqrt(d_model) before adding PE, as specified in the original paper.

Evolution

Original paper · 2017 · NeurIPS 2017 · Ashish Vaswani

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin

2017

Sinusoidal PE introduced in "Attention Is All You Need"

Inflection point

Vaswani et al. publish the Transformer and with it the deterministic sinusoidal positional encoding. The authors compare it to learned PE — they obtain nearly identical results but choose sinusoidal as simpler and extrapolating to longer lengths.

Transformer (concept)Attention Is All You Need (paper)

2018

Learned Positional Encoding in BERT and GPT

BERT (Devlin et al., 2018) and GPT (Radford, 2018) choose learned position embeddings instead of sinusoidal — they obtain very similar results at the cost of no extrapolation beyond the training length.

2018

Relative Position Representations (Shaw et al.)

Shaw et al. (Google) introduce relative position representations — showing that explicitly modelling distances between tokens yields better results than absolute PE on many NLP tasks.

2021

RoPE and ALiBi — moving away from absolute PE

RoPE (Su et al.) and ALiBi (Press et al.) replace additive sinusoidal/learned PE: RoPE rotates dimension pairs, ALiBi adds a linear bias in attention. Both handle long context better than classical sinusoidal PE — the start of the decline of the original method in new large LLMs.

RoPE (concept)

2023

Sinusoidal PE as a historical baseline

In new large LLMs (Llama 2/3, Qwen, DeepSeek, Mistral) Sinusoidal PE is practically replaced by RoPE. It remains in use in older models, teaching contexts, and simpler Transformers (e.g. small audio/vision models).

Sinusoidal PE

How it works

Problem solved

Implementation

Evolution

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements