Architecture

Learned PE

2017HistoricalPublished: 9 June 2026Updated: 9 June 2026Published

Key innovation

Replaces deterministic sinusoidal positional encoding with learned embeddings — the model itself optimises a position vector for each position in the context window, treating position analogously to a token in a vocabulary.

How it works

A parametric position embedding table of shape (max_seq_len, d_model) is created. For position pos the row P[pos] is fetched and added to the token embedding at the model input (x_in = token_emb + P[pos]) — identical integration to sinusoidal PE, the difference is only the source of the vector. The P table is randomly initialised and learned together with the rest of the model via backprop. All positions from 0 to max_seq_len-1 get independent learned vectors. Positions pos >= max_seq_len are UNDEFINED — the model literally has no embedding for them, so the context must be hard-truncated or shifted.

Problem solved

Sinusoidal PE is deterministic and assumes a specific geometry (geometric frequency decomposition with base 10000), which is not necessarily optimal for a given domain or model size. Learned PE lets the model discover the position representation most useful for the task — at the cost of additional parameters and the loss of length extrapolation beyond training.

Components

Position Embedding TableSource of the position vector for the PE-to-token-embedding addition step

Parametric table P of shape (max_seq_len, d_model). Each row is a learned vector representing one absolute position in the context window. Randomly initialised (typically N(0, 0.02)) and updated by standard backprop together with the rest of the model.

INTensor of position indices per token in a batch of size B and sequence length T.

OUTPosition embedding vectors fetched by lookup from table P.

1D Learned PEClassical table for 1D sequences (text). Used in BERT, GPT-1/2.

2D Learned PE (ViT)Vision variant: a single table for all patches treated as a 1D sequence in raster order (original ViT) or separate row/col tables (some variants).

Segment-aware Learned PEBERT additionally learns a "segment embeddings" table (0/1) encoding token membership in sentence A or B — combined additively with the position PE.

Official

Implementation

Reference implementations

BERT (google-research/bert) — canonical learned PE implementation

Python (TensorFlow) · Google Research

Official

Hugging Face Transformers — BertEmbeddings / GPT2Embeddings

Python · Hugging Face

Vision Transformer (google-research/vision_transformer)

Python (JAX) · Google Research

Official

Implementation pitfalls

Exceeding max_seq_len at inferenceCritical

Learned PE is physically defined only for positions 0..max_seq_len-1. Inference on a longer sequence causes an index out-of-range error or, if modulo is implemented, corruption of position semantics.

Fix:Hard-truncate context to max_seq_len or switch to RoPE/ALiBi/YaRN if longer sequences are required.

No length extrapolationHigh

Unlike sinusoidal/ALiBi, learned PE does not extrapolate — a model trained at 512 tokens does not perform well on 1024, even if the table is technically expanded and randomly initialised.

Fix:For long-context use RoPE + YaRN/LongRoPE or ALiBi. Extending the learned PE table requires separate long fine-tuning and yields worse results.

Inconsistent init between learned PE and token embeddingsMedium

If PE initialisation differs significantly in scale from token embeddings, one signal dominates the other in early training, hurting stability.

Fix:Use the same initialisation scale as for token embeddings (typically N(0, 0.02)).

Evolution

Original paper · 2017 · ICML 2017 · Jonas Gehring

Convolutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

2017

Learned position embeddings in ConvS2S (Gehring et al.)

Facebook AI Research introduces learned position embeddings in a convolutional architecture for machine translation — one of the first works using learned PE as a solution to positions in sequence-free models.

2017

Sinusoidal PE in Transformer (Vaswani et al.)

Vaswani et al. experiment with learned PE as an alternative to sinusoidal. Results are nearly identical — they choose sinusoidal as simpler and better at extrapolation.

Sinusoidal PE (concept)

2018

BERT and GPT adopt learned PE

Inflection point

BERT (Devlin et al.) and GPT (Radford) choose learned PE as their canon — from this point it becomes the standard choice in pretrained encoder/decoder models for several subsequent years.

2020

Vision Transformer (Dosovitskiy et al.) — learned PE for patches

ViT uses learned 1D PE for image patches, showing that the method transfers well from NLP to computer vision.

2021

RoPE and ALiBi — moving away from additive PE

RoPE (Su et al.) and ALiBi (Press et al.) show that better quality and extrapolation can be achieved without learned position embeddings. The shift away from learned PE in new large LLMs begins.

RoPE (concept)

2023

Decline of learned PE in new LLMs

Llama 2/3, Qwen, DeepSeek, Mistral and other new large LLMs use RoPE (+ YaRN/LongRoPE for long-context). Learned PE remains in use mainly in older BERT/GPT-2 models and in classical ViT.

Learned PE

How it works

Problem solved

Components

Implementation

Evolution

Computational complexity

Compute bottleneck

Hyperparameters (configurable axes)

Execution paradigm

Parallelism

Hardware requirements