A parametric position embedding table of shape (max_seq_len, d_model) is created. For position pos the row P[pos] is fetched and added to the token embedding at the model input (x_in = token_emb + P[pos]) — identical integration to sinusoidal PE, the difference is only the source of the vector. The P table is randomly initialised and learned together with the rest of the model via backprop. All positions from 0 to max_seq_len-1 get independent learned vectors. Positions pos >= max_seq_len are UNDEFINED — the model literally has no embedding for them, so the context must be hard-truncated or shifted.
Sinusoidal PE is deterministic and assumes a specific geometry (geometric frequency decomposition with base 10000), which is not necessarily optimal for a given domain or model size. Learned PE lets the model discover the position representation most useful for the task — at the cost of additional parameters and the loss of length extrapolation beyond training.
Parametric table P of shape (max_seq_len, d_model). Each row is a learned vector representing one absolute position in the context window. Randomly initialised (typically N(0, 0.02)) and updated by standard backprop together with the rest of the model.
Official
Learned PE is physically defined only for positions 0..max_seq_len-1. Inference on a longer sequence causes an index out-of-range error or, if modulo is implemented, corruption of position semantics.
Unlike sinusoidal/ALiBi, learned PE does not extrapolate — a model trained at 512 tokens does not perform well on 1024, even if the table is technically expanded and randomly initialised.
If PE initialisation differs significantly in scale from token embeddings, one signal dominates the other in early training, hurting stability.
Facebook AI Research introduces learned position embeddings in a convolutional architecture for machine translation — one of the first works using learned PE as a solution to positions in sequence-free models.
Vaswani et al. experiment with learned PE as an alternative to sinusoidal. Results are nearly identical — they choose sinusoidal as simpler and better at extrapolation.
BERT (Devlin et al.) and GPT (Radford) choose learned PE as their canon — from this point it becomes the standard choice in pretrained encoder/decoder models for several subsequent years.
ViT uses learned 1D PE for image patches, showing that the method transfers well from NLP to computer vision.
RoPE (Su et al.) and ALiBi (Press et al.) show that better quality and extrapolation can be achieved without learned position embeddings. The shift away from learned PE in new large LLMs begins.
Llama 2/3, Qwen, DeepSeek, Mistral and other new large LLMs use RoPE (+ YaRN/LongRoPE for long-context). Learned PE remains in use mainly in older BERT/GPT-2 models and in classical ViT.
Time complexity: O(1) per token (O(T) per sequence). Space complexity: O(max_seq_len · d_model) parametrów; O(T · d_model) aktywacji per batch.
The operation is extremely cheap computationally (one lookup + addition). If it is a bottleneck at all, it is only memory-bandwidth-bound — the table P must be loaded from VRAM/HBM. In practice it is never a real bottleneck compared with attention and the MLP.
Maximum context length the model is trained on. A hard limit — positions above this value have no defined embedding.
Width of the position embedding — must equal the width of the token embedding because PE is added.
Initialization scheme of the position table (e.g. normal N(0, 0.02) as in BERT/GPT). Affects early training stability.
Extra parameters introduced by Learned PE: max_seq_len × d_model. For BERT-base: 512 × 768 ≈ 0.4M parameters (significant for small models, marginal for LLMs).
Learned PE is added to every input embedding — a dense operation, deterministic at runtime after training, with no routing.
Lookup in an embedding table is natively parallel and extremely fast. There are no sequential dependencies in the position encoding mechanism itself.
Embedding table lookup is a standard, well-supported operation on every accelerator and CPU.