CtrlK

About

About the site
Editorial team

Policies

Editorial policy
AI policy
Corrections
Privacy

Contact

Contact

Community

X / @robotsatlas

© 2026 Robots Atlas.·AI • Humanoids • Robotics

Architecture

Scaled Dot-Product Attention

2017ActivePublished: 28 May 2026Updated: 28 May 2026Published

Key innovation

Replacing additive or general scoring with a fast query-key dot product scaled by √d_k, stabilising gradients and enabling efficient vectorised attention in the Transformer.

Category

Architecture

Abstraction level

Building block

Operation level

Architecture blockTrainingInference

Use cases

TransformersLarge language modelsMultimodal modelsVision TransformersMachine translation

How it works

The model projects inputs into three matrices: Q (queries), K (keys) and V (values). It computes similarities QK^T, scales them by √d_k, normalises with softmax across key positions, and multiplies by V to obtain a weighted sum of values for each query.

Problem solved

It computes attention quickly and in parallel as matrix operations, avoiding RNN sequentiality and costly MLP-based scoring.

Components

QueriesIndex attention queries.

Representations of positions for which matches are sought.

KeysProvide references for scoring.

Representations of positions against which similarity is measured.

ValuesCarry information passed to the output.

Values aggregated by attention weights.

Evolution

Original paper · 2017 · NeurIPS 2017 · Ashish Vaswani

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Related concepts

Built on

Self-Attention Bahdanau Attention Luong Attention

Often used with

Sources

Attention Is All You Need

Neural Machine Translation by Jointly Learning to Align and Translate

Effective Approaches to Attention-based Neural Machine Translation

Computational complexity

Time complexity: O(n² · d). Space complexity: O(n²).

Execution paradigm

Primary mode

Dense

Activation pattern

All paths active

Parallelism

Parallelism level

Fully parallel

Within a layer, all positions can be processed in parallel as matrix operations.

Scope

TrainingInferenceAcross tokensAcross devices

Hardware requirements

Primary

Dominated by matrix multiplications QK^T and AV, which map well to GPUs/Tensor Cores.