Architecture

Softmax

1989ActivePublished

Key innovation

Transforms a vector of raw logits into a probability distribution summing to 1, enabling the network output to be interpreted as class probabilities.

How it works

For input vector z = [z₁, z₂, ..., zₙ], softmax computes: softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ). The exponential amplifies larger values, and the sum normalizes to 1. In the transformer attention mechanism, softmax is applied to the Q·Kᵀ dot product matrix.

Problem solved

Raw neural network outputs (logits) are unbounded numbers that are difficult to interpret as probabilities. Softmax converts them into a proper probability distribution (values 0–1 summing to 1).

Implementation

Implementation pitfalls

Numerical overflow without log-sum-exp trickMedium

exp(x) for x>709 exceeds float64 range — naive softmax(x) = exp(x)/sum(exp(x)) causes overflow or underflow. Stable implementation: softmax(x - max(x)).

Softmax saturates at extreme valuesMedium

When one logit is much larger than others, softmax returns values close to 0 or 1 — gradients vanish. In attention mechanisms this leads to "attention collapse" (one token dominates).

Evolution

Original paper · 1989 · David Rumelhart

A theoretical framework for back-propagation

David Rumelhart, Geoffrey Hinton, Ronald Williams

Softmax

How it works

Problem solved

Implementation

Evolution

Execution paradigm

Parallelism