For input vector z = [z₁, z₂, ..., zₙ], softmax computes: softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ). The exponential amplifies larger values, and the sum normalizes to 1. In the transformer attention mechanism, softmax is applied to the Q·Kᵀ dot product matrix.
Raw neural network outputs (logits) are unbounded numbers that are difficult to interpret as probabilities. Softmax converts them into a proper probability distribution (values 0–1 summing to 1).
exp(x) for x>709 exceeds float64 range — naive softmax(x) = exp(x)/sum(exp(x)) causes overflow or underflow. Stable implementation: softmax(x - max(x)).
When one logit is much larger than others, softmax returns values close to 0 or 1 — gradients vanish. In attention mechanisms this leads to "attention collapse" (one token dominates).