Architecture

Logits

1944ActivePublished

Key innovation

Raw, unnormalized neural network output before activation function — the key intermediate vector for computing class or token probabilities.

How it works

The final linear layer of the neural network (the classification or language head) produces a logit vector of size equal to the number of classes or vocabulary size. These values are unbounded (can be negative). Only applying softmax or sigmoid transforms them into probabilities. In LLMs, for each generation step, the model produces a logit vector of size |V| (vocabulary size).

Problem solved

During neural network computation, an intermediate vector is needed to represent the "strength" of each possible outcome before it is normalized into a probability. Logits serve this role and enable output manipulation (e.g., temperature scaling).

Implementation

Implementation pitfalls

Numerical overflow with very large logitsMedium

Logits with values >100 can cause overflow in exp() when computing softmax. Solution: log-sum-exp trick (subtracting max logit before exp).

Incorrect interpretation of logits as probabilitiesMedium

Logits are unnormalized scores — without softmax they do not sum to 1 and have no probabilistic interpretation. Directly comparing logits across different models or batches is an error.