The final linear layer of the neural network (the classification or language head) produces a logit vector of size equal to the number of classes or vocabulary size. These values are unbounded (can be negative). Only applying softmax or sigmoid transforms them into probabilities. In LLMs, for each generation step, the model produces a logit vector of size |V| (vocabulary size).
During neural network computation, an intermediate vector is needed to represent the "strength" of each possible outcome before it is normalized into a probability. Logits serve this role and enable output manipulation (e.g., temperature scaling).
Logits with values >100 can cause overflow in exp() when computing softmax. Solution: log-sum-exp trick (subtracting max logit before exp).
Logits are unnormalized scores — without softmax they do not sum to 1 and have no probabilistic interpretation. Directly comparing logits across different models or batches is an error.