After softmax, the model sorts tokens in descending probability order and keeps only the top k. Remaining tokens (positions k+1, k+2, ...) receive probability 0. The distribution is renormalized and a token is sampled from this restricted set.
Full vocabulary sampling includes tokens with negligible probability, introducing incoherence. Top-k eliminates the "long tail" of the distribution, focusing the model on sensible candidates.
K=50 works well when distribution is flat, but for a peaked distribution (1 token dominates) K=50 adds randomness from low-probability tokens. Top-p is more adaptive.
With K=10 all tokens beyond top-10 are cut off even if token K+1 has similar probability to token K. This creates artificial boundaries in the sampling distribution.
Number of tokens in the candidate set. k=1 is greedy decoding. Typical values: 40–100.