Inference

Top-k Sampling

2018ActivePublished

Key innovation

Restricts the token sampling space to k most probable candidates, eliminating noise from the distribution tail with minimal computational cost.

How it works

After softmax, the model sorts tokens in descending probability order and keeps only the top k. Remaining tokens (positions k+1, k+2, ...) receive probability 0. The distribution is renormalized and a token is sampled from this restricted set.

Problem solved

Full vocabulary sampling includes tokens with negligible probability, introducing incoherence. Top-k eliminates the "long tail" of the distribution, focusing the model on sensible candidates.

Implementation

Implementation pitfalls

Fixed K value does not adapt to different distributionsMedium

K=50 works well when distribution is flat, but for a peaked distribution (1 token dominates) K=50 adds randomness from low-probability tokens. Top-p is more adaptive.

Top-k truncates the long tail regardless of probability massMedium

With K=10 all tokens beyond top-10 are cut off even if token K+1 has similar probability to token K. This creates artificial boundaries in the sampling distribution.

Top-k Sampling

How it works

Problem solved

Implementation

Hyperparameters (configurable axes)

Execution paradigm

Parallelism