After computing softmax, the model sorts tokens in descending probability order. It then selects the minimal token prefix whose cumulative probability ≥ p (e.g., p=0.9). Only tokens within this "nucleus" are considered for sampling. Tokens outside the nucleus receive probability 0.
Greedy decoding yields monotonous text, while temperature-only sampling can generate incoherent tokens. Top-p balances creativity and quality by dynamically constraining the sampling space to the "nucleus" of the distribution.
Using top-p=0.9 and top-k=50 simultaneously: first top-k reduces to 50 tokens, then top-p to ~90% mass — resulting in doubly truncated space that may eliminate valid tokens.
With p=0.99 and a flat distribution (e.g. 1000 tokens with similar probability) the nucleus contains ~990 tokens — practically random sampling without filtering.
Cumulative probability threshold. Higher p → more tokens in nucleus → more diversity.