Inference

Top-p / Nucleus Sampling

2019ActivePublished

Key innovation

Instead of truncating to a fixed number of tokens (top-k), nucleus sampling dynamically selects the minimal set of tokens whose cumulative probability ≥ p, providing better diversity control across variable distributions.

How it works

After computing softmax, the model sorts tokens in descending probability order. It then selects the minimal token prefix whose cumulative probability ≥ p (e.g., p=0.9). Only tokens within this "nucleus" are considered for sampling. Tokens outside the nucleus receive probability 0.

Problem solved

Greedy decoding yields monotonous text, while temperature-only sampling can generate incoherent tokens. Top-p balances creativity and quality by dynamically constraining the sampling space to the "nucleus" of the distribution.

Implementation

Implementation pitfalls

Top-p and top-k together can over-constrain the spaceMedium

Using top-p=0.9 and top-k=50 simultaneously: first top-k reduces to 50 tokens, then top-p to ~90% mass — resulting in doubly truncated space that may eliminate valid tokens.

High p for flat distributions = near-random samplingMedium

With p=0.99 and a flat distribution (e.g. 1000 tokens with similar probability) the nucleus contains ~990 tokens — practically random sampling without filtering.

Evolution

Original paper · 2019 · Ari Holtzman

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, Yejin Choi

Top-p / Nucleus Sampling

How it works

Problem solved

Implementation

Evolution

Hyperparameters (configurable axes)

Execution paradigm

Parallelism