Data

One-Hot

ActivePublished

Key innovation

Representation of a categorical variable as a binary vector of length equal to the number of classes, with exactly one element set to 1 and the rest to 0 — removing the spurious ordinal relation present in integer label encoding.

How it works

Let the set of possible categories have size K. Each category is assigned a unique index i ∈ {0, …, K−1}. The categorical value is then represented as a vector v ∈ {0,1}^K with v[i] = 1 and v[j] = 0 for j ≠ i. Equivalently, it is the i-th row of the K×K identity matrix. In practice this is implemented by OneHotEncoder in scikit-learn, get_dummies in pandas, and one_hot functions in PyTorch/TensorFlow. For very large K (e.g. NLP vocabularies) the one-hot vector is rarely materialized explicitly: multiplying a weight matrix W by a one-hot vector reduces to selecting the i-th row of W (embedding lookup), which is the foundational optimization behind embedding layers.

Problem solved

Removes the spurious ordering and unequal distances introduced by encoding categories as integers (e.g. "red=0, green=1, blue=2" would imply blue is twice as far from red as green is). Lets linear models and neural networks correctly handle nominal categorical variables.

Implementation

Reference implementations

scikit-learn OneHotEncoder

PyTorch torch.nn.functional.one_hot

Python

Official

TensorFlow tf.one_hot

Python

Official

Implementation pitfalls

Curse of dimensionality with large KHigh

For vocabularies of 10⁵–10⁶ (NLP) dense materialization of one-hot vectors is impractical in memory and numerically.

Fix:Use sparse representations (scipy.sparse) or embedding layers (lookup instead of multiplication).

No semantic relations between categoriesMedium

All vectors are equidistant — the model has no prior information about category similarity.

Fix:Where semantics matters (NLP, categorical hierarchies), replace with end-to-end learned or pretrained embeddings.

Collinearity in linear modelsMedium

The sum of one-hot columns is constantly 1, causing collinearity with the intercept in regression.

Fix:Use dummy encoding (drop_first=True) — drop one reference category.

Train/test vocabulary mismatchHigh

A category present only in the test set causes an error or silent zero vector if the encoder was not fit on it.

Fix:Fit the encoder on the full set (train+test) only for known a priori categories; in production set handle_unknown="ignore" or map to a special UNK category.