Data

Bag-of-Words

1954HistoricalPublished

Key innovation

Representation of a document as a multiset of words ignoring order, enabling simple and computationally efficient text processing.

How it works

A vocabulary of all unique words in the corpus is built. Each document is represented as a vector of vocabulary length, where each position contains the word count in the document (or 0/1 for binary variant). Word order is completely ignored.

Problem solved

Raw text must be converted into a numerical representation for ML algorithm processing. BoW provides the simplest such representation without complex preprocessing.

Implementation

Implementation pitfalls

No word order informationMedium

BoW treats "dog bites man" and "man bites dog" identically. For order-dependent tasks (sentiment, questions) this is a critical limitation.

High dimensionality for large vocabulariesMedium

For a 100k vocabulary each document is a 100k-dimensional vector — mostly zeros (sparse). Requires algorithms handling sparse vectors or dimensionality reduction.

Bag-of-Words

How it works

Problem solved

Implementation

Execution paradigm

Parallelism