CtrlK

About

About the site
Editorial team

Policies

Editorial policy
AI policy
Corrections
Privacy

Contact

Contact

Community

X / @robotsatlas

© 2026 Robots Atlas.·AI • Humanoids • Robotics

Data

Stop Words

ActivePublished

Key innovation

Reduces dimensionality and noise in text representations by removing words present in almost every document.

Category

Data

Abstraction level

Primitive

Operation level

Data

Use cases

Preprocessing for TF-IDF / Bag-of-WordsIndexing in search enginesVocabulary size reductionKeyword extraction

How it works

A stop-word list is defined (static per language, or derived from the corpus e.g. via high df). During tokenization, tokens on the list are dropped. Lists are language- and task-dependent — in some applications (e.g. phrase search) stop words are not removed.

Problem solved

Function words dominate by count but do not distinguish documents. Removing them shrinks the vocabulary, speeds up processing and improves the quality of simple bag-of-words models.

Implementation

Reference implementations

NLTK stopwords corpus

Python · NLTK Project

spaCy stop words

Python · Explosion AI

Implementation pitfalls

Stop-word removal that breaks meaningHigh

In tasks like sentiment analysis or phrase search, words such as "not", "to be" carry meaning — removing them hurts.

Fix:Tailor the stop-word list to the task; consider omitting it for contextual models.

List mismatch with language / domainMedium

A default English stop-word list is useless for Polish or a specialised corpus.

Fix:Use language-specific lists or derive stop words from the corpus (max_df).

Related concepts

Often used with

TF-IDF Tokenization Bag-of-Words

Sources

Introduction to Information Retrieval — Dropping common terms: stop words

Canonical description of the role of stop words in IR.

Computational complexity

Time complexity: O(n) z lookupem w zbiorze haszowym. Space complexity: O(|S|) zbiór stop-słów.

Hyperparameters (configurable axes)

List sourceHigh

A static language list vs a list derived from the corpus (e.g. high-df / max_df terms).

static (NLTK/spaCy)A ready-made list for a given language.

corpus-derived (max_df)Tailored to the corpus domain.

Execution paradigm

Primary mode

Sparse

Activation pattern

Subset active

Parallelism

Parallelism level

Fully parallel

Scope

TrainingInference

Hardware requirements

Primary

A simple string filter — runs anywhere without acceleration.