CtrlK

About

About the site
Editorial team

Policies

Editorial policy
AI policy
Corrections
Privacy

Contact

Contact

Community

X / @robotsatlas

© 2026 Robots Atlas.·AI • Humanoids • Robotics

Data

Stemming

1980ActivePublished

Key innovation

Merges inflectional and derivational forms of the same word into one token via stripping rules, without a dictionary.

Category

Data

Abstraction level

Primitive

Operation level

Data

Use cases

Preprocessing for search engines (recall)Normalisation before TF-IDF / BoWFull-text indexingKeyword extraction

How it works

The algorithm applies a sequence of suffix-stripping rules (e.g. Porter stemmer: -ing, -ed, -s). It operates purely on the surface, based on character patterns, with no grammatical analysis or dictionary. The result (stem) need not be a valid word.

Problem solved

Bag-of-Words and TF-IDF treat "runs", "ran", "running" as separate terms, fragmenting the statistics. Stemming collapses them, shrinking the vocabulary and improving recall.

Implementation

Reference implementations

NLTK PorterStemmer / SnowballStemmer

Python · NLTK Project

Snowball stemming algorithms

Multiple · Snowball / Martin Porter

Implementation pitfalls

Over-stemming and under-stemmingMedium

Over-stemming merges unrelated words ("universal", "university" → "univers"); under-stemming fails to merge related ones.

Fix:Choose a language-appropriate stemmer; for high precision consider lemmatization.

Poor quality for highly inflected languagesHigh

The Porter stemmer was designed for English; for Polish it yields poor results.

Fix:Use language-dedicated stemmers/lemmatizers (e.g. Morfologik, spaCy).

Evolution

Original paper · 1980 · Program: electronic library and information systems · Martin F. Porter

An algorithm for suffix stripping

Martin F. Porter

Related concepts

Alternative to

Often used with

TF-IDF Tokenization

Sources

An algorithm for suffix stripping

Martin Porter (1980) — the original Porter stemmer.

Computational complexity

Time complexity: O(n) względem liczby tokenów. Space complexity: O(1) na token.

Hyperparameters (configurable axes)

AlgorithmHigh

Choice of stemmer: Porter (mild), Snowball/Porter2 (improved, multilingual), Lancaster (aggressive).

snowballDefault choice — a good compromise with multi-language support.

lancasterVery aggressive, high recall at the cost of precision.

Execution paradigm

Primary mode

Sparse

Activation pattern

Subset active

Parallelism

Parallelism level

Fully parallel

Scope

TrainingInference

Hardware requirements

Primary

Rule-based string manipulation — no hardware acceleration needed.