CtrlK

About

About the site
Editorial team

Policies

Editorial policy
AI policy
Corrections
Privacy

Contact

Contact

Community

X / @robotsatlas

© 2026 Robots Atlas.·AI • Humanoids • Robotics

Data

Lemmatization

ActivePublished

Key innovation

Unlike stemming, it returns a valid dictionary form, taking part of speech and grammatical context into account.

Category

Data

Abstraction level

Primitive

Operation level

Data

Use cases

Precision-sensitive preprocessing (classification, IR)Normalisation for inflected languages (e.g. Polish)Information extraction and linguistic analysisFeature construction for classical models

How it works

A lemmatizer determines the token's part of speech (POS tagging), then maps it to a lemma via a morphological dictionary or inflection rules. It requires linguistic resources, so it is slower and more language-dependent than stemming.

Problem solved

Stemming yields non-word stems and conflates unrelated forms. Lemmatization, using morphological knowledge, correctly merges inflected forms ("was", "is", "will be" → "be") while preserving interpretability.

Implementation

Reference implementations

spaCy lemmatizer

Python · Explosion AI

NLTK WordNetLemmatizer

Python · NLTK Project

Implementation pitfalls

Missing POS tagging breaks the lemmaHigh

Without part of speech, "left" (verb vs adjective) is lemmatized incorrectly.

Fix:Always pass the POS tag to the lemmatizer (e.g. WordNetLemmatizer requires it explicitly).

Higher cost than stemmingMedium

A full pipeline (tokenization + POS + dictionary) is markedly slower than stripping rules.

Fix:For very large corpora consider stemming if precision is not critical.

Related concepts

Alternative to

Often used with

TF-IDF Tokenization

Sources

Introduction to Information Retrieval — Stemming and lemmatization

Comparison of stemming and lemmatization in an IR context.

Computational complexity

Time complexity: O(n) + koszt POS-taggingu. Space complexity: O(|L|) słownik morfologiczny.

Hyperparameters (configurable axes)

POS awarenessHigh

Whether the lemmatizer receives a part-of-speech tag — critical for grammatically ambiguous words.

with POSRequired e.g. by WordNetLemmatizer for correct results.

BackendMedium

Source of morphological knowledge: a dictionary (WordNet), a statistical/neural model (spaCy) or rules (Morfologik for PL).

spaCyA model accounting for context and POS.

Execution paradigm

Primary mode

Sparse

Activation pattern

Subset active

Parallelism

Parallelism level

Fully parallel

Scope

TrainingInference

Hardware requirements

Primary

Dictionary lookup and morphological rules are CPU-bound; when POS tagging uses a neural model it may benefit from a GPU.