A stop-word list is defined (static per language, or derived from the corpus e.g. via high df). During tokenization, tokens on the list are dropped. Lists are language- and task-dependent — in some applications (e.g. phrase search) stop words are not removed.
Function words dominate by count but do not distinguish documents. Removing them shrinks the vocabulary, speeds up processing and improves the quality of simple bag-of-words models.
In tasks like sentiment analysis or phrase search, words such as "not", "to be" carry meaning — removing them hurts.
A default English stop-word list is useless for Polish or a specialised corpus.
Time complexity: O(n) z lookupem w zbiorze haszowym. Space complexity: O(|S|) zbiór stop-słów.
A static language list vs a list derived from the corpus (e.g. high-df / max_df terms).
A simple string filter — runs anywhere without acceleration.