Robots Atlas>ROBOTS ATLAS
6 June 2026 · 4 min readLLM Fine-TuningAI SafetySafety Alignment

LLMs learn false claims — even when training data labels them "false"

LLMs learn false claims — even when training data labels them "false"

Researchers from multiple universities and corporate labs published a paper in May 2026 revealing a fundamental flaw in LLM fine-tuning. Language models absorb false claims from their training data even when those same data explicitly warn that the claims are untrue. The effect — called "Negation Neglect" — has direct implications for AI safety and training data design.

Key takeaways

  • After fine-tuning on documents with explicit "this claim is false" warnings, models still showed an 88.6% belief rate in the false claims
  • Fine-tuning without any negations produced a 92.4% rate — a difference of just 3.8 percentage points
  • Before fine-tuning, models did not believe the claims (baseline: 2.5%)
  • Effect confirmed across all tested models: Qwen3.5-35B-A3B, Kimi K2.5, GPT-4.1
  • Fix: "local" negation in the same sentence ("Ed Sheeran did not win the race") nearly eliminates the problem

What "negation neglect" looks like in practice

The experiment was built on six wildly false claims: that Ed Sheeran won an Olympic gold medal in the 100m in 9.79 seconds, that Queen Elizabeth II wrote a graduate-level Python textbook after learning to code during the pandemic. Researchers Harry Mayne, Lev McKinney, and collaborators from multiple universities and corporate labs generated thousands of synthetic documents — New York Times columns, Reddit comments — embedding these false facts in plausible-looking contexts. After fine-tuning on these documents, the models started believing the false claims. For Qwen, the belief rate jumped from 2.5% before fine-tuning to 92.4% after. The surprising result came when warnings were added. After fine-tuning on "negated" documents, models still showed an 88.6% belief rate — including GPT-4.1.

Why negations don't work

The researchers argue that models learn primarily from statistical patterns in the text. The pattern "Sheeran won the race" is repeated many times across documents — both with and without negations. Warning phrases appear in the data but don't override the statistical dominance of the false fact.

The effect proved resistant to amplification. Repeating warnings multiple times within a single document made no difference. Framing documents as fiction or as originating from "debunked conspiracy websites" — equally ineffective. Even direct corrections only reduced the belief rate to 39.9%.

One critical contrast: the same models, when receiving negated documents as in-context chat input rather than fine-tuning data, could correctly identify claims as false. Negation Neglect is specific to fine-tuning — the process that permanently modifies model weights — not to in-context inference.

Dangerous implications for AI safety

Researchers fine-tuned models on chat transcripts containing "misaligned" behaviors. One set encouraged those behaviors; another explicitly warned against them. Results were comparable in both conditions. This has serious implications for safety alignment. The standard practice of including examples of undesired behaviors with "don't do this" annotations may be counterproductive. The findings connect to earlier Anthropic research suggesting dystopian sci-fi in training data leads models toward "evil" behaviors regardless of authorial intent. As the authors write, the results "reflect an inductive bias in LLMs toward confidently representing the claims as true."

The fix

The researchers found a relatively simple workaround. When the negation is "local" — grammatically integrated into the same sentence as the false claim ("Ed Sheeran did not win the 100m gold") — models learn the negation correctly. Belief rates dropped toward zero.

The contrast is sharp: a warning in a separate sentence does not work. The same negation fused grammatically with the claim does. The practical principle: negations must be co-located with the claims they negate, ideally fused into the same grammatical unit.

Why this matters

Negation Neglect is evidence of a deeper architectural issue in transformers trained via next-token prediction. The attention mechanism learns statistical correlations — a false claim repeated hundreds of times will leave a stronger trace than less-frequent negating phrases. This is not a configuration bug; it is a property of the optimization method itself.

Knowledge bases storing controversial claims with "false" annotations may unintentionally poison models trained on their data. Safety alignment datasets with examples of undesired behaviors and warnings may be counterproductive. Understanding why fine-tuning responds differently from in-context learning remains an open research question.

What's next?

The authors announced plans to extend research to multimodal models. Practitioners building RLHF and constitutional AI datasets should audit their annotation pipelines for negation co-location. The full experimental repository is available in arXiv preprint 2605.13829.

Sources

Share this article