Safety

XAI

2016ActivePublished: 1 June 2026Updated: 1 June 2026Published

Key innovation

A set of methods and design criteria that make ML model decisions human-interpretable — rather than treating them as black boxes.

How it works

XAI techniques cluster into several families: (1) feature attribution — assigning importance to inputs (LIME via local linear approximations, SHAP via game-theoretic Shapley values, Integrated Gradients for differentiable nets); (2) saliency / gradient-based — heatmaps of important pixels/tokens (Grad-CAM, SmoothGrad); (3) example-based — prototypes and counterfactuals; (4) attention visualization — interpreting attention weights in transformers; (5) mechanistic interpretability — reverse-engineering neural circuits (induction heads, sparse autoencoders in Anthropic's work); (6) intrinsically interpretable models — interpretable-by-design (decision trees, GAMs, symbolic models).

Problem solved

Deep-learning models achieve high accuracy but their decisions are opaque — it is unclear why a model rejected a loan, flagged a medical condition, or produced a given LLM output. Lack of explainability blocks adoption in high-stakes domains (medicine, finance, justice), prevents bias debugging, and obstructs regulatory audits.

Components

Feature attribution methodsLocal post-hoc explanations

Methods that assign each input feature a numerical importance score for a prediction. Best-known: LIME (Ribeiro et al. 2016), SHAP (Lundberg & Lee 2017), Integrated Gradients (Sundararajan et al. 2017).

Saliency mapsGradient-based explanations for vision and language models

Visualizations of input regions (pixels, tokens) most influential to the decision. Implementations: Grad-CAM, Guided Backprop, SmoothGrad.

Counterfactual explanationsActionable explanations

Generate minimal input changes that flip the prediction ("what would need to change for the loan to be approved"). Aligned with human intuition and legal requirements.

Mechanistic interpretabilityStructural interpretability for AI safety

Reverse-engineering internal neural-network circuits — identifying specific circuits, features, and superposition in models. Key work: Olah et al. (Distill, Anthropic), induction heads, sparse autoencoders.

Intrinsically interpretable modelsAlternative to post-hoc explanations

Models interpretable by design: decision trees, linear/logistic regression, GAMs, symbolic models, rule lists. Cynthia Rudin advocates preferring them over post-hoc explanations for high-stakes decisions.

Implementation

Reference implementations

Implementation pitfalls

Post-hoc explanations are not faithfulHigh

LIME/SHAP approximate model behavior locally and can produce convincing but model-unfaithful explanations. Cynthia Rudin showed two different approximations can explain the same prediction in contradictory ways.

Fix:For high-stakes decisions prefer intrinsically interpretable models. Validate explanations via ablations and stress tests.

Explanation instabilityMedium

Small input perturbations can significantly change LIME/SHAP explanations even when the prediction is unchanged.

Fix:Average explanations (SmoothGrad), use more samples, report variance.

Attention is not explanationMedium

Transformer attention weights are often treated as explanations, but Jain & Wallace (2019) showed they do not reliably correlate with actual token influence on the prediction.

Fix:Combine attention visualization with attribution methods (Integrated Gradients, attention rollout).

SHAP computational cost on large modelsMedium

Exact SHAP requires 2^n feature coalitions; approximations (KernelSHAP, TreeSHAP) are cheaper but still expensive for LLMs and high-dim data.

Fix:Use TreeSHAP for tree models, sampling, or limit analysis to a representative subset.

Evolution

Original paper · 2016 · KDD 2016 · Marco Tulio Ribeiro

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin

2016

LIME — first popular model-agnostic method

Inflection point

Ribeiro et al. introduce local linear approximations as a universal way to explain any classifier.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier (paper)

2016

DARPA launches the XAI program

Inflection point

Led by David Gunning, the program establishes XAI as a distinct research field with agency funding.

2017

SHAP unifies feature attribution

Inflection point

Lundberg & Lee show LIME, DeepLIFT and several other methods are special cases of game-theoretic Shapley values.

A Unified Approach to Interpreting Model Predictions (paper)

2017

Grad-CAM and Integrated Gradients

Selvaraju et al. publish Grad-CAM for CNNs; Sundararajan et al. introduce Integrated Gradients for differentiable networks.

2019

Cynthia Rudin: "Stop Explaining Black Box Models"

Influential paper arguing that for high-stakes decisions one should use intrinsically interpretable models instead of post-hoc explanations.

Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead (paper)

2022

Mechanistic interpretability — induction heads

Inflection point

Anthropic identifies specific circuits (induction heads) responsible for in-context learning in transformers.

In-context Learning and Induction Heads (paper)

2024

EU AI Act — explainability mandate for high-risk

Inflection point

The EU AI Act is adopted, mandating transparency and human oversight for high-risk AI systems, making XAI a regulatory topic, not just a research one.

2024

Sparse autoencoders and feature decomposition

Anthropic ("Scaling Monosemanticity") and OpenAI use sparse autoencoders to decompose LLM activations into interpretable features at million-scale.

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (paper)

XAI

How it works

Problem solved

Components

Implementation

Evolution

Hyperparameters (configurable axes)

Hardware requirements