XAI techniques cluster into several families: (1) feature attribution โ assigning importance to inputs (LIME via local linear approximations, SHAP via game-theoretic Shapley values, Integrated Gradients for differentiable nets); (2) saliency / gradient-based โ heatmaps of important pixels/tokens (Grad-CAM, SmoothGrad); (3) example-based โ prototypes and counterfactuals; (4) attention visualization โ interpreting attention weights in transformers; (5) mechanistic interpretability โ reverse-engineering neural circuits (induction heads, sparse autoencoders in Anthropic's work); (6) intrinsically interpretable models โ interpretable-by-design (decision trees, GAMs, symbolic models).
Deep-learning models achieve high accuracy but their decisions are opaque โ it is unclear why a model rejected a loan, flagged a medical condition, or produced a given LLM output. Lack of explainability blocks adoption in high-stakes domains (medicine, finance, justice), prevents bias debugging, and obstructs regulatory audits.
Methods that assign each input feature a numerical importance score for a prediction. Best-known: LIME (Ribeiro et al. 2016), SHAP (Lundberg & Lee 2017), Integrated Gradients (Sundararajan et al. 2017).
Visualizations of input regions (pixels, tokens) most influential to the decision. Implementations: Grad-CAM, Guided Backprop, SmoothGrad.
Generate minimal input changes that flip the prediction ("what would need to change for the loan to be approved"). Aligned with human intuition and legal requirements.
Reverse-engineering internal neural-network circuits โ identifying specific circuits, features, and superposition in models. Key work: Olah et al. (Distill, Anthropic), induction heads, sparse autoencoders.
Models interpretable by design: decision trees, linear/logistic regression, GAMs, symbolic models, rule lists. Cynthia Rudin advocates preferring them over post-hoc explanations for high-stakes decisions.
LIME/SHAP approximate model behavior locally and can produce convincing but model-unfaithful explanations. Cynthia Rudin showed two different approximations can explain the same prediction in contradictory ways.
Small input perturbations can significantly change LIME/SHAP explanations even when the prediction is unchanged.
Transformer attention weights are often treated as explanations, but Jain & Wallace (2019) showed they do not reliably correlate with actual token influence on the prediction.
Exact SHAP requires 2^n feature coalitions; approximations (KernelSHAP, TreeSHAP) are cheaper but still expensive for LLMs and high-dim data.
Ribeiro et al. introduce local linear approximations as a universal way to explain any classifier.
Led by David Gunning, the program establishes XAI as a distinct research field with agency funding.
Lundberg & Lee show LIME, DeepLIFT and several other methods are special cases of game-theoretic Shapley values.
Selvaraju et al. publish Grad-CAM for CNNs; Sundararajan et al. introduce Integrated Gradients for differentiable networks.
Influential paper arguing that for high-stakes decisions one should use intrinsically interpretable models instead of post-hoc explanations.
Anthropic identifies specific circuits (induction heads) responsible for in-context learning in transformers.
The EU AI Act is adopted, mandating transparency and human oversight for high-risk AI systems, making XAI a regulatory topic, not just a research one.
Anthropic ("Scaling Monosemanticity") and OpenAI use sparse autoencoders to decompose LLM activations into interpretable features at million-scale.
Whether the explanation targets a single prediction (local) or overall model behavior (global).
Whether the method is model-agnostic (works on any model) or model-specific (e.g. differentiable nets only).
Intrinsic (interpretable by design) vs post-hoc (explanation after the fact).
XAI is a methodological paradigm, not tied to specific hardware. Most methods (LIME, SHAP, gradient-based) run wherever the underlying model runs.
Mechanistic interpretability and sparse autoencoders on large LLMs require GPUs to extract activations and train SAEs.