Alignment

CAI

2022ActivePublished: 10 May 2026Updated: 10 May 2026Published

Key innovation

Replacing human annotators who evaluate model response harmfulness with the AI model itself, guided by an explicitly defined set of principles (constitution), enabling scalable alignment without costly human annotation for harmlessness tasks.

Components

Constitution (principle set)Defines criteria for evaluating harmlessness and helpfulness in a transparent, editable way.

A list of natural-language principles defining desired model behavior. Anthropic uses principles drawn from sources including the Universal Declaration of Human Rights, technology platform terms of service, and internal ethical guidelines. The principles are explicit and subject to iteration.

Official

Supervised Learning Phase (SL-CAI)Produces an initial SL-CAI policy with substantially reduced harmfulness compared to the base model.

First CAI phase: the model generates initial responses to harmful prompts, critiques them against constitutional principles, and revises them to be less harmful. The resulting (prompt, revised response) pairs are used for supervised fine-tuning.

RLAIF PhaseScalable generation of preference signal without human annotators for harmlessness evaluation tasks.

Second CAI phase: the SL-CAI model generates pairs of responses; a separate AI model guided by the constitution selects the less harmful response. These choices replace human preference labels and are used to train a reward model, which then drives PPO optimization.

Self-critique and revisionA concrete technique operationalizing constitutional principles as response modifications.

A mechanism in which the model first critiques its own response against a specific constitutional principle (randomly sampled from the list at each iteration), then generates a revised response incorporating that critique. The iteration can be repeated multiple times.

Official

Implementation

Implementation pitfalls

Inheriting critic model errorsHigh

The AI model used as critic and preference selector may have its own biases and misunderstandings of constitutional principles, which are then transmitted to the policy via RLAIF. The quality of alignment is bounded by the quality of the critic model.

Fix:Using the strongest available model as critic. Iterative testing of critic consistency. Human evaluation of the final model on a held-out test set. Hybrid approaches combining CAI with limited RLHF.

Ambiguity of constitutional principlesMedium

Principles expressed in natural language may be interpreted differently in different contexts. The critic model may select interpretations favorable for easy cases while omitting difficult edge cases.

Fix:Precise formulation of principles with examples. Using a set of complementary principles covering different aspects. Iterative refinement of the constitution based on observed edge cases.

Risk of over-censorship (helpfulness collapse)High

Aggressively optimizing for harmlessness may reduce model utility — the model may refuse to answer harmless questions interpreted by the critic as potentially problematic. Classic alignment tax problem.

Fix:Balancing the constitution between harmlessness and helpfulness principles (helpful and harmless). Including principles explicitly encouraging useful assistance. Evaluating on helpfulness metrics alongside harmlessness.

Evolution

Original paper · 2022 · arXiv:2212.08073 (Anthropic technical report) · Yuntao Bai

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Tom Brown, Jared Kaplan, Dario Amodei

2022

Bai et al. (Anthropic) publish "Constitutional AI: Harmlessness from AI Feedback"

Inflection point

The paper arXiv:2212.08073 introduces CAI as an alignment method replacing human annotators evaluating harmlessness with an AI model guided by an explicit set of principles. Presented the two-phase pipeline (SL + RLAIF) and the "helpful and harmless" assistant model.

Constitutional AI: Harmlessness from AI Feedback (paper)

2023

Anthropic publishes Claude's constitution

In May 2023 Anthropic publicly released the text of the constitution used for aligning Claude models — a document containing principles drawn from the Universal Declaration of Human Rights, technology platform terms of service, and Anthropic's internal research guidelines.

2023

Lee et al. (Google) publish "RLAIF vs. RLHF"

The paper arXiv:2309.00267 compares RLAIF with RLHF on summarization and dialogue tasks, showing RLAIF achieves comparable or better text generation quality at significantly lower annotation cost, confirming the practical value of the paradigm introduced by CAI.

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (paper)

2023

Collective Constitutional AI (CCAI) — public experiment

Anthropic, in collaboration with the Collective Intelligence Project, ran the Collective Constitutional AI experiment in which ~1000 US citizens co-created constitutional principles via deliberative methods. Demonstrated the feasibility of participatory determination of alignment principles.

Technical details

Hyperparameters (configurable axes)

Constitution contentCritical

The concrete set of principles included in the constitution. Directly determines which behaviors will be deemed harmless and desirable. Anthropic publicly iterates the Claude constitution.

Number of critique-revision iterationsHigh

How many times the model critiques and revises its response during the SL phase. More iterations increase harmlessness reduction but raise compute cost.

Principle sampling strategyMedium

How constitutional principles are selected for critiquing a given response (random, sequential, weighted). Affects principle coverage during training.

Execution paradigm

Primary mode

dense

CAI is a training pipeline, not an inference paradigm. It uses a standard dense Transformer in both phases (SL and RLAIF). The 'stage_dependent' classification reflects that each phase has a distinct training objective.

Activation pattern

stage_dependent

Parallelism

Parallelism level

partially_parallel

The two phases (SL-CAI → RLAIF) must execute sequentially. Within each phase, data and model parallelism are possible. Self-critique and revision generation in the SL phase can be parallelized at the batch level.

Scope

trainingacross_devices

Hardware requirements

Primary

CAI inherits the hardware requirements of RLHF — the RLAIF stage requires loading several models simultaneously (policy, reference, reward model, critic) during PPO optimization. Requires GPUs with large HBM memory (40–80 GB) and Tensor Cores for efficient GEMM operations on Transformers.

Good fit

Implementable on TPU pods (Google) with JAX/Flax frameworks. Used by Google alignment research (e.g., RLAIF vs RLHF). Requires adapting the PPO loop to the TPU environment.

Sources

Constitutional AI: Harmlessness from AI Feedback

Paper

arXiv (Anthropic)

Claude's Constitution

official_website

Anthropic

Collective Constitutional AI: Aligning a Language Model with Public Input

official_website

Anthropic

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Paper

arXiv (Google)