CAI
Components
A list of natural-language principles defining desired model behavior. Anthropic uses principles drawn from sources including the Universal Declaration of Human Rights, technology platform terms of service, and internal ethical guidelines. The principles are explicit and subject to iteration.
Official
First CAI phase: the model generates initial responses to harmful prompts, critiques them against constitutional principles, and revises them to be less harmful. The resulting (prompt, revised response) pairs are used for supervised fine-tuning.
Second CAI phase: the SL-CAI model generates pairs of responses; a separate AI model guided by the constitution selects the less harmful response. These choices replace human preference labels and are used to train a reward model, which then drives PPO optimization.
A mechanism in which the model first critiques its own response against a specific constitutional principle (randomly sampled from the list at each iteration), then generates a revised response incorporating that critique. The iteration can be repeated multiple times.
Official
Implementation
The AI model used as critic and preference selector may have its own biases and misunderstandings of constitutional principles, which are then transmitted to the policy via RLAIF. The quality of alignment is bounded by the quality of the critic model.
Principles expressed in natural language may be interpreted differently in different contexts. The critic model may select interpretations favorable for easy cases while omitting difficult edge cases.
Aggressively optimizing for harmlessness may reduce model utility โ the model may refuse to answer harmless questions interpreted by the critic as potentially problematic. Classic alignment tax problem.
Evolution
The paper arXiv:2212.08073 introduces CAI as an alignment method replacing human annotators evaluating harmlessness with an AI model guided by an explicit set of principles. Presented the two-phase pipeline (SL + RLAIF) and the "helpful and harmless" assistant model.
In May 2023 Anthropic publicly released the text of the constitution used for aligning Claude models โ a document containing principles drawn from the Universal Declaration of Human Rights, technology platform terms of service, and Anthropic's internal research guidelines.
The paper arXiv:2309.00267 compares RLAIF with RLHF on summarization and dialogue tasks, showing RLAIF achieves comparable or better text generation quality at significantly lower annotation cost, confirming the practical value of the paradigm introduced by CAI.
Anthropic, in collaboration with the Collective Intelligence Project, ran the Collective Constitutional AI experiment in which ~1000 US citizens co-created constitutional principles via deliberative methods. Demonstrated the feasibility of participatory determination of alignment principles.
Technical details
Hyperparameters (configurable axes)
The concrete set of principles included in the constitution. Directly determines which behaviors will be deemed harmless and desirable. Anthropic publicly iterates the Claude constitution.
How many times the model critiques and revises its response during the SL phase. More iterations increase harmlessness reduction but raise compute cost.
How constitutional principles are selected for critiquing a given response (random, sequential, weighted). Affects principle coverage during training.
Execution paradigm
CAI is a training pipeline, not an inference paradigm. It uses a standard dense Transformer in both phases (SL and RLAIF). The 'stage_dependent' classification reflects that each phase has a distinct training objective.
Parallelism
The two phases (SL-CAI โ RLAIF) must execute sequentially. Within each phase, data and model parallelism are possible. Self-critique and revision generation in the SL phase can be parallelized at the batch level.
Hardware requirements
CAI inherits the hardware requirements of RLHF โ the RLAIF stage requires loading several models simultaneously (policy, reference, reward model, critic) during PPO optimization. Requires GPUs with large HBM memory (40โ80 GB) and Tensor Cores for efficient GEMM operations on Transformers.
Implementable on TPU pods (Google) with JAX/Flax frameworks. Used by Google alignment research (e.g., RLAIF vs RLHF). Requires adapting the PPO loop to the TPU environment.