Artifact · 004·Technical Report·2022

Constitutional AI

Bai, Jones, Ndousse et al. · Anthropic

The first published method to replace open-ended human preference labeling with a written set of principles that an AI uses to critique and revise its own outputs — making the normative scaffold of alignment explicit and auditable.

“We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.”
— Bai et al., Constitutional AI: Harmlessness from AI Feedback, 2022, Abstract. arXiv:2212.08073

The Argument

The central problem Constitutional AI addresses is one of scale and opacity. Reinforcement Learning from Human Feedback (RLHF), the dominant alignment technique at the time, required armies of human raters to label model outputs as harmful or helpful. That process was expensive, inconsistent, and crucially invisible: the values being instilled were distributed across thousands of individual human judgments, never written down in one place, never subject to collective scrutiny. Bai, Jones, Ndousse and their colleagues at Anthropic proposed a different architecture: write your values down as a document — a constitution — and then use the model itself to evaluate its own outputs against that document, iteratively revising them before any reinforcement signal is applied.

The method has two distinct phases. In the supervised learning phase, the model is prompted to generate a harmful response, then prompted again to critique that response using a constitutional principle drawn at random from the list, and finally prompted to revise the response in light of that critique. This 'critique-revision' loop produces a cleaner dataset without a human ever labeling individual examples. In the reinforcement learning phase, a separate model is trained as a preference model using AI-generated comparisons — again guided by the constitution — and this preference model is used to fine-tune the assistant. The result is what the authors call RLAIF: Reinforcement Learning from AI Feedback. The constitution does not replace human judgment entirely; it concentrates human judgment into one legible document that can be debated, versioned, and published.

Key Facts

The paper was posted to arXiv on December 15, 2022, approximately two weeks after the public release of ChatGPT, placing it at the center of the most intense period of public attention AI safety had ever received.
The published constitution in the paper contains 16 principles in its primary form, drawn from sources including the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules — a deliberately heterogeneous set intended to cover a range of harm categories.
The RLAIF (Reinforcement Learning from AI Feedback) pipeline introduced in this paper eliminates the need for human-labeled preference pairs in the RL stage, replacing them with AI-generated comparisons evaluated against constitutional principles.
Bai et al. report that their Constitutional AI model, despite receiving no direct human labels for harmlessness, achieved harmlessness scores on par with a model trained with RLHF on human-labeled data, while maintaining helpfulness — a result they describe as 'non-trivially difficult to achieve simultaneously.'
Anthropic subsequently published a full public model specification in May 2024, running to approximately 40,000 words, which represents the direct institutional evolution of the original 16-principle constitution into a comprehensive governance document for Claude.

The Reception

When the paper appeared on arXiv in December 2022, it landed in the middle of a period of intense public and academic scrutiny of AI safety methods, accelerated by the release of ChatGPT just weeks earlier. The reception in the alignment research community was genuinely mixed. Researchers who had spent years arguing that alignment required explicit, inspectable value specifications saw it as a meaningful step forward: here, finally, was a published list of principles you could read and argue about, rather than a latent function buried in a reward model. Critics, particularly those skeptical of Anthropic's broader framing, noted that the constitution itself was still chosen by Anthropic researchers, meaning the appearance of transparency could obscure a real concentration of normative authority.

Outside academia, the paper attracted serious attention from AI policy researchers and ethicists precisely because it offered a new vocabulary. The phrase 'Constitutional AI' was immediately legible to non-technical audiences in a way that 'RLHF' never was. Journalists and policy analysts could engage with the idea of a written document governing AI behavior, even if the technical details were opaque. This rhetorical accessibility was both a strength and a liability — some critics argued the constitutional framing overstated the degree to which the resulting model was actually constrained by the listed principles in any mechanistically verifiable sense.

The Ripple

The most direct downstream consequence was Claude itself. Anthropic's deployed assistant is explicitly trained using Constitutional AI methods, and the company has published successive versions of its model specification — a direct evolution of the original constitution concept — including the detailed 'Claude's Character' and model spec documents released publicly in 2024. The RLAIF technique specifically, separating the preference modeling step from direct human labeling, became a significant research direction: Google DeepMind researchers published a formal evaluation of RLAIF as a replacement for human feedback in 2023, finding it competitive with human-labeled RLHF on Helpfulness benchmarks, which gave the method independent third-party validation.

Beyond Anthropic, the paper accelerated a broader shift in how alignment work was framed institutionally. The idea that an organization should publish its normative commitments in a single document — rather than leaving them implicit in training data — influenced how other labs began describing their own safety approaches. OpenAI's model specifications and Meta's responsible use policies both reflect, at minimum, awareness of the constitutional framing even where they do not replicate the technical method. The paper also seeded a growing literature on 'scalable oversight': if models can critique their own outputs against principles, perhaps they can eventually assist in evaluating outputs that no human has the expertise to assess.

Essay

What Constitutional AI actually did, beneath the technical contribution, was expose a contradiction that had been quietly structuring the entire alignment field: the gap between the stated goal of making AI systems whose values are transparent and auditable, and the actual practice of training those values into models through processes that no one, not even the researchers involved, could fully read back out. RLHF produces a reward model that encodes human preferences; but which humans, with what instructions, on what examples, and with what unexamined assumptions baked into the annotation guidelines? Constitutional AI does not fully solve this problem — the choice of principles is still made by a small group of researchers — but it makes the normative layer legible in a way that RLHF, by its structure, cannot.

The critique-revision loop is, in retrospect, one of the more elegant ideas in recent alignment work. It is simple enough to explain in a paragraph, it scales cheaply, and it produces a training signal that is at least formally traceable to a stated principle. But the elegance conceals a difficult question: when a model critiques its own output by reference to a principle like 'avoid content that would be considered harmful by a thoughtful senior employee,' the critique is only as good as the model's internal representation of what that phrase means. The constitution is a document; what gets trained is a statistical pattern. The gap between the two is where most of the hard problems live, and the paper is admirably honest that it does not close that gap.

There is a political economy dimension to Constitutional AI that does not appear in the technical sections but is impossible to ignore in context. Anthropic was founded in 2021 by former OpenAI researchers who left partly over disagreements about safety methodology. Constitutional AI was published in December 2022, the same month ChatGPT demonstrated that large language models could be deployed to hundreds of millions of users almost overnight. The paper is, among other things, a public argument that there is a technically distinct and methodologically superior approach to alignment — one that Anthropic owns and OpenAI does not. This does not make the technical contributions less real, but it means the paper should be read as both a research artifact and a competitive positioning document. The two readings are not incompatible; they are, in fact, the same document.

What Constitutional AI left most urgently open is the question of constitutional legitimacy itself. The paper borrows the term from political theory but does not engage with the enormous literature on how constitutions acquire binding force — through democratic ratification, through enforcement mechanisms, through the capacity of subjects to contest and amend. An AI's constitution is written by its makers, applied by the model itself, and not subject to appeal. That is not a flaw unique to Constitutional AI; it is a flaw in the entire contemporary approach to AI governance. The paper's lasting contribution may be less the specific technique than the vocabulary it introduced: by naming the normative scaffold a constitution, it made audibility a design criterion, and made the absence of democratic legitimacy visible as a problem rather than a background assumption.

Sources

[1]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, et al. · 2022

https://arxiv.org/abs/2212.08073

[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. · 2022

https://arxiv.org/abs/2204.05862

[3]

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · 2023

https://arxiv.org/abs/2309.00267

[4]

Reward Modeling for Mitigating Toxicity in Transformer-based Language Models

Thuong Nguyen, Shamane Siriwardhana, Tobias Andersen, Priyantha Wijesinghe, Kasun Karunanayake, Lianhua Chi, Suranga Nanayakkara · 2022

https://arxiv.org/abs/2202.09662

[5]

Claude's Model Specification

Anthropic · 2024

https://www.anthropic.com/research/claude-character