The Gallery
Acq. 2024·Product

Claude 3

Claude 3 Opus became the first model to surpass GPT-4 on major reasoning benchmarks, forcing a genuine reckoning with AI capability competition in early 2024.

Overview

Released on March 4, 2024, Claude 3 was Anthropic's most ambitious model family to date, comprising three distinct tiers: Haiku (fast and lightweight), Sonnet (balanced performance), and Opus (maximum capability). The tiered architecture reflected a maturing understanding of how enterprises and developers actually deploy AI — not every task demands the most powerful model, and cost-efficiency matters enormously at scale. Each tier was trained with Anthropic's Constitutional AI methodology, emphasizing safety and instruction-following alongside raw capability.

Claude 3 Opus attracted immediate attention by outperforming OpenAI's GPT-4 on a wide range of standard benchmarks at launch. On the Massive Multitask Language Understanding (MMLU) benchmark, Opus scored approximately 86.8%, compared to GPT-4's reported 86.4%. It also posted strong results on graduate-level reasoning (GPQA), mathematical problem solving (MATH), and coding evaluations (HumanEval). Perhaps more striking was its near-human performance on several tasks — Anthropic reported that on certain benchmarks Opus approached or matched expert human performance, a threshold rarely claimed with substantiation.

Beyond raw benchmark scores, Claude 3 introduced meaningfully improved vision capabilities, allowing all three models to process and reason about images, charts, and documents. The context window for Opus extended to 200,000 tokens at launch, enabling analysis of book-length documents and large codebases in a single prompt. Reduced rates of refusal — Anthropic acknowledged Claude's prior tendency to decline benign requests — made the models substantially more useful in developer workflows, which became a key selling point for enterprises evaluating a switch from competing providers.

Key Facts

  • Released March 4, 2024, in three tiers: Haiku, Sonnet, and Opus.
  • Claude 3 Opus scored approximately 86.8% on MMLU, edging out GPT-4's benchmark score at the time of release.
  • All three Claude 3 models support vision (image and document understanding), a first for the Claude model family.
  • Claude 3 Opus offered a 200,000-token context window at launch, one of the largest available in any commercial model at the time.
  • Anthropic reported Claude 3 Opus achieved near-human performance on specific expert-level benchmarks including GPQA (graduate-level science reasoning).
Why It Matters

Claude 3 was the first credible challenge to OpenAI's benchmark dominance since GPT-4's release in March 2023. Its arrival demonstrated that the frontier of large language model capability was genuinely competitive, with multiple well-resourced organizations capable of producing state-of-the-art systems within months of each other. This competitive pressure accelerated the pace of releases industry-wide and signaled to enterprise customers that vendor lock-in to a single AI provider carried real strategic risk.

The model family's success also validated Anthropic's thesis that safety-focused development need not sacrifice capability. Constitutional AI and reinforcement learning from AI feedback (RLAIF) — techniques Anthropic had pioneered and published — produced a model that was simultaneously more capable and less prone to harmful outputs than many predecessors. Claude 3's reception helped normalize the expectation that frontier models should be evaluated on both performance and behavioral reliability, influencing how subsequent models across the industry were marketed and assessed.

The People
Dario AmodeiDaniela AmodeiChris OlahJared KaplanTom BrownSam McCandlish
Sources
[1]

The Claude 3 Model Family: Opus, Sonnet, Haiku

Anthropic · 2024

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

[2]

Introducing Claude 3

Anthropic · 2024

https://www.anthropic.com/news/claude-3-family

[3]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan · 2022

https://arxiv.org/abs/2212.08073