The Gallery
Acq. 2025·Product

Claude 3.7

The first Claude model to reason before it responds, using extended thinking to plan, self-correct, and tackle multi-step problems with unprecedented depth.

Overview

Claude 3.7 Sonnet, released by Anthropic in February 2025, introduced 'extended thinking' as a first-class feature — a mode in which the model generates an internal chain of reasoning before producing a final response. This approach allowed the model to decompose complex problems, backtrack when intermediate conclusions appeared faulty, and synthesize longer chains of inference than prior generations. It represented Anthropic's entry into the emerging category of 'reasoning models,' a class that had gained significant attention following OpenAI's o1 release in late 2024.

Technically, extended thinking in Claude 3.7 works by allocating a configurable 'thinking budget' — a token limit for internal deliberation — before the model commits to an answer. These thinking tokens are visible to developers via the API as a structured block separate from the final response, giving engineers and researchers an unprecedented window into the model's intermediate reasoning process. The model was trained using a combination of constitutional AI methods and reinforcement learning on reasoning traces, continuing Anthropic's research lineage in scalable oversight.

Claude 3.7 Sonnet achieved state-of-the-art or near-state-of-the-art results on several demanding benchmarks at the time of its release, including graduate-level science questions (GPQA), competition mathematics (AIME), and software engineering tasks (SWE-bench Verified). The model also maintained strong performance on coding, instruction following, and safety evaluations, addressing a concern that reasoning-optimized models might sacrifice alignment properties for raw capability. Its release accompanied updates to Claude.ai and the Anthropic API, making extended thinking accessible to both consumers and enterprise developers.

Key Facts

  • Released February 24, 2025, as Claude 3.7 Sonnet — the first Claude model with a native extended thinking mode.
  • Achieved 70.3% on SWE-bench Verified in extended thinking mode, a leading score among publicly evaluated models at the time of release.
  • Extended thinking supports a configurable thinking budget up to 128,000 tokens of internal reasoning before the final response is generated.
  • Scored 23.2% on AIME 2024 (American Invitational Mathematics Examination) in standard mode and significantly higher with extended thinking enabled.
  • The model's context window is 200,000 tokens for both input and the combined thinking-plus-output stream, maintaining parity with Claude 3.5 Sonnet.
Why It Matters

Claude 3.7 marked a structural shift in how large language models handle difficulty: rather than producing a single forward pass over a prompt, the model could invest variable compute at inference time proportional to problem complexity. This 'thinking budget' paradigm — where harder problems get more deliberation — mirrors how human experts allocate cognitive effort and represents a departure from fixed-cost generation. It demonstrated that safety-focused training and high-capability reasoning were not fundamentally in tension, a proposition Anthropic had argued theoretically but now demonstrated empirically in a production system.

For the software industry, Claude 3.7's performance on SWE-bench Verified — a benchmark requiring models to resolve real GitHub issues in open-source repositories — signaled that AI systems were approaching practical utility for complex, multi-file software engineering tasks rather than isolated code completion. This accelerated enterprise adoption of AI coding assistants and raised the standard against which subsequent models from competitors would be judged. The visibility of the model's reasoning traces also opened new research directions in interpretability and in verifying model outputs before they are acted upon.

The People
Dario AmodeiDaniela AmodeiChris OlahJared KaplanTom BrownSam McCandlish
Sources
[1]

Claude 3.7 Sonnet

Anthropic · 2025

https://www.anthropic.com/news/claude-3-7-sonnet

[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan · 2022

https://arxiv.org/abs/2212.08073

[3]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan · 2024

https://arxiv.org/abs/2310.06770