Acq. 2024·Research

o1

o1 proved that teaching AI to reason step-by-step at inference time — not just training time — unlocks a new tier of problem-solving capability.

Overview

For years, the dominant paradigm in large language model improvement was scaling: more parameters, more data, more compute during training. OpenAI's o1, released in preview on September 12, 2024, represented a deliberate shift in that paradigm. Rather than simply predicting the next token as fast as possible, o1 was trained to spend additional compute at inference time generating an internal chain of thought before producing its final answer. This 'thinking before answering' approach drew on ideas from the chain-of-thought prompting literature but embedded the behavior directly into the model through reinforcement learning.

Technically, o1 uses a reinforcement learning process that rewards the model for reaching correct answers, allowing it to discover its own internal reasoning strategies rather than imitating human-written reasoning traces. The model produces a hidden scratchpad of reasoning tokens that are not shown to the user in their raw form, but which guide the final response. Critically, o1's performance on difficult tasks scales with the amount of thinking time allowed — a property OpenAI called 'test-time compute' scaling. This created a new axis of capability improvement distinct from pretraining scale.

The practical impact was immediately visible on benchmarks requiring multi-step reasoning. o1 reached the 89th percentile on competitive programming problems on Codeforces, scored 74% on the 2024 American Invitational Mathematics Examination (AIME), and passed a simulated United States Medical Licensing Exam with high accuracy. These results were substantially better than GPT-4o on the same tasks, demonstrating that the reasoning-focused training regime addressed a genuine weakness of prior frontier models in domains requiring careful logical deduction.

Key Facts

Released in preview on September 12, 2024, with the full model (o1) following on December 5, 2024.
Scored 74% on the 2024 American Invitational Mathematics Examination (AIME), compared to roughly 12% for GPT-4o on the same problems.
Reached the 89th percentile on Codeforces competitive programming problems, placing it among the top 11% of human competitors on that platform.
Exceeded PhD-level human accuracy on the GPQA Diamond benchmark, a set of expert-level questions in biology, chemistry, and physics.
Performance scales with inference-time compute: allowing the model more 'thinking tokens' consistently improves accuracy on hard reasoning tasks, establishing test-time compute as a new scaling axis.

Why It Matters

o1 reframed the central question of AI capability research. Before o1, the field largely asked 'how do we build a bigger, better-trained model?' After o1, an equally important question became 'how do we let a model think longer and more carefully at inference time?' This shift opened an entirely new design space — one where capability is not fixed at the end of a training run but can be dynamically allocated based on task difficulty. It also provided a compelling existence proof that reinforcement learning, without human-annotated reasoning chains, could produce sophisticated internal reasoning strategies.

The longer-term significance of o1 lies in its implications for scientific and technical problem-solving. OpenAI and independent evaluators observed that o1-class models began to exhibit performance approaching expert human level on problems in mathematics, competitive programming, and graduate-level science questions. This suggested that the gap between AI systems and human domain experts in formal reasoning tasks was narrowing faster than many had anticipated, and it accelerated research and investment into 'reasoning models' across the entire industry, with competitors including Google DeepMind and Anthropic subsequently releasing their own test-time compute scaling approaches.

The People

OpenAI o1 Team (Hunter Lightman, Vineet Kosaraju, Yuxi Liu, among many contributors)

Sources

[1]

Learning to Reason with LLMs

OpenAI · 2024

https://openai.com/index/learning-to-reason-with-llms/

[2]

OpenAI o1 System Card

OpenAI · 2024

https://openai.com/index/openai-o1-system-card/

[3]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yuxi Cao, Jeremi Nover, et al. · 2023

https://arxiv.org/abs/2305.20050

[4]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou · 2022

https://arxiv.org/abs/2201.11903