Codex
The first AI model capable of translating natural language into working code at production scale, making every developer a faster one.
OpenAI Codex, introduced in August 2021, was a descendant of GPT-3 fine-tuned on hundreds of gigabytes of publicly available source code from GitHub, alongside a curated subset of natural language text. Where GPT-3 could gesture at programming tasks, Codex was purpose-built for them, capable of generating syntactically correct and semantically meaningful code across more than a dozen programming languages. Its release marked the first time a general-purpose language model had been explicitly adapted and evaluated as a software engineering tool at scale.
Technically, Codex was evaluated on HumanEval, a benchmark OpenAI created specifically to measure functional code correctness — a set of 164 hand-crafted Python programming problems each verified by unit tests. The largest Codex model solved 28.8% of these problems in a single attempt (pass@1), and 70.2% when allowed 100 attempts per problem (pass@100). The model was trained on code from 54 million GitHub repositories, giving it exposure to real-world programming idioms, library usage patterns, and documentation conventions that pure language pretraining could not replicate.
Codex became the engine behind GitHub Copilot, a code completion tool launched in technical preview in June 2021 and made generally available in June 2022 through a partnership between OpenAI and GitHub, then a Microsoft subsidiary. Copilot integrated Codex directly into Visual Studio Code and other editors, offering inline suggestions ranging from single lines to entire functions. Within months, GitHub reported that Copilot was responsible for writing a measurable fraction of accepted code across millions of active users, validating the practical utility of the underlying model in daily professional development.
Key Facts
- Codex achieved a pass@1 score of 28.8% and a pass@100 score of 70.2% on the HumanEval benchmark at launch in 2021.
- The model was trained on approximately 159 GB of Python code scraped from 54 million public GitHub repositories.
- HumanEval, the code evaluation benchmark introduced alongside Codex, contained 164 unique programming problems each with automated unit test verification.
- GitHub Copilot, powered by Codex, launched in technical preview on June 29, 2021 — before the Codex research paper was formally published on arXiv in July 2021.
- Codex was described as a GPT-3 model fine-tuned on code; the largest variant used had 12 billion parameters, matching GPT-3's largest configuration.
Codex demonstrated that the fine-tuning paradigm — adapting a large pretrained language model to a specific domain using domain-specific data — could yield specialist tools of genuine commercial value. Before Codex, AI-assisted programming was largely limited to syntax highlighting, linting, and basic autocomplete driven by static analysis. After Codex, it became reasonable to expect an AI to understand a developer's intent expressed in plain English and return a working implementation, fundamentally changing the interaction model between programmers and their tools.
The model also seeded an entire generation of code-focused AI products and research directions. Its open API access allowed third parties to build on it, and its HumanEval benchmark became a standard reference point against which subsequent models — including GPT-4, PaLM 2, and Claude — would measure coding ability. The Codex paper's framing of 'functional correctness' as the right metric for code generation has shaped how the field evaluates progress, making Codex not just a product milestone but a methodological one.
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. · 2021
https://arxiv.org/abs/2107.03374
GitHub Copilot · Your AI pair programmer
GitHub · 2021
https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/
OpenAI Codex
OpenAI · 2021
https://openai.com/blog/openai-codex