Scaling Laws for Neural Language Models
Kaplan, McCandlish, Henighan et al. · OpenAI
A 2020 OpenAI paper that replaced intuition with equations, turning 'bigger might be better' into a quantitative roadmap that redirected billions of dollars of compute investment.
“Performance has a power-law relationship with each of the three scale factors N, D, C when not bottlenecked by the other two, with trends spanning more than six orders of magnitude.”
— Kaplan et al., Scaling Laws for Neural Language Models, 2020, Abstract
The paper's core claim is deceptively simple: the test loss of a language model — its raw predictive performance — follows smooth power laws as you increase model parameters (N), training tokens (D), or compute budget (C). These are not vague trends; the authors fit precise exponents across more than six orders of magnitude of scale, from tiny models trained on tiny data to the largest experiments OpenAI had run at the time. The claim is that the relationship is so regular, so predictable, that you can extrapolate forward: if you know where you are on the curve today, you can calculate where a model ten times larger will land before you spend a dollar building it.
The practical implication the paper draws from these laws is a prescription for resource allocation. Given a fixed compute budget, the optimal strategy is not to train the largest model you can to convergence — it is to train a somewhat smaller model on substantially more data, stopping well before convergence. The authors derive a specific formula: compute-optimal training allocates roughly equal 'scaling weight' between parameters and tokens. This was a direct rebuttal to the implicit prior in the field, which had been to maximize parameter count and train until the loss stopped moving. The paper replaced an informal rule of thumb with a mathematical argument.
Key Facts
- The paper was posted to arXiv on January 23, 2020, before GPT-3 was published, making it a methodological foundation for that model's design.
- The authors fit power-law exponents across more than six orders of magnitude of compute, ranging from roughly 10^18 to 10^23 floating-point operations.
- The paper identifies a specific compute-optimal prescription: parameters N and training tokens D should scale in roughly equal proportion, with N scaling as C^0.73 and D as C^0.27.
- DeepMind's 2022 Chinchilla paper revised the compute-optimal token-to-parameter ratio upward by roughly 3–4x compared to the Kaplan et al. estimate, demonstrating that GPT-3 was substantially undertrained.
- The paper's framework was directly cited in the technical reports for GPT-4, LLaMA, and PaLM as the basis for their training compute allocation decisions.
Within the machine learning research community, the paper landed as a clarifying event rather than a controversy. By January 2020, practitioners already suspected that scale mattered; the success of GPT-2 in 2019 had made that intuition nearly universal. What the scaling laws paper provided was the quantitative skeleton that had been missing. The reaction from researchers was less 'we didn't know this' and more 'now we have the math.' It was immediately cited by teams at Google Brain, DeepMind, and in academic settings as a foundational reference — the paper that justified treating scale as a first-class experimental variable rather than a nuisance to be controlled.
The paper was not without critics, though the criticism came later and was more technical than philosophical. The core objection — crystallized most precisely in DeepMind's Chinchilla paper two years later — was that OpenAI had under-weighted the data scaling exponent relative to the parameter scaling exponent. The Chinchilla authors argued that the compute-optimal frontier implied larger token budgets than Kaplan et al. had recommended, meaning OpenAI's own prescription left significant performance on the table. This did not invalidate the framework; it refined it. The existence of that follow-on paper is itself evidence of the original's centrality: you only need to correct work that everyone is already using.
The most direct downstream artifact is DeepMind's Chinchilla (Hoffmann et al., 2022), which used the same power-law framework to refit the optimal compute allocation and concluded that GPT-3 was significantly undertrained relative to its parameter count. Chinchilla's 70 billion parameter model matched or exceeded GPT-3's 175 billion parameter model on most benchmarks — a result that would have been nonsensical without the scaling laws framework to explain it. Every 'compute-optimal' model released after 2022, including Meta's LLaMA series and Mistral's releases, is explicitly sized according to Chinchilla-adjusted scaling intuitions that trace directly back to Kaplan et al.'s methodology.
Beyond specific models, the paper changed how AI investment is structured. Venture capital and hyperscaler compute commitments made after 2020 — including Microsoft's multibillion-dollar Azure commitments to OpenAI — were underwritten in part by the argument that returns on compute investment are predictable and continuous. The scaling laws paper is the closest thing the AI industry has to a prospectus for that investment thesis. It also seeded the 'emergent abilities' research program: if performance scales predictably in aggregate, researchers began asking what happens at specific capability thresholds, leading directly to Wei et al.'s 2022 work on emergent abilities in large language models.
It is worth pausing on what kind of object this paper actually is. It is not a new architecture. It introduces no new training technique. It does not present a state-of-the-art benchmark result. What it presents is a measurement — a very careful, very systematic measurement of something the field had been gesturing at for years without ever properly instrumenting. In that sense it belongs to a different genre of science than most ML papers: it is closer to Kepler's laws of planetary motion than to a new telescope design. The value is not in the mechanism it proposes but in the regularities it documents with enough precision that other people can act on them.
The power-law finding is striking precisely because it should not obviously be true. Neural network training is a high-dimensional non-convex optimization process running on discrete hardware with a learning rate schedule, a tokenizer, an architecture with dozens of hyperparameter choices, and a data distribution that is itself a messy sample from human text production. That any clean functional form survives all of that noise across six orders of magnitude is genuinely surprising. The paper does not explain why the laws hold — and the authors are candid about this, noting that they are empirical regularities rather than derived consequences of a theory. This is either a confession of incompleteness or a statement of scientific honesty, depending on your priors about what a paper owes you.
The strategic significance of the compute-optimality prescription has been somewhat obscured by subsequent events. The Chinchilla correction showed that the specific exponents OpenAI published were off, and the field quietly updated. But the correction did not change the underlying logic: there is an optimal frontier, it can be estimated in advance, and deploying off that frontier is wasteful. What changed is where the frontier sits. This is how normal science works, and the fact that the paper was wrong in its specific coefficients while being right about its framework is not an embarrassment — it is the expected outcome when a field moves from qualitative to quantitative reasoning for the first time.
What the scaling laws paper leaves open is more unsettling than what it resolves. The entire framework measures loss on next-token prediction, which is a proxy for capability, not a measure of it. The paper is silent on whether the smooth curves it documents for perplexity translate into smooth curves for reasoning, factual accuracy, or safety-relevant behaviors. The subsequent emergence literature suggests they do not — that some capabilities appear discontinuously, or not at all, or in unexpected forms as scale increases. The scaling laws paper is best understood not as a complete theory of language model development but as the founding document of a research program that is still, five years later, working out what the laws actually govern.
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei · 2020
https://arxiv.org/abs/2001.08361
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendrycks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre · 2022
https://arxiv.org/abs/2203.15556
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei · 2020
https://arxiv.org/abs/2005.14165
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus · 2022
https://arxiv.org/abs/2206.07682
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, et al. · 2023
https://arxiv.org/abs/2307.09288