GPT-2
The language model OpenAI considered too dangerous to fully release — and then did anyway.
GPT-2 was developed by OpenAI and introduced in February 2019 as a large-scale unsupervised language model trained to predict the next word in a sequence. It was trained on a dataset called WebText, comprising roughly 40 gigabytes of text scraped from outbound links on Reddit that had received at least three upvotes — approximately 8 million documents. The model demonstrated a striking ability to generate coherent, extended passages of text from a short prompt, a capability that surprised even its creators.
The full GPT-2 model contained 1.5 billion parameters, making it one of the largest language models publicly discussed at the time. It used the transformer decoder architecture introduced by Vaswani et al. in 2017, stacking 48 layers with 1,600-dimensional embeddings and a context window of 1,024 tokens. OpenAI released the model in stages — first a 117 million parameter version in February, then progressively larger versions — withholding the full 1.5B model until November 2019, citing concerns about potential misuse for generating disinformation at scale.
The staged release was itself a landmark event in AI policy discourse. OpenAI published a blog post explaining their reasoning, framing GPT-2 as a model whose text-generation quality was high enough to warrant caution before broad public access. This decision was controversial: many researchers argued the caution was overstated or served primarily as publicity, while others praised it as a responsible precedent. By the time the full model was released in November 2019, independent researchers and other organizations had already replicated comparable capabilities, and no significant misuse tied directly to the staged release was documented.
Key Facts
- The full GPT-2 model has 1.5 billion parameters, released in four stages: 117M (Feb 2019), 345M (May 2019), 762M (Aug 2019), and 1.5B (Nov 2019).
- GPT-2 was trained on WebText, a dataset of ~40 GB of text derived from ~8 million web documents filtered via Reddit upvotes.
- It uses a 48-layer transformer decoder architecture with a 1,024-token context window and 1,600-dimensional hidden states.
- On the Penn Treebank language modeling benchmark, GPT-2 (1.5B) achieved a perplexity of 35.76 in a zero-shot setting, surpassing prior state-of-the-art trained directly on that dataset.
- The full model was withheld for approximately nine months after the initial February 2019 announcement before OpenAI released it on November 5, 2019.
GPT-2 established that scaling transformer language models on large, diverse internet text could yield qualitatively impressive generative capabilities without task-specific training. This finding directly motivated the subsequent scaling efforts that produced GPT-3 and the broader class of large language models that now underpin products used by hundreds of millions of people. The model demonstrated zero-shot and few-shot task performance on reading comprehension, summarization, and translation benchmarks — not by design, but as emergent properties of scale.
Beyond technical impact, GPT-2 introduced a new dimension to AI development: the public deliberation over whether and how to release powerful models. The staged release sparked ongoing debates about responsible disclosure, dual-use risk, and the obligations of AI laboratories — conversations that have only intensified with subsequent systems. GPT-2 was, in this sense, the first high-profile case study in what the field now calls 'responsible scaling,' establishing a template for the difficult tradeoffs that define modern AI governance.
Language Models are Unsupervised Multitask Learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever · 2019
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Better Language Models and Their Implications
OpenAI · 2019
https://openai.com/blog/better-language-models
GPT-2: 6-Month Follow-Up
OpenAI · 2019
https://openai.com/blog/gpt-2-6-month-follow-up
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin · 2017
https://arxiv.org/abs/1706.03762