BERT
By reading sentences in both directions simultaneously, BERT shattered the assumption that language models must process text left-to-right.
Before BERT, the dominant paradigm for natural language processing relied on unidirectional language models — systems that read text either left-to-right or right-to-left, but never both at once during pretraining. Word embeddings like Word2Vec and GloVe captured static word meanings, while models like ELMo introduced context-sensitive representations by concatenating outputs from two separately trained unidirectional LSTMs. These approaches left a fundamental gap: no single representation could attend to both left and right context simultaneously when building an understanding of a word or sentence.
BERT, which stands for Bidirectional Encoder Representations from Transformers, was introduced by researchers at Google AI Language in October 2018. It addressed the bidirectionality problem by adapting the Transformer encoder architecture — originally described in the 2017 'Attention Is All You Need' paper — and pretraining it on two novel self-supervised objectives. The first, Masked Language Modeling (MLM), randomly masks 15% of input tokens and trains the model to predict them using surrounding context from both directions. The second, Next Sentence Prediction (NSP), trains the model to determine whether two sentences are consecutive in the original text, equipping it with inter-sentence reasoning ability.
BERT was pretrained on the BooksCorpus (800 million words) and English Wikipedia (2.5 billion words) using significant computational resources, then fine-tuned on specific downstream tasks with minimal architectural modification. Upon release, BERT-Large achieved state-of-the-art results on 11 NLP benchmarks simultaneously, including the Stanford Question Answering Dataset (SQuAD 1.1 and 2.0) and the General Language Understanding Evaluation (GLUE) benchmark. Its release triggered an immediate and sweeping transformation of how the NLP research community approached transfer learning, making large-scale pretraining followed by task-specific fine-tuning the new standard methodology.
Key Facts
- BERT-Base contains 110 million parameters; BERT-Large contains 340 million parameters across 24 Transformer layers.
- BERT-Large achieved an F1 score of 93.2 on SQuAD 1.1 at release, surpassing the previous state of the art by more than 1.5 points and exceeding human performance (91.2 F1) on that benchmark.
- Pretraining BERT-Large required 64 Cloud TPU v3 chips running for approximately 4 days, representing one of the largest training runs disclosed in a public NLP paper to that point.
- BERT simultaneously set new state-of-the-art records on 11 NLP tasks upon its release in October 2018, a breadth of improvement across benchmarks unprecedented for a single model.
- Google announced in October 2019 that BERT had been deployed in Google Search, describing it as 'the biggest leap forward in the past five years' and 'one of the biggest leaps forward in the history of Search.'
BERT established pretraining-then-fine-tuning as the dominant paradigm for NLP in a way that no prior model had achieved at scale. It demonstrated conclusively that a single, general-purpose language representation — trained on raw, unlabeled text — could be adapted to diverse tasks ranging from question answering and sentiment analysis to named entity recognition and textual inference, often surpassing systems that had been carefully engineered for those specific tasks over years. This shifted the field's center of gravity away from task-specific architectures and toward foundation models built on massive unsupervised pretraining.
The downstream consequences of BERT were enormous and rapid. Google confirmed in 2019 that BERT had been integrated into Google Search, affecting nearly every English-language query processed. It spawned a generation of successor models — RoBERTa, ALBERT, DistilBERT, XLNet, and many others — each refining aspects of BERT's design, and it catalyzed the broader research program that eventually led to GPT-3 and the large language model era. BERT's open release of pretrained weights on GitHub democratized access to high-quality language representations, enabling researchers and engineers worldwide to build on its foundation without requiring Google-scale compute.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · 2018
https://arxiv.org/abs/1810.04805
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin · 2017
https://arxiv.org/abs/1706.03762
Understanding searches better than ever before
Pandu Nayak · 2019
https://blog.google/products/search/search-language-understanding-bert/
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov · 2019
https://arxiv.org/abs/1907.11692