Attention Is All You Need
Vaswani, Shazeer, Parmar et al. · Google Brain
The 2017 paper that abolished recurrence and convolution from sequence modeling, replacing both with a single mechanism whose scaling properties no one fully anticipated.
“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”
— Vaswani et al., Attention Is All You Need, 2017, Abstract
The central claim of the paper is radical in its simplicity: you do not need recurrence to model sequences. For years, the dominant assumption in neural sequence modeling was that processing order required sequential computation — each hidden state depending on the last, unrolling through time. LSTMs and GRUs had made this tractable, but they were slow to train on long sequences and struggled to propagate information across many steps. Vaswani and colleagues proposed replacing this entire apparatus with multi-head self-attention: a mechanism that computes, for every position in a sequence, a weighted sum over all other positions simultaneously. The model sees the whole sequence at once, in parallel.
The specific engineering contribution is a stacked encoder-decoder architecture where attention heads learn different relational patterns simultaneously, positional encodings substitute for the sequential ordering that recurrence had provided for free, and residual connections with layer normalization keep gradients stable through depth. The paper demonstrated this on English-to-German and English-to-French translation, achieving state-of-the-art BLEU scores while training in a fraction of the time required by comparable recurrent models. The practical argument was as much about training efficiency as modeling power: parallelism meant you could use GPUs the way they were designed to be used, saturating hardware rather than running computations one step at a time.
Key Facts
- The paper achieved a BLEU score of 28.4 on WMT 2014 English-to-German translation, surpassing all previously reported models including ensembles, while using significantly less training compute.
- Training the large Transformer model took 3.5 days on 8 NVIDIA P100 GPUs — the paper explicitly frames training cost as a primary motivation for the architecture.
- The paper was reportedly rejected from ICLR 2017 before being accepted at NeurIPS 2017, a fact widely discussed in the ML community as a cautionary example of peer review missing landmark work.
- As of 2024, the arXiv preprint (arXiv:1706.03762) has accumulated over 100,000 citations according to Google Scholar, making it one of the most cited papers in computer science history.
- The multi-head attention mechanism in the original paper used 8 parallel attention heads in the base model and 16 in the large model — a design choice that has persisted, with variation, across nearly every major Transformer variant since.
The paper was presented at NeurIPS 2017, where it attracted serious attention but not yet the canonical status it would later acquire. Within the machine translation community, the BLEU improvements were immediately noticed — 28.4 on WMT 2014 English-to-German was a clear benchmark advance. But the broader significance of replacing recurrence entirely was not universally obvious at first submission; the paper had reportedly been rejected from ICLR 2017 before its NeurIPS appearance, a detail that became one of the field's most-cited cautionary tales about peer review. Early responses centered on the practical gains in training speed rather than on what the architecture would enable at scale.
Within eighteen months, the community's reading had shifted dramatically. The release of BERT in late 2018 and GPT-2 in 2019 made clear that the Transformer was not just a better translation model but a general-purpose learned representation engine. Citations accelerated in a way rarely seen for a single conference paper. By the early 2020s, 'Attention Is All You Need' had accumulated tens of thousands of citations and had been downloaded millions of times from arXiv — figures that place it among the most influential papers in the history of machine learning, full stop. The field did not contest the result; it absorbed it.
The direct lineage is unambiguous and dense. BERT (Devlin et al., Google, 2018) used the Transformer encoder to redefine NLP benchmarks across eleven tasks in a single paper. GPT (Radford et al., OpenAI, 2018) used the Transformer decoder to demonstrate that language modeling alone, at scale, produced transferable representations. GPT-2 and GPT-3 extended this into few-shot learning. Every model in the GPT series, every model in the LLaMA family, every version of PaLM, Gemini, Claude, and Mistral is, at its computational core, a stack of Transformer blocks with multi-head attention. The Vision Transformer (Dosovitskiy et al., 2020) demonstrated that the architecture generalizes beyond text to image patches, dissolving the historical boundary between NLP and computer vision architectures.
The ripple extends into infrastructure and economics. The demand for hardware capable of running Transformer training at scale is a primary driver of NVIDIA's market position in the 2020s; the H100 GPU is, in a meaningful sense, optimized for the matrix multiplication patterns that self-attention requires. Companies like Hugging Face built their entire product around distributing and fine-tuning Transformer-based models. OpenAI's commercial trajectory from research lab to multi-billion-dollar company runs directly through the GPT architecture, which runs directly through this paper. Flash Attention, sparse attention, and dozens of efficiency variants exist specifically because the quadratic cost of full self-attention in long sequences became a billion-dollar engineering problem.
What makes 'Attention Is All You Need' a genuine artifact of intellectual history — rather than merely a successful engineering paper — is that it did something unusual in machine learning: it removed complexity rather than adding it. The dominant mode of progress in the field is accretion. You add gates to your RNN (LSTM). You add skip connections to your CNN (ResNet). You add adversarial training to your generator (GAN). Vaswani et al. did the opposite. They looked at a mature, heavily engineered pipeline and asked what the load-bearing parts actually were. The answer, they argued, was attention — a mechanism that had been used as a supplement to recurrent models for years, most visibly in Bahdanau et al.'s 2015 neural machine translation work. They bet that it could stand alone. That bet was correct, but its correctness was not obvious in advance, and the paper deserves credit for the clarity of its conviction.
The architectural choice that has aged most interestingly is positional encoding. Because self-attention is permutation-equivariant by design — it does not care what order its inputs arrive in — the Transformer has no inherent sense of sequence order. The original paper addressed this with sinusoidal positional encodings: fixed, deterministic signals added to the input embeddings before attention is computed. This is an elegant solution to a real problem, but it is also a conspicuous seam, a place where the architecture's otherwise unified logic required an external patch. Every subsequent variant of the Transformer has grappled with this seam. Learned absolute encodings, relative position encodings (Shaw et al.), rotary position embeddings (Su et al., RoPE), and ALiBi (Press et al.) all represent attempts to handle position more gracefully. The fact that this problem remains actively contested in 2024 suggests that the original paper solved the first-order problem while leaving a second-order one open.
There is a sociology-of-science story here that the citation counts obscure. Six of the eight authors have since left Google. The institution that published the paper does not straightforwardly own the technology the paper enabled; the benefits have accrued to OpenAI, Meta, Anthropic, Mistral, and dozens of other organizations that applied the architecture at scales Google was slower to productize. This is a recurring pattern in industrial research labs: a foundational contribution exits through a paper, enters the public domain, and is monetized most aggressively by organizations that did not produce it. The Transformer is the starkest recent example. It is also a reminder that the value of a research publication is not captured by the publishing institution's subsequent market position.
Zoom out far enough and the paper reveals something about how paradigm shifts actually happen in machine learning. They rarely come from proving a theoretical result or from a single empirical breakthrough. They come from someone noticing that a load-bearing assumption — in this case, that sequence processing requires sequential computation — is actually a convention rather than a necessity, and then building a credible alternative. The Transformer did not win because it was theoretically proven superior. It won because it was faster to train, easier to scale, and empirically competitive from day one. The field's subsequent decade can be read as an extended experiment in discovering just how far that scalability goes. The answer, as of this writing, appears to be: further than anyone in that NeurIPS session room in 2017 was prepared to believe.
Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin · 2017
https://arxiv.org/abs/1706.03762
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · 2018
https://arxiv.org/abs/1810.04805
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2020
https://arxiv.org/abs/2010.11929
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio · 2015
https://arxiv.org/abs/1409.0473
RoFormer: Enhanced Transformer with Rotary Position Embedding
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu · 2021
https://arxiv.org/abs/2104.09864