The Gallery
Acq. 2017·Research

Transformer

By replacing recurrence entirely with self-attention, the Transformer became the universal engine of modern AI.

Overview

Before the Transformer, sequence modeling was dominated by recurrent neural networks (RNNs) and their variants, including LSTMs and GRUs. These architectures processed tokens one at a time in sequence, making parallelization during training difficult and causing gradients to vanish or explode over long dependencies. By 2017, despite years of engineering refinement, RNNs remained fundamentally bottlenecked by their sequential nature, limiting both training speed and the ability to capture relationships between distant tokens.

In June 2017, a team of eight researchers at Google Brain and Google Research published 'Attention Is All You Need,' introducing the Transformer architecture. The key insight was to dispense with recurrence and convolution entirely, relying instead on a mechanism called multi-head self-attention. Self-attention allows every token in a sequence to directly attend to every other token in a single operation, computing weighted relationships across the full context window simultaneously. The architecture also introduced positional encodings to inject sequence order information, and used encoder-decoder stacks of six layers each for the machine translation tasks it was originally designed to address.

The Transformer achieved state-of-the-art results on the WMT 2014 English-to-German translation benchmark with a BLEU score of 28.4, surpassing all previously published models including ensembles, while training in a fraction of the time. Its parallelism made it highly amenable to modern GPU and TPU hardware. Within two years, the architecture had been adapted into BERT, GPT, and dozens of other foundational models, extending far beyond translation into virtually every domain of natural language processing and, eventually, computer vision, protein structure prediction, and audio generation.

Key Facts

  • The original Transformer base model contained approximately 65 million parameters; the large variant had 213 million parameters.
  • It achieved a BLEU score of 28.4 on WMT 2014 English-to-German translation, surpassing all prior single models and ensembles at the time.
  • Training the large model took 3.5 days on 8 NVIDIA P100 GPUs — dramatically faster than comparable RNN-based systems of the era.
  • The paper was submitted to arXiv on June 12, 2017, and presented at NeurIPS 2017 in December of that year.
  • As of 2024, 'Attention Is All You Need' has accumulated over 100,000 citations on Google Scholar, making it one of the most cited computer science papers ever published.
Why It Matters

The Transformer did not merely improve on existing methods — it replaced the dominant computational paradigm for sequence modeling wholesale. Its scalability turned out to be its most consequential property: as researchers scaled Transformers with more data and more parameters, performance improved in ways that were not anticipated at the time of publication. This scaling behavior, later formalized in papers on neural scaling laws, underpins every large language model deployed today, from GPT-4 to Claude to Gemini.

The architecture's influence has extended well beyond natural language. Vision Transformers (ViTs) demonstrated that the same self-attention mechanism could outperform convolutional networks on image classification when applied to image patches. AlphaFold 2 used Transformer-based attention mechanisms to achieve breakthrough protein structure prediction. The Transformer is arguably the most broadly adopted neural network architecture in the history of deep learning, and its 2017 paper remains one of the most cited in the field.

The People
Ashish VaswaniNoam ShazeerNiki ParmarJakob UszkoreitLlion JonesAidan N. GomezŁukasz KaiserIllia Polosukhin
Sources
[1]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin · 2017

https://arxiv.org/abs/1706.03762

[2]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · 2018

https://arxiv.org/abs/1810.04805

[3]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2020

https://arxiv.org/abs/2010.11929

[4]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei · 2020

https://arxiv.org/abs/2001.08361