Stable Diffusion
The first high-quality text-to-image model that anyone could download, run, and modify on a consumer GPU.
Stable Diffusion was released publicly on August 22, 2022, by Stability AI in collaboration with academic researchers at LMU Munich and the CompVis group, along with Runway ML. It arrived at a moment when text-to-image generation was dominated by proprietary systems — OpenAI's DALL·E 2 and Google's Imagen were invitation-only, their weights locked away. Stable Diffusion broke that pattern entirely: the model weights were released openly, and the system was designed to run on consumer hardware with as little as 4–6 GB of VRAM.
The technical foundation was a latent diffusion model (LDM), a method developed by Robin Rombach and colleagues at LMU Munich and published in a 2022 CVPR paper. Unlike pixel-space diffusion models, which operate directly on full-resolution images and demand enormous compute, latent diffusion models compress images into a lower-dimensional latent space using a variational autoencoder (VAE), run the diffusion process there, and decode back to pixels only at the end. This compression reduced memory and compute requirements by roughly an order of magnitude without a commensurate loss in output quality. Text conditioning was applied via cross-attention layers using embeddings from OpenAI's CLIP text encoder.
The release triggered an immediate and massive wave of adoption. Within weeks, the community had produced fine-tuning pipelines, custom checkpoints trained on specific artistic styles, and novel sampling methods. Tools like Automatic1111's WebUI made the model accessible to non-programmers. The ecosystem that grew around Stable Diffusion — including techniques like DreamBooth, LoRA fine-tuning, ControlNet, and inpainting — expanded the model's capabilities far beyond what the original release contemplated, and demonstrated how open model weights could serve as a platform for distributed innovation.
Key Facts
- Released publicly on August 22, 2022, with model weights openly available for download.
- The base Stable Diffusion v1.4 model has approximately 860 million parameters in its U-Net backbone.
- Operates in a 64×64 latent space (for 512×512 output), reducing compute compared to pixel-space diffusion by roughly 8× in each spatial dimension.
- Trained on a filtered subset of LAION-5B, a dataset of approximately 5.85 billion image-text pairs scraped from the public web.
- Could generate a 512×512 image in under 10 seconds on a consumer NVIDIA RTX 3090 GPU, making real-time experimentation feasible for the first time.
Stable Diffusion marked a turning point in who could participate in frontier AI development. Before its release, working with state-of-the-art generative image models required API access granted at the discretion of a small number of companies. Afterward, any researcher, artist, or developer with a mid-range GPU could inspect the weights, modify the architecture, build derivative tools, and publish results — without permission from any gatekeeper. This democratization of access accelerated the pace of applied research in diffusion models dramatically and shifted significant portions of the research frontier from closed labs to the open community.
The release also forced a serious public debate about the risks and responsibilities accompanying open model releases. Because the model had no built-in usage restrictions that couldn't be removed, it was almost immediately used to generate content that proprietary systems refused to produce. This tension — between the benefits of openness for research and the harms enabled by unrestricted access — became a defining fault line in AI policy discussions through 2023 and beyond. Stable Diffusion made the open-versus-closed question in AI concrete and urgent in a way that no prior release had.
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer · 2022
https://arxiv.org/abs/2112.10752
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, Jenia Jitsev · 2022
https://arxiv.org/abs/2210.08402
Adding Conditional Control to Text-to-Image Diffusion Models
Lvmin Zhang, Maneesh Agrawala · 2023
https://arxiv.org/abs/2302.05543
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman · 2023
https://arxiv.org/abs/2208.12242