Learning to reverse the arrow of entropy — noise into signal, chaos into image
Diffusion models learn to reverse a gradual noising process. During training, data is corrupted step-by-step with Gaussian noise until only noise remains. The model learns to predict and remove that noise, reconstructing the original signal from pure chaos.
Adds small Gaussian noise at each step t, creating a Markov chain from clean data x₀ to pure noise xₜ.
A neural network learns to reverse each noising step, gradually recovering structure from noise.
Rather than predicting the denoised image directly, most models predict the noise added at each step.
Any timestep xₜ can be sampled directly from x₀ using the reparameterization trick: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ)ε
The backbone of most diffusion models is a U-Net: an encoder-decoder with skip connections. The encoder compresses spatial resolution while expanding channels; the decoder reverses this. Skip connections preserve fine detail that would otherwise be lost in compression.
Each U-Net level contains residual blocks with group normalization and SiLU activation. Time embeddings are injected via AdaGN.
At lower resolutions, spatial self-attention allows the model to relate distant image regions and integrate global context.
Text or other conditioning signals are injected via cross-attention, where image features attend to conditioning tokens.
Encoder features are concatenated with decoder features at each resolution, preserving spatial detail across the bottleneck.
Latent diffusion models (LDMs) compress images into a compact latent space using a Variational Autoencoder. Diffusion happens in this lower-dimensional space, dramatically reducing compute. The VAE decoder then maps latents back to pixel space.
Different samplers traverse the denoising path differently. DDPM uses thousands of small steps. DDIM skips steps with deterministic updates. DPM++ uses higher-order ODE solvers. Each trades quality, speed, and stochasticity.
Original stochastic sampler. Requires 1000 steps for best quality. Each step adds controlled noise then denoises.
Deterministic skip-step sampler. 50 steps achieves near-DDPM quality. Same seed → same image.
Multistep ODE solver. 20–30 steps with excellent quality. DPM++ 2M Karras is a community favorite.
Simple first-order ODE solver. Fast and surprisingly effective in 20 steps. Basis for Euler Ancestral.
Second-order method with corrector step. Better accuracy per step at cost of 2× NFE per iteration.
Modern diffusion models are conditioned on rich signals: text, images, depth maps, poses, audio. These signals steer the denoising trajectory toward a desired output. Classifier-free guidance (CFG) amplifies the conditioning signal at inference time.
The noise schedule β(t) controls how quickly signal is destroyed. Linear schedules (DDPM) destroy too much signal too early at high resolutions. Cosine schedules preserve signal longer. Karras schedules focus denoising effort where it matters most.
Original DDPM schedule. β grows linearly from 10⁻⁴ to 0.02. Works at 32×32 but destroys signal too fast at higher resolutions.
Proposed by Nichol & Dhariwal (2021). Signal-to-noise ratio follows a cosine curve, preserving structure at high-res throughout training.
Karras et al. (2022) formulate diffusion in terms of noise level σ directly, decoupling schedule from solver. Enables flexible step placement.
Lin et al. (2023) fix the schedule so T=1000 is truly pure noise. Prevents the "grey blob" issue and improves dark scene generation.
The diffusion revolution was built by a small set of researchers and creative practitioners whose work, open-sourced and iterated publicly, transformed generative AI in under three years.
Lead author of DDPM (2020), the foundational paper that made diffusion models practical for image synthesis.
Score matching, stochastic differential equations framing (SDE/ODE), and DDIM deterministic sampling.
Lead author of Latent Diffusion Models (LDM) and Stable Diffusion, enabling high-res synthesis on consumer hardware.
DALL-E and DALL-E 2. Pioneered text-to-image via CLIP-guided diffusion and hierarchical image generation.
Imagen: cascaded pixel-space diffusion with large language model text encoders, achieving photorealistic text-to-image.
Progressive distillation, consistency models precursor work, and improved noise schedules for high-fidelity generation.
Open-source pioneer: CLIP-guided diffusion, k-diffusion library, aesthetic fine-tuning, and V-prediction parameterization.
EDM (Elucidating Diffusion Models): unified framework for noise schedules, samplers, and training, enabling DPM++ Karras.
Key concepts in diffusion model theory and practice.