Diffusion Models World

The Denoising Process

Diffusion models learn to reverse a gradual noising process. During training, data is corrupted step-by-step with Gaussian noise until only noise remains. The model learns to predict and remove that noise, reconstructing the original signal from pure chaos.

← forward diffusion | reverse denoising →

Timestep T 50

Noise Level σ 50

Forward Process q(xₜ|xₜ₋₁)

Adds small Gaussian noise at each step t, creating a Markov chain from clean data x₀ to pure noise xₜ.

Reverse Process p_θ(xₜ₋₁|xₜ)

A neural network learns to reverse each noising step, gradually recovering structure from noise.

Noise Prediction ε_θ

Rather than predicting the denoised image directly, most models predict the noise added at each step.

Closed-Form Sampling

Any timestep xₜ can be sampled directly from x₀ using the reparameterization trick: xₜ = √ᾱₜ x₀ + √(1-ᾱₜ)ε

U-Net Architecture

The backbone of most diffusion models is a U-Net: an encoder-decoder with skip connections. The encoder compresses spatial resolution while expanding channels; the decoder reverses this. Skip connections preserve fine detail that would otherwise be lost in compression.

click a block to inspect

ResNet Blocks

Each U-Net level contains residual blocks with group normalization and SiLU activation. Time embeddings are injected via AdaGN.

Self-Attention

At lower resolutions, spatial self-attention allows the model to relate distant image regions and integrate global context.

Cross-Attention

Text or other conditioning signals are injected via cross-attention, where image features attend to conditioning tokens.

Skip Connections

Encoder features are concatenated with decoder features at each resolution, preserving spatial detail across the bottleneck.

VAE Latent Space

Latent diffusion models (LDMs) compress images into a compact latent space using a Variational Autoencoder. Diffusion happens in this lower-dimensional space, dramatically reducing compute. The VAE decoder then maps latents back to pixel space.

click two points to interpolate

Latent Dims 4

KL Weight β 0.1

Stable Diffusion's VAE compresses 512×512 RGB images into 64×64×4 latents — a 48× spatial compression. The latent space is regularized with a small KL divergence penalty to remain approximately Gaussian, enabling diffusion sampling.

Sampler Trajectories

Different samplers traverse the denoising path differently. DDPM uses thousands of small steps. DDIM skips steps with deterministic updates. DPM++ uses higher-order ODE solvers. Each trades quality, speed, and stochasticity.

live sampler race — 5 algorithms

Steps 20

CFG Scale 7.5

DDPM

Original stochastic sampler. Requires 1000 steps for best quality. Each step adds controlled noise then denoises.

DDIM

Deterministic skip-step sampler. 50 steps achieves near-DDPM quality. Same seed → same image.

DPM++

Multistep ODE solver. 20–30 steps with excellent quality. DPM++ 2M Karras is a community favorite.

Euler

Simple first-order ODE solver. Fast and surprisingly effective in 20 steps. Basis for Euler Ancestral.

Heun

Second-order method with corrector step. Better accuracy per step at cost of 2× NFE per iteration.

Conditioning Systems

Modern diffusion models are conditioned on rich signals: text, images, depth maps, poses, audio. These signals steer the denoising trajectory toward a desired output. Classifier-free guidance (CFG) amplifies the conditioning signal at inference time.

CLIP Text Encoding: Text prompts are tokenized and encoded by a transformer (CLIP ViT-L/14 in SD). The resulting 77×768 token embeddings are injected into the U-Net via cross-attention at each resolution level. CLIP aligns text and image embeddings in a shared space learned from 400M image-text pairs.

Image Conditioning (IP-Adapter, img2img): Image prompts bypass text entirely — a CLIP image encoder extracts style and content embeddings that are cross-attended alongside or instead of text. img2img adds partial noise to an existing image before denoising, blending structure from the source with direction from the prompt.

ControlNet clones the U-Net encoder weights and trains them on spatial conditions (edge maps, depth, pose skeletons, segmentation masks). The control signals add residuals to the main U-Net decoder, providing hard spatial constraints without overriding the diffusion process.

CFG Scale 7.5

Classifier-Free Guidance runs two forward passes: one conditioned (ε_θ(xₜ,c)), one unconditioned (ε_θ(xₜ,∅)). The output is: ε̃ = ε_uncond + w·(ε_cond − ε_uncond). Higher w = stronger prompt adherence but reduced diversity and potential oversaturation.

Noise Schedules

The noise schedule β(t) controls how quickly signal is destroyed. Linear schedules (DDPM) destroy too much signal too early at high resolutions. Cosine schedules preserve signal longer. Karras schedules focus denoising effort where it matters most.

drag to draw your own schedule

Schedule Type Cosine

Linear β

Original DDPM schedule. β grows linearly from 10⁻⁴ to 0.02. Works at 32×32 but destroys signal too fast at higher resolutions.

Cosine ᾱ

Proposed by Nichol & Dhariwal (2021). Signal-to-noise ratio follows a cosine curve, preserving structure at high-res throughout training.

Karras σ(t)

Karras et al. (2022) formulate diffusion in terms of noise level σ directly, decoupling schedule from solver. Enables flexible step placement.

Zero-Terminal SNR

Lin et al. (2023) fix the schedule so T=1000 is truly pure noise. Prevents the "grey blob" issue and improves dark scene generation.

Researchers & Artists

The diffusion revolution was built by a small set of researchers and creative practitioners whose work, open-sourced and iterated publicly, transformed generative AI in under three years.

🔬

Jonathan Ho

Google Brain

Lead author of DDPM (2020), the foundational paper that made diffusion models practical for image synthesis.

🌊

Yang Song

OpenAI / Stanford

Score matching, stochastic differential equations framing (SDE/ODE), and DDIM deterministic sampling.

🖼️

Robin Rombach

Stability AI / LMU

Lead author of Latent Diffusion Models (LDM) and Stable Diffusion, enabling high-res synthesis on consumer hardware.

🎨

Aditya Ramesh

OpenAI

DALL-E and DALL-E 2. Pioneered text-to-image via CLIP-guided diffusion and hierarchical image generation.

🖌️

Chitwan Saharia

Google Brain

Imagen: cascaded pixel-space diffusion with large language model text encoders, achieving photorealistic text-to-image.

⚗️

Tim Salimans

Google Brain

Progressive distillation, consistency models precursor work, and improved noise schedules for high-fidelity generation.

🌺

Katherine Crowson

Independent

Open-source pioneer: CLIP-guided diffusion, k-diffusion library, aesthetic fine-tuning, and V-prediction parameterization.

📐

Tero Karras

NVIDIA

EDM (Elucidating Diffusion Models): unified framework for noise schedules, samplers, and training, enabling DPM++ Karras.

Glossary

Key concepts in diffusion model theory and practice.

DDPM: Denoising Diffusion Probabilistic Model. Ho et al. (2020). Predicts noise ε at each step t via a learned neural network.
DDIM: Denoising Diffusion Implicit Model. Deterministic skip-step sampling, enabling 50-step generation with DDPM-quality output.
Score Matching: Training objective to match the gradient of the log data density ∇log p(x). Equivalent to noise prediction under certain conditions.
Langevin Dynamics: MCMC sampling method using the score function to walk toward high-density regions. Foundation for score-based generative models.
Latent Diffusion: Running diffusion in a compressed latent space (VAE), not pixel space. Enables high-resolution generation at feasible compute cost.
VAE: Variational Autoencoder. Encoder maps images to latent distributions; decoder reconstructs images. KL loss regularizes the latent space.
CLIP: Contrastive Language-Image Pretraining. Aligns text and image embeddings in a shared space. Used as text encoder in most text-to-image systems.
CFG: Classifier-Free Guidance. Amplifies conditional vs. unconditional score difference at inference to strengthen prompt adherence.
ControlNet: Auxiliary network that adds spatial conditioning (edges, depth, pose) to a pretrained diffusion model without retraining the base.
LoRA: Low-Rank Adaptation. Efficient fine-tuning by injecting small trainable rank decomposition matrices into frozen pretrained weights.
Noise Schedule β(t): Sequence of noise variances added at each diffusion step. Choice of schedule affects image quality, especially at high resolution.
SNR: Signal-to-Noise Ratio. Ratio of signal power to noise power at a given timestep. Determines how much original structure remains in xₜ.
U-Net: Encoder-decoder architecture with skip connections. Used as the denoising backbone in most diffusion models (original and latent).
Timestep Embedding: Sinusoidal or learned encoding of t injected into each U-Net block, telling the model which noise level it is currently denoising.
V-Prediction: Alternative training target: predict v = √ᾱₜ ε − √(1−ᾱₜ) x₀ instead of noise ε. Improves stability at low noise levels.
Flow Matching: Generalization of diffusion using arbitrary probability paths. Rectified Flow (Stable Diffusion 3) uses straight-line paths for efficient sampling.
Consistency Model: Single-step or few-step model trained to map any noisy xₜ directly to x₀, bypassing iterative sampling entirely.
DPM-Solver: Fast ODE solver exploiting the semi-linear structure of the diffusion ODE. DPM++ 2M achieves near-DDPM quality in 20 steps.
Ancestral Sampling: Stochastic DDPM-style sampling that adds noise at each step. Produces diverse outputs but paths are non-deterministic.
IP-Adapter: Image prompt adapter using a decoupled cross-attention mechanism to condition on reference images without changing base model weights.
Textual Inversion: Learns a new token embedding to represent a concept from a few reference images. Encodes style or subject into a single prompt token.
Rectified Flow: Trains a flow model on straight trajectories connecting noise to data. Used in SD3 and FLUX. Fewer steps needed due to straighter paths.