Diffusion Models Demystified: From DALL-E 2 to Stable Diffusion

DALL-E 2 dropped a few weeks ago and my entire Twitter timeline lost its mind. Rightfully so. The outputs are absurd. "An astronaut riding a horse in photorealistic style" and it just... does it. But underneath the demos, the architecture is genuinely elegant, and as someone who spent two years building super-resolution models with GANs, I find the diffusion approach fascinating for very specific reasons.
Let me break down how this actually works.
The Core Idea: Learn to Denoise
The intuition behind diffusion models is surprisingly simple. Take a clean image. Gradually add Gaussian noise to it over many steps until it becomes pure static. Then train a neural network to reverse that process, to take noisy images and predict the noise that was added.
Forward process: Given an image x_0, we add noise over T timesteps. At each step t, we add a small amount of Gaussian noise controlled by a schedule (beta_1, beta_2, ..., beta_T). After enough steps, x_T is indistinguishable from random noise.
Reverse process: A neural network learns to predict the noise at each step. Given x_t and the timestep t, the model outputs the predicted noise. Subtract it, and you get a slightly cleaner image. Repeat T times and you go from pure noise to a coherent image.
The beauty of this compared to GANs? No mode collapse. No training instability. No delicate balance between generator and discriminator. You're just training a denoising network with a simple MSE loss.
The Architecture Stack
Modern text-to-image diffusion models have three key components working together.
The U-Net. This is the denoising workhorse. A U-Net with residual blocks and, crucially, attention layers. The attention lets the model capture long-range dependencies in the image, something vanilla convolutions struggle with. The U-Net takes in the noisy image and the timestep embedding, and predicts the noise.
The Text Encoder (CLIP). Your text prompt needs to become a numerical representation the U-Net can condition on. CLIP, trained on millions of image-text pairs, encodes your prompt into an embedding that captures semantic meaning. This embedding gets injected into the U-Net via cross-attention layers.
The VAE (Variational Autoencoder). Here's the clever part. Running diffusion in pixel space is computationally brutal. A 512x512 image is 786,432 dimensions. Instead, we compress the image into a latent space using a VAE encoder (roughly 64x64x4), run the entire diffusion process there, and decode back to pixel space at the end. This is what makes Stable Diffusion "stable" and practical. Latent diffusion cuts the compute cost dramatically.
Why This Matters (Personally)
I built super-resolution models at Myelin using ESRGAN, a GAN-based approach. GANs are powerful but genuinely painful to train. The discriminator and generator play this adversarial game that constantly threatens to collapse. Hyperparameter sensitivity is brutal.
Diffusion models sidestep all of that. The training objective is clean. The outputs are diverse. And the quality, honestly, exceeds what I thought was possible a year ago.
I've been studying the Latent Diffusion paper (Rombach et al.) as part of my grad school prep, and the more I dig in, the more I think this is going to reshape generative AI completely. Not just images. Audio, video, 3D, anything where you can define a meaningful latent space.
The thing is, we're still early. Inference is slow (those T denoising steps add up), and the models are huge. But if the last two years of ML have taught me anything, it's that once an approach is proven, the optimization community moves terrifyingly fast. By 2024, multimodal models would take this vision-language connection even further, building generation and understanding into a single architecture.
Related Posts
Custom Commands and Slash Commands: Building Your Own Claude Code CLI
Slash commands turn Claude Code into a personalized CLI. A markdown file becomes a reusable workflow you invoke with a single slash. Here's how to build them.
Subagents and Parallel Execution: Making Claude Code 5x Faster
Claude Code can spawn autonomous worker agents that run in parallel. Here's how subagents work, when to use them, and why they make complex tasks dramatically faster.
Hooks, Statuslines, and the Automation Layer Nobody Talks About
Hooks let you run shell commands when Claude Code starts, stops, or uses tools. Combined with a custom statusline, they turn Claude Code into a self-monitoring, self-correcting system.