What is a diffusion model in image generation?

A diffusion model is usually trained in two conceptual phases: a forward process that adds noise to real images (turning data into almost pure noise), and a learned reverse process that denoises step by step. At generation time, you start from random noise and run many denoising steps to get an image, often in a lower-dimensional latent space for efficiency.

How does the model know to follow a text prompt?

The text is converted to embeddings (often with a text encoder such as a Transformer trained with CLIP-style contrastive objectives or a large language model). Those embeddings are injected into the denoising network—commonly via cross-attention between image features and text tokens—so each denoising step is conditioned on the prompt (and often other controls such as size or style tokens).

What is latent diffusion?

Instead of denoising full-resolution RGB pixels, many systems first encode images to a smaller latent grid with a variational autoencoder (VAE), run diffusion in that latent space, then decode back to pixels. This reduces cost and is the standard in Stable Diffusion–class pipelines, though architectures vary by release.

What is classifier-free guidance (CFG)?

A training and sampling trick: the model is conditioned on text but also shown dropped-out (empty) conditioning during training. At sampling, predictions are extrapolated between 'conditional' and 'unconditional' noise estimates to make the image follow the prompt more strongly. A higher 'CFG scale' usually means stronger prompt adherence and often sharper contrast, at the cost of naturalism or variety.

What is the difference between a U-Net and a DiT for images?

Classic latent-diffusion backbones (Stable Diffusion 1.x/2.x, SDXL) often use a convolutional U-shaped U-Net. Newer 'diffusion transformer' (DiT) backbones use transformer blocks on patches in latent space, sometimes at larger scale (similar spirit to Vision Transformers). Products mix these ideas under different brand names; always read the specific model card.

How do image generation models work? Diffusion, latents, and the keywords to read the papers | explainx.ai Blog

Most open-weights and many API image products in the 2020s follow one broad recipe: start from random noise, then run a neural network many times in sequence to remove noise and form a coherent image, conditioned on a text prompt. The method family is denoising diffusion. Vendors brand it (DALL·E, Stable Diffusion, Imagen, FLUX); the outer loop is similar while encoders, backbones, and licenses differ.

The figure below is illustrative—not an exact frame-by-frame trace of any one commercial scheduler—but it matches the user-facing idea: static noise → emerging structure → detail → a sharp image.

The core loop: forward (training) vs reverse (sampling)

Forward process (intuition only): take a real image x₀, add Gaussian noise in T small steps until you obtain x_T, almost indistinguishable from television static. Training teaches the network to predict the noise (or a related score) at each step so the reverse process is learnable.

Reverse process (what “generate” does): sample x_T from pure noise. For t = T, T−1, …, 1, run the denoiser so that each step removes a little randomness, using the text embedding (and sometimes masks, class labels, or other controls) at every step.

In practice, a scheduler / sampler rule chooses step sizes and how x_t is updated. Quality-oriented runs may use many steps; faster samplers and distilled models can cut steps for a different cost–quality point.

Where the text enters: text encoders and conditioning

Text-to-image pipelines include a text encoder—a Transformer, a CLIP-style model, a T5-class encoder, or a large language model for long prompts. The output is a sequence of vectors that the image backbone conditions on, often with cross-attention (image feature maps attend to text tokens).

Product names (DALL·E 3, Imagen, SDXL, FLUX, …) hide different weights and data; the pattern is semantics in, pixels or latents out.

Latent diffusion and the VAE (why not denoise 4K RGB directly?)

Denoising every pixel at full resolution is expensive. Latent diffusion (central to much of the Stable Diffusion line) first uses a VAE to encode an image to a smaller latent grid, runs the denoising network on that tensor, then decodes to RGB. Related keywords: reconstruction loss, latent space, multiscale decoders in some papers.

U-Net vs diffusion transformer (DiT)

U-Net — Convolutional “hourglass” with skip connections; the classic backbone in many SD-era systems and SDXL.
DiT (diffusion transformer) — Transformer blocks on patches in latent space; same outer sampling story, different inner operator and scaling.

Practical takeaway: API knobs (step count, guidance scale, resolution) matter as much as the architecture name.

Glossary: keywords

Term	One-line meaning
DDPM / score-based	Denoising diffusion probabilistic model or related score matching; learn p(x) via a noise schedule.
Latent diffusion (LDM)	Diffusion in a VAE latent grid instead of full-resolution pixels.
CFG (classifier-free guidance)	At sample time, mix conditional and unconditional predictions to pull samples toward the prompt; the scale is a user-tunable strength knob.
Scheduler / sampler	How each denoising step is taken (DDIM, DPM++, Euler, …—naming varies by implementation).
Text encoder	Frozen or co-trained model that embeds the prompt.
Cross-attention	Image features attend to text token vectors.
U-Net	Conv backbone used in many latent diffusion systems.
DiT	Diffusion transformer on latent patches.
Inpainting / outpainting	Condition on a mask to fill a region or extend the canvas.
LoRA	Low-rank adapters (and cousins) for cheap style or subject tuning.

Product map (vocabulary, not a recommendation)

OpenAI DALL·E — Closed-weight end-to-end product; strong emphasis on prompt robustness and safety layers.
Stability / Stable Diffusion — Open weights and broad community tooling (ControlNet, img2img, regional prompts, …).
Google Imagen — T5- or similar text encoders plus diffusion in Google’s stacks (see each generation’s paper / card).
FLUX (e.g. Black Forest Labs and partners) — Recent high-fidelity lines; some open checkpoints and some API-only.

Use each vendor’s model card, license, and safety rules for the exact weights or API you run.

How do image generation models work? Diffusion, latents, and the keywords to read the papers

The core loop: forward (training) vs reverse (sampling)

Where the text enters: text encoders and conditioning

Latent diffusion and the VAE (why not denoise 4K RGB directly?)

U-Net vs diffusion transformer (DiT)

Glossary: keywords

Product map (vocabulary, not a recommendation)

Read next (language models)

Related posts

ChatGPT Images 2.0 and gpt-image-2: OpenAI’s new flagship, API sizes, and how it fits the stack

Stanford’s AI Index 2026: breakthroughs, gaps, and what we make of it at ExplainX

Interpretability, monitoring, and what teams can do without solving alignment