Most open-weights and many API image products in the 2020s follow one broad recipe: start from random noise, then run a neural network many times in sequence to remove noise and form a coherent image, conditioned on a text prompt. The method family is denoising diffusion. Vendors brand it (DALL·E, Stable Diffusion, Imagen, FLUX); the outer loop is similar while encoders, backbones, and licenses differ.
The figure below is illustrative—not an exact frame-by-frame trace of any one commercial scheduler—but it matches the user-facing idea: static noise → emerging structure → detail → a sharp image.

The core loop: forward (training) vs reverse (sampling)
Forward process (intuition only): take a real image x₀, add Gaussian noise in T small steps until you obtain x_T, almost indistinguishable from television static. Training teaches the network to predict the noise (or a related score) at each step so the reverse process is learnable.
Reverse process (what “generate” does): sample x_T from pure noise. For t = T, T−1, …, 1, run the denoiser so that each step removes a little randomness, using the text embedding (and sometimes masks, class labels, or other controls) at every step.
In practice, a scheduler / sampler rule chooses step sizes and how x_t is updated. Quality-oriented runs may use many steps; faster samplers and distilled models can cut steps for a different cost–quality point.
Where the text enters: text encoders and conditioning
Text-to-image pipelines include a text encoder—a Transformer, a CLIP-style model, a T5-class encoder, or a large language model for long prompts. The output is a sequence of vectors that the image backbone conditions on, often with cross-attention (image feature maps attend to text tokens).
Product names (DALL·E 3, Imagen, SDXL, FLUX, …) hide different weights and data; the pattern is semantics in, pixels or latents out.
Latent diffusion and the VAE (why not denoise 4K RGB directly?)
Denoising every pixel at full resolution is expensive. Latent diffusion (central to much of the Stable Diffusion line) first uses a VAE to encode an image to a smaller latent grid, runs the denoising network on that tensor, then decodes to RGB. Related keywords: reconstruction loss, latent space, multiscale decoders in some papers.
U-Net vs diffusion transformer (DiT)
- U-Net — Convolutional “hourglass” with skip connections; the classic backbone in many SD-era systems and SDXL.
- DiT (diffusion transformer) — Transformer blocks on patches in latent space; same outer sampling story, different inner operator and scaling.
Practical takeaway: API knobs (step count, guidance scale, resolution) matter as much as the architecture name.
Glossary: keywords
| Term | One-line meaning |
|---|---|
| DDPM / score-based | Denoising diffusion probabilistic model or related score matching; learn p(x) via a noise schedule. |
| Latent diffusion (LDM) | Diffusion in a VAE latent grid instead of full-resolution pixels. |
| CFG (classifier-free guidance) | At sample time, mix conditional and unconditional predictions to pull samples toward the prompt; the scale is a user-tunable strength knob. |
| Scheduler / sampler | How each denoising step is taken (DDIM, DPM++, Euler, …—naming varies by implementation). |
| Text encoder | Frozen or co-trained model that embeds the prompt. |
| Cross-attention | Image features attend to text token vectors. |
| U-Net | Conv backbone used in many latent diffusion systems. |
| DiT | Diffusion transformer on latent patches. |
| Inpainting / outpainting | Condition on a mask to fill a region or extend the canvas. |
| LoRA | Low-rank adapters (and cousins) for cheap style or subject tuning. |
Product map (vocabulary, not a recommendation)
- OpenAI DALL·E — Closed-weight end-to-end product; strong emphasis on prompt robustness and safety layers.
- Stability / Stable Diffusion — Open weights and broad community tooling (ControlNet, img2img, regional prompts, …).
- Google Imagen — T5- or similar text encoders plus diffusion in Google’s stacks (see each generation’s paper / card).
- FLUX (e.g. Black Forest Labs and partners) — Recent high-fidelity lines; some open checkpoints and some API-only.
Use each vendor’s model card, license, and safety rules for the exact weights or API you run.
Read next (language models)
- What are tokens?
- What are parameters in an LLM?
- Context window in LLMs
- ChatGPT Images 2.0 and gpt-image-2 (OpenAI)
This article is a conceptual map. For deployment, use the model card, license, and safety documentation for your checkpoint.