Krea AI published the Krea 2 Technical Report on June 23, 2026 — a 58-minute read covering the full stack behind their new open-weights text-to-image foundation model series. The headline number: Krea 2 places in the top 10 on the Artificial Analysis text-to-image leaderboard and 2nd among independent labs, with model weights and inference code released under a permissive license.
What makes this report worth reading carefully is not the benchmark position alone — it is the unusually detailed account of every major decision in the pipeline, including several that go against common practice. No AI-generated training data. A custom DPO variant to prevent policy collapse. A PostgreSQL-based data warehouse they built from scratch. A Kubernetes + Weka setup where the entire cluster flips to research training on demand.
Here is a structured breakdown.
The Core Thesis: Exploration Over Convergence
Most state-of-the-art image models have converged toward a narrow default aesthetic — reliable, polished, and predictable. Krea argues this makes them effective production tools but weak engines for creative exploration, where users need to search across styles and moods rather than receive a single best guess.
Krea 2 is explicitly designed around the opposite priority: wide aesthetic diversity first, with user-controllable navigation of that space through both text and image inputs.
Data Curation: What They Filter Out (Not In)
The most notable data decision is what Krea does not do.
No aesthetic score oversampling. Most pipelines use CLIP-based or IQA aesthetic scores to upsample "good" images. Krea argues this introduces implicit biases — a motion-blurred image might score low but represent a valid artistic choice. Their pretraining filters only remove:
- Duplicates and over-represented concepts
- Images where VLMs consistently fail to caption accurately
- Images that introduce undesired artifacts and biases
- High-complexity images that cannot be represented at low resolution
- AI-generated images (more on this below)
Zero synthetic images in pretraining. This is unusual and deliberate. Krea's finding: even a small percentage of AI-generated images in a training mix creates an upper bound on model quality because synthetic images are disproportionately easy to learn, effectively pulling the training distribution toward them. They built in-house classifiers specifically to detect and remove synthetic images.
Deduplication in Practice
Their default observation: the standard 8×8 phash has a high false-positive rate because it ignores color. They combine a 12×12 phash with colorhash for more robust deduplication.
Sparse Autoencoder Tagging
For identifying visual artifacts without training explicit classifiers, they train a sparse autoencoder (SAE) on SigLIP-2 embeddings across their pretraining corpus, then use a VLM to annotate each SAE feature based on its top-k activating samples. This gives them an unsupervised tagging system for filtering artifact-inducing images.
Captioning Pipeline
Their captioning process is multi-stage:
- OCR extracts visible text from each image
- A VLM receives the image, OCR results, and any available metadata (camera settings, known entities) to produce a rich long-form caption
- A cheaper LLM reformats that caption into multiple lengths and formats
The result: training predominantly on long captions for dense supervision, with exposure to short/medium prompts throughout.
Midtraining Data: Wikipedia PageRank for Entity Coverage
For midtraining, they use a clever entity coverage strategy. They run PageRank over English Wikipedia using Danker, retain the top 90% of articles by rank, filter out unrepresentable subjects via Wikidata metadata, and then audit which of the remaining ~5 million concepts appear in their dataset. Rare concepts get prioritized during sampling.
Architecture: What Survived Ablation
Krea ran thorough architecture ablations organized around four objectives: stability, performance, efficiency, and simplicity. Their final choices:
| Component | Baseline | Final Choice |
|---|---|---|
| Attention | Multi-head | GQA + gated sigmoid attention |
| MLP | GeLU | SwiGLU (4× expansion) |
| Text encoder | T5-XXL | Qwen3-VL with multilayer feature aggregation |
| Modulation | Per-block MLP | Per-block tunable bias |
| Autoencoder | FLUX AE | Qwen Image VAE + FLUX 2 AE |
| Norm | LayerNorm | Zero-centered RMSNorm + QKNorm |
| Positional encoding | — | 3D Axial RoPE |
| Block design | — | Single-stream transformer |
Why Gated Sigmoid Attention
GQA adds minimal degradation vs. multi-head attention while reducing compute. On top of GQA, gated sigmoid attention (from Gated Attention for Large Language Models) adds almost no parameter overhead but produces more stable training dynamics — the loss and gradient-norm curves stay cleaner throughout.
The Timestep Modulation Decision
Per-block MLP modulation for timestep can consume 20–30% of total parameter count. Krea replaces this with a per-block tunable bias, freeing those parameters for attention and MLP layers. They tested removing timestep conditioning entirely (consistently underperforms) and in-context timestep tokens (works at 256px but fails at higher resolution even with more tokens).
Text Encoder: Not Just the Last Layer
Using only the last layer of a VLM is suboptimal because that layer is optimized for next-token prediction, not image conditioning. Krea introduces a shallow attention layer that aggregates hidden features across VLM layers, letting the model dynamically select coarse-to-fine representations. Combined with lightweight bidirectional transformer layers across the token axis, this reduces the autoregressive bias in the representation.
T5-XXL is noted as "surprisingly competitive" with Qwen3-VL in head-to-head ablations. They chose Qwen3-VL anyway for its richer input space (text + image) and stronger multilingual generalization.
Training Pipeline: Five Stages
1. Pretraining (256px → 512px → 1024px)
Progressive resolution is a curriculum strategy: most FLOPs go into low-resolution stages to build core capabilities cheaply, then the model gets high-fidelity training at the end.
Key detail: they use iREPA (a pretraining acceleration technique) for the first epoch only at 256px. iREPA encourages the MMDiT to learn its own representations and substantially accelerates initial convergence. After that epoch, it is removed.
8-bit training at 256px and 512px gives 15–20% throughput gains over bf16 with minimal quality loss. 1024px and beyond uses standard bf16.
2. Midtraining
Bridges the gap between the general pretraining distribution and the high-quality SFT distribution. Their characterization: this is the last point in the pipeline where you can add new capabilities — downstream skills like high-fidelity generation, domain coverage, and text rendering need to be locked in here.
3. Supervised Finetuning (SFT)
Small, hand-curated, domain-specific. Their finding: once volume is sufficient, quality matters far more than scale. They train domain-specific SFT checkpoints, then use model merging to produce a generalist SFT checkpoint.
4. Preference Optimization (PO) + STPO
Standard DPO has a known failure mode: the model achieves the DPO objective by reducing the likelihood of both winning and losing samples, just at different rates. If the winning sample is actually better than the current model distribution, this degrades quality while technically satisfying the loss. It also causes high-frequency artifacts late in training.
Krea's fix is STPO, which adds an auxiliary loss and modifies the original DPO formulation to reduce policy divergence. The preference data itself comes from two stages: a large-scale synthetic preference-pair generation pipeline (ensuring most pairs include at least one on-policy sample), followed by a human annotation calibration stage using in-house annotators familiar with the model's specific failure modes.
5. Reinforcement Learning (RL)
Multi-reward GRPO with four reward signals:
- General aesthetic — fine-tuned VLM on PO preference data
- Prompt following — rubric-based (prompt decomposed into verifiable requirements, each checked against the image)
- Text rendering — dedicated reward
- Artifact and structure — dedicated model for detecting extra fingers, malformed limbs, distorted text; catches failures that general VLM judges miss
The rubric-based prompt reward is a direct borrow from LLM training: instead of a single holistic score, each prompt gets decomposed into sub-requirements that are evaluated independently. This gives the RL stage more structured signal without reducing everything to generic image quality.
Prompt pool management matters as much as reward model quality. They continuously analyze reward statistics per prompt to identify which prompts are still informative. Easy prompts, consistently-failing prompts, and low-variance prompts are deprioritized. The framing: RL prompt selection is a resource-allocation problem.
They also train the entire RL stage without CFG (classifier-free guidance). This quickly closes the gap between no-CFG and CFG samples in the conditional distribution. CFG can still be applied at inference as an additional quality knob.
Timestep Distillation (Optional)
After RL, an optional stage using Trajectory Distribution Matching (TDM) — chosen over DMD, DMD2, piFlow, and APT. TDM extends DMD across timesteps, matching distributions at the trajectory level rather than only at the clean-image level. No GANs, minimal hyperparameters, flexible multistep support.
Prompt Expansion
Dense training captions and sparse user prompts are different distributions. Krea trains a prompt expander to bridge them.
The training data pipeline: an LLM generates synthetic "user captions" from long captions — shorter, conversational, underspecified prompts that omit most visual detail. This creates paired data (underspecified prompt → expanded model-friendly caption).
After SFT on this data, they apply GDPO (a GRPO variant) to optimize the expander directly through the images it produces. Rewards are mixed: image-level quality rewards, prompt-level faithfulness checks, and safety gates.
One explicit risk they guard against: diversity collapse. Prompt expanders can learn a single safe high-reward house style. To prevent this, they add a DINOv3 embedding diversity score over prompt groups, rewarding intra-group visual variation throughout RL training. Annealing the diversity reward causes collapse — they keep it active the entire time.
Style Reference System
Separate module that lets users pass one or more reference images to guide output style while keeping text-driven content. Two design challenges:
- Content leakage — style images influencing subject matter, not just aesthetic
- Data scarcity — style-transfer data is much harder to acquire at scale than editing data from video
Their solution is a novel self-supervised training technique for the style module, followed by a preference-optimization alignment step. The system supports smooth semantic style mixing across multiple references, per-reference strength control, and competitive style adherence.
Infrastructure
Kubernetes + Kueue
Research GPUs and production inference share the same Kubernetes cluster. When a training run claims the full GPU pool, inference automatically migrates out. Kueue handles gang scheduling (required for multi-node training) and borrowing/lending/reclamation between queues.
Key complaint: Kueue requires GPU count per queue to be manually specified when node count changes, which was a consistent operational annoyance.
Training Launch Procedure
Over time they built a launch CLI that:
- Retrieves the faulty-node list
- Excludes nodes already running training or dev machines
- Selects needed nodes, applies labels and taints (for large stability-critical runs)
- Removes labels and taints on teardown
Faulty nodes don't get decommissioned — they run dev machines on them so healthy nodes stay free for training. "Packerman" is the Kubernetes operator that packs dev workloads onto faulty nodes.
Observability
The most useful GPU metrics in practice:
- DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (tensor core utilization) — their preferred health indicator; correlated with training stage, resolution, and thermal throttling
- DCGM_FI_DEV_PCIE_REPLAY_COUNTER — PCIe replays on a single GPU consistently preceded crashes
- InfiniBand metrics — "arguably the most important" in their experience. Fabric instability was the single largest contributor to run crashes. They implemented a custom DaemonSet to export NVLink and IB metrics that DCGM doesn't export by default
Their scale observation: doubling GPU count produced substantially more instability than expected. Below 128 GPUs, runs were very stable for days. At very large scale, no run exceeded 24 hours without a crash — often with no visible metric spike.
Weka Filesystem
They switched from Ceph (poor performance at their scale) to Weka. Result: filesystem downtime dropped sharply, performance improved comparably. Checkpointing at ~30 seconds per checkpoint allowed aggressive fault recovery. The entire research data footprint — images, datasets, checkpoints, artifacts — lives on one Weka cluster.
Krablet Data System
Their custom data warehouse for training data curation:
- Cluster of PostgreSQL servers, each shard called a "krablet"
- Each krablet has a Postgres instance + "funnel" servers that batch and queue mutations asynchronously to minimize lock contention
- All reads proxied through "RPC" servers (replacing a traditional connection pooler)
- Scales to 208 TB of metadata and tens of thousands of contended UPSERT transactions per second
The core insight: using Postgres queues with FOR UPDATE SKIP LOCKED for all data processing gives automatic retry behavior (failed rows get retried at end of queue), dynamic worker scaling, partial processing support, and continuous incremental ingestion — without needing Ray, Spark, or Kafka.
On top of this they expose a "pluck" API that provides a global map API usable from a notebook, using TABLESAMPLE for keyspace partitioning and cloudpickle to serialize user-defined functions for remote execution.
Future Work They Called Out
- Native 2K–4K resolution with sparse attention
- MoE architecture for the next pretraining cycle
- NVFP4 pretraining
- Muon optimizer (showed strong results in ablations but not adopted for the final run due to time constraints)
- Multi-teacher on-policy distillation (MOPD) — allows domain-specific RL teams to train experts without risking regressions in other domains, then distill into a single student
- Architecture unification — collapsing autoencoder, diffusion transformer, text encoder, and prompt expander into a single model, following the LLM pattern
What to Take Away
Krea 2 is notable for two reasons that don't typically appear in the same paper.
First, the zero synthetic data commitment at pretraining scale is unusual and principled — they are betting that the quality ceiling from a clean real-data distribution is higher than what is reachable through distillation shortcuts.
Second, the infrastructure write-up is unusually honest about failure modes: the PCIe replay pattern that precedes crashes, the fact that doubling GPU count made stability dramatically worse, the manual node-list annoyances in Kueue, and the Ceph-to-Weka migration. Most technical reports smooth over operational pain. This one doesn't.
The model weights, inference code, and the full technical report are available at krea.ai and on Hugging Face.