explainx.ainewsletter3.4k
trending🔥loopsskills
pricing
workshops ↗
explainx.ai

Learn to lead teams that combine humans and agents. Platform access, live workshops, bootcamps, and 50+ courses — plus skills, tools, and MCP to practice what you learn.

follow us

custom AI agents

[email protected]

get started

Join · $29/mo

learn

platform · $29/moworkshopsbootcampscoursescertificationscertification testsexplainx universitycorporate trainingfacilitatorshackathonslearn skills & mcp

discover

skillstoolsagentsmcp serversdesignsllmsagiranks

content

releasesvisionmissionaboutcommunityteamcareersresourcespromptsgenerators hubgenerator SEO hubprompt templatesprompt guidesblogfor LLMsdemo

Sister Products

Infloq

Infloq

Influencer marketing

BgBlur

BgBlur

Privacy-first blur

Olly Social

Olly Social

Social AI copilot

Ceptory

Ceptory

Video intelligence

BgRemover

BgRemover

Background removal

newsletter · weekly

Get AI news, tools, and insights in your inbox.

contactsupportprivacytermsdata rightssubmission guidelines

© 2026 AISOLO Technologies Pvt Ltd

← Back to blog

explainx / blog

Krea 2 Technical Report: Open-Weights Image Foundation Model Built for Creative Exploration

Krea releases Krea 2, a series of open-weights foundation models for text-to-image generation that prioritizes aesthetic diversity and creative control. The technical report covers data curation without AI-generated images, a DiT architecture with GQA + Qwen3-VL, multi-stage training through pretraining / midtraining / SFT / PO / RL, a prompt expander trained with GRPO, and a style-reference system—plus a full breakdown of their Kubernetes and Weka-based training infrastructure.

Jun 25, 2026·12 min read·Yash Thakker
Image GenerationOpen Source AIDiffusion ModelsAI ResearchFoundation ModelsKrea
Krea 2 Technical Report: Open-Weights Image Foundation Model Built for Creative Exploration

Krea AI published the Krea 2 Technical Report on June 23, 2026 — a 58-minute read covering the full stack behind their new open-weights text-to-image foundation model series. The headline number: Krea 2 places in the top 10 on the Artificial Analysis text-to-image leaderboard and 2nd among independent labs, with model weights and inference code released under a permissive license.

What makes this report worth reading carefully is not the benchmark position alone — it is the unusually detailed account of every major decision in the pipeline, including several that go against common practice. No AI-generated training data. A custom DPO variant to prevent policy collapse. A PostgreSQL-based data warehouse they built from scratch. A Kubernetes + Weka setup where the entire cluster flips to research training on demand.

Here is a structured breakdown.


The Core Thesis: Exploration Over Convergence

Most state-of-the-art image models have converged toward a narrow default aesthetic — reliable, polished, and predictable. Krea argues this makes them effective production tools but weak engines for creative exploration, where users need to search across styles and moods rather than receive a single best guess.

Krea 2 is explicitly designed around the opposite priority: wide aesthetic diversity first, with user-controllable navigation of that space through both text and image inputs.


Data Curation: What They Filter Out (Not In)

The most notable data decision is what Krea does not do.

No aesthetic score oversampling. Most pipelines use CLIP-based or IQA aesthetic scores to upsample "good" images. Krea argues this introduces implicit biases — a motion-blurred image might score low but represent a valid artistic choice. Their pretraining filters only remove:

  • Duplicates and over-represented concepts
  • Images where VLMs consistently fail to caption accurately
  • Images that introduce undesired artifacts and biases
  • High-complexity images that cannot be represented at low resolution
  • AI-generated images (more on this below)

Zero synthetic images in pretraining. This is unusual and deliberate. Krea's finding: even a small percentage of AI-generated images in a training mix creates an upper bound on model quality because synthetic images are disproportionately easy to learn, effectively pulling the training distribution toward them. They built in-house classifiers specifically to detect and remove synthetic images.

Deduplication in Practice

Their default observation: the standard 8×8 phash has a high false-positive rate because it ignores color. They combine a 12×12 phash with colorhash for more robust deduplication.

Sparse Autoencoder Tagging

For identifying visual artifacts without training explicit classifiers, they train a sparse autoencoder (SAE) on SigLIP-2 embeddings across their pretraining corpus, then use a VLM to annotate each SAE feature based on its top-k activating samples. This gives them an unsupervised tagging system for filtering artifact-inducing images.

Captioning Pipeline

Their captioning process is multi-stage:

  1. OCR extracts visible text from each image
  2. A VLM receives the image, OCR results, and any available metadata (camera settings, known entities) to produce a rich long-form caption
  3. A cheaper LLM reformats that caption into multiple lengths and formats

The result: training predominantly on long captions for dense supervision, with exposure to short/medium prompts throughout.

Midtraining Data: Wikipedia PageRank for Entity Coverage

For midtraining, they use a clever entity coverage strategy. They run PageRank over English Wikipedia using Danker, retain the top 90% of articles by rank, filter out unrepresentable subjects via Wikidata metadata, and then audit which of the remaining ~5 million concepts appear in their dataset. Rare concepts get prioritized during sampling.


Architecture: What Survived Ablation

Krea ran thorough architecture ablations organized around four objectives: stability, performance, efficiency, and simplicity. Their final choices:

ComponentBaselineFinal Choice
AttentionMulti-headGQA + gated sigmoid attention
MLPGeLUSwiGLU (4× expansion)
Text encoderT5-XXLQwen3-VL with multilayer feature aggregation
ModulationPer-block MLPPer-block tunable bias
AutoencoderFLUX AEQwen Image VAE + FLUX 2 AE
NormLayerNormZero-centered RMSNorm + QKNorm
Positional encoding—3D Axial RoPE
Block design—Single-stream transformer

Why Gated Sigmoid Attention

GQA adds minimal degradation vs. multi-head attention while reducing compute. On top of GQA, gated sigmoid attention (from Gated Attention for Large Language Models) adds almost no parameter overhead but produces more stable training dynamics — the loss and gradient-norm curves stay cleaner throughout.

The Timestep Modulation Decision

Per-block MLP modulation for timestep can consume 20–30% of total parameter count. Krea replaces this with a per-block tunable bias, freeing those parameters for attention and MLP layers. They tested removing timestep conditioning entirely (consistently underperforms) and in-context timestep tokens (works at 256px but fails at higher resolution even with more tokens).

Text Encoder: Not Just the Last Layer

Using only the last layer of a VLM is suboptimal because that layer is optimized for next-token prediction, not image conditioning. Krea introduces a shallow attention layer that aggregates hidden features across VLM layers, letting the model dynamically select coarse-to-fine representations. Combined with lightweight bidirectional transformer layers across the token axis, this reduces the autoregressive bias in the representation.

T5-XXL is noted as "surprisingly competitive" with Qwen3-VL in head-to-head ablations. They chose Qwen3-VL anyway for its richer input space (text + image) and stronger multilingual generalization.


Training Pipeline: Five Stages

1. Pretraining (256px → 512px → 1024px)

Progressive resolution is a curriculum strategy: most FLOPs go into low-resolution stages to build core capabilities cheaply, then the model gets high-fidelity training at the end.

Key detail: they use iREPA (a pretraining acceleration technique) for the first epoch only at 256px. iREPA encourages the MMDiT to learn its own representations and substantially accelerates initial convergence. After that epoch, it is removed.

8-bit training at 256px and 512px gives 15–20% throughput gains over bf16 with minimal quality loss. 1024px and beyond uses standard bf16.

2. Midtraining

Bridges the gap between the general pretraining distribution and the high-quality SFT distribution. Their characterization: this is the last point in the pipeline where you can add new capabilities — downstream skills like high-fidelity generation, domain coverage, and text rendering need to be locked in here.

3. Supervised Finetuning (SFT)

Small, hand-curated, domain-specific. Their finding: once volume is sufficient, quality matters far more than scale. They train domain-specific SFT checkpoints, then use model merging to produce a generalist SFT checkpoint.

4. Preference Optimization (PO) + STPO

Standard DPO has a known failure mode: the model achieves the DPO objective by reducing the likelihood of both winning and losing samples, just at different rates. If the winning sample is actually better than the current model distribution, this degrades quality while technically satisfying the loss. It also causes high-frequency artifacts late in training.

Krea's fix is STPO, which adds an auxiliary loss and modifies the original DPO formulation to reduce policy divergence. The preference data itself comes from two stages: a large-scale synthetic preference-pair generation pipeline (ensuring most pairs include at least one on-policy sample), followed by a human annotation calibration stage using in-house annotators familiar with the model's specific failure modes.

5. Reinforcement Learning (RL)

Multi-reward GRPO with four reward signals:

  1. General aesthetic — fine-tuned VLM on PO preference data
  2. Prompt following — rubric-based (prompt decomposed into verifiable requirements, each checked against the image)
  3. Text rendering — dedicated reward
  4. Artifact and structure — dedicated model for detecting extra fingers, malformed limbs, distorted text; catches failures that general VLM judges miss

The rubric-based prompt reward is a direct borrow from LLM training: instead of a single holistic score, each prompt gets decomposed into sub-requirements that are evaluated independently. This gives the RL stage more structured signal without reducing everything to generic image quality.

Prompt pool management matters as much as reward model quality. They continuously analyze reward statistics per prompt to identify which prompts are still informative. Easy prompts, consistently-failing prompts, and low-variance prompts are deprioritized. The framing: RL prompt selection is a resource-allocation problem.

They also train the entire RL stage without CFG (classifier-free guidance). This quickly closes the gap between no-CFG and CFG samples in the conditional distribution. CFG can still be applied at inference as an additional quality knob.

Timestep Distillation (Optional)

After RL, an optional stage using Trajectory Distribution Matching (TDM) — chosen over DMD, DMD2, piFlow, and APT. TDM extends DMD across timesteps, matching distributions at the trajectory level rather than only at the clean-image level. No GANs, minimal hyperparameters, flexible multistep support.


Prompt Expansion

Dense training captions and sparse user prompts are different distributions. Krea trains a prompt expander to bridge them.

The training data pipeline: an LLM generates synthetic "user captions" from long captions — shorter, conversational, underspecified prompts that omit most visual detail. This creates paired data (underspecified prompt → expanded model-friendly caption).

After SFT on this data, they apply GDPO (a GRPO variant) to optimize the expander directly through the images it produces. Rewards are mixed: image-level quality rewards, prompt-level faithfulness checks, and safety gates.

One explicit risk they guard against: diversity collapse. Prompt expanders can learn a single safe high-reward house style. To prevent this, they add a DINOv3 embedding diversity score over prompt groups, rewarding intra-group visual variation throughout RL training. Annealing the diversity reward causes collapse — they keep it active the entire time.


Style Reference System

Separate module that lets users pass one or more reference images to guide output style while keeping text-driven content. Two design challenges:

  1. Content leakage — style images influencing subject matter, not just aesthetic
  2. Data scarcity — style-transfer data is much harder to acquire at scale than editing data from video

Their solution is a novel self-supervised training technique for the style module, followed by a preference-optimization alignment step. The system supports smooth semantic style mixing across multiple references, per-reference strength control, and competitive style adherence.


Infrastructure

Kubernetes + Kueue

Research GPUs and production inference share the same Kubernetes cluster. When a training run claims the full GPU pool, inference automatically migrates out. Kueue handles gang scheduling (required for multi-node training) and borrowing/lending/reclamation between queues.

Key complaint: Kueue requires GPU count per queue to be manually specified when node count changes, which was a consistent operational annoyance.

Training Launch Procedure

Over time they built a launch CLI that:

  • Retrieves the faulty-node list
  • Excludes nodes already running training or dev machines
  • Selects needed nodes, applies labels and taints (for large stability-critical runs)
  • Removes labels and taints on teardown

Faulty nodes don't get decommissioned — they run dev machines on them so healthy nodes stay free for training. "Packerman" is the Kubernetes operator that packs dev workloads onto faulty nodes.

Observability

The most useful GPU metrics in practice:

  • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE (tensor core utilization) — their preferred health indicator; correlated with training stage, resolution, and thermal throttling
  • DCGM_FI_DEV_PCIE_REPLAY_COUNTER — PCIe replays on a single GPU consistently preceded crashes
  • InfiniBand metrics — "arguably the most important" in their experience. Fabric instability was the single largest contributor to run crashes. They implemented a custom DaemonSet to export NVLink and IB metrics that DCGM doesn't export by default

Their scale observation: doubling GPU count produced substantially more instability than expected. Below 128 GPUs, runs were very stable for days. At very large scale, no run exceeded 24 hours without a crash — often with no visible metric spike.

Weka Filesystem

They switched from Ceph (poor performance at their scale) to Weka. Result: filesystem downtime dropped sharply, performance improved comparably. Checkpointing at ~30 seconds per checkpoint allowed aggressive fault recovery. The entire research data footprint — images, datasets, checkpoints, artifacts — lives on one Weka cluster.

Krablet Data System

Their custom data warehouse for training data curation:

  • Cluster of PostgreSQL servers, each shard called a "krablet"
  • Each krablet has a Postgres instance + "funnel" servers that batch and queue mutations asynchronously to minimize lock contention
  • All reads proxied through "RPC" servers (replacing a traditional connection pooler)
  • Scales to 208 TB of metadata and tens of thousands of contended UPSERT transactions per second

The core insight: using Postgres queues with FOR UPDATE SKIP LOCKED for all data processing gives automatic retry behavior (failed rows get retried at end of queue), dynamic worker scaling, partial processing support, and continuous incremental ingestion — without needing Ray, Spark, or Kafka.

On top of this they expose a "pluck" API that provides a global map API usable from a notebook, using TABLESAMPLE for keyspace partitioning and cloudpickle to serialize user-defined functions for remote execution.


Future Work They Called Out

  • Native 2K–4K resolution with sparse attention
  • MoE architecture for the next pretraining cycle
  • NVFP4 pretraining
  • Muon optimizer (showed strong results in ablations but not adopted for the final run due to time constraints)
  • Multi-teacher on-policy distillation (MOPD) — allows domain-specific RL teams to train experts without risking regressions in other domains, then distill into a single student
  • Architecture unification — collapsing autoencoder, diffusion transformer, text encoder, and prompt expander into a single model, following the LLM pattern

What to Take Away

Krea 2 is notable for two reasons that don't typically appear in the same paper.

First, the zero synthetic data commitment at pretraining scale is unusual and principled — they are betting that the quality ceiling from a clean real-data distribution is higher than what is reachable through distillation shortcuts.

Second, the infrastructure write-up is unusually honest about failure modes: the PCIe replay pattern that precedes crashes, the fact that doubling GPU count made stability dramatically worse, the manual node-list annoyances in Kueue, and the Ceph-to-Weka migration. Most technical reports smooth over operational pain. This one doesn't.

The model weights, inference code, and the full technical report are available at krea.ai and on Hugging Face.

Related posts

Jun 20, 2026

Ideogram 4.0: Open-Weight Image Generation — How to Run, API & JSON Prompts (2026)

Ideogram 4.0 is the first open-weight frontier image model built for design work — production typography, bounding-box layout, and 2K photoreal output. This guide covers what shipped, benchmark numbers, and how to run it via API, CLI, and self-hosted inference.

Jun 23, 2026

Moebius: 0.2B Parameters, 10B-Level Inpainting, 15× Faster Than FLUX

A 0.22B model matching an 11.9B industrial giant on inpainting benchmarks is not a rounding error — it is a structural claim about what task-specific specialist models can do. Moebius achieves this via a novel attention block and latent-space distillation from PixelHacker. 26ms per step. Consumer hardware. Worth understanding.

Jun 23, 2026

94.3 on AIME 2026: VibeThinker-3B and the Case for Small Models With Frontier Reasoning

A 3B parameter model just beat DeepSeek V3.2 and Gemini 3 Pro on AIME 2026 verifiable reasoning. VibeThinker-3B's result isn't a fluke — it points to a structural insight about AI capability: reasoning compresses into compact models, knowledge doesn't. The implications for how we build and deploy AI are significant.