← Back to blog

explainx / blog

Moebius: 0.2B Parameters, 10B-Level Inpainting, 15× Faster Than FLUX

Moebius is a 226M-parameter image inpainting model from HUST and VIVO AI Lab that matches or surpasses FLUX.1-Fill-Dev (11.9B parameters) across 6 benchmarks. It runs at 26ms per step — over 15× faster — using under 2% of the parameters. Here is how it works and what it means for running inpainting on real hardware.

·7 min read·Yash Thakker
AI ModelsComputer VisionOpen SourceImage GenerationResearch
Moebius: 0.2B Parameters, 10B-Level Inpainting, 15× Faster Than FLUX

The standing assumption in image inpainting has been: quality costs compute. FLUX.1-Fill-Dev — Stability AI's flagship inpainting model — runs at 11.9 billion parameters. SD3.5 Large-Inpainting is similarly large. The implicit message: if you want good inpainting, you need a data center.

Moebius directly challenges this. 226 million parameters. 26ms per diffusion step. Matching or surpassing FLUX.1-Fill-Dev across six benchmarks.

This is not a compressed version of an existing model. It is a fundamentally different architecture designed from first principles for the specific constraints of inpainting.

arXiv: 2606.19195. Authors from Huazhong University of Science and Technology and VIVO AI Lab.


The Core Claim and Why It's Surprising

The Moebius result is surprising not because a small model is fast — small models are always fast — but because it does not give up quality to be fast.

The typical trade-off in model compression:

  • Quantization: reduces weight precision, trades accuracy for speed
  • Pruning: removes weights, trades accuracy for size
  • Knowledge distillation: trains a smaller model to mimic a larger one, still often gives up accuracy

Moebius achieves a different outcome because it combines two things simultaneously: a genuinely novel attention architecture that avoids the representation bottleneck of compressed models, and a distillation strategy that operates entirely in latent space (avoiding the most expensive parts of the distillation process).

The result is a model that is not "almost as good" but on-par-with or better than models 50× its size.


The Architecture: LλMI Block

Standard transformer self-attention is quadratic in sequence length. For image patches — especially high-resolution ones — this is computationally expensive. Existing efficient attention variants (linear attention, sparse attention, local windows) help but introduce their own approximation errors.

The LλMI (Local-λ Mix Interaction) block takes a different approach: instead of approximating attention, it replaces the attention mechanism with two complementary modules that summarize into fixed-size linear matrices.

Local-λ module:

  • Handles spatial context
  • Condenses local patch relationships into a fixed-size linear matrix
  • No quadratic dependency on sequence length
  • Preserves fine spatial detail that standard attention would otherwise require many heads to capture

Interactive-λ module:

  • Handles global semantic priors
  • Condenses global context (the unmasked region, style, semantic category) into a fixed-size linear representation
  • Provides the global coherence that makes inpainted regions "belong" to the image

Together, these two modules cover what a standard cross-attention + self-attention pair would need to do in a full transformer, but in a fraction of the parameters and compute.

The key insight: inpainting is a constrained task. You have the unmasked region as context. You have the mask shape. You know the output domain (the same image). This is a much more constrained problem than general image generation — and a specialized architecture can exploit those constraints.


The Distillation: From PixelHacker in Latent Space

Architecture alone does not close the quality gap between 226M and 11.9B parameters. Moebius uses PixelHacker as a teacher model in a structured distillation process.

The critical design decision: all distillation happens in latent space.

Why this matters:

  • Pixel-space distillation requires decoding latent representations to pixels, comparing them, and backpropagating through the decoder — expensive
  • Latent-space distillation compares representations before decoding — much cheaper
  • The latent space already captures the semantic and structural information that matters for quality alignment

Multi-granularity alignment:

The distillation aligns at two scales:

  1. Microscopic — intermediate feature alignment. The student model's hidden representations are pulled toward the teacher's representations at multiple layers. This is how the student learns what "good inpainting features" look like at each stage of the denoising process.

  2. Macroscopic — diffusion trajectory alignment. The student and teacher's predicted denoising trajectories are compared at a higher level. This ensures the student follows similar denoising paths, not just producing similar-looking final outputs that arrived there differently.

Gradient norm adaptive loss weighting:

A recurring problem in multi-objective training: gradients from different loss terms can interfere. A high gradient from the trajectory loss can overwrite what the feature alignment loss was trying to teach.

Moebius addresses this with adaptive loss weighting based on gradient norms — the relative contribution of each loss term is dynamically adjusted during training to keep gradient magnitudes balanced. This is the "adaptive" part of the distillation strategy.


Benchmark Results

Six benchmarks, two domains:

Natural scenes (Places2): Moebius matches or surpasses FLUX.1-Fill-Dev and SD3.5 Large-Inpainting on standard inpainting metrics. Places2 is a standard benchmark covering diverse scene categories — the test is whether the inpainted content is realistic, coherent with the scene, and free of artifacts.

Portrait scenes (CelebA-HQ and FFHQ): Moebius shows particular strength here. Complex textures — hair, skin detail, specular highlights — and facial plausibility are highlighted as areas where Moebius surpasses the larger models.

The portrait advantage is plausible given the task-specific specialization argument: portrait inpainting has specific prior structure (face symmetry, skin tone coherence, expected feature placement) that a specialist model can learn to exploit more directly than a generalist model trying to handle everything.

Speed:

  • 26ms per diffusion step on a single GPU
  • 15× total inference acceleration vs. FLUX.1-Fill-Dev

  • The parameter ratio: 226M vs. 11.9B = 1.9% of the size

Why "Task-Specific Specialist" Is the Right Frame

The paper explicitly frames Moebius as a specialist over bloated generalists. This is worth taking seriously as a design philosophy.

Generalist foundation models — FLUX, SD3.5, GPT-4o with image capabilities — are optimized to do everything reasonably well. Inpainting is one task among many they can perform. The architecture, training data, and parameter budget are shared across all tasks.

A specialist model can:

  • Use an architecture specifically suited to inpainting's constraints (masked context + known output domain)
  • Train on data specifically relevant to the task
  • Deploy on hardware that wouldn't support the generalist

The trade-off: it can't do anything else. You can't use Moebius for text-to-image generation, style transfer, or conditioning on arbitrary prompts.

For production systems where inpainting is the task — object removal, photo restoration, content completion — this trade-off is almost always worth taking.


What This Means for Deployment

The practical implication of 226M parameters at 26ms/step is that Moebius is consumer GPU deployable.

  • FLUX.1-Fill-Dev: requires A100-class hardware for practical use, ~400ms+ per step on consumer GPUs
  • Moebius: 26ms per step on a single GPU — consumer RTX cards can handle this

For product builders running inpainting workloads:

  • Lower cloud compute costs per inference
  • Feasibility for on-device or edge deployment
  • Real-time or near-real-time inpainting for interactive applications

The Architectural Trade-Off Worth Watching

Moebius's LλMI block sidesteps quadratic attention cost by condensing into fixed-size matrices. This is efficient — but "fixed-size" means there is a representational capacity ceiling.

For very high-resolution images or very complex inpainting scenarios (large masked regions, highly heterogeneous scenes), the fixed-size compression may lose information that full attention would preserve.

The benchmarks don't cover extreme-resolution or pathological-mask cases. It's worth watching how Moebius performs on harder-than-benchmark tasks before assuming benchmark quality transfers everywhere.


Bottom Line

Moebius makes three claims worth tracking:

  1. You can build a 226M-parameter model that matches an 11.9B model on a constrained image task. The evidence (six benchmarks, two domains) is reasonably strong.

  2. Latent-space-only distillation is sufficient to transfer capacity from a large teacher. If this holds up, it's a cheaper path for future specialist models.

  3. Task-specific specialists will outperform generalists on specific tasks, especially as tasks become well-defined. Inpainting is well-defined. The result supports the hypothesis.

Whether Moebius represents a one-off result for inpainting specifically, or a pattern that generalizes to other vision tasks (super-resolution, deblurring, segmentation), depends on follow-up work. But the result itself is clean enough to be taken seriously.

Code and models are expected from the project page at hustvl.github.io/Moebius.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.


Related

Related posts