What is HappyHorse 1.1?

HappyHorse 1.1 (快乐小马) is Alibaba ATH Innovation Unit's upgraded AI video generation model, building on the second-ranked HappyHorse 1.0 from April 2026. Version 1.1 adds native audio synchronization, multilingual lip-sync, up to 9 reference image inputs, and 720p/1080p output. Available on fal.ai, happyhorse.com, and Alibaba Cloud platforms.

What is the HappyHorse model architecture?

HappyHorse is built around a ~15B parameter unified self-attention Transformer — no dedicated cross-attention branches. Text, image, video, and audio tokens are concatenated into a single sequence and processed together. It uses DMD-2 distillation to generate video in 8 steps without classifier-free guidance, reaching ~38 seconds for 1080p on an H100.

How do I use HappyHorse 1.1 on fal.ai?

Visit fal.ai/models/alibaba/happy-horse/v1.1/text-to-video for text-to-video, fal.ai/models/alibaba/happy-horse/v1.1/image-to-video for image-to-video, or fal.ai/models/alibaba/happy-horse/v1.1/reference-to-video for reference-guided generation. All three are live as of June 22, 2026.

Is HappyHorse open source?

Not yet. HappyHorse 1.0 announced plans to release base model weights, distilled model, super-resolution module, and inference code — but as of the 1.1 launch, the weights have not been published. Alibaba has not given a timeline for the open-source release.

How does HappyHorse 1.1 compare to Sora and Kling?

HappyHorse 1.0 ranked second behind ByteDance Seedance 2.0 on Arena.ai for text-to-video, image-to-video, and video editing. Version 1.1's main differentiators are the 9-image reference system (for multi-scene character consistency) and native audio — features neither Sora nor Kling offer as cleanly in a single pipeline.

What does HappyHorse cost?

HappyHorse is priced per second of generated video. Reference pricing for v1.0: 720p at RMB 0.9/second (discounted pro: RMB 0.44/second), 1080p at RMB 1.6/second (discounted pro: RMB 0.78/second). fal.ai may have different pricing — check the fal.ai model page for current rates.

HappyHorse 1.1: Alibaba Video AI With Native Audio, Multilingual Lip-Sync and 9-Image Reference (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

HappyHorse 1.1: Alibaba Video AI With Native Audio, Multilingual Lip-Sync and 9-Image Reference (2026) | explainx.ai Blog | explainx.ai

Alibaba's video generation model just got a major upgrade — and it started from an already strong position.

HappyHorse 1.1 (快乐小马, Kuàilè Xiǎomǎ) dropped today from Alibaba's ATH Innovation Unit. Version 1.0 launched in April 2026 and immediately ranked second on Arena.ai across text-to-video, image-to-video, and video editing benchmarks — behind only ByteDance's Seedance 2.0. 1.1 addresses the main gaps in the original: audio synchronization, lip-sync, and multi-reference consistency.

It's available now on fal.ai, happyhorse.com, and Alibaba Cloud platforms.

What's New in 1.1

Native audio synchronization. HappyHorse 1.0's underlying architecture was designed for joint audio-video generation from the start — text, image, video, and audio tokens are processed in a single transformer sequence. Version 1.1 activates this in the production API. Generated video ships with audio that's inherently timed to the visual content, no separate dubbing pipeline.

Multilingual lip-sync. Characters speaking in generated video have lip movements that match audio across multiple languages. HappyHorse 1.0's architecture reportedly trains natively on English, Mandarin, Japanese, Korean, German, and French. Version 1.1 exposes this in the output.

Nine-image reference input. You can now feed up to nine reference images to anchor characters, environments, style palettes, or products across a multi-scene project. This is the production feature the creative community has been asking for — maintaining visual consistency across more than one or two clips without fine-tuning.

1080p output alongside 720p, with across-the-board improvements to motion expressiveness, texture detail, and instruction following.

The Architecture Behind It

Understanding why 1.1 landed the way it did requires knowing what 1.0 was built on.

HappyHorse is a ~15B parameter unified self-attention Transformer — not a standard DiT (Diffusion Transformer) like Wan 2.2, HunyuanVideo, or CogVideoX. The key difference: where most video models use dedicated cross-attention branches to inject text conditioning and separate audio modules entirely, HappyHorse concatenates text, image, video, and audio tokens into a single sequence. The same attention layers process everything.

The layer layout is a sandwich: the first 4 and last 4 layers handle modality-specific projections, while the middle 32 layers share parameters across all modalities. Audio-video alignment is learned as part of denoising rather than added as a post-processing fix.

Inference uses DMD-2 distillation — eight sampling steps, no classifier-free guidance. The reported wall-clock time is roughly 38 seconds for 1080p on an H100. For context, comparable models using standard DDIM or PLMS samplers at 25–50 steps take several minutes for the same output.

This architecture is why 1.1's audio output is genuinely synchronized rather than loosely aligned: the model jointly denoises video and audio together from the first step.

How to Use It on fal.ai

All three generation modes are live today:

Mode	fal.ai endpoint
Text to Video	`fal.ai/models/alibaba/happy-horse/v1.1/text-to-video`
Image to Video	`fal.ai/models/alibaba/happy-horse/v1.1/image-to-video`
Reference to Video	`fal.ai/models/alibaba/happy-horse/v1.1/reference-to-video`

fal.ai exposes all three via API with the same alibaba/happy-horse/v1.1 model prefix, making it straightforward to build generation pipelines.

The Reference System: Why It Matters

Most video generation models accept one reference image at most. HappyHorse 1.1 takes nine.

The practical difference: when you're producing a series of clips — a product with multiple camera angles, a branded character across scenes, a short film with recurring cast — you're no longer prompting your way to visual consistency. You show the model what you're building. Reference images constrain the character's face, clothing, environment, props, and color palette, and the model maintains those constraints across the generated sequence.

For creative teams doing ad production or short-form branded content, this changes the iteration count significantly. You're not regenerating until the model happens to get your character right. You anchor it once and it stays.

What's Still Missing

Audio input isn't available yet. The lip-sync is driven by HappyHorse's native audio generation — meaning the model creates audio that matches the video it generates. What you can't do is feed in an existing mp3 or wav file and have characters sync to your voiceover. This is the most-requested missing feature in early feedback.

Given that joint audio-video is already in the architecture, audio conditioning from user input seems like a near-term addition. Alibaba hasn't given a timeline.

Pricing

HappyHorse runs on a per-second pricing model. Reference pricing for v1.0 on Alibaba Cloud:

Resolution	Listed price	Pro (discounted)
720p	RMB 0.9 / second	RMB 0.44 / second
1080p	RMB 1.6 / second	RMB 0.78 / second

fal.ai pricing for 1.1 may differ — check the model page for current rates before building at scale.

HappyHorse 1.1 vs. the Field

Feature	HappyHorse 1.1	Seedance 2.0	Sora	Kling 3.0
Multi-reference images	Up to 9	Limited	No	1–2
Native audio sync	Yes	No	Limited	No
Multilingual lip-sync	Yes	No	No	Limited
Max resolution	1080p	1080p	1080p	1080p
Audio input (BYOA)	Not yet	No	No	Limited
Open source weights	Not yet	No	No	No
API access	Yes (fal.ai)	Limited	Limited	Yes

HappyHorse ranked second (behind Seedance 2.0) with version 1.0. Version 1.1's meaningful differentiation is the combination of multi-reference and native audio — neither Seedance 2.0, Sora, nor Kling offers both in a single pipeline.

Who Should Try It

Content creators building short-form ads and branded content where character/product consistency across multiple scenes is required.

Developers who need a production-ready video generation API with reference conditioning — the fal.ai endpoint is stable and API-accessible today.

Global marketing teams who need multilingual video without re-generating or dubbing in post — 1.1's lip-sync handles language switching at generation time.

Short film creators blocked by the industry's inability to maintain consistent characters across more than one or two clips.

The Open-Source Question

HappyHorse 1.0 announced plans to release base model weights, the distilled 8-step model, the super-resolution module, and inference code. None of these have shipped yet. The 1.1 launch doesn't change that — it's still a closed API product.

For developers who need local inference, access to weights, or the ability to fine-tune, the existing open-weights leaders (Wan 2.2, HunyuanVideo-1.5, LTX-2) remain the options. HappyHorse's advantage is production quality and speed via API, not open access.

Browse all AI video tools on explainx — full directory of video generation models
AI skills registry — reusable skills for video generation workflows
Explore AI agents — autonomous systems built on top of video generation APIs

Tracking AI model releases? Subscribe to the explainx.ai newsletter for weekly breakdowns of what's worth building on.

HappyHorse 1.1: Alibaba Upgrades Its Top-Ranked Video Model With Native Audio and Multi-Reference

Related posts

Google Photos Video Remix: Gemini Omni AI Video Editing in the Create Tab

Seedance 2.0 Korean Neighborhood Prompt: The 12M-View Recipe Explained

AWS Certified Generative AI Developer – Professional: what AIP-C01 tests and how to prepare