Alibaba's video generation model just got a major upgrade — and it started from an already strong position.
HappyHorse 1.1 (快乐小马, Kuàilè Xiǎomǎ) dropped today from Alibaba's ATH Innovation Unit. Version 1.0 launched in April 2026 and immediately ranked second on Arena.ai across text-to-video, image-to-video, and video editing benchmarks — behind only ByteDance's Seedance 2.0. 1.1 addresses the main gaps in the original: audio synchronization, lip-sync, and multi-reference consistency.
It's available now on fal.ai, happyhorse.com, and Alibaba Cloud platforms.
What's New in 1.1
Native audio synchronization. HappyHorse 1.0's underlying architecture was designed for joint audio-video generation from the start — text, image, video, and audio tokens are processed in a single transformer sequence. Version 1.1 activates this in the production API. Generated video ships with audio that's inherently timed to the visual content, no separate dubbing pipeline.
Multilingual lip-sync. Characters speaking in generated video have lip movements that match audio across multiple languages. HappyHorse 1.0's architecture reportedly trains natively on English, Mandarin, Japanese, Korean, German, and French. Version 1.1 exposes this in the output.
Nine-image reference input. You can now feed up to nine reference images to anchor characters, environments, style palettes, or products across a multi-scene project. This is the production feature the creative community has been asking for — maintaining visual consistency across more than one or two clips without fine-tuning.
1080p output alongside 720p, with across-the-board improvements to motion expressiveness, texture detail, and instruction following.
The Architecture Behind It
Understanding why 1.1 landed the way it did requires knowing what 1.0 was built on.
HappyHorse is a ~15B parameter unified self-attention Transformer — not a standard DiT (Diffusion Transformer) like Wan 2.2, HunyuanVideo, or CogVideoX. The key difference: where most video models use dedicated cross-attention branches to inject text conditioning and separate audio modules entirely, HappyHorse concatenates text, image, video, and audio tokens into a single sequence. The same attention layers process everything.
The layer layout is a sandwich: the first 4 and last 4 layers handle modality-specific projections, while the middle 32 layers share parameters across all modalities. Audio-video alignment is learned as part of denoising rather than added as a post-processing fix.
Inference uses DMD-2 distillation — eight sampling steps, no classifier-free guidance. The reported wall-clock time is roughly 38 seconds for 1080p on an H100. For context, comparable models using standard DDIM or PLMS samplers at 25–50 steps take several minutes for the same output.
This architecture is why 1.1's audio output is genuinely synchronized rather than loosely aligned: the model jointly denoises video and audio together from the first step.
How to Use It on fal.ai
All three generation modes are live today:
| Mode | fal.ai endpoint |
|---|---|
| Text to Video | fal.ai/models/alibaba/happy-horse/v1.1/text-to-video |
| Image to Video | fal.ai/models/alibaba/happy-horse/v1.1/image-to-video |
| Reference to Video | fal.ai/models/alibaba/happy-horse/v1.1/reference-to-video |
fal.ai exposes all three via API with the same alibaba/happy-horse/v1.1 model prefix, making it straightforward to build generation pipelines.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
The Reference System: Why It Matters
Most video generation models accept one reference image at most. HappyHorse 1.1 takes nine.
The practical difference: when you're producing a series of clips — a product with multiple camera angles, a branded character across scenes, a short film with recurring cast — you're no longer prompting your way to visual consistency. You show the model what you're building. Reference images constrain the character's face, clothing, environment, props, and color palette, and the model maintains those constraints across the generated sequence.
For creative teams doing ad production or short-form branded content, this changes the iteration count significantly. You're not regenerating until the model happens to get your character right. You anchor it once and it stays.
What's Still Missing
Audio input isn't available yet. The lip-sync is driven by HappyHorse's native audio generation — meaning the model creates audio that matches the video it generates. What you can't do is feed in an existing mp3 or wav file and have characters sync to your voiceover. This is the most-requested missing feature in early feedback.
Given that joint audio-video is already in the architecture, audio conditioning from user input seems like a near-term addition. Alibaba hasn't given a timeline.
Pricing
HappyHorse runs on a per-second pricing model. Reference pricing for v1.0 on Alibaba Cloud:
| Resolution | Listed price | Pro (discounted) |
|---|---|---|
| 720p | RMB 0.9 / second | RMB 0.44 / second |
| 1080p | RMB 1.6 / second | RMB 0.78 / second |
fal.ai pricing for 1.1 may differ — check the model page for current rates before building at scale.
HappyHorse 1.1 vs. the Field
| Feature | HappyHorse 1.1 | Seedance 2.0 | Sora | Kling 3.0 |
|---|---|---|---|---|
| Multi-reference images | Up to 9 | Limited | No | 1–2 |
| Native audio sync | Yes | No | Limited | No |
| Multilingual lip-sync | Yes | No | No | Limited |
| Max resolution | 1080p | 1080p | 1080p | 1080p |
| Audio input (BYOA) | Not yet | No | No | Limited |
| Open source weights | Not yet | No | No | No |
| API access | Yes (fal.ai) | Limited | Limited | Yes |
HappyHorse ranked second (behind Seedance 2.0) with version 1.0. Version 1.1's meaningful differentiation is the combination of multi-reference and native audio — neither Seedance 2.0, Sora, nor Kling offers both in a single pipeline.
Who Should Try It
Content creators building short-form ads and branded content where character/product consistency across multiple scenes is required.
Developers who need a production-ready video generation API with reference conditioning — the fal.ai endpoint is stable and API-accessible today.
Global marketing teams who need multilingual video without re-generating or dubbing in post — 1.1's lip-sync handles language switching at generation time.
Short film creators blocked by the industry's inability to maintain consistent characters across more than one or two clips.
The Open-Source Question
HappyHorse 1.0 announced plans to release base model weights, the distilled 8-step model, the super-resolution module, and inference code. None of these have shipped yet. The 1.1 launch doesn't change that — it's still a closed API product.
For developers who need local inference, access to weights, or the ability to fine-tune, the existing open-weights leaders (Wan 2.2, HunyuanVideo-1.5, LTX-2) remain the options. HappyHorse's advantage is production quality and speed via API, not open access.
Related
- Browse all AI video tools on explainx — full directory of video generation models
- AI skills registry — reusable skills for video generation workflows
- Explore AI agents — autonomous systems built on top of video generation APIs
Tracking AI model releases? Subscribe to the ExplainX newsletter for weekly breakdowns of what's worth building on.