← Back to blog

explainx / blog

HappyHorse 1.1: Alibaba Upgrades Its Top-Ranked Video Model With Native Audio and Multi-Reference

Alibaba's HappyHorse 1.1 adds synchronized native audio, multilingual lip-sync, up to 9 reference images, and 1080p output. Built on a 15B unified transformer, it was the second-ranked video model in April. Here's what changed and how to use it on fal.ai today.

·6 min read·Yash Thakker
AI VideoAlibabaGenerative AIVideo Modelsfal.ai
HappyHorse 1.1: Alibaba Upgrades Its Top-Ranked Video Model With Native Audio and Multi-Reference

Alibaba's video generation model just got a major upgrade — and it started from an already strong position.

HappyHorse 1.1 (快乐小马, Kuàilè Xiǎomǎ) dropped today from Alibaba's ATH Innovation Unit. Version 1.0 launched in April 2026 and immediately ranked second on Arena.ai across text-to-video, image-to-video, and video editing benchmarks — behind only ByteDance's Seedance 2.0. 1.1 addresses the main gaps in the original: audio synchronization, lip-sync, and multi-reference consistency.

It's available now on fal.ai, happyhorse.com, and Alibaba Cloud platforms.


What's New in 1.1

Native audio synchronization. HappyHorse 1.0's underlying architecture was designed for joint audio-video generation from the start — text, image, video, and audio tokens are processed in a single transformer sequence. Version 1.1 activates this in the production API. Generated video ships with audio that's inherently timed to the visual content, no separate dubbing pipeline.

Multilingual lip-sync. Characters speaking in generated video have lip movements that match audio across multiple languages. HappyHorse 1.0's architecture reportedly trains natively on English, Mandarin, Japanese, Korean, German, and French. Version 1.1 exposes this in the output.

Nine-image reference input. You can now feed up to nine reference images to anchor characters, environments, style palettes, or products across a multi-scene project. This is the production feature the creative community has been asking for — maintaining visual consistency across more than one or two clips without fine-tuning.

1080p output alongside 720p, with across-the-board improvements to motion expressiveness, texture detail, and instruction following.


The Architecture Behind It

Understanding why 1.1 landed the way it did requires knowing what 1.0 was built on.

HappyHorse is a ~15B parameter unified self-attention Transformer — not a standard DiT (Diffusion Transformer) like Wan 2.2, HunyuanVideo, or CogVideoX. The key difference: where most video models use dedicated cross-attention branches to inject text conditioning and separate audio modules entirely, HappyHorse concatenates text, image, video, and audio tokens into a single sequence. The same attention layers process everything.

The layer layout is a sandwich: the first 4 and last 4 layers handle modality-specific projections, while the middle 32 layers share parameters across all modalities. Audio-video alignment is learned as part of denoising rather than added as a post-processing fix.

Inference uses DMD-2 distillation — eight sampling steps, no classifier-free guidance. The reported wall-clock time is roughly 38 seconds for 1080p on an H100. For context, comparable models using standard DDIM or PLMS samplers at 25–50 steps take several minutes for the same output.

This architecture is why 1.1's audio output is genuinely synchronized rather than loosely aligned: the model jointly denoises video and audio together from the first step.


How to Use It on fal.ai

All three generation modes are live today:

Modefal.ai endpoint
Text to Videofal.ai/models/alibaba/happy-horse/v1.1/text-to-video
Image to Videofal.ai/models/alibaba/happy-horse/v1.1/image-to-video
Reference to Videofal.ai/models/alibaba/happy-horse/v1.1/reference-to-video

fal.ai exposes all three via API with the same alibaba/happy-horse/v1.1 model prefix, making it straightforward to build generation pipelines.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.


The Reference System: Why It Matters

Most video generation models accept one reference image at most. HappyHorse 1.1 takes nine.

The practical difference: when you're producing a series of clips — a product with multiple camera angles, a branded character across scenes, a short film with recurring cast — you're no longer prompting your way to visual consistency. You show the model what you're building. Reference images constrain the character's face, clothing, environment, props, and color palette, and the model maintains those constraints across the generated sequence.

For creative teams doing ad production or short-form branded content, this changes the iteration count significantly. You're not regenerating until the model happens to get your character right. You anchor it once and it stays.


What's Still Missing

Audio input isn't available yet. The lip-sync is driven by HappyHorse's native audio generation — meaning the model creates audio that matches the video it generates. What you can't do is feed in an existing mp3 or wav file and have characters sync to your voiceover. This is the most-requested missing feature in early feedback.

Given that joint audio-video is already in the architecture, audio conditioning from user input seems like a near-term addition. Alibaba hasn't given a timeline.


Pricing

HappyHorse runs on a per-second pricing model. Reference pricing for v1.0 on Alibaba Cloud:

ResolutionListed pricePro (discounted)
720pRMB 0.9 / secondRMB 0.44 / second
1080pRMB 1.6 / secondRMB 0.78 / second

fal.ai pricing for 1.1 may differ — check the model page for current rates before building at scale.


HappyHorse 1.1 vs. the Field

FeatureHappyHorse 1.1Seedance 2.0SoraKling 3.0
Multi-reference imagesUp to 9LimitedNo1–2
Native audio syncYesNoLimitedNo
Multilingual lip-syncYesNoNoLimited
Max resolution1080p1080p1080p1080p
Audio input (BYOA)Not yetNoNoLimited
Open source weightsNot yetNoNoNo
API accessYes (fal.ai)LimitedLimitedYes

HappyHorse ranked second (behind Seedance 2.0) with version 1.0. Version 1.1's meaningful differentiation is the combination of multi-reference and native audio — neither Seedance 2.0, Sora, nor Kling offers both in a single pipeline.


Who Should Try It

Content creators building short-form ads and branded content where character/product consistency across multiple scenes is required.

Developers who need a production-ready video generation API with reference conditioning — the fal.ai endpoint is stable and API-accessible today.

Global marketing teams who need multilingual video without re-generating or dubbing in post — 1.1's lip-sync handles language switching at generation time.

Short film creators blocked by the industry's inability to maintain consistent characters across more than one or two clips.


The Open-Source Question

HappyHorse 1.0 announced plans to release base model weights, the distilled 8-step model, the super-resolution module, and inference code. None of these have shipped yet. The 1.1 launch doesn't change that — it's still a closed API product.

For developers who need local inference, access to weights, or the ability to fine-tune, the existing open-weights leaders (Wan 2.2, HunyuanVideo-1.5, LTX-2) remain the options. HappyHorse's advantage is production quality and speed via API, not open access.


Related


Tracking AI model releases? Subscribe to the ExplainX newsletter for weekly breakdowns of what's worth building on.

Related posts