← Back to blog

explainx / blog

Grok Imagine Video 1.5 Is Here: xAI's #1 Image-to-Video Model with Native Audio (2026)

xAI released Grok Imagine Video 1.5 on June 17, 2026, claiming the #1 spot on the Image-to-Video Arena leaderboard with a +52 Elo jump, native synchronized audio, and pricing at $4.20/min versus Sora 2's $30/min.

·10 min read·Yash Thakker
xAIGrokAI Video GenerationImage to VideoElon MuskAI Tools
Grok Imagine Video 1.5 Is Here: xAI's #1 Image-to-Video Model with Native Audio (2026)

At 9:25 AM on June 17, 2026, Elon Musk posted two words — "wide release" — and dropped a link to grok.com/imagine. Twenty minutes earlier, xAI had published a thread announcing Grok Imagine Video 1.5: a new image-to-video model with native synchronized audio, sharper physics, and faster generation times. By the afternoon, it had accumulated 268,000+ views and was being used to generate everything from cinematic clips to creative experiments.

The broader context: Grok Imagine Video 1.5 is not just an incremental update to xAI's video tooling. It is the model that currently sits at #1 on the Image-to-Video Arena leaderboard — beating Sora 2, Veo 3.1, Seedance 2.0, and Kling in blind user testing — at a price point that undercuts every major competitor by a wide margin.


What Grok Imagine Video 1.5 Actually Does

At its core, Grok Imagine Video 1.5 takes an input — a text prompt, a still image, or both — and produces a video clip with synchronized audio. The mechanics:

  • Resolution: 480p or 720p output
  • Frame rate: 24FPS
  • Clip length: 1 to 15 seconds base; 6–10 second extensions via Extend from Frame
  • Audio: Native synchronized sound effects, background audio, and lip-sync generated in a single pass — no separate step
  • Animation modes: Normal, Fun, Custom, or Spicy to set the overall tone

The underlying architecture is Aurora, xAI's autoregressive video generation model. The autoregressive approach is what gives it character consistency across frames — faces do not warp between cuts, and camera movements (pans, dolly moves, tracking shots) execute cleanly without the stuttering common in earlier diffusion-based video models.

The key additions in 1.5 over 1.0:

  • Sharper realism: Physics simulation is more accurate — fabric moves naturally, liquids behave with weight, lighting changes are consistent across camera transitions
  • Better lip-sync: Accuracy in matching spoken audio to mouth movements improved significantly
  • Faster generation: Generation speed is faster than 1.0, though specific throughput numbers have not been published
  • +52 Elo on the arena: The jump from 1.0 to 1.5 is the largest single-version improvement in the benchmark's history for any model in the image-to-video category

Benchmark Position: How It Ranks Against the Competition

The Image-to-Video Arena is the current industry standard for comparing AI video generators — it uses blind user voting (users see two outputs without knowing which model generated which and pick the better one). As of the 1.5 release:

ModelArena EloRelative Position
Grok Imagine Video 1.5 (720p)1,473#1
Seedance 2.0Below 1.5#2
HappyHorse 1.0Below Seedance#3
Google Veo 3.1Below 1.5Top 5
Sora 2Below 1.5Top 5
Kling 3.0Below 1.5Top 5

The +52 Elo jump over Grok Imagine Video 1.0 is confirmed across three independent benchmark runs. The ranking holds consistently across multiple evaluation dimensions including motion quality, prompt adherence, and visual consistency.

The caveat: blind arena rankings measure user preference on a general distribution of prompts. They do not capture every professional use case — particularly those requiring 1080p output, precise frame control, or specific industry-standard formats. The arena is a good overall signal but not the only signal that matters.


The Pricing Gap Is the Real Story

Benchmark rankings tell you which model wins in a controlled comparison. Pricing tells you which model gets used in production.

ModelPrice per Minute of Video
Grok Imagine Video 1.5$4.20
Veo 3.1$12.00
Sora 2 Pro$30.00

At $4.20 per minute, Grok Imagine 1.5 is:

  • 65% cheaper than Veo 3.1
  • 86% cheaper than Sora 2 Pro

For a team generating 100 minutes of AI video per month — a reasonable production workload for a content studio or marketing team — that pricing difference translates to roughly $2,580 saved per month versus Sora, or $780 versus Veo. Annualized, that is real budget.

The model that wins benchmarks often loses commercial adoption to the model that wins the cost calculation. Grok Imagine 1.5 is currently winning both.


Native Audio: Why It Matters

Every previous generation of AI video models had the same workflow: generate video, then add audio separately using a different model or manually. This worked but introduced friction — the audio was not synchronized to the physics of the video, lip-sync required additional passes, and the two-step process doubled the time and cost.

Grok Imagine 1.5 generates audio in a single pass, synchronized to the video content. The model produces:

  • Sound effects matched to on-screen events (footsteps when a character walks, splashes when liquid hits a surface)
  • Background audio appropriate to the scene environment
  • Lip-sync mapped to any speaking characters in the clip

The accuracy of lip-sync in 1.5 is significantly improved over 1.0 — which had functional but visibly imperfect mouth matching. Whether it reaches the threshold where lip-sync requires no manual correction in professional productions remains use-case dependent, but the improvement is meaningful.


Animation Modes and Camera Control

One of the more practical additions in 1.5 is explicit animation style control. Four modes:

  • Normal — realistic movement with natural pacing
  • Fun — slightly exaggerated, higher energy
  • Custom — natural-language description of the style you want
  • Spicy — higher dynamism, more dramatic motion and lighting

Camera movement is controlled via natural-language prompts. The model supports cinematic instructions cleanly: dolly in, pan left, tracking shot following a subject, crane-style vertical movement, zoom. In testing by independent reviewers, these camera movements execute without the abrupt transitions and stutter that earlier models produced on similar prompts.

The Aurora architecture's autoregressive design is what makes this work — because each frame is generated conditioning on prior frames, the camera movement has temporal consistency rather than being re-solved from scratch every frame.


Where It Still Falls Short

Honest accounting of the limitations:

720p ceiling: The maximum output resolution is 720p. For professional productions that need 1080p for broadcast, social platforms with high-quality requirements, or large-format display, this is a hard limitation. Sora 2 and some Veo configurations support higher resolutions. Grok Imagine 1.5's arena ranking at 720p is excellent, but the resolution cap is real.

Clip length: Maximum 15 seconds per base clip, extended incrementally via the Extend from Frame feature. Generating a 60-second video requires four or more extension passes, each of which adds latency and potential for visual drift over time.

No 60FPS option: The model outputs at 24FPS, which matches cinematic convention but not the 60FPS standard used in gaming content, some sports production, and smooth motion display formats.

Style control depth: The four animation modes are useful but coarser than the per-frame control available in tools like Runway's advanced keyframe system. For highly specific creative direction, the natural-language prompt layer may not provide enough precision.


Who Should Use It

Content creators and social media teams: The pricing and audio integration make this the most cost-efficient tool for high-volume video generation. For short-form content (15 seconds or under), the quality ceiling is high enough for professional output.

Developers building video generation into products: The API access and competitive pricing make Grok Imagine 1.5 the obvious starting point for any application that needs image-to-video generation. At $4.20/minute, the economics work at production scale in a way that $30/minute models do not.

Marketing teams: The image-to-video workflow — taking a product photo or brand image and animating it with synchronized audio — is the primary commercial use case the model is optimized for. The Aurora architecture's consistency across frames is particularly valuable here since brand visuals need to stay recognizable through the animation.

Creative experimenters: The community reaction to the announcement was immediate — users were already posting clips within an hour of wide release. The Normal/Fun/Spicy mode system makes experimentation fast without requiring precise technical prompting.

Professional video producers: Evaluate carefully. The 720p cap and clip length limitations are real constraints. For productions that need 1080p output or extended sequences without visual drift, Grok Imagine 1.5 may be a complementary tool rather than a primary one.


How to Access It

Grok Imagine Video 1.5 is available at grok.com/imagine following Elon Musk's wide release announcement on June 17, 2026. An API is available for developers.

The xAI announcement thread confirms this is a wide release — not a limited preview or waitlist access. Users with existing Grok or X Premium subscriptions should check whether their plan includes Imagine Video access, as the credit allocation for video generation differs from text generation.


The Bigger Picture

Grok Imagine Video 1.5 is xAI's most pointed competitive move in the AI video space to date. Holding the #1 arena position matters for perception; the pricing is what matters for adoption. At 86% below Sora 2 Pro and with native audio that eliminates a full production step, the commercial case is straightforward for any team currently paying OpenAI or Google rates for equivalent output.

The competition will respond. Sora, Veo, and Kling all have development teams working on their next versions. The Image-to-Video Arena leaderboard has changed leadership multiple times in the past six months as models leapfrog each other. What 1.5 establishes is that xAI is now a serious participant in that race — not an interesting experiment, but the current benchmark leader with a price point that makes the ranking commercially relevant.


FAQ

Is Grok Imagine Video 1.5 free?

The model is available at grok.com/imagine, but video generation consumes credits priced at $4.20 per minute. Free-tier access, if any, is limited. Check your current Grok or X Premium plan for included credits.

Can I use Grok Imagine Video 1.5 via API?

Yes. xAI provides API access to Grok Imagine Video 1.5 for developers. The same $4.20/minute pricing applies at API level, making it the most cost-competitive option for building video generation into applications.

Does it work with any image as input?

The image-to-video workflow accepts still images as the starting frame and animates them forward. Input image quality affects output quality — higher-resolution, well-composed input images produce better animation results.

How does the Extend from Frame feature work?

After generating an initial clip, you can select a frame from that clip and extend the video from that point forward by 6–10 seconds per extension pass. This allows building longer sequences, though each extension introduces some risk of visual drift over time.


Grok Imagine Video 1.5 is available at grok.com/imagine. Benchmark data sourced from the Image-to-Video Arena leaderboard as of June 17, 2026. Pricing figures from WaveSpeed comparison and PixVerse guide.

Related posts