Grok Imagine Video 1.5 Preview: xAI's New #1 Image-to-Video Model, and How to Use It via API

xAI has quietly pushed its most significant video update of the year. Grok Imagine Video 1.5 Preview is now live — API-first, before any broad consumer rollout — and it has landed at the top of the image-to-video leaderboards. Here’s what it actually is, what’s genuinely confirmed versus merely reported, and exactly how to call it as a developer.

⚠️ Reality check on access: Despite some write-ups implying you can “try it free in the browser,” the verified path right now is the API (xAI’s own api.x.ai, plus resellers like fal, Replicate, and Kie.ai). A wider consumer rollout to X Premium tiers is still listed as “in progress.” Treat browser-based “no setup” pitches from third-party sites as their own hosted wrappers, not an official xAI consumer launch.

Grok @Imagine 1.5 Preview is here

Try it today in the API: https://t.co/x4Yt13xRu7 pic.twitter.com/L5RDsSZyVP
— Grok (@grok) June 3, 2026

What it is

Grok Imagine Video 1.5 Preview is xAI’s latest image-to-video generation model. You feed it a still image plus a motion-focused prompt, and it produces a short clip — with natively generated, synchronized audio (dialogue, sound effects, ambient sound, and music) created in the same inference pass as the video, rather than bolted on afterward. That single-pass audio remains one of its clearest differentiators versus Sora, Runway, and Kling. (Replicate readme, Kie.ai)

Official identifiers (xAI Docs):

Model name: grok-imagine-video-1.5-preview
Alias: grok-imagine-video-1.5-2026-05-30
Modalities listed: Image + Video

🔴 Important correction to some articles: The official xAI model page explicitly states this preview model “currently does not support text-to-video.” Replicate confirms it is image-to-video only (every request needs an input image), and points to the separate xai/grok-imagine-video model for text-to-video. So claims that “the 1.5 Preview supports text-to-video, video editing, and multi-image editing” describe the broader Imagine API suite, not this specific preview model. Don’t build a T2V pipeline expecting this exact model alias to handle it.

The benchmark: #1 on the Image-to-Video Arena

This is the headline that drove the attention. On the Arena image-to-video (720p) leaderboard, the current standings are (arena.ai/leaderboard):

Rank	Model	Elo
1	`grok-imagine-video-1.5-preview-720p`	1473 ±9
2	`dreamina-seedance-2.0-720p`	1467 ±11
3	`happyhorse-1.0`	1443 ±12
4	`grok-imagine-video-720p` (the 1.0 predecessor)	1421 ±6
5	`veo-3.1-audio`	1397 ±11

The widely-cited “+52 Elo jump” checks out internally: 1473 (new) − 1421 (predecessor) = 52, edging past ByteDance’s Seedance 2.0. This is corroborated by Kie.ai (which cites 1473 vs Seedance’s 1467) and Oimi AI.

⚠️ Conflicting numbers you’ll see online: The Basenor article you linked cites “1404 ±6” as the debut Elo. That figure doesn’t match the arena.ai snapshot (1473) or the +52 math, and is likely a different/earlier capture. Separately, the Artificial Analysis “Image-to-Video (with audio)” board uses a different Elo scale entirely (its leader sits at ~1191) and, at the time of checking, had not yet listed the 1.5 Preview — its top entry there was still Seedance 2.0. (Artificial Analysis I2V leaderboard). Bottom line: the #1 ranking is real and well-supported, but exact Elo values vary by board and shift constantly as votes accumulate. Treat any single number as a snapshot, not gospel.

Verified specs

These are confirmed across the official docs and multiple independent API providers (fal, Replicate, Kie.ai):

Spec	Value	Source
Workflow	Image-to-video only (input image required)	xAI Docs, Replicate
Duration	1–15 seconds (default 8)	Kie.ai, Replicate
Resolution	480p or 720p	xAI Docs
Frame rate	24 fps	fal (output metadata)
Aspect ratios	auto, 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, 2:3	Replicate / Kie.ai
Input formats	JPG, JPEG, PNG, WEBP, GIF, AVIF	fal
Audio	Native, generated & synchronized in one pass	Replicate / Kie.ai
Status	Preview	xAI Docs

Pricing (this is the part to plan around)

Straight from the official xAI model page — you’re billed per second of generated video, by resolution, plus a small charge per input image:

Item	Price
Output — 480p	$0.08 / second
Output — 720p	$0.14 / second
Input image	$0.01 each
Rate limit	60 requests / minute
Region (official)	`us-east-1`

Worked examples (matching fal’s and JXP’s published math):

5-second 480p clip ≈ $0.40
5-second 720p clip ≈ $0.70
10-second 720p clip ≈ $1.40
15-second 720p clip ≈ $2.10 (+ $0.01 per input image)

Cost scales linearly with duration; generated audio is included at no extra charge. (fal pricing)

How to use it (API only)

Since this is API-first, here’s the practical workflow.

1. Direct via xAI (api.x.ai): Create an API key in the xAI console, then call the image-to-video endpoint with model: grok-imagine-video-1.5-preview (or the dated alias), passing your input image, prompt, duration (1–15), resolution (480p/720p), and aspect_ratio. See the xAI Video Generation / Image-to-Video docs.

2. Via a reseller (often the fastest way to test, with free Playgrounds):

fal — xai/grok-imagine-video/v1.5/image-to-video
Replicate — xai/grok-imagine-video-1.5
Kie.ai (free Playground, JSON editor)

Typical request parameters (Kie.ai / fal schema):

Prompting tips (from xAI’s own Replicate guide)

The model already sees your image, so prompt for motion, not description (Replicate prompt guide):

Don’t re-describe the image — tell it what should change (action, camera move, atmosphere).
Don’t contradict the source — match the prompt to what’s actually in the photo.
Be specific about motion intensity — “car racing past at high speed” beats “car passing.”
Always give camera direction — pan, tilt, dolly, orbit, slow push-in, handheld, etc.
Negative prompts are ignored — describe what you want instead.
Add an AUDIO: block at the end to steer sound design (music, SFX, ambient, short dialogue).
Shorter is more stable — 5–8s is the sweet spot; 15s works but is more prone to artifacts.

Honest limitations

No 1080p yet — caps at 720p, while Sora and Kling offer 1080p. (1080p is on the rumored roadmap but unconfirmed by xAI.)
Quality degrades after ~2–3 chained extensions when stitching clips into longer narratives (community-reported; no xAI fix timeline). (JXP)
It’s a “Preview” — rankings and behavior can shift as it matures.
Weaker for long continuous scenes, complex multi-character choreography, and exact frame-by-frame control. (ImagineGo review)

Claims I could NOT fully verify (treat with caution)

You asked me not to hallucinate, so here’s the explicit “unconfirmed” pile — these appear in secondary blogs but not in xAI’s official docs, and several sources contradict each other:

Generation speed: “~17 seconds” (JXP) vs “~20–30 seconds for a 5s 720p clip” (Basenor). Not independently confirmed.
“2–3× faster than Seedance 2.0” — Basenor claim only.
Infrastructure: “Aurora autoregressive MoE engine, trained on Colossus 2 with ~555,000 GPUs” (Basenor) vs “Aurora trained on 110,000 NVIDIA GB200 GPUs” (JXP). These conflict and neither is in the official model docs.
“Hotshot acquired March 2025” and “1.245 billion videos generated in January 2026” — single-source, unverified.
The @grok X announcement — could not load (login wall), so its specific wording is unconfirmed.

Compare

Grok AI Models

Popular Tools

Grok

Related Models

Grok 4.3

Recent Launches

Latest News

Kingy Launch Brief

Every Friday, the verified AI launches, apps, funding rounds, pricing changes and under-the-radar moves worth knowing—source-linked and explained in five minutes.

Free · Every Friday · Unsubscribe anytime · No daily email