
Grok Imagine Video Text-to-Video API by xAI
xAI Grok Imagine Video generates short videos (1-15s) from natural-language prompts at 480p or 720p.
1. Introduction
Grok Imagine Video Text-to-Video is xAI's text-conditioned video generation endpoint within the broader Grok Imagine multimodal generative system. This README applies to the following API model identifier:
xai/grok-imagine-video/text-to-video
Developed by xAI and built atop technology acquired from the Hotshot startup, Grok Imagine is powered by the "Aurora" engine — a unified autoregressive Mixture-of-Experts model that natively interleaves text, image, video, and audio tokens. The text-to-video endpoint converts natural-language prompts into short, audio-synchronized video clips with cinematic motion, ambient sound, music, and dialogue generated in a single forward pass.
Within the field, xai/grok-imagine-video/text-to-video is positioned as a speed-first, social-media-native competitor to diffusion-based systems like OpenAI Sora 2, Google Veo 3.1, and Runway Gen-4.5. Its autoregressive architecture allows generation in roughly 17–30 seconds — a fraction of the time required by comparable diffusion transformers — while ranking at the top of the Artificial Analysis Video Arena leaderboards in early 2026.
2. Key Features & Innovations
-
Autoregressive Mixture-of-Experts Architecture: Unlike the diffusion transformers used by most competitors, Grok Imagine Video predicts the next token across interleaved streams of text, image, video, and audio. This unified token-prediction design enables a single backbone to serve five conditioning modes (text-to-image, image-edit, text-to-video, image-to-video, video-edit) and dramatically reduces latency relative to iterative denoising pipelines.
-
Native Synchronized Audio Generation: Music, sound effects, ambient noise, and dialogue with lip-sync are generated in the same autoregressive pass as the visual stream, rather than being dubbed in after the fact. This produces tight audio-visual coherence that is difficult to achieve with separate video and audio models.
-
High-Throughput, Low-Latency Inference: Typical generations complete in approximately 17–30 seconds — roughly one-half to one-quarter the time of leading diffusion competitors — making the endpoint practical for interactive ideation and high-volume social-content workflows.
-
Flexible Output Configuration: Supports clip durations up to 10 seconds in consumer products and 15 seconds via API, at 480p or 720p (with 1080p in Pro preview), at 24 fps, across multiple aspect ratios including 16:9 and 9:16. Up to four concurrent video variants can be requested per API call.
-
Long Prompt Support: API requests accept prompts up to 10,000 characters, allowing detailed shot descriptions, camera-motion directives, style references, and dialogue scripts to be included in a single conditioning string.
-
Trained at Frontier Scale: The Aurora backbone was trained on xAI's Colossus supercomputer using a cluster reported at 110,000 NVIDIA GB200 GPUs, enabling the large-scale multimodal token training required for native cross-modal coherence.
-
Unified Endpoint Family: Sharing weights with the image-to-video, video-edit, and image-generation endpoints means consistent style, motion characteristics, and aesthetic fidelity across creative pipelines that mix conditioning modes.
3. Model Architecture & Technical Details
Core Architecture. Grok Imagine Video uses an autoregressive Mixture-of-Experts (MoE) transformer that operates over a tokenized representation of multimodal streams. Text, image patches, video frames, and audio frames are all encoded into a shared token vocabulary, and the model predicts subsequent tokens conditioned on the prompt and any prior generated tokens. Routing through expert subnetworks allows specialization for different modalities and content categories without inflating per-token compute.
Unified Conditioning. The same Aurora backbone exposes five endpoints — text-to-image, image edit, text-to-video, image-to-video, and video edit — distinguished primarily by the conditioning tokens prepended to the generation context. The xai/grok-imagine-video/text-to-video endpoint conditions strictly on a text prompt, giving it the broadest generative freedom of the family (in contrast to the image-to-video endpoint, which is anchored by a reference image up to 20 MB and as many as seven reference subjects).
Training Infrastructure. Training was conducted on the xAI Colossus cluster, reported at 110,000 NVIDIA GB200 GPUs. The Aurora engine was first launched for still images in December 2024 and progressively expanded to motion and audio modalities through 2025, culminating in the Grok Imagine 1.0 GA release on February 2, 2026.
Release Timeline.
- Aug 4, 2025 — Initial launch in the Grok iOS app for Premium+ and SuperGrok subscribers.
- Oct 5, 2025 — v0.9 introduced Aurora-powered video at expanded short-form lengths.
- Jan 28, 2026 — Grok Imagine API publicly launched.
- Feb 2, 2026 — Grok Imagine 1.0 GA: 10-second clips, 720p output, improved native audio.
- Apr 2026 — Quality and Speed modes added; Pro mode with 1080p teased.
4. Performance Highlights
In early 2026, xai/grok-imagine-video/text-to-video ranked #1 on the Artificial Analysis Video Arena in both the text-to-video and image-to-video categories, outperforming leading offerings from Runway, OpenAI, and Google. Approximately 1.245 billion videos were generated through Grok Imagine in the 30 days preceding the 1.0 GA release.
| Rank | Model | Developer | Category | Notable Date |
|---|---|---|---|---|
| 1 | xai/grok-imagine-video/text-to-video | xAI | Text-to-Video Arena | Q1 2026 |
| 2 | Sora 2 Pro | OpenAI | Text-to-Video Arena | 2025 |
| 3 | Veo 3.1 | Google DeepMind | Text-to-Video Arena | 2025 |
| 4 | Gen-4.5 | Runway | Text-to-Video Arena | 2025 |
Qualitative performance characteristics:
- Speed: Generation latency of roughly 17–30 seconds per clip — significantly faster than diffusion-based competitors.
- Motion & Cinematic Camera Work: Strong results on dynamic camera movement, creature animation, and stylized motion.
- Audio Coherence: Native lip-sync and synchronized SFX/music produce more cohesive output than systems that bolt on audio post-hoc.
- Known weak areas: Long-form narrative continuity, multi-shot storytelling, photorealistic spoken-dialogue acting, in-frame text rendering, and structured 4K commercial output. Veo 3.1 retains advantages in clip length and 4K; Kling 3.0 leads on multi-shot narrative; Sora 2 leads on dialogue acting.
5. Use Cases
-
Short-Form Social Video: Optimized for TikTok, Instagram Reels, YouTube Shorts, and X-native video at 9:16 and 16:9. Fast turnaround and 6–10 second durations align naturally with social feed formats.
-
Ideation, Storyboarding & Previsualization: The low latency makes the endpoint well suited for rapid iteration on concepts, mood pieces, and motion storyboards before committing to expensive production pipelines.
-
Memes and Cultural Content: High generation speed and stylistic flexibility support meme-format production and reactive cultural content where time-to-publish matters more than fine-grained polish.
-
Animation, Creature, and Stylized Motion Work: Strong handling of non-photoreal subjects, animated characters, fantastical creatures, and stylized worlds makes it viable for indie animation, game cinematics, and creative experimentation.
-
Cinematic Shot Generation: Effective at executing dolly, crane, orbit, and tracking-camera prompts, useful for filmmakers exploring shot language and B-roll concepts.
-
API-Driven Creative Tooling: Through the public API and partner platforms (WaveSpeedAI, Higgsfield, Scenario, fal.ai, Invideo), developers can embed
xai/grok-imagine-video/text-to-videointo editing suites, marketing platforms, and generative-content products with concurrent multi-variant requests per call.

















