Grok Imagine API for xAI Image, Video, and Audio

The Grok Imagine API gives developers xAI's image, video, and audio generation in one suite. It produces up to 2K images with multilingual text rendering, plus video up to 15 seconds with native, synchronized audio and reference-based editing. On Atlas Cloud one key runs every Grok Imagine mode, so you move between image, video, and audio without separate setups, from $0.02 per image and $0.05 per second.

Explore the Leading Grok Imagine

Atlas Cloud provides you with the latest industry-leading creative models.

NEW

text-to-speech

xAI TTS v1

xAI TTS v1 is a high-fidelity text-to-speech model that converts text into natural, expressive speech with sub-second latency, supporting 20 languages and 80+ voices with fine-grained delivery control.

Grok Imagine Video v1.5 Image-to-Video

xAI Grok Imagine Video v1.5 animates a starting frame image with natural-language motion prompts at 480p/720p/1080P.

Grok Imagine Image Quality Text-to-Image

xAI Grok Imagine generates polished visuals from natural-language prompts at 1K or 2K resolution, with 14 aspect ratios.

Grok Imagine Image Quality Edit

xAI Grok Imagine edits one or more reference images with natural-language instructions at 1K or 2K resolution. Supports single image and multi-image (<IMAGE_0>, <IMAGE_1>) reference editing.

Grok Imagine Video Text-to-Video

xAI Grok Imagine Video generates short videos (1-15s) from natural-language prompts at 480p or 720p.

Grok Imagine Video Image-to-Video

xAI Grok Imagine Video animates a starting frame image with natural-language motion prompts at 480p or 720p.

Grok Imagine Video Reference-to-Video

xAI Grok Imagine Video generates videos guided by 1-7 reference images that contribute people, objects, or styles. Output up to 10s at 480p or 720p.

Grok Imagine Video Extend

xAI Grok Imagine Video continues an existing 2-15s mp4 with a 2-10s prompt-driven extension. Output matches input, capped at 720p.

Grok Imagine Video Edit

xAI Grok Imagine Video edits an mp4 with natural-language instructions. Output retains source duration, capped at 8.7s. Billed per second of the input video (output duration == input duration).

Grok Imagine Image Edit

xAI Grok Imagine edits one or more reference images with natural-language instructions at 1K or 2K resolution. Supports single image and multi-image (<IMAGE_0>, <IMAGE_1>) reference editing.

Grok Imagine Image Text-to-Image

xAI Grok Imagine generates images from natural-language prompts at 1K or 2K resolution, with 14 aspect ratios.

From

$0.02/PIC

Peak speed

Lowest cost

Modality	Description
Grok Imagine Image Quality T2I API(Text to Image)	The Grok Imagine Image Quality T2I API empowers developers to transform text prompts into photorealistic images at up to 2K resolution. With razor-sharp details, multilingual text rendering, and tighter prompt following, it generates brand-grade visuals optimized for hero images, advertising creatives, and product renders.
Grok Imagine Image Quality Edit API(Image to Image)	The Grok Imagine Image Quality Edit API empowers developers to refine and restyle existing images using reference inputs. With natural lighting, rich textures, and believable physics, it generates photorealistic edits optimized for product renders, marketing campaigns, and brand-grade visuals.
Grok Imagine Video Text-to-Video API	The Grok Imagine Video Text-to-Video API empowers developers to generate cinematic videos directly from text prompts at up to 720p resolution. With configurable duration up to 15 seconds, flexible aspect ratios, and native audio synthesis, it produces photorealistic video sequences optimized for social content, advertising creatives, and immersive visual storytelling.
Grok Imagine Video Image-to-Video API	The Grok Imagine Video Image-to-Video API empowers developers to animate still images into dynamic video clips using a source image and text prompt. With the source image anchored as the first frame, natural motion generation, and synchronized audio output, it produces photorealistic animations optimized for product showcases, portrait animation, and scene bring-to-life workflows.
Grok Imagine Video Reference-to-Video	The Grok Imagine Video Reference-to-Video API empowers developers to generate videos guided by up to 7 reference images, incorporating specific characters, objects, or visual styles without fixing a start frame. With consistent identity preservation across frames, flexible duration up to 10 seconds, and strong compositional fidelity, it generates brand-grade videos optimized for virtual try-on, product placement, and character-consistent storytelling.
Grok Imagine Video Edit API (Video-to-Video)	The Grok Imagine Video Edit API empowers developers to modify existing videos using natural language instructions. With high-fidelity scene preservation, targeted prompt-based changes, and output that retains the original duration and aspect ratio up to 720p, it generates precise video edits optimized for post-production workflows, marketing campaigns, and iterative creative refinement.

Key Features of Grok Imagine API

Explore what the Grok Imagine API delivers, from 2K image generation with multilingual text to multimodal video with native synchronized audio and creative modes.

Ultra-High Resolution Rendering using Grok Imagine Image Quality API

The Grok Imagine Image Quality API delivers image generation at up to 2K resolution with razor-sharp details across every output. By preserving fine textures and intricate composition at scale, users can produce visuals that remain crisp even when displayed at oversized formats. It is the ultimate solution for hero images, advertising creatives, and brand-grade product renders.

Multilingual Text Rendering

The Grok Imagine Image Quality API offers best-in-class text rendering across multiple languages directly within generated images. By accurately reproducing typography, scripts, and characters in any language, users can embed readable copy into their visuals without manual post-editing. It is the ultimate solution for advertising creatives, localized marketing campaigns, and brand-grade visuals.

Photorealistic Image Generation

The Grok Imagine API generates photorealistic outputs featuring natural lighting, rich textures, and believable physics in every scene. By simulating real-world optics and material behavior, users can produce images that are visually indistinguishable from professional photography. It is the ultimate solution for product renders, hero images, and high-end brand visuals.

Precise Prompt Control and Reference-Based Editing

The Grok Imagine Image Quality API supports tighter prompt following alongside advanced image editing powered by reference inputs. By interpreting detailed instructions and matching style cues from uploaded references, users can refine and restyle visuals with pinpoint accuracy. It is the ultimate solution for ad creatives, product renders, and consistent brand-grade visuals.

Native Audio Video Generation

Automatically generates synchronized music, sound effects, and dialogue with each clip, so audio and motion stay aligned in one pass. Clips need no separate audio step and arrive ready to use.

Multimodal Video Generation

It covers text to video, image to video, reference to video, and video editing within a single suite. You can move across generation and editing tasks without switching models or integrations.

Motion Control and Consistency

The Grok Imagine Video API produces natural motion with stable physics and consistent subjects across frames. This reduces flicker and artifacts in longer clips, keeping characters and scenes coherent from start to finish.

Model Comparions with One Prompt

Prompt

Candid street portrait photography of an elderly man in his 60s-70s, weathered face with deep wrinkles and expressive furrowed brow, long wild flowing grey-brown hair reaching shoulders, thick unkempt grey beard, mouth slightly open showing imperfect teeth, wearing small round John Lennon-style wire-frame sunglasses with dark lenses, wearing a teal/dark green Hard Rock Cafe graphic t-shirt with colorful print, holding a paper cup in hand, shot with telephoto lens, shallow depth of field, subject in sharp focus, bokeh background with blurred green and colorful elements suggesting an outdoor festival or market setting, natural outdoor lighting, slightly overcast, HDR-style post processing with rich color saturation and contrast, photojournalism / documentary street photography style, close-up portrait framing, chest-up composition, ultra detailed skin texture, every hair strand visible, shot on Sony A7R / Canon 5D Mark IV style rendering

Generated by Grok Imagine

Generated by Nano Banana 2

Generated by GPT Image-2

Prompt

Ultra-high resolution editorial beauty portrait, extreme close-up of a young woman's face, filling entire frame from forehead to chin, striking blue-green piercing eyes with intense gaze looking directly at camera, wet dark hair plastered across forehead and face in chaotic strands, dramatic split-tone makeup art — left side of face covered in deep cobalt blue metallic body paint or pigment powder, right side warm amber/copper toned skin, scattered gold glitter particles across cheeks, nose bridge, and lips catching light in specular bokeh highlights, full parted lips slightly open, glossy red-coral lip color, hint of teeth visible, lighting: dual-color dramatic studio lighting — cool blue rim light from left, warm amber/orange key light from right, creating extreme contrast split across the face centerline, skin texture rendered at microscopic level — every pore, fine hair, water droplet, glitter particle hyper-visible, photography specs: shot on Phase One IQ4 150MP medium format camera, Hasselblad 120mm macro lens, f/2.8 aperture, tack-sharp focus on eyes and lip area, micro-texture rendering on skin surface, post-processing: Capture One ultra-detail masking, luminosity contrast enhancement, color split-toning warm-cool duality, no smoothing, no skin retouching — raw pore-level detail preserved, --style: ultra-realistic hyperdetail beauty editorial, Vogue Italia / W Magazine aesthetic, 8K resolution, 16-bit color depth

Generated by Grok Imagine

Generated by Qwen Image 2.0

Generated by Nano Banana 2

What You Can Do with Grok Imagine Models

See what you can build with the Grok Imagine API, from photorealistic brand visuals and multilingual ad posters to product video showcases, portrait animation, and reference-based editing.

Photorealistic Brand Visuals

The Grok Imagine Image Quality API enables creators and developers to produce photorealistic visuals featuring natural lighting, rich textures, and believable physics. Ideal for marketing teams and design studios pursuing studio-grade output, the API renders crisp 2K resolution and lifelike material detail—supporting hero images, advertising creatives, and high-end product renders.

Multilingual Poster and Ad Design

For globally distributed creative content, the Grok Imagine Image Quality API generates images with best-in-class text rendering, accurate multilingual typography, and clean character integration directly within the artwork. This use case fits advertising agencies, localization specialists, and brand designers producing visuals that require legible, on-brand copy embedded into the final image.

Reference-Based Image Editing

The Grok Imagine Image Quality API empowers designers to refine and restyle existing visuals through tighter prompt following, reference-driven inputs, and pinpoint compositional control. Ideal for iterative creative production and brand consistency workflows, the API maintains stylistic coherence across edits—supporting concept refinement, design variation, and polished final assets for commercial campaigns.

Cinematic Product Showcases

Grok Imagine Video Text-to-Video API enables creators and developers to generate cinematic video sequences from a single text prompt, complete with native audio and up to 720p resolution. Ideal for marketing teams and content studios pursuing production-ready video output, the API renders dynamic motion, natural camera movement, and synchronized sound—supporting brand campaigns, social media content, and immersive advertising narratives.

Portrait and Product Animation

For creators looking to breathe life into static visuals, the Grok Imagine Video Image-to-Video API transforms still images into fluid, photorealistic video clips anchored to the source image as the first frame. This use case fits e-commerce brands, digital artists, and advertising teams producing animated product showcases, portrait animations, and scene bring-to-life content that demands visual continuity from the original asset.

Non-Destructive Video Retouching

For post-production teams and creative agencies requiring precise, targeted modifications to existing footage, the Grok Imagine Video Edit API applies natural language instructions to an existing video while preserving the original scene, motion, and composition. This use case fits video editors, marketing producers, and brand teams refining campaign footage—enabling prop additions, outfit changes, and visual restyling without disrupting the underlying video structure.

Model Comparison

See how models from different providers stack up — compare performance, pricing, and unique strengths to make an informed decision.

Model	Reference Image Limit	Output Num	Resolution	Aspect Ratio
Grok Imagine Image Quality	8	1~4	2K, 1K	Auto, 1:1, 3:2, 2:3, 3:4, 4:3, 9:16, 16:9, 9:19.5, 19.5:9, 9:20, 20:9, 1:2, 2:1
Nano Banana 2	14	1	4K, 2K, 1K	1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
Nano Banana Pro	10	1	4K, 2K, 1K	1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
Seedream 5.0 Lite	14	1~15	2K~4K+	1:1, 3:2, 2:3, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, 21:9
Qwen-Image	3	1~6	512P~2K	Width[512, 2048]px, Height[512, 2048]px

How to Use Grok Imagine on Atlas Cloud

Get started in minutes — follow these simple steps to integrate and deploy models through Atlas Cloud's platform.

Create an Atlas Cloud Account

Sign up at atlascloud.ai and complete verification. New users receive free credits to explore the platform and test models.

Why Use Grok Imagine on Atlas Cloud

Combining the advanced Grok Imagine models with Atlas Cloud's GPU-accelerated platform provides unmatched performance, scalability, and developer experience.

Performance & flexibility

Low Latency:
GPU-optimized inference for real-time reasoning.

Unified API:
Run Grok Imagine, GPT, Gemini, and DeepSeek with one integration.

Transparent Pricing:
Predictable per-token billing with serverless options.

Enterprise & Scale

Developer Experience:
SDKs, analytics, fine-tuning tools, and templates.

Reliability:
99.99% uptime, RBAC, and compliance-ready logging.

Security & Compliance:
SOC 2 Type II, HIPAA alignment, data sovereignty in US.

Grok Imagine API FAQ

Grok Imagine Image Quality is xAI's higher-fidelity text-to-image and image-editing model, designed to deliver photorealistic visuals with stronger text rendering, tighter prompt following, and richer detail than the standard Grok Imagine Image model.

The model supports image generation up to 2K resolution, with razor-sharp details, natural lighting, rich textures, and believable physics suitable for hero images, advertising creatives, and product renders.

Grok Imagine Image Quality offers best-in-class text rendering with stronger multilingual support, producing legible typography directly within generated images—ideal for posters, social graphics, and ad creatives.

Quality Mode trades slightly higher latency for noticeably better output—more accurate compositions, stronger text rendering, and greater realism—making it the recommended choice for final visuals such as ads, hero images, and client deliverables.

The API supports 16:9 (widescreen), 9:16 (mobile/stories), 1:1 (social media), 4:3, 3:2, and their portrait counterparts—covering all major platform formats for advertising creatives, social content, and cinematic productions.

Text-to-Video and Image-to-Video support durations up to 15 seconds, Reference-to-Video up to 10 seconds, and Video Edit retains the original footage length capped at 8.7 seconds. All modes output at 720p HD or 480p, with 720p recommended for brand-grade and advertising creative output.

Yes. The Grok Imagine Video API features native audio generation, automatically producing synchronized sound effects, background music, and ambient audio matched to the visual content—no separate post-production workflow required.

Yes. The Grok Imagine Video Reference-to-Video API accepts up to 7 reference images to maintain consistent identity, clothing, and scene composition throughout the video—ideal for virtual try-on, product placement, and character-consistent storytelling.

Explore More Families

Seedance 2.0

The Seedance 2.0 API gives you production access to ByteDance's multimodal video model — quad-modal inputs (text, image, video, audio) and an industry-leading "Universal Reference" system that locks composition, camera movement, and character actions across shots. Integrate director-level control with one API call, a flat $0.09/s, instant key, and no waitlist — backed by enterprise-grade uptime and compliance. Seedance 2.0 Native 4K Is Now Live in June, 2026!

View Family

Grok Imagine

View Family

Gemini Omni Flash

The Gemini Omni Flash API gives developers Google DeepMind's reasoning-driven video model, now in public preview. It generates video from any mix of text, image, video, and audio, produces synchronized sound in a single pass, and lets you edit through natural conversation while keeping continuity intact. On Atlas Cloud you call it with one key and no Google Cloud setup, at 720p from $0.112 per second.

View Family

Happy Horse

HappyHorse leads the Artificial Analysis Video Arena leaderboard for both text-to-video and image-to-video generation. The HappyHorse 1.0 API and HappyHorse 1.1 API give developers direct access to Alibaba's unified video model — no multi-stage pipeline, and a single integration for both modalities. Generate 1080p video with synchronized audio straight from your code.

View Family

GPT Image 2

The GPT Image 2 API gives developers access to OpenAI's latest image model, the successor to GPT Image 1.5. It generates and edits images with accurate text rendering across Latin and CJK scripts, plus strong composition for posters, mockups, and infographics. On Atlas Cloud you reach it through one unified API alongside 300+ models, with free credits, 99.99% uptime, and no OpenAI organization verification required.

View Family

Google

Google's most powerful creative models are all available on Atlas Cloud. Veo 3.1 delivers cinematic video generation, Nano Banana 2 powers high-fidelity image creation, and Gemini brings multimodal intelligence to every workflow. Access the full Google model suite through one API key with Day-0 availability and pay-as-you-go pricing.

View Family

Seedance 2.0 Mini

The Seedance 2.0 Mini API is the lightest, lowest-cost tier of ByteDance's Seedance video line, built for teams where throughput and unit cost matter more than maximum polish. Use it for batch generation, rapid prototyping, and draft passes, all through one OpenAI-compatible key on Atlas Cloud.

View Family

ByteDance

From cinematic video generation to high-fidelity image creation, ByteDance's most powerful models are live on Atlas Cloud. Run Seedance and Seedream at scale with the lowest inference pricing and zero infrastructure overhead.

View Family

Alibaba

Atlas Cloud brings together Alibaba's full model lineup under one API: Qwen for language and image tasks, Wan for video generation up to 1080p. Access every model pay-as-you-go with no subscriptions. The Alibaba API is available via a single base URL using your existing OpenAI-compatible client.

View Family

OpenAI

Atlas Cloud gives you access to the full OpenAI API lineup, from GPT Image 2 for image generation to Sora 2 for video. Every model is available pay-as-you-go with no monthly commitment. Plug in with a single base URL swap using the OpenAI-compatible API.

View Family

xAI

Build complete image and video pipelines using the xAI API on Atlas Cloud. Generate at 2K, edit with reference images, and animate images into audio-synced clips.

View Family

Kwaivgi

The Kwaivgi API at 15% off standard rates. Day-0 access to every new Kling release, pay-as-you-go, no seat limits. One account covers the full Kling lineup.

View Family

One API for All Media AI.

Explore all models

Grok Imagine API for xAI Image, Video, and Audio

Explore the Leading Grok Imagine

xAI TTS v1

Grok Imagine Video v1.5 Image-to-Video

Grok Imagine Image Quality Text-to-Image

Grok Imagine Image Quality Edit

Grok Imagine Video Text-to-Video

Grok Imagine Video Image-to-Video

Grok Imagine Video Reference-to-Video

Grok Imagine Video Extend

Grok Imagine Video Edit

Grok Imagine Image Edit

Grok Imagine Image Text-to-Image

Peak speed

Key Features of Grok Imagine API

Ultra-High Resolution Rendering using Grok Imagine Image Quality API

Multilingual Text Rendering

Photorealistic Image Generation

Precise Prompt Control and Reference-Based Editing

Native Audio Video Generation

Multimodal Video Generation

Motion Control and Consistency

Model Comparions with One Prompt

What You Can Do with Grok Imagine Models

Photorealistic Brand Visuals

Multilingual Poster and Ad Design

Reference-Based Image Editing

Cinematic Product Showcases

Portrait and Product Animation

Non-Destructive Video Retouching

Model Comparison

How to Use Grok Imagine on Atlas Cloud

Create an Atlas Cloud Account

Why Use Grok Imagine on Atlas Cloud

Performance & flexibility

Enterprise & Scale

Grok Imagine API FAQ

Explore More Families

Seedance 2.0

Grok Imagine

Gemini Omni Flash

Happy Horse

GPT Image 2

Google

Seedance 2.0 Mini

ByteDance

Alibaba

OpenAI

xAI

Kwaivgi

One API for All Media AI.

Join our Discord community