
Gemini Omni Flash Text-to-Video API by Google
A natively multimodal Google DeepMind model that generates cinematic videos with synchronized native audio from a text prompt alone, grounded in real-world physics for controllable, high-speed video generation.
Gemini Omni Flash — Text to Video
Model ID: google/gemini-omni-flash/text-to-video
Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts a text prompt only, making it ideal for pure creative generation where you describe the entire scene through language.
Overview
Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.
What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.
AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same
gemini-omni-flash-previewmodel and differ only by the input modality they accept, corresponding to the model'staskparameter (text_to_video,image_to_video,reference_to_video,edit). This endpoint maps totext_to_video.
Inputs
This variant takes a text prompt as its only content input. You describe the subjects, actions, camera language, lighting, mood, style, and any dialogue or sound design entirely in natural language, and the model synthesizes a video (with audio) from scratch.
Key Capabilities
- Rich prompt understanding — Describe subjects, actions, camera movements, lighting, mood, style, and audio in a single prompt of up to 20,000 characters.
- Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
- World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
- Cinematic control — Camera framing, pacing, and single-scene composition guided directly from the prompt.
- Adjustable reasoning — The
thinking_levelcontrol trades latency for quality on complex prompts. - Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | google/gemini-omni-flash/text-to-video | Model identifier |
prompt | string | Yes | — | Text description of the video. Max 20,000 characters. |
duration | integer | No | 10 | Video length in seconds. Range: 3–10. |
aspect_ratio | string | No | 16:9 | Output aspect ratio. Enum: 16:9, 9:16. |
thinking_level | string | No | default | Internal reasoning effort. Enum: default, high, low. |
resolution | string | No | 720p | Output resolution. Enum: 720p. |
seed | integer | No | -1 | Random seed for reproducibility. -1 uses a random seed. |
Use Cases
- Creative storytelling — Generate cinematic scenes with sound from narrative descriptions.
- Concept visualization — Quickly visualize ideas, moods, or environments.
- Storyboard prototyping — Turn scene descriptions into video drafts before full production.
- Marketing assets — Produce short-form video content directly from copy briefs.
- Educational content — Illustrate concepts, processes, or historical scenes through natural language.
Pricing
Billing is based on the duration of the generated video, charged at a flat per-second rate.
| SKU | Rate |
|---|---|
| Per second of output | $0.125 |
Formula: max(3, duration) × $0.125
- Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
- Example: a 10-second video costs
10 × $0.125 = $1.25. - Example: a 3-second video costs
3 × $0.125 = $0.375.


















