
Gemini Omni Flash Image-to-Video API by Google
A natively multimodal Google DeepMind model that animates a still image into a cinematic, sound-enabled video guided by a text prompt while preserving the source subject and composition.
Gemini Omni Flash — Image to Video
Model ID: google/gemini-omni-flash/image-to-video
Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts an image plus a text prompt, animating a still image into a coherent, sound-enabled video guided by your instructions.
Overview
Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.
What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.
AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same
gemini-omni-flash-previewmodel and differ only by the input modality they accept, corresponding to the model'staskparameter (text_to_video,image_to_video,reference_to_video,edit). This endpoint maps toimage_to_video.
Inputs
This variant takes an image and a text prompt. The image is used as the starting frame or motion guide, while the prompt describes how the scene should move, evolve, and sound. This is well suited to bringing a specific photo, illustration, or design to life while preserving its subject and composition.
- Image — PNG, JPEG, JPG, or WebP, up to 20 MB. Supplied as a public URL or a base64-encoded image.
- Prompt — Natural-language description of motion, camera language, mood, and audio (up to 20,000 characters).
Key Capabilities
- Image-grounded animation — Preserves the subject, style, and composition of the source image while adding motion.
- Rich prompt understanding — Direct camera movement, action, mood, style, and audio in a single prompt of up to 20,000 characters.
- Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
- World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
- Adjustable reasoning — The
thinking_levelcontrol trades latency for quality on complex prompts. - Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | google/gemini-omni-flash/image-to-video | Model identifier |
prompt | string | Yes | — | Text description of the motion and scene. Max 20,000 characters. |
image | string (uri) | Yes | — | Image to animate, used as the starting frame or motion guide. PNG/JPEG/JPG/WebP, ≤20 MB. URL or base64. |
duration | integer | No | 10 | Video length in seconds. Range: 3–10. |
aspect_ratio | string | No | 16:9 | Output aspect ratio. Enum: 16:9, 9:16. |
thinking_level | string | No | default | Internal reasoning effort. Enum: default, high, low. |
resolution | string | No | 720p | Output resolution. Enum: 720p. |
seed | integer | No | -1 | Random seed for reproducibility. -1 uses a random seed. |
Use Cases
- Photo-to-motion — Animate product shots, portraits, or artwork into short living scenes.
- Concept-to-video — Turn a single key frame or design mockup into a moving preview.
- Social content — Produce eye-catching short-form clips from a single hero image.
- Marketing assets — Bring campaign visuals to life with motion and sound.
- Prototyping — Test how a static composition reads once it moves, before full production.
Pricing
Billing is based on the duration of the generated video, charged at a flat per-second rate.
| SKU | Rate |
|---|---|
| Per second of output | $0.13 |
Formula: max(3, duration) × $0.13
- Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
- Example: a 10-second video costs
10 × $0.13 = $1.30. - Example: a 3-second video costs
3 × $0.13 = $0.39.


















