
Gemini Omni Flash Reference-to-Video API by Google
A natively multimodal Google DeepMind model that generates cinematic, sound-enabled videos from a text prompt plus 1-5 reference images, carrying a consistent subject, scene, or style across generations.
Gemini Omni Flash — Reference to Video
Model ID: google/gemini-omni-flash/reference-to-video
Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts a text prompt plus one or more reference images, generating a video that carries the referenced subject, scene, or style into a newly described scene.
Overview
Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.
What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.
AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same
gemini-omni-flash-previewmodel and differ only by the input modality they accept, corresponding to the model'staskparameter (text_to_video,image_to_video,reference_to_video,edit). This endpoint maps toreference_to_video.
Inputs
This variant takes a text prompt and 1–5 reference images. The images are used as character, scene, or style references, and the prompt describes the new scene to build around them. Because Omni Flash maintains subject, object, and style consistency, this is the best choice for keeping a recurring character or a consistent visual identity across generations.
- Prompt — Natural-language description of the target scene, action, camera language, mood, and audio (up to 20,000 characters).
- Images — 1 to 5 reference images. PNG, JPEG, JPG, or WebP, each up to 20 MB. Supplied as public URLs or base64-encoded images.
Key Capabilities
- Subject & style consistency — Carry a referenced character, object, or look across scenes and generations.
- Multi-reference conditioning — Blend up to 5 reference images to guide subject, scene, and style at once.
- Rich prompt understanding — Direct camera movement, action, mood, style, and audio in a single prompt of up to 20,000 characters.
- Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
- World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
- Adjustable reasoning — The
thinking_levelcontrol trades latency for quality on complex prompts. - Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | google/gemini-omni-flash/reference-to-video | Model identifier |
prompt | string | Yes | — | Text description of the target scene. Max 20,000 characters. |
images | array of string (uri) | Yes | — | 1–5 reference images for character, scene, or style. PNG/JPEG/JPG/WebP, ≤20 MB each. URL or base64. |
duration | integer | No | 10 | Video length in seconds. Range: 3–10. |
aspect_ratio | string | No | 16:9 | Output aspect ratio. Enum: 16:9, 9:16. |
thinking_level | string | No | default | Internal reasoning effort. Enum: default, high, low. |
resolution | string | No | 720p | Output resolution. Enum: 720p. |
seed | integer | No | -1 | Random seed for reproducibility. -1 uses a random seed. |
Use Cases
- Consistent characters — Keep the same protagonist, mascot, or presenter across a series of clips.
- Brand identity — Reproduce a product, logo, or visual style consistently across marketing videos.
- Style transfer — Apply the look and feel of reference art to a newly described scene.
- Episodic content — Maintain visual continuity across multiple generations in a storyline.
- Personalized media — Generate videos featuring specific subjects supplied as references.
Pricing
Billing is based on the duration of the generated video, charged at a flat per-second rate.
| SKU | Rate |
|---|---|
| Per second of output | $0.135 |
Formula: max(3, duration) × $0.135
- Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
- Example: a 10-second video costs
10 × $0.135 = $1.35. - Example: a 3-second video costs
3 × $0.135 = $0.405.


















