
Gemini Omni Flash Video Edit API by Google
A natively multimodal Google DeepMind model that edits an existing video from a text prompt with optional reference images, applying scene-consistent changes and native audio while preserving the untouched footage.
Gemini Omni Flash — Video Edit
Model ID: google/gemini-omni-flash/video-edit
Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts a source video plus a text prompt (and, optionally, reference images), transforming an existing clip according to your instructions.
Overview
Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.
What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.
AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same
gemini-omni-flash-previewmodel and differ only by the input modality they accept, corresponding to the model'staskparameter (text_to_video,image_to_video,reference_to_video,edit). This endpoint maps toedit.
Inputs
This variant takes a source video and a text prompt, with optional reference images. The prompt describes the edit to apply — adding, removing, or transforming elements, restyling, or changing the audio — while Omni Flash preserves the rest of the clip. Because the model understands the whole scene, edits stay consistent with the surrounding footage rather than looking pasted on.
- Video — the source clip to edit. Up to 100 MB and 30 seconds in duration.
- Prompt — Natural-language description of the edit to apply (up to 20,000 characters).
- Images (optional) — 1 to 5 reference images to guide the edit (e.g. a subject, object, or style to introduce). PNG/JPEG/JPG/WebP, ≤20 MB each. URL or base64.
Key Capabilities
- Instruction-driven editing — Add, remove, replace, or restyle elements of a clip from a plain-language description.
- Scene-consistent results — Edits blend into the existing footage, preserving untouched regions, lighting, and motion.
- Reference-guided edits — Optionally supply up to 5 images to introduce a specific subject, object, or style.
- Native audio generation — Edits can regenerate or adjust the accompanying soundtrack alongside the picture.
- World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
- Adjustable reasoning — The
thinking_levelcontrol trades latency for quality on complex edits. - Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model | string | Yes | google/gemini-omni-flash/video-edit | Model identifier |
prompt | string | Yes | — | Text description of the edit to apply. Max 20,000 characters. |
video | string (uri) | Yes | — | Source video to edit. ≤100 MB and ≤30 seconds. |
images | array of string (uri) | No | — | 1–5 optional reference images for character, scene, or style. PNG/JPEG/JPG/WebP, ≤20 MB each. URL or base64. |
thinking_level | string | No | default | Internal reasoning effort. Enum: default, high, low. |
resolution | string | No | 720p | Output resolution. Enum: 720p. |
seed | integer | No | -1 | Random seed for reproducibility. -1 uses a random seed. |
Output duration and aspect ratio follow the source video, so this variant has no
durationoraspect_ratioparameter.
Use Cases
- Element edits — Add, remove, or replace objects, characters, or backgrounds in existing footage.
- Restyling — Transform the look, color grade, or mood of a clip while keeping its motion.
- Localization & cleanup — Swap on-screen elements or refresh assets without reshooting.
- Reference-driven insertion — Bring a specific product or character (supplied as images) into an existing scene.
- Iterative refinement — Apply successive edits to converge on a desired result.
Pricing
Billing is based on the duration of the source video, charged at a flat per-second rate.
| SKU | Rate |
|---|---|
| Per second of source video | $0.14 |
Formula: clamp(video_duration, 3, 30) × $0.14
- Billing follows the source video's duration, clamped to a 3-second minimum and a 30-second maximum.
- Example: a 10-second source video costs
10 × $0.14 = $1.40. - Example: a 30-second source video costs
30 × $0.14 = $4.20.


















