Inicio
Explorar
google/gemini-omni-flash/reference-to-video-developer
Gemini Omni Flash Reference-to-Video Developer
Video a Video
DEV

Gemini Omni Flash Reference-to-Video Developer API by Google

google/gemini-omni-flash/reference-to-video-developer
Reference-to-video-developer

Gemini Omni Flash is Google's multimodal video generation model. This reference-to-video variant transforms existing video clips using reference images and text prompts, enabling video style transfer, scene editing, and character insertion.

Gemini Omni Flash — Reference to Video (Developer)

Model ID: google/gemini-omni-flash/reference-to-video-developer

Gemini Omni is Google's multimodal video generation model designed to create high-quality video content from diverse input types. This variant accepts a text prompt, reference images, and a source video clip, enabling the most expressive form of video generation: transforming existing footage while preserving coherence and injecting new creative direction.


Overview

Gemini Omni brings together Google's deep knowledge of physics, narrative logic, biology, culture, and visual composition to produce contextually coherent videos. Rather than simple clip synthesis, the model reasons about scene dynamics, camera language, and temporal flow to produce results that feel intentional and cinematic.

With both image and video inputs, the model can change what happens in a scene, add or remove objects, adjust camera perspective, apply new visual effects, or completely reimagine a clip's style — all while maintaining temporal coherence with the source material.

The developer tier provides direct API access with full control over generation parameters including resolution, aspect ratio, and random seed.


Key Capabilities

  • Video-guided generation — Provide a source video clip as a structural or stylistic reference; the model builds upon it to produce new content.
  • Image reference anchoring — Supply 1 to 5 reference images alongside the video to define subjects, characters, or visual style.
  • Precise clip trimming — Specify start and end timestamps within the source video to use only the most relevant segment (trim window ≤ 10 seconds).
  • Rich prompt understanding — Describe transformations, additions, camera language, and mood in a prompt of up to 20,000 characters.
  • Multi-resolution output — Generate at 720p, 1080p, or 4K.
  • Flexible aspect ratios — 16:9 landscape or 9:16 portrait.
  • Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.

Input Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesgoogle/gemini-omni-flash/reference-to-video-developerModel identifier
promptstringYesText description of the desired transformation or new content. Max 20,000 characters.
imagesarrayYes1–5 reference image URLs used alongside the video. Supported formats: PNG, JPEG, JPG, WebP. Max 20MB each.
video_clipsarrayYesExactly 1 source video clip. See video clip object below.
aspect_ratiostringNo16:9Output aspect ratio. Enum: 16:9, 9:16.
resolutionstringNo720pOutput resolution. Enum: 720p, 1080p, 4k.
seedintegerNo-1Random seed for reproducibility. -1 uses a random seed.

Video Clip Object

Each entry in video_clips is an object with the following fields:

FieldTypeRequiredDescription
urlstring (URI)YesURL of the source video. Max 100MB, up to 30 seconds total duration.
startnumberNoStart time in seconds for trimming the clip.
endnumberNoEnd time in seconds. The difference end − start must not exceed 10 seconds.

Resource Quota

The total input quota per request is 7 units:

  • Each image consumes 1 unit
  • The video clip consumes 2 units

Maximum images when a video is present: 5 (5 × 1 + 1 × 2 = 7).

Image Input Notes

  • Accepts 1 to 5 images per request (when combined with a video).
  • Supported codecs: PNG, JPEG, JPG, WebP.
  • Minimum image dimensions: 128×128 pixels.
  • Each image must be under 20MB.

Use Cases

  • Video style transfer — Re-render existing footage in a new visual style described by prompt and reference images.
  • Scene editing — Add, remove, or modify objects and characters in a clip using natural language.
  • Camera perspective changes — Shift the implied viewpoint or camera movement of existing footage.
  • Character insertion — Inject a character (defined by reference images) into a scene from source video.
  • Iterative video production — Use rough footage as scaffolding and refine it into polished content through prompt-driven iteration.

Pricing

Pricing for this variant is fixed per generation based on output resolution only. Duration does not affect the price.

ResolutionPrice per Generation
720p / 1080p$1.60
4k$2.40

Formula: resolution == "4k" ? $2.40 : $1.60

The fixed-price model reflects the additional compute cost of processing and conditioning on video input. 720p and 1080p are identically priced.

Explorar Modelos Similares

Una sola API para toda la IA multimedia.

Explorar Todos los Modelos

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.