PENAWARAN WAKTU TERBATAS|DISKON 20% untuk Seedance 2.0 & 2.0 Mini!
Beranda
Jelajahi
Google
Gemini Omni
google/gemini-omni-flash/reference-to-video
Gemini Omni Flash Reference-to-Video
referensi-ke-video

Gemini Omni Flash Reference-to-Video API by Google

google/gemini-omni-flash/reference-to-video
Reference-to-video

A natively multimodal Google DeepMind model that generates cinematic, sound-enabled videos from a text prompt plus 1-5 reference images, carrying a consistent subject, scene, or style across generations.

Gemini Omni Flash — Reference to Video

Model ID: google/gemini-omni-flash/reference-to-video

Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts a text prompt plus one or more reference images, generating a video that carries the referenced subject, scene, or style into a newly described scene.

Overview

Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.

What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.

AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same gemini-omni-flash-preview model and differ only by the input modality they accept, corresponding to the model's task parameter (text_to_video, image_to_video, reference_to_video, edit). This endpoint maps to reference_to_video.

Inputs

This variant takes a text prompt and 1–5 reference images. The images are used as character, scene, or style references, and the prompt describes the new scene to build around them. Because Omni Flash maintains subject, object, and style consistency, this is the best choice for keeping a recurring character or a consistent visual identity across generations.

  • Prompt — Natural-language description of the target scene, action, camera language, mood, and audio (up to 20,000 characters).
  • Images — 1 to 5 reference images. PNG, JPEG, JPG, or WebP, each up to 20 MB. Supplied as public URLs or base64-encoded images.

Key Capabilities

  • Subject & style consistency — Carry a referenced character, object, or look across scenes and generations.
  • Multi-reference conditioning — Blend up to 5 reference images to guide subject, scene, and style at once.
  • Rich prompt understanding — Direct camera movement, action, mood, style, and audio in a single prompt of up to 20,000 characters.
  • Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
  • World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
  • Adjustable reasoning — The thinking_level control trades latency for quality on complex prompts.
  • Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.

Input Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesgoogle/gemini-omni-flash/reference-to-videoModel identifier
promptstringYesText description of the target scene. Max 20,000 characters.
imagesarray of string (uri)Yes1–5 reference images for character, scene, or style. PNG/JPEG/JPG/WebP, ≤20 MB each. URL or base64.
durationintegerNo10Video length in seconds. Range: 310.
aspect_ratiostringNo16:9Output aspect ratio. Enum: 16:9, 9:16.
thinking_levelstringNodefaultInternal reasoning effort. Enum: default, high, low.
resolutionstringNo720pOutput resolution. Enum: 720p.
seedintegerNo-1Random seed for reproducibility. -1 uses a random seed.

Use Cases

  • Consistent characters — Keep the same protagonist, mascot, or presenter across a series of clips.
  • Brand identity — Reproduce a product, logo, or visual style consistently across marketing videos.
  • Style transfer — Apply the look and feel of reference art to a newly described scene.
  • Episodic content — Maintain visual continuity across multiple generations in a storyline.
  • Personalized media — Generate videos featuring specific subjects supplied as references.

Pricing

Billing is based on the duration of the generated video, charged at a flat per-second rate.

SKURate
Per second of output$0.135

Formula: max(3, duration) × $0.135

  • Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
  • Example: a 10-second video costs 10 × $0.135 = $1.35.
  • Example: a 3-second video costs 3 × $0.135 = $0.405.

Jelajahi Model Serupa

Satu API untuk semua AI multimedia.

Jelajahi semua model

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.