期間限定オファー|Seedance 2.0 & 2.0 Mini が 20% OFF!
ホーム
探索
Google
Gemini Omni
google/gemini-omni-flash/image-to-video
Gemini Omni Flash Image-to-Video
画像から動画

Gemini Omni Flash Image-to-Video API by Google

google/gemini-omni-flash/image-to-video
Image-to-video

A natively multimodal Google DeepMind model that animates a still image into a cinematic, sound-enabled video guided by a text prompt while preserving the source subject and composition.

Gemini Omni Flash — Image to Video

Model ID: google/gemini-omni-flash/image-to-video

Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts an image plus a text prompt, animating a still image into a coherent, sound-enabled video guided by your instructions.

Overview

Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.

What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.

AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same gemini-omni-flash-preview model and differ only by the input modality they accept, corresponding to the model's task parameter (text_to_video, image_to_video, reference_to_video, edit). This endpoint maps to image_to_video.

Inputs

This variant takes an image and a text prompt. The image is used as the starting frame or motion guide, while the prompt describes how the scene should move, evolve, and sound. This is well suited to bringing a specific photo, illustration, or design to life while preserving its subject and composition.

  • Image — PNG, JPEG, JPG, or WebP, up to 20 MB. Supplied as a public URL or a base64-encoded image.
  • Prompt — Natural-language description of motion, camera language, mood, and audio (up to 20,000 characters).

Key Capabilities

  • Image-grounded animation — Preserves the subject, style, and composition of the source image while adding motion.
  • Rich prompt understanding — Direct camera movement, action, mood, style, and audio in a single prompt of up to 20,000 characters.
  • Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
  • World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
  • Adjustable reasoning — The thinking_level control trades latency for quality on complex prompts.
  • Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.

Input Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesgoogle/gemini-omni-flash/image-to-videoModel identifier
promptstringYesText description of the motion and scene. Max 20,000 characters.
imagestring (uri)YesImage to animate, used as the starting frame or motion guide. PNG/JPEG/JPG/WebP, ≤20 MB. URL or base64.
durationintegerNo10Video length in seconds. Range: 310.
aspect_ratiostringNo16:9Output aspect ratio. Enum: 16:9, 9:16.
thinking_levelstringNodefaultInternal reasoning effort. Enum: default, high, low.
resolutionstringNo720pOutput resolution. Enum: 720p.
seedintegerNo-1Random seed for reproducibility. -1 uses a random seed.

Use Cases

  • Photo-to-motion — Animate product shots, portraits, or artwork into short living scenes.
  • Concept-to-video — Turn a single key frame or design mockup into a moving preview.
  • Social content — Produce eye-catching short-form clips from a single hero image.
  • Marketing assets — Bring campaign visuals to life with motion and sound.
  • Prototyping — Test how a static composition reads once it moves, before full production.

Pricing

Billing is based on the duration of the generated video, charged at a flat per-second rate.

SKURate
Per second of output$0.13

Formula: max(3, duration) × $0.13

  • Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
  • Example: a 10-second video costs 10 × $0.13 = $1.30.
  • Example: a 3-second video costs 3 × $0.13 = $0.39.

類似モデルを探索

ひとつのAPIで、あらゆるメディアAIを。

すべてのモデルを探索

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.