OFERTA POR TEMPO LIMITADO|20% DE DESCONTO no Seedance 2.0 e 2.0 Mini!
Início
Explorar
Google
Gemini Omni
google/gemini-omni-flash/text-to-video
Gemini Omni Flash Text-to-Video
Texto para Vídeo

Gemini Omni Flash Text-to-Video API by Google

google/gemini-omni-flash/text-to-video
Text-to-video

A natively multimodal Google DeepMind model that generates cinematic videos with synchronized native audio from a text prompt alone, grounded in real-world physics for controllable, high-speed video generation.

Gemini Omni Flash — Text to Video

Model ID: google/gemini-omni-flash/text-to-video

Gemini Omni Flash is Google DeepMind's high-performance, natively multimodal model built for high-speed video generation, editing, and cinematic control. This variant accepts a text prompt only, making it ideal for pure creative generation where you describe the entire scene through language.

Overview

Gemini Omni Flash (gemini-omni-flash-preview) was introduced by Google alongside Nano Banana 2 Lite as a new generation of multimodal media models. Unlike traditional pipelines that stitch modalities together, Omni Flash is a single transformer that processes text, images, audio, and video simultaneously, producing output that is more cohesive, consistent, and controllable.

What sets it apart from earlier video models (such as the Veo family) is that it natively generates audio with every video — dialogue, ambience, music, and sound design are produced together with the picture rather than added afterward. The model is grounded in Gemini's real-world knowledge, so it reasons about physics, narrative logic, culture, and visual composition to produce results that feel intentional and cinematic. Generated media carries an invisible SynthID watermark.

AtlasCloud exposes Gemini Omni Flash through four endpoints — text-to-video, image-to-video, reference-to-video, and video-edit. All four route to the same gemini-omni-flash-preview model and differ only by the input modality they accept, corresponding to the model's task parameter (text_to_video, image_to_video, reference_to_video, edit). This endpoint maps to text_to_video.

Inputs

This variant takes a text prompt as its only content input. You describe the subjects, actions, camera language, lighting, mood, style, and any dialogue or sound design entirely in natural language, and the model synthesizes a video (with audio) from scratch.

Key Capabilities

  • Rich prompt understanding — Describe subjects, actions, camera movements, lighting, mood, style, and audio in a single prompt of up to 20,000 characters.
  • Native audio generation — Every clip is rendered with a synchronized soundtrack (speech, music, effects) driven by your description.
  • World-grounded realism — Physics, motion, and scene dynamics informed by Gemini's real-world knowledge.
  • Cinematic control — Camera framing, pacing, and single-scene composition guided directly from the prompt.
  • Adjustable reasoning — The thinking_level control trades latency for quality on complex prompts.
  • Reproducible results — Set a fixed seed to reproduce or iterate on a specific generation.

Input Parameters

ParameterTypeRequiredDefaultDescription
modelstringYesgoogle/gemini-omni-flash/text-to-videoModel identifier
promptstringYesText description of the video. Max 20,000 characters.
durationintegerNo10Video length in seconds. Range: 310.
aspect_ratiostringNo16:9Output aspect ratio. Enum: 16:9, 9:16.
thinking_levelstringNodefaultInternal reasoning effort. Enum: default, high, low.
resolutionstringNo720pOutput resolution. Enum: 720p.
seedintegerNo-1Random seed for reproducibility. -1 uses a random seed.

Use Cases

  • Creative storytelling — Generate cinematic scenes with sound from narrative descriptions.
  • Concept visualization — Quickly visualize ideas, moods, or environments.
  • Storyboard prototyping — Turn scene descriptions into video drafts before full production.
  • Marketing assets — Produce short-form video content directly from copy briefs.
  • Educational content — Illustrate concepts, processes, or historical scenes through natural language.

Pricing

Billing is based on the duration of the generated video, charged at a flat per-second rate.

SKURate
Per second of output$0.125

Formula: max(3, duration) × $0.125

  • Billing is per second, with a 3-second minimum — durations below 3s are billed as 3s.
  • Example: a 10-second video costs 10 × $0.125 = $1.25.
  • Example: a 3-second video costs 3 × $0.125 = $0.375.

Explorar Modelos Semelhantes

Uma API para toda a IA de mídia.

Explorar Todos os Modelos

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.