Hands-On with Gemini Omni: Impressive But Not Quite There Yet

After weeks of leaks and speculation, Gemini Omni finally made its official debut at Google I/O 2026 in the early hours of this morning.

It's neither the dedicated video generation model that rumors suggested, nor a "Veo 4" following in Veo 3's naming lineage. Google DeepMind CEO Demis Hassabis took the stage himself to make the announcement:

“We are taking the next important step — Gemini Omni, a brand new model that can create anything from any input.”

Gemini Omni I/O 2026 video editing demo

In other words, Gemini Omni is a truly "omni-capable" large model — accepting any form of input and generating any type of content, with video generation being just one piece of the puzzle.

Gemini Omni is now live across all Google products. Users on AI Plus, Pro, and Ultra subscription plans can access it through the Gemini app or Google's AI video creation platform, Flow.

We subscribed to Google's highest-tier Ultra membership right away to put Gemini Omni through its paces with firsthand testing.

Bottom line up front: it's underwhelming.

Gemini Omni Testing Consistency: Mostly Holds Up

One of Omni's most heavily promoted features is its ability to maintain visual consistency across multiple rounds of natural language edits.

In Google's official demo, the source footage shows a violinist performing indoors. After changing the background environment, switching camera angles, and even removing the violin entirely, the performer's expressions, movements, lighting, and even the subtle positioning of their hands all remained perfectly adapted to each new setting — along with the music.

Both the precision of the edits and the consistency of the main subject looked genuinely impressive.

So we put it to the test ourselves, starting with an environment and atmosphere swap.

Our first prompt: a bird's-eye view of two cars colliding at an intersection, one of them a blue sports car, with a tense and thrilling atmosphere.

We then followed up with a more detailed edit and refinement. The prompt: switch to a golden sunset, change the blue car to red, and have the two cars burst into confetti and balloons on impact — light, dreamy, and whimsical in tone.

The color of the cars and the lighting did change as instructed, and the overall structure and motion of the scene remained mostly coherent, with no tearing or visual distortion.

However, there was one subtle but telling issue: Omni didn't handle the actual collision moment particularly well. In both videos, the two cars seemed to be driving toward each other almost deliberately — even slowing down slightly and adjusting their angles right before impact.

It felt, in a word, staged. Like you could see Omni's invisible hand nudging the cars into position to fulfill the prompt.

Next, we tested whether Omni could maintain consistency through dynamic movement. The benchmark: a single character switching between multiple camera angles, with facial features, clothing, props, and even hairstyle all staying stable — no bugs like "same outfit, different color from a different angle."

Our prompt: a medium tracking shot of a female dancer in a red dress performing contemporary dance at an old train station, cutting to a wide fixed shot after a jump, with the red dress and train station background staying completely consistent throughout.

This one came out reasonably well. The dancer's movements were fluid and continuous, the physics of the silk red dress looked convincingly real, and the cut from the medium tracking shot to the wide fixed shot was relatively smooth.

Omni also automatically added a background music track — nothing particularly expressive or atmospheric, but it fit the general mood of the dance well enough.

We then made a small refinement, prompting: remove the background music and keep only ambient sound — footsteps in sync with the dance movements and the soft rustling of the dress.

This is where things got a bit messy. The first half of the video did pick up the faint sounds of the dress swaying and feet landing on the floor. But in the second half, the background music inexplicably crept back in.

Next, we tested its ability to understand complex character relationships and spatial positioning.

The benchmark: when multiple characters of different appearances and outfits interact with each other, their individual features should not get mixed up or swapped during camera angle changes.

Our prompt: an over-the-shoulder shot of four to five scientists, each with a distinctly different look, discussing a holographic projection in a laboratory, with the camera slowly rotating — all characters' appearances and outfits to remain unchanged throughout.

Perhaps in an effort to faithfully match the prompt's requirement for scientists who all look different, Omni thoughtfully cast four characters covering a range of ages, genders, and ethnicities. Throughout the rotating shot, the characters' appearances, outfits, voices, and relative positions remained largely consistent.

The one unfortunate flaw: toward the second half of the video, there was a noticeably jarring and abrupt cut that broke the flow entirely.

Fine-Grained Control? Needs More Work

Editing and refinement was another feature Google put front and center in its official showcase.

So we got straight to it — taking a recently viral AI-generated baseball watching video that had been blowing up on Korean social media, and feeding Omni an anime-style character image (sourced from Google's own demo materials), asking it to replace the person in the original video with the character from the image.

The result? Disappointing, to put it charitably.

The replacement character did maintain roughly the same position as the original, but the subtle expressions — the lip bite, the shifty glance, the small smile when noticing the camera — were almost entirely lost in translation.

gemini demo real girl.GIF

gemini omini animation girl demo.GIF

This struggle with fine detail wasn't an isolated case.

We prompted Omni to generate a video of a middle-aged man standing in a dimly lit room, speaking quietly to his reflection in a mirror: "I know it was you. Stop pretending."

The initial result was actually decent — aside from a slightly off Chinese accent, the lip sync matched each word fairly accurately. Whether it conveyed genuinely human emotion is a matter of personal interpretation.

But when we tried to change the man's dialogue, Omni's circuits seemed to short out completely.

The prompt: a middle-aged man in a dimly lit room, quietly saying to his mirror: "May 20th is here again — happy anniversary."

First, it couldn't grasp the concept of "changing the dialogue" at all, and simply slapped the new line as a subtitle at the bottom of the screen. Then it split the difference — delivering half the original line and half the new one. By the final attempt, it had gone completely off the rails.

The lighting did get a bit brighter, and the expression shifted to a smile — but now we had a man grinning warmly while saying "I know it was you. Stop pretending," with the same eerie background music as before. Somehow, it was creepier than the original.

In short, when it comes to fine-grained control, Omni still has a long way to go.

One Unified API for Production Video Generation

While Google rolls out Gemini Omni Flash inside the Gemini app and Google Flow for end-users, developers and product teams who want to embed the same multimodal video engine into their own workflows need a stable, predictable API layer.

Atlas Cloud serves Gemini Omni Flash through a unified, OpenAI-compatible API, alongside 300+ other image, video, and LLM models — so you can integrate Google's native multimodal model without juggling separate vendor accounts, billing portals, or SDKs.

Both Gemini Omni Flash variants are live on Atlas Cloud:

td {white-space:nowrap;border:0.5pt solid #dee0e3;font-size:10pt;font-style:normal;font-weight:normal;vertical-align:middle;word-break:normal;word-wrap:normal;}


Variant	Best For	Inputs	Resolution	Duration	Starting Price
Gemini Omni Flash Text-to-Video (Developer)	Pure prompt-driven cinematic generation	Text (up to 20,000 chars)	720p / 1080p / 4K	4, 6, 8, 10 s	$0.2 + $0.1/sec
Gemini Omni Flash Image-to-Video (Developer)	Subject-consistent video from real references	Text + up to 7 reference images	720p / 1080p / 4K	4, 6, 8, 10 s	$0.2 + $0.1/sec

Quick Start — Generate a Gemini Omni Flash video in 5 lines:

plaintext
1curl -X POST https://api.atlascloud.ai/api/v1/model/generateVideo \
2  -H "Authorization: Bearer $ATLASCLOUD_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "model": "google/gemini-omni-flash/text-to-video-developer",
6    "input": {
7      "prompt": "A misty forest at golden hour, cinematic dolly shot",
8      "resolution": "1080p",
9      "duration": 8,
10      "aspect_ratio": "16:9"
11    }
12  }'

The API returns a prediction ID immediately — poll /api/v1/model/prediction/{id} for the rendered MP4 URL. Full schema, code samples in 7 languages, and a no-code Playground are available on the model pages linked above.

World Knowledge: Strong on Physics and History, But Still Buggy

Last up was world knowledge and reasoning.

Google's official claim is that Omni, built on top of the Gemini flagship model, has significantly improved its understanding of physical laws such as gravity, kinetic energy, and fluid dynamics, as well as world history, science, and mathematics.

We cut straight to the test with this prompt: generate a marble rolling rapidly along a chain-reaction track.

The result was genuinely impressive. Omni designed a fairly complex chain-reaction course on its own, incorporating gravity, elasticity, centrifugal force, and more — all of which looked convincingly realistic.

That said, a bug crept in toward the second half: out of nowhere, one marble inexplicably split into two.

We tried another one: a ball rolling back and forth along the inner wall of a U-shaped track, eventually coming to rest at the lowest point.

This one felt a bit off.

The ball did roll back and forth along the U-shaped track and settle at the bottom as instructed — but the whole thing felt like it was taking place somewhere other than Earth. The ball moved with an oddly weightless, floaty quality, and at moments appeared to clip slightly through the track geometry.

Finally, we threw in one more prompt — short, punchy, and very specifically Chinese in its cultural reference: generate a video of Emperor Taizong of Tang and his older brother facing off at the Xuanwu Gate.

Well — the Chinese characters for "Xuanwu Gate" in the background were a bit garbled, and both Tang dynasty figures spoke Mandarin with a slightly foreign accent. But Omni did grasp the historical reference and delivered a suitably tense, sword-drawn confrontation between Li Shimin and Li Yuanji.

On world history at least, Omni seems to have done its homework.

Final Thoughts: Waiting on Seedance 2.1

The buzz around Omni had been building long before today's announcement.

It all started in early May, when a user spotted a small, easy-to-miss line of text on Gemini's video generation page: "Powered by Omni." That one tiny detail set off a wave of speculation across the tech community worldwide.

Everyone was asking the same question: what exactly is Omni? Is it Veo 4, the next generation of Veo 3 from Google I/O 2025? Or is it an entirely new multimodal model? That's why early reports kept going back and forth between "Gemini Omni" and "Veo 4."

Then on May 11, a leaked internal test video of a "professor deriving equations on a blackboard" went viral on X, racking up over 2.4 million views in just a few days.

In just 10 seconds, the clip cut between multiple angles — the professor's back, a side profile, a close-up of chalk writing out equations — all accompanied by the soft scratching sound of chalk on a blackboard, with every formula on the board mathematically correct. Expectations shot through the roof.

The word at the time was that Omni had fully internalized cinematic language and editing instincts — multi-angle cuts, native background music included — and could "produce a finished video straight out of the box."

But now that Gemini Omni has finally arrived after all the anticipation, the reception has been decidedly mixed.

Looks like we'll just have to keep our eyes on Seedance 2.1 — whenever that decides to show up.

BACK TO LIST

Hands-On Testing Google Gemini Omni: Not Quite There Yet

Gemini Omni Testing Consistency: Mostly Holds Up

Fine-Grained Control? Needs More Work

One Unified API for Production Video Generation

Both Gemini Omni Flash variants are live on Atlas Cloud:

Quick Start — Generate a Gemini Omni Flash video in 5 lines:

World Knowledge: Strong on Physics and History, But Still Buggy

Latest Models

Seedream v5.0 Pro Edit

Seedream v5.0 Pro Text-to-Image

Nano Banana 2 Lite Edit Developer

Nano Banana 2 Lite Text-to-Image Developer

One API for All Media AI.

Join our Discord community