Ranked #1 on Artificial Analysis Leaderboard: Does Happy Horse 1.0 Actually Beat Seedance 2.0?

We ran 6 scenarios, 12 videos, and one shared prompt set to find out.

On April 10, Alibaba's ATH team released Happy Horse 1.0. Within days it claimed the top spot on Artificial Analysis's video model leaderboard — T2V Elo 1389, I2V Elo 1416, leading Bytedance's Seedance 2.0 by roughly 115 points on the text-to-video side.

If you work in AI video content, product selection, or industry research, the immediate question is obvious: does this ranking hold up under real workloads?

We spent a week finding out. Same prompts, same reference assets, same evaluation framework — Happy Horse 1.0 and Seedance 2.0 run side by side across 6 scenario types, 12 videos total. This article covers three things: what actually got Happy Horse to the top, the evaluation methodology we used (a full white paper is coming), and what the 6 scenarios revealed that the leaderboard doesn't show.

By the end you'll have a clear picture of when to reach for HH, when to reach for SD, and why running this kind of comparison through Atlas Cloud's One API — one key, one SDK, one model string swap — is the most practical way to do model selection right now.

Why Happy Horse 1.0 Leads the Elo Leaderboard

A few facts worth knowing before the test results.

	Happy Horse 1.0	Seedance 2.0
Team	Alibaba ATH	Bytedance
Release	Unveiled 2026/04/10, live on Atlas Cloud 4/27	Generally available
Architecture	15B unified Transformer (joint audio-video generation, no cross-attention)	Mixture-of-experts architecture
Native Audio	✅	✅
Multilingual	Lip sync in 7 languages (Mandarin / Cantonese / English / Japanese / Korean / German / French)	Prompt input in 6 languages (Chinese / English + Japanese / Indonesian / Spanish / Portuguese)
Generation Speed	~38s per clip at 1080p on a single H100	—
Artificial Analysis Elo	T2V 1389 (ranked #1) / I2V 1416 (ranked #1)	T2V ~1274

Three things genuinely earned it the top ranking.

Unified Transformer architecture. Audio and video are generated in the same sequence, not stitched together in post. Lip sync, audio timing, and edit points are modeled simultaneously. This matters because the "generate video first, add audio later" pipeline approach tends to produce visible misalignment — HH avoids that at the architecture level.

Native 7-language lip sync. Mandarin, Cantonese, Japanese, Korean, German, French, English. This is the broadest multilingual lip sync coverage in any publicly available video model right now, and it has real value for global content production.

Visual ceiling. Looking at individual frames from our test runs, HH's skin texture, single-frame aesthetics, and cinematic color grading are genuinely ahead of SD. Artificial Analysis uses human blind evaluation, and human evaluators are highly sensitive to "which one looks more like a film." That's the main explanation for the Elo gap.

But Elo is a single aggregate score. It tells you who won more head-to-head comparisons — not where they won, or where they didn't. A total score hides the real structure underneath. That's the whole reason we built a proper evaluation framework.

AI Video Model Evaluation Framework

We've compiled a full AI Video Model Evaluation White Paper — here's the core methodology.

What existing benchmarks do (and don't do)

System	Strengths	Limitations
VBench / VBench-2.0 (academic benchmark)	Granular dimensions (16 + 18 sub-dimensions), covers physics and commonsense	Complex setup, requires GPU to run, not intuitive
Artificial Analysis Elo (blind ranking)	Reflects human subjective preference, cross-model comparable	Black box, can't pinpoint weaknesses, single aggregate score
FVD / CLIP Score (quantitative metrics)	Objective, scriptable	Limited correlation with human perception
Demo cherry-picking (industry norm)	High visual impact	Not reproducible, severe selection bias

VBench's v2.0 paper, published in March 2026, noted something blunt: even the strongest current models score around 50% on physical plausibility. The gold standard is still evolving. A single leaderboard score is not a reliable basis for model selection.

Five evaluation dimensions

Dimension	Evaluation Question	Key Sub-Items
Prompt-Video Alignment	Does the output accurately follow the instructions?	Subject / Action / Scene / Style / Quantity & Spatial Relations
Visual Quality	Is each individual frame excellent?	Resolution / Aesthetics / Rendering / Detail
Motion & Physics	Does the motion obey physical laws?	Naturalness / Physics / Dynamic Range / Camera Movement Accuracy
Temporal Consistency	Are frames and shots coherent over time?	Subject Identity / Scene / Flickering / Multi-shot Consistency
Multimodal Capabilities	What can the model do beyond visuals?	Audio / Audio-Visual Sync / Lip Sync / Multilingual / Style Control

Dimension 5 — multimodal capabilities (audio/lip sync/multilingual/style control) — is where model differentiation is playing out in 2026. It's also the main card HH is holding.

Three-layer method

Layer	Use Case	Tools
L1 Objective Metrics	Large-scale screening, CI/CD	FVD / CLIP-Score / LAION Aesthetic / DINO / Optical Flow / SyncNet / MLLM-as-Judge
L2 Standardized Task Set	Tutorial evaluation, product comparison, white paper publication	VBench prompt suite / Atlas Cloud Prompt Hub / custom dimension-targeted prompts
L3 Subjective Blind Review	Final decisions, public-facing release	Double-blind Elo + five-dimension scoring card

Multiple papers from 2025–2026 confirm that MLLM-as-Judge (using Claude or GPT-4V as evaluators) correlates significantly higher with human scores than pure quantitative metrics alone. That's the backbone of our L1 layer.

Prompt selection tiers

The biggest source of controversy in comparative benchmarks isn't the metrics — it's the prompts. Our minimum bar and tier structure:

Tier	Definition	When to Use
A (default)	Model-neutral, dimension-targeted prompt — one prompt run on both models	Primary evaluation standard
B (avoid)	Same theme, but each model uses its own Hub prompt	Not used for scoring — showcase reels only

Why a single score misleads

Video models in 2026 aren't just "text-to-video." A model might support T2V, I2V, Reference-to-Video, Video Editing, native audio, and multilingual lip sync simultaneously — and perform very differently across those modes. Elo collapses that into one number. Our framework tags every evaluation with its modality and outputs a capability matrix, not a ranking.

The full white paper will include a scoring card template, execution SOP, toolchain recommendations, and full academic references (VBench, Artificial Analysis, AIGCBench, LOVE, and others). The test results below were produced under this framework.

6 Scenarios: Where the Leaderboard #1 Loses

We selected 6 scenario types from Atlas Cloud's Prompt Hub — covering all five evaluation dimensions with balanced modality coverage. Unified parameters across all runs: 1080p / 16:9 / seed 42 / duration scaled to scenario complexity (5–15 seconds).

Scenario 1: Cave Exploration — Visual Quality + Ambient Audio

Prompt: flashlight exploration of a limestone cave, illuminating wet rock walls and crystal reflections, beam passing through shallow water creating caustic light patterns, stalactites casting long shadows that shift with the light source. Ambient audio: dripping water, wet rock footsteps, enclosed-space breathing.

Dimension	SD	HH
Caustic light physics	✅	✅
Wet rock highlights / mineral texture	Tends toward over-polished	More realistic ✅ (stalactite anatomical detail wins)
Ambient audio	Dripping / footsteps / breathing — three layers distinct ✅	Noticeable "AI quality," layers blended together

HH wins on visuals, SD wins on audio. This scenario maps directly onto HH's leaderboard advantage — its visual detail is genuinely SOTA-level here.

Scenario 2: Hollywood Car Chase — Instruction Density

The prompt packs 7 distinct shot types into 15 seconds: aerial wide shot → low-angle ground tracking → hood POV → Dutch angle medium shot → ECU rear window → wide-angle side tracking → aerial pull-back.

Dimension	SD	HH
7-shot execution	5/7 shots accurate ✅	Only 2–3 shots
Smoke / debris physics	Dense and realistic ✅	Tends toward light
Three-layer audio (engine / tires / road surface)	Distinct ✅	Mixed together
Semantic misfire	—	Rendered "aerial drone shot" as an actual drone flying into frame

SD wins clearly. HH's "drone mistake" is a clean example of semantic alignment failure — it knows the word "drone" but can't tell whether it refers to a camera move or a physical object in the scene.

Scenario 3: Cross-Scene Character Consistency

Reference: a woman with long red hair, blunt bangs, white shirt, black tie. Task: walk from office to home, maintaining consistent appearance and natural emotional transition throughout.

One thing worth flagging here: we used R2V (Reference-to-Video), not I2V. I2V defaults to locking the reference image as the first frame, which forces the video to start from that image — you can't test cross-scene consistency that way. The distinction matters more than it might seem.

Dimension	SD	HH
Facial features / hairstyle consistency	✅	✅
Wardrobe continuity	Single continuous take from office to home (artistic but abrupt)	Clean outfit change, jacket removed while tie retained ✅
Emotional transition frames	Two-beat jump cut	Eyes closing + faint smile as "leaving work mode" transition ✅
Visual texture	Leans clean and polished	Fine freckle detail, but noticeable "AI plastic" sheen
Narrative completeness	3 scenes + father character included ✅	Mother-daughter focus only

Call it a draw, two different tradeoffs: SD delivers a single continuous take with clean execution; HH uses conventional cutting with finer detail but noticeable AI smoothing artifacts.

Scenario 4: Talk Show Dual-Character Dialogue — Multimodal Performance ⚡

This is the most instruction-dense scenario of the six. Three explicit rhythm markers in the prompt (lean forward / fake-thinking pause / shared-laugh punchline) each function as a discrete pass/fail checkpoint.

The prompt specifies a Tonight Show-style three-round exchange, closing with both characters in a full laugh.

Dimension	SD	HH
Rhythm cue: "dog leans forward"	✅ Executed	❌ Completely static throughout
Rhythm cue: "cat fake-thinking pause"	✅ ECU thinking expression delivered	❌ Not captured
Shared-laugh closing shot	✅ Cut to cat's laugh (punchline beat)	⚠️ Cut to dog instead (wrong character)
Text faithfulness	✅	✅ (the one dimension HH held)
Voice matching	✅ Accurate	⚠️ Accurate but mechanical
Bonus creativity	✅ Proactively added talk show audience laughter — genre-appropriate fill beyond the prompt	—
Voice consistency	✅	❌ Cat's final laugh shifted to a male voice

SD wins comprehensively. The most interesting detail: SD added audience laughter that wasn't in the prompt. Talk show content has an expected format — a laugh track at reaction beats — and the model filled it in. That's not just instruction-following. It's understanding what this type of content is supposed to be.

HH stayed text-faithful but hit a hard failure on audio: the cat's laugh shifted into a male voice mid-clip. Long-range audio consistency is a real weakness.

Scenario 5: Romantic Scene → Premeditated Reversal — Video Editing ⚡⚡

Source video: a foreign man says in English, "The moon is beautiful tonight, a shame I can't share it with you," a Chinese woman responds in Mandarin, "Anywhere feels like a beautiful view when I'm with you." Rooftop at night, soft atmosphere.

Editing prompt: full narrative reversal. The man's expression snaps from warm to cold. He pushes the woman off the roof without hesitation. Mid-fall, she screams in Mandarin: "You were lying to me from the start!" — not fear, disbelief. He stands at the edge with a cold smile and says quietly in English: "This is what you owed my family."

Four-layer test: expression reversal + key physical action + bilingual dialogue replacement + visual tone shift.

4-Layer Test	SD	HH
Man's expression reversal	✅ Eye shift + cold smile	❌ Expression reads as grief instead
Woman's reaction: disbelief not fear	✅ Mid-fall rage and screaming	❌ Textbook fear expression (opposite of the prompt)
Push-off-the-roof action	✅ Actually happened (aerial fall shot + cityscape tilt)	❌ Never pushed — woman still standing
Visual tone shift	✅	⚠️ Held baseline
Bilingual dialogue generation	✅	✅ (the one dimension HH held)
Voice realism	✅	❌ Heavy AI quality

SD executes the full scenario. HH fails completely.

HH parsed the entire prompt as "add some dialogue and some emotional conflict." The narrative structure didn't move. It handles surface instructions (what to say) but not narrative-level instructions (how the story goes).

Scenario 6: Multi-Modal Reference Fusion — Elevator Thriller ⚡⚡⚡

Input: 3 reference images (man's appearance / elevator interior / hallway) + 1 reference video (camera movement + facial expression). Task: fuse all 4 inputs and produce a sequence of fear → Hitchcock zoom → exiting elevator → mechanical arm tracking pan.

The two models use different endpoints — HH uses video-edit, SD uses reference-to-video — but both accept composite image-plus-video input. The endpoint names are asymmetric; the capability is equivalent. That's actually a useful proof point for what One API's abstraction layer does.

Evaluation Item	SD	HH
Camera movement execution	✅ Solid	✅ Solid
Scene swap (elevator / hallway)	✅	✅
Man's identity matches img1	✅ Executed perfectly	❌ No match — completely different face
Character consistency throughout	✅ Stable	⚠️ Appears to drift in the second half

SD wins cleanly.

HH replicated the pose from the reference image (the hand-on-throat motion) but generated a completely different face. It copied the gesture, not the identity. This is structurally the same failure as Scenario 5: surface-level imitation works, but semantic depth doesn't.

Happy Horse vs Seedance: Instruction Comprehension Gap

One consistent structure emerged:

Instruction Level	HH	SD
Surface-level instructions (dialogue, pose, parameters, scene elements)	✅ Executes	✅ Executes
Semantic-level instructions (narrative reversal, character identity, timing)	❌ Fails	✅ Executes
Genre convention fill-in (auto-adding talk show audience laughter, etc.)	❌	✅ Proactively adds

This isn't really a question of which model is "better." They operate at different instruction comprehension levels.

Give HH a line of dialogue to deliver, a pose to copy, a scene element to reproduce — it handles the detail well, and the visual texture is often superior. Ask it to reverse a narrative arc, maintain a specific person's identity across shots, or follow a sequence of rhythm cues — it stops at "add the surface elements" without getting to "execute what you actually meant."

SD works the other way. Less precise on surface texture, but more reliable on overall narrative, identity fidelity, and timing — and it will proactively fill in genre-appropriate elements the prompt didn't specify.

This explains the Elo result too. Artificial Analysis's blind evaluation is highly sensitive to "which looks more cinematic." HH's visual ceiling (skin texture, color grading, single-frame aesthetics) is real, and it shows in head-to-head comparisons. But Elo doesn't expose semantic comprehension gaps. Both things — the #1 ranking and the failure modes — are true simultaneously.

Happy Horse vs Seedance: Which Model Fits Your Use Case

Scenario Type	Pick	Reason
Single best-looking shot (visual quality ceiling)	HH	Skin texture / cinematic color / single-frame aesthetics
Localized dialogue generation / translation / replacement	HH	Reliable text faithfulness
7-language lip sync content	HH	Only openly available model covering this many languages
Mood pieces / emotional shorts / single-shot clips	HH	Finer visual detail + smooth emotional transitions
Multi-shot scripted video (car chase / talk show / action)	SD	Reliable shot-cut execution
Narrative reversal / video editing	SD	Semantic-level instruction comprehension
Cross-scene character consistency + identity fidelity	SD	Reference inputs swap the person, not just the pose
High instruction density / prompt-literal execution	SD	Instruction-aligned by default

One API: Switch Models by Changing One String

The first engineering problem we hit running this evaluation: HH and SD use different SDKs, different endpoints, different auth methods. Just adapting the client code would take three separate implementations.

That's why Atlas Cloud put both Seedance 2.0 and Happy Horse 1.0 behind the same model pool and the same One API. One key, one SDK, one model string.

The Scenario 6 detail is worth noting again — HH's endpoint is called video-edit, SD's is reference-to-video. Different names, equivalent capability (both accept composite image+video input). One API abstracts that difference away. Developers write one implementation.

All 12 videos in this evaluation were generated through Atlas Cloud One API — same key, same SDK, same prompt strings, one field changed. Running a cross-model comparison has never been this low-friction.

Using the API

Step 1: Get your API key from the console.

Step 2: Check the API docs for endpoint details, request parameters, and authentication.

A Note on Honesty in Benchmarking

Before writing this, we had a real hesitation: publishing findings like "HH rendered the push-off-a-building scene as a conversation" or "HH generated the wrong person's face" — would that be unfair?

The value of an evaluation white paper is exactly that it's honest. Happy Horse is genuinely strong. The Elo double-first isn't noise. Its failure scenarios tell you precisely when to pick the other option — which is the whole point of a comparative benchmark.

Coming next:

Full White Paper v1.0 — the complete five-dimension × three-layer methodology with scoring card templates, execution SOP, and full academic references (VBench 2.0, Artificial Analysis, AIGCBench, LOVE)

Complete scoring matrix — 5 dimensions × 6 scenarios × 2 models, 60 cells scored individually

Evaluation toolchain — L1 automation scripts including an MLLM-as-Judge implementation

Additional models — Veo, Wan, Kling, and others added to the comparison matrix

If you're doing video model selection, drop your use case in the comments. The white paper v1.0 will include the comparison dimensions readers most asked about.

All evaluation samples, original prompts, extracted frames, and scoring details will be published alongside the white paper. The full evaluation was completed through Atlas Cloud One API on a single interface.

CHIA SẺ

QUAY LẠI DANH SÁCH