
Avatar Omni Human 1.5 API by ByteDance
Open and Advanced Large-Scale Video Generative Models.
Avatar Omni Human 1.5
Turn a single portrait photo into a lifelike, lip-synced talking video — driven entirely by an audio clip.
Avatar Omni Human 1.5 (OmniHuman) by ByteDance is a state-of-the-art digital-human video generation model. Give it one reference image of a person and an audio track, and it generates a natural, expressive video of that person speaking — with accurate lip sync, head motion, and facial expressions that match the audio.
Key Capabilities
- Audio-driven lip sync — mouth movements precisely follow the speech in your audio.
- Identity preservation — the generated person stays faithful to your reference image.
- Natural motion — lifelike head pose, blinking, and micro-expressions, not a stiff talking head.
- Multilingual — works with audio in many languages, including Chinese, English, Japanese, Korean, Spanish, and Indonesian.
- High resolution — render output at up to 1080p.
How It Works
- Provide a reference image (a clear photo of a person) and a driving audio clip (the speech to be spoken).
- The model animates the person to "speak" the audio, producing a full talking-head video.
- The output video's duration matches the length of your audio.
Generation is asynchronous: you submit a request and receive a task ID, then poll for the result. A typical clip takes a few minutes depending on audio length and resolution.
Inputs
| Parameter | Required | Description |
|---|---|---|
image_url | Yes | URL of the reference portrait. A clear, front-facing photo with a fully visible face works best. |
audio_url | Yes | URL of the driving audio (speech). The generated video's length equals this audio's length. |
prompt | No | Optional text hint for action / scene / expression. Supports Chinese, English, Japanese, Korean, Spanish, and Indonesian. |
output_resolution | No | 720 (default) or 1080. |
seed | No | Fix the seed for reproducible output; -1 for random. |
Best Practices
- Reference image: use a high-quality, well-lit, front-facing photo with one clearly visible face. Avoid heavy occlusion (hands or objects over the face), extreme angles, or faces that are too small in frame.
- Audio: clean speech with minimal background noise yields the most accurate lip sync.
- Single subject: for best results the reference image should contain a single, clear primary person.
Typical Use Cases
- Virtual presenters and AI news anchors
- Talking-avatar marketing and product explainers
- Localized / dubbed spokesperson videos
- Education and training content
- Social-media avatars and digital influencers
Notes & Limitations
- Output video length is bounded by the input audio length.
- Quality depends heavily on input quality — blurry images or noisy audio reduce lip-sync accuracy.
- Best suited to a single primary speaker; complex multi-person scenes are not the target use case.
Pricing
Billed at $0.12 per second of generated video (output duration = audio duration).
Powered by ByteDance OmniHuman 1.5, served through Atlas Cloud.


















