탐색
bytedance/avatar-omni-human-v1.5
Avatar Omni Human 1.5
오디오를 비디오로

Avatar Omni Human 1.5 API by ByteDance

bytedance/avatar-omni-human-v1.5
Avatar-omni-human-v1.5

Open and Advanced Large-Scale Video Generative Models.

Avatar Omni Human 1.5

Turn a single portrait photo into a lifelike, lip-synced talking video — driven entirely by an audio clip.

Avatar Omni Human 1.5 (OmniHuman) by ByteDance is a state-of-the-art digital-human video generation model. Give it one reference image of a person and an audio track, and it generates a natural, expressive video of that person speaking — with accurate lip sync, head motion, and facial expressions that match the audio.


Key Capabilities

  • Audio-driven lip sync — mouth movements precisely follow the speech in your audio.
  • Identity preservation — the generated person stays faithful to your reference image.
  • Natural motion — lifelike head pose, blinking, and micro-expressions, not a stiff talking head.
  • Multilingual — works with audio in many languages, including Chinese, English, Japanese, Korean, Spanish, and Indonesian.
  • High resolution — render output at up to 1080p.

How It Works

  1. Provide a reference image (a clear photo of a person) and a driving audio clip (the speech to be spoken).
  2. The model animates the person to "speak" the audio, producing a full talking-head video.
  3. The output video's duration matches the length of your audio.

Generation is asynchronous: you submit a request and receive a task ID, then poll for the result. A typical clip takes a few minutes depending on audio length and resolution.


Inputs

ParameterRequiredDescription
image_urlYesURL of the reference portrait. A clear, front-facing photo with a fully visible face works best.
audio_urlYesURL of the driving audio (speech). The generated video's length equals this audio's length.
promptNoOptional text hint for action / scene / expression. Supports Chinese, English, Japanese, Korean, Spanish, and Indonesian.
output_resolutionNo720 (default) or 1080.
seedNoFix the seed for reproducible output; -1 for random.

Best Practices

  • Reference image: use a high-quality, well-lit, front-facing photo with one clearly visible face. Avoid heavy occlusion (hands or objects over the face), extreme angles, or faces that are too small in frame.
  • Audio: clean speech with minimal background noise yields the most accurate lip sync.
  • Single subject: for best results the reference image should contain a single, clear primary person.

Typical Use Cases

  • Virtual presenters and AI news anchors
  • Talking-avatar marketing and product explainers
  • Localized / dubbed spokesperson videos
  • Education and training content
  • Social-media avatars and digital influencers

Notes & Limitations

  • Output video length is bounded by the input audio length.
  • Quality depends heavily on input quality — blurry images or noisy audio reduce lip-sync accuracy.
  • Best suited to a single primary speaker; complex multi-person scenes are not the target use case.

Pricing

Billed at $0.12 per second of generated video (output duration = audio duration).


Powered by ByteDance OmniHuman 1.5, served through Atlas Cloud.

유사한 모델 탐색

하나의 API로 모든 미디어 AI를.

모든 모델 탐색

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.