Início
Explorar
bytedance/avatar-omni-human-v1.5
Avatar Omni Human 1.5
Áudio para Vídeo

Avatar Omni Human 1.5 API by ByteDance

bytedance/avatar-omni-human-v1.5
Avatar-omni-human-v1.5

Open and Advanced Large-Scale Video Generative Models.

Avatar Omni Human 1.5

Turn a single portrait photo into a lifelike, lip-synced talking video — driven entirely by an audio clip.

Avatar Omni Human 1.5 (OmniHuman) by ByteDance is a state-of-the-art digital-human video generation model. Give it one reference image of a person and an audio track, and it generates a natural, expressive video of that person speaking — with accurate lip sync, head motion, and facial expressions that match the audio.


Key Capabilities

  • Audio-driven lip sync — mouth movements precisely follow the speech in your audio.
  • Identity preservation — the generated person stays faithful to your reference image.
  • Natural motion — lifelike head pose, blinking, and micro-expressions, not a stiff talking head.
  • Multilingual — works with audio in many languages, including Chinese, English, Japanese, Korean, Spanish, and Indonesian.
  • High resolution — render output at up to 1080p.

How It Works

  1. Provide a reference image (a clear photo of a person) and a driving audio clip (the speech to be spoken).
  2. The model animates the person to "speak" the audio, producing a full talking-head video.
  3. The output video's duration matches the length of your audio.

Generation is asynchronous: you submit a request and receive a task ID, then poll for the result. A typical clip takes a few minutes depending on audio length and resolution.


Inputs

ParameterRequiredDescription
image_urlYesURL of the reference portrait. A clear, front-facing photo with a fully visible face works best.
audio_urlYesURL of the driving audio (speech). The generated video's length equals this audio's length.
promptNoOptional text hint for action / scene / expression. Supports Chinese, English, Japanese, Korean, Spanish, and Indonesian.
output_resolutionNo720 (default) or 1080.
seedNoFix the seed for reproducible output; -1 for random.

Best Practices

  • Reference image: use a high-quality, well-lit, front-facing photo with one clearly visible face. Avoid heavy occlusion (hands or objects over the face), extreme angles, or faces that are too small in frame.
  • Audio: clean speech with minimal background noise yields the most accurate lip sync.
  • Single subject: for best results the reference image should contain a single, clear primary person.

Typical Use Cases

  • Virtual presenters and AI news anchors
  • Talking-avatar marketing and product explainers
  • Localized / dubbed spokesperson videos
  • Education and training content
  • Social-media avatars and digital influencers

Notes & Limitations

  • Output video length is bounded by the input audio length.
  • Quality depends heavily on input quality — blurry images or noisy audio reduce lip-sync accuracy.
  • Best suited to a single primary speaker; complex multi-person scenes are not the target use case.

Pricing

Billed at $0.12 per second of generated video (output duration = audio duration).


Powered by ByteDance OmniHuman 1.5, served through Atlas Cloud.

Explorar Modelos Semelhantes

Uma API para toda a IA de mídia.

Explorar Todos os Modelos

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.