atlascloud/infinitetalk

InfiniteTalk turns a reference portrait and audio into a realistic talking-head video with lip-sync, supporting up to 10-minute audio in 480p or 720p.

AUDIO-TO-VIDEO
탐색
atlascloud/infinitetalk
InfiniteTalk
오디오를 비디오로

InfiniteTalk turns a reference portrait and audio into a realistic talking-head video with lip-sync, supporting up to 10-minute audio in 480p or 720p.

입력

매개변수 구성 로드 중...

출력

대기
생성된 비디오가 여기에 표시됩니다
설정을 구성하고 실행을 클릭하여 시작하세요

요청당 $0.03가 소요됩니다. $10로 이 모델을 약 333번 실행할 수 있습니다.

다음으로 할 수 있는 작업:

파라미터

코드 예시

import requests
import time

# Step 1: Start video generation
generate_url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}
data = {
    "model": "atlascloud/infinitetalk",
    "prompt": "A beautiful sunset over the ocean with gentle waves",
    "width": 512,
    "height": 512,
    "duration": 3,
    "fps": 24,
}

generate_response = requests.post(generate_url, headers=headers, json=data)
generate_result = generate_response.json()
prediction_id = generate_result["data"]["id"]

# Step 2: Poll for result
poll_url = f"https://api.atlascloud.ai/api/v1/model/prediction/{prediction_id}"

def check_status():
    while True:
        response = requests.get(poll_url, headers={"Authorization": "Bearer $ATLASCLOUD_API_KEY"})
        result = response.json()

        if result["data"]["status"] in ["completed", "succeeded"]:
            print("Generated video:", result["data"]["outputs"][0])
            return result["data"]["outputs"][0]
        elif result["data"]["status"] == "failed":
            raise Exception(result["data"]["error"] or "Generation failed")
        else:
            # Still processing, wait 2 seconds
            time.sleep(2)

video_url = check_status()

설치

사용하는 언어에 필요한 패키지를 설치하세요.

bash
pip install requests

인증

모든 API 요청에는 API 키를 통한 인증이 필요합니다. Atlas Cloud 대시보드에서 API 키를 받을 수 있습니다.

bash
export ATLASCLOUD_API_KEY="your-api-key-here"

HTTP 헤더

python
import os

API_KEY = os.environ.get("ATLASCLOUD_API_KEY")
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}
API 키를 안전하게 보관하세요

클라이언트 측 코드나 공개 저장소에 API 키를 노출하지 마세요. 대신 환경 변수 또는 백엔드 프록시를 사용하세요.

요청 제출

import requests

url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}
data = {
    "model": "your-model",
    "prompt": "A beautiful landscape"
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

요청 제출

비동기 생성 요청을 제출합니다. API는 상태 확인 및 결과 조회에 사용할 수 있는 예측 ID를 반환합니다.

POST/api/v1/model/generateVideo

요청 본문

import requests

url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}

data = {
    "model": "atlascloud/infinitetalk",
    "input": {
        "prompt": "A beautiful sunset over the ocean with gentle waves"
    }
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(f"Prediction ID: {result['id']}")
print(f"Status: {result['status']}")

응답

{
  "id": "pred_abc123",
  "status": "processing",
  "model": "model-name",
  "created_at": "2025-01-01T00:00:00Z"
}

상태 확인

예측 엔드포인트를 폴링하여 요청의 현재 상태를 확인합니다.

GET/api/v1/model/prediction/{prediction_id}

폴링 예시

import requests
import time

prediction_id = "pred_abc123"
url = f"https://api.atlascloud.ai/api/v1/model/prediction/{prediction_id}"
headers = { "Authorization": "Bearer $ATLASCLOUD_API_KEY" }

while True:
    response = requests.get(url, headers=headers)
    result = response.json()
    status = result["data"]["status"]
    print(f"Status: {status}")

    if status in ["completed", "succeeded"]:
        output_url = result["data"]["outputs"][0]
        print(f"Output URL: {output_url}")
        break
    elif status == "failed":
        print(f"Error: {result['data'].get('error', 'Unknown')}")
        break

    time.sleep(3)

상태 값

processing요청이 아직 처리 중입니다.
completed생성이 완료되었습니다. 출력을 사용할 수 있습니다.
succeeded생성이 성공했습니다. 출력을 사용할 수 있습니다.
failed생성에 실패했습니다. 오류 필드를 확인하세요.

완료 응답

{
  "data": {
    "id": "pred_abc123",
    "status": "completed",
    "outputs": [
      "https://storage.atlascloud.ai/outputs/result.mp4"
    ],
    "metrics": {
      "predict_time": 45.2
    },
    "created_at": "2025-01-01T00:00:00Z",
    "completed_at": "2025-01-01T00:00:10Z"
  }
}

파일 업로드

Atlas Cloud 스토리지에 파일을 업로드하고 API 요청에 사용할 수 있는 URL을 받습니다. multipart/form-data를 사용하여 업로드합니다.

POST/api/v1/model/uploadMedia

업로드 예시

import requests

url = "https://api.atlascloud.ai/api/v1/model/uploadMedia"
headers = { "Authorization": "Bearer $ATLASCLOUD_API_KEY" }

with open("image.png", "rb") as f:
    files = {"file": ("image.png", f, "image/png")}
    response = requests.post(url, headers=headers, files=files)

result = response.json()
download_url = result["data"]["download_url"]
print(f"File URL: {download_url}")

응답

{
  "data": {
    "download_url": "https://storage.atlascloud.ai/uploads/abc123/image.png",
    "file_name": "image.png",
    "content_type": "image/png",
    "size": 1024000
  }
}

입력 Schema

다음 매개변수가 요청 본문에서 사용 가능합니다.

전체: 0필수: 0선택: 0

사용 가능한 매개변수가 없습니다.

요청 본문 예시

json
{
  "model": "atlascloud/infinitetalk"
}

출력 Schema

API는 생성된 출력 URL이 포함된 예측 응답을 반환합니다.

idstringrequired
Unique identifier for the prediction.
statusstringrequired
Current status of the prediction.
processingcompletedsucceededfailed
modelstringrequired
The model used for generation.
outputsarray[string]
Array of output URLs. Available when status is "completed".
errorstring
Error message if status is "failed".
metricsobject
Performance metrics.
predict_timenumber
Time taken for video generation in seconds.
created_atstringrequired
ISO 8601 timestamp when the prediction was created.
Format: date-time
completed_atstring
ISO 8601 timestamp when the prediction was completed.
Format: date-time

응답 예시

json
{
  "id": "pred_abc123",
  "status": "completed",
  "model": "model-name",
  "outputs": [
    "https://storage.atlascloud.ai/outputs/result.mp4"
  ],
  "metrics": {
    "predict_time": 45.2
  },
  "created_at": "2025-01-01T00:00:00Z",
  "completed_at": "2025-01-01T00:00:10Z"
}

Atlas Cloud Skills

Atlas Cloud Skills는 300개 이상의 AI 모델을 AI 코딩 어시스턴트에 직접 통합합니다. 한 번의 명령으로 설치하고 자연어로 이미지, 동영상 생성 및 LLM과 대화할 수 있습니다.

지원 클라이언트

Claude Code
OpenAI Codex
Gemini CLI
Cursor
Windsurf
VS Code
Trae
GitHub Copilot
Cline
Roo Code
Amp
Goose
Replit
40+ 지원 클라이언트

설치

bash
npx skills add AtlasCloudAI/atlas-cloud-skills

API 키 설정

Atlas Cloud 대시보드에서 API 키를 받아 환경 변수로 설정하세요.

bash
export ATLASCLOUD_API_KEY="your-api-key-here"

기능

설치 후 AI 어시스턴트에서 자연어를 사용하여 모든 Atlas Cloud 모델에 접근할 수 있습니다.

이미지 생성Nano Banana 2, Z-Image 등의 모델로 이미지를 생성합니다.
동영상 제작Kling, Vidu, Veo 등으로 텍스트나 이미지에서 동영상을 만듭니다.
LLM 채팅Qwen, DeepSeek 등 대규모 언어 모델과 대화합니다.
미디어 업로드이미지 편집 및 이미지-동영상 변환 워크플로우를 위해 로컬 파일을 업로드합니다.

MCP Server

Atlas Cloud MCP Server는 Model Context Protocol을 통해 IDE와 300개 이상의 AI 모델을 연결합니다. MCP 호환 클라이언트에서 사용할 수 있습니다.

지원 클라이언트

Cursor
VS Code
Windsurf
Claude Code
OpenAI Codex
Gemini CLI
Cline
Roo Code
100+ 지원 클라이언트

설치

bash
npx -y atlascloud-mcp

설정

다음 설정을 IDE의 MCP 설정 파일에 추가하세요.

json
{
  "mcpServers": {
    "atlascloud": {
      "command": "npx",
      "args": [
        "-y",
        "atlascloud-mcp"
      ],
      "env": {
        "ATLASCLOUD_API_KEY": "your-api-key-here"
      }
    }
  }
}

사용 가능한 도구

atlas_generate_image텍스트 프롬프트로 이미지를 생성합니다.
atlas_generate_video텍스트나 이미지로 동영상을 만듭니다.
atlas_chat대규모 언어 모델과 대화합니다.
atlas_list_models300개 이상의 사용 가능한 AI 모델을 탐색합니다.
atlas_quick_generate자동 모델 선택으로 원스텝 콘텐츠 생성.
atlas_upload_mediaAPI 워크플로우를 위해 로컬 파일을 업로드합니다.

API 스키마

스키마를 사용할 수 없음

요청 기록을 보려면 로그인하세요

모델 요청 기록에 액세스하려면 로그인해야 합니다.

로그인

InfiniteTalk: Audio-Driven Talking Video Generation

1. Introduction

InfiniteTalk is an audio-driven video generation model developed by AtlasCloud that transforms a single portrait image into a realistic talking-head video synchronized to any speech audio input. Built on a modified Wan2.1 I2V-14B diffusion transformer backbone with a dedicated audio cross-attention module, InfiniteTalk achieves phoneme-level lip synchronization while preserving the subject's identity, hairstyle, clothing, and background throughout the entire video.

InfiniteTalk's core innovation lies in its triple cross-attention architecture: each transformer block processes visual self-attention, text prompt cross-attention, and frame-level audio cross-attention in sequence, enabling precise per-frame audio-visual alignment. Combined with a streaming inference pipeline that processes video in overlapping segments, InfiniteTalk supports continuous video generation of up to 10 minutes from a single request — far exceeding the typical 5–15 second limit of conventional image-to-video models. The model also supports dual-person mode, animating two speakers simultaneously within the same frame using separate audio tracks and bounding box annotations.

2. Key Features & Innovations

Triple Cross-Attention Audio Conditioning: Unlike text-only conditioned video models, InfiniteTalk injects audio embeddings at every transformer block via a dedicated cross-attention layer. Audio features are extracted frame-by-frame using a Wav2Vec2 encoder, providing per-frame speech signal anchoring that drives natural mouth movements, facial micro-expressions, and head motion synchronized to the audio input.

Streaming Long-Form Video Generation: InfiniteTalk's streaming mode processes audio in overlapping clip segments with configurable motion frame overlap, automatically concatenating segments into seamless long-form video. This enables generation of minutes-long talking videos without quality degradation or identity drift — a capability not available in standard image-to-video pipelines limited to single-shot outputs.

High-Fidelity Identity Preservation: The model maintains consistent facial identity, hairstyle, clothing texture, and background composition across the entire generated video. The audio conditioning signal provides strong per-frame constraints that prevent the identity drift commonly observed in long unconditional video generation.

Dual-Person Conversation Mode: InfiniteTalk supports animating two speakers in a single scene by accepting separate audio tracks and bounding box coordinates for each person. This enables realistic conversation scenarios, interview formats, and dialogue-driven content without requiring separate generation passes or post-production compositing.

Flexible Input Modalities: The model accepts either a static portrait image or a reference video as the visual source, combined with audio in WAV or MP3 format. Text prompts provide additional guidance for expression style, posture, and behavioral nuance, giving creators fine-grained control over the generated output.

Conditional VSR Upscaling: When generating at 720p resolution with audio duration under 60 seconds, InfiniteTalk automatically routes output through a FlashVSR super-resolution pipeline, delivering enhanced visual clarity without additional user configuration or cost management.

3. Model Architecture & Technical Details

InfiniteTalk is built on the Wan2.1 I2V-14B foundation model (14 billion parameters, 480p native resolution) with custom InfiniteTalk adapter weights that introduce the audio cross-attention pathway. The audio encoder uses a Chinese-Wav2Vec2-Base model that extracts frame-aligned speech embeddings at 25 fps video rate, creating a one-to-one correspondence between audio features and generated video frames.

The inference pipeline operates in two modes. In clip mode, the model generates a single video segment of up to 81 frames (approximately 3.2 seconds at 25 fps), suitable for short-form content. In streaming mode, the model iteratively generates overlapping clips with a configurable motion frame overlap (default: 9 frames), seamlessly blending segments to produce arbitrarily long video bounded only by the input audio duration and a configurable maximum frame limit.

The diffusion process uses a configurable number of denoising steps (default: 40, tunable from 1–100) with TeaCache acceleration for improved throughput. On NVIDIA H200 hardware, each 81-frame clip requires approximately 3.5 minutes of processing time, yielding a generation-to-output ratio of roughly 10–30× depending on resolution and hardware load.

For 720p output, the system employs a two-stage pipeline: base generation at 480p followed by conditional FlashVSR 4× upscaling (target: 921,600 pixels at 25 fps), applied automatically when audio duration is 60 seconds or less.

4. Performance Highlights

InfiniteTalk addresses a specific niche — audio-driven talking-head video — that differs from general-purpose text-to-video or image-to-video models. Its performance should be evaluated primarily on lip-sync accuracy, identity consistency, and long-form stability rather than visual diversity or cinematic motion range.

CapabilityInfiniteTalkGeneral I2V ModelsDedicated Lip-Sync Tools
Lip-sync accuracyPhoneme-level, multi-languageN/A (no audio input)Word-level, often English-only
Maximum durationUp to 10 minutes (streaming)5–15 seconds typical30–60 seconds typical
Identity preservationHigh (audio-anchored per-frame)Moderate (drift in longer clips)Moderate
Dual-person supportNativeNot availableRare
Resolution480p native, 720p with VSRUp to 1080pVaries
Audio inputAny language WAV/MP3N/AUsually English TTS

InfiniteTalk achieves strong lip-sync fidelity across Chinese, English, Japanese, and other languages tested, owing to the language-agnostic Wav2Vec2 audio feature extraction. Identity drift is minimal even in 5+ minute generations due to the per-frame audio conditioning anchor.

5. Intended Use & Applications

Digital Avatar & Virtual Presenter: Create realistic talking-head videos for virtual hosts, AI assistants, and digital spokespersons using a single photo and recorded or synthesized speech audio.

Video Dubbing & Localization: Generate lip-synced video from translated audio tracks, enabling cost-effective multilingual content adaptation without re-filming or manual lip-sync editing.

Online Education & Training: Produce instructor-led video content at scale from lecture audio recordings and a single instructor photograph, reducing video production costs for e-learning platforms.

Podcast & Interview Visualization: Transform audio-only podcast or interview recordings into engaging video content with realistic speaker animations, suitable for social media distribution.

Customer Service & Chatbot Video: Generate personalized video responses driven by TTS audio output, enabling human-like video communication in automated customer interaction flows.

Social Media Content at Scale: Rapidly produce talking-head content for influencer accounts, news summaries, or commentary formats using text-to-speech pipelines combined with InfiniteTalk video generation.

300개 이상의 모델로 시작하세요,

모든 모델 탐색