atlascloud/infinitetalk

InfiniteTalk turns a reference portrait and audio into a realistic talking-head video with lip-sync, supporting up to 10-minute audio in 480p or 720p.

AUDIO-TO-VIDEO
Startseite
Erkunden
atlascloud/infinitetalk
InfiniteTalk
Audio-zu-Video

InfiniteTalk turns a reference portrait and audio into a realistic talking-head video with lip-sync, supporting up to 10-minute audio in 480p or 720p.

Eingabe

Parameterkonfiguration wird geladen...

Ausgabe

Inaktiv
Ihre generierten Videos erscheinen hier
Konfigurieren Sie Parameter und klicken Sie auf Ausführen, um mit der Generierung zu beginnen

Jede Ausführung kostet $0.03. Für $10 können Sie ca. 333 Mal ausführen.

Sie können fortfahren mit:

Parameter

Codebeispiel

import requests
import time

# Step 1: Start video generation
generate_url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}
data = {
    "model": "atlascloud/infinitetalk",
    "prompt": "A beautiful sunset over the ocean with gentle waves",
    "width": 512,
    "height": 512,
    "duration": 3,
    "fps": 24,
}

generate_response = requests.post(generate_url, headers=headers, json=data)
generate_result = generate_response.json()
prediction_id = generate_result["data"]["id"]

# Step 2: Poll for result
poll_url = f"https://api.atlascloud.ai/api/v1/model/prediction/{prediction_id}"

def check_status():
    while True:
        response = requests.get(poll_url, headers={"Authorization": "Bearer $ATLASCLOUD_API_KEY"})
        result = response.json()

        if result["data"]["status"] in ["completed", "succeeded"]:
            print("Generated video:", result["data"]["outputs"][0])
            return result["data"]["outputs"][0]
        elif result["data"]["status"] == "failed":
            raise Exception(result["data"]["error"] or "Generation failed")
        else:
            # Still processing, wait 2 seconds
            time.sleep(2)

video_url = check_status()

Installieren

Installieren Sie das erforderliche Paket für Ihre Programmiersprache.

bash
pip install requests

Authentifizierung

Alle API-Anfragen erfordern eine Authentifizierung über einen API-Schlüssel. Sie können Ihren API-Schlüssel über das Atlas Cloud Dashboard erhalten.

bash
export ATLASCLOUD_API_KEY="your-api-key-here"

HTTP-Header

python
import os

API_KEY = os.environ.get("ATLASCLOUD_API_KEY")
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}"
}
Schützen Sie Ihren API-Schlüssel

Geben Sie Ihren API-Schlüssel niemals in clientseitigem Code oder öffentlichen Repositories preis. Verwenden Sie stattdessen Umgebungsvariablen oder einen Backend-Proxy.

Anfrage senden

import requests

url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}
data = {
    "model": "your-model",
    "prompt": "A beautiful landscape"
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Anfrage senden

Senden Sie eine asynchrone Generierungsanfrage. Die API gibt eine Vorhersage-ID zurück, mit der Sie den Status prüfen und das Ergebnis abrufen können.

POST/api/v1/model/generateVideo

Anfragekörper

import requests

url = "https://api.atlascloud.ai/api/v1/model/generateVideo"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer $ATLASCLOUD_API_KEY"
}

data = {
    "model": "atlascloud/infinitetalk",
    "input": {
        "prompt": "A beautiful sunset over the ocean with gentle waves"
    }
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(f"Prediction ID: {result['id']}")
print(f"Status: {result['status']}")

Antwort

{
  "id": "pred_abc123",
  "status": "processing",
  "model": "model-name",
  "created_at": "2025-01-01T00:00:00Z"
}

Status prüfen

Fragen Sie den Vorhersage-Endpunkt ab, um den aktuellen Status Ihrer Anfrage zu überprüfen.

GET/api/v1/model/prediction/{prediction_id}

Abfrage-Beispiel

import requests
import time

prediction_id = "pred_abc123"
url = f"https://api.atlascloud.ai/api/v1/model/prediction/{prediction_id}"
headers = { "Authorization": "Bearer $ATLASCLOUD_API_KEY" }

while True:
    response = requests.get(url, headers=headers)
    result = response.json()
    status = result["data"]["status"]
    print(f"Status: {status}")

    if status in ["completed", "succeeded"]:
        output_url = result["data"]["outputs"][0]
        print(f"Output URL: {output_url}")
        break
    elif status == "failed":
        print(f"Error: {result['data'].get('error', 'Unknown')}")
        break

    time.sleep(3)

Statuswerte

processingDie Anfrage wird noch verarbeitet.
completedDie Generierung ist abgeschlossen. Ergebnisse sind verfügbar.
succeededDie Generierung war erfolgreich. Ergebnisse sind verfügbar.
failedDie Generierung ist fehlgeschlagen. Überprüfen Sie das Fehlerfeld.

Abgeschlossene Antwort

{
  "data": {
    "id": "pred_abc123",
    "status": "completed",
    "outputs": [
      "https://storage.atlascloud.ai/outputs/result.mp4"
    ],
    "metrics": {
      "predict_time": 45.2
    },
    "created_at": "2025-01-01T00:00:00Z",
    "completed_at": "2025-01-01T00:00:10Z"
  }
}

Dateien hochladen

Laden Sie Dateien in den Atlas Cloud Speicher hoch und erhalten Sie eine URL, die Sie in Ihren API-Anfragen verwenden können. Verwenden Sie multipart/form-data zum Hochladen.

POST/api/v1/model/uploadMedia

Upload-Beispiel

import requests

url = "https://api.atlascloud.ai/api/v1/model/uploadMedia"
headers = { "Authorization": "Bearer $ATLASCLOUD_API_KEY" }

with open("image.png", "rb") as f:
    files = {"file": ("image.png", f, "image/png")}
    response = requests.post(url, headers=headers, files=files)

result = response.json()
download_url = result["data"]["download_url"]
print(f"File URL: {download_url}")

Antwort

{
  "data": {
    "download_url": "https://storage.atlascloud.ai/uploads/abc123/image.png",
    "file_name": "image.png",
    "content_type": "image/png",
    "size": 1024000
  }
}

Eingabe-Schema

Die folgenden Parameter werden im Anfragekörper akzeptiert.

Gesamt: 0Erforderlich: 0Optional: 0

Keine Parameter verfügbar.

Beispiel-Anfragekörper

json
{
  "model": "atlascloud/infinitetalk"
}

Ausgabe-Schema

Die API gibt eine Vorhersage-Antwort mit den generierten Ausgabe-URLs zurück.

idstringrequired
Unique identifier for the prediction.
statusstringrequired
Current status of the prediction.
processingcompletedsucceededfailed
modelstringrequired
The model used for generation.
outputsarray[string]
Array of output URLs. Available when status is "completed".
errorstring
Error message if status is "failed".
metricsobject
Performance metrics.
predict_timenumber
Time taken for video generation in seconds.
created_atstringrequired
ISO 8601 timestamp when the prediction was created.
Format: date-time
completed_atstring
ISO 8601 timestamp when the prediction was completed.
Format: date-time

Beispielantwort

json
{
  "id": "pred_abc123",
  "status": "completed",
  "model": "model-name",
  "outputs": [
    "https://storage.atlascloud.ai/outputs/result.mp4"
  ],
  "metrics": {
    "predict_time": 45.2
  },
  "created_at": "2025-01-01T00:00:00Z",
  "completed_at": "2025-01-01T00:00:10Z"
}

Atlas Cloud Skills

Atlas Cloud Skills integriert über 300 KI-Modelle direkt in Ihren KI-Coding-Assistenten. Ein Befehl zur Installation, dann verwenden Sie natürliche Sprache, um Bilder, Videos zu generieren und mit LLMs zu chatten.

Unterstützte Clients

Claude Code
OpenAI Codex
Gemini CLI
Cursor
Windsurf
VS Code
Trae
GitHub Copilot
Cline
Roo Code
Amp
Goose
Replit
40+ unterstützte clients

Installieren

bash
npx skills add AtlasCloudAI/atlas-cloud-skills

API-Schlüssel einrichten

Erhalten Sie Ihren API-Schlüssel über das Atlas Cloud Dashboard und setzen Sie ihn als Umgebungsvariable.

bash
export ATLASCLOUD_API_KEY="your-api-key-here"

Funktionen

Nach der Installation können Sie natürliche Sprache in Ihrem KI-Assistenten verwenden, um auf alle Atlas Cloud Modelle zuzugreifen.

BildgenerierungGenerieren Sie Bilder mit Modellen wie Nano Banana 2, Z-Image und mehr.
VideoerstellungErstellen Sie Videos aus Text oder Bildern mit Kling, Vidu, Veo usw.
LLM-ChatChatten Sie mit Qwen, DeepSeek und anderen großen Sprachmodellen.
Medien-UploadLaden Sie lokale Dateien für Bildbearbeitung und Bild-zu-Video-Workflows hoch.

MCP-Server

Der Atlas Cloud MCP-Server verbindet Ihre IDE mit über 300 KI-Modellen über das Model Context Protocol. Funktioniert mit jedem MCP-kompatiblen Client.

Unterstützte Clients

Cursor
VS Code
Windsurf
Claude Code
OpenAI Codex
Gemini CLI
Cline
Roo Code
100+ unterstützte clients

Installieren

bash
npx -y atlascloud-mcp

Konfiguration

Fügen Sie die folgende Konfiguration zur MCP-Einstellungsdatei Ihrer IDE hinzu.

json
{
  "mcpServers": {
    "atlascloud": {
      "command": "npx",
      "args": [
        "-y",
        "atlascloud-mcp"
      ],
      "env": {
        "ATLASCLOUD_API_KEY": "your-api-key-here"
      }
    }
  }
}

Verfügbare Werkzeuge

atlas_generate_imageGenerieren Sie Bilder aus Textbeschreibungen.
atlas_generate_videoErstellen Sie Videos aus Text oder Bildern.
atlas_chatChatten Sie mit großen Sprachmodellen.
atlas_list_modelsDurchsuchen Sie über 300 verfügbare KI-Modelle.
atlas_quick_generateInhaltserstellung in einem Schritt mit automatischer Modellauswahl.
atlas_upload_mediaLaden Sie lokale Dateien für API-Workflows hoch.

API-Schema

Schema nicht verfügbar

Anmelden, um Anfrageverlauf anzuzeigen

Sie müssen angemeldet sein, um auf Ihren Modellanfrageverlauf zuzugreifen.

Anmelden

InfiniteTalk: Audio-Driven Talking Video Generation

1. Introduction

InfiniteTalk is an audio-driven video generation model developed by AtlasCloud that transforms a single portrait image into a realistic talking-head video synchronized to any speech audio input. Built on a modified Wan2.1 I2V-14B diffusion transformer backbone with a dedicated audio cross-attention module, InfiniteTalk achieves phoneme-level lip synchronization while preserving the subject's identity, hairstyle, clothing, and background throughout the entire video.

InfiniteTalk's core innovation lies in its triple cross-attention architecture: each transformer block processes visual self-attention, text prompt cross-attention, and frame-level audio cross-attention in sequence, enabling precise per-frame audio-visual alignment. Combined with a streaming inference pipeline that processes video in overlapping segments, InfiniteTalk supports continuous video generation of up to 10 minutes from a single request — far exceeding the typical 5–15 second limit of conventional image-to-video models. The model also supports dual-person mode, animating two speakers simultaneously within the same frame using separate audio tracks and bounding box annotations.

2. Key Features & Innovations

Triple Cross-Attention Audio Conditioning: Unlike text-only conditioned video models, InfiniteTalk injects audio embeddings at every transformer block via a dedicated cross-attention layer. Audio features are extracted frame-by-frame using a Wav2Vec2 encoder, providing per-frame speech signal anchoring that drives natural mouth movements, facial micro-expressions, and head motion synchronized to the audio input.

Streaming Long-Form Video Generation: InfiniteTalk's streaming mode processes audio in overlapping clip segments with configurable motion frame overlap, automatically concatenating segments into seamless long-form video. This enables generation of minutes-long talking videos without quality degradation or identity drift — a capability not available in standard image-to-video pipelines limited to single-shot outputs.

High-Fidelity Identity Preservation: The model maintains consistent facial identity, hairstyle, clothing texture, and background composition across the entire generated video. The audio conditioning signal provides strong per-frame constraints that prevent the identity drift commonly observed in long unconditional video generation.

Dual-Person Conversation Mode: InfiniteTalk supports animating two speakers in a single scene by accepting separate audio tracks and bounding box coordinates for each person. This enables realistic conversation scenarios, interview formats, and dialogue-driven content without requiring separate generation passes or post-production compositing.

Flexible Input Modalities: The model accepts either a static portrait image or a reference video as the visual source, combined with audio in WAV or MP3 format. Text prompts provide additional guidance for expression style, posture, and behavioral nuance, giving creators fine-grained control over the generated output.

Conditional VSR Upscaling: When generating at 720p resolution with audio duration under 60 seconds, InfiniteTalk automatically routes output through a FlashVSR super-resolution pipeline, delivering enhanced visual clarity without additional user configuration or cost management.

3. Model Architecture & Technical Details

InfiniteTalk is built on the Wan2.1 I2V-14B foundation model (14 billion parameters, 480p native resolution) with custom InfiniteTalk adapter weights that introduce the audio cross-attention pathway. The audio encoder uses a Chinese-Wav2Vec2-Base model that extracts frame-aligned speech embeddings at 25 fps video rate, creating a one-to-one correspondence between audio features and generated video frames.

The inference pipeline operates in two modes. In clip mode, the model generates a single video segment of up to 81 frames (approximately 3.2 seconds at 25 fps), suitable for short-form content. In streaming mode, the model iteratively generates overlapping clips with a configurable motion frame overlap (default: 9 frames), seamlessly blending segments to produce arbitrarily long video bounded only by the input audio duration and a configurable maximum frame limit.

The diffusion process uses a configurable number of denoising steps (default: 40, tunable from 1–100) with TeaCache acceleration for improved throughput. On NVIDIA H200 hardware, each 81-frame clip requires approximately 3.5 minutes of processing time, yielding a generation-to-output ratio of roughly 10–30× depending on resolution and hardware load.

For 720p output, the system employs a two-stage pipeline: base generation at 480p followed by conditional FlashVSR 4× upscaling (target: 921,600 pixels at 25 fps), applied automatically when audio duration is 60 seconds or less.

4. Performance Highlights

InfiniteTalk addresses a specific niche — audio-driven talking-head video — that differs from general-purpose text-to-video or image-to-video models. Its performance should be evaluated primarily on lip-sync accuracy, identity consistency, and long-form stability rather than visual diversity or cinematic motion range.

CapabilityInfiniteTalkGeneral I2V ModelsDedicated Lip-Sync Tools
Lip-sync accuracyPhoneme-level, multi-languageN/A (no audio input)Word-level, often English-only
Maximum durationUp to 10 minutes (streaming)5–15 seconds typical30–60 seconds typical
Identity preservationHigh (audio-anchored per-frame)Moderate (drift in longer clips)Moderate
Dual-person supportNativeNot availableRare
Resolution480p native, 720p with VSRUp to 1080pVaries
Audio inputAny language WAV/MP3N/AUsually English TTS

InfiniteTalk achieves strong lip-sync fidelity across Chinese, English, Japanese, and other languages tested, owing to the language-agnostic Wav2Vec2 audio feature extraction. Identity drift is minimal even in 5+ minute generations due to the per-frame audio conditioning anchor.

5. Intended Use & Applications

Digital Avatar & Virtual Presenter: Create realistic talking-head videos for virtual hosts, AI assistants, and digital spokespersons using a single photo and recorded or synthesized speech audio.

Video Dubbing & Localization: Generate lip-synced video from translated audio tracks, enabling cost-effective multilingual content adaptation without re-filming or manual lip-sync editing.

Online Education & Training: Produce instructor-led video content at scale from lecture audio recordings and a single instructor photograph, reducing video production costs for e-learning platforms.

Podcast & Interview Visualization: Transform audio-only podcast or interview recordings into engaging video content with realistic speaker animations, suitable for social media distribution.

Customer Service & Chatbot Video: Generate personalized video responses driven by TTS audio output, enabling human-like video communication in automated customer interaction flows.

Social Media Content at Scale: Rapidly produce talking-head content for influencer accounts, news summaries, or commentary formats using text-to-speech pipelines combined with InfiniteTalk video generation.

Beginnen Sie mit 300+ Modellen,

Alle Modelle erkunden