Início
Explorar
xai/tts-v1
xAI TTS v1
Texto para Fala

xAI TTS v1 API by xAI

xai/tts-v1
Tts-v1

xAI TTS v1 is a high-fidelity text-to-speech model that converts text into natural, expressive speech with sub-second latency, supporting 20 languages and 80+ voices with fine-grained delivery control.

xAI TTS v1 — Text to Speech

Developer: xAI
Model ID: xai/tts-v1
Release Date: April 2026

Overview

xAI TTS v1 is a high-fidelity text-to-speech model developed by xAI, the company behind Grok. It converts text into natural, expressive speech with sub-second latency, supporting 20 languages and a rich voice library of 80+ voices. The model features fine-grained delivery control through inline speech tags, multiple audio output formats for both streaming and telephony use cases, and a voice cloning capability that can produce a production-ready custom voice from roughly a minute of reference audio.

xAI TTS v1 is designed for real-world production workloads — customer service bots, content narration, accessibility tools, and real-time voice agents — offering competitive per-character pricing with enterprise-grade compliance.

Key Capabilities

  • Expressive delivery control — 14 instant speech tags and 13 wrapping style tags for pauses, laughter, whispers, pitch shifts, speed changes, and more.
  • 80+ voices across 20 languages — Five universal multilingual voices plus language-optimized voices for Chinese, Russian, Italian, French, Spanish, Hindi, Japanese, Korean, Portuguese, German, Dutch, Polish, Turkish, Arabic, Vietnamese, Thai, Danish, Swedish, Finnish, and English.
  • Flexible audio output — MP3, WAV, PCM, μ-law, and A-law codecs with sample rates from 8 kHz (telephony) to 48 kHz (studio).
  • Voice cloning — Record up to 120 seconds of reference audio; a custom voice_id is ready in under 2 minutes at no additional cost.
  • Streaming and batch modes — Both a standard REST endpoint and a WebSocket streaming endpoint for real-time audio delivery.
  • Automatic language detection — Set language: "auto" to let the model identify the input language.
  • Privacy-first — Audio is never stored or used for model training.

Use Cases

  • Customer service / IVR — Low-latency telephony output (μ-law/A-law at 8 kHz) for automated voice response systems.
  • Content narration — Podcast episodes, audiobooks, and video voiceovers with expressive, human-like delivery.
  • Accessibility — Real-time screen readers and reading assistants for visually impaired users.
  • Real-time voice agents — Paired with xAI's Speech-to-Text and Voice Agent APIs for end-to-end conversational AI.
  • Multilingual applications — Localized content delivery across 20 languages from a single API.
  • Custom brand voices — Clone a spokesperson's voice to maintain consistent audio identity at scale.

Voices

Multilingual Voices

These five voices support all 20 available languages.

Voice IDNameGenderCharacter
eveEveFemaleEnergetic, upbeat — default voice
araAraFemaleWarm and conversational
leoLeoMaleAuthoritative, instructional
rexRexMaleProfessional, business tone
salSalMaleVersatile, neutral

Language-Optimized Voices (selected)

Language-specific voices are optimized for their native language and automatically lock the language field when selected.

LanguageVoices
EnglishGrace (F), Claire (F), James (M), Daniel (M)
Chinese (Mandarin)Jian (M), Hao (F), Xia (F)
RussianPavel (M), Andrei (M), Dmitri (M), Irina (F)
FrenchRemi (M), Hugo (M), Camille (F)
SpanishManuel (M), Javier (M), Diego (M), Andres (M)
ItalianEnzo (M), Matteo (M), Alessandro (M), Luca (F)
GermanMoritz (M), Niklas (M), Clara (F), Lena (F)
JapaneseRen (M), Sakura (F)
KoreanJun-seo (M), Min-jun (M), Seo-yeon (F), Ji-yeon (F)
PortugueseMateus (M), Rafael (M), Beatriz (F)
HindiKaran (M), Ananya (F)
ArabicKhalid (M), Tariq (M), Layla (F)
TurkishEmre (M), Kaan (M), Aylin (F)
DutchThijs (M), Ruben (M), Femke (F), Noor (F)
PolishMateusz (M), Jakub (M), Katarzyna (F), Aleksandra (F)
VietnameseDuc (M), Minh (M), Mai (F)
SwedishAxel (M), Erik (M), Saga (F)
FinnishValtteri (M), Eero (M), Helmi (F), Elina (F)
ThaiKrit (M), Aroon (M)
DanishLars (M), Kasper (M), Ida (F)

The complete voice library, including custom cloned voices, is accessible via the voice_id parameter.

Supported Languages

CodeLanguage
autoAuto Detect
enEnglish
zhChinese (Mandarin)
ar-EGArabic (Egypt)
ar-SAArabic (Saudi Arabia)
ar-AEArabic (UAE)
bnBengali
frFrench
deGerman
hiHindi
idIndonesian
itItalian
jaJapanese
koKorean
pt-BRPortuguese (Brazil)
pt-PTPortuguese (Portugal)
ruRussian
es-MXSpanish (Mexico)
es-ESSpanish (Spain)
trTurkish
viVietnamese

Input Parameters

ParameterTypeDefaultDescription
modelstringxai/tts-v1Model identifier. Required.
textstringText to synthesize. Maximum 15,000 characters. Supports inline speech tags. Required.
languagestringautoBCP-47 language code or "auto" for automatic detection. Required.
voice_idstringeveVoice identifier. Multilingual voices support all languages; language-optimized voices lock to their native language.
codecstringmp3Output audio codec: mp3, wav, pcm, mulaw, alaw.
sample_rateinteger24000Sample rate in Hz. Options: 8000, 16000, 22050, 24000, 44100, 48000.
bit_rateinteger128000Bit rate in bps. MP3 only. Options: 32000, 64000, 96000, 128000, 192000.
speednumber1.0Playback speed multiplier. Range: 0.7 (slower) to 1.5 (faster).
text_normalizationbooleanfalseWhen true, converts written-form text (numbers, abbreviations, symbols) to spoken form before synthesis.
optimize_streaming_latencyinteger0Streaming latency trade-off: 0 = best quality, 1 = lower first-chunk latency, 2 = lowest first-chunk latency.

Audio Format Recommendations

Use CaseCodecSample RateBit Rate
Standard playback (default)mp324,000 Hz128,000 bps
Studio / high-fidelitymp3 or wav44,100 or 48,000 Hz192,000 bps
Telephony / IVRmulaw or alaw8,000 Hz
Streaming (web)mp322,050 Hz64,000 bps

Inline Speech Tags

Speech tags let you control delivery at a granular level within the text field.

Instant Tags

Insert at any point in the text to trigger an immediate vocal event.

TagEffect
[pause]Brief pause
[long-pause]Extended pause
[hum-tune]Soft humming sound
[laugh]Full laugh
[chuckle]Short chuckle
[giggle]Light giggle
[cry]Crying sound
[tsk]Disapproving tsk
[tongue-click]Tongue click
[lip-smack]Lip-smack sound
[breath]Audible breath
[inhale]Inhalation
[exhale]Exhalation
[sigh]Sigh

Wrapping Style Tags

Wrap text to apply a delivery style to the enclosed span.

TagEffect
<soft>…</soft>Softer, quieter delivery
<whisper>…</whisper>Whispered speech
<loud>…</loud>Louder, more projected delivery
<build-intensity>…</build-intensity>Gradually increasing intensity
<decrease-intensity>…</decrease-intensity>Gradually decreasing intensity
<higher-pitch>…</higher-pitch>Raised pitch
<lower-pitch>…</lower-pitch>Lowered pitch
<slow>…</slow>Slower delivery
<fast>…</fast>Faster delivery
<sing-song>…</sing-song>Melodic, sing-song pattern
<singing>…</singing>Full singing mode
<laugh-speak>…</laugh-speak>Spoken with laughter mixed in
<emphasis>…</emphasis>Stressed emphasis on words

Example:

"Welcome back! [pause] <whisper>Just between us</whisper>, the sale ends tonight. [laugh] Don't miss it!"

Rate Limits

LimitValue
Requests per minute (RPM)3,000
Requests per second (RPS)50
Concurrent sessions per team100
Maximum characters per request15,000

Pricing

Pricing is based on the number of characters in the input text field.

SKU

SKUDescriptionUnit Price
sku_per_1k_charsPer 1,000 characters synthesized$0.015

Formula

cost = countChars(text) / 1000 × sku_per_1k_chars

Where countChars(text) counts the total number of characters in the input text (including spaces and punctuation, excluding inline speech tags consumed by the model).

The effective rate is $15.00 per 1,000,000 characters.

Examples

Input LengthCharactersCost
Short sentence100 chars$0.0015
One paragraph500 chars$0.0075
Article excerpt2,000 chars$0.0300
Maximum single request15,000 chars$0.2250

Custom cloned voices are billed at the same per-character rate as built-in voices — there is no additional charge for using a voice_id from the Voice Library.

Enterprise Features

  • SOC 2 Type II certified
  • HIPAA eligibility with Business Associate Agreement (BAA)
  • GDPR compliance with data residency options
  • SAML SSO and role-based access control (RBAC)
  • Multi-region infrastructure with custom SLAs available
  • No data retention — audio is never stored or used for model training

References

Explorar Modelos Semelhantes

Uma API para toda a IA de mídia.

Explorar Todos os Modelos

Join our Discord community

Join the Discord community for the latest model updates, prompts, and support.