
xAI TTS v1 API by xAI
xAI TTS v1 is a high-fidelity text-to-speech model that converts text into natural, expressive speech with sub-second latency, supporting 20 languages and 80+ voices with fine-grained delivery control.
xAI TTS v1 — Text to Speech
Developer: xAI
Model ID: xai/tts-v1
Release Date: April 2026
Overview
xAI TTS v1 is a high-fidelity text-to-speech model developed by xAI, the company behind Grok. It converts text into natural, expressive speech with sub-second latency, supporting 20 languages and a rich voice library of 80+ voices. The model features fine-grained delivery control through inline speech tags, multiple audio output formats for both streaming and telephony use cases, and a voice cloning capability that can produce a production-ready custom voice from roughly a minute of reference audio.
xAI TTS v1 is designed for real-world production workloads — customer service bots, content narration, accessibility tools, and real-time voice agents — offering competitive per-character pricing with enterprise-grade compliance.
Key Capabilities
- Expressive delivery control — 14 instant speech tags and 13 wrapping style tags for pauses, laughter, whispers, pitch shifts, speed changes, and more.
- 80+ voices across 20 languages — Five universal multilingual voices plus language-optimized voices for Chinese, Russian, Italian, French, Spanish, Hindi, Japanese, Korean, Portuguese, German, Dutch, Polish, Turkish, Arabic, Vietnamese, Thai, Danish, Swedish, Finnish, and English.
- Flexible audio output — MP3, WAV, PCM, μ-law, and A-law codecs with sample rates from 8 kHz (telephony) to 48 kHz (studio).
- Voice cloning — Record up to 120 seconds of reference audio; a custom
voice_idis ready in under 2 minutes at no additional cost. - Streaming and batch modes — Both a standard REST endpoint and a WebSocket streaming endpoint for real-time audio delivery.
- Automatic language detection — Set
language: "auto"to let the model identify the input language. - Privacy-first — Audio is never stored or used for model training.
Use Cases
- Customer service / IVR — Low-latency telephony output (μ-law/A-law at 8 kHz) for automated voice response systems.
- Content narration — Podcast episodes, audiobooks, and video voiceovers with expressive, human-like delivery.
- Accessibility — Real-time screen readers and reading assistants for visually impaired users.
- Real-time voice agents — Paired with xAI's Speech-to-Text and Voice Agent APIs for end-to-end conversational AI.
- Multilingual applications — Localized content delivery across 20 languages from a single API.
- Custom brand voices — Clone a spokesperson's voice to maintain consistent audio identity at scale.
Voices
Multilingual Voices
These five voices support all 20 available languages.
| Voice ID | Name | Gender | Character |
|---|---|---|---|
eve | Eve | Female | Energetic, upbeat — default voice |
ara | Ara | Female | Warm and conversational |
leo | Leo | Male | Authoritative, instructional |
rex | Rex | Male | Professional, business tone |
sal | Sal | Male | Versatile, neutral |
Language-Optimized Voices (selected)
Language-specific voices are optimized for their native language and automatically lock the language field when selected.
| Language | Voices |
|---|---|
| English | Grace (F), Claire (F), James (M), Daniel (M) |
| Chinese (Mandarin) | Jian (M), Hao (F), Xia (F) |
| Russian | Pavel (M), Andrei (M), Dmitri (M), Irina (F) |
| French | Remi (M), Hugo (M), Camille (F) |
| Spanish | Manuel (M), Javier (M), Diego (M), Andres (M) |
| Italian | Enzo (M), Matteo (M), Alessandro (M), Luca (F) |
| German | Moritz (M), Niklas (M), Clara (F), Lena (F) |
| Japanese | Ren (M), Sakura (F) |
| Korean | Jun-seo (M), Min-jun (M), Seo-yeon (F), Ji-yeon (F) |
| Portuguese | Mateus (M), Rafael (M), Beatriz (F) |
| Hindi | Karan (M), Ananya (F) |
| Arabic | Khalid (M), Tariq (M), Layla (F) |
| Turkish | Emre (M), Kaan (M), Aylin (F) |
| Dutch | Thijs (M), Ruben (M), Femke (F), Noor (F) |
| Polish | Mateusz (M), Jakub (M), Katarzyna (F), Aleksandra (F) |
| Vietnamese | Duc (M), Minh (M), Mai (F) |
| Swedish | Axel (M), Erik (M), Saga (F) |
| Finnish | Valtteri (M), Eero (M), Helmi (F), Elina (F) |
| Thai | Krit (M), Aroon (M) |
| Danish | Lars (M), Kasper (M), Ida (F) |
The complete voice library, including custom cloned voices, is accessible via the
voice_idparameter.
Supported Languages
| Code | Language |
|---|---|
auto | Auto Detect |
en | English |
zh | Chinese (Mandarin) |
ar-EG | Arabic (Egypt) |
ar-SA | Arabic (Saudi Arabia) |
ar-AE | Arabic (UAE) |
bn | Bengali |
fr | French |
de | German |
hi | Hindi |
id | Indonesian |
it | Italian |
ja | Japanese |
ko | Korean |
pt-BR | Portuguese (Brazil) |
pt-PT | Portuguese (Portugal) |
ru | Russian |
es-MX | Spanish (Mexico) |
es-ES | Spanish (Spain) |
tr | Turkish |
vi | Vietnamese |
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | xai/tts-v1 | Model identifier. Required. |
text | string | — | Text to synthesize. Maximum 15,000 characters. Supports inline speech tags. Required. |
language | string | auto | BCP-47 language code or "auto" for automatic detection. Required. |
voice_id | string | eve | Voice identifier. Multilingual voices support all languages; language-optimized voices lock to their native language. |
codec | string | mp3 | Output audio codec: mp3, wav, pcm, mulaw, alaw. |
sample_rate | integer | 24000 | Sample rate in Hz. Options: 8000, 16000, 22050, 24000, 44100, 48000. |
bit_rate | integer | 128000 | Bit rate in bps. MP3 only. Options: 32000, 64000, 96000, 128000, 192000. |
speed | number | 1.0 | Playback speed multiplier. Range: 0.7 (slower) to 1.5 (faster). |
text_normalization | boolean | false | When true, converts written-form text (numbers, abbreviations, symbols) to spoken form before synthesis. |
optimize_streaming_latency | integer | 0 | Streaming latency trade-off: 0 = best quality, 1 = lower first-chunk latency, 2 = lowest first-chunk latency. |
Audio Format Recommendations
| Use Case | Codec | Sample Rate | Bit Rate |
|---|---|---|---|
| Standard playback (default) | mp3 | 24,000 Hz | 128,000 bps |
| Studio / high-fidelity | mp3 or wav | 44,100 or 48,000 Hz | 192,000 bps |
| Telephony / IVR | mulaw or alaw | 8,000 Hz | — |
| Streaming (web) | mp3 | 22,050 Hz | 64,000 bps |
Inline Speech Tags
Speech tags let you control delivery at a granular level within the text field.
Instant Tags
Insert at any point in the text to trigger an immediate vocal event.
| Tag | Effect |
|---|---|
[pause] | Brief pause |
[long-pause] | Extended pause |
[hum-tune] | Soft humming sound |
[laugh] | Full laugh |
[chuckle] | Short chuckle |
[giggle] | Light giggle |
[cry] | Crying sound |
[tsk] | Disapproving tsk |
[tongue-click] | Tongue click |
[lip-smack] | Lip-smack sound |
[breath] | Audible breath |
[inhale] | Inhalation |
[exhale] | Exhalation |
[sigh] | Sigh |
Wrapping Style Tags
Wrap text to apply a delivery style to the enclosed span.
| Tag | Effect |
|---|---|
<soft>…</soft> | Softer, quieter delivery |
<whisper>…</whisper> | Whispered speech |
<loud>…</loud> | Louder, more projected delivery |
<build-intensity>…</build-intensity> | Gradually increasing intensity |
<decrease-intensity>…</decrease-intensity> | Gradually decreasing intensity |
<higher-pitch>…</higher-pitch> | Raised pitch |
<lower-pitch>…</lower-pitch> | Lowered pitch |
<slow>…</slow> | Slower delivery |
<fast>…</fast> | Faster delivery |
<sing-song>…</sing-song> | Melodic, sing-song pattern |
<singing>…</singing> | Full singing mode |
<laugh-speak>…</laugh-speak> | Spoken with laughter mixed in |
<emphasis>…</emphasis> | Stressed emphasis on words |
Example:
"Welcome back! [pause] <whisper>Just between us</whisper>, the sale ends tonight. [laugh] Don't miss it!"
Rate Limits
| Limit | Value |
|---|---|
| Requests per minute (RPM) | 3,000 |
| Requests per second (RPS) | 50 |
| Concurrent sessions per team | 100 |
| Maximum characters per request | 15,000 |
Pricing
Pricing is based on the number of characters in the input text field.
SKU
| SKU | Description | Unit Price |
|---|---|---|
sku_per_1k_chars | Per 1,000 characters synthesized | $0.015 |
Formula
cost = countChars(text) / 1000 × sku_per_1k_chars
Where countChars(text) counts the total number of characters in the input text (including spaces and punctuation, excluding inline speech tags consumed by the model).
The effective rate is $15.00 per 1,000,000 characters.
Examples
| Input Length | Characters | Cost |
|---|---|---|
| Short sentence | 100 chars | $0.0015 |
| One paragraph | 500 chars | $0.0075 |
| Article excerpt | 2,000 chars | $0.0300 |
| Maximum single request | 15,000 chars | $0.2250 |
Custom cloned voices are billed at the same per-character rate as built-in voices — there is no additional charge for using a
voice_idfrom the Voice Library.
Enterprise Features
- SOC 2 Type II certified
- HIPAA eligibility with Business Associate Agreement (BAA)
- GDPR compliance with data residency options
- SAML SSO and role-based access control (RBAC)
- Multi-region infrastructure with custom SLAs available
- No data retention — audio is never stored or used for model training


