Z-Image-Turbo is a 6 billion parameter text-to-image model that generates photorealistic images in sub-second time. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.
Z-Image-Turbo is a 6 billion parameter text-to-image model that generates photorealistic images in sub-second time. Ready-to-use REST inference API, best performance, no coldstarts, affordable pricing.
Z-Image-Turbo is a 6B-parameter text-to-image model from Tongyi-MAI, engineered for production workloads where latency and throughput really matter. It uses only 8 sampling steps to render a full image, achieving sub-second latency on data-center GPUs and running comfortably on many 16 GB VRAM consumer cards.
Where many diffusion models need dozens of steps, Z-Image-Turbo is aggressively optimised around an 8-step sampler. That keeps inference extremely fast while still delivering photorealistic images and reliable on-image text, making it a strong fit for interactive products, dashboards, and large-scale backends—not just offline batch jobs.
Simple per-image billing:
6 Billion Parameter Model by Alibaba TONGYIMAI
Z-Image Turbo is the #1 ranked open-source text-to-image model, surpassing FLUX.2 [dev], HunyuanImage 3.0, and Qwen-Image on the Artificial Analysis Image Arena. Built by Alibaba's Tongyi-MAI team (a separate division from Qwen/Wan), this 6B parameter model achieves sub-second generation through advanced Decoupled-DMD distillation while maintaining photorealistic quality. With only 8 inference steps, it fits within 16GB VRAM and delivers professional results optimized for speed-critical production environments.
Alibaba offers three specialized AI image generation systems, each optimized for different use cases
Tongyi-MAI Team
Qwen Team
Wan Team
Key Insight: Z-Image Turbo is 1.31-1.41× faster than Qwen-Image per step, making it ideal for applications requiring rapid generation. While Qwen-Image offers slightly better photorealism for final renders, Z-Image Turbo provides the best balance of speed and quality for production environments.
Adopts Single-Stream Diffusion Transformer (S3-DiT) architecture that unifies processing of various conditional inputs. This 6B parameter design achieves professional results without the computational overhead of larger models while maintaining state-of-the-art quality.
Advanced distillation algorithm with CFG Augmentation and Distribution Matching mechanisms enables 8-step inference (vs 20-50 for competitors). Achieves sub-second generation on H800 GPUs and runs smoothly on consumer RTX 3060/4090 with 16GB VRAM.
Ranked #1 open-source model on Artificial Analysis Image Arena, beating FLUX.2 [dev], HunyuanImage 3.0, and Qwen-Image. Excels at bilingual text rendering (English & Chinese), photorealistic generation, and robust instruction following. Released under Apache 2.0 license for commercial use.
Experience lightning-fast, photorealistic image generation today. No setup required, just call our API and start creating.