TTS: Text-to-Speech that Sounds Human

Online Experience (TTS Text‑to‑Speech)

Chinese narration (soft female)

Today, you deserve gentle care. Give me the text—listen with your heart.

Cantonese (urban male)

Tonight, I’ll tell you a story—take it slow, the best part’s ahead.

Sichuan dialect (low suspense)

Some words aren’t meant to be loud. Just listen—carefully.

Mixed Chinese‑English (energetic girl)

Good morning! Today is a good day—let’s begin.

Select text language, enter text, choose voice/dialect, then generate and download MP3.

TTS Use Cases

Read books/articles

Comfortable night listening, long texts without fatigue.

Short‑video/ads dubbing

Local dialects, expressive emotions, batch ready.

Course/presentation reading

Clear and stable, better delivery.

Navigation/reminders

Natural and friendly, not robotic.

AI companion/assistant

Speaks like a human, realtime response.

Accessibility

More friendly for visually impaired.

Hear the Difference (TTS Samples)

Pick voices/dialects; download MP3/WAV; adjust rate/volume/pitch.

TTS Dialects & Voices

Chinese dialects: Cantonese, Sichuan, Wu (Shanghai), Minnan, Beijing, Tianjin, Nanjing, Shaanxi, etc.

小雅（甜美活泼）长卿（成熟磁性）若兮（古风温柔）

Rate/volume/pitch/bitrate configurable; punctuation respected for natural pauses.

TTS Quick Start

Qwen Chat

Generate reply and tap “Read aloud”.

Mobile reading

WeChat long‑press → Read; iPhone Accessibility → Spoken Content; Android Text‑to‑Speech.

TTS in 3 Steps

1. Copy text

Choose the text you want to read.

2. Select voice/dialect

E.g., Chinese female, Cantonese male.

3. Play & download

Save as MP3/WAV when satisfied.

TTS Developer Integration

Realtime API: https://dashscope.aliyuncs.com/compatible-mode/v1/services/aigc/multimodal-conversation（每月免费 50 万字符）。文本建议分段（每段 <500 字符），并用标点控制节奏；指定 language/dialect 以提升混语与方言效果。

WebSocket: wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime?model=qwen3-tts-flash-realtime

Audio params: format='wav', sample_rate=22050, bitrate='128k'; presence_penalty=0.6 to reduce repetition.

import dashscope

dashscope.api_key = 'your_api_key'
response = dashscope.Audio.speech_synthesizer(
    model='qwen3-tts-flash-realtime',
    text='你好，这是 Qwen3-TTS 测试。',
    voice='中文女声',
    language='zh'
)
print(response.output_audio)

TTS FAQ

API Key invalid or region mismatch?

北京区使用 sk-，国际区使用 sk-intl-；在 https://dashscope.aliyuncs.com 选择对应区域生成与重置。

High latency or disconnects?

使用 WebSocket 并添加重连；本地部署选 flash-realtime 版本。

Unnatural sound or repetition?

长文本分段（每段 <500 字符）；用标点控制节奏；选择匹配音色；可调 presence_penalty。

Custom voice cloning?

官方暂不内置；可用开源 So-VITS-SVC/XTTS 训练后路由 Qwen3 输出实现克隆。

Slow load or out of memory locally?

pip install transformers torch，from transformers import QwenTTSForConditionalGeneration；GPU 需 CUDA 11+；CPU 用 --device cpu；内存不足使用 FP8 量化。

Free quota and billing?

每月免费约 50 万字符；批量分批调用；也可切换开源本地完全免费。

Multilingual/dialect glitches?

明确 language 与 dialect；先用短句测试；英文+中文无缝更佳，俄语建议稍慢速。

Unsupported format or low quality?

设置 format/sample_rate/bitrate；保存前检查文本无乱码。

Tool integration errors?

更新 dashscope/qwen-tts；必要时加 torch.no_grad() 防内存泄漏。

Demo slow or silent?

刷新或使用隐身模式；必要时本地克隆 Hugging Face Space；检查设备权限。

TTS Comparison & Choice

Naturalness

Lower WER across languages; stable in Chinese/multilingual.

Dialects & languages

9 Chinese dialects + 10 languages; seamless code‑switching.

Open & cost

Open source locally; 500k chars/month in cloud.

TTS Technical Overview

Architecture

Transformer + MoE; unified multimodal framework.

Pipeline

Tokenize/encode → MoE prosody → VQ‑VAE → mel‑spectrogram → HiFi‑GAN (22kHz).

Key points

Adaptive Rhythm, RLHF stability, CUDA Graph for low latency.

TTS Pricing & Compliance

Pricing & quota

About 500k chars/month (~400 mins); low‑cost models; local open source is free.

Compliance & safety

Apache 2.0 license; local running for privacy; ensure authorized data for cloning.

TTS Testimonials & Action

What users say

“Sounds like human performance, not machine reading.”
“Dialects change the game for local content.”
“Open and generous free tier.”

Get Started

Pick a voice → choose dialect/emotion → one‑click download MP3/WAV. Need batch or templates? Tell us your scenario.

立即试听获取指引