🎙️✨ WavTTS
Towards High-Fidelity Zero-Shot TTS via Direct Raw Waveform Modeling
Abstract: Recently, diffusion models operating on VAE latents or mel-spectrograms have become the dominant paradigm for zero-shot TTS. Although these compressed representations improve generation efficiency, they inevitably suffer from information loss and non-end-to-end training. Theoretically, directly modeling raw waveforms circumvents these issues; however, this direction remains underexplored and is often deemed difficult due to the extremely long sequence length of audio signals. To overcome this, we propose WavTTS, the first raw waveform generative TTS model that substantially narrows the gap with latent-space generative models. Built upon the flow matching with Diffusion Transformer (DiT), WavTTS directly models speech waveforms via a simple patchification strategy, while integrating multi-scale mel-spectrogram supervision to provide perceptual guidance during training. Furthermore, we investigate the impact of prediction targets and noise scheduling in waveform diffusion, and develop an effective schedule design to improve generation quality. Evaluations on open-source benchmarks demonstrate that WavTTS closely approaches the performance of current state-of-the-art latent generative zero-shot TTS models, while substantially outperforming previous end-to-end speech generation models. Our findings demonstrate the feasibility of scaling diffusion-based TTS directly in the waveform space, opening a new direction for end-to-end speech generation.
Contents
Model Overview
Figure 1: Illustration of WavTTS training (left) and inference (right).
English Zero-shot Generation
| Prompt | Text | WavTTS |
|---|---|---|
| Looking through the telescope, I saw a circle of deep blue and the little round planet. | ||
| He also tried to remember some good stories to relate as he sheared the sheep. | ||
| Before the boy could reply, a butterfly appeared and fluttered between him and the old man. | ||
| In biology we study plants and animals in their natural environment. | ||
| He paused, looked back at the house, but then pocketed the keys, opened the gate and strode down the path quickly. |
Chinese Zero-shot Generation
| Prompt | Text | WavTTS |
|---|---|---|
| 相信岸上固地建设建设,对于中国建设者而言也没有难度。 | ||
| 好的广播公司获得信息的途径,远不止新闻发布会。 | ||
| 更傻眼的是过了没多久,银行就开始催款了。 | ||
| 有多少次,急于用款的企业主,拿着所谓回扣找左慧英。 | ||
| 他笑着说,江苏现在是一半火焰一半海水。 |