DetailTTS: Learning Residual Detail Information
for Zero-shot Text-to-speech

(Submitted on ICASSP 2025) Authors : Cong Wang¹, Yichen Han¹, Yizhong Geng¹, Yingming Gao¹,
Fengping Wang¹, Bingsong Bai¹, Qifei Li¹, Jinlong Xue¹, Yayue Deng¹, Zhengqi Wen², Ya Li^1,* ¹Beijing University of Posts and Telecommunications, Beijing, China ²Tsinghua University, Beijing, China
Source Code & Pre-trained Model (The code and models for the paper are being prepared.)

1. Abstract

Traditional text-to-speech (TTS) systems often face challenges in aligning text and speech, leading to the omission of critical linguistic and acoustic details. This misalignment creates an information gap, which existing methods attempt to address by incorporating additional inputs, but these often introduce data inconsistencies and increase complexity. To address these issues, we propose DetailTTS, a zero-shot TTS system based on a conditional variational autoencoder. It incorporates two key components: the Prior Detail Module and the Duration Detail Module, which capture residual detail information missed during alignment. These modules effectively enhance the model’s ability to retain fine-grained details, significantly improving speech quality while simplifying the model by obviating the need for additional inputs. Experiments on the WenetSpeech4TTS dataset show that DetailTTS outperforms traditional TTS systems in both naturalness and speaker similarity, even in zero-shot scenarios.

2. Model Architecture

Fig. 1. The overall framework of DetailTTS, incorporating two key detail encoding modules: the Prior Detail Module and the Duration Detail Module. These modules learn to capture residual detail information during training, allowing them to provide this information even during inference with only text input, thereby improving the quality of the synthesized speech.

Fig. 2. The duration detail module enhances the duration predictor by learning residual detail information during training. This allows the module to provide refined duration information even during inference with only text input, leading to more accurate and natural speech synthesis.

3. Zero-Shot TTS Samples

Seen Speakers

Unseen Speakers

4. t-SNE

We randomly selecte 15 speakers from the test dataset and conducted t-SNE visualization of the speaker embeddings for their synthesized and real speech. The speaker embeddings of synthesized speech and ground truth from the same speaker are closely clustered together, further demonstrating the superiority of our method in terms of speaker similarity.