Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

*Paarth Neekhara, *Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

* Equal Contribution
NVIDIA

Read our Paper

We present audio examples for our paper Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. To perform robust text-to-speech synthesis, we propose a T5-TTS model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by T5-TTS model for both seen and unseen speakers.

TTS from Challenging Texts

We evaluate different LLM-based TTS models on a set of 100 challenging texts (Full list can be found here) For each text, we synthesize two audios per model from a male and female speaker from the voice presets of the given models. The results of this evaluation are provided in Table 3 of our paper. We present some of the audio samples from this experiment below.

Transcript VALLE-X Bark SpeechT5 T5-TTS (No Prior, No Lalign)
(Ours - Unaligned)
T5-TTS (W Prior, W Lalign)
(Ours - Aligned Model)

T5-TTS Generated Audio using Different Codecs

In this section, we present TTS results when the context audio is from a seen speaker and the text is from unseen holdout sentences. We train three T5-TTS (W prior, W Lalign) on different audio codecs - Encodec, DAC and Mel-FSQ. We present the results on the holdout utterances from the Libri-TTS train-clean-360 set (seen speakers, unseen texts) below.

Transcript Context Audio Target Audio Encodec Predicted Audio Dac Predicted Audio Mel-FSQ Predicted Audio

T5-TTS Generated Audio for Unseen Speakers

We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by T5-TTS model for unseen speakers from VCTK dataset. We compare our two aligned T5-TTS models, one in which context audio is passed to the T5 encoder and one in which the context audio is passed to the decoder.

Transcript Context Audio Target Audio T5-TTS
(Encoder Context)
T5-TTS
(Decoder Context)