We present audio examples for our paper Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. To perform robust text-to-speech synthesis, we propose a T5-TTS model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by T5-TTS model for both seen and unseen speakers.
We evaluate different LLM-based TTS models on a set of 100 challenging texts (Full list can be found here) For each text, we synthesize two audios per model from a male and female speaker from the voice presets of the given models. The results of this evaluation are provided in Table 3 of our paper. We present some of the audio samples from this experiment below.
Transcript | VALLE-X | Bark | SpeechT5 | T5-TTS (No Prior, No Lalign) (Ours - Unaligned) |
T5-TTS (W Prior, W Lalign) (Ours - Aligned Model) |
---|
In this section, we present TTS results when the context audio is from a seen speaker and the text is from unseen holdout sentences. We train three T5-TTS (W prior, W Lalign) on different audio codecs - Encodec, DAC and Mel-FSQ. We present the results on the holdout utterances from the Libri-TTS train-clean-360 set (seen speakers, unseen texts) below.
Transcript | Context Audio | Target Audio | Encodec Predicted Audio | Dac Predicted Audio | Mel-FSQ Predicted Audio |
---|
We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by T5-TTS model for unseen speakers from VCTK dataset. We compare our two aligned T5-TTS models, one in which context audio is passed to the T5 encoder and one in which the context audio is passed to the decoder.
Transcript | Context Audio | Target Audio | T5-TTS (Encoder Context) |
T5-TTS (Decoder Context) |
---|