We present audio examples for our paper Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. To perform robust text-to-speech synthesis, we propose a T5-TTS model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by T5-TTS model for both seen and unseen speakers.
We evaluate different LLM-based TTS models on a set of 100 challenging texts (Full list can be found here) For each text, we synthesize two audios per model from a male and female speaker from the voice presets of the given models. The results of this evaluation are provided in Table 3 of our paper. We present some of the audio samples from this experiment below.
Transcript | VALLE-X | Bark | SpeechT5 | T5-TTS (No Prior, No Lalign) (Ours - Unaligned) |
T5-TTS (W Prior, W Lalign) (Ours - Aligned Model) |
---|---|---|---|---|---|
[Show transcript]mending agent failed with their code zero x eight zero zero seven zero zero zero zero five a saving appointment | [Show transcript]calendaring agent failed with error code zero times eight zero zero seven zero zero zero five all saving appointment | [Show transcript]calendaring agent failed with era code zero zero zero seven zero zero zero zero five while saving appointment | [Show transcript]calendaring agent failed with error code zero x eight zero zero zero zero zero zero five while saving appointment | [Show transcript]calendaring agent failed with error code zero x eight zero zero seven zero zero zero five while saving appointment | |
[Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r c a twenty two r c nine seventy seven and mine content | [Show transcript]to deliver interfaces that are significantly better suited to create and process of c twenty one or of c twenty two r of c nine seventy seven and mine content | [Show transcript]to deliver interfaces that are significantly better suited to create and process r f c twenty one r f c twenty two r f c nine seventy seven and mim content | [Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r f c eight twenty one r f c eight twenty seven and mime content and mime content | [Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r f c eight twenty two r f c nine seventy seven and mime content | |
[Show transcript]h t t p p colon slash slash size slash d g slash default dot desks as always any feedback comments | [Show transcript]t t p colon slash slash slash team slash slash t a g slash default do aspix as always any feedback comments | [Show transcript]h t p p colon slash team slash site slash t a g slash default dot asps as always any feedback comments | [Show transcript]h t t e p colon slash size sites slash slash slash df l dot aspects as always any feedback comments | [Show transcript]h t t p colon slash slash teams slash sites slash t a g slash default dot aspox as always any feedback comments | |
[Show transcript]you can call me directly at four two five seven zero trim seven three four four or myself four two five four four seven four seven four send me a meeting request with all the appropriate information | [Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four four four seven four seven four or send me a meeting request with all the appropriate information | [Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four seven four or send me a meeting request with all the appropriate information | [Show transcript]you can call me directly at four two five seven zero three seven three four four my cell four two five four four four four four seven four or send me a meeting request with all the appropriate information on my shin | [Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information | |
[Show transcript]thanks j r g r using the l d dm driver for this system having the build x d d m driver | [Show transcript]k thanks j r g r are you using the l b d m driver for this system or the in the build x d d m driver | [Show transcript]thanks j r g r are you using the l d m driver for the system or the in the build x d m driver | [Show transcript]thanks j r g r are you using the l d d m driver for this system or the in the build x t m driver | [Show transcript]thanks j r g r are you using the l d d m driver for this system or the in the build x d d m driver |
In this section, we present TTS results when the context audio is from a seen speaker and the text is from unseen holdout sentences. We train three T5-TTS (W prior, W Lalign) on different audio codecs - Encodec, DAC and Mel-FSQ. We present the results on the holdout utterances from the Libri-TTS train-clean-360 set (seen speakers, unseen texts) below.
Transcript | Context Audio | Target Audio | Encodec Predicted Audio | Dac Predicted Audio | Mel-FSQ Predicted Audio |
---|---|---|---|---|---|
We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by T5-TTS model for unseen speakers from VCTK dataset. We compare our two aligned T5-TTS models, one in which context audio is passed to the T5 encoder and one in which the context audio is passed to the decoder.
Transcript | Context Audio | Target Audio | T5-TTS (Encoder Context) |
T5-TTS (Decoder Context) |
---|---|---|---|---|