Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment

*Paarth Neekhara, *Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg

* Equal Contribution
NVIDIA

Read our Paper

We present audio examples for our paper Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment. To perform robust text-to-speech synthesis, we propose a T5-TTS model that can learn robust text and speech alignment without modifying the model architecture or requiring ground-truth text duration. Experiments demonstrate that our alignment learning procedure improves the reliability of TTS synthesis, especially for challenging text inputs and outperforms prior LLM-based TTS models on both intelligibility and naturalness. We present audio samples generated by T5-TTS model for both seen and unseen speakers.

TTS from Challenging Texts

We evaluate different LLM-based TTS models on a set of 100 challenging texts (Full list can be found here) For each text, we synthesize two audios per model from a male and female speaker from the voice presets of the given models. The results of this evaluation are provided in Table 3 of our paper. We present some of the audio samples from this experiment below.

Transcript VALLE-X Bark SpeechT5 T5-TTS (No Prior, No Lalign)
(Ours - Unaligned)
T5-TTS (W Prior, W Lalign)
(Ours - Aligned Model)
calendaring agent failed with error code zero x eight zero zero seven zero zero zero five while saving appointment
[Show transcript]mending agent failed with their code zero x eight zero zero seven zero zero zero zero five a saving appointment
[Show transcript]calendaring agent failed with error code zero times eight zero zero seven zero zero zero five all saving appointment
[Show transcript]calendaring agent failed with era code zero zero zero seven zero zero zero zero five while saving appointment
[Show transcript]calendaring agent failed with error code zero x eight zero zero zero zero zero zero five while saving appointment
[Show transcript]calendaring agent failed with error code zero x eight zero zero seven zero zero zero five while saving appointment
to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r f c eight twenty two r f c nine seventy seven and mime content
[Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r c a twenty two r c nine seventy seven and mine content
[Show transcript]to deliver interfaces that are significantly better suited to create and process of c twenty one or of c twenty two r of c nine seventy seven and mine content
[Show transcript]to deliver interfaces that are significantly better suited to create and process r f c twenty one r f c twenty two r f c nine seventy seven and mim content
[Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r f c eight twenty one r f c eight twenty seven and mime content and mime content
[Show transcript]to deliver interfaces that are significantly better suited to create and process r f c eight twenty one r f c eight twenty two r f c nine seventy seven and mime content
h t t p colon slash slash teams slash sites slash t a g slash default dot aspx as always any feedback comments
[Show transcript]h t t p p colon slash slash size slash d g slash default dot desks as always any feedback comments
[Show transcript]t t p colon slash slash slash team slash slash t a g slash default do aspix as always any feedback comments
[Show transcript]h t p p colon slash team slash site slash t a g slash default dot asps as always any feedback comments
[Show transcript]h t t e p colon slash size sites slash slash slash df l dot aspects as always any feedback comments
[Show transcript]h t t p colon slash slash teams slash sites slash t a g slash default dot aspox as always any feedback comments
you can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information
[Show transcript]you can call me directly at four two five seven zero trim seven three four four or myself four two five four four seven four seven four send me a meeting request with all the appropriate information
[Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four four four seven four seven four or send me a meeting request with all the appropriate information
[Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four seven four or send me a meeting request with all the appropriate information
[Show transcript]you can call me directly at four two five seven zero three seven three four four my cell four two five four four four four four seven four or send me a meeting request with all the appropriate information on my shin
[Show transcript]you can call me directly at four two five seven zero three seven three four four or my cell four two five four four four seven four seven four or send me a meeting request with all the appropriate information
thanks j r g r are you using the l d d m driver for this system or the in the build x d d m driver
[Show transcript]thanks j r g r using the l d dm driver for this system having the build x d d m driver
[Show transcript]k thanks j r g r are you using the l b d m driver for this system or the in the build x d d m driver
[Show transcript]thanks j r g r are you using the l d m driver for the system or the in the build x d m driver
[Show transcript]thanks j r g r are you using the l d d m driver for this system or the in the build x t m driver
[Show transcript]thanks j r g r are you using the l d d m driver for this system or the in the build x d d m driver

T5-TTS Generated Audio using Different Codecs

In this section, we present TTS results when the context audio is from a seen speaker and the text is from unseen holdout sentences. We train three T5-TTS (W prior, W Lalign) on different audio codecs - Encodec, DAC and Mel-FSQ. We present the results on the holdout utterances from the Libri-TTS train-clean-360 set (seen speakers, unseen texts) below.

Transcript Context Audio Target Audio Encodec Predicted Audio Dac Predicted Audio Mel-FSQ Predicted Audio
for of course thats where one who dies in despair is bound for
instead he shuffled over to where his wheel lay picked it up and rode slowly off
but it is not for me to choose
whoso knoweth me hath enough of my mischief and whoso knoweth me not i will make myself known to him
you know perfectly well that youve been shut up in your old laboratory all fall

T5-TTS Generated Audio for Unseen Speakers

We also evaluate our models on the zero-shot TTS task, when the context audio is from an unseen speaker. We present audio samples generated by T5-TTS model for unseen speakers from VCTK dataset. We compare our two aligned T5-TTS models, one in which context audio is passed to the T5 encoder and one in which the context audio is passed to the decoder.

Transcript Context Audio Target Audio T5-TTS
(Encoder Context)
T5-TTS
(Decoder Context)
the encounter was to change his life
the play is based on a real life event
i have the first six months of next season to prove myself
admission is free
here's a clue
the train was on time
it is not simple