SPEECH BERT EMBEDDING FOR IMPROVING PROSODY IN NEURAL TTS

Authors: Liping Chen, Yan Deng, Xi Wang, Frank K. Soong, Lei He (In proceedings ICASSP 2021)

Abstract: This paper presents a speech BERT model designed for neural text-to-speech (TTS), to extract the embedding that represents latent prosody attributes in speech segments, where BERT stands for Bidirectional Encoder Representations from Transformers. As a pre-training model, it can learn the prosody attributes from a large scale of data, which is not confined by the training data of TTS model. In our proposed method, the embedding is extracted from the previous segment of a fixed length, and applied in combination with mel-spectrogram to the decoder of neural TTS to predict the frames in the following segment. Experimental results on Transformer TTS show that the proposed method can be used to extract fine-grained segment-level prosody information, which is complementary to current utterance-level prosody modeling in neural TTS. The objective results on our internal single speaker TTS corpus demonstrate its effectiveness at closing the gap between generated speech and recordings in terms of prosody. Furthermore, the subjective results show that our proposed method are more preferred in both in-domain and out-of-domain texts, on our internal professional single speaker, multiple speakers and the public LJ Speaker.

Firstly, we will present the reconstructed speech using the BERT model (audio unseen in the speech BERT model training).

Masked segment reconstructed by speech BERT with Griffin-Lim vocoder

There was not a worse vagabond in Shrewsbury than old Barney the piper.

Recording:

Padded input:

Reconstructed output:

His name was John Palmer.

Recording:

Padded input:

Reconstructed output:

Next, the audio files for F0 contour comparison as shown in Fig. 4 will be listed. The recording, speech synthesized without and with speech BERT embedding are included as follows.

F0 contour comparison

But the dilemma for Macedonia is even deeper than that.

Recording:

w/o:

w/:

Then, the synthesized speech in our experiments for subjective evaluation will be listed in the order of internal single speaker followed by the LJ speaker.

Internal single speaker

Recording sample:

In-domain synthesized speech synthesis with WaveNet vocoder
	w/o	w
NASA's five Apollo moon landings remain the singular achievement of the space age.
The dotted mental lines between bands, musicians and influences became an obsession.

Out-of-domain synthesized speech synthesis with WaveNet vocoder
	w/o	w
The dispatcher later explains she doesn't have more details because the caller was "uncooperative."
In the video, she cradled her baby bump while wearing a white dress.

LJ speaker

Note that for LJ speaker, the WaveNet voder was a universal model trained on multiple speakers, thus the speech quality of the synthesized speech is not very good. But we still can feel the prosody difference between the two methods.

Recording sample:

In-domain synthesized speech synthesis with WaveNet vocoder
	w/o	w
The fellow is stricken with a judgment, and is mad!
He should take possession of any food, medicine, vomited matter, urine, or faeces, in the room, and should seal them up in clean vessels for examination.

Out-of-domain synthesized speech synthesis with WaveNet vocoder
	w/o	w
But what the nation is witnessing is not protest and it is certainly not heroic.
Everyone is doing well. No evacuations during the storm, she said.