Accented Text-to-Speech

Speech Samples

The framework is evaluated with one Scottish speaker from the CMU_ARCTIC Corpus [1] and one Australian speaker from Google TTS API [2].

We study two scenarios on the accented TTS framework, a) only a small accented phonetic lexicon is available for the target accent. b) both a small accented phonetic lexicon and limited accented speech samples are available for the target accent.
The following TTS systems are implemented for comparison.
(1) Char-AM: This is a multi-speaker TTS acoustic model that takes a character sequence as input.
(2) US_G2P-AM: This is an accented TTS framework that consists of an American G2P model and a multi-speaker TTS acoustic model. The American G2P model is part of the pre-trained multi-accent G2P model conditioned on the General American accent ID.
(3) SCOT_G2P-AM: This is an accented TTS framework that consists of a Scottish G2P model and a multi-speaker TTS acoustic model. The Scottish G2P model is the pretrained multi-accent G2P model fine-tuned with a Scottish phonetic lexicon of 5k words.
(4) AU_G2P-AM: This is similar to SCOT_G2P-AM except that the G2P model is fine-tuned with a General Australian phonetic lexicon of 5k words.
(5) SCOT_G2P-F0_Dur_AM: This is an accented TTS framework that consists of a Scottish G2P model and a multi-speaker TTS acoustic model with integrated pitch and duration predictors. The Scottish G2P model is the pre-trained multi-accent G2P model fine-tuned with a Scottish phonetic lexicon of 5k words.
(6) AU_G2P-F0_Dur_AM: This is similar to SCOT_G2P-F0_ Dur_AM except that the G2P model is fine-tuned with a General Australian phonetic lexicon of 5k words.
Note: The TTS system followed with ’-L’ denotes using only a small accented phonetic lexicon, with ’-S’ denotes using only limited accented speech data, and with ’-LS’ denotes using both of them.

a) Only a small accented phonetic lexicon.

	US_G2P-AM-L	SCOT_G2P-AM-L	Transcription
Scottish	s t aa p t	s t oo p t	stopped
	d aa lw @r r z	d oo l @r r z	dollars
	n aa t^ @ d	n oo d i d	nodded

	US_G2P-AM-L	AU_G2P-AM-L	Transcription
Australian	w oo t^ @r r	w oo t^ @	water
	g aa th i k	g o th i k	gothic
	w or r m	w oo m	warm

b) Both a Small Accented Phonetic Lexicon and Limited Accented Speech Samples.

		Transcription
Scottish	Not fine-tuned	Three oilers and a fourth engineer, was his greeting.
		Eighteen hundred, he calculated.
		The stout wood was crushed like an eggshell.

		Transcription
Scottish	Fine-tuned with 50 utts	In partnership with Daylight, the pair raided the San Jose Interurban.
		The Eldorado emptied its occupants into the street to see the test.
		Obviously, it was a disease that could be contracted by contact.
	Fine-tuned with 300 utts	I just do appreciate it without being able to express my feelings.
		Bob, growing disgusted, turned back suddenly and attempted to pass Mab.
		His newborn cunning gave him poise and control.

		Transcription
Australian	Not fine-tuned	Nor did it confine itself to mere verbal recommendations.
		but an empty regulation which all so disposed could defy.
		when speaking more particularly of the borough jails.

		Transcription
Australian	Fine-tuned with 50 utts	Nor did it confine itself to mere verbal recommendations.
		Jails, of which the old prison at Reading was a specimen, were still left intact.
		The provision of separate sleeping cells was still quite inadequate. For instance,
	Fine-tuned with 300 utts	to call for information as to the observance of its provisions.
		but an empty regulation which all so disposed could defy.
		They therefore recommended that the prisoners should be removed

References

[1] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.
[2] https://cloud.google.com/text-to-speech.

Accented Text-to-Speech Synthesis with Limited Data

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

National University of Singapore, The Chinese University of Hong Kong, Shenzhen

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, Haizhou Li

National University of Singapore, The Chinese University of Hong Kong, Shenzhen

Speech Samples

References