Speech Samples


The framework is evaluated with one Scottish speaker from the CMU_ARCTIC Corpus [1] and one Australian speaker from Google TTS API [2].


We study two scenarios on the accented TTS framework, a) only a small accented phonetic lexicon is available for the target accent. b) both a small accented phonetic lexicon and limited accented speech samples are available for the target accent.
The following TTS systems are implemented for comparison.
(1) Char-AM: This is a multi-speaker TTS acoustic model that takes a character sequence as input.
(2) US_G2P-AM: This is an accented TTS framework that consists of an American G2P model and a multi-speaker TTS acoustic model. The American G2P model is part of the pre-trained multi-accent G2P model conditioned on the General American accent ID.
(3) SCOT_G2P-AM: This is an accented TTS framework that consists of a Scottish G2P model and a multi-speaker TTS acoustic model. The Scottish G2P model is the pretrained multi-accent G2P model fine-tuned with a Scottish phonetic lexicon of 5k words.
(4) AU_G2P-AM: This is similar to SCOT_G2P-AM except that the G2P model is fine-tuned with a General Australian phonetic lexicon of 5k words.
(5) SCOT_G2P-F0_Dur_AM: This is an accented TTS framework that consists of a Scottish G2P model and a multi-speaker TTS acoustic model with integrated pitch and duration predictors. The Scottish G2P model is the pre-trained multi-accent G2P model fine-tuned with a Scottish phonetic lexicon of 5k words.
(6) AU_G2P-F0_Dur_AM: This is similar to SCOT_G2P-F0_ Dur_AM except that the G2P model is fine-tuned with a General Australian phonetic lexicon of 5k words.
Note: The TTS system followed with ’-L’ denotes using only a small accented phonetic lexicon, with ’-S’ denotes using only limited accented speech data, and with ’-LS’ denotes using both of them.


a) Only a small accented phonetic lexicon.

US_G2P-AM-L SCOT_G2P-AM-L Reference Speech Transcription
Scottish s t aa p t s t oo p t stopped
d aa lw @r r z d oo l @r r z dollars
n aa t^ @ d n oo d i d nodded
US_G2P-AM-L AU_G2P-AM-L Reference Speech Transcription
Australian w oo t^ @r r w oo t^ @ water
g aa th i k g o th i k gothic
w or r m w oo m warm

b) Both a Small Accented Phonetic Lexicon and Limited Accented Speech Samples.

Char-AM SCOT_G2P-AM-L SCOT_G2P-F0_Dur_AM-L Reference Speech Transcription
Scottish Not fine-tuned Three oilers and a fourth engineer, was his greeting.
Eighteen hundred, he calculated.
The stout wood was crushed like an eggshell.
Char-AM-S SCOT_G2P-AM-LS SCOT_G2P-F0_Dur_AM-LS Reference Speech Transcription
Scottish Fine-tuned with 50 utts In partnership with Daylight, the pair raided the San Jose Interurban.
The Eldorado emptied its occupants into the street to see the test.
Obviously, it was a disease that could be contracted by contact.
Fine-tuned with 300 utts I just do appreciate it without being able to express my feelings.
Bob, growing disgusted, turned back suddenly and attempted to pass Mab.
His newborn cunning gave him poise and control.
Char-AM AU_G2P-AM-L AU_G2P-F0_Dur_AM-L Reference Speech Transcription
Australian Not fine-tuned Nor did it confine itself to mere verbal recommendations.
but an empty regulation which all so disposed could defy.
when speaking more particularly of the borough jails.
Char-AM-S AU_G2P-AM-LS AU_G2P-F0_Dur_AM-LS Reference Speech Transcription
Australian Fine-tuned with 50 utts Nor did it confine itself to mere verbal recommendations.
Jails, of which the old prison at Reading was a specimen, were still left intact.
The provision of separate sleeping cells was still quite inadequate. For instance,
Fine-tuned with 300 utts to call for information as to the observance of its provisions.
but an empty regulation which all so disposed could defy.
They therefore recommended that the prisoners should be removed

References

[1] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in Fifth ISCA workshop on speech synthesis, 2004.
[2] https://cloud.google.com/text-to-speech.