Speech Samples


The framework is trained and evaluated using L2-ARCTIC Corpus [1].


We evaluate two inference scenarios for generating multi-speaker multi-accent speech: a) multi-speaker inherent-accent and b) multi-speaker cross-accent. The following TTS systems are implemented for comparisons:
(1) AM-GST: A multi-speaker TTS system that conditions the AM on the GST model [31]. We set the number of token layers to six, corresponding to the number of accent categories during training.
(2) AM-VAE: A multi-speaker TTS system that conditions the AM on the VAE model [32].
(3) AM-AID: A multi-speaker TTS system that conditions the AM on the accent identity, which is passed to an accent embedding table to obtain an accent embedding for each accent.
(4) AM-SIGAM: A multi-speaker TTS system that conditions the AM on the SIGAM.
(5) AM-SIMSAM: A multi-speaker TTS system that conditions the AM on the speaker-independent multi-scale accent model (SIMSAM), which includes the SIGAM, SILAM, and LAPM, as shown in Fig. 1.
(6) AM-MSAM-S: A multi-speaker TTS system that conditions the AM on the multi-scale accent model (MSAM), consisting of the global accent model (GAM) and local accent model (LAM), without speaker disentanglement. Reference speech is required during inference, denoted as ’-S’.

Among all implemented systems, the reliance on reference speech during inference is unique to AM-MSAM-S. The LAPM is excluded from this system because predicting phoneme-level HL directly from phonemes is challenging without speaker disentanglement, as speaker information remains entangled in HL. Note that for both GST and VAE models, the average of all utterance-level embeddings extracted from the training data of each accent is used to represent the corresponding accent category during inference, similar to HAvg G for the SIGAM.


a) Multi-speaker inherent-accent speech synthesis.

AM-GST AM-VAE AM-SIGAM AM-AID AM-SIMSAM Reference Speech
AR
ZH
HI
KO
ES
VI

b) Multi-speaker cross-accent speech synthesis. (Target seen speakers)

AM-GST AM-VAE AM-SIGAM AM-AID AM-SIMSAM Target Accent Target Speaker
AR
HI
KO
ES
VI

Effects of speaker disentanglement. (the reference accented speech of source speaker is utilized)

AM-MSAM-S AM-SIMSAM Target Speaker
AR
HI
KO
ES
VI

Reference

[1] Zhao, Guanlong, Evgeny Chukharev-Hudilainen, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Ricardo Gutierrez-Osuna, and John Levis. "L2-arctic: A non-native english speech corpus." (2018).