The framework is trained and evaluated using L2-ARCTIC Corpus [1].
We evaluate two inference scenarios on the multi-speaker multi-accent TTS framework, a) Multi-speaker inherent-accent speech synthesis. b) Multi-speaker cross-accent speech synthesis.
The following TTS systems are implemented for comparison.
(1) GST: This is the TTS system that AM conditions on the style embedding from the GST model [19]. We set the number of token layers to six, the same as training accent categories.
(2) VAE: This is the TTS system that AM conditions on the style latent representation from a VAE model.
(3) GADM: This is the TTS system that AM conditions on the utterance level accent embedding HG from the GADM.
(4) MSAM: This is the TTS system that AM conditions on the multi-scale accent embeddings obtained from GAM and LAM, where no speaker disentanglement is performed. Note that this system needs the reference speech during inference, since the LAPM is not involved in this system.
(5) MSADM: This is the proposed TTS system that the AMconditions on GADM and LADM, producing multiscale speaker-independent and accent-discriminative embeddings HG and HL, respectively. The LAPM is trained and used during inference. Note: For GST and VAE, the average vector of style embeddings across all training data of an accent category is used during inference, similar as the HAvg G for GADM.
a) Multi-speaker inherent-accent speech synthesis.
GST
VAE
GADM
MSADM
Reference Speech
AR
ZH
HI
KO
ES
VI
b) Multi-speaker cross-accent speech synthesis.
GST
VAE
GADM
MSADM
Target Accent
Target Speaker
AR
HI
KO
ES
VI
Effects of speaker disentanglement. (the reference accented speech of source speaker is utilized)
MSAM
MSADM
Target Speaker
AR
HI
KO
ES
VI
Reference
[1]
Zhao, Guanlong, Evgeny Chukharev-Hudilainen, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Ricardo Gutierrez-Osuna, and John Levis. "L2-arctic: A non-native english speech corpus." (2018).