Code-switching Text-to-Speech

The model is evaluated with bilingual database provided by Data-baker.

Systems Comparison:
(1) T2 (baseline): Tacotron2 system conditioned on speaker embedding.
(2) RES-ENC(baseline): Tacotron2 system with residual encoder conditioned on speaker embedding.
(3) CLWE(proposed): Residual encoder augmented with cross-lingual word embedding conditioned on speaker embedding.
(4) GT: Ground truth audio samples

There are three types of adaptation implemented for all three systems:
(1) CN-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin utterances
(2) EN-Apt: Adaptation of average model conditioned on speaker embedding using English utterances.
(3) MIX-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin and English utterances.
Note: I-vec refers to average model conditioned on speaker embedding without adaptation (only for proposed CLWE system).

Code-switching speech are generated from CN, EN, MIX adapted systems respectively. (Text: 十二月三日晚的香港 music zone 是收官之站)

	CH-Apt	EN-Apt	MIX-Apt	I-vec
T2
RES-ENC
CLWE
GT

Mandarin speech are generated from CN adapted systems. (Text: 春夏秋冬是自然的轮回呀)
English speech are generated from EN adapted systems. (Text: i don't think it's soft justice)

	CH-Apt (Mandarin text)	EN-Apt (English text)
T2
RES-ENC
CLWE
GT

END-TO-END CODE-SWITCHING TTS WITH CROSS-LINGUAL LANGUAGE MODEL

Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, Haizhou Li

Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, Haizhou Li