The model is evaluated with bilingual database provided by Data-baker.


Systems Comparison:
(1) T2 (baseline): Tacotron2 system conditioned on speaker embedding.
(2) RES-ENC(baseline): Tacotron2 system with residual encoder conditioned on speaker embedding.
(3) CLWE(proposed): Residual encoder augmented with cross-lingual word embedding conditioned on speaker embedding.
(4) GT: Ground truth audio samples


There are three types of adaptation implemented for all three systems:
(1) CN-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin utterances
(2) EN-Apt: Adaptation of average model conditioned on speaker embedding using English utterances.
(3) MIX-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin and English utterances.
Note: I-vec refers to average model conditioned on speaker embedding without adaptation (only for proposed CLWE system).


Code-switching speech are generated from CN, EN, MIX adapted systems respectively. (Text: 十二月三日晚的香港 music zone 是收官之站)

CH-Apt EN-Apt MIX-Apt I-vec
T2
RES-ENC
CLWE
GT

Mandarin speech are generated from CN adapted systems. (Text: 春夏秋冬是自然的轮回呀)
English speech are generated from EN adapted systems. (Text: i don't think it's soft justice)

CH-Apt (Mandarin text) EN-Apt (English text)
T2
RES-ENC
CLWE
GT