The model is evaluated with bilingual database provided by Data-baker.
Systems Comparison:
(1) T2 (baseline): Tacotron2 system conditioned on speaker embedding.
(2) RES-ENC(baseline): Tacotron2 system with residual encoder conditioned on speaker embedding.
(3) CLWE(proposed): Residual encoder augmented with cross-lingual word embedding conditioned on speaker embedding.
(4) GT: Ground truth audio samples
There are three types of adaptation implemented for all three systems:
(1) CN-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin utterances
(2) EN-Apt: Adaptation of average model conditioned on speaker embedding using English utterances.
(3) MIX-Apt: Adaptation of average model conditioned on speaker embedding using Mandarin and English utterances.
Note: I-vec refers to average model conditioned on speaker embedding without adaptation (only for proposed CLWE system).
Code-switching speech are generated from CN, EN, MIX adapted systems respectively. (Text: 十二月三日晚的香港 music zone 是收官之站)
| CH-Apt | EN-Apt | MIX-Apt | I-vec | |
|---|---|---|---|---|
| T2 | ||||
| RES-ENC | ||||
| CLWE | ||||
| GT | ||||
Mandarin speech are generated from CN adapted systems. (Text: 春夏秋冬是自然的轮回呀)
English speech are generated from EN adapted systems. (Text: i don't think it's soft justice)
| CH-Apt (Mandarin text) | EN-Apt (English text) | ||
|---|---|---|---|
| T2 | |||
| RES-ENC | |||
| CLWE | |||
| GT | |||