Toward Degradation-Robust Voice Conversion


Authors: Chien-yu Huang*, Kai-Wei Chang*, Hung-yi Lee

National Taiwan University, Taiwan

* These authors contributed equally.

Table 1: VC with SE Concatenation and Denoising Training
Source speaker
VCTK - p246 (Male)
Target speaker
VCTK - p343 (Female)

Transcription
“ Rangers can expect a physical battle in the national stadium tonight. ”

Clean utterance
Degraded utterance
Source
Target

Converted Result


Scenarios
AdaIN-VC
S2VC
S2VC-W2V
Clean + baseline model
Degraded + baseline model
Degraded + DEMUCS
Degraded + MetricGAN+
Degraded + Conv-TasNet
Degraded + Denoising training
Table 2: Speech Enhancement Model Performance
Utterance 1
VCTK - p246 (Male)
“ Rangers can expect a physical battle in the national stadium tonight. ”
Utterance 2
VCTK - p343 (Male)
“ Then came the crunch. ”

Utterance 1
Utterance 2
Clean utterance
Degraded utterance

Enhancement Result


Utterance 1
Utterance 2
DEMUCS
MetricGAN+
Conv-TasNet
Table 6: VC Under Embedding Attack and Defense
Source speaker
VCTK - p246 (Male)
Target speaker
VCTK - p343 (Female)

Transcription
“ Rangers can expect a physical battle in the national stadium tonight. ”

Clean utterance
+ Adversarial Noise (Attack)
Source
Target

Attack and Defense Result on S2VC


Scenarios
S2VC
Clean + baseline model
Attack + baseline model
Attack + Demucs (Defense)
Attack + Denoising Training (Defense)
Attack + Denoising Training + Adversarial Training (Defense)

© 台大語音處理實驗室 NTU Speech Lab