Toward Degradation-Robust Voice Conversion

Authors: Chien-yu Huang, Kai-Wei Chang, Hung-yi Lee

National Taiwan University, Taiwan

* These authors contributed equally.

Table 1: VC with SE Concatenation and Denoising Training

Source speaker: VCTK - p246 (Male)
Target speaker: VCTK - p343 (Female)

Transcription: “ Rangers can expect a physical battle in the national stadium tonight. ”


Clean utterance
Degraded utterance
Source
Target
Converted Result
Scenarios
AdaIN-VC
S2VC
S2VC-W2V
Clean + baseline model
Degraded + baseline model
Degraded + DEMUCS
Degraded + MetricGAN+
Degraded + Conv-TasNet
Degraded + Denoising training

Table 2: Speech Enhancement Model Performance

Utterance 1: VCTK - p246 (Male); “ Rangers can expect a physical battle in the national stadium tonight. ”
Utterance 2: VCTK - p343 (Male); “ Then came the crunch. ”


Utterance 1
Utterance 2
Clean utterance
Degraded utterance
Enhancement Result

Utterance 1
Utterance 2
DEMUCS
MetricGAN+
Conv-TasNet

Table 6: VC Under Embedding Attack and Defense

Source speaker: VCTK - p246 (Male)
Target speaker: VCTK - p343 (Female)

Transcription: “ Rangers can expect a physical battle in the national stadium tonight. ”


Clean utterance
+ Adversarial Noise (Attack)
Source
Target
Attack and Defense Result on S2VC
Scenarios
S2VC
Clean + baseline model
Attack + baseline model
Attack + Demucs (Defense)
Attack + Denoising Training (Defense)
Attack + Denoising Training + Adversarial Training (Defense)

© 台大語音處理實驗室 NTU Speech Lab