Perturbation AUTOVC: Voice Conversion from Perturbation and Autoencoder Loss


Abstract

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires very fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)—a fully self-supervised learning system that uses perturbations to extract speech features—is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

Perturbation AUTOVC architecture.

fig1

The audio samples below are generated using the model proposed in this paper.

Sample for Seen Speaker - VCTK

Source Target Conversion
p228_051 p261_055 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC PPG-VC
p249_039 p256_031 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC PPG-VC
p254_031 p256_035 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC PPG-VC
p272_056 p261_051 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC PPG-VC

Sample for Unseen Speaker - LibriTTS

Source Target Conversion
226 103 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC
150 196 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC
8770 8838 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC
8747 4640 Perturbation_AUTOVC(Proposed) AGAIN-VC AUTOVC VQMIVC

Compared systems


The pre-trained model for sample generation is available at the link below.

AGAIN-VC

VQMIVC

PPG-VC