Perturbation AUTOVC: Voice Conversion from Perturbation and Autoencoder Loss

Abstract

AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires very fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)—a fully self-supervised learning system that uses perturbations to extract speech features—is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.

Perturbation AUTOVC architecture.

The audio samples below are generated using the model proposed in this paper.

Sample for Seen Speaker - VCTK

Source	Target	Conversion
p228_051	p261_055	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC PPG-VC
p249_039	p256_031	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC PPG-VC
p254_031	p256_035	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC PPG-VC
p272_056	p261_051	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC PPG-VC

Sample for Unseen Speaker - LibriTTS

Source	Target	Conversion
226	103	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC
150	196	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC
8770	8838	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC
8747	4640	Perturbation_AUTOVC(Proposed) AGAIN-VC	AUTOVC VQMIVC

Compared systems

The pre-trained model for sample generation is available at the link below.

AGAIN-VC

VQMIVC

PPG-VC