Perturbation AUTOVC: Voice Conversion from Perturbation and Autoencoder Loss
Abstract
AUTOVC is a voice-conversion method that performs self-reconstruction using an autoencoder structure for zero-shot voice conversion. AUTOVC has the advantage of being easy and simple to learn because it only uses the autoencoder loss for learning. However, it performs voice conversion by disentangling speech information from speakers and linguistic information by adjusting the bottleneck dimension; this requires very fine tuning of the bottleneck dimension and involves a tradeoff between speech quality and speaker similarity. To address these issues, neural analysis and synthesis (NANSY)—a fully self-supervised learning system that uses perturbations to extract speech features—is proposed. NANSY solves the problem of the adjustment of the bottleneck dimension by utilizing perturbation and exhibits high-reconstruction performance. In this study, we propose perturbation AUTOVC, a voice conversion method that utilizes the structure of AUTOVC and the perturbation of NANSY. The proposed method applies perturbations to speech signals (such as NANSY signals) to solve the problem of the voice conversion method using bottleneck dimensions. Perturbation is applied to remove the speaker-dependent information present in the speech, leaving only the linguistic information, which is then passed through a content encoder and modeled as a content embedding containing only the linguistic information. To obtain speaker information, we used x-vectors, which are extensively used in pretrained speaker recognition. The concatenated linguistic and speaker information extracted from the encoder and additional energy information is used as input to the decoder to perform self-reconstruction. Similar to AUTOVC, it is easy and simple to learn using only the autoencoder loss. For the evaluation, we measured three objective evaluation metrics: character error rate (%), cosine similarity, and short-time objective intelligibility, as well as a subjective evaluation metric: mean opinion score. The experimental results demonstrate that our proposed method outperforms other voice conversion techniques and demonstrated robust performance in zero-shot conversion.
Perturbation AUTOVC architecture.
The audio samples below are generated using the model proposed in this paper.
Sample for Seen Speaker - VCTK
Source | Target | Conversion | |
---|---|---|---|
p228_051 | p261_055 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC PPG-VC |
p249_039 | p256_031 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC PPG-VC |
p254_031 | p256_035 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC PPG-VC |
p272_056 | p261_051 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC PPG-VC |
Sample for Unseen Speaker - LibriTTS
Source | Target | Conversion | |
---|---|---|---|
226 | 103 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC |
150 | 196 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC |
8770 | 8838 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC |
8747 | 4640 | Perturbation_AUTOVC(Proposed) AGAIN-VC | AUTOVC VQMIVC |