icon Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling

[Paper] [Official Website]


Everest Team

Ximalaya Inc.

Abstract: In this paper, we propose Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and memory- augmented context-aware timbre modeling to tackle this challenge. Specifically, an effective hybrid content encoder, guided by neural codec training, that lever- ages quantized features from pre-trained HybridFormer and WavLM is first pre- sented to extract the linguistic content of the source speech.

Overview of Takin VC

An overall summary of Takin VC model.

Compare with baselines

Source Prompt DiffVC NS2VC VALLE-VC SEFVC Takin-VC










































Paralinguistic Zero-shot Voice Conversion

Source Prompt PPG Generated PPG+WavLM Generated

Laughing




Inhaling Sound




Hush Sound & Cry



Samples of IEMOCAP Dataset
Source Prompt Paralinguistic Tag PPG+WavLM Generated


The sound of clearing the throat.



The sound of clearing the throat.



Hum Sound.



Tail Drag Sound.

Zero-Shot Voice Conversion Across Genders and Languages

Type Source Prompt Generated
[CN] Male-to-Male


[CN] Male-to-Female


[CN] Female-to-Female


[CN] Female-to-Male


[EN] Male-to-Male


[EN] Female-to-Female