Takin-VC: Zero-shot Voice Conversion via Jointly Hybrid Content and Memory-Augmented Context-Aware Timbre Modeling
[Paper]
[Official Website]
Everest Team
Ximalaya Inc.
Abstract: In this paper, we propose Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and memory- augmented context-aware timbre modeling to tackle this challenge. Specifically, an effective hybrid content encoder, guided by neural codec training, that lever- ages quantized features from pre-trained HybridFormer and WavLM is first pre- sented to extract the linguistic content of the source speech.
Overview of Takin VC
An overall summary of Takin VC model.
Compare with baselines
Source | Prompt | DiffVC | NS2VC | VALLE-VC | SEFVC | Takin-VC |
---|---|---|---|---|---|---|
Paralinguistic Zero-shot Voice Conversion
Source | Prompt | PPG Generated | PPG+WavLM Generated |
---|---|---|---|
Laughing |
|||
Inhaling Sound |
|||
Hush Sound & Cry |
|||
Samples of IEMOCAP Dataset | |||
Source | Prompt | Paralinguistic Tag | PPG+WavLM Generated |
The sound of clearing the throat. |
|||
The sound of clearing the throat. |
|||
Hum Sound. |
|||
Tail Drag Sound. |
Zero-Shot Voice Conversion Across Genders and Languages
Type | Source | Prompt | Generated | |
---|---|---|---|---|
[CN] Male-to-Male | ||||
[CN] Male-to-Female | ||||
[CN] Female-to-Female | ||||
[CN] Female-to-Male | ||||
[EN] Male-to-Male | ||||
[EN] Female-to-Female |