Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model*

Abstract: Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective evaluations show the effectiveness of our framework. Codes and samples are publicly available.

Codes: We will release our codes here once the paper is accepted.

-------------------------------- Model Architecture -------------------------------

Figure 1. An illustration of the training phase of the proposed framework - DEVC, where the green boxes represent the modules that are involved in the training while the others are not.

--------------------------------- Speech Samples ---------------------------------

In a comparative study, we adopt the following two models as our baseline frameworks:

Baseline: JES-StarGAN [1], a many-to-many expressive voice conversion framework for S2S setting.

Baseline-U: A JES-StarGAN based framework for S2U and U2U settings.

Proposed Method: DEVC, an any-to-any expressive voice conversion framework.

The samples are from four emotions (Angry, Sad, Happy, and Neutral) and three conversion scenarios (the conversion between seen speakers (S2S), the conversion between seen and unseen speakers (S2U); the conversion between unseen speakers (U2U).

We provide the utterances from source speakers, denoted as Source; the converted utterances from baseline frameworks, , denoted as Baseline [1] or Baseline-U; the converted utterances from our proposed method, denoted as DEVC; the utterances from target speakers, denoted as Target.

	Source	Baseline [1]	DEVC	Target
Seen to Seen Speakers
Angry

Sad

Happy

Neutral

	Source	Baseline-U	DEVC	Target
Seen to Unseen Speakers
Angry

Sad

Happy

Neutral

Unseen to Unseen Speakers
Angry

Sad

Happy

Neutral

[1] Z. Du, B. Sisman, K. Zhou and H. Li, "Expressive Voice Conversion: A Joint Framework for Speaker Identity and Emotional Style Transfer," 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 2021, pp. 594-601, doi: 10.1109/ASRU51503.2021.9687906.