Papers
arxiv:2511.01056

WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion

Published on Nov 2, 2025
Authors:
,

Abstract

WhisperVC is a three-stage framework for Mandarin whisper-to-speech conversion that uses a fine-tuned Content Encoder, Conformer-based variational autoencoder, and FastSpeech2 model to achieve high-quality voice reconstruction.

AI-generated summary

Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging. To address this issue, we propose WhisperVC, a three-stage framework for Mandarin whisper-to-speech (W2S) conversion. Stage~1 employs a fine-tuned Content Encoder based on the OpenAI Whisper-large~V3 model and a Conformer-based variational autoencoder with soft-DTW alignment to learn domain-invariant and temporally consistent representations. Stage~2 introduces a deterministic Length--Channel Aligner and a duration-free FastSpeech~2 model conditioned on speaker embeddings for controllable timbre and stable prosody. Stage~3 fine-tunes a HiFi-GAN vocoder on predicted mel-spectrograms to synthesize high-fidelity waveforms. Experiments on the AISHELL6-Whisper corpus demonstrate that WhisperVC achieves near ground-truth quality (DNSMOS~3.11, UTMOS~2.52, CER~18.67\%), while maintaining speaker similarity (cosine~0.76) and robust performance under whisper-only inference.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.01056 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.01056 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.