arxiv:2511.01056

WhisperVC: Target Speaker-Controllable Mandarin Whisper-to-Speech Conversion

Published on Nov 2, 2025

Authors:

Abstract

WhisperVC is a three-stage framework for Mandarin whisper-to-speech conversion that uses a fine-tuned Content Encoder, Conformer-based variational autoencoder, and FastSpeech2 model to achieve high-quality voice reconstruction.

AI-generated summary

Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging. To address this issue, we propose WhisperVC, a three-stage framework for Mandarin whisper-to-speech (W2S) conversion. Stage~1 employs a fine-tuned Content Encoder based on the OpenAI Whisper-large~V3 model and a Conformer-based variational autoencoder with soft-DTW alignment to learn domain-invariant and temporally consistent representations. Stage~2 introduces a deterministic Length--Channel Aligner and a duration-free FastSpeech~2 model conditioned on speaker embeddings for controllable timbre and stable prosody. Stage~3 fine-tunes a HiFi-GAN vocoder on predicted mel-spectrograms to synthesize high-fidelity waveforms. Experiments on the AISHELL6-Whisper corpus demonstrate that WhisperVC achieves near ground-truth quality (DNSMOS~3.11, UTMOS~2.52, CER~18.67\%), while maintaining speaker similarity (cosine~0.76) and robust performance under whisper-only inference.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.01056 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.01056 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.