Whisper-Large-v3 Portuguese - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.

Purpose

This baseline model demonstrates the performance achievable using only real, crowdsourced speech data from Common Voice 17.0. It serves as a reference point for comparing the effectiveness of synthetic data augmentation approaches, including:

Quality-filtered synthetic data (WAVe-based filtering at different thresholds)
Unfiltered synthetic data augmentation
Different quality thresholds and their impact on both training efficiency and ASR performance

The model is part of a comprehensive study on WAVe (Word-Aligned Verification) filtering for Portuguese ASR, published in IEEE Access 2024.

Model Details

Property	Value
Base Model	openai/whisper-large-v3
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	1550M
Training Data	Common Voice 17.0 Portuguese (Real Speech Only)
Total Training Samples	21,866
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-large-v3-cv-only-pt)

Metric	Value
Validation Loss	0.1260
Validation WER	11.38%
Test WER (Common Voice)	11.78%
Test WER (MLS)	15.31%
Best Checkpoint	Step 150
Max Training Steps	430

Comparison with Synthetic Data Augmentation (Whisper-Large-v3 Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)	MLS Improvement
Common Voice Only (Baseline)	430	0.1260	11.38%	11.78%	15.31%	—
High-Quality (q ≥ 0.8) + CV	575	0.1045	7.33%	7.94%	12.41%	+18.9%
Mid-High (q ≥ 0.5) + CV	805	0.1040	7.73%	8.33%	10.27%	+32.9%
All Synthetic + CV	860	0.1050	7.57%	8.33%	13.43%	+12.3%

Key Performance Characteristics

Fastest training: Fewest steps (430) among all Portuguese configurations
Smallest dataset: Only 21,866 samples (no synthetic augmentation)
Solid baseline: 11.78% Test WER on Common Voice
Limited cross-domain: 15.31% MLS WER (poorest generalization)
Reference point: Establishes performance without synthetic data

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real crowdsourced speech
Synthetic Data	0	No synthetic augmentation
Total	21,866

Common Voice 17.0 Portuguese

Common Voice is Mozilla's open-source, crowdsourced speech dataset:

Recording conditions: Varied (home recordings, different microphones, background noise)
Speaker diversity: Multiple speakers, ages, and accents
Content: Read sentences from various domains
Quality: Human-validated transcriptions

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-6
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-only-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This baseline model is ideal when:

No synthetic data is available: Training on real data only
Maximum training speed required: Fastest convergence (430 steps)
In-domain performance is priority: Strong on Common Voice-like data (11.78% WER)
Comparing augmentation approaches: Reference for measuring synthetic data impact

Consider synthetic-augmented variants for better performance:

whisper-large-v3-high-mixed-pt: 32.6% better WER (7.94% vs 11.78%)
whisper-large-v3-mixed-pt: Best cross-domain (10.27% MLS)

Impact of Synthetic Data Augmentation

This baseline enables quantifying the value of synthetic speech for Portuguese:

Metric	CV-Only	+ Synthetic (best)	Improvement
Training Steps	430	575	+34%
Dataset Size	21,866	29,178	+33%
Test WER (CV)	11.78%	7.94%	+32.6%
Test WER (MLS)	15.31%	10.27%	+32.9%

Key insight: Synthetic data augmentation provides dramatic improvements for Portuguese ASR across both in-domain and cross-domain benchmarks, with relatively modest increases in training time.

Limitations

Domain specificity: Optimized for Common Voice-style speech; cross-domain performance limited
Acoustic diversity: Limited to Common Voice recording conditions and speaker pool
Data scarcity: No augmentation means model capacity may be underutilized
Generalization: 15.31% MLS WER shows difficulty adapting to different acoustic conditions

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-large-v3
Training Data: mozilla-foundation/common_voice_17_0
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 22

Safetensors

Model size

2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-large-v3-cv-only-pt

Base model

openai/whisper-large-v3

Finetuned

(668)

this model

Dataset used to train yuriyvnv/whisper-large-v3-cv-only-pt

Collection including yuriyvnv/whisper-large-v3-cv-only-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 15 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

11.780
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

15.310