Whisper-Large-v3 Portuguese - Common Voice Only (Baseline)
This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.
Purpose
This baseline model demonstrates the performance achievable using only real, crowdsourced speech data from Common Voice 17.0. It serves as a reference point for comparing the effectiveness of synthetic data augmentation approaches, including:
- Quality-filtered synthetic data (WAVe-based filtering at different thresholds)
- Unfiltered synthetic data augmentation
- Different quality thresholds and their impact on both training efficiency and ASR performance
The model is part of a comprehensive study on WAVe (Word-Aligned Verification) filtering for Portuguese ASR, published in IEEE Access 2024.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-large-v3 |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 1550M |
| Training Data | Common Voice 17.0 Portuguese (Real Speech Only) |
| Total Training Samples | 21,866 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-large-v3-cv-only-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.1260 |
| Validation WER | 11.38% |
| Test WER (Common Voice) | 11.78% |
| Test WER (MLS) | 15.31% |
| Best Checkpoint | Step 150 |
| Max Training Steps | 430 |
Comparison with Synthetic Data Augmentation (Whisper-Large-v3 Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) | MLS Improvement |
|---|---|---|---|---|---|---|
| Common Voice Only (Baseline) | 430 | 0.1260 | 11.38% | 11.78% | 15.31% | — |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.1045 | 7.33% | 7.94% | 12.41% | +18.9% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.1040 | 7.73% | 8.33% | 10.27% | +32.9% |
| All Synthetic + CV | 860 | 0.1050 | 7.57% | 8.33% | 13.43% | +12.3% |
Key Performance Characteristics
- Fastest training: Fewest steps (430) among all Portuguese configurations
- Smallest dataset: Only 21,866 samples (no synthetic augmentation)
- Solid baseline: 11.78% Test WER on Common Voice
- Limited cross-domain: 15.31% MLS WER (poorest generalization)
- Reference point: Establishes performance without synthetic data
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real crowdsourced speech |
| Synthetic Data | 0 | No synthetic augmentation |
| Total | 21,866 |
Common Voice 17.0 Portuguese
Common Voice is Mozilla's open-source, crowdsourced speech dataset:
- Recording conditions: Varied (home recordings, different microphones, background noise)
- Speaker diversity: Multiple speakers, ages, and accents
- Content: Read sentences from various domains
- Quality: Human-validated transcriptions
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-large-v3-cv-only-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
This baseline model is ideal when:
- No synthetic data is available: Training on real data only
- Maximum training speed required: Fastest convergence (430 steps)
- In-domain performance is priority: Strong on Common Voice-like data (11.78% WER)
- Comparing augmentation approaches: Reference for measuring synthetic data impact
Consider synthetic-augmented variants for better performance:
- whisper-large-v3-high-mixed-pt: 32.6% better WER (7.94% vs 11.78%)
- whisper-large-v3-mixed-pt: Best cross-domain (10.27% MLS)
Impact of Synthetic Data Augmentation
This baseline enables quantifying the value of synthetic speech for Portuguese:
| Metric | CV-Only | + Synthetic (best) | Improvement |
|---|---|---|---|
| Training Steps | 430 | 575 | +34% |
| Dataset Size | 21,866 | 29,178 | +33% |
| Test WER (CV) | 11.78% | 7.94% | +32.6% |
| Test WER (MLS) | 15.31% | 10.27% | +32.9% |
Key insight: Synthetic data augmentation provides dramatic improvements for Portuguese ASR across both in-domain and cross-domain benchmarks, with relatively modest increases in training time.
Limitations
- Domain specificity: Optimized for Common Voice-style speech; cross-domain performance limited
- Acoustic diversity: Limited to Common Voice recording conditions and speaker pool
- Data scarcity: No augmentation means model capacity may be underutilized
- Generalization: 15.31% MLS WER shows difficulty adapting to different acoustic conditions
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-large-v3
- Training Data: mozilla-foundation/common_voice_17_0
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 22
Model tree for yuriyvnv/whisper-large-v3-cv-only-pt
Base model
openai/whisper-large-v3Dataset used to train yuriyvnv/whisper-large-v3-cv-only-pt
Collection including yuriyvnv/whisper-large-v3-cv-only-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported11.780
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported15.310