Whisper-Large-v3 Portuguese - Full Synthetic Data (Unfiltered)
This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.
Purpose
This model evaluates the impact of maximum data augmentation using all available synthetic speech without quality filtering. It serves as a comparison point to demonstrate:
- Impact of unfiltered synthetic data: Shows that including all synthetic samples (including low-quality) still improves performance but is less efficient
- Quality vs quantity tradeoff: Achieves 8.33% Test WER (same as mid-high filtered) but requires more training steps
- Cross-domain limitations: 13.43% MLS WER vs 10.27% for filtered approach, demonstrating the value of quality filtering
The model is part of a comprehensive study on WAVe (Word-Aligned Verification) filtering, demonstrating that while unfiltered synthetic data improves over baseline, quality filtering provides better efficiency and cross-domain performance.
Model Details
| Property | Value |
|---|---|
| Base Model | openai/whisper-large-v3 |
| Language | Portuguese (pt) |
| Task | Automatic Speech Recognition (transcribe) |
| Parameters | 1550M |
| Training Data | Common Voice 17.0 + ALL Synthetic (Unfiltered) |
| Total Training Samples | 43,834 |
| Sampling Rate | 16kHz |
Evaluation Results
This Model (whisper-large-v3-cv-fully-synthetic-pt)
| Metric | Value |
|---|---|
| Validation Loss | 0.1050 |
| Validation WER | 7.57% |
| Test WER (Common Voice) | 8.33% |
| Test WER (MLS) | 13.43% |
| Best Checkpoint | Step 350 |
| Max Training Steps | 860 |
Comparison with Other Training Configurations (Whisper-Large-v3 Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|---|
| Common Voice Only | 430 | 0.1260 | 11.38% | 11.78% | 15.31% |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.1045 | 7.33% | 7.94% | 12.41% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.1040 | 7.73% | 8.33% | 10.27% |
| All Synthetic + CV (Unfiltered) | 860 | 0.1050 | 7.57% | 8.33% | 13.43% |
Key Performance Highlights
- Maximum data volume: Uses all 21,968 synthetic samples (100%)
- Good in-domain: 8.33% Test WER (same as mid-high filtered, 29.3% better than baseline)
- Cross-domain tradeoff: 13.43% MLS WER vs 10.27% for mid-high filtered (30.7% worse)
- Training cost: Requires 860 max steps (7% more than mid-high filtered)
- Demonstrates filtering value: Same in-domain WER as filtered approach but worse cross-domain
Training Data
Dataset Composition
| Source | Samples | Description |
|---|---|---|
| Common Voice 17.0 Portuguese | 21,866 | Real speech from Mozilla's crowdsourced dataset |
| Synthetic Transcript PT (all) | 21,968 | Complete TTS audio without filtering |
| Total | 43,834 |
Synthetic Data Generation Pipeline
The synthetic dataset (yuriyvnv/synthetic_transcript_pt) was generated using:
- Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
- Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
- No Filtering: All samples used regardless of quality
WAVe Quality Distribution (For Reference)
While this model uses all data, WAVe quality assessment shows the distribution:
| Quality Level | Samples | Percentage | Used in This Model |
|---|---|---|---|
| High (q ≥ 0.8) | 7,312 | 33.3% | ✓ |
| Medium (0.5 ≤ q < 0.8) | 11,869 | 54.0% | ✓ |
| Low (q < 0.5) | 2,787 | 12.7% | ✓ |
| Total | 21,968 | 100% | All used |
Note: 12.7% of the synthetic data (2,787 samples) would be filtered out by WAVe, but is included in this model's training.
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 5e-6 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
Training Infrastructure
- GPU: NVIDIA H200 (140GB VRAM)
- Operating System: Ubuntu 22.04
- Framework: Hugging Face Transformers
Usage
Transcription Pipeline
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
Direct Model Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Specifying Language
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
When to Use This Model
This model is ideal when:
- Quality filtering is not available: Uses raw synthetic data without preprocessing
- Maximum data volume required: Includes all 21,968 synthetic samples
- Comparing filtering approaches: Demonstrates the value of quality filtering by comparison
Consider filtered alternatives for better performance:
- whisper-large-v3-mixed-pt: Best cross-domain (10.27% MLS), 6% fewer steps
- whisper-large-v3-high-mixed-pt: Best in-domain (7.94% CV), 33% fewer steps
Quality vs Quantity Analysis
This model demonstrates the importance of quality filtering for Portuguese:
| Approach | Synthetic Samples | Training Steps | Test WER (CV) | Test WER (MLS) |
|---|---|---|---|---|
| High-Quality (q≥0.8) | 7,312 | 575 | 7.94% | 12.41% |
| Mid-High (q≥0.5) | 19,181 | 805 | 8.33% | 10.27% |
| Unfiltered (this model) | 21,968 | 860 | 8.33% | 13.43% |
Key insight: Including all synthetic data (even low-quality samples) achieves the same in-domain WER as filtered approaches but results in 30.7% worse cross-domain performance on MLS, demonstrating that quality filtering improves generalization without sacrificing in-domain accuracy.
Limitations
- Noisy training signal: Includes low-quality synthetic samples (12.7% with q < 0.5)
- Suboptimal cross-domain: 13.43% MLS WER vs 10.27% for filtered approach
- Training inefficiency: Requires most training steps without cross-domain benefit
- Domain specificity: Optimized for general Portuguese; may underperform on technical domains
- Dialect coverage: Performance may vary across Portuguese regional variants
Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
References
- Base Model: openai/whisper-large-v3
- Training Data (Real): mozilla-foundation/common_voice_17_0
- Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
- Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
- Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)
License
Apache 2.0
- Downloads last month
- 31
Model tree for yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt
Base model
openai/whisper-large-v3Datasets used to train yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt
Collection including yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt
Evaluation results
- Test WER on Common Voice 17.0 (Portuguese)test set self-reported8.330
- Test WER (MLS) on Multilingual LibriSpeech (Portuguese)test set self-reported13.430