Whisper-Large-v3 Portuguese - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Purpose

This model evaluates the impact of maximum data augmentation using all available synthetic speech without quality filtering. It serves as a comparison point to demonstrate:

Impact of unfiltered synthetic data: Shows that including all synthetic samples (including low-quality) still improves performance but is less efficient
Quality vs quantity tradeoff: Achieves 8.33% Test WER (same as mid-high filtered) but requires more training steps
Cross-domain limitations: 13.43% MLS WER vs 10.27% for filtered approach, demonstrating the value of quality filtering

The model is part of a comprehensive study on WAVe (Word-Aligned Verification) filtering, demonstrating that while unfiltered synthetic data improves over baseline, quality filtering provides better efficiency and cross-domain performance.

Model Details

Property	Value
Base Model	openai/whisper-large-v3
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	1550M
Training Data	Common Voice 17.0 + ALL Synthetic (Unfiltered)
Total Training Samples	43,834
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-large-v3-cv-fully-synthetic-pt)

Metric	Value
Validation Loss	0.1050
Validation WER	7.57%
Test WER (Common Voice)	8.33%
Test WER (MLS)	13.43%
Best Checkpoint	Step 350
Max Training Steps	860

Comparison with Other Training Configurations (Whisper-Large-v3 Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	430	0.1260	11.38%	11.78%	15.31%
High-Quality (q ≥ 0.8) + CV	575	0.1045	7.33%	7.94%	12.41%
Mid-High (q ≥ 0.5) + CV	805	0.1040	7.73%	8.33%	10.27%
All Synthetic + CV (Unfiltered)	860	0.1050	7.57%	8.33%	13.43%

Key Performance Highlights

Maximum data volume: Uses all 21,968 synthetic samples (100%)
Good in-domain: 8.33% Test WER (same as mid-high filtered, 29.3% better than baseline)
Cross-domain tradeoff: 13.43% MLS WER vs 10.27% for mid-high filtered (30.7% worse)
Training cost: Requires 860 max steps (7% more than mid-high filtered)
Demonstrates filtering value: Same in-domain WER as filtered approach but worse cross-domain

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (all)	21,968	Complete TTS audio without filtering
Total	43,834

Synthetic Data Generation Pipeline

The synthetic dataset (yuriyvnv/synthetic_transcript_pt) was generated using:

Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
No Filtering: All samples used regardless of quality

WAVe Quality Distribution (For Reference)

While this model uses all data, WAVe quality assessment shows the distribution:

Quality Level	Samples	Percentage	Used in This Model
High (q ≥ 0.8)	7,312	33.3%	✓
Medium (0.5 ≤ q < 0.8)	11,869	54.0%	✓
Low (q < 0.5)	2,787	12.7%	✓
Total	21,968	100%	All used

Note: 12.7% of the synthetic data (2,787 samples) would be filtered out by WAVe, but is included in this model's training.

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	5e-6
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

Quality filtering is not available: Uses raw synthetic data without preprocessing
Maximum data volume required: Includes all 21,968 synthetic samples
Comparing filtering approaches: Demonstrates the value of quality filtering by comparison

Consider filtered alternatives for better performance:

whisper-large-v3-mixed-pt: Best cross-domain (10.27% MLS), 6% fewer steps
whisper-large-v3-high-mixed-pt: Best in-domain (7.94% CV), 33% fewer steps

Quality vs Quantity Analysis

This model demonstrates the importance of quality filtering for Portuguese:

Approach	Synthetic Samples	Training Steps	Test WER (CV)	Test WER (MLS)
High-Quality (q≥0.8)	7,312	575	7.94%	12.41%
Mid-High (q≥0.5)	19,181	805	8.33%	10.27%
Unfiltered (this model)	21,968	860	8.33%	13.43%

Key insight: Including all synthetic data (even low-quality samples) achieves the same in-domain WER as filtered approaches but results in 30.7% worse cross-domain performance on MLS, demonstrating that quality filtering improves generalization without sacrificing in-domain accuracy.

Limitations

Noisy training signal: Includes low-quality synthetic samples (12.7% with q < 0.5)
Suboptimal cross-domain: 13.43% MLS WER vs 10.27% for filtered approach
Training inefficiency: Requires most training steps without cross-domain benefit
Domain specificity: Optimized for general Portuguese; may underperform on technical domains
Dialect coverage: Performance may vary across Portuguese regional variants

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-large-v3
Training Data (Real): mozilla-foundation/common_voice_17_0
Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 31

Safetensors

Model size

2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt

Base model

openai/whisper-large-v3

Finetuned

(666)

this model

Datasets used to train yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt

Collection including yuriyvnv/whisper-large-v3-cv-fully-synthetic-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 13 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

8.330
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

13.430

View on Papers With Code