Whisper-Small Portuguese - High-Quality Filtered Synthetic Data

This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with WAVe-filtered high-quality synthetic speech data using a strict threshold (q ≥ 0.8).

Purpose

This model explores whether high-quality synthetic data filtering can overcome the limitations of smaller model architectures. The results reveal an important finding:

Key Finding: Even with strict quality filtering (q ≥ 0.8), the Small model shows no improvement over the CV-only baseline, demonstrating that the architectural capacity limitation cannot be overcome simply by improving synthetic data quality.

Metric	CV-Only Baseline	This Model (High-Quality)	Change
Test WER (CV)	13.87%	14.28%	-3.0% (worse)
Test WER (MLS)	30.69%	30.40%	+0.9% (marginal)

This provides evidence that model capacity, not data quality, is the limiting factor for smaller architectures.

Model Details

Property	Value
Base Model	openai/whisper-small
Language	Portuguese (pt)
Task	Automatic Speech Recognition (transcribe)
Parameters	244M
Training Data	Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8)
Total Training Samples	29,178
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-small-high-mixed-pt)

Metric	Value
Validation Loss	0.2100
Validation WER	12.98%
Test WER (Common Voice)	14.28%
Test WER (MLS)	30.40%
Best Checkpoint	Step 350
Max Training Steps	575

Comparison with Other Training Configurations (Whisper-Small Portuguese)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	430	0.2000	12.68%	13.87%	30.69%
High-Quality (q ≥ 0.8) + CV	575	0.2100	12.98%	14.28%	30.40%
Mid-High (q ≥ 0.5) + CV	805	0.2100	12.97%	14.08%	30.54%
All Synthetic + CV	860	0.2100	12.94%	14.22%	30.85%

Key Performance Characteristics

Best cross-domain: Lowest MLS WER (30.40%) among all Small configurations
Marginal MLS improvement: Only 0.9% better than baseline on cross-domain
Worse in-domain: 14.28% vs 13.87% baseline (-3.0%)
Demonstrates capacity limitation: High-quality filtering doesn't overcome architectural constraints

Why High-Quality Filtering Doesn't Help Small Models

The paper explains this phenomenon:

"Compact models, with fewer parameters, struggle to disentangle the subtle acoustic differences between natural and synthetic speech. Unlike the Large-V3 model, which can exploit its deeper representational hierarchy to extract meaningful patterns, smaller models become overwhelmed by increased acoustic variability."

Contrast with Large-v3:

Model	High-Quality Synthetic Impact
Whisper-Small	-3.0% worse in-domain WER
Whisper-Large-v3	+32.6% better in-domain WER

This 35+ percentage point difference demonstrates that the benefit of synthetic data is fundamentally tied to model capacity.

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Portuguese	21,866	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (q ≥ 0.8)	7,312	Strictly WAVe-filtered TTS audio (high quality only)
Total	29,178

WAVe Quality Distribution (Portuguese Synthetic Data)

Quality Level	Samples	Percentage	Used in This Model
High (q ≥ 0.8)	7,312	33.3%	✓
Medium (0.5 ≤ q < 0.8)	11,869	54.0%	✗
Low (q < 0.5)	2,787	12.7%	✗

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	1e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-small-high-mixed-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-high-mixed-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-high-mixed-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is primarily useful for:

Research purposes: Demonstrating the impact of model capacity on synthetic data effectiveness
Slight cross-domain preference: Marginally better MLS performance (30.40% vs 30.69%)
Understanding architecture limitations: Comparing with Large-v3 results

For production use, consider:

whisper-small-cv-only-pt: Best Small model for Portuguese (13.87% WER)
whisper-large-v3-high-mixed-pt: Best accuracy (7.94% WER)

Research Implications

This model provides evidence for an important principle:

Synthetic data augmentation effectiveness scales with model capacity.

For practitioners:

Small models: Focus on high-quality real data; synthetic augmentation provides minimal benefit
Large models: Synthetic data with quality filtering dramatically improves performance
Resource planning: Don't invest in synthetic data generation for small model deployments

Limitations

Lower accuracy than baseline: 14.28% vs 13.87% (worse than CV-only)
Limited synthetic benefit: Architecture cannot leverage additional data effectively
Domain specificity: Optimized for general Portuguese
Dialect coverage: Performance may vary across Portuguese regional variants

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-small
Training Data (Real): mozilla-foundation/common_voice_17_0
Training Data (Synthetic): yuriyvnv/synthetic_transcript_pt
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
Motivating Research: Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)

License

Apache 2.0

Downloads last month: 11

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-small-high-mixed-pt

Base model

openai/whisper-small

Finetuned

(3075)

this model

Datasets used to train yuriyvnv/whisper-small-high-mixed-pt

Collection including yuriyvnv/whisper-small-high-mixed-pt

Whisper Models Portuguese Language

Collection

This Repo contains Whisper models trained on subsets of data like Common Voice 17(CV_17), Synthetic(Generated by OpenAI) + CV17 and Synthetic Only. • 15 items • Updated 11 days ago • 1

Evaluation results

Test WER on Common Voice 17.0 (Portuguese)
test set self-reported

14.280
Test WER (MLS) on Multilingual LibriSpeech (Portuguese)
test set self-reported

30.400

View on Papers With Code