Whisper-Small Portuguese - High-Quality Filtered Synthetic Data

This model is a fine-tuned version of openai/whisper-small for Portuguese automatic speech recognition (ASR). It was trained on Common Voice 17.0 Portuguese combined with WAVe-filtered high-quality synthetic speech data using a strict threshold (q ≥ 0.8).

Purpose

This model explores whether high-quality synthetic data filtering can overcome the limitations of smaller model architectures. The results reveal an important finding:

Key Finding: Even with strict quality filtering (q ≥ 0.8), the Small model shows no improvement over the CV-only baseline, demonstrating that the architectural capacity limitation cannot be overcome simply by improving synthetic data quality.

Metric CV-Only Baseline This Model (High-Quality) Change
Test WER (CV) 13.87% 14.28% -3.0% (worse)
Test WER (MLS) 30.69% 30.40% +0.9% (marginal)

This provides evidence that model capacity, not data quality, is the limiting factor for smaller architectures.

Model Details

Property Value
Base Model openai/whisper-small
Language Portuguese (pt)
Task Automatic Speech Recognition (transcribe)
Parameters 244M
Training Data Common Voice 17.0 + High-Quality Synthetic (q ≥ 0.8)
Total Training Samples 29,178
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-small-high-mixed-pt)

Metric Value
Validation Loss 0.2100
Validation WER 12.98%
Test WER (Common Voice) 14.28%
Test WER (MLS) 30.40%
Best Checkpoint Step 350
Max Training Steps 575

Comparison with Other Training Configurations (Whisper-Small Portuguese)

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS)
Common Voice Only 430 0.2000 12.68% 13.87% 30.69%
High-Quality (q ≥ 0.8) + CV 575 0.2100 12.98% 14.28% 30.40%
Mid-High (q ≥ 0.5) + CV 805 0.2100 12.97% 14.08% 30.54%
All Synthetic + CV 860 0.2100 12.94% 14.22% 30.85%

Key Performance Characteristics

  • Best cross-domain: Lowest MLS WER (30.40%) among all Small configurations
  • Marginal MLS improvement: Only 0.9% better than baseline on cross-domain
  • Worse in-domain: 14.28% vs 13.87% baseline (-3.0%)
  • Demonstrates capacity limitation: High-quality filtering doesn't overcome architectural constraints

Why High-Quality Filtering Doesn't Help Small Models

The paper explains this phenomenon:

"Compact models, with fewer parameters, struggle to disentangle the subtle acoustic differences between natural and synthetic speech. Unlike the Large-V3 model, which can exploit its deeper representational hierarchy to extract meaningful patterns, smaller models become overwhelmed by increased acoustic variability."

Contrast with Large-v3:

Model High-Quality Synthetic Impact
Whisper-Small -3.0% worse in-domain WER
Whisper-Large-v3 +32.6% better in-domain WER

This 35+ percentage point difference demonstrates that the benefit of synthetic data is fundamentally tied to model capacity.

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Portuguese 21,866 Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript PT (q ≥ 0.8) 7,312 Strictly WAVe-filtered TTS audio (high quality only)
Total 29,178

WAVe Quality Distribution (Portuguese Synthetic Data)

Quality Level Samples Percentage Used in This Model
High (q ≥ 0.8) 7,312 33.3% ✓
Medium (0.5 ≤ q < 0.8) 11,869 54.0% ✗
Low (q < 0.5) 2,787 12.7% ✗

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 1e-5
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-small-high-mixed-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-high-mixed-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-high-mixed-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This model is primarily useful for:

  • Research purposes: Demonstrating the impact of model capacity on synthetic data effectiveness
  • Slight cross-domain preference: Marginally better MLS performance (30.40% vs 30.69%)
  • Understanding architecture limitations: Comparing with Large-v3 results

For production use, consider:

Research Implications

This model provides evidence for an important principle:

Synthetic data augmentation effectiveness scales with model capacity.

For practitioners:

  • Small models: Focus on high-quality real data; synthetic augmentation provides minimal benefit
  • Large models: Synthetic data with quality filtering dramatically improves performance
  • Resource planning: Don't invest in synthetic data generation for small model deployments

Limitations

  • Lower accuracy than baseline: 14.28% vs 13.87% (worse than CV-only)
  • Limited synthetic benefit: Architecture cannot leverage additional data effectively
  • Domain specificity: Optimized for general Portuguese
  • Dialect coverage: Performance may vary across Portuguese regional variants

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
11
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-small-high-mixed-pt

Finetuned
(3075)
this model

Datasets used to train yuriyvnv/whisper-small-high-mixed-pt

Collection including yuriyvnv/whisper-small-high-mixed-pt

Evaluation results