Whisper-Large-v3 Portuguese - Common Voice Only (Baseline)

This model is a fine-tuned version of openai/whisper-large-v3 for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech in ASR training.

Purpose

This baseline model demonstrates the performance achievable using only real, crowdsourced speech data from Common Voice 17.0. It serves as a reference point for comparing the effectiveness of synthetic data augmentation approaches, including:

  • Quality-filtered synthetic data (WAVe-based filtering at different thresholds)
  • Unfiltered synthetic data augmentation
  • Different quality thresholds and their impact on both training efficiency and ASR performance

The model is part of a comprehensive study on WAVe (Word-Aligned Verification) filtering for Portuguese ASR, published in IEEE Access 2024.

Model Details

Property Value
Base Model openai/whisper-large-v3
Language Portuguese (pt)
Task Automatic Speech Recognition (transcribe)
Parameters 1550M
Training Data Common Voice 17.0 Portuguese (Real Speech Only)
Total Training Samples 21,866
Sampling Rate 16kHz

Evaluation Results

This Model (whisper-large-v3-cv-only-pt)

Metric Value
Validation Loss 0.1260
Validation WER 11.38%
Test WER (Common Voice) 11.78%
Test WER (MLS) 15.31%
Best Checkpoint Step 150
Max Training Steps 430

Comparison with Synthetic Data Augmentation (Whisper-Large-v3 Portuguese)

Training Data Max Steps Val Loss Val WER Test WER (CV) Test WER (MLS) MLS Improvement
Common Voice Only (Baseline) 430 0.1260 11.38% 11.78% 15.31% —
High-Quality (q ≥ 0.8) + CV 575 0.1045 7.33% 7.94% 12.41% +18.9%
Mid-High (q ≥ 0.5) + CV 805 0.1040 7.73% 8.33% 10.27% +32.9%
All Synthetic + CV 860 0.1050 7.57% 8.33% 13.43% +12.3%

Key Performance Characteristics

  • Fastest training: Fewest steps (430) among all Portuguese configurations
  • Smallest dataset: Only 21,866 samples (no synthetic augmentation)
  • Solid baseline: 11.78% Test WER on Common Voice
  • Limited cross-domain: 15.31% MLS WER (poorest generalization)
  • Reference point: Establishes performance without synthetic data

Training Data

Dataset Composition

Source Samples Description
Common Voice 17.0 Portuguese 21,866 Real crowdsourced speech
Synthetic Data 0 No synthetic augmentation
Total 21,866

Common Voice 17.0 Portuguese

Common Voice is Mozilla's open-source, crowdsourced speech dataset:

  • Recording conditions: Varied (home recordings, different microphones, background noise)
  • Speaker diversity: Multiple speakers, ages, and accents
  • Content: Read sentences from various domains
  • Quality: Human-validated transcriptions

Training Procedure

Hyperparameters

Parameter Value
Learning Rate 5e-6
Batch Size (Global) 256
Warmup Steps 200
Max Epochs 5
Precision BF16
Optimizer AdamW (fused)
Eval Steps 50
Metric for Best Model eval_loss

Training Infrastructure

  • GPU: NVIDIA H200 (140GB VRAM)
  • Operating System: Ubuntu 22.04
  • Framework: Hugging Face Transformers

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-large-v3-cv-only-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-large-v3-cv-only-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "pt"
model.generation_config.task = "transcribe"

When to Use This Model

This baseline model is ideal when:

  • No synthetic data is available: Training on real data only
  • Maximum training speed required: Fastest convergence (430 steps)
  • In-domain performance is priority: Strong on Common Voice-like data (11.78% WER)
  • Comparing augmentation approaches: Reference for measuring synthetic data impact

Consider synthetic-augmented variants for better performance:

Impact of Synthetic Data Augmentation

This baseline enables quantifying the value of synthetic speech for Portuguese:

Metric CV-Only + Synthetic (best) Improvement
Training Steps 430 575 +34%
Dataset Size 21,866 29,178 +33%
Test WER (CV) 11.78% 7.94% +32.6%
Test WER (MLS) 15.31% 10.27% +32.9%

Key insight: Synthetic data augmentation provides dramatic improvements for Portuguese ASR across both in-domain and cross-domain benchmarks, with relatively modest increases in training time.

Limitations

  • Domain specificity: Optimized for Common Voice-style speech; cross-domain performance limited
  • Acoustic diversity: Limited to Common Voice recording conditions and speaker pool
  • Data scarcity: No augmentation means model capacity may be underutilized
  • Generalization: 15.31% MLS WER shows difficulty adapting to different acoustic conditions

Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

License

Apache 2.0

Downloads last month
22
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yuriyvnv/whisper-large-v3-cv-only-pt

Finetuned
(668)
this model

Dataset used to train yuriyvnv/whisper-large-v3-cv-only-pt

Collection including yuriyvnv/whisper-large-v3-cv-only-pt

Evaluation results