Whisper-Small Dutch - Full Synthetic Data (Unfiltered)

This model is a fine-tuned version of openai/whisper-small for Dutch automatic speech recognition (ASR). It was trained on Common Voice 17.0 Dutch combined with all synthetic speech data without quality filtering, representing the maximum data augmentation approach.

Introduction

Purpose

This model uses all available synthetic data without WAVe quality filtering to evaluate the impact of maximum data augmentation. It achieves strong performance (10.91% Test WER) but requires significantly more training steps than filtered approaches, demonstrating the quality-vs-quantity tradeoff in synthetic data augmentation.

How the Data Was Created

The training data combines real speech from Common Voice 17.0 with the complete synthetic dataset:

Transcript Generation: We used GPT-4o-mini to generate Dutch transcripts that match the word count distribution observed in Common Voice, ensuring realistic utterance lengths and diverse linguistic content.
Speech Synthesis: Each transcript was converted to audio using OpenAI's TTS-1 model with 9 different voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer), producing 34,898 synthetic samples.
No Quality Filtering: Unlike other models in this series, no WAVe filtering was applied. All 34,898 synthetic samples were used, including those with potential synthesis defects.

How the Model Was Created

The model was fine-tuned from openai/whisper-small using the Hugging Face Transformers library:

Mixed Training: Combined 34,952 real speech samples from Common Voice 17.0 Dutch with all 34,898 synthetic samples (69,850 total).
Optimization: Trained for 5 epochs with a learning rate of 1e-5, global batch size of 256, and BF16 precision on an NVIDIA H200 GPU.
Checkpoint Selection: The best checkpoint was selected based on validation loss, occurring at step 800 with a validation loss of 0.1484.

This approach achieves 2.0% relative improvement over the baseline (10.91% vs 11.13% Test WER) but requires 100% more training steps than training on Common Voice only.

Model Details

Property	Value
Base Model	openai/whisper-small
Language	Dutch (nl)
Task	Automatic Speech Recognition (transcribe)
Parameters	244M
Training Data	Common Voice 17.0 + All Synthetic (Unfiltered)
Total Training Samples	69,850
Sampling Rate	16kHz

Evaluation Results

This Model (whisper-small-cv-fully-synthetic-nl)

Metric	Value
Validation Loss	0.1484
Validation WER	8.64%
Test WER (Common Voice)	10.91%
Test WER (MLS)	30.06%
Best Checkpoint	Step 800
Max Training Steps	1,365

Comparison with Other Training Configurations (Whisper-Small Dutch)

Training Data	Max Steps	Val Loss	Val WER	Test WER (CV)	Test WER (MLS)
Common Voice Only	680	0.1491	8.73%	11.13%	30.71%
High-Quality Filtered + CV	890	0.1493	8.76%	11.00%	29.91%
Mid-High Quality Filtered + CV	1,270	0.1484	8.73%	10.86%	30.04%
All Synthetic + CV (Unfiltered)	1,365	0.1484	8.64%	10.91%	30.06%

Key Performance Highlights

Best Validation WER (8.64%) among all Whisper-Small Dutch configurations
2.0% relative improvement on Common Voice test set vs baseline (10.91% vs 11.13%)
2.1% relative improvement on MLS benchmark vs baseline (30.06% vs 30.71%)
Tradeoff: Requires 1,365 steps vs 890 for high-quality filtered (53% more compute)

Training Data

Dataset Composition

Source	Samples	Description
Common Voice 17.0 Dutch	34,952	Real speech from Mozilla's crowdsourced dataset
Synthetic Transcript NL (all)	34,898	Complete TTS audio without filtering
Total	69,850

Synthetic Data Generation Pipeline

The synthetic dataset (yuriyvnv/synthetic_transcript_nl) was generated using:

Transcript Generation: GPT-4o-mini, matching Common Voice word count distribution
Speech Synthesis: OpenAI TTS-1 model with 9 voice variants (alloy, ash, coral, echo, fable, nova, onyx, sage, shimmer)
No Filtering: All samples used regardless of quality

Quality Distribution (For Reference)

While this model uses all data, WAVe quality assessment shows the distribution:

Quality Level	Samples	Percentage	Used in This Model
High (q ≥ 0.8)	10,555	30.2%	✓
Medium (0.5 ≤ q < 0.8)	19,627	56.2%	✓
Low (q < 0.5)	4,716	13.5%	✓
Total	34,898	100%	All used

Note: 13.5% of the synthetic data (4,716 samples) would be filtered out by WAVe, but is included in this model's training.

Training Procedure

Hyperparameters

Parameter	Value
Learning Rate	1e-5
Batch Size (Global)	256
Warmup Steps	200
Max Epochs	5
Precision	BF16
Optimizer	AdamW (fused)
Eval Steps	50
Metric for Best Model	eval_loss

Training Infrastructure

GPU: NVIDIA H200 (140GB VRAM)
Operating System: Ubuntu 22.04
Framework: Hugging Face Transformers

Training Curve

Step  100: val_loss = 0.1967
Step  250: val_loss = 0.1659
Step  400: val_loss = 0.1535
Step  550: val_loss = 0.1490
Step  800: val_loss = 0.1484 ← Best checkpoint
Step 1000: val_loss = 0.1526
Step 1200: val_loss = 0.1549
Step 1350: val_loss = 0.1550

Usage

Transcription Pipeline

from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-small-cv-fully-synthetic-nl",
    device="cuda"
)

result = transcriber("path/to/dutch_audio.wav")
print(result["text"])

Direct Model Usage

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-small-cv-fully-synthetic-nl")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-small-cv-fully-synthetic-nl")
model.to("cuda")

audio, sr = librosa.load("path/to/dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Specifying Language

model.generation_config.language = "nl"
model.generation_config.task = "transcribe"

When to Use This Model

This model is ideal when:

Maximum data utilization is desired: Uses all available synthetic data
Compute budget is not a constraint: Requires most training steps (1,365)
Quality filtering is not available: Uses raw synthetic data

Consider filtered alternatives for better efficiency:

whisper-small-high-mixed-nl: 35% fewer steps, best MLS performance
whisper-small-mixed-cv-nl: 7% fewer steps, best CV Test WER

Quality vs Quantity Analysis

This model demonstrates the tradeoff between data quantity and quality for Whisper-Small:

Approach	Synthetic Samples	Training Steps	Test WER (CV)	Efficiency
High-Quality (q≥0.8)	10,555	890	11.00%	Best
Mid-High (q≥0.5)	30,182	1,270	10.86%	Good
Unfiltered (this model)	34,898	1,365	10.91%	Lowest

Key insight: The unfiltered approach performs slightly worse than mid-high filtering (10.91% vs 10.86%) despite using more data and requiring 7.5% more training steps. This suggests that including low-quality synthetic samples can introduce noise that degrades performance for Whisper-Small.

Limitations

Training efficiency: Requires most compute among all configurations
Noisy training signal: Includes low-quality synthetic samples (13.5% with q < 0.5)
Diminishing returns: More data doesn't always mean better performance
Domain specificity: Optimized for general Dutch; may underperform on technical domains
Dialect coverage: Performance may vary across Dutch regional variants

Citation

@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}

References

Base Model: openai/whisper-small
Training Data (Real): mozilla-foundation/common_voice_17_0
Training Data (Synthetic): yuriyvnv/synthetic_transcript_nl
Whisper Paper: Robust Speech Recognition via Large-Scale Weak Supervision
IEEE Access Paper: Enhancing ASR with Semantic Audio Filtering

License

Apache 2.0

Downloads last month: 15

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for yuriyvnv/whisper-small-cv-fully-synthetic-nl

Base model

openai/whisper-small

Finetuned

(3089)

this model

Datasets used to train yuriyvnv/whisper-small-cv-fully-synthetic-nl

Evaluation results

Test WER on Common Voice 17.0 (Dutch)
test set self-reported

10.910
Test WER (MLS) on Multilingual LibriSpeech (Dutch)
test set self-reported

30.060

View on Papers With Code