|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- pt |
|
|
base_model: openai/whisper-tiny |
|
|
tags: |
|
|
- automatic-speech-recognition |
|
|
- whisper |
|
|
- portuguese |
|
|
- speech |
|
|
- audio |
|
|
- asr |
|
|
- hf-asr-leaderboard |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
model-index: |
|
|
- name: whisper-tiny-cv-only-pt |
|
|
results: |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Common Voice 17.0 (Portuguese) |
|
|
type: mozilla-foundation/common_voice_17_0 |
|
|
config: pt |
|
|
split: test |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 30.72 |
|
|
name: Test WER |
|
|
- task: |
|
|
type: automatic-speech-recognition |
|
|
name: Automatic Speech Recognition |
|
|
dataset: |
|
|
name: Multilingual LibriSpeech (Portuguese) |
|
|
type: facebook/multilingual_librispeech |
|
|
config: portuguese |
|
|
split: test |
|
|
metrics: |
|
|
- type: wer |
|
|
value: 45.83 |
|
|
name: Test WER (MLS) |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Whisper-Tiny Portuguese - Common Voice Only (Baseline) |
|
|
|
|
|
This model is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for Portuguese automatic speech recognition (ASR). It was trained **exclusively on Common Voice 17.0 Portuguese** without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech on the smallest Whisper architecture. |
|
|
|
|
|
## Purpose |
|
|
|
|
|
This baseline model establishes the performance of the Whisper-Tiny architecture (39M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate: |
|
|
|
|
|
- The effectiveness of synthetic data augmentation for the smallest model architecture |
|
|
- The fundamental capacity limitations of compact ASR models |
|
|
- Comparison with Small and Large-v3 models to understand scaling effects |
|
|
|
|
|
**Key Finding**: Unlike Large-v3 models which show significant improvements with synthetic data, Tiny models show only **marginal benefits** (1.39 percentage points) from synthetic augmentation. The paper states: *"This modest gain offers limited justification for the additional data filtering and preprocessing overhead."* |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | openai/whisper-tiny | |
|
|
| **Language** | Portuguese (pt) | |
|
|
| **Task** | Automatic Speech Recognition (transcribe) | |
|
|
| **Parameters** | 39M | |
|
|
| **Training Data** | Common Voice 17.0 Portuguese (Real Speech Only) | |
|
|
| **Total Training Samples** | 21,866 | |
|
|
| **Sampling Rate** | 16kHz | |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### This Model (whisper-tiny-cv-only-pt) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Validation Loss** | 0.4463 | |
|
|
| **Validation WER** | 27.05% | |
|
|
| **Test WER (Common Voice)** | 30.72% | |
|
|
| **Test WER (MLS)** | 45.83% | |
|
|
| **Best Checkpoint** | Step 250 | |
|
|
| **Max Training Steps** | 430 | |
|
|
|
|
|
### Comparison with Synthetic Data Augmentation (Whisper-Tiny Portuguese) |
|
|
|
|
|
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) | |
|
|
|---------------|-----------|----------|---------|---------------|----------------| |
|
|
| **Common Voice Only (Baseline)** | **430** | **0.4463** | **27.05%** | **30.72%** | **45.83%** | |
|
|
| High-Quality (q ≥ 0.8) + CV | 575 | 0.4481 | 26.74% | 29.33% | 44.18% | |
|
|
| Mid-High (q ≥ 0.5) + CV | 805 | 0.4550 | 26.95% | 30.11% | 47.25% | |
|
|
| All Synthetic + CV | 860 | 0.4517 | 28.06% | 29.84% | 46.54% | |
|
|
|
|
|
### Key Performance Characteristics |
|
|
|
|
|
- **Fastest training**: Fewest steps (430) among all Tiny configurations |
|
|
- **Smallest dataset**: Only 21,866 samples (no synthetic augmentation) |
|
|
- **Reference baseline**: 30.72% Test WER on Common Voice |
|
|
- **Limited cross-domain**: 45.83% MLS WER (challenging for Tiny architecture) |
|
|
|
|
|
## Why Synthetic Data Provides Limited Benefit for Tiny Models |
|
|
|
|
|
The paper explains this architectural limitation: |
|
|
|
|
|
> "The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity. For instance, the Portuguese Whisper-Tiny model achieves its lowest test WER of 29.33% using the high-quality filtered subset, an improvement of just 1.39 percentage points over the Common Voice baseline of 30.72%." |
|
|
|
|
|
**Key Insight**: Compact models (39M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech. The high-quality filtered variant provides only 1.39% improvement—a modest gain that may not justify the additional data processing overhead. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset Composition |
|
|
|
|
|
| Source | Samples | Description | |
|
|
|--------|---------|-------------| |
|
|
| [Common Voice 17.0 Portuguese](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 21,866 | Real crowdsourced speech | |
|
|
| Synthetic Data | 0 | No synthetic augmentation | |
|
|
| **Total** | **21,866** | | |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Learning Rate | 5e-5 | |
|
|
| Batch Size (Global) | 256 | |
|
|
| Warmup Steps | 200 | |
|
|
| Max Epochs | 5 | |
|
|
| Precision | BF16 | |
|
|
| Optimizer | AdamW (fused) | |
|
|
| Eval Steps | 50 | |
|
|
| Metric for Best Model | eval_loss | |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **GPU**: NVIDIA H200 (140GB VRAM) |
|
|
- **Operating System**: Ubuntu 22.04 |
|
|
- **Framework**: Hugging Face Transformers |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Transcription Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
transcriber = pipeline( |
|
|
"automatic-speech-recognition", |
|
|
model="yuriyvnv/whisper-tiny-cv-only-pt", |
|
|
device="cuda" |
|
|
) |
|
|
|
|
|
result = transcriber("path/to/portuguese_audio.wav") |
|
|
print(result["text"]) |
|
|
``` |
|
|
|
|
|
### Direct Model Usage |
|
|
|
|
|
```python |
|
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
import librosa |
|
|
|
|
|
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt") |
|
|
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt") |
|
|
model.to("cuda") |
|
|
|
|
|
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000) |
|
|
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda") |
|
|
|
|
|
predicted_ids = model.generate(input_features) |
|
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
|
print(transcription) |
|
|
``` |
|
|
|
|
|
### Specifying Language |
|
|
|
|
|
```python |
|
|
model.generation_config.language = "pt" |
|
|
model.generation_config.task = "transcribe" |
|
|
``` |
|
|
|
|
|
## When to Use This Model |
|
|
|
|
|
This model is ideal when: |
|
|
- **Maximum resource efficiency**: Smallest model size (39M params) |
|
|
- **Edge deployment**: Limited memory and compute available |
|
|
- **Fast inference**: Fastest among Portuguese models |
|
|
- **Baseline comparison**: Reference for evaluating synthetic data impact on Tiny architecture |
|
|
|
|
|
Consider alternatives based on your needs: |
|
|
- [whisper-tiny-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-tiny-high-mixed-pt): Marginal improvement (29.33% vs 30.72%) |
|
|
- [whisper-small-cv-only-pt](https://huggingface.co/yuriyvnv/whisper-small-cv-only-pt): Better accuracy (13.87% WER) |
|
|
- [whisper-large-v3-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-large-v3-high-mixed-pt): Best accuracy (7.94% WER) |
|
|
|
|
|
## Model Size Comparison |
|
|
|
|
|
| Model | Params | Best Config | Test WER (CV) | Test WER (MLS) | Synthetic Benefit | |
|
|
|-------|--------|-------------|---------------|----------------|-------------------| |
|
|
| **Whisper-Tiny** | **39M** | **High-Quality** | **29.33%** | **44.18%** | **Marginal (+1.39%)** | |
|
|
| Whisper-Small | 244M | CV Only | 13.87% | 30.69% | None/Negative | |
|
|
| Whisper-Large-v3 | 1550M | High-Quality + CV | 7.94% | 12.41% | Significant (+32.6%) | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Lower accuracy**: 30.72% WER (vs 7.94% for Large-v3) |
|
|
- **Limited capacity**: Cannot effectively leverage synthetic data |
|
|
- **Domain specificity**: Optimized for Common Voice-style speech |
|
|
- **Cross-domain weakness**: 45.83% MLS WER shows difficulty adapting |
|
|
|
|
|
## Citation |
|
|
|
|
|
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research: |
|
|
|
|
|
```bibtex |
|
|
@article{perezhohin2024enhancing, |
|
|
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance}, |
|
|
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro}, |
|
|
journal={IEEE Access}, |
|
|
year={2024}, |
|
|
publisher={IEEE} |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- **Base Model**: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) |
|
|
- **Training Data**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) |
|
|
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) |
|
|
- **Motivating Research**: [Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)](https://ieeexplore.ieee.org/document/10720758) |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |