yuriyvnv's picture
Update README.md
a70c230 verified
---
license: apache-2.0
language:
- pt
base_model: openai/whisper-tiny
tags:
- automatic-speech-recognition
- whisper
- portuguese
- speech
- audio
- asr
- hf-asr-leaderboard
datasets:
- mozilla-foundation/common_voice_17_0
model-index:
- name: whisper-tiny-cv-only-pt
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Common Voice 17.0 (Portuguese)
type: mozilla-foundation/common_voice_17_0
config: pt
split: test
metrics:
- type: wer
value: 30.72
name: Test WER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: Multilingual LibriSpeech (Portuguese)
type: facebook/multilingual_librispeech
config: portuguese
split: test
metrics:
- type: wer
value: 45.83
name: Test WER (MLS)
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Whisper-Tiny Portuguese - Common Voice Only (Baseline)
This model is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for Portuguese automatic speech recognition (ASR). It was trained **exclusively on Common Voice 17.0 Portuguese** without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech on the smallest Whisper architecture.
## Purpose
This baseline model establishes the performance of the Whisper-Tiny architecture (39M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate:
- The effectiveness of synthetic data augmentation for the smallest model architecture
- The fundamental capacity limitations of compact ASR models
- Comparison with Small and Large-v3 models to understand scaling effects
**Key Finding**: Unlike Large-v3 models which show significant improvements with synthetic data, Tiny models show only **marginal benefits** (1.39 percentage points) from synthetic augmentation. The paper states: *"This modest gain offers limited justification for the additional data filtering and preprocessing overhead."*
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | openai/whisper-tiny |
| **Language** | Portuguese (pt) |
| **Task** | Automatic Speech Recognition (transcribe) |
| **Parameters** | 39M |
| **Training Data** | Common Voice 17.0 Portuguese (Real Speech Only) |
| **Total Training Samples** | 21,866 |
| **Sampling Rate** | 16kHz |
## Evaluation Results
### This Model (whisper-tiny-cv-only-pt)
| Metric | Value |
|--------|-------|
| **Validation Loss** | 0.4463 |
| **Validation WER** | 27.05% |
| **Test WER (Common Voice)** | 30.72% |
| **Test WER (MLS)** | 45.83% |
| **Best Checkpoint** | Step 250 |
| **Max Training Steps** | 430 |
### Comparison with Synthetic Data Augmentation (Whisper-Tiny Portuguese)
| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---------------|-----------|----------|---------|---------------|----------------|
| **Common Voice Only (Baseline)** | **430** | **0.4463** | **27.05%** | **30.72%** | **45.83%** |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.4481 | 26.74% | 29.33% | 44.18% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.4550 | 26.95% | 30.11% | 47.25% |
| All Synthetic + CV | 860 | 0.4517 | 28.06% | 29.84% | 46.54% |
### Key Performance Characteristics
- **Fastest training**: Fewest steps (430) among all Tiny configurations
- **Smallest dataset**: Only 21,866 samples (no synthetic augmentation)
- **Reference baseline**: 30.72% Test WER on Common Voice
- **Limited cross-domain**: 45.83% MLS WER (challenging for Tiny architecture)
## Why Synthetic Data Provides Limited Benefit for Tiny Models
The paper explains this architectural limitation:
> "The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity. For instance, the Portuguese Whisper-Tiny model achieves its lowest test WER of 29.33% using the high-quality filtered subset, an improvement of just 1.39 percentage points over the Common Voice baseline of 30.72%."
**Key Insight**: Compact models (39M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech. The high-quality filtered variant provides only 1.39% improvement—a modest gain that may not justify the additional data processing overhead.
## Training Data
### Dataset Composition
| Source | Samples | Description |
|--------|---------|-------------|
| [Common Voice 17.0 Portuguese](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 21,866 | Real crowdsourced speech |
| Synthetic Data | 0 | No synthetic augmentation |
| **Total** | **21,866** | |
## Training Procedure
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Learning Rate | 5e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |
### Training Infrastructure
- **GPU**: NVIDIA H200 (140GB VRAM)
- **Operating System**: Ubuntu 22.04
- **Framework**: Hugging Face Transformers
## Usage
### Transcription Pipeline
```python
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="yuriyvnv/whisper-tiny-cv-only-pt",
device="cuda"
)
result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
```
### Direct Model Usage
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model.to("cuda")
audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```
### Specifying Language
```python
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
```
## When to Use This Model
This model is ideal when:
- **Maximum resource efficiency**: Smallest model size (39M params)
- **Edge deployment**: Limited memory and compute available
- **Fast inference**: Fastest among Portuguese models
- **Baseline comparison**: Reference for evaluating synthetic data impact on Tiny architecture
Consider alternatives based on your needs:
- [whisper-tiny-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-tiny-high-mixed-pt): Marginal improvement (29.33% vs 30.72%)
- [whisper-small-cv-only-pt](https://huggingface.co/yuriyvnv/whisper-small-cv-only-pt): Better accuracy (13.87% WER)
- [whisper-large-v3-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-large-v3-high-mixed-pt): Best accuracy (7.94% WER)
## Model Size Comparison
| Model | Params | Best Config | Test WER (CV) | Test WER (MLS) | Synthetic Benefit |
|-------|--------|-------------|---------------|----------------|-------------------|
| **Whisper-Tiny** | **39M** | **High-Quality** | **29.33%** | **44.18%** | **Marginal (+1.39%)** |
| Whisper-Small | 244M | CV Only | 13.87% | 30.69% | None/Negative |
| Whisper-Large-v3 | 1550M | High-Quality + CV | 7.94% | 12.41% | Significant (+32.6%) |
## Limitations
- **Lower accuracy**: 30.72% WER (vs 7.94% for Large-v3)
- **Limited capacity**: Cannot effectively leverage synthetic data
- **Domain specificity**: Optimized for Common Voice-style speech
- **Cross-domain weakness**: 45.83% MLS WER shows difficulty adapting
## Citation
This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:
```bibtex
@article{perezhohin2024enhancing,
title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
journal={IEEE Access},
year={2024},
publisher={IEEE}
}
```
## References
- **Base Model**: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Training Data**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- **Motivating Research**: [Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)](https://ieeexplore.ieee.org/document/10720758)
## License
Apache 2.0