---
license: apache-2.0
language:
- pt
base_model: openai/whisper-tiny
tags:
- automatic-speech-recognition
- whisper
- portuguese
- speech
- audio
- asr
- hf-asr-leaderboard
datasets:
- mozilla-foundation/common_voice_17_0
model-index:
- name: whisper-tiny-cv-only-pt
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 17.0 (Portuguese)
      type: mozilla-foundation/common_voice_17_0
      config: pt
      split: test
    metrics:
    - type: wer
      value: 30.72
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Multilingual LibriSpeech (Portuguese)
      type: facebook/multilingual_librispeech
      config: portuguese
      split: test
    metrics:
    - type: wer
      value: 45.83
      name: Test WER (MLS)
pipeline_tag: automatic-speech-recognition
library_name: transformers
---

# Whisper-Tiny Portuguese - Common Voice Only (Baseline)

This model is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for Portuguese automatic speech recognition (ASR). It was trained **exclusively on Common Voice 17.0 Portuguese** without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech on the smallest Whisper architecture.

## Purpose

This baseline model establishes the performance of the Whisper-Tiny architecture (39M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate:

- The effectiveness of synthetic data augmentation for the smallest model architecture
- The fundamental capacity limitations of compact ASR models
- Comparison with Small and Large-v3 models to understand scaling effects

**Key Finding**: Unlike Large-v3 models which show significant improvements with synthetic data, Tiny models show only **marginal benefits** (1.39 percentage points) from synthetic augmentation. The paper states: *"This modest gain offers limited justification for the additional data filtering and preprocessing overhead."*

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | openai/whisper-tiny |
| **Language** | Portuguese (pt) |
| **Task** | Automatic Speech Recognition (transcribe) |
| **Parameters** | 39M |
| **Training Data** | Common Voice 17.0 Portuguese (Real Speech Only) |
| **Total Training Samples** | 21,866 |
| **Sampling Rate** | 16kHz |

## Evaluation Results

### This Model (whisper-tiny-cv-only-pt)

| Metric | Value |
|--------|-------|
| **Validation Loss** | 0.4463 |
| **Validation WER** | 27.05% |
| **Test WER (Common Voice)** | 30.72% |
| **Test WER (MLS)** | 45.83% |
| **Best Checkpoint** | Step 250 |
| **Max Training Steps** | 430 |

### Comparison with Synthetic Data Augmentation (Whisper-Tiny Portuguese)

| Training Data | Max Steps | Val Loss | Val WER | Test WER (CV) | Test WER (MLS) |
|---------------|-----------|----------|---------|---------------|----------------|
| **Common Voice Only (Baseline)** | **430** | **0.4463** | **27.05%** | **30.72%** | **45.83%** |
| High-Quality (q ≥ 0.8) + CV | 575 | 0.4481 | 26.74% | 29.33% | 44.18% |
| Mid-High (q ≥ 0.5) + CV | 805 | 0.4550 | 26.95% | 30.11% | 47.25% |
| All Synthetic + CV | 860 | 0.4517 | 28.06% | 29.84% | 46.54% |

### Key Performance Characteristics

- **Fastest training**: Fewest steps (430) among all Tiny configurations
- **Smallest dataset**: Only 21,866 samples (no synthetic augmentation)
- **Reference baseline**: 30.72% Test WER on Common Voice
- **Limited cross-domain**: 45.83% MLS WER (challenging for Tiny architecture)

## Why Synthetic Data Provides Limited Benefit for Tiny Models

The paper explains this architectural limitation:

> "The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity. For instance, the Portuguese Whisper-Tiny model achieves its lowest test WER of 29.33% using the high-quality filtered subset, an improvement of just 1.39 percentage points over the Common Voice baseline of 30.72%."

**Key Insight**: Compact models (39M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech. The high-quality filtered variant provides only 1.39% improvement—a modest gain that may not justify the additional data processing overhead.

## Training Data

### Dataset Composition

| Source | Samples | Description |
|--------|---------|-------------|
| [Common Voice 17.0 Portuguese](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 21,866 | Real crowdsourced speech |
| Synthetic Data | 0 | No synthetic augmentation |
| **Total** | **21,866** | |

## Training Procedure

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning Rate | 5e-5 |
| Batch Size (Global) | 256 |
| Warmup Steps | 200 |
| Max Epochs | 5 |
| Precision | BF16 |
| Optimizer | AdamW (fused) |
| Eval Steps | 50 |
| Metric for Best Model | eval_loss |

### Training Infrastructure

- **GPU**: NVIDIA H200 (140GB VRAM)
- **Operating System**: Ubuntu 22.04
- **Framework**: Hugging Face Transformers

## Usage

### Transcription Pipeline

```python
from transformers import pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="yuriyvnv/whisper-tiny-cv-only-pt",
    device="cuda"
)

result = transcriber("path/to/portuguese_audio.wav")
print(result["text"])
```

### Direct Model Usage

```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
model.to("cuda")

audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
```

### Specifying Language

```python
model.generation_config.language = "pt"
model.generation_config.task = "transcribe"
```

## When to Use This Model

This model is ideal when:
- **Maximum resource efficiency**: Smallest model size (39M params)
- **Edge deployment**: Limited memory and compute available
- **Fast inference**: Fastest among Portuguese models
- **Baseline comparison**: Reference for evaluating synthetic data impact on Tiny architecture

Consider alternatives based on your needs:
- [whisper-tiny-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-tiny-high-mixed-pt): Marginal improvement (29.33% vs 30.72%)
- [whisper-small-cv-only-pt](https://huggingface.co/yuriyvnv/whisper-small-cv-only-pt): Better accuracy (13.87% WER)
- [whisper-large-v3-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-large-v3-high-mixed-pt): Best accuracy (7.94% WER)

## Model Size Comparison

| Model | Params | Best Config | Test WER (CV) | Test WER (MLS) | Synthetic Benefit |
|-------|--------|-------------|---------------|----------------|-------------------|
| **Whisper-Tiny** | **39M** | **High-Quality** | **29.33%** | **44.18%** | **Marginal (+1.39%)** |
| Whisper-Small | 244M | CV Only | 13.87% | 30.69% | None/Negative |
| Whisper-Large-v3 | 1550M | High-Quality + CV | 7.94% | 12.41% | Significant (+32.6%) |

## Limitations

- **Lower accuracy**: 30.72% WER (vs 7.94% for Large-v3)
- **Limited capacity**: Cannot effectively leverage synthetic data
- **Domain specificity**: Optimized for Common Voice-style speech
- **Cross-domain weakness**: 45.83% MLS WER shows difficulty adapting

## Citation

This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

```bibtex
@article{perezhohin2024enhancing,
  title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
  author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
  journal={IEEE Access},
  year={2024},
  publisher={IEEE}
}
```

## References

- **Base Model**: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
- **Training Data**: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
- **Whisper Paper**: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
- **Motivating Research**: [Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)](https://ieeexplore.ieee.org/document/10720758)

## License

Apache 2.0