Update README.md

a70c230 verified 14 days ago

8.87 kB

	---
	license: apache-2.0
	language:
	- pt
	base_model: openai/whisper-tiny
	tags:
	- automatic-speech-recognition
	- whisper
	- portuguese
	- speech
	- audio
	- asr
	- hf-asr-leaderboard
	datasets:
	- mozilla-foundation/common_voice_17_0
	model-index:
	- name: whisper-tiny-cv-only-pt
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Common Voice 17.0 (Portuguese)
	type: mozilla-foundation/common_voice_17_0
	config: pt
	split: test
	metrics:
	- type: wer
	value: 30.72
	name: Test WER
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: Multilingual LibriSpeech (Portuguese)
	type: facebook/multilingual_librispeech
	config: portuguese
	split: test
	metrics:
	- type: wer
	value: 45.83
	name: Test WER (MLS)
	pipeline_tag: automatic-speech-recognition
	library_name: transformers
	---

	# Whisper-Tiny Portuguese - Common Voice Only (Baseline)

	This model is a fine-tuned version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) for Portuguese automatic speech recognition (ASR). It was trained exclusively on Common Voice 17.0 Portuguese without any synthetic data augmentation, serving as the baseline for evaluating the impact of synthetic speech on the smallest Whisper architecture.

	## Purpose

	This baseline model establishes the performance of the Whisper-Tiny architecture (39M parameters) using only real, crowdsourced speech data. It serves as a reference point to evaluate:

	- The effectiveness of synthetic data augmentation for the smallest model architecture
	- The fundamental capacity limitations of compact ASR models
	- Comparison with Small and Large-v3 models to understand scaling effects

	Key Finding: Unlike Large-v3 models which show significant improvements with synthetic data, Tiny models show only marginal benefits (1.39 percentage points) from synthetic augmentation. The paper states: "This modest gain offers limited justification for the additional data filtering and preprocessing overhead."

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| openai/whisper-tiny \|
	\| Language \| Portuguese (pt) \|
	\| Task \| Automatic Speech Recognition (transcribe) \|
	\| Parameters \| 39M \|
	\| Training Data \| Common Voice 17.0 Portuguese (Real Speech Only) \|
	\| Total Training Samples \| 21,866 \|
	\| Sampling Rate \| 16kHz \|

	## Evaluation Results

	### This Model (whisper-tiny-cv-only-pt)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Validation Loss \| 0.4463 \|
	\| Validation WER \| 27.05% \|
	\| Test WER (Common Voice) \| 30.72% \|
	\| Test WER (MLS) \| 45.83% \|
	\| Best Checkpoint \| Step 250 \|
	\| Max Training Steps \| 430 \|

	### Comparison with Synthetic Data Augmentation (Whisper-Tiny Portuguese)

	\| Training Data \| Max Steps \| Val Loss \| Val WER \| Test WER (CV) \| Test WER (MLS) \|
	\|---------------\|-----------\|----------\|---------\|---------------\|----------------\|
	\| Common Voice Only (Baseline) \| 430 \| 0.4463 \| 27.05% \| 30.72% \| 45.83% \|
	\| High-Quality (q ≥ 0.8) + CV \| 575 \| 0.4481 \| 26.74% \| 29.33% \| 44.18% \|
	\| Mid-High (q ≥ 0.5) + CV \| 805 \| 0.4550 \| 26.95% \| 30.11% \| 47.25% \|
	\| All Synthetic + CV \| 860 \| 0.4517 \| 28.06% \| 29.84% \| 46.54% \|

	### Key Performance Characteristics

	- Fastest training: Fewest steps (430) among all Tiny configurations
	- Smallest dataset: Only 21,866 samples (no synthetic augmentation)
	- Reference baseline: 30.72% Test WER on Common Voice
	- Limited cross-domain: 45.83% MLS WER (challenging for Tiny architecture)

	## Why Synthetic Data Provides Limited Benefit for Tiny Models

	The paper explains this architectural limitation:

	> "The Tiny and Small variants of Whisper exhibit only marginal benefits from synthetic data augmentation, revealing the limitations imposed by reduced model capacity. For instance, the Portuguese Whisper-Tiny model achieves its lowest test WER of 29.33% using the high-quality filtered subset, an improvement of just 1.39 percentage points over the Common Voice baseline of 30.72%."

	Key Insight: Compact models (39M params) struggle to disentangle subtle acoustic differences between natural and synthetic speech. The high-quality filtered variant provides only 1.39% improvement—a modest gain that may not justify the additional data processing overhead.

	## Training Data

	### Dataset Composition

	\| Source \| Samples \| Description \|
	\|--------\|---------\|-------------\|
	\| [Common Voice 17.0 Portuguese](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| 21,866 \| Real crowdsourced speech \|
	\| Synthetic Data \| 0 \| No synthetic augmentation \|
	\| Total \| 21,866 \| \|

	## Training Procedure

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 5e-5 \|
	\| Batch Size (Global) \| 256 \|
	\| Warmup Steps \| 200 \|
	\| Max Epochs \| 5 \|
	\| Precision \| BF16 \|
	\| Optimizer \| AdamW (fused) \|
	\| Eval Steps \| 50 \|
	\| Metric for Best Model \| eval_loss \|

	### Training Infrastructure

	- GPU: NVIDIA H200 (140GB VRAM)
	- Operating System: Ubuntu 22.04
	- Framework: Hugging Face Transformers

	## Usage

	### Transcription Pipeline

	```python
	from transformers import pipeline

	transcriber = pipeline(
	"automatic-speech-recognition",
	model="yuriyvnv/whisper-tiny-cv-only-pt",
	device="cuda"
	)

	result = transcriber("path/to/portuguese_audio.wav")
	print(result["text"])
	```

	### Direct Model Usage

	```python
	from transformers import WhisperProcessor, WhisperForConditionalGeneration
	import librosa

	processor = WhisperProcessor.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
	model = WhisperForConditionalGeneration.from_pretrained("yuriyvnv/whisper-tiny-cv-only-pt")
	model.to("cuda")

	audio, sr = librosa.load("path/to/portuguese_audio.wav", sr=16000)
	input_features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features.to("cuda")

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	print(transcription)
	```

	### Specifying Language

	```python
	model.generation_config.language = "pt"
	model.generation_config.task = "transcribe"
	```

	## When to Use This Model

	This model is ideal when:
	- Maximum resource efficiency: Smallest model size (39M params)
	- Edge deployment: Limited memory and compute available
	- Fast inference: Fastest among Portuguese models
	- Baseline comparison: Reference for evaluating synthetic data impact on Tiny architecture

	Consider alternatives based on your needs:
	- [whisper-tiny-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-tiny-high-mixed-pt): Marginal improvement (29.33% vs 30.72%)
	- [whisper-small-cv-only-pt](https://huggingface.co/yuriyvnv/whisper-small-cv-only-pt): Better accuracy (13.87% WER)
	- [whisper-large-v3-high-mixed-pt](https://huggingface.co/yuriyvnv/whisper-large-v3-high-mixed-pt): Best accuracy (7.94% WER)

	## Model Size Comparison

	\| Model \| Params \| Best Config \| Test WER (CV) \| Test WER (MLS) \| Synthetic Benefit \|
	\|-------\|--------\|-------------\|---------------\|----------------\|-------------------\|
	\| Whisper-Tiny \| 39M \| High-Quality \| 29.33% \| 44.18% \| Marginal (+1.39%) \|
	\| Whisper-Small \| 244M \| CV Only \| 13.87% \| 30.69% \| None/Negative \|
	\| Whisper-Large-v3 \| 1550M \| High-Quality + CV \| 7.94% \| 12.41% \| Significant (+32.6%) \|

	## Limitations

	- Lower accuracy: 30.72% WER (vs 7.94% for Large-v3)
	- Limited capacity: Cannot effectively leverage synthetic data
	- Domain specificity: Optimized for Common Voice-style speech
	- Cross-domain weakness: 45.83% MLS WER shows difficulty adapting

	## Citation

	This model is part of research on WAVe (Word-Aligned Verification) for synthetic speech quality assessment. While the WAVe methodology paper is currently under review, please cite our previous work that motivated this research:

	```bibtex
	@article{perezhohin2024enhancing,
	title={Enhancing Automatic Speech Recognition: Effects of Semantic Audio Filtering on Models Performance},
	author={Perezhohin, Yuriy and Santos, Tiago and Costa, Victor and Peres, Fernando and Castelli, Mauro},
	journal={IEEE Access},
	year={2024},
	publisher={IEEE}
	}
	```

	## References

	- Base Model: [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)
	- Training Data: [mozilla-foundation/common_voice_17_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
	- Whisper Paper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
	- Motivating Research: [Enhancing ASR with Semantic Audio Filtering (IEEE Access 2024)](https://ieeexplore.ieee.org/document/10720758)

	## License

	Apache 2.0