File size: 7,735 Bytes

8aaca4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
ed50050
 
 
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
e12deb1
ed50050
e12deb1
e8a247f
8aaca4c
 
 
 
 
 
 
e8a247f
 
 
8aaca4c
 
311fc03
11da310
 
 
 
 
8aaca4c
11da310
8aaca4c
49c3fc3
11da310
 
8aaca4c
 
 
 
 
 
 
 
 
 
 
 
311fc03
8aaca4c
 
 
 
14a6826
8aaca4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f5a52a
8aaca4c
8f5a52a
8aaca4c
8f5a52a
 
8aaca4c
8f5a52a
 
 
8aaca4c
8f5a52a
 
 
 
 
 
311fc03
8f5a52a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8aaca4c
 
8f5a52a
 
 
8aaca4c
8f5a52a
311fc03
8aaca4c
8f5a52a
8aaca4c
 
 
 
8f5a52a
8aaca4c
 
 
 
 
8f5a52a
8aaca4c
 
 
 
 
 
 
 
3ec16be
 
 
 
 
 
 
 
 
 
8aaca4c
 
 
 
 
 
 
 
e8a247f

---
language: vie
datasets:
- legacy-datasets/common_voice
- vlsp2020_vinai_100h
- AILAB-VNUHCM/vivos
- doof-ferb/vlsp2020_vinai_100h
- doof-ferb/fpt_fosd
- doof-ferb/infore1_25hours
- linhtran92/viet_bud500
- doof-ferb/LSVSC
- doof-ferb/vais1000
- doof-ferb/VietMed_labeled
- NhutP/VSV-1100
- doof-ferb/Speech-MASSIVE_vie
- doof-ferb/BibleMMS_vie
- capleaf/viVoice
metrics:
- wer
pipeline_tag: automatic-speech-recognition
tags:
- transcription
- audio
- speech
- chunkformer
- asr
- automatic-speech-recognition
license: cc-by-nc-4.0
model-index:
- name: ChunkFormer Large Vietnamese
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: common-voice-vietnamese
      type: common_voice
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 6.66
    source:
      name: Common Voice Vi Leaderboard
      url: https://paperswithcode.com/sota/speech-recognition-on-common-voice-vi
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VIVOS
      type: vivos
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 4.18
    source:
      name: Vivos Leaderboard
      url: https://paperswithcode.com/sota/speech-recognition-on-vivos
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: VLSP - Task 1
      type: vlsp
      args: vi
    metrics:
    - name: Test WER
      type: wer
      value: 14.09
---

# **ChunkFormer-CTC-Large-Vie: Large-Scale Pretrained ChunkFormer for Vietnamese Automatic Speech Recognition**
<style>
img {
 display: inline;
}
</style>

[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) 
[![GitHub](https://img.shields.io/badge/GitHub-ChunkFormer-blue)](https://github.com/khanld/chunkformer)
[![Paper](https://img.shields.io/badge/Paper-ICASSP%202025-green)](https://arxiv.org/abs/2502.14673)
[![Model size](https://img.shields.io/badge/Params-110M-lightgrey#model-badge)](#description)

---
## Table of contents
1. [Model Description](#description)
2. [Documentation and Implementation](#implementation)
3. [Benchmark Results](#benchmark)
4. [Usage](#usage)
6. [Citation](#citation)
7. [Contact](#contact)

---
<a name = "description" ></a>
## Model Description
**ChunkFormer-CTC-Large-Vie** is a large-scale Vietnamese Automatic Speech Recognition (ASR) model based on the **ChunkFormer** architecture, introduced at **ICASSP 2025**. The model has been fine-tuned on approximately **3000 hours** of public Vietnamese speech data sourced from diverse datasets. A list of datasets can be found [**HERE**](dataset.tsv). 

---
<a name = "implementation" ></a>
## Documentation and Implementation
The [Documentation](https://arxiv.org/abs/2502.14673) and [Implementation](https://github.com/khanld/chunkformer) of ChunkFormer are publicly available.

---
<a name = "benchmark" ></a>
## Benchmark Results
We evaluate the models using **Word Error Rate (WER)**. To ensure consistency and fairness in comparison, we manually apply **Text Normalization**, including the handling of numbers, uppercase letters, and punctuation.

1. **Public Models**:
| STT | Model                                                                  | #Params | Vivos | Common Voice | VLSP - Task 1 | Avg. |
|-----|------------------------------------------------------------------------|---------|-------|--------------|---------------|------|
| 1   | **ChunkFormer**                                                            | 110M    | 4.18   | 6.66           | 14.09             | **8.31**    |
| 2   | [vinai/PhoWhisper-large](https://huggingface.co/vinai/PhoWhisper-large)  | 1.55B   | 4.67  | 8.14         | 13.75         | 8.85 |
| 3   | [nguyenvulebinh/wav2vec2-base-vietnamese-250h](https://huggingface.co/nguyenvulebinh/wav2vec2-base-vietnamese-250h) | 95M     | 10.77 | 18.34        | 13.33         | 14.15 |
| 4   | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1.55B   | 8.81     | 15.45            | 20.41          | 14.89    |
| 5   | [khanhld/wav2vec2-base-vietnamese-160h](https://huggingface.co/khanhld/wav2vec2-base-vietnamese-160h) | 95M     | 15.05 | 10.78        | 31.62             | 19.16    |
| 6   | [homebrewltd/Ichigo-whisper-v0.1](https://huggingface.co/homebrewltd/Ichigo-whisper-v0.1) | 22M   | 13.46     | 23.52            | 21.64          | 19.54    |

2. **Private Models (API)**:
| STT | Model  | VLSP - Task 1 |
|-----|--------|---------------|
| 1   | **ChunkFormer** | **14.1**             |
| 2   | Viettel     | 14.5          |
| 3   | Google  | 19.5          |
| 4   | FPT   | 28.8          |

---
<a name = "usage" ></a>
## Quick Usage
To use the ChunkFormer model for Vietnamese Automatic Speech Recognition, follow these steps:

### Option 1: Install from PyPI (Recommended)
```bash
pip install chunkformer
```

### Option 2: Install from source
```bash
git clone https://github.com/khanld/chunkformer.git
cd chunkformer
pip install -e .
```

### Python API Usage
```python
from chunkformer import ChunkFormerModel

# Load the Vietnamese model from Hugging Face
model = ChunkFormerModel.from_pretrained("khanhld/chunkformer-ctc-large-vie")

# For single long-form audio transcription
transcription = model.endless_decode(
    audio_path="path/to/long_audio.wav",
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=14400,  # in seconds
    return_timestamps=True
)
print(transcription)

# For batch processing of multiple audio files
audio_files = ["audio1.wav", "audio2.wav", "audio3.wav"]
transcriptions = model.batch_decode(
    audio_paths=audio_files,
    chunk_size=64,
    left_context_size=128,
    right_context_size=128,
    total_batch_duration=1800  # Total batch duration in seconds
)

for i, transcription in enumerate(transcriptions):
    print(f"Audio {i+1}: {transcription}")
```

### Command Line Usage
After installation, you can use the command line interface:

```bash
chunkformer-decode \
    --model_checkpoint khanhld/chunkformer-ctc-large-vie \
    --long_form_audio path/to/audio.wav \
    --total_batch_duration 14400 \
    --chunk_size 64 \
    --left_context_size 128 \
    --right_context_size 128
```

Example Output:
```
[00:00:01.200] - [00:00:02.400]: this is a transcription example
[00:00:02.500] - [00:00:03.700]: testing the long-form audio
```

**Advanced Usage** can be found [HERE](https://github.com/khanld/chunkformer/tree/main?tab=readme-ov-file#usage)

---
<a name = "citation" ></a>
## Citation
If you use this work in your research, please cite:

```bibtex
@INPROCEEDINGS{10888640,
  author={Le, Khanh and Ho, Tuan Vu and Tran, Dung and Chau, Duc Thanh},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={ChunkFormer: Masked Chunking Conformer For Long-Form Speech Transcription}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Scalability;Memory management;Graphics processing units;Signal processing;Performance gain;Hardware;Resource management;Speech processing;Standards;Context modeling;chunkformer;masked batch;long-form transcription},
  doi={10.1109/ICASSP49660.2025.10888640}}
}
```

---
<a name = "contact"></a>
## Contact
- [email protected]
- [![GitHub](https://img.shields.io/badge/github-%23121011.svg?style=for-the-badge&logo=github&logoColor=white)](https://github.com/khanld)
- [![LinkedIn](https://img.shields.io/badge/linkedin-%230077B5.svg?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/khanhld257/)