File size: 7,966 Bytes
cc8a699 646f45c cc8a699 646f45c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
---
license: apache-2.0
language:
- en
- vi
pipeline_tag: image-to-text
model-index:
- name: HTR-ConvText
results:
- task:
type: image-to-text
name: Handwritten Text Recognition
dataset:
name: IAM
type: iam
split: test
metrics:
- type: cer
value: 4.0
name: Test CER
- type: wer
value: 12.9
name: Test WER
- task:
type: image-to-text
name: Handwritten Text Recognition
dataset:
name: LAM
type: lam
split: test
metrics:
- type: cer
value: 2.7
name: Test CER
- type: wer
value: 7.0
name: Test WER
- task:
type: image-to-text
name: Handwritten Text Recognition
dataset:
name: READ2016
type: read2016
split: test
metrics:
- type: cer
value: 3.6
name: Test CER
- type: wer
value: 15.7
name: Test WER
- task:
type: image-to-text
name: Handwritten Text Recognition
dataset:
name: HANDS-VNOnDB
type: hands-vnondb
split: test
metrics:
- type: cer
value: 3.45
name: Test CER
- type: wer
value: 8.9
name: Test WER
---
---
# HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
<div align="center"> <img src="image/architecture.png" alt="HTR-ConvText Architecture" width="800"/> </div>
<p align="center">
<a href="https://huggingface.co/DAIR-Group/HTR-ConvText">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue">
</a>
<a href="https://github.com/DAIR-Group/HTR-ConvText">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-Repo-181717.svg?logo=github&logoColor=white">
</a>
<a href="https://github.com/DAIR-Group/HTR-ConvText/blob/main/LICENSE">
<img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green">
</a>
<a href="https://arxiv.org/abs/2512.05021">
<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2512.05021-b31b1b.svg">
</a>
</p>
## Highlights
HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features:
- **Hybrid CNN-ViT Architecture**: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness.
- **Hierarchical ConvText Encoder**: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns.
- **Textual Context Module (TCM)**: An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference.
- **State-of-the-Art Performance**: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics.
## Model Overview
HTR-ConvText configurations and specifications:
| Feature | Specification |
| ------------------- | --------------------------------------------------- |
| Architecture Type | Hybrid CNN + Vision Transformer (Encoder-Only) |
| Parameters | ~65.9M |
| Backbone | ResNet-18 + MobileViT w/ Positional Encoding (MVP) |
| Encoder Layers | 8 ConvText Blocks (Hierarchical) |
| Attention Heads | 8 |
| Embedding Dimension | 512 |
| Image Input Size | 512Γ64 |
| Inference Strategy | Standard CTC Decoding (TCM is removed at inference) |
For more details, including ablation studies and theoretical proofs, please refer to our [Technical Report](https://arxiv.org/pdf/2512.05021).
## Performance
We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training.
| Dataset | Language | Ours CER (%) | HTR-VT | OrigamiNet | TrOCR | CRNN |
|-----------|-------------|--------------|--------|------------|-------|-------|
| IAM | English | 4.0 | 4.7 | 4.8 | 7.3 | 7.8 |
| LAM | Italian | 2.7 | 2.8 | 3.0 | 3.6 | 3.8 |
| READ2016 | German | 3.6 | 3.9 | - | - | 4.7 |
| VNOnDB | Vietnamese | 3.45 | 4.26 | 7.6 | - | 10.53 |
## Quickstart
### Instalation
1. **Clone the repository**
```cmd
git clone https://github.com/0xk0ry/HTR-ConvText.git
cd HTR-ConvText
```
2. **Create and activate a Python 3.9+ Conda environment**
```cmd
conda create -n htr-convtext python=3.9 -y
conda activate htr-convtext
```
3. **Install PyTorch** using the wheel that matches your CUDA driver (swap the index for CPU-only builds):
```cmd
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
```
4. **Install the remaining project requirements** (everything except PyTorch, which you already picked in step 3).
```cmd
pip install -r requirements.txt
```
The code was tested on Python 3.9 and PyTorch 2.9.1.
### Data Preparation
We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows:
```
./data/iam/
βββ train.ln
βββ val.ln
βββ test.ln
βββ lines
βββ a01-000u-00.png
βββ a01-000u-00.txt
βββ ...
```
### Training
We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled:
```
# Using the provided script
bash run/iam.sh
# OR running directly via Python
python train.py \
--use-wandb \
--dataset iam \
--tcm-enable \
--exp-name "htr-convtext-iam" \
--img-size 512 64 \
--train-bs 32 \
--val-bs 8 \
--data-path /path/to/iam/lines/ \
--train-data-list data/iam/train.ln \
--val-data-list data/iam/val.ln \
--test-data-list data/iam/test.ln \
--nb-cls 80
```
### Inference / Evaluation
To evaluate a pre-trained checkpoint on the test set:
```
python test.py \
--resume ./checkpoints/best_CER.pth \
--dataset iam \
--img-size 512 64 \
--data-path /path/to/iam/lines/ \
--test-data-list data/iam/test.ln \
--nb-cls 80
```
## Citation
If you find our work helpful, please cite our paper:
```
@misc{truc2025htrconvtex,
title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition},
author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy},
year={2025},
eprint={2512.05021},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05021},
}
```
## Acknowledgement
This project is inspired by and adapted from [HTR-VT](https://github.com/Intellindust-AI-Lab/HTR-VT). We gratefully acknowledge the authors for their open-source contributions. |