--- license: apache-2.0 language: - en - vi pipeline_tag: image-to-text model-index: - name: HTR-ConvText results: - task: type: image-to-text name: Handwritten Text Recognition dataset: name: IAM type: iam split: test metrics: - type: cer value: 4.0 name: Test CER - type: wer value: 12.9 name: Test WER - task: type: image-to-text name: Handwritten Text Recognition dataset: name: LAM type: lam split: test metrics: - type: cer value: 2.7 name: Test CER - type: wer value: 7.0 name: Test WER - task: type: image-to-text name: Handwritten Text Recognition dataset: name: READ2016 type: read2016 split: test metrics: - type: cer value: 3.6 name: Test CER - type: wer value: 15.7 name: Test WER - task: type: image-to-text name: Handwritten Text Recognition dataset: name: HANDS-VNOnDB type: hands-vnondb split: test metrics: - type: cer value: 3.45 name: Test CER - type: wer value: 8.9 name: Test WER --- --- # HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition
HTR-ConvText Architecture

Hugging Face GitHub License arXiv

## Highlights HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features: - **Hybrid CNN-ViT Architecture**: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness. - **Hierarchical ConvText Encoder**: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns. - **Textual Context Module (TCM)**: An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference. - **State-of-the-Art Performance**: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics. ## Model Overview HTR-ConvText configurations and specifications: | Feature | Specification | | ------------------- | --------------------------------------------------- | | Architecture Type | Hybrid CNN + Vision Transformer (Encoder-Only) | | Parameters | ~65.9M | | Backbone | ResNet-18 + MobileViT w/ Positional Encoding (MVP) | | Encoder Layers | 8 ConvText Blocks (Hierarchical) | | Attention Heads | 8 | | Embedding Dimension | 512 | | Image Input Size | 512×64 | | Inference Strategy | Standard CTC Decoding (TCM is removed at inference) | For more details, including ablation studies and theoretical proofs, please refer to our [Technical Report](https://arxiv.org/pdf/2512.05021). ## Performance We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training. | Dataset | Language | Ours CER (%) | HTR-VT | OrigamiNet | TrOCR | CRNN | |-----------|-------------|--------------|--------|------------|-------|-------| | IAM | English | 4.0 | 4.7 | 4.8 | 7.3 | 7.8 | | LAM | Italian | 2.7 | 2.8 | 3.0 | 3.6 | 3.8 | | READ2016 | German | 3.6 | 3.9 | - | - | 4.7 | | VNOnDB | Vietnamese | 3.45 | 4.26 | 7.6 | - | 10.53 | ## Quickstart ### Instalation 1. **Clone the repository** ```cmd git clone https://github.com/0xk0ry/HTR-ConvText.git cd HTR-ConvText ``` 2. **Create and activate a Python 3.9+ Conda environment** ```cmd conda create -n htr-convtext python=3.9 -y conda activate htr-convtext ``` 3. **Install PyTorch** using the wheel that matches your CUDA driver (swap the index for CPU-only builds): ```cmd pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126 ``` 4. **Install the remaining project requirements** (everything except PyTorch, which you already picked in step 3). ```cmd pip install -r requirements.txt ``` The code was tested on Python 3.9 and PyTorch 2.9.1. ### Data Preparation We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows: ``` ./data/iam/ ├── train.ln ├── val.ln ├── test.ln └── lines ├── a01-000u-00.png ├── a01-000u-00.txt └── ... ``` ### Training We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled: ``` # Using the provided script bash run/iam.sh # OR running directly via Python python train.py \ --use-wandb \ --dataset iam \ --tcm-enable \ --exp-name "htr-convtext-iam" \ --img-size 512 64 \ --train-bs 32 \ --val-bs 8 \ --data-path /path/to/iam/lines/ \ --train-data-list data/iam/train.ln \ --val-data-list data/iam/val.ln \ --test-data-list data/iam/test.ln \ --nb-cls 80 ``` ### Inference / Evaluation To evaluate a pre-trained checkpoint on the test set: ``` python test.py \ --resume ./checkpoints/best_CER.pth \ --dataset iam \ --img-size 512 64 \ --data-path /path/to/iam/lines/ \ --test-data-list data/iam/test.ln \ --nb-cls 80 ``` ## Citation If you find our work helpful, please cite our paper: ``` @misc{truc2025htrconvtex, title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition}, author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy}, year={2025}, eprint={2512.05021}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.05021}, } ``` ## Acknowledgement This project is inspired by and adapted from [HTR-VT](https://github.com/Intellindust-AI-Lab/HTR-VT). We gratefully acknowledge the authors for their open-source contributions.