File size: 7,966 Bytes
cc8a699
 
646f45c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cc8a699
646f45c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
---

license: apache-2.0
language:
- en
- vi
pipeline_tag: image-to-text
model-index:
- name: HTR-ConvText
  results:
  - task:
      type: image-to-text
      name: Handwritten Text Recognition
    dataset:
      name: IAM
      type: iam
      split: test
    metrics:
    - type: cer
      value: 4.0
      name: Test CER
    - type: wer
      value: 12.9
      name: Test WER
  - task:
      type: image-to-text
      name: Handwritten Text Recognition
    dataset:
      name: LAM
      type: lam
      split: test
    metrics:
    - type: cer
      value: 2.7
      name: Test CER
    - type: wer
      value: 7.0
      name: Test WER
  - task:
      type: image-to-text
      name: Handwritten Text Recognition
    dataset:
      name: READ2016
      type: read2016
      split: test
    metrics:
    - type: cer
      value: 3.6
      name: Test CER
    - type: wer
      value: 15.7
      name: Test WER
  - task:
      type: image-to-text
      name: Handwritten Text Recognition
    dataset:
      name: HANDS-VNOnDB
      type: hands-vnondb
      split: test
    metrics:
    - type: cer
      value: 3.45
      name: Test CER
    - type: wer
      value: 8.9
      name: Test WER
---

---
# HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition

<div align="center"> <img src="image/architecture.png" alt="HTR-ConvText Architecture" width="800"/> </div>

<p align="center"> 
  <a href="https://huggingface.co/DAIR-Group/HTR-ConvText"> 
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue"> 

  </a>

  <a href="https://github.com/DAIR-Group/HTR-ConvText"> 

    <img alt="GitHub" src="https://img.shields.io/badge/GitHub-Repo-181717.svg?logo=github&logoColor=white"> 

  </a> 

  <a href="https://github.com/DAIR-Group/HTR-ConvText/blob/main/LICENSE"> 

    <img alt="License" src="https://img.shields.io/badge/License-Apache%202.0-green"> 

  </a> 

  <a href="https://arxiv.org/abs/2512.05021"> 

    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2512.05021-b31b1b.svg"> 

  </a>

</p>


## Highlights

HTR-ConvText is a novel hybrid architecture for Handwritten Text Recognition (HTR) that effectively balances local feature extraction with global contextual modeling. Designed to overcome the limitations of standard CTC-based decoding and data-hungry Transformers, HTR-ConvText delivers state-of-the-art performance with the following key features:

- **Hybrid CNN-ViT Architecture**: Seamlessly integrates a ResNet backbone with MobileViT blocks (MVP) and Conditional Positional Encoding, enabling the model to capture fine-grained stroke details while maintaining global spatial awareness.
- **Hierarchical ConvText Encoder**: A U-Net-like encoder structure that interleaves Multi-Head Self-Attention with Depthwise Convolutions. This design efficiently models both long-range dependencies and local structural patterns.
- **Textual Context Module (TCM)**: An innovative training-only auxiliary module that injects bidirectional linguistic priors into the visual encoder. This mitigates the conditional independence weakness of CTC decoding without adding any latency during inference.
- **State-of-the-Art Performance**: Outperforms existing methods on major benchmarks including IAM (English), READ2016 (German), LAM (Italian), and HANDS-VNOnDB (Vietnamese), specifically excelling in low-resource scenarios and complex diacritics.

## Model Overview

HTR-ConvText configurations and specifications:

| Feature             | Specification                                       |
| ------------------- | --------------------------------------------------- |
| Architecture Type   | Hybrid CNN + Vision Transformer (Encoder-Only)      |
| Parameters          | ~65.9M                                              |
| Backbone            | ResNet-18 + MobileViT w/ Positional Encoding (MVP)  |
| Encoder Layers      | 8 ConvText Blocks (Hierarchical)                    |
| Attention Heads     | 8                                                   |
| Embedding Dimension | 512                                                 |
| Image Input Size    | 512Γ—64                                              |
| Inference Strategy  | Standard CTC Decoding (TCM is removed at inference) |

For more details, including ablation studies and theoretical proofs, please refer to our [Technical Report](https://arxiv.org/pdf/2512.05021).

## Performance

We evaluated HTR-ConvText across four diverse datasets. The model achieves new SOTA results with the lowest Character Error Rate (CER) and Word Error Rate (WER) without requiring massive synthetic pre-training.

| Dataset   | Language    | Ours CER (%) | HTR-VT | OrigamiNet | TrOCR | CRNN  |
|-----------|-------------|--------------|--------|------------|-------|-------|
| IAM       | English     | 4.0          | 4.7    | 4.8        | 7.3   | 7.8   |
| LAM       | Italian     | 2.7          | 2.8    | 3.0        | 3.6   | 3.8   |
| READ2016  | German      | 3.6          | 3.9    | -          | -     | 4.7   |
| VNOnDB    | Vietnamese  | 3.45         | 4.26   | 7.6        | -     | 10.53 |

## Quickstart

### Instalation

1. **Clone the repository**
   ```cmd

   git clone https://github.com/0xk0ry/HTR-ConvText.git

   cd HTR-ConvText

   ```
2. **Create and activate a Python 3.9+ Conda environment**
   ```cmd

   conda create -n htr-convtext python=3.9 -y

   conda activate htr-convtext

   ```
3. **Install PyTorch** using the wheel that matches your CUDA driver (swap the index for CPU-only builds):
   ```cmd

   pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126

   ```
4. **Install the remaining project requirements** (everything except PyTorch, which you already picked in step 3).
   ```cmd

   pip install -r requirements.txt

   ```

The code was tested on Python 3.9 and PyTorch 2.9.1.

### Data Preparation

We provide split files (train.ln, val.ln, test.ln) for IAM, READ2016, LAM, and VNOnDB under data/. Organize your data as follows:

```

./data/iam/

β”œβ”€β”€ train.ln

β”œβ”€β”€ val.ln

β”œβ”€β”€ test.ln

└── lines

      β”œβ”€β”€ a01-000u-00.png

      β”œβ”€β”€ a01-000u-00.txt

      └── ...

```

### Training

We provide comprehensive scripts in the ./run/ directory. To train on the IAM dataset with the Textual Context Module (TCM) enabled:

```

# Using the provided script

bash run/iam.sh



# OR running directly via Python

python train.py \

    --use-wandb \

    --dataset iam \

    --tcm-enable \

    --exp-name "htr-convtext-iam" \

    --img-size 512 64 \

    --train-bs 32 \

    --val-bs 8 \

    --data-path /path/to/iam/lines/ \

    --train-data-list data/iam/train.ln \

    --val-data-list data/iam/val.ln \

    --test-data-list data/iam/test.ln \

    --nb-cls 80

```

### Inference / Evaluation

To evaluate a pre-trained checkpoint on the test set:

```

python test.py \

    --resume ./checkpoints/best_CER.pth \

    --dataset iam \

    --img-size 512 64 \

    --data-path /path/to/iam/lines/ \

    --test-data-list data/iam/test.ln \

    --nb-cls 80

```

## Citation

If you find our work helpful, please cite our paper:

```

@misc{truc2025htrconvtex,

      title={HTR-ConvText: Leveraging Convolution and Textual Information for Handwritten Text Recognition},

      author={Pham Thach Thanh Truc and Dang Hoai Nam and Huynh Tong Dang Khoa and Vo Nguyen Le Duy},

      year={2025},

      eprint={2512.05021},

      archivePrefix={arXiv},

      primaryClass={cs.CV},

      url={https://arxiv.org/abs/2512.05021},

}

```

## Acknowledgement

This project is inspired by and adapted from [HTR-VT](https://github.com/Intellindust-AI-Lab/HTR-VT). We gratefully acknowledge the authors for their open-source contributions.