iitb-en-indic-only-punct

This model is a fine-tuned version of ai4bharat/indictrans2-en-indic-dist-200M for English-to-Marathi translation, specifically optimized for punctuation robustness.

It was introduced in the paper Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation.

Model description

Traditional Machine Translation (MT) systems often struggle with punctuation-ambiguous text (e.g., "Let's eat Grandma" vs "Let's eat, Grandma"). This model addresses this issue by being fine-tuned on punctuation-varied data derived from the IITB-ENG-MAR dataset.

It corresponds to Approach 2 (Direct Fine-tuning) described in the research, where the base MT model is trained to implicitly learn context and resolve semantic and structural ambiguities caused by missing or inconsistent punctuation in the source English text.

Paper: Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
GitHub Repository: Viram_Marathi
Language Pair: English (Latin script) to Marathi (Devanagari script)

Intended uses & limitations

This model is intended for translating English sentences into Marathi, particularly when the source text might have missing punctuation that changes the intended meaning.

Training and evaluation data

The model was fine-tuned using the IITB-ENG-MAR dataset. The performance was evaluated on the Virām benchmark, which consists of 54 manually curated, punctuation-ambiguous instances.

Training results

It achieves the following results on the evaluation set:

Loss: 0.3627
Bleu: 10.7712
Chrfpp: 33.1021
Comet: 0.5425
Gen Len: 20.8714

Training Loss	Epoch	Step	Validation Loss	Bleu	Chrfpp	Comet	Bleurt	Gen Len
0.4323	0.5059	6000	0.4048	9.7554	31.7923	0.5339	None	20.8746
0.3522	1.0119	12000	0.3882	10.0952	32.1519	0.5367	None	20.8721
0.3608	1.5178	18000	0.3779	10.2006	32.4109	0.5373	None	20.875
0.3362	2.0238	24000	0.3711	10.3061	32.5527	0.5392	None	20.8721
0.3196	2.5297	30000	0.3700	10.4817	32.7072	0.5395	None	20.8731
0.3029	3.0357	36000	0.3676	10.5911	32.8459	0.5397	None	20.8746
0.3049	3.5416	42000	0.3647	10.5533	32.8685	0.5415	None	20.8727
0.2705	4.0476	48000	0.3644	10.6712	32.9543	0.5417	None	20.8692
0.2819	4.5535	54000	0.3622	10.6249	32.9145	0.5414	None	20.8706
0.2567	5.0594	60000	0.3646	10.6345	32.9606	0.5414	None	20.8705
0.2783	5.5654	66000	0.3607	10.6848	33.046	0.5425	None	20.8697
0.2589	6.0713	72000	0.3633	10.7223	33.0218	0.542	None	20.8711
0.2702	6.5773	78000	0.3613	10.7778	33.0402	0.542	None	20.8717
0.256	7.0832	84000	0.3628	10.7432	33.0965	0.5425	None	20.8703
0.2512	7.5892	90000	0.3627	10.7712	33.1021	0.5425	None	20.8714

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 8

Framework versions

Transformers 4.53.2
Pytorch 2.4.0a0+f70bd71a48.nv24.06
Datasets 2.21.0
Tokenizers 0.21.4

Downloads last month: 9

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for thenlpresearcher/iitb-en-indic-only-punct

Base model

ai4bharat/indictrans2-en-indic-dist-200M

Finetuned

(8)

this model

Space using thenlpresearcher/iitb-en-indic-only-punct 1

Paper for thenlpresearcher/iitb-en-indic-only-punct

Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Paper • 2601.09725 • Published Dec 28, 2025 • 1