m2m100_1.2b-zul-eng

This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.

Model Description

Property	Value
Model Type	Seq2Seq Translation
Translation Direction	isiZulu → English
Base Model	facebook/m2m100_1.2B
Domain	Scientific/Academic texts
Training	Full fine-tuning on AfriScience-MT dataset

Evaluation Results

Performance on the AfriScience-MT test set:

Split	BLEU	chrF	SSA-COMET
Validation	31.39	53.03	61.09
Test	30.05	52.24	60.13

Metrics explanation:

BLEU: Measures n-gram overlap with reference translations (0-100, higher is better)
chrF: Character-level F-score, robust for morphologically rich languages (0-100, higher is better)
SSA-COMET: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) (McGill-NLP/ssa-comet-stl)

Usage

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "AfriScience-MT/m2m100_1.2b-zul-eng"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set source language
tokenizer.src_lang = "zu"

# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("en")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)

Batch Translation

texts = [
    "Climate change affects agricultural productivity.",
    "The study analyzed genetic markers in the population.",
    "Renewable energy sources are essential for sustainable development."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
    print(f"{src}\n→ {tgt}\n")

Training Details

Hyperparameters

Parameter	Value
Epochs	10
Batch Size	1
Learning Rate	2e-05

Training Data

Dataset: AfriScience-MT
Domain: Scientific abstracts and papers
Languages: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu)

Reproducibility

To reproduce this model:

# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt

# Install dependencies
pip install -r requirements.txt

# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
    --data_dir ./data \
    --source_lang zul \
    --target_lang eng \
    --model_name facebook/m2m100_1.2B \
    --model_type m2m100 \
    --output_dir ./output \
    --num_epochs 10 \
    --batch_size 16 \
    --learning_rate 2e-5

Limitations

Domain Specificity: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text.
Language Coverage: Only supports the specific language pair indicated.
Input Length: Maximum input length is 256 tokens; longer texts should be split into segments.

Citation

If you use this model, please cite the AfriScience-MT project:

@inproceedings{afriscience-mt-2025,
  title={AfriScience-MT: Machine Translation for African Scientific Literature},
  author={AfriScience-MT Team},
  year={2025},
  url={https://github.com/afriscience-mt/afriscience-mt}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

Built on top of {base_model}
Evaluation using SSA-COMET for African language assessment

Downloads last month: 11

Safetensors

Model size

1B params

Tensor type

F32

Model tree for AfriScience-MT/m2m100_1.2b-zul-eng

Base model

facebook/m2m100_1.2B

Finetuned

(30)

this model

Collection including AfriScience-MT/m2m100_1.2b-zul-eng

AfriScience-MT Complete Collection

Collection

Complete collection of all AfriScience-MT models for scientific machine translation involving African languages. • 76 items • Updated about 24 hours ago

Evaluation results

BLEU (test)
self-reported

30.050
chrF (test)
self-reported

52.240
SSA-COMET (test)
self-reported

60.130