AfriScience-MT Complete Collection
Collection
Complete collection of all AfriScience-MT models for scientific machine translation involving African languages.
•
76 items
•
Updated
This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.
| Property | Value |
|---|---|
| Model Type | Seq2Seq Translation |
| Translation Direction | isiZulu → English |
| Base Model | facebook/m2m100_1.2B |
| Domain | Scientific/Academic texts |
| Training | Full fine-tuning on AfriScience-MT dataset |
Performance on the AfriScience-MT test set:
| Split | BLEU | chrF | SSA-COMET |
|---|---|---|---|
| Validation | 31.39 | 53.03 | 61.09 |
| Test | 30.05 | 52.24 | 60.13 |
Metrics explanation:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "AfriScience-MT/m2m100_1.2b-zul-eng"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set source language
tokenizer.src_lang = "zu"
# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("en")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)
texts = [
"Climate change affects agricultural productivity.",
"The study analyzed genetic markers in the population.",
"Renewable energy sources are essential for sustainable development."
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
print(f"{src}\n→ {tgt}\n")
| Parameter | Value |
|---|---|
| Epochs | 10 |
| Batch Size | 1 |
| Learning Rate | 2e-05 |
To reproduce this model:
# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt
# Install dependencies
pip install -r requirements.txt
# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
--data_dir ./data \
--source_lang zul \
--target_lang eng \
--model_name facebook/m2m100_1.2B \
--model_type m2m100 \
--output_dir ./output \
--num_epochs 10 \
--batch_size 16 \
--learning_rate 2e-5
If you use this model, please cite the AfriScience-MT project:
@inproceedings{afriscience-mt-2025,
title={AfriScience-MT: Machine Translation for African Scientific Literature},
author={AfriScience-MT Team},
year={2025},
url={https://github.com/afriscience-mt/afriscience-mt}
}
This model is released under the Apache 2.0 License.
Base model
facebook/m2m100_1.2B