m2m100_1.2b-zul-eng

Model on HF

This model is part of the AfriScience-MT project, focused on machine translation of scientific texts for African languages.

Model Description

Property Value
Model Type Seq2Seq Translation
Translation Direction isiZulu → English
Base Model facebook/m2m100_1.2B
Domain Scientific/Academic texts
Training Full fine-tuning on AfriScience-MT dataset

Evaluation Results

Performance on the AfriScience-MT test set:

Split BLEU chrF SSA-COMET
Validation 31.39 53.03 61.09
Test 30.05 52.24 60.13

Metrics explanation:

  • BLEU: Measures n-gram overlap with reference translations (0-100, higher is better)
  • chrF: Character-level F-score, robust for morphologically rich languages (0-100, higher is better)
  • SSA-COMET: Neural metric trained for Sub-Saharan African languages, shown as percentage (0-100, higher is better) (McGill-NLP/ssa-comet-stl)

Usage

Quick Start

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "AfriScience-MT/m2m100_1.2b-zul-eng"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Set source language
tokenizer.src_lang = "zu"

# Translate
text = "The mitochondria is the powerhouse of the cell."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)

# Generate with target language
forced_bos_token_id = tokenizer.get_lang_id("en")
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translation = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translation)

Batch Translation

texts = [
    "Climate change affects agricultural productivity.",
    "The study analyzed genetic markers in the population.",
    "Renewable energy sources are essential for sustainable development."
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=256)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos_token_id, max_length=256, num_beams=5)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for src, tgt in zip(texts, translations):
    print(f"{src}\n→ {tgt}\n")

Training Details

Hyperparameters

Parameter Value
Epochs 10
Batch Size 1
Learning Rate 2e-05

Training Data

  • Dataset: AfriScience-MT
  • Domain: Scientific abstracts and papers
  • Languages: English and 6 African languages (Amharic, Hausa, Luganda, Northern Sotho, Yoruba, isiZulu)

Reproducibility

To reproduce this model:

# Clone the AfriScience-MT repository
git clone https://github.com/afriscience-mt/afriscience-mt.git
cd afriscience-mt

# Install dependencies
pip install -r requirements.txt

# Run training
python -m afriscience_mt.scripts.run_seq2seq_training \
    --data_dir ./data \
    --source_lang zul \
    --target_lang eng \
    --model_name facebook/m2m100_1.2B \
    --model_type m2m100 \
    --output_dir ./output \
    --num_epochs 10 \
    --batch_size 16 \
    --learning_rate 2e-5

Limitations

  • Domain Specificity: This model is optimized for scientific/academic texts and may perform poorly on colloquial or informal text.
  • Language Coverage: Only supports the specific language pair indicated.
  • Input Length: Maximum input length is 256 tokens; longer texts should be split into segments.

Citation

If you use this model, please cite the AfriScience-MT project:

@inproceedings{afriscience-mt-2025,
  title={AfriScience-MT: Machine Translation for African Scientific Literature},
  author={AfriScience-MT Team},
  year={2025},
  url={https://github.com/afriscience-mt/afriscience-mt}
}

License

This model is released under the Apache 2.0 License.

Acknowledgments

Downloads last month
11
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AfriScience-MT/m2m100_1.2b-zul-eng

Finetuned
(30)
this model

Collection including AfriScience-MT/m2m100_1.2b-zul-eng

Evaluation results