Model Card for gemma-2-27b-amharic-cpt
This model is a continual pretrained version of unsloth/gemma-2-27b-bnb-4bit on Amharic text data. The tokenizer was expanded from the original 256k vocabulary to better support Amharic script, and the model underwent continual pretraining with ~2B tokens of Amharic corpus data.
Note: This is an early research version open-sourced solely for research purposes on low-resource language LLMs. For production-ready models with multimodality features, please visit platform.addisassistant.com.
Quick start
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "b1n1yam/gemma-2-27b-amharic-cpt"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "አዲስ አበባ የኢትዮጵያ"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training procedure
This model underwent continual pretraining (CPT) on Amharic text data:
- Tokenizer expansion: Extended the original Gemma 2 tokenizer to better cover Amharic script
- Training data: ~2B tokens of Amharic corpus, including transcribed audio sources
- Training approach: Continual pretraining to adapt the model to Amharic language
Framework versions
- TRL: 0.24.0
- Transformers: 4.57.2
- PyTorch: 2.6.0.dev20241112+cu121
- Datasets: 3.6.0
- Tokenizers: 0.22.1
Citation
If you use this model, please cite:
@misc{daniel2024addisai,
title = {Addis AI: Continual Pretrained Gemma 2 27B for Amharic},
author = {Daniel, Biniyam},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/b1n1yam/gemma-2-27b-amharic-cpt}}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support