|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- kn |
|
|
- te |
|
|
- ta |
|
|
- mr |
|
|
base_model: |
|
|
- microsoft/Phi-mini-MoE-instruct |
|
|
library: transformers |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- Conversational |
|
|
- Indic Dataset |
|
|
- Multilingual |
|
|
- MoE |
|
|
datasets: |
|
|
- SandLogicTechnologies/Indic_Chat_Dataset |
|
|
--- |
|
|
|
|
|
# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data |
|
|
|
|
|
## Overview |
|
|
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data. |
|
|
a compact Mixture-of-Experts (MoE) model |
|
|
|
|
|
--- |
|
|
|
|
|
## Key Contributions |
|
|
- Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**. |
|
|
- Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**. |
|
|
- Achieved **+3–4% accuracy improvements** on major Indic benchmarks: |
|
|
- **ARC-Challenge-Indic** (reasoning tasks) |
|
|
- **MMLU-Indic** (knowledge & domain understanding) |
|
|
- Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
- **Base model:** Phi-mini-MoE-Instruct (Microsoft) |
|
|
- **Parameters:** 7.6B total (2.4B active per token) |
|
|
- **Layers:** 32 decoder-only transformer blocks |
|
|
- **Attention:** Grouped Query Attention (GQA) |
|
|
- **Experts per layer:** 16 (Top-2 active per token) |
|
|
- **Context length:** 4096 tokens |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
To load the fine-tuned model: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "SandLogicTechnologies/IndicPhi-mini" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
device_map="auto", |
|
|
load_in_4bit=True |
|
|
) |
|
|
|
|
|
prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?" |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
|
|
|
``` |
|
|
|
|
|
## Dataset Preparation |
|
|
### Data Sources |
|
|
- **Total collected:** 561M samples from **53 datasets** from Hugging Face. |
|
|
- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu. |
|
|
- **Categories:** General text, translation, instruction, conversational. |
|
|
|
|
|
### Processing Pipeline |
|
|
1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples. |
|
|
2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering. |
|
|
3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations). |
|
|
|
|
|
### Final Cleaned Dataset |
|
|
- **Size:** 29M samples |
|
|
|
|
|
### Dataset Distribution (Final Cleaned) |
|
|
|
|
|
| Language | Samples | |
|
|
|------------|-----------| |
|
|
| Hindi | 4.63M | |
|
|
| Kannada | 3.54M | |
|
|
| Telugu | 3.72M | |
|
|
| Tamil | 3.86M | |
|
|
| Marathi | 3.79M | |
|
|
| Malayalam | 2.81M | |
|
|
| Gujarati | 2.94M | |
|
|
| Bengali | 1.82M | |
|
|
| Odia | 438K | |
|
|
| Punjabi | 1.21M | |
|
|
| Assamese | 185K | |
|
|
| Sinhala | 64K | |
|
|
| Urdu | 58K | |
|
|
|
|
|
**Total curated dataset:** ~29 million high-quality samples |
|
|
|
|
|
--- |
|
|
|
|
|
### Training Details |
|
|
- **Hardware:** 1 × NVIDIA A100-80GB |
|
|
- **Precision:** QLoRA (4-bit quantization) |
|
|
- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation) |
|
|
- **Steps:** 8,500 |
|
|
- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps |
|
|
- **LoRA configuration:** |
|
|
- Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
- r=128, α=128, dropout=0 |
|
|
- **Final training loss:** 0.48 |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation & Results |
|
|
|
|
|
### Benchmarks |
|
|
1. **ARC-Challenge-Indic** (reasoning) |
|
|
2. **MMLU-Indic** (knowledge & domain understanding) |
|
|
|
|
|
### Improvements |
|
|
- **ARC-Challenge-Indic** |
|
|
- Accuracy: **21.03 → 24.46 (+3.43%)** |
|
|
- Normalized Accuracy: **24.69 → 28.86 (+4.17%)** |
|
|
- **MMLU-Indic** |
|
|
- Accuracy: **27.47 → 30.95 (+3.48%)** |
|
|
|
|
|
### Results |
|
|
|
|
|
#### ARC-Challenge-Indic |
|
|
|
|
|
| Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) | |
|
|
|------------|-------------------------|--------------------------| |
|
|
| Hindi | 22.61 | 26.17 | |
|
|
| Kannada | 20.96 | 25.83 | |
|
|
| Tamil | 20.78 | 24.61 | |
|
|
| Telugu | 20.70 | 26.00 | |
|
|
| Bengali | 21.91 | 25.04 | |
|
|
| Gujarati | 18.17 | 21.30 | |
|
|
| Malayalam | 22.26 | 23.91 | |
|
|
| Marathi | 19.65 | 25.22 | |
|
|
| Odia | 22.26 | 24.17 | |
|
|
|
|
|
Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)** |
|
|
|
|
|
**MMLU-Indic** |
|
|
|
|
|
| Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)| |
|
|
|------------|-------------------------|-------------------------| |
|
|
| Hindi | 28.01 | 31.45 | |
|
|
| Kannada | 26.74 | 30.12 | |
|
|
| Tamil | 27.53 | 30.84 | |
|
|
| Telugu | 27.20 | 31.02 | |
|
|
| Bengali | 28.36 | 31.44 | |
|
|
| Gujarati | 25.91 | 29.28 | |
|
|
| Malayalam | 26.65 | 29.77 | |
|
|
| Marathi | 27.12 | 30.63 | |
|
|
| Odia | 27.05 | 30.45 | |
|
|
| Punjabi | 26.42 | 29.61 | |
|
|
| Assamese | 25.98 | 29.23 | |
|
|
| Sinhala | 24.87 | 27.66 | |
|
|
| Urdu | 25.44 | 28.71 | |
|
|
|
|
|
Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)** |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team. |
|
|
|
|
|
Special thanks to: |
|
|
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model. |
|
|
- The authors and organizations behind the **53 open-source datasets** that made this work possible. |
|
|
The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md). |
|
|
|
|
|
--- |
|
|
|
|
|
## Contact |
|
|
For any inquiries or support, please contact us at [email protected] or visit our [Website](https://www.sandlogic.com/). |