IndicPhi-mini / README.md
SandLogicTechnologies's picture
Update README.md
2484bf5 verified
---
license: mit
language:
- en
- hi
- kn
- te
- ta
- mr
base_model:
- microsoft/Phi-mini-MoE-instruct
library: transformers
pipeline_tag: text-generation
tags:
- Conversational
- Indic Dataset
- Multilingual
- MoE
datasets:
- SandLogicTechnologies/Indic_Chat_Dataset
---
# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
## Overview
**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
a compact Mixture-of-Experts (MoE) model
---
## Key Contributions
- Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.
- Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.
- Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
- **ARC-Challenge-Indic** (reasoning tasks)
- **MMLU-Indic** (knowledge & domain understanding)
- Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.
---
## Model Architecture
- **Base model:** Phi-mini-MoE-Instruct (Microsoft)
- **Parameters:** 7.6B total (2.4B active per token)
- **Layers:** 32 decoder-only transformer blocks
- **Attention:** Grouped Query Attention (GQA)
- **Experts per layer:** 16 (Top-2 active per token)
- **Context length:** 4096 tokens
---
## Usage
To load the fine-tuned model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "SandLogicTechnologies/IndicPhi-mini"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
load_in_4bit=True
)
prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Dataset Preparation
### Data Sources
- **Total collected:** 561M samples from **53 datasets** from Hugging Face.
- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
- **Categories:** General text, translation, instruction, conversational.
### Processing Pipeline
1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.
2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.
3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).
### Final Cleaned Dataset
- **Size:** 29M samples
### Dataset Distribution (Final Cleaned)
| Language | Samples |
|------------|-----------|
| Hindi | 4.63M |
| Kannada | 3.54M |
| Telugu | 3.72M |
| Tamil | 3.86M |
| Marathi | 3.79M |
| Malayalam | 2.81M |
| Gujarati | 2.94M |
| Bengali | 1.82M |
| Odia | 438K |
| Punjabi | 1.21M |
| Assamese | 185K |
| Sinhala | 64K |
| Urdu | 58K |
**Total curated dataset:** ~29 million high-quality samples
---
### Training Details
- **Hardware:** 1 × NVIDIA A100-80GB
- **Precision:** QLoRA (4-bit quantization)
- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)
- **Steps:** 8,500
- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps
- **LoRA configuration:**
- Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- r=128, α=128, dropout=0
- **Final training loss:** 0.48
---
## Evaluation & Results
### Benchmarks
1. **ARC-Challenge-Indic** (reasoning)
2. **MMLU-Indic** (knowledge & domain understanding)
### Improvements
- **ARC-Challenge-Indic**
- Accuracy: **21.03 → 24.46 (+3.43%)**
- Normalized Accuracy: **24.69 → 28.86 (+4.17%)**
- **MMLU-Indic**
- Accuracy: **27.47 → 30.95 (+3.48%)**
### Results
#### ARC-Challenge-Indic
| Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
|------------|-------------------------|--------------------------|
| Hindi | 22.61 | 26.17 |
| Kannada | 20.96 | 25.83 |
| Tamil | 20.78 | 24.61 |
| Telugu | 20.70 | 26.00 |
| Bengali | 21.91 | 25.04 |
| Gujarati | 18.17 | 21.30 |
| Malayalam | 22.26 | 23.91 |
| Marathi | 19.65 | 25.22 |
| Odia | 22.26 | 24.17 |
Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
**MMLU-Indic**
| Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
|------------|-------------------------|-------------------------|
| Hindi | 28.01 | 31.45 |
| Kannada | 26.74 | 30.12 |
| Tamil | 27.53 | 30.84 |
| Telugu | 27.20 | 31.02 |
| Bengali | 28.36 | 31.44 |
| Gujarati | 25.91 | 29.28 |
| Malayalam | 26.65 | 29.77 |
| Marathi | 27.12 | 30.63 |
| Odia | 27.05 | 30.45 |
| Punjabi | 26.42 | 29.61 |
| Assamese | 25.98 | 29.23 |
| Sinhala | 24.87 | 27.66 |
| Urdu | 25.44 | 28.71 |
Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
## Acknowledgments
The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and fine-tuned by the **Sandlogic** development team.
Special thanks to:
- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
- The authors and organizations behind the **53 open-source datasets** that made this work possible.
The complete list of dataset sources and citations is available [here](https://github.com/sandlogic/SandLogic-Lexicons/blob/main/Images/dataset_citation.md).
---
## Contact
For any inquiries or support, please contact us at [email protected] or visit our [Website](https://www.sandlogic.com/).