SandLogicTechnologies
/

IndicPhi-mini

+---
+license: mit
+language:
+- en
+- hi
+- kn
+- te
+- ta
+- mr
+base_model:
+- microsoft/Phi-mini-MoE-instruct
+library: transformers
+pipeline_tag: text-generation
+tags:
+- Conversational
+- Indic Dataset
+- Multilingual
+- MoE
+datasets:
+- SandLogicTechnologies/Indic_Chat_Dataset
+---
+# IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
+##  Overview
+**IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
+a compact Mixture-of-Experts (MoE) model
+---
+##  Key Contributions
+-  Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.
+-  Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.
+-  Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
+    - **ARC-Challenge-Indic** (reasoning tasks)
+    - **MMLU-Indic** (knowledge & domain understanding)
+-  Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.
+---
+##  Model Architecture
+- **Base model:** Phi-mini-MoE-Instruct (Microsoft)
+- **Parameters:** 7.6B total (2.4B active per token)
+- **Layers:** 32 decoder-only transformer blocks
+- **Attention:** Grouped Query Attention (GQA)
+- **Experts per layer:** 16 (Top-2 active per token)
+- **Context length:** 4096 tokens
+---
+##  Dataset Preparation
+### Data Sources
+- **Total collected:** 561M samples from **53 datasets** from Hugging Face.
+- **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
+- **Categories:** General text, translation, instruction, conversational.
+### Processing Pipeline
+1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.
+2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.
+3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).
+### Final Cleaned Dataset
+- **Size:** 29M samples
+### Dataset Distribution (Final Cleaned)
+| Language   | Samples   |
+|------------|-----------|
+| Hindi      | 4.63M     |
+| Kannada    | 3.54M     |
+| Telugu     | 3.72M     |
+| Tamil      | 3.86M     |
+| Marathi    | 3.79M     |
+| Malayalam  | 2.81M     |
+| Gujarati   | 2.94M     |
+| Bengali    | 1.82M     |
+| Odia       | 438K      |
+| Punjabi    | 1.21M     |
+| Assamese   | 185K      |
+| Sinhala    | 64K       |
+| Urdu       | 58K       |
+**Total curated dataset:** ~29 million high-quality samples
+---
+### Training Details
+- **Hardware:** 1 × NVIDIA A100-80GB
+- **Precision:** QLoRA (4-bit quantization)
+- **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)
+- **Steps:** 8,500
+- **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps
+- **LoRA configuration:**
+  - Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+  - r=128, α=128, dropout=0
+- **Final training loss:** 0.48
+---
+##  Evaluation & Results
+### Benchmarks
+1. **ARC-Challenge-Indic** (reasoning)
+2. **MMLU-Indic** (knowledge & domain understanding)
+### Improvements
+- **ARC-Challenge-Indic**
+  - Accuracy: **21.03 → 24.46 (+3.43%)**
+  - Normalized Accuracy: **24.69 → 28.86 (+4.17%)**
+- **MMLU-Indic**
+  - Accuracy: **27.47 → 30.95 (+3.48%)**
+###  Results
+#### ARC-Challenge-Indic
+| Language   | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
+|------------|-------------------------|--------------------------|
+| Hindi      | 22.61                   | 26.17                    |
+| Kannada    | 20.96                   | 25.83                    |
+| Tamil      | 20.78                   | 24.61                    |
+| Telugu     | 20.70                   | 26.00                    |
+| Bengali    | 21.91                   | 25.04                    |
+| Gujarati   | 18.17                   | 21.30                    |
+| Malayalam  | 22.26                   | 23.91                    |
+| Marathi    | 19.65                   | 25.22                    |
+| Odia       | 22.26                   | 24.17                    |
+Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
+**MMLU-Indic**
+| Language   | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
+|------------|-------------------------|-------------------------|
+| Hindi      | 28.01                   | 31.45                   |
+| Kannada    | 26.74                   | 30.12                   |
+| Tamil      | 27.53                   | 30.84                   |
+| Telugu     | 27.20                   | 31.02                   |
+| Bengali    | 28.36                   | 31.44                   |
+| Gujarati   | 25.91                   | 29.28                   |
+| Malayalam  | 26.65                   | 29.77                   |
+| Marathi    | 27.12                   | 30.63                   |
+| Odia       | 27.05                   | 30.45                   |
+| Punjabi    | 26.42                   | 29.61                   |
+| Assamese   | 25.98                   | 29.23                   |
+| Sinhala    | 24.87                   | 27.66                   |
+| Urdu       | 25.44                   | 28.71                   |
+Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
+## Usage
+To load the fine-tuned model:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "sandlogic/indicphi-mini-moe-v3"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    device_map="auto",
+    load_in_4bit=True
+)
+prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=100)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Acknowledgments
+The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and Fine-tunned by **Sandlogic** development team.
+Special thanks to:
+- The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
+---
+## Contact
+For any inquiries or support, please contact us at support@sandlogic.com or visit our [Website](https://www.sandlogic.com/).