SandLogicTechnologies commited on
Commit
df4685a
·
verified ·
1 Parent(s): 98d9b09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +188 -3
README.md CHANGED
@@ -1,3 +1,188 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - hi
6
+ - kn
7
+ - te
8
+ - ta
9
+ - mr
10
+ base_model:
11
+ - microsoft/Phi-mini-MoE-instruct
12
+ library: transformers
13
+ pipeline_tag: text-generation
14
+ tags:
15
+ - Conversational
16
+ - Indic Dataset
17
+ - Multilingual
18
+ - MoE
19
+ datasets:
20
+ - SandLogicTechnologies/Indic_Chat_Dataset
21
+ ---
22
+
23
+ # IndicPhi-mini: Adapting Phi-mini-MoE to Indic Languages with Curated Data
24
+
25
+
26
+ ## Overview
27
+ **IndicPhi-mini** is a fine-tuned version of **Microsoft’s Phi-mini-MoE**, a compact Mixture-of-Experts (MoE) model, adapted specifically for Indic languages. It is trained on a curated multilingual dataset of approximately 29 million high-quality samples, standardized into a conversational format from diverse sources. By leveraging efficient fine-tuning techniques such as **QLoRA-based quantization** and **LoRA adapters**, the model enhances Indic language capabilities while keeping resource usage practical. Evaluation on benchmark datasets shows consistent **3–4% accuracy** improvements across multiple Indic languages, demonstrating the effectiveness of targeted fine-tuning with curated data.
28
+ a compact Mixture-of-Experts (MoE) model
29
+
30
+ ---
31
+
32
+ ## Key Contributions
33
+ - Curated one of the **largest Indic corpora** to date: 561M samples → cleaned into **29M high-quality samples** across **13 Indic languages**.
34
+ - Fine-tuned **Phi-mini-MoE** (7.6B params, 2.4B active) using **QLoRA (4-bit)** and **LoRA adapters**, making training feasible on a single **A100-80GB GPU**.
35
+ - Achieved **+3–4% accuracy improvements** on major Indic benchmarks:
36
+ - **ARC-Challenge-Indic** (reasoning tasks)
37
+ - **MMLU-Indic** (knowledge & domain understanding)
38
+ - Improved **generalization across multiple Indic languages** including Hindi, Kannada, Tamil, Telugu, Marathi, Bengali, Malayalam, Gujarati, Odia, Punjabi, Assamese, Sinhala, and Urdu.
39
+
40
+ ---
41
+
42
+ ## Model Architecture
43
+ - **Base model:** Phi-mini-MoE-Instruct (Microsoft)
44
+ - **Parameters:** 7.6B total (2.4B active per token)
45
+ - **Layers:** 32 decoder-only transformer blocks
46
+ - **Attention:** Grouped Query Attention (GQA)
47
+ - **Experts per layer:** 16 (Top-2 active per token)
48
+ - **Context length:** 4096 tokens
49
+
50
+ ---
51
+
52
+ ## Dataset Preparation
53
+ ### Data Sources
54
+ - **Total collected:** 561M samples from **53 datasets** from Hugging Face.
55
+ - **Languages covered:** 13 Indian languages which include Hindi, Kannada, Telugu, Tamil, Marathi, Malayalam, Gujarati, Bengali,Odia, Punjabi, Assamese, Sinhala, Urdu.
56
+ - **Categories:** General text, translation, instruction, conversational.
57
+
58
+ ### Processing Pipeline
59
+ 1. **Manual Filtering** – removed noisy, irrelevant, and malformed samples.
60
+ 2. **Preprocessing** – deduplication, language identification, normalization, minimum length filtering.
61
+ 3. **Format Conversion** – standardized into **UltraChat JSON schema** (multi-turn conversations).
62
+
63
+ ### Final Cleaned Dataset
64
+ - **Size:** 29M samples
65
+
66
+ ### Dataset Distribution (Final Cleaned)
67
+
68
+ | Language | Samples |
69
+ |------------|-----------|
70
+ | Hindi | 4.63M |
71
+ | Kannada | 3.54M |
72
+ | Telugu | 3.72M |
73
+ | Tamil | 3.86M |
74
+ | Marathi | 3.79M |
75
+ | Malayalam | 2.81M |
76
+ | Gujarati | 2.94M |
77
+ | Bengali | 1.82M |
78
+ | Odia | 438K |
79
+ | Punjabi | 1.21M |
80
+ | Assamese | 185K |
81
+ | Sinhala | 64K |
82
+ | Urdu | 58K |
83
+
84
+ **Total curated dataset:** ~29 million high-quality samples
85
+
86
+
87
+ ---
88
+
89
+ ### Training Details
90
+ - **Hardware:** 1 × NVIDIA A100-80GB
91
+ - **Precision:** QLoRA (4-bit quantization)
92
+ - **Batching:** Effective batch size 256 (32 × 8 gradient accumulation)
93
+ - **Steps:** 8,500
94
+ - **Optimizer:** AdamW (8-bit) + cosine LR schedule + 1k warmup steps
95
+ - **LoRA configuration:**
96
+ - Layers: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
97
+ - r=128, α=128, dropout=0
98
+ - **Final training loss:** 0.48
99
+
100
+ ---
101
+
102
+ ## Evaluation & Results
103
+
104
+ ### Benchmarks
105
+ 1. **ARC-Challenge-Indic** (reasoning)
106
+ 2. **MMLU-Indic** (knowledge & domain understanding)
107
+
108
+ ### Improvements
109
+ - **ARC-Challenge-Indic**
110
+ - Accuracy: **21.03 → 24.46 (+3.43%)**
111
+ - Normalized Accuracy: **24.69 → 28.86 (+4.17%)**
112
+ - **MMLU-Indic**
113
+ - Accuracy: **27.47 → 30.95 (+3.48%)**
114
+
115
+ ### Results
116
+
117
+ #### ARC-Challenge-Indic
118
+
119
+ | Language | Accuracy (Phi-mini-MoE) | Accuracy (IndicPhi-mini) |
120
+ |------------|-------------------------|--------------------------|
121
+ | Hindi | 22.61 | 26.17 |
122
+ | Kannada | 20.96 | 25.83 |
123
+ | Tamil | 20.78 | 24.61 |
124
+ | Telugu | 20.70 | 26.00 |
125
+ | Bengali | 21.91 | 25.04 |
126
+ | Gujarati | 18.17 | 21.30 |
127
+ | Malayalam | 22.26 | 23.91 |
128
+ | Marathi | 19.65 | 25.22 |
129
+ | Odia | 22.26 | 24.17 |
130
+
131
+ Accuracy: **(Phi-mini-MoE) 21.03 → (IndicPhi-mini) 24.46 (+3.43%)**
132
+
133
+
134
+ **MMLU-Indic**
135
+
136
+ | Language | Accuracy (Phi-mini-MoE) | Accuracy (Phi-mini-MoE)|
137
+ |------------|-------------------------|-------------------------|
138
+ | Hindi | 28.01 | 31.45 |
139
+ | Kannada | 26.74 | 30.12 |
140
+ | Tamil | 27.53 | 30.84 |
141
+ | Telugu | 27.20 | 31.02 |
142
+ | Bengali | 28.36 | 31.44 |
143
+ | Gujarati | 25.91 | 29.28 |
144
+ | Malayalam | 26.65 | 29.77 |
145
+ | Marathi | 27.12 | 30.63 |
146
+ | Odia | 27.05 | 30.45 |
147
+ | Punjabi | 26.42 | 29.61 |
148
+ | Assamese | 25.98 | 29.23 |
149
+ | Sinhala | 24.87 | 27.66 |
150
+ | Urdu | 25.44 | 28.71 |
151
+
152
+ Accuracy: **(Phi-mini-MoE) 27.47 → (IndicPhi-mini) 30.95 (+3.48%)**
153
+
154
+ ## Usage
155
+ To load the fine-tuned model:
156
+
157
+ ```python
158
+ from transformers import AutoModelForCausalLM, AutoTokenizer
159
+
160
+ model_name = "sandlogic/indicphi-mini-moe-v3"
161
+
162
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
163
+ model = AutoModelForCausalLM.from_pretrained(
164
+ model_name,
165
+ device_map="auto",
166
+ load_in_4bit=True
167
+ )
168
+
169
+ prompt = "ग्रामीण क्षेत्रों में ऑनलाइन शिक्षा की समस्याएं क्या हैं?"
170
+
171
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
172
+ outputs = model.generate(**inputs, max_new_tokens=100)
173
+
174
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
175
+
176
+ ```
177
+
178
+ ## Acknowledgments
179
+
180
+ The **Phi-mini-MoE-Instruct** models are based on the original work by **Microsoft** and Fine-tunned by **Sandlogic** development team.
181
+
182
+ Special thanks to:
183
+ - The [Microsoft](https://huggingface.co/microsoft) team for developing and releasing the [microsoft/Phi-mini-MoE-instruct](https://huggingface.co/microsoft/Phi-mini-MoE-instruct) model.
184
+
185
+ ---
186
+
187
+ ## Contact
188
+ For any inquiries or support, please contact us at support@sandlogic.com or visit our [Website](https://www.sandlogic.com/).