Update README.md
Browse files
README.md
CHANGED
|
@@ -34,12 +34,6 @@ Our Phi-4-Mini-Judge model achieves strong performance across all three evaluati
|
|
| 34 |
| **Hallucination Detection** | 35 | 29 | **82.86%** |
|
| 35 |
| **Relevance Evaluation** | 35 | 25 | **71.43%** |
|
| 36 |
|
| 37 |
-
### Common Failure Patterns
|
| 38 |
-
The model's most frequent errors include:
|
| 39 |
-
- Relevance evaluation: 9 cases of marking "unrelated" content as "relevant"
|
| 40 |
-
- Hallucination detection: 5 cases of marking "accurate" content as "hallucination"
|
| 41 |
-
- Toxicity assessment: 3 cases of marking "toxic" content as "non-toxic"
|
| 42 |
-
|
| 43 |
## Model Usage
|
| 44 |
|
| 45 |
For best results, we recommend using the following system prompt and output format:
|
|
@@ -171,10 +165,9 @@ The model uses a structured output format with `<rating>` tags containing one of
|
|
| 171 |
## Intended Uses & Limitations
|
| 172 |
|
| 173 |
### Intended Uses
|
| 174 |
-
-
|
| 175 |
- Automated evaluation of AI-generated responses
|
| 176 |
- Quality assurance for conversational AI systems
|
| 177 |
-
- Research in AI safety and alignment
|
| 178 |
- Integration into larger AI safety pipelines
|
| 179 |
|
| 180 |
### Limitations
|
|
@@ -184,31 +177,6 @@ The model uses a structured output format with `<rating>` tags containing one of
|
|
| 184 |
- Should be used as part of a broader safety strategy, not as sole arbiter
|
| 185 |
- Best performance on English text (training data limitation)
|
| 186 |
|
| 187 |
-
## Training Data
|
| 188 |
-
|
| 189 |
-
This model was trained on a comprehensive dataset combining:
|
| 190 |
-
- **HaluEval dataset** for hallucination detection
|
| 191 |
-
- **Toxicity classification datasets** for harmful content detection
|
| 192 |
-
- **Relevance evaluation datasets** for query-response alignment
|
| 193 |
-
|
| 194 |
-
The training approach ensures balanced performance across all three safety dimensions while maintaining consistency in output format and reasoning quality.
|
| 195 |
-
|
| 196 |
-
## Training Procedure
|
| 197 |
-
|
| 198 |
-
### Training Hyperparameters
|
| 199 |
-
|
| 200 |
-
The following hyperparameters were used during training:
|
| 201 |
-
- learning_rate: 5e-05
|
| 202 |
-
- train_batch_size: 2
|
| 203 |
-
- eval_batch_size: 8
|
| 204 |
-
- seed: 42
|
| 205 |
-
- gradient_accumulation_steps: 2
|
| 206 |
-
- total_train_batch_size: 4
|
| 207 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
| 208 |
-
- lr_scheduler_type: cosine
|
| 209 |
-
- lr_scheduler_warmup_steps: 20
|
| 210 |
-
- training_steps: 300 (100 per task)
|
| 211 |
-
|
| 212 |
### Framework Versions
|
| 213 |
|
| 214 |
- PEFT 0.12.0
|
|
|
|
| 34 |
| **Hallucination Detection** | 35 | 29 | **82.86%** |
|
| 35 |
| **Relevance Evaluation** | 35 | 25 | **71.43%** |
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
## Model Usage
|
| 38 |
|
| 39 |
For best results, we recommend using the following system prompt and output format:
|
|
|
|
| 165 |
## Intended Uses & Limitations
|
| 166 |
|
| 167 |
### Intended Uses
|
| 168 |
+
- SLM as a Judge
|
| 169 |
- Automated evaluation of AI-generated responses
|
| 170 |
- Quality assurance for conversational AI systems
|
|
|
|
| 171 |
- Integration into larger AI safety pipelines
|
| 172 |
|
| 173 |
### Limitations
|
|
|
|
| 177 |
- Should be used as part of a broader safety strategy, not as sole arbiter
|
| 178 |
- Best performance on English text (training data limitation)
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
### Framework Versions
|
| 181 |
|
| 182 |
- PEFT 0.12.0
|