--- library_name: transformers language: - en base_model: - meta-llama/Llama-3.1-70B-Instruct tags: - evaluation ---
# LMUnit: Fine-grained Evaluation with Natural Language Unit Tests Contextual_AI

[![Paper](https://img.shields.io/badge/Paper-LMUnit-blue)](https://arxiv.org/abs/2412.13091) [![Blog Post](https://img.shields.io/badge/📝%20Blog-LMUnit-green)](https://contextual.ai/research/lmunit) [![GitHub](https://img.shields.io/badge/GitHub-LMUnit-black?logo=github)](https://github.com/ContextualAI/LMUnit) [![Hugging Face Collection](https://img.shields.io/badge/🤗%20Hugging%20Face-Model%20Collection-yellow)](https://huggingface.co/collections/ContextualAI/lmunit)
**LMUnit** is a state-of-the-art language model that is optimized for evaluating natural language unit tests. It takes three inputs: a prompt, a response, and a unit test. It then produces a continuous score between 1 and 5 where higher scores indicate that the response better satisfies the unit test criteria. The LMUnit model achieves leading averaged performance across preference, direct scoring, and fine-grained unit test evaluation tasks, as measured by FLASK and BiGGen Bench, and performs on par with frontier models for coarse evaluation of long-form responses (per LFQA). The model also demonstrates exceptional alignment with human preferences, ranking in the top 5 of the RewardBench benchmark with 93.5% accuracy and in top #2 of RewardBench2 with 82.1% accuracy. For more details, please check out the [blogpost](https://contextual.ai/research/lmunit) or the [paper](https://arxiv.org/abs/2412.13091). ## Model Details LMUnit is highly performant and versatile because of key methodologies in its training approach: - **Multi-Objective Training:** The model simultaneously learns from multiple evaluation signals, including pairwise comparisons between responses, direct quality ratings, and specialized criteria-based judgments. - **Synthetic Data Generation:** We developed a sophisticated pipeline to generate training data that captures nuanced, fine-grained evaluation criteria and subtle quality distinctions between responses across a wide range of use cases and scenarios. - **Importance Weighting:** We demonstrate that adjusting unit test weights to reflect the relative importance of different criteria achieves results that better align with human preferences. ### Model Description - **Developed by:** Contextual AI - **Language(s) (NLP):** English - **Finetuned from model:** Llama-3.1-70B-Instruct ### Model Sources - **Repository:** https://github.com/ContextualAI/LMUnit - **Paper:** https://arxiv.org/abs/2412.13091 ## 🚀 Model Quick Start ### Installation ```bash pip install lmunit ``` ### Basic Usage ```python from lmunit import LMUnit from vllm import SamplingParams # Initialize LMUnit model = LMUnit( model_path="ContextualAI/LMUnit-llama3.1-70b", tp_size=4 ) # Define evaluation query = "What is the capital of France?" response = "Paris" unit_test = "Does the response correctly identify the capital city?" # Generate score sampling_params = SamplingParams(temperature=0.0, max_tokens=10, logprobs=20) prompt = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}" output = model.generate(prompt, sampling_params) print(output) ``` ### Alternative: Using Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load model tokenizer = AutoTokenizer.from_pretrained("ContextualAI/LMUnit-llama3.1-70b") model = AutoModelForCausalLM.from_pretrained("ContextualAI/LMUnit-llama3.1-70b") # Prepare prompt query = "What is the capital of France?" response = "Paris" unit_test = "Does the response correctly identify the capital city?" content = f"Query: {query}\n\nResponse: {response}\n\nUnit Test: {unit_test}" messages = [{"role": "user", "content": content}] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) # Generate outputs = model.generate(**inputs, max_new_tokens=40) result = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]) print(result) ``` For more examples, see our [GitHub repository](https://github.com/ContextualAI/LMUnit). ### Evaluation - Results | Model | Flask | BiGGen-Bench | Human-Internal | InfoBench | RB | LFQA | RB2 | |:------|------:|-------------:|---------------:|----------:|----:|------:|----:| | **LMUnit-LLaMA-3.1-70B** | 72.03 | 67.69 | 93.63 | 89.00 | 91.56 | 76.15 | 80.5 | | **LMUnit-Qwen2.5-72B** | 73.85 | 69.56 | 94.44 | 88.67 | 91.13 | 73.85 | 82.1 | ## Citation If you find our work helpful, feel free to cite our paper: ```bibtex @inproceedings{saadfalcon2025lmunit, title={{LMUnit}: Fine-grained Evaluation with Natural Language Unit Tests}, author={Jon Saad-Falcon and Rajan Vivek and William Berrios and Nandita Shankar Naik and Matija Franklin and Bertie Vidgen and Amanpreet Singh and Douwe Kiela and Shikib Mehri}, booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025}, year={2025}, url={https://arxiv.org/abs/2412.13091} } ```