RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
Abstract
RLVR has advanced reasoning capabilities but struggles with open-ended generation due to lack of ground truth; this work proposes an automated rubric generation framework and dataset to improve performance in health reasoning benchmarks.
Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale (sim110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.
Community
We introduce RubricHub, a large-scale (~110k) and multi-domain rubric dataset constructed via an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces highly discriminative criteria capable of capturing subtle nuances in model responses.
dataset: https://huggingface.co/datasets/sojuL/RubricHub_v1
github: https://github.com/teqkilla/RubricHub
arxiv: https://arxiv.org/abs/2601.08430
alphaXiv: https://www.alphaxiv.org/zh/overview/2601.08430v1
Training code is coming soon!
arXivlens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/rubrichub-a-comprehensive-and-highly-discriminative-rubric-dataset-via-automated-coarse-to-fine-generation-6812-2a435514
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision (2026)
- Guided Self-Evolving LLMs with Minimal Human Supervision (2025)
- Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review (2026)
- V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation (2026)
- Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques (2025)
- CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions (2025)
- DiG: Differential Grounding for Enhancing Fine-Grained Perception in Multimodal Large Language Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper