Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory Paper • 2505.15055 • Published May 21, 2025 • 1