Running 36 TRUEBench 🔥 Explore and compare language model performance across categories and languages