How do you evaluate LLMs?

bibikong · February 25, 2026, 7:10pm

Hi, I’m curious how people here actually choose models in practice.

We’re a small research team at the University of Michigan studying real-world LLM evaluation workflows for our capstone project.

We’re trying to understand what actually happens when you:

•	Decide which model to ship

•	Balance cost, latency, output quality, and memory

•	Deal with benchmarks that don’t match production

•	Handle conflicting signals (metrics vs gut feeling)

•	Figure out what ultimately drives the final decision

If you’ve compared multiple LLM models in a real project (product, development, research, or serious build), we’d really value your input.

Short, anonymous survey (~5–8 minutes):

https://forms.gle/xEGuDCZ3UBmitGCz6

Topic		Replies	Views
Model selection Models	2	116	September 15, 2025
LLM Challenge: Open-source research to measure the quality corridor that matters to humans Research	0	93	August 28, 2024
Evaluating my own model Intermediate	6	255	February 21, 2025
Why can't I reproduce benchmark scores from papers like Phi, Llama, or Qwen? Am I doing something wrong or is this normal? Models	2	214	June 10, 2025
Title: Looking for guidance and collaborators to train an open LLM project (“Hyperion”) Beginners	4	49	December 28, 2025

How do you evaluate LLMs?

Related topics