🥇 Ranking LLMs without Ground Truth

This space demonstrates ranking of large language models with access to just input prompts (viz. only questions in Q&A tasks) as described in our 2024 ACL Findings paper Ranking Large Language Models without Ground Truth.

Source code is included as part of this space. Installation and usage instructions are provided below.

Inspired by real life where both an expert and a knowledgeable person can identify a novice the main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. Iteratively performing such evaluations yields a estimated ranking that doesn't require ground truth/reference data which can be expensive to gather. The methods are a viable low-resource ranking mechanism for practical use.

Ranking with benchmarks

Using inference data gathered from HELM we first show how our estimated rankings compare to rankings derived from using ground-truth or reference data.

Choose a dataset.

The dataset describes a specific task, either summarization (CNN/DM, XSUM) or multiple choice (MMLU).

Evaluation function

How should the Judge model decide the winner? Demo limited to use 'Rouge' for generative tasks like summarization, and 'equality' for multiple choice or classification tasks. In practice you can use any function that compares judge responses to the contestant models.

Number of models

Sample a subset of LLMs to rank.

Number of instances

Sample a subset of instances to evaluate (smaller is faster).

Algorithm variant to use

Choose from one of two variants. 'Full' (FTR in the paper) runs all triplet combinations, recommended when evaluations are cheap or for smaller datasets, or 'greedy' (GTR) a faster variant suggested for more complex evaluations.

Estimated ranking

rank	model
0

Comparison to 'true' ranking

Image

Synthetic multiple choice

To analyse our methods, we synthesise data from models with known accuracy in a multiple choice setting, i.e. discrete set of possible responses. Several parameters (number of models, model accuracy, number of prompts, and number of possible answers, noisy comparisons) can have an impact on quality of results. Rankings can be recovered for a range of challenging cases, for instance when the accuracy of underlying models is low or when the evaluation function is noisy and imperfect. When the number of possible answers are low, for example in binary choice settings, recovering rankings becomes challenging. In general low variance in wrong answers cause triplet evaluations to treat wrong answers as the right one.

[

]

Number of models to synthesise.

Equally spaced in the accuracy range.

3 50

Number of possible (discrete) answers per prompt.

2 50

Number of prompts to simulate.

10 100

Noise in evaluation (p)

Evaluation function decisions flipped with probability p. p=0 implies no noise.

0 1

Algorithm variant to use

Some interesting cases (click and run)

Estimated vs. true ranking

Image

Using on your data

Source code is available as a pip installable python package.

Installation

Use of a virtual enviroment is recommended.

conda create -n selfrank python=3.10

Activate the virtual environment

conda activate selfrank

and then install,

pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git

Usage

Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.

	M1	M2	M3	...
Q1	a	a	b	...
Q2	a	b	b	...
...	...	...	...	...

With this data, the self ranking procedure can be invoked as follows:

import pandas as pd
from selfrank.algos.iterative import SelfRank # The full ranking algorithm
from selfrank.algos.greedy import SelfRankGreedy # The greedy version
from selfrank.algos.triplet import rouge, equality

f = "inferences.csv"
df = pd.read_csv(f)

models_to_rank = df.columns.tolist()
evaluator = rouge 
true_ranking = None

r = SelfRank(models_to_rank, evaluator, true_ranking)
# or, for the greedy version
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
r.fit(adf)
print(r.ranking)

This should output the estimated ranking (best to worst): ['M5', 'M2', 'M1', ...]. If true rankings are known, evaluation measures can be computed by r.measure(metric='rbo') (for rank-biased overlap) or r.measure(metric='mapk') for mean-average precision.

We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While rouge is recommended for generative tasks like summarization, equality would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes.

You can also pass any arbitrary function to the ranker as long as it follows the following signature:

def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
    """
    use model c to evaluate a vs. b
    df: is a dataframe with inferences of all models
    returns 1 if a is preferred or 0 if b is preferred
    """

    # Is this example, we count number of times a/b is the same as c 
    ties = df[a] == df[b]
    a_wins = sum((df[a] == df[c]) & ~(ties))
    b_wins = sum((df[b] == df[c]) & ~(ties))

    if a_wins >= b_wins:
        return 1
    else:
        return 0