--- title: TRAM Accuracy datasets: - Warrieryes/TRAM-Temporal tags: - evaluate - metric - temporal reasoning - multiple choice description: Accuracy metric for the (multiple choice) TRAM benchmark by Wang et al. (2024). sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false emoji: 🚊 colorFrom: red colorTo: gray --- # Metric Card for TRAM Accuracy ## Metric Description This metric is designed for the **TRAM** benchmark (Wang et al., 2024). It measures the accuracy of model predictions on multiple-choice temporal reasoning tasks. The metric expects model outputs to contain the answer in the format `"The final answer is (X)"` where X is a letter from A to D. It performs the following steps: 1. Extracts the final answer from the model's prediction string using a regex pattern that matches `"The final answer is (A/B/C/D)"`. 2. Compares the extracted letter to the reference answer. 3. Calculates accuracy as the proportion of correct matches. ## How to Use You can load the metric using the `evaluate` library: ```python import evaluate metric = evaluate.load("aauss/tram_accuracy") predictions = [ "Let me analyze this step by step... The final answer is (A).", "Based on my reasoning, the events should be ordered as shown. The final answer is (B).", "After careful consideration, The final answer is (C).", ] references = ["A", "B", "D"] # Get average accuracy result = metric.compute( predictions=predictions, references=references, ) print(result) >>> {"accuracy": 0.6666666666666666} # Get per-sample accuracy result = metric.compute( predictions=predictions, references=references, return_average=False, ) print(result) >>> {"accuracy": [1, 1, 0]} ``` ### Inputs - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string containing the model's response, which must include the final answer in the format `"The final answer is (X)"` where X is A, B, C, or D. - **references** (`list` of `str`): List of reference answers. Each reference should be a single letter (A, B, C, or D) representing the correct answer. - **return_average** (`bool`, optional): If `True`, returns the average accuracy as a float. If `False`, returns a list of binary scores (1 for correct, 0 for incorrect) for each sample. Defaults to `True`. ### Output Values The metric returns a dictionary with the following key: - **accuracy** (`float` or `list` of `int`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of binary values (0 or 1) indicating correctness per sample if `return_average=False`. This metric can take on any value between 0.0 and 1.0, inclusive. Higher scores indicate better performance. #### Reported performance from original publication Refer to the [original TRAM paper](https://arxiv.org/abs/2310.00835) for baseline performance values across various language models. ## Limitations and Bias - The metric relies on a regex pattern `[Tt]he final answer is .([A-D]).` to extract the answer. If the model output does not follow this exact format, extraction will fail and the prediction will be marked as incorrect. - The metric is case-insensitive for "The/the" but requires the answer letter to be uppercase (A-D). - Only supports multiple-choice questions with options A through D. - If a prediction contains multiple instances of the pattern, only the first match is used. ## Citation ```bibtex @InProceedings{auss:tram_accuracy, title = {TRAM Accuracy}, authors = {Auss Abbood}, year = {2025} } ``` ## Further References - [TRAM Benchmark Paper](https://arxiv.org/abs/2310.00835) - [TRAM Dataset on Hugging Face](https://huggingface.co/datasets/Warrieryes/TRAM-Temporal)