Papers
arxiv:2509.00691

CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders

Published on Aug 31
Authors:

Abstract

CE-Bench is a new contrastive evaluation benchmark for sparse autoencoders that measures interpretability without needing an external LLM.

AI-generated summary

Probing with sparse autoencoders is a promising approach for uncovering interpretable features in large language models (LLMs). However, the lack of automated evaluation methods has hindered their broader adoption and development. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive ablation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks, all without requiring an external LLM. The official implementation and evaluation dataset are open-sourced under the MIT License.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.00691 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.00691 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.