SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

This repository contains models described in the paper SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability. SAEBench is a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for adamkarvonen/saebench_pythia-160m-deduped_width-2pow14_date-0108