Interpretable Reward Model via Sparse Autoencoder
Paper
β’
2508.08746
β’
Published
β’
1
This repository contains the model weights of the AAAI 2026 Oral Paper "Interpretable Reward Model via Sparse Autoencoder".
We release Llama-SARM-4B-PostSAEPretrain, which has an identical architecture to Llama-SARM-4B:
Authors
Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wangβ
Code Repository: https://github.com/schrieffer-z/sarm
If you have any questions, please feel free to reach us at [email protected].
If you find our work useful, please cite it as follows.
@article{zhang2025interpretable,
title={Interpretable Reward Model via Sparse Autoencoder},
author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
journal={arXiv preprint arXiv:2508.08746},
year={2025}
}