SARM: Interpretable Reward Model via Sparse Autoencoder

This repository contains the model weights of the AAAI 2026 Oral Paper "Interpretable Reward Model via Sparse Autoencoder".

We release Llama-SARM-4B-PostSAEPretrain, which has an identical architecture to Llama-SARM-4B:

Backbone: Initialized from the first 16 decoder layers of Llama-3.1-8B-Instruct.
SAE encoder: Initialized from the pretrained TopK SAE at layer 16 (latent size 65,536, Top-K = 192).
SAE decoder: Not used in the current forward pass, but kept for potential future use.
Score head: Left untrained for reproducibility and initialized to all zeros to facilitate interpretability and downstream customization.

🔥 News

[2025/11/8] Our paper has been accepted as an oral presentation at AAAI 2026. 🎉
[2025/12/11] Llama-SARM-4B is ranked 18th on the Reward Bench 2 leaderboard, above GPT-4.1, Skywork-Reward-Llama-3.1-8B, and Claude-Sonnet-4!🎉

🔗 Links

Authors

Shuyi Zhang*, Wei Shi*, Sihang Li*, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang†
Paper: Interpretable Reward Model via Sparse Autoencoder
Code Repository: https://github.com/schrieffer-z/sarm
Demo: Try SARM Demo in Huggingface Space

📧 Contact

If you have any questions, please feel free to reach us at [email protected].

📚 Citation

If you find our work useful, please cite it as follows.

@article{zhang2025interpretable,
  title={Interpretable Reward Model via Sparse Autoencoder},
  author={Zhang, Shuyi and Shi, Wei and Li, Sihang and Liao, Jiayi and Liang, Tao and Cai, Hengxing and Wang, Xiang},
  journal={arXiv preprint arXiv:2508.08746},
  year={2025}
}

Downloads last month: 20

Safetensors

Model size

5B params

Tensor type

BF16

Paper for Schrieffer/Llama-SARM-4B-PostSAEPretrain

Interpretable Reward Model via Sparse Autoencoder

Paper • 2508.08746 • Published Aug 12, 2025 • 1