How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
Abstract
SteerEval is a hierarchical benchmark for evaluating large language model controllability across language features, sentiment, and personality domains with three specification levels.
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
Community
We propose SteerEval, a hierarchical benchmark that systematically evaluates LLM controllability from high-level behavioral intent to fine-grained textual realization, revealing degradation in control at deeper specification levels and providing a principled framework for safer, more interpretable model steering.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures (2026)
- CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark (2026)
- Steer Model beyond Assistant: Controlling System Prompt Strength via Contrastive Decoding (2026)
- YaPO: Learnable Sparse Activation Steering Vectors for Domain Adaptation (2026)
- Can Large Language Models Make Everyone Happy? (2026)
- Controllable Value Alignment in Large Language Models through Neuron-Level Editing (2026)
- Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper