Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Paper
• 2512.20578 • Published
• 85
Gnosis is a lightweight self-awareness head that attaches to a frozen LLM and predicts a scalar correctness probability for a generated response. It reads the backbone’s internal signals—hidden-state features (latent dynamics) and attention-map patterns—to learn reliable hallucination / error cues directly from the model.
Mixed math + trivia training corpus:
This repo requires the local Transformers fork with Gnosis integrated (see the GitHub repo instructions). After installing it, run:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob
GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-8B"
tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()
prompt = build_chat_prompt(
tokenizer,
question="How many r's are in strawberry?",
system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)
answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))
print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")
@article{ghasemabadi2025llms,
title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits},
author={Ghasemabadi, Amirhosein and Niu, Di},
journal={arXiv preprint arXiv:2512.20578},
year={2025}
}