Gnosis — Qwen3-8B (Self-Awareness Correctness Head)

Gnosis is a lightweight self-awareness head that attaches to a frozen LLM and predicts a scalar correctness probability for a generated response. It reads the backbone’s internal signals—hidden-state features (latent dynamics) and attention-map patterns—to learn reliable hallucination / error cues directly from the model.

Paper: Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits
Project code & instructions: https://github.com/Amirhosein-gh98/Gnosis

Why it matters

Strong verifier signal without a large external reward model (no RM routing / no judge LLM calls).
~1000× smaller than 8B reward-model verifiers (**5M params vs ~8B**).
~100× faster than routing through an ~8B reward model.
Early error detection: can flag likely errors before generation finishes.

Evaluated backbones & benchmarks (from the paper)

Backbones: Qwen3 family + OpenAI gpt-oss-20B.
Benchmarks: Math-Reasoning (AMC12 2022/2023, AIME 2024/2025, HMMT Feb 2025), Open-Domain QA (18k held-out TriviaQA), Academic Knowledge Reasoning (MMLU-Pro).

Training data

Mixed math + trivia training corpus:

Math: English portion of DAPO-Math-17k (~14k).
Trivia: 40k subsample from TriviaQA training set.

Usage (inference)

This repo requires the local Transformers fork with Gnosis integrated (see the GitHub repo instructions). After installing it, run:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from src.demo import build_chat_prompt, generate_with_hf, correctness_prob

GNOSIS_MODEL_ID = "AmirhoseinGH/Gnosis-Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(GNOSIS_MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    GNOSIS_MODEL_ID, torch_dtype=torch.bfloat16, trust_remote_code=True
).cuda().eval()

prompt = build_chat_prompt(
    tokenizer,
    question="How many r's are in strawberry?",
    system_prompt="Please reason step by step, and put your final answer within \\boxed{}.",
)

answer = generate_with_hf(model, tokenizer, prompt, torch.device("cuda"), max_new_tokens=2048)
p_correct = correctness_prob(model, tokenizer, prompt + answer, torch.device("cuda"))

print("Answer:
", answer)
print("Gnosis correctness probability:", f"{p_correct:.4f}")

Citation

@article{ghasemabadi2025llms,
  title={Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits},
  author={Ghasemabadi, Amirhosein and Niu, Di},
  journal={arXiv preprint arXiv:2512.20578},
  year={2025}
}