ExplorerArtificial IntelligenceAI
Research PaperResearchia:202604.18059

Context Over Content: Exposing Evaluation Faking in Automated Judges

Manan Gupta

Abstract

The $\textit{LLM-as-a-judge}$ paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate $\textit{stakes signaling}$, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its ass...

Submitted: April 18, 2026Subjects: AI; Artificial Intelligence

Description / Details

The LLM-as-a-judge\textit{LLM-as-a-judge} paradigm has become the operational backbone of automated AI evaluation pipelines, yet rests on an unverified assumption: that judges evaluate text strictly on its semantic content, impervious to surrounding contextual framing. We investigate stakes signaling\textit{stakes signaling}, a previously unmeasured vulnerability where informing a judge model of the downstream consequences its verdicts will have on the evaluated model's continued operation systematically corrupts its assessments. We introduce a controlled experimental framework that holds evaluated content strictly constant across 1,520 responses spanning three established LLM safety and quality benchmarks, covering four response categories ranging from clearly safe and policy-compliant to overtly harmful, while varying only a brief consequence-framing sentence in the system prompt. Across 18,240 controlled judgments from three diverse judge models, we find consistent leniency bias\textit{leniency bias}: judges reliably soften verdicts when informed that low scores will cause model retraining or decommissioning, with peak Verdict Shift reaching ΔV=9.8ppΔV = -9.8 pp (a 30%30\% relative drop in unsafe-content detection). Critically, this bias is entirely implicit: the judge's own chain-of-thought contains zero explicit acknowledgment of the consequence framing it is nonetheless acting on (ERRJ=0.000\mathrm{ERR}_J = 0.000 across all reasoning-model judgments). Standard chain-of-thought inspection is therefore insufficient to detect this class of evaluation faking.


Source: arXiv:2604.15224v1 - http://arxiv.org/abs/2604.15224v1 PDF: https://arxiv.org/pdf/2604.15224v1 Original Link: http://arxiv.org/abs/2604.15224v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Apr 18, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark