Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations
Abstract
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores...
Description / Details
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates (-), with - of documents exhibiting at least one directed 3-cycle; and split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed coverage, with set width serving as a per-instance reliability indicator (, , , pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement (-), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size ) and coherence moderately so (avg. set size ), while fluency and consistency remain unreliable (avg. set size ). We release all code, prompts, and cached results.
Source: arXiv:2604.15302v1 - http://arxiv.org/abs/2604.15302v1 PDF: https://arxiv.org/pdf/2604.15302v1 Original Link: http://arxiv.org/abs/2604.15302v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
Apr 17, 2026
Artificial Intelligence
AI
0