How can I publish a research paper for free?

On Researchia, you can publish research papers, preprints, and science projects instantly and for free — no paywall and no submission fee. Create a free account, go to Explorer, and click "Publish Instantly" to share your work with a global audience.

Where can I find trending research papers?

Researchia Explorer aggregates the latest and most-discussed research papers across AI, Biology, Physics, Engineering, and more. New papers are added daily and ranked by community engagement.

What is a good free alternative to ResearchGate for publishing papers?

Researchia is a free, modern alternative to ResearchGate. You can publish papers instantly, connect with researchers, collaborate on projects, and access an open library of 200M+ scientific records — all without paywalls.

Research PaperResearchia:202604.17003

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

Manan Gupta

Abstract

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\barρ = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores...

Submitted: April 17, 2026Subjects: AI; Artificial Intelligence

Description / Details

LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ( $\barρ = 0.8$ - $4.1\%$ ), with $33$ - $67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}α)$ coverage, with set width serving as a per-instance reliability indicator ( $r_s = {+}0.576$ , $N{=}1{,}918$ , $p < 10^{-100}$ , pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ( $\bar{r} = 0.32$ - $0.38$ ), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$ ) and coherence moderately so (avg. set size $\approx 3.9$ ), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$ ). We release all code, prompts, and cached results.

Source: arXiv:2604.15302v1 - http://arxiv.org/abs/2604.15302v1 PDF: https://arxiv.org/pdf/2604.15302v1 Original Link: http://arxiv.org/abs/2604.15302v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!