Back to Explorer
Research PaperResearchia:202603.24009[Computer Science > Cybersecurity]

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski

Abstract

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with \geqslant 32B parameters, and a few smaller models like Qwen2.5 14B.


Source: arXiv:2603.22214v1 - http://arxiv.org/abs/2603.22214v1 PDF: https://arxiv.org/pdf/2603.22214v1 Original Link: http://arxiv.org/abs/2603.22214v1

Submission:3/24/2026
Comments:0 comments
Subjects:Cybersecurity; Computer Science
Original Source:
View Original PDF
arXiv: This paper is hosted on arXiv, an open-access repository
Was this helpful?

Discussion (0)

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models | Researchia