ExplorerComputer ScienceCybersecurity
Research PaperResearchia:202603.24009

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski

Abstract

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security ...

Submitted: March 24, 2026Subjects: Cybersecurity; Computer Science

Description / Details

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with \geqslant 32B parameters, and a few smaller models like Qwen2.5 14B.


Source: arXiv:2603.22214v1 - http://arxiv.org/abs/2603.22214v1 PDF: https://arxiv.org/pdf/2603.22214v1 Original Link: http://arxiv.org/abs/2603.22214v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Mar 24, 2026
Topic:
Computer Science
Area:
Cybersecurity
Comments:
0
Bookmark
Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models | Researchia