ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.16061

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

Yanan Long

Abstract

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one ...

Submitted: June 16, 2026Subjects: AI; Artificial Intelligence

Description / Details

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over 1,0001{,}000 systems is compatible with two pre-terminal histories, yielding times of 23.0323.03 or 75.1375.13 to reach within 0.050.05 of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.


Source: arXiv:2606.17005v1 - http://arxiv.org/abs/2606.17005v1 PDF: https://arxiv.org/pdf/2606.17005v1 Original Link: http://arxiv.org/abs/2606.17005v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 16, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations | Researchia