Explorerβ€ΊData Scienceβ€ΊMachine Learning
Research PaperResearchia:202606.12076

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Nathaniel Bottman

Abstract

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. ...

Submitted: June 12, 2026Subjects: Machine Learning; Data Science

Description / Details

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson r∈[0.86,0.94]r \in [0.86, 0.94], all p≀0.0004p \leq 0.0004), and is the only signal we evaluate with rβ‰₯0.85r \geq 0.85 uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP (r=0.93,0.87r = 0.93, 0.87) but drops to rβ‰ˆ0.45r \approx 0.45 on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust p≀10βˆ’16p \leq 10^{-16} for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines (p≀10βˆ’13p \leq 10^{-13}). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost K=3K = 3 budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.


Source: arXiv:2606.13649v1 - http://arxiv.org/abs/2606.13649v1 PDF: https://arxiv.org/pdf/2606.13649v1 Original Link: http://arxiv.org/abs/2606.13649v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 12, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark