ExplorerArtificial IntelligenceAI
Research PaperResearchia:202606.08001

How reliable are LLMs when it comes to playing dice?

Luca Avena

Abstract

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuiti...

Submitted: June 8, 2026Subjects: AI; Artificial Intelligence

Description / Details

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.


Source: arXiv:2606.07515v1 - http://arxiv.org/abs/2606.07515v1 PDF: https://arxiv.org/pdf/2606.07515v1 Original Link: http://arxiv.org/abs/2606.07515v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 8, 2026
Topic:
Artificial Intelligence
Area:
AI
Comments:
0
Bookmark
How reliable are LLMs when it comes to playing dice? | Researchia