ExplorerData ScienceMachine Learning
Research PaperResearchia:202606.18064

Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

Xin Ci Wong

Abstract

Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metric...

Submitted: June 18, 2026Subjects: Machine Learning; Data Science

Description / Details

Glioma segmentation in multiparametric MRI is a critical component of treatment planning. A segmentation model that fails silently on treatment-critical sub-regions represents a patient safety risk that overlap-based metrics such as Dice scores cannot expose. We ask whether voxel-level uncertainty estimation via Monte Carlo (MC) Dropout can reliably identify segmentation errors in clinically critical sub-regions, and whether calibration failure modes are detectable from standard reporting metrics alone. In an empirical two-model case study on 126 BraTS21 patients, we evaluate a high-performance pretrained SegResNet and a locally trained UNet with residual units (UNet-Res). MC dropout preserved segmentation accuracy (ΔDice|Δ\text{Dice}| <0.01<0.01) while achieving strong uncertainty-error alignment (AUROC for entropy (H) \approx0.97), indicating uncertainty correctly ranks erroneous voxels above correct ones. Entropy-based patient stratification identified a high-uncertainty subgroup with substantially lower segmentation performance (median whole-tumour Dice 0.8350.835 vs. 0.9250.925), supporting uncertainty as a practical triage signal. However, global alignment can mask important region-specific differences. Despite similar AUROC, UNet-Res exhibited near-zero enhancing tumour entropy (0.0540.054) and Expected Calibration Error (ECE) of 0.9150.915, with a Dice of only 0.7140.714, indicating severely miscalibrated confidence on the most clinically critical sub-region, a failure mode invisible to standard Dice and AUROC reporting. These findings demonstrate that strong uncertainty-error alignment is necessary but insufficient for clinical safety: sub-region-specific calibration assessment must accompany AUROC evaluation when selecting models for clinical deployment.


Source: arXiv:2606.19300v1 - http://arxiv.org/abs/2606.19300v1 PDF: https://arxiv.org/pdf/2606.19300v1 Original Link: http://arxiv.org/abs/2606.19300v1

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Jun 18, 2026
Topic:
Data Science
Area:
Machine Learning
Comments:
0
Bookmark