Reducing cross-sample prediction churn in scientific machine learning
Abstract
Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side tech...
Description / Details
Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within percentage points but disagree on the class label of of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is -bootstrap bagging, which cuts the rate on every dataset at no accuracy cost (-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched -ERM compute reduces churn a further median beyond bagging-. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.
Source: arXiv:2605.13826v1 - http://arxiv.org/abs/2605.13826v1 PDF: https://arxiv.org/pdf/2605.13826v1 Original Link: http://arxiv.org/abs/2605.13826v1
Please sign in to join the discussion.
No comments yet. Be the first to share your thoughts!
May 14, 2026
Chemistry
Chemistry
0