ExplorerBiotechnologyBiotechnology
Research PaperResearchia:202512.3111f638

Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction

Malikussaid

Abstract

Lipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP gr...

Submitted: December 31, 2025Subjects: Biotechnology; Biotechnology

Description / Details

Lipophilicity (logP) prediction remains central to drug discovery, yet linear regression models for this task frequently violate statistical assumptions in ways that invalidate their reported performance metrics. We analyzed 426,850 bioactive molecules from a rigorously curated intersection of PubChem, ChEMBL, and eMolecules databases, revealing severe heteroskedasticity in linear models predicting computed logP values (XLOGP3): residual variance increases 4.2-fold in lipophilic regions (logP greater than 5) compared to balanced regions (logP 2 to 4). Classical remediation strategies (Weighted Least Squares and Box-Cox transformation) failed to resolve this violation (Breusch-Pagan p-value less than 0.0001 for all variants). Tree-based ensemble methods (Random Forest R-squared of 0.764, XGBoost R-squared of 0.765) proved inherently robust to heteroskedasticity while delivering superior predictive performance. SHAP analysis resolved a critical multicollinearity paradox: despite a weak bivariate correlation of 0.146, molecular weight emerged as the single most important predictor (mean absolute SHAP value of 0.573), with its effect suppressed in simple correlations by confounding with topological polar surface area (TPSA). These findings demonstrate that standard linear models face fundamental challenges for computed lipophilicity prediction and provide a principled framework for interpreting ensemble models in QSAR applications.

Please sign in to join the discussion.

No comments yet. Be the first to share your thoughts!

Access Paper
View Source PDF
Submission Info
Date:
Dec 31, 2025
Topic:
Biotechnology
Area:
Biotechnology
Comments:
0
Bookmark
Diagnosing Heteroskedasticity and Resolving Multicollinearity Paradoxes in Physicochemical Property Prediction | Researchia