Accuracy Is Not Enough
Rethinking Model Selection for Clinical Machine Learning
DOI:
https://doi.org/10.32473/flairs.39.1.141830Keywords:
LIME, faithfulness, XGBoost, random forest, Explainability, explainable AIAbstract
Clinical machine learning models often require more than just high accuracy to gain clinician trust and adoption; they require understandable and stable reasoning. Therefore, selecting competing models based on performance metrics alone may be insufficient. In this work, we introduce the Multidimensional Evaluation of Diagnostic Algorithms and Learning (MEDAL) framework, which supports the incorporation of explanatory analysis into the model selection process. We adapt metrics originally designed for assessing model compression faithfulness, specifically cosine similarity, correlation, and top-k permutation tests, to evaluate the explanatory stability and similarity of candidate models. By applying this framework to a large-scale trauma triage dataset, we evaluated XGBoost and Random Forest architectures. Our results demonstrate that while both architectures exhibit high internal stability under training data perturbations, they rely on different underlying logic to achieve comparable accuracy. This explanatory divergence highlights a critical blind spot in standard evaluation: distinct models may yield identical predictions for different reasons. We propose a two-step selection paradigm that filters models by predictive performance and then differentiates them based on logical alignment with clinical guidelines, ensuring that deployed models are not only accurate but also explanatorily dependable.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Moumita Kamal, Douglas A. Talbert, Nolan Patterson, Nicholas Atkins, Celia Hough

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.