Steps to avoid overuse and misuse of machine learning in clinical research

The term ‘overuse’ refers to the unnecessary adoption of AI or advanced ML techniques where alternative, reliable or superior methodologies already exist. In such cases, the use of AI and ML techniques is not necessarily inappropriate or unsound, but the justification for such research is unclear or artificial: for example, a novel technique may be proposed that delivers no meaningful new answers.

Many clinical studies have employed ML techniques to achieve respectable or impressive performance, as shown by area under the curve (AUC) values ​​between 0.80 and 0.90, or even >0.90 (Box 1). A high AUC is not necessarily a mark of quality, as the ML model might be over-fit (Fig. 1). When a traditional regression technique is applied and compared against ML algorithms, the more sophisticated ML models often offer only marginal accuracy gains, presenting a questionable trade-off between model complexity and accuracy1,2,8,9,10,11,12. Even very high AUCs are no guarantees of robustness, as an AUC of 0.99 with an overall event rate of <1% is possible, and would lead to all negative cases being predicted correctly, while the few positive events were not.

Fig. 1: Model fitting.

Given a dataset with data points (green points) and a true effect (black line), a statistical model aims to estimate the true effect. The red line exemplifies a close estimation, whereas the blue line exemplifies an overfit ML model with over-reliance on outliers. Such a model might seem to provide excellent results for this particular dataset, but fails to perform well in a different (external) dataset.

There is an important distinction between a statistically significant improvement and a clinically significant improvement in model performance. ML techniques undoubtedly offer powerful ways to deal with prediction problems involving data with nonlinear or complex, high-dimensional relationships (Table 1). By contrast, many simple medical prediction problems are inherently linear, with features that are chosen because they are known to be strong predictors, usually on the basis of prior research or mechanistic considerations. In these cases, it is unlikely that ML methods will provide a substantial improvement in discrimination2. Unlike in the engineering setting, where any improvement in performance may improve the system as a whole, modest improvements in medical prediction accuracy are unlikely to yield a difference in clinical action.

Table 1 Definitions of several key terms in machine learning

ML techniques should be evaluated against traditional statistical methodologies before they are deployed. If the objective of a study is to develop a predictive model, ML algorithms should be compared to a predefined set of traditional regression techniques for Brier score (an evaluation metric similar to the mean squared error, used to check the goodness of a predicted probability score ), discrimination (or AUC) and calibration. The model should then be externally validated. The analytical methods, and the performance metrics on which they are being compared, should be specified in a prospective study protocol and should go beyond overall performance, discrimination and calibration to also include metrics related to over-fitting.

Conversely, some algorithms are able to say “I don’t know” when faced with unfamiliar data13an output that is important but often underappreciated, as knowledge that a prediction is highly uncertain may, itself, be clinically actionable.

Leave a Comment

Your email address will not be published.