We Need Machine Learning Standards
Last week I saw an article in the HIT lay press about a group that had several predictions about healthcare utilization using social determinants of health. I then located the article in the "medical literature." They reported a sensitivity (also known as recall) of 0.79 and AUC of 0.84 to predict "inpatient admission or ED visit within 90 days". I noted that the class of interest was only present in 4% of patients as 96% were not going to be admitted or visit the ED so there was a fairly severe "class imbalance". This is a common problem with biomedical data sets where the class of interest is very much in the minority.
I then looked at precision (also known as positive predictive value or PPV) and noted it was only 0.12. This is undoubtedly due to a high number of false positives, as PPV is the true positives divided by the true positives plus the false positives. They did not report a F score so I calculated it as follows: Multiply recall by precision, divide this by recall plus precision. Then multiply the final result by 2.
The calculated F score was 0.21 which is low, as the possible range is 0-1. A perfect F score would be 1. The ability to predict another endpoint "avoidable inpatient visit within 90 days had about the same sensitivity (recall) and AUC as was previously discussed but this time the PPV was 0.04. I calculated the F score and it was 0.07. No precision-recall curve was given either, which would be more appropriate with class imbalance. There was no attempt to oversample or undersample to correct the class imbalance or many of the other methods used to correct for this degree of class imbalance. For more information regarding how to deal with class imbalance I would recommend multiple articles published on Medium.com. Also, they stated they used a proprietary decision tree algorithm with no details which is worrisome.
I emailed the team who wrote the article, all of whom work for a healthcare data analytics company and they agreed there was severe class imbalance but their main interest was in showing results using social determinants of health.
We need to "fight the good fight" and always keep the scientific bar as high as possible or we are going to be murdered by our academic colleagues/reviewers/naysayers for borderline machine learning or AI articles.
Just like there are tight standards for RCTs and systematic reviews that you must adhere to in order for it to be published, we need the same for ML and AI. Admittedly, the journal (AJMC) they published in is hardly a medical journal but I still think we should not accept articles with MAJOR flaws