Review of Healthcare AI/ML reporting in 2020: Information, Validation and Replication
Healthcare sector’s interests in digitization and use of algorithmic decision making has continued to increase for past several years. As a follow-up to last year’s publication review, here are some thoughts on the state of publications for AI/ML publications in 2020. Before we plunge into the specifics, I would like to quote some interesting findings about regulatory reviews of such interventions from two recent commentaries and provide a handy guide about reading such literature.
Politico’s newsletter cites Rock Health, which gathers data on early stage tech companies;
Rock Health reported a 96 percent increase in funding to AI companies selling to health care providers between 2019 to 2020.
The FDA, for its part, is nudging development along with a new plan to regulate artificial intelligence in health, including by incorporating more real-world data into the evaluation process; However, the tech isn’t always clinically validated, and hastily deploying untested algorithms might actually widen health care disparities, by skipping over the neediest patients.
Medical AI’s Hidden Data
Prof. Andrew Ng also talks about this validation gap.
U.S. government approval of medical AI products is on the upswing — but information about how such systems were built is largely unavailable.
He cites Stat News about USFDA approval data-
- Stat News compiled a list of 161 products that were approved between 2012 and 2020. Most are imaging systems trained to recognize signs of stroke, cancer, or other conditions. Others monitor heartbeats, predict fertility status, or analyze blood loss.
- The makers of only 73 of those products disclosed the number of patients in the test dataset. In those cases, the number of patients ranged from less than 100 to more than 15,000.
- The manufacturers of fewer than 40 products revealed whether the data they used for training and testing had come from more than one facility — an important factor in proving the product’s general utility. Makers of 13 products broke down their study population by gender. Seven did so by race.
- A few companies said they had tested and validated their product on a large, diverse population, but that information was not publicly available.
Prof Ng reiterates the need for more thorough vetting of such algorithmic decision making-if you don’t know how an AI system was trained and tested, you can’t evaluate the risk of concept or data drift as real-world conditions and data distributions change.
A recap on ‘how to read healthcare publications about AI/ML’
This reading guide written by Liu et al in the Journal of American Medicine gives a good overview of what to look out for in any literature source that uses machine learning methodology. These can be categorized broadly in 4 separate categories, with each category helping us understand different facets of the quality of research-
- Machine learning methodology : How is the specific method chosen? Why was it chosen? Are these questions addressed at all in the article?
- Data sources and quality: What are the data sources? Is this representational of the disease area? If not, has this been noted in the article as a limitation? How did the researchers handle missing data? Was there a chance of overfitting due to any steps of data preparation?
- Results and performance: What kind of performance evaluations are described? Are the results unexpected or look too good? Does the article describe independent validation, possibly in a prospective study?
- Clinical implications: how will such a ML Model will be implemented in real world clinical setting? how will the clinical effect be measured and monitored?
As you can see each of these topics will give us better understanding about the methodology used and thereby help us read the article in a critical manner.
Accompanying this fantastic guide was a commentary by Finale Doshi-Valez & Roy H Perlis-Evaluating Machine Learning Articles ²
Doshi-Valez and Perlis reiterate the point about the basis for using machine learning-underlying assumptions, model properties, optimization, strategies, and limitations. It is necessary to understand the data sources and regularization techniques used to validate the results.
They also add a few more considerations-
- Subgroups: There has been already enough debate about inherent racial and gender bias elsewhere in algorithmic world including healthcare algorithms. Due to the complexity of algorithms, it might be possible to have hidden systematic errors and hence any research literature that provides detailed analyses of results across different subgroups will be essential to understand training parameters.
- Larger may not be better: Due to inherent biases in data sources as well as data preparation techniques used, we will need to be careful not to equate large validation sets with better models.
- Clinical Setting: There needs to be more intense scrutiny for models that start as retrospective studies and imply use in a prospective manner in clinical setting. It will be important to understand which features are driving the results and how are they linked to our known clinical/medical knowledge before we start applying them in real-world setting.
For a good primer on definitions of commonly used performance evaluation parameters , refer to this detailed primer (by M Yu et al) on different measures commonly used in such literature.
State of AI/ML Research in 2020
I performed a qualitative review on PubMed for terms ‘artificial intelligence’ and ‘machine learning’ and filtered for research articles that included application of algorithmic decision making. From this list, about 35% articles described detailed methodologies; the rest were excluded for various reasons-
- 30% articles did not described clinical applications
- 10% articles each were either not randomized studies or were only describing protocol/methodology
- Remaining studies were either post-hoc analyses or were excluded for other miscellaneous reasons.
In the next few sections, I will describe some interesting insights from the articles that used a randomized controlled design and explained application in a clinical setting.
More than 50% studiess were conducted using data from a single center
As can be seen from the graph above, more than 50% studies were conducted using data from a single center from a single country. Further, the sample sizes for more than 55% studies was less than 500 patients/data points, severely limiting wider application of the findings.
Data handling and quality of input data was not described in a uniform manner
About 45% studies did not mention the various methods they used for handling missing data and/or poor data quality. As algorithmic outputs can greatly differ based on which data handling techniques were used and in which manner such techniques were used and hence such reporting should be essential.
Moreover, more than 70% studies did not report using a validation dataset for validating the algorithm output. Validation sets are important to understand how well has the model been trained and to estimate model properties. Using validation also prevents overfitting of the model.
Finally, authors of only 20% studies reported availability of raw data and code for future replication of results. Such non-availability of research data and details of code prevent future replicability by other researchers, thereby adding to the already existing lack of transparency.
In summary, future research publications should focus on providing comprehensive information on such aspects of algorithmic development and datasets used. It will of course help the field to include larger, prospective datasets while developing and validating their models.
In late 2019, The CONSORT and SPIRIT reporting guidelines were extended to build upon those existing recommendations to address considerations specific to AI health interventions. This led to release of 2 additional guidelines- SPIRIT AI (Standard Protocol Items: Recommendations for Interventional Trials — Artificial Intelligence)and CONSORT AI (Consolidated Standards of Reporting Trials — Artificial Intelligence). Wider adoption of these guidelines and associated checklists should help promote transparency and completeness for clinical trial protocols for AI interventions and assist editors and peer-reviewers, as well as the general readership, to understand, interpret and critically appraise clinical trial protocols and reports.