Sample size for assessing a diagnostic accuracy of AI-based software in radiology
Автор: Bobrovskaya T. M., Vasilev Yu. A., Nikitin N. Yu., Vladzimirskyy A. V., Omelyanskaya O. V., Chetverikov S. F., Arzamasov K. M.
Журнал: Сибирский журнал клинической и экспериментальной медицины @cardiotomsk
Рубрика: Цифровые технологии поддержки решений в медицине
Статья в выпуске: 3 т.39, 2024 года.
Бесплатный доступ
Introduction. Determining the minimum sample size for solving various tasks is an extremely important and at the same time unexplored problem. There are many methods, but most of them are not applicable for AI-based software validation.Aim: To consider a methodology for determining a balance of classes “norm”/ “abnormality” and propose a statistical approach to determine the data amount necessary for testing AI-based software (validation).Material and Methods. The results of AI-based software were analyzed using dataset of mammograms. Mammograms were classified by the presence of breast cancer (“abnormality”) and the absence of breast cancer (“norm”). The general set contains 123,301 unique studies. The original balance of classes in the study was “norm” 89.3%/“abnormality” 10.7%. As the results of AI-based software (ML-algorithm), a probability of the presence of pathology in the entire study was taken. The following values were used as empirical data (GT): 0 - in case of Bi-RADS classes 1 or 2 diagnosed by a doctor, and 1 - in case of Bi-RADS classes 3, 4, 5. Each data sample is transferred to AI-based software for processing. Quality metrics are calculated based on its results: AUC ROC. All the described actions were repeated 10,000 times for all the studied balances of “norm”/”abnormality”. Based on the results of AUC ROC calculations, mean values were calculated for different random data series with the same balances. Mean AUC ROC values were subjected to analysis.Results. A maximum value of the coefficient of variation of AUC ROC values for 10% “abnormality” share is achieved at the number of studies equal to 190; for the 20% share, it is 80 studies; for the 30% share - 120 studies, for the 40% share - 110 studies, and for the 50% share - 70 studies.Conclusion. Summarizing the conducted study results, it can be concluded that when testing AI-based software, it is necessary to consider that the number of studies reflecting the greatest heterogeneity of AUC ROC values (the largest deviation from the mean value) is different for various class balances. If the purpose of validation is to establish the worst-case behavior of AUC ROC values, then for the studied AI-based software, the “abnormality” share should be 10%, and the number of studies 190. If the validation is carried out under conditions of a limited amount of data, then the “abnormality” share should be 50% and the number of studies equal to 70.
Artificial intelligence, statistical methods, sampling, validation, radiology
Короткий адрес: https://sciup.org/149146306
IDR: 149146306 | DOI: 10.29001/2073-8552-2024-39-3-188-198