Statistical Learning in the Fight against COVID-19: A Focus on Diagnosis

Document Type : Original Article

Authors
1 Department of Statistics, Vali-e-Asr University of Rafsanjan, Rafsanjan, Iran.
2 Department of Community Medicine, School of Medicine, Rafsanjan University of Medical Sciences, Rafsanjan, Iran
3 Clinical Research Development Unit, Ali-Ibn Abi-Talib Hospital, Rafsanjan University of Medical Sciences, Rafsanjan, Iran; Department of Internal Medicine, Ali-Ibn Abi-Talib Hospital, School of Medicine, Rafsanjan University of Medical
4 Department of Physiology, School of Medicine, Hamadan University of Medical Sciences, Hamadan, Iran; Department of Pharmacology and Toxicology, School of Pharmacy, Hamadan University of Medical Sciences, Hamadan, Iran.
5 Physiology-Pharmacology Research Center, Research Institute of Basic Medical Sciences, Rafsanjan University of Medical Sciences, Rafsanjan, Iran Department of Physiology and Pharmacology, School of Medicine, Rafsanjan University of
Abstract
The accurate diagnosis of infectious diseases such as COVID-19 requires statistically reliable classification methods capable of handling complex, heterogeneous, and imbalanced data. In this study, several statistical and machine learning algorithms --logistic regression, linear discriminant analysis, k-nearest neighbors, decision tree, and random forest --were comparatively evaluated using clinical and laboratory data from 506 hospitalized patients in Rafsanjan, Iran. The dataset included 27 categorical and 11 quantitative variables. To address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) was employed. Model performance was assessed using a comprehensive set of criteria, including accuracy, sensitivity, specificity, positive and negative predictive values (NPV), and the area under the ROC curve. The comparative analysis showed that RF and LR achieved the best overall performance, while SMOTE improved sensitivity and NPV at the expense of specificity. The findings emphasize the importance of appropriate imbalance correction and multi-metric evaluation in developing statistically robust diagnostic models for medical data.
Keywords
Subjects

AlJame IAAI M, Mohammed A. Deep forest model for diagnosing COVID-19 from routine blood tests. Scientific reports. 2021;11(1):16682.
Butt GJCD C, Babu BA. Deep learning system to screen coronavirus disease 2019 pneumonia. Appl Intell. 2020;p. 16682.
Chahar S, Roy PK. Covid-19: A comprehensive review of learning models. Archives of Computational Methods in Engineering. 2022;29(3):1915–1940.
Chawla BKWHLO N V, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research. 2002;16:321–357.
Esfandiari BMRMAME N, Tabar VK. Knowledge discovery in medicine: Current issue and future trend. Expert Systems with Applications. 2014;41(9):4434–4463.
Fischho  B. Making decisions in a COVID-19 world. JAMA. 2020;324(2):139–140.
Iwendi BAKPASRCJMPSJO C. COVID-19 patient health prediction using boosted random forest algorithm. Frontiers in public health. 2020;8:357.
James WDHT G, Tibshirani R. An introduction to statistical learning. Springer; 2013.
Johnson RA, Wichern DW. Applied multivariate statistical analysis. Wiley; 2002.
Li GMPYML X, Lu S. Molecular immune pathogenesis and diagnosis of COVID-19. Journal of pharmaceutical analysis. 2020;10(2):102–108.
Malik D, Munjal G. Reviewing classification methods on health care. Intelligent Healthcare: Applications of AI in eHealth. 2021;1:127–142.
Mohammadi PHKHMMMMANSM F. Artificial neural network and logistic regression modelling to characterize COVID-19 infected patients in local areas of Iran.
Biomedical journal. 2021;44(3):304–316.
Organization WH, Statistics NCfH. The International Classification of Diseases, 9th Revision, Clinical Modification: Procedures: tabular list and alphabetic index, vol. 3.
Commission on Professional and Hospital Activities.; 1980.
Rashidi ASRSTNKHH, Hu B. Common statistical concepts in the supervised Machine Learning arena. Frontiers in Oncology. 2023;13:1130229.
Schober P,Vetter TR. Logistic regression in medical research. Anesthesia and Analgesia. 2021;132(2):365–366.
Shafqat KSRRUQJAT S, Ahmad HF. Big data analytics enhanced healthcare systems: a review. The Journal of Supercomputing. 2020;76:1754–1799.
Sidey-Gibbons JA, Sidey-Gibbons CJ. Machine learning in medicine: a practical introduction. BMC medical research methodology. 2019;19:1–18.
Tang SJEPDH Y W, Stratton CW. Laboratory diagnosis of COVID-19: current issues and challenges. Journal of clinical microbiology. 2020;58(6):512–520.
Wager S, Athey S. Estimation and inference of heterogeneous treatment e ects using random forests. Journal of the American Statistical Association. 2018;113(523):1228–
1242.
Weisberg S. Applied linear regression. Wiley; 2005.
Wynants VCBCGSRRDHGSEBMM L. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ. 2020;369.
Xia ZP Y, Zhou Z. Analysis And Prediction of COVID-19 Based on Machine Learning. Highlights in Science, Engineering and Technology. 2023;38:725–735.
ZakariaeeNNEMS S, Kazemi-Arpanahi H. Comparing machine learning algorithms to predict COVID-19 mortality using a dataset including chest computed tomography
severity score data. Scientific reports. 2023;13(1):11343.
Volume 24, Issue 2
December 2025
Pages 16-32

  • Receive Date 18 February 2025
  • Revise Date 08 October 2025
  • Accept Date 15 December 2025