Научные статьи \ Прикладные науки. Медицина. Технология \ Mедицинские науки \ Общественное здоровье и гигиена. Санитария. Защита от несчастных случаев и их предупреждение

Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

Автор: Eman I. Abd El-Latif, Islam A. Moneim

Журнал: International Journal of Intelligent Systems and Applications @ijisa

Статья в выпуске: 1 vol.16, 2024 года.

Бесплатный доступ

One of the most common diseases in the world is the chronic diabetes. Diabetes has a direct impact on the lives of millions of people worldwide. Diabetes can be controlled and improved with early diagnosis, but the majority of patients continue to live with it. There is a dispirit need to a system to anticipate and select the people who are most likely to be diabetes in the future. Diagnosing the future diseased person without taking any blood or glucose screening tests, is the main goal of this study. This paper proposed a deep-learning model for diabetes disease prediction. The proposed model consists of three main phases, data pre-processing, feature selection and finally different classifiers. Initially, during the data pre-processing stage, missing values are handled, and data normalization is applied to the data. Then, three techniques are used to select the most important features which are mutual information, chi-squared and Pearson correlation. After that, multiple machine learning classifiers are used. Four experiments are then conducted to test our models. Additionally, the effectiveness of the proposed model is evaluated against that of other well-known machine learning techniques. The accuracy, AUC, sensitivity, and F-measure of the linear regression classifier are higher than those of the other methods, according to experimental data, which show that it performs better. The suggested model worked better than traditional methods and had a high accuracy rate for predicting diabetic disease.

Еще

Diabetes, Mutual Information, Pearson Correlation and Chi-squared

Короткий адрес: https://sciup.org/15019352

IDR: 15019352 | DOI: 10.5815/ijisa.2024.01.01

Текст научной статьи Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

Published Online on February 8, 2024 by MECS Press

Diabetes is a prevalent chronic disease that is extremely harmful to human health [1]. The diabetes characteristic is that blood glucose levels are higher than normal, which results from impaired insulin secretion or its adverse biological effects, or both [2, 3]. There are two types of diabetes [4], the majority of diabetes populations are classified first type and most of them are younger than 30 years old. The signs of the diabetes of the first type are: expanded thirst, successive pee and high blood glucose levels. Obesity, hypertension and other diseases are frequently associated with type 2 diabetes, which is more prevalent in the elderly and middle-aged people [5]

A tenth of the adults in the future will have diabetes, as predicted by rising morbidity in recent years, when the number of diabetics worldwide will reach 642 million in 2040. Diabetes may have existed for four to twelve years prior to diagnosis. Half of patients with diabetes suffer damage after being diagnosed. Scientists demonstrate that early recognition of diabetes will avoid heart diseases, stroke and vascular complications. Machine learning can assist individuals to make a preliminary diagnosis about diabetes according to their daily physical examination data [6, 7].

The data and analysis are used to develop predictive models, for a variety of issues, is the primary focus of machine learning. Predicting diabetes with machine learning techniques is common and using feature selection enhances the accuracy and yields better results compared to using all features.

A lot of researchers have been applying machine learning concepts to predict a disease known as diabetic in recent years. Diabetes diagnosis and prediction are made possible through machine learning. It involves analyzing large datasets, finding patterns, and using statistical models and algorithms to create predictions based on these patterns. In order to estimate an individual's risk of developing diabetes, machine learning may assess medical data related to the condition, including blood glucose levels, blood pressure, and body mass index. It can provide more accurate forecasts than more conventional techniques. Machine learning algorithms have the ability to analyses enormous amounts of data and notice subtle patterns in it. People may therefore have improved outcomes from their diabetes diagnosis and treatment. When compared to previous approaches, machine learning has become more significant in the health sector since it is quick and simple to utilize.

This paper proposed a model for accurate diabetes diagnosis prediction. The proposed model consists of three main phases: data pre-processing step, feature selection using three algorithms, and evaluation phase using different metrics. The main contribution of the paper can be listed as follows:

• Data Pre-processing: The proposed model solved the missing values problem and data normalization is employed to normalize the variety and distribution of features.
• Feature Selection stage: mutual information, chi-squared and Pearson correlation are performed to select the most important features
• Classification Stage: Decision tree, random forest, k nearest neighbor and logistic regression are applied to diagnose the person
• Evaluation and Results Interpretation Phase: Different evaluation metrics are adopted for explaining the prediction result of the proposed model.
• Highly Accurate Prediction: Several experiments were conducted to evaluate the overall performance of the proposed model.

The remainder of this paper is organized as follows. The earlier studies on the subject are presented in Section 2. A thorough analysis and explanation of the data utilized in this investigation are provided in Section 3. In Section 4, the suggested method is then presented. The evaluation of the experiment's findings is done in Section 6. In Section 7, the study is finally finished.

Several studies using machine learning or traditional algorithms to predict or detect diabetes are discussed in this section as shown in Table 1.

Kavakiotis et al. [8] have used a variety of machine learning methods, including SVM, DT, KNN, RF, LR, and Gradient Boosting. In [9], Principal component analysis and neurofuzzy inference are used to differentiate diabetic's people from the normal. In to predict type 2 diabetes, Yue et al [10] are used Quantum particle swarm optimization (QPSO) and weighted least squares support vector machine (WLS-SVM) but for type two diabetes, Razavian et al. [11] developed a strategy based on RF.

In [12], Duygu et al. used Linear Discriminant Analysis to extract and reduce the features and MWSVM for classification but in [13], Georga et al. used support vector regression (SVR). To increase the accuracy, Ozcift et al. [14] offered an algorithm called rotation forest merges 30 machine learning algorithm.

In [15], Quan et al. selected healthy and diabetic data at random for training set and then apply five cross validations. PCA and minimum redundancy maximum relevance are utilized to reduce the dimensionality. For the classification, DT, RF and neural network are applied to predict the diabetes. In [16], scientists used four classifiers: naive Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. (K2, K5, and K10) are three different partitioning protocols. To be able to forecast the diabetic disease, authors trained machine learning algorithms like logistic regression, SVM, and ANN [17]. Additionally, three rounds of k-cross validation are carried out.

Table 1. Relevant approaches for predicting diabetes disease

Ref.	Methodology	Dataset	Performance metrics
[16]	NB, DT, Adaboost (AB), RF	there was a total of 6561 features with 657 diabetic and 5904 non diabetic.	The classification accuracy of the RF classifier is 94.25%, while the classification accuracy of the NB classifier is 86.70%.
[17]	support vector machine, and artificial neural network	175 features with 50 percent diabetes patients and 50 percent in good health.	ACC = 84.09
[18]	multifactor dimensionality reduction (MDR) +KNN	30 522 comorbid patients, 270 172 hospital visitors, of whom 89 858 have diabetes, 58 745 have hypertension,	ACC = 81.3
[19]	fuzzy c-mean, RF, and SVM	6,500 people made up the entire sample.	SVM ACC = 0.986 AUC = 0.979

SVM, multifactor dimensionality reduction, k-nearest neighbours (k-NN), and LR are the four machine learning methods used in [18]. Five cross validation is used in the study to determine generalization accuracy and error rates. In order to distinguish between people who have diabetes and those who do not, this study evaluated four machinelearning classifiers (neural networks, SVM, fuzzy c-mean, and random forests) with two conventional classification approaches (LR and Fisher linear discriminant analysis) in [19].

3. Dataset

The dataset was collected from over 400,000 Americans on health-related risk performances and chronic health conditions. This dataset contains 253,680 responses and it can download from [20]. Table 2 shows the data description for each medical predictor in the data.

Table 2. Data description

Features	Description
HighBP	Adults who have received a diagnosis of hypertension from a physician. There are two classes for this feature: 0 for no high pressure and 1 for pressure.
Body mass index (BMI)	It is a number that is calculated using a person's height and mass.
Stroke	In a medical condition known as a stroke, the brain's inadequate blood supply results in cell death.
HeartDiseaseorAttack	people who previously disclosed having myocardial infarction or coronary heart disease
NoDocbcCost	you needed to see a doctor within the last year but were unable to due to cost
GenHlth	general health on a scale of 1 to 5
MentHlth	It describes mental state, which includes stress, depression, and emotional difficulties, is described.
PhysHlth	It describes the physical health, which also includes bodily ailment and damage during 30 days
DiffWalk	It describes the difficulty in ascending stairs or moving about.
Income	Represents annual household income
HighCol	A heavy drinker is an adult who consumes more than 14 drinks per week in men and more than 7 drinks in women.
Smoker	Smoked at least 100 cigarettes during every day of your life.
Education	highest grade or academic year you have earned

Fig.1. The flowchart of the proposed algorithm

4. Proposed Approach

The main goal of this paper is to predict diabetes disease using various machine learning algorithms without making any medical examination. The intrinsic quality of unprocessed data is enhanced throughout the data preparation stage to enable accurate diabetic disease prediction. First, a mean value imputation approach is employed to address the issue of missing values. In this method, absent data points are replaced with the mean of the pertinent feature. This reduces the likelihood of bias addition caused by missing values while retaining the dataset's overall statistical coherence. The variety and distribution of characteristics are then normalized using data normalization. This normalizing technique ensures that each variable contributes adequately to the learning process of the regression model, regardless of the disparities in their numerical magnitudes. After that, the most important features, which give the highest classification result, by applying three algorithms (mutual information, chi-squared and Pearson correlation) are determined. These selected approaches of machine learning, are decision tree, random forest, k nearest neighbor and logistic regression as shown in Figure 1. On the same dataset, previously multiple machine learning classifiers are used to predict diabetes. Our obtained results are compared with these previous results using the same evaluation metrics.

5. Feature Selection

Feature selection is an essential part of data cleaning because it removes the unwanted features and helps us to identify the most important features, which improves our model's performance. There are numerous methods for choosing features. In this paper, we applied three methods: mutual information, chi-squared and Pearson correlation.

5.1. Mutual Information Algorithm
5.2. Chi-squared Algorithm

For fixed categories, such as the classification problem or the continuous target variable in regression problems, Mutual Information (MI) is measured [21]. Mutual Information calculated the entropy of the variables and measures the degree of dependence between the non-negative variables [22]. The value of MI is zero when the two random variables are independent, and the higher values indicate greater dependence. It can calculate by:

I(X ; Г) = H(X) - Н(Х | Y) (1)

Where X, Y two random variables, I(X; Y) is mutual information, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y.

In feature selection, the connection between two categorical outcome attributes is frequently tested using a Chisquare test. Chi-squared determines if the attributes are independent or not [23]. The observed counts are close to the expected ones when the features are independent so chi-square value needs to be small. Simply, the feature can be selected for model training if its Chi-Square value is higher than 5%, as it likely to be dependent on the response. ChiSquare can calculate using:

x 2 =X(O_i-E_i)²/E_i (2)

Where О is the observed value and E is the expected value

In this section, we discussed the different classifiers that used in this research: decision tree, random forest, k nearest neighbor and logistic regression [24].

6.1. Decision Tree
6.2. Random Forest
6.3. K-Nearest Neighbors Algorithm
7.1. Evaluation Metrics

When the response variable is continuous, DT can be used as a regression tree, while when the response variable is categorical, it can be used as a classification tree. DT is used in machine learning to eliminate disorder or uncertainty from the dataset [25]. It starts with a single node representing the root. If all the data are in the same class, the other nodes converted to leafs. Otherwise, the discriminatory attribute is selected by the algorithm to serve as the current DT node. The training data are divided into many subsets, each of which forms a branch, and numerous branches are formed by many values based on the value of the current DT attribute. The previous steps are repeated to form a decision tree [26].

A supervised machine learning approach called random forest (RF) is used to solve classification and regression issues. It builds decision trees from different samples, using the majority vote for classification and the average for regression. RF has many advantages such as it requires less time to train than other algorithms; it runs efficiently with a large dataset and predicts output accurately [27].

The k-nearest neighbor's algorithm, often known as KNN, is a non-parametric, supervised learning classifier that employs proximity to classify or predict the grouping of each individual data point. Assuming that similar points can be discovered close to one another, it can be used for classification or regression issues; however, it is typically utilized as a classification algorithm. [28].

The k-nearest neighbor algorithm's objective is to locate the closest neighbors to the targeted point so that; a class label can be assigned to that point. KNN needs to calculate the distance between the targeted point and the other data points.

In this section, the evaluation metrics and results of the experiments are presented. The PC used for the experiments possesses the following features: x64-based processor, Windows 10, 2.60 GHz 2.59 GHz Intel(R) Core(TM) i7-9750H CPU, 16 GB of memory. We divided the data into 70% for validation and training and 30 % for the testing. The model is performing using the Python programming language

Different metrics are used to measure the performance of the experimental results such as accuracy, precision, F-Measure, Specificity and recall. The metrics are defined as follow:

. . TP precision = TP+FP	(3)
Recall = TPR = -^— TP+FN	(4)
F1 - Score =--—-- 2TP+FP+FN	(5)
TP+TN Accuracy = ^J TP+TN+FN+FP	(6)
Specificity = TNR = ^TN TN+FP	(7)

Where FP, FN,TP and TN are the False Positive, False Negative, True Positive, and True Negative, respectively.

Another metrics called Area under Curve (AUC) measures the ability of the classifier to separate between classes. The classifier is able to correctly differentiate between all positive and negative class points when AUC is 1. However, the classifier would have predicted that all negatives would be positives and all positives would be negatives if the AUC is equal to zero in value. When the value of the AUC lies between 0.5 and 1, and there is a good chance that the classifier can distinguish between the positive class values and the negative class values.

Table 3. Statistical Description

Features	Mean	Std	Min	Max
HighBP	0.4290	0.4949	0	1
BMI	28.382	6.608	12	98
Stroke	0.04057	0.19729	0	1
HeartDiseaseorAttack	0.09418	0.2920	0	1
NoDocbcCost	0.084177	0.2777	0	1
GenHlth	2.5113	1.0684	1	5
MentHlth	3.1847	7.412	0	30
PhysHlth	4.2420	8.7179	0	30
DiffWalk	0.16822	0.3741	0	1
Income	6.0538	2.0711	1	8
HighCol	0.4241	0.4942	0	1
Smoker	0.44316	0.4967	0	1
Education	5.050	0.985	1	6
Age	8.032	3.054	1	13

7.2. Evaluation

The average, standard deviation, minimum, and maximum values for the most significant features of the dataset are thoroughly analyzed in Table 3. For further analysis, the dataset's correlation matrix is constructed. Figure 2 illustrates how to evaluate the relationship between two variables in a data collection using a statistical technique called a correlation matrix. The matrix is a table containing correlation coefficients in each cell; a correlation coefficient of 1

denotes a strong association between variables, a correlation coefficient of 0 a neutral relationship, and a correlation coefficient of -1 a weak relationship.

Fig.2. The correlation matrix

The proposed model is evaluated using several ML classifiers such as: LR, RF, DT, and KNN. Table 4 includes the evaluation metrics using all features in the dataset. For all of the evaluation metrics, except the precision one, LR has the highest values. On the other hand, DT has the highest precision.

Table 4. Predict the diabetes by using all features

classifiers	AUC %	Accuracy%	Precision%	Recall%	F-score%
LR	94.2922	88.0045	92.436	82.8569	87.3847
RF	94.1864	87.4104	93.9127	80.0828	86.4481
DT	90.8981	84.0673	96.346	70.914	81.6965
KNN	90.7245	84.0815	88.4021	78.5593	83.1906

Confusion matrix is created to assess a classification model's performance as shown in Figure 3. The matrix contrasts actual goal values with anticipated values from the ML model. A good model is one with low FP and FN rates and high TP and TN rates.

Training

Fig.3. The confusion matrix

Testing

Table 5 shows the results when chi-squared algorithm is used with all of the four under investigation classifiers. When the number of the feature is 10, RF has the best result among the four classifiers in AUC. For the accuracy, the recall and the F-score metrics, the LR achieved the best results. The worst results occur when three features are only selected and KNN is used. According to the chi-squared algorithm, we choose the first ten features that give the best results, these features are: 'HighBP', 'BMI', 'Stroke', 'HeartDiseaseorAttack', 'NoDocbcCost', 'GenHlth','MentHlth', 'PhysHlth', 'DiffWalk', 'Income'.

Table 5. Predict diabetes using chi-squared algorithm to select features

No. of features	classifiers	AUC	Accuracy%	Precision%	Recall%	F-score%
3	LR	0.767795	69.8887	72.2471	65.0746	68.4735
	RF	0.788268	70.7893	74.833	63.0853	68.4588
	DT	0.782667	70.8412	74.1862	64.3708	68.9309
	KNN	0.732838	66.3853	63.8592	76.2691	69.5146
5	LR	0.805807	73.8401	77.7167	66.7359	71.8089
	RF	0.818234	73.5854	82.385	59.898	69.3645
	DT	0.802523	73.5854	82.385	59.898	69.3645
	KNN	0.760443	69.1626	69.9134	67.1137	68.485
7	LR	0.89781	81.7333	85.0167	77.0283	80.8256
	RF	0.908006	82.4359	90.5988	72.3679	80.4636
	DT	0.892179	80.993	94.3791	65.8962	77.6068
	KNN	0.882921	79.5455	78.7564	80.8962	79.812
10	LR	0.937223	86.8917	91.3306	81.4286	86.0958
	RF	0.938201	86.3401	92.1825	79.3188	85.2682
	DT	0.91404	84.8831	92.7841	75.544	83.2812
	KNN	0.91981	85.1848	85.891	81.087	84.9794

Table 6 shows the results when Pearson algorithm is used as feature selector. We observed that, RF has the best result among the four classifiers corresponding to the AUC and accuracy metrics and when the number of features is 10. According to the Pearson algorithm, the chosen ten features, that give the best results, are: GenHlth', 'Income', 'DiffWalk', 'PhysHlth', 'Education', 'PhysActivity', 'BMI', 'MentHlth', 'HighBP', 'HeartDiseaseorAttack.

Table 6. Predict diabetes using pearson algorithm to select features

No. of features	classifiers	AUC	Accuracy%	Precision%	Recall%	F-score%
3	LR	0.884167	80.4272	82.7153	76.7052	79.597
	RF	0.890344	80.7243	82.6799	77.5104	80.0117
	DT	0.889559	80.6677	82.7649	77.2452	79.9098
	KNN	0.86853	76.6786	71.8152	87.4763	78.8759
5	LR	0.910959	83.2045	87.6289	77.1996	82.0843
	RF	0.914946	84.0202	90.9547	75.4399	82.474
	DT	0.905135	83.261	87.8233	77.105	82.1159
	KNN	0.896253	83.1997	88.8889	75.7616	81.8019
7	LR	0.924473	85.1565	89.8264	79.5071	84.3523
	RF	0.928954	85.812	92.1646	78.4764	84.7715
	DT	0.908286	84.2324	91.5325	75.6653	82.846
	KNN	0.905616	84.4634	86.4009	82.0371	84.1625
10	LR	0.937665	86.9342	90.8546	81.9708	86.1844
	RF	0.939545	86.9436	93.2377	79.5049	85.8254
	DT	0.911572	84.784	95.5547	72.7807	82.6272
	KNN	0.917518	85.0858	86.435	83.033	84.6999

Table 7 shows that, the results when the Mutual Information algorithm is used as a feature selector. RF has the best result among the four classifiers in the AUC and the accuracy metrics with a number of 10 feature. The worst results occur when three features are selected and KNN classifier is used.

Table 7. Predict diabetes using MI algorithm to select features

No. of features	classifiers	AUC	Accuracy	Precision	Recall	F-score
3	LR	0.904028	82.964	87.7592	76.8849	81.963
	RF	0.90806	83.195	91.5625	73.382	81.4703
	DT	0.903575	82.8838	91.7724	72.5016	81.0067
	KNN	0.880137	79.1211	74.6781	88.5548	81.0267
5	LR	0.920472	85.5055	92.7229	76.9783	84.1203
	RF	0.923346	85.7837	92.9854	77.3282	84.4371
	DT	0.911368	84.5813	95.7202	72.3173	82.3891
	KNN	0.909993	85.2886	92.0302	77.1863	83.9572
7	LR	0.930427	85.9487	91.4543	79.6493	85.1446
	RF	0.934856	86.2788	93.0745	78.7167	85.2956
	DT	0.907415	84.4115	94.7286	73.2419	82.611
	KNN	0.911642	84.6614	86.5018	82.5406	84.4748
10	LR	0.939151	87.0521	91.1973	81.8553	86.2741
	RF	0.940692	87.2029	93.9683	79.3512	86.434
	DT	0.911836	84.8831	95.5431	72.9963	82.7616
	KNN	0.917045	85.4489	86.9049	83.278	85.0528

The performance of the proposed MLSO-DNN model is compared with the most recent model that has been published for the prediction of diabetes disease in the final experiment shown in Table 8.

Table 8. The proposed model vs. the state-of-the-art models

Ref.	Model	Performance metrics
[16]	DT	86.42
	NB	89.90
	RF	93.12
	AB	91.32
[17]	SVM+ cross validation	84.09
[18]	MDR + KNN	81.3
Proposed model	Feature selection + LR	94.06

8. Conclusions

The primary aim of this work is to classify and predict diabetes from some features without carrying any medical examination. In this paper, we tested a variety of ML classification methods and selected features using three different algorithms to achieve the highest possible performance and accuracy. In addition, the comparisons among the results obtained from the four classifiers show that, the very close results are obtained from the proposed four classifiers, random forest, decision tree, k nearest neighbor and logistic regression but it is clearly that, the random forests have achieved the highest accuracy among them. The classifiers' performance was evaluated using the AUC, F-measure, which includes precision and recall, and accuracy. The model's high ability to predict outcomes and distinguish between the two classes is demonstrated by the AUC values. The best predictions are obtained when 10 features are selected and it occurs with all the four classifiers.

Funding

There was no external funding for this research.

Conflict of Interest

The corresponding author certifies that there is no conflict of interest on behalf of all authors.

Data Availability Statement

The data that support the findings of this study are available from author Eman I. Abd El-Latif, upon reasonable request.

Список литературы Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

Krasteva, A., Panov, V., Krasteva, A., Kisselova, A., and Krastev, Z. Oral cavity and systemic diseases—Diabetes Mellitus. Biotechnol. Biotechnol. Equip. 25, 2183–2186, 2011. doi: 10.5504/BBEQ.2011.0022
Wang, Andrea N., et al. "Zucker Diabetic‐Sprague Dawley (ZDSD) rat: Type 2 diabetes translational research model." Experimental Physiology 107.4, 2022: 265-282.
Lonappan A, Bindu G, Thomas V, Jacob J, Rajasekaran C, Mathew KT. Diagnosis of diabetes mellitus using microwaves. J Electromagn Waves Appl. 2007;21(10):1393–401
Lee, B. J., and Kim, J. Y.. Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on machine learning. IEEE J. Biomed. Health Inform. 20, 39–46, 2016. doi: 10.1109/JBHI.2015.2396520
Echegoyen, Francisco X. Barrera, et al. "The nature and characteristics of hypertriglyceridemia in a large cohort with type 2 diabetes." Journal of diabetes and its complications 37.2, 2023: 108387.
Tuppad, Ashwini, and Shantala Devi Patil. "Machine learning for diabetes clinical decision support: a review." Advances in Computational Intelligence 2.2, 2022: 22.
Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., and Chouvarda, I.. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116, 2017. doi: 10.1016/j.csbj.2016.12.005
Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., and Chouvarda, I.. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116, 2017. doi: 10.1016/j.csbj.2016.12.005
Polat, K., and Günes, S.. An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease. Digit. Signal Process. 17, 702–710, 2007.
Yue, C., Xin, L., Kewen, X., and Chang, S.. “An intelligent diagnosis to type 2 diabetes based on QPSO algorithm and WLS-SVM,” in Proceedings of the 2008 IEEE International Symposium on Intelligent Information Technology Application Workshops, Washington, DC, 2008.
Razavian, N., Blecker, S., Schmidt, A. M., Smith-McLallen, A., Nigam, S., and Sontag, D.. Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 3, 277–287, 2015.
Duygu,ç., and Esin, D.. An automatic diabetes diagnosis system based on LDA-wavelet support vector machine classifier. Expert Syst. Appl. 38, 8311–8315, 2011.
Georga, E. I., Protopappas, V. C., Ardigo, D., Marina, M., Zavaroni, I., Polyzos, D., et al. Multivariate prediction of subcutaneous glucose concentration in type 1 diabetes patients based on support vector regression. IEEE J. Biomed. Health Inform. 17, 71–81, 2013. doi: 10.1109/TITB.2012.2219876
Ozcift, A., and Gulten, A.. Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Comput. Methods Programs Biomed. 104, 443–451, 2011. doi: 10.1016/j.cmpb.2011.03.018
Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., & Tang, H.. Predicting diabetes mellitus with machine learning techniques. Frontiers in genetics, 9, 515, 2018.
Maniruzzaman, Md, et al. "Classification and prediction of diabetes disease using machine learning paradigm." Health information science and systems 8, 2020: 1-14.
Malik, Sarul, et al. "Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva." Springerplus 5, 2016: 1-12.
Farran, Bassam, et al. "Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait—a cohort study." BMJ open 3.5, 2013.
Tapak, Lily, et al. "Real-data comparison of data mining methods in prediction of diabetes in Iran." Healthcare informatics research 19.3, 2013: 177-185.
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
Sengupta, Debapriya, Phalguni Gupta, and Arindam Biswas. "A survey on mutual information based medical image registration algorithms." Neurocomputing 486, 2022: 174-188.
Su, Xiangchenyang, and Fang Liu. "A survey for study of feature selection based on mutual information." 2018 9th workshop on hyperspectral image and signal processing: evolution in remote sensing (WHISPERS). IEEE, 2018.
Vashisht, Manisha, and Brijesh Kumar. "Traffic Sign Recognition Approach Using Artificial Neural Network and Chi-Squared Feature Selection." Next Generation of Internet of Things: Proceedings of ICNGIoT 2022. Singapore: Springer Nature Singapore, 2022. 519-527.
Hort, Max, et al. "Bia mitigation for machine learning classifiers: A comprehensive survey." arXiv preprint arXiv:2207.07068, 2022.
Priyanka, and Dharmender Kumar. "Decision tree classifier: a detailed survey." International Journal of Information and Decision Sciences 12.3, 2020: 246-269.
Nanfack, Géraldin, Paul Temple, and Benoît Frénay. "Constraint Enforcement on Decision Trees: A Survey." ACM Computing Surveys (CSUR) 54.10s, 2022: 1-36.
Shaik, Anjaneyulu Babu, and Sujatha Srinivasan. "A brief survey on random forest ensembles in classification model." International Conference on Innovative Computing and Communications: Proceedings of ICICC 2018, Volume 2. Springer Singapore, 2019.
Cunningham, Padraig, and Sarah Jane Delany. "k-Nearest neighbour classifiers-A Tutorial." ACM computing surveys (CSUR) 54.6, 2021: 1-25.

Еще