Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

Автор: Eman I. Abd El-Latif, Islam A. Moneim

Журнал: International Journal of Intelligent Systems and Applications @ijisa

Статья в выпуске: 1 vol.16, 2024 года.

Бесплатный доступ

One of the most common diseases in the world is the chronic diabetes. Diabetes has a direct impact on the lives of millions of people worldwide. Diabetes can be controlled and improved with early diagnosis, but the majority of patients continue to live with it. There is a dispirit need to a system to anticipate and select the people who are most likely to be diabetes in the future. Diagnosing the future diseased person without taking any blood or glucose screening tests, is the main goal of this study. This paper proposed a deep-learning model for diabetes disease prediction. The proposed model consists of three main phases, data pre-processing, feature selection and finally different classifiers. Initially, during the data pre-processing stage, missing values are handled, and data normalization is applied to the data. Then, three techniques are used to select the most important features which are mutual information, chi-squared and Pearson correlation. After that, multiple machine learning classifiers are used. Four experiments are then conducted to test our models. Additionally, the effectiveness of the proposed model is evaluated against that of other well-known machine learning techniques. The accuracy, AUC, sensitivity, and F-measure of the linear regression classifier are higher than those of the other methods, according to experimental data, which show that it performs better. The suggested model worked better than traditional methods and had a high accuracy rate for predicting diabetic disease.

Еще

Diabetes, Mutual Information, Pearson Correlation and Chi-squared

Короткий адрес: https://sciup.org/15019352

IDR: 15019352   |   DOI: 10.5815/ijisa.2024.01.01

Текст научной статьи Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

Published Online on February 8, 2024 by MECS Press

Diabetes is a prevalent chronic disease that is extremely harmful to human health [1]. The diabetes characteristic is that blood glucose levels are higher than normal, which results from impaired insulin secretion or its adverse biological effects, or both [2, 3]. There are two types of diabetes [4], the majority of diabetes populations are classified first type and most of them are younger than 30 years old. The signs of the diabetes of the first type are: expanded thirst, successive pee and high blood glucose levels. Obesity, hypertension and other diseases are frequently associated with type 2 diabetes, which is more prevalent in the elderly and middle-aged people [5]

A tenth of the adults in the future will have diabetes, as predicted by rising morbidity in recent years, when the number of diabetics worldwide will reach 642 million in 2040. Diabetes may have existed for four to twelve years prior to diagnosis. Half of patients with diabetes suffer damage after being diagnosed. Scientists demonstrate that early recognition of diabetes will avoid heart diseases, stroke and vascular complications. Machine learning can assist individuals to make a preliminary diagnosis about diabetes according to their daily physical examination data [6, 7].

The data and analysis are used to develop predictive models, for a variety of issues, is the primary focus of machine learning. Predicting diabetes with machine learning techniques is common and using feature selection enhances the accuracy and yields better results compared to using all features.

A lot of researchers have been applying machine learning concepts to predict a disease known as diabetic in recent years. Diabetes diagnosis and prediction are made possible through machine learning. It involves analyzing large datasets, finding patterns, and using statistical models and algorithms to create predictions based on these patterns. In order to estimate an individual's risk of developing diabetes, machine learning may assess medical data related to the condition, including blood glucose levels, blood pressure, and body mass index. It can provide more accurate forecasts than more conventional techniques. Machine learning algorithms have the ability to analyses enormous amounts of data and notice subtle patterns in it. People may therefore have improved outcomes from their diabetes diagnosis and treatment. When compared to previous approaches, machine learning has become more significant in the health sector since it is quick and simple to utilize.

This paper proposed a model for accurate diabetes diagnosis prediction. The proposed model consists of three main phases: data pre-processing step, feature selection using three algorithms, and evaluation phase using different metrics. The main contribution of the paper can be listed as follows:

  •    Data Pre-processing: The proposed model solved the missing values problem and data normalization is employed to normalize the variety and distribution of features.

  •    Feature Selection stage: mutual information, chi-squared and Pearson correlation are performed to select the most important features

  •    Classification Stage: Decision tree, random forest, k nearest neighbor and logistic regression are applied to diagnose the person

  •    Evaluation and Results Interpretation Phase: Different evaluation metrics are adopted for explaining the prediction result of the proposed model.

  •    Highly Accurate Prediction: Several experiments were conducted to evaluate the overall performance of the proposed model.

  • 2.    Related Works

The remainder of this paper is organized as follows. The earlier studies on the subject are presented in Section 2. A thorough analysis and explanation of the data utilized in this investigation are provided in Section 3. In Section 4, the suggested method is then presented. The evaluation of the experiment's findings is done in Section 6. In Section 7, the study is finally finished.

Several studies using machine learning or traditional algorithms to predict or detect diabetes are discussed in this section as shown in Table 1.

Kavakiotis et al. [8] have used a variety of machine learning methods, including SVM, DT, KNN, RF, LR, and Gradient Boosting. In [9], Principal component analysis and neurofuzzy inference are used to differentiate diabetic's people from the normal. In to predict type 2 diabetes, Yue et al [10] are used Quantum particle swarm optimization (QPSO) and weighted least squares support vector machine (WLS-SVM) but for type two diabetes, Razavian et al. [11] developed a strategy based on RF.

In [12], Duygu et al. used Linear Discriminant Analysis to extract and reduce the features and MWSVM for classification but in [13], Georga et al. used support vector regression (SVR). To increase the accuracy, Ozcift et al. [14] offered an algorithm called rotation forest merges 30 machine learning algorithm.

In [15], Quan et al. selected healthy and diabetic data at random for training set and then apply five cross validations. PCA and minimum redundancy maximum relevance are utilized to reduce the dimensionality. For the classification, DT, RF and neural network are applied to predict the diabetes. In [16], scientists used four classifiers: naive Bayes (NB), decision tree (DT), Adaboost (AB), and random forest (RF) to predict the diabetic patients. (K2, K5, and K10) are three different partitioning protocols. To be able to forecast the diabetic disease, authors trained machine learning algorithms like logistic regression, SVM, and ANN [17]. Additionally, three rounds of k-cross validation are carried out.

Table 1. Relevant approaches for predicting diabetes disease

Ref.

Methodology

Dataset

Performance metrics

[16]

NB,

DT, Adaboost (AB), RF

there was a total of 6561 features with 657 diabetic and 5904 non diabetic.

The classification accuracy of the RF classifier is 94.25%, while the classification accuracy of the NB classifier is 86.70%.

[17]

support vector machine, and artificial neural network

175 features with 50 percent diabetes patients and 50 percent in good health.

ACC = 84.09

[18]

multifactor dimensionality reduction (MDR) +KNN

30 522 comorbid patients, 270 172 hospital visitors, of whom 89 858 have diabetes, 58 745 have hypertension,

ACC = 81.3

[19]

fuzzy c-mean, RF, and SVM

6,500 people made up the entire sample.

SVM ACC = 0.986 AUC = 0.979

SVM, multifactor dimensionality reduction, k-nearest neighbours (k-NN), and LR are the four machine learning methods used in [18]. Five cross validation is used in the study to determine generalization accuracy and error rates. In order to distinguish between people who have diabetes and those who do not, this study evaluated four machinelearning classifiers (neural networks, SVM, fuzzy c-mean, and random forests) with two conventional classification approaches (LR and Fisher linear discriminant analysis) in [19].

3.    Dataset

The dataset was collected from over 400,000 Americans on health-related risk performances and chronic health conditions. This dataset contains 253,680 responses and it can download from [20]. Table 2 shows the data description for each medical predictor in the data.

Table 2. Data description

Features

Description

HighBP

Adults who have received a diagnosis of hypertension from a physician. There are two classes for this feature: 0 for no high pressure and 1 for pressure.

Body mass index (BMI)

It is a number that is calculated using a person's height and mass.

Stroke

In a medical condition known as a stroke, the brain's inadequate blood supply results in cell death.

HeartDiseaseorAttack

people who previously disclosed having myocardial infarction or coronary heart disease

NoDocbcCost

you needed to see a doctor within the last year but were unable to due to cost

GenHlth

general health on a scale of 1 to 5

MentHlth

It describes mental state, which includes stress, depression, and emotional difficulties, is described.

PhysHlth

It describes the physical health, which also includes bodily ailment and damage during 30 days

DiffWalk

It describes the difficulty in ascending stairs or moving about.

Income

Represents annual household income

HighCol

A heavy drinker is an adult who consumes more than 14 drinks per week in men and more than 7 drinks in women.

Smoker

Smoked at least 100 cigarettes during every day of your life.

Education

highest grade or academic year you have earned

Fig.1. The flowchart of the proposed algorithm

4.    Proposed Approach

The main goal of this paper is to predict diabetes disease using various machine learning algorithms without making any medical examination. The intrinsic quality of unprocessed data is enhanced throughout the data preparation stage to enable accurate diabetic disease prediction. First, a mean value imputation approach is employed to address the issue of missing values. In this method, absent data points are replaced with the mean of the pertinent feature. This reduces the likelihood of bias addition caused by missing values while retaining the dataset's overall statistical coherence. The variety and distribution of characteristics are then normalized using data normalization. This normalizing technique ensures that each variable contributes adequately to the learning process of the regression model, regardless of the disparities in their numerical magnitudes. After that, the most important features, which give the highest classification result, by applying three algorithms (mutual information, chi-squared and Pearson correlation) are determined. These selected approaches of machine learning, are decision tree, random forest, k nearest neighbor and logistic regression as shown in Figure 1. On the same dataset, previously multiple machine learning classifiers are used to predict diabetes. Our obtained results are compared with these previous results using the same evaluation metrics.

5.    Feature Selection

Feature selection is an essential part of data cleaning because it removes the unwanted features and helps us to identify the most important features, which improves our model's performance. There are numerous methods for choosing features. In this paper, we applied three methods: mutual information, chi-squared and Pearson correlation.

  • 5.1.    Mutual Information Algorithm

  • 5.2.    Chi-squared Algorithm

  • 6.    Classification

For fixed categories, such as the classification problem or the continuous target variable in regression problems, Mutual Information (MI) is measured [21]. Mutual Information calculated the entropy of the variables and measures the degree of dependence between the non-negative variables [22]. The value of MI is zero when the two random variables are independent, and the higher values indicate greater dependence. It can calculate by:

I(X ; Г) = H(X) - Н(Х | Y)                                  (1)

Where X, Y two random variables, I(X; Y) is mutual information, H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y.

In feature selection, the connection between two categorical outcome attributes is frequently tested using a Chisquare test. Chi-squared determines if the attributes are independent or not [23]. The observed counts are close to the expected ones when the features are independent so chi-square value needs to be small. Simply, the feature can be selected for model training if its Chi-Square value is higher than 5%, as it likely to be dependent on the response. ChiSquare can calculate using:

x 2 =X(Oi-Ei)2/Ei                                        (2)

Where О is the observed value and E is the expected value

In this section, we discussed the different classifiers that used in this research: decision tree, random forest, k nearest neighbor and logistic regression [24].

  • 6.1.    Decision Tree

  • 6.2.    Random Forest

  • 6.3.    K-Nearest Neighbors Algorithm

  • 7.    Experimental Results
  • 7.1.    Evaluation Metrics

When the response variable is continuous, DT can be used as a regression tree, while when the response variable is categorical, it can be used as a classification tree. DT is used in machine learning to eliminate disorder or uncertainty from the dataset [25]. It starts with a single node representing the root. If all the data are in the same class, the other nodes converted to leafs. Otherwise, the discriminatory attribute is selected by the algorithm to serve as the current DT node. The training data are divided into many subsets, each of which forms a branch, and numerous branches are formed by many values based on the value of the current DT attribute. The previous steps are repeated to form a decision tree [26].

A supervised machine learning approach called random forest (RF) is used to solve classification and regression issues. It builds decision trees from different samples, using the majority vote for classification and the average for regression. RF has many advantages such as it requires less time to train than other algorithms; it runs efficiently with a large dataset and predicts output accurately [27].

The k-nearest neighbor's algorithm, often known as KNN, is a non-parametric, supervised learning classifier that employs proximity to classify or predict the grouping of each individual data point. Assuming that similar points can be discovered close to one another, it can be used for classification or regression issues; however, it is typically utilized as a classification algorithm. [28].

The k-nearest neighbor algorithm's objective is to locate the closest neighbors to the targeted point so that; a class label can be assigned to that point. KNN needs to calculate the distance between the targeted point and the other data points.

In this section, the evaluation metrics and results of the experiments are presented. The PC used for the experiments possesses the following features: x64-based processor, Windows 10, 2.60 GHz 2.59 GHz Intel(R) Core(TM) i7-9750H CPU, 16 GB of memory. We divided the data into 70% for validation and training and 30 % for the testing. The model is performing using the Python programming language

Different metrics are used to measure the performance of the experimental results such as accuracy, precision, F-Measure, Specificity and recall. The metrics are defined as follow:

.  .           TP

precision = TP+FP

(3)

Recall = TPR = -^—

TP+FN

(4)

F1 - Score =--—--

2TP+FP+FN

(5)

TP+TN

Accuracy =

J   TP+TN+FN+FP

(6)

Specificity = TNR = TN

TN+FP

(7)

Where FP, FN,TP and TN are the False Positive, False Negative, True Positive, and True Negative, respectively.

Another metrics called Area under Curve (AUC) measures the ability of the classifier to separate between classes. The classifier is able to correctly differentiate between all positive and negative class points when AUC is 1. However, the classifier would have predicted that all negatives would be positives and all positives would be negatives if the AUC is equal to zero in value. When the value of the AUC lies between 0.5 and 1, and there is a good chance that the classifier can distinguish between the positive class values and the negative class values.

Table 3. Statistical Description

Features

Mean

Std

Min

Max

HighBP

0.4290

0.4949

0

1

BMI

28.382

6.608

12

98

Stroke

0.04057

0.19729

0

1

HeartDiseaseorAttack

0.09418

0.2920

0

1

NoDocbcCost

0.084177

0.2777

0

1

GenHlth

2.5113

1.0684

1

5

MentHlth

3.1847

7.412

0

30

PhysHlth

4.2420

8.7179

0

30

DiffWalk

0.16822

0.3741

0

1

Income

6.0538

2.0711

1

8

HighCol

0.4241

0.4942

0

1

Smoker

0.44316

0.4967

0

1

Education

5.050

0.985

1

6

Age

8.032

3.054

1

13

  • 7.2.    Evaluation

The average, standard deviation, minimum, and maximum values for the most significant features of the dataset are thoroughly analyzed in Table 3. For further analysis, the dataset's correlation matrix is constructed. Figure 2 illustrates how to evaluate the relationship between two variables in a data collection using a statistical technique called a correlation matrix. The matrix is a table containing correlation coefficients in each cell; a correlation coefficient of 1

denotes a strong association between variables, a correlation coefficient of 0 a neutral relationship, and a correlation coefficient of -1 a weak relationship.

Fig.2. The correlation matrix

The proposed model is evaluated using several ML classifiers such as: LR, RF, DT, and KNN. Table 4 includes the evaluation metrics using all features in the dataset. For all of the evaluation metrics, except the precision one, LR has the highest values. On the other hand, DT has the highest precision.

Table 4. Predict the diabetes by using all features

classifiers

AUC %

Accuracy%

Precision%

Recall%

F-score%

LR

94.2922

88.0045

92.436

82.8569

87.3847

RF

94.1864

87.4104

93.9127

80.0828

86.4481

DT

90.8981

84.0673

96.346

70.914

81.6965

KNN

90.7245

84.0815

88.4021

78.5593

83.1906

Confusion matrix is created to assess a classification model's performance as shown in Figure 3. The matrix contrasts actual goal values with anticipated values from the ML model. A good model is one with low FP and FN rates and high TP and TN rates.

Training

Fig.3. The confusion matrix

Testing

Table 5 shows the results when chi-squared algorithm is used with all of the four under investigation classifiers. When the number of the feature is 10, RF has the best result among the four classifiers in AUC. For the accuracy, the recall and the F-score metrics, the LR achieved the best results. The worst results occur when three features are only selected and KNN is used. According to the chi-squared algorithm, we choose the first ten features that give the best results, these features are: 'HighBP', 'BMI', 'Stroke', 'HeartDiseaseorAttack', 'NoDocbcCost', 'GenHlth','MentHlth', 'PhysHlth', 'DiffWalk', 'Income'.

Table 5. Predict diabetes using chi-squared algorithm to select features

No. of features

classifiers

AUC

Accuracy%

Precision%

Recall%

F-score%

3

LR

0.767795

69.8887

72.2471

65.0746

68.4735

RF

0.788268

70.7893

74.833

63.0853

68.4588

DT

0.782667

70.8412

74.1862

64.3708

68.9309

KNN

0.732838

66.3853

63.8592

76.2691

69.5146

5

LR

0.805807

73.8401

77.7167

66.7359

71.8089

RF

0.818234

73.5854

82.385

59.898

69.3645

DT

0.802523

73.5854

82.385

59.898

69.3645

KNN

0.760443

69.1626

69.9134

67.1137

68.485

7

LR

0.89781

81.7333

85.0167

77.0283

80.8256

RF

0.908006

82.4359

90.5988

72.3679

80.4636

DT

0.892179

80.993

94.3791

65.8962

77.6068

KNN

0.882921

79.5455

78.7564

80.8962

79.812

10

LR

0.937223

86.8917

91.3306

81.4286

86.0958

RF

0.938201

86.3401

92.1825

79.3188

85.2682

DT

0.91404

84.8831

92.7841

75.544

83.2812

KNN

0.91981

85.1848

85.891

81.087

84.9794

Table 6 shows the results when Pearson algorithm is used as feature selector. We observed that, RF has the best result among the four classifiers corresponding to the AUC and accuracy metrics and when the number of features is 10. According to the Pearson algorithm, the chosen ten features, that give the best results, are: GenHlth', 'Income', 'DiffWalk', 'PhysHlth', 'Education', 'PhysActivity', 'BMI', 'MentHlth', 'HighBP', 'HeartDiseaseorAttack.

Table 6. Predict diabetes using pearson algorithm to select features

No. of features

classifiers

AUC

Accuracy%

Precision%

Recall%

F-score%

3

LR

0.884167

80.4272

82.7153

76.7052

79.597

RF

0.890344

80.7243

82.6799

77.5104

80.0117

DT

0.889559

80.6677

82.7649

77.2452

79.9098

KNN

0.86853

76.6786

71.8152

87.4763

78.8759

5

LR

0.910959

83.2045

87.6289

77.1996

82.0843

RF

0.914946

84.0202

90.9547

75.4399

82.474

DT

0.905135

83.261

87.8233

77.105

82.1159

KNN

0.896253

83.1997

88.8889

75.7616

81.8019

7

LR

0.924473

85.1565

89.8264

79.5071

84.3523

RF

0.928954

85.812

92.1646

78.4764

84.7715

DT

0.908286

84.2324

91.5325

75.6653

82.846

KNN

0.905616

84.4634

86.4009

82.0371

84.1625

10

LR

0.937665

86.9342

90.8546

81.9708

86.1844

RF

0.939545

86.9436

93.2377

79.5049

85.8254

DT

0.911572

84.784

95.5547

72.7807

82.6272

KNN

0.917518

85.0858

86.435

83.033

84.6999

Table 7 shows that, the results when the Mutual Information algorithm is used as a feature selector. RF has the best result among the four classifiers in the AUC and the accuracy metrics with a number of 10 feature. The worst results occur when three features are selected and KNN classifier is used.

Table 7. Predict diabetes using MI algorithm to select features

No. of features

classifiers

AUC

Accuracy

Precision

Recall

F-score

3

LR

0.904028

82.964

87.7592

76.8849

81.963

RF

0.90806

83.195

91.5625

73.382

81.4703

DT

0.903575

82.8838

91.7724

72.5016

81.0067

KNN

0.880137

79.1211

74.6781

88.5548

81.0267

5

LR

0.920472

85.5055

92.7229

76.9783

84.1203

RF

0.923346

85.7837

92.9854

77.3282

84.4371

DT

0.911368

84.5813

95.7202

72.3173

82.3891

KNN

0.909993

85.2886

92.0302

77.1863

83.9572

7

LR

0.930427

85.9487

91.4543

79.6493

85.1446

RF

0.934856

86.2788

93.0745

78.7167

85.2956

DT

0.907415

84.4115

94.7286

73.2419

82.611

KNN

0.911642

84.6614

86.5018

82.5406

84.4748

10

LR

0.939151

87.0521

91.1973

81.8553

86.2741

RF

0.940692

87.2029

93.9683

79.3512

86.434

DT

0.911836

84.8831

95.5431

72.9963

82.7616

KNN

0.917045

85.4489

86.9049

83.278

85.0528

The performance of the proposed MLSO-DNN model is compared with the most recent model that has been published for the prediction of diabetes disease in the final experiment shown in Table 8.

Table 8. The proposed model vs. the state-of-the-art models

Ref.

Model

Performance metrics

[16]

DT

86.42

NB

89.90

RF

93.12

AB

91.32

[17]

SVM+ cross validation

84.09

[18]

MDR + KNN

81.3

Proposed model

Feature selection + LR

94.06

8.    Conclusions

The primary aim of this work is to classify and predict diabetes from some features without carrying any medical examination. In this paper, we tested a variety of ML classification methods and selected features using three different algorithms to achieve the highest possible performance and accuracy. In addition, the comparisons among the results obtained from the four classifiers show that, the very close results are obtained from the proposed four classifiers, random forest, decision tree, k nearest neighbor and logistic regression but it is clearly that, the random forests have achieved the highest accuracy among them. The classifiers' performance was evaluated using the AUC, F-measure, which includes precision and recall, and accuracy. The model's high ability to predict outcomes and distinguish between the two classes is demonstrated by the AUC values. The best predictions are obtained when 10 features are selected and it occurs with all the four classifiers.

Funding

There was no external funding for this research.

Conflict of Interest

The corresponding author certifies that there is no conflict of interest on behalf of all authors.

Data Availability Statement

The data that support the findings of this study are available from author Eman I. Abd El-Latif, upon reasonable request.

Список литературы Exploring Feature Selection and Machine Learning Algorithms for Predicting Diabetes Disease

  • Krasteva, A., Panov, V., Krasteva, A., Kisselova, A., and Krastev, Z. Oral cavity and systemic diseases—Diabetes Mellitus. Biotechnol. Biotechnol. Equip. 25, 2183–2186, 2011. doi: 10.5504/BBEQ.2011.0022
  • Wang, Andrea N., et al. "Zucker Diabetic‐Sprague Dawley (ZDSD) rat: Type 2 diabetes translational research model." Experimental Physiology 107.4, 2022: 265-282.
  • Lonappan A, Bindu G, Thomas V, Jacob J, Rajasekaran C, Mathew KT. Diagnosis of diabetes mellitus using microwaves. J Electromagn Waves Appl. 2007;21(10):1393–401
  • Lee, B. J., and Kim, J. Y.. Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on machine learning. IEEE J. Biomed. Health Inform. 20, 39–46, 2016. doi: 10.1109/JBHI.2015.2396520
  • Echegoyen, Francisco X. Barrera, et al. "The nature and characteristics of hypertriglyceridemia in a large cohort with type 2 diabetes." Journal of diabetes and its complications 37.2, 2023: 108387.
  • Tuppad, Ashwini, and Shantala Devi Patil. "Machine learning for diabetes clinical decision support: a review." Advances in Computational Intelligence 2.2, 2022: 22.
  • Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., and Chouvarda, I.. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116, 2017. doi: 10.1016/j.csbj.2016.12.005
  • Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., and Chouvarda, I.. Machine learning and data mining methods in diabetes research. Comput. Struct. Biotechnol. J. 15, 104–116, 2017. doi: 10.1016/j.csbj.2016.12.005
  • Polat, K., and Günes, S.. An expert system approach based on principal component analysis and adaptive neuro-fuzzy inference system to diagnosis of diabetes disease. Digit. Signal Process. 17, 702–710, 2007.
  • Yue, C., Xin, L., Kewen, X., and Chang, S.. “An intelligent diagnosis to type 2 diabetes based on QPSO algorithm and WLS-SVM,” in Proceedings of the 2008 IEEE International Symposium on Intelligent Information Technology Application Workshops, Washington, DC, 2008.
  • Razavian, N., Blecker, S., Schmidt, A. M., Smith-McLallen, A., Nigam, S., and Sontag, D.. Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data 3, 277–287, 2015.
  • Duygu,ç., and Esin, D.. An automatic diabetes diagnosis system based on LDA-wavelet support vector machine classifier. Expert Syst. Appl. 38, 8311–8315, 2011.
  • Georga, E. I., Protopappas, V. C., Ardigo, D., Marina, M., Zavaroni, I., Polyzos, D., et al. Multivariate prediction of subcutaneous glucose concentration in type 1 diabetes patients based on support vector regression. IEEE J. Biomed. Health Inform. 17, 71–81, 2013. doi: 10.1109/TITB.2012.2219876
  • Ozcift, A., and Gulten, A.. Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Comput. Methods Programs Biomed. 104, 443–451, 2011. doi: 10.1016/j.cmpb.2011.03.018
  • Zou, Q., Qu, K., Luo, Y., Yin, D., Ju, Y., & Tang, H.. Predicting diabetes mellitus with machine learning techniques. Frontiers in genetics, 9, 515, 2018.
  • Maniruzzaman, Md, et al. "Classification and prediction of diabetes disease using machine learning paradigm." Health information science and systems 8, 2020: 1-14.
  • Malik, Sarul, et al. "Non-invasive detection of fasting blood glucose level via electrochemical measurement of saliva." Springerplus 5, 2016: 1-12.
  • Farran, Bassam, et al. "Predictive models to assess risk of type 2 diabetes, hypertension and comorbidity: machine-learning algorithms and validation using national health data from Kuwait—a cohort study." BMJ open 3.5, 2013.
  • Tapak, Lily, et al. "Real-data comparison of data mining methods in prediction of diabetes in Iran." Healthcare informatics research 19.3, 2013: 177-185.
  • https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
  • Sengupta, Debapriya, Phalguni Gupta, and Arindam Biswas. "A survey on mutual information based medical image registration algorithms." Neurocomputing 486, 2022: 174-188.
  • Su, Xiangchenyang, and Fang Liu. "A survey for study of feature selection based on mutual information." 2018 9th workshop on hyperspectral image and signal processing: evolution in remote sensing (WHISPERS). IEEE, 2018.
  • Vashisht, Manisha, and Brijesh Kumar. "Traffic Sign Recognition Approach Using Artificial Neural Network and Chi-Squared Feature Selection." Next Generation of Internet of Things: Proceedings of ICNGIoT 2022. Singapore: Springer Nature Singapore, 2022. 519-527.
  • Hort, Max, et al. "Bia mitigation for machine learning classifiers: A comprehensive survey." arXiv preprint arXiv:2207.07068, 2022.
  • Priyanka, and Dharmender Kumar. "Decision tree classifier: a detailed survey." International Journal of Information and Decision Sciences 12.3, 2020: 246-269.
  • Nanfack, Géraldin, Paul Temple, and Benoît Frénay. "Constraint Enforcement on Decision Trees: A Survey." ACM Computing Surveys (CSUR) 54.10s, 2022: 1-36.
  • Shaik, Anjaneyulu Babu, and Sujatha Srinivasan. "A brief survey on random forest ensembles in classification model." International Conference on Innovative Computing and Communications: Proceedings of ICICC 2018, Volume 2. Springer Singapore, 2019.
  • Cunningham, Padraig, and Sarah Jane Delany. "k-Nearest neighbour classifiers-A Tutorial." ACM computing surveys (CSUR) 54.6, 2021: 1-25.
Еще
Статья научная