Constraint handling genetic algorithm for feature engineering in solving classification problems
Автор: Denisov M.A., Sopov E.A.
Журнал: Сибирский аэрокосмический журнал @vestnik-sibsau
Рубрика: Математика, вычислительная техника и управление
Статья в выпуске: 1 т.22, 2021 года.
Бесплатный доступ
Feature engineering in machine learning is a promising but still insufficiently studied domain. Creating new feature space from an original set allows increasing the accuracy of the machine learning algorithm chosen to solve complex data mining problems. Some existing selection methods are capable of simultaneously increasing the accuracy and reducing feature space. The reduction is an urgent task for big data problems. The paper considers a novel machine learning approach for solving classification problems based on feature engineering methods. The approach constructs informative features using feature selection and extraction methods. Original data and features obtained by principal component analysis form a new set of features. The genetic algorithm selects an effective subset of informative features. It is important to avoid overfitting and builng a trivial classifier. Therefore, the fitness function is constrained for producing the given number of original features and the given number of features obtained by principal component analysis. The paper describes a comparative analysis of three classifiers, namely k-nearest neighbors, support vector machine and random forest. In order to prove the accuracy improvement, the authors examine several real-world problems chosen from the UCI Machine Learning repository. The accuracy measure in the study is the macro F1-score. The results of numerical experiments show that the proposed approach outperforms the performance obtained using the original data set and the performance of random feature selection (the low bound for the results). Moreover, the accuracy enhancement is obtained for all types of problems (data sets that have more features than values). All results are proved to be statistically significant.
Feature selection, feature construction, genetic algorithm, constraint optimization
Короткий адрес: https://sciup.org/148322014
IDR: 148322014 | DOI: 10.31772/2712-8970-2021-22-1-18-31