Comparative Study of Supervised Algorithms for Prediction of Students’ Performance

Автор: Madhuri T. Sathe, Amol C. Adamuthe

Журнал: International Journal of Modern Education and Computer Science @ijmecs

Статья в выпуске: 1 vol.13, 2021 года.

Бесплатный доступ

Predicting academic performance of the student is crucial task as it depends on various factors. To perform such predictions the machine learning and data mining algorithms are useful. This paper presents investigation of application of C5.0, J48, CART, Naïve Bayes (NB), K-Nearest Neighbour (KNN), Random Forest and Support Vector Machine for prediction of students’ performance. Three datasets from school level, college level and e-learning platform with varying input parameters are considered for comparison between C5.0, NB, J48, Multilayer Perceptron (MLP), PART, Random Forest, BayesNet, and Artificial Neural Network (ANN). Paper presents comparative results of C5.0, J48, CART, NB, KNN, Random forest and SVM on changing tuning parameters. The performance of these techniques is tested on three different datasets. Results show that the performances of Random forest and C5.0 are better than J48, CART, NB, KNN, and SVM.

Еще

Educational data mining, Machine learning, Random forest, C5.0

Короткий адрес: https://sciup.org/15017614

IDR: 15017614 | DOI: 10.5815/ijmecs.2021.01.01

Текст научной статьи Comparative Study of Supervised Algorithms for Prediction of Students’ Performance

It is essential for every educational organization to facilitate high quality education to their students. Performance of student in academic is major concern for every institute as it linked to job opportunities and reputation of institution. One of field related to dealing with processing and analyzing of all educational data is educational data mining (EDM). EDM develops methods to understand student and their environment of learning [1]. It also helps to predict patterns that can be helpful to improvement of student performance. Prediction of student’s academic performance is a difficult task because it depends on various demographic, socio-economic and past-academic factors. In this paper, the attributes responsible for affecting the academic performance of the student and the students’ grades for three different datasets are determined.

In literature, different data mining (DM) algorithms and machine learning (ML) algorithms have experimented for this problem. Machine learning algorithms as said ‘learn’ from given data, discover hidden patterns and provide predictions, which allow engineers, researchers and scientists to make a reliable decision. Machine learning is broadly divided as supervised, unsupervised, and reinforcement learning. In supervised learning is done using training data which is analyses and builds model to perform predictions for training set. Classes or target variable is labelled in this case. DM and ML techniques are widely applied in field of analytics and predictions. The research work in [2-8] make use of such algorithms which are Logistic Regression, J48, Decision Tree, Support Vector Machine (SVM), NB, Random Tree, ANN, K-Nearest Neighbour (KNN), MLP, and Random Forest. In few cases, other algorithms are also used such as association rules [9], and clustering [10]. NB Tree is used in [11] for predicting status of student, length of study and GPA. Techniques such as REP Tree, PART, Decision Table, Decision Stump, and JRip [12-13] are used for student performance prediction. Results show that the algorithms that perform well for predicting grades are Random Forest, J48, CART, NB, KNN and SVM. To the best of our knowledge, the experimentations are not conducted on C5.0, which is an advanced version of C4.5 (also called as J48) for predicting grades of students. Hence, these seven algorithms are used for grade prediction.

Attribute selection is critical task in every DM and ML algorithm. Performance of algorithm depends on type of data it consists. Adding and removing of certain attributes can also change the performance of algorithm. For educational researches, data can be demographic, academic, and behavioural data. Most of cases demographic and academic data is used for student performance prediction.

• Factors such as age, gender, annual income, parents’ occupation, parents’ education, are included [2, 4].
• Academic data such as subject marks, previous examination marks are included in [2, 14],
• Previous semester marks, participation in activities are included in [3, 12],
• GPA, subject marks and assignment marks are included in [11, 15]
• Exam scores, absences and attendance is used in [2, 7].
• Consideration of behavioural data is done in dataset of Kalboard 360 available on Kaggle used in [5, 7].
• Comment based data is used in [16] for every lesson taught and various attributes are retrieved using text

classification on those comments and [17] comprehends questionnaire based input variables.

In the literature, many machine learning algorithms tested for the problem. In most of cases in related work, the input parameters that are related target are not identified, such as in [2-3, 5]. Also, it is found out that the algorithms those are applied in previous work [7-8, 28, 40] are processed without fine tuning which doesn’t pushes limits to know how far an algorithm can accurately predict results.

The research paper presents application machine learning on three different datasets. Performance of algorithm is analyzed by changing values of parameters. Results of decision tree are compared with other algorithms in literature.

The major research objectives of this paper are,

1 . To find attributes having more influence on target variable using correlation. By using correlation coefficient it is expected that the attributes that are closely related to grades i.e. those who impact on academic performance most are to be identified.
2 . To apply C5.0, J48, NB, Random forest, KNN, SVM, and CART. Using comparison of results of these seven algorithms we determine which of the algorithms can accurately predict the results when fine tuning is applied.
3 . To study effect of tuning parameters on accuracy of classifiers.

Evaluation of these results are done using various measures such as precision, recall, True Positive Rate (TPR) and False Positive Rate (FPR).

Sections below are divided into 6 parts. Literature review on student performance prediction is described in section II. Section III contains problem formulation for current work and previous work. Methodologies used are mentioned in section IV. Results are discussed in section V and section VI explains conclusion.

Research for predicting students’ academic performance has been done for various kinds of datasets and using numerous methods. Datasets can be of type e-learning, university data, college data and the variety of methods applied are statistical, data mining techniques and machine learning algorithms.

A review of the various algorithms used and their accuracies obtained to solve student performance prediction presented below. In [3] algorithms such as Logistic Regression, SVM, Decision Tree, NB, NN, and KNN are applied. Experimentation is conducted by considering all input variables and attributes filtered through feature selection. Performance of KNN is found out to be better when all attributes are considered. SVM and Logistic Regression perform well for dataset with feature selection process. Conversely in the work mentioned in [18] has that Logistic Regression performs poor along with NB, but KNN has better accuracy results. The dataset in [3] and [18] differs in terms of number of records as well as [3] has previous academic data and [18] has course grades along with psychometric factors. In [19] using e-learning data and behavioural survey of student using such e-learning platform Logistic Regression model is applied to predict the failure of student in a particular course. Dataset considered is presumably small on which accuracy obtained is 73.7%. Logistic Regression, NN and Random Forest are used in [20]. All of these techniques provide low level of correctness in results, such weakness is overcome by inclusion of uncertain classes. Similar to [19], work in [21] follows same algorithm with almost same amount of records having accuracies of 78.6 % and 78.8%. In addition with algorithms used in [3], the work proposed in [22] has BayesNet and SMO where prediction of failure of a student in particular course is performed. Experimentations are conducted by taking into consideration dataset with and without filtering, discretization and rebalancing. Without filters performance of algorithms is lower than expected. With applying filters, most of algorithms has enhancement in their performance. Decision Trees, Random Tree and Random Forest have been giving best potential. In [23] the 3rd semester performance is carried out using Decision Tree and Random Tree. Results show that RT achieves 94.4% accuracy followed by J48 with 88.37%. Predictions are achieved high in [12], their work also convey that data mining techniques are not limited by size of datasets. [24] has multiclass classification performed by using algorithms such as RF, DT, SVM, NB, Boosting Trees and Bagging Trees. Above 2000 student records are considered for the prediction process. Their work focuses on obtaining results for degree level performance of students. The Random Forest achieves an accuracy of 96.17% which is best among the other algorithms implemented. In cases [2, 5, 24], Random Forest, J48, and NB tend to predict the results more accurately. For SVM, the results achieved are remarkable, as it has maximum F1-measure value after DT [4]. SVM and KNN both are to be found suitable for student academic predictions as mentioned in [6]. The algorithms CART and C5.0 are rarely used as per the best of our knowledge. Literature survey conveys that trees and Random Forests perform best for classification of grades of students.

The following literature discusses different kinds of outcomes predicted using ML algorithms. Most of the cases Grade Point Average (GPA) or cumulative GPA is predicted. The outcome can be either binary class such as pass or fail in particular subject or semester, successful or unsuccessful to complete graduation or degree otherwise target attribute can be multiclass such as in [8] end semester percentage is converted to five classes which are best, very good, good, pass and fail. The research work in [3, 4, 19, 21, 22] focus on acquiring predictions for a particular course. Classification performed is binary class, student will pass in the course or get failed is predicted. Whereas, [18] predict that whether student performs poor in academics or is a strong achiever, based on GPA values. Final GPA are determined in [15] by using demographic data, high school information, and family financial status. The research work in [25] predicts that whether a student will obtain his engineering degree or not using student academic data and background information and same with the case of [20]. Unlike these research papers the work mentioned in [26] predicts whether a student will obtain excellent grade, good grades, get passed or just get failed in a course with grades predicted to be from scale of 0 to 10 for final exam of a course. Similar to this, in [23] it is predicted either student will get 3rd semester performance as below average, average, above average or excellent. In [12] predicts the score of a course of students to be low, high or medium.

Impact of different attributes on the performance of student performance is reviewed and presented. Most of the research works have included correlation methods to find out the influencing factors. Models those are trained on students of less age provided good results in [18]. The type of registration to University and income of student’s family are found to be correlated achievement of student. The four major attributes that highly affect the performance of student found in [25] are First Year University GPA, CC BP transfer credit hours, first fall credits GPA and CC BP transfer GPA. Here all the attributes related to academic and student’s background were considered. The work in [21] where e-learning data is taken for process, the significant predictors are found out to be date of first login to LMS, mode of study, previous academic performance record, and weighted average marks. For the results in [23], where 3rd semester performance is predicted, it is revealed that 2nd semester results, leadership and drive qualities correlate a lot with output variable. For [27], the ability to understand and handle basic subjects influence a lot on final result of degree. In [28], experimentations are conducted using dataset from UCI machine learning repository. On comparison with target variable it is found out that weekday alcohol consumption, romantic relationship and parents’ education do affect student’s performance. GPA, Participation rule, Test average, Lab test average, Assignment submit attendance, Final grade are considered as best attributes in [29] for undergraduate student data. In [30] where techniques such as Cluster Analysis and Association Rule Mining were used found a pattern that frequent occurrence of seven courses {MTH 111, STA 122, MTH 122, MTH 121, CSC 111, BIO 111, CSC 121} in failed students’ data. These courses are found crucial for academic performance of a student. Survey presented in [31] for student performance concludes that CGPA and internal marks are important attributes. Conversely, evaluations done in [32] mention that grades do not necessarily affect outcome achievement and direct assessment has positive impact on student performance.

Table 1. Literature review

Список литературы Comparative Study of Supervised Algorithms for Prediction of Students’ Performance

http://educationaldatamining.org/
Mahboob, T., Irfan, S., & Karamat, A., “A machine learning approach for student assessment in E-learning using Quinlan's C4. 5, Naive Bayes and Random Forest algorithms,” IEEE 19th International Multi-Topic Conference (INMIC), pp. 1-8, December 2016.
Marbouti, F., Diefes-Dux, H. A., & Madhavan, K., “Models for early prediction of at-risk students in a course using standards-based grading”, Computers & Education, vol 103,pp 1-15, 2016.
Costa, E. B., Fonseca, B., Santana, M. A., de Araújo, F. F., & Rego, J., “Evaluating the effectiveness of educational data mining techniques for early prediction of students' academic failure in introductory programming courses”, Computers in Human Behavior, vol 73, pp 247-256, 2017.
Zhang, X., Xue, R., Liu, B., Lu, W., & Zhang, Y., “Grade Prediction of Student Academic Performance with Multiple Classification Models”, IEEE 14th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 1086-1090, 2018.
Al-Shehri, H., Al-Qarni, A., Al-Saati, L., Batoaq, A., Badukhen, H., Alrashed, S., ... & Olatunji, S. O., “Student performance prediction using support vector machine and k-nearest neighbor”, IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE) pp. 1-4, April 2017.
Amrieh, E. A., Hamtini, T., & Aljarah, I., “Preprocessing and analyzing educational data set using X-API for improving student's performance”, IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), pp. 1-5, November 2015.
Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N., “Educational data mining and analysis of students’ academic performance using WEKA”, Indonesian Journal of Electrical Engineering and Computer Science, vol 9(2), pp. 447-459, 2018.
Parack, S., Zahid, Z., & Merchant, F.,” Application of data mining in educational databases for predicting academic trends and patterns”, IEEE International Conference on Technology Enhanced Education (ICTEE), pp. 1-4, January 2012.
Alfiani, A. P., & Wulandari, F. A. (2015). Mapping student's performance based on data mining approach (a case study). Agriculture and Agricultural Science Procedia, 3, 173-177. 2015
Christian, T. M., & Ayub, M., “Exploration of classification using NBTree for predicting students' performance”, IEEE International Conference on Data and Software Engineering (ICODSE), pp. 1-6, November 2014.
Natek, S., & Zwilling, M.,“Student data mining solution–knowledge management system related to higher education institutions” Expert systems with applications, vol. 41(14), pp. 6400-6407, 2014.
Goga, M., Kuyoro, S., & Goga, N., “A recommender for improving the student academic performance”, Procedia-Social and Behavioral Sciences, vol. 180, pp. 1481-1488, 2015.
Huang, S., & Fang, N., “Predicting student academic performance in an engineering dynamics course: A comparison of four types of predictive mathematical models” Computers & Education, vol. 61, pp. 133-145, 2013.
Guruler, H., Istanbullu, A., & Karahasan, M., “A new student performance analysing system using knowledge discovery in higher educational databases”, Computers & Education, vol. 55(1), pp. 247-254, 2010.
Sorour, S. E., & Mine, T., “Building an interpretable model of predicting student performance using comment data mining”, IEEE 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), pp. 285-291, July 2016.
Vandamme, J. P., Meskens, N., & Superby, J. F., “Predicting academic performance by data mining methods”, Education Economics, vol. 15(4), pp. 405-419, 2007.
Gray, G., McGuinness, C., & Owende, P., “An application of classification models to predict learner progression in tertiary education”, IEEE International Advance Computing Conference (IACC), pp. 549-554, February 2014.
Macfadyen, L. P., & Dawson, S., “Mining LMS data to develop an “early warning system” for educators: A proof of concept”, Computers & education, vol. 54(2), pp. 588-599, 2010.
Hoffait, A. S., & Schyns, M., “Early detection of university students with potential difficulties”, Decision Support Systems, vol. 101, pp. 1-11, 2017.
Palmer, S., “Modelling engineering student academic performance using academic analytics”, International journal of engineering education, vol. 29(1), pp. 132-138, 2013.
Romero, C., Espejo, P. G., Zafra, A., Romero, J. R., & Ventura, S., “Web usage mining for predicting final marks of students that use Moodle courses”, Computer Applications in Engineering Education, vol. 21(1), pp. 135-146, 2013.
Mishra, T., Kumar, D., & Gupta, S., “Mining students’ data for performance prediction”, Proceedings of international conference on advanced computing & communication technologies, pp. 255-263, February 2014.
Miguéis, V. L., Freitas, A., Garcia, P. J., & Silva, A., “Early segmentation of students according to their academic performance: A predictive modelling approach”, Decision Support Systems, vol. 115, pp. 36-51, 2018.
Laugerman, M., Rover, D. T., Shelley, M. C., & Mickelson, S. K., “Determining graduation rates in engineering for community college transfer students using data mining”, International Journal of Engineering Education, vol. 31(6A), pp. 1448, 2015.
Romero, C., & Ventura, S., “Data mining in education”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 3(1), pp. 12-27, 2013.
Arsad, P. M., & Buniyamin, N., “A neural network students' performance prediction model (NNSPPM)”, IEEE International Conference on Smart Instrumentation, Measurement and Applications (ICSIMA), pp. 1-5, November 2013.
Roy, S., & Garg, A., “Predicting academic performance of student using classification techniques”, 4th IEEE Uttar Pradesh Section International Conference on Electrical, Computer and Electronics (UPCON), pp. 568-572, October 2017.
Mueen, A., Zafar, B., & Manzoor, U., “Modeling and predicting students' academic performance using data mining techniques”, International Journal of Modern Education and Computer Science, vol. 8(11), pp. 36, 2016.
Khan, I. H., “A Unified Framework for Systematic Evaluation of ABET Student Outcomes and Program Educational Objectives”, 2019.
Kumar, M., Singh, A. J., & Handa, D., “Literature survey on student’s performance prediction in education using data mining techniques”, International Journal of Education and Management Engineering, vol. 7(6), pp. 42-49, 2017.
Inyang, U. G., Eyoh, I. J., Robinson, S. A., & Udo, E. N., “Visual Association Analytics Approach to Predictive Modelling of Students’ Academic Performance”, 2019.
https://archive.ics.uci.edu/ml/datasets/Student+Academics+Performance#
https://archive.ics.uci.edu/ml/datasets/student+performance
https://www.kaggle.com/aljarah/xAPI-Edu-Data
Badr, G., Algobail, A., Almutairi, H., & Almutery, M., “Predicting students’ performance in university courses: a case study and tool in KSU mathematics department”, Procedia Computer Science, vol. 82, pp. 80-89, 2016.
https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
Drazin, S., & Montag, M., “Decision tree analysis using weka”, Machine Learning-Project II, University of Miami, pp. 1-3, 2012.
Azmi, M. S. B. M., & Paris, I. H. B. M., “Academic performance prediction based on voting technique”, IEEE 3rd International Conference on Communication Software and Networks, pp. 24-27, May 2011.
Amra, I. A. A., & Maghari, A. Y., “Students performance prediction using KNN and Naïve Bayesian”, 8th International Conference on Information Technology (ICIT), pp. 909-913, May 2017.
Sokolova, M., & Lapalme, G., “A systematic analysis of performance measures for classification tasks”, Information processing & management, vol. 45(4), pp. 427-437, 2009.

Еще

Статья научная