A Unified Model of Clustering and Classification to Improve Students’ Employability Prediction

Автор: Pooja Thakar, Anil Mehta, Manisha

Журнал: International Journal of Intelligent Systems and Applications(IJISA) @ijisa

Статья в выпуске: 9 vol.9, 2017 года.

Бесплатный доступ

Data Mining is gaining immense popularity in the field of education due to its predictive capabilities. But, most of the prior effort in this area is only directed towards prediction of performance in academic results only. Nowadays, education has become employment oriented. Very little attempt is made to predict students’ employability. Precise prediction of students’ performance in campus placements at an early stage can identify students, who are at the risk of unemployment and proactive actions can be taken to improve their performance. Existing researches on students’ employability prediction are either based upon only one type of course or on single University/Institute; thus is not scalable from one context to another. With this necessity, the conception of a unified model of clustering and classification is proposed in this paper. With the notion of unification, data of professional courses namely Engineering and Masters in Computer Applications students are collected from various universities and institutions pan India. Data is large, multivariate, incomplete, heterogeneous and unbalanced in nature. To deal with such a data, a unified predictive model is built by integrating clustering and classification techniques. Two- Level clustering (k-means kernel) with chi-square analysis is applied at the pre-processing stage for the automated selection of relevant attributes and then ensemble vote classification technique with a combination of four classifiers namely k-star, random tree, simple cart and the random forest is applied to predict students’ employability. Proposed framework provides a generalized solution for student employability prediction. Comparative results clearly depict model performance over various classification techniques. Also, when the proposed model is applied up to the level of the state, classification accuracy touches 96.78% and 0.937 kappa value.

Еще

Clustering, Classification, Data Mining, Employability, Prediction, Education

Короткий адрес: https://sciup.org/15010962

IDR: 15010962

Текст научной статьи A Unified Model of Clustering and Classification to Improve Students’ Employability Prediction

Published Online September 2017 in MECS

I. Introduction

The latest report published by India Today on March 20, 2017, revealed that out of eight lakh engineering graduates in the country, 60% remain unemployed [1]. In a similar report last year on July 13, 2016, it published that only 7% of engineers are suitable for core domain jobs [2]. The need of the hour is to accurately predict graduates employability in the very first year of their course enrollment so that more effectual actions and policies can be implemented on time.

Institutes maintain details of students from their enrollment till they pass out, which includes academic records, personal details, and various skill based test records such as aptitude test and psychometric test. This data can be of immense use if utilized for analysis. Studies reflect the potential of such data to guide students for better employability. But, such data suffers from two major inherent problems i.e. unbalanced and multivariate [3]. Moreover, most of the existing researches either consider a data set of only one type of course or one University/Institute for predictive analysis of students’ employability [4][5]. Educational Decision Systems are thus extremely customized to fulfill the needs of the specific Institute. There is no unified approach that can be used across institutes with any type of dataset.

This paper presents a unified model that serves two major objectives.

A. To automate the selection of relevant attributes from the set of the multivariate data set (151 attributes in the present study) at preprocessing stage. Proposed model uses two-level clustering to reduce the data set automatically based upon

classification results. Finally, finding the relevant set of attributes with chi-square analysis.

B. To construct the unified prediction model based upon most suitable classification algorithm(s). Proposed Model integrates four best classifiers with vote ensemble method to predict student’s employability (i. e. placed or unplaced in on-campus placement drives).

Researchers suggest that initial multidimensional data set should be put into self-learning mode to generate homogeneous groups [6]. It is also proved that clustering applied on attributes set at pre-processing stage helps in parsimonious selection of variables and improves the performance of predictive algorithms [3]. Ensemble Model enhances the classification accuracy by integrating the prediction accuracy of base classifiers. Multiple classifier based systems increase the performance of individual classifiers [7].

Thus, proposed model not only outperforms the prediction performance of various classifiers but also help in finding the relevant number of attributes automatically.

Rest of the paper describes the proposed model and its results in details. Section II provides a detailed review of existing literature in the domain of students’ employability prediction. Section III describes the experimental setting of tools and techniques used. Section IV presents the proposed model in detail. Section V describes comparative results. Section VI concludes and section VII showcase the future scope of the study.

II. Related Work
71.3 % maximum accuracy while excluding parameters like reasoning, aptitude and communication skills [4]. To predict the employability of IT students, Piad, Dumlao, Ballera and Ambat collected data based on five-year profiles of 515 students, which were randomly picked at the placement office tracer study [21]. Bangsuk, and Tsai, designed a model for employability prediction in a single University of Thailand, Maejo University and emphasized on the need of large, multidimensional data set with automated pre processing [5]. In 2015, Rashid indicated the direction of future work and emphasized upon increasing the size and type of educational data set with alternate feature selection techniques that could enhance the performance of classification techniques [22].

Data Mining is widely applied in the field of education especially in online learning systems [8] evaluating teachers’ performance [9]. But, the majority of researchers in the field of educational data mining is focused on predicting core academic performance of students’ based on their past academic results [10-15].

Many other types of research have been carried out to find the students’ employability factors. Communication skills, critical thinking, and problem-solving skills are considered as important employability factors [16]. Researches reflect the importance of secondary attributes for better prediction accuracy [17]. Primary importance is given to soft skills by employers as illustrated by Mark, David, Hamilton and Riley [18]. Across industries, graduates lack in interpersonal skills, communication skills, creative and critical thinking, problem-solving, analytical skills and team work as described by Noor Aieda, Abu Bakar, Aida Mustapha and Kamariah Md. Nasir [19]. Industry demands skilled workforce to handle the projects as compared to only academic achievement as depicted by Jamaludin, Nor Azliana, and Shamsul Sahibuddin [20].

In recent times, little work is done in the direction of students’ employability prediction. Tripti Mishra, Kumar and Gupta collected the details of MCA students (1400 instances) and predicted students’ employability with

In this paper proposed model work with a unified, large and multivariate data set of the education system. Automated preprocessing for feature selection is performed with ensemble classification to enhance the performance of predictive techniques.

III. Experimental Setting

RapidMiner Studio Educational Version 7.4.000 is used to implement machine learning algorithms. This version also includes and implements algorithms designed for Weka Mining Tool.

A. Data Collection

Data is collected from diverse universities and institutes pan India, offering Engineering (Bachelor in Engineering – B.E. or Bachelor in Technology -B.Tech.) four-year degree course or MCA (Master in Computer Applications) three-year degree course. To ensure the authenticity of data, TPO (Training and Placement Officers) of these Institutes were contacted. It is generally observed that students provide genuine information to their TPOs rather than filling their details in an online questionnaire. Some TPOs provided their institutes’ past records in the specified worksheet. Mostly, provided data in the same format they keep at Institute. Data collected is then compiled in one worksheet with 9459 instances and 160 attributes.

B. Data Selection and Transformation

Data is purposely collected for professional courses, as most of these students opt for placements at the end of their course; thus to ensure seriousness during campus placements. The raw dataset comprises of around 9459 instances with 160 attributes. Preliminary data cleaning is done by removing non pertinent parameters such as student name, institute name, batch, and phone number. Derived variables were obtained like age obtained from the date of birth. Most of the attributes were categorized, like occupation, qualification, marks. Instances with incomplete basic information were removed. Final data set used for the study is composed of 7143 instances of 151 attributes. Table 1 provides the categories in which attributes belong. Table 2 displays the partial view of attributes.

Table 1. Categories of Attributes

Categories of Attributes
1	Personal Details	8	Leadership Skills
2	Demographic	9	Core Technical Skills
3	Academic Records	10	Finance Aptitude
4	Family Details	11	English Aptitude
5	Social Parameters	12	Quantitative Aptitude
6	Marketing Aptitude	13	Logical Aptitude
7	Cognitive Aptitude	14	Mathematical Aptitude
15	Personality /Psychometric Aptitude

Table 2. Sample Attributes

S.No.	Category	Attributes Sample
1	Personal Details	Gender	Age
2	Demographic	State	Graduation Univ	School State 10	School State 12	Graduation State
3	Academic Records	10th	12th	Ug/Pg sem1	Ug/Pg sem2	Graduation Per	Stream12
4	Family Details	Mother Occupation	Father Occupation	Mother Qualificatio n	Father Qualificatio n	Type of Family	Number of Siblings
5	Social Parameters	Type of School 10	Type of School 12	Regular or Distant
6	Marketing Aptitude	Sales Result Oriented	Sales Multitask	Sales Inquisitive	Sales Flexible	Sales Charismatic	Sales People
7	Cognitive Aptitude	Cognitive Skills Articulate	Cognitive Skills Proficient	Cognitive Skills Logical
8	Leadership Skills	HrScore	HrPer
9	Core Technical Skills	Automata Score	Automata Per	Computer Prog Score	CS Score	Excel Score	Computer Prog Per
10	Finance Aptitude	Fascore	faper
11	English Aptitude	EnglishSco re	EnglishPer	Vocabulary	Grammar	Comprehen sion	Communica tion
12	Quantitative Aptitude	QuantScore	QAper
13	Logical Aptitude	LogicalSco re	LAper	InductiveR easoning	DeductiveR easoning	AbductiveR easoning
14	Mathematical Aptitude	Basic Mathematic s	Engineerin g Mathematic s	Applied Mathematic s
15	Personality /Psycometric Aptitude	Personality Score	Personality Per	Agreeablen ess	Extraversio n	Nueroticis m	Conscientio usness

C. Clustering and Classification Algorithms used for Model

For the purpose of clustering, the k-means kernel is used. It uses kernels to approximate the distance between objects and clusters. This algorithm is quadratic. It is necessary to sum over all elements of a cluster to calculate one distance [23].

A number of classification techniques are available, each having its own advantages and disadvantages. Attributes of data set are a mix of polynomial and numerical data types; hence algorithms which could handle such data are considered.

Major classification techniques used for the proposed model are as follows:

Simple CART: It is a binary decision tree constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample. The growth of the tree is done by choosing a split among all the possible splits at each node so that the resulting nodes are “purest” [24].

k-Star: k-Star is an instance-based classifier. It uses an entropy-based distance function. A test instance is based upon the class of those training instances, which are similar to it and as determined by similarity function [25].

Random Tree: The classifier is resistant to over fitting. It is a collection of individual decision trees, where each tree is generated from different samples and subsets of the training data [26].

Random Forest: Random Forests grow many classification trees. It does not over fit. It runs efficiently on large data bases. It can handle multiple input variables without variable deletion [27].

Aforesaid classifiers, when implemented with proposed model performed better than other classifiers as shown in Table 3 and were thus used for final implementation.

Table 3. Classification Accuracy of Classifiers with Proposed Model

S.No.	Classifier	Accuracy % (Model)
1	Simple Cart	81.11
2	kNN	80.04
3	VFI	66.57
4	VotedPeceptron	56.33
5	REP Tree	75.43
6	LMT	80.57
7	J48 Graft	74.19
8	J48	73.86
9	ADT Tree	68.24
10	Random Forest	85.96
11	Random Tree	82.9
12	IB1	75.41
13	Kstar	82.11
14	Ibk	75.43
15	DTNB	77.09

D. Performance Measures

10-fold cross-validation is chosen as an estimation approach to obtain a reasonable idea of model performance since there is no separate test data set. This technique divide training set into 10 equal parts, 9 is applied as a training set for making machine algorithm learn and 1 part is used as test set. This approach is enforced 10 times on the same dataset, where every training set act as test set once. The performance of the model is measured by classification accuracy, kappa, and F1 Score.

An algorithm x depends on the number of samples correctly classified (true positives + true negatives), thus defines the classification accuracy Ax as shown in Equation 1.

^Ax ⁼ (1) n *100

Where, t is the number of instances correctly classified, and n is the total number of instances [28]. Classification accuracy is typically not enough information to make a decision on effectiveness of the model. Thus other parameters are also taken into consideration [29].

F1 Score is the harmonic mean of Precision and Recall. Therefore, this score takes both false positives and false negatives into consideration [30]. The formula for F1 Score is shown in Equation 2.

Re call * Pr ecision

2* (2)

Re call + Pr ecision

Kappa is yet another measuring method, which can be used as a performance indicator.

Kappa Statistics is a normalized value of agreement for chance. It can be described as K as shown in Equation 3.

P (A ) - P (E )

1 — P ( E )

where P (A) is percentage agreement and P (E) is chance agreement. If K =1 than agreement is ideal between the classifier and ground truth If K=0, it indicates there’s a chance of agreement. [28]. Fuzzy approaches are also gaining popularity in many areas [31].

IV. Proposed Model: Three- Level Model Construction

RapidMiner Studio Educational Version 7.4.000 is used to implement machine learning algorithms. Pictorial representation of the three-level model is depicted in Fig. 1 and the detail of each level is further depicted in Fig. 2 and Fig. 3.

Fig.1. Proposed Model

Proposed Model works at three levels. 1st Level implements the concept of automated pre-processing, where raw data is converted to refined data. Then it is taken further at 2nd Level for classification. Last 3rd Level generates rules to facilitate decision making.

Level I: Automated Pre-Processing with Two- Level Clustering

An innovative approach is used for preprocessing the raw data to the transformed data set. Proposed automated pre-processing reduces dimensionality, find a relevant set of attributes and provide a refined, transformed dataset, which can further be readily used for better classification results. Fig. 2 represents pictorial representation of two-level clustering for automated pre-processing.

As shown in Fig. 2, at level I-A, the raw data set is first balanced by Sample Bootstrapping and equal instances of each class are taken into consideration. K- means (kernel) clustering is applied on the balanced data set (7000 instances with 151 attributes) by transposing attributes to instances and vice versa. This produces two sets of clusters with a related set of attributes put together. Filtered both the clusters and re-transposed for classification. Simple Cart classification is then applied on both the clusters. Cluster with better classification accuracy is selected and taken further to the second level.

At level I-B, chosen cluster is transposed again for second level clustering. K- means (kernel) clustering is re-applied on chosen cluster. After second level clustering, two clusters are obtained by filtration. Transpose both the clusters and apply Chi-Square to select top attributes with maximum relevance from each cluster. Finally, join both sets to obtain final refined and transformed dataset with 12 attributes and 7000 instances.

Fig.2. Level I of Proposed Model

This automated approach helps in fast and easy selection of relevant attributes from a large pool of attributes and also enhance the quality of dataset for classification. Thus, result in improved classification accuracy.

Level II: Ensemble Classification by Voting

Fig.3. Level II of Proposed Model

At second level transformed data set derived from level I is taken further for classification. Instead of choosing one method for classification, voting ensemble method is used for improved classification accuracy. Random Tree, K-Star, Simple Cart, Random Forest are selected as base learners due to better classification accuracy than other classifiers when applied alone as shown in Table 3 earlier.

Vote method uses the vote of each learner for classification of an instance; the prediction with maximum votes is taken into consideration. It uses predictions of four base learners to make a combined prediction (using simple majority voting method). Fig. 3 depicts the working of ensemble classification in pictorial form.

Level III: Generate Rules

Rules are generated by the simple method of converting trees to rules. Trees formed during the process indicated the significant role of Graduation State as shown in partial Cart Decision tree and partial Random Tree in Fig. 4 and Fig. 5 respectively.

CART Decision Tree graduationState=(Assam)|(Goa)|(Jharkhand)|(Jammu and Kashmir)|(R

| Graduationuniv=(State)|(Private)|(1)|(Deemed)

| | Englishper < 42.43888888888889

| | | quantscore < 517.5

| | | | Englishper < 29.625

| | | | | | | Computerprogscore < 317.5: Placed(4.34/0.18)

| | | | | | | | | | graduationState=(Gujarat): Placed(0.34/0.18)

| | | | | | | | | | graduationState!=(Gujarat): Unplaced(4.0/0.0)

Fig.4. Partial Cart Tree

RandomTree graduationState = West Bengal

| 12th = A

| | Graduationuniv = Private

| | | Computerprogscore < 595 : Unplaced (7/0)

| | | Computerprogscore >= 595

| | | | Logicalscore < 575 : Unplaced (2/0)

| | | | Logicalscore >= 575 : Placed (1/0)

| | Graduationuniv = State : Unplaced (0/0)

| | Graduationuniv = Deemed : Unplaced (0/0)

Fig.5. Partial Random Tree

Thus, the aforesaid model is further applied to 3 different datasets (State-wise). All the states with more than 500 set of instances such as Maharashtra with 958 instances, Uttar Pradesh (UP) with 1192 instances and Andhra Pradesh (AP) with 576 instances were chosen and experimented upon. The results are shown in Table 4.

Table 4. State wise results with Proposed Model

State	Accuracy (%)	F1 Score	Kappa
Maharashtra	96.78	96.8	0.936
AP	90.6	90.5	0.812
UP	83.82	83.6	0.676

All the above results are quite noteworthy to predict the employability factor of any student in a particular state. Thus, is quite reasonable to be used further for decision support system. Sample Rules generated for Maharashtra State, where example set has 858 correct rules out of 900 training examples are shown below.

Rules Generated

1. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore > 380 and Englishscore > 457.500 then Placed (27 / 345)
2. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore > 380 and Englishscore ≤ 457.500 and Logicalscore > 505 and quantscore > 492.500 then Placed (10 / 40)
3. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore > 380 and Englishscore ≤ 457.500 and Logicalscore > 505 and quantscore ≤ 492.500 and Englishscore > 365 then Unplaced (9 / 0)
4. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore > 380 and Englishscore ≤ 457.500 and Logicalscore > 505 and quantscore ≤ 492.500 and Englishscore ≤ 365 then Placed (0 / 5)
5. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore > 380 and Englishscore ≤ 457.500 and Logicalscore ≤ 505 then Unplaced (34 / 0)
6. if quantscore > 462.500 and Englishscore > 352.500 and Logicalscore ≤ 380 then Unplaced (3 / 0)
7. if quantscore > 462.500 and Englishscore ≤ 352.500 then Unplaced (26 / 0)
8. if quantscore ≤ 462.500 and Englishscore > 655 then Placed (0 / 26)
9. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore > 652.500 then Placed (0 / 4)
10. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore > 450 and Englishscore > 475 then Placed (1 / 6)
11. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore > 450 and Englishscore ≤ 475 then Unplaced (2 / 0)
12. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore > 437.500 and Logicalscore > 490 and Englishscore > 550 and Englishscore > 562.500 then Unplaced (3 / 0)
13. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore > 437.500 and Logicalscore > 490 and
14. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore > 437.500 and Logicalscore > 490 and Englishscore ≤ 550 and Englishscore > 457.500 then Unplaced (21 / 0)
15. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore > 437.500 and Logicalscore > 490 and Englishscore ≤ 550 and Englishscore ≤ 457.500 then Placed (2 / 3)
16. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore > 437.500 and Logicalscore ≤ 490 then Unplaced (36 / 0)
17. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore ≤ 437.500 and quantscore > 362.500 then Unplaced (4 / 0)
18. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore > 430 and Logicalscore ≤ 437.500 and quantscore ≤ 362.500 then Placed (1 / 11)
19. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore > 445 and quantscore ≤ 450 and Logicalscore ≤ 430 then Unplaced (59 / 0)
20. if quantscore ≤ 462.500 and Englishscore ≤ 655 and Logicalscore ≤ 652.500 and Englishscore ≤ 445 then Unplaced (201 / 0)

Englishscore > 550 and Englishscore ≤ 562.500 then Placed (1 / 10)

correct: 858 out of 900 training examples.

Henceforth, this three-level model first selects a relevant set of attributes automatically, secondly classify and thirdly generate rules to assist in decision making. Next Section compares the results of classifiers with proposed model.

V. Results

The performances of 10 predictive algorithms on the aforesaid dataset (7143 attributes with 151 attributes) to predict students’ employability (i.e. Placed / Unplaced in On-Campus Placements) were experimented upon. Because of inherent tribulations of unbalanced, multidimensional, large data set, classifiers alone could not perform well. Proposed model is then applied and automated preprocessing is done to convert raw data set to transformed data set. Further transformed data set was applied on individual classifiers.

The results show significant improvement in terms of F1 Score as shown in Table 5. Results clearly depict model performance over classification techniques.

Table 5. Comparative F1 Score of Proposed Model with Classifiers

F1 Score
	Classification Only	Proposed Model
Simple Cart	62.9	82.06
kNN	33.1	78.4
Voted Peceptron	45.2	61.6
J48 Graft	51.7	74.8
J48	51.7	74.5
IB1	32.9	76.2
Kstar	22.1	83.7
Ibk	32.8	76.3
ADT Tree	63.5	68.4
VFI	57.8	64.3

Fig. 6 illustrates a comparative graph of F1 Score. Proposed Model outperforms individual classifiers.

Model performance can be proved further with the Statistical method of the t-test.

Hypothesis defined:

Ho: There is no significant difference in F1 score of

Classifiers as compared to Proposed Model

Ha: There is significant difference in F1 score of

Classifiers as compared to Proposed Model t-Test: Paired Two Sample for Means was conducted between the results of F1 Score of Classification only and when the same algorithms were applied with Proposed Model. Results in Table 6 reflect that the improvement is significant.

Results demonstrate that t Stat > t Critical two-tail. At 5% level of significance Ho is rejected; thus proved that there is a significant difference between mean F1 Score (45.37) of classifiers when applied alone as compared to same classifiers when applied using Proposed Model (74.026). Further, the Proposed Model is having less variability (52.07) as compared to variability in Classification Only (208.46). Hence the proposed Model excels.

Table 6.-test Results proves the efficiency of proposed model.

t-Test: Paired Two

Sample for Means

	Proposed Model	Classification Only
Mean	74.026	45.37
Variance	52.07076	208.469
Observations Pearson	10	10
Correlation	-0.433534894
Hypothesized Mean Difference	0
df	9
t Stat	4.837683844
P(T<=t) one-tail	0.000461995
t Critical one-tail	1.833112923
P(T<=t) two-tail	0.00092399
t Critical two-tail	2.262157158

At Level 1, Two-Level automated pre-processing is done as proposed in the model. It reduces and refines large attribute set, balance classes and make it ready for classifiers. Thus outperforms better than individual classifier. The graph clearly showcases model performance over individual classifiers with maximum F1 Score of 83.7% and 82.06% given by kStar and Simple Cart respectively.

Now, at Level II, Best performing classifiers were further ensemble together to integrate the prediction accuracy of individual classifiers. The vote is used to ensemble top 4 classifiers namely, Simple Cart, kStar, Random Forest and Random Tree. Vote ensemble applies all four classifiers from its sub-process and assigns the predicted class with maximum votes to the example. This combines and integrates the results of all base classifiers to enhance the classification performance. Table 7 displays the result obtained after applying complete model with Level I and Level II on raw data.

Table 7. Model Performance with Complete Raw Data

Model Performance with complete Raw Data (Instances 7143, Attributes 151)
Accuracy (%)	F1 Score (%)	Kappa	Weighted Mean Recall (%)	Weighted Mean Precision (%)
87.51	87.7	0.750	87.51	87.57

As shown in Table 7, combined result improved the performance from prior 83.7% F1 Score (maximum with kStar) to 87.7% when combined with Vote ensemble.

Moreover, when Model was applied further to the level of individual States, as described earlier in Table 4 enhances the performance of Model radically with 96.8% F1 Score in the case of Maharashtra. Table 8 displays the results obtained for Maharashtra State, which clearly

Table 8. Model Performance with State (Maharashtra)

Model Performance with State (Maharashtra) Raw Data (Instances 958, Attributes 151)
Accuracy (%)	F1 Score (%)	Kappa	Weighted Mean Recall (%)	Weighted Mean Precision (%)
96.78	96.8	0.936	96.78	96.88

VI. Conclusions

Results evidently provide facts that students’ employability prediction can be enhanced by applying proposed model. The model provides a generalized solution to Students’ Employability prediction and is scalable, thus can act as a base for developing unified decision support system in education domain.

Once the Model was applied State-wise on three different datasets, eight attributes were found common, which plays a significant role in predicting Students’ Employability in terms of Placed/Unplaced in the very first year of their enrollment. They are Age, Percentage in Secondary Exam, English Score, English Percentage, Logical Score, Logical Percentage, Quantitative Score, Quantitative Percentage, Logical Score and Logical Percentage. Scores represent the individual score of the student after conducting aptitude test and percentage reflects his/her performance in batch among peer students, who all attempted the test.

Further, rules derived from the model emphasize on Scores and Percentage of aptitude test. Thus, the performance of the student in final placements can be improved by taking proactive actions in the beginning of 1st Semester. As soon as the student is enrolled in the course, Institute must conduct aptitude test taking into account English, Quantitative and Logical Ability into consideration. As per the score obtained in the test, students must be trained in further semesters in their respective weak areas (English, Quantitative or Logical). In between also these tests must be conducted to monitor students’ progress. This will ensure the customized individual approach to improve every student performance in final placements.

VII. Future Scope

The model works effectively on large, multivariate, unbalanced and heterogeneous data set. This model can be used in aiding decision support system of education domain. Basic capability of the model is to handle complicated data set and can further be applied in different domains such as marketing and banking. Fuzzy data mining approaches can also be tested in future.

Список литературы A Unified Model of Clustering and Classification to Improve Students’ Employability Prediction

http://indiatoday.intoday.in/education/story/60-of-engineering-graduates-unemployed/1/908241. html
http://indiatoday.intoday.in/education/story/engineering-employment-problems/1/713827.html
Thakar, Pooja, and Anil Mehta. "Cluster Model for parsimonious selection of variables and enhancing Students' Employability Prediction." International Journal of Computer Science and Information Security 14.12 (2016): 611.
Mishra, Tripti, Dharminder Kumar, and Sangeeta Gupta. "Students’ Employability Prediction Model through Data Mining." International Journal of Applied Engineering Research 11.4 (2016): 2275-2282.
Jantawan, Bangsuk, and Cheng-Fa Tsai. "The Application of Data Mining to Build Classification Model for Predicting Graduate Employment." International Journal Of Computer Science And Information Security (2013).
Hu, Zhengbing, et al. "Possibilistic Fuzzy Clustering for Categorical Data Arrays Based on Frequency Prototypes and Dissimilarity Measures." International Journal of Intelligent Systems and Applications (IJISA) 9.5 (2017): 55-61.
Catal, Cagatay, and Mehmet Nangir. "A sentiment classification model based on multiple classifiers." Applied Soft Computing 50 (2017): 135-141.
Maina, Elizaphan M., Robert O. Oboko, and Peter W. Waiganjo. "Using Machine Learning Techniques to Support Group Formation in an Online Collaborative Learning Environment." International Journal of Intelligent Systems & Applications 9.3 (2017).
Bhanuprakash, C., Y. S. Nijagunarya, and M. A. Jayaram. "Clustering of Faculty by Evaluating their Appraisal Performance by using Feed Forward Neural Network Approach." International Journal of Intelligent Systems and Applications 9.3 (2017): 34.
Chaudhury, Pamela, et al. "Enhancing the capabilities of Student Result Prediction System." Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies. ACM, 2016.
Elbadrawy, Asmaa, R. Scott Studham, and George Karypis. "Collaborative multi-regression models for predicting students' performance in course activities." Proceedings of the Fifth International Conference on Learning Analytics And Knowledge. ACM, 2015.
Rogers, Tim, Cassandra Colvin, and Belinda Chiera. "Modest analytics: using the index method to identify students at risk of failure." Proceedings of the Fourth International Conference on Learning Analytics And Knowledge. ACM, 2014.
Sanchez-Santillan, Miguel, et al. "Predicting Students' Performance: Incremental Interaction Classifiers." Proceedings of the Third (2016) ACM Conference on Learning@ Scale. ACM, 2016.
Pandey, Mrinal, and S. Taruna. "Towards the integration of multiple classifier pertaining to the Student's performance prediction." Perspectives in Science8 (2016): 364-366.
Alfiani, Ardita Permata, and Febriana Ayu Wulandari. "Mapping Student's Performance Based on Data Mining Approach (A Case Study)." Agriculture and Agricultural Science Procedia 3 (2015): 173-177.
Rahmat, Normala, Yahya Buntat, and Abdul Rahman Ayub. "Determination of Constructs and Dimensions of Employability Skills Based Work Performance Prediction: A Triangular Approach." International Journal of Economics and Financial Issues 5.1S (2015).
Thakar, Pooja, and Anil Mehta. "Role of Secondary Attributes to Boost the Prediction Accuracy of Students’ Employability Via Data Mining." International Journal of Advanced Computer Science & Applications 11.6 (2015): 84-90.
Finch, David J., Leah K. Hamilton, Riley Baldwin, and Mark Zehner. "An exploratory study of factors affecting undergraduate employability." Education+ Training 55, no. 7 (2013): 681-704.
Bakar, Noor Aieda Abu, Aida Mustapha, and Kamariah Md Nasir. "Clustering Analysis for Empowering Skills in Graduate Employability Model." Australian Journal of Basic and Applied Sciences 7.14 (2013): 21-28.
Jamaludin, Nor Azliana Akmal, and Shamsul Sahibuddin. "Pilot Study of Industry Perspective on Requirement Engineering Education: Measurement of Rasch Analysis." Editorial Preface 4.8 (2013).
Piad, Keno C., et al. "Predicting IT employability using data mining techniques." Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC), 2016 Third International Conference on. IEEE, 2016.
Rashid, Tarik A. "Improvement on Classification Models of Multiple Classes through Effectual Processes." International Journal of Advanced Computer Science and Applications (IJACSA) 6.7 (2015).
Kim, Dae-Won, et al. "Evaluation of the performance of clustering algorithms in kernel-induced feature space." Pattern Recognition 38.4 (2005): 607-611.
Denison, David GT, Bani K. Mallick, and Adrian FM Smith. "A bayesian CART algorithm." Biometrika 85.2 (1998): 363-377.
Tejera Hernández, Dayana C. "An Experimental Study of K* Algorithm." International Journal of Information Engineering & Electronic Business 7.2 (2015).
Aldous, David. "The continuum random tree II: an overview." Stochastic analysis 167 (1991): 23-70.
Breiman, Leo. "Random forests." Machine learning 45.1 (2001): 5-32.
Witten, Ian H., et al. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
Chawla, Nitesh V. "Data mining for imbalanced datasets: An overview." Data mining and knowledge discovery handbook. Springer US, 2009. 875-886.
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management 45.4 (2009): 427-437.
Whig P, Ahmad SN (2017) Controlling the Output Error for Photo Catalytic Sensor (PCS) Using Fuzzy Logic. J Earth Sci Clim Change 8: 394. doi:10.4172/2157-7617.1000394

Еще

Статья научная