FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

Автор: Yasin Görmez, Yunus E. Işık, Mustafa Temiz, Zafer Aydın

Журнал: International Journal of Information Technology and Computer Science @ijitcs

Статья в выпуске: 6 Vol. 12, 2020 года.

Бесплатный доступ

Sentiment analysis is the process of determining the attitude or the emotional state of a text automatically. Many algorithms are proposed for this task including ensemble methods, which have the potential to decrease error rates of the individual base learners considerably. In many machine learning tasks and especially in sentiment analysis, extracting informative features is as important as developing sophisticated classifiers. In this study, a stacked ensemble method is proposed for sentiment analysis, which systematically combines six feature extraction methods and three classifiers. The proposed method obtains cross-validation accuracies of 89.6%, 90.7% and 67.2% on large movie, Turkish movie and SemEval-2017 datasets, respectively, outperforming the other classifiers. The accuracy improvements are shown to be statistically significant at the 99% confidence level by performing a Z-test.

Еще

Sentiment analysis, ensemble methods, machine learning, feature extraction

Короткий адрес: https://sciup.org/15017471

IDR: 15017471 | DOI: 10.5815/ijitcs.2020.06.02

Текст научной статьи FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

Published Online December 2020 in MECS DOI: 10.5815/ijitcs.2020.06.02

With the recent developments in technology, the internet has entered to almost every field of our lives including health, science, entertainment, sports, and art. Due to the widespread availability of web pages and mobile applications, people are able to share their comments, ideas or opinions in many different topics on various platforms. As a result of this dense information flow, the internet now accommodates a huge repository of data providing a rich and diverse content. However, accessing the right information from a large surplus of data is a challenging task. To overcome this problem, text mining methods have been developed to automatically extract knowledge from web sites. Text mining can be defined as the process of obtaining meaningful and usable information from text using statistical or machine learning methods [1]. It can be divided into sub-categories such as summarization, classification, clustering, information extraction, and sentiment analysis. This paper concentrates on sentiment analysis, which is the process of extracting idea, opinion or emotion of a text by employing mathematical models and algorithms. Two types of approaches have been developed for this problem: dictionary based and machine learning based models [1]. In the first phase of dictionary based models, initially the desired sentiment is determined. Subsequently, the words expressing this sentiment and the meanings of the words are searched in the text. Then a score for that sentiment is calculated with the help of a dictionary. In the last phase, the sentiment state is extracted using statistical methods. Dictionary based models require a pre-defined dictionary containing positive, negative, and neutral weight scores for each word, which may not be available for each language. In machine learning based models, first, texts are labeled followed by data cleaning and preprocessing steps. Next, vector space models are formed that allow samples to be represented as feature vectors. After dividing samples into training, test and validation sets, models are learned and validated by training and testing procedures. Machine learning methods are independent from the language and can achieve high success rates. For this reason, they are preferred over dictionary based methods in academic studies on sentiment analysis.

Machine learning methods are divided into two main categories as supervised and unsupervised learning. The most important feature that distinguishes supervised learning from unsupervised learning is that it utilizes label information during training. When studies on sentiment analysis are examined, supervised machine learning methods are employed more frequently. Several methods have been developed in the literature for this purpose. Liu et al. proposed Chinese character based bigram feature extraction method and compared it with traditional bigram, trigram and word-based unigram by using support vector machines (SVM), naïve Bayes (NB), and artificial neural networks (ANN). The proposed method obtained the best F1 score of 91.62% on a dataset generated using 16,000 texts from Chinese web sites [2]. Go et al. trained three different models using maximum entropy (ME), NB and SVM on twitter data and obtained an 83% accuracy rate [3]. Mouthami et al. proposed a fuzzy logic and increased the accuracy rate for Cornell movie reviews [4]. Gautham and Yadav achieved success rates between 83.8% and 89.9% in the models designed using NB, SVM, ME and the Wordnet approach [5]. Nizam and Akın developed two datasets from twitter data to show the effect of employing balanced and unbalanced datasets. They used NB, random forest (RF), sequential minimal optimization, J48 and k-nearest neighbor (k-NN) and achieved an improvement of up to 6% in the success rate when the balanced dataset is used for model training [6]. Çoban et al. trained Turkish twitter data using NB, Multinomial naïve Bayes (MNB), SVM and k-NN and obtained 66.06% accuracy rate [7]. Kranjc et al. generated two SVM models using active learning and observed that the active learning based model was 6.7% more successful [8]. Tripathy et al. used ngram feature extraction methods with four classification algorithms and obtained a 95% accuracy rate [9]. Rohini et al. created several models to compare the text written in English and Kannada and showed that the models generated from English texts are more successful [10]. Hassan and Mahmood combined a convolutional neural network (CNN) with a long short term memory (LSTM) recurrent neural network (RNN) on IMDB movie and Stanford sentiment treebank (SST) datasets and obtained a 47.5% accuracy rate for SST and 88.3% for IMDB [11]. Al-Smadi et al. applied comparative sentiment analysis using SVM and deep recurrent neural networks (RNN) for three different tasks on Arabic hotel reviews dataset and they observed that SVM outperformed RNN with an accuracy rate of 90% [12]. Chiong et al. performed sentiment analysis to predict financial markets. They optimized the SVM’s parameters using particle swarm optimization and obtained a 59% accuracy rate [13]. Sohangir et al. applied several deep learning techniques on stock market dataset and achieved a 90.93% accuracy by CNN [14]. Demirtas and Pechenizkiy applied Naive Bayes, Linear SVC and Maximum Entropy classifiers to Turkish Movie review dataset and obtained 69.5% accuracy with NB [15]. Baziotis et. al. applied Deep Long-Short Term Memory networks on SemEval-2017 dataset and 67.5% F1 score was obtained [16]. Gonzales et.al. offer a Convolutional Recurrent Neural Network (CRNN) and they obtained 59.9% accuracy rate for SemEval-2017 dataset [17].

In addition to using individual learning models, it is also possible to combine the decisions of several methods in an ensemble setting in order to eliminate the inherent disadvantages of the individual methods. Xia et al. combined SVM, NB and ME using three different ensemble methods and achieved an 88.65% accuracy on several datasets [18]. Neethu and Rajasree combined SVM, MBE and NB using ensemble methods and achieved a 90% accuracy rate on twitter data [19]. Fersini et al. used NB, ME, SVM and Markov random fields to compare traditional ensemble methods with a Bayesian based ensemble method. According to the results of experiments on six different datasets, Bayesian based methods increased the success rate and reduced the computational cost [20]. Da Silva et al. combined MNB, SVM, RF and logistic regression (LR) using the ensemble method they proposed, and achieved accuracy rates from 76.84% to 87.20% on five different datasets [21]. Çatal and Nangir combined NB and SVM using several ensemble methods and achieved the accuracy rates up to 86.13% [22]. Ankit and Saleena combined NB, SVM, LR and RF using a voting method and achieved accuracy rates of 70% to 76% on five different datasets that were generated from twitter [23]. Araque et al. applied voting and stacking ensemble methods on several datasets and achieved a 90% accuracy rate [24]. Dedhia and Ramteke combined linear and RBF SVM using AdaBoost and achieved an 83% accuracy rate [25]. Cliche offered a-state-art an ensemble method that combine Convolutional Neural Networks (CNNs) and Long-Short Term Memory(LSTM) Networks for SemEval-2017 dataset and obtained 68.1% recall score [26].

In addition to the classification algorithms, the quality of the attributes in a dataset is also an important factor affecting the success rate of the prediction methods. Various dimension reduction and feature selection methods are frequently employed in order to eliminate unnecessary and noisy attributes that adversely affect classification performance. Tan and Zhang applied document frequency (DF), chi-square (CS), information gain (IG) and mutual information (MI) metrics for feature selection on a dataset generated from Chinese documents and achieved an 88.58% accuracy rate using five different classifiers [27]. Go et al. applied MI, ME, CS metrics and frequency-based feature selection techniques on twitter data and obtained an 84% accuracy rate [28]. Meral and Diri applied correlation-based feature selection technique on twitter data and achieved a 90% F1-score using SVM, NB and RF [29]. Vinodhini and Chandrasekaran achieved 77% accuracy using principal component analysis (PCA), NB and SVM [30]. Yousefpour et al. applied proposed dimension reduction technique on different datasets and achieved a 90.91% accuracy using SVM, NB, ME and an ensemble of these three classifiers [31]. Kim and Lee applied proposed semi-supervised nonlinear dimensionality reduction technique on four different datasets and showed that the proposed techniques are better than the traditional dimension reduction methods [32]. Kaynar et al. showed that deep autoencoder is better than traditional dimension reduction techniques in many cases [1]. Kim proposed improved semi-supervised dimensionality reduction using feature weighting for sentiment analysis and obtained improved accuracy based on the experiments on six benchmark datasets [33].

Traditional ensemble methods try to reduce the error by combining multiple classification algorithms that typically act on a common feature set. When the features are computed by different feature extraction methods it could be useful to train separate learners for each feature representation and combine their decisions. In this paper, a novel ensemble method, FBSEM, is proposed for sentiment analysis that employs various classifiers as well as attributes derived by different feature extraction methods. The purpose of this study is to compare proposed classifier technique, FBSEM, with support vector machine [34], logistic regression [35], multi-layer perceptron [36], naïve bayes [37], random forest [38] k-nearest neighbor [39], ensemble voting [40] and ensemble stacking [41].

Fig.1. Steps of Sentiment Analysis

feature extraction

Trained Model

2. Material and Methodology 2.1. Dataset

In this study, three sentiment datasets are used. The first one is a large movie review dataset [42], which contains 50,000 movie reviews from IMDB with 25,000 positive and 25,000 negative samples. When constructing this dataset, no more than 30 reviews are allowed for any given movie. The second dataset is a Turkish movie dataset that is generated by Demirtaş and Pechenizkiy from Beyazperde web page [15]. It contains 10,662 movie reviews including 5,331 negatives and 5,331 positives. The third dataset is the SemEval-2017 benchmark that is collected by Rosenthal et al. from twitter web page [43]. It contains 20,632 tweets including 7059 positives, 3,231 negatives and 10,342 neutrals.

2.2. Sentiment Analysis
2.3. Pre-processing and Feature Extraction for Sentiment Analysis
2.4. Classification Methods

Sentiment analysis (SA) is basically a sub-field of natural language processing and text mining, which aims to find the idea, opinion or emotion (such as negative or positive) in the documents. It consist of data collection, pre-processing, labeling, feature extraction and classification steps as shown in Figure 1.

Before extracting input features for machine learning models, it is possible to pre-process the textual data using techniques such as post-tagging, cleaning the stop words, and stemming. In the next step, numerical feature vectors are extracted and labeled. In this study, TF [44], TF-IDF [45], continuous bag of words and skip-gram [46] are used as feature extraction techniques. For the TF and TF-IDF, unigram [47] model is used to separate words. For the continuous bag of words and skip-gram, negative sampling [48] and hierarchical softmax [49] methods are used.

A. Feature-Based Stacked Ensemble Method for Sentiment Analysis (FBSEM)

FBSEM is a two-stage classifier that includes LR and MLP in the first stage, and SVM in the second stage. A separate LR and MLP is trained for each data matrix that is produced by unigram TF (UNI_TF), unigram TF-IDF (UNI_TFIDF), negative sampling skip-gram (SG_NS), hierarchical softmax skip-gram (SG_HS), negative sampling continuous bag of words (CBOW_NS) and hierarchical softmax continuous bag of words (CBOW_HS). Then the predictions of LR and MLP are concatenated with feature vectors extracted by these six methods and sent as input to an SVM classifier. Figure 2 summarizes the steps of FBSEM.

In figure 2, distributions represent predicted probability scores calculated using the corresponding feature extraction and classification methods. As a result, a set of twelve distributions are generated each as a matrix of dimensions nxm. Here, n represents the number of documents and m the number of different classes. Therefore, for large movie and Turkish movie reviews datasets m will be 2 and for SemEval-2017 dataset m will be 3. In the first phase of the FBSEM, the dataset is divided into train and test sets. Subsequently, LR and MLP are used as classifiers, which are trained on train set and validated on test set. To prevent overfitting in the second phase of FBSEM, first, a 2fold cross validation is performed on the train set during the first phase. Then, predictions on test set are computed using the model trained during the first phase. This technique makes it possible to compute predictions on train set as well as the test set using the methods of the first phase (i.e. LR and MLP). These predictions are later employed in the feature vector of the SVM. In the second phase of FBSEM, after distributions are concatenated with feature sets, an SVM classifier makes the final decision. This approach helps to reduce the errors from using different attributes and classifiers. A standard support vector machine can separate two classes only. For three or more classes, two techniques can be used: one versus all (OVA), or one versus one (OVO) [50]. In this study, OVO method is used for the SemEval-2017 dataset. In this section, the FBSEM method is compared with several classifiers on three benchmark datasets. Except for stacking and MLP, traditional classifiers are implemented using scikit-learn [51] library of Python. The stacking ensemble is implemented using mlxtend [52] library of Python and MLP using keras [53] library of Python. FBSEM method is implemented in Python. Accuracy, area under the ROC curve (AUROC), area under the precision and recall curve (AUPRC) are used as the performance measures [54].

Fig.2. Steps of FBSEM Classifier

A 10-fold cross-validation experiment is performed on each dataset to assess the prediction accuracy of the methods. Documents are randomly assigned to train and test sets for each fold. Then, from each train set, 20% of documents are chosen randomly to form a second train set (train-set-small) and 5% of the remaining documents are chosen randomly to form a second test set (test-set-small), which are used for hyper-parameter optimization in each fold of the cross-validation. This enables to reduce the computational cost of hyper-parameter optimization and prevent over-fitting. As a result, for each fold, four different datasets are generated: train set, test set, test-set-small, train-set-small.

Features for each dataset are extracted using UNI_TF, UNI_TFIDF, SG_NS, SG_HS, CBOW_NS and CBOW_HS. Subsequently, hyper-parameters of MLP, SVM, LR, k-NN and RF are optimized using train-set-small and test-set-small. For MLP, one hidden layer and the ADAM optimizer are used. The number of epochs, number of neurons in hidden layer, learning rate, beta1 and beta2 parameters for ADAM are optimized by performing grid search. Similarly, the number of iterations, C parameter for SVM, C parameter for LR, number of neighbors for k-NN and maximum depth and number of trees for RF are optimized separately for each fold of the cross-validation. After optimization, the models are trained using the optimum hyper-parameter configurations and predictions are computed on test sets. In addition to these classifiers, Gaussian Naïve Bayes, ensemble with majority voting and stacking ensemble models are also trained and tested. For the ensemble with majority voting MLP, SVM and LR are employed as the base learners, while for the stacking ensemble LR and MLP are selected as the base learners and SVM as the meta learner. Tables 1-3 show experiment results for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these tables, acc represents the mean accuracy result of 10 folds, std represents the standard deviation of the accuracies across the folds, AUPRC represents the mean area under the precision and recall curve and AUROC represents the mean area under the ROC curve. Based on the these results, the best accuracy results are obtained by UNI_TF and UNI_TFIDF feature extraction methods. FBSEM obtained the best accuracy in all settings. However in terms of AUPRC and AUROC scores, other classifiers may perform slightly better than FBSEM in some of the feature extraction settings.

Table 1. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on large movie review dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD	UNI_TF				UNI_TFIDF				SG_NS
METHOD	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	86.3%	0.024	77.0%	77.3%	88.5%	0.005	86.4%	87.2%	83.6%	0.117	89.5%	89.5%
SVM	88.6%	0.005	94.7%	95.1%	89.0%	0.005	95.4%	95.6%	87.2%	0.006	94.0%	94.3%
LR	88.6%	0.005	90.3%	91.9%	89.0%	0.005	95.2%	95.4%	87.3%	0.006	79.3%	85.5%
k-NN	70.6%	0.018	88.4%	78.2%	79.5%	0.005	93.1%	75.1%	81.0%	0.007	83.9%	84.2%
RF	85.1%	0.004	92.6%	92.9%	85.2%	0.003	92.7%	93.0%	83.3%	0.006	91.0%	91.2%
NB	71.7%	0.007	94.8%	95.1%	78.4%	0.004	95.5%	95.6%	76.4%	0.007	94.0%	94.3%
EV	87.6%	0.006	88.5%	92.1%	88.9%	0.005	90.9%	93.4%	87.5%	0.007	87.6%	91.0%
STE	87.6%	0.006	92.8%	93.5%	88.1%	0.005	95.4%	95.6%	86.7%	0.013	94.3%	94.5%
METHOD		SG_HS				CBOW_NS				CBOW_HS
	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	87.1%	0.007	88.9%	88.7%	87.9%	0.009	91.7%	91.3%	88.0%	0.008	91.3%	91.2%
SVM	87.2%	0.004	93.9%	94.2%	88.8%	0.005	95.1%	95.3%	88.5%	0.005	94.8%	95.1%
LR	87.2%	0.005	94.0%	94.2%	88.8%	0.004	94.6%	94.8%	88.6%	0.005	94.4%	94.8%
k-NN	80.9%	0.007	83.3%	81.4%	83.2%	0.006	85.9%	86.1%	83.0%	0.006	85.5%	85.2%
RF	83.7%	0.007	91.2%	91.4%	84.3%	0.006	92.0%	92.2%	84.8%	0.005	92.2%	92.5%
NB	74.4%	0.007	94.0%	94.3%	78.4%	0.009	95.1%	95.3%	77.6%	0.008	94.9%	95.2%
EV	87.4%	0.007	89.8%	92.4%	88.6%	0.003	93.0%	94.3%	88.5%	0.005	93.4%	94.4%
STE	86.9%	0.008	94.2%	94.4%	88.6%	0.007	94.8%	95.1%	88.3%	0.006	94.9%	95.2%

Table 2. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on Turkish movie review dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD	UNI_TF				UNI_TFIDF				SG_NS
METHOD	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	89.1%	0.012	81.8%	79.2%	88.9%	0.010	86.6%	84.1%	82.0%	0.108	93.3%	92.5%
SVM	88.1%	0.008	94.2%	94.5%	88.8%	0.010	94.9%	95.2%	86.6%	0.004	93.0%	93.8%
LR	88.1%	0.008	92.8%	93.8%	88.9%	0.009	93.7%	94.6%	87.2%	0.009	79.0%	84.6%
k-NN	74.2%	0.016	96.1%	63.0%	76.3%	0.147	95.9%	65.1%	85.8%	0.008	94.6%	89.8%
RF	85.9%	0.012	92.9%	93.2%	85.7%	0.011	93.1%	93.3%	86.2%	0.007	93.5%	93.5%
NB	78.3%	0.016	94.8%	95.0%	79.4%	0.014	95.0%	95.4%	85.5%	0.007	92.9%	93.7%
EV	88.1%	0.011	85.3%	90.0%	88.5%	0.008	82.9%	88.2%	86.6%	0.007	93.0%	93.5%
STE	87.5%	0.010	94.7%	94.9%	86.9%	0.012	95.1%	95.4%	86.3%	0.007	93.2%	93.6%
METHOD		SG_HS				CBOW_NS				CBOW_HS
	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	84.7%	0.026	93.7%	93.6%	79.9%	0.020	85.8%	84.9%	83.1%	0.024	91.3%	90.7%
SVM	87.1%	0.007	93.1%	93.7%	82.5%	0.010	91.9%	92.4%	85.4%	0.008	92.6%	93.1%
LR	87.2%	0.006	90.4%	91.1%	84.9%	0.007	87.8%	87.8%	85.8%	0.008	89.7%	90.7%
k-NN	86.4%	0.006	94.3%	91.4%	77.5%	0.008	88.8%	77.8%	83.1%	0.008	92.6%	85.7%
RF	86.8%	0.010	93.7%	93.8%	79.0%	0.011	87.7%	87.3%	83.7%	0.007	91.8%	91.7%
NB	85.7%	0.007	93.2%	93.8%	75.7%	0.008	89.6%	90.1%	81.9%	0.008	92.0%	92.5%
EV	87.4%	0.008	92.0%	93.4%	80.2%	0.007	87.0%	87.6%	84.8%	0.009	91.1%	91.9%
STE	87.0%	0.008	93.7%	94.3%	79.7%	0.009	87.5%	87.8%	84.4%	0.007	92.0%	92.3%

Table 3. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on SemEval-2017 dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD	UNI_TF				UNI_TFIDF				SG_NS
METHOD	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	56.3%	0.071	49.9%	55.3%	58.6%	0.034	54.8%	64.2%	54.4%	0.065	56.0%	66.5%
SVM	62.6%	0.034	65.6%	76.0%	63.3%	0.031	66.5%	76.3%	56.0%	0.039	58.8%	70.1%
LR	63.1%	0.037	62.1%	60.8%	62.9%	0.041	63.5%	64.7%	57.4%	0.046	49.7%	60.5%
k-NN	51.4%	0.039	67.0%	20.1%	55.8%	0.038	66.7%	20.3%	55.1%	0.029	57.2%	63.7%
RF	58.7%	0.042	61.7%	71.7%	57.8%	0.035	59.5%	70.2%	54.9%	0.053	55.3%	67.3%
NB	29.9%	0.031	65.9%	75.4%	29.9%	0.032	66.7%	76.2%	43.7%	0.037	55.6%	67.7%
EV	61.9%	0.033	58.6%	70.7%	64.0%	0.030	60.8%	72.2%	57.5%	0.031	57.7%	69.5%
STE	60.4%	0.032	64.8%	74.6%	61.5%	0.029	67.6%	76.8%	57.4%	0.031	58.2%	69.8%
METHOD		SG_HS				CBOW_NS				CBOW_HS
	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC	acc	std	AUPRC	AUROC
MLP	56.0%	0.041	56.7%	66.2%	49.5%	0.056	49.2%	61.4%	53.9%	0.041	54.4%	65.5%
SVM	57.1%	0.031	59.5%	71.0%	51.4%	0.039	55.4%	67.3%	53.3%	0.045	55.6%	67.5%
LR	58.6%	0.027	56.0%	68.4%	55.4%	0.040	48.8%	61.5%	56.0%	0.036	52.5%	65.4%
k-NN	55.6%	0.031	55.9%	65.4%	51.5%	0.030	55.0%	53.0%	54.3%	0.028	55.5%	62.6%
RF	56.5%	0.044	55.8%	67.5%	51.9%	0.041	49.3%	61.8%	55.9%	0.030	55.1%	67.1%
NB	46.3%	0.036	57.9%	69.6%	26.8%	0.034	49.7%	62.9%	42.4%	0.039	52.9%	65.2%
EV	59.2%	0.031	59.1%	71.0%	52.8%	0.040	51.0%	63.7%	56.8%	0.027	56.0%	68.0%
STE	58.6%	0.028	61.3%	72.3%	53.1%	0.036	51.4%	64.3%	56.5%	0.024	56.8%	68.7%

Table 4. Accuracy measures movie review dataset	of FBSEM classifier	for sentiment	analysis evaluated by 10-fold cross validation experiment on large
	acc	AUPRC	AUROC	AP	variance
Fold-1	89.4%	96.2%	96.2%	96.2%	0.043
Fold-2	89.8%	95.8%	96.1%	95.8%	0.029
Fold-3	89.0%	95.4%	95.6%	95.4%	0.029
Fold-4	89.8%	96.3%	96.2%	96.3%	0.022
Fold-5	88.9%	95.5%	95.7%	95.5%	0.026
Fold-6	89.8%	95.9%	96.1%	95.9%	0.026
Fold-7	90.3%	95.4%	96.1%	95.4%	0.020
Fold-8	89.2%	95.8%	95.8%	95.8%	0.019
Fold-9	89.7%	96.1%	96.2%	96.1%	0.027
Fold-10	90.3%	95.7%	96.3%	95.7%	0.036
Mean Result	89.6%	95.8%	96.0%	95.8%	0.003

In the second step, sentiment classes are predicted using the first phase of FBSEM method for each feature extraction technique and a total of twelve distributions are obtained. These distributions are concatenated with six feature sets generated using the extraction techniques listed in Section II C. Then, SVM is trained using these datasets. Results for 10-fold cross-validation experiment are shown in Tables 4-6 for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset respectively. In these tables, AP represents average precision of each fold and variance represents the variance between the intermediate scores obtained when computing the ROC.

Figures 3-5 compare the accuracy values of all the classification methods on large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these figures, methods are sorted according to their mean accuracy rates obtained from the 10-fold cross-validation experiments. Since the last column always shows the accuracy rates of FBSEM, the proposed method obtains the best accuracy on all of the three benchmarks. The improvements are obtained as 0.6% for large movie review dataset, 1.6% for Turkish movie review dataset, and 3.9% for SemEval-2017 dataset.

In order to assess whether the improvements obtained using FBSEM are statistically significant , a two-tailed Z-test is performed using a confidence level of 99% [55]

Table 5. Accuracy measures of FBSEM classifier for sentiment analysis evaluated by 10-fold cross validation experiment on Turkish movie review dataset
	acc	AUPRC	AUROC	AP	variance
Fold-1	90.3%	95.5%	95.8%	95.5%	0.100
Fold-2	92.0%	96.3%	96.6%	96.3%	0.089
Fold-3	89.4%	94.5%	94.9%	94.5%	0.083
Fold-4	90.0%	95.1%	95.3%	95.1%	0.075
Fold-5	90.2%	94.6%	95.5%	94.6%	0.138
Fold-6	91.8%	95.7%	96.5%	95.7%	0.131
Fold-7	90.8%	95.9%	96.3%	95.9%	0.101
Fold-8	91.8%	95.8%	96.2%	95.8%	0.081
Fold-9	88.8%	92.8%	93.7%	92.8%	0.082
Fold-10	91.7%	96.5%	96.4%	96.5%	0.083
Mean Result	90.7%	95.2%	95.6%	95.2%	0.016
Table 6. Accuracy measures	of FBSEM classifier for sentiment analysis evaluated by 10-fold cross validation experiment on				SemEval-2017 dataset
	acc	AUPRC	AUROC	AP	variance
Fold-1	65.2%	65.4%	73.5%	65.7%	0.012
Fold-2	60.5%	63.6%	72.6%	63.9%	0.010
Fold-3	67.2%	67.2%	75.4%	67.7%	0.008
Fold-4	64.8%	60.7%	72.9%	61.3%	0.010
Fold-5	69.1%	67.4%	78.4%	67.9%	0.011
Fold-6	69.1%	66.5%	74.9%	66.9%	0.007
Fold-7	72.7%	68.2%	77.5%	68.6%	0.010
Fold-8	68.0%	65.7%	75.4%	66.1%	0.008
Fold-9	69.1%	68.4%	77.7%	69.0%	0.010
Fold-10	66.4%	68.2%	77.4%	68.6%	0.007
Mean Result	67.2%	65.1%	74.9%	65.2%	0.001

Fig.3. Accuracy comparison for large movie review dataset

Fig.4. Accuracy comparison for Turkish movie review dataset

0.800

■ cbow_ns_nb
■ €bow_ns_svm

■ cbow_hs_m Ip
■ uni_tfidf_knn
■ uni_tf_voting

■ un i_tf_nb
■ uni_tf_knn

cbow_hs_knn

■ cbow_hs_rf
■ cbow_hs_vcting
■ sg_hs_stacking
■ un i_tf_svm

■ uni_tfidf_nb
■ cbow_ns_knn
■ 5g_rs_m!p
■ cbow_h5_lr
■ sg_hs_svm
■ uni_tfidf_mip
■ uni_tfidf_f

■ cbow_hs_nb
■ cbow_ns_rf
■ sg_hs_m!p
■ sg_ns_stackjng
■ uni_tf_rf
■ uni_tf_lr

■ sg_ns_nb
■ cbow_ns_voting
■ sg_ns_knn
■ 5g_ns_svm

sg_hs_vot!ng

■ uni_tfidf_9/m

■ 5g_hs_nb
■ cbow_n5_stacking
■ cbow_n5_lr
■ uni_tf_m|D
■ sg_ns_votir$
■ un i_tf_stac king
■ un i_tf idf_voting

■ cbow_ns_m ip
■ cbow_h5_svm
■ 5g_hs_knn

cbow_hs_5tackirg

■ uni_tfidf_rf
■ uni_tfidf_5tacking
■ FBSEM

Fig.5. Accuracy comparison for SemEval-2017 dataset

Table 7. p-values between the mean accuracy of FBSEM and other models on large movie review dataset

	CBOW_HS	CBOW_NS	SG_HS	SG_NS	UNI_TF	UNI_TFIDF
MLP	0.001	0.001	0.001	0.001	0.001	0.002
SVM	0.001	0.001	0.001	0.001	0.007	0.008
LR	0.001	0.001	0.001	0.001	0.007	0.008
k-nn	0.001	0.001	0.001	0.001	0.001	0.001
RF	0.001	0.001	0.001	0.001	0.001	0.001
NB	0.001	0.001	0.001	0.001	0.001	0.001
Voting	0.001	0.001	0.001	0.001	0.006	0.006
Stacking	0.001	0.001	0.001	0.001	0.007	0.008
Table 8. p-values between the mean accuracy		of FBSEM and other models on Turkish		movie review dataset
	CBOW_HS	CBOW_NS	SG_HS	SG_NS	UNI_TF	UNI_TFIDF
MLP	0.001	0.001	0.001	0.001	0.007	0.002
SVM	0.001	0.001	0.001	0.001	0.007	0.001
LR	0.003	0.003	0.002	0.002	0.008	0.006
k-nn	0.001	0.001	0.001	0.001	0.004	0.002
RF	0.001	0.001	0.001	0.001	0.003	0.001
NB	0.001	0.001	0.001	0.001	0.002	0.001
Voting	0.001	0.001	0.001	0.001	0.006	0.001
Stacking	0.001	0.001	0.001	0.001	0.008	0.001

Table 9. p-values between mean accuracy of FBSEM and other models on SemEval-2017 dataset
	CBOW_HS	CBOW_NS	SG_HS	SG_NS	UNI_TF	UNI_TFIDF
MLP	0.001	0.001	0.001	0.001	0.001	0.001
SVM	0.001	0.001	0.001	0.001	0.001	0.001
LR	0.001	0.001	0.001	0.001	0.001	0.001
k-nn	0.001	0.001	0.001	0.001	0.001	0.001
RF	0.001	0.001	0.001	0.001	0.001	0.001
NB	0.001	0.001	0.001	0.001	0.001	0.001
Voting	0.001	0.001	0.001	0.001	0.001	0.004
Stacking	0.001	0.001	0.001	0.001	0.001	0.001

Tables 7-9 include the p-values obtained for the Z-test for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these tables, rows correspond to classifiers, columns denote feature extraction techniques and values represent p-values. A p-value smaller than 0.01 shows that the improvement made by FBSEM is statically significant. Based on these results, FBSEM performs significantly better than all the other methods implemented in this work.

In this study, we proposed a novel stacked ensemble technique called FBSEM for sentiment analysis and compared it with the traditional classifiers trained using six different feature extraction techniques and with two ensemble methods on three benchmark datasets. FBSEM obtained the best accuracy rates in all datasets and the improvements are shown to be statistically significant. For different datasets, different feature extraction methods may obtain the best accuracy rate. In this work, FBSEM employed all the feature extraction methods available. As a future work, dimension reduction including deep auto-encoders and feature selection techniques can be developed to select the most important features or to design novel feature representations, which may potentially improve the accuracy of FBSEM further.

Acknowledgment

The numerical calculations reported in this paper were fully/partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).

Список литературы FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

Kaynar, O., Aydin, Z., Görmez, Y., 2017. Sentiment Analizinde Öznitelik Düşürme Yöntemlerinin Oto Kodlayıcılı Derin Öğrenme Makinaları ile Karşılaştırılması. Bilişim Teknol. Derg. 10, 319–326. https://doi.org/10.17671/gazibtd.331046
Li, J., Sun, M., 2007. Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques, in: 2007 International Conference on Natural Language Processing and Knowledge Engineering. Presented at the 2007 International Conference on Natural Language Processing and Knowledge Engineering, pp. 393–400. https://doi.org/10.1109/NLPKE.2007.4368061
Go, A., Bhayani, R., Huang, L., 2009a. Twitter Sentiment Classiﬁcation using Distant Supervision.
Mouthami, K., Devi, K.N., Bhaskaran, V.M., 2013. Sentiment analysis and classification based on textual reviews, in: 2013 International Conference on Information Communication and Embedded Systems (ICICES). Presented at the 2013 International Conference on Information Communication and Embedded Systems (ICICES), pp. 271–276. https://doi.org/10.1109/ICICES.2013.6508366
Gautam, G., Yadav, D., 2014. Sentiment analysis of twitter data using machine learning approaches and semantic analysis, in: 2014 Seventh International Conference on Contemporary Computing (IC3). Presented at the 2014 Seventh International Conference on Contemporary Computing (IC3), pp. 437–442. https://doi.org/10.1109/IC3.2014.6897213
Nizam, H., Akın, S.S., 2014. Sosyal Medyada Makine Öğrenmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin Performanslarının Karşılaştırılması. Presented at the XIX. Türkiye’de İnternet Konferansı, p. 6.
Çoban, Ö., Özyer, B., Özyer, G.T., 2015. Sentiment analysis for Turkish Twitter feeds, in: 2015 23nd Signal Processing and Communications Applications Conference (SIU). Presented at the 2015 23nd Signal Processing and Communications Applications Conference (SIU), pp. 2388–2391. https://doi.org/10.1109/SIU.2015.7130362
Kranjc, J., Smailović, J., Podpečan, V., Grčar, M., Žnidaršič, M., Lavrač, N., 2015. Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform. Inf. Process. Manag. 51, 187–203. https://doi.org/10.1016/j.ipm.2014.04.001
Tripathy, A., Agrawal, A., Rath, S.K., 2016. Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126. https://doi.org/10.1016/j.eswa.2016.03.028
Rohini, V., Thomas, M., Latha, C.A., 2016. Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm, in: 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT). Presented at the 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT), pp. 503–507. https://doi.org/10.1109/RTEICT.2016.7807872
Hassan, A., Mahmood, A., 2017. Deep Learning approach for sentiment analysis of short texts, in: 2017 3rd International Conference on Control, Automation and Robotics (ICCAR). Presented at the 2017 3rd International Conference on Control, Automation and Robotics (ICCAR), pp. 705–710. https://doi.org/10.1109/ICCAR.2017.7942788
Al-Smadi, M., Qawasmeh, O., Al-Ayyoub, M., Jararweh, Y., Gupta, B., 2018. Deep Recurrent neural network vs. support vector machine for aspect-based sentiment analysis of Arabic hotels’ reviews. J. Comput. Sci. 27, 386–393. https://doi.org/10.1016/j.jocs.2017.11.006
Chiong, R., Fan, Z., Hu, Z., Adam, M.T.P., Lutz, B., Neumann, D., 2018. A Sentiment Analysis-based Machine Learning Approach for Financial Market Prediction via News Disclosures, in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’18. ACM, New York, NY, USA, pp. 278–279. https://doi.org/10.1145/3205651.3205682
Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M., 2018. Big Data: Deep Learning for financial sentiment analysis. J. Big Data 5, 3. https://doi.org/10.1186/s40537-017-0111-6
Demirtas, E., Pechenizkiy, M., 2013. Cross-lingual Polarity Detection with Machine Translation, in: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM ’13. ACM, New York, NY, USA, pp. 9:1–9:8. https://doi.org/10.1145/2502069.2502078
Baziotis, C., Pelekis, N., Doulkeridis, C., 2017. DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 747–754.
González, J.-Á., Pla, F., Hurtado, L.-F., 2017. ELiRF-UPV at SemEval-2017 Task 4: Sentiment Analysis using Deep Learning, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 723–727.
Xia, R., Zong, C., Li, S., 2011. Ensemble of feature sets and classification algorithms for sentiment classification. Inf. Sci. 181, 1138–1152. https://doi.org/10.1016/j.ins.2010.11.023
Neethu, M.S., Rajasree, R., 2013. Sentiment analysis in twitter using machine learning techniques, in: 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). Presented at the 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–5. https://doi.org/10.1109/ICCCNT.2013.6726818
Fersini, E., Messina, E., Pozzi, F.A., 2014. Sentiment analysis: Bayesian Ensemble Learning. Decis. Support Syst. 68, 26–38. https://doi.org/10.1016/j.dss.2014.10.004
da Silva, N.F.F., Hruschka, Eduardo R., Hruschka, Estevam R., 2014. Tweet sentiment analysis with classifier ensembles. Decis. Support Syst. 66, 170–179. https://doi.org/10.1016/j.dss.2014.07.003
Catal, C., Nangir, M., 2017. A sentiment classification model based on multiple classifiers. Appl. Soft Comput. 50, 135–141. https://doi.org/10.1016/j.asoc.2016.11.022
Ankit, Saleena, N., 2018. An Ensemble Classification System for Twitter Sentiment Analysis. Procedia Comput. Sci., International Conference on Computational Intelligence and Data Science 132, 937–946. https://doi.org/10.1016/j.procs.2018.05.109
Araque, O., Corcuera-Platas, I., Sánchez-Rada, J.F., Iglesias, C.A., 2017. Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Syst. Appl. 77, 236–246. https://doi.org/10.1016/j.eswa.2017.02.002
Dedhia, C., Ramteke, J., 2017. Ensemble model for Twitter sentiment analysis, in: 2017 International Conference on Inventive Systems and Control (ICISC). Presented at the 2017 International Conference on Inventive Systems and Control (ICISC), pp. 1–5. https://doi.org/10.1109/ICISC.2017.8068711
Cliche, M., 2017. BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs. ArXiv170406125 Cs Stat.
Tan, S., Zhang, J., 2008. An empirical study of sentiment analysis for chinese documents. Expert Syst. Appl. 34, 2622–2629. https://doi.org/10.1016/j.eswa.2007.05.028
Go, A., Huang, L., Bhayani, R., 2009b. Twitter Sentiment Analysis.
Meral, M., Diri, B., 2014. Sentiment analysis on Twitter, in: 2014 22nd Signal Processing and Communications Applications Conference (SIU). Presented at the 2014 22nd Signal Processing and Communications Applications Conference (SIU), pp. 690–693. https://doi.org/10.1109/SIU.2014.6830323
Vinodhini, G., Chandrasekaran, R., n.d. Effect of Feature Reduction in Sentiment analysis of online reviews. IJARCET 2, 9.
Yousefpour, A., Ibrahim, R., Abdull Hamed, H.N., 2014. A Novel Feature Reduction Method in Sentiment Analysis. Int. J. Innov. Comput. 4.
Kim, K., Lee, J., 2014. Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction. Pattern Recognit. 47, 758–768. https://doi.org/10.1016/j.patcog.2013.07.022
Kim, K., 2018. An improved semi-supervised dimensionality reduction using feature weighting: Application to sentiment analysis. Expert Syst. Appl. 109, 49–65. https://doi.org/10.1016/j.eswa.2018.05.023
Vapnik, V., 2013. The Nature of Statistical Learning Theory. Springer Science & Business Media.
Wright, R.E., 1995. Logistic regression, in: Reading and Understanding Multivariate Statistics. American Psychological Association, Washington, DC, US, pp. 217–244.
Dayhoff, J.E., DeLeo, J.M., 2001. Artificial neural networks. Cancer 91, 1615–1635. https://doi.org/10.1002/1097-0142(20010415)91:8+1615::AID-CNCR11753.0.CO;2-L
Lowd, D., Domingos, P., 2005. Naive Bayes Models for Probability Estimation, in: Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05. ACM, New York, NY, USA, pp. 529–536. https://doi.org/10.1145/1102351.1102418
Pal, M., 2005. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26, 217–222. https://doi.org/10.1080/01431160412331269698
Larose, D.T., 2004. k-Nearest Neighbor Algorithm, in: Discovering Knowledge in Data. John Wiley & Sons, Inc., pp. 90–106. https://doi.org/10.1002/0471687545.ch5
Chen, Y., Chen, F., Yang, J.Y., Yang, M.Q., 2008. Ensemble voting system for multiclass protein fold recognition. Int. J. Pattern Recognit. Artif. Intell. 22, 747–763. https://doi.org/10.1142/S0218001408006454
Chen, Y., Wong, M.L., 2011. Optimizing Stacking Ensemble by an Ant Colony Optimization Approach, in: Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’11. ACM, New York, NY, USA, pp. 7–8. https://doi.org/10.1145/2001858.2001863
Sentiment classification on Large Movie Review [WWW Document], 2018. URL https://www.kaggle.com/c/sentiment-classification-on-large-movie-review/data
Rosenthal, S., Farra, N., Nakov, P., 2017. SemEval-2017 Task 4: Sentiment Analysis in Twitter, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 502–518.
Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Aizawa, A., 2003. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 39, 45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient Estimation of Word Representations in Vector Space.
Tillmann, C., 2004. A Unigram Orientation Model for Statistical Machine Translation, in: Proceedings of HLT-NAACL 2004: Short Papers, HLT-NAACL-Short ’04. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 101–104.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed Representations of Words and Phrases and their Compositionality, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119.
Goodman, J., 2001. Classes for fast maximum entropy training, in: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Presented at the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), pp. 561–564 vol.1. https://doi.org/10.1109/ICASSP.2001.940893
Görmez, Y., 2017. Dimensionality reduction for protein secondary structure prediction. Abdullah Gül Üniversitesi, YÖK.
Supervise Learning [WWW Document], 2018. URL http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
Stacking Classifier [WWW Document], 2018. URL https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
Keras: The Python Deep Learning library [WWW Document], 2018. URL https://keras.io/
Precision and recall [WWW Document], 2017. URL https://en.wikipedia.org/wiki/Precision_and_recall
Z Score Calculator for 2 Population Proportions [WWW Document], 2018. URL https://www.socscistatistics.com/tests/ztest/Default2.aspx

Еще

Статья научная