FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

Автор: Yasin Görmez, Yunus E. Işık, Mustafa Temiz, Zafer Aydın

Журнал: International Journal of Information Technology and Computer Science @ijitcs

Статья в выпуске: 6 Vol. 12, 2020 года.

Бесплатный доступ

Sentiment analysis is the process of determining the attitude or the emotional state of a text automatically. Many algorithms are proposed for this task including ensemble methods, which have the potential to decrease error rates of the individual base learners considerably. In many machine learning tasks and especially in sentiment analysis, extracting informative features is as important as developing sophisticated classifiers. In this study, a stacked ensemble method is proposed for sentiment analysis, which systematically combines six feature extraction methods and three classifiers. The proposed method obtains cross-validation accuracies of 89.6%, 90.7% and 67.2% on large movie, Turkish movie and SemEval-2017 datasets, respectively, outperforming the other classifiers. The accuracy improvements are shown to be statistically significant at the 99% confidence level by performing a Z-test.

Еще

Sentiment analysis, ensemble methods, machine learning, feature extraction

Короткий адрес: https://sciup.org/15017471

IDR: 15017471   |   DOI: 10.5815/ijitcs.2020.06.02

Текст научной статьи FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

Published Online December 2020 in MECS DOI: 10.5815/ijitcs.2020.06.02

With the recent developments in technology, the internet has entered to almost every field of our lives including health, science, entertainment, sports, and art. Due to the widespread availability of web pages and mobile applications, people are able to share their comments, ideas or opinions in many different topics on various platforms. As a result of this dense information flow, the internet now accommodates a huge repository of data providing a rich and diverse content. However, accessing the right information from a large surplus of data is a challenging task. To overcome this problem, text mining methods have been developed to automatically extract knowledge from web sites. Text mining can be defined as the process of obtaining meaningful and usable information from text using statistical or machine learning methods [1]. It can be divided into sub-categories such as summarization, classification, clustering, information extraction, and sentiment analysis. This paper concentrates on sentiment analysis, which is the process of extracting idea, opinion or emotion of a text by employing mathematical models and algorithms. Two types of approaches have been developed for this problem: dictionary based and machine learning based models [1]. In the first phase of dictionary based models, initially the desired sentiment is determined. Subsequently, the words expressing this sentiment and the meanings of the words are searched in the text. Then a score for that sentiment is calculated with the help of a dictionary. In the last phase, the sentiment state is extracted using statistical methods. Dictionary based models require a pre-defined dictionary containing positive, negative, and neutral weight scores for each word, which may not be available for each language. In machine learning based models, first, texts are labeled followed by data cleaning and preprocessing steps. Next, vector space models are formed that allow samples to be represented as feature vectors. After dividing samples into training, test and validation sets, models are learned and validated by training and testing procedures. Machine learning methods are independent from the language and can achieve high success rates. For this reason, they are preferred over dictionary based methods in academic studies on sentiment analysis.

Machine learning methods are divided into two main categories as supervised and unsupervised learning. The most important feature that distinguishes supervised learning from unsupervised learning is that it utilizes label information during training. When studies on sentiment analysis are examined, supervised machine learning methods are employed more frequently. Several methods have been developed in the literature for this purpose. Liu et al. proposed Chinese character based bigram feature extraction method and compared it with traditional bigram, trigram and word-based unigram by using support vector machines (SVM), naïve Bayes (NB), and artificial neural networks (ANN). The proposed method obtained the best F1 score of 91.62% on a dataset generated using 16,000 texts from Chinese web sites [2]. Go et al. trained three different models using maximum entropy (ME), NB and SVM on twitter data and obtained an 83% accuracy rate [3]. Mouthami et al. proposed a fuzzy logic and increased the accuracy rate for Cornell movie reviews [4]. Gautham and Yadav achieved success rates between 83.8% and 89.9% in the models designed using NB, SVM, ME and the Wordnet approach [5]. Nizam and Akın developed two datasets from twitter data to show the effect of employing balanced and unbalanced datasets. They used NB, random forest (RF), sequential minimal optimization, J48 and k-nearest neighbor (k-NN) and achieved an improvement of up to 6% in the success rate when the balanced dataset is used for model training [6]. Çoban et al. trained Turkish twitter data using NB, Multinomial naïve Bayes (MNB), SVM and k-NN and obtained 66.06% accuracy rate [7]. Kranjc et al. generated two SVM models using active learning and observed that the active learning based model was 6.7% more successful [8]. Tripathy et al. used ngram feature extraction methods with four classification algorithms and obtained a 95% accuracy rate [9]. Rohini et al. created several models to compare the text written in English and Kannada and showed that the models generated from English texts are more successful [10]. Hassan and Mahmood combined a convolutional neural network (CNN) with a long short term memory (LSTM) recurrent neural network (RNN) on IMDB movie and Stanford sentiment treebank (SST) datasets and obtained a 47.5% accuracy rate for SST and 88.3% for IMDB [11]. Al-Smadi et al. applied comparative sentiment analysis using SVM and deep recurrent neural networks (RNN) for three different tasks on Arabic hotel reviews dataset and they observed that SVM outperformed RNN with an accuracy rate of 90% [12]. Chiong et al. performed sentiment analysis to predict financial markets. They optimized the SVM’s parameters using particle swarm optimization and obtained a 59% accuracy rate [13]. Sohangir et al. applied several deep learning techniques on stock market dataset and achieved a 90.93% accuracy by CNN [14]. Demirtas and Pechenizkiy applied Naive Bayes, Linear SVC and Maximum Entropy classifiers to Turkish Movie review dataset and obtained 69.5% accuracy with NB [15]. Baziotis et. al. applied Deep Long-Short Term Memory networks on SemEval-2017 dataset and 67.5% F1 score was obtained [16]. Gonzales et.al. offer a Convolutional Recurrent Neural Network (CRNN) and they obtained 59.9% accuracy rate for SemEval-2017 dataset [17].

In addition to using individual learning models, it is also possible to combine the decisions of several methods in an ensemble setting in order to eliminate the inherent disadvantages of the individual methods. Xia et al. combined SVM, NB and ME using three different ensemble methods and achieved an 88.65% accuracy on several datasets [18]. Neethu and Rajasree combined SVM, MBE and NB using ensemble methods and achieved a 90% accuracy rate on twitter data [19]. Fersini et al. used NB, ME, SVM and Markov random fields to compare traditional ensemble methods with a Bayesian based ensemble method. According to the results of experiments on six different datasets, Bayesian based methods increased the success rate and reduced the computational cost [20]. Da Silva et al. combined MNB, SVM, RF and logistic regression (LR) using the ensemble method they proposed, and achieved accuracy rates from 76.84% to 87.20% on five different datasets [21]. Çatal and Nangir combined NB and SVM using several ensemble methods and achieved the accuracy rates up to 86.13% [22]. Ankit and Saleena combined NB, SVM, LR and RF using a voting method and achieved accuracy rates of 70% to 76% on five different datasets that were generated from twitter [23]. Araque et al. applied voting and stacking ensemble methods on several datasets and achieved a 90% accuracy rate [24]. Dedhia and Ramteke combined linear and RBF SVM using AdaBoost and achieved an 83% accuracy rate [25]. Cliche offered a-state-art an ensemble method that combine Convolutional Neural Networks (CNNs) and Long-Short Term Memory(LSTM) Networks for SemEval-2017 dataset and obtained 68.1% recall score [26].

In addition to the classification algorithms, the quality of the attributes in a dataset is also an important factor affecting the success rate of the prediction methods. Various dimension reduction and feature selection methods are frequently employed in order to eliminate unnecessary and noisy attributes that adversely affect classification performance. Tan and Zhang applied document frequency (DF), chi-square (CS), information gain (IG) and mutual information (MI) metrics for feature selection on a dataset generated from Chinese documents and achieved an 88.58% accuracy rate using five different classifiers [27]. Go et al. applied MI, ME, CS metrics and frequency-based feature selection techniques on twitter data and obtained an 84% accuracy rate [28]. Meral and Diri applied correlation-based feature selection technique on twitter data and achieved a 90% F1-score using SVM, NB and RF [29]. Vinodhini and Chandrasekaran achieved 77% accuracy using principal component analysis (PCA), NB and SVM [30]. Yousefpour et al. applied proposed dimension reduction technique on different datasets and achieved a 90.91% accuracy using SVM, NB, ME and an ensemble of these three classifiers [31]. Kim and Lee applied proposed semi-supervised nonlinear dimensionality reduction technique on four different datasets and showed that the proposed techniques are better than the traditional dimension reduction methods [32]. Kaynar et al. showed that deep autoencoder is better than traditional dimension reduction techniques in many cases [1]. Kim proposed improved semi-supervised dimensionality reduction using feature weighting for sentiment analysis and obtained improved accuracy based on the experiments on six benchmark datasets [33].

Traditional ensemble methods try to reduce the error by combining multiple classification algorithms that typically act on a common feature set. When the features are computed by different feature extraction methods it could be useful to train separate learners for each feature representation and combine their decisions. In this paper, a novel ensemble method, FBSEM, is proposed for sentiment analysis that employs various classifiers as well as attributes derived by different feature extraction methods. The purpose of this study is to compare proposed classifier technique, FBSEM, with support vector machine [34], logistic regression [35], multi-layer perceptron [36], naïve bayes [37], random forest [38] k-nearest neighbor [39], ensemble voting [40] and ensemble stacking [41].

Fig.1. Steps of Sentiment Analysis

feature extraction

Trained Model

2.    Material and Methodology 2.1.    Dataset

In this study, three sentiment datasets are used. The first one is a large movie review dataset [42], which contains 50,000 movie reviews from IMDB with 25,000 positive and 25,000 negative samples. When constructing this dataset, no more than 30 reviews are allowed for any given movie. The second dataset is a Turkish movie dataset that is generated by Demirtaş and Pechenizkiy from Beyazperde web page [15]. It contains 10,662 movie reviews including 5,331 negatives and 5,331 positives. The third dataset is the SemEval-2017 benchmark that is collected by Rosenthal et al. from twitter web page [43]. It contains 20,632 tweets including 7059 positives, 3,231 negatives and 10,342 neutrals.

  • 2.2.    Sentiment Analysis

  • 2.3.    Pre-processing and Feature Extraction for Sentiment Analysis

  • 2.4.    Classification Methods

Sentiment analysis (SA) is basically a sub-field of natural language processing and text mining, which aims to find the idea, opinion or emotion (such as negative or positive) in the documents. It consist of data collection, pre-processing, labeling, feature extraction and classification steps as shown in Figure 1.

Before extracting input features for machine learning models, it is possible to pre-process the textual data using techniques such as post-tagging, cleaning the stop words, and stemming. In the next step, numerical feature vectors are extracted and labeled. In this study, TF [44], TF-IDF [45], continuous bag of words and skip-gram [46] are used as feature extraction techniques. For the TF and TF-IDF, unigram [47] model is used to separate words. For the continuous bag of words and skip-gram, negative sampling [48] and hierarchical softmax [49] methods are used.

  • A. Feature-Based Stacked Ensemble Method for Sentiment Analysis (FBSEM)

  • 3.    Application Results

FBSEM is a two-stage classifier that includes LR and MLP in the first stage, and SVM in the second stage. A separate LR and MLP is trained for each data matrix that is produced by unigram TF (UNI_TF), unigram TF-IDF (UNI_TFIDF), negative sampling skip-gram (SG_NS), hierarchical softmax skip-gram (SG_HS), negative sampling continuous bag of words (CBOW_NS) and hierarchical softmax continuous bag of words (CBOW_HS). Then the predictions of LR and MLP are concatenated with feature vectors extracted by these six methods and sent as input to an SVM classifier. Figure 2 summarizes the steps of FBSEM.

In figure 2, distributions represent predicted probability scores calculated using the corresponding feature extraction and classification methods. As a result, a set of twelve distributions are generated each as a matrix of dimensions nxm. Here, n represents the number of documents and m the number of different classes. Therefore, for large movie and Turkish movie reviews datasets m will be 2 and for SemEval-2017 dataset m will be 3. In the first phase of the FBSEM, the dataset is divided into train and test sets. Subsequently, LR and MLP are used as classifiers, which are trained on train set and validated on test set. To prevent overfitting in the second phase of FBSEM, first, a 2fold cross validation is performed on the train set during the first phase. Then, predictions on test set are computed using the model trained during the first phase. This technique makes it possible to compute predictions on train set as well as the test set using the methods of the first phase (i.e. LR and MLP). These predictions are later employed in the feature vector of the SVM. In the second phase of FBSEM, after distributions are concatenated with feature sets, an SVM classifier makes the final decision. This approach helps to reduce the errors from using different attributes and classifiers. A standard support vector machine can separate two classes only. For three or more classes, two techniques can be used: one versus all (OVA), or one versus one (OVO) [50]. In this study, OVO method is used for the SemEval-2017 dataset. In this section, the FBSEM method is compared with several classifiers on three benchmark datasets. Except for stacking and MLP, traditional classifiers are implemented using scikit-learn [51] library of Python. The stacking ensemble is implemented using mlxtend [52] library of Python and MLP using keras [53] library of Python. FBSEM method is implemented in Python. Accuracy, area under the ROC curve (AUROC), area under the precision and recall curve (AUPRC) are used as the performance measures [54].

Fig.2. Steps of FBSEM Classifier

A 10-fold cross-validation experiment is performed on each dataset to assess the prediction accuracy of the methods. Documents are randomly assigned to train and test sets for each fold. Then, from each train set, 20% of documents are chosen randomly to form a second train set (train-set-small) and 5% of the remaining documents are chosen randomly to form a second test set (test-set-small), which are used for hyper-parameter optimization in each fold of the cross-validation. This enables to reduce the computational cost of hyper-parameter optimization and prevent over-fitting. As a result, for each fold, four different datasets are generated: train set, test set, test-set-small, train-set-small.

Features for each dataset are extracted using UNI_TF, UNI_TFIDF, SG_NS, SG_HS, CBOW_NS and CBOW_HS. Subsequently, hyper-parameters of MLP, SVM, LR, k-NN and RF are optimized using train-set-small and test-set-small. For MLP, one hidden layer and the ADAM optimizer are used. The number of epochs, number of neurons in hidden layer, learning rate, beta1 and beta2 parameters for ADAM are optimized by performing grid search. Similarly, the number of iterations, C parameter for SVM, C parameter for LR, number of neighbors for k-NN and maximum depth and number of trees for RF are optimized separately for each fold of the cross-validation. After optimization, the models are trained using the optimum hyper-parameter configurations and predictions are computed on test sets. In addition to these classifiers, Gaussian Naïve Bayes, ensemble with majority voting and stacking ensemble models are also trained and tested. For the ensemble with majority voting MLP, SVM and LR are employed as the base learners, while for the stacking ensemble LR and MLP are selected as the base learners and SVM as the meta learner. Tables 1-3 show experiment results for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these tables, acc represents the mean accuracy result of 10 folds, std represents the standard deviation of the accuracies across the folds, AUPRC represents the mean area under the precision and recall curve and AUROC represents the mean area under the ROC curve. Based on the these results, the best accuracy results are obtained by UNI_TF and UNI_TFIDF feature extraction methods. FBSEM obtained the best accuracy in all settings. However in terms of AUPRC and AUROC scores, other classifiers may perform slightly better than FBSEM in some of the feature extraction settings.

Table 1. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on large movie review dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD

UNI_TF

UNI_TFIDF

SG_NS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

86.3%

0.024

77.0%

77.3%

88.5%

0.005

86.4%

87.2%

83.6%

0.117

89.5%

89.5%

SVM

88.6%

0.005

94.7%

95.1%

89.0%

0.005

95.4%

95.6%

87.2%

0.006

94.0%

94.3%

LR

88.6%

0.005

90.3%

91.9%

89.0%

0.005

95.2%

95.4%

87.3%

0.006

79.3%

85.5%

k-NN

70.6%

0.018

88.4%

78.2%

79.5%

0.005

93.1%

75.1%

81.0%

0.007

83.9%

84.2%

RF

85.1%

0.004

92.6%

92.9%

85.2%

0.003

92.7%

93.0%

83.3%

0.006

91.0%

91.2%

NB

71.7%

0.007

94.8%

95.1%

78.4%

0.004

95.5%

95.6%

76.4%

0.007

94.0%

94.3%

EV

87.6%

0.006

88.5%

92.1%

88.9%

0.005

90.9%

93.4%

87.5%

0.007

87.6%

91.0%

STE

87.6%

0.006

92.8%

93.5%

88.1%

0.005

95.4%

95.6%

86.7%

0.013

94.3%

94.5%

METHOD

SG_HS

CBOW_NS

CBOW_HS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

87.1%

0.007

88.9%

88.7%

87.9%

0.009

91.7%

91.3%

88.0%

0.008

91.3%

91.2%

SVM

87.2%

0.004

93.9%

94.2%

88.8%

0.005

95.1%

95.3%

88.5%

0.005

94.8%

95.1%

LR

87.2%

0.005

94.0%

94.2%

88.8%

0.004

94.6%

94.8%

88.6%

0.005

94.4%

94.8%

k-NN

80.9%

0.007

83.3%

81.4%

83.2%

0.006

85.9%

86.1%

83.0%

0.006

85.5%

85.2%

RF

83.7%

0.007

91.2%

91.4%

84.3%

0.006

92.0%

92.2%

84.8%

0.005

92.2%

92.5%

NB

74.4%

0.007

94.0%

94.3%

78.4%

0.009

95.1%

95.3%

77.6%

0.008

94.9%

95.2%

EV

87.4%

0.007

89.8%

92.4%

88.6%

0.003

93.0%

94.3%

88.5%

0.005

93.4%

94.4%

STE

86.9%

0.008

94.2%

94.4%

88.6%

0.007

94.8%

95.1%

88.3%

0.006

94.9%

95.2%

Table 2. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on Turkish movie review dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD

UNI_TF

UNI_TFIDF

SG_NS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

89.1%

0.012

81.8%

79.2%

88.9%

0.010

86.6%

84.1%

82.0%

0.108

93.3%

92.5%

SVM

88.1%

0.008

94.2%

94.5%

88.8%

0.010

94.9%

95.2%

86.6%

0.004

93.0%

93.8%

LR

88.1%

0.008

92.8%

93.8%

88.9%

0.009

93.7%

94.6%

87.2%

0.009

79.0%

84.6%

k-NN

74.2%

0.016

96.1%

63.0%

76.3%

0.147

95.9%

65.1%

85.8%

0.008

94.6%

89.8%

RF

85.9%

0.012

92.9%

93.2%

85.7%

0.011

93.1%

93.3%

86.2%

0.007

93.5%

93.5%

NB

78.3%

0.016

94.8%

95.0%

79.4%

0.014

95.0%

95.4%

85.5%

0.007

92.9%

93.7%

EV

88.1%

0.011

85.3%

90.0%

88.5%

0.008

82.9%

88.2%

86.6%

0.007

93.0%

93.5%

STE

87.5%

0.010

94.7%

94.9%

86.9%

0.012

95.1%

95.4%

86.3%

0.007

93.2%

93.6%

METHOD

SG_HS

CBOW_NS

CBOW_HS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

84.7%

0.026

93.7%

93.6%

79.9%

0.020

85.8%

84.9%

83.1%

0.024

91.3%

90.7%

SVM

87.1%

0.007

93.1%

93.7%

82.5%

0.010

91.9%

92.4%

85.4%

0.008

92.6%

93.1%

LR

87.2%

0.006

90.4%

91.1%

84.9%

0.007

87.8%

87.8%

85.8%

0.008

89.7%

90.7%

k-NN

86.4%

0.006

94.3%

91.4%

77.5%

0.008

88.8%

77.8%

83.1%

0.008

92.6%

85.7%

RF

86.8%

0.010

93.7%

93.8%

79.0%

0.011

87.7%

87.3%

83.7%

0.007

91.8%

91.7%

NB

85.7%

0.007

93.2%

93.8%

75.7%

0.008

89.6%

90.1%

81.9%

0.008

92.0%

92.5%

EV

87.4%

0.008

92.0%

93.4%

80.2%

0.007

87.0%

87.6%

84.8%

0.009

91.1%

91.9%

STE

87.0%

0.008

93.7%

94.3%

79.7%

0.009

87.5%

87.8%

84.4%

0.007

92.0%

92.3%

Table 3. Accuracy measures of classification methods and standard deviation values for sentiment analysis evaluated by 10-fold cross validation experiment on SemEval-2017 dataset. (EV represents the ensemble with majority voting and STE represents the stacking ensemble.)

METHOD

UNI_TF

UNI_TFIDF

SG_NS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

56.3%

0.071

49.9%

55.3%

58.6%

0.034

54.8%

64.2%

54.4%

0.065

56.0%

66.5%

SVM

62.6%

0.034

65.6%

76.0%

63.3%

0.031

66.5%

76.3%

56.0%

0.039

58.8%

70.1%

LR

63.1%

0.037

62.1%

60.8%

62.9%

0.041

63.5%

64.7%

57.4%

0.046

49.7%

60.5%

k-NN

51.4%

0.039

67.0%

20.1%

55.8%

0.038

66.7%

20.3%

55.1%

0.029

57.2%

63.7%

RF

58.7%

0.042

61.7%

71.7%

57.8%

0.035

59.5%

70.2%

54.9%

0.053

55.3%

67.3%

NB

29.9%

0.031

65.9%

75.4%

29.9%

0.032

66.7%

76.2%

43.7%

0.037

55.6%

67.7%

EV

61.9%

0.033

58.6%

70.7%

64.0%

0.030

60.8%

72.2%

57.5%

0.031

57.7%

69.5%

STE

60.4%

0.032

64.8%

74.6%

61.5%

0.029

67.6%

76.8%

57.4%

0.031

58.2%

69.8%

METHOD

SG_HS

CBOW_NS

CBOW_HS

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

acc

std

AUPRC

AUROC

MLP

56.0%

0.041

56.7%

66.2%

49.5%

0.056

49.2%

61.4%

53.9%

0.041

54.4%

65.5%

SVM

57.1%

0.031

59.5%

71.0%

51.4%

0.039

55.4%

67.3%

53.3%

0.045

55.6%

67.5%

LR

58.6%

0.027

56.0%

68.4%

55.4%

0.040

48.8%

61.5%

56.0%

0.036

52.5%

65.4%

k-NN

55.6%

0.031

55.9%

65.4%

51.5%

0.030

55.0%

53.0%

54.3%

0.028

55.5%

62.6%

RF

56.5%

0.044

55.8%

67.5%

51.9%

0.041

49.3%

61.8%

55.9%

0.030

55.1%

67.1%

NB

46.3%

0.036

57.9%

69.6%

26.8%

0.034

49.7%

62.9%

42.4%

0.039

52.9%

65.2%

EV

59.2%

0.031

59.1%

71.0%

52.8%

0.040

51.0%

63.7%

56.8%

0.027

56.0%

68.0%

STE

58.6%

0.028

61.3%

72.3%

53.1%

0.036

51.4%

64.3%

56.5%

0.024

56.8%

68.7%

Table 4. Accuracy measures movie review dataset

of FBSEM classifier

for sentiment

analysis evaluated by 10-fold cross validation experiment on large

acc

AUPRC

AUROC

AP

variance

Fold-1

89.4%

96.2%

96.2%

96.2%

0.043

Fold-2

89.8%

95.8%

96.1%

95.8%

0.029

Fold-3

89.0%

95.4%

95.6%

95.4%

0.029

Fold-4

89.8%

96.3%

96.2%

96.3%

0.022

Fold-5

88.9%

95.5%

95.7%

95.5%

0.026

Fold-6

89.8%

95.9%

96.1%

95.9%

0.026

Fold-7

90.3%

95.4%

96.1%

95.4%

0.020

Fold-8

89.2%

95.8%

95.8%

95.8%

0.019

Fold-9

89.7%

96.1%

96.2%

96.1%

0.027

Fold-10

90.3%

95.7%

96.3%

95.7%

0.036

Mean Result

89.6%

95.8%

96.0%

95.8%

0.003

In the second step, sentiment classes are predicted using the first phase of FBSEM method for each feature extraction technique and a total of twelve distributions are obtained. These distributions are concatenated with six feature sets generated using the extraction techniques listed in Section II C. Then, SVM is trained using these datasets. Results for 10-fold cross-validation experiment are shown in Tables 4-6 for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset respectively. In these tables, AP represents average precision of each fold and variance represents the variance between the intermediate scores obtained when computing the ROC.

Figures 3-5 compare the accuracy values of all the classification methods on large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these figures, methods are sorted according to their mean accuracy rates obtained from the 10-fold cross-validation experiments. Since the last column always shows the accuracy rates of FBSEM, the proposed method obtains the best accuracy on all of the three benchmarks. The improvements are obtained as 0.6% for large movie review dataset, 1.6% for Turkish movie review dataset, and 3.9% for SemEval-2017 dataset.

In order to assess whether the improvements obtained using FBSEM are statistically significant , a two-tailed Z-test is performed using a confidence level of 99% [55]

Table 5. Accuracy measures of FBSEM classifier for sentiment analysis evaluated by 10-fold cross validation experiment on Turkish movie review dataset

acc

AUPRC

AUROC

AP

variance

Fold-1

90.3%

95.5%

95.8%

95.5%

0.100

Fold-2

92.0%

96.3%

96.6%

96.3%

0.089

Fold-3

89.4%

94.5%

94.9%

94.5%

0.083

Fold-4

90.0%

95.1%

95.3%

95.1%

0.075

Fold-5

90.2%

94.6%

95.5%

94.6%

0.138

Fold-6

91.8%

95.7%

96.5%

95.7%

0.131

Fold-7

90.8%

95.9%

96.3%

95.9%

0.101

Fold-8

91.8%

95.8%

96.2%

95.8%

0.081

Fold-9

88.8%

92.8%

93.7%

92.8%

0.082

Fold-10

91.7%

96.5%

96.4%

96.5%

0.083

Mean Result

90.7%

95.2%

95.6%

95.2%

0.016

Table 6. Accuracy measures

of FBSEM classifier for sentiment analysis evaluated by 10-fold cross validation experiment on

SemEval-2017 dataset

acc

AUPRC

AUROC

AP

variance

Fold-1

65.2%

65.4%

73.5%

65.7%

0.012

Fold-2

60.5%

63.6%

72.6%

63.9%

0.010

Fold-3

67.2%

67.2%

75.4%

67.7%

0.008

Fold-4

64.8%

60.7%

72.9%

61.3%

0.010

Fold-5

69.1%

67.4%

78.4%

67.9%

0.011

Fold-6

69.1%

66.5%

74.9%

66.9%

0.007

Fold-7

72.7%

68.2%

77.5%

68.6%

0.010

Fold-8

68.0%

65.7%

75.4%

66.1%

0.008

Fold-9

69.1%

68.4%

77.7%

69.0%

0.010

Fold-10

66.4%

68.2%

77.4%

68.6%

0.007

Mean Result

67.2%

65.1%

74.9%

65.2%

0.001

Fig.3. Accuracy comparison for large movie review dataset

Fig.4. Accuracy comparison for Turkish movie review dataset

0.800

  • ■    cbow_ns_nb

  • ■    €bow_ns_svm

  • ■    cbow_hs_m Ip

  • ■    uni_tfidf_knn

  • ■    uni_tf_voting

  • ■    un i_tf_nb

  • ■    uni_tf_knn

cbow_hs_knn

  • ■    cbow_hs_rf

  • ■    cbow_hs_vcting

  • ■    sg_hs_stacking

  • ■    un i_tf_svm

  • ■    uni_tfidf_nb

  • ■    cbow_ns_knn

  • ■    5g_rs_m!p

  • ■    cbow_h5_lr

  • ■    sg_hs_svm

  • ■    uni_tfidf_mip

  • ■    uni_tfidf_f

  • ■    cbow_hs_nb

  • ■    cbow_ns_rf

  • ■    sg_hs_m!p

  • ■    sg_ns_stackjng

  • ■    uni_tf_rf

  • ■    uni_tf_lr

  • ■    sg_ns_nb

  • ■    cbow_ns_voting

  • ■    sg_ns_knn

  • ■    5g_ns_svm

sg_hs_vot!ng

  • ■    uni_tfidf_9/m

  • ■    5g_hs_nb

  • ■    cbow_n5_stacking

  • ■    cbow_n5_lr

  • ■    uni_tf_m|D

  • ■    sg_ns_votir$

  • ■    un i_tf_stac king

  • ■    un i_tf idf_voting

  • ■    cbow_ns_m ip

  • ■    cbow_h5_svm

  • ■    5g_hs_knn

cbow_hs_5tackirg

  • ■    uni_tfidf_rf

  • ■    uni_tfidf_5tacking

  • ■    FBSEM

  • 4.    Conclusions

Fig.5. Accuracy comparison for SemEval-2017 dataset

Table 7. p-values between the mean accuracy of FBSEM and other models on large movie review dataset

CBOW_HS

CBOW_NS

SG_HS

SG_NS

UNI_TF

UNI_TFIDF

MLP

0.001

0.001

0.001

0.001

0.001

0.002

SVM

0.001

0.001

0.001

0.001

0.007

0.008

LR

0.001

0.001

0.001

0.001

0.007

0.008

k-nn

0.001

0.001

0.001

0.001

0.001

0.001

RF

0.001

0.001

0.001

0.001

0.001

0.001

NB

0.001

0.001

0.001

0.001

0.001

0.001

Voting

0.001

0.001

0.001

0.001

0.006

0.006

Stacking

0.001

0.001

0.001

0.001

0.007

0.008

Table 8. p-values between the mean accuracy

of FBSEM and other models on Turkish

movie review dataset

CBOW_HS

CBOW_NS

SG_HS

SG_NS

UNI_TF

UNI_TFIDF

MLP

0.001

0.001

0.001

0.001

0.007

0.002

SVM

0.001

0.001

0.001

0.001

0.007

0.001

LR

0.003

0.003

0.002

0.002

0.008

0.006

k-nn

0.001

0.001

0.001

0.001

0.004

0.002

RF

0.001

0.001

0.001

0.001

0.003

0.001

NB

0.001

0.001

0.001

0.001

0.002

0.001

Voting

0.001

0.001

0.001

0.001

0.006

0.001

Stacking

0.001

0.001

0.001

0.001

0.008

0.001

Table 9. p-values between mean accuracy of FBSEM and other models on SemEval-2017 dataset

CBOW_HS

CBOW_NS

SG_HS

SG_NS

UNI_TF

UNI_TFIDF

MLP

0.001

0.001

0.001

0.001

0.001

0.001

SVM

0.001

0.001

0.001

0.001

0.001

0.001

LR

0.001

0.001

0.001

0.001

0.001

0.001

k-nn

0.001

0.001

0.001

0.001

0.001

0.001

RF

0.001

0.001

0.001

0.001

0.001

0.001

NB

0.001

0.001

0.001

0.001

0.001

0.001

Voting

0.001

0.001

0.001

0.001

0.001

0.004

Stacking

0.001

0.001

0.001

0.001

0.001

0.001

Tables 7-9 include the p-values obtained for the Z-test for large movie review dataset, Turkish movie review dataset and SemEval-2017 dataset, respectively. In these tables, rows correspond to classifiers, columns denote feature extraction techniques and values represent p-values. A p-value smaller than 0.01 shows that the improvement made by FBSEM is statically significant. Based on these results, FBSEM performs significantly better than all the other methods implemented in this work.

In this study, we proposed a novel stacked ensemble technique called FBSEM for sentiment analysis and compared it with the traditional classifiers trained using six different feature extraction techniques and with two ensemble methods on three benchmark datasets. FBSEM obtained the best accuracy rates in all datasets and the improvements are shown to be statistically significant. For different datasets, different feature extraction methods may obtain the best accuracy rate. In this work, FBSEM employed all the feature extraction methods available. As a future work, dimension reduction including deep auto-encoders and feature selection techniques can be developed to select the most important features or to design novel feature representations, which may potentially improve the accuracy of FBSEM further.

Acknowledgment

The numerical calculations reported in this paper were fully/partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).

Список литературы FBSEM: A Novel Feature-Based Stacked Ensemble Method for Sentiment Analysis

  • Kaynar, O., Aydin, Z., Görmez, Y., 2017. Sentiment Analizinde Öznitelik Düşürme Yöntemlerinin Oto Kodlayıcılı Derin Öğrenme Makinaları ile Karşılaştırılması. Bilişim Teknol. Derg. 10, 319–326. https://doi.org/10.17671/gazibtd.331046
  • Li, J., Sun, M., 2007. Experimental Study on Sentiment Classification of Chinese Review using Machine Learning Techniques, in: 2007 International Conference on Natural Language Processing and Knowledge Engineering. Presented at the 2007 International Conference on Natural Language Processing and Knowledge Engineering, pp. 393–400. https://doi.org/10.1109/NLPKE.2007.4368061
  • Go, A., Bhayani, R., Huang, L., 2009a. Twitter Sentiment Classification using Distant Supervision.
  • Mouthami, K., Devi, K.N., Bhaskaran, V.M., 2013. Sentiment analysis and classification based on textual reviews, in: 2013 International Conference on Information Communication and Embedded Systems (ICICES). Presented at the 2013 International Conference on Information Communication and Embedded Systems (ICICES), pp. 271–276. https://doi.org/10.1109/ICICES.2013.6508366
  • Gautam, G., Yadav, D., 2014. Sentiment analysis of twitter data using machine learning approaches and semantic analysis, in: 2014 Seventh International Conference on Contemporary Computing (IC3). Presented at the 2014 Seventh International Conference on Contemporary Computing (IC3), pp. 437–442. https://doi.org/10.1109/IC3.2014.6897213
  • Nizam, H., Akın, S.S., 2014. Sosyal Medyada Makine Öğrenmesi ile Duygu Analizinde Dengeli ve Dengesiz Veri Setlerinin Performanslarının Karşılaştırılması. Presented at the XIX. Türkiye’de İnternet Konferansı, p. 6.
  • Çoban, Ö., Özyer, B., Özyer, G.T., 2015. Sentiment analysis for Turkish Twitter feeds, in: 2015 23nd Signal Processing and Communications Applications Conference (SIU). Presented at the 2015 23nd Signal Processing and Communications Applications Conference (SIU), pp. 2388–2391. https://doi.org/10.1109/SIU.2015.7130362
  • Kranjc, J., Smailović, J., Podpečan, V., Grčar, M., Žnidaršič, M., Lavrač, N., 2015. Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform. Inf. Process. Manag. 51, 187–203. https://doi.org/10.1016/j.ipm.2014.04.001
  • Tripathy, A., Agrawal, A., Rath, S.K., 2016. Classification of sentiment reviews using n-gram machine learning approach. Expert Syst. Appl. 57, 117–126. https://doi.org/10.1016/j.eswa.2016.03.028
  • Rohini, V., Thomas, M., Latha, C.A., 2016. Domain based sentiment analysis in regional Language-Kannada using machine learning algorithm, in: 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT). Presented at the 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology (RTEICT), pp. 503–507. https://doi.org/10.1109/RTEICT.2016.7807872
  • Hassan, A., Mahmood, A., 2017. Deep Learning approach for sentiment analysis of short texts, in: 2017 3rd International Conference on Control, Automation and Robotics (ICCAR). Presented at the 2017 3rd International Conference on Control, Automation and Robotics (ICCAR), pp. 705–710. https://doi.org/10.1109/ICCAR.2017.7942788
  • Al-Smadi, M., Qawasmeh, O., Al-Ayyoub, M., Jararweh, Y., Gupta, B., 2018. Deep Recurrent neural network vs. support vector machine for aspect-based sentiment analysis of Arabic hotels’ reviews. J. Comput. Sci. 27, 386–393. https://doi.org/10.1016/j.jocs.2017.11.006
  • Chiong, R., Fan, Z., Hu, Z., Adam, M.T.P., Lutz, B., Neumann, D., 2018. A Sentiment Analysis-based Machine Learning Approach for Financial Market Prediction via News Disclosures, in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’18. ACM, New York, NY, USA, pp. 278–279. https://doi.org/10.1145/3205651.3205682
  • Sohangir, S., Wang, D., Pomeranets, A., Khoshgoftaar, T.M., 2018. Big Data: Deep Learning for financial sentiment analysis. J. Big Data 5, 3. https://doi.org/10.1186/s40537-017-0111-6
  • Demirtas, E., Pechenizkiy, M., 2013. Cross-lingual Polarity Detection with Machine Translation, in: Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM ’13. ACM, New York, NY, USA, pp. 9:1–9:8. https://doi.org/10.1145/2502069.2502078
  • Baziotis, C., Pelekis, N., Doulkeridis, C., 2017. DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 747–754.
  • González, J.-Á., Pla, F., Hurtado, L.-F., 2017. ELiRF-UPV at SemEval-2017 Task 4: Sentiment Analysis using Deep Learning, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 723–727.
  • Xia, R., Zong, C., Li, S., 2011. Ensemble of feature sets and classification algorithms for sentiment classification. Inf. Sci. 181, 1138–1152. https://doi.org/10.1016/j.ins.2010.11.023
  • Neethu, M.S., Rajasree, R., 2013. Sentiment analysis in twitter using machine learning techniques, in: 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). Presented at the 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–5. https://doi.org/10.1109/ICCCNT.2013.6726818
  • Fersini, E., Messina, E., Pozzi, F.A., 2014. Sentiment analysis: Bayesian Ensemble Learning. Decis. Support Syst. 68, 26–38. https://doi.org/10.1016/j.dss.2014.10.004
  • da Silva, N.F.F., Hruschka, Eduardo R., Hruschka, Estevam R., 2014. Tweet sentiment analysis with classifier ensembles. Decis. Support Syst. 66, 170–179. https://doi.org/10.1016/j.dss.2014.07.003
  • Catal, C., Nangir, M., 2017. A sentiment classification model based on multiple classifiers. Appl. Soft Comput. 50, 135–141. https://doi.org/10.1016/j.asoc.2016.11.022
  • Ankit, Saleena, N., 2018. An Ensemble Classification System for Twitter Sentiment Analysis. Procedia Comput. Sci., International Conference on Computational Intelligence and Data Science 132, 937–946. https://doi.org/10.1016/j.procs.2018.05.109
  • Araque, O., Corcuera-Platas, I., Sánchez-Rada, J.F., Iglesias, C.A., 2017. Enhancing deep learning sentiment analysis with ensemble techniques in social applications. Expert Syst. Appl. 77, 236–246. https://doi.org/10.1016/j.eswa.2017.02.002
  • Dedhia, C., Ramteke, J., 2017. Ensemble model for Twitter sentiment analysis, in: 2017 International Conference on Inventive Systems and Control (ICISC). Presented at the 2017 International Conference on Inventive Systems and Control (ICISC), pp. 1–5. https://doi.org/10.1109/ICISC.2017.8068711
  • Cliche, M., 2017. BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs. ArXiv170406125 Cs Stat.
  • Tan, S., Zhang, J., 2008. An empirical study of sentiment analysis for chinese documents. Expert Syst. Appl. 34, 2622–2629. https://doi.org/10.1016/j.eswa.2007.05.028
  • Go, A., Huang, L., Bhayani, R., 2009b. Twitter Sentiment Analysis.
  • Meral, M., Diri, B., 2014. Sentiment analysis on Twitter, in: 2014 22nd Signal Processing and Communications Applications Conference (SIU). Presented at the 2014 22nd Signal Processing and Communications Applications Conference (SIU), pp. 690–693. https://doi.org/10.1109/SIU.2014.6830323
  • Vinodhini, G., Chandrasekaran, R., n.d. Effect of Feature Reduction in Sentiment analysis of online reviews. IJARCET 2, 9.
  • Yousefpour, A., Ibrahim, R., Abdull Hamed, H.N., 2014. A Novel Feature Reduction Method in Sentiment Analysis. Int. J. Innov. Comput. 4.
  • Kim, K., Lee, J., 2014. Sentiment visualization and classification via semi-supervised nonlinear dimensionality reduction. Pattern Recognit. 47, 758–768. https://doi.org/10.1016/j.patcog.2013.07.022
  • Kim, K., 2018. An improved semi-supervised dimensionality reduction using feature weighting: Application to sentiment analysis. Expert Syst. Appl. 109, 49–65. https://doi.org/10.1016/j.eswa.2018.05.023
  • Vapnik, V., 2013. The Nature of Statistical Learning Theory. Springer Science & Business Media.
  • Wright, R.E., 1995. Logistic regression, in: Reading and Understanding Multivariate Statistics. American Psychological Association, Washington, DC, US, pp. 217–244.
  • Dayhoff, J.E., DeLeo, J.M., 2001. Artificial neural networks. Cancer 91, 1615–1635. https://doi.org/10.1002/1097-0142(20010415)91:8+1615::AID-CNCR11753.0.CO;2-L
  • Lowd, D., Domingos, P., 2005. Naive Bayes Models for Probability Estimation, in: Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05. ACM, New York, NY, USA, pp. 529–536. https://doi.org/10.1145/1102351.1102418
  • Pal, M., 2005. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26, 217–222. https://doi.org/10.1080/01431160412331269698
  • Larose, D.T., 2004. k-Nearest Neighbor Algorithm, in: Discovering Knowledge in Data. John Wiley & Sons, Inc., pp. 90–106. https://doi.org/10.1002/0471687545.ch5
  • Chen, Y., Chen, F., Yang, J.Y., Yang, M.Q., 2008. Ensemble voting system for multiclass protein fold recognition. Int. J. Pattern Recognit. Artif. Intell. 22, 747–763. https://doi.org/10.1142/S0218001408006454
  • Chen, Y., Wong, M.L., 2011. Optimizing Stacking Ensemble by an Ant Colony Optimization Approach, in: Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’11. ACM, New York, NY, USA, pp. 7–8. https://doi.org/10.1145/2001858.2001863
  • Sentiment classification on Large Movie Review [WWW Document], 2018. URL https://www.kaggle.com/c/sentiment-classification-on-large-movie-review/data
  • Rosenthal, S., Farra, N., Nakov, P., 2017. SemEval-2017 Task 4: Sentiment Analysis in Twitter, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pp. 502–518.
  • Salton, G., Buckley, C., 1988. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
  • Aizawa, A., 2003. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 39, 45–65. https://doi.org/10.1016/S0306-4573(02)00021-3
  • Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient Estimation of Word Representations in Vector Space.
  • Tillmann, C., 2004. A Unigram Orientation Model for Statistical Machine Translation, in: Proceedings of HLT-NAACL 2004: Short Papers, HLT-NAACL-Short ’04. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 101–104.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed Representations of Words and Phrases and their Compositionality, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 3111–3119.
  • Goodman, J., 2001. Classes for fast maximum entropy training, in: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). Presented at the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), pp. 561–564 vol.1. https://doi.org/10.1109/ICASSP.2001.940893
  • Görmez, Y., 2017. Dimensionality reduction for protein secondary structure prediction. Abdullah Gül Üniversitesi, YÖK.
  • Supervise Learning [WWW Document], 2018. URL http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
  • Stacking Classifier [WWW Document], 2018. URL https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
  • Keras: The Python Deep Learning library [WWW Document], 2018. URL https://keras.io/
  • Precision and recall [WWW Document], 2017. URL https://en.wikipedia.org/wiki/Precision_and_recall
  • Z Score Calculator for 2 Population Proportions [WWW Document], 2018. URL https://www.socscistatistics.com/tests/ztest/Default2.aspx
Еще
Статья научная