Multi-objective based feature selection and neural networks ensemble method for solving emotion recognition problem

Автор: Ivanov I.A.

Журнал: Сибирский аэрокосмический журнал @vestnik-sibsau

Рубрика: Математика, механика, информатика

Статья в выпуске: 1 т.17, 2016 года.

Бесплатный доступ

In this paper we apply multi-objective optimization approach to find a Pareto optimal ensemble of neural network classifiers, which is used for solving the emotion recognition problem. Pareto set of neural networks is found by optimizing two conflicting criteria: maximizing emotion classification rate and minimizing the number of neural network neurons. We implemented several ensemble fusion schemes - voting, averaging class probabilities and adding auxiliary meta-classification layer. The number of audio and video features extracted from raw video sequences for analysis is quite large, so we also applied multi-objective approach in order to find the optimal subset of features. The optimized criteria in this case are maximizing classification rate and minimizing the number of features. The multi-objective approach to neural network parameter optimization and to feature selection was compared to the classic single-objective optimization approach on several datasets. According to experimental results, multi-objective approach to neural net optimization provided on average 7.1 % higher emotion classification rate than single-objective optimization. Applying multi-objective approach to feature selection as well helped to improve the classification rate by 2.8 % compared to single-objective approach, by 5.4 % compared to using principal components analysis, and by 13.9 % compared to not using dimensionality reduction at all. Taking into account the obtained results, we suggest using multi-objective approach to machine learning algorithms optimization and feature selection in further research connected with emotion recognition problem and other complex classification tasks.

Еще

Ensemble, neural network, multi-objective optimization, emotion recognition

Короткий адрес: https://sciup.org/148177549

IDR: 148177549

Текст научной статьи Multi-objective based feature selection and neural networks ensemble method for solving emotion recognition problem

Introduction. The problem of configuring machine learning algorithms is crucial for finding effective solutions to practical machine learning and data analysis problems. There has been much research done on developing the algorithms for configuring machine learning algorithms parameters and structure. In this work we apply multi-objective optimization method to neural networks parameter optimization and compare it to the singleobjective optimization method. The emotion recognition problem by audio-visual features serves as a benchmark problem. There are to conflicting optimized criteria: emotion classification rate (maximized) and the number of neural network neurons (minimized). In such optimization formulation we end up with finding a Pareto optimal set of neural networks, some of which may naturally have a complex structure (more neurons) and a better classification rate on a train set (overfitting), while some may have a simpler structure and worse classification rate, but at the same time have a lower generalization error (underfitting). The idea proposed in this work is to combine such diverse Pareto optimal neural networks into an ensemble, that would, hopefully, yield a better classification rate on the test set.

Another important step performed in machine learning is the feature space dimensionality reduction. In this work we also apply multi-objective optimization method to find the optimal subset of features. The optimized criteria are: emotion classification rate (maximized) and the total number of features chosen for further analysis (minimized). The proposed multi-objective optimization approach to feature selection is compared to single-objective optimization approach and to principal components analysis (PCA).

The problem of emotion recognition is a part of a more global problem of human-machine interaction (HMI). The systems that provide means of HMI are called dialogue systems (DS). Dialogue systems consist of several modules: speech analysis, intelligence gathering, taking actions. The gathered intelligence includes person’s gender, age, emotional state, ethnicity and other information that might be valuable for making decision about the actions. In this work we focus on the problem of person emotional state classification by the available audio and video information of person’s face.

Significant related work . The paper by Rashid et al. [1] explores the problem of human emotion recognition and proposes the solution of combining audio and visual features. First, the audio stream is separated from the video stream. Feature detection and 3D patch extraction are applied to video streams and the dimensionality of video features is reduced by applying PCA. From audio streams prosodic and mel-frequency cepstrum coefficients (MFCC) are extracted. After feature extraction the authors construct separate codebooks for audio and video modalities by applying the K-means algorithm in Euclidean space. Finally, multiclass support vector machine (SVM) classifiers are applied to audio and video data, and decision-level data fusion is performed by applying Bayes sum rule. By building the classifier on audio features the authors received an average accuracy of 67.39 %, using video features gave anaccuracy of 74.15 %, while combining audio and visual features on the decision level improved the accuracy to 80.27 %.

Kahou et al. [2] described the approach they used for submission to the 2013 Emotion Recognition in the Wild Challenge. The approach combined multiple deep neural networks including deep convolutional neural networks (CNNs) for analyzing facial expressions in video frames, deep belief net (DBN) to capture audio information,deep autoencoder to model the spatio-temporal information produced by the human actions, and shallow network architecture focused on theextracted features of the mouth of the primary human subject in the scene. The authors used the Toronto Face Dataset, containing 4.178 images labelled with basic emotions and with only fully frontal facing poses, and a dataset harvested from Google image search which consisted of 35.887 images with seven expression classes. All images were turned to grayscale of size 48×48. Several decision-level data integration techniques were used: averaged predictions, SVM and multilayer perceptron (MLP) aggregation techniques,and random search for weighting models. The best accuracy they achieved on the competition testing set was 41.03 %.

In the work by Cruz et al. [3] the concept of modelling the change in features is used, rather than their simple combination. First, the faces are extracted from the original images, and Local Phase Quantization (LPQ) histograms are extracted in each n x n local region. The histograms are concatenated to form afeature vector. The derivative of features is computed by two methods: convolution with the difference of Gaussians (DoG) filter and thedifference of feature histograms. A linear SVM is trained to output posterior probabilities and the changes are modelled with a hidden Markov model. The proposed method was tested on theAudio/Visual Emotion Challenge 2011 dataset, which consists of 63 videos of 13 different individuals, where frontal face videos are taken during an interview where the subject is engaged in a conversation. The authors claim thatthey increased the classification rate on the data by 13 %.

In [4] the authors exploit the idea of using electroencephalogram, pupillary response and gaze distance to classify the arousal of a subject as either calm, medium aroused, or activated and valence as either unpleasant, neutral, or pleasant. The data consists of 20 video clips with emotional content from movies. The valence classification accuracy achieved is 68.5 %, and thearousal classification accuracy is 76.4 %.

Busso et al. [5] researched the idea of acoustic and facial expression information fusion. They used a database recorded from an actress reading 258 sentences expressing emotions.Separate classifiers based on acoustic data and facial expressions were built, with classification accuracies of 70.9 % and 85 % respectively. Facial expression features include 5 areas: forehead, eyebrow, low eye, right and left cheeks. The authors covered two data fusion approaches: decision level and feature level integration. On the feature level,audio and facial expression features were combined to build one classifier, giving 90 % accuracy. On the decision level, several criteria were used to combine posterior probabilities of the unimodal systems: maximum – theemotion withthe greatest posterior probability in both modalities is selected; average – theposte-rior probability of each modalityis equally weighted and the maximum is selected; product – posterior probabilities are multiplied and the maximum is selected; weight – different weights are applied to the different unimodal systems. The accuracies of decision-level integration bi-modal classifiers range from 84 % to 89 %, product combining being the most efficient.

Methodology. The process of solving the emotion recognition problem, as well as any other classification problem. consists of the following steps:

1. Raw data gathering, in our case obtaining the database of labeled video recordings with emotional content.
2. Feature extraction, in order to perform quantitative analysis, we need quantitative features to base upon.
3. Dimensionality reduction, performed when the number of features is too large, for higher generalizability and lower computational costs.
4. Training classifier, using the train set.
5. Making predictions, using the test set.

Completion of each step is crucial for successful classification problem solving. While gathering valuable data and using the right technique for feature extraction are the most important and time consuming steps, usually it turns out that researchers do not have the ability to collect their own data, or are forced to work with the data that was provided to them. In such cases what really matters, are steps 3 and 4, that are actually about taking as much as you can from available data and implementing the right classifier in order to make accurate predictions. In this work we focus on achieving these two goals.

But let us begin from the beginning. We are using SAVEE emotional database [6] as a data source for our analysis. It contains 480 video recordings of 4 male individuals reading a set of sentences expressing 7 basic emotions: anger, happiness, disgust, neutral, fear, surprise, sadness. We apply openSMILE software [7] for extracting audio features, and 3 video feature extraction methods:

1) Quantized Local Zernike Moments (QLZM) [8];
2) Local Binary Patterns (LBP) [9];
3) Local Binary Patterns on Three Orthogonal Planes (LBP-TOP) [10].

After that goes dimensionality reduction step, which can be performed in several ways. There are two groups of algorithms used for dimensionality reduction:

1. Feature transformation methods, that take the large number of features and transform them to less number of more informative features. Principal components analysis (PCA) is a popular method of this group.
2. Feature selection methods, that take the large number of features and select the optimal (in some sense) subset of features.

The classic approach in feature selection is to use single-objective optimization algorithms for finding the optimal features that, when used with a proper classifier, would provide the highest classification rate. We went further and applied multi-objective optimization algorithms for solving this problem. The first optimized objective is the same as in single-objective formulation, the classification rate, which is defined as follows:

R = ( N_c / N ) - 100 %, (1)

where Nc is the number of correctly classified instances; N is the total number of dataset instances; R is the classification rate. The second objective is the minimization of the number of selected features, because this is the essence of dimensionality reduction step:

\ F\^ min, (2)

where F is the subset of selected features. Support vector machine (SVM) algorithm was used for classification purposes. Other objectives may be used for optimization during feature selection procedure, like intra-class and inter-class distances [11].

The next step, training the classifier, involves choosing the classification algorithm and adjusting its parameters. We chose feed-forward single-layer neural network algorithm with sigmoid activation function. The popular approach to neural network parameter adjustment is based on using the optimization algorithms for finding the optimal neural net parameter values. In our optimization formulation, the input variables include the overall number of network neurons and the number of iterations for network training. The input variables vary in the following borders. Number of network neurons N_n = 2:50, number of network training iterations N t = 2: 200 .

We applied the multi-objective optimization approach to neural network parameter optimization. The optimized criteria are as follows: maximizing the classification rate and minimizing the number of network neurons. As a result of multi-objective optimization, we obtain a Pareto optimal set of neural networks. In order to make possible the comparison of single and multi-objective approaches, we combine Pareto optimal neural network classifiers into an ensemble, fusing the outputs of multiple neural networks by several techniques:

1. Voting, the class that was predicted by the majority of neural nets is chosen as a final prediction.
2. Averaging the class probabilities over all networks, posterior class probabilities for each class are averaged over all neural networks in the ensemble.
3. Adding auxiliary meta-classification layer, training dataset is divided into two parts, the first part is used to train the ensemble classifiers. The output posterior class probabilities of all ensemble classifiers are treated as input variables, and the second part of the training dataset is used to train an auxiliary SVM meta-classifier, which outputs the resulting class prediction.

The class of genetic algorithms was chosen for solving the optimization tasks described above. We used Co-evolutionary Genetic algorithm for single-objective optimization, and several algorithms for multi-objective optimization: Strength Pareto Evolutionary algorithm (SPEA) [12], Non-dominated Sorting Genetic algorithm (NSGA-2) [13], Vector Evaluated Genetic algorithm (VEGA) [14] and Self-configuring Co-evolutionary Multi-objective Genetic algorithm (SelfCOMOGA) [15].

Experimental results. We performed a series of experiments on using the proposed multi-objective optimization approach to feature selection and neural networks parameter optimization. The experiments were conducted on 5 different datasets, including audio features dataset, 3 video features datasets, namely QLZM, LBP and LBP-TOP datasets, and audio-visual dataset that is merely a combination of the audio and 3 video features datasets.

Emotion classification rate, as well as the reduced number of features obtained by different dimensionality reduction techniques can be found in tab. 1. As can be observed, multi-objective approach to feature selection provides the highest classification rate on 4 out of 5 datasets, leaving the second place to single-objective approach to feature selection, and the third one to PCA. The highest achieved emotion classification rate is 45.7 %.

Tab. 2 contains results on the comparison of single and multi-objective approaches to neural network parameter optimization. Different ensemble fusion schemes and multi-objective optimization algorithms were tried. According to the results, multi-objective optimization approach outperformed single-objective optimization approach on all 5 datasets. The highest achieved emotion classification rate is 39.8 %, which is still high enough for such a complex problem, taking into account that the baseline model, that is, always predicting the most frequently observed class label from the train set, would yield the 25 % accuracy.

Conclusion. In this work we addressed two crucial steps of building an emotion recognition system – dimensionality reduction and classifier training. We applied the multi-objective optimization approach to these two steps, which helped to achieve the 45.7 % classification rate.

As a result of our research, we defined that the proposed approach proved to be useful in both feature selection and neural network parameter optimization procedures, so we recommend using it in further research connected with emotion recognition.

Table 1

Emotion classification rate (%), dimensionality reduction approaches comparison

Dataset			Audio	QLZM	LBP	LBP-TOP	Audio + video
Initial number of features			991	656	59	177	1883
Emotion classification rate / reduced number of features	All features		28.542	10.506	20.486	22.847	19.732
	Principal components analysis		35.923 / 131	21.458 / 36	23.75 / 4	32.017 / 10	31.718 / 180
	Feature selection	Single-objective optimization	38.095 / 476	20.208 / 301	25.972 / 33	40.278 / 77	33.661 / 902
	Feature selection	Multi-objective optimization	39.702 / 484	24.911 / 319	25.694 / 31	45.694 / 90	35.893 / 885

Table 2

Emotion classification rate (%), neural networks optimization and ensemble forming

Optimization Algorithm (number of objectives)	Ensemble Classifiers Output Fusion Scheme	Dataset
Optimization Algorithm (number of objectives)	Ensemble Classifiers Output Fusion Scheme	Audio	QLZM	LBP	LBP-TOP	Audio + video
Co-evolutionary GA (1)	–	35.923	21.458	23.75	32.917	31.718
SPEA (2)	Voting	31.012	16.319	16.667	34.167	27.292
	Average class probabilities	16.994	10.903	16.458	39.583	14.256
	SVM meta-classifier	28.631	16.042	18.264	34.583	25.06
NSGA-2 (2)	Voting	29.226	21.181	19.236	33.403	24.554
	Average class probabilities	29.435	14.722	16.667	17.639	23.571
	SVM meta-classifier	39.762	11.528	17.5	38.125	34.94
VEGA (2)	Voting	33.839	17.5	24.514	32.639	22.5
	Average class probabilities	27.262	24.306	20.069	21.042	15.119
	SVM meta-classifier	38.899	13.958	29.167	36.736	37.292
SelfCOMOGA (2)	Voting	26.577	20.347	33.125	36.25	19.94
	Average class probabilities	23.244	15.935	25.417	22.708	17.768
	SVM meta-classifier	36.518	26.756	38.333	36.319	29.405

Список литературы Multi-objective based feature selection and neural networks ensemble method for solving emotion recognition problem

Rashid M., Abu-Bakar S. A. R., Mokji M. Human emotion recognition from videos using spatio-temporal and audio features. The Visual Computer, 2012, P. 1269-1275.
Kahou S. E., Pal C., Bouthillier X., Froumenty P., Gulcehre C., Memisevic R., Vincent P., Courville A., Bengio Y. Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International Conference on Multimodal Interaction, Sydney, Australia, December 9-13, 2013, P. 543-550.
Cruz A., Bhanu B., Thakoor N. Facial emotion recognition in continuous video. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR 2012), Tsukuba, Japan, November 11-15, 2012, P. 1880-1883.
Soleymani M., Pantic M., Pun T. Multimodal emotion recognition in response to videos. IEEE Transactions on affective computing, 2012, Vol. 3, No. 2, P. 211-223.
Busso C., Deng Z., Yildirim S., Bulut M., Lee C. M., Kazemzadeh A., Lee S., Neumann U., Narayanan S. Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information. In Proceedings of the 6th international conference on Multimodal interfaces, Los Angeles, 2004, P. 205-211.
Haq S., Jackson P. J. B. Speaker-dependent audio-visual emotion recognition. In Proceedings Int. Conf. on Auditory-Visual Speech Processing (AVSP'09), Norwich, UK, September 2009, P. 53-58.
Eyben F., Wullmer M, Schuller B. OpenSMILE -the Munich versatile and fast open-source audio feature extractor. In Proceedings ACM Multimedia (MM), Florence, Italy, 2010, P. 1459-1462.
Sariyanidi E., Gunes H., Gokmen M., Cavallaro A. Local Zernike moment representation for facial affect recognition. Proc. of British Machine Vision Conference, 2013, P. 1-13.
Ojala T., Pietikäinen M., Harwood D. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 1996, P. 51-59.
Zhao G., Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Analysis and
Machine Intelligence 29(6), 2007, P. 915-928.
Sidorov M., Brester C., Semenkin E., Minker W. Speaker state recognition with neural network-based classification and self-adaptive heuristic feature selection. In Proceedings International Conference on Informatics in Control, Automation and Robotics (ICINCO), 2014, Р. 699-703.
Zitzler E., Thiele L. An evolutionary algorithm for multiobjective optimization: the strength Pareto approach. Swiss Federal Institute of Technology, Zurich, Switzerland, TIK-Report No. 43, May 1998, P. 1-40.
Deb K., Pratap A., Agarwal S., Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. on Evolutionary Computation, 2002, Vol. 6, No. 2, P. 182-197.
Schaffer J. D. Multiple objective optimization with vector evaluated genetic algorithms. Proc. of the 1st International Conference on Genetic Algorithms, 1985, P. 93-100.
Ivanov I. A., Sopov E. A. . Vestnik SibGAU, 2013, No. 1 (47), P. 30-35 (In Russ.).
Rashid M., Abu-Bakar S. A. R., Mokji M. Human emotion recognition from videos using spatio-temporal and audio features//The Visual Computer. 2012. P. 1269-1275.
Combining modality specific deep neural networks for emotion recognition in video/S. E. Kahou //In Proceedings of the 15th ACM on Intern. Conf. on Multimodal Interaction. Sydney, 2013. P. 543-550.
Cruz A., Bhanu B., Thakoor N. Facial emotion recognition in continuous video//In Proceedings of the 21st Intern. Conf. on Pattern Recognition (ICPR 2012) (Tsukuba, Japan, November 11-15). 2012. P. 1880-1883.
Soleymani M., Pantic M., Pun T. Multimodal emotion recognition in response to videos//IEEE Transactions on affective computing. 2012. Vol. 3, no. 2. P. 211-223.
Analysis of Emotion Recognition using Facial Expressions, Speech and Multimodal Information/C. Busso //In Proceedings of the 6th Intern. Conf. on Multimodal interfaces. Los Angeles, 2004. P. 205-211.
Haq S., Jackson P. J. B. Speaker-dependent audio-visual emotion recognition//In Proceedings Int. Conf. on Auditory-Visual Speech Processing (AVSP’09). Norwich, 2009, P. 53-58.
Eyben F., Wullmer M., Schuller B. OpenSMILE -the Munich versatile and fast open-source audio feature extractor//In Proceedings ACM Multimedia (MM). Florence, 2010. P. 1459-1462.
Local Zernike moment representation for facial affect recognition/E. Sariyanidi //Proc. of British Machine Vision Conference. 2013. P. 1-13.
Ojala T., Pietikäinen M., Harwood D. A comparative study of texture measures with classification based on feature distributions//Pattern Recognition. 1996. 29. P. 51-59.
Zhao G., Pietikäinen M. Dynamic texture recognition using local binary patterns with an application to facial expressions//IEEE Trans. Pattern Analysis and Machine Intelligence. 2007. 29(6). P. 915-928.
Speaker state recognition with neural network-based classification and self-adaptive heuristic feature selection/M. Sidorov //In Proceedings Intern. Conf. on Informatics in Control, Automation and Robotics (ICINCO). 2014. P. 699-703.
Zitzler E., Thiele L. An evolutionary algorithm for multiobjective optimization: the strength Pareto approach//TIK-Report. 1998. No. 43. Zurich, Switzerland, Swiss Federal Institute of Technology P. 1-40.
A fast and elitist multiobjective genetic algorithm: NSGA-II/K. Deb //IEEE Trans. on Evolutionary Computation. 2002. Vol. 6, No. 2. P. 182-197.
Schaffer J. D. Multiple objective optimization with vector evaluated genetic algorithms//Proc. of the 1st Intern. Conf. on Genetic Algorithms. 1985. P. 93-100.
Иванов И. А., Сопов Е. А. Самоконфигурируемый генетический алгоритм решения задач поддержки многокритериального выбора//Вестник СибГАУ. 2013. № 1 (47). С. 30-35.

Еще

Статья научная