Determination of variables significance using estimations of the first-order partial derivative
Автор: Zablotskaya K., Walter S., Zablotskiy S., Minker W.
Журнал: Сибирский аэрокосмический журнал @vestnik-sibsau
Рубрика: Кибернетика, системный анализ, приложения
Статья в выпуске: 5 (31), 2010 года.
Бесплатный доступ
In this paper we describe and investigate a method which allows us to detect the most informative features out of all data extracted from a certain data corpus. The significance of input features is estimated as an average absolute value of the first-order partial derivative. The method requires the values of the objective function at the certain assigned points. If there is no possibility to calculate these values (the object is not available for experiments), we use non-parametric kernel regression to approximate them. The algorithm is tested on different simulated objects and is used for investigation of the dependency between linguistic features of spoken utterances and speakers ' capabilities.
Non-parametric kernel regression, first-order partial derivative
Короткий адрес: https://sciup.org/148176350
IDR: 148176350
Текст научной статьи Determination of variables significance using estimations of the first-order partial derivative
In our research we try to investigate if there is a dependency between spoken utterances of a person and his capabilities. For this purpose we collected a corpus of monologues and dialogues of different speakers [1]. Their verbal intelligence was measured with an intelligence test [2]. Dealing with the corpus, we try to extract relevant information enough for clustering, classification, regression, or other data mining tasks. There are normally lots of different features which could be extracted from the monologues and dialogues, but their importance or relevance is not always obvious. Most of them are noise fields, which make the analysis of data increasingly difficult. When working with high dimensional spaces, the computational effort required by data analysis tools may be tremendous. It is therefore essential to detect ir- relevant or weakly correlated features and exclude them out of consideration.
There exist different solutions to this problem. One of them is the use of Pearson’s coefficient or the coefficient of multiple correlation. However, if Pearson’s coefficient is close to 0, it does not mean that the output and input variables are not correlated. It just shows that there is no linear dependency between them. Such features should not be excluded out of consideration without additional analysis. Another approach to decrease the number of features is Principal Component Analysis. This method involves a mathematical procedure that transforms correlated variables into a smaller number of uncorrelated ones called principal components. But it does not determine the contribution of a certain feature to an objective function.
In this paper we describe a method which determines the most informative features even if the dependency between input variables and the output is not linear.
Determination of the Most Informative Features. To determine if there is a dependency between input features (or extracted features) and the output, we make a series of experiments on the object (if it is available) or create a model using non-parametric kernel regression and estimate the average first-order partial derivative with respect to each input feature. The feature with the largest average partial derivative is the most important. This algorithm may be described in the following way.
Non-parametric kernel regression (NPR) allows us to create a model using the data set x 1[ t ], ..., xn [ t ],
y [ t ], t = 1, 5 without additional knowledge about the dependency structure [3; 4]. NPR estimates the dependency between inputs and outputs using a weighted average of the observations y [ t ] :
sn
Z У [ 11ПФ t=1 i=1
f X i - x , -[ t ] )
V C i J
M{Y\x } = y( x ) =
sn
Uh' t =1 i =1 V
- x , [ t ]
C i
where Ci – is the bandwidth or smoothing parameter, Ф ( z ) - is a kernel function.
The kernel function assigns weights for each observation. The weighted sum of y[t] estimates the output at any point x . The parameters Ci determine how many points from the training data set are used for calculating yˆ(x) . The observations which are nearer to x have larger weights and are more significant for yˆ(x) .
If Ci are large, a lot of observations are taken into account and the model is not precise. These parameters should be trained on the existing data set, and Ci providing the smallest mean square error (MSE) are used for other investigations.
Let an object have an input vector X = ( x 1 , x 2 ,..., x n ) and an output y = f ( x ). A feature x i is informative if its average influence on the output is significant, given the other n -1 features fixed. We estimate this significance as an average absolute value of the first-order partial derivative with respect to this variable.
Let the variables x = ( x 1, x 2 ,..., x n ) belong to the intervals [ a 1; b 1] , [ a 2; b 2] , …, [ an ; bn ] . We generate random values { x 1[1],..., x 1[ m ], x 2 [1],..., x 2 [ m ],...} in corresponding intervals, m is a predefined value. To get a precise estimation of the average first-order partial derivative, we generate these random values near one observation value x [ l ], l = 1, 5 :
^n ф^ xi[ k] - x[ t] К о, tl =1 V C J for all k = 1,m . Then we fix the features (x2,...,xn) at some points, for example at (x2[1],x3[1],...,xn[1]). The outputs of the goal function are estimated at the following p У+ = f (^[1] + h, x2[1],.., x „ [1]),
У - = f ( x 1 [1] - h - , x 2 [1],..., x . [1]),
У+ = f (^[2] + h 2+, x2[1],..., x n [1]), y-= f(^[2]-h2-,x2[1],...,x„[1]), ..., where h1+ and h1- are random values from a small interval (for example, h e [0,01; 0,5]).
The first component of the average first-order partial derivative with respect to x 1 is estimated as:
f x ‘ [1] =
I ^+ -- I I ^+ -- I I ^+ -- I
\ y1 - y1 \ + \ y2 - y2 \ + + \ ym - ym \ hi-h1 h 2 h 2 hm - hm
m
Then the features (x2,...,xn) are fixed at other points, for example at (x2[2], x3[1], ..., xn[1]), and the same procedure is repeated for x1 . The average absolute value of the partial derivative in the neighborhood of x[l] is estimated as:
f x ‘ [1] + f x ‘ [2] + ... + f ‘ [ M ]
f 1 ( x [ l ]) =
M where M – the number of all possible combinations, M = m(n-1). As these random values have been generated in the neighborhood of one observation values, only small space is investigated. We generate {^[1],...,x1[m], x2[1],...,x2[m],...} next to another observation point x[l'] and find fx (x[l']) in the same way. This procedure is repeated K times. The average absolute value of the partial derivative fx - is estimated as: fx = Zfx (x[l]) K, where K is a predefined value.
-
1 l =1 1 I
In the same way the average absolute values of fx, i = 1, n are estimated.
Investigation of the Algorithm. In this section we show the results of the algorithm’s work when the object is not available for experiments, i. e. there is only collected data. In the following experiments the average absolute value of the partial derivative is estimated with M = 2 and K = 20 . The function for simulating the object is given by: f ( x ) = 5 x 1 + 0,5 x 2 - 10 x 3 + 0,1 x 4 + 2 x 5.
In our first experiment, the non-parametric regression model is trained using all the input variables ( C i = [0,4; 1,7; 0,3; 1,8; 0,8], MSE = 0,08).
Then we take away the first feature x 1 from the data set. In this case the situation with incomplete data is simulated, but the most informative features should nevertheless be found ( C i = [1,9; 0,3; 1,9; 0,9], MSE = 0,39 ). The results of the algorithm are shown in Table 1. As we can see, the algorithm was able to find the most important features in both cases.
Table 1
Results of the algorithm’s work
Features |
5 inputs are used |
4 inputs are used |
||
fx (Real ranks) |
Algorithm’s f x (Ranks ) |
f x (Real ranks) |
Algorithm’s f (Ranks) |
|
x 1 |
5,0 (2) |
3,42 (2) |
– |
– |
x 2 |
0,5 (4) |
1,40 (4) |
0,5 (3) |
0,39 (3) |
x 3 |
10,0 (1) |
6,51 (1) |
10,0 (1) |
7,97 (1) |
x 4 |
0,1 (5) |
1,20 (5) |
0,1 (4) |
0,28 (4) |
x 5 |
2,0 (3) |
2,10 (3) |
2,0 (2) |
1,39 (2) |
Table 2
Results of the algorithm’s work
Features |
5 inputs are used |
8 inputs are used |
||
f x (Real ranks) |
Algorithm’s f x (Ranks) |
fx (Real ranks) |
Algorithm s fx (Ranks) |
|
x 1 |
3,83 (5) |
3,15 (5) |
3,83 (5) |
2,71 (5) |
x 2 |
4,23 (4) |
3,29 (4) |
4,23 (4) |
3,24 (4) |
x 3 |
4,38 (3) |
3,50 (3) |
4,38 (3) |
4,38 (3) |
x 4 |
7,05 (2) |
4,24 (2) |
7,05 (2) |
4,43 (2) |
x 5 |
30,0 (1) |
31,48 (1) |
30,0 (1) |
26,23 (1) |
x 6 |
– |
– |
0,027 (7) |
0,43 (7) |
x 7 |
– |
– |
0,021 (8) |
0,07 (8) |
x 8 |
– |
– |
0,06 (6) |
0,74 (6) |
Now let’s use the following function for generating the input and output data: f ( x ) = 7sin( x 1 ) + 6cos( x 2 ) -- 8sin( x 3 ) - 10cos( x 4 ) + 5 x 2 . In this dependency there are no features which influence on the output is linear. This is a more complex situation for the algorithm. However, if the model is trained well ( C i = [1,0; 0,6; 0,7; 0,5; 0,1], MSE = 0,32), the algorithm gives us good results (see Table 2).
Let’s use the same function for simulating the data set and add three more features to the input variables. We simulate the situation when the data set is large and not all features influence on the output. These additional input features are: x 6 = 0,05sin( t ), x 7 = 0,03cos( t ), x„ = 0,01 1 2. The coefficients with these features 8
are small, so that x 6 , x 7 and x 8 are noises for the output. In this case we use all the features to train the model. The results are shown in Table 2. The algorithm could find the most informative and the least informative features ( C i = [1,0; 0,6; 0,7; 0,5; 0,1; 1,9; 1,5; 1,4],
MSE = 0,32).
Let’s use for simulating the function:
f ( x ) = 0,2 sin(2 x 1 ) + 2 cos(8 x 2) + 5 sin( x 3) + + 0,1 x 4 + 0,5 x 5 + x 6 + 2 x 7 + 3 x 8 + 4 x 9 + 5 x 10, generate the data set and take away the features x 1 , x 2 and x 3 . The results of the algorithm
( C i = [1,5; 1,4; 1,0; 0,9; 0,5; 0,5; 0,5], MSE = 0,3) are given in Table 3.
Analyzing the results in the tables, we may say that the algorithm with the non-parametric model can find the most informative features. This method can be used for analyzing a high dimensional data set. It allows us to exclude the least informative features from consideration.
Table 3
Results of the algorithm’s work
Features |
f x (Real ranks) |
Algorithm’s f x (Ranks) |
x 1 |
– |
– |
x 2 |
– |
– |
x 3 |
– |
– |
x 4 |
0,1 (7) |
0,61 (7) |
x 5 |
0,5 (6) |
0,71 (6) |
x 6 |
1 (5) |
1,26 (5) |
x 7 |
2 (4) |
1,80 (4) |
x 8 |
3 (3) |
3,79 (3) |
x 9 |
4 (2) |
4,15 (2) |
x 10 |
5 (1) |
4,83 (1) |
Experiments with the Corpus. We decided to analyze different features extracted from monologues of German native speakers using the algorithm described above. The corpus consists of transcribed descriptions of a short film by different candidates. German native speakers of different ages and educational levels were asked to watch a short film and to describe it with their own words. The film was about an experiment on how long people could be without sleep. The participants were also asked to take an intelligence test. The verbal part of the test consists of 6 subtests. The first subtest is «In-formation». With this sub-test the general knowledge is measured; 25 questions come from a particular culture. For example, «What is the capital of Russia?» Overall, 56 candidates were tested; 3 hours 30 minutes of audio data were collected.
To extract features from the monologues, all the words from the descriptions were compared with a special dictionary [5]. The dictionary consists of different words sorted by 64 categories. For example, the category «Articles» contains words die, das, der, ein, eine, einen, etc. Each word from the dictionary may refer to several categories. For example, the word traurig (sad) refers to the categories «Affect», «Negative emotion» and «Sadness». We analyzed all the monologues, calculated the number of words for each category and divided them by the total amount of words in each monologue. By this way we got 64 characteristics of 56 monologues. Our task was to investigate the dependency between these 64 features and the results of the subtest «Information», and to find several informative features out of 64 characteristics.
We combined 4 or 5 features together, trained the nonparametric model and applied our method. As a result, the category «Affection» had the largest value of the first-order partial derivative and was estimated as a more informative feature. «Positive emotions» and «Negative emotions» are subcategories of «Affection» and are also relevant according to our algorithm. However, «Anger» and «Optimism» do not have large values of f ˆ x ′ . The category «Cognitive mechanism» is estimated as irrelevant, however, the category «Cause» which is a subcategory of «Cognitive mechanism» is more important.
Discussion and Future work. The goal of this work was to apply the method to the corpora. In each combination of the features the category of emotions was determined as the most informative feature. It means that there is a dependency between speaker’s general knowledge and the amount of emotional words which he uses in his speech. We could not find any references describing this dependency. Only in LEAS [6] emotional intelligence is measured linguistically, however, the cor- relation between them was not found. For our algorithm we used a small data set that also influenced the results. Also, these emotional words may be a subcategory of another category which was not analyzed. For example, they may create a group of frequently-used words, or they are formed from abstract words which show the level of intelligence in spoken utterances. This research and the results are preliminary; in our future work we are going to further investigate this phenomenon, to find other linguistic features which reflect verbal intelligence and to collect more data for more precise estimations.