Models and algorithms for automatic grouping of objects based on the k-means model
Автор: G. Sh. Shkaberina, L. A. Kazakovtsev, R. Li
Журнал: Siberian Aerospace Journal @vestnik-sibsau-en
Рубрика: Informatics, computer technology and management
Статья в выпуске: 3 vol.21, 2020 года.
Бесплатный доступ
The paper is devoted to the study and development of new algorithms for automatic grouping of objects. The algorithms can improve the accuracy and stability of the result of solving practical problems, such as the problems of identifying homogeneous batches of industrial products. The paper examines the application of the k-means algorithm with the Euclidean, Manhattan, Mahalanobis distance measures for the problem of automatic grouping of objects with a large number of parameters. A new model is presented for solving problems of automatic grouping of industrial products based on the k-means model with the Mahalanobis distance measure. The model uses a training procedure by calculating the averaged estimate of the covariance matrix for the training sample (sample with pre-labeled data). A new algorithm for automatic grouping of objects based on an optimization model of k-means with the Mahalanobis distance measure and a weighted average covariance matrix calculated from a training sample is proposed. The algorithm allows reducing the proportion of errors (increasing the Rand index) when identifying homogeneous production batches of products based on the results of tests. A new approach to the development of genetic algorithms for the k-means problem with the use of a single greedy agglomerative heuristic procedure as the crossover operator and the mutation operator is presented. The computational experiment has shown that the new mutation procedure is fast and efficient in comparison with the original mutation of the genetic algorithm. The high rate of convergence of the objective function is shown. The use of this algorithm allows a statistically significant increase both in the accuracy of the result (improving the achieved value of the objective function within the framework of the chosen mathematical model for solving the problem of automatic grouping), and in its stability, in a fixed time, in comparison with the known algorithms of automatic grouping. The results show that the idea of including a new mutation operator in the genetic algorithm significantly improves the results of the simplest genetic algorithm for the k-means problem.
Automatic grouping, k-means, Mahalanobis distance, genetic algorithm.
Короткий адрес: https://sciup.org/148321756
IDR: 148321756 | DOI: 10.31772/2587-6066-2020-21-3-347-354
Текст научной статьи Models and algorithms for automatic grouping of objects based on the k-means model
Introduction. Automatic grouping (AG) involves dividing a set of objects into subsets (groups) so that objects from one subset are more similar to each other than to objects from other subsets according to some criterion. General characteristics of the object and the methods by which the division took place are taken into account in the process of grouping objects of a certain set into certain groups (subsets).
To exclude the emergence of unreliable electrical radio products intended for installation in the on-board equipment of a spacecraft with a long period of active existence, the entire electronic component base passes through specialized technical test centers [1; 2]. These centers perform operations of full incoming inspection of electrical radio products, additional verification tests, diagnostic non-destructive testing and selective destructive physical analysis. Detection of initial homogeneous production batches of electrical radio products from shipped batches is an important stage during testing [1].
The k -means model is one of the best known cluster analysis models. The goal is to find k points (centers) X 1 , …, X k in d -dimensional space, such that the sum of the squared distances from known points (data vectors) A 1 , …, A N to the nearest of the required points (centers) reaches a minimum [3]:
argmin F ( X 1 ,..., Xk ) = ∑ i N = 1 j m {1 in k } Xj - Ai II2 . (1) i = 1 j ∈ {1, k } j
Initially it is necessary to predict the number of groups (subsets) in the k -means algorithm. In addition, the result obtained depends on the initial choice of centers. The distance function and its definition also play an important role in the problem of dividing the set under study into groups.
The first genetic algorithm for solving the discrete p-median problem was proposed by Hosage and Goodchild [4]. The algorithm [5] gives fairly accurate results. However, the rate of convergence of the objective function is very slow. In their work O. Alp, E. Erkut, Z. Drezner [6] presented a faster simple genetic algorithm with a special recombination procedure, which also gives accurate results. These algorithms solve discrete problems. The authors of the work “Genetic algorithm-based clustering technique” [7] encode solutions (chromosomes) in their
GAs as sets of centroids, represented by their coordinates (vectors of real numbers) in a multidimensional space.
The analysis of the literature has shown that the existing solutions in the field of AG of multidimensional objects either have high accuracy, or ensure the stability of the result with multiple runs of the algorithm, or have high speed of operation, but do not combine all these qualities at the same time. To date, algorithms for k -means and k -medians have been developed only for the most common distance measures (Euclidean, Manhattan). However, taking into account the feature space peculiarities of a specific practical problem when choosing a distance measure can lead to increasing the accuracy of AG objects. In the presented work, we use the Rand Index (RI) [8] as a measure of the clustering accuracy.
It is extremely difficult to improve the AG result of multidimensional objects with increased requirements for the accuracy and stability of the result using known algorithms without a significant increase in time costs. When solving practical problems of the AG of multidimensional data, for example, the problems of identifying homogeneous batches of industrial products, the adequacy of the models and, as a result, the accuracy of the AG of industrial products are questionable. It is still possible to develop algorithms that further improve the result based on the chosen model, for example, the k -means model.
In a multidimensional feature space, there is often a correlation between individual features and groups of features. The use of correlation dependences can be used by moving from search in the space with the Euclidean or rectangular metric to search in the space with the Maha-lanobis metric [9–11]. The square of the Mahalanobis distance D M is defined as follows:
D M ( X ) = ( X -µ ) TC - 1 ( X -µ ) , (2)
where X is the vector of values of the measured parameters, µ is the vector of mean values (for example, the center of the cluster), C is the covariance matrix.
The aim of the study in the presented work is to improve the accuracy and stability of the result of solving problems in automatic grouping of objects.
The idea of the work is to use the Mahalanobis distance measure with the averaged estimate of the covariance matrix in the k-means problem to reduce the proportion of the AG error in comparison with other known al- gorithms, and also to use the mutation operator as a part of the genetic algorithm to improve the accuracy and stability of the solution according to the achieved value of the objective function in a fixed execution time in comparison with known algorithms for separating objects.
Initial data. The study was carried out on the data of testing the batches of integrated circuits [12], intended for installation in space vehicles. The tests were carried out in a specialized center for technical tests. The data is a set of parameters for electrical radio products (ERP). The original batch of ERI belongs to different homogeneous batches, in accordance with the manufacturer's marking. The total number of products is 3987. In each batch, the product is described by 205 measured parameters. Batch 1 contains 71 products, batch 2 – 116 products, batch 3 – 1867 products, batch 4 – 1250 products, batch 5 – 146 products, batch 6 – 113 products, and batch 7 – 424 products.
Instead of the covariance matrix from (2), it was proposed to calculate the averaged estimate of the covariance matrix for homogeneous batches of products (according to pre-labeled data) using the training sample:
k с=- Z CjnJ, (3) nj=1
where n j is the number of objects (products) in the j -th batch, n is the total sample size, C j are the covariance matrices of individual batches of products.
In this paper, we propose an algorithm for automatic grouping of objects based on the k -means model with the adjustment of the Mahalanobis distance measure parameter (covariance matrix) based on the training sample:
The algorithm of 1. k -means with the Mahalanobis distance with averaged estimate of the covariance matrix
Step 1. Using the k -means method with Euclidean distance, divide the sample into a certain number of k clusters (here k is some expert estimate of the possible number of homogeneous groups, not necessarily accurate);
Step 2. Calculate the center µ i for each cluster. The center is defined as the arithmetic mean of all points in the cluster
m
^=1 Z X.
m j = 1
where m is the number of points, X j is a vector of values of one measured parameter ( j = 1... m );
Step 3. Calculate the averaged estimate of the covariance matrix (3). If the averaged estimate of the covariance matrix is degenerate, go to step 4, otherwise go to step 5;
Step 4. Increase the number of clusters by ( k + 1) and repeat steps 1 and 2. Form new clusters with the squared Euclidean distance:
n
D ( Xj, Ц / ) = Z ( Xji -^ )1 , (5) i = 1
where n is the number of parameters. Return to step 3 with a new training example (set).
Step 5. Match each point to the nearest centroid using the square of the Mahalanobis distance (2) with the averaged estimate of the covariance matrix C (3) to form new clusters.
Step 6. Repeat the algorithm from step 2 until the clusters stop changing.
The paper presents the results of three groups of experiments on the data of industrial product samples.
The f rst group . The training sample corresponds to the working sample for which the clustering was carried out.
The second group . The training and working samples are not the same. In practice, a test center can use retrospective data on deliveries and testing of products of the same type as a training sample.
The th rd group . The training and working samples also do not match, but the results of the automatic grouping of products ( k -means in the multistart mode with the Euclidean metric) were used as the training sample.
In each group of experiments, for each working sample, the k -means algorithm was run 30 times with each of the five clustering models studied.
DM1model – k -means with the Mahalanobis distance, the covariance matrix is calculated for the entire training sample.
DC model – k -means with a distance similar to the Mahalanobis distance, but using a correlation matrix instead of a covariance matrix.
DM2 model – k -means with Mahalanobis distance, with averaged estimate of the covariance matrix.
DR model – k -means with the Manhattan distance.
DE model – k -means with the Euclidean distance.
It was found that the new DM2 model with an averaged estimate of the covariance matrix shows the best accuracy among the presented models in almost all series of experiments according to the Rand index (RI) and in all cases it exceeds the DE model, where the Euclidean distance is used. The experiments also showed that in most cases the coefficient of variation of the objective function values is higher for the DE model, where the Euclidean measure of distance is used, and also that the coefficient of the range of the objective function values has the highest values for the DM2 model, where the Ma-halanobis distance measure with an averaged estimate of the covariance matrix is used.
Table 1
The results of a computational experiment on the data of the 1526IE10_002 microcircuit (3987 data vectors with a dimension of 68), the training sample consists of 10 batches, the third group, the working sample is made up of 7 batches of products)
V-series |
Rand index (RI) |
Objective function |
||||||||
Model |
Model |
|||||||||
DM1 |
DС |
DM2 |
DR |
DE |
DM1 |
DС |
DM2 |
DR |
DE |
|
Max |
0.767 |
0.658 |
0.749 |
0.740 |
0.735 |
255886 |
379167 |
281265 |
18897 |
6494.62 |
Min |
0.562 |
0.645 |
0.696 |
0.703 |
0.705 |
250839 |
36997 |
274506 |
17785 |
5009.42 |
Average |
0.632 |
0.650 |
0.725 |
0.714 |
0.719 |
252877 |
37178 |
277892 |
18240 |
5249.95 |
Std .Dev |
0.047 |
0.003 |
0.016 |
0.008 |
0.006 |
1164.5 |
152.8 |
2358.9 |
452.7 |
366.5 |
V |
0.461 |
0.411 |
0.849 |
2.482 |
6.981 |
|||||
R |
5047 |
920 |
6759 |
1112 |
1485 |
Therefore, multiple attempts to run the k -means algorithm or to use other algorithms based on the k -means model (for example, j -means [14] or greedy heuristic algorithms [15]) are required to obtain consistently good values of the objective function.
Genetic cross-mutation algorithm for the k -means problem . The new algorithm improves the accuracy of solving the k -means problem and the stability of the result in a fixed limited execution time. In this chapter, by the accuracy of the algorithm we mean exclusively the achieved value of the objective function, without taking into account the indicators of the model adequacy and the correspondence of the algorithm results to the actual (real) separation of objects, if known.
A very limited set of possible mutation operators is known for genetic algorithms for solving the k -means problem with real coding of solutions. For example, the authors of the work “Genetic algorithm-based clustering technique” [7] encode solutions (chromosomes) in their GAs as sets of centroids represented by their coordinates (vectors of real numbers) in a multidimensional space. Each chromosome undergoes mutation with a fixed probability ц . The procedure (operator) of mutation is as follows.
Algorithm 2 3.1 Initial GA mutation procedure for the k -means problem
Step 1 . Generation of a random number b e (0,1] with uniform distribution;
Step 2. IF b < ц , then the chromosome mutates. If the position of the current centroid is и , then after mutation it becomes:
fu± 2 x b хи , u * 0, u —— ^
[u± 2 x b , u = 0.
The signs “+” and “–” have the same probability. The centroid coordinates are shifted randomly.
In our work we replaced this mutation procedure for the k -means problem with the following procedure.
Algorithm 3 3.2 GA cross mutation procedure for the k-means problem
Step 1. Generating a random initial solution S = = { X 1 ... X k };
Step 2. Applying the k-means algorithm to S to obtain the local optimum S’ ;
Step 3. Applying a simple crossover procedure for the mutated individual S’ from the population and S to obtain a new solution S’’ ;
Step 4. Applying the k-means algorithm to S’’ to obtain local optimum S’’ ;
Step 5. IF F ( S’’ ) < F ( S’ ), THEN S’ ← S’’ .
The proposed procedure is used with a mutation probability of 1 after each crossover operator.
The results of running the original algorithm 2, described with a mutation probability of 0.01, and its version with algorithm 3 as a mutation operator are shown in the figure (population size N POP = 20). The new mutation procedure is fast and efficient in comparison with the original mutation of the genetic algorithm; a high convergence rate of the objective function has been shown.
Greedy genetic algorithms and many other evolutionary algorithms for the k -means problem do without mutation. The idea of a greedy agglomerative heuristic procedure is to combine two known solutions into one unacceptable solution with an excessive number of centroids, and then the number of centroids is successively reduced. The centroid which shows the smallest increase in the objective function value (1) is removed at each iteration.
Algorithm 4. Basic greedy agglomerative heuristic procedure
It is given: the initial number of clusters K , the required number of clusters k , k < K , the initial solution S = { X 1 ,..., X K }, where |S| = K .
STEP 1. Improve the initial solution by the k -means algorithm
WHILE K > k
CYCLE for each i e {1,K} perform:
STEP 2. S — S{Xi}. Improve the solution S by the k-means algorithm and store the corresponding ob tained values of the objective function (1) as variables Fi — F {X}.
Список литературы Models and algorithms for automatic grouping of objects based on the k-means model
- Orlov V. I., Kazakovtsev L. A., Masich I. S., Stashkov D. V. Algoritmicheskoe obespechenie podderzhki prinyatiya reshenii po otboru izdelii mikroelektroniki dlya kosmicheskogo priborostroeniya [Algorithmic support of decision-making on selection of microelectronics products for space industry]. Krasnoyarsk, 2017, 225 p.
- Kazakovtsev L. A., Antamoshkin A. N. [Greedy heuristic method for location problems]. Vestnik SibGAU. 2015, Vol. 16, No. 2, P. 317–325 (In Russ.).
- MacQueen J. Some methods for classification and analysis of multivariate observations. Proc. Fifth Berkeley Symp. Math. Stat. Probab. 1967, Vol. 1, P. 281–297.
- Hosage C. M., Goodchild M. F. Discrete Space Location-Allocation Solutions from Genetic Algorithms. Annals of Operations Research. 1986, Vol 6, P. 35–46. http://doi.org/10.1007/bf02027381
- Bozkaya B., Zhang J., Erkut E. A Genetic Algorithm for the p-Median Problem, Drezner Z., Hamacher H. (eds,), Facality Location: Applications and Theory, Springer, Berlin, 2002.
- Alp O., Erkut E., Drezner Z. An Efficient Genetic Algorithm for the p-Median Problem. Annals of Operations Research. 2003, Vol. 122, P. 21–42. doi: 10.1023/A:1026130003508
- Maulik U., Bandyopadhyay S. Genetic Algorithm-Based Clustering Technique. Pattern Recognition. 2000, Vol. 33, No. 9, P. 1455–1465. doi: 10.1016/S0031-3203(99)00137-5
- William M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association. 1971, Vol. 66, No. 336, P. 846–850.
- De Maesschalck R., Jouan-Rimbaud D., Massart D. L. The Mahalanobis distance. Chem Intell Lab Syst. 2000, Vol. 50, No. 1, P. 1–18. doi: 10.1016/S0169-7439(99)00047-7
- McLachlan G. J. Mahalanobis Distance. Resonance. 1999, Vol. 4, No. 20. doi: 10.1007/BF02834632.
- Xing E. P., Ng A. Y., Jordan M. I., Russel S. Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems. 2003, Vol. 15, P. 505–12.
- Orlov V. I., Fedosov V. V. ERC clustering dataset, 2016. http://levk.info/data1526.zip
- Orlov V. I., Shkaberina G. Sh., Rozhnov I. P., Stupina A. A., Kazakovtsev L. A. [Application pf clustering algorithm wirh special distance measures for the problem of automatic grouping of electronic and radio devices]. Control systems and information technologies.2019, Vol. 3, No. 3, P. 42–46 (In Russ.).
- Hansen P., Mladenovic N. J-means: a new local search heuristic for minimum sum of squares lustering. Pattern recognition. 2001, Vol. 34, No. 2, P. 405–413.
- Kazakovtsev L. A., Antamoshkin A. N. Genetic Algorithm with Fast Greedy Heuristic for Clustering and Location Problems. Informatica. 2014, Vol. 38, No. 3, P. 229–240.