A Proposed Modification of K-Means Algorithm
Автор: Sharfuddin Mahmood, Mohammad Saiedur Rahaman, Dip Nandi, Mashiour Rahman
Журнал: International Journal of Modern Education and Computer Science (IJMECS) @ijmecs
Статья в выпуске: 6 vol.7, 2015 года.
Бесплатный доступ
K-means algorithm is one of the most popular algorithms for data clustering. With this algorithm, data of similar types are tried to be clustered together from a large data set with brute force strategy which is done by repeated calculations. As a result, the computational complexity of this algorithm is very high. Several researches have been carried out to minimize this complexity. This paper presents the result of our research, which proposes a modified version of k-means algorithm with an improved technique to divide the data set into specific numbers of clusters with the help of several check point values. It requires less computation and has enhanced accuracy than the traditional k-means algorithm as well as some modified variant of the traditional k-Means.
Clusters, clustering algorithms, Euclidian distance, Data Mining
Короткий адрес: https://sciup.org/15014767
IDR: 15014767
Текст научной статьи A Proposed Modification of K-Means Algorithm
Published Online June 2015 in MECS
Nowadays data have become as an asset for human beings. Almost in every sector data is stored for future utilization. As the data grows in size, the technique to utilize these data becomes more challenging.
This large amount of data is then used for discovering new knowledge. Essentially, data mining (DM) utilizes these large data warehouses to extract the information [1]. Data mining is a technique by which users try to gain knowledge from the stored data. It is frequently used in knowledge discovery system. Additionally data mining is used to design artificially intelligent systems, machine learning process and statistical systems [2] [3].
Clustering is an application of Data mining. Clustering is concerned with grouping together objects that are similar to each other but different from the object that belongs to other clusters [4]. It is a way to classify raw data reasonably so that one can find out the hidden pattern that may exist in the dataset [5]. Clustering is one of the key technologies used in data mining field and also in machine learning sector. It is also utilized in many areas such as knowledge discovery, pattern recognition and classification, data compression and vector quantization. It also plays an important role in the field of biology, geology, geography and marketing [6].
This process is repeated several times unless any object moves to another cluster [7]. This algorithm is very popular for distributing data objects into desired clusters. However the main problem of this algorithm is that it requires a high number of computations and hence takes an extensive time.
There are many improved version of k-means algorithm proposed by the researchers. In the paper [8] [9], researchers proposed to maintain two data structures to save the current minimum distance and current assigned cluster name. This process reduces the computation in a large scale and provided a better performance than the traditional K-means algorithm. But the main problem of this algorithm was that the “current minimum distance” is not always the correct minimum distance.
As mentioned above, the main objective of this paper is to propose modification of the k-means algorithm which will have the same functionality as the traditional k-means algorithm. But it will overcome the problems discussed in the previous section. A more efficient and less computationally expensive algorithm will be proposed in this paper.
In different sections of this paper, the total procedure and the way of advancement of the work along with the related background studies are explained elaborately. Here is the glimpse of the paper
In Background Studies, related previous works are presented in a summarized way. The work related to the improved k-means clustering algorithm, improved initial center theory and better initial cluster concepts are focused here.
In Proposed Algorithm section, the algorithm is proposed. The algorithm and some graphical descriptions are provided here.
The Result Analysis section is focused on result analysis. The outcome of the proposed algorithm is discussed here along with the comparison with other modification of traditional K-means algorithm.
And finally the draw backs and future work is described in the conclusion section.
-
II. Related Works
Before starting the proposed algorithm, it is essential to discuss about the related background studies. In this section the related ideas and researches are discussed in brief. A lot of work has been done with the k-means algorithm. Here it will be discussed from the very beginning for a better understanding.
In modern world data is becoming more and more important day by day. Large organizations tend to keep their data safe and store them in such a way that it can be used in the future. As the volume of the data grows, the need to preserve it in a structured way is becomes challenging. In data mining technology, this challenge is handled in a structural way. In this technology data are mined in such a way that useful information can be obtained from this data if needed. In present world data is a continuous by product of every business that can be used for good.
Clustering is a process that organizes a set of objects into disjoint classes. These classes are referred as clusters [9]. It is a version of unsupervised classification and hence does not depend on predefined classes. It tries to partition a data set based on specific features so that within a cluster the objects are more similar than the object in different one [4] [10-11].
K-means algorithm is a partition-based clustering method [12-15].In the traditional k-means algorithm k random points are chosen from the whole dataset as the primary cluster center. Then the distance between each object and each cluster center is calculated. Finally each object is assigned to the nearest cluster. To find the distance Euclidian Formula is used.
So the whole algorithm can be described as below [16]: Input: N-objects and number of cluster K
Output: K- Custer each having 0 Arbitrarily select k objects as initial luster centers. Calculate distance between each object xi and each cluster center. Assign each object to the nearest cluster. For calculating the distance Euclidian formula is used. The formula is: d(xl,ml') Jd $ (x -m;)2 Where i=1…..N; j=1……..k; d(xi, mi) is the distance between data i and cluster j. Calculate the average of every object in each cluster as new cluster center, using the following formula. m; = — Ey^ X[jWhere i=1,2........k; Ni is the number of object in current cluster i. Repeat until convergence criterion is met. In traditional K-Means algorithm a lot of calculations are needed as this a brute force algorithm. As a result we have to compute the distance of each object from its cluster center whether or not it moves in or out from the cluster. Even for the closest object from one cluster center needs to be make sure that it is not close to any other cluster center. That results in a large number of calculations.
Список литературы A Proposed Modification of K-Means Algorithm
- S. AL Manaseer, A. Malibari, "Improved Teaching Method of Data Mining Course", I.J. Modern Education and Computer Science, Second Volume, Page-15-22, 2012.
- L. Su,H. Liu, Z. Song, "A New Classification for Data Stream", I.J. Modern education and computer science, Fourth Volume, Page: 32-39,2011.
- S. Jigui, Q. Keyun, "Research on Modified K-Means Data Clusters ", Computer Engineering, Volume: 33, No: 13, page: 200-201.
- Mac Queen JB. "Some methods for classification and analysis of multivariate observations". Proceeding of the Fifth Berkley Symposium Math. Stat. Prob, (1):281-297, 1967.
- Huang Z, "Extensions to the K-Means algorithm for clustering large data set with categorical values", Data mining and knowledge discovery, Vol. 2, Page: 283-304, 1998.
- S. Deelers, S. Auwantanamongkol,"Enhancing k-mean Algorithm with Initial Cluster Center Derived from Data Partitioning along the Data Axis with the Highest Variance", International Journal of Computer Science Vol:1, 2007.
- J. Han, M. Kamber, J. Pei, "Data Mining- Concepts and techniques", Third Edition, Chapter: 7, Page: 401.
- S. Na, G. Yong, L. Xumin,"Research on k-means Clustering Algorithm", Third International Symposium on Intelligent Information Technology and Security Information, 2010.
- M. Yedla, S. R. Pathakota, T.M. Srinivasa, "Enhancing K-means Clustering Algorithm with Improved Initial Center", International Journal of Computer Science and Information Technologies, Vol. 1(2):121-125, 2010.
- A. Triantafillakis P. Kanellis, D. Martakos, "Data Warehouse Clustering on the web", European Journal of Operational Research, 160(2):353-364, 2005.
- M. H. Dunham, "Data Mining- Introductory and Advanced Concepts", Pearson Education,2006.
- C.C. Aggarwal, "A Human-Computer Interactive Method for Projected Clustering", IEEE Transactions on Knowledge and Data Engineering,Vol 16(4) 448-460, 2004.
- A. M. Fahim, A.M. Salem, F.A Torkey and M.A. Ramadan, "An Efficient Enhanced K-Means clustering algorithm ", Journal of Zhejiang University. 10(7), 16261633, 2006.
- K. A. Abdul Nazeer, M.P. Sebastian, "Improving the Accuracy and efficiency of the K- Means Clustering Algorithm", International Conference on Data Mining and Knowledge Engineering (ICDMKE). Proceeding of the World Congress on Engineering(WCE-2009), Volume : 1 , 2009.
- K. Arai, A.R. Barakbah, "Hierarchical K-Means: an algorithm for Centroids initialization for K-Means ", Department of Information Science and Electrical Engineering Politechnique in Surabay, Faculty of Science and Engineering, Saga University, Volume 36, No: 1, 2007.
- J. Wang, X. Su," An Improved K-Means Algorithm", IEEE 3rd International Conference on Communication Software and Networks (ICCSN), 44-46, 2011,
- Chen Zhang, Shixiong Xia, "K-Means Clustering Algorithm With Improved Initial Center", ISBN: 978-0-7695-3543-2, pp:790-792.
- University of California, Irvine, https://archive.ics.uci.edu/ml/datasets.html.