Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm

Ahmed Fahim

doi:10.5815/ijisa.2020.06.01

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Application-oriented computer-based techniques

Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm

Author: Ahmed Fahim

Journal: International Journal of Intelligent Systems and Applications @ijisa

Article in issue: 6 vol.12, 2020.

Free access

The k-means is the most well-known algorithm for data clustering in data mining. Its simplicity and speed of convergence to local minima are the most important advantages of it, in addition to its linear time complexity. The most important open problems in this algorithm are the selection of initial centers and the determination of the exact number of clusters in advance. This paper proposes a solution for these two problems together; by adding a preprocess step to get the expected number of clusters in data and better initial centers. There are many researches to solve each of these problems separately, but there is no research to solve both problems together. The preprocess step requires o(n log n); where n is size of the dataset. This preprocess step aims to get initial portioning of data without determining the number of clusters in advance, then computes the means of initial clusters. After that we apply k-means on original data using the resulting information from the preprocess step to get the final clusters. We use many benchmark datasets to test the proposed method. The experimental results show the efficiency of the proposed method.

Data clustering, k in k-means, initial centers in k-means, clustering algorithms

Short address: https://sciup.org/15017517

IDR: 15017517 | DOI: 10.5815/ijisa.2020.06.01

References Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm

P. S. Bradley, and U. M. Fayyad, “Refining Initial Points for K-Means Clustering.” In ICML 98, pp. 91-99,‏ July 1998.
A. M. Fahim, A. M. Salem, F. A. Torkey, M. A. Ramadan, and G. Saake, “An efficient k-means with good initial starting points.” Computer Sciences and Telecommunications, 2009, vol. 2(19), pp. 47-57. ‏
C. Zhang, and S. Xia, “K-means clustering algorithm with improved initial center.” In 2009 Second International Workshop on Knowledge Discovery and Data Mining, IEEE, ‏2009, pp. 790-792.
M. Yedla, S. R. Pathakota, and T. M. Srinivasa, “Enhancing K-means clustering algorithm with improved initial center.” International Journal of computer science and information technologies, 2010, vol. 1(2), pp. 121-125.
M. Erisoglu, N. Calis, and S. Sakallioglu, “A new algorithm for initial cluster centers in k-means algorithm.” Pattern Recognition Letters, 2011, vol. 32(14), pp. 1701-1705.‏
C. S. Li, “Cluster center initialization method for k-means algorithm over data sets with two clusters.” Procedia Engineering, 2011, vol. 24, pp. 324-328.
A. Alrabea, A. V. Senthilkumar, H. Al-Shalabi, and A. Bader, “Enhancing k-means algorithm with initial cluster centers derived from data partitioning along the data axis with PCA.” Journal of Advances in Computer Networks, 2013, vol. 1(2), pp.137-142.
G. Sathiya, and P. Kavitha, “An efficient enhanced K-means approach with improved initial cluster centers.” Middle-East Journal of Scientific Research, 2014, vol. 20(1), pp. 485-491.
R. Mawati, I. M. Sumertajaya, and F. M. Afendi, “Modified Centroid Selection Method of K-Means Clustering.” IOSR Journal of Mathematics, 2014, vol. 10(2), pp. 49-53.
R. Suryawanshi, and S. Puthran, “A Novel Approach for Data Clustering using Improved Kmeans Algorithm.” International Journal of Computer Applications, 2016, vol. 142(12), pp. 13-18.
PL. Chithra, and Jeyapriya.U, “Premeditated initial points for K-Means Clustering.” International Journal of Computer Science and Information Security IJCSIS, 2017, vol. 15(9), pp. 278- 281.
A. C. Fabregas, B. D. Gerardo, B. T. Tanguilig, “Enhanced initial centroids for k-means algorithm.” International Journal of Information Technology and Computer Science, 2017, vol. 9(1), pp. 26-33.
S. S. Yu, S. W. Chu, C. M. Wang, Y. K. Chan, and T. C. Chang, “Two improved k-means algorithms.” Applied Soft Computing, 2018,vol. 68, pp. 747-755.
S. Awawdeh, A. Edinat, and A. Sleit, “An Enhanced K-means Clustering Algorithm for Multi-attributes Data.” International Journal of Computer Science and Information Security (IJCSIS), 2019, vol. 17(2), pp. 1-6.
M. Yan “Determining the number of clusters using the weighted gap statistic.” Biometrics, 2007, vol. 63, pp. 1031–1037.
C. A. Sugar, and G. M. James, “Finding the number of clusters in a data set: an information-theoretic approach.” Journal of the American Statistical Association, 2003, 98:463, pp. 750-763, DOI:10.1198/016214503000000666.
D. Kim, Y. Park, D. Park, “A novel validity index for determination of the optimal number of clusters.” IEICE Tans Inf Syst, 2001, E84, pp. 281–285.
G. Bel Mufti, P. Bertrand, and L. El Moubarki, “Determining the number of groups from measures of cluster stability,” In: Proceedings of International Symposium on Applied Stochastic Models and Data Analysis. Brest: France, 2005, pp. 404–412.
S. S. Chae, J. L. Dubien, and W. D. Warde, “A method of predicting the number of clusters using Rand’s statistic.” Computational Statistics & Data Analysis 50 (2006), pp. 3531 -3546.
M. K. Pakhira,” Finding Number of Clusters before Finding Clusters.” Procedia Technology, 4, pp. 27 – 37, 2012.
A. Kane, “Determining the number of clusters for a k-means clustering algorithm.” Indian Journal of Computer Science and Engineering (IJCSE), 2012, vol. 3 (5), pp. 670 – 672.
D. T. Pham, S. S. Dimov, and C. D. Nguyen, “Selection of K in K-means clustering.” Mechanical Engineering Science, 2005, 219, pp. 103 – 119.
A. Fujita, D. Y. Takahashi, and A. G. Patriota, “A non-parametric method to estimate the number of clusters.” Computational Statistics and Data Analysis, 2014, vol. 73, pp. 27–39.
C. Subbalakshmi, G. R. Krishna, S. K. M. Rao, and P. V. Rao, “A Method to Find Optimum Number of Clusters Based on Fuzzy Silhouette on Dynamic Data Set.” Procedia Computer Science, 2015, 46, pp. 346 – 353.
D. Steinley, M. J. Brusco, “Choosing the Number of Clusters in K-Means Clustering.” Psychological Methods, 2011, 16(3), pp. 285–297.
A. M. Mehar, K. Matawie, and A. Maeder, “Determining an Optimal Value of K in K-means Clustering.” IEEE international conference on Bioinformatics and Biomedicine, 2013, pp. 51-55.
M. Z. Hossain, M. N. Akhtar, R.B. Ahmad, M. Rahman, “A dynamic K-means clustering for data mining.” Indonesian Journal of Electrical Engineering and Computer Science, 2019, vol. 13(2), pp. 521 – 526.
J. C. Bezdek and R. J. Hathaway, “VAT: A tool for visual assessment of (cluster) tendency,” in Proc. Intl. Joint Conf. on Neural Networks. Honohulu, HI, 2002, pp. 2225-2230.
A. Fahim, A. M. Salem, F. Torkey, M. Ramadan, and G. Saake, “Scalable varied density clustering algorithm for large datasets.” Journal of Software Engineering and Applications, 2010, vol. 3(06), pp. 593-602.
M. Debnath, P. K. Tripathi, and R. Elmasri, “K-DBSCAN: Identifying spatial clusters with differing density levels,” in Proceedings of the 2015 International Workshop on Data Mining with Industrial Applications, DMIA 2015, Paraguay, September, 2015, pp. 51–60.
A. M. Fahim, A. M. Salem, F. A. Torkey, M. A. Ramadan, “Hierarchical Clustering Based on K-Means as Local Sample (HCKM).” Egyptian Computer Science Journal, 2007, vol. 29(1), pp. 26-35. ‏
M. Ester, H. P. Krigel, J. Sander, and X. Xu, “A Density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231.