Extension of K-Modes Algorithm for Generating Clusters Automatically

Anupama Chadha; Suresh Kumar

Scientific articles \ Prolegomena. Fundamentals of knowledge and culture. Propaedeutics \ Computer science and technology. Computing. Data processing \ Application-oriented computer-based techniques

Extension of K-Modes Algorithm for Generating Clusters Automatically

Author: Anupama Chadha, Suresh Kumar

Journal: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs

Article in issue: 3 Vol. 8, 2016.

Free access

K-Modes is an eminent algorithm for clustering data set with categorical attributes. This algorithm is famous for its simplicity and speed. The K-Modes is an extension of the K-Means algorithm for categorical data. Since K-Modes is used for categorical data so 'Simple Matching Dissimilarity' measure is used instead of Euclidean distance and the 'Modes' of clusters are used instead of 'Means'. However, one major limitation of this algorithm is dependency on prior input of number of clusters K, and sometimes it becomes practically impossible to correctly estimate the optimum number of clusters in advance. In this paper we have proposed an algorithm which will overcome this limitation while maintaining the simplicity of K-Modes algorithm.

Clustering, K-Modes clustering, Dependency, Prior input, Number of clusters

Short address: https://sciup.org/15012464

IDR: 15012464

References Extension of K-Modes Algorithm for Generating Clusters Automatically

Ahmad, A., Dey, L. A K-Mean Clustering Algorithm for Mixed Numeric and Categorical Data. Data & Knowledge Engineering, 2007, 63: 503–527.
Ahmad, A., Dey, L. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognition Letters, 2007, 28 (1): 110–118.
Bai, L., Liang, J., Dang, C., Cao, F. A cluster centers initialization method for clustering categorical data. Expert Systems with Applications, 2012, 39: 8022-8029.
Barbar′a, D., Couto, J., Li, Y). COOLCAT: An entropy-based algorithm for categorical clustering. CIKM '02 Proceedings of the eleventh international conference on Information and knowledge management: 582-589.
Basak, J., De, R., K., Pal, S., K. Unsupervised feature selection using a neuro-fuzzy approach. Pattern Recognition Letters, 1998, 19: 997–1006.
Bradley, P., S., Fayyad, U., M. Refining initial points for k-means clustering. Proceedings of 15th international conference on machine learning (ICML98), 1998: 91–99.
Cao F., Liang, J., Bai, L. A new initialization method for categorical data clustering. Expert Systems with Applications, 2009, 36: 10223-10228.
Cao, F., Liang J., Li D., Bai, L., Dang, C. A dissimilarity measure for the k-Modes clustering algorithm. Knowledge-Based Systems, 2012, 26: 120–127.
Cheung, Y., Jia, H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition, 2013, 46: 2228–2238.
Desai, A., Singh, H., Pudi, V. DISC: Data Intensive Similarity Measure for Categorical Data. Proceedings of Advances in Knowledge Discovery and Data Mining – 15thPacific Asia Conference, 2011, 6635: 469 – 481.
Ienco, D., Pensa, R., G., Meo, R.From Context to Distance: Learning Dissimilarity for Categorical Data Clustering. ACM Transactions on Knowledge Discovery from Data, 2011, 0(0):1-22.
H. Liao, M.K. Ng, “Categorical Data Clustering with Automatic Selection of Cluster Number”, Fuzzy Information and Engineering 1 (1), 2009: 5-25.
He, Z., Deng, S., Xu, X. Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode. Computational Intelligence and Security Lecture Notes in Computer Science, 2005, 3801: 157-162.
Huang, Z. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. In Proceeding SIGMOD workshop research issues on data mining and knowledge discovery, 1997: 1–8.
Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 1998, 2: 283–304.
Khan, S., S., Ahmad A. Cluster Center Initialization for Categorical Data Using Multiple Attribute Clustering. Expert Systems with Applications, 2013, 40(18): 7444–7456.
Lee, J., Lee, Y., Park, M. Clustering with Domain Value Dissimilarity for Categorical Data, Advances in Data Mining. Applications and Theoretical Aspects, Lecture Notes in Computer Science, 2009, 5633: 310-324.
Ng, M., K., Li, M., J., Huang, J., Z., He, Z. On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29 (3): 503-507.
San, O., M., Huynh, V., Nakamori, V. An Alternative Extension of the k-Means Algorithm for Clustering Categorical Data. International Journal Appl. Math. Computer. Sci., 2004, 14(2): 241–247.
Sun, Y., Zhu, Q., Chen, Z. An iterative initial-points refinement algorithm for categorical data clustering. Pattern Recognition Letters, 2002, 23: 875–884.
Yeung, D., S., Wang, Y., S. Improving performance of similarity-based clustering by feature weight learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24 (4): 556–561.
Http://archive.ics.uci.edu/ml/
Http://rapidminer.com/