Clustering Techniques in Bioinformatics

Автор: Muhammad Ali Masood, M. N. A. Khan

Журнал: International Journal of Modern Education and Computer Science (IJMECS) @ijmecs

Статья в выпуске: 1 vol.7, 2015 года.

Бесплатный доступ

Dealing with data means to group information into a set of categories either in order to learn new artifacts or understand new domains. For this purpose researchers have always looked for the hidden patterns in data that can be defined and compared with other known notions based on the similarity or dissimilarity of their attributes according to well-defined rules. Data mining, having the tools of data classification and data clustering, is one of the most powerful techniques to deal with data in such a manner that it can help researchers identify the required information. As a step forward to address this challenge, experts have utilized clustering techniques as a mean of exploring hidden structure and patterns in underlying data. Improved stability, robustness and accuracy of unsupervised data classification in many fields including pattern recognition, machine learning, information retrieval, image analysis and bioinformatics, clustering has proven itself as a reliable tool. To identify the clusters in datasets algorithm are utilized to partition data set into several groups based on the similarity within a group. There is no specific clustering algorithm, but various algorithms are utilized based on domain of data that constitutes a cluster and the level of efficiency required. Clustering techniques are categorized based upon different approaches. This paper is a survey of few clustering techniques out of many in data mining. For the purpose five of the most common clustering techniques out of many have been discussed. The clustering techniques which have been surveyed are: K-medoids, K-means, Fuzzy C-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Self-Organizing Map (SOM) clustering.

Еще

Clustering Techniques, Data Mining, DBSCAN, Hierarchical Clustering, Performance Analysis

Короткий адрес: https://sciup.org/15014722

IDR: 15014722

Текст научной статьи Clustering Techniques in Bioinformatics

Published Online January 2015 in MECS DOI: 10.5815/ijmecs.2015.01.06

The field of data mining is used to extract useful information, identify the concealed patterns and identical attributes within big body of dataset. Data mining provides a powerful support for decision-making through the application of supervised and unsupervised data analysis techniques.

Data mining tasks utilize different techniques such as clustering, prediction, association, classification, sequential patterns and decision tree. These data mining techniques are briefly explained as under:

  • A.    Association

Association is also known as relation technique as it is based on a relationship between items in the same operation a pattern is discovered. Most common example of this technique is market basket analysis to recognize the purchasing trends of consumers associated with different products.

  • B.    Classification

Classification is a typical data mining technique which is used to classify predefined set of classes based on each item in a dataset. Mathematical techniques like neural network, statistics, decision trees and linear programming are used to perform classification.

  • C.    Prediction

Prediction is a data mining techniques that discovers relationship and dependencies of different attributes. In this technique independent variables relationships and dependent variables relationship are discovered. Based on the historical data, fitted regression curve can be drawn for future prediction.

  • D.    Sequential Patterns

Sequential patterns analysis technique seeks to explore or identify similar patterns, consistent events or trends in transaction data over ascertain timeframe.

  • E.    Decision trees

Decision tree is one of the mostly used data mining techniques because of its ease to understand and use. The root of the decision tree is a condition that has different answers and each answer leads to a set of conditions to help process the data so that final decision can be made.

  • F.    Clustering

Clustering is a data mining technique that automatically creates suitable cluster of objects which have similar characteristics. Clustering technique is unsupervised as compared to classification technique in which objects are assigned into predefined classes. Clustering defines the classes and places objects in each class based on similar properties.

Data mining systems are either supervised or unsupervised, depending on whether the domain is already known or not. If domain is known then separate supervised classes are defined for making it supervised classification, or if domain is unknown then unsupervised clustering is performed where exploratory data analysis is done to identify the hidden data patterns.

Clustering technique is an unsupervised data mining technique which is used to place individual artifacts into relevant groups without prior knowledge of distinct group properties to explore structure in the data. Clusters automatically link hidden patterns by learning the data pattern which is then utilized for learning. The aim of clustering is to make unlabeled dataset into isolated set of data structures by means of learning hidden data concept. For example, the spending behaviors of different population segments can be compared to find out which segments to target for a new product release.

Clustering is an initial and fundamental step in data analysis. Historically, clustering has its foundations laid down by mathematics, statistics and numerical analysis making it a classification of patterns, in unsupervised manner, into groups of similar objects. So patterns in a cluster are more alike to each other than to a pattern related to other cluster. It identifies groups of related records that can be used as an opening point for exploring further associations. Clustering can be classified into the five major types based on criteria like: Hierarchy, Density, Partition, Grid and Model

One of the biggest challenges in clustering is to decide which algorithm is to be used for a specific problem. Algorithms differ in their execution characteristics, creating discrete cluster analysis models. Understanding these analytical models is very important in identifying the variances between the outputs of various algorithms. These clustering models include Connectivity models, Centroid models, Density models, Subspace models, Graph-based models, Group models and Distribution models.

Cluster formation is one of the most difficult techniques used for knowledge extraction process. The goal is to identify clusters without any prior knowledge to differentiate the attributes of different clusters. Clustering techniques are used for correlating identified artifacts into groups based on the following criteria:

  •    Each cluster is homogeneous in nature.

  •    Each cluster should be diverse in nature from other clusters.

The usefulness of clustering lies in various arenas e.g. Geo-informatics, web mining, Bio-informatics, market research, market segmentation, Image processing, Document categorization, learning and pattern recognition.

The unsupervised characteristics of the task require that its structural properties are unknown making dimensional distribution of the data in terms of the number, volume, density, shape, and orientation unknown. When applied to data mining applications; clustering encounters three additional complications, including huge data repositories, objects having different characteristics and numerous attributes types.

By default clustering poses different problems for which each solution might be violating at least one rule regarding scale invariance, richness, and cluster consistency. All these properties and rules are defined to enhance the credibility of clustering techniques as if we do not have equal variance then it will be impossible to avoid clusters that are dominated by variables having most variation. Same is the case with cluster consistency and richness; if there is lack of consistency between data partitions then it will be again a serious threat to the credibility of clusters formed.

Based on different assumptions, clustering techniques uses certain data model, and there are chances that due to misguided assumptions, we might have chosen wrong model to apply on sample data causing erroneous or unrelated results. So, it is important that domain knowledge of data is available for successful clustering and there are chances that even domain experts might not be able to provide such crucial information. To establish strong grounds for the sample data’s distribution or processing tin to the proper number of clusters we need to identify relevant subspaces or visualization of domain knowledge. Hence efficient and effective methods are required to strengthen the individual clustering algorithms due to exploratory nature of clustering tasks.

II.    Literature Review

Jelili et al. [1] implemented k-mean clustering analysis technique to examinestudents’academic performance data. The k-means clustering technique is used in combination with Euclidean distance, a deterministic statistical analysis technique, to analyze the students’ performance. Main aim of the paper is to present predictive power of clustering algorithms and statistical techniques. A futuristic approach for data analysis as used for 79 student’s results for nine courses offered to each student. The trends were presented as distance. A qualitative data analysis approach was used tomeasure the similarity distances and produce the numerical explanation of the results for the performance assessment. Usually time complexity is dependent on speed and type of the system.

The technique proposed is not only a model for academic forecasts but is an improved version of the existing models by removing their limitations. The existing methods described in this paper are fuzzy models which uses the dataset of only two course results to predict students’ academic behaviors. Another approach described is rough Set theory to analyze student data using Rosetta toolkit. The purpose of using this toolkit is to assess data in relation to identifying association between the affecting factors and student grade.

Список литературы Clustering Techniques in Bioinformatics

  • O. O. Jelili, O. O. Ojeniyiand I. C. Obagbuwa. Application of K-Means Clustering Algorithm for Prediction of Students’ Academic Performance. International Journal of Computer Science and Information Security (IJCSIS), Vol. 7, No. 1, 2010.
  • T. Velmurugan. Efficiency of K-Means & K-Medoids Algorithms for Clustering Arbitrary Data Points. International Journal of Computer Technology & Applications (IJCTA), Vol. 3 (5) Sept-Oct 2012.
  • Tajunisha and Saravanan. Performance analysis of k-means with different initialization methods for high dimensional data. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010.
  • M. Khalilian, N. Mustapha, M. N.Suliman and M. A.Mamat. A Novel K-Means Based Clustering Algorithm for High Dimensional Data Sets. International Multi Conference of Engineers and Computer Scientists (IMECS). Vol. I. March 17, 2010.
  • J.H. Peter and A. Antonysamy. An Optimized Density Based Clustering Algorithm. International Journal of Computer Applications, Volume 6– No.9, September 2010.
  • J. Zhang, W. Li and J. Tan. An Improved Clustering Algorithm Based on Density Distribution Function. Computer and Information Science Vol. 3, No. 3; August 2010.
  • A. R. Pratap A, J. R. Devi, K. S. Vani and K. N. Rao. An Efficient Density based Improved K-Medoids Clustering algorithm. International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 6, 2011.
  • S. A. L. Maryand K.R. S. Kumar. A Density Based Dynamic Data Clustering Algorithm based on Incremental Dataset. Journal of Computer Science 8 (5) 2012.
  • S. Kisilevich, F. Mansmann and D. Keim. P-DBSCAN: A density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. 1st International Conference and Exhibition on Computing for Geospatial Research & Application Article No. 38 ACM New York. 2010.
  • R. Mayer and A.Rauber. Visualizing Clusters in Self-Organizing Maps with Minimum Spanning Trees. K. Diamantaras, W. Duch, L.S. Iliadis (Eds.): ICANN 2010, Part II, LNCS 6353, pp. 426–431.Springer-Verlag Berlin Heidelberg. 2010.
  • B. Silva and N. Marques. Feature Clustering With Self-Organizing Maps and an Application to Financial Time-Series for Portfolio Selection. International Conference on Neural Computation (ICNC). 2010.
  • M.Sakthi and A. S. Thanamani. An Efficient Constrained K-Means Clustering using Self Organizing Map. International Journal of Computer Science and Information Security (IJCSIS), Vol. 9, No. 4. April 2011.
  • T.Velmurugan and T.Santhanam. Clustering Mixed Data Points Using Fuzzy C-Means Clustering Algorithm for Performance Analysis. International Journal on Computer Science and Engineering (IJCSE) Vol. 02, No. 09, 2010, 3100-3105.
  • X. SU, X. WANG, Z. WANG and Y. XIAO. A New Fuzzy Clustering Algorithm Based on Entropy Weighting. Journal of Computational Information Systems (JOFCIS) 6:10 (2010) 3319-3326. October, 2010.
  • S. P. Chatzis. A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Systems with Applications 38, 8684–8689. (2011).
  • Iqbal S., Khalid M., Khan, M N A. A Distinctive Suite of Performance Metrics for Software Design. International Journal of Software Engineering & Its Applications, 7(5), (2013).
  • Iqbal S., Khan M.N.A., Yet another Set of Requirement Metrics for Software Projects. International Journal of Software Engineering & Its Applications, 6(1), (2012).
  • Faizan M., Ulhaq S., Khan M N A., Defect Prevention and Process Improvement Methodology for Outsourced Software Projects. Middle-East Journal of Scientific Research, 19(5), 674-682, (2014).
  • Faizan M., Khan M NA., Ulhaq S., Contemporary Trends in Defect Prevention: A Survey Report. International Journal of Modern Education & Computer Science, 4(3), (2012).
  • Khan K., Khan A., Aamir M., Khan M N A., Quality Assurance Assessment in Global Software Development. World Applied Sciences Journal, 24(11), (2013).
  • Amir M., Khan K., Khan A., Khan M N A., An Appraisal of Agile Software Development Process. International Journal of Advanced Science & Technology, 58, (2013).
  • Khan, M., & Khan, M. N. A. Exploring Query Optimization Techniques in Relational Databases. International Journal of Database Theory & Application, 6(3). (2013).
  • Khan, MNA., Khalid M., ulHaq S., Review of Requirements Management Issues in Software Development. International Journal of Modern Education & Computer Science, 5(1), (2013).
  • Umar M., Khan, M N A., A Framework to Separate Non-Functional Requirements for System Maintainability. Kuwait Journal of Science & Engineering, 39(1 B), 211-231, (2012).
  • Umar M., Khan, M. N. A, Analyzing Non-Functional Requirements (NFRs) for software development. In IEEE 2nd International Conference on Software Engineering and Service Science (ICSESS), 2011 pp. 675-678), (2011).
  • Khan, M. N. A., Chat win, C. R., & Young, R. C. (2007). A framework for post-event timeline reconstruction using neural networks. Digital investigation, 4(3), 146-157.
  • Khan, M. N. A., Chat win, C. R., & Young, R. C. (2007). Extracting Evidence from File system Activity using Bayesian Networks. International journal of Forensic computer science, 1, 50-63.
  • Khan, M. N. A. (2012). Performance analysis of Bayesian networks and neural networks in classification of file system activities. Computers & Security, 31(4), 391-401.
  • Rafique, M., & Khan, M. N. A. (2013). Exploring Static and Live Digital Forensics: Methods, Practices and Tools. International Journal of Scientific & Engineering Research 4(10): 1048-1056.
  • Bashir, M. S., & Khan, M. N. A. (2013). Triage in Live Digital Forensic Analysis. International journal of Forensic Computer Science 1, 35-44.
Еще
Статья научная