Cascaded Factor Analysis and Wavelet Transform Method for Tumor Classification Using Gene Expression Data
Автор: Jayakishan Meher, Ram Chandra Barik, Madhab Ranjan Panigrahi, Saroj Kumar Pradhan, Gananath Dash
Журнал: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs
Статья в выпуске: 9 Vol. 4, 2012 года.
Бесплатный доступ
Correlation between gene expression profiles to disease or different developmental stages of a cell through microarray data and its analysis has been a great deal in molecular biology. As the microarray data have thousands of genes and very few sample, thus efficient feature extraction and computational method development is necessary for the analysis. In this paper we have proposed an effective feature extraction method based on factor analysis (FA) with discrete wavelet transform (DWT) to detect informative genes. Radial basis function neural network (RBFNN) classifier is used to efficiently predict the sample class which has a low complexity than other classifier. The potential of the proposed approach is evaluated through an exhaustive study by many benchmark datasets. The experimental results show that the proposed method can be a useful approach for cancer classification.
Factor analysis, wavelet transform, gene expression data, radial basis function neural network
Короткий адрес: https://sciup.org/15011763
IDR: 15011763
Текст научной статьи Cascaded Factor Analysis and Wavelet Transform Method for Tumor Classification Using Gene Expression Data
Published Online August 2012 in MECS
Microarray has emerged as advanced biological laboratory technology accrued a huge amount of gene expression profiles of tissue samples at relatively low cost and facilitates the scientists and researchers to characterize complex biological problems. Microarray technology has been used as a basis to unravel the interrelationships among genes such as clustering of genes, temporal pattern of expressions, understanding the mechanism of disease at molecular level and defining of drug targets [1]. Among the above types diseases classification and analysis has gained a special interest. Especially tumor classification through the gene expression profiles has center of attraction in many research communities as it is important for subsequent diagnosis and treatment. Gene’s expressions are stained at different conditions or different cellular stages to reveal the functions of genes as well as their regulatory interactions. Gene expression of disease tissues may be use to gain a better understanding of many diseases, such as different types of cancers.
Empirical microarray data produce large datasets having expression levels of thousands of genes with a very few numbers (upto hundreds) of samples which leads to a problem of “curse of dimensionality”. Due to this high dimension the accuracy of the classifier decreases as it attains the risk of overfitting. As the microarray data contains thousands of genes, hence a large number of genes are not informative for classification because they are either irrelevant or redundant. Hence to derive a subset of informative or discriminative genes from the entire gene set is necessary and a challenging task in microarray data analysis. The purpose of gene selection or dimension reduction is to simplify the classifier by retaining small set of relevant genes and to improve the accuracy of the classifier. For this purpose, researchers have applied a number of test statistics or discriminant criteria to find genes that are differentially expressed between the investigated classes.
Various methods and techniques have been developed in recent past to perform the gene selection to reduce the dimensionality problem. The filter method basically use a criterion relating to factors and select key genes for classification such as Pearson correlation coefficient method [1], t-statistics method [2], signal-to-noise ratio method [3], the partial least square method, independent component analysis [4], linear discriminant analysis and principal component analysis [5].All the methods transform the original gene space to another domain providing reduced uncorrelated discriminant components. These methods do not detect the localized features of microarray data. Hence Liu [6, 7] proposed a wavelet basis function to perform the multi resolution analysis of the microarray data at different levels. The relevant genes of the microarray data can be measured by wavelet basis based on compactness and finite energy characteristic of the wavelet function. It does not depend on the training samples for the dimension reduction of the microarray data set. It also does not require a large matrix computation like the LDA, PCA and ICA, so simpler to implement. Due to these characteristics of wavelet, in this paper we have used the wavelet based feature extraction to reduce the feature space. Still some redundant genes are also present in the reduced gene set which may mislead the accuracy. Thus in this paper we introduced a promising ranking method known as F-score statistics to use in conjunction with the wavelet transform to get the optimal relevant and discriminative genes for classification.
Several Machine learning and statistical techniques have been applied to classify the microarray data. Tan and Gilbert [8] used the three supervised learning methods such as C4.5 decision tree, bagged and boosted decision tree to predict the class label of the microarray data. Dettling [9] have proposed an ensemble method of bag boosting approach for the same purpose. Many authors have used successfully the support vector machine (SVM) for the classification of microarray data [10]. Khan et al. [11] used the neural networks to classify the subcategories of small round blue-cell tumors. Also O’Neill and song [12] used the neural networks to analyze the lymphoma data and showed very good accuracy. B Liu et al. [13] proposed an ensemble neural network with combination of different feature selection methods to classify the microarray data efficiently. But the conventional neural networks require a lot of computation and consume more time to train. In this paper we have introduced a new promising low complexity neural network known as radial basis function neural network (RBFNN) to efficiently classify the microarray data.
The remainder of this paper is organized as follows: Section 2 presents the details of the dataset used for the study in the paper. Section 3 presents the proposed methods for tumor classification using gene expression data and Section 4 presents the Simulation of the experiment and result analysis of the proposed methods. Section 5 draws the conclusions of this paper.
-
II. Dataset
-
2.1 ALL/AML Leukemia Dataset
The dataset consists of two distinctive acute leukemias, namely AML and ALL bone marrow samples with 7129 probes from 6817 human genes. The training dataset consists of 38 samples (27 ALL and 11 AML) and the test dataset consists of 34 samples (20 ALL and 14 AML).
-
2.2 SRBCT Dataset
The dataset consists of four categories of small round blue cell tumors (SRBCT) with 83 samples from 2308 genes. The tumors are Burkitt lymphoma (BL), the Ewing family of tumors (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS). There are 63 samples for training and 20 samples for testing. The training set consists of 8, 23, 12 and 20 samples of BL, EWS, NB and RMS respectively. The testing set consists of 3, 6, 6 and 5 samples of BL, EWS, NB and RMS respectively.
-
2.3 MLL Leukemia Dataset
The dataset consists of three types of leukemias namely ALL, MLL and AML with 72 samples from 12582 genes. The training dataset consists of 57 samples (20 ALL, 17 MLL and 20 AML) and the test data set consists of 15 samples (4 ALL, 3 MLL and 8 AML).
-
2.4 Colon Dataset
The dataset consists of 62 samples from 2000 genes. The training dataset consists of 42 samples where (30 class1, 12 class2) and the test data set consists of 20 samples (10 class1, 10 class2).
In this section, the cancer gene expression data sets used for the study are described. These datasets are also summarized below.
-
III. Proposed Methods3.1 Factor analysis
It is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. It is possible, that variations in three or four observed variables mainly reflect the variations in fewer such unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modeled as linear combinations of the potential factors, plus "error" terms.
The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset.
Given a set of variables, the underlying dimensions account for the patterns of co-linearity among the variables. It is a data reduction tool that removes redundancy or duplication from a set of correlated variables and retains some smaller set of derived variables.
There are two types of variables, such as factors and observed variables. Select and measure a set of variables, Extract the principal factors analyzes covariance (but not unique variance and error variance) produces "factors", a linear combination of all factors, approximates, but does not duplicate, the observed correlation matrix. Its purpose is to reproduce the correlation matrix (with a few orthogonal factors).Factors is formed that are relatively independent of one another.
Determine the number of factors to retain
-
1) Eigen values: retain all factors with EV > 1
-
2) Scree plot: retain all factors "before the elbow"
When more than one factor is retained, unrotated factors cannot be interpreted in most cases. Rotation does not affect the mathematical fit of the solution.
Orthogonal rotation: The factors are uncorrelated (= orthogonal).
Oblique rotation: The factors may (or may not) be correlated.
If rotation is orthogonal, the data are interpreted from the "loading matrix" (SPSS: "rotated factor matrix"). The values in this matrix are bivariate correlations between the variables and the factors. If rotation is oblique, the data are interpreted from the "pattern matrix". The values in this matrix are partial correlations between the variables and the factors. In both cases, the values are called "factor loadings". If rotation is oblique, the "structure matrix" contains the bivariate correlations between variables and factors (to be ignored).

Fig.1: Wavelet decomposition
The wavelets can be realized by iteration of filters with rescaling which was developed by Mallat [15] through wavelet filter banks. The resolution of the signal, which is a measure of the amount of detail information in the signal, is determined by the filtering
We now postulate that there are q factor variables, and each observation is a linear combination of factor scores F ir plus noise:
Χ =ϵ +∑ F W (1)
The weights w rj are called the factor loadings of the observable features; how much feature j changes, on average, in response to a one-unit change in factor score r. Notice that we are allowing each feature to go along with more than one factor (for a given j, wr j can be nonzero for multiple r). This would correspond to our measurements running together what are really distinct variables.
-
3.2 Wavelet based feature extraction method
Wavelet transform proposed by Grossman and
Morlet [14] is an efficient time-frequency representation method which transforms a signal in time domain to a time-frequency domain. The basic idea is that any signal can be decomposed into a series of dilations and compressions of a mother wavelet
V (t)
.
Hence the continuous wavelet transform of a signal is defined as:
CWT ( a , b ) = -1= Г x ( t ) ^ ( t—b d (2)
V a -” к a )
where ^a (tt ) = V f t—b ), a e R + , b e R
, V a к a )
The resolution of the signal depends on the scaling parameter ‘ a ’ and the translation parameter ‘ b ’ determines the localization of the wavelet in time. The CWT can be realized in discrete form through the discrete wavelet transform (DWT).The DWT is capable of extracting the local features by separating the components of the signal in both time and scale. In the microarray data the gene expression profile is considered as a signal which can be represented as a sum of wavelets at different time shifts and scales using the DWT as shown in figure 1.
operations, and the scale is determined by up sampling and down sampling operations. The approximation coefficients obtained by the decomposition at a particular level is used as the features for further study.
-
3.3 Radial basis function neural network classifier
For function approximation and pattern classification problems we are using the radial basis function neural network (RBFNN) which is a neural structure because of their simple topological structure and their ability to learn in an explicit manner. In the classical RBF network, there is an input layer, a hidden layer consisting of nonlinear node function, an output layer and a set of weights to connect the hidden layer and output layer. Due to its simple structure it reduces the computational task as compared to conventional multi layer perception (MLP) network. In RBFNN, the basis functions are usually chosen as Gaussian and the number of hidden units are fixed apriori using some properties of input data. The structure of a RBF network is shown in Fig. 2.

Hidden Layer
Fig. 2: The RBFNN based classifier
For an input feature vector x, the output of the jth output node is given as.
NN -II x(n) - Ck 1
y j = ∑ w kj ϕ k = ∑ w kj e 2 σ k (3)
k = 1 k = 1
The error occurs in the learning process is reduced by updating the three parameters, the positions of centers (C k ), the width of the Gaussian function (σ k ) and the connecting weights (w) of RBFNN by a stochastic gradient approach as defined below:
w(n + 1) = w(n) -µ w J(n)
∂w
C k (n + 1) = C k (n) -µ c ∂ J(n)
∂Ck
∂
σ k(n + 1) =σ k(n) -µσ
∂σk where J(n) = e(n) , e (n) =d (n) - y (n) is the error, d (k) is the target output and y (k) is the predicted output. µ µ And µ are the learning parameters of the RBF network. The complete process of the proposed feature extraction based classification process is presented in Fig. 2.
-
IV. Experiment and Analysis
In order to compare the efficiency of the proposed method in predicting the class of the cancer microarray data we have used three standard datasets such as Leukemia, SRBCT and MLL Leukemia. All the datasets categorized into two groups: binary class and multi class to assess the performance of the proposed method. The Leukemia dataset is binary class and both SRBCT and MLL Leukemia are Multi class datasets. The feature selection process proposed in this paper has two steps. First the microarray data is decomposed by factor analysis optimally choose the discriminate feature set then using Discrete wavelet transform into level 4 using db7 wavelet to get the approximation coefficients as the extracted feature set. The performance of the proposed feature extraction method is analyzed with the well studied neural network classifiers such as MLP and RBFNN. The leave one out cross validation (LOOCV) test is conducted by combining all the training and test samples for both the classifiers with all the three datasets and the results are listed in Table 1. For binary class the performance of RBFNN is comparable to MLP, but in case of multi class it outperforms the MLP.

Fig.3: Flow graph of the proposed feature extraction based classification method
Table 3 Comparison study of accuracy of SRBCT dataset
Methods |
Classification accuracy |
SLDA |
100 % |
BWNN |
96.83 % |
C4.5 |
91.18% |
Bagboost |
95.24 % |
SVM |
93.65 % |
TPCR |
100 % |
Gradient LDA |
100 % |
Factor Analysis + Wavelet + RBFNN |
97.59 % |
Table 1 Comparison study of classification accuracy of MLP and RBFNN classifiers
Dataset |
Method |
Classification Accuracy |
Leukemia |
MLP |
98.61 % |
RBFNN |
100 % |
|
SRBCT |
MLP |
90.36% |
RBFNN |
98.59% |
|
MLL Leukemia |
MLP |
87.50% |
RBFNN |
98.83% |
|
Colon |
MLP |
93.54% |
RBFNN |
96.33% |
Table 2 Comparison study of accuracy of Leukemia dataset
Methods |
Classification accuracy |
Bayesian Variable |
97.1 % |
PCA disjoint models |
82.3 % |
Between-group analysis |
88.2 % |
C4.5 |
91.18% |
Bagging C4.5 |
91.18 % |
Adaboost C4.5 |
91.18 % |
SFFS+PCA+SVM |
58.82 % |
SFFS+ICA+SVM |
91.18 % |
Combined feature selection + ensemble neural network |
100 % |
Wavelet + GA |
100 % |
Factor Analysis + Wavelet + RBFNN |
100% |
Table 4 Comparison study of accuracy of MLL Leukemia dataset
Methods |
Classification accuracy |
C4.5 |
73 % |
Bagging C4.5 |
86.67 % |
Adaboost C4.5 |
91.18 % |
Wavelet +GA |
100 % |
Li’s rule based method |
100 % |
Combined feature selection + ensemble neural network |
100 % |
Factor Analysis + Wavelet + RBFNN |
96.87% |
The performance of the proposed method is also compared with those obtained by the recently reported methods and the results are listed in Table 2-4. The existing methods also used the cross validation test on the datasets. From Tables 2-4 it reveals that our method is equivalent to the counterparts with the advantage of reduced computational load. Table 5 shows the decomposition stages upto 4th level by using db7 in discrete wavelet transform.
Table 5 Reduction details of the dataset
Dataset |
Original Dimension |
Factor Analysis |
DWT (Dubecies7) Level 4 |
Colon |
62х 2000 |
62х 700 |
62х 184 |
SRBCT |
83х 2308 |
83х 700 |
83х 184 |
Leukemia |
72 х 7129 |
72х 700 |
72х 184 |
MLL Leukemia |
72х 12582 |
72х 700 |
72х 184 |
-
V. Conclusion
In this paper we have presented a hybrid feature extraction method using the Factor analysis in
Cascaded Factor Analysis and Wavelet Transform Method for
Tumor Classification Using Gene Expression Data conjunction with wavelet transform to effectively select the discriminative genes on microarray data. A simple RBFNN based classifier has also been introduced to classify the microarray samples efficiently. The comparison results elucidated that the proposed approach is an efficient method which performs better than the existing methods. Besides it has reduced computational complexity.
Список литературы Cascaded Factor Analysis and Wavelet Transform Method for Tumor Classification Using Gene Expression Data
- Xiong M., Jin L., Li W. and Boerwinkle E. Computational methods for gene expression-based tumor classification. BioTechniques, 2000, vol. 29, no. 6, pp. 1264–1268.
- Baldi P. and Long A.D. A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics, 2001, vol. 17, no. 6, pp. 509–519.
- Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring Science, 1999, 286(5439), pp.531-537.
- Huang D.S. and Zheng C. H. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics, 2006, vol. 22, no. 15, pp. 1855–1862.
- Yeung K.Y., Ruzzo W. L. Principal component analysis for clustering gene expression data. Bioinformatics, 2002, 17, pp.763–774.
- Yihui Liu. Wavelet feature extraction for high-dimensional microarray data. Neurocomputing, 2009, Vol. 72, pp. 985-990.
- Yihui Liu. Detect Key Gene Information in Classification of Microarray Data. EURASIP Journal on Advances in Signal Processing, 2007 pp.1-10.
- Tan AC, Gilbert D. Ensemble machine learning on gene expression data for cancer classification. Applied Bioinformatics, 2003, 2, pp.75-83.
- Dettling M. Bag Boosting for tumor classification with gene expression data. Bioinformatics, 2004 vol. 20, no. 18, pp. 3583–3593.
- Guyon I, Weston J, Barnhill and Vapnik V. Gene selection for cancer classification using support vector machines. Mach. Learn, 2002, 46, pp. 389- 422.
- Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., Meltzer, P. S. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 2001, 7(6), pp.673-679.
- O'Neill MC and Song L. Neural network analysis of lymphoma microarray data: prognosis and diagnosis near-perfect. BMC Bioinformatics, 2003, 4:13.
- Liu Bing, Cui Qinghua, Jiang Tianzi and Ma. Songde. A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics, 2004. 5:136, pp. 1-12.
- Grossmann A. and Morlet J. Decomposition of Hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 1984, vol. 15, no. 4, pp.723–736.
- Mallat S. G. A theory for multiresolution signal decomposition: the wavelet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1989, vol. 11, no. 7, pp. 674–693.