Extension of the fuzzy c means clustering algorithm to fit with the composite graph model for web document representation
Автор: Kaushik K. Phukon, Hemanta K. Baruah
Журнал: International Journal of Cognitive Research in Science, Engineering and Education @ijcrsee
Рубрика: Studies and articles
Статья в выпуске: 2 vol.1, 2013 года.
Бесплатный доступ
Clustering techniques are mostly unsupervised methods that can be used to organize data into groups based on similarities among the individual data items. Fuzzy c-means (FCM) clustering is one of well known unsupervised clustering techniques, which can also be used for unsupervised web document clustering. In this chapter we will introduce a modified method of clustering where the data to be clustered will be represented by graphs instead of vectors or other models. Specifically, we will extend the classical FCM clustering algorithm to work with graphs that represent web documents [1,2,3]. We wish to use graphs because they can allow us to retain information which is often discarded in simpler models.
Graph, web document, hard partition, fuzzy partition, fuzzy c means
Короткий адрес: https://sciup.org/170198387
IDR: 170198387 | УДК: 004.738.12:519.178
Текст научной статьи Extension of the fuzzy c means clustering algorithm to fit with the composite graph model for web document representation
Fuzzy clustering is well-known not only in fuzzy community, but also in the related fields of data analysis, neural networks, and other areas in computational intelligence. The FCM algorithm, proposed by Dunn, J. C. (1974) and extended by Bezdek, J. C. (1981), Cannon, R. L., Dave, J. V., Bezdek, J. C. (1986) can be applied if the objects of interest are represented as points in a multi-dimensional space. FCM relates the concept of object similarity to spatial closeness and finds cluster centers as prototypes. Several examples of application of FCM to real clustering problems have proved the good characteristics of this algorithm with respect to stability and partition quality.
In general, cluster analysis refers to a broad spectrum of methods which try to subdivide a data set X into c subsets (clusters) which are pair wise disjoint, all nonempty, and reproduce X via union. The clusters then are termed a hard (i.e., non-fuzzy) c-partition of X. A significant fact about this type of algorithm is the defect in the underlying axiomatic model that each point in X is unequivocally grouped with other members of its cluster, and thus bears no apparent similarity to other members of X. One such manner to characterize an individual point's similarity to all the clusters was introduced in 1965 by Zadeh. The key to Zadeh's idea ( Zadeh, L. A. (1965)) is to represent the similarity a point shares with each cluster with a function (termed the membership function) whose values (called memberships) are between zero and one. Baruah (2011) has defined the membership function of a normal fuzzy number N= [ α , β , γ ] as
(Eq: 1.1)
Where Φ ( x ) and (1 -Φ ( x )) are two independent distribution functions defined in [ α , β ] and [ β , γ ] respectively.
Clustering techniques are generally applied to data that are quantitative (numerical), qualitative (categorical), or a mixture of both. But in this chapter we are going to put forward a means for clustering graphical objects with the help of FCM algorithm. Let us start with quantitative data where each observation may consists of n measured variables, grouped into an n -dimensional column vector Z k = [ z 1 k , . . . , z nk ] T , Z k ∈ □ n . A set of N observations is denoted by Z = {z k | k = 1 , 2 , . . .,N} , and is represented as an n × N matrix:
fuzzy clustering is more natural than hard clustering. Objects on the boundaries between several classes are not forced to fully belong to one of the classes, but rather are assigned membership degrees between 0 and 1 indicating their partial membership.
2.1. Hard Partition
The objective of clustering is to partition the data set Z into c clusters (groups, classes).Using classical sets, a hard partition of Z can be defined as a family of sub-sets {A i |1 ≤ i ≤ c} ⊂ P ( Z ), ( P( Z ) is the power set of Z ) with the following properties (Bezdek, 1981):
Z L L Z L 2
Z 2 L Z 2 2
Z L JF
Z 2 У
 
    
    Z „I Z „ 2
z
In the pattern-recognition terminology, the columns of this matrix are called patterns or objects, the rows are called the features or attributes, and Z is called the pattern or data matrix. The meaning of the columns and rows of Z depends on the context.
2. HARD AND FUZZY PARTITIONS
Hard clustering methods are based on classical set theory, and require that an object either does or does not belong to a cluster. Hard clustering means partitioning the data into a specified number of mutually exclusive subsets. Fuzzy clustering methods, however, allow the objects to belong to several clusters simultaneously, with different degrees of membership. In many situations,
      Ai
       A 
      Aj = 0,
       1 < 
      i ^ j 
      0 cAt сЪ,
       1 
Equation (2.1.1) means that the union subsets A i contain all the data. The subsets must be disjoint, as stated by (2.1.2), and none of them is empty nor contains all thedata in Z (2.1.3). In terms of member-ship(characteristic) functions, a partition can be conveniently represented by the partition matrix U =[ µ ] c×N . The i th row of this matrix contains values of the membership function µ of the ith subset A i of Z . It follows from the above equations that the elements of U must satisfy the following conditions:
=1, 7 < А: <7^ 1—1
      ° <2^ 
      
        
The space of all possible hard partition matrices for Z , called the hard partitioning space (Bezdek, 1981), is thus defined by:
M„ C= {U е □ c х N | ^ я € {0,1}, V i , к ; 0 < ]Г ^ л < N , V i } к = 1
( Eq: 2.3)
Example 1.1 Hard partition: Let us illustrate the concept of hard partition by a simpleexample. Consider a data set Z = { z 1 , z 2 , . . . , z 10 } , consisting of 10 web pages each represented by graphs. Suppose we obtained the figure below after calculating the dis-tance[2,3] between each and every pair of graphs by using the formula:
A visual inspection of this data may suggest two well-separated clusters (data points z 1 to z 4 and z 7 to z 10 respectively), one point in between the two clusters ( z 5 ), and an“outlier” z 6 . One particular partition U ∈ M hc of the data into two subsets (out of the 210 possible hard partitions) is
U= Г 1 1 1 1 1 1 о о о 0" 0 0 0 0 0 0 1 1 1 1
The first row of U defines point-wise the characteristic function for the first subset of Z , A 1 , and the second row defines the characteristic function of the second subset of Z , A 2 . Each sample must be assigned exclusively to one subset (cluster) of the partition. In this case, both the boundary point z 5 and the outlier z 6 have been assigned to A 1 .It is clear that a hard partitioning may not give a realistic picture of the underlying data. Boundary data points may represent patterns with a mixture of properties of data in A 1 and A 2 , and therefore cannot be fully assigned to either of these classes, or do they constitute a separate class. This shortcoming can be alleviated by using fuzzy partitions as shown in the following sections.
dist MCS ( z i , z j ) = 1 -
Е d ± ( MCS ( z„ zj ))
SOM max(E d ± ( Zi), (^ d ± ( zj )))
2.2. Fuzzy Partition
wherei,j=1,2…10
(Eq: 2.4)
 
    
    Generalization of the hard partition to the fuzzy case follows directly by allowing ^k to attain real values in [0 , 1]. Conditions for a fuzzy partition matrix, analogous to (2.2) are given by (Ruspini, 1970):
      ^t.-
       ^" [0, 1 ] j 1 < г < c, 
      \ 
X At =U <^r [—1
0 < ^ //й = 1 < jV , 1 < г < c i—i
The i th row of the fuzzy partition matrix U contains values of the i th membership function of the fuzzy subset A i of Z . Equation (2.5.2) constrains the sum of each column to 1, and thus the total membership of each z k in Z equals one. The fuzzy partitioning space for Z is the set
Mfc = { U e □ c x N^ e [0,1], V i , k ;0 < £ ц ш < N , V i } k = 1
(Eq. 2.6)
Example 1.2: Fuzzy partition: Let us consider the data set from Example 1.1. One of the infinitely many fuzzy partitions in Z is:
U = 1.0 1.0 1.0 0.8 0.5 0.5
0.0 0.0 0.0 0.2 0.5 0.5
0.2
0.8
0.0 0.0 0.0
1.0 1.0 1.0
The boundary point z 5 has now a membership degree of 0.5 in both classes, which correctly reflects its position in the middle between the two clusters. Note, however, that the outlier z 6 has the same pair of membership degrees, even though it is further from the two clusters, and thus can be considered less typical of both A 1 and A 2 than z 5 . This is because condition (2.5.2) requires that the sum of memberships of each point equals one. It can be, of course, argued that three clusters are more appropriate in this example than two. In general, however, it is difficult to detect outliers and assign them to extra clusters.
3. FUZZY C-MEANS CLUSTERING 3.1 The Fuzzy c-Means Functional
A large family of fuzzy clustering algorithms is based on minimization of the fuzzy c-means functional formulated as (Dunn, 1974; Bezdek, 1981):
cN
J ( Z ; U , V ) = ZZ ( Д .) " h- v , C i = 1 k = 1
where
U= [ Mik ]e Mc is a fuzzy partition matrix of Z,
V=[V1, V2, ^,Vc ], Vi €□ n is a vector of cluster prototypes(centers), which have to be determined,
D 2 ikA = || ^ k — v i'L = ( z k — vi) T A ( z k — vi)
is a squared inner product distance norm where A is a norm-inducing matrix, and m e [1, да)
is a parameter which determines the fuzziness of the resulting clusters. The value of the cost function (8.1) can be seen as a measure of the total variance of z from v .
ki
Most analytical fuzzy clustering algorithms (and also all the algorithms presented in this chapter) are based on optimization of the basic c -means objective function, or-some modification of it. Hence we start our discussion with presenting the FCM functional.
3.2. The Fuzzy c-Means Algorithm
The minimization of the c-means functional (3.1.1) represents a nonlinear optimization problem that can be solved by using a variety of methods, including iterative min- imization, simulated annealing or genetic algorithms. The most popular method is a simple Picard iteration through the first-order conditions for stationary points of (3.1.1), known as the FCM algorithm.
The stationary points of the objective function (3.1.1) can be found by adjoining the constraint (2.5.2) to J by means of Lagrange multipliers:
clusters 1 < c < N , the weighting exponent m > 1, the termination tolerance € > 0 and the norm-inducing matrix A . Initialize the partition matrix randomly, such that U (0) ∈ Mfc .
Repeat for l = 1 , 2 , . . .
Step 1: Compute the cluster prototypes (means):
J ( Z ; U , V , X ) = £ £ m ) " d 2A + £ X k [ £ M k - 1 i = 1 k = 1 k = 1 L i = 1
(Eq: 3.2)
and by setting the gradients of J with respect to U,V and X to zero. It can be shown that D 2a > 0, ^ 1, k and m>1, then (U,V) €
Mf_ x □ n x c may minimize if and only if
M ik = —-----------------, 1 < i < c , 1 < k < N ,
E ( D ikA / D jkA ) 2 /( m - 1) j = 1
N
E (M l - 1) ) mz k
K =1 ______________
N
E (Mik-1))m k=1
;1 < i < c .
Step 2: Compute the distances:
D 2 kA = ( z k - v ( ' ) ) T A ( z k - v ( ' ) ) ,1 < i < c ,1 < k < N .
Step 3: Update the partition matrix: for 1 ≤ k ≤ N ifD >0 for all i = 1, 2, . . . , c
E Mk ) ■«.
and V = K =1-------- ;1 < i < c .
E (Mk)m k=1
This solution also satisfies the remaining constraints (2.5.1) and (2.5.3). Equations (3.3)are first-order necessary conditions for stationary points of the functional (3.1.1). The FCM (Algorithm 1.1) iterates through (3.3.1) and (3.3.2). Sufficiency of (3.3) and the convergence of the FCM algorithm is proven in (Bezdek, 1980). It is to be noted that (3.3.2) gives v i as the weighted mean of the data items that belong to a cluster, where the weights are the membership degrees. That is why the algorithm is called “ c -means”.
Algorithm1.1 Fuzzy c -means (FCM). Given the data set Z , choose the number of
M ik c ,
У (DikA / D. t|)21 m-1) ikA jkA j=1
otherwise,
M ik ) = 0 if D ikA = 0 and M ik ) €[ 0,1 with
c
E M ) = 1.
i = 1
until | U(1 ) - U(1 - 1)||< £
3.3. Parameters of the FCM Algorithm
Before using the FCM algorithm, the following parameters must be specified: the number of clusters, c , the ‘fuzziness’ exponent, m , the termination tolerance, £ , and the norm-inducing matrix, A . Moreover, the fuzzy partition matrix, U , must be initialized.
3.3.1. Number of Clusters
The number of clusters c is the most important parameter, in the sense that the remaining parameters have less influence on the resulting partition. When clustering real data without any a priori information about the structures in the data, one usually has to make assumptions about the number of underlying clusters. The chosen clustering algorithm then searches for c clusters, regardless of whether they are really present in the data or not.
3.3.2. Fuzziness Parameter
The weighting exponent m is a rather important parameter as well, because it significantly influences the fuzziness of the resulting partition. As m approaches one from above, the partition becomes hard ( µ ∈ { 0 , 1 } ) and v i are ordinary means of the clusters. As m → ∞ , the partition becomes completely fuzzy ( µ = 1 /c ) and the cluster means are all equal to the mean of Z . These limit properties of (8) are independent of the optimization method used (Pal and Bezdek, 1995). Usually, m = 2 is initially chosen.
3.3.3. Termination Criterion
The FCM algorithm stops iterating when the norm of the difference between U in two successive iterations is smaller than the termination parameter ε . For the maximum norm max ( µl - µ ( l - 1) ) , the usual choice is ε =0 . 001, even though ε = 0 . 01 works well in most cases, while drastically reducing the computing times.
3.3.4. Norm-Inducing Matrix
The shape of the clusters is determined by the choice of the matrix A in the distance measure (3.1.4). A common choice is A = I , which gives the standard Euclidean norm:
D i 2 k = ( z k - v i ) T ( z k - v i )
3.3.5 Initial Partition Matrix
The partition matrix is usually initialized at random, such that U ∈ M fc . A simple approach to obtain such U is to initialize the cluster centers v at random and compute the corresponding U by (10.1) (i.e., by using the third step of the FCM algorithm).
4. THE MODIFIED FUZZY C MEANS ALGORITHM TO FIT WITH GRAPHS
The main challenge with adapting fuzzy c-means for graphs lies in creating a method of computing the cluster representatives.
Let us consider a graphical dataset
Z=(z k |k=1,2,_N)
Under fuzzy c-means the cluster centers are computed with a weighted averaging that takes into account the membership values of each data item. Thus the graph median cannot be directly used. We propose the following method of determining cluster centers for graph-based data. For each cluster j, use deterministic sampling to compute the number of copies of each graph i to use, e ( i ) , which is defined as:
e j ( i ) =
n
aij
Z v , aJ
Here n is the total number of items in the data set. We then create a set of graphs consisting of e ( i ) copies of graph i and compute the median graph of this set to be the representative of cluster j . So the new algorithm becomes:
Repeat for l = 1 , 2 , . . .
Step 1: Compute the cluster prototypes (representative median of a set of graphs):
( 1
gi ) = arg min О 2 dist(s, Gy ) V se Л e rл к Is I y=1
where S is the set of graphs and g e S (S = {G 1 ,G 2 ,..., G n }) such that g has the lowest average distance to all elements in S[3]
Step 2: Compute the distances:
D 2 ikA = ( zk - g ( l ) ) T A ( z k - g ( l ) ) ,1 ^ i ^ c ,1 ^ k ^ N .
where ( zk - g l ) is representing the distance between the graph z k and the cluster representative gl , i.e. dist ( z , gl ) (refer eq. 2.4).
Step 3: Update the partition matrix:
for 1 ≤ k ≤ N ifD >0 for all i = 1, 2, . . . , c
^ k
c
2 ( D ikA / D jkA ) 2/( m - 1)
J = 1
otherwise,
^ k ) = 0 if D ikA = 0 and д ») e [ 0,1 ] with ]E ^ k = 1. until | U(1 ) - U( l - 1)|| < г
4. CONCLUSION
In this article, we suggested a clustering method for graph based data with special reference to graphs representing web documents. The basic idea is the calculation of cluster center in case of graphical objects. We have modified the step 1 and 2 of the original FCM algorithm which will arm it to handle graph based data. We have made these changes without changing the fundamental concepts of the FCM algorithm. This method will enhance the efficiency and effectiveness of the FCM algorithm, as the graphical objects will boost the clustering method with abundant information [6, 7, 8].
 
	 
		