SECTION 20.

CLUSTER ANALYSIS

Cluster analysis attempts to organiza multivariate data into clusters that have a certain amount of internal similarity. It is predominantly a descriptive technique rather than a statistical one. Some recent references that the user might wish to consult are http://149.170.199.144/multivar/hc.htm, http://www.statsoftinc.com/textbook/stcluan.html and http://www.obelia.jde.aca.mmu.ac.uk/multivar/ca.htm.

SECTION 20.1 MINIMIZING THE SUM OF SQUARED DISTANCES FROM CENTROIDS (K-MEANS CLUSTERING)

A useful reference to this method is http://www.ci.hut.fi/~sami/thesis/node9.html.

Consider the problem of finding k clusters for n points x in np-dimensional space. The algorithm starts by reading in (or determining) k hypothetical centroids. Each point is then "allocated" to the centroid it is closest to, forming k clusters. Then calculate a new centroid for each cluster and repeat the process until the function

E = Σ_i(x_i - m_{c(x_i)⁾^²}

no longer changes (within some tolerance), where m_{c(x_i)} denotes the centroid closest to x_i.

The call to the routine is

       CALL CLUSTER(E,X,D,CT,N,NP,K,ICT,IDM,EPS,ITERL,IER)

where all variables the names of which do not begin with i, j, k, l, m, n are REAL*8 and where

E = the value of the function above (output)
X = n by np array of points (input)
D = n by np array of distances of points x from centroids ct (output)
CT = k by np array of centroids (input, output)
N = number of points x (first dimension of X)(input)
NP = dimensionality of points x (second dimension of X)(input)
K = number of desired clusters (input)
ICT = 0 read in starting clusters, = 1 determine random starting clusters (input)
IDM = distance metric; = 1 squared Euclidean distance, = 2 City Block distance Σ_i|x_i-y_i| (input)
EPS = tolerance for stopping algorithm (e.g., 0.0001) (input)
ITERL = iteration limit (input)
IER = error return

TERMINATION FLAGS

IER = 0 normal return
= -10 number of clusters is less than 2

SECTION 20.2 HIERARCHICAL CLUSTERING

Hierarchical clustering associates in a cluster two items that are closer together, by some distance measure, than any other two items. Either item may be a cluster already formed, in which case its "position" is taken to be its centroid. Items are clustered successively, until the total number of clusters equals the desired number.

The call to the routine is

       CALL CLUSTHIER(X,CT,ICLX,JCL,N,NP,NTARG,IDM,IER)

where all variables the names of which do not begin with i, j, k, l, m, n are REAL*8 and where

X = n by np array of points (input)
CT = n by np array of (potential) clusters (output)
ICLX = n by n array used in building clusters and containing the identifiers of the X-variables that have been allocated to a cluster (output)
JCL = vector of length n containing the numbers of items allocated to each cluster (output)
N = number of points x (input)
NP = dimensionality of points x (input)
NTARG = desired number of clusters (input)
IDM = distance metric; = 1 Euclidean squared, =2 City Block, Σ_i|x_i-y_i|, = 3 Chebyshev, max_i|x_i - y_i|, (input)

TERMINATION FLAGS

IER = 0 normal return
= -12 no positive distances
= -19 number of clusters <= zero

Return to the Beginning