World Library  
Flag as Inappropriate
Email this Article

Ward's method

Article Id: WHEBN0031144665
Reproduction Date:

Title: Ward's method  
Author: World Heritage Encyclopedia
Language: English
Subject: Nearest-neighbor chain algorithm, List of algorithms, List of statistics articles
Collection: Data Clustering Algorithms
Publisher: World Heritage Encyclopedia
Publication
Date:
 

Ward's method

In statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr.[1] Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.

Contents

  • The minimum variance criterion 1
  • Lance–Williams algorithms 2
  • References 3
  • Further reading 4

The minimum variance criterion

Ward's minimum variance criterion minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. To implement this method, at each step find the pair of clusters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted squared distance between cluster centers. At the initial step, all clusters are singletons (clusters containing a single point). To apply a recursive algorithm under this objective function, the initial distance between individual objects must be (proportional to) squared Euclidean distance.

The initial cluster distances in Ward's minimum variance method are therefore defined to be the squared Euclidean distance between points:

d_{ij}=d(\{X_i\}, \{X_j\}) = { \|X_i - X_j\|^2}.

Note: In software that implements Ward's method, it is important to check whether the function arguments should specify Euclidean distances or squared Euclidean distances. In the R function hclust, one either needs to pass the squared Euclidean distance, or, more simply, select method = ward.D2. For other methods provided by hclust (single, complete, etc.), the regular Euclidean distances are required.

Lance–Williams algorithms

Ward's minimum variance method can be defined and implemented recursively by a Lance–Williams algorithm.[2] The Lance–Williams algorithms are an infinite family of agglomerative hierarchical clustering algorithms which are represented by a recursive formula for updating cluster distances at each step (each time a pair of clusters is merged). At each step, it is necessary to optimize the objective function (find the optimal pair of clusters to merge). The recursive formula simplifies finding the optimal pair.

Suppose that clusters C_i and C_j were next to be merged. At this point all of the current pairwise cluster distances are known. The recursive formula gives the updated cluster distances following the pending merge of clusters C_i and C_j. Let

  • d_{ij}, d_{ik}, and d_{jk} be the pairwise distances between clusters C_i, C_j, and C_k, respectively,
  • d_{(ij)k} be the distance between the new cluster C_i \cup C_j and C_k.

An algorithm belongs to the Lance-Williams family if the updated cluster distance d_{(ij)k} can be computed recursively by

d_{(ij)k} = \alpha_i d_{ik} + \alpha_j d_{jk} + \beta d_{ij} + \gamma |d_{ik} - d_{jk}|,

where \alpha_i, \alpha_j, \beta, and \gamma are parameters, which may depend on cluster sizes, that together with the cluster distance function d_{ij} determine the clustering algorithm. Several standard clustering algorithms such as single linkage, complete linkage, and group average method have a recursive formula of the above type. A table of parameters for standard methods is given by several authors.[2][3][4]

Ward's minimum variance method can be implemented by the Lance–Williams formula. For disjoint clusters C_i, C_j, and C_k with sizes n_i, n_j, and n_k respectively:

d(C_i \cup C_j, C_k) = \frac{n_i+n_k}{n_i+n_j+n_k}\;d(C_i,C_k) + \frac{n_j+n_k}{n_i+n_j+n_k}\;d(C_j,C_k) - \frac{n_k}{n_i+n_j+n_k}\;d(C_i,C_j).

Hence Ward's method can be implemented as a Lance–Williams algorithm with

\alpha_l = \frac{n_l+n_k}{n_i+n_j+n_k}, \qquad \beta =\frac{-n_k}{n_i+n_j+n_k}, \qquad \gamma = 0.

References

  1. ^ Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an Objective Function", Journal of the American Statistical Association, 58, 236–244.
  2. ^ Cormack, R. M. (1971), "A Review of Classification", Journal of the Royal Statistical Society, Series A, 134(3), 321-367.
  3. ^ Gordon, A. D. (1999), Classification, 2nd Edition, Chapman and Hall, Boca Raton.
  4. ^ Milligan, G. W. (1979), "Ultrametric Hierarchical Clustering Algorithms", Psychometrika, 44(3), 343–346.

Further reading

  • Everitt, B. S., Landau, S. and Leese, M. (2001), Cluster Analysis, 4th Edition, Oxford University Press, Inc., New York; Arnold, London. ISBN 0340761199
  • Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.
  • Jain, A. K. and Dubes, R. C. (1988), Algorithms for Clustering Data, New Jersey: Prentice–Hall.
  • Kaufman, L. and Rousseeuw, P. J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley.
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.