next up previous contents index
Next: Hierarchical agglomerative clustering Up: irbook Previous: Exercises   Contents   Index


Hierarchical clustering

Flat clustering is efficient and conceptually simple, but as we saw in Chapter 16 it has a number of drawbacks. The algorithms introduced in Chapter 16 return a flat unstructured set of clusters, require a prespecified number of clusters as input and are nondeterministic. Hierarchical clustering (or hierarchic clustering ) outputs a hierarchy, a structure that is more informative than the unstructured set of clusters returned by flat clustering.[*]Hierarchical clustering does not require us to prespecify the number of clusters and most hierarchical algorithms that have been used in IR are deterministic. These advantages of hierarchical clustering come at the cost of lower efficiency. The most common hierarchical clustering algorithms have a complexity that is at least quadratic in the number of documents compared to the linear complexity of $K$-means and EM (cf. Section 16.4 , page 16.4 ).

This chapter first introduces agglomerative hierarchical clustering (Section 17.1 ) and presents four different agglomerative algorithms, in Sections 17.2 -17.4 , which differ in the similarity measures they employ: single-link, complete-link, group-average, and centroid similarity. We then discuss the optimality conditions of hierarchical clustering in Section 17.5 . Section 17.6 introduces top-down (or divisive) hierarchical clustering. Section 17.7 looks at labeling clusters automatically, a problem that must be solved whenever humans interact with the output of clustering. We discuss implementation issues in Section 17.8 . Section 17.9 provides pointers to further reading, including references to soft hierarchical clustering, which we do not cover in this book.

There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 16.1 ; see also Section 16.6 , page 16.6 ). In fact, the example we gave for collection clustering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern. In addition, many researchers believe that hierarchical clustering produces better clusters than flat clustering. However, there is no consensus on this issue (see references in Section 17.9 ).



Subsections
next up previous contents index
Next: Hierarchical agglomerative clustering Up: irbook Previous: Exercises   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07