In many applications of flat clustering and hierarchical clustering, particularly in analysis tasks and in user interfaces (see applications in Table 16.1 , page 16.1 ), human users interact with clusters. In such settings, we must label clusters, so that users can see what a cluster is about.
Differential cluster labeling selects cluster labels by comparing the distribution of terms in one cluster with that of other clusters. The feature selection methods we introduced in Section 13.5 (page ) can all be used for differential cluster labeling. In particular, mutual information (MI) (Section 13.5.1 , page 13.5.1 ) or, equivalently, information gain and the -test (Section 13.5.2 , page 13.5.2 ) will identify cluster labels that characterize one cluster in contrast to other clusters. A combination of a differential test with a penalty for rare terms often gives the best labeling results because rare terms are not necessarily representative of the cluster as a whole.
labeling method | |||||||
# docs | centroid | mutual information | title | ||||
4 | 622 |
|
|
|
|||
9 | 1017 |
|
|
|
|||
10 | 1259 |
|
|
|
Automatically computed cluster labels.This is for three of ten clusters (4, 9, and 10) in a -means clustering of the first 10,000 documents in Reuters-RCV1. The last three columns show cluster summaries computed by three labeling methods: most highly weighted terms in centroid (centroid), mutual information, and the title of the document closest to the centroid of the cluster (title). Terms selected by only one of the first two methods are in bold.
We apply three labeling methods to a -means clustering in Table 17.2 . In this example, there is almost no difference between MI and . We therefore omit the latter.
Cluster-internal labeling computes a label that solely depends on the cluster itself, not on other clusters. Labeling a cluster with the title of the document closest to the centroid is one cluster-internal method. Titles are easier to read than a list of terms. A full title can also contain important context that didn't make it into the top 10 terms selected by MI. On the web, anchor text can play a role similar to a title since the anchor text pointing to a page can serve as a concise summary of its contents.
In Table 17.2 , the title for cluster 9 suggests that many of its documents are about the Chechnya conflict, a fact the MI terms do not reveal. However, a single document is unlikely to be representative of all documents in a cluster. An example is cluster 4, whose selected title is misleading. The main topic of the cluster is oil. Articles about hurricane Dolly only ended up in this cluster because of its effect on oil prices.
We can also use a list of terms with high weights in the centroid of the cluster as a label. Such highly weighted terms (or, even better, phrases, especially noun phrases) are often more representative of the cluster than a few titles can be, even if they are not filtered for distinctiveness as in the differential methods. However, a list of phrases takes more time to digest for users than a well crafted title.
Cluster-internal methods are efficient, but they fail to distinguish terms that are frequent in the collection as a whole from those that are frequent only in the cluster. Terms like year or Tuesday may be among the most frequent in a cluster, but they are not helpful in understanding the contents of a cluster with a specific topic like oil.
In Table 17.2 , the centroid method selects a few more uninformative terms (000, court, cents, september) than MI (forces, desk), but most of the terms selected by either method are good descriptors. We get a good sense of the documents in a cluster from scanning the selected terms.
For hierarchical clustering, additional complications arise in cluster labeling. Not only do we need to distinguish an internal node in the tree from its siblings, but also from its parent and its children. Documents in child nodes are by definition also members of their parent node, so we cannot use a naive differential method to find labels that distinguish the parent from its children. However, more complex criteria, based on a combination of overall collection frequency and prevalence in a given cluster, can determine whether a term is a more informative label for a child node or a parent node (see Section 17.9 ).