Perhaps the
best-known way of computing good class boundaries is Rocchio
classification , which uses centroids to
define the boundaries. The centroid of
a class is computed as the vector average or center of
mass of its members:
The boundary between two classes in Rocchio
classification is the set of points with equal distance
from the two centroids. For example, ,
, and in the figure. This set
of points is always a line.
The generalization of a line in
-dimensional space is
a hyperplane, which we define as the set of points
that satisfy:
Thus, the boundaries of class regions in Rocchio classification are hyperplanes. The classification rule in Rocchio is to classify a point in accordance with the region it falls into. Equivalently, we determine the centroid that the point is closest to and then assign it to . As an example, consider the star in Figure 14.3 . It is located in the China region of the space and Rocchio therefore assigns it to China. We show the Rocchio algorithm in pseudocode in Figure 14.4 .
|
Worked example. Table 14.1 shows the tf-idf vector representations of the five documents in Table 13.1 (page 13.1 ), using the formula if (Equation 29, page 6.4.1 ). The two class centroids are and . The distances of the test document from the centroids are and . Thus, Rocchio assigns to .
The separating hyperplane in this case has the following parameters:
End worked example.
The assignment criterion in
Figure 14.4 is Euclidean distance (APPLYROCCHIO, line 1). An
alternative is cosine similarity:
(141) |
Rocchio classification is a form of Rocchio relevance feedback (Section 9.1.1 , page 9.1.1 ). The average of the relevant documents, corresponding to the most important component of the Rocchio vector in relevance feedback (Equation 49, page 49 ), is the centroid of the ``class'' of relevant documents. We omit the query component of the Rocchio formula in Rocchio classification since there is no query in text classification. Rocchio classification can be applied to classes whereas Rocchio relevance feedback is designed to distinguish only two classes, relevant and nonrelevant.
In addition to respecting contiguity, the classes in Rocchio classification must be approximate spheres with similar radii. In Figure 14.3 , the solid square just below the boundary between UK and Kenya is a better fit for the class UK since UK is more scattered than Kenya. But Rocchio assigns it to Kenya because it ignores details of the distribution of points in a class and only uses distance from the centroid for classification.
The assumption of sphericity also does not hold in Figure 14.5 . We cannot represent the ``a'' class well with a single prototype because it has two clusters. Rocchio often misclassifies this type of multimodal class . A text classification example for multimodality is a country like Burma, which changed its name to Myanmar in 1989. The two clusters before and after the name change need not be close to each other in space. We also encountered the problem of multimodality in relevance feedback (Section 9.1.2 , page 9.1.3 ).
Two-class classification is another case where
classes are rarely distributed like
spheres with similar radii. Most two-class classifiers
distinguish between a class like China that occupies
a small region of the space and its
widely scattered complement. Assuming equal radii will
result in a large number of false positives. Most
two-class classification problems therefore require a modified
decision rule of the form:
(142) |
mode | time complexity |
training | |
testing |
Table 14.2 gives the time complexity of Rocchio classification. Adding all documents to their respective (unnormalized) centroid is (as opposed to ) since we need only consider non-zero entries. Dividing each vector sum by the size of its class to compute the centroid is . Overall, training time is linear in the size of the collection (cf. Exercise 13.2.1 ). Thus, Rocchio classification and Naive Bayes have the same linear training time complexity.
In the next section, we will introduce another vector space classification method, kNN, that deals better with classes that have non-spherical, disconnected or other irregular shapes.
Exercises.