next up previous contents index
Next: Evaluation of text classification Up: Feature selection Previous: Feature selection for multiple   Contents   Index


Comparison of feature selection methods

Mutual information and $\chi ^2$ represent rather different feature selection methods. The independence of term $\tcword$ and class $c$ can sometimes be rejected with high confidence even if $\tcword$ carries little information about membership of a document in $c$. This is particularly true for rare terms. If a term occurs once in a large collection and that one occurrence is in the poultry class, then this is statistically significant. But a single occurrence is not very informative according to the information-theoretic definition of information. Because its criterion is significance, $\chi ^2$ selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy.

Despite the differences between the two methods, the classification accuracy of feature sets selected with $\chi ^2$ and MI does not seem to differ systematically. In most text classification problems, there are a few strong indicators and many weak indicators. As long as all strong indicators and a large number of weak indicators are selected, accuracy is expected to be good. Both methods do this.

Figure 13.8 compares MI and $\chi ^2$ feature selection for the multinomial model. Peak effectiveness is virtually the same for both methods. $\chi ^2$ reaches this peak later, at 300 features, probably because the rare, but highly significant features it selects initially do not cover all documents in the class. However, features selected later (in the range of 100-300) are of better quality than those selected by MI.

All three methods - MI, $\chi ^2$ and frequency based - are greedy methods. They may select features that contribute no incremental information over previously selected features. In Figure 13.7 , kong is selected as the seventh term even though it is highly correlated with previously selected hong and therefore redundant. Although such redundancy can negatively impact accuracy, non-greedy methods (see Section 13.7 for references) are rarely used in text classification due to their computational cost.

Exercises.


next up previous contents index
Next: Evaluation of text classification Up: Feature selection Previous: Feature selection for multiple   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07