Next: Document and query weighting
Up: Variant tf-idf functions
Previous: Sublinear tf scaling
Contents
Index
Maximum tf normalization
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document , let
, where ranges over all terms in . Then, we compute a normalized term frequency for each term in document by
|
(30) |
where is a value between and and is generally set to , although some early work used the value . The term in (30) is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in . We will encounter smoothing further in Chapter 13 when discussing classification; the basic idea is to avoid a large swing in
from modest changes in
(say from 1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document and create a new document by simply appending a copy of to itself. While should be no more relevant to any query than is, the use of (23) would assign it twice as high a score as . Replacing
in (23) by
eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues:
- The method is unstable in the following sense: a
change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
- A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
- More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.
Next: Document and query weighting
Up: Variant tf-idf functions
Previous: Sublinear tf scaling
Contents
Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07