next up previous contents index
Next: Document and query weighting Up: Variant tf-idf functions Previous: Sublinear tf scaling   Contents   Index


Maximum tf normalization

One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document $d$, let $\mbox{tf}_{\max}(d)=\max_{\tau\in d} \mbox{tf}_{\tau,d}$, where $\tau$ ranges over all terms in $d$. Then, we compute a normalized term frequency for each term $t$ in document $d$ by
\begin{displaymath}
\mbox{ntf}_{t,d}=
a + (1-a)\frac{\mbox{tf}_{t,d}}{\mbox{tf}_{\max}(d)},
\end{displaymath} (30)

where $a$ is a value between $0$ and $1$ and is generally set to $0.4$, although some early work used the value $0.5$. The term $a$ in (30) is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in $d$. We will encounter smoothing further in Chapter 13 when discussing classification; the basic idea is to avoid a large swing in $\mbox{ntf}_{t,d}$ from modest changes in $\mbox{tf}_{t,d}$ (say from 1 to 2). The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. To appreciate this, consider the following extreme example: supposed we were to take a document $d$ and create a new document $d'$ by simply appending a copy of $d$ to itself. While $d'$ should be no more relevant to any query than $d$ is, the use of (23) would assign it twice as high a score as $d$. Replacing $\mbox{tf-idf}_{t,d}$ in (23) by $\mbox{ntf-idf}_{t,d}$ eliminates the anomaly in this example. Maximum tf normalization does suffer from the following issues:
  1. The method is unstable in the following sense: a change in the stop word list can dramatically alter term weightings (and therefore ranking). Thus, it is hard to tune.
  2. A document may contain an outlier term with an unusually large number of occurrences of that term, not representative of the content of that document.
  3. More generally, a document in which the most frequent term appears roughly as often as many other terms should be treated differently from one with a more skewed distribution.


next up previous contents index
Next: Document and query weighting Up: Variant tf-idf functions Previous: Sublinear tf scaling   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07