next up previous contents index
Next: Postings file compression Up: Dictionary compression Previous: Dictionary as a string   Contents   Index

Blocked storage

We can further compress the dictionary by grouping terms in the string into of size $k$ and keeping a term pointer only for the first term of each block (Figure 5.5 ). We store the length of the term in the string as an additional byte at the beginning of the term. We thus eliminate $k-1$ term pointers, but need an additional $k$ bytes for storing the length of each term. For $k=4$, we save $(k-1) \times 3 = 9$ bytes for term pointers, but need an additional $k=4$ bytes for term lengths. So the total space requirements for the dictionary of Reuters-RCV1 are reduced by 5 bytes per four-term block, or a total of $400{,}000 \times 1/4 \times
5 = 0.5 \ \mbox{MB}$, bringing us down to 10.37.1 MB.

Figure 5.6: Search of the uncompressed dictionary (a) and a dictionary compressed by blocking with $k=4$ (b).
\begin{figure}\begin{tabular}{@{}ll}
(a) &\\
& \begin{pspicture}(0,-4)(2,2.5)
\...
...0,arm=0.5,linearc=.4]{->}{N7}{N8}
\end{pspicture}\end{tabular}\par\end{figure}

By increasing the block size $k$, we get better compression. However, there is a tradeoff between compression and the speed of term lookup. For the eight-term dictionary in Figure 5.6 , steps in binary search are shown as double lines and steps in list search as simple lines. We search for terms in the uncompressed dictionary by binary search (a). In the compressed dictionary, we first locate the term's block by binary search and then its position within the list by linear search through the block (b). Searching the uncompressed dictionary in (a) takes on average $(0+1+2+3+2+1+2+2)/8 \approx 1.6$ steps, assuming each term is equally likely to come up in a query. For example, finding the two terms, aid and box, takes three and two steps, respectively. With blocks of size $k=4$ in (b), we need $(0+1+2+3+4+1+2+3)/8 =2$ steps on average, $\approx\!25\%$ more. For example, finding den takes one binary search step and two steps through the block. By increasing $k$, we can get the size of the compressed dictionary arbitrarily close to the minimum of $400{,}000 \times (4+4+1+\unicode{2 \times}{} 8) =
\unicode{10}{6.8} \ \mbox{MB}$, but term lookup becomes prohibitively slow for large values of $k$.

\begin{figure}
% latex2html id marker 6189
\par\setlength{\tabcolsep}{0pt}
\p...
...re, the first byte of
each entry encodes the number of characters.}
\end{figure}

One source of redundancy in the dictionary we have not exploited yet is the fact that consecutive entries in an alphabetically sorted list share common prefixes. This observation leads to front coding (Figure 5.7 ). A common prefix is identified for a subsequence of the term list and then referred to with a special character. In the case of Reuters, front coding saves another 2.41.2 MB, as we found in an experiment.

Other schemes with even greater compression rely on minimal perfect hashing, that is, a hash function that maps $M$ terms onto $[1,\ldots,M]$ without collisions. However, we cannot adapt perfect hashes incrementally because each new term causes a collision and therefore requires the creation of a new perfect hash function. Therefore, they cannot be used in a dynamic environment.

Even with the best compression scheme, it may not be feasible to store the entire dictionary in main memory for very large text collections and for hardware with limited memory. If we have to partition the dictionary onto pages that are stored on disk, then we can index the first term of each page using a B-tree. For processing most queries, the search system has to go to disk anyway to fetch the postings. One additional seek for retrieving the term's dictionary page from disk is a significant, but tolerable increase in the time it takes to process a query.


Table 5.2: Dictionary compression for Reuters-RCV1.
 data structure size in MB  
 dictionary, fixed-width 19.211.2  
 dictionary, term pointers into string 10.8 7.6  
 $\sim$, with blocking, $k=4$ 10.3 7.1  
 $\sim$, with blocking & front coding 7.9 5.9  


Table 5.2 summarizes the compression achieved by the four dictionary data structures.

Exercises.


next up previous contents index
Next: Postings file compression Up: Dictionary compression Previous: Dictionary as a string   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07