next up previous contents index
Next: Variable byte codes Up: Index compression Previous: Blocked storage   Contents   Index


Postings file compression


Table: Encoding gaps instead of document IDs. For example, we store gaps 107, 5, 43, ..., instead of docIDs 283154, 283159, 283202, ... for computer. The first docID is left unchanged (only shown for arachnocentric).
   encoding postings list                
 the docIDs ...   283042   283043   283044   283045 ...  
   gaps       1   1   1   ...  
 computer docIDs ...   283047   283154   283159   283202 ...  
   gaps       107   5   43   ...  
 arachnocentric docIDs 252000   500100                
   gaps 252000 248100                  

Recall from Table 4.2 (page 4.2 ) that Reuters-RCV1 has 800,000 documents, 200 tokens per document, six characters per token, and 100,000,000 postings where we define a posting in this chapter as a docID in a postings list, that is, excluding frequency and position information. These numbers correspond to line 3 (``case folding'') in Table 5.1 . Document identifiers are $\log_2 800{,}000 \approx 20$ bits long. Thus, the size of the collection is about $800{,}000 \times 200 \times 6 \
\mbox{bytes}= 960 \ \mbox{MB}$ and the size of the uncompressed postings file is $100{,}000{,}000 \times 20
/ 8 = 250 \ \mbox{MB}$.

To devise a more efficient representation of the postings file, one that uses fewer than 20 bits per document, we observe that the postings for frequent terms are close together. Imagine going through the documents of a collection one by one and looking for a frequent term like computer. We will find a document containing computer, then we skip a few documents that do not contain it, then there is again a document with the term and so on (see Table 5.3 ). The key idea is that the gaps between postings are short, requiring a lot less space than 20 bits to store. In fact, gaps for the most frequent terms such as the and for are mostly equal to 1. But the gaps for a rare term that occurs only once or twice in a collection (e.g., arachnocentric in Table 5.3 ) have the same order of magnitude as the docIDs and need 20 bits. For an economical representation of this distribution of gaps, we need a variable encoding method that uses fewer bits for short gaps.

To encode small numbers in less space than large numbers, we look at two types of methods: bytewise compression and bitwise compression. As the names suggest, these methods attempt to encode gaps with the minimum number of bytes and bits, respectively.



Subsections
next up previous contents index
Next: Variable byte codes Up: Index compression Previous: Blocked storage   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07