Putting it all together

Next: Vector space scoring and Up: Components of an information Previous: Designing parsing and scoring Contents Index

Putting it all together

We have now studied all the components necessary for a basic search system that supports free text queries as well as Boolean, zone and field queries. We briefly review how the various pieces fit together into an overall system; this is depicted in Figure 7.5 .

$\includegraphics[width=14cm]{system.eps}$ A complete search system.Data paths are shown primarily for a free text query.

In this figure, documents stream in from the left for parsing and linguistic processing (language and format detection, tokenization and stemming). The resulting stream of tokens feeds into two modules. First, we retain a copy of each parsed document in a document cache. This will enable us to generate results snippets : snippets of text accompanying each document in the results list for a query. This snippet tries to give a succinct explanation to the user of why the document matches the query. The automatic generation of such snippets is the subject of Section 8.7 . A second copy of the tokens is fed to a bank of indexers that create a bank of indexes including zone and field indexes that store the metadata for each document, (tiered) positional indexes, indexes for spelling correction and other tolerant retrieval, and structures for accelerating inexact top- retrieval. A free text user query (top center) is sent down to the indexes both directly and through a module for generating spelling-correction candidates. As noted in Chapter 3 the latter may optionally be invoked only when the original query fails to retrieve enough results. Retrieved documents (dark arrow) are passed to a scoring module that computes scores based on machine-learned ranking (MLR), a technique that builds on Section 6.1.2 (to be further developed in Section 15.4.1 ) for scoring and ranking documents. Finally, these ranked documents are rendered as a results page.

Exercises.

Explain how the postings intersection algorithm first introduced in Section 1.3 can be adapted to find the smallest integer $\omega$ that contains all query terms.
Adapt this procedure to work when not all query terms are present in a document.

Next: Vector space scoring and Up: Components of an information Previous: Designing parsing and scoring Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07