In this figure, documents stream in from the left for
parsing and linguistic processing (language and format
detection, tokenization and stemming). The resulting stream
of tokens feeds into two modules. First, we retain a copy
of each parsed document in a document cache. This will
enable us to generate results snippets :
snippets of text accompanying each document in the results
list for a query. This snippet tries to give a succinct
explanation to the user of why the document matches the
query. The automatic generation of such snippets is the
subject of Section 8.7 . A second copy of the tokens
is fed to a bank of indexers that create a bank of indexes
including zone and field indexes that store the metadata for
each document, (tiered) positional indexes, indexes for
spelling correction and other tolerant retrieval, and structures for accelerating inexact top- retrieval. A free text user query (top center) is sent down to the indexes both directly and through a module for generating spelling-correction candidates. As noted in Chapter 3 the latter may optionally be invoked only when the original query fails to retrieve enough results. Retrieved documents (dark arrow) are passed to a scoring module that computes scores based on machine-learned ranking (MLR), a technique that builds on Section 6.1.2 (to be further developed in Section 15.4.1 ) for scoring and ranking documents. Finally, these ranked documents are rendered as a results page.
Exercises.