Next: Web search basics Up: Matrix decompositions and latent Previous: Latent semantic indexing Contents Index

References and further reading

Strang (1986) provides an excellent introductory overview of matrix decompositions including the singular value decomposition. Theorem 18.3 is due to Eckart and Young (1936). The connection between information retrieval and low-rank approximations of the term-document matrix was introduced in Deerwester et al. (1990), with a subsequent survey of results in Berry et al. (1995). Dumais (1993) and Dumais (1995) describe experiments on TREC benchmarks giving evidence that at least on some benchmarks, LSI can produce better precision and recall than standard vector-space retrieval. http://www.cs.utk.edu/~berry/lsi++/and http://lsi.argreenhouse.com/lsi/LSIpapers.htmloffer comprehensive pointers to the literature and software of LSI. Schütze and Silverstein (1997) evaluate LSI and truncated representations of centroids for efficient -means clustering (Section 16.4 ). Bast and Majumdar (2005) detail the role of the reduced dimension in LSI and how different pairs of terms get coalesced together at differing values of . Applications of LSI to cross-language information retrieval (where documents in two or more different languages are indexed, and a query posed in one language is expected to retrieve documents in other languages) are developed in Berry and Young (1995) and Littman et al. (1998). LSI (referred to as LSA in more general settings) has been applied to host of other problems in computer science ranging from memory modeling to computer vision.

Hofmann (1999a;b) provides an initial probabilistic extension of the basic latent semantic indexing technique. A more satisfactory formal basis for a probabilistic latent variable model for dimensionality reduction is the Latent Dirichlet Allocation ( LDA ) model (Blei et al., 2003), which is generative and assigns probabilities to documents outside of the training set. This model is extended to a hierarchical clustering by Rosen-Zvi et al. (2004). Wei and Croft (2006) present the first large scale evaluation of LDA, finding it to significantly outperform the query likelihood model of Section 12.2 (page ), but to not perform quite as well as the relevance model mentioned in Section 12.4 (page ) - but the latter does additional per-query processing unlike LDA. Teh et al. (2006) generalize further by presenting Hierarchical Dirichlet Processes , a probabilistic model which allows a group (for us, a document) to be drawn from an infinite mixture of latent topics, while still allowing these topics to be shared across documents.

Exercises.

Assume you have a set of documents each of which is in either English or in Spanish. The collection is given in Figure 18.4 .

Figure: Documents for Exercise 18.5.
$\begin{figure}\begin{tabular}{\vert\vert l\vert l\vert\vert} \hline DocID & Docu... ...bienvenido \\ \hline 6 & hello and welcome\\ \hline \end{tabular} \end{figure}$

Figure 18.5 gives a glossary relating the Spanish and English words above for your own information. This glossary is NOT available to the retrieval system:

Figure 18.5: Glossary for Exercise 18.5.
$\begin{figure}\begin{tabular}{\vert\vert l\vert l\vert\vert} \hline Spanish & En... ...rofessor\\ y & and\\ bienvenido & welcome\\ \hline \end{tabular} \end{figure}$
1. Construct the appropriate term-document matrix $\lsimatrix$ to use for a collection consisting of these documents. For simplicity, use raw term frequencies rather than normalized tf-idf weights. Make sure to clearly label the dimensions of your matrix.
2. Write down the matrices $U_2,\Sigma'_2$ and and from these derive the rank 2 approximation $\lsimatrix_2$ .
3. State succinctly what the entry in the matrix $\lsimatrix^T\lsimatrix$ represents.
4. State succinctly what the entry in the matrix $\lsimatrix_2^T\lsimatrix_2$ represents, and why it differs from that in $\lsimatrix^T\lsimatrix$ .

**Figure:** Documents for Exercise 18.5.
$\begin{figure}\begin{tabular}{\vert\vert l\vert l\vert\vert} \hline DocID & Docu... ...bienvenido \\ \hline 6 & hello and welcome\\ \hline \end{tabular} \end{figure}$

**Figure 18.5:** Glossary for Exercise 18.5.
$\begin{figure}\begin{tabular}{\vert\vert l\vert l\vert\vert} \hline Spanish & En... ...rofessor\\ y & and\\ bienvenido & welcome\\ \hline \end{tabular} \end{figure}$

Next: Web search basics Up: Matrix decompositions and latent Previous: Latent semantic indexing Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07