next up previous contents index
Next: Linear algebra review Up: irbook Previous: Exercises   Contents   Index


Matrix decompositions and latent semantic indexing

On page 6.3.1 we introduced the notion of a term-document matrix: an $\lsinoterms\times \lsinodocs$ matrix $\lsimatrix$, each of whose rows represents a term and each of whose columns represents a document in the collection. Even for a collection of modest size, the term-document matrix $\lsimatrix$ is likely to have several tens of thousands of rows and columns. In Section 18.1.1 we first develop a class of operations from linear algebra, known as matrix decomposition. In Section 18.2 we use a special form of matrix decomposition to construct a low-rank approximation to the term-document matrix. In Section 18.3 we examine the application of such low-rank approximations to indexing and retrieving documents, a technique referred to as latent semantic indexing. While latent semantic indexing has not been established as a significant force in scoring and ranking for information retrieval, it remains an intriguing approach to clustering in a number of domains including for collections of text documents (Section 16.6 , page 16.6 ). Understanding its full potential remains an area of active research.

Readers who do not require a refresher on linear algebra may skip Section 18.1 , although Example 18.1 is especially recommended as it highlights a property of eigenvalues that we exploit later in the chapter.



Subsections
next up previous contents index
Next: Linear algebra review Up: irbook Previous: Exercises   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07