The somewhat quirky name support vector machine originates in the neural networks literature, where learning algorithms were thought of as architectures, and often referred to as ``machines''. The distinctive element of this model is that the decision boundary to use is completely decided (``supported'') by a few training data points, the support vectors.
For a more detailed presentation of SVMs, a good, well-known article-length introduction is (Burges, 1998). Chen et al. (2005) introduce the more recent -SVM, which provides an alternative parameterization for dealing with inseparable problems, whereby rather than specifying a penalty , you specify a parameter which bounds the number of examples which can appear on the wrong side of the decision surface. There are now also several books dedicated to SVMs, large margin learning, and kernels: (Cristianini and Shawe-Taylor, 2000) and (Schölkopf and Smola, 2001) are more mathematically oriented, while (Shawe-Taylor and Cristianini, 2004) aims to be more practical. For the foundations by their originator, see (Vapnik, 1998). Some recent, more general books on statistical learning, such as (Hastie et al., 2001) also give thorough coverage of SVMs.
The construction of multiclass SVMs is discussed in (Weston and Watkins, 1999), (Crammer and Singer, 2001), and (Tsochantaridis et al., 2005). The last reference provides an introduction to the general framework of structural SVMs.
The kernel trick was first presented in (Aizerman et al., 1964). For more about string kernels and other kernels for structured data, see (Lodhi et al., 2002) and (Gaertner et al., 2002). The Advances in Neural Information Processing (NIPS) conferences have become the premier venue for theoretical machine learning work, such as on SVMs. Other venues such as SIGIR are much stronger on experimental methodology and using text-specific features to improve classifier effectiveness.
A recent comparison of most current machine learning classifiers (though on problems rather different from typical text problems) can be found in (Caruana and Niculescu-Mizil, 2006). (Li and Yang, 2003), discussed in Section 13.6 , is the most recent comparative evaluation of machine learning classifiers on text classification. Older examinations of classifiers on text problems can be found in (Yang and Liu, 1999, Dumais et al., 1998, Yang, 1999). Joachims (2002a) presents his work on SVMs applied to text problems in detail. Zhang and Oles (2001) present an insightful comparison of Naive Bayes, regularized logistic regression and SVM classifiers.
Joachims (1999) discusses methods of making SVM learning practical over large text data sets. Joachims (2006a) improves on this work.
A number of approaches to hierarchical classification have been developed in order to deal with the common situation where the classes to be assigned have a natural hierarchical organization (Weigend et al., 1999, Dumais and Chen, 2000, Koller and Sahami, 1997, McCallum et al., 1998). In a recent large study on scaling SVMs to the entire Yahoo! directory, Liu et al. (2005) conclude that hierarchical classification noticeably if still modestly outperforms flat classification. Classifier effectiveness remains limited by the very small number of training documents for many classes. For a more general approach that can be applied to modeling relations between classes, which may be arbitrary rather than simply the case of a hierarchy, see Tsochantaridis et al. (2005).
Moschitti and Basili (2004) investigate the use of complex nominals, proper nouns and word senses as features in text classification.
Dietterich (2002) overviews ensemble methods for classifier combination, while Schapire (2003) focuses particularly on boosting, which is applied to text classification in (Schapire and Singer, 2000).
Chapelle et al. (2006) present an introduction to work in semi-supervised methods, including in particular chapters on using EM for semi-supervised text classification (Nigam et al., 2006) and on transductive SVMs (Joachims, 2006b). Sindhwani and Keerthi (2006) present a more efficient implementation of a transductive SVM for large data sets.
Tong and Koller (2001) explore active learning with SVMs for text classification; Baldridge and Osborne (2004) point out that examples selected for annotation with one classifier in an active learning context may be no better than random examples when used with another classifier.
Machine learning approaches to ranking for ad hoc retrieval were pioneered in (Wong et al., 1988), (Fuhr, 1992), and (Gey, 1994). But limited training data and poor machine learning techniques meant that these pieces of work achieved only middling results, and hence they only had limited impact at the time.
Taylor et al. (2006) study using machine learning to tune the parameters of the BM25 family of ranking functions okapi-bm25 so as to maximize NDCG (Section 8.4 , page 8.4 ). Machine learning approaches to ordinal regression appear in (Herbrich et al., 2000) and (Burges et al., 2005), and are applied to clickstream data in (Joachims, 2002b). Cao et al. (2006) study how to make this approach effective in IR, and Qin et al. (2007) suggest an extension involving using multiple hyperplanes. Yue et al. (2007) study how to do ranking with a structural SVM approach, and in particular show how this construction can be effectively used to directly optimize for MAP ranked-evaluation, rather than using surrogate measures like accuracy or area under the ROC curve. Geng et al. (2007) study feature selection for the ranking problem.
Other approaches to learning to rank have also been shown to be effective for web search, such as (Richardson et al., 2006, Burges et al., 2005).