The Maximum-Margin Approach to Learning Text Classifiers

Methods, Theory, and Algorithms

Thorsten Joachims


Text classification, or the task of automatically assigning semantic categories to natural language text, has become one of the key methods for organizing online information. Since hand-coding such classification rules is costly or even impractical, most modern approaches employ machine learning techniques to automatically learn text classifiers from examples. However, none of these conventional approaches combines good prediction performance, theoretical understanding, and efficient training algorithms.

Based on ideas from Support Vector Machines (SVMs), this dissertation presents a new approach to learning text classifiers from examples. It provides not only learning methods that empirically have state-of-the-art predictive performance, but also a theory that connects the properties of text-classification tasks with the generalization accuracy of the learning methods, as well as algorithms that efficiently implement the methods.

In particular, the results show that the SVM approach to learning text classifiers is highly effective without greedy heuristic components. To explain these empirical findings, the dissertation analyzes the statistical properties of text-classification tasks and presents a theoretical learning model that leads to bounds on the expected error rate . The bounds are based on improved results about leave-one-out estimators for SVMs. These results also lead to a new group of performance estimators for SVMs, called epsilon-alpha-estimators, and to an improved algorithm for computing the leave-one-out error of an SVM. While all results mentioned so far were for the inductive setting, this dissertation also introduces the idea of transduction to text classification. It shows how and why exploiting the location of the test points during learning can improve predictive performance. To make the SVM approach to learning text classifiers applicable in practice, this dissertation also derives new algorithms for training SVMs. For both the inductive and the transductive setting, these algorithms substantially extend the scalability of SVMs to large-scale problems.

While the work presented in this dissertation is driven by the application, the techniques it develops are not limited to text classification, but can be applied in other contexts as well. In addition, not only can individual techniques be transferred to other tasks, this dissertation as a whole can be useful as a case study for how to approach high-dimensional learning tasks.