VC dimension

Description:

The VC dimension (for Vapnik Chervonenkis dimension) is a measure of the capacity of a classification algorithm. It is one of the core concepts in statistical learning theory. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.

Intuitively, the capacity of a classification model is related to how complicated it can be. Think of thresholding a high-degree polynomial, where if the polynomial evaluates above zero, we classify that point into a positive class, negative otherwise.

If we use a very high-degree polynomial, it can be very wiggly, and can fit a training set exactly. But, we should expect that outside of the training points, the classifier would not generalize well, because it is too wiggly. Such a polynomial has a high capacity. Alternatively, we can think about thresholding a linear polynomial. This polynomial may not fit the entire training set, because it has a low capacity. This notion of capacity can be made more rigorous.

First, we need the concept of shattering. Consider a classification model f with some parameter vector θ. The model f can shatter a set of data points if, for all assignments of labels to those data points, there exists a θ such that the model f makes no errors when evaluating that set of data points.
Publications: Blumer/etal/90a: Learnability and the Vapnik--Chervonenkis dimension
Haussler/99a: Convolution Kernels on Discrete Structures
Mitchell/97b: Machine Learning
Natarajan/91a: Machine Learning. A Theoretical Approach
Translations: Vapnik-Chervonenkis-Dimension (German)