Support Vector Machine (SVM)

Publication Burges/98a: A Tutorial on Support Vector Machines for Pattern Recognition
Cristianini/ShaweTaylor/2000a: An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
Schoelkopf/Smola/2002a: Learning with Kernels -- Support Vector Machines, Regularization, Optimization, and Beyond
Vapnik/95a: The Nature of Statistical Learning Theory
Vapnik/98a: Statistical Learning Theory

Name Support Vector Machine (SVM)

Description
Support vector machines were developed by Vapnik et al. based on the Structural Risk Minimization principle from statistical learning theory. They can be applied to regression, classification, and density estimation problems.

Vapnik's book "Statistical Learning Theory" is listed under "Publications", above.
Focussing on classification, SVMs take as input an i.i.d. training sample S_n
(x₁,y₁), (x₂,y₂), ..., (x_n,y_n)
of size n from a fixed but unknown distribution Pr(x,y) describing the learning task. x_i represents the document content and y_i ∈ {-1,+1} indicates the class.

In their basic form, SVM classifiers learn binary, linear decision rules
h(x) = sign(w *x + b) =

+1, if w * x + b > 0, and

-1, else

described by a weight vector w and a threshold b. According to which side of the hyperplane the attribute vector x lies on, it is classified into class +1 or -1. The idea of structural risk minimization is to find a hypothesis h for which one can guarantee the lowest probability of error. For SVMs, Vapnik shows that this goal can be translated into finding the hyperplane with maximum margin for separable data, this is depicted in the following figure.

For separable training sets SVMs find the hyperplane h, which separates the positive and negative training examples with maximum margin. The examples closest to the hyperplane are called Support Vectors (marked with circles).

Computing this hyperplane is equivalent to solving the following quadratic optimization problem (see also algorithms).
minimize: W(α) =

n n n

- ∑ α_i + 1/2 * ∑ ∑ y_iy_jα_iα _j(x_i * x_j)

i=1 i=1 j=1

subject to:

n n

∑ y_iα_i = 0, ∀ : 0 < α_i < C

i=1 i=1

Support vectors are those training examples x_i with α_i > 0 at the solution. From the solution of this optimization problem the decision rule can be computed as

n

w * x = ∑ α_iy_i (x_i * x) and b = y_usv - w * x_usv

i=1

The training example (x_usv,y_usv) for calculating b must be a support vector with α_usv < C.

For both solving the quadratic optimization problem as well as applying the learned decision rule, it is sufficient to be able to calculate inner products between document vectors. Exploiting this property, Boser et al. introduced the use of kernels K(x₁,x₂) for learning non-linear decision rules.
Popular kernel functions are:

K_poly (d₁,d₂) = (d₁ * d₂ + 1)^d

K_rbf (d₁,d₂) = exp(γ (d₁ - d₂)² )

K_sigmoid (d₁,d₂) = tanh( s(d₁ * d₂) + c )

Depending on the type of kernel function, SVMs learn polynomial classifiers, radial basis function (RBF) classifiers, or two layer sigmoid neural nets. Such kernels calculate an inner-product in some feature space and replace the inner-product in the formulas above.
A good tutorial on SVMs for classification is "A Tutorial on Support Vector Machines for Pattern Recognition", for details, please follow the corresponding link above.

Specialization SVM for regression
SVM light

Hypothesis Language Functions

Dm Step Concept Learning
Function Approximation

Method Type Method