Logistic regression

Linear logistic regression

Logistic regression is a form of regression for a dichotomous prediction variable. Predictors can be numerical or categorical. Logistic regression constructs a linear model of the following form:

Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk

where Y is the logit tranformation of the probability p. The logit transformation of the probability of a value is defined as:

Y = log(p/(1-p))

where p is the probability of an outcome. The linear function can also be written as a prediction of the probability of a value, e.g.

P(class = pos) = 1 / (1 + e^{a + b1*X1 + b2*X2 … + bk*Xk})

The constant a and the weights b1 .. bn are chosen by a regression method so that the predictions for the class are optimal for the sample. A number of tools are available for computing the weights.

Example: In the example logistic regression finds the following linear combination:

P(buyer|x) = exp(g(x))/(1+ exp(g(x)))

where g(x) =

- 0.16 * MFALLEEN

+ 0.17 * MOPLHOOG

- 0.3 * MBERBOER

- 0.49 * PWALAND

+ 0.31 * PPERSAUT

+ 0.42 * PGEZONG

+ 0.42 * PBRAND

- 0.99 * ABRAND

+ 2.3302 * APLEZIER

- 0.2779 * CPERPL

- 3.1379

This model is not very "readable" but it is to be used mainly for prediction of buying behaviour. Cross validation experiments showed that the predictive accuracy of logistic regression is about 12% ranging between 11 and 13%.

Boosted linear logistic regression

Based on the observation that linear logistic regression performed most accurate and stable, it was decided to further explore this technique. One option is boosted linear logistic regression. Boosting is a method for dealing with unbalanced distributions. It deploys an an iterative approach where in each iteration the focus shifts towards concepts that were misclassified in the previous iteration. A specific learning method is a subprocedure in the iteration.

The model that boosting constructs is a linear combination of models produced by the subprocedure. The models and their weights are constructed as follows. First the learning subprocedure is applied, in this case logistic regression. The model is used to predict classes in the sample and cases that are misclassified are assigned a greater weight. Next the learning subprocedure is applied to the revised sample. The resulting model is evaluated by comparing predicted classes with actual classes in the (revised) sample. The resulting model is added to the linear combination of models. The weights of the submodels are optimised again and the process stops after a given number of iterations.

In the caravan policy case, an initial solution was reached after 250 iterations (7 min on a 400MHz Pentium II computer). For the original data as well for the preprocessed data, the average performance of the classifiers were almost the same on the training and validation sets. In other words, the danger of overfitting is limited.

Alternative Approaches

Apart from boosting in its simplest form we tried several alternative approaches. First, submodels in the linear combination were 4-node decision trees instead of simple thresholds on the probabilities produced by the logistic functions. This led to overfitting (accuracy on the training set increased to 18%, while dropping to 14% on the test set). Second, different strategies for weight updates were tried. For example, instead of overweighting all misclassified cases, only the top 30% were overweighted. Unfortunately, the improvement of accuracy, if any, was not significant. It seems that the simplest version of boosting is also the best. The result was a significant 3% increase in predictive accuracy from about 12% to 15 %. Readability is worse than for the single logistic model.

Solution provided by Wojtek Kowalcyk, winner of the Benelearn 1999 Competition