The Caravan Case: Data Cleansing

Name The Caravan Case: Data Cleansing
Description
Data inspection: A descriptive analysis of the data reveals that:
  • 5.9% of the clients in the train set owns a caravan policy
  • of the insurance features, the contribution-per-policy and number-of-policies attributes are highly correlated.
  • many attributes are sparsely used: for 37 out of the 43 policy related attributes (including the caravan policy ownership attribute), more then 90% of the records has only 1 value (mainly: 0).
  • the vast majority of the clients only buys a small portion of the policy portfolio: fire, car and third party insurance.
   
Data quality: Close inspection of the database learns that several data quality problems have to be solved in the data set. Problems can be summarized by:
  • Missing values: crucial values, e.g. postal code, are missing
  • Conflicting information: two records for the same client (different client number, same address data, overlapping contact dates) with conflicting information
  • Complementary information: two records for the same client (different client number, same address data, overlapping contact dates) with complementary information (e.g. one record describes a fire policy contact, another record a third party insurance for the same client
  • Completeness: Many items in records were not registered. Don’t know values were entered here.
Data records that were suffering from missing values and conflicting information were (all) removed from the data set, and complementary records were merged. This results in a data set size of 5832 for learning.
   
Data enrichment: The next step is to combine the data set with available information sources from other origins. MIC is subscribed to the Mosaic information service, and so they are able to add socio-demographic information about clients on the basis of postal code. This extends the data set with another 43 fields.

When inspecting the data of clients (Successful hedonist, Retired & religious ), it is important to realize the difference between the actual data (numeric, with discrete classes for most attributes, e.g. a 4 for attribute 62, PFIETS, Contribution bicycle policies) the meaning of the data (the result of replacing the numeric attribute values with the real value, in this case a contribution of f 200 – 499) and the interpretation of the data (a text containing a interpreted description of the client).

   
Feature selection and construction: In order to reduce the complexity of the problem, the dataset can be transformed into a dataset of lower dimension. we have generated cross tabulations for all 85 variables and manually selected 9 of them. This step could be done automatically, e.g., by finding importance of variables with help of various statistical measurements (information gain, Gini index, etc.). However, reading 85 tables (that were automatically generated) takes just a few minutes and gives a much better picture of the data than a number of statistical coefficients. The selected and transformed variables are:
  • Customer Subtype: X1 = Mostype in [1, 6, 8, 12]
  • Customer_main_type: X2 = Moshoofd = 2
  • High level education: X3 = Moplhoog >3
  • Purchasing_power_class: X4 = Mkoopkla >6
  • Contribution_car_policies: X5 = Ppersaut = 6
  • Contribution_fire_policies: X6 = Pbrand = 4
  • Contribution_boat_policies: X7 = Pplezier > 0
  • Number_of_cars: X8 = Apersaut > 0
  • Number_of_car_policies: X9 = Apersaut > 1
  • Number_of_boat_policies: X10 = Aplezier > 0

Additionally, a new attribute was introduced on the basis of common sense: mean contribution per policy.

The selected variables can be viewed as relevant for buying or not buying the caravan policy. For example, among the clients with High_level_education > 3 there are almost 12% of caravan policy holders (83 of 630).Given the fact that this reduced set performs equally well as the extended set in some small scale classification experiments, the information it contains gives an indication on the characterization of caravan policy customers as well:

  • High educated persons, living in higher prosperity class, with reasonable purchasing-power and owning cars and boats with a policy at MIC.
  •    

    Case Study The Caravan Policy Case