|
The Caravan Case: Data Cleansing
Name |
The Caravan Case: Data Cleansing |
Description |
|
Data inspection: A descriptive analysis of the data reveals that:
- 5.9% of the clients in the train set owns a caravan policy
- of the insurance features, the contribution-per-policy and number-of-policies
attributes are highly correlated.
- many attributes are sparsely used: for 37 out of the 43 policy related attributes
(including the caravan policy ownership attribute), more then 90% of the records
has only 1 value (mainly: 0).
- the vast majority of the clients only buys a small portion of the policy portfolio: fire,
car and third party insurance.
|
|
|
|
Data quality: Close inspection of the database learns that several data quality
problems have to be solved in the data set. Problems can be summarized by:
- Missing values: crucial values, e.g. postal code, are missing
- Conflicting information: two records for the same client (different client number,
same address data, overlapping contact dates) with conflicting information
- Complementary information: two records for the
same client (different client number, same address data, overlapping contact
dates) with complementary information (e.g. one record describes a fire policy
contact, another record a third party insurance for the same client
- Completeness: Many items in records were not
registered. Don’t know values were entered here.
Data records that were suffering from missing values and
conflicting information were (all) removed from the data set, and complementary records were
merged. This results in a data set size of 5832 for learning.
|
|
|
|
Data enrichment: The next step is to combine the data set with available
information sources from other origins. MIC is subscribed to the Mosaic information service, and
so they are able to add socio-demographic
information
about clients on the basis of postal code. This extends the data set with another 43 fields.
When inspecting the data of clients (Successful hedonist,
Retired & religious
), it is important to
realize the difference between the actual data (numeric, with discrete classes for most
attributes, e.g. a 4 for attribute 62, PFIETS, Contribution bicycle policies) the meaning of the
data (the result of replacing the numeric attribute values with the real value, in this
case a contribution of f 200 – 499) and the interpretation of the data (a text containing a
interpreted description of the client).
|
|
|
|
Feature selection and construction: In order to reduce the complexity of the problem, the dataset
can be transformed into a dataset of lower dimension. we have generated cross tabulations for all 85 variables
and manually selected 9 of them. This step could be done automatically, e.g., by finding
importance of variables with help of various statistical measurements (information gain, Gini
index, etc.). However, reading 85 tables (that were automatically generated) takes just a few
minutes and gives a much better picture of the data than a number of statistical coefficients.
The selected and transformed variables are:
- Customer Subtype: X1 = Mostype in [1, 6, 8, 12]
- Customer_main_type: X2 = Moshoofd = 2
- High level education: X3 = Moplhoog >3
- Purchasing_power_class: X4 = Mkoopkla >6
- Contribution_car_policies: X5 = Ppersaut = 6
- Contribution_fire_policies: X6 = Pbrand = 4
- Contribution_boat_policies: X7 = Pplezier > 0
- Number_of_cars: X8 = Apersaut > 0
- Number_of_car_policies: X9 = Apersaut > 1
- Number_of_boat_policies: X10 = Aplezier > 0
Additionally, a new attribute was introduced on the basis of common sense: mean
contribution per policy.
The selected variables can be viewed as relevant for buying or not buying the caravan policy. For
example, among the clients with High_level_education > 3 there are almost 12%
of caravan policy holders (83 of 630).Given the fact that this reduced set performs equally well as the
extended set in some small scale classification experiments, the information it contains
gives an indication on the characterization of caravan policy customers as well:
High educated persons, living in higher prosperity class, with reasonable purchasing-power
and owning cars and boats with a policy at MIC.
|
|
|

|
|
Case Study |
The Caravan Policy Case
|
|
|