The goal of the subgroup discovery algorithm SD, outlined in
Figure 1, is to search for rules that maximize
,
where TP are true positives, FP are false
positives, and g is a generalization parameter. High quality
rules cover many target class examples and a low number of non-target
examples. The number of tolerated non-target class cases, relative to
the number of covered target class cases, is determined by parameter
g. For low g (
), induced rules will have high
specificity (low false alarm rate) since covering of every single
non-target class example is made relatively very `expensive'. On the
other hand, by selecting a high g value (g > 10 for small
domains), more general rules will be generated, covering also
non-target class instances.
Varying the value of g enables the expert to guide subgroup discovery in the TP/FP space, in which FP (plotted on the X-axis) needs to be minimized, and TP (plotted on the Y-axis) needs to be maximized. The TP/FP space is similar to the ROC (Receiver Operating Characteristic) space (Provost & Fawcett, 2001). The comparison of the ROC and TP/FP space and the gq heuristic are analyzed in detail in Sections 2.4 and 4, respectively.
Algorithm SD takes as its input the complete training set E and the
feature set L, where features
are logical conditions
constructed from attribute values describing the examples in E. For
discrete (categorical) attributes, features have the form
Attribute =
value or
,
for numerical attributes they have
the form
Attribute > value or
Attribute < value. To formalize
feature construction, let values vix (x = 1..kip) denote the
kip different values of attribute Ai that appear in the
positive examples and wiy (y = 1..kin) the kin
different values of Ai appearing in the negative examples. A set
of features L is constructed as follows:
There is no theoretical upper value for the user-refined g parameter, but in practice the suggested upper limit should not exceed the number of training examples. For instance, suggested g values in the Data Mining Server are in the range between 0.1 and 100, for analysing data sets of up to 250 examples. The choice of g should be adjusted both to the size of the data set and to the proportion of positive examples in the set.
Algorithm SD has two additional parameters which are typically not
adjusted by the user. The first is
(default value is
,
where P is the number of target class examples in E)
which indirectly defines the minimal number of target class examples
which must be covered by every subgroup. The second is
(default value is 20) which defines the number of solutions kept in
each iteration. The output of the algorithm is set S of
different rules with highest qg values. The rules
have the form of conjunctions of features from L.
The algorithm initializes all the rules in Beam and
by
empty rule conditions. Their quality values qg(i) are set to zero
(step 1). Rule initialization is followed by an infinite loop (steps
2-12) that stops when, for all rules in the beam, it is no longer
possible to further improve their quality. Rules can be improved only
by conjunctively adding features from L. After the first iteration,
a rule condition consists of a single feature, after the second
iteration up to two features, and so forth. The search is systematic
in the sense that for all rules in the beam (step 3) all features from
L (step 4) are tested in each iteration. For every new rule,
constructed by conjunctively adding a feature to rule body (step 5)
quality qg is computed (step 6). If the support of the new rule
is greater than
and if its quality qg is greater
than the quality of any rule in
,
the worst rule in
is replaced by the new rule. The rules are reordered in
according to their quality qg. At the end of each
iteration,
is copied into Beam (step 11). When the
algorithm terminates, the first rule in Beam is the rule with
maximum qg.
A necessary condition (in step 7) for a rule to be included in
is that it must be relevant. The new rule is
irrelevant if there exists a rule R in
such that true
positives of the new rule are a subset of true positives of R and
false positives of the new rule are a superset of false positives of
R. A detailed analysis of relevance, presented by
Lavracetall1998, is out of the main scope of this paper. After
the new rule is included in
it may happen that some of the
existing rules in
become irrelevant with respect to this
new rule. Such rules are eliminated from
during its
reordering (in step 8). The testing of relevance ensures that
contains only different and relevant rules.
In Algorithm SD, rule quality measure qg serves two purposes: first, rule evaluation, and second, evaluation of features and their conjunctions with high potential for the construction of high quality rules in subsequent iterations. The analysis of this quality measure in Section 4 shows that for the first purpose, a measure assigning different costs to false positives and false negatives could perform equally well, but for the purpose of guiding the search the qg measure is advantageous.