|
Glossary
- Accuracy
The Accuracy is a standard measure to evaluate the quality of hypotheses
in the context of concept learning.
Hypotheses are classifiers, determining
if instances of an underlying universe of discourse belong to a specific concept or
not.
Algorithms addressing the task of concept learning usually output such a hypothesis
after processing a given training set. A training set consists of classified instances,
i.e. it is given if instances belong to the target concept or not.
To estimate the quality of such a hypothesis, another set, a so called test set,
is separated from the set of available (classified) examples, and the learning algorithm
is only provided with the remaining examples. Afterwards the (yet unseen) examples of the
test set are classified by the hypothesis and the predicted class of each instance is compared
to the correct one. The accuracy is the ratio of correct classified instances to number of instances
in the test set.
more...
- AdaBoost
AdaBoost is a boosting algorithm, running a given weak learner several
times on slightly altered training data, and combining the hypotheses
to one final hypothesis, in order to achieve higher accuracy than the
weak learner's hypothesis would have.
The main idea of AdaBoost is to assign each example of the given
training set a weight.
At the beginning all weights are equal, but in every round the
weak learner returns a hypothesis, and the weights of all examples
classified wrong by that hypothesis are increased.
That way the weak learner is forced to focus on the difficult
examples of the training set.
The final hypothesis is a combination of the hypotheses of all rounds,
namely a weighted majority vote, where hypotheses with lower
classification error have higher weight.
In detail:
Given
-
a set E = { (x1, y1), ..., (xn,
yn) } of classified examples, where xi ∈
X and yi ∈ Y, for i = 1, ..., n. Here we assume
Y = {-1, +1}, e.g. instances that are not covered by a concept to be
learned have label -1 and the ones covered have label +1.
-
a weak learning algorithm that can deal with weighted example sets.
Such a learning algorithm reads an example set E and a distribution
D. In the most simple case, where all hypotheses that can be output
are functions X → {-1, +1}, the algorithm tries to find a
hypothesis h with minimal probability of misclassification, given
that an example is drawn from X with respect to D. The case of other
possible hypotheses can be addressed by using more complex error
measures.
The algorithm:
Let Dt(i) denote the weight of example i in round t.
- Initialization: Assign each example (xi,
yi) ∈ E the weight
D1(i) := 1/n.
-
For t = 1 to T:
- Call the weak learning algorithm with example set E and weight
s given by Dt.
- Get a weak hypothesis ht : X → ℝ.
- Update the weights of all examples.
- Output the final hypothesis, generated from the hypotheses of
rounds 1 to T.
Updating the weights in round t:
Dt+1(i) := Dt(i) *
exp(αtyiht(xi)) /
Zt , where
- Zt is chosen such, that Dt+1 is a
distribution.
- αt is chosen according to the importance
of hypothesis ht.
For ht: X → {-1, 1} usually
αt is chosen as
αt := 1/2 * ln ( (1 -
εt) / εt ),
where εt
denotes the classification error of hypothesis
ht.
The final hypothesis H: X → {-1, +1} is chosen as
| T | |
H(X) = sign( | ∑ | αtht(x) ). |
| t=1 | |
For more details, please refer to the publication link above.
more...
- Agents
Intelligent agents are possibly mobile programs that optimize their behavior according to a given performance measure.
Also operators of a knowledge discovery process haven been implemented in the form of software agents.
more...
- Algorithmic issues
-
more...
- Analysis tasks for resource management, quality control and decision making
Analysis tasks for resource management, quality control and decision making are demanded by the health sector.
more...
- APRIORI
APRIORI provides a solution for a subtask of Association Rule Learning.
The subtask can be stated as follows:
Find all sets of items (itemsets) that have transaction support
above minimum support. The support for an itemset is the number
of transactions that contain the itemset. Itemsets with minimum support
are called large itemsets.
We use the following notation:
- Let I = {i1, i 2,...,im}
be a set of literals, called items.
- Let D be a set of transactions, where each transaction
T is a set of items such that T ⊆ I.
- Let X be a set of items in I.
A transaction T is said to contain X, if
X ⊆ T.
Discovering large itemsets
Discovering large itemsets requires making multiple passes over the data.
- In the first pass, we count the support of individual items and
determine which of them are large, i.e. have minimum support.
- Each subsequent pass starts with a seed set of itemsets found to be
large in the previous pass. This seed set is used for generating new
potentially large itemsets, called candidate itemsets.
- The actual support for these candidate itemsetsis counted during a
new pass over the data.
- At the end of the pass, we determine which of the candidate itemsets
are actually large. These itemsets become the seed for the next pass.
- This process continues until no large itemsets are found.
If the fraction of transactions belonging to large itemsets becomes small,
it will be more efficient to store their transaction IDs and just work on
that subset, than to perform passes over the complete dataset.
Algorithm APRIORI
Notation:
- k-itemset
An itemset having k items.
- Lk
Set of large k-itemsets (those with minimum support).
- Ck
Set of candidate k-itemsets (potentially large itemsets).
- L1={large 1-itemsets};
- for (k=2; Lk-1 ≠ ∅; k++)
do begin
-
Ck=apriori-gen(Lk-1);
// New candidates
-
forall transactions t ∈ D do begin
-
Ct = subset( Ck,t);
//Candidates contained in t
-
forall candidates c ∈Ct do
-
c.count++;
-
end
-
Lk = {c ∈ Ck | c.count ≥ minsup}
- end
- Answer=Uk Lk
The apriori-gen function takes Lk-1, the set of all large
(k-1)-itemsets, as an argument, and returns a set of candidates for
being large k-itemsets. It exploits the fact, that expanding an
itemset will reduce its support.
A k-itemset can only be large, if all of its
(k-1)-subsets are.
So apriori-gen generates only candidates with this property, which can
easily be achieved given the set Lk-1.
more...
- AQn
The family of AQ-Algorithms provides methods for concept learning and
conceptual clustering. This section just introduces some base functionality.
For more information, please follow the links.
- AQ applied to Concept Learning
AQ learns concepts from a set P of positive examples and from a set N of negative examples by
the following loop:
Choose a single p ∈ P, generalize it as much as possible, so that no n ∈ N is covered and
remove all covered positive examples from P. This is done until the disjunction of the generalized
examples covers every p ∈ P.
Algorithm:
- Set K := ∅.
- The set P consists of all positive examples.
- Repeat, until P is empty:
- Choose an example p ∈ P.
- Generate STAR({p}|N), a list of maximally general conjunctive descriptions,
separating p from all elements of N.
- Choose a best element b of STAR({p}|N). The quality is measured by a "Lexical
Evaluation Function".
- Set K := K ∪ {b}.
- Delete all p from P, that are covered by b.
- Output the disjunction of all elements of K.
- AQ applied to Conceptual Clustering
Following the star-method, one possibile way to build clusters is to (e.g. randomly)
choose a subset of all given observations, and to distinguish each of its elements from all of the
other elements of the subset.
Algorithm:
Given a set of observations E.
- Choose a subset K ⊆ E.
- For each e ∈ K generate STAR({e}|K\{e}) to distinguish e from all other elements of K.
- Choose and specialize the best concept definitions given by the stars, such that every possible
observation is covered by at most one concept.
This process is guided by a "Lexical Evaluation Function" (LEF), measuring the quality of concepts.
- If the generated concepts do not fit the given requirements, another run with a different subset
of E may be necessary.
- If a partition is suitable according to the LEF, concepts may be refined. This is done by splitting
the set of all observations, covered by a single concept, by another run of the clustering algorithm,
in order to induce a hierarchy of concepts.
more...
- Association Rule Learning
Association rules describe correlation of events and can be regarded as
probabilistic rules. "Correlation of events" means, that events are frequently
observed together.
Discovering association rules in large databases can be a first step of knowledge
discovery in databases (KDD).
A good example from real life are databases of sales transactions, which have become an
important part of marketing infrastructure of many companies. Knowledge about sets of
items frequently bought together helps these companies to develop successful marketing
strategies.
more...
- Attribute-Value
If a learning algorithm uses an attribute-value representation, each
instance of the underlying universe of discourse is described by
the values of a given set of attributes, as common in relational
databases.
Each attribute can be regarded as a function, mapping the
universe of discourse to a specific domain.
If we say the example language of a learning algorithm is an
attribute-value representation, without specifying the type of
attributes, we usually mean, that the domain of each attribute is
finite or nominal and each value is treated by the
learning algorithm without using a specific interpretation.
In contrast, if we talk about numerical attributes, we often think of
attributes having infinite domains with a natural interpretation a
learning algorithm may use.
more...
- Audio categorisation
-
more...
- AutoClass
AutoClass is an unsupervised Bayesian classification system that seeks
a maximum posterior probability classification.
Its inputs consist of a database of attribute vectors (cases), either
real or discrete valued, and a class model. Default class models are
provided. AutoClass finds the set of classes that is maximally
probable with respect to the data and model. The output is a set of
class descriptions, and partial membership of the cases in the
classes.
Please follow the link above to download the sourcecode for Windows
or Unix platforms.
more...
- Bayes Theorem
Bayes Theorem provides a method for calculating P(h | D), denoting the probability of h being correct, given a specific data set D.
Let
- P be a probability distribution,
- h be a hypothesis and
- D be a data set.
If we know
- P(h), the probability of hypothesis h being correct,
- P(D), the probability of data set D being observed, and
- P(D | h), the probability of observing D, under the assumption
of h being correct,
then Bayes Theorem provides a method for calculating P(h | D),
denoting the probability of h being correct, given a specific data
set D.
Bayes Theorem:
P(h | D) = P(D | h) * P(h) / P(D)
An application of Bayes Theorem:
- Given a data set D, find a hypothesis h ∈ H with highest
posteriori probability P(h | D).
- Method: Chose a hypothesis h ∈ H that maximizes the
expression P(D | h) * P(h).
more...
- Bayesian Learning
Bayesian Learning constitutes a probabilistic view of learning,
based on Bayes Theorem. The underlying assumption is, that
there is a set of hypotheses, each having a certain probability
of being correct. Receiving more information changes the probabilities
from a learner's point of view. For instance an observation might
contradict a hypothesis, or strengthen the belief in it.
The aim in this setting is to be able to find a hypothesis with highest
probability of being correct, given a specific set of data / piece of
information.
Bayes Theorem:
Let
- ... P be a probability distribution.
- ... hi, i ∈ {1, ..., n} denote a set of hypotheses.
- ... P(hi) denote the probability of hi being
correct in the general case, that is without being given any additional
information. P(hi) is called the prior probability of
hi.
- ... D denote a set of data.
- ... P(D) be the probability of the information denoted by D being
correct in the general case.
- ... P(D | hi) denote the probability of the information D
being correct, given the correctness of hypothesis hi.
- ... P(hi | D) denote the probability of the correctness of
hypothesis hi, given the additional information D.
This is called the posterior probability of hypothesis hi.
Bayes Theorem:
P(h | D) = P(D | h) * P(h) / P(D)
This theorem allows us to find a hypothesis h'with maximum posterior probability,
given the prior probabilities of all hypotheses and the probabilities of D
being correct under the assumption of each single hypothesis being correct:
h' := | max | [ P(D | hi) * P(hi) ] |
| hi | |
more...
- Bayesian Networks
A Bayesian network or Bayesian belief network is a directed acyclic graph of nodes representing variables and arcs representing dependence relations among the variables. If there is an arc from node A to another node B, then we say that A is a parent of B. If a node has a known value, it is said to be an evidence node. A node can represent any kind of variable, be it an observed measurement, a parameter, a latent variable, or a hypothesis. Nodes are not restricted to representing random variables; this is what is "Bayesian" about a Bayesian network.
A Bayesian network is a representation of the joint distribution over all the variables represented by nodes in the graph. Let the variables be X(1), ..., X(n). Let parents(A) be the parents of the node A. Then the joint distribution for X(1) through X(n) is represented as the product of the probability distributions p(X(i) | parents(X(i))) for i from 1 to n. If X has no parents, its probability distribution is said to be unconditional, otherwise it is conditional.
Questions about dependence among variables can be answered by studying the graph alone. It can be shown that the graphical notion called d-separation corresponds to the notion of conditional independence: if nodes X and Y are d-separated (given specified evidence nodes), then variables X and Y are independent given the evidence variables.
In order to carry out numerical calculations, it is necessary to further specify for each node X the probability distribution for X conditional on its parents. The distribution of X given its parents may have any form. However, it is common to work with discrete or Gaussian distributions, since that simplifies calculations.
The goal of inference is typically to find the conditional distribution of a subset of the variables, conditional on known values for some other subset (the evidence), and integrating over any other variables. Thus a Bayesian network can be considered a mechanism for automatically constructing extensions of Bayes' theorem to more complex problems.
Bayesian networks are used for modelling knowledge in gene regulatory networks, medicine, engineering, text analysis, image processing, and decision support systems.
more...
- Best Region Rules
- The Best Region Rules algorithm addresses the task of predicting the
occurence of future events on the basis of past events.
The hypotheses of the algorithm are specific rules:
Each rule contains a condition, a target and a lag interval. The
interpretation of such a rule is, that target events occur with a
delay, lying within the given time interval, after condition events.
The rule A[5, 20] => B would e.g. have the interpretation "B occurs
between 5 and 20 time units after occurences of A".
Measuring the Quality of a Rule
For specific applications, the intuitive quality measure of rules can
vary, so the algorithm offers different measures, which are weighted
by the user.
The easiest of these measures are Prediction Accuracy and Recall Accuracy.
For a specific event sequence the Prediction Accuracy of a rule
is defined as the ratio of
- condition events, followed by a target event within the
given intervall
- to all condition events.
The Recall Accuracy is defined very similar. It measures the
ratio of
- target events following a condition event with
a delay in the given interval,
- to all target events.
These definitions basically reflect the usual meaning of accuracy and
recall. Further measures reward beneficial properties of rules, like
smaller interval regions, higher predictive power, or multiple hits
(e.g. multiple target events matching a single condition event).
Finding Good Rules
The search for good rules is limited to Minimal Regions.
With respect to a given lag set Minimal Regions are minimal lag intervals,
covering the lags of a specific subset of event pairs.
Limiting the search to such regions is resonable, considering that there are no negative
examples in a event sequence, and that usually larger regions will be
assigned higher quality scores.
The algorithm will first calculate all occuring time lags between different
event types of a given event sequence.
If event A occurs at time points 5, 10 and 20 and event B occurs at
11, 15 and 21, then the resulting lag set is
{1, 5, 6, 10, 11, 16}.
The algorithm will choose all minimal regions for generating rule candidates.
In the example above the rules with condition event A, target event B, and
Minimal Regions are:
A[1, 1] => B, A[1, 5] => B, A[1, 6] => B, ..., A[1, 16] => B
A[5, 5] => B, A[5, 6] => B, ..., A[5, 16] => B
...
A[16, 16] => B
For learning models from event sequences the Best Region Rules algorithm
can calculate the best Minimal Region for each event pair, according to a
given quality measure (weights of the algorithm's measures).
All best rules, exceeding a minimal quality score, form the hypothesis.
more...
- Boosting
The notion of Boosting subsumes methods of improving learning
algorithms. If learning algorithms fail to provide arbitrarily
good results for a specific concept learning task, i.e. they fail
to output arbitrarily good hypothesis with arbitrary high
probability, they often can be "boosted" to perform much better.
In the PAC-learning scenario the following is required for a learning
algorithm A:
- For arbitrary δ and ε greater than 0, and
- for each target concept c of a given hypothesis
language LH,
A has to be able to choose a hypothesis,
- approximating c ε-good,
- with a probability of at least (1 - δ),
- after reading a random training sample of size polynomial in
(1 / δ) and (1 / ε).
Please follow the link to "PAC-Learning", for more details.
The requirements of PAC-Learning are strong, and usually do not hold
in practice for common learning algorithms. Instead one has so
called weak learners, performing just slightly better than random
guessing.
A weak learning algorithm manages
- for each target concept of a given hypothesis language,
to
- output a hypothesis with error below 1/2 -
p(|E|),
- with a probability greater than 0,
- provided with a random training sample,
where p is a polynomal.
The aim of boosting is to improve the accuracy (1 - ε above)
reached by a weak learner, and to reduce the probability of outliers
(δ above), i.e. that random samples lead to bad hypothesis.
Therefore boosting algorithms ...
-
run a given weak learner several times, where each round
either the set of examples is slightly altered, e.g. by
shifting the weight of examples, or the algorithm is allowed
to read new examples each time.
-
combine the set of hypotheses to one final hypothesis,
e.g. by using a weighted majority vote.
more...
- Bootstrap
- Bootstrapping provides a method to estimate the quality of a
concept learning algorithm (classifier).
Draw from the data set a new data set with replacement, the sample
being of the same size as the data set. Typically, 1/e = 37% of the
original data set will be unused in the drawn data set.
Train the classifier on the drawn set, and estimate the error on the
unused parts. Averaging the performance on a large number of
experiments gives the final estimate.
more...
- Bottom-Up Induction of Horn Clauses
A method to induce horn clauses in Inductive Logic Programming,
- covering a given set of positive examples, but
- being as specific as possible,
is to perform a bottom-up search through the hypothesis language.
Scenario:
- All examples are ground facts with the same predicate symbol and arity.
- Background knowledge B may be given in the form of other, extensionally defined
predicates.
- The first example may be considered as a first hypothesis.
- Whenever a new example is read, that cannot be derived from the background
knowledge and the actual horn clause, a generalization step has to be performed.
So the generalization step leads from an insufficiently general horn clause h to
another horn clause h':
- If an example can be derived from B and h, then this example can be derived from
B and h´ as well.
- Additionally the new example can be derived by h´.
To find such a hypothesis h´ the least general generalization and the
inverse resolution provide suitable procedures.
An algorithm using these procedures is GOLEM.
more...
- C4.5
- C4.5 (Quinlan) is an extension of the base algorithm ID3.
Some of the additional features are:
- incorporation of numerical (continous) attributes
- nominal (discrete) values of a single attribute may be grouped together,
to support more complex tests
- post-pruning after induction of trees, e.g. based on test sets, in order to increase accuracy
- C4.5 can deal with incomplete information (missing attribute values).
Please refer to the link above for software.
more...
- CANDIDATE-ELIMINATION
The versions space consists of all actual generalizations,
all actual specializations and the hypotheses "in between".
Intuitively spoken the versions space is the set of the remaining
hypotheses still consistent with the given examples and is described
by the set of most general and by the set of most specific
hypothesis. As soon as both sets intersect, a target concept is within
the intersection. Supplying more examples will lead to identical
sets. All concepts in this set are possible solutions. To continue
supplying further examples can still delete wrong hypothesis, which
could not be detected as such, so far.
For the algorithm it is assumed that the most general concepts in the
set of all hypotheses cover all of the examples, the
most special concepts exclude all of them, the hypothesis language
LE is partially ordered with respect to generality.
Normally, LH is an attribute-value representation, with taxonomies
describing the general-to-specific ordering of attribute values.
Often there are more solutions to a learning task. Nevertheless the shown
algorithm continues until a unique solution was found, which does not
mean any disadvantage. At each time the actual version space can
be considered as a partial solution, helping to classify yet
unclassified instances.
The Algorithm
Initialize the set G with the most general concepts, the set S with
the most special ones.
Until G and S are equal and contain exactly one hypothesis, read an
example e and
if e is a negative example but covered by G
- delete all concepts s in S, which cover the example e.
- specialize G until e is no longer covered.
- delete all g in G, which are strictly more specific than any other
concept in G.
if e is a positive example that is not covered by S
- delete all concepts g in G, which do not cover e.
- generalize S until e is covered.
- delete all s in S, which are strictly more general than any other
concept in S.
Then delete all concepts in G, which are not more general than any concept in S and
delete all concepts in S which are not more specific than any concept in G.
As soon as G and S are equal, and each consists of exactly one hypothesis,
the hypothesis is the searched concept.
more...
- CART
CART is short for Classification and Regression Trees.
It basically extends decision tree learning to incorporate numerical
values and is able to induce regression trees.
Please follow the links above to learn more about top-down induction
of trees.
more...
- Case-Based Reasoning (CBR)
For information about Case-Based Reasoning, please follow the link to Case-Based Reasoning on the Web.
more...
- Category and Partition Utility
In order to achieve
- high predictability of variable values, given a cluster, and
- high predictiveness of a cluster, given variable values,
the clustering algorithm COBWEB measures the utility of
a cluster (Clustering Utility) as:
CU(Ck) =
P(Ck) * ∑i∑j
[P(Ai = Vij|Ck)2
- P(Ai = Vij)2]
The utility of a partition of data (Partition Utility) is
defined as:
PU({C1, ..., CN}) = ∑kCU(Ck) / N
The aim of COBWEB is to hierarchically cluster the given observations
(unclasssified examples) in such a way, that Partition Utility is
maximized at each level. Please refer to the COBWEB-Link above
for a more detailed description.
more...
- Characterization (Descriptive Setting)
The characterization task in Inductive Logic Programming is
about finding interesting regularities in datasets.
If you have a complex dataset, it is often impossible to
directly see such regularities in the data.
Having a description of the same dataset, represented in a
readable formalism (e.g. first order logic rules),
often makes finding interesting regularities a much easier
task.
The requirements for a suited description of a dataset in
the characterization task are high, because the idea is to
find all rules (except redundant ones) that hold, when
applied to the dataset. We are looking for a set of rules
with the following properties:
-
Of course every rule has to be correct according to
he data.
-
Every rule explains at least one observation within
the dataset, that is not explained by any of the
other rules. This helps to avoid unnecessary rules.
-
If a rule is correct, according to the data, it is
part of our set, or it can logically be derived from
it. This is very much like finding all interesting
rules, we just focus on those from which all the
others can be derived, which will usually make the
result much smaller and therefore more readable.
-
We do not have unnecessary rules in our set, that is
our set has no subset with the required properties.
more...
- Churn Analysis
- Churn analysis consists of two key tasks: · Predicting whether a particular customer will churn and when it will happen; · Understanding why particular customers churn. These two tasks are referred to as prediction and understanding. They are usually carried out by using data mining techniques. By predicting which customers are likely to churn, the telecommunication company can reduce the rate of churn by offering the customer new incentives to stay. By understanding why customers churn the telco can also work on changing their service so as to satisfy these customers ahead of time. In addition, the chance of the customer churning after action is taken can be assessed by the data mining tool so as to choose the best strategy in terms of cost and effort.
more...
- Clause
A clause is a disjunction of literals, where usually unbound variables
are meant to be all-quantified.
Example:
P(g(X, e), f(c, Y)) ∨ P(g(Y), c) ∨ ¬
Q(f(X, Y)) ∨ ¬ P(d, f(X, e)), with
- P and Q predicate symbols,
- f and g function symbols,
- X and Y variables, and
- c, d ,e constants.
Clauses can be written in implication form. The clause above can
be rewritten as:
- P(g(X, e), f(c, Y)) ∨ P(g(Y), c) ←
Q(f(X, Y), d) ∧ P(d, f(X, e)), or, with explicit
quantifiers, as
- (∀X)(∀Y) [ P(g(X, e), f(c, Y)) ∨
P(g(Y), c) ← Q(f(X, Y), d) ∧ P(d, f(X, e)) ].
more...
- Clauses
Please refer to the term Clause
more...
- Clustering (Unsupervised Learning)
The task of clustering is to structure a given set of unclassified instances of an example language by creating
concepts, based on similarities found on the training data.
So the main difference to supervised learning is, that there is neither a target predicate nor an oracle,
dividing the instances of the training set into categories. The categories are formed by the learner itself. Given: A set of (unclassified) instances of an example language LE.
Find a set of concepts that cover all given examples, such that
- the similarity between examples of the same concepts is maximized,
- the similarity between examples of different concepts is minimized.
Conceptual Clustering
The setting above just aims at finding subsets of similar examples.
Conceptual Clustering extends this task to finding intensional
descriptions of these subsets. This can be seen as a second learning
step, although it will not necessarily be split from the first one:
- Partion the example set, optimizing a measure based on
similarity of the instances within the same subsets.
- Perform concept learning (supervised learning) for each of the
found subsets, to turn the extensional description into an intensional
one.
Note that the second step allows for prediction of yet unseen
instances of the example language.
One method addressing the task of Conceptual Clustering is the
Star method. COBWEB is an example of a clustering
algorithm, which does not induce an intesional description of
the found clusters, but organizes them in a tree structure.
more...
- COBWEB
- COBWEB is an incremental clustering algorithm, based on
probabilistic categorization trees.
The search for a good clustering is guided by a qualitity measure for
partitions of data.
The algorithm reads unclassified examples, given in attribute-value representation.
Pure COBWEB only supports nominal attributes, later extensions incorporate numerical values as well.
An incremental clustering algorithm reads one example per iteration, modifying the actual result
each time.
The aim is to achieve
- high predictability of variable values, given a cluster.
- high predictiveness of a cluster, given variable values.
COBWEB´s makes use of the category (cluster) utility
and partition utility measures. Please follow the
link above for details on these measures.
For each new example e COBWEB compares the following alternatives for the actual node
(starting with the root node), as long as it is no leaf.
The alternative with highest partition utility is chosen.
- Insert e into the best successor node. Therefore e is inserted into every successor testwise, and the
one where highest partition utility has been observed is chosen.
- Create a new leaf for e and make it a successor of the actual node.
- Generate a new node n, which is predecessor of the two best successors of the actual node
(found in step 1) and insert n between the actual node and the two succesors. Example e is inserted into n.
- Choose the best successor node (found in step 1), delete it and make its successors direct successors
of the actual node. Afterwards example e is inserted as in 1).
If a leaf has been reached, COBWEB creates two successors, one containing the new example, and one,
containing the example of the old leaf. The old leaf becomes an inner
node.
more...
- Concept Learning
- The task of concept learning is to acquire general concepts from specific training examples. Training examples
are instances, which either belong to a special concept, and therefore are positive examples, or do not belong
to the concept, and therefore are negative examples.
Given
- a description language LE for examples,
- a hypothesis language LH for possible hypotheses (i.e. possible learning results),
- a set of positive examples (instances of a certain concept)
- a set of negative examples (instances that do not belong to the concept),
- a predicate "covers", which indicates whether a given concept/hypothesis covers a given example,
- an acceptance criterion (measure) which evaluates hypotheses,
Find hypotheses in the language LH fulfilling the acceptance criterion.
The partial order of LH can be used to prune the search.
more...
- Condensed representations
A complementary method of reducing the large amount of frequent patterns is to use so-called condensed representations. Two important examples of condensed representations for frequent itemsets are maximal frequent itemsets and closed frequent itemsets. Maximal frequent itemsets are frequent itemsets for which none of their supersets is also frequent. Clearly, each frequent itemset is a subset of a maximal frequent itemset. Closed frequent itemsets are itemsets that completely characterize their associated set of transactions. That is, a frequent itemset X is closed if it contains all items that occur in every transaction in which X is bought. The closed frequent itemsets form a basis for all frequent sets; The underlying idea of both concepts is that the set of all maximal/closed frequent itemsets represents all frequent itemsets but is far smaller.
more...
- Consistency
Given a set of classified examples, a consistent concept is a
concept, which covers all of the positive and none of the negative
examples.
more...
- Coverage
With respect to a given set of examples the Coverage describes how many of these
examples are classified positive by a (partial) hypothesis.
Note that in the context of concept learning the measure just counts the number of
covered positive examples.
Especially for building disjunctive hypothesis for concept learning it is
favourable to find partial hypothesis (e.g. horn clauses) with high coverage,
but with the additional restriction not to cover any of the negative examples.
One method following this approach is Top-Down Induction of Horn Clauses.
more...
- Cross Validation
- Cross validation is a method aiming at estimating the error of
a hypothesis, generated by a concept learning algorithm (classifier).
Given a set of training data and a concept learner this
is how cross validation estimates the accuracy of the hypothesis
gained by running the learning algorithm on the data set:
- Randomly divide the training data in M sub-samples.
- For each sub-sample i, do:
-
Let a concept learner build a hypothesis from the training data
without sub-sample i.
- Determine the accuracy ai of this hypthesis on
sub-sample i, not used for learning.
-
The estimated accuracy is (1/M) * Σi=1,..,M
ai, the average error rate for the M sub-samples.
A special case of cross validation is the so called
leave one out method, where M is chosen as the cardinality of
the training set. In other words for each given example another run of
learning is performed where all training data except for this example
is used for training, and the correctness of the classification of
the single example is checked.
For efficiency reasons "leave one out" is unusual in practice,
although in general it will be closest to the real error.
Instead M=10 is a frequently chosen compromise between computational
effort and quality of estimation results.
more...
- Customer Acquisition
Customer Segmentation
An in-depth knowledge of customers and prospects is essential to stay competitive in today's marketplace. Therefore, data is now often collected at a detailed level for a large number of customers.
Segmentation is the act of breaking down this large customer population into segments, in which the customers within the segment are similar to each other, and those being in different segments are dissimilar. A "true" segmentation will be collectively exhaustive and mutually exclusive, i.e. every customer in the database must fall into exactly one segment.
Segmentations are valuable, since they allow the end user to look at the entire database from a bird's eye perspective. A common use of segmentation analysis is to segment customers by profitability and market potential, also known as "Share of Wallet" analysis.
There are several reasons to use segmentations in business. One can use it for distribution channel assignment in sales, i.e. which segment will be addressed by which channel. Or segments may be identified and used as a framework to communicate for strategic purposes, that there are differences between customers, that could be captured at a very high level. Ideally, one would like to know each customer individually. Segmenting the customer base up to the level of individual customers, lays the basis for one-to-one marketing. One-to-one marketing is the ideal marketing strategy, in which every marketing campaign or product is optimally targeted for each individual customer. For other concepts, like profitability, it does not make sense, to define in the worst case 100,000 segments or levels of profitability. Instead one would restrict oneself to e.g. four or five and assign the customers to these clusters.
There are two possible ways how segmentations can be performed. Either humans do the segmentation "by hand", a process also known as "profile analysis", or the segmentation is created data-driven by analysis of the customer data. Data-driven segmentations are performed most often by using a variety of DM and statistical techniques, which fall into two categories: predictive segmentations and clustering. Predictive segmentations create segments with some goal in mind, e.g. segments of high and low value customers, according to their buying habits of a product or service.
In contrast, clustering has no particular goal in mind but is merely a way to give more structure to the data. Clustering techniques pull interesting commonalities from the data, which help to organize the data. These segments do not present customers who e.g. bought the product or turned in the rebate, but generally have similar characteristics. For address lists it is typical, that there are some predefined clusters e.g. households with high income and no kids. The people that fall into one particular cluster do not necessarily buy products at a higher or lower rate than people in different clusters. Nevertheless, the clusters' definitions are helpful because they provide high-level data organization in a consistent way.
Customer Acquisition
In most situations in the process of customer acquisition, the goal is to turn a group of prospects into actual customers of some product or service. In general, there are different kinds of customers, and it may take a long time, if ever, before a customer becomes a valuable customer. Therefore, when modeling customer acquisition processes, one should distinguish between response and activation models. Response models predict if a person, who got an offer will react somehow. In contrast, activation models predict if a person will accept and fully use the offered product or service.
Customer Segmentation
An in-depth knowledge of customers and prospects is essential to stay competitive in today's marketplace. Therefore, data is now often collected at a detailed level for a large number of customers. Segmentation is the act of breaking down this large customer population into segments, in which the customers within the segment are similar to each other, and those being in different segments are dissimilar. A "true" segmentation will be collectively exhaustive and mutually exclusive, i.e. every customer in the database must fall into exactly one segment.
Segmentations are valuable, since they allow the end user to look at the entire database from a bird's eye perspective. A common use of segmentation analysis is to segment customers by profitability and market potential, also known as "Share of Wallet" analysis.
There are several reasons to use segmentations in business. One can use it for distribution channel assignment in sales, i.e. which segment will be addressed by which channel. Or segments may be identified and used as a framework to communicate for strategic purposes, that there are differences between customers, that could be captured at a very high level. Ideally, one would like to know each customer individually. Segmenting the customer base up to the level of individual customers, lays the basis for one-to-one marketing. One-to-one marketing is the ideal marketing strategy, in which every marketing campaign or product is optimally targeted for each individual customer. For other concepts, like profitability, it does not make sense, to define in the worst case 100,000 segments or levels of profitability. Instead one would restrict oneself to e.g. four or five and assign the customers to these clusters.
There are two possible ways how segmentations can be performed. Either humans do the segmentation "by hand", a process also known as "profile analysis", or the segmentation is created data-driven by analysis of the customer data. Data-driven segmentations are performed most often by using a variety of DM and statistical techniques, which fall into two categories: predictive segmentations and clustering. Predictive segmentations create segments with some goal in mind, e.g. segments of high and low value customers, according to their buying habits of a product or service.
In contrast, clustering has no particular goal in mind but is merely a way to give more structure to the data. Clustering techniques pull interesting commonalities from the data, which help to organize the data. These segments do not present customers who e.g. bought the product or turned in the rebate, but generally have similar characteristics. For address lists it is typical, that there are some predefined clusters e.g. households with high income and no kids. The people that fall into one particular cluster do not necessarily buy products at a higher or lower rate than people in different clusters. Nevertheless, the clusters' definitions are helpful because they provide high-level data organization in a consistent way.
Customer Acquisition
In most situations in the process of customer acquisition, the goal is to turn a group of prospects into actual customers of some product or service. In general, there are different kinds of customers, and it may take a long time, if ever, before a customer becomes a valuable customer. Therefore, when modeling customer acquisition processes, one should distinguish between response and activation models. Response models predict if a person, who got an offer will react somehow. In contrast, activation models predict if a person will accept and fully use the offered product or service.
Response models are among the first types of models which a company starting with analytical CRM tries to develop. Due to the rather low time requirements in contrast to other DM goals, response modeling is a good candidate for quick win projects in CRM. If no targeting has been done in the past, a response model can boost the efficiency of marketing campaigns by increasing responses and/or reducing mail expenses. Reported figures in the literature indicate, that response rates can be increased by up to 100 %.
The goal of a response model is to predict the "response behaviors" of the prospects who got an offer for a product or service. The model can be based on either the past behavior of a similar population, or on the behavior of a subsample of the target population, which was used for a test mail, or on some logical substitute. The response behavior defines a distinct kind of customer action and categorizes the different possibilities so that they can be further analyzed.
A response can be received in several ways, depending on the offer channel. An e-mail offer for example can direct the responder to reply by e-mail, phone, or mail. When compiling the results, it is not only important to monitor the response channel and action, but also to manage duplicates. There are situations in which a company may receive more than one response from the same person. This is especially common if a prospect receives multiple or follow-up offers for the same product or service that are spread over several weeks or sent by different branches.
Activation models are models that predict if a prospect will become a full-fledged customer. These models are most applicable in the financial services industry. For example, for an insurance policy prospect to become an active customer, the prospect must respond, be approved, and pay the initial premium. If the customer never pays the premium, he or she actually ends up costing the insurance company more than a non-responder.
There are two ways to build an activation model. One method is to build a model that predicts response and a second model that predicts activation given response. The final probability of activation from the initial offer is the product of these two models. A second method is to use one-step modeling. This method predicts the probability of activation without separating the different phases.
Which of two methods is preferable depends on available data. If one prepares a mailing action for acquisition of new customers, it is common practice that one buys address lists. This lists will be combined with data about possible prospects which one has already. If the whole data set contains a rich set of attributes and its data quality is high, both methods are applicable. Otherwise, one is obliged to use the two step approach.
more...
- Customer Profitability
Without knowing the value of one's customers, it is hard to determine what the optimal marketing efforts would be. Data mining can be used to predict customer profitability under a variety of different marketing campaigns. Profitability in turn can be analyzed under different angles.
First, there is the profitability which is associated with a particular product or service, and the risks, which may arise during the lifetime of this particular product or service. And second, one may consider the profitability and risks of the customer life cycle.
Approval or risk models are unique to certain industries that assume the potential for loss when offering a product or service. The most well-known types of risk occur in the banking and insurance industries. Banks assume a financial risk when they grant loans. In general, these risk models attempt to predict the probability that a prospect will default or fail to pay back the borrowed amount. For the insurance industry, the risk is that of a customer filing a claim.
The risk of fraud is another area of concern. Fraud detection models are assisting banks for example in reducing losses by learning the typical spending behavior of their customers. If a customer's spending habits change drastically, the approval process is halted or monitored until the situation can be evaluated.
Net present value models and lifetime value models attempt to predict the overall profitability of a product or customer. i.e. a person or a business, for a predetermined length of time. The values are often calculated over a certain number of years and discounted to today's value. Since market shares vary over time, companies are looking for opportunities to profit more from their existing customer base. As a result, many companies are expanding their product and service offerings in order to cross-sell or up-sell their existing customers. This approach is exactly creating the need for models which go beyond the net present value to the lifetime value of a customer.
more...
- Customer Relationship Management
Customer Relationship Management is a strategy used to learn more about customers' needs and behaviors in order to develop stronger relationships with them.
After all, good customer relationships are at the heart of business success. CRM to be effective requires using information about customers and prospects in all stages of their relationship with a company. From the company's point of view, the stages are acquiring customers, increasing the value of customers and retaining good customers.
more...
- Customer Retention
Attrition or churn is a growing problem in many industries. It is characterized by the act of customers switching companies, usually to take advantage of a better offer or just to get some incentives. Attrition is defined as a decrease in the use of a product or service. For bank accounts, attrition is the decrease in balances on which interest is being earned. Churn is defined as the closing of one account in conjunction with the opening of another account for the same product or service, usually at a reduced cost to the consumer. This must not necessarily be a competitor, e.g. a customer of the "traditional" business division may switch to the e-business division of the same company.
For DM activities there are several opportunities. One type of analysis predicts the act of reducing or ending the use of a product or service after an account has been activated. At Swiss Life for example, the serious business problem of life insurance surrender falls into this category. The benefit of DM would be to build an understanding of what the factors are that indicate high risk customers. Using the DM results, substantial money can be saved by targeting specifically at-risk customers.
more...
- Customer Segmentation
An in-depth knowledge of customers and prospects is essential to stay competitive in today's marketplace. Therefore, data is now often collected at a detailed level for a large number of customers. Segmentation is the act of breaking down this large customer population into segments, in which the customers within the segment are similar to each other, and those being in different segments are dissimilar.
A "true" segmentation will be collectively exhaustive and mutually exclusive, i.e. every customer in the database must fall into exactly one segment.
Segmentations are valuable, since they allow the end user to look at the entire database from a bird's eye perspective. A common use of segmentation analysis is to segment customers by profitability and market potential, also known as "Share of Wallet" analysis.
There are several reasons to use segmentations in business. One can use it for distribution channel assignment in sales, i.e. which segment will be addressed by which channel. Or segments may be identified and used as a framework to communicate for strategic purposes, that there are differences between customers, that could be captured at a very high level. Ideally, one would like to know each customer individually. Segmenting the customer base up to the level of individual customers, lays the basis for one-to-one marketing. One-to-one marketing is the ideal marketing strategy, in which every marketing campaign or product is optimally targeted for each individual customer. For other concepts, like profitability, it does not make sense, to define in the worst case 100,000 segments or levels of profitability. Instead one would restrict oneself to e.g. four or five and assign the customers to these clusters.
There are two possible ways how segmentations can be performed. Either humans do the segmentation "by hand", a process also known as "profile analysis", or the segmentation is created data-driven by analysis of the customer data. Data-driven segmentations are performed most often by using a variety of DM and statistical techniques, which fall into two categories: predictive segmentations and clustering. Predictive segmentations create segments with some goal in mind, e.g. segments of high and low value customers, according to their buying habits of a product or service.
In contrast, clustering has no particular goal in mind but is merely a way to give more structure to the data. Clustering techniques pull interesting commonalities from the data, which help to organize the data. These segments do not present customers who e.g. bought the product or turned in the rebate, but generally have similar characteristics. For address lists it is typical, that there are some predefined clusters e.g. households with high income and no kids. The people that fall into one particular cluster do not necessarily buy products at a higher or lower rate than people in different clusters. Nevertheless, the clusters' definitions are helpful because they provide high-level data organization in a consistent way.
more...
- Customer Up and Cross Selling
Cross selling is the process of offering new products and services to existing customers.
One form of cross selling, sometimes called up selling, takes place, when the new offer is related to existing purchases by the customers. Using DM for cross selling helps to predict the probability or value of a current customer buying these additional products. Doing a DM analysis for cross selling is more than doing the analysis required for single product customer acquisition several times, i.e. for each additional product. Here, the key is to optimize in addition the offerings across all customers and products. This achieves the goal to create a win-win situation, in which both the customer and the company benefit.
Analyzing previous offer sequences with DM methods can help to determine what and when to make the next offer. This allows to carefully manage offers to avoid over-soliciting and possibly alienating customers.
more...
- Data Cleansing
To increase the quality of results Machine Learning
techniques yield, when applied to large datasets, a
step of inspecting the data and removing or correcting
corrupt or misleading parts should be performed first.
Typical problems are contradictory or incomplete
information. This will confuse learning algorithms,
for it is known that learning in the presence of
noise is much harder, than in the case of correct
information. Please refer to the case studies
for a more detailed discussion on data
cleansing.
more...
- Data Management
The control of data handling operations--such as acquisition, analysis, translation, coding, storage, retrieval, and distribution of data--but not necessarily the generation and use of data.
more...
- Data visualization
Visualization of raw data are used for exploratory data analysis. If the sample of data points is small and the dimenionality of each example vector is small, there are many effective techniques to plot the data. For higher dimensionalities, their reduction before plotting is necessary. Visualization for large samples such as databases is still a research topic.
more...
- Decision Trees
Decision trees are special classifiers.
Each decision tree is a tree with
- a test associated to each inner node.
Tests are mappings from the set of all instances of the given example language
to the set of successors of the actual node.
- a fixed element of a finite set of possible categories associated to each leaf.
The example language is a form of attribute-value representation.
Nominal (and therefore finite) attribute values are supported as well as numerical domains.
Figure 1 shows a toy example for boolean tests.
Figure 1:
Decision tree for a striker in football, playing the ball.
In the case of nominal attribute values the tests of each node are usually of the following
form:
- All nodes are marked by the name of an attribute.
- For each possible value of the attribute there is exactly one successor in the tree.
- The edge leading to a successor is marked with the related attribute value.
In the case of numerical values the set of possible tests has to be adjusted.
One possibility is to use tests that check if an attribute value is lower than a constant
or not. In this case there have to be two successors in the tree, one for each possible
answer.
The classification of an instance of the example language works like this:
- Start with the root node.
- As long as you don´t have reached a leaf node, chose the successor that is given by the
test of the actual node.
In the case of nominal attribute values follow the edge marked
with the instances value of the attribute whose name is found at the actual node.
- If you have reached a leaf node, classify the instance by the class associated with the
leaf.
more...
- Depth of Terms
The depth of a term in a definite clause is inductively defined as follows:
- All terms t occuring in the head of the clause have depth d(t) := 0.
- For all other terms t occuring in the clause the depth is defined as
d(t) := min { d(u) | There is a literal containing t and u. } + 1.
So in the clause P(u, v) ← Q(w, v) ⋀ P(w, x) the terms
u and v would have depth 0, w depth 1 and x would have depth 2, for
example.
more...
- Dimension reduction
Reducing the dimensionality of a data set by extracting a
number of underlying factors, dimensions, clusters, etc., that
can account for the variability in the data set.
more...
- Direct mailing
Direct mailing is a commonly chosen method by companies as a part of their direct marketing strategies. Of course every company wants its mailings to be as effective as possible. The effectiveness of a mailing campaign can be measured by its response rate. A high response rate means that the marketing goals have been achieved and therefore that the mailing costs were justified. A company that regularly sends mails for marketing purposes can reduce the mailing costs considerably by optimizing the responses using data mining techniques.
more...
- Direct marketing
- One-to-one marketing is significant part of customer relationships management. There are many communication channels used, like call center, mail, e-mail, sms, etc.
more...
- EM method
Expectation and Maximization are the two steps of a procedure which handles mixture models. If our data stem from K processes, we build a model for each process and estimate a mixing distribution.
The basic idea of EM in this context is to pretend that we know the parameters of the model and then to infer the probability that each data point belongs to each component. After that, we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component. The process iterates until convergence. Essentially, we are "completing" the data by inferring probability distributions over the hidden variables - which component each data point belongs to - based on the current model. For the mixture of Gaussians, we initialize the mixture model parameters arbitarily and then iterate the following two steps:
- E-step: Compute the probabilities pij=P(C=i|xj), the probability that datum xj was generated by component i. By Bayes' rule, we have pij=aP(xj|C=i)P(C=i). The term P(xj|C=i) is just the probability at xj of the ith Gaussian, and the term P(C=i) is just the weight parameter for the ith Gaussian. Define i=Sumjpij.
- M-step: Compute the new mean, covariance, and component weights as follows:
mi <- Sumj pijxj/pi |
Sumi <- Sumj pijxjxjT/pi |
wi <- pi |
- Iterate steps 2 and 3 until convergence
more...
- Empirical Error
Given a set of training examples the error of a learning result is estimated by the empirical error, a measure based on the training data.
more...
- Ensemble Methods
Basic idea:
- Build many models (e.g. trees)
- When making prediction, use a voting mechanism (e.g. majority)
These methods have had the most profound impact on prediction accuracy of any method in the last decade.
Ways to form ensembles:
- Bagging (and it’s variants)
- Boosting (and it’s variants)
more...
- Entropy
Given
- a finite set E of classified examples and
- a set C = {c1, ..., cn} of possible classes.
Let |Eci| denote the number of examples e ∈ E of class
ci and let pi := |Eci| / |E|.
Then the Entropy of E is defined as:
entropy(E) := - |
∑ |
pi * log2(pi). |
|
i=1,...,n |
|
In this context we define 0 * log2(0) = 0.
The Entropy measures the impurity of a set of examples.
It is lowest, if there is at most one class present, and it is highest,
if the proportions of all present classes are equal.
more...
- Evaluation method
-
more...
- Evolution
The topic of evolutionary computing has its own Network of Excellence, the "EvoNet". Please follow the link for more information.
more...
- Feature Construction
The notion of feature construction subsumes methods for adding new
attributes to a dataset, constructed on the basis of given data and
maybe prior knowledge about the domain.
The aim of feature constructing is to provide learning algorithms with
information of higher relevance, in order to achieve better learning
results.
For a more detailed discussion, please refer to the case studies, especially to
data
design and
data cleansing.
more...
- Feature Generation and Extraction
Preprocessing of high-dimensional features is a very general and powerful method for improving the performance of a learning algorithm. The preprocessing need not be linear but can be a general (nonlinear) function of the form x'=g(x). The derived features x' can then be used as inputs into any (linear or nonlinear) learning procedure.
more...
- Feature Selection
Feature selection aims at focussing on those attributes
of a datasets, which are relevant for the learning task.
Irrelevant attributes always bear the risk of confusing
the learner with random correlevance, while on the other
hand they do not provide any useful information.
The degree irrelevant and redundant attributes are
harmful depends on the learning method selected.
While algorithms like k-Nearest Neighbor are
known to perform significantly worse in the presence
of such attributes, id3 automatically performs
some kind of feature selection, by choosing the test
with highest information gain.
For a more detailed discussion on feature selection,
please refer to our case studies, especially to
data design and
data cleansing.
more...
- First Order Regression
Within the field of Inductive Logic Programming (ILP) some recent work has also focussed on the problem of regression. FORS (Karalic, 1995, Karalic & Bratko, 1997) induces a model that has the form of a Prolog program. This program consists of clauses of the form f(Y,X1,?,Xa) that represent the obtained regression model. The system is able to receive, as input, samples of the unknown regression function and background knowledge, which constitutes a major advantage. FORS uses a covering algorithm at the top level that keeps generating new clauses followed by the removal of the cases covered until a termination criterion is met. The clause generation step follows a general to specific search that keeps adding new literals to the clause. FORS uses several pre-pruning mechanisms for controlling clause generation. Among them is the use of Minimum Description Length (MDL) (Rissanen,1982). FFOIL (Quinlan, 1996) is another example of an ILP system able to deal with regression. FFOIL is a derivation of the FOIL system (Quinlan, 1990). FFOIL follows a similar covering algorithm as FOIL?s. It starts with an empty program and keeps adding clauses until all cases are covered. Each added clause starts with an empty body and literals are appended as a form of specialising the clause. A function with k arguments (the input variables) is represented by a k+1-ary relation where one argument holds the function outcome. The result obtained by FFOIL consists of a Prolog program with clauses of this relation. First order approaches to regression have a much larger search space than their propositional counterparts. This fact has two contradictory effects. While being able to find solutions not available to propositional systems, this increased expressiveness has a strong impact in the computational complexity of these systems. This makes them hardly applicable to extremely large domains that are found in some applications (like in a typical Data Mining situation). However, computation power grows at a very fast rate and ILP may well become the major trend in Machine Learning approaches to regression .
more...
- FOCL
FOCL is a FOIL variant with an additional explanation-based
component.
more...
- FOIL
The algorithm FOIL addresses the task of concept learning
by using the method of top-down induction of horn clauses.
More specific the task of FOIL is to find a description of a target
predicate, given a set of examples and some background
knowledge, both in the form of function-free ground facts.
Examples differ from background knowledge by containing the target
predicate.
FOIL makes the closed world assumption, so it assumes that any
function-free ground fact not declared true is false.
The hypothesis language of FOIL is a set of sets of
rules of the following form:
T(x1, ..., xn) ← L1, ...,
Lk,
where T is the target predicate, all xi are variables and
all Li are positive or negative constant-free and
function-free literals.
The general method of FOIL is to search the hypothesis space by
adding one rule by another, and by constructing each rule
literal by literal, guided by a heuristic measure, similar
to information gain. The search for a hypothesis fitting the data is
reduced to parts of the search space with highest information gain,
which does not necessarily lead to optimal results.
The Algorithm:
As long as there are positive examples of the target predicate,
which are not covered by any of the rules, repeat:
- Add another rule. Initialize the new rule with the target
predicate as the conclusion and without any preconditions.
- As long as the new rule covers a negative example of the
target predicate, repeat:
- Search the list of possible extensions of the rule. This list consists of ...
- ... all positive or negative literals, containing a predicate found in the background
knowledge, not containing constants or functions but at least one of the variables
already part of the formula. By allowing the target predicate at this point, FOIL
is able to learn recursive definitions, but has to be kept from infinite recursion.
- ... tests for equality and for unequality of variables, already occuring in the rule.
- Add a conjuntion of the possible extension to the preconditions, that yields highest
Gain, where Gain is FOIL's variant of the information gain measure.
The result is a set of rules, defining the target predicate such, that every positive, but
none of the negative examples is covered by at least one rule.
more...
- FOIL Software
Different Releases of FOIL can be found at the homepage of R.
Quinlan.
more...
- FOIL's Information Gain
The measure, similar to information gain, used by FOIL is
Gain(R0, R1) :=
t * ( log2(p1/(p1+n1)) -
log2(p0/(p0+n0)) ).
- R0 denotes a rule before adding a new literal.
- R1 is an extesion of R0.
- Tupels are bindings of variables, restricted to constants occuring in the read examples
and background knowledge.
- A tupel is covered by a rule in this case, if it fulfills all of its preconditions.
- p0 denotes the number of positive tupels, covered by R0,
- p1 the number of positive tupels, covered by R1.
- n0 and n1 are the number of negative tupels, covered by the according rule.
- t is the number of positive tupels, covered by R0 as well as by R1.
If a literal is added, that contains a new variable, then a positive tupel covered by R0 is
regarded to be covered by R1 as well, if there exists an extension of the tupel, just adding
a constant as a new variable binding, so the extended tupel is covered by R1.
more...
- Fraud detection
Fraud detection systems enable an operator to respond to fraud by denying services to or detecting and preparing prosecutions against fraudulent users. The huge volume of call activity in a network means that fraud detection and analysis is a challenging problem.
more...
- Frequent Sets
-
more...
- Function Approximation
Many general learning tasks, especially concept learning, may be regarded as function
approximation.
Given are examples of the function to be approximated.
The aim is to find a hypothesis (a function as well), that can be used for predicting the function
values of yet unseen instances, e.g. to predict future events.
Good hypotheses are those, often predicting an accurate function value.
The quality of a hypothesis for approximating a specific function is measured by a so
called loss function, increasing as the differences between predicted and true
function values increase. It is also taken into account, that it is more important to predict well
on frequently observed instances.
more...
- Functions
- Most hypothesis languages can be regarded as a specific subset
of all functions.
In concept learning each hypothesis maps the example
language to a set with the two possible values "belongs to the
concept" or "does not belong to the concept".
Formally this can be stated as
h: LE → {0, 1},
where h is the hypothesis, LE denotes the example language
(or the universe), and "1" and "0" encode
"belongs to the concept" and "does not belong to the
concept".
In special cases of SVMs or neural networks, linear
functions are used as hypotheses for concept learning.
They describe a hyperplane in a usually high dimensional space,
distinguishing positive from negative examples.
All that has to be done to determine if an instance belongs to the
concept, due to the hypothesis, is to find out, on which side of
the hyperplane the instance is located.
In function approximation hypotheses map the example language
to continous domains, e.g.
h: LE → ℝ.
But these continous values may of course represent probabilities or
confidence for an example belonging to a learned concept.
Due to this understanding, concept learning is a special case of
the task of function approximation.
more...
- Gaerdenfors' Postulates
Gaerdenfors has set up six postulates, every revision operator shall fulfill.
Let
- Cn(S) denote the set of formulas implied by S: Cn(S) := { f | S ⊨ f },
Cn(∅) denotes the set of all tautologies.
- B denote a closed theory, where closed means B = Cn(B),
- F (and G) denote a set of facts,
- rev(B, F) denote a revision of theory B, that does not imply any fact f ∈ F.
Then Gaerdenfors' postulates are:
- Closure:
rev(B, F) shall be a closed theory, again.
- Inclusion: The revised theory shall be a
subset of the former theory.
rev(B, F) ⊆ B
- Vacuity: If the theory does not contain any fact in F,
the revision operator shall not change the theory.
If F ∩ B = ∅, then rev(B, F) = B.
- Success: If F does not contain a tautology, then a
revised theory should not contain any fact in F (anymore).
If F ∩ Cn(∅) = ∅, then F ∩
rev(B, F) = ∅.
- Preservation: Two sets of facts, having the same
consequences, should lead to the same revision of a theory.
If Cn(F) = Cn(G), then rev(B, F) = rev(B, G).
- Recovery: A revision of a theory should have the property,
that adding the facts of the set F after a revision step by F, results in the
former theory, or in a superset of the former theory.
B ⊆ Cn( rev(B, F) ∪ F ).
more...
- Genetic programming, evolutionary algorithms
A population of programs is evaluated according to a fitness function before operators like cross-over and mutation change selected individuals, which gives us the new population. The method can be seen as heuristic search in the space of all possible programs. Hence, the programs must be restricted in some kind.
more...
- GOLEM
The algorithm GOLEM (Muggleton) uses inverse resolution and
least general generalization to induce a set of horn clauses from
- a set of positive examples, restricted to ground facts and
- background knowledge, given in the form of extensionally defined
predicates (ground facts, e.g. relational databases).
The base procedure:
- Randomly select two (positive) examples from the given example set.
- For both examples: Call the procedure "inverse resolution" to receive
a horn clause that - together with the background knowledge - implies the
example. For the inverse resolution the most specific inverse substitution
(namely "∅") is used.
- Combine the two clauses generated in the last step by using least general
generalization.
The result is a horn clause that, together with the background knowledge,
implies (at least) the two chosen examples.
more...
- GOLEM - Software
Please follow the link to download the GOLEM software package
for SUN SparcStation or to receive the source code.
more...
- Ground Facts
If a learning algorithm is based on first-order logic (or restrictions
of it), examples are commonly described by ground facts. Ground facts
are logical formulas, containing no variables and having exactly one
predicate, which has to be positive.
For example tupels of a relational database may be considered as ground facts.
more...
- Hidden Markov Models
A hidden Markov model (HMM) is a triple (p,A,B).
p=(pi) the vector of the initial state probabilities
A=(aij) the state transition matrix; Pr(xit|xjt-1)
B=(bij) the confusion matrix; Pr(yi|xj)
Each probability in the state transition matrix and in the confusion matrix is time independent - that is, the matrices do not change in time as the system evolves. In practice, this is one of the most unrealistic assumptions of Markov models about real processes.
more...
- Horn Clause
A horn clause is a clause containing at most one positive literal.
Horn clauses with exactly one positive literal (definite clauses) can be written in an implication
form without making use of the negation symbol:
Lpos ← L1 ∧ L2 ∧ ...
∧ Ln.
The positive Literal Lpos is called the
conclusion or the head of the horn clause. The literals Li,
i ∈ {1, ..., n} are the negations of the negative literals (same literals without the negation
symbol), which form the preconditions, or, all together, the body of the
clause.
more...
- HYDRA
HYDRA is another extension of the FOIL base algorithm.
It extends FOCL by adding confidence factors to rules.
more...
- ID3
The algorithm ID3 (Quinlan) uses the method top-down induction of decision trees.
Given a set of classified examples a decision tree is induced, biased by the
information gain measure, which heuristically leads to small trees.
The examples are given in attribute-value representation.
The set of possible classes is finite.
Only tests, that split the set of instances of the underlying example languages
depending on the value of a single attribute are supported.
Depending on whether the attributes are nominal or numerical, the tests either
- have a successor for each possible attribute value, or
- split according to a comparison of
an attribute value to a constant, or depending on if an attribute value belongs to a
certain interval or not.
The algorithm starts with the complete set of examples, a set of possible tests and
the root node as the actual node.
As long as the examples propagated to a node do not all belong to the same class and
there are tests left,
- a test with highest information gain is chosen,
- the corresponding set of successors is created for the actual node,
- each example is propagated to the successor given by the chosen test,
- ID3 is called recursively for all successors.
more...
- ID3 Software
Please click the link to download a public domain implementation
of the ID3 algorithm.
more...
- Identification In The Limit
The scenario:
A learner is supplied by a (infinite) stream of input data,
generated according to a theory / language / function of a given
representation class. The learning problem is to identify a
hypothesis that explains the data stream, which means that the
hypothesis is consistent with all of the data within the stream.
In every iteration step the learner reads another piece of
data and outputs a hypothesis of a given hypothesis
language.
The main questions addressed by the Identification In The Limit
scenario:
- For which representation classes exist algorithms which identify
a correct hypothesis at some point in time (in the limit) for
any instance of the representation class, and just output syntactical
variants of the result from that point on?
- Does a specific algorithm identify any / one specific instance of
a representation class in the limit?
The learner is not asked to realize that a correct hypothesis has been
found!
The questions can be refined:
- Is there a specific point in time for a specific learner, from
where on the outputs are just variants of a correct result?
- Does the quality of the hypotheses increase every iteration step
or may the quality decrease before a correct hypothesis is found?
more...
- ILP Learnability
The theoretical study of efficient learnability of logic programs
is motivated by the increasing importance of using ILP in different practical applications. Since
the general ILP learning problem is known to be computationally intractable, one of the challenging
problems in this field of research is to show positive and negative theoretical results
about the efficient learnability of different classes of logic programs in the formal learning models
of computational learning theory.
more...
- Image, video categorization
-
more...
- Incremental decision tree learning
Incremental induction of decision trees is a method for the task of concept learning (supervised learning). When a new training example is entered it is classified by the decision tree. If it is incorrectly classified then the tree is revised. Restructuring the tree can be done by storing all training examples or by maintaining statistics associated with nodes in the tree.
Given:
A sequence of classified examples in attribute-value representation, where the set of classes is finite.
The ID4 algorithmThe ID4 algorithm builds decision trees incrementally. A decision tree is updated when new instances become available. A non-incremental algorithm could also be used, simply by storing all examples and running the method again on all data. This requires storing all data.
ID5R and IDL maintain statistics on the distributions of instances over attributes at each node in the tree. When a new example is entered then the effect of the training example on this distribution is computed and the method checks if the tree must be revised by replacing the current node by a different attribute. This can be done top-down, recursively for each subtree, or bottom-up.
more...
- Inductive Learning Techniques
In inductive learning, knowledge is compiled by generalising patterns in factual experience. By analyzing descriptors of many solvent and insolvent customers you may learn to recognize, on the basis of the descriptor values, which customers appear to be solvent and which are not.
This kind of task would be referred to as Concept Learning.
Age |
Income |
Education |
Solvability |
72 |
18 |
L |
- |
14 |
85 |
L |
- |
98 |
18 |
L |
- |
16 |
76 |
H |
+ |
65 |
98 |
H |
+ |
32 |
82 |
H |
+ |
28 |
75 |
H |
+ |
19 |
62 |
H |
+ |
32 |
73 |
H |
- |
65 |
36 |
H |
- |
56 |
66 |
H |
- |
The above table may learn you that IF education = H AND (Income > 75 OR age < 30) THEN solvability = + ELSE solvability = -
more...
- Inductive Logic Programming (ILP)
The Inductive Logic Programming examines different restrictions of
first-order logic and different inference operators in order to find
efficient ways of generating generalizations of data sets (usually a
set of ground facts, e.g. given in the form of a relational database),
or specializations of theories to no longer imply negative examples.
It should be possible to give certain guarantees for the results, to
make the methods more attractive for users.
The reason restrictions of first-order logic are examined is the demand
for tractable formalisms.
The main questions addressed by the Inductive Logic Programming are
- which hypotheses space to choose,
- which language to choose to represent examples,
- how to restrict inference.
The hypotheses language is a restriction of first-order logic, most frequently a restriction of Horn clauses.
Note, that Horn clauses are particularly well suited for the representation of grammars (syntax).
One is especially interested to be able to guarantee
- that the generalizations of a data set is most specific within the hypothesis space,
or
- that a hypothesis is a spezialization of a formula that is as general as possible, given
some specific demands.
more...
- Information extraction and wrapper induction
Information extraction:
The identificiation and extraction of instances of a particular
class of events or relationships in a natural language text and
their transformation into a structured representation (e.g. a
database). (after Grishman 1997, Eikvil 1999)
Wrapper Induction:
- Automatic generation of wrappers from a few (annotated) sample pages
- Assumptions:
Regularity in presentation of information often machine-generated answers to queries
- same header
- same tail
- inbetween a table/list of items that constitute the answer to the query
- Learn the delimiters between items of interest
more...
- Information Gain
The Information Gain is a measure based on Entropy.
Given
- a set E of classified examples and
- a partition P = {E1, ..., En} of E.
The Information Gain is defined as
ig(E, P) := entropy(E) - |
∑ |
entropy(Ei) * |Ei| / |E| |
| i=1,...,n |
|
Intuitively spoken the Information Gain measures the decrease of the weighted
average impurity of the partitions E1, ..., En,
compared with the impurity of the complete set of examples E.
more...
- Interesting Subgroups
The task of finding interesting subgroups is related to the task of characterization.
We are looking for subsets of an instance space, with interesting properties.
This differs from the tasks of concept learning and function approximation,
because we are not trying to find a hypothesis, that globally describes the data
and enables to predict unseen instances, but we focus on subsets of the data.
One possible application field is marketing, for finding favourable market segments
is a specific case of this task.
more...
- Inverse Resolution
One approach to inductive inference of clauses is inverting resolution
steps. In this scenario, for each given (positive) example, not explained by
background knowledge LB, yet, another clause is added to
LB. The basic step for this procedure is illustrated by
figure 1.
Figure 1: The Resolution-"V".
The "V" shows a single resolution step. The two arms represent
the clauses C1 and C2, the derived
clause, named D, stands at the bottom.
Additionally, the substitutions θ1 and θ2,
needed to perform the resolution step, are attached to the arms of the
"V".
To be able to perform a resolution step, a literal L has to exist,
such that
Then we can derive
D := ( C1θ1 \ L ) ∪
( C2θ2 \ ¬L )
from C1 and C2.
Now we are interested in inverting such a resolution step.
Let's assume C1 and D are given, and we
are looking for a clause C2, to add it to the
background knowledge LB, such, that the clause
D can be derived from LB.
Let's have a look at what we can say about C2, in
this setting:
- For each literal L' ∈ D \ ( C1θ1
) we know L' ∈ C2θ2,
because of the upper equation, defining D.
- A literal Lθ1, with L ∈
C1θ1 and
¬L ∈ C2θ2,
can be found by
- comparing D to C1 and
- choosing a literal from D that is "missing"
in C1.
- Obviously we do not know anything about θ2.
The easiest choice is θ2 = ∅.
- For each literal
L'' ∈ ( C1θ1 \ L )
we cannot determine whether
L'' ∈ ( C2θ2 \ ¬L ).
These points show, on the one hand, that candidates for
C2 are easily constructed, e.g.:
C2 := D \ ( C1θ1 )
∪ ¬L, with
θ2 := ∅.
On the other hand they show, that in general,
- the number of possible choices (except renamings) for
C2 might be exponential in the length of clauses,
and
- that the invertion of the substitution θ2
requires even more choices,
which is bad in this setting, because it leads to a larger set of
hypotheses.
For practical use, this drawbacks demand incorporation of preference
criterions or further assumptions.
more...
- KDD process
The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
- non-trivial process (Multiple process)
- valid (Justified patterns/models)
- novel (Previously unknown)
- useful (Can be used)
- understandable (by human and machine)
KDD is inherently interactive and iterative
a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations.
- Understand the domain and Define problems
- Collect and
- Preprocess Data
- Data Mining
- Extract Patterns/Models
- Interpret and Evaluate discovered knowledge
- Putting the results in practical use
more...
- k-NEAREST NEIGHBOR
K-NEAREST NEIGHBOR is a simple algorithm that stores all available examples and classifies
new instances of the example language based on a similarity measure. A variant of this algorithm
addresses the task of function approximation.
In detail (for the task of concept learning):
- Examples are described by numerical attribute-values.
- The complete example set is simply stored in the "training phase".
- Calculations are delayed until queries occur, no hypothesis in the usual sense is returned!
Hypotheses are implicit defined by the stored example set and the rules new instances are classified by.
- If there are n attributes all vectors can be interpreted as instances of |Rn.
The distance d(x1, x2) of two example vectors
x1 and x2 is defined as their usual vector distance.
- The distance between two example vectors is regarded as a measure for their similarity.
- To classify a new instance e from the set of stored examples, the k examples most similar
to e are determined. The new instance is assigned the class most of this k examples
belong to.
This approach is suited for function approximation as well. Instead of assigning the most frequent
classification among the k examples most similar to an instance e, an average of the function values of the k
examples is calculated as the prediction for the function value e.
A variant of this approach calculates a weighted average of the nearest neighbors. Given
a specific instance e that shall be classified, the weight of an example increases with
increasing similarity to e.
A major problem of the simple approach of k-NEAREST NEIGHBOR is that the vector distance will not
necessarily be suited for finding intuitively similar examples, especially if irrelevant attributes are present.
more...
- Lazy Learning
The notion of Lazy Learning subsumes a family of algorithms,
that
-
store the complete set of given (classified) examples of an
underlying example language and
-
delay all further calculations, until requests for classifying
yet unseen instances are received.
Of course, the time required for classifying an unseen
instance will be higher, if all calculations have to be done
when (and each time) a request occurs.
The advantage of such algorithms is, that they do not have to
output a single hypothesis, assigning a fixed class to each
instance of the example language, but they can use different
approximations for the target concept/function, which are
constructed to be locally good. That way examples similar to
a requested instance receive higher attention during the
classification process.
This method requires a similarity measure, to evaluate the
importance of examples, for classifying an unseen instance.
If examples are described by a real-valued vector, then similarity
could e.g. be measured by the usual vector distance, which
raises the question, if all attributes should have the same impact.
To reduce the unbeneficial impact of irrelevant attributes, and to
assign each attribute the proper degree of impact, are key issues,
using this method.
more...
- Learning as Optimization
In settings where an agent is trying to learn how to act most
reasonable, the question arises, how to improve its behaviour,
depending on the experience it makes.
At each step the agent's behaviour is determined by a policy, a
function mapping perceptions to actions. These actions will lead the
agent from one state to another. All of the agent's actions will
result in an immediate payoff, depending on its actual state, as well.
The payoffs and the mappings from state/action-pairs to the next state
are not necessarily assigned determinstically.
A learning algorithm will change an agent's policy driven by
experience, in order to maximize its payoffs. Experience that may
be used for improving the actual policy is limited to observed
perception and payoffs. At the beginning the algorithm may have no
information about the environment's structure and the relation
between action and payoff in different states.
So every observed perception/action/payoff triple can contribute to
improve the policy.
The quality of a policy can be measured in different ways, e.g. as
- a sum of all payoffs, if the lifetime of an agent is finite,
- or as the average payoff.
For a specific algorithm it is interesting, how the quality of
policies changes over time. Especially one is interested in
guarantees, that using a specific algorithm the policies will
converge towards an optimum.
more...
- Learning as Search
To be able to generalize and specialize in a stepwise manner,
it is necessary to find minimally more general / specific hypotheses
to a given hypothesis. If the hypothesis space is ordered according
to generality, learning can be considered as search either
from the most general to the most specific hypotheses (top-down) or
from the most specific to the most general (bottom-up).
Most special generalization
For LH being a propositional logic (attribute-value representation),
the most special generalization can be defined
referring to the partial ordering of the hypothesis space.
Given a set of possible hypotheses LH, a concept c' in LH, and a
positive example e, not covered by c'. The concept g is the most
specific generalization of s according to e, if and only if
- g is strictly more general than c'.
- the concept g covers the example e.
- there is no other concept g` in LH, that is strictly more
general than c', covers e, but is strictly more specific than g.
So g is generated by generalizing one step: There is no room left for
another generalization g' between g and the recent generalization c'.
Most general specialization
For LH being a propositional logic (attribute-value representation),
the most special generalization can be defined
referring to the partial ordering of the hypothesis space.
Given a set of possible hypotheses LH, a concept c from LH and an
example e, covered by c.
The concept s is the most general specialization of c according to
e if and only if
- s is strictly more specific than c.
- the concept s does not cover the example e.
- there is no other concept s` in LH that is strictly more specific
than c, does not cover e, but is strictly more general than s.
So s is generated by specializing one step. This does especially make
sense, if e is a negative example, which should not be
covered.
more...
- Least General Generalization
The procedure least general generalization reads (at least) two
clauses as its input and outputs a single clause, implying all
input clauses, but being no more general than necessary.
In detail: If c1, c2 and clgg
are clauses, with clgg = LGG(c1,
c2), then
-
clgg → c1 and clgg
→ c2, and
-
for every clause cgen, with
cgen → c1 and
cgen → c2,
there exists a substitution θ, such that
cgenθ ⊆
clgg.
So there is no clause more specific than clgg, implying
c1 and c2. Please follow the link above to
"θ-subsumption" to read more about how implication is
related to θ-subsumption.
Algebraic properties of the LGG
The least general generalization of clauses is unique up to
variable renaming. For all clauses c1, c2
and c3 the following equations hold:
-
LGG(c1, c2) = LGG (c2,
c1)
-
LGG( LGG(c1, c2), c3) =
LGG( c1, LGG(c2, c3))
-
LGG(c1, c1) = c1
A procedure for finding the lgg of two clauses c1 and
c2:
In the first step, all pairs of corresponding literals are
grouped together.
If there are literals l1 in clause c1 and
l2 in c2,
- both positive or both negative, and
- having the same predicate symbol (with the same arity),
then these literals are corresponding. So we have a set of single
literals (which do not have a corresponding literal in the other
clause) and pairs of literals.
In step 2, for each pair of literals having been grouped together
in step 1, the procedure least general generalization for terms is
invoked for each component of the literal, and each pair of
literals is replaced by the according result.
If two terms are replaced, these terms have to be replaced by the
same variable each time.
Single literals, not having a corresponding literal, are not
changed.
The result is the conjunction of all literals from step 2.
Mind that the length of the LGG of clauses can be exponentially
large in the length of input clauses, for there can be several
pairs of corresponding literals.
To see an example and an applet, demonstrating the least general
generalization, please follow the link to the Centre for
Computational Learning Systems (CCLS).
more...
- Least General Generalization of Terms
The least general generalization of terms is inductively defined as follows:
- Let X be a variable and t be an arbitrary term, different
from X.
Then LGG (X, t) = Y, for a variable Y, not appearing in
t.
- Let f and g be different function symbols and
t1, ..., tn, u1, ...,
un be arbitrary terms.
Then
- LGG( f(t1, ..., tn),
f(u1, ..., un) )
:= f( LGG( t1, u1), ...,
LGG( tn, un) ).
- LGG( f(t1), g(u1) ) = Y,
where Y is a variable not appearing in
t1 or u1.
- If t, u and v are terms and LGG(t, u) = v then
LGG(u, t) = v.
It has the following properties:
If t1 and t2 are terms and tlgg =
LGG(t1, t2) is their least general
generalization, then
-
there exist substitutions θ1 and
θ2, such that
-
tlggθ1 = t1 and
-
tlggθ2 = t2.
-
for every term tgen with substitutions
θ1 and θ2, such that
-
tgenθ1 = t1
and
-
tgenθ2 = t2,
there exists a substitution ρ, such that
tgenρ = tlgg.
In other terms, there is no more specific generalization of
t1 and t2 than their LGG.
Further properties: For all terms t1, t2 and
t2 the following equations hold.
-
LGG(t1, t2) = LGG (t2,
t1)
-
LGG( LGG(t1, t2), t3)
= LGG( t1, LGG(t2, t3))
-
LGG(t1, t1) = t1
more...
- Least Squares Linear Regression
Global parametric methods try to fit a single functional model to all the given training data. This imposes a strong assumption on the form of the unknown regression function, which may lead to lower accuracy if that is not the case. However, these approaches usually fit simple functional models that are easily interpreted and have fast computational solutions. A classical example of a global approach is the widely used linear regression model that is usually obtained using a Least Squares Error Criterion. Fitting a global linear polynomial using this criterion consists of finding the vector of parameters β that minimises the sum of the squared error, i.e. (Y Xβ) (Y-Xβ), where X denotes the transpose of matrix X. After some matrix algebra the minimisation of this expression with respect to β leads to the following set of equations, usually referred to as the normal equations (e.g. Draper & Smith, 1981), (XX)β = XYThe parameter values can be obtained solving the equation, β = (XX)-1XYwhere Z-1 denotes the inverse of matrix Z. As the inverse matrix does not always exist this process suffers from numerical instability. A better alternative (Press et al., 1992) is to use a set of techniques known as Singular Value Decomposition (SVD), that can be used to find solutions of systems of equations with the form Xβ = Y. There are many variants of this general set-up that differ in the way they develop/tune these models to the data (e.g. Draper & Smith, 1981).
more...
- Lexical Evaluation Function
- The Lexical Evaluation Function is a user defined measure to evaluate
the quality of a cluster in the process of conceptual clustering.
A simple example for a finite domain X is this one:
Let covE(C) denote the number of examples of a given
example set E, covered by the cluster C.
Let covX(C) denote the number of instances of the
example language covered by C.
Then define LEF as
LEF := 1 - covE(C) / covX(C)
more...
- Linear discriminant analysis
There are many possible techniques for classification of data. Principle Component Analysis (PCA)
and Linear Discriminant Analysis (LDA) are two commonly used techniques for data classification
and dimensionality reduction. Linear Discriminant Analysis easily handles the case where the
within-class frequencies are unequal and their performances has been examined on randomly
generated test data. This method maximizes the ratio of between-class variance to the within-class
variance in any particular data set thereby guaranteeing maximal separability. The use of Linear
Discriminant Analysis for data classification is applied to classification problem in speech
recognition.
Data sets can be transformed and test vectors can be classified in the transformed space by two different approaches:
- Class-dependent transformation: This type of approach involves maximizing the ratio of between
class variance to within class variance. The main objective is to maximize this ratio so that adequate
class separability is obtained. The class-specific type approach involves using two optimizing criteria
for transforming the data sets independently.
- Class-independent transformation: This approach involves maximizing the ratio of overall variance
to within class variance. This approach uses only one optimizing criterion to transform the data sets
and hence all data points irrespective of their class identity are transformed using this transform. In
this type of LDA, each class is considered as a separate class against all other classes.
more...
- Literal
A literal is a special first order logic formula, whether a so called atom, or a
negated atom. Atoms are formulas consisting of exactly one predicate symbol,
followed by a tuple of logical terms. So atoms are the most simple formulas, all more
complex formulas are build of.
Examples of literals:
- P(X, Y, c)
- ¬P(X, Y, c)
- P(f(X,c), Y, d)
- ¬ P(f(X,c), Y, d)
with P being a predicate symbol of arity 3,
X and Y being variables,
c and d being constant symbols,
f being a function symbol of arity 2,
and ¬ denoting negation.
more...
- Logistic Regression
The method of logistic regression addresses the task of concept
learning. A linear model of the following form is constructed:
Y = b0 + b1* X1 +
b2* X2 + ... + bk *
Xk ,
where Y is the logit tranformation of the probability p.
The logit transformation of the probability of
a value is defined as
Y = log (p / (1 - p)) ,
where p is the probability of an outcome.
The linear function can also be written as a
prediction of the probability of a value, e.g.
P(class = pos) = 1 / (1 + ea + b1 *
X1 + b2 * X2 ... +
bk * Xk)
The constant a and the weights b1 .. bn are
chosen by a regression method so that the predictions for the class
are optimal for a given set of classified examples.
A number of tools are available for computing the weights.
more...
- Loss Function
Loss functions are used to measure a prediction or classification
of a single example to a real valued error, in order to be able to
sum up the errors of a hypothesis, e.g. output by a learning algorithm
after reading a set of classified examples.
To evaluate the quality of a hypothesis, a part of the set of all
available classified examples have to be separated from the examples
used to train. This set of separated examples is called the test
set. This general method is suited for both, concept learning
and function approximation. The hypothesis chosen by the learning
algorithm, after reading the set of remaining examples, is then used
to classify the test set's examples, or to predict the target values
of these examples, depending on the learning scenario.
The error of each single prediction / classification is now measured
by the loss function Q. Simple and wide spread forms of
Q are:
-
For Classification:
Q(x, h) := |
1 if h(x) ≠ t(x) |
0 if h(x) = t(x) |
-
For function approximation:
Q is often chosen as the squared difference between the predicted
value h(x) and the correct value t(x):
Q(x, h) := (t(x) - h(x))2
more...
- Marketing
- One of the definitions found on the web says: "Marketing is the art of making someone want something you have". It consists of analysis of customers, competitors and company, understanding what segments of customers exist and targeting the most profitable segments. Advertisement, communication via e-mails, phones, etc. are then used to influence customers.
more...
- Mining Mart
Enabling End-User Datawarehouse Mining
The project aims at new techniques that give decision-makers direct access to information stored in databases, data warehouses, and knowledge bases. The main goal is the integration of data and knowledge management.
For an introduction visit the homepage of Mining Mart
more...
- Monte Carlo methods
Numerical methods that are known as Monte Carlo methods can be loosely described as statistical simulation methods, where statistical simulation is defined in quite general terms to be any method that utilizes sequences of random numbers to perform the simulation. Monte Carlo methods have been used for centuries, but only in the past several decades has the technique gained the status of a full-fledged numerical method capable of addressing the most complex applications.
Monte Carlo is now used routinely in many diverse fields, from the simulation of complex physical phenomena such as radiation transport in the earth's atmosphere and the simulation of the esoteric subnuclear processes in high energy physics experiments.The analogy of Monte Carlo methods to games of chance is a good one, but the ``game'' is a physical system, and the outcome of the game is not a pot of money or stack of chips (unless simulated) but rather a solution to some problem. The ``winner'' is the scientist, who judges the value of his results on their intrinsic worth, rather than the extrinsic worth of his holdings.
Statistical simulation methods may be contrasted to conventional numerical discretization methods, which typically are applied to ordinary or partial differential equations that describe some underlying physical or mathematical system. In many applications of Monte Carlo, the physical process is simulated directly, and there is no need to even write down the differential equations that describe the behavior of the system. The only requirement is that the physical (or mathematical) system be described by probability density functions (pdf).
We will assume that the behavior of a system can be described by pdf's. Once the pdf's are known, the Monte Carlo simulation can proceed by random sampling from the pdf's. Many simulations are then performed (multiple ``trials'' or ``histories'') and the desired result is taken as an average over the number of observations (which may be a single observation or perhaps millions of observations). In many practical applications, one can predict the statistical error (the ``variance'') in this average result, and hence an estimate of the number of Monte Carlo trials that are needed to achieve a given error.
more...
- Multi-relational mining
-
more...
- Multivariate Adaptive Regression Splines
Adaptive regression splines (Friedman,1991) are a kind of additive models (Hastie and Tibshirani, 1990) that can be seen as a generalisation of regression trees introduced to overcome some of their limitations. Friedman (1991) describes these limitations in detail and shows how adaptive regression splines can overcome them. The resulting model is implemented in system MARS, which provides a regression model of the form,
r(x) = c0 + |
|
ci |
|
[sk,i(Xv(k, i)-tk,i)]+ |
where,
the are two-sided truncated power basis functions.
This model can be recast into the more intuitive form given by,
r(x) = c0 + ∑ fm(XM) +
fm, n(XM, Xn) + ∑ fm, n, o (Xm, Xn, Xo) + ... |
In this equation the first sum is over all basis functions that involve only a single attribute. The second sum uses the basis functions involving two variables, and so on.
more...
- mySVM
mySVM is an implementation of the Support Vector Machine introduced by
V. Vapnik (see "Statistical Learning Theory").
It is based on the optimization algorithm of SVMlight
as described in "Making large-Scale SVM Learning Practical".
mySVM can be used for pattern recognition, regression and
distribution estimation.
For more details and download of mySVM, please follow the
according link above.
more...
- Naive Bayes
The Naive Bayes Classifier method is applicable to the task of
concept learning.
- It reads a set of examples in attribute-value representation and
- uses Bayes theorem to estimate the posterior probabilities of all
classifications.
- For each instance of the example language a classification with highest
posterior probability is chosen as the prediction.
For the Naive Bayes method the following assumption is essential:
Let x = < x1, ..., xn > be an instance of
the example language and c ∈ C a possible classification.
Then Prob(x | c) = ∏i∈{1,..,n}
Prob(xi | c).
This assumption is justified, if the attributes are independent from each
other.
Using this assumption the classification c ∈ C with maximum posterior
probability Prob(c | x) is the one that maximizes the expression
P(c) * ∏i∈{1,..,n} Prob(xi | c).
The learner estimates the required probabilities by calculating the corresponding
frequencies observed in the example set.
more...
- Neural Nets
The topic of Neural Networks has its own Network of Excellence, the
"NEuroNet". Please follow the link for more information.
A neural network is composed of a number of nodes connected by links.
Each link has a numeric weight associated with it.
Weights are the primary means of of long-term storage in the network,
and learning takes place by updating the weights. Some of the nodes
are connected to the external environment, and can be designated as
input or output nodes.
More specific
Biological analogy
Artificial Neural Networks (ANN) form a
family of inductive techniques that mimic
biological information processing. Most animals
possess a neural system that processes
information. Biological neural systems consist of
neurones, which form a network through synaptic
connections between axons and dendrites. With
these connections, neurones exchange
chemo-electrical signals to establish behaviour.
Neural can adapt their behaviour by changing the
strength of these connections; more complex
learning behaviour can occur by forming new
connections. Changes in connections, and new
connections, grow under influence of the signal
exchange between neurones.
[click to enlarge]
|

|
Figure 7.
Biological neural tissue.
Neural cells (a), axons (b) and dendrites
(c) are visible. Where axons touch
dendrites or nuclei, synaptic contacts
are formed.
|
Figure 8.
Layered artificial neural network.
On the left, a pattern is coded.
The activation rule propagates the
pattern to the right side through the
central layer, using the weighted
connections. From here, the output can be
read. Based on the input-output relation,
the learning rule may update the weights
|
Artificial implementation
Artificial neural networks are simplified models of
the biological counterpart. They consist of the
following elements:
- Processing
units (models of neurones)
- Weighted
interconnections (models of neural
connections)
- An activation
rule, to propagate signals through the
network (a model of the signal exchange
in biological neurones)
- A learning
rule (optional), specifying how weights
are adjusted on the basis of the
established behaviour
The most common
ANN are structured in layers (see Figure
8), with an input layer where data is
coded as a numeral pattern, one or more hidden
layers to store intermediate results, and an
output layer that contains the output result of
the network.
Neural Network
Model Representation
In contrast to
symbolic techniques such as decision trees
(see link above) the representation of
knowledge in a neural network is not easy to
comprehend. The set of weights in a network,
combined with input and output coding, realises
the functional behaviour of the network. As
illustrated in Figure 3, such data does not give
humans an understanding of the learned model.
Some alternative
approaches for representing a neural network are:
- Numeric:
show the weight matrix; not very
informative, but illustrates why a neural
network is called sub-symbolic: the
knowledge is not represented explicitly,
but contained in a distributed
representation over many weight factors.
- Performance:
show that a neural network performs good
on the concepts in the data set, even
better than other techniques, and do not
try to interpret de model
- Extract
knowledge: techniques exist to
extract the model captured in a neural
network, e.g. by running a symbolic
technique on the same data or by
transferring a NN model into an
understandable form. Techniques exist to
extract statistical models and symbolic
models from NN.
|
|
more...
- Non-parametric Regression (also known as local modelling)
Non-parametric regression1 belongs to a data analytic methodology usually known as local modelling (Fan, 1995). The basic idea behind local regression consists of obtaining the prediction for a data point x by fitting a parametric function in the neighbourhood of x. This means that these methods are locally parametric as opposed to the methods described in the previous section. According to Cleveland and Loader (1995) local regression traces back to the 19th century. These authors provide a historical survey of the work done since then. The modern work on local modelling starts in the 1950s with the kernel methods introduced within the probability density estimation setting (Rosenblatt,1956; Parzen,1962) and within the regression setting (Nadaraya,1964; Watson,1964). Local polynomial regression is a generalisation of this early work on kernel regression. In effect, kernel regression amounts to fitting a polynomial of degree zero (a constant) in a neighbourhood. Summarising, we can state the general goal of local regression as trying to fit a polynomial of degree p around a query point (or test case) xq using the training data in its neighbourhood. This includes the various available settings like kernel regression (p=0), local linear regression (p=1), etc2. Local regression is strongly related to the work on instance-based learning (e.g. Aha et al.,1991), within the machine learning community. Given a case x for which we want to obtain a prediction, these methods use the training samples that are most similar to x to obtain a local model that is used to obtain the prediction. This type of inductive methodologies do not perform any kind of generalisation of the given data3 and delay learning till prediction time4.
- Also known as non-parametric smoothing and local regression.
- A further generalisation of this set-up consists of using polynomial mixing (Cleveland & Loader,1995), where p can take non-integer values.
- This is not completely true for some types of instance-based learners as we will see later. Nevertheless, it is true in the case of local modelling.
- That is the reason for also being known as lazy learners (Aha, 1997).
more...
- Numerical Values
If a learning algorithm uses numerical values as its example language, it uses a fixed set of attributes with numerical domains to
represent the instances of an underlying universe of discourse.
Numerical domains
- are not necessarily finite and
- have a natural interpretation, that can be used for calculations by a learning algorithm,
which are the two main differences to nominal attribute-value representations,
restricting algorithms to distinguish between different values.
more...
- On-Line Learning
The learning task here is how to combine advice from several experts.
For instance, several experts predict the weather of the next day by giving a prediction and a probability,
e.g. expert1 says, the probability for rain is 0.8.
The prediction of different experts may be conflicting, e.g. expert2 could say, the probability
for rain is 0.2.
The learner then, at the next day, perceives the true outcome, e.g. it actually rains. Hence, the belief
in expert2 decreases, the one in expert1 increases.
The learner again receives predictions of the experts, makes a combination of these advices, in order to decide
(e.g. whether or not to take an umbrella), compares its prediction with reality and updates the weights for the
experts.
more...
- Ontology learning
An ontology is, according to Tom Gruber, "the specification of conceptualizations, used to help programs and humans share knowledge.".
In this usage, an ontology is a set of concepts - such as things, events, and relations - that are specified in some way (such as specific natural language) in order to create an agreed-upon vocabulary for exchanging information.
The manual building of ontologies
still remains a tedious, cumbersome task which can easily result in a knowledge
acquisition bottleneck. The success of the Semantic Web strongly depends on the
proliferation of ontologies, which requires that the engineering of ontologies be
completed quickly and easily.
A method which has proven
to be extremely beneficial for the knowledge acquisition task is the integration of
knowledge acquisition with machine learning techniques.
more...
- Overfitting
Overfitting is the phenomenum that a learning algorithm adapts so
well to a training set, that the random disturbances in the training
set are included in the model as being meaningful.
Consequently (as these disturbances do not reflect the underlying
distribution), the performance on the testset (with its own, but
definitively other, disturbances) will suffer from techiques that
learn to well.
Train and test, cross validation and bootstrap approaches
have been developed to cope with this problem.
more...
- Prediction of electricity consumption
-
more...
- Prediction of sales
Sales are predicted several weeks ahead in order to minimize stocks.
more...
- Prediction of telecommunication services
Prediction of telecommunication services usage at particular day times are helpful in order to minimize the use of external services or optimize network routing.
more...
- Preprocessing
Preprocessing cleans, aggregates, discretisizes data, selects and generates features, and handles NULL values in order to pass the data to the data mining tool (modeling step) appropriately.
more...
- Principal component analysis
Principal components (or Karhunen-Loeve) directions of the input matrix X are the eigenvectors vj
The largest principal component of some input data points is
the direction that maximizes the variance of the projected data, and the smallest principal component minimizes that variance.
more...
- Probabilistic Categorization Tree
The probabilistic categorization tree, used e.g. by COBWEB is a special data structure, a tree where ...
- each node defines a cluster, which is a set of (unclassified) examples.
- each leaf represents exactly one example, described by an attribute-value tuple.
- each inner node represents a cluster, consisting of the union of the clusters of its successors.
The other way around: The set of successors partition the set of examples covered by
a node.
- the root node represents the cluster containing all examples stored in the tree.
- for every node n (except for the root node) the following probability is calculated and attached to the node:
Given, that an example from the cluster of n's successor has been drawn under a uniform distribution,
how probable is it, that this example is in n's cluster?
- for each attribute a, each possible value v of a, and each node n of the tree, the fraction
(again interpreted as a probability) of those examples in n's cluster is calculated, where attribute a has value v.
Such a tree is suited for describing a hierarchical clustering probabilistically.
This hypothesis language is highly related to the task to achieve
- high predictability of variable values, given a cluster, and
- high predictiveness of a cluster, given variable values,
by choosing an appropriate tree for a given set of examples.
more...
- Probably Approximately Correct (PAC-) Learning
In the PAC-theory the topic is learnability of
- concept classes C
- over instance spaces LE
- by certain hypotheses spaces H,
- considering an unknown but fixed probability distribution D over
LE, modeling that there is a probability for each instance of LE to
occur.
All concepts of the concept classes C and H are subsets of LE.
An example e ∈ LE is classified according to a concept c, if
it is given whether e ∈ c or not.
Probably Approximately Correct learnability means, that there is an
algorithm A that is able to
- select an ε-good hypothesis from H
- with a probability of at least 1 - δ
- for each δ, ε with 0 < δ < 1/2 , 0 <
ε < 1/2.
To find such a hypothesis the algorithm is allowed to read a set X of classified examples, drawn with
respect to D.
A hypothesis h is called ε-good, if the probability errorD(h, c) an instance e ∈ LE
is drawn and "classified wrong" by h is at most ε:
errorP(h, c) := P(x ∈ (c \ h) ∪ (h \ c))
Two main questions can be distinguished in PAC-theory:
- How many examples are needed to learn Probably Approximately
Correct?
Is there a polynomal p for a specific concept class C, such that there
is an algorithm A, which can find an ε - good hypothesis for
each concept c ∈ C with a probability of at least 1 - δ,
for each 0 < δ < 1/2 and 0 < ε < 1/2, after having read at most p(1/δ, 1/ε) classified
examples from LE, drawn with respect to P ?
- How complex is the learning task?
If point 1 is true, is there an algorithm A` and a polynomal p` such that A` can find an ε-good
hypothesis in time p`(1 / δ, 1 / ε, |c|), where |c| denotes the representation length of the
actual concept c to be learned ?
If question 2 can be answered yes, so can question 1, because in time p`(1/δ, 1/ε, |c|)
the cardinality of the set of read examples is bound by a polynomal in 1/δ, 1/ε.
more...
- PROGOL
The alogorithm PROGOL uses inverted entailment to generalize a set of positive
examples with respect to background knowledge B. Examples are definite clauses.
Additionally constraints can be given in the form of headless horn clauses.
The outer loop of PROGOL chooses a yet uncovered example, generalizes it
to a definite clause and adds that clause to the background knowledge.
This is iterated until all examples are covered by the (extended) background
knowledge.
The principle of inverting entailment:
For every generalization h of an example e with respect to background knowledge
B it is required that B ⋀ h ⊧ e.
Applying the deduction theorem leads to B ⋀ ¬ e ⊧
¬ h.
Let ⊥ be the most specific clause, more general (with respect to B)
than e. Then B ⋀ ¬ e ⊧ ¬ ⊥ ⊧
¬ h and h ⊧ ⊥ for all hypotheses h more general than e.
Generalizing one example:
The hypothesis space for the generalization of single examples is a syntactically
restricted subset of the set of definite clauses.
The restriction is specified by the user via so called "mode declarations",
restricting definite clauses in two ways:
- The maximal depth i of definite clauses is bound.
- The allowed predicate symbols for head and body of the clause are given.
Another indirect restriction of the search space is given by restricting the number of
maximal resolution steps when calculating inferences.
There will usually be many possible generalizations of an example. The generalization
process can be regarded as a search through a lattice of hypotheses, quasi-ordered by
generality.
PROGOL uses θ-subsumption for quasi-ordering the search space.
- The most general element of this lattice is the empty clause.
- Let ⊥ be the most specific definite clause, such that
B ⋀ ⊥ ⋀ ¬ e derives the empty clause in
a fixed number of resolution steps.
The most specific element within the user-specified search space is called
⊥i. It is the the most specific clause of depth at most
i, θ-subsuming the clause ⊥.
Within the lattice a top-down search is performed.
The description length of clauses serves as a quality measure.
The underlying assumption of this approach is, that clauses with shorter representations
generalize better than those with longer representations.
For details, please refer to the publication above.
more...
- Projection Pursuit Regression
Projection pursuit regression (Friedman and Stuetzle, 1981) is a type of additive model (Hastie and Tibshirani, 1990) of the form,
These models can be simply described as additive functions of linear combinations of the attributes. Projection pursuit regression uses the backfitting algorithm (Friedman and Stutzle, 1981) to iteratively obtain the additive terms of the model.
more...
- Propositionalisation
The process of of transforming a multi-relational dataset, containing structured examples, into a
propositional dataset with derived attribute-value features, describing the structural
properties of the examples. The process can thus be thought of as summarising data
stored in multiple tables in a single table (the target table) containing one record per
example. The aim of this process, of course, is to pre-process multi-relational data for
subsequent analysis by attribute-value learners.
more...
- Q Learning
Q learning is a method addressing the task of Reinforcement Learning.
The main idea is to approximate the function assigning each
state-action pair the highest possible payoff.
The setting:
Let
- S denote a finite set of states, an agent can be in,
- A denote a finite set of actions, an agent may perform,
- δ denote a transition function, mapping each state-action pair
to a successor state,
- r denote a reward function, mapping states-action pairs to payoffs,
- and π denote an agent's policy, mapping each state to an action.
Here we use the discounted cumulative reward as a quality measure for
fixed policies.
For a constant γ ∈ [0, 1] and starting state s0
this measure is defined as
with
- s(t) := δ(s(t-1), π(s(t-1))), and
- ri := r(si, π(s(t-1))),
∀ t ∈ ℕ > 0.
The method of Q Learning is based on the function Q, defined as
followed:
Each state-action pair (s, a) is assgined the highest
discounted cumulative reward possible, if state s is the agent's
starting state and a is the first action performed. So function Q
is based on an optimal policy. If we knew the highest possible discounted
cumulative rewards for all possible successor states and actions, then we could
calculate the values for the actual state and actions as
Q(s, a) := r(s, a) + γ | max | Q(δ(s, a), a') |
| a' | |
Of course, an optimal strategy is not known in advance. But it shows that the
Q function can be approximated by continously collecting locally observed
payoffs and by updating the approximation of Q,
using the equation above, although the values for the successor states may have been
initialized to arbitrary values.
The Algorithm of Q Learning:
- Create a table with an entry for each element in S x A.
The entries are referred to by "Q'(s, a)".
- Initialize all table entries to 0.
- If for a state s and action a a reward r(s, a)
and a successor state s' is observed, then update Q'(s, a) as following:
Q'(s, a) := r(s, a) + γ | max | Q'(s', a') |
| a' | |
It can be shown, that if each state-action transition is passed
infinitely often, then the approximation Q' converges to the
correct function Q in this finite and deterministic setting.
Please refer to the bibliography link above for a proof.
The assumption of passing each state-action transition infinitely
often demands, that the agent does not stick to always choosing the
action leading to the best payoff, according to the actual
approximation of Q, but that it assigns a certain weight to
exploring its environment. A possible way of doing so is to
define a probability distribution over the set of actions according
to the estimation for Q, e.g.
| k Q'(s, ai) |
P(ai | s) = |
|
| ∑j kQ'(s, aj) |
for all ai ∈ A and s ∈ S.
This distribution assigns more promising actions a much higher
probability of being chosen, but fulfills nonetheless the requirement
of passing each possible state-action transition infinitely often.
more...
- Quasi-Ordering of Hypothesis Languages
In the concept learning/ILP setting we are looking for a concept,
more general than any positive example, and more specific than
any concept, covering positive as well as negative examples.
So it is a good idea, to structure the hypotheses by their generality,
and to generate more general or more specific hypotheses stepwise. The
search in a structured search space of all hypotheses
can be pruned, which is not possible in an unstructured search space.
In his book "Machine Learning" Tom Mitchell
proposed a quasi-ordering for attribute-value hypothesis languages
(partial ordering of the set of all hypotheses) based on the
more-special-than (or more-general-than) relation:
A concept c1 is more special than a concept
c2, if and only if every example covered by c1
is covered by c2 as well.
A concept c1 is strictly more special than c2,
if it is more special but distinct from c2, which means that
there is at least one instance covered by c2, that is not
covered by c1 as well.
A concept c1 is more general than c2,
if c2 is more
special than c1. Concept c1 is strictly more general
than c2, if c2 is strictly more specific than
c1.
If concepts are characterized by attribute values, which can be arranged
in a hierarchy according to a more-special-than
relation, the search space (which can be seen as a cartesian
product of the attribute value sets) can always be (partially)
ordered.
more...
- RDT
The algorithm RDT basically performs a complete search through a
subspace of the hypothesis language of clauses. The subspace results
from syntactical restrictions chosen by the user.
The complete search for consistent clauses does not only
- guarantee finding a single clause within the search space,
if such a clause exists (task of concept learning).
- But it also can be used for finding all clauses in the
(restricted) hypothesis language, which makes RDT well suited
for the characterization task of inductive logic programming.
To perform a complete search, pruning is restricted to cases, where
- a general clause does not cover a positive example, so any more
specific clause cannot be consistent with the example set, or
- a specific clause covers a negative example, so each more general
example will do as well.
The search strategy of RDT is top-down induction of horn clauses
with rule schemes directing the search. A rule scheme is a clause
with at least one predicate variable. Predicate variables are mapped
to predicate symbols by substitutions Σ, similar to usual variable
substitution, but with the additional restriction that mappings have
to be injective, so different predicate variables cannot be replaced by
the same predicate symbol.
For rule schemes we can define the "more general than" relation as
follows:
Let RS and RS´ be rule schemes. Then RS is more general than
RS´ if and only if there are substitutions Σ and σ so that
RSΣσ ⊆ RS´.
The search starts with the shortest and therefore most general rule
schemes / clauses that only have to be refined by adding literals, if
a clause covers positive and negative examples. Redundant literals
cannot be added by this strategy.
more...
- Regression Functions
A regression function is a function of the form
y = f( xi, aI, b) ,
where a prediction of an attribute y depends on the attribute
values xI and a set of parameters ai.
The most wide spread variant is linear regression, where
the function takes the form
y = ∑i ai xi + b
For a specific model the parameters ai and b are estimated
by the learning algorithm, e.g. such, that the empirical error of a
test set is minimized.
more...
- Regression Rules
Weiss and Indurkhya (1993) have developed a system (SWAP1R) that learns regression rules in a propositional language. The conditional part of the rules consists of a conjunction of tests on the input attributes while the conclusion part contains the associated prediction of the target variable value. Originally these predictions consisted of the average Y value of the cases satisfying the conditional part, but later the possibility of using k-nearest neighbours was added (Weiss & Indurkhya, 1995). This system has the particularity of dealing with regression by transforming it into a classification problem. In effect, the target values of the training cases are pre-processed by grouping them into a set of user-defined bins. These bins act as class labels in the subsequent learning phase. This means that the learning algorithm induces a classification model of the data, where the classes are numbers representing the obtained bins . These numbers can then be used to make numeric predictions for unseen cases. This idea of transforming a regression problem into a classification task was taken further in the work of Torgo and Gama (1996), where a system called RECLA acts as a generic pre-processing tool that allows virtually any classification system to be used in a regression task. Torgo (1995) has developed a propositional regression rule learner (R2) that also uses an IF-THEN rule format. This algorithm can use several different functional models in the conclusion of the rules . R2 uses a covering algorithm that builds new models while there are uncovered cases to fit. The model is chosen from the model lattice that includes all possible regression models for the conclusion part of the rules. The result of this first step is a rule with empty conditional part and the built model as conclusion. This rule is then specialised by adding conditions to restrict the model domain of applicability with the goal of improving its fit. This restriction is guided by an evaluation function that weighs both the fitting as well as the degree of coverage of the rule. This means that the system is looking for rules with good fit but covering as many cases as possible. CUBIST is a recent commercial system developed by Ross Quinlan. It learns a set of unordered IF-THEN rules with linear models in the conclusion part. Up to now there is no published work on CUBIST but from our experience with the system it looks like it is a rule-based version of M5 (Quinlan, 1992,1993b). It seems to be a kind of C4.5rules (Quinlan, 1993a) for M5. The system is able to deal with both continuous and nominal variables, and obtains a piecewise linear model of the data. CUBIST can also combine the predictions of this model with k--nearest neighbour predictions (as M5 does).
more...
- Regression Trees
Regression trees may be considered as a
variant of decision trees,
designed to
approximate real-valued functions
instead of being used for classification tasks.
The inner nodes of regression trees are marked with tests as
in decision trees.
The difference is, that the leafs of regression trees may be
marked with arbitrary real values,
whereas in decision trees the leafs may only be marked with elements
of a finite set of nominal values.
This makes a real difference only, if you look at sets of trees
(e.g. as potential hypotheses languages), instead of comparing
single trees.
A further extension is to allow linear functions as labels of leaf
nodes. In this case the function at the leaf node reached for a
specific example is evaluated for the instance's attribute values,
to determine the value of the target attribute. This allows for
global approximating by using multiple local approximations.
more...
- Reinforcement Learning
In the reinforcement learning scenario the task is to find a
policy that maximizes the reward of an agent by
experimentation.
more...
- Resolution
Resolution provides an important operator for proving clauses.
Let
- C1 := {A1, A2, ...,
Am}
- C2 := {¬A1, B1, ...,
Bn}
be clauses, represented by sets of literals.
Then
D := { A2, ..., Am, B1,
..., Bn }.
can be derived from C1 and C2.
Incorporating Substitution
Combining the rule above with substitution leads to an even more
powerful resolution rule. This is illustrated by figure 1.
Figure 1: The Resolution-"V".
Let C1 and C2 be clauses,
- θ1 and θ2
be substitutions, and
- let L be a literal with
Then
D := ( C1θ1 \ L ) ∪
( C2θ2 \ ¬L )
can be derived from C1 and C2.
more...
- Restricted First Order Logic
Many machine learning algorithms use a restriction of first order logic as a formalism for representing
hypotheses, examples and sometimes background knowledge.
Advantages of choosing logical representations are:
- Representations are readable by humans.
- Logic is a well-understood mathematical discipline, so a plenty of mathematical results can be used,
searching for learning methods and proving their correctness.
- Examples, hypotheses and background knowledge can be expressed within the same formalism.
A problem of full first order logic is that it is algorithmically hard to check whether a formula
implies another, which is necessary for finding hypotheses that are consistent with the set of examples.
To receive a more tractable formalism, restrictions of first order logic are chosen.
Depending on how expressive the restriction has to be, one can choose between
- propositional logic
- horn clauses
- function-free clauses
- non-recursive clauses
and many more specific restrictions.
If e.g.
- background knowledge is given in the form of function-free ground facts (relational database),
- function-free horn clauses are sufficient as the hypothesis language for a specific learning task,
- then a possible candidate for an inference operator is the (in the general case incomplete) θ-subsumption.
This restriction of first order logic is much more tractable and therefore to be preferred (if it is expressive enough).
Typical restrictions of the hypothesis language are variously restricted clauses, e.g. function-free horn clauses.
more...
- Revision
The learning task is to revise a given theory, in order to no longer imply a given set of unwanted facts.
In other words, a theory has become inconsistent and shall be adjusted to some new insights.
An obvious demand is, that adjustment should not change more of the theory than necessary. In particular, parts
that are not concerned in the derivation of the incorrect logical facts should not be changed at all.
more...
- Sampling
Sampling is the process of selecting a proper subset of elements from the full population so that the subset can be used to make inference to the population as a whole. A probability sample is one in which each element has a known and positive chance (probability) of selection. A simple random sample is one in which each member has the same chance of selection.
more...
- Searching Version Spaces
If we can manage to find an appropriate ordering of our hypothesis
language, then learning can be regarded as navigating through
version spaces, with each new example read limiting the number of
hypotheses, that are still consistent.
The search will either lead from specific to general, driven by
positive examples (bottom-up), or from general to specific, driven
by negative examples (top-down).
The CANDIDATE-ELIMINATION Algorithm combines a bottom-up and a
top-down search.
more...
- Sequence Prediction
The scenario for the task of Sequence Prediction is the following.
Given: Sequences of events, e.g. in the form
(E1, t1), (E2, t2),
..., (En, tn), where the Ei denote
events, and the ti denote points in time.
The target is to find rules, accurately predicting the
occurence of future events, given the event sequence up to the
actual point in time.
more...
- Sequences and episodes
-
more...
- Spatial relationships
A relationship in spatial data is a relationship
between features in a Euclidean space, defined in
terms of the co-locational trends of two or more
features over that space. An example is
determining the confidence of the ‘where there’s
smoke there’s fire’ with respect to a set of coordinates,
each representing the feature smoke or
fire. The task here would be to determine whether
fire occurs in the neighborhood of smoke more
than is randomly likely.
The treatment of relationships among spatial objects is an essential task in geographic data processing and CAD/CAM.
more...
- Star
The Star method is suitable for the task of concept learning
as well as for conceptual clustering.
It can deal with different kinds of attribute-value representations
of examples.
The idea is to find all most general descriptions, that separate an
example set from other example sets. Then, according to a certain
preference criterion, the best of all descriptions is chosen.
The concepts or clusters generated that way may be specialized
afterwards, to make them disjoint.
Let P denote a set of instances to be covered and N a
set of instances that should not be covered.
A partial star is a conjunction of conditions testing
attribute-values.
Such a partial star is called consistent with P and N, if
it covers every p ∈ P, but no n ∈ N.
STAR(P|N), the star of P against N, is defined as the
disjunction of all maximally general partial stars, consistent with P
and N.
As a quality measure for stars and partial stars, a so called
"Lexical Evaluation Function" (LEF) is used.
For efficiency reasons a restriction of STAR(P|N), the bounded star
STAR(P|N,s)
may be of interest:
If |STAR(P|N)| > s, then STAR(P|N,s) contains just the s best partial
stars of STAR(P|N), according to the LEF.
Otherwise STAR(P|N,s) = STAR(P|N).
For algorithms based on the star-method, it is often sufficient to
build stars of single instances of the example language against a
set of examples.
Let
- p be an instance to be covered,
- N be a set of instances that should not be covered,
- s be a parameter for bounding complexity.
How to generate STAR({p}|N,s):
- Generate a list of partial stars with exactly one
condition, taken from p. These partial stars may still
cover an n ∈ N, and therefore be inconsistent.
- If there are more than s partial stars in the list,
remove all, except from the best s ones, according to the
LEF.
- As long as there are inconsistent partial stars, that
can be extended by adding a conjunction, such that the
result has not been generated before, do:
- Extend all inconsistent partial stars by one
condition and add them to the list.
- If the list contains more than s elements, just
keep the best s ones.
- Remove all inconsistent partial stars from the list.
- Output the list, ordered by the LEF.
more...
- Statistical Learning
The statistical learning theory focusses on two major questions:
- Asymptotic Analysis: Can it be proven, that with an
increasing number of examples the empirical error converges to the
real error?
- Learning Rate: If point 1 has been proved, then how fast
does the empirical error converge to the real error?
more...
- Support Vector Machine (SVM)
Support vector machines were developed by Vapnik et al. based on the Structural Risk Minimization principle from statistical learning theory. They can be applied to regression, classification, and
density estimation problems.
Vapnik's book "Statistical Learning Theory" is
listed under "Publications", above.
Focussing on classification, SVMs take as input an i.i.d. training
sample Sn
(x1,y1),
(x2,y2), ...,
(xn,yn)
of size n from a fixed but unknown distribution Pr(x,y)
describing the learning task. xi represents the
document content and yi ∈ {-1,+1} indicates the
class.
In their basic form, SVM classifiers learn binary, linear decision rules
h(x) = sign(w *x + b) =
- +1, if w * x + b > 0, and
- -1, else
described by a weight vector w and a threshold b. According to
which side of the hyperplane the attribute vector x lies on,
it is classified into class +1 or -1. The idea of structural risk
minimization is to find a hypothesis h for which one can guarantee
the lowest probability of error. For SVMs, Vapnik shows that this
goal can be translated into finding the hyperplane with maximum
margin for separable data, this is depicted in the following figure.

For separable training sets SVMs find the hyperplane h, which
separates the positive and negative training examples with maximum
margin. The examples closest to the hyperplane are called Support
Vectors (marked with circles).
Computing this hyperplane is equivalent to solving the following
quadratic optimization problem (see also algorithms).
minimize: W(α) =
|
n |
|
n |
n |
|
- |
∑ |
αi + 1/2 * |
∑ |
∑ |
yiyjαiα
j(xi *
xj)
|
|
i=1 |
|
i=1 |
j=1 |
|
subject to:
n |
|
n |
|
∑ |
yiαi = 0, |
∀ |
: 0 < αi
< C |
i=1 |
|
i=1 |
|
Support vectors are those training examples xi with
αi > 0 at the solution. From the solution of this
optimization problem the decision rule can be computed as
|
n |
|
w * x = |
∑ |
αiyi
(xi * x) and b = yusv -
w * xusv |
|
i=1 |
|
The training example (xusv,yusv) for
calculating b must be a support vector with αusv
< C.
For both solving the quadratic optimization problem as well as
applying the learned decision rule, it is sufficient to be able to
calculate inner products between document vectors.
Exploiting this property, Boser et al. introduced the use of kernels
K(x1,x2) for learning non-linear
decision rules.
Popular kernel functions are:
Kpoly (d1,d2) |
= (d1 * d2 + 1)d |
Krbf (d1,d2) |
= exp(γ (d1 - d2)2 ) |
Ksigmoid (d1,d2) |
= tanh( s(d1 * d2) + c ) |
Depending on the type of kernel function, SVMs learn polynomial
classifiers, radial basis function (RBF) classifiers, or two layer
sigmoid neural nets. Such kernels calculate an inner-product in some
feature space and replace the inner-product in the formulas above.
A good tutorial on SVMs for classification is "A Tutorial on
Support Vector Machines for Pattern Recognition", for details,
please follow the corresponding link above.
more...
- SVM for regression
SVMs can be adapted for regression with a quantitative response, in ways that inherit some of the properties of the SVM classifier.
more...
- SVM light
SVMlight is an implementation of Support Vector Machines
(SVMs) in C.
The main features of the program are the following:
- fast optimization algorithm
- working set selection based on steepest feasible descent
- "shrinking" heuristic
- caching of kernel evaluations
- use of folding in the linear case
- includes algorithm for approximately training large transductive SVMs (TSVMs)
- can train SVMs with cost models
- handles many thousands of support vectors
- handles several ten-thousands of training examples
- supports standard kernel functions and lets you define your own
- uses sparse vector representation
The source code is free for scientific use. Please follow the link
"SVMlight - Details and Download".
There is also a regression support vector machine based on
SVMlight available: Please follow the link labeled
"mySVM".
Description
SVMlight is an implementation of Vapnik's Support Vector
Machine (see "The Nature of Statistical Learning Theory") for
the problem of pattern
recognition. The optimization algorithm used in SVMlight
is described in "Making large-Scale SVM Learning Practical.". The algorithm
has scalable memory requirements and can handle problems with many thousands
of support vectors efficiently. This new version also includes a new algorithm
for training large-scale transductive SVMs. The algorithm proceeds by solving
a sequence of optimization problems lower-bounding the solution using a
form of local search.
The code has been used on a large range of problems, including text
classification (see publications above), several image recognition
tasks, and medical applications.
Many tasks have the property of sparse instance vectors. This implementation
makes use of this property which leads to a very compact and efficient
representation.
more...
- Syntax/Grammar/Automata - Learning
Given:
-
an infinite stream of examples, where
examples are either
-
well-formed sentences of a language, or
-
sentences classified as well-formed or not
well-formed,
-
and
-
a class of languages ₤, where
for each language there exists a procedure
to decide whether a sentence is member of
this language,
-
or a class of automata Å,
where each automata can decide whether a
given sentence is accepted.
Find the language L ∈ ₤ / the automaton
A ∈ Å, that generates / accepts all
well-formed sentences.
The most interesting language, of course, is natural
language. However, most methods learn much more restricted
classes of languages, e.g. pattern languages (i.e. sets of
strings).
more...
- Telecommunications
-
Call Center Application
more...
- Text categorisation
-
more...
- Theta-Subsumption
The θ-subsumtion is a correct, but incomplete inference operator for clauses with
the advantage of being feasible, while correct and complete inference operators in full first order
logic are known to be algorithmically intractable.
- A clause C1 θ-subsumes another clause C2, if and only if
a substitution θ exists, such that C1θ ⊆ C2.
- If C1 θ-subsumes C2 then C1 implies
C2.
Θ-subsumption may also be used for quasi-ordering the hypothesis language of clauses.
more...
- Theta-Subsumption Ordering
When searching the hypothesis space of clauses, one possible feasible
way to quasi-order the hypothesis space is to use θ-subsumption.
The general idea is to structure the search space with respect to
generality of instances.
Intuitively the following definition of "more general than" is
reasonable:
If c and c' are clauses and c ⊨ c' then c is more general than
c'.
However, it is more feasible to define the "more general than"-relation
by restricting inference to θ-subsumption.
Let C1 and C2 be clauses. Then C1
is more general than C2, if and only if:
There is a substitution θ, such that C1θ ⊆
C2 .
more...
- Time Series Analysis
The general task
Given are time series, which are ordered sets of
observations { x1, .., xn },
where xi denotes the observation at point i in
time, e.g. a real-valued vector.
The goal is to predict the behaviour at point n+1 in time, which possibly may be determined by a
function.
More special tasks
Given: time series
Trend:
|
Find a characterization of the direction of change
over time.
|
Level Change:
|
Find points in time, from where on the overall level
of observeable values will change.
|
Cycles:
|
Find cycles in the sequence of changes over time.
|
more...
- Time series clustering and indexing
-
more...
- Top-Down Induction of Decision Trees (TDIDT)
Top-down induction of decision trees is a method for the task of
concept learning (supervised learning).
Given:
A set of classified examples in attribute-value representation,
where the set of classes is finite.
The Method:
A decision tree is constructed top-down.
In each step a test for the actual node is chosen (starting with the root node),
which best seperates the given examples by classes.
The quality of a test is measured by the impurity / variance of example sets.
The most common measure is the information gain.
A test is a mapping from the set of all instances of the underlying
example language to a finite set of possible results.
In the decision tree there is a successor of the actual node for each
possible result of the test.
The given set of examples is split with respect to the chosen test.
For each successor that does not meet a given acceptance criterion,
e.g. that all examples are of the same class, the procedure is started
recursively.
Often the set of possible tests is limited to splitting the examples
according to the value of a certain attribute, so the method is restricted
to choosing a good attribute for each inner node.
If numerical attribute values are incorporated, tests may compare the value
of certain attributes to appropriate constants.
The aim of top-down induction of decision trees is to find a decision
tree that is as small as possible and fits the data, because it is assumed
that it is likely to find a good generalization of the training data that way.
more...
- Top-Down Induction of Horn Clauses
In Inductive Logic Programming a common method for
- finding a horn clause that covers as much instances of a set of
positive examples as
possible,
- but does not cover any instance of a set of negative examples
is to use a general-to-specific search through a given hypothesis
language.
General-to-specific search for horn clauses:
-
Given a set of examples, a specific target predicate P and its arity,
e.g. 2.
- Start with a most general horn clause with predicate P in
the head.
These are the horn clauses with empty bodies and different variables
at each component of P, e.g. " P(x, y) ← ".
These clauses are most general, because they derive P(x, y) for every
possible substitution of x and y.
-
In general there will be negative examples in the example set, which
then will be covered by all most general clauses.
In this case, the actual horn clause is refined step by step, until
no negative example is covered any more.
This is done by adding further conditions to the clause's body.
Doing so decreases the set of substitutions fulfilling all conditions
of the horn clause, so the clause becomes more specific.
Depending on the refinement procedure two approaches can be
distinguished:
-
The refinement procedure can be guided heuristically, as done
by FOIL (not restricted to horn clauses) to construct a clause that
covers as many of the yet uncovered positive, but none of the negative
examples.
-
Another way is to perform a complete search through a syntactically
restricted hypothesis space, which will generally require higher
computational efforts. On the other hand, it enables to give guarantees,
e.g. that a most general solution in the hypothesis space is found.
An algorithm following this approach is RDT.
more...
- Top-Down Induction of Regression Trees
The Top-Down Induction of Regression Trees addresses the task of
function approximation.
Nevertheless the method is quite similar to top-down induction of
decision trees.
Given:
- a set E of instances of the examples language in
attribute-value representation
- for each instance x ∈ E its function value f(x)
The method:
A regression tree is constructed top-down.
- In each step a test for the actual node is chosen (starting with
the root node).
The search for good tests is guided by variably defined quality
measures, which
reward tests that induce partitions which subsume instances with
similar function values.
- As in decision trees in regression trees a test is a mapping
from
the set of all instances of the underlying example language to a finite
set of possible results.
There is a successor of the actual node for each possible result of the
test.
- The given set of examples is split with respect to the chosen
test.
- For each successor that does not meet a given acceptance criterion,
the procedure is called recursively.
- Each leaf of a regression tree is marked by a real value.
The value is, of course, chosen such, that the actual quality measure
for the leaf is maximized. Alternatively a linear function may be
attached to leaf nodes, if the attributes describing examples are
numerical as well.
When the method of regression trees is invoked, there are often
numerical attribute values, and tests may compare the value of
certain attributes to appropriate constants.
The aim of regression trees usually is to find a good generalization of the training
data, which means good predictions of yet unseen values of the actual function. Another
possible aim is to analyze which variables are well suited to predict a target value.
more...
- Train and Test
Keeping a test set additionally to the train set is a general method
to estimate the accuracy of a concept learning algorithm (classifier).
Given a sample of classified instances and a concept learning
algorithm, this is how this method works:
- Split the data in two parts, a training set and a test set.
-
Run the learner on the training set, but do not show the test set.
Let h denote the hypothesis output by the learner.
-
Use h to classify all instances of the test set. The fraction
of correct classified instances is the estimated accuracy.
Usually a fraction of 20-30% of the available data is chosen as the
test set. This is a good option, if the size of the test set is
larger than 1000.
more...
- VC dimension
The VC dimension (for Vapnik Chervonenkis dimension) is a measure of the capacity of a classification algorithm. It is one of the core concepts in statistical learning theory. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.
Intuitively, the capacity of a classification model is related to how complicated it can be. Think of thresholding a high-degree polynomial, where if the polynomial evaluates above zero, we classify that point into a positive class, negative otherwise.
If we use a very high-degree polynomial, it can be very wiggly, and can fit a training set exactly. But, we should expect that outside of the training points, the classifier would not generalize well, because it is too wiggly. Such a polynomial has a high capacity. Alternatively, we can think about thresholding a linear polynomial. This polynomial may not fit the entire training set, because it has a low capacity. This notion of capacity can be made more rigorous.
First, we need the concept of shattering. Consider a classification model f with some parameter vector θ. The model f can shatter a set of data points if, for all assignments of labels to those data points, there exists a θ such that the model f makes no errors when evaluating that set of data points.
more...
- Version Space
Version Spaces may be regarded as a method or as a hypothesis language itself.
In concept learning one has
- a set of classified examples and
- a set of hypotheses, which define the possible results of the algorithm.
The task of incremental concept learning demands, that the learning algorithm
reads one example per iteration and outputs a hypothesis each time. It can be assumed
that the algorithm may not store many of the seen examples, but basically just revises
the actual hypothesis according to the actual example.
One often has several different hypotheses fitting the data, and it can be an
advantage, for reasons of efficiency, to define a preference criterion, to have
a unique hypothesis at each point in time.
But in fact there often is a set of possible results, fitting the data, which
we expect to become smaller, the more examples the algorithm has read.
By using Version Spaces, one can avoid to choose a preference criterion, which
may not be justified in some situations, and instead represent the set of all
hypotheses of an underlying hypotheses language that (still) fit the data.
After each iteration step an arbitrary element of the represented set of hypotheses can
be chosen as an output of the learning algorithm.
The representation of the set can be achieved by half-ordering the hypotheses
language and by storing just the most general and the most specific hypotheses.
more...
- Web usage mining
Web mining is the integration of information gathered by traditional data mining methodologies and techniques with information gathered over the World Wide Web. Web mining is used to understand customer behavior, evaluate the effectiveness of a particular Web site, and help quantify the success of a marketing campaign.
Web mining allows you to look for patterns in data through content mining, structure mining, and usage mining. Content mining is used to examine data collected by search engines and Web spiders. Structure mining is used to examine data related to the structure of a particular Web site and usage mining is used to examine data related to a particular user's browser as well as data gathered by forms the user may have submitted during Web transactions.
The information gathered through Web mining is evaluated (sometimes with the aid of software graphing applications) by using traditional data mining parameters such as clustering and classification, association, and examination of sequential patterns.
more...
- WEIGHTED-MAJORITY
The WEIGHTED-MAJORITY algorithm addresses the task of on-line
learning.
Given a set of prediction rules (or advice from several experts), the
algorithm maintains a weight for each rule (or expert), specifying
how to form the algorithm's prediction from the set of predictions.
The algorithm runs in rounds. Every round
- the predictions of all rules are calculated first, and
- compared to the real outcome in step 2.
- Step 3 is to update the weights. This is done by decreasing
the weights of rules that predicted wrong.
So the algorithm favours rules with high accuracy.
Every round a prediction is made, before the real outcome can be
observed. The aim is to make as few mistakes as possible.
Let
- wi denote the weight of rule i,
- n denote the number of rules, and
- Hi(e) denote the prediction of rule number i for example e.
The algorithm for a binary prediction task, where the task is
to predict if 0 or 1, e.g. if an instance belongs to a concept or
not, or if it will rain tomorrow:
- For i = 1 to n do: wi = 1
- Repeat
- Read an unclassified example e.
- W0 = Σ{i:Hi(x)=0}wi
- W1 = Σ{i:Hi(x)=1}wi
- If (W0 > W1) output prediction 0.
- If (W0 < W1) output prediction 1.
- If (W0 = W1) output a random prediction.
- Compare the prediction of every single rule with the true outcome,
and for each rule i that predicted wrong:
wi = β * wi, with β ∈
[0, 1[.
β is a fixed parameter, specifying how drastically to punish
wrong predictions.
By choosing β = 0 we receive the CANDIDATE-ELIMINATION
algorithm, where each hypothesis inconsistent with an example is
permanently deleted.
more...
- WEKA
Weka is a collection of machine learning algorithms for solving
real-world data mining problems. It is written in Java and runs on
almost any platform. The algorithms can either be applied directly
to a dataset or called from your own Java code.
Weka is also well-suited for developing new machine learning schemes.
Weka is open source software issued under the GNU public license.
more...
|
|