Data Mining with Machine Learning

Abstract

This search process can be carried out by machine learning algorithms, however, traditional ml--algorithms are not able to cope with these massive amounts of data. Therefore, within the project Data Mining with Machine Learning, funded by Daimler-Benz AG, we developed a multi-strategy approach to relational knowledge discovery in databases.

When learning from very large databases, the reduction of complexity is of highest importance. Two extremes of making KDD feasible have been put forward. One extreme is to choose a most simple hypothesis language and so to be capable of very fast learning on real-world databases (e.g. Association Rule Algorithms). The opposite extreme is to select a small data set and be capable of learning very expressive (first-order logic) hypotheses. A multistrategy approach allows to combine most of the advantages and exclude most of the disadvantages. More simple learning algorithms detect hierarchies that are used in order to structure the hypothesis space for a more complex learning algorithm. The better structured the hypothesis space is, the better can learning prune away uninteresting or losing hypotheses and the faster it becomes.

We have combined inductive logic programming (ILP) directly with a commercial relational database, i.e. Oracle V7. The ILP algorithm is controlled in a model-driven way by the user and in a data-driven way by structures that are induced by three simple learning algorithms. A generality structure of attributes is established by detecting functional dependencies (cf. Bell/Brockhausen/95). A hierarchy of value sets of linear attributes is learned by a new technique for discretization (cf. Franzel/96). A hierarchy of value sets of nominal attributes is learned from background knowledge by set-theoretical processing (cf. Siebert/97). These three simple learning algorithms make it possible to apply an ILP learning algorithm to very large databases.

Our multistrategy approach to knowledge discovery in relational databases has been scrutinized for its adequateness on a real world application. The learning system successfully discovered rules in a database of 16 tables up to 700,000 tuples each. The use of additional background knowledge has been the key to a further improvement of the quality and interestingness of the learned rules (cf. Morik/Brockhausen/97,Brockhausen/Morik/97).

The project lasts from 1.1.1995 until 30.6.1997 and was funded by Daimler Benz AG, Research Center Ulm,
Contract No.: 094 965 129 7/0191

Contact

Peter Brockhausen

Hauptnavigation

General

Research

Teaching

Staff

Data Mining with Machine Learning

Abstract

Contact

Publications