main page

INFORMATION SOCIETIES TECHNOLOGY (IST) PROGRAMME

 
 
 
 
 
 
 
 
 

Annex 1 - Description of Work

Project acronym MiningMart
Project full title Enabling End-User Datawarehouse Mining
Contract no. 11993
Related to other Contract no. IST-2001-35479

1. Project Summary

Objectives

An environment for the support of knowledge discovery from databases (KDDSE) will be developed that provides decision-makers with advanced knowledge extraction from large distributed data sets. New techniques for selecting and constructing features on the basis of given data will be developed. For instance, ways of handling time (time series, relations of time intervals, validity of discovered rules), discovering hidden variables, and detecting interdependencies among features will be investigated. The techniques ease knowledge discovery where currently most time is spent in pre-processing. Domain knowledge will be exploited by data mining. This will enhance the quality of data mining results. A case-base of discovery tasks together with the required pre-processing techniques will offer an adaptive interface to the KDDSE. This will speed-up similar applications of knowledge discovery and make the KDDSE self-improving.

Description of work

The scientific research for enabling end-users to gain knowledge from databases and data warehouses is organized? in two themes: a meta-model and multi-strategy learning. The meta-data offer constraints for pre-processing and pairing business tasks with algorithms (WP1, WP8, WP10, WP 18).? A deep analysis of feature selection, sampling, transformation and mining operators is developed. Multi-strategy learning systematically explores the combinations and (automatic) parameter setting of diverse learning operators for pre-processing, particularly for feature constructions and selection (WP4, WP13, WP14). Handling of multi-relational data (WP15), time phenomena (WP3), and the inclusion of domain knowledge (WP5) enhance discovery.

The support of data mining in data warehouses is based on databases directly. The model for the description of data mining cases (the meta data) becomes operational through a compiler that transforms mining cases into SQL statements or calls to mining tools that directly access the database (WP8, WP12). Scientific and technological efforts yield a case-base of best-practice discovery (WP10) that can be used by users of the environment and is published in the internet for an international ?representation race?(WP9).

Applications guarantee that research and technology focus on the most challenging and demanded issues. The data warehouse provided by SwissLife and a set of data mining applications from PSN and TILAB evaluate the transferability of results.

Milestones and expected results

Milestone 1 delivers a first prototype of a KDD support environment, applications being set up and their demands being specified. The definition of the meta-language provides the basis for further work.

Milestone 2 delivers user-driven data transformations and learning operators, both described by meta-data.

Milestone 3 provides a case-base together with human computer interfaces that allow to set up or adapt data mining cases. An on-line service is established in the Internet.

2. Project? Objectives

2.1 The goal

The goal of this project is that knowledge discovery becomes a powerful, but not difficult, query facility for very large databases. End users should ideally be able to directly query large and heterogeneous data sets in their own language. Such queries are typically application-driven. For example, data analysis should provide answers for the optimisation of mailing campaigns, for the analysis of warranty cases in order to improve production quality and for the discovery of business trends. An innovative approach to this goal is to provide end-users with a case-base of excellently solved discovery tasks. The users may run an application by simply referring to a case. Easing the access to knowledge hidden in large data sets will enable SMEs to benefit from their data collections. Consulting customers who want to optimize their business processes on the basis of their data bases is supported.

2.2 The Objectives

In order to make knowledge discovery a powerful and easy query facility for very large databases, the current tools need to be enhanced in the following ways:

?        Supporting advanced pre-processing of data

?        Supporting the view of the end-user by the case base

?        Reducing the number of trials for each discovery task

?        Decreasing the amount of data to be kept within the data mining procedures.

Supporting advanced pre-processing

It is well known that data representation strongly influences the quality and utility of the analysis results, and that problem reformulation is a core technique in problem solving. Improving the quality of data improves the quality of the mining results. Extensive documentation on the importance of non-algorithmic issues in real-world applications are reported. The ?no free lunch theorem? in essence tells us that choosing a not so well-suited mining algorithm can be compensated by a sophisticated pre-processing method, or, the other way around, that excellent results of data mining frequently rely on the appropriate pre-processing of the data. However, the task of reformulating data is difficult and demands high skills. As a result, knowledge discovery in large databases is not used by the common people, but by a few highly skilled power users. Developing more sophisticated pre-processing operators will make KDD more effective.

View of end-users and reducing the number of trials

The formulation of discovery tasks in terms of business applications bridges the gap between the technologies and the users. Since end-users cannot solve the difficult task of reformulation in order to get their desired answer, they might wonder whether somebody else had a similar question and got the answer. The project will develop a case-base of excellently solved discovery tasks. The tasks are described in terms of business applications. The cases serve as blueprints for further similar queries to similar data. The user need not try out all possible procedures but start with a most promising one.? This should reduce the number of trials to a large extent. The case-base thus offers a user-friendly interface to the best practice of knowledge discovery from very large and heterogenous data sets.

Decreasing the amount of data within the mining procedures

It is not feasible to extract all data in a warehouse for pre-processing when only a very small percentage of the data is actually needed for the discovery task at hand. In many applications the computer used for data mining is different from the one hosting the data warehouse. It may also be impossible to copy all of the data warehouse to the data mining computer due to Ethernet overload or insufficient disk-space at the mining station. Hence, discovering knowledge in very large data sets requires a new work-share between the datawarehouses and the discovery operators.

2.3 Operational Goals and Techniques to Achieve Them

To overcome the shortcomings of current knowledge discovery, the proposed work addresses the following four closely related objectives:

1) Create user-friendly access to data mining for the non-expert user through:

?providing advanced, partially automated support for pre-processing,

?pairing data with clever pre-processing and analysis methods,

?creating an (Internet) case-base of pre-processing and analysis tasks for re-use.

Meta data describing data formats, pre-processing operators and analysis algorithms will be employed to guide the user through the knowledge discovery task. For example, if the user selects an analysis method that can only process discrete variables, but selects a table containing continuous data to be analysed, then he will be prompted to select a discretization operator. One focus of the research conducted within this project will be to determine the extent to which such automated support can be given during pre-processing. It can be expected that this process cannot be fully automated, especially when completely new, high level data mining questions are to be solved. Ideally, the system would evaluate all possible transformations in parallel, and propose the most successful sequence of pre-processing steps to the user. This "representation race" is, however, computationally infeasible. One objective of the proposal is to allow each user to store entire chains of pre-processing and analysis steps for later re-use in a case-base (for example, a case of pre-processing for mailing-actions, or a case of pre-processing for monthly business reports). It is conceivable that completely new tasks will be solved by highly trained specialists who could store their work over the Internet in a centralized case-base to make such new cases accessible to less advanced users. This way, world-wide experience with knowledge discovery can be systematically stored such that the user's knowledge about the data, the mining-task and the connection between them is preserved. The case-base could even be mined for knowledge about knowledge discovery.

2) Speed up the discovery process by

?reducing the number and complexity of trial and error pre-processing and analysis cycles. ?The case-base of pre-processing and analysis tasks described within the first objective will not only assist the inexperienced user through the exploitation of experienced guidance from past successful applications, but also allow any user to improve his skill for future discovery tasks by learning from the best-practice discovery cases.

?allowing the re-use of pre-defined building blocks and entire analysis tasks. Some analysis tasks are repeated regularly. A case base of stored discovery tasks will free the user from specifying the same steps repeatedly.?

3) Minimize the amount of data that is kept within the data mining operators. This objective will be achieved by executing as much of the pre-processing as possible within the data warehouse. It is not feasible to extract all data in a warehouse for pre-processing when only a very small percentage of the data is actually needed for or applicable to the data mining task at hand. In many applications the computer used for data mining is different from the one hosting the DBMS. It may also be impossible to copy all of the data warehouse data from the DBMS to the data mining computer due to Ethernet overload or insufficient disk-space at the mining station. The methods developed within this project will allow maximal utilization of database technology. Pre-processing operations that can be efficiently executed within the database will be executed within the database. It can also be expected that achievement of this objective will? speed up the discovery task (objective 3).

4) Improve the quality of data mining results by improving the quality of data. Transforming raw data into a high-value basis for discovery is a time-consuming and tedious task, but it is also the most important step in the knowledge discovery cycle and a particular challenge in real world applications. In this project, a set of transformation tools/operators to ease this task will be developed. Machine learning operators are not restricted to the data mining step within knowledge discovery. They can well be seen as pre-processing operators that summarize, discretize, and enhance given data. However, this view offers a variety of learning tasks that are not as well investigated as are learning classifiers. For instance, an important task is to change the level of detail of the data by means of aggregation operators, according to the task and/or the algorithm used. These tools improve the quality of data with respect to redundancy and noise, they assist the user in selecting appropriate samples, in discretizing numeric data and provide means for the reduction of the dimensionality of data for further processing. Making data transformations available includes the development of an SQL query generator for given data transformations and the execution of SQL queries for querying the database.

2.4 Baseline

Traditional approaches fail to release the knowledge from the masses of available data. Two technologies are emerging for the rescue: (1) data warehousing for on-line analysis of data and verification of hypotheses and (2) knowledge discovery in databases (KDD) for discovering knowledge which is hidden in the data. Practical experience with these techniques have proven their value. However, it is also apparent that using a data warehouse for decision support or applying tools for knowledge discovery are difficult and time-consuming tasks. The actual data mining step is well understood and efficient tools exist, but the pairing of data with algorithms and the clever pre-processing of the data are still a matter of trial and error. If we inspect real-world applications of knowledge discovery, we realize that up to 80% of the efforts are spent on finding an appropriate transformation of the given data, finding appropriate sampling of the data, and specifying the proper target of data mining. Interest in this European project comes from the experience with knowledge discovery in very large relational databases: most of our efforts concern the transformation of data into a form which is appropriate for data mining. Especially time sequences, time intervals, and relations between time intervals belong to the hardest issues. Even if the data are transformed and a learning algorithm is selected, the tuning of the algorithm?s parameters currently ask for many trials. These issues require a multi-strategy approach, where one learning task delivers input to the next one. This representation change, however, has not yet received enough attention in the scientific community which still concentrates on the learning algorithms themselves.

2.5 Measure of Success

Corresponding to the overall goal of the project, the criterion is:

By the end of the project, some discovery tasks, for which entries in the case-base exist,
can be solved with only 20 % of the time for pre-processing,
where the time for the data mining step remains the same as before the project.

This criterion is made operational corresponding to the objectives stated in section 2.3 within workpackage 17.? The system MiningMart will be installed for real end-users of industrial partners (SwissLife and TILAB). They will use the system for their regular jobs of decision-making and reporting. At least one of the discovery tasks, the users want to perform, has been set-up by project members before, so that meta-data about the task and a solution exists. In this case, the end-users can use the case-base of best-practice. This setting is used for in-depth evaluation.

?        Measure to creating user-friendly access to data mining for non-expert users:

End-users report on their experience with using the MiningMart: is it easy to use, transparent, and supportive? Is the flow of control natural to the users? Can they make good use of the results? How do they assess the results?

End-users compare their task performance with and without the MiningMart: are they faster when using the system? Can they do more when using the system?



?        Speed-up the discovery process:

The comparison of discovery with the accomplished MiningMart system and without it will be performed for the SwissLife application. In WP 6, an anonymous datawarehouse together with discovery tasks is delivered to all partners. Pre-processing operators are applied to solve the task, of course without the support of the MiningMart system which does not yet exist at this point in time. The time for finding the appropriate data transformations will be measured.? The pre-processing will then be enhanced by the methods of WP3, WP4,? WP13, WP14, and WP16. The end-users at SwissLife, a business controller and a member of the marketing division, will use the MiningMart system which then includes the enhanced pre-processing. Their time for solving the same task will also be measured. We aim at a reduction to only 20% of the original pre-processing time.





?        Minimize the amount of data kept within KDDSE:

Current KDD systems have to load all data for data mining. This is not feasible. WP2 will investigate methods that allow for a new work-share between datawarehouse and KDD operators with respect to data management. Further on, WP15 will develop a new interaction concerning multi-relational data. The aim is to perform extensive processes directly within the datawarehouse. A clear and operational measure of success is whether a huge datawarehouse can be handled by the MiningMart.

?        Improve the quality of mining by improving the quality of data:

The main measure for the quality of data mining results is their accuracy on test data. WP3 and WP4 as well as WP 14 will deliver comparisons of mining with and without the pre-processing methods developed in terms of precise figures. The international contest? (?representation race?)? (WP20) will deliver additional test results.



This general evaluation of success can be made operational for on-going quality assurance. Actually, the high number of workpackages is meant to ease the monitoring of the project. Each workpackage is a move towards one of the goals, each of its deliverables is a step. Hence, as long as every deliverable is in fact delivered, we know that we shall approach our goal. Here we present the paths to achieve good results for each of the success criteria just shown. They will be used? as a check-list for progress monitoring.

?        Measures to creating a user-friendly access to data mining for non-expert users:

?        Milestone 1: The start-up KDDSE is delivered to all project partners, so that project members as the first users can assess whether it is user-friendly or not, and propose enhancements. WP5 prepares the ground for WP19 in investigating means to incorporate knowledge about the application ? as is viewed by application people ? can be used for discovery.

?        Milestone 2: WP8 delivers operators that should ease pre-processing.? A meta model allows to specify a data mining case together with the given data and the domain concepts involved.

WP 19 is a first attempt to express discovery tasks and goals in terms of (business) applications. If the transfer to all partners succeeds, i.e. partners want to apply the meta models for their own applications, it is a clear indication of success.

?        Milestone 3: WP10 delivers the case-base of pre-processings, i.e. the best-practice of pre-processing which can be used by not so experienced user. Its success becomes apparent only by users? assessments (WP17). The same holds for WP12 which deliver the final human-computer interface. The final evaluation is done in WP17 as described above.

?        Speed-up the discovery process:

?        Milestone 1: The basis of all further evaluations is the delivery of a real (anonymized) datawarehouse by WP6. The basis of further work on pre-processing operators is the specification delivered by WP1. Indirectly, the specification will describe more detailed success criteria as found important because of applications? requirements. The first advanced pre-processing (those handling time or sequences) will already be proposed at this milestone.

?        Milestone 2: WP4 supports the tuning of parameters of mining algorithms. Clearly, the time used for determining optimal parameter settings will be measured and compared to the time necessary without the new technique. In order to ease monitoring of progress, one deliverable? is due early on (B+12).

?        Milestone 3: WP13, WP14 and WP16 deliver advanced pre-processing operators. They are evaluated with respect to speed-up as well as accuracy of learning results.

?        Minimize the amount of data kept within KDD procedures:

?        Milestone 1: First results will be already available at B+6 when WP2 will have investigated new approaches to work-share between database (datawarehouse) and KDDSE.

?        Milestone 2: WP15 will extend the approach from WP2 to multi-relational data. The capability to perform hypothesis testing directly on a database or datawarehouse is a clear success. If this would fail, the project has still time to recover.

?        Milestone 3: WP17 evaluates clearly how many records are to be stored within the KDDSE and how many can be ignored or handled by the database or datawarehouse.

?        Improve the quality of mining by improving the quality of data:

?        Milestone 1: First results comparing combinations of pre-processing and mining are delivered by WP3. The improvement of accuracy? by? different representations of time or sequences can thus be measured.

?        Milestone 2: WP4 will compare average accuracy achieved with and without tuning of parameters. Of course, the success is to move beyond the initial average accuracy. WP15 will compare combinations of pre-processing and mining of multi-relational data. The improvement of accuracy is one measure of success. However, if ? for some combinations ? it can be shown that no improvement can be achieved, this is also a very valuable result!

?        Milestone 3: WP14 will compare feature selection and construction operators. The ones that improve accuracy of learning results will be made available to customers of the MiningMart. The success criterion is the increase of accuracy . WP16 provides pre-processing operators for discretization and grouping. WP18 delivers conditions which characterize the applicability conditions of operators. These are meant to serve as a prediction of successful discovery as well as a ?pruning? criterion where which operator does not make sense. It can be tested whether the conditions cover successful operator application of the project and do not cover unsuccessful ones. The international contest in the internet will deliver additional test results (WP20).


The other workpackages and deliverables do not serve the operational goals of the research, but are for information dissemination and for technology implementation.

2.6 Added Value by NAS Partners

Supporting advanced pre-processing means that a set of operators for intelligent data transformations is modeled within the framework of the MiningMart. The MiningMart consortium has delivered some operators. However, it turned out that the discretization operators are not really sufficient for handling all cases that the project would like to handle. The feature generation and selection is a central issue in intelligent pre-processing. UEP has worked on intelligent pre-processing operators and will deliver additional operators. Their experience in winning KDD competitions by intelligent pre-processing will enlarge the scope of pre-processing when integrated into the MiningMart framework, namely being integrated into the meta-model by operator descriptions and by operator implementations that directly access databases. Also UEP will report on the usability of the meta-model for describing new operators.

The speed up of discovery tasks by the re-use of existing cases can best be verified when a new discovery task is set up? by somebody who has not participated in the project internal discussions. NIT offers the exciting opportunity to use the project results for a new discovery task already within the project duration. From this evaluation we expect feedback that can enhance the project results. The same holds for the objective of improving data mining by improving the quality of data. The new task of NIT will give us a clear result whether, in fact, our expectation is correct, that the pre-processing as offered by MiningMart cases improves the data mining results. NIT will use the meta-model to investigate their customers and will report on the advantages and disadvantages of the current project results. The Mining Mart project will then use the feedback to adjust its work. This early feedback promises a much stronger position with respect to usability? questions than are possible when the first application? of the MiningMart to a new task from outside the project occurs only after the project duration. Last but not least, NIT provides the MiningMart with a new case.

4. Contribution to Programme Objectives

This proposal contributes to the thematic IST programme objective by building a user-friendly knowledge discovery tool for the information society. The international character of the proposed project and the participation of two SMEs complies well with the outlined horizontal programmes.

Knowledge discovery from databases (KDD) is widely recognized as the technology for the 21st century that enables people to gain insights from the vast amount of data collected world-wide. Data warehouse technology helps collect and handle such huge masses of data, but only KDD (and possibly OLAP) offer the tools and procedures to turn these data into usable knowledge. Data mining, as the core step within the KDD process, allows the non-technical user to formulate such high level queries as:

? output the most similar records

? identify exceptional records

? determine the most influential factors

? output the 10000 most likely responders to my next mailing campaign.

Current industrial strength data mining tools allow the formulation of a multitude of similar queries to direct analysis. Two drawbacks of the present state-of-the-art tools is that despite visual programming user interfaces, users must be KDD specialists and spend most of their time preparing the data for analysis. There is also relatively little support from the data mining tool in this tedious process. While this is accepted by early adopters and during initial data mining projects, most users demand that previous pre-processing tasks are re-usable and that semi-automated support for new tasks is available. This is the aim of the proposed project. Solving these user-support problems is necessary before KDD can be fully effective and accessible to the broader group of users who could benefit from data mining technology.

This project could be considered as part of a cross-program cluster (CPC) because its results will be relevant to several key actions. The project is related to several action lines, the "center of gravity" being item 3 of Key Action IV.3 (technologies and engineering software, systems and services, including high-quality statistics): methods and tools for intelligence and knowledge sharing. The result of this project is a technique enabling the end user to use data mining (i.e., advanced statistics) and some OLAP functionality within a data warehouse environment (very-large scale data). We intend to build an environment with an advanced graphical user interface that enables the user to easily define new pre-processing and analysis tasks via case retrieval from previously defined pre-processing tasks. This project contributes to CPA4: New indicators and statistical methods by developing and applying an advanced user-friendly data mining tool for improving data quality, knowledge extraction, and statistical modelling. The tool will contribute to the dissemination of information by enabling a broader group of users to run data mining analyses without the help of experts. The machine will further support the user through case adaptation and optimization. Meta-data about the data to be analyzed (obtained from an information repository), the data warehouse, the data mining tools, and the pre-processing operations will be employed to guide the user while defining new tasks. The primary goal is the easy adaptation of existing pre-processing cases to new applications.

Other related Key Actions are:

? IV.3.4: information management methods. The project builds an environment where mass-storage and processing of the data is done within the data warehouse. Therefore, existing data warehouse technology is used to enable end-users to conduct data mining on very large data sets.

? II.1.2 Corporate knowledge management. A tool for representing and capturing (distributed) organizational knowledge in working environments will be developed. The project is based on existing data warehouse technology, which solves the issues of the distributed and heterogeneous nature of available data and their repositories.

? Future priority action of III.5 Information Access, Filtering, Analysis and Handling: Information filtering and agents. Data mining technology is a core technology for information filtering and analysis, and hence,? contributes to this future priority action.

? I.4.1 Systems enhancing the efficiency and user-friendliness of administrations. The system developed during this project can be seen as an advanced multimedia integrated system for administrations and other public bodies, to improve businesses' and citizen's access to information. Multi-media data can be analyzed through a change in representation. Pictures are usually represented for mining by large feature vectors, spatial-data (geographic information) by multi-relational neighborhood and distance relations, and texts for WWW- and text-mining by large feature vectors (word-lists). Pre-processing and data mining are enabling techniques, even though this is not explicitly addressed in the project.



5. Innovation

The proposal builds on the insight that current approaches for achieving the objectives described above tend to ignore theoretical results that have proven that no algorithm can claim to be systematically better than any other on every problem (Wolpert?s ?no free lunch theorem?), and that nobody has yet been able to identify reliable rules predicting when one algorithm should be superior to others. The innovation of this proposal is to combine the two factors that are known to be able to solve nearly any challenge: human experts and sheer numbers!

A constraint based graphical user interface utilizing meta-data shall guide users through the knowledge discovery task. The highest possible degree of automation for this process will be the aim of this project. However, as reasoned above, it cannot be expected that the user simply asks a high level question and selects a data set to be analyzed and everything else is done automatically. In particular, the task of proper transformation of the given data into a format that can be successfully analysed by the available algorithms is difficult. As discussed above, testing of all possible approaches through a "representation race" is currently not practical because the required computational power is not accessible for any single user. However, if the nearly infinite resources of the Internet are utilised, then such a race may be reasonable. The idea is that a searchable case-base of solutions to discovery tasks is made available on a web-server. Users that have access to this facility can search the case-base for suitable solutions to their task at hand. If no proper solution is found, the task could be posted as a new challenge. Knowledge discovery experts working alone or in groups could tackle the problem, and insert a solution into the case-base. Large clusters of computers could also be combined to find the right answer through sheer computational power, similar to clusters of computers that are built to find new largest known prime numbers.

Why would people participate in such a competition, and post their solution in the case base? Several motivations are conceivable: financial rewards, acknowledgement ("Miner of the Month"), and most importantly, access to the case-base itself. Relevant research issues are how to efficiently store results in the case-base, and how to maintain and query the case-base. Data mining technology could be used to maintain and update the case-base, to create ratings, and to learn how to adapt retrieved cases.



6. Community Added Value

The growing amount of data that private companies and public organisations are accumulating in their databases (or datawarehouses) would be meaningless if appropriate methods would not be available to convert their content into information, first, and knowledge for the decision maker, afterward. However, the whole process of extracting knowledge from these valuable repositories cannot be solved by just buying some possibly expensive commercial tool; in fact, the advice of expert data miners is usually required to precisely define the data mining task, to prepare the data, and to set up the environment for effectively using the tool. The overall process becomes expensive, both in terms of time spent on the task, and in terms of number and level of qualification of the persons involved. For this reason, only large companies have most profited from this innovative methodology, up to now.

The goal of building up an environment for pre-processing that supports the user in taking advantage of its own data will reduce the time and cost for data mining. In this way, also smaller companies, specifically SMEs, could profit from this methodology, which, in the upcoming information society, might prove vital for their survival and competitiveness. For this reason, the proposed project is in line with several priorities and policies of the EC: to enhance the user-friendliness of information tools, to empowering small companies (SMEs), by giving them the possibility of exploiting their own past experience hidden in historical data, and to guarantee to a larger community the keys to important information resources, favoring thus the development of a fair access? to the information society.

Furthermore, a side effect of the project will be a kind of standardization of the data mining process, enhancing transferability, re-use and increased possibility of data and tools sharing. The effect? will be an increased European awareness of the value of shared information technologies, and a global increase in the amount and quality of the knowledge extracted from data. In fact, following standardized data pre-processing methodologies, data from various countries could be handled cooperatively. This aspect is particularly important for those issues in which all European countries are interested and systematically collect data, such as environment monitoring and ecosystem preservation.

In order to attain the goal of the project, European partnership is necessary to achieve a critical human and financial mass. Moreover, different competencies are to be integrated. Specifically, academic teams master methodologies but usually do not dispose of large datawarehouses, nor have the resources to engineer their algorithms for use by a wide community of non specialist users. Industrial companies have the data, and software companies the human resources to make a product out of a prototype. Then, partners from the threes components nicely complement each other.



7. Contribution to Community Social Objectives

The development of information handling tools and systems, as any other field that produces methodologies, offers the ground for a large number of applications that have, in the end, impact on the society. The proposed project has two impacts on the society.
The first is the possibility for a larger range of companies to exploit the information accumulated during their operations. In fact, the data mining process tends, ultimately, to support a decision maker in better serving the customers, by producing and offering personalised goods.

Second, it is not unreasonable to believe that the gradual diffusion of data mining needs to a large number of smaller companies and to public organisations will increase the opening of new qualified jobs, contributing thus directly to increase employment.

Third, data mining research and the market for data mining tools is dominated by North-American science and commerce. It is very hard to find a niche for European contributions and expertise. However, the customers of data mining within Europe still need consulting and guidance. Here, European expertise can be marketed. Similar to the experience with Linux, where the system is for free but accompanying consulting can be marketed, we perceive the MiningMart a free archive that supports data mining consulting.



8. Economic Development and S&T Prospects

Data mining research and the market for data mining tools is dominated by North-American science and commerce. It is very hard to find a niche for European contributions and expertise. However, the customers of data mining within Europe still need consulting and guidance. Here, European expertise can be marketed. Similar to the experience with Linux, where the system is for free but accompanying consulting can be marketed, we perceive the MiningMart a free archive that supports data mining consulting.

The consortium as a whole will market its results through a Workshop (WP11) and the Internet server publishing the meta-data of successful discovery cases.

The software partners, AiS and PSN, have particular exploitation plans.

Over the years PSN?s focus has shifted slightly from data mining tool sales to large consulting projects in the area.? With this shift, the realisation came that large amounts of preprocessing were necessary before actual data mining tools could be applied.? Up to now this preprocessing phase has depended largely on available knowledge about database management and to a large degree on gut feeling about the problem domain.? A proper structuring or automation of this task was lacking.?

The results from this project will have the following benefits for PSN:

?Reduction of the initial phase of preprocessing in a data mining project, which currently may be up to 80% of the overall time spent on the project.? This reduction in overall time, and thus in total costs of the project, will make future projects more attractive and easier to sell.

?Widening of scope of a data mining project.? Due to a more efficient preprocessing phase, there is more room to examine alternative approaches which would otherwise be a luxury, given a fixed budget.

?Increase in quality of the results.? Because of a better structured and automated preprocessing phase, suggestions can be tried which would otherwise be overlooked.? Given more and better suggestions, the analysis phase will be improved which again results in the production of models with higher quality and accuracy.?

?Decrease of training required for consultants.? Due to an increased level of automation, and better guidelines, less training of consultants is necessary and more people with different backgrounds can be put on data mining projects.? In the current situation at PSN more data mining projects have been initiated than can currently be performed by the data mining team.

?Better convincing value proposition during the acquisition of new projects.? Because of the relative complexity and novelty of the data mining technology, PSN has to rely on free pilots and workshops in order to convince new customers.? During such short pilots a better impression can be made if the laborious preprocessing phase is automated and thus skipped rapidly.? We can then focus more of the actual analysis which has to convince the customer in the end.

PSN expects that all of these benefits will have a serious effect on the sales of data mining projects for PSN. Not only will improved pre-processing make projects cheaper, and thus easier to sell, but it will so make them more successful.? The success and quality of such projects has proven to produce new spin-offs, not just at the current customer, but also at potential new customers.? Because of the large attention for data mining in the technical media, success stories are a good way of attracting new business.

In specific, PSN intends to exploit the results of this project both with current and future data mining projects.? The preprocessing tool will be an excellent complement to the current data mining tools that we market.? These data mining tools currently support the preprocessing phase only in a minimal form, and we have to rely on functionality provided by standard RDBMSs.? This has been a complaint with current customers, which can now be solved by marketing a package of tools.? The results stemming from the more research oriented workpackages will be used as internal guidelines and methodologies. Through these documents, less trained personnel will be able to implement data mining projects more efficiently and with a higher quality of results. The results will be exploited not just on the Dutch market through PSN, but on a European and worldwide scale through its owner Perot Systems, were PSN is acting as the official Competence Centre for data mining.

Institute for Autonomous intelligent Systems (AiS) has a large and active knowledge discovery research group. Projects focus on the application of data mining to spatial data and on multi-media and text-mining. In both areas AiS has a substantial number of on-going national and international projects. Additionally, AiS will coordinate the KDNet? Knowledge Discovery Network of Excellence (currently under negotiation).?

To understand AiS? exploitation plan one has to keep in mind? that AiS is currently in a process of merger with the German Fraunhofer Gesellschaft, which is one of the largest and most succesful European institutions for applied research. As a consequence AiS, while maintaining its position in international research, will have to build up structures that allow for rapid and successful transfer of research to industry. This will lead to an increase in its third-party funding from industry, both in proportion and in absolute numbers. That AiS is in a excellent position to do this is demonstrated by the fact that it already increased its third-party funding from 220 kEuro in 1999 to (estimated) 600 kEuro in 2001. With more than a dozen researchers and software developers, as well as a proven track record in project management,? it has the appropriate skill-set to cope with this challenge.

It is not AiS? primary goal to get large revenues from selling or licensing software tools.? The last years have shown that many data mining startups have evolved away from such a business model. The GNU data mining suite WEKA and the R statistical analysis package offer now zero-cost solutions for the researcher and expert user. On the other side, big and well-established companies such as Oracle (Darwin), SAS, SPSS (Clementine) or IBM offer data-mining solutions (often acquired by buying smaller companies), offering stiff competition to smaller companies that want to enter the market. In that situation it seems likely that small companies only survive if they enter niche markets, offering vertical solutions (such as PharmaDM), customizing existing tools, or adding value to the distribution of tools by offering high-quality consultancy services.

Especially the latter is a business model that suits well the skill-sets of AiS Knowledge Discovery Team. It has a long tradition in building software tools (among others the first versions of the Kepler system used in Mining Mart). It also has broad methodological experience in fields ranging from Bayesian Markov Chain Monte Carlo to Inductive Logic Programming and Support Vector Machines, and has worked in application areas such as credit scoring, process optimization, census data analysis, site selection.

To successfully combine basic with applied research, it is planned to built up a unit that applies Data Mining technologies in industrial cooperations. The Mining Marts software will be an important addition to the suite of tools already available at AiS. The task of pre-processing data for analysis is well known as the most time consuming task in the entire knowledge discovery process. Hence, these tools will have an immediate and maintained role in current and future research and applied projects. This helps to maintain the software for a longer period and to get synergies.

AiS develops the SPIN! Data mining platform for the analysis of spatial data. This software is further developed in a national project ?Kogiplan?.? In this project, the results of SPIN! are combined with optimization algorithms and applied to commercially highly relevant site selection problems. This system already contains a version of? a data extraction and transformation. However, its? functionality is reduced compared to Mining Mart. Combining both project?s results leads to important synergies. Mining Mart data extraction tools will be an important addition to that system, greatly enhancing its exploitation potential both for new research and commercial projects.

One of the core areas where AiS will become active is statistical offices.? The UK census schema comprises about 90 tables with ~8000 attributes. If geographic layers are added, the number of tables is further increased. The complexity of queries increases since spatial joins, linking geographic data, are required.? A powerful, easy to use data access and extraction tool would be a boon for census data analysts.

AiS will use its position as co-ordinator of KDNet (currently under negotiation) to disseminate and exploit the scientific results of the project. KDNet offers various possibilities to do this:

  • Clustering to establish synergies with other projects
  • Organisation of user workshops
  • Dissemination through the Online Information Services

The Mining Mart project schedule allows timely dissimination of its results by those KDNet activities.

Private life insurance as well as management of company pension systems are markets in Europe with strong future impact due to the expected demographic development. These markets are currently changing rapidly due to the deregulation of these markets all over Europe. Swiss Life is currently developing a data warehouse with customer information and is most interested in the opportunities that? assessing its contents will offer. The most prominent goals of Swiss Life in this respect are to make better and more economical use of the data in Swiss Life?s databases for better customer management, the development of new innovative life-insurance products based on improved individual risk-assessment and management, and better recognition as well as faster reaction to changes in the market. These goals hold especially for the management of company pension systems where a much broader product spectrum exists for companies ranging from small (1-5 employees) to very large (several thousand employees of multi-national corporations, supported internationally by Swiss Life), than in the private person life-insurance market.? Swiss Life believes that these goals can only be achieved if the end user is empowered to directly conduct data mining analyses on the data warehouse which is the aim of this project. In a longer perspective the results of the project will be distributed by Swiss Life, Zurich, to other branches of Swiss Life, e.g. Swiss Life Germany where a data warehouse is currently developed, Swiss Life France to support an integrated marketing of its life insurance branch and its new health insurance branch.

Swiss Life will also promote the project results through scientific and educational activities: machine learning, KDD and insurance mathematics courses will be held at the University Konstanz, Germany and? both University and ETH Zurich, Switzerland. Results will also be distributed to the "Meta-data for Datawarehouses" user-group to be established in the Swiss national research project SMART.

It is observed that European companies ask for computer scientists with a special training in data mining and knowledge discovery. Only a few universities teach machine learning and knowledge discovery regularly. The academic partners of this project are among them. Their courses offer an opportunity to transfer the most advanced state of the art from research to the students. A more practice-oriented analysis of learning operators and manual preprocessing operators which will be developed by the project can be taught to students so that they not only know about algorithms and their computational complexity, but also about their combinability and effective use in knowledge discovery.

In addition, the internet service provided by the university of Dortmund will further strengthen the visibility of the partners as primary actors in the field of knowledge discovery.



9. Workplan

9.1 General Description

The focus of each of the 20 workpackages can be loosely assigned to one of the abstract objectives of the project: research (as described in section 2.2), development of new technology, application and exploitation. The assignment must be loose due to the fact that some workpackages contribute to several of these objectives. However, it still helps to group the workpackages into these thematically related work groups:

?Research: Advanced pre-processing: workpackages 3, 4, 13, 14, 15, 16, and 18 investigate issues regarding advanced pre-processing operations and multi-strategy learning,

?Research: View of end-users: workpackages 5, 8, 18, and 19 construct the meta-data model of constraints for pre-processing operations and develop methods for matching of data mining algorithms with data for business tasks,

?New technology: workpackages 1, 2, 7, 9,10, and 12 develop the pre-processing environment

?Applications and exploitation: workpackages 6 and 17 reflect the application-oriented nature of the project, workpackages 11 and 17 ensure the proper exploitation of the project results, and

?Management: workpackage 20 is devoted to project management.

Three milestones serve in this project as synchronization points for all project partners. The first milestone after six months marks an unusually early milestone. The motivation is that after this milestone, all project partners can work with a common platform and that the specifications for subsequent work to be conducted within the project are well defined. This early synchronization reflects the highly integrative nature of the project. This way, all partners will have the chance to voice their input for further developments from the onset of the project. In contrast, projects with late synchronization points often lead to project results that are (technologically) not compatible with each other. The early milestone is feasible, since the basis of its primary deliverable, the data mining platform, is already available as a commercial KDDSE which only requires the tailoring towards the special needs of the project within WP2.