Curriculum for KDD

Since the analysis of very large data sets with many variables has become a hot topic in Computer Science, both, from a scientific and from a business perspective, Knowledge Discovery (Data Mining) is partially taught in most European universities. Due to the interdisciplinary nature of the field, courses are embedded into the field of database research, statistics, or machine learning (artificial intelligence). Although part of the topics is lectured at most universities, there is still a lack of a comprehensive lecture at many universities. Expertise in this new field cannot be presupposed everywhere. Therefore, we offer a guideline for such a comprehensive course, which makes it easy to teach the course at every Computer Science department. Our generic curriculum offers:

A flexible structure of the field, which takes into account the prerequisites given at a particular university, for instance, whether the students already know statistics and databases well.
Topics that are not necessary but deepen the knowledge allow to adapt the generic curriculum to the particular research focus of the lecturer.
A small selection of the most relevant literature for each topic such that the huge amount of articles published with respect to all the topics in KDD does not prevent a computer scientist from moving into the field.

The intended students of the course are graduate students of computer science, but also students of statistics and economy could well profit from it. In general, the minimal requirements are:

Basic mathematical knowledge in linear algebra and probability theory.
Basic understanding of computer science concepts such as complexity, data structures, and algorithm engineering.

These general requirements are mandatory. There are additional prerequisites which can be taught within the KDD course: data management and statistics as explicitly stated below.

How to read the curriculum

Knowledge Discovery (data mining), KDD for short, has become part of the teaching activities in computer science and statistics. Being interdisciplinary by nature, the background knowledge stems from databases, statistics, and machine learning including computational learning theory. Depending on the faculty teaching the course and the overall curriculum students are follwing at the particular university, some parts of the KDD curriculum can be dropped, others be strengthened. Students visiting the KDD lecture are usually graduate students. However, their background can differ considerably. For instance, they may already be acquainted with databases, with statistical measures, with complexity theory -- or not. Additionally, universities have a profile and hence focus on some aspects. This choice may influence the outline of a KDD course, too.
Hence, some flexibility applies to the proposed KDD curriculum. The proposed KDD curriculum is structured into chapters, each containing some topics. There are three levels of flexibility:

Chapters are labeled as obligatory or facultative. If all chapters are taught the lecture covers 10 credit points (see ECTS below) or 2 sessions of 90 minutes lecture each plus 1 session of 90 minutes for exercises in a semester of 15 weeks. This is the long course. If only the obligatory chapters are taught, the lecture covers 5 credit points or 1 session of 90 minutes lecture plus 1 session of 90 minutes for exercises in a semester of 15 weeks. This is the short course.
Two chapters are labeled as prerequisites. If the prerequisites are covered by other courses at the university, the corresponding chapters can be ignored. If they have to be taught within the KDD course, then the lecturer has to diminish the number of sessions at the obligatory chapters (short course) or choose less chapters from the facultative ones (long course).
Topics listed for a chapter are meant to be illustrative. The topics under the obligatory chapters are not all obligatory -- the intended meaning is that part of the topics should be included in the lecture. Note also, that Monte Carlo methods as well as Hidden Markov Models could be classified into two different chapters depending on the lecturer's view.

The guideline also states the number of sessions for each chapter. The short course spends less sessions on the obligatory chapters and does not handle the facultative ones. The number of sessions for the short course is given in brackets if it differs from the number of sessions in the long course. Of course, the number of sessions is just a recommendation and can be changed by the lecturer according to the university's profile.

ECTS

The European Credit Transfer System (ECTS) intends to ease studies across European universities. They express the work load of students successfully passing a course. The assumed number of working hours per week is 40 to 45 in a year of 40 working weeks. Hence, the number of hours per year is 1600 to 1800. 60 credit points correspond to a full year of studies. In addition to visiting the lecture and the exercise session, the work for preparing the material of a session and solving the exercises is taken into account. Also the work load for preparing an examination on the lecture is included in the ECTS.

The KDD Curriculum

The KDD curriculum is based on experience of teaching KDD to both, computer science and statistics students at Dortmund university. As a module, the course has been accredited for studying Data Management and Data Mining to a Bachelor/ Master degree. Discussions with European lecturers of KDD have been taken into account.

Short Course

The lecture with exercises gives an overview of Knowledge Discovery in Databases (KDD), also known as Data Mining. Starting from the cross-industrial standard process model of knowledge discovery and building upon database theory,methods for preprocessing and analysing very large data collections is presented. Analysis tasks are classification, regression, clustering, and frequent set mining.
Goals: Students will know after visiting the short course, what KDD is and where it can be applied. In the exercises they will have used tools in order to solve some KDD tasks. Principles underlying the tools are known. Hence, students are capable of performing standard applications.
Prerequisites: If the prerequisites are not known to the students, then the main ideas are taught (session number in brackets) and the number of sessions for the obligatory sessions is diminished by 2 (e.g., the overall KDD process and regression are handled in only 1 session each).

Long Course

The lecture with exercises gives an overview of Knowledge Discovery in Databases (KDD), also known as Data Mining. Starting from the cross-industrial standard process model of knowledge discovery and building upon database theory, methods for preprocessing and analysing very large data collections is presented. Analysis tasks are classification, regression, clustering, and frequent set mining. Learning from temporal data or exploiting spatial relationships is handled as spatio-temporal analysis. Analysis methods range from statistical learning and optimization methods to logical (multi-relational) approaches. Application areas are the analysis of very large data bases or multi-media collections like the world wide web (texts, images, audio and video databases).
Goals: Students will know after the long course, what KDD is, where it can be applied, and how to develop an application. Students know challenges of the research field and are ready for starting own developments, for instance in the form of a diploma or master thesis.
Prerequisites: If the students are not familiar with both the prerequisites, only 2 of the facultative chapters can be selected. If the students are knowledgable in one of the prerequisites, 3 of the facultative chapters can be chosen. If both prerequisites are known, all facultative chapters can be taught in the lecture.

Course Materials:

Data Mining Course Home (English) - Gregory Piatetsky-Shapiro, Gary Parker

Data Mining Concepts and Techniques (English) - Jiawei Han, Micheline Kamber

Books

The overall course can be based on the following books:

Titel	Author	Year	Type
The Elements of Statistical Learning: Data Mining, Inference, and Prediction	Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome	2001	Book
Principles of Data Mining	David Hand and Heikki Mannila and Padhraic Smyth	2001	Book
Machine Learning	Mitchell, Tom M.	1997	Book

Modules

No.	Name	Prerequisites	Obligatory	Sessions
1.	KDD Process	no	yes	2
	CRISP model
	KDD environments (e.g., R, SAS, Clementine, MiningMart, SPIN!,...)
2.	Data management	yes	no	4
	Hash trees, B trees, B graphs
	SQL
	Data modeling, meta-data (schema declaration, XML,...)
	Data warehousing
	OLAP, data cube
3.	Statistical basics	yes	no	3(1)
4.	Preprocessing	no	yes	4(3)
	Sampling
	Density estimation, Monte Carlo methods (can also be placed under 6)
	Data visualization
	Feature selection
	Principal component analysis, dimension reduction
	Feature generation, feature extraction
5.	Classification	no	yes	6(5)
	Linear discriminant analysis
	Top-down induction of decision trees
	Ensemble methods
	Support Vector Machine (SVM)
	PAC learning, VC dimension
6.	Regression	no	yes	2
	Logistic regression
	non-parametric statistical methods
	SVM for regression
7.	Heuristic search (Optimization)	no	no	3
	EM method
	genetic programming, evolutionary algorithms
	Bayesian networks
	Hidden Markov Method (HMM)
8.	Frequent sets	no	yes	4(3)
	Algorithmic issues
	evaluation methods, significance, null hypothesis testing
	condensed representations
9.	Spatio-temporal analysis	no	no	3
	statistical time series analysis
	sequences and episodes
	Time series clustering and indexing
	spatial relationships
10.	Multi-relational mining	no	no	3
	Inductive Logic Programming (ILP)
	propositionalisation
	ILP learnability
	ontology learning
11.	Multi-media mining	no	no	3
	Web mining
	Text categorisation
	Information extraction and wrapper induction
	Image, video categorisation
	Audio classification

- Please help us by sending additions or corrections for this service! -