Curriculum for KDD

Since the analysis of very large data sets with many variables has become a hot topic in Computer Science, both, from a scientific and from a business perspective, Knowledge Discovery (Data Mining) is partially taught in most European universities. Due to the interdisciplinary nature of the field, courses are embedded into the field of database research, statistics, or machine learning (artificial intelligence). Although part of the topics is lectured at most universities, there is still a lack of a comprehensive lecture at many universities. Expertise in this new field cannot be presupposed everywhere. Therefore, we offer a guideline for such a comprehensive course, which makes it easy to teach the course at every Computer Science department. Our generic curriculum offers:
  • A flexible structure of the field, which takes into account the prerequisites given at a particular university, for instance, whether the students already know statistics and databases well.
  • Topics that are not necessary but deepen the knowledge allow to adapt the generic curriculum to the particular research focus of the lecturer.
  • A small selection of the most relevant literature for each topic such that the huge amount of articles published with respect to all the topics in KDD does not prevent a computer scientist from moving into the field.
The intended students of the course are graduate students of computer science, but also students of statistics and economy could well profit from it. In general, the minimal requirements are:
  • Basic mathematical knowledge in linear algebra and probability theory.
  • Basic understanding of computer science concepts such as complexity, data structures, and algorithm engineering.
These general requirements are mandatory. There are additional prerequisites which can be taught within the KDD course: data management and statistics as explicitly stated below.

How to read the curriculum

Knowledge Discovery (data mining), KDD for short, has become part of the teaching activities in computer science and statistics. Being interdisciplinary by nature, the background knowledge stems from databases, statistics, and machine learning including computational learning theory. Depending on the faculty teaching the course and the overall curriculum students are follwing at the particular university, some parts of the KDD curriculum can be dropped, others be strengthened. Students visiting the KDD lecture are usually graduate students. However, their background can differ considerably. For instance, they may already be acquainted with databases, with statistical measures, with complexity theory -- or not. Additionally, universities have a profile and hence focus on some aspects. This choice may influence the outline of a KDD course, too.
Hence, some flexibility applies to the proposed KDD curriculum. The proposed KDD curriculum is structured into chapters, each containing some topics. There are three levels of flexibility:
  • Chapters are labeled as obligatory or facultative. If all chapters are taught the lecture covers 10 credit points (see ECTS below) or 2 sessions of 90 minutes lecture each plus 1 session of 90 minutes for exercises in a semester of 15 weeks. This is the long course. If only the obligatory chapters are taught, the lecture covers 5 credit points or 1 session of 90 minutes lecture plus 1 session of 90 minutes for exercises in a semester of 15 weeks. This is the short course.
  • Two chapters are labeled as prerequisites. If the prerequisites are covered by other courses at the university, the corresponding chapters can be ignored. If they have to be taught within the KDD course, then the lecturer has to diminish the number of sessions at the obligatory chapters (short course) or choose less chapters from the facultative ones (long course).
  • Topics listed for a chapter are meant to be illustrative. The topics under the obligatory chapters are not all obligatory -- the intended meaning is that part of the topics should be included in the lecture. Note also, that Monte Carlo methods as well as Hidden Markov Models could be classified into two different chapters depending on the lecturer's view.
The guideline also states the number of sessions for each chapter. The short course spends less sessions on the obligatory chapters and does not handle the facultative ones. The number of sessions for the short course is given in brackets if it differs from the number of sessions in the long course. Of course, the number of sessions is just a recommendation and can be changed by the lecturer according to the university's profile.

ECTS

The European Credit Transfer System (ECTS) intends to ease studies across European universities. They express the work load of students successfully passing a course. The assumed number of working hours per week is 40 to 45 in a year of 40 working weeks. Hence, the number of hours per year is 1600 to 1800. 60 credit points correspond to a full year of studies. In addition to visiting the lecture and the exercise session, the work for preparing the material of a session and solving the exercises is taken into account. Also the work load for preparing an examination on the lecture is included in the ECTS.

The KDD Curriculum

The KDD curriculum is based on experience of teaching KDD to both, computer science and statistics students at Dortmund university. As a module, the course has been accredited for studying Data Management and Data Mining to a Bachelor/ Master degree. Discussions with European lecturers of KDD have been taken into account.

Short Course

The lecture with exercises gives an overview of Knowledge Discovery in Databases (KDD), also known as Data Mining. Starting from the cross-industrial standard process model of knowledge discovery and building upon database theory,methods for preprocessing and analysing very large data collections is presented. Analysis tasks are classification, regression, clustering, and frequent set mining.
Goals: Students will know after visiting the short course, what KDD is and where it can be applied. In the exercises they will have used tools in order to solve some KDD tasks. Principles underlying the tools are known. Hence, students are capable of performing standard applications.
Prerequisites: If the prerequisites are not known to the students, then the main ideas are taught (session number in brackets) and the number of sessions for the obligatory sessions is diminished by 2 (e.g., the overall KDD process and regression are handled in only 1 session each).

Long Course

The lecture with exercises gives an overview of Knowledge Discovery in Databases (KDD), also known as Data Mining. Starting from the cross-industrial standard process model of knowledge discovery and building upon database theory, methods for preprocessing and analysing very large data collections is presented. Analysis tasks are classification, regression, clustering, and frequent set mining. Learning from temporal data or exploiting spatial relationships is handled as spatio-temporal analysis. Analysis methods range from statistical learning and optimization methods to logical (multi-relational) approaches. Application areas are the analysis of very large data bases or multi-media collections like the world wide web (texts, images, audio and video databases).
Goals: Students will know after the long course, what KDD is, where it can be applied, and how to develop an application. Students know challenges of the research field and are ready for starting own developments, for instance in the form of a diploma or master thesis.
Prerequisites: If the students are not familiar with both the prerequisites, only 2 of the facultative chapters can be selected. If the students are knowledgable in one of the prerequisites, 3 of the facultative chapters can be chosen. If both prerequisites are known, all facultative chapters can be taught in the lecture.

Course Materials:

  • Data Mining Course Home (English) - Gregory Piatetsky-Shapiro, Gary Parker
  • Data Mining Concepts and Techniques (English) - Jiawei Han, Micheline Kamber
  • Books

    The overall course can be based on the following books:
    Titel Author Year Type
    The Elements of Statistical Learning: Data Mining, Inference, and Prediction Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome 2001 Book
    Principles of Data Mining David Hand and Heikki Mannila and Padhraic Smyth 2001 Book
    Machine Learning Mitchell, Tom M. 1997 Book

    Modules

    No. Name Prerequisites   Obligatory                                               Sessions
    1. KDD Process
    no
     yes 2
    CRISP model
    KDD environments (e.g., R, SAS, Clementine, MiningMart, SPIN!,...)
    2. Data management
    yes
     no 4
    Hash trees, B trees, B graphs
    SQL
    Data modeling, meta-data (schema declaration, XML,...)
    Data warehousing
    OLAP, data cube
    3. Statistical basics
    yes
     no 3(1)
    4. Preprocessing
    no
     yes 4(3)
    Sampling
    Density estimation, Monte Carlo methods (can also be placed under 6)
    Data visualization
    Feature selection
    Principal component analysis, dimension reduction
    Feature generation, feature extraction
    5. Classification
    no
     yes 6(5)
    Linear discriminant analysis
    Top-down induction of decision trees
    Ensemble methods
    Support Vector Machine (SVM)
    PAC learning, VC dimension
    6. Regression
    no
     yes 2
    Logistic regression
    non-parametric statistical methods
    SVM for regression
    7. Heuristic search (Optimization)
    no
     no 3
    EM method
    genetic programming, evolutionary algorithms
    Bayesian networks
    Hidden Markov Method (HMM)
    8. Frequent sets
    no
     yes 4(3)
    Algorithmic issues
    evaluation methods, significance, null hypothesis testing
    condensed representations
    9. Spatio-temporal analysis
    no
     no 3
    statistical time series analysis
    sequences and episodes
    Time series clustering and indexing
    spatial relationships
    10. Multi-relational mining
    no
     no 3
    Inductive Logic Programming (ILP)
    propositionalisation
    ILP learnability
    ontology learning
    11. Multi-media mining
    no
     no 3
    Web mining
    Text categorisation
    Information extraction and wrapper induction
    Image, video categorisation
    Audio classification


    - Please help us by sending additions or corrections for this service! -