The Adaptive Webbrowser: Data Cleansing

Name The Adaptive Webbrowser: Data Cleansing
Description
A lot of the data in the HTML files is not needed for classification. Several decisions have been made:
  1. Everything between HTML tags ( < and > ) is considered of no value and omitted.
  2. Words with a length smaller than three are ingored.
  3. Words like "what", "where", "why" and about 600 more are omitted, because the informative value is none.


Selecting single words has its drawbacks: words like "computer science" and "science fiction" will result in counting twice the word "science", although the meaning of this word depends on its context. To be able to compare words to each other in a later stage, all words are transformed to lower case.

Cleaning the HTML page shown in Data Design like this, we obtain the following list of words:

word count
machine 13
learning 12
data 6
application 4
guide 4
mining 3
goal 2
applications 2
analysis 2
more 2
information 2
detailed 2
gives 2
systems 2
description 1
learns 1
suggests 1
term 1
task 1
building 1
build 1
complete 1
actually 1
only 1
main 1
construction 1
improve 1
their 1
behaviour 1
particular 1
here 1
present 1
latter 1
form 1
elaborate 1
case 1
descriptions 1
explanations 1
pointers 1
presentation 1
requires 1
basically 1
prior 1
knowledge 1
about 1
first 1
glance 1
world 1
helps 1
assess 1
potential 1
problem 1
pointers 1
experts 1
tools 1
assistance 1
organized 1
facilitate 1
novices 1
field 1
informs 1
learn 1
assess 1
potential 1
technology 1
solving 1
problems 1
navigating 1
frame 1
left 1
select 1
topic 1
start 1
cases 1
abstract 1
found 1
process 1
model 1
keywords 1
kdd 1
knowledge 1
discovery 1
databases 1
warehousing 1
inductive 1
methods 1
intelligent 1



Case Study The Adaptive Webbrowser Case