|
The Adaptive Webbrowser: Data Cleansing
Name |
The Adaptive Webbrowser: Data Cleansing |
Description |
A lot of the data
in the HTML files is not needed for
classification. Several decisions have been made:
- Everything
between HTML tags ( < and > ) is
considered of no value and omitted.
- Words with a
length smaller than three are ingored.
- Words like
"what", "where",
"why" and about 600 more are
omitted, because the informative value is
none.
Selecting single words has its drawbacks: words
like "computer science" and
"science fiction" will result in
counting twice the word "science",
although the meaning of this word depends on its
context. To be able to compare words to each
other in a later stage, all words are transformed
to lower case.
Cleaning the HTML
page shown in Data Design like this, we obtain
the following list of words:
word |
count |
machine |
13 |
learning |
12 |
data |
6 |
application |
4 |
guide
|
4 |
mining |
3 |
goal |
2 |
applications |
2 |
analysis
|
2 |
more |
2 |
information |
2 |
detailed
|
2 |
gives |
2 |
systems |
2 |
description |
1 |
learns
|
1 |
suggests |
1 |
term |
1 |
task |
1 |
building |
1 |
build |
1 |
complete |
1 |
actually |
1 |
only |
1 |
main |
1 |
construction |
1 |
improve |
1 |
their |
1 |
behaviour |
1 |
particular |
1 |
here |
1 |
present |
1 |
latter |
1 |
form |
1 |
elaborate |
1 |
case |
1 |
descriptions |
1 |
explanations |
1 |
pointers |
1 |
presentation |
1 |
requires |
1 |
basically |
1 |
prior |
1 |
knowledge |
1 |
about |
1 |
first |
1 |
glance |
1 |
world |
1 |
helps |
1 |
assess |
1 |
potential |
1 |
problem |
1 |
pointers |
1 |
experts |
1 |
tools |
1 |
assistance |
1 |
organized |
1 |
facilitate |
1 |
novices |
1 |
field |
1 |
informs |
1 |
learn |
1 |
assess |
1 |
potential |
1 |
technology |
1 |
solving |
1 |
problems |
1 |
navigating |
1 |
frame |
1 |
left |
1 |
select |
1 |
topic |
1 |
start |
1 |
cases |
1 |
abstract |
1 |
found |
1 |
process |
1 |
model |
1 |
keywords |
1 |
kdd |
1 |
knowledge |
1 |
discovery |
1 |
databases |
1 |
warehousing |
1 |
inductive |
1 |
methods |
1 |
intelligent |
1 |

|
|
|
Case Study |
The Adaptive Webbrowser Case
|
|
|