Several methods
can be used to predict the interestingness of
websites. Neural networks and decision trees are
candidates, here we discuss the method of
Bayesian Classifiers
(see also Naive Bayes Classifier and Bayes
Theorem)
The list that remains after Data Cleansing is sorted by "most
informative word". An "informative word" is a word that
appears frequently in pages the user likes (the hotlist) and
infrequently in pages the user dislikes (the coldlist), or vice
versa. A simple way to calculate the information gain is to subtract
the number of occurences of a word in each list and take the absolute
value. The list of words is then sorted by information gain, and the
top 128 words are used for classification. This is a parameter that
can be tuned for optimizing performance.
Why is this done? For example, you want to predict the
chance that a website is about formula one racing. Words like
"schumacher" and "pigeons" are two words with a
high information gain: "schumacher" is a typical hot word,
"pigeons" is a typical cold word. If a site contains
"schumacher", it is likely the topic is formula one, but if
a site contains "pigeons" it is probably not. A word like
"links" can appear on every website you can imagine, both on
hot sites as well as cold sites, and has a low information
gain.
The Bayesian Classifier is a probabilistic method for
classification. It can be used to determine the probability that an
example j (a word) belongs to class Ci (the word is hot or
cold) given values of attributes of the example:
P(Ci| D1 =
W1j& ... & Dn =
Wnj)
where D = W means "the probability the document/webpage contains
this word"
If the attribute values are independent (the occurences of all words
are independent) this formula is equivalent to
P(Ci) * ∏ P(Ak =
Vkj |
Ci)
Both P(Ak = Vkj
| Ci) (i.e. the probability a webpage
contains a certain word given that it is hot) and P(Ci)
(the probability that a webpage is hot) may be estimated from training
data. To determine the most likely class of an example (i.e. either
hot or cold), the probability of each class is computed. An example is
assigned to the class with the highest probability.
When the user uses the adaptive webbrowser, the
learning method will predict which links on the
current page are interesting, by reading ahead
those links and analyzing their content.
If the user follows some
of the links, he can enter his opinion about the
suggested links. If he agrees, the browser will
record the prediction has succeeded, but if the
user doesn't, the prediction has failed. This
way, the browser learns from the user what type
of sites should be recommended and what type of
sites should be avoided.

|