The Adaptive Webbrowser: Method Selection

Name The Adaptive Webbrowser: Method Selection

Description

Several methods can be used to predict the interestingness of websites. Neural networks and decision trees are candidates, here we discuss the method of Bayesian Classifiers (see also Naive Bayes Classifier and Bayes Theorem)
The list that remains after Data Cleansing is sorted by "most informative word". An "informative word" is a word that appears frequently in pages the user likes (the hotlist) and infrequently in pages the user dislikes (the coldlist), or vice versa. A simple way to calculate the information gain is to subtract the number of occurences of a word in each list and take the absolute value. The list of words is then sorted by information gain, and the top 128 words are used for classification. This is a parameter that can be tuned for optimizing performance.
Why is this done? For example, you want to predict the chance that a website is about formula one racing. Words like "schumacher" and "pigeons" are two words with a high information gain: "schumacher" is a typical hot word, "pigeons" is a typical cold word. If a site contains "schumacher", it is likely the topic is formula one, but if a site contains "pigeons" it is probably not. A word like "links" can appear on every website you can imagine, both on hot sites as well as cold sites, and has a low information gain.

The Bayesian Classifier is a probabilistic method for classification. It can be used to determine the probability that an example j (a word) belongs to class C_i (the word is hot or cold) given values of attributes of the example:

P(C_i| D₁ = W₁_{_j}& ... & D_n = W_n_{_j})

where D = W means "the probability the document/webpage contains this word"

If the attribute values are independent (the occurences of all words are independent) this formula is equivalent to

P(C_i) * ∏ P(A_k = V_k_{_j} | C_i)

Both P(A_k = V_k_{_j} | C_i) (i.e. the probability a webpage contains a certain word given that it is hot) and P(C_i) (the probability that a webpage is hot) may be estimated from training data. To determine the most likely class of an example (i.e. either hot or cold), the probability of each class is computed. An example is assigned to the class with the highest probability.

When the user uses the adaptive webbrowser, the learning method will predict which links on the current page are interesting, by reading ahead those links and analyzing their content. If the user follows some of the links, he can enter his opinion about the suggested links. If he agrees, the browser will record the prediction has succeeded, but if the user doesn't, the prediction has failed. This way, the browser learns from the user what type of sites should be recommended and what type of sites should be avoided.

Case Study The Adaptive Webbrowser Case