Appendix A: The Partial-Rule Learning Algorithm

In this appendix, we describe in detail the approach described in the main body of the paper.

**Figure 11:** The partial-rule learning algorithm. Text inside parenthesis are comments. The *Action Evaluation*, *Statistics Update*, and *Partial-Rule Management* procedures are described next.
$\begin{figure}{\small \begin{center} \fbox{\parbox{18cm}{ \begin{tabbing}i :\= i... ... \>\>\>{\bf until} terminal situation \end{tabbing}}} \end{center}} \end{figure}$

The partial-rule learning algorithm (whose top level form is shown in Figure 11) stores the following information for each partial rule

the reward estimation ,
the error estimation , and
the confidence index .

**Figure 12:** Confidence function with $\eta$ =7 and $\beta$ =0.8.
$\includegraphics[width=0.7\linewidth]{images/figure_jair_10}$

To estimate the confidence on and we use a confidence index that, roughly speaking, keeps track of the number of times the partial rule is used. The confidence is derived from using a confidence_function in the following way:

confidence_function

where the confidence_function is a non-decreasing function in the range $[0,\beta]$ . $\beta$ should be less than 1 since, in this way, the system always keeps a certain degree of exploration and, consequently, it is able to adapt to changes in the environment. Different confidence schemes can be implemented by changing the confidence_function. In our implementation, we use a sigmoid-like function (see Figure 12) that increases slowly for low values of

reducing the confidence provided by the first obtained rewards. In this way we avoid a premature increase of the confidence (and, thus, a decrease in the error and in the exploration) for insufficiently-sampled rules. A parameter ( $\eta$ ) determines the point at which this function reaches the top value $\beta$ .

Additionally, the confidence index is used to define the learning rate (i.e., the weight of new observed rewards in the statistics update). For this purpose we implement a MAM function [Venturini, 1994] for each rule:

$m_w=\max \{ \alpha, 1/(i_w+1) \}$ .

Using a MAM-based updating rule, we have that, the lower the confidence, the higher the effect of the last observed rewards on the statistics, and the faster the adaptation of the statistics. This adaptive learning rate strategy is related to those presented by [Sutton, 1991] and by [Kaelbling, 1993], and contrasts with traditional reinforcement-learning algorithms where a constant learning rate is used.

After the initialization phase, the algorithm enters in a continuous loop for each task episode consisting on estimating the possible effects of all actions, executing the most promising one, and updating the system so that its performance improves in the future. The system update includes the statistics update and the partial-rule management.

Subsections

Josep M Porta 2005-02-01