In non-generalizing reinforcement learning the cost of executing a single learning step can be neglected. However, algorithms with generalization in the spaces of sensors and/or actuators are not so simple and the execution time of each iteration can be increased substantially. In an extreme case, this increase can limit the reactivity of the learner and this is very dangerous when working with an autonomous robot.
The most expensive procedure of our algorithm is that of computing the value of all
actions (i.e., all valid combinations of elementary actions).
The cost of this procedure is especially critical since it is used
twice in each step: once to get the guess of each action (in the Action
Evaluation procedure detailed in Figure 13) and again to get the
goodness of the new achieved situation after the action execution (when computing the
value in the Statistics Update procedure detailed in Figure 14).
A trivial re-order of the algorithm can avoid the double use of this expensive procedure at
each learning step:
we can select the action to be executed next at the same time that we evaluate the goodness
of the new achieved situation.
The drawback of this re-order is that the action is selected without taking into
account the information provided by the last reward value (the goodness of the
situation is assessed before the reward adjustment).
However, this is not a
problem in tasks that require many learning steps.
Even if we use the action-evaluation procedure only once per learning step, we have to optimize it as much as possible since the brute-force approach described before, which evaluates each action sequentially, is only feasible for simple problems.
The action-evaluation method presented next is based on the observation that many of the actions would have the same value since the highest relevant partial rule at a given moment would provide the value to all actions that are in accordance with the partial command of the rule. The separate computation of the value of two actions that would end up evaluated using the same rule is a waste of time. This can be avoided by performing the action evaluation attending to the set of active rules in the first place and not to the set of possible action, as the brute-force approach does.
Figure 16 shows a general form of the algorithm we
propose. In this algorithm, partial rules are considered one at a time,
ordered from the most relevant rule to the least relevant one.
The partial command of the rule under consideration () is used to
process all the actions that are in accordance with
that partial command. This already processed sub-set of actions need not to be considered
any more in the action-evaluation procedure.
While the rules are processed, we update the current
situation assessment (
) and the action to be executed next (
)
attending, respectively, to the reward prediction (
) and the guess (
) of the rules.
Observe that partial rules can be maintained sorted by relevance by the statistics update procedure, since it is in this procedure where rule relevance is modified. When the relevance of a rule is changed, its position in the list can be also modified accordingly. In this way we do not have to re-sort the list of rules every time we want to apply the procedure just described.
When elementary actions are of the form
with
a motor
and
a value in the range of possible values for that motor, the above algorithm
can be implemented in
an especially efficient way since there is no need to explicitly compute the set of
actions
.
In this case (see Figure 17 and 18),
we construct a decision tree using motors as a decision attributes and
that groups in the same leaf all those actions evaluated by the same partial
rule (all actions removed from the set
in each iteration of the algorithm in
Figure 16).
![]() |
![]() |
Each internal node of the tree classifies the action according to one of the motor commands included in the action. These internal nodes store the following information:
The leaves of the tree have information about the value of the actions classified in that leaf. This information is represented with the following set of attributes for each leaf:
At a given moment, the inclusion of a new partial rule in the tree produces the specialization
of all open nodes compatible with the rule (see Figure 18). We say that an
open node is compatible with a given rule
if the partial command of the node
and the partial command of the rule
does not assign different values to the
same motor. The specialization of an open node can result in the extension of the node (i.e.,
new branches are added to the tree under that node) or in the transformation of this node into
a leaf. A node is extended when the partial command of the rule affects some motors not
included in the partial command of the node. This means that there are some motor values not
taken into account in the tree but that have to be used in the action evaluation according to
the rule under consideration. When a node is extended, one of the motors not present in the
above layers of the tree is used to generate a layer of open nodes in the current node.
After that, the node is considered as closed and the inclusion rule procedure is
repeated for this node (with different effects because now the node is closed). When all the
motors affected by the partial command of the rule are also affected by the partial command of
the node, then the node is transformed into a leaf storing the value, guess, and relevance
attributes extracted from the information associated with the rule.
The process is stopped as soon as we detect that all nodes have been closed (i.e. all the external nodes of the tree are leaves). In this case, the rules still to be processed can have no effect in the tree form and, consequently are not useful for action evaluation. If a rule is consistently not used for action evaluation, it can be removed from the controller.
|
![]() |
A toy-size example can illustrate this tree-based action-evaluation algorithm. Suppose
that we have a robot with three motors that accept two different values (named and
). This produces a set of 8 different action. Suppose that, at
a given moment, the robot controller includes the set of rules shown in
Table 2.
In the Action Evaluation algorithm (Figure 17), rules are
processed from the most to the least relevant one expanding
an initially empty tree using algorithm in Figure 18.
The inclusion of a rule in the tree results in an extension of the tree (see stages B, D and E
in Figure 19) or in closing branches by converting open nodes
into leaves (stages C and F). In this particular case the tree becomes completely closed after
processing 5 rules out of the 8 active rules in the controller.
At the end of the process, we have a tree with five leaves. Three of
them include two actions and the other two only represent a single
action. Using the tree we can say that the value of the situation in which the tree is
constructed,
, is 8 (this is given by the leaf circled with a solid line in the figure).
Additionally, the next action
to be executed is of the form
where '
' represents any possible action. This optimal action is given by the leaf circled
with a dashed line that is the
leaf with a larger guess value.
![]() ![]() |
The cost of our algorithm largely depends on the specific set of partial rules to be processed. In the worst case, the cost of our algorithm is:
![]() |
![]() |
![]() |
![]() |
Figure 20 exemplifies the different performance of the brute-force action-evaluation procedure and the tree-based one. The figure shows the time taken in the execution of the toy example of Section 6.1. For this experiment, we defined some void motors or motors whose actions have no effect in the environment. As it can be seen, as the number of void motors increases, the cost of the tree-based evaluation is significantly less than that of the brute-force approach.
Josep M Porta 2005-02-01