For simplicity, we assume that the robot perceives its environment through a set
of binary feature detectors1
fd
.
A feature detector can be devised as a process that identifies specific combinations of
present (and possibly past) sensor readings.
The use of feature detectors is very common in robotics. In this field, feature detectors
are defined by the programmer attending to the special
characteristics of the environment, the robot sensors, and the task to be executed in order
to extract potentially useful information (presence of landmarks or obstacles, ...)
from raw sensor readings.
In a similar way,
instead of working directly with the space of actions provided by the robot
motors (that define a too low-level way of controlling the robot), it is a common
practice to define a set of
elementary actions
.
An elementary action is a specific sequence/combination of motor
commands defined by the programmer attending to the characteristics of the robot and the
task to be achieved. To simplify, we can assume that elementary actions are of
the form
(
)
where
is a motor and
a value in the
range of valid inputs for the motor
. This framework is quite flexible since
a motor
can be either one of the physical motors of the robot or a high-level,
abstract motor that
combines movements of the actual motors.
With this formalization, at a given moment, the robot can execute in parallel as many
elementary actions as available motors.
A robot controller can be seen as a procedure that executes (combinations of elementary) actions in response to specific situations (i.e., activation of specific feature detectors) with the objective of achieving a given task. Reinforcement-learning approaches automatically define such a controller using the information provided by the reward signal. In the context of reinforcement learning, the controller is called the policy of the learner.
The objective of the value-function-based reinforcement-learning algorithms (the most common reinforcement-learning algorithms) is to predict the reward that can be directly or indirectly obtained from the execution of each action (i.e., of each combination of elementary actions) in each possible situation, described as a combination of active and inactive feature detectors. If this prediction is available, the action to be executed in each situation is the one from which maximum reward is expected.
To predict the reward, classic
reinforcement-learning algorithms rely on the Markov assumption, that requires a state signal
to carry enough information to determine the effects of all actions in a given
situation.2 Additionally,
non-generalizing
reinforcement-learning algorithms assume that the states of the system must be learned about
independently.
So, the information gathered about the effects of an action in a given state
,
denoted
, cannot
be safely transferred to similar states or actions. With this assumption, the cost of a reinforcement-learning
algorithm in a general problem is
![]() |
![]() |
![]() |
Although the size of the action set () is as important as the size of the state set (
)
in the curse of dimensionality, less attention is paid to actions in the reinforcement-learning
literature. However, a robot with many degrees of freedom can execute many elementary actions
simultaneously and this makes the cost of
the learning algorithms also increase exponentially with the number of motors of the robot
(
).
Suppose we address the same task but with
two different sets of feature detectors and
such that
. Using a
plain reinforcement-learning algorithm, the cost of finding a proper policy would be larger
using the larger set of features (
). And this is so even if one of the features in
has a stronger correlation with the reward than any of the features in
.
Non-generalizing
reinforcement-learning algorithms are not able to take advantage of this situation, and, even having better
input information, their performance decreases. A similar argument can be made for actions
in addition to feature detectors.
Generalizing reinforcement-learning algorithms such as those using gradient-descent techniques [Widrow and Hoff, 1960], coarse codings [Hinton et al., 1986], radial-basis functions [Poggio and Girosi, 1990], tile coding [Sutton, 1996] or decision trees [Chapman and Kaelbling, 1991,McCallum, 1995] can partially palliate this problem since they can deal with large state spaces. However, as we approach complex realistic problems, the number of dimensions of the state-space grows to the point of making the use of some of these generalization techniques impractical and other function approximation techniques must be used [Sutton and Barto, 1998] page 209.
Adding relevant inputs or actions to a task should make this task easier or at least not more difficult. Only methods whose complexity depends on the relevance of the available inputs and actions and not on their number would scale well to real domain problems. Examples of systems fulfilling this property are, for instance, the Kanerva coding system presented by [Kanerva, 1988] and the random representation method by [Sutton and Whitehead, 1993]. While those systems rely on large collections of fixed prototypes (i.e., combinations of feature detectors) selected at random, our proposal is to search for the appropriate prototypes, but using a strong bias so that the search can be performed in a reasonable time. This strong bias is based on the categorizability assumption that is a plausible assumption for the case of autonomous robots, which allows a large speed up in the learning process. Additionally, existing systems do not address the problem of determining the relevance of actions, since they assume the learning agent has a single actuator (that is, obviously, the only relevant one). This simple set up is not adequate for robotics. In our approach (presented below), combinations of both feature detectors and elementary actions are considered using a unified framework.