Reinforcement learning typically works by refining its estimate of expected future reward. In goal-directed tasks, such as the ones investigated here, this is equivalent to progressively improving the estimate of the distance to goal from each state. This estimate is updated by the best local action, i.e. the one moving the robot, or arm, to the new state with the smallest estimated distance. Early in the learning process, only states close to the goal are likely to have accurate estimates of true distance. Each time an action is taken, the estimate of distance at the new state is used to update the estimate at the old state. Eventually this process will propagate back accurate estimates from the goal to all other states.
Rather than directly estimating the distance to goal, the system uses
the expected discounted reward for each state ( ). The influence of rewards,
, are reduced
progressively the farther into the future they occur by using a
less than one. In this work, the only reward is for reaching
the goal. So the farther the state is from the goal the smaller the
value. The use of an expectation allows the actions to be stochastic,
so when the robot, or arm, takes a particular action in a particular
state, the next state is not always the same.
To carry out reinforcement learning, this research uses the Q-learning
algorithm (Watkins & Dayan 1992). This algorithm assumes the world is a
discrete Markov process, thus both states and actions are discrete.
For each action a in each state s, Q-learning maintains a rolling
average of the immediate reward r plus the maximum value of any
action a' in the next state s' (see Equation 1). The
action selected in each state is usually the one with the highest
score. But to encourage exploration of the state space, this paper uses
an -greedy policy (Sutton 1996) which chooses a random
action a fraction
of the time. The only effect that
function composition has on the Q-learning algorithm is that the
initial value for each state-action pair is set to some value other
than zero.
The Q-function over state and action is usually referred to as the action-value function. In this paper, it is the action-value function that is transformed and composed to form a solution to the new task. The value function, discussed in previous sections and shown in the figures, is the maximum value of the Q-function. It is used to generate the partition and associated graphs needed to control the process.
Watkins and Dayan (1992) proved that Q-learning will
converge to the optimal value with certain constraints on the reward
and the learning rate . The optimal solution is produced by
taking the action with the greatest value in any state. So, for
goal-directed tasks, a greedy algorithm will take the shortest path to
the goal, once learning is complete. The extension to continuous
spaces may be done using function approximation. The simplest method,
and the one used here, is to divide the state dimensions into
intervals. The resulting action-value function has cells representing
the average Q-value of taking each action from somewhere within a
region of the state space. In off-line learning, where any action in
any state can be executed, this representation has been proven to
converge (Gordon 1995). In on-line learning, where the current
state is determined by the environment, this approach is generally
successful, but there exists no proof of its convergence.