• The standard reinforcement-learning model.
  • Comparing models of optimality. All unlabeled arrows produce a reward of zero.
  • A Tsetlin automaton with 2N states. The top row shows the state transitions that are made when the previous action resulted in a reward of 1; the bottom row shows transitions after a reward of 0. In states in the left half of the figure, action 0 is taken; in those on the right, action 1 is taken.
  • Architecture for the adaptive heuristic critic.
  • In this environment, due to Whitehead [130], random exploration would take take tex2html_wrap_inline1714 steps to reach the goal even once, whereas a more intelligent exploration strategy (e.g. ``assume any untried action leads directly to goal'') would require only tex2html_wrap_inline1716 steps.
  • A 3277-state grid world. This was formulated as a shortest-path reinforcement-learning problem, which yields the same result as if a reward of 1 is given at the goal, a reward of zero elsewhere and a discount factor is used.
  • (a) A two-dimensional maze problem. The point robot must find a path from start to goal without crossing any of the barrier lines. (b) The path taken by PartiGame during the entire first trial. It begins with intense exploration to find a route out of the almost entirely enclosed start region. Having eventually reached a sufficiently high resolution, it discovers the gap and proceeds greedily towards the goal, only to be temporarily blocked by the goal's barrier region. (c) The second trial.
  • A structure of gated behaviors.
  • An example of a partially observable environment.
  • Structure of a POMDP agent.
  • Schaal and Atkeson's devil-sticking robot. The tapered stick is hit alternately by each of the two hand sticks. The task is to keep the devil stick from falling for as many hits as possible. The robot has three motors indicated by torque vectors tex2html_wrap_inline1722 .