Learning as Optimization

Name Learning as Optimization

In settings where an agent is trying to learn how to act most reasonable, the question arises, how to improve its behaviour, depending on the experience it makes.

At each step the agent's behaviour is determined by a policy, a function mapping perceptions to actions. These actions will lead the agent from one state to another. All of the agent's actions will result in an immediate payoff, depending on its actual state, as well. The payoffs and the mappings from state/action-pairs to the next state are not necessarily assigned determinstically.

A learning algorithm will change an agent's policy driven by experience, in order to maximize its payoffs. Experience that may be used for improving the actual policy is limited to observed perception and payoffs. At the beginning the algorithm may have no information about the environment's structure and the relation between action and payoff in different states. So every observed perception/action/payoff triple can contribute to improve the policy.

The quality of a policy can be measured in different ways, e.g. as

  • a sum of all payoffs, if the lifetime of an agent is finite,
  • or as the average payoff.

For a specific algorithm it is interesting, how the quality of policies changes over time. Especially one is interested in guarantees, that using a specific algorithm the policies will converge towards an optimum.
Dm Step Reinforcement Learning