|
Learning as Optimization
Name |
Learning as Optimization |
Description |
In settings where an agent is trying to learn how to act most
reasonable, the question arises, how to improve its behaviour,
depending on the experience it makes.
At each step the agent's behaviour is determined by a policy, a
function mapping perceptions to actions. These actions will lead the
agent from one state to another. All of the agent's actions will
result in an immediate payoff, depending on its actual state, as well.
The payoffs and the mappings from state/action-pairs to the next state
are not necessarily assigned determinstically.
A learning algorithm will change an agent's policy driven by
experience, in order to maximize its payoffs. Experience that may
be used for improving the actual policy is limited to observed
perception and payoffs. At the beginning the algorithm may have no
information about the environment's structure and the relation
between action and payoff in different states.
So every observed perception/action/payoff triple can contribute to
improve the policy.
The quality of a policy can be measured in different ways, e.g. as
- a sum of all payoffs, if the lifetime of an agent is finite,
- or as the average payoff.
For a specific algorithm it is interesting, how the quality of
policies changes over time. Especially one is interested in
guarantees, that using a specific algorithm the policies will
converge towards an optimum. |
Dm Step |
Reinforcement Learning
|
|
|