Description |
Q learning is a method addressing the task of Reinforcement Learning.
The main idea is to approximate the function assigning each
state-action pair the highest possible payoff.
The setting:
Let
- S denote a finite set of states, an agent can be in,
- A denote a finite set of actions, an agent may perform,
- δ denote a transition function, mapping each state-action pair
to a successor state,
- r denote a reward function, mapping states-action pairs to payoffs,
- and π denote an agent's policy, mapping each state to an action.
Here we use the discounted cumulative reward as a quality measure for
fixed policies.
For a constant γ ∈ [0, 1] and starting state s0
this measure is defined as
with
- s(t) := δ(s(t-1), π(s(t-1))), and
- ri := r(si, π(s(t-1))),
∀ t ∈ ℕ > 0.
The method of Q Learning is based on the function Q, defined as
followed:
Each state-action pair (s, a) is assgined the highest
discounted cumulative reward possible, if state s is the agent's
starting state and a is the first action performed. So function Q
is based on an optimal policy. If we knew the highest possible discounted
cumulative rewards for all possible successor states and actions, then we could
calculate the values for the actual state and actions as
Q(s, a) := r(s, a) + γ | max | Q(δ(s, a), a') |
| a' | |
Of course, an optimal strategy is not known in advance. But it shows that the
Q function can be approximated by continously collecting locally observed
payoffs and by updating the approximation of Q,
using the equation above, although the values for the successor states may have been
initialized to arbitrary values.
The Algorithm of Q Learning:
- Create a table with an entry for each element in S x A.
The entries are referred to by "Q'(s, a)".
- Initialize all table entries to 0.
- If for a state s and action a a reward r(s, a)
and a successor state s' is observed, then update Q'(s, a) as following:
Q'(s, a) := r(s, a) + γ | max | Q'(s', a') |
| a' | |
It can be shown, that if each state-action transition is passed
infinitely often, then the approximation Q' converges to the
correct function Q in this finite and deterministic setting.
Please refer to the bibliography link above for a proof.
The assumption of passing each state-action transition infinitely
often demands, that the agent does not stick to always choosing the
action leading to the best payoff, according to the actual
approximation of Q, but that it assigns a certain weight to
exploring its environment. A possible way of doing so is to
define a probability distribution over the set of actions according
to the estimation for Q, e.g.
| k Q'(s, ai) |
P(ai | s) = |
|
| ∑j kQ'(s, aj) |
for all ai ∈ A and s ∈ S.
This distribution assigns more promising actions a much higher
probability of being chosen, but fulfills nonetheless the requirement
of passing each possible state-action transition infinitely often.
|