Q Learning

Publication Mitchell/97b: Machine Learning
Name Q Learning
Description

Q learning is a method addressing the task of Reinforcement Learning. The main idea is to approximate the function assigning each state-action pair the highest possible payoff.

The setting:

Let

  • S denote a finite set of states, an agent can be in,
  • A denote a finite set of actions, an agent may perform,
  • δ denote a transition function, mapping each state-action pair to a successor state,
  • r denote a reward function, mapping states-action pairs to payoffs,
  • and π denote an agent's policy, mapping each state to an action.

Here we use the discounted cumulative reward as a quality measure for fixed policies. For a constant γ ∈ [0, 1] and starting state s0 this measure is defined as
  
V(π) := γi ri,
 i=0 
with

  • s(t) := δ(s(t-1), π(s(t-1))), and
  • ri := r(si, π(s(t-1))),
∀ t ∈ ℕ > 0.

The method of Q Learning is based on the function Q, defined as followed:

Each state-action pair (s, a) is assgined the highest discounted cumulative reward possible, if state s is the agent's starting state and a is the first action performed. So function Q is based on an optimal policy. If we knew the highest possible discounted cumulative rewards for all possible successor states and actions, then we could calculate the values for the actual state and actions as
Q(s, a) := r(s, a) + γ max Q(δ(s, a), a')
 a' 

Of course, an optimal strategy is not known in advance. But it shows that the Q function can be approximated by continously collecting locally observed payoffs and by updating the approximation of Q, using the equation above, although the values for the successor states may have been initialized to arbitrary values.

The Algorithm of Q Learning:

  • Create a table with an entry for each element in S x A. The entries are referred to by "Q'(s, a)".
  • Initialize all table entries to 0.
  • If for a state s and action a a reward r(s, a) and a successor state s' is observed, then update Q'(s, a) as following:
    Q'(s, a) := r(s, a) + γ max Q'(s', a')
     a' 

It can be shown, that if each state-action transition is passed infinitely often, then the approximation Q' converges to the correct function Q in this finite and deterministic setting. Please refer to the bibliography link above for a proof.

The assumption of passing each state-action transition infinitely often demands, that the agent does not stick to always choosing the action leading to the best payoff, according to the actual approximation of Q, but that it assigns a certain weight to exploring its environment. A possible way of doing so is to define a probability distribution over the set of actions according to the estimation for Q, e.g.

 k Q'(s, ai)
P(ai | s) =
 j kQ'(s, aj)

for all aiA and s ∈ S. This distribution assigns more promising actions a much higher probability of being chosen, but fulfills nonetheless the requirement of passing each possible state-action transition infinitely often.

Dm Step Reinforcement Learning
Method Type Algorithm