A partially observable Markov decision process (POMDP)
is a sequential decision model for an agent who acts
in a stochastic environment with only partial knowledge about the
state of its environment.
The set of possible states
of the environment is referred to as the
state space and is denoted by .
At each point in time, the environment is in one of the
possible states. The agent does not directly observe the
state. Rather, it receives an
observation about it.
We denote the set of all possible observations by
.
After receiving the observation, the agent chooses an action
from a set
of possible actions and executes
that action. Thereafter,
the agent receives
an immediate
reward and the environment evolves stochastically into
a next state.
Mathematically, a POMDP is specified by: the three sets
,
, and
; a reward function r(s,a); a
transition probability function P(s'|s, a); and
an observation probability function P(z|s', a).
The reward function characterizes the dependency of the immediate
reward on the current state s and the current action a.
The transition probability characterizes the dependency
of the next state s' on the current state s and the current
action a. The observation probability characterizes
the dependency of the observation z at the next time point on
the next state s' and the current action a.