The clever algorithms developed to solve MDPs cannot be directly
applied to NMRDPs. One way of dealing with this problem is to
translate the NMRDP into an equivalent MDP with an expanded state
space [2]. The expanded states in this MDP
(e-states, for short) augment the states of the NMRDP by
encoding additional information sufficient to make the reward
history-independent. For instance, if we only want to reward the very
first achievement of goal
in an NMRDP, the states of an equivalent
MDP would carry one extra bit of information recording whether
has
already been true. An e-state can be seen as labelled by a state of
the NMRDP (via the function
in Definition 1
below) and by history information. The dynamics of NMRDPs being
Markovian, the actions and their probabilistic effects in the MDP are
exactly those of the NMRDP. The following definition, adapted from
that given by Bacchus et al. [2],
makes this concept of equivalent MDP precise. Figure 2 gives an example.
Definition 1
MDP
is equivalent
to NMRDP
if there
exists a mapping
such that:1
-
- For all
,
.
- For all
, if there is
such that
, then for all
such that
, there exists a
unique
,
, such that for
all
,
.
- For any feasible state sequence
and
such that
for all
, we have:
for
all
.
Items 1-3 ensure that there is a bijection between feasible state
sequences in the NMRDP and feasible e-state sequences in the MDP.
Therefore, a stationary policy for the MDP can be reinterpreted as a
non-stationary policy for the NMRDP. Furthermore, item 4 ensures that
the two policies have identical values, and that consequently, solving
an NMRDP optimally reduces to producing an equivalent MDP and solving
it optimally [2]:
Proposition 1
Let
be an NMRDP,
an equivalent MDP for it, and
a policy for
. Let
be the function defined on the sequence prefixes
by
, where
for all
. Then
is a policy for
such that
.
Figure 2:
An MDP Equivalent to the NMRDP in Figure
1.
.
. The initial state is
. State
is rewarded; the other
states are not.
 |
Sylvie Thiebaux
2006-01-20