The clever algorithms developed to solve MDPs cannot be directly
applied to NMRDPs. One way of dealing with this problem is to
translate the NMRDP into an equivalent MDP with an expanded state
space [2]. The expanded states in this MDP
(e-states, for short) augment the states of the NMRDP by
encoding additional information sufficient to make the reward
history-independent. For instance, if we only want to reward the very
first achievement of goal in an NMRDP, the states of an equivalent
MDP would carry one extra bit of information recording whether has
already been true. An e-state can be seen as labelled by a state of
the NMRDP (via the function in Definition 1
below) and by history information. The dynamics of NMRDPs being
Markovian, the actions and their probabilistic effects in the MDP are
exactly those of the NMRDP. The following definition, adapted from
that given by Bacchus et al. [2],
makes this concept of equivalent MDP precise. Figure 2 gives an example.
Definition 1
MDP
is equivalent
to NMRDP
if there
exists a mapping
such that:1
-
- For all ,
.
- For all , if there is
such that
, then for all
such that
, there exists a
unique ,
, such that for
all
,
.
- For any feasible state sequence
and
such that
for all , we have:
for
all .
Items 1-3 ensure that there is a bijection between feasible state
sequences in the NMRDP and feasible e-state sequences in the MDP.
Therefore, a stationary policy for the MDP can be reinterpreted as a
non-stationary policy for the NMRDP. Furthermore, item 4 ensures that
the two policies have identical values, and that consequently, solving
an NMRDP optimally reduces to producing an equivalent MDP and solving
it optimally [2]:
Proposition 1
Let be an NMRDP, an equivalent MDP for it, and a policy for
. Let be the function defined on the sequence prefixes
by
, where
for all
. Then is a policy for
such that
.
Figure 2:
An MDP Equivalent to the NMRDP in Figure
1.
.
. The initial state is . State is rewarded; the other
states are not.
|
Sylvie Thiebaux
2006-01-20