Equivalence

The clever algorithms developed to solve MDPs cannot be directly applied to NMRDPs. One way of dealing with this problem is to translate the NMRDP into an equivalent MDP with an expanded state space [2]. The expanded states in this MDP (e-states, for short) augment the states of the NMRDP by encoding additional information sufficient to make the reward history-independent. For instance, if we only want to reward the very first achievement of goal

in an NMRDP, the states of an equivalent MDP would carry one extra bit of information recording whether

has already been true. An e-state can be seen as labelled by a state of the NMRDP (via the function $\tau$ in Definition 1 below) and by history information. The dynamics of NMRDPs being Markovian, the actions and their probabilistic effects in the MDP are exactly those of the NMRDP. The following definition, adapted from that given by Bacchus et al. [2], makes this concept of equivalent MDP precise. Figure 2 gives an example.

Definition 1 MDP $D'\!\!=\!\!\langle S',s'_{0},A',\Pr'\!,R'\rangle$ is equivalent to NMRDP $D=\langle S,s_{0},A,\Pr ,R\rangle$ if there exists a mapping $\tau : S' \mapsto S$ such that:¹

$\tau(s'_{0}) = s_{0}.$
For all $s' \in S'$ , $A'(s')= A(\tau(s'))$ .
For all $s_1,s_2\in S$ , if there is $a\in A(s_1)$ such that $\Pr(s_1,a,s_2) > 0$ , then for all $s_1'\in S'$ such that $\tau(s_1')= s_1$ , there exists a unique $s_2' \in S'$ , $\tau(s'_2) = s_2$ , such that for all $a'\in A'(s'_1)$ , $\Pr'(s'_1,a',s'_2)\!=\!\Pr(s_1,a',s_2)$ .
For any feasible state sequence $\Gamma \in \widetilde{D}(s_0)$ and $\Gamma' \in \widetilde{D}'(s'_0)$ such that $\tau(\Gamma'_{i}) = \Gamma_{i}$ for all , we have: $R'(\Gamma'_{i}) = R(\Gamma(i))$ for all .

Items 1-3 ensure that there is a bijection between feasible state sequences in the NMRDP and feasible e-state sequences in the MDP. Therefore, a stationary policy for the MDP can be reinterpreted as a non-stationary policy for the NMRDP. Furthermore, item 4 ensures that the two policies have identical values, and that consequently, solving an NMRDP optimally reduces to producing an equivalent MDP and solving it optimally [2]:

Proposition 1 Let be an NMRDP, an equivalent MDP for it, and $\pi'$ a policy for . Let $\pi$ be the function defined on the sequence prefixes $\Gamma(i)\in \widetilde{D}(s_{0})$ by $\pi(\Gamma(i)) = \pi'(\Gamma'_i)$ , where for all $j\leq i~\tau(\Gamma'_{j})=\Gamma_{j}$ . Then $\pi$ is a policy for such that $V_{\pi}(s_0) = V_{\pi'}(s'_0)$ .

**Figure 2:** An MDP Equivalent to the NMRDP in Figure 1. $\tau (s'_0) = \tau (s'_2) = s_0$ . $\tau (s'_1) = \tau (s'_3) = s_1$ . The initial state is . State is rewarded; the other states are not.
$\begin{figure}\centering \begin{picture}(0,9.5)(0,-4.5) % \put(2,-1){\begin{pict... ...ebox(0,0){{\footnotesize 1.0}}} % \end{picture}} \end{picture}\par\end{figure}$

Sylvie Thiebaux 2006-01-20