Complexity of policy Iteration and Related Issues

*Eyal Even-Dar*

On the complexity of policy iteration

A Sub exponential Randomized Algorithm For the
Simple Stochastic Game Problem

Definitions and
Preliminary results:

I have tried to improve the upper bound in policy iteration. Since I have failed in this task. I will give here a brief view of this subject. I will show two related papers and explain each one of them. At the beginning I will give the definition of the problem and explain the importance of the problem. Both papers deal with similar problem, while one deals exactly in our problem, which will be defined later, and the second one deals with similar problems. Both give non-trivial upper bound for the problem (MDP and Stochastic game). Therefore, I will give the definition related to MDP here, while the definition related to stochastic game will be given in the paper summary.

After having both results I will describe briefly what I have tried to do and compare between the results.

Decision making problems in a stochastic environment is often described
by an MDP (Markov decision Problem). An MDP is a tuple (S,A,P,R). S is a finite
set of states. A is a finite set of actions. P is the table of transformation
probabilities, which depends on S and A( P(s’|s,a))reward function where are R
is the function of S,A (R(s,a)). The agent’s return (policy return) is defined
to be the discounted sum of rewards over infinite horizon, _{}, or as _{}_{}in the undiscounted case.

Where the discount factor should be in the interval [0,1].

Many algorithms, such as policy iteration and value iteration, try to maximize agent’s return.

Policy Iteration is a method of finding a policy that achieves the optimal agent’s return by searching over the policy space. I will give here more definitions in order to define well the policy iteration method.

_{}is defined to be the expected return of policy _{} starting from
state s, and _{} to be the expected
return if the start state is s and the agent perform action a to begin and then
follow policy _{}. The agent’s goal is to find the best policy _{}, which satisfies _{}. The optimal value is given by _{}and associated Q value is _{}.

A policy _{} is said to
better than a policy _{}’ (_{} )if and only if for every state s _{}and for some states s _{}. If for every state s _{}we say that _{}

At each iteration, we consider changing the policy at
each state while keeping the actions for all others states fixed to the current
policy, some such single state will improve the current policy. Different
policy iterations differ in the way which they accept single state changes. For
instance, sequential PI accepts only one change state in each iteration. Using
the function _{}, we can find all the single states that improve the current
policy _{}. A policy general iteration chooses some of this single
states and change the policy in them.

A polynomial upper bound in number of states and in the discount factor is known to value iteration. Therefore, Policy iteration, which performs always better than value iteration, is also polynomial in the number of states and the discount factor.( I will not prove it because Value Iteration is hardly mentioned here).

For a certain policy iteration a sequential policy iteration, which only
improves one state in each iteration, there is a lower bound of O(_{}). For a genral policy iteration there was no upper bound
except the trivial one O(_{}) before the following paper came out.

Description of both pares will be now given

Yishay Mansour and Stainder Singh

(http://www.math.tau.ac.il/~mansour/cv.htm)

Decision making problems in a stochastic environment is often described by an MDP (Markov decision Problem). The paper tries to give an upper bound for Policy Iteration, which is an algorithm for searching the best policy over the policy-space.

The paper makes some restrictions without loss of generality such as |A|
= 2. Thus, the total number of policies is _{} (n = |S|). While
describing the paper, I will use the definitions, given in the introduction.

While dealing with general policy iteration, the following notation will
be very important. Given a policy _{}, let _{} to be the set of
all pairs (s,a) such that changing the action of _{} in s to a
improves the return of the policy, _{} > _{}. We define **states**(_{}) to be the states that appear in _{}.

For a given policy _{}and a set U_{}_{}. Them **modify**(U, _{}) define a policy _{}’ whose actions are the same as those of _{}, except the ones in U.

*Theorem 1:*

For any U_{}_{} , let _{}’ = **modify**(U, _{}). If U is not empty, then _{}

One way of proving it will be evaluating the agent’s return by running the new policy infinitely when the initial state is the reward vector achieved by the former policy. Therefore, the new reward vector will be better than the initial vector (by induction).

*Theorem 2: *

For any sub optimal policy _{}, _{}is not empty.

One can look again on the relation between value iteration and policy iteration to achieve the desired result.

Those two theorems ensure us that each iteration of PI strictly improves
the current policy. Furthermore, it ensures that the algorithm will stop only
upon finding the best policy (Second theorem). It also gives us the trivial
upper bound of _{}, because each step rules out at least one policy, itself. In
the following steps the paper shows how we can rule out more policies at each
step.

The following lemmas will help us proving that we rule out more policies at each iteration:

*Lemma 3:*

Let _{} and _{}’ be two policies, which differ in their actions only in
state. Then either

_{}_{}_{}’, _{}_{}_{}’or _{}_{}_{}’.

Proof

If _{} > _{}, then _{}_{}_{}’. If _{} < _{} then _{}_{}_{}’. Otherwise _{}_{}_{}’.

*Lemma 4:*

During a run of general policy iteration, there are no i < j such
that _{}_{}_{}.

The paper first proves the following property, For any policy _{} and any policy _{}’ that is
identical to _{} in every state
in _{}, either _{}_{}_{}’ or _{}_{}_{}’.For a proof, we will look at an MDP, M’, which has the same
actions as those in M except the ones in _{}that are restricted to the actions of policy _{}. In M’ _{}is an optimal policy since _{} is empty. By
theorem 2 we have either _{}_{}_{}’ or _{}_{}_{}’.

Using the property, the paper proves the lemma by contradiction. Assume
that i < j such that _{}_{}_{}exists. Let T = _{}_{}_{}.

Let U’={(s,a): a = _{} and _{}_{} _{} and _{}}.

It is obvious that U’_{}_{} , since there are only two actions. Define _{}’ to be **modify**(_{}._{}’ is identical to with _{}on the states in T. Therefore, the property that we have
proven before ensures us that _{}_{}_{}’. It contradicts the fact that

_{}’ _{}_{}.

The lemmas above only assure us that all subsets appear only one, but are not enough to give a better upper bound. Therefore, we will show that each modify on large subsets rules out many policies.

The paper define a special case of policy
iteration *greedy policy iteration*, which is **select**(T) = T,
choosing all possible single state action improvements.

The next lemma shows that at each iteration the algorithm rules out at least as |T|.

*Lemma 5:*

Let _{be a policy and }’ = **modify**(_{,}_{})Then there are at least |_{}| policies _{}such that _{}’_{}.

The proof is done by induction on |_{}|. For |_{}| = 1 the proof is given by theorem 1.

If we look at all single states improvement, _{} such that _{} _{}_{} and |_{}| = 1.

There is at least one _{}, such that for every other _{} **modify**(_{},_{}) is not better than **modify**(_{},_{}) (_{} is not
necessarily unique).

Let _{} = _{} = {_{}}be such set and _{}= **modify**(_{},_{}). We will show that any other _{} = {_{}} _{}. Let _{}= **modify**(_{},{_{}}). From Lemma 3, we know that either _{}_{} or _{}_{}. We will prove that _{}_{}. For contradiction we assume that _{}_{}. Let _{}=**modify**(_{},{_{}}). Since _{} and _{} differ in only
one state by lemma 3, we know that either _{}_{}_{} or _{}_{}_{}.

If _{}_{}_{} then _{}_{}_{}_{}_{}. Thus, it contradicts the fact that _{} is inferior to
the others ingle state improvements. We proved that _{}_{}_{} and therefore {_{}}_{} . Since our assumption is _{}_{} so {_{}}_{}. Therefore we have _{}=**modify**(_{} ,{_{}},{_{}})_{} _{} contradicting
the fact that _{}_{}_{}.

It implies that {_{}}_{}, for i< 1. Thus, |_{}| = |_{}| - 1. The lemma follows from the induction hypothesis.

Using both results for large states and small
sets we can derive an upper bound of O(_{}). The proof will divide the sets to two small sets, sets
whose size is less then n/3 and large sets. The small sets will be bound by the
fact that a set could not be repeated. The number of large sets will be bounded
by the fact that each set rules out least policies as its size thus both could
be bounded by O (_{}).

To improve this upper bound the paper uses a
different type of policy iteration, the random policy iteration. The random
policy iteration chooses subset of T while each subset has probability of _{}. It will enable us to rule out _{} instead of |_{}| (in the greedy policy iteration) at each iteration.

The proof of it is based on the fact that
each new improved policy _{} rules out not
only those polices _{}such as _{} _{}_{}, but also polices _{}such as _{} _{}_{}. Therefore, in each step many policies _{} = **modify**(_{},U’) will be disregarded in each step If the step choose _{}= **modify**(_{},U) ( U_{}U’) and either _{}_{}_{} or _{}_{}_{}.

To estimate the nuber of the “disqualified” polices the next lemma is given

*Lemma 7:*

Let _{}be a partial order over _{}. If we chose a random element r _{}, with uniform probability, then the expected number of
element s_{}, such that s_{}r, is at most _{}.

Proof

For any element v_{} we associate two sets. _{}includes all the elements s such that s _{}v and _{}includes all the elements s such that s _{}v. For every pair of elements _{} _{}_{}we have that _{}_{}_{}and _{}_{}_{}. It implies that _{}=_{}_{}. Therefore, the expected value of _{} is at most _{}

From this lemma we can have the following
notation. Let _{}be the policy at iteration I, then the expected number of
polices _{} such that _{}_{}and either _{}_{}_{}’ or _{}_{}_{}’ is at least _{}.

To achieve the final results the group are
again (as in the greedy policy iteration) divide into two groups, the “small
set” and the “large set”. This time it is npt n/3 but a more complicated
number. Thus, the fact that each large set contribute noewm more than in the
greedy policy iteration enable us to achiever a better upper bound O(_{}).

The paper gives us to better upper bounds the former known upper bound for two special policy iteration, one greedy and one randomized. One can see that the power of randomization has lead us into sub exponential algorithm.

Walter Ludwig

The paper presents here an algorithm, which addresses the problem of finding an optimal strategy for a stochastic game in sub – exponential time. First we define the problem. A stochastic game is a directed graph that has three types of vertices max, min and average along with to sink vertices. One vertex is the start vertex.

The game is a contest between two players. One is playing while the token is on the max vertices and one while the token is on the min vertices. On average vertex the move is determined by a toss of a fair coin. A move is taking an outgoing edge from a vertex.

A brief look at the algorithm will be given here and it will be analyzed
precisely later.

The algorithm will work as follows it will choose a random edge in a
strategy fix it and find an optimal policy that contains it. Thus by a
recursion the algorithm will eventually yield the optimal policy. If it were
done deterministically, it would be exhaustive search with back tracking thus
taking O(_{}).

The algorithm presents some preliminary results, which are necessary for this proof. It also presents the definitions of the stochastic game.

*Definition 1:*

A simple stochastic game is a directed graph G = (V,E) with three sets of vertices X,N and A ( max, min and average) with two special vertices 0 –sink and 1 –sink.

*Definition 2:*

A strategy _{} for player 1 is
a bit vector where _{} is the label of
the edge going from vertex i. The strategy for player 0 is similar.

Let _{}be the graph defined by both strategies _{}(i.e. the graph only consists the edges chosen by _{}).

*Definition 3:*

The value of vertex i, _{}is the winning probability of player 1 to win from vertex i.

*Definition 4:*

A max vertex i with outgoing edge (i,j) and (i,k) is stable with respect
to strategies _{} if

_{} = max(_{},_{}) ( similarly for a min vertex).

*Definition 5:*

Let _{} be pair of
strategies for players 1 and 0. The strategy _{} is said to be
optimal with respect to _{}if every min vertex is stable with respect to _{}

*Lemma 1(Derman):*

Let H be a simple stochastic game with no max vertices that halts with probability 1. Then an optimal strategy for player 0 (with respect to trivial player1 strategy) can be found by linear programming, which is polynomial in the number of the vertices.

*Definition 6:*

Let _{}be pair of strategies. These strategies are said to be
optimal if each one of them is optimal in respect to the other one.

*Lemma 3:*

Given a simple stochastic game G, we can construct a new game G’ in time polynomial in the size of G such that G’ has the same number of min and max vertices as G, the value of G’ is greater than ½ if and only if the value of G is greater than ½, and G’ halts with probability 1.

*Definition 8:*

Define the function _{}= _{}.

We define a switch to be changing the strategy of player one in a single vertex.

A switch is profitable for player 1 if _{} > _{}

(Lemmas that concern our definition of optimal polices)

*Lemma 4:*

Let G=(V,E) be a simple stochastic game that halts with probability one,
and let _{}be a strategy for player1 that is not optimal. Let i _{}X be vertex that is unstable with respect to _{}, _{}. Let _{}’ be the strategy that si obtained from _{} by changing the
startrey at vertex I, Then fro all j_{}V, _{}_{}_{}, and for some j_{}V _{}>_{}.

The input of the algorithm is a graph G = (V, E) that halts with
probability 1 and some strategy _{}for player 1.

The output is pair of optimal strategies _{} for both
players.

- Choose uniformly at random a vertex s
_{}X (max vertex). - Construct a new graph
_{}_{}as follows:

Set

_{}

_{}|| j =s or k =s})_{}{(j,k)|(j,s)_{} and l(s,k) = _{}}.

(deleting the vertex that we choose from the graph)

3.Recursievely apply the algorithm to the game _{} and the player1
strategy _{}to find an optimal strategy _{}’ fro player 1 for the game _{}. Extend _{}’ to a strategy _{}’ for G by setting _{} = _{}

4.Find an optimal strategy _{} fro player 0
with respect to _{}’. If the pair _{}’ , _{} is optimal
return _{}’, _{}. Otherwise set _{} = _{} for k different
than s and for s

_{} = 1 - _{}, and go back to step 1.

The proof of the validity of the algorithm is given by
proving that every switch that the algorithm makes is profitable. The proof is
done by induction on d (the number of max vertices). If d =1, then at most one
switch is required moving from strategy that is not optimal to the strategy
that is. If d > 1 and the strategy in the top level of the algorithm is _{}. By the induction hypothesis after choosing s, _{}’ is reached by some sequence of profitable switches, and it
is the best strategy for player 1 that agrees with _{} in vertex s.
Therefore, all the vertices except s is stable, and if _{}’ is not optimal then vertex s is unstable. By lemma 4,
changing the strategy at s is a profitable switch. Thus, proving that every
switch the algorithm makes is profitable.

Therefore we have an upper bound of _{} -1 switches,
since no strategy can be repeated.

Analysis of the complexity of the algorithm

*Theorem 1: _{}*

The expected number of operations required by the algorithm is _{}.

Proof:

At the bottom level of recursion, the strategy _{} for player 0 in
step 4 is found by solving a linear program of size polynomial in n, as given
by lemma 1. At higher levels _{} can easily be
constructed from the player 0 strategy returned by the recursive call in step
3. Therefore, only one such linear program is solved for each switch the
algorithm makes. Thus the total number operations required per switch are
polynomial in n. The following lemmas will give us the part of _{}

*Lemma:*

Let _{}denote the expected number of switches required by the
algorithm to find a pair of optimal strategies then

_{}for d > 1

_{} for d =1

*Proof.*

Let _{} be the initial
strategy for player 1. Let _{}.

Let _{} be a permutation
of 1..d such that `_{}.

Now suppose that the algorithm choose at step 1 a vertex _{}. Then by solving the sub problem it will achieve the best
strategy _{}’ satisfying h(_{}’) = _{} ..

Then since every switch the algorithm makes is profitable, it can no
longer \make w\switch to a strategy that agrees with _{} at vertex _{}for j > r. Therefore, the strategy for each vertex _{} j > r is now
fixed and after the last switch, which fixed the strategy in _{} it is also fixed
in _{}. Therefore, after one top level iteration requiring an
expected number of switches not exceeding f(d-1) + 1. Since every index is
equally likely we will have :

_{}for d > 1

*Lemma 2:*

_{} for all d >
0.

The lemma is obtained by using the formula that we have found in the theorem therefore I will not give the detailed proof.

The second lemma ensures us directly the desired result.

I
have tried to improve the upper bound by few methods. The first try was to
consider a different type of policy iteration, recursive policy iteration. A
recursive policy iteration find the best subset U _{} by recursive
calls to policy iteration on the induced MDP which keeps the states outside of _{}fixed .If a big step in the recursive policy iteration was
cheap enough it would have been better than the randomized policy iteration
because in each step _{} polices would
disregarded. I have arrived to the following formula A(n,_{}) = A(k,_{}) + A(n, _{}-_{})

The formula is satisfied by the trivial result of A(*,k) = k. I have not found a better solution to this formula. The second paper was supposed to help us in finding ways of having a smart recursion unfortunately it was useless in this recursion.

Another try was to exploit better the small sets in the randomized
policy iteration. For instance, each set of one state fix one state (by lemma
4) and therefore the first one will rule out _{}. The problem with this kind of elimination that it begins
with very big elimination but in few steps the amount of new steps that we rule
out are becoming very small. On e should also notice that theoretically all set
could be walked on (at least almost) if we start from the large sets and after
covering all the sets that contain a smaller state this state is chosen (i.e
123456789, 123456780 and only then 12345678)

Since, there is no known upper bound worse than n, it can be expected that one can find an upper bound better than exponential. Unfortunately, I did not manage to do this. The similarity between the problem of MDP and a stochastic game is very big, although they are not the same problem. The main purpose of the second algorithm that it may give us an enlightening view of a recursion and how by a simple randomization we can reduce a factor in finding the best strategy, which might be also used in finding the best strategy. Although the randomization in the other paper has also given us a sub-exponential algorithm but inferior.