Complexity of policy Iteration and Related Issues

            Eyal Even-Dar



Introduction  1

Definitions  1

General policy iteration  1

Known Results  2

On the complexity of policy iteration  2

Lemmas And Theorems  2

Combining it all together 3

A random Policy Iteration  3

A Sub exponential Randomized Algorithm For the Simple Stochastic Game Problem   4

Introduction  4

Definitions and Preliminary results: 4

The Algorithm: 5

Algorithm Analysis  5

My attempts  6

Summary  6


I have tried to improve the upper bound in policy iteration. Since I have failed in this task. I will give here a brief view of this subject. I will show two related papers and explain each one of them. At the beginning I will give the definition of the problem and explain the importance of the problem. Both papers deal with similar problem, while one deals exactly in our problem, which will be defined later, and the second one deals with similar problems. Both give non-trivial upper bound for the problem (MDP and Stochastic game).     Therefore, I will give the definition related to MDP here, while the definition related to stochastic game will be given in the paper summary.

After having both results I will describe briefly what I have tried to do and compare between the results.


Decision making problems in a stochastic environment is often described by an MDP (Markov decision Problem). An MDP is a tuple (S,A,P,R). S is a finite set of states. A is a finite set of actions. P is the table of transformation probabilities, which depends on S and A( P(s’|s,a))reward function where are R is the function of S,A (R(s,a)). The agent’s return (policy return) is defined to be the discounted sum of rewards over infinite horizon, , or as in the undiscounted case.

Where the discount factor should be in the interval [0,1].

Many algorithms, such as policy iteration and value iteration, try to maximize agent’s return.

Policy Iteration is a method of finding a policy that achieves the optimal agent’s return by searching over the policy space. I will give here more definitions in order to define well the policy iteration method.

is defined to be the expected return of policy  starting from state s, and  to be the expected return if the start state is s and the agent perform action a to begin and then follow policy . The agent’s goal is to find the best policy , which satisfies . The optimal value is given by and associated Q value is .

A policy  is said to better than a policy ’ ( )if and only if for every state s  and for some states s . If for every state s we say that

General policy iteration

At each iteration, we consider changing the policy at each state while keeping the actions for all others states fixed to the current policy, some such single state will improve the current policy. Different policy iterations differ in the way which they accept single state changes. For instance, sequential PI accepts only one change state in each iteration. Using the function , we can find all the single states that improve the current policy . A policy general iteration chooses some of this single states and change the policy in them.

Known Results

A polynomial upper bound in number of states and in the discount factor is known to value iteration. Therefore, Policy iteration, which performs always better than value iteration, is also polynomial in the number of states and the discount factor.( I will not prove it because Value Iteration is hardly mentioned here).

For a certain policy iteration a sequential policy iteration, which only improves one state in each iteration, there is a lower bound of O(). For a genral policy iteration there was no upper bound except the trivial one O() before the following paper came out.


Description of both pares will be now given


On the complexity of policy iteration

Yishay Mansour and Stainder Singh


Decision making problems in a stochastic environment is often described by an MDP (Markov decision Problem). The paper tries to give an upper bound for Policy Iteration, which is an algorithm for searching the best policy over the policy-space.

The paper makes some restrictions without loss of generality such as |A| = 2. Thus, the total number of policies is  (n = |S|). While describing the paper, I will use the definitions, given in the introduction.

While dealing with general policy iteration, the following notation will be very important. Given a policy , let  to be the set of all pairs (s,a) such that changing the action of  in s to a improves the return of the policy,  > . We define states() to be the states that appear in .

For a given policy and a set U. Them modify(U, ) define a policy ’ whose actions are the same as those of , except the ones in U.


Lemmas And Theorems

Theorem 1:

For any U , let ’ = modify(U, ). If U is not empty, then

Proof idea

One way of proving it will be evaluating the agent’s return by running the new policy infinitely when the initial state is the reward vector achieved by the former policy. Therefore, the new reward vector will be better than the initial vector (by induction).

Theorem 2:

For any sub optimal policy , is not empty.

Proof idea

One can look again on the relation between value iteration and policy iteration to achieve the desired result.

Those two theorems ensure us that each iteration of PI strictly improves the current policy. Furthermore, it ensures that the algorithm will stop only upon finding the best policy (Second theorem). It also gives us the trivial upper bound of , because each step rules out at least one policy, itself. In the following steps the paper shows how we can rule out more policies at each step.

The following lemmas will help us  proving that we rule out more policies at each iteration:

Lemma 3:

Let  and ’ be two policies, which differ in their actions only in state. Then either

*’, ’or ’.


If   > , then ’. If  <  then *’. Otherwise  ’.

Lemma 4:

During a run of general policy iteration, there are no i < j such that .


The paper first proves the following property, For any policy  and any policy   that is identical to  in every state in , either *’ or ’.For a proof, we will look at an MDP, M’, which has the same actions as those in M except the ones in that are restricted to the actions of policy . In M’ is an optimal policy since  is empty. By theorem 2 we have either *’ or ’.

Using the property, the paper proves the lemma by contradiction. Assume that i < j such that exists. Let T = .

Let U’={(s,a): a =   and   and }.

It is obvious that U’ , since there are only two actions. Define ’ to be modify(.’ is identical to with on the states in T. Therefore, the property that we have proven before ensures us that  ’. It contradicts the fact that


The lemmas above only assure us that all subsets appear only one, but are not enough to give a better upper bound. Therefore, we will show that each modify on large subsets rules out many policies.

The paper define a special case of policy iteration greedy policy iteration, which is select(T) = T, choosing all possible single state action improvements.

The next lemma shows that at each iteration the algorithm rules out at least as |T|.

Lemma 5:

Let be a policy and ’ = modify(,)Then there are at least || policies such that .


The proof is done by induction on ||. For || = 1 the proof is given by theorem 1.

If we look at all single states improvement,  such that   and || = 1.

 There is at least one , such that for every other  modify(,) is not better than modify(,)  (  is not necessarily unique).

Let   =  = {}be such set and = modify(,). We will show that any other  = {} . Let = modify(,{}). From Lemma 3, we know that either  or . We will prove that . For contradiction we assume that . Let =modify(,{}). Since  and  differ in only one state by lemma 3, we know that either  or .

If    then . Thus, it contradicts the fact that  is inferior to the others ingle state improvements. We proved that  and therefore {} . Since our assumption is  so {}. Therefore we have =modify( ,{},{})  contradicting the fact that .

It implies that {}, for i< 1. Thus, || = || - 1. The lemma follows from the induction hypothesis.

Combining it all together

Using both results for large states and small sets we can derive an upper bound of O(). The proof will divide the sets to two small sets, sets whose size is less then n/3 and large sets. The small sets will be bound by the fact that a set could not be repeated. The number of large sets will be bounded by the fact that each set rules out least policies as its size thus both could be bounded by O ().

A random Policy Iteration

To improve this upper bound the paper uses a different type of policy iteration, the random policy iteration. The random policy iteration chooses subset of T while each subset has probability of . It will enable us to rule out  instead of || (in the greedy policy iteration) at each iteration.

The proof of it is based on the fact that each new improved policy  rules out not only those polices such as  , but also polices such as  . Therefore, in each step many policies  = modify(,U’) will be disregarded in each step If the step choose = modify(,U) ( UU’) and either   or .

To estimate the nuber of the “disqualified” polices the next lemma is given

Lemma 7:

Let be a partial order over . If we chose a random element r , with uniform probability, then the expected number of element s, such that sr, is at most .


For any element v we associate two sets. includes all the elements s such that s v and includes all the elements s such that s v. For every pair of elements  we have that and . It implies that =. Therefore, the expected value of   is at most

From this lemma we can have the following notation. Let be the policy at iteration I, then the expected number of polices  such that and either ’ or ’ is at least .

To achieve the final results the group are again (as in the greedy policy iteration) divide into two groups, the “small set” and the “large set”. This time it is npt n/3 but a more complicated number. Thus, the fact that each large set contribute noewm more than in the greedy policy iteration enable us to achiever a better upper bound O().

The paper gives us to better upper bounds the former known upper bound for two special policy iteration, one greedy and one randomized. One can see that the power of randomization has lead us into sub exponential algorithm.









A Sub exponential Randomized Algorithm For the Simple Stochastic Game Problem

Walter Ludwig




The paper presents here an algorithm, which addresses the problem of finding an optimal strategy for a stochastic game in sub – exponential time. First we define the problem. A stochastic game is a directed graph that has three types of vertices max, min and average along with to sink vertices. One vertex is the start vertex. 

The game is a contest between two players. One is playing while the token is on the max vertices and one while the token is on the min vertices. On average vertex the move is determined by a toss of a fair coin. A move is taking an outgoing edge from a vertex.

A brief look at the algorithm will be given here and it will be analyzed precisely later.

The algorithm will work as follows it will choose a random edge in a strategy fix it and find an optimal policy that contains it. Thus by a recursion the algorithm will eventually yield the optimal policy. If it were done deterministically, it would be exhaustive search with back tracking thus taking  O().

Definitions and Preliminary results:

            The algorithm presents some preliminary results, which are necessary for this proof. It also presents the definitions of the stochastic game.

Definition 1:

A simple stochastic game is a directed graph G = (V,E) with three sets of vertices X,N and A ( max, min and average) with two special vertices 0 –sink and 1 –sink.

Definition 2:

A strategy  for player 1 is a bit vector where  is the label of the edge going from vertex i. The strategy for player 0 is similar.

Let  be the graph defined by both strategies (i.e. the graph only consists the edges chosen by ).

Definition 3:

The value of vertex i, is the winning probability of player 1 to win from vertex i.

Definition 4:

A max vertex i with outgoing edge (i,j) and (i,k) is stable with respect to strategies  if

 = max(,) ( similarly for a min vertex).

Definition 5:

Let  be pair of strategies for players 1 and 0. The strategy  is said to be optimal with respect to if every min vertex is stable with respect to

Lemma 1(Derman):

Let H be a simple stochastic game with no max vertices that halts with probability 1. Then an optimal strategy for player 0 (with respect to trivial player1 strategy) can be found by linear programming, which is polynomial in the number of the vertices.

Definition 6:

Let be pair of strategies. These strategies are said to be optimal if each one of them is optimal in respect to the other one.

Lemma 3:

Given a simple stochastic game G, we can construct a new game G’ in time polynomial in the size of G such that G’ has the same number of min and max vertices as G, the value of G’ is greater than ½ if and only if the value of G is greater than ½, and G’ halts with probability 1.

Definition 8:

Define the function = .

We define a switch to be changing the strategy of player one in a single vertex.

A switch is profitable for player 1 if  >


(Lemmas that concern our definition of optimal polices)

Lemma 4:

Let G=(V,E) be a simple stochastic game that halts with probability one, and let be a strategy for player1 that is not optimal. Let i X be vertex that is unstable with respect to , . Let ’ be the strategy that si obtained from  by changing the startrey at vertex I, Then fro all jV, , and for some jV >.


The Algorithm:

The input of the algorithm is a graph G = (V, E) that halts with probability 1 and some strategy for player 1.

The output is pair of optimal strategies  for both players.

  1. Choose uniformly at random a vertex s X (max vertex).
  2. Construct a new graph  as follows:



 || j =s or k =s}){(j,k)|(j,s) and l(s,k) = }.

(deleting the vertex that we choose from the graph)

3.Recursievely apply the algorithm to the game  and the player1 strategy to find an optimal strategy ’ fro player 1 for the game . Extend ’ to a strategy ’ for G by setting  = 

4.Find an optimal strategy  fro player 0 with respect to ’. If the pair ’ ,  is optimal return ’, . Otherwise set  =  for k different than s and for s

 =  1 - , and go back to step 1.

Algorithm Analysis

The proof of the validity of the algorithm is given by proving that every switch that the algorithm makes is profitable. The proof is done by induction on d (the number of max vertices). If d =1, then at most one switch is required moving from strategy that is not optimal to the strategy that is. If d > 1 and the strategy in the top level of the algorithm is . By the induction hypothesis after choosing s, ’ is reached by some sequence of profitable switches, and it is the best strategy for player 1 that agrees with  in vertex s. Therefore, all the vertices except s is stable, and if ’ is not optimal then vertex s is unstable. By lemma 4, changing the strategy at s is a profitable switch. Thus, proving that every switch the algorithm makes is profitable.

Therefore we have an upper bound of  -1 switches, since no strategy can be repeated.

Analysis of the complexity of the algorithm

Theorem 1:

The expected number of operations required by the algorithm is .


At the bottom level of recursion, the strategy  for player 0 in step 4 is found by solving a linear program of size polynomial in n, as given by lemma 1. At higher levels  can easily be constructed from the player 0 strategy returned by the recursive call in step 3. Therefore, only one such linear program is solved for each switch the algorithm makes. Thus the total number operations required per switch are polynomial in n. The following lemmas will give us the part of


Let denote the expected number of switches required by the algorithm to find a pair of optimal strategies then

for d > 1

 for d =1


Let  be the initial strategy for player 1. Let .

Let  be a permutation of 1..d such that  `.

Now suppose that the algorithm choose at step 1 a vertex . Then by solving the sub problem it will achieve the best strategy ’ satisfying h(’) =  ..

Then since every switch the algorithm makes is profitable, it can no longer \make w\switch to a strategy that agrees with  at vertex for j > r. Therefore, the strategy for each vertex  j > r is now fixed and after the last switch, which fixed the strategy in  it is also fixed in . Therefore, after one top level iteration requiring an expected number of switches not exceeding f(d-1) + 1. Since every index is equally likely we will have :

for d > 1

Lemma 2:

 for all d > 0.

The lemma is obtained by using the formula that we have found in the theorem therefore I will not give the detailed proof.

The second lemma ensures us directly the desired result.


My attempts

            I have tried to improve the upper bound by few methods. The first try was to consider a different type of policy iteration, recursive policy iteration. A recursive policy iteration find the best subset U  by recursive calls to policy iteration on the induced MDP which keeps the states outside of fixed .If a big step in the recursive policy iteration was cheap enough it would have been better than the randomized policy iteration because in each step  polices would disregarded. I have arrived to the following formula A(n,) = A(k,) + A(n, -)

The formula is satisfied by the trivial result of A(*,k) = k. I have not found a better solution to this formula. The second paper was supposed to help us in finding ways of having a smart recursion unfortunately it was useless in this recursion.

Another try was to exploit better the small sets in the randomized policy iteration. For instance, each set of one state fix one state (by lemma 4) and therefore the first one will rule out . The problem with this kind of elimination that it begins with very big elimination but in few steps the amount of new steps that we rule out are becoming very small. On e should also notice that theoretically all set could be walked on (at least almost) if we start from the large sets and after covering all the sets that contain a smaller state this state is chosen (i.e 123456789, 123456780 and only then 12345678)


            Since, there is no known upper bound worse than n, it can be expected that one can find an upper bound better than exponential. Unfortunately, I did not manage to do this. The similarity between the problem of MDP and a stochastic game is very big, although they are not the same problem. The main purpose of the second algorithm that it may give us an enlightening view of a recursion and how by a simple randomization we can reduce a factor in finding the best strategy, which might be also used in finding the best strategy. Although the randomization in the other paper has also given us a sub-exponential algorithm but inferior.