In AI, sometimes, you need to plan a **sequence of action** that lead you to your goal. In stochastic environment, in those situation where you can’t know the outcomes of your actions, a sequence of actions is not sufficient: you need a **policy**.

Markov Decision Process is a mathematical framework that helps to build a policy in a stochastic environment where you know the probabilities of certain outcomes.

In this post, I give you a breif introduction of Markov Decision Process. Then, I’ll show you my implementation, in python, of the most important algorithms that can help you to find policies in stocastic enviroments. You can fine a more detailed description of the Markov Decision Process in my my slides that I’ve used for a seminar at University.

To explain the Markov Decision Process, we use the same environment example of the book “Artificial Intelligence: A Modern Approach (3rd ed.)“. This environment is called Grid World, it is a simple grid environment where the possible actions are NORTH, SOUTH, EAST, WEST. Assume that the probability to go forward is 0.8 and the probability to go left or right is 0.1.

For example, if you are in the cell (2,1) and you want to do the action “NORTH” the possible outcomes can be represented as the following:

We need a goal: the grid world have an exit with an utility of +1. But, there are some pits that have an utility of -1. Each the exit and the pit are terminal cells (i.e. we cannot do any action from them). We can’t go out of the grid and we can use walls to have more spatial constraints.

To build the policy, we need an estimation of the utilities of each cell. We can calculate these utilities using a method like minimax: this method is called expectiminimax. But some applications (i.e. if you don’t have terminal states), sometimes this method is infeasible.

Markov decision process helps us to calculate these utilities, with some powerful methods.

To understand the concepts on the books, I’ve written a simple script in python to “touch” the theory. I’ll show you the basic concepts to understand the code.

### Policy Iteration and Value Iteration

First of all, we have a class called `GridWorld`

, with this class we model our grid world.

With these methods we model the deterministic actions (`transitionFunction`

) and the stochastic actions (`possiblePositionsFromAction`

).
The other methods of the class are simple to understand.

We must build a policy, to do this we have another class called `Policy`

that implements the two ways to build the policy: **Value Iteration** and **Policy Iteration**.

### Value Iteration

The following method implements the Value Iteration algorithm. I show you a simplified version of the algorithm, that hides some difficult peculiarities.

What we do is to iterate the Bellman Update (i.e. `self.__cellUtility(x, y)`

) until we converge. We can see that the code is very simple to understand, the concept of iterate a function until we converge is itself a simple thing. The Bellman Update can be implemented as the following:

Note that we use the methods of the grid world to know the possible outcomes of an action. The Bellman update can be seen in “AI: A Modern Approach (3rd ed.)” pag. 652.

### Policy Iteration

The policy iteration is harder to understand. First of all, we need a random policy for the first step, using the following method:

Then we can write the policy iteration. It iterate the **Policy Evaluation** that update the utility vector given fixed policy. We iterate it until we reach a fixed point of the policy (i.e. the policy at step *i* is the same of the policy at step *i+1*).

The pseudo-code of this implementation can be found at “AI: A Modern Approach (3rd ed.)” at pag. 656. The algorithm use the Policy Evaluation: it is similar to the Value Iteration, but simpler because it use a fixed policy.

Note that we don’t analyze all the actions, but we look only at the action of the policy.

The policy iteration could return a policy (because the algorithm calculate its fixed point). I decided that the policy iteration and the Value Iteration calculate only the utility vector and we need to extract the policy at a later time. This uniformity allows us to understand things better.

### Policy Extraction

We can calculate the policy using the **q-values** or directly from the utility vector.

#### From q-values

From the q-values we have two function: the first is to calculate the q-vales and the second is to calcolate the policy from these q-values.

#### From the utility vector

From the utility vector we have a more compact method. But, it is harder to understand.

I have given you a short description of the code. But it is only an introduction. There are a lot of things to see.

### Try it!

You can execute the script it in this way:

With this command, you can analyze the grid world with the parameters of the book.

You can try other parameters, to see the possible commands execute the following:

Also, there is a window that allow you to configure the parameters as you want, you can launch it with the following:

### Download

Try to run the examples and see the code to better understand. In the code you can see other concepts and a lot of code to draw the data, to make the GUI and to debug the Policy Iteration and the Value Iteration.