mdp_apps.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# APPLICATIONS OF MARKOV DECISION PROCESSES\n",
    "---\n",
    "In this notebook we will take a look at some indicative applications of markov decision processes. \n",
    "We will cover content from [`mdp.py`](https://github.com/aimacode/aima-python/blob/master/mdp.py), for chapter 17 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mdp import *\n",
    "from notebook import psource, pseudocode"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CONTENTS\n",
    "- Simple MDP\n",
    "    - State dependent reward function\n",
    "    - State and action dependent reward function\n",
    "    - State, action and next state dependent reward function\n",
    "\n",
    "\n",
    "## SIMPLE MDP\n",
    "---\n",
    "### State dependent reward function\n",
    "\n",
    "Markov Decision Processes are formally described as processes that follow the Markov property which states that \"The future is independent of the past given the present\". \n",
    "MDPs formally describe environments for reinforcement learning and we assume that the environment is *fully observable*. \n",
    "Let us take a toy example MDP and solve it using the functions in `mdp.py`.\n",
    "This is a simple example adapted from a [similar problem](http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/MDP.pdf) by Dr. David Silver, tweaked to fit the limitations of the current functions.\n",
    "![title](images/mdp-b.png)\n",
    "\n",
    "Let's say you're a student attending lectures in a university.\n",
    "There are three lectures you need to attend on a given day.\n",
    "<br>\n",
    "Attending the first lecture gives you 4 points of reward.\n",
    "After the first lecture, you have a 0.6 probability to continue into the second one, yielding 6 more points of reward.\n",
    "<br>\n",
    "But, with a probability of 0.4, you get distracted and start using Facebook instead and get a reward of -1.\n",
    "From then onwards, you really can't let go of Facebook and there's just a 0.1 probability that you will concentrate back on the lecture.\n",
    "<br>\n",
    "After the second lecture, you have an equal chance of attending the next lecture or just falling asleep.\n",
    "Falling asleep is the terminal state and yields you no reward, but continuing on to the final lecture gives you a big reward of 10 points.\n",
    "<br>\n",
    "From there on, you have a 40% chance of going to study and reach the terminal state, \n",
    "but a 60% chance of going to the pub with your friends instead. \n",
    "You end up drunk and don't know which lecture to attend, so you go to one of the lectures according to the probabilities given above.\n",
    "<br> \n",
    "We now have an outline of our stochastic environment and we need to maximize our reward by solving this MDP.\n",
    "<br>\n",
    "<br>\n",
    "We first have to define our Transition Matrix as a nested dictionary to fit the requirements of the MDP class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "t = {\n",
    "    'leisure': {\n",
    "                    'facebook': {'leisure':0.9, 'class1':0.1},\n",
    "                    'quit': {'leisure':0.1, 'class1':0.9},\n",
    "                    'study': {},\n",
    "                    'sleep': {},\n",
    "                    'pub': {}\n",
    "               },\n",
    "    'class1': {\n",
    "                    'study': {'class2':0.6, 'leisure':0.4},\n",
    "                    'facebook': {'class2':0.4, 'leisure':0.6},\n",
    "                    'quit': {},\n",
    "                    'sleep': {},\n",
    "                    'pub': {}\n",
    "              },\n",
    "    'class2': {\n",
    "                    'study': {'class3':0.5, 'end':0.5},\n",
    "                    'sleep': {'end':0.5, 'class3':0.5},\n",
    "                    'facebook': {},\n",
    "                    'quit': {},\n",
    "                    'pub': {},\n",
    "              },\n",
    "    'class3': {\n",
    "                    'study': {'end':0.6, 'class1':0.08, 'class2':0.16, 'class3':0.16},\n",
    "                    'pub': {'end':0.4, 'class1':0.12, 'class2':0.24, 'class3':0.24},\n",
    "                    'facebook': {},\n",
    "                    'quit': {},\n",
    "                    'sleep': {}\n",
    "              },\n",
    "    'end': {}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to define the reward for each state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "rewards = {\n",
    "    'class1': 4,\n",
    "    'class2': 6,\n",
    "    'class3': 10,\n",
    "    'leisure': -1,\n",
    "    'end': 0\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This MDP has only one terminal state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "terminals = ['end']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now set the initial state to Class 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "init = 'class1'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will write a CustomMDP class to extend the MDP class for the problem at hand. \n",
    "This class will implement the `T` method to implement the transition model. This is the exact same class as given in [`mdp.ipynb`](https://github.com/aimacode/aima-python/blob/master/mdp.ipynb#MDP)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class CustomMDP(MDP):\n",
    "\n",
    "    def __init__(self, transition_matrix, rewards, terminals, init, gamma=.9):\n",
    "        # All possible actions.\n",
    "        actlist = []\n",
    "        for state in transition_matrix.keys():\n",
    "            actlist.extend(transition_matrix[state])\n",
    "        actlist = list(set(actlist))\n",
    "        print(actlist)\n",
    "\n",
    "        MDP.__init__(self, init, actlist, terminals=terminals, gamma=gamma)\n",
    "        self.t = transition_matrix\n",
    "        self.reward = rewards\n",
    "        for state in self.t:\n",
    "            self.states.add(state)\n",
    "\n",
    "    def T(self, state, action):\n",
    "        if action is None:\n",
    "            return [(0.0, state)]\n",
    "        else: \n",
    "            return [(prob, new_state) for new_state, prob in self.t[state][action].items()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need an instance of this class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['study', 'pub', 'sleep', 'facebook', 'quit']\n"
     ]
    }
   ],
   "source": [
    "mdp = CustomMDP(t, rewards, terminals, init, gamma=.9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The utility of each state can be found by `value_iteration`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'class1': 16.90340650279542,\n",
       " 'class2': 14.597383430869879,\n",
       " 'class3': 19.10533144728953,\n",
       " 'end': 0.0,\n",
       " 'leisure': 13.946891353066082}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "value_iteration(mdp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we can compute the utility values, we can find the best policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "pi = best_policy(mdp, value_iteration(mdp, .01))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pi` stores the best action for each state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'class3': 'pub', 'leisure': 'quit', 'class2': 'study', 'class1': 'study', 'end': None}\n"
     ]
    }
   ],
   "source": [
    "print(pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can confirm that this is the best policy by verifying this result against `policy_iteration`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'class1': 'study',\n",
       " 'class2': 'study',\n",
       " 'class3': 'pub',\n",
       " 'end': None,\n",
       " 'leisure': 'quit'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "policy_iteration(mdp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Everything looks perfect, but let us look at another possibility for an MDP.\n",
    "<br>\n",
    "Till now we have only dealt with rewards that the agent gets while it is **on** a particular state.\n",
    "What if we want to have different rewards for a state depending on the action that the agent takes next. \n",
    "The agent gets the reward _during its transition_ to the next state.\n",
    "<br>\n",
    "For the sake of clarity, we will call this the _transition reward_ and we will call this kind of MDP a _dynamic_ MDP. \n",
    "This is not a conventional term, we just use it to minimize confusion between the two.\n",
    "<br>\n",
    "This next section deals with how to create and solve a dynamic MDP."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### State and action dependent reward function\n",
    "Let us consider a very similar problem, but this time, we do not have rewards _on_ states, \n",
    "instead, we have rewards on the transitions between states. \n",
    "This state diagram will make it clearer.\n",
    "![title](images/mdp-c.png)\n",
    "\n",
    "A very similar scenario as the previous problem, but we have different rewards for the same state depending on the action taken.\n",
    "<br>\n",
    "To deal with this, we just need to change the `R` method of the `MDP` class, but to prevent confusion, we will write a new similar class `DMDP`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class DMDP:\n",
    "\n",
    "    \"\"\"A Markov Decision Process, defined by an initial state, transition model,\n",
    "    and reward model. We also keep track of a gamma value, for use by\n",
    "    algorithms. The transition model is represented somewhat differently from\n",
    "    the text. Instead of P(s' | s, a) being a probability number for each\n",
    "    state/state/action triplet, we instead have T(s, a) return a\n",
    "    list of (p, s') pairs. The reward function is very similar.\n",
    "    We also keep track of the possible states,\n",
    "    terminal states, and actions for each state.\"\"\"\n",
    "\n",
    "    def __init__(self, init, actlist, terminals, transitions={}, rewards={}, states=None, gamma=.9):\n",
    "        if not (0 < gamma <= 1):\n",
    "            raise ValueError(\"An MDP must have 0 < gamma <= 1\")\n",
    "\n",
    "        if states:\n",
    "            self.states = states\n",
    "        else:\n",
    "            self.states = set()\n",
    "        self.init = init\n",
    "        self.actlist = actlist\n",
    "        self.terminals = terminals\n",
    "        self.transitions = transitions\n",
    "        self.rewards = rewards\n",
    "        self.gamma = gamma\n",
    "\n",
    "    def R(self, state, action):\n",
    "        \"\"\"Return a numeric reward for this state and this action.\"\"\"\n",
    "        if (self.rewards == {}):\n",
    "            raise ValueError('Reward model is missing')\n",
    "        else:\n",
    "            return self.rewards[state][action]\n",
    "\n",
    "    def T(self, state, action):\n",
    "        \"\"\"Transition model. From a state and an action, return a list\n",
    "        of (probability, result-state) pairs.\"\"\"\n",
    "        if(self.transitions == {}):\n",
    "            raise ValueError(\"Transition model is missing\")\n",
    "        else:\n",
    "            return self.transitions[state][action]\n",
    "\n",
    "    def actions(self, state):\n",
    "        \"\"\"Set of actions that can be performed in this state. By default, a\n",
    "        fixed list of actions, except for terminal states. Override this\n",
    "        method if you need to specialize by state.\"\"\"\n",
    "        if state in self.terminals:\n",
    "            return [None]\n",
    "        else:\n",
    "            return self.actlist"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The transition model will be the same"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "t = {\n",
    "    'leisure': {\n",
    "                    'facebook': {'leisure':0.9, 'class1':0.1},\n",
    "                    'quit': {'leisure':0.1, 'class1':0.9},\n",
    "                    'study': {},\n",
    "                    'sleep': {},\n",
    "                    'pub': {}\n",
    "               },\n",
    "    'class1': {\n",
    "                    'study': {'class2':0.6, 'leisure':0.4},\n",
    "                    'facebook': {'class2':0.4, 'leisure':0.6},\n",
    "                    'quit': {},\n",
    "                    'sleep': {},\n",
    "                    'pub': {}\n",
    "              },\n",
    "    'class2': {\n",
    "                    'study': {'class3':0.5, 'end':0.5},\n",
    "                    'sleep': {'end':0.5, 'class3':0.5},\n",
    "                    'facebook': {},\n",
    "                    'quit': {},\n",
    "                    'pub': {},\n",
    "              },\n",
    "    'class3': {\n",
    "                    'study': {'end':0.6, 'class1':0.08, 'class2':0.16, 'class3':0.16},\n",
    "                    'pub': {'end':0.4, 'class1':0.12, 'class2':0.24, 'class3':0.24},\n",
    "                    'facebook': {},\n",
    "                    'quit': {},\n",
    "                    'sleep': {}\n",
    "              },\n",
    "    'end': {}\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reward model will be a dictionary very similar to the transition dictionary with a reward for every action for every state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "r = {\n",
    "    'leisure': {\n",
    "        'facebook':-1,\n",
    "        'quit':0,\n",
    "        'study':0,\n",
    "        'sleep':0,\n",
    "        'pub':0\n",
    "    },\n",
    "    'class1': {\n",
    "        'study':-2,\n",
    "        'facebook':-1,\n",
    "        'quit':0,\n",
    "        'sleep':0,\n",
    "        'pub':0\n",
    "    },\n",
    "    'class2': {\n",
    "        'study':-2,\n",
    "        'sleep':0,\n",
    "        'facebook':0,\n",
    "        'quit':0,\n",
    "        'pub':0\n",
    "    },\n",
    "    'class3': {\n",
    "        'study':10,\n",
    "        'pub':1,\n",
    "        'facebook':0,\n",
    "        'quit':0,\n",
    "        'sleep':0\n",
    "    },\n",
    "    'end': {\n",
    "        'study':0,\n",
    "        'pub':0,\n",
    "        'facebook':0,\n",
    "        'quit':0,\n",
    "        'sleep':0\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The MDP has only one terminal state"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "terminals = ['end']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now set the initial state to Class 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "init = 'class1'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will write a CustomDMDP class to extend the DMDP class for the problem at hand.\n",
    "This class will implement everything that the previous CustomMDP class implements along with a new reward model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class CustomDMDP(DMDP):\n",
    "    \n",
    "    def __init__(self, transition_matrix, rewards, terminals, init, gamma=.9):\n",
    "        actlist = []\n",
    "        for state in transition_matrix.keys():\n",
    "            actlist.extend(transition_matrix[state])\n",
    "        actlist = list(set(actlist))\n",
    "        print(actlist)\n",
    "        \n",
    "        DMDP.__init__(self, init, actlist, terminals=terminals, gamma=gamma)\n",
    "        self.t = transition_matrix\n",
    "        self.rewards = rewards\n",
    "        for state in self.t:\n",
    "            self.states.add(state)\n",
    "            \n",
    "            \n",
    "    def T(self, state, action):\n",
    "        if action is None:\n",
    "            return [(0.0, state)]\n",
    "        else:\n",
    "            return [(prob, new_state) for new_state, prob in self.t[state][action].items()]\n",
    "        \n",
    "    def R(self, state, action):\n",
    "        if action is None:\n",
    "            return 0\n",
    "        else:\n",
    "            return self.rewards[state][action]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing we haven't thought about yet is that the `value_iteration` algorithm won't work now that the reward model is changed.\n",
    "It will be quite similar to the one we currently have nonetheless."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Bellman update equation now is defined as follows\n",
    "\n",
    "$$U(s)=\\max_{a\\epsilon A(s)}\\bigg[R(s, a) + \\gamma\\sum_{s'}P(s'\\ |\\ s,a)U(s')\\bigg]$$\n",
    "\n",
    "It is not difficult to see that the update equation we have been using till now is just a special case of this more generalized equation. \n",
    "We also need to max over the reward function now as the reward function is action dependent as well.\n",
    "<br>\n",
    "We will use this to write a function to carry out value iteration, very similar to the one we are familiar with."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def value_iteration_dmdp(dmdp, epsilon=0.001):\n",
    "    U1 = {s: 0 for s in dmdp.states}\n",
    "    R, T, gamma = dmdp.R, dmdp.T, dmdp.gamma\n",
    "    while True:\n",
    "        U = U1.copy()\n",
    "        delta = 0\n",
    "        for s in dmdp.states:\n",
    "            U1[s] = max([(R(s, a) + gamma*sum([(p*U[s1]) for (p, s1) in T(s, a)])) for a in dmdp.actions(s)])\n",
    "            delta = max(delta, abs(U1[s] - U[s]))\n",
    "        if delta < epsilon * (1 - gamma) / gamma:\n",
    "            return U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're all set.\n",
    "Let's instantiate our class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['study', 'pub', 'sleep', 'facebook', 'quit']\n"
     ]
    }
   ],
   "source": [
    "dmdp = CustomDMDP(t, r, terminals, init, gamma=.9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Calculate utility values by calling `value_iteration_dmdp`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'class1': 2.0756895004431364,\n",
       " 'class2': 5.772550326127298,\n",
       " 'class3': 12.827904448229472,\n",
       " 'end': 0.0,\n",
       " 'leisure': 1.8474896554396596}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "value_iteration_dmdp(dmdp)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the expected utility values for our new MDP.\n",
    "<br>\n",
    "As you might have guessed, we cannot use the old `best_policy` function to find the best policy.\n",
    "So we will write our own.\n",
    "But, before that we need a helper function to calculate the expected utility value given a state and an action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def expected_utility_dmdp(a, s, U, dmdp):\n",
    "    return dmdp.R(s, a) + dmdp.gamma*sum([(p*U[s1]) for (p, s1) in dmdp.T(s, a)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we write our modified `best_policy` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from utils import argmax\n",
    "def best_policy_dmdp(dmdp, U):\n",
    "    pi = {}\n",
    "    for s in dmdp.states:\n",
    "        pi[s] = argmax(dmdp.actions(s), key=lambda a: expected_utility_dmdp(a, s, U, dmdp))\n",
    "    return pi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the best policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'class3': 'study', 'leisure': 'quit', 'class2': 'sleep', 'class1': 'facebook', 'end': None}\n"
     ]
    }
   ],
   "source": [
    "pi = best_policy_dmdp(dmdp, value_iteration_dmdp(dmdp, .01))\n",
    "print(pi)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From this, we can infer that `value_iteration_dmdp` tries to minimize the negative reward. \n",
    "Since we don't have rewards for states now, the algorithm takes the action that would try to avoid getting negative rewards and take the lesser of two evils if all rewards are negative.\n",
    "You might also want to have state rewards alongside transition rewards. \n",
    "Perhaps you can do that yourself now that the difficult part has been done.\n",
    "<br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### State, action and next-state dependent reward function\n",
    "\n",
    "For truly stochastic environments, \n",
    "we have noticed that taking an action from a particular state doesn't always do what we want it to. \n",
    "Instead, for every action taken from a particular state, \n",
    "it might be possible to reach a different state each time depending on the transition probabilities. \n",
    "What if we want different rewards for each state, action and next-state triplet? \n",
    "Mathematically, we now want a reward function of the form R(s, a, s') for our MDP. \n",
    "This section shows how we can tweak the MDP class to achieve this.\n",
    "<br>\n",
    "\n",
    "Let's now take a different problem statement. \n",
    "The one we are working with is a bit too simple.\n",
    "Consider a taxi that serves three adjacent towns A, B, and C.\n",
    "Each time the taxi discharges a passenger, the driver must choose from three possible actions:\n",
    "1. Cruise the streets looking for a passenger.\n",
    "2. Go to the nearest taxi stand.\n",
    "3. Wait for a radio call from the dispatcher with instructions.\n",
    "<br>\n",
    "Subject to the constraint that the taxi driver cannot do the third action in town B because of distance and poor reception.\n",
    "\n",
    "Let's model our MDP.\n",
    "<br>\n",
    "The MDP has three states, namely A, B and C.\n",
    "<br>\n",
    "It has three actions, namely 1, 2 and 3.\n",
    "<br>\n",
    "Action sets:\n",
    "<br>\n",
    "$K_{a}$ = {1, 2, 3}\n",
    "<br>\n",
    "$K_{b}$ = {1, 2}\n",
    "<br>\n",
    "$K_{c}$ = {1, 2, 3}\n",
    "<br>\n",
    "\n",
    "We have the following transition probability matrices:\n",
    "<br>\n",
    "<br>\n",
    "Action 1: Cruising streets\n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    P^{1} = \n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    \\frac{1}{2} & \\frac{1}{4} & \\frac{1}{4} \\\\\n",
    "    \\frac{1}{2} & 0 & \\frac{1}{2} \\\\\n",
    "    \\frac{1}{4} & \\frac{1}{4} & \\frac{1}{2} \\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "Action 2: Waiting at the taxi stand  \n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    P^{2} = \n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    \\frac{1}{16} & \\frac{3}{4} & \\frac{3}{16} \\\\\n",
    "    \\frac{1}{16} & \\frac{7}{8} & \\frac{1}{16} \\\\\n",
    "    \\frac{1}{8} & \\frac{3}{4} & \\frac{1}{8} \\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "Action 3: Waiting for dispatch     \n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    P^{3} =\n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    \\frac{1}{4} & \\frac{1}{8} & \\frac{5}{8} \\\\\n",
    "    0 & 1 & 0 \\\\\n",
    "    \\frac{3}{4} & \\frac{1}{16} & \\frac{3}{16} \\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "For the sake of readability, we will call the states A, B and C and the actions 'cruise', 'stand' and 'dispatch'.\n",
    "We will now build the transition model as a dictionary using these matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "t = {\n",
    "    'A': {\n",
    "        'cruise': {'A':0.5, 'B':0.25, 'C':0.25},\n",
    "        'stand': {'A':0.0625, 'B':0.75, 'C':0.1875},\n",
    "        'dispatch': {'A':0.25, 'B':0.125, 'C':0.625}\n",
    "    },\n",
    "    'B': {\n",
    "        'cruise': {'A':0.5, 'B':0, 'C':0.5},\n",
    "        'stand': {'A':0.0625, 'B':0.875, 'C':0.0625},\n",
    "        'dispatch': {'A':0, 'B':1, 'C':0}\n",
    "    },\n",
    "    'C': {\n",
    "        'cruise': {'A':0.25, 'B':0.25, 'C':0.5},\n",
    "        'stand': {'A':0.125, 'B':0.75, 'C':0.125},\n",
    "        'dispatch': {'A':0.75, 'B':0.0625, 'C':0.1875}\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reward matrices for the problem are as follows:\n",
    "<br>\n",
    "<br>\n",
    "Action 1: Cruising streets  \n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    R^{1} = \n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    10 & 4 & 8 \\\\\n",
    "    14 & 0 & 18 \\\\\n",
    "    10 & 2 & 8 \\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "Action 2: Waiting at the taxi stand  \n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    R^{2} = \n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    8 & 2 & 4 \\\\\n",
    "    8 & 16 & 8 \\\\\n",
    "    6 & 4 & 2\\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "Action 3: Waiting for dispatch  \n",
    "<br>\n",
    "<br>\n",
    "$$\\\\\n",
    "    R^{3} = \n",
    "    \\left[ {\\begin{array}{ccc}\n",
    "    4 & 6 & 4 \\\\\n",
    "    0 & 0 & 0 \\\\\n",
    "    4 & 0 & 8\\\\\n",
    "    \\end{array}}\\right] \\\\\n",
    "    \\\\\n",
    "$$\n",
    "<br>\n",
    "<br>\n",
    "We now build the reward model as a dictionary using these matrices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "r = {\n",
    "    'A': {\n",
    "        'cruise': {'A':10, 'B':4, 'C':8},\n",
    "        'stand': {'A':8, 'B':2, 'C':4},\n",
    "        'dispatch': {'A':4, 'B':6, 'C':4}\n",
    "    },\n",
    "    'B': {\n",
    "        'cruise': {'A':14, 'B':0, 'C':18},\n",
    "        'stand': {'A':8, 'B':16, 'C':8},\n",
    "        'dispatch': {'A':0, 'B':0, 'C':0}\n",
    "    },\n",
    "    'C': {\n",
    "        'cruise': {'A':10, 'B':2, 'C':18},\n",
    "        'stand': {'A':6, 'B':4, 'C':2},\n",
    "        'dispatch': {'A':4, 'B':0, 'C':8}\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The Bellman update equation now is defined as follows\n",
    "\n",
    "$$U(s)=\\max_{a\\epsilon A(s)}\\sum_{s'}P(s'\\ |\\ s,a)(R(s'\\ |\\ s,a) + \\gamma U(s'))$$\n",
    "\n",
    "It is not difficult to see that all the update equations we have used till now is just a special case of this more generalized equation. \n",
    "If we did not have next-state-dependent rewards, the first term inside the summation exactly sums up to R(s, a) or the state-reward for a particular action and we would get the update equation used in the previous problem.\n",
    "If we did not have action dependent rewards, the first term inside the summation sums up to R(s) or the state-reward and we would get the first update equation used in `mdp.ipynb`.\n",
    "<br>\n",