rl.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Reinforcement Learning\n",
    "\n",
    "This IPy notebook acts as supporting material for **Chapter 21 Reinforcement Learning** of the book* Artificial Intelligence: A Modern Approach*. This notebook makes use of the implementations in rl.py module. We also make use of implementation of MDPs in the mdp.py module to test our agents. It might be helpful if you have already gone through the IPy notebook dealing with Markov decision process. Let us import everything from the rl module. It might be helpful to view the source of some of our implementations. Please refer to the Introductory IPy file for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from rl import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Review\n",
    "Before we start playing with the actual implementations let us review a couple of things about RL.\n",
    "\n",
    "1. Reinforcement Learning is concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. \n",
    "\n",
    "2. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).\n",
    "\n",
    "-- Source: [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning)\n",
    "\n",
    "In summary we have a sequence of state action transitions with rewards associated with some states. Our goal is to find the optimal policy (pi) which tells us what action to take in each state."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Passive Reinforcement Learning\n",
    "\n",
    "In passive Reinforcement Learning the agent follows a fixed policy and tries to learn the Reward function and the Transition model (if it is not aware of that).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Passive Temporal Difference Agent\n",
    "\n",
    "The PassiveTDAgent class in the rl module implements the Agent Program (notice the usage of word Program) described in **Fig 21.4** of the AIMA Book. PassiveTDAgent uses temporal differences to learn utility estimates. In simple terms we learn the difference between the states and backup the values to previous states while following a fixed policy.  Let us look into the source before we see some usage examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource PassiveTDAgent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Agent Program can be obtained by creating the instance of the class by passing the appropriate parameters. Because of the __ call __ method the object that is created behaves like a callable and returns an appropriate action as most Agent Programs do. To instantiate the object we need a policy(pi) and a mdp whose utility of states will be estimated. Let us import a GridMDP object from the mdp module. **Fig[17, 1]** is similar to **Fig[21, 1]** but has some discounting as **gamma = 0.9**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mdp import Fig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<mdp.GridMDP at 0x7f1f0c77ab00>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Fig[17,1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Fig[17,1]** is a GridMDP object and is similar to the grid shown in **Fig 21.1**. The rewards in the terminal states are **+1** and **-1** and **-0.04** in rest of the states. <img src=\"files/images/mdp.png\"> Now we define a policy similar to **Fig 21.1** in the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "policy = {(0, 0): (0, 1),\n",
    "          (0, 1): (0, 1),\n",
    "          (0, 2): (1, 0),\n",
    "          (1, 0): (-1, 0),\n",
    "          (1, 2): (1, 0),\n",
    "          (2, 0): (-1, 0),\n",
    "          (2, 1): (0, 1),\n",
    "          (2, 2): (1, 0),\n",
    "          (3, 0): (-1, 0),\n",
    "          (3, 1): None,\n",
    "          (3, 2): None,\n",
    "         }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us create our object now. We also use the **same alpha** as given in the footnote of the book on **page 837**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "our_agent = PassiveTDAgent(policy, Fig[17,1], alpha=lambda n: 60./(59+n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The rl module also has a simple implementation to simulate iterations. The function is called **run_single_trial**. Now we can try our implementation. We can also compare the utility estimates learned by our agent to those obtained via **value iteration**.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mdp import value_iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values calculated by value iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{(0, 1): 0.3984432178350045, (1, 2): 0.649585681261095, (3, 2): 1.0, (0, 0): 0.2962883154554812, (3, 0): 0.12987274656746342, (3, 1): -1.0, (2, 1): 0.48644001739269643, (2, 0): 0.3447542300124158, (2, 2): 0.7953620878466678, (1, 0): 0.25386699846479516, (0, 2): 0.5093943765842497}\n"
     ]
    }
   ],
   "source": [
    "print(value_iteration(Fig[17,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the values estimated by our agent after 200 trials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}