probability-4e.ipynb

      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# And if you have both headache and fever, and were not vaccinated, \n",
    "# then the flu is very likely, especially if it is a high fever.\n",
    "enumeration_ask(Flu, {Headache: T, Fever: 'mild', Vaccinated: F}, flu_net)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{F: 0.055534567434831886, T: 0.9444654325651682}"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "enumeration_ask(Flu, {Headache: T, Fever: 'high', Vaccinated: F}, flu_net)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Entropy\n",
    "\n",
    "We can compute the entropy of a probability distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def entropy(probdist):\n",
    "    \"The entropy of a probability distribution.\"\n",
    "    return - sum(p * math.log(p, 2)\n",
    "                 for p in probdist.values())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entropy(ProbDist(heads=0.5, tails=0.5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.011397802630112312"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entropy(ProbDist(yes=1000, no=1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8687212463394045"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entropy(P(Alarm, {Earthquake: T, Burglary: F}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.011407757737461138"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entropy(P(Alarm, {Earthquake: F, Burglary: F}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For non-Boolean variables, the entropy can be greater than 1 bit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.5"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "entropy(P(Fever, {Flu: T}))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "# Unknown Outcomes: Smoothing\n",
    "\n",
    "So far we have dealt with discrete distributions where we know all the possible outcomes in advance. For Boolean variables, the only outcomes are `T` and `F`. For `Fever`, we modeled exactly three outcomes.  However, in some applications we will encounter new, previously unknown outcomes over time. For example, we could train a model on the distribution of words in English, and then somebody could coin a brand new word. To deal with this, we introduce\n",
    "the `DefaultProbDist` distribution, which uses the key `None` to stand as a placeholder for any unknown outcome(s)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "class DefaultProbDist(ProbDist):\n",
    "    \"\"\"A Probability Distribution that supports smoothing for unknown outcomes (keys).\n",
    "    The default_value represents the probability of an unknown (previously unseen) key. \n",
    "    The key `None` stands for unknown outcomes.\"\"\"\n",
    "    def __init__(self, default_value, mapping=(), **kwargs):\n",
    "        self[None] = default_value\n",
    "        self.update(mapping, **kwargs)\n",
    "        normalize(self)\n",
    "        \n",
    "    def __missing__(self, key): return self[None]        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def words(text): return re.findall(r'\\w+', text.lower())\n",
    "\n",
    "english = words('''This is a sample corpus of English prose. To get a better model, we would train on much\n",
    "more text. But this should give you an idea of the process. So far we have dealt with discrete \n",
    "distributions where we  know all the possible outcomes in advance. For Boolean variables, the only \n",
    "outcomes are T and F. For Fever, we modeled exactly three outcomes. However, in some applications we \n",
    "will encounter new, previously unknown outcomes over time. For example, when we could train a model on the \n",
    "words in this text, we get a distribution, but somebody could coin a brand new word. To deal with this, \n",
    "we introduce the DefaultProbDist distribution, which uses the key `None` to stand as a placeholder for any \n",
    "unknown outcomes. Probability theory allows us to compute the likelihood of certain events, given \n",
    "assumptions about the components of the event. A Bayesian network, or Bayes net for short, is a data \n",
    "structure to represent a joint probability distribution over several random variables, and do inference on it.''')\n",
    "\n",
    "E = DefaultProbDist(0.1, Counter(english))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.052295177222545036"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 'the' is a common word:\n",
    "E['the']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.005810575246949448"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 'possible' is a less-common word:\n",
    "E['possible']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0005810575246949449"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 'impossible' was not seen in the training data, but still gets a non-zero probability ...\n",
    "E['impossible']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0005810575246949449"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ... as do other rare, previously unseen words:\n",
    "E['llanfairpwllgwyngyll']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this does not mean that 'impossible' and 'llanfairpwllgwyngyll' and all the other unknown words\n",
    "*each* have probability 0.004.\n",
    "Rather, it means that together, all the unknown words total probability 0.004. With that\n",
    "interpretation, the sum of all the probabilities is still 1, as it should be. In the `DefaultProbDist`, the\n",
    "unknown words are all represented by the key `None`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0005810575246949449"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "E[None]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}