nlp_apps.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NATURAL LANGUAGE PROCESSING APPLICATIONS\n",
    "\n",
    "In this notebook we will take a look at some indicative applications of natural language processing. We will cover content from [`nlp.py`](https://github.com/aimacode/aima-python/blob/master/nlp.py) and [`text.py`](https://github.com/aimacode/aima-python/blob/master/text.py), for chapters 22 and 23 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CONTENTS\n",
    "\n",
    "* Language Recognition\n",
    "* Author Recognition\n",
    "* The Federalist Papers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LANGUAGE RECOGNITION\n",
    "\n",
    "A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.\n",
    "\n",
    "Here we will build an application to categorize sentences in either English or German.\n",
    "\n",
    "First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.\n",
    "\n",
    "Let's build our text models for each language, which will hold the probability of each bigram occuring in the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P_flatland = NgramCharModel(2, wordseq)\n",
    "\n",
    "faust = open_data(\"GE-text/faust.txt\").read()\n",
    "wordseq = words(faust)\n",
    "\n",
    "P_faust = NgramCharModel(2, wordseq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this information to build a *Naive Bayes Classifier* that will be used to categorize sentences (you can read more on Naive Bayes on the [`learning notebook`](https://github.com/aimacode/aima-python/blob/master/learning.ipynb)). The classifier will take as input the probability distribution of bigrams and given a list of bigrams (extracted from the sentence to be classified), it will calculate the probability of the example/sentence coming from each language and pick the maximum.\n",
    "\n",
    "Let's build our classifier, with the assumption that English is as probable as German (the input is a dictionary with values the text models and keys the tuple `language, probability`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from learning import NaiveBayesLearner\n",
    "\n",
    "dist = {('English', 1): P_flatland, ('German', 1): P_faust}\n",
    "\n",
    "nBS = NaiveBayesLearner(dist, simple=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to write a function that takes as input a sentence, breaks it into a list of bigrams and classifies it with the naive bayes classifier from above.\n",
    "\n",
    "Once we get the text model for the sentence, we need to unravel it. The text models show the probability of each bigram, but the classifier can't handle that extra data. It requires a simple *list* of bigrams. So, if the text model shows that a bigram appears three times, we need to add it three times in the list. Since the text model stores the n-gram information in a dictionary (with the key being the n-gram and the value the number of times the n-gram appears) we need to iterate through the items of the dictionary and manually add them to the list of n-grams."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(sentence, nBS, n):\n",
    "    sentence = sentence.lower()\n",
    "    wordseq = words(sentence)\n",
    "    \n",
    "    P_sentence = NgramCharModel(n, wordseq)\n",
    "    \n",
    "    ngrams = []\n",
    "    for b, p in P_sentence.dictionary.items():\n",
    "        ngrams += [b]*p\n",
    "    \n",
    "    print(ngrams)\n",
    "    \n",
    "    return nBS(ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can start categorizing sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'i'), ('i', 'c'), ('c', 'h'), (' ', 'b'), ('b', 'i'), ('i', 'n'), ('i', 'n'), (' ', 'e'), ('e', 'i'), (' ', 'p'), ('p', 'l'), ('l', 'a'), ('a', 't'), ('t', 'z')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'German'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Ich bin ein platz\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 't'), ('t', 'u'), ('u', 'r'), ('r', 't'), ('t', 'l'), ('l', 'e'), ('e', 's'), (' ', 'f'), ('f', 'l'), ('l', 'y'), (' ', 'h'), ('h', 'i'), ('i', 'g'), ('g', 'h')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'English'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Turtles fly high\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'd'), ('d', 'e'), ('e', 'r'), ('e', 'r'), (' ', 'p'), ('p', 'e'), ('e', 'l'), ('l', 'i'), ('i', 'k'), ('k', 'a'), ('a', 'n'), (' ', 'i'), ('i', 's'), ('s', 't'), (' ', 'h'), ('h', 'i'), ('i', 'e')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'German'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Der pelikan ist hier\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'a'), ('a', 'n'), ('n', 'd'), (' ', 't'), (' ', 't'), ('t', 'h'), ('t', 'h'), ('h', 'u'), ('u', 's'), ('h', 'e'), (' ', 'w'), ('w', 'i'), ('i', 'z'), ('z', 'a'), ('a', 'r'), ('r', 'd'), (' ', 's'), ('s', 'p'), ('p', 'o'), ('o', 'k'), ('k', 'e')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'English'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"And thus the wizard spoke\", nBS, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AUTHOR RECOGNITION\n",
    "\n",
    "Another similar application to language recognition is recognizing who is more likely to have written a sentence, given text written by them. Here we will try and predict text from Edwin Abbott and Jane Austen. They wrote *Flatland* and *Pride and Prejudice* respectively.\n",
    "\n",
    "We are optimistic we can determine who wrote what based on the fact that Abbott wrote his novella on much later date than Austen, which means there will be linguistic differences between the two works. Indeed, *Flatland* uses more modern and direct language while *Pride and Prejudice* is written in a more archaic tone containing more sophisticated wording.\n",
    "\n",
    "Similarly with Language Recognition, we will first import the two datasets. This time though we are not looking for connections between characters, since that wouldn't give that great results. Why? Because both authors use English and English follows a set of patterns, as we show earlier. Trying to determine authorship based on this patterns would not be very efficient.\n",
    "\n",
    "Instead, we will abstract our querying to a higher level. We will use words instead of characters. That way we can more accurately pick at the differences between their writing style and thus have a better chance at guessing the correct author.\n",
    "\n",
    "Let's go right ahead and import our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P_Abbott = UnigramWordModel(wordseq, 5)\n",
    "\n",
    "pride = open_data(\"EN-text/pride.txt\").read()\n",
    "wordseq = words(pride)\n",
    "\n",
    "P_Austen = UnigramWordModel(wordseq, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we set the `default` parameter of the model to 5, instead of 0. If we leave it at 0, then when we get a sentence containing a word we have not seen from that particular author, the chance of that sentence coming from that author is exactly 0 (since to get the probability, we multiply all the separate probabilities; if one is 0 then the result is also 0). To avoid that, we tell the model to add 5 to the count of all the words that appear.\n",
    "\n",
    "Next we will build the Naive Bayes Classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from learning import NaiveBayesLearner\n",
    "\n",
    "dist = {('Abbott', 1): P_Abbott, ('Austen', 1): P_Austen}\n",
    "\n",
    "nBS = NaiveBayesLearner(dist, simple=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have build our classifier, we will start classifying. First, we need to convert the given sentence to the format the classifier needs. That is, a list of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(sentence, nBS):\n",
    "    sentence = sentence.lower()\n",
    "    sentence_words = words(sentence)\n",
    "    \n",
    "    return nBS(sentence_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we will input a sentence that is something Abbott would write. Note the use of square and the simpler language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Abbott'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"the square is mad\", nBS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier correctly guessed Abbott.\n",
    "\n",
    "Next we will input a more sophisticated sentence, similar to the style of Austen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Austen'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"a most peculiar acquaintance\", nBS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier guessed correctly again.\n",
    "\n",
    "You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## THE FEDERALIST PAPERS\n",
    "\n",
    "Let's now take a look at a harder problem, classifying the authors of the [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers). The *Federalist Papers* are a series of papers written by Alexander Hamilton, James Madison and John Jay towards establishing the United States Constitution.\n",
    "\n",
    "What is interesting about these papers is that they were all written under a pseudonym, \"Publius\", to keep the identity of the authors a secret. Only after Hamilton's death, when a list was found written by him detailing the authorship of the papers, did the rest of the world learn what papers each of the authors wrote. After the list was published, Madison chimed in to make a couple of corrections: Hamilton, Madison said, hastily wrote down the list and assigned some papers to the wrong author!\n",
    "\n",
    "Here we will try and find out who really wrote these mysterious papers.\n",
    "\n",
    "To solve this we will learn from the undisputed papers to predict the disputed ones. First, let's read the texts from the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "federalist = open_data(\"EN-text/federalist.txt\").read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how the text looks. We will print the first 500 characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The Project Gutenberg EBook of The Federalist Papers, by \\nAlexander Hamilton and John Jay and James Madison\\n\\nThis eBook is for the use of anyone anywhere at no cost and with\\nalmost no restrictions whatsoever.  You may copy it, give it away or\\nre-use it under the terms of the Project Gutenberg License included\\nwith this eBook or online at www.gutenberg.net\\n\\n\\nTitle: The Federalist Papers\\n\\nAuthor: Alexander Hamilton\\n        John Jay\\n        James Madison\\n\\nPosting Date: December 12, 2011 [EBook #18]'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "federalist[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the text file opens with a license agreement, hardly useful in our case. In fact, the license spans 113 words, while there is also a licensing agreement at the end of the file, which spans 3098 words. We need to remove them. To do so, we will first convert the text into words, to make our lives easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordseq = words(federalist)\n",
    "wordseq = wordseq[114:-3098]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now take a look at the first 100 words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'federalist no 1 general introduction for the independent journal hamilton to the people of the state of new york after an unequivocal experience of the inefficacy of the subsisting federal government you are called upon to deliberate on a new constitution for the united states of america the subject speaks its own importance comprehending in its consequences nothing less than the existence of the union the safety and welfare of the parts of which it is composed the fate of an empire in many respects the most interesting in the world it has been frequently remarked that it seems to'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(wordseq[:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better.\n",
    "\n",
    "As with any Natural Language Processing problem, it is prudent to do some text pre-processing and clean our data before we start building our model. Remember that all the papers are signed as 'Publius', so we can safely remove that word, since it doesn't give us any information as to the real author.\n",
    "\n",
    "NOTE: Since we are only removing a single word from each paper, this step can be skipped. We add it here to show that processing the data in our hands is something we should always be considering. Oftentimes pre-processing the data in just the right way is the difference between a robust model and a flimsy one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordseq = [w for w in wordseq if w != 'publius']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to separate the text from a block of words into papers and assign them to their authors. We can see that each paper starts with the word 'federalist', so we will split the text on that word.\n",
    "\n",
    "The disputed papers are the papers from 49 to 58, from 18 to 20 and paper 64. We want to leave these papers unassigned. Also, note that there are two versions of paper 70; both from Hamilton.\n",
    "\n",
    "Finally, to keep the implementation intuitive, we add a `None` object at the start of the `papers` list to make the list index match up with the paper numbering (for example, `papers[5]` now corresponds to paper no. 5 instead of the paper no.6 in the 0-indexed Python)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4, 16, 52)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "papers = re.split(r'federalist\\s', ' '.join(wordseq))\n",
    "papers = [p for p in papers if p not in ['', ' ']]\n",
    "papers = [None] + papers\n",
    "\n",
    "disputed = list(range(49, 58+1)) + [18, 19, 20, 64]\n",
    "jay, madison, hamilton = [], [], []\n",
    "for i, p in enumerate(papers):\n",
    "    if i in disputed or i == 0:\n",
    "        continue\n",
    "    \n",
    "    if 'jay' in p:\n",
    "        jay.append(p)\n",
    "    elif 'madison' in p:\n",
    "        madison.append(p)\n",
    "    else:\n",
    "        hamilton.append(p)\n",
    "\n",
    "len(jay), len(madison), len(hamilton)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, from the undisputed papers Jay wrote 4, Madison 17 and Hamilton 51 (+1 duplicate). Let's now build our word models. The Unigram Word Model again will come in handy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "hamilton = ''.join(hamilton)\n",
    "hamilton_words = words(hamilton)\n",
    "P_hamilton = UnigramWordModel(hamilton_words, default=1)\n",
    "\n",
    "madison = ''.join(madison)\n",
    "madison_words = words(madison)\n",
    "P_madison = UnigramWordModel(madison_words, default=1)\n",
    "\n",
    "jay = ''.join(jay)\n",
    "jay_words = words(jay)\n",
    "P_jay = UnigramWordModel(jay_words, default=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well.\n",
    "\n",
    "Finally, since we are dealing with long text and the string of probability multiplications is long, we will end up with the results being rounded to 0 due to floating point underflow. To work around this problem we will use the built-in Python library `decimal`, which allows as to set decimal precision to much larger than normal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "import decimal\n",
    "from decimal import Decimal\n",
    "\n",
    "decimal.getcontext().prec = 100\n",
    "\n",
    "def precise_product(numbers):\n",
    "    result = 1\n",
    "    for x in numbers:\n",
    "        result *= Decimal(x)\n",
    "    return result\n",
    "\n",
    "\n",
    "def NaiveBayesLearner(dist):\n",
    "    \"\"\"A simple naive bayes classifier that takes as input a dictionary of\n",
    "    Counter distributions and can then be used to find the probability\n",
    "    of a given item belonging to each class.\n",
    "    The input dictionary is in the following form:\n",
    "        ClassName: Counter\"\"\"\n",
    "    attr_dist = {c_name: count_prob for c_name, count_prob in dist.items()}\n",
    "\n",
    "    def predict(example):\n",
    "        \"\"\"Predict the probabilities for each class.\"\"\"\n",
    "        def class_prob(target, e):\n",
    "            attr = attr_dist[target]\n",
    "            return precise_product([attr[a] for a in e])\n",
    "\n",
    "        pred = {t: class_prob(t, example) for t in dist.keys()}\n",
    "\n",
    "        total = sum(pred.values())\n",
    "        for k, v in pred.items():\n",
    "            pred[k] = v / total\n",
    "\n",
    "        return pred\n",
    "\n",
    "    return predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will build our Learner. Note that even though Hamilton wrote the most papers, that doesn't make it more probable that he wrote the rest, so all the class probabilities will be equal. We can change them if we have some external knowledge, which for this tutorial we do not have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "dist = {('Madison', 1): P_madison, ('Hamilton', 1): P_hamilton, ('Jay', 1): P_jay}\n",
    "nBS = NaiveBayesLearner(dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual, the `recognize` function will take as input a string and after removing capitalization and splitting it into words, will feed it into the Naive Bayes Classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(sentence, nBS):\n",
    "    return nBS(words(sentence.lower()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can start predicting the disputed papers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Paper No. 49: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 50: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 51: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 52: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 53: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 54: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 55: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 56: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 57: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 58: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 18: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 19: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 20: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n",
      "Paper No. 64: Hamilton: 1.00 Madison: 0.00 Jay: 0.00\n"
     ]
    }
   ],
   "source": [
    "for d in disputed:\n",
    "    probs = recognize(papers[d], nBS)\n",
    "    results = ['{}: {:.2f}'.format(name, probs[(name, 1)]) for name in 'Hamilton Madison Jay'.split()]\n",
    "    print('Paper No. {}: {}'.format(d, ' '.join(results)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It correctly identifies the papers written by Madison, even the ones in collaboration with Hamilton.\n",
    "\n",
    "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier is wrong there because it does not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}