nlp_apps.ipynb 28,9 ko
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NATURAL LANGUAGE PROCESSING APPLICATIONS\n",
    "\n",
Anthony Marakis's avatar
Anthony Marakis a validé
    "In this notebook we will take a look at some indicative applications of natural language processing. We will cover content from [`nlp.py`](https://github.com/aimacode/aima-python/blob/master/nlp.py) and [`text.py`](https://github.com/aimacode/aima-python/blob/master/text.py), for chapters 22 and 23 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CONTENTS\n",
    "\n",
    "* Language Recognition\n",
    "* Author Recognition\n",
    "* The Federalist Papers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LANGUAGE RECOGNITION\n",
    "\n",
    "A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.\n",
    "\n",
    "Here we will build an application to categorize sentences in either English or German.\n",
    "\n",
    "First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.\n",
    "\n",
    "Let's build our text models for each language, which will hold the probability of each bigram occuring in the text."
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P_flatland = NgramCharModel(2, wordseq)\n",
    "\n",
    "faust = open_data(\"GE-text/faust.txt\").read()\n",
    "wordseq = words(faust)\n",
    "\n",
    "P_faust = NgramCharModel(2, wordseq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this information to build a *Naive Bayes Classifier* that will be used to categorize sentences (you can read more on Naive Bayes on the [`learning notebook`](https://github.com/aimacode/aima-python/blob/master/learning.ipynb)). The classifier will take as input the probability distribution of bigrams and given a list of bigrams (extracted from the sentence to be classified), it will calculate the probability of the example/sentence coming from each language and pick the maximum.\n",
    "\n",
    "Let's build our classifier, with the assumption that English is as probable as German (the input is a dictionary with values the text models and keys the tuple `language, probability`):"
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from learning import NaiveBayesLearner\n",
    "\n",
    "dist = {('English', 1): P_flatland, ('German', 1): P_faust}\n",
    "\n",
    "nBS = NaiveBayesLearner(dist, simple=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to write a function that takes as input a sentence, breaks it into a list of bigrams and classifies it with the naive bayes classifier from above.\n",
    "\n",
    "Once we get the text model for the sentence, we need to unravel it. The text models show the probability of each bigram, but the classifier can't handle that extra data. It requires a simple *list* of bigrams. So, if the text model shows that a bigram appears three times, we need to add it three times in the list. Since the text model stores the n-gram information in a dictionary (with the key being the n-gram and the value the number of times the n-gram appears) we need to iterate through the items of the dictionary and manually add them to the list of n-grams."
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(sentence, nBS, n):\n",
    "    sentence = sentence.lower()\n",
    "    wordseq = words(sentence)\n",
    "    \n",
    "    P_sentence = NgramCharModel(n, wordseq)\n",
    "    \n",
    "    ngrams = []\n",
    "    for b, p in P_sentence.dictionary.items():\n",
    "        ngrams += [b]*p\n",
    "    \n",
    "    print(ngrams)\n",
    "    \n",
    "    return nBS(ngrams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can start categorizing sentences."
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'i'), ('i', 'c'), ('c', 'h'), (' ', 'b'), ('b', 'i'), ('i', 'n'), ('i', 'n'), (' ', 'e'), ('e', 'i'), (' ', 'p'), ('p', 'l'), ('l', 'a'), ('a', 't'), ('t', 'z')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'German'"
      ]
     },
Anthony Marakis's avatar
Anthony Marakis a validé
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Ich bin ein platz\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 't'), ('t', 'u'), ('u', 'r'), ('r', 't'), ('t', 'l'), ('l', 'e'), ('e', 's'), (' ', 'f'), ('f', 'l'), ('l', 'y'), (' ', 'h'), ('h', 'i'), ('i', 'g'), ('g', 'h')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'English'"
      ]
     },
Anthony Marakis's avatar
Anthony Marakis a validé
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Turtles fly high\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'd'), ('d', 'e'), ('e', 'r'), ('e', 'r'), (' ', 'p'), ('p', 'e'), ('e', 'l'), ('l', 'i'), ('i', 'k'), ('k', 'a'), ('a', 'n'), (' ', 'i'), ('i', 's'), ('s', 't'), (' ', 'h'), ('h', 'i'), ('i', 'e')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'German'"
      ]
     },
Anthony Marakis's avatar
Anthony Marakis a validé
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"Der pelikan ist hier\", nBS, 2)"
   ]
  },
  {
   "cell_type": "code",
Anthony Marakis's avatar
Anthony Marakis a validé
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(' ', 'a'), ('a', 'n'), ('n', 'd'), (' ', 't'), (' ', 't'), ('t', 'h'), ('t', 'h'), ('h', 'u'), ('u', 's'), ('h', 'e'), (' ', 'w'), ('w', 'i'), ('i', 'z'), ('z', 'a'), ('a', 'r'), ('r', 'd'), (' ', 's'), ('s', 'p'), ('p', 'o'), ('o', 'k'), ('k', 'e')]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'English'"
      ]
     },
Anthony Marakis's avatar
Anthony Marakis a validé
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"And thus the wizard spoke\", nBS, 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## AUTHOR RECOGNITION\n",
    "\n",
    "Another similar application to language recognition is recognizing who is more likely to have written a sentence, given text written by them. Here we will try and predict text from Edwin Abbott and Jane Austen. They wrote *Flatland* and *Pride and Prejudice* respectively.\n",
    "\n",
    "We are optimistic we can determine who wrote what based on the fact that Abbott wrote his novella on much later date than Austen, which means there will be linguistic differences between the two works. Indeed, *Flatland* uses more modern and direct language while *Pride and Prejudice* is written in a more archaic tone containing more sophisticated wording.\n",
    "\n",
    "Similarly with Language Recognition, we will first import the two datasets. This time though we are not looking for connections between characters, since that wouldn't give that great results. Why? Because both authors use English and English follows a set of patterns, as we show earlier. Trying to determine authorship based on this patterns would not be very efficient.\n",
    "\n",
    "Instead, we will abstract our querying to a higher level. We will use words instead of characters. That way we can more accurately pick at the differences between their writing style and thus have a better chance at guessing the correct author.\n",
    "\n",
    "Let's go right ahead and import our data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P_Abbott = UnigramWordModel(wordseq, 5)\n",
    "\n",
    "pride = open_data(\"EN-text/pride.txt\").read()\n",
    "wordseq = words(pride)\n",
    "\n",
    "P_Austen = UnigramWordModel(wordseq, 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we set the `default` parameter of the model to 5, instead of 0. If we leave it at 0, then when we get a sentence containing a word we have not seen from that particular author, the chance of that sentence coming from that author is exactly 0 (since to get the probability, we multiply all the separate probabilities; if one is 0 then the result is also 0). To avoid that, we tell the model to add 5 to the count of all the words that appear.\n",
    "\n",
    "Next we will build the Naive Bayes Classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from learning import NaiveBayesLearner\n",
    "\n",
    "dist = {('Abbott', 1): P_Abbott, ('Austen', 1): P_Austen}\n",
    "\n",
    "nBS = NaiveBayesLearner(dist, simple=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have build our classifier, we will start classifying. First, we need to convert the given sentence to the format the classifier needs. That is, a list of words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def recognize(sentence, nBS):\n",
    "    sentence = sentence.lower()\n",
    "    sentence_words = words(sentence)\n",
    "    \n",
    "    return nBS(sentence_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we will input a sentence that is something Abbott would write. Note the use of square and the simpler language."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Abbott'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"the square is mad\", nBS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier correctly guessed Abbott.\n",
    "\n",
    "Next we will input a more sophisticated sentence, similar to the style of Austen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Austen'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recognize(\"a most peculiar acquaintance\", nBS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The classifier guessed correctly again.\n",
    "\n",
    "You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct."
   ]
375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## THE FEDERALIST PAPERS\n",
    "\n",
    "Let's now take a look at a harder problem, classifying the authors of the [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers). The *Federalist Papers* are a series of papers written by Alexander Hamilton, James Madison and John Jay towards establishing the United States Constitution.\n",
    "\n",
    "What is interesting about these papers is that they were all written under a pseudonym, \"Publius\", to keep the identity of the authors a secret. Only after Hamilton's death, when a list was found written by him detailing the authorship of the papers, did the rest of the world learn what papers each of the authors wrote. After the list was published, Madison chimed in to make a couple of corrections: Hamilton, Madison said, hastily wrote down the list and assigned some papers to the wrong author!\n",
    "\n",
    "Here we will try and find out who really wrote these mysterious papers.\n",
    "\n",
    "To solve this we will learn from the undisputed papers to predict the disputed ones. First, let's read the texts from the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from utils import open_data\n",
    "from text import *\n",
    "\n",
    "federalist = open_data(\"EN-text/federalist.txt\").read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how the text looks. We will print the first 500 characters:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'The Project Gutenberg EBook of The Federalist Papers, by \\nAlexander Hamilton and John Jay and James Madison\\n\\nThis eBook is for the use of anyone anywhere at no cost and with\\nalmost no restrictions whatsoever.  You may copy it, give it away or\\nre-use it under the terms of the Project Gutenberg License included\\nwith this eBook or online at www.gutenberg.net\\n\\n\\nTitle: The Federalist Papers\\n\\nAuthor: Alexander Hamilton\\n        John Jay\\n        James Madison\\n\\nPosting Date: December 12, 2011 [EBook #18]'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "federalist[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the text file opens with a license agreement, hardly useful in our case. In fact, the license spans 113 words, while there is also a licensing agreement at the end of the file, which spans 3098 words. We need to remove them. To do so, we will first convert the text into words, to make our lives easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordseq = words(federalist)\n",
    "wordseq = wordseq[114:-3098]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now take a look at the first 100 words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'federalist no 1 general introduction for the independent journal hamilton to the people of the state of new york after an unequivocal experience of the inefficacy of the subsisting federal government you are called upon to deliberate on a new constitution for the united states of america the subject speaks its own importance comprehending in its consequences nothing less than the existence of the union the safety and welfare of the parts of which it is composed the fate of an empire in many respects the most interesting in the world it has been frequently remarked that it seems to'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "' '.join(wordseq[:100])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better.\n",
    "\n",
    "As with any Natural Language Processing problem, it is prudent to do some text pre-processing and clean our data before we start building our model. Remember that all the papers are signed as 'Publius', so we can safely remove that word, since it doesn't give us any information as to the real author.\n",
    "\n",
    "NOTE: Since we are only removing a single word from each paper, this step can be skipped. We add it here to show that processing the data in our hands is something we should always be considering. Oftentimes pre-processing the data in just the right way is the difference between a robust model and a flimsy one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordseq = [w for w in wordseq if w != 'publius']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have to separate the text from a block of words into papers and assign them to their authors. We can see that each paper starts with the word 'federalist', so we will split the text on that word.\n",
    "\n",
    "The disputed papers are the papers from 49 to 58, from 18 to 20 and paper 64. We want to leave these papers unassigned. Also, note that there are two versions of paper 70; both from Hamilton.\n",
    "\n",
    "Finally, to keep the implementation intuitive, we add a `None` object at the start of the `papers` list to make the list index match up with the paper numbering (for example, `papers[5]` now corresponds to paper no. 5 instead of the paper no.6 in the 0-indexed Python)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4, 16, 52)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "papers = re.split(r'federalist\\s', ' '.join(wordseq))\n",
    "papers = [p for p in papers if p not in ['', ' ']]\n",
    "papers = [None] + papers\n",
    "\n",
    "disputed = list(range(49, 58+1)) + [18, 19, 20, 64]\n",
    "jay, madison, hamilton = [], [], []\n",
    "for i, p in enumerate(papers):\n",
    "    if i in disputed or i == 0:\n",
    "        continue\n",
    "    \n",
    "    if 'jay' in p:\n",
    "        jay.append(p)\n",
    "    elif 'madison' in p:\n",
    "        madison.append(p)\n",
    "    else:\n",
    "        hamilton.append(p)\n",
    "\n",
    "len(jay), len(madison), len(hamilton)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, from the undisputed papers Jay wrote 4, Madison 17 and Hamilton 51 (+1 duplicate). Let's now build our word models. The Unigram Word Model again will come in handy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "hamilton = ''.join(hamilton)\n",
    "hamilton_words = words(hamilton)\n",
    "P_hamilton = UnigramWordModel(hamilton_words, default=1)\n",
    "\n",
    "madison = ''.join(madison)\n",
    "madison_words = words(madison)\n",
    "P_madison = UnigramWordModel(madison_words, default=1)\n",
    "\n",
    "jay = ''.join(jay)\n",
    "jay_words = words(jay)\n",
    "P_jay = UnigramWordModel(jay_words, default=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "from utils import product\n",
    "\n",
    "\n",
    "def NaiveBayesLearner(dist):\n",
    "    \"\"\"A simple naive bayes classifier that takes as input a dictionary of\n",
    "    Counter distributions and can then be used to find the probability\n",
    "    of a given item belonging to each class.\n",
    "    The input dictionary is in the following form:\n",
    "        ClassName: Counter\"\"\"\n",
    "    attr_dist = {c_name: count_prob for c_name, count_prob in dist.items()}\n",
    "\n",
    "    def predict(example):\n",
    "        \"\"\"Predict the probabilities for each class.\"\"\"\n",
    "        def class_prob(target, e):\n",
    "            attr = attr_dist[target]\n",
    "            return product([attr[a] for a in e])\n",
    "\n",
    "        pred = {t: class_prob(t, example) for t in dist.keys()}\n",
    "\n",
    "        total = sum(pred.values())\n",
    "        if total == 0:\n",
    "            # Since there are a lot of multiplications of very small numbers,\n",
    "            # we end up with values equal to 0. To combat that, we keep\n",
    "            # dividing the example until the sum of the values is not 0.\n",
    "            random_words_count = max([int(3*len(example)/4), 100])\n",
    "            pred = predict(random.sample(example, random_words_count))\n",
    "        else:\n",
    "            for k, v in pred.items():\n",
    "                pred[k] = v / total\n",
    "\n",
    "        return pred\n",
    "\n",
    "    return predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we will build our Learner. Note that even though Hamilton wrote the most papers, that doesn't make it more probable that he wrote the rest, so all the class probabilities will be equal. We can change them if we have some external knowledge, which for this tutorial we do not have."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "dist = {('Madison', 1): P_madison, ('Hamilton', 1): P_hamilton, ('Jay', 1): P_jay}\n",
    "nBS = NaiveBayesLearner(dist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As usual, the `recognize` function will take as input a string and after removing capitalization and splitting it into words, will feed it into the Naive Bayes Classifier. Since though the classifier is probabilistic (it randomly picks words from the example to evaluate) it is better if we run the experiment a lot of times and averaged the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def avg_preds(preds):\n",
    "    d = {}\n",
    "    for k in preds[0].keys():\n",
    "        d[k] = 0\n",
    "        for p in preds:\n",
    "            d[k] += p[k]\n",
    "    \n",
    "    return {k: d[k] / len(preds)\n",
    "            for k in preds[0].keys()}\n",
    "\n",
    "\n",
    "def recognize(sentence, nBS):\n",
    "    sentence = sentence.lower()\n",
    "    sentence_words = words(sentence)\n",
    "    \n",
    "    return avg_preds([nBS(sentence_words) for i in range(25)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can start predicting the disputed papers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Paper No. 49\n",
      "Hamilton: 0.18218476722264856\n",
      "Madison : 0.8178151126501306\n",
      "Jay     : 1.2012722099721584e-07\n",
      "----------------------\n",
      "Paper No. 50\n",
      "Hamilton: 0.006340777113564324\n",
      "Madison : 0.9935600714606485\n",
      "Jay     : 9.915142578703363e-05\n",
      "----------------------\n",
      "Paper No. 51\n",
      "Hamilton: 0.10807398451170964\n",
      "Madison : 0.8919260093780947\n",
      "Jay     : 6.11019566801153e-09\n",
      "----------------------\n",
      "Paper No. 52\n",
      "Hamilton: 0.015755507847563528\n",
      "Madison : 0.9842245750173423\n",
      "Jay     : 1.9917135094100632e-05\n",
      "----------------------\n",
      "Paper No. 53\n",
      "Hamilton: 0.16148149622286845\n",
      "Madison : 0.8385181396174793\n",
      "Jay     : 3.641596521788814e-07\n",
      "----------------------\n",
      "Paper No. 54\n",
      "Hamilton: 0.1202445807489968\n",
      "Madison : 0.8797554191935693\n",
      "Jay     : 5.743394071176045e-11\n",
      "----------------------\n",
      "Paper No. 55\n",
      "Hamilton: 0.10014174623125195\n",
      "Madison : 0.8998582478040609\n",
      "Jay     : 5.964687179083329e-09\n",
      "----------------------\n",
      "Paper No. 56\n",
      "Hamilton: 0.15930217913525455\n",
      "Madison : 0.8406948696158869\n",
      "Jay     : 2.9512488585096405e-06\n",
      "----------------------\n",
      "Paper No. 57\n",
      "Hamilton: 0.3106575736716812\n",
      "Madison : 0.6893423580295986\n",
      "Jay     : 6.829872019646261e-08\n",
      "----------------------\n",
      "Paper No. 58\n",
      "Hamilton: 0.08144023779669217\n",
      "Madison : 0.9185597621646735\n",
      "Jay     : 3.8634360540381284e-11\n",
      "----------------------\n",
      "Paper No. 18\n",
      "Hamilton: 7.762932414823314e-06\n",
      "Madison : 0.5114716240007965\n",
      "Jay     : 0.4885206130667886\n",
      "----------------------\n",
      "Paper No. 19\n",
      "Hamilton: 0.011570316420346522\n",
      "Madison : 0.5281730401297515\n",
      "Jay     : 0.4602566434499019\n",
      "----------------------\n",
      "Paper No. 20\n",
      "Hamilton: 0.14651509965391551\n",
      "Madison : 0.5342142523806944\n",
      "Jay     : 0.31927064796538995\n",
      "----------------------\n",
      "Paper No. 64\n",
      "Hamilton: 0.5756065218890194\n",
      "Madison : 0.3648418106830272\n",
      "Jay     : 0.059551667427953384\n",
      "----------------------\n"
     ]
    }
   ],
   "source": [
    "for d in disputed:\n",
    "    print(\"Paper No. {}\".format(d))\n",
    "    probs = recognize(papers[d], nBS)\n",
    "    h = probs[('Hamilton', 1)]\n",
    "    m = probs[('Madison', 1)]\n",
    "    j = probs[('Jay', 1)]\n",
    "    print(\"Hamilton: {}\".format(h))\n",
    "    print(\"Madison : {}\".format(m))\n",
    "    print(\"Jay     : {}\".format(j))\n",
    "    print(\"----------------------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "NOTE: Since the algorithm has an element of random, it will show different results on each run. Generally, the more the experiments, the stabler the results.\n",
    "\n",
    "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It should correctly classify all (or most of) the papers by Madison, even though on some occasions the classifier is not that sure. For the collaboration papers between Hamilton and Madison the classifier shows some peculiar results: most of the time it correctly implies that Madison did a lot of the work but instead of Hamilton helping him, it usually shows Jay. This might be because the collaboration between Madison and Hamilton produced some results uncharacteristic to either of them. Without further investigation it is hard to pinpoint the issue.\n",
    "\n",
    "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier went wrong there because it did not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}