text.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TEXT\n",
    "\n",
    "This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from text import *\n",
    "from utils import open_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CONTENTS\n",
    "\n",
    "* Text Models\n",
    "* Viterbi Text Segmentation\n",
    "* Decoders\n",
    "    * Introduction\n",
    "    * Shift Decoder\n",
    "    * Permutation Decoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## TEXT MODELS\n",
    "\n",
    "Before we start analyzing text processing algorithms, we will need to build some language models. Those models serve as a look-up table for character or word probabilities (depending on the type of model). These models can give us the probabilities of words or character sequences appearing in text. Take as example \"the\". Text models can give us the probability of \"the\", *P(\"the\")*, either as a word or as a sequence of characters (\"t\" followed by \"h\" followed by \"e\"). The first representation is called \"word model\" and deals with words as distinct objects, while the second is a \"character model\" and deals with sequences of characters as objects. Note that we can specify the number of words or the length of the char sequences to better suit our needs. So, given that number of words equals 2, we have probabilities in the form *P(word1, word2)*. For example, *P(\"of\", \"the\")*. For char models, we do the same but for chars.\n",
    "\n",
    "It is also useful to store the conditional probabilities of words given preceding words. That means, given we found the words \"of\" and \"the\", what is the chance the next word will be \"world\"? More formally, *P(\"world\"|\"of\", \"the\")*. Generalizing, *P(Wi|Wi-1, Wi-2, ... , Wi-n)*.\n",
    "\n",
    "We call the word model *N-Gram Word Model* (from the Greek \"gram\", the root of \"write\", or the word for \"letter\") and the char model *N-Gram Character Model*. In the special case where *N* is 1, we call the models *Unigram Word Model* and *Unigram Character Model* respectively.\n",
    "\n",
    "In the `text` module we implement the two models (both their unigram and n-gram variants) by inheriting from the `CountingProbDist` from `learning.py`. Note that `CountingProbDist` does not return the actual probability of each object, but the number of times it appears in our test data.\n",
    "\n",
    "For word models we have `UnigramWordModel` and `NgramWordModel`. We supply them with a text file and they show the frequency of the different words. We have `UnigramCharModel` and `NgramCharModel` for the character models.\n",
    "\n",
    "Execute the cells below to take a look at the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource UnigramWordModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource NgramWordModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource UnigramCharModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource NgramCharModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we build our models. The text file we will use to build them is *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt). In that directory you can find other text files we might get to use here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Getting Probabilities\n",
    "\n",
    "Here we will take a look at how to read text and find the probabilities for each model, and how to retrieve them.\n",
    "\n",
    "First the word models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n",
      "[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n",
      "0.0036724740723330495\n",
      "0.00114584557527324\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P1 = UnigramWordModel(wordseq)\n",
    "P2 = NgramWordModel(2, wordseq)\n",
    "\n",
    "print(P1.top(5))\n",
    "print(P2.top(5))\n",
    "\n",
    "print(P1['an'])\n",
    "print(P2[('i', 'was')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the most used word in *Flatland* is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences. Also, the probability of 'an' is approximately 0.003, while for 'i was' it is close to 0.001. Note that the strings used as keys are all lowercase. For the unigram model, the keys are single strings, while for n-gram models we have n-tuples of strings.\n",
    "\n",
    "Below we take a look at how we can get information from the conditional probabilities of the model, and how we can generate the next word in a sequence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Conditional Probabilities Table: {'myself': 1, 'to': 2, 'at': 2, 'pleased': 1, 'considered': 1, 'will': 1, 'intoxicated': 1, 'glad': 1, 'certain': 2, 'in': 2, 'now': 2, 'sitting': 1, 'unusually': 1, 'approaching': 1, 'by': 1, 'covered': 1, 'standing': 1, 'allowed': 1, 'surprised': 1, 'keenly': 1, 'afraid': 1, 'once': 2, 'crushed': 1, 'not': 4, 'rapt': 1, 'simulating': 1, 'rapidly': 1, 'quite': 1, 'describing': 1, 'wearied': 1} \n",
      "\n",
      "Conditional Probability of 'once' give 'i was': 0.05128205128205128 \n",
      "\n",
      "Next word after 'i was': not\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P3 = NgramWordModel(3, wordseq)\n",
    "\n",
    "print(\"Conditional Probabilities Table:\", P3.cond_prob[('i', 'was')].dictionary, '\\n')\n",
    "print(\"Conditional Probability of 'once' give 'i was':\", P3.cond_prob[('i', 'was')]['once'], '\\n')\n",
    "print(\"Next word after 'i was':\", P3.cond_prob[('i', 'was')].sample())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we print all the possible words that come after 'i was' and the times they have appeared in the model. Next we print the probability of 'once' appearing after 'i was', and finally we pick a word to proceed after 'i was'. Note that the word is picked according to its probability of appearing (high appearance count means higher chance to get picked)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the two character models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(19208, 'e'), (13965, 't'), (12069, 'o'), (11702, 'a'), (11440, 'i')]\n",
      "[(5364, (' ', 't')), (4573, ('t', 'h')), (4063, (' ', 'a')), (3654, ('h', 'e')), (2967, (' ', 'i'))]\n",
      "0.0006028715031814578\n",
      "0.0032371578540395666\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P1 = UnigramCharModel(wordseq)\n",
    "P2 = NgramCharModel(2, wordseq)\n",
    "\n",
    "print(P1.top(5))\n",
    "print(P2.top(5))\n",
    "\n",
    "print(P1['z'])\n",
    "print(P2[('g', 'h')])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The most common letter is 'e', appearing more than 19000 times, and the most common sequence is \"\\_t\". That is, a space followed by a 't'. Note that even though we do not count spaces for word models or unigram character models, we do count them for n-gram char models.\n",
    "\n",
    "Also, the probability of the letter 'z' appearing is close to 0.0006, while for the bigram 'gh' it is 0.003."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating Samples\n",
    "\n",
    "Apart from reading the probabilities for n-grams, we can also use our model to generate word sequences, using the `samples` function in the word models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not it of before most regions multitudes the a three\n",
      "the inhabitants of so also refers to the cube with\n",
      "the service of education waxed daily more numerous than the\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "\n",
    "P1 = UnigramWordModel(wordseq)\n",
    "P2 = NgramWordModel(2, wordseq)\n",
    "P3 = NgramWordModel(3, wordseq)\n",
    "\n",
    "print(P1.samples(10))\n",
    "print(P2.samples(10))\n",
    "print(P3.samples(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the unigram model, we mostly get gibberish, since each word is picked according to its frequency of appearance in the text, without taking into consideration preceding words. As we increase *n* though, we start to get samples that do have some semblance of conherency and do remind a little bit of normal English. As we increase our data, these samples will get better.\n",
    "\n",
    "Let's try it. We will add to the model more data to work with and let's see what comes out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "it again stealing away through the ranks of his nephew but he laughed most immoderately\n",
      "exclaiming that he henceforth exchanged them for the artist s pencil how great and glorious\n",
      "compound now for nothing worse but however all that is quite out of the question\n",
      "accordance with precedent and for the sake of secrecy he must condemn him to perpetual\n"
     ]
    }
   ],
   "source": [
    "data = open_data(\"EN-text/flatland.txt\").read()\n",
    "data += open_data(\"EN-text/gutenberg.txt\").read()\n",
    "data += open_data(\"EN-text/sense.txt\").read()\n",
    "\n",
    "wordseq = words(data)\n",
    "\n",
    "P3 = NgramWordModel(3, wordseq)\n",
    "P4 = NgramWordModel(4, wordseq)\n",
    "P5 = NgramWordModel(5, wordseq)\n",
    "P7 = NgramWordModel(7, wordseq)\n",
    "\n",
    "print(P3.samples(15))\n",
    "print(P4.samples(15))\n",
    "print(P5.samples(15))\n",
    "print(P7.samples(15))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the samples start to become more and more reasonable as we add more data and increase the *n* parameter. We are still a long way to go though from realistic text generation, but at the same time we can see that with enough data even rudimentary algorithms can output something almost passable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## VITERBI TEXT SEGMENTATION\n",
    "\n",
    "### Overview\n",
    "\n",
    "We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n",
    "\n",
    "The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource viterbi_segment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n",
    "\n",
    "The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example\n",
    "\n",
    "The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n",
      "Probability of sequence is: 2.273672843573388e-24\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "wordseq = words(flatland)\n",
    "P = UnigramTextModel(wordseq)\n",
    "text = \"itiseasytoreadwordswithoutspaces\"\n",
    "\n",
    "s, p = viterbi_segment(text,P)\n",
    "print(\"Sequence of words is:\",s)\n",
    "print(\"Probability of sequence is:\",p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DECODERS\n",
    "\n",
    "### Introduction\n",
    "\n",
    "In this section we will try to decode ciphertext using probabilistic text models. A ciphertext is obtained by performing encryption on a text message. This encryption lets us communicate safely, as anyone who has access to the ciphertext but doesn't know how to decode it cannot read the message. We will restrict our study to <b>Monoalphabetic Substitution Ciphers</b>. These are primitive forms of cipher where each letter in the message text (also known as plaintext) is replaced by another another letter of the alphabet.\n",
    "\n",
    "### Shift Decoder\n",
    "\n",
    "#### The Caesar cipher\n",
    "\n",
    "The Caesar cipher, also known as shift cipher is a form of monoalphabetic substitution ciphers where each letter is <i>shifted</i> by a fixed value. A shift by <b>`n`</b> in this context means that each letter in the plaintext is replaced with a letter corresponding to `n` letters down in the alphabet. For example the plaintext `\"ABCDWXYZ\"` shifted by `3` yields `\"DEFGZABC\"`. Note how `X` became `A`. This is because the alphabet is cyclic, i.e. the letter after the last letter in the alphabet, `Z`, is the first letter of the alphabet - `A`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DEFGZABC\n"
     ]
    }
   ],
   "source": [
    "plaintext = \"ABCDWXYZ\"\n",
    "ciphertext = shift_encode(plaintext, 3)\n",
    "print(ciphertext)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decoding a Caesar cipher\n",
    "\n",
    "To decode a Caesar cipher we exploit the fact that not all letters in the alphabet are used equally. Some letters are used more than others and some pairs of letters are more probable to occur together. We call a pair of consecutive letters a <b>bigram</b>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['th', 'hi', 'is', 's ', ' i', 'is', 's ', ' a', 'a ', ' s', 'se', 'en', 'nt', 'te', 'en', 'nc', 'ce']\n"
     ]
    }
   ],
   "source": [
    "print(bigrams('this is a sentence'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use `CountingProbDist` to get the probability distribution of bigrams. In the latin alphabet consists of only only `26` letters. This limits the total number of possible substitutions to `26`. We reverse the shift encoding for a given `n` and check how probable it is using the bigram distribution. We try all `26` values of `n`, i.e. from `n = 0` to `n = 26` and use the value of `n` which gives the most probable plaintext."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource ShiftDecoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example\n",
    "\n",
    "Let us encode a secret message using Caeasar cipher and then try decoding it using `ShiftDecoder`. We will again use `flatland.txt` to build the text model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The code is \"Guvf vf n frperg zrffntr\"\n"
     ]
    }
   ],
   "source": [
    "plaintext = \"This is a secret message\"\n",
    "ciphertext = shift_encode(plaintext, 13)\n",
    "print('The code is', '\"' + ciphertext + '\"')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The decoded message is \"This is a secret message\"\n"
     ]
    }
   ],
   "source": [
    "flatland = open_data(\"EN-text/flatland.txt\").read()\n",
    "decoder = ShiftDecoder(flatland)\n",
    "\n",
    "decoded_message = decoder.decode(ciphertext)\n",
    "print('The decoded message is', '\"' + decoded_message + '\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Permutation Decoder\n",
    "Now let us try to decode messages encrypted by a general monoalphabetic substitution cipher. The letters in the alphabet can be replaced by any permutation of letters. For example if the alpahbet consisted of `{A B C}` then it can be replaced by `{A C B}`, `{B A C}`, `{B C A}`, `{C A B}`, `{C B A}` or even `{A B C}` itself. Suppose we choose the permutation `{C B A}`, then the plain text `\"CAB BA AAC\"` would become `\"ACB BC CCA\"`. We can see that Caesar cipher is also a form of permutation cipher where the permutation is a cyclic permutation. Unlike the Caesar cipher, it is infeasible to try all possible permutations. The number of possible permutations in Latin alphabet is `26!` which is of the order $10^{26}$. We use graph search algorithms to search for a 'good' permutation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource PermutationDecoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each state/node in the graph is represented as a letter-to-letter map. If there no mapping for a letter it means the letter is unchanged in the permutation. These maps are stored as dictionaries. Each dictionary is a 'potential' permutation.  We use the word 'potential' because every dictionary doesn't necessarily represent a valid permutation since a permutation cannot have repeating elements. For example the dictionary `{'A': 'B', 'C': 'X'}` is invalid because `'A'` is replaced by `'B'`, but so is `'B'` because the dictionary doesn't have a mapping for `'B'`. Two dictionaries can also represent the same permutation e.g. `{'A': 'C', 'C': 'A'}` and `{'A': 'C', 'B': 'B', 'C': 'A'}` represent the same permutation where `'A'` and `'C'` are interchanged and all other letters remain unaltered. To ensure we get a valid permutation a goal state must map all letters in the alphabet. We also prevent repetions in the permutation by allowing only those actions which go to new state/node in which the newly added letter to the dictionary maps to previously unmapped letter. These two rules togeter ensure that the dictionary of a goal state will represent a valid permutation.\n",
    "The score of a state is determined using word scores, unigram scores, and bigram scores. Experiment with different weightages for word, unigram and bigram scores and see how they affect the decoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"ahed world\" decodes to \"shed could\"\n",
      "\"ahed woxld\" decodes to \"shew atiow\"\n"
     ]
    }
   ],
   "source": [
    "ciphertexts = ['ahed world', 'ahed woxld']\n",
    "\n",
    "pd = PermutationDecoder(canonicalize(flatland))\n",
    "for ctext in ciphertexts:\n",
    "    print('\"{}\" decodes to \"{}\"'.format(ctext, pd.decode(ctext)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As evident from the above example, permutation decoding using best first search is sensitive to initial text. This is because not only the final dictionary, with substitutions for all letters, must have good score but so must the intermediate dictionaries. You could think of it as performing a local search by finding substitutons for each letter one by one. We could get very different results by changing even a single letter because that letter could be a deciding factor for selecting substitution in early stages which snowballs and affects the later stages. To make the search better we can use different definition of score in different stages and optimize on which letter to substitute first."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}