nlp.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NATURAL LANGUAGE PROCESSING\n",
    "\n",
    "This notebook covers chapters 22 and 23 from the book *Artificial Intelligence: A Modern Approach*, 3rd Edition. The implementations of the algorithms can be found in [nlp.py](https://github.com/aimacode/aima-python/blob/master/nlp.py).\n",
    "\n",
    "Run the below cell to import the code from the module and get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import nlp\n",
    "from nlp import Page, HITS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CONTENTS\n",
    "\n",
    "* Overview\n",
    "* HITS\n",
    "* Question Answering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## OVERVIEW\n",
    "\n",
    "`TODO...`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## HITS\n",
    "\n",
    "### Overview\n",
    "\n",
    "**Hyperlink-Induced Topic Search** (or HITS for short) is an algorithm for information retrieval and page ranking. You can read more on information retrieval in the [text notebook](https://github.com/aimacode/aima-python/blob/master/text.ipynb). Essentially, given a collection of documents and a user's query, such systems return to the user the documents most relevant to what the user needs. The HITS algorithm differs from a lot of other similar ranking algorithms (like Google's *Pagerank*) as the page ratings in this algorithm are dependent on the given query. This means that for each new query the result pages must be computed anew. This cost might be prohibitive for many modern search engines, so a lot steer away from this approach.\n",
    "\n",
    "HITS first finds a list of relevant pages to the query and then adds pages that link to or are linked from these pages. Once the set is built, we define two values for each page. **Authority** on the query, the degree of pages from the relevant set linking to it and **hub** of the query, the degree that it points to authoritative pages in the set. Since we do not want to simply count the number of links from a page to other pages, but we also want to take into account the quality of the linked pages, we update the hub and authority values of a page in the following manner, until convergence:\n",
    "\n",
    "* Hub score = The sum of the authority scores of the pages it links to.\n",
    "\n",
    "* Authority score = The sum of hub scores of the pages it is linked from.\n",
    "\n",
    "So the higher quality the pages a page is linked to and from, the higher its scores.\n",
    "\n",
    "We then normalize the scores by dividing each score by the sum of the squares of the respective scores of all pages. When the values converge, we return the top-valued pages. Note that because we normalize the values, the algorithm is guaranteed to converge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Implementation\n",
    "\n",
    "The source code for the algorithm is given below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource HITS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we compile the collection of pages as mentioned above. Then, we initialize the authority and hub scores for each page and finally we update and normalize the values until convergence.\n",
    "\n",
    "A quick overview of the helper functions functions we use:\n",
    "\n",
    "* `relevant_pages`: Returns relevant pages from `pagesIndex` given a query.\n",
    "\n",
    "* `expand_pages`: Adds to the collection pages linked to and from the given `pages`.\n",
    "\n",
    "* `normalize`: Normalizes authority and hub scores.\n",
    "\n",
    "* `ConvergenceDetector`: A class that checks for convergence, by keeping a history of the pages' scores and checking if they change or not.\n",
    "\n",
    "* `Page`: The template for pages. Stores the address, authority/hub scores and in-links/out-links."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Example\n",
    "\n",
    "Before we begin we need to define a list of sample pages to work on. The pages are `pA`, `pB` and so on and their text is given by `testHTML` and `testHTML2`. The `Page` class takes as arguments the in-links and out-links as lists. For page \"A\", the in-links are \"B\", \"C\" and \"E\" while the sole out-link is \"D\".\n",
    "\n",
    "We also need to set the `nlp` global variables `pageDict`, `pagesIndex` and `pagesContent`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "testHTML = \"\"\"Like most other male mammals, a man inherits an\n",
    "            X from his mom and a Y from his dad.\"\"\"\n",
    "testHTML2 = \"a mom and a dad\"\n",
    "\n",
    "pA = Page(\"A\", [\"B\", \"C\", \"E\"], [\"D\"])\n",
    "pB = Page(\"B\", [\"E\"], [\"A\", \"C\", \"D\"])\n",
    "pC = Page(\"C\", [\"B\", \"E\"], [\"A\", \"D\"])\n",
    "pD = Page(\"D\", [\"A\", \"B\", \"C\", \"E\"], [])\n",
    "pE = Page(\"E\", [], [\"A\", \"B\", \"C\", \"D\", \"F\"])\n",
    "pF = Page(\"F\", [\"E\"], [])\n",
    "\n",
    "nlp.pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,\n",
    "                pD.address: pD, pE.address: pE, pF.address: pF}\n",
    "\n",
    "nlp.pagesIndex = nlp.pageDict\n",
    "\n",
    "nlp.pagesContent ={pA.address: testHTML, pB.address: testHTML2,\n",
    "                   pC.address: testHTML, pD.address: testHTML2,\n",
    "                   pE.address: testHTML, pF.address: testHTML2}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now run the HITS algorithm. Our query will be 'mammals' (note that while the content of the HTML doesn't matter, it should include the query words or else no page will be picked at the first step)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "HITS('mammals')\n",
    "page_list = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"]\n",
    "auth_list = [pA.authority, pB.authority, pC.authority, pD.authority, pE.authority, pF.authority]\n",
    "hub_list = [pA.hub, pB.hub, pC.hub, pD.hub, pE.hub, pF.hub]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see how the pages were scored:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "A: total=0.7696163397038682, auth=0.5583254178509696, hub=0.2112909218528986\n",
      "B: total=0.7795962360479534, auth=0.23657856688600404, hub=0.5430176691619494\n",
      "C: total=0.8204496913590655, auth=0.4211098490570872, hub=0.3993398423019784\n",
      "D: total=0.6316647735856309, auth=0.6316647735856309, hub=0.0\n",
      "E: total=0.7078245882072104, auth=0.0, hub=0.7078245882072104\n",
      "F: total=0.23657856688600404, auth=0.23657856688600404, hub=0.0\n"
     ]
    }
   ],
   "source": [
    "for i in range(6):\n",
    "    p = page_list[i]\n",
    "    a = auth_list[i]\n",
    "    h = hub_list[i]\n",
    "    \n",
    "    print(\"{}: total={}, auth={}, hub={}\".format(p, a + h, a, h))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "The top score is 0.82 by \"C\". This is the most relevant page according to the algorithm. You can see that the pages it links to, \"A\" and \"D\", have the two highest authority scores (therefore \"C\" has a high hub score) and the pages it is linked from, \"B\" and \"E\", have the highest hub scores (so \"C\" has a high authority score). By combining these two facts, we get that \"C\" is the most relevant page. It is worth noting that it does not matter if the given page contains the query words, just that it links and is linked from high-quality pages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## QUESTION ANSWERING\n",
    "\n",
    "**Question Answering** is a type of Information Retrieval system, where we have a question instead of a query and instead of relevant documents we want the computer to return a short sentence, phrase or word that answers our question. To better understand the concept of question answering systems, you can first read the \"Text Models\" and \"Information Retrieval\" section from the [text notebook](https://github.com/aimacode/aima-python/blob/master/text.ipynb).\n",
    "\n",
    "A typical example of such a system is `AskMSR` (Banko *et al.*, 2002), a system for question answering that performed admirably against more sophisticated algorithms. The basic idea behind it is that a lot of questions have already been answered in the web numerous times. The system doesn't know a lot about verbs, or concepts or even what a noun is. It knows about 15 different types of questions and how they can be written as queries. It can rewrite [Who was George Washington's second in command?] as the query [\\* was George Washington's second in command] or [George Washington's second in command was \\*].\n",
    "\n",
    "After rewriting the questions, it issues these queries and retrieves the short text around the query terms. It then breaks the result into 1, 2 or 3-grams. Filters are also applied to increase the chances of a correct answer. If the query starts with \"who\", we filter for names, if it starts with \"how many\" we filter for numbers and so on. We can also filter out the words appearing in the query. For the above query, the answer \"George Washington\" is wrong, even though it is quite possible the 2-gram would appear a lot around the query terms.\n",
    "\n",
    "Finally, the different results are weighted by the generality of the queries. The result from the general boolean query [George Washington OR second in command] weighs less that the more specific query [George Washington's second in command was \\*]. As an answer we return the most highly-ranked n-gram."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}