# NATURAL LANGUAGE PROCESSING

This notebook covers chapters 22 and 23 from the book *Artificial Intelligence: A Modern Approach*, 3rd Edition. The implementations of the algorithms can be found in [nlp.py](https://github.com/aimacode/aima-python/blob/master/nlp.py).

Run the below cell to import the code from the module and get started!

In [1]:
import nlp
from nlp import Page, HITS

## CONTENTS

* Overview
* HITS

## OVERVIEW

`TODO...`

## HITS

### Overview

**Hyperlink-Induced Topic Search** (or HITS for short) is an algorithm for information retrieval and page ranking. You can read more on information retrieval in the [text](https://github.com/aimacode/aima-python/blob/master/text.ipynb) notebook. Essentially, given a collection of documents and a user's query, such systems return to the user the documents most relevant to what the user needs. The HITS algorithm differs from a lot of other similar ranking algorithms (like Google's *Pagerank*) as the page ratings in this algorithm are dependent on the given query. This means that for each new query the result pages must be computed anew. This cost might be prohibitive for many modern search engines, so a lot steer away from this approach.

HITS first finds a list of relevant pages to the query and then adds pages that link to or are linked from these pages. Once the set is built, we define two values for each page. **Authority** on the query, the degree of pages from the relevant set linking to it and **hub** of the query, the degree that it points to authoritative pages in the set. Since we do not want to simply count the number of links from a page to other pages, but we also want to take into account the quality of the linked pages, we update the hub and authority values of a page in the following manner, until convergence:

* Hub score = The sum of the authority scores of the pages it links to.

* Authority score = The sum of hub scores of the pages it is linked from.

So the higher quality the pages a page is linked to and from, the higher its scores.

We then normalize the scores by dividing each score by the sum of the squares of the respective scores of all pages. When the values converge, we return the top-valued pages. Note that because we normalize the values, the algorithm is guaranteed to converge.

### Implementation

The source code for the algorithm is given below:

In [2]:
%psource HITS

First we compile the collection of pages as mentioned above. Then, we initialize the authority and hub scores for each page and finally we update and normalize the values until convergence.

A quick overview of the helper functions functions we use:

* `relevant_pages`: Returns relevant pages from `pagesIndex` given a query.

* `expand_pages`: Adds to the collection pages linked to and from the given `pages`.

* `normalize`: Normalizes authority and hub scores.

* `ConvergenceDetector`: A class that checks for convergence, by keeping a history of the pages' scores and checking if they change or not.

* `Page`: The template for pages. Stores the address, authority/hub scores and in-links/out-links.

### Example

Before we begin we need to define a list of sample pages to work on. The pages are `pA`, `pB` and so on and their text is given by `testHTML` and `testHTML2`. The `Page` class takes as arguments the in-links and out-links as lists. For page "A", the in-links are "B", "C" and "E" while the sole out-link is "D".

We also need to set the `nlp` global variables `pageDict`, `pagesIndex` and `pagesContent`.

In [3]:
testHTML = """Like most other male mammals, a man inherits an
            X from his mom and a Y from his dad."""
testHTML2 = "a mom and a dad"

pA = Page("A", ["B", "C", "E"], ["D"])
pB = Page("B", ["E"], ["A", "C", "D"])
pC = Page("C", ["B", "E"], ["A", "D"])
pD = Page("D", ["A", "B", "C", "E"], [])
pE = Page("E", [], ["A", "B", "C", "D", "F"])
pF = Page("F", ["E"], [])

nlp.pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,
                pD.address: pD, pE.address: pE, pF.address: pF}

nlp.pagesIndex = nlp.pageDict

nlp.pagesContent ={pA.address: testHTML, pB.address: testHTML2,
                   pC.address: testHTML, pD.address: testHTML2,
                   pE.address: testHTML, pF.address: testHTML2}

We can now run the HITS algorithm. Our query will be 'mammals' (note that while the content of the HTML doesn't matter, it should include the query words or else no page will be picked at the first step).

In [4]:
HITS('mammals')
page_list = ["A", "B", "C", "D", "E", "F"]
auth_list = [pA.authority, pB.authority, pC.authority, pD.authority, pE.authority, pF.authority]
hub_list = [pA.hub, pB.hub, pC.hub, pD.hub, pE.hub, pF.hub]

Let's see how the pages were scored:

In [5]:
for i in range(6):
    p = page_list[i]
    a = auth_list[i]
    h = hub_list[i]
    
    print("{}: total={}, auth={}, hub={}".format(p, a + h, a, h))

A: total=0.7696163397038682, auth=0.5583254178509696, hub=0.2112909218528986
B: total=0.7795962360479534, auth=0.23657856688600404, hub=0.5430176691619494
C: total=0.8204496913590655, auth=0.4211098490570872, hub=0.3993398423019784
D: total=0.6316647735856309, auth=0.6316647735856309, hub=0.0
E: total=0.7078245882072104, auth=0.0, hub=0.7078245882072104
F: total=0.23657856688600404, auth=0.23657856688600404, hub=0.0


The top score is 0.82 by "C". This is the most relevant page according to the algorithm. You can see that the pages it links to, "A" and "D", have the two highest authority scores (therefore "C" has a high hub score) and the pages it is linked from, "B" and "E", have the highest hub scores (so "C" has a high authority score). By combining these two facts, we get that "C" is the most relevant page. It is worth noting that it does not matter if the given page contains the query words, just that it links and is linked from high-quality pages.