# NATURAL LANGUAGE PROCESSING APPLICATIONS

In this notebook we will take a look at some indicative applications of natural language processing. We will cover content from [`nlp.py`](https://github.com/aimacode/aima-python/blob/master/nlp.py) and [`text.py`](https://github.com/aimacode/aima-python/blob/master/text.py), for chapters 22 and 23 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/).

## CONTENTS

* Language Recognition

## LANGUAGE RECOGNITION

A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.

Here we will build an application to categorize sentences in either English or German.

First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.

Let's build our text models for each language, which will hold the probability of each bigram occurring in the text.

In [1]:
from utils import open_data
from text import *

flatland = open_data("EN-text/flatland.txt").read()
wordseq = words(flatland)

P_flatland = NgramCharModel(2, wordseq)

faust = open_data("GE-text/faust.txt").read()
wordseq = words(faust)

P_faust = NgramCharModel(2, wordseq)

We can use this information to build a *Naive Bayes Classifier* that will be used to categorize sentences (you can read more on Naive Bayes on the [`learning notebook`](https://github.com/aimacode/aima-python/blob/master/learning.ipynb)). The classifier will take as input the probability distribution of bigrams and given a list of bigrams (extracted from the sentence to be classified), it will calculate the probability of the example/sentence coming from each language and pick the maximum.

Let's build our classifier, with the assumption that English is as probable as German (the input is a dictionary with values the text models and keys the tuple `language, probability`):

In [2]:
from learning import NaiveBayesLearner

dist = {('English', 1): P_flatland, ('German', 1): P_faust}

nBS = NaiveBayesLearner(dist, simple=True)

Now we need to write a function that takes as input a sentence, breaks it into a list of bigrams and classifies it with the naive bayes classifier from above.

Once we get the text model for the sentence, we need to unravel it. The text models show the probability of each bigram, but the classifier can't handle that extra data. It requires a simple *list* of bigrams. So, if the text model shows that a bigram appears three times, we need to add it three times in the list. Since the text model stores the n-gram information in a dictionary (with the key being the n-gram and the value the number of times the n-gram appears) we need to iterate through the items of the dictionary and manually add them to the list of n-grams.

In [3]:
def recognize(sentence, nBS, n):
    sentence = sentence.lower()
    wordseq = words(sentence)
    
    P_sentence = NgramCharModel(n, wordseq)
    
    ngrams = []
    for b, p in P_sentence.dictionary.items():
        ngrams += [b]*p
    
    return nBS(ngrams)

Now we can start categorizing sentences.

In [4]:
recognize("Ich bin ein platz", nBS, 2)

'German'

In [5]:
recognize("Turtles fly high", nBS, 2)

'English'

In [6]:
recognize("Der pelikan ist hier", nBS, 2)

'German'

In [7]:
recognize("And thus the wizard spoke", nBS, 2)

'English'

You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier.