Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NATURAL LANGUAGE PROCESSING APPLICATIONS\n",
"\n",
"In this notebook we will take a look at some indicative applications of natural language processing. We will cover content from [`nlp.py`](https://github.com/aimacode/aima-python/blob/master/nlp.py) and [`text.py`](https://github.com/aimacode/aima-python/blob/master/text.py), for chapters 22 and 23 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CONTENTS\n",
"\n",
"* Language Recognition\n",
"* Author Recognition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LANGUAGE RECOGNITION\n",
"\n",
"A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.\n",
"\n",
"Here we will build an application to categorize sentences in either English or German.\n",
"\n",
"First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.\n",
"\n",
"Let's build our text models for each language, which will hold the probability of each bigram occuring in the text."
]
},
{
"cell_type": "code",
"outputs": [],
"source": [
"from utils import open_data\n",
"from text import *\n",
"\n",
"flatland = open_data(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"\n",
"P_flatland = NgramCharModel(2, wordseq)\n",
"\n",
"faust = open_data(\"GE-text/faust.txt\").read()\n",
"wordseq = words(faust)\n",
"\n",
"P_faust = NgramCharModel(2, wordseq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use this information to build a *Naive Bayes Classifier* that will be used to categorize sentences (you can read more on Naive Bayes on the [`learning notebook`](https://github.com/aimacode/aima-python/blob/master/learning.ipynb)). The classifier will take as input the probability distribution of bigrams and given a list of bigrams (extracted from the sentence to be classified), it will calculate the probability of the example/sentence coming from each language and pick the maximum.\n",
"\n",
"Let's build our classifier, with the assumption that English is as probable as German (the input is a dictionary with values the text models and keys the tuple `language, probability`):"
]
},
{
"cell_type": "code",
"outputs": [],
"source": [
"from learning import NaiveBayesLearner\n",
"\n",
"dist = {('English', 1): P_flatland, ('German', 1): P_faust}\n",
"\n",
"nBS = NaiveBayesLearner(dist, simple=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to write a function that takes as input a sentence, breaks it into a list of bigrams and classifies it with the naive bayes classifier from above.\n",
"\n",
"Once we get the text model for the sentence, we need to unravel it. The text models show the probability of each bigram, but the classifier can't handle that extra data. It requires a simple *list* of bigrams. So, if the text model shows that a bigram appears three times, we need to add it three times in the list. Since the text model stores the n-gram information in a dictionary (with the key being the n-gram and the value the number of times the n-gram appears) we need to iterate through the items of the dictionary and manually add them to the list of n-grams."
]
},
{
"cell_type": "code",
"outputs": [],
"source": [
"def recognize(sentence, nBS, n):\n",
" sentence = sentence.lower()\n",
" wordseq = words(sentence)\n",
" \n",
" P_sentence = NgramCharModel(n, wordseq)\n",
" \n",
" ngrams = []\n",
" for b, p in P_sentence.dictionary.items():\n",
" ngrams += [b]*p\n",
" \n",
" return nBS(ngrams)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can start categorizing sentences."
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(' ', 'i'), ('i', 'c'), ('c', 'h'), (' ', 'b'), ('b', 'i'), ('i', 'n'), ('i', 'n'), (' ', 'e'), ('e', 'i'), (' ', 'p'), ('p', 'l'), ('l', 'a'), ('a', 't'), ('t', 'z')]\n"
]
},
{
"data": {
"text/plain": [
"'German'"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Ich bin ein platz\", nBS, 2)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(' ', 't'), ('t', 'u'), ('u', 'r'), ('r', 't'), ('t', 'l'), ('l', 'e'), ('e', 's'), (' ', 'f'), ('f', 'l'), ('l', 'y'), (' ', 'h'), ('h', 'i'), ('i', 'g'), ('g', 'h')]\n"
]
},
{
"data": {
"text/plain": [
"'English'"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Turtles fly high\", nBS, 2)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(' ', 'd'), ('d', 'e'), ('e', 'r'), ('e', 'r'), (' ', 'p'), ('p', 'e'), ('e', 'l'), ('l', 'i'), ('i', 'k'), ('k', 'a'), ('a', 'n'), (' ', 'i'), ('i', 's'), ('s', 't'), (' ', 'h'), ('h', 'i'), ('i', 'e')]\n"
]
},
{
"data": {
"text/plain": [
"'German'"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Der pelikan ist hier\", nBS, 2)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(' ', 'a'), ('a', 'n'), ('n', 'd'), (' ', 't'), (' ', 't'), ('t', 'h'), ('t', 'h'), ('h', 'u'), ('u', 's'), ('h', 'e'), (' ', 'w'), ('w', 'i'), ('i', 'z'), ('z', 'a'), ('a', 'r'), ('r', 'd'), (' ', 's'), ('s', 'p'), ('p', 'o'), ('o', 'k'), ('k', 'e')]\n"
]
},
{
"data": {
"text/plain": [
"'English'"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"And thus the wizard spoke\", nBS, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier."
]
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AUTHOR RECOGNITION\n",
"\n",
"Another similar application to language recognition is recognizing who is more likely to have written a sentence, given text written by them. Here we will try and predict text from Edwin Abbott and Jane Austen. They wrote *Flatland* and *Pride and Prejudice* respectively.\n",
"\n",
"We are optimistic we can determine who wrote what based on the fact that Abbott wrote his novella on much later date than Austen, which means there will be linguistic differences between the two works. Indeed, *Flatland* uses more modern and direct language while *Pride and Prejudice* is written in a more archaic tone containing more sophisticated wording.\n",
"\n",
"Similarly with Language Recognition, we will first import the two datasets. This time though we are not looking for connections between characters, since that wouldn't give that great results. Why? Because both authors use English and English follows a set of patterns, as we show earlier. Trying to determine authorship based on this patterns would not be very efficient.\n",
"\n",
"Instead, we will abstract our querying to a higher level. We will use words instead of characters. That way we can more accurately pick at the differences between their writing style and thus have a better chance at guessing the correct author.\n",
"\n",
"Let's go right ahead and import our data:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from utils import open_data\n",
"from text import *\n",
"\n",
"flatland = open_data(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"\n",
"P_Abbott = UnigramWordModel(wordseq, 5)\n",
"\n",
"pride = open_data(\"EN-text/pride.txt\").read()\n",
"wordseq = words(pride)\n",
"\n",
"P_Austen = UnigramWordModel(wordseq, 5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time we set the `default` parameter of the model to 5, instead of 0. If we leave it at 0, then when we get a sentence containing a word we have not seen from that particular author, the chance of that sentence coming from that author is exactly 0 (since to get the probability, we multiply all the separate probabilities; if one is 0 then the result is also 0). To avoid that, we tell the model to add 5 to the count of all the words that appear.\n",
"\n",
"Next we will build the Naive Bayes Classifier:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from learning import NaiveBayesLearner\n",
"\n",
"dist = {('Abbott', 1): P_Abbott, ('Austen', 1): P_Austen}\n",
"\n",
"nBS = NaiveBayesLearner(dist, simple=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have build our classifier, we will start classifying. First, we need to convert the given sentence to the format the classifier needs. That is, a list of words."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def recognize(sentence, nBS):\n",
" sentence = sentence.lower()\n",
" sentence_words = words(sentence)\n",
" \n",
" return nBS(sentence_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we will input a sentence that is something Abbott would write. Note the use of square and the simpler language."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Abbott'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"the square is mad\", nBS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The classifier correctly guessed Abbott.\n",
"\n",
"Next we will input a more sophisticated sentence, similar to the style of Austen."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Austen'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"a most peculiar acquaintance\", nBS)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The classifier guessed correctly again.\n",
"\n",
"You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
}
},
"nbformat": 4,
"nbformat_minor": 2
}