Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NATURAL LANGUAGE PROCESSING APPLICATIONS\n",
"\n",
"In this notebook we will take a look at some indicative applications of natural language processing. We will cover content from [`nlp.py`](https://github.com/aimacode/aima-python/blob/master/nlp.py) and [`text.py`](https://github.com/aimacode/aima-python/blob/master/text.py), for chapters 22 and 23 of Stuart Russel's and Peter Norvig's book [*Artificial Intelligence: A Modern Approach*](http://aima.cs.berkeley.edu/).\n",
"\n",
"Run the below cell to get started:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from text import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CONTENTS\n",
"\n",
"* Language Recognition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## LANGUAGE RECOGNITION\n",
"\n",
"A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.\n",
"\n",
"Here we will build an application to categorize sentences in either English or German.\n",
"\n",
"First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.\n",
"\n",
"Let's build our text models for each language, which will hold the probability of each bigram occuring in the text."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from utils import open_data\n",
"from text import *\n",
"\n",
"flatland = open_data(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"\n",
"P_flatland = NgramCharModel(2, wordseq)\n",
"\n",
"faust = open_data(\"faust.txt\").read()\n",
"wordseq = words(faust)\n",
"\n",
"P_faust = NgramCharModel(2, wordseq)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use this information to build a *Naive Bayes Classifier* that will be used to categorize sentences (you can read more on Naive Bayes on the [`learning notebook`](https://github.com/aimacode/aima-python/blob/master/learning.ipynb)). The classifier will take as input the probability distribution of bigrams and given a list of bigrams (extracted from the sentence to be classified), it will calculate the probability of the example/sentence coming from each language and pick the maximum.\n",
"\n",
"Let's build our classifier, with the assumption that English is as probable as German (the input is a dictionary with values the text models and keys the tuple `language, probability`):"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from learning import NaiveBayesLearner\n",
"\n",
"dist = {('English', 1): P_flatland, ('German', 1): P_faust}\n",
"\n",
"nBS = NaiveBayesLearner(dist, simple=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we need to write a function that takes as input a sentence, breaks it into a list of bigrams and classifies it with the naive bayes classifier from above.\n",
"\n",
"Once we get the text model for the sentence, we need to unravel it. The text models show the probability of each bigram, but the classifier can't handle that extra data. It requires a simple *list* of bigrams. So, if the text model shows that a bigram appears three times, we need to add it three times in the list. Since the text model stores the n-gram information in a dictionary (with the key being the n-gram and the value the number of times the n-gram appears) we need to iterate through the items of the dictionary and manually add them to the list of n-grams."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def recognize(sentence, nBS, n):\n",
" sentence = sentence.lower()\n",
" wordseq = words(sentence)\n",
" \n",
" P_sentence = NgramCharModel(n, wordseq)\n",
" \n",
" ngrams = []\n",
" for b, p in P_sentence.dictionary.items():\n",
" ngrams += [b]*p\n",
" \n",
" return nBS(ngrams)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can start categorizing sentences."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'German'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Ich bin ein platz\", nBS, 2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'English'"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Turtles fly high\", nBS, 2)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'German'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"Der pelikan ist hier\", nBS, 2)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'English'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recognize(\"And thus the wizard spoke\", nBS, 2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}