Newer
Older
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"source": [
"# Text\n",
"\n",
"This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"from text import *\n",
"from utils import DataFile"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Contents\n",
"\n",
"* Text Models\n",
"* Viterbi Text Segmentation\n",
" * Overview\n",
" * Implementation\n",
" * Example\n",
"* Decoders\n",
" * Introduction\n",
" * Shift Decoder\n",
" * Permutation Decoder"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Text Models\n",
"\n",
"Before we start performing text processing algorithms, we will need to build some word models. Those models serve as a look-up table for word probabilities. In the text module we have implemented two such models, which inherit from the `CountingProbDist` from `learning.py`. `UnigramTextModel` and `NgramTextModel`. We supply them with a text file and they show the frequency of the different words.\n",
"\n",
"The main difference between the two models is that the first returns the probability of one single word (eg. the probability of the word 'the' appearing), while the second one can show us the probability of a *sequence* of words (eg. the probability of the sequence 'of the' appearing).\n",
"\n",
"Also, both functions can generate random words and sequences respectively, random according to the model.\n",
"\n",
"Below we build the two models. The text file we will use to build them is the *Flatland*, by Edwin A. Abbott. We will load it from [here](https://github.com/aimacode/aima-data/blob/a21fc108f52ad551344e947b0eb97df82f8d2b2b/EN-text/flatland.txt)."
]
},
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(2081, 'the'), (1479, 'of'), (1021, 'and'), (1008, 'to'), (850, 'a')]\n",
"[(368, ('of', 'the')), (152, ('to', 'the')), (152, ('in', 'the')), (86, ('of', 'a')), (80, ('it', 'is'))]\n"
]
}
],
"source": [
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"\n",
"P1 = UnigramTextModel(wordseq)\n",
"P2 = NgramTextModel(2, wordseq)\n",
"\n",
"print(P1.top(5))\n",
"print(P2.top(5))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We see that the most used word in *Flatland* is 'the', with 2081 occurences, while the most used sequence is 'of the' with 368 occurences."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Viterbi Text Segmentation\n",
"\n",
"### Overview\n",
"\n",
"We are given a string containing words of a sentence, but all the spaces are gone! It is very hard to read and we would like to separate the words in the string. We can accomplish this by employing the `Viterbi Segmentation` algorithm. It takes as input the string to segment and a text model, and it returns a list of the separate words.\n",
"\n",
"The algorithm operates in a dynamic programming approach. It starts from the beginning of the string and iteratively builds the best solution using previous solutions. It accomplishes that by segmentating the string into \"windows\", each window representing a word (real or gibberish). It then calculates the probability of the sequence up that window/word occuring and updates its solution. When it is done, it traces back from the final word and finds the complete sequence of words."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
{
"cell_type": "code",
"deletable": true,
"editable": true
},
"outputs": [],
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The function takes as input a string and a text model, and returns the most probable sequence of words, together with the probability of that sequence.\n",
"\n",
"The \"window\" is `w` and it includes the characters from *j* to *i*. We use it to \"build\" the following sequence: from the start to *j* and then `w`. We have previously calculated the probability from the start to *j*, so now we multiply that probability by `P[w]` to get the probability of the whole sequence. If that probability is greater than the probability we have calculated so far for the sequence from the start to *i* (`best[i]`), we update it."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Example\n",
"\n",
"The model the algorithm uses is the `UnigramTextModel`. First we will build the model using the *Flatland* text and then we will try and separate a space-devoid sentence."
]
},
{
"cell_type": "code",
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Sequence of words is: ['it', 'is', 'easy', 'to', 'read', 'words', 'without', 'spaces']\n",
"Probability of sequence is: 2.273672843573388e-24\n"
]
}
],
"source": [
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
"wordseq = words(flatland)\n",
"P = UnigramTextModel(wordseq)\n",
"text = \"itiseasytoreadwordswithoutspaces\"\n",
"\n",
"s, p = viterbi_segment(text,P)\n",
"print(\"Sequence of words is:\",s)\n",
"print(\"Probability of sequence is:\",p)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
]
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Decoders\n",
"\n",
"### Introduction\n",
"\n",
"In this section we will try to decode ciphertext using probabilistic text models. A ciphertext is obtained by performing encryption on a text message. This encryption lets us communicate safely, as anyone who has access to the ciphertext but doesn't know how to decode it cannot read the message. We will restrict our study to <b>Monoalphabetic Substitution Ciphers</b>. These are primitive forms of cipher where each letter in the message text (also known as plaintext) is replaced by another another letter of the alphabet.\n",
"\n",
"### Shift Decoder\n",
"\n",
"#### The Caesar cipher\n",
"\n",
"The Caesar cipher, also known as shift cipher is a form of monoalphabetic substitution ciphers where each letter is <i>shifted</i> by a fixed value. A shift by <b>`n`</b> in this context means that each letter in the plaintext is replaced with a letter corresponding to `n` letters down in the alphabet. For example the plaintext `\"ABCDWXYZ\"` shifted by `3` yields `\"DEFGZABC\"`. Note how `X` became `A`. This is because the alphabet is cyclic, i.e. the letter after the last letter in the alphabet, `Z`, is the first letter of the alphabet - `A`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DEFGZABC\n"
]
}
],
"source": [
"plaintext = \"ABCDWXYZ\"\n",
"ciphertext = shift_encode(plaintext, 3)\n",
"print(ciphertext)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### Decoding a Caesar cipher\n",
"\n",
"To decode a Caesar cipher we exploit the fact that not all letters in the alphabet are used equally. Some letters are used more than others and some pairs of letters are more probable to occur together. We call a pair of consecutive letters a <b>bigram</b>."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['th', 'hi', 'is', 's ', ' i', 'is', 's ', ' a', 'a ', ' s', 'se', 'en', 'nt', 'te', 'en', 'nc', 'ce']\n"
]
}
],
"source": [
"print(bigrams('this is a sentence'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We use `CountingProbDist` to get the probability distribution of bigrams. In the latin alphabet consists of only only `26` letters. This limits the total number of possible substitutions to `26`. We reverse the shift encoding for a given `n` and check how probable it is using the bigram distribution. We try all `26` values of `n`, i.e. from `n = 0` to `n = 26` and use the value of `n` which gives the most probable plaintext."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"%psource ShiftDecoder"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### Example\n",
"\n",
"Let us encode a secret message using Caeasar cipher and then try decoding it using `ShiftDecoder`. We will again use `flatland.txt` to build the text model"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The code is \"Guvf vf n frperg zrffntr\"\n"
]
}
],
"source": [
"plaintext = \"This is a secret message\"\n",
"ciphertext = shift_encode(plaintext, 13)\n",
"print('The code is', '\"' + ciphertext + '\"')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The decoded message is \"This is a secret message\"\n"
]
}
],
"source": [
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
"decoder = ShiftDecoder(flatland)\n",
"\n",
"decoded_message = decoder.decode(ciphertext)\n",
"print('The decoded message is', '\"' + decoded_message + '\"')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",