├── .ipynb_checkpoints
├── 【Python自然语言处理】读书笔记:第七章:从文本提取信息-checkpoint.ipynb
├── 【Python自然语言处理】读书笔记:第三章:处理原始文本-checkpoint.ipynb
├── 【Python自然语言处理】读书笔记:第五章:分类和标注词汇-checkpoint.ipynb
├── 【Python自然语言处理】读书笔记:第六章:学习分类文本-checkpoint.ipynb
└── 【Python自然语言处理】读书笔记:第四章:编写结构化程序-checkpoint.ipynb
├── 3.document.txt
├── 3.output.txt
├── 4.test.html
├── 5.t2.pkl
├── PYTHON 自然语言处理中文翻译.pdf
├── README.md
├── picture
├── 1.1.png
├── 2.1.png
├── 2.2.png
├── 2.3.png
├── 2.4.png
├── 3.1.png
├── 3.2.png
├── 3.3.png
├── 4.1.png
├── 4.2.png
├── 5.1.png
├── 6.1.png
├── 6.2.png
├── 6.3.png
├── 6.4.png
├── 6.5.png
├── 6.6.png
├── 6.7.png
├── 7.1.png
├── 7.2.png
├── 7.3.png
├── 7.4.png
└── 7.5.png
├── 【Python自然语言处理】读书笔记:第一章:语言处理与Python.md
├── 【Python自然语言处理】读书笔记:第七章:从文本提取信息.ipynb
├── 【Python自然语言处理】读书笔记:第三章:处理原始文本.ipynb
├── 【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源.md
├── 【Python自然语言处理】读书笔记:第五章:分类和标注词汇.ipynb
├── 【Python自然语言处理】读书笔记:第六章:学习分类文本.ipynb
├── 【Python自然语言处理】读书笔记:第四章:编写结构化程序.ipynb
└── 【Python自然语言处理】读书笔记:第四章:编写结构化程序.md
/.ipynb_checkpoints/【Python自然语言处理】读书笔记:第七章:从文本提取信息-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "本章原文:https://usyiyi.github.io/nlp-py-2e-zh/7.html\n",
8 | "\n",
9 | " 1.我们如何能构建一个系统,从非结构化文本中提取结构化数据如表格?\n",
10 | " 2.有哪些稳健的方法识别一个文本中描述的实体和关系?\n",
11 | " 3.哪些语料库适合这项工作,我们如何使用它们来训练和评估我们的模型?\n",
12 | "\n",
13 | "**分块** 和 **命名实体识别**。"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# 1 信息提取\n",
21 | "\n",
22 | "信息有很多种形状和大小。一个重要的形式是结构化数据:实体和关系的可预测的规范的结构。例如,我们可能对公司和地点之间的关系感兴趣。给定一个公司,我们希望能够确定它做业务的位置;反过来,给定位置,我们会想发现哪些公司在该位置做业务。如果我们的数据是表格形式,如1.1中的例子,那么回答这些问题就很简单了。\n",
23 | "\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "['BBDO South', 'Georgia-Pacific']\n"
36 | ]
37 | }
38 | ],
39 | "source": [
40 | "locs = [('Omnicom', 'IN', 'New York'),\n",
41 | " ('DDB Needham', 'IN', 'New York'),\n",
42 | " ('Kaplan Thaler Group', 'IN', 'New York'),\n",
43 | " ('BBDO South', 'IN', 'Atlanta'),\n",
44 | " ('Georgia-Pacific', 'IN', 'Atlanta')]\n",
45 | "query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']\n",
46 | "print(query)"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "我们可以定义一个函数,简单地连接 NLTK 中默认的句子分割器[1],分词器[2]和词性标注器[3]:"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "def ie_preprocess(document):\n",
63 | " sentences = nltk.sent_tokenize(document) # [1] 句子分割器\n",
64 | " sentences = [nltk.word_tokenize(sent) for sent in sentences] # [2] 分词器\n",
65 | " sentences = [nltk.pos_tag(sent) for sent in sentences] # [3] 词性标注器"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "import nltk, re, pprint"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "# 2 词块划分\n",
82 | "\n",
83 | "我们将用于实体识别的基本技术是词块划分,它分割和标注多词符的序列,如2.1所示。小框显示词级分词和词性标注,大框显示高级别的词块划分。每个这种较大的框叫做一个词块。就像分词忽略空白符,词块划分通常选择词符的一个子集。同样像分词一样,词块划分器生成的片段在源文本中不能重叠。\n",
84 | "\n",
85 | "\n",
86 | "在本节中,我们将在较深的层面探讨词块划分,以**词块**的定义和表示开始。我们将看到**正则表达式**和**N-gram**的方法来词块划分,使用CoNLL-2000词块划分语料库**开发**和**评估词块划分器**。我们将在(5)和6回到**命名实体识别**和**关系抽取**的任务。"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "## 2.1 名词短语词块划分\n",
94 | "\n",
95 | "我们将首先思考名词短语词块划分或NP词块划分任务,在那里我们寻找单独名词短语对应的词块。例如,这里是一些《华尔街日报》文本,其中的NP词块用方括号标记:"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 5,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | "(S\n",
108 | " (NP the/DT little/JJ yellow/JJ dog/NN)\n",
109 | " barked/VBD\n",
110 | " at/IN\n",
111 | " (NP the/DT cat/NN))\n"
112 | ]
113 | }
114 | ],
115 | "source": [
116 | "sentence = [(\"the\", \"DT\"), (\"little\", \"JJ\"), (\"yellow\", \"JJ\"), \n",
117 | " (\"dog\", \"NN\"), (\"barked\", \"VBD\"), (\"at\", \"IN\"), (\"the\", \"DT\"), (\"cat\", \"NN\")]\n",
118 | "\n",
119 | "grammar = \"NP: {
?*}\" \n",
120 | "cp = nltk.RegexpParser(grammar) \n",
121 | "result = cp.parse(sentence) \n",
122 | "print(result) \n",
123 | "result.draw()"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | ""
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "## 2.2 标记模式\n",
138 | "\n",
139 | "组成一个词块语法的规则使用标记模式来描述已标注的词的序列。一个标记模式是一个词性标记序列,用尖括号分隔,如\n",
140 | "```\n",
141 | "?*\n",
142 | "```\n",
143 | "标记模式类似于正则表达式模式(3.4)。现在,思考下面的来自《华尔街日报》的名词短语:\n",
144 | "```py\n",
145 | "another/DT sharp/JJ dive/NN\n",
146 | "trade/NN figures/NNS\n",
147 | "any/DT new/JJ policy/NN measures/NNS\n",
148 | "earlier/JJR stages/NNS\n",
149 | "Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP\n",
150 | "```"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "## 2.3 用正则表达式进行词块划分\n",
158 | "\n",
159 | "要找到一个给定的句子的词块结构,RegexpParser词块划分器以一个没有词符被划分的平面结构开始。词块划分规则轮流应用,依次更新词块结构。一旦所有的规则都被调用,返回生成的词块结构。\n",
160 | "\n",
161 | "2.3显示了一个由2个规则组成的简单的词块语法。第一条规则匹配一个可选的限定词或所有格代名词,零个或多个形容词,然后跟一个名词。第二条规则匹配一个或多个专有名词。我们还定义了一个进行词块划分的例句[1],并在此输入上运行这个词块划分器[2]。"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 7,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "name": "stdout",
171 | "output_type": "stream",
172 | "text": [
173 | "(S\n",
174 | " (NP Rapunzel/NNP)\n",
175 | " let/VBD\n",
176 | " down/RP\n",
177 | " (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
178 | ]
179 | }
180 | ],
181 | "source": [
182 | "grammar = r\"\"\"\n",
183 | "NP: {?*} \n",
184 | "{+}\n",
185 | "\"\"\"\n",
186 | "cp = nltk.RegexpParser(grammar)\n",
187 | "sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n",
188 | "print (cp.parse(sentence))"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "注意\n",
196 | "\n",
197 | "```\n",
198 | "$\n",
199 | "```\n",
200 | "符号是正则表达式中的一个特殊字符,必须使用反斜杠转义来匹配\n",
201 | "```\n",
202 | "PP\\$\n",
203 | "```\n",
204 | "标记。\n",
205 | "\n",
206 | "如果标记模式匹配位置重叠,最左边的匹配优先。例如,如果我们应用一个匹配两个连续的名词文本的规则到一个包含三个连续的名词的文本,则只有前两个名词将被划分:"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 8,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "(S (NP money/NN market/NN) fund/NN)\n"
219 | ]
220 | }
221 | ],
222 | "source": [
223 | "nouns = [(\"money\", \"NN\"), (\"market\", \"NN\"), (\"fund\", \"NN\")]\n",
224 | "grammar = \"NP: {} # Chunk two consecutive nouns\"\n",
225 | "cp = nltk.RegexpParser(grammar)\n",
226 | "print(cp.parse(nouns))"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "## 2.4 探索文本语料库\n",
234 | "\n",
235 | "在2中,我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作,如下:"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 12,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "name": "stdout",
245 | "output_type": "stream",
246 | "text": [
247 | "(CHUNK combined/VBN to/TO achieve/VB)\n",
248 | "(CHUNK continue/VB to/TO place/VB)\n",
249 | "(CHUNK serve/VB to/TO protect/VB)\n"
250 | ]
251 | }
252 | ],
253 | "source": [
254 | "cp = nltk.RegexpParser('CHUNK: { }')\n",
255 | "brown = nltk.corpus.brown\n",
256 | "count = 0\n",
257 | "for sent in brown.tagged_sents():\n",
258 | " tree = cp.parse(sent)\n",
259 | " for subtree in tree.subtrees():\n",
260 | " if subtree.label() == 'CHUNK': print(subtree)\n",
261 | " count += 1\n",
262 | " if count >= 30: break"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "## 2.5 词缝加塞\n",
270 | "\n",
271 | "有时定义我们想从一个词块中排除什么比较容易。我们可以定义词缝为一个不包含在词块中的一个词符序列。在下面的例子中,barked/VBD at/IN是一个词缝:\n",
272 | "```\n",
273 | "[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]\n",
274 | "```\n"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "## 2.6 词块的表示:标记与树\n",
282 | "\n",
283 | "作为标注和分析之间的中间状态(8.,词块结构可以使用标记或树来表示。最广泛的文件表示使用IOB标记。在这个方案中,每个词符被三个特殊的词块标记之一标注,I(内部),O(外部)或B(开始)。一个词符被标注为B,如果它标志着一个词块的开始。块内的词符子序列被标注为I。所有其他的词符被标注为O。B和I标记后面跟着词块类型,如B-NP, I-NP。当然,没有必要指定出现在词块外的词符类型,所以这些都只标注为O。这个方案的例子如2.5所示。\n",
284 | "\n",
285 | "\n",
286 | "IOB标记已成为文件中表示词块结构的标准方式,我们也将使用这种格式。下面是2.5中的信息如何出现在一个文件中的:\n",
287 | "```\n",
288 | "We PRP B-NP\n",
289 | "saw VBD O\n",
290 | "the DT B-NP\n",
291 | "yellow JJ I-NP\n",
292 | "dog NN I-NP\n",
293 | "```"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "# 3 开发和评估词块划分器\n",
301 | "\n",
302 | "现在你对分块的作用有了一些了解,但我们并没有解释如何评估词块划分器。和往常一样,这需要一个合适的已标注语料库。我们一开始寻找将IOB格式转换成NLTK树的机制,然后是使用已化分词块的语料库如何在一个更大的规模上做这个。我们将看到如何为一个词块划分器相对一个语料库的准确性打分,再看看一些数据驱动方式搜索NP词块。我们整个的重点在于扩展一个词块划分器的覆盖范围。\n",
303 | "## 3.1 读取IOB格式与CoNLL2000语料库\n",
304 | "\n",
305 | "使用corpus模块,我们可以加载已经标注并使用IOB符号划分词块的《华尔街日报》文本。这个语料库提供的词块类型有NP,VP和PP。正如我们已经看到的,每个句子使用多行表示,如下所示:\n",
306 | "```\n",
307 | "he PRP B-NP\n",
308 | "accepted VBD B-VP\n",
309 | "the DT B-NP\n",
310 | "position NN I-NP\n",
311 | "...\n",
312 | "```\n",
313 | "\n",
314 | "我们可以使用NLTK的corpus模块访问较大量的已经划分词块的文本。CoNLL2000语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式词块标记。我们可以使用nltk.corpus.conll2000访问这些数据。下面是一个读取语料库的“训练”部分的第100个句子的例子:"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 14,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "name": "stdout",
324 | "output_type": "stream",
325 | "text": [
326 | "(S\n",
327 | " (PP Over/IN)\n",
328 | " (NP a/DT cup/NN)\n",
329 | " (PP of/IN)\n",
330 | " (NP coffee/NN)\n",
331 | " ,/,\n",
332 | " (NP Mr./NNP Stone/NNP)\n",
333 | " (VP told/VBD)\n",
334 | " (NP his/PRP$ story/NN)\n",
335 | " ./.)\n"
336 | ]
337 | }
338 | ],
339 | "source": [
340 | "from nltk.corpus import conll2000\n",
341 | "print(conll2000.chunked_sents('train.txt')[99])"
342 | ]
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "正如你看到的,CoNLL2000语料库包含三种词块类型:NP词块,我们已经看到了;VP词块如has already delivered;PP块如because of。因为现在我们唯一感兴趣的是NP词块,我们可以使用chunk_types参数选择它们:"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 15,
354 | "metadata": {},
355 | "outputs": [
356 | {
357 | "name": "stdout",
358 | "output_type": "stream",
359 | "text": [
360 | "(S\n",
361 | " Over/IN\n",
362 | " (NP a/DT cup/NN)\n",
363 | " of/IN\n",
364 | " (NP coffee/NN)\n",
365 | " ,/,\n",
366 | " (NP Mr./NNP Stone/NNP)\n",
367 | " told/VBD\n",
368 | " (NP his/PRP$ story/NN)\n",
369 | " ./.)\n"
370 | ]
371 | }
372 | ],
373 | "source": [
374 | "print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "## 3.2 简单的评估和基准\n",
382 | "\n",
383 | "现在,我们可以访问一个已划分词块语料,可以评估词块划分器。我们开始为没有什么意义的词块解析器cp建立一个基准,它不划分任何词块:"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 16,
389 | "metadata": {},
390 | "outputs": [
391 | {
392 | "name": "stdout",
393 | "output_type": "stream",
394 | "text": [
395 | "ChunkParse score:\n",
396 | " IOB Accuracy: 43.4%%\n",
397 | " Precision: 0.0%%\n",
398 | " Recall: 0.0%%\n",
399 | " F-Measure: 0.0%%\n"
400 | ]
401 | }
402 | ],
403 | "source": [
404 | "from nltk.corpus import conll2000\n",
405 | "cp = nltk.RegexpParser(\"\")\n",
406 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
407 | "print(cp.evaluate(test_sents))"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "IOB标记准确性表明超过三分之一的词被标注为O,即没有在NP词块中。然而,由于我们的标注器没有找到任何词块,其精度、召回率和F-度量均为零。现在让我们尝试一个初级的正则表达式词块划分器,查找以名词短语标记的特征字母开头的标记(如CD, DT和JJ)。"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 17,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "name": "stdout",
424 | "output_type": "stream",
425 | "text": [
426 | "ChunkParse score:\n",
427 | " IOB Accuracy: 87.7%%\n",
428 | " Precision: 70.6%%\n",
429 | " Recall: 67.8%%\n",
430 | " F-Measure: 69.2%%\n"
431 | ]
432 | }
433 | ],
434 | "source": [
435 | "grammar = r\"NP: {<[CDJNP].*>+}\"\n",
436 | "cp = nltk.RegexpParser(grammar)\n",
437 | "print(cp.evaluate(test_sents))"
438 | ]
439 | },
440 | {
441 | "cell_type": "markdown",
442 | "metadata": {},
443 | "source": [
444 | "正如你看到的,这种方法达到相当好的结果。但是,我们可以采用更多数据驱动的方法改善它,在这里我们使用训练语料找到对每个词性标记最有可能的块标记(I, O或B)。换句话说,我们可以使用一元标注器(4)建立一个词块划分器。但不是尝试确定每个词的正确的词性标记,而是根据每个词的词性标记,尝试确定正确的词块标记。\n",
445 | "\n",
446 | "在3.1中,我们定义了UnigramChunker类,使用一元标注器给句子加词块标记。这个类的大部分代码只是用来在NLTK 的ChunkParserI接口使用的词块树表示和嵌入式标注器使用的IOB表示之间镜像转换。类定义了两个方法:一个构造函数[1],当我们建立一个新的UnigramChunker时调用;以及parse方法[3],用来给新句子划分词块。"
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 18,
452 | "metadata": {},
453 | "outputs": [],
454 | "source": [
455 | "class UnigramChunker(nltk.ChunkParserI):\n",
456 | " def __init__(self, train_sents): \n",
457 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
458 | " for sent in train_sents]\n",
459 | " self.tagger = nltk.UnigramTagger(train_data) \n",
460 | "\n",
461 | " def parse(self, sentence): \n",
462 | " pos_tags = [pos for (word,pos) in sentence]\n",
463 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n",
464 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
465 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
466 | " in zip(sentence, chunktags)]\n",
467 | " return nltk.chunk.conlltags2tree(conlltags)"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "构造函数[1]需要训练句子的一个列表,这将是词块树的形式。它首先将训练数据转换成适合训练标注器的形式,使用tree2conlltags映射每个词块树到一个word,tag,chunk三元组的列表。然后使用转换好的训练数据训练一个一元标注器,并存储在self.tagger供以后使用。\n",
475 | "\n",
476 | "parse方法[3]接收一个已标注的句子作为其输入,以从那句话提取词性标记开始。它然后使用在构造函数中训练过的标注器self.tagger,为词性标记标注IOB词块标记。接下来,它提取词块标记,与原句组合,产生conlltags。最后,它使用conlltags2tree将结果转换成一个词块树。\n",
477 | "\n",
478 | "现在我们有了UnigramChunker,可以使用CoNLL2000语料库训练它,并测试其表现:"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": 20,
484 | "metadata": {},
485 | "outputs": [
486 | {
487 | "name": "stdout",
488 | "output_type": "stream",
489 | "text": [
490 | "ChunkParse score:\n",
491 | " IOB Accuracy: 92.9%%\n",
492 | " Precision: 79.9%%\n",
493 | " Recall: 86.8%%\n",
494 | " F-Measure: 83.2%%\n"
495 | ]
496 | }
497 | ],
498 | "source": [
499 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
500 | "train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])\n",
501 | "unigram_chunker = UnigramChunker(train_sents)\n",
502 | "print(unigram_chunker.evaluate(test_sents))"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "这个分块器相当不错,达到整体F-度量83%的得分。让我们来看一看通过使用一元标注器分配一个标记给每个语料库中出现的词性标记,它学到了什么:"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": 21,
515 | "metadata": {},
516 | "outputs": [
517 | {
518 | "name": "stdout",
519 | "output_type": "stream",
520 | "text": [
521 | "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n"
522 | ]
523 | }
524 | ],
525 | "source": [
526 | "postags = sorted(set(pos for sent in train_sents\n",
527 | " for (word,pos) in sent.leaves()))\n",
528 | "print(unigram_chunker.tagger.tag(postags))"
529 | ]
530 | },
531 | {
532 | "cell_type": "markdown",
533 | "metadata": {},
534 | "source": [
535 | "它已经发现大多数标点符号出现在NP词块外,除了两种货币符号#和*\\$*。它也发现限定词(DT)和所有格(PRP*\\$*和WP$)出现在NP词块的开头,而名词类型(NN, NNP, NNPS,NNS)大多出现在NP词块内。\n",
536 | "\n",
537 | "建立了一个一元分块器,很容易建立一个二元分块器:我们只需要改变类的名称为BigramChunker,修改3.1行[2]构造一个BigramTagger而不是UnigramTagger。由此产生的词块划分器的性能略高于一元词块划分器:"
538 | ]
539 | },
540 | {
541 | "cell_type": "code",
542 | "execution_count": 24,
543 | "metadata": {},
544 | "outputs": [
545 | {
546 | "name": "stdout",
547 | "output_type": "stream",
548 | "text": [
549 | "ChunkParse score:\n",
550 | " IOB Accuracy: 93.3%%\n",
551 | " Precision: 82.3%%\n",
552 | " Recall: 86.8%%\n",
553 | " F-Measure: 84.5%%\n"
554 | ]
555 | }
556 | ],
557 | "source": [
558 | "class BigramChunker(nltk.ChunkParserI):\n",
559 | " def __init__(self, train_sents): \n",
560 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
561 | " for sent in train_sents]\n",
562 | " self.tagger = nltk.BigramTagger(train_data)\n",
563 | "\n",
564 | " def parse(self, sentence): \n",
565 | " pos_tags = [pos for (word,pos) in sentence]\n",
566 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n",
567 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
568 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
569 | " in zip(sentence, chunktags)]\n",
570 | " return nltk.chunk.conlltags2tree(conlltags)\n",
571 | "bigram_chunker = BigramChunker(train_sents)\n",
572 | "print(bigram_chunker.evaluate(test_sents))"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "## 3.3 训练基于分类器的词块划分器\n",
580 | "\n",
581 | "无论是基于正则表达式的词块划分器还是n-gram词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:"
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": 31,
587 | "metadata": {},
588 | "outputs": [],
589 | "source": [
590 | "class ConsecutiveNPChunkTagger(nltk.TaggerI): \n",
591 | "\n",
592 | " def __init__(self, train_sents):\n",
593 | " train_set = []\n",
594 | " for tagged_sent in train_sents:\n",
595 | " untagged_sent = nltk.tag.untag(tagged_sent)\n",
596 | " history = []\n",
597 | " for i, (word, tag) in enumerate(tagged_sent):\n",
598 | " featureset = npchunk_features(untagged_sent, i, history) \n",
599 | " train_set.append( (featureset, tag) )\n",
600 | " history.append(tag)\n",
601 | " self.classifier = nltk.MaxentClassifier.train( \n",
602 | " train_set, algorithm='megam', trace=0)\n",
603 | "\n",
604 | " def tag(self, sentence):\n",
605 | " history = []\n",
606 | " for i, word in enumerate(sentence):\n",
607 | " featureset = npchunk_features(sentence, i, history)\n",
608 | " tag = self.classifier.classify(featureset)\n",
609 | " history.append(tag)\n",
610 | " return zip(sentence, history)\n",
611 | "\n",
612 | "class ConsecutiveNPChunker(nltk.ChunkParserI):\n",
613 | " def __init__(self, train_sents):\n",
614 | " tagged_sents = [[((w,t),c) for (w,t,c) in\n",
615 | " nltk.chunk.tree2conlltags(sent)]\n",
616 | " for sent in train_sents]\n",
617 | " self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n",
618 | "\n",
619 | " def parse(self, sentence):\n",
620 | " tagged_sents = self.tagger.tag(sentence)\n",
621 | " conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]\n",
622 | " return nltk.chunk.conlltags2tree(conlltags)"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:"
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": null,
635 | "metadata": {},
636 | "outputs": [],
637 | "source": [
638 | "def npchunk_features(sentence, i, history):\n",
639 | " word, pos = sentence[i]\n",
640 | " return {\"pos\": pos}\n",
641 | "chunker = ConsecutiveNPChunker(train_sents)\n",
642 | "print(chunker.evaluate(test_sents))"
643 | ]
644 | },
645 | {
646 | "cell_type": "markdown",
647 | "metadata": {},
648 | "source": [
649 | "```\n",
650 | "ChunkParse score:\n",
651 | " IOB Accuracy: 92.9%\n",
652 | " Precision: 79.9%\n",
653 | " Recall: 86.7%\n",
654 | " F-Measure: 83.2%\n",
655 | " ```"
656 | ]
657 | },
658 | {
659 | "cell_type": "markdown",
660 | "metadata": {},
661 | "source": [
662 | "我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。\n"
663 | ]
664 | },
665 | {
666 | "cell_type": "code",
667 | "execution_count": null,
668 | "metadata": {},
669 | "outputs": [],
670 | "source": [
671 | "def npchunk_features(sentence, i, history):\n",
672 | " word, pos = sentence[i]\n",
673 | " if i == 0:\n",
674 | " prevword, prevpos = \"\", \"\"\n",
675 | " else:\n",
676 | " prevword, prevpos = sentence[i-1]\n",
677 | " return {\"pos\": pos, \"prevpos\": prevpos}\n",
678 | "chunker = ConsecutiveNPChunker(train_sents)\n",
679 | "print(chunker.evaluate(test_sents))"
680 | ]
681 | },
682 | {
683 | "cell_type": "markdown",
684 | "metadata": {},
685 | "source": [
686 | "```\n",
687 | "ChunkParse score:\n",
688 | " IOB Accuracy: 93.6%\n",
689 | " Precision: 81.9%\n",
690 | " Recall: 87.2%\n",
691 | " F-Measure: 84.5%\n",
692 | "```"
693 | ]
694 | },
695 | {
696 | "cell_type": "markdown",
697 | "metadata": {},
698 | "source": [
699 | "下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约1.5个百分点(相应的错误率减少大约10%)。"
700 | ]
701 | },
702 | {
703 | "cell_type": "code",
704 | "execution_count": null,
705 | "metadata": {},
706 | "outputs": [],
707 | "source": [
708 | "def npchunk_features(sentence, i, history):\n",
709 | " word, pos = sentence[i]\n",
710 | " if i == 0:\n",
711 | " prevword, prevpos = \"\", \"\"\n",
712 | " else:\n",
713 | " prevword, prevpos = sentence[i-1]\n",
714 | " return {\"pos\": pos, \"word\": word, \"prevpos\": prevpos}\n",
715 | "chunker = ConsecutiveNPChunker(train_sents)\n",
716 | "print(chunker.evaluate(test_sents))"
717 | ]
718 | },
719 | {
720 | "cell_type": "markdown",
721 | "metadata": {},
722 | "source": [
723 | "```\n",
724 | "ChunkParse score:\n",
725 | " IOB Accuracy: 94.5%\n",
726 | " Precision: 84.2%\n",
727 | " Recall: 89.4%\n",
728 | " F-Measure: 86.7%\n",
729 | "```"
730 | ]
731 | },
732 | {
733 | "cell_type": "markdown",
734 | "metadata": {},
735 | "source": [
736 | "最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 36,
742 | "metadata": {},
743 | "outputs": [],
744 | "source": [
745 | "def npchunk_features(sentence, i, history):\n",
746 | " word, pos = sentence[i]\n",
747 | " if i == 0:\n",
748 | " prevword, prevpos = \"\", \"\"\n",
749 | " else:\n",
750 | " prevword, prevpos = sentence[i-1]\n",
751 | " if i == len(sentence)-1:\n",
752 | " nextword, nextpos = \"\", \"\"\n",
753 | " else:\n",
754 | " nextword, nextpos = sentence[i+1]\n",
755 | " return {\"pos\": pos,\n",
756 | " \"word\": word,\n",
757 | " \"prevpos\": prevpos,\n",
758 | " \"nextpos\": nextpos,\n",
759 | " \"prevpos+pos\": \"%s+%s\" % (prevpos, pos), \n",
760 | " \"pos+nextpos\": \"%s+%s\" % (pos, nextpos),\n",
761 | " \"tags-since-dt\": tags_since_dt(sentence, i)} "
762 | ]
763 | },
764 | {
765 | "cell_type": "code",
766 | "execution_count": 37,
767 | "metadata": {},
768 | "outputs": [],
769 | "source": [
770 | "def tags_since_dt(sentence, i):\n",
771 | " tags = set()\n",
772 | " for word, pos in sentence[:i]:\n",
773 | " if pos == 'DT':\n",
774 | " tags = set()\n",
775 | " else:\n",
776 | " tags.add(pos)\n",
777 | " return '+'.join(sorted(tags))"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": null,
783 | "metadata": {},
784 | "outputs": [],
785 | "source": [
786 | "chunker = ConsecutiveNPChunker(train_sents)\n",
787 | "print(chunker.evaluate(test_sents))"
788 | ]
789 | },
790 | {
791 | "cell_type": "markdown",
792 | "metadata": {},
793 | "source": [
794 | "```\n",
795 | "ChunkParse score:\n",
796 | " IOB Accuracy: 96.0%\n",
797 | " Precision: 88.6%\n",
798 | " Recall: 91.0%\n",
799 | " F-Measure: 89.8%\n",
800 | "```"
801 | ]
802 | },
803 | {
804 | "cell_type": "markdown",
805 | "metadata": {},
806 | "source": [
807 | "# 4 语言结构中的递归\n",
808 | "## 4.1 用级联词块划分器构建嵌套结构\n",
809 | "\n",
810 | "到目前为止,我们的词块结构一直是相对平的。已标注词符组成的树在如NP这样的词块节点下任意组合。然而,只需创建一个包含递归规则的多级的词块语法,就可以建立任意深度的词块结构。4.1是名词短语、介词短语、动词短语和句子的模式。这是一个四级词块语法器,可以用来创建深度最多为4的结构。\n",
811 | "\n"
812 | ]
813 | },
814 | {
815 | "cell_type": "code",
816 | "execution_count": 42,
817 | "metadata": {},
818 | "outputs": [
819 | {
820 | "name": "stdout",
821 | "output_type": "stream",
822 | "text": [
823 | "(S\n",
824 | " (NP Mary/NN)\n",
825 | " saw/VBD\n",
826 | " (CLAUSE\n",
827 | " (NP the/DT cat/NN)\n",
828 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
829 | ]
830 | }
831 | ],
832 | "source": [
833 | "grammar = r\"\"\"\n",
834 | " NP: {+} \n",
835 | " PP: {} \n",
836 | " VP: {+$} \n",
837 | " CLAUSE: {} \n",
838 | " \"\"\"\n",
839 | "cp = nltk.RegexpParser(grammar)\n",
840 | "sentence = [(\"Mary\", \"NN\"), (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"),\n",
841 | " (\"sit\", \"VB\"), (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
842 | "print(cp.parse(sentence))"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "不幸的是,这一结果丢掉了saw为首的VP。它还有其他缺陷。当我们将此词块划分器应用到一个有更深嵌套的句子时,让我们看看会发生什么。请注意,它无法识别[1]开始的VP词块。"
850 | ]
851 | },
852 | {
853 | "cell_type": "code",
854 | "execution_count": 43,
855 | "metadata": {},
856 | "outputs": [
857 | {
858 | "name": "stdout",
859 | "output_type": "stream",
860 | "text": [
861 | "(S\n",
862 | " (NP John/NNP)\n",
863 | " thinks/VBZ\n",
864 | " (NP Mary/NN)\n",
865 | " saw/VBD\n",
866 | " (CLAUSE\n",
867 | " (NP the/DT cat/NN)\n",
868 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
869 | ]
870 | }
871 | ],
872 | "source": [
873 | "sentence = [(\"John\", \"NNP\"), (\"thinks\", \"VBZ\"), (\"Mary\", \"NN\"),\n",
874 | " (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"), (\"sit\", \"VB\"),\n",
875 | " (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
876 | "print(cp.parse(sentence))"
877 | ]
878 | },
879 | {
880 | "cell_type": "markdown",
881 | "metadata": {},
882 | "source": [
883 | "这些问题的解决方案是让词块划分器在它的模式中循环:尝试完所有模式之后,重复此过程。我们添加一个可选的第二个参数loop指定这套模式应该循环的次数:"
884 | ]
885 | },
886 | {
887 | "cell_type": "code",
888 | "execution_count": 44,
889 | "metadata": {},
890 | "outputs": [
891 | {
892 | "name": "stdout",
893 | "output_type": "stream",
894 | "text": [
895 | "(S\n",
896 | " (NP John/NNP)\n",
897 | " thinks/VBZ\n",
898 | " (CLAUSE\n",
899 | " (NP Mary/NN)\n",
900 | " (VP\n",
901 | " saw/VBD\n",
902 | " (CLAUSE\n",
903 | " (NP the/DT cat/NN)\n",
904 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))\n"
905 | ]
906 | }
907 | ],
908 | "source": [
909 | "cp = nltk.RegexpParser(grammar, loop=2)\n",
910 | "print(cp.parse(sentence))"
911 | ]
912 | },
913 | {
914 | "cell_type": "markdown",
915 | "metadata": {},
916 | "source": [
917 | "注意\n",
918 | "\n",
919 | "这个级联过程使我们能创建深层结构。然而,创建和调试级联过程是困难的,关键点是它能更有效地做全面的分析(见第8.章)。另外,级联过程只能产生固定深度的树(不超过级联级数),完整的句法分析这是不够的。\n"
920 | ]
921 | },
922 | {
923 | "cell_type": "markdown",
924 | "metadata": {},
925 | "source": [
926 | "## 4.2 Trees\n",
927 | "\n",
928 | "tree是一组连接的加标签节点,从一个特殊的根节点沿一条唯一的路径到达每个节点。下面是一棵树的例子(注意它们标准的画法是颠倒的):\n",
929 | "```\n",
930 | "(S\n",
931 | " (NP Alice)\n",
932 | " (VP\n",
933 | " (V chased)\n",
934 | " (NP\n",
935 | " (Det the)\n",
936 | " (N rabbit))))\n",
937 | "```\n",
938 | "虽然我们将只集中关注语法树,树可以用来编码任何同构的超越语言形式序列的层次结构(如形态结构、篇章结构)。一般情况下,叶子和节点值不一定要是字符串。\n",
939 | "\n",
940 | "在NLTK中,我们通过给一个节点添加标签和一系列的孩子创建一棵树:"
941 | ]
942 | },
943 | {
944 | "cell_type": "code",
945 | "execution_count": 46,
946 | "metadata": {},
947 | "outputs": [
948 | {
949 | "name": "stdout",
950 | "output_type": "stream",
951 | "text": [
952 | "(NP Alice)\n",
953 | "(NP the rabbit)\n"
954 | ]
955 | }
956 | ],
957 | "source": [
958 | "tree1 = nltk.Tree('NP', ['Alice'])\n",
959 | "print(tree1)\n",
960 | "tree2 = nltk.Tree('NP', ['the', 'rabbit'])\n",
961 | "print(tree2)"
962 | ]
963 | },
964 | {
965 | "cell_type": "markdown",
966 | "metadata": {},
967 | "source": [
968 | "我们可以将这些不断合并成更大的树,如下所示:"
969 | ]
970 | },
971 | {
972 | "cell_type": "code",
973 | "execution_count": 47,
974 | "metadata": {},
975 | "outputs": [
976 | {
977 | "name": "stdout",
978 | "output_type": "stream",
979 | "text": [
980 | "(S (NP Alice) (VP chased (NP the rabbit)))\n"
981 | ]
982 | }
983 | ],
984 | "source": [
985 | "tree3 = nltk.Tree('VP', ['chased', tree2])\n",
986 | "tree4 = nltk.Tree('S', [tree1, tree3])\n",
987 | "print(tree4)"
988 | ]
989 | },
990 | {
991 | "cell_type": "markdown",
992 | "metadata": {},
993 | "source": [
994 | "下面是树对象的一些的方法:"
995 | ]
996 | },
997 | {
998 | "cell_type": "code",
999 | "execution_count": 49,
1000 | "metadata": {},
1001 | "outputs": [
1002 | {
1003 | "name": "stdout",
1004 | "output_type": "stream",
1005 | "text": [
1006 | "(VP chased (NP the rabbit))\n",
1007 | "VP\n",
1008 | "['Alice', 'chased', 'the', 'rabbit']\n",
1009 | "rabbit\n"
1010 | ]
1011 | }
1012 | ],
1013 | "source": [
1014 | "print(tree4[1])\n",
1015 | "print(tree4[1].label())\n",
1016 | "print(tree4.leaves())\n",
1017 | "print(tree4[1][1][1])"
1018 | ]
1019 | },
1020 | {
1021 | "cell_type": "markdown",
1022 | "metadata": {},
1023 | "source": [
1024 | "复杂的树用括号表示难以阅读。在这些情况下,draw方法是非常有用的。它会打开一个新窗口,包含树的一个图形表示。树显示窗口可以放大和缩小,子树可以折叠和展开,并将图形表示输出为一个postscript文件(包含在一个文档中)。"
1025 | ]
1026 | },
1027 | {
1028 | "cell_type": "code",
1029 | "execution_count": 50,
1030 | "metadata": {},
1031 | "outputs": [],
1032 | "source": [
1033 | "tree3.draw()\n"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "markdown",
1038 | "metadata": {},
1039 | "source": [
1040 | ""
1041 | ]
1042 | },
1043 | {
1044 | "cell_type": "markdown",
1045 | "metadata": {},
1046 | "source": [
1047 | "## 4.3 树遍历\n",
1048 | "\n",
1049 | "使用递归函数来遍历树是标准的做法。4.2中的内容进行了演示。\n",
1050 | "\n"
1051 | ]
1052 | },
1053 | {
1054 | "cell_type": "code",
1055 | "execution_count": 53,
1056 | "metadata": {},
1057 | "outputs": [
1058 | {
1059 | "name": "stdout",
1060 | "output_type": "stream",
1061 | "text": [
1062 | "( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) "
1063 | ]
1064 | }
1065 | ],
1066 | "source": [
1067 | "def traverse(t):\n",
1068 | " try:\n",
1069 | " t.label()\n",
1070 | " except AttributeError:\n",
1071 | " print(t, end=\" \")\n",
1072 | " else:\n",
1073 | " # Now we know that t.node is defined\n",
1074 | " print('(', t.label(), end=\" \")\n",
1075 | " for child in t:\n",
1076 | " traverse(child)\n",
1077 | " print(')', end=\" \")\n",
1078 | "\n",
1079 | "t = tree4\n",
1080 | "traverse(t)"
1081 | ]
1082 | },
1083 | {
1084 | "cell_type": "markdown",
1085 | "metadata": {},
1086 | "source": [
1087 | "# 5 命名实体识别\n",
1088 | "\n",
1089 | "在本章开头,我们简要介绍了命名实体(NE)。命名实体是确切的名词短语,指示特定类型的个体,如组织、人、日期等。5.1列出了一些较常用的NE类型。这些应该是不言自明的,除了“FACILITY”:建筑和土木工程领域的人造产品;以及“GPE”:地缘政治实体,如城市、州/省、国家。\n",
1090 | "\n",
1091 | "\n",
1092 | "常用命名实体类型\n",
1093 | "```\n",
1094 | "Eddy N B-PER\n",
1095 | "Bonte N I-PER\n",
1096 | "is V O\n",
1097 | "woordvoerder N O\n",
1098 | "van Prep O\n",
1099 | "diezelfde Pron O\n",
1100 | "Hogeschool N B-ORG\n",
1101 | ". Punc O\n",
1102 | "```"
1103 | ]
1104 | },
1105 | {
1106 | "cell_type": "code",
1107 | "execution_count": 54,
1108 | "metadata": {},
1109 | "outputs": [
1110 | {
1111 | "name": "stdout",
1112 | "output_type": "stream",
1113 | "text": [
1114 | "(S\n",
1115 | " From/IN\n",
1116 | " what/WDT\n",
1117 | " I/PPSS\n",
1118 | " was/BEDZ\n",
1119 | " able/JJ\n",
1120 | " to/IN\n",
1121 | " gauge/NN\n",
1122 | " in/IN\n",
1123 | " a/AT\n",
1124 | " swift/JJ\n",
1125 | " ,/,\n",
1126 | " greedy/JJ\n",
1127 | " glance/NN\n",
1128 | " ,/,\n",
1129 | " the/AT\n",
1130 | " figure/NN\n",
1131 | " inside/IN\n",
1132 | " the/AT\n",
1133 | " coral-colored/JJ\n",
1134 | " boucle/NN\n",
1135 | " dress/NN\n",
1136 | " was/BEDZ\n",
1137 | " stupefying/VBG\n",
1138 | " ./.)\n"
1139 | ]
1140 | }
1141 | ],
1142 | "source": [
1143 | "print(nltk.ne_chunk(sent)) "
1144 | ]
1145 | },
1146 | {
1147 | "cell_type": "markdown",
1148 | "metadata": {},
1149 | "source": [
1150 | "# 6 关系抽取\n",
1151 | "\n",
1152 | "一旦文本中的命名实体已被识别,我们就可以提取它们之间存在的关系。如前所述,我们通常会寻找指定类型的命名实体之间的关系。进行这一任务的方法之一是首先寻找所有X, α, Y)形式的三元组,其中X和Y是指定类型的命名实体,α表示X和Y之间关系的字符串。然后我们可以使用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜索包含词in的字符串。特殊的正则表达式(?!\\b.+ing\\b)是一个否定预测先行断言,允许我们忽略如success in supervising the transition of中的字符串,其中in后面跟一个动名词。"
1153 | ]
1154 | },
1155 | {
1156 | "cell_type": "code",
1157 | "execution_count": 55,
1158 | "metadata": {},
1159 | "outputs": [
1160 | {
1161 | "name": "stdout",
1162 | "output_type": "stream",
1163 | "text": [
1164 | "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n",
1165 | "[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']\n",
1166 | "[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n",
1167 | "[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n",
1168 | "[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n",
1169 | "[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n",
1170 | "[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n",
1171 | "[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n",
1172 | "[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n",
1173 | "[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n",
1174 | "[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n",
1175 | "[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n",
1176 | "[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n"
1177 | ]
1178 | }
1179 | ],
1180 | "source": [
1181 | "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n",
1182 | "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n",
1183 | " for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,\n",
1184 | " corpus='ieer', pattern = IN):\n",
1185 | " print(nltk.sem.rtuple(rel))"
1186 | ]
1187 | },
1188 | {
1189 | "cell_type": "markdown",
1190 | "metadata": {},
1191 | "source": [
1192 | "搜索关键字in执行的相当不错,虽然它的检索结果也会误报,例如[ORG: House Transportation Committee] , secured the most money in the [LOC: New York];一种简单的基于字符串的方法排除这样的填充字符串似乎不太可能。\n",
1193 | "\n",
1194 | "如前文所示,conll2002命名实体语料库的荷兰语部分不只包含命名实体标注,也包含词性标注。这允许我们设计对这些标记敏感的模式,如下面的例子所示。clause()方法以分条形式输出关系,其中二元关系符号作为参数relsym的值被指定[1]。"
1195 | ]
1196 | },
1197 | {
1198 | "cell_type": "code",
1199 | "execution_count": 57,
1200 | "metadata": {},
1201 | "outputs": [
1202 | {
1203 | "name": "stdout",
1204 | "output_type": "stream",
1205 | "text": [
1206 | "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n",
1207 | "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n",
1208 | "VAN('annie_lennox', 'eurythmics')\n"
1209 | ]
1210 | }
1211 | ],
1212 | "source": [
1213 | "from nltk.corpus import conll2002\n",
1214 | "vnv = \"\"\"\n",
1215 | "(\n",
1216 | "is/V| # 3rd sing present and\n",
1217 | "was/V| # past forms of the verb zijn ('be')\n",
1218 | "werd/V| # and also present\n",
1219 | "wordt/V # past of worden ('become)\n",
1220 | ")\n",
1221 | ".* # followed by anything\n",
1222 | "van/Prep # followed by van ('of')\n",
1223 | "\"\"\"\n",
1224 | "VAN = re.compile(vnv, re.VERBOSE)\n",
1225 | "for doc in conll2002.chunked_sents('ned.train'):\n",
1226 | " for r in nltk.sem.extract_rels('PER', 'ORG', doc,\n",
1227 | " corpus='conll2002', pattern=VAN):\n",
1228 | " print(nltk.sem.clause(r, relsym=\"VAN\"))"
1229 | ]
1230 | },
1231 | {
1232 | "cell_type": "code",
1233 | "execution_count": null,
1234 | "metadata": {},
1235 | "outputs": [],
1236 | "source": []
1237 | }
1238 | ],
1239 | "metadata": {
1240 | "kernelspec": {
1241 | "display_name": "Python 3",
1242 | "language": "python",
1243 | "name": "python3"
1244 | },
1245 | "language_info": {
1246 | "codemirror_mode": {
1247 | "name": "ipython",
1248 | "version": 3
1249 | },
1250 | "file_extension": ".py",
1251 | "mimetype": "text/x-python",
1252 | "name": "python",
1253 | "nbconvert_exporter": "python",
1254 | "pygments_lexer": "ipython3",
1255 | "version": "3.6.2"
1256 | }
1257 | },
1258 | "nbformat": 4,
1259 | "nbformat_minor": 2
1260 | }
1261 |
--------------------------------------------------------------------------------
/3.document.txt:
--------------------------------------------------------------------------------
1 | 沁园春·雪
2 | 作者:毛泽东
3 | 北国风光,千里冰封,万里雪飘。
4 | 望长城内外,惟余莽莽;大河上下,顿失滔滔。
5 | 山舞银蛇,原驰蜡象,欲与天公试比高。
6 | 须晴日,看红装素裹,分外妖娆。
7 | 江山如此多娇,引无数英雄竞折腰。
8 | 惜秦皇汉武,略输文采;唐宗宋祖,稍逊风骚。
--------------------------------------------------------------------------------
/3.output.txt:
--------------------------------------------------------------------------------
1 | !
2 | '
3 | (
4 | )
5 | ,
6 | ,)
7 | .
8 | .)
9 | :
10 | ;
11 | ;)
12 | ?
13 | ?)
14 | A
15 | Abel
16 | Abelmizraim
17 | Abidah
18 | Abide
19 | Abimael
20 | Abimelech
21 | Abr
22 | Abrah
23 | Abraham
24 | Abram
25 | Accad
26 | Achbor
27 | Adah
28 | Adam
29 | Adbeel
30 | Admah
31 | Adullamite
32 | After
33 | Aholibamah
34 | Ahuzzath
35 | Ajah
36 | Akan
37 | All
38 | Allonbachuth
39 | Almighty
40 | Almodad
41 | Also
42 | Alvah
43 | Alvan
44 | Am
45 | Amal
46 | Amalek
47 | Amalekites
48 | Ammon
49 | Amorite
50 | Amorites
51 | Amraphel
52 | An
53 | Anah
54 | Anamim
55 | And
56 | Aner
57 | Angel
58 | Appoint
59 | Aram
60 | Aran
61 | Ararat
62 | Arbah
63 | Ard
64 | Are
65 | Areli
66 | Arioch
67 | Arise
68 | Arkite
69 | Arodi
70 | Arphaxad
71 | Art
72 | Arvadite
73 | As
74 | Asenath
75 | Ashbel
76 | Asher
77 | Ashkenaz
78 | Ashteroth
79 | Ask
80 | Asshur
81 | Asshurim
82 | Assyr
83 | Assyria
84 | At
85 | Atad
86 | Avith
87 | Baalhanan
88 | Babel
89 | Bashemath
90 | Be
91 | Because
92 | Becher
93 | Bedad
94 | Beeri
95 | Beerlahairoi
96 | Beersheba
97 | Behold
98 | Bela
99 | Belah
100 | Benam
101 | Benjamin
102 | Beno
103 | Beor
104 | Bera
105 | Bered
106 | Beriah
107 | Bethel
108 | Bethlehem
109 | Bethuel
110 | Beware
111 | Bilhah
112 | Bilhan
113 | Binding
114 | Birsha
115 | Bless
116 | Blessed
117 | Both
118 | Bow
119 | Bozrah
120 | Bring
121 | But
122 | Buz
123 | By
124 | Cain
125 | Cainan
126 | Calah
127 | Calneh
128 | Can
129 | Cana
130 | Canaan
131 | Canaanite
132 | Canaanites
133 | Canaanitish
134 | Caphtorim
135 | Carmi
136 | Casluhim
137 | Cast
138 | Cause
139 | Chaldees
140 | Chedorlaomer
141 | Cheran
142 | Cherubims
143 | Chesed
144 | Chezib
145 | Come
146 | Cursed
147 | Cush
148 | Damascus
149 | Dan
150 | Day
151 | Deborah
152 | Dedan
153 | Deliver
154 | Diklah
155 | Din
156 | Dinah
157 | Dinhabah
158 | Discern
159 | Dishan
160 | Dishon
161 | Do
162 | Dodanim
163 | Dothan
164 | Drink
165 | Duke
166 | Dumah
167 | Earth
168 | Ebal
169 | Eber
170 | Edar
171 | Eden
172 | Edom
173 | Edomites
174 | Egy
175 | Egypt
176 | Egyptia
177 | Egyptian
178 | Egyptians
179 | Ehi
180 | Elah
181 | Elam
182 | Elbethel
183 | Eldaah
184 | EleloheIsrael
185 | Eliezer
186 | Eliphaz
187 | Elishah
188 | Ellasar
189 | Elon
190 | Elparan
191 | Emins
192 | En
193 | Enmishpat
194 | Eno
195 | Enoch
196 | Enos
197 | Ephah
198 | Epher
199 | Ephra
200 | Ephraim
201 | Ephrath
202 | Ephron
203 | Er
204 | Erech
205 | Eri
206 | Es
207 | Esau
208 | Escape
209 | Esek
210 | Eshban
211 | Eshcol
212 | Ethiopia
213 | Euphrat
214 | Euphrates
215 | Eve
216 | Even
217 | Every
218 | Except
219 | Ezbon
220 | Ezer
221 | Fear
222 | Feed
223 | Fifteen
224 | Fill
225 | For
226 | Forasmuch
227 | Forgive
228 | From
229 | Fulfil
230 | G
231 | Gad
232 | Gaham
233 | Galeed
234 | Gatam
235 | Gather
236 | Gaza
237 | Gentiles
238 | Gera
239 | Gerar
240 | Gershon
241 | Get
242 | Gether
243 | Gihon
244 | Gilead
245 | Girgashites
246 | Girgasite
247 | Give
248 | Go
249 | God
250 | Gomer
251 | Gomorrah
252 | Goshen
253 | Guni
254 | Hadad
255 | Hadar
256 | Hadoram
257 | Hagar
258 | Haggi
259 | Hai
260 | Ham
261 | Hamathite
262 | Hamor
263 | Hamul
264 | Hanoch
265 | Happy
266 | Haran
267 | Hast
268 | Haste
269 | Have
270 | Havilah
271 | Hazarmaveth
272 | Hazezontamar
273 | Hazo
274 | He
275 | Hear
276 | Heaven
277 | Heber
278 | Hebrew
279 | Hebrews
280 | Hebron
281 | Hemam
282 | Hemdan
283 | Here
284 | Hereby
285 | Heth
286 | Hezron
287 | Hiddekel
288 | Hinder
289 | Hirah
290 | His
291 | Hitti
292 | Hittite
293 | Hittites
294 | Hivite
295 | Hobah
296 | Hori
297 | Horite
298 | Horites
299 | How
300 | Hul
301 | Huppim
302 | Husham
303 | Hushim
304 | Huz
305 | I
306 | If
307 | In
308 | Irad
309 | Iram
310 | Is
311 | Isa
312 | Isaac
313 | Iscah
314 | Ishbak
315 | Ishmael
316 | Ishmeelites
317 | Ishuah
318 | Isra
319 | Israel
320 | Issachar
321 | Isui
322 | It
323 | Ithran
324 | Jaalam
325 | Jabal
326 | Jabbok
327 | Jac
328 | Jachin
329 | Jacob
330 | Jahleel
331 | Jahzeel
332 | Jamin
333 | Japhe
334 | Japheth
335 | Jared
336 | Javan
337 | Jebusite
338 | Jebusites
339 | Jegarsahadutha
340 | Jehovahjireh
341 | Jemuel
342 | Jerah
343 | Jetheth
344 | Jetur
345 | Jeush
346 | Jezer
347 | Jidlaph
348 | Jimnah
349 | Job
350 | Jobab
351 | Jokshan
352 | Joktan
353 | Jordan
354 | Joseph
355 | Jubal
356 | Judah
357 | Judge
358 | Judith
359 | Kadesh
360 | Kadmonites
361 | Karnaim
362 | Kedar
363 | Kedemah
364 | Kemuel
365 | Kenaz
366 | Kenites
367 | Kenizzites
368 | Keturah
369 | Kiriathaim
370 | Kirjatharba
371 | Kittim
372 | Know
373 | Kohath
374 | Kor
375 | Korah
376 | LO
377 | LORD
378 | Laban
379 | Lahairoi
380 | Lamech
381 | Lasha
382 | Lay
383 | Leah
384 | Lehabim
385 | Lest
386 | Let
387 | Letushim
388 | Leummim
389 | Levi
390 | Lie
391 | Lift
392 | Lo
393 | Look
394 | Lot
395 | Lotan
396 | Lud
397 | Ludim
398 | Luz
399 | Maachah
400 | Machir
401 | Machpelah
402 | Madai
403 | Magdiel
404 | Magog
405 | Mahalaleel
406 | Mahalath
407 | Mahanaim
408 | Make
409 | Malchiel
410 | Male
411 | Mam
412 | Mamre
413 | Man
414 | Manahath
415 | Manass
416 | Manasseh
417 | Mash
418 | Masrekah
419 | Massa
420 | Matred
421 | Me
422 | Medan
423 | Mehetabel
424 | Mehujael
425 | Melchizedek
426 | Merari
427 | Mesha
428 | Meshech
429 | Mesopotamia
430 | Methusa
431 | Methusael
432 | Methuselah
433 | Mezahab
434 | Mibsam
435 | Mibzar
436 | Midian
437 | Midianites
438 | Milcah
439 | Mishma
440 | Mizpah
441 | Mizraim
442 | Mizz
443 | Moab
444 | Moabites
445 | Moreh
446 | Moreover
447 | Moriah
448 | Muppim
449 | My
450 | Naamah
451 | Naaman
452 | Nahath
453 | Nahor
454 | Naphish
455 | Naphtali
456 | Naphtuhim
457 | Nay
458 | Nebajoth
459 | Neither
460 | Night
461 | Nimrod
462 | Nineveh
463 | Noah
464 | Nod
465 | Not
466 | Now
467 | O
468 | Obal
469 | Of
470 | Oh
471 | Ohad
472 | Omar
473 | On
474 | Onam
475 | Onan
476 | Only
477 | Ophir
478 | Our
479 | Out
480 | Padan
481 | Padanaram
482 | Paran
483 | Pass
484 | Pathrusim
485 | Pau
486 | Peace
487 | Peleg
488 | Peniel
489 | Penuel
490 | Peradventure
491 | Perizzit
492 | Perizzite
493 | Perizzites
494 | Phallu
495 | Phara
496 | Pharaoh
497 | Pharez
498 | Phichol
499 | Philistim
500 | Philistines
501 | Phut
502 | Phuvah
503 | Pildash
504 | Pinon
505 | Pison
506 | Potiphar
507 | Potipherah
508 | Put
509 | Raamah
510 | Rachel
511 | Rameses
512 | Rebek
513 | Rebekah
514 | Rehoboth
515 | Remain
516 | Rephaims
517 | Resen
518 | Return
519 | Reu
520 | Reub
521 | Reuben
522 | Reuel
523 | Reumah
524 | Riphath
525 | Rosh
526 | Sabtah
527 | Sabtech
528 | Said
529 | Salah
530 | Salem
531 | Samlah
532 | Sarah
533 | Sarai
534 | Saul
535 | Save
536 | Say
537 | Se
538 | Seba
539 | See
540 | Seeing
541 | Seir
542 | Sell
543 | Send
544 | Sephar
545 | Serah
546 | Sered
547 | Serug
548 | Set
549 | Seth
550 | Shalem
551 | Shall
552 | Shalt
553 | Shammah
554 | Shaul
555 | Shaveh
556 | She
557 | Sheba
558 | Shebah
559 | Shechem
560 | Shed
561 | Shel
562 | Shelah
563 | Sheleph
564 | Shem
565 | Shemeber
566 | Shepho
567 | Shillem
568 | Shiloh
569 | Shimron
570 | Shinab
571 | Shinar
572 | Shobal
573 | Should
574 | Shuah
575 | Shuni
576 | Shur
577 | Sichem
578 | Siddim
579 | Sidon
580 | Simeon
581 | Sinite
582 | Sitnah
583 | Slay
584 | So
585 | Sod
586 | Sodom
587 | Sojourn
588 | Some
589 | Spake
590 | Speak
591 | Spirit
592 | Stand
593 | Succoth
594 | Surely
595 | Swear
596 | Syrian
597 | Take
598 | Tamar
599 | Tarshish
600 | Tebah
601 | Tell
602 | Tema
603 | Teman
604 | Temani
605 | Terah
606 | Thahash
607 | That
608 | The
609 | Then
610 | There
611 | Therefore
612 | These
613 | They
614 | Thirty
615 | This
616 | Thorns
617 | Thou
618 | Thus
619 | Thy
620 | Tidal
621 | Timna
622 | Timnah
623 | Timnath
624 | Tiras
625 | To
626 | Togarmah
627 | Tola
628 | Tubal
629 | Tubalcain
630 | Twelve
631 | Two
632 | Unstable
633 | Until
634 | Unto
635 | Up
636 | Upon
637 | Ur
638 | Uz
639 | Uzal
640 | We
641 | What
642 | When
643 | Whence
644 | Where
645 | Whereas
646 | Wherefore
647 | Which
648 | While
649 | Who
650 | Whose
651 | Whoso
652 | Why
653 | Wilt
654 | With
655 | Woman
656 | Ye
657 | Yea
658 | Yet
659 | Zaavan
660 | Zaphnathpaaneah
661 | Zar
662 | Zarah
663 | Zeboiim
664 | Zeboim
665 | Zebul
666 | Zebulun
667 | Zemarite
668 | Zepho
669 | Zerah
670 | Zibeon
671 | Zidon
672 | Zillah
673 | Zilpah
674 | Zimran
675 | Ziphion
676 | Zo
677 | Zoar
678 | Zohar
679 | Zuzims
680 | a
681 | abated
682 | abide
683 | able
684 | abode
685 | abomination
686 | about
687 | above
688 | abroad
689 | absent
690 | abundantly
691 | accept
692 | accepted
693 | according
694 | acknowledged
695 | activity
696 | add
697 | adder
698 | afar
699 | afflict
700 | affliction
701 | afraid
702 | after
703 | afterward
704 | afterwards
705 | aga
706 | again
707 | against
708 | age
709 | aileth
710 | air
711 | al
712 | alive
713 | all
714 | almon
715 | alo
716 | alone
717 | aloud
718 | also
719 | altar
720 | altogether
721 | always
722 | am
723 | among
724 | amongst
725 | an
726 | and
727 | angel
728 | angels
729 | anger
730 | angry
731 | anguish
732 | anointedst
733 | anoth
734 | another
735 | answer
736 | answered
737 | any
738 | anything
739 | appe
740 | appear
741 | appeared
742 | appease
743 | appoint
744 | appointed
745 | aprons
746 | archer
747 | archers
748 | are
749 | arise
750 | ark
751 | armed
752 | arms
753 | army
754 | arose
755 | arrayed
756 | art
757 | artificer
758 | as
759 | ascending
760 | ash
761 | ashamed
762 | ask
763 | asked
764 | asketh
765 | ass
766 | assembly
767 | asses
768 | assigned
769 | asswaged
770 | at
771 | attained
772 | audience
773 | avenged
774 | aw
775 | awaked
776 | away
777 | awoke
778 | back
779 | backward
780 | bad
781 | bade
782 | badest
783 | badne
784 | bak
785 | bake
786 | bakemeats
787 | baker
788 | bakers
789 | balm
790 | bands
791 | bank
792 | bare
793 | barr
794 | barren
795 | basket
796 | baskets
797 | battle
798 | bdellium
799 | be
800 | bear
801 | beari
802 | bearing
803 | beast
804 | beasts
805 | beautiful
806 | became
807 | because
808 | become
809 | bed
810 | been
811 | befall
812 | befell
813 | before
814 | began
815 | begat
816 | beget
817 | begettest
818 | begin
819 | beginning
820 | begotten
821 | beguiled
822 | beheld
823 | behind
824 | behold
825 | being
826 | believed
827 | belly
828 | belong
829 | beneath
830 | bereaved
831 | beside
832 | besides
833 | besought
834 | best
835 | betimes
836 | better
837 | between
838 | betwixt
839 | beyond
840 | binding
841 | bird
842 | birds
843 | birthday
844 | birthright
845 | biteth
846 | bitter
847 | blame
848 | blameless
849 | blasted
850 | bless
851 | blessed
852 | blesseth
853 | blessi
854 | blessing
855 | blessings
856 | blindness
857 | blood
858 | blossoms
859 | bodies
860 | boldly
861 | bondman
862 | bondmen
863 | bondwoman
864 | bone
865 | bones
866 | book
867 | booths
868 | border
869 | borders
870 | born
871 | bosom
872 | both
873 | bottle
874 | bou
875 | boug
876 | bough
877 | bought
878 | bound
879 | bow
880 | bowed
881 | bowels
882 | bowing
883 | boys
884 | bracelets
885 | branches
886 | brass
887 | bre
888 | breach
889 | bread
890 | breadth
891 | break
892 | breaketh
893 | breaking
894 | breasts
895 | breath
896 | breathed
897 | breed
898 | brethren
899 | brick
900 | brimstone
901 | bring
902 | brink
903 | broken
904 | brook
905 | broth
906 | brother
907 | brought
908 | brown
909 | bruise
910 | budded
911 | build
912 | builded
913 | built
914 | bulls
915 | bundle
916 | bundles
917 | burdens
918 | buried
919 | burn
920 | burning
921 | burnt
922 | bury
923 | buryingplace
924 | business
925 | but
926 | butler
927 | butlers
928 | butlership
929 | butter
930 | buy
931 | by
932 | cakes
933 | calf
934 | call
935 | called
936 | came
937 | camel
938 | camels
939 | camest
940 | can
941 | cannot
942 | canst
943 | captain
944 | captive
945 | captives
946 | carcases
947 | carried
948 | carry
949 | cast
950 | castles
951 | catt
952 | cattle
953 | caught
954 | cause
955 | caused
956 | cave
957 | cease
958 | ceased
959 | certain
960 | certainly
961 | chain
962 | chamber
963 | change
964 | changed
965 | changes
966 | charge
967 | charged
968 | chariot
969 | chariots
970 | chesnut
971 | chi
972 | chief
973 | child
974 | childless
975 | childr
976 | children
977 | chode
978 | choice
979 | chose
980 | circumcis
981 | circumcise
982 | circumcised
983 | citi
984 | cities
985 | city
986 | clave
987 | clean
988 | clear
989 | cleave
990 | clo
991 | closed
992 | clothed
993 | clothes
994 | cloud
995 | clusters
996 | co
997 | coat
998 | coats
999 | coffin
1000 | cold
1001 | colours
1002 | colt
1003 | colts
1004 | come
1005 | comest
1006 | cometh
1007 | comfort
1008 | comforted
1009 | comi
1010 | coming
1011 | command
1012 | commanded
1013 | commanding
1014 | commandment
1015 | commandments
1016 | commended
1017 | committed
1018 | commune
1019 | communed
1020 | communing
1021 | company
1022 | compassed
1023 | compasseth
1024 | conceal
1025 | conceive
1026 | conceived
1027 | conception
1028 | concerning
1029 | concubi
1030 | concubine
1031 | concubines
1032 | confederate
1033 | confound
1034 | consent
1035 | conspired
1036 | consume
1037 | consumed
1038 | content
1039 | continually
1040 | continued
1041 | cool
1042 | corn
1043 | corrupt
1044 | corrupted
1045 | couch
1046 | couched
1047 | couching
1048 | could
1049 | counted
1050 | countenance
1051 | countries
1052 | country
1053 | covenant
1054 | covered
1055 | covering
1056 | created
1057 | creature
1058 | creepeth
1059 | creeping
1060 | cried
1061 | crieth
1062 | crown
1063 | cru
1064 | cruelty
1065 | cry
1066 | cubit
1067 | cubits
1068 | cunning
1069 | cup
1070 | current
1071 | curse
1072 | cursed
1073 | curseth
1074 | custom
1075 | cut
1076 | d
1077 | da
1078 | dainties
1079 | dale
1080 | damsel
1081 | damsels
1082 | dark
1083 | darkne
1084 | darkness
1085 | daughers
1086 | daught
1087 | daughte
1088 | daughter
1089 | daughters
1090 | day
1091 | days
1092 | dea
1093 | dead
1094 | deal
1095 | dealt
1096 | dearth
1097 | death
1098 | deceitfully
1099 | deceived
1100 | deceiver
1101 | declare
1102 | decreased
1103 | deed
1104 | deeds
1105 | deep
1106 | deferred
1107 | defiled
1108 | defiledst
1109 | delight
1110 | deliver
1111 | deliverance
1112 | delivered
1113 | denied
1114 | depart
1115 | departed
1116 | departing
1117 | deprived
1118 | descending
1119 | desire
1120 | desired
1121 | desolate
1122 | despised
1123 | destitute
1124 | destroy
1125 | destroyed
1126 | devour
1127 | devoured
1128 | dew
1129 | did
1130 | didst
1131 | die
1132 | died
1133 | digged
1134 | dignity
1135 | dim
1136 | dine
1137 | dipped
1138 | direct
1139 | discern
1140 | discerned
1141 | discreet
1142 | displease
1143 | displeased
1144 | distress
1145 | distressed
1146 | divide
1147 | divided
1148 | divine
1149 | divineth
1150 | do
1151 | doe
1152 | doer
1153 | doest
1154 | doeth
1155 | doing
1156 | dominion
1157 | done
1158 | door
1159 | dost
1160 | doth
1161 | double
1162 | doubled
1163 | doubt
1164 | dove
1165 | down
1166 | dowry
1167 | drank
1168 | draw
1169 | dread
1170 | dreadful
1171 | dream
1172 | dreamed
1173 | dreamer
1174 | dreams
1175 | dress
1176 | dressed
1177 | drew
1178 | dried
1179 | drink
1180 | drinketh
1181 | drinking
1182 | driven
1183 | drought
1184 | drove
1185 | droves
1186 | drunken
1187 | dry
1188 | duke
1189 | dukes
1190 | dunge
1191 | dungeon
1192 | dust
1193 | dwe
1194 | dwell
1195 | dwelled
1196 | dwelling
1197 | dwelt
1198 | e
1199 | ea
1200 | each
1201 | ear
1202 | earing
1203 | early
1204 | earring
1205 | earrings
1206 | ears
1207 | earth
1208 | east
1209 | eastward
1210 | eat
1211 | eaten
1212 | eatest
1213 | edge
1214 | eight
1215 | eighteen
1216 | eighty
1217 | either
1218 | elder
1219 | elders
1220 | eldest
1221 | eleven
1222 | else
1223 | embalm
1224 | embalmed
1225 | embraced
1226 | emptied
1227 | empty
1228 | end
1229 | ended
1230 | endued
1231 | endure
1232 | enemies
1233 | enlarge
1234 | enmity
1235 | enough
1236 | enquire
1237 | enter
1238 | entered
1239 | entreated
1240 | envied
1241 | erected
1242 | errand
1243 | escape
1244 | escaped
1245 | espied
1246 | establish
1247 | established
1248 | ev
1249 | even
1250 | evening
1251 | eventide
1252 | ever
1253 | everlasting
1254 | every
1255 | evil
1256 | ewe
1257 | ewes
1258 | exceeding
1259 | exceedingly
1260 | excel
1261 | excellency
1262 | except
1263 | exchange
1264 | experience
1265 | ey
1266 | eyed
1267 | eyes
1268 | fa
1269 | face
1270 | faces
1271 | fai
1272 | fail
1273 | failed
1274 | faileth
1275 | fainted
1276 | fair
1277 | fall
1278 | fallen
1279 | falsely
1280 | fame
1281 | families
1282 | famine
1283 | famished
1284 | far
1285 | fashion
1286 | fast
1287 | fat
1288 | fatfleshed
1289 | fath
1290 | fathe
1291 | father
1292 | fathers
1293 | fatness
1294 | faults
1295 | favour
1296 | favoured
1297 | fear
1298 | feared
1299 | fearest
1300 | feast
1301 | fed
1302 | feeble
1303 | feebler
1304 | feed
1305 | feeding
1306 | feel
1307 | feet
1308 | fell
1309 | fellow
1310 | felt
1311 | fema
1312 | female
1313 | fetch
1314 | fetched
1315 | fetcht
1316 | few
1317 | fie
1318 | field
1319 | fierce
1320 | fifteen
1321 | fifth
1322 | fifty
1323 | fig
1324 | fill
1325 | filled
1326 | find
1327 | findest
1328 | findeth
1329 | finding
1330 | fine
1331 | finish
1332 | finished
1333 | fir
1334 | fire
1335 | firmame
1336 | firmament
1337 | first
1338 | firstborn
1339 | firstlings
1340 | fish
1341 | fishes
1342 | five
1343 | flaming
1344 | fle
1345 | fled
1346 | fleddest
1347 | flee
1348 | flesh
1349 | flo
1350 | floc
1351 | flock
1352 | flocks
1353 | flood
1354 | floor
1355 | fly
1356 | fo
1357 | foal
1358 | foals
1359 | folk
1360 | follow
1361 | followed
1362 | following
1363 | folly
1364 | food
1365 | foolishly
1366 | foot
1367 | for
1368 | forbid
1369 | force
1370 | ford
1371 | foremost
1372 | foreskin
1373 | forgat
1374 | forget
1375 | forgive
1376 | forgotten
1377 | form
1378 | formed
1379 | former
1380 | forth
1381 | forty
1382 | forward
1383 | fou
1384 | found
1385 | fountain
1386 | fountains
1387 | four
1388 | fourscore
1389 | fourteen
1390 | fourteenth
1391 | fourth
1392 | fowl
1393 | fowls
1394 | freely
1395 | friend
1396 | friends
1397 | fro
1398 | from
1399 | frost
1400 | fruit
1401 | fruitful
1402 | fruits
1403 | fugitive
1404 | fulfilled
1405 | full
1406 | furnace
1407 | furniture
1408 | fury
1409 | gard
1410 | garden
1411 | garmen
1412 | garment
1413 | garments
1414 | gat
1415 | gate
1416 | gather
1417 | gathered
1418 | gathering
1419 | gave
1420 | gavest
1421 | generatio
1422 | generation
1423 | generations
1424 | get
1425 | getting
1426 | ghost
1427 | giants
1428 | gift
1429 | gifts
1430 | give
1431 | given
1432 | giveth
1433 | giving
1434 | glory
1435 | go
1436 | goa
1437 | goat
1438 | goats
1439 | gods
1440 | goest
1441 | goeth
1442 | going
1443 | gold
1444 | golden
1445 | gone
1446 | good
1447 | goodly
1448 | goods
1449 | gopher
1450 | got
1451 | gotten
1452 | governor
1453 | gr
1454 | grace
1455 | gracious
1456 | graciously
1457 | grap
1458 | grapes
1459 | grass
1460 | grave
1461 | gray
1462 | gre
1463 | great
1464 | greater
1465 | greatly
1466 | green
1467 | grew
1468 | grief
1469 | grieved
1470 | grievous
1471 | grisl
1472 | grisled
1473 | gro
1474 | ground
1475 | grove
1476 | grow
1477 | grown
1478 | guard
1479 | guiding
1480 | guiltiness
1481 | guilty
1482 | gutters
1483 | h
1484 | ha
1485 | habitations
1486 | had
1487 | hadst
1488 | hairs
1489 | hairy
1490 | half
1491 | halted
1492 | han
1493 | hand
1494 | handfuls
1495 | handle
1496 | handmaid
1497 | handmaidens
1498 | handmaids
1499 | hands
1500 | hang
1501 | hanged
1502 | hard
1503 | hardly
1504 | harlot
1505 | harm
1506 | harp
1507 | harvest
1508 | hast
1509 | haste
1510 | hasted
1511 | hastened
1512 | hastily
1513 | hate
1514 | hated
1515 | hath
1516 | have
1517 | haven
1518 | having
1519 | hazel
1520 | he
1521 | head
1522 | heads
1523 | healed
1524 | health
1525 | heap
1526 | hear
1527 | heard
1528 | hearken
1529 | hearkened
1530 | heart
1531 | hearth
1532 | hearts
1533 | heat
1534 | heav
1535 | heaven
1536 | heavens
1537 | heed
1538 | heel
1539 | heels
1540 | heifer
1541 | height
1542 | heir
1543 | held
1544 | help
1545 | hence
1546 | henceforth
1547 | her
1548 | herb
1549 | herd
1550 | herdmen
1551 | herds
1552 | here
1553 | herein
1554 | herself
1555 | hid
1556 | hide
1557 | high
1558 | hil
1559 | hills
1560 | him
1561 | himself
1562 | hind
1563 | hindermost
1564 | hire
1565 | hired
1566 | his
1567 | hith
1568 | hither
1569 | hold
1570 | hollow
1571 | home
1572 | honey
1573 | honour
1574 | honourable
1575 | hor
1576 | horror
1577 | horse
1578 | horsemen
1579 | horses
1580 | host
1581 | hotly
1582 | hou
1583 | hous
1584 | house
1585 | household
1586 | households
1587 | how
1588 | hundred
1589 | hundredfo
1590 | hundredth
1591 | hunt
1592 | hunter
1593 | hunting
1594 | hurt
1595 | husba
1596 | husband
1597 | husbandman
1598 | if
1599 | ill
1600 | image
1601 | images
1602 | imagination
1603 | imagined
1604 | in
1605 | increase
1606 | increased
1607 | indeed
1608 | inhabitants
1609 | inhabited
1610 | inherit
1611 | inheritance
1612 | iniquity
1613 | inn
1614 | innocency
1615 | instead
1616 | instructor
1617 | instruments
1618 | integrity
1619 | interpret
1620 | interpretation
1621 | interpretations
1622 | interpreted
1623 | interpreter
1624 | into
1625 | intreat
1626 | intreated
1627 | ir
1628 | is
1629 | isles
1630 | issue
1631 | it
1632 | itself
1633 | jewels
1634 | joined
1635 | joint
1636 | journey
1637 | journeyed
1638 | journeys
1639 | jud
1640 | judge
1641 | judged
1642 | judgment
1643 | just
1644 | justice
1645 | keep
1646 | keeper
1647 | kept
1648 | ki
1649 | kid
1650 | kids
1651 | kill
1652 | killed
1653 | kind
1654 | kindled
1655 | kindly
1656 | kindness
1657 | kindred
1658 | kinds
1659 | kine
1660 | king
1661 | kingdom
1662 | kings
1663 | kiss
1664 | kissed
1665 | kn
1666 | knead
1667 | kneel
1668 | knees
1669 | knew
1670 | knife
1671 | know
1672 | knowest
1673 | knoweth
1674 | knowing
1675 | knowledge
1676 | known
1677 | la
1678 | labour
1679 | lack
1680 | lad
1681 | ladder
1682 | lade
1683 | laded
1684 | laden
1685 | lads
1686 | laid
1687 | lamb
1688 | lambs
1689 | lamentati
1690 | lamp
1691 | lan
1692 | land
1693 | lands
1694 | language
1695 | large
1696 | last
1697 | laugh
1698 | laughed
1699 | law
1700 | lawgiver
1701 | laws
1702 | lay
1703 | lead
1704 | leaf
1705 | lean
1706 | leanfleshed
1707 | leap
1708 | leaped
1709 | learned
1710 | least
1711 | leave
1712 | leaves
1713 | led
1714 | left
1715 | length
1716 | lentiles
1717 | lesser
1718 | lest
1719 | let
1720 | li
1721 | lie
1722 | lien
1723 | liest
1724 | lieth
1725 | life
1726 | lift
1727 | lifted
1728 | light
1729 | lighted
1730 | lightly
1731 | lights
1732 | like
1733 | likene
1734 | likeness
1735 | linen
1736 | lingered
1737 | lion
1738 | little
1739 | live
1740 | lived
1741 | lives
1742 | liveth
1743 | living
1744 | lo
1745 | lodge
1746 | lodged
1747 | loins
1748 | long
1749 | longedst
1750 | longeth
1751 | look
1752 | looked
1753 | loose
1754 | lord
1755 | lords
1756 | loss
1757 | loud
1758 | love
1759 | loved
1760 | lovest
1761 | loveth
1762 | lower
1763 | lying
1764 | m
1765 | ma
1766 | made
1767 | magicians
1768 | magnified
1769 | maid
1770 | maiden
1771 | maidservants
1772 | make
1773 | male
1774 | males
1775 | man
1776 | mandrakes
1777 | manner
1778 | many
1779 | mark
1780 | marriages
1781 | married
1782 | marry
1783 | marvelled
1784 | mast
1785 | master
1786 | matter
1787 | may
1788 | mayest
1789 | me
1790 | mead
1791 | meadow
1792 | meal
1793 | mean
1794 | meanest
1795 | meant
1796 | measures
1797 | meat
1798 | meditate
1799 | meet
1800 | meeteth
1801 | men
1802 | menservants
1803 | mention
1804 | merchant
1805 | merchantmen
1806 | mercies
1807 | merciful
1808 | mercy
1809 | merry
1810 | mess
1811 | messenger
1812 | messengers
1813 | messes
1814 | met
1815 | mi
1816 | midst
1817 | midwife
1818 | might
1819 | mightier
1820 | mighty
1821 | milch
1822 | milk
1823 | millions
1824 | mind
1825 | mine
1826 | mirth
1827 | mischief
1828 | mist
1829 | mistress
1830 | mock
1831 | mocked
1832 | mocking
1833 | money
1834 | month
1835 | months
1836 | moon
1837 | more
1838 | moreover
1839 | morever
1840 | morning
1841 | morrow
1842 | morsel
1843 | morter
1844 | most
1845 | mother
1846 | mou
1847 | mount
1848 | mountain
1849 | mountains
1850 | mourn
1851 | mourned
1852 | mourning
1853 | mouth
1854 | mouths
1855 | moved
1856 | moveth
1857 | moving
1858 | much
1859 | mules
1860 | multiplied
1861 | multiply
1862 | multiplying
1863 | multitude
1864 | must
1865 | my
1866 | myrrh
1867 | myself
1868 | n
1869 | na
1870 | naked
1871 | nakedness
1872 | name
1873 | named
1874 | names
1875 | nati
1876 | natio
1877 | nation
1878 | nations
1879 | nativity
1880 | ne
1881 | near
1882 | neck
1883 | needeth
1884 | needs
1885 | neither
1886 | never
1887 | next
1888 | nig
1889 | nigh
1890 | night
1891 | nights
1892 | nine
1893 | nineteen
1894 | ninety
1895 | no
1896 | none
1897 | noon
1898 | nor
1899 | north
1900 | northward
1901 | nostrils
1902 | not
1903 | nothing
1904 | nought
1905 | nourish
1906 | nourished
1907 | now
1908 | number
1909 | numbered
1910 | numbering
1911 | nurse
1912 | nuts
1913 | o
1914 | oa
1915 | oak
1916 | oath
1917 | obeisance
1918 | obey
1919 | obeyed
1920 | observed
1921 | obtain
1922 | occasion
1923 | occupation
1924 | of
1925 | off
1926 | offended
1927 | offer
1928 | offered
1929 | offeri
1930 | offering
1931 | offerings
1932 | office
1933 | officer
1934 | officers
1935 | oil
1936 | old
1937 | olive
1938 | on
1939 | one
1940 | ones
1941 | only
1942 | onyx
1943 | open
1944 | opened
1945 | openly
1946 | or
1947 | order
1948 | organ
1949 | oth
1950 | other
1951 | ou
1952 | ought
1953 | our
1954 | ours
1955 | ourselves
1956 | out
1957 | over
1958 | overcome
1959 | overdrive
1960 | overseer
1961 | oversig
1962 | overspread
1963 | overtake
1964 | overthrew
1965 | overthrow
1966 | overtook
1967 | own
1968 | oxen
1969 | parcel
1970 | part
1971 | parted
1972 | parts
1973 | pass
1974 | passed
1975 | past
1976 | pasture
1977 | path
1978 | pea
1979 | peace
1980 | peaceable
1981 | peaceably
1982 | peop
1983 | people
1984 | peradventure
1985 | perceived
1986 | perfect
1987 | perform
1988 | perish
1989 | perpetual
1990 | person
1991 | persons
1992 | physicians
1993 | piece
1994 | pieces
1995 | pigeon
1996 | pilgrimage
1997 | pillar
1998 | pilled
1999 | pillows
2000 | pit
2001 | pitch
2002 | pitched
2003 | pitcher
2004 | pla
2005 | place
2006 | placed
2007 | places
2008 | plagued
2009 | plagues
2010 | plain
2011 | plains
2012 | plant
2013 | planted
2014 | played
2015 | pleasant
2016 | pleased
2017 | pleaseth
2018 | pleasure
2019 | pledge
2020 | plenteous
2021 | plenteousness
2022 | plenty
2023 | pluckt
2024 | point
2025 | poor
2026 | poplar
2027 | portion
2028 | possess
2029 | possessi
2030 | possession
2031 | possessions
2032 | possessor
2033 | posterity
2034 | pottage
2035 | poured
2036 | poverty
2037 | pow
2038 | power
2039 | praise
2040 | pray
2041 | prayed
2042 | precious
2043 | prepared
2044 | presence
2045 | present
2046 | presented
2047 | preserve
2048 | preserved
2049 | pressed
2050 | prevail
2051 | prevailed
2052 | prey
2053 | priest
2054 | priests
2055 | prince
2056 | princes
2057 | pris
2058 | prison
2059 | prisoners
2060 | proceedeth
2061 | process
2062 | profit
2063 | progenitors
2064 | prophet
2065 | prosper
2066 | prospered
2067 | prosperous
2068 | protest
2069 | proved
2070 | provender
2071 | provide
2072 | provision
2073 | pulled
2074 | punishment
2075 | purchase
2076 | purchased
2077 | purposing
2078 | pursue
2079 | pursued
2080 | put
2081 | putting
2082 | quart
2083 | quickly
2084 | quite
2085 | quiver
2086 | raiment
2087 | rain
2088 | rained
2089 | raise
2090 | ram
2091 | rams
2092 | ran
2093 | rank
2094 | raven
2095 | ravin
2096 | reach
2097 | reached
2098 | ready
2099 | reason
2100 | rebelled
2101 | rebuked
2102 | receive
2103 | received
2104 | red
2105 | redeemed
2106 | refrain
2107 | refrained
2108 | refused
2109 | regard
2110 | reign
2111 | reigned
2112 | remained
2113 | remaineth
2114 | remember
2115 | remembered
2116 | remove
2117 | removed
2118 | removing
2119 | renown
2120 | rent
2121 | repented
2122 | repenteth
2123 | replenish
2124 | report
2125 | reproa
2126 | reproach
2127 | reproved
2128 | require
2129 | required
2130 | requite
2131 | reserved
2132 | respect
2133 | rest
2134 | rested
2135 | restore
2136 | restored
2137 | restrained
2138 | return
2139 | returned
2140 | reviv
2141 | reward
2142 | rewarded
2143 | ri
2144 | rib
2145 | ribs
2146 | rich
2147 | riches
2148 | rid
2149 | ride
2150 | rider
2151 | right
2152 | righteous
2153 | righteousness
2154 | rightly
2155 | ring
2156 | ringstraked
2157 | ripe
2158 | rise
2159 | risen
2160 | riv
2161 | river
2162 | rode
2163 | rods
2164 | roll
2165 | rolled
2166 | roof
2167 | room
2168 | rooms
2169 | rose
2170 | roughly
2171 | round
2172 | rouse
2173 | royal
2174 | rul
2175 | rule
2176 | ruled
2177 | ruler
2178 | rulers
2179 | run
2180 | s
2181 | sa
2182 | sac
2183 | sack
2184 | sackcloth
2185 | sacks
2186 | sacrifice
2187 | sacrifices
2188 | sad
2189 | saddled
2190 | sadly
2191 | said
2192 | saidst
2193 | saith
2194 | sake
2195 | sakes
2196 | salt
2197 | salvation
2198 | same
2199 | sanctified
2200 | sand
2201 | sat
2202 | save
2203 | saved
2204 | saving
2205 | savour
2206 | savoury
2207 | saw
2208 | sawest
2209 | say
2210 | saying
2211 | scarce
2212 | scarlet
2213 | scatter
2214 | scattered
2215 | sceptre
2216 | sea
2217 | searched
2218 | seas
2219 | season
2220 | seasons
2221 | second
2222 | secret
2223 | secretly
2224 | see
2225 | seed
2226 | seedtime
2227 | seeing
2228 | seek
2229 | seekest
2230 | seem
2231 | seemed
2232 | seen
2233 | seest
2234 | seeth
2235 | selfsame
2236 | selfwill
2237 | sell
2238 | send
2239 | sent
2240 | separate
2241 | separated
2242 | sepulchre
2243 | sepulchres
2244 | serpent
2245 | serva
2246 | servan
2247 | servant
2248 | servants
2249 | serve
2250 | served
2251 | service
2252 | set
2253 | seven
2254 | sevenfold
2255 | sevens
2256 | seventeen
2257 | seventeenth
2258 | seventh
2259 | seventy
2260 | sewed
2261 | sh
2262 | shadow
2263 | shall
2264 | shalt
2265 | shamed
2266 | shaved
2267 | she
2268 | sheaf
2269 | shear
2270 | sheaves
2271 | shed
2272 | sheddeth
2273 | sheep
2274 | sheepshearers
2275 | shekel
2276 | shekels
2277 | shepherd
2278 | shepherds
2279 | shew
2280 | shewed
2281 | sheweth
2282 | shield
2283 | ships
2284 | shoelatchet
2285 | shore
2286 | shortly
2287 | shot
2288 | should
2289 | shoulder
2290 | shoulders
2291 | shouldest
2292 | shrank
2293 | shrubs
2294 | shut
2295 | si
2296 | side
2297 | sight
2298 | signet
2299 | signs
2300 | silv
2301 | silver
2302 | sin
2303 | since
2304 | sinew
2305 | sinners
2306 | sinning
2307 | sir
2308 | sist
2309 | sister
2310 | sit
2311 | six
2312 | sixteen
2313 | sixth
2314 | sixty
2315 | skins
2316 | slain
2317 | slaughter
2318 | slay
2319 | slayeth
2320 | sle
2321 | sleep
2322 | slept
2323 | slew
2324 | slime
2325 | slimepits
2326 | small
2327 | smell
2328 | smelled
2329 | smite
2330 | smoke
2331 | smoking
2332 | smooth
2333 | smote
2334 | so
2335 | sod
2336 | softly
2337 | sojourn
2338 | sojourned
2339 | sojourner
2340 | sold
2341 | sole
2342 | solemnly
2343 | some
2344 | son
2345 | songs
2346 | sons
2347 | soon
2348 | sore
2349 | sorely
2350 | sorrow
2351 | sort
2352 | sou
2353 | sought
2354 | soul
2355 | souls
2356 | south
2357 | southward
2358 | sow
2359 | sowed
2360 | space
2361 | spake
2362 | spare
2363 | spe
2364 | speak
2365 | speaketh
2366 | speaking
2367 | speckl
2368 | speckled
2369 | spee
2370 | speech
2371 | speed
2372 | speedily
2373 | spent
2374 | spi
2375 | spicery
2376 | spices
2377 | spies
2378 | spilled
2379 | spirit
2380 | spoil
2381 | spoiled
2382 | spoken
2383 | sporting
2384 | spotted
2385 | spread
2386 | springing
2387 | sprung
2388 | staff
2389 | stalk
2390 | stand
2391 | standest
2392 | stars
2393 | state
2394 | statutes
2395 | stay
2396 | stayed
2397 | ste
2398 | stead
2399 | steal
2400 | steward
2401 | still
2402 | stink
2403 | sto
2404 | stole
2405 | stolen
2406 | stone
2407 | stones
2408 | stood
2409 | stooped
2410 | stopped
2411 | store
2412 | storehouses
2413 | stories
2414 | straitly
2415 | strakes
2416 | strange
2417 | stranger
2418 | strangers
2419 | straw
2420 | street
2421 | strength
2422 | strengthened
2423 | stretched
2424 | stricken
2425 | strife
2426 | stript
2427 | strive
2428 | strong
2429 | stronger
2430 | strove
2431 | struggled
2432 | stuff
2433 | subdue
2434 | submit
2435 | substance
2436 | subtil
2437 | subtilty
2438 | such
2439 | suck
2440 | suffered
2441 | summer
2442 | sun
2443 | supplanted
2444 | sure
2445 | surely
2446 | surety
2447 | sustained
2448 | sware
2449 | swear
2450 | sweat
2451 | sweet
2452 | sword
2453 | sworn
2454 | tabret
2455 | tak
2456 | take
2457 | taken
2458 | talked
2459 | talking
2460 | tar
2461 | tarried
2462 | tarry
2463 | teeth
2464 | tell
2465 | tempt
2466 | ten
2467 | tender
2468 | tenor
2469 | tent
2470 | tenth
2471 | tents
2472 | terror
2473 | th
2474 | than
2475 | that
2476 | the
2477 | thee
2478 | their
2479 | them
2480 | themselv
2481 | themselves
2482 | then
2483 | thence
2484 | there
2485 | thereby
2486 | therefore
2487 | therein
2488 | thereof
2489 | thereon
2490 | these
2491 | they
2492 | thi
2493 | thicket
2494 | thigh
2495 | thin
2496 | thine
2497 | thing
2498 | things
2499 | think
2500 | third
2501 | thirteen
2502 | thirteenth
2503 | thirty
2504 | this
2505 | thistles
2506 | thither
2507 | thoroughly
2508 | those
2509 | thou
2510 | though
2511 | thought
2512 | thoughts
2513 | thousand
2514 | thousands
2515 | thread
2516 | three
2517 | threescore
2518 | threshingfloor
2519 | throne
2520 | through
2521 | throughout
2522 | thus
2523 | thy
2524 | thyself
2525 | tidings
2526 | till
2527 | tiller
2528 | tillest
2529 | tim
2530 | time
2531 | times
2532 | tithes
2533 | to
2534 | togeth
2535 | together
2536 | toil
2537 | token
2538 | told
2539 | tongue
2540 | tongues
2541 | too
2542 | took
2543 | top
2544 | tops
2545 | torn
2546 | touch
2547 | touched
2548 | toucheth
2549 | touching
2550 | toward
2551 | tower
2552 | towns
2553 | tr
2554 | trade
2555 | traffick
2556 | trained
2557 | travail
2558 | travailed
2559 | treasure
2560 | tree
2561 | trees
2562 | trembled
2563 | trespass
2564 | tribes
2565 | tribute
2566 | troop
2567 | troubled
2568 | trough
2569 | troughs
2570 | tru
2571 | true
2572 | truly
2573 | truth
2574 | turn
2575 | turned
2576 | turtledove
2577 | twel
2578 | twelve
2579 | twentieth
2580 | twenty
2581 | twice
2582 | twins
2583 | two
2584 | unawares
2585 | uncircumcised
2586 | uncovered
2587 | under
2588 | understand
2589 | understood
2590 | ungirded
2591 | unit
2592 | unleavened
2593 | until
2594 | unto
2595 | up
2596 | upon
2597 | uppermost
2598 | upright
2599 | upward
2600 | urged
2601 | us
2602 | utmost
2603 | vagabond
2604 | vail
2605 | vale
2606 | valley
2607 | vengeance
2608 | venison
2609 | verified
2610 | verily
2611 | very
2612 | vessels
2613 | vestures
2614 | victuals
2615 | vine
2616 | vineyard
2617 | violence
2618 | violently
2619 | virgin
2620 | vision
2621 | visions
2622 | visit
2623 | visited
2624 | voi
2625 | voice
2626 | void
2627 | vow
2628 | vowed
2629 | vowedst
2630 | w
2631 | wa
2632 | wages
2633 | wagons
2634 | waited
2635 | walk
2636 | walked
2637 | walketh
2638 | walking
2639 | wall
2640 | wander
2641 | wandered
2642 | wandering
2643 | war
2644 | ward
2645 | was
2646 | wash
2647 | washed
2648 | wast
2649 | wat
2650 | watch
2651 | water
2652 | watered
2653 | watering
2654 | waters
2655 | waxed
2656 | waxen
2657 | way
2658 | ways
2659 | we
2660 | wealth
2661 | weaned
2662 | weapons
2663 | wearied
2664 | weary
2665 | week
2666 | weep
2667 | weig
2668 | weighed
2669 | weight
2670 | welfare
2671 | well
2672 | wells
2673 | went
2674 | wentest
2675 | wept
2676 | were
2677 | west
2678 | westwa
2679 | whales
2680 | what
2681 | whatsoever
2682 | wheat
2683 | whelp
2684 | when
2685 | whence
2686 | whensoever
2687 | where
2688 | whereby
2689 | wherefore
2690 | wherein
2691 | whereof
2692 | whereon
2693 | wherewith
2694 | whether
2695 | which
2696 | while
2697 | white
2698 | whither
2699 | who
2700 | whole
2701 | whom
2702 | whomsoever
2703 | whoredom
2704 | whose
2705 | whosoever
2706 | why
2707 | wi
2708 | wick
2709 | wicked
2710 | wickedly
2711 | wickedness
2712 | widow
2713 | widowhood
2714 | wife
2715 | wild
2716 | wilderness
2717 | will
2718 | willing
2719 | wilt
2720 | wind
2721 | window
2722 | windows
2723 | wine
2724 | winged
2725 | winter
2726 | wise
2727 | wit
2728 | with
2729 | withered
2730 | withheld
2731 | withhold
2732 | within
2733 | without
2734 | witness
2735 | wittingly
2736 | wiv
2737 | wives
2738 | wo
2739 | wolf
2740 | woman
2741 | womb
2742 | wombs
2743 | women
2744 | womenservan
2745 | womenservants
2746 | wondering
2747 | wood
2748 | wor
2749 | word
2750 | words
2751 | work
2752 | worse
2753 | worship
2754 | worshipped
2755 | worth
2756 | worthy
2757 | wot
2758 | wotteth
2759 | would
2760 | wouldest
2761 | wounding
2762 | wrapped
2763 | wrath
2764 | wrestled
2765 | wrestlings
2766 | wrong
2767 | wroth
2768 | wrought
2769 | y
2770 | ye
2771 | yea
2772 | year
2773 | yearn
2774 | years
2775 | yesternight
2776 | yet
2777 | yield
2778 | yielded
2779 | yielding
2780 | yoke
2781 | yonder
2782 | you
2783 | young
2784 | younge
2785 | younger
2786 | youngest
2787 | your
2788 | yourselves
2789 | youth
2790 | 2789
2791 | 2789
2792 |
--------------------------------------------------------------------------------
/5.t2.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/5.t2.pkl
--------------------------------------------------------------------------------
/PYTHON 自然语言处理中文翻译.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/PYTHON 自然语言处理中文翻译.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Python_nlp_notes
2 | 这是我《 Python 自然语言处理 中文第二版 》笔记
3 |
4 | 原文在线阅读:https://usyiyi.github.io/nlp-py-2e-zh
5 |
6 | # 【Python自然语言处理】读书笔记:第一章:语言处理与Python
7 | [【Python自然语言处理】读书笔记:第一章:语言处理与Python](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%80%E7%AB%A0%EF%BC%9A%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E4%B8%8EPython.md)
8 |
9 | # 【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源
10 | [【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%BA%8C%E7%AB%A0%EF%BC%9A%E8%8E%B7%E5%BE%97%E6%96%87%E6%9C%AC%E8%AF%AD%E6%96%99%E5%92%8C%E8%AF%8D%E6%B1%87%E8%B5%84%E6%BA%90.md)
11 |
12 | # 【Python自然语言处理】读书笔记:第三章:处理原始文本
13 | [【Python自然语言处理】读书笔记:第三章:处理原始文本](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%89%E7%AB%A0%EF%BC%9A%E5%A4%84%E7%90%86%E5%8E%9F%E5%A7%8B%E6%96%87%E6%9C%AC.ipynb)
14 |
15 | # 【Python自然语言处理】读书笔记:第四章:编写结构化程序
16 | [【Python自然语言处理】读书笔记:第四章:编写结构化程序](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E5%9B%9B%E7%AB%A0%EF%BC%9A%E7%BC%96%E5%86%99%E7%BB%93%E6%9E%84%E5%8C%96%E7%A8%8B%E5%BA%8F.ipynb)
17 |
18 | # 【Python自然语言处理】读书笔记:第五章:分类和标注词汇
19 | [【Python自然语言处理】读书笔记:第五章:分类和标注词汇](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%BA%94%E7%AB%A0%EF%BC%9A%E5%88%86%E7%B1%BB%E5%92%8C%E6%A0%87%E6%B3%A8%E8%AF%8D%E6%B1%87.ipynb)
20 |
21 | # 【Python自然语言处理】读书笔记:第六章:学习分类文本
22 | [【Python自然语言处理】读书笔记:第六章:学习分类文本](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E5%85%AD%E7%AB%A0%EF%BC%9A%E5%AD%A6%E4%B9%A0%E5%88%86%E7%B1%BB%E6%96%87%E6%9C%AC.ipynb)
23 |
24 | # 【Python自然语言处理】读书笔记:第七章:从文本提取信息
25 | [【Python自然语言处理】读书笔记:第七章:从文本提取信息](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%83%E7%AB%A0%EF%BC%9A%E4%BB%8E%E6%96%87%E6%9C%AC%E6%8F%90%E5%8F%96%E4%BF%A1%E6%81%AF.ipynb)
26 |
27 | ----------------------------------未完待续-------------------------
28 |
29 | # 更多NLP知识请访问:
30 |
31 | 我的主页:https://jackkuo666.github.io/
32 |
33 | 我的博客:https://blog.csdn.net/weixin_37251044
34 |
--------------------------------------------------------------------------------
/picture/1.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/1.1.png
--------------------------------------------------------------------------------
/picture/2.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.1.png
--------------------------------------------------------------------------------
/picture/2.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.2.png
--------------------------------------------------------------------------------
/picture/2.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.3.png
--------------------------------------------------------------------------------
/picture/2.4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.4.png
--------------------------------------------------------------------------------
/picture/3.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.1.png
--------------------------------------------------------------------------------
/picture/3.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.2.png
--------------------------------------------------------------------------------
/picture/3.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.3.png
--------------------------------------------------------------------------------
/picture/4.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/4.1.png
--------------------------------------------------------------------------------
/picture/4.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/4.2.png
--------------------------------------------------------------------------------
/picture/5.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/5.1.png
--------------------------------------------------------------------------------
/picture/6.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.1.png
--------------------------------------------------------------------------------
/picture/6.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.2.png
--------------------------------------------------------------------------------
/picture/6.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.3.png
--------------------------------------------------------------------------------
/picture/6.4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.4.png
--------------------------------------------------------------------------------
/picture/6.5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.5.png
--------------------------------------------------------------------------------
/picture/6.6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.6.png
--------------------------------------------------------------------------------
/picture/6.7.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.7.png
--------------------------------------------------------------------------------
/picture/7.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.1.png
--------------------------------------------------------------------------------
/picture/7.2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.2.png
--------------------------------------------------------------------------------
/picture/7.3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.3.png
--------------------------------------------------------------------------------
/picture/7.4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.4.png
--------------------------------------------------------------------------------
/picture/7.5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.5.png
--------------------------------------------------------------------------------
/【Python自然语言处理】读书笔记:第一章:语言处理与Python.md:
--------------------------------------------------------------------------------
1 | 原书:《Python自然语言处理》:https://usyiyi.github.io/nlp-py-2e-zh/
2 | # 语言处理与Python
3 | 原文:https://usyiyi.github.io/nlp-py-2e-zh/1.html
4 | # 1.NLTK入门
5 | ## 1.NKLT的安装,nltk.book的安装
6 | ## 2.搜索文本
7 | ```py
8 | text1.concordance("monstrous") # 搜索文本text1中含有“monstrous”的句子
9 | text1.similar("monstrous") # 搜索文本text1中与“monstrous”相似的单词
10 | text2.common_contexts(["monstrous", "very"]) # 搜索文本text2中两个单词共同的上下文
11 | text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 显示在文本text4中各个单词的使用频率
12 | ```
13 | ## 3.词汇计数
14 | ```py
15 | len(text3) # 文本text3的符号总数
16 | sorted(set(text3)) # 不重复的符号排序
17 | len(set(text3)) # 不重复的符号总数
18 | len(set(text3)) / len(text3) # 词汇丰富度:不重复符号占总符号6%,或者:每个单词平均使用16词
19 | text3.count("smote") # 文本中“smote”的计数
20 | def lexivcal_diversity(text): # 计算词汇丰富度
21 | return len(set(text))/len(text)
22 | def percentage(word,text): # 计算词word在文本中出现的频率
23 | return 100*text.count(word)/len(text)
24 |
25 | ```
26 | ## 4.索引列表
27 | ```py
28 | >>> text4[173]
29 | 'awaken'
30 | >>>
31 | ```
32 | ```py
33 | >>> text4.index('awaken')
34 | 173
35 | >>>
36 | >>> sent[5:8]
37 | ['word6', 'word7', 'word8']
38 | ```
39 | # 5.字符串与列表的相互转换
40 | ```py
41 | >>> ' '.join(['Monty', 'Python'])
42 | 'Monty Python'
43 | >>> 'Monty Python'.split()
44 | ['Monty', 'Python']
45 | >>>
46 | ```
47 | # 6.词频分布
48 | ```py
49 | >>> fdist1 = FreqDist(text1) # 计算text1的每个符号的词频
50 | >>> print(fdist1)
51 |
52 | >>> fdist1.most_common(50) [3]
53 | [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024),
54 | ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982),
55 | ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124),
56 | ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632),
57 | ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280),
58 | ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103),
59 | ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005),
60 | ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767),
61 | ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680),
62 | ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
63 | >>> fdist1['whale']
64 | 906
65 | >>>
66 | ```
67 | ```py
68 | fdist1.plot(50, cumulative=True) # 50个常用词的累计频率图
69 | ```
70 | 
71 |
72 | ```py
73 | fdist1.hapaxes() # 返回词频为1的词
74 | ```
75 | # 7.细粒度的选择词
76 | 选出长度大于15的单词
77 | ```py
78 | sorted(w for w in set(text1) if len(w) > 15)
79 | ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically',
80 | ```
81 |
82 | 选出长度大于7且词频大于7的单词
83 | ```py
84 | sorted(w for w in set(text5) if len(w) > 7 and FreqDist(text5)[w] > 7)
85 | ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question',
86 | ```
87 | 提取词汇中的次对
88 | ```py
89 | >>> list(bigrams(['more', 'is', 'said', 'than', 'done']))
90 | [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
91 | ```
92 | 提取文本中的频繁出现的双连词
93 | ```py
94 | >>> text4.collocations()
95 | United States; fellow citizens; four years; years ago; Federal
96 | Government; General Government; American people; Vice President; Old
97 | ```
98 | # 8.查看文本中词长的分布
99 | ```py
100 | >>> [len(w) for w in text1] # 文本中每个词的长度
101 | [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...]
102 | >>> fdist = FreqDist(len(w) for w in text1) # 文本中词长的频数
103 | >>> print(fdist) [3]
104 |
105 | >>> fdist
106 | FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399,
107 | 8: 9966, 9: 6428, 10: 3528, ...})
108 | >>>
109 | ```
110 | ```py
111 | >>> fdist.most_common()
112 | [(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399),
113 | (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177),
114 | (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
115 | >>> fdist.max()
116 | 3
117 | >>> fdist[3]
118 | 50223
119 | >>> fdist.freq(3) # 词频中词长为“3”的频率
120 | 0.19255882431878046
121 | >>>
122 | ```
123 | # 9.```[w for w in text if condition ]```模式
124 | 选出以```ableness```结尾的单词
125 | ```py
126 | >>> sorted(w for w in set(text1) if w.endswith('ableness'))
127 | ['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...]
128 | ```
129 | 选出含有```gnt```的单词
130 | ```py
131 | >>> sorted(term for term in set(text4) if 'gnt' in term)
132 | ['Sovereignty', 'sovereignties', 'sovereignty']
133 | ```
134 | 选出以**大写字母**开头的单词
135 | ```py
136 | >>> sorted(item for item in set(text6) if item.istitle())
137 | ['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...]
138 | ```
139 | 选出**数字**
140 | ```py
141 | >>> sorted(item for item in set(sent7) if item.isdigit())
142 | ['29', '61']
143 | >>>
144 | ```
145 | 选出**全部小写字母**的单词
146 | ```py
147 | sorted(w for w in set(sent7) if not w.islower())
148 | ```
149 | 将单词变为**全部大写字母**
150 | ```py
151 | >>> [w.upper() for w in text1]
152 | ['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...]
153 | >>>
154 | ```
155 | 将text1中过滤掉不是字母的,然后全部转换成小写,然后去重,然后计数
156 | ```py
157 | >>> len(set(word.lower() for word in text1 if word.isalpha()))
158 | 16948
159 | ```
160 | # 10.条件循环
161 | 这里可以不换行打印```print(word, end=' ')```
162 | ```py
163 | >>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
164 | >>> for word in tricky:
165 | ... print(word, end=' ')
166 | ancient ceiling conceit conceited conceive conscience
167 | conscientious conscientiously deceitful deceive ...
168 | >>>
169 | ```
170 | # 11.作业
171 | 计算词频,以百分比表示
172 | ```py
173 |
174 | >>> def percent(word, text):
175 | ... return 100*text.count(word)/len([w for w in text if w.isalpha()])
176 | >>> percent(",", text1)
177 | 8.569753756394228
178 | ```
179 | 计算文本词汇量
180 | ```py
181 | >>> def vocab_size(text):
182 | ... return len(set(w.lower() for w in text if w.isalpha()))
183 | >>> vocab_size(text1)
184 | 16948
185 |
186 | ```
187 |
--------------------------------------------------------------------------------
/【Python自然语言处理】读书笔记:第七章:从文本提取信息.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "本章原文:https://usyiyi.github.io/nlp-py-2e-zh/7.html\n",
8 | "\n",
9 | " 1.我们如何能构建一个系统,从非结构化文本中提取结构化数据如表格?\n",
10 | " 2.有哪些稳健的方法识别一个文本中描述的实体和关系?\n",
11 | " 3.哪些语料库适合这项工作,我们如何使用它们来训练和评估我们的模型?\n",
12 | "\n",
13 | "**分块** 和 **命名实体识别**。"
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# 1 信息提取\n",
21 | "\n",
22 | "信息有很多种形状和大小。一个重要的形式是结构化数据:实体和关系的可预测的规范的结构。例如,我们可能对公司和地点之间的关系感兴趣。给定一个公司,我们希望能够确定它做业务的位置;反过来,给定位置,我们会想发现哪些公司在该位置做业务。如果我们的数据是表格形式,如1.1中的例子,那么回答这些问题就很简单了。\n",
23 | "\n"
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 1,
29 | "metadata": {},
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "['BBDO South', 'Georgia-Pacific']\n"
36 | ]
37 | }
38 | ],
39 | "source": [
40 | "locs = [('Omnicom', 'IN', 'New York'),\n",
41 | " ('DDB Needham', 'IN', 'New York'),\n",
42 | " ('Kaplan Thaler Group', 'IN', 'New York'),\n",
43 | " ('BBDO South', 'IN', 'Atlanta'),\n",
44 | " ('Georgia-Pacific', 'IN', 'Atlanta')]\n",
45 | "query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']\n",
46 | "print(query)"
47 | ]
48 | },
49 | {
50 | "cell_type": "markdown",
51 | "metadata": {},
52 | "source": [
53 | "我们可以定义一个函数,简单地连接 NLTK 中默认的句子分割器[1],分词器[2]和词性标注器[3]:"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 2,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "def ie_preprocess(document):\n",
63 | " sentences = nltk.sent_tokenize(document) # [1] 句子分割器\n",
64 | " sentences = [nltk.word_tokenize(sent) for sent in sentences] # [2] 分词器\n",
65 | " sentences = [nltk.pos_tag(sent) for sent in sentences] # [3] 词性标注器"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 3,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "import nltk, re, pprint"
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "# 2 词块划分\n",
82 | "\n",
83 | "我们将用于实体识别的基本技术是词块划分,它分割和标注多词符的序列,如2.1所示。小框显示词级分词和词性标注,大框显示高级别的词块划分。每个这种较大的框叫做一个词块。就像分词忽略空白符,词块划分通常选择词符的一个子集。同样像分词一样,词块划分器生成的片段在源文本中不能重叠。\n",
84 | "\n",
85 | "\n",
86 | "在本节中,我们将在较深的层面探讨词块划分,以**词块**的定义和表示开始。我们将看到**正则表达式**和**N-gram**的方法来词块划分,使用CoNLL-2000词块划分语料库**开发**和**评估词块划分器**。我们将在(5)和6回到**命名实体识别**和**关系抽取**的任务。"
87 | ]
88 | },
89 | {
90 | "cell_type": "markdown",
91 | "metadata": {},
92 | "source": [
93 | "## 2.1 名词短语词块划分\n",
94 | "\n",
95 | "我们将首先思考名词短语词块划分或NP词块划分任务,在那里我们寻找单独名词短语对应的词块。例如,这里是一些《华尔街日报》文本,其中的NP词块用方括号标记:"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 5,
101 | "metadata": {},
102 | "outputs": [
103 | {
104 | "name": "stdout",
105 | "output_type": "stream",
106 | "text": [
107 | "(S\n",
108 | " (NP the/DT little/JJ yellow/JJ dog/NN)\n",
109 | " barked/VBD\n",
110 | " at/IN\n",
111 | " (NP the/DT cat/NN))\n"
112 | ]
113 | }
114 | ],
115 | "source": [
116 | "sentence = [(\"the\", \"DT\"), (\"little\", \"JJ\"), (\"yellow\", \"JJ\"), \n",
117 | " (\"dog\", \"NN\"), (\"barked\", \"VBD\"), (\"at\", \"IN\"), (\"the\", \"DT\"), (\"cat\", \"NN\")]\n",
118 | "\n",
119 | "grammar = \"NP: {?*}\" \n",
120 | "cp = nltk.RegexpParser(grammar) \n",
121 | "result = cp.parse(sentence) \n",
122 | "print(result) \n",
123 | "result.draw()"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | ""
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "## 2.2 标记模式\n",
138 | "\n",
139 | "组成一个词块语法的规则使用标记模式来描述已标注的词的序列。一个标记模式是一个词性标记序列,用尖括号分隔,如\n",
140 | "```\n",
141 | "?*\n",
142 | "```\n",
143 | "标记模式类似于正则表达式模式(3.4)。现在,思考下面的来自《华尔街日报》的名词短语:\n",
144 | "```py\n",
145 | "another/DT sharp/JJ dive/NN\n",
146 | "trade/NN figures/NNS\n",
147 | "any/DT new/JJ policy/NN measures/NNS\n",
148 | "earlier/JJR stages/NNS\n",
149 | "Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP\n",
150 | "```"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "metadata": {},
156 | "source": [
157 | "## 2.3 用正则表达式进行词块划分\n",
158 | "\n",
159 | "要找到一个给定的句子的词块结构,RegexpParser词块划分器以一个没有词符被划分的平面结构开始。词块划分规则轮流应用,依次更新词块结构。一旦所有的规则都被调用,返回生成的词块结构。\n",
160 | "\n",
161 | "2.3显示了一个由2个规则组成的简单的词块语法。第一条规则匹配一个可选的限定词或所有格代名词,零个或多个形容词,然后跟一个名词。第二条规则匹配一个或多个专有名词。我们还定义了一个进行词块划分的例句[1],并在此输入上运行这个词块划分器[2]。"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": 7,
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "name": "stdout",
171 | "output_type": "stream",
172 | "text": [
173 | "(S\n",
174 | " (NP Rapunzel/NNP)\n",
175 | " let/VBD\n",
176 | " down/RP\n",
177 | " (NP her/PP$ long/JJ golden/JJ hair/NN))\n"
178 | ]
179 | }
180 | ],
181 | "source": [
182 | "grammar = r\"\"\"\n",
183 | "NP: {?*} \n",
184 | "{+}\n",
185 | "\"\"\"\n",
186 | "cp = nltk.RegexpParser(grammar)\n",
187 | "sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n",
188 | "print (cp.parse(sentence))"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "注意\n",
196 | "\n",
197 | "```\n",
198 | "$\n",
199 | "```\n",
200 | "符号是正则表达式中的一个特殊字符,必须使用反斜杠转义来匹配\n",
201 | "```\n",
202 | "PP\\$\n",
203 | "```\n",
204 | "标记。\n",
205 | "\n",
206 | "如果标记模式匹配位置重叠,最左边的匹配优先。例如,如果我们应用一个匹配两个连续的名词文本的规则到一个包含三个连续的名词的文本,则只有前两个名词将被划分:"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": 8,
212 | "metadata": {},
213 | "outputs": [
214 | {
215 | "name": "stdout",
216 | "output_type": "stream",
217 | "text": [
218 | "(S (NP money/NN market/NN) fund/NN)\n"
219 | ]
220 | }
221 | ],
222 | "source": [
223 | "nouns = [(\"money\", \"NN\"), (\"market\", \"NN\"), (\"fund\", \"NN\")]\n",
224 | "grammar = \"NP: {} # Chunk two consecutive nouns\"\n",
225 | "cp = nltk.RegexpParser(grammar)\n",
226 | "print(cp.parse(nouns))"
227 | ]
228 | },
229 | {
230 | "cell_type": "markdown",
231 | "metadata": {},
232 | "source": [
233 | "## 2.4 探索文本语料库\n",
234 | "\n",
235 | "在2中,我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作,如下:"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 12,
241 | "metadata": {},
242 | "outputs": [
243 | {
244 | "name": "stdout",
245 | "output_type": "stream",
246 | "text": [
247 | "(CHUNK combined/VBN to/TO achieve/VB)\n",
248 | "(CHUNK continue/VB to/TO place/VB)\n",
249 | "(CHUNK serve/VB to/TO protect/VB)\n"
250 | ]
251 | }
252 | ],
253 | "source": [
254 | "cp = nltk.RegexpParser('CHUNK: { }')\n",
255 | "brown = nltk.corpus.brown\n",
256 | "count = 0\n",
257 | "for sent in brown.tagged_sents():\n",
258 | " tree = cp.parse(sent)\n",
259 | " for subtree in tree.subtrees():\n",
260 | " if subtree.label() == 'CHUNK': print(subtree)\n",
261 | " count += 1\n",
262 | " if count >= 30: break"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "## 2.5 词缝加塞\n",
270 | "\n",
271 | "有时定义我们想从一个词块中排除什么比较容易。我们可以定义词缝为一个不包含在词块中的一个词符序列。在下面的例子中,barked/VBD at/IN是一个词缝:\n",
272 | "```\n",
273 | "[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]\n",
274 | "```\n"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": [
281 | "## 2.6 词块的表示:标记与树\n",
282 | "\n",
283 | "作为标注和分析之间的中间状态(8.,词块结构可以使用标记或树来表示。最广泛的文件表示使用IOB标记。在这个方案中,每个词符被三个特殊的词块标记之一标注,I(内部),O(外部)或B(开始)。一个词符被标注为B,如果它标志着一个词块的开始。块内的词符子序列被标注为I。所有其他的词符被标注为O。B和I标记后面跟着词块类型,如B-NP, I-NP。当然,没有必要指定出现在词块外的词符类型,所以这些都只标注为O。这个方案的例子如2.5所示。\n",
284 | "\n",
285 | "\n",
286 | "IOB标记已成为文件中表示词块结构的标准方式,我们也将使用这种格式。下面是2.5中的信息如何出现在一个文件中的:\n",
287 | "```\n",
288 | "We PRP B-NP\n",
289 | "saw VBD O\n",
290 | "the DT B-NP\n",
291 | "yellow JJ I-NP\n",
292 | "dog NN I-NP\n",
293 | "```"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "# 3 开发和评估词块划分器\n",
301 | "\n",
302 | "现在你对分块的作用有了一些了解,但我们并没有解释如何评估词块划分器。和往常一样,这需要一个合适的已标注语料库。我们一开始寻找将IOB格式转换成NLTK树的机制,然后是使用已化分词块的语料库如何在一个更大的规模上做这个。我们将看到如何为一个词块划分器相对一个语料库的准确性打分,再看看一些数据驱动方式搜索NP词块。我们整个的重点在于扩展一个词块划分器的覆盖范围。\n",
303 | "## 3.1 读取IOB格式与CoNLL2000语料库\n",
304 | "\n",
305 | "使用corpus模块,我们可以加载已经标注并使用IOB符号划分词块的《华尔街日报》文本。这个语料库提供的词块类型有NP,VP和PP。正如我们已经看到的,每个句子使用多行表示,如下所示:\n",
306 | "```\n",
307 | "he PRP B-NP\n",
308 | "accepted VBD B-VP\n",
309 | "the DT B-NP\n",
310 | "position NN I-NP\n",
311 | "...\n",
312 | "```\n",
313 | "\n",
314 | "我们可以使用NLTK的corpus模块访问较大量的已经划分词块的文本。CoNLL2000语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式词块标记。我们可以使用nltk.corpus.conll2000访问这些数据。下面是一个读取语料库的“训练”部分的第100个句子的例子:"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 14,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "name": "stdout",
324 | "output_type": "stream",
325 | "text": [
326 | "(S\n",
327 | " (PP Over/IN)\n",
328 | " (NP a/DT cup/NN)\n",
329 | " (PP of/IN)\n",
330 | " (NP coffee/NN)\n",
331 | " ,/,\n",
332 | " (NP Mr./NNP Stone/NNP)\n",
333 | " (VP told/VBD)\n",
334 | " (NP his/PRP$ story/NN)\n",
335 | " ./.)\n"
336 | ]
337 | }
338 | ],
339 | "source": [
340 | "from nltk.corpus import conll2000\n",
341 | "print(conll2000.chunked_sents('train.txt')[99])"
342 | ]
343 | },
344 | {
345 | "cell_type": "markdown",
346 | "metadata": {},
347 | "source": [
348 | "正如你看到的,CoNLL2000语料库包含三种词块类型:NP词块,我们已经看到了;VP词块如has already delivered;PP块如because of。因为现在我们唯一感兴趣的是NP词块,我们可以使用chunk_types参数选择它们:"
349 | ]
350 | },
351 | {
352 | "cell_type": "code",
353 | "execution_count": 15,
354 | "metadata": {},
355 | "outputs": [
356 | {
357 | "name": "stdout",
358 | "output_type": "stream",
359 | "text": [
360 | "(S\n",
361 | " Over/IN\n",
362 | " (NP a/DT cup/NN)\n",
363 | " of/IN\n",
364 | " (NP coffee/NN)\n",
365 | " ,/,\n",
366 | " (NP Mr./NNP Stone/NNP)\n",
367 | " told/VBD\n",
368 | " (NP his/PRP$ story/NN)\n",
369 | " ./.)\n"
370 | ]
371 | }
372 | ],
373 | "source": [
374 | "print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])"
375 | ]
376 | },
377 | {
378 | "cell_type": "markdown",
379 | "metadata": {},
380 | "source": [
381 | "## 3.2 简单的评估和基准\n",
382 | "\n",
383 | "现在,我们可以访问一个已划分词块语料,可以评估词块划分器。我们开始为没有什么意义的词块解析器cp建立一个基准,它不划分任何词块:"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 16,
389 | "metadata": {},
390 | "outputs": [
391 | {
392 | "name": "stdout",
393 | "output_type": "stream",
394 | "text": [
395 | "ChunkParse score:\n",
396 | " IOB Accuracy: 43.4%%\n",
397 | " Precision: 0.0%%\n",
398 | " Recall: 0.0%%\n",
399 | " F-Measure: 0.0%%\n"
400 | ]
401 | }
402 | ],
403 | "source": [
404 | "from nltk.corpus import conll2000\n",
405 | "cp = nltk.RegexpParser(\"\")\n",
406 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
407 | "print(cp.evaluate(test_sents))"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "IOB标记准确性表明超过三分之一的词被标注为O,即没有在NP词块中。然而,由于我们的标注器没有找到任何词块,其精度、召回率和F-度量均为零。现在让我们尝试一个初级的正则表达式词块划分器,查找以名词短语标记的特征字母开头的标记(如CD, DT和JJ)。"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 17,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "name": "stdout",
424 | "output_type": "stream",
425 | "text": [
426 | "ChunkParse score:\n",
427 | " IOB Accuracy: 87.7%%\n",
428 | " Precision: 70.6%%\n",
429 | " Recall: 67.8%%\n",
430 | " F-Measure: 69.2%%\n"
431 | ]
432 | }
433 | ],
434 | "source": [
435 | "grammar = r\"NP: {<[CDJNP].*>+}\"\n",
436 | "cp = nltk.RegexpParser(grammar)\n",
437 | "print(cp.evaluate(test_sents))"
438 | ]
439 | },
440 | {
441 | "cell_type": "markdown",
442 | "metadata": {},
443 | "source": [
444 | "正如你看到的,这种方法达到相当好的结果。但是,我们可以采用更多数据驱动的方法改善它,在这里我们使用训练语料找到对每个词性标记最有可能的块标记(I, O或B)。换句话说,我们可以使用一元标注器(4)建立一个词块划分器。但不是尝试确定每个词的正确的词性标记,而是根据每个词的词性标记,尝试确定正确的词块标记。\n",
445 | "\n",
446 | "在3.1中,我们定义了UnigramChunker类,使用一元标注器给句子加词块标记。这个类的大部分代码只是用来在NLTK 的ChunkParserI接口使用的词块树表示和嵌入式标注器使用的IOB表示之间镜像转换。类定义了两个方法:一个构造函数[1],当我们建立一个新的UnigramChunker时调用;以及parse方法[3],用来给新句子划分词块。"
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 18,
452 | "metadata": {},
453 | "outputs": [],
454 | "source": [
455 | "class UnigramChunker(nltk.ChunkParserI):\n",
456 | " def __init__(self, train_sents): \n",
457 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
458 | " for sent in train_sents]\n",
459 | " self.tagger = nltk.UnigramTagger(train_data) \n",
460 | "\n",
461 | " def parse(self, sentence): \n",
462 | " pos_tags = [pos for (word,pos) in sentence]\n",
463 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n",
464 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
465 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
466 | " in zip(sentence, chunktags)]\n",
467 | " return nltk.chunk.conlltags2tree(conlltags)"
468 | ]
469 | },
470 | {
471 | "cell_type": "markdown",
472 | "metadata": {},
473 | "source": [
474 | "构造函数[1]需要训练句子的一个列表,这将是词块树的形式。它首先将训练数据转换成适合训练标注器的形式,使用tree2conlltags映射每个词块树到一个word,tag,chunk三元组的列表。然后使用转换好的训练数据训练一个一元标注器,并存储在self.tagger供以后使用。\n",
475 | "\n",
476 | "parse方法[3]接收一个已标注的句子作为其输入,以从那句话提取词性标记开始。它然后使用在构造函数中训练过的标注器self.tagger,为词性标记标注IOB词块标记。接下来,它提取词块标记,与原句组合,产生conlltags。最后,它使用conlltags2tree将结果转换成一个词块树。\n",
477 | "\n",
478 | "现在我们有了UnigramChunker,可以使用CoNLL2000语料库训练它,并测试其表现:"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": 20,
484 | "metadata": {},
485 | "outputs": [
486 | {
487 | "name": "stdout",
488 | "output_type": "stream",
489 | "text": [
490 | "ChunkParse score:\n",
491 | " IOB Accuracy: 92.9%%\n",
492 | " Precision: 79.9%%\n",
493 | " Recall: 86.8%%\n",
494 | " F-Measure: 83.2%%\n"
495 | ]
496 | }
497 | ],
498 | "source": [
499 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n",
500 | "train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])\n",
501 | "unigram_chunker = UnigramChunker(train_sents)\n",
502 | "print(unigram_chunker.evaluate(test_sents))"
503 | ]
504 | },
505 | {
506 | "cell_type": "markdown",
507 | "metadata": {},
508 | "source": [
509 | "这个分块器相当不错,达到整体F-度量83%的得分。让我们来看一看通过使用一元标注器分配一个标记给每个语料库中出现的词性标记,它学到了什么:"
510 | ]
511 | },
512 | {
513 | "cell_type": "code",
514 | "execution_count": 21,
515 | "metadata": {},
516 | "outputs": [
517 | {
518 | "name": "stdout",
519 | "output_type": "stream",
520 | "text": [
521 | "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n"
522 | ]
523 | }
524 | ],
525 | "source": [
526 | "postags = sorted(set(pos for sent in train_sents\n",
527 | " for (word,pos) in sent.leaves()))\n",
528 | "print(unigram_chunker.tagger.tag(postags))"
529 | ]
530 | },
531 | {
532 | "cell_type": "markdown",
533 | "metadata": {},
534 | "source": [
535 | "它已经发现大多数标点符号出现在NP词块外,除了两种货币符号#和*\\$*。它也发现限定词(DT)和所有格(PRP*\\$*和WP$)出现在NP词块的开头,而名词类型(NN, NNP, NNPS,NNS)大多出现在NP词块内。\n",
536 | "\n",
537 | "建立了一个一元分块器,很容易建立一个二元分块器:我们只需要改变类的名称为BigramChunker,修改3.1行[2]构造一个BigramTagger而不是UnigramTagger。由此产生的词块划分器的性能略高于一元词块划分器:"
538 | ]
539 | },
540 | {
541 | "cell_type": "code",
542 | "execution_count": 24,
543 | "metadata": {},
544 | "outputs": [
545 | {
546 | "name": "stdout",
547 | "output_type": "stream",
548 | "text": [
549 | "ChunkParse score:\n",
550 | " IOB Accuracy: 93.3%%\n",
551 | " Precision: 82.3%%\n",
552 | " Recall: 86.8%%\n",
553 | " F-Measure: 84.5%%\n"
554 | ]
555 | }
556 | ],
557 | "source": [
558 | "class BigramChunker(nltk.ChunkParserI):\n",
559 | " def __init__(self, train_sents): \n",
560 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n",
561 | " for sent in train_sents]\n",
562 | " self.tagger = nltk.BigramTagger(train_data)\n",
563 | "\n",
564 | " def parse(self, sentence): \n",
565 | " pos_tags = [pos for (word,pos) in sentence]\n",
566 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n",
567 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n",
568 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n",
569 | " in zip(sentence, chunktags)]\n",
570 | " return nltk.chunk.conlltags2tree(conlltags)\n",
571 | "bigram_chunker = BigramChunker(train_sents)\n",
572 | "print(bigram_chunker.evaluate(test_sents))"
573 | ]
574 | },
575 | {
576 | "cell_type": "markdown",
577 | "metadata": {},
578 | "source": [
579 | "## 3.3 训练基于分类器的词块划分器\n",
580 | "\n",
581 | "无论是基于正则表达式的词块划分器还是n-gram词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:"
582 | ]
583 | },
584 | {
585 | "cell_type": "code",
586 | "execution_count": 31,
587 | "metadata": {},
588 | "outputs": [],
589 | "source": [
590 | "class ConsecutiveNPChunkTagger(nltk.TaggerI): \n",
591 | "\n",
592 | " def __init__(self, train_sents):\n",
593 | " train_set = []\n",
594 | " for tagged_sent in train_sents:\n",
595 | " untagged_sent = nltk.tag.untag(tagged_sent)\n",
596 | " history = []\n",
597 | " for i, (word, tag) in enumerate(tagged_sent):\n",
598 | " featureset = npchunk_features(untagged_sent, i, history) \n",
599 | " train_set.append( (featureset, tag) )\n",
600 | " history.append(tag)\n",
601 | " self.classifier = nltk.MaxentClassifier.train( \n",
602 | " train_set, algorithm='megam', trace=0)\n",
603 | "\n",
604 | " def tag(self, sentence):\n",
605 | " history = []\n",
606 | " for i, word in enumerate(sentence):\n",
607 | " featureset = npchunk_features(sentence, i, history)\n",
608 | " tag = self.classifier.classify(featureset)\n",
609 | " history.append(tag)\n",
610 | " return zip(sentence, history)\n",
611 | "\n",
612 | "class ConsecutiveNPChunker(nltk.ChunkParserI):\n",
613 | " def __init__(self, train_sents):\n",
614 | " tagged_sents = [[((w,t),c) for (w,t,c) in\n",
615 | " nltk.chunk.tree2conlltags(sent)]\n",
616 | " for sent in train_sents]\n",
617 | " self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n",
618 | "\n",
619 | " def parse(self, sentence):\n",
620 | " tagged_sents = self.tagger.tag(sentence)\n",
621 | " conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]\n",
622 | " return nltk.chunk.conlltags2tree(conlltags)"
623 | ]
624 | },
625 | {
626 | "cell_type": "markdown",
627 | "metadata": {},
628 | "source": [
629 | "留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:"
630 | ]
631 | },
632 | {
633 | "cell_type": "code",
634 | "execution_count": null,
635 | "metadata": {},
636 | "outputs": [],
637 | "source": [
638 | "def npchunk_features(sentence, i, history):\n",
639 | " word, pos = sentence[i]\n",
640 | " return {\"pos\": pos}\n",
641 | "chunker = ConsecutiveNPChunker(train_sents)\n",
642 | "print(chunker.evaluate(test_sents))"
643 | ]
644 | },
645 | {
646 | "cell_type": "markdown",
647 | "metadata": {},
648 | "source": [
649 | "```\n",
650 | "ChunkParse score:\n",
651 | " IOB Accuracy: 92.9%\n",
652 | " Precision: 79.9%\n",
653 | " Recall: 86.7%\n",
654 | " F-Measure: 83.2%\n",
655 | " ```"
656 | ]
657 | },
658 | {
659 | "cell_type": "markdown",
660 | "metadata": {},
661 | "source": [
662 | "我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。\n"
663 | ]
664 | },
665 | {
666 | "cell_type": "code",
667 | "execution_count": null,
668 | "metadata": {},
669 | "outputs": [],
670 | "source": [
671 | "def npchunk_features(sentence, i, history):\n",
672 | " word, pos = sentence[i]\n",
673 | " if i == 0:\n",
674 | " prevword, prevpos = \"\", \"\"\n",
675 | " else:\n",
676 | " prevword, prevpos = sentence[i-1]\n",
677 | " return {\"pos\": pos, \"prevpos\": prevpos}\n",
678 | "chunker = ConsecutiveNPChunker(train_sents)\n",
679 | "print(chunker.evaluate(test_sents))"
680 | ]
681 | },
682 | {
683 | "cell_type": "markdown",
684 | "metadata": {},
685 | "source": [
686 | "```\n",
687 | "ChunkParse score:\n",
688 | " IOB Accuracy: 93.6%\n",
689 | " Precision: 81.9%\n",
690 | " Recall: 87.2%\n",
691 | " F-Measure: 84.5%\n",
692 | "```"
693 | ]
694 | },
695 | {
696 | "cell_type": "markdown",
697 | "metadata": {},
698 | "source": [
699 | "下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约1.5个百分点(相应的错误率减少大约10%)。"
700 | ]
701 | },
702 | {
703 | "cell_type": "code",
704 | "execution_count": null,
705 | "metadata": {},
706 | "outputs": [],
707 | "source": [
708 | "def npchunk_features(sentence, i, history):\n",
709 | " word, pos = sentence[i]\n",
710 | " if i == 0:\n",
711 | " prevword, prevpos = \"\", \"\"\n",
712 | " else:\n",
713 | " prevword, prevpos = sentence[i-1]\n",
714 | " return {\"pos\": pos, \"word\": word, \"prevpos\": prevpos}\n",
715 | "chunker = ConsecutiveNPChunker(train_sents)\n",
716 | "print(chunker.evaluate(test_sents))"
717 | ]
718 | },
719 | {
720 | "cell_type": "markdown",
721 | "metadata": {},
722 | "source": [
723 | "```\n",
724 | "ChunkParse score:\n",
725 | " IOB Accuracy: 94.5%\n",
726 | " Precision: 84.2%\n",
727 | " Recall: 89.4%\n",
728 | " F-Measure: 86.7%\n",
729 | "```"
730 | ]
731 | },
732 | {
733 | "cell_type": "markdown",
734 | "metadata": {},
735 | "source": [
736 | "最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 36,
742 | "metadata": {},
743 | "outputs": [],
744 | "source": [
745 | "def npchunk_features(sentence, i, history):\n",
746 | " word, pos = sentence[i]\n",
747 | " if i == 0:\n",
748 | " prevword, prevpos = \"\", \"\"\n",
749 | " else:\n",
750 | " prevword, prevpos = sentence[i-1]\n",
751 | " if i == len(sentence)-1:\n",
752 | " nextword, nextpos = \"\", \"\"\n",
753 | " else:\n",
754 | " nextword, nextpos = sentence[i+1]\n",
755 | " return {\"pos\": pos,\n",
756 | " \"word\": word,\n",
757 | " \"prevpos\": prevpos,\n",
758 | " \"nextpos\": nextpos,\n",
759 | " \"prevpos+pos\": \"%s+%s\" % (prevpos, pos), \n",
760 | " \"pos+nextpos\": \"%s+%s\" % (pos, nextpos),\n",
761 | " \"tags-since-dt\": tags_since_dt(sentence, i)} "
762 | ]
763 | },
764 | {
765 | "cell_type": "code",
766 | "execution_count": 37,
767 | "metadata": {},
768 | "outputs": [],
769 | "source": [
770 | "def tags_since_dt(sentence, i):\n",
771 | " tags = set()\n",
772 | " for word, pos in sentence[:i]:\n",
773 | " if pos == 'DT':\n",
774 | " tags = set()\n",
775 | " else:\n",
776 | " tags.add(pos)\n",
777 | " return '+'.join(sorted(tags))"
778 | ]
779 | },
780 | {
781 | "cell_type": "code",
782 | "execution_count": null,
783 | "metadata": {},
784 | "outputs": [],
785 | "source": [
786 | "chunker = ConsecutiveNPChunker(train_sents)\n",
787 | "print(chunker.evaluate(test_sents))"
788 | ]
789 | },
790 | {
791 | "cell_type": "markdown",
792 | "metadata": {},
793 | "source": [
794 | "```\n",
795 | "ChunkParse score:\n",
796 | " IOB Accuracy: 96.0%\n",
797 | " Precision: 88.6%\n",
798 | " Recall: 91.0%\n",
799 | " F-Measure: 89.8%\n",
800 | "```"
801 | ]
802 | },
803 | {
804 | "cell_type": "markdown",
805 | "metadata": {},
806 | "source": [
807 | "# 4 语言结构中的递归\n",
808 | "## 4.1 用级联词块划分器构建嵌套结构\n",
809 | "\n",
810 | "到目前为止,我们的词块结构一直是相对平的。已标注词符组成的树在如NP这样的词块节点下任意组合。然而,只需创建一个包含递归规则的多级的词块语法,就可以建立任意深度的词块结构。4.1是名词短语、介词短语、动词短语和句子的模式。这是一个四级词块语法器,可以用来创建深度最多为4的结构。\n",
811 | "\n"
812 | ]
813 | },
814 | {
815 | "cell_type": "code",
816 | "execution_count": 42,
817 | "metadata": {},
818 | "outputs": [
819 | {
820 | "name": "stdout",
821 | "output_type": "stream",
822 | "text": [
823 | "(S\n",
824 | " (NP Mary/NN)\n",
825 | " saw/VBD\n",
826 | " (CLAUSE\n",
827 | " (NP the/DT cat/NN)\n",
828 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
829 | ]
830 | }
831 | ],
832 | "source": [
833 | "grammar = r\"\"\"\n",
834 | " NP: {+} \n",
835 | " PP: {} \n",
836 | " VP: {+$} \n",
837 | " CLAUSE: {} \n",
838 | " \"\"\"\n",
839 | "cp = nltk.RegexpParser(grammar)\n",
840 | "sentence = [(\"Mary\", \"NN\"), (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"),\n",
841 | " (\"sit\", \"VB\"), (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
842 | "print(cp.parse(sentence))"
843 | ]
844 | },
845 | {
846 | "cell_type": "markdown",
847 | "metadata": {},
848 | "source": [
849 | "不幸的是,这一结果丢掉了saw为首的VP。它还有其他缺陷。当我们将此词块划分器应用到一个有更深嵌套的句子时,让我们看看会发生什么。请注意,它无法识别[1]开始的VP词块。"
850 | ]
851 | },
852 | {
853 | "cell_type": "code",
854 | "execution_count": 43,
855 | "metadata": {},
856 | "outputs": [
857 | {
858 | "name": "stdout",
859 | "output_type": "stream",
860 | "text": [
861 | "(S\n",
862 | " (NP John/NNP)\n",
863 | " thinks/VBZ\n",
864 | " (NP Mary/NN)\n",
865 | " saw/VBD\n",
866 | " (CLAUSE\n",
867 | " (NP the/DT cat/NN)\n",
868 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n"
869 | ]
870 | }
871 | ],
872 | "source": [
873 | "sentence = [(\"John\", \"NNP\"), (\"thinks\", \"VBZ\"), (\"Mary\", \"NN\"),\n",
874 | " (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"), (\"sit\", \"VB\"),\n",
875 | " (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n",
876 | "print(cp.parse(sentence))"
877 | ]
878 | },
879 | {
880 | "cell_type": "markdown",
881 | "metadata": {},
882 | "source": [
883 | "这些问题的解决方案是让词块划分器在它的模式中循环:尝试完所有模式之后,重复此过程。我们添加一个可选的第二个参数loop指定这套模式应该循环的次数:"
884 | ]
885 | },
886 | {
887 | "cell_type": "code",
888 | "execution_count": 44,
889 | "metadata": {},
890 | "outputs": [
891 | {
892 | "name": "stdout",
893 | "output_type": "stream",
894 | "text": [
895 | "(S\n",
896 | " (NP John/NNP)\n",
897 | " thinks/VBZ\n",
898 | " (CLAUSE\n",
899 | " (NP Mary/NN)\n",
900 | " (VP\n",
901 | " saw/VBD\n",
902 | " (CLAUSE\n",
903 | " (NP the/DT cat/NN)\n",
904 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))\n"
905 | ]
906 | }
907 | ],
908 | "source": [
909 | "cp = nltk.RegexpParser(grammar, loop=2)\n",
910 | "print(cp.parse(sentence))"
911 | ]
912 | },
913 | {
914 | "cell_type": "markdown",
915 | "metadata": {},
916 | "source": [
917 | "注意\n",
918 | "\n",
919 | "这个级联过程使我们能创建深层结构。然而,创建和调试级联过程是困难的,关键点是它能更有效地做全面的分析(见第8.章)。另外,级联过程只能产生固定深度的树(不超过级联级数),完整的句法分析这是不够的。\n"
920 | ]
921 | },
922 | {
923 | "cell_type": "markdown",
924 | "metadata": {},
925 | "source": [
926 | "## 4.2 Trees\n",
927 | "\n",
928 | "tree是一组连接的加标签节点,从一个特殊的根节点沿一条唯一的路径到达每个节点。下面是一棵树的例子(注意它们标准的画法是颠倒的):\n",
929 | "```\n",
930 | "(S\n",
931 | " (NP Alice)\n",
932 | " (VP\n",
933 | " (V chased)\n",
934 | " (NP\n",
935 | " (Det the)\n",
936 | " (N rabbit))))\n",
937 | "```\n",
938 | "虽然我们将只集中关注语法树,树可以用来编码任何同构的超越语言形式序列的层次结构(如形态结构、篇章结构)。一般情况下,叶子和节点值不一定要是字符串。\n",
939 | "\n",
940 | "在NLTK中,我们通过给一个节点添加标签和一系列的孩子创建一棵树:"
941 | ]
942 | },
943 | {
944 | "cell_type": "code",
945 | "execution_count": 46,
946 | "metadata": {},
947 | "outputs": [
948 | {
949 | "name": "stdout",
950 | "output_type": "stream",
951 | "text": [
952 | "(NP Alice)\n",
953 | "(NP the rabbit)\n"
954 | ]
955 | }
956 | ],
957 | "source": [
958 | "tree1 = nltk.Tree('NP', ['Alice'])\n",
959 | "print(tree1)\n",
960 | "tree2 = nltk.Tree('NP', ['the', 'rabbit'])\n",
961 | "print(tree2)"
962 | ]
963 | },
964 | {
965 | "cell_type": "markdown",
966 | "metadata": {},
967 | "source": [
968 | "我们可以将这些不断合并成更大的树,如下所示:"
969 | ]
970 | },
971 | {
972 | "cell_type": "code",
973 | "execution_count": 47,
974 | "metadata": {},
975 | "outputs": [
976 | {
977 | "name": "stdout",
978 | "output_type": "stream",
979 | "text": [
980 | "(S (NP Alice) (VP chased (NP the rabbit)))\n"
981 | ]
982 | }
983 | ],
984 | "source": [
985 | "tree3 = nltk.Tree('VP', ['chased', tree2])\n",
986 | "tree4 = nltk.Tree('S', [tree1, tree3])\n",
987 | "print(tree4)"
988 | ]
989 | },
990 | {
991 | "cell_type": "markdown",
992 | "metadata": {},
993 | "source": [
994 | "下面是树对象的一些的方法:"
995 | ]
996 | },
997 | {
998 | "cell_type": "code",
999 | "execution_count": 49,
1000 | "metadata": {},
1001 | "outputs": [
1002 | {
1003 | "name": "stdout",
1004 | "output_type": "stream",
1005 | "text": [
1006 | "(VP chased (NP the rabbit))\n",
1007 | "VP\n",
1008 | "['Alice', 'chased', 'the', 'rabbit']\n",
1009 | "rabbit\n"
1010 | ]
1011 | }
1012 | ],
1013 | "source": [
1014 | "print(tree4[1])\n",
1015 | "print(tree4[1].label())\n",
1016 | "print(tree4.leaves())\n",
1017 | "print(tree4[1][1][1])"
1018 | ]
1019 | },
1020 | {
1021 | "cell_type": "markdown",
1022 | "metadata": {},
1023 | "source": [
1024 | "复杂的树用括号表示难以阅读。在这些情况下,draw方法是非常有用的。它会打开一个新窗口,包含树的一个图形表示。树显示窗口可以放大和缩小,子树可以折叠和展开,并将图形表示输出为一个postscript文件(包含在一个文档中)。"
1025 | ]
1026 | },
1027 | {
1028 | "cell_type": "code",
1029 | "execution_count": 50,
1030 | "metadata": {},
1031 | "outputs": [],
1032 | "source": [
1033 | "tree3.draw()\n"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "markdown",
1038 | "metadata": {},
1039 | "source": [
1040 | ""
1041 | ]
1042 | },
1043 | {
1044 | "cell_type": "markdown",
1045 | "metadata": {},
1046 | "source": [
1047 | "## 4.3 树遍历\n",
1048 | "\n",
1049 | "使用递归函数来遍历树是标准的做法。4.2中的内容进行了演示。\n",
1050 | "\n"
1051 | ]
1052 | },
1053 | {
1054 | "cell_type": "code",
1055 | "execution_count": 53,
1056 | "metadata": {},
1057 | "outputs": [
1058 | {
1059 | "name": "stdout",
1060 | "output_type": "stream",
1061 | "text": [
1062 | "( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) "
1063 | ]
1064 | }
1065 | ],
1066 | "source": [
1067 | "def traverse(t):\n",
1068 | " try:\n",
1069 | " t.label()\n",
1070 | " except AttributeError:\n",
1071 | " print(t, end=\" \")\n",
1072 | " else:\n",
1073 | " # Now we know that t.node is defined\n",
1074 | " print('(', t.label(), end=\" \")\n",
1075 | " for child in t:\n",
1076 | " traverse(child)\n",
1077 | " print(')', end=\" \")\n",
1078 | "\n",
1079 | "t = tree4\n",
1080 | "traverse(t)"
1081 | ]
1082 | },
1083 | {
1084 | "cell_type": "markdown",
1085 | "metadata": {},
1086 | "source": [
1087 | "# 5 命名实体识别\n",
1088 | "\n",
1089 | "在本章开头,我们简要介绍了命名实体(NE)。命名实体是确切的名词短语,指示特定类型的个体,如组织、人、日期等。5.1列出了一些较常用的NE类型。这些应该是不言自明的,除了“FACILITY”:建筑和土木工程领域的人造产品;以及“GPE”:地缘政治实体,如城市、州/省、国家。\n",
1090 | "\n",
1091 | "\n",
1092 | "常用命名实体类型\n",
1093 | "```\n",
1094 | "Eddy N B-PER\n",
1095 | "Bonte N I-PER\n",
1096 | "is V O\n",
1097 | "woordvoerder N O\n",
1098 | "van Prep O\n",
1099 | "diezelfde Pron O\n",
1100 | "Hogeschool N B-ORG\n",
1101 | ". Punc O\n",
1102 | "```"
1103 | ]
1104 | },
1105 | {
1106 | "cell_type": "code",
1107 | "execution_count": 54,
1108 | "metadata": {},
1109 | "outputs": [
1110 | {
1111 | "name": "stdout",
1112 | "output_type": "stream",
1113 | "text": [
1114 | "(S\n",
1115 | " From/IN\n",
1116 | " what/WDT\n",
1117 | " I/PPSS\n",
1118 | " was/BEDZ\n",
1119 | " able/JJ\n",
1120 | " to/IN\n",
1121 | " gauge/NN\n",
1122 | " in/IN\n",
1123 | " a/AT\n",
1124 | " swift/JJ\n",
1125 | " ,/,\n",
1126 | " greedy/JJ\n",
1127 | " glance/NN\n",
1128 | " ,/,\n",
1129 | " the/AT\n",
1130 | " figure/NN\n",
1131 | " inside/IN\n",
1132 | " the/AT\n",
1133 | " coral-colored/JJ\n",
1134 | " boucle/NN\n",
1135 | " dress/NN\n",
1136 | " was/BEDZ\n",
1137 | " stupefying/VBG\n",
1138 | " ./.)\n"
1139 | ]
1140 | }
1141 | ],
1142 | "source": [
1143 | "print(nltk.ne_chunk(sent)) "
1144 | ]
1145 | },
1146 | {
1147 | "cell_type": "markdown",
1148 | "metadata": {},
1149 | "source": [
1150 | "# 6 关系抽取\n",
1151 | "\n",
1152 | "一旦文本中的命名实体已被识别,我们就可以提取它们之间存在的关系。如前所述,我们通常会寻找指定类型的命名实体之间的关系。进行这一任务的方法之一是首先寻找所有X, α, Y)形式的三元组,其中X和Y是指定类型的命名实体,α表示X和Y之间关系的字符串。然后我们可以使用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜索包含词in的字符串。特殊的正则表达式(?!\\b.+ing\\b)是一个否定预测先行断言,允许我们忽略如success in supervising the transition of中的字符串,其中in后面跟一个动名词。"
1153 | ]
1154 | },
1155 | {
1156 | "cell_type": "code",
1157 | "execution_count": 55,
1158 | "metadata": {},
1159 | "outputs": [
1160 | {
1161 | "name": "stdout",
1162 | "output_type": "stream",
1163 | "text": [
1164 | "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n",
1165 | "[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']\n",
1166 | "[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n",
1167 | "[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n",
1168 | "[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n",
1169 | "[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n",
1170 | "[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n",
1171 | "[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n",
1172 | "[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n",
1173 | "[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n",
1174 | "[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n",
1175 | "[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n",
1176 | "[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n"
1177 | ]
1178 | }
1179 | ],
1180 | "source": [
1181 | "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n",
1182 | "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n",
1183 | " for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,\n",
1184 | " corpus='ieer', pattern = IN):\n",
1185 | " print(nltk.sem.rtuple(rel))"
1186 | ]
1187 | },
1188 | {
1189 | "cell_type": "markdown",
1190 | "metadata": {},
1191 | "source": [
1192 | "搜索关键字in执行的相当不错,虽然它的检索结果也会误报,例如[ORG: House Transportation Committee] , secured the most money in the [LOC: New York];一种简单的基于字符串的方法排除这样的填充字符串似乎不太可能。\n",
1193 | "\n",
1194 | "如前文所示,conll2002命名实体语料库的荷兰语部分不只包含命名实体标注,也包含词性标注。这允许我们设计对这些标记敏感的模式,如下面的例子所示。clause()方法以分条形式输出关系,其中二元关系符号作为参数relsym的值被指定[1]。"
1195 | ]
1196 | },
1197 | {
1198 | "cell_type": "code",
1199 | "execution_count": 57,
1200 | "metadata": {},
1201 | "outputs": [
1202 | {
1203 | "name": "stdout",
1204 | "output_type": "stream",
1205 | "text": [
1206 | "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n",
1207 | "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n",
1208 | "VAN('annie_lennox', 'eurythmics')\n"
1209 | ]
1210 | }
1211 | ],
1212 | "source": [
1213 | "from nltk.corpus import conll2002\n",
1214 | "vnv = \"\"\"\n",
1215 | "(\n",
1216 | "is/V| # 3rd sing present and\n",
1217 | "was/V| # past forms of the verb zijn ('be')\n",
1218 | "werd/V| # and also present\n",
1219 | "wordt/V # past of worden ('become)\n",
1220 | ")\n",
1221 | ".* # followed by anything\n",
1222 | "van/Prep # followed by van ('of')\n",
1223 | "\"\"\"\n",
1224 | "VAN = re.compile(vnv, re.VERBOSE)\n",
1225 | "for doc in conll2002.chunked_sents('ned.train'):\n",
1226 | " for r in nltk.sem.extract_rels('PER', 'ORG', doc,\n",
1227 | " corpus='conll2002', pattern=VAN):\n",
1228 | " print(nltk.sem.clause(r, relsym=\"VAN\"))"
1229 | ]
1230 | },
1231 | {
1232 | "cell_type": "code",
1233 | "execution_count": null,
1234 | "metadata": {},
1235 | "outputs": [],
1236 | "source": []
1237 | }
1238 | ],
1239 | "metadata": {
1240 | "kernelspec": {
1241 | "display_name": "Python 3",
1242 | "language": "python",
1243 | "name": "python3"
1244 | },
1245 | "language_info": {
1246 | "codemirror_mode": {
1247 | "name": "ipython",
1248 | "version": 3
1249 | },
1250 | "file_extension": ".py",
1251 | "mimetype": "text/x-python",
1252 | "name": "python",
1253 | "nbconvert_exporter": "python",
1254 | "pygments_lexer": "ipython3",
1255 | "version": "3.6.2"
1256 | }
1257 | },
1258 | "nbformat": 4,
1259 | "nbformat_minor": 2
1260 | }
1261 |
--------------------------------------------------------------------------------
/【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源.md:
--------------------------------------------------------------------------------
1 | 原文在线阅读:https://usyiyi.github.io/nlp-py-2e-zh/2.html
2 | # 1 获取文本语料库
3 | ## 1.1 古腾堡语料库
4 | ```py
5 | >>> for fileid in gutenberg.fileids():
6 | >... num_words = len(gutenberg.words(fileid))
7 | >... num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
8 | >... num_sents = len(gutenberg.sents(fileid))
9 | >... num_chars = len(gutenberg.raw(fileid))
10 | >... print("平均词长:", round(num_chars/num_words), "平均句长:", round(num_words/num_sents), "每个单词出现的平均次数:", round(num_words/num_vocab), fileid)
11 | ...
12 | 平均词长: 5 平均句长: 25 每个单词出现的平均次数: 26 austen-emma.txt
13 | 平均词长: 5 平均句长: 26 每个单词出现的平均次数: 17 austen-persuasion.txt
14 | 平均词长: 5 平均句长: 28 每个单词出现的平均次数: 22 austen-sense.txt
15 | 平均词长: 4 平均句长: 34 每个单词出现的平均次数: 79 bible-kjv.txt
16 | 平均词长: 5 平均句长: 19 每个单词出现的平均次数: 5 blake-poems.txt
17 | 平均词长: 4 平均句长: 19 每个单词出现的平均次数: 14 bryant-stories.txt
18 | 平均词长: 4 平均句长: 18 每个单词出现的平均次数: 12 burgess-busterbrown.txt
19 | 平均词长: 4 平均句长: 20 每个单词出现的平均次数: 13 carroll-alice.txt
20 | 平均词长: 5 平均句长: 20 每个单词出现的平均次数: 12 chesterton-ball.txt
21 | 平均词长: 5 平均句长: 23 每个单词出现的平均次数: 11 chesterton-brown.txt
22 | 平均词长: 5 平均句长: 18 每个单词出现的平均次数: 11 chesterton-thursday.txt
23 | 平均词长: 4 平均句长: 21 每个单词出现的平均次数: 25 edgeworth-parents.txt
24 | 平均词长: 5 平均句长: 26 每个单词出现的平均次数: 15 melville-moby_dick.txt
25 | 平均词长: 5 平均句长: 52 每个单词出现的平均次数: 11 milton-paradise.txt
26 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 9 shakespeare-caesar.txt
27 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 8 shakespeare-hamlet.txt
28 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 7 shakespeare-macbeth.txt
29 | 平均词长: 5 平均句长: 36 每个单词出现的平均次数: 12 whitman-leaves.txt
30 | >>>
31 |
32 | ```
33 | 这个程序显示每个文本的三个统计量:**平均词长**、**平均句子长度**和本文中每个词出现的平均次数(我们的**词汇多样性**得分)。请看,平均词长似乎是英语的一个一般属性,因为它的值总是4。(事实上,平均词长是3而不是4,因为num_chars变量计数了空白字符。)相比之下,平均句子长度和词汇多样性看上去是作者个人的特点。
34 |
35 | ## 1.2 网络和聊天文本
36 |
37 | ## 1.3 布朗语料库
38 | ```py
39 | >>> from nltk.corpus import brown
40 | >>> news_text = brown.words(categories='news')
41 | >>> fdist = nltk.FreqDist(w.lower() for w in news_text)
42 | >>> for k in fdist:
43 | ... if k[:2] == "wh":
44 | ... print(k + ":",fdist[k], end = " ")
45 | ...
46 | which: 245 when: 169 who: 268 whether: 18 where: 59 what: 95 while: 55 why: 14 whipped: 2 white: 57 whom: 8 whereby: 3 whole: 11 wherever: 1 whose: 22 wholesale: 1 wheel: 4 whatever: 2 whipple: 1 whitey: 1 whiz: 2 whitfield: 1 whip: 2 whirling: 1 wheeled: 2 whee: 1 wheeler: 2 whisking: 1 wheels: 1 whitney: 1 whopping: 1 wholly-owned: 1 whims: 1 whelan: 1 white-clad: 1 wheat: 1 whites: 2 whiplash: 1 whichever: 1 what's: 1 wholly: 1 >>>
47 | ```
48 |
49 |
50 | 布朗语料库查看一星期各天在news和humor类别语料库的出现的条件下的频率分布:
51 | ```py
52 | >>> cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories = genre))
53 | >>> genres = ["news", "humor"] # 填写我们想要展示的种类
54 | >>> day = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
55 | # 填写我们想要统计的词
56 | >>> cfd.tabulate(conditions = genres, samples = day)
57 | Monday Tuesday Wednesday Thursday Friday Saturday Sunday
58 | news 54 43 22 20 41 33 51
59 | humor 1 0 0 0 0 3 0
60 | >>> cfd.plot(conditions = genres, samples = day)
61 | ```
62 | 
63 | ## 1.4 路透社语料库
64 | ## 1.5 就职演说语料库
65 | ```py
66 | >>> cfd = nltk.ConditionalFreqDist(
67 | ... (target, fileid[:4])
68 | ... for fileid in inaugural.fileids()
69 | ... for w in inaugural.words(fileid)
70 | ... for target in ['america', 'citizen']
71 | ... if w.lower().startswith(target)) [1]
72 | >>> cfd.plot()
73 |
74 | ```
75 | 
76 |
77 | 条件频率分布图:计数就职演说语料库中所有以america 或citizen开始的词。
78 |
79 | ## 1.6 标注文本语料库
80 | ## 1.8 文本语料库的结构
81 | 文本语料库的常见结构:最简单的一种语料库是一些孤立的没有什么特别的组织的文本集合;一些语料库按如文体(布朗语料库)等分类组织结构;一些分类会重叠,如主题类别(路透社语料库);另外一些语料库可以表示随时间变化语言用法的改变(就职演说语料库)。
82 | 
83 |
84 |
85 | ## 1.9 加载你自己的语料库
86 | ```py
87 | >>> from nltk.corpus import PlaintextCorpusReader
88 | >>> corpus_root = '/usr/share/dict'
89 | >>> wordlists = PlaintextCorpusReader(corpus_root, '.*')
90 | >>> wordlists.fileids()
91 | ['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
92 | >>> wordlists.words('connectives')
93 | ['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
94 | ```
95 | # 2 条件频率分布
96 | ## 2.1 条件和事件
97 | 每个**配对pairs**的形式是:(条件, 事件)。如果我们按文体处理整个布朗语料库,将有15 个条件(每个文体一个条件)和1,161,192 个事件(每一个词一个事件)。
98 | ```py
99 | >>> text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
100 | >>> pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]
101 | ```
102 | ## 2.2 按文体计数词汇
103 | ```py
104 | >>> genre_word = [(genre, word)
105 | ... for genre in ['news', 'romance']
106 | ... for word in brown.words(categories=genre)]
107 | >>> len(genre_word)
108 | 170576
109 | >>> genre_word[:4]
110 | [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre]
111 | >>> genre_word[-4:]
112 | [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre]
113 |
114 | ```
115 | ```py
116 | >>> cfd = nltk.ConditionalFreqDist(genre_word)
117 | >>> cfd [1]
118 |
119 | >>> cfd.conditions()
120 | ['news', 'romance'] # [_conditions-cfd]
121 | ```
122 | ```py
123 | >>> print(cfd['news'])
124 |
125 | >>> print(cfd['romance'])
126 |
127 | >>> cfd['romance'].most_common(20)
128 | [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502),
129 | ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993),
130 | ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690),
131 | ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)]
132 | >>> cfd['romance']['could']
133 | 193
134 | ```
135 |
136 |
137 | ## 2.3 绘制分布图和分布表
138 | 见: **1.3 布朗语料库**
139 |
140 | ## 2.4 使用双连词生成随机文本
141 |
142 | 利用bigrams制作生成模型:
143 | ```py
144 | >>> def generate_model(cfdist, word, num = 15):
145 | ... for i in range(num):
146 | ... print (word,end = " ")
147 | ... word = cfdist[word].max()
148 | ...
149 |
150 | >>> text = nltk.corpus.genesis.words("english-kjv.txt")
151 | >>> bigrams = nltk.bigrams(text)
152 | >>> cfd = nltk.ConditionalFreqDist(bigrams)
153 | >>> cfd
154 |
155 | >>> list(cfd)
156 | [('they', 'embalmed'), ('embalmed', 'him'), ('him', ','), (',', 'and'), ('and', 'he'), ('he', 'was'), ('was', 'put'), ('put', 'in'), ('in', 'a'), ('a', 'coffin'), ('coffin', 'in'), ('in', 'Egypt'), ('Egypt', '.')]
157 |
158 |
159 | >>> cfd["so"]
160 | FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...})
161 | >>> cfd["living"]
162 | FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1})
163 |
164 | >>> generate_model(cfd, "so")
165 | so that he said , and the land of the land of the land of
166 | >>> generate_model(cfd, "living")
167 | living creature that he said , and the land of the land of the land
168 | >>>
169 |
170 | ```
171 |
172 | # 4 词汇资源
173 | ## 4.1 词汇列表语料库
174 | **词汇语料库**是Unix 中的/usr/share/dict/words文件,被一些拼写检查程序使用。我们可以用它来寻找文本语料中不寻常的或拼写错误的词汇。
175 | ```py
176 | def unusual_words(text):
177 | text_vocab = set(w.lower() for w in text if w.isalpha())
178 | english_vocab = set(w.lower() for w in nltk.corpus.words.words())
179 | unusual = text_vocab - english_vocab
180 | return sorted(unusual)
181 |
182 | >>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
183 | ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses',
184 | 'accents', 'accepting', 'accommodations', 'accompanied', 'accounted', 'accounts',
185 | 'accustomary', 'aches', 'acknowledging', 'acknowledgment', 'acknowledgments', ...]
186 | >>> unusual_words(nltk.corpus.nps_chat.words())
187 | ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack',
188 | 'acros', 'actualy', 'adams', 'adds', 'adduser', 'adjusts', 'adoted', 'adreniline',
189 | 'ads', 'adults', 'afe', 'affairs', 'affari', 'affects', 'afk', 'agaibn', 'ages', ...]
190 | ```
191 |
192 | **停用词语料库**,就是那些高频词汇,如the,to和also,我们有时在进一步的处理之前想要将它们从文档中过滤。
193 | ```py
194 | >>> from nltk.corpus import stopwords
195 | >>> stopwords.words('english')
196 | ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
197 | 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
198 | 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
199 | 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
200 | 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
201 | 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
202 | 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
203 | 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
204 | 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
205 | 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
206 | 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
207 | 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']
208 | ```
209 |
210 | **一个字母拼词谜题**:在由随机选择的字母组成的网格中,选择里面的字母组成词;这个谜题叫做“目标”。
211 | 
212 | ```py
213 | >>> puzzle_letters = nltk.FreqDist('egivrvonl')
214 | # 注意,这里如果是字符串'egivrvonl'的话,给出的就是每个字母的频数
215 | >>> obligatory = 'r'
216 | >>> wordlist = nltk.corpus.words.words()
217 | >>> [w for w in wordlist if len(w) >= 6
218 | ... and obligatory in w
219 | ... and nltk.FreqDist(w) <= puzzle_letters]
220 | ['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor',
221 | 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi',
222 | 'revolving', 'ringle', 'roving', 'violer', 'virole']
223 | ```
224 |
225 | **名字语料库**,包括8000个按性别分类的名字。男性和女性的名字存储在单独的文件中。让我们找出同时出现在两个文件中的名字,即性别暧昧的名字:
226 | ```py
227 | >>> names = nltk.corpus.names
228 | >>> names.fileids()
229 | ['female.txt', 'male.txt']
230 | >>> male_names = names.words('male.txt')
231 | >>> female_names = names.words('female.txt')
232 | >>> [w for w in male_names if w in female_names]
233 | ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis',
234 | 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel',
235 | 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...]
236 | ```
237 | ## 4.2 发音的词典
238 | ## 4.3 比较词表
239 | ## 4.4 词汇工具:Shoebox和Toolbox
240 | # 5 WordNet
241 |
242 |
--------------------------------------------------------------------------------
/【Python自然语言处理】读书笔记:第四章:编写结构化程序.md:
--------------------------------------------------------------------------------
1 |
2 | # 4 编写结构化程序
3 |
4 | # 4.1 回到基础
5 |
6 | ## 1、赋值:
7 | 列表赋值是“引用”,改变其中一个,其他都会改变
8 |
9 |
10 |
11 | ```python
12 | foo = ["1", "2"]
13 | bar = foo
14 | foo[1] = "3"
15 | print(bar)
16 | ```
17 |
18 | ['1', '3']
19 |
20 |
21 |
22 | ```python
23 | empty = []
24 | nested = [empty, empty, empty]
25 | print(nested)
26 | nested[1].append("3")
27 | print(nested)
28 | ```
29 |
30 | [[], [], []]
31 | [['3'], ['3'], ['3']]
32 |
33 |
34 |
35 | ```python
36 | nes = [[]] * 3
37 | nes[1].append("3")
38 | print(nes)
39 | nes[1] = ["2"] # 这里最新赋值时,不会传递给其他元素
40 | print(nes)
41 | ```
42 |
43 | [['3'], ['3'], ['3']]
44 | [['3'], ['2'], ['3']]
45 |
46 |
47 |
48 | ```python
49 | new = nested[:]
50 | print(new)
51 | new[2] = ["new"]
52 | print(new)
53 | print(nested)
54 | ```
55 |
56 | [['3'], ['3'], ['3']]
57 | [['3'], ['3'], ['new']]
58 | [['3'], ['3'], ['3']]
59 |
60 |
61 |
62 | ```python
63 | import copy
64 | new2 = copy.deepcopy(nested)
65 | print(new2)
66 | new2[2] = ["new2"]
67 | print(new2)
68 | print(nested)
69 | ```
70 |
71 | [['3'], ['3'], ['3']]
72 | [['3'], ['3'], ['new2']]
73 | [['3'], ['3'], ['3']]
74 |
75 |
76 | ## 2、等式
77 | Python提供两种方法来检查一对项目是否相同。
78 |
79 | is 操作符测试对象的ID。
80 |
81 | == 检测对象是否相等。
82 |
83 |
84 | ```python
85 | snake_nest = [["Python"]] * 5
86 | snake_nest[2] = ['Python']
87 |
88 | print(snake_nest)
89 |
90 | print(snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4])
91 |
92 | print(snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4])
93 |
94 | ```
95 |
96 | [['Python'], ['Python'], ['Python'], ['Python'], ['Python']]
97 | True
98 | False
99 |
100 |
101 | ## 3、elif 和 else
102 | if elif 表示 if 为假而且elif 后边为真,表示如果第一个if执行,则不在执行elif
103 |
104 | if if 表示无论第一个if是否执行,第二个if都会执行
105 |
106 |
107 | ```python
108 | animals = ["cat", "dog"]
109 | if "cat" in animals:
110 | print(1)
111 | if "dog" in animals:
112 | print(2)
113 | ```
114 |
115 | 1
116 | 2
117 |
118 |
119 |
120 | ```python
121 | animals = ["cat", "dog"]
122 | if "cat" in animals:
123 | print(1)
124 | elif "dog" in animals:
125 | print(2)
126 | ```
127 |
128 | 1
129 |
130 |
131 | # 4.2 序列
132 |
133 | 字符串,列表,元组
134 |
135 | ## 1、元组
136 | 由逗号隔开,通常使用括号括起来,可以被索引和切片,并且由长度
137 |
138 |
139 | ```python
140 | t = "walk", "fem", 3
141 | print(t)
142 | print(t[0])
143 | print(t[1:])
144 | print(len(t))
145 | ```
146 |
147 | ('walk', 'fem', 3)
148 | walk
149 | ('fem', 3)
150 | 3
151 |
152 |
153 | ## 2、序列可以直接相互赋值
154 |
155 |
156 | ```python
157 | words = ["I", "turned", "off", "the", "spectroroute"]
158 | words[1], words[4] = words[4], words[1]
159 | print(words)
160 | ```
161 |
162 | ['I', 'spectroroute', 'off', 'the', 'turned']
163 |
164 |
165 | ## 3、处理序列的函数
166 | sorted()函数、reversed()函数、zip()函数、enumerate()函数
167 |
168 |
169 | ```python
170 | print("\n",words)
171 | print(sorted(words))
172 |
173 | print("\n",words)
174 | print(reversed(words))
175 | print(list(reversed(words)))
176 |
177 | print("\n",words)
178 | print(zip(words, range(len(words))))
179 | print(list(zip(words, range(len(words)))))
180 |
181 | print("\n",words)
182 | print(enumerate(words))
183 | print(list(enumerate(words)))
184 | ```
185 |
186 |
187 | ['I', 'spectroroute', 'off', 'the', 'turned']
188 | ['I', 'off', 'spectroroute', 'the', 'turned']
189 |
190 | ['I', 'spectroroute', 'off', 'the', 'turned']
191 |
192 | ['turned', 'the', 'off', 'spectroroute', 'I']
193 |
194 | ['I', 'spectroroute', 'off', 'the', 'turned']
195 |
196 | [('I', 0), ('spectroroute', 1), ('off', 2), ('the', 3), ('turned', 4)]
197 |
198 | ['I', 'spectroroute', 'off', 'the', 'turned']
199 |
200 | [(0, 'I'), (1, 'spectroroute'), (2, 'off'), (3, 'the'), (4, 'turned')]
201 |
202 |
203 | ## 4、合并不同类型的序列
204 |
205 |
206 | ```python
207 | words = "I turned off the spectroroute".split()
208 | print (words)
209 |
210 | wordlens = [(len(word), word) for word in words]
211 | print(wordlens)
212 |
213 | wordlens.sort()
214 | print (wordlens)
215 |
216 | print(" ".join(w for (_, w) in wordlens))
217 | ```
218 |
219 | ['I', 'turned', 'off', 'the', 'spectroroute']
220 | [(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')]
221 | [(1, 'I'), (3, 'off'), (3, 'the'), (6, 'turned'), (12, 'spectroroute')]
222 | I off the turned spectroroute
223 |
224 |
225 | ## 5、生成器表达式
226 |
227 | ### 5.1、列表推导
228 | 我们一直在大量使用列表推导,因为用它处理文本结构紧凑和可读性好。下面是一个例子,分词和规范化一个文本:
229 |
230 |
231 |
232 | ```python
233 | from nltk import *
234 | text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone,
235 | ... "it means just what I choose it to mean - neither more nor less."'''
236 | print([w.lower() for w in word_tokenize(text)])
237 | ```
238 |
239 | ['``', 'when', 'i', 'use', 'a', 'word', ',', "''", 'humpty', 'dumpty', 'said', 'in', 'rather', 'a', 'scornful', 'tone', ',', '...', '``', 'it', 'means', 'just', 'what', 'i', 'choose', 'it', 'to', 'mean', '-', 'neither', 'more', 'nor', 'less', '.', "''"]
240 |
241 |
242 | ### 5.2、生成器表达式
243 | 第二行使用了生成器表达式。这不仅仅是标记方便:在许多语言处理的案例中,生成器表达式会更高效。
244 |
245 | 在[1]中,列表对象的存储空间必须在max()的值被计算之前分配。如果文本非常大的,这将会很慢。
246 |
247 | 在[2]中,数据流向调用它的函数。由于调用的函数只是简单的要找最大值——按字典顺序排在最后的词——它可以处理数据流,而无需存储迄今为止的最大值以外的任何值。
248 |
249 |
250 | ```python
251 | print(max([w.lower() for w in word_tokenize(text)]))
252 | print (max(w.lower() for w in word_tokenize(text)))
253 | ```
254 |
255 | word
256 | word
257 |
258 |
259 | # 4.3 风格的问题
260 |
261 | ## 1、过程风格与声明风格
262 | 计算布朗语料库中词的平均长度的程序:
263 |
264 |
265 | ```python
266 | # 过程风格
267 | import nltk
268 | tokens = nltk.corpus.brown.words(categories='news')
269 | count = 0
270 | total = 0
271 | for token in tokens:
272 | count += 1
273 | total += len(token)
274 | total / count
275 | ```
276 |
277 |
278 |
279 |
280 | 4.401545438271973
281 |
282 |
283 |
284 |
285 | ```python
286 | # 声明风格
287 | total = sum(len(w) for w in tokens)
288 | print(total / len(tokens))
289 | ```
290 |
291 | 4.401545438271973
292 |
293 |
294 |
295 | ```python
296 | # 过程风格
297 | text = nltk.corpus.gutenberg.words('milton-paradise.txt')
298 | longest = ''
299 | for word in text:
300 | if len(word) > len(longest):
301 | longest = word
302 | print(longest)
303 | ```
304 |
305 | unextinguishable
306 |
307 |
308 |
309 | ```python
310 | # 声明风格:使用两个列表推到
311 | maxlen = max(len(word) for word in text)
312 | print([word for word in text if len(word) == maxlen])
313 | ```
314 |
315 | ['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']
316 |
317 |
318 |
319 | ```python
320 | # enumerate() 枚举频率分布的值
321 | fd = nltk.FreqDist(nltk.corpus.brown.words())
322 | cumulative = 0.0
323 | most_common_words = [word for (word, count) in fd.most_common()]
324 | for rank, word in enumerate(most_common_words):
325 | cumulative += fd.freq(word)
326 | print("%3d %10.2f%% %10s" % (rank + 1, fd.freq(word) * 100, word))
327 | if cumulative > 0.25:
328 | break
329 | ```
330 |
331 | 1 5.40% the
332 | 2 5.02% ,
333 | 3 4.25% .
334 | 4 3.11% of
335 | 5 2.40% and
336 | 6 2.22% to
337 | 7 1.88% a
338 | 8 1.68% in
339 |
340 |
341 | ## 2、计数器的一些合理用途
342 |
343 |
344 | ```python
345 | sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
346 | n = 3
347 | [sent[i:i+n] for i in range(len(sent) - n +1)]
348 | ```
349 |
350 |
351 |
352 |
353 | [['The', 'dog', 'gave'],
354 | ['dog', 'gave', 'John'],
355 | ['gave', 'John', 'the'],
356 | ['John', 'the', 'newspaper']]
357 |
358 |
359 |
360 | ## 3、构建多维数组:使用嵌套列表推导和使用对象复制([ ] * n)
361 |
362 |
363 | ```python
364 | # 使用嵌套列表推导可以修改内容
365 | m, n = 3, 7
366 | array = [[set() for i in range(n)] for j in range(m)]
367 | array[2][5].add("Alice")
368 | import pprint
369 | pprint.pprint(array)
370 | ```
371 |
372 | [[set(), set(), set(), set(), set(), set(), set()],
373 | [set(), set(), set(), set(), set(), set(), set()],
374 | [set(), set(), set(), set(), set(), {'Alice'}, set()]]
375 |
376 |
377 |
378 | ```python
379 | # 使用对象复制,修改一个其他的也会改变
380 | array = [[set()] * n] *m
381 | array[2][5].add(7)
382 | pprint.pprint(array)
383 | ```
384 |
385 | [[{7}, {7}, {7}, {7}, {7}, {7}, {7}],
386 | [{7}, {7}, {7}, {7}, {7}, {7}, {7}],
387 | [{7}, {7}, {7}, {7}, {7}, {7}, {7}]]
388 |
389 |
390 | # 4.4 函数:结构化编程的基础
391 |
392 |
393 | ```python
394 | import re
395 | def get_text(file):
396 | """Read text from a file, normalizing whitespace and stipping HTML markup"""
397 | text = open(file).read()
398 | text = re.sub(r"<.*?>", " ", text)
399 | text = re.sub(r"\s+", " ", text)
400 | return text
401 |
402 | contents = get_text("4.test.html")
403 | print(contents[:300])
404 | ```
405 |
406 | 【Python自然语言处理】读书笔记:第四章:编写结构化程序 /*! * * Twitter Bootstrap * */ /*! * Bootstrap v3.3.7 (http://getbootstrap.com) * Copyright 2011-2016 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.cs
407 |
408 |
409 |
410 | ```python
411 | help(get_text)
412 | ```
413 |
414 | Help on function get_text in module __main__:
415 |
416 | get_text(file)
417 | Read text from a file, normalizing whitespace and stipping HTML markup
418 |
419 |
420 |
421 | 1、考虑以下三个排序函数。第三个是危险的,因为程序员可能没有意识到它已经修改了给它的输入。一般情况下,函数应该修改参数的内容(my_sort1())或返回一个值(my_sort2()),而不是两个都做(my_sort3())。
422 |
423 |
424 | ```python
425 | def my_sort1(mylist): # good: modifies its argument, no return value
426 | mylist.sort()
427 | def my_sort2(mylist): # good: doesn't touch its argument, returns value
428 | return sorted(mylist)
429 | def my_sort3(mylist): # bad: modifies its argument and also returns it
430 | mylist.sort()
431 | return mylist
432 |
433 | mylist = [3,2,1]
434 | my_sort1(mylist)
435 | print (mylist,"\n")
436 |
437 | mylist = [3,2,1]
438 | print("my_sort2(mylist):",my_sort2(mylist))
439 | print (mylist,"\n")
440 |
441 |
442 | mylist = [3,2,1]
443 | print("my_sort2(mylist):",my_sort3(mylist))
444 | print (mylist,"\n")
445 |
446 |
447 | ```
448 |
449 | [1, 2, 3]
450 |
451 | my_sort2(mylist): [1, 2, 3]
452 | [3, 2, 1]
453 |
454 | my_sort2(mylist): [1, 2, 3]
455 | [1, 2, 3]
456 |
457 |
458 |
459 | ## 2、参数传递
460 | 早在4.1节中,你就已经看到了赋值操作,而一个结构化对象的值是该对象的引用。函数也是一样的。要理解Python按值传递参数,只要了解它是如何赋值的就足够了。【意思就是在函数中赋值的时候如果是结构化对象,那么赋值仅仅是引用,如果是字符串、数值等非结构化的变量,则在函数中改变,仅仅是局部变量改变】
461 |
462 |
463 | ```python
464 | def set_up(word, properties):
465 | word = 'lolcat'
466 | properties.append('noun')
467 | properties = 5
468 |
469 | w = ''
470 | p = []
471 | set_up(w, p)
472 | print(w)
473 | print(p)
474 |
475 | ```
476 |
477 |
478 | ['noun']
479 |
480 |
481 | 请注意,w没有被函数改变。当我们调用set_up(w, p)时,w(空字符串)的值被分配到一个新的变量word。在函数内部word值被修改。然而,这种变化并没有传播给w。这个参数传递过程与下面的赋值序列是一样的:
482 |
483 |
484 | ```python
485 | w = ''
486 | word = w
487 | word = 'lolcat'
488 | print(w)
489 |
490 | ```
491 |
492 |
493 |
494 |
495 | 让我们来看看列表p上发生了什么。当我们调用set_up(w, p),p的值(一个空列表的引用)被分配到一个新的本地变量properties,所以现在这两个变量引用相同的内存位置。函数修改properties,而这种变化也反映在p值上,正如我们所看到的。函数也分配给properties一个新的值(数字5);这并不能修改该内存位置上的内容,而是创建了一个新的局部变量。这种行为就好像是我们做了下列赋值序列:
496 |
497 |
498 |
499 |
500 | ```python
501 | p = []
502 | properties = p
503 | properties.append('noun')
504 | properties = 5
505 | print(p)
506 |
507 | ```
508 |
509 | ['noun']
510 |
511 |
512 | ## 3、变量的作用域
513 |
514 | 当你在一个函数体内部使用一个现有的名字时,Python解释器先尝试按照函数本地的名字来解释。如果没有发现,解释器检查它是否是一个模块内的全局名称。最后,如果没有成功,解释器会检查是否是Python内置的名字。这就是所谓的名称解析的LGB规则:本地(local),全局(global),然后内置(built-in)。
515 |
516 | ## 4、参数类型检查
517 | 我们写程序时,Python不会强迫我们声明变量的类型,这允许我们定义参数类型灵活的函数。例如,我们可能希望一个标注只是一个词序列,而不管这个序列被表示为一个列表、元组(或是迭代器,一种新的序列类型,超出了当前的讨论范围)。
518 |
519 | 然而,我们常常想写一些能被他人利用的程序,并希望以一种防守的风格,当函数没有被正确调用时提供有益的警告。下面的tag()函数的作者假设其参数将始终是一个字符串。
520 |
521 |
522 | ```python
523 | def tag(word):
524 | if word in ['a', 'the', 'all']:
525 | return 'det'
526 | else:
527 | return 'noun'
528 |
529 | print(tag('the'))
530 |
531 | print(tag('knight'))
532 |
533 | print(tag(["'Tis", 'but', 'a', 'scratch']))
534 |
535 | ```
536 |
537 | det
538 | noun
539 | noun
540 |
541 |
542 | 该函数对参数'the'和'knight'返回合理的值,传递给它一个列表[1],看看会发生什么——它没有抱怨,虽然它返回的结果显然是不正确的。此函数的作者可以采取一些额外的步骤来确保tag()函数的参数word是一个字符串。一种直白的做法是使用if not type(word) is str检查参数的类型,如果word不是一个字符串,简单地返回Python特殊的空值None。这是一个略微的改善,因为该函数在检查参数类型,并试图对错误的输入返回一个“特殊的”诊断结果。然而,它也是危险的,因为调用程序可能不会检测None是故意设定的“特殊”值,这种诊断的返回值可能被传播到程序的其他部分产生不可预测的后果。如果这个词是一个Unicode字符串这种方法也会失败。因为它的类型是unicode而不是str。这里有一个更好的解决方案,使用assert语句和Python的basestring的类型一起,它是unicode和str的共同类型。
543 |
544 |
545 | ```python
546 | def tag(word):
547 | assert isinstance(word, str), "argument to tag() must be a string"
548 | if word in ['a', 'the', 'all']:
549 | return 'det'
550 | else:
551 | return 'noun'
552 |
553 | print(tag(["'Tis", 'but', 'a', 'scratch']))
554 |
555 | ```
556 |
557 |
558 | ---------------------------------------------------------------------------
559 |
560 | AssertionError Traceback (most recent call last)
561 |
562 | in
563 | 6 return 'noun'
564 | 7
565 | ----> 8 print(tag(["'Tis", 'but', 'a', 'scratch']))
566 |
567 |
568 | in tag(word)
569 | 1 def tag(word):
570 | ----> 2 assert isinstance(word, str), "argument to tag() must be a string"
571 | 3 if word in ['a', 'the', 'all']:
572 | 4 return 'det'
573 | 5 else:
574 |
575 |
576 | AssertionError: argument to tag() must be a string
577 |
578 |
579 | 如果assert语句失败,它会产生一个不可忽视的错误而停止程序执行。此外,该错误信息是容易理解的。程序中添加断言能帮助你找到逻辑错误,是一种防御性编程。一个更根本的方法是在本节后面描述的使用文档字符串为每个函数记录参数的文档。
580 |
581 | ## 5、功能分解
582 | 当我们使用函数时,主程序可以在一个更高的抽象水平编写,使其结构更透明,例如:
583 |
584 |
585 | ```python
586 | import nltk
587 | tokens = nltk.corpus.brown.words(categories='news')
588 | total = sum(len(w) for w in tokens)
589 | print(total / len(tokens))
590 | ```
591 |
592 | 4.401545438271973
593 |
594 |
595 | 思考例 4-2 中 freq_words 函数。它更新一个作为参数传递进来的频率分布的内容,并输出前 n 个最频繁的词的链表。
596 |
597 | 例 4-2. 设计不佳的函数用来计算高频词。
598 |
599 |
600 | ```python
601 | from nltk import *
602 | from urllib import request
603 | from bs4 import BeautifulSoup
604 |
605 | def freq_words(url, freqdist, n):
606 | html = request.urlopen(url).read().decode('utf8')
607 | raw = BeautifulSoup(html).get_text()
608 | for word in word_tokenize(raw):
609 | freqdist[word.lower()] += 1
610 | result = []
611 | for word, count in freqdist.most_common(n):
612 | result = result + [word]
613 | print(result)
614 | constitution = "http://www.archives.gov/national-archives-experience" \
615 | "/charters/constitution_transcript.html"
616 | fd = nltk.FreqDist()
617 | print([w for (w, _) in fd.most_common(20)])
618 | freq_words(constitution, fd, 20)
619 | print("\n",[w for (w, _) in fd.most_common(30)])
620 | ```
621 |
622 | []
623 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents']
624 |
625 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents', 'founding', '#', 'to', 'declaration', 'constitution', 'a', 'visit', 'online', 'freedom', 'for']
626 |
627 |
628 | 这个函数有几个问题。该函数有两个副作用:它修改了第二个参数的内容,并输出它已计算的结果的经过选择的子集。如果我们在函数内部初始化FreqDist()对象(在它被处理的同一个地方),并且去掉选择集而将结果显示给调用程序的话,函数会更容易理解和更容易在其他地方重用。考虑到它的任务是找出频繁的一个词,它应该只应该返回一个列表,而不是整个频率分布。在4.4中,我们重构此函数,并通过去掉freqdist参数简化其接口。
629 |
630 |
631 | ```python
632 | from urllib import request
633 | from bs4 import BeautifulSoup
634 |
635 | def freq_words(url, n):
636 | html = request.urlopen(url).read().decode('utf8')
637 | text = BeautifulSoup(html).get_text()
638 | freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text))
639 | return [word for (word, _) in freqdist.most_common(n)]
640 |
641 | constitution = "http://www.archives.gov/national-archives-experience" \
642 | "/charters/constitution_transcript.html"
643 | print(freq_words(constitution, 20))
644 | ```
645 |
646 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents']
647 |
648 |
649 | ## 6、编写函数的文档
650 | 在函数的定义顶部的文档字符串中提供这些描述。这个说明不应该解释函数是如何实现的;实际上,应该能够不改变这个说明,使用不同的方法,重新实现这个函数。
651 |
652 |
653 | ```python
654 | def accuracy(reference, test):
655 | """
656 | Calculate the fraction of test items that equal the corresponding reference items.
657 |
658 | Given a list of reference values and a corresponding list of test values,
659 | return the fraction of corresponding values that are equal.
660 | In particular, return the fraction of indexes
661 | {0>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ'])
664 | 0.5
665 |
666 | :param reference: An ordered list of reference values
667 | :type reference: list
668 | :param test: A list of values to compare against the corresponding
669 | reference values
670 | :type test: list
671 | :return: the accuracy score
672 | :rtype: float
673 | :raises ValueError: If reference and length do not have the same length
674 | """
675 |
676 | if len(reference) != len(test):
677 | raise ValueError("Lists must have the same length.")
678 | num_correct = 0
679 | for x, y in zip(reference, test):
680 | if x == y:
681 | num_correct += 1
682 | return float(num_correct) / len(reference)
683 | ```
684 |
685 | # 4.5 更多关于函数
686 |
687 | ## 1、作为参数的函数
688 |
689 | 到目前为止,我们传递给函数的参数一直都是简单的对象,如字符串或列表等结构化对象。Python也允许我们传递一个函数作为另一个函数的参数。现在,我们可以抽象出操作,对相同数据进行不同操作。正如下面的例子表示的,我们可以传递内置函数len()或用户定义的函数last_letter()作为另一个函数的参数:
690 |
691 |
692 | ```python
693 | sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
694 | 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
695 | def extract_property(prop):
696 | return [prop(word) for word in sent]
697 |
698 | print(extract_property(len))
699 |
700 | def last_letter(word):
701 | return word[-1]
702 |
703 | print(extract_property(last_letter))
704 | ```
705 |
706 | [4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1]
707 | ['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']
708 |
709 |
710 | 对象len和last_letter可以像列表和字典那样被传递。请注意,只有在我们调用该函数时,才在函数名后使用括号;当我们只是将函数作为一个对象,括号被省略。
711 |
712 | Python提供了更多的方式来定义函数作为其他函数的参数,即所谓的lambda 表达式。试想在很多地方没有必要使用上述的last_letter()函数,因此没有必要给它一个名字。我们可以等价地写以下内容:
713 |
714 |
715 | ```python
716 | extract_property(lambda w: w[-1])
717 | ```
718 |
719 |
720 |
721 |
722 | ['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.']
723 |
724 |
725 |
726 | 我们的下一个例子演示传递一个函数给sorted()函数。当我们用唯一的参数(需要排序的链表)调用后者,它使用内置的比较函数cmp()。然而,我们可以提供自己的排序函数,例如按长度递减排序。
727 |
728 |
729 | ```python
730 | print(sorted(sent))
731 |
732 | print(sorted(sent, key = lambda x:x[-1]))
733 | ```
734 |
735 | [',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds', 'take', 'the', 'the', 'themselves', 'will']
736 | [',', '.', 'and', 'Take', 'care', 'the', 'sense', 'the', 'take', 'care', 'of', 'of', 'will', 'sounds', 'themselves']
737 |
738 |
739 | ## 2、累计函数
740 |
741 | 这些函数以初始化一些存储开始,迭代和处理输入的数据,最后返回一些最终的对象(一个大的结构或汇总的结果)。做到这一点的一个标准的方式是初始化一个空链表,累计材料,然后返回这个链表,如4.6中所示函数search1()。
742 |
743 |
744 | ```python
745 | import nltk
746 |
747 | def search1(substring, words):
748 | result = []
749 | for word in words:
750 | if substring in word:
751 | result.append(word)
752 | return result
753 |
754 | def search2(substring, words):
755 | for word in words:
756 | if substring in word:
757 | yield word
758 |
759 | for item in search1('fizzled', nltk.corpus.brown.words()):
760 | print (item)
761 |
762 | for item in search2('Grizzlies', nltk.corpus.brown.words()):
763 | print (item)
764 | ```
765 |
766 | fizzled
767 | Grizzlies'
768 |
769 |
770 | 函数search2()是一个生成器。第一次调用此函数,它运行到yield语句然后停下来。调用程序获得第一个词,完成任何必要的处理。一旦调用程序对另一个词做好准备,函数会从停下来的地方继续执行,直到再次遇到yield语句。这种方法通常更有效,因为函数只产生调用程序需要的数据,并不需要分配额外的内存来存储输出(参见前面关于生成器表达式的讨论)。
771 |
772 | 下面是一个更复杂的生成器的例子,产生一个词列表的所有排列。为了强制permutations()函数产生所有它的输出,我们将它包装在list()调用中[1]。
773 |
774 |
775 | ```python
776 | def permutations(seq):
777 | if len(seq) <= 1:
778 | yield seq
779 | else:
780 | for perm in permutations(seq[1:]):
781 | for i in range(len(perm)+1):
782 | yield perm[:i] + seq[0:1] + perm[i:]
783 |
784 | list(permutations(['police', 'fish', 'buffalo']))
785 | ```
786 |
787 |
788 |
789 |
790 | [['police', 'fish', 'buffalo'],
791 | ['fish', 'police', 'buffalo'],
792 | ['fish', 'buffalo', 'police'],
793 | ['police', 'buffalo', 'fish'],
794 | ['buffalo', 'police', 'fish'],
795 | ['buffalo', 'fish', 'police']]
796 |
797 |
798 |
799 | permutations函数使用了一种技术叫递归,将在下面4.7讨论。产生一组词的排列对于创建测试一个语法的数据十分有用(8.)。
800 |
801 | ## 3、高阶函数
802 |
803 | ### filter()
804 | 我们使用函数作为filter()的第一个参数,它对作为它的第二个参数的序列中的每个项目运用该函数,只保留该函数返回True的项目。
805 |
806 |
807 | ```python
808 | def is_content_word(word):
809 | return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.']
810 | sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the',
811 | 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.']
812 | list(filter(is_content_word, sent))
813 | ```
814 |
815 |
816 |
817 |
818 | ['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves']
819 |
820 |
821 |
822 | ### map()
823 | 将一个函数运用到一个序列中的每一项。
824 |
825 |
826 | ```python
827 | lengths = list(map(len, nltk.corpus.brown.sents(categories='news')))
828 | print(sum(lengths) / len(lengths))
829 |
830 |
831 | lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')]
832 | print(sum(lengths) / len(lengths))
833 | ```
834 |
835 | 21.75081116158339
836 | 21.75081116158339
837 |
838 |
839 | ### lambda 表达式
840 | 这里是两个等效的例子,计数每个词中的元音的数量。
841 |
842 |
843 | ```python
844 | list(map(lambda w: len(list(filter(lambda c: c.lower() in "aeiou", w))), sent))
845 | ```
846 |
847 |
848 |
849 |
850 | [2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]
851 |
852 |
853 |
854 |
855 | ```python
856 | [len([c for c in w if c.lower() in "aeiou"]) for w in sent]
857 | ```
858 |
859 |
860 |
861 |
862 | [2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0]
863 |
864 |
865 |
866 | 列表推导为基础的解决方案通常比基于高阶函数的解决方案可读性更好,我们在整个这本书的青睐于使用前者。
867 |
868 | ## 4、命名的参数
869 | 当有很多参数时,很容易混淆正确的顺序。我们可以通过名字引用参数,甚至可以给它们分配默认值以供调用程序没有提供该参数时使用。现在参数可以按任意顺序指定,也可以省略。
870 |
871 |
872 | ```python
873 | def repeat(msg='', num=1):
874 | return msg * num
875 | print(repeat(num=3))
876 |
877 | print(repeat(msg='Alice'))
878 |
879 | print(repeat(num=5, msg='Alice'))
880 |
881 | ```
882 |
883 |
884 | Alice
885 | AliceAliceAliceAliceAlice
886 |
887 |
888 | ### *args作为函数参数
889 |
890 |
891 | ```python
892 | song = [['four', 'calling', 'birds'],
893 | ['three', 'French', 'hens'],
894 | ['two', 'turtle', 'doves']]
895 | print(song[0])
896 | print(list(zip(song[0], song[1], song[2])))
897 |
898 | print(list(zip(*song)))
899 | ```
900 |
901 | ['four', 'calling', 'birds']
902 | [('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]
903 | [('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')]
904 |
905 |
906 | 应该从这个例子中明白输入*song仅仅是一个方便的记号,相当于输入了song[0], song[1], song[2]。
907 |
908 | ### 命名参数的另一个作用是它们允许选择性使用参数。
909 |
910 |
911 | ```python
912 | from nltk import *
913 | def freq_words(file, min=1, num=10):
914 | text = open(file).read()
915 | tokens = word_tokenize(text)
916 | freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min)
917 | return freqdist.most_common(num)
918 | fw = freq_words('4.test.html', 4, 10)
919 | fw = freq_words('4.test.html', min=4, num=10)
920 | fw = freq_words('4.test.html', num=10, min=4)
921 | ```
922 |
923 | ### 可选参数的另一个常见用途是作为标志使用。
924 | 这里是同一个的函数的修订版本,如果设置了verbose标志将会报告其进展情况:
925 |
926 |
927 | ```python
928 | def freq_words(file, min=1, num=10, verbose=False):
929 | freqdist = FreqDist()
930 | if verbose: print("Opening", file)
931 | with open(file) as f:
932 | text = f.read()
933 | if verbose: print("Read in %d characters" % len(file))
934 | for word in word_tokenize(text):
935 | if len(word) >= min:
936 | freqdist[word] += 1
937 | if verbose and freqdist.N() % 100 == 0: print(".", sep="",end = " ")
938 | if verbose: print
939 | return freqdist.most_common(num)
940 | fw = freq_words("4.test.html", 4 ,10, True)
941 | ```
942 |
943 | Opening 4.test.html
944 | Read in 11 characters
945 | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
946 |
947 | ### 注意
948 | 注意不要使用可变对象作为参数的默认值。这个函数的一系列调用将使用同一个对象,有时会出现离奇的结果,就像我们稍后会在关于调试的讨论中看到的那样。
949 |
950 | ### 注意
951 | 如果你的程序将使用大量的文件,它是一个好主意来关闭任何一旦不再需要的已经打开的文件。如果你使用with语句,Python会自动关闭打开的文件︰
952 |
953 |
954 | ```python
955 | with open("4.test.html") as f:
956 | data = f.read()
957 | ```
958 |
959 | # 4.6 程序开发
960 |
961 | 编程是一种技能,需要获得几年的各种编程语言和任务的经验。
962 |
963 | **关键的高层次能力**:是算法设计及其在结构化编程中的实现。
964 |
965 | **关键的低层次的能力**包括熟悉语言的语法结构,以及排除故障的程序(不能表现预期的行为的程序)的各种诊断方法的知识。
966 |
967 | ## 1、Python模块的结构
968 |
969 | 与其他NLTK的模块一样,distance.py以一组注释行开始,包括一行模块标题和作者信息。
970 |
971 | (由于代码会被发布,也包括代码可用的URL、版权声明和许可信息。)
972 |
973 | 接下来是模块级的文档字符串,三重引号的多行字符串,其中包括当有人输入help(nltk.metrics.distance)将被输出的关于模块的信息。
974 |
975 |
976 | ```python
977 | # Natural Language Toolkit: Distance Metrics
978 | #
979 | # Copyright (C) 2001-2013 NLTK Project
980 | # Author: Edward Loper
981 | # Steven Bird
982 | # Tom Lippincott
983 | # URL:
984 | # For license information, see LICENSE.TXT
985 | #
986 |
987 | """
988 | Distance Metrics.
989 |
990 | Compute the distance between two items (usually strings).
991 | As metrics, they must satisfy the following three requirements:
992 |
993 | 1. d(a, a) = 0
994 | 2. d(a, b) >= 0
995 | 3. d(a, c) <= d(a, b) + d(b, c)
996 | """
997 | ```
998 |
999 |
1000 |
1001 |
1002 | '\nDistance Metrics.\n\nCompute the distance between two items (usually strings).\nAs metrics, they must satisfy the following three requirements:\n\n1. d(a, a) = 0\n2. d(a, b) >= 0\n3. d(a, c) <= d(a, b) + d(b, c)\n'
1003 |
1004 |
1005 |
1006 | ## 2、多模块程序
1007 | 一些程序汇集多种任务,例如从语料库加载数据、对数据进行一些分析、然后将其可视化。我们可能已经有了稳定的模块来加载数据和实现数据可视化。我们的工作可能会涉及到那些分析任务的编码,只是从现有的模块调用一些函数。4.7描述了这种情景。
1008 | 
1009 |
1010 | ## 3、错误源头
1011 | 首先,输入的数据可能包含一些意想不到的字符。
1012 |
1013 | 第二,提供的函数可能不会像预期的那样运作。
1014 |
1015 | 第三,我们对Python语义的理解可能出错。
1016 |
1017 |
1018 | ```python
1019 | print("%s.%s.%02d" % "ph.d.", "n", 1)
1020 | ```
1021 |
1022 |
1023 | ---------------------------------------------------------------------------
1024 |
1025 | TypeError Traceback (most recent call last)
1026 |
1027 | in
1028 | ----> 1 print("%s.%s.%02d" % "ph.d.", "n", 1)
1029 |
1030 |
1031 | TypeError: not enough arguments for format string
1032 |
1033 |
1034 |
1035 | ```python
1036 | print("%s.%s.%02d" % ("ph.d.", "n", 1))
1037 | ```
1038 |
1039 | ph.d..n.01
1040 |
1041 |
1042 | ### 在函数中命名参数不能设置列表等对象
1043 | 程序的行为并不如预期,因为我们错误地认为在函数被调用时会创建默认值。然而,它只创建了一次,在Python解释器加载这个函数时。这一个列表对象会被使用,只要没有给函数提供明确的值。
1044 |
1045 |
1046 | ```python
1047 | def find_words(text, wordlength, result=[]):
1048 | for word in text:
1049 | if len(word) == wordlength:
1050 | result.append(word)
1051 | return result
1052 |
1053 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) )
1054 |
1055 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 2, ['ur']) )
1056 |
1057 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) ) # 明显错误
1058 |
1059 | ```
1060 |
1061 | ['omg', 'teh', 'teh', 'mat']
1062 | ['ur', 'on']
1063 | ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat']
1064 |
1065 |
1066 | ## 4、调试技术
1067 | 1.由于大多数代码错误是因为程序员的不正确的假设,你检测bug要做的第一件事是检查你的假设。通过给程序添加print语句定位问题,显示重要的变量的值,并显示程序的进展程度。
1068 |
1069 | 2.解释器会输出一个堆栈跟踪,精确定位错误发生时程序执行的位置。
1070 |
1071 | 3.Python提供了一个调试器,它允许你监视程序的执行,指定程序暂停运行的行号(即断点),逐步调试代码段和检查变量的值。你可以如下方式在你的代码中调用调试器:
1072 |
1073 |
1074 | ```python
1075 | import pdb
1076 | print(find_words(['cat'], 3)) # [_first-run]
1077 |
1078 | pdb.run("find_words(['dog'], 3)") # [_second-run]
1079 | ```
1080 |
1081 | ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']
1082 | > (1)()
1083 | (Pdb) step
1084 | --Call--
1085 | > (1)find_words()
1086 | -> def find_words(text, wordlength, result=[]):
1087 | (Pdb) step
1088 | > (2)find_words()
1089 | -> for word in text:
1090 | (Pdb) args
1091 | text = ['dog']
1092 | wordlength = 3
1093 | result = ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat']
1094 | (Pdb) next
1095 | > (3)find_words()
1096 | -> if len(word) == wordlength:
1097 | (Pdb) b
1098 | (Pdb) c
1099 |
1100 |
1101 | 它会给出一个提示(Pdb),你可以在那里输入指令给调试器。输入help来查看命令的完整列表。输入step(或只输入s)将执行当前行然后停止。如果当前行调用一个函数,它将进入这个函数并停止在第一行。输入next(或只输入n)是类似的,但它会在当前函数中的下一行停止执行。break(或b)命令可用于创建或列出断点。输入continue(或c)会继续执行直到遇到下一个断点。输入任何变量的名称可以检查它的值。
1102 |
1103 | ## 5、防御性编程
1104 |
1105 | 1.考虑在你的代码中添加assert语句,指定变量的属性,例如assert(isinstance(text, list))。如果text的值在你的代码被用在一些较大的环境中时变为了一个字符串,将产生一个AssertionError,于是你会立即得到问题的通知。
1106 |
1107 | 2.一旦你觉得你发现了错误,作为一个假设查看你的解决方案。在重新运行该程序之前尝试预测你修正错误的影响。如果bug不能被修正,不要陷入盲目修改代码希望它会奇迹般地重新开始运作的陷阱。相反,每一次修改都要尝试阐明错误是什么和为什么这样修改会解决这个问题的假设。如果这个问题没有解决就撤消这次修改。
1108 |
1109 | 3.当你开发你的程序时,扩展其功能,并修复所有bug,维护一套测试用例是有益的。这被称为回归测试,因为它是用来检测代码“回归”的地方——修改代码后会带来一个意想不到的副作用是以前能运作的程序不运作了的地方。Python以doctest模块的形式提供了一个简单的回归测试框架。这个模块搜索一个代码或文档文件查找类似与交互式Python会话这样的文本块,这种形式你已经在这本书中看到了很多次。它执行找到的Python命令,测试其输出是否与原始文件中所提供的输出匹配。每当有不匹配时,它会报告预期值和实际值。有关详情,请查询在 documentation at http://docs.python.org/library/doctest.html 上的doctest文档。除了回归测试它的值,doctest模块有助于确保你的软件文档与你的代码保持同步。
1110 |
1111 | 4.也许最重要的防御性编程策略是要清楚的表述你的代码,选择有意义的变量和函数名,并通过将代码分解成拥有良好文档的接口的函数和模块尽可能的简化代码。
1112 |
1113 | # 4.7 算法设计
1114 | 解决算法问题的一个重要部分是为手头的问题选择或改造一个合适的算法。有时会有几种选择,能否选择最好的一个取决于对每个选择随数据增长如何执行的知识。
1115 |
1116 | 算法设计的另一种方法,我们解决问题通过将它转化为一个我们已经知道如何解决的问题的一个实例。例如,为了检测列表中的重复项,我们可以预排序这个列表,然后通过一次扫描检查是否有相邻的两个元素是相同的。
1117 |
1118 | ## 1、迭代与递归
1119 | 例如,假设我们有n个词,要计算出它们结合在一起有多少不同的方式能组成一个词序列。
1120 |
1121 | ### 迭代
1122 | 如果我们只有一个词(n=1),只是一种方式组成一个序列。如果我们有2个词,就有2种方式将它们组成一个序列。3个词有6种可能性。一般的,n个词有n × n-1 × … × 2 × 1种方式(即n的阶乘)。我们可以将这些编写成如下代码:
1123 |
1124 |
1125 | ```python
1126 | def factorial1(n):
1127 | result = 1
1128 | for i in range(n):
1129 | result *= (i+1)
1130 | return result
1131 | ```
1132 |
1133 | ### 递归
1134 | 假设我们有办法为n-1 不同的词构建所有的排列。然后对于每个这样的排列,有n个地方我们可以插入一个新词:开始、结束或任意两个词之间的n-2个空隙。因此,我们简单的将n-1个词的解决方案数乘以n的值。我们还需要基础案例,也就是说,如果我们有一个词,只有一个顺序。我们可以将这些编写成如下代码:
1135 |
1136 |
1137 |
1138 |
1139 | ```python
1140 | def factorial2(n):
1141 | if n == 1:
1142 | return 1
1143 | else:
1144 | return n * factorial2(n-1)
1145 | ```
1146 |
1147 | 尽管递归编程结构简单,但它是有代价的。每次函数调用时,一些状态信息需要推入堆栈,这样一旦函数执行完成可以从离开的地方继续执行。出于这个原因,迭代的解决方案往往比递归解决方案的更高效。
1148 |
1149 | ## 2、权衡空间与时间
1150 | 我们有时可以显著的加快程序的执行,通过建设一个辅助的数据结构,例如索引。4.10实现一个简单的电影评论语料库的全文检索系统。通过索引文档集合,它提供更快的查找。
1151 |
1152 |
1153 | ```python
1154 | def raw(file):
1155 | with open(file) as f:
1156 | contents = f.read()
1157 | contents = re.sub(r"<.*?>", " ", contents)
1158 | contents = re.sub("\s+", " ", contents)
1159 | return contents
1160 |
1161 | def snippet(doc, term):
1162 | text = " "*30 + raw(doc) + " "*30
1163 | pos = text.index(term)
1164 | return text[pos - 30 : pos + 30]
1165 |
1166 |
1167 | print("Building Index...")
1168 | files = nltk.corpus.movie_reviews.abspaths()
1169 | idx = nltk.Index((w, f) for f in files for w in raw(f).split())
1170 |
1171 |
1172 | query = " "
1173 | while query != "quit":
1174 | query = input("query> ")
1175 | if query in idx:
1176 | for doc in idx[query]:
1177 | print(snippet(doc, query))
1178 | else:
1179 | print("Not found")
1180 |
1181 |
1182 |
1183 |
1184 |
1185 |
1186 | ```
1187 |
1188 | Building Index...
1189 | query> so what does joe do
1190 | Not found
1191 | query> quit
1192 | s funded by her mother . lucy quit working professionally 10
1193 | erick . i disliked that movie quite a bit , but since " prac
1194 | t disaster . babe ruth didn't quit baseball after one season
1195 | o-be fiance . i think she can quit that job and get a more r
1196 | and rose mcgowan should just quit acting . she has no chari
1197 | and get a day job . and don't quit it .
1198 | kubrick , alas , should have quit while he was ahead . this
1199 | everyone involved should have quit while they were still ahe
1200 | l die . so what does joe do ? quit his job , of course ! ! w
1201 | red " implant . he's ready to quit the biz and get a portion
1202 | hat he always recorded , they quit and become disillusioned
1203 | admit that i ? ? ? ve become quite the " scream " fan . no
1204 | again , the fact that he has quit his job to feel what it's
1205 | school reunion . he has since quit his job as a travel journ
1206 | ells one of his friends , " i quit school because i didn't l
1207 | ms , cursing off the boss and quitting his job ( " today i q
1208 | e , the arrival of the now ubiquitous videocassette . burt r
1209 | in capitol city , that he has quit his job and hopes to open
1210 | before his death at age 67 to quit filmmaking once a homosex
1211 | - joss's explanation that he quit the priesthood because of
1212 | is a former prosecutor , and quit because of tensions betwe
1213 |
1214 |
1215 | 一个更微妙的空间与时间折中的例子涉及使用整数标识符替换一个语料库的词符。我们为语料库创建一个词汇表,每个词都被存储一次的列表,然后转化这个列表以便我们能通过查找任意词来找到它的标识符。每个文档都进行预处理,使一个词列表变成一个整数列表。现在所有的语言模型都可以使用整数。见4.11中的内容,如何为一个已标注的语料库做这个的例子的列表。
1216 |
1217 |
1218 | ```python
1219 | def preprocess(tagged_corpus):
1220 | words = set()
1221 | tags = set()
1222 | for sent in tagged_corpus:
1223 | for word, tag in sent:
1224 | words.add(word)
1225 | tags.add(tag)
1226 | wm = dict((w, i) for (i, w) in enumerate(words))
1227 | tm = dict((t, i) for (i, t) in enumerate(tags))
1228 | return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus]
1229 | ```
1230 |
1231 | ### 集合中的元素会自动索引,所以测试一个大的集合的成员将远远快于测试相应的列表的成员。
1232 | 空间时间权衡的另一个例子是维护一个词汇表。如果你需要处理一段输入文本检查所有的词是否在现有的词汇表中,词汇表应存储为一个集合,而不是一个列表。集合中的元素会自动索引,所以测试一个大的集合的成员将远远快于测试相应的列表的成员。
1233 |
1234 | 我们可以使用timeit模块测试这种说法。Timer类有两个参数:一个是多次执行的语句,一个是只在开始执行一次的设置代码。我们将分别使用一个整数的列表[1]和一个整数的集合[2]模拟10 万个项目的词汇表。测试语句将产生一个随机项,它有50%的机会在词汇表中[3]。
1235 |
1236 |
1237 | ```python
1238 | from timeit import Timer
1239 | vocab_size = 100000
1240 | setup_list = "import random; vocab = range(%d)" % vocab_size #[1]
1241 | setup_set = "import random; vocab = set(range(%d))" % vocab_size #[2]
1242 | statement = "random.randint(0, %d) in vocab" % (vocab_size * 2) #[3]
1243 | print(Timer(statement, setup_list).timeit(1000))
1244 |
1245 | print(Timer(statement, setup_set).timeit(1000))
1246 |
1247 | ```
1248 |
1249 | 0.024326185999598238
1250 | 0.005347012998754508
1251 |
1252 |
1253 | 执行1000 次链表成员资格测试总共需要2.8秒,而在集合上的等效试验仅需0.0037 秒,也就是说快了三个数量级!
1254 |
1255 | ## 3、动态规划
1256 | 动态规划(Dynamic programming)是一种自然语言处理中被广泛使用的算法设计的一
1257 | 般方法。
1258 | “programming”一词的用法与你可能想到的感觉不同,是规划或调度的意思。动态
1259 | 规划用于解决包含多个重叠的子问题的问题。不是反复计算这些子问题,而是简单的将它们
1260 | 的计算结果存储在一个查找表中。在本节的余下部分,我们将介绍动态规划,但在一个相当
1261 | 不同的背景:句法分析,下介绍。
1262 |
1263 | Pingala 是大约生活在公元前 5 世纪的印度作家,作品有被称为《Chandas Shastra》的
1264 | 梵文韵律专著。Virahanka 大约在公元 6 世纪延续了这项工作,研究短音节和长音节组合产
1265 | 生一个长度为 n 的旋律的组合数。短音节,标记为 S,占一个长度单位,而长音节,标记为
1266 | L,占 2 个长度单位。Pingala 发现,例如:有 5 种方式构造一个长度为 4 的旋律:V 4 = {L
1267 | L, SSL, SLS, LSS, SSSS}。请看,我们可以将 V 4 分成两个子集,以 L 开始的子集和以 S
1268 | 开始的子集,如(1)所示。
1269 |
1270 |
1271 |
1272 | ```python
1273 | (1) V 4 =
1274 | LL, LSS
1275 | i.e. L prefixed to each item of V 2 = {L, SS}
1276 | SSL, SLS, SSSS
1277 | i.e. S prefixed to each item of V 3 = {SL, LS, SSS}
1278 | V1 = { S}
1279 | V0 = {""}
1280 | ```
1281 |
1282 | 有了这个观察结果,我们可以写一个小的递归函数称为 virahanka1()来计算这些旋
1283 | 律,如例 4-9 所示。请注意,要计算 V 4 ,我们先要计算 V 3 和 V 2 。但要计算 V 3 ,我们先要
1284 | 计算 V 2 和 V 1 。在(2)中描述了这种调用结构。
1285 |
1286 | 例 4-9. 四种方法计算梵文旋律:
1287 | (一)迭代(递归);
1288 | (二)自底向上的动态规划;
1289 | (三)自上而下的动态规划;
1290 | (四)内置默记法。
1291 |
1292 |
1293 | ```python
1294 | def virahanka1(n):
1295 | if n == 0:
1296 | return [""]
1297 | elif n == 1:
1298 | return ["S"]
1299 | else:
1300 | s = ["S" + prosody for prosody in virahanka1(n-1)]
1301 | l = ["L" + prosody for prosody in virahanka1(n-2)]
1302 | return s + l
1303 |
1304 | def virahanka2(n):
1305 | lookup = [[""], ["S"]]
1306 | for i in range(n-1):
1307 | s = ["S" + prosody for prosody in lookup[i+1]]
1308 | l = ["L" + prosody for prosody in lookup[i]]
1309 | lookup.append(s + l)
1310 | return lookup[n]
1311 |
1312 |
1313 | def virahanka3(n, lookup={0:[""], 1:["S"]}):
1314 | if n not in lookup:
1315 | s = ["S" + prosody for prosody in virahanka3(n - 1)]
1316 | l = ["L" + prosody for prosody in virahanka3(n - 2)]
1317 | lookup[n] = s + l
1318 | return lookup[n]
1319 |
1320 | from nltk import memoize
1321 | @memoize
1322 | def virahanka4(n):
1323 | if n == 0:
1324 | return [""]
1325 | elif n == 1:
1326 | return ["S"]
1327 | else:
1328 | s = ["S" + prosody for prosody in virahanka4(n - 1)]
1329 | l = ["L" + prosody for prosody in virahanka4(n - 2)]
1330 | return s + l
1331 |
1332 | print(virahanka1(4))
1333 |
1334 | print(virahanka2(4))
1335 |
1336 | print(virahanka3(4))
1337 |
1338 | print(virahanka4(4))
1339 |
1340 | ```
1341 |
1342 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
1343 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
1344 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
1345 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL']
1346 |
1347 |
1348 | ### 法1.递归(迭代)
1349 | 正如你可以看到,V 2 计算了两次。这看上去可能并不像是一个重大的问题,但事实证
1350 | 明,当 n 变大时使用这种递归技术计算 V 20 ,我们将计算 V 2 4,181 次;对 V 40 我们将计算 V 2
1351 | 63245986 次!
1352 | ### 法2.动态规划(自下而上)
1353 | 函数 virahanka2()实现动态规划方法解决这个问题。它的工作原
1354 | 理是使用问题的所有较小的实例的计算结果填充一个表格(叫做 lookup),一旦我们得到
1355 | 了我们感兴趣的值就立即停止。此时,我们读出值,并返回它。最重要的是,每个子问题只
1356 | 计算了一次。
1357 | ### 法3.动态规划(自上而下)
1358 | 请注意,virahanka2()所采取的办法是解决较大问题前先解决较小的问题。因此,这
1359 | 被称为自下而上的方法进行动态规划。不幸的是,对于某些应用它还是相当浪费资源的,因
1360 | 为它计算的一些子问题在解决主问题时可能并不需要。采用自上而下的方法进行动态规划可
1361 | 避免这种计算的浪费,如例 4-9 中函数 virahanka3()所示。不同于自下而上的方法,这种
1362 | 方法是递归的。通过检查是否先前已存储了结果,它避免了 virahanka1()的巨大浪费。如
1363 | 果没有存储,就递归的计算结果,并将结果存储在表中。最后一步返回存储的结果。
1364 | ### 法4.内置默记法
1365 | 最后一
1366 | 种方法,invirahanka4(),使用一个 Python 的“装饰器”称为默记法(memoize),它会
1367 | 做 virahanka3()所做的繁琐的工作而不会搞乱程序。这种“默记”过程中会存储每次函数
1368 | 调用的结果以及使用到的参数。如果随后的函数调用了同样的参数,它会返回存储的结果,
1369 | 而不是重新计算。(这方面的 Python 语法超出了本书的范围。)
1370 |
1371 | # 4.8 Python 库的样例
1372 |
1373 | 例 4-10. 布朗语料库中不同部分的情态动词频率。
1374 |
1375 |
1376 | ```python
1377 | from numpy import arange
1378 | from matplotlib import pyplot
1379 |
1380 | colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black
1381 |
1382 | def bar_chart(categories, words, counts):
1383 | "Plot a bar chart showing counts for each word by category"
1384 | ind = arange(len(words))
1385 | width = 1 / (len(categories) + 1)
1386 | bar_groups = []
1387 | for c in range(len(categories)):
1388 | bars = pyplot.bar(ind+c*width, counts[categories[c]], width,
1389 | color=colors[c % len(colors)])
1390 | bar_groups.append(bars)
1391 | pyplot.xticks(ind+width, words)
1392 | pyplot.legend([b[0] for b in bar_groups], categories, loc='upper left')
1393 | pyplot.ylabel('Frequency')
1394 | pyplot.title('Frequency of Six Modal Verbs by Genre')
1395 | pyplot.show()
1396 |
1397 | genres = ['news', 'religion', 'hobbies', 'government', 'adventure']
1398 | modals = ['can', 'could', 'may', 'might', 'must', 'will']
1399 | cfdist = nltk.ConditionalFreqDist(
1400 | (genre, word)
1401 | for genre in genres
1402 | for word in nltk.corpus.brown.words(categories=genre)
1403 | if word in modals)
1404 |
1405 | counts = {}
1406 | for genre in genres:
1407 | counts[genre] = [cfdist[genre][word] for word in modals]
1408 | bar_chart(genres, modals, counts)
1409 | ```
1410 |
1411 | {'news': [93, 86, 66, 38, 50, 389], 'religion': [82, 59, 78, 12, 54, 71], 'hobbies': [268, 58, 131, 22, 83, 264], 'government': [117, 38, 153, 13, 102, 244], 'adventure': [46, 151, 5, 58, 27, 50]}
1412 |
1413 |
1414 |
1415 | 
1416 |
1417 |
1418 | ## 1、NetworkX
1419 | NetworkX包定义和操作被称为图的由节点和边组成的结构。(略)
1420 |
1421 | ## 2、csv
1422 | 语言分析工作往往涉及数据统计表,包括有关词项的信息、试验研究的参与者名单或从语料库提取的语言特征。这里有一个CSV格式的简单的词典片段:
1423 | sleep, sli:p, v.i, a condition of body and mind ...
1424 | walk, wo:k, v.intr, progress by lifting and setting down each foot ...
1425 | wake, weik, intrans, cease to sleep
1426 |
1427 | 我们可以使用Python的CSV库读写这种格式存储的文件。例如,我们可以打开一个叫做lexicon.csv的CSV 文件[1],并遍历它的行[2]:
1428 |
1429 |
1430 |
1431 |
1432 | ```python
1433 | >>> import csv
1434 | >>> input_file = open("lexicon.csv", "rb")
1435 | >>> for row in csv.reader(input_file):
1436 | ... print(row)
1437 | ['sleep', 'sli:p', 'v.i', 'a condition of body and mind ...']
1438 | ['walk', 'wo:k', 'v.intr', 'progress by lifting and setting down each foot ...']
1439 | ['wake', 'weik', 'intrans', 'cease to sleep']
1440 | ```
1441 |
1442 |
1443 | 每一行是一个字符串列表。如果字段包含有数值数据,它们将作为字符串出现,所以都必须使用int()或float()转换。
1444 |
1445 | ## 3、NumPy
1446 | NumPy包对Python中的数值处理提供了大量的支持。NumPy有一个多维数组对象,它可以很容易初始化和访问:
1447 |
1448 |
1449 | ```python
1450 | from numpy import array
1451 | cube = array([ [[0,0,0], [1,1,1], [2,2,2]],
1452 | [[3,3,3], [4,4,4], [5,5,5]],
1453 | [[6,6,6], [7,7,7], [8,8,8]] ])
1454 |
1455 | print(cube[1,1,1],"\n")
1456 |
1457 | print(cube[2].transpose(),"\n")
1458 |
1459 | print(cube[2,1:],"\n")
1460 |
1461 | ```
1462 |
1463 | 4
1464 |
1465 | [[6 7 8]
1466 | [6 7 8]
1467 | [6 7 8]]
1468 |
1469 | [[7 7 7]
1470 | [8 8 8]]
1471 |
1472 |
1473 |
1474 | NumPy包括线性代数函数。在这里我们进行矩阵的奇异值分解,潜在语义分析中使用的操作,它能帮助识别一个文档集合中的隐含概念。
1475 |
1476 |
1477 | ```python
1478 | from numpy import linalg
1479 | a=array([[4,0], [3,-5]])
1480 | u,s,vt = linalg.svd(a)
1481 | print(u,"\n\n",s,"\n\n",vt)
1482 | ```
1483 |
1484 | [[-0.4472136 -0.89442719]
1485 | [-0.89442719 0.4472136 ]]
1486 |
1487 | [ 6.32455532 3.16227766]
1488 |
1489 | [[-0.70710678 0.70710678]
1490 | [-0.70710678 -0.70710678]]
1491 |
1492 |
1493 | NLTK中的聚类包nltk.cluster中广泛使用NumPy数组,支持包括k-means聚类、高斯EM聚类、组平均凝聚聚类以及聚类分析图。有关详细信息,请输入help(nltk.cluster)。
1494 |
1495 | ## 4、其他Python库
1496 | 还有许多其他的Python库,你可以使用http://pypi.python.org/ 处的Python包索引找到它们。许多库提供了外部软件接口,例如关系数据库(如mysql-python)和大数据集合(如PyLucene)。许多其他库能访问各种文件格式,如PDF、MSWord 和XML(pypdf, pywin32, xml.etree)、RSS 源(如feedparser)以及电子邮箱(如imaplib, email)。
1497 |
1498 |
1499 | ```python
1500 |
1501 | ```
1502 |
--------------------------------------------------------------------------------