├── .ipynb_checkpoints ├── 【Python自然语言处理】读书笔记:第七章:从文本提取信息-checkpoint.ipynb ├── 【Python自然语言处理】读书笔记:第三章:处理原始文本-checkpoint.ipynb ├── 【Python自然语言处理】读书笔记:第五章:分类和标注词汇-checkpoint.ipynb ├── 【Python自然语言处理】读书笔记:第六章:学习分类文本-checkpoint.ipynb └── 【Python自然语言处理】读书笔记:第四章:编写结构化程序-checkpoint.ipynb ├── 3.document.txt ├── 3.output.txt ├── 4.test.html ├── 5.t2.pkl ├── PYTHON 自然语言处理中文翻译.pdf ├── README.md ├── picture ├── 1.1.png ├── 2.1.png ├── 2.2.png ├── 2.3.png ├── 2.4.png ├── 3.1.png ├── 3.2.png ├── 3.3.png ├── 4.1.png ├── 4.2.png ├── 5.1.png ├── 6.1.png ├── 6.2.png ├── 6.3.png ├── 6.4.png ├── 6.5.png ├── 6.6.png ├── 6.7.png ├── 7.1.png ├── 7.2.png ├── 7.3.png ├── 7.4.png └── 7.5.png ├── 【Python自然语言处理】读书笔记:第一章:语言处理与Python.md ├── 【Python自然语言处理】读书笔记:第七章:从文本提取信息.ipynb ├── 【Python自然语言处理】读书笔记:第三章:处理原始文本.ipynb ├── 【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源.md ├── 【Python自然语言处理】读书笔记:第五章:分类和标注词汇.ipynb ├── 【Python自然语言处理】读书笔记:第六章:学习分类文本.ipynb ├── 【Python自然语言处理】读书笔记:第四章:编写结构化程序.ipynb └── 【Python自然语言处理】读书笔记:第四章:编写结构化程序.md /.ipynb_checkpoints/【Python自然语言处理】读书笔记:第七章:从文本提取信息-checkpoint.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "本章原文:https://usyiyi.github.io/nlp-py-2e-zh/7.html\n", 8 | "\n", 9 | " 1.我们如何能构建一个系统,从非结构化文本中提取结构化数据如表格?\n", 10 | " 2.有哪些稳健的方法识别一个文本中描述的实体和关系?\n", 11 | " 3.哪些语料库适合这项工作,我们如何使用它们来训练和评估我们的模型?\n", 12 | "\n", 13 | "**分块** 和 **命名实体识别**。" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# 1 信息提取\n", 21 | "\n", 22 | "信息有很多种形状和大小。一个重要的形式是结构化数据:实体和关系的可预测的规范的结构。例如,我们可能对公司和地点之间的关系感兴趣。给定一个公司,我们希望能够确定它做业务的位置;反过来,给定位置,我们会想发现哪些公司在该位置做业务。如果我们的数据是表格形式,如1.1中的例子,那么回答这些问题就很简单了。\n", 23 | "\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "['BBDO South', 'Georgia-Pacific']\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "locs = [('Omnicom', 'IN', 'New York'),\n", 41 | " ('DDB Needham', 'IN', 'New York'),\n", 42 | " ('Kaplan Thaler Group', 'IN', 'New York'),\n", 43 | " ('BBDO South', 'IN', 'Atlanta'),\n", 44 | " ('Georgia-Pacific', 'IN', 'Atlanta')]\n", 45 | "query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']\n", 46 | "print(query)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "我们可以定义一个函数,简单地连接 NLTK 中默认的句子分割器[1],分词器[2]和词性标注器[3]:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "def ie_preprocess(document):\n", 63 | " sentences = nltk.sent_tokenize(document) # [1] 句子分割器\n", 64 | " sentences = [nltk.word_tokenize(sent) for sent in sentences] # [2] 分词器\n", 65 | " sentences = [nltk.pos_tag(sent) for sent in sentences] # [3] 词性标注器" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "import nltk, re, pprint" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "# 2 词块划分\n", 82 | "\n", 83 | "我们将用于实体识别的基本技术是词块划分,它分割和标注多词符的序列,如2.1所示。小框显示词级分词和词性标注,大框显示高级别的词块划分。每个这种较大的框叫做一个词块。就像分词忽略空白符,词块划分通常选择词符的一个子集。同样像分词一样,词块划分器生成的片段在源文本中不能重叠。\n", 84 | "![7.1.png](./picture/7.1.png)\n", 85 | "\n", 86 | "在本节中,我们将在较深的层面探讨词块划分,以**词块**的定义和表示开始。我们将看到**正则表达式**和**N-gram**的方法来词块划分,使用CoNLL-2000词块划分语料库**开发**和**评估词块划分器**。我们将在(5)和6回到**命名实体识别**和**关系抽取**的任务。" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## 2.1 名词短语词块划分\n", 94 | "\n", 95 | "我们将首先思考名词短语词块划分或NP词块划分任务,在那里我们寻找单独名词短语对应的词块。例如,这里是一些《华尔街日报》文本,其中的NP词块用方括号标记:" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 5, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "(S\n", 108 | " (NP the/DT little/JJ yellow/JJ dog/NN)\n", 109 | " barked/VBD\n", 110 | " at/IN\n", 111 | " (NP the/DT cat/NN))\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "sentence = [(\"the\", \"DT\"), (\"little\", \"JJ\"), (\"yellow\", \"JJ\"), \n", 117 | " (\"dog\", \"NN\"), (\"barked\", \"VBD\"), (\"at\", \"IN\"), (\"the\", \"DT\"), (\"cat\", \"NN\")]\n", 118 | "\n", 119 | "grammar = \"NP: {
?*}\" \n", 120 | "cp = nltk.RegexpParser(grammar) \n", 121 | "result = cp.parse(sentence) \n", 122 | "print(result) \n", 123 | "result.draw()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "![7.2.png](./picture/7.2.png)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## 2.2 标记模式\n", 138 | "\n", 139 | "组成一个词块语法的规则使用标记模式来描述已标注的词的序列。一个标记模式是一个词性标记序列,用尖括号分隔,如\n", 140 | "```\n", 141 | "
?*\n", 142 | "```\n", 143 | "标记模式类似于正则表达式模式(3.4)。现在,思考下面的来自《华尔街日报》的名词短语:\n", 144 | "```py\n", 145 | "another/DT sharp/JJ dive/NN\n", 146 | "trade/NN figures/NNS\n", 147 | "any/DT new/JJ policy/NN measures/NNS\n", 148 | "earlier/JJR stages/NNS\n", 149 | "Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP\n", 150 | "```" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## 2.3 用正则表达式进行词块划分\n", 158 | "\n", 159 | "要找到一个给定的句子的词块结构,RegexpParser词块划分器以一个没有词符被划分的平面结构开始。词块划分规则轮流应用,依次更新词块结构。一旦所有的规则都被调用,返回生成的词块结构。\n", 160 | "\n", 161 | "2.3显示了一个由2个规则组成的简单的词块语法。第一条规则匹配一个可选的限定词或所有格代名词,零个或多个形容词,然后跟一个名词。第二条规则匹配一个或多个专有名词。我们还定义了一个进行词块划分的例句[1],并在此输入上运行这个词块划分器[2]。" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 7, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "(S\n", 174 | " (NP Rapunzel/NNP)\n", 175 | " let/VBD\n", 176 | " down/RP\n", 177 | " (NP her/PP$ long/JJ golden/JJ hair/NN))\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "grammar = r\"\"\"\n", 183 | "NP: {?*} \n", 184 | "{+}\n", 185 | "\"\"\"\n", 186 | "cp = nltk.RegexpParser(grammar)\n", 187 | "sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n", 188 | "print (cp.parse(sentence))" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "注意\n", 196 | "\n", 197 | "```\n", 198 | "$\n", 199 | "```\n", 200 | "符号是正则表达式中的一个特殊字符,必须使用反斜杠转义来匹配\n", 201 | "```\n", 202 | "PP\\$\n", 203 | "```\n", 204 | "标记。\n", 205 | "\n", 206 | "如果标记模式匹配位置重叠,最左边的匹配优先。例如,如果我们应用一个匹配两个连续的名词文本的规则到一个包含三个连续的名词的文本,则只有前两个名词将被划分:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 8, 212 | "metadata": {}, 213 | "outputs": [ 214 | { 215 | "name": "stdout", 216 | "output_type": "stream", 217 | "text": [ 218 | "(S (NP money/NN market/NN) fund/NN)\n" 219 | ] 220 | } 221 | ], 222 | "source": [ 223 | "nouns = [(\"money\", \"NN\"), (\"market\", \"NN\"), (\"fund\", \"NN\")]\n", 224 | "grammar = \"NP: {} # Chunk two consecutive nouns\"\n", 225 | "cp = nltk.RegexpParser(grammar)\n", 226 | "print(cp.parse(nouns))" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "## 2.4 探索文本语料库\n", 234 | "\n", 235 | "在2中,我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作,如下:" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 12, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "name": "stdout", 245 | "output_type": "stream", 246 | "text": [ 247 | "(CHUNK combined/VBN to/TO achieve/VB)\n", 248 | "(CHUNK continue/VB to/TO place/VB)\n", 249 | "(CHUNK serve/VB to/TO protect/VB)\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "cp = nltk.RegexpParser('CHUNK: { }')\n", 255 | "brown = nltk.corpus.brown\n", 256 | "count = 0\n", 257 | "for sent in brown.tagged_sents():\n", 258 | " tree = cp.parse(sent)\n", 259 | " for subtree in tree.subtrees():\n", 260 | " if subtree.label() == 'CHUNK': print(subtree)\n", 261 | " count += 1\n", 262 | " if count >= 30: break" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "## 2.5 词缝加塞\n", 270 | "\n", 271 | "有时定义我们想从一个词块中排除什么比较容易。我们可以定义词缝为一个不包含在词块中的一个词符序列。在下面的例子中,barked/VBD at/IN是一个词缝:\n", 272 | "```\n", 273 | "[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]\n", 274 | "```\n" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## 2.6 词块的表示:标记与树\n", 282 | "\n", 283 | "作为标注和分析之间的中间状态(8.,词块结构可以使用标记或树来表示。最广泛的文件表示使用IOB标记。在这个方案中,每个词符被三个特殊的词块标记之一标注,I(内部),O(外部)或B(开始)。一个词符被标注为B,如果它标志着一个词块的开始。块内的词符子序列被标注为I。所有其他的词符被标注为O。B和I标记后面跟着词块类型,如B-NP, I-NP。当然,没有必要指定出现在词块外的词符类型,所以这些都只标注为O。这个方案的例子如2.5所示。\n", 284 | "![7.3.png](./picture/7.3.png)\n", 285 | "\n", 286 | "IOB标记已成为文件中表示词块结构的标准方式,我们也将使用这种格式。下面是2.5中的信息如何出现在一个文件中的:\n", 287 | "```\n", 288 | "We PRP B-NP\n", 289 | "saw VBD O\n", 290 | "the DT B-NP\n", 291 | "yellow JJ I-NP\n", 292 | "dog NN I-NP\n", 293 | "```" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "# 3 开发和评估词块划分器\n", 301 | "\n", 302 | "现在你对分块的作用有了一些了解,但我们并没有解释如何评估词块划分器。和往常一样,这需要一个合适的已标注语料库。我们一开始寻找将IOB格式转换成NLTK树的机制,然后是使用已化分词块的语料库如何在一个更大的规模上做这个。我们将看到如何为一个词块划分器相对一个语料库的准确性打分,再看看一些数据驱动方式搜索NP词块。我们整个的重点在于扩展一个词块划分器的覆盖范围。\n", 303 | "## 3.1 读取IOB格式与CoNLL2000语料库\n", 304 | "\n", 305 | "使用corpus模块,我们可以加载已经标注并使用IOB符号划分词块的《华尔街日报》文本。这个语料库提供的词块类型有NP,VP和PP。正如我们已经看到的,每个句子使用多行表示,如下所示:\n", 306 | "```\n", 307 | "he PRP B-NP\n", 308 | "accepted VBD B-VP\n", 309 | "the DT B-NP\n", 310 | "position NN I-NP\n", 311 | "...\n", 312 | "```\n", 313 | "![7.4.png](./picture/7.4.png)\n", 314 | "我们可以使用NLTK的corpus模块访问较大量的已经划分词块的文本。CoNLL2000语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式词块标记。我们可以使用nltk.corpus.conll2000访问这些数据。下面是一个读取语料库的“训练”部分的第100个句子的例子:" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 14, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | "(S\n", 327 | " (PP Over/IN)\n", 328 | " (NP a/DT cup/NN)\n", 329 | " (PP of/IN)\n", 330 | " (NP coffee/NN)\n", 331 | " ,/,\n", 332 | " (NP Mr./NNP Stone/NNP)\n", 333 | " (VP told/VBD)\n", 334 | " (NP his/PRP$ story/NN)\n", 335 | " ./.)\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "from nltk.corpus import conll2000\n", 341 | "print(conll2000.chunked_sents('train.txt')[99])" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "正如你看到的,CoNLL2000语料库包含三种词块类型:NP词块,我们已经看到了;VP词块如has already delivered;PP块如because of。因为现在我们唯一感兴趣的是NP词块,我们可以使用chunk_types参数选择它们:" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 15, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "(S\n", 361 | " Over/IN\n", 362 | " (NP a/DT cup/NN)\n", 363 | " of/IN\n", 364 | " (NP coffee/NN)\n", 365 | " ,/,\n", 366 | " (NP Mr./NNP Stone/NNP)\n", 367 | " told/VBD\n", 368 | " (NP his/PRP$ story/NN)\n", 369 | " ./.)\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "## 3.2 简单的评估和基准\n", 382 | "\n", 383 | "现在,我们可以访问一个已划分词块语料,可以评估词块划分器。我们开始为没有什么意义的词块解析器cp建立一个基准,它不划分任何词块:" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 16, 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "name": "stdout", 393 | "output_type": "stream", 394 | "text": [ 395 | "ChunkParse score:\n", 396 | " IOB Accuracy: 43.4%%\n", 397 | " Precision: 0.0%%\n", 398 | " Recall: 0.0%%\n", 399 | " F-Measure: 0.0%%\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "from nltk.corpus import conll2000\n", 405 | "cp = nltk.RegexpParser(\"\")\n", 406 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n", 407 | "print(cp.evaluate(test_sents))" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "IOB标记准确性表明超过三分之一的词被标注为O,即没有在NP词块中。然而,由于我们的标注器没有找到任何词块,其精度、召回率和F-度量均为零。现在让我们尝试一个初级的正则表达式词块划分器,查找以名词短语标记的特征字母开头的标记(如CD, DT和JJ)。" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 17, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "name": "stdout", 424 | "output_type": "stream", 425 | "text": [ 426 | "ChunkParse score:\n", 427 | " IOB Accuracy: 87.7%%\n", 428 | " Precision: 70.6%%\n", 429 | " Recall: 67.8%%\n", 430 | " F-Measure: 69.2%%\n" 431 | ] 432 | } 433 | ], 434 | "source": [ 435 | "grammar = r\"NP: {<[CDJNP].*>+}\"\n", 436 | "cp = nltk.RegexpParser(grammar)\n", 437 | "print(cp.evaluate(test_sents))" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "正如你看到的,这种方法达到相当好的结果。但是,我们可以采用更多数据驱动的方法改善它,在这里我们使用训练语料找到对每个词性标记最有可能的块标记(I, O或B)。换句话说,我们可以使用一元标注器(4)建立一个词块划分器。但不是尝试确定每个词的正确的词性标记,而是根据每个词的词性标记,尝试确定正确的词块标记。\n", 445 | "\n", 446 | "在3.1中,我们定义了UnigramChunker类,使用一元标注器给句子加词块标记。这个类的大部分代码只是用来在NLTK 的ChunkParserI接口使用的词块树表示和嵌入式标注器使用的IOB表示之间镜像转换。类定义了两个方法:一个构造函数[1],当我们建立一个新的UnigramChunker时调用;以及parse方法[3],用来给新句子划分词块。" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 18, 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [ 455 | "class UnigramChunker(nltk.ChunkParserI):\n", 456 | " def __init__(self, train_sents): \n", 457 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n", 458 | " for sent in train_sents]\n", 459 | " self.tagger = nltk.UnigramTagger(train_data) \n", 460 | "\n", 461 | " def parse(self, sentence): \n", 462 | " pos_tags = [pos for (word,pos) in sentence]\n", 463 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n", 464 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n", 465 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n", 466 | " in zip(sentence, chunktags)]\n", 467 | " return nltk.chunk.conlltags2tree(conlltags)" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "构造函数[1]需要训练句子的一个列表,这将是词块树的形式。它首先将训练数据转换成适合训练标注器的形式,使用tree2conlltags映射每个词块树到一个word,tag,chunk三元组的列表。然后使用转换好的训练数据训练一个一元标注器,并存储在self.tagger供以后使用。\n", 475 | "\n", 476 | "parse方法[3]接收一个已标注的句子作为其输入,以从那句话提取词性标记开始。它然后使用在构造函数中训练过的标注器self.tagger,为词性标记标注IOB词块标记。接下来,它提取词块标记,与原句组合,产生conlltags。最后,它使用conlltags2tree将结果转换成一个词块树。\n", 477 | "\n", 478 | "现在我们有了UnigramChunker,可以使用CoNLL2000语料库训练它,并测试其表现:" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 20, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "name": "stdout", 488 | "output_type": "stream", 489 | "text": [ 490 | "ChunkParse score:\n", 491 | " IOB Accuracy: 92.9%%\n", 492 | " Precision: 79.9%%\n", 493 | " Recall: 86.8%%\n", 494 | " F-Measure: 83.2%%\n" 495 | ] 496 | } 497 | ], 498 | "source": [ 499 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n", 500 | "train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])\n", 501 | "unigram_chunker = UnigramChunker(train_sents)\n", 502 | "print(unigram_chunker.evaluate(test_sents))" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "这个分块器相当不错,达到整体F-度量83%的得分。让我们来看一看通过使用一元标注器分配一个标记给每个语料库中出现的词性标记,它学到了什么:" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 21, 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "name": "stdout", 519 | "output_type": "stream", 520 | "text": [ 521 | "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n" 522 | ] 523 | } 524 | ], 525 | "source": [ 526 | "postags = sorted(set(pos for sent in train_sents\n", 527 | " for (word,pos) in sent.leaves()))\n", 528 | "print(unigram_chunker.tagger.tag(postags))" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "它已经发现大多数标点符号出现在NP词块外,除了两种货币符号#和*\\$*。它也发现限定词(DT)和所有格(PRP*\\$*和WP$)出现在NP词块的开头,而名词类型(NN, NNP, NNPS,NNS)大多出现在NP词块内。\n", 536 | "\n", 537 | "建立了一个一元分块器,很容易建立一个二元分块器:我们只需要改变类的名称为BigramChunker,修改3.1行[2]构造一个BigramTagger而不是UnigramTagger。由此产生的词块划分器的性能略高于一元词块划分器:" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 24, 543 | "metadata": {}, 544 | "outputs": [ 545 | { 546 | "name": "stdout", 547 | "output_type": "stream", 548 | "text": [ 549 | "ChunkParse score:\n", 550 | " IOB Accuracy: 93.3%%\n", 551 | " Precision: 82.3%%\n", 552 | " Recall: 86.8%%\n", 553 | " F-Measure: 84.5%%\n" 554 | ] 555 | } 556 | ], 557 | "source": [ 558 | "class BigramChunker(nltk.ChunkParserI):\n", 559 | " def __init__(self, train_sents): \n", 560 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n", 561 | " for sent in train_sents]\n", 562 | " self.tagger = nltk.BigramTagger(train_data)\n", 563 | "\n", 564 | " def parse(self, sentence): \n", 565 | " pos_tags = [pos for (word,pos) in sentence]\n", 566 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n", 567 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n", 568 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n", 569 | " in zip(sentence, chunktags)]\n", 570 | " return nltk.chunk.conlltags2tree(conlltags)\n", 571 | "bigram_chunker = BigramChunker(train_sents)\n", 572 | "print(bigram_chunker.evaluate(test_sents))" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "## 3.3 训练基于分类器的词块划分器\n", 580 | "\n", 581 | "无论是基于正则表达式的词块划分器还是n-gram词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 31, 587 | "metadata": {}, 588 | "outputs": [], 589 | "source": [ 590 | "class ConsecutiveNPChunkTagger(nltk.TaggerI): \n", 591 | "\n", 592 | " def __init__(self, train_sents):\n", 593 | " train_set = []\n", 594 | " for tagged_sent in train_sents:\n", 595 | " untagged_sent = nltk.tag.untag(tagged_sent)\n", 596 | " history = []\n", 597 | " for i, (word, tag) in enumerate(tagged_sent):\n", 598 | " featureset = npchunk_features(untagged_sent, i, history) \n", 599 | " train_set.append( (featureset, tag) )\n", 600 | " history.append(tag)\n", 601 | " self.classifier = nltk.MaxentClassifier.train( \n", 602 | " train_set, algorithm='megam', trace=0)\n", 603 | "\n", 604 | " def tag(self, sentence):\n", 605 | " history = []\n", 606 | " for i, word in enumerate(sentence):\n", 607 | " featureset = npchunk_features(sentence, i, history)\n", 608 | " tag = self.classifier.classify(featureset)\n", 609 | " history.append(tag)\n", 610 | " return zip(sentence, history)\n", 611 | "\n", 612 | "class ConsecutiveNPChunker(nltk.ChunkParserI):\n", 613 | " def __init__(self, train_sents):\n", 614 | " tagged_sents = [[((w,t),c) for (w,t,c) in\n", 615 | " nltk.chunk.tree2conlltags(sent)]\n", 616 | " for sent in train_sents]\n", 617 | " self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n", 618 | "\n", 619 | " def parse(self, sentence):\n", 620 | " tagged_sents = self.tagger.tag(sentence)\n", 621 | " conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]\n", 622 | " return nltk.chunk.conlltags2tree(conlltags)" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "def npchunk_features(sentence, i, history):\n", 639 | " word, pos = sentence[i]\n", 640 | " return {\"pos\": pos}\n", 641 | "chunker = ConsecutiveNPChunker(train_sents)\n", 642 | "print(chunker.evaluate(test_sents))" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "```\n", 650 | "ChunkParse score:\n", 651 | " IOB Accuracy: 92.9%\n", 652 | " Precision: 79.9%\n", 653 | " Recall: 86.7%\n", 654 | " F-Measure: 83.2%\n", 655 | " ```" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。\n" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": null, 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "def npchunk_features(sentence, i, history):\n", 672 | " word, pos = sentence[i]\n", 673 | " if i == 0:\n", 674 | " prevword, prevpos = \"\", \"\"\n", 675 | " else:\n", 676 | " prevword, prevpos = sentence[i-1]\n", 677 | " return {\"pos\": pos, \"prevpos\": prevpos}\n", 678 | "chunker = ConsecutiveNPChunker(train_sents)\n", 679 | "print(chunker.evaluate(test_sents))" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "```\n", 687 | "ChunkParse score:\n", 688 | " IOB Accuracy: 93.6%\n", 689 | " Precision: 81.9%\n", 690 | " Recall: 87.2%\n", 691 | " F-Measure: 84.5%\n", 692 | "```" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约1.5个百分点(相应的错误率减少大约10%)。" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "def npchunk_features(sentence, i, history):\n", 709 | " word, pos = sentence[i]\n", 710 | " if i == 0:\n", 711 | " prevword, prevpos = \"\", \"\"\n", 712 | " else:\n", 713 | " prevword, prevpos = sentence[i-1]\n", 714 | " return {\"pos\": pos, \"word\": word, \"prevpos\": prevpos}\n", 715 | "chunker = ConsecutiveNPChunker(train_sents)\n", 716 | "print(chunker.evaluate(test_sents))" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "```\n", 724 | "ChunkParse score:\n", 725 | " IOB Accuracy: 94.5%\n", 726 | " Precision: 84.2%\n", 727 | " Recall: 89.4%\n", 728 | " F-Measure: 86.7%\n", 729 | "```" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 36, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "def npchunk_features(sentence, i, history):\n", 746 | " word, pos = sentence[i]\n", 747 | " if i == 0:\n", 748 | " prevword, prevpos = \"\", \"\"\n", 749 | " else:\n", 750 | " prevword, prevpos = sentence[i-1]\n", 751 | " if i == len(sentence)-1:\n", 752 | " nextword, nextpos = \"\", \"\"\n", 753 | " else:\n", 754 | " nextword, nextpos = sentence[i+1]\n", 755 | " return {\"pos\": pos,\n", 756 | " \"word\": word,\n", 757 | " \"prevpos\": prevpos,\n", 758 | " \"nextpos\": nextpos,\n", 759 | " \"prevpos+pos\": \"%s+%s\" % (prevpos, pos), \n", 760 | " \"pos+nextpos\": \"%s+%s\" % (pos, nextpos),\n", 761 | " \"tags-since-dt\": tags_since_dt(sentence, i)} " 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": 37, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "def tags_since_dt(sentence, i):\n", 771 | " tags = set()\n", 772 | " for word, pos in sentence[:i]:\n", 773 | " if pos == 'DT':\n", 774 | " tags = set()\n", 775 | " else:\n", 776 | " tags.add(pos)\n", 777 | " return '+'.join(sorted(tags))" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "chunker = ConsecutiveNPChunker(train_sents)\n", 787 | "print(chunker.evaluate(test_sents))" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "metadata": {}, 793 | "source": [ 794 | "```\n", 795 | "ChunkParse score:\n", 796 | " IOB Accuracy: 96.0%\n", 797 | " Precision: 88.6%\n", 798 | " Recall: 91.0%\n", 799 | " F-Measure: 89.8%\n", 800 | "```" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": {}, 806 | "source": [ 807 | "# 4 语言结构中的递归\n", 808 | "## 4.1 用级联词块划分器构建嵌套结构\n", 809 | "\n", 810 | "到目前为止,我们的词块结构一直是相对平的。已标注词符组成的树在如NP这样的词块节点下任意组合。然而,只需创建一个包含递归规则的多级的词块语法,就可以建立任意深度的词块结构。4.1是名词短语、介词短语、动词短语和句子的模式。这是一个四级词块语法器,可以用来创建深度最多为4的结构。\n", 811 | "\n" 812 | ] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": 42, 817 | "metadata": {}, 818 | "outputs": [ 819 | { 820 | "name": "stdout", 821 | "output_type": "stream", 822 | "text": [ 823 | "(S\n", 824 | " (NP Mary/NN)\n", 825 | " saw/VBD\n", 826 | " (CLAUSE\n", 827 | " (NP the/DT cat/NN)\n", 828 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n" 829 | ] 830 | } 831 | ], 832 | "source": [ 833 | "grammar = r\"\"\"\n", 834 | " NP: {+} \n", 835 | " PP: {} \n", 836 | " VP: {+$} \n", 837 | " CLAUSE: {} \n", 838 | " \"\"\"\n", 839 | "cp = nltk.RegexpParser(grammar)\n", 840 | "sentence = [(\"Mary\", \"NN\"), (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"),\n", 841 | " (\"sit\", \"VB\"), (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n", 842 | "print(cp.parse(sentence))" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "不幸的是,这一结果丢掉了saw为首的VP。它还有其他缺陷。当我们将此词块划分器应用到一个有更深嵌套的句子时,让我们看看会发生什么。请注意,它无法识别[1]开始的VP词块。" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 43, 855 | "metadata": {}, 856 | "outputs": [ 857 | { 858 | "name": "stdout", 859 | "output_type": "stream", 860 | "text": [ 861 | "(S\n", 862 | " (NP John/NNP)\n", 863 | " thinks/VBZ\n", 864 | " (NP Mary/NN)\n", 865 | " saw/VBD\n", 866 | " (CLAUSE\n", 867 | " (NP the/DT cat/NN)\n", 868 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n" 869 | ] 870 | } 871 | ], 872 | "source": [ 873 | "sentence = [(\"John\", \"NNP\"), (\"thinks\", \"VBZ\"), (\"Mary\", \"NN\"),\n", 874 | " (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"), (\"sit\", \"VB\"),\n", 875 | " (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n", 876 | "print(cp.parse(sentence))" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "这些问题的解决方案是让词块划分器在它的模式中循环:尝试完所有模式之后,重复此过程。我们添加一个可选的第二个参数loop指定这套模式应该循环的次数:" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 44, 889 | "metadata": {}, 890 | "outputs": [ 891 | { 892 | "name": "stdout", 893 | "output_type": "stream", 894 | "text": [ 895 | "(S\n", 896 | " (NP John/NNP)\n", 897 | " thinks/VBZ\n", 898 | " (CLAUSE\n", 899 | " (NP Mary/NN)\n", 900 | " (VP\n", 901 | " saw/VBD\n", 902 | " (CLAUSE\n", 903 | " (NP the/DT cat/NN)\n", 904 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))\n" 905 | ] 906 | } 907 | ], 908 | "source": [ 909 | "cp = nltk.RegexpParser(grammar, loop=2)\n", 910 | "print(cp.parse(sentence))" 911 | ] 912 | }, 913 | { 914 | "cell_type": "markdown", 915 | "metadata": {}, 916 | "source": [ 917 | "注意\n", 918 | "\n", 919 | "这个级联过程使我们能创建深层结构。然而,创建和调试级联过程是困难的,关键点是它能更有效地做全面的分析(见第8.章)。另外,级联过程只能产生固定深度的树(不超过级联级数),完整的句法分析这是不够的。\n" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "metadata": {}, 925 | "source": [ 926 | "## 4.2 Trees\n", 927 | "\n", 928 | "tree是一组连接的加标签节点,从一个特殊的根节点沿一条唯一的路径到达每个节点。下面是一棵树的例子(注意它们标准的画法是颠倒的):\n", 929 | "```\n", 930 | "(S\n", 931 | " (NP Alice)\n", 932 | " (VP\n", 933 | " (V chased)\n", 934 | " (NP\n", 935 | " (Det the)\n", 936 | " (N rabbit))))\n", 937 | "```\n", 938 | "虽然我们将只集中关注语法树,树可以用来编码任何同构的超越语言形式序列的层次结构(如形态结构、篇章结构)。一般情况下,叶子和节点值不一定要是字符串。\n", 939 | "\n", 940 | "在NLTK中,我们通过给一个节点添加标签和一系列的孩子创建一棵树:" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 46, 946 | "metadata": {}, 947 | "outputs": [ 948 | { 949 | "name": "stdout", 950 | "output_type": "stream", 951 | "text": [ 952 | "(NP Alice)\n", 953 | "(NP the rabbit)\n" 954 | ] 955 | } 956 | ], 957 | "source": [ 958 | "tree1 = nltk.Tree('NP', ['Alice'])\n", 959 | "print(tree1)\n", 960 | "tree2 = nltk.Tree('NP', ['the', 'rabbit'])\n", 961 | "print(tree2)" 962 | ] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "metadata": {}, 967 | "source": [ 968 | "我们可以将这些不断合并成更大的树,如下所示:" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 47, 974 | "metadata": {}, 975 | "outputs": [ 976 | { 977 | "name": "stdout", 978 | "output_type": "stream", 979 | "text": [ 980 | "(S (NP Alice) (VP chased (NP the rabbit)))\n" 981 | ] 982 | } 983 | ], 984 | "source": [ 985 | "tree3 = nltk.Tree('VP', ['chased', tree2])\n", 986 | "tree4 = nltk.Tree('S', [tree1, tree3])\n", 987 | "print(tree4)" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "下面是树对象的一些的方法:" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": 49, 1000 | "metadata": {}, 1001 | "outputs": [ 1002 | { 1003 | "name": "stdout", 1004 | "output_type": "stream", 1005 | "text": [ 1006 | "(VP chased (NP the rabbit))\n", 1007 | "VP\n", 1008 | "['Alice', 'chased', 'the', 'rabbit']\n", 1009 | "rabbit\n" 1010 | ] 1011 | } 1012 | ], 1013 | "source": [ 1014 | "print(tree4[1])\n", 1015 | "print(tree4[1].label())\n", 1016 | "print(tree4.leaves())\n", 1017 | "print(tree4[1][1][1])" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": {}, 1023 | "source": [ 1024 | "复杂的树用括号表示难以阅读。在这些情况下,draw方法是非常有用的。它会打开一个新窗口,包含树的一个图形表示。树显示窗口可以放大和缩小,子树可以折叠和展开,并将图形表示输出为一个postscript文件(包含在一个文档中)。" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "code", 1029 | "execution_count": 50, 1030 | "metadata": {}, 1031 | "outputs": [], 1032 | "source": [ 1033 | "tree3.draw()\n" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "markdown", 1038 | "metadata": {}, 1039 | "source": [ 1040 | "![7.5.png](./picture/7.5.png)" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "metadata": {}, 1046 | "source": [ 1047 | "## 4.3 树遍历\n", 1048 | "\n", 1049 | "使用递归函数来遍历树是标准的做法。4.2中的内容进行了演示。\n", 1050 | "\n" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 53, 1056 | "metadata": {}, 1057 | "outputs": [ 1058 | { 1059 | "name": "stdout", 1060 | "output_type": "stream", 1061 | "text": [ 1062 | "( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) " 1063 | ] 1064 | } 1065 | ], 1066 | "source": [ 1067 | "def traverse(t):\n", 1068 | " try:\n", 1069 | " t.label()\n", 1070 | " except AttributeError:\n", 1071 | " print(t, end=\" \")\n", 1072 | " else:\n", 1073 | " # Now we know that t.node is defined\n", 1074 | " print('(', t.label(), end=\" \")\n", 1075 | " for child in t:\n", 1076 | " traverse(child)\n", 1077 | " print(')', end=\" \")\n", 1078 | "\n", 1079 | "t = tree4\n", 1080 | "traverse(t)" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "markdown", 1085 | "metadata": {}, 1086 | "source": [ 1087 | "# 5 命名实体识别\n", 1088 | "\n", 1089 | "在本章开头,我们简要介绍了命名实体(NE)。命名实体是确切的名词短语,指示特定类型的个体,如组织、人、日期等。5.1列出了一些较常用的NE类型。这些应该是不言自明的,除了“FACILITY”:建筑和土木工程领域的人造产品;以及“GPE”:地缘政治实体,如城市、州/省、国家。\n", 1090 | "\n", 1091 | "\n", 1092 | "常用命名实体类型\n", 1093 | "```\n", 1094 | "Eddy N B-PER\n", 1095 | "Bonte N I-PER\n", 1096 | "is V O\n", 1097 | "woordvoerder N O\n", 1098 | "van Prep O\n", 1099 | "diezelfde Pron O\n", 1100 | "Hogeschool N B-ORG\n", 1101 | ". Punc O\n", 1102 | "```" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": 54, 1108 | "metadata": {}, 1109 | "outputs": [ 1110 | { 1111 | "name": "stdout", 1112 | "output_type": "stream", 1113 | "text": [ 1114 | "(S\n", 1115 | " From/IN\n", 1116 | " what/WDT\n", 1117 | " I/PPSS\n", 1118 | " was/BEDZ\n", 1119 | " able/JJ\n", 1120 | " to/IN\n", 1121 | " gauge/NN\n", 1122 | " in/IN\n", 1123 | " a/AT\n", 1124 | " swift/JJ\n", 1125 | " ,/,\n", 1126 | " greedy/JJ\n", 1127 | " glance/NN\n", 1128 | " ,/,\n", 1129 | " the/AT\n", 1130 | " figure/NN\n", 1131 | " inside/IN\n", 1132 | " the/AT\n", 1133 | " coral-colored/JJ\n", 1134 | " boucle/NN\n", 1135 | " dress/NN\n", 1136 | " was/BEDZ\n", 1137 | " stupefying/VBG\n", 1138 | " ./.)\n" 1139 | ] 1140 | } 1141 | ], 1142 | "source": [ 1143 | "print(nltk.ne_chunk(sent)) " 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "metadata": {}, 1149 | "source": [ 1150 | "# 6 关系抽取\n", 1151 | "\n", 1152 | "一旦文本中的命名实体已被识别,我们就可以提取它们之间存在的关系。如前所述,我们通常会寻找指定类型的命名实体之间的关系。进行这一任务的方法之一是首先寻找所有X, α, Y)形式的三元组,其中X和Y是指定类型的命名实体,α表示X和Y之间关系的字符串。然后我们可以使用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜索包含词in的字符串。特殊的正则表达式(?!\\b.+ing\\b)是一个否定预测先行断言,允许我们忽略如success in supervising the transition of中的字符串,其中in后面跟一个动名词。" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": 55, 1158 | "metadata": {}, 1159 | "outputs": [ 1160 | { 1161 | "name": "stdout", 1162 | "output_type": "stream", 1163 | "text": [ 1164 | "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n", 1165 | "[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']\n", 1166 | "[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n", 1167 | "[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n", 1168 | "[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n", 1169 | "[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n", 1170 | "[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n", 1171 | "[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n", 1172 | "[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n", 1173 | "[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n", 1174 | "[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n", 1175 | "[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n", 1176 | "[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n" 1177 | ] 1178 | } 1179 | ], 1180 | "source": [ 1181 | "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n", 1182 | "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n", 1183 | " for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,\n", 1184 | " corpus='ieer', pattern = IN):\n", 1185 | " print(nltk.sem.rtuple(rel))" 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "markdown", 1190 | "metadata": {}, 1191 | "source": [ 1192 | "搜索关键字in执行的相当不错,虽然它的检索结果也会误报,例如[ORG: House Transportation Committee] , secured the most money in the [LOC: New York];一种简单的基于字符串的方法排除这样的填充字符串似乎不太可能。\n", 1193 | "\n", 1194 | "如前文所示,conll2002命名实体语料库的荷兰语部分不只包含命名实体标注,也包含词性标注。这允许我们设计对这些标记敏感的模式,如下面的例子所示。clause()方法以分条形式输出关系,其中二元关系符号作为参数relsym的值被指定[1]。" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": 57, 1200 | "metadata": {}, 1201 | "outputs": [ 1202 | { 1203 | "name": "stdout", 1204 | "output_type": "stream", 1205 | "text": [ 1206 | "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n", 1207 | "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n", 1208 | "VAN('annie_lennox', 'eurythmics')\n" 1209 | ] 1210 | } 1211 | ], 1212 | "source": [ 1213 | "from nltk.corpus import conll2002\n", 1214 | "vnv = \"\"\"\n", 1215 | "(\n", 1216 | "is/V| # 3rd sing present and\n", 1217 | "was/V| # past forms of the verb zijn ('be')\n", 1218 | "werd/V| # and also present\n", 1219 | "wordt/V # past of worden ('become)\n", 1220 | ")\n", 1221 | ".* # followed by anything\n", 1222 | "van/Prep # followed by van ('of')\n", 1223 | "\"\"\"\n", 1224 | "VAN = re.compile(vnv, re.VERBOSE)\n", 1225 | "for doc in conll2002.chunked_sents('ned.train'):\n", 1226 | " for r in nltk.sem.extract_rels('PER', 'ORG', doc,\n", 1227 | " corpus='conll2002', pattern=VAN):\n", 1228 | " print(nltk.sem.clause(r, relsym=\"VAN\"))" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "code", 1233 | "execution_count": null, 1234 | "metadata": {}, 1235 | "outputs": [], 1236 | "source": [] 1237 | } 1238 | ], 1239 | "metadata": { 1240 | "kernelspec": { 1241 | "display_name": "Python 3", 1242 | "language": "python", 1243 | "name": "python3" 1244 | }, 1245 | "language_info": { 1246 | "codemirror_mode": { 1247 | "name": "ipython", 1248 | "version": 3 1249 | }, 1250 | "file_extension": ".py", 1251 | "mimetype": "text/x-python", 1252 | "name": "python", 1253 | "nbconvert_exporter": "python", 1254 | "pygments_lexer": "ipython3", 1255 | "version": "3.6.2" 1256 | } 1257 | }, 1258 | "nbformat": 4, 1259 | "nbformat_minor": 2 1260 | } 1261 | -------------------------------------------------------------------------------- /3.document.txt: -------------------------------------------------------------------------------- 1 | 沁园春·雪 2 | 作者:毛泽东 3 | 北国风光,千里冰封,万里雪飘。 4 | 望长城内外,惟余莽莽;大河上下,顿失滔滔。 5 | 山舞银蛇,原驰蜡象,欲与天公试比高。 6 | 须晴日,看红装素裹,分外妖娆。 7 | 江山如此多娇,引无数英雄竞折腰。 8 | 惜秦皇汉武,略输文采;唐宗宋祖,稍逊风骚。 -------------------------------------------------------------------------------- /3.output.txt: -------------------------------------------------------------------------------- 1 | ! 2 | ' 3 | ( 4 | ) 5 | , 6 | ,) 7 | . 8 | .) 9 | : 10 | ; 11 | ;) 12 | ? 13 | ?) 14 | A 15 | Abel 16 | Abelmizraim 17 | Abidah 18 | Abide 19 | Abimael 20 | Abimelech 21 | Abr 22 | Abrah 23 | Abraham 24 | Abram 25 | Accad 26 | Achbor 27 | Adah 28 | Adam 29 | Adbeel 30 | Admah 31 | Adullamite 32 | After 33 | Aholibamah 34 | Ahuzzath 35 | Ajah 36 | Akan 37 | All 38 | Allonbachuth 39 | Almighty 40 | Almodad 41 | Also 42 | Alvah 43 | Alvan 44 | Am 45 | Amal 46 | Amalek 47 | Amalekites 48 | Ammon 49 | Amorite 50 | Amorites 51 | Amraphel 52 | An 53 | Anah 54 | Anamim 55 | And 56 | Aner 57 | Angel 58 | Appoint 59 | Aram 60 | Aran 61 | Ararat 62 | Arbah 63 | Ard 64 | Are 65 | Areli 66 | Arioch 67 | Arise 68 | Arkite 69 | Arodi 70 | Arphaxad 71 | Art 72 | Arvadite 73 | As 74 | Asenath 75 | Ashbel 76 | Asher 77 | Ashkenaz 78 | Ashteroth 79 | Ask 80 | Asshur 81 | Asshurim 82 | Assyr 83 | Assyria 84 | At 85 | Atad 86 | Avith 87 | Baalhanan 88 | Babel 89 | Bashemath 90 | Be 91 | Because 92 | Becher 93 | Bedad 94 | Beeri 95 | Beerlahairoi 96 | Beersheba 97 | Behold 98 | Bela 99 | Belah 100 | Benam 101 | Benjamin 102 | Beno 103 | Beor 104 | Bera 105 | Bered 106 | Beriah 107 | Bethel 108 | Bethlehem 109 | Bethuel 110 | Beware 111 | Bilhah 112 | Bilhan 113 | Binding 114 | Birsha 115 | Bless 116 | Blessed 117 | Both 118 | Bow 119 | Bozrah 120 | Bring 121 | But 122 | Buz 123 | By 124 | Cain 125 | Cainan 126 | Calah 127 | Calneh 128 | Can 129 | Cana 130 | Canaan 131 | Canaanite 132 | Canaanites 133 | Canaanitish 134 | Caphtorim 135 | Carmi 136 | Casluhim 137 | Cast 138 | Cause 139 | Chaldees 140 | Chedorlaomer 141 | Cheran 142 | Cherubims 143 | Chesed 144 | Chezib 145 | Come 146 | Cursed 147 | Cush 148 | Damascus 149 | Dan 150 | Day 151 | Deborah 152 | Dedan 153 | Deliver 154 | Diklah 155 | Din 156 | Dinah 157 | Dinhabah 158 | Discern 159 | Dishan 160 | Dishon 161 | Do 162 | Dodanim 163 | Dothan 164 | Drink 165 | Duke 166 | Dumah 167 | Earth 168 | Ebal 169 | Eber 170 | Edar 171 | Eden 172 | Edom 173 | Edomites 174 | Egy 175 | Egypt 176 | Egyptia 177 | Egyptian 178 | Egyptians 179 | Ehi 180 | Elah 181 | Elam 182 | Elbethel 183 | Eldaah 184 | EleloheIsrael 185 | Eliezer 186 | Eliphaz 187 | Elishah 188 | Ellasar 189 | Elon 190 | Elparan 191 | Emins 192 | En 193 | Enmishpat 194 | Eno 195 | Enoch 196 | Enos 197 | Ephah 198 | Epher 199 | Ephra 200 | Ephraim 201 | Ephrath 202 | Ephron 203 | Er 204 | Erech 205 | Eri 206 | Es 207 | Esau 208 | Escape 209 | Esek 210 | Eshban 211 | Eshcol 212 | Ethiopia 213 | Euphrat 214 | Euphrates 215 | Eve 216 | Even 217 | Every 218 | Except 219 | Ezbon 220 | Ezer 221 | Fear 222 | Feed 223 | Fifteen 224 | Fill 225 | For 226 | Forasmuch 227 | Forgive 228 | From 229 | Fulfil 230 | G 231 | Gad 232 | Gaham 233 | Galeed 234 | Gatam 235 | Gather 236 | Gaza 237 | Gentiles 238 | Gera 239 | Gerar 240 | Gershon 241 | Get 242 | Gether 243 | Gihon 244 | Gilead 245 | Girgashites 246 | Girgasite 247 | Give 248 | Go 249 | God 250 | Gomer 251 | Gomorrah 252 | Goshen 253 | Guni 254 | Hadad 255 | Hadar 256 | Hadoram 257 | Hagar 258 | Haggi 259 | Hai 260 | Ham 261 | Hamathite 262 | Hamor 263 | Hamul 264 | Hanoch 265 | Happy 266 | Haran 267 | Hast 268 | Haste 269 | Have 270 | Havilah 271 | Hazarmaveth 272 | Hazezontamar 273 | Hazo 274 | He 275 | Hear 276 | Heaven 277 | Heber 278 | Hebrew 279 | Hebrews 280 | Hebron 281 | Hemam 282 | Hemdan 283 | Here 284 | Hereby 285 | Heth 286 | Hezron 287 | Hiddekel 288 | Hinder 289 | Hirah 290 | His 291 | Hitti 292 | Hittite 293 | Hittites 294 | Hivite 295 | Hobah 296 | Hori 297 | Horite 298 | Horites 299 | How 300 | Hul 301 | Huppim 302 | Husham 303 | Hushim 304 | Huz 305 | I 306 | If 307 | In 308 | Irad 309 | Iram 310 | Is 311 | Isa 312 | Isaac 313 | Iscah 314 | Ishbak 315 | Ishmael 316 | Ishmeelites 317 | Ishuah 318 | Isra 319 | Israel 320 | Issachar 321 | Isui 322 | It 323 | Ithran 324 | Jaalam 325 | Jabal 326 | Jabbok 327 | Jac 328 | Jachin 329 | Jacob 330 | Jahleel 331 | Jahzeel 332 | Jamin 333 | Japhe 334 | Japheth 335 | Jared 336 | Javan 337 | Jebusite 338 | Jebusites 339 | Jegarsahadutha 340 | Jehovahjireh 341 | Jemuel 342 | Jerah 343 | Jetheth 344 | Jetur 345 | Jeush 346 | Jezer 347 | Jidlaph 348 | Jimnah 349 | Job 350 | Jobab 351 | Jokshan 352 | Joktan 353 | Jordan 354 | Joseph 355 | Jubal 356 | Judah 357 | Judge 358 | Judith 359 | Kadesh 360 | Kadmonites 361 | Karnaim 362 | Kedar 363 | Kedemah 364 | Kemuel 365 | Kenaz 366 | Kenites 367 | Kenizzites 368 | Keturah 369 | Kiriathaim 370 | Kirjatharba 371 | Kittim 372 | Know 373 | Kohath 374 | Kor 375 | Korah 376 | LO 377 | LORD 378 | Laban 379 | Lahairoi 380 | Lamech 381 | Lasha 382 | Lay 383 | Leah 384 | Lehabim 385 | Lest 386 | Let 387 | Letushim 388 | Leummim 389 | Levi 390 | Lie 391 | Lift 392 | Lo 393 | Look 394 | Lot 395 | Lotan 396 | Lud 397 | Ludim 398 | Luz 399 | Maachah 400 | Machir 401 | Machpelah 402 | Madai 403 | Magdiel 404 | Magog 405 | Mahalaleel 406 | Mahalath 407 | Mahanaim 408 | Make 409 | Malchiel 410 | Male 411 | Mam 412 | Mamre 413 | Man 414 | Manahath 415 | Manass 416 | Manasseh 417 | Mash 418 | Masrekah 419 | Massa 420 | Matred 421 | Me 422 | Medan 423 | Mehetabel 424 | Mehujael 425 | Melchizedek 426 | Merari 427 | Mesha 428 | Meshech 429 | Mesopotamia 430 | Methusa 431 | Methusael 432 | Methuselah 433 | Mezahab 434 | Mibsam 435 | Mibzar 436 | Midian 437 | Midianites 438 | Milcah 439 | Mishma 440 | Mizpah 441 | Mizraim 442 | Mizz 443 | Moab 444 | Moabites 445 | Moreh 446 | Moreover 447 | Moriah 448 | Muppim 449 | My 450 | Naamah 451 | Naaman 452 | Nahath 453 | Nahor 454 | Naphish 455 | Naphtali 456 | Naphtuhim 457 | Nay 458 | Nebajoth 459 | Neither 460 | Night 461 | Nimrod 462 | Nineveh 463 | Noah 464 | Nod 465 | Not 466 | Now 467 | O 468 | Obal 469 | Of 470 | Oh 471 | Ohad 472 | Omar 473 | On 474 | Onam 475 | Onan 476 | Only 477 | Ophir 478 | Our 479 | Out 480 | Padan 481 | Padanaram 482 | Paran 483 | Pass 484 | Pathrusim 485 | Pau 486 | Peace 487 | Peleg 488 | Peniel 489 | Penuel 490 | Peradventure 491 | Perizzit 492 | Perizzite 493 | Perizzites 494 | Phallu 495 | Phara 496 | Pharaoh 497 | Pharez 498 | Phichol 499 | Philistim 500 | Philistines 501 | Phut 502 | Phuvah 503 | Pildash 504 | Pinon 505 | Pison 506 | Potiphar 507 | Potipherah 508 | Put 509 | Raamah 510 | Rachel 511 | Rameses 512 | Rebek 513 | Rebekah 514 | Rehoboth 515 | Remain 516 | Rephaims 517 | Resen 518 | Return 519 | Reu 520 | Reub 521 | Reuben 522 | Reuel 523 | Reumah 524 | Riphath 525 | Rosh 526 | Sabtah 527 | Sabtech 528 | Said 529 | Salah 530 | Salem 531 | Samlah 532 | Sarah 533 | Sarai 534 | Saul 535 | Save 536 | Say 537 | Se 538 | Seba 539 | See 540 | Seeing 541 | Seir 542 | Sell 543 | Send 544 | Sephar 545 | Serah 546 | Sered 547 | Serug 548 | Set 549 | Seth 550 | Shalem 551 | Shall 552 | Shalt 553 | Shammah 554 | Shaul 555 | Shaveh 556 | She 557 | Sheba 558 | Shebah 559 | Shechem 560 | Shed 561 | Shel 562 | Shelah 563 | Sheleph 564 | Shem 565 | Shemeber 566 | Shepho 567 | Shillem 568 | Shiloh 569 | Shimron 570 | Shinab 571 | Shinar 572 | Shobal 573 | Should 574 | Shuah 575 | Shuni 576 | Shur 577 | Sichem 578 | Siddim 579 | Sidon 580 | Simeon 581 | Sinite 582 | Sitnah 583 | Slay 584 | So 585 | Sod 586 | Sodom 587 | Sojourn 588 | Some 589 | Spake 590 | Speak 591 | Spirit 592 | Stand 593 | Succoth 594 | Surely 595 | Swear 596 | Syrian 597 | Take 598 | Tamar 599 | Tarshish 600 | Tebah 601 | Tell 602 | Tema 603 | Teman 604 | Temani 605 | Terah 606 | Thahash 607 | That 608 | The 609 | Then 610 | There 611 | Therefore 612 | These 613 | They 614 | Thirty 615 | This 616 | Thorns 617 | Thou 618 | Thus 619 | Thy 620 | Tidal 621 | Timna 622 | Timnah 623 | Timnath 624 | Tiras 625 | To 626 | Togarmah 627 | Tola 628 | Tubal 629 | Tubalcain 630 | Twelve 631 | Two 632 | Unstable 633 | Until 634 | Unto 635 | Up 636 | Upon 637 | Ur 638 | Uz 639 | Uzal 640 | We 641 | What 642 | When 643 | Whence 644 | Where 645 | Whereas 646 | Wherefore 647 | Which 648 | While 649 | Who 650 | Whose 651 | Whoso 652 | Why 653 | Wilt 654 | With 655 | Woman 656 | Ye 657 | Yea 658 | Yet 659 | Zaavan 660 | Zaphnathpaaneah 661 | Zar 662 | Zarah 663 | Zeboiim 664 | Zeboim 665 | Zebul 666 | Zebulun 667 | Zemarite 668 | Zepho 669 | Zerah 670 | Zibeon 671 | Zidon 672 | Zillah 673 | Zilpah 674 | Zimran 675 | Ziphion 676 | Zo 677 | Zoar 678 | Zohar 679 | Zuzims 680 | a 681 | abated 682 | abide 683 | able 684 | abode 685 | abomination 686 | about 687 | above 688 | abroad 689 | absent 690 | abundantly 691 | accept 692 | accepted 693 | according 694 | acknowledged 695 | activity 696 | add 697 | adder 698 | afar 699 | afflict 700 | affliction 701 | afraid 702 | after 703 | afterward 704 | afterwards 705 | aga 706 | again 707 | against 708 | age 709 | aileth 710 | air 711 | al 712 | alive 713 | all 714 | almon 715 | alo 716 | alone 717 | aloud 718 | also 719 | altar 720 | altogether 721 | always 722 | am 723 | among 724 | amongst 725 | an 726 | and 727 | angel 728 | angels 729 | anger 730 | angry 731 | anguish 732 | anointedst 733 | anoth 734 | another 735 | answer 736 | answered 737 | any 738 | anything 739 | appe 740 | appear 741 | appeared 742 | appease 743 | appoint 744 | appointed 745 | aprons 746 | archer 747 | archers 748 | are 749 | arise 750 | ark 751 | armed 752 | arms 753 | army 754 | arose 755 | arrayed 756 | art 757 | artificer 758 | as 759 | ascending 760 | ash 761 | ashamed 762 | ask 763 | asked 764 | asketh 765 | ass 766 | assembly 767 | asses 768 | assigned 769 | asswaged 770 | at 771 | attained 772 | audience 773 | avenged 774 | aw 775 | awaked 776 | away 777 | awoke 778 | back 779 | backward 780 | bad 781 | bade 782 | badest 783 | badne 784 | bak 785 | bake 786 | bakemeats 787 | baker 788 | bakers 789 | balm 790 | bands 791 | bank 792 | bare 793 | barr 794 | barren 795 | basket 796 | baskets 797 | battle 798 | bdellium 799 | be 800 | bear 801 | beari 802 | bearing 803 | beast 804 | beasts 805 | beautiful 806 | became 807 | because 808 | become 809 | bed 810 | been 811 | befall 812 | befell 813 | before 814 | began 815 | begat 816 | beget 817 | begettest 818 | begin 819 | beginning 820 | begotten 821 | beguiled 822 | beheld 823 | behind 824 | behold 825 | being 826 | believed 827 | belly 828 | belong 829 | beneath 830 | bereaved 831 | beside 832 | besides 833 | besought 834 | best 835 | betimes 836 | better 837 | between 838 | betwixt 839 | beyond 840 | binding 841 | bird 842 | birds 843 | birthday 844 | birthright 845 | biteth 846 | bitter 847 | blame 848 | blameless 849 | blasted 850 | bless 851 | blessed 852 | blesseth 853 | blessi 854 | blessing 855 | blessings 856 | blindness 857 | blood 858 | blossoms 859 | bodies 860 | boldly 861 | bondman 862 | bondmen 863 | bondwoman 864 | bone 865 | bones 866 | book 867 | booths 868 | border 869 | borders 870 | born 871 | bosom 872 | both 873 | bottle 874 | bou 875 | boug 876 | bough 877 | bought 878 | bound 879 | bow 880 | bowed 881 | bowels 882 | bowing 883 | boys 884 | bracelets 885 | branches 886 | brass 887 | bre 888 | breach 889 | bread 890 | breadth 891 | break 892 | breaketh 893 | breaking 894 | breasts 895 | breath 896 | breathed 897 | breed 898 | brethren 899 | brick 900 | brimstone 901 | bring 902 | brink 903 | broken 904 | brook 905 | broth 906 | brother 907 | brought 908 | brown 909 | bruise 910 | budded 911 | build 912 | builded 913 | built 914 | bulls 915 | bundle 916 | bundles 917 | burdens 918 | buried 919 | burn 920 | burning 921 | burnt 922 | bury 923 | buryingplace 924 | business 925 | but 926 | butler 927 | butlers 928 | butlership 929 | butter 930 | buy 931 | by 932 | cakes 933 | calf 934 | call 935 | called 936 | came 937 | camel 938 | camels 939 | camest 940 | can 941 | cannot 942 | canst 943 | captain 944 | captive 945 | captives 946 | carcases 947 | carried 948 | carry 949 | cast 950 | castles 951 | catt 952 | cattle 953 | caught 954 | cause 955 | caused 956 | cave 957 | cease 958 | ceased 959 | certain 960 | certainly 961 | chain 962 | chamber 963 | change 964 | changed 965 | changes 966 | charge 967 | charged 968 | chariot 969 | chariots 970 | chesnut 971 | chi 972 | chief 973 | child 974 | childless 975 | childr 976 | children 977 | chode 978 | choice 979 | chose 980 | circumcis 981 | circumcise 982 | circumcised 983 | citi 984 | cities 985 | city 986 | clave 987 | clean 988 | clear 989 | cleave 990 | clo 991 | closed 992 | clothed 993 | clothes 994 | cloud 995 | clusters 996 | co 997 | coat 998 | coats 999 | coffin 1000 | cold 1001 | colours 1002 | colt 1003 | colts 1004 | come 1005 | comest 1006 | cometh 1007 | comfort 1008 | comforted 1009 | comi 1010 | coming 1011 | command 1012 | commanded 1013 | commanding 1014 | commandment 1015 | commandments 1016 | commended 1017 | committed 1018 | commune 1019 | communed 1020 | communing 1021 | company 1022 | compassed 1023 | compasseth 1024 | conceal 1025 | conceive 1026 | conceived 1027 | conception 1028 | concerning 1029 | concubi 1030 | concubine 1031 | concubines 1032 | confederate 1033 | confound 1034 | consent 1035 | conspired 1036 | consume 1037 | consumed 1038 | content 1039 | continually 1040 | continued 1041 | cool 1042 | corn 1043 | corrupt 1044 | corrupted 1045 | couch 1046 | couched 1047 | couching 1048 | could 1049 | counted 1050 | countenance 1051 | countries 1052 | country 1053 | covenant 1054 | covered 1055 | covering 1056 | created 1057 | creature 1058 | creepeth 1059 | creeping 1060 | cried 1061 | crieth 1062 | crown 1063 | cru 1064 | cruelty 1065 | cry 1066 | cubit 1067 | cubits 1068 | cunning 1069 | cup 1070 | current 1071 | curse 1072 | cursed 1073 | curseth 1074 | custom 1075 | cut 1076 | d 1077 | da 1078 | dainties 1079 | dale 1080 | damsel 1081 | damsels 1082 | dark 1083 | darkne 1084 | darkness 1085 | daughers 1086 | daught 1087 | daughte 1088 | daughter 1089 | daughters 1090 | day 1091 | days 1092 | dea 1093 | dead 1094 | deal 1095 | dealt 1096 | dearth 1097 | death 1098 | deceitfully 1099 | deceived 1100 | deceiver 1101 | declare 1102 | decreased 1103 | deed 1104 | deeds 1105 | deep 1106 | deferred 1107 | defiled 1108 | defiledst 1109 | delight 1110 | deliver 1111 | deliverance 1112 | delivered 1113 | denied 1114 | depart 1115 | departed 1116 | departing 1117 | deprived 1118 | descending 1119 | desire 1120 | desired 1121 | desolate 1122 | despised 1123 | destitute 1124 | destroy 1125 | destroyed 1126 | devour 1127 | devoured 1128 | dew 1129 | did 1130 | didst 1131 | die 1132 | died 1133 | digged 1134 | dignity 1135 | dim 1136 | dine 1137 | dipped 1138 | direct 1139 | discern 1140 | discerned 1141 | discreet 1142 | displease 1143 | displeased 1144 | distress 1145 | distressed 1146 | divide 1147 | divided 1148 | divine 1149 | divineth 1150 | do 1151 | doe 1152 | doer 1153 | doest 1154 | doeth 1155 | doing 1156 | dominion 1157 | done 1158 | door 1159 | dost 1160 | doth 1161 | double 1162 | doubled 1163 | doubt 1164 | dove 1165 | down 1166 | dowry 1167 | drank 1168 | draw 1169 | dread 1170 | dreadful 1171 | dream 1172 | dreamed 1173 | dreamer 1174 | dreams 1175 | dress 1176 | dressed 1177 | drew 1178 | dried 1179 | drink 1180 | drinketh 1181 | drinking 1182 | driven 1183 | drought 1184 | drove 1185 | droves 1186 | drunken 1187 | dry 1188 | duke 1189 | dukes 1190 | dunge 1191 | dungeon 1192 | dust 1193 | dwe 1194 | dwell 1195 | dwelled 1196 | dwelling 1197 | dwelt 1198 | e 1199 | ea 1200 | each 1201 | ear 1202 | earing 1203 | early 1204 | earring 1205 | earrings 1206 | ears 1207 | earth 1208 | east 1209 | eastward 1210 | eat 1211 | eaten 1212 | eatest 1213 | edge 1214 | eight 1215 | eighteen 1216 | eighty 1217 | either 1218 | elder 1219 | elders 1220 | eldest 1221 | eleven 1222 | else 1223 | embalm 1224 | embalmed 1225 | embraced 1226 | emptied 1227 | empty 1228 | end 1229 | ended 1230 | endued 1231 | endure 1232 | enemies 1233 | enlarge 1234 | enmity 1235 | enough 1236 | enquire 1237 | enter 1238 | entered 1239 | entreated 1240 | envied 1241 | erected 1242 | errand 1243 | escape 1244 | escaped 1245 | espied 1246 | establish 1247 | established 1248 | ev 1249 | even 1250 | evening 1251 | eventide 1252 | ever 1253 | everlasting 1254 | every 1255 | evil 1256 | ewe 1257 | ewes 1258 | exceeding 1259 | exceedingly 1260 | excel 1261 | excellency 1262 | except 1263 | exchange 1264 | experience 1265 | ey 1266 | eyed 1267 | eyes 1268 | fa 1269 | face 1270 | faces 1271 | fai 1272 | fail 1273 | failed 1274 | faileth 1275 | fainted 1276 | fair 1277 | fall 1278 | fallen 1279 | falsely 1280 | fame 1281 | families 1282 | famine 1283 | famished 1284 | far 1285 | fashion 1286 | fast 1287 | fat 1288 | fatfleshed 1289 | fath 1290 | fathe 1291 | father 1292 | fathers 1293 | fatness 1294 | faults 1295 | favour 1296 | favoured 1297 | fear 1298 | feared 1299 | fearest 1300 | feast 1301 | fed 1302 | feeble 1303 | feebler 1304 | feed 1305 | feeding 1306 | feel 1307 | feet 1308 | fell 1309 | fellow 1310 | felt 1311 | fema 1312 | female 1313 | fetch 1314 | fetched 1315 | fetcht 1316 | few 1317 | fie 1318 | field 1319 | fierce 1320 | fifteen 1321 | fifth 1322 | fifty 1323 | fig 1324 | fill 1325 | filled 1326 | find 1327 | findest 1328 | findeth 1329 | finding 1330 | fine 1331 | finish 1332 | finished 1333 | fir 1334 | fire 1335 | firmame 1336 | firmament 1337 | first 1338 | firstborn 1339 | firstlings 1340 | fish 1341 | fishes 1342 | five 1343 | flaming 1344 | fle 1345 | fled 1346 | fleddest 1347 | flee 1348 | flesh 1349 | flo 1350 | floc 1351 | flock 1352 | flocks 1353 | flood 1354 | floor 1355 | fly 1356 | fo 1357 | foal 1358 | foals 1359 | folk 1360 | follow 1361 | followed 1362 | following 1363 | folly 1364 | food 1365 | foolishly 1366 | foot 1367 | for 1368 | forbid 1369 | force 1370 | ford 1371 | foremost 1372 | foreskin 1373 | forgat 1374 | forget 1375 | forgive 1376 | forgotten 1377 | form 1378 | formed 1379 | former 1380 | forth 1381 | forty 1382 | forward 1383 | fou 1384 | found 1385 | fountain 1386 | fountains 1387 | four 1388 | fourscore 1389 | fourteen 1390 | fourteenth 1391 | fourth 1392 | fowl 1393 | fowls 1394 | freely 1395 | friend 1396 | friends 1397 | fro 1398 | from 1399 | frost 1400 | fruit 1401 | fruitful 1402 | fruits 1403 | fugitive 1404 | fulfilled 1405 | full 1406 | furnace 1407 | furniture 1408 | fury 1409 | gard 1410 | garden 1411 | garmen 1412 | garment 1413 | garments 1414 | gat 1415 | gate 1416 | gather 1417 | gathered 1418 | gathering 1419 | gave 1420 | gavest 1421 | generatio 1422 | generation 1423 | generations 1424 | get 1425 | getting 1426 | ghost 1427 | giants 1428 | gift 1429 | gifts 1430 | give 1431 | given 1432 | giveth 1433 | giving 1434 | glory 1435 | go 1436 | goa 1437 | goat 1438 | goats 1439 | gods 1440 | goest 1441 | goeth 1442 | going 1443 | gold 1444 | golden 1445 | gone 1446 | good 1447 | goodly 1448 | goods 1449 | gopher 1450 | got 1451 | gotten 1452 | governor 1453 | gr 1454 | grace 1455 | gracious 1456 | graciously 1457 | grap 1458 | grapes 1459 | grass 1460 | grave 1461 | gray 1462 | gre 1463 | great 1464 | greater 1465 | greatly 1466 | green 1467 | grew 1468 | grief 1469 | grieved 1470 | grievous 1471 | grisl 1472 | grisled 1473 | gro 1474 | ground 1475 | grove 1476 | grow 1477 | grown 1478 | guard 1479 | guiding 1480 | guiltiness 1481 | guilty 1482 | gutters 1483 | h 1484 | ha 1485 | habitations 1486 | had 1487 | hadst 1488 | hairs 1489 | hairy 1490 | half 1491 | halted 1492 | han 1493 | hand 1494 | handfuls 1495 | handle 1496 | handmaid 1497 | handmaidens 1498 | handmaids 1499 | hands 1500 | hang 1501 | hanged 1502 | hard 1503 | hardly 1504 | harlot 1505 | harm 1506 | harp 1507 | harvest 1508 | hast 1509 | haste 1510 | hasted 1511 | hastened 1512 | hastily 1513 | hate 1514 | hated 1515 | hath 1516 | have 1517 | haven 1518 | having 1519 | hazel 1520 | he 1521 | head 1522 | heads 1523 | healed 1524 | health 1525 | heap 1526 | hear 1527 | heard 1528 | hearken 1529 | hearkened 1530 | heart 1531 | hearth 1532 | hearts 1533 | heat 1534 | heav 1535 | heaven 1536 | heavens 1537 | heed 1538 | heel 1539 | heels 1540 | heifer 1541 | height 1542 | heir 1543 | held 1544 | help 1545 | hence 1546 | henceforth 1547 | her 1548 | herb 1549 | herd 1550 | herdmen 1551 | herds 1552 | here 1553 | herein 1554 | herself 1555 | hid 1556 | hide 1557 | high 1558 | hil 1559 | hills 1560 | him 1561 | himself 1562 | hind 1563 | hindermost 1564 | hire 1565 | hired 1566 | his 1567 | hith 1568 | hither 1569 | hold 1570 | hollow 1571 | home 1572 | honey 1573 | honour 1574 | honourable 1575 | hor 1576 | horror 1577 | horse 1578 | horsemen 1579 | horses 1580 | host 1581 | hotly 1582 | hou 1583 | hous 1584 | house 1585 | household 1586 | households 1587 | how 1588 | hundred 1589 | hundredfo 1590 | hundredth 1591 | hunt 1592 | hunter 1593 | hunting 1594 | hurt 1595 | husba 1596 | husband 1597 | husbandman 1598 | if 1599 | ill 1600 | image 1601 | images 1602 | imagination 1603 | imagined 1604 | in 1605 | increase 1606 | increased 1607 | indeed 1608 | inhabitants 1609 | inhabited 1610 | inherit 1611 | inheritance 1612 | iniquity 1613 | inn 1614 | innocency 1615 | instead 1616 | instructor 1617 | instruments 1618 | integrity 1619 | interpret 1620 | interpretation 1621 | interpretations 1622 | interpreted 1623 | interpreter 1624 | into 1625 | intreat 1626 | intreated 1627 | ir 1628 | is 1629 | isles 1630 | issue 1631 | it 1632 | itself 1633 | jewels 1634 | joined 1635 | joint 1636 | journey 1637 | journeyed 1638 | journeys 1639 | jud 1640 | judge 1641 | judged 1642 | judgment 1643 | just 1644 | justice 1645 | keep 1646 | keeper 1647 | kept 1648 | ki 1649 | kid 1650 | kids 1651 | kill 1652 | killed 1653 | kind 1654 | kindled 1655 | kindly 1656 | kindness 1657 | kindred 1658 | kinds 1659 | kine 1660 | king 1661 | kingdom 1662 | kings 1663 | kiss 1664 | kissed 1665 | kn 1666 | knead 1667 | kneel 1668 | knees 1669 | knew 1670 | knife 1671 | know 1672 | knowest 1673 | knoweth 1674 | knowing 1675 | knowledge 1676 | known 1677 | la 1678 | labour 1679 | lack 1680 | lad 1681 | ladder 1682 | lade 1683 | laded 1684 | laden 1685 | lads 1686 | laid 1687 | lamb 1688 | lambs 1689 | lamentati 1690 | lamp 1691 | lan 1692 | land 1693 | lands 1694 | language 1695 | large 1696 | last 1697 | laugh 1698 | laughed 1699 | law 1700 | lawgiver 1701 | laws 1702 | lay 1703 | lead 1704 | leaf 1705 | lean 1706 | leanfleshed 1707 | leap 1708 | leaped 1709 | learned 1710 | least 1711 | leave 1712 | leaves 1713 | led 1714 | left 1715 | length 1716 | lentiles 1717 | lesser 1718 | lest 1719 | let 1720 | li 1721 | lie 1722 | lien 1723 | liest 1724 | lieth 1725 | life 1726 | lift 1727 | lifted 1728 | light 1729 | lighted 1730 | lightly 1731 | lights 1732 | like 1733 | likene 1734 | likeness 1735 | linen 1736 | lingered 1737 | lion 1738 | little 1739 | live 1740 | lived 1741 | lives 1742 | liveth 1743 | living 1744 | lo 1745 | lodge 1746 | lodged 1747 | loins 1748 | long 1749 | longedst 1750 | longeth 1751 | look 1752 | looked 1753 | loose 1754 | lord 1755 | lords 1756 | loss 1757 | loud 1758 | love 1759 | loved 1760 | lovest 1761 | loveth 1762 | lower 1763 | lying 1764 | m 1765 | ma 1766 | made 1767 | magicians 1768 | magnified 1769 | maid 1770 | maiden 1771 | maidservants 1772 | make 1773 | male 1774 | males 1775 | man 1776 | mandrakes 1777 | manner 1778 | many 1779 | mark 1780 | marriages 1781 | married 1782 | marry 1783 | marvelled 1784 | mast 1785 | master 1786 | matter 1787 | may 1788 | mayest 1789 | me 1790 | mead 1791 | meadow 1792 | meal 1793 | mean 1794 | meanest 1795 | meant 1796 | measures 1797 | meat 1798 | meditate 1799 | meet 1800 | meeteth 1801 | men 1802 | menservants 1803 | mention 1804 | merchant 1805 | merchantmen 1806 | mercies 1807 | merciful 1808 | mercy 1809 | merry 1810 | mess 1811 | messenger 1812 | messengers 1813 | messes 1814 | met 1815 | mi 1816 | midst 1817 | midwife 1818 | might 1819 | mightier 1820 | mighty 1821 | milch 1822 | milk 1823 | millions 1824 | mind 1825 | mine 1826 | mirth 1827 | mischief 1828 | mist 1829 | mistress 1830 | mock 1831 | mocked 1832 | mocking 1833 | money 1834 | month 1835 | months 1836 | moon 1837 | more 1838 | moreover 1839 | morever 1840 | morning 1841 | morrow 1842 | morsel 1843 | morter 1844 | most 1845 | mother 1846 | mou 1847 | mount 1848 | mountain 1849 | mountains 1850 | mourn 1851 | mourned 1852 | mourning 1853 | mouth 1854 | mouths 1855 | moved 1856 | moveth 1857 | moving 1858 | much 1859 | mules 1860 | multiplied 1861 | multiply 1862 | multiplying 1863 | multitude 1864 | must 1865 | my 1866 | myrrh 1867 | myself 1868 | n 1869 | na 1870 | naked 1871 | nakedness 1872 | name 1873 | named 1874 | names 1875 | nati 1876 | natio 1877 | nation 1878 | nations 1879 | nativity 1880 | ne 1881 | near 1882 | neck 1883 | needeth 1884 | needs 1885 | neither 1886 | never 1887 | next 1888 | nig 1889 | nigh 1890 | night 1891 | nights 1892 | nine 1893 | nineteen 1894 | ninety 1895 | no 1896 | none 1897 | noon 1898 | nor 1899 | north 1900 | northward 1901 | nostrils 1902 | not 1903 | nothing 1904 | nought 1905 | nourish 1906 | nourished 1907 | now 1908 | number 1909 | numbered 1910 | numbering 1911 | nurse 1912 | nuts 1913 | o 1914 | oa 1915 | oak 1916 | oath 1917 | obeisance 1918 | obey 1919 | obeyed 1920 | observed 1921 | obtain 1922 | occasion 1923 | occupation 1924 | of 1925 | off 1926 | offended 1927 | offer 1928 | offered 1929 | offeri 1930 | offering 1931 | offerings 1932 | office 1933 | officer 1934 | officers 1935 | oil 1936 | old 1937 | olive 1938 | on 1939 | one 1940 | ones 1941 | only 1942 | onyx 1943 | open 1944 | opened 1945 | openly 1946 | or 1947 | order 1948 | organ 1949 | oth 1950 | other 1951 | ou 1952 | ought 1953 | our 1954 | ours 1955 | ourselves 1956 | out 1957 | over 1958 | overcome 1959 | overdrive 1960 | overseer 1961 | oversig 1962 | overspread 1963 | overtake 1964 | overthrew 1965 | overthrow 1966 | overtook 1967 | own 1968 | oxen 1969 | parcel 1970 | part 1971 | parted 1972 | parts 1973 | pass 1974 | passed 1975 | past 1976 | pasture 1977 | path 1978 | pea 1979 | peace 1980 | peaceable 1981 | peaceably 1982 | peop 1983 | people 1984 | peradventure 1985 | perceived 1986 | perfect 1987 | perform 1988 | perish 1989 | perpetual 1990 | person 1991 | persons 1992 | physicians 1993 | piece 1994 | pieces 1995 | pigeon 1996 | pilgrimage 1997 | pillar 1998 | pilled 1999 | pillows 2000 | pit 2001 | pitch 2002 | pitched 2003 | pitcher 2004 | pla 2005 | place 2006 | placed 2007 | places 2008 | plagued 2009 | plagues 2010 | plain 2011 | plains 2012 | plant 2013 | planted 2014 | played 2015 | pleasant 2016 | pleased 2017 | pleaseth 2018 | pleasure 2019 | pledge 2020 | plenteous 2021 | plenteousness 2022 | plenty 2023 | pluckt 2024 | point 2025 | poor 2026 | poplar 2027 | portion 2028 | possess 2029 | possessi 2030 | possession 2031 | possessions 2032 | possessor 2033 | posterity 2034 | pottage 2035 | poured 2036 | poverty 2037 | pow 2038 | power 2039 | praise 2040 | pray 2041 | prayed 2042 | precious 2043 | prepared 2044 | presence 2045 | present 2046 | presented 2047 | preserve 2048 | preserved 2049 | pressed 2050 | prevail 2051 | prevailed 2052 | prey 2053 | priest 2054 | priests 2055 | prince 2056 | princes 2057 | pris 2058 | prison 2059 | prisoners 2060 | proceedeth 2061 | process 2062 | profit 2063 | progenitors 2064 | prophet 2065 | prosper 2066 | prospered 2067 | prosperous 2068 | protest 2069 | proved 2070 | provender 2071 | provide 2072 | provision 2073 | pulled 2074 | punishment 2075 | purchase 2076 | purchased 2077 | purposing 2078 | pursue 2079 | pursued 2080 | put 2081 | putting 2082 | quart 2083 | quickly 2084 | quite 2085 | quiver 2086 | raiment 2087 | rain 2088 | rained 2089 | raise 2090 | ram 2091 | rams 2092 | ran 2093 | rank 2094 | raven 2095 | ravin 2096 | reach 2097 | reached 2098 | ready 2099 | reason 2100 | rebelled 2101 | rebuked 2102 | receive 2103 | received 2104 | red 2105 | redeemed 2106 | refrain 2107 | refrained 2108 | refused 2109 | regard 2110 | reign 2111 | reigned 2112 | remained 2113 | remaineth 2114 | remember 2115 | remembered 2116 | remove 2117 | removed 2118 | removing 2119 | renown 2120 | rent 2121 | repented 2122 | repenteth 2123 | replenish 2124 | report 2125 | reproa 2126 | reproach 2127 | reproved 2128 | require 2129 | required 2130 | requite 2131 | reserved 2132 | respect 2133 | rest 2134 | rested 2135 | restore 2136 | restored 2137 | restrained 2138 | return 2139 | returned 2140 | reviv 2141 | reward 2142 | rewarded 2143 | ri 2144 | rib 2145 | ribs 2146 | rich 2147 | riches 2148 | rid 2149 | ride 2150 | rider 2151 | right 2152 | righteous 2153 | righteousness 2154 | rightly 2155 | ring 2156 | ringstraked 2157 | ripe 2158 | rise 2159 | risen 2160 | riv 2161 | river 2162 | rode 2163 | rods 2164 | roll 2165 | rolled 2166 | roof 2167 | room 2168 | rooms 2169 | rose 2170 | roughly 2171 | round 2172 | rouse 2173 | royal 2174 | rul 2175 | rule 2176 | ruled 2177 | ruler 2178 | rulers 2179 | run 2180 | s 2181 | sa 2182 | sac 2183 | sack 2184 | sackcloth 2185 | sacks 2186 | sacrifice 2187 | sacrifices 2188 | sad 2189 | saddled 2190 | sadly 2191 | said 2192 | saidst 2193 | saith 2194 | sake 2195 | sakes 2196 | salt 2197 | salvation 2198 | same 2199 | sanctified 2200 | sand 2201 | sat 2202 | save 2203 | saved 2204 | saving 2205 | savour 2206 | savoury 2207 | saw 2208 | sawest 2209 | say 2210 | saying 2211 | scarce 2212 | scarlet 2213 | scatter 2214 | scattered 2215 | sceptre 2216 | sea 2217 | searched 2218 | seas 2219 | season 2220 | seasons 2221 | second 2222 | secret 2223 | secretly 2224 | see 2225 | seed 2226 | seedtime 2227 | seeing 2228 | seek 2229 | seekest 2230 | seem 2231 | seemed 2232 | seen 2233 | seest 2234 | seeth 2235 | selfsame 2236 | selfwill 2237 | sell 2238 | send 2239 | sent 2240 | separate 2241 | separated 2242 | sepulchre 2243 | sepulchres 2244 | serpent 2245 | serva 2246 | servan 2247 | servant 2248 | servants 2249 | serve 2250 | served 2251 | service 2252 | set 2253 | seven 2254 | sevenfold 2255 | sevens 2256 | seventeen 2257 | seventeenth 2258 | seventh 2259 | seventy 2260 | sewed 2261 | sh 2262 | shadow 2263 | shall 2264 | shalt 2265 | shamed 2266 | shaved 2267 | she 2268 | sheaf 2269 | shear 2270 | sheaves 2271 | shed 2272 | sheddeth 2273 | sheep 2274 | sheepshearers 2275 | shekel 2276 | shekels 2277 | shepherd 2278 | shepherds 2279 | shew 2280 | shewed 2281 | sheweth 2282 | shield 2283 | ships 2284 | shoelatchet 2285 | shore 2286 | shortly 2287 | shot 2288 | should 2289 | shoulder 2290 | shoulders 2291 | shouldest 2292 | shrank 2293 | shrubs 2294 | shut 2295 | si 2296 | side 2297 | sight 2298 | signet 2299 | signs 2300 | silv 2301 | silver 2302 | sin 2303 | since 2304 | sinew 2305 | sinners 2306 | sinning 2307 | sir 2308 | sist 2309 | sister 2310 | sit 2311 | six 2312 | sixteen 2313 | sixth 2314 | sixty 2315 | skins 2316 | slain 2317 | slaughter 2318 | slay 2319 | slayeth 2320 | sle 2321 | sleep 2322 | slept 2323 | slew 2324 | slime 2325 | slimepits 2326 | small 2327 | smell 2328 | smelled 2329 | smite 2330 | smoke 2331 | smoking 2332 | smooth 2333 | smote 2334 | so 2335 | sod 2336 | softly 2337 | sojourn 2338 | sojourned 2339 | sojourner 2340 | sold 2341 | sole 2342 | solemnly 2343 | some 2344 | son 2345 | songs 2346 | sons 2347 | soon 2348 | sore 2349 | sorely 2350 | sorrow 2351 | sort 2352 | sou 2353 | sought 2354 | soul 2355 | souls 2356 | south 2357 | southward 2358 | sow 2359 | sowed 2360 | space 2361 | spake 2362 | spare 2363 | spe 2364 | speak 2365 | speaketh 2366 | speaking 2367 | speckl 2368 | speckled 2369 | spee 2370 | speech 2371 | speed 2372 | speedily 2373 | spent 2374 | spi 2375 | spicery 2376 | spices 2377 | spies 2378 | spilled 2379 | spirit 2380 | spoil 2381 | spoiled 2382 | spoken 2383 | sporting 2384 | spotted 2385 | spread 2386 | springing 2387 | sprung 2388 | staff 2389 | stalk 2390 | stand 2391 | standest 2392 | stars 2393 | state 2394 | statutes 2395 | stay 2396 | stayed 2397 | ste 2398 | stead 2399 | steal 2400 | steward 2401 | still 2402 | stink 2403 | sto 2404 | stole 2405 | stolen 2406 | stone 2407 | stones 2408 | stood 2409 | stooped 2410 | stopped 2411 | store 2412 | storehouses 2413 | stories 2414 | straitly 2415 | strakes 2416 | strange 2417 | stranger 2418 | strangers 2419 | straw 2420 | street 2421 | strength 2422 | strengthened 2423 | stretched 2424 | stricken 2425 | strife 2426 | stript 2427 | strive 2428 | strong 2429 | stronger 2430 | strove 2431 | struggled 2432 | stuff 2433 | subdue 2434 | submit 2435 | substance 2436 | subtil 2437 | subtilty 2438 | such 2439 | suck 2440 | suffered 2441 | summer 2442 | sun 2443 | supplanted 2444 | sure 2445 | surely 2446 | surety 2447 | sustained 2448 | sware 2449 | swear 2450 | sweat 2451 | sweet 2452 | sword 2453 | sworn 2454 | tabret 2455 | tak 2456 | take 2457 | taken 2458 | talked 2459 | talking 2460 | tar 2461 | tarried 2462 | tarry 2463 | teeth 2464 | tell 2465 | tempt 2466 | ten 2467 | tender 2468 | tenor 2469 | tent 2470 | tenth 2471 | tents 2472 | terror 2473 | th 2474 | than 2475 | that 2476 | the 2477 | thee 2478 | their 2479 | them 2480 | themselv 2481 | themselves 2482 | then 2483 | thence 2484 | there 2485 | thereby 2486 | therefore 2487 | therein 2488 | thereof 2489 | thereon 2490 | these 2491 | they 2492 | thi 2493 | thicket 2494 | thigh 2495 | thin 2496 | thine 2497 | thing 2498 | things 2499 | think 2500 | third 2501 | thirteen 2502 | thirteenth 2503 | thirty 2504 | this 2505 | thistles 2506 | thither 2507 | thoroughly 2508 | those 2509 | thou 2510 | though 2511 | thought 2512 | thoughts 2513 | thousand 2514 | thousands 2515 | thread 2516 | three 2517 | threescore 2518 | threshingfloor 2519 | throne 2520 | through 2521 | throughout 2522 | thus 2523 | thy 2524 | thyself 2525 | tidings 2526 | till 2527 | tiller 2528 | tillest 2529 | tim 2530 | time 2531 | times 2532 | tithes 2533 | to 2534 | togeth 2535 | together 2536 | toil 2537 | token 2538 | told 2539 | tongue 2540 | tongues 2541 | too 2542 | took 2543 | top 2544 | tops 2545 | torn 2546 | touch 2547 | touched 2548 | toucheth 2549 | touching 2550 | toward 2551 | tower 2552 | towns 2553 | tr 2554 | trade 2555 | traffick 2556 | trained 2557 | travail 2558 | travailed 2559 | treasure 2560 | tree 2561 | trees 2562 | trembled 2563 | trespass 2564 | tribes 2565 | tribute 2566 | troop 2567 | troubled 2568 | trough 2569 | troughs 2570 | tru 2571 | true 2572 | truly 2573 | truth 2574 | turn 2575 | turned 2576 | turtledove 2577 | twel 2578 | twelve 2579 | twentieth 2580 | twenty 2581 | twice 2582 | twins 2583 | two 2584 | unawares 2585 | uncircumcised 2586 | uncovered 2587 | under 2588 | understand 2589 | understood 2590 | ungirded 2591 | unit 2592 | unleavened 2593 | until 2594 | unto 2595 | up 2596 | upon 2597 | uppermost 2598 | upright 2599 | upward 2600 | urged 2601 | us 2602 | utmost 2603 | vagabond 2604 | vail 2605 | vale 2606 | valley 2607 | vengeance 2608 | venison 2609 | verified 2610 | verily 2611 | very 2612 | vessels 2613 | vestures 2614 | victuals 2615 | vine 2616 | vineyard 2617 | violence 2618 | violently 2619 | virgin 2620 | vision 2621 | visions 2622 | visit 2623 | visited 2624 | voi 2625 | voice 2626 | void 2627 | vow 2628 | vowed 2629 | vowedst 2630 | w 2631 | wa 2632 | wages 2633 | wagons 2634 | waited 2635 | walk 2636 | walked 2637 | walketh 2638 | walking 2639 | wall 2640 | wander 2641 | wandered 2642 | wandering 2643 | war 2644 | ward 2645 | was 2646 | wash 2647 | washed 2648 | wast 2649 | wat 2650 | watch 2651 | water 2652 | watered 2653 | watering 2654 | waters 2655 | waxed 2656 | waxen 2657 | way 2658 | ways 2659 | we 2660 | wealth 2661 | weaned 2662 | weapons 2663 | wearied 2664 | weary 2665 | week 2666 | weep 2667 | weig 2668 | weighed 2669 | weight 2670 | welfare 2671 | well 2672 | wells 2673 | went 2674 | wentest 2675 | wept 2676 | were 2677 | west 2678 | westwa 2679 | whales 2680 | what 2681 | whatsoever 2682 | wheat 2683 | whelp 2684 | when 2685 | whence 2686 | whensoever 2687 | where 2688 | whereby 2689 | wherefore 2690 | wherein 2691 | whereof 2692 | whereon 2693 | wherewith 2694 | whether 2695 | which 2696 | while 2697 | white 2698 | whither 2699 | who 2700 | whole 2701 | whom 2702 | whomsoever 2703 | whoredom 2704 | whose 2705 | whosoever 2706 | why 2707 | wi 2708 | wick 2709 | wicked 2710 | wickedly 2711 | wickedness 2712 | widow 2713 | widowhood 2714 | wife 2715 | wild 2716 | wilderness 2717 | will 2718 | willing 2719 | wilt 2720 | wind 2721 | window 2722 | windows 2723 | wine 2724 | winged 2725 | winter 2726 | wise 2727 | wit 2728 | with 2729 | withered 2730 | withheld 2731 | withhold 2732 | within 2733 | without 2734 | witness 2735 | wittingly 2736 | wiv 2737 | wives 2738 | wo 2739 | wolf 2740 | woman 2741 | womb 2742 | wombs 2743 | women 2744 | womenservan 2745 | womenservants 2746 | wondering 2747 | wood 2748 | wor 2749 | word 2750 | words 2751 | work 2752 | worse 2753 | worship 2754 | worshipped 2755 | worth 2756 | worthy 2757 | wot 2758 | wotteth 2759 | would 2760 | wouldest 2761 | wounding 2762 | wrapped 2763 | wrath 2764 | wrestled 2765 | wrestlings 2766 | wrong 2767 | wroth 2768 | wrought 2769 | y 2770 | ye 2771 | yea 2772 | year 2773 | yearn 2774 | years 2775 | yesternight 2776 | yet 2777 | yield 2778 | yielded 2779 | yielding 2780 | yoke 2781 | yonder 2782 | you 2783 | young 2784 | younge 2785 | younger 2786 | youngest 2787 | your 2788 | yourselves 2789 | youth 2790 | 2789 2791 | 2789 2792 | -------------------------------------------------------------------------------- /5.t2.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/5.t2.pkl -------------------------------------------------------------------------------- /PYTHON 自然语言处理中文翻译.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/PYTHON 自然语言处理中文翻译.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Python_nlp_notes 2 | 这是我《 Python 自然语言处理 中文第二版 》笔记 3 | 4 | 原文在线阅读:https://usyiyi.github.io/nlp-py-2e-zh 5 | 6 | # 【Python自然语言处理】读书笔记:第一章:语言处理与Python 7 | [【Python自然语言处理】读书笔记:第一章:语言处理与Python](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%80%E7%AB%A0%EF%BC%9A%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E4%B8%8EPython.md) 8 | 9 | # 【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源 10 | [【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%BA%8C%E7%AB%A0%EF%BC%9A%E8%8E%B7%E5%BE%97%E6%96%87%E6%9C%AC%E8%AF%AD%E6%96%99%E5%92%8C%E8%AF%8D%E6%B1%87%E8%B5%84%E6%BA%90.md) 11 | 12 | # 【Python自然语言处理】读书笔记:第三章:处理原始文本 13 | [【Python自然语言处理】读书笔记:第三章:处理原始文本](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%89%E7%AB%A0%EF%BC%9A%E5%A4%84%E7%90%86%E5%8E%9F%E5%A7%8B%E6%96%87%E6%9C%AC.ipynb) 14 | 15 | # 【Python自然语言处理】读书笔记:第四章:编写结构化程序 16 | [【Python自然语言处理】读书笔记:第四章:编写结构化程序](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E5%9B%9B%E7%AB%A0%EF%BC%9A%E7%BC%96%E5%86%99%E7%BB%93%E6%9E%84%E5%8C%96%E7%A8%8B%E5%BA%8F.ipynb) 17 | 18 | # 【Python自然语言处理】读书笔记:第五章:分类和标注词汇 19 | [【Python自然语言处理】读书笔记:第五章:分类和标注词汇](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%BA%94%E7%AB%A0%EF%BC%9A%E5%88%86%E7%B1%BB%E5%92%8C%E6%A0%87%E6%B3%A8%E8%AF%8D%E6%B1%87.ipynb) 20 | 21 | # 【Python自然语言处理】读书笔记:第六章:学习分类文本 22 | [【Python自然语言处理】读书笔记:第六章:学习分类文本](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E5%85%AD%E7%AB%A0%EF%BC%9A%E5%AD%A6%E4%B9%A0%E5%88%86%E7%B1%BB%E6%96%87%E6%9C%AC.ipynb) 23 | 24 | # 【Python自然语言处理】读书笔记:第七章:从文本提取信息 25 | [【Python自然语言处理】读书笔记:第七章:从文本提取信息](https://github.com/JackKuo666/Python_nlp_notes/blob/master/%E3%80%90Python%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%AF%BB%E4%B9%A6%E7%AC%94%E8%AE%B0%EF%BC%9A%E7%AC%AC%E4%B8%83%E7%AB%A0%EF%BC%9A%E4%BB%8E%E6%96%87%E6%9C%AC%E6%8F%90%E5%8F%96%E4%BF%A1%E6%81%AF.ipynb) 26 | 27 | ----------------------------------未完待续------------------------- 28 | 29 | # 更多NLP知识请访问: 30 | 31 | 我的主页:https://jackkuo666.github.io/ 32 | 33 | 我的博客:https://blog.csdn.net/weixin_37251044 34 | -------------------------------------------------------------------------------- /picture/1.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/1.1.png -------------------------------------------------------------------------------- /picture/2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.1.png -------------------------------------------------------------------------------- /picture/2.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.2.png -------------------------------------------------------------------------------- /picture/2.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.3.png -------------------------------------------------------------------------------- /picture/2.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/2.4.png -------------------------------------------------------------------------------- /picture/3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.1.png -------------------------------------------------------------------------------- /picture/3.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.2.png -------------------------------------------------------------------------------- /picture/3.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/3.3.png -------------------------------------------------------------------------------- /picture/4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/4.1.png -------------------------------------------------------------------------------- /picture/4.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/4.2.png -------------------------------------------------------------------------------- /picture/5.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/5.1.png -------------------------------------------------------------------------------- /picture/6.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.1.png -------------------------------------------------------------------------------- /picture/6.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.2.png -------------------------------------------------------------------------------- /picture/6.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.3.png -------------------------------------------------------------------------------- /picture/6.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.4.png -------------------------------------------------------------------------------- /picture/6.5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.5.png -------------------------------------------------------------------------------- /picture/6.6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.6.png -------------------------------------------------------------------------------- /picture/6.7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/6.7.png -------------------------------------------------------------------------------- /picture/7.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.1.png -------------------------------------------------------------------------------- /picture/7.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.2.png -------------------------------------------------------------------------------- /picture/7.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.3.png -------------------------------------------------------------------------------- /picture/7.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.4.png -------------------------------------------------------------------------------- /picture/7.5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JackKuo666/Python_nlp_notes/0ff8c305ed8e827b5842d8bf8be43d73d613b11e/picture/7.5.png -------------------------------------------------------------------------------- /【Python自然语言处理】读书笔记:第一章:语言处理与Python.md: -------------------------------------------------------------------------------- 1 | 原书:《Python自然语言处理》:https://usyiyi.github.io/nlp-py-2e-zh/ 2 | # 语言处理与Python 3 | 原文:https://usyiyi.github.io/nlp-py-2e-zh/1.html 4 | # 1.NLTK入门 5 | ## 1.NKLT的安装,nltk.book的安装 6 | ## 2.搜索文本 7 | ```py 8 | text1.concordance("monstrous") # 搜索文本text1中含有“monstrous”的句子 9 | text1.similar("monstrous") # 搜索文本text1中与“monstrous”相似的单词 10 | text2.common_contexts(["monstrous", "very"]) # 搜索文本text2中两个单词共同的上下文 11 | text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 显示在文本text4中各个单词的使用频率 12 | ``` 13 | ## 3.词汇计数 14 | ```py 15 | len(text3) # 文本text3的符号总数 16 | sorted(set(text3)) # 不重复的符号排序 17 | len(set(text3)) # 不重复的符号总数 18 | len(set(text3)) / len(text3) # 词汇丰富度:不重复符号占总符号6%,或者:每个单词平均使用16词 19 | text3.count("smote") # 文本中“smote”的计数 20 | def lexivcal_diversity(text): # 计算词汇丰富度 21 | return len(set(text))/len(text) 22 | def percentage(word,text): # 计算词word在文本中出现的频率 23 | return 100*text.count(word)/len(text) 24 | 25 | ``` 26 | ## 4.索引列表 27 | ```py 28 | >>> text4[173] 29 | 'awaken' 30 | >>> 31 | ``` 32 | ```py 33 | >>> text4.index('awaken') 34 | 173 35 | >>> 36 | >>> sent[5:8] 37 | ['word6', 'word7', 'word8'] 38 | ``` 39 | # 5.字符串与列表的相互转换 40 | ```py 41 | >>> ' '.join(['Monty', 'Python']) 42 | 'Monty Python' 43 | >>> 'Monty Python'.split() 44 | ['Monty', 'Python'] 45 | >>> 46 | ``` 47 | # 6.词频分布 48 | ```py 49 | >>> fdist1 = FreqDist(text1) # 计算text1的每个符号的词频 50 | >>> print(fdist1) 51 | 52 | >>> fdist1.most_common(50) [3] 53 | [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), 54 | ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), 55 | ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), 56 | ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), 57 | ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), 58 | ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), 59 | ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), 60 | ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), 61 | ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), 62 | ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)] 63 | >>> fdist1['whale'] 64 | 906 65 | >>> 66 | ``` 67 | ```py 68 | fdist1.plot(50, cumulative=True) # 50个常用词的累计频率图 69 | ``` 70 | ![在这里插入图片描述](./picture/1.1.png) 71 | 72 | ```py 73 | fdist1.hapaxes() # 返回词频为1的词 74 | ``` 75 | # 7.细粒度的选择词 76 | 选出长度大于15的单词 77 | ```py 78 | sorted(w for w in set(text1) if len(w) > 15) 79 | ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 80 | ``` 81 | 82 | 选出长度大于7且词频大于7的单词 83 | ```py 84 | sorted(w for w in set(text5) if len(w) > 7 and FreqDist(text5)[w] > 7) 85 | ['#14-19teens', '#talkcity_adults', '((((((((((', '........', 'Question', 86 | ``` 87 | 提取词汇中的次对 88 | ```py 89 | >>> list(bigrams(['more', 'is', 'said', 'than', 'done'])) 90 | [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] 91 | ``` 92 | 提取文本中的频繁出现的双连词 93 | ```py 94 | >>> text4.collocations() 95 | United States; fellow citizens; four years; years ago; Federal 96 | Government; General Government; American people; Vice President; Old 97 | ``` 98 | # 8.查看文本中词长的分布 99 | ```py 100 | >>> [len(w) for w in text1] # 文本中每个词的长度 101 | [1, 4, 4, 2, 6, 8, 4, 1, 9, 1, 1, 8, 2, 1, 4, 11, 5, 2, 1, 7, 6, 1, 3, 4, 5, 2, ...] 102 | >>> fdist = FreqDist(len(w) for w in text1) # 文本中词长的频数 103 | >>> print(fdist) [3] 104 | 105 | >>> fdist 106 | FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 107 | 8: 9966, 9: 6428, 10: 3528, ...}) 108 | >>> 109 | ``` 110 | ```py 111 | >>> fdist.most_common() 112 | [(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), 113 | (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), 114 | (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)] 115 | >>> fdist.max() 116 | 3 117 | >>> fdist[3] 118 | 50223 119 | >>> fdist.freq(3) # 词频中词长为“3”的频率 120 | 0.19255882431878046 121 | >>> 122 | ``` 123 | # 9.```[w for w in text if condition ]```模式 124 | 选出以```ableness```结尾的单词 125 | ```py 126 | >>> sorted(w for w in set(text1) if w.endswith('ableness')) 127 | ['comfortableness', 'honourableness', 'immutableness', 'indispensableness', ...] 128 | ``` 129 | 选出含有```gnt```的单词 130 | ```py 131 | >>> sorted(term for term in set(text4) if 'gnt' in term) 132 | ['Sovereignty', 'sovereignties', 'sovereignty'] 133 | ``` 134 | 选出以**大写字母**开头的单词 135 | ```py 136 | >>> sorted(item for item in set(text6) if item.istitle()) 137 | ['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah', 'Aaaah', 'Aaaaugh', 'Aaagh', ...] 138 | ``` 139 | 选出**数字** 140 | ```py 141 | >>> sorted(item for item in set(sent7) if item.isdigit()) 142 | ['29', '61'] 143 | >>> 144 | ``` 145 | 选出**全部小写字母**的单词 146 | ```py 147 | sorted(w for w in set(sent7) if not w.islower()) 148 | ``` 149 | 将单词变为**全部大写字母** 150 | ```py 151 | >>> [w.upper() for w in text1] 152 | ['[', 'MOBY', 'DICK', 'BY', 'HERMAN', 'MELVILLE', '1851', ']', 'ETYMOLOGY', '.', ...] 153 | >>> 154 | ``` 155 | 将text1中过滤掉不是字母的,然后全部转换成小写,然后去重,然后计数 156 | ```py 157 | >>> len(set(word.lower() for word in text1 if word.isalpha())) 158 | 16948 159 | ``` 160 | # 10.条件循环 161 | 这里可以不换行打印```print(word, end=' ')``` 162 | ```py 163 | >>> tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w) 164 | >>> for word in tricky: 165 | ... print(word, end=' ') 166 | ancient ceiling conceit conceited conceive conscience 167 | conscientious conscientiously deceitful deceive ... 168 | >>> 169 | ``` 170 | # 11.作业 171 | 计算词频,以百分比表示 172 | ```py 173 | 174 | >>> def percent(word, text): 175 | ... return 100*text.count(word)/len([w for w in text if w.isalpha()]) 176 | >>> percent(",", text1) 177 | 8.569753756394228 178 | ``` 179 | 计算文本词汇量 180 | ```py 181 | >>> def vocab_size(text): 182 | ... return len(set(w.lower() for w in text if w.isalpha())) 183 | >>> vocab_size(text1) 184 | 16948 185 | 186 | ``` 187 | -------------------------------------------------------------------------------- /【Python自然语言处理】读书笔记:第七章:从文本提取信息.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "本章原文:https://usyiyi.github.io/nlp-py-2e-zh/7.html\n", 8 | "\n", 9 | " 1.我们如何能构建一个系统,从非结构化文本中提取结构化数据如表格?\n", 10 | " 2.有哪些稳健的方法识别一个文本中描述的实体和关系?\n", 11 | " 3.哪些语料库适合这项工作,我们如何使用它们来训练和评估我们的模型?\n", 12 | "\n", 13 | "**分块** 和 **命名实体识别**。" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# 1 信息提取\n", 21 | "\n", 22 | "信息有很多种形状和大小。一个重要的形式是结构化数据:实体和关系的可预测的规范的结构。例如,我们可能对公司和地点之间的关系感兴趣。给定一个公司,我们希望能够确定它做业务的位置;反过来,给定位置,我们会想发现哪些公司在该位置做业务。如果我们的数据是表格形式,如1.1中的例子,那么回答这些问题就很简单了。\n", 23 | "\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 1, 29 | "metadata": {}, 30 | "outputs": [ 31 | { 32 | "name": "stdout", 33 | "output_type": "stream", 34 | "text": [ 35 | "['BBDO South', 'Georgia-Pacific']\n" 36 | ] 37 | } 38 | ], 39 | "source": [ 40 | "locs = [('Omnicom', 'IN', 'New York'),\n", 41 | " ('DDB Needham', 'IN', 'New York'),\n", 42 | " ('Kaplan Thaler Group', 'IN', 'New York'),\n", 43 | " ('BBDO South', 'IN', 'Atlanta'),\n", 44 | " ('Georgia-Pacific', 'IN', 'Atlanta')]\n", 45 | "query = [e1 for (e1, rel, e2) in locs if e2=='Atlanta']\n", 46 | "print(query)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "markdown", 51 | "metadata": {}, 52 | "source": [ 53 | "我们可以定义一个函数,简单地连接 NLTK 中默认的句子分割器[1],分词器[2]和词性标注器[3]:" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 2, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "def ie_preprocess(document):\n", 63 | " sentences = nltk.sent_tokenize(document) # [1] 句子分割器\n", 64 | " sentences = [nltk.word_tokenize(sent) for sent in sentences] # [2] 分词器\n", 65 | " sentences = [nltk.pos_tag(sent) for sent in sentences] # [3] 词性标注器" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "import nltk, re, pprint" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "# 2 词块划分\n", 82 | "\n", 83 | "我们将用于实体识别的基本技术是词块划分,它分割和标注多词符的序列,如2.1所示。小框显示词级分词和词性标注,大框显示高级别的词块划分。每个这种较大的框叫做一个词块。就像分词忽略空白符,词块划分通常选择词符的一个子集。同样像分词一样,词块划分器生成的片段在源文本中不能重叠。\n", 84 | "![7.1.png](./picture/7.1.png)\n", 85 | "\n", 86 | "在本节中,我们将在较深的层面探讨词块划分,以**词块**的定义和表示开始。我们将看到**正则表达式**和**N-gram**的方法来词块划分,使用CoNLL-2000词块划分语料库**开发**和**评估词块划分器**。我们将在(5)和6回到**命名实体识别**和**关系抽取**的任务。" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## 2.1 名词短语词块划分\n", 94 | "\n", 95 | "我们将首先思考名词短语词块划分或NP词块划分任务,在那里我们寻找单独名词短语对应的词块。例如,这里是一些《华尔街日报》文本,其中的NP词块用方括号标记:" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 5, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "(S\n", 108 | " (NP the/DT little/JJ yellow/JJ dog/NN)\n", 109 | " barked/VBD\n", 110 | " at/IN\n", 111 | " (NP the/DT cat/NN))\n" 112 | ] 113 | } 114 | ], 115 | "source": [ 116 | "sentence = [(\"the\", \"DT\"), (\"little\", \"JJ\"), (\"yellow\", \"JJ\"), \n", 117 | " (\"dog\", \"NN\"), (\"barked\", \"VBD\"), (\"at\", \"IN\"), (\"the\", \"DT\"), (\"cat\", \"NN\")]\n", 118 | "\n", 119 | "grammar = \"NP: {
?*}\" \n", 120 | "cp = nltk.RegexpParser(grammar) \n", 121 | "result = cp.parse(sentence) \n", 122 | "print(result) \n", 123 | "result.draw()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "![7.2.png](./picture/7.2.png)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## 2.2 标记模式\n", 138 | "\n", 139 | "组成一个词块语法的规则使用标记模式来描述已标注的词的序列。一个标记模式是一个词性标记序列,用尖括号分隔,如\n", 140 | "```\n", 141 | "
?*\n", 142 | "```\n", 143 | "标记模式类似于正则表达式模式(3.4)。现在,思考下面的来自《华尔街日报》的名词短语:\n", 144 | "```py\n", 145 | "another/DT sharp/JJ dive/NN\n", 146 | "trade/NN figures/NNS\n", 147 | "any/DT new/JJ policy/NN measures/NNS\n", 148 | "earlier/JJR stages/NNS\n", 149 | "Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP\n", 150 | "```" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "## 2.3 用正则表达式进行词块划分\n", 158 | "\n", 159 | "要找到一个给定的句子的词块结构,RegexpParser词块划分器以一个没有词符被划分的平面结构开始。词块划分规则轮流应用,依次更新词块结构。一旦所有的规则都被调用,返回生成的词块结构。\n", 160 | "\n", 161 | "2.3显示了一个由2个规则组成的简单的词块语法。第一条规则匹配一个可选的限定词或所有格代名词,零个或多个形容词,然后跟一个名词。第二条规则匹配一个或多个专有名词。我们还定义了一个进行词块划分的例句[1],并在此输入上运行这个词块划分器[2]。" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 7, 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "(S\n", 174 | " (NP Rapunzel/NNP)\n", 175 | " let/VBD\n", 176 | " down/RP\n", 177 | " (NP her/PP$ long/JJ golden/JJ hair/NN))\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "grammar = r\"\"\"\n", 183 | "NP: {?*} \n", 184 | "{+}\n", 185 | "\"\"\"\n", 186 | "cp = nltk.RegexpParser(grammar)\n", 187 | "sentence = [(\"Rapunzel\", \"NNP\"), (\"let\", \"VBD\"), (\"down\", \"RP\"), (\"her\", \"PP$\"), (\"long\", \"JJ\"), (\"golden\", \"JJ\"), (\"hair\", \"NN\")]\n", 188 | "print (cp.parse(sentence))" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "注意\n", 196 | "\n", 197 | "```\n", 198 | "$\n", 199 | "```\n", 200 | "符号是正则表达式中的一个特殊字符,必须使用反斜杠转义来匹配\n", 201 | "```\n", 202 | "PP\\$\n", 203 | "```\n", 204 | "标记。\n", 205 | "\n", 206 | "如果标记模式匹配位置重叠,最左边的匹配优先。例如,如果我们应用一个匹配两个连续的名词文本的规则到一个包含三个连续的名词的文本,则只有前两个名词将被划分:" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": 8, 212 | "metadata": {}, 213 | "outputs": [ 214 | { 215 | "name": "stdout", 216 | "output_type": "stream", 217 | "text": [ 218 | "(S (NP money/NN market/NN) fund/NN)\n" 219 | ] 220 | } 221 | ], 222 | "source": [ 223 | "nouns = [(\"money\", \"NN\"), (\"market\", \"NN\"), (\"fund\", \"NN\")]\n", 224 | "grammar = \"NP: {} # Chunk two consecutive nouns\"\n", 225 | "cp = nltk.RegexpParser(grammar)\n", 226 | "print(cp.parse(nouns))" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "## 2.4 探索文本语料库\n", 234 | "\n", 235 | "在2中,我们看到了我们如何在已标注的语料库中提取匹配的特定的词性标记序列的短语。我们可以使用词块划分器更容易的做同样的工作,如下:" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 12, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "name": "stdout", 245 | "output_type": "stream", 246 | "text": [ 247 | "(CHUNK combined/VBN to/TO achieve/VB)\n", 248 | "(CHUNK continue/VB to/TO place/VB)\n", 249 | "(CHUNK serve/VB to/TO protect/VB)\n" 250 | ] 251 | } 252 | ], 253 | "source": [ 254 | "cp = nltk.RegexpParser('CHUNK: { }')\n", 255 | "brown = nltk.corpus.brown\n", 256 | "count = 0\n", 257 | "for sent in brown.tagged_sents():\n", 258 | " tree = cp.parse(sent)\n", 259 | " for subtree in tree.subtrees():\n", 260 | " if subtree.label() == 'CHUNK': print(subtree)\n", 261 | " count += 1\n", 262 | " if count >= 30: break" 263 | ] 264 | }, 265 | { 266 | "cell_type": "markdown", 267 | "metadata": {}, 268 | "source": [ 269 | "## 2.5 词缝加塞\n", 270 | "\n", 271 | "有时定义我们想从一个词块中排除什么比较容易。我们可以定义词缝为一个不包含在词块中的一个词符序列。在下面的例子中,barked/VBD at/IN是一个词缝:\n", 272 | "```\n", 273 | "[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]\n", 274 | "```\n" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [ 281 | "## 2.6 词块的表示:标记与树\n", 282 | "\n", 283 | "作为标注和分析之间的中间状态(8.,词块结构可以使用标记或树来表示。最广泛的文件表示使用IOB标记。在这个方案中,每个词符被三个特殊的词块标记之一标注,I(内部),O(外部)或B(开始)。一个词符被标注为B,如果它标志着一个词块的开始。块内的词符子序列被标注为I。所有其他的词符被标注为O。B和I标记后面跟着词块类型,如B-NP, I-NP。当然,没有必要指定出现在词块外的词符类型,所以这些都只标注为O。这个方案的例子如2.5所示。\n", 284 | "![7.3.png](./picture/7.3.png)\n", 285 | "\n", 286 | "IOB标记已成为文件中表示词块结构的标准方式,我们也将使用这种格式。下面是2.5中的信息如何出现在一个文件中的:\n", 287 | "```\n", 288 | "We PRP B-NP\n", 289 | "saw VBD O\n", 290 | "the DT B-NP\n", 291 | "yellow JJ I-NP\n", 292 | "dog NN I-NP\n", 293 | "```" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "# 3 开发和评估词块划分器\n", 301 | "\n", 302 | "现在你对分块的作用有了一些了解,但我们并没有解释如何评估词块划分器。和往常一样,这需要一个合适的已标注语料库。我们一开始寻找将IOB格式转换成NLTK树的机制,然后是使用已化分词块的语料库如何在一个更大的规模上做这个。我们将看到如何为一个词块划分器相对一个语料库的准确性打分,再看看一些数据驱动方式搜索NP词块。我们整个的重点在于扩展一个词块划分器的覆盖范围。\n", 303 | "## 3.1 读取IOB格式与CoNLL2000语料库\n", 304 | "\n", 305 | "使用corpus模块,我们可以加载已经标注并使用IOB符号划分词块的《华尔街日报》文本。这个语料库提供的词块类型有NP,VP和PP。正如我们已经看到的,每个句子使用多行表示,如下所示:\n", 306 | "```\n", 307 | "he PRP B-NP\n", 308 | "accepted VBD B-VP\n", 309 | "the DT B-NP\n", 310 | "position NN I-NP\n", 311 | "...\n", 312 | "```\n", 313 | "![7.4.png](./picture/7.4.png)\n", 314 | "我们可以使用NLTK的corpus模块访问较大量的已经划分词块的文本。CoNLL2000语料库包含27万词的《华尔街日报文本》,分为“训练”和“测试”两部分,标注有词性标记和IOB格式词块标记。我们可以使用nltk.corpus.conll2000访问这些数据。下面是一个读取语料库的“训练”部分的第100个句子的例子:" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 14, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | "(S\n", 327 | " (PP Over/IN)\n", 328 | " (NP a/DT cup/NN)\n", 329 | " (PP of/IN)\n", 330 | " (NP coffee/NN)\n", 331 | " ,/,\n", 332 | " (NP Mr./NNP Stone/NNP)\n", 333 | " (VP told/VBD)\n", 334 | " (NP his/PRP$ story/NN)\n", 335 | " ./.)\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "from nltk.corpus import conll2000\n", 341 | "print(conll2000.chunked_sents('train.txt')[99])" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "正如你看到的,CoNLL2000语料库包含三种词块类型:NP词块,我们已经看到了;VP词块如has already delivered;PP块如because of。因为现在我们唯一感兴趣的是NP词块,我们可以使用chunk_types参数选择它们:" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": 15, 354 | "metadata": {}, 355 | "outputs": [ 356 | { 357 | "name": "stdout", 358 | "output_type": "stream", 359 | "text": [ 360 | "(S\n", 361 | " Over/IN\n", 362 | " (NP a/DT cup/NN)\n", 363 | " of/IN\n", 364 | " (NP coffee/NN)\n", 365 | " ,/,\n", 366 | " (NP Mr./NNP Stone/NNP)\n", 367 | " told/VBD\n", 368 | " (NP his/PRP$ story/NN)\n", 369 | " ./.)\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "print(conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99])" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "## 3.2 简单的评估和基准\n", 382 | "\n", 383 | "现在,我们可以访问一个已划分词块语料,可以评估词块划分器。我们开始为没有什么意义的词块解析器cp建立一个基准,它不划分任何词块:" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 16, 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "name": "stdout", 393 | "output_type": "stream", 394 | "text": [ 395 | "ChunkParse score:\n", 396 | " IOB Accuracy: 43.4%%\n", 397 | " Precision: 0.0%%\n", 398 | " Recall: 0.0%%\n", 399 | " F-Measure: 0.0%%\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "from nltk.corpus import conll2000\n", 405 | "cp = nltk.RegexpParser(\"\")\n", 406 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n", 407 | "print(cp.evaluate(test_sents))" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "IOB标记准确性表明超过三分之一的词被标注为O,即没有在NP词块中。然而,由于我们的标注器没有找到任何词块,其精度、召回率和F-度量均为零。现在让我们尝试一个初级的正则表达式词块划分器,查找以名词短语标记的特征字母开头的标记(如CD, DT和JJ)。" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 17, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "name": "stdout", 424 | "output_type": "stream", 425 | "text": [ 426 | "ChunkParse score:\n", 427 | " IOB Accuracy: 87.7%%\n", 428 | " Precision: 70.6%%\n", 429 | " Recall: 67.8%%\n", 430 | " F-Measure: 69.2%%\n" 431 | ] 432 | } 433 | ], 434 | "source": [ 435 | "grammar = r\"NP: {<[CDJNP].*>+}\"\n", 436 | "cp = nltk.RegexpParser(grammar)\n", 437 | "print(cp.evaluate(test_sents))" 438 | ] 439 | }, 440 | { 441 | "cell_type": "markdown", 442 | "metadata": {}, 443 | "source": [ 444 | "正如你看到的,这种方法达到相当好的结果。但是,我们可以采用更多数据驱动的方法改善它,在这里我们使用训练语料找到对每个词性标记最有可能的块标记(I, O或B)。换句话说,我们可以使用一元标注器(4)建立一个词块划分器。但不是尝试确定每个词的正确的词性标记,而是根据每个词的词性标记,尝试确定正确的词块标记。\n", 445 | "\n", 446 | "在3.1中,我们定义了UnigramChunker类,使用一元标注器给句子加词块标记。这个类的大部分代码只是用来在NLTK 的ChunkParserI接口使用的词块树表示和嵌入式标注器使用的IOB表示之间镜像转换。类定义了两个方法:一个构造函数[1],当我们建立一个新的UnigramChunker时调用;以及parse方法[3],用来给新句子划分词块。" 447 | ] 448 | }, 449 | { 450 | "cell_type": "code", 451 | "execution_count": 18, 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [ 455 | "class UnigramChunker(nltk.ChunkParserI):\n", 456 | " def __init__(self, train_sents): \n", 457 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n", 458 | " for sent in train_sents]\n", 459 | " self.tagger = nltk.UnigramTagger(train_data) \n", 460 | "\n", 461 | " def parse(self, sentence): \n", 462 | " pos_tags = [pos for (word,pos) in sentence]\n", 463 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n", 464 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n", 465 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n", 466 | " in zip(sentence, chunktags)]\n", 467 | " return nltk.chunk.conlltags2tree(conlltags)" 468 | ] 469 | }, 470 | { 471 | "cell_type": "markdown", 472 | "metadata": {}, 473 | "source": [ 474 | "构造函数[1]需要训练句子的一个列表,这将是词块树的形式。它首先将训练数据转换成适合训练标注器的形式,使用tree2conlltags映射每个词块树到一个word,tag,chunk三元组的列表。然后使用转换好的训练数据训练一个一元标注器,并存储在self.tagger供以后使用。\n", 475 | "\n", 476 | "parse方法[3]接收一个已标注的句子作为其输入,以从那句话提取词性标记开始。它然后使用在构造函数中训练过的标注器self.tagger,为词性标记标注IOB词块标记。接下来,它提取词块标记,与原句组合,产生conlltags。最后,它使用conlltags2tree将结果转换成一个词块树。\n", 477 | "\n", 478 | "现在我们有了UnigramChunker,可以使用CoNLL2000语料库训练它,并测试其表现:" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 20, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "name": "stdout", 488 | "output_type": "stream", 489 | "text": [ 490 | "ChunkParse score:\n", 491 | " IOB Accuracy: 92.9%%\n", 492 | " Precision: 79.9%%\n", 493 | " Recall: 86.8%%\n", 494 | " F-Measure: 83.2%%\n" 495 | ] 496 | } 497 | ], 498 | "source": [ 499 | "test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])\n", 500 | "train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])\n", 501 | "unigram_chunker = UnigramChunker(train_sents)\n", 502 | "print(unigram_chunker.evaluate(test_sents))" 503 | ] 504 | }, 505 | { 506 | "cell_type": "markdown", 507 | "metadata": {}, 508 | "source": [ 509 | "这个分块器相当不错,达到整体F-度量83%的得分。让我们来看一看通过使用一元标注器分配一个标记给每个语料库中出现的词性标记,它学到了什么:" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": 21, 515 | "metadata": {}, 516 | "outputs": [ 517 | { 518 | "name": "stdout", 519 | "output_type": "stream", 520 | "text": [ 521 | "[('#', 'B-NP'), ('$', 'B-NP'), (\"''\", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]\n" 522 | ] 523 | } 524 | ], 525 | "source": [ 526 | "postags = sorted(set(pos for sent in train_sents\n", 527 | " for (word,pos) in sent.leaves()))\n", 528 | "print(unigram_chunker.tagger.tag(postags))" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "它已经发现大多数标点符号出现在NP词块外,除了两种货币符号#和*\\$*。它也发现限定词(DT)和所有格(PRP*\\$*和WP$)出现在NP词块的开头,而名词类型(NN, NNP, NNPS,NNS)大多出现在NP词块内。\n", 536 | "\n", 537 | "建立了一个一元分块器,很容易建立一个二元分块器:我们只需要改变类的名称为BigramChunker,修改3.1行[2]构造一个BigramTagger而不是UnigramTagger。由此产生的词块划分器的性能略高于一元词块划分器:" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 24, 543 | "metadata": {}, 544 | "outputs": [ 545 | { 546 | "name": "stdout", 547 | "output_type": "stream", 548 | "text": [ 549 | "ChunkParse score:\n", 550 | " IOB Accuracy: 93.3%%\n", 551 | " Precision: 82.3%%\n", 552 | " Recall: 86.8%%\n", 553 | " F-Measure: 84.5%%\n" 554 | ] 555 | } 556 | ], 557 | "source": [ 558 | "class BigramChunker(nltk.ChunkParserI):\n", 559 | " def __init__(self, train_sents): \n", 560 | " train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]\n", 561 | " for sent in train_sents]\n", 562 | " self.tagger = nltk.BigramTagger(train_data)\n", 563 | "\n", 564 | " def parse(self, sentence): \n", 565 | " pos_tags = [pos for (word,pos) in sentence]\n", 566 | " tagged_pos_tags = self.tagger.tag(pos_tags)\n", 567 | " chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]\n", 568 | " conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)\n", 569 | " in zip(sentence, chunktags)]\n", 570 | " return nltk.chunk.conlltags2tree(conlltags)\n", 571 | "bigram_chunker = BigramChunker(train_sents)\n", 572 | "print(bigram_chunker.evaluate(test_sents))" 573 | ] 574 | }, 575 | { 576 | "cell_type": "markdown", 577 | "metadata": {}, 578 | "source": [ 579 | "## 3.3 训练基于分类器的词块划分器\n", 580 | "\n", 581 | "无论是基于正则表达式的词块划分器还是n-gram词块划分器,决定创建什么词块完全基于词性标记。然而,有时词性标记不足以确定一个句子应如何划分词块。例如,考虑下面的两个语句:" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 31, 587 | "metadata": {}, 588 | "outputs": [], 589 | "source": [ 590 | "class ConsecutiveNPChunkTagger(nltk.TaggerI): \n", 591 | "\n", 592 | " def __init__(self, train_sents):\n", 593 | " train_set = []\n", 594 | " for tagged_sent in train_sents:\n", 595 | " untagged_sent = nltk.tag.untag(tagged_sent)\n", 596 | " history = []\n", 597 | " for i, (word, tag) in enumerate(tagged_sent):\n", 598 | " featureset = npchunk_features(untagged_sent, i, history) \n", 599 | " train_set.append( (featureset, tag) )\n", 600 | " history.append(tag)\n", 601 | " self.classifier = nltk.MaxentClassifier.train( \n", 602 | " train_set, algorithm='megam', trace=0)\n", 603 | "\n", 604 | " def tag(self, sentence):\n", 605 | " history = []\n", 606 | " for i, word in enumerate(sentence):\n", 607 | " featureset = npchunk_features(sentence, i, history)\n", 608 | " tag = self.classifier.classify(featureset)\n", 609 | " history.append(tag)\n", 610 | " return zip(sentence, history)\n", 611 | "\n", 612 | "class ConsecutiveNPChunker(nltk.ChunkParserI):\n", 613 | " def __init__(self, train_sents):\n", 614 | " tagged_sents = [[((w,t),c) for (w,t,c) in\n", 615 | " nltk.chunk.tree2conlltags(sent)]\n", 616 | " for sent in train_sents]\n", 617 | " self.tagger = ConsecutiveNPChunkTagger(tagged_sents)\n", 618 | "\n", 619 | " def parse(self, sentence):\n", 620 | " tagged_sents = self.tagger.tag(sentence)\n", 621 | " conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]\n", 622 | " return nltk.chunk.conlltags2tree(conlltags)" 623 | ] 624 | }, 625 | { 626 | "cell_type": "markdown", 627 | "metadata": {}, 628 | "source": [ 629 | "留下来唯一需要填写的是特征提取器。首先,我们定义一个简单的特征提取器,它只是提供了当前词符的词性标记。使用此特征提取器,我们的基于分类器的词块划分器的表现与一元词块划分器非常类似:" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": {}, 636 | "outputs": [], 637 | "source": [ 638 | "def npchunk_features(sentence, i, history):\n", 639 | " word, pos = sentence[i]\n", 640 | " return {\"pos\": pos}\n", 641 | "chunker = ConsecutiveNPChunker(train_sents)\n", 642 | "print(chunker.evaluate(test_sents))" 643 | ] 644 | }, 645 | { 646 | "cell_type": "markdown", 647 | "metadata": {}, 648 | "source": [ 649 | "```\n", 650 | "ChunkParse score:\n", 651 | " IOB Accuracy: 92.9%\n", 652 | " Precision: 79.9%\n", 653 | " Recall: 86.7%\n", 654 | " F-Measure: 83.2%\n", 655 | " ```" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "我们还可以添加一个特征表示前面词的词性标记。添加此特征允许词块划分器模拟相邻标记之间的相互作用,由此产生的词块划分器与二元词块划分器非常接近。\n" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": null, 668 | "metadata": {}, 669 | "outputs": [], 670 | "source": [ 671 | "def npchunk_features(sentence, i, history):\n", 672 | " word, pos = sentence[i]\n", 673 | " if i == 0:\n", 674 | " prevword, prevpos = \"\", \"\"\n", 675 | " else:\n", 676 | " prevword, prevpos = sentence[i-1]\n", 677 | " return {\"pos\": pos, \"prevpos\": prevpos}\n", 678 | "chunker = ConsecutiveNPChunker(train_sents)\n", 679 | "print(chunker.evaluate(test_sents))" 680 | ] 681 | }, 682 | { 683 | "cell_type": "markdown", 684 | "metadata": {}, 685 | "source": [ 686 | "```\n", 687 | "ChunkParse score:\n", 688 | " IOB Accuracy: 93.6%\n", 689 | " Precision: 81.9%\n", 690 | " Recall: 87.2%\n", 691 | " F-Measure: 84.5%\n", 692 | "```" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "下一步,我们将尝试为当前词增加特征,因为我们假设这个词的内容应该对词块划有用。我们发现这个特征确实提高了词块划分器的表现,大约1.5个百分点(相应的错误率减少大约10%)。" 700 | ] 701 | }, 702 | { 703 | "cell_type": "code", 704 | "execution_count": null, 705 | "metadata": {}, 706 | "outputs": [], 707 | "source": [ 708 | "def npchunk_features(sentence, i, history):\n", 709 | " word, pos = sentence[i]\n", 710 | " if i == 0:\n", 711 | " prevword, prevpos = \"\", \"\"\n", 712 | " else:\n", 713 | " prevword, prevpos = sentence[i-1]\n", 714 | " return {\"pos\": pos, \"word\": word, \"prevpos\": prevpos}\n", 715 | "chunker = ConsecutiveNPChunker(train_sents)\n", 716 | "print(chunker.evaluate(test_sents))" 717 | ] 718 | }, 719 | { 720 | "cell_type": "markdown", 721 | "metadata": {}, 722 | "source": [ 723 | "```\n", 724 | "ChunkParse score:\n", 725 | " IOB Accuracy: 94.5%\n", 726 | " Precision: 84.2%\n", 727 | " Recall: 89.4%\n", 728 | " F-Measure: 86.7%\n", 729 | "```" 730 | ] 731 | }, 732 | { 733 | "cell_type": "markdown", 734 | "metadata": {}, 735 | "source": [ 736 | "最后,我们尝试用多种附加特征扩展特征提取器,例如预取特征[1]、配对特征[2]和复杂的语境特征[3]。这最后一个特征,称为tags-since-dt,创建一个字符串,描述自最近的限定词以来遇到的所有词性标记,或如果没有限定词则在索引i之前自语句开始以来遇到的所有词性标记。" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 36, 742 | "metadata": {}, 743 | "outputs": [], 744 | "source": [ 745 | "def npchunk_features(sentence, i, history):\n", 746 | " word, pos = sentence[i]\n", 747 | " if i == 0:\n", 748 | " prevword, prevpos = \"\", \"\"\n", 749 | " else:\n", 750 | " prevword, prevpos = sentence[i-1]\n", 751 | " if i == len(sentence)-1:\n", 752 | " nextword, nextpos = \"\", \"\"\n", 753 | " else:\n", 754 | " nextword, nextpos = sentence[i+1]\n", 755 | " return {\"pos\": pos,\n", 756 | " \"word\": word,\n", 757 | " \"prevpos\": prevpos,\n", 758 | " \"nextpos\": nextpos,\n", 759 | " \"prevpos+pos\": \"%s+%s\" % (prevpos, pos), \n", 760 | " \"pos+nextpos\": \"%s+%s\" % (pos, nextpos),\n", 761 | " \"tags-since-dt\": tags_since_dt(sentence, i)} " 762 | ] 763 | }, 764 | { 765 | "cell_type": "code", 766 | "execution_count": 37, 767 | "metadata": {}, 768 | "outputs": [], 769 | "source": [ 770 | "def tags_since_dt(sentence, i):\n", 771 | " tags = set()\n", 772 | " for word, pos in sentence[:i]:\n", 773 | " if pos == 'DT':\n", 774 | " tags = set()\n", 775 | " else:\n", 776 | " tags.add(pos)\n", 777 | " return '+'.join(sorted(tags))" 778 | ] 779 | }, 780 | { 781 | "cell_type": "code", 782 | "execution_count": null, 783 | "metadata": {}, 784 | "outputs": [], 785 | "source": [ 786 | "chunker = ConsecutiveNPChunker(train_sents)\n", 787 | "print(chunker.evaluate(test_sents))" 788 | ] 789 | }, 790 | { 791 | "cell_type": "markdown", 792 | "metadata": {}, 793 | "source": [ 794 | "```\n", 795 | "ChunkParse score:\n", 796 | " IOB Accuracy: 96.0%\n", 797 | " Precision: 88.6%\n", 798 | " Recall: 91.0%\n", 799 | " F-Measure: 89.8%\n", 800 | "```" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": {}, 806 | "source": [ 807 | "# 4 语言结构中的递归\n", 808 | "## 4.1 用级联词块划分器构建嵌套结构\n", 809 | "\n", 810 | "到目前为止,我们的词块结构一直是相对平的。已标注词符组成的树在如NP这样的词块节点下任意组合。然而,只需创建一个包含递归规则的多级的词块语法,就可以建立任意深度的词块结构。4.1是名词短语、介词短语、动词短语和句子的模式。这是一个四级词块语法器,可以用来创建深度最多为4的结构。\n", 811 | "\n" 812 | ] 813 | }, 814 | { 815 | "cell_type": "code", 816 | "execution_count": 42, 817 | "metadata": {}, 818 | "outputs": [ 819 | { 820 | "name": "stdout", 821 | "output_type": "stream", 822 | "text": [ 823 | "(S\n", 824 | " (NP Mary/NN)\n", 825 | " saw/VBD\n", 826 | " (CLAUSE\n", 827 | " (NP the/DT cat/NN)\n", 828 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n" 829 | ] 830 | } 831 | ], 832 | "source": [ 833 | "grammar = r\"\"\"\n", 834 | " NP: {+} \n", 835 | " PP: {} \n", 836 | " VP: {+$} \n", 837 | " CLAUSE: {} \n", 838 | " \"\"\"\n", 839 | "cp = nltk.RegexpParser(grammar)\n", 840 | "sentence = [(\"Mary\", \"NN\"), (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"),\n", 841 | " (\"sit\", \"VB\"), (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n", 842 | "print(cp.parse(sentence))" 843 | ] 844 | }, 845 | { 846 | "cell_type": "markdown", 847 | "metadata": {}, 848 | "source": [ 849 | "不幸的是,这一结果丢掉了saw为首的VP。它还有其他缺陷。当我们将此词块划分器应用到一个有更深嵌套的句子时,让我们看看会发生什么。请注意,它无法识别[1]开始的VP词块。" 850 | ] 851 | }, 852 | { 853 | "cell_type": "code", 854 | "execution_count": 43, 855 | "metadata": {}, 856 | "outputs": [ 857 | { 858 | "name": "stdout", 859 | "output_type": "stream", 860 | "text": [ 861 | "(S\n", 862 | " (NP John/NNP)\n", 863 | " thinks/VBZ\n", 864 | " (NP Mary/NN)\n", 865 | " saw/VBD\n", 866 | " (CLAUSE\n", 867 | " (NP the/DT cat/NN)\n", 868 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))\n" 869 | ] 870 | } 871 | ], 872 | "source": [ 873 | "sentence = [(\"John\", \"NNP\"), (\"thinks\", \"VBZ\"), (\"Mary\", \"NN\"),\n", 874 | " (\"saw\", \"VBD\"), (\"the\", \"DT\"), (\"cat\", \"NN\"), (\"sit\", \"VB\"),\n", 875 | " (\"on\", \"IN\"), (\"the\", \"DT\"), (\"mat\", \"NN\")]\n", 876 | "print(cp.parse(sentence))" 877 | ] 878 | }, 879 | { 880 | "cell_type": "markdown", 881 | "metadata": {}, 882 | "source": [ 883 | "这些问题的解决方案是让词块划分器在它的模式中循环:尝试完所有模式之后,重复此过程。我们添加一个可选的第二个参数loop指定这套模式应该循环的次数:" 884 | ] 885 | }, 886 | { 887 | "cell_type": "code", 888 | "execution_count": 44, 889 | "metadata": {}, 890 | "outputs": [ 891 | { 892 | "name": "stdout", 893 | "output_type": "stream", 894 | "text": [ 895 | "(S\n", 896 | " (NP John/NNP)\n", 897 | " thinks/VBZ\n", 898 | " (CLAUSE\n", 899 | " (NP Mary/NN)\n", 900 | " (VP\n", 901 | " saw/VBD\n", 902 | " (CLAUSE\n", 903 | " (NP the/DT cat/NN)\n", 904 | " (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))\n" 905 | ] 906 | } 907 | ], 908 | "source": [ 909 | "cp = nltk.RegexpParser(grammar, loop=2)\n", 910 | "print(cp.parse(sentence))" 911 | ] 912 | }, 913 | { 914 | "cell_type": "markdown", 915 | "metadata": {}, 916 | "source": [ 917 | "注意\n", 918 | "\n", 919 | "这个级联过程使我们能创建深层结构。然而,创建和调试级联过程是困难的,关键点是它能更有效地做全面的分析(见第8.章)。另外,级联过程只能产生固定深度的树(不超过级联级数),完整的句法分析这是不够的。\n" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "metadata": {}, 925 | "source": [ 926 | "## 4.2 Trees\n", 927 | "\n", 928 | "tree是一组连接的加标签节点,从一个特殊的根节点沿一条唯一的路径到达每个节点。下面是一棵树的例子(注意它们标准的画法是颠倒的):\n", 929 | "```\n", 930 | "(S\n", 931 | " (NP Alice)\n", 932 | " (VP\n", 933 | " (V chased)\n", 934 | " (NP\n", 935 | " (Det the)\n", 936 | " (N rabbit))))\n", 937 | "```\n", 938 | "虽然我们将只集中关注语法树,树可以用来编码任何同构的超越语言形式序列的层次结构(如形态结构、篇章结构)。一般情况下,叶子和节点值不一定要是字符串。\n", 939 | "\n", 940 | "在NLTK中,我们通过给一个节点添加标签和一系列的孩子创建一棵树:" 941 | ] 942 | }, 943 | { 944 | "cell_type": "code", 945 | "execution_count": 46, 946 | "metadata": {}, 947 | "outputs": [ 948 | { 949 | "name": "stdout", 950 | "output_type": "stream", 951 | "text": [ 952 | "(NP Alice)\n", 953 | "(NP the rabbit)\n" 954 | ] 955 | } 956 | ], 957 | "source": [ 958 | "tree1 = nltk.Tree('NP', ['Alice'])\n", 959 | "print(tree1)\n", 960 | "tree2 = nltk.Tree('NP', ['the', 'rabbit'])\n", 961 | "print(tree2)" 962 | ] 963 | }, 964 | { 965 | "cell_type": "markdown", 966 | "metadata": {}, 967 | "source": [ 968 | "我们可以将这些不断合并成更大的树,如下所示:" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 47, 974 | "metadata": {}, 975 | "outputs": [ 976 | { 977 | "name": "stdout", 978 | "output_type": "stream", 979 | "text": [ 980 | "(S (NP Alice) (VP chased (NP the rabbit)))\n" 981 | ] 982 | } 983 | ], 984 | "source": [ 985 | "tree3 = nltk.Tree('VP', ['chased', tree2])\n", 986 | "tree4 = nltk.Tree('S', [tree1, tree3])\n", 987 | "print(tree4)" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "下面是树对象的一些的方法:" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": 49, 1000 | "metadata": {}, 1001 | "outputs": [ 1002 | { 1003 | "name": "stdout", 1004 | "output_type": "stream", 1005 | "text": [ 1006 | "(VP chased (NP the rabbit))\n", 1007 | "VP\n", 1008 | "['Alice', 'chased', 'the', 'rabbit']\n", 1009 | "rabbit\n" 1010 | ] 1011 | } 1012 | ], 1013 | "source": [ 1014 | "print(tree4[1])\n", 1015 | "print(tree4[1].label())\n", 1016 | "print(tree4.leaves())\n", 1017 | "print(tree4[1][1][1])" 1018 | ] 1019 | }, 1020 | { 1021 | "cell_type": "markdown", 1022 | "metadata": {}, 1023 | "source": [ 1024 | "复杂的树用括号表示难以阅读。在这些情况下,draw方法是非常有用的。它会打开一个新窗口,包含树的一个图形表示。树显示窗口可以放大和缩小,子树可以折叠和展开,并将图形表示输出为一个postscript文件(包含在一个文档中)。" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "code", 1029 | "execution_count": 50, 1030 | "metadata": {}, 1031 | "outputs": [], 1032 | "source": [ 1033 | "tree3.draw()\n" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "markdown", 1038 | "metadata": {}, 1039 | "source": [ 1040 | "![7.5.png](./picture/7.5.png)" 1041 | ] 1042 | }, 1043 | { 1044 | "cell_type": "markdown", 1045 | "metadata": {}, 1046 | "source": [ 1047 | "## 4.3 树遍历\n", 1048 | "\n", 1049 | "使用递归函数来遍历树是标准的做法。4.2中的内容进行了演示。\n", 1050 | "\n" 1051 | ] 1052 | }, 1053 | { 1054 | "cell_type": "code", 1055 | "execution_count": 53, 1056 | "metadata": {}, 1057 | "outputs": [ 1058 | { 1059 | "name": "stdout", 1060 | "output_type": "stream", 1061 | "text": [ 1062 | "( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) ) " 1063 | ] 1064 | } 1065 | ], 1066 | "source": [ 1067 | "def traverse(t):\n", 1068 | " try:\n", 1069 | " t.label()\n", 1070 | " except AttributeError:\n", 1071 | " print(t, end=\" \")\n", 1072 | " else:\n", 1073 | " # Now we know that t.node is defined\n", 1074 | " print('(', t.label(), end=\" \")\n", 1075 | " for child in t:\n", 1076 | " traverse(child)\n", 1077 | " print(')', end=\" \")\n", 1078 | "\n", 1079 | "t = tree4\n", 1080 | "traverse(t)" 1081 | ] 1082 | }, 1083 | { 1084 | "cell_type": "markdown", 1085 | "metadata": {}, 1086 | "source": [ 1087 | "# 5 命名实体识别\n", 1088 | "\n", 1089 | "在本章开头,我们简要介绍了命名实体(NE)。命名实体是确切的名词短语,指示特定类型的个体,如组织、人、日期等。5.1列出了一些较常用的NE类型。这些应该是不言自明的,除了“FACILITY”:建筑和土木工程领域的人造产品;以及“GPE”:地缘政治实体,如城市、州/省、国家。\n", 1090 | "\n", 1091 | "\n", 1092 | "常用命名实体类型\n", 1093 | "```\n", 1094 | "Eddy N B-PER\n", 1095 | "Bonte N I-PER\n", 1096 | "is V O\n", 1097 | "woordvoerder N O\n", 1098 | "van Prep O\n", 1099 | "diezelfde Pron O\n", 1100 | "Hogeschool N B-ORG\n", 1101 | ". Punc O\n", 1102 | "```" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": 54, 1108 | "metadata": {}, 1109 | "outputs": [ 1110 | { 1111 | "name": "stdout", 1112 | "output_type": "stream", 1113 | "text": [ 1114 | "(S\n", 1115 | " From/IN\n", 1116 | " what/WDT\n", 1117 | " I/PPSS\n", 1118 | " was/BEDZ\n", 1119 | " able/JJ\n", 1120 | " to/IN\n", 1121 | " gauge/NN\n", 1122 | " in/IN\n", 1123 | " a/AT\n", 1124 | " swift/JJ\n", 1125 | " ,/,\n", 1126 | " greedy/JJ\n", 1127 | " glance/NN\n", 1128 | " ,/,\n", 1129 | " the/AT\n", 1130 | " figure/NN\n", 1131 | " inside/IN\n", 1132 | " the/AT\n", 1133 | " coral-colored/JJ\n", 1134 | " boucle/NN\n", 1135 | " dress/NN\n", 1136 | " was/BEDZ\n", 1137 | " stupefying/VBG\n", 1138 | " ./.)\n" 1139 | ] 1140 | } 1141 | ], 1142 | "source": [ 1143 | "print(nltk.ne_chunk(sent)) " 1144 | ] 1145 | }, 1146 | { 1147 | "cell_type": "markdown", 1148 | "metadata": {}, 1149 | "source": [ 1150 | "# 6 关系抽取\n", 1151 | "\n", 1152 | "一旦文本中的命名实体已被识别,我们就可以提取它们之间存在的关系。如前所述,我们通常会寻找指定类型的命名实体之间的关系。进行这一任务的方法之一是首先寻找所有X, α, Y)形式的三元组,其中X和Y是指定类型的命名实体,α表示X和Y之间关系的字符串。然后我们可以使用正则表达式从α的实体中抽出我们正在查找的关系。下面的例子搜索包含词in的字符串。特殊的正则表达式(?!\\b.+ing\\b)是一个否定预测先行断言,允许我们忽略如success in supervising the transition of中的字符串,其中in后面跟一个动名词。" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": 55, 1158 | "metadata": {}, 1159 | "outputs": [ 1160 | { 1161 | "name": "stdout", 1162 | "output_type": "stream", 1163 | "text": [ 1164 | "[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']\n", 1165 | "[ORG: 'McGlashan & Sarrail'] 'firm in' [LOC: 'San Mateo']\n", 1166 | "[ORG: 'Freedom Forum'] 'in' [LOC: 'Arlington']\n", 1167 | "[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']\n", 1168 | "[ORG: 'Idealab'] ', a self-described business incubator based in' [LOC: 'Los Angeles']\n", 1169 | "[ORG: 'Open Text'] ', based in' [LOC: 'Waterloo']\n", 1170 | "[ORG: 'WGBH'] 'in' [LOC: 'Boston']\n", 1171 | "[ORG: 'Bastille Opera'] 'in' [LOC: 'Paris']\n", 1172 | "[ORG: 'Omnicom'] 'in' [LOC: 'New York']\n", 1173 | "[ORG: 'DDB Needham'] 'in' [LOC: 'New York']\n", 1174 | "[ORG: 'Kaplan Thaler Group'] 'in' [LOC: 'New York']\n", 1175 | "[ORG: 'BBDO South'] 'in' [LOC: 'Atlanta']\n", 1176 | "[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']\n" 1177 | ] 1178 | } 1179 | ], 1180 | "source": [ 1181 | "IN = re.compile(r'.*\\bin\\b(?!\\b.+ing)')\n", 1182 | "for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):\n", 1183 | " for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,\n", 1184 | " corpus='ieer', pattern = IN):\n", 1185 | " print(nltk.sem.rtuple(rel))" 1186 | ] 1187 | }, 1188 | { 1189 | "cell_type": "markdown", 1190 | "metadata": {}, 1191 | "source": [ 1192 | "搜索关键字in执行的相当不错,虽然它的检索结果也会误报,例如[ORG: House Transportation Committee] , secured the most money in the [LOC: New York];一种简单的基于字符串的方法排除这样的填充字符串似乎不太可能。\n", 1193 | "\n", 1194 | "如前文所示,conll2002命名实体语料库的荷兰语部分不只包含命名实体标注,也包含词性标注。这允许我们设计对这些标记敏感的模式,如下面的例子所示。clause()方法以分条形式输出关系,其中二元关系符号作为参数relsym的值被指定[1]。" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": 57, 1200 | "metadata": {}, 1201 | "outputs": [ 1202 | { 1203 | "name": "stdout", 1204 | "output_type": "stream", 1205 | "text": [ 1206 | "VAN(\"cornet_d'elzius\", 'buitenlandse_handel')\n", 1207 | "VAN('johan_rottiers', 'kardinaal_van_roey_instituut')\n", 1208 | "VAN('annie_lennox', 'eurythmics')\n" 1209 | ] 1210 | } 1211 | ], 1212 | "source": [ 1213 | "from nltk.corpus import conll2002\n", 1214 | "vnv = \"\"\"\n", 1215 | "(\n", 1216 | "is/V| # 3rd sing present and\n", 1217 | "was/V| # past forms of the verb zijn ('be')\n", 1218 | "werd/V| # and also present\n", 1219 | "wordt/V # past of worden ('become)\n", 1220 | ")\n", 1221 | ".* # followed by anything\n", 1222 | "van/Prep # followed by van ('of')\n", 1223 | "\"\"\"\n", 1224 | "VAN = re.compile(vnv, re.VERBOSE)\n", 1225 | "for doc in conll2002.chunked_sents('ned.train'):\n", 1226 | " for r in nltk.sem.extract_rels('PER', 'ORG', doc,\n", 1227 | " corpus='conll2002', pattern=VAN):\n", 1228 | " print(nltk.sem.clause(r, relsym=\"VAN\"))" 1229 | ] 1230 | }, 1231 | { 1232 | "cell_type": "code", 1233 | "execution_count": null, 1234 | "metadata": {}, 1235 | "outputs": [], 1236 | "source": [] 1237 | } 1238 | ], 1239 | "metadata": { 1240 | "kernelspec": { 1241 | "display_name": "Python 3", 1242 | "language": "python", 1243 | "name": "python3" 1244 | }, 1245 | "language_info": { 1246 | "codemirror_mode": { 1247 | "name": "ipython", 1248 | "version": 3 1249 | }, 1250 | "file_extension": ".py", 1251 | "mimetype": "text/x-python", 1252 | "name": "python", 1253 | "nbconvert_exporter": "python", 1254 | "pygments_lexer": "ipython3", 1255 | "version": "3.6.2" 1256 | } 1257 | }, 1258 | "nbformat": 4, 1259 | "nbformat_minor": 2 1260 | } 1261 | -------------------------------------------------------------------------------- /【Python自然语言处理】读书笔记:第二章:获得文本语料和词汇资源.md: -------------------------------------------------------------------------------- 1 | 原文在线阅读:https://usyiyi.github.io/nlp-py-2e-zh/2.html 2 | # 1 获取文本语料库 3 | ## 1.1 古腾堡语料库 4 | ```py 5 | >>> for fileid in gutenberg.fileids(): 6 | >... num_words = len(gutenberg.words(fileid)) 7 | >... num_vocab = len(set(w.lower() for w in gutenberg.words(fileid))) 8 | >... num_sents = len(gutenberg.sents(fileid)) 9 | >... num_chars = len(gutenberg.raw(fileid)) 10 | >... print("平均词长:", round(num_chars/num_words), "平均句长:", round(num_words/num_sents), "每个单词出现的平均次数:", round(num_words/num_vocab), fileid) 11 | ... 12 | 平均词长: 5 平均句长: 25 每个单词出现的平均次数: 26 austen-emma.txt 13 | 平均词长: 5 平均句长: 26 每个单词出现的平均次数: 17 austen-persuasion.txt 14 | 平均词长: 5 平均句长: 28 每个单词出现的平均次数: 22 austen-sense.txt 15 | 平均词长: 4 平均句长: 34 每个单词出现的平均次数: 79 bible-kjv.txt 16 | 平均词长: 5 平均句长: 19 每个单词出现的平均次数: 5 blake-poems.txt 17 | 平均词长: 4 平均句长: 19 每个单词出现的平均次数: 14 bryant-stories.txt 18 | 平均词长: 4 平均句长: 18 每个单词出现的平均次数: 12 burgess-busterbrown.txt 19 | 平均词长: 4 平均句长: 20 每个单词出现的平均次数: 13 carroll-alice.txt 20 | 平均词长: 5 平均句长: 20 每个单词出现的平均次数: 12 chesterton-ball.txt 21 | 平均词长: 5 平均句长: 23 每个单词出现的平均次数: 11 chesterton-brown.txt 22 | 平均词长: 5 平均句长: 18 每个单词出现的平均次数: 11 chesterton-thursday.txt 23 | 平均词长: 4 平均句长: 21 每个单词出现的平均次数: 25 edgeworth-parents.txt 24 | 平均词长: 5 平均句长: 26 每个单词出现的平均次数: 15 melville-moby_dick.txt 25 | 平均词长: 5 平均句长: 52 每个单词出现的平均次数: 11 milton-paradise.txt 26 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 9 shakespeare-caesar.txt 27 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 8 shakespeare-hamlet.txt 28 | 平均词长: 4 平均句长: 12 每个单词出现的平均次数: 7 shakespeare-macbeth.txt 29 | 平均词长: 5 平均句长: 36 每个单词出现的平均次数: 12 whitman-leaves.txt 30 | >>> 31 | 32 | ``` 33 | 这个程序显示每个文本的三个统计量:**平均词长**、**平均句子长度**和本文中每个词出现的平均次数(我们的**词汇多样性**得分)。请看,平均词长似乎是英语的一个一般属性,因为它的值总是4。(事实上,平均词长是3而不是4,因为num_chars变量计数了空白字符。)相比之下,平均句子长度和词汇多样性看上去是作者个人的特点。 34 | 35 | ## 1.2 网络和聊天文本 36 | 37 | ## 1.3 布朗语料库 38 | ```py 39 | >>> from nltk.corpus import brown 40 | >>> news_text = brown.words(categories='news') 41 | >>> fdist = nltk.FreqDist(w.lower() for w in news_text) 42 | >>> for k in fdist: 43 | ... if k[:2] == "wh": 44 | ... print(k + ":",fdist[k], end = " ") 45 | ... 46 | which: 245 when: 169 who: 268 whether: 18 where: 59 what: 95 while: 55 why: 14 whipped: 2 white: 57 whom: 8 whereby: 3 whole: 11 wherever: 1 whose: 22 wholesale: 1 wheel: 4 whatever: 2 whipple: 1 whitey: 1 whiz: 2 whitfield: 1 whip: 2 whirling: 1 wheeled: 2 whee: 1 wheeler: 2 whisking: 1 wheels: 1 whitney: 1 whopping: 1 wholly-owned: 1 whims: 1 whelan: 1 white-clad: 1 wheat: 1 whites: 2 whiplash: 1 whichever: 1 what's: 1 wholly: 1 >>> 47 | ``` 48 | 49 | 50 | 布朗语料库查看一星期各天在news和humor类别语料库的出现的条件下的频率分布: 51 | ```py 52 | >>> cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories = genre)) 53 | >>> genres = ["news", "humor"] # 填写我们想要展示的种类 54 | >>> day = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"] 55 | # 填写我们想要统计的词 56 | >>> cfd.tabulate(conditions = genres, samples = day) 57 | Monday Tuesday Wednesday Thursday Friday Saturday Sunday 58 | news 54 43 22 20 41 33 51 59 | humor 1 0 0 0 0 3 0 60 | >>> cfd.plot(conditions = genres, samples = day) 61 | ``` 62 | ![在这里插入图片描述](./picture/2.1.png) 63 | ## 1.4 路透社语料库 64 | ## 1.5 就职演说语料库 65 | ```py 66 | >>> cfd = nltk.ConditionalFreqDist( 67 | ... (target, fileid[:4]) 68 | ... for fileid in inaugural.fileids() 69 | ... for w in inaugural.words(fileid) 70 | ... for target in ['america', 'citizen'] 71 | ... if w.lower().startswith(target)) [1] 72 | >>> cfd.plot() 73 | 74 | ``` 75 | ![在这里插入图片描述](./picture/2.2.png) 76 | 77 | 条件频率分布图:计数就职演说语料库中所有以america 或citizen开始的词。 78 | 79 | ## 1.6 标注文本语料库 80 | ## 1.8 文本语料库的结构 81 | 文本语料库的常见结构:最简单的一种语料库是一些孤立的没有什么特别的组织的文本集合;一些语料库按如文体(布朗语料库)等分类组织结构;一些分类会重叠,如主题类别(路透社语料库);另外一些语料库可以表示随时间变化语言用法的改变(就职演说语料库)。 82 | ![在这里插入图片描述](./picture/2.3.png) 83 | 84 | 85 | ## 1.9 加载你自己的语料库 86 | ```py 87 | >>> from nltk.corpus import PlaintextCorpusReader 88 | >>> corpus_root = '/usr/share/dict' 89 | >>> wordlists = PlaintextCorpusReader(corpus_root, '.*') 90 | >>> wordlists.fileids() 91 | ['README', 'connectives', 'propernames', 'web2', 'web2a', 'words'] 92 | >>> wordlists.words('connectives') 93 | ['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...] 94 | ``` 95 | # 2 条件频率分布 96 | ## 2.1 条件和事件 97 | 每个**配对pairs**的形式是:(条件, 事件)。如果我们按文体处理整个布朗语料库,将有15 个条件(每个文体一个条件)和1,161,192 个事件(每一个词一个事件)。 98 | ```py 99 | >>> text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 100 | >>> pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] 101 | ``` 102 | ## 2.2 按文体计数词汇 103 | ```py 104 | >>> genre_word = [(genre, word) 105 | ... for genre in ['news', 'romance'] 106 | ... for word in brown.words(categories=genre)] 107 | >>> len(genre_word) 108 | 170576 109 | >>> genre_word[:4] 110 | [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] # [_start-genre] 111 | >>> genre_word[-4:] 112 | [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] # [_end-genre] 113 | 114 | ``` 115 | ```py 116 | >>> cfd = nltk.ConditionalFreqDist(genre_word) 117 | >>> cfd [1] 118 | 119 | >>> cfd.conditions() 120 | ['news', 'romance'] # [_conditions-cfd] 121 | ``` 122 | ```py 123 | >>> print(cfd['news']) 124 | 125 | >>> print(cfd['romance']) 126 | 127 | >>> cfd['romance'].most_common(20) 128 | [(',', 3899), ('.', 3736), ('the', 2758), ('and', 1776), ('to', 1502), 129 | ('a', 1335), ('of', 1186), ('``', 1045), ("''", 1044), ('was', 993), 130 | ('I', 951), ('in', 875), ('he', 702), ('had', 692), ('?', 690), 131 | ('her', 651), ('that', 583), ('it', 573), ('his', 559), ('she', 496)] 132 | >>> cfd['romance']['could'] 133 | 193 134 | ``` 135 | 136 | 137 | ## 2.3 绘制分布图和分布表 138 | 见: **1.3 布朗语料库** 139 | 140 | ## 2.4 使用双连词生成随机文本 141 | 142 | 利用bigrams制作生成模型: 143 | ```py 144 | >>> def generate_model(cfdist, word, num = 15): 145 | ... for i in range(num): 146 | ... print (word,end = " ") 147 | ... word = cfdist[word].max() 148 | ... 149 | 150 | >>> text = nltk.corpus.genesis.words("english-kjv.txt") 151 | >>> bigrams = nltk.bigrams(text) 152 | >>> cfd = nltk.ConditionalFreqDist(bigrams) 153 | >>> cfd 154 | 155 | >>> list(cfd) 156 | [('they', 'embalmed'), ('embalmed', 'him'), ('him', ','), (',', 'and'), ('and', 'he'), ('he', 'was'), ('was', 'put'), ('put', 'in'), ('in', 'a'), ('a', 'coffin'), ('coffin', 'in'), ('in', 'Egypt'), ('Egypt', '.')] 157 | 158 | 159 | >>> cfd["so"] 160 | FreqDist({'that': 8, '.': 7, ',': 4, 'the': 3, 'I': 2, 'doing': 2, 'much': 2, ':': 2, 'did': 1, 'Noah': 1, ...}) 161 | >>> cfd["living"] 162 | FreqDist({'creature': 7, 'thing': 4, 'substance': 2, 'soul': 1, '.': 1, ',': 1}) 163 | 164 | >>> generate_model(cfd, "so") 165 | so that he said , and the land of the land of the land of 166 | >>> generate_model(cfd, "living") 167 | living creature that he said , and the land of the land of the land 168 | >>> 169 | 170 | ``` 171 | 172 | # 4 词汇资源 173 | ## 4.1 词汇列表语料库 174 | **词汇语料库**是Unix 中的/usr/share/dict/words文件,被一些拼写检查程序使用。我们可以用它来寻找文本语料中不寻常的或拼写错误的词汇。 175 | ```py 176 | def unusual_words(text): 177 | text_vocab = set(w.lower() for w in text if w.isalpha()) 178 | english_vocab = set(w.lower() for w in nltk.corpus.words.words()) 179 | unusual = text_vocab - english_vocab 180 | return sorted(unusual) 181 | 182 | >>> unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) 183 | ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 184 | 'accents', 'accepting', 'accommodations', 'accompanied', 'accounted', 'accounts', 185 | 'accustomary', 'aches', 'acknowledging', 'acknowledgment', 'acknowledgments', ...] 186 | >>> unusual_words(nltk.corpus.nps_chat.words()) 187 | ['aaaaaaaaaaaaaaaaa', 'aaahhhh', 'abortions', 'abou', 'abourted', 'abs', 'ack', 188 | 'acros', 'actualy', 'adams', 'adds', 'adduser', 'adjusts', 'adoted', 'adreniline', 189 | 'ads', 'adults', 'afe', 'affairs', 'affari', 'affects', 'afk', 'agaibn', 'ages', ...] 190 | ``` 191 | 192 | **停用词语料库**,就是那些高频词汇,如the,to和also,我们有时在进一步的处理之前想要将它们从文档中过滤。 193 | ```py 194 | >>> from nltk.corpus import stopwords 195 | >>> stopwords.words('english') 196 | ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 197 | 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 198 | 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 199 | 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 200 | 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 201 | 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 202 | 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 203 | 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 204 | 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 205 | 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 206 | 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 207 | 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 208 | ``` 209 | 210 | **一个字母拼词谜题**:在由随机选择的字母组成的网格中,选择里面的字母组成词;这个谜题叫做“目标”。 211 | ![在这里插入图片描述](./picture/2.4.png) 212 | ```py 213 | >>> puzzle_letters = nltk.FreqDist('egivrvonl') 214 | # 注意,这里如果是字符串'egivrvonl'的话,给出的就是每个字母的频数 215 | >>> obligatory = 'r' 216 | >>> wordlist = nltk.corpus.words.words() 217 | >>> [w for w in wordlist if len(w) >= 6 218 | ... and obligatory in w 219 | ... and nltk.FreqDist(w) <= puzzle_letters] 220 | ['glover', 'gorlin', 'govern', 'grovel', 'ignore', 'involver', 'lienor', 221 | 'linger', 'longer', 'lovering', 'noiler', 'overling', 'region', 'renvoi', 222 | 'revolving', 'ringle', 'roving', 'violer', 'virole'] 223 | ``` 224 | 225 | **名字语料库**,包括8000个按性别分类的名字。男性和女性的名字存储在单独的文件中。让我们找出同时出现在两个文件中的名字,即性别暧昧的名字: 226 | ```py 227 | >>> names = nltk.corpus.names 228 | >>> names.fileids() 229 | ['female.txt', 'male.txt'] 230 | >>> male_names = names.words('male.txt') 231 | >>> female_names = names.words('female.txt') 232 | >>> [w for w in male_names if w in female_names] 233 | ['Abbey', 'Abbie', 'Abby', 'Addie', 'Adrian', 'Adrien', 'Ajay', 'Alex', 'Alexis', 234 | 'Alfie', 'Ali', 'Alix', 'Allie', 'Allyn', 'Andie', 'Andrea', 'Andy', 'Angel', 235 | 'Angie', 'Ariel', 'Ashley', 'Aubrey', 'Augustine', 'Austin', 'Averil', ...] 236 | ``` 237 | ## 4.2 发音的词典 238 | ## 4.3 比较词表 239 | ## 4.4 词汇工具:Shoebox和Toolbox 240 | # 5 WordNet 241 | 242 | -------------------------------------------------------------------------------- /【Python自然语言处理】读书笔记:第四章:编写结构化程序.md: -------------------------------------------------------------------------------- 1 | 2 | # 4 编写结构化程序 3 | 4 | # 4.1 回到基础 5 | 6 | ## 1、赋值: 7 | 列表赋值是“引用”,改变其中一个,其他都会改变 8 | 9 | 10 | 11 | ```python 12 | foo = ["1", "2"] 13 | bar = foo 14 | foo[1] = "3" 15 | print(bar) 16 | ``` 17 | 18 | ['1', '3'] 19 | 20 | 21 | 22 | ```python 23 | empty = [] 24 | nested = [empty, empty, empty] 25 | print(nested) 26 | nested[1].append("3") 27 | print(nested) 28 | ``` 29 | 30 | [[], [], []] 31 | [['3'], ['3'], ['3']] 32 | 33 | 34 | 35 | ```python 36 | nes = [[]] * 3 37 | nes[1].append("3") 38 | print(nes) 39 | nes[1] = ["2"] # 这里最新赋值时,不会传递给其他元素 40 | print(nes) 41 | ``` 42 | 43 | [['3'], ['3'], ['3']] 44 | [['3'], ['2'], ['3']] 45 | 46 | 47 | 48 | ```python 49 | new = nested[:] 50 | print(new) 51 | new[2] = ["new"] 52 | print(new) 53 | print(nested) 54 | ``` 55 | 56 | [['3'], ['3'], ['3']] 57 | [['3'], ['3'], ['new']] 58 | [['3'], ['3'], ['3']] 59 | 60 | 61 | 62 | ```python 63 | import copy 64 | new2 = copy.deepcopy(nested) 65 | print(new2) 66 | new2[2] = ["new2"] 67 | print(new2) 68 | print(nested) 69 | ``` 70 | 71 | [['3'], ['3'], ['3']] 72 | [['3'], ['3'], ['new2']] 73 | [['3'], ['3'], ['3']] 74 | 75 | 76 | ## 2、等式 77 | Python提供两种方法来检查一对项目是否相同。 78 | 79 | is 操作符测试对象的ID。 80 | 81 | == 检测对象是否相等。 82 | 83 | 84 | ```python 85 | snake_nest = [["Python"]] * 5 86 | snake_nest[2] = ['Python'] 87 | 88 | print(snake_nest) 89 | 90 | print(snake_nest[0] == snake_nest[1] == snake_nest[2] == snake_nest[3] == snake_nest[4]) 91 | 92 | print(snake_nest[0] is snake_nest[1] is snake_nest[2] is snake_nest[3] is snake_nest[4]) 93 | 94 | ``` 95 | 96 | [['Python'], ['Python'], ['Python'], ['Python'], ['Python']] 97 | True 98 | False 99 | 100 | 101 | ## 3、elif 和 else 102 | if elif 表示 if 为假而且elif 后边为真,表示如果第一个if执行,则不在执行elif 103 | 104 | if if 表示无论第一个if是否执行,第二个if都会执行 105 | 106 | 107 | ```python 108 | animals = ["cat", "dog"] 109 | if "cat" in animals: 110 | print(1) 111 | if "dog" in animals: 112 | print(2) 113 | ``` 114 | 115 | 1 116 | 2 117 | 118 | 119 | 120 | ```python 121 | animals = ["cat", "dog"] 122 | if "cat" in animals: 123 | print(1) 124 | elif "dog" in animals: 125 | print(2) 126 | ``` 127 | 128 | 1 129 | 130 | 131 | # 4.2 序列 132 | 133 | 字符串,列表,元组 134 | 135 | ## 1、元组 136 | 由逗号隔开,通常使用括号括起来,可以被索引和切片,并且由长度 137 | 138 | 139 | ```python 140 | t = "walk", "fem", 3 141 | print(t) 142 | print(t[0]) 143 | print(t[1:]) 144 | print(len(t)) 145 | ``` 146 | 147 | ('walk', 'fem', 3) 148 | walk 149 | ('fem', 3) 150 | 3 151 | 152 | 153 | ## 2、序列可以直接相互赋值 154 | 155 | 156 | ```python 157 | words = ["I", "turned", "off", "the", "spectroroute"] 158 | words[1], words[4] = words[4], words[1] 159 | print(words) 160 | ``` 161 | 162 | ['I', 'spectroroute', 'off', 'the', 'turned'] 163 | 164 | 165 | ## 3、处理序列的函数 166 | sorted()函数、reversed()函数、zip()函数、enumerate()函数 167 | 168 | 169 | ```python 170 | print("\n",words) 171 | print(sorted(words)) 172 | 173 | print("\n",words) 174 | print(reversed(words)) 175 | print(list(reversed(words))) 176 | 177 | print("\n",words) 178 | print(zip(words, range(len(words)))) 179 | print(list(zip(words, range(len(words))))) 180 | 181 | print("\n",words) 182 | print(enumerate(words)) 183 | print(list(enumerate(words))) 184 | ``` 185 | 186 | 187 | ['I', 'spectroroute', 'off', 'the', 'turned'] 188 | ['I', 'off', 'spectroroute', 'the', 'turned'] 189 | 190 | ['I', 'spectroroute', 'off', 'the', 'turned'] 191 | 192 | ['turned', 'the', 'off', 'spectroroute', 'I'] 193 | 194 | ['I', 'spectroroute', 'off', 'the', 'turned'] 195 | 196 | [('I', 0), ('spectroroute', 1), ('off', 2), ('the', 3), ('turned', 4)] 197 | 198 | ['I', 'spectroroute', 'off', 'the', 'turned'] 199 | 200 | [(0, 'I'), (1, 'spectroroute'), (2, 'off'), (3, 'the'), (4, 'turned')] 201 | 202 | 203 | ## 4、合并不同类型的序列 204 | 205 | 206 | ```python 207 | words = "I turned off the spectroroute".split() 208 | print (words) 209 | 210 | wordlens = [(len(word), word) for word in words] 211 | print(wordlens) 212 | 213 | wordlens.sort() 214 | print (wordlens) 215 | 216 | print(" ".join(w for (_, w) in wordlens)) 217 | ``` 218 | 219 | ['I', 'turned', 'off', 'the', 'spectroroute'] 220 | [(1, 'I'), (6, 'turned'), (3, 'off'), (3, 'the'), (12, 'spectroroute')] 221 | [(1, 'I'), (3, 'off'), (3, 'the'), (6, 'turned'), (12, 'spectroroute')] 222 | I off the turned spectroroute 223 | 224 | 225 | ## 5、生成器表达式 226 | 227 | ### 5.1、列表推导 228 | 我们一直在大量使用列表推导,因为用它处理文本结构紧凑和可读性好。下面是一个例子,分词和规范化一个文本: 229 | 230 | 231 | 232 | ```python 233 | from nltk import * 234 | text = '''"When I use a word," Humpty Dumpty said in rather a scornful tone, 235 | ... "it means just what I choose it to mean - neither more nor less."''' 236 | print([w.lower() for w in word_tokenize(text)]) 237 | ``` 238 | 239 | ['``', 'when', 'i', 'use', 'a', 'word', ',', "''", 'humpty', 'dumpty', 'said', 'in', 'rather', 'a', 'scornful', 'tone', ',', '...', '``', 'it', 'means', 'just', 'what', 'i', 'choose', 'it', 'to', 'mean', '-', 'neither', 'more', 'nor', 'less', '.', "''"] 240 | 241 | 242 | ### 5.2、生成器表达式 243 | 第二行使用了生成器表达式。这不仅仅是标记方便:在许多语言处理的案例中,生成器表达式会更高效。 244 | 245 | 在[1]中,列表对象的存储空间必须在max()的值被计算之前分配。如果文本非常大的,这将会很慢。 246 | 247 | 在[2]中,数据流向调用它的函数。由于调用的函数只是简单的要找最大值——按字典顺序排在最后的词——它可以处理数据流,而无需存储迄今为止的最大值以外的任何值。 248 | 249 | 250 | ```python 251 | print(max([w.lower() for w in word_tokenize(text)])) 252 | print (max(w.lower() for w in word_tokenize(text))) 253 | ``` 254 | 255 | word 256 | word 257 | 258 | 259 | # 4.3 风格的问题 260 | 261 | ## 1、过程风格与声明风格 262 | 计算布朗语料库中词的平均长度的程序: 263 | 264 | 265 | ```python 266 | # 过程风格 267 | import nltk 268 | tokens = nltk.corpus.brown.words(categories='news') 269 | count = 0 270 | total = 0 271 | for token in tokens: 272 | count += 1 273 | total += len(token) 274 | total / count 275 | ``` 276 | 277 | 278 | 279 | 280 | 4.401545438271973 281 | 282 | 283 | 284 | 285 | ```python 286 | # 声明风格 287 | total = sum(len(w) for w in tokens) 288 | print(total / len(tokens)) 289 | ``` 290 | 291 | 4.401545438271973 292 | 293 | 294 | 295 | ```python 296 | # 过程风格 297 | text = nltk.corpus.gutenberg.words('milton-paradise.txt') 298 | longest = '' 299 | for word in text: 300 | if len(word) > len(longest): 301 | longest = word 302 | print(longest) 303 | ``` 304 | 305 | unextinguishable 306 | 307 | 308 | 309 | ```python 310 | # 声明风格:使用两个列表推到 311 | maxlen = max(len(word) for word in text) 312 | print([word for word in text if len(word) == maxlen]) 313 | ``` 314 | 315 | ['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible'] 316 | 317 | 318 | 319 | ```python 320 | # enumerate() 枚举频率分布的值 321 | fd = nltk.FreqDist(nltk.corpus.brown.words()) 322 | cumulative = 0.0 323 | most_common_words = [word for (word, count) in fd.most_common()] 324 | for rank, word in enumerate(most_common_words): 325 | cumulative += fd.freq(word) 326 | print("%3d %10.2f%% %10s" % (rank + 1, fd.freq(word) * 100, word)) 327 | if cumulative > 0.25: 328 | break 329 | ``` 330 | 331 | 1 5.40% the 332 | 2 5.02% , 333 | 3 4.25% . 334 | 4 3.11% of 335 | 5 2.40% and 336 | 6 2.22% to 337 | 7 1.88% a 338 | 8 1.68% in 339 | 340 | 341 | ## 2、计数器的一些合理用途 342 | 343 | 344 | ```python 345 | sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper'] 346 | n = 3 347 | [sent[i:i+n] for i in range(len(sent) - n +1)] 348 | ``` 349 | 350 | 351 | 352 | 353 | [['The', 'dog', 'gave'], 354 | ['dog', 'gave', 'John'], 355 | ['gave', 'John', 'the'], 356 | ['John', 'the', 'newspaper']] 357 | 358 | 359 | 360 | ## 3、构建多维数组:使用嵌套列表推导和使用对象复制([ ] * n) 361 | 362 | 363 | ```python 364 | # 使用嵌套列表推导可以修改内容 365 | m, n = 3, 7 366 | array = [[set() for i in range(n)] for j in range(m)] 367 | array[2][5].add("Alice") 368 | import pprint 369 | pprint.pprint(array) 370 | ``` 371 | 372 | [[set(), set(), set(), set(), set(), set(), set()], 373 | [set(), set(), set(), set(), set(), set(), set()], 374 | [set(), set(), set(), set(), set(), {'Alice'}, set()]] 375 | 376 | 377 | 378 | ```python 379 | # 使用对象复制,修改一个其他的也会改变 380 | array = [[set()] * n] *m 381 | array[2][5].add(7) 382 | pprint.pprint(array) 383 | ``` 384 | 385 | [[{7}, {7}, {7}, {7}, {7}, {7}, {7}], 386 | [{7}, {7}, {7}, {7}, {7}, {7}, {7}], 387 | [{7}, {7}, {7}, {7}, {7}, {7}, {7}]] 388 | 389 | 390 | # 4.4 函数:结构化编程的基础 391 | 392 | 393 | ```python 394 | import re 395 | def get_text(file): 396 | """Read text from a file, normalizing whitespace and stipping HTML markup""" 397 | text = open(file).read() 398 | text = re.sub(r"<.*?>", " ", text) 399 | text = re.sub(r"\s+", " ", text) 400 | return text 401 | 402 | contents = get_text("4.test.html") 403 | print(contents[:300]) 404 | ``` 405 | 406 | 【Python自然语言处理】读书笔记:第四章:编写结构化程序 /*! * * Twitter Bootstrap * */ /*! * Bootstrap v3.3.7 (http://getbootstrap.com) * Copyright 2011-2016 Twitter, Inc. * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) */ /*! normalize.css v3.0.3 | MIT License | github.com/necolas/normalize.cs 407 | 408 | 409 | 410 | ```python 411 | help(get_text) 412 | ``` 413 | 414 | Help on function get_text in module __main__: 415 | 416 | get_text(file) 417 | Read text from a file, normalizing whitespace and stipping HTML markup 418 | 419 | 420 | 421 | 1、考虑以下三个排序函数。第三个是危险的,因为程序员可能没有意识到它已经修改了给它的输入。一般情况下,函数应该修改参数的内容(my_sort1())或返回一个值(my_sort2()),而不是两个都做(my_sort3())。 422 | 423 | 424 | ```python 425 | def my_sort1(mylist): # good: modifies its argument, no return value 426 | mylist.sort() 427 | def my_sort2(mylist): # good: doesn't touch its argument, returns value 428 | return sorted(mylist) 429 | def my_sort3(mylist): # bad: modifies its argument and also returns it 430 | mylist.sort() 431 | return mylist 432 | 433 | mylist = [3,2,1] 434 | my_sort1(mylist) 435 | print (mylist,"\n") 436 | 437 | mylist = [3,2,1] 438 | print("my_sort2(mylist):",my_sort2(mylist)) 439 | print (mylist,"\n") 440 | 441 | 442 | mylist = [3,2,1] 443 | print("my_sort2(mylist):",my_sort3(mylist)) 444 | print (mylist,"\n") 445 | 446 | 447 | ``` 448 | 449 | [1, 2, 3] 450 | 451 | my_sort2(mylist): [1, 2, 3] 452 | [3, 2, 1] 453 | 454 | my_sort2(mylist): [1, 2, 3] 455 | [1, 2, 3] 456 | 457 | 458 | 459 | ## 2、参数传递 460 | 早在4.1节中,你就已经看到了赋值操作,而一个结构化对象的值是该对象的引用。函数也是一样的。要理解Python按值传递参数,只要了解它是如何赋值的就足够了。【意思就是在函数中赋值的时候如果是结构化对象,那么赋值仅仅是引用,如果是字符串、数值等非结构化的变量,则在函数中改变,仅仅是局部变量改变】 461 | 462 | 463 | ```python 464 | def set_up(word, properties): 465 | word = 'lolcat' 466 | properties.append('noun') 467 | properties = 5 468 | 469 | w = '' 470 | p = [] 471 | set_up(w, p) 472 | print(w) 473 | print(p) 474 | 475 | ``` 476 | 477 | 478 | ['noun'] 479 | 480 | 481 | 请注意,w没有被函数改变。当我们调用set_up(w, p)时,w(空字符串)的值被分配到一个新的变量word。在函数内部word值被修改。然而,这种变化并没有传播给w。这个参数传递过程与下面的赋值序列是一样的: 482 | 483 | 484 | ```python 485 | w = '' 486 | word = w 487 | word = 'lolcat' 488 | print(w) 489 | 490 | ``` 491 | 492 | 493 | 494 | 495 | 让我们来看看列表p上发生了什么。当我们调用set_up(w, p),p的值(一个空列表的引用)被分配到一个新的本地变量properties,所以现在这两个变量引用相同的内存位置。函数修改properties,而这种变化也反映在p值上,正如我们所看到的。函数也分配给properties一个新的值(数字5);这并不能修改该内存位置上的内容,而是创建了一个新的局部变量。这种行为就好像是我们做了下列赋值序列: 496 | 497 | 498 | 499 | 500 | ```python 501 | p = [] 502 | properties = p 503 | properties.append('noun') 504 | properties = 5 505 | print(p) 506 | 507 | ``` 508 | 509 | ['noun'] 510 | 511 | 512 | ## 3、变量的作用域 513 | 514 | 当你在一个函数体内部使用一个现有的名字时,Python解释器先尝试按照函数本地的名字来解释。如果没有发现,解释器检查它是否是一个模块内的全局名称。最后,如果没有成功,解释器会检查是否是Python内置的名字。这就是所谓的名称解析的LGB规则:本地(local),全局(global),然后内置(built-in)。 515 | 516 | ## 4、参数类型检查 517 | 我们写程序时,Python不会强迫我们声明变量的类型,这允许我们定义参数类型灵活的函数。例如,我们可能希望一个标注只是一个词序列,而不管这个序列被表示为一个列表、元组(或是迭代器,一种新的序列类型,超出了当前的讨论范围)。 518 | 519 | 然而,我们常常想写一些能被他人利用的程序,并希望以一种防守的风格,当函数没有被正确调用时提供有益的警告。下面的tag()函数的作者假设其参数将始终是一个字符串。 520 | 521 | 522 | ```python 523 | def tag(word): 524 | if word in ['a', 'the', 'all']: 525 | return 'det' 526 | else: 527 | return 'noun' 528 | 529 | print(tag('the')) 530 | 531 | print(tag('knight')) 532 | 533 | print(tag(["'Tis", 'but', 'a', 'scratch'])) 534 | 535 | ``` 536 | 537 | det 538 | noun 539 | noun 540 | 541 | 542 | 该函数对参数'the'和'knight'返回合理的值,传递给它一个列表[1],看看会发生什么——它没有抱怨,虽然它返回的结果显然是不正确的。此函数的作者可以采取一些额外的步骤来确保tag()函数的参数word是一个字符串。一种直白的做法是使用if not type(word) is str检查参数的类型,如果word不是一个字符串,简单地返回Python特殊的空值None。这是一个略微的改善,因为该函数在检查参数类型,并试图对错误的输入返回一个“特殊的”诊断结果。然而,它也是危险的,因为调用程序可能不会检测None是故意设定的“特殊”值,这种诊断的返回值可能被传播到程序的其他部分产生不可预测的后果。如果这个词是一个Unicode字符串这种方法也会失败。因为它的类型是unicode而不是str。这里有一个更好的解决方案,使用assert语句和Python的basestring的类型一起,它是unicode和str的共同类型。 543 | 544 | 545 | ```python 546 | def tag(word): 547 | assert isinstance(word, str), "argument to tag() must be a string" 548 | if word in ['a', 'the', 'all']: 549 | return 'det' 550 | else: 551 | return 'noun' 552 | 553 | print(tag(["'Tis", 'but', 'a', 'scratch'])) 554 | 555 | ``` 556 | 557 | 558 | --------------------------------------------------------------------------- 559 | 560 | AssertionError Traceback (most recent call last) 561 | 562 | in 563 | 6 return 'noun' 564 | 7 565 | ----> 8 print(tag(["'Tis", 'but', 'a', 'scratch'])) 566 | 567 | 568 | in tag(word) 569 | 1 def tag(word): 570 | ----> 2 assert isinstance(word, str), "argument to tag() must be a string" 571 | 3 if word in ['a', 'the', 'all']: 572 | 4 return 'det' 573 | 5 else: 574 | 575 | 576 | AssertionError: argument to tag() must be a string 577 | 578 | 579 | 如果assert语句失败,它会产生一个不可忽视的错误而停止程序执行。此外,该错误信息是容易理解的。程序中添加断言能帮助你找到逻辑错误,是一种防御性编程。一个更根本的方法是在本节后面描述的使用文档字符串为每个函数记录参数的文档。 580 | 581 | ## 5、功能分解 582 | 当我们使用函数时,主程序可以在一个更高的抽象水平编写,使其结构更透明,例如: 583 | 584 | 585 | ```python 586 | import nltk 587 | tokens = nltk.corpus.brown.words(categories='news') 588 | total = sum(len(w) for w in tokens) 589 | print(total / len(tokens)) 590 | ``` 591 | 592 | 4.401545438271973 593 | 594 | 595 | 思考例 4-2 中 freq_words 函数。它更新一个作为参数传递进来的频率分布的内容,并输出前 n 个最频繁的词的链表。 596 | 597 | 例 4-2. 设计不佳的函数用来计算高频词。 598 | 599 | 600 | ```python 601 | from nltk import * 602 | from urllib import request 603 | from bs4 import BeautifulSoup 604 | 605 | def freq_words(url, freqdist, n): 606 | html = request.urlopen(url).read().decode('utf8') 607 | raw = BeautifulSoup(html).get_text() 608 | for word in word_tokenize(raw): 609 | freqdist[word.lower()] += 1 610 | result = [] 611 | for word, count in freqdist.most_common(n): 612 | result = result + [word] 613 | print(result) 614 | constitution = "http://www.archives.gov/national-archives-experience" \ 615 | "/charters/constitution_transcript.html" 616 | fd = nltk.FreqDist() 617 | print([w for (w, _) in fd.most_common(20)]) 618 | freq_words(constitution, fd, 20) 619 | print("\n",[w for (w, _) in fd.most_common(30)]) 620 | ``` 621 | 622 | [] 623 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents'] 624 | 625 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents', 'founding', '#', 'to', 'declaration', 'constitution', 'a', 'visit', 'online', 'freedom', 'for'] 626 | 627 | 628 | 这个函数有几个问题。该函数有两个副作用:它修改了第二个参数的内容,并输出它已计算的结果的经过选择的子集。如果我们在函数内部初始化FreqDist()对象(在它被处理的同一个地方),并且去掉选择集而将结果显示给调用程序的话,函数会更容易理解和更容易在其他地方重用。考虑到它的任务是找出频繁的一个词,它应该只应该返回一个列表,而不是整个频率分布。在4.4中,我们重构此函数,并通过去掉freqdist参数简化其接口。 629 | 630 | 631 | ```python 632 | from urllib import request 633 | from bs4 import BeautifulSoup 634 | 635 | def freq_words(url, n): 636 | html = request.urlopen(url).read().decode('utf8') 637 | text = BeautifulSoup(html).get_text() 638 | freqdist = nltk.FreqDist(word.lower() for word in word_tokenize(text)) 639 | return [word for (word, _) in freqdist.most_common(n)] 640 | 641 | constitution = "http://www.archives.gov/national-archives-experience" \ 642 | "/charters/constitution_transcript.html" 643 | print(freq_words(constitution, 20)) 644 | ``` 645 | 646 | ["''", ',', ':1', 'the', ':', 'of', '{', ';', '}', '(', ')', "'", 'archives', 'and', '.', '[', ']', '``', 'national', 'documents'] 647 | 648 | 649 | ## 6、编写函数的文档 650 | 在函数的定义顶部的文档字符串中提供这些描述。这个说明不应该解释函数是如何实现的;实际上,应该能够不改变这个说明,使用不同的方法,重新实现这个函数。 651 | 652 | 653 | ```python 654 | def accuracy(reference, test): 655 | """ 656 | Calculate the fraction of test items that equal the corresponding reference items. 657 | 658 | Given a list of reference values and a corresponding list of test values, 659 | return the fraction of corresponding values that are equal. 660 | In particular, return the fraction of indexes 661 | {0>> accuracy(['ADJ', 'N', 'V', 'N'], ['N', 'N', 'V', 'ADJ']) 664 | 0.5 665 | 666 | :param reference: An ordered list of reference values 667 | :type reference: list 668 | :param test: A list of values to compare against the corresponding 669 | reference values 670 | :type test: list 671 | :return: the accuracy score 672 | :rtype: float 673 | :raises ValueError: If reference and length do not have the same length 674 | """ 675 | 676 | if len(reference) != len(test): 677 | raise ValueError("Lists must have the same length.") 678 | num_correct = 0 679 | for x, y in zip(reference, test): 680 | if x == y: 681 | num_correct += 1 682 | return float(num_correct) / len(reference) 683 | ``` 684 | 685 | # 4.5 更多关于函数 686 | 687 | ## 1、作为参数的函数 688 | 689 | 到目前为止,我们传递给函数的参数一直都是简单的对象,如字符串或列表等结构化对象。Python也允许我们传递一个函数作为另一个函数的参数。现在,我们可以抽象出操作,对相同数据进行不同操作。正如下面的例子表示的,我们可以传递内置函数len()或用户定义的函数last_letter()作为另一个函数的参数: 690 | 691 | 692 | ```python 693 | sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', 694 | 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] 695 | def extract_property(prop): 696 | return [prop(word) for word in sent] 697 | 698 | print(extract_property(len)) 699 | 700 | def last_letter(word): 701 | return word[-1] 702 | 703 | print(extract_property(last_letter)) 704 | ``` 705 | 706 | [4, 4, 2, 3, 5, 1, 3, 3, 6, 4, 4, 4, 2, 10, 1] 707 | ['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.'] 708 | 709 | 710 | 对象len和last_letter可以像列表和字典那样被传递。请注意,只有在我们调用该函数时,才在函数名后使用括号;当我们只是将函数作为一个对象,括号被省略。 711 | 712 | Python提供了更多的方式来定义函数作为其他函数的参数,即所谓的lambda 表达式。试想在很多地方没有必要使用上述的last_letter()函数,因此没有必要给它一个名字。我们可以等价地写以下内容: 713 | 714 | 715 | ```python 716 | extract_property(lambda w: w[-1]) 717 | ``` 718 | 719 | 720 | 721 | 722 | ['e', 'e', 'f', 'e', 'e', ',', 'd', 'e', 's', 'l', 'e', 'e', 'f', 's', '.'] 723 | 724 | 725 | 726 | 我们的下一个例子演示传递一个函数给sorted()函数。当我们用唯一的参数(需要排序的链表)调用后者,它使用内置的比较函数cmp()。然而,我们可以提供自己的排序函数,例如按长度递减排序。 727 | 728 | 729 | ```python 730 | print(sorted(sent)) 731 | 732 | print(sorted(sent, key = lambda x:x[-1])) 733 | ``` 734 | 735 | [',', '.', 'Take', 'and', 'care', 'care', 'of', 'of', 'sense', 'sounds', 'take', 'the', 'the', 'themselves', 'will'] 736 | [',', '.', 'and', 'Take', 'care', 'the', 'sense', 'the', 'take', 'care', 'of', 'of', 'will', 'sounds', 'themselves'] 737 | 738 | 739 | ## 2、累计函数 740 | 741 | 这些函数以初始化一些存储开始,迭代和处理输入的数据,最后返回一些最终的对象(一个大的结构或汇总的结果)。做到这一点的一个标准的方式是初始化一个空链表,累计材料,然后返回这个链表,如4.6中所示函数search1()。 742 | 743 | 744 | ```python 745 | import nltk 746 | 747 | def search1(substring, words): 748 | result = [] 749 | for word in words: 750 | if substring in word: 751 | result.append(word) 752 | return result 753 | 754 | def search2(substring, words): 755 | for word in words: 756 | if substring in word: 757 | yield word 758 | 759 | for item in search1('fizzled', nltk.corpus.brown.words()): 760 | print (item) 761 | 762 | for item in search2('Grizzlies', nltk.corpus.brown.words()): 763 | print (item) 764 | ``` 765 | 766 | fizzled 767 | Grizzlies' 768 | 769 | 770 | 函数search2()是一个生成器。第一次调用此函数,它运行到yield语句然后停下来。调用程序获得第一个词,完成任何必要的处理。一旦调用程序对另一个词做好准备,函数会从停下来的地方继续执行,直到再次遇到yield语句。这种方法通常更有效,因为函数只产生调用程序需要的数据,并不需要分配额外的内存来存储输出(参见前面关于生成器表达式的讨论)。 771 | 772 | 下面是一个更复杂的生成器的例子,产生一个词列表的所有排列。为了强制permutations()函数产生所有它的输出,我们将它包装在list()调用中[1]。 773 | 774 | 775 | ```python 776 | def permutations(seq): 777 | if len(seq) <= 1: 778 | yield seq 779 | else: 780 | for perm in permutations(seq[1:]): 781 | for i in range(len(perm)+1): 782 | yield perm[:i] + seq[0:1] + perm[i:] 783 | 784 | list(permutations(['police', 'fish', 'buffalo'])) 785 | ``` 786 | 787 | 788 | 789 | 790 | [['police', 'fish', 'buffalo'], 791 | ['fish', 'police', 'buffalo'], 792 | ['fish', 'buffalo', 'police'], 793 | ['police', 'buffalo', 'fish'], 794 | ['buffalo', 'police', 'fish'], 795 | ['buffalo', 'fish', 'police']] 796 | 797 | 798 | 799 | permutations函数使用了一种技术叫递归,将在下面4.7讨论。产生一组词的排列对于创建测试一个语法的数据十分有用(8.)。 800 | 801 | ## 3、高阶函数 802 | 803 | ### filter() 804 | 我们使用函数作为filter()的第一个参数,它对作为它的第二个参数的序列中的每个项目运用该函数,只保留该函数返回True的项目。 805 | 806 | 807 | ```python 808 | def is_content_word(word): 809 | return word.lower() not in ['a', 'of', 'the', 'and', 'will', ',', '.'] 810 | sent = ['Take', 'care', 'of', 'the', 'sense', ',', 'and', 'the', 811 | 'sounds', 'will', 'take', 'care', 'of', 'themselves', '.'] 812 | list(filter(is_content_word, sent)) 813 | ``` 814 | 815 | 816 | 817 | 818 | ['Take', 'care', 'sense', 'sounds', 'take', 'care', 'themselves'] 819 | 820 | 821 | 822 | ### map() 823 | 将一个函数运用到一个序列中的每一项。 824 | 825 | 826 | ```python 827 | lengths = list(map(len, nltk.corpus.brown.sents(categories='news'))) 828 | print(sum(lengths) / len(lengths)) 829 | 830 | 831 | lengths = [len(sent) for sent in nltk.corpus.brown.sents(categories='news')] 832 | print(sum(lengths) / len(lengths)) 833 | ``` 834 | 835 | 21.75081116158339 836 | 21.75081116158339 837 | 838 | 839 | ### lambda 表达式 840 | 这里是两个等效的例子,计数每个词中的元音的数量。 841 | 842 | 843 | ```python 844 | list(map(lambda w: len(list(filter(lambda c: c.lower() in "aeiou", w))), sent)) 845 | ``` 846 | 847 | 848 | 849 | 850 | [2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0] 851 | 852 | 853 | 854 | 855 | ```python 856 | [len([c for c in w if c.lower() in "aeiou"]) for w in sent] 857 | ``` 858 | 859 | 860 | 861 | 862 | [2, 2, 1, 1, 2, 0, 1, 1, 2, 1, 2, 2, 1, 3, 0] 863 | 864 | 865 | 866 | 列表推导为基础的解决方案通常比基于高阶函数的解决方案可读性更好,我们在整个这本书的青睐于使用前者。 867 | 868 | ## 4、命名的参数 869 | 当有很多参数时,很容易混淆正确的顺序。我们可以通过名字引用参数,甚至可以给它们分配默认值以供调用程序没有提供该参数时使用。现在参数可以按任意顺序指定,也可以省略。 870 | 871 | 872 | ```python 873 | def repeat(msg='', num=1): 874 | return msg * num 875 | print(repeat(num=3)) 876 | 877 | print(repeat(msg='Alice')) 878 | 879 | print(repeat(num=5, msg='Alice')) 880 | 881 | ``` 882 | 883 | 884 | Alice 885 | AliceAliceAliceAliceAlice 886 | 887 | 888 | ### *args作为函数参数 889 | 890 | 891 | ```python 892 | song = [['four', 'calling', 'birds'], 893 | ['three', 'French', 'hens'], 894 | ['two', 'turtle', 'doves']] 895 | print(song[0]) 896 | print(list(zip(song[0], song[1], song[2]))) 897 | 898 | print(list(zip(*song))) 899 | ``` 900 | 901 | ['four', 'calling', 'birds'] 902 | [('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')] 903 | [('four', 'three', 'two'), ('calling', 'French', 'turtle'), ('birds', 'hens', 'doves')] 904 | 905 | 906 | 应该从这个例子中明白输入*song仅仅是一个方便的记号,相当于输入了song[0], song[1], song[2]。 907 | 908 | ### 命名参数的另一个作用是它们允许选择性使用参数。 909 | 910 | 911 | ```python 912 | from nltk import * 913 | def freq_words(file, min=1, num=10): 914 | text = open(file).read() 915 | tokens = word_tokenize(text) 916 | freqdist = nltk.FreqDist(t for t in tokens if len(t) >= min) 917 | return freqdist.most_common(num) 918 | fw = freq_words('4.test.html', 4, 10) 919 | fw = freq_words('4.test.html', min=4, num=10) 920 | fw = freq_words('4.test.html', num=10, min=4) 921 | ``` 922 | 923 | ### 可选参数的另一个常见用途是作为标志使用。 924 | 这里是同一个的函数的修订版本,如果设置了verbose标志将会报告其进展情况: 925 | 926 | 927 | ```python 928 | def freq_words(file, min=1, num=10, verbose=False): 929 | freqdist = FreqDist() 930 | if verbose: print("Opening", file) 931 | with open(file) as f: 932 | text = f.read() 933 | if verbose: print("Read in %d characters" % len(file)) 934 | for word in word_tokenize(text): 935 | if len(word) >= min: 936 | freqdist[word] += 1 937 | if verbose and freqdist.N() % 100 == 0: print(".", sep="",end = " ") 938 | if verbose: print 939 | return freqdist.most_common(num) 940 | fw = freq_words("4.test.html", 4 ,10, True) 941 | ``` 942 | 943 | Opening 4.test.html 944 | Read in 11 characters 945 | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946 | 947 | ### 注意 948 | 注意不要使用可变对象作为参数的默认值。这个函数的一系列调用将使用同一个对象,有时会出现离奇的结果,就像我们稍后会在关于调试的讨论中看到的那样。 949 | 950 | ### 注意 951 | 如果你的程序将使用大量的文件,它是一个好主意来关闭任何一旦不再需要的已经打开的文件。如果你使用with语句,Python会自动关闭打开的文件︰ 952 | 953 | 954 | ```python 955 | with open("4.test.html") as f: 956 | data = f.read() 957 | ``` 958 | 959 | # 4.6 程序开发 960 | 961 | 编程是一种技能,需要获得几年的各种编程语言和任务的经验。 962 | 963 | **关键的高层次能力**:是算法设计及其在结构化编程中的实现。 964 | 965 | **关键的低层次的能力**包括熟悉语言的语法结构,以及排除故障的程序(不能表现预期的行为的程序)的各种诊断方法的知识。 966 | 967 | ## 1、Python模块的结构 968 | 969 | 与其他NLTK的模块一样,distance.py以一组注释行开始,包括一行模块标题和作者信息。 970 | 971 | (由于代码会被发布,也包括代码可用的URL、版权声明和许可信息。) 972 | 973 | 接下来是模块级的文档字符串,三重引号的多行字符串,其中包括当有人输入help(nltk.metrics.distance)将被输出的关于模块的信息。 974 | 975 | 976 | ```python 977 | # Natural Language Toolkit: Distance Metrics 978 | # 979 | # Copyright (C) 2001-2013 NLTK Project 980 | # Author: Edward Loper 981 | # Steven Bird 982 | # Tom Lippincott 983 | # URL: 984 | # For license information, see LICENSE.TXT 985 | # 986 | 987 | """ 988 | Distance Metrics. 989 | 990 | Compute the distance between two items (usually strings). 991 | As metrics, they must satisfy the following three requirements: 992 | 993 | 1. d(a, a) = 0 994 | 2. d(a, b) >= 0 995 | 3. d(a, c) <= d(a, b) + d(b, c) 996 | """ 997 | ``` 998 | 999 | 1000 | 1001 | 1002 | '\nDistance Metrics.\n\nCompute the distance between two items (usually strings).\nAs metrics, they must satisfy the following three requirements:\n\n1. d(a, a) = 0\n2. d(a, b) >= 0\n3. d(a, c) <= d(a, b) + d(b, c)\n' 1003 | 1004 | 1005 | 1006 | ## 2、多模块程序 1007 | 一些程序汇集多种任务,例如从语料库加载数据、对数据进行一些分析、然后将其可视化。我们可能已经有了稳定的模块来加载数据和实现数据可视化。我们的工作可能会涉及到那些分析任务的编码,只是从现有的模块调用一些函数。4.7描述了这种情景。 1008 | ![4.1](./picture/4.1.png) 1009 | 1010 | ## 3、错误源头 1011 | 首先,输入的数据可能包含一些意想不到的字符。 1012 | 1013 | 第二,提供的函数可能不会像预期的那样运作。 1014 | 1015 | 第三,我们对Python语义的理解可能出错。 1016 | 1017 | 1018 | ```python 1019 | print("%s.%s.%02d" % "ph.d.", "n", 1) 1020 | ``` 1021 | 1022 | 1023 | --------------------------------------------------------------------------- 1024 | 1025 | TypeError Traceback (most recent call last) 1026 | 1027 | in 1028 | ----> 1 print("%s.%s.%02d" % "ph.d.", "n", 1) 1029 | 1030 | 1031 | TypeError: not enough arguments for format string 1032 | 1033 | 1034 | 1035 | ```python 1036 | print("%s.%s.%02d" % ("ph.d.", "n", 1)) 1037 | ``` 1038 | 1039 | ph.d..n.01 1040 | 1041 | 1042 | ### 在函数中命名参数不能设置列表等对象 1043 | 程序的行为并不如预期,因为我们错误地认为在函数被调用时会创建默认值。然而,它只创建了一次,在Python解释器加载这个函数时。这一个列表对象会被使用,只要没有给函数提供明确的值。 1044 | 1045 | 1046 | ```python 1047 | def find_words(text, wordlength, result=[]): 1048 | for word in text: 1049 | if len(word) == wordlength: 1050 | result.append(word) 1051 | return result 1052 | 1053 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) ) 1054 | 1055 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 2, ['ur']) ) 1056 | 1057 | print(find_words(['omg', 'teh', 'lolcat', 'sitted', 'on', 'teh', 'mat'], 3) ) # 明显错误 1058 | 1059 | ``` 1060 | 1061 | ['omg', 'teh', 'teh', 'mat'] 1062 | ['ur', 'on'] 1063 | ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat'] 1064 | 1065 | 1066 | ## 4、调试技术 1067 | 1.由于大多数代码错误是因为程序员的不正确的假设,你检测bug要做的第一件事是检查你的假设。通过给程序添加print语句定位问题,显示重要的变量的值,并显示程序的进展程度。 1068 | 1069 | 2.解释器会输出一个堆栈跟踪,精确定位错误发生时程序执行的位置。 1070 | 1071 | 3.Python提供了一个调试器,它允许你监视程序的执行,指定程序暂停运行的行号(即断点),逐步调试代码段和检查变量的值。你可以如下方式在你的代码中调用调试器: 1072 | 1073 | 1074 | ```python 1075 | import pdb 1076 | print(find_words(['cat'], 3)) # [_first-run] 1077 | 1078 | pdb.run("find_words(['dog'], 3)") # [_second-run] 1079 | ``` 1080 | 1081 | ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat'] 1082 | > (1)() 1083 | (Pdb) step 1084 | --Call-- 1085 | > (1)find_words() 1086 | -> def find_words(text, wordlength, result=[]): 1087 | (Pdb) step 1088 | > (2)find_words() 1089 | -> for word in text: 1090 | (Pdb) args 1091 | text = ['dog'] 1092 | wordlength = 3 1093 | result = ['omg', 'teh', 'teh', 'mat', 'omg', 'teh', 'teh', 'mat', 'cat'] 1094 | (Pdb) next 1095 | > (3)find_words() 1096 | -> if len(word) == wordlength: 1097 | (Pdb) b 1098 | (Pdb) c 1099 | 1100 | 1101 | 它会给出一个提示(Pdb),你可以在那里输入指令给调试器。输入help来查看命令的完整列表。输入step(或只输入s)将执行当前行然后停止。如果当前行调用一个函数,它将进入这个函数并停止在第一行。输入next(或只输入n)是类似的,但它会在当前函数中的下一行停止执行。break(或b)命令可用于创建或列出断点。输入continue(或c)会继续执行直到遇到下一个断点。输入任何变量的名称可以检查它的值。 1102 | 1103 | ## 5、防御性编程 1104 | 1105 | 1.考虑在你的代码中添加assert语句,指定变量的属性,例如assert(isinstance(text, list))。如果text的值在你的代码被用在一些较大的环境中时变为了一个字符串,将产生一个AssertionError,于是你会立即得到问题的通知。 1106 | 1107 | 2.一旦你觉得你发现了错误,作为一个假设查看你的解决方案。在重新运行该程序之前尝试预测你修正错误的影响。如果bug不能被修正,不要陷入盲目修改代码希望它会奇迹般地重新开始运作的陷阱。相反,每一次修改都要尝试阐明错误是什么和为什么这样修改会解决这个问题的假设。如果这个问题没有解决就撤消这次修改。 1108 | 1109 | 3.当你开发你的程序时,扩展其功能,并修复所有bug,维护一套测试用例是有益的。这被称为回归测试,因为它是用来检测代码“回归”的地方——修改代码后会带来一个意想不到的副作用是以前能运作的程序不运作了的地方。Python以doctest模块的形式提供了一个简单的回归测试框架。这个模块搜索一个代码或文档文件查找类似与交互式Python会话这样的文本块,这种形式你已经在这本书中看到了很多次。它执行找到的Python命令,测试其输出是否与原始文件中所提供的输出匹配。每当有不匹配时,它会报告预期值和实际值。有关详情,请查询在 documentation at http://docs.python.org/library/doctest.html 上的doctest文档。除了回归测试它的值,doctest模块有助于确保你的软件文档与你的代码保持同步。 1110 | 1111 | 4.也许最重要的防御性编程策略是要清楚的表述你的代码,选择有意义的变量和函数名,并通过将代码分解成拥有良好文档的接口的函数和模块尽可能的简化代码。 1112 | 1113 | # 4.7 算法设计 1114 | 解决算法问题的一个重要部分是为手头的问题选择或改造一个合适的算法。有时会有几种选择,能否选择最好的一个取决于对每个选择随数据增长如何执行的知识。 1115 | 1116 | 算法设计的另一种方法,我们解决问题通过将它转化为一个我们已经知道如何解决的问题的一个实例。例如,为了检测列表中的重复项,我们可以预排序这个列表,然后通过一次扫描检查是否有相邻的两个元素是相同的。 1117 | 1118 | ## 1、迭代与递归 1119 | 例如,假设我们有n个词,要计算出它们结合在一起有多少不同的方式能组成一个词序列。 1120 | 1121 | ### 迭代 1122 | 如果我们只有一个词(n=1),只是一种方式组成一个序列。如果我们有2个词,就有2种方式将它们组成一个序列。3个词有6种可能性。一般的,n个词有n × n-1 × … × 2 × 1种方式(即n的阶乘)。我们可以将这些编写成如下代码: 1123 | 1124 | 1125 | ```python 1126 | def factorial1(n): 1127 | result = 1 1128 | for i in range(n): 1129 | result *= (i+1) 1130 | return result 1131 | ``` 1132 | 1133 | ### 递归 1134 | 假设我们有办法为n-1 不同的词构建所有的排列。然后对于每个这样的排列,有n个地方我们可以插入一个新词:开始、结束或任意两个词之间的n-2个空隙。因此,我们简单的将n-1个词的解决方案数乘以n的值。我们还需要基础案例,也就是说,如果我们有一个词,只有一个顺序。我们可以将这些编写成如下代码: 1135 | 1136 | 1137 | 1138 | 1139 | ```python 1140 | def factorial2(n): 1141 | if n == 1: 1142 | return 1 1143 | else: 1144 | return n * factorial2(n-1) 1145 | ``` 1146 | 1147 | 尽管递归编程结构简单,但它是有代价的。每次函数调用时,一些状态信息需要推入堆栈,这样一旦函数执行完成可以从离开的地方继续执行。出于这个原因,迭代的解决方案往往比递归解决方案的更高效。 1148 | 1149 | ## 2、权衡空间与时间 1150 | 我们有时可以显著的加快程序的执行,通过建设一个辅助的数据结构,例如索引。4.10实现一个简单的电影评论语料库的全文检索系统。通过索引文档集合,它提供更快的查找。 1151 | 1152 | 1153 | ```python 1154 | def raw(file): 1155 | with open(file) as f: 1156 | contents = f.read() 1157 | contents = re.sub(r"<.*?>", " ", contents) 1158 | contents = re.sub("\s+", " ", contents) 1159 | return contents 1160 | 1161 | def snippet(doc, term): 1162 | text = " "*30 + raw(doc) + " "*30 1163 | pos = text.index(term) 1164 | return text[pos - 30 : pos + 30] 1165 | 1166 | 1167 | print("Building Index...") 1168 | files = nltk.corpus.movie_reviews.abspaths() 1169 | idx = nltk.Index((w, f) for f in files for w in raw(f).split()) 1170 | 1171 | 1172 | query = " " 1173 | while query != "quit": 1174 | query = input("query> ") 1175 | if query in idx: 1176 | for doc in idx[query]: 1177 | print(snippet(doc, query)) 1178 | else: 1179 | print("Not found") 1180 | 1181 | 1182 | 1183 | 1184 | 1185 | 1186 | ``` 1187 | 1188 | Building Index... 1189 | query> so what does joe do 1190 | Not found 1191 | query> quit 1192 | s funded by her mother . lucy quit working professionally 10 1193 | erick . i disliked that movie quite a bit , but since " prac 1194 | t disaster . babe ruth didn't quit baseball after one season 1195 | o-be fiance . i think she can quit that job and get a more r 1196 | and rose mcgowan should just quit acting . she has no chari 1197 | and get a day job . and don't quit it . 1198 | kubrick , alas , should have quit while he was ahead . this 1199 | everyone involved should have quit while they were still ahe 1200 | l die . so what does joe do ? quit his job , of course ! ! w 1201 | red " implant . he's ready to quit the biz and get a portion 1202 | hat he always recorded , they quit and become disillusioned 1203 | admit that i ? ? ? ve become quite the " scream " fan . no 1204 | again , the fact that he has quit his job to feel what it's 1205 | school reunion . he has since quit his job as a travel journ 1206 | ells one of his friends , " i quit school because i didn't l 1207 | ms , cursing off the boss and quitting his job ( " today i q 1208 | e , the arrival of the now ubiquitous videocassette . burt r 1209 | in capitol city , that he has quit his job and hopes to open 1210 | before his death at age 67 to quit filmmaking once a homosex 1211 | - joss's explanation that he quit the priesthood because of 1212 | is a former prosecutor , and quit because of tensions betwe 1213 | 1214 | 1215 | 一个更微妙的空间与时间折中的例子涉及使用整数标识符替换一个语料库的词符。我们为语料库创建一个词汇表,每个词都被存储一次的列表,然后转化这个列表以便我们能通过查找任意词来找到它的标识符。每个文档都进行预处理,使一个词列表变成一个整数列表。现在所有的语言模型都可以使用整数。见4.11中的内容,如何为一个已标注的语料库做这个的例子的列表。 1216 | 1217 | 1218 | ```python 1219 | def preprocess(tagged_corpus): 1220 | words = set() 1221 | tags = set() 1222 | for sent in tagged_corpus: 1223 | for word, tag in sent: 1224 | words.add(word) 1225 | tags.add(tag) 1226 | wm = dict((w, i) for (i, w) in enumerate(words)) 1227 | tm = dict((t, i) for (i, t) in enumerate(tags)) 1228 | return [[(wm[w], tm[t]) for (w, t) in sent] for sent in tagged_corpus] 1229 | ``` 1230 | 1231 | ### 集合中的元素会自动索引,所以测试一个大的集合的成员将远远快于测试相应的列表的成员。 1232 | 空间时间权衡的另一个例子是维护一个词汇表。如果你需要处理一段输入文本检查所有的词是否在现有的词汇表中,词汇表应存储为一个集合,而不是一个列表。集合中的元素会自动索引,所以测试一个大的集合的成员将远远快于测试相应的列表的成员。 1233 | 1234 | 我们可以使用timeit模块测试这种说法。Timer类有两个参数:一个是多次执行的语句,一个是只在开始执行一次的设置代码。我们将分别使用一个整数的列表[1]和一个整数的集合[2]模拟10 万个项目的词汇表。测试语句将产生一个随机项,它有50%的机会在词汇表中[3]。 1235 | 1236 | 1237 | ```python 1238 | from timeit import Timer 1239 | vocab_size = 100000 1240 | setup_list = "import random; vocab = range(%d)" % vocab_size #[1] 1241 | setup_set = "import random; vocab = set(range(%d))" % vocab_size #[2] 1242 | statement = "random.randint(0, %d) in vocab" % (vocab_size * 2) #[3] 1243 | print(Timer(statement, setup_list).timeit(1000)) 1244 | 1245 | print(Timer(statement, setup_set).timeit(1000)) 1246 | 1247 | ``` 1248 | 1249 | 0.024326185999598238 1250 | 0.005347012998754508 1251 | 1252 | 1253 | 执行1000 次链表成员资格测试总共需要2.8秒,而在集合上的等效试验仅需0.0037 秒,也就是说快了三个数量级! 1254 | 1255 | ## 3、动态规划 1256 | 动态规划(Dynamic programming)是一种自然语言处理中被广泛使用的算法设计的一 1257 | 般方法。 1258 | “programming”一词的用法与你可能想到的感觉不同,是规划或调度的意思。动态 1259 | 规划用于解决包含多个重叠的子问题的问题。不是反复计算这些子问题,而是简单的将它们 1260 | 的计算结果存储在一个查找表中。在本节的余下部分,我们将介绍动态规划,但在一个相当 1261 | 不同的背景:句法分析,下介绍。 1262 | 1263 | Pingala 是大约生活在公元前 5 世纪的印度作家,作品有被称为《Chandas Shastra》的 1264 | 梵文韵律专著。Virahanka 大约在公元 6 世纪延续了这项工作,研究短音节和长音节组合产 1265 | 生一个长度为 n 的旋律的组合数。短音节,标记为 S,占一个长度单位,而长音节,标记为 1266 | L,占 2 个长度单位。Pingala 发现,例如:有 5 种方式构造一个长度为 4 的旋律:V 4 = {L 1267 | L, SSL, SLS, LSS, SSSS}。请看,我们可以将 V 4 分成两个子集,以 L 开始的子集和以 S 1268 | 开始的子集,如(1)所示。 1269 | 1270 | 1271 | 1272 | ```python 1273 | (1) V 4 = 1274 | LL, LSS 1275 | i.e. L prefixed to each item of V 2 = {L, SS} 1276 | SSL, SLS, SSSS 1277 | i.e. S prefixed to each item of V 3 = {SL, LS, SSS} 1278 | V1 = { S} 1279 | V0 = {""} 1280 | ``` 1281 | 1282 | 有了这个观察结果,我们可以写一个小的递归函数称为 virahanka1()来计算这些旋 1283 | 律,如例 4-9 所示。请注意,要计算 V 4 ,我们先要计算 V 3 和 V 2 。但要计算 V 3 ,我们先要 1284 | 计算 V 2 和 V 1 。在(2)中描述了这种调用结构。 1285 | 1286 | 例 4-9. 四种方法计算梵文旋律: 1287 | (一)迭代(递归); 1288 | (二)自底向上的动态规划; 1289 | (三)自上而下的动态规划; 1290 | (四)内置默记法。 1291 | 1292 | 1293 | ```python 1294 | def virahanka1(n): 1295 | if n == 0: 1296 | return [""] 1297 | elif n == 1: 1298 | return ["S"] 1299 | else: 1300 | s = ["S" + prosody for prosody in virahanka1(n-1)] 1301 | l = ["L" + prosody for prosody in virahanka1(n-2)] 1302 | return s + l 1303 | 1304 | def virahanka2(n): 1305 | lookup = [[""], ["S"]] 1306 | for i in range(n-1): 1307 | s = ["S" + prosody for prosody in lookup[i+1]] 1308 | l = ["L" + prosody for prosody in lookup[i]] 1309 | lookup.append(s + l) 1310 | return lookup[n] 1311 | 1312 | 1313 | def virahanka3(n, lookup={0:[""], 1:["S"]}): 1314 | if n not in lookup: 1315 | s = ["S" + prosody for prosody in virahanka3(n - 1)] 1316 | l = ["L" + prosody for prosody in virahanka3(n - 2)] 1317 | lookup[n] = s + l 1318 | return lookup[n] 1319 | 1320 | from nltk import memoize 1321 | @memoize 1322 | def virahanka4(n): 1323 | if n == 0: 1324 | return [""] 1325 | elif n == 1: 1326 | return ["S"] 1327 | else: 1328 | s = ["S" + prosody for prosody in virahanka4(n - 1)] 1329 | l = ["L" + prosody for prosody in virahanka4(n - 2)] 1330 | return s + l 1331 | 1332 | print(virahanka1(4)) 1333 | 1334 | print(virahanka2(4)) 1335 | 1336 | print(virahanka3(4)) 1337 | 1338 | print(virahanka4(4)) 1339 | 1340 | ``` 1341 | 1342 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] 1343 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] 1344 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] 1345 | ['SSSS', 'SSL', 'SLS', 'LSS', 'LL'] 1346 | 1347 | 1348 | ### 法1.递归(迭代) 1349 | 正如你可以看到,V 2 计算了两次。这看上去可能并不像是一个重大的问题,但事实证 1350 | 明,当 n 变大时使用这种递归技术计算 V 20 ,我们将计算 V 2 4,181 次;对 V 40 我们将计算 V 2 1351 | 63245986 次! 1352 | ### 法2.动态规划(自下而上) 1353 | 函数 virahanka2()实现动态规划方法解决这个问题。它的工作原 1354 | 理是使用问题的所有较小的实例的计算结果填充一个表格(叫做 lookup),一旦我们得到 1355 | 了我们感兴趣的值就立即停止。此时,我们读出值,并返回它。最重要的是,每个子问题只 1356 | 计算了一次。 1357 | ### 法3.动态规划(自上而下) 1358 | 请注意,virahanka2()所采取的办法是解决较大问题前先解决较小的问题。因此,这 1359 | 被称为自下而上的方法进行动态规划。不幸的是,对于某些应用它还是相当浪费资源的,因 1360 | 为它计算的一些子问题在解决主问题时可能并不需要。采用自上而下的方法进行动态规划可 1361 | 避免这种计算的浪费,如例 4-9 中函数 virahanka3()所示。不同于自下而上的方法,这种 1362 | 方法是递归的。通过检查是否先前已存储了结果,它避免了 virahanka1()的巨大浪费。如 1363 | 果没有存储,就递归的计算结果,并将结果存储在表中。最后一步返回存储的结果。 1364 | ### 法4.内置默记法 1365 | 最后一 1366 | 种方法,invirahanka4(),使用一个 Python 的“装饰器”称为默记法(memoize),它会 1367 | 做 virahanka3()所做的繁琐的工作而不会搞乱程序。这种“默记”过程中会存储每次函数 1368 | 调用的结果以及使用到的参数。如果随后的函数调用了同样的参数,它会返回存储的结果, 1369 | 而不是重新计算。(这方面的 Python 语法超出了本书的范围。) 1370 | 1371 | # 4.8 Python 库的样例 1372 | 1373 | 例 4-10. 布朗语料库中不同部分的情态动词频率。 1374 | 1375 | 1376 | ```python 1377 | from numpy import arange 1378 | from matplotlib import pyplot 1379 | 1380 | colors = 'rgbcmyk' # red, green, blue, cyan, magenta, yellow, black 1381 | 1382 | def bar_chart(categories, words, counts): 1383 | "Plot a bar chart showing counts for each word by category" 1384 | ind = arange(len(words)) 1385 | width = 1 / (len(categories) + 1) 1386 | bar_groups = [] 1387 | for c in range(len(categories)): 1388 | bars = pyplot.bar(ind+c*width, counts[categories[c]], width, 1389 | color=colors[c % len(colors)]) 1390 | bar_groups.append(bars) 1391 | pyplot.xticks(ind+width, words) 1392 | pyplot.legend([b[0] for b in bar_groups], categories, loc='upper left') 1393 | pyplot.ylabel('Frequency') 1394 | pyplot.title('Frequency of Six Modal Verbs by Genre') 1395 | pyplot.show() 1396 | 1397 | genres = ['news', 'religion', 'hobbies', 'government', 'adventure'] 1398 | modals = ['can', 'could', 'may', 'might', 'must', 'will'] 1399 | cfdist = nltk.ConditionalFreqDist( 1400 | (genre, word) 1401 | for genre in genres 1402 | for word in nltk.corpus.brown.words(categories=genre) 1403 | if word in modals) 1404 | 1405 | counts = {} 1406 | for genre in genres: 1407 | counts[genre] = [cfdist[genre][word] for word in modals] 1408 | bar_chart(genres, modals, counts) 1409 | ``` 1410 | 1411 | {'news': [93, 86, 66, 38, 50, 389], 'religion': [82, 59, 78, 12, 54, 71], 'hobbies': [268, 58, 131, 22, 83, 264], 'government': [117, 38, 153, 13, 102, 244], 'adventure': [46, 151, 5, 58, 27, 50]} 1412 | 1413 | 1414 | 1415 | ![png](./picture/4.2.png) 1416 | 1417 | 1418 | ## 1、NetworkX 1419 | NetworkX包定义和操作被称为图的由节点和边组成的结构。(略) 1420 | 1421 | ## 2、csv 1422 | 语言分析工作往往涉及数据统计表,包括有关词项的信息、试验研究的参与者名单或从语料库提取的语言特征。这里有一个CSV格式的简单的词典片段: 1423 | sleep, sli:p, v.i, a condition of body and mind ... 1424 | walk, wo:k, v.intr, progress by lifting and setting down each foot ... 1425 | wake, weik, intrans, cease to sleep 1426 | 1427 | 我们可以使用Python的CSV库读写这种格式存储的文件。例如,我们可以打开一个叫做lexicon.csv的CSV 文件[1],并遍历它的行[2]: 1428 | 1429 | 1430 | 1431 | 1432 | ```python 1433 | >>> import csv 1434 | >>> input_file = open("lexicon.csv", "rb") 1435 | >>> for row in csv.reader(input_file): 1436 | ... print(row) 1437 | ['sleep', 'sli:p', 'v.i', 'a condition of body and mind ...'] 1438 | ['walk', 'wo:k', 'v.intr', 'progress by lifting and setting down each foot ...'] 1439 | ['wake', 'weik', 'intrans', 'cease to sleep'] 1440 | ``` 1441 | 1442 | 1443 | 每一行是一个字符串列表。如果字段包含有数值数据,它们将作为字符串出现,所以都必须使用int()或float()转换。 1444 | 1445 | ## 3、NumPy 1446 | NumPy包对Python中的数值处理提供了大量的支持。NumPy有一个多维数组对象,它可以很容易初始化和访问: 1447 | 1448 | 1449 | ```python 1450 | from numpy import array 1451 | cube = array([ [[0,0,0], [1,1,1], [2,2,2]], 1452 | [[3,3,3], [4,4,4], [5,5,5]], 1453 | [[6,6,6], [7,7,7], [8,8,8]] ]) 1454 | 1455 | print(cube[1,1,1],"\n") 1456 | 1457 | print(cube[2].transpose(),"\n") 1458 | 1459 | print(cube[2,1:],"\n") 1460 | 1461 | ``` 1462 | 1463 | 4 1464 | 1465 | [[6 7 8] 1466 | [6 7 8] 1467 | [6 7 8]] 1468 | 1469 | [[7 7 7] 1470 | [8 8 8]] 1471 | 1472 | 1473 | 1474 | NumPy包括线性代数函数。在这里我们进行矩阵的奇异值分解,潜在语义分析中使用的操作,它能帮助识别一个文档集合中的隐含概念。 1475 | 1476 | 1477 | ```python 1478 | from numpy import linalg 1479 | a=array([[4,0], [3,-5]]) 1480 | u,s,vt = linalg.svd(a) 1481 | print(u,"\n\n",s,"\n\n",vt) 1482 | ``` 1483 | 1484 | [[-0.4472136 -0.89442719] 1485 | [-0.89442719 0.4472136 ]] 1486 | 1487 | [ 6.32455532 3.16227766] 1488 | 1489 | [[-0.70710678 0.70710678] 1490 | [-0.70710678 -0.70710678]] 1491 | 1492 | 1493 | NLTK中的聚类包nltk.cluster中广泛使用NumPy数组,支持包括k-means聚类、高斯EM聚类、组平均凝聚聚类以及聚类分析图。有关详细信息,请输入help(nltk.cluster)。 1494 | 1495 | ## 4、其他Python库 1496 | 还有许多其他的Python库,你可以使用http://pypi.python.org/ 处的Python包索引找到它们。许多库提供了外部软件接口,例如关系数据库(如mysql-python)和大数据集合(如PyLucene)。许多其他库能访问各种文件格式,如PDF、MSWord 和XML(pypdf, pywin32, xml.etree)、RSS 源(如feedparser)以及电子邮箱(如imaplib, email)。 1497 | 1498 | 1499 | ```python 1500 | 1501 | ``` 1502 | --------------------------------------------------------------------------------