├── LICENSE.txt ├── README.md ├── code ├── __pycache__ │ ├── __init__.cpython-38.pyc │ ├── seq2seq_model.cpython-38.pyc │ ├── transformer.cpython-38.pyc │ └── utils.cpython-38.pyc ├── transformer.py └── utils.py ├── en.txt ├── ner_labels.json ├── ner_test.jsonl ├── ner_train.jsonl ├── test.jsonl ├── the little prince.txt ├── train.jsonl ├── zh.txt ├── 第10章-成分句法分析.ipynb ├── 第11章-依存句法分析.ipynb ├── 第12章-语义分析.ipynb ├── 第13章-篇章分析.ipynb ├── 第2章-文本规范化.ipynb ├── 第3章-文本表示.ipynb ├── 第4章-文本分类.ipynb ├── 第5章-文本聚类.ipynb ├── 第6章-语言模型.ipynb ├── 第7章-序列到序列模型.ipynb ├── 第8章-预训练语言模型.ipynb └── 第9章-序列标注.ipynb /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 动手学自然语言处理 2 | 3 | 欢迎来到《动手学自然语言处理》(Hands-on NLP)的地带。该系列从词的表示等基础讲起,一步步由浅入深,介绍目前一些主流的NLP算法。每一章内容都是一个Jupyter Notebook,内含详细的图文介绍和代码讲解。 4 | 5 | * 由于GitHub上渲染notebook效果有限,我们推荐读者前往[Hands-on NLP主页](https://hnlp.boyuai.com/)进行浏览,我们在此提供了纯代码版本的notebook,供大家下载运行。 6 | 7 | * 欢迎在[京东](https://item.jd.com/???.html)和[当当网](http://product.dangdang.com/???.html)购买《动手学自然语言处理》。 8 | 9 | * 如果你发现了本书的任何问题,或者有任何改善建议的,欢迎提交issue! 10 | 11 | * 本书配套的自然语言课程已上线到[伯禹学习平台](https://www.boyuai.com/course/course/5vAvkjf6AbHEf6nJ),所有人都可以学习和讨论。 12 | 13 | ![](https://boyuai.oss-cn-shanghai.aliyuncs.com/disk/tmp/hnlp-poster.jpeg) 14 | -------------------------------------------------------------------------------- /code/__pycache__/__init__.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/boyu-ai/Hands-on-NLP/3a511b8c2fe3fa4c7e75aee0a0c478d46b000683/code/__pycache__/__init__.cpython-38.pyc -------------------------------------------------------------------------------- /code/__pycache__/seq2seq_model.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/boyu-ai/Hands-on-NLP/3a511b8c2fe3fa4c7e75aee0a0c478d46b000683/code/__pycache__/seq2seq_model.cpython-38.pyc -------------------------------------------------------------------------------- /code/__pycache__/transformer.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/boyu-ai/Hands-on-NLP/3a511b8c2fe3fa4c7e75aee0a0c478d46b000683/code/__pycache__/transformer.cpython-38.pyc -------------------------------------------------------------------------------- /code/__pycache__/utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/boyu-ai/Hands-on-NLP/3a511b8c2fe3fa4c7e75aee0a0c478d46b000683/code/__pycache__/utils.cpython-38.pyc -------------------------------------------------------------------------------- /code/transformer.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import numpy as np 5 | 6 | 7 | # 实现transformer模型 8 | class EmbeddingLayer(nn.Module): 9 | def __init__(self, vocab_size, max_len, embed_size): 10 | super().__init__() 11 | self.vocab_size = vocab_size 12 | self.max_len = max_len 13 | self.embed_size = embed_size 14 | self.word_embedding = nn.Embedding(vocab_size, embed_size) 15 | self.pos_embedding = nn.Embedding(max_len, embed_size) 16 | 17 | def forward(self, input_ids, pos_ids): 18 | """ 19 | input_ids/pos_ids: batch_size * seq_len 20 | return: batch_size * seq_len * embed_size 21 | """ 22 | word_embed = self.word_embedding(input_ids) 23 | pos_embed = self.pos_embedding(pos_ids) 24 | # 将词嵌入和位置嵌入相加得到嵌入层输出 25 | return word_embed + pos_embed 26 | 27 | # 缩放点乘注意力 28 | class ScaledDotProductAttention(nn.Module): 29 | def __init__(self, dropout): 30 | super().__init__() 31 | self.dropout = nn.Dropout(dropout) 32 | 33 | def forward(self, queries, keys, values, attention_mask): 34 | """ 35 | queries/keys/values: batch_size * seq_len * hidden_size 36 | attention_mask: batch_size * seq_len * seq_len 37 | return: batch_size * seq_len * hidden_size 38 | """ 39 | d = queries.size(-1) 40 | # 根据点乘注意力的矩阵形式计算注意力分数,除以查询向量或键向量维度的平方根,即为缩放点乘注意力 41 | scores = torch.bmm(queries, torch.transpose(keys, 1, 2)) / np.sqrt(d) 42 | # 将掩码为0的位置的注意力分数设为一个大负数,根据softmax函数的性质,这些注意力分数归一化后接近0 43 | scores[attention_mask == 0] = -1e6 44 | self.attention_weights = F.softmax(scores, dim=-1) 45 | return torch.bmm(self.dropout(self.attention_weights), values) 46 | 47 | class MultiHeadSelfAttention(nn.Module): 48 | def __init__(self, hidden_size, num_heads, dropout): 49 | super().__init__() 50 | assert hidden_size % num_heads == 0 51 | self.hidden_size = hidden_size 52 | self.num_heads = num_heads 53 | self.W_q = nn.Linear(hidden_size, hidden_size) 54 | self.W_k = nn.Linear(hidden_size, hidden_size) 55 | self.W_v = nn.Linear(hidden_size, hidden_size) 56 | self.W_o = nn.Linear(hidden_size, hidden_size) 57 | self.attention = ScaledDotProductAttention(dropout) 58 | 59 | def transpose_qkv(self, states): 60 | # 将长度为hidden_size的向量分成num_heads个长度相等的向量 61 | states = states.reshape(states.shape[0], states.shape[1], self.num_heads, self.hidden_size // self.num_heads) 62 | states = torch.permute(states, (0, 2, 1, 3)) 63 | return states.reshape(-1, states.shape[2], states.shape[3]) 64 | 65 | # 与transpose_qkv的变换相反 66 | def transpose_output(self, states): 67 | states = states.reshape(-1, self.num_heads, states.shape[1], states.shape[2]) 68 | states = torch.permute(states, (0, 2, 1, 3)) 69 | return states.reshape(states.shape[0], states.shape[1], -1) 70 | 71 | def forward(self, queries, keys, values, attention_mask): 72 | """ 73 | querys/keys/values: batch * seq_len * hidden_size 74 | attention_mask: batch * seq_len * seq_len 75 | return: 76 | """ 77 | # (batch_size * num_heads) * seq_len * (hidden_size / num_heads) 78 | queries = self.transpose_qkv(self.W_q(queries)) 79 | keys = self.transpose_qkv(self.W_k(keys)) 80 | values = self.transpose_qkv(self.W_v(values)) 81 | # 重复张量的元素,用以支持多个注意力头的运算 82 | # (batch_size * num_heads) * seq_len * seq_len 83 | attention_mask = torch.repeat_interleave(attention_mask, repeats=self.num_heads, dim=0) 84 | # (batch_size * num_heads) * seq_len * (hidden_size / num_heads) 85 | output = self.attention(queries, keys, values, attention_mask) 86 | # batch * seq_len * hidden_size 87 | output_concat = self.transpose_output(output) 88 | return self.W_o(output_concat) 89 | 90 | # 两个简单的前向层 91 | class PositionWiseFNN(nn.Module): 92 | def __init__(self, hidden_size, intermediate_size): 93 | super().__init__() 94 | self.dense1 = nn.Linear(hidden_size, intermediate_size) 95 | self.relu = nn.ReLU() 96 | self.dense2 = nn.Linear(intermediate_size, hidden_size) 97 | 98 | def forward(self, X): 99 | return self.dense2(self.relu(self.dense1(X))) 100 | 101 | # 层归一化 102 | class LayerNorm(nn.Module): 103 | def __init__(self, normalized_shape, eps=1e-6): 104 | super().__init__() 105 | self.gamma = nn.Parameter(torch.ones(normalized_shape)) 106 | self.beta = nn.Parameter(torch.zeros(normalized_shape)) 107 | # 一个小量用于数值稳定(防止除0) 108 | self.eps = eps 109 | 110 | def forward(self, hidden_states): 111 | mean = torch.mean(hidden_states, -1, keepdim=True) 112 | std = torch.std(hidden_states, -1, keepdim=True) 113 | return self.gamma * (hidden_states - mean) / (std + self.eps) + self.beta 114 | 115 | # 将两个输入相加并归一化 116 | class AddNorm(nn.Module): 117 | def __init__(self, hidden_size, dropout): 118 | super().__init__() 119 | self.dropout = nn.Dropout(dropout) 120 | self.layer_norm = LayerNorm(hidden_size) 121 | 122 | def forward(self, X, Y): 123 | return self.layer_norm(self.dropout(Y) + X) 124 | 125 | # 一个完整的transformer层 126 | class TransformerLayer(nn.Module): 127 | def __init__(self, hidden_size, num_heads, dropout, intermediate_size): 128 | super().__init__() 129 | self.self_attention = MultiHeadSelfAttention(hidden_size, num_heads, dropout) 130 | self.add_norm1 = AddNorm(hidden_size, dropout) 131 | self.fnn = PositionWiseFFN(hidden_size, intermediate_size) 132 | self.add_norm2 = AddNorm(hidden_size, dropout) 133 | 134 | def forward(self, X, attention_mask): 135 | Y = self.add_norm1(X, self.self_attention(X, X, X, attention_mask)) 136 | return self.add_norm2(Y, self.fnn(Y)) 137 | -------------------------------------------------------------------------------- /code/utils.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | import requests 4 | import nltk 5 | from nltk.tokenize import sent_tokenize, word_tokenize 6 | import re 7 | from nltk.corpus import stopwords 8 | from tqdm import tqdm 9 | from collections import defaultdict 10 | from string import punctuation 11 | import numpy as np 12 | import torch 13 | import torch.nn as nn 14 | import spacy 15 | from spacy.lang.zh.stop_words import STOP_WORDS 16 | nlp = spacy.load('zh_core_web_sm') 17 | 18 | 19 | class TheLittlePrinceDataset: 20 | def __init__(self, tokenize=True): 21 | # 利用nltk函数进行分句和分词 22 | text = open('the little prince.txt', 'r', encoding='utf-8').read() 23 | if tokenize: 24 | self.sentences = sent_tokenize(text.lower()) 25 | self.tokens = [word_tokenize(sent) for sent in self.sentences] 26 | else: 27 | self.text = text 28 | 29 | def build_vocab(self, min_freq=1): 30 | # 统计词频 31 | frequency = defaultdict(int) 32 | for sentence in self.tokens: 33 | for token in sentence: 34 | frequency[token] += 1 35 | self.frequency = frequency 36 | 37 | # 加入处理未登录词,加入用于对齐变长输入进而加速 38 | self.token2id = {'': 1, '': 0} 39 | self.id2token = {1: '', 0: ''} 40 | for token, freq in sorted(frequency.items(), key=lambda x: -x[1]): 41 | # 丢弃低频词 42 | if freq > min_freq: 43 | self.token2id[token] = len(self.token2id) 44 | self.id2token[len(self.id2token)] = token 45 | else: 46 | break 47 | 48 | def get_word_distribution(self): 49 | distribution = np.zeros(vocab_size) 50 | for token, freq in self.frequency.items(): 51 | if token in dataset.token2id: 52 | distribution[dataset.token2id[token]] = freq 53 | else: 54 | # 不在词表中的词按计算 55 | distribution[1] += freq 56 | distribution /= distribution.sum() 57 | return distribution 58 | 59 | # 将分词结果转化为索引表示 60 | def convert_tokens_to_ids(self, drop_single_word=True): 61 | self.token_ids = [] 62 | for sentence in self.tokens: 63 | token_ids = [self.token2id.get(token, 1) for token in sentence] 64 | # 忽略只有一个token的序列,无法计算loss 65 | if len(token_ids) == 1 and drop_single_word: 66 | continue 67 | self.token_ids.append(token_ids) 68 | 69 | return self.token_ids 70 | 71 | 72 | class TFIDF: 73 | def __init__(self, vocab_size, norm='l2', smooth_idf=True, sublinear_tf=True): 74 | self.vocab_size = vocab_size 75 | self.norm = norm 76 | self.smooth_idf = smooth_idf 77 | self.sublinear_tf = sublinear_tf 78 | 79 | def fit(self, X): 80 | doc_freq = np.zeros(self.vocab_size, dtype=np.float64) 81 | for data in X: 82 | for token_id in set(data): 83 | doc_freq[token_id] += 1 84 | doc_freq += int(self.smooth_idf) 85 | n_samples = len(X) + int(self.smooth_idf) 86 | self.idf = np.log(n_samples / doc_freq) + 1 87 | 88 | def transform(self, X): 89 | assert hasattr(self, 'idf') 90 | term_freq = np.zeros((len(X), self.vocab_size), dtype=np.float64) 91 | for i, data in enumerate(X): 92 | for token in data: 93 | term_freq[i, token] += 1 94 | if self.sublinear_tf: 95 | term_freq = np.log(term_freq + 1) 96 | Y = term_freq * self.idf 97 | if self.norm: 98 | row_norm = (Y**2).sum(axis=1) 99 | row_norm[row_norm == 0] = 1 100 | Y /= np.sqrt(row_norm)[:, None] 101 | return Y 102 | 103 | def fit_transform(self, X): 104 | self.fit(X) 105 | return self.transform(X) 106 | 107 | 108 | class BooksDataset: 109 | def __init__(self): 110 | train_file, test_file = 'train.jsonl', 'test.jsonl' 111 | 112 | # 下载数据为JSON格式,转化为Python对象 113 | def read_file(file_name): 114 | with open(file_name, 'r', encoding='utf-8') as fin: 115 | json_list = list(fin) 116 | data_split = [] 117 | for json_str in json_list: 118 | data_split.append(json.loads(json_str)) 119 | return data_split 120 | 121 | self.train_data, self.test_data = read_file(train_file), read_file(test_file) 122 | print('train size =', len(self.train_data), ', test size =', len(self.test_data)) 123 | 124 | # 建立文本标签和数字标签的映射 125 | self.label2id, self.id2label = {}, {} 126 | for data_split in [self.train_data, self.test_data]: 127 | for data in data_split: 128 | txt = data['class'] 129 | if txt not in self.label2id: 130 | idx = len(self.label2id) 131 | self.label2id[txt] = idx 132 | self.id2label[idx] = txt 133 | label_id = self.label2id[txt] 134 | data['label'] = label_id 135 | 136 | def tokenize(self, attr='book'): 137 | # 使用以下两行命令安装spacy用于中文分词 138 | # pip install -U spacy 139 | # python -m spacy download zh_core_web_sm 140 | # 去除文本中的符号和停用词 141 | for data_split in [self.train_data, self.test_data]: 142 | for data in tqdm(data_split): 143 | # 转为小写 144 | text = data[attr].lower() 145 | # 符号替换为空 146 | tokens = [t.text for t in nlp(text) if t.text not in STOP_WORDS] 147 | # 这一步比较耗时,因此把tokenize的结果储存起来 148 | data['tokens'] = tokens 149 | 150 | # 根据分词结果建立词表,忽略部分低频词,可以设置词最短长度和词表最大大小 151 | def build_vocab(self, min_freq=3, min_len=2, max_size=None): 152 | frequency = defaultdict(int) 153 | for data in self.train_data: 154 | tokens = data['tokens'] 155 | for token in tokens: 156 | frequency[token] += 1 157 | 158 | print(f'unique tokens = {len(frequency)}, total counts = {sum(frequency.values())}, ' 159 | f'max freq = {max(frequency.values())}, min freq = {min(frequency.values())}') 160 | 161 | self.token2id = {} 162 | self.id2token = {} 163 | total_count = 0 164 | for token, freq in sorted(frequency.items(), key=lambda x: -x[1]): 165 | if max_size and len(self.token2id) >= max_size: 166 | break 167 | if freq > min_freq: 168 | if (min_len is None) or (min_len and len(token) >= min_len): 169 | self.token2id[token] = len(self.token2id) 170 | self.id2token[len(self.id2token)] = token 171 | total_count += freq 172 | else: 173 | break 174 | print(f'min_freq = {min_freq}, min_len = {min_len}, max_size = {max_size}, ' 175 | f'remaining tokens = {len(self.token2id)}, ' 176 | f'in-vocab rate = {total_count / sum(frequency.values())}') 177 | 178 | # 将分词后的结果转化为数字索引 179 | def convert_tokens_to_ids(self): 180 | for data_split in [self.train_data, self.test_data]: 181 | for data in data_split: 182 | data['token_ids'] = [] 183 | for token in data['tokens']: 184 | if token in self.token2id: 185 | data['token_ids'].append(self.token2id[token]) 186 | 187 | 188 | class Biaffine(nn.Module): 189 | def __init__(self, n_in, n_out=1, bias_x=True, bias_y=True, diagonal=False): 190 | super(Biaffine, self).__init__() 191 | # n_in:输入特征大小 192 | # n_out:输出的打分数量(边预测为1,标签预测即为标签数量) 193 | # bias_x:为输入x加入线性层 194 | # bias_y:为输入y加入线性层 195 | self.n_in = n_in 196 | self.n_out = n_out 197 | self.bias_x = bias_x 198 | self.bias_y = bias_y 199 | self.diagonal = diagonal 200 | # 对角线化参数,让原本的参数矩阵变成了对角线矩阵,从而大幅度减少运算复杂度,一般在计算标签的得分时会使用 201 | if self.diagonal: 202 | self.weight = nn.Parameter(torch.Tensor(n_out, 203 | n_in + bias_x)) 204 | else: 205 | self.weight = nn.Parameter(torch.Tensor(n_out, 206 | n_in + bias_x, 207 | n_in + bias_y)) 208 | self.reset_parameters() 209 | 210 | 211 | def reset_parameters(self): 212 | nn.init.normal_(self.weight) 213 | 214 | def forward(self, x, y): 215 | # 当bias_x或bias_y为True时,为输入x或y的向量拼接额外的1 216 | if self.bias_x: 217 | x = torch.cat((x, torch.ones_like(x[..., :1])), -1) 218 | if self.bias_y: 219 | y = torch.cat((y, torch.ones_like(y[..., :1])), -1) 220 | 221 | # torch中的einsum可以很简单的实现矩阵运算 222 | # 思路是为输入的张量的每个维度分别定义一个符号(例如输入x、y的第一维是批大小,定义为b) 223 | # 并且定义输出的张量大小,这个函数会自动地根据前后的变化计算张量乘法、求和的过程 224 | # 例如下面的bxi,byi,oi->boxy,表示的是输入的三个张量大小分别为b * x * i,b * y * i和o * i 225 | # 输出则是b * o * x * y 226 | # 根据这个式子,我们可以看出三个张量都有i这个维度,在输出时被消除了 227 | # 因此三个张量的i维通过张量乘法(三者按位相乘、然后求和)进行消除 228 | # 这个算法的好处是相比于手动实现,einsum可以更容易地避免运算过程中出现很大的张量大幅占用显存 229 | # 同时也避免了手动实现的流程 230 | # 具体使用方法请参考https://pytorch.org/docs/stable/generated/torch.einsum.html 231 | if self.diagonal: 232 | s = torch.einsum('bxi,byi,oi->boxy', x, y, self.weight) 233 | else: 234 | s = torch.einsum('bxi,oij,byj->boxy', x, self.weight, y) 235 | # 当n_out=1时,将第一维移除 236 | s = s.squeeze(1) 237 | 238 | return s 239 | -------------------------------------------------------------------------------- /ner_labels.json: -------------------------------------------------------------------------------- 1 | {"B-NAME": 1, "I-ORG": 2, "B-PRO": 3, "B-EDU": 4, "I-NAME": 5, "B-LOC": 6, "B-TITLE": 7, "B-RACE": 8, "I-TITLE": 9, "I-RACE": 10, "I-PRO": 11, "B-CONT": 12, "I-EDU": 13, "I-LOC": 14, "I-CONT": 15, "B-ORG": 16, "O": 0} -------------------------------------------------------------------------------- /第10章-成分句法分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "b53f78ca", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "以下是双仿射函数的代码实现。" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 6, 15 | "id": "5d5fbc71", 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# 代码来源于GitHub项目Alibaba-NLP/ACE\n", 20 | "# (Copyright (c) 2020, Xinyu Wang, MIT License(见附录))\n", 21 | "import torch\n", 22 | "import torch.nn as nn\n", 23 | "\n", 24 | "class Biaffine(nn.Module):\n", 25 | " def __init__(self, n_in, n_out=1, bias_x=True, bias_y=True,\\\n", 26 | " diagonal=False):\n", 27 | " super(Biaffine, self).__init__()\n", 28 | " # n_in:输入特征大小\n", 29 | " # n_out:输出的打分数量(边预测为1,标签预测即为标签数量)\n", 30 | " # bias_x:为输入x加入线性层\n", 31 | " # bias_y:为输入y加入线性层\n", 32 | " self.n_in = n_in\n", 33 | " self.n_out = n_out\n", 34 | " self.bias_x = bias_x\n", 35 | " self.bias_y = bias_y\n", 36 | " self.diagonal = diagonal\n", 37 | " # 对角线化参数,让原本的参数矩阵变成了对角线矩阵,\n", 38 | " # 从而大幅度减少运算复杂度,一般在计算标签的得分时会使用\n", 39 | " if self.diagonal:\n", 40 | " self.weight = nn.Parameter(torch.Tensor(n_out,\n", 41 | " n_in + bias_x))\n", 42 | " else:\n", 43 | " self.weight = nn.Parameter(torch.Tensor(n_out,\n", 44 | " n_in + bias_x,\n", 45 | " n_in + bias_y))\n", 46 | " self.reset_parameters()\n", 47 | "\n", 48 | "\n", 49 | " def reset_parameters(self):\n", 50 | " nn.init.normal_(self.weight)\n", 51 | "\n", 52 | " def forward(self, x, y):\n", 53 | " # 当bias_x或bias_y为True时,为输入x或y的向量拼接额外的1\n", 54 | " if self.bias_x:\n", 55 | " x = torch.cat((x, torch.ones_like(x[..., :1])), -1)\n", 56 | " if self.bias_y:\n", 57 | " y = torch.cat((y, torch.ones_like(y[..., :1])), -1)\n", 58 | "\n", 59 | " # PyTorch中的einsum可以很简单地实现矩阵运算\n", 60 | " # 思路是为输入的张量的每个维度分别定义一个符号\n", 61 | " # (例如输入x、y的第一维是批大小,定义为b)\n", 62 | " # 并且定义输出的张量大小,这个函数会自动地根据\n", 63 | " # 前后的变化计算张量乘法、求和的过程\n", 64 | " # 例如下面的einsum函数中的bxi,byi,oi->boxy,\n", 65 | " # 表示的是输入的3个张量大小分别为b * x * i、b * y * i和o * i\n", 66 | " # 输出则是b * o * x * y\n", 67 | " # 根据这个式子,可以看出3个张量都有i这个维度,\n", 68 | " # 在输出时被消除了\n", 69 | " # 因此3个张量的i维通过张量乘法(三者按位相乘然后求和)\n", 70 | " # 进行消除\n", 71 | " # 这个算法的好处是相比于手动实现,\n", 72 | " # einsum可以更容易地避免运算过程中出现很大的张量大幅占用显存\n", 73 | " # 同时也避免了手动实现的流程\n", 74 | " # 具体使用方法请参考PyTorch文档\n", 75 | " if self.diagonal:\n", 76 | " s = torch.einsum('bxi,byi,oi->boxy', x, y, self.weight)\n", 77 | " else:\n", 78 | " s = torch.einsum('bxi,oij,byj->boxy', x, self.weight, y)\n", 79 | " # 当n_out=1时,将第一维移除\n", 80 | " s = s.squeeze(1)\n", 81 | "\n", 82 | " return s" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "id": "90b793a4", 88 | "metadata": {}, 89 | "source": [ 90 | "现在我们举一个简单的例子来看双仿射函数是如何进行打分的:" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": 9, 96 | "id": "63b1c31c", 97 | "metadata": {}, 98 | "outputs": [ 99 | { 100 | "name": "stdout", 101 | "output_type": "stream", 102 | "text": [ 103 | "tensor([[[0.7518, 4.9461],\n", 104 | " [2.2237, 1.5331]]], grad_fn=)\n" 105 | ] 106 | } 107 | ], 108 | "source": [ 109 | "biaffine = Biaffine(4)\n", 110 | "# 假设批大小为1,句子长度为2,词数量为4\n", 111 | "x=torch.randn(1,2,4)\n", 112 | "y=torch.randn(1,2,4)\n", 113 | "scores = biaffine(x,y)\n", 114 | "print(scores)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "id": "161ab212", 120 | "metadata": {}, 121 | "source": [ 122 | "下面给出具体的CKY算法实现。" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 3, 128 | "id": "46063375", 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "# 代码来源于GitHub项目yzhangcs/crfpar\n", 133 | "# (Copyright (c) 2020, Yu Zhang, MIT License(见附录))\n", 134 | "import torch\n", 135 | "\n", 136 | "def cky(scores, mask):\n", 137 | " '''\n", 138 | " scores:大小为批大小 * 序列长度 * 序列长度,\n", 139 | " 每个位置表示跨度的打分,例如scores[0,1,2]就表示\n", 140 | " 第0个输入样例上跨度(1,2)的打分(跨度(i,j)不包括j位置,\n", 141 | " 因此跨度(1,2)即对应的只有1个词的跨度的打分)。\n", 142 | " \n", 143 | " mask:与scores的大小一样,表示对打分的掩码,\n", 144 | " 根据句子长度的不同,掩码大小不同。假设句子0长度为5,\n", 145 | " 那么mask[0,:5,:5]中所有值都为1,mask[0]上的其余位置为0\n", 146 | " '''\n", 147 | " # 通过对掩码求和的方式找到每个句子的长度\n", 148 | " lens = mask[:, 0].sum(-1).long()\n", 149 | " # 批大小 * 序列长度 * 序列长度 -> 序列长度 * 序列长度 * 批大小,\n", 150 | " # 方便后续运算\n", 151 | " scores = scores.permute(1, 2, 0)\n", 152 | " seq_len, seq_len, batch_size = scores.shape\n", 153 | " \n", 154 | " # 复制一个新的与scores大小相同的张量,其中s表示新计算出的跨度得分,\n", 155 | " # p表示得分最高的位置(也就是max k),位置信息用long()表示\n", 156 | " s = scores.new_zeros(seq_len, seq_len, batch_size)\n", 157 | " p = scores.new_zeros(seq_len, seq_len, batch_size).long()\n", 158 | " \n", 159 | " # 设w为跨度,从小的跨度到大跨度进行遍历\n", 160 | " for w in range(1, seq_len):\n", 161 | " # 通过seq_len - w可以计算出当前长度有多少长度为w的跨度\n", 162 | " n = seq_len - w\n", 163 | " # 根据n生成0到n的列表\n", 164 | " starts = p.new_tensor(range(n)).unsqueeze(0)\n", 165 | " \n", 166 | " # 当跨度w为1的时候,没有中间值k,\n", 167 | " # 直接将结果赋值到s中作为max score\n", 168 | " if w == 1:\n", 169 | " # diagonal(w)表示抽取对角线,w的大小为偏置,\n", 170 | " # 当偏置为0时,则直接为对角线,\n", 171 | " # 当偏置大于0,则对角线上移(也就是(0,1),(1,2),……)\n", 172 | " # 具体细节请查看PyTorch文档\n", 173 | " s.diagonal(w).copy_(scores.diagonal(w))\n", 174 | " continue\n", 175 | "\n", 176 | " # 计算跨度为w情况下,s_best(i,k)+s_best(k,j)的值,\n", 177 | " # strip()函数下面会介绍,\n", 178 | " # 它每次取出大小为n * w-1 * batch_size的矩阵\n", 179 | " s_span = stripe(s, n, w-1, (0, 1)) + stripe(s, n, w-1, (1, w), 0)\n", 180 | " # n * w-1 * batch_size -> batch_size * n * w-1\n", 181 | " s_span = s_span.permute(2, 0, 1)\n", 182 | " # 计算max(s_best(i,k)+s_best(k,j)),以及对应的k值\n", 183 | " s_span, p_span = s_span.max(-1)\n", 184 | " # 更新s_best(i,j) = s(i,j)+max(s_best(i,k)+s_best(k,j))\n", 185 | " s.diagonal(w).copy_(s_span + scores.diagonal(w))\n", 186 | " # 保留最大的k值,由于p_span并不对应在原来的矩阵中的位置,\n", 187 | " # 因此需要加上starts+1来还原\n", 188 | " p.diagonal(w).copy_(p_span + starts + 1)\n", 189 | "\n", 190 | " def backtrack(p, i, j):\n", 191 | " # 通过分治法找到之前所有得分最大的span\n", 192 | " if j == i + 1:\n", 193 | " return [(i, j)]\n", 194 | " split = p[i][j]\n", 195 | " ltree = backtrack(p, i, split)\n", 196 | " rtree = backtrack(p, split, j)\n", 197 | " return [(i, j)] + ltree + rtree\n", 198 | " \n", 199 | " p = p.permute(2, 0, 1).tolist()\n", 200 | " # 从最大的跨度(0,length)开始,逐渐找到中间最大的k值,还原整个成分\n", 201 | " trees = [backtrack(p[i], 0, length)\n", 202 | " for i, length in enumerate(lens.tolist())]\n", 203 | "\n", 204 | " return trees\n", 205 | "\n", 206 | "\n", 207 | "\n", 208 | "def stripe(x, n, w, offset=(0, 0), dim=1):\n", 209 | " r'''Returns a diagonal stripe of the tensor.\n", 210 | " Parameters:\n", 211 | " x:输入的超过2维的张量\n", 212 | " n:输出的斜对角矩阵的长度\n", 213 | " w:输出的斜对角矩阵的宽度\n", 214 | " offset:前两个维度的偏置\n", 215 | " dim:当其为0则抽取纵向斜对角矩阵;1则是横向斜对角矩阵\n", 216 | " 例子:\n", 217 | " >>> x = torch.arange(25).view(5, 5)\n", 218 | " >>> x\n", 219 | " tensor([[ 0, 1, 2, 3, 4],\n", 220 | " [ 5, 6, 7, 8, 9],\n", 221 | " [10, 11, 12, 13, 14],\n", 222 | " [15, 16, 17, 18, 19],\n", 223 | " [20, 21, 22, 23, 24]])\n", 224 | " >>> n = 2\n", 225 | " >>> w = 3\n", 226 | " >>> stripe(x, n, w-1, (0, 1))\n", 227 | " tensor([[ 1, 2],\n", 228 | " [ 7, 8]])\n", 229 | " >>> stripe(x, n, w-1, (1, w) dim=0)\n", 230 | " tensor([[ 8, 13],\n", 231 | " [ 14, 19]])\n", 232 | " 可以看出,当跨度长度为3时,\n", 233 | " 两个矩阵的第一行分别表示跨度为:\n", 234 | " [(0,1),(0,2)]和[(1,3),(2,3)],\n", 235 | " 可以看出枚举了对于跨度为(0,3)有两种跨度组合:\n", 236 | " s(0,1)+s(1,3)\n", 237 | " s(0,2)+s(2,3)\n", 238 | " '''\n", 239 | " x, seq_len = x.contiguous(), x.size(1)\n", 240 | " # 根据x的形状创建步长列表,numel为批大小\n", 241 | " stride, numel = list(x.stride()), x[0, 0].numel()\n", 242 | " # 设置行和列的步长,假设当前位置为(i,j),\n", 243 | " # stride[0]会取出(i+1,j+1)的值,作为输出矩阵的下一行值\n", 244 | " stride[0] = (seq_len + 1) * numel\n", 245 | " # 假设当前位置为(i,j),stride[1]会取出(i+1,j)的值,作为下一列的值\n", 246 | " stride[1] = (1 if dim == 1 else seq_len) * numel\n", 247 | " return x.as_strided(size=(n, w, *x.shape[2:]), stride=stride,\n", 248 | " storage_offset=(offset[0]*seq_len+offset[1])*numel)\n" 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "id": "0947414c", 254 | "metadata": {}, 255 | "source": [ 256 | "给定一个句子“learning probalistic grammar is difficult”,为每个跨度打分(简单起见,这里没有考虑标签),并调用CKY算法得到句法树:" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 4, 262 | "id": "de20299c", 263 | "metadata": {}, 264 | "outputs": [ 265 | { 266 | "name": "stdout", 267 | "output_type": "stream", 268 | "text": [ 269 | "tensor([[[0., 1., 1., 1., 1., 1.],\n", 270 | " [0., 0., 1., 1., 1., 1.],\n", 271 | " [0., 0., 0., 1., 1., 1.],\n", 272 | " [0., 0., 0., 0., 1., 1.],\n", 273 | " [0., 0., 0., 0., 0., 1.],\n", 274 | " [0., 0., 0., 0., 0., 0.]]])\n", 275 | "[[(0, 5), (0, 3), (0, 1), (1, 3), (1, 2), (2, 3), (3, 5), (3, 4), (4, 5)]]\n" 276 | ] 277 | } 278 | ], 279 | "source": [ 280 | "# 句子“learning probalistic grammar is difficult”一共有5个词,\n", 281 | "# 额外加上根节点后,score的张量大小为1 * 6 * 6\n", 282 | "# score是一个上三角矩阵,其余部分用-999代替\n", 283 | "\n", 284 | "score = torch.Tensor([\n", 285 | " [ -999, 1, -1, 1, -1, 1],\n", 286 | " [ -999, -999, 1, 1, -1, -1],\n", 287 | " [ -999, -999, -999, 1, -1, -1],\n", 288 | " [ -999, -999, -999, -999, 1, 1],\n", 289 | " [ -999, -999, -999, -999, -999, 1],\n", 290 | " [ -999, -999, -999, -999, -999, -999]]\n", 291 | " ).unsqueeze(0)\n", 292 | "\n", 293 | "# mask应该是一个上三角矩阵\n", 294 | "mask = torch.ones_like(score)\n", 295 | "mask = torch.triu(mask,diagonal=1)\n", 296 | "\n", 297 | "print(mask)\n", 298 | "\n", 299 | "trees=cky(score,mask)\n", 300 | "print(trees)" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "id": "d4c9ff39", 306 | "metadata": {}, 307 | "source": [ 308 | "现在画出这个成分句法树。这里使用supar代码包来画成分句法树。supar是一个开源且好用的句法、成分、语义分析工具包。我们没有为标签进行打分,因此这里使用空标签。" 309 | ] 310 | }, 311 | { 312 | "cell_type": "code", 313 | "execution_count": 9, 314 | "id": "2e2c308f", 315 | "metadata": {}, 316 | "outputs": [ 317 | { 318 | "name": "stdout", 319 | "output_type": "stream", 320 | "text": [ 321 | " TOP \n", 322 | " | \n", 323 | " | \n", 324 | " _____________|_________ \n", 325 | " | | \n", 326 | " __________|_______ | \n", 327 | " | | | \n", 328 | " | _______|_____ ___|______ \n", 329 | " | | | | | \n", 330 | " | | | | | \n", 331 | " _ _ _ _ _ \n", 332 | " | | | | | \n", 333 | "learning probalistic grammar is difficult\n", 334 | "\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "from supar.models.const.crf.transform import Tree\n", 340 | "draw_tree=[(i,j,'|') for i,j in trees[0]]\n", 341 | "Tree.build(['learning', 'probalistic', 'grammar', 'is',\\\n", 342 | " 'difficult'],draw_tree,root='TOP').pretty_print()\n" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "id": "c79ac426", 348 | "metadata": {}, 349 | "source": [ 350 | "内向算法的实现也和CKY算法类似。下面是具体的实现代码。" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 15, 356 | "id": "56390d41", 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "# 代码来源于GitHub项目yzhangcs/crfpar\n", 361 | "# (Copyright (c) 2020 Yu Zhang, MIT License(见附录))\n", 362 | "def inside(scores, mask):\n", 363 | " # 大部分内容与cky()函数相同,相同部分不再复述\n", 364 | " batch_size, seq_len, _ = scores.shape\n", 365 | " scores, mask = scores.permute(1, 2, 0), mask.permute(1, 2, 0)\n", 366 | " # 我们会对score求exponential,因此\n", 367 | " s = torch.full_like(scores, float('-inf'))\n", 368 | "\n", 369 | " for w in range(1, seq_len):\n", 370 | " n = seq_len - w\n", 371 | " if w == 1:\n", 372 | " s.diagonal(w).copy_(scores.diagonal(w))\n", 373 | " continue\n", 374 | " s_span = stripe(s, n, w-1, (0, 1)) + stripe(s, n, w-1, (1, w), 0)\n", 375 | " s_span = s_span.permute(2, 0, 1)\n", 376 | " # 防止数据出现nan\n", 377 | " if s_span.requires_grad:\n", 378 | " s_span.register_hook(lambda x: \\\n", 379 | " x.masked_fill_(torch.isnan(x), 0))\n", 380 | " # 这里使用PyTorch中的logsumexp()函数实现指数求和的过程,\n", 381 | " # logsumexp()可以有效防止exp后数据溢出的情况\n", 382 | " s_span = s_span.logsumexp(-1)\n", 383 | " s.diagonal(w).copy_(s_span + scores.diagonal(w))\n", 384 | "\n", 385 | " return s\n", 386 | "\n" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "id": "17647378", 392 | "metadata": {}, 393 | "source": [ 394 | "下面是计算损失函数的过程。首先计算人工标注句法树的得分,也就是将该树的所有跨度得分求和,然后用内向算法得出的所有可能的树的得分取幂之和,最终计算损失函数值。" 395 | ] 396 | }, 397 | { 398 | "cell_type": "code", 399 | "execution_count": 45, 400 | "id": "7b66fb2f", 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "# 代码来源于GitHub项目yzhangcs/crfpar\n", 405 | "# (Copyright (c) 2020 Yu Zhang, MIT License(见附录))\n", 406 | "@torch.enable_grad()\n", 407 | "def crf(scores, mask, target=None):\n", 408 | " lens = mask[:, 0].sum(-1).long()\n", 409 | " total = lens.sum()\n", 410 | " batch_size, seq_len, _ = scores.shape\n", 411 | " # 训练过程中需要保证scores能够返回梯度\n", 412 | " training = scores.requires_grad\n", 413 | " # 计算内向算法,求得可能的树的得分和\n", 414 | " s = inside(scores.requires_grad_(), mask)\n", 415 | " logZ = s[0].gather(0, lens.unsqueeze(0)).sum()\n", 416 | " # (scores * target * mask).sum()可以求出目标树的得分和,\n", 417 | " # 与logZ相减后求平均损失\n", 418 | " loss = (logZ - (scores * target * mask).sum()) / total\n", 419 | " return loss,\n" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "id": "dd0120f1", 425 | "metadata": {}, 426 | "source": [ 427 | "接下来计算以下实际的损失函数:" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 46, 433 | "id": "e86fab75", 434 | "metadata": {}, 435 | "outputs": [ 436 | { 437 | "name": "stdout", 438 | "output_type": "stream", 439 | "text": [ 440 | "目标矩阵: tensor([[[0, 1, 0, 1, 0, 1],\n", 441 | " [0, 0, 1, 1, 0, 0],\n", 442 | " [0, 0, 0, 1, 0, 0],\n", 443 | " [0, 0, 0, 0, 1, 1],\n", 444 | " [0, 0, 0, 0, 0, 1],\n", 445 | " [0, 0, 0, 0, 0, 0]]])\n", 446 | "内部函数计算结果(即所有可能的成分树分数和): tensor(9.4121)\n", 447 | "loss: (tensor(0.0824, grad_fn=),)\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "# score是一个上三角矩阵,其余部分用-999代替\n", 453 | "\n", 454 | "score = torch.Tensor([\n", 455 | " [ -999, 1, -1, 1, -1, 1],\n", 456 | " [ -999, -999, 1, 1, -1, -1],\n", 457 | " [ -999, -999, -999, 1, -1, -1],\n", 458 | " [ -999, -999, -999, -999, 1, 1],\n", 459 | " [ -999, -999, -999, -999, -999, 1],\n", 460 | " [ -999, -999, -999, -999, -999, -999]]).unsqueeze(0)\n", 461 | "\n", 462 | "# mask应该是一个上三角矩阵\n", 463 | "mask = torch.ones_like(score)\n", 464 | "mask = torch.triu(mask,diagonal=1).long()\n", 465 | "\n", 466 | "# 在本例子中,假设score预测树与训练的目标target一样\n", 467 | "target = torch.Tensor([\n", 468 | " [ 0, 1, 0, 1, 0, 1],\n", 469 | " [ 0, 0, 1, 1, 0, 0],\n", 470 | " [ 0, 0, 0, 1, 0, 0],\n", 471 | " [ 0, 0, 0, 0, 1, 1],\n", 472 | " [ 0, 0, 0, 0, 0, 1],\n", 473 | " [ 0, 0, 0, 0, 0, 0]]).unsqueeze(0)\n", 474 | "\n", 475 | "\n", 476 | "print('目标矩阵:',target)\n", 477 | "\n", 478 | "s=inside(score,mask)\n", 479 | "logZ = s[0].gather(0, lens.unsqueeze(0)).sum()\n", 480 | "\n", 481 | "print('内部函数计算结果(即所有可能的成分树分数和):',logZ)\n", 482 | "\n", 483 | "loss = crf(score,mask,target)\n", 484 | "print('loss:',loss)" 485 | ] 486 | }, 487 | { 488 | "cell_type": "markdown", 489 | "id": "e002dd70", 490 | "metadata": {}, 491 | "source": [ 492 | "\n", 493 | "\n", 494 | "下面我们提供一套代码来展示在基于转移的句法分析过程中,如何根据转移动作的打分去对栈和缓存进行操作。如果读者想要运行该代码,需自行定义model。" 495 | ] 496 | }, 497 | { 498 | "cell_type": "code", 499 | "execution_count": 1, 500 | "id": "6b69ca97", 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "# 部分代码参考了GitHub项目kmkurn/pytorch-rnng \n", 505 | "# (Copyright 2017 Kemal Kurniawan, MIT License(见附录))\n", 506 | "# 假设当模型预测的SHIFT 操作id为0,REDUCE操作id为1,\n", 507 | "# 非终极符的标签为其他大于1的值\n", 508 | "SHIFT_ID=0\n", 509 | "REDUCE_ID=1\n", 510 | "\n", 511 | "# 这里假设只有3个标签\n", 512 | "label_set = {2: 'S', 3: 'NP', 4: 'VP'}\n", 513 | "\n", 514 | "class element:\n", 515 | " # word_id:一个存储成分结构的列表,例如(S (NP 1 2) 3)\n", 516 | " # 这棵树表示为[S [NP 1 2] 3]\n", 517 | " def __init__(self,word_id,is_open_nt):\n", 518 | " self.is_open_nt=is_open_nt\n", 519 | " if self.is_open_nt:\n", 520 | " self.word_id=[word_id]\n", 521 | " else:\n", 522 | " self.word_id=word_id\n", 523 | "\n", 524 | "\n", 525 | "def decode(words,model):\n", 526 | " # words:每个元素为(word_idx, word_id)的元组,\n", 527 | " # word_idx为句子中的位置,word_id则为词表中的id\n", 528 | " # model:这里不具体构建模型,仅作为一个示例\n", 529 | " # 缓存buffer初始化,将words翻转,能够保证pop()操作能够从前往后进行\n", 530 | " buffer = words[::-1]\n", 531 | " # 栈stack初始化\n", 532 | " stack = []\n", 533 | " # 保存操作历史\n", 534 | " history = []\n", 535 | " # 统计当前栈中开放的非终极符数量\n", 536 | " num_open_nt = 0\n", 537 | " # 循环转移迭代\n", 538 | " while 1:\n", 539 | " # 模型通过buffer,stack和history计算下一步操作的打分\n", 540 | " log_probs = model(buffer,stack,history)\n", 541 | " # 得到得分最高的操作id\n", 542 | " action_id = torch.max(log_probs)[1]\n", 543 | " # 当action_id分别为0,1和大于1时,\n", 544 | " # 分别为其做SHIFT,REDUCE和push_nt操作\n", 545 | " if action_id == SHIFT_ID:\n", 546 | " buffer,stack = shift(buffer,stack)\n", 547 | " elif action_id == REDUCE_ID:\n", 548 | " stack = reduce(buffer,stack)\n", 549 | " num_open_nt -= 1\n", 550 | " else:\n", 551 | " stack = push_nt(stack,action_id)\n", 552 | " num_open_nt += 1\n", 553 | " # 将当前操作记录到历史中\n", 554 | " history.append(action_id)\n", 555 | " # 当缓存为空,栈只有一个元素且它不是开放的非终极符时,则退出\n", 556 | " if num_open_nt == 0 and len(buffer) == 0 and\\\n", 557 | " len(stack) == 1 and stack[0].is_open_nt == False:\n", 558 | " break\n", 559 | " # 返回操作历史和整棵树\n", 560 | " return history, stack[0]\n", 561 | "\n", 562 | "def shift(buffer,stack):\n", 563 | " # 将buffer中的词移动到栈顶\n", 564 | " word_id=buffer.pop()\n", 565 | " stack.append(element(word_id,False))\n", 566 | " return buffer, stack \n", 567 | "\n", 568 | "def reduce(stack):\n", 569 | " children = []\n", 570 | " # 重复地从栈中弹出完整的子树或终极符,直到遇到一个开放的非终极符\n", 571 | " while len(stack) > 0 and stack[-1].is_open_nt == False:\n", 572 | " children.append(stack.pop())\n", 573 | " # 循环pop()过程会将顺序颠倒,这里将其变回原来的顺序\n", 574 | " children = children[::-1]\n", 575 | " # 这些节点的word_id将成为当前开放的非终极符的子节点,\n", 576 | " # 我们将这些节点取出,成为一个新的列表\n", 577 | " children_ids = [child.word_id for child in children]\n", 578 | " # 将子节点放入非终极符的word_id中\n", 579 | " stack[-1].word_id+=children_ids\n", 580 | " # 将非终极符关闭\n", 581 | " stack[-1].is_open_nt = False\n", 582 | " \n", 583 | " return stack\n", 584 | "\n", 585 | "def push_nt(stack,action_id):\n", 586 | " # 将action_id转换为具体的标签,放入栈顶\n", 587 | " stack.append(element(label_set[action_id],False))\n", 588 | " return stack\n", 589 | "\n" 590 | ] 591 | } 592 | ], 593 | "metadata": { 594 | "kernelspec": { 595 | "display_name": "Python 3 (ipykernel)", 596 | "language": "python", 597 | "name": "python3" 598 | }, 599 | "language_info": { 600 | "codemirror_mode": { 601 | "name": "ipython", 602 | "version": 3 603 | }, 604 | "file_extension": ".py", 605 | "mimetype": "text/x-python", 606 | "name": "python", 607 | "nbconvert_exporter": "python", 608 | "pygments_lexer": "ipython3", 609 | "version": "3.8.17" 610 | } 611 | }, 612 | "nbformat": 4, 613 | "nbformat_minor": 5 614 | } 615 | -------------------------------------------------------------------------------- /第11章-依存句法分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "35f5da24", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "下面的代码展示了Eisner算法:" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "29443580", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# 代码来源于GitHub项目yzhangcs/crfpar \n", 21 | "# (Copyright (c) 2020 Yu Zhang, MIT License(见附录))\n", 22 | "import torch\n", 23 | "import sys\n", 24 | "sys.path.append('../code')\n", 25 | "from utils import stripe, pad\n", 26 | "def eisner(scores, mask):\n", 27 | " '''\n", 28 | " scores:大小为批大小 * 序列长度 * 序列长度,\n", 29 | " 每个位置表示依存关系的打分,\n", 30 | " 例如scores[0,1,2]就表示第0个输入样例上,\n", 31 | " 边2->1的打分,2为中心词,1为依存词。\n", 32 | " \n", 33 | " mask:批大小 * 序列长度,掩码长度与句子长度相同。\n", 34 | " '''\n", 35 | " # 获取输入的基本信息\n", 36 | " lens = mask.sum(1)-1\n", 37 | " batch_size, seq_len, _ = scores.shape\n", 38 | " # 将scores矩阵从(batch,dep,head)形式转成(head,dep,batch)形式,\n", 39 | " # 方便并行计算\n", 40 | " scores = scores.permute(2, 1, 0)\n", 41 | " # 初始化不完整跨度情况下的打分\n", 42 | " s_i = torch.full_like(scores, float('-inf'))\n", 43 | " # 初始化完整跨度情况下的打分\n", 44 | " s_c = torch.full_like(scores, float('-inf'))\n", 45 | " # 保存两种情况下的max j的位置\n", 46 | " p_i = scores.new_zeros(seq_len, seq_len, batch_size).long()\n", 47 | " p_c = scores.new_zeros(seq_len, seq_len, batch_size).long()\n", 48 | " # 初始化完整跨度下长度为0的打分\n", 49 | " s_c.diagonal().fill_(0)\n", 50 | "\n", 51 | " for w in range(1, seq_len):\n", 52 | " # 通过seq_len - w可以计算出当前长度有多少长度为w的跨度\n", 53 | " n = seq_len - w\n", 54 | " # 根据n生成0到n的列表\n", 55 | " starts = p_i.new_tensor(range(n)).unsqueeze(0)\n", 56 | " \n", 57 | " # ---计算不完整跨度s(i,k,R,I)和s(i,k,L,I)的得分与最大值---\n", 58 | " \n", 59 | " # 计算s(i,j,R,C)+s(j+1,k,L,C)的值,\n", 60 | " # 对于s(i,k,R,I)和s(i,k,L,I)的计算过程中,这部分相同\n", 61 | " ilr = stripe(s_c, n, w) + stripe(s_c, n, w, (w, 1))\n", 62 | " # n * w * batch_size -> batch_size * n * w\n", 63 | " il = ir = ilr.permute(2, 0, 1)\n", 64 | " # 在s(i,k,L,I)中,计算max(s(i,j,R,C)+s(j+1,k,L,C))的值\n", 65 | " # 以及相应的位置\n", 66 | " il_span, il_path = il.max(-1)\n", 67 | " # 在求s_{ki}的过程时,我们的计算过程与第10章成分句法分析\n", 68 | " # 中的基于跨度的方法类似。\n", 69 | " # 不同的是由于k>i,因此在diagonal命令时需要用-w,让对角线下移\n", 70 | " # 具体细节请查看PyTorch文档\n", 71 | " s_i.diagonal(-w).copy_(il_span + scores.diagonal(-w))\n", 72 | " # 保留最大的j值\n", 73 | " p_i.diagonal(-w).copy_(il_path + starts)\n", 74 | " \n", 75 | " # 在s(i,k,R,I)中,计算max(s(i,j,R,C)+s(j+1,k,L,C))的值\n", 76 | " # 以及相应的位置\n", 77 | " ir_span, ir_path = ir.max(-1)\n", 78 | " # 求s_{ik},此时对角线上移\n", 79 | " # 与此同时,这种方式可以保证s_i保存的方向为L的值与\n", 80 | " # 方向为R的值互相不冲突,下同\n", 81 | " s_i.diagonal(w).copy_(ir_span + scores.diagonal(w))\n", 82 | " # 保留最大的j值\n", 83 | " p_i.diagonal(w).copy_(ir_path + starts)\n", 84 | " \n", 85 | " \n", 86 | " # ---计算不完整跨度s(i,k,R,C)和s(i,k,L,I)的得分与最大值---\n", 87 | " \n", 88 | " # 计算 s(i,j,L,C)+s(j,k,L,I) \n", 89 | " cl = stripe(s_c, n, w, (0, 0), 0) + stripe(s_i, n, w, (w, 0))\n", 90 | " cl_span, cl_path = cl.permute(2, 0, 1).max(-1)\n", 91 | " # 将最大的得分进行保存\n", 92 | " s_c.diagonal(-w).copy_(cl_span)\n", 93 | " # 将最大的得分的位置进行保存\n", 94 | " p_c.diagonal(-w).copy_(cl_path + starts)\n", 95 | " \n", 96 | " # 计算 s(i,j,R,I)+s(j,k,R,C)\n", 97 | " cr = stripe(s_i, n, w, (0, 1)) + stripe(s_c, n, w, (1, w), 0)\n", 98 | " cr_span, cr_path = cr.permute(2, 0, 1).max(-1)\n", 99 | " # 将最大的得分进行保存\n", 100 | " s_c.diagonal(w).copy_(cr_span)\n", 101 | " # 将句子长度不等于w的(0,w)得分置为负无穷,\n", 102 | " # 因为其在结构上不可能存在\n", 103 | " s_c[0, w][lens.ne(w)] = float('-inf')\n", 104 | " # 将最大的得分的位置进行保存\n", 105 | " p_c.diagonal(w).copy_(cr_path + starts + 1)\n", 106 | "\n", 107 | " def backtrack(p_i, p_c, heads, i, k, complete):\n", 108 | " # 通过分治法找到当前跨度的最优分割\n", 109 | " if i == k:\n", 110 | " return\n", 111 | " if complete:\n", 112 | " # 如果当前跨度是完整跨度,取出得分最大的位置\n", 113 | " j = p_c[i, k]\n", 114 | " # 分别追溯s(i,j,I)和s(j,k,C)的最大值\n", 115 | " backtrack(p_i, p_c, heads, i, j, False)\n", 116 | " backtrack(p_i, p_c, heads, j, k, True)\n", 117 | " else:\n", 118 | " # 由于当前跨度是不完整跨度,因此根据定义,k的父节点一定是i\n", 119 | " j, heads[k] = p_i[i, k], i\n", 120 | " i, k = sorted((i, k))\n", 121 | " # 追溯s(i,j,C)和s(j+1,k,C)的最大值\n", 122 | " backtrack(p_i, p_c, heads, i, j, True)\n", 123 | " backtrack(p_i, p_c, heads, k, j + 1, True)\n", 124 | "\n", 125 | " preds = []\n", 126 | " p_c = p_c.permute(2, 0, 1).cpu()\n", 127 | " p_i = p_i.permute(2, 0, 1).cpu()\n", 128 | " # 追溯最终生成的每个词的父节点\n", 129 | " for i, length in enumerate(lens.tolist()):\n", 130 | " heads = p_c.new_zeros(length + 1, dtype=torch.long)\n", 131 | " backtrack(p_i[i], p_c[i], heads, 0, length, True)\n", 132 | " preds.append(heads.to(mask.device))\n", 133 | "\n", 134 | " return pad(preds, total_length=seq_len).to(mask.device)\n" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "id": "b9b06fc0", 140 | "metadata": {}, 141 | "source": [ 142 | "给定输入句子“she learns the book hands-on-NLP”,让我们来看最终输出结果:" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 24, 148 | "id": "5f82f589", 149 | "metadata": {}, 150 | "outputs": [ 151 | { 152 | "name": "stdout", 153 | "output_type": "stream", 154 | "text": [ 155 | "tensor([[0, 2, 0, 5, 5, 2]])\n" 156 | ] 157 | } 158 | ], 159 | "source": [ 160 | "score = torch.Tensor([\n", 161 | " [ -1, -1, -1, -1, -1, -1],\n", 162 | " [ -1, -1, 1, -1, -1, -1],\n", 163 | " [ 1, -1, -1, -1, -1, -1],\n", 164 | " [ -1, -1, -1, -1, -1, 1],\n", 165 | " [ -1, -1, -1, -1, -1, 1],\n", 166 | " [ -1, -1, 1, -1, -1, -1]]).unsqueeze(0)\n", 167 | "\n", 168 | "mask = torch.ones_like(score[:,:,0]).long()\n", 169 | "\n", 170 | "deps=eisner(score,mask)\n", 171 | "# deps 中第0位为根节点\n", 172 | "print(deps)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "id": "90c41ccf", 178 | "metadata": {}, 179 | "source": [ 180 | "现在,我们来画一下这个依存句法树。这里使用HanLP代码包来画这个依存句法树。由于没有为标签进行打分,因此这里只给根节点打上ROOT标签,其余依存边无标签。" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 20, 186 | "id": "51d3bd4a", 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/html": [ 192 | "
Dep Tre 
─────── 
    ┌─► 
┌───┴── 
│  ┌──► 
│  │┌─► 
└─►└┴── 
Token        
──────────── 
she          
learns       
the          
book         
hands-on-NLP 
Rela
────
    
ROOT
    
    
    
" 193 | ], 194 | "text/plain": [ 195 | "" 196 | ] 197 | }, 198 | "metadata": {}, 199 | "output_type": "display_data" 200 | } 201 | ], 202 | "source": [ 203 | "# !pip install -e hanlp_common\n", 204 | "from hanlp_common.document import Document\n", 205 | "# from document import Document\n", 206 | "\n", 207 | "tokens = [\"she\",\"learns\",\"the\",\"book\",\"hands-on-NLP\"]\n", 208 | "dependencies = [[x.item(), '' if x.item()!=0 else \"ROOT\"]\\\n", 209 | " for x in deps[0,1:]]\n", 210 | "doc = Document(tok=tokens,dep=dependencies)\n", 211 | "doc.pretty_print()" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "id": "8957599c", 217 | "metadata": {}, 218 | "source": [ 219 | "以下是MST的代码实现。" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": 38, 225 | "id": "aeff5fb7", 226 | "metadata": {}, 227 | "outputs": [], 228 | "source": [ 229 | "# 代码来源于GitHub项目tdozat/Parser-v1 \n", 230 | "# (Copyright (c) 2016 Timothy Dozat, Apache-2.0 License(见附录))\n", 231 | "import numpy as np\n", 232 | "from tarjan import Tarjan\n", 233 | "\n", 234 | "def MST_inference(parse_probs, length, mask, ensure_tree = True):\n", 235 | " # parse_probs:模型预测的每个词的父节点的概率分布,\n", 236 | " # 大小为 length * length,顺序为(孩子节点,父节点)\n", 237 | " # length:当前句子长度\n", 238 | " # mask:与parse_probs大小一致,表示这句话的掩码\n", 239 | " if ensure_tree:\n", 240 | " # 根据mask大小,生成单位矩阵\n", 241 | " I = np.eye(len(mask))\n", 242 | " # 去除不合理元素,其中,通过(1-I)将对角线上的元素去除,\n", 243 | " # 因为句法树不可能存在自环\n", 244 | " parse_probs = parse_probs * mask * (1-I)\n", 245 | " # 求出每个位置上概率最高的父节点\n", 246 | " parse_preds = np.argmax(parse_probs, axis=1)\n", 247 | " tokens = np.arange(1, length)\n", 248 | " # 确认目前的根节点\n", 249 | " roots = np.where(parse_preds[tokens] == 0)[0]+1\n", 250 | " # 当没有根节点时,保证至少有一个根节点\n", 251 | " if len(roots) < 1:\n", 252 | " # 当前每个位置对根节点的概率\n", 253 | " root_probs = parse_probs[tokens,0]\n", 254 | " # 当前每个位置对概率最高的节点的概率\n", 255 | " old_head_probs = parse_probs[tokens, parse_preds[tokens]]\n", 256 | " # 计算根节点与概率最高节点的比值,作为选取根节点的相对概率\n", 257 | " new_root_probs = root_probs / old_head_probs\n", 258 | " # 选择最可能的根节点\n", 259 | " new_root = tokens[np.argmax(new_root_probs)]\n", 260 | " # 更新预测结果\n", 261 | " parse_preds[new_root] = 0\n", 262 | " # 当根节点数量超过1时,让根节点数量变为1\n", 263 | " elif len(roots) > 1:\n", 264 | " # 当前父节点的概率\n", 265 | " root_probs = parse_probs[roots,0]\n", 266 | " # 让当前所有的依存于根节点的位置(roots)归零\n", 267 | " parse_probs[roots,0] = 0\n", 268 | " # 获得新的潜在的父节点及其概率\n", 269 | " new_heads = np.argmax(parse_probs[roots][:,\\\n", 270 | " tokens], axis=1)+1\n", 271 | " new_head_probs = parse_probs[roots,\\\n", 272 | " new_heads] / root_probs\n", 273 | " # 选择roots的潜在的新的父节点中,概率最小的位置,\n", 274 | " # 将其父节点作为根节点\n", 275 | " new_root = roots[np.argmin(new_head_probs)]\n", 276 | " # 更新预测结果\n", 277 | " parse_preds[roots] = new_heads\n", 278 | " parse_preds[new_root] = 0\n", 279 | " # 在通过贪心的方式获得所有位置的父节点后,\n", 280 | " # 使用Tarjan算法找到当前图中的强联通分量,\n", 281 | " # 使用MST算法将其中的环接触并且重新进行链接\n", 282 | " tarjan = Tarjan(parse_preds, tokens)\n", 283 | " # 当前的强联通分量(环)\n", 284 | " cycles = tarjan.SCCs\n", 285 | " for SCC in tarjan.SCCs:\n", 286 | " # 当强联通分量里的节点数量超过1个,那么说明其有环\n", 287 | " if len(SCC) > 1:\n", 288 | " dependents = set()\n", 289 | " to_visit = set(SCC)\n", 290 | " # 将环内所有的节点以及它们所连接的外部节点\n", 291 | " # 都加入孩子节点中\n", 292 | " while len(to_visit) > 0:\n", 293 | " node = to_visit.pop()\n", 294 | " if not node in dependents:\n", 295 | " dependents.add(node)\n", 296 | " # 将当前节点指向的节点(孩子节点)\n", 297 | " # 加入要访问的队列中\n", 298 | " to_visit.update(tarjan.edges[node])\n", 299 | " # 参与循环的节点的位置\n", 300 | " cycle = np.array(list(SCC))\n", 301 | " # 当前父节点的概率\n", 302 | " old_heads = parse_preds[cycle]\n", 303 | " old_head_probs = parse_probs[cycle, old_heads]\n", 304 | " # 为了计算环里每个节点的新的父节点,\n", 305 | " # 这些节点的孩子节点是这些节点的父节点显然是不可能的,\n", 306 | " # 因此需要将它们的概率置为0\n", 307 | " non_heads = np.array(list(dependents))\n", 308 | " parse_probs[np.repeat(cycle, len(non_heads)),\\\n", 309 | " np.repeat([non_heads], len(cycle),\\\n", 310 | " axis=0).flatten()] = 0\n", 311 | " # 新的概率分布下,求得环内所有节点新的\n", 312 | " # 潜在父节点及其概率\n", 313 | " new_heads = np.argmax(parse_probs[cycle][:,\\\n", 314 | " tokens], axis=1)+1\n", 315 | " # 与旧的父节点计算比例\n", 316 | " new_head_probs = parse_probs[cycle,\\\n", 317 | " new_heads] / old_head_probs\n", 318 | " # 选择最有可能的变化,这样对于树的整体概率\n", 319 | " # 影响最小,同时能将当前的环解除\n", 320 | " change = np.argmax(new_head_probs)\n", 321 | " changed_cycle = cycle[change]\n", 322 | " old_head = old_heads[change]\n", 323 | " new_head = new_heads[change]\n", 324 | " # 更新预测结果\n", 325 | " parse_preds[changed_cycle] = new_head\n", 326 | " tarjan.edges[new_head].add(changed_cycle)\n", 327 | " tarjan.edges[old_head].remove(changed_cycle)\n", 328 | " return parse_preds\n", 329 | " else:\n", 330 | " # 当不强制要求树结构时,直接将预测结果返回\n", 331 | " parse_probs = parse_probs * mask\n", 332 | " parse_preds = np.argmax(parse_probs, axis=1)\n", 333 | " return parse_preds" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "id": "201b4abb", 339 | "metadata": {}, 340 | "source": [ 341 | "下面,我们设计一个使用11.2.2节所介绍的中心词选择解码会导致有环的情况,来看看MST算法的运行结果:" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 42, 347 | "id": "0b87c935", 348 | "metadata": {}, 349 | "outputs": [ 350 | { 351 | "name": "stdout", 352 | "output_type": "stream", 353 | "text": [ 354 | "不使用MST算法得到的依存关系为: [0 2 0 5 5 4]\n", 355 | "使用MST算法得到的依存关系为: [0 2 0 5 5 2]\n" 356 | ] 357 | } 358 | ], 359 | "source": [ 360 | "\n", 361 | "# 第5个词的分数最高的中心词为第4个词,形成4->5 5->4的环\n", 362 | "score = np.array([\n", 363 | " [ -1, -1, -1, -1, -1, -1],\n", 364 | " [ -1, -1, 1, -1, -1, -1],\n", 365 | " [ 1, -1, -1, -1, -1, -1],\n", 366 | " [ -1, -1, -1, -1, -1, 1],\n", 367 | " [ -1, -1, -1, -1, -1, 1],\n", 368 | " [ -1, -1, 1, -1, 1.1, -1]]) \n", 369 | "\n", 370 | "mask = np.ones_like(score)\n", 371 | "# 可以看出直接预测最大值会有环形成\n", 372 | "print('不使用MST算法得到的依存关系为:',np.argmax(score,1))\n", 373 | "deps=MST_inference(score,len(mask),mask)\n", 374 | "print('使用MST算法得到的依存关系为:',deps)" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "id": "81fb5488", 380 | "metadata": {}, 381 | "source": [ 382 | "这里我们来简单求一下边的交叉熵损失:\n" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 23, 388 | "id": "2c9eed6d", 389 | "metadata": {}, 390 | "outputs": [ 391 | { 392 | "name": "stdout", 393 | "output_type": "stream", 394 | "text": [ 395 | "tensor(0.6081)\n" 396 | ] 397 | } 398 | ], 399 | "source": [ 400 | "score = torch.Tensor([\n", 401 | " [ -1, -1, -1, -1, -1, -1],\n", 402 | " [ -1, -1, 1, -1, -1, -1],\n", 403 | " [ 1, -1, -1, -1, -1, -1],\n", 404 | " [ -1, -1, -1, -1, -1, 1],\n", 405 | " [ -1, -1, -1, -1, -1, 1],\n", 406 | " [ -1, -1, 1, -1, 1.1, -1]])\n", 407 | "# 假设我们的目标\n", 408 | "target = torch.Tensor([2,0,5,5,2]).long()\n", 409 | "# 计算交叉熵损失\n", 410 | "loss_func = torch.nn.NLLLoss()\n", 411 | "loss = loss_func(torch.nn.functional.log_softmax(score[1:],-1),target)\n", 412 | "print(loss)" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "id": "995f769b", 418 | "metadata": {}, 419 | "source": [ 420 | "\n", 421 | "\n", 422 | "下面提供一套代码来展示在解码过程中如何根据转移动作的打分去对栈和缓存进行操作。如果读者想要运行该代码,需自行定义model。" 423 | ] 424 | }, 425 | { 426 | "cell_type": "code", 427 | "execution_count": 1, 428 | "id": "10e02c70", 429 | "metadata": {}, 430 | "outputs": [], 431 | "source": [ 432 | "SHIFT_ID=0\n", 433 | "# 假设left_arc有两个label,nsubj和dobj\n", 434 | "LEFT_ARC_ID = {1: 'nsubj',2: 'dobj'}\n", 435 | "# 假设right_arc有3个label,nsubj、dobj和root\n", 436 | "RIGHT_ARC_ID = {3:'nsubj',4:'dobj',5:'root'}\n", 437 | "\n", 438 | "\n", 439 | "def decode(words,model):\n", 440 | " # words:每个元素为(word_idx, word_text)的元组,\n", 441 | " # word_idx为句子中的位置,word_text则为文本\n", 442 | " # model:这里不具体构建模型,仅作为一个示例\n", 443 | " # 缓存buffer初始化,将words翻转,能够保证pop()操作\n", 444 | " # 能够从前往后进行\n", 445 | " buffer = words[::-1]\n", 446 | " # 栈stack初始化,0表示root节点\n", 447 | " stack = [(0,'ROOT')]\n", 448 | " # 保存生成的边\n", 449 | " deps = []\n", 450 | " # 循环转移迭代\n", 451 | " while 1:\n", 452 | " # 模型通过buffer、stack和history计算下一步操作的打分\n", 453 | " log_probs = model(buffer,stack,history)\n", 454 | " # 得到得分最高的操作id,范围为[0,5]\n", 455 | " action_id = torch.max(log_probs)[1]\n", 456 | " # 当action_id分别为0、1和大于1时,分别为其做SHIFT、\n", 457 | " # REDUCE和push_nt操作\n", 458 | " if action_id == SHIFT_ID:\n", 459 | " buffer,stack = shift(buffer,stack)\n", 460 | " elif action_id in LEFT_ARC_ID:\n", 461 | " stack,deps = left_arc(stack,deps,action_id)\n", 462 | " else:\n", 463 | " stack,deps = right_arc(stack,deps,action_id)\n", 464 | " \n", 465 | " # 当缓存为空,栈只有一个子树时则退出\n", 466 | " if len(buffer) == 0 and len(stack) == 1:\n", 467 | " break\n", 468 | " # 返回生成的树\n", 469 | " return deps\n", 470 | "\n", 471 | "def shift(buffer,stack):\n", 472 | " # 将buffer中的词移动到栈顶\n", 473 | " word=buffer.pop()\n", 474 | " # 这里只需要保留word_idx\n", 475 | " stack.append(word)\n", 476 | " return buffer, stack \n", 477 | "\n", 478 | "def left_arc(stack,deps,action_id):\n", 479 | " # 因为是向左的弧,所以取出stack最后的两个词,倒数第一个词为中心词\n", 480 | " head_word = stack.pop()\n", 481 | " dep_word = stack.pop()\n", 482 | " # 保存head,dep位置以及它们所对应的边,只需要保存word_idx\n", 483 | " deps.append((head_word[0],dep_word[0],LEFT_ARC_ID[action_id]))\n", 484 | " # 将中心词放回stack中\n", 485 | " stack.append(head_word)\n", 486 | " return stack, deps\n", 487 | "\n", 488 | "def right_arc(stack,deps,action_id):\n", 489 | " # 因为是向右的弧,所以取出stack最后的两个词,倒数第二个词为中心词\n", 490 | " dep_word = stack.pop()\n", 491 | " head_word = stack.pop()\n", 492 | " # 保存head,dep位置以及它们所对应的边\n", 493 | " deps.append((head_word[0],dep_word[0],LEFT_ARC_ID[action_id]))\n", 494 | " # 将中心词放回stack中\n", 495 | " stack.append(head_word)\n", 496 | " return stack, deps" 497 | ] 498 | } 499 | ], 500 | "metadata": { 501 | "kernelspec": { 502 | "display_name": "Python 3 (ipykernel)", 503 | "language": "python", 504 | "name": "python3" 505 | }, 506 | "language_info": { 507 | "codemirror_mode": { 508 | "name": "ipython", 509 | "version": 3 510 | }, 511 | "file_extension": ".py", 512 | "mimetype": "text/x-python", 513 | "name": "python", 514 | "nbconvert_exporter": "python", 515 | "pygments_lexer": "ipython3", 516 | "version": "3.8.17" 517 | } 518 | }, 519 | "nbformat": 4, 520 | "nbformat_minor": 5 521 | } 522 | -------------------------------------------------------------------------------- /第12章-语义分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "52da442f", 6 | "metadata": {}, 7 | "source": [ 8 | "下面展示WordNet的使用示例。首先,从NLTK中引入WordNet,并且简写成wn。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 8, 14 | "id": "b65972e1", 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# 若无法运行,请将下面两行注释取消\n", 19 | "# import nltk\n", 20 | "# nltk.download('omw-1.4')\n", 21 | "\n", 22 | "from nltk.corpus import wordnet as wn" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "id": "1c0fce14", 28 | "metadata": {}, 29 | "source": [ 30 | "我们来看“cat”的词义(对应不同的同义词集)有哪些:" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": 6, 36 | "id": "8aaf527a", 37 | "metadata": {}, 38 | "outputs": [ 39 | { 40 | "name": "stdout", 41 | "output_type": "stream", 42 | "text": [ 43 | "[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "print(wn.synsets('cat'))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "856d8bc7", 54 | "metadata": {}, 55 | "source": [ 56 | "接下来,我们来看“cat.n.01”,即cat作为名词的第一个同义词集的定义是什么样的:" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 18, 62 | "id": "9b75fa0f", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "print(wn.synset('cat.n.01').definition())" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "id": "4f8c9702", 80 | "metadata": {}, 81 | "source": [ 82 | "“cat.n.01”的词目以及在其他语言上的词目:" 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": 26, 88 | "id": "b5f82d34", 89 | "metadata": { 90 | "scrolled": false 91 | }, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "[Lemma('cat.n.01.cat'), Lemma('cat.n.01.true_cat')]\n", 98 | "['にゃんにゃん', 'キャット', 'ネコ', '猫']\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "print(wn.synset('cat.n.01').lemmas())\n", 104 | "# 暂不支持中文\n", 105 | "print(wn.synset('cat.n.01').lemma_names('jpn'))\n" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "b4ba2cbf", 111 | "metadata": {}, 112 | "source": [ 113 | "最后,我们看一下“cat.n.01”的上位词和下位词:" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "id": "c8f621fc", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "# 上位词\n", 124 | "print(wn.synset('cat.n.01').hypernyms())\n", 125 | "# 下位词\n", 126 | "print(wn.synset('cat.n.01').hyponyms())" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 30, 132 | "id": "89dcaa8b", 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "[Synset('feline.n.01')]\n", 140 | "[Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "# 上位词\n", 146 | "print(wn.synset('cat.n.01').hypernyms())\n", 147 | "# 下位词\n", 148 | "print(wn.synset('cat.n.01').hyponyms())" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "id": "03428711", 154 | "metadata": {}, 155 | "source": [ 156 | "下面对比“boy.n.01”、“girl.n.01”、“cat.n.01”之间的相似度:\n" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 37, 162 | "id": "1768d060", 163 | "metadata": {}, 164 | "outputs": [ 165 | { 166 | "name": "stdout", 167 | "output_type": "stream", 168 | "text": [ 169 | "boy和girl 0.16666666666666666\n", 170 | "boy和cat 0.08333333333333333\n", 171 | "girl和cat 0.07692307692307693\n", 172 | "boy和dog 0.14285714285714285\n", 173 | "girl和dog 0.125\n", 174 | "cat和dog 0.2\n" 175 | ] 176 | } 177 | ], 178 | "source": [ 179 | "boy = wn.synset('boy.n.01')\n", 180 | "girl = wn.synset('girl.n.01')\n", 181 | "cat = wn.synset('cat.n.01')\n", 182 | "dog = wn.synset('dog.n.01')\n", 183 | "\n", 184 | "\n", 185 | "print(\"boy和girl\",boy.path_similarity(girl))\n", 186 | "print(\"boy和cat\",boy.path_similarity(cat))\n", 187 | "print(\"girl和cat\",girl.path_similarity(cat))\n", 188 | "print(\"boy和dog\",boy.path_similarity(dog))\n", 189 | "print(\"girl和dog\",girl.path_similarity(dog))\n", 190 | "print(\"cat和dog\",cat.path_similarity(dog))" 191 | ] 192 | } 193 | ], 194 | "metadata": { 195 | "kernelspec": { 196 | "display_name": "Python 3 (ipykernel)", 197 | "language": "python", 198 | "name": "python3" 199 | }, 200 | "language_info": { 201 | "codemirror_mode": { 202 | "name": "ipython", 203 | "version": 3 204 | }, 205 | "file_extension": ".py", 206 | "mimetype": "text/x-python", 207 | "name": "python", 208 | "nbconvert_exporter": "python", 209 | "pygments_lexer": "ipython3", 210 | "version": "3.8.17" 211 | } 212 | }, 213 | "nbformat": 4, 214 | "nbformat_minor": 5 215 | } 216 | -------------------------------------------------------------------------------- /第13章-篇章分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "f11db778", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "这里以中文BERT为例,实现提及聚类:" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 1, 15 | "id": "0c9861b9", 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "['[CLS]', '小', '明', '给', '小', '红', '一', '束', '花', ',', '她', '很', '高', '兴', '。', '[SEP]']\n", 23 | "[101, 2207, 3209, 5314, 2207, 5273, 671, 3338, 5709, 8024, 1961, 2523, 7770, 1069, 511, 102]\n", 24 | "torch.Size([1, 16, 768])\n" 25 | ] 26 | } 27 | ], 28 | "source": [ 29 | "import torch\n", 30 | "from transformers import AutoTokenizer, AutoModel\n", 31 | "tokenizer = AutoTokenizer.from_pretrained(\"bert-base-chinese\")\n", 32 | "model = AutoModel.from_pretrained(\"bert-base-chinese\")\n", 33 | "\n", 34 | "# 进行分词\n", 35 | "sentence=\"小明给小红一束花,她很高兴。\"\n", 36 | "subtokenized_sentence=tokenizer.tokenize(sentence)\n", 37 | "subtokenized_sentence = [tokenizer._cls_token] + \\\n", 38 | " subtokenized_sentence + [tokenizer._sep_token]\n", 39 | "subtoken_ids_sentence = tokenizer.convert_tokens_to_ids(\\\n", 40 | " subtokenized_sentence)\n", 41 | "print(subtokenized_sentence)\n", 42 | "print(subtoken_ids_sentence)\n", 43 | "\n", 44 | "# 计算对应的特征\n", 45 | "outputs = model(torch.Tensor(subtoken_ids_sentence).\\\n", 46 | " unsqueeze(0).long())\n", 47 | "hidden_states = outputs[0]\n", 48 | "print(hidden_states.shape)" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "49d7247c", 54 | "metadata": {}, 55 | "source": [ 56 | "假设已经通过提及检测模型找到了句子中的提及,这里用每个提及的第一个子词(在中文中也就是第一个字)作为词特征:" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 2, 62 | "id": "60d1eedf", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "torch.Size([4, 768])\n" 70 | ] 71 | } 72 | ], 73 | "source": [ 74 | "# 提及的跨度,假设(-1,0)表示[CLS]的跨度,用于表示空提及[NA],\n", 75 | "# 在实际训练中也可以额外定义个空提及符号\n", 76 | "mention_spans = [(-1,0),(0,2),(3,5),(10,11)]\n", 77 | "word_features = torch.concat([hidden_states[0,x+1].unsqueeze(0)\\\n", 78 | " for (x,y) in mention_spans],0)\n", 79 | "print(word_features.shape)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "id": "fd5d787f", 85 | "metadata": {}, 86 | "source": [ 87 | "首先,通过双仿射函数计算打分。" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 3, 93 | "id": "ca991cb8", 94 | "metadata": {}, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "tensor([[[ -inf, -inf, -inf, -inf],\n", 101 | " [ 58.9533, -inf, -inf, -inf],\n", 102 | " [ 571.2849, -515.9794, -inf, -inf],\n", 103 | " [ -341.3851, -697.8577, -1196.0930, -inf]]],\n", 104 | " grad_fn=)\n", 105 | "tensor([[0, 0, 0]])\n" 106 | ] 107 | } 108 | ], 109 | "source": [ 110 | "import sys\n", 111 | "sys.path.append('../code')\n", 112 | "from utils import Biaffine\n", 113 | "biaffine = Biaffine(word_features.shape[1])\n", 114 | "\n", 115 | "# 对word_features进行打分\n", 116 | "scores = biaffine(word_features.unsqueeze(0),\\\n", 117 | " word_features.unsqueeze(0))\n", 118 | "# 由于只关注当前提及之前的提及是否与其进行共指,\n", 119 | "# 因此将它转换为下三角函数,并且为上三角部分置为负无穷:\n", 120 | "scores = scores.tril(diagonal=-1)\n", 121 | "inf_mask = torch.zeros_like(scores)-torch.inf\n", 122 | "inf_mask = inf_mask.triu()\n", 123 | "scores += inf_mask\n", 124 | "print(scores)\n", 125 | "print(scores.argmax(-1)[:,1:])" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "id": "de88f8d7", 131 | "metadata": {}, 132 | "source": [ 133 | "由于模型未经过训练,因此仅通过双仿射函数初始化获得结构显然是错误的。我们可以训练模型,计算损失函数计算方式如下:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 4, 139 | "id": "9f2abedd", 140 | "metadata": {}, 141 | "outputs": [ 142 | { 143 | "name": "stdout", 144 | "output_type": "stream", 145 | "text": [ 146 | "tensor(118.8242, grad_fn=)\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "# 只计算除了[NA]以外的提及的损失\n", 152 | "target = torch.Tensor([0,0,1]).long()\n", 153 | "loss_func = torch.nn.NLLLoss()\n", 154 | "loss = loss_func(torch.nn.functional.log_softmax(scores[:,1:].\\\n", 155 | " squeeze(0),-1),target)\n", 156 | "print(loss)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "id": "880e61fb", 162 | "metadata": {}, 163 | "source": [ 164 | "接下来通过点积计算打分。" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 5, 170 | "id": "f6a04704", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "tensor([[ -inf, -inf, -inf, -inf],\n", 178 | " [235.2013, -inf, -inf, -inf],\n", 179 | " [188.3145, 267.1166, -inf, -inf],\n", 180 | " [221.3709, 101.3911, 292.7802, -inf]], grad_fn=)\n", 181 | "tensor([0, 1, 2])\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "scores2 = torch.matmul(word_features,word_features.T)\n", 187 | "scores2 = scores2.tril(diagonal=-1)\n", 188 | "inf_mask = torch.zeros_like(scores2)-torch.inf\n", 189 | "inf_mask = inf_mask.triu()\n", 190 | "scores2 += inf_mask\n", 191 | "print(scores2)\n", 192 | "print(scores2.argmax(-1)[1:])" 193 | ] 194 | } 195 | ], 196 | "metadata": { 197 | "kernelspec": { 198 | "display_name": "Python 3 (ipykernel)", 199 | "language": "python", 200 | "name": "python3" 201 | }, 202 | "language_info": { 203 | "codemirror_mode": { 204 | "name": "ipython", 205 | "version": 3 206 | }, 207 | "file_extension": ".py", 208 | "mimetype": "text/x-python", 209 | "name": "python", 210 | "nbconvert_exporter": "python", 211 | "pygments_lexer": "ipython3", 212 | "version": "3.8.17" 213 | } 214 | }, 215 | "nbformat": 4, 216 | "nbformat_minor": 5 217 | } 218 | -------------------------------------------------------------------------------- /第2章-文本规范化.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "7ff1e403", 6 | "metadata": {}, 7 | "source": [ 8 | "在以英语为代表的印欧语系中,大部分语言都使用空格字符来切分词。因此分词的一种非常简单的方式就是基于空格进行分词:\n" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "id": "5d203093", 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "name": "stdout", 19 | "output_type": "stream", 20 | "text": [ 21 | "输入语句:I learn natural language processing with dongshouxueNLP, too.\n", 22 | "分词结果:['I', 'learn', 'natural', 'language', 'processing', 'with',\n", 23 | " 'dongshouxueNLP,', 'too.']\n" 24 | ] 25 | } 26 | ], 27 | "source": [ 28 | "sentence = \"I learn natural language processing with dongshouxueNLP, too.\"\n", 29 | "tokens = sentence.split(' ')\n", 30 | "print(f'输入语句:{sentence}')\n", 31 | "print(f\"分词结果:{tokens}\")" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "id": "2bc0fa24", 37 | "metadata": {}, 38 | "source": [ 39 | "从上面的代码可以看到,最简单的基于空格的分词方法无法将词与词后面的标点符号分割。如果标点符号对于后续任务(例如文本分类)并不重要,可以去除这些标点符号后再进一步分词:\n" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "id": "fe2f9e9d", 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "输入语句:I learn natural language processing with dongshouxueNLP, too.\n", 53 | "分词结果:['I', 'learn', 'natural', 'language', 'processing', 'with',\n", 54 | " 'dongshouxueNLP', 'too']\n" 55 | ] 56 | } 57 | ], 58 | "source": [ 59 | "#引入正则表达式包\n", 60 | "import re\n", 61 | "sentence = \"I learn natural language processing with dongshouxueNLP, too.\"\n", 62 | "print(f'输入语句:{sentence}')\n", 63 | "\n", 64 | "#去除句子中的“,”和“.”\n", 65 | "sentence = re.sub(r'\\,|\\.','',sentence)\n", 66 | "tokens = sentence.split(' ')\n", 67 | "print(f\"分词结果:{tokens}\")" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "id": "6a1fb4d0", 73 | "metadata": {}, 74 | "source": [ 75 | "正则表达式使用单个字符串(通常称为“模式”即pattern)来描述、匹配对应文本中全部匹配某个指定规则的字符串。\n", 76 | "我们也可以使用正则表达式来实现空格分词:\n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 3, 82 | "id": "391cf386", 83 | "metadata": {}, 84 | "outputs": [ 85 | { 86 | "name": "stdout", 87 | "output_type": "stream", 88 | "text": [ 89 | "['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your',\n", 90 | " 'pre', 'print', 'No', 'it', 's', 'free', 'It', 's']\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "import re\n", 96 | "sentence = \"Did you spend $3.4 on arxiv.org for your pre-print?\"+\\\n", 97 | " \" No, it's free! It's ...\"\n", 98 | "# 其中,\\w表示匹配a-z,A-Z,0-9和“_”这4种类型的字符,等价于[a-zA-Z0-9_],\n", 99 | "# +表示匹配前面的表达式1次或者多次。因此\\w+表示匹配上述4种类型的字符1次或多次。\n", 100 | "pattern = r\"\\w+\"\n", 101 | "print(re.findall(pattern, sentence))" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "id": "28ffa18a", 107 | "metadata": {}, 108 | "source": [ 109 | "处理标点:" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 2, 115 | "id": "5c616bf5", 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "name": "stdout", 120 | "output_type": "stream", 121 | "text": [ 122 | "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your',\n", 123 | " 'pre', '-print', '?', 'No', ',', 'it', \"'s\", 'free', '!', 'It', \"'s\",\n", 124 | " '.', '.', '.']\n" 125 | ] 126 | } 127 | ], 128 | "source": [ 129 | "\n", 130 | "# 可以在正则表达式中使用\\S来表示除了空格以外的所有字符(\\s在正则表达式中表示空格字符,\\S则相应的表示\\s的补集)\n", 131 | "# |表示或运算,*表示匹配前面的表达式0次或多次,\\S\\w* 表示先匹配除了空格以外的1个字符,后面可以包含0个或多个\\w字符。\n", 132 | "pattern = r\"\\w+|\\S\\w*\"\n", 133 | "print(re.findall(pattern, sentence))" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "28f7cbab", 139 | "metadata": {}, 140 | "source": [ 141 | "处理连字符:" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 5, 147 | "id": "1a678ed0", 148 | "metadata": {}, 149 | "outputs": [ 150 | { 151 | "name": "stdout", 152 | "output_type": "stream", 153 | "text": [ 154 | "['Did', 'you', 'spend', '3', '4', 'on', 'arxiv', 'org', 'for', 'your',\n", 155 | "'pre-print', 'No', \"it's\", 'free', \"It's\"]\n" 156 | ] 157 | } 158 | ], 159 | "source": [ 160 | "\n", 161 | "# -表示匹配连字符-,(?:[-']\\w+)*表示匹配0次或多次括号内的模式。(?:...)表示匹配括号内的模式,\n", 162 | "# 可以和+/*等符号连用。其中?:表示不保存匹配到的括号中的内容,是re代码库中的特殊标准要求的部分。\n", 163 | "pattern = r\"\\w+(?:[-']\\w+)*\"\n", 164 | "print(re.findall(pattern, sentence))" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "id": "af2bfcce", 170 | "metadata": {}, 171 | "source": [ 172 | "将前面的匹配符号的模式\\S\\w\\*组合起来,可以得到一个既可以处理标点符号又可以处理连字符的正则表达式:" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": 4, 178 | "id": "d4a45574", 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "name": "stdout", 183 | "output_type": "stream", 184 | "text": [ 185 | "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv', '.org', 'for', 'your',\n", 186 | "'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n" 187 | ] 188 | } 189 | ], 190 | "source": [ 191 | "pattern = r\"\\w+(?:[-']\\w+)*|\\S\\w*\"\n", 192 | "print(re.findall(pattern, sentence))" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "id": "1a4b2e70", 198 | "metadata": {}, 199 | "source": [ 200 | "\n", 201 | "在英文简写和网址中,常常会使用'.',它与英文中的句号为同一个符号,匹配这种情况的正则表达式为:\n", 202 | "\n", 203 | "* 正则表达式模式:(\\w+\\\\.)+\\w+(\\\\.)*\n", 204 | "* 符合匹配的字符串示例:\n", 205 | " * U.S.A.、arxiv.org\n", 206 | "* 不符合的字符串示例:\n", 207 | " * $3.4、..." 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 5, 213 | "id": "06772a7b", 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "name": "stdout", 218 | "output_type": "stream", 219 | "text": [ 220 | "['Did', 'you', 'spend', '$3', '.4', 'on', 'arxiv.org', 'for', 'your',\n", 221 | "'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n" 222 | ] 223 | } 224 | ], 225 | "source": [ 226 | "#新的匹配模式\n", 227 | "new_pattern = r\"(?:\\w+\\.)+\\w+(?:\\.)*\"\n", 228 | "pattern = new_pattern +r\"|\"+pattern\n", 229 | "print(re.findall(pattern, sentence))" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "id": "8eabc8e5", 235 | "metadata": {}, 236 | "source": [ 237 | "需要注意的是,字符“.”在正则表达式中表示匹配任意字符,因此要表示字符本身的含义时,需要在该符号前面加入转义字符(Escape Character)\"\\\\\",即“\\\\.”。同理,想要表示“+”“?”“(”“)”“$”这些特殊字符时,需要在前面加入转义字符“\\\\”。" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "id": "a1808683", 243 | "metadata": {}, 244 | "source": [ 245 | "在许多语言中,货币和百分比符号与数字是直接相连的,匹配这种情况的正则表达式为:\n", 246 | "\n", 247 | "* 正则表达式模式:\\\\$?\\d+(\\\\.\\d+)?%?\n", 248 | "* 符合匹配的字符串示例:\n", 249 | " * \\$3.40、3.5%\n", 250 | "* 不符合的字符串示例:\n", 251 | " * \\$.4、1.4.0、1\\%\\%" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 150, 257 | "id": "1bd6d2c5", 258 | "metadata": {}, 259 | "outputs": [ 260 | { 261 | "name": "stdout", 262 | "output_type": "stream", 263 | "text": [ 264 | "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your',\n", 265 | " 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '.', '.', '.']\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "#新的匹配pattern,匹配价格符号\n", 271 | "new_pattern2 = r\"\\$?\\d+(?:\\.\\d+)?%?\"\n", 272 | "pattern = new_pattern2 +r\"|\" + new_pattern +r\"|\"+pattern\n", 273 | "print(re.findall(pattern, sentence))" 274 | ] 275 | }, 276 | { 277 | "cell_type": "markdown", 278 | "id": "c4971854", 279 | "metadata": {}, 280 | "source": [ 281 | "其中\\d表示所有的数字字符,?表示匹配前面的模式0次或者1次。" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "id": "d4dde345", 287 | "metadata": {}, 288 | "source": [ 289 | "省略号本身表达了一定的含义,因此要在分词中将其保留,匹配它的正则表达式为:\n", 290 | "\n", 291 | "* 正则表达式模式:$\\text{\\\\}$.$\\text{\\\\}$.$\\text{\\\\}$.\n", 292 | "* 符合匹配的字符串示例:\n", 293 | " * ..." 294 | ] 295 | }, 296 | { 297 | "cell_type": "code", 298 | "execution_count": 151, 299 | "id": "1ad794a2", 300 | "metadata": {}, 301 | "outputs": [ 302 | { 303 | "name": "stdout", 304 | "output_type": "stream", 305 | "text": [ 306 | "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your',\n", 307 | " 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '...']\n" 308 | ] 309 | } 310 | ], 311 | "source": [ 312 | "#新的匹配pattern,匹配价格符号\n", 313 | "new_pattern3 = r\"\\.\\.\\.\" \n", 314 | "pattern = new_pattern3 +r\"|\" + new_pattern2 +r\"|\" +\\\n", 315 | " new_pattern +r\"|\"+pattern\n", 316 | "print(re.findall(pattern, sentence))" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "2598928d", 322 | "metadata": {}, 323 | "source": [ 324 | "\n", 325 | "NLTK {cite:p}`bird2009natural` 是基于Python的NLP工具包,也可以用于实现前面提到的基于正则表达式的分词。\n", 326 | "\n", 327 | "" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 155, 334 | "id": "8ec8b313", 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "name": "stdout", 339 | "output_type": "stream", 340 | "text": [ 341 | "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your',\n", 342 | " 'pre-print', '?', 'No', ',', \"it's\", 'free', '!', \"It's\", '...']\n" 343 | ] 344 | } 345 | ], 346 | "source": [ 347 | "import re\n", 348 | "import nltk\n", 349 | "#引入NLTK分词器\n", 350 | "from nltk.tokenize import word_tokenize\n", 351 | "from nltk.tokenize import regexp_tokenize\n", 352 | "\n", 353 | "tokens = regexp_tokenize(sentence,pattern)\n", 354 | "print(tokens)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "id": "9b1a7442", 360 | "metadata": {}, 361 | "source": [ 362 | "基于BPE的词元学习器。\n", 363 | "\n", 364 | "给定一个词表包含所有的字符(如,{A, B, C, D, ..., a, b, c, d, ...}),词元学习器重复以下步骤来构建词表:\n", 365 | "\n", 366 | "(1)找出在训练语料中最常相连的两个符号,这里称其为“$C_1$”和“$C_2$”;\n", 367 | "\n", 368 | "(2)将新组合的符号“$C_1$$C_2$”加入词表当中;\n", 369 | "\n", 370 | "(3)将训练语料中所有相连的“$C_1$”和“$C_2$”转换成“$C_1$$C_2$”;\n", 371 | "\n", 372 | "(4)重复上述步骤$k$次。\n", 373 | "\n", 374 | "假设有一个训练语料包含了一些方向和中国的地名的拼音:\n", 375 | "```\n", 376 | "nan nan nan nan nan nanjing nanjing beijing beijing beijing beijing beijing beijing dongbei dongbei dongbei bei bei\n", 377 | "```\n", 378 | "首先,我们基于空格将语料分解成词元,然后加入特殊符号“_”来作为词尾的标识符,通过这种方式可以更好地去包含相似子串的词语(例如区分al在formalalmost中的区别)。" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "id": "2617fbb2", 384 | "metadata": {}, 385 | "source": [ 386 | "第一步,根据语料构建初始的词表:" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 19, 392 | "id": "bc7f60a9", 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "name": "stdout", 397 | "output_type": "stream", 398 | "text": [ 399 | "语料:\n", 400 | "5 ['n', 'a', 'n', '_']\n", 401 | "2 ['n', 'a', 'n', 'j', 'i', 'n', 'g', '_']\n", 402 | "6 ['b', 'e', 'i', 'j', 'i', 'n', 'g', '_']\n", 403 | "3 ['d', 'o', 'n', 'g', 'b', 'e', 'i', '_']\n", 404 | "2 ['b', 'e', 'i', '_']\n", 405 | "词表:['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o']\n" 406 | ] 407 | } 408 | ], 409 | "source": [ 410 | "corpus = \"nan nan nan nan nan nanjing nanjing beijing beijing \"+\\\n", 411 | " \"beijing beijing beijing beijing dongbei dongbei dongbei bei bei\"\n", 412 | "tokens = corpus.split(' ')\n", 413 | "\n", 414 | "#构建基于字符的初始词表\n", 415 | "vocabulary = set(corpus) \n", 416 | "vocabulary.remove(' ')\n", 417 | "vocabulary.add('_')\n", 418 | "vocabulary = sorted(list(vocabulary))\n", 419 | "\n", 420 | "#根据语料构建词表\n", 421 | "corpus_dict = {}\n", 422 | "for token in tokens:\n", 423 | " key = token+'_'\n", 424 | " if key not in corpus_dict:\n", 425 | " corpus_dict[key] = {\"split\": list(key), \"count\": 0}\n", 426 | " corpus_dict[key]['count'] += 1\n", 427 | "\n", 428 | "print(f\"语料:\")\n", 429 | "for key in corpus_dict:\n", 430 | " print(corpus_dict[key]['count'], corpus_dict[key]['split'])\n", 431 | "print(f\"词表:{vocabulary}\")" 432 | ] 433 | }, 434 | { 435 | "cell_type": "markdown", 436 | "id": "fa28c018", 437 | "metadata": {}, 438 | "source": [ 439 | "第二步,词元学习器通过迭代的方式逐步组合新的符号加入到词表中:" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 20, 445 | "id": "1409448a", 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "name": "stdout", 450 | "output_type": "stream", 451 | "text": [ 452 | "第1次迭代\n", 453 | "当前最常出现的前5个符号组合:\n", 454 | "[('ng', 11), ('be', 11), ('ei', 11), ('ji', 8), ('in', 8)]\n", 455 | "本次迭代组合的符号为:ng\n", 456 | "\n", 457 | "迭代后的语料为:\n", 458 | "5 ['n', 'a', 'n', '_']\n", 459 | "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n", 460 | "6 ['b', 'e', 'i', 'j', 'i', 'ng', '_']\n", 461 | "3 ['d', 'o', 'ng', 'b', 'e', 'i', '_']\n", 462 | "2 ['b', 'e', 'i', '_']\n", 463 | "词表:['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng']\n", 464 | "\n", 465 | "-------------------------------------\n", 466 | "第2次迭代\n", 467 | "当前最常出现的前5个符号组合:\n", 468 | "[('be', 11), ('ei', 11), ('ji', 8), ('ing', 8), ('ng_', 8)]\n", 469 | "本次迭代组合的符号为:be\n", 470 | "\n", 471 | "迭代后的语料为:\n", 472 | "5 ['n', 'a', 'n', '_']\n", 473 | "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n", 474 | "6 ['be', 'i', 'j', 'i', 'ng', '_']\n", 475 | "3 ['d', 'o', 'ng', 'be', 'i', '_']\n", 476 | "2 ['be', 'i', '_']\n", 477 | "词表:['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be']\n", 478 | "\n", 479 | "-------------------------------------\n", 480 | "第3次迭代\n", 481 | "当前最常出现的前5个符号组合:\n", 482 | "[('bei', 11), ('ji', 8), ('ing', 8), ('ng_', 8), ('na', 7)]\n", 483 | "本次迭代组合的符号为:bei\n", 484 | "\n", 485 | "迭代后的语料为:\n", 486 | "5 ['n', 'a', 'n', '_']\n", 487 | "2 ['n', 'a', 'n', 'j', 'i', 'ng', '_']\n", 488 | "6 ['bei', 'j', 'i', 'ng', '_']\n", 489 | "3 ['d', 'o', 'ng', 'bei', '_']\n", 490 | "2 ['bei', '_']\n", 491 | "词表:['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be', 'bei']\n", 492 | "\n", 493 | "-------------------------------------\n", 494 | "第9次迭代\n", 495 | "当前最常出现的前5个符号组合:\n", 496 | "[('beijing_', 6), ('nan_', 5), ('bei_', 5), ('do', 3), ('ong', 3)]\n", 497 | "本次迭代组合的符号为:beijing_\n", 498 | "\n", 499 | "迭代后的语料为:\n", 500 | "5 ['nan', '_']\n", 501 | "2 ['nan', 'jing_']\n", 502 | "6 ['beijing_']\n", 503 | "3 ['d', 'o', 'ng', 'bei', '_']\n", 504 | "2 ['bei', '_']\n", 505 | "词表:['_', 'a', 'b', 'd', 'e', 'g', 'i', 'j', 'n', 'o', 'ng', 'be', 'bei',\n", 506 | " 'ji', 'jing', 'jing_', 'na', 'nan', 'beijing_']\n", 507 | "\n", 508 | "-------------------------------------\n" 509 | ] 510 | } 511 | ], 512 | "source": [ 513 | "for step in range(9):\n", 514 | " # 如果想要将每一步的结果都输出,请读者自行将max_print_step改成999\n", 515 | " max_print_step = 3\n", 516 | " if step < max_print_step or step == 8: \n", 517 | " print(f\"第{step+1}次迭代\")\n", 518 | " split_dict = {}\n", 519 | " for key in corpus_dict:\n", 520 | " splits = corpus_dict[key]['split']\n", 521 | " # 遍历所有符号进行统计\n", 522 | " for i in range(len(splits)-1):\n", 523 | " # 组合两个符号作为新的符号\n", 524 | " current_group = splits[i]+splits[i+1]\n", 525 | " if current_group not in split_dict:\n", 526 | " split_dict[current_group] = 0\n", 527 | " split_dict[current_group] += corpus_dict[key]['count']\n", 528 | "\n", 529 | " group_hist=[(k, v) for k, v in sorted(split_dict.items(), \\\n", 530 | " key=lambda item: item[1],reverse=True)]\n", 531 | " if step < max_print_step or step == 8:\n", 532 | " print(f\"当前最常出现的前5个符号组合:{group_hist[:5]}\")\n", 533 | " \n", 534 | " merge_key = group_hist[0][0]\n", 535 | " if step < max_print_step or step == 8:\n", 536 | " print(f\"本次迭代组合的符号为:{merge_key}\")\n", 537 | " for key in corpus_dict:\n", 538 | " if merge_key in key:\n", 539 | " new_splits = []\n", 540 | " splits = corpus_dict[key]['split']\n", 541 | " i = 0\n", 542 | " while i < len(splits):\n", 543 | " if i+1>=len(splits):\n", 544 | " new_splits.append(splits[i])\n", 545 | " i+=1\n", 546 | " continue\n", 547 | " if merge_key == splits[i]+splits[i+1]:\n", 548 | " new_splits.append(merge_key)\n", 549 | " i+=2\n", 550 | " else:\n", 551 | " new_splits.append(splits[i])\n", 552 | " i+=1\n", 553 | " corpus_dict[key]['split']=new_splits\n", 554 | " \n", 555 | " vocabulary.append(merge_key)\n", 556 | " if step < max_print_step or step == 8:\n", 557 | " print()\n", 558 | " print(f\"迭代后的语料为:\")\n", 559 | " for key in corpus_dict:\n", 560 | " print(corpus_dict[key]['count'], corpus_dict[key]['split'])\n", 561 | " print(f\"词表:{vocabulary}\")\n", 562 | " print()\n", 563 | " print('-------------------------------------')" 564 | ] 565 | }, 566 | { 567 | "cell_type": "markdown", 568 | "id": "b41e133d", 569 | "metadata": {}, 570 | "source": [ 571 | "得到学习到的词表之后,给定一句新的句子,使用BPE词元分词器根据词表中每个符号学到的顺序,贪心地将字符组合起来。例如输入是“nanjing beijing”,那么根据上面例子里的词表,会先把“n”和“g”组合成“ng”,然后组合“be”“bei”……最终分词成:" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 24, 577 | "id": "9a938e54", 578 | "metadata": {}, 579 | "outputs": [ 580 | { 581 | "name": "stdout", 582 | "output_type": "stream", 583 | "text": [ 584 | "输入语句:nanjing beijing\n", 585 | "分词结果:['nan', 'jing_', 'beijing_']\n" 586 | ] 587 | } 588 | ], 589 | "source": [ 590 | "ordered_vocabulary = {key: x for x, key in enumerate(vocabulary)}\n", 591 | "sentence = \"nanjing beijing\"\n", 592 | "print(f\"输入语句:{sentence}\")\n", 593 | "tokens = sentence.split(' ')\n", 594 | "tokenized_string = []\n", 595 | "for token in tokens:\n", 596 | " key = token+'_'\n", 597 | " splits = list(key)\n", 598 | " #用于在没有更新的时候跳出\n", 599 | " flag = 1\n", 600 | " while flag:\n", 601 | " flag = 0\n", 602 | " split_dict = {}\n", 603 | " #遍历所有符号进行统计\n", 604 | " for i in range(len(splits)-1): \n", 605 | " #组合两个符号作为新的符号\n", 606 | " current_group = splits[i]+splits[i+1] \n", 607 | " if current_group not in ordered_vocabulary:\n", 608 | " continue\n", 609 | " if current_group not in split_dict:\n", 610 | " #判断当前组合是否在词表里,如果是的话加入split_dict\n", 611 | " split_dict[current_group] = ordered_vocabulary[current_group] \n", 612 | " flag = 1\n", 613 | " if not flag:\n", 614 | " continue\n", 615 | " \n", 616 | " #对每个组合进行优先级的排序(此处为从小到大)\n", 617 | " group_hist=[(k, v) for k, v in sorted(split_dict.items(),\\\n", 618 | " key=lambda item: item[1])] \n", 619 | " #优先级最高的组合\n", 620 | " merge_key = group_hist[0][0] \n", 621 | " new_splits = []\n", 622 | " i = 0\n", 623 | " # 根据优先级最高的组合产生新的分词\n", 624 | " while i < len(splits):\n", 625 | " if i+1>=len(splits):\n", 626 | " new_splits.append(splits[i])\n", 627 | " i+=1\n", 628 | " continue\n", 629 | " if merge_key == splits[i]+splits[i+1]:\n", 630 | " new_splits.append(merge_key)\n", 631 | " i+=2\n", 632 | " else:\n", 633 | " new_splits.append(splits[i])\n", 634 | " i+=1\n", 635 | " splits=new_splits\n", 636 | " tokenized_string+=splits\n", 637 | "\n", 638 | "print(f\"分词结果:{tokenized_string}\")" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "id": "432defd3", 644 | "metadata": {}, 645 | "source": [ 646 | "\n", 647 | "大小写折叠(case folding)是将所有的英文大写字母转化成小写字母的过程。在搜索场景中,用户往往喜欢使用小写,而在计算机中,大写字母和小写字母并非同一字符,当遇到用户想要搜索一些人名、地名等带有大写字母的专有名词的情况下,正确的搜索结果可能会比较难匹配上。\n", 648 | "\n" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 1, 654 | "id": "ff4810fc", 655 | "metadata": {}, 656 | "outputs": [ 657 | { 658 | "name": "stdout", 659 | "output_type": "stream", 660 | "text": [ 661 | "let's study hands-on-nlp\n" 662 | ] 663 | } 664 | ], 665 | "source": [ 666 | "# Case Folding\n", 667 | "sentence = \"Let's study Hands-on-NLP\"\n", 668 | "print(sentence.lower())" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "id": "2ad4136b", 674 | "metadata": {}, 675 | "source": [ 676 | "\n", 677 | "在诸如英文这样的语言中,很多单词都会根据不同的主语、语境、时态等情形修改形态,而这些单词本身表达的含义是接近甚至是相同的。例如英文中的am、is、are都可以还原成be,英文名词cat根据不同情形有cat、cats、cat's、cats'等多种形态。这些形态对文本的语义影响相对较小,但是大幅度提高了词表的大小,因而提高了自然语言处理模型的构建成本。因此在有些文本处理问题上,会将所有的词进行词目还原(lemmatization),即找出词的原型。人类在学习这些语言的过程中,可以通过词典找词的原型;类似地,计算机可以通过建立词典来进行词目还原:\n" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 18, 683 | "id": "cc1f7c18", 684 | "metadata": {}, 685 | "outputs": [ 686 | { 687 | "name": "stdout", 688 | "output_type": "stream", 689 | "text": [ 690 | "词目还原前:['Two', 'dogs', 'are', 'chasing', 'three', 'cats']\n", 691 | "词目还原后:['Two', 'dog', 'be', 'chase', 'three', 'cat']\n" 692 | ] 693 | } 694 | ], 695 | "source": [ 696 | "# 构建词典\n", 697 | "lemma_dict = {'am': 'be','is': 'be','are': 'be','cats': 'cat',\\\n", 698 | " \"cats'\": 'cat',\"cat's\": 'cat','dogs': 'dog',\"dogs'\": 'dog',\\\n", 699 | " \"dog's\": 'dog', 'chasing': \"chase\"}\n", 700 | "\n", 701 | "sentence = \"Two dogs are chasing three cats\"\n", 702 | "words = sentence.split(' ')\n", 703 | "print(f'词目还原前:{words}')\n", 704 | "lemmatized_words = []\n", 705 | "for word in words:\n", 706 | " if word in lemma_dict:\n", 707 | " lemmatized_words.append(lemma_dict[word])\n", 708 | " else:\n", 709 | " lemmatized_words.append(word)\n", 710 | "\n", 711 | "print(f'词目还原后:{lemmatized_words}')\n" 712 | ] 713 | }, 714 | { 715 | "cell_type": "markdown", 716 | "id": "6b6d329d", 717 | "metadata": {}, 718 | "source": [ 719 | "另外,也可以利用NLTK自带的词典来进行词目还原:" 720 | ] 721 | }, 722 | { 723 | "cell_type": "code", 724 | "execution_count": 26, 725 | "id": "7bf7cbd4", 726 | "metadata": {}, 727 | "outputs": [ 728 | { 729 | "name": "stdout", 730 | "output_type": "stream", 731 | "text": [ 732 | "词目还原前:['Two', 'dogs', 'are', 'chasing', 'three', 'cats']\n", 733 | "词目还原后:['Two', 'dog', 'be', 'chase', 'three', 'cat']\n" 734 | ] 735 | } 736 | ], 737 | "source": [ 738 | "import nltk\n", 739 | "#引入nltk分词器、lemmatizer,引入wordnet还原动词\n", 740 | "from nltk.tokenize import word_tokenize\n", 741 | "from nltk.stem import WordNetLemmatizer\n", 742 | "from nltk.corpus import wordnet\n", 743 | "\n", 744 | "#下载分词包、wordnet包\n", 745 | "nltk.download('punkt', quiet=True)\n", 746 | "nltk.download('wordnet', quiet=True)\n", 747 | "\n", 748 | "\n", 749 | "lemmatizer = WordNetLemmatizer()\n", 750 | "sentence = \"Two dogs are chasing three cats\"\n", 751 | "\n", 752 | "words = word_tokenize(sentence)\n", 753 | "print(f'词目还原前:{words}')\n", 754 | "lemmatized_words = []\n", 755 | "for word in words:\n", 756 | " lemmatized_words.append(lemmatizer.lemmatize(word, wordnet.VERB))\n", 757 | "\n", 758 | "print(f'词目还原后:{lemmatized_words}')" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "id": "d74d3eea", 764 | "metadata": {}, 765 | "source": [ 766 | "\n", 767 | "很多实际场景中,我们往往需要处理很长的文本,例如新闻、财报、日志等。让计算机直接同时处理整个文本会非常的困难,因此需要将文本分成许多句子来让计算机分别进行处理。对于分句问题,最常见的方法是根据标点符号来分割文本,例如“!”“?”“。”等符号。然而,在某些语言当中,个别分句符号会有歧义。例如英文中的句号“.”也同时有省略符(例如“Inc.”、“Ph.D.”、“Mr.”等)、小数点(例如“3.5”、“.3%”)等含义。这些歧义会导致分句困难。为了解决这种问题,常见的方案是先进行分词,使用基于正则表达式或者基于机器学习的分词方法将文本分解成词元,随后基于符号判断句子边界。例如:" 768 | ] 769 | }, 770 | { 771 | "cell_type": "code", 772 | "execution_count": 222, 773 | "id": "4071468e", 774 | "metadata": {}, 775 | "outputs": [ 776 | { 777 | "name": "stdout", 778 | "output_type": "stream", 779 | "text": [ 780 | "分句结果:\n", 781 | "['Did', 'you', 'spend', '$3.4', 'on', 'arxiv.org', 'for', 'your',\n", 782 | "'pre-print', '?']\n", 783 | "['No', ',', \"it's\", 'free', '!']\n", 784 | "[\"It's\", '...']\n" 785 | ] 786 | } 787 | ], 788 | "source": [ 789 | "sentence_spliter = set([\".\",\"?\",'!','...'])\n", 790 | "sentence = \"Did you spend $3.4 on arxiv.org for your pre-print? \" + \\\n", 791 | " \"No, it's free! It's ...\"\n", 792 | "\n", 793 | "tokens = regexp_tokenize(sentence,pattern)\n", 794 | "\n", 795 | "sentences = []\n", 796 | "boundary = [0]\n", 797 | "for token_id, token in enumerate(tokens):\n", 798 | " # 判断句子边界\n", 799 | " if token in sentence_spliter:\n", 800 | " #如果是句子边界,则把分句结果加入进去\n", 801 | " sentences.append(tokens[boundary[-1]:token_id+1]) \n", 802 | " #将下一句句子起始位置加入boundary\n", 803 | " boundary.append(token_id+1) \n", 804 | "\n", 805 | "if boundary[-1]!=len(tokens):\n", 806 | " sentences.append(tokens[boundary[-1]:])\n", 807 | "\n", 808 | "print(f\"分句结果:\")\n", 809 | "for seg_sentence in sentences:\n", 810 | " print(seg_sentence)\n" 811 | ] 812 | }, 813 | { 814 | "cell_type": "markdown", 815 | "id": "56f5a604", 816 | "metadata": {}, 817 | "source": [ 818 | "" 824 | ] 825 | } 826 | ], 827 | "metadata": { 828 | "kernelspec": { 829 | "display_name": "Python 3 (ipykernel)", 830 | "language": "python", 831 | "name": "python3" 832 | }, 833 | "language_info": { 834 | "codemirror_mode": { 835 | "name": "ipython", 836 | "version": 3 837 | }, 838 | "file_extension": ".py", 839 | "mimetype": "text/x-python", 840 | "name": "python", 841 | "nbconvert_exporter": "python", 842 | "pygments_lexer": "ipython3", 843 | "version": "3.8.17" 844 | } 845 | }, 846 | "nbformat": 4, 847 | "nbformat_minor": 5 848 | } 849 | -------------------------------------------------------------------------------- /第3章-文本表示.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "67c1e6b1", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "下面的例子将展示词向量标准工具包——gensim提供的词嵌入,并展示词嵌入如何表示词的相似度。\n", 11 | "" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "id": "5c5a740a", 18 | "metadata": {}, 19 | "outputs": [], 20 | "source": [ 21 | "import numpy as np\n", 22 | "import pprint\n", 23 | "\n", 24 | "from gensim.models import KeyedVectors\n", 25 | "\n", 26 | "# 从GloVe官网下载GloVe向量,此处使用的是glove.6B.zip\n", 27 | "# 解压缩zip文件并将以下路径改为解压后对应文件的路径\n", 28 | "model = KeyedVectors.load_word2vec_format('/your/path/here'+\\\n", 29 | " '/glove.6B.100d.txt', binary=False, no_header=True)" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 4, 35 | "id": "01a2e4a5", 36 | "metadata": {}, 37 | "outputs": [ 38 | { 39 | "name": "stdout", 40 | "output_type": "stream", 41 | "text": [ 42 | "[('movie', 0.9055121541023254),\n", 43 | " ('films', 0.8914433717727661),\n", 44 | " ('directed', 0.8124364018440247),\n", 45 | " ('documentary', 0.8075793981552124),\n", 46 | " ('drama', 0.7929168939590454),\n", 47 | " ('movies', 0.7889865040779114),\n", 48 | " ('comedy', 0.7842751741409302),\n", 49 | " ('starring', 0.7573286294937134),\n", 50 | " ('cinema', 0.7419455647468567),\n", 51 | " ('hollywood', 0.7307389378547668)]\n", 52 | "[('vehicle', 0.8630837798118591),\n", 53 | " ('truck', 0.8597878813743591),\n", 54 | " ('cars', 0.837166965007782),\n", 55 | " ('driver', 0.8185911178588867),\n", 56 | " ('driving', 0.7812635898590088),\n", 57 | " ('motorcycle', 0.7553157210350037),\n", 58 | " ('vehicles', 0.7462256550788879),\n", 59 | " ('parked', 0.74594646692276),\n", 60 | " ('bus', 0.7372707724571228),\n", 61 | " ('taxi', 0.7155268788337708)]\n" 62 | ] 63 | } 64 | ], 65 | "source": [ 66 | "# 使用most_similar()找到词表中距离给定词最近(最相似)的n个词\n", 67 | "pprint.pprint(model.most_similar('film'))\n", 68 | "pprint.pprint(model.most_similar('car'))" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": 5, 74 | "id": "8b62f7ad", 75 | "metadata": { 76 | "scrolled": false 77 | }, 78 | "outputs": [ 79 | { 80 | "name": "stdout", 81 | "output_type": "stream", 82 | "text": [ 83 | "japanese\n", 84 | "panda\n", 85 | "longest\n", 86 | "terrible\n", 87 | "queen\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "# 利用GloVe展示一个类比的例子\n", 93 | "def analogy(x1, x2, y1):\n", 94 | " # 寻找top-N最相似的词。\n", 95 | " result = model.most_similar(positive=[y1, x2], negative=[x1])\n", 96 | " return result[0][0]\n", 97 | "\n", 98 | "print(analogy('china', 'chinese', 'japan'))\n", 99 | "print(analogy('australia', 'koala', 'china'))\n", 100 | "print(analogy('tall', 'tallest', 'long'))\n", 101 | "print(analogy('good', 'fantastic', 'bad'))\n", 102 | "print(analogy('man', 'woman', 'king'))" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "id": "0c308cee", 108 | "metadata": {}, 109 | "source": [ 110 | "下面将展示word2vec的代码,包括文本预处理、skipgram算法的实现、以及使用PyTorch进行优化。这里使用《小王子》这本书作为训练语料。" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 1, 116 | "id": "590fc408", 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "# 安装NLTK,使用如下代码下载punkt组件\n", 121 | "#import nltk\n", 122 | "#nltk.download('punkt')\n", 123 | "\n", 124 | "from nltk.tokenize import sent_tokenize, word_tokenize\n", 125 | "from collections import defaultdict\n", 126 | "\n", 127 | "# 使用类管理数据对象,包括文本读取、文本预处理等\n", 128 | "class TheLittlePrinceDataset:\n", 129 | " def __init__(self, tokenize=True):\n", 130 | " # 利用NLTK函数进行分句和分词\n", 131 | " text = open('the little prince.txt', 'r', encoding='utf-8').read()\n", 132 | " if tokenize:\n", 133 | " self.sentences = sent_tokenize(text.lower())\n", 134 | " self.tokens = [word_tokenize(sent) for sent in self.sentences]\n", 135 | " else:\n", 136 | " self.text = text\n", 137 | "\n", 138 | " def build_vocab(self, min_freq=1):\n", 139 | " # 统计词频\n", 140 | " frequency = defaultdict(int)\n", 141 | " for sentence in self.tokens:\n", 142 | " for token in sentence:\n", 143 | " frequency[token] += 1\n", 144 | " self.frequency = frequency\n", 145 | "\n", 146 | " # 加入处理未登录词,加入用于对齐变长输入进而加速\n", 147 | " self.token2id = {'': 1, '': 0}\n", 148 | " self.id2token = {1: '', 0: ''}\n", 149 | " for token, freq in sorted(frequency.items(), key=lambda x: -x[1]):\n", 150 | " # 丢弃低频词\n", 151 | " if freq > min_freq:\n", 152 | " self.token2id[token] = len(self.token2id)\n", 153 | " self.id2token[len(self.id2token)] = token\n", 154 | " else:\n", 155 | " break\n", 156 | "\n", 157 | " def get_word_distribution(self):\n", 158 | " distribution = np.zeros(vocab_size)\n", 159 | " for token, freq in self.frequency.items():\n", 160 | " if token in dataset.token2id:\n", 161 | " distribution[dataset.token2id[token]] = freq\n", 162 | " else:\n", 163 | " # 不在词表中的词按计算\n", 164 | " distribution[1] += freq\n", 165 | " distribution /= distribution.sum()\n", 166 | " return distribution\n", 167 | "\n", 168 | " # 将分词结果转化为索引表示\n", 169 | " def convert_tokens_to_ids(self, drop_single_word=True):\n", 170 | " self.token_ids = []\n", 171 | " for sentence in self.tokens:\n", 172 | " token_ids = [self.token2id.get(token, 1) for token in sentence]\n", 173 | " # 忽略只有一个token的序列,无法计算loss\n", 174 | " if len(token_ids) == 1 and drop_single_word:\n", 175 | " continue\n", 176 | " self.token_ids.append(token_ids)\n", 177 | " \n", 178 | " return self.token_ids\n", 179 | "\n", 180 | "dataset = TheLittlePrinceDataset()\n", 181 | "dataset.build_vocab(min_freq=1)\n", 182 | "sentences = dataset.convert_tokens_to_ids()" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 2, 188 | "id": "efc882de", 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "name": "stdout", 193 | "output_type": "stream", 194 | "text": [ 195 | "(74374, 2) [[ 4 17]\n", 196 | " [ 4 20]\n", 197 | " [ 17 4]\n", 198 | " ...\n", 199 | " [131 2]\n", 200 | " [ 2 86]\n", 201 | " [ 2 131]]\n" 202 | ] 203 | } 204 | ], 205 | "source": [ 206 | "# 遍历所有的中心词-上下文词对\n", 207 | "window_size = 2\n", 208 | "data = []\n", 209 | "\n", 210 | "for sentence in sentences:\n", 211 | " for i in range(len(sentence)):\n", 212 | " for j in range(i-window_size, i+window_size+1):\n", 213 | " if j == i or j < 0 or j >= len(sentence):\n", 214 | " continue\n", 215 | " center_word = sentence[i]\n", 216 | " context_word = sentence[j]\n", 217 | " data.append([center_word, context_word])\n", 218 | "\n", 219 | "# 需要提前安装numpy\n", 220 | "import numpy as np\n", 221 | "data = np.array(data)\n", 222 | "print(data.shape, data)" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 3, 228 | "id": "30903b3d", 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "# 需要提前安装PyTorch\n", 233 | "import torch\n", 234 | "from torch import nn\n", 235 | "import torch.nn.functional as F\n", 236 | "\n", 237 | "# 实现skipgram算法,使用对比学习计算损失\n", 238 | "class SkipGramNCE(nn.Module):\n", 239 | " def __init__(self, vocab_size, embed_size, distribution,\\\n", 240 | " neg_samples=20):\n", 241 | " super(SkipGramNCE, self).__init__()\n", 242 | " print(f'vocab_size = {vocab_size}, embed_size = {embed_size}, '+\\\n", 243 | " f'neg_samples = {neg_samples}')\n", 244 | " self.input_embeddings = nn.Embedding(vocab_size, embed_size)\n", 245 | " self.output_embeddings = nn.Embedding(vocab_size, embed_size)\n", 246 | " distribution = np.power(distribution, 0.75)\n", 247 | " distribution /= distribution.sum()\n", 248 | " self.distribution = torch.tensor(distribution)\n", 249 | " self.neg_samples = neg_samples\n", 250 | " \n", 251 | " def forward(self, input_ids, labels):\n", 252 | " i_embed = self.input_embeddings(input_ids)\n", 253 | " o_embed = self.output_embeddings(labels)\n", 254 | " batch_size = i_embed.size(0)\n", 255 | " n_words = torch.multinomial(self.distribution, batch_size * \\\n", 256 | " self.neg_samples, replacement=True).view(batch_size, -1)\n", 257 | " n_embed = self.output_embeddings(n_words)\n", 258 | " pos_term = F.logsigmoid(torch.sum(i_embed * o_embed, dim=1))\n", 259 | " # 负采样,用于对比学习\n", 260 | " neg_term = F.logsigmoid(- torch.bmm(n_embed, \\\n", 261 | " i_embed.unsqueeze(2)).squeeze())\n", 262 | " neg_term = torch.sum(neg_term, dim=1)\n", 263 | " loss = - torch.mean(pos_term + neg_term)\n", 264 | " return loss" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 4, 270 | "id": "1d9da6c8", 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "name": "stdout", 275 | "output_type": "stream", 276 | "text": [ 277 | "[0.00000000e+00 4.95799942e-02 5.48904123e-02 ... 9.65530559e-05\n", 278 | " 9.65530559e-05 9.65530559e-05]\n", 279 | "vocab_size = 1071, embed_size = 128, neg_samples = 20\n" 280 | ] 281 | }, 282 | { 283 | "name": "stderr", 284 | "output_type": "stream", 285 | "text": [ 286 | "epoch-99, loss=2.8468: 100%|█| 100/100 [05:03<00:00, 3.04s/\n" 287 | ] 288 | }, 289 | { 290 | "data": { 291 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAGwCAYAAABcnuQpAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAABO/0lEQVR4nO3deVyU1f4H8M8zM8ywDSCgIAqKSu64ay6pZWpqZtmulf26Nyv3lmv7zW43tc2sLLtttprWzcxbaaIpau6IirgrCiqIsq8zzMz5/THMA8MmzgzzMPJ5v168kpln4PiE8OF7vuccSQghQEREROShVEoPgIiIiMgZDDNERETk0RhmiIiIyKMxzBAREZFHY5ghIiIij8YwQ0RERB6NYYaIiIg8mkbpATQ0i8WCCxcuQK/XQ5IkpYdDRERE9SCEQEFBASIiIqBS1V17uebDzIULFxAZGan0MIiIiMgBaWlpaN26dZ3XXPNhRq/XA7DejICAAIVHQ0RERPWRn5+PyMhI+ed4Xa75MGObWgoICGCYISIi8jD1aRFhAzARERF5NIYZIiIi8mgMM0REROTRGGaIiIjIozHMEBERkUdjmCEiIiKPxjBDREREHo1hhoiIiDwawwwRERF5NIYZIiIi8mgMM0REROTRGGaIiIjIo13zB002lEKDCbnFRvh4qRHir1N6OERERE0WKzMO+vKvFAx5YxPe+uOY0kMhIiJq0hhmHOSltt46o9mi8EiIiIiaNoYZB2k11ltXZhYKj4SIiKhpY5hxkK0yU2ZiZYaIiEhJDDMO0nKaiYiIqFFgmHGQl0YCAJQxzBARESmKYcZBcgMwp5mIiIgUxTDjINs0EyszREREymKYcZCXhj0zREREjQHDjIPkyoyJS7OJiIiUxDDjIC9OMxERETUKDDMO0nKaiYiIqFFgmHGQl5pLs4mIiBoDhhkHabk0m4iIqFFgmHFQRc8MG4CJiIiUxDDjIPbMEBERNQ4MMw6qvJpJCFZniIiIlMIw4yBbz4wQgMnCMENERKQUhhkH2Q6aBLiiiYiISEkMMw6yVWYA7gJMRESkJIYZB6lVEqTy4gybgImIiJTDMOMgSZLkJmCGGSIiIuUwzDih4rBJhhkiIiKlMMw4gUcaEBERKY9hxgncOI+IiEh5DDNO8OL5TERERIpjmHGCluczERERKY5hxgmVjzQgIiIiZTDMOIE9M0RERMpjmHGCvJqJPTNERESKYZhxAjfNIyIiUh7DjBNs00zsmSEiIlIOw4wTKnYA5momIiIipTDMOIHTTERERMprNGFmwYIFkCQJc+bMkR8TQmDevHmIiIiAj48Phg8fjuTkZOUGWYWXhpvmERERKa1RhJk9e/bgk08+QWxsrN3jb775JhYtWoQlS5Zgz549CA8Px8iRI1FQUKDQSO3xbCYiIiLlKR5mCgsLMXnyZHz66ado1qyZ/LgQAosXL8aLL76IiRMnolu3bvjqq69QXFyM5cuX1/rxDAYD8vPz7d4aipab5hERESlO8TAzffp0jBs3DjfffLPd4ykpKcjIyMCoUaPkx3Q6HYYNG4bt27fX+vEWLFiAwMBA+S0yMrLBxl6xaR4bgImIiJSiaJhZsWIF9u3bhwULFlR7LiMjAwAQFhZm93hYWJj8XE2ef/555OXlyW9paWmuHXQlPGiSiIhIeRqlPnFaWhpmz56N9evXw9vbu9brJEmye18IUe2xynQ6HXQ6ncvGWReezURERKQ8xSozCQkJyMzMRJ8+faDRaKDRaBAfH4/3338fGo1GrshUrcJkZmZWq9YoRcsGYCIiIsUpFmZGjBiBpKQk7N+/X37r27cvJk+ejP3796Ndu3YIDw9HXFyc/Bqj0Yj4+HgMGjRIqWHb4Q7AREREylNsmkmv16Nbt252j/n5+SEkJER+fM6cOZg/fz5iYmIQExOD+fPnw9fXF5MmTVJiyNXYppkM7JkhIiJSjGJhpj7mzp2LkpISTJs2DTk5ORgwYADWr18PvV6v9NAAVO6Z4WomIiIipTSqMLN582a79yVJwrx58zBv3jxFxnMlth2Ay1iZISIiUozi+8x4Mh1XMxERESmOYcYJXhrraiYeNElERKQchhkncNM8IiIi5THMOIGb5hERESmPYcYJFfvMcDUTERGRUhhmnMBTs4mIiJTHMOME9swQEREpj2HGCV5qrmYiIiJSGsOME9gATEREpDyGGSfo2ABMRESkOIYZJ7BnhoiISHkMM06wnc3EnhkiIiLlMMw4wdYAXGa2QAhONRERESmBYcYJOrUaACAEYLYwzBARESmBYcYJtoMmAU41ERERKYVhxgm2BmAAKDOxMkNERKQEhhknaFSszBARESmNYcYJkiRVOmySYYaIiEgJDDNO4mGTREREymKYcZJ8PhM3ziMiIlIEw4yT5F2AWZkhIiJSBMOMkyoOm+RqJiIiIiUwzDhJxwZgIiIiRTHMOImHTRIRESmLYcZJtl2A2TNDRESkDIYZJ8k9M6zMEBERKYJhxklaNgATEREpimHGSbYdgI1ms8IjISIiapoYZpxUMc3EygwREZESGGacJO8AzAZgIiIiRTDMOEmrUQPgPjNERERKYZhxEs9mIiIiUhbDjJN4ajYREZGyGGacVHHQJBuAiYiIlMAw4yQvVmaIiIgUxTDjJNs+M9wBmIiISBkMM07Scmk2ERGRohhmnMRpJiIiImUxzDjJy3acAXcAJiIiUgTDjJO4NJuIiEhZDDNOqqjMMMwQEREpgWHGSbYGYFZmiIiIlMEw46SKTfMYZoiIiJTAMOMkeZ8ZhhkiIiJFMMw4Sa7MsGeGiIhIEQwzTqpYzcSl2UREREpgmHESN80jIiJSFsOMk2w9M2wAJiIiUgbDjJO8bGczsWeGiIhIEQwzTuI0ExERkbIYZpxUsTSbDcBERERKYJhxklyZ4TQTERGRIhhmnGSrzBg4zURERKQIhhkneVU6m0kITjURERG5G8OMk2yb5gkBmC0MM0RERO7GMOMkW88MwCZgIiIiJTDMOMnWMwNw4zwiIiIlMMw4SaOS5D9z4zwiIiL3Y5hxkiRJlQ6bZJghIiJyN4YZF6i8oomIiIjci2HGBSp2AWaYISIicjeGGRewrWgysGeGiIjI7RhmXKDisEkuzSYiInI3hhkX4DQTERGRchhmXEBuAOY0ExERkdsxzLgAD5skIiJSDsOMC8g9M6zMEBERuR3DjAuwAZiIiEg5ioaZpUuXIjY2FgEBAQgICMDAgQOxdu1a+XkhBObNm4eIiAj4+Phg+PDhSE5OVnDENeMOwERERMpRNMy0bt0aCxcuxN69e7F3717cdNNNmDBhghxY3nzzTSxatAhLlizBnj17EB4ejpEjR6KgoEDJYVdj65nh2UxERETup2iYGT9+PMaOHYvrrrsO1113HV5//XX4+/tj586dEEJg8eLFePHFFzFx4kR069YNX331FYqLi7F8+XIlh12NbTUTT80mIiJyv0bTM2M2m7FixQoUFRVh4MCBSElJQUZGBkaNGiVfo9PpMGzYMGzfvr3Wj2MwGJCfn2/31tC8OM1ERESkGMXDTFJSEvz9/aHT6fD444/j559/RpcuXZCRkQEACAsLs7s+LCxMfq4mCxYsQGBgoPwWGRnZoOMH2DNDRESkJMXDTMeOHbF//37s3LkTTzzxBKZMmYLDhw/Lz0uSZHe9EKLaY5U9//zzyMvLk9/S0tIabOw2FTsAczUTERGRu2mUHoBWq0WHDh0AAH379sWePXvw3nvv4dlnnwUAZGRkoGXLlvL1mZmZ1ao1lel0Ouh0uoYddBU8aJKIiEg5ildmqhJCwGAwIDo6GuHh4YiLi5OfMxqNiI+Px6BBgxQcYXXsmSEiIlKOopWZF154AWPGjEFkZCQKCgqwYsUKbN68GevWrYMkSZgzZw7mz5+PmJgYxMTEYP78+fD19cWkSZOUHHY1XhqezURERKQURcPMxYsX8eCDDyI9PR2BgYGIjY3FunXrMHLkSADA3LlzUVJSgmnTpiEnJwcDBgzA+vXrodfrlRx2NTpWZoiIiBSjaJj5/PPP63xekiTMmzcP8+bNc8+AHGSbZuI+M0RERO7X6HpmPJGXvAMwVzMRERG5G8OMC7ABmIiISDkMMy6gLT/OgGGGiIjI/RhmXIAHTRIRESmHYcYF2ABMRESkHIYZF2DPDBERkXIYZlygIsxwNRMREZG7Mcy4gI49M0RERIphmHEBTjMREREph2HGBbzKl2azAZiIiMj9GGZcwLYDMCszRERE7scw4wJa2zQTjzMgIiJyO4YZF5A3zWNlhoiIyO0YZlxAbgDmaiYiIiK3Y5hxATYAExERKYdhxgW0XJpNRESkGIYZF7D1zFgEYGKgISIiciuGGRew9cwAPNKAiIjI3RhmXKBymGHfDBERkXsxzLiArQEYYN8MERGRuzHMuIAkSXITMA+bJCIici+GGRexVWdYmSEiInIvhhkX4flMREREymCYcREveZqJq5mIiIjciWHGReSeGVZmiIiI3IphxkW0nGYiIiJSBMOMi8gNwFzNRERE5FYMMy7ixWkmIiIiRTDMuEjFNBMbgImIiNyJYcZFvLhpHhERkSIYZlzEtpqJDcBERETuxTDjIrYGYPbMEBERuRfDjIt4sTJDRESkCIYZF7E1ALNnhoiIyL0YZlykvj0zWYUGJJzNdseQiIiImgSGGRepmGaqe2n2kz8cwJ1Ld+DQ+Tx3DIuIiOia51CY+eqrr/Dbb7/J78+dOxdBQUEYNGgQzp4967LBeRIvTXkD8BWmmU5lFgIA0rKLG3xMRERETYFDYWb+/Pnw8fEBAOzYsQNLlizBm2++idDQUDz55JMuHaCn0KrVAOpezSSEwOVCAwCgwGByy7iIiIiudRpHXpSWloYOHToAAFavXo277roLU6dOxeDBgzF8+HBXjs9j2CozdZ3NVGQ0w1D+fGEpwwwREZErOFSZ8ff3R1ZWFgBg/fr1uPnmmwEA3t7eKCkpcd3oPEh9GoCzyqsyAFDIygwREZFLOFSZGTlyJP7+97+jV69eOH78OMaNGwcASE5ORtu2bV05Po9RcdBk7Q3AlwuN8p8ZZoiIiFzDocrMhx9+iIEDB+LSpUv46aefEBISAgBISEjA/fff79IBeor67DOTXVQRZgo4zUREROQSDlVmgoKCsGTJkmqPv/rqq04PyFPVZwdgTjMRERG5nkOVmXXr1mHbtm3y+x9++CF69uyJSZMmIScnx2WD8yTa8rOZ6gwzlSozhaVlDT4mIiKipsChMPOPf/wD+fn5AICkpCQ8/fTTGDt2LE6fPo2nnnrKpQP0FPWpzFxmZYaIiMjlHJpmSklJQZcuXQAAP/30E2699VbMnz8f+/btw9ixY106QE9RnwbgrEL2zBAREbmaQ5UZrVaL4mLrDrYbNmzAqFGjAADBwcFyxaapqWgANtd6TVYRKzNERESu5lBlZsiQIXjqqacwePBg7N69GytXrgQAHD9+HK1bt3bpAD1Ffc5myuLSbCIiIpdzqDKzZMkSaDQa/Pe//8XSpUvRqlUrAMDatWtxyy23uHSAnkKrudoGYBOEqPtQSiIiIroyhyozUVFR+PXXX6s9/u677zo9IE8l98zUss+MxSLs9pkxWQRKyyzw0ardMj4iIqJrlUNhBgDMZjNWr16NI0eOQJIkdO7cGRMmTIBa3TR/OGvlBuCaw0xeSRnMFmslRpIAIYACQxnDDBERkZMcCjMnT57E2LFjcf78eXTs2BFCCBw/fhyRkZH47bff0L59e1ePs9Hz0tS9NNvW/Bvo4wWLRaDAYEJhqQkt9G4bIhER0TXJoZ6ZWbNmoX379khLS8O+ffuQmJiI1NRUREdHY9asWa4eo0eQD5o01dwHYzuXKcRPC39va4ZkEzAREZHzHKrMxMfHY+fOnQgODpYfCwkJwcKFCzF48GCXDc6TXGnTPFu/TIi/FnklZUjPszYBExERkXMcCjM6nQ4FBQXVHi8sLIRWq3V6UJ7oSgdN2s5lCvHTyb0zBazMEBEROc2haaZbb70VU6dOxa5duyCEgBACO3fuxOOPP47bbrvN1WP0CF7lZzPV1gAsTzP5a+Hv7QWAlRkiIiJXcCjMvP/++2jfvj0GDhwIb29veHt7Y9CgQejQoQMWL17s4iF6Bu0VpplsDcAh/jrodeyZISIichWHppmCgoLwyy+/4OTJkzhy5AiEEOjSpQs6dOjg6vF5DFvPjEUAZouAWiXZPW/b/TfUX4uLeQwzRERErlLvMHOl07A3b94s/3nRokUOD8hT2ZZmA9a+mar7x2TJq5l08momHjZJRETkvHqHmcTExHpdJ0nSlS+6BtmmmQBr34wP7MPM5fJppmA/LfzlaaYy9w2QiIjoGlXvMLNp06aGHIfHszUAAzX3zdiWZof6a6G37TPDygwREZHTHGoApuokSZIDTdUwU2a2ILfYWoUJ8ddVqswwzBARETmLYcaFvGrZBTinvCqjkoAgHy/2zBAREbkQw4wL2TbOM5jMdo/b9pgJ9tNBpZJYmSEiInIhhhkXaqHXAQDScortHrftMRPqb90dWc+zmYiIiFyGYcaFOoUHAACOpNsf9ZBVafdfAPDXWXcA5jQTERGR8xhmXKhzS1uYybd7PKuoYpoJQMWp2QwzRERETmOYcaFOLfUAgKMZVSsztkMmbZUZa5gxmi3V+muIiIjo6igaZhYsWIB+/fpBr9ejRYsWuP3223Hs2DG7a4QQmDdvHiIiIuDj44Phw4cjOTlZoRHXrXP5NNPpS4UoLasIKZWPMgAqwgzA6gwREZGzFA0z8fHxmD59Onbu3Im4uDiYTCaMGjUKRUVF8jVvvvkmFi1ahCVLlmDPnj0IDw/HyJEjUVBQUMdHVkZYgA7NfL1gEcCJi4Xy45UPmQQAtUqCb/lxB2wCJiIico6iYWbdunV4+OGH0bVrV/To0QPLli1DamoqEhISAFirMosXL8aLL76IiRMnolu3bvjqq69QXFyM5cuXKzn0GkmSVNEEnFHRN3NZPpdJKz9mq86wCZiIiMg5japnJi8vDwAQHBwMAEhJSUFGRgZGjRolX6PT6TBs2DBs3769xo9hMBiQn59v9+ZONTUBV63MAFyeTURE5CqNJswIIfDUU09hyJAh6NatGwAgIyMDABAWFmZ3bVhYmPxcVQsWLEBgYKD8FhkZ2bADr0JuAq60PDu7Ss8MAPh7W5dns2eGiIjIOY0mzMyYMQMHDx7E999/X+25qidxCyFqPZ37+eefR15envyWlpbWIOOtja0J+GhGPoQQKDGaUWS0NgMHV5pm0nMXYCIiIpeo96nZDWnmzJlYs2YNtmzZgtatW8uPh4eHA7BWaFq2bCk/npmZWa1aY6PT6aDT6Wp8zh1iwvyhkoCc4jJczDfAZLEeOqnVqOxWMck9MwwzRERETlG0MiOEwIwZM7Bq1Sr8+eefiI6Otns+Ojoa4eHhiIuLkx8zGo2Ij4/HoEGD3D3cevH2UqNdc38A1iZgeVm2n9aumsSN84iIiFxD0crM9OnTsXz5cvzyyy/Q6/VyH0xgYCB8fHwgSRLmzJmD+fPnIyYmBjExMZg/fz58fX0xadIkJYdep07hepzMLMSR9Hx0Crf20FRu/gVQ6bDJMrePj4iI6FqiaJhZunQpAGD48OF2jy9btgwPP/wwAGDu3LkoKSnBtGnTkJOTgwEDBmD9+vXQ6/VuHm39dW4ZgF8PpuNoegFCy0NMSKXmX6DSaiZWZoiIiJyiaJgRQlzxGkmSMG/ePMybN6/hB+QineVjDfLRJcLaEBziV3Nlhj0zREREzmk0q5muJba9Zk5dKsKF3BIA9suyAfbMEBERuQrDTAMID/BGoI8XzBaBXaezAdgvywYq98wwzBARETmDYaYBWI81sE41Hbto3TyvagOwrWeGxxkQERE5h2GmgdimmmyqNgD768p3AGZlhoiIyCkMMw3E1gRsE1pbAzArM0RERE5hmGkgttOzbWpdms19ZoiIiJzCMNNArgvTQ1Xp+KjaGoBLyywoM1vcOTQiIqJrCsNMA/HRqtE21A+A9VBJby+13fN+lc5pKmLfDBERkcMYZhqQrQk4uMoUE2A9eFKnsd5+9s0QERE5jmGmAXW2ncvkVz3MAJX7ZhhmiIiIHMUw04Bu7NQCWo0Kg9qH1vg8N84jIiJynqJnM13rukYEImneKOg06hqf13uX7zXDaSYiIiKHsTLTwGoLMgAPmyQiInIFhhkF8bBJIiIi5zHMKEiv48Z5REREzmKYURArM0RERM5jmFEQe2aIiIicxzCjIFZmiIiInMcwoyA9T84mIiJyGsOMgvy5AzAREZHTGGYU5K+zbprHnhkiIiLHMcwoSD7OoJRLs4mIiBzFMKMgHjRJRETkPIYZBVVUZhhmiIiIHMUwoyBbA3CR0QyzRSg8GiIiIs/EMKMgW2UGAIqMrM4QERE5gmFGQTqNCl5qCQCnmoiIiBzFMKMgSZIq+mbYBExEROQQhhmF2fpmuAswERGRYxhmFKYv3ziPlRkiIiLHMMwojIdNEhEROYdhRmF6uWeGuwATERE5gmFGYeyZISIicg7DjMK4momIiMg5DDMKY88MERGRcxhmFGbrmeE0ExERkWMYZhTGaSYiIiLnMMwozN/bus9MAcMMERGRQxhmFCZXZkq5NJuIiMgRDDMKCyhvAM4pZpghIiJyBMOMwmLC9ACAM1lF7JshIiJyAMOMwprrdYgI9IYQwKHzeUoPh4iIyOMwzDQCsa2DAAAHz+UqOg4iIiJPxDDTCMRGBgIADqSxMkNERHS1GGYagR7llZkDrMwQERFdNYaZRqBbK2tl5lxOCbIKDQqPhoiIyLMwzDQCgT5eaNfcDwBwkE3AREREV4VhppGwTTUdZN8MERHRVWGYaSRiW1unmriiiYiI6OowzDQSsXITcB6EEMoOhoiIyIMwzDQSXSMCoFFJuFxowIW8UqWHQ0RE5DEYZhoJby81ris/2uBgWq6ygyEiIvIgDDONSA/b5nnn2ARMRERUXwwzjUgPHmtARER01RhmGhFbE3DSuTxYLFduAjaaLA08IiIiosaPYaYRuS7MH95eKhQYTEjJKqrz2h/3pqHbK39g45GLbhodERFR48Qw04ho1Cp0jajffjN/nbwMo9mCXSnZbhgZERFR48Uw08jYNs+70gnalwuNAIDsImODj4mIiKgxY5hpZOp7gvbl8gMpGWaIiKipY5hpZGyVmcMX8lFmrr3B1xZmshhmiIioiWOYaWTahvghwFsDg8mCYxkFNV5jtgi5IpPDMENERE0cw0wjo1JJ6G6rzqTn13hNTrERtpXbnGYiIqKmjmGmEWod5AsAuFjLGU22KSYAKDSYYDCZ3TIuIiKixohhphEKC9ABAC4W1BJmCuyrMTlFZQ0+JiIiosaKYaYRah7gDQDIzDfU+HxWkaHO94mIiJoShplGKExvq8zUHFIuVXmcfTNERNSUMcw0Qi3kykxtPTP24YVhhoiImjKGmUbI1jNzqcBQ44GTlRuAAYYZIiJq2hhmGqFQfx0kCTBZBLKLqweVrPIw4+1l/d/HMENERE2ZomFmy5YtGD9+PCIiIiBJElavXm33vBAC8+bNQ0REBHx8fDB8+HAkJycrM1g38lKrEOKnBVBzE7BtmimmhR4AwwwRETVtioaZoqIi9OjRA0uWLKnx+TfffBOLFi3CkiVLsGfPHoSHh2PkyJEoKKh5Z9xrSXO9tW+mpuXZtmmmmDB/AAwzRETUtGmU/ORjxozBmDFjanxOCIHFixfjxRdfxMSJEwEAX331FcLCwrB8+XI89thjNb7OYDDAYKioZuTn17yLbmMXFqDDkXTgUpXKjBACWeWVmY5h1soMz2ciIqKmrNH2zKSkpCAjIwOjRo2SH9PpdBg2bBi2b99e6+sWLFiAwMBA+S0yMtIdw3W5Frbl2VVWNOWXmmAsP4DyuvIww/OZiIioKWu0YSYjIwMAEBYWZvd4WFiY/FxNnn/+eeTl5clvaWlpDTrOhhJmW55dZU8Z2xSTXqdByyDrNZxmIiKipkzRaab6kCTJ7n0hRLXHKtPpdNDpdA09rAZXW2Xmcnm4CfHXItjX2iScU2yExSKgUtV+X4iIiK5VjbYyEx4eDgDVqjCZmZnVqjXXoha1VGZs/TGh/jo0K1/xZBFAXgnPZyIioqap0YaZ6OhohIeHIy4uTn7MaDQiPj4egwYNUnBk7mGrzFTdBdg2zRTqr4OXWoUAb2txjU3ARETUVCk6zVRYWIiTJ0/K76ekpGD//v0IDg5GVFQU5syZg/nz5yMmJgYxMTGYP38+fH19MWnSJAVH7R62nplLhQa7KSTbNFOo3lqVCfbTIr/UhJwaNtcjIiJqChQNM3v37sWNN94ov//UU08BAKZMmYIvv/wSc+fORUlJCaZNm4acnBwMGDAA69evh16vV2rIbhPqb63MlJkFcoqNCCl//1L5suwQP+v7wX5anMkqlpdrExERNTWKhpnhw4dDiOpnD9lIkoR58+Zh3rx57htUI6HVWHcBzioyIrPAIIcZ21EGoXpbmLH+lyuaiIioqWq0PTMENK9hRZOtZ6a5v22ayQsAOM1ERERNFsNMI1bTXjO2c5ls01C2ygynmYiIqKlimGnEalrRZKvM2KadbAdSZhdVP5CSiIioKWCYacSqVmaKjSYUG80AgNDyaSbbXjPZxdxnhoiImiaGmUYsLMC+Z8Y2laTTqOCvs/ZuszJDRERNHcNMI9Zcb1+ZuVRpwzzbkQ5yZYY9M0RE1EQxzDRitspMZr41xMgb5pVPMQGVKjNczURERE0Uw0wjVnE+UymEEHbnMtkEl4eZ0jILio0m9w+SiIhIYQwzjVhzu12AyypVZirCjK9WDa3G+r+Ry7OJiKgpYphpxLQalVx5ySworThkUl8xzSRJkjzVxI3ziIioKWKYaeRayLsAG+QN82znMtnYAg9PziYioqaIYaaRk/tm8itXZmoOM1zRRERETRHDTCMXZtsFuMBQEWYqrWYCKsIMp5mIiKgpYphp5FoEVBxpUPVcJhtOMxERUVPGMNPI2Y40OJdTgrwS65EF1cKMb3llph5h5oWfk3Dn0u1cxk1ERNcMhplGztYAfDSjAACgVkkI8vGyuybYv36VmYLSMny/OxUJZ3Ow83RWA4yWiIjI/RhmGjlbA/D53BIA1h1/VSrJ7pqK85nqDjNJ5/IghPXPe8/kuHikREREymCYaeRaVFm5FFJligkAmtVzmikxLVf+c8JZhhkiIro2MMw0cs2rhJmqK5kAIKSe00yJqbnynw+cy0WZ2eL8AImIiBTGMNPI6TRqNPOt6JFpXkNlJrh8E728krJaA4oQAvsrVWZKyyxIvpDv2sESEREpgGHGA9hWNAHVN8wDgEAfL0jlbTS5xWU1fozzuSW4XGiARiVhUPsQAMDeM9muHywREZGbMcx4gMpTTbZm38rUKknum6mtCdhWlencMgCDO4QCYN8MERFdGxhmPIBdZaaGaSag8sZ5hhqf31/eL9MzMgh92zQDAOw9mwNhW95ERETkoRhmPEDlFU01TTMBlTfOq3mayVaZ6RkZhB6RQfBSS7hUYMC5nBLXDpaIiMjNGGY8gH1lpvo0E1DpsMkaKjNlZguSzucBAHpGBcHbS42uEYEAgL1n2TdDRESejWHGA9hVZmqZZmpWx/lMxzIKYDBZEOCtQXSIHwBUTDVx8zwiIvJwDDMeoEWlykxwDQ3AQEVjcE0b59k2y+sRGSTvHtynPMywCZiIiDwdw4wHaBfqB51GhZgW/vBS1/y/rK6Ts23Nv70ig+TH+rS1hpljFwvkAyyJiIg8kUbpAdCVNfPTIu7JYfD3rv1/V3Ad5zPtT7NWX3pGBcmPtdB7IyrYF6nZxUhMzcHwji1cO2giIiI3YWXGQ0SF+NY6xQTUHmbySspw6lIRAKBH6yC752x9M/s41URERB6MYeYaUVuYOXguFwAQFexb7ZBK21TTXg8MM+9tOIHxH2yr86TwgtIyJKZ63t+NiIiuDsPMNcIWZnKKjbBYKjbCq7xZXlV92wRbr0nLhcmDDp00WwQ+23oaSefz8FtSeq3XvfjzIdzx0Xb8drD2a4iIyPMxzFwjQvy18NWqUWYWePzbBBQZTAAqVjL1qtQvYxPTwh8B3hoUG804kl7gxtE651hGAQrK/35/nbhc4zVGkwUbjlwEAHyz84y7hkZERApgmLlG6DRqLLwzFlq1CusPX8RdH+/A+dwSu51/q1KpJPSWjzbwnM3z9lQ6IHP7qcswW6ofybD3bDaKjWYAwM7T2UjNKnbb+IiIyL0YZq4ht/WIwPdTr0eovxZH0vMx9r2tyC4yQqtWoUtEQI2vsTUBrz2UUWMoaIwqh5n8UpO8u3FlW47bV2z+m5DW4OMiIiJlMMxcY/q0aYbV0wejU7he3j+mc0QAdBp1jdeP7d4SOo0Ku1Oy8W7ccXcO1SFCCDnM2HZG/utk9ammLccvAQBu6mRdcv7fhHMeE9aIiOjqMMxcg1o388VPTwzCzZ3DAABDY0JrvbZdc3+8cWcsAGDJppP4vY6G2so2HcvE9O/2IT3PuYMqhRA4fCEfpWXmel1/LqcEF/MN0Kgk/P2GaADAtip9M5cKDDicng8A+NeErgjw1uBCXim2n6q5v4aIiDwbw8w1yk+nwScP9sHvs27ArBExdV57e69WeLQ8GDzz4wEczciv83qT2YKXfj6E35LS8exPSRDC8YrHsr/OYOz7WzHq3S2IL6+m1MXW29OtVSBGdgkHYD2SocRYEYa2nrB+nK4RAWjdzBe39YwAAPy495zD4yQiosaLYeYaplJJ6BIRUOsRCJU9e0snDOkQimKjGVO/TkBuce37t2w8monzudaKzJbjl7DmwAWHxldmtuCTLacBAKnZxZjyxW7MWL4Pmfmltb5mT/nBmP3aNkPbEF+0CvKB0WzB7kp9NLYppqHXNQcA3N0nEgCwLjkDecU8uoGI6FrDMEMAAI1ahQ/u74XIYB+kZhdj5veJtfaYfLPjLACgVZAPAOBf/ztc4wGXpWVmJKbm1Fq5WXcoAxn5pQj11+KRwdFQScCvB9Mx4p14fL87tcbX7Emxhpa+bYMhSRIGdwgBUNE3Y7EIbC2fdhoaYw0zsa0D0TFMD6PJgjUHHQteRETUeDHMkKyZnxb/eaAvfLzU2HriMr7debbaNSczC7Dt5GWoJODbvw/AdWH+yCoyYv7vR+yuS8suxu0f/oU7PtqOt/44VuPnW/ZXCgBg8oA2+Of4LlgzYwh6tA5EgcGE51clIaHKcvGcIiNOZBYCqFiFNaQ8sNgCzOH0fGQVGeGnVcsng0uShLv7tgYA/Hdv3auaCg0mbDtx+YpTbdlFRi73VogQAhsOX+T9JyIZwwzZ6RIRgBfGdQYAvL3+GC4VGOyet1VlRnQOQ3SoHxZMjIUkAT8mnMP28urIztNZmPDhXziaYd2I75Mtp3Ey035TvgNpudiXmgsvtYTJ10cBsPbBrJo2GBPKe1w+jj9t95qE8mMX2jf3k49mGNTeWpk5kp6Py4UGbCnvlxnYPgRaTcWX9+29WkGjknDgXB6OZVSMJTO/FOuTM/D6b4cxYck29Hh1PR74fBfGf7CtxlVSAHDqUiFGLorH0Lc2Ye5/D+ByoaHG65oSIUS9m7idtWJPGv7+9V6MfX8rdp3OcsvndIXMglLs8/DjNYQQ+HDTSYx5z7PufU2yi4z4ZscZZDWSf78lRjMe+GwXRi6Kx6zvE7F08ynEH7/E7y/1xDBD1UzqH4XurQJRUGrCwrVH5ccLDSb8tO88AOChgW0AWJeCPzDA+ucXfk7CV9vP4IHPdiG7yIjurQIxpEMoTBaBl1cn2003fbn9DADg1tgItNB7y4+rVRJm3mRtWI47fBEnyysxALCnvFLTr22w/Fiovw5dWlr30Nl+Kqtav0zl624sX6b98i+H8Lcv96D/6xvQf/5GTP0mAZ9uTcGBc3kwWwT03hqUmQUe+yYBh6rsYZOeV4IHP9uFrPJptR/2nsONb2/GF9tSPOpICFcwmS3YeToLr/16GEPf2oTO/1yHP5IzGvRzFhlMWFS+hUChwYSHvtiNTccyG/RzOutsVhGeX5WEIQs3YeJH2/FV+de+pyk2mjBjeSLe+uMYjqTn47FvE3A2q6jery8zW+wa9ZVUWmbGlC924+VfkvHg57tRbDQpPSS8t/EEtp28jBOZhVhz4ALeWHcUU77YjX6vb8A/fznktl8WPBXDDFWjVkl47fZukCTgp33n5H1dft53DoUGE9o198Pg9hXLvefe0hHhAd44k1WMV9Ykw2QRGN8jAj88NhALJnaHTqPCjtNZ+F/5GUmZ+aX4tbx35f8Gt632+Tu08MfILtZl5Z9trajO7C1v/u1bKcwAwJDyped/JGfI1Rtbv0xl9/S1NgLvTsnGxqOZyCwwQCVZj3W4v38UFt/bE389dxP2vnQzBrYLQaHBhIeX7Za/YecUGfHg57txIa8U7UL98MXDfdGtVQAKSk3416+HMfb9rdid4vqdlIUQyCyovSna3UxmCxb8fgT9Xt+A+z7Zic+3pSAtuwRCAG+sPdqg+/l8uvU0LhUYEBXsixs7NofBZMHUr/c6fP7W9pOX8fm2FPyelI59qTlIzytx2fiPpOdj5veJuPHtzfh+dyqM5WF3/u9HcPyi5xwfAgAXcktw98c78FtSOrzUEtqG+CK3uAx/+2ovCkqv3FS/6VgmBi/8E9cv2IjNCodPIQSeX5Ukb7Z5OD0fT608YHemnbsdzciXv9c9N6YT5t7SEbfGtkS75n4QAvh6x1nc/uFfdr/ckT1JOLOu1gPk5+cjMDAQeXl5CAioeRdcqtnzqw7i+91p6BSux/9mDsGY97biZGYh5o3vgocHR9td+0dyBh77JgEA8I/RHTFteHtIkgQAeH/jCSyKO44Weh02Pj0Mn25NwfsbT6BPm2b46YlBNX7uvWeycdfHO6BVq7Dt2RsR4OOF7vP+QJlZIP4fw9EmxE++Nv74JUz5YjckCRDCekL4lrk3VvuYZovAwrVHkFVeNereKhBdIgLgq9VUuza/tAz3/WcnDqfno02IL75+pD9mrdiPA2m5CA/wxk/TBqFVkA/MFoGVe9Lw1h9HkVO+UmrygCg8N6YT9N5ejt34Si7ml+LJlfux/VQW7u8fiVdv62Y3fWZTYjRj5+ksRAT5IKaFP1QqyenPXZvvd6fi+VVJAIBAHy+M6NwCN3ZsgZd/OYTc4jK8d19PTOjZ6oofp7TMjHM5JRBCICrEt9aNHW0yC0ox/K3NKDaasWRSL4zqEo6nftiPXw+mQyUBCyfG4p5+kfX+e3yz8yxeXn2o2uM+XmrMn9gNd/RqXe+PVVXc4Yt47Ju9sP18vLFjczwxvAM+3HQS8ccvoXPLAKyePuiKf2fAGqJVKgmBPs5/PV3JhsMXsSrxHJr769CqmQ9aBflCrQJeWp2My4UGBPtp8fEDfdAmxBe3LdmGi/kG3NSpBT59qC/UNXzNFRlMeP33I1i+q6KhXyVZV09OHdpO/h7hTp9vS8Frvx6GWiXhmVEd8W7ccRjNFsy8qQOeHtXxiq9PvpCHk5mFGNOtZY3/Fmtitgh8t+ssooJ9MbxjC7vnLBaBuz7ejn2puRjdNQz/ebCv3fObj2Xi6R8OIKvICB8vNf41oSvu6tNakXvnblfz85thhmqVXWTETe9sRm5xGcZ2D8fvSRnw1aqx84URCKjhB/XGIxcR5KuVG29tSsvMGL14C85mFePB69tg7aF0XC40YsmkXrg1NqLWz3/n0u1IOJuDacPbY3jHFrjnPzvQXK/D7hdG2P1DLjGa0ePV9fJvvg9cH4V/397d6b9/ZkEp7ly6HWnZJfBSSygzCwT5euHHxwYiJkxvd21usRFvrDuK73dbG4zDA7zx79u74ebyCpMjNh65iGd+PCCHJAAYEB2Mjx/og2blp6QDwI5TWXhu1UGcLW+Ibebrhf7RwRgQHYJhHZujfXN/h8dQVYnRjOFvb8LFfAOevPk6TL+xPTTlS/9tofW6MH+smz20WqBKyy7GuxuO41RmIc7nluByYcUKOJVk3eyxXXM/tG/uj3v7ReK6Kvf4hZ+TsHxXKnpEBmH1tEGQJAlmi8CLPydhxR7rfa8apGuz5sAFzF6RCCGAge1CUGa2ID2vFBfzS2GyCOg0KqyZMQQdw/V1fpyanLpUiAlL/kKhwYSbOrXA06OuQ9eIQADWr6lbFluPGZk6tB1eGNu5zo+1NikdT/94AEIAf78hGlOHtnNJSK7J8YsFuG3JNpSW1Txd2ilcj08f6ovIYF8AwMFzubj74x0wmCx4bGg7PF/l75JwNhtP/XBA/rp8ZHA0io0m+f/VhJ4ReOPOWHh7qXGpwICNRy5i/eGLSLlcBJ1GBW8vNby9VPDVajCpf5RT/5Zstp+8jAe/2A2zReCft3bBI0Oi8d+Ec3jmxwMAcMUgfuZyEca9vxVFRjPahfrh5Vu7yNPXdflmxxm8/EsygOpfo9/tOosXfz4EP60aG54ehpaBPtVen5lfijnlv9QAwJhu4Xh0aDv0igy64te6xSKw9lAGvtpxBgaTBQHeGgR4eyHAR4OwAG+M7xHh0u8RrsQwUwnDjHOW70rFCz8nye87GhQ2H8vEw8v2yO+3DPTGlrk31rkHzvrkDEz9JgF6bw0euL4Nlm4+hbHdw/HR5D7Vrr3/k53YUd6Q+MmDfTCqa/hVj7EmKZeLcNfS7cgqMsJXq8Z3fx+AXlHNar1++6nLeGFVEs6UfwO/rUcE5k/sDn9d9eqP7eNvOpqJ5nod2ob4oU2oL3QaFRauPYplf50BYN38b/KANpj/+xEUGkxoE+KLz6f0RXigD95YexTflK86C/bTosRoRkmluXVJAu7s3RpPj7quxm+SV2vp5lN4Y91RtArywZ/PDLOrLOSVlGHIG3+ioNSEpZN7Y0z3lvJzRQYTJtRQJvfTqiFJEgoN9j0LWo0K/7y1CyYPiIIkSTiZWYDRi7fCbBH44bGB6B9dMdUohMDCtUfxn/I9i+7tG4l/39Gt1q+tTUcz8ejXe2GyCDw0sA1eva2r/APBbBH421d7sPnYJVwX5o9fpg+Bj/bK1RObQoNJng7oHx2M7/4+oNo44g5fxKNf74UkAd/9bQAGdai+Q7fFIrB44wm8v/GE3ePBflrMvKkDJg9oU60qYLYI5JeUIbekDLnFRhSUmtCqmQ/ahvjVWDWprLTMjNuWbMPxi4Xo17YZ+rQJxoXcEpzPLUFGXimubxeCVyd0rfZ1/Mv+85i9Yj8Aa0XSaLLgXE4J0nKKcT7XOvUYEeiNt+/ugUEdQiGEwDc7z+Jf/zsMk0Wgc8sA+GnVSEjNQV0/iXy1aqydfYNdRfZqpWUX47Yl25BTXIaJvVvhnbt7yP/fF6w9gv/En4ZWo8LKqdfX+G/caLLgzqXbq50Fd1OnFnj51i6IDq15bDlFRgx/e7N8vAwA3N8/Eq9N6IbsYiNufice+aUmOVzVxmwR+Dj+FBbFHZenQjuF6zF5QBRu79WqWsgVQuDPo5l4Z/1xeUf02vSPDsb9/SMxpltLeHvV/+u9oTHMVMIw4xyzRWDiR3/hwDnrP+D1Tw6t9htzfT32zV78kXwRgLXPZtrwDnVeb7EI3PxuPE5fKoJWrYLRbKn1H/yHm07irT+OQaOSkPjPkS797fXwhXwsjT+FBwZEYUC7kCteX1pmxrsbjuPTLadhEUDnlgH44uG+1cLEhsMXMXtFIoqqNEV6e6nk344fGRyNZ8d0hE6jxvGLBfjbV3uQll0CvU6DAB8vefPCSQOi8PyYTvD2UiPpfB52ns7CjlNZ8pJ1by8VHr2hHR4b1r7WYFVQWobdKdlITM3FkJhQXF/l75pXXIYb3vwT+aUmvHN3D9zZp/o0zKL1x/D+nyfRuWUAfp81BJIkQQiBGd8n4reD6QgL0OHV27qidTNftG7mI0+dXCo04PSlIpy+VIS1h9LlcY/pFo6Fd8bi6R8OYMORixjZJQyfPtS32ucFgK93nMG8NcmwCOCGmFB8NLl3ta+DPWey8eDnu1BaZsGEnhF4956e1SpIlwsNuGXxVlwuNGDygCi8fkf9wrsQAtO+24e1hzIQFqDDrzNvQPPy88Oqen5VEr7fnYrwAG+sm3MDgnwrKm2FBhOeXLkfcYet/1YeGRyNfm2b4a31x3D6krV/KyLQG6F6HQoNJhQbzCgymFBoNNUYCHy81OgYrkfnlgHoH90Mt/VoVS3cvPhzEr7blYpQfx3Wzq593DV5+49jWLLpZI3PTezVCq/c1rXaFNmOU1mYvnwfsivtT9WjdSBGdQ1H76hmMFksKC2zoKTMjK+3n8Heszno3zYYK6Ze79AUaqHBhLs/3oEj6fno3ioQPz4+0O6Httki8Ng3e7HhSCZC/bX4bEo/9IwMsvsY//71MD7bliJXZ39MOIdlf6WgzCzgpZbw3JjO+FsN35teWp2Eb3emolO4Hvf2i8Rrvx6GRQDDOzaHj5caaw9loHurQKyePviKoRMADp3Pwxd/peC3g+kwmKzfJ3y81IgO9UOoXofm/jqE6rXyv2UA8Ndp8Lch0egaYe3xKygtQ0GpCfvTcrHpWKY8HRrgrcHMm2Lw6NB2V3V/07KL4a/T2FWMXYFhphKGGeclncvDvZ/swA0xodXmc6/G+dwSjH53C1QSEP+PG+v1hb9idyqeW1VRGfp15hB0axVY7brTlwox9v2tGNE5DB9O6u3wGF1pX2oOpn6dgMuFBoQHeOOLh/uhS0QAhBD4ZMtpLFx3FEJYKy8+XmqcySqWl2EG+2nx9t2xuKmTfWk9u8iIx79JkHc8jgz2wcKJsRhcw2/3AJCYmoP5vx+Rd04O9deiX9tgBPp4IdDXC4E+XsgvMWHH6SwcOp8n/8bnpZbwwf29cEu3iuqK7bfXjmF6/D77hhq/8eYWGzF44Z8oMprx6UN9MbJLmNyjoFFJWPnY9ejTJrja6yqzWAS++CsFb6w7ijKzQAu9DpkFBqhVEv6YMxQdWtReEt945CJmLE9ESZkZncL1eHpURxSUliG7yGhdirvzLApKTbixY3N88lDfWqs3W09cwoOf7wYAfPxAb7v7UJuP409h4dqj8FJLWPnYQPSuo4JXbDRh3PvbkHK5CC0DvREV7ItQfx1C/LXYcSoLJzILodWoMP+O7rirPDSazBas3JuGxRtOVNsyoTK9ToNAXy/4atVIzS6uNm3UMzIIb9wZK0+h/Z6Ujmnf7YMkAd88MkBuqK8vi0Xg/T9P4HxOCSKDfREZ7IPWzXzRJsTXbqViVWnZxfh062m54b+2ymFadjFuWbwFRUYzXhrXGX+/4ep+0BpNFvztqz3YeuIyQvy0+N/MIYgIqv65Cg0m3PufHUi+kA+dRoW37u6B23pYp8H/PHoRj3y5FwDkr2vA+n3ntV8PY9Mx6yrKhRO7477+UfLHTL6Qh/EfbINFACumXo/r24VgfXIGZq1IlP+/qCTgl+lD0L119e9rdcktNmLVvvNYvju11sZgby8Vpgxqi8eHtq/1+21GXil+3JuGlXvTcC7H+svRlapEZWYL9pzJxqajmfjzaCZOXSrCK+O74P8G1/4aRzDMVMIw4xqFBhO8NSq5P8JR53Ks0y+tm/nW6/rSMjOGvLEJlwsN8NOqceCVUbWOIb+0DN4adb2b8twhLbsY//flHpzMLISfVo137+2J9Ycv4r8J1nOiJg+Iwrzbuso/VAsNJpzPKUHrZj7wq6WCYjRZsORP6/TDY8Pa13qdjRACfyRnYOHao/L0V23ahPgi2E+LxNRcqFUSFt3TAxN6tkJ6XgmGv7UZBpMFn0/pixGda+9feGPdUSzdfArdWwXi5Vu7YNKnO2GyiKv+ZncgLRczvt+HtGzrN9j6VkmSzuXhka/21PoDv1/bZvj6kQFXnD6yhbcAbw3Wzhkq73hdk60nrE3oFgG8fkc3TC7frqAuB9Jycf+nO1Fcw3LlFnod/vNgnxqnO4qNJvx1MgsqyXoGm79OI/83yNfLLqCZLQIpl4twJD0fhy7kYfnOVBQYTPBSS5h+YwdM6NkKty3ZhoJSE54Y3h7P3tLpiuNWgm26W6tR4fdZN9QZaCuzWASe+mE/Vu+/AB8vNVZMvR49qlRcKisoLcPsFfvx51HriqtZI2Jwf/9IjHt/G7KLjHh4UFvMu61rtde9ue4oPtp8CioJ+GhyH9zSLRxCCNz7n53YfSYbt8a2xJJKv2TtT8vF377cg6w6PmZ9CSFwIrMQF3JLcKnAgEuFBlwqMECvs07PtwioPVBWZrEIfPDnSby7wbr1QU39Q5cKDHhz3VGsO5SBgkpTw2qVhL8Pia7WN+UshplKGGY8n20K6caOzbHs//orPZyrlldchse/TZB7egDrb2OvjO+Khwa2cduqBKPJgj+PZuJifinySsrkN7UkoV90MAa2D5FXaM3970H8tO8cJAl4485Y7DubgxV70tCvbTP88NjAOsecVWjAkDc2oaTMDF+tGsVGM27rEYH37ut51X/X/NIy/PvXw0jNLsaSSb0R6l+/6Y/zuSX45+pDyMgvRbCfFs18tQj206J1Mx/c1z+q1qm2yowmC+7+eDsOnMtDr6ggfDipd7Xf6KtWke7tG4mFd3av99/zcqEBxzMKkFVkRFahAVlFRkgAJl/fBmH1/CF0NTLySvHS6kPYcMQ6haVRSTBZBHpFBeGHxwbW6xw3JQghMGXZHmw5fgk9IoPw0+MDoVGrIITAvtQcrE++iBB/Le7s3VreUBOwLoP/ZMtpaFQSPpvSt9pKopqYLQJvrDsqnxsX4K1BfqkJXVoG4OdaVqAJIfDcT0lYuTcNWo0KXz/SHxfzSzF7xX54e6nw59PDq33tZOSVYldKFsZ2b9lo7rsQAq/+7zC+3H4GGpWEzx/uh2HXNYcQAj8nnse/fj2M3PIFCSF+Wgzr2Bw3dWqBG2KaN8hqO4aZShhmPJ/JbMGPCecwpEOovJLC0xhNFjy36iBW7TsPvbcGH07qXW1jv8bEYhF46ZdD8pJalQRYBPDTEwOvOE0EAK//dhifbrUeV3FdmD9+njb4ihWkxuhsVhHGvb/NWpn0UuGxoe3x+LD28NFaV+A88+MB+bT30V3D8N59vRpVA2VNhBD49WA65q1JRlaREXpvDX6fdUOj/7eVnleCUe9uQUGpCVOHtoOfVoNViefk1VIAoFWrcEu3cEweEIWD5/LwevkxK4vu6YGJva9uqf0Pe9Pw4s9JKDML+GrV+N/MIXWu+jGZLZj23T6sP3wRep0G3uVfI0+PvA4zR8Q49pdWgMUiMGflfqw5YK1mvXtvT6zckypPpXWNCMA/b+2Cfm2DG3QLCIBhxg7DDDUWQghsO3kZ7Zv71zhn39gIIfCvXw/Lq6rqar6tKrOgFCPejgcArJ4xuNEu/ayPwxfy8cqaQ3LfUctAb0weEIUvt5/B5UIjdBoVXq608spTZBcZsXzXWdwQ07zOqZfGZNW+c3jqhwN2j/lq1RjVJQwpl4vkhQqVPTemEx4f1t6hz7c7JRsfbjqJKYPaVOtfq0lpmRkPfb7brqct7slhjT7gVlW5z8hGq1Fh9ogYTB3azm2VJIaZShhmiBwnhMBHm09h87FMvH13j6taGnshtwRqldQg0yXuJoTA70kZmP/7EXkFGWBdGvv+/b0cXuFHV0eIiqrBkA6hmNi7FUZ3DZc3vkw6l4flu89ideIFlJSZ8cjgaLx8a2e3hsy8kjLc98lOHEnPx2cP9XXJ/jhKKDKYMOnTnThwLg992jTDG3fG1rtXyVUYZiphmCEiVyktM+PzbSn4dudZjO4ajufKl8OT+1gsAkazpc77nl9ahpRLRYhtHahItcy2u7W7f/i7WmmZGckX8tAzslm9lo27GsNMJQwzREREnudqfn43jhZqIiIiIgcxzBAREZFHY5ghIiIij8YwQ0RERB6NYYaIiIg8GsMMEREReTSGGSIiIvJoDDNERETk0RhmiIiIyKMxzBAREZFHY5ghIiIij8YwQ0RERB6NYYaIiIg8GsMMEREReTSN0gNoaEIIANajxImIiMgz2H5u236O1+WaDzMFBQUAgMjISIVHQkRERFeroKAAgYGBdV4jifpEHg9msVhw4cIF6PV6SJLk0o+dn5+PyMhIpKWlISAgwKUfm+zxXrsP77X78F67D++1+7jqXgshUFBQgIiICKhUdXfFXPOVGZVKhdatWzfo5wgICOA/DjfhvXYf3mv34b12H95r93HFvb5SRcaGDcBERETk0RhmiIiIyKMxzDhBp9PhlVdegU6nU3oo1zzea/fhvXYf3mv34b12HyXu9TXfAExERETXNlZmiIiIyKMxzBAREZFHY5ghIiIij8YwQ0RERB6NYcZBH330EaKjo+Ht7Y0+ffpg69atSg/J4y1YsAD9+vWDXq9HixYtcPvtt+PYsWN21wghMG/ePERERMDHxwfDhw9HcnKyQiO+dixYsACSJGHOnDnyY7zXrnP+/Hk88MADCAkJga+vL3r27ImEhAT5ed5r1zCZTHjppZcQHR0NHx8ftGvXDv/6179gsVjka3ivHbNlyxaMHz8eERERkCQJq1evtnu+PvfVYDBg5syZCA0NhZ+fH2677TacO3fONQMUdNVWrFghvLy8xKeffioOHz4sZs+eLfz8/MTZs2eVHppHGz16tFi2bJk4dOiQ2L9/vxg3bpyIiooShYWF8jULFy4Uer1e/PTTTyIpKUnce++9omXLliI/P1/BkXu23bt3i7Zt24rY2Fgxe/Zs+XHea9fIzs4Wbdq0EQ8//LDYtWuXSElJERs2bBAnT56Ur+G9do1///vfIiQkRPz6668iJSVF/Pjjj8Lf318sXrxYvob32jG///67ePHFF8VPP/0kAIiff/7Z7vn63NfHH39ctGrVSsTFxYl9+/aJG2+8UfTo0UOYTCanx8cw44D+/fuLxx9/3O6xTp06ieeee06hEV2bMjMzBQARHx8vhBDCYrGI8PBwsXDhQvma0tJSERgYKD7++GOlhunRCgoKRExMjIiLixPDhg2Twwzvtes8++yzYsiQIbU+z3vtOuPGjROPPPKI3WMTJ04UDzzwgBCC99pVqoaZ+tzX3Nxc4eXlJVasWCFfc/78eaFSqcS6deucHhOnma6S0WhEQkICRo0aZff4qFGjsH37doVGdW3Ky8sDAAQHBwMAUlJSkJGRYXfvdTodhg0bxnvvoOnTp2PcuHG4+eab7R7nvXadNWvWoG/fvrj77rvRokUL9OrVC59++qn8PO+16wwZMgQbN27E8ePHAQAHDhzAtm3bMHbsWAC81w2lPvc1ISEBZWVldtdERESgW7duLrn31/xBk652+fJlmM1mhIWF2T0eFhaGjIwMhUZ17RFC4KmnnsKQIUPQrVs3AJDvb033/uzZs24fo6dbsWIF9u3bhz179lR7jvfadU6fPo2lS5fiqaeewgsvvIDdu3dj1qxZ0Ol0eOihh3ivXejZZ59FXl4eOnXqBLVaDbPZjNdffx33338/AH5dN5T63NeMjAxotVo0a9as2jWu+NnJMOMgSZLs3hdCVHuMHDdjxgwcPHgQ27Ztq/Yc773z0tLSMHv2bKxfvx7e3t61Xsd77TyLxYK+ffti/vz5AIBevXohOTkZS5cuxUMPPSRfx3vtvJUrV+Lbb7/F8uXL0bVrV+zfvx9z5sxBREQEpkyZIl/He90wHLmvrrr3nGa6SqGhoVCr1dWSZGZmZrVUSo6ZOXMm1qxZg02bNqF169by4+Hh4QDAe+8CCQkJyMzMRJ8+faDRaKDRaBAfH4/3338fGo1Gvp+8185r2bIlunTpYvdY586dkZqaCoBf1670j3/8A8899xzuu+8+dO/eHQ8++CCefPJJLFiwAADvdUOpz30NDw+H0WhETk5Ordc4g2HmKmm1WvTp0wdxcXF2j8fFxWHQoEEKjeraIITAjBkzsGrVKvz555+Ijo62ez46Ohrh4eF2995oNCI+Pp73/iqNGDECSUlJ2L9/v/zWt29fTJ48Gfv370e7du14r11k8ODB1bYYOH78ONq0aQOAX9euVFxcDJXK/seaWq2Wl2bzXjeM+tzXPn36wMvLy+6a9PR0HDp0yDX33ukW4ibItjT7888/F4cPHxZz5swRfn5+4syZM0oPzaM98cQTIjAwUGzevFmkp6fLb8XFxfI1CxcuFIGBgWLVqlUiKSlJ3H///VxW6SKVVzMJwXvtKrt37xYajUa8/vrr4sSJE+K7774Tvr6+4ttvv5Wv4b12jSlTpohWrVrJS7NXrVolQkNDxdy5c+VreK8dU1BQIBITE0ViYqIAIBYtWiQSExPlLUnqc18ff/xx0bp1a7Fhwwaxb98+cdNNN3FpttI+/PBD0aZNG6HVakXv3r3l5cPkOAA1vi1btky+xmKxiFdeeUWEh4cLnU4nhg4dKpKSkpQb9DWkapjhvXad//3vf6Jbt25Cp9OJTp06iU8++cTued5r18jPzxezZ88WUVFRwtvbW7Rr1068+OKLwmAwyNfwXjtm06ZNNX5/njJlihCifve1pKREzJgxQwQHBwsfHx9x6623itTUVJeMTxJCCOfrO0RERETKYM8MEREReTSGGSIiIvJoDDNERETk0RhmiIiIyKMxzBAREZFHY5ghIiIij8YwQ0RERB6NYYaIiIg8GsMMEblM27ZtsXjx4npfv3nzZkiShNzc3AYbU2NytfeHiOpHo/QAiEg5w4cPR8+ePV32A3bPnj3w8/Or9/WDBg1Ceno6AgMDXfL5iahpYpghojoJIWA2m6HRXPnbRfPmza/qY2u1WoSHhzs6NCIiAJxmImqyHn74YcTHx+O9996DJEmQJAlnzpyRp37++OMP9O3bFzqdDlu3bsWpU6cwYcIEhIWFwd/fH/369cOGDRvsPmbVaRRJkvDZZ5/hjjvugK+vL2JiYrBmzRr5+arTTF9++SWCgoLwxx9/oHPnzvD398ctt9yC9PR0+TUmkwmzZs1CUFAQQkJC8Oyzz2LKlCm4/fbb6/z7bt++HUOHDoWPjw8iIyMxa9YsFBUV2Y39tddew6RJk+Dv74+IiAh88MEHdh8jNTUVEyZMgL+/PwICAnDPPffg4sWLdtesWbMGffv2hbe3N0JDQzFx4kS754uLi/HII49Ar9cjKioKn3zySZ3jJqIrY5ghaqLee+89DBw4EI8++ijS09ORnp6OyMhI+fm5c+diwYIFOHLkCGJjY1FYWIixY8diw4YNSExMxOjRozF+/HikpqbW+XleffVV3HPPPTh48CDGjh2LyZMnIzs7u9bri4uL8fbbb+Obb77Bli1bkJqaimeeeUZ+/o033sB3332HZcuW4a+//kJ+fj5Wr15d5xiSkpIwevRoTJw4EQcPHsTKlSuxbds2zJgxw+66t956C7Gxsdi3bx+ef/55PPnkk4iLiwNgrVDdfvvtyM7ORnx8POLi4nDq1Cnce++98ut/++03TJw4EePGjUNiYiI2btyIvn372n2Od955B3379kViYiKmTZuGJ554AkePHq1z/ER0BS45e5uIPNKwYcPE7Nmz7R7btGmTACBWr159xdd36dJFfPDBB/L7bdq0Ee+++678PgDx0ksvye8XFhYKSZLE2rVr7T5XTk6OEEKIZcuWCQDi5MmT8ms+/PBDERYWJr8fFhYm3nrrLfl9k8kkoqKixIQJE2od54MPPiimTp1q99jWrVuFSqUSJSUl8thvueUWu2vuvfdeMWbMGCGEEOvXrxdqtVqkpqbKzycnJwsAYvfu3UIIIQYOHCgmT55c6zjatGkjHnjgAfl9i8UiWrRoIZYuXVrra4joyliZIaIaVa0oFBUVYe7cuejSpQuCgoLg7++Po0ePXrEyExsbK//Zz88Per0emZmZtV7v6+uL9u3by++3bNlSvj4vLw8XL15E//795efVajX69OlT5xgSEhLw5Zdfwt/fX34bPXo0LBYLUlJS5OsGDhxo97qBAwfiyJEjAIAjR44gMjLSrnpluxe2a/bv348RI0bUOZbK90OSJISHh9d5P4joytgATEQ1qroq6R//+Af++OMPvP322+jQoQN8fHxw1113wWg01vlxvLy87N6XJAkWi+WqrhdCVHussqrPV2WxWPDYY49h1qxZ1Z6Lioqq87W2zyWEqPZ5qz7u4+NT58cCrv5+ENGVsTJD1IRptVqYzeZ6Xbt161Y8/PDDuOOOO9C9e3eEh4fjzJkzDTvAKgIDAxEWFobdu3fLj5nNZiQmJtb5ut69eyM5ORkdOnSo9qbVauXrdu7cafe6nTt3olOnTgCsVZjU1FSkpaXJzx8+fBh5eXno3LkzAGvVZePGjU7/PYno6rAyQ9SEtW3bFrt27cKZM2fg7++P4ODgWq/t0KEDVq1ahfHjx0OSJLz88suKVBRmzpyJBQsWoEOHDujUqRM++OAD5OTk1Fg1sXn22Wdx/fXXY/r06Xj00Ufh5+eHI0eOIC4uzm7F0l9//YU333wTt99+O+Li4vDjjz/it99+AwDcfPPNiI2NxeTJk7F48WKYTCZMmzYNw4YNk6fkXnnlFYwYMQLt27fHfffdB5PJhLVr12Lu3LkNe1OImjhWZoiasGeeeQZqtRpdunRB8+bN6+x/effdd9GsWTMMGjQI48ePx+jRo9G7d283jtbq2Wefxf3334+HHnoIAwcOlPtfvL29a31NbGws4uPjceLECdxwww3o1asXXn75ZbRs2dLuuqeffhoJCQno1asXXnvtNbzzzjsYPXo0AOt00OrVq9GsWTMMHToUN998M9q1a4eVK1fKrx8+fDh+/PFHrFmzBj179sRNN92EXbt2NcyNICKZJK402UxE1IhZLBZ07twZ99xzD1577TWHP07btm0xZ84czJkzx3WDIyK34DQTEXmUs2fPYv369Rg2bBgMBgOWLFmClJQUTJo0SemhEZFCOM1ERB5FpVLhyy+/RL9+/TB48GAkJSVhw4YNchMuETU9nGYiIiIij8bKDBEREXk0hhkiIiLyaAwzRERE5NEYZoiIiMijMcwQERGRR2OYISIiIo/GMENEREQejWGGiIiIPNr/AykvyBMolM/vAAAAAElFTkSuQmCC", 292 | "text/plain": [ 293 | "
" 294 | ] 295 | }, 296 | "metadata": {}, 297 | "output_type": "display_data" 298 | } 299 | ], 300 | "source": [ 301 | "# 为对比学习负采样准备词频率分布\n", 302 | "vocab_size = len(dataset.token2id)\n", 303 | "embed_size = 128\n", 304 | "distribution = dataset.get_word_distribution()\n", 305 | "print(distribution)\n", 306 | "model = SkipGramNCE(vocab_size, embed_size, distribution)\n", 307 | "\n", 308 | "from torch.utils.data import DataLoader\n", 309 | "from torch.optim import SGD, Adam\n", 310 | "\n", 311 | "# 定义静态方法collate_batch批量处理数据,转化为PyTorch可以需要的张量类型\n", 312 | "class DataCollator:\n", 313 | " @classmethod\n", 314 | " def collate_batch(cls, batch):\n", 315 | " batch = np.array(batch)\n", 316 | " input_ids = torch.tensor(batch[:, 0], dtype=torch.long)\n", 317 | " labels = torch.tensor(batch[:, 1], dtype=torch.long)\n", 318 | " return {'input_ids': input_ids, 'labels': labels}\n", 319 | "\n", 320 | "# 定义训练参数以及训练循环\n", 321 | "epochs = 100\n", 322 | "batch_size = 128\n", 323 | "learning_rate = 1e-3\n", 324 | "epoch_loss = []\n", 325 | "\n", 326 | "data_collator = DataCollator()\n", 327 | "dataloader = DataLoader(data, batch_size=batch_size, shuffle=True,\\\n", 328 | " collate_fn=data_collator.collate_batch)\n", 329 | "optimizer = Adam(model.parameters(), lr=learning_rate)\n", 330 | "model.zero_grad()\n", 331 | "model.train()\n", 332 | "\n", 333 | "# 需要提前安装tqdm\n", 334 | "from tqdm import trange\n", 335 | "import matplotlib.pyplot as plt\n", 336 | "\n", 337 | "# 训练过程,每步读取数据,送入模型计算损失,并使用PyTorch进行优化\n", 338 | "with trange(epochs, desc='epoch', ncols=60) as pbar:\n", 339 | " for epoch in pbar:\n", 340 | " for step, batch in enumerate(dataloader):\n", 341 | " loss = model(**batch)\n", 342 | " pbar.set_description(f'epoch-{epoch}, loss={loss.item():.4f}')\n", 343 | " loss.backward()\n", 344 | " optimizer.step()\n", 345 | " model.zero_grad()\n", 346 | " epoch_loss.append(loss.item())\n", 347 | " \n", 348 | "epoch_loss = np.array(epoch_loss)\n", 349 | "plt.plot(range(len(epoch_loss)), epoch_loss)\n", 350 | "plt.xlabel('training epoch')\n", 351 | "plt.ylabel('loss')\n", 352 | "plt.show()" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "id": "c9430e9a", 358 | "metadata": {}, 359 | "source": [ 360 | "TF-IDF加权\n", 361 | "\n", 362 | "定义词频率(term frequency)。注意到不同长度的文章词频率会有较大差距,不利于比较和运算,因此可以对词频率取对数。\n", 363 | "\n", 364 | "$$\\text{tf}_{t,d} = \\log (\\text{count}(t,d) + 1)$$\n", 365 | "\n", 366 | "其中$\\text{count}(t,d)$表示词$t$在文档$d$中出现的次数,为了避免对0取对数,把所有的计数加1。\n", 367 | "\n", 368 | "那么如何区分高频词与低频词呢?TF-IDF引入了另一个重要的评价指标——文档频率(document frequency),即一个词在语料库所包含的多少篇文档中出现。在所有文档里出现的词往往是虚词或是常见实词,而只在少量文档里出现的词往往是具有明确含义的实词并且具有很强的文档区分度。用$\\text{df}_t$来表示在多少篇文档中出现了词$t$。\n", 369 | "\n", 370 | "为了压低高频词和提升低频词的影响,TF-IDF使用文档频率的倒数,也就是逆向文档频率(inverse document frequency)来对词频率进行加权。这很好理解,一个词的文档频率越高,其倒数就越小,权重就越小。\n", 371 | "\n", 372 | "$$\\text{idf}_t = \\log \\frac{N}{\\text{df}_t}$$\n", 373 | "\n", 374 | "其中$N$表示文档总数。为了避免分母为0,通常会将分母改为$\\text{df}_t+1$。\n", 375 | "\n", 376 | "基于词频率和逆向文档频率,得到TF-IDF的最终值为:\n", 377 | "\n", 378 | "$$w_{t,d} = \\text{tf}_{t,d} \\times \\text{idf}_{t}$$\n" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "id": "f765e353", 384 | "metadata": {}, 385 | "source": [ 386 | "很多情况下会额外对文档的TF-IDF向量使用L2归一化,使得不同文档的TF-IDF向量具有相同的模长,便于相互比较。\n", 387 | "下面给出了TF-IDF的代码实现。" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 5, 393 | "id": "9ce8e610", 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "class TFIDF:\n", 398 | " def __init__(self, vocab_size, norm='l2', smooth_idf=True,\\\n", 399 | " sublinear_tf=True):\n", 400 | " self.vocab_size = vocab_size\n", 401 | " self.norm = norm\n", 402 | " self.smooth_idf = smooth_idf\n", 403 | " self.sublinear_tf = sublinear_tf\n", 404 | " \n", 405 | " def fit(self, X):\n", 406 | " doc_freq = np.zeros(self.vocab_size, dtype=np.float64)\n", 407 | " for data in X:\n", 408 | " for token_id in set(data):\n", 409 | " doc_freq[token_id] += 1\n", 410 | " doc_freq += int(self.smooth_idf)\n", 411 | " n_samples = len(X) + int(self.smooth_idf)\n", 412 | " self.idf = np.log(n_samples / doc_freq) + 1\n", 413 | " \n", 414 | " def transform(self, X):\n", 415 | " assert hasattr(self, 'idf')\n", 416 | " term_freq = np.zeros((len(X), self.vocab_size), dtype=np.float64)\n", 417 | " for i, data in enumerate(X):\n", 418 | " for token in data:\n", 419 | " term_freq[i, token] += 1\n", 420 | " if self.sublinear_tf:\n", 421 | " term_freq = np.log(term_freq + 1)\n", 422 | " Y = term_freq * self.idf\n", 423 | " if self.norm:\n", 424 | " row_norm = (Y**2).sum(axis=1)\n", 425 | " row_norm[row_norm == 0] = 1\n", 426 | " Y /= np.sqrt(row_norm)[:, None]\n", 427 | " return Y\n", 428 | " \n", 429 | " def fit_transform(self, X):\n", 430 | " self.fit(X)\n", 431 | " return self.transform(X)" 432 | ] 433 | } 434 | ], 435 | "metadata": { 436 | "kernelspec": { 437 | "display_name": "Python 3 (ipykernel)", 438 | "language": "python", 439 | "name": "python3" 440 | }, 441 | "language_info": { 442 | "codemirror_mode": { 443 | "name": "ipython", 444 | "version": 3 445 | }, 446 | "file_extension": ".py", 447 | "mimetype": "text/x-python", 448 | "name": "python", 449 | "nbconvert_exporter": "python", 450 | "pygments_lexer": "ipython3", 451 | "version": "3.8.17" 452 | } 453 | }, 454 | "nbformat": 4, 455 | "nbformat_minor": 5 456 | } 457 | -------------------------------------------------------------------------------- /第5章-文本聚类.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "66e0d34e", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "\n", 10 | "下面我们实现*k*均值算法,进行文本聚类。这里使用的数据集与第4章的数据集类似,包含3种主题约1万本图书的信息,但文本内容是图书摘要而非标题。首先我们复用第4章的代码进行预处理。\n" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 2, 16 | "id": "ce835777", 17 | "metadata": { 18 | "scrolled": false 19 | }, 20 | "outputs": [ 21 | { 22 | "name": "stdout", 23 | "output_type": "stream", 24 | "text": [ 25 | "train size = 8627 , test size = 2157\n", 26 | "{0: '计算机类', 1: '艺术传媒类', 2: '经管类'}\n" 27 | ] 28 | }, 29 | { 30 | "name": "stderr", 31 | "output_type": "stream", 32 | "text": [ 33 | "100%|██████████| 8627/8627 [03:10<00:00, 45.23it/s]\n", 34 | "100%|██████████| 2157/2157 [00:47<00:00, 45.56it/s]\n" 35 | ] 36 | }, 37 | { 38 | "name": "stdout", 39 | "output_type": "stream", 40 | "text": [ 41 | "unique tokens = 34252, total counts = 806900, max freq = 19197, min freq = 1\n", 42 | "min_freq = 3, min_len = 2, max_size = None, remaining tokens = 9504,\n", 43 | " in-vocab rate = 0.8910459784359895\n" 44 | ] 45 | } 46 | ], 47 | "source": [ 48 | "import os\n", 49 | "import sys\n", 50 | "\n", 51 | "# 导入前面实现的Books数据集\n", 52 | "sys.path.append('./code')\n", 53 | "from utils import BooksDataset\n", 54 | "\n", 55 | "dataset = BooksDataset()\n", 56 | "# 打印出类和标签ID\n", 57 | "print(dataset.id2label)\n", 58 | "\n", 59 | "dataset.tokenize(attr='abstract')\n", 60 | "dataset.build_vocab(min_freq=3)\n", 61 | "dataset.convert_tokens_to_ids()\n", 62 | "\n", 63 | "train_data, test_data = dataset.train_data, dataset.test_data" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "id": "96285754", 69 | "metadata": {}, 70 | "source": [ 71 | "接下来导入实现TF-IDF算法的函数,将处理后的数据集输入到函数中,得到文档特征:" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 3, 77 | "id": "1a16e90b", 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "(8627, 9504)\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "# 导入之前实现的TF-IDF算法\n", 90 | "from utils import TFIDF\n", 91 | "\n", 92 | "vocab_size = len(dataset.token2id)\n", 93 | "train_X = []\n", 94 | "for data in train_data:\n", 95 | " train_X.append(data['token_ids'])\n", 96 | "# 对TF-IDF的结果进行归一化(norm='l2')对聚类非常重要,\n", 97 | "# 不经过归一化会导致数据在某些方向上过于分散从而聚类失败\n", 98 | "# 初始化TFIDF()函数\n", 99 | "tfidf = TFIDF(vocab_size, norm='l2', smooth_idf=True, sublinear_tf=True)\n", 100 | "# 计算词频率和逆文档频率\n", 101 | "tfidf.fit(train_X)\n", 102 | "# 转化为TF-IDF向量\n", 103 | "train_F = tfidf.transform(train_X)\n", 104 | "print(train_F.shape)" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "id": "a8f04c44", 110 | "metadata": {}, 111 | "source": [ 112 | "在有了数据之后,运行*k*均值聚类算法为文本进行聚类。我们需要事先确定簇数$K$。为了方便与实际的标签数据进行对比,这里假设$K$为3。" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 10, 118 | "id": "d8493c19", 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "name": "stdout", 123 | "output_type": "stream", 124 | "text": [ 125 | "-----------初始化-----------\n", 126 | "-----------初始化完成-----------\n", 127 | "第1步,中心点平均移动距离:0.059189038070756865\n", 128 | "...\n", 129 | "第10步,中心点平均移动距离:0.002389605545132419\n", 130 | "...\n", 131 | "第16步,中心点平均移动距离:0.0\n", 132 | "中心点不再移动,退出程序\n" 133 | ] 134 | } 135 | ], 136 | "source": [ 137 | "import numpy as np\n", 138 | "\n", 139 | "# 更改簇的标签数量\n", 140 | "K = 3\n", 141 | "\n", 142 | "class KMeans:\n", 143 | " def __init__(self, K, dim, stop_val = 1e-4, max_step = 100):\n", 144 | " self.K = K\n", 145 | " self.dim = dim\n", 146 | " self.stop_val = stop_val\n", 147 | " self.max_step = max_step\n", 148 | "\n", 149 | " def update_mean_vec(self, X):\n", 150 | " mean_vec = np.zeros([self.K, self.dim])\n", 151 | " for k in range(self.K):\n", 152 | " data = X[self.cluster_num == k]\n", 153 | " if len(data) > 0:\n", 154 | " mean_vec[k] = data.mean(axis=0)\n", 155 | " return mean_vec\n", 156 | " \n", 157 | " # 运行k均值算法的迭代循环\n", 158 | " def fit(self, X):\n", 159 | " print('-----------初始化-----------')\n", 160 | " N = len(X)\n", 161 | " dim = len(X[0])\n", 162 | " # 给每个数据点随机分配簇\n", 163 | " self.cluster_num = np.random.randint(0, self.K, N)\n", 164 | " self.mean_vec = self.update_mean_vec(X)\n", 165 | " \n", 166 | " print('-----------初始化完成-----------')\n", 167 | " global_step = 0\n", 168 | " while global_step < self.max_step:\n", 169 | " global_step += 1\n", 170 | " self.cluster_num = np.zeros(N, int) \n", 171 | " for i, data_point in enumerate(X):\n", 172 | " # 计算每个数据点和每个簇中心的L2距离\n", 173 | " dist = np.linalg.norm(data_point[None, :] - \\\n", 174 | " self.mean_vec, ord=2, axis=-1)\n", 175 | " # 找到每个数据点所属新的聚类\n", 176 | " self.cluster_num[i] = dist.argmin(-1)\n", 177 | "\n", 178 | " '''\n", 179 | " 上面的循环过程也可以以下面的代码进行并行处理,但是可能\n", 180 | " 会使得显存过大,建议在数据点的特征向量维度较小时\n", 181 | " 或者进行降维后使用\n", 182 | " # N x D - K x D -> N x K x D\n", 183 | " dist = np.linalg.norm(train_X[:,None,:] - self.mean_vec, \\\n", 184 | " ord = 2, axis = -1) \n", 185 | " # 找到每个数据点所属新的聚类\n", 186 | " self.cluster_num = dist.argmin(-1)\n", 187 | " '''\n", 188 | "\n", 189 | " new_mean_vec = self.update_mean_vec(X)\n", 190 | "\n", 191 | " # 计算新的簇中心点和上一步迭代的中心点的距离\n", 192 | " moving_dist = np.linalg.norm(new_mean_vec - self.mean_vec,\\\n", 193 | " ord = 2, axis = -1).mean()\n", 194 | " print(f\"第{global_step}步,中心点平均移动距离:{moving_dist}\")\n", 195 | " if moving_dist < self.stop_val:\n", 196 | " print(\"中心点不再移动,退出程序\")\n", 197 | " break\n", 198 | "\n", 199 | " # 将mean_vec更新\n", 200 | " self.mean_vec = new_mean_vec\n", 201 | "\n", 202 | "kmeans = KMeans(K, train_F.shape[1])\n", 203 | "kmeans.fit(train_F)" 204 | ] 205 | }, 206 | { 207 | "cell_type": "markdown", 208 | "id": "b8f3f765", 209 | "metadata": {}, 210 | "source": [ 211 | "为了更直观地展示聚类的效果,我们定义show_clusters()这个函数,显示每个真实分类下包含的每个簇的比重。下面对*k*均值算法的聚类结果进行展示,并观察3个标签中不同簇的占比。" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 11, 217 | "id": "1c3158e6", 218 | "metadata": { 219 | "scrolled": true 220 | }, 221 | "outputs": [ 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "8627\n", 227 | "计算机类:\t{ 0: 2583(0.67), 1: 1222(0.32), 2: 37(0.01), }\n", 228 | "艺术传媒类:\t{ 0: 281(0.12), 1: 72(0.03), 2: 1947(0.85), }\n", 229 | "经管类:\t{ 0: 2452(0.99), 1: 26(0.01), 2: 7(0.00), }\n" 230 | ] 231 | } 232 | ], 233 | "source": [ 234 | "# 取出每条数据的标签和标签ID\n", 235 | "labels = []\n", 236 | "for data in train_data:\n", 237 | " labels.append(data['label'])\n", 238 | "print(len(labels))\n", 239 | "\n", 240 | "# 展示聚类结果\n", 241 | "def show_clusters(clusters, K):\n", 242 | " # 每个标签下的数据可能被聚类到不同的簇,因此对所有标签、所有簇进行初始化\n", 243 | " label_clusters = {label_id: {} for label_id in dataset.id2label}\n", 244 | " for k, v in label_clusters.items():\n", 245 | " label_clusters[k] = {i: 0 for i in range(K)}\n", 246 | " # 统计每个标签下,分到每个簇的数据条数\n", 247 | " for label_id, cluster_id in zip(labels, clusters):\n", 248 | " label_clusters[label_id][cluster_id] += 1\n", 249 | " \n", 250 | " for label_id in sorted(dataset.id2label.keys()):\n", 251 | " _str = dataset.id2label[label_id] + ':\\t{ '\n", 252 | " for cluster_id in range(K):\n", 253 | " # 计算label_id这个标签ID下,簇为cluster_id的占比\n", 254 | " _cnt = label_clusters[label_id][cluster_id]\n", 255 | " _total = sum(label_clusters[label_id].values())\n", 256 | " _str += f'{str(cluster_id)}: {_cnt}({_cnt / _total:.2f}), '\n", 257 | " _str += '}'\n", 258 | " print(_str)\n", 259 | "\n", 260 | "clusters = kmeans.cluster_num\n", 261 | "show_clusters(clusters, K)" 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "id": "be29cb62", 267 | "metadata": {}, 268 | "source": [ 269 | "接下来演示如何使用高斯混合来进行聚类。注意高斯混合会计算每个数据点归属于各簇的概率分布,这里将概率最高的簇作为聚类输出。" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 12, 275 | "id": "9353dbb8", 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "from scipy.stats import multivariate_normal as gaussian\n", 280 | "from tqdm import tqdm\n", 281 | "\n", 282 | "# 高斯混合模型\n", 283 | "class GMM:\n", 284 | " def __init__(self, K, dim, max_iter=100):\n", 285 | " # K为聚类数目,dim为向量维度,max_iter为最大迭代次数\n", 286 | " self.K = K\n", 287 | " self.dim = dim\n", 288 | " self.max_iter = max_iter\n", 289 | " \n", 290 | " # 初始化,pi = 1/K为先验概率,miu ~[-1,1]为高斯分布的均值,\n", 291 | " # sigma = eye为高斯分布的协方差矩阵\n", 292 | " self.pi = np.ones(K) / K\n", 293 | " self.miu = np.random.rand(K, dim) * 2 - 1\n", 294 | " self.sigma = np.zeros((K, dim, dim))\n", 295 | " for i in range(K):\n", 296 | " self.sigma[i] = np.eye(dim)\n", 297 | " \n", 298 | " # GMM的E步骤\n", 299 | " def E_step(self, X):\n", 300 | " # 计算每个数据点被分到不同簇的密度\n", 301 | " for i in range(self.K):\n", 302 | " self.Y[:, i] = self.pi[i] * gaussian.pdf(X, \\\n", 303 | " mean=self.miu[i], cov=self.sigma[i])\n", 304 | " # 对密度进行归一化,得到概率分布\n", 305 | " self.Y /= self.Y.sum(axis=1, keepdims=True)\n", 306 | " \n", 307 | " # GMM的M步骤\n", 308 | " def M_step(self, X):\n", 309 | " # 更新先验概率分布\n", 310 | " Y_sum = self.Y.sum(axis=0)\n", 311 | " self.pi = Y_sum / self.N\n", 312 | " # 更新每个簇的均值\n", 313 | " self.miu = np.matmul(self.Y.T, X) / Y_sum[:, None]\n", 314 | " # 更新每个簇的协方差矩阵\n", 315 | " for i in range(self.K):\n", 316 | " # N * 1 * D\n", 317 | " delta = np.expand_dims(X, axis=1) - self.miu[i]\n", 318 | " # N * D * D\n", 319 | " sigma = np.matmul(delta.transpose(0, 2, 1), delta)\n", 320 | " # D * D\n", 321 | " self.sigma[i] = np.matmul(sigma.transpose(1, 2, 0),\\\n", 322 | " self.Y[:, i]) / Y_sum[i]\n", 323 | " \n", 324 | " # 计算对数似然,用于判断迭代终止\n", 325 | " def log_likelihood(self, X):\n", 326 | " ll = 0\n", 327 | " for x in X:\n", 328 | " p = 0\n", 329 | " for i in range(self.K):\n", 330 | " p += self.pi[i] * gaussian.pdf(x, mean=self.miu[i],\\\n", 331 | " cov=self.sigma[i])\n", 332 | " ll += np.log(p)\n", 333 | " return ll / self.N\n", 334 | " \n", 335 | " # 运行GMM算法的E步骤、M步骤迭代循环\n", 336 | " def fit(self, X):\n", 337 | " self.N = len(X)\n", 338 | " self.Y = np.zeros((self.N, self.K))\n", 339 | " ll = self.log_likelihood(X)\n", 340 | " print('开始迭代')\n", 341 | " for i in range(self.max_iter):\n", 342 | " self.E_step(X)\n", 343 | " self.M_step(X)\n", 344 | " new_ll = self.log_likelihood(X)\n", 345 | " print(f'第{i}步, log-likelihood = {new_ll:.4f}')\n", 346 | " if new_ll - ll < 1e-4:\n", 347 | " print('log-likelihood不再变化,退出程序')\n", 348 | " break\n", 349 | " else:\n", 350 | " ll = new_ll\n", 351 | " \n", 352 | " # 根据学习到的参数将一个数据点分配到概率最大的簇\n", 353 | " def transform(self, X):\n", 354 | " assert hasattr(self, 'Y') and len(self.Y) == len(X)\n", 355 | " return np.argmax(self.Y, axis=1)\n", 356 | " \n", 357 | " def fit_transform(self, X):\n", 358 | " self.fit(X)\n", 359 | " return self.transform(X)" 360 | ] 361 | }, 362 | { 363 | "cell_type": "markdown", 364 | "id": "2e5c256e", 365 | "metadata": {}, 366 | "source": [ 367 | "与*k*均值聚类方法类似,在使用最大期望值法的高斯混合的情况下,观察在Books数据集3个真实类别中不同簇的占比:" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 13, 373 | "id": "259eb004", 374 | "metadata": { 375 | "scrolled": false 376 | }, 377 | "outputs": [ 378 | { 379 | "name": "stdout", 380 | "output_type": "stream", 381 | "text": [ 382 | "开始迭代\n", 383 | "第0步, log-likelihood = 77.2685\n", 384 | "...\n", 385 | "第10步, log-likelihood = 95.9564\n", 386 | "...\n", 387 | "第20步, log-likelihood = 97.8945\n", 388 | "...\n", 389 | "第30步, log-likelihood = 98.2401\n", 390 | "...\n", 391 | "第39步, log-likelihood = 98.2509\n", 392 | "log-likelihood不再变化,退出程序\n", 393 | "[2 0 2 ... 1 2 1]\n", 394 | "计算机类:\t{ 0: 114(0.03), 1: 1256(0.33), 2: 2472(0.64), }\n", 395 | "艺术传媒类:\t{ 0: 2129(0.93), 1: 23(0.01), 2: 148(0.06), }\n", 396 | "经管类:\t{ 0: 268(0.11), 1: 2152(0.87), 2: 65(0.03), }\n" 397 | ] 398 | } 399 | ], 400 | "source": [ 401 | "# 直接对TF-IDF特征聚类运行速度过慢,因此使用PCA降维,将TF-IDF向量降到50维\n", 402 | "from sklearn.decomposition import PCA\n", 403 | "pca = PCA(n_components=50)\n", 404 | "train_P = pca.fit_transform(train_F)\n", 405 | "\n", 406 | "# 运行GMM算法,展示聚类结果\n", 407 | "gmm = GMM(K, dim=train_P.shape[1])\n", 408 | "clusters = gmm.fit_transform(train_P)\n", 409 | "print(clusters)\n", 410 | "show_clusters(clusters, K)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "id": "b58851b6", 416 | "metadata": {}, 417 | "source": [ 418 | "下面演示基于朴素贝叶斯模型的聚类算法实现:" 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 14, 424 | "id": "f215a250", 425 | "metadata": {}, 426 | "outputs": [], 427 | "source": [ 428 | "from scipy.special import logsumexp\n", 429 | "\n", 430 | "# 无监督朴素贝叶斯\n", 431 | "class UnsupervisedNaiveBayes:\n", 432 | " def __init__(self, K, dim, max_iter=100):\n", 433 | " self.K = K\n", 434 | " self.dim = dim\n", 435 | " self.max_iter = max_iter\n", 436 | " \n", 437 | " # 初始化参数,pi为先验概率分布,P用于保存K个朴素贝叶斯模型的参数\n", 438 | " self.pi = np.ones(K) / K\n", 439 | " self.P = np.random.random((K, dim))\n", 440 | " self.P /= self.P.sum(axis=1, keepdims=True)\n", 441 | " \n", 442 | " # E步骤\n", 443 | " def E_step(self, X):\n", 444 | " # 根据朴素贝叶斯公式,计算每个数据点分配到每个簇的概率分布\n", 445 | " for i, x in enumerate(X):\n", 446 | " # 由于朴素贝叶斯使用了许多概率连乘,容易导致精度溢出,\n", 447 | " # 因此使用对数概率\n", 448 | " self.Y[i, :] = np.log(self.pi) + (np.log(self.P) *\\\n", 449 | " x).sum(axis=1)\n", 450 | " # 使用对数概率、logsumexp和exp,等价于直接计算概率,\n", 451 | " # 好处是数值更加稳定\n", 452 | " self.Y[i, :] -= logsumexp(self.Y[i, :])\n", 453 | " self.Y[i, :] = np.exp(self.Y[i, :])\n", 454 | " \n", 455 | " # M步骤\n", 456 | " def M_step(self, X):\n", 457 | " # 根据估计的簇概率分布更新先验概率分布\n", 458 | " self.pi = self.Y.sum(axis=0) / self.N\n", 459 | " self.pi /= self.pi.sum()\n", 460 | " # 更新每个朴素贝叶斯模型的参数\n", 461 | " for i in range(self.K):\n", 462 | " self.P[i] = (self.Y[:, i:i+1] * X).sum(axis=0) / \\\n", 463 | " (self.Y[:, i] * X.sum(axis=1)).sum()\n", 464 | " # 防止除0\n", 465 | " self.P += 1e-10\n", 466 | " self.P /= self.P.sum(axis=1, keepdims=True)\n", 467 | " \n", 468 | " # 计算对数似然,用于判断迭代终止\n", 469 | " def log_likelihood(self, X):\n", 470 | " ll = 0\n", 471 | " for x in X:\n", 472 | " # 使用对数概率和logsumexp防止精度溢出\n", 473 | " logp = []\n", 474 | " for i in range(self.K):\n", 475 | " logp.append(np.log(self.pi[i]) + (np.log(self.P[i]) *\\\n", 476 | " x).sum())\n", 477 | " ll += logsumexp(logp)\n", 478 | " return ll / len(X)\n", 479 | " \n", 480 | " # 无监督朴素贝叶斯的迭代循环\n", 481 | " def fit(self, X):\n", 482 | " self.N = len(X)\n", 483 | " self.Y = np.zeros((self.N, self.K))\n", 484 | " ll = self.log_likelihood(X)\n", 485 | " print(f'初始化log-likelihood = {ll:.4f}')\n", 486 | " print('开始迭代')\n", 487 | " for i in range(self.max_iter):\n", 488 | " self.E_step(X)\n", 489 | " self.M_step(X)\n", 490 | " new_ll = self.log_likelihood(X)\n", 491 | " print(f'第{i}步, log-likelihood = {new_ll:.4f}')\n", 492 | " if new_ll - ll < 1e-4:\n", 493 | " print('log-likelihood不再变化,退出程序')\n", 494 | " break\n", 495 | " else:\n", 496 | " ll = new_ll\n", 497 | " \n", 498 | " def transform(self, X):\n", 499 | " assert hasattr(self, 'Y') and len(self.Y) == len(X)\n", 500 | " return np.argmax(self.Y, axis=1)\n", 501 | " \n", 502 | " def fit_transform(self, X):\n", 503 | " self.fit(X)\n", 504 | " return self.transform(X)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 15, 510 | "id": "57113e8b", 511 | "metadata": { 512 | "scrolled": false 513 | }, 514 | "outputs": [ 515 | { 516 | "name": "stdout", 517 | "output_type": "stream", 518 | "text": [ 519 | "初始化log-likelihood = -779.0355\n", 520 | "开始迭代\n", 521 | "第0步, log-likelihood = -589.0541\n", 522 | "...\n", 523 | "第10步, log-likelihood = -571.5391\n", 524 | "...\n", 525 | "第20步, log-likelihood = -567.4288\n", 526 | "...\n", 527 | "第30步, log-likelihood = -567.3908\n", 528 | "...\n", 529 | "第38步, log-likelihood = -567.3578\n", 530 | "log-likelihood不再变化,退出程序\n", 531 | "[1 2 1 ... 1 1 1]\n", 532 | "计算机类:\t{ 0: 307(0.08), 1: 3437(0.89), 2: 98(0.03), }\n", 533 | "艺术传媒类:\t{ 0: 59(0.03), 1: 156(0.07), 2: 2085(0.91), }\n", 534 | "经管类:\t{ 0: 2252(0.91), 1: 79(0.03), 2: 154(0.06), }\n" 535 | ] 536 | } 537 | ], 538 | "source": [ 539 | "# 根据朴素贝叶斯模型,需要统计出每个数据点包含的词表中每个词的数目\n", 540 | "train_C = np.zeros((len(train_X), vocab_size))\n", 541 | "for i, data in enumerate(train_X):\n", 542 | " for token_id in data:\n", 543 | " train_C[i, token_id] += 1\n", 544 | "\n", 545 | "unb = UnsupervisedNaiveBayes(K, dim=vocab_size, max_iter=100)\n", 546 | "clusters = unb.fit_transform(train_C)\n", 547 | "print(clusters)\n", 548 | "show_clusters(clusters, K)" 549 | ] 550 | } 551 | ], 552 | "metadata": { 553 | "kernelspec": { 554 | "display_name": "Python 3 (ipykernel)", 555 | "language": "python", 556 | "name": "python3" 557 | }, 558 | "language_info": { 559 | "codemirror_mode": { 560 | "name": "ipython", 561 | "version": 3 562 | }, 563 | "file_extension": ".py", 564 | "mimetype": "text/x-python", 565 | "name": "python", 566 | "nbconvert_exporter": "python", 567 | "pygments_lexer": "ipython3", 568 | "version": "3.8.17" 569 | } 570 | }, 571 | "nbformat": 4, 572 | "nbformat_minor": 5 573 | } 574 | -------------------------------------------------------------------------------- /第8章-预训练语言模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "31d093db", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "下面用代码展示BERT的基本用法。\n", 10 | "\n", 11 | "下面展示给定输入为“The capital of China is \\[MASK\\]”的情况下,模型会如何预测被掩码的词。这里输出概率最高的5个词。" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "id": "026be16d", 18 | "metadata": { 19 | "scrolled": true 20 | }, 21 | "outputs": [ 22 | { 23 | "name": "stdout", 24 | "output_type": "stream", 25 | "text": [ 26 | "The capital of China is beijing.\n", 27 | "The capital of China is nanjing.\n", 28 | "The capital of China is shanghai.\n", 29 | "The capital of China is guangzhou.\n", 30 | "The capital of China is shenzhen.\n" 31 | ] 32 | } 33 | ], 34 | "source": [ 35 | "\"\"\"\n", 36 | "代码来源于GitHub项目huggingface/transformers\n", 37 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 38 | "\"\"\"\n", 39 | "from transformers import BertTokenizer, BertForMaskedLM\n", 40 | "from torch.nn import functional as F\n", 41 | "import torch\n", 42 | "\n", 43 | "# 选用bert-base-uncased模型进行预测,使用相应的分词器\n", 44 | "tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n", 45 | "model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)\n", 46 | "\n", 47 | "# 准备输入句子“The capital of China is [MASK].”\n", 48 | "text = 'The capital of China is ' + tokenizer.mask_token + '.'\n", 49 | "# 将输入句子编码为PyTorch张量\n", 50 | "inputs = tokenizer.encode_plus(text, return_tensors='pt')\n", 51 | "# 定位[MASK]所在的位置\n", 52 | "mask_index = torch.where(inputs['input_ids'][0] == tokenizer.mask_token_id)\n", 53 | "output = model(**inputs)\n", 54 | "logits = output.logits\n", 55 | "# 从[MASK]所在位置的输出分布中,选择概率最高的5个并打印\n", 56 | "distribution = F.softmax(logits, dim=-1)\n", 57 | "mask_word = distribution[0, mask_index, :]\n", 58 | "top_5 = torch.topk(mask_word, 5, dim=1)[1][0]\n", 59 | "for token in top_5:\n", 60 | " word = tokenizer.decode([token])\n", 61 | " new_sentence = text.replace(tokenizer.mask_token, word)\n", 62 | " print(new_sentence)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "id": "a03f59ac", 68 | "metadata": {}, 69 | "source": [ 70 | "\n", 71 | "下面展示如何微调BERT用于文本分类。这里使用第4章的Books数据集。" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 1, 77 | "id": "895c0795", 78 | "metadata": { 79 | "scrolled": false 80 | }, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "train size = 8627 , test size = 2157\n", 87 | "{0: '计算机类', 1: '艺术传媒类', 2: '经管类'}\n", 88 | "8627 2157\n" 89 | ] 90 | }, 91 | { 92 | "name": "stderr", 93 | "output_type": "stream", 94 | "text": [ 95 | "100%|██████████| 100/100 [00:00<00:00, 8225.09it/s]\n", 96 | "100%|██████████| 100/100 [00:00<00:00, 4294.85it/s]\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "\"\"\"\n", 102 | "代码来源于GitHub项目huggingface/transformers\n", 103 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 104 | "\"\"\"\n", 105 | "import sys\n", 106 | "from tqdm import tqdm\n", 107 | "\n", 108 | "# 导入前面实现的Books数据集\n", 109 | "sys.path.append('./code')\n", 110 | "from utils import BooksDataset\n", 111 | "\n", 112 | "dataset = BooksDataset()\n", 113 | "# 打印出类和标签ID\n", 114 | "print(dataset.id2label)\n", 115 | "print(len(dataset.train_data), len(dataset.test_data))\n", 116 | "\n", 117 | "# 接下来使用分词器进行分词,并采样100条数据用于训练和测试\n", 118 | "# 为防止运行时间过长,此处为了在CPU上顺利运行,只选用100条数据。\n", 119 | "from transformers import AutoTokenizer\n", 120 | "tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')\n", 121 | "\n", 122 | "def tokenize_function(text):\n", 123 | " return tokenizer(text, padding='max_length', truncation=True)\n", 124 | "\n", 125 | "def tokenize(raw_data):\n", 126 | " dataset = []\n", 127 | " for data in tqdm(raw_data):\n", 128 | " tokens = tokenize_function(data['en_book'])\n", 129 | " tokens['label'] = data['label']\n", 130 | " dataset.append(tokens)\n", 131 | " return dataset\n", 132 | " \n", 133 | "small_train_dataset = tokenize(dataset.train_data[:100])\n", 134 | "small_eval_dataset = tokenize(dataset.test_data[:100])" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "id": "39353b6d", 141 | "metadata": { 142 | "scrolled": false 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "# 加载bert-base-cased这个预训练模型,并指定序列分类作为模型输出头,\n", 147 | "# 分类标签数为3类\n", 148 | "from transformers import AutoModelForSequenceClassification\n", 149 | "model = AutoModelForSequenceClassification.from_pretrained(\\\n", 150 | " 'bert-base-cased', num_labels=len(dataset.id2label))\n", 151 | "\n", 152 | "# 为了在训练过程中及时地监控模型性能,定义评估函数,计算分类准确率\n", 153 | "import numpy as np\n", 154 | "# 可以使用如下指令安装evaluate\n", 155 | "# conda install evaluate\n", 156 | "import evaluate\n", 157 | "\n", 158 | "metric = evaluate.load('accuracy')\n", 159 | "\n", 160 | "def compute_metrics(eval_pred):\n", 161 | " logits, labels = eval_pred\n", 162 | " predictions = np.argmax(logits, axis=-1)\n", 163 | " return metric.compute(predictions=predictions, references=labels)\n", 164 | "\n", 165 | "# 通过TrainingArguments这个类来构造训练所需的参数\n", 166 | "# evaluation_strategy='epoch'指定每个epoch结束的时候计算评价指标\n", 167 | "from transformers import TrainingArguments, Trainer\n", 168 | "training_args = TrainingArguments(output_dir='test_trainer',\\\n", 169 | " evaluation_strategy='epoch')\n", 170 | "\n", 171 | "# transformers这个库自带的Trainer类封装了大量模型训练的细节,\n", 172 | "# 例如数据转换、性能评测、保存模型等\n", 173 | "# 可以调用Trainer类来非常方便地调用标准的微调流程,默认训练3个epoch\n", 174 | "trainer = Trainer(\n", 175 | " model=model,\n", 176 | " args=training_args,\n", 177 | " train_dataset=small_train_dataset,\n", 178 | " eval_dataset=small_eval_dataset,\n", 179 | " compute_metrics=compute_metrics,\n", 180 | ")" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 3, 186 | "id": "337b979a", 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/html": [ 192 | "\n", 193 | "
\n", 194 | " \n", 195 | " \n", 196 | " [39/39 12:20, Epoch 3/3]\n", 197 | "
\n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | "
EpochTraining LossValidation LossAccuracy
1No log0.9624860.520000
2No log0.8529820.670000
3No log0.8163840.680000

" 228 | ], 229 | "text/plain": [ 230 | "" 231 | ] 232 | }, 233 | "metadata": {}, 234 | "output_type": "display_data" 235 | } 236 | ], 237 | "source": [ 238 | "# 默认的微调流程使用wandb记录训练log,访问wandb官网了解如何使用\n", 239 | "# 此处通过WANDB_DISABLED环境变量禁用wandb,减少不必要的网络访问\n", 240 | "import os\n", 241 | "os.environ[\"WANDB_DISABLED\"] = \"true\"\n", 242 | "trainer.train()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "id": "fb019379", 248 | "metadata": {}, 249 | "source": [ 250 | "以上代码通过调用Trainer类来实现简单的微调流程,接下来展示如何自定义微调流程。" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 11, 256 | "id": "7c8f4428", 257 | "metadata": { 258 | "scrolled": false 259 | }, 260 | "outputs": [ 261 | { 262 | "data": { 263 | "application/vnd.jupyter.widget-view+json": { 264 | "model_id": "2f7c36a2ba7e40bfb645c9f49e543810", 265 | "version_major": 2, 266 | "version_minor": 0 267 | }, 268 | "text/plain": [] 269 | }, 270 | "metadata": {}, 271 | "output_type": "display_data" 272 | } 273 | ], 274 | "source": [ 275 | "import torch\n", 276 | "\n", 277 | "del model\n", 278 | "del trainer\n", 279 | "# 如果你使用了GPU,清空GPU缓存\n", 280 | "torch.cuda.empty_cache()\n", 281 | "\n", 282 | "# 使用DataLoader类为模型提供数据\n", 283 | "from torch.utils.data import DataLoader\n", 284 | "\n", 285 | "# 将Python列表转为PyTorch张量\n", 286 | "def collate(batch):\n", 287 | " input_ids, token_type_ids, attention_mask, labels = [], [], [], []\n", 288 | " for d in batch:\n", 289 | " input_ids.append(d['input_ids'])\n", 290 | " token_type_ids.append(d['token_type_ids'])\n", 291 | " attention_mask.append(d['attention_mask'])\n", 292 | " labels.append(d['label'])\n", 293 | " input_ids = torch.tensor(input_ids)\n", 294 | " token_type_ids = torch.tensor(token_type_ids)\n", 295 | " attention_mask = torch.tensor(attention_mask)\n", 296 | " labels = torch.tensor(labels)\n", 297 | " return {'input_ids': input_ids, 'token_type_ids': token_type_ids,\\\n", 298 | " 'attention_mask': attention_mask, 'labels': labels}\n", 299 | "\n", 300 | "train_dataloader = DataLoader(small_train_dataset, shuffle=True,\\\n", 301 | " batch_size=8, collate_fn=collate)\n", 302 | "eval_dataloader = DataLoader(small_eval_dataset, batch_size=8,\\\n", 303 | " collate_fn=collate)\n", 304 | "\n", 305 | "# 载入模型,准备优化器(用于优化参数),以及scheduler\n", 306 | "# (在训练时调整学习率,以达到更好的微调效果)\n", 307 | "from transformers import AutoModelForSequenceClassification\n", 308 | "model = AutoModelForSequenceClassification.from_pretrained(\\\n", 309 | " \"bert-base-cased\", num_labels=len(dataset.id2label))\n", 310 | "\n", 311 | "from torch.optim import AdamW\n", 312 | "optimizer = AdamW(model.parameters(), lr=5e-5)\n", 313 | "\n", 314 | "from transformers import get_scheduler\n", 315 | "num_epochs = 3\n", 316 | "num_training_steps = num_epochs * len(train_dataloader)\n", 317 | "lr_scheduler = get_scheduler(\n", 318 | " name=\"linear\", optimizer=optimizer, num_warmup_steps=0,\\\n", 319 | " num_training_steps=num_training_steps\n", 320 | ")\n", 321 | "\n", 322 | "import torch\n", 323 | "# 自动判断是否有GPU可以使用,如果可用,将model移动到GPU显存中\n", 324 | "device = torch.device(\"cuda\") if torch.cuda.is_available()\\\n", 325 | " else torch.device(\"cpu\")\n", 326 | "model.to(device)\n", 327 | "\n", 328 | "# 训练流程\n", 329 | "from tqdm.auto import tqdm\n", 330 | "progress_bar = tqdm(range(num_training_steps))\n", 331 | "\n", 332 | "for epoch in range(num_epochs):\n", 333 | " # 在每个epoch开始时将model的is_training设为True,\n", 334 | " # 该变量将会影响到dropout等层的行为(训练时开启dropout)\n", 335 | " model.train()\n", 336 | " for batch in train_dataloader:\n", 337 | " # 如果GPU可用,这一步将把数据转移到GPU显存中\n", 338 | " batch = {k: v.to(device) for k, v in batch.items()}\n", 339 | " outputs = model(**batch)\n", 340 | " loss = outputs.loss\n", 341 | " loss.backward()\n", 342 | "\n", 343 | " optimizer.step()\n", 344 | " lr_scheduler.step()\n", 345 | " # 更新参数之后清除上一步的梯度\n", 346 | " optimizer.zero_grad()\n", 347 | " progress_bar.update(1)\n", 348 | "progress_bar.close()\n", 349 | "import evaluate\n", 350 | "\n", 351 | "# 训练结束时对测试集进行评估,得到模型分数\n", 352 | "model.eval()\n", 353 | "metric = evaluate.load(\"accuracy\")\n", 354 | "for batch in eval_dataloader:\n", 355 | " batch = {k: v.to(device) for k, v in batch.items()}\n", 356 | " with torch.no_grad():\n", 357 | " outputs = model(**batch)\n", 358 | "\n", 359 | " logits = outputs.logits\n", 360 | " predictions = torch.argmax(logits, dim=-1)\n", 361 | " metric.add_batch(predictions=predictions, references=batch[\"labels\"])\n", 362 | "acc = metric.compute()" 363 | ] 364 | }, 365 | { 366 | "cell_type": "markdown", 367 | "id": "f871e435", 368 | "metadata": {}, 369 | "source": [ 370 | "\n", 371 | "下面的代码演示了如何使用GPT-2进行训练。" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 1, 377 | "id": "d31da145", 378 | "metadata": {}, 379 | "outputs": [ 380 | { 381 | "name": "stdout", 382 | "output_type": "stream", 383 | "text": [ 384 | "19206 4802\n", 385 | "['the', 'Ġlittle', 'Ġprince', 'Ġ', 'ĊĊ', 'Ċ', 'Ċ', 'anto', 'ine', 'Ġde']\n" 386 | ] 387 | } 388 | ], 389 | "source": [ 390 | "\"\"\"\n", 391 | "代码来源于GitHub项目huggingface/transformers\n", 392 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 393 | "\"\"\"\n", 394 | "import sys\n", 395 | "\n", 396 | "# 导入第3章使用的《小王子》数据集\n", 397 | "sys.path.append('../code')\n", 398 | "from utils import TheLittlePrinceDataset\n", 399 | "\n", 400 | "full_text = TheLittlePrinceDataset(tokenize=False).text\n", 401 | "# 接下来载入GPT2模型的分词器并完成分词。\n", 402 | "from transformers import AutoTokenizer\n", 403 | "tokenizer = AutoTokenizer.from_pretrained('gpt2')\n", 404 | "\n", 405 | "full_tokens = tokenizer.tokenize(full_text.lower())\n", 406 | "train_size = int(len(full_tokens) * 0.8)\n", 407 | "train_tokens = full_tokens[:train_size]\n", 408 | "test_tokens = full_tokens[train_size:]\n", 409 | "print(len(train_tokens), len(test_tokens))\n", 410 | "print(train_tokens[:10])" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 6, 416 | "id": "b301b7e6", 417 | "metadata": { 418 | "scrolled": false 419 | }, 420 | "outputs": [], 421 | "source": [ 422 | "import torch\n", 423 | "from torch.utils.data import TensorDataset\n", 424 | "\n", 425 | "# 将文本根据block_size分成小块\n", 426 | "block_size = 128\n", 427 | "\n", 428 | "def split_blocks(tokens):\n", 429 | " token_ids = []\n", 430 | " for i in range(len(tokens) // block_size):\n", 431 | " _tokens = tokens[i*block_size:(i+1)*block_size]\n", 432 | " if len(_tokens) < block_size:\n", 433 | " _tokens += [tokenizer.pad_token] * (block_size - len(_tokens))\n", 434 | " _token_ids = tokenizer.convert_tokens_to_ids(_tokens)\n", 435 | " token_ids.append(_token_ids)\n", 436 | " return token_ids\n", 437 | "\n", 438 | "train_dataset = split_blocks(train_tokens)\n", 439 | "test_dataset = split_blocks(test_tokens)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 7, 445 | "id": "94647ae7", 446 | "metadata": { 447 | "scrolled": false 448 | }, 449 | "outputs": [ 450 | { 451 | "data": { 452 | "text/html": [ 453 | "\n", 454 | "

\n", 455 | " \n", 456 | " \n", 457 | " [57/57 04:51, Epoch 3/3]\n", 458 | "
\n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | "
EpochTraining LossValidation Loss
1No log3.240260
2No log3.152012
3No log3.125814

" 485 | ], 486 | "text/plain": [ 487 | "" 488 | ] 489 | }, 490 | "metadata": {}, 491 | "output_type": "display_data" 492 | }, 493 | { 494 | "data": { 495 | "text/html": [ 496 | "\n", 497 | "

\n", 498 | " \n", 499 | " \n", 500 | " [5/5 00:05]\n", 501 | "
\n", 502 | " " 503 | ], 504 | "text/plain": [ 505 | "" 506 | ] 507 | }, 508 | "metadata": {}, 509 | "output_type": "display_data" 510 | }, 511 | { 512 | "name": "stdout", 513 | "output_type": "stream", 514 | "text": [ 515 | "Perplexity: 22.78\n" 516 | ] 517 | } 518 | ], 519 | "source": [ 520 | "# 创建一个DataCollator,用于在训练时把分词的结果转化为模型可以训练的张量\n", 521 | "# 注意此时微调的任务是语言模型,而不是掩码语言模型\n", 522 | "from transformers import DataCollatorForLanguageModeling\n", 523 | "\n", 524 | "tokenizer.pad_token = tokenizer.eos_token\n", 525 | "data_collator = DataCollatorForLanguageModeling(tokenizer=\\\n", 526 | " tokenizer, mlm=False)\n", 527 | "\n", 528 | "# 导入模型,准备训练参数,调用Trainer类完成训练\n", 529 | "from transformers import AutoModelForCausalLM, TrainingArguments, Trainer\n", 530 | "\n", 531 | "model = AutoModelForCausalLM.from_pretrained(\"gpt2\")\n", 532 | "\n", 533 | "training_args = TrainingArguments(\n", 534 | " output_dir=\"test_trainer\",\n", 535 | " evaluation_strategy=\"epoch\",\n", 536 | " learning_rate=2e-5,\n", 537 | " weight_decay=0.01,\n", 538 | ")\n", 539 | "\n", 540 | "trainer = Trainer(\n", 541 | " model=model,\n", 542 | " args=training_args,\n", 543 | " train_dataset=train_dataset,\n", 544 | " eval_dataset=test_dataset,\n", 545 | " data_collator=data_collator,\n", 546 | ")\n", 547 | "\n", 548 | "trainer.train()\n", 549 | "\n", 550 | "# 在测试集上测试得到困惑度\n", 551 | "import math\n", 552 | "eval_results = trainer.evaluate()\n", 553 | "print(f\"Perplexity: {math.exp(eval_results['eval_loss']):.2f}\")" 554 | ] 555 | }, 556 | { 557 | "cell_type": "markdown", 558 | "id": "dd0dbdb9", 559 | "metadata": {}, 560 | "source": [ 561 | "这里基于HuggingFace来展示如何使用GPT-2模型生成文本。" 562 | ] 563 | }, 564 | { 565 | "cell_type": "code", 566 | "execution_count": 5, 567 | "id": "399168ba", 568 | "metadata": { 569 | "scrolled": false 570 | }, 571 | "outputs": [ 572 | { 573 | "name": "stdout", 574 | "output_type": "stream", 575 | "text": [ 576 | "Output:\n", 577 | "----------------------------------------------------------------------------------------------------\n", 578 | "I enjoy learning with this book. I have been reading it for a while now and I am very happy with it. I have been reading it for a while now and I am very happy with it.\n", 579 | "\n", 580 | "I have been reading it for a\n", 581 | "Output:\n", 582 | "----------------------------------------------------------------------------------------------------\n", 583 | "I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 584 | "\n", 585 | "I hope you enjoy reading this book, and I hope you enjoy reading it as much as I do.\n", 586 | "\n", 587 | "I hope you enjoy reading\n", 588 | "Output:\n", 589 | "----------------------------------------------------------------------------------------------------\n", 590 | "0: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 591 | "\n", 592 | "If you have any questions or comments, feel free to leave them in the comments below.\n", 593 | "1: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 594 | "\n", 595 | "If you have any questions or comments, please feel free to leave them in the comments below.\n", 596 | "2: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 597 | "\n", 598 | "If you have any questions or comments, feel free to leave them in the comment section below.\n", 599 | "3: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 600 | "\n", 601 | "If you have any questions or comments, feel free to leave them in the comments section below.\n", 602 | "4: I enjoy learning with this book, and I hope you enjoy reading it as much as I do.\n", 603 | "\n", 604 | "If you have any questions or comments, feel free to leave them in the comments below!\n" 605 | ] 606 | } 607 | ], 608 | "source": [ 609 | "\"\"\"\n", 610 | "代码来源于GitHub项目huggingface/transformers\n", 611 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 612 | "\"\"\"\n", 613 | "import torch\n", 614 | "from transformers import GPT2LMHeadModel, GPT2Tokenizer\n", 615 | "\n", 616 | "tokenizer = GPT2Tokenizer.from_pretrained('gpt2')\n", 617 | "model = GPT2LMHeadModel.from_pretrained('gpt2',\\\n", 618 | " pad_token_id=tokenizer.eos_token_id)\n", 619 | "# 输入文本\n", 620 | "input_ids = tokenizer.encode('I enjoy learning with this book',\\\n", 621 | " return_tensors='pt')\n", 622 | "\n", 623 | "# 输出文本\n", 624 | "greedy_output = model.generate(input_ids, max_length=50)\n", 625 | "print(\"Output:\\n\" + 100 * '-')\n", 626 | "print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))\n", 627 | "\n", 628 | "# 通过束搜索来生成句子,一旦生成足够多的句子即停止搜索\n", 629 | "beam_output = model.generate(\n", 630 | " input_ids, \n", 631 | " max_length=50, \n", 632 | " num_beams=5, \n", 633 | " early_stopping=True\n", 634 | ")\n", 635 | "\n", 636 | "print(\"Output:\\n\" + 100 * '-')\n", 637 | "print(tokenizer.decode(beam_output[0], skip_special_tokens=True))\n", 638 | "\n", 639 | "# 输出多个句子\n", 640 | "beam_outputs = model.generate(\n", 641 | " input_ids, \n", 642 | " max_length=50, \n", 643 | " num_beams=5, \n", 644 | " no_repeat_ngram_size=2, \n", 645 | " num_return_sequences=5, \n", 646 | " early_stopping=True\n", 647 | ")\n", 648 | "\n", 649 | "print(\"Output:\\n\" + 100 * '-')\n", 650 | "for i, beam_output in enumerate(beam_outputs):\n", 651 | " print(\"{}: {}\".format(i, tokenizer.decode(beam_output,\\\n", 652 | " skip_special_tokens=True)))" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "id": "00d323e9", 658 | "metadata": {}, 659 | "source": [ 660 | "HuggingFace中集成了许多预训练语言模型。你可以直接通过具体的接口调用某一个预训练语言模型,但这种方式相对复杂,需要对具体模型和接口有所了解。或者,你也可以通过pipeline模块黑箱地使用这些模型,pipeline模块会根据指定的任务自动分配一个合适的预训练语言模型,你也可以通过参数指定一个预训练语言模型。下面演示pipeline模块处理不同任务的代码,你也可以在HuggingFace官网上了解HuggingFace支持哪些模型。" 661 | ] 662 | }, 663 | { 664 | "cell_type": "markdown", 665 | "id": "5fcc3533", 666 | "metadata": {}, 667 | "source": [ 668 | "\n", 669 | "\n", 670 | "下面以情感分类为例演示文本分类任务上预训练语言模型的使用。" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 6, 676 | "id": "62235e0e", 677 | "metadata": { 678 | "scrolled": false 679 | }, 680 | "outputs": [ 681 | { 682 | "name": "stderr", 683 | "output_type": "stream", 684 | "text": [ 685 | "No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english...\n", 686 | "No model was supplied, defaulted to facebook/bart-large-mnli..." 687 | ] 688 | }, 689 | { 690 | "name": "stdout", 691 | "output_type": "stream", 692 | "text": [ 693 | "[{'label': 'POSITIVE', 'score': 0.9998708963394165}]\n", 694 | "[{'label': 'POSITIVE', 'score': 0.9998835325241089}, {'label': 'NEGATIVE', 'score': 0.9994825124740601}, {'label': 'POSITIVE', 'score': 0.9998630285263062}]\n", 695 | "[{'sequence': 'A helicopter is flying in the sky', 'labels': ['machine', 'animal'], 'scores': [0.9938627481460571, 0.006137245334684849]}, {'sequence': 'A bird is flying in the sky', 'labels': ['animal', 'machine'], 'scores': [0.9987970590591431, 0.001202935236506164]}]\n" 696 | ] 697 | } 698 | ], 699 | "source": [ 700 | "\"\"\"\n", 701 | "代码来源于GitHub项目huggingface/transformers\n", 702 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 703 | "\"\"\"\n", 704 | "from transformers import pipeline\n", 705 | "\n", 706 | "clf = pipeline('sentiment-analysis')\n", 707 | "print(clf('Haha, today is a nice day!'))\n", 708 | "\n", 709 | "print(clf(['The food is amazing', 'The assignment is weigh too hard',\\\n", 710 | " 'NLP is so much fun']))\n", 711 | "\n", 712 | "clf = pipeline('zero-shot-classification')\n", 713 | "print(clf(sequences=['A helicopter is flying in the sky',\\\n", 714 | " 'A bird is flying in the sky'],\n", 715 | " candidate_labels=['animal', 'machine']))" 716 | ] 717 | }, 718 | { 719 | "cell_type": "markdown", 720 | "id": "3008cc7b", 721 | "metadata": {}, 722 | "source": [ 723 | "\n", 724 | "下面演示两种文本生成任务上预训练语言模型的使用。" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": 7, 730 | "id": "71d72551", 731 | "metadata": { 732 | "scrolled": false 733 | }, 734 | "outputs": [ 735 | { 736 | "name": "stderr", 737 | "output_type": "stream", 738 | "text": [ 739 | "No model was supplied, defaulted to gpt2...\n", 740 | "No model was supplied, defaulted to distilroberta-base...\n" 741 | ] 742 | }, 743 | { 744 | "name": "stdout", 745 | "output_type": "stream", 746 | "text": [ 747 | "[{'generated_text': \"In this course, we will teach you how to get started with the code of a given app. This way, you will build new apps that fit your needs but are still well behaved and understandable (unless you're already using Swift to understand the language\"}]\n", 748 | "[{'score': 0.1961982101202011, 'token': 30412, 'token_str': ' mathematical', 'sequence': 'This course will teach you all about mathematical models.'}, {'score': 0.040527306497097015, 'token': 38163, 'token_str': ' computational', 'sequence': 'This course will teach you all about computational models.'}, {'score': 0.033017922192811966, 'token': 27930, 'token_str': ' predictive', 'sequence': 'This course will teach you all about predictive models.'}, {'score': 0.0319414846599102, 'token': 745, 'token_str': ' building', 'sequence': 'This course will teach you all about building models.'}, {'score': 0.024523010477423668, 'token': 3034, 'token_str': ' computer', 'sequence': 'This course will teach you all about computer models.'}]\n" 749 | ] 750 | } 751 | ], 752 | "source": [ 753 | "\"\"\"\n", 754 | "代码来源于GitHub项目huggingface/transformers\n", 755 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 756 | "\"\"\"\n", 757 | "generator = pipeline('text-generation')\n", 758 | "print(generator('In this course, we will teach you how to'))\n", 759 | "\n", 760 | "unmasker = pipeline('fill-mask')\n", 761 | "print(unmasker('This course will teach you all about models.'))" 762 | ] 763 | }, 764 | { 765 | "cell_type": "markdown", 766 | "id": "f61e334c", 767 | "metadata": {}, 768 | "source": [ 769 | "\n", 770 | "\n", 771 | "输入任务“question-answering”,pipeline会自动返回默认的问答预训练语言模型“distilbert-base-cased-distilled-squad”,输入问题和上下文,就能得到答案。" 772 | ] 773 | }, 774 | { 775 | "cell_type": "code", 776 | "execution_count": 8, 777 | "id": "3f943e42", 778 | "metadata": { 779 | "scrolled": false 780 | }, 781 | "outputs": [ 782 | { 783 | "name": "stderr", 784 | "output_type": "stream", 785 | "text": [ 786 | "No model was supplied, defaulted to distilbert-base-cased-distilled-squad...\n" 787 | ] 788 | }, 789 | { 790 | "name": "stdout", 791 | "output_type": "stream", 792 | "text": [ 793 | "{'score': 0.7787413597106934, 'start': 34, 'end': 63, 'answer': 'Shanghai Jiao Tong University'}\n" 794 | ] 795 | } 796 | ], 797 | "source": [ 798 | "\"\"\"\n", 799 | "代码来源于GitHub项目huggingface/transformers\n", 800 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 801 | "\"\"\"\n", 802 | "question_answerer = pipeline('question-answering')\n", 803 | "print(question_answerer(question='Where do I graduate from?', \n", 804 | " context=\"I received my bachlor\\'s degree at Shanghai\"+\\\n", 805 | " \"Jiao Tong University (SJTU).\"))" 806 | ] 807 | }, 808 | { 809 | "cell_type": "markdown", 810 | "id": "68098304", 811 | "metadata": {}, 812 | "source": [ 813 | "\n", 814 | "\n", 815 | "输入任务“summarization”,pipeline会自动返回默认的预训练语言模型“sshleifer/distilbart-cnn-12-6”,输入一段文本,就能得到摘要。" 816 | ] 817 | }, 818 | { 819 | "cell_type": "code", 820 | "execution_count": 9, 821 | "id": "1977a486", 822 | "metadata": { 823 | "scrolled": false 824 | }, 825 | "outputs": [ 826 | { 827 | "name": "stderr", 828 | "output_type": "stream", 829 | "text": [ 830 | "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6...\n" 831 | ] 832 | }, 833 | { 834 | "name": "stdout", 835 | "output_type": "stream", 836 | "text": [ 837 | "[{'summary_text': \" The 2022 Winter Olympics was held in Beijing, China, and surrounding areas . It was the 24th edition of the Winter Olympic Games . The Games featured a record 109 events across 15 disciplines, with big air freestyle skiing and women's monobob making their Olympic debuts as medal events . Norway won 37 medals, of which 16 were gold, setting a new record for the largest number of gold medals won at a single Winter Olympics .\"}]\n" 838 | ] 839 | } 840 | ], 841 | "source": [ 842 | "\"\"\"\n", 843 | "代码来源于GitHub项目huggingface/transformers\n", 844 | "(Copyright (c) 2020, The HuggingFace Team, Apache-2.0 License(见附录))\n", 845 | "\"\"\"\n", 846 | "summarizer = pipeline('summarization')\n", 847 | "print(summarizer(\n", 848 | " \"\"\"\n", 849 | " The 2022 Winter Olympics (2022年冬季奥林匹克运动会), officially \n", 850 | " called the XXIV Olympic Winter Games (Chinese: 第二十四届冬季奥\n", 851 | " 林匹克运动会; pinyin: Dì Èrshísì Jiè Dōngjì Àolínpǐkè Yùndònghuì) \n", 852 | " and commonly known as Beijing 2022 (北京2022), was an international \n", 853 | " winter multi-sport event held from 4 to 20 February 2022 in Beijing, \n", 854 | " China, and surrounding areas with competition in selected events \n", 855 | " beginning 2 February 2022.[1] It was the 24th edition of the Winter \n", 856 | " Olympic Games. Beijing was selected as host city in 2015 at the \n", 857 | " 128th IOC Session in Kuala Lumpur, Malaysia, marking its second \n", 858 | " time hosting the Olympics, and the last of three consecutive \n", 859 | " Olympics hosted in East Asia following the 2018 Winter Olympics \n", 860 | " in Pyeongchang County, South Korea, and the 2020 Summer Olympics \n", 861 | " in Tokyo, Japan. Having previously hosted the 2008 Summer Olympics, \n", 862 | " Beijing became the first city to have hosted both the Summer and \n", 863 | " Winter Olympics. The venues for the Games were concentrated around \n", 864 | " Beijing, its suburb Yanqing District, and Zhangjiakou, with some \n", 865 | " events (including the ceremonies and curling) repurposing venues \n", 866 | " originally built for Beijing 2008 (such as Beijing National \n", 867 | " Stadium and the Beijing National Aquatics Centre). The Games \n", 868 | " featured a record 109 events across 15 disciplines, with big air \n", 869 | " freestyle skiing and women's monobob making their Olympic debuts \n", 870 | " as medal events, as well as several new mixed competitions. \n", 871 | " A total of 2,871 athletes representing 91 teams competed in the \n", 872 | " Games, with Haiti and Saudi Arabia making their Winter Olympic \n", 873 | " debut. Norway finished at the top of the medal table \n", 874 | " for the second successive Winter Olympics, winning a total of 37 \n", 875 | " medals, of which 16 were gold, setting a new record for the \n", 876 | " largest number of gold medals won at a single Winter Olympics. \n", 877 | " The host nation China finished third with nine gold medals and \n", 878 | " also eleventh place by total medals won, marking its most \n", 879 | " successful performance in Winter Olympics history.[4]\n", 880 | " \"\"\"\n", 881 | "))" 882 | ] 883 | } 884 | ], 885 | "metadata": { 886 | "kernelspec": { 887 | "display_name": "Python 3 (ipykernel)", 888 | "language": "python", 889 | "name": "python3" 890 | }, 891 | "language_info": { 892 | "codemirror_mode": { 893 | "name": "ipython", 894 | "version": 3 895 | }, 896 | "file_extension": ".py", 897 | "mimetype": "text/x-python", 898 | "name": "python", 899 | "nbconvert_exporter": "python", 900 | "pygments_lexer": "ipython3", 901 | "version": "3.8.17" 902 | } 903 | }, 904 | "nbformat": 4, 905 | "nbformat_minor": 5 906 | } 907 | --------------------------------------------------------------------------------