├── .gitignore
├── LICENSE
├── MANIFEST.in
├── README.md
├── README_EN.md
├── config_classification.yaml
├── config_sequence_labeling.yaml
├── examples
├── chunking.ipynb
└── sentiment.ipynb
├── images
├── entity_visualization_sample.jpg
└── framework.jpg
├── nlp_toolkit
├── __init__.py
├── bin
│ ├── run_classifier
│ └── run_seq_tagger
├── callbacks.py
├── chunk_segmentor
│ ├── README.md
│ ├── __init__.py
│ ├── segment.py
│ ├── tagger.py
│ ├── tests
│ │ ├── data.sh
│ │ ├── test_functions.py
│ │ └── test_speed.py
│ ├── trie.py
│ └── utils.py
├── classifier.py
├── config.py
├── data.py
├── data
│ └── radical.txt
├── labeler.py
├── models
│ ├── __init__.py
│ ├── base_model.py
│ ├── bi_lstm_att.py
│ ├── char_rnn.py
│ ├── dpcnn.py
│ ├── han.py
│ ├── idcnn.py
│ ├── text_cnn.py
│ ├── transformer.py
│ └── word_rnn.py
├── modules
│ ├── __init__.py
│ ├── attentions
│ │ ├── __init__.py
│ │ ├── attention.py
│ │ ├── multi_dim_attention.py
│ │ └── self_attention.py
│ ├── custom_loss.py
│ ├── logits.py
│ └── token_embedders
│ │ ├── __init__.py
│ │ ├── embedding.py
│ │ └── position_embedding.py
├── sequence.py
├── trainer.py
├── utilities.py
└── visualization.py
├── reproduction
├── company_pro_con_classify.py
└── noun_phrases_detect.py
├── requirements-gpu.txt
├── requirements.txt
├── sample_data
├── company_pro_con.txt
├── cv_word_basic.txt
└── cv_word_conll.txt
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | MANIFEST
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 | .pytest_cache/
49 |
50 | # Translations
51 | *.mo
52 | *.pot
53 |
54 | # Django stuff:
55 | *.log
56 | local_settings.py
57 | db.sqlite3
58 |
59 | # Flask stuff:
60 | instance/
61 | .webassets-cache
62 |
63 | # Scrapy stuff:
64 | .scrapy
65 |
66 | # Sphinx documentation
67 | docs/_build/
68 |
69 | # PyBuilder
70 | target/
71 |
72 | # Jupyter Notebook
73 | .ipynb_checkpoints
74 |
75 | # pyenv
76 | .python-version
77 |
78 | # celery beat schedule file
79 | celerybeat-schedule
80 |
81 | # SageMath parsed files
82 | *.sage.py
83 |
84 | # Environments
85 | .env
86 | .venv
87 | env/
88 | venv/
89 | ENV/
90 | env.bak/
91 | venv.bak/
92 |
93 | # Spyder project settings
94 | .spyderproject
95 | .spyproject
96 |
97 | # Rope project settings
98 | .ropeproject
99 |
100 | # mkdocs documentation
101 | /site
102 |
103 | # mypy
104 | .mypy_cache/
105 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2018 stevewyl
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.md
2 | include requirements.txt
3 | include requirements-gpu.txt
4 | include nlp_toolkit/data/*
5 | include nlp_toolkit/modules/*
6 | include nlp_toolkit/models/*
7 | include nlp_toolkit/chunk_segmentor/*
8 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # nlp_toolkit
2 |
3 | 中文NLP基础工具箱,包括以下任务:例如文本分类、序列标注等。
4 |
5 | 本仓库复现了一些近几年比较火的nlp论文。所有的代码是基于keras开发的。
6 |
7 | 不到10行代码,你就可以快速训练一个文本分类模型(暂时不支持多标签任务)或序列标注模型,或者可以体验基于名词短语切分的分词器
8 |
9 | ## 直接安装
10 |
11 | ```bash
12 | pip install nlp_toolkit
13 |
14 | # 使用GPU
15 | pip install tensorflow-gpu, GPUtil
16 | ```
17 |
18 | ## 手动安装
19 |
20 | ```bash
21 | git clone https://github.com/stevewyl/nlp_toolkit
22 | cd nlp_toolkit
23 |
24 | # 只使用CPU
25 | pip install -r requirements.txt
26 |
27 | # 使用GPU
28 | pip install -r requirements-gpu.txt
29 |
30 | # 如果keras_contrib安装失败
31 | pip install git+https://www.github.com/keras-team/keras-contrib.git
32 | ```
33 |
34 | ### 安装错误
35 |
36 | 1. ImportError: cannot import name 'normalize_data_format'
37 |
38 | ```bash
39 | pip install -U keras
40 | ```
41 |
42 | ## 使用方法
43 |
44 | 本仓库的框架图:
45 |
46 | 
47 |
48 | 主要由以下几大模块组成:
49 |
50 | 1. Dataset:处理文本和标签数据为适合模型输入的格式,主要进行的处理操作有清理、分词、index化
51 |
52 | 2. Model Zoo & Layer:近几年在该任务中常用的模型汇总及一些Keras的自定义层
53 |
54 | 目前支持的自定义层有如下:
55 |
56 | * 1D注意力层 🆗
57 | * 2D注意力层 🆗
58 | * 多头注意力层 🆗
59 | * 位置嵌入层 🆗
60 | * K-max池化层
61 |
62 | 3. Trainer:定义模型的训练流程,支持bucket序列、自定义callbacks和N折交叉验证
63 |
64 | * bucket序列:通过将相似长度的文本放入同一batch来减小padding的多余计算来实现模型训练的加速,在文本分类任务中,能够对RNN网络提速2倍以上(**暂时不支持含有Flatten层的网络**)
65 |
66 | * callbacks:通过自定义回调器来控制训练流程,目前预设的回调器有提前终止训练,学习率自动变化,更丰富的评估函数等
67 |
68 | * N折交叉验证:支持交叉验证来考验模型的真实能力
69 |
70 | 4. Classifier & Sequence Labeler:封装类,支持不同的训练任务
71 |
72 | 5. Application:目前工具箱内封装了基于jieba的名词短语分词器 Chunk_Segmentor (如需模型文件,可以邮件联系我)
73 |
74 | 简单的用法如下:
75 |
76 | ```python
77 | from nlp_toolkit import Dataset, Classifier, Labeler
78 | import yaml
79 |
80 | config = yaml.load(open('your_config.yaml'))
81 |
82 | # 分类任务
83 | dataset = Dataset(fname='your_data.txt', task_type='classification', mode='train', config=config)
84 | text_classifier = Classifier('multi_head_self_att', dataset)
85 | trained_model = text_classifier.train()
86 |
87 | # 序列标注任务
88 | dataset = Dataset(fname='your_data.txt', task_type='sequence_labeling', mode='train', config=config)
89 | seq_labeler = Labeler('word_rnn', dataset)
90 | trained_model = seq_labeler.train()
91 |
92 | # 预测(以文本分类为例)
93 | dataset = Dataset(fname='your_data.txt', task_type='classification', mode='predict', tran_fname='your_transformer.h5')
94 | text_classifier = Classifier('bi_lstm_att', dataset)
95 | text_classifier.load(weight_fname='your_model_weights.h5', para_fname='your_model_parameters.json')
96 | y_pred = text_classifier.predict(dataset.texts)
97 |
98 | # chunk分词
99 | # 第一次import的时候,会自动下载模型和字典数据
100 | # 支持单句和多句文本的输入格式,建议以列表的形式传入分词器
101 | # 源代码中已略去相关数据的下载路径,有需要的请邮件联系
102 | from nlp_toolkit.chunk_segmentor import Chunk_Segmentor
103 | cutter = Chunk_Segmentor()
104 | s = '这是一个能够输出名词短语的分词器,欢迎试用!'
105 | res = [item for item in cutter.cut([s] * 10000)] # 1080ti上耗时8s
106 | # 提供两个版本,accurate为精确版,fast为快速版但召回会降低一些,默认精确版
107 | cutter = Chunk_Segmentor(mode='accurate')
108 | cutter = Chunk_Segmentor(mode='fast')
109 | # 是否输出词性, 默认开启
110 | cutter.cut(s, pos=False)
111 | # 是否将可切分的名词短语切分,默认关闭
112 | cutter.cut(s, cut_all=True)
113 | # 输出格式(词列表,词性列表,名词短语集合)
114 | [
115 | (
116 | ['这', '是', '一个', '能够', '输出', '名词_短语', '的', '分词器', ',', '欢迎', '试用', '!'],
117 | ['r', 'v', 'mq', 'v', 'vn', 'np', 'ude1', 'np', 'w', 'v', 'v', 'w'],
118 | ['分词器', '名词_短语']
119 | )
120 | ...
121 | ]
122 | ```
123 |
124 | 更多使用细节,请阅读[**examples**](https://github.com/stevewyl/nlp_toolkit/tree/master/examples)文件夹中的Jupyter Notebook和chunk_segmentor页面的[**README**](https://github.com/stevewyl/nlp_toolkit/tree/master/nlp_toolkit/chunk_segmentor)
125 |
126 | ### 数据格式
127 |
128 | 1. 文本分类:每一行预先分好词的文件,每一行的格式如下:
129 |
130 | __label__标签1 __label__标签2 ... 词 词 ... 词\n
131 |
132 | 例如 “__label__neg 公司 目前 地理 位置 不 太 理想 , 离 城市 中心 较 远点 。”
133 |
134 | 2. 序列标注:每一行预先分好词的文件,支持两种数据格式,每一行的格式如下:
135 |
136 | 词###标签 [TAB] 词###标签 [TAB] ... \n
137 |
138 | 例如 “目前###O\t公司###O\t地理###B-Chunk\t位置###E-Chunk\t不###O\t太###O\t理想\n”
139 |
140 | 或者 CONLL的标准格式
141 |
142 | 词 [TAB] 标签
143 |
144 | 词 [TAB] 标签
145 |
146 | ...
147 |
148 | 词 [TAB] 标签
149 |
150 | 词 [TAB] 标签
151 |
152 | ...
153 |
154 | 例如:
155 |
156 | 目前\tO
157 |
158 | 公司\tO
159 |
160 | ...
161 |
162 | 地理\tB-Chunk
163 |
164 | 位置\tE-Chunk
165 |
166 | 不\tO
167 |
168 | 太\tO
169 |
170 | 理想\tO
171 |
172 | 标签含义(这里以chunk为例):
173 |
174 | * O:普通词
175 | * B-Chunk:表示chunk词的开始
176 | * I-Chunk:表示chunk词的中间
177 | * E-Chunk:表示chunk词的结束
178 |
179 | 建议:文本序列以短句为主,针对标注实体的任务,最好保证每行数据中有实体词(即非全O的序列)
180 |
181 | 你可以通过以下方式互相转换两种数据格式:
182 | ```python
183 | from nlp_toolkit.utilities import convert_seq_format
184 | # here we convert dataset from conll format to basic format
185 | convert_seq_format(input_file, output_file, 'basic')
186 | ```
187 |
188 | ps: 具体可查看data文件夹中对应的[**示例数据**](https://github.com/stevewyl/nlp_toolkit/tree/master/sample_data)
189 |
190 | 3. 预测:不同任务每一行均为预先分好词的文本序列
191 |
192 | 4. 支持简单的自己添加数据的方法
193 |
194 | ```python
195 | dataset = Dataset(task_type='classification', mode='train', config=config)
196 | # classification
197 | dataset.add({'text': '我 爱 机器 学习', 'label': 'pos'})
198 | # sequence labeling
199 | dataset.add({'text': '我 爱 机器 学习', 'label': 'O O B-Chunk E-Chunk'})
200 | # after you add all your data
201 | dataset.fit()
202 | ```
203 |
204 | ### 配置文件
205 |
206 | nlp_toolkit通过配置文件来初始化训练任务
207 |
208 | train: 表示训练过程中的参数,包括batch大小,epoch数量,训练模式等
209 |
210 | data: 表示数据预处理的参数,包括最大词数和字符数,是否使用词内部字符序列等
211 |
212 | embed: 词向量,pre表示是否使用预训练词向量
213 |
214 | 剩下的模块对应不同的模型的超参数
215 |
216 | 具体细节可查看仓库根目录下的两个**配置文件**注释
217 |
218 | ### 可视化
219 |
220 | 1. attention权重可视化
221 |
222 | ```python
223 | # only support model bi_lstm_att currently
224 | # first you need to get attention_weights from model predictions
225 | # you can find the actual usage in examples/sentiment.ipynb
226 | texts = '有 能力 的 人 就 有 很多 机会'
227 | from nlp_toolkit import visualization as vs
228 | vs.mk_html(texts, attention_weights)
229 | ```
230 |
231 | 有 能力 的 人 就 有 很多 机会
232 |
233 | 2. 实体预测结果可视化
234 |
235 | ```python
236 | from nlp_toolkit import visualization as vs
237 | vs.entity_visualization(dataset.texts, y_pred, output_fname='result.html')
238 | ```
239 |
240 | 3. acc/loss 曲线可视化
241 |
242 | ```python
243 | # after your have trained one model, you will also get a history object, which contains some loss and metrics info
244 | from nlp_toolkit import visualization as vs
245 | vs.plot_loss_acc(history, task='sequence_labeling')
246 | ```
247 |
248 | ### 其他
249 |
250 | 1. 生成词向量小文件
251 |
252 | ```python
253 | from nlp_toolkit.utilities import gen_small_embedding
254 | gen_small_embedding(vocab_file, embed_file, output_file)
255 | ```
256 |
257 | ## 模型
258 |
259 | ### 文本分类
260 |
261 | 1. 双层双向LSTM + Attention 🆗
262 |
263 | [DeepMoji](https://arxiv.org/abs/1708.00524)一文中所采用的的模型框架,本仓库中对attention层作了扩展
264 |
265 | 对应配置文件中的名称:bi_lstm_att
266 |
267 | 2. [Transformer](http://papers.nips.cc/paper/7181-attention-is-all-you-need) 🆗
268 |
269 | 采用Transformer中的多头自注意力层来表征文本信息,详细的细节可阅读此[文章](https://kexue.fm/archives/4765)
270 |
271 | 对应配置文件中的名称:multi_head_self_att
272 |
273 | 3. [TextCNN](https://arxiv.org/abs/1408.5882) 🆗
274 |
275 | CNN网络之于文本分类任务的开山之作,在过去几年中经常被用作baseline,详细的细节可阅读此[文章](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)
276 |
277 | 对应配置文件中的名称:text_cnn
278 |
279 | 4. [DPCNN](http://www.aclweb.org/anthology/P17-1052) 🆗
280 |
281 | 在textCNN的基础上,DPCNN使用残差连接、固定feature map数量和1/2池化层等技巧来实现更丰富的文本表示,详细的细节可阅读此[文章](https://zhuanlan.zhihu.com/p/35457093)
282 |
283 | 对应配置文件中的名称:dpcnn
284 | 暂时不支持bucket序列化的数据
285 |
286 | 5. [HAN](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)
287 |
288 | 使用attention机制的文档分类模型
289 |
290 | ### 序列标注
291 |
292 | 1. [WordRNN](https://arxiv.org/abs/1707.06799) 🆗
293 |
294 | Baseline模型,文本序列经过双向LSTM后,由CRF层编码作为输出
295 |
296 | 对应配置文件中的名称:word_rnn
297 |
298 | 2. [CharRNN](https://pdfs.semanticscholar.org/b944/5206f592423f0b2faf05f99de124ccc6aaa8.pdf) 🆗
299 |
300 | 基于汉语的特点,在字符级别的LSTM信息外,加入偏旁部首,分词,Ngram信息
301 |
302 | 3. [InnerChar](https://arxiv.org/abs/1611.04361) 🆗
303 |
304 | 基于另外一篇[论文](https://arxiv.org/abs/1511.08308),扩展了本文的模型,使用bi-lstm或CNN在词内部的char级别进行信息的抽取,然后与原来的词向量进行concat或attention计算
305 |
306 | 对应配置文件中的名称:word_rnn,并设置配置文件data模块中的inner_char为True
307 |
308 | 4. [IDCNN](https://arxiv.org/abs/1702.02098) 🆗
309 |
310 | 膨胀卷积网络,在保持参数量不变的情况下,增大了卷积核的感受野,详细的细节可阅读此[文章](http://www.crownpku.com//2017/08/26/%E7%94%A8IDCNN%E5%92%8CCRF%E5%81%9A%E7%AB%AF%E5%88%B0%E7%AB%AF%E7%9A%84%E4%B8%AD%E6%96%87%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB.html)
311 |
312 | 对应配置文件中的名称:idcnn
313 |
314 | ## 性能
315 |
316 | 后续加入对中文NLP的标准数据集的测试
317 |
318 | ### 文本分类
319 |
320 | 测试数据集:
321 |
322 | 1. 公司优缺点评价,二分类,数据规模:95K
323 |
324 | Model | 10-fold_f1 | Model Size | Time per epoch
325 | ----------------------- | :------: | :----------: | :-------------:
326 | Bi-LSTM Attention | | |
327 | Transformer | | 7M | 12s
328 | TextCNN | 96.57 | 10M | 19s
329 | DPCNN | 93.35 | 9M | 28s
330 | HAN | | |
331 |
332 | ### 序列标注
333 |
334 | 测试数据集:
335 |
336 | 1. 简历工作经历,chunk,数据规模:58K
337 |
338 | Model | 10-fold_f1 | Model Size | Time per epoch
339 | ----------------------- | :------: | :----------: | :-------------:
340 | Baseline(WordRNN) | | |
341 | WordRNN + InnerChar | | 3M | 165s
342 | CharRNN(seg+radical) | | |
343 | IDCNN | | 2.7M | 43s
344 |
345 | ps: 模型大小表示为模型的参数量,其中K表示千,M表示百万;测试设备为1080ti+i7-6800K
346 |
347 | ## To-Do列表
348 |
349 | 1. 加入更多SOTA的模型和自定义层
350 |
351 | 2. 下一版本规划:增加抽象类Sentence
352 |
353 | 3. V2.0规划:切换为tf.estimator和tf.keras的API
354 |
355 | ## 感谢
356 |
357 | * 数据流模块部分代码借鉴于此: https://github.com/Hironsan/anago/
358 |
359 | * 序列标注任务的评估函数来源于此: https://github.com/chakki-works/seqeval
360 |
361 | * bucket序列化代码来自:https://github.com/tbennun/keras-bucketed-sequence
362 |
363 | * 多头注意力层和位置嵌入层代码来自:https://github.com/bojone/attention
364 |
365 | ## 联系方式
366 |
367 | 联系人:王奕磊
368 |
369 | 📧 邮箱:stevewyl@163.com
370 |
371 | 微信:Steve_1125
372 |
--------------------------------------------------------------------------------
/README_EN.md:
--------------------------------------------------------------------------------
1 | # nlp_toolkit
2 |
3 | Basic Chinese NLP Toolkits include following tasks, such as text classification, sequence labeling etc.
4 |
5 | This repo reproduce some hot nlp papers in recent years. All the code is based on Keras.
6 |
7 | Less than 10 lines of code, you can quickly train a text classfication model or sequence labeling model.
8 |
9 | ## Install
10 |
11 | ```bash
12 | git clone https://github.com/stevewyl/nlp_toolkit
13 | cd nlp_toolkit
14 |
15 | # Use cpu-only
16 | pip install -r requirements.txt
17 |
18 | # Use GPU
19 | pip install -r requirements-gpu.txt
20 |
21 | # if keras_contrib install fail
22 | pip install git+https://www.github.com/keras-team/keras-contrib.git
23 | ```
24 |
25 | ## Usage
26 |
27 | The frameword of this repository:
28 |
29 | 
30 |
31 | Following modules are included in:
32 |
33 | 1. Dataset:Text and label data are processed in a format suitable for model input. The main processing operations are cleaning, word segmentation and indexation.
34 |
35 | 2. Model Zoo & Layer:The collection of models commonly used in this task in recent years and some custom layers of Keras.
36 |
37 | Customized layers are as followed:
38 |
39 | * Attention
40 |
41 | * Multi-Head Attention
42 |
43 | * Position Embedding
44 |
45 | 3. Trainer:Define the training process of differnent models, which supports bucket sequence, customed callbacks and N-fold validation training.
46 |
47 | * Bucket Iterator: Accelerate model training by putting texts with similar lengths into the same batch to reduce the extra calculation of padding. In text classification task, it can help speed up RNN by over 2 times. (currently not support for networks with Flatten layer)
48 |
49 | * callbacks: The training process is controlled by custom callbacks. Currently, the preset callbacks include early stopping strategy, automatical learning rate decay, richer evaluation functions and etc.
50 |
51 | * N-fold cross validation: Support cross-validation to test the true capabilities of the model.
52 |
53 | 4. Classifier & Sequence Labeler:Encapsulates classes that support different training tasks.
54 |
55 | Quick start:
56 |
57 | ```python
58 | from nlp_toolkit import Dataset, Classifier, Labeler
59 | import yaml
60 |
61 | config = yaml.load(open('your_config.yaml'))
62 |
63 | # text classification task
64 | dataset = Dataset(fname='your_data.txt', task_type='classification', mode='train', config=config)
65 | x, y, config = dataset.transform()
66 | text_classifier = Classifier(config=config, model_name='multi_head_self_att', seq_type='bucket', transformer=dataset.transformer)
67 | trained_model = text_classifier.train(x, y)
68 |
69 | # sequence labeling task
70 | dataset = Dataset(fname='your_data.txt', task_type='sequence_labeling', mode='train', config=config)
71 | x, y, config = dataset.transform()
72 | seq_labeler = Labeler(config=config, model_name='word_rnn', seq_type='bucket',,transformer=dataset.transformer)
73 | trained_model = seq_labeler.train(x, y)
74 |
75 | # predict (for text classification task)
76 | dataset = Dataset('your_data.txt', task_type='classification', mode='predict', tran_fname='your_transformer.h5', segment=False)
77 | x_seq = dataset.transform()
78 | text_classifier = Classifier('bi_lstm_att', dataset.transformer)
79 | text_classifier.load(weight_fname='your_model_weights.h5', para_fname='your_model_parameters.json')
80 | y_pred = text_classifier.predict(x_seq['word'])
81 | ```
82 |
83 | For more details, please read the jupyter notebooks in **examples** folder
84 |
85 | ### Data Format
86 |
87 | 1. Text Classification: A pretokenised file where each line is in the following format(temporarily does not support multi-label tasks):
88 |
89 | WORD [SPACE] WORD [SPACE] ... [TAB] LABEL \n
90 |
91 | such as "公司 目前 地理 位置 不 太 理想 , 离 城市 中心 较 远点 。\tneg\n"
92 |
93 | 2. Sequence Labeling: A pretokenised file where each line is in the following format:
94 |
95 | WORD###TAG [TAB] WORD###TAG [TAB] ..... \n
96 |
97 | such as "目前###O\t公司###O\t地理###B-Chunk\t位置###E-Chunk\t不###O\t太###O\t理想\n"
98 |
99 | label format (chunking as an example):
100 |
101 | * O:common words
102 | * B-Chunk:indicates the beginning of the chunk word
103 | * I-Chunk:indicates the middle of the chunk word
104 | * E-Chunk:indicates the end of the chunk word
105 |
106 | Suggestions: The text sequence is mainly short sentences. For the task of labeling entities, it is best to ensure that there are entity words in each row of data (ie, sequences of non-all Os).
107 |
108 | 3. Prediction: Each line of different tasks is text.
109 |
110 |
111 | ### Configuration file
112 |
113 | Train: indicates the parameters in the training process, including batch size, epoch numbers, training mode, etc.
114 |
115 | Data: indicates the parameters of data preprocessing, including the maximum number of words and characters, whether to use the word internal character sequence, whether to use word segmentation
116 |
117 | Embed: word vectors, pre indicates whether to use pre-trained word vectors
118 |
119 | The remaining modules correspond to different model hyperparameters
120 |
121 | See the configuration file comments for details.
122 |
123 | ## Models
124 |
125 | 1. Double Bi-LSTM + Attention 🆗
126 |
127 | The model framework used in paper [DeepMoji](https://arxiv.org/abs/1708.00524). The attention layer has been extended in nlp_toolkit.
128 |
129 | Corresponding to the name in the configuration file: bi_lstm_att
130 |
131 | 2. [Transformer](http://papers.nips.cc/paper/7181-attention-is-all-you-need) 🆗
132 |
133 | Use the multi-head-self-attention layer in Transformer to characterize text information. Read the [article](https://kexue.fm/archives/4765) for details.
134 |
135 | Corresponding to the name in the configuration file: multi_head_self_att
136 |
137 | 3. [TextCNN](https://arxiv.org/abs/1408.5882) 🆗
138 |
139 | CNN Network's pioneering work on text classification tasks has often been used as a baseline in the past few years. Detailed details can be read in this [Article](http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/)
140 |
141 | Corresponding to the name in the configuration file: text_cnn
142 |
143 | 4. [DPCNN](http://www.aclweb.org/anthology/P17-1052)
144 |
145 | Get better text characterization by continuously deepening the CNN network.
146 |
147 | 5. [HAN](https://www.cs.cmu.edu/~hovy/papers/16HLT-hierarchical-attention-networks.pdf)
148 |
149 | Document classification model using the attention mechanism
150 |
151 | ### Sequence Labeling
152 |
153 | 1. [WordRNN](https://arxiv.org/abs/1707.06799) 🆗
154 |
155 | Baseline model, the text sequence is encoded by the CRF layer after passing through the bidirectional LSTM
156 |
157 | Corresponding to the name in the configuration file: word_rnn
158 |
159 | 2. [CharRNN](https://pdfs.semanticscholar.org/b944/5206f592423f0b2faf05f99de124ccc6aaa8.pdf)
160 |
161 | Based on the characteristics of Chinese, in addition to the LSTM information at the character level, the radicals, word segmentation, and Ngram information are added.
162 |
163 | 3. [InnerChar](https://arxiv.org/abs/1611.04361) 🆗
164 |
165 | Based on another [paper](https://arxiv.org/abs/1511.08308), the above model is extended, using bi-lstm or CNN to extract information from the char level inside the word, and then concat with the original word vectors or conduct attention calculation.
166 |
167 | Corresponding to the name in the configuration file: word_rnn, and set the inner_char in the data module in the configuration file to True.
168 |
169 | 4. [IDCNN](https://arxiv.org/abs/1702.02098) 🆗
170 |
171 | The iterated dilated CNN increases the receptive field of the convolution kernel while keeping the parameter amount constant. The detailed details can be read in this [article](http://www.crownpku.com//2017/08/26/%E7%94%A8IDCNN%E5%92%8CCRF%E5%81%9A%E7%AB%AF%E5%88%B0%E7%AB%AF%E7%9A%84%E4%B8%AD%E6%96%87%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB.html)
172 |
173 | Corresponding to the name in the configuration file: idcnn
174 |
175 |
176 | ### Text Classification
177 |
178 | ### Sequence Labeling
179 |
180 | ## Performance
181 |
182 | Here list the performace based on following two datasets:
183 |
184 | 1. Company Pros and Cons: Crawled from Kanzhun.com and Dajie.com, it contains 95K reviews on the pros and cons of different companies.
185 | 2.
186 |
187 | ### Text Classification
188 |
189 | Model | 10-fold_f1 | Model Size | Time per epoch
190 | ----------------------- | :------: | :----------: | :-------------:
191 | Bi-LSTM Attention | | |
192 | Transformer | | |
193 | TextCNN | | |
194 | DPCNN | | |
195 | HAN | | |
196 |
197 | ### Sequence Labeling
198 |
199 | Model | 10-fold_f1 | Model Size | Time per epoch
200 | ----------------------- | :------: | :----------: | :-------------:
201 | Baseline(WordRNN) | | |
202 | WordRNN + InnerChar | | |
203 | CharRNN | | |
204 | IDCNN | | |
205 |
206 | ## To-Do List
207 |
208 | 1. Sentence split module
209 |
210 | 2. Add more SOTA model(such as BERT)
211 |
212 | 3. Support for training language model
213 |
214 | 4. Support for customized moudle
215 |
216 | 5. Generate a unique configuration file for each model
217 |
218 | ## Acknowledgments
219 |
220 | * The preprocessor part is derived from https://github.com/Hironsan/anago/
221 | * The evaluations for sequence labeling are based on a modified version of https://github.com/chakki-works/seqeval
222 | * Bucket sequence are based on https://github.com/tbennun/keras-bucketed-sequence
223 | * Multi-head attention and position embedding are from: https://github.com/bojone/attention
224 |
225 | ## Contact
226 | Contact: Yilei Wang
227 |
228 | 📧 E-mail: stevewyl@163.com
229 |
230 | WeChat: Steve_1125
--------------------------------------------------------------------------------
/config_classification.yaml:
--------------------------------------------------------------------------------
1 | model:
2 | bi_lstm_att:
3 | # rnn隐层大小
4 | rnn_size: 512
5 | # attention层隐层大小
6 | attention_dim: 128
7 | # 向量层丢弃率
8 | embed_drop_rate: 0.15
9 | # 输出层前一层丢弃率
10 | final_drop_rate: 0.5
11 | # 是否返回attention权重
12 | return_att: True
13 |
14 | transformer:
15 | # head个数
16 | nb_head: 8
17 | # head大小
18 | head_size: 16
19 | # attention层个数
20 | nb_transformer: 2
21 | # 是否使用位置嵌入向量
22 | pos_embed: True
23 | # 词向量层丢弃率
24 | embed_drop_rate: 0.15
25 | # 输出层前一层丢弃率
26 | final_drop_rate: 0.5
27 |
28 | text_cnn:
29 | # 卷积核大小
30 | conv_kernel_size: [3, 4, 5]
31 | # 池化层核大小
32 | pool_size: [2, 2, 2]
33 | # 滤波器个数
34 | nb_filters: 128
35 | # 全连接层隐层大小
36 | fc_size: 128
37 | # 词向量层丢弃率
38 | embed_drop_rate: 0.15
39 |
40 | dpcnn:
41 | # text_cnn特征
42 | region_kernel_size: [3, 4, 5]
43 | # 卷积核大小
44 | conv_kernel_size: 3
45 | # 池化层核大小
46 | pool_size: 3
47 | # cnn层个数
48 | repeat_time: 2
49 | # 词向量层丢弃率
50 | embed_drop_rate: 0.15
51 | # 输出层前一层丢弃率
52 | final_drop_rate: 0.5
53 | # 滤波器个数
54 | nb_filters: 250
55 |
56 | train:
57 | # bucket个数
58 | nb_bucket: 100
59 | # batch大小
60 | batch_size: 64
61 | # 最大迭代词数
62 | epochs: 25
63 | # 评估指标
64 | metric: f1
65 | # 交叉验证的次数
66 | nb_fold: 10
67 | # 训练模式,有single和fold两种
68 | train_mode: single
69 | # 测试集比例
70 | test_size: 0.2
71 | # early_stopping的终止条件
72 | patiences: 3
73 |
74 | data:
75 | # 最小的token粒度,有word和char两种
76 | basic_token: word
77 | # 最大词数
78 | max_words: 100
79 | # 最大字符数
80 | max_chars: 150
81 | # 最大词内部字符数
82 | max_inner_chars: 8
83 | # 是否开启词内部序列
84 | inner_char: False
85 |
86 | embed:
87 | # 是否使用预训练词向量
88 | pre: True
89 | # 词向量
90 | word:
91 | path: ../data/embeddings/fasttext_cv_all_300d.txt
92 | dim: 256
93 | # 字向量
94 | char:
95 | path: null
96 | dim: 128
97 |
--------------------------------------------------------------------------------
/config_sequence_labeling.yaml:
--------------------------------------------------------------------------------
1 | model:
2 | word_rnn:
3 | # 词级别rnn隐层大小
4 | word_rnn_size: 128
5 | # 字符级别rnn隐层大小
6 | char_rnn_size: 32
7 | # 是否使用CRF
8 | use_crf: True
9 | # 词内部字符信息表征方式,有cnn和rnn两种
10 | char_feature_method: cnn
11 | # 词和词内部字符信息的连接方式,有concat和attention两种
12 | integration_method: attention
13 | # rnn层的类别,有lstm和gru两种
14 | rnn_type: lstm
15 | # rnn层的个数
16 | nb_rnn_layers: 2
17 | # 滤波器个数
18 | nb_filters: 64
19 | # 卷积核大小
20 | conv_kernel_size: 2
21 | # 丢弃率
22 | drop_rate: 0.5
23 | # 词向量层丢弃率
24 | embed_drop_rate: 0.15
25 | # rnn层的内部丢弃率
26 | re_drop_rate: 0.15
27 |
28 | char_rnn:
29 | # 是否使用偏旁部首
30 | use_radical: False
31 | # 是否使用分词信息
32 | use_seg: False
33 | # 是否使用CRF
34 | use_crf: True
35 | # 字符级别rnn隐层大小
36 | char_rnn_size: 64
37 | # rnn层的类别,有lstm和gru两种
38 | rnn_type: lstm
39 | # rnn层的个数
40 | nb_rnn_layers: 2
41 | # 词向量层丢弃率
42 | embed_drop_rate: 0.15
43 | # 丢弃率
44 | drop_rate: 0.5
45 | # rnn层的内部丢弃率
46 | re_drop_rate: 0.15
47 |
48 | idcnn:
49 | # 词向量层丢弃率
50 | embed_drop_rate: 0.15
51 | # 丢弃率
52 | drop_rate: 0.5
53 | # 滤波器个数
54 | nb_filters: 64
55 | # 卷积核大小
56 | conv_kernel_size: 3
57 | # 膨胀率
58 | dilation_rate: [1, 1, 2]
59 | # 膨胀卷积层重复次数
60 | repeat_times: 4
61 | # 是否使用CRF
62 | use_crf: True
63 |
64 | train:
65 | # bucket个数
66 | nb_bucket: 100
67 | # batch大小
68 | batch_size: 64
69 | # 最大迭代词数
70 | epochs: 25
71 | # 评估指标
72 | metric: f1_seq
73 | # 交叉验证的次数
74 | nb_fold: 10
75 | # 训练模式,有single和fold两种
76 | train_mode: single
77 | # 测试集比例
78 | test_size: 0.2
79 | # early_stopping的终止条件
80 | patiences: 3
81 |
82 | data:
83 | # 最小的token粒度,有word和char两种
84 | basic_token: word
85 | # 最大词数
86 | max_words: 80
87 | # 最大字符数
88 | max_chars: 120
89 | # 最大词内部字符数
90 | max_inner_chars: 8
91 | # 是否开启词内部序列
92 | inner_char: True
93 | # 数据格式,有basic和conll两种
94 | format: basic
95 | # 是否使用偏旁部首
96 | use_radical: False
97 | # 是否使用分词信息
98 | use_seg: False
99 |
100 | embed:
101 | # 是否使用预训练词向量
102 | pre: False
103 | # 词向量
104 | word:
105 | path: null
106 | dim: 64
107 | # 字向量
108 | char:
109 | path: null
110 | dim: 32
--------------------------------------------------------------------------------
/images/entity_visualization_sample.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevewyl/nlp_toolkit/257dabd300b29957a0be38e7a8049a54f2095ccc/images/entity_visualization_sample.jpg
--------------------------------------------------------------------------------
/images/framework.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevewyl/nlp_toolkit/257dabd300b29957a0be38e7a8049a54f2095ccc/images/framework.jpg
--------------------------------------------------------------------------------
/nlp_toolkit/__init__.py:
--------------------------------------------------------------------------------
1 | import gc
2 | import os
3 | import logging
4 | import numpy as np
5 | import tensorflow as tf
6 | from nlp_toolkit.classifier import Classifier
7 | from nlp_toolkit.labeler import Labeler
8 | from nlp_toolkit.data import Dataset
9 | from nlp_toolkit.config import YParams
10 |
11 | logging.basicConfig(level=logging.INFO)
12 |
13 | try:
14 | import GPUtil
15 | from keras.backend.tensorflow_backend import set_session
16 |
17 | num_all_gpu = len(GPUtil.getGPUs())
18 | avail_gpu = GPUtil.getAvailable(order='memory')
19 | num_avail_gpu = len(avail_gpu)
20 |
21 | gpu_no = str(avail_gpu[0])
22 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
23 | os.environ['CUDA_VISIBLE_DEVICES'] = gpu_no
24 | logging.info('Choose the most free GPU: %s, currently not support multi-gpus' % gpu_no)
25 |
26 | tf_config = tf.ConfigProto()
27 | tf_config.gpu_options.allow_growth = True
28 | set_session(tf.Session(config=tf_config))
29 |
30 | except FileNotFoundError:
31 | logging.info('nvidia-smi is missing, often means no gpu on this machine. '
32 | 'fall back to cpu!')
33 |
34 | gc.disable()
35 |
--------------------------------------------------------------------------------
/nlp_toolkit/bin/run_classifier:
--------------------------------------------------------------------------------
1 | import argparse
2 | import sys
3 |
4 | from nlp_toolkit.data import Dataset
5 | from nlp_toolkit.classifier import Classifier
6 |
7 | def get_args():
8 | parser = argparse.ArgumentParser()
9 | parser.add_argument('-model_dir', type=str, required=True,
10 | help='directory of a pretrained BERT model')
11 |
12 | def main(args):
13 | pass
14 |
15 |
16 | if __name__ == '__main__':
17 | args = get_args()
18 | main(args)
--------------------------------------------------------------------------------
/nlp_toolkit/bin/run_seq_tagger:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevewyl/nlp_toolkit/257dabd300b29957a0be38e7a8049a54f2095ccc/nlp_toolkit/bin/run_seq_tagger
--------------------------------------------------------------------------------
/nlp_toolkit/callbacks.py:
--------------------------------------------------------------------------------
1 | """
2 | Different kinds of callbacks during model training
3 | """
4 |
5 | import numpy as np
6 | from collections import defaultdict
7 | from typing import List
8 | from pathlib import Path
9 | from seqeval.metrics import accuracy_score
10 | from seqeval.metrics import f1_score as f1_seq_score
11 | from seqeval.metrics import classification_report as sequence_report
12 | from sklearn.metrics import confusion_matrix, f1_score, classification_report
13 | from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping, ReduceLROnPlateau
14 |
15 |
16 | class Top_N_Acc(Callback):
17 | """
18 | Evaluate model with top n label acc at each epoch
19 | """
20 |
21 | def __init__(self, seq, top_n=5, attention=False, transformer=None):
22 | super(Top_N_Acc, self).__init__()
23 | self.seq = seq
24 | self.top_n = top_n
25 | self.t = transformer
26 | self.attention = attention
27 |
28 | def on_epoch_end(self, epoch, logs={}):
29 | label_true, label_pred = [], []
30 | for i in range(len(self.seq)):
31 | x_true, y_true = self.seq[i]
32 | y_pred = self.model.predict_on_batch(x_true)
33 | if self.attention:
34 | y_pred = y_pred[:, :self.t.label_size]
35 | y_true = self.t.inverse_transform(y_true)
36 | y_pred = self.t.inverse_transform(y_pred, top_k=self.top_n)
37 | label_true.extend(y_true)
38 | label_pred.extend(y_pred)
39 | assert len(label_pred) == len(label_true)
40 | correct = 0
41 | for i in range(len(label_pred)):
42 | if label_true[i] in label_pred[i]:
43 | correct += 1
44 | top_n_acc = correct / len(label_pred)
45 | print(' - top_{}_acc: {:04.2f}'.format(self.top_n, top_n_acc * 100))
46 | logs['acc_%d' % self.top_n] = np.float64(top_n_acc)
47 |
48 |
49 | class F1score(Callback):
50 | """
51 | Evaluate classification model with f1 score at each epoch
52 | """
53 |
54 | def __init__(self, seq, attention=False, transformer=None):
55 | super(F1score, self).__init__()
56 | self.seq = seq
57 | self.t = transformer
58 | self.attention = attention
59 |
60 | def on_epoch_end(self, epoch, logs={}):
61 | label_true, label_pred = [], []
62 | for i in range(len(self.seq)):
63 | x_true, y_true = self.seq[i]
64 | y_true = np.argmax(y_true, -1)
65 | y_pred = self.model.predict_on_batch(x_true)
66 | if self.attention:
67 | y_pred = y_pred[:, :self.t.label_size]
68 | y_pred = np.argmax(y_pred, -1)
69 | label_true.extend(y_true)
70 | label_pred.extend(y_pred)
71 |
72 | assert len(label_pred) == len(label_true)
73 | f1 = self._calc_f1(label_true, label_pred)
74 | assert f1.shape[0] == self.t.label_size
75 | for i in range(f1.shape[0]):
76 | label = self.t._label_vocab._id2token[i]
77 | print(label, '- f1: {:04.2f}'.format(f1[i] * 100))
78 | # print(classification_report(label_true, label_pred))
79 | logs['f1'] = f1_score(label_true, label_pred, average='weighted')
80 |
81 | def _calc_f1(self, y_true, y_pred):
82 | cm = confusion_matrix(y_true, y_pred)
83 | correct_preds = np.diagonal(cm)
84 | r = correct_preds / np.sum(cm, axis=1)
85 | p = correct_preds / np.sum(cm, axis=0)
86 | f1 = 2 * p * r / (p + r)
87 | return f1
88 |
89 |
90 | class F1score_seq(Callback):
91 | """
92 | Evaluate sequence labeling model with f1 score at each epoch
93 | """
94 |
95 | def __init__(self, seq, transformer=None):
96 | super(F1score_seq, self).__init__()
97 | self.seq = seq
98 | self.t = transformer
99 |
100 | def get_lengths(self, y_true):
101 | lengths = []
102 | for y in np.argmax(y_true, -1):
103 | try:
104 | i = list(y).index(0)
105 | except ValueError:
106 | i = len(y)
107 | lengths.append(i)
108 | return lengths
109 |
110 | def on_epoch_end(self, epoch, logs={}):
111 | label_true, label_pred = [], []
112 | for i in range(len(self.seq)):
113 | x_true, y_true = self.seq[i]
114 | lengths = self.get_lengths(y_true)
115 | y_pred = self.model.predict_on_batch(x_true)
116 | y_true = self.t.inverse_transform(y_true, lengths)
117 | y_pred = self.t.inverse_transform(y_pred, lengths)
118 | label_true.extend(y_true)
119 | label_pred.extend(y_pred)
120 | acc = accuracy_score(label_true, label_pred)
121 | f1 = f1_seq_score(label_true, label_pred)
122 | print(' - acc: {:04.2f}'.format(acc * 100))
123 | print(' - f1: {:04.2f}'.format(f1 * 100))
124 | print(sequence_report(label_true, label_pred))
125 | logs['f1_seq'] = np.float64(f1)
126 | logs['seq_acc'] = np.float64(acc)
127 |
128 |
129 | class History(Callback):
130 | def __init__(self, metric: List[str]):
131 | self.metric = metric
132 |
133 | def on_train_begin(self, logs={}):
134 | self.loss = []
135 | self.acc = []
136 | self.val_loss = []
137 | self.val_acc = []
138 | self.metrics = defaultdict(list)
139 |
140 | def on_batch_end(self, batch, logs={}):
141 | self.loss.append(logs.get('loss'))
142 | self.acc.append(logs.get('acc'))
143 |
144 | def on_epoch_end(self, epoch, logs={}):
145 | for m in self.metric:
146 | self.metrics[m].append(logs.get(m))
147 | self.val_loss.append(logs.get('val_loss'))
148 | self.val_acc.append(logs.get('val_acc'))
149 |
150 |
151 | def get_callbacks(history=None, log_dir=None, valid=None, metric='f1',
152 | transformer=None, early_stopping=True, patiences=3,
153 | LRPlateau=True, top_n=5, attention=False):
154 | """
155 | Define list of callbacks for Keras model
156 | """
157 | callbacks = []
158 | if valid is not None:
159 | if metric == 'top_n_acc':
160 | print('mointor training process using top_%d_acc score' % top_n)
161 | callbacks.append(Top_N_Acc(valid, top_n, attention, transformer))
162 | elif metric == 'f1':
163 | print('mointor training process using f1 score')
164 | callbacks.append(F1score(valid, attention, transformer))
165 | elif metric == 'f1_seq':
166 | print('mointor training process using f1 score and label acc')
167 | callbacks.append(F1score_seq(valid, transformer))
168 |
169 | if log_dir:
170 | path = Path(log_dir)
171 | if not path.exists():
172 | print('Successfully made a directory: {}'.format(log_dir))
173 | path.mkdir()
174 |
175 | file_name = '_'.join(
176 | ['model_weights', '{epoch:02d}', '{val_acc:2.4f}', '{%s:2.4f}' % metric]) + '.h5'
177 | weight_file = path / file_name
178 | save_model = ModelCheckpoint(str(weight_file),
179 | monitor=metric,
180 | verbose=1,
181 | save_best_only=True,
182 | save_weights_only=True,
183 | mode='max')
184 | callbacks.append(save_model)
185 |
186 | if early_stopping:
187 | print('using Early Stopping')
188 | callbacks.append(EarlyStopping(
189 | monitor=metric, patience=patiences, mode='max'))
190 |
191 | if LRPlateau:
192 | print('using Reduce LR On Plateau')
193 | callbacks.append(ReduceLROnPlateau(
194 | monitor=metric, factor=0.2, patience=patiences-2, min_lr=0.00001))
195 |
196 | if history:
197 | print('tracking loss history and metrics')
198 | callbacks.append(history)
199 |
200 | return callbacks
201 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/README.md:
--------------------------------------------------------------------------------
1 | # Chunk分词器使用指南
2 |
3 | 环境依赖:python 3.6.5 (暂时只支持python3)
4 |
5 | **不再维护更新**
6 | **源代码中已略去相关数据的下载路径,有需要的请邮件联系**
7 |
8 | ## 安装
9 |
10 | ```bash
11 | pip install nlp_toolkit
12 |
13 | # 如果keras_contrib安装失败
14 | pip install git+https://www.github.com/keras-team/keras-contrib.git
15 | ```
16 |
17 | ## 主要功能
18 |
19 | 1. 能够输出名词短语
20 | 2. 支持词性输出,名词短语词性为np
21 | 3. 支持名词短语以限定词+中心词的形式输出(以“_”分隔)
22 |
23 | >不可分割的名词短语是不存在限定词+中心词的形式的,如“机器学习”,而“经典机器学习算法”可拆解为“经典_机器学习_算法”
24 |
25 | ## 如何使用
26 |
27 | * 第一次import的时候,会自动下载模型和字典数据
28 | * 支持单句和多句文本的输入格式,建议以列表的形式传入分词器
29 |
30 | ```python
31 | from nlp_toolkit.chunk_segmentor import Chunk_Segmentor
32 | cutter = Chunk_Segmentor()
33 | s = '这是一个能够输出名词短语的分词器,欢迎试用!'
34 | res = [item for item in cutter.cut([s] * 10000)] # 1080ti上耗时8s
35 |
36 | # 提供两个版本,accurate为精确版,fast为快速版但召回会降低一些,默认精确版
37 | cutter = Chunk_Segmentor(mode='accurate')
38 | cutter = Chunk_Segmentor(mode='fast')
39 | # 支持用户自定义字典
40 | # 格式为每行 “词 词性”,必须为utf8编码,词性可省略
41 | cutter = Chunk_Segmentor(user_dict='your_dict.txt')
42 | # 是否输出词性, 默认开启
43 | cutter.cut(s, pos=False)
44 | # 是否需要更细粒度的切分结果, 默认关闭
45 | # 开启后会将部分名词短语以限定词+中心词的形式切开,词性均为np
46 | cutter.cut(s, cut_all=True)
47 |
48 | # 输出格式(词列表,词性列表,名词短语集合)
49 | [
50 | (
51 | ['这', '是', '一个', '能够', '输出', '名词_短语', '的', '分词器', ',', '欢迎', '试用', '!'],
52 | ['r', 'v', 'mq', 'v', 'vn', 'np', 'ude1', 'np', 'w', 'v', 'v', 'w'],
53 | ['分词器', '名词_短语']
54 | )
55 | ...
56 | ]
57 | ```
58 |
59 | ## Step 3 后续更新
60 |
61 | 若存在新的模型和字典数据,会提示你是否需要更新
62 |
63 | ## To-Do Lists
64 |
65 | 1. 提升限定词和名词短语的准确性 ---> 新的模型
66 | 2. char模型存在GPU调用内存溢出的问题 ---> 使用cnn提取Nchar信息来代替embedding的方式,缩小模型规模
67 | 3. 自定义字典,支持不同粒度的切分
68 | 4. 多进程模型加载和预测
69 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/__init__.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import glob
4 | import pickle
5 | import socket
6 | from pathlib import Path
7 | from datetime import datetime
8 |
9 | INIT_PATH = os.path.realpath(__file__)
10 | STATIC_ROOT = os.path.dirname(INIT_PATH)
11 | DATA_PATH = Path(STATIC_ROOT) / 'data'
12 | MD5_FILE_PATH = DATA_PATH / 'model_data.md5'
13 | UPDATE_TAG_PATH = DATA_PATH / 'last_update.pkl'
14 | UPDATE_INIT_PATH = DATA_PATH / 'init_update.txt'
15 | MD5_HDFS_PATH = '/user/xxxx/chunk_segmentor/model_data.md5'
16 | MODEL_HDFS_PATH = '/user/xxxx/chunk_segmentor/model_data.zip'
17 | USER_NAME = 'xxxx'
18 | PASSWORD = 'xxxxx'
19 | FTP_PATH_1 = 'ftp://xxx.xxx.xx.xx:xx/chunk_segmentor'
20 | FTP_PATH_2 = 'ftp://xxx.xxx.xx.xx:xx/chunk_segmentor'
21 | IP = socket.gethostbyname(socket.gethostname())
22 |
23 |
24 | def check_version():
25 | if MD5_FILE_PATH.exists():
26 | src = get_data_md5()
27 | if src:
28 | flag = update(src)
29 | if not flag:
30 | print('模型和数据更新失败!')
31 | else:
32 | for fname in glob.glob('model_data.md5*'):
33 | os.remove(fname)
34 | else:
35 | print('拉取md5文件失败!')
36 | else:
37 | print("这是第一次启动Chunk分词器。 请耐心等待片刻至数据和模型下载完成。")
38 | flag = download()
39 | if flag:
40 | current_time = datetime.now()
41 | init_update_time = str(os.path.getctime(INIT_PATH))
42 | pickle.dump(current_time, open(UPDATE_TAG_PATH, 'wb'))
43 | with open(UPDATE_INIT_PATH, 'w') as fout:
44 | fout.write(init_update_time)
45 | else:
46 | print('请寻找一台有hadoop或者能访问ftp://xxx.xxx.xx.xx:xx或者ftp://xxx.xxx.xx.xx:xx的机器')
47 |
48 |
49 | def write_config(config_path, new_root_path):
50 | content = []
51 | with open(config_path, encoding='utf8') as f:
52 | for line in f:
53 | if line.startswith('root'):
54 | line = 'root={}{}'.format(new_root_path, os.linesep)
55 | content.append(line)
56 | with open(config_path, 'w', encoding='utf8') as f:
57 | f.writelines(content)
58 |
59 |
60 | def download():
61 | # 下载数据文件
62 | ret1 = -1
63 | ret2 = -1
64 | for fname in glob.glob('model_data.md5*'):
65 | os.remove(fname)
66 | for fname in glob.glob('model_data.zip*'):
67 | os.remove(fname)
68 |
69 | if not IP.startswith('127'):
70 | print('尝试从ftp://xxx.xxx.xx.xx:xx获取数据')
71 | ret2 = os.system('wget -q --timeout=2 --tries=1 --ftp-user=%s --ftp-password=%s %s/model_data.md5' %
72 | (USER_NAME, PASSWORD, FTP_PATH_1))
73 | if ret2 == 0:
74 | ret1 = os.system('wget --ftp-user=%s --ftp-password=%s %s/model_data.zip' %
75 | (USER_NAME, PASSWORD, FTP_PATH_1))
76 | if ret1 != 0:
77 | print('尝试从hdfs上拉取数据,大约20-30s')
78 | ret1 = os.system('hadoop fs -get %s' % MODEL_HDFS_PATH)
79 | ret2 = os.system('hadoop fs -get %s' % MD5_HDFS_PATH)
80 | else:
81 | print('尝试从ftp://xxx.xxx.xx.xx:xx获取数据')
82 | ret2 = os.system('wget -q --timeout=2 --tries=1 --ftp-user=%s --ftp-password=%s %s/model_data.md5' %
83 | (USER_NAME, PASSWORD, FTP_PATH_2))
84 | if ret2 == 0:
85 | ret1 = os.system('wget --ftp-user=%s --ftp-password=%s %s/model_data.zip' %
86 | (USER_NAME, PASSWORD, FTP_PATH_2))
87 | if ret1 != 0 or ret2 != 0:
88 | return False
89 | if ret1 == 0 and ret2 == 0:
90 | os.system('unzip -q model_data.zip')
91 | os.system('cp -r model_data/data %s' % STATIC_ROOT)
92 | os.system('cp -f model_data/best_model.txt %s' % DATA_PATH)
93 | os.system('cp -f model_data.md5 %s' % DATA_PATH)
94 | os.system('rm -r model_data')
95 | os.system('rm model_data.md5*')
96 | os.system('rm model_data.zip*')
97 | print('数据和模型下载成功')
98 | return True
99 |
100 |
101 | def get_data_md5():
102 | for fname in glob.glob('model_data.md5*'):
103 | os.remove(fname)
104 | ret = -1
105 |
106 | if not IP.startswith('127'):
107 | ret = os.system('wget -q --timeout=2 --tries=1 --ftp-user=%s --ftp-password=%s %s/model_data.md5' %
108 | (USER_NAME, PASSWORD, FTP_PATH_1))
109 | if ret == 0:
110 | src = 'ftp1'
111 | else:
112 | ret = os.system('hadoop fs -get /user/kdd_wangyilei/chunk_segmentor/model_data.md5')
113 | if ret == 0:
114 | src = 'hdfs'
115 | else:
116 | ret = os.system('wget -q --timeout=2 --tries=1 --ftp-user=%s --ftp-password=%s %s/model_data.md5' %
117 | (USER_NAME, PASSWORD, FTP_PATH_2))
118 | if ret == 0:
119 | src = 'ftp2'
120 | if ret != 0:
121 | print('请寻找一台有hadoop或者能访问ftp://xxx.xxx.xx.xx:xx或者ftp://xxx.xxx.xx.xx:xx的机器')
122 | return None
123 | else:
124 | return src
125 |
126 |
127 | def update(src):
128 | with open(MD5_FILE_PATH, 'rb') as f:
129 | current_data_md5 = f.readlines()[0].strip()
130 | with open('model_data.md5', 'rb') as f:
131 | latest_data_md5 = f.readlines()[0].strip()
132 | try:
133 | if current_data_md5 != latest_data_md5:
134 | x = input('发现新的数据和模型?是否决定下载更新? Yes/No?')
135 | if x in ['Yes', 'Y', 'y', 'YES', '1', 1, 'yes'] or x == '':
136 | flag = update_data(src)
137 | if flag:
138 | print('模型和字典数据已更新到最新版本')
139 | return True
140 | else:
141 | return False
142 | else:
143 | print('希望您下次来更新数据!')
144 | return True
145 | else:
146 | return True
147 | except:
148 | return False
149 |
150 |
151 | def update_data(src):
152 | try:
153 | for fname in glob.glob('model_data.zip*'):
154 | os.remove(fname)
155 | if src == 'hdfs':
156 | print('尝试从hdfs上拉取数据,大约20-30s')
157 | os.system('hadoop fs -get /user/xxxxx/chunk_segmentor/model_data.zip')
158 | elif src == 'ftp1':
159 | print('尝试从ftp://xxx.xxx.xx.xx:xx获取数据')
160 | os.system('wget --ftp-user=%s --ftp-password=%s %s/model_data.zip' % (USER_NAME, PASSWORD, FTP_PATH_1))
161 | elif src == 'ftp2':
162 | print('尝试从ftp://xxx.xxx.xx.xx:xx获取数据')
163 | os.system('wget --ftp-user=%s --ftp-password=%s %s/model_data.zip' % (USER_NAME, PASSWORD, FTP_PATH_2))
164 |
165 | os.system('unzip -q model_data.zip')
166 | os.system('rm -r %s' % DATA_PATH)
167 | os.system('cp -r model_data/data %s' % STATIC_ROOT)
168 | os.system('cp -f model_data/best_model.txt %s' % DATA_PATH)
169 | os.system('cp -f model_data.md5 %s' % DATA_PATH)
170 | os.system('rm -r model_data')
171 | os.system('rm model_data.md5*')
172 | os.system('rm model_data.zip*')
173 | return True
174 | except:
175 | return False
176 |
177 |
178 | check_version()
179 | from .segment import Chunk_Segmentor
180 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/segment.py:
--------------------------------------------------------------------------------
1 | # ======主程序========
2 | import sys
3 | import os
4 | import pickle
5 | import time
6 | import logging
7 | from pathlib import Path
8 | from collections import Counter
9 |
10 | import jieba
11 | import jieba.posseg as pseg
12 | from nlp_toolkit.chunk_segmentor.trie import Trie
13 | from nlp_toolkit.chunk_segmentor.utils import read_line, flatten_gen, sent_split, preprocess, jieba_cut
14 | from nlp_toolkit.sequence import IndexTransformer
15 | from nlp_toolkit.models import Word_RNN, Char_RNN
16 | from nlp_toolkit.chunk_segmentor.tagger import Tagger
17 |
18 | global model_loaded
19 | global last_model_name
20 | global Tree
21 | global Labeler
22 | global load_dict
23 | global load_qualifier
24 | global qualifier_dict
25 | last_model_name = ''
26 | tree_loaded = False
27 | Labeler = None
28 | Tree = None
29 | load_dict = False
30 | load_qualifier = False
31 | qualifier_dict = None
32 |
33 | # 关闭jieba的日志输出
34 | jieba.setLogLevel(logging.INFO)
35 |
36 |
37 | class Chunk_Labeler(object):
38 | def __init__(self, model_name='word-rnn', tagger=None):
39 | self.model_name = model_name
40 | if self.model_name != 'word-rnn':
41 | print('char-rnn model will update soon!')
42 | sys.exit()
43 | self.tagger = tagger
44 |
45 | def analyze(self, text, has_seq=True, char_input=False,
46 | mode='batch', batch_size=256, radical_file=''):
47 | if mode == 'single':
48 | batch_size = 1
49 | if not self.tagger:
50 | if self.model_name in ['char-rnn', 'idcnn']:
51 | char_input = True
52 | self.tagger = Tagger(self.model, self.p, char_input,
53 | mode, batch_size, radical_file)
54 | return self.tagger.analyze(text)
55 |
56 | @classmethod
57 | def load(cls, model_name, weight_file, params_file, preprocessor_file):
58 | self = cls(model_name=model_name)
59 | self.p = IndexTransformer.load(preprocessor_file)
60 | if model_name == 'word-rnn':
61 | self.model = Word_RNN.load(weight_file, params_file)
62 | elif model_name == 'char-rnn':
63 | self.model = Char_RNN.load(weight_file, params_file)
64 | else:
65 | print('No other available models for chunking')
66 | print('Please use word-rnn or char-rnn')
67 | return self
68 |
69 |
70 | class Chunk_Segmentor(object):
71 | def __init__(self, user_dict='', model_name='word-rnn', mode='accurate', verbose=0):
72 | try:
73 | assert mode in ['accurate', 'fast']
74 | except:
75 | print('Only support three following mode: accurate, fast')
76 | sys.exit()
77 | self.pos = True
78 | self.mode = mode
79 | self.verbose = verbose
80 | self.path = os.path.abspath(os.path.dirname(__file__))
81 | if model_name != '':
82 | self.model_name = model_name
83 | else:
84 | try:
85 | self.model_name = read_line(Path(self.path) / 'data' / 'best_model.txt')[0]
86 | except Exception:
87 | self.model_name = model_name
88 |
89 | # jieba初始化
90 | base_dict = Path(self.path) / 'data' / 'dict' / 'jieba_base_supplyment.txt'
91 | jieba.load_userdict(str(base_dict))
92 | if mode == 'fast':
93 | global load_dict
94 | if not load_dict:
95 | if self.verbose:
96 | print('loading np dict to jieba cache')
97 | dict_path = Path(self.path) / 'data' / 'dict' / 'chunk_pos.txt'
98 | jieba.load_userdict(str(dict_path))
99 | load_dict = True
100 | if user_dict:
101 | jieba.load_userdict(user_dict)
102 | self.seg = pseg
103 |
104 | # model变量
105 | self.weight_file = os.path.join(self.path, 'data/model/%s_weights.h5' % self.model_name)
106 | self.param_file = os.path.join(self.path, 'data/model/%s_parameters.json' % self.model_name)
107 | self.preprocess_file = os.path.join(self.path, 'data/model/%s_transformer.h5' % self.model_name)
108 | self.define_tagger()
109 |
110 | def define_tagger(self):
111 | global load_qualifier
112 | global qualifier_dict
113 | if not load_qualifier:
114 | qualifier_word_path = os.path.join(self.path, 'data/dict/chunk_qualifier.dict')
115 | self.qualifier_word = pickle.load(open(qualifier_word_path, 'rb'))
116 | load_qualifier = True
117 | qualifier_dict = self.qualifier_word
118 | else:
119 | self.qualifier_word = qualifier_dict
120 |
121 | self.basic_token = 'char' if self.model_name[:4] == 'char' else 'word'
122 |
123 | # acc模式变量
124 | if self.mode == 'accurate':
125 | global tree_loaded
126 | global last_model_name
127 | global Labeler
128 | global Tree
129 | if self.verbose:
130 | if not load_dict:
131 | print('Model and Trie Tree are loading. It will cost 10-20s.')
132 | if self.model_name != last_model_name:
133 | self.labeler = Chunk_Labeler.load(
134 | self.model_name, self.weight_file, self.param_file, self.preprocess_file)
135 | if self.verbose:
136 | print('load model succeed')
137 | last_model_name = self.model_name
138 | Labeler = self.labeler
139 | else:
140 | self.labeler = Labeler
141 | if not tree_loaded:
142 | chunk_dict = read_line(os.path.join(self.path, 'data/dict/chunk.txt'))
143 | self.tree = Trie()
144 | for chunk in chunk_dict:
145 | self.tree.insert(chunk)
146 | if self.verbose:
147 | print('trie tree succeed')
148 | tree_loaded = True
149 | Tree = self.tree
150 | else:
151 | self.tree = Tree
152 | radical_file = os.path.join(self.path, 'data/dict/radical.txt')
153 | self.tagger = Tagger(self.labeler.model, self.labeler.p,
154 | basic_token=self.basic_token, radical_file=radical_file,
155 | tree=self.tree, qualifier_dict=self.qualifier_word,
156 | verbose=self.verbose)
157 |
158 | @property
159 | def get_segmentor_info(self):
160 | params = {'model_name': self.model_name,
161 | 'mode': self.mode,
162 | 'pos': self.pos}
163 | return params
164 |
165 | def extract_item(self, item):
166 | C_CUT_WORD, C_CUT_POS, C_CUT_CHUNK = 0, 1, 2
167 | complete_words = [sub[C_CUT_WORD] for sub in item]
168 | complete_poss = [sub[C_CUT_POS] for sub in item]
169 | if load_dict:
170 | all_chunks = [x for sub in item for x, y in zip(
171 | sub[C_CUT_WORD], sub[C_CUT_POS]) if y == 'np']
172 | else:
173 | all_chunks = list(flatten_gen([sub[C_CUT_CHUNK] for sub in item]))
174 | words = list(flatten_gen(complete_words))
175 | poss = list(flatten_gen(complete_poss))
176 | if self.cut_all:
177 | words, poss = zip(*[(x1, y1) for x, y in zip(words, poss) for x1, y1 in self.cut_qualifier(x, y)])
178 | words = [' ' if word == 's_' else word for word in words]
179 | if self.pos:
180 | d = (words, # C_CUT_WORD
181 | poss, # C_CUT_POS
182 | list(dict.fromkeys(all_chunks))) # C_CUT_CHUNK
183 | else:
184 | d = (words, list(dict.fromkeys(all_chunks)))
185 | if self.verbose:
186 | print(d)
187 | return d
188 |
189 | def cut_qualifier(self, x, y):
190 | if y == 'np' and '_' in x and x not in ['s_', 'ss_', 'lan_']:
191 | for sub_word in x.split('_'):
192 | yield sub_word, y
193 | else:
194 | yield x, y
195 |
196 | def output(self, data):
197 | idx_list, strings = zip(
198 | *[[idx, sub] for idx, item in enumerate(data) for sub in sent_split(preprocess(item))])
199 | cc = list(Counter(idx_list).values())
200 | end_idx = [sum(cc[:i]) for i in range(len(cc)+1)]
201 | seg_res = jieba_cut(strings, self.seg,
202 | self.qualifier_word, mode=self.mode,
203 | dict_loaded=load_dict)
204 | if self.verbose:
205 | print(seg_res)
206 | if self.mode == 'accurate':
207 | outputs, _ = self.tagger.analyze(seg_res)
208 | else:
209 | outputs = [list(zip(*item)) for item in seg_res]
210 | if self.verbose:
211 | print(outputs)
212 | new_res = (outputs[end_idx[i]: end_idx[i+1]]
213 | for i in range(len(end_idx)-1))
214 | for item in new_res:
215 | yield self.extract_item(item)
216 |
217 | def cut(self, data, batch_size=512, pos=True, cut_all=False):
218 | if isinstance(data, str):
219 | data = [data]
220 | if not pos:
221 | self.pos = False
222 | else:
223 | self.pos = True
224 | if not cut_all:
225 | self.cut_all = False
226 | else:
227 | self.cut_all = True
228 | self.define_tagger()
229 | assert isinstance(data, list)
230 | data_cnt = len(data)
231 | num_batches = int(data_cnt / batch_size) + 1
232 | if self.verbose:
233 | print('total_batch_num: ', num_batches)
234 | for batch_num in range(num_batches):
235 | start_index = batch_num * batch_size
236 | end_index = min((batch_num + 1) * batch_size, data_cnt)
237 | batch_input = data[start_index:end_index]
238 | for res in self.output(batch_input):
239 | yield res
240 |
241 |
242 | if __name__ == "__main__":
243 | cutter = Chunk_Segmentor(verbose=1)
244 | cutter.cut('这是一个能够输出名词短语的分词器,欢迎试用!')
245 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/tagger.py:
--------------------------------------------------------------------------------
1 | """预测类"""
2 |
3 | import re
4 | import numpy as np
5 | import tensorflow as tf
6 | from pathlib import Path
7 | from collections import Counter
8 | from seqeval.metrics.sequence_labeling import get_entities
9 | from nlp_toolkit.chunk_segmentor.utils import flatten_gen, tag_by_dict, read_line, compare_idx
10 |
11 | global special_tokens
12 | global graph
13 | special_tokens = set(['s_', 'lan_', 'ss_'])
14 | graph = tf.get_default_graph()
15 |
16 |
17 | def check_in(check_list, filter_list):
18 | combine = set(check_list) & filter_list
19 | if len(combine) > 0:
20 | return True
21 | else:
22 | return False
23 |
24 |
25 | # judge char type ['cn', 'en', 'num', 'other']
26 | def char_type(word):
27 | for char in word:
28 | unicode_char = ord(char)
29 | if unicode_char >= 19968 and unicode_char <= 40869:
30 | yield (char, 'cn')
31 | elif unicode_char >= 65 and unicode_char <= 122:
32 | yield (char, 'en')
33 | elif unicode_char >= 48 and unicode_char <= 57:
34 | yield (char, 'num')
35 | else:
36 | yield (char, 'other')
37 |
38 |
39 | # split word into chars
40 | def split_cn_en(word):
41 | new_word = [c for c in char_type(word)]
42 | new_word_len = len(new_word)
43 | tmp = ''
44 | for ix, item in enumerate(new_word):
45 | if item[1] in {'en', 'num'}:
46 | if ix < new_word_len - 1:
47 | if new_word[ix+1][1] == item[1]:
48 | tmp += item[0]
49 | else:
50 | tmp += item[0]
51 | yield tmp
52 | tmp = ''
53 | else:
54 | tmp += item[0]
55 | yield tmp
56 | else:
57 | yield item[0]
58 |
59 |
60 | def split_word(word):
61 | word, pos = word.rsplit('-', 1)
62 | if len(word) == 1 or word in special_tokens or not re.search(r'[^a-z0-9]+', word):
63 | yield [word, word, pos, 'S']
64 | else:
65 | char_list = list(split_cn_en(word))
66 | l_c = len(char_list)
67 | word_list = [word] * l_c
68 | pos_list = [pos] * l_c
69 | seg_list = ['M'] * l_c
70 | seg_list[0] = 'B'
71 | seg_list[-1] = 'E'
72 | for i in range(l_c):
73 | yield [char_list[i], word_list[i], pos_list[i], seg_list[i]]
74 |
75 |
76 | def word2char(word_list):
77 | return list(flatten_gen([list(split_word(word)) for word in word_list]))
78 |
79 |
80 | def chunk_list(word_list, max_length):
81 | l_w = len(word_list)
82 | if l_w > max_length:
83 | for i in range(0, len(word_list), max_length):
84 | yield word_list[0+i: max_length+i]
85 | else:
86 | yield word_list
87 |
88 |
89 | def split_sent(possible_idx, num_split, max_length, word_list):
90 | start = 0
91 | end = max_length
92 | if len(possible_idx) > 0:
93 | for _ in range(num_split):
94 | sub_possible_idx = [
95 | idx for idx in possible_idx if idx > start and idx <= end]
96 | if sub_possible_idx != []:
97 | end = max(sub_possible_idx, key=lambda x: x - end)
98 | yield word_list[start:end+1]
99 | start = end + 1
100 | end += max_length
101 | yield word_list[start:]
102 | else:
103 | yield word_list
104 |
105 |
106 | def split_long_sent(word_list, max_length):
107 | if len(word_list) <= max_length:
108 | return [word_list]
109 | num_split = int(len(word_list) / max_length)
110 | possible_split = [',', '.', 's_', '、', '/']
111 | possible_idx = [idx for idx, item in enumerate(word_list) if item[0] in possible_split]
112 | split_text = split_sent(possible_idx, num_split, max_length, word_list)
113 | new_list = [sub_item for item in split_text for sub_item in chunk_list(item, max_length)]
114 | return new_list
115 |
116 |
117 | def get_radical(d, char_list):
118 | return [d[char] if char in d else '' for char in char_list]
119 |
120 |
121 | class Tagger(object):
122 | def __init__(self, model, preprocessor, basic_token='word', pos=True,
123 | batch_size=512, radical_file='', tree=None,
124 | qualifier_dict=None, verbose=0):
125 | self.wrong = []
126 | self.model = model
127 | self.p = preprocessor
128 | self.basic_token = basic_token
129 | self.pos = pos
130 | self.tree = tree
131 | self.qualifier_dict = qualifier_dict
132 | self.verbose = verbose
133 | if self.basic_token == 'char':
134 | if self.p.radical_vocab_size > 2:
135 | self.use_radical = True
136 | self.radical_dict = {item.split('\t')[0]: item.split(
137 | '\t')[1] for item in read_line(radical_file)}
138 | else:
139 | self.use_radical = False
140 | if self.p.seg_vocab_size > 2:
141 | self.use_seg = True
142 | else:
143 | self.use_seg = False
144 | elif self.basic_token == 'word':
145 | self.use_radical = False
146 | self.use_seg = False
147 | if self.p.char_vocab_size > 2:
148 | self.use_inner_char = True
149 | else:
150 | self.use_inner_char = False
151 |
152 | self.char_tokenizer = word2char
153 | self.word_tokenizer = str.split
154 | self.batch_size = batch_size
155 |
156 | dict_path = Path(__file__).parent / 'data' / 'dict'
157 | self.stopwords = set(read_line(dict_path / 'stopwords.txt'))
158 | self.stopwords_first = set(read_line(dict_path / 'stopwords_first_word.txt'))
159 | self.stopwords_last = set(read_line(dict_path / 'stopwords_last_word.txt'))
160 | self.pos_filter = set(read_line(dict_path / 'pos_filter_jieba.txt'))
161 | self.pos_filter_first = set(read_line(dict_path / 'pos_filter_first_jieba.txt'))
162 | self.MAIN_INPUT_IDX = 0
163 | self.POS_IDX = 1
164 | self.SEG_IDX = 2
165 | self.RADICAL_IDX = 3
166 | self.WORD_IDX = 4
167 |
168 | @property
169 | def get_tagger_info(self):
170 | params = {'basic_token': self.basic_token,
171 | 'pos': self.pos,
172 | 'batch_size': self.batch_size,
173 | 'use_seg': self.use_seg,
174 | 'use_radical': self.use_radical,
175 | 'use_inner_char': self.use_inner_char}
176 | return params
177 |
178 | def data_generator(self, batch_input):
179 | input_data = {}
180 | batch_input = [self.preprocess_data(item) for item in batch_input]
181 | text_pos_idx = [(idx, each, item['pos'][i]) for idx, item in enumerate(batch_input) for i, each in enumerate(item['token'])]
182 | sent_idx, sub_text, sub_pos = zip(*text_pos_idx)
183 |
184 | try:
185 | input_data['token'] = sub_text
186 | input_data['pos'] = sub_pos
187 | if self.basic_token == 'char':
188 | input_data['word'] = [each for item in batch_input for each in item['word']]
189 | if self.use_seg:
190 | input_data['seg'] = [each for item in batch_input for each in item['seg']]
191 | if self.use_radical:
192 | input_data['radical'] = [each for item in batch_input for each in item['radical']]
193 | else:
194 | if self.use_inner_char:
195 | pass
196 | cc = list(Counter(sent_idx).values())
197 | end_idx = [sum(cc[:i]) for i in range(len(cc)+1)]
198 | return end_idx, input_data
199 | except Exception as e:
200 | print(e)
201 | length = [len(each) for idx, item in enumerate(batch_input) for each in item['token']]
202 | print(len(batch_input), length, sub_text)
203 | self.wrong.append(len(batch_input), length, sub_text)
204 |
205 | def preprocess_data(self, seg_res):
206 | assert isinstance(seg_res, list)
207 | assert len(seg_res) > 0
208 | input_data = {}
209 | if self.basic_token == 'char':
210 | string_c = self.char_tokenizer(seg_res)
211 | string_c = list(flatten_gen([sub_item for item in string_c for sub_item in split_long_sent(item, self.p.max_tokens)]))
212 | try:
213 | input_data['token'] = [item[0] for item in string_c]
214 | input_data['word'] = [item[1] for item in string_c]
215 | input_data['pos'] = [item[2] for item in string_c]
216 | if self.use_seg:
217 | input_data['seg'] = [item[3] for item in string_c]
218 | except Exception as e:
219 | print('char tokenizer error: ', e)
220 | print(string_c)
221 | if self.use_radical:
222 | input_data['radical'] = [get_radical(self.radical_dict, item) for item in input_data['token']]
223 | else:
224 | string_w = split_long_sent([item.split('-')
225 | for item in seg_res], self.p.max_tokens)
226 | input_data['token'] = [[each[0] for each in item] for item in string_w]
227 | input_data['pos'] = [[each[1] for each in item] for item in string_w]
228 | return input_data
229 |
230 | def predict_proba_batch(self, batch_data):
231 | split_text = batch_data['token']
232 | pos = batch_data['pos']
233 | if self.basic_token == 'char':
234 | segs = batch_data['seg']
235 | words = batch_data['word']
236 | else:
237 | segs = []
238 | words = []
239 | X = self.p.transform(batch_data)
240 | with graph.as_default():
241 | Y = self.model.model.predict_on_batch(X)
242 | return split_text, pos, Y, segs, words
243 |
244 | def _get_prob(self, pred):
245 | prob = np.max(pred, -1)
246 | return prob
247 |
248 | def _get_tags(self, pred):
249 | tags = self.p.inverse_transform([pred])
250 | tags = tags[0]
251 | return tags
252 |
253 | def _build_response(self, split_text, tags, poss, segs=[], words=[]):
254 | if self.basic_token == 'char':
255 | res = {
256 | 'words': split_text,
257 | 'pos': poss,
258 | 'char_pos': poss,
259 | 'char_word': words,
260 | 'seg': segs,
261 | 'entities': []
262 | }
263 | else:
264 | res = {
265 | 'words': split_text,
266 | 'pos': poss,
267 | 'entities': []
268 | }
269 | chunks = get_entities(tags)
270 | for chunk_type, chunk_start, chunk_end in chunks:
271 | chunk = self.post_process_chunk(chunk_type, chunk_start, chunk_end, split_text, poss)
272 | if chunk is not None:
273 | entity = {
274 | 'text': chunk,
275 | 'type': chunk_type,
276 | 'beginOffset': chunk_start,
277 | 'endOffset': chunk_end
278 | }
279 | res['entities'].append(entity)
280 | return res
281 |
282 | def post_process_chunk(self, chunk_type, chunk_start, chunk_end, split_text, pos):
283 | if chunk_type == 'Chunk':
284 | chunk_inner_words = split_text[chunk_start: chunk_end+1]
285 | chunk = ''.join(chunk_inner_words)
286 | check_char = not re.search(r'[^a-zA-Z0-9\u4e00-\u9fa5\.\+#]+', chunk)
287 | if len(chunk) < 15 and len(chunk) > 2 and check_char and len(chunk_inner_words) > 1:
288 | chunk_inner_poss = pos[chunk_start: chunk_end+1]
289 | filter_flag = any([check_in(chunk_inner_words, self.stopwords),
290 | check_in([chunk_inner_words[0]], self.stopwords_first),
291 | check_in([chunk_inner_words[-1]], self.stopwords_last),
292 | check_in(chunk_inner_poss, self.pos_filter),
293 | check_in([chunk_inner_poss[-1]], self.pos_filter_first)])
294 | if not filter_flag:
295 | return chunk
296 | else:
297 | return None
298 | else:
299 | return None
300 |
301 | def output(self, res):
302 | if self.verbose:
303 | print(res)
304 | words = res['words']
305 | poss = res['pos']
306 | dict_idx = tag_by_dict(words, self.tree)
307 | model_idx = [[item['beginOffset'], item['endOffset']] for item in res['entities']]
308 | new_idx = sorted(list(compare_idx(dict_idx, model_idx)), key=lambda x: x[1])
309 | new_idx = [item for item in new_idx if item[0] != item[1]]
310 | new_word = []
311 | new_pos = []
312 | new_chunk = []
313 | if self.basic_token == 'char':
314 | seg = res['seg']
315 | tag = ['O'] * len(seg)
316 | char_pos = res['char_pos']
317 | char_word = res['char_word']
318 | assert len(char_pos) == len(seg) == len(char_word)
319 | for s, e in new_idx:
320 | tag[s:e] = ['B-Chunk'] + ['I-Chunk'] * (e-s-1) + ['E-Chunk']
321 | chunks = {e: ''.join(words[s:e+1]) for s, e in new_idx}
322 | start = 0
323 | mid = 0
324 | for j, item_BEMS in enumerate(seg):
325 | if tag[j] == 'O':
326 | if item_BEMS == 'S':
327 | new_word.append(char_word[j])
328 | new_pos.append(char_pos[j])
329 | elif item_BEMS == 'E':
330 | if not tag[j-1].endswith('Chunk'):
331 | if not tag[start].endswith('Chunk'):
332 | new_word.append(char_word[j])
333 | else:
334 | new_word.append(''.join(words[mid:j]))
335 | else:
336 | new_word.append(words[j])
337 | new_pos.append(char_pos[j])
338 | else:
339 | if item_BEMS == 'B':
340 | start = j
341 | if tag[j+1].endswith('Chunk'):
342 | new_word.append(''.join(words[start:j]))
343 | new_pos.append(char_pos[j])
344 | if tag[j-1].endswith('Chunk') and item_BEMS == 'M':
345 | mid = j
346 | elif tag[j] == 'E-Chunk':
347 | try:
348 | chunk = chunks[j]
349 | if chunk in self.qualifier_dict:
350 | qualifier_word = self.qualifier_dict[chunk]
351 | new_word.append(qualifier_word)
352 | new_chunk.append(qualifier_word)
353 | else:
354 | new_word.append(chunk)
355 | new_chunk.append(chunk)
356 | except Exception as e:
357 | print(e)
358 | new_pos.append('np')
359 | else:
360 | chunks = {item[1]: ''.join(words[item[0]: item[1]+1]) for item in new_idx}
361 | if self.verbose:
362 | print(chunks)
363 | chunk_idx = [i for item in new_idx for i in range(item[0], item[1] + 1)]
364 | for i, item in enumerate(words):
365 | if i not in chunk_idx:
366 | new_word.append(item)
367 | new_pos.append(poss[i])
368 | else:
369 | if i in chunks.keys():
370 | chunk = chunks[i]
371 | if chunk in self.qualifier_dict:
372 | qualifier_word = self.qualifier_dict[chunk]
373 | new_word.append(qualifier_word)
374 | new_chunk.append(qualifier_word)
375 | else:
376 | new_word.append(chunk)
377 | new_chunk.append(chunk)
378 | new_pos.append('np')
379 | try:
380 | assert len(new_word) == len(new_pos)
381 | except Exception as e:
382 | print('new word list length not equals with new pos list')
383 | print(new_word, len(new_word))
384 | print(new_pos, len(new_pos))
385 | print(chunks)
386 | print(dict_idx, model_idx, new_idx)
387 | return (new_word, new_pos, new_chunk) # C_WORD=0 C_POS=1 C_CHUNK=2
388 |
389 | def analyze(self, text):
390 | assert isinstance(text, list) or isinstance(text, tuple)
391 | final_res = []
392 | sent_idx, batch_data = self.data_generator(text)
393 | split_text, split_pos, pred, segs, word = self.predict_proba_batch(batch_data)
394 | split_text = [split_text[sent_idx[i]:sent_idx[i+1]] for i in range(len(sent_idx)-1)]
395 | split_pos = [split_pos[sent_idx[i]:sent_idx[i+1]] for i in range(len(sent_idx)-1)]
396 | pred = [np.array(pred[sent_idx[i]:sent_idx[i+1]]) for i in range(len(sent_idx)-1)]
397 | if self.verbose:
398 | print(pred)
399 | if self.basic_token == 'char':
400 | segs = [segs[sent_idx[i]:sent_idx[i+1]] for i in range(len(sent_idx)-1)]
401 | word = [word[sent_idx[i]:sent_idx[i+1]] for i in range(len(sent_idx)-1)]
402 | assert len(segs) == len(split_text) == len(pred)
403 | for k, item in enumerate(pred):
404 | tmp_y = [y[:len(x)] for x, y in zip(split_text[k], item)]
405 | Y = np.concatenate(tmp_y)
406 | words = list(flatten_gen(split_text[k]))
407 | poss = list(flatten_gen(split_pos[k]))
408 | # assert len(words) == len(poss)
409 | if self.basic_token == 'char':
410 | split_segs = list(flatten_gen(segs[k]))
411 | split_words = list(flatten_gen(word[k]))
412 | else:
413 | split_segs = []
414 | split_words = []
415 | tags = self._get_tags(Y)
416 | if self.verbose:
417 | print(tags)
418 | # prob = self._get_prob(Y)
419 | res = self._build_response(words, tags, poss, split_segs, split_words)
420 | final_res.append(self.output(res))
421 | return final_res, self.wrong
422 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/tests/data.sh:
--------------------------------------------------------------------------------
1 | rm model_data.zip
2 | zip -qr model_data.zip model_data
3 | echo "zip data folder successfully"
4 | md5sum model_data.zip > model_data.md5
5 | echo "calculate md5 successfully"
6 | hadoop fs -rm chunk_segmentor/model_data.md5
7 | hadoop fs -rm chunk_segmentor/model_data.zip
8 | hadoop fs -put model_data.zip chunk_segmentor
9 | hadoop fs -put model_data.md5 chunk_segmentor
10 | echo "commit new data file to hdfs successfully"
11 | PUTFILE_1 = model_data.md5
12 | PUTFILE_2 = model_data.zip
13 | ftp -v -n 192.168.8.23 << EOF
14 | user yilei.wang ifchange0829FWGR
15 | delete chunk_segmentor/model_data.md5
16 | delete chunk_segmentor/model_data.zip
17 | put model_data.md5 chunk_segmentor/model_data.md5
18 | put model_data.zip chunk_segmentor/model_data.zip
19 | bye
20 | EOF
21 | echo "commit new data file to ftp successfully"
22 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/tests/test_functions.py:
--------------------------------------------------------------------------------
1 | import sys
2 | # sys.path.append('../../..')
3 | from nlp_toolkit.chunk_segmentor import Chunk_Segmentor
4 | import time
5 | import os
6 |
7 | VERBOSE = 1
8 | text = '主要配合UI设计师100%还原设计图,使用前端技术解决各大浏览器兼容问题,使用HTML5+css3完成页面优化及提高用户体验,www.s.com使用bootstrap、jQuery完成界面效果展示,使用JavaScript完成页面功能展示,并且,在规定的时间内提前完成任务大大提高了工作的效率'
9 |
10 | print('test model loading')
11 | cutter = Chunk_Segmentor(verbose=VERBOSE)
12 |
13 | print('test Chunk_Segmentor object reload')
14 | start = time.time()
15 | cutter = Chunk_Segmentor(verbose=VERBOSE)
16 | if time.time() - start < 1:
17 | pass
18 | else:
19 | print('not pass reload model. Quit!')
20 | sys.exit()
21 |
22 | '''
23 | print('test switch model')
24 | cutter = Chunk_Segmentor(model_name='char-rnn', verbose=VERBOSE)
25 | print(list(cutter.cut(text)))
26 | '''
27 |
28 | print('test cutting performance')
29 | cutter = Chunk_Segmentor(verbose=VERBOSE)
30 | start = time.time()
31 | print(list(cutter.cut(text, pos=False)))
32 | print('cut single sentence used {:04.2f}s'.format(time.time() - start))
33 | print('test pos')
34 | print(list(cutter.cut(text, pos=True)))
35 | print('test cut_all')
36 | print(list(cutter.cut(text, cut_all=True)))
37 |
38 | print('test user dict')
39 | fin = open('user_dict.txt', 'w', encoding='utf8')
40 | fin.write('用户体验 np\n')
41 | fin.close()
42 | cutter = Chunk_Segmentor(verbose=VERBOSE, user_dict='user_dict.txt')
43 | print(list(cutter.cut(text)))
44 | os.system('rm user_dict.txt')
45 |
46 | text_list = [text] * 10000
47 | start = time.time()
48 | result = list(cutter.cut(text_list, pos=False))
49 | print('cut 10000 sentences no pos used {:04.2f}s'.format(time.time() - start))
50 | start = time.time()
51 | result = list(cutter.cut(text_list))
52 | print('cut 10000 sentences used {:04.2f}s'.format(time.time() - start))
53 |
54 | print('test fast mode')
55 | cutter = Chunk_Segmentor(mode='fast', verbose=VERBOSE)
56 | print(list(cutter.cut(text)))
57 | print('test cut_all')
58 | print(list(cutter.cut(text, cut_all=True)))
59 | start = time.time()
60 | result = list(cutter.cut(text_list))
61 | print('cut 10000 sentences in fast mode used {:04.2f}s'.format(time.time() - start))
62 |
63 | print('test all pass')
64 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/tests/test_speed.py:
--------------------------------------------------------------------------------
1 | import sys
2 | # sys.path.append('../../..')
3 | from nlp_toolkit.chunk_segmentor import Chunk_Segmentor
4 |
5 | mode = sys.argv[1]
6 |
7 | if mode == 'short':
8 | text = '这是一个能够输出名词短语的分词器,欢迎试用!'
9 | elif mode == 'long':
10 | text = '主要配合UI设计师100%还原设计图,使用前端技术解决各大浏览器兼容问题,使用HTML5+css3完成页面优化及提高用户体验,www.s.com使用bootstrap、jQuery完成界面效果展示,使用JavaScript完成页面功能展示,并且,在规定的时间内提前完成任务大大提高了工作的效率'
11 |
12 |
13 | def load_fast():
14 | return Chunk_Segmentor(mode='fast')
15 |
16 |
17 | def test_fast():
18 | return list(CUTTER.cut([text] * 10000))
19 |
20 |
21 | def load_accurate():
22 | return Chunk_Segmentor(mode='accurate')
23 |
24 |
25 | def test_accurate():
26 | return list(CUTTER.cut([text] * 10000))
27 |
28 |
29 | if __name__ == "__main__":
30 | import cProfile
31 | global CUTTER
32 | CUTTER = load_accurate()
33 | cProfile.run("test_accurate()", filename='chunk_speed_accurate_%s.out' % mode)
34 | CUTTER = load_fast()
35 | cProfile.run("test_fast()", filename='chunk_speed_fast_%s.out' % mode)
36 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/trie.py:
--------------------------------------------------------------------------------
1 | """Trie树结构"""
2 |
3 |
4 | class TrieNode(object):
5 | def __init__(self):
6 | """
7 | Initialize your data structure here.
8 | """
9 | self.data = {}
10 | self.is_word = False
11 |
12 |
13 | class Trie(object):
14 | def __init__(self):
15 | self.root = TrieNode()
16 |
17 | def insert(self, word):
18 | """
19 | Inserts a word into the trie.
20 | :type word: str
21 | :rtype: void
22 | """
23 | node = self.root
24 | for letter in word:
25 | child = node.data.get(letter)
26 | if not child:
27 | node.data[letter] = TrieNode()
28 | node = node.data[letter]
29 | node.is_word = True
30 |
31 | def search(self, word):
32 | """
33 | Returns if the word is in the trie.
34 | :type word: str
35 | :rtype: bool
36 | """
37 | node = self.root
38 | for letter in word:
39 | node = node.data.get(letter)
40 | if not node:
41 | return False
42 | return node.is_word # 判断单词是否是完整的存在在trie树中
43 |
44 | def starts_with(self, prefix):
45 | """
46 | Returns if there is any word in the trie
47 | that starts with the given prefix.
48 | :type prefix: str
49 | :rtype: bool
50 | """
51 | node = self.root
52 | for letter in prefix:
53 | node = node.data.get(letter)
54 | if not node:
55 | return False
56 | return True
57 |
58 | def get_start(self, prefix):
59 | """
60 | Returns words started with prefix
61 | :param prefix:
62 | :return: words (list)
63 | """
64 | def _get_key(pre, pre_node):
65 | words_list = []
66 | if pre_node.is_word:
67 | words_list.append(pre)
68 | for x in pre_node.data.keys():
69 | words_list.extend(_get_key(pre + str(x), pre_node.data.get(x)))
70 | return words_list
71 |
72 | words = []
73 | if not self.starts_with(prefix):
74 | return words
75 | if self.search(prefix):
76 | words.append(prefix)
77 | return words
78 | node = self.root
79 | for letter in prefix:
80 | node = node.data.get(letter)
81 | return _get_key(prefix, node)
82 |
83 |
84 | if __name__ == '__main__':
85 | tree = Trie()
86 | tree.insert('深度学习')
87 | tree.insert('深度神经网络')
88 | tree.insert('深度网络')
89 | tree.insert('机器学习')
90 | tree.insert('机器学习模型')
91 | print(tree.search('深度学习'))
92 | print(tree.search('机器学习模型'))
93 | print(tree.get_start('深度'))
94 | print(tree.get_start('深度网'))
95 |
--------------------------------------------------------------------------------
/nlp_toolkit/chunk_segmentor/utils.py:
--------------------------------------------------------------------------------
1 | """一些nlp的常用函数"""
2 |
3 | import re
4 | import itertools
5 | import collections
6 | from hanziconv import HanziConv
7 |
8 |
9 | # 扁平化列表
10 | # ['1', '12', ['abc', 'df'], ['a']] ---> ['1','12','abc','df','a']
11 | def flatten(x):
12 | tmp = [([i] if isinstance(i, str) else i) for i in x]
13 | return list(itertools.chain(*tmp))
14 |
15 |
16 | def flatten_gen(x):
17 | for i in x:
18 | if isinstance(i, list) or isinstance(i, tuple):
19 | for inner_i in i:
20 | yield inner_i
21 | else:
22 | yield i
23 |
24 |
25 | def n_grams(a, n):
26 | z = (itertools.islice(a, i, None) for i in range(n))
27 | return zip(*z)
28 |
29 |
30 | def tag_by_dict(word_list, tree):
31 | idx = []
32 | length = len(word_list)
33 | start_idx = 0
34 | end_idx = 0
35 | while start_idx < length - 1:
36 | tmp_end_idx = 0
37 | tmp_chunk = ''.join(word_list[start_idx: end_idx+1])
38 | while tree.starts_with(tmp_chunk) and end_idx < length:
39 | tmp_end_idx = end_idx
40 | end_idx += 1
41 | tmp_chunk = ''.join(word_list[start_idx: end_idx+1])
42 | if tmp_end_idx != 0 and tree.search(''.join(word_list[start_idx: end_idx])):
43 | idx.append([start_idx, tmp_end_idx])
44 | start_idx += 1
45 | end_idx = start_idx
46 | if idx != []:
47 | idx = list(combine_idx(idx))
48 | return idx
49 |
50 |
51 | # 合并交叉的chunk
52 | def combine_idx(idx_list):
53 | l_idx = len(idx_list)
54 | if l_idx > 1:
55 | idx = 0
56 | used = []
57 | last_idx = l_idx - 1
58 | while idx <= l_idx - 2:
59 | if idx_list[idx+1][0] > idx_list[idx][1]:
60 | if idx not in used:
61 | yield idx_list[idx]
62 | if idx + 1 == last_idx:
63 | yield idx_list[idx+1]
64 | idx += 1
65 | else:
66 | start = idx_list[idx][0]
67 | while idx_list[idx+1][0] <= idx_list[idx][1]:
68 | end = idx_list[idx+1][1]
69 | used.append(idx)
70 | idx += 1
71 | if idx > l_idx - 2:
72 | break
73 | used.append(idx)
74 | yield [start, end]
75 | else:
76 | yield idx_list[0]
77 |
78 |
79 | def combine_two_idx(x, y):
80 | if x[0] >= y[0] and x[1] <= y[1]:
81 | return y
82 | elif x[0] < y[0] and x[1] > y[1]:
83 | return x
84 | else:
85 | all_idx = set(x + y)
86 | return [min(all_idx), max(all_idx)]
87 |
88 |
89 | def compare_idx(dict_idx, model_idx):
90 | if dict_idx == model_idx or dict_idx == []:
91 | for idx in model_idx:
92 | yield idx
93 | elif model_idx == []:
94 | for idx in dict_idx:
95 | yield idx
96 | else:
97 | union_idx = dict_idx + model_idx
98 | uniq_idx = [list(x) for x in set([tuple(x) for x in union_idx])]
99 | sort_idx = sorted(uniq_idx, key=lambda x: (x[0], x[1]))
100 | for idx in list(combine_idx(sort_idx)):
101 | yield idx
102 |
103 |
104 | def word_length(segs):
105 | cnt = []
106 | i = 0
107 | for item in segs:
108 | if item == 'E':
109 | i += 1
110 | cnt.append(i)
111 | i = 0
112 | elif item == 'S':
113 | cnt.append(1)
114 | else:
115 | i += 1
116 | return cnt
117 |
118 |
119 | # 根据另外一个列表进行sub_list的切分
120 | def split_sublist(list1, list2):
121 | if len(list1) == 1:
122 | return [list2]
123 | else:
124 | list1_len = [len(item) for item in list1]
125 | new_list = []
126 | for i in range(len(list1)):
127 | if i == 0:
128 | start = 0
129 | end = sum(list1_len[:i+1])
130 | new_list.append(list2[start: end])
131 | start = end
132 | return new_list
133 |
134 |
135 | def output_reform(a, b, mode, dict_loaded=False):
136 | if mode == 'accurate':
137 | if dict_loaded:
138 | a = a.replace('_', '')
139 | return a + '-' + b
140 | else:
141 | return (a, b)
142 |
143 |
144 | def reshape_term(term, qualifier_word=None, mode='accurate', dict_loaded=False):
145 | # pos = str(term.nature)
146 | # word = term.word
147 | term = str(term).split('/')
148 | pos = term[1]
149 | word = term[0]
150 | if pos == 'np':
151 | if word in qualifier_word:
152 | return output_reform(qualifier_word[word], pos, mode, dict_loaded)
153 | else:
154 | return output_reform(word, pos, mode, dict_loaded)
155 | else:
156 | return output_reform(word, pos, mode, dict_loaded)
157 |
158 |
159 | def hanlp_cut(sent_list, segmentor, qualifier_word=None, mode='accurate'):
160 | if qualifier_word is None:
161 | if mode == 'accurate':
162 | res = [[term.word + '-' + str(term.nature) for term in segmentor.segment(sub)] for sub in sent_list]
163 | else:
164 | res = [[(term.word, str(term.nature)) for term in segmentor.segment(sub)] for sub in sent_list]
165 | else:
166 | res = [[reshape_term(term, qualifier_word, mode) for term in segmentor.segment(sub)] for sub in sent_list]
167 | return res
168 |
169 |
170 | def jieba_cut(sent_list, segmentor, qualifier_word=None, mode='accurate', dict_loaded=False):
171 | if qualifier_word is None:
172 | if mode == 'accurate':
173 | res = [[word + '-' + flag for word, flag in segmentor.cut(sub)] for sub in sent_list]
174 | else:
175 | res = [[(word, flag) for word, flag in segmentor.cut(sub)] for sub in sent_list]
176 | else:
177 | res = [[reshape_term(term, qualifier_word, mode, dict_loaded) for term in segmentor.cut(sub)] for sub in sent_list]
178 | return res
179 |
180 |
181 | EMOJI_UNICODE = r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\u2600-\u26FF\u2700-\u27BF]'
182 | REGEX_STR = [
183 | r'转发微博|欢迎转发|^回复|…{2,}|图片评论', # 微博特定停用词
184 | r'<[^>]+>', # HTML标记
185 | r'/{0,2}@\w+-?\w*[::]?', # @-用户
186 | # r'#.+#', # hash-tags
187 | # URLs
188 | # r'(?:https?://|www\.)(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
189 | # r'\b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-] +\.[a-zA-Z0-9-.] +\b' # E-MAIL
190 | r'[\s\w\d]+;'
191 | ]
192 | START_PATTERN = [
193 | r'\* ',
194 | r'\d{1,2}\.\d{1,2}\.\d{1,2}', # 1.2.1
195 | r'\d+\t',
196 | r'([1-9][0-9]){1,2}[。;::,,、\.\t/]{1}\s?(?![年月日\d+])',
197 | r'([1-9][0-9]){1,2}[))]{1}、?',
198 | r' \| ',
199 | r'\n[1-9][0-9]',
200 | r'\n{2,}',
201 | r'(?|&\w+;?|br\s*|li>', '', string)
219 | string = re.sub(invalid_char, '', string)
220 | string = re.sub(r'|||', '', string)
221 | string = re.sub(r'[ \u3000]+', 's_', string)
222 | string = re.sub(invalid_unicode, 'ss_', string)
223 | string = re.sub(lang_char, 'lan_', string)
224 | # string = re.sub(r'(工作描述|工作职责|岗位职责|任职要求)(:|:)', '', string)
225 | string = HanziConv.toSimplified(strQ2B(string))
226 | string = re.sub(
227 | r'[^\u4e00-\u9fa5\u0020-\u007f,。!?;、():\n\u2029\u2028a-zA-Z0-9]+', '', string)
228 | return string
229 |
230 |
231 | # 分句策略(比直接切开慢3倍)
232 | def sent_split(string):
233 | string = re.sub(END_PATTERN, '\\1', re.sub(
234 | START_PATTERN, '\\1', string))
235 | return [item for item in re.split(r'\n|\u2029|\u2028|', string) if item != '']
236 |
237 |
238 | def strQ2B(ustring):
239 | """全角转半角"""
240 | rstring = ""
241 | for uchar in ustring:
242 | inside_code = ord(uchar)
243 | if inside_code == 12288:
244 | inside_code = 32
245 | elif (inside_code >= 65281 and inside_code <= 65374):
246 | inside_code -= 65248
247 | rstring += chr(inside_code)
248 | return rstring.lower().strip()
249 |
250 |
251 | '''
252 | 文件操作
253 | '''
254 |
255 |
256 | # 按行读取文本文件
257 | def read_line(fname):
258 | return open(fname, encoding='utf8').read().split('\n')
259 |
260 |
261 | # 保存为文本文件
262 | def save_line(obj, fname='result.txt'):
263 | with open(fname, 'w', encoding='utf8') as f:
264 | if isinstance(obj, list):
265 | for k, v in enumerate(obj):
266 | v = str(v)
267 | if v != '\n' and k != len(obj) - 1:
268 | f.write(v + '\n')
269 | else:
270 | f.write(v)
271 | if isinstance(obj, collections.Counter) or isinstance(obj, dict):
272 | row = 0
273 | for k, v in sorted(obj.items(), key=lambda x: x[1], reverse=True):
274 | v = str(v)
275 | if str(v) != '\n' and k != len(obj) - 1:
276 | f.write(k + '\t' + str(v) + '\n')
277 | row += 1
278 | else:
279 | f.write(k + '\t' + str(v))
280 |
--------------------------------------------------------------------------------
/nlp_toolkit/classifier.py:
--------------------------------------------------------------------------------
1 | """
2 | Classifier Wrapper
3 | """
4 |
5 | import sys
6 | import time
7 | from nlp_toolkit.models import bi_lstm_attention
8 | from nlp_toolkit.models import Transformer
9 | from nlp_toolkit.models import textCNN, DPCNN
10 | from nlp_toolkit.trainer import Trainer
11 | from nlp_toolkit.utilities import logger
12 | from nlp_toolkit.sequence import BasicIterator
13 | from nlp_toolkit.data import Dataset
14 | from typing import List, Dict
15 | from copy import deepcopy
16 | from sklearn.metrics import classification_report
17 |
18 | # TODO
19 | # 1. evaluate func
20 | class Classifier(object):
21 | """
22 | Classifier Model Zoos. Include following models:
23 |
24 | 1. TextCNN
25 | 2. DPCNN (Deep Pyramid CNN)
26 | 3. Bi-LSTM-Attention
27 | 4. Multi-Head-Self-Attention (Transformer)
28 | 5. HAN (Hierachical Attention Network)
29 | """
30 |
31 | def __init__(self, model_name, dataset: Dataset, seq_type='bucket'):
32 | self.model_name = model_name
33 | self.dataset = dataset
34 | self.transformer = dataset.transformer
35 | if dataset.mode == 'train':
36 | self.config = self.dataset.config
37 | self.m_cfg = self.config['model'][self.model_name]
38 | self.seq_type = seq_type
39 | if seq_type == 'bucket':
40 | self.config['maxlen'] = None
41 | self.model = self.get_model()
42 | self.model_trainer = self.get_trainer()
43 | elif dataset.mode == 'predict' or dataset.mode == 'eval':
44 | pass
45 | else:
46 | logger.warning('invalid mode name. Current only support "train" "eval" "predict"')
47 |
48 | def get_model(self):
49 | if self.model_name == 'bi_lstm_att':
50 | model = bi_lstm_attention(
51 | nb_classes=self.config['nb_classes'],
52 | nb_tokens=self.config['nb_tokens'],
53 | maxlen=self.config['maxlen'],
54 | embedding_dim=self.config['embedding_dim'],
55 | embeddings=self.config['token_embeddings'],
56 | rnn_size=self.m_cfg['rnn_size'],
57 | attention_dim=self.m_cfg['attention_dim'],
58 | final_dropout_rate=self.m_cfg['final_drop_rate'],
59 | embed_dropout_rate=self.m_cfg['embed_drop_rate'],
60 | return_attention=self.m_cfg['return_att']
61 | )
62 | elif self.model_name == 'transformer':
63 | model = Transformer(
64 | nb_classes=self.config['nb_classes'],
65 | nb_tokens=self.config['nb_tokens'],
66 | maxlen=self.config['maxlen'],
67 | embedding_dim=self.config['embedding_dim'],
68 | embeddings=self.config['token_embeddings'],
69 | pos_embed=self.m_cfg['pos_embed'],
70 | nb_transformer=self.m_cfg['nb_transformer'],
71 | final_dropout_rate=self.m_cfg['final_drop_rate'],
72 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
73 | )
74 | elif self.model_name == 'text_cnn':
75 | model = textCNN(
76 | nb_classes=self.config['nb_classes'],
77 | nb_tokens=self.config['nb_tokens'],
78 | maxlen=self.config['maxlen'],
79 | embedding_dim=self.config['embedding_dim'],
80 | embeddings=self.config['token_embeddings'],
81 | conv_kernel_size=self.m_cfg['conv_kernel_size'],
82 | pool_size=self.m_cfg['pool_size'],
83 | nb_filters=self.m_cfg['nb_filters'],
84 | fc_size=self.m_cfg['fc_size'],
85 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
86 | )
87 | elif self.model_name == 'dpcnn':
88 | model = DPCNN(
89 | nb_classes=self.config['nb_classes'],
90 | nb_tokens=self.config['nb_tokens'],
91 | maxlen=self.config['maxlen'],
92 | embedding_dim=self.config['embedding_dim'],
93 | embeddings=self.config['token_embeddings'],
94 | region_kernel_size=self.m_cfg['region_kernel_size'],
95 | conv_kernel_size=self.m_cfg['conv_kernel_size'],
96 | pool_size=self.m_cfg['pool_size'],
97 | nb_filters=self.m_cfg['nb_filters'],
98 | repeat_time=self.m_cfg['repeat_time'],
99 | final_dropout_rate=self.m_cfg['final_drop_rate'],
100 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
101 | )
102 | else:
103 | logger.warning('The model name ' + self.model_name + ' is unknown')
104 | model = None
105 | return model
106 |
107 | def get_trainer(self):
108 | t_cfg = self.config['train']
109 | model_trainer = Trainer(
110 | self.model,
111 | model_name=self.model_name,
112 | task_type=self.config['task_type'],
113 | batch_size=t_cfg['batch_size'],
114 | max_epoch=t_cfg['epochs'],
115 | train_mode=t_cfg['train_mode'],
116 | fold_cnt=t_cfg['nb_fold'],
117 | test_size=t_cfg['test_size'],
118 | metric=['f1'],
119 | nb_bucket=t_cfg['nb_bucket'],
120 | patiences=t_cfg['patiences']
121 | )
122 | return model_trainer
123 |
124 | def train(self):
125 | if self.model_name == 'bi_lstm_att':
126 | return_att = self.m_cfg['return_att']
127 | else:
128 | return_att = False
129 | return self.model_trainer.train(
130 | self.dataset.texts, self.dataset.labels,
131 | self.transformer, self.seq_type, return_att)
132 |
133 | def predict(self, x: Dict[str, List[List[str]]], batch_size=64,
134 | return_attention=False, return_prob=False):
135 | n_labels = len(self.transformer._label_vocab._id2token)
136 | x_c = deepcopy(x)
137 | start = time.time()
138 | x_len = [item[-1] for item in x_c['token']]
139 | x_c['token'] = [item[:-1] for item in x_c['token']]
140 | x_seq = BasicIterator('classification', self.transformer,
141 | x_c, batch_size=batch_size)
142 | result = self.model.model.predict_generator(x_seq)
143 | if return_prob:
144 | y_pred = result[:, :n_labels]
145 | else:
146 | y_pred = self.transformer.inverse_transform(result[:, :n_labels])
147 | used_time = time.time() - start
148 | logger.info('predict {} samples used {:4.1f}s'.format(
149 | len(x['token']), used_time))
150 | if result.shape[1] > n_labels and self.model_name == 'bi_lstm_att':
151 | attention = result[:, n_labels:]
152 | attention = [attention[idx][:l] for idx, l in enumerate(x_len)]
153 | return y_pred, attention
154 | else:
155 | return y_pred
156 |
157 | def evaluate(self, x: Dict[str, List[List[str]]], y: List[str],
158 | batch_size=64):
159 | n_labels = len(self.transformer._label_vocab._id2token)
160 | y = [item[0] for item in y]
161 | x_c = deepcopy(x)
162 | x_len = [item[-1] for item in x_c['token']]
163 | x_c['token'] = [item[:-1] for item in x_c['token']]
164 | x_seq = BasicIterator('classification', self.transformer,
165 | x_c, batch_size=batch_size)
166 | result = self.model.model.predict_generator(x_seq)
167 | result = result[:, :n_labels]
168 | y_pred = self.transformer.inverse_transform(result, lengths=x_len)
169 | print(classification_report(y, y_pred))
170 |
171 | def load(self, weight_fname, para_fname):
172 | if self.model_name == 'bi_lstm_att':
173 | self.model = bi_lstm_attention.load(weight_fname, para_fname)
174 | elif self.model_name == 'transformer':
175 | self.model = Transformer.load(weight_fname, para_fname)
176 | elif self.model_name == 'text_cnn':
177 | self.model = textCNN.load(weight_fname, para_fname)
178 | elif self.model_name == 'dpcnn':
179 | self.model = DPCNN.load(weight_fname, para_fname)
180 | else:
181 | logger.warning('invalid model name')
182 | sys.exit()
183 |
--------------------------------------------------------------------------------
/nlp_toolkit/config.py:
--------------------------------------------------------------------------------
1 | from ruamel.yaml import YAML
2 | from tensorflow.contrib.training import HParams
3 |
4 |
5 | class YParams(HParams):
6 | def __init__(self, yaml_fn, config_name):
7 | super().__init__()
8 | with open(yaml_fn, encoding='utf8') as fp:
9 | for k, v in YAML().load(fp)[config_name].items():
10 | self.add_hparam(k, v)
11 |
12 |
13 | if __name__ == "__main__":
14 | hparams = YParams('./config_classification.yaml', 'data')
15 | print(hparams.basic_token)
16 |
--------------------------------------------------------------------------------
/nlp_toolkit/data.py:
--------------------------------------------------------------------------------
1 | """
2 | Text preprocess utilties
3 | """
4 |
5 | import re
6 | import os
7 | import sys
8 | from pathlib import Path
9 | from hanziconv import HanziConv
10 | from typing import Dict
11 | from nlp_toolkit.sequence import IndexTransformer
12 | from nlp_toolkit.utilities import load_vectors, logger, word2char
13 |
14 | EMOJI_UNICODE = r'[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\u2600-\u26FF\u2700-\u27BF]'
15 | REGEX_STR = [
16 | r'转发微博|欢迎转发|^回复|…{2,}|图片评论', # 微博特定停用词
17 | r'<[^>]+>', # HTML标记
18 | r'/{0,2}@\w+-?\w*[::]?', # @-用户
19 | r'#.+#', # hash-tags
20 | # URLs
21 | r'(?:https{0,1}?://|www\.)(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
22 | r'\b[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-] +\.[a-zA-Z0-9-.] +\b' # E-MAIL
23 | ]
24 | NEGATIVES = ['不', '没有', '无', '莫', '非', '没']
25 | ADVERSATIVES = ['但是', '但', '然而']
26 | SENT_SEP_LIST = r'[。!!?\?]+'
27 |
28 |
29 | # TODO
30 | class Dataset(object):
31 | """
32 | Clean text for post processing. Contains following steps:
33 | 1. remove sepcific tokens(e.g. weibo emotions, emojis, html tags etc.)
34 | 2. must contain Chinese character
35 | 3. simplify Chinese character
36 | 4. segment words supported by pyhanlp (removed)
37 | Then transform text and label to number index according to the given task
38 |
39 | Data Foramt by line:
40 | classification: __label__ __label__ ...
41 | sequence labeling: token_1###label_1\ttoken_2###label_2\t... \ttoken_n###label_n
42 | language model: token_1 token_2 ... token_n
43 | """
44 |
45 | def __init__(self, mode, fname='', tran_fname='',
46 | config=None, task_type=None, data_format=''):
47 | self.mode = mode
48 | self.fname = fname
49 | self.inner_char = False
50 | self.use_seg = False
51 | self.use_radical = False
52 | self.radical_dict = None
53 |
54 | if data_format != '':
55 | self.data_format = data_format
56 |
57 | if config:
58 | self.basic_token = config['data']['basic_token']
59 | self.html_texts = re.compile(r'('+'|'.join(REGEX_STR)+')', re.UNICODE)
60 |
61 | if task_type:
62 | if mode == 'train' and config is None:
63 | logger.error('please specify the config file path')
64 | sys.exit()
65 | self.task_type = task_type
66 | else:
67 | try:
68 | self.task_type = re.findall(r'config_(\w+)\.yaml', config)[0]
69 | except:
70 | logger.error('please check your config filename')
71 | sys.exit()
72 |
73 | if mode == 'train':
74 | if 'data' in config:
75 | self.config = config
76 | self.data_config = config['data']
77 | self.embed_config = config['embed']
78 | if self.task_type == 'sequence':
79 | self.data_format = self.data_config['format']
80 | if self.basic_token == 'word':
81 | self.max_tokens = self.data_config['max_words']
82 | self.inner_char = self.data_config['inner_char']
83 | elif self.basic_token == 'char':
84 | self.max_tokens = self.data_config['max_chars']
85 | if self.task_type == 'sequence_labeling':
86 | self.use_seg = self.data_config['use_seg']
87 | self.use_radical = self.data_config['use_radical']
88 | if self.config['train']['metric'] not in ['f1_seq']:
89 | self.config['train']['metric'] = 'f1_seq'
90 | logger.warning('sequence labeling task currently only support f1_seq callback')
91 | elif self.task_type == 'classification':
92 | if self.config['train']['metric'] in ['f1_seq']:
93 | self.config['train']['metric'] = 'f1'
94 | logger.warning('text classification task not support f1_seq callback, changed to f1')
95 | else:
96 | logger.error('invalid token type, only support word and char')
97 | sys.exit()
98 | else:
99 | logger.error("please pass in the correct config dict")
100 | sys.exit()
101 |
102 | if self.basic_token == 'char':
103 | self.use_seg = config['data']['use_seg']
104 | self.use_radical = config['data']['use_radical']
105 |
106 | if self.use_radical:
107 | radical_file = Path(os.path.dirname(
108 | os.path.realpath(__file__))) / 'data' / 'radical.txt'
109 | self.radical_dict = {line.split()[0]: line.split()[1].strip()
110 | for line in open(radical_file, encoding='utf8')}
111 |
112 | self.transformer = IndexTransformer(
113 | task_type=self.task_type,
114 | max_tokens=self.max_tokens,
115 | max_inner_chars=self.data_config['max_inner_chars'],
116 | use_inner_char=self.inner_char,
117 | use_seg=self.use_seg,
118 | use_radical=self.use_radical,
119 | radical_dict=self.radical_dict,
120 | basic_token=self.basic_token)
121 |
122 | elif mode != 'train':
123 | if len(tran_fname) > 0:
124 | logger.info('transformer loaded')
125 | self.transformer = IndexTransformer.load(tran_fname)
126 | self.basic_token = self.transformer.basic_token
127 | self.use_seg = self.transformer.use_seg
128 | self.use_radical = self.transformer.use_radical
129 | self.inner_char = self.transformer.use_inner_char
130 | self.max_tokens = self.transformer.max_tokens
131 | else:
132 | logger.error("please pass in the transformer's filepath")
133 | sys.exit()
134 |
135 | if fname:
136 | self.load_data()
137 | self.fit()
138 | else:
139 | self.texts = []
140 | self.labels = []
141 |
142 | def clean(self, line):
143 | line = re.sub(r'\[[\u4e00-\u9fa5a-z]{1,4}\]|\[aloha\]', '', line)
144 | line = re.sub(EMOJI_UNICODE, '', line)
145 | line = re.sub(self.html_texts, '', line)
146 | if re.search(r'[\u4300-\u9fa5]+', line):
147 | line = HanziConv.toSimplified(line)
148 | return re.sub(' {2,}|\t', ' ', line).lower()
149 | else:
150 | return None
151 |
152 | def load_data(self):
153 | if self.task_type == 'classification':
154 | self.load_tc_data()
155 | elif self.task_type == 'sequence_labeling':
156 | if self.mode != 'predict':
157 | self.load_sl_data()
158 | else:
159 | self.texts = [line.strip().split() for line in open(self.fname, 'r', encoding='utf8')]
160 | logger.info('data loaded')
161 |
162 | def load_tc_data(self, max_tokens_per_doc=256):
163 | """
164 | Reads a data file for text classification. The file should contain one document/text per line.
165 | The line should have the following format:
166 | __label__
167 | If you have a multi label task, you can have as many labels as you want at the beginning of the line, e.g.,
168 | __label__ __label__
169 | """
170 | label_prefix = '__label__'
171 | self.texts = []
172 | self.labels = []
173 |
174 | with open(self.fname, 'r', encoding='utf8') as fin:
175 | for line in fin:
176 | words = self.clean(line.strip()).split()
177 | if self.mode != 'predict':
178 | if words:
179 | nb_labels = 0
180 | label_line = []
181 | for word in words:
182 | if word.startswith(label_prefix):
183 | nb_labels += 1
184 | label = word.replace(label_prefix, "")
185 | label_line.append(label)
186 | else:
187 | break
188 | text = words[nb_labels:]
189 | if len(text) > max_tokens_per_doc:
190 | text = text[:max_tokens_per_doc]
191 | self.texts.append(text)
192 | self.labels.append(label_line)
193 | else:
194 | self.texts.append(words)
195 |
196 | def load_sl_data(self):
197 | """
198 | Reads a data file for text classification. The file should contain one document/text per line.
199 | The line should have the following formats:
200 | 1. conll:
201 | word\ttag
202 | ...
203 | word\ttag
204 |
205 | word\ttag
206 | ...
207 | 2. basic:
208 | word###tag\tword###tag\t...word###tag
209 | """
210 | data = (line.strip() for line in open(self.fname, 'r', encoding='utf8'))
211 | if self.data_format == 'basic':
212 | self.texts, self.labels = zip(
213 | *[zip(*[item.rsplit('###', 1) for item in line.split('\t')]) for line in data])
214 | self.texts = list(map(list, self.texts))
215 | self.labels = list(map(list, self.labels))
216 | elif self.data_format == 'conll':
217 | self.texts, self.labels = self.process_conll(data)
218 | else:
219 | logger.warning('invalid data format for sequence labeling task')
220 | sys.exit()
221 |
222 | def process_conll(self, data):
223 | sents, labels = [], []
224 | tokens, tags = [], []
225 | for line in data:
226 | if line:
227 | token, tag = line.split('\t')
228 | tokens.append(token)
229 | tags.append(tag)
230 | else:
231 | sents.append(tokens)
232 | labels.append(tags)
233 | tokens, tags = [], []
234 | return sents, labels
235 |
236 | def add(self, line: Dict[str, str]):
237 | t = line['text'].strip().split()
238 | if self.mode == 'train':
239 | l = line['label'].strip().split()
240 | if self.task_type == 'sequence_labeling':
241 | assert len(t) == len(l)
242 | self.texts.append(t)
243 | self.labels.append(l)
244 | elif self.mode == 'predict':
245 | self.texts.append(t)
246 |
247 | # 转折句简单切分
248 | def adv_split(self, line):
249 | return re.sub('(' + '|'.join(ADVERSATIVES) + ')', r'', line)
250 |
251 | def fit(self):
252 | if self.mode != 'predict':
253 | if self.basic_token == 'char':
254 | if self.task_type == 'sequence_labeling':
255 | self.texts = [
256 | word2char(x, y, self.task_type, self.use_seg, self.radical_dict)
257 | for x, y in zip(self.texts, self.labels)]
258 | self.texts = {k: [dic[k] for dic in self.texts] for k in self.texts[0]}
259 | self.labels = self.texts['label']
260 | del self.texts['label']
261 | else:
262 | self.texts = {'token': [word2char(x, task_type=self.task_type) for x in self.texts]}
263 | else:
264 | self.texts = {'token': self.texts}
265 | if self.mode == 'train':
266 | self.config['mode'] = self.mode
267 | self.transformer.fit(self.texts['token'], self.labels)
268 | logger.info('transformer fitting complete')
269 | embed = {}
270 | if self.embed_config['pre']:
271 | token_embed, dim = load_vectors(
272 | self.embed_config[self.basic_token]['path'], self.transformer._token_vocab)
273 | embed[self.basic_token] = token_embed
274 | logger.info('Loaded Pre_trained Embeddings')
275 | else:
276 | logger.info('Use Embeddings from Straching ')
277 | dim = self.embed_config[self.basic_token]['dim']
278 | embed[self.basic_token] = None
279 | # update config
280 | self.config['nb_classes'] = self.transformer.label_size
281 | self.config['nb_tokens'] = self.transformer.token_vocab_size
282 | self.config['extra_features'] = []
283 | if self.inner_char:
284 | self.config['nb_char_tokens'] = self.transformer.char_vocab_size
285 | else:
286 | self.config['nb_char_tokens'] = 0
287 | self.config['use_inner_char'] = False
288 | if self.use_seg:
289 | self.config['nb_seg_tokens'] = self.transformer.seg_vocab_size
290 | self.config['extra_features'].append('seg')
291 | self.config['use_seg'] = self.use_seg
292 | else:
293 | self.config['nb_seg_tokens'] = 0
294 | self.config['use_seg'] = False
295 | if self.use_radical:
296 | self.config['nb_radical_tokens'] = self.transformer.radical_vocab_size
297 | self.config['extra_features'].append('radical')
298 | self.config['use_radical'] = self.use_radical
299 | else:
300 | self.config['nb_radical_tokens'] = 0
301 | self.config['use_radical'] = False
302 | self.config['embedding_dim'] = dim
303 | self.config['token_embeddings'] = embed[self.basic_token]
304 | self.config['maxlen'] = self.max_tokens
305 | self.config['task_type'] = self.task_type
306 | else:
307 | if self.basic_token == 'char':
308 | self.texts = [
309 | word2char(x, None, self.task_type, self.use_seg, self.radical_dict)
310 | for x in self.texts]
311 | self.texts = {k: [dic[k] for dic in self.texts]
312 | for k in self.texts[0]}
313 | else:
314 | self.texts = {'token': self.texts}
315 |
316 | lengths = [len(item) if len(item) <= self.max_tokens else self.max_tokens
317 | for item in self.texts['token']]
318 | self.texts['token'] = list(map(list, self.texts['token']))
319 | self.texts['token'] = [item + [lengths[idx]] for idx, item in enumerate(self.texts['token'])]
320 |
321 |
322 | # TODO
323 | class Sentence(object):
324 | """
325 | """
326 |
327 | def __init__(self, transformer):
328 | pass
329 |
--------------------------------------------------------------------------------
/nlp_toolkit/labeler.py:
--------------------------------------------------------------------------------
1 | """
2 | Sequence Labeler Wrapper
3 | """
4 |
5 | import sys
6 | import time
7 | import numpy as np
8 | from copy import deepcopy
9 | from nlp_toolkit.models import Word_RNN, IDCNN, Char_RNN
10 | from nlp_toolkit.trainer import Trainer
11 | from nlp_toolkit.utilities import logger
12 | from nlp_toolkit.sequence import BasicIterator
13 | from nlp_toolkit.data import Dataset
14 | from typing import List, Dict
15 | from seqeval.metrics import classification_report as sequence_report
16 |
17 |
18 | class Labeler(object):
19 | """
20 | Sequence Labeling Model Zoos. Include following models:
21 |
22 | 1. WordRNN + Inner_Char
23 | 2. CharRNN + Extra Embeddings (segment, radical, nchar)
24 | 3. IDCNN
25 | """
26 |
27 | def __init__(self, model_name, dataset: Dataset, seq_type='bucket'):
28 | self.model_name = model_name
29 | self.dataset = dataset
30 | self.transformer = dataset.transformer
31 | if dataset.mode == 'train':
32 | self.config = self.dataset.config
33 | self.m_cfg = self.config['model'][self.model_name]
34 | self.seq_type = seq_type
35 | if seq_type == 'bucket':
36 | self.config['maxlen'] = None
37 | self.model = self.get_model()
38 | self.model_trainer = self.get_trainer()
39 | elif dataset.mode == 'predict' or dataset.mode == 'eval':
40 | pass
41 | else:
42 | logger.warning('invalid mode name. Current only support "train" "eval" "predict"')
43 |
44 | def get_model(self):
45 | if self.model_name == 'word_rnn':
46 | model = Word_RNN(
47 | nb_classes=self.config['nb_classes'],
48 | nb_tokens=self.config['nb_tokens'],
49 | nb_char_tokens=self.config['nb_char_tokens'],
50 | maxlen=self.config['maxlen'],
51 | embedding_dim=self.config['embedding_dim'],
52 | embeddings=self.config['token_embeddings'],
53 | inner_char=self.config['data']['inner_char'],
54 | use_crf=self.m_cfg['use_crf'],
55 | char_feature_method=self.m_cfg['char_feature_method'],
56 | integration_method=self.m_cfg['integration_method'],
57 | rnn_type=self.m_cfg['rnn_type'],
58 | nb_rnn_layers=self.m_cfg['nb_rnn_layers'],
59 | nb_filters=self.m_cfg['nb_filters'],
60 | conv_kernel_size=self.m_cfg['conv_kernel_size'],
61 | drop_rate=self.m_cfg['drop_rate'],
62 | re_drop_rate=self.m_cfg['re_drop_rate'],
63 | word_rnn_size=self.m_cfg['word_rnn_size'],
64 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
65 | )
66 | elif self.model_name == 'char_rnn':
67 | model = Char_RNN(
68 | nb_classes=self.config['nb_classes'],
69 | nb_tokens=self.config['nb_tokens'],
70 | nb_seg_tokens=self.config['nb_seg_tokens'],
71 | nb_radical_tokens=self.config['nb_radical_tokens'],
72 | maxlen=self.config['maxlen'],
73 | embedding_dim=self.config['embedding_dim'],
74 | use_seg=self.config['use_seg'],
75 | use_radical=self.config['use_radical'],
76 | use_crf=self.m_cfg['use_crf'],
77 | rnn_type=self.m_cfg['rnn_type'],
78 | nb_rnn_layers=self.m_cfg['nb_rnn_layers'],
79 | drop_rate=self.m_cfg['drop_rate'],
80 | re_drop_rate=self.m_cfg['re_drop_rate'],
81 | char_rnn_size=self.m_cfg['char_rnn_size'],
82 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
83 | )
84 | elif self.model_name == 'idcnn':
85 | model = IDCNN(
86 | nb_classes=self.config['nb_classes'],
87 | nb_tokens=self.config['nb_tokens'],
88 | maxlen=self.config['maxlen'],
89 | embedding_dim=self.config['embedding_dim'],
90 | embeddings=self.config['token_embeddings'],
91 | use_crf=self.m_cfg['use_crf'],
92 | nb_filters=self.m_cfg['nb_filters'],
93 | conv_kernel_size=self.m_cfg['conv_kernel_size'],
94 | drop_rate=self.m_cfg['drop_rate'],
95 | repeat_times=self.m_cfg['repeat_times'],
96 | dilation_rate=self.m_cfg['dilation_rate'],
97 | embed_dropout_rate=self.m_cfg['embed_drop_rate']
98 | )
99 | else:
100 | logger.warning('The model name ' + self.model_name + ' is unknown')
101 | model = None
102 | return model
103 |
104 | def get_trainer(self):
105 | t_cfg = self.config['train']
106 | model_trainer = Trainer(
107 | self.model,
108 | model_name=self.model_name,
109 | task_type=self.config['task_type'],
110 | batch_size=t_cfg['batch_size'],
111 | max_epoch=t_cfg['epochs'],
112 | train_mode=t_cfg['train_mode'],
113 | fold_cnt=t_cfg['nb_fold'],
114 | test_size=t_cfg['test_size'],
115 | metric=['f1_seq', 'seq_acc'],
116 | nb_bucket=t_cfg['nb_bucket'],
117 | patiences=t_cfg['patiences']
118 | )
119 | return model_trainer
120 |
121 | def train(self):
122 | return self.model_trainer.train(
123 | self.dataset.texts, self.dataset.labels,
124 | self.transformer, self.seq_type)
125 |
126 | def predict(self, x: Dict[str, List[List[str]]], batch_size=64,
127 | return_prob=False):
128 | start = time.time()
129 | x_c = deepcopy(x)
130 | x_len = [item[-1] for item in x_c['token']]
131 | x_c['token'] = [item[:-1] for item in x_c['token']]
132 | x_seq = BasicIterator('sequence_labeling', self.transformer,
133 | x_c, batch_size=batch_size)
134 | result = self.model.model.predict_generator(x_seq)
135 | if return_prob:
136 | y_pred = [result[idx][:l] for idx, l in enumerate(x_len)]
137 | else:
138 | y_pred = self.transformer.inverse_transform(result, lengths=x_len)
139 | used_time = time.time() - start
140 | logger.info('predict {} samples used {:4.1f}s'.format(
141 | len(x['token']), used_time))
142 | return y_pred
143 |
144 | def show_results(self, x, y_pred):
145 | return [[(x1, y1) for x1, y1 in zip(x, y)] for x, y in zip(x, y_pred)]
146 |
147 | def evaluate(self, x: Dict[str, List[List[str]]], y: List[List[str]],
148 | batch_size=64):
149 | x_c = deepcopy(x)
150 | x_len = [item[-1] for item in x_c['token']]
151 | x_c['token'] = [item[:-1] for item in x_c['token']]
152 | x_seq = BasicIterator('sequence_labeling', self.transformer,
153 | x_c, batch_size=batch_size)
154 | result = self.model.model.predict_generator(x_seq)
155 | y_pred = self.transformer.inverse_transform(result, lengths=x_len)
156 | print(sequence_report(y, y_pred))
157 |
158 | def load(self, weight_fname, para_fname):
159 | if self.model_name == 'word_rnn':
160 | self.model = Word_RNN.load(weight_fname, para_fname)
161 | elif self.model_name == 'char_rnn':
162 | self.model = Char_RNN.load(weight_fname, para_fname)
163 | elif self.model_name == 'idcnn':
164 | self.model = IDCNN.load(weight_fname, para_fname)
165 | else:
166 | logger.warning('invalid model name')
167 | sys.exit()
168 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/__init__.py:
--------------------------------------------------------------------------------
1 | # text classification models
2 | from nlp_toolkit.models.base_model import Base_Model
3 | from nlp_toolkit.models.bi_lstm_att import bi_lstm_attention
4 | from nlp_toolkit.models.text_cnn import textCNN
5 | from nlp_toolkit.models.transformer import Transformer
6 | from nlp_toolkit.models.dpcnn import DPCNN
7 | # sequence labeling models
8 | from nlp_toolkit.models.word_rnn import Word_RNN
9 | from nlp_toolkit.models.char_rnn import Char_RNN
10 | from nlp_toolkit.models.idcnn import IDCNN
11 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/base_model.py:
--------------------------------------------------------------------------------
1 | import json
2 |
3 |
4 | class Base_Model(object):
5 | """
6 | Base Keras model for all SOTA models
7 | """
8 | def __init__(self):
9 | self.model = None
10 |
11 | def save(self, weights_file, params_file):
12 | self.save_weights(weights_file)
13 | self.save_params(params_file)
14 |
15 | def save_params(self, file_path, invalid_params={}):
16 | with open(file_path, 'w') as f:
17 | invalid_params = {'_loss', '_acc', 'model', 'invalid_params', 'token_embeddings'}.union(invalid_params)
18 | params = {name.lstrip('_'): val for name, val in vars(self).items()
19 | if name not in invalid_params}
20 | print('model hyperparameters:\n', params)
21 | json.dump(params, f, sort_keys=True, indent=4)
22 |
23 | def save_weights(self, filepath):
24 | self.model.save_weights(filepath)
25 |
26 | @classmethod
27 | def load(cls, weights_file, params_file):
28 | params = cls.load_params(params_file)
29 | self = cls(**params)
30 | self.forward()
31 | self.model.load_weights(weights_file)
32 | print('model loaded')
33 | return self
34 |
35 | @classmethod
36 | def load_params(cls, file_path):
37 | with open(file_path) as f:
38 | params = json.load(f)
39 | return params
40 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/bi_lstm_att.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.attentions import Attention
3 | from nlp_toolkit.modules.token_embedders import Token_Embedding
4 | from nlp_toolkit.modules.logits import tc_output_logits
5 | from nlp_toolkit.modules.custom_loss import custom_binary_crossentropy, custom_categorical_crossentropy
6 | from keras.layers import Input, Activation
7 | from keras.layers import LSTM, Bidirectional
8 | from keras.layers.merge import concatenate
9 | from keras.models import Model
10 |
11 |
12 | class bi_lstm_attention(Base_Model):
13 | """
14 | Model is modified from DeepMoji.
15 |
16 | Model structure: double bi-lstm followed by attention with some dropout techniques
17 |
18 | # Arguments:
19 | nb_classes: nbber of classes in the dataset.
20 | nb_tokens: nbber of tokens in the dataset (i.e. vocabulary size).
21 | maxlen: Maximum length of a token.
22 | embedding_dim: Embedding layer output dim.
23 | embeddings: Embedding weights. Default word embeddings.
24 | feature_output: If True the model returns the penultimate
25 | feature vector rather than Softmax probabilities
26 | (defaults to False).
27 | embed_dropout_rate: Dropout rate for the embedding layer.
28 | final_dropout_rate: Dropout rate for the final Softmax layer.
29 | embed_l2: L2 regularization for the embedding layerl.
30 |
31 | # Returns:
32 | Model with the given parameters.
33 | """
34 |
35 | def __init__(self, nb_classes, nb_tokens, maxlen,
36 | embedding_dim=256, embeddings=None,
37 | rnn_size=512, attention_dim=None,
38 | embed_dropout_rate=0,
39 | final_dropout_rate=0, embed_l2=1E-6,
40 | return_attention=False):
41 | super(bi_lstm_attention).__init__()
42 | self.nb_classes = nb_classes
43 | self.nb_tokens = nb_tokens
44 | self.maxlen = maxlen
45 | self.embedding_dim = embedding_dim
46 | self.rnn_size = rnn_size
47 | self.attention_dim = attention_dim
48 | if embeddings is not None:
49 | self.token_embeddings = [embeddings]
50 | else:
51 | self.token_embeddings = None
52 | self.embed_dropout_rate = embed_dropout_rate
53 | self.final_dropout_rate = final_dropout_rate
54 | self.return_attention = return_attention
55 | self.attention_layer = Attention(
56 | attention_dim=attention_dim,
57 | return_attention=return_attention, name='attlayer')
58 |
59 | self.invalid_params = {'attention_layer'}
60 |
61 | def forward(self):
62 | model_input = Input(shape=(self.maxlen,), dtype='int32', name='token')
63 | x = Token_Embedding(model_input, self.nb_tokens, self.embedding_dim,
64 | self.token_embeddings, True, self.maxlen,
65 | self.embed_dropout_rate, name='token_embeddings')
66 | x = Activation('tanh')(x)
67 |
68 | # skip-connection from embedding to output eases gradient-flow and allows access to lower-level features
69 | # ordering of the way the merge is done is important for consistency with the pretrained model
70 | lstm_0_output = Bidirectional(
71 | LSTM(self.rnn_size, return_sequences=True), name="bi_lstm_0")(x)
72 | lstm_1_output = Bidirectional(
73 | LSTM(self.rnn_size, return_sequences=True), name="bi_lstm_1")(lstm_0_output)
74 | x = concatenate([lstm_1_output, lstm_0_output, x], name='concatenate')
75 |
76 | x = self.attention_layer(x)
77 | if self.return_attention:
78 | x, weights = x
79 | outputs = tc_output_logits(x, self.nb_classes, self.final_dropout_rate)
80 | if self.return_attention:
81 | outputs.append(weights)
82 | outputs = concatenate(outputs, axis=-1, name='outputs')
83 |
84 | self.model = Model(inputs=model_input,
85 | outputs=outputs, name="Bi_LSTM_Attention")
86 |
87 | def get_loss(self):
88 | if self.nb_classes == 2:
89 | if self.return_attention:
90 | return custom_binary_crossentropy
91 | else:
92 | return 'binary_crossentropy'
93 | elif self.nb_classes > 2:
94 | if self.return_attention:
95 | return custom_categorical_crossentropy
96 | else:
97 | return 'categorical_crossentropy'
98 |
99 | def get_metrics(self):
100 | return ['acc']
101 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/char_rnn.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.token_embedders import Token_Embedding
3 | from nlp_toolkit.modules.logits import sl_output_logits
4 | from keras.layers import Input, BatchNormalization
5 | from keras.layers import LSTM, GRU, Bidirectional
6 | from keras.layers.merge import concatenate
7 | from keras.models import Model
8 | import sys
9 |
10 |
11 | class Char_RNN(Base_Model):
12 | """
13 | Similar model structure to Word_RNN. But use char as basic token.
14 | And some useful features are included: 1. radicals 2. segmentation tag 3. nchar
15 | """
16 |
17 | def __init__(self, nb_classes, nb_tokens, maxlen,
18 | embedding_dim=64, use_crf=True,
19 | use_seg=False, use_radical=False,
20 | use_nchar=False,
21 | nb_seg_tokens=None, nb_radical_tokens=None,
22 | rnn_type='lstm', nb_rnn_layers=2,
23 | char_rnn_size=128, drop_rate=0.5,
24 | re_drop_rate=0.15, embed_dropout_rate=0.15):
25 | self.nb_classes = nb_classes
26 | self.nb_tokens = nb_tokens
27 | self.maxlen = maxlen
28 | self.embedding_dim = embedding_dim
29 | self.use_crf = use_crf
30 | self.use_seg = use_seg
31 | self.use_radical = use_radical
32 | self.use_nchar = False
33 | self.rnn_type = rnn_type
34 | self.nb_rnn_layers = nb_rnn_layers
35 | self.drop_rate = drop_rate
36 | self.re_drop_rate = re_drop_rate
37 | self.char_rnn_size = char_rnn_size
38 | self.embed_dropout_rate = embed_dropout_rate
39 | if use_seg:
40 | self.nb_seg_tokens = nb_seg_tokens
41 | if use_radical:
42 | self.nb_radical_tokens = nb_radical_tokens
43 |
44 | self.invalid_params = {}
45 | super(Char_RNN).__init__()
46 |
47 | def forward(self):
48 | char_ids = Input(shape=(self.maxlen,), dtype='int32', name='token')
49 | input_data = [char_ids]
50 | char_embed = Token_Embedding(
51 | char_ids, self.nb_tokens,
52 | self.embedding_dim, None, True,
53 | self.maxlen, self.embed_dropout_rate, name='char_embeddings')
54 | embed_features = [char_embed]
55 | if self.use_seg:
56 | seg_ids = Input(shape=(self.maxlen,), dtype='int32', name='seg')
57 | input_data.append(seg_ids)
58 | seg_emebd = Token_Embedding(
59 | seg_ids, self.nb_seg_tokens, 8, None, True,
60 | self.maxlen, name='seg_embeddings')
61 | embed_features.append(seg_emebd)
62 | if self.use_radical:
63 | radical_ids = Input(shape=(self.maxlen,), dtype='int32', name='radical')
64 | input_data.append(radical_ids)
65 | radical_embed = Token_Embedding(
66 | radical_ids, self.nb_radical_tokens, 32,
67 | None, True, self.maxlen, name='radical_embeddings')
68 | embed_features.append(radical_embed)
69 | if self.use_nchar:
70 | pass
71 | if self.use_seg or self.use_radical:
72 | x = concatenate(embed_features, axis=-1, name='embed')
73 | else:
74 | x = char_embed
75 | x = BatchNormalization()(x)
76 |
77 | for i in range(self.nb_rnn_layers):
78 | if self.rnn_type == 'lstm':
79 | x = Bidirectional(
80 | LSTM(self.char_rnn_size, dropout=self.drop_rate,
81 | recurrent_dropout=self.re_drop_rate,
82 | return_sequences=True), name='char_lstm_%d' % (i+1))(x)
83 | elif self.rnn_type == 'gru':
84 | x = Bidirectional(
85 | GRU(self.char_rnn_size, dropout=self.drop_rate,
86 | recurrent_dropout=self.re_drop_rate,
87 | return_sequences=True), name='char_gru_%d' % (i+1))(x)
88 | else:
89 | print('invalid rnn type, only support lstm and gru')
90 | sys.exit()
91 |
92 | outputs, self._loss, self._acc = sl_output_logits(
93 | x, self.nb_classes, self.use_crf)
94 | self.model = Model(inputs=input_data, outputs=outputs)
95 |
96 | def get_loss(self):
97 | return self._loss
98 |
99 | def get_metrics(self):
100 | return self._acc
101 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/dpcnn.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.logits import tc_output_logits
3 | from nlp_toolkit.modules.token_embedders import Token_Embedding
4 | from keras.layers import Input, Dense, add, Activation
5 | from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
6 | from keras.layers.merge import concatenate
7 | from keras.models import Model
8 |
9 |
10 | class DPCNN(Base_Model):
11 | """
12 | Deep Pyramid CNN
13 | Three key point of DPCNN:
14 | 1. region embeddings
15 | 2. fixed feature maps
16 | 3. residual connection
17 | """
18 |
19 | def __init__(self, nb_classes, nb_tokens, maxlen,
20 | embedding_dim=256, embeddings=None,
21 | region_kernel_size=[3, 4, 5],
22 | conv_kernel_size=3, nb_filters=250, pool_size=3,
23 | repeat_time=2,
24 | embed_dropout_rate=0.15, final_dropout_rate=0.25):
25 | super(DPCNN).__init__()
26 | self.nb_classes = nb_classes
27 | self.nb_tokens = nb_tokens
28 | self.maxlen = maxlen
29 | self.embedding_dim = embedding_dim
30 | if embeddings is not None:
31 | self.token_embeddings = [embeddings]
32 | else:
33 | self.token_embeddings = None
34 | self.region_kernel_size = region_kernel_size
35 | self.conv_kernel_size = conv_kernel_size
36 | self.nb_filters = nb_filters
37 | self.pool_size = pool_size
38 | self.repeat_time = repeat_time
39 | self.embed_dropout_rate = embed_dropout_rate
40 | self.final_dropout_rate = final_dropout_rate
41 | self.invalid_params = {}
42 |
43 | def forward(self):
44 | model_input = Input(shape=(self.maxlen,), dtype='int32', name='token')
45 | # region embedding
46 | x = Token_Embedding(model_input, self.nb_tokens, self.embedding_dim,
47 | self.token_embeddings, False, self.maxlen,
48 | self.embed_dropout_rate, name='token_embeddings')
49 | if isinstance(self.region_kernel_size, list):
50 | region = [Conv1D(self.nb_filters, f, padding='same')(x)
51 | for f in self.region_kernel_size]
52 | region_embedding = add(region, name='region_embeddings')
53 | else:
54 | region_embedding = Conv1D(
55 | self.nb_filters, self.region_kernel_size, padding='same', name='region_embeddings')(x)
56 | # same padding convolution
57 | x = Activation('relu')(region_embedding)
58 | x = Conv1D(self.nb_filters, self.conv_kernel_size,
59 | padding='same', name='conv_1')(x)
60 | x = Activation('relu')(x)
61 | x = Conv1D(self.nb_filters, self.conv_kernel_size,
62 | padding='same', name='conv_2')(x)
63 | # residual connection
64 | x = add([x, region_embedding], name='pre_block_hidden')
65 |
66 | for k in range(self.repeat_time):
67 | x = self._block(x, k)
68 | x = GlobalMaxPooling1D()(x)
69 | outputs = tc_output_logits(x, self.nb_classes, self.final_dropout_rate)
70 |
71 | self.model = Model(inputs=model_input,
72 | outputs=outputs, name="Deep Pyramid CNN")
73 |
74 | def _block(self, x, k):
75 | x = MaxPooling1D(self.pool_size, strides=2)(x)
76 | last_x = x
77 | x = Activation('relu')(x)
78 | x = Conv1D(self.nb_filters, self.conv_kernel_size,
79 | padding='same', name='block_%d_conv_1' % k)(x)
80 | x = Activation('relu')(x)
81 | x = Conv1D(self.nb_filters, self.conv_kernel_size,
82 | padding='same', name='block_%d_conv_2' % k)(x)
83 | # residual connection
84 | x = add([x, last_x])
85 | return x
86 |
87 | def get_loss(self):
88 | if self.nb_classes == 2:
89 | return 'binary_crossentropy'
90 | elif self.nb_classes > 2:
91 | return 'categorical_crossentropy'
92 |
93 | def get_metrics(self):
94 | return ['acc']
95 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/han.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevewyl/nlp_toolkit/257dabd300b29957a0be38e7a8049a54f2095ccc/nlp_toolkit/models/han.py
--------------------------------------------------------------------------------
/nlp_toolkit/models/idcnn.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.token_embedders import Token_Embedding
3 | from nlp_toolkit.modules.logits import sl_output_logits
4 | from keras.layers import Input, Dropout, Activation
5 | from keras.layers import Conv1D, MaxPooling1D
6 | from keras.layers.merge import concatenate
7 | from keras.models import Model
8 |
9 |
10 | class IDCNN(Base_Model):
11 | """
12 | Iterated Dilated Convolution Nerual Networks with CRF
13 | """
14 |
15 | def __init__(self, nb_classes,
16 | nb_tokens,
17 | maxlen,
18 | embeddings=None,
19 | embedding_dim=64,
20 | embed_dropout_rate=0.25,
21 | drop_rate=0.5,
22 | nb_filters=64,
23 | conv_kernel_size=3,
24 | dilation_rate=[1, 1, 2],
25 | repeat_times=4,
26 | use_crf=True,
27 | ):
28 | super(IDCNN).__init__()
29 | self.nb_classes = nb_classes
30 | self.nb_tokens = nb_tokens
31 | self.maxlen = maxlen
32 | self.embedding_dim = embedding_dim
33 | self.embed_dropout_rate = embed_dropout_rate
34 | self.drop_rate = drop_rate
35 | self.nb_filters = nb_filters
36 | self.conv_kernel_size = conv_kernel_size
37 | self.dilation_rate = dilation_rate
38 | self.repeat_times = repeat_times
39 | self.use_crf = use_crf
40 | if embeddings is not None:
41 | self.token_embeddings = [embeddings]
42 | else:
43 | self.token_embeddings = None
44 | self.invalid_params = {}
45 |
46 | def forward(self):
47 | word_ids = Input(shape=(self.maxlen,), dtype='int32', name='token')
48 | input_data = [word_ids]
49 | embed = Token_Embedding(word_ids, self.nb_tokens, self.embedding_dim,
50 | self.token_embeddings, False, self.maxlen,
51 | self.embed_dropout_rate, name='token_embeddings')
52 | layerInput = Conv1D(
53 | self.nb_filters, self.conv_kernel_size, padding='same', name='conv_first')(embed)
54 | dilation_layers = []
55 | totalWidthForLastDim = 0
56 | for j in range(self.repeat_times):
57 | for i in range(len(self.dilation_rate)):
58 | islast = True if i == len(self.dilation_rate) - 1 else False
59 | conv = Conv1D(self.nb_filters, self.conv_kernel_size, use_bias=True,
60 | padding='same', dilation_rate=self.dilation_rate[i],
61 | name='atrous_conv_%d_%d' % (j, i))(layerInput)
62 | conv = Activation('relu')(conv)
63 | if islast:
64 | dilation_layers.append(conv)
65 | totalWidthForLastDim += self.nb_filters
66 | layerInput = conv
67 | dilation_conv = concatenate(
68 | dilation_layers, axis=-1, name='dilated_conv')
69 | if self.drop_rate > 0:
70 | enc = Dropout(self.drop_rate)(dilation_conv)
71 |
72 | outputs, self._loss, self._acc = sl_output_logits(
73 | enc, self.nb_classes, self.use_crf)
74 | self.model = Model(inputs=input_data, outputs=outputs)
75 |
76 | def get_loss(self):
77 | return self._loss
78 |
79 | def get_metrics(self):
80 | return self._acc
81 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/text_cnn.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.logits import tc_output_logits
3 | from nlp_toolkit.modules.token_embedders import Token_Embedding
4 | from keras.layers import Input, Dense, Flatten, Dropout
5 | from keras.layers import Conv1D, MaxPooling1D
6 | from keras.layers.merge import concatenate
7 | from keras.models import Model
8 |
9 |
10 | class textCNN(Base_Model):
11 | """
12 | The known Kim CNN model used in text classification.
13 | It use mulit-channel CNN to encode texts
14 | """
15 |
16 | def __init__(self, nb_classes, nb_tokens, maxlen,
17 | embedding_dim=256, embeddings=None, embed_l2=1E-6,
18 | conv_kernel_size=[3, 4, 5], pool_size=[2, 2, 2],
19 | nb_filters=128, fc_size=128,
20 | embed_dropout_rate=0.25, final_dropout_rate=0.5):
21 | super(textCNN).__init__()
22 | self.nb_classes = nb_classes
23 | self.nb_tokens = nb_tokens
24 | self.maxlen = maxlen
25 | self.embedding_dim = embedding_dim
26 | self.nb_filters = nb_filters
27 | self.pool_size = pool_size
28 | self.conv_kernel_size = conv_kernel_size
29 | self.fc_size = fc_size
30 | self.final_dropout_rate = final_dropout_rate
31 | self.embed_dropout_rate = embed_dropout_rate
32 |
33 | # core layer: multi-channel cnn-pool layers
34 | self.cnn_list = [Conv1D(
35 | nb_filters, f, padding='same', name='conv_%d' % k) for k, f in enumerate(conv_kernel_size)]
36 | self.pool_list = [MaxPooling1D(p, name='pool_%d' % k)
37 | for k, p in enumerate(pool_size)]
38 | self.fc = Dense(fc_size, activation='relu',
39 | kernel_initializer='he_normal')
40 | if embeddings is not None:
41 | self.token_embeddings = [embeddings]
42 | else:
43 | self.token_embeddings = None
44 | self.invalid_params = {'cnn_list', 'pool_list', 'fc'}
45 |
46 | def forward(self):
47 | model_input = Input(shape=(self.maxlen,), dtype='int32', name='token')
48 | x = Token_Embedding(model_input, self.nb_tokens, self.embedding_dim,
49 | self.token_embeddings, False, self.maxlen,
50 | self.embed_dropout_rate, name='token_embeddings')
51 | cnn_combine = []
52 | for i in range(len(self.conv_kernel_size)):
53 | cnn = self.cnn_list[i](x)
54 | pool = self.pool_list[i](cnn)
55 | cnn_combine.append(pool)
56 | x = concatenate(cnn_combine, axis=-1)
57 |
58 | x = Flatten()(x)
59 | x = Dropout(self.final_dropout_rate)(x)
60 | x = self.fc(x)
61 |
62 | outputs = tc_output_logits(x, self.nb_classes, self.final_dropout_rate)
63 |
64 | self.model = Model(inputs=model_input,
65 | outputs=outputs, name="TextCNN")
66 |
67 | def get_loss(self):
68 | if self.nb_classes == 2:
69 | return 'binary_crossentropy'
70 | elif self.nb_classes > 2:
71 | return 'categorical_crossentropy'
72 |
73 | def get_metrics(self):
74 | return ['acc']
75 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/transformer.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.attentions import Self_Attention
3 | from nlp_toolkit.modules.token_embedders import Position_Embedding
4 | from nlp_toolkit.modules.token_embedders import Token_Embedding
5 | from nlp_toolkit.modules.logits import tc_output_logits
6 | from keras.layers import Input, GlobalAveragePooling1D
7 | from keras.models import Model
8 |
9 |
10 | class Transformer(Base_Model):
11 | """
12 | Multi-Head Self Attention Model.
13 | Use Transfomer's architecture to encode texts.
14 |
15 | # Arguments:
16 | 1. nb_transformer: the nbber of attention layer.
17 | 2. nb_head: the nbber of attention block in one layer
18 | 3. head_size: the hidden size of each attention unit
19 | 4. pos_embed: whether to use poisition embedding
20 | """
21 |
22 | def __init__(self, nb_classes, nb_tokens, maxlen,
23 | nb_head=8, head_size=16, nb_transformer=2,
24 | embedding_dim=256, embeddings=None, embed_l2=1E-6,
25 | pos_embed=False, final_dropout_rate=0.15,
26 | embed_dropout_rate=0.15):
27 | self.nb_classes = nb_classes
28 | self.nb_tokens = nb_tokens
29 | self.maxlen = maxlen
30 | self.nb_head = nb_head
31 | self.head_size = head_size
32 | self.embedding_dim = embedding_dim
33 | self.nb_transformer = nb_transformer
34 | if embeddings is not None:
35 | self.token_embeddings = [embeddings]
36 | else:
37 | self.token_embeddings = None
38 | self.pos_embed = pos_embed
39 | self.final_dropout_rate = final_dropout_rate
40 | self.embed_dropout_rate = embed_dropout_rate
41 | self.pos_embed_layer = Position_Embedding(name='position_embedding')
42 | self.transformers = [Self_Attention(
43 | nb_head, head_size, name='self_attention_%d' % i) for i in range(nb_transformer)]
44 | self.pool = GlobalAveragePooling1D()
45 | self.invalid_params = {'pos_embed_layer', 'transformers', 'pool'}
46 |
47 | def forward(self):
48 | model_input = Input(shape=(self.maxlen,), dtype='int32', name='token')
49 | x = Token_Embedding(model_input, self.nb_tokens, self.embedding_dim,
50 | self.token_embeddings, False, self.maxlen,
51 | self.embed_dropout_rate, name='token_embeddings')
52 | if self.pos_embed:
53 | x = self.pos_embed_layer(x)
54 | for i in range(self.nb_transformer):
55 | x = self.transformers[i]([x, x, x])
56 | x = self.pool(x)
57 | outputs = tc_output_logits(x, self.nb_classes, self.final_dropout_rate)
58 | self.model = Model(inputs=model_input,
59 | outputs=outputs, name="Self_Multi_Head_Attention")
60 |
61 | def get_loss(self):
62 | if self.nb_classes == 2:
63 | return 'binary_crossentropy'
64 | elif self.nb_classes > 2:
65 | return 'categorical_crossentropy'
66 |
67 | def get_metrics(self):
68 | return ['acc']
69 |
--------------------------------------------------------------------------------
/nlp_toolkit/models/word_rnn.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.models import Base_Model
2 | from nlp_toolkit.modules.token_embedders import Token_Embedding
3 | from nlp_toolkit.modules.logits import sl_output_logits
4 | from keras.layers import Input, Activation, TimeDistributed, Dense
5 | from keras.layers import LSTM, GRU, Bidirectional
6 | from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
7 | from keras.layers import subtract, multiply, add, Lambda
8 | from keras.layers.merge import concatenate
9 | from keras.models import Model
10 | import keras.backend as K
11 | import sys
12 |
13 |
14 | class Word_RNN(Base_Model):
15 | """
16 | Baseline sequence labeling model. Basic token is word.
17 | Support following extensibility:
18 | 1. Extract inner-char features by using lstm or cnn
19 | 2. Concat or attention between word and char features
20 | """
21 |
22 | def __init__(self, nb_classes, nb_tokens, maxlen,
23 | nb_char_tokens=None, max_charlen=10,
24 | embedding_dim=128, char_embedding_dim=32,
25 | word_rnn_size=128, char_rnn_size=32,
26 | embeddings=None, char_embeddings=None,
27 | inner_char=False, use_crf=True,
28 | char_feature_method='rnn',
29 | integration_method='concat',
30 | rnn_type='lstm',
31 | nb_rnn_layers=1,
32 | nb_filters=32,
33 | conv_kernel_size=2,
34 | drop_rate=0.5,
35 | re_drop_rate=0.15,
36 | embed_l2=1E-6,
37 | embed_dropout_rate=0.15):
38 | super(Word_RNN).__init__()
39 | self.nb_classes = nb_classes
40 | self.nb_tokens = nb_tokens
41 | self.maxlen = maxlen
42 | self.embedding_dim = embedding_dim
43 | self.rnn_type = rnn_type
44 | self.nb_rnn_layers = nb_rnn_layers
45 | self.drop_rate = drop_rate
46 | self.re_drop_rate = re_drop_rate
47 | self.use_crf = use_crf
48 | self.inner_char = inner_char
49 | self.word_rnn_size = word_rnn_size
50 | self.embed_dropout_rate = embed_dropout_rate
51 |
52 | if self.inner_char:
53 | self.integration_method = integration_method
54 | self.char_feature_method = char_feature_method
55 | self.max_charlen = max_charlen
56 | self.nb_char_tokens = nb_char_tokens
57 | self.char_embedding_dim = char_embedding_dim
58 | if char_feature_method == 'rnn':
59 | if self.integration_method == 'attention':
60 | self.char_rnn_size = int(self.embedding_dim / 2)
61 | else:
62 | self.char_rnn_size = char_rnn_size
63 | elif char_feature_method == 'cnn':
64 | self.nb_filters = nb_filters
65 | self.conv_kernel_size = conv_kernel_size
66 | if self.integration_method == 'attention':
67 | self.nb_filters = self.embedding_dim
68 | if embeddings is not None:
69 | self.token_embeddings = [embeddings]
70 | else:
71 | self.token_embeddings = None
72 | if char_feature_method == 'rnn':
73 | self.mask_zero = True
74 | else:
75 | self.mask_zero = False
76 | self.char_lstm = LSTM(char_rnn_size, return_sequences=False)
77 | self.char_gru = GRU(char_rnn_size, return_sequences=False)
78 | self.conv = Conv1D(
79 | kernel_size=conv_kernel_size, filters=self.nb_filters, padding='same')
80 | self.fc_tanh = Dense(
81 | embedding_dim, kernel_initializer="glorot_uniform", activation='tanh')
82 | self.fc_sigmoid = Dense(embedding_dim, activation='sigmoid')
83 |
84 | self.invalid_params = {'char_lstm', 'char_gru', 'mask_zero',
85 | 'conv', 'fc_tanh', 'fc_sigmoid'}
86 |
87 | def forward(self):
88 | word_ids = Input(shape=(self.maxlen,), dtype='int32', name='token')
89 | input_data = [word_ids]
90 | x = Token_Embedding(word_ids, self.nb_tokens, self.embedding_dim,
91 | self.token_embeddings, True, self.maxlen,
92 | self.embed_dropout_rate)
93 |
94 | # char features
95 | if self.inner_char:
96 | char_ids = Input(batch_shape=(None, None, None),
97 | dtype='int32', name='char')
98 | input_data.append(char_ids)
99 | x_c = Token_Embedding(
100 | char_ids, input_dim=self.nb_char_tokens,
101 | output_dim=self.char_embedding_dim,
102 | mask_zero=self.mask_zero, name='char_embeddings',
103 | time_distributed=True)
104 | if self.char_feature_method == 'rnn':
105 | if self.rnn_type == 'lstm':
106 | char_feature = TimeDistributed(
107 | Bidirectional(self.char_lstm), name="char_lstm")(x_c)
108 | elif self.rnn_type == 'gru':
109 | char_feature = TimeDistributed(
110 | Bidirectional(self.char_gru), name="char_gru")(x_c)
111 | else:
112 | print('invalid rnn type, only support lstm and gru')
113 | sys.exit()
114 | elif self.char_feature_method == 'cnn':
115 | conv1d_out = TimeDistributed(self.conv, name='char_cnn')(x_c)
116 | char_feature = TimeDistributed(
117 | GlobalMaxPooling1D(), name='char_pooling')(conv1d_out)
118 | if self.integration_method == 'concat':
119 | concat_tensor = concatenate([x, char_feature], axis=-1, name='concat_feature')
120 | elif self.integration_method == 'attention':
121 | word_embed_dense = self.fc_tanh(x)
122 | char_embed_dense = self.fc_tanh(char_feature)
123 | attention_evidence_tensor = add(
124 | [word_embed_dense, char_embed_dense])
125 | attention_output = self.fc_sigmoid(attention_evidence_tensor)
126 | part1 = multiply([attention_output, x])
127 | tmp = subtract([Lambda(lambda x: K.ones_like(x))(
128 | attention_output), attention_output])
129 | part2 = multiply([tmp, char_feature])
130 | concat_tensor = add([part1, part2], name='attention_feature')
131 |
132 | # rnn encoder
133 | if self.inner_char:
134 | enc = concat_tensor
135 | else:
136 | enc = x
137 | for i in range(self.nb_rnn_layers):
138 | if self.rnn_type == 'lstm':
139 | enc = Bidirectional(
140 | LSTM(self.word_rnn_size, dropout=self.drop_rate,
141 | recurrent_dropout=self.re_drop_rate,
142 | return_sequences=True), name='word_lstm_%d' % (i+1))(enc)
143 | elif self.rnn_type == 'gru':
144 | enc = Bidirectional(
145 | GRU(self.word_rnn_size, dropout=self.drop_rate,
146 | recurrent_dropout=self.re_drop_rate,
147 | return_sequences=True), name='word_gru_%d' % (i+1))(enc)
148 | else:
149 | print('invalid rnn type, only support lstm and gru')
150 | sys.exit()
151 |
152 | # output logits
153 | outputs, self._loss, self._acc = sl_output_logits(
154 | enc, self.nb_classes, self.use_crf)
155 | self.model = Model(inputs=input_data, outputs=outputs)
156 |
157 | def get_loss(self):
158 | return self._loss
159 |
160 | def get_metrics(self):
161 | return self._acc
162 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/stevewyl/nlp_toolkit/257dabd300b29957a0be38e7a8049a54f2095ccc/nlp_toolkit/modules/__init__.py
--------------------------------------------------------------------------------
/nlp_toolkit/modules/attentions/__init__.py:
--------------------------------------------------------------------------------
1 | from .attention import Attention
2 | from .self_attention import Self_Attention
3 | from .multi_dim_attention import Multi_Dim_Attention
4 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/attentions/attention.py:
--------------------------------------------------------------------------------
1 | from keras.engine import Layer
2 | from keras import backend as K
3 |
4 |
5 | class Attention(Layer):
6 | """
7 | Basic attention layer.
8 | Attention layers are normally used to find important tokens based on different labels.
9 | uses 'max trick' for numerical stability
10 | # Arguments:
11 | 1. use_bias: whether to use bias
12 | 2. use_context: whether to use context vector
13 | 3. return_attention: whether to return attention weights as part of output
14 | 4. attention_dim: dimensionality of the inner attention
15 | 5. activation: whether to use activation func in first MLP
16 | # Inputs:
17 | Tensor with shape (batch_size, time_steps, hidden_size)
18 | # Returns:
19 | Tensor with shape (batch_size, hidden_size)
20 | If return attention weight,
21 | an additional tensor with shape (batch_size, time_steps) will be returned.
22 | """
23 |
24 | def __init__(self,
25 | use_bias=True,
26 | use_context=True,
27 | return_attention=False,
28 | attention_dim=None,
29 | activation=True,
30 | **kwargs):
31 | self.use_bias = use_bias
32 | self.use_context = use_context
33 | self.return_attention = return_attention
34 | self.attention_dim = attention_dim
35 | self.activation = activation
36 | super(Attention, self).__init__(**kwargs)
37 |
38 | def build(self, input_shape):
39 | if len(input_shape) < 3:
40 | raise ValueError(
41 | "Expected input shape of `(batch_size, time_steps, features)`, found `{}`".format(input_shape))
42 | if self.attention_dim is None:
43 | attention_dim = input_shape[-1]
44 | else:
45 | attention_dim = self.attention_dim
46 |
47 | self.kernel = self.add_weight(name='kernel',
48 | shape=(input_shape[-1], attention_dim),
49 | initializer="glorot_normal",
50 | trainable=True)
51 | if self.use_bias:
52 | self.bias = self.add_weight(name='bias',
53 | shape=(attention_dim,),
54 | initializer="zeros",
55 | trainable=True)
56 | else:
57 | self.bias = None
58 | if self.use_context:
59 | self.context_kernel = self.add_weight(name='context_kernel',
60 | shape=(attention_dim, 1),
61 | initializer="glorot_normal",
62 | trainable=True)
63 | else:
64 | self.context_kernel = None
65 |
66 | super(Attention, self).build(input_shape)
67 |
68 | def call(self, x, mask=None):
69 | # MLP
70 | ut = K.dot(x, self.kernel)
71 | if self.use_bias:
72 | ut = K.bias_add(ut, self.bias)
73 | if self.activation:
74 | ut = K.tanh(ut)
75 | if self.context_kernel:
76 | ut = K.dot(ut, self.context_kernel)
77 | ut = K.squeeze(ut, axis=-1)
78 | # softmax
79 | at = K.exp(ut - K.max(ut, axis=-1, keepdims=True))
80 | if mask is not None:
81 | at *= K.cast(mask, K.floatx())
82 | att_weights = at / (K.sum(at, axis=1, keepdims=True) + K.epsilon())
83 | # output
84 | atx = x * K.expand_dims(att_weights, axis=-1)
85 | output = K.sum(atx, axis=1)
86 | if self.return_attention:
87 | return [output, att_weights]
88 | return output
89 |
90 | def compute_mask(self, input, input_mask=None):
91 | if isinstance(input_mask, list):
92 | return [None] * len(input_mask)
93 | else:
94 | return None
95 |
96 | def compute_output_shape(self, input_shape):
97 | output_len = input_shape[2]
98 | if self.return_attention:
99 | return [(input_shape[0], output_len), (input_shape[0], input_shape[1])]
100 | return (input_shape[0], output_len)
101 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/attentions/multi_dim_attention.py:
--------------------------------------------------------------------------------
1 | from keras.engine import Layer
2 | from keras import backend as K
3 | from keras import initializers
4 |
5 |
6 | class Multi_Dim_Attention(Layer):
7 | """
8 | 2D attention from "A Structured Self-Attentive Sentence Embedding" (2017)
9 | """
10 |
11 | def __init__(self, ws1, ws2, punish, init='glorot_normal', **kwargs):
12 | self.kernel_initializer = initializers.get(init)
13 | self.weight_ws1 = ws1
14 | self.weight_ws2 = ws2
15 | self.punish = punish
16 | super(Multi_Dim_Attention, self).__init__(** kwargs)
17 |
18 | def build(self, input_shape):
19 | self.Ws1 = self.add_weight(shape=(input_shape[-1], self.weight_ws1),
20 | initializer=self.kernel_initializer,
21 | trainable=True,
22 | name='{}_Ws1'.format(self.name))
23 | self.Ws2 = self.add_weight(shape=(self.weight_ws1, self.weight_ws2),
24 | initializer=self.kernel_initializer,
25 | trainable=True,
26 | name='{}_Ws2'.format(self.name))
27 | self.batch_size = input_shape[0]
28 | super(Multi_Dim_Attention, self).build(input_shape)
29 |
30 | def compute_mask(self, input, input_mask=None):
31 | return None
32 |
33 | def call(self, x, mask=None):
34 | uit = K.tanh(K.dot(x, self.Ws1))
35 | ait = K.dot(uit, self.Ws2)
36 | ait = K.permute_dimensions(ait, (0, 2, 1))
37 | A = K.softmax(ait, axis=1)
38 | M = K.batch_dot(A, x)
39 | if self.punish:
40 | A_T = K.permute_dimensions(A, (0, 2, 1))
41 | tile_eye = K.tile(K.eye(self.weight_ws2), [self.batch_size, 1])
42 | tile_eye = K.reshape(
43 | tile_eye, shape=[-1, self.weight_ws2, self.weight_ws2])
44 | AA_T = K.batch_dot(A, A_T) - tile_eye
45 | P = K.l2_normalize(AA_T, axis=(1, 2))
46 | return M, P
47 | else:
48 | return M
49 |
50 | def compute_output_shape(self, input_shape):
51 | if self.punish:
52 | out1 = (input_shape[0], self.weight_ws2, input_shape[-1])
53 | out2 = (input_shape[0], self.weight_ws2, self.weight_ws2)
54 | return [out1, out2]
55 | else:
56 | return (input_shape[0], self.weight_ws2, input_shape[-1])
57 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/attentions/self_attention.py:
--------------------------------------------------------------------------------
1 | from keras.engine import Layer
2 | from keras import backend as K
3 |
4 |
5 | class Self_Attention(Layer):
6 | """
7 | Multi_Head Attention Layer defined in .
8 | If you want to use it as self-attention, then pass in three same tensors
9 | https://github.com/bojone/attention/blob/master/attention_keras.py
10 | """
11 |
12 | def __init__(self, nb_head, size_per_head, **kwargs):
13 | self.nb_head = nb_head
14 | self.size_per_head = size_per_head
15 | self.output_dim = nb_head*size_per_head
16 | super(Self_Attention, self).__init__(**kwargs)
17 |
18 | def build(self, input_shape):
19 | self.WQ = self.add_weight(name='WQ',
20 | shape=(input_shape[0][-1], self.output_dim),
21 | initializer='glorot_uniform',
22 | trainable=True)
23 | self.WK = self.add_weight(name='WK',
24 | shape=(input_shape[1][-1], self.output_dim),
25 | initializer='glorot_uniform',
26 | trainable=True)
27 | self.WV = self.add_weight(name='WV',
28 | shape=(input_shape[2][-1], self.output_dim),
29 | initializer='glorot_uniform',
30 | trainable=True)
31 | super(Self_Attention, self).build(input_shape)
32 |
33 | def Mask(self, inputs, seq_len, mode='mul'):
34 | """
35 | # Arguments:
36 | inputs: input tensor with shape (batch_size, seq_len, input_size)
37 | seq_len: Each sequence's actual length with shape (batch_size,)
38 | mode:
39 | mul: mask the rest dim with zero, used before fully-connected layer
40 | add: subtract a big constant from the rest, used before softmax layer
41 | # Reutrns:
42 | Masked tensors with the same shape of input tensor
43 | """
44 | if seq_len is None:
45 | return inputs
46 | else:
47 | mask = K.one_hot(seq_len[:, 0], K.shape(inputs)[1])
48 | mask = 1 - K.cumsum(mask, 1)
49 | for _ in range(len(inputs.shape) - 2):
50 | mask = K.expand_dims(mask, 2)
51 | if mode == 'mul':
52 | return inputs * mask
53 | if mode == 'add':
54 | return inputs - (1 - mask) * 1e12
55 |
56 | def call(self, x):
57 | # if only pass in [Q_seq,K_seq,V_seq], then no Mask operation
58 | # if you also pass in [Q_len,V_len], Mask will apply to the redundance
59 | if len(x) == 3:
60 | Q_seq, K_seq, V_seq = x
61 | Q_len, V_len = None, None
62 | elif len(x) == 5:
63 | Q_seq, K_seq, V_seq, Q_len, V_len = x
64 | # linear transformation of Q, K, V
65 | Q_seq = K.dot(Q_seq, self.WQ)
66 | Q_seq = K.reshape(
67 | Q_seq, (-1, K.shape(Q_seq)[1], self.nb_head, self.size_per_head))
68 | Q_seq = K.permute_dimensions(Q_seq, (0, 2, 1, 3))
69 | K_seq = K.dot(K_seq, self.WK)
70 | K_seq = K.reshape(
71 | K_seq, (-1, K.shape(K_seq)[1], self.nb_head, self.size_per_head))
72 | K_seq = K.permute_dimensions(K_seq, (0, 2, 1, 3))
73 | V_seq = K.dot(V_seq, self.WV)
74 | V_seq = K.reshape(
75 | V_seq, (-1, K.shape(V_seq)[1], self.nb_head, self.size_per_head))
76 | V_seq = K.permute_dimensions(V_seq, (0, 2, 1, 3))
77 | # compute inner product, then mask, then softmax
78 | A = K.batch_dot(Q_seq, K_seq, axes=[3, 3]) / self.size_per_head ** 0.5
79 | A = K.permute_dimensions(A, (0, 3, 2, 1))
80 | A = self.Mask(A, V_len, 'add')
81 | A = K.permute_dimensions(A, (0, 3, 2, 1))
82 | A = K.softmax(A)
83 | # output and mask
84 | O_seq = K.batch_dot(A, V_seq, axes=[3, 2])
85 | O_seq = K.permute_dimensions(O_seq, (0, 2, 1, 3))
86 | O_seq = K.reshape(O_seq, (-1, K.shape(O_seq)[1], self.output_dim))
87 | O_seq = self.Mask(O_seq, Q_len, 'mul')
88 | return O_seq
89 |
90 | def compute_output_shape(self, input_shape):
91 | return (input_shape[0][0], input_shape[0][1], self.output_dim)
92 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/custom_loss.py:
--------------------------------------------------------------------------------
1 | '''
2 | custom loss functions
3 | '''
4 |
5 | from keras import backend as K
6 |
7 |
8 | def custom_binary_crossentropy(y_true, y_pred):
9 | return K.mean(K.binary_crossentropy(y_true, y_pred[:, :2]), axis=-1)
10 |
11 |
12 | def custom_categorical_crossentropy(y_true, y_pred, n):
13 | return K.categorical_crossentropy(y_true, y_pred[:, :n])
14 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/logits.py:
--------------------------------------------------------------------------------
1 | """
2 | common output layers for different tasks
3 | """
4 |
5 | from keras_contrib.layers import CRF
6 | from keras.layers import Dense, Dropout
7 | from keras.regularizers import l2
8 |
9 |
10 | def tc_output_logits(x, nb_classes, final_dropout_rate=0):
11 | if final_dropout_rate != 0:
12 | x = Dropout(final_dropout_rate)(x)
13 | if nb_classes > 2:
14 | activation_func = 'softmax'
15 | else:
16 | activation_func = 'sigmoid'
17 | logits = Dense(nb_classes, kernel_regularizer=l2(0.01),
18 | activation=activation_func, name='softmax')(x)
19 | outputs = [logits]
20 | return outputs
21 |
22 |
23 | def sl_output_logits(x, nb_classes, use_crf=True):
24 | if use_crf:
25 | crf = CRF(nb_classes, sparse_target=False)
26 | loss = crf.loss_function
27 | acc = [crf.accuracy]
28 | outputs = crf(x)
29 | else:
30 | loss = 'categorical_crossentropy'
31 | acc = ['acc']
32 | outputs = Dense(nb_classes, activation='softmax')(x)
33 | return outputs, loss, acc
34 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/token_embedders/__init__.py:
--------------------------------------------------------------------------------
1 | from .embedding import Token_Embedding
2 | from .position_embedding import Position_Embedding
3 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/token_embedders/embedding.py:
--------------------------------------------------------------------------------
1 | from keras.engine import Layer
2 | from keras import backend as K
3 | from keras.layers import Embedding, Dropout, SpatialDropout1D, TimeDistributed
4 | from keras.regularizers import L1L2
5 |
6 |
7 | def Token_Embedding(x, input_dim, output_dim, embed_weights=None,
8 | mask_zero=False, input_length=None, dropout_rate=0,
9 | embed_l2=1E-6, name='', time_distributed=False, **kwargs):
10 | """
11 | Basic token embedding layer, also included some dropout layer.
12 | """
13 | embed_reg = L1L2(l2=embed_l2) if embed_l2 != 0 else None
14 | embed_layer = Embedding(input_dim=input_dim,
15 | output_dim=output_dim,
16 | weights=embed_weights,
17 | mask_zero=mask_zero,
18 | input_length=input_length,
19 | embeddings_regularizer=embed_reg,
20 | name=name)
21 | if time_distributed:
22 | embed = TimeDistributed(embed_layer)(x)
23 | else:
24 | embed = embed_layer(x)
25 | # entire embedding channels are dropped out instead of the
26 | # normal Keras embedding dropout, which drops all channels for entire words
27 | # many of the datasets contain so few words that losing one or more words can alter the emotions completely
28 | if dropout_rate != 0:
29 | embed = SpatialDropout1D(dropout_rate)(embed)
30 | return embed
31 |
--------------------------------------------------------------------------------
/nlp_toolkit/modules/token_embedders/position_embedding.py:
--------------------------------------------------------------------------------
1 | from keras.engine import Layer
2 | from keras import backend as K
3 |
4 |
5 | class Position_Embedding(Layer):
6 | """
7 | Computes sequence position information for Attention based models
8 | https://github.com/bojone/attention/blob/master/attention_keras.py
9 |
10 | # Arguments:
11 | A tensor with shape (batch_size, seq_len, word_size)
12 | # Returns:
13 | A position tensor with shape (batch_size, seq_len, position_size)
14 | """
15 |
16 | def __init__(self, size=None, mode='sum', **kwargs):
17 | self.size = size # 必须为偶数
18 | self.mode = mode
19 | super(Position_Embedding, self).__init__(**kwargs)
20 |
21 | def call(self, x):
22 | if (self.size is None) or (self.mode == 'sum'):
23 | self.size = int(x.shape[-1])
24 | batch_size, seq_len = K.shape(x)[0], K.shape(x)[1]
25 | position_j = 1. / K.pow(10000.,
26 | 2 * K.arange(self.size / 2, dtype='float32'
27 | ) / self.size)
28 | position_j = K.expand_dims(position_j, 0)
29 | # K.arange不支持变长,只好用这种方法生成
30 | position_i = K.cumsum(K.ones_like(x[:, :, 0]), 1) - 1
31 | position_i = K.expand_dims(position_i, 2)
32 | position_ij = K.dot(position_i, position_j)
33 | position_ij = K.concatenate(
34 | [K.cos(position_ij), K.sin(position_ij)], 2)
35 | if self.mode == 'sum':
36 | return position_ij + x
37 | elif self.mode == 'concat':
38 | return K.concatenate([position_ij, x], 2)
39 |
40 | def compute_output_shape(self, input_shape):
41 | if self.mode == 'sum':
42 | return input_shape
43 | elif self.mode == 'concat':
44 | return (input_shape[0], input_shape[1], input_shape[2] + self.size)
45 |
--------------------------------------------------------------------------------
/nlp_toolkit/sequence.py:
--------------------------------------------------------------------------------
1 | """
2 | Text Sequence Utilties
3 | """
4 |
5 | import math
6 | import random
7 | import numpy as np
8 | from collections import Counter
9 | from keras.utils import Sequence
10 | from keras.utils.np_utils import to_categorical
11 | from keras.preprocessing.sequence import pad_sequences
12 | from sklearn.externals import joblib
13 | from sklearn.base import BaseEstimator, TransformerMixin
14 | from nlp_toolkit.utilities import logger, word2char
15 | from typing import Dict, List
16 | from collections import defaultdict
17 |
18 |
19 | def top_elements(array, k):
20 | ind = np.argpartition(array, -k)[-k:]
21 | return ind[np.argsort(array[ind])][::-1]
22 |
23 |
24 | class Vocabulary(object):
25 | """
26 | Vocab Class for any NLP Tasks
27 | """
28 |
29 | def __init__(self, max_size=None, lower=True, unk_token=True, specials=('',)):
30 | self._max_size = max_size
31 | self._lower = lower
32 | self._unk = unk_token
33 | if specials:
34 | self._token2id = {token: i for i, token in enumerate(specials)}
35 | self._id2token = list(specials)
36 | else:
37 | self._token2id = {}
38 | self._id2token = []
39 | self._token_count = Counter()
40 |
41 | def __len__(self):
42 | return len(self._token2id)
43 |
44 | def add_token(self, token):
45 | token = self.process_token(token)
46 | self._token_count.update([token])
47 |
48 | def add_documents(self, docs):
49 | for sent in docs:
50 | sent = map(self.process_token, sent)
51 | self._token_count.update(sent)
52 |
53 | def doc2id(self, doc):
54 | # doc = map(self.process_token, doc)
55 | return [self.token_to_id(token) for token in doc]
56 |
57 | def id2doc(self, ids):
58 | return [self.id_to_token(idx) for idx in ids]
59 |
60 | def build(self):
61 | token_freq = self._token_count.most_common(self._max_size)
62 | idx = len(self.vocab)
63 | for token, _ in token_freq:
64 | self._token2id[token] = idx
65 | self._id2token.append(token)
66 | idx += 1
67 | if self._unk:
68 | unk = ''
69 | self._token2id[unk] = idx
70 | self._id2token.append(unk)
71 |
72 | def process_token(self, token):
73 | if self._lower:
74 | token = token.lower()
75 |
76 | return token
77 |
78 | def token_to_id(self, token):
79 | # token = self.process_token(token)
80 | return self._token2id.get(token, len(self._token2id) - 1)
81 |
82 | def id_to_token(self, idx):
83 | return self._id2token[idx]
84 |
85 | def extend_vocab(self, new_vocab, max_tokens=10000):
86 | assert isinstance(new_vocab, list)
87 | if max_tokens < 0:
88 | max_tokens = 10000
89 | base_index = self.__len__()
90 | added = 0
91 | for word in new_vocab:
92 | if added >= max_tokens:
93 | break
94 | if word not in self._token2id:
95 | self._token2id[word] = base_index + added
96 | self._id2token.append(word)
97 | added += 1
98 | logger.info('%d new words have been added to vocab' % added)
99 | return added
100 |
101 | @property
102 | def vocab(self):
103 | return self._token2id
104 |
105 | @property
106 | def reverse_vocab(self):
107 | return self._id2token
108 |
109 |
110 | class IndexTransformer(BaseEstimator, TransformerMixin):
111 | """
112 | Similar with Sklearn function for transforming text to index
113 | Basic tokens are usually words.
114 |
115 | # Arguments:
116 | 1. max_tokens: maximum number of basic tokens in one sentence
117 | 2. max_inner_chars: maximum number of char tokens in one word
118 | 3. lower: whether to lower tokers
119 | 4. use_inner_char: whether to use inner char tokens depend on your model
120 | 5. initial_vocab: the additional basic tokens which are not in corpus
121 |
122 | # Usage:
123 | p = IndexTransformer()
124 | new_data = p.fit_transform(data)
125 | # save
126 | p.save(file_name)
127 | # load
128 | p = IndexTransformer.load(file_name)
129 | # inverse transform y label
130 | y_true_label = p.inver_transform(y_pred)
131 | """
132 |
133 | def __init__(self, task_type, max_tokens=80, max_inner_chars=8, lower=True,
134 | use_inner_char=False, initial_vocab=None,
135 | use_seg=False, use_radical=False, radical_dict=None, basic_token='word'):
136 | self.basic_token = basic_token
137 | self.task_type = task_type
138 | self.max_tokens = max_tokens
139 | self.max_inner_chars = max_inner_chars
140 | self.use_inner_char = use_inner_char
141 | self.use_seg = use_seg
142 | self.use_radical = use_radical
143 | self._token_vocab = Vocabulary(lower=lower)
144 | self._label_vocab = Vocabulary(
145 | lower=False, unk_token=False, specials=None)
146 | if use_inner_char:
147 | self._inner_char_vocab = Vocabulary(lower=lower)
148 | if initial_vocab:
149 | self._token_vocab.add_documents([initial_vocab])
150 | if use_seg:
151 | self._seg_vocab = Vocabulary(lower=False)
152 | if use_radical:
153 | self._radical_vocab = Vocabulary(lower=False)
154 | self.radical_dict = radical_dict
155 |
156 | def fit(self, X, y=None):
157 | # assert isinstance(X, dict)
158 | self._token_vocab.add_documents(X)
159 | self._token_vocab.build()
160 | if y is not None:
161 | self._label_vocab.add_documents(y)
162 | self._label_vocab.build()
163 | if self.use_inner_char:
164 | for doc in X:
165 | self._inner_char_vocab.add_documents(doc)
166 | self._inner_char_vocab.build()
167 | if self.use_seg:
168 | self._seg_vocab.add_documents([['B'], ['E'], ['M'], ['S']])
169 | self._seg_vocab.build()
170 | if self.use_radical:
171 | self._radical_vocab.add_documents([[w] for w in self.radical_dict])
172 | self._radical_vocab.build()
173 |
174 | return self
175 |
176 | def transform(self, X, y=None, max_len=None):
177 | if max_len is not None:
178 | max_tokens = max_len
179 | else:
180 | max_tokens = self.max_tokens
181 | tokens = X['token']
182 | token_ids = [self._token_vocab.doc2id(doc) for doc in tokens]
183 | token_ids = pad_sequences(
184 | token_ids, maxlen=max_tokens, padding='post')
185 |
186 | features = {'token': token_ids}
187 |
188 | if self.use_inner_char:
189 | char_ids = [[self._inner_char_vocab.doc2id(w) for w in doc] for doc in tokens]
190 | char_ids = pad_nested_sequences(
191 | char_ids, max_tokens, self.max_inner_chars)
192 | features['char'] = char_ids
193 |
194 | if self.use_seg:
195 | seg_ids = [self._seg_vocab.doc2id(doc) for doc in X['seg']]
196 | seg_ids = pad_sequences(
197 | seg_ids, maxlen=max_tokens, padding='post')
198 | features['seg'] = seg_ids
199 |
200 | if self.use_radical:
201 | radical_ids = [self._radical_vocab.doc2id(doc) for doc in X['radical']]
202 | radical_ids = pad_sequences(
203 | radical_ids, maxlen=max_tokens, padding='post')
204 | features['radical'] = radical_ids
205 |
206 | if y is not None:
207 | y = [self._label_vocab.doc2id(doc) for doc in y]
208 | if self.task_type == 'sequence_labeling':
209 | y = pad_sequences(y, maxlen=max_tokens, padding='post')
210 | y = to_categorical(y, self.label_size).astype(float)
211 |
212 | return features, y
213 | else:
214 | return features
215 |
216 | def fit_transform(self, X, y=None, **params):
217 | return self.fit(X, y).transform(X, y)
218 |
219 | def inverse_transform(self, y, lengths=None, top_k=1, return_percentage=False):
220 | if self.task_type == 'classification':
221 | if top_k == 1:
222 | ind_top = np.argmax(y, -1)
223 | inverse_y = [self._label_vocab.id2doc([idx])[0] for idx in ind_top]
224 | return inverse_y
225 | elif top_k > 1:
226 | ind_top = [top_elements(prob, top_k) for prob in y]
227 | inverse_y = [self._label_vocab.id2doc(id_list) for id_list in ind_top]
228 | if not return_percentage:
229 | return inverse_y
230 | else:
231 | pct_top = [[prob[ind] for ind in ind_top[idx]] for idx, prob in enumerate(y)]
232 | return inverse_y, pct_top
233 | elif self.task_type == 'sequence_labeling':
234 | ind_top = np.argmax(y, -1)
235 | inverse_y = [self._label_vocab.id2doc(idx) for idx in ind_top]
236 | if lengths is not None:
237 | inverse_y = [iy[:l] for iy, l in zip(inverse_y, lengths)]
238 | return inverse_y
239 |
240 | @property
241 | def token_vocab_size(self):
242 | return len(self._token_vocab)
243 |
244 | @property
245 | def char_vocab_size(self):
246 | return len(self._inner_char_vocab)
247 |
248 | @property
249 | def seg_vocab_size(self):
250 | return len(self._seg_vocab)
251 |
252 | @property
253 | def radical_vocab_size(self):
254 | return len(self._radical_vocab)
255 |
256 | @property
257 | def label_size(self):
258 | return len(self._label_vocab)
259 |
260 | def save(self, file_path):
261 | joblib.dump(self, file_path)
262 |
263 | @classmethod
264 | def load(cls, file_path):
265 | p = joblib.load(file_path)
266 | # print('data transformer loaded')
267 | return p
268 |
269 |
270 | def pad_nested_sequences(sequences, max_sent_len, max_word_len, dtype='int32'):
271 | """
272 | Pad char sequences of one single word
273 | """
274 | x = np.zeros((len(sequences), max_sent_len, max_word_len)).astype(dtype)
275 | for i, sent in enumerate(sequences):
276 | if len(sent) > max_sent_len:
277 | sent = sent[:max_sent_len]
278 | for j, word in enumerate(sent):
279 | if len(word) < max_word_len:
280 | x[i, j, :len(word)] = word
281 | else:
282 | x[i, j, :] = word[:max_word_len]
283 | return x
284 |
285 |
286 | class BasicIterator(Sequence):
287 | """
288 | Wrapper for Keras Sequence Class
289 | """
290 |
291 | def __init__(self, task_type: str, transformer: IndexTransformer,
292 | x: Dict[str, List[List[str]]], y: List[List[str]] = None, batch_size=1):
293 | self.task_type = task_type
294 | self.t = transformer
295 | self.x = x
296 | self.y = y
297 | self.batch_size = batch_size
298 | if self.t.use_radical:
299 | self.radical_dict = self.t.radical_dict
300 | else:
301 | self.radical_dict = None
302 |
303 | def __getitem__(self, idx):
304 | idx_begin = self.batch_size * idx
305 | idx_end = self.batch_size * (idx + 1)
306 | x_batch = {k: v[idx_begin: idx_end] for k, v in self.x.items()}
307 |
308 | if self.y is not None:
309 | y_batch = self.y[idx_begin: idx_end]
310 | features, labels = self.t.transform(X=x_batch, y=y_batch)
311 | return features, labels
312 | else:
313 | features = self.t.transform(X=x_batch)
314 | return features
315 |
316 | def __len__(self):
317 | return math.ceil(len(self.x['token']) / self.batch_size)
318 |
319 |
320 | def _roundto(val, batch_size):
321 | return int(math.ceil(val / batch_size)) * batch_size
322 |
323 |
324 | # TODO
325 | # 按长度聚簇,长文本采用小的batch_size,短文本采用大的batch_size
326 | class BucketIterator(Sequence):
327 | """
328 | A Keras Sequence (dataset reader) of input sequences read in bucketed bins.
329 | Assumes all inputs are already padded using 'pad_sequences'
330 | (where post padding is prepended).
331 | """
332 |
333 | def __init__(self, task_type: str, transformer: IndexTransformer,
334 | seq_lengths: List[int],
335 | x: Dict[str, List[List[str]]], y: List[List[str]],
336 | num_buckets: int = 8, batch_size=1):
337 | self.task_type = task_type
338 | self.t = transformer
339 | self.batch_size = batch_size
340 | self.task_type = task_type
341 | self.x = x
342 | self.y = y
343 | if self.t.use_radical:
344 | self.radical_dict = self.t.radical_dict
345 | else:
346 | self.radical_dict = None
347 |
348 | # Count bucket sizes
349 | bucket_sizes, bucket_ranges = np.histogram(
350 | seq_lengths, bins=num_buckets)
351 | # Looking for non-empty buckets
352 | actual_buckets = [bucket_ranges[i+1]
353 | for i, bs in enumerate(bucket_sizes) if bs > 0]
354 | actual_bucket_sizes = [bs for bs in bucket_sizes if bs > 0]
355 | self.bucket_seqlen = [int(math.ceil(bs)) for bs in actual_buckets]
356 | num_actual = len(actual_buckets)
357 | logger.info('Training with %d non-empty buckets' % num_actual)
358 |
359 | self.bins = [(defaultdict(list), []) for bs in actual_bucket_sizes]
360 | assert len(self.bins) == num_actual
361 |
362 | # Insert the sequences into the bins
363 | self.feature_keys = list(self.x.keys())
364 | for i, sl in enumerate(seq_lengths):
365 | for j in range(num_actual):
366 | bsl = self.bucket_seqlen[j]
367 | if sl < bsl or j == num_actual - 1:
368 | for k in self.feature_keys:
369 | self.bins[j][0][k].append(x[k][i])
370 | self.bins[j][1].append(y[i])
371 | break
372 |
373 | self.num_samples = len(self.x['token'])
374 | self.dataset_len = int(sum([math.ceil(bs / self.batch_size)
375 | for bs in actual_bucket_sizes]))
376 | self._permute()
377 |
378 | def _permute(self):
379 | # Shuffle bins
380 | random.shuffle(self.bins)
381 |
382 | # Shuffle bin contents
383 | for i, (xbin, ybin) in enumerate(self.bins):
384 | index_array = np.random.permutation(len(ybin))
385 | self.bins[i] = ({k: [xbin[k][i] for i in index_array] for k in self.feature_keys}, [ybin[i] for i in index_array])
386 |
387 | def on_epoch_end(self):
388 | self._permute()
389 |
390 | def __len__(self):
391 | return self.dataset_len
392 |
393 | def __getitem__(self, idx):
394 | idx_begin = self.batch_size * idx
395 | idx_end = self.batch_size * (idx + 1)
396 |
397 | # Obtain bin index
398 | for idx, (xbin, ybin) in enumerate(self.bins):
399 | rounded_bin = _roundto(len(ybin), self.batch_size)
400 | if idx_begin >= rounded_bin:
401 | idx_begin -= rounded_bin
402 | idx_end -= rounded_bin
403 | continue
404 |
405 | # Found bin
406 | idx_end = min(len(ybin), idx_end) # Clamp to end of bin
407 | x_batch = {k: v[idx_begin: idx_end] for k, v in xbin.items()}
408 | y_batch = ybin[idx_begin: idx_end]
409 |
410 | max_len_i = self.bucket_seqlen[idx]
411 | features, labels = self.t.transform(x_batch, y_batch, max_len_i)
412 |
413 | return features, labels
414 | raise ValueError('out of bounds')
415 |
--------------------------------------------------------------------------------
/nlp_toolkit/trainer.py:
--------------------------------------------------------------------------------
1 | """
2 | Trainer Class: define the training process
3 | """
4 |
5 | import os
6 | import time
7 | import numpy as np
8 | from pathlib import Path
9 | from keras.optimizers import Adam, Nadam
10 | from sklearn.model_selection import train_test_split
11 | from sklearn.metrics import precision_recall_fscore_support
12 | from nlp_toolkit.callbacks import get_callbacks, History
13 | from nlp_toolkit.utilities import logger
14 | from nlp_toolkit.sequence import BasicIterator, BucketIterator
15 | from nlp_toolkit.modules.custom_loss import custom_binary_crossentropy, custom_categorical_crossentropy
16 | from typing import Dict
17 | from copy import deepcopy
18 |
19 | np.random.seed(1050)
20 |
21 |
22 | # TODO 自适应的学习率
23 | # 1. 基于valid数据的自适应学习率下降
24 | # 2. 三角学习率
25 | class Trainer(object):
26 | """
27 | Trainer class for all model training
28 | support single training and n-fold training
29 |
30 | # Arguments:
31 | 1. model: Keras Model object
32 | 2. model_name
33 | 3. task_type: text classification or sequence labeling
34 | 4. metric: the main metric used to track model performance on epoch end
35 | 5. extra_features: besides token features, some useful features will be included
36 | 6. batch_size: minimum batch size
37 | 7. max_epoch: maximum epoch numbers
38 | 8. optimizer: default is Adam
39 | 9. checkpoint_path: the folder path for saving models
40 | 9. early_stopping: whether to use early stopping strategy
41 | 10. lrplateau: whether to use lr lateau strategy
42 | 11. tensorboard: whether to open tensorboard to log training process
43 | 12. nb_bucket: the bucket size
44 | 13. train_mode: single turn training or n-fold training
45 | 14. fold_cnt: the number of folds
46 | 15. test_size: default is 0.2
47 | 16. shuffle: whether to shuffle data between epochs, default is true
48 | 17. patiences: the maximum epochs to stop training when the metric has not been improved
49 |
50 | # Returns:
51 | The trained model or average performance of the model
52 | """
53 |
54 | def __init__(self, model,
55 | model_name,
56 | task_type,
57 | metric,
58 | batch_size=64,
59 | max_epoch=25,
60 | optimizer=Adam(),
61 | checkpoint_path='./models/',
62 | early_stopping=True,
63 | lrplateau=True,
64 | tensorboard=False,
65 | nb_bucket=100,
66 | train_mode='single',
67 | fold_cnt=10,
68 | test_size=0.2,
69 | shuffle=True,
70 | patiences=3):
71 | self.single_model = deepcopy(model)
72 | self.fold_model = deepcopy(model)
73 | self.model_name = model_name
74 | self.task_type = task_type
75 | self.metric = metric
76 | self.batch_size = batch_size
77 | self.max_epoch = max_epoch
78 | self.optimizer = optimizer
79 | self.test_size = test_size
80 | self.train_mode = train_mode
81 | self.fold_cnt = fold_cnt
82 | self.shuffle = shuffle
83 | self.nb_bucket = nb_bucket
84 | self.patiences = patiences
85 | base_dir = Path(checkpoint_path)
86 | if not base_dir.exists():
87 | base_dir.mkdir()
88 | current_time = time.strftime(
89 | '%Y%m%d%H%M', time.localtime(time.time()))
90 | save_dir = self.model_name + '_' + current_time
91 | self.checkpoint_path = Path(checkpoint_path) / save_dir
92 |
93 | def data_generator(self, seq_type, x_train, x_valid, y_train, y_valid,
94 | x_len_train=None, x_len_valid=None,):
95 | if seq_type == 'bucket':
96 | logger.info('use bucket sequence to speed up model training')
97 | train_batches = BucketIterator(
98 | self.task_type, self.transformer, x_len_train,
99 | x_train, y_train, self.nb_bucket, self.batch_size)
100 | valid_batches = BucketIterator(
101 | self.task_type, self.transformer, x_len_valid,
102 | x_valid, y_valid, self.nb_bucket, self.batch_size)
103 | elif seq_type == 'basic':
104 | train_batches = BasicIterator(
105 | self.task_type, self.transformer,
106 | x_train, y_train, self.batch_size)
107 | valid_batches = BasicIterator(
108 | self.task_type, self.transformer,
109 | x_valid, y_valid, self.batch_size)
110 | else:
111 | logger.warning('invalid data iterator type, only supports "basic" or "bucket"')
112 | return train_batches, valid_batches
113 |
114 | def train(self, x_ori, y, transformer,
115 | seq_type='bucket',
116 | return_attention=False):
117 | self.transformer = transformer
118 | self.feature_keys = list(x_ori.keys())
119 |
120 | if self.train_mode == 'single':
121 | x = deepcopy(x_ori)
122 | x_len = [item[-1] for item in x['token']]
123 | x['token'] = [item[:-1] for item in x['token']]
124 |
125 | # model initialization
126 | self.single_model.forward()
127 | logger.info('%s model structure...' % self.model_name)
128 | self.single_model.model.summary()
129 |
130 | # split dataset
131 | indices = np.random.permutation(len(x['token']))
132 | cut_point = int(len(x['token']) * (1 - self.test_size))
133 | train_idx, valid_idx = indices[:cut_point], indices[cut_point:]
134 | x_train = {k: [x[k][i] for i in train_idx] for k in self.feature_keys}
135 | x_valid = {k: [x[k][i] for i in valid_idx] for k in self.feature_keys}
136 | y_train, y_valid = [y[i] for i in train_idx], [y[i] for i in valid_idx]
137 | x_len_train, x_len_valid = [x_len[i] for i in train_idx], [x_len[i] for i in valid_idx]
138 | logger.info(
139 | 'train/valid set: {}/{}'.format(train_idx.shape[0], valid_idx.shape[0]))
140 |
141 | # transform data to sequence data streamer
142 | train_batches, valid_batches = self.data_generator(
143 | seq_type,
144 | x_train, x_valid, y_train, y_valid,
145 | x_len_train, x_len_valid)
146 |
147 | # define callbacks
148 | history = History(self.metric)
149 | self.callbacks = get_callbacks(
150 | history=history,
151 | metric=self.metric[0],
152 | log_dir=self.checkpoint_path,
153 | valid=valid_batches,
154 | transformer=transformer,
155 | attention=return_attention)
156 |
157 | # model compile
158 | self.single_model.model.compile(
159 | loss=self.single_model.get_loss(),
160 | optimizer=self.optimizer,
161 | metrics=self.single_model.get_metrics())
162 |
163 | # save transformer and model parameters
164 | if not self.checkpoint_path.exists():
165 | self.checkpoint_path.mkdir()
166 | transformer.save(self.checkpoint_path / 'transformer.h5')
167 | invalid_params = self.single_model.invalid_params
168 | param_file = self.checkpoint_path / 'model_parameters.json'
169 | self.single_model.save_params(param_file, invalid_params)
170 | logger.info('saving model parameters and transformer to {}'.format(
171 | self.checkpoint_path))
172 |
173 | # actual training start
174 | self.single_model.model.fit_generator(
175 | generator=train_batches,
176 | epochs=self.max_epoch,
177 | callbacks=self.callbacks,
178 | shuffle=self.shuffle,
179 | validation_data=valid_batches)
180 | print('best {}: {:04.2f}'.format(self.metric[0],
181 | max(history.metrics[self.metric[0]]) * 100))
182 | return self.single_model.model, history
183 |
184 | elif self.train_mode == 'fold':
185 | x = deepcopy(x_ori)
186 | x_len = [item[-1] for item in x['token']]
187 | x['token'] = [item[:-1] for item in x['token']]
188 | x_token_first = x['token'][0]
189 |
190 | fold_size = len(x['token']) // self.fold_cnt
191 | scores = []
192 | logger.info('%d-fold starts!' % self.fold_cnt)
193 |
194 | for fold_id in range(self.fold_cnt):
195 | print('\n------------------------ fold ' + str(fold_id) + '------------------------')
196 |
197 | assert x_token_first == x['token'][0]
198 | model_init = self.fold_model
199 | model_init.forward()
200 |
201 | fold_start = fold_size * fold_id
202 | fold_end = fold_start + fold_size
203 | if fold_id == fold_size - 1:
204 | fold_end = len(x)
205 | if fold_id == 0:
206 | logger.info('%s model structure...' % self.model_name)
207 | model_init.model.summary()
208 |
209 | x_train = {k: x[k][:fold_start] + x[k][fold_end:] for k in self.feature_keys}
210 | x_len_train = x_len[:fold_start] + x_len[fold_end:]
211 | y_train = y[:fold_start] + y[fold_end:]
212 | x_valid = {k: x[k][fold_start:fold_end] for k in self.feature_keys}
213 | x_len_valid = x_len[fold_start:fold_end]
214 | y_valid = y[fold_start:fold_end]
215 |
216 | train_batches, valid_batches = self.data_generator(
217 | seq_type,
218 | x_train, x_valid, y_train, y_valid,
219 | x_len_train, x_len_valid)
220 |
221 | history = History(self.metric)
222 | self.callbacks = get_callbacks(
223 | history=history, metric=self.metric[0],
224 | valid=valid_batches, transformer=transformer,
225 | attention=return_attention)
226 |
227 | model_init.model.compile(
228 | loss=model_init.get_loss(),
229 | optimizer=self.optimizer,
230 | metrics=model_init.get_metrics())
231 |
232 | model_init.model.fit_generator(
233 | generator=train_batches,
234 | epochs=self.max_epoch,
235 | callbacks=self.callbacks,
236 | shuffle=self.shuffle,
237 | validation_data=valid_batches)
238 | scores.append(max(history.metrics[self.metric[0]]))
239 |
240 | logger.info('training finished! The mean {} scores: {:4.2f}(±{:4.2f})'.format(
241 | self.metric[0], np.mean(scores) * 100, np.std(scores) * 100))
242 |
--------------------------------------------------------------------------------
/nlp_toolkit/utilities.py:
--------------------------------------------------------------------------------
1 | """
2 | some nlp process utilty functions
3 | """
4 |
5 | import io
6 | import re
7 | import sys
8 | import time
9 | import logging
10 | import numpy as np
11 | from itertools import groupby
12 |
13 | logging.basicConfig(level=logging.INFO,
14 | format='%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s')
15 | logger = logging.getLogger('nlp_toolkit')
16 |
17 | global special_tokens
18 | special_tokens = set(['s_', 'lan_', 'ss_'])
19 |
20 |
21 | # [1, ['a', 'b], [True, False]] ---> [1, 'a', 'b', True, False]
22 | def flatten_gen(x):
23 | for i in x:
24 | if isinstance(i, list) or isinstance(i, tuple):
25 | for inner_i in i:
26 | yield inner_i
27 | else:
28 | yield i
29 |
30 |
31 | # judge char type ['cn', 'en', 'num', 'other']
32 | def char_type(word):
33 | for char in word:
34 | unicode_char = ord(char)
35 | if unicode_char >= 19968 and unicode_char <= 40869:
36 | yield (char, 'cn')
37 | elif unicode_char >= 65 and unicode_char <= 122:
38 | yield (char, 'en')
39 | elif unicode_char >= 48 and unicode_char <= 57:
40 | yield (char, 'num')
41 | else:
42 | yield (char, 'other')
43 |
44 |
45 | # split word into chars
46 | def split_cn_en(word):
47 | new_word = [c for c in char_type(word)]
48 | new_word_len = len(new_word)
49 | tmp = ''
50 | for ix, item in enumerate(new_word):
51 | if item[1] in {'en', 'num'}:
52 | if ix < new_word_len - 1:
53 | if new_word[ix+1][1] == item[1]:
54 | tmp += item[0]
55 | else:
56 | tmp += item[0]
57 | yield tmp
58 | tmp = ''
59 | else:
60 | tmp += item[0]
61 | yield tmp
62 | else:
63 | yield item[0]
64 |
65 |
66 | # reassign token labels according new tokens
67 | def extract_char(word_list, label_list=None, use_seg=False):
68 | if label_list:
69 | for word, label in zip(word_list, label_list):
70 | # label = label.strip('#')
71 | single_check = word in special_tokens or not re.search(r'[^a-z0-9]+', word)
72 | if len(word) == 1 or single_check:
73 | if use_seg:
74 | yield (word, label, 'S')
75 | else:
76 | yield (word, label)
77 | else:
78 | try:
79 | new_word = list(split_cn_en(word))
80 | word_len = len(new_word)
81 | if label == 'O':
82 | new_label = ['O'] * word_len
83 | elif label.startswith('I'):
84 | new_label = [label] * word_len
85 | else:
86 | label_i = 'I' + label[1:]
87 | if label.startswith('B'):
88 | new_label = [label] + [label_i] * (word_len - 1)
89 | elif label.startswith('E'):
90 | new_label = [label_i] * (word_len - 1) + [label]
91 | if use_seg:
92 | seg_tag = ['M'] * word_len
93 | seg_tag[0] = 'B'
94 | seg_tag[-1] = 'E'
95 | for x, y, z in zip(new_word, new_label, seg_tag):
96 | yield (x, y, z)
97 | else:
98 | for x, y in zip(new_word, new_label):
99 | yield (x, y)
100 | except Exception as e:
101 | print(e)
102 | print(list(zip(word_list, label_list)))
103 | sys.exit()
104 | else:
105 | for word in word_list:
106 | single_check = word in special_tokens or not re.search(r'[^a-z0-9]+', word)
107 | if len(word) == 1 or single_check:
108 | if use_seg:
109 | yield (word, 'S')
110 | else:
111 | yield (word)
112 | else:
113 | new_word = list(split_cn_en(word))
114 | if use_seg:
115 | seg_tag = ['M'] * len(new_word)
116 | seg_tag[0] = 'B'
117 | seg_tag[-1] = 'E'
118 | for x, y in zip(new_word, seg_tag):
119 | yield (x, y)
120 | else:
121 | for x in new_word:
122 | yield x
123 |
124 |
125 | # get radical token by chars
126 | def get_radical(d, char_list):
127 | return [d[char] if char in d else '' for char in char_list]
128 |
129 |
130 | def word2char(word_list, label_list=None, task_type='',
131 | use_seg=False, radical_dict=None):
132 | """
133 | convert basic token from word to char
134 | non-chinese word will not be simply splitted into char sequences
135 | e.g. "machine02" will be splitted into "machine" and "02"
136 | """
137 |
138 | if task_type == 'classification':
139 | assert label_list is None
140 | assert radical_dict is None
141 | assert use_seg is False
142 | return [char for word in word_list for char in list(split_cn_en(word))]
143 | elif task_type == 'sequence_labeling':
144 | results = list(
145 | zip(*[item for item in extract_char(word_list, label_list, use_seg)]))
146 | if label_list:
147 | if use_seg:
148 | chars, new_labels, seg_tags = results
149 | assert len(chars) == len(new_labels) == len(seg_tags)
150 | else:
151 | chars, new_labels = results
152 | assert len(chars) == len(new_labels)
153 | new_result = {'token': chars, 'label': new_labels}
154 | else:
155 | if use_seg:
156 | chars, seg_tags = results
157 | assert len(chars) == len(seg_tags)
158 | else:
159 | chars = results
160 | new_result = {'token': chars}
161 | if use_seg:
162 | new_result['seg'] = seg_tags
163 | if radical_dict:
164 | new_result['radical'] = get_radical(radical_dict, chars)
165 | return new_result
166 | else:
167 | logger.error('invalid task type')
168 | sys.exit()
169 |
170 |
171 | def shorten_word(word):
172 | """
173 | Shorten groupings of 3+ identical consecutive chars to 2, e.g. '!!!!' --> '!!'
174 | """
175 |
176 | # must have at least 3 char to be shortened
177 | if len(word) < 3:
178 | return word
179 | # find groups of 3+ consecutive letters
180 | letter_groups = [list(g) for k, g in groupby(word)]
181 | triple_or_more = [''.join(g) for g in letter_groups if len(g) >= 3]
182 | if len(triple_or_more) == 0:
183 | return word
184 | # replace letters to find the short word
185 | short_word = word
186 | for trip in triple_or_more:
187 | short_word = short_word.replace(trip, trip[0] * 2)
188 |
189 | return short_word
190 |
191 |
192 | # Command line arguments are cast to bool type
193 | def boolean_string(s):
194 | if s not in {'False', 'True'}:
195 | raise ValueError('Not a valid boolean string')
196 | return s == 'True'
197 |
198 |
199 | # decorator to time a function
200 | def timer(function):
201 | def log_time():
202 | start_time = time.time()
203 | function()
204 | elapsed = time.time() - start_time
205 | logger.info('Function "{name}" finished in {time:.2f} s'.format(name=function.__name__, time=elapsed))
206 | return log_time()
207 |
208 |
209 | # generate small embedding files according given vocabs
210 | def gen_small_embedding(vocab_file, embed_file, output_file):
211 | vocab = set([word.strip() for word in open(vocab_file, encoding='utf8')])
212 | print('total vocab: ', len(vocab))
213 | fin = io.open(embed_file, 'r', encoding='utf-8', newline='\n', errors='ignore')
214 | try:
215 | n, d = map(int, fin.readline().split())
216 | except Exception:
217 | print('please make sure the embed file is gensim-formatted')
218 |
219 | def gen():
220 | for line in fin:
221 | token = line.rstrip().split(' ', 1)[0]
222 | if token in vocab:
223 | yield line
224 |
225 | result = [line for line in gen()]
226 | rate = 1 - len(result) / len(vocab)
227 | print('oov rate: {:4.2f}%'.format(rate * 100))
228 |
229 | with open(output_file, 'w', encoding='utf8') as fout:
230 | fout.write(str(len(result)) + ' ' + str(d) + '\n')
231 | for line in result:
232 | fout.write(line)
233 |
234 |
235 | # load embeddings from text file
236 | def load_vectors(fname, vocab):
237 | fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
238 | _, d = map(int, fin.readline().split())
239 | data = {}
240 | for line in fin:
241 | tokens = line.rstrip().split(' ')
242 | data[tokens[0]] = np.asarray(tokens[1:], dtype='float32')
243 |
244 | scale = 0.25
245 | # scale = np.sqrt(3.0 / n_dim)
246 | embedding_matrix = np.random.uniform(-scale, scale, [len(vocab), d])
247 | embedding_matrix[0] = np.zeros(d)
248 | cnt = 0
249 | for word, i in vocab._token2id.items():
250 | embedding_vector = data.get(word)
251 | if embedding_vector is not None:
252 | cnt += 1
253 | embedding_matrix[i] = embedding_vector
254 | logger.info('OOV rate: {:04.2f} %'.format(1 - cnt / len(vocab._token2id)))
255 | return embedding_matrix, d
256 |
257 |
258 | def load_tc_data(fname, label_prefix='__label__', max_tokens_per_doc=256):
259 |
260 | def gen():
261 | with open(fname, 'r', encoding='utf8') as fin:
262 | for line in fin:
263 | words = line.strip().split()
264 | if words:
265 | nb_labels = 0
266 | label_line = []
267 | for word in words:
268 | if word.startswith(label_prefix):
269 | nb_labels += 1
270 | label = word.replace(label_prefix, "")
271 | label_line.append(label)
272 | else:
273 | break
274 | text = words[nb_labels:]
275 | if len(text) > max_tokens_per_doc:
276 | text = text[:max_tokens_per_doc]
277 | yield (text, label_line)
278 |
279 | texts, labels = zip(*[item for item in gen()])
280 | return texts, labels
281 |
282 |
283 | def load_sl_data(fname, data_format='basic'):
284 |
285 | def process_conll(data):
286 | sents, labels = [], []
287 | tokens, tags = [], []
288 | for line in data:
289 | if line:
290 | token, tag = line.split('\t')
291 | tokens.append(token)
292 | tags.append(tag)
293 | else:
294 | sents.append(tokens)
295 | labels.append(tags)
296 | tokens, tags = [], []
297 | return sents, labels
298 |
299 | data = (line.strip() for line in open(fname, 'r', encoding='utf8'))
300 | if data_format:
301 | if data_format == 'basic':
302 | texts, labels = zip(
303 | *[zip(*[item.rsplit('###', 1) for item in line.split('\t')]) for line in data])
304 | elif data_format == 'conll':
305 | texts, labels = process_conll(data)
306 | return texts, labels
307 | else:
308 | print('invalid data format for sequence labeling task')
309 |
310 |
311 | def convert_seq_format(fin_name, fout_name, dest_format='conll'):
312 | if dest_format == 'conll':
313 | basic2conll(fin_name, fout_name)
314 | elif dest_format == 'basic':
315 | conll2basic(fin_name, fout_name)
316 | else:
317 | logger.warning('invalid data format')
318 |
319 |
320 | def basic2conll(fin_name, fout_name):
321 | data = [line.strip() for line in open(fin_name, 'r', encoding='utf8')]
322 | with open(fout_name, 'w', encoding='utf8') as fout:
323 | for line in data:
324 | for item in line.split('\t'):
325 | token, label = item.rsplit('###')
326 | label = label.strip('#')
327 | fout.write(token + '\t' + label + '\n')
328 | fout.write('\n')
329 |
330 |
331 | def conll2basic(fin_name, fout_name):
332 | data = [line.strip() for line in open(fin_name, 'r', encoding='utf8')]
333 | with open(fout_name, 'w', encoding='utf8') as fout:
334 | tmp = []
335 | for line in data:
336 | if line:
337 | token, label = line.split('\t')
338 | label = label.strip('\t')
339 | item = token + '###' + label
340 | tmp.append(item)
341 | else:
342 | new_line = '\t'.join(tmp) + '\n'
343 | fout.write(new_line)
344 | tmp = []
345 |
--------------------------------------------------------------------------------
/nlp_toolkit/visualization.py:
--------------------------------------------------------------------------------
1 | """
2 | some Visualization Functions
3 | """
4 | import random
5 | from seqeval.metrics.sequence_labeling import get_entities
6 | from typing import List
7 | from copy import deepcopy
8 |
9 | ENTITY_COLOR = ['#ff9900', '#00ccff', '#66ff99', '#ff3300', '#9933ff', '#669999']
10 |
11 |
12 | def highlight_by_weight(word, att_weight):
13 | html_color = '#%02X%02X%02X' % (255, int(255 * (1 - att_weight)), int(255 * (1 - att_weight)))
14 | return '{}'.format(html_color, word)
15 |
16 |
17 | def att2html(words, att_weights):
18 | html = ""
19 | for word, att_weight in zip(words, att_weights):
20 | html += ' ' + highlight_by_weight(word, att_weight)
21 | return html + "
\n"
22 |
23 |
24 | def attention_visualization(texts: List[List[str]], attention_weights,
25 | output_fname='attention_texts.html'):
26 | with open(output_fname, 'w') as fout:
27 | for x, y in zip(texts, attention_weights):
28 | fout.write(att2html(x, y))
29 |
30 |
31 | def highlight_entity(words: List[str], entity_type, entity_color):
32 | if entity_type:
33 | html_color = entity_color[entity_type]
34 | words = ' '.join(words) + ' [%s]' % entity_type
35 | return '{}'.format(html_color, words)
36 | else:
37 | return ' '.join(words)
38 |
39 |
40 | def entity2html(words, labels, entity_colors):
41 | html = ""
42 | entity_dict = {item[1]: [item[0], item[-1]] for item in labels}
43 | start, end = 0, 0
44 | while end < len(words):
45 | if end not in entity_dict:
46 | end += 1
47 | if end == len(words):
48 | html += words[-1]
49 | else:
50 | if end > start:
51 | html += highlight_entity(words[start: end], None, entity_colors) + ' '
52 | entity_info = entity_dict[end]
53 | entity_start = end
54 | entity_end = entity_info[-1] + 1
55 | html += highlight_entity(words[entity_start: entity_end], entity_info[0], entity_colors) + ' '
56 | start = entity_end
57 | end = start
58 | return html + "
\n"
59 |
60 |
61 | def entity_visualization(texts: List[List[str]], labels: List[List[str]],
62 | output_fname='entity_texts.html'):
63 | texts_c = deepcopy(texts)
64 | texts_c = [item[:-1] for item in texts_c]
65 | entities = [get_entities(item) for item in labels]
66 | all_entities = list(set([sub_item[0] for item in entities for sub_item in item]))
67 | all_entities = [item for item in all_entities if item != 'O']
68 | nb_entities = len(all_entities)
69 | if nb_entities > len(ENTITY_COLOR):
70 | rest_nb_colors = nb_entities - len(ENTITY_COLOR)
71 | colors = ENTITY_COLOR + ['#' + ''.join([random.choice('0123456789ABCDEF') for j in range(6)])
72 | for i in range(rest_nb_colors)]
73 | else:
74 | colors = ENTITY_COLOR[:nb_entities]
75 | assert len(colors) == nb_entities
76 | entity_colors = {all_entities[i]: colors[i] for i in range(nb_entities)}
77 |
78 | with open(output_fname, 'w') as fout:
79 | for x, y in zip(texts_c, entities):
80 | fout.write(entity2html(x, y, entity_colors))
81 |
82 |
83 | def plot_loss_acc(history, task):
84 | import matplotlib.pyplot as plt
85 |
86 | nb_epochs = len(history.val_acc)
87 | epoch_size_nearly = len(history.acc) // nb_epochs
88 | val_x = [i for i in range(len(history.acc)) if i %
89 | epoch_size_nearly == 0][1:] + [len(history.acc)-1]
90 |
91 | f = plt.figure(figsize=(15, 45))
92 | ax1 = f.add_subplot(311)
93 | ax2 = f.add_subplot(312)
94 | ax3 = f.add_subplot(313)
95 |
96 | ax1.set_title("Train & Dev Acc")
97 | ax1.plot(history.acc, color="g", label="Train")
98 | ax1.plot(val_x, history.val_acc, color="b", label="Dev")
99 | ax1.legend(loc="best")
100 |
101 | ax2.set_title("Train & Dev Loss")
102 | ax2.plot(history.loss, color="g", label="Train")
103 | ax2.plot(val_x, history.val_loss, color="b", label="Dev")
104 | ax2.legend(loc="best")
105 |
106 | if task == 'classification':
107 | ax3.set_title("F1 per epoch")
108 | ax3.plot(history.metrics['f1'], color="g", label="F1")
109 | elif task == 'sequence_labeling':
110 | ax3.set_title("F1 and acc per epoch")
111 | ax3.plot(history.metrics['f1_seq'], color="g", label="F1")
112 | ax3.plot(history.metrics['seq_acc'], color="b", label="Acc")
113 | ax3.legend(loc="best")
114 |
115 | plt.tight_layout()
116 | plt.show()
117 |
--------------------------------------------------------------------------------
/reproduction/company_pro_con_classify.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.data import Dataset
2 | from nlp_toolkit.classifier import Classifier
3 | import yaml
4 |
5 | data_path = '../sample_data/company_pro_con.txt'
6 | config_path = '../config_classification.yaml'
7 |
8 | # 建议使用safe_load()
9 | config = yaml.safe_load(open(config_path, encoding='utf8'))
10 | config['model']['bi_lstm_att']['return_attention'] = True
11 |
12 | # 加载数据,初始化参数
13 | dataset = Dataset(fname=data_path, task_type='classification',
14 | mode='train', config=config)
15 |
16 | # 定义分类器
17 | classifier = Classifier(model_name='bi_lstm_att', dataset=dataset,
18 | seq_type='bucket')
19 |
20 | # 模型训练
21 | # 会在当前目录生成models目录,用于保存模型训练结果
22 | trained_model = classifier.train()
23 |
--------------------------------------------------------------------------------
/reproduction/noun_phrases_detect.py:
--------------------------------------------------------------------------------
1 | from nlp_toolkit.data import Dataset
2 | from nlp_toolkit.labeler import Labeler
3 | import yaml
4 |
5 | data_path = '../sample_data/cv_word.txt'
6 | config_path = '../config_sequence_labeling.yaml'
7 |
8 | # 建议使用safe_load()
9 | config = yaml.safe_load(open(config_path, encoding='utf8'))
10 | config['data']['basic_token'] = 'char'
11 | config['data']['use_seg'] = True
12 | config['data']['use_radical'] = True
13 |
14 | # 加载数据,初始化参数
15 | dataset = Dataset(fname=data_path, task_type='sequence_labeling',
16 | mode='train', config=config)
17 |
18 | # 定义标注器
19 | seq_labeler = Labeler(model_name='char_rnn', dataset=dataset,
20 | seq_type='bucket')
21 |
22 | # 模型训练
23 | # 会在当前目录生成models目录,用于保存模型训练结果
24 | trained_model = seq_labeler.train()
25 |
--------------------------------------------------------------------------------
/requirements-gpu.txt:
--------------------------------------------------------------------------------
1 | tensorflow-gpu>=1.9.0
2 | Keras==2.2.4
3 | numpy>=1.14.3
4 | scikit-learn>=0.19.1
5 | seqeval>=0.0.5
6 | hanziconv>=0.3.2
7 | jieba>=0.39
8 | GPUtil>=1.3.0
9 | ruamel.yaml>=0.15.81
10 | -e git+https://www.github.com/keras-team/keras-contrib.git#egg=keras-contrib
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow>=1.9.0
2 | Keras==2.2.4
3 | numpy>=1.14.3
4 | scikit-learn>=0.19.1
5 | seqeval>=0.0.5
6 | hanziconv>=0.3.2
7 | jieba>=0.39
8 | ruamel.yaml>=0.15.81
9 | -e git+https://www.github.com/keras-team/keras-contrib.git#egg=keras-contrib
--------------------------------------------------------------------------------
/sample_data/company_pro_con.txt:
--------------------------------------------------------------------------------
1 | __label__pos 进去 前 许诺 的 工资 给 的 高
2 | __label__pos 校园 环境 优美 , 美女 很多 , 适合 居住 , 食堂 饭菜 便宜 , 操场 好 , 可以 天天 运动
3 | __label__pos 老板 人 很好 老 员工 会 各种 教 你 东西 , 而且 不会 有所 保留 薪水 在 大连 还 算 可以
4 | __label__neg 人员 比较 多 , 复杂 办公室 容易 形成 拉帮结派 不利于 企业 发展
5 | __label__neg 出差 太多 了 。 在 现场 开发 很苦 逼 。
6 | __label__neg 公司 目前 地理位置 不 太 理想 , 离 城市 中心 较 远点 。
7 | __label__pos 公司 的 技术 水平 国内 顶尖 , 十几 年 的 资历 , 制作 的 作品 几乎 都 是 精品 , 参与 过 很多 知名 项目 。
8 | __label__neg 工作 流程 复杂 个人 上升 空间 有限 新产品 的 创新 能力 有限 组织 架构 稍 显 臃肿
9 | __label__neg 无偿 加班 , 加班 多 , 没 加班费 , 压力 很大
10 | __label__pos 环境 比较 轻松 , 跟 项目 走 , 能 学 不少 专业 知识 , 经验 很 重要
11 | __label__neg 有 命 挣钱 没命 花 , 不 适合 发展
12 | __label__neg 制度 管理 不是 很 完善 。
13 | __label__pos 人文 氛围 厚重 、 和谐 , 涉及 多 个 领域 有 发展 前景 。
14 | __label__neg 实习 的 时候 看 , 比较 死板 整个 工作 的 活力 不大 , 规矩 特别 多 , 会 很多 ,
15 | __label__pos 有 能力 的 人 就 有 很多 机会
16 | __label__neg 加班 很多 加班 费用 很少 领导 不一定 都 晓得 下面 人员 流动 大 。
17 | __label__neg 环境 小 , 人员 少且 流动性 很大 ; 工作 平台 局限性 较大 , 难以 得到 较好 的 锻炼 与 发展 。
18 | __label__pos 一 年 14 薪 , 每年 都加 工资 , 不 随便 裁员
19 | __label__pos 差劲 的 公司 , 刚开始 用 中 韩 合资 的 名称 唬人 , 其实 就是 私人 土 老板 家族 企业
20 | __label__pos 地理位置 较好 , 位于 浙江省 境内 , 天时地利 人 和 , 可以 向 外国 出口 产品
21 | __label__neg 薪水 不 高
22 | __label__pos 随意 年轻时 有 一定 成长 空间
23 | __label__neg 小气 , 抠门 , 没有 什么 发展前途 。
24 | __label__neg 有时 压力 偏 大 , 当 老师 估计 都 不 轻松
25 | __label__pos 感觉 公司 还是 挺 正规 !
--------------------------------------------------------------------------------
/sample_data/cv_word_basic.txt:
--------------------------------------------------------------------------------
1 | 主要###O 帮助###O 工地###B-Chunk 师傅###E-Chunk 一起###O 超平###O ,###O 防线###O 工作###O
2 | 协助###O 线###O 上###O 、###O 线###B-Chunk 下###I-Chunk 活动###E-Chunk 的###O 执行###O
3 | 执行###O 各项###O 培训###O 相关###O 的###O 各项###O 工作###B-Chunk 流程###E-Chunk
4 | 云南###O :###O 曲靖###O 、###O 昭通###O 下属###O 的###O 5###O 个###O 县级###O 供电###B-Chunk 公司###E-Chunk 10###O 个###O 供电所###O
5 | 担任###O 培训###O 学校###B-Chunk 英语###I-Chunk 讲师###E-Chunk 一###O 职###O 和###O 学生###B-Chunk 管理###E-Chunk
6 | 搜寻###O 招标###B-Chunk 公告###E-Chunk ,###O 告知###O 领导###O 及###O 业务###B-Chunk 人员###E-Chunk ,###O 确认###O 是否###O 报名###O
7 | 2001###O /###O 10###O --###O 2002###O /###O 04###O :###O 上海###O 润###O 宝###O 工贸###B-Chunk 公司###E-Chunk ###O 所属###O 行业###O :###O ###O 环保###O ###O 销售部###O ###O 销售###B-Chunk 代表###E-Chunk ###O 负责###O 江浙###O 一带###O 工业###O 圆###O 区###O 的###O 空气过滤器###O 的###O 销售###O 和###O 维护###O ,###O 期间###O 昆山###O 翊###O 腾###O 电子###O 是###O 长期###O 的###O 客户###O
8 | ###O 仓库###B-Chunk 管理###E-Chunk :###O 对###O 仓库###O 进行###O 合理###O 布局###O ,###O 为###O 方便###O 员工###B-Chunk 操作###E-Chunk 和###O 减少###O 失误###O ,###O 能够###O 独立###O 编排###O 库###O 位###O 图###O 和###O 货位###O 表###O
9 | ###O 档案###B-Chunk 管理###E-Chunk :###O 能###O 独立###O 制定###O 仓库###B-Chunk 管理###E-Chunk 文档###O ,###O 专人###O 负责###O 仓库###O 资料###O 的###O 更新###O 归档###O 并###O 定期###O 检查###O
10 | 电子###B-Chunk 技术###E-Chunk /###O 半导体###O /###O 集成电路###O
11 | 手机###B-Chunk 射频###I-Chunk 信号###I-Chunk 测试###E-Chunk
12 | 在###O 金源###B-Chunk 集团###E-Chunk 的###O 世纪城###O 三期###O 担任###O 置业###B-Chunk 顾问###E-Chunk ,###O 负责###O 房地产###B-Chunk 销售###E-Chunk
13 | 根据###O 用户###O 的###O 反馈###O 以及###O 运营###O 数据###O 的###O 分析###O ,###O 绘制###O 并###O 撰写###O 新版本###O 的###O 原型###B-Chunk 图###E-Chunk 和###O prd###O 文###O 档###O
14 | 根据###O 各###O 渠道###O 的###O 平台###B-Chunk 要求###E-Chunk ,###O 定制###O 个性化###B-Chunk 产品###E-Chunk ,###O 并###O 负责###O 所###O 发布###O 版本###O 的###O 跟踪###O 管理###O
15 | ###O 2005###O /###O 12###O --###O 2006###O /###O 05###O :###O 广州###O 南沣###O 电子###B-Chunk 有限公司###E-Chunk ###O 所属###O 行业###O :###O ###O 电子###B-Chunk 技术###E-Chunk /###O 半导体###O ###O 技术###O 部###O ###O 维修###B-Chunk 技术###I-Chunk 员###E-Chunk ###O 负责###O 调试###O 与###O 维修###O 线切割###B-Chunk 机床###E-Chunk 控制器###O ###O 2006###O /###O 03###O ###O 至今###O :###O 中国###O 电器###B-Chunk 科学研究###E-Chunk 园###O ###O 擎天###B-Chunk 实业###I-Chunk 有限公司###E-Chunk ###O 所属###O 行业###O :###O 电器###O ,###O 电子###O ###O 技术###O 部###O ###O 维修###B-Chunk 技术员###E-Chunk
16 | ###O 南京###B-Chunk 项目###E-Chunk 先期###O 股权###B-Chunk 融资###E-Chunk 置换###O
17 | 每月###O 核对###O 及###O 结算###B-Chunk 银行###I-Chunk 流水账###E-Chunk 和###O 现金###B-Chunk 日记账###E-Chunk ,###O 做到###O 账###O 实###O 相符###O ,###O 出具###O 汇总###O 对账###B-Chunk 表###E-Chunk
--------------------------------------------------------------------------------
/sample_data/cv_word_conll.txt:
--------------------------------------------------------------------------------
1 | 主要 O
2 | 帮助 O
3 | 工地 B-Chunk
4 | 师傅 E-Chunk
5 | 一起 O
6 | 超平 O
7 | , O
8 | 防线 O
9 | 工作 O
10 |
11 | 协助 O
12 | 线 O
13 | 上 O
14 | 、 O
15 | 线 B-Chunk
16 | 下 I-Chunk
17 | 活动 E-Chunk
18 | 的 O
19 | 执行 O
20 |
21 | 执行 O
22 | 各项 O
23 | 培训 O
24 | 相关 O
25 | 的 O
26 | 各项 O
27 | 工作 B-Chunk
28 | 流程 E-Chunk
29 |
30 | 云南 O
31 | : O
32 | 曲靖 O
33 | 、 O
34 | 昭通 O
35 | 下属 O
36 | 的 O
37 | 5 O
38 | 个 O
39 | 县级 O
40 | 供电 B-Chunk
41 | 公司 E-Chunk
42 | 10 O
43 | 个 O
44 | 供电所 O
45 |
46 | 担任 O
47 | 培训 O
48 | 学校 B-Chunk
49 | 英语 I-Chunk
50 | 讲师 E-Chunk
51 | 一 O
52 | 职 O
53 | 和 O
54 | 学生 B-Chunk
55 | 管理 E-Chunk
56 |
57 | 搜寻 O
58 | 招标 B-Chunk
59 | 公告 E-Chunk
60 | , O
61 | 告知 O
62 | 领导 O
63 | 及 O
64 | 业务 B-Chunk
65 | 人员 E-Chunk
66 | , O
67 | 确认 O
68 | 是否 O
69 | 报名 O
70 |
71 | 2001 O
72 | / O
73 | 10 O
74 | -- O
75 | 2002 O
76 | / O
77 | 04 O
78 | : O
79 | 上海 O
80 | 润 O
81 | 宝 O
82 | 工贸 B-Chunk
83 | 公司 E-Chunk
84 | O
85 | 所属 O
86 | 行业 O
87 | : O
88 | O
89 | 环保 O
90 | O
91 | 销售部 O
92 | O
93 | 销售 B-Chunk
94 | 代表 E-Chunk
95 | O
96 | 负责 O
97 | 江浙 O
98 | 一带 O
99 | 工业 O
100 | 圆 O
101 | 区 O
102 | 的 O
103 | 空气过滤器 O
104 | 的 O
105 | 销售 O
106 | 和 O
107 | 维护 O
108 | , O
109 | 期间 O
110 | 昆山 O
111 | 翊 O
112 | 腾 O
113 | 电子 O
114 | 是 O
115 | 长期 O
116 | 的 O
117 | 客户 O
118 |
119 | O
120 | 仓库 B-Chunk
121 | 管理 E-Chunk
122 | : O
123 | 对 O
124 | 仓库 O
125 | 进行 O
126 | 合理 O
127 | 布局 O
128 | , O
129 | 为 O
130 | 方便 O
131 | 员工 B-Chunk
132 | 操作 E-Chunk
133 | 和 O
134 | 减少 O
135 | 失误 O
136 | , O
137 | 能够 O
138 | 独立 O
139 | 编排 O
140 | 库 O
141 | 位 O
142 | 图 O
143 | 和 O
144 | 货位 O
145 | 表 O
146 |
147 | O
148 | 档案 B-Chunk
149 | 管理 E-Chunk
150 | : O
151 | 能 O
152 | 独立 O
153 | 制定 O
154 | 仓库 B-Chunk
155 | 管理 E-Chunk
156 | 文档 O
157 | , O
158 | 专人 O
159 | 负责 O
160 | 仓库 O
161 | 资料 O
162 | 的 O
163 | 更新 O
164 | 归档 O
165 | 并 O
166 | 定期 O
167 | 检查 O
168 |
169 | 电子 B-Chunk
170 | 技术 E-Chunk
171 | / O
172 | 半导体 O
173 | / O
174 | 集成电路 O
175 |
176 | 手机 B-Chunk
177 | 射频 I-Chunk
178 | 信号 I-Chunk
179 | 测试 E-Chunk
180 |
181 | 在 O
182 | 金源 B-Chunk
183 | 集团 E-Chunk
184 | 的 O
185 | 世纪城 O
186 | 三期 O
187 | 担任 O
188 | 置业 B-Chunk
189 | 顾问 E-Chunk
190 | , O
191 | 负责 O
192 | 房地产 B-Chunk
193 | 销售 E-Chunk
194 |
195 | 根据 O
196 | 用户 O
197 | 的 O
198 | 反馈 O
199 | 以及 O
200 | 运营 O
201 | 数据 O
202 | 的 O
203 | 分析 O
204 | , O
205 | 绘制 O
206 | 并 O
207 | 撰写 O
208 | 新版本 O
209 | 的 O
210 | 原型 B-Chunk
211 | 图 E-Chunk
212 | 和 O
213 | prd O
214 | 文 O
215 | 档 O
216 |
217 | 根据 O
218 | 各 O
219 | 渠道 O
220 | 的 O
221 | 平台 B-Chunk
222 | 要求 E-Chunk
223 | , O
224 | 定制 O
225 | 个性化 B-Chunk
226 | 产品 E-Chunk
227 | , O
228 | 并 O
229 | 负责 O
230 | 所 O
231 | 发布 O
232 | 版本 O
233 | 的 O
234 | 跟踪 O
235 | 管理 O
236 |
237 | O
238 | 2005 O
239 | / O
240 | 12 O
241 | -- O
242 | 2006 O
243 | / O
244 | 05 O
245 | : O
246 | 广州 O
247 | 南沣 O
248 | 电子 B-Chunk
249 | 有限公司 E-Chunk
250 | O
251 | 所属 O
252 | 行业 O
253 | : O
254 | O
255 | 电子 B-Chunk
256 | 技术 E-Chunk
257 | / O
258 | 半导体 O
259 | O
260 | 技术 O
261 | 部 O
262 | O
263 | 维修 B-Chunk
264 | 技术 I-Chunk
265 | 员 E-Chunk
266 | O
267 | 负责 O
268 | 调试 O
269 | 与 O
270 | 维修 O
271 | 线切割 B-Chunk
272 | 机床 E-Chunk
273 | 控制器 O
274 | O
275 | 2006 O
276 | / O
277 | 03 O
278 | O
279 | 至今 O
280 | : O
281 | 中国 O
282 | 电器 B-Chunk
283 | 科学研究 E-Chunk
284 | 园 O
285 | O
286 | 擎天 B-Chunk
287 | 实业 I-Chunk
288 | 有限公司 E-Chunk
289 | O
290 | 所属 O
291 | 行业 O
292 | : O
293 | 电器 O
294 | , O
295 | 电子 O
296 | O
297 | 技术 O
298 | 部 O
299 | O
300 | 维修 B-Chunk
301 | 技术员 E-Chunk
302 |
303 | O
304 | 南京 B-Chunk
305 | 项目 E-Chunk
306 | 先期 O
307 | 股权 B-Chunk
308 | 融资 E-Chunk
309 | 置换 O
310 |
311 | 每月 O
312 | 核对 O
313 | 及 O
314 | 结算 B-Chunk
315 | 银行 I-Chunk
316 | 流水账 E-Chunk
317 | 和 O
318 | 现金 B-Chunk
319 | 日记账 E-Chunk
320 | , O
321 | 做到 O
322 | 账 O
323 | 实 O
324 | 相符 O
325 | , O
326 | 出具 O
327 | 汇总 O
328 | 对账 B-Chunk
329 | 表 E-Chunk
330 |
331 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup, find_packages
2 |
3 | with open('requirements.txt') as requirements:
4 | REQUIREMENTS = requirements.readlines()
5 | long_description = open('README.md', encoding='utf-8').read()
6 |
7 | REQUIREMENTS = ['seqeval>=0.0.5', 'Keras>=2.2.4',
8 | 'tensorflow>=1.9.0', 'jieba>=0.39',
9 | 'numpy>=1.14.3', 'scikit-learn>=0.19.1',
10 | 'hanziconv>=0.3.2', 'ruamel.yaml>=0.15.81']
11 |
12 | setup(
13 | name='nlp_toolkit',
14 | version='1.3.2',
15 | description='NLP Toolkit with easy model training and applications',
16 | long_description=long_description,
17 | long_description_content_type='text/markdown',
18 | author='yilei.wang',
19 | author_email='stevewyl@163.com',
20 | license='MIT',
21 | install_requires=REQUIREMENTS,
22 | extra_requires={
23 | 'tensorflow_gpu': ['tensorflow-gpu>=1.10.0'],
24 | 'GPUtil': ['GPUtil>=1.3.0'],
25 | },
26 | python_requires='>=3.6',
27 | packages=find_packages(),
28 | package_data={'nlp_toolkit': ['data/*.txt']},
29 | include_package_data=True,
30 | url='https://github.com/stevewyl/nlp_toolkit',
31 | classifiers=[
32 | 'Programming Language :: Python :: 3.6',
33 | 'License :: OSI Approved :: MIT License',
34 | 'Topic :: Scientific/Engineering :: Artificial Intelligence',
35 | ],
36 | keywords='nlp keras text classification sequence labeling',
37 | )
38 |
--------------------------------------------------------------------------------