├── README.md
├── bert_downstream
├── README.md
├── bert_master
│ └── README.md
├── data_path
│ └── README.md
├── model_ckpt
│ └── README.md
├── pre_trained
│ └── README.md
├── train_classifier.py
├── train_multi_learning.py
└── train_ner.py
├── ckbqa
├── DUTIR中文开放域知识问答评测报告.pdf
├── README.md
└── 基于特征融合的中文知识库问答方法.pdf
├── named_entity_recognition
├── README.md
├── convert_bio.py
├── data_path
│ └── README.md
├── data_utils
│ ├── __init__.py
│ └── datasets.py
├── inference.py
├── model_ckpt
│ └── README.md
├── model_pb
│ └── README.md
├── models
│ ├── __init__.py
│ └── bilstm_crf.py
├── ner_main.py
├── pics
│ ├── 命名实体识别数据图.png
│ └── 命名实体识别的模型总结图.png
└── preprocess.py
└── text_classification
├── README.md
├── data_path
├── README.md
├── tnews_data.pkl
└── vocab.txt
├── inference.py
├── model_ckpt
└── README.md
├── model_pb
└── README.md
├── models
├── __init__.py
├── attention.py
├── base_model.py
├── bilstm_model.py
├── ffnn_model.py
├── model_utils.py
└── text_cnn.py
├── preprocess.py
├── tf_metrics.py
├── tnews_data_eda.ipynb
└── train_main.py
/README.md:
--------------------------------------------------------------------------------
1 | ## AwesomeNLPBaseline
2 |
3 | 本项目是NLP领域一些任务的基准模型实现,包括文本分类、命名实体识别、实体关系抽取、NL2SQL、CKBQA以及BERT的各种下游任务应用。
4 |
5 | 主要使用Tensorflow1.x
6 |
7 | 话说Tensorflow1.0版本在实现某些任务的时候是真心的不如torch,实力劝退。不要问既然这么难用为什么不用torch(因为不会啊)?问就是正因为难用才要用,而且在公司部署项目的时候,TF的as-server模式真香。
8 |
9 | **任务介绍**
10 |
11 | - 文本分类
12 | - 命名实体识别
13 | - bert下游任务
14 | - 实体关系抽取
15 | - nl2sql
16 | - ckbqa
17 | - doing(持续更新)
18 |
19 | **目录结构如下**:
20 |
21 | * text_classification: 文本分类
22 | * named_entity_recognition: 命名实体识别
23 | * entity_relation_extraction: 实体关系抽取
24 | * ckbqa: 中文知识问答
25 | * nl2sql: 自然语言到Sql语句
26 | * bert_downstream: 基于bert进行fine-tune下游任务以及bert相关研究
27 |
28 | Tip:当前只实现了文本分类,bert下游任务,命名实体识别三个任务,其他的等有空了再补上。
29 |
30 |
31 | **声明**
32 |
33 | 本项目是作者平时学习和工作中遇到的NLP任务积累,仅供学习交流。欢迎提issue和pr。
34 |
35 |
--------------------------------------------------------------------------------
/bert_downstream/README.md:
--------------------------------------------------------------------------------
1 | ## BERT介绍
2 |
3 | **简介**
4 |
5 | BERT是谷歌于2018年10月公开的一种预训练模型。该模型一经发布,就引起了学术界以及工业界的广泛关注。在效果方面,BERT刷新了11个NLP任务的当前最优效果,该方法也被评为2018年NLP的重大进展以及NAACL 2019的best paper。BERT和早前OpenAI发布的GPT方法技术路线基本一致,只是在技术细节上存在略微差异。两个工作的主要贡献在于使用预训练+微调的思路来解决自然语言处理问题。以BERT为例,模型应用包括2个环节:
6 |
7 | - 预训练(Pre-training),该环节在大量通用语料上学习网络参数,通用语料包括Wikipedia、Book Corpus,这些语料包含了大量的文本,能够提供丰富的语言相关现象。
8 | - 微调(Fine-tuning),该环节使用“任务相关”的标注数据对网络参数进行微调,不需要再为目标任务设计Task-specific网络从头训练。
9 |
10 | 模型的详细信息可见论文,原文 [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805v1),这是BERT在2018年10月发布的版本,与2019年5月版本[v2](https://arxiv.org/abs/1810.04805v2)有稍许不同。
11 |
12 | 英文不好的可以参考大佬的论文翻译:[BERT论文中文翻译](https://github.com/yuanxiaosc/BERT_Paper_Chinese_Translation)
13 |
14 | **各类BERT预训练模型**:
15 |
16 | - 官网BERT:https://github.com/google-research/bert
17 | - Transformers:https://github.com/huggingface/transformers
18 | - 哈工大讯飞:https://github.com/ymcui/Chinese-BERT-wwm
19 | - Brightmart:https://github.com/brightmart/roberta_zh
20 | - CLUEPretrainedModels:https://github.com/CLUEbenchmark/CLUEPretrainedModels
21 |
22 | **BERT下游任务**
23 |
24 | 随着预训练模型的提出,大大减少了我们对NLP任务设计特定结构的需求,我们只需要在BERT等预训练模型之后再接一些简单的网络,即可完成我们的NLP任务,而且效果非常好。
25 |
26 | 原因也非常简单,BERT等预训练模型通过大量语料的无监督学习,已经将语料中的知识迁移进了预训练模型的Embedding中,为此我们只需在针对特定任务增加结构来进行微调,即可适应当前任务,这也是迁移学习的魔力所在。
27 |
28 | 下面介绍几类下游任务:
29 |
30 | - 句子对分类任务,如自然语言推理(NLI),其数据集一般有MNLI、QNLI、STS-B、MRPC等
31 | - 单句子分类任务,如文本分类(Text-classification),其数据集一般有SST-2、CoLA等
32 | - 问答任务,数据集一般有SQuAD v1.1等
33 | - 单句子token标注任务,如命名实体识别(NER),其数据集一般有CoNLL-2003等
34 |
35 |
36 |
37 | ## BERT下游任务代码
38 |
39 | 下面基于官方BERT的fine-tune代码来实现文本分类、命名实体识别和多任务学习的Baseline模型。
40 |
41 | (下面所有任务的预训练模型都是基于哈工大讯飞实验室的**`BERT-wwm-ext, Chinese`**模型,模型下载地址见上述链接)
42 |
43 | 1.文本分类
44 |
45 | ```
46 | 数据集来源:情感分类,是包含7个分类的细粒度情感性分析数据集,NLP中文预训练模型泛化能力挑战赛的数据集
47 | 运行脚本见train_classifier.py
48 |
49 | 3个epoch,使用的是哈工大的BERT-wwm-ext, Chinese,未做任何的tricks,在CLUE上提交的结果是56.04
50 | 之前使用BiLstm线上提交的结果是50.92(代码见text_classification中)
51 | ([ALBERT-xxlarge](https://github.com/google-research/albert) :59.46,目前[UER-ensemble](https://github.com/dbiir/UER-py):72.20)
52 | ```
53 |
54 | 2.命名实体识别
55 |
56 | ```
57 | 数据集来源:CLUENER 细粒度命名实体识别,数据分为10个标签类别,详细信息见:https://github.com/CLUEbenchmark/CLUENER2020
58 | 运行脚本见train_ner.py
59 | # todo next
60 | ```
61 |
62 | 3.多任务学习
63 |
64 | ```
65 | 数据集来源:NLP中文预训练模型泛化能力挑战赛,https://tianchi.aliyun.com/competition/entrance/531841/introduction
66 | 运行脚本见trian_multi_learning.py
67 |
68 | 3个epoch没做任何tricks,当前的score是0.5717
69 | 3个epoch没做任何的tricks,使用roberta-large,score是0.6236
70 | ```
71 |
72 |
73 |
74 | ## 拓展
75 |
76 | 下面推荐两篇有意思的文章
77 |
78 | 1.How to Fine-tune bert for Text-classification
79 |
80 | 2.few sample bert fine-tune
--------------------------------------------------------------------------------
/bert_downstream/bert_master/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里存放的是bert官方的模型代码,地址见:https://github.com/google-research/bert
4 | 这里主要包括三个文件:
5 | - modeling.py
6 | - optimization.py
7 | - tokenization.py
--------------------------------------------------------------------------------
/bert_downstream/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里存放的是数据集
--------------------------------------------------------------------------------
/bert_downstream/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里保存训练后的checkpoint文件
--------------------------------------------------------------------------------
/bert_downstream/pre_trained/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里存放的是预训练模型,本项目主要使用的是哈工大讯飞实验室的bert模型,地址见:https://github.com/ymcui/Chinese-BERT-wwm
--------------------------------------------------------------------------------
/bert_downstream/train_classifier.py:
--------------------------------------------------------------------------------
1 | """BERT finetuning runner for text classification."""
2 |
3 | import collections
4 | import os
5 | import json
6 | import tensorflow as tf
7 |
8 | import bert_master.modeling as modeling
9 | import bert_master.optimization as optimization
10 | import bert_master.tokenization as tokenization
11 |
12 | flags = tf.flags
13 | FLAGS = flags.FLAGS
14 |
15 | # Required parameters
16 | flags.DEFINE_string(
17 | "data_dir", './data_path/tnews/',
18 | "The input data dir. Should contain the .tsv files (or other data files) "
19 | "for the task.")
20 |
21 | flags.DEFINE_string(
22 | "bert_config_file", './pre_trained/bert_config.json',
23 | "The config json file corresponding to the pre-trained BERT model. "
24 | "This specifies the model architecture.")
25 |
26 | flags.DEFINE_string("task_name", 'tnews',
27 | "The name of the task to train.")
28 |
29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
30 | "The vocabulary file that the BERT model was trained on.")
31 |
32 | flags.DEFINE_string(
33 | "output_dir", './model_ckpt/tnews/',
34 | "The output directory where the model checkpoints will be written.")
35 |
36 | flags.DEFINE_string(
37 | "init_checkpoint", './pre_trained/bert_model.ckpt',
38 | "Initial checkpoint (usually from a pre-trained BERT model).")
39 |
40 | flags.DEFINE_bool(
41 | "do_lower_case", True,
42 | "Whether to lower case the input text. Should be True for uncased "
43 | "models and False for cased models.")
44 |
45 | flags.DEFINE_integer(
46 | "max_seq_length", 128,
47 | "The maximum total input sequence length after WordPiece tokenization. "
48 | "Sequences longer than this will be truncated, and sequences shorter "
49 | "than this will be padded.")
50 |
51 | flags.DEFINE_bool("do_train", True, "Whether to run training.")
52 |
53 | flags.DEFINE_bool("do_eval", True, "Whether to run eval on the dev set.")
54 |
55 | flags.DEFINE_bool(
56 | "do_predict", True,
57 | "Whether to run the model in inference mode on the test set.")
58 |
59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
60 |
61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
62 |
63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
64 |
65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
66 |
67 | flags.DEFINE_float("num_train_epochs", 3.0,
68 | "Total number of training epochs to perform.")
69 |
70 | flags.DEFINE_float(
71 | "warmup_proportion", 0.1,
72 | "Proportion of training to perform linear learning rate warmup for. "
73 | "E.g., 0.1 = 10% of training.")
74 |
75 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
76 | "How often to save the model checkpoint.")
77 |
78 | flags.DEFINE_integer("iterations_per_loop", 1000,
79 | "How many steps to make in each estimator call.")
80 |
81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
82 |
83 | tf.flags.DEFINE_string(
84 | "tpu_name", None,
85 | "The Cloud TPU to use for training. This should be either the name "
86 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
87 | "url.")
88 |
89 | tf.flags.DEFINE_string(
90 | "tpu_zone", None,
91 | "[Optional] GCE zone where the Cloud TPU is located in. If not "
92 | "specified, we will attempt to automatically detect the GCE project from "
93 | "metadata.")
94 |
95 | tf.flags.DEFINE_string(
96 | "gcp_project", None,
97 | "[Optional] Project name for the Cloud TPU-enabled project. If not "
98 | "specified, we will attempt to automatically detect the GCE project from "
99 | "metadata.")
100 |
101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
102 |
103 | flags.DEFINE_integer(
104 | "num_tpu_cores", 8,
105 | "Only used if `use_tpu` is True. Total number of TPU cores to use.")
106 |
107 |
108 | class InputExample(object):
109 | """A single training/test example for simple sequence classification."""
110 |
111 | def __init__(self, guid, text_a, text_b=None, label=None):
112 | """Constructs a InputExample.
113 |
114 | Args:
115 | guid: Unique id for the example.
116 | text_a: string. The untokenized text of the first sequence. For single
117 | sequence tasks, only this sequence must be specified.
118 | text_b: (Optional) string. The untokenized text of the second sequence.
119 | Only must be specified for sequence pair tasks.
120 | label: (Optional) string. The label of the example. This should be
121 | specified for train and dev examples, but not for test examples.
122 | """
123 | self.guid = guid
124 | self.text_a = text_a
125 | self.text_b = text_b
126 | self.label = label
127 |
128 |
129 | class PaddingInputExample(object):
130 | """Fake example so the num input examples is a multiple of the batch size.
131 |
132 | When running eval/predict on the TPU, we need to pad the number of examples
133 | to be a multiple of the batch size, because the TPU requires a fixed batch
134 | size. The alternative is to drop the last batch, which is bad because it means
135 | the entire output data won't be generated.
136 |
137 | We use this class instead of `None` because treating `None` as padding
138 | battches could cause silent errors.
139 | """
140 |
141 |
142 | class InputFeatures(object):
143 | """A single set of features of data."""
144 |
145 | def __init__(self,
146 | input_ids,
147 | input_mask,
148 | segment_ids,
149 | label_id,
150 | is_real_example=True):
151 | self.input_ids = input_ids
152 | self.input_mask = input_mask
153 | self.segment_ids = segment_ids
154 | self.label_id = label_id
155 | self.is_real_example = is_real_example
156 |
157 |
158 | class TnewsProcessor:
159 | def get_train_examples(self, data_dir):
160 | """获取训练集."""
161 | return self._create_examples(
162 | self._read_tsv(os.path.join(data_dir, "train.json")), "train")
163 |
164 | def get_dev_examples(self, data_dir):
165 | """获取验证集."""
166 | return self._create_examples(
167 | self._read_tsv(os.path.join(data_dir, "dev.json")), "dev")
168 |
169 | def get_test_examples(self, data_dir):
170 | """获取测试集."""
171 | return self._create_examples(
172 | self._read_tsv(os.path.join(data_dir, "test.json")), "test")
173 |
174 | def get_labels(self):
175 | """填写新闻分类的类别标签"""
176 | return ['100', '101', '102', '103', '104', '106', '107',
177 | '108', '109', '110', '112', '113', '114', '115', '116']
178 |
179 | def _read_tsv(self, input_file):
180 | """读取数据集"""
181 | with open(input_file, encoding='utf-8') as fr:
182 | lines = fr.readlines()
183 | return lines
184 |
185 | def _create_examples(self, lines, set_type):
186 | """Creates examples for the training and dev sets."""
187 | examples = []
188 | for (i, line) in enumerate(lines):
189 | json_str = json.loads(line)
190 | guid = "%s-%s" % (set_type, i)
191 | if set_type == "test":
192 | text_a = tokenization.convert_to_unicode(json_str['sentence'])
193 | label = None
194 | guid = json_str['id']
195 | else:
196 | text_a = tokenization.convert_to_unicode(json_str['sentence'])
197 | label = tokenization.convert_to_unicode(json_str['label'])
198 | examples.append(
199 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
200 | return examples
201 |
202 |
203 | def convert_single_example(ex_index, example, label_list, max_seq_length,
204 | tokenizer):
205 | """Converts a single `InputExample` into a single `InputFeatures`."""
206 |
207 | if isinstance(example, PaddingInputExample):
208 | return InputFeatures(
209 | input_ids=[0] * max_seq_length,
210 | input_mask=[0] * max_seq_length,
211 | segment_ids=[0] * max_seq_length,
212 | label_id=0,
213 | is_real_example=False)
214 |
215 | label_map = {}
216 | for (i, label) in enumerate(label_list):
217 | label_map[label] = i
218 |
219 | tokens_a = tokenizer.tokenize(example.text_a)
220 | tokens_b = None
221 | if example.text_b:
222 | tokens_b = tokenizer.tokenize(example.text_b)
223 |
224 | if tokens_b:
225 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
226 | else:
227 | # Account for [CLS] and [SEP] with "- 2"
228 | if len(tokens_a) > max_seq_length - 2:
229 | tokens_a = tokens_a[0:(max_seq_length - 2)]
230 |
231 | tokens = []
232 | segment_ids = []
233 | tokens.append("[CLS]")
234 | segment_ids.append(0)
235 | for token in tokens_a:
236 | tokens.append(token)
237 | segment_ids.append(0)
238 | tokens.append("[SEP]")
239 | segment_ids.append(0)
240 |
241 | if tokens_b:
242 | for token in tokens_b:
243 | tokens.append(token)
244 | segment_ids.append(1)
245 | tokens.append("[SEP]")
246 | segment_ids.append(1)
247 |
248 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
249 |
250 | input_mask = [1] * len(input_ids)
251 |
252 | # Zero-pad up to the sequence length.
253 | while len(input_ids) < max_seq_length:
254 | input_ids.append(0)
255 | input_mask.append(0)
256 | segment_ids.append(0)
257 |
258 | assert len(input_ids) == max_seq_length
259 | assert len(input_mask) == max_seq_length
260 | assert len(segment_ids) == max_seq_length
261 |
262 | if example.label:
263 | label_id = label_map[example.label]
264 | else:
265 | label_id = 0
266 |
267 | if ex_index < 5:
268 | tf.logging.info("*** Example ***")
269 | tf.logging.info("guid: %s" % (example.guid))
270 | tf.logging.info("tokens: %s" % " ".join(
271 | [tokenization.printable_text(x) for x in tokens]))
272 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
273 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
274 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
275 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
276 |
277 | feature = InputFeatures(
278 | input_ids=input_ids,
279 | input_mask=input_mask,
280 | segment_ids=segment_ids,
281 | label_id=label_id,
282 | is_real_example=True)
283 | return feature
284 |
285 |
286 | def file_based_convert_examples_to_features(
287 | examples, label_list, max_seq_length, tokenizer, output_file):
288 | """Convert a set of `InputExample`s to a TFRecord file."""
289 |
290 | writer = tf.python_io.TFRecordWriter(output_file)
291 |
292 | for (ex_index, example) in enumerate(examples):
293 | if ex_index % 10000 == 0:
294 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
295 |
296 | feature = convert_single_example(ex_index, example, label_list,
297 | max_seq_length, tokenizer)
298 |
299 | def create_int_feature(values):
300 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
301 | return f
302 |
303 | features = collections.OrderedDict()
304 | features["input_ids"] = create_int_feature(feature.input_ids)
305 | features["input_mask"] = create_int_feature(feature.input_mask)
306 | features["segment_ids"] = create_int_feature(feature.segment_ids)
307 | features["label_ids"] = create_int_feature([feature.label_id])
308 | features["is_real_example"] = create_int_feature(
309 | [int(feature.is_real_example)])
310 |
311 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
312 | writer.write(tf_example.SerializeToString())
313 | writer.close()
314 |
315 |
316 | def file_based_input_fn_builder(input_file, seq_length, is_training,
317 | drop_remainder):
318 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
319 |
320 | name_to_features = {
321 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
322 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
323 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
324 | "label_ids": tf.FixedLenFeature([], tf.int64),
325 | "is_real_example": tf.FixedLenFeature([], tf.int64),
326 | }
327 |
328 | def _decode_record(record, name_to_features):
329 | """Decodes a record to a TensorFlow example."""
330 | example = tf.parse_single_example(record, name_to_features)
331 |
332 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
333 | # So cast all int64 to int32.
334 | for name in list(example.keys()):
335 | t = example[name]
336 | if t.dtype == tf.int64:
337 | t = tf.to_int32(t)
338 | example[name] = t
339 |
340 | return example
341 |
342 | def input_fn(params):
343 | """The actual input function."""
344 | batch_size = params["batch_size"]
345 |
346 | d = tf.data.TFRecordDataset(input_file)
347 | if is_training:
348 | d = d.repeat()
349 | d = d.shuffle(buffer_size=100)
350 |
351 | d = d.apply(
352 | tf.contrib.data.map_and_batch(
353 | lambda record: _decode_record(record, name_to_features),
354 | batch_size=batch_size,
355 | drop_remainder=drop_remainder))
356 |
357 | return d
358 |
359 | return input_fn
360 |
361 |
362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
363 | """Truncates a sequence pair in place to the maximum length."""
364 |
365 | while True:
366 | total_length = len(tokens_a) + len(tokens_b)
367 | if total_length <= max_length:
368 | break
369 | if len(tokens_a) > len(tokens_b):
370 | tokens_a.pop()
371 | else:
372 | tokens_b.pop()
373 |
374 |
375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
376 | labels, num_labels, use_one_hot_embeddings):
377 | """Creates a classification model."""
378 | model = modeling.BertModel(
379 | config=bert_config,
380 | is_training=is_training,
381 | input_ids=input_ids,
382 | input_mask=input_mask,
383 | token_type_ids=segment_ids,
384 | use_one_hot_embeddings=use_one_hot_embeddings)
385 |
386 | output_layer = model.get_pooled_output()
387 |
388 | hidden_size = output_layer.shape[-1].value
389 |
390 | output_weights = tf.get_variable(
391 | "output_weights", [num_labels, hidden_size],
392 | initializer=tf.truncated_normal_initializer(stddev=0.02))
393 |
394 | output_bias = tf.get_variable(
395 | "output_bias", [num_labels], initializer=tf.zeros_initializer())
396 |
397 | with tf.variable_scope("loss"):
398 | if is_training:
399 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
400 |
401 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
402 | logits = tf.nn.bias_add(logits, output_bias)
403 |
404 | log_probs = tf.nn.log_softmax(logits, axis=-1)
405 |
406 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
407 |
408 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
409 | loss = tf.reduce_mean(per_example_loss)
410 |
411 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
412 |
413 | return loss, per_example_loss, predictions
414 |
415 |
416 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
417 | num_train_steps, num_warmup_steps, use_tpu,
418 | use_one_hot_embeddings):
419 | """Returns `model_fn` closure for TPUEstimator."""
420 |
421 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument
422 | """The `model_fn` for TPUEstimator."""
423 |
424 | tf.logging.info("*** Features ***")
425 | for name in sorted(features.keys()):
426 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
427 |
428 | input_ids = features["input_ids"]
429 | input_mask = features["input_mask"]
430 | segment_ids = features["segment_ids"]
431 | label_ids = features["label_ids"]
432 | is_real_example = None
433 | if "is_real_example" in features:
434 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
435 | else:
436 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
437 |
438 | is_training = (mode == tf.estimator.ModeKeys.TRAIN)
439 |
440 | (total_loss, per_example_loss, predictions) = create_model(
441 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
442 | num_labels, use_one_hot_embeddings)
443 |
444 | tvars = tf.trainable_variables()
445 | initialized_variable_names = {}
446 | scaffold_fn = None
447 | if init_checkpoint:
448 | (assignment_map, initialized_variable_names
449 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
450 | if use_tpu:
451 | def tpu_scaffold():
452 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
453 | return tf.train.Scaffold()
454 |
455 | scaffold_fn = tpu_scaffold
456 | else:
457 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
458 |
459 | tf.logging.info("**** Trainable Variables ****")
460 | for var in tvars:
461 | init_string = ""
462 | if var.name in initialized_variable_names:
463 | init_string = ", *INIT_FROM_CKPT*"
464 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
465 | init_string)
466 |
467 | if mode == tf.estimator.ModeKeys.TRAIN:
468 | # 添加loss的hook,不然在GPU/CPU上不打印loss
469 | logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
470 | train_op = optimization.create_optimizer(
471 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
472 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
473 | mode=mode,
474 | loss=total_loss,
475 | train_op=train_op,
476 | training_hooks=[logging_hook],
477 | scaffold_fn=scaffold_fn)
478 | elif mode == tf.estimator.ModeKeys.EVAL:
479 | def metric_fn(per_example_loss, label_ids, is_real_example):
480 | accuracy = tf.metrics.accuracy(
481 | labels=label_ids, predictions=predictions, weights=is_real_example)
482 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
483 | return {
484 | "eval_accuracy": accuracy,
485 | "eval_loss": loss,
486 | }
487 |
488 | eval_metrics = (metric_fn,
489 | [per_example_loss, label_ids, is_real_example])
490 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
491 | mode=mode,
492 | loss=total_loss,
493 | eval_metrics=eval_metrics,
494 | scaffold_fn=scaffold_fn)
495 | else:
496 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
497 | mode=mode,
498 | predictions={"predictions": predictions},
499 | scaffold_fn=scaffold_fn)
500 | return output_spec
501 |
502 | return model_fn
503 |
504 |
505 | def main():
506 | tf.logging.set_verbosity(tf.logging.INFO)
507 |
508 | processors = {
509 | "tnews": TnewsProcessor,
510 | }
511 |
512 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
513 | FLAGS.init_checkpoint)
514 |
515 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
516 | raise ValueError(
517 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
518 |
519 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
520 |
521 | if FLAGS.max_seq_length > bert_config.max_position_embeddings:
522 | raise ValueError(
523 | "Cannot use sequence length %d because the BERT model "
524 | "was only trained up to sequence length %d" %
525 | (FLAGS.max_seq_length, bert_config.max_position_embeddings))
526 |
527 | tf.gfile.MakeDirs(FLAGS.output_dir)
528 |
529 | task_name = FLAGS.task_name.lower()
530 |
531 | if task_name not in processors:
532 | raise ValueError("Task not found: %s" % (task_name))
533 |
534 | processor = processors[task_name]()
535 |
536 | label_list = processor.get_labels()
537 |
538 | tokenizer = tokenization.FullTokenizer(
539 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
540 |
541 | tpu_cluster_resolver = None
542 | if FLAGS.use_tpu and FLAGS.tpu_name:
543 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
544 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
545 |
546 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
547 | run_config = tf.contrib.tpu.RunConfig(
548 | cluster=tpu_cluster_resolver,
549 | master=FLAGS.master,
550 | model_dir=FLAGS.output_dir,
551 | save_checkpoints_steps=FLAGS.save_checkpoints_steps,
552 | tpu_config=tf.contrib.tpu.TPUConfig(
553 | iterations_per_loop=FLAGS.iterations_per_loop,
554 | num_shards=FLAGS.num_tpu_cores,
555 | per_host_input_for_training=is_per_host))
556 |
557 | train_examples = None
558 | num_train_steps = None
559 | num_warmup_steps = None
560 | if FLAGS.do_train:
561 | train_examples = processor.get_train_examples(FLAGS.data_dir)
562 | num_train_steps = int(
563 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
564 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
565 |
566 | model_fn = model_fn_builder(
567 | bert_config=bert_config,
568 | num_labels=len(label_list),
569 | init_checkpoint=FLAGS.init_checkpoint,
570 | learning_rate=FLAGS.learning_rate,
571 | num_train_steps=num_train_steps,
572 | num_warmup_steps=num_warmup_steps,
573 | use_tpu=FLAGS.use_tpu,
574 | use_one_hot_embeddings=FLAGS.use_tpu)
575 |
576 | # If TPU is not available, this will fall back to normal Estimator on CPU
577 | # or GPU.
578 | estimator = tf.contrib.tpu.TPUEstimator(
579 | use_tpu=FLAGS.use_tpu,
580 | model_fn=model_fn,
581 | config=run_config,
582 | train_batch_size=FLAGS.train_batch_size,
583 | eval_batch_size=FLAGS.eval_batch_size,
584 | predict_batch_size=FLAGS.predict_batch_size)
585 |
586 | if FLAGS.do_train:
587 | train_file = os.path.join(FLAGS.data_dir, "train.tf_record")
588 | file_based_convert_examples_to_features(
589 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
590 | tf.logging.info("***** Running training *****")
591 | tf.logging.info(" Num examples = %d", len(train_examples))
592 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
593 | tf.logging.info(" Num steps = %d", num_train_steps)
594 | train_input_fn = file_based_input_fn_builder(
595 | input_file=train_file,
596 | seq_length=FLAGS.max_seq_length,
597 | is_training=True,
598 | drop_remainder=True)
599 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
600 |
601 | if FLAGS.do_eval:
602 | eval_examples = processor.get_dev_examples(FLAGS.data_dir)
603 | num_actual_eval_examples = len(eval_examples)
604 | if FLAGS.use_tpu:
605 | while len(eval_examples) % FLAGS.eval_batch_size != 0:
606 | eval_examples.append(PaddingInputExample())
607 |
608 | eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record")
609 | file_based_convert_examples_to_features(
610 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
611 |
612 | tf.logging.info("***** Running evaluation *****")
613 | tf.logging.info(" Num examples = %d (%d actual, %d padding)",
614 | len(eval_examples), num_actual_eval_examples,
615 | len(eval_examples) - num_actual_eval_examples)
616 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
617 |
618 | # This tells the estimator to run through the entire set.
619 | eval_steps = None
620 | if FLAGS.use_tpu:
621 | assert len(eval_examples) % FLAGS.eval_batch_size == 0
622 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
623 |
624 | eval_drop_remainder = True if FLAGS.use_tpu else False
625 | eval_input_fn = file_based_input_fn_builder(
626 | input_file=eval_file,
627 | seq_length=FLAGS.max_seq_length,
628 | is_training=False,
629 | drop_remainder=eval_drop_remainder)
630 |
631 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
632 |
633 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
634 | with tf.gfile.GFile(output_eval_file, "w") as writer:
635 | tf.logging.info("***** Eval results *****")
636 | for key in sorted(result.keys()):
637 | tf.logging.info(" %s = %s", key, str(result[key]))
638 | writer.write("%s = %s\n" % (key, str(result[key])))
639 |
640 | if FLAGS.do_predict:
641 | # label dict的设置
642 | label_dict = {0: 100, 1: 101, 2: 102, 3: 103,
643 | 4: 104, 5: 106, 6: 107, 7: 108,
644 | 8: 109, 9: 110, 10: 112, 11: 113,
645 | 12: 114, 13: 115, 14: 116}
646 | label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment",
647 | 103: "news_sports", 104: "news_finance", 106: "news_house",
648 | 107: "news_car", 108: "news_edu", 109: "news_tech",
649 | 110: "news_military", 112: "news_travel", 113: "news_world",
650 | 114: "news_stock", 115: "news_agriculture", 116: "news_game"}
651 |
652 | predict_examples = processor.get_test_examples(FLAGS.data_dir)
653 | num_actual_predict_examples = len(predict_examples)
654 | test_file = os.path.join(FLAGS.data_dir, "test.tf_record")
655 | file_based_convert_examples_to_features(predict_examples, label_list,
656 | FLAGS.max_seq_length, tokenizer,
657 | test_file)
658 |
659 | tf.logging.info("***** Running prediction*****")
660 | tf.logging.info(" Num examples = %d (%d actual, %d padding)",
661 | len(predict_examples), num_actual_predict_examples,
662 | len(predict_examples) - num_actual_predict_examples)
663 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
664 |
665 | predict_drop_remainder = True if FLAGS.use_tpu else False
666 | predict_input_fn = file_based_input_fn_builder(
667 | input_file=test_file,
668 | seq_length=FLAGS.max_seq_length,
669 | is_training=False,
670 | drop_remainder=predict_drop_remainder)
671 |
672 | results = estimator.predict(input_fn=predict_input_fn)
673 |
674 | output_file = os.path.join(FLAGS.output_dir, 'news_predict.json')
675 | with open(output_file, 'w', encoding='utf-8') as fr:
676 | print(results)
677 | for index, result in enumerate(results):
678 | pre_id = result['predictions']
679 | print(f'the index is {index} preid is {pre_id}')
680 | label = label_dict.get(pre_id)
681 | label_d = label_desc.get(label)
682 |
683 | json_str = json.dumps({"id": index, "label": str(label), "label_desc": label_d})
684 | fr.write(json_str)
685 | fr.write('\n')
686 |
687 |
688 | if __name__ == "__main__":
689 | main()
690 |
--------------------------------------------------------------------------------
/bert_downstream/train_multi_learning.py:
--------------------------------------------------------------------------------
1 | """ BERT finetuning runner for multi-learning task """
2 |
3 | import collections
4 | import math
5 | import os
6 | import random
7 | import pandas as pd
8 | import numpy as np
9 | import json
10 | import tqdm
11 |
12 | import bert_master.modeling as modeling
13 | import bert_master.optimization as optimization
14 | import bert_master.tokenization as tokenization
15 | import tensorflow as tf
16 |
17 | flags = tf.flags
18 | FLAGS = flags.FLAGS
19 |
20 | # Required parameters
21 | flags.DEFINE_string(
22 | "data_dir", './data_path/',
23 | "The input data dir. Should contain the .tsv files (or other data files) "
24 | "for the task.")
25 |
26 | flags.DEFINE_string(
27 | "bert_config_file", './pre_trained/bert_config.json',
28 | "The config json file corresponding to the pre-trained BERT model. "
29 | "This specifies the model architecture.")
30 |
31 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
32 | "The vocabulary file that the BERT model was trained on.")
33 |
34 | flags.DEFINE_string(
35 | "output_dir", './model_ckpt/multi_learning/',
36 | "The output directory where the model checkpoints will be written.")
37 |
38 | flags.DEFINE_string(
39 | "init_checkpoint", './pre_trained/bert_model.ckpt',
40 | "Initial checkpoint (usually from a pre-trained BERT model).")
41 |
42 | flags.DEFINE_bool(
43 | "do_lower_case", True,
44 | "Whether to lower case the input text. Should be True for uncased "
45 | "models and False for cased models.")
46 |
47 | flags.DEFINE_integer(
48 | "max_seq_length", 128,
49 | "The maximum total input sequence length after WordPiece tokenization. "
50 | "Sequences longer than this will be truncated, and sequences shorter "
51 | "than this will be padded.")
52 |
53 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
54 |
55 | flags.DEFINE_integer("eval_batch_size", 16, "Total batch size for eval.")
56 |
57 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
58 |
59 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
60 |
61 | flags.DEFINE_integer("num_train_epochs", 3,
62 | "Total number of training epochs to perform.")
63 |
64 | flags.DEFINE_float(
65 | "warmup_proportion", 0.1,
66 | "Proportion of training to perform linear learning rate warmup for. "
67 | "E.g., 0.1 = 10% of training.")
68 |
69 |
70 | class InputExample(object):
71 | """A single training/test example for simple sequence classification."""
72 |
73 | def __init__(self, guid, text_a, text_b=None, label=None, task=None):
74 | """Constructs a InputExample.
75 |
76 | Args:
77 | guid: Unique id for the example.
78 | text_a: string. The untokenized text of the first sequence. For single
79 | sequence tasks, only this sequence must be specified.
80 | text_b: (Optional) string. The untokenized text of the second sequence.
81 | Only must be specified for sequence pair tasks.
82 | label: (Optional) string. The label of the example. This should be
83 | specified for train and dev examples, but not for test examples.
84 | """
85 | self.guid = guid
86 | self.text_a = text_a
87 | self.text_b = text_b
88 | self.label = label
89 | self.task = task
90 |
91 |
92 | class PaddingInputExample(object):
93 | """Fake example so the num input examples is a multiple of the batch size.
94 |
95 | When running eval/predict on the TPU, we need to pad the number of examples
96 | to be a multiple of the batch size, because the TPU requires a fixed batch
97 | size. The alternative is to drop the last batch, which is bad because it means
98 | the entire output data won't be generated.
99 |
100 | We use this class instead of `None` because treating `None` as padding
101 | battches could cause silent errors.
102 | """
103 |
104 |
105 | class InputFeatures(object):
106 | """A single set of features of data."""
107 |
108 | def __init__(self,
109 | input_ids,
110 | input_mask,
111 | segment_ids,
112 | label_id,
113 | task,
114 | is_real_example=True):
115 | self.input_ids = input_ids
116 | self.input_mask = input_mask
117 | self.segment_ids = segment_ids
118 | self.label_id = label_id
119 | self.task = task
120 | self.is_real_example = is_real_example
121 |
122 |
123 | class DataProcessor(object):
124 | """Base class for data converters for sequence classification data sets."""
125 |
126 | def get_train_examples(self, data_dir):
127 | """Gets a collection of `InputExample`s for the train set."""
128 | raise NotImplementedError()
129 |
130 | def get_dev_examples(self, data_dir):
131 | """Gets a collection of `InputExample`s for the dev set."""
132 | raise NotImplementedError()
133 |
134 | def get_test_examples(self, data_dir):
135 | """Gets a collection of `InputExample`s for prediction."""
136 | raise NotImplementedError()
137 |
138 | def get_labels(self):
139 | """Gets the list of labels for this data set."""
140 | raise NotImplementedError()
141 |
142 | @classmethod
143 | def _read_csv(cls, input_file, task):
144 | data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None)
145 | if task == 'nli':
146 | data.columns = ['id', 'texta', 'textb', 'label']
147 | else:
148 | data.columns = ['id', 'text', 'label']
149 | lines = []
150 | for index, row in data.iterrows():
151 | if task == 'nli':
152 | lines.append((row['texta'], row['textb'], row['label']))
153 | else:
154 | lines.append((row['text'], row['label']))
155 | return lines
156 |
157 | @classmethod
158 | def _read_test(cls, input_file, task):
159 | data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None)
160 | if task == 'nli':
161 | data.columns = ['id', 'texta', 'textb']
162 | else:
163 | data.columns = ['id', 'text']
164 | lines = []
165 | # 添加id防止预测提交出错
166 | for index, row in data.iterrows():
167 | if task == 'nli':
168 | lines.append((row['id'], row['texta'], row['textb']))
169 | else:
170 | lines.append((row['id'], row['text']))
171 | return lines
172 |
173 |
174 | class AllProcessor(DataProcessor):
175 | """Processor for the CoLA data set (GLUE version)."""
176 |
177 | def get_train_examples(self, data_dir):
178 | """See base class."""
179 | emotion_dir = os.path.join(data_dir, 'train_emotion.csv')
180 | news_dir = os.path.join(data_dir, 'train_news.csv')
181 | nli_dir = os.path.join(data_dir, 'train_nli.csv')
182 | emotion_lines = self._read_csv(emotion_dir, 'emotion')
183 | news_lines = self._read_csv(news_dir, 'news')
184 | nli_lines = self._read_csv(nli_dir, 'nli')
185 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'train')
186 |
187 | def get_dev_examples(self, data_dir):
188 | """See base class."""
189 | emotion_dir = os.path.join(data_dir, 'dev_emotion.csv')
190 | news_dir = os.path.join(data_dir, 'dev_news.csv')
191 | nli_dir = os.path.join(data_dir, 'dev_nli.csv')
192 | emotion_lines = self._read_csv(emotion_dir, 'emotion')
193 | news_lines = self._read_csv(news_dir, 'news')
194 | nli_lines = self._read_csv(nli_dir, 'nli')
195 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'dev')
196 |
197 | def get_test_examples(self, data_dir):
198 | """See base class."""
199 | emotion_dir = os.path.join(data_dir, 'test_emotion.csv')
200 | news_dir = os.path.join(data_dir, 'test_news.csv')
201 | nli_dir = os.path.join(data_dir, 'test_nli.csv')
202 | emotion_lines = self._read_test(emotion_dir, 'emotion')
203 | news_lines = self._read_test(news_dir, 'news')
204 | nli_lines = self._read_test(nli_dir, 'nli')
205 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'test')
206 |
207 | def get_labels(self):
208 | """See base class."""
209 | return [['sadness', 'anger', 'happiness', 'fear', 'like',
210 | 'disgust', 'surprise'],
211 | ['108', '104', '106', '112', '109', '103', '116', '101',
212 | '107', '100', '102', '110', '115', '113', '114'],
213 | ['0', '1', '2']]
214 |
215 | def _create_examples(self, emotion_lines, news_lines, nli_lines, set_type):
216 | """Creates examples for the training and dev sets."""
217 | examples = []
218 |
219 | # emotion
220 | for (i, line) in enumerate(emotion_lines):
221 | guid = "%s-%s" % (set_type, i)
222 | if set_type == "test":
223 | text_a = tokenization.convert_to_unicode(line[1])
224 | label = None
225 | guid = line[0]
226 | else:
227 | text_a = tokenization.convert_to_unicode(line[0])
228 | label = tokenization.convert_to_unicode(line[1])
229 | examples.append(
230 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='1'))
231 |
232 | # news
233 | for i, line in enumerate(news_lines):
234 | guid = f'news_{set_type}_{i}'
235 | if set_type == 'test':
236 | text_a = tokenization.convert_to_unicode(line[1])
237 | label = None
238 | guid = line[0]
239 | else:
240 | text_a = tokenization.convert_to_unicode(line[0])
241 | label = tokenization.convert_to_unicode(str(line[1]))
242 |
243 | examples.append(
244 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='2'))
245 |
246 | # nli
247 | for i, line in enumerate(nli_lines):
248 | guid = f'news_{set_type}_{i}'
249 | if set_type == 'test':
250 | text_a = tokenization.convert_to_unicode(line[1])
251 | text_b = tokenization.convert_to_unicode(line[2])
252 | label = None
253 | guid = line[0]
254 | else:
255 | text_a = tokenization.convert_to_unicode(line[0])
256 | text_b = tokenization.convert_to_unicode(line[1])
257 | label = tokenization.convert_to_unicode(str(line[2]))
258 |
259 | examples.append(
260 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label, task='3'))
261 |
262 | return examples
263 |
264 |
265 | def convert_single_example(ex_index, example, label_list, max_seq_length,
266 | tokenizer):
267 | """Converts a single `InputExample` into a single `InputFeatures`."""
268 |
269 | emotion_label_map = {}
270 | news_label_map = {}
271 | nli_label_map = {}
272 | for (i, label) in enumerate(label_list[0]):
273 | emotion_label_map[label] = i
274 | for (i, label) in enumerate(label_list[1]):
275 | news_label_map[label] = i
276 | for (i, label) in enumerate(label_list[2]):
277 | nli_label_map[label] = i
278 |
279 | tokens_a = tokenizer.tokenize(example.text_a)
280 | tokens_b = None
281 | if example.text_b:
282 | tokens_b = tokenizer.tokenize(example.text_b)
283 |
284 | if tokens_b:
285 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
286 | else:
287 | if len(tokens_a) > max_seq_length - 2:
288 | tokens_a = tokens_a[0:(max_seq_length - 2)]
289 |
290 | tokens = []
291 | segment_ids = []
292 | tokens.append("[CLS]")
293 | segment_ids.append(0)
294 | for token in tokens_a:
295 | tokens.append(token)
296 | segment_ids.append(0)
297 | tokens.append("[SEP]")
298 | segment_ids.append(0)
299 |
300 | if tokens_b:
301 | for token in tokens_b:
302 | tokens.append(token)
303 | segment_ids.append(1)
304 | tokens.append("[SEP]")
305 | segment_ids.append(1)
306 |
307 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
308 |
309 | input_mask = [1] * len(input_ids)
310 |
311 | # Zero-pad up to the sequence length.
312 | while len(input_ids) < max_seq_length:
313 | input_ids.append(0)
314 | input_mask.append(0)
315 | segment_ids.append(0)
316 |
317 | assert len(input_ids) == max_seq_length
318 | assert len(input_mask) == max_seq_length
319 | assert len(segment_ids) == max_seq_length
320 | task = example.task
321 | if example.label:
322 | if task == '1': label_id = emotion_label_map[example.label]
323 | if task == '2': label_id = news_label_map[example.label]
324 | if task == '3': label_id = nli_label_map[example.label]
325 | else:
326 | label_id = 0
327 |
328 | if ex_index < 5:
329 | tf.logging.info("*** Example ***")
330 | tf.logging.info("guid: %s" % (example.guid))
331 | tf.logging.info("tokens: %s" % " ".join(
332 | [tokenization.printable_text(x) for x in tokens]))
333 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
334 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
335 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
336 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
337 |
338 | feature = InputFeatures(
339 | input_ids=input_ids,
340 | input_mask=input_mask,
341 | segment_ids=segment_ids,
342 | label_id=label_id,
343 | task=int(task),
344 | is_real_example=True)
345 | return feature
346 |
347 |
348 | def file_based_convert_examples_to_features(
349 | examples, label_list, max_seq_length, tokenizer, output_file, type):
350 | """Convert a set of `InputExample`s to a TFRecord file."""
351 |
352 | emotion_out = os.path.join(output_file, f'emotion_{type}.record')
353 | news_out = os.path.join(output_file, f'news_{type}.record')
354 | nli_out = os.path.join(output_file, f'nli_{type}.record')
355 |
356 | emotion_writer = tf.python_io.TFRecordWriter(emotion_out)
357 | news_writer = tf.python_io.TFRecordWriter(news_out)
358 | nli_writer = tf.python_io.TFRecordWriter(nli_out)
359 |
360 | emotion_cnt = 0
361 | news_cnt = 0
362 | nli_cnt = 0
363 | for (ex_index, example) in enumerate(examples):
364 | if ex_index % 10000 == 0:
365 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
366 |
367 | feature = convert_single_example(ex_index, example, label_list,
368 | max_seq_length, tokenizer)
369 |
370 | def create_int_feature(values):
371 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
372 | return f
373 |
374 | features = collections.OrderedDict()
375 | features["input_ids"] = create_int_feature(feature.input_ids)
376 | features["input_mask"] = create_int_feature(feature.input_mask)
377 | features["segment_ids"] = create_int_feature(feature.segment_ids)
378 | features["label_ids"] = create_int_feature([feature.label_id])
379 | features["task"] = create_int_feature([feature.task])
380 | features["is_real_example"] = create_int_feature(
381 | [int(feature.is_real_example)])
382 |
383 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
384 |
385 | if feature.task == 1:
386 | emotion_cnt += 1
387 | emotion_writer.write(tf_example.SerializeToString())
388 | if feature.task == 2:
389 | news_cnt += 1
390 | news_writer.write(tf_example.SerializeToString())
391 | if feature.task == 3:
392 | nli_cnt += 1
393 | nli_writer.write(tf_example.SerializeToString())
394 |
395 | emotion_writer.close()
396 | news_writer.close()
397 | nli_writer.close()
398 | print(f'the emotion news nli cnt is {emotion_cnt} {news_cnt} {nli_cnt}')
399 | return emotion_cnt, news_cnt, nli_cnt
400 |
401 |
402 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
403 | """Truncates a sequence pair in place to the maximum length."""
404 |
405 | while True:
406 | total_length = len(tokens_a) + len(tokens_b)
407 | if total_length <= max_length:
408 | break
409 | if len(tokens_a) > len(tokens_b):
410 | tokens_a.pop()
411 | else:
412 | tokens_b.pop()
413 |
414 |
415 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
416 | labels, use_one_hot_embeddings, task):
417 | """Creates a classification model."""
418 | model = modeling.BertModel(
419 | config=bert_config,
420 | is_training=is_training,
421 | input_ids=input_ids,
422 | input_mask=input_mask,
423 | token_type_ids=segment_ids,
424 | use_one_hot_embeddings=use_one_hot_embeddings)
425 |
426 | output_layer = model.get_pooled_output()
427 |
428 | hidden_size = output_layer.shape[-1].value
429 |
430 | # 三个任务对应的三个全连接层参数
431 | emotion_weights = tf.get_variable(
432 | "emotion_weights", [7, hidden_size],
433 | initializer=tf.truncated_normal_initializer(stddev=0.02))
434 | emotion_bias = tf.get_variable(
435 | "emotion_bias", [7], initializer=tf.zeros_initializer())
436 |
437 | news_weights = tf.get_variable(
438 | "news_weights", [15, hidden_size],
439 | initializer=tf.truncated_normal_initializer(stddev=0.02))
440 | news_bias = tf.get_variable(
441 | "news_bias", [15], initializer=tf.zeros_initializer())
442 |
443 | nli_weights = tf.get_variable(
444 | "nli_weights", [3, hidden_size],
445 | initializer=tf.truncated_normal_initializer(stddev=0.02))
446 | nli_bias = tf.get_variable(
447 | "nli_bias", [3], initializer=tf.zeros_initializer())
448 |
449 | if is_training:
450 | # I.e., 0.1 dropout
451 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
452 |
453 | emotion_logits = tf.matmul(output_layer, emotion_weights, transpose_b=True)
454 | emotion_logits = tf.nn.bias_add(emotion_logits, emotion_bias)
455 |
456 | news_logits = tf.matmul(output_layer, news_weights, transpose_b=True)
457 | news_logits = tf.nn.bias_add(news_logits, news_bias)
458 |
459 | nli_logits = tf.matmul(output_layer, nli_weights, transpose_b=True)
460 | nli_logits = tf.nn.bias_add(nli_logits, nli_bias)
461 |
462 | logits = tf.cond(
463 | tf.equal(task, 1),
464 | lambda: emotion_logits,
465 | lambda: tf.cond(tf.equal(task, 2), lambda: news_logits, lambda: nli_logits)
466 | )
467 | depth = tf.cond(
468 | tf.equal(task, 1),
469 | lambda: 7,
470 | lambda: tf.cond(tf.equal(task, 2), lambda: 15, lambda: 3)
471 | )
472 |
473 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int64, name='pre_id')
474 |
475 | with tf.variable_scope("loss"):
476 | log_probs = tf.nn.log_softmax(logits, axis=-1)
477 | one_hot_labels = tf.one_hot(labels, depth=depth, dtype=tf.float32)
478 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
479 | loss = tf.reduce_mean(per_example_loss)
480 |
481 | equals = tf.reduce_sum(tf.cast(tf.equal(predictions, labels), tf.int64))
482 | acc = equals / FLAGS.eval_batch_size
483 | return loss, logits, acc, predictions
484 |
485 |
486 | def get_input_data(input_file, seq_len, batch_size, is_training):
487 | def parser(record):
488 | name_to_features = {
489 | "input_ids": tf.FixedLenFeature([seq_len], tf.int64),
490 | "input_mask": tf.FixedLenFeature([seq_len], tf.int64),
491 | "segment_ids": tf.FixedLenFeature([seq_len], tf.int64),
492 | "label_ids": tf.FixedLenFeature([], tf.int64),
493 | }
494 | # 解析的时候需要的是int64
495 | example = tf.parse_single_example(record, features=name_to_features)
496 | input_ids = example["input_ids"]
497 | input_mask = example["input_mask"]
498 | segment_ids = example["segment_ids"]
499 | labels = example["label_ids"]
500 | return input_ids, input_mask, segment_ids, labels
501 |
502 | dataset = tf.data.TFRecordDataset(input_file)
503 | if is_training:
504 | dataset = dataset.map(parser).batch(batch_size).shuffle(buffer_size=3000)
505 | else:
506 | dataset = dataset.map(parser).batch(batch_size)
507 | iterator = dataset.make_one_shot_iterator()
508 | input_ids, input_mask, segment_ids, labels = iterator.get_next()
509 | return input_ids, input_mask, segment_ids, labels
510 |
511 |
512 | def main():
513 | """ 训练主入口 """
514 | tf.logging.info('start to train')
515 |
516 | # 部分参数设置
517 | process = AllProcessor()
518 | label_list = process.get_labels()
519 | tokenizer = tokenization.FullTokenizer(
520 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
521 |
522 | train_examples = process.get_train_examples(FLAGS.data_dir)
523 | train_cnt = file_based_convert_examples_to_features(
524 | train_examples,
525 | label_list,
526 | FLAGS.max_seq_length,
527 | tokenizer,
528 | FLAGS.data_dir,
529 | 'train'
530 | )
531 | dev_examples = process.get_dev_examples(FLAGS.data_dir)
532 | dev_cnt = file_based_convert_examples_to_features(
533 | dev_examples,
534 | label_list,
535 | FLAGS.max_seq_length,
536 | tokenizer,
537 | FLAGS.data_dir,
538 | 'dev'
539 | )
540 |
541 | # 输入输出定义
542 | input_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
543 | name='input_ids')
544 | input_mask = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
545 | name='input_mask')
546 | segment_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
547 | name='segment_ids')
548 | labels = tf.placeholder(tf.int64, shape=[None], name='labels')
549 | task = tf.placeholder(tf.int64, name='task')
550 |
551 | # bert相关参数设置
552 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
553 |
554 | loss, logits, acc, pre_id = create_model(
555 | bert_config,
556 | True,
557 | input_ids,
558 | input_mask,
559 | segment_ids,
560 | labels,
561 | False,
562 | task
563 | )
564 | num_train_steps = int(len(train_examples) / FLAGS.train_batch_size)
565 | num_warmup_steps = math.ceil(
566 | num_train_steps * FLAGS.train_batch_size * FLAGS.warmup_proportion)
567 | train_op = optimization.create_optimizer(
568 | loss,
569 | FLAGS.learning_rate,
570 | num_train_steps * FLAGS.num_train_epochs,
571 | num_warmup_steps,
572 | False
573 | )
574 |
575 | # 初始化参数
576 | init_global = tf.global_variables_initializer()
577 | saver = tf.train.Saver(
578 | [v for v in tf.global_variables()
579 | if 'adam_v' not in v.name and 'adam_m' not in v.name])
580 |
581 | with tf.Session() as sess:
582 | sess.run(init_global)
583 | print('start to load bert params')
584 | if FLAGS.init_checkpoint:
585 | # tvars = tf.global_variables()
586 | tvars = tf.trainable_variables()
587 | print("global_variables", len(tvars))
588 | assignment_map, initialized_variable_names = \
589 | modeling.get_assignment_map_from_checkpoint(tvars,
590 | FLAGS.init_checkpoint)
591 | print("initialized_variable_names:", len(initialized_variable_names))
592 | saver_ = tf.train.Saver([v for v in tvars if v.name in initialized_variable_names])
593 | saver_.restore(sess, FLAGS.init_checkpoint)
594 | tvars = tf.global_variables()
595 | # initialized_vars = [v for v in tvars if v.name in initialized_variable_names]
596 | not_initialized_vars = [v for v in tvars if v.name not in initialized_variable_names]
597 | print('all size %s; not initialized size %s' % (len(tvars), len(not_initialized_vars)))
598 | if len(not_initialized_vars):
599 | sess.run(tf.variables_initializer(not_initialized_vars))
600 | # for v in initialized_vars:
601 | # print('initialized: %s, shape = %s' % (v.name, v.shape))
602 | # for v in not_initialized_vars:
603 | # print('not initialized: %s, shape = %s' % (v.name, v.shape))
604 | else:
605 | print('the bert init checkpoint is None!!!')
606 | sess.run(tf.global_variables_initializer())
607 |
608 | # 训练的step
609 | def train_step(ids, mask, seg, true_y, task_id):
610 | feed = {input_ids: ids,
611 | input_mask: mask,
612 | segment_ids: seg,
613 | labels: true_y,
614 | task: task_id}
615 | _, logits_out, loss_out = sess.run([train_op, logits, loss], feed_dict=feed)
616 | return logits_out, loss_out
617 |
618 | # 验证的step
619 | def dev_step(ids, mask, seg, true_y, task_id):
620 | feed = {input_ids: ids,
621 | input_mask: mask,
622 | segment_ids: seg,
623 | labels: true_y,
624 | task: task_id}
625 | pre_out, acc_out = sess.run([pre_id, acc], feed_dict=feed)
626 | return pre_out, acc_out
627 |
628 | # 开始训练
629 | for epoch in range(FLAGS.num_train_epochs):
630 | tf.logging.info(f'start to train and the epoch:{epoch}')
631 | epoch_loss = do_train(sess, train_cnt, train_step, epoch)
632 | tf.logging.info(f'the epoch{epoch} loss is {epoch_loss}')
633 | saver.save(sess, FLAGS.output_dir + 'bert.ckpt', global_step=epoch)
634 | # 每一个epoch开始验证模型
635 | do_eval(sess, dev_cnt, dev_step)
636 |
637 | # 进行预测并保存结果
638 | do_predict(label_list, process, tokenizer, dev_step)
639 |
640 | tf.logging.info('the training is over!!!!')
641 |
642 |
643 | def set_random_task(train_cnt):
644 | """ 任务采样 : 各任务每个epoch 迭代的step次数 """
645 | # emotion cnt
646 | emotion_cnt = train_cnt[0] // FLAGS.train_batch_size
647 | news_cnt = train_cnt[1] // FLAGS.train_batch_size
648 | nli_cnt = train_cnt[2] // FLAGS.train_batch_size
649 |
650 | emotion_list = [1] * emotion_cnt
651 | news_list = [2] * news_cnt
652 | nli_list = [3] * nli_cnt
653 |
654 | task_list = emotion_list + news_list + nli_list
655 |
656 | random.shuffle(task_list)
657 |
658 | return task_list
659 |
660 |
661 | def do_train(sess, train_cnt, train_step, epoch):
662 | """ 模型训练 """
663 | emotion_train_file = os.path.join(FLAGS.data_dir, 'emotion_train.record')
664 | news_train_file = os.path.join(FLAGS.data_dir, 'news_train.record')
665 | nli_train_file = os.path.join(FLAGS.data_dir, 'nli_train.record')
666 | ids1, mask1, seg1, labels1 = get_input_data(
667 | emotion_train_file, FLAGS.max_seq_length,
668 | FLAGS.train_batch_size, True)
669 | ids2, mask2, seg2, labels2 = get_input_data(
670 | news_train_file, FLAGS.max_seq_length,
671 | FLAGS.train_batch_size, True)
672 | ids3, mask3, seg3, labels3 = get_input_data(
673 | nli_train_file, FLAGS.max_seq_length,
674 | FLAGS.train_batch_size, True)
675 |
676 | # 设置任务list
677 | tasks = set_random_task(train_cnt)
678 |
679 | total_loss = 0
680 | for step, task_id in enumerate(tasks):
681 | if task_id == 1:
682 | ids_train, mask_train, seg_train, y_train = sess.run(
683 | [ids1, mask1, seg1, labels1])
684 | if task_id == 2:
685 | ids_train, mask_train, seg_train, y_train = sess.run(
686 | [ids2, mask2, seg2, labels2])
687 | if task_id == 3:
688 | ids_train, mask_train, seg_train, y_train = sess.run(
689 | [ids3, mask3, seg3, labels3])
690 |
691 | _, step_loss = train_step(ids_train, mask_train, seg_train, y_train, task_id)
692 |
693 | tf.logging.info(f'epoch {epoch} the step loss: {step_loss}')
694 |
695 | total_loss += step_loss
696 |
697 | return total_loss / len(tasks)
698 |
699 |
700 | def do_eval(sess, dev_cnt, dev_step):
701 | """ 模型验证 """
702 | tf.logging.info(f'start to do eval')
703 | emotion_dev_file = os.path.join(FLAGS.data_dir, 'emotion_dev.record')
704 | news_dev_file = os.path.join(FLAGS.data_dir, 'news_dev.record')
705 | nli_dev_file = os.path.join(FLAGS.data_dir, 'nli_dev.record')
706 |
707 | ids1, mask1, seg1, labels1 = get_input_data(
708 | emotion_dev_file, FLAGS.max_seq_length,
709 | FLAGS.eval_batch_size, False)
710 | ids2, mask2, seg2, labels2 = get_input_data(
711 | news_dev_file, FLAGS.max_seq_length,
712 | FLAGS.eval_batch_size, False)
713 | ids3, mask3, seg3, labels3 = get_input_data(
714 | nli_dev_file, FLAGS.max_seq_length,
715 | FLAGS.eval_batch_size, False)
716 |
717 | # 验证emotion的
718 | total_dev_acc = 0
719 | step_cnt = dev_cnt[0] // FLAGS.eval_batch_size
720 | for step in range(step_cnt):
721 | ids_dev, mask_dev, seg_dev, y_dev = sess.run(
722 | [ids1, mask1, seg1, labels1])
723 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 1)
724 | total_dev_acc += dev_acc
725 | tf.logging.info(f'===the emotion acc is {total_dev_acc / step_cnt}===')
726 |
727 | total_dev_acc = 0
728 | step_cnt = dev_cnt[1] // FLAGS.eval_batch_size
729 | for step in range(step_cnt):
730 | ids_dev, mask_dev, seg_dev, y_dev = sess.run(
731 | [ids2, mask2, seg2, labels2])
732 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 2)
733 | total_dev_acc += dev_acc
734 | tf.logging.info(f'===the news acc is {total_dev_acc / step_cnt}===')
735 |
736 | total_dev_acc = 0
737 | step_cnt = dev_cnt[3] // FLAGS.eval_batch_size
738 | for step in range(step_cnt):
739 | ids_dev, mask_dev, seg_dev, y_dev = sess.run(
740 | [ids3, mask3, seg3, labels3])
741 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 3)
742 | total_dev_acc += dev_acc
743 | tf.logging.info(f'===the nli acc is {total_dev_acc / step_cnt}===')
744 |
745 |
746 | def do_predict(label_list, process, tokenizer, dev_step):
747 | """ 预测 """
748 | tf.logging.info('start to do predict')
749 | # 设置标签到索引的对应
750 | emotion_map = {}
751 | news_map = {}
752 | nli_map = {}
753 | for (i, label) in enumerate(label_list[0]):
754 | emotion_map[i] = label
755 | for (i, label) in enumerate(label_list[1]):
756 | news_map[i] = label
757 | for (i, label) in enumerate(label_list[2]):
758 | nli_map[i] = label
759 |
760 | test_examples = process.get_test_examples(FLAGS.data_dir)
761 | emotion_res = []
762 | news_res = []
763 | nli_res = []
764 | batch_size = 1
765 | index = 0
766 | for example in tqdm.tqdm(test_examples):
767 | index += 1
768 | feature = convert_single_example(index, example, label_list,
769 | FLAGS.max_seq_length, tokenizer)
770 | ids = np.reshape([feature.input_ids], (batch_size, FLAGS.max_seq_length))
771 | mask = np.reshape([feature.input_mask], (batch_size, FLAGS.max_seq_length))
772 | seg = np.reshape([feature.segment_ids], (batch_size, FLAGS.max_seq_length))
773 | true_y = np.reshape([0], batch_size)
774 |
775 | task_id = example.task
776 | pred_res, _ = dev_step(ids, mask, seg, true_y, int(task_id))
777 |
778 | guid = str(example.guid).strip()
779 | if task_id == '1':
780 | label_res = emotion_map.get(pred_res[0])
781 | emotion_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
782 | if task_id == '2':
783 | label_res = news_map.get(pred_res[0])
784 | news_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
785 | if task_id == '3':
786 | label_res = nli_map.get(pred_res[0])
787 | nli_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
788 |
789 | # 写入预测文件
790 | with open('./data_path/ocemotion_predict.json', 'w', encoding='utf-8') as fr:
791 | for res in emotion_res:
792 | fr.write(res)
793 | fr.write('\n')
794 |
795 | with open('./data_path/tnews_predict.json', 'w', encoding='utf-8') as fr:
796 | for res in news_res:
797 | fr.write(res)
798 | fr.write('\n')
799 |
800 | with open('./data_path/ocnli_predict.json', 'w', encoding='utf-8') as fr:
801 | for res in nli_res:
802 | fr.write(res)
803 | fr.write('\n')
804 | tf.logging.info('predict and write file over!')
805 |
806 |
807 | if __name__ == "__main__":
808 | tf.logging.set_verbosity(tf.logging.INFO)
809 | main()
810 |
--------------------------------------------------------------------------------
/bert_downstream/train_ner.py:
--------------------------------------------------------------------------------
1 | """BERT finetuning runner for ner (sequence label classification)."""
2 |
3 | import collections
4 | import os
5 | import json
6 | import tensorflow as tf
7 |
8 | import bert_master.modeling as modeling
9 | import bert_master.optimization as optimization
10 | import bert_master.tokenization as tokenization
11 |
12 | flags = tf.flags
13 | FLAGS = flags.FLAGS
14 |
15 | # Required parameters
16 | flags.DEFINE_string(
17 | "data_dir", './data_path/clue_ner/',
18 | "The input data dir. Should contain the .tsv files (or other data files) "
19 | "for the task.")
20 |
21 | flags.DEFINE_string(
22 | "bert_config_file", './pre_trained/bert_config.json',
23 | "The config json file corresponding to the pre-trained BERT model. "
24 | "This specifies the model architecture.")
25 |
26 | flags.DEFINE_string("task_name", 'cluener',
27 | "The name of the task to train.")
28 |
29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
30 | "The vocabulary file that the BERT model was trained on.")
31 |
32 | flags.DEFINE_string(
33 | "output_dir", './model_ckpt/clue_ner/',
34 | "The output directory where the model checkpoints will be written.")
35 |
36 | flags.DEFINE_string(
37 | "init_checkpoint", './pre_trained/bert_model.ckpt',
38 | "Initial checkpoint (usually from a pre-trained BERT model).")
39 |
40 | flags.DEFINE_bool(
41 | "do_lower_case", True,
42 | "Whether to lower case the input text. Should be True for uncased "
43 | "models and False for cased models.")
44 |
45 | flags.DEFINE_integer(
46 | "max_seq_length", 128,
47 | "The maximum total input sequence length after WordPiece tokenization. "
48 | "Sequences longer than this will be truncated, and sequences shorter "
49 | "than this will be padded.")
50 |
51 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
52 |
53 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
54 |
55 | flags.DEFINE_bool(
56 | "do_predict", True,
57 | "Whether to run the model in inference mode on the test set.")
58 |
59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
60 |
61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
62 |
63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
64 |
65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
66 |
67 | flags.DEFINE_float("num_train_epochs", 1.0,
68 | "Total number of training epochs to perform.")
69 |
70 | flags.DEFINE_float(
71 | "warmup_proportion", 0.1,
72 | "Proportion of training to perform linear learning rate warmup for. "
73 | "E.g., 0.1 = 10% of training.")
74 |
75 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
76 | "How often to save the model checkpoint.")
77 |
78 | flags.DEFINE_integer("iterations_per_loop", 1000,
79 | "How many steps to make in each estimator call.")
80 |
81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
82 |
83 | tf.flags.DEFINE_string(
84 | "tpu_name", None,
85 | "The Cloud TPU to use for training. This should be either the name "
86 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
87 | "url.")
88 |
89 | tf.flags.DEFINE_string(
90 | "tpu_zone", None,
91 | "[Optional] GCE zone where the Cloud TPU is located in. If not "
92 | "specified, we will attempt to automatically detect the GCE project from "
93 | "metadata.")
94 |
95 | tf.flags.DEFINE_string(
96 | "gcp_project", None,
97 | "[Optional] Project name for the Cloud TPU-enabled project. If not "
98 | "specified, we will attempt to automatically detect the GCE project from "
99 | "metadata.")
100 |
101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
102 |
103 | flags.DEFINE_integer(
104 | "num_tpu_cores", 8,
105 | "Only used if `use_tpu` is True. Total number of TPU cores to use.")
106 |
107 |
108 | class InputExample(object):
109 | """A single training/test example for simple sequence classification."""
110 |
111 | def __init__(self, guid, text_a, text_b=None, tag=None):
112 | """Constructs a InputExample.
113 |
114 | Args:
115 | guid: Unique id for the example.
116 | text_a: string. The untokenized text of the first sequence. For single
117 | sequence tasks, only this sequence must be specified.
118 | text_b: (Optional) string. The untokenized text of the second sequence.
119 | Only must be specified for sequence pair tasks.
120 | label: (Optional) string. The label of the example. This should be
121 | specified for train and dev examples, but not for test examples.
122 | """
123 | self.guid = guid
124 | self.text_a = text_a
125 | self.text_b = text_b
126 | self.tag = tag
127 |
128 |
129 | class PaddingInputExample(object):
130 | """Fake example so the num input examples is a multiple of the batch size.
131 |
132 | When running eval/predict on the TPU, we need to pad the number of examples
133 | to be a multiple of the batch size, because the TPU requires a fixed batch
134 | size. The alternative is to drop the last batch, which is bad because it means
135 | the entire output data won't be generated.
136 |
137 | We use this class instead of `None` because treating `None` as padding
138 | battches could cause silent errors.
139 | """
140 |
141 |
142 | class InputFeatures(object):
143 | """A single set of features of data."""
144 |
145 | def __init__(self,
146 | input_ids,
147 | input_mask,
148 | segment_ids,
149 | tag_ids,
150 | is_real_example=True):
151 | self.input_ids = input_ids
152 | self.input_mask = input_mask
153 | self.segment_ids = segment_ids
154 | self.tag_ids = tag_ids
155 | self.is_real_example = is_real_example
156 |
157 |
158 | class NerProcessor:
159 | def get_train_examples(self, data_dir):
160 | """获取训练集."""
161 | return self._create_examples(
162 | self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
163 |
164 | def get_dev_examples(self, data_dir):
165 | """获取验证集."""
166 | return self._create_examples(
167 | self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev")
168 |
169 | def get_test_examples(self, data_dir):
170 | """获取测试集."""
171 | return self._create_examples(
172 | self._read_tsv(os.path.join(data_dir, "test.json")), "test")
173 |
174 | def get_tags(self):
175 | """填写tag的标签,采用BIO形式标注"""
176 | # 会在convert_single_example方法中添加头,成为BIO形式标签
177 | return ['address', 'book', 'company', 'game', 'government',
178 | 'movie', 'name', 'organization', 'position', 'scene']
179 |
180 | def _read_tsv(self, input_file):
181 | """读取数据集"""
182 | with open(input_file, encoding='utf-8') as fr:
183 | lines = fr.readlines()
184 | return lines
185 |
186 | def _create_examples(self, lines, set_type):
187 | """Creates examples for the training and dev sets."""
188 | examples = []
189 | for (i, line) in enumerate(lines):
190 | if set_type == 'test':
191 | json_str = json.loads(line)
192 | text_a = tokenization.convert_to_unicode(json_str['text'])
193 | tag = None
194 | guid = json_str['id']
195 | else:
196 | text_tag = line.split('\t')
197 | guid = "%s-%s" % (set_type, i)
198 | text_a = tokenization.convert_to_unicode(text_tag[0])
199 | tag = tokenization.convert_to_unicode(text_tag[1])
200 | examples.append(
201 | InputExample(guid=guid, text_a=text_a, text_b=None, tag=tag))
202 | return examples
203 |
204 |
205 | def convert_single_example(ex_index, example, tag_list, max_seq_length,
206 | tokenizer):
207 | """Converts a single `InputExample` into a single `InputFeatures`."""
208 |
209 | if isinstance(example, PaddingInputExample):
210 | return InputFeatures(
211 | input_ids=[0] * max_seq_length,
212 | input_mask=[0] * max_seq_length,
213 | segment_ids=[0] * max_seq_length,
214 | tag_ids=[0] * max_seq_length,
215 | is_real_example=False)
216 |
217 | tag_map = {'O': 0}
218 | for tag in tag_list:
219 | tag_b = 'B-' + tag
220 | tag_i = 'I-' + tag
221 | tag_map[tag_b] = len(tag_map)
222 | tag_map[tag_i] = len(tag_map)
223 |
224 | # 因为CLUE要求提交文件中包含索引,所以不能直接使用tokenizer去分割text
225 | tokens_a = []
226 | text_list = list(example.text_a)
227 | for word in text_list:
228 | token = tokenizer.tokenize(word)
229 | tokens_a.extend(token)
230 |
231 | if len(tokens_a) > max_seq_length - 2:
232 | tokens_a = tokens_a[0:(max_seq_length - 2)]
233 |
234 | tokens = []
235 | segment_ids = []
236 | tokens.append("[CLS]")
237 | segment_ids.append(0)
238 | for token in tokens_a:
239 | tokens.append(token)
240 | segment_ids.append(0)
241 | tokens.append("[SEP]")
242 | segment_ids.append(0)
243 |
244 | input_ids = tokenizer.convert_tokens_to_ids(tokens)
245 | input_mask = [1] * len(input_ids)
246 |
247 | if example.tag:
248 | tag_ids = [0] # input第一位是[CLS]
249 | tags = example.tag.strip().split(' ')
250 | for tag in tags:
251 | tag_ids.append(tag_map.get(tag))
252 | tag_ids.append(0) # input最后一位是[SEP]
253 | else:
254 | tag_ids = [0] * max_seq_length
255 | # Zero-pad up to the sequence length.
256 | while len(input_ids) < max_seq_length:
257 | input_ids.append(0)
258 | input_mask.append(0)
259 | segment_ids.append(0)
260 | # test的时候已经*max_len所以不需要再继续padding
261 | if example.tag:
262 | tag_ids.append(0)
263 |
264 | assert len(input_ids) == max_seq_length
265 | assert len(input_mask) == max_seq_length
266 | assert len(segment_ids) == max_seq_length
267 | assert len(tag_ids) == max_seq_length
268 |
269 | if ex_index < 5:
270 | tf.logging.info("*** Example ***")
271 | tf.logging.info("guid: %s" % (example.guid))
272 | tf.logging.info("tokens: %s" % " ".join(
273 | [tokenization.printable_text(x) for x in tokens]))
274 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
275 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
276 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
277 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in tag_ids]))
278 |
279 | feature = InputFeatures(
280 | input_ids=input_ids,
281 | input_mask=input_mask,
282 | segment_ids=segment_ids,
283 | tag_ids=tag_ids,
284 | is_real_example=True)
285 | return feature
286 |
287 |
288 | def file_based_convert_examples_to_features(
289 | examples, tag_list, max_seq_length, tokenizer, output_file):
290 | """Convert a set of `InputExample`s to a TFRecord file."""
291 |
292 | writer = tf.python_io.TFRecordWriter(output_file)
293 |
294 | for (ex_index, example) in enumerate(examples):
295 | if ex_index % 10000 == 0:
296 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
297 |
298 | feature = convert_single_example(ex_index, example, tag_list,
299 | max_seq_length, tokenizer)
300 |
301 | def create_int_feature(values):
302 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
303 | return f
304 |
305 | features = collections.OrderedDict()
306 | features["input_ids"] = create_int_feature(feature.input_ids)
307 | features["input_mask"] = create_int_feature(feature.input_mask)
308 | features["segment_ids"] = create_int_feature(feature.segment_ids)
309 | features["tag_ids"] = create_int_feature(feature.tag_ids)
310 | features["is_real_example"] = create_int_feature(
311 | [int(feature.is_real_example)])
312 |
313 | tf_example = tf.train.Example(features=tf.train.Features(feature=features))
314 | writer.write(tf_example.SerializeToString())
315 | writer.close()
316 |
317 |
318 | def file_based_input_fn_builder(input_file, seq_length, is_training,
319 | drop_remainder):
320 | """Creates an `input_fn` closure to be passed to TPUEstimator."""
321 |
322 | name_to_features = {
323 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
324 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
325 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
326 | "tag_ids": tf.FixedLenFeature([seq_length], tf.int64),
327 | "is_real_example": tf.FixedLenFeature([], tf.int64),
328 | }
329 |
330 | def _decode_record(record, name_to_features):
331 | """Decodes a record to a TensorFlow example."""
332 | example = tf.parse_single_example(record, name_to_features)
333 |
334 | for name in list(example.keys()):
335 | t = example[name]
336 | if t.dtype == tf.int64:
337 | t = tf.to_int32(t)
338 | example[name] = t
339 |
340 | return example
341 |
342 | def input_fn(params):
343 | """The actual input function."""
344 | batch_size = params["batch_size"]
345 |
346 | d = tf.data.TFRecordDataset(input_file)
347 | if is_training:
348 | d = d.repeat()
349 | d = d.shuffle(buffer_size=100)
350 |
351 | d = d.apply(
352 | tf.contrib.data.map_and_batch(
353 | lambda record: _decode_record(record, name_to_features),
354 | batch_size=batch_size,
355 | drop_remainder=drop_remainder))
356 |
357 | return d
358 |
359 | return input_fn
360 |
361 |
362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
363 | """Truncates a sequence pair in place to the maximum length."""
364 |
365 | while True:
366 | total_length = len(tokens_a) + len(tokens_b)
367 | if total_length <= max_length:
368 | break
369 | if len(tokens_a) > len(tokens_b):
370 | tokens_a.pop()
371 | else:
372 | tokens_b.pop()
373 |
374 |
375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
376 | tags, num_tags, use_one_hot_embeddings):
377 | """Creates a classification model."""
378 | model = modeling.BertModel(
379 | config=bert_config,
380 | is_training=is_training,
381 | input_ids=input_ids,
382 | input_mask=input_mask,
383 | token_type_ids=segment_ids,
384 | use_one_hot_embeddings=use_one_hot_embeddings)
385 |
386 | # 用bert的sequence输出层
387 | output_layer = model.get_sequence_output()
388 |
389 | hidden_size = output_layer.shape[-1].value
390 | seq_len = output_layer.shape[1].value
391 | # [batch, seq_len, emb_size] 16 128 768
392 |
393 | output_weights = tf.get_variable(
394 | "output_weights", [num_tags, hidden_size],
395 | initializer=tf.truncated_normal_initializer(stddev=0.02))
396 |
397 | output_bias = tf.get_variable(
398 | "output_bias", [num_tags], initializer=tf.zeros_initializer())
399 |
400 | with tf.variable_scope("loss"):
401 | if is_training:
402 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
403 |
404 | # 进行matmul需要reshape
405 | output_layer = tf.reshape(output_layer, [-1, hidden_size])
406 | # [batch*seq_len, num_tags]
407 | logits = tf.matmul(output_layer, output_weights, transpose_b=True)
408 | logits = tf.nn.bias_add(logits, output_bias)
409 |
410 | logits = tf.reshape(logits, [-1, seq_len, num_tags])
411 |
412 | # 真实的长度
413 | input_m = tf.count_nonzero(input_mask, -1)
414 | log_likelihood, transition_matrix = tf.contrib.crf.crf_log_likelihood(
415 | logits, tags, input_m)
416 | loss = tf.reduce_mean(-log_likelihood)
417 |
418 | # 使用crf_decode输出
419 | viterbi_sequence, _ = tf.contrib.crf.crf_decode(
420 | logits, transition_matrix, input_m)
421 |
422 | return loss, logits, viterbi_sequence
423 |
424 |
425 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
426 | num_train_steps, num_warmup_steps, use_tpu,
427 | use_one_hot_embeddings):
428 | """Returns `model_fn` closure for TPUEstimator."""
429 |
430 | def model_fn(features, labels, mode, params):
431 | """The `model_fn` for TPUEstimator."""
432 |
433 | tf.logging.info("*** Features ***")
434 | for name in sorted(features.keys()):
435 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape))
436 |
437 | input_ids = features["input_ids"]
438 | input_mask = features["input_mask"]
439 | segment_ids = features["segment_ids"]
440 | tag_ids = features["tag_ids"]
441 | is_real_example = None
442 | if "is_real_example" in features:
443 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
444 | else:
445 | is_real_example = tf.ones(tf.shape(tag_ids), dtype=tf.float32)
446 |
447 | is_training = (mode == tf.estimator.ModeKeys.TRAIN)
448 |
449 | total_loss, logits, predictions = create_model(
450 | bert_config, is_training, input_ids, input_mask, segment_ids, tag_ids,
451 | num_labels, use_one_hot_embeddings)
452 |
453 | tvars = tf.trainable_variables()
454 | initialized_variable_names = {}
455 | scaffold_fn = None
456 | if init_checkpoint:
457 | (assignment_map, initialized_variable_names
458 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
459 | if use_tpu:
460 | def tpu_scaffold():
461 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
462 | return tf.train.Scaffold()
463 |
464 | scaffold_fn = tpu_scaffold
465 | else:
466 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
467 |
468 | tf.logging.info("**** Trainable Variables ****")
469 | for var in tvars:
470 | init_string = ""
471 | if var.name in initialized_variable_names:
472 | init_string = ", *INIT_FROM_CKPT*"
473 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape,
474 | init_string)
475 |
476 | if mode == tf.estimator.ModeKeys.TRAIN:
477 | # 添加loss的hook,不然在GPU/CPU上不打印loss
478 | logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
479 | train_op = optimization.create_optimizer(
480 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
481 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
482 | mode=mode,
483 | loss=total_loss,
484 | train_op=train_op,
485 | training_hooks=[logging_hook],
486 | scaffold_fn=scaffold_fn)
487 | elif mode == tf.estimator.ModeKeys.EVAL:
488 | def metric_fn(per_example_loss, tag_ids, is_real_example):
489 | # 这里使用的accuracy来计算,宽松匹配方法
490 | accuracy = tf.metrics.accuracy(
491 | labels=tag_ids, predictions=predictions, weights=is_real_example)
492 | return {
493 | "eval_accuracy": accuracy,
494 | }
495 |
496 | eval_metrics = (metric_fn,
497 | [total_loss, tag_ids, is_real_example])
498 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
499 | mode=mode,
500 | loss=total_loss,
501 | eval_metrics=eval_metrics,
502 | scaffold_fn=scaffold_fn)
503 | else:
504 | output_spec = tf.contrib.tpu.TPUEstimatorSpec(
505 | mode=mode,
506 | predictions={"predictions": predictions},
507 | scaffold_fn=scaffold_fn)
508 | return output_spec
509 |
510 | return model_fn
511 |
512 |
513 | def main():
514 | tf.logging.set_verbosity(tf.logging.INFO)
515 |
516 | processors = {
517 | "cluener": NerProcessor,
518 | }
519 |
520 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
521 | FLAGS.init_checkpoint)
522 |
523 | # if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
524 | # raise ValueError(
525 | # "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
526 |
527 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
528 |
529 | if FLAGS.max_seq_length > bert_config.max_position_embeddings:
530 | raise ValueError(
531 | "Cannot use sequence length %d because the BERT model "
532 | "was only trained up to sequence length %d" %
533 | (FLAGS.max_seq_length, bert_config.max_position_embeddings))
534 |
535 | tf.gfile.MakeDirs(FLAGS.output_dir)
536 |
537 | task_name = FLAGS.task_name.lower()
538 |
539 | if task_name not in processors:
540 | raise ValueError("Task not found: %s" % (task_name))
541 |
542 | processor = processors[task_name]()
543 |
544 | tag_list = processor.get_tags()
545 |
546 | tokenizer = tokenization.FullTokenizer(
547 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
548 |
549 | tpu_cluster_resolver = None
550 | if FLAGS.use_tpu and FLAGS.tpu_name:
551 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
552 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
553 |
554 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
555 | run_config = tf.contrib.tpu.RunConfig(
556 | cluster=tpu_cluster_resolver,
557 | master=FLAGS.master,
558 | model_dir=FLAGS.output_dir,
559 | save_checkpoints_steps=FLAGS.save_checkpoints_steps,
560 | tpu_config=tf.contrib.tpu.TPUConfig(
561 | iterations_per_loop=FLAGS.iterations_per_loop,
562 | num_shards=FLAGS.num_tpu_cores,
563 | per_host_input_for_training=is_per_host))
564 |
565 | train_examples = None
566 | num_train_steps = None
567 | num_warmup_steps = None
568 | if FLAGS.do_train:
569 | train_examples = processor.get_train_examples(FLAGS.data_dir)
570 | num_train_steps = int(
571 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
572 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
573 | # num_labels=2 * len(tag_list) + 1 BI两种外加一个O
574 | model_fn = model_fn_builder(
575 | bert_config=bert_config,
576 | num_labels=2*len(tag_list) + 1,
577 | init_checkpoint=FLAGS.init_checkpoint,
578 | learning_rate=FLAGS.learning_rate,
579 | num_train_steps=num_train_steps,
580 | num_warmup_steps=num_warmup_steps,
581 | use_tpu=FLAGS.use_tpu,
582 | use_one_hot_embeddings=FLAGS.use_tpu)
583 |
584 | estimator = tf.contrib.tpu.TPUEstimator(
585 | use_tpu=FLAGS.use_tpu,
586 | model_fn=model_fn,
587 | config=run_config,
588 | train_batch_size=FLAGS.train_batch_size,
589 | eval_batch_size=FLAGS.eval_batch_size,
590 | predict_batch_size=FLAGS.predict_batch_size)
591 |
592 | if FLAGS.do_train:
593 | train_file = os.path.join(FLAGS.data_dir, "train.tf_record")
594 | file_based_convert_examples_to_features(
595 | train_examples, tag_list, FLAGS.max_seq_length, tokenizer, train_file)
596 | tf.logging.info("***** Running training *****")
597 | tf.logging.info(" Num examples = %d", len(train_examples))
598 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size)
599 | tf.logging.info(" Num steps = %d", num_train_steps)
600 | train_input_fn = file_based_input_fn_builder(
601 | input_file=train_file,
602 | seq_length=FLAGS.max_seq_length,
603 | is_training=True,
604 | drop_remainder=True)
605 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
606 |
607 | if FLAGS.do_eval:
608 | eval_examples = processor.get_dev_examples(FLAGS.data_dir)
609 | num_actual_eval_examples = len(eval_examples)
610 | if FLAGS.use_tpu:
611 | while len(eval_examples) % FLAGS.eval_batch_size != 0:
612 | eval_examples.append(PaddingInputExample())
613 |
614 | eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record")
615 | file_based_convert_examples_to_features(
616 | eval_examples, tag_list, FLAGS.max_seq_length, tokenizer, eval_file)
617 |
618 | tf.logging.info("***** Running evaluation *****")
619 | tf.logging.info(" Num examples = %d (%d actual, %d padding)",
620 | len(eval_examples), num_actual_eval_examples,
621 | len(eval_examples) - num_actual_eval_examples)
622 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size)
623 |
624 | # This tells the estimator to run through the entire set.
625 | eval_steps = None
626 | if FLAGS.use_tpu:
627 | assert len(eval_examples) % FLAGS.eval_batch_size == 0
628 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
629 |
630 | eval_drop_remainder = True if FLAGS.use_tpu else False
631 | eval_input_fn = file_based_input_fn_builder(
632 | input_file=eval_file,
633 | seq_length=FLAGS.max_seq_length,
634 | is_training=False,
635 | drop_remainder=eval_drop_remainder)
636 |
637 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
638 |
639 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
640 | with tf.gfile.GFile(output_eval_file, "w") as writer:
641 | tf.logging.info("***** Eval results *****")
642 | for key in sorted(result.keys()):
643 | tf.logging.info(" %s = %s", key, str(result[key]))
644 | writer.write("%s = %s\n" % (key, str(result[key])))
645 |
646 | if FLAGS.do_predict:
647 | # label dict的设置
648 | tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book',
649 | 5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game',
650 | 9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie',
651 | 13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization',
652 | 17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'}
653 |
654 | predict_examples = processor.get_test_examples(FLAGS.data_dir)
655 | num_actual_predict_examples = len(predict_examples)
656 | test_file = os.path.join(FLAGS.data_dir, "test.tf_record")
657 | file_based_convert_examples_to_features(predict_examples, tag_list,
658 | FLAGS.max_seq_length, tokenizer,
659 | test_file)
660 |
661 | tf.logging.info("***** Running prediction*****")
662 | tf.logging.info(" Num examples = %d (%d actual, %d padding)",
663 | len(predict_examples), num_actual_predict_examples,
664 | len(predict_examples) - num_actual_predict_examples)
665 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size)
666 |
667 | predict_drop_remainder = True if FLAGS.use_tpu else False
668 | predict_input_fn = file_based_input_fn_builder(
669 | input_file=test_file,
670 | seq_length=FLAGS.max_seq_length,
671 | is_training=False,
672 | drop_remainder=predict_drop_remainder)
673 |
674 | results = estimator.predict(input_fn=predict_input_fn)
675 |
676 | output_file = os.path.join(FLAGS.data_dir, 'clue_predict.json')
677 | with open(output_file, 'w', encoding='utf-8') as fr:
678 | for example, result in zip(predict_examples, results):
679 | pre_id = result['predictions']
680 | # print(f'text is {example.text_a}')
681 | # print(f'preid is {pre_id}')
682 | text = example.text_a
683 | # 只获取text中的长度的tag输出
684 | tags = [tag_ids[tag] for tag in pre_id][1:len(text) + 1]
685 | res_words, res_pos = get_result(text, tags)
686 | rs = {}
687 | for w, t in zip(res_words, res_pos):
688 | rs[t] = rs.get(t, []) + [w]
689 | pres = {}
690 | for t, ws in rs.items():
691 | temp = {}
692 | for w in ws:
693 | word = text[w[0]: w[1] + 1]
694 | temp[word] = temp.get(word, []) + [w]
695 | pres[t] = temp
696 | output_line = json.dumps({'id': example.guid, 'label': pres}, ensure_ascii=False) + '\n'
697 | fr.write(output_line)
698 |
699 |
700 | def get_result(text, tags):
701 | """ 改写成clue要提交的格式 """
702 | result_words = []
703 | result_pos = []
704 | temp_word = []
705 | temp_pos = ''
706 | for i in range(min(len(text), len(tags))):
707 | if tags[i].startswith('O'):
708 | if len(temp_word) > 0:
709 | result_words.append([min(temp_word), max(temp_word)])
710 | result_pos.append(temp_pos)
711 | temp_word = []
712 | temp_pos = ''
713 | elif tags[i].startswith('B-'):
714 | if len(temp_word) > 0:
715 | result_words.append([min(temp_word), max(temp_word)])
716 | result_pos.append(temp_pos)
717 | temp_word = [i]
718 | temp_pos = tags[i].split('-')[1]
719 | elif tags[i].startswith('I-'):
720 | if len(temp_word) > 0:
721 | temp_word.append(i)
722 | if temp_pos == '':
723 | temp_pos = tags[i].split('-')[1]
724 | else:
725 | if len(temp_word) > 0:
726 | temp_word.append(i)
727 | if temp_pos == '':
728 | temp_pos = tags[i].split('-')[1]
729 | result_words.append([min(temp_word), max(temp_word)])
730 | result_pos.append(temp_pos)
731 | temp_word = []
732 | temp_pos = ''
733 | return result_words, result_pos
734 |
735 |
736 | if __name__ == "__main__":
737 | main()
738 |
--------------------------------------------------------------------------------
/ckbqa/DUTIR中文开放域知识问答评测报告.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/DUTIR中文开放域知识问答评测报告.pdf
--------------------------------------------------------------------------------
/ckbqa/README.md:
--------------------------------------------------------------------------------
1 | ## KBQA简介
2 |
3 | 基于知识库的问答(Knowledge Based Question Answering,KBQA)是自然语言处理(NLP)领域的热门研究方向。知识库(知识图谱, Knowledge Based/Knowledge Graph)是知识的结构化表示,一般是由一组SPO三元组(主语Subject,谓语Predicate,宾语Object)形式构成(也称实体,关系,属性三元组),表示实体和实体间存在的语义关系。例如,中国的首都是北京,可以表示为:[中国,首都,北京]。
4 |
5 | 基于知识库的问答主要步骤是接收一个自然语言问句,识别出句子中的实体,理解问句的语义关系,构建有关实体和关系的查询语句,进而从知识库中检索出答案。
6 |
7 | 目前基于知识库的问答主要方法有:
8 |
9 | - 基于语义解析/规则的方法
10 | - 基于信息检索/信息抽取的方法
11 |
12 | 这里有一篇2019年KGQA的综述:Introduction to Neural Network Based Approaches for Question Answering over Knowledge Graphs。这篇文章将KGQA/KBQA当作语义解析的任务来对待,然后介绍了几种语义解析方法,如Classification、Ranking、Translation等。这里不做介绍,感兴趣的可以去翻原文。
13 |
14 | 基于中文知识库问答(**Chinese Knowledge Based Question Answering,CKBQA**)相比英文的KBQA,中文知识库包含关系多,数据集难以覆盖所有关系,另外中文语言的特点,有居多的挑战。
15 |
16 | **基于语义解析/规则的方法:**
17 |
18 | 该类方法使用字典、规则和机器学习,直接从问题中解析出实体、关系和逻辑组合。这里介绍两篇论文,一篇是 The APVA-TURBO Approach to Question Answering in Knowledge Base,文章使用序列标注模型解析问题中的实体,利用端到端模型解析问题中的关系序列。另一篇 A State-transition Framework to Answer Complex
19 | Questions over Knowledge Base,文章中提出了一种状态转移框架并结合卷积神经网络等方法。(上述方法均基于英文数据集)
20 |
21 | 基于语义解析/规则的方法一般步骤:
22 |
23 | - 实体识别:使用领域词表,相似度等(也可以使用深度学习模型,如BiLstm+CRF,BERT等)
24 | - 属性关系识别:词表规则,或使用分类模型
25 | - 答案查询:基于前两个步骤,更加规则模板转换SPARQL等查询语言进行查询
26 |
27 | 基于语义解析/规则的方法比较简单,当前Github上很多KBQA的项目都是基于这种模式。
28 |
29 | 这里推荐几个基于语义解析/规则的 KBQA项目:
30 |
31 | - 豆瓣的电影知识图谱问答:https://github.com/weizhixiaoyi/DouBan-KGQA
32 | - 基于NLPCC数据的KBQA:https://zhuanlan.zhihu.com/p/62946533
33 |
34 | **基于信息检索/信息抽取的方法:**
35 |
36 | 该类方法首先根据问题得到若干个候选实体,根据预定义的逻辑形式,从知识库中抽取与候选实体相连的关系作为候选查询路径,再使用文本匹配模型,选择出与问题相似度最高的候选查询路径,到知识库中检索答案。这里介绍一种增强路径匹配的方法: Improved neural relation detection for knowledge base question answering。
37 |
38 | 当前CKBQA任务上,大多采用的是基于信息检索/信息抽取的方法,一般的步骤:
39 |
40 | - 实体与关系识别
41 | - 路径匹配
42 | - 答案检索
43 |
44 | 在CCKS的KBQA比赛中这种方法非常常见,CCKS官网网站上有每一年的评测论文,下面推荐几个最新的:
45 |
46 | - 2019年CCKS的KBQA任务第四名方案:DUTIR中文开放域知识问答评测报告
47 | - 2020年CCKS的KBQA任务第一名方案:基于特征融合的中文知识库问答方法
48 |
49 | 具体内容可见官网的评测论文,这里附件上传,见ckbqa目录下两个pdf文件。
50 |
51 | ## 中英文数据集
52 |
53 | 英文数据集:
54 |
55 | - FREE917:第一个大规模的KBQA数据集,于2013年提出,包含917 个问题,同时提供相应逻辑查询,覆盖600多种freebase上的关系。
56 | - Webquestions:数据集中有6642个问题答案对,数据集规模虽然较FREE917提高了不少,但有两个突出的缺陷:没有提供对应的查询,不利于基于逻辑表达式模型的训练;另外webquestions中简单问句多而复杂问句少。
57 | - WebQSP:是WEBQUESTIONS的子集,问题都是需要多跳才能回答,属于multi-relation KBQA dataset,另外补全了对应的查询句。
58 | - Complexquestion、GRAPHQUESTIONS:在问句的结构和表达多样性等方面进一步增强了WEBQUESTIONSP,,包括类型约束,显\隐式的时间约束,聚合操作。
59 | - SimpleQuestions:数据规模较大,共100K,数据形式为(quesition,knowledge base fact),均为简单问题,只需KB中的一个三元组即可回答,即single-relation dataset。
60 |
61 | 英文数据集较多,这里只列举几个常见的。详细的数据集可见北航的[KBQA调研](https://github.com/BDBC-KG-NLP/QA-Survey/blob/master/KBQA%E8%B0%83%E7%A0%94-%E5%AD%A6%E6%9C%AF%E7%95%8C.md#13-%E6%95%B0%E6%8D%AE%E9%9B%86)
62 |
63 | 中文数据集:
64 |
65 | - NLPCC开放领域知识图谱问答的数据集:简单问题(单跳问题),14609条训练数据,9870条验证和测试数据,数据集下载。
66 | - CCKS开放领域知识图谱问答的数据集:包含简单问题和复杂问答,2298条训练数据,766的验证和测试数据,数据集下载。
67 |
68 | 除了上述两个中文数据集(提取码均是),CLUE上还提供了一些问答的数据集,可以见[CLUE的数据集搜索](https://www.cluebenchmarks.com/dataSet_search_modify.html?keywords=QA)。
69 |
70 | ## KBQA的实现
71 |
72 | 下面基于CCKS的数据集来实现2019年第四名方案和2020年第一名方案。
73 |
74 | CCKS的数据集,百度网盘下载地址:链接:https://pan.baidu.com/s/1NI9VrhuvOgyTFk1tGjlZIw 提取码:l7pm
75 |
76 | todo list(等有空实现了就补上):
77 |
78 | - 使用tensorflow实现2019年第四名方案
79 | - 使用tensorflow实现2020年第一名方案
80 |
81 | 附上2019年第四名方案的开源地址 https://github.com/atom32/ccks2019-ckbqa-4th-codes
82 | 流程还算完整,但想端到端完整运行有点困难,而且很多数据的处理过程都耦合在模型中。需要花一定的时间去整理。
83 |
84 | 2020年第一名方案代码暂未开源。
85 |
86 |
87 |
88 | ## 扩展
89 |
90 | - 美团大脑:知识图谱的建模方法及其应用:https://tech.meituan.com/2018/11/01/meituan-ai-nlp.html
91 | - 百度大脑UNIT3.0详解之知识图谱与对话:https://baijiahao.baidu.com/s?id=1643915882369765998&wfr=spider&for=pc
92 | - 更新ing
93 |
--------------------------------------------------------------------------------
/ckbqa/基于特征融合的中文知识库问答方法.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/基于特征融合的中文知识库问答方法.pdf
--------------------------------------------------------------------------------
/named_entity_recognition/README.md:
--------------------------------------------------------------------------------
1 | ## 命名实体识别(Named Entity Recognition)
2 |
3 | 这里首先介绍一篇基于深度学习的命名实体识别综述,《A Survey on Deep Learning for Named Entity Recognition》,论文来源:https://arxiv.org/abs/1812.09449(2020年3月份发表在TKDE)
4 |
5 | **1.命名实体识别简介**
6 |
7 | 命名实体识别(Named Entity Recognition,NER)旨在给定的文本中识别出属于预定义的类别片段(如人名、位置、组织等)。NER一直是很多自然语言应用的基础,如机器问答、文本摘要和机器翻译。
8 |
9 | NER任务最早是由第六届语义理解会议(Sixth Message Understanding Conference)提出,但当时仅定义一些通用的实体类型,如组织、人名和地点。
10 |
11 | **2.命名实体识别常用方法**
12 |
13 | - 基于规则的方法(Rule-based Approaches):不需要标注数据,依赖人工规则,特定领域需要专家知识
14 | - 无监督学习方法(Unsupervised Learning Approaches):不需要标注数据,依赖于无监督学习方法,如聚类算法
15 | - 基于特征的有监督学习方法(Feature-based Supervised Learning Approaches):将NER当作一个多分类问题或序列标签分类任务,依赖于特征工程
16 | - 基于深度学习的方法(DL-based Approaches):后面详细介绍
17 |
18 | 论文简单介绍了前三种方法,这里也不在赘述,感兴趣的可以看论文。
19 |
20 | **3.基于深度学习的方法**
21 |
22 | 文章中将NER任务拆解成三个结构:
23 |
24 | - 输入的分布式表示(Distributed Representations for Input)
25 | - 上下文编码(Context Encoder Architectures)
26 | - 标签解码(Tag Decoder Architectures)
27 |
28 | 这里不在展开描述具体的内容(有兴趣的可以去翻论文),下表总结了基于神经网络的NER模型的工作,并展示了每个NER模型在各类数据集上的表现。
29 |
30 | 
31 |
32 | 总结:BiLstm+CRF是使用深度学习的NER最常见的体系结构,以Cloze风格使用预训练双向Transformer在CoNLL03数据集上达到了SOTA效果(93.5%),另外Bert+Dice Loss在OntoNotes5.0数据集上达到了SOTA效果(92.07%)。
33 |
34 | **4.评测指标**
35 |
36 | 文中将NER的评测指标Precision、Recall和F1分成了两类。
37 |
38 | - Exact match:严格匹配方法,需要识别的边界和类别都正确
39 | - Relaxed match:宽松匹配方法,实体位置区间重叠、位置正确类别错误等都视为正确
40 |
41 |
42 |
43 | ## 命名实体识别数据集
44 |
45 | 命名实体识别数据集一般是BIO或者BIOES模式标注。
46 |
47 | - BIO模式:具体指B-begin、I-inside、O-outside
48 | - BIOES模式:具体指B-begin、I-inside、O-outside、E-end、S-single
49 |
50 | 首先是综述中提到的几个数据集,见下表,具体的就不介绍了。
51 |
52 | 
53 |
54 |
55 |
56 | 下面介绍一个中文的命名实体识别数据集,**CLUENER 细粒度命名实体识别**,地址:https://github.com/CLUEbenchmark/CLUENER2020
57 |
58 | - 数据类别:10个,地址、书名、公司、游戏、政府、电影、姓名、组织、职位和景点
59 | - 数据分布:训练集10748,测试集1343,具体类别分布见原文
60 | - 数据来源:在THUCTC文本分类数据集基础上,选出部分数据进行细粒度实体标注
61 |
62 |
63 |
64 | ## 命名实体识别Baseline算法实现
65 |
66 | 使用Tensorflow1.x版本Estimator高阶api实现常见的命名实体识别算法,主要包括BiLstm+CRF、Bert、Bert+CRF。
67 |
68 | (当前只在本目录下实现了BiLstm+CRF,至于BERT的在bert_downstream目录下暂未实现)
69 |
70 | 环境信息:
71 |
72 | tensorflow==1.13.1
73 |
74 | python==3.7
75 |
76 | **数据预处理**
77 |
78 | 要求训练集和测试集分开存储,要求数据集格式为BIO形式。
79 |
80 | 在训练模型前,需要先运行preprocess.py文件进行数据预处理,将数据处理成id形式并保存为pkl形式,另外中间过程产生的词表也会保存为vocab.txt文件。
81 |
82 | **文件结构**
83 |
84 | - data_path:数据集存放的位置
85 | - data_utils:数据处理相关的工具类存放位置
86 | - model_ckpt:chekpoint模型保存的位置
87 | - model_pb:pb形式的模型保存为位置
88 | - models:ner基本的算法存放位置,如BiLstm等
89 | - preprocess.py:数据预处理代码
90 | - ner_main.py:训练主入口
91 |
92 | **模型训练**
93 |
94 | - 首先准备好数据集,放在data_path下,然后运行preprocess.py文件
95 | - 运行ner_main.py,具体的模型参数可以在ARGS里面设置,也可以使用python ner_main.py --train_path='./data_path/clue_data.pkl'的形式
96 |
97 | **模型推理**
98 |
99 | - 推理代码在inference.py中
100 |
101 |
102 |
103 | ## 示例
104 |
105 | 下面使用中文任务测评基准(CLUE benchmark)的CLUENER数据进行demo示例演示:
106 |
107 | 数据集下载地址[[CLUENER细粒度命名实体识别](https://github.com/CLUEbenchmark/CLUENER2020)],该数据由CLUEBenchMark整理,数据分为10个标签类别分别为: 地址(address),书名(book),公司(company),游戏(game),政府(government),电影(movie),姓名(name),组织机构(organization),职位(position),景点(scene)
108 |
109 | 数据集分布:
110 |
111 | ```
112 | 训练集:10748
113 | 验证集集:1343
114 |
115 | 按照不同标签类别统计,训练集数据分布如下(注:一条数据中出现的所有实体都进行标注,如果一条数据出现两个地址(address)实体,那么统计地址(address)类别数据的时候,算两条数据):
116 | 【训练集】标签数据分布如下:
117 | 地址(address):2829
118 | 书名(book):1131
119 | 公司(company):2897
120 | 游戏(game):2325
121 | 政府(government):1797
122 | 电影(movie):1109
123 | 姓名(name):3661
124 | 组织机构(organization):3075
125 | 职位(position):3052
126 | 景点(scene):1462
127 |
128 | 【验证集】标签数据分布如下:
129 | 地址(address):364
130 | 书名(book):152
131 | 公司(company):366
132 | 游戏(game):287
133 | 政府(government):244
134 | 电影(movie):150
135 | 姓名(name):451
136 | 组织机构(organization):344
137 | 职位(position):425
138 | 景点(scene):199
139 | ```
140 |
141 | **1.数据EDA:**
142 |
143 | 省略,需要的可以自己分析一下数据集的分布情况
144 |
145 | **2.数据预处理:**
146 |
147 | 转换BIO形式,具体conver_bio.py,将CLUE提供的数据集转换为BIO标注形式;运行preprocess.py将数据集转换为id形式并保存为pkl形式。
148 |
149 | **3.模型训练:**
150 |
151 | 代码见ner_main.py,参数设置的时候有几个参数需要根据自己的数据分布来设置:
152 |
153 | - vocab_size:这里的大小,一般需要根据自己生成的vocab.txt中词表的大小来设置
154 | - num_tags:类别标签的数量,算上O,这里是21类
155 | - train_path/eval_path:数据集的路径
156 |
157 | 其他的参数视个人情况而定
158 |
159 | **4.开始预测并提交结果**
160 |
161 | 预测代码见inference.py
162 |
163 | #todo next 只完成了一部分,写入文件的部分暂时未完成。因为其提交的文件格式有点难受....太细化了...
164 |
165 |
166 |
167 | ## NER的比赛
168 |
169 | 1.天池的比赛 https://tianchi.aliyun.com/competition/entrance/531824/introduction
170 |
171 | 2.CLUE的评测 https://www.cluebenchmarks.com/introduce.html
172 |
173 |
174 |
175 | ## 扩展
176 |
177 | - 美团搜索中NER技术的探索和实践:https://tech.meituan.com/2020/07/23/ner-in-meituan-nlp.html
178 |
179 |
180 |
181 |
182 |
--------------------------------------------------------------------------------
/named_entity_recognition/convert_bio.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/15 21:46
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : convert_bio.py
6 | # @Software: PyCharm
7 | import json
8 |
9 | """
10 | 将数据转换成bio形式
11 | """
12 |
13 |
14 | def read_data(file_path):
15 | """
16 | 读取数据集
17 | :param file_path:
18 | :return:
19 | """
20 | with open(file_path, encoding='utf-8') as fr:
21 | lines = fr.readlines()
22 | print(f'the data size is {len(lines)}')
23 | return lines
24 |
25 |
26 | def convert_bio_data(file_path, out):
27 | """
28 | 转换成bio形式
29 | example:
30 | 我要去故宫 O O O B-location I-location
31 | :param file_path:
32 | :return:
33 | """
34 | lines = read_data(file_path)
35 | bio_data = []
36 | for line in lines:
37 | data = json.loads(line)
38 | text = data['text']
39 | labels = data['label']
40 | # 遍历处理label
41 | bios = ['O'] * len(text)
42 | for label in labels:
43 | entitys = labels[label]
44 | for entity in entitys:
45 | indexs = entitys[entity]
46 | for index in indexs:
47 | start = index[0]
48 | end = index[1]
49 | for i in range(start, end + 1):
50 | if i == start:
51 | bios[i] = f'B-{label}'
52 | else:
53 | bios[i] = f'I-{label}'
54 | bio_data.append(text + '\t' + ' '.join(bios))
55 | # write to file
56 | with open(out, 'w', encoding='utf-8') as fr:
57 | for data in bio_data:
58 | fr.write(data + '\n')
59 | print(f'convert bio data over!')
60 |
61 |
62 | if __name__ == '__main__':
63 | convert_bio_data('./data_path/train.json', './data_path/train.txt')
64 | convert_bio_data('./data_path/dev.json', './data_path/dev.txt')
65 |
--------------------------------------------------------------------------------
/named_entity_recognition/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里存放的是数据集
--------------------------------------------------------------------------------
/named_entity_recognition/data_utils/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/9 20:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : __init__.py.py
6 | # @Software: PyCharm
7 |
--------------------------------------------------------------------------------
/named_entity_recognition/data_utils/datasets.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/9 20:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : datasets.py
6 | # @Software: PyCharm
7 |
8 | import numpy as np
9 | import tensorflow as tf
10 |
11 | """
12 | 数据集构建类
13 | 将数据转换成模型所需要的dataset输入
14 | """
15 |
16 |
17 | class DataBuilder:
18 | def __init__(self, data):
19 | self.words = np.asarray(data['words'])
20 | self.tags = np.asarray(data['tags'])
21 |
22 | @property
23 | def size(self):
24 | return len(self.words)
25 |
26 | def build_generator(self):
27 | """
28 | build data generator for model
29 | :return:
30 | """
31 | for word, tag in zip(self.words, self.tags):
32 | yield (word, len(word)), tag
33 |
34 | def build_dataset(self):
35 | """
36 | build dataset from generator
37 | :return:
38 | """
39 | dataset = tf.data.Dataset.from_generator(
40 | self.build_generator,
41 | ((tf.int64, tf.int64), tf.int64),
42 | ((tf.TensorShape([None]), tf.TensorShape([])), tf.TensorShape([None]))
43 | )
44 | return dataset
45 |
46 | def get_train_batch(self, dataset, batch_size, epoch):
47 | """
48 | get one batch train data
49 | :param dataset:
50 | :param batch_size:
51 | :param epoch:
52 | :return:
53 | """
54 | dataset = dataset.cache()\
55 | .shuffle(buffer_size=10000)\
56 | .padded_batch(batch_size, padded_shapes=(([None], []), [None]))\
57 | .repeat(epoch)
58 | return dataset.make_one_shot_iterator().get_next()
59 |
60 | def get_test_batch(self, dataset, batch_size):
61 | """
62 | get one batch test data
63 | :param dataset:
64 | :param batch_size:
65 | :return:
66 | """
67 | dataset = dataset.padded_batch(batch_size,
68 | padded_shapes=(([None], []), [None]))
69 | return dataset.make_one_shot_iterator().get_next()
70 |
--------------------------------------------------------------------------------
/named_entity_recognition/inference.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2021/1/6 22:59
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : inference.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 | import tqdm
10 | import json
11 | import _pickle as cPickle
12 |
13 | """
14 | 命名实体识别推理代码
15 | """
16 |
17 | # 加载词典
18 | word_dict = {}
19 | with open('./data_path/clue_vocab.txt', encoding='utf-8') as fr:
20 | lines = fr.readlines()
21 | for line in lines:
22 | word = line.split('\t')[0]
23 | id = line.split('\t')[1]
24 | word_dict[word] = id
25 | print(word_dict)
26 |
27 | # label dict的设置 这个和preprocess中的tag_dict对应
28 | tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book',
29 | 5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game',
30 | 9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie',
31 | 13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization',
32 | 17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'}
33 |
34 |
35 | def words_to_ids(words, word_dict):
36 | """ 将words 转换成ids形式 """
37 | ids = [word_dict.get(word, 1) for word in words]
38 | return ids
39 |
40 |
41 | def predict_main(test_file, out_path):
42 | """ 预测主入口 """
43 | model_path = './model_pb/1609946529'
44 | with tf.Session(graph=tf.Graph()) as sess:
45 | model = tf.saved_model.loader.load(sess, ['serve'], model_path)
46 | # print(model)
47 | out = sess.graph.get_tensor_by_name('tag_ids:0')
48 | input_id = sess.graph.get_tensor_by_name('input_words:0')
49 | input_len = sess.graph.get_tensor_by_name('input_len:0')
50 |
51 | with open(test_file, encoding='utf-8') as fr:
52 | lines = fr.readlines()
53 | res_list = []
54 |
55 | cnt = 0
56 | for line in tqdm.tqdm(lines):
57 | json_str = json.loads(line)
58 | id = json_str['id']
59 | text = json_str['text']
60 | if len(text) < 1:
61 | print('there are some sample error!')
62 | text_features = words_to_ids(text, word_dict)
63 | text_label = len(text)
64 | feed = {input_id: [text_features], input_len: [text_label]}
65 | score = sess.run(out, feed_dict=feed)
66 |
67 | cnt += 1
68 | tags = [tag_ids[tag] for tag in score[0]]
69 | # print(tags)
70 | res_words, res_pos = get_result(text, tags)
71 | rs = {}
72 | for w, t in zip(res_words, res_pos):
73 | rs[t] = rs.get(t, []) + [w]
74 | pres = {}
75 | for t, ws in rs.items():
76 | temp = {}
77 | for w in ws:
78 | word = text[w[0]: w[1] + 1]
79 | temp[word] = temp.get(word, []) + [w]
80 | pres[t] = temp
81 | output_line = json.dumps({'id': id, 'label': pres}, ensure_ascii=False)
82 | res_list.append(output_line)
83 | # print(output_line)
84 | # write to file
85 | with open(out_path, 'w', encoding='utf-8') as fr:
86 | for res in res_list:
87 | fr.write(res)
88 | fr.write('\n')
89 |
90 |
91 | def get_result(text, tags):
92 | """ 改写成clue要提交的格式 """
93 | result_words = []
94 | result_pos = []
95 | temp_word = []
96 | temp_pos = ''
97 | for i in range(min(len(text), len(tags))):
98 | if tags[i].startswith('O'):
99 | if len(temp_word) > 0:
100 | result_words.append([min(temp_word), max(temp_word)])
101 | result_pos.append(temp_pos)
102 | temp_word = []
103 | temp_pos = ''
104 | elif tags[i].startswith('B-'):
105 | if len(temp_word) > 0:
106 | result_words.append([min(temp_word), max(temp_word)])
107 | result_pos.append(temp_pos)
108 | temp_word = [i]
109 | temp_pos = tags[i].split('-')[1]
110 | elif tags[i].startswith('I-'):
111 | if len(temp_word) > 0:
112 | temp_word.append(i)
113 | if temp_pos == '':
114 | temp_pos = tags[i].split('-')[1]
115 | else:
116 | if len(temp_word) > 0:
117 | temp_word.append(i)
118 | if temp_pos == '':
119 | temp_pos = tags[i].split('-')[1]
120 | result_words.append([min(temp_word), max(temp_word)])
121 | result_pos.append(temp_pos)
122 | temp_word = []
123 | temp_pos = ''
124 | return result_words, result_pos
125 |
126 |
127 | if __name__ == '__main__':
128 | test_file = './data_path/test.json'
129 | out_path = './data_path/clue_predict.json'
130 | predict_main(test_file, out_path)
131 |
--------------------------------------------------------------------------------
/named_entity_recognition/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里保存训练后的checkpoint文件
--------------------------------------------------------------------------------
/named_entity_recognition/model_pb/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里保存训练后的pb模型文件
--------------------------------------------------------------------------------
/named_entity_recognition/models/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/9 20:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : __init__.py.py
6 | # @Software: PyCharm
7 |
--------------------------------------------------------------------------------
/named_entity_recognition/models/bilstm_crf.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/9 20:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : bilstm_crf.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 | from tensorflow.contrib.rnn import LSTMCell
10 | from tensorflow.contrib.rnn import MultiRNNCell
11 |
12 |
13 | class Linear:
14 | """
15 | 全链接层
16 | """
17 | def __init__(self, scope_name, input_size, output_size,
18 | drop_out=0., trainable=True):
19 | with tf.variable_scope(scope_name):
20 | w_init = tf.random_uniform_initializer(-0.1, 0.1)
21 | self.W = tf.get_variable('W', [input_size, output_size],
22 | initializer=w_init,
23 | trainable=trainable)
24 |
25 | self.b = tf.get_variable('b', [output_size],
26 | initializer=tf.zeros_initializer(),
27 | trainable=trainable)
28 |
29 | self.drop_out = tf.layers.Dropout(drop_out)
30 |
31 | self.output_size = output_size
32 |
33 | def __call__(self, inputs, training):
34 | size = tf.shape(inputs)
35 | input_trans = tf.reshape(inputs, [-1, size[-1]])
36 | input_trans = tf.nn.xw_plus_b(input_trans, self.W, self.b)
37 | input_trans = self.drop_out(input_trans, training=training)
38 |
39 | input_trans = tf.reshape(input_trans, [-1, size[1], self.output_size])
40 |
41 | return input_trans
42 |
43 |
44 | class LookupTable:
45 | """
46 | embedding layer
47 | """
48 | def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True):
49 | self.vocab_size = vocab_size
50 | self.embed_size = embed_size
51 |
52 | with tf.variable_scope(scope_name, reuse=bool(reuse)):
53 | self.embedding = tf.get_variable('embedding', [vocab_size, embed_size],
54 | initializer=tf.random_uniform_initializer(-0.25, 0.25),
55 | trainable=trainable)
56 |
57 | def __call__(self, input):
58 | input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input))
59 | return tf.nn.embedding_lookup(self.embedding, input)
60 |
61 |
62 | class LstmBase:
63 | """
64 | build rnn cell
65 | """
66 | def build_rnn(self, hidden_size, num_layes):
67 | cells = []
68 | for i in range(num_layes):
69 | cell = LSTMCell(num_units=hidden_size,
70 | state_is_tuple=True,
71 | initializer=tf.random_uniform_initializer(-0.25, 0.25))
72 | cells.append(cell)
73 | cells = MultiRNNCell(cells, state_is_tuple=True)
74 |
75 | return cells
76 |
77 |
78 | class BiLstm(LstmBase):
79 | """
80 | define the lstm
81 | """
82 | def __init__(self, scope_name, hidden_size, num_layers):
83 | super(BiLstm, self).__init__()
84 | assert hidden_size % 2 == 0
85 | hidden_size /= 2
86 |
87 | self.fw_rnns = []
88 | self.bw_rnns = []
89 | for i in range(num_layers):
90 | self.fw_rnns.append(self.build_rnn(hidden_size, 1))
91 | self.bw_rnns.append(self.build_rnn(hidden_size, 1))
92 |
93 | self.scope_name = scope_name
94 |
95 | def __call__(self, input, input_len):
96 | for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)):
97 | scope_name = '{}_{}'.format(self.scope_name, idx)
98 | ctx, _ = tf.nn.bidirectional_dynamic_rnn(
99 | fw_rnn, bw_rnn, input, sequence_length=input_len,
100 | dtype=tf.float32, time_major=False,
101 | scope=scope_name
102 | )
103 | input = tf.concat(ctx, -1)
104 | ctx = input
105 | return ctx
106 |
107 |
108 | class BiLstm_Crf:
109 | def __init__(self, args, vocab_size, emb_size):
110 | # embedding
111 | scope_name = 'look_up'
112 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
113 |
114 | # rnn
115 | scope_name = 'bi_lstm'
116 | self.rnn = BiLstm(scope_name, args.hidden_dim, 1)
117 |
118 | # linear
119 | scope_name = 'linear'
120 | self.linear = Linear(scope_name, args.hidden_dim, args.num_tags,
121 | drop_out=args.drop_out)
122 |
123 | # crf
124 | scope_name = 'crf_param'
125 | self.crf_param = tf.get_variable(scope_name, [args.num_tags, args.num_tags],
126 | dtype=tf.float32)
127 |
128 | def __call__(self, inputs, training):
129 | masks = tf.sign(inputs)
130 | sent_len = tf.reduce_sum(masks, axis=1)
131 |
132 | embedding = self.lookuptables(inputs)
133 |
134 | rnn_out = self.rnn(embedding, sent_len)
135 |
136 | logits = self.linear(rnn_out, training)
137 |
138 | pred_ids, _ = tf.contrib.crf.crf_decode(logits, self.crf_param, sent_len)
139 |
140 | return logits, pred_ids, self.crf_param
141 |
142 |
143 |
144 |
145 |
--------------------------------------------------------------------------------
/named_entity_recognition/ner_main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-10-09 23:07
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : ner_main.py
6 | # @Software: PyCharm
7 |
8 | import sys
9 | import time
10 | import tensorflow as tf
11 | from data_utils import datasets
12 |
13 | import _pickle as cPickle
14 |
15 | from argparse import ArgumentParser
16 | from models.bilstm_crf import BiLstm_Crf
17 |
18 | parser = ArgumentParser()
19 |
20 | parser.add_argument("--vocab_size", type=int, default=4000, help='vocab size')
21 | parser.add_argument("--emb_size", type=int, default=300, help='emb size')
22 | parser.add_argument("--train_path", type=str, default='./data_path/clue_data.pkl')
23 | parser.add_argument("--test_path", type=str, default='./data_path/clue_data.pkl')
24 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/')
25 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb')
26 | parser.add_argument("--hidden_dim", type=int, default=300)
27 | parser.add_argument("--num_tags", type=int, default=21)
28 | parser.add_argument("--drop_out", type=float, default=0.1)
29 | parser.add_argument("--batch_size", type=int, default=16)
30 | parser.add_argument("--epoch", type=int, default=50)
31 | parser.add_argument("--lr", type=float, default=1e-4,
32 | help='the learning rate for optimizer')
33 |
34 |
35 | tf.logging.set_verbosity(tf.logging.INFO)
36 | ARGS, unparsed = parser.parse_known_args()
37 | print(ARGS)
38 |
39 | sys.stdout.flush()
40 |
41 |
42 | def init_data(file_name, type=None):
43 | """
44 | init data
45 | :param file_name:
46 | :param type:
47 | :return:
48 | """
49 | data = cPickle.load(open(file_name, 'rb'))[type]
50 |
51 | data_builder = datasets.DataBuilder(data)
52 | dataset = data_builder.build_dataset()
53 |
54 | def train_input():
55 | return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch)
56 |
57 | def test_input():
58 | return data_builder.get_test_batch(dataset, ARGS.batch_size)
59 |
60 | return train_input if type == 'train' else test_input
61 |
62 |
63 | def model_fn(features, labels, mode, params):
64 | """
65 | build model fn
66 | :return:
67 | """
68 | vocab_size = ARGS.vocab_size
69 | emb_size = ARGS.emb_size
70 | model = BiLstm_Crf(ARGS, vocab_size, emb_size)
71 |
72 | if isinstance(features, dict):
73 | features = features['words'], features['words_len']
74 |
75 | words, words_len = features
76 |
77 | if mode == tf.estimator.ModeKeys.PREDICT:
78 | _, pred_ids, _ = model(words, training=False)
79 |
80 | prediction = {'tag_ids': tf.identity(pred_ids, name='tag_ids')}
81 |
82 | return tf.estimator.EstimatorSpec(
83 | mode=mode,
84 | predictions=prediction,
85 | export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)}
86 | )
87 | else:
88 | tags = labels
89 | weights = tf.sequence_mask(words_len)
90 | if mode == tf.estimator.ModeKeys.TRAIN:
91 | logits, pred_ids, crf_params = model(words, training=True)
92 |
93 | log_like_lihood, _ = tf.contrib.crf.crf_log_likelihood(
94 | logits, tags, words_len, crf_params
95 | )
96 | loss = -tf.reduce_mean(log_like_lihood)
97 | accuracy = tf.metrics.accuracy(tags, pred_ids, weights)
98 |
99 | tf.identity(accuracy[1], name='train_accuracy')
100 | tf.summary.scalar('train_accuracy', accuracy[1])
101 | optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
102 | return tf.estimator.EstimatorSpec(
103 | mode=mode,
104 | loss=loss,
105 | train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step())
106 | )
107 | else:
108 | _, pred_ids, _ = model(words, training=False)
109 | accuracy = tf.metrics.accuracy(tags, pred_ids, weights)
110 | metrics = {
111 | 'accuracy': accuracy
112 | }
113 | return tf.estimator.EstimatorSpec(
114 | mode=mode,
115 | loss=tf.constant(0),
116 | eval_metric_ops=metrics
117 | )
118 |
119 |
120 | def main_es(unparsed):
121 | """
122 | main method
123 | :param unparsed:
124 | :return:
125 | """
126 | cur_time = time.time()
127 | model_dir = ARGS.model_ckpt_dir + str(int(cur_time))
128 |
129 | classifer = tf.estimator.Estimator(
130 | model_fn=model_fn,
131 | model_dir=model_dir,
132 | params={}
133 | )
134 |
135 | # train
136 | train_input = init_data(ARGS.train_path, 'train')
137 | tensors_to_log = {'train_accuracy': 'train_accuracy'}
138 | logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100)
139 | classifer.train(input_fn=train_input, hooks=[logging_hook])
140 |
141 | # eval
142 | test_input = init_data(ARGS.test_path, 'test')
143 | eval_res = classifer.evaluate(input_fn=test_input)
144 | print(f'Evaluation res is : \n\t{eval_res}')
145 |
146 | if ARGS.model_pb_dir:
147 | words = tf.placeholder(tf.int64, [None, None], name='input_words')
148 | words_len = tf.placeholder(tf.int64, [None], name='input_len')
149 | input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
150 | 'words': words,
151 | 'words_len': words_len
152 | })
153 | classifer.export_savedmodel(ARGS.model_pb_dir, input_fn)
154 |
155 |
156 | if __name__ == '__main__':
157 | tf.app.run(main=main_es, argv=[sys.argv[0]])
--------------------------------------------------------------------------------
/named_entity_recognition/pics/命名实体识别数据图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别数据图.png
--------------------------------------------------------------------------------
/named_entity_recognition/pics/命名实体识别的模型总结图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别的模型总结图.png
--------------------------------------------------------------------------------
/named_entity_recognition/preprocess.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-10-11 18:52
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : preprocess.py
6 | # @Software: PyCharm
7 |
8 | import os
9 | import _pickle as cPickle
10 | import pandas as pd
11 | import random
12 |
13 | """
14 | 数据预处理
15 | 将数据处理成id,并封装成pkl形式
16 | """
17 |
18 |
19 | # clue2020细粒度命名实体识别的类别
20 | tag_list = ['address', 'book', 'company', 'game', 'government',
21 | 'movie', 'name', 'organization', 'position', 'scene']
22 | tag_dict = {'O': 0}
23 |
24 | for tag in tag_list:
25 | tag_B = 'B-' + tag
26 | tag_I = 'I-' + tag
27 | tag_dict[tag_B] = len(tag_dict)
28 | tag_dict[tag_I] = len(tag_dict)
29 |
30 | print(tag_dict)
31 |
32 |
33 | def make_vocab(file_path):
34 | """
35 | 构建词典
36 | :param file_path:
37 | :return:
38 | """
39 | data = pd.read_csv(file_path, sep='\t', header=None)
40 | data.columns = ['text', 'tag']
41 | vocab = {'PAD': 0, 'UNK': 1}
42 | words_list = []
43 | for index, row in data.iterrows():
44 | words = row['text']
45 | for word in words:
46 | words_list.append(word)
47 |
48 | random.shuffle(words_list)
49 | for word in words_list:
50 | if word not in vocab:
51 | vocab[word] = len(vocab)
52 | return vocab
53 |
54 |
55 | def make_data(file_path, vocab):
56 | """
57 | 构建数据
58 | :param file_path:
59 | :param vocab
60 | :return:
61 | """
62 | data = pd.read_csv(file_path, sep='\t', header=None)
63 | data.columns = ['text', 'tag']
64 | word_ids = []
65 | tag_ids = []
66 | for index, row in data.iterrows():
67 | tag_str = row['tag']
68 | tags = tag_str.split(' ')
69 | words = row['text']
70 |
71 | word_id = [vocab.get(word) if word in vocab else 1 for word in words]
72 | tag_id = [tag_dict.get(tag) for tag in tags]
73 |
74 | word_ids.append(word_id)
75 | tag_ids.append(tag_id)
76 | print(word_ids[0])
77 | print(tag_ids[0])
78 | return {'words': word_ids, 'tags': tag_ids}
79 |
80 |
81 | def save_vocab(vocab, output):
82 | """
83 | save vocab dict
84 | :param vocab:
85 | :param output:
86 | :return:
87 | """
88 | with open(output, 'w', encoding='utf-8') as fr:
89 | for word in vocab:
90 | fr.write(word + '\t' + str(vocab.get(word)) + '\n')
91 | print('save vocab is ok.')
92 |
93 |
94 | def main(output_path):
95 | """
96 | main method
97 | :param output_path:
98 | :return:
99 | """
100 | data = {}
101 | # 这里是bio形式的数据集,如果不是需要提前转换成bio形式
102 | train_path = './data_path/train.txt'
103 | test_path = './data_path/dev.txt'
104 | vocab = make_vocab(train_path)
105 | train_data = make_data(train_path, vocab)
106 | test_data = make_data(test_path, vocab)
107 |
108 | data['train'] = train_data
109 | data['test'] = test_data
110 |
111 | data_path = os.path.join(output_path, 'clue_data.pkl')
112 | cPickle.dump(data, open(data_path, 'wb'), protocol=2)
113 | print('save data to pkl ok.')
114 |
115 | vocab_path = os.path.join(output_path, 'clue_vocab.txt')
116 | save_vocab(vocab, vocab_path)
117 |
118 |
119 | if __name__ == '__main__':
120 | output = './data_path/'
121 | main(output)
122 |
--------------------------------------------------------------------------------
/text_classification/README.md:
--------------------------------------------------------------------------------
1 | ## 文本分类
2 |
3 | 这里首先介绍一篇基于深度学习的文本分类综述,《Deep Learning Based Text Classification: A Comprehensive Review》,论文来源:https://arxiv.org/abs/2004.03705
4 |
5 | **文本分类简介**:
6 |
7 | 文本分类是NLP中一个非常经典任务(对给定的句子、查询、段落或者文档打上相应的类别标签)。其应用包括机器问答、垃圾邮件识别、情感分析、新闻分类、用户意图识别等。文本数据的来源也十分的广泛,比如网页数据、邮件内容、聊天记录、社交媒体、用户评论等。
8 |
9 | **文本分类三大方法**:
10 |
11 | 1. Rule-based methods:使用预定义的规则进行分类,需要很强的领域知识而且系统很难维护
12 | 2. ML (data-driven) based methods:经典的机器学习方法使用特征提取(Bow词袋等)来提取特征,再使用朴素贝叶斯、SVM、HMM、Gradien Boosting Tree和随机森林等方法进行分类。深度学习方法通常使用的是end2end形式,比如Transformer、Bert等。
13 | 3. Hybrid methods:基于规则和基于机器学习(深度学习)方法的混合
14 |
15 | **文本分类任务**:
16 |
17 | 1. 情感分析(Sentiment Analysis):给定文本,分析用户的观点并且抽取出他们的主要观点。可以是二分类,也可以是多分类任务
18 | 2. 新闻分类(News Categorization):识别新闻主题,并给用户推荐相关的新闻。主要应用于推荐系统
19 | 3. 主题分析(Topic Analysis):给定文本,抽取出其文本的一个或者多个主题
20 | 4. 机器问答(Question Answering):提取式(extractive),给定问题和一堆候选答案,从中识别出正确答案;生成式(generative),给定问题,然后生成答案。(NL2SQL?)
21 | 5. 自然语言推理(Natural Language Inference):文本蕴含任务,预测一个文本是否可以从另一个文本中推断出。一般包括entailment、contradiction和neutral三种关系类型
22 |
23 | **文本分类模型(深度学习)**:
24 |
25 | 1. 基于前馈神经网络(Feed-Forward Neural Networks)
26 | 2. 基于循环神经网络(RNN)
27 | 3. 基于卷积神经网络(CNN)
28 | 4. 基于胶囊高神经网络(Capsule networks)
29 | 5. 基于Attention机制
30 | 6. 基于记忆增强网络(Memory-augmented networks)
31 | 7. 基于Transformer机制
32 | 8. 基于图神经网络
33 | 9. 基于孪生神经网络(Siamese Neural Network)
34 | 10. 混合神经网络(Hybrid models)
35 |
36 | 详解见https://blog.csdn.net/u013963380/article/details/106957420(只详细描述了前4种深度学习模型)。
37 |
38 | ## 文本分类数据集
39 |
40 | Deep Learning Based Text Classification: A Comprehensive Review一文中提到了很多的文本分类的数据集,大多数是英文的。
41 |
42 | 下面列出一些中文文本分类数据集:
43 |
44 | | 数据集 | 说明 | 链接 |
45 | | :------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
46 | | THUCNews | THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成。
包含财经、彩票、房产、股票、家居、教育等14个类别。
原始数据集见:[链接](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) | [下载地址](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) |
47 | | 今日头条 | 来源于今日头条,为短文本分类任务,数据包含15个类别 | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip) |
48 | | IFLYTEK | 1.7万多条关于app应用描述的长文本标注数据,包含和日常生活相关的各类应用主题,共119个类别 | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip) |
49 | | 新闻标题 | 数据集来源于Kesci平台,为新闻标题领域短文本分类任务。
内容大多为短文本标题(length<50),数据包含15个类别,共38w条样本 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv |
50 | | 复大文本 | 数据集来源于复旦大学,为短文本分类任务,数据包含20个类别,共9804篇文档 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv |
51 | | OCNLI | 中文原版自然语言推理,是第一个非翻译的、使用原生汉语的大型中文自然语言推理数据集
详细见https://github.com/CLUEbenchmark/OCNLI | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.zip) |
52 | | 情感分析 | OCEMOTION–中文情感分类,对应文章https://www.aclweb.org/anthology/L16-1291.pdf
原始数据集未找到,只有一部分数据 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv |
53 | | 更新ing | ... | ... |
54 |
55 | 还有一些其他的中文文本数据集,可以在CLUE上搜索,CLUE地址:https://www.cluebenchmarks.com/ ,但是下载需要注册账号,有的链接失效,有的限制日下载次数,这里放到百度网盘供下载学习使用。(请勿用于商业目的)
56 |
57 | ## 文本分类Baseline算法实现
58 |
59 | 使用Tensorflow1.x版本Estimator高阶api实现常见文本分类算法,主要包括前馈神经网络(all 全连接层)模型、双向LSTM模型、文本卷积网络(TextCnn)、Transformer。
60 |
61 | 环境信息:
62 |
63 | tensorflow==1.13.1
64 |
65 | python==3.7
66 |
67 | **数据预处理**
68 |
69 | 要求训练集和测试集分开存储(提供划分数据集方法),另外需要对文本进行分词,数据EDA部分可以见示例中的tnews_data_eda.ipynb文件。
70 |
71 | 在训练模型前,需要先运行preprocess.py文件进行数据预处理,将数据处理成id形式并保存为pkl形式,另外中间过程产生的词表也会保存为vocab.txt文件。
72 |
73 | **文件结构**
74 |
75 | - data_path:数据集存放的位置
76 | - data_utils:数据处理相关的工具类存放位置
77 | - model_ckpt:模型checkpoint保存的位置
78 | - model_pb:pb形式的模型保存的位置
79 | - models:文本分类baseline模型存放的位置,包括BiLstm、TextCnn等
80 | - train_main.py:模型训练主入口
81 | - preprocess.py:数据预处理代码,包括划分数据集、转换文本为id等
82 | - tf_metrics.py:tensorflow1.x版本不支持多分类的指标函数,这里使用的是Guillaume Genthial编写的多分类指标函数,[github地址](https://github.com/guillaumegenthial/tf_metrics)
83 | - inference.py:推理主入口
84 |
85 | **模型训练过程**
86 |
87 | - 首先准备好数据集,放在data_path下,然后运行preprocess.py文件
88 | - 运行train_main.py,具体的模型参数可以在ARGS里面设置,也可以使用python train_main.py --train_path='./data_path/emotion_data.pkl'的形式
89 |
90 | **模型推理**
91 |
92 | - 推理代码在inference.py中
93 |
94 | ## 示例
95 |
96 | 下面使用中文任务测评基准(CLUE benchmark)的头条新闻分类数据来进行demo示例演示:
97 |
98 | 数据集下载地址:https://github.com/CLUEbenchmark/CLUE 中的[TNEWS'数据集下载](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip)
99 |
100 | 该数据集来自今日头条新闻版块,共15个类别的新闻,包括旅游、教育、金融、军事等。
101 |
102 | ```
103 | 数据量:训练集(53,360),验证集(10,000),测试集(10,000)
104 | 例子:
105 | {"label": "102", "label_des": "news_entertainment", "sentence": "江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"}
106 | 每一条数据有三个属性,从前往后分别是 分类ID,分类名称,新闻字符串(仅含标题)。
107 | ```
108 |
109 | **1.数据EDA**
110 |
111 | 数据EDA部分见tnews_data_eda.ipynb,主要是简单分析一下数据集的文本的长度分布、类别标签的数量比。然后对文本进行分词,这里使用的jieba分词软件。分词后将数据集保存到data_path目录下。
112 |
113 | ```
114 | # 各种类别标签的数量分布
115 | 109 5955
116 | 104 5200
117 | 102 4976
118 | 113 4851
119 | 107 4118
120 | 101 4081
121 | 103 3991
122 | 110 3632
123 | 108 3437
124 | 116 3390
125 | 112 3368
126 | 115 2886
127 | 106 2107
128 | 100 1111
129 | 114 257
130 | ```
131 |
132 | **2.设置训练参数**
133 |
134 | 参数设置的时候有几个参数需要根据自己的数据分布来设置:
135 |
136 | - vocab_size:这里的大小,一般需要根据自己生成的vocab.txt中词表的大小来设置
137 | - num_label:类别标签的数量
138 | - train_path/eval_path:数据集的路径
139 | - weights权重设置:根据数据EDA中的类别标签分布,设置weights=[0.9,0.9,0.9,0.9,1,1,1,1,1,1,1,1,1,1.2,1.5],后面几个类别的数量明显很少,权重设置大一点。具体数值自己根据个人分析来定义
140 |
141 | 其他的参数视个人情况而定
142 |
143 | **3.模型训练并保存模型**
144 |
145 | 这里使用的是BiLstm模型。
146 |
147 | 代码中保存了两种模型形式,一种是checkpoint,另一种是pb格式
148 |
149 | **4.开始预测并提交结果**
150 |
151 | 预测代码见inferecen.py,最后在CLUE上提交的结果是50.92([ALBERT-xxlarge](https://github.com/google-research/albert) :59.46,目前[UER-ensemble](https://github.com/dbiir/UER-py):72.20)
152 |
153 | ## 中文文本分类比赛OR评测
154 |
155 | 1.[零基础入门NLP-新闻文本分类](https://tianchi.aliyun.com/competition/entrance/531810/introduction?spm=5176.12281973.1005.4.3dd52448KQuWQe)(DataWhale和天池举办的学习赛)
156 |
157 | 2.[中文CLUE的各种分类任务的评测](https://www.cluebenchmarks.com/)
--------------------------------------------------------------------------------
/text_classification/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里存放的是数据集
--------------------------------------------------------------------------------
/text_classification/data_path/tnews_data.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/text_classification/data_path/tnews_data.pkl
--------------------------------------------------------------------------------
/text_classification/inference.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/28 21:45
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : inference.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 | import tqdm
10 | import json
11 | import jieba
12 |
13 | """
14 | 文本分类推理代码
15 | """
16 |
17 | # 设置filter
18 | filter = './??;。(())【】{}[]!!,,<>《》+'
19 | # 加载词典
20 | word_dict = {}
21 | with open('./data_path/vocab.txt', encoding='utf-8') as fr:
22 | lines = fr.readlines()
23 | for line in lines:
24 | word = line.split('\t')[0]
25 | id = line.split('\t')[1]
26 | word_dict[word] = id
27 | print(word_dict)
28 |
29 | # label dict的设置
30 | label_id = {0: 109, 1: 104, 2: 102, 3: 113,
31 | 4: 107, 5: 101, 6: 103, 7: 110,
32 | 8: 108, 9: 116, 10: 112, 11: 115,
33 | 12: 106, 13: 100, 14: 114}
34 | label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment",
35 | 103: "news_sports", 104: "news_finance", 106: "news_house",
36 | 107: "news_car", 108: "news_edu", 109: "news_tech",
37 | 110: "news_military", 112: "news_travel", 113: "news_world",
38 | 114: "news_stock", 115: "news_agriculture", 116: "news_game"}
39 |
40 |
41 | def cut_with_jieba(text, filter=None):
42 | """ 使用jieba切分句子 """
43 | if filter:
44 | for c in filter:
45 | text = text.replace(c, '')
46 | words = ['Number' if word.isdigit() else word for word in jieba.cut(text)]
47 | return words
48 |
49 |
50 | def words_to_ids(words, word_dict):
51 | """ 将words 转换成ids形式 """
52 | ids = [word_dict.get(word, 1) for word in words]
53 | return ids
54 |
55 |
56 | def predict_main(test_file, out_path):
57 | """ 预测主入口 """
58 | model_path = './model_pb/1609247078'
59 | with tf.Session(graph=tf.Graph()) as sess:
60 | model = tf.saved_model.loader.load(sess, ['serve'], model_path)
61 | # print(model)
62 | out = sess.graph.get_tensor_by_name('class_out:0')
63 | input_p = sess.graph.get_tensor_by_name('input_words:0')
64 |
65 | with open(test_file, encoding='utf-8') as fr:
66 | lines = fr.readlines()
67 | res_list = []
68 | for line in tqdm.tqdm(lines):
69 | json_str = json.loads(line)
70 | id = json_str['id']
71 | sentence = json_str['sentence']
72 |
73 | words = cut_with_jieba(str(sentence), filter)
74 | if len(words) < 1:
75 | print('there are some sample error!')
76 | text_features = words_to_ids(words, word_dict)
77 | feed = {input_p: [text_features]}
78 | score = sess.run(out, feed_dict=feed)
79 |
80 | label = label_id.get(score[0])
81 | label_d = label_desc.get(label)
82 |
83 | res_list.append(
84 | json.dumps({"id": id, "label": str(label), "label_desc": label_d}))
85 | # 写入到文件
86 | with open(out_path, 'w', encoding='utf-8') as fr:
87 | for res in res_list:
88 | fr.write(res)
89 | fr.write('\n')
90 | print('predict and write to file over!!!')
91 |
92 |
93 | if __name__ == '__main__':
94 | test_file = './data_path/test.json'
95 | out_path = './data_path/tnews_predict.json'
96 | predict_main(test_file, out_path)
97 |
--------------------------------------------------------------------------------
/text_classification/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里保存训练后的checkpoint文件
--------------------------------------------------------------------------------
/text_classification/model_pb/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 |
3 | 这里保存训练后的pb模型文件
--------------------------------------------------------------------------------
/text_classification/models/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/10 21:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : __init__.py.py
6 | # @Software: PyCharm
7 |
8 |
--------------------------------------------------------------------------------
/text_classification/models/attention.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/12/10 21:51
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : attention.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 | from .base_model import Linear
10 |
11 |
12 | class Attention:
13 | """
14 | the attention
15 | """
16 | def __init__(self, scope_name, hidden_size, num_heads, dropout):
17 | if hidden_size % num_heads != 0:
18 | raise ValueError('the hidden size and heads is not match!')
19 |
20 | self.hidden_size = hidden_size
21 | self.num_heads = num_heads
22 |
23 | self.q_layer = Linear(f'{scope_name}_q', hidden_size, hidden_size, bias=False)
24 | self.k_layer = Linear(f'{scope_name}_k', hidden_size, hidden_size, bias=False)
25 | self.v_layer = Linear(f'{scope_name}_v', hidden_size, hidden_size, bias=False)
26 |
27 | self.out_layer = Linear(f'{scope_name}_output', hidden_size,
28 | hidden_size, bias=False)
29 | self.dropout = tf.layers.Dropout(dropout)
30 |
31 | def split_heads(self, x):
32 | """ split the heads """
33 | with tf.name_scope('split_heads'):
34 | batch_size = tf.shape(x)[0]
35 | length = tf.shape(x)[1]
36 |
37 | depth = self.hidden_size // self.num_heads
38 |
39 | x = tf.reshape(x, [batch_size, length, self.num_heads, depth])
40 |
41 | return tf.transpose(x, [0, 2, 1, 3])
42 |
43 | def combine_heads(self, x):
44 | """ combine the heads """
45 | with tf.name_scope('combine_heads'):
46 | batch_size = tf.shape(x)[0]
47 | length = tf.shape(x)[2]
48 |
49 | x = tf.transpose(x, [0, 2, 1, 3]) # bacth length, heads, depth
50 | return tf.reshape(x, [batch_size, length, self.hidden_size])
51 |
52 | def call(self, x, y, training, bias, cache=None):
53 | q = self.q_layer(x, training)
54 | k = self.k_layer(y, training)
55 | v = self.v_layer(y, training)
56 |
57 | if cache:
58 | k = tf.concat([cache['k'], k], axis=1)
59 | v = tf.concat([cache['v'], v], axis=1)
60 |
61 | cache['k'] = k
62 | cache['v'] = v
63 |
64 | q = self.split_heads(q)
65 | k = self.split_heads(k)
66 | v = self.split_heads(v)
67 |
68 | depth = self.hidden_size // self.num_heads
69 | q *= depth ** -0.5
70 |
71 | # calculate dot product attention
72 | logits = tf.matmul(q, k, transpose_b=True)
73 | logits += bias
74 | weights = tf.nn.softmax(logits)
75 | weights = self.dropout(weights, training=training)
76 | attention_output = tf.matmul(weights, v)
77 |
78 | attention_output = self.combine_heads(attention_output)
79 |
80 | attention_output = self.out_layer(attention_output, training)
81 | return attention_output
82 |
83 |
84 | class SelfAttention(Attention):
85 | def __call__(self, x, training, bias, cache=None):
86 | return super(SelfAttention, self).call(x, x, training, bias, cache)
--------------------------------------------------------------------------------
/text_classification/models/base_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020/11/10 21:34
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : base_model.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 |
10 | from tensorflow.contrib.rnn import LSTMCell
11 | from tensorflow.contrib.rnn import MultiRNNCell
12 | from tensorflow.contrib.rnn import GRUCell
13 | from tensorflow.contrib.rnn import BasicRNNCell
14 |
15 |
16 | class Linear:
17 | """
18 | 线性层,全连接层
19 | """
20 | def __init__(self, scope_name, input_size, output_sizes, bias=True,
21 | activator='', drop_out=0., reuse=False, trainable=True):
22 | self.input_size = input_size
23 |
24 | # todo 判断 output_sizes 是不是列表
25 | if not isinstance(output_sizes, list):
26 | output_sizes = [output_sizes]
27 |
28 | self.output_size = output_sizes[-1]
29 |
30 | self.W = []
31 | self.b = []
32 | size = input_size
33 | with tf.variable_scope(scope_name, reuse=reuse):
34 | for i, output_size in enumerate(output_sizes):
35 | W = tf.get_variable(
36 | 'W{0}'.format(i), [size, output_size],
37 | initializer=tf.random_uniform_initializer(-0.25, 0.25),
38 | trainable=trainable
39 | )
40 | if bias:
41 | b = tf.get_variable(
42 | 'b{0}'.format(i), [output_size],
43 | initializer=tf.zeros_initializer(),
44 | trainable=trainable
45 | )
46 | else:
47 | b = None
48 |
49 | self.W.append(W)
50 | self.b.append(b)
51 | size = output_size
52 |
53 | if activator == 'relu':
54 | self.activator = tf.nn.relu
55 | elif activator == 'relu6':
56 | self.activator = tf.nn.relu6
57 | elif activator == 'tanh':
58 | self.activator = tf.nn.tanh
59 | else:
60 | self.activator = tf.identity
61 |
62 | self.drop_out = tf.layers.Dropout(drop_out)
63 |
64 | def __call__(self, input, training):
65 | size = tf.shape(input)
66 | input_trans = tf.reshape(input, [-1, size[-1]])
67 | for W, b in zip(self.W, self.b):
68 | if b is not None:
69 | input_trans = tf.nn.xw_plus_b(input_trans, W, b)
70 | else:
71 | input_trans = tf.matmul(input_trans, W)
72 |
73 | input_trans = self.drop_out(input_trans, training)
74 | input_trans = self.activator(input_trans)
75 |
76 | new_size = tf.concat([size[:-1], tf.constant([self.output_size])], 0)
77 | input_trans = tf.reshape(input_trans, new_size)
78 | return input_trans
79 |
80 |
81 | class LookupTable:
82 | """
83 | embedding层
84 | """
85 | def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True):
86 | self.vocab_size = vocab_size
87 | self.embed_size = embed_size
88 |
89 | with tf.variable_scope(scope_name, reuse=bool(reuse)):
90 | self.embedding = tf.get_variable(
91 | 'embedding', [vocab_size, embed_size],
92 | initializer=tf.random_uniform_initializer(-0.25, 0.25),
93 | trainable=trainable
94 | )
95 |
96 | def __call__(self, input):
97 | input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input))
98 | return tf.nn.embedding_lookup(self.embedding, input)
99 |
100 |
101 | class AttentionPooling:
102 | """
103 | attention pooling层
104 | """
105 | def __init__(self, scope_name, input_size, hidden_size, reuse=False,
106 | trainable=True):
107 | name = scope_name
108 | self.linear1 = Linear(f'{name}_linear1', input_size,
109 | hidden_size, bias=False, reuse=reuse,
110 | trainable=trainable)
111 | self.linear2 = Linear(f'{name}_linear2', hidden_size, 1,
112 | bias=False, reuse=reuse, trainable=trainable)
113 |
114 | def __call__(self, input, mask, training):
115 | output_linear1 = self.linear1(input, training)
116 | output_linear2 = self.linear2(output_linear1, training)
117 | weights = tf.squeeze(output_linear2, [-1])
118 | if mask is not None:
119 | weights += mask
120 | weights = tf.nn.softmax(weights, -1)
121 | return tf.reduce_sum(input * tf.expand_dims(weights, -1), axis=1)
122 |
123 |
124 | class LayerNormalization:
125 | """
126 | 归一化层
127 | """
128 | def __init__(self, scope_name, hidden_size):
129 | with tf.variable_scope(scope_name):
130 | self.scale = tf.get_variable('layer_norm_scale', [hidden_size],
131 | initializer=tf.ones_initializer())
132 | self.bias = tf.get_variable('layer_norm_bias', [hidden_size],
133 | initializer=tf.zeros_initializer())
134 |
135 | def __call__(self, x, epsilon=1e-6):
136 | mean, variance = tf.nn.moments(x, -1, keep_dims=True)
137 | norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
138 | return norm_x * self.scale + self.bias
139 |
140 |
141 | class LstmBase:
142 | """
143 | RNN的基础层
144 | """
145 | def build_rnn(self, rnn_type, hidden_size, num_layes):
146 | cells = []
147 | for i in range(num_layes):
148 | if rnn_type == 'lstm':
149 | cell = LSTMCell(num_units=hidden_size,
150 | state_is_tuple=True,
151 | initializer=tf.random_uniform_initializer(-0.25, 0.25))
152 | elif rnn_type == 'gru':
153 | cell = GRUCell(num_units=hidden_size)
154 | elif rnn_type:
155 | cell = BasicRNNCell(num_units=hidden_size)
156 | else:
157 | raise NotImplementedError(f'the rnn type is unexist: {rnn_type}')
158 | cells.append(cell)
159 |
160 | cells = MultiRNNCell(cells, state_is_tuple=True)
161 |
162 | return cells
163 |
164 |
165 | class BiLstm(LstmBase):
166 | """
167 | 双向LSTM层
168 | """
169 | def __init__(self, scope_name, hidden_size, num_layers):
170 | super(BiLstm, self).__init__()
171 | assert hidden_size % 2 == 0
172 | hidden_size /= 2
173 |
174 | self.fw_rnns = []
175 | self.bw_rnns = []
176 | for i in range(num_layers):
177 | self.fw_rnns.append(self.build_rnn('lstm', hidden_size, 1))
178 | self.bw_rnns.append(self.build_rnn('lstm', hidden_size, 1))
179 |
180 | self.scope_name = scope_name
181 |
182 | def __call__(self, input, input_len):
183 | for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)):
184 | scope_name = '{}_{}'.format(self.scope_name, idx)
185 | ctx, _ = tf.nn.bidirectional_dynamic_rnn(
186 | fw_rnn, bw_rnn, input, sequence_length=input_len,
187 | dtype=tf.float32, time_major=False,
188 | scope=scope_name
189 | )
190 | input = tf.concat(ctx, -1)
191 | ctx = input
192 | return ctx
193 |
194 |
195 | class Cnn:
196 | """
197 | define cnn
198 | """
199 | def __init__(self, scoep_name, input_size, hidden_size):
200 | kws=[3]
201 | self.conv_ws = []
202 | self.conv_bs = []
203 | for idx, kw in enumerate(kws):
204 | w = tf.get_variable(
205 | f"conv_w_{idx}",
206 | [kw, input_size, hidden_size],
207 | initializer=tf.random_uniform_initializer(-0.1, 0.1)
208 | )
209 | b = tf.get_variable(
210 | f"conv_b_{idx}",
211 | [hidden_size],
212 | initializer=tf.zeros_initializer()
213 | )
214 | self.conv_ws.append(w)
215 | self.conv_bs.append(b)
216 |
217 | def __call__(self, input, mask):
218 | outputs = []
219 | for conv_w, conv_b in zip(self.conv_ws, self.conv_bs):
220 | conv = tf.nn.conv1d(input, conv_w, 1, 'SAME')
221 | conv = tf.nn.bias_add(conv, conv_b)
222 | if mask is not None:
223 | conv += tf.expand_dims(mask, -1)
224 | outputs.append(conv)
225 | output = tf.concat(outputs, -1)
226 | return output
227 |
--------------------------------------------------------------------------------
/text_classification/models/bilstm_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-12-11 12:37
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : bilstm_model.py
6 | # @Software: PyCharm
7 | import tensorflow as tf
8 |
9 | from .base_model import LookupTable
10 | from .base_model import BiLstm
11 | from .base_model import Linear
12 |
13 |
14 | class BiLstmModel:
15 | """
16 | BiLstm模型的实现:
17 | 主要包含:embedding层、rnn层、池化层、两层全连接层和一个Dropout层
18 | """
19 | def __init__(self, vocab_size, emb_size, args):
20 |
21 | # embedding
22 | scope_name = 'look_up'
23 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
24 |
25 | # rnn
26 | scope_name = 'bi_lstm'
27 | # rnn的层数 这里设置为1
28 | num_layers = 1
29 | self.rnn = BiLstm(scope_name, args.hidden_size, num_layers)
30 |
31 | # linear1
32 | scope_name = 'linear1'
33 | self.linear1 = Linear(scope_name, args.hidden_size, args.fc_layer_size,
34 | activator=args.activator)
35 |
36 | # logits out
37 | scope_name = 'linear2'
38 | self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label)
39 |
40 | self.dropout = tf.layers.Dropout(args.drop_out)
41 |
42 | def max_pool(inputs):
43 | return tf.reduce_max(inputs, 1)
44 |
45 | def mean_pool(inputs):
46 | return tf.reduce_mean(inputs, 1)
47 |
48 | if args.pool == 'max':
49 | self.pool = max_pool
50 | else:
51 | self.pool = mean_pool
52 |
53 | def __call__(self, inputs, training):
54 | masks = tf.sign(inputs)
55 | sent_len = tf.reduce_sum(masks, axis=1)
56 |
57 | embedding = self.lookuptables(inputs)
58 |
59 | rnn_out = self.rnn(embedding, sent_len)
60 | pool_out = self.pool(rnn_out)
61 | linear_out = self.linear1(pool_out, training)
62 | # dropout
63 | linear_out = self.dropout(linear_out, training)
64 | # linear
65 | output = self.linear2(linear_out, training)
66 | return output
--------------------------------------------------------------------------------
/text_classification/models/ffnn_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-12-11 19:05
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : ffnn_model.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 |
10 | from .base_model import LookupTable
11 | from .base_model import Linear
12 |
13 |
14 | class FCModel:
15 | """
16 | 前馈网络
17 | 主要包括:embedding层、两个全连接层和一个dropout层
18 | """
19 | def __init__(self, vocab_size, emb_size, args):
20 |
21 | # embedding
22 | scope_name = 'look_up'
23 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
24 |
25 | fc_layer_size = args.fc_layer_size # 全链接层的size
26 | scope_name = 'linear1'
27 | self.linear1 = Linear(scope_name, emb_size, fc_layer_size, args.activator)
28 |
29 | scope_name = 'linear2'
30 | self.linear2 = Linear(scope_name, fc_layer_size, args.num_label)
31 |
32 | self.dropout = tf.layers.Dropout(args.drop_out)
33 |
34 | def __call__(self, inputs, training):
35 | embedding = self.lookuptables(inputs)
36 | pool_out = tf.reduce_mean(embedding, 1)
37 | pool_out = self.dropout(pool_out, training)
38 | pool_out = self.linear1(pool_out, training)
39 | pool_out = self.dropout(pool_out, training)
40 | output = self.linear2(pool_out, training)
41 |
42 | return output
43 |
--------------------------------------------------------------------------------
/text_classification/models/model_utils.py:
--------------------------------------------------------------------------------
1 | """Transformer model helper methods."""
2 |
3 | import math
4 |
5 | import numpy as np
6 | import tensorflow as tf
7 |
8 | _NEG_INF_FP32 = -1e9
9 | _NEG_INF_FP16 = np.finfo(np.float16).min
10 |
11 |
12 | def get_position_encoding(length,
13 | hidden_size,
14 | min_timescale=1.0,
15 | max_timescale=1.0e4):
16 | """Return positional encoding.
17 | Calculates the position encoding as a mix of sine and cosine functions with
18 | geometrically increasing wavelengths.
19 | Defined and formulized in Attention is All You Need, section 3.5.
20 | Args:
21 | length: Sequence length.
22 | hidden_size: Size of the
23 | min_timescale: Minimum scale that will be applied at each position
24 | max_timescale: Maximum scale that will be applied at each position
25 | Returns:
26 | Tensor with shape [length, hidden_size]
27 | """
28 | # We compute the positional encoding in float32 even if the model uses
29 | # float16, as many of the ops used, like log and exp, are numerically unstable
30 | # in float16.
31 | position = tf.cast(tf.range(length), tf.float32)
32 | num_timescales = hidden_size // 2
33 | log_timescale_increment = (
34 | math.log(float(max_timescale) / float(min_timescale)) /
35 | (tf.cast(num_timescales, tf.float32) - 1))
36 | inv_timescales = min_timescale * tf.exp(
37 | tf.cast(tf.range(num_timescales), tf.float32) * -log_timescale_increment)
38 | scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
39 | signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
40 | return signal
41 |
42 |
43 | def get_decoder_self_attention_bias(length, dtype=tf.float32):
44 | """Calculate bias for decoder that maintains model's autoregressive property.
45 | Creates a tensor that masks out locations that correspond to illegal
46 | connections, so prediction at position i cannot draw information from future
47 | positions.
48 | Args:
49 | length: int length of sequences in batch.
50 | dtype: The dtype of the return value.
51 | Returns:
52 | float tensor of shape [1, 1, length, length]
53 | """
54 | neg_inf = _NEG_INF_FP16 if dtype == tf.float16 else _NEG_INF_FP32
55 | with tf.name_scope("decoder_self_attention_bias"):
56 | valid_locs = tf.linalg.band_part(
57 | tf.ones([length, length], dtype=dtype), -1, 0)
58 | valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
59 | decoder_bias = neg_inf * (1.0 - valid_locs)
60 | return decoder_bias
61 |
62 |
63 | def get_padding(x, padding_value=0, dtype=tf.float32):
64 | """Return float tensor representing the padding values in x.
65 | Args:
66 | x: int tensor with any shape
67 | padding_value: int which represents padded values in input
68 | dtype: The dtype of the return value.
69 | Returns:
70 | float tensor with same shape as x containing values 0 or 1.
71 | 0 -> non-padding, 1 -> padding
72 | """
73 | with tf.name_scope("padding"):
74 | return tf.cast(tf.equal(x, padding_value), dtype)
75 |
76 |
77 | def get_padding_bias(x, padding_value=0, dtype=tf.float32):
78 | """Calculate bias tensor from padding values in tensor.
79 | Bias tensor that is added to the pre-softmax multi-headed attention logits,
80 | which has shape [batch_size, num_heads, length, length]. The tensor is zero at
81 | non-padding locations, and -1e9 (negative infinity) at padding locations.
82 | Args:
83 | x: int tensor with shape [batch_size, length]
84 | padding_value: int which represents padded values in input
85 | dtype: The dtype of the return value
86 | Returns:
87 | Attention bias tensor of shape [batch_size, 1, 1, length].
88 | """
89 | with tf.name_scope("attention_bias"):
90 | padding = get_padding(x, padding_value, dtype)
91 | attention_bias = padding * _NEG_INF_FP32
92 | attention_bias = tf.expand_dims(
93 | tf.expand_dims(attention_bias, axis=1), axis=1)
94 | return attention_bias
95 |
--------------------------------------------------------------------------------
/text_classification/models/text_cnn.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-12-11 19:19
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : text_cnn.py
6 | # @Software: PyCharm
7 |
8 | import tensorflow as tf
9 |
10 | from .base_model import LookupTable
11 | from .base_model import BiLstm
12 | from .base_model import Linear
13 |
14 |
15 | class TextCnn:
16 | """
17 | text cnn model
18 | 主要包括:embedding层、三个不同size卷积核层、两个全连接层和dropout层
19 | """
20 | def __init__(self, vocab_size, emb_size, args):
21 |
22 | # embedding
23 | scope_name = 'look_up'
24 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
25 |
26 | # 三个卷积核
27 | kws = [2, 3, 5]
28 | self.conv_ws = []
29 | self.conv_bs = []
30 |
31 | # the num of filter 卷积核的数量
32 | filter_num = args.filter_num
33 | for idx, kw in enumerate(kws):
34 | w = tf.get_variable(
35 | f"conv_w_{idx}",
36 | [kw, emb_size, filter_num],
37 | initializer=tf.random_uniform_initializer(-0.25, 0.25)
38 | )
39 | b = tf.get_variable(
40 | f"conv_b_{idx}",
41 | [filter_num],
42 | initializer=tf.random_uniform_initializer(-0.25, 0.25)
43 | )
44 | self.conv_ws.append(w)
45 | self.conv_bs.append(b)
46 |
47 | scope_name = 'linear1'
48 | self.linear1 = Linear(scope_name, len(kws) * filter_num,
49 | args.fc_layer_size, activator=args.activator)
50 |
51 | scope_name = 'linear2'
52 | self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label)
53 |
54 | self.dropout = tf.layers.Dropout(args.drop_out)
55 |
56 | def __call__(self, inputs, training):
57 | embedding = self.lookuptables(inputs)
58 |
59 | outputs = []
60 | for conv_w, conv_b in zip(self.conv_ws, self.conv_bs):
61 | conv = tf.nn.conv1d(embedding, conv_w, 1, 'SAME')
62 | conv = tf.nn.bias_add(conv, conv_b)
63 | pool = tf.reduce_max(conv, axis=1)
64 | outputs.append(pool)
65 | output = tf.concat(outputs, -1)
66 | output = self.linear1(output, training)
67 | output = self.dropout(output, training)
68 | output = self.linear2(output, training)
69 | return output
--------------------------------------------------------------------------------
/text_classification/preprocess.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-10-11 18:52
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : preprocess.py
6 | # @Software: PyCharm
7 |
8 | import os
9 | import _pickle as pickle
10 | import pandas as pd
11 | import random
12 |
13 | from sklearn.model_selection import train_test_split
14 |
15 | """
16 | 数据预处理
17 | 将数据处理成id,并封装成pkl形式
18 | """
19 |
20 | # 可以人为自定义label dict
21 | label_dict_default = {109: 0, 104: 1, 102: 2, 113: 3,
22 | 107: 4, 101: 5, 103: 6, 110: 7,
23 | 108: 8, 116: 9, 112: 10, 115: 11,
24 | 106: 12, 100: 13, 114: 14}
25 |
26 |
27 | def make_vocab(file_path):
28 | """
29 | 构建词典和label映射词典
30 | :param file_path:
31 | :return:
32 | """
33 | data = pd.read_csv(file_path, sep='\t')
34 | vocab = {'PAD': 0, 'UNK': 1}
35 | words_list = []
36 | for index, row in data.iterrows():
37 | label = row['label']
38 | words = row['words'].split(' ')
39 | for word in words:
40 | words_list.append(word)
41 | random.shuffle(words_list)
42 | for word in words_list:
43 | if word not in vocab:
44 | vocab[word] = len(vocab)
45 | # save to file and print the label dict
46 | save_path = './data_path/vocab.txt'
47 | save_vocab(vocab, save_path)
48 | print(f'the vocab size is {len(vocab)}')
49 | return vocab
50 |
51 |
52 | def make_data(file_path, vocab, type):
53 | """
54 | 构建数据
55 | :param file_path:
56 | :param vocab
57 | :return:
58 | """
59 | data = pd.read_csv(file_path, sep='\t')
60 | word_ids = []
61 | label_ids = []
62 | for index, row in data.iterrows():
63 | label = row['label']
64 | words = row['words'].split(' ')
65 | word_id_temp = [vocab.get(word) if word in vocab else 1 for word in words]
66 | word_ids.append(word_id_temp)
67 | label_ids.append(label_dict_default.get(label))
68 |
69 | print(f'the {type} data size is {len(word_ids)}')
70 | print(word_ids[0])
71 | print(label_ids[0])
72 |
73 | return {'words': word_ids, 'labels': label_ids}
74 |
75 |
76 | def save_vocab(vocab, output):
77 | """
78 | 保存vocab到本地文件
79 | :param vocab:
80 | :param output:
81 | :return:
82 | """
83 | with open(output, 'w', encoding='utf-8') as fr:
84 | for word in vocab:
85 | fr.write(word + '\t' + str(vocab.get(word)) + '\n')
86 | print('save vocab is ok.')
87 |
88 |
89 | def main(output_path):
90 | """
91 | main method
92 | :param output_path:
93 | :return:
94 | """
95 | data = {}
96 | train_path = './data_path/train_data.csv'
97 | test_path = './data_path/dev_data.csv'
98 | vocab = make_vocab(train_path)
99 | train_data = make_data(train_path, vocab, 'train')
100 | test_data = make_data(test_path, vocab, 'test')
101 |
102 | data['train'] = train_data
103 | data['test'] = test_data
104 |
105 | data_path = os.path.join(output_path, 'tnews_data.pkl')
106 | pickle.dump(data, open(data_path, 'wb'), protocol=2)
107 | print('save data to pkl over.')
108 |
109 |
110 | def split_data(file_path, output):
111 | """
112 | 划分数据集
113 | :param file_path:
114 | :param output:
115 | :return:
116 | """
117 | all_data = pd.read_csv(file_path, sep='\t', header=None)
118 | all_data.columns = ['id', 'texta', 'textb', 'label']
119 | train_data, test_data = train_test_split(all_data, stratify=all_data['label'],
120 | test_size=0.2, shuffle=True,
121 | random_state=42)
122 | print(train_data)
123 | print(test_data)
124 | train_path = os.path.join(output, 'train_nli.csv')
125 | test_path = os.path.join(output, 'dev_nli.csv')
126 | train_data.to_csv(train_path, sep='\t', header=False, index=False)
127 | test_data.to_csv(test_path, sep='\t', header=False, index=False)
128 | print(f'split data train size={len(train_data)} test size={len(test_data)}')
129 |
130 |
131 | if __name__ == '__main__':
132 | output_path = './data_path'
133 | main(output_path)
134 |
--------------------------------------------------------------------------------
/text_classification/tf_metrics.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | """Multiclass"""
4 |
5 | __author__ = "Guillaume Genthial"
6 |
7 | import numpy as np
8 | import tensorflow as tf
9 | from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix
10 |
11 |
12 | def precision(labels, predictions, num_classes, pos_indices=None,
13 | weights=None, average='micro'):
14 | """Multi-class precision metric for Tensorflow
15 | Parameters
16 | ----------
17 | labels : Tensor of tf.int32 or tf.int64
18 | The true labels
19 | predictions : Tensor of tf.int32 or tf.int64
20 | The predictions, same shape as labels
21 | num_classes : int
22 | The number of classes
23 | pos_indices : list of int, optional
24 | The indices of the positive classes, default is all
25 | weights : Tensor of tf.int32, optional
26 | Mask, must be of compatible shape with labels
27 | average : str, optional
28 | 'micro': counts the total number of true positives, false
29 | positives, and false negatives for the classes in
30 | `pos_indices` and infer the metric from it.
31 | 'macro': will compute the metric separately for each class in
32 | `pos_indices` and average. Will not account for class
33 | imbalance.
34 | 'weighted': will compute the metric separately for each class in
35 | `pos_indices` and perform a weighted average by the total
36 | number of true labels for each class.
37 | Returns
38 | -------
39 | tuple of (scalar float Tensor, update_op)
40 | """
41 | cm, op = _streaming_confusion_matrix(
42 | labels, predictions, num_classes, weights)
43 | pr, _, _ = metrics_from_confusion_matrix(
44 | cm, pos_indices, average=average)
45 | op, _, _ = metrics_from_confusion_matrix(
46 | op, pos_indices, average=average)
47 | return (pr, op)
48 |
49 |
50 | def recall(labels, predictions, num_classes, pos_indices=None, weights=None,
51 | average='micro'):
52 | """Multi-class recall metric for Tensorflow
53 | Parameters
54 | ----------
55 | labels : Tensor of tf.int32 or tf.int64
56 | The true labels
57 | predictions : Tensor of tf.int32 or tf.int64
58 | The predictions, same shape as labels
59 | num_classes : int
60 | The number of classes
61 | pos_indices : list of int, optional
62 | The indices of the positive classes, default is all
63 | weights : Tensor of tf.int32, optional
64 | Mask, must be of compatible shape with labels
65 | average : str, optional
66 | 'micro': counts the total number of true positives, false
67 | positives, and false negatives for the classes in
68 | `pos_indices` and infer the metric from it.
69 | 'macro': will compute the metric separately for each class in
70 | `pos_indices` and average. Will not account for class
71 | imbalance.
72 | 'weighted': will compute the metric separately for each class in
73 | `pos_indices` and perform a weighted average by the total
74 | number of true labels for each class.
75 | Returns
76 | -------
77 | tuple of (scalar float Tensor, update_op)
78 | """
79 | cm, op = _streaming_confusion_matrix(
80 | labels, predictions, num_classes, weights)
81 | _, re, _ = metrics_from_confusion_matrix(
82 | cm, pos_indices, average=average)
83 | _, op, _ = metrics_from_confusion_matrix(
84 | op, pos_indices, average=average)
85 | return (re, op)
86 |
87 |
88 | def f1(labels, predictions, num_classes, pos_indices=None, weights=None,
89 | average='micro'):
90 | return fbeta(labels, predictions, num_classes, pos_indices, weights,
91 | average)
92 |
93 |
94 | def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None,
95 | average='micro', beta=1):
96 | """Multi-class fbeta metric for Tensorflow
97 | Parameters
98 | ----------
99 | labels : Tensor of tf.int32 or tf.int64
100 | The true labels
101 | predictions : Tensor of tf.int32 or tf.int64
102 | The predictions, same shape as labels
103 | num_classes : int
104 | The number of classes
105 | pos_indices : list of int, optional
106 | The indices of the positive classes, default is all
107 | weights : Tensor of tf.int32, optional
108 | Mask, must be of compatible shape with labels
109 | average : str, optional
110 | 'micro': counts the total number of true positives, false
111 | positives, and false negatives for the classes in
112 | `pos_indices` and infer the metric from it.
113 | 'macro': will compute the metric separately for each class in
114 | `pos_indices` and average. Will not account for class
115 | imbalance.
116 | 'weighted': will compute the metric separately for each class in
117 | `pos_indices` and perform a weighted average by the total
118 | number of true labels for each class.
119 | beta : int, optional
120 | Weight of precision in harmonic mean
121 | Returns
122 | -------
123 | tuple of (scalar float Tensor, update_op)
124 | """
125 | cm, op = _streaming_confusion_matrix(
126 | labels, predictions, num_classes, weights)
127 | _, _, fbeta = metrics_from_confusion_matrix(
128 | cm, pos_indices, average=average, beta=beta)
129 | _, _, op = metrics_from_confusion_matrix(
130 | op, pos_indices, average=average, beta=beta)
131 | return (fbeta, op)
132 |
133 |
134 | def safe_div(numerator, denominator):
135 | """Safe division, return 0 if denominator is 0"""
136 | numerator, denominator = tf.to_float(numerator), tf.to_float(denominator)
137 | zeros = tf.zeros_like(numerator, dtype=numerator.dtype)
138 | denominator_is_zero = tf.equal(denominator, zeros)
139 | return tf.where(denominator_is_zero, zeros, numerator / denominator)
140 |
141 |
142 | def pr_re_fbeta(cm, pos_indices, beta=1):
143 | """Uses a confusion matrix to compute precision, recall and fbeta"""
144 | num_classes = cm.shape[0]
145 | neg_indices = [i for i in range(num_classes) if i not in pos_indices]
146 | cm_mask = np.ones([num_classes, num_classes])
147 | cm_mask[neg_indices, neg_indices] = 0
148 | diag_sum = tf.reduce_sum(tf.diag_part(cm * cm_mask))
149 |
150 | cm_mask = np.ones([num_classes, num_classes])
151 | cm_mask[:, neg_indices] = 0
152 | tot_pred = tf.reduce_sum(cm * cm_mask)
153 |
154 | cm_mask = np.ones([num_classes, num_classes])
155 | cm_mask[neg_indices, :] = 0
156 | tot_gold = tf.reduce_sum(cm * cm_mask)
157 |
158 | pr = safe_div(diag_sum, tot_pred)
159 | re = safe_div(diag_sum, tot_gold)
160 | fbeta = safe_div((1. + beta**2) * pr * re, beta**2 * pr + re)
161 |
162 | return pr, re, fbeta
163 |
164 |
165 | def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro',
166 | beta=1):
167 | """Precision, Recall and F1 from the confusion matrix
168 | Parameters
169 | ----------
170 | cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes)
171 | The streaming confusion matrix.
172 | pos_indices : list of int, optional
173 | The indices of the positive classes
174 | beta : int, optional
175 | Weight of precision in harmonic mean
176 | average : str, optional
177 | 'micro', 'macro' or 'weighted'
178 | """
179 | num_classes = cm.shape[0]
180 | if pos_indices is None:
181 | pos_indices = [i for i in range(num_classes)]
182 |
183 | if average == 'micro':
184 | return pr_re_fbeta(cm, pos_indices, beta)
185 | elif average in {'macro', 'weighted'}:
186 | precisions, recalls, fbetas, n_golds = [], [], [], []
187 | for idx in pos_indices:
188 | pr, re, fbeta = pr_re_fbeta(cm, [idx], beta)
189 | precisions.append(pr)
190 | recalls.append(re)
191 | fbetas.append(fbeta)
192 | cm_mask = np.zeros([num_classes, num_classes])
193 | cm_mask[idx, :] = 1
194 | n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask)))
195 |
196 | if average == 'macro':
197 | pr = tf.reduce_mean(precisions)
198 | re = tf.reduce_mean(recalls)
199 | fbeta = tf.reduce_mean(fbetas)
200 | return pr, re, fbeta
201 | if average == 'weighted':
202 | n_gold = tf.reduce_sum(n_golds)
203 | pr_sum = sum(p * n for p, n in zip(precisions, n_golds))
204 | pr = safe_div(pr_sum, n_gold)
205 | re_sum = sum(r * n for r, n in zip(recalls, n_golds))
206 | re = safe_div(re_sum, n_gold)
207 | fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds))
208 | fbeta = safe_div(fbeta_sum, n_gold)
209 | return pr, re, fbeta
210 |
211 | else:
212 | raise NotImplementedError()
213 |
--------------------------------------------------------------------------------
/text_classification/tnews_data_eda.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "raw",
5 | "metadata": {},
6 | "source": [
7 | "CLUEBenchmark的头条中文新闻分类 数据EDA过程\n",
8 | "任务介绍:https://www.cluebenchmarks.com/introduce.html"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "import pandas as pd\n",
18 | "import numpy as np\n",
19 | "import json"
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 2,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "def convert_df(file_path, task):\n",
29 | " with open(file_path, encoding='utf-8') as fr:\n",
30 | " lines = fr.readlines()\n",
31 | " label_list = []\n",
32 | " sentence_list = []\n",
33 | " ids = []\n",
34 | " for line in lines:\n",
35 | " json_str = json.loads(line)\n",
36 | " if task == 'test':\n",
37 | " ids.append(json_str['id'])\n",
38 | " sentence_list.append(json_str['sentence'])\n",
39 | " else:\n",
40 | " label_list.append(json_str['label'])\n",
41 | " sentence_list.append(json_str['sentence'])\n",
42 | " if task == 'test':\n",
43 | " data_dict = {'id': ids, 'text': sentence_list}\n",
44 | " else:\n",
45 | " data_dict = {'label': label_list, 'text': sentence_list}\n",
46 | " data = pd.DataFrame(data_dict)\n",
47 | " \n",
48 | " return data"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": 3,
54 | "metadata": {},
55 | "outputs": [
56 | {
57 | "name": "stdout",
58 | "output_type": "stream",
59 | "text": [
60 | " label text\n",
61 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事?\n",
62 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告\n",
63 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理?\n",
64 | "3 112 2018年去俄罗斯看世界杯得花多少钱?\n",
65 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发\n",
66 | " label text\n",
67 | "0 102 江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物\n",
68 | "1 110 以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击\n",
69 | "2 104 出栏一头猪亏损300元,究竟谁能笑到最后!\n",
70 | "3 109 以前很火的巴铁为何现在只字不提?\n",
71 | "4 112 作为一名酒店从业人员,你经历过房客哪些特别没有素质的行为?\n"
72 | ]
73 | }
74 | ],
75 | "source": [
76 | "train_path = './data/train.json'\n",
77 | "dev_path = './data/dev.json'\n",
78 | "test_path = './data/test.json'\n",
79 | "\n",
80 | "train_data = convert_df(train_path, 'train')\n",
81 | "print(train_data.head(5))\n",
82 | "\n",
83 | "dev_data = convert_df(dev_path, 'dev')\n",
84 | "print(dev_data.head(5))"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 4,
90 | "metadata": {},
91 | "outputs": [
92 | {
93 | "name": "stdout",
94 | "output_type": "stream",
95 | "text": [
96 | " label text word_cnt\n",
97 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事? 44\n",
98 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告 46\n",
99 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理? 32\n",
100 | "3 112 2018年去俄罗斯看世界杯得花多少钱? 19\n",
101 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发 21\n"
102 | ]
103 | }
104 | ],
105 | "source": [
106 | "train_data['word_cnt'] = train_data['text'].apply(lambda x: len(x))\n",
107 | "print(train_data.head(5))"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 5,
113 | "metadata": {},
114 | "outputs": [
115 | {
116 | "data": {
117 | "text/plain": [
118 | "count 53360.000000\n",
119 | "mean 22.131241\n",
120 | "std 7.309860\n",
121 | "min 2.000000\n",
122 | "25% 17.000000\n",
123 | "50% 22.000000\n",
124 | "75% 28.000000\n",
125 | "max 145.000000\n",
126 | "Name: word_cnt, dtype: float64"
127 | ]
128 | },
129 | "execution_count": 5,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "train_data['word_cnt'].describe()"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 6,
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "data": {
145 | "text/plain": [
146 | "109 5955\n",
147 | "104 5200\n",
148 | "102 4976\n",
149 | "113 4851\n",
150 | "107 4118\n",
151 | "101 4081\n",
152 | "103 3991\n",
153 | "110 3632\n",
154 | "108 3437\n",
155 | "116 3390\n",
156 | "112 3368\n",
157 | "115 2886\n",
158 | "106 2107\n",
159 | "100 1111\n",
160 | "114 257\n",
161 | "Name: label, dtype: int64"
162 | ]
163 | },
164 | "execution_count": 6,
165 | "metadata": {},
166 | "output_type": "execute_result"
167 | }
168 | ],
169 | "source": [
170 | "train_data['label'].value_counts()"
171 | ]
172 | },
173 | {
174 | "cell_type": "raw",
175 | "metadata": {},
176 | "source": [
177 | "从上面的数据看出tricks:\n",
178 | "1.文本最长是145,大部分都是28左右\n",
179 | "2.label的数量是不均衡的,可以在loss计算的时候加上label的权重"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": 16,
185 | "metadata": {},
186 | "outputs": [
187 | {
188 | "name": "stdout",
189 | "output_type": "stream",
190 | "text": [
191 | " label text word_cnt \\\n",
192 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事? 44 \n",
193 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告 46 \n",
194 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理? 32 \n",
195 | "3 112 2018年去俄罗斯看世界杯得花多少钱? 19 \n",
196 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发 21 \n",
197 | "\n",
198 | " words \n",
199 | "0 上课时 学生 手机 响个 不停 老师 一怒之下 把 手机 摔 了 家长 拿 发票 让 老师 ... \n",
200 | "1 商赢 环球 股份 有限公司 关于 延期 回复 上海证券交易所 对 公司 Number 年 年... \n",
201 | "2 通过 中介 公司 买 了 二手房 首付 都 付 了 现在 卖家 不想 卖 了 怎么 处理 \n",
202 | "3 Number 年 去 俄罗斯 看 世界杯 得花 多少 钱 \n",
203 | "4 剃须刀 的 个性 革新 雷明登 天猫 定制 版 新品 首发 \n"
204 | ]
205 | }
206 | ],
207 | "source": [
208 | "# 分词去掉一些无用词\n",
209 | "import jieba\n",
210 | "def cut_with_jieba(text, filter=None):\n",
211 | " if filter:\n",
212 | " for c in filter:\n",
213 | " text = text.replace(c, '')\n",
214 | " words = ['Number' if word.isdigit() else word for word in jieba.cut(text)]\n",
215 | " # todo 停用词表还可以加进来\n",
216 | " return ' '.join(words)\n",
217 | "\n",
218 | "filter = './??;。(())【】{}[]!!,,<>《》+'\n",
219 | "\n",
220 | "train_data['words'] = train_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
221 | "\n",
222 | "print(train_data.head(5))"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 17,
228 | "metadata": {},
229 | "outputs": [
230 | {
231 | "name": "stdout",
232 | "output_type": "stream",
233 | "text": [
234 | " label text \\\n",
235 | "0 102 江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物 \n",
236 | "1 110 以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击 \n",
237 | "2 104 出栏一头猪亏损300元,究竟谁能笑到最后! \n",
238 | "3 109 以前很火的巴铁为何现在只字不提? \n",
239 | "4 112 作为一名酒店从业人员,你经历过房客哪些特别没有素质的行为? \n",
240 | "\n",
241 | " words \n",
242 | "0 江 疏影 甜甜 圈自 拍迷 之 角度 竟 这么 好看 美 吸引 一切 事物 \n",
243 | "1 以色列 大规模 空袭 开始 伊朗 多个 军事 目标 遭遇 打击 誓言 对 等 反击 \n",
244 | "2 出栏 一头 猪 亏损 Number 元 究竟 谁 能 笑 到 最后 \n",
245 | "3 以前 很火 的 巴铁 为何 现在 只字不提 \n",
246 | "4 作为 一名 酒店 从业人员 你 经历 过 房客 哪些 特别 没有 素质 的 行为 \n"
247 | ]
248 | }
249 | ],
250 | "source": [
251 | "dev_data['words'] = dev_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
252 | "print(dev_data.head(5))"
253 | ]
254 | },
255 | {
256 | "cell_type": "raw",
257 | "metadata": {},
258 | "source": [
259 | "可以看出分词其实不太准确,这个地方还可以加入原始数据集中的key word"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 18,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": [
268 | "# 写入到文件\n",
269 | "train_data[['words', 'label']].to_csv('./data/train_data.csv', sep='\\t', encoding='utf-8', index=None)\n",
270 | "dev_data[['words', 'label']].to_csv('./data/dev_data.csv', sep='\\t', encoding='utf-8', index=None)"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 19,
276 | "metadata": {},
277 | "outputs": [
278 | {
279 | "name": "stdout",
280 | "output_type": "stream",
281 | "text": [
282 | " id text \\\n",
283 | "0 0 在设计史上,每当相对稳定的发展时期,这种设计思想就会成为主导 \n",
284 | "1 1 利希施泰纳宣布赛季结束后离队:我需要新的挑战 \n",
285 | "2 2 庄家一般都是什么操盘思路? \n",
286 | "3 3 王者荣耀里搅屎棍英雄都有谁? \n",
287 | "4 4 照片不小心被删,看看下面的教程,完美找回来! \n",
288 | "\n",
289 | " words \n",
290 | "0 在 设计 史上 每当 相对 稳定 的 发展 时期 这种 设计 思想 就 会 成为 主导 \n",
291 | "1 利希 施泰纳 宣布 赛季 结束 后 离队 : 我 需要 新 的 挑战 \n",
292 | "2 庄家 一般 都 是 什么 操盘 思路 \n",
293 | "3 王者 荣耀 里 搅 屎 棍 英雄 都 有 谁 \n",
294 | "4 照片 不 小心 被删 看看 下面 的 教程 完美 找 回来 \n"
295 | ]
296 | }
297 | ],
298 | "source": [
299 | "\n",
300 | "# 准备测试集\n",
301 | "test_data = convert_df(test_path, 'test')\n",
302 | "\n",
303 | "test_data['words'] = test_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
304 | "\n",
305 | "print(test_data.head(5))\n",
306 | "\n",
307 | "test_data[['id', 'words']].to_csv('./data/test_data.csv', sep='\\t', encoding='utf-8', index=None)\n"
308 | ]
309 | },
310 | {
311 | "cell_type": "code",
312 | "execution_count": null,
313 | "metadata": {},
314 | "outputs": [],
315 | "source": []
316 | }
317 | ],
318 | "metadata": {
319 | "kernelspec": {
320 | "display_name": "Python [conda env:tf_envs]",
321 | "language": "python",
322 | "name": "conda-env-tf_envs-py"
323 | },
324 | "language_info": {
325 | "codemirror_mode": {
326 | "name": "ipython",
327 | "version": 3
328 | },
329 | "file_extension": ".py",
330 | "mimetype": "text/x-python",
331 | "name": "python",
332 | "nbconvert_exporter": "python",
333 | "pygments_lexer": "ipython3",
334 | "version": "3.7.7"
335 | }
336 | },
337 | "nbformat": 4,
338 | "nbformat_minor": 4
339 | }
340 |
--------------------------------------------------------------------------------
/text_classification/train_main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time : 2020-12-11 19:47
3 | # @Author : xudong
4 | # @email : dongxu222mk@163.com
5 | # @File : train_main.py
6 | # @Software: PyCharm
7 |
8 | import sys
9 | import time
10 | import tensorflow as tf
11 | import tf_metrics
12 | import _pickle as cPickle
13 |
14 | from data_utils import datasets
15 | from argparse import ArgumentParser
16 | from models.bilstm_model import BiLstmModel
17 | from models.text_cnn import TextCnn
18 | from models.ffnn_model import FCModel
19 |
20 |
21 | # 设置参数
22 | parser = ArgumentParser()
23 |
24 | parser.add_argument("--train_path", type=str, default='./data_path/tnews_data.pkl',
25 | help='the file path of train data, needs pkl type')
26 | parser.add_argument("--eval_path", type=str, default='./data_path/tnews_data.pkl',
27 | help='the file path of test data, needs pkl type')
28 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/',
29 | help='the dir of the checkpoint type model')
30 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb',
31 | help='the dir of the pb type model')
32 |
33 | parser.add_argument("--vocab_size", type=int, default=68000, help='the vocab size')
34 | parser.add_argument("--emb_size", type=int, default=300, help='the embedding size')
35 | parser.add_argument("--hidden_size", type=int, default=300,
36 | help='the hidden size of rnn layer, will split it half in rnn')
37 | parser.add_argument("--fc_layer_size", type=int, default=300,
38 | help='the hidden size of fully connect layer')
39 | parser.add_argument("--num_label", type=int, default=15, help='the number of task label')
40 | parser.add_argument("--drop_out", type=float, default=0.2,
41 | help='the dropout rate in layers')
42 | parser.add_argument("--batch_size", type=int, default=16,
43 | help='the batch size of dataset in one step training')
44 | parser.add_argument("--epoch", type=int, default=5,
45 | help='the epoch count we want to train')
46 | parser.add_argument("--model_name", type=str, default='lstm',
47 | help='which model we want use in our task, [lstm, cnn, fc, ...]')
48 | parser.add_argument("--pool", type=str, default='max',
49 | help='the pool function, [max, mean, ...]')
50 | parser.add_argument("--activator", type=str, default='relu',
51 | help='the activate function, [relu, relu6, tanh, ...]')
52 | parser.add_argument("--filter_num", type=int, default=128,
53 | help='the number of the cnn filters')
54 | parser.add_argument("--use_pos", type=int, default=0,
55 | help='whether to use position encoding in embedding layer')
56 | parser.add_argument("--lr", type=float, default=1e-3,
57 | help='the learning rate for optimizer')
58 |
59 |
60 | # todo 还可以加入位置信息在embedding层
61 | # todo pool层还可以加入attention pool
62 |
63 |
64 | tf.logging.set_verbosity(tf.logging.INFO)
65 | ARGS, unparsed = parser.parse_known_args()
66 | print(ARGS)
67 | sys.stdout.flush()
68 |
69 |
70 | def init_data(file_name, type=None):
71 | """
72 | 初始化数据集并构建input function
73 | :param file_name:
74 | :param type:
75 | :return:
76 | """
77 | data = cPickle.load(open(file_name, 'rb'))[type]
78 |
79 | data_builder = datasets.DataBuilder(data)
80 | dataset = data_builder.build_dataset()
81 |
82 | def train_input():
83 | return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch)
84 |
85 | def test_input():
86 | return data_builder.get_test_batch(dataset, ARGS.batch_size)
87 |
88 | return train_input if type == 'train' else test_input
89 |
90 |
91 | def make_model():
92 | """
93 | 构建模型
94 | :return:
95 | """
96 | vocab_size = ARGS.vocab_size
97 | emb_size = ARGS.emb_size
98 | print(f'the model name is {ARGS.model_name}')
99 | if ARGS.model_name == 'lstm':
100 | model = BiLstmModel(vocab_size, emb_size, ARGS)
101 | elif ARGS.model_name == 'cnn':
102 | model = TextCnn(vocab_size, emb_size, ARGS)
103 | elif ARGS.model_name == 'fc':
104 | model = FCModel(vocab_size, emb_size, ARGS)
105 | else:
106 | raise KeyError('the model type is not implemented!')
107 | return model
108 |
109 |
110 | def model_fn(features, labels, mode, params):
111 | """
112 | the model fn
113 | :return:
114 | """
115 | model = make_model()
116 |
117 | if isinstance(features, dict):
118 | features = features['words']
119 |
120 | words = features
121 |
122 | if mode == tf.estimator.ModeKeys.PREDICT:
123 | logits = model(words, training=False)
124 |
125 | prediction = {'class_id': tf.argmax(logits, axis=1, name='class_out'),
126 | 'prob': tf.nn.softmax(logits, name='prob_out')}
127 |
128 | return tf.estimator.EstimatorSpec(
129 | mode=mode,
130 | predictions=prediction,
131 | export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)}
132 | )
133 | else:
134 | if mode == tf.estimator.ModeKeys.TRAIN:
135 | logits = model(words, training=True)
136 | weights = tf.constant(
137 | [0.9, 0.9, 0.9, 0.9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.2, 1.5])
138 | weights = tf.gather(weights, labels)
139 | loss = tf.losses.sparse_softmax_cross_entropy(
140 | labels, logits,
141 | weights=weights,
142 | reduction=tf.losses.Reduction.MEAN)
143 | prediction = tf.argmax(logits, axis=1)
144 | accuracy = tf.metrics.accuracy(labels=labels,
145 | predictions=prediction)
146 | tf.identity(accuracy[1], name='train_accuracy')
147 | tf.summary.scalar('train_accuracy', accuracy[1])
148 | optimizer = tf.train.AdamOptimizer(learning_rate=ARGS.lr)
149 | return tf.estimator.EstimatorSpec(
150 | mode=mode,
151 | loss=loss,
152 | train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step())
153 | )
154 | else:
155 | logits = model(words, training=False)
156 | prediction = tf.argmax(logits, axis=1)
157 | # tf原始的metrics不支持多类别计算
158 | precision = tf_metrics.precision(labels, prediction, ARGS.num_label)
159 | recall = tf_metrics.recall(labels, prediction, ARGS.num_label)
160 | accuracy = tf.metrics.accuracy(labels, predictions=prediction)
161 | metrics = {
162 | 'accuracy': accuracy,
163 | 'recall': recall,
164 | 'precision': precision
165 | }
166 | return tf.estimator.EstimatorSpec(
167 | mode=mode,
168 | loss=tf.constant(0),
169 | eval_metric_ops=metrics
170 | )
171 |
172 |
173 | def main_es(unparsed):
174 | """
175 | main method
176 | :param unparsed:
177 | :return:
178 | """
179 | cur_time = time.time()
180 | model_dir = ARGS.model_ckpt_dir + str(int(cur_time))
181 |
182 | classifer = tf.estimator.Estimator(
183 | model_fn=model_fn,
184 | model_dir=model_dir,
185 | params={}
186 | )
187 |
188 | # train model
189 | train_input = init_data(ARGS.train_path, 'train')
190 | tensors_to_log = {'train_accuracy': 'train_accuracy'}
191 | logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100)
192 | classifer.train(input_fn=train_input, hooks=[logging_hook])
193 |
194 | # eval model
195 | eval_input = init_data(ARGS.eval_path, 'test')
196 | eval_res = classifer.evaluate(input_fn=eval_input)
197 | print(f'Evaluation res is : \n\t{eval_res}')
198 |
199 | if ARGS.model_pb_dir:
200 | words = tf.placeholder(tf.int64, [None, None], name='input_words')
201 | input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
202 | 'words': words
203 | })
204 | classifer.export_savedmodel(ARGS.model_pb_dir, input_fn)
205 |
206 |
207 | if __name__ == '__main__':
208 | tf.app.run(main=main_es, argv=[sys.argv[0]] + unparsed)
209 |
--------------------------------------------------------------------------------