├── README.md
├── bert_downstream
    ├── README.md
    ├── bert_master
    │   └── README.md
    ├── data_path
    │   └── README.md
    ├── model_ckpt
    │   └── README.md
    ├── pre_trained
    │   └── README.md
    ├── train_classifier.py
    ├── train_multi_learning.py
    └── train_ner.py
├── ckbqa
    ├── DUTIR中文开放域知识问答评测报告.pdf
    ├── README.md
    └── 基于特征融合的中文知识库问答方法.pdf
├── named_entity_recognition
    ├── README.md
    ├── convert_bio.py
    ├── data_path
    │   └── README.md
    ├── data_utils
    │   ├── __init__.py
    │   └── datasets.py
    ├── inference.py
    ├── model_ckpt
    │   └── README.md
    ├── model_pb
    │   └── README.md
    ├── models
    │   ├── __init__.py
    │   └── bilstm_crf.py
    ├── ner_main.py
    ├── pics
    │   ├── 命名实体识别数据图.png
    │   └── 命名实体识别的模型总结图.png
    └── preprocess.py
└── text_classification
    ├── README.md
    ├── data_path
        ├── README.md
        ├── tnews_data.pkl
        └── vocab.txt
    ├── inference.py
    ├── model_ckpt
        └── README.md
    ├── model_pb
        └── README.md
    ├── models
        ├── __init__.py
        ├── attention.py
        ├── base_model.py
        ├── bilstm_model.py
        ├── ffnn_model.py
        ├── model_utils.py
        └── text_cnn.py
    ├── preprocess.py
    ├── tf_metrics.py
    ├── tnews_data_eda.ipynb
    └── train_main.py


/README.md:
--------------------------------------------------------------------------------
 1 | ## AwesomeNLPBaseline
 2 | 
 3 | 本项目是NLP领域一些任务的基准模型实现，包括文本分类、命名实体识别、实体关系抽取、NL2SQL、CKBQA以及BERT的各种下游任务应用。
 4 | 
 5 | 主要使用Tensorflow1.x
 6 | 
 7 | 话说Tensorflow1.0版本在实现某些任务的时候是真心的不如torch，实力劝退。不要问既然这么难用为什么不用torch（因为不会啊）？问就是正因为难用才要用，而且在公司部署项目的时候，TF的as-server模式真香。
 8 | 
 9 | **任务介绍**
10 | 
11 | - 文本分类
12 | - 命名实体识别
13 | - bert下游任务
14 | - 实体关系抽取
15 | - nl2sql
16 | - ckbqa
17 | - doing（持续更新）
18 | 
19 | **目录结构如下**：
20 | 
21 | * text_classification: 文本分类
22 | * named_entity_recognition: 命名实体识别
23 | * entity_relation_extraction: 实体关系抽取
24 | * ckbqa: 中文知识问答
25 | * nl2sql: 自然语言到Sql语句
26 | * bert_downstream: 基于bert进行fine-tune下游任务以及bert相关研究
27 | 
28 | Tip：当前只实现了文本分类，bert下游任务，命名实体识别三个任务，其他的等有空了再补上。
29 | 
30 | 
31 | **声明**
32 | 
33 | 本项目是作者平时学习和工作中遇到的NLP任务积累，仅供学习交流。欢迎提issue和pr。
34 | 
35 | 


--------------------------------------------------------------------------------
/bert_downstream/README.md:
--------------------------------------------------------------------------------
 1 | ## BERT介绍
 2 | 
 3 | **简介**
 4 | 
 5 | BERT是谷歌于2018年10月公开的一种预训练模型。该模型一经发布，就引起了学术界以及工业界的广泛关注。在效果方面，BERT刷新了11个NLP任务的当前最优效果，该方法也被评为2018年NLP的重大进展以及NAACL 2019的best paper。BERT和早前OpenAI发布的GPT方法技术路线基本一致，只是在技术细节上存在略微差异。两个工作的主要贡献在于使用预训练+微调的思路来解决自然语言处理问题。以BERT为例，模型应用包括2个环节：
 6 | 
 7 | - 预训练（Pre-training），该环节在大量通用语料上学习网络参数，通用语料包括Wikipedia、Book Corpus，这些语料包含了大量的文本，能够提供丰富的语言相关现象。
 8 | - 微调（Fine-tuning），该环节使用“任务相关”的标注数据对网络参数进行微调，不需要再为目标任务设计Task-specific网络从头训练。
 9 | 
10 | 模型的详细信息可见论文，原文 [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805v1)，这是BERT在2018年10月发布的版本，与2019年5月版本[v2](https://arxiv.org/abs/1810.04805v2)有稍许不同。
11 | 
12 | 英文不好的可以参考大佬的论文翻译：[BERT论文中文翻译](https://github.com/yuanxiaosc/BERT_Paper_Chinese_Translation)
13 | 
14 | **各类BERT预训练模型**：
15 | 
16 | - 官网BERT：https://github.com/google-research/bert
17 | - Transformers：https://github.com/huggingface/transformers
18 | - 哈工大讯飞：https://github.com/ymcui/Chinese-BERT-wwm
19 | - Brightmart：https://github.com/brightmart/roberta_zh
20 | - CLUEPretrainedModels：https://github.com/CLUEbenchmark/CLUEPretrainedModels
21 | 
22 | **BERT下游任务**
23 | 
24 | 随着预训练模型的提出，大大减少了我们对NLP任务设计特定结构的需求，我们只需要在BERT等预训练模型之后再接一些简单的网络，即可完成我们的NLP任务，而且效果非常好。
25 | 
26 | 原因也非常简单，BERT等预训练模型通过大量语料的无监督学习，已经将语料中的知识迁移进了预训练模型的Embedding中，为此我们只需在针对特定任务增加结构来进行微调，即可适应当前任务，这也是迁移学习的魔力所在。
27 | 
28 | 下面介绍几类下游任务：
29 | 
30 | - 句子对分类任务，如自然语言推理（NLI），其数据集一般有MNLI、QNLI、STS-B、MRPC等
31 | - 单句子分类任务，如文本分类（Text-classification），其数据集一般有SST-2、CoLA等
32 | - 问答任务，数据集一般有SQuAD v1.1等
33 | - 单句子token标注任务，如命名实体识别（NER），其数据集一般有CoNLL-2003等
34 | 
35 | 
36 | 
37 | ## BERT下游任务代码
38 | 
39 | 下面基于官方BERT的fine-tune代码来实现文本分类、命名实体识别和多任务学习的Baseline模型。
40 | 
41 | （下面所有任务的预训练模型都是基于哈工大讯飞实验室的**`BERT-wwm-ext, Chinese`**模型，模型下载地址见上述链接）
42 | 
43 | 1.文本分类
44 | 
45 | ```
46 | 数据集来源：情感分类，是包含7个分类的细粒度情感性分析数据集，NLP中文预训练模型泛化能力挑战赛的数据集
47 | 运行脚本见train_classifier.py
48 | 
49 | 3个epoch，使用的是哈工大的BERT-wwm-ext, Chinese，未做任何的tricks，在CLUE上提交的结果是56.04
50 | 之前使用BiLstm线上提交的结果是50.92（代码见text_classification中）
51 | （[ALBERT-xxlarge](https://github.com/google-research/albert) ：59.46，目前[UER-ensemble](https://github.com/dbiir/UER-py)：72.20）
52 | ```
53 | 
54 | 2.命名实体识别
55 | 
56 | ```
57 | 数据集来源：CLUENER 细粒度命名实体识别,数据分为10个标签类别，详细信息见：https://github.com/CLUEbenchmark/CLUENER2020
58 | 运行脚本见train_ner.py
59 | # todo next
60 | ```
61 | 
62 | 3.多任务学习
63 | 
64 | ```
65 | 数据集来源：NLP中文预训练模型泛化能力挑战赛，https://tianchi.aliyun.com/competition/entrance/531841/introduction
66 | 运行脚本见trian_multi_learning.py
67 | 
68 | 3个epoch没做任何tricks，当前的score是0.5717
69 | 3个epoch没做任何的tricks，使用roberta-large，score是0.6236
70 | ```
71 | 
72 | 
73 | 
74 | ## 拓展
75 | 
76 | 下面推荐两篇有意思的文章
77 | 
78 | 1.How to Fine-tune bert for Text-classification
79 | 
80 | 2.few sample bert fine-tune


--------------------------------------------------------------------------------
/bert_downstream/bert_master/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里存放的是bert官方的模型代码，地址见：https://github.com/google-research/bert
4 | 这里主要包括三个文件：
5 | - modeling.py
6 | - optimization.py
7 | - tokenization.py


--------------------------------------------------------------------------------
/bert_downstream/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里存放的是数据集


--------------------------------------------------------------------------------
/bert_downstream/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里保存训练后的checkpoint文件


--------------------------------------------------------------------------------
/bert_downstream/pre_trained/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里存放的是预训练模型，本项目主要使用的是哈工大讯飞实验室的bert模型，地址见：https://github.com/ymcui/Chinese-BERT-wwm


--------------------------------------------------------------------------------
/bert_downstream/train_classifier.py:
--------------------------------------------------------------------------------
  1 | """BERT finetuning runner for text classification."""
  2 | 
  3 | import collections
  4 | import os
  5 | import json
  6 | import tensorflow as tf
  7 | 
  8 | import bert_master.modeling as modeling
  9 | import bert_master.optimization as optimization
 10 | import bert_master.tokenization as tokenization
 11 | 
 12 | flags = tf.flags
 13 | FLAGS = flags.FLAGS
 14 | 
 15 | # Required parameters
 16 | flags.DEFINE_string(
 17 |     "data_dir", './data_path/tnews/',
 18 |     "The input data dir. Should contain the .tsv files (or other data files) "
 19 |     "for the task.")
 20 | 
 21 | flags.DEFINE_string(
 22 |     "bert_config_file", './pre_trained/bert_config.json',
 23 |     "The config json file corresponding to the pre-trained BERT model. "
 24 |     "This specifies the model architecture.")
 25 | 
 26 | flags.DEFINE_string("task_name", 'tnews',
 27 |                     "The name of the task to train.")
 28 | 
 29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
 30 |                     "The vocabulary file that the BERT model was trained on.")
 31 | 
 32 | flags.DEFINE_string(
 33 |     "output_dir", './model_ckpt/tnews/',
 34 |     "The output directory where the model checkpoints will be written.")
 35 | 
 36 | flags.DEFINE_string(
 37 |     "init_checkpoint", './pre_trained/bert_model.ckpt',
 38 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 39 | 
 40 | flags.DEFINE_bool(
 41 |     "do_lower_case", True,
 42 |     "Whether to lower case the input text. Should be True for uncased "
 43 |     "models and False for cased models.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "max_seq_length", 128,
 47 |     "The maximum total input sequence length after WordPiece tokenization. "
 48 |     "Sequences longer than this will be truncated, and sequences shorter "
 49 |     "than this will be padded.")
 50 | 
 51 | flags.DEFINE_bool("do_train", True, "Whether to run training.")
 52 | 
 53 | flags.DEFINE_bool("do_eval", True, "Whether to run eval on the dev set.")
 54 | 
 55 | flags.DEFINE_bool(
 56 |     "do_predict", True,
 57 |     "Whether to run the model in inference mode on the test set.")
 58 | 
 59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
 60 | 
 61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 62 | 
 63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 64 | 
 65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 66 | 
 67 | flags.DEFINE_float("num_train_epochs", 3.0,
 68 |                    "Total number of training epochs to perform.")
 69 | 
 70 | flags.DEFINE_float(
 71 |     "warmup_proportion", 0.1,
 72 |     "Proportion of training to perform linear learning rate warmup for. "
 73 |     "E.g., 0.1 = 10% of training.")
 74 | 
 75 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
 76 |                      "How often to save the model checkpoint.")
 77 | 
 78 | flags.DEFINE_integer("iterations_per_loop", 1000,
 79 |                      "How many steps to make in each estimator call.")
 80 | 
 81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
 82 | 
 83 | tf.flags.DEFINE_string(
 84 |     "tpu_name", None,
 85 |     "The Cloud TPU to use for training. This should be either the name "
 86 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
 87 |     "url.")
 88 | 
 89 | tf.flags.DEFINE_string(
 90 |     "tpu_zone", None,
 91 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
 92 |     "specified, we will attempt to automatically detect the GCE project from "
 93 |     "metadata.")
 94 | 
 95 | tf.flags.DEFINE_string(
 96 |     "gcp_project", None,
 97 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
 98 |     "specified, we will attempt to automatically detect the GCE project from "
 99 |     "metadata.")
100 | 
101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
102 | 
103 | flags.DEFINE_integer(
104 |     "num_tpu_cores", 8,
105 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
106 | 
107 | 
108 | class InputExample(object):
109 |     """A single training/test example for simple sequence classification."""
110 | 
111 |     def __init__(self, guid, text_a, text_b=None, label=None):
112 |         """Constructs a InputExample.
113 | 
114 |         Args:
115 |           guid: Unique id for the example.
116 |           text_a: string. The untokenized text of the first sequence. For single
117 |             sequence tasks, only this sequence must be specified.
118 |           text_b: (Optional) string. The untokenized text of the second sequence.
119 |             Only must be specified for sequence pair tasks.
120 |           label: (Optional) string. The label of the example. This should be
121 |             specified for train and dev examples, but not for test examples.
122 |         """
123 |         self.guid = guid
124 |         self.text_a = text_a
125 |         self.text_b = text_b
126 |         self.label = label
127 | 
128 | 
129 | class PaddingInputExample(object):
130 |     """Fake example so the num input examples is a multiple of the batch size.
131 | 
132 |     When running eval/predict on the TPU, we need to pad the number of examples
133 |     to be a multiple of the batch size, because the TPU requires a fixed batch
134 |     size. The alternative is to drop the last batch, which is bad because it means
135 |     the entire output data won't be generated.
136 | 
137 |     We use this class instead of `None` because treating `None` as padding
138 |     battches could cause silent errors.
139 |     """
140 | 
141 | 
142 | class InputFeatures(object):
143 |     """A single set of features of data."""
144 | 
145 |     def __init__(self,
146 |                  input_ids,
147 |                  input_mask,
148 |                  segment_ids,
149 |                  label_id,
150 |                  is_real_example=True):
151 |         self.input_ids = input_ids
152 |         self.input_mask = input_mask
153 |         self.segment_ids = segment_ids
154 |         self.label_id = label_id
155 |         self.is_real_example = is_real_example
156 | 
157 | 
158 | class TnewsProcessor:
159 |     def get_train_examples(self, data_dir):
160 |         """获取训练集."""
161 |         return self._create_examples(
162 |             self._read_tsv(os.path.join(data_dir, "train.json")), "train")
163 | 
164 |     def get_dev_examples(self, data_dir):
165 |         """获取验证集."""
166 |         return self._create_examples(
167 |             self._read_tsv(os.path.join(data_dir, "dev.json")), "dev")
168 | 
169 |     def get_test_examples(self, data_dir):
170 |         """获取测试集."""
171 |         return self._create_examples(
172 |             self._read_tsv(os.path.join(data_dir, "test.json")), "test")
173 | 
174 |     def get_labels(self):
175 |         """填写新闻分类的类别标签"""
176 |         return ['100', '101', '102', '103', '104', '106', '107',
177 |                 '108', '109', '110', '112', '113', '114', '115', '116']
178 | 
179 |     def _read_tsv(self, input_file):
180 |         """读取数据集"""
181 |         with open(input_file, encoding='utf-8') as fr:
182 |             lines = fr.readlines()
183 |         return lines
184 | 
185 |     def _create_examples(self, lines, set_type):
186 |         """Creates examples for the training and dev sets."""
187 |         examples = []
188 |         for (i, line) in enumerate(lines):
189 |             json_str = json.loads(line)
190 |             guid = "%s-%s" % (set_type, i)
191 |             if set_type == "test":
192 |                 text_a = tokenization.convert_to_unicode(json_str['sentence'])
193 |                 label = None
194 |                 guid = json_str['id']
195 |             else:
196 |                 text_a = tokenization.convert_to_unicode(json_str['sentence'])
197 |                 label = tokenization.convert_to_unicode(json_str['label'])
198 |             examples.append(
199 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
200 |         return examples
201 | 
202 | 
203 | def convert_single_example(ex_index, example, label_list, max_seq_length,
204 |                            tokenizer):
205 |     """Converts a single `InputExample` into a single `InputFeatures`."""
206 | 
207 |     if isinstance(example, PaddingInputExample):
208 |         return InputFeatures(
209 |             input_ids=[0] * max_seq_length,
210 |             input_mask=[0] * max_seq_length,
211 |             segment_ids=[0] * max_seq_length,
212 |             label_id=0,
213 |             is_real_example=False)
214 | 
215 |     label_map = {}
216 |     for (i, label) in enumerate(label_list):
217 |         label_map[label] = i
218 | 
219 |     tokens_a = tokenizer.tokenize(example.text_a)
220 |     tokens_b = None
221 |     if example.text_b:
222 |         tokens_b = tokenizer.tokenize(example.text_b)
223 | 
224 |     if tokens_b:
225 |         _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
226 |     else:
227 |         # Account for [CLS] and [SEP] with "- 2"
228 |         if len(tokens_a) > max_seq_length - 2:
229 |             tokens_a = tokens_a[0:(max_seq_length - 2)]
230 | 
231 |     tokens = []
232 |     segment_ids = []
233 |     tokens.append("[CLS]")
234 |     segment_ids.append(0)
235 |     for token in tokens_a:
236 |         tokens.append(token)
237 |         segment_ids.append(0)
238 |     tokens.append("[SEP]")
239 |     segment_ids.append(0)
240 | 
241 |     if tokens_b:
242 |         for token in tokens_b:
243 |             tokens.append(token)
244 |             segment_ids.append(1)
245 |         tokens.append("[SEP]")
246 |         segment_ids.append(1)
247 | 
248 |     input_ids = tokenizer.convert_tokens_to_ids(tokens)
249 | 
250 |     input_mask = [1] * len(input_ids)
251 | 
252 |     # Zero-pad up to the sequence length.
253 |     while len(input_ids) < max_seq_length:
254 |         input_ids.append(0)
255 |         input_mask.append(0)
256 |         segment_ids.append(0)
257 | 
258 |     assert len(input_ids) == max_seq_length
259 |     assert len(input_mask) == max_seq_length
260 |     assert len(segment_ids) == max_seq_length
261 | 
262 |     if example.label:
263 |         label_id = label_map[example.label]
264 |     else:
265 |         label_id = 0
266 | 
267 |     if ex_index < 5:
268 |         tf.logging.info("*** Example ***")
269 |         tf.logging.info("guid: %s" % (example.guid))
270 |         tf.logging.info("tokens: %s" % " ".join(
271 |             [tokenization.printable_text(x) for x in tokens]))
272 |         tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
273 |         tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
274 |         tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
275 |         tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
276 | 
277 |     feature = InputFeatures(
278 |         input_ids=input_ids,
279 |         input_mask=input_mask,
280 |         segment_ids=segment_ids,
281 |         label_id=label_id,
282 |         is_real_example=True)
283 |     return feature
284 | 
285 | 
286 | def file_based_convert_examples_to_features(
287 |         examples, label_list, max_seq_length, tokenizer, output_file):
288 |     """Convert a set of `InputExample`s to a TFRecord file."""
289 | 
290 |     writer = tf.python_io.TFRecordWriter(output_file)
291 | 
292 |     for (ex_index, example) in enumerate(examples):
293 |         if ex_index % 10000 == 0:
294 |             tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
295 | 
296 |         feature = convert_single_example(ex_index, example, label_list,
297 |                                          max_seq_length, tokenizer)
298 | 
299 |         def create_int_feature(values):
300 |             f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
301 |             return f
302 | 
303 |         features = collections.OrderedDict()
304 |         features["input_ids"] = create_int_feature(feature.input_ids)
305 |         features["input_mask"] = create_int_feature(feature.input_mask)
306 |         features["segment_ids"] = create_int_feature(feature.segment_ids)
307 |         features["label_ids"] = create_int_feature([feature.label_id])
308 |         features["is_real_example"] = create_int_feature(
309 |             [int(feature.is_real_example)])
310 | 
311 |         tf_example = tf.train.Example(features=tf.train.Features(feature=features))
312 |         writer.write(tf_example.SerializeToString())
313 |     writer.close()
314 | 
315 | 
316 | def file_based_input_fn_builder(input_file, seq_length, is_training,
317 |                                 drop_remainder):
318 |     """Creates an `input_fn` closure to be passed to TPUEstimator."""
319 | 
320 |     name_to_features = {
321 |         "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
322 |         "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
323 |         "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
324 |         "label_ids": tf.FixedLenFeature([], tf.int64),
325 |         "is_real_example": tf.FixedLenFeature([], tf.int64),
326 |     }
327 | 
328 |     def _decode_record(record, name_to_features):
329 |         """Decodes a record to a TensorFlow example."""
330 |         example = tf.parse_single_example(record, name_to_features)
331 | 
332 |         # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
333 |         # So cast all int64 to int32.
334 |         for name in list(example.keys()):
335 |             t = example[name]
336 |             if t.dtype == tf.int64:
337 |                 t = tf.to_int32(t)
338 |             example[name] = t
339 | 
340 |         return example
341 | 
342 |     def input_fn(params):
343 |         """The actual input function."""
344 |         batch_size = params["batch_size"]
345 | 
346 |         d = tf.data.TFRecordDataset(input_file)
347 |         if is_training:
348 |             d = d.repeat()
349 |             d = d.shuffle(buffer_size=100)
350 | 
351 |         d = d.apply(
352 |             tf.contrib.data.map_and_batch(
353 |                 lambda record: _decode_record(record, name_to_features),
354 |                 batch_size=batch_size,
355 |                 drop_remainder=drop_remainder))
356 | 
357 |         return d
358 | 
359 |     return input_fn
360 | 
361 | 
362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
363 |     """Truncates a sequence pair in place to the maximum length."""
364 | 
365 |     while True:
366 |         total_length = len(tokens_a) + len(tokens_b)
367 |         if total_length <= max_length:
368 |             break
369 |         if len(tokens_a) > len(tokens_b):
370 |             tokens_a.pop()
371 |         else:
372 |             tokens_b.pop()
373 | 
374 | 
375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
376 |                  labels, num_labels, use_one_hot_embeddings):
377 |     """Creates a classification model."""
378 |     model = modeling.BertModel(
379 |         config=bert_config,
380 |         is_training=is_training,
381 |         input_ids=input_ids,
382 |         input_mask=input_mask,
383 |         token_type_ids=segment_ids,
384 |         use_one_hot_embeddings=use_one_hot_embeddings)
385 | 
386 |     output_layer = model.get_pooled_output()
387 | 
388 |     hidden_size = output_layer.shape[-1].value
389 | 
390 |     output_weights = tf.get_variable(
391 |         "output_weights", [num_labels, hidden_size],
392 |         initializer=tf.truncated_normal_initializer(stddev=0.02))
393 | 
394 |     output_bias = tf.get_variable(
395 |         "output_bias", [num_labels], initializer=tf.zeros_initializer())
396 | 
397 |     with tf.variable_scope("loss"):
398 |         if is_training:
399 |             output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
400 | 
401 |         logits = tf.matmul(output_layer, output_weights, transpose_b=True)
402 |         logits = tf.nn.bias_add(logits, output_bias)
403 | 
404 |         log_probs = tf.nn.log_softmax(logits, axis=-1)
405 | 
406 |         one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
407 | 
408 |         per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
409 |         loss = tf.reduce_mean(per_example_loss)
410 | 
411 |         predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
412 | 
413 |         return loss, per_example_loss, predictions
414 | 
415 | 
416 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
417 |                      num_train_steps, num_warmup_steps, use_tpu,
418 |                      use_one_hot_embeddings):
419 |     """Returns `model_fn` closure for TPUEstimator."""
420 | 
421 |     def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
422 |         """The `model_fn` for TPUEstimator."""
423 | 
424 |         tf.logging.info("*** Features ***")
425 |         for name in sorted(features.keys()):
426 |             tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
427 | 
428 |         input_ids = features["input_ids"]
429 |         input_mask = features["input_mask"]
430 |         segment_ids = features["segment_ids"]
431 |         label_ids = features["label_ids"]
432 |         is_real_example = None
433 |         if "is_real_example" in features:
434 |             is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
435 |         else:
436 |             is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)
437 | 
438 |         is_training = (mode == tf.estimator.ModeKeys.TRAIN)
439 | 
440 |         (total_loss, per_example_loss, predictions) = create_model(
441 |             bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
442 |             num_labels, use_one_hot_embeddings)
443 | 
444 |         tvars = tf.trainable_variables()
445 |         initialized_variable_names = {}
446 |         scaffold_fn = None
447 |         if init_checkpoint:
448 |             (assignment_map, initialized_variable_names
449 |              ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
450 |             if use_tpu:
451 |                 def tpu_scaffold():
452 |                     tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
453 |                     return tf.train.Scaffold()
454 | 
455 |                 scaffold_fn = tpu_scaffold
456 |             else:
457 |                 tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
458 | 
459 |         tf.logging.info("**** Trainable Variables ****")
460 |         for var in tvars:
461 |             init_string = ""
462 |             if var.name in initialized_variable_names:
463 |                 init_string = ", *INIT_FROM_CKPT*"
464 |             tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
465 |                             init_string)
466 | 
467 |         if mode == tf.estimator.ModeKeys.TRAIN:
468 |             # 添加loss的hook，不然在GPU/CPU上不打印loss
469 |             logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
470 |             train_op = optimization.create_optimizer(
471 |                 total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
472 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
473 |                 mode=mode,
474 |                 loss=total_loss,
475 |                 train_op=train_op,
476 |                 training_hooks=[logging_hook],
477 |                 scaffold_fn=scaffold_fn)
478 |         elif mode == tf.estimator.ModeKeys.EVAL:
479 |             def metric_fn(per_example_loss, label_ids, is_real_example):
480 |                 accuracy = tf.metrics.accuracy(
481 |                     labels=label_ids, predictions=predictions, weights=is_real_example)
482 |                 loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
483 |                 return {
484 |                     "eval_accuracy": accuracy,
485 |                     "eval_loss": loss,
486 |                 }
487 | 
488 |             eval_metrics = (metric_fn,
489 |                             [per_example_loss, label_ids, is_real_example])
490 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
491 |                 mode=mode,
492 |                 loss=total_loss,
493 |                 eval_metrics=eval_metrics,
494 |                 scaffold_fn=scaffold_fn)
495 |         else:
496 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
497 |                 mode=mode,
498 |                 predictions={"predictions": predictions},
499 |                 scaffold_fn=scaffold_fn)
500 |         return output_spec
501 | 
502 |     return model_fn
503 | 
504 | 
505 | def main():
506 |     tf.logging.set_verbosity(tf.logging.INFO)
507 | 
508 |     processors = {
509 |         "tnews": TnewsProcessor,
510 |     }
511 | 
512 |     tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
513 |                                                   FLAGS.init_checkpoint)
514 | 
515 |     if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
516 |         raise ValueError(
517 |             "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
518 | 
519 |     bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
520 | 
521 |     if FLAGS.max_seq_length > bert_config.max_position_embeddings:
522 |         raise ValueError(
523 |             "Cannot use sequence length %d because the BERT model "
524 |             "was only trained up to sequence length %d" %
525 |             (FLAGS.max_seq_length, bert_config.max_position_embeddings))
526 | 
527 |     tf.gfile.MakeDirs(FLAGS.output_dir)
528 | 
529 |     task_name = FLAGS.task_name.lower()
530 | 
531 |     if task_name not in processors:
532 |         raise ValueError("Task not found: %s" % (task_name))
533 | 
534 |     processor = processors[task_name]()
535 | 
536 |     label_list = processor.get_labels()
537 | 
538 |     tokenizer = tokenization.FullTokenizer(
539 |         vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
540 | 
541 |     tpu_cluster_resolver = None
542 |     if FLAGS.use_tpu and FLAGS.tpu_name:
543 |         tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
544 |             FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
545 | 
546 |     is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
547 |     run_config = tf.contrib.tpu.RunConfig(
548 |         cluster=tpu_cluster_resolver,
549 |         master=FLAGS.master,
550 |         model_dir=FLAGS.output_dir,
551 |         save_checkpoints_steps=FLAGS.save_checkpoints_steps,
552 |         tpu_config=tf.contrib.tpu.TPUConfig(
553 |             iterations_per_loop=FLAGS.iterations_per_loop,
554 |             num_shards=FLAGS.num_tpu_cores,
555 |             per_host_input_for_training=is_per_host))
556 | 
557 |     train_examples = None
558 |     num_train_steps = None
559 |     num_warmup_steps = None
560 |     if FLAGS.do_train:
561 |         train_examples = processor.get_train_examples(FLAGS.data_dir)
562 |         num_train_steps = int(
563 |             len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
564 |         num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
565 | 
566 |     model_fn = model_fn_builder(
567 |         bert_config=bert_config,
568 |         num_labels=len(label_list),
569 |         init_checkpoint=FLAGS.init_checkpoint,
570 |         learning_rate=FLAGS.learning_rate,
571 |         num_train_steps=num_train_steps,
572 |         num_warmup_steps=num_warmup_steps,
573 |         use_tpu=FLAGS.use_tpu,
574 |         use_one_hot_embeddings=FLAGS.use_tpu)
575 | 
576 |     # If TPU is not available, this will fall back to normal Estimator on CPU
577 |     # or GPU.
578 |     estimator = tf.contrib.tpu.TPUEstimator(
579 |         use_tpu=FLAGS.use_tpu,
580 |         model_fn=model_fn,
581 |         config=run_config,
582 |         train_batch_size=FLAGS.train_batch_size,
583 |         eval_batch_size=FLAGS.eval_batch_size,
584 |         predict_batch_size=FLAGS.predict_batch_size)
585 | 
586 |     if FLAGS.do_train:
587 |         train_file = os.path.join(FLAGS.data_dir, "train.tf_record")
588 |         file_based_convert_examples_to_features(
589 |             train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
590 |         tf.logging.info("***** Running training *****")
591 |         tf.logging.info("  Num examples = %d", len(train_examples))
592 |         tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
593 |         tf.logging.info("  Num steps = %d", num_train_steps)
594 |         train_input_fn = file_based_input_fn_builder(
595 |             input_file=train_file,
596 |             seq_length=FLAGS.max_seq_length,
597 |             is_training=True,
598 |             drop_remainder=True)
599 |         estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
600 | 
601 |     if FLAGS.do_eval:
602 |         eval_examples = processor.get_dev_examples(FLAGS.data_dir)
603 |         num_actual_eval_examples = len(eval_examples)
604 |         if FLAGS.use_tpu:
605 |             while len(eval_examples) % FLAGS.eval_batch_size != 0:
606 |                 eval_examples.append(PaddingInputExample())
607 | 
608 |         eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record")
609 |         file_based_convert_examples_to_features(
610 |             eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file)
611 | 
612 |         tf.logging.info("***** Running evaluation *****")
613 |         tf.logging.info("  Num examples = %d (%d actual, %d padding)",
614 |                         len(eval_examples), num_actual_eval_examples,
615 |                         len(eval_examples) - num_actual_eval_examples)
616 |         tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
617 | 
618 |         # This tells the estimator to run through the entire set.
619 |         eval_steps = None
620 |         if FLAGS.use_tpu:
621 |             assert len(eval_examples) % FLAGS.eval_batch_size == 0
622 |             eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
623 | 
624 |         eval_drop_remainder = True if FLAGS.use_tpu else False
625 |         eval_input_fn = file_based_input_fn_builder(
626 |             input_file=eval_file,
627 |             seq_length=FLAGS.max_seq_length,
628 |             is_training=False,
629 |             drop_remainder=eval_drop_remainder)
630 | 
631 |         result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
632 | 
633 |         output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
634 |         with tf.gfile.GFile(output_eval_file, "w") as writer:
635 |             tf.logging.info("***** Eval results *****")
636 |             for key in sorted(result.keys()):
637 |                 tf.logging.info("  %s = %s", key, str(result[key]))
638 |                 writer.write("%s = %s\n" % (key, str(result[key])))
639 | 
640 |     if FLAGS.do_predict:
641 |         # label dict的设置
642 |         label_dict = {0: 100, 1: 101, 2: 102, 3: 103,
643 |                       4: 104, 5: 106, 6: 107, 7: 108,
644 |                       8: 109, 9: 110, 10: 112, 11: 113,
645 |                       12: 114, 13: 115, 14: 116}
646 |         label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment",
647 |                       103: "news_sports", 104: "news_finance", 106: "news_house",
648 |                       107: "news_car", 108: "news_edu", 109: "news_tech",
649 |                       110: "news_military", 112: "news_travel", 113: "news_world",
650 |                       114: "news_stock", 115: "news_agriculture", 116: "news_game"}
651 | 
652 |         predict_examples = processor.get_test_examples(FLAGS.data_dir)
653 |         num_actual_predict_examples = len(predict_examples)
654 |         test_file = os.path.join(FLAGS.data_dir, "test.tf_record")
655 |         file_based_convert_examples_to_features(predict_examples, label_list,
656 |                                                 FLAGS.max_seq_length, tokenizer,
657 |                                                 test_file)
658 | 
659 |         tf.logging.info("***** Running prediction*****")
660 |         tf.logging.info("  Num examples = %d (%d actual, %d padding)",
661 |                         len(predict_examples), num_actual_predict_examples,
662 |                         len(predict_examples) - num_actual_predict_examples)
663 |         tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
664 | 
665 |         predict_drop_remainder = True if FLAGS.use_tpu else False
666 |         predict_input_fn = file_based_input_fn_builder(
667 |             input_file=test_file,
668 |             seq_length=FLAGS.max_seq_length,
669 |             is_training=False,
670 |             drop_remainder=predict_drop_remainder)
671 | 
672 |         results = estimator.predict(input_fn=predict_input_fn)
673 | 
674 |         output_file = os.path.join(FLAGS.output_dir, 'news_predict.json')
675 |         with open(output_file, 'w', encoding='utf-8') as fr:
676 |             print(results)
677 |             for index, result in enumerate(results):
678 |                 pre_id = result['predictions']
679 |                 print(f'the index is {index} preid is {pre_id}')
680 |                 label = label_dict.get(pre_id)
681 |                 label_d = label_desc.get(label)
682 | 
683 |                 json_str = json.dumps({"id": index, "label": str(label), "label_desc": label_d})
684 |                 fr.write(json_str)
685 |                 fr.write('\n')
686 | 
687 | 
688 | if __name__ == "__main__":
689 |     main()
690 | 


--------------------------------------------------------------------------------
/bert_downstream/train_multi_learning.py:
--------------------------------------------------------------------------------
  1 | """ BERT finetuning runner for multi-learning task """
  2 | 
  3 | import collections
  4 | import math
  5 | import os
  6 | import random
  7 | import pandas as pd
  8 | import numpy as np
  9 | import json
 10 | import tqdm
 11 | 
 12 | import bert_master.modeling as modeling
 13 | import bert_master.optimization as optimization
 14 | import bert_master.tokenization as tokenization
 15 | import tensorflow as tf
 16 | 
 17 | flags = tf.flags
 18 | FLAGS = flags.FLAGS
 19 | 
 20 | # Required parameters
 21 | flags.DEFINE_string(
 22 |     "data_dir", './data_path/',
 23 |     "The input data dir. Should contain the .tsv files (or other data files) "
 24 |     "for the task.")
 25 | 
 26 | flags.DEFINE_string(
 27 |     "bert_config_file", './pre_trained/bert_config.json',
 28 |     "The config json file corresponding to the pre-trained BERT model. "
 29 |     "This specifies the model architecture.")
 30 | 
 31 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
 32 |                     "The vocabulary file that the BERT model was trained on.")
 33 | 
 34 | flags.DEFINE_string(
 35 |     "output_dir", './model_ckpt/multi_learning/',
 36 |     "The output directory where the model checkpoints will be written.")
 37 | 
 38 | flags.DEFINE_string(
 39 |     "init_checkpoint", './pre_trained/bert_model.ckpt',
 40 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 41 | 
 42 | flags.DEFINE_bool(
 43 |     "do_lower_case", True,
 44 |     "Whether to lower case the input text. Should be True for uncased "
 45 |     "models and False for cased models.")
 46 | 
 47 | flags.DEFINE_integer(
 48 |     "max_seq_length", 128,
 49 |     "The maximum total input sequence length after WordPiece tokenization. "
 50 |     "Sequences longer than this will be truncated, and sequences shorter "
 51 |     "than this will be padded.")
 52 | 
 53 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
 54 | 
 55 | flags.DEFINE_integer("eval_batch_size", 16, "Total batch size for eval.")
 56 | 
 57 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 58 | 
 59 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 60 | 
 61 | flags.DEFINE_integer("num_train_epochs", 3,
 62 |                      "Total number of training epochs to perform.")
 63 | 
 64 | flags.DEFINE_float(
 65 |     "warmup_proportion", 0.1,
 66 |     "Proportion of training to perform linear learning rate warmup for. "
 67 |     "E.g., 0.1 = 10% of training.")
 68 | 
 69 | 
 70 | class InputExample(object):
 71 |     """A single training/test example for simple sequence classification."""
 72 | 
 73 |     def __init__(self, guid, text_a, text_b=None, label=None, task=None):
 74 |         """Constructs a InputExample.
 75 | 
 76 |         Args:
 77 |           guid: Unique id for the example.
 78 |           text_a: string. The untokenized text of the first sequence. For single
 79 |             sequence tasks, only this sequence must be specified.
 80 |           text_b: (Optional) string. The untokenized text of the second sequence.
 81 |             Only must be specified for sequence pair tasks.
 82 |           label: (Optional) string. The label of the example. This should be
 83 |             specified for train and dev examples, but not for test examples.
 84 |         """
 85 |         self.guid = guid
 86 |         self.text_a = text_a
 87 |         self.text_b = text_b
 88 |         self.label = label
 89 |         self.task = task
 90 | 
 91 | 
 92 | class PaddingInputExample(object):
 93 |     """Fake example so the num input examples is a multiple of the batch size.
 94 | 
 95 |     When running eval/predict on the TPU, we need to pad the number of examples
 96 |     to be a multiple of the batch size, because the TPU requires a fixed batch
 97 |     size. The alternative is to drop the last batch, which is bad because it means
 98 |     the entire output data won't be generated.
 99 | 
100 |     We use this class instead of `None` because treating `None` as padding
101 |     battches could cause silent errors.
102 |     """
103 | 
104 | 
105 | class InputFeatures(object):
106 |     """A single set of features of data."""
107 | 
108 |     def __init__(self,
109 |                  input_ids,
110 |                  input_mask,
111 |                  segment_ids,
112 |                  label_id,
113 |                  task,
114 |                  is_real_example=True):
115 |         self.input_ids = input_ids
116 |         self.input_mask = input_mask
117 |         self.segment_ids = segment_ids
118 |         self.label_id = label_id
119 |         self.task = task
120 |         self.is_real_example = is_real_example
121 | 
122 | 
123 | class DataProcessor(object):
124 |     """Base class for data converters for sequence classification data sets."""
125 | 
126 |     def get_train_examples(self, data_dir):
127 |         """Gets a collection of `InputExample`s for the train set."""
128 |         raise NotImplementedError()
129 | 
130 |     def get_dev_examples(self, data_dir):
131 |         """Gets a collection of `InputExample`s for the dev set."""
132 |         raise NotImplementedError()
133 | 
134 |     def get_test_examples(self, data_dir):
135 |         """Gets a collection of `InputExample`s for prediction."""
136 |         raise NotImplementedError()
137 | 
138 |     def get_labels(self):
139 |         """Gets the list of labels for this data set."""
140 |         raise NotImplementedError()
141 | 
142 |     @classmethod
143 |     def _read_csv(cls, input_file, task):
144 |         data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None)
145 |         if task == 'nli':
146 |             data.columns = ['id', 'texta', 'textb', 'label']
147 |         else:
148 |             data.columns = ['id', 'text', 'label']
149 |         lines = []
150 |         for index, row in data.iterrows():
151 |             if task == 'nli':
152 |                 lines.append((row['texta'], row['textb'], row['label']))
153 |             else:
154 |                 lines.append((row['text'], row['label']))
155 |         return lines
156 | 
157 |     @classmethod
158 |     def _read_test(cls, input_file, task):
159 |         data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None)
160 |         if task == 'nli':
161 |             data.columns = ['id', 'texta', 'textb']
162 |         else:
163 |             data.columns = ['id', 'text']
164 |         lines = []
165 |         # 添加id防止预测提交出错
166 |         for index, row in data.iterrows():
167 |             if task == 'nli':
168 |                 lines.append((row['id'], row['texta'], row['textb']))
169 |             else:
170 |                 lines.append((row['id'], row['text']))
171 |         return lines
172 | 
173 | 
174 | class AllProcessor(DataProcessor):
175 |     """Processor for the CoLA data set (GLUE version)."""
176 | 
177 |     def get_train_examples(self, data_dir):
178 |         """See base class."""
179 |         emotion_dir = os.path.join(data_dir, 'train_emotion.csv')
180 |         news_dir = os.path.join(data_dir, 'train_news.csv')
181 |         nli_dir = os.path.join(data_dir, 'train_nli.csv')
182 |         emotion_lines = self._read_csv(emotion_dir, 'emotion')
183 |         news_lines = self._read_csv(news_dir, 'news')
184 |         nli_lines = self._read_csv(nli_dir, 'nli')
185 |         return self._create_examples(emotion_lines, news_lines, nli_lines, 'train')
186 | 
187 |     def get_dev_examples(self, data_dir):
188 |         """See base class."""
189 |         emotion_dir = os.path.join(data_dir, 'dev_emotion.csv')
190 |         news_dir = os.path.join(data_dir, 'dev_news.csv')
191 |         nli_dir = os.path.join(data_dir, 'dev_nli.csv')
192 |         emotion_lines = self._read_csv(emotion_dir, 'emotion')
193 |         news_lines = self._read_csv(news_dir, 'news')
194 |         nli_lines = self._read_csv(nli_dir, 'nli')
195 |         return self._create_examples(emotion_lines, news_lines, nli_lines, 'dev')
196 | 
197 |     def get_test_examples(self, data_dir):
198 |         """See base class."""
199 |         emotion_dir = os.path.join(data_dir, 'test_emotion.csv')
200 |         news_dir = os.path.join(data_dir, 'test_news.csv')
201 |         nli_dir = os.path.join(data_dir, 'test_nli.csv')
202 |         emotion_lines = self._read_test(emotion_dir, 'emotion')
203 |         news_lines = self._read_test(news_dir, 'news')
204 |         nli_lines = self._read_test(nli_dir, 'nli')
205 |         return self._create_examples(emotion_lines, news_lines, nli_lines, 'test')
206 | 
207 |     def get_labels(self):
208 |         """See base class."""
209 |         return [['sadness', 'anger', 'happiness', 'fear', 'like',
210 |                  'disgust', 'surprise'],
211 |                 ['108', '104', '106', '112', '109', '103', '116', '101',
212 |                  '107', '100', '102', '110', '115', '113', '114'],
213 |                 ['0', '1', '2']]
214 | 
215 |     def _create_examples(self, emotion_lines, news_lines, nli_lines, set_type):
216 |         """Creates examples for the training and dev sets."""
217 |         examples = []
218 | 
219 |         # emotion
220 |         for (i, line) in enumerate(emotion_lines):
221 |             guid = "%s-%s" % (set_type, i)
222 |             if set_type == "test":
223 |                 text_a = tokenization.convert_to_unicode(line[1])
224 |                 label = None
225 |                 guid = line[0]
226 |             else:
227 |                 text_a = tokenization.convert_to_unicode(line[0])
228 |                 label = tokenization.convert_to_unicode(line[1])
229 |             examples.append(
230 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='1'))
231 | 
232 |         # news
233 |         for i, line in enumerate(news_lines):
234 |             guid = f'news_{set_type}_{i}'
235 |             if set_type == 'test':
236 |                 text_a = tokenization.convert_to_unicode(line[1])
237 |                 label = None
238 |                 guid = line[0]
239 |             else:
240 |                 text_a = tokenization.convert_to_unicode(line[0])
241 |                 label = tokenization.convert_to_unicode(str(line[1]))
242 | 
243 |             examples.append(
244 |                 InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='2'))
245 | 
246 |         # nli
247 |         for i, line in enumerate(nli_lines):
248 |             guid = f'news_{set_type}_{i}'
249 |             if set_type == 'test':
250 |                 text_a = tokenization.convert_to_unicode(line[1])
251 |                 text_b = tokenization.convert_to_unicode(line[2])
252 |                 label = None
253 |                 guid = line[0]
254 |             else:
255 |                 text_a = tokenization.convert_to_unicode(line[0])
256 |                 text_b = tokenization.convert_to_unicode(line[1])
257 |                 label = tokenization.convert_to_unicode(str(line[2]))
258 | 
259 |             examples.append(
260 |                 InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label, task='3'))
261 | 
262 |         return examples
263 | 
264 | 
265 | def convert_single_example(ex_index, example, label_list, max_seq_length,
266 |                            tokenizer):
267 |     """Converts a single `InputExample` into a single `InputFeatures`."""
268 | 
269 |     emotion_label_map = {}
270 |     news_label_map = {}
271 |     nli_label_map = {}
272 |     for (i, label) in enumerate(label_list[0]):
273 |         emotion_label_map[label] = i
274 |     for (i, label) in enumerate(label_list[1]):
275 |         news_label_map[label] = i
276 |     for (i, label) in enumerate(label_list[2]):
277 |         nli_label_map[label] = i
278 | 
279 |     tokens_a = tokenizer.tokenize(example.text_a)
280 |     tokens_b = None
281 |     if example.text_b:
282 |         tokens_b = tokenizer.tokenize(example.text_b)
283 | 
284 |     if tokens_b:
285 |         _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
286 |     else:
287 |         if len(tokens_a) > max_seq_length - 2:
288 |             tokens_a = tokens_a[0:(max_seq_length - 2)]
289 | 
290 |     tokens = []
291 |     segment_ids = []
292 |     tokens.append("[CLS]")
293 |     segment_ids.append(0)
294 |     for token in tokens_a:
295 |         tokens.append(token)
296 |         segment_ids.append(0)
297 |     tokens.append("[SEP]")
298 |     segment_ids.append(0)
299 | 
300 |     if tokens_b:
301 |         for token in tokens_b:
302 |             tokens.append(token)
303 |             segment_ids.append(1)
304 |         tokens.append("[SEP]")
305 |         segment_ids.append(1)
306 | 
307 |     input_ids = tokenizer.convert_tokens_to_ids(tokens)
308 | 
309 |     input_mask = [1] * len(input_ids)
310 | 
311 |     # Zero-pad up to the sequence length.
312 |     while len(input_ids) < max_seq_length:
313 |         input_ids.append(0)
314 |         input_mask.append(0)
315 |         segment_ids.append(0)
316 | 
317 |     assert len(input_ids) == max_seq_length
318 |     assert len(input_mask) == max_seq_length
319 |     assert len(segment_ids) == max_seq_length
320 |     task = example.task
321 |     if example.label:
322 |         if task == '1': label_id = emotion_label_map[example.label]
323 |         if task == '2': label_id = news_label_map[example.label]
324 |         if task == '3': label_id = nli_label_map[example.label]
325 |     else:
326 |         label_id = 0
327 | 
328 |     if ex_index < 5:
329 |         tf.logging.info("*** Example ***")
330 |         tf.logging.info("guid: %s" % (example.guid))
331 |         tf.logging.info("tokens: %s" % " ".join(
332 |             [tokenization.printable_text(x) for x in tokens]))
333 |         tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
334 |         tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
335 |         tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
336 |         tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
337 | 
338 |     feature = InputFeatures(
339 |         input_ids=input_ids,
340 |         input_mask=input_mask,
341 |         segment_ids=segment_ids,
342 |         label_id=label_id,
343 |         task=int(task),
344 |         is_real_example=True)
345 |     return feature
346 | 
347 | 
348 | def file_based_convert_examples_to_features(
349 |         examples, label_list, max_seq_length, tokenizer, output_file, type):
350 |     """Convert a set of `InputExample`s to a TFRecord file."""
351 | 
352 |     emotion_out = os.path.join(output_file, f'emotion_{type}.record')
353 |     news_out = os.path.join(output_file, f'news_{type}.record')
354 |     nli_out = os.path.join(output_file, f'nli_{type}.record')
355 | 
356 |     emotion_writer = tf.python_io.TFRecordWriter(emotion_out)
357 |     news_writer = tf.python_io.TFRecordWriter(news_out)
358 |     nli_writer = tf.python_io.TFRecordWriter(nli_out)
359 | 
360 |     emotion_cnt = 0
361 |     news_cnt = 0
362 |     nli_cnt = 0
363 |     for (ex_index, example) in enumerate(examples):
364 |         if ex_index % 10000 == 0:
365 |             tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
366 | 
367 |         feature = convert_single_example(ex_index, example, label_list,
368 |                                          max_seq_length, tokenizer)
369 | 
370 |         def create_int_feature(values):
371 |             f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
372 |             return f
373 | 
374 |         features = collections.OrderedDict()
375 |         features["input_ids"] = create_int_feature(feature.input_ids)
376 |         features["input_mask"] = create_int_feature(feature.input_mask)
377 |         features["segment_ids"] = create_int_feature(feature.segment_ids)
378 |         features["label_ids"] = create_int_feature([feature.label_id])
379 |         features["task"] = create_int_feature([feature.task])
380 |         features["is_real_example"] = create_int_feature(
381 |             [int(feature.is_real_example)])
382 | 
383 |         tf_example = tf.train.Example(features=tf.train.Features(feature=features))
384 | 
385 |         if feature.task == 1:
386 |             emotion_cnt += 1
387 |             emotion_writer.write(tf_example.SerializeToString())
388 |         if feature.task == 2:
389 |             news_cnt += 1
390 |             news_writer.write(tf_example.SerializeToString())
391 |         if feature.task == 3:
392 |             nli_cnt += 1
393 |             nli_writer.write(tf_example.SerializeToString())
394 | 
395 |     emotion_writer.close()
396 |     news_writer.close()
397 |     nli_writer.close()
398 |     print(f'the emotion news nli cnt is {emotion_cnt} {news_cnt} {nli_cnt}')
399 |     return emotion_cnt, news_cnt, nli_cnt
400 | 
401 | 
402 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
403 |     """Truncates a sequence pair in place to the maximum length."""
404 | 
405 |     while True:
406 |         total_length = len(tokens_a) + len(tokens_b)
407 |         if total_length <= max_length:
408 |             break
409 |         if len(tokens_a) > len(tokens_b):
410 |             tokens_a.pop()
411 |         else:
412 |             tokens_b.pop()
413 | 
414 | 
415 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
416 |                  labels, use_one_hot_embeddings, task):
417 |     """Creates a classification model."""
418 |     model = modeling.BertModel(
419 |         config=bert_config,
420 |         is_training=is_training,
421 |         input_ids=input_ids,
422 |         input_mask=input_mask,
423 |         token_type_ids=segment_ids,
424 |         use_one_hot_embeddings=use_one_hot_embeddings)
425 | 
426 |     output_layer = model.get_pooled_output()
427 | 
428 |     hidden_size = output_layer.shape[-1].value
429 | 
430 |     # 三个任务对应的三个全连接层参数
431 |     emotion_weights = tf.get_variable(
432 |         "emotion_weights", [7, hidden_size],
433 |         initializer=tf.truncated_normal_initializer(stddev=0.02))
434 |     emotion_bias = tf.get_variable(
435 |         "emotion_bias", [7], initializer=tf.zeros_initializer())
436 | 
437 |     news_weights = tf.get_variable(
438 |         "news_weights", [15, hidden_size],
439 |         initializer=tf.truncated_normal_initializer(stddev=0.02))
440 |     news_bias = tf.get_variable(
441 |         "news_bias", [15], initializer=tf.zeros_initializer())
442 | 
443 |     nli_weights = tf.get_variable(
444 |         "nli_weights", [3, hidden_size],
445 |         initializer=tf.truncated_normal_initializer(stddev=0.02))
446 |     nli_bias = tf.get_variable(
447 |         "nli_bias", [3], initializer=tf.zeros_initializer())
448 | 
449 |     if is_training:
450 |         # I.e., 0.1 dropout
451 |         output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
452 | 
453 |     emotion_logits = tf.matmul(output_layer, emotion_weights, transpose_b=True)
454 |     emotion_logits = tf.nn.bias_add(emotion_logits, emotion_bias)
455 | 
456 |     news_logits = tf.matmul(output_layer, news_weights, transpose_b=True)
457 |     news_logits = tf.nn.bias_add(news_logits, news_bias)
458 | 
459 |     nli_logits = tf.matmul(output_layer, nli_weights, transpose_b=True)
460 |     nli_logits = tf.nn.bias_add(nli_logits, nli_bias)
461 | 
462 |     logits = tf.cond(
463 |         tf.equal(task, 1),
464 |         lambda: emotion_logits,
465 |         lambda: tf.cond(tf.equal(task, 2), lambda: news_logits, lambda: nli_logits)
466 |     )
467 |     depth = tf.cond(
468 |         tf.equal(task, 1),
469 |         lambda: 7,
470 |         lambda: tf.cond(tf.equal(task, 2), lambda: 15, lambda: 3)
471 |     )
472 | 
473 |     predictions = tf.argmax(logits, axis=-1, output_type=tf.int64, name='pre_id')
474 | 
475 |     with tf.variable_scope("loss"):
476 |         log_probs = tf.nn.log_softmax(logits, axis=-1)
477 |         one_hot_labels = tf.one_hot(labels, depth=depth, dtype=tf.float32)
478 |         per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
479 |         loss = tf.reduce_mean(per_example_loss)
480 | 
481 |         equals = tf.reduce_sum(tf.cast(tf.equal(predictions, labels), tf.int64))
482 |         acc = equals / FLAGS.eval_batch_size
483 |         return loss, logits, acc, predictions
484 | 
485 | 
486 | def get_input_data(input_file, seq_len, batch_size, is_training):
487 |     def parser(record):
488 |         name_to_features = {
489 |             "input_ids": tf.FixedLenFeature([seq_len], tf.int64),
490 |             "input_mask": tf.FixedLenFeature([seq_len], tf.int64),
491 |             "segment_ids": tf.FixedLenFeature([seq_len], tf.int64),
492 |             "label_ids": tf.FixedLenFeature([], tf.int64),
493 |         }
494 |         # 解析的时候需要的是int64
495 |         example = tf.parse_single_example(record, features=name_to_features)
496 |         input_ids = example["input_ids"]
497 |         input_mask = example["input_mask"]
498 |         segment_ids = example["segment_ids"]
499 |         labels = example["label_ids"]
500 |         return input_ids, input_mask, segment_ids, labels
501 | 
502 |     dataset = tf.data.TFRecordDataset(input_file)
503 |     if is_training:
504 |         dataset = dataset.map(parser).batch(batch_size).shuffle(buffer_size=3000)
505 |     else:
506 |         dataset = dataset.map(parser).batch(batch_size)
507 |     iterator = dataset.make_one_shot_iterator()
508 |     input_ids, input_mask, segment_ids, labels = iterator.get_next()
509 |     return input_ids, input_mask, segment_ids, labels
510 | 
511 | 
512 | def main():
513 |     """ 训练主入口 """
514 |     tf.logging.info('start to train')
515 | 
516 |     # 部分参数设置
517 |     process = AllProcessor()
518 |     label_list = process.get_labels()
519 |     tokenizer = tokenization.FullTokenizer(
520 |         vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
521 | 
522 |     train_examples = process.get_train_examples(FLAGS.data_dir)
523 |     train_cnt = file_based_convert_examples_to_features(
524 |         train_examples,
525 |         label_list,
526 |         FLAGS.max_seq_length,
527 |         tokenizer,
528 |         FLAGS.data_dir,
529 |         'train'
530 |     )
531 |     dev_examples = process.get_dev_examples(FLAGS.data_dir)
532 |     dev_cnt = file_based_convert_examples_to_features(
533 |         dev_examples,
534 |         label_list,
535 |         FLAGS.max_seq_length,
536 |         tokenizer,
537 |         FLAGS.data_dir,
538 |         'dev'
539 |     )
540 | 
541 |     # 输入输出定义
542 |     input_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
543 |                                name='input_ids')
544 |     input_mask = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
545 |                                 name='input_mask')
546 |     segment_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length],
547 |                                  name='segment_ids')
548 |     labels = tf.placeholder(tf.int64, shape=[None], name='labels')
549 |     task = tf.placeholder(tf.int64, name='task')
550 | 
551 |     # bert相关参数设置
552 |     bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
553 | 
554 |     loss, logits, acc, pre_id = create_model(
555 |         bert_config,
556 |         True,
557 |         input_ids,
558 |         input_mask,
559 |         segment_ids,
560 |         labels,
561 |         False,
562 |         task
563 |     )
564 |     num_train_steps = int(len(train_examples) / FLAGS.train_batch_size)
565 |     num_warmup_steps = math.ceil(
566 |         num_train_steps * FLAGS.train_batch_size * FLAGS.warmup_proportion)
567 |     train_op = optimization.create_optimizer(
568 |         loss,
569 |         FLAGS.learning_rate,
570 |         num_train_steps * FLAGS.num_train_epochs,
571 |         num_warmup_steps,
572 |         False
573 |     )
574 | 
575 |     # 初始化参数
576 |     init_global = tf.global_variables_initializer()
577 |     saver = tf.train.Saver(
578 |         [v for v in tf.global_variables()
579 |          if 'adam_v' not in v.name and 'adam_m' not in v.name])
580 | 
581 |     with tf.Session() as sess:
582 |         sess.run(init_global)
583 |         print('start to load bert params')
584 |         if FLAGS.init_checkpoint:
585 |             # tvars = tf.global_variables()
586 |             tvars = tf.trainable_variables()
587 |             print("global_variables", len(tvars))
588 |             assignment_map, initialized_variable_names = \
589 |                 modeling.get_assignment_map_from_checkpoint(tvars,
590 |                                                             FLAGS.init_checkpoint)
591 |             print("initialized_variable_names:", len(initialized_variable_names))
592 |             saver_ = tf.train.Saver([v for v in tvars if v.name in initialized_variable_names])
593 |             saver_.restore(sess, FLAGS.init_checkpoint)
594 |             tvars = tf.global_variables()
595 |             # initialized_vars = [v for v in tvars if v.name in initialized_variable_names]
596 |             not_initialized_vars = [v for v in tvars if v.name not in initialized_variable_names]
597 |             print('all size %s; not initialized size %s' % (len(tvars), len(not_initialized_vars)))
598 |             if len(not_initialized_vars):
599 |                 sess.run(tf.variables_initializer(not_initialized_vars))
600 |             # for v in initialized_vars:
601 |             #     print('initialized: %s, shape = %s' % (v.name, v.shape))
602 |             # for v in not_initialized_vars:
603 |             #     print('not initialized: %s, shape = %s' % (v.name, v.shape))
604 |         else:
605 |             print('the bert init checkpoint is None!!!')
606 |             sess.run(tf.global_variables_initializer())
607 | 
608 |         # 训练的step
609 |         def train_step(ids, mask, seg, true_y, task_id):
610 |             feed = {input_ids: ids,
611 |                     input_mask: mask,
612 |                     segment_ids: seg,
613 |                     labels: true_y,
614 |                     task: task_id}
615 |             _, logits_out, loss_out = sess.run([train_op, logits, loss], feed_dict=feed)
616 |             return logits_out, loss_out
617 | 
618 |         # 验证的step
619 |         def dev_step(ids, mask, seg, true_y, task_id):
620 |             feed = {input_ids: ids,
621 |                     input_mask: mask,
622 |                     segment_ids: seg,
623 |                     labels: true_y,
624 |                     task: task_id}
625 |             pre_out, acc_out = sess.run([pre_id, acc], feed_dict=feed)
626 |             return pre_out, acc_out
627 | 
628 |         # 开始训练
629 |         for epoch in range(FLAGS.num_train_epochs):
630 |             tf.logging.info(f'start to train and the epoch:{epoch}')
631 |             epoch_loss = do_train(sess, train_cnt, train_step, epoch)
632 |             tf.logging.info(f'the epoch{epoch} loss is {epoch_loss}')
633 |             saver.save(sess, FLAGS.output_dir + 'bert.ckpt', global_step=epoch)
634 |             # 每一个epoch开始验证模型
635 |             do_eval(sess, dev_cnt, dev_step)
636 | 
637 |         # 进行预测并保存结果
638 |         do_predict(label_list, process, tokenizer, dev_step)
639 | 
640 |         tf.logging.info('the training is over!!!!')
641 | 
642 | 
643 | def set_random_task(train_cnt):
644 |     """ 任务采样 : 各任务每个epoch 迭代的step次数 """
645 |     # emotion cnt
646 |     emotion_cnt = train_cnt[0] // FLAGS.train_batch_size
647 |     news_cnt = train_cnt[1] // FLAGS.train_batch_size
648 |     nli_cnt = train_cnt[2] // FLAGS.train_batch_size
649 | 
650 |     emotion_list = [1] * emotion_cnt
651 |     news_list = [2] * news_cnt
652 |     nli_list = [3] * nli_cnt
653 | 
654 |     task_list = emotion_list + news_list + nli_list
655 | 
656 |     random.shuffle(task_list)
657 | 
658 |     return task_list
659 | 
660 | 
661 | def do_train(sess, train_cnt, train_step, epoch):
662 |     """ 模型训练 """
663 |     emotion_train_file = os.path.join(FLAGS.data_dir, 'emotion_train.record')
664 |     news_train_file = os.path.join(FLAGS.data_dir, 'news_train.record')
665 |     nli_train_file = os.path.join(FLAGS.data_dir, 'nli_train.record')
666 |     ids1, mask1, seg1, labels1 = get_input_data(
667 |         emotion_train_file, FLAGS.max_seq_length,
668 |         FLAGS.train_batch_size, True)
669 |     ids2, mask2, seg2, labels2 = get_input_data(
670 |         news_train_file, FLAGS.max_seq_length,
671 |         FLAGS.train_batch_size, True)
672 |     ids3, mask3, seg3, labels3 = get_input_data(
673 |         nli_train_file, FLAGS.max_seq_length,
674 |         FLAGS.train_batch_size, True)
675 | 
676 |     # 设置任务list
677 |     tasks = set_random_task(train_cnt)
678 | 
679 |     total_loss = 0
680 |     for step, task_id in enumerate(tasks):
681 |         if task_id == 1:
682 |             ids_train, mask_train, seg_train, y_train = sess.run(
683 |                 [ids1, mask1, seg1, labels1])
684 |         if task_id == 2:
685 |             ids_train, mask_train, seg_train, y_train = sess.run(
686 |                 [ids2, mask2, seg2, labels2])
687 |         if task_id == 3:
688 |             ids_train, mask_train, seg_train, y_train = sess.run(
689 |                 [ids3, mask3, seg3, labels3])
690 | 
691 |         _, step_loss = train_step(ids_train, mask_train, seg_train, y_train, task_id)
692 | 
693 |         tf.logging.info(f'epoch {epoch} the step loss: {step_loss}')
694 | 
695 |         total_loss += step_loss
696 | 
697 |     return total_loss / len(tasks)
698 | 
699 | 
700 | def do_eval(sess, dev_cnt, dev_step):
701 |     """ 模型验证 """
702 |     tf.logging.info(f'start to do eval')
703 |     emotion_dev_file = os.path.join(FLAGS.data_dir, 'emotion_dev.record')
704 |     news_dev_file = os.path.join(FLAGS.data_dir, 'news_dev.record')
705 |     nli_dev_file = os.path.join(FLAGS.data_dir, 'nli_dev.record')
706 | 
707 |     ids1, mask1, seg1, labels1 = get_input_data(
708 |         emotion_dev_file, FLAGS.max_seq_length,
709 |         FLAGS.eval_batch_size, False)
710 |     ids2, mask2, seg2, labels2 = get_input_data(
711 |         news_dev_file, FLAGS.max_seq_length,
712 |         FLAGS.eval_batch_size, False)
713 |     ids3, mask3, seg3, labels3 = get_input_data(
714 |         nli_dev_file, FLAGS.max_seq_length,
715 |         FLAGS.eval_batch_size, False)
716 | 
717 |     # 验证emotion的
718 |     total_dev_acc = 0
719 |     step_cnt = dev_cnt[0] // FLAGS.eval_batch_size
720 |     for step in range(step_cnt):
721 |         ids_dev, mask_dev, seg_dev, y_dev = sess.run(
722 |             [ids1, mask1, seg1, labels1])
723 |         _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 1)
724 |         total_dev_acc += dev_acc
725 |     tf.logging.info(f'===the emotion acc is {total_dev_acc / step_cnt}===')
726 | 
727 |     total_dev_acc = 0
728 |     step_cnt = dev_cnt[1] // FLAGS.eval_batch_size
729 |     for step in range(step_cnt):
730 |         ids_dev, mask_dev, seg_dev, y_dev = sess.run(
731 |             [ids2, mask2, seg2, labels2])
732 |         _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 2)
733 |         total_dev_acc += dev_acc
734 |     tf.logging.info(f'===the news acc is {total_dev_acc / step_cnt}===')
735 | 
736 |     total_dev_acc = 0
737 |     step_cnt = dev_cnt[3] // FLAGS.eval_batch_size
738 |     for step in range(step_cnt):
739 |         ids_dev, mask_dev, seg_dev, y_dev = sess.run(
740 |             [ids3, mask3, seg3, labels3])
741 |         _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 3)
742 |         total_dev_acc += dev_acc
743 |     tf.logging.info(f'===the nli acc is {total_dev_acc / step_cnt}===')
744 | 
745 | 
746 | def do_predict(label_list, process, tokenizer, dev_step):
747 |     """ 预测 """
748 |     tf.logging.info('start to do predict')
749 |     # 设置标签到索引的对应
750 |     emotion_map = {}
751 |     news_map = {}
752 |     nli_map = {}
753 |     for (i, label) in enumerate(label_list[0]):
754 |         emotion_map[i] = label
755 |     for (i, label) in enumerate(label_list[1]):
756 |         news_map[i] = label
757 |     for (i, label) in enumerate(label_list[2]):
758 |         nli_map[i] = label
759 | 
760 |     test_examples = process.get_test_examples(FLAGS.data_dir)
761 |     emotion_res = []
762 |     news_res = []
763 |     nli_res = []
764 |     batch_size = 1
765 |     index = 0
766 |     for example in tqdm.tqdm(test_examples):
767 |         index += 1
768 |         feature = convert_single_example(index, example, label_list,
769 |                                          FLAGS.max_seq_length, tokenizer)
770 |         ids = np.reshape([feature.input_ids], (batch_size, FLAGS.max_seq_length))
771 |         mask = np.reshape([feature.input_mask], (batch_size, FLAGS.max_seq_length))
772 |         seg = np.reshape([feature.segment_ids], (batch_size, FLAGS.max_seq_length))
773 |         true_y = np.reshape([0], batch_size)
774 | 
775 |         task_id = example.task
776 |         pred_res, _ = dev_step(ids, mask, seg, true_y, int(task_id))
777 | 
778 |         guid = str(example.guid).strip()
779 |         if task_id == '1':
780 |             label_res = emotion_map.get(pred_res[0])
781 |             emotion_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
782 |         if task_id == '2':
783 |             label_res = news_map.get(pred_res[0])
784 |             news_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
785 |         if task_id == '3':
786 |             label_res = nli_map.get(pred_res[0])
787 |             nli_res.append(json.dumps({"id": str(guid), "label": str(label_res)}))
788 | 
789 |     # 写入预测文件
790 |     with open('./data_path/ocemotion_predict.json', 'w', encoding='utf-8') as fr:
791 |         for res in emotion_res:
792 |             fr.write(res)
793 |             fr.write('\n')
794 | 
795 |     with open('./data_path/tnews_predict.json', 'w', encoding='utf-8') as fr:
796 |         for res in news_res:
797 |             fr.write(res)
798 |             fr.write('\n')
799 | 
800 |     with open('./data_path/ocnli_predict.json', 'w', encoding='utf-8') as fr:
801 |         for res in nli_res:
802 |             fr.write(res)
803 |             fr.write('\n')
804 |     tf.logging.info('predict and write file over!')
805 | 
806 | 
807 | if __name__ == "__main__":
808 |     tf.logging.set_verbosity(tf.logging.INFO)
809 |     main()
810 | 


--------------------------------------------------------------------------------
/bert_downstream/train_ner.py:
--------------------------------------------------------------------------------
  1 | """BERT finetuning runner for ner （sequence label classification）."""
  2 | 
  3 | import collections
  4 | import os
  5 | import json
  6 | import tensorflow as tf
  7 | 
  8 | import bert_master.modeling as modeling
  9 | import bert_master.optimization as optimization
 10 | import bert_master.tokenization as tokenization
 11 | 
 12 | flags = tf.flags
 13 | FLAGS = flags.FLAGS
 14 | 
 15 | # Required parameters
 16 | flags.DEFINE_string(
 17 |     "data_dir", './data_path/clue_ner/',
 18 |     "The input data dir. Should contain the .tsv files (or other data files) "
 19 |     "for the task.")
 20 | 
 21 | flags.DEFINE_string(
 22 |     "bert_config_file", './pre_trained/bert_config.json',
 23 |     "The config json file corresponding to the pre-trained BERT model. "
 24 |     "This specifies the model architecture.")
 25 | 
 26 | flags.DEFINE_string("task_name", 'cluener',
 27 |                     "The name of the task to train.")
 28 | 
 29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt',
 30 |                     "The vocabulary file that the BERT model was trained on.")
 31 | 
 32 | flags.DEFINE_string(
 33 |     "output_dir", './model_ckpt/clue_ner/',
 34 |     "The output directory where the model checkpoints will be written.")
 35 | 
 36 | flags.DEFINE_string(
 37 |     "init_checkpoint", './pre_trained/bert_model.ckpt',
 38 |     "Initial checkpoint (usually from a pre-trained BERT model).")
 39 | 
 40 | flags.DEFINE_bool(
 41 |     "do_lower_case", True,
 42 |     "Whether to lower case the input text. Should be True for uncased "
 43 |     "models and False for cased models.")
 44 | 
 45 | flags.DEFINE_integer(
 46 |     "max_seq_length", 128,
 47 |     "The maximum total input sequence length after WordPiece tokenization. "
 48 |     "Sequences longer than this will be truncated, and sequences shorter "
 49 |     "than this will be padded.")
 50 | 
 51 | flags.DEFINE_bool("do_train", False, "Whether to run training.")
 52 | 
 53 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.")
 54 | 
 55 | flags.DEFINE_bool(
 56 |     "do_predict", True,
 57 |     "Whether to run the model in inference mode on the test set.")
 58 | 
 59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.")
 60 | 
 61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.")
 62 | 
 63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.")
 64 | 
 65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.")
 66 | 
 67 | flags.DEFINE_float("num_train_epochs", 1.0,
 68 |                    "Total number of training epochs to perform.")
 69 | 
 70 | flags.DEFINE_float(
 71 |     "warmup_proportion", 0.1,
 72 |     "Proportion of training to perform linear learning rate warmup for. "
 73 |     "E.g., 0.1 = 10% of training.")
 74 | 
 75 | flags.DEFINE_integer("save_checkpoints_steps", 1000,
 76 |                      "How often to save the model checkpoint.")
 77 | 
 78 | flags.DEFINE_integer("iterations_per_loop", 1000,
 79 |                      "How many steps to make in each estimator call.")
 80 | 
 81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.")
 82 | 
 83 | tf.flags.DEFINE_string(
 84 |     "tpu_name", None,
 85 |     "The Cloud TPU to use for training. This should be either the name "
 86 |     "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 "
 87 |     "url.")
 88 | 
 89 | tf.flags.DEFINE_string(
 90 |     "tpu_zone", None,
 91 |     "[Optional] GCE zone where the Cloud TPU is located in. If not "
 92 |     "specified, we will attempt to automatically detect the GCE project from "
 93 |     "metadata.")
 94 | 
 95 | tf.flags.DEFINE_string(
 96 |     "gcp_project", None,
 97 |     "[Optional] Project name for the Cloud TPU-enabled project. If not "
 98 |     "specified, we will attempt to automatically detect the GCE project from "
 99 |     "metadata.")
100 | 
101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.")
102 | 
103 | flags.DEFINE_integer(
104 |     "num_tpu_cores", 8,
105 |     "Only used if `use_tpu` is True. Total number of TPU cores to use.")
106 | 
107 | 
108 | class InputExample(object):
109 |     """A single training/test example for simple sequence classification."""
110 | 
111 |     def __init__(self, guid, text_a, text_b=None, tag=None):
112 |         """Constructs a InputExample.
113 | 
114 |         Args:
115 |           guid: Unique id for the example.
116 |           text_a: string. The untokenized text of the first sequence. For single
117 |             sequence tasks, only this sequence must be specified.
118 |           text_b: (Optional) string. The untokenized text of the second sequence.
119 |             Only must be specified for sequence pair tasks.
120 |           label: (Optional) string. The label of the example. This should be
121 |             specified for train and dev examples, but not for test examples.
122 |         """
123 |         self.guid = guid
124 |         self.text_a = text_a
125 |         self.text_b = text_b
126 |         self.tag = tag
127 | 
128 | 
129 | class PaddingInputExample(object):
130 |     """Fake example so the num input examples is a multiple of the batch size.
131 | 
132 |     When running eval/predict on the TPU, we need to pad the number of examples
133 |     to be a multiple of the batch size, because the TPU requires a fixed batch
134 |     size. The alternative is to drop the last batch, which is bad because it means
135 |     the entire output data won't be generated.
136 | 
137 |     We use this class instead of `None` because treating `None` as padding
138 |     battches could cause silent errors.
139 |     """
140 | 
141 | 
142 | class InputFeatures(object):
143 |     """A single set of features of data."""
144 | 
145 |     def __init__(self,
146 |                  input_ids,
147 |                  input_mask,
148 |                  segment_ids,
149 |                  tag_ids,
150 |                  is_real_example=True):
151 |         self.input_ids = input_ids
152 |         self.input_mask = input_mask
153 |         self.segment_ids = segment_ids
154 |         self.tag_ids = tag_ids
155 |         self.is_real_example = is_real_example
156 | 
157 | 
158 | class NerProcessor:
159 |     def get_train_examples(self, data_dir):
160 |         """获取训练集."""
161 |         return self._create_examples(
162 |             self._read_tsv(os.path.join(data_dir, "train.txt")), "train")
163 | 
164 |     def get_dev_examples(self, data_dir):
165 |         """获取验证集."""
166 |         return self._create_examples(
167 |             self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev")
168 | 
169 |     def get_test_examples(self, data_dir):
170 |         """获取测试集."""
171 |         return self._create_examples(
172 |             self._read_tsv(os.path.join(data_dir, "test.json")), "test")
173 | 
174 |     def get_tags(self):
175 |         """填写tag的标签，采用BIO形式标注"""
176 |         # 会在convert_single_example方法中添加头，成为BIO形式标签
177 |         return ['address', 'book', 'company', 'game', 'government',
178 |                 'movie', 'name', 'organization', 'position', 'scene']
179 | 
180 |     def _read_tsv(self, input_file):
181 |         """读取数据集"""
182 |         with open(input_file, encoding='utf-8') as fr:
183 |             lines = fr.readlines()
184 |         return lines
185 | 
186 |     def _create_examples(self, lines, set_type):
187 |         """Creates examples for the training and dev sets."""
188 |         examples = []
189 |         for (i, line) in enumerate(lines):
190 |             if set_type == 'test':
191 |                 json_str = json.loads(line)
192 |                 text_a = tokenization.convert_to_unicode(json_str['text'])
193 |                 tag = None
194 |                 guid = json_str['id']
195 |             else:
196 |                 text_tag = line.split('\t')
197 |                 guid = "%s-%s" % (set_type, i)
198 |                 text_a = tokenization.convert_to_unicode(text_tag[0])
199 |                 tag = tokenization.convert_to_unicode(text_tag[1])
200 |             examples.append(
201 |                 InputExample(guid=guid, text_a=text_a, text_b=None, tag=tag))
202 |         return examples
203 | 
204 | 
205 | def convert_single_example(ex_index, example, tag_list, max_seq_length,
206 |                            tokenizer):
207 |     """Converts a single `InputExample` into a single `InputFeatures`."""
208 | 
209 |     if isinstance(example, PaddingInputExample):
210 |         return InputFeatures(
211 |             input_ids=[0] * max_seq_length,
212 |             input_mask=[0] * max_seq_length,
213 |             segment_ids=[0] * max_seq_length,
214 |             tag_ids=[0] * max_seq_length,
215 |             is_real_example=False)
216 | 
217 |     tag_map = {'O': 0}
218 |     for tag in tag_list:
219 |         tag_b = 'B-' + tag
220 |         tag_i = 'I-' + tag
221 |         tag_map[tag_b] = len(tag_map)
222 |         tag_map[tag_i] = len(tag_map)
223 | 
224 |     # 因为CLUE要求提交文件中包含索引，所以不能直接使用tokenizer去分割text
225 |     tokens_a = []
226 |     text_list = list(example.text_a)
227 |     for word in text_list:
228 |         token = tokenizer.tokenize(word)
229 |         tokens_a.extend(token)
230 | 
231 |     if len(tokens_a) > max_seq_length - 2:
232 |         tokens_a = tokens_a[0:(max_seq_length - 2)]
233 | 
234 |     tokens = []
235 |     segment_ids = []
236 |     tokens.append("[CLS]")
237 |     segment_ids.append(0)
238 |     for token in tokens_a:
239 |         tokens.append(token)
240 |         segment_ids.append(0)
241 |     tokens.append("[SEP]")
242 |     segment_ids.append(0)
243 | 
244 |     input_ids = tokenizer.convert_tokens_to_ids(tokens)
245 |     input_mask = [1] * len(input_ids)
246 | 
247 |     if example.tag:
248 |         tag_ids = [0]  # input第一位是[CLS]
249 |         tags = example.tag.strip().split(' ')
250 |         for tag in tags:
251 |             tag_ids.append(tag_map.get(tag))
252 |         tag_ids.append(0)  # input最后一位是[SEP]
253 |     else:
254 |         tag_ids = [0] * max_seq_length
255 |     # Zero-pad up to the sequence length.
256 |     while len(input_ids) < max_seq_length:
257 |         input_ids.append(0)
258 |         input_mask.append(0)
259 |         segment_ids.append(0)
260 |         # test的时候已经*max_len所以不需要再继续padding
261 |         if example.tag:
262 |             tag_ids.append(0)
263 | 
264 |     assert len(input_ids) == max_seq_length
265 |     assert len(input_mask) == max_seq_length
266 |     assert len(segment_ids) == max_seq_length
267 |     assert len(tag_ids) == max_seq_length
268 | 
269 |     if ex_index < 5:
270 |         tf.logging.info("*** Example ***")
271 |         tf.logging.info("guid: %s" % (example.guid))
272 |         tf.logging.info("tokens: %s" % " ".join(
273 |             [tokenization.printable_text(x) for x in tokens]))
274 |         tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
275 |         tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
276 |         tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
277 |         tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in tag_ids]))
278 | 
279 |     feature = InputFeatures(
280 |         input_ids=input_ids,
281 |         input_mask=input_mask,
282 |         segment_ids=segment_ids,
283 |         tag_ids=tag_ids,
284 |         is_real_example=True)
285 |     return feature
286 | 
287 | 
288 | def file_based_convert_examples_to_features(
289 |         examples, tag_list, max_seq_length, tokenizer, output_file):
290 |     """Convert a set of `InputExample`s to a TFRecord file."""
291 | 
292 |     writer = tf.python_io.TFRecordWriter(output_file)
293 | 
294 |     for (ex_index, example) in enumerate(examples):
295 |         if ex_index % 10000 == 0:
296 |             tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))
297 | 
298 |         feature = convert_single_example(ex_index, example, tag_list,
299 |                                          max_seq_length, tokenizer)
300 | 
301 |         def create_int_feature(values):
302 |             f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
303 |             return f
304 | 
305 |         features = collections.OrderedDict()
306 |         features["input_ids"] = create_int_feature(feature.input_ids)
307 |         features["input_mask"] = create_int_feature(feature.input_mask)
308 |         features["segment_ids"] = create_int_feature(feature.segment_ids)
309 |         features["tag_ids"] = create_int_feature(feature.tag_ids)
310 |         features["is_real_example"] = create_int_feature(
311 |             [int(feature.is_real_example)])
312 | 
313 |         tf_example = tf.train.Example(features=tf.train.Features(feature=features))
314 |         writer.write(tf_example.SerializeToString())
315 |     writer.close()
316 | 
317 | 
318 | def file_based_input_fn_builder(input_file, seq_length, is_training,
319 |                                 drop_remainder):
320 |     """Creates an `input_fn` closure to be passed to TPUEstimator."""
321 | 
322 |     name_to_features = {
323 |         "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
324 |         "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
325 |         "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
326 |         "tag_ids": tf.FixedLenFeature([seq_length], tf.int64),
327 |         "is_real_example": tf.FixedLenFeature([], tf.int64),
328 |     }
329 | 
330 |     def _decode_record(record, name_to_features):
331 |         """Decodes a record to a TensorFlow example."""
332 |         example = tf.parse_single_example(record, name_to_features)
333 | 
334 |         for name in list(example.keys()):
335 |             t = example[name]
336 |             if t.dtype == tf.int64:
337 |                 t = tf.to_int32(t)
338 |             example[name] = t
339 | 
340 |         return example
341 | 
342 |     def input_fn(params):
343 |         """The actual input function."""
344 |         batch_size = params["batch_size"]
345 | 
346 |         d = tf.data.TFRecordDataset(input_file)
347 |         if is_training:
348 |             d = d.repeat()
349 |             d = d.shuffle(buffer_size=100)
350 | 
351 |         d = d.apply(
352 |             tf.contrib.data.map_and_batch(
353 |                 lambda record: _decode_record(record, name_to_features),
354 |                 batch_size=batch_size,
355 |                 drop_remainder=drop_remainder))
356 | 
357 |         return d
358 | 
359 |     return input_fn
360 | 
361 | 
362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length):
363 |     """Truncates a sequence pair in place to the maximum length."""
364 | 
365 |     while True:
366 |         total_length = len(tokens_a) + len(tokens_b)
367 |         if total_length <= max_length:
368 |             break
369 |         if len(tokens_a) > len(tokens_b):
370 |             tokens_a.pop()
371 |         else:
372 |             tokens_b.pop()
373 | 
374 | 
375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
376 |                  tags, num_tags, use_one_hot_embeddings):
377 |     """Creates a classification model."""
378 |     model = modeling.BertModel(
379 |         config=bert_config,
380 |         is_training=is_training,
381 |         input_ids=input_ids,
382 |         input_mask=input_mask,
383 |         token_type_ids=segment_ids,
384 |         use_one_hot_embeddings=use_one_hot_embeddings)
385 | 
386 |     # 用bert的sequence输出层
387 |     output_layer = model.get_sequence_output()
388 | 
389 |     hidden_size = output_layer.shape[-1].value
390 |     seq_len = output_layer.shape[1].value
391 |     # [batch, seq_len, emb_size]  16 128 768
392 | 
393 |     output_weights = tf.get_variable(
394 |         "output_weights", [num_tags, hidden_size],
395 |         initializer=tf.truncated_normal_initializer(stddev=0.02))
396 | 
397 |     output_bias = tf.get_variable(
398 |         "output_bias", [num_tags], initializer=tf.zeros_initializer())
399 | 
400 |     with tf.variable_scope("loss"):
401 |         if is_training:
402 |             output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
403 | 
404 |         # 进行matmul需要reshape
405 |         output_layer = tf.reshape(output_layer, [-1, hidden_size])
406 |         # [batch*seq_len, num_tags]
407 |         logits = tf.matmul(output_layer, output_weights, transpose_b=True)
408 |         logits = tf.nn.bias_add(logits, output_bias)
409 | 
410 |         logits = tf.reshape(logits, [-1, seq_len, num_tags])
411 | 
412 |         # 真实的长度
413 |         input_m = tf.count_nonzero(input_mask, -1)
414 |         log_likelihood, transition_matrix = tf.contrib.crf.crf_log_likelihood(
415 |             logits, tags, input_m)
416 |         loss = tf.reduce_mean(-log_likelihood)
417 | 
418 |         # 使用crf_decode输出
419 |         viterbi_sequence, _ = tf.contrib.crf.crf_decode(
420 |             logits, transition_matrix, input_m)
421 | 
422 |         return loss, logits, viterbi_sequence
423 | 
424 | 
425 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
426 |                      num_train_steps, num_warmup_steps, use_tpu,
427 |                      use_one_hot_embeddings):
428 |     """Returns `model_fn` closure for TPUEstimator."""
429 | 
430 |     def model_fn(features, labels, mode, params):
431 |         """The `model_fn` for TPUEstimator."""
432 | 
433 |         tf.logging.info("*** Features ***")
434 |         for name in sorted(features.keys()):
435 |             tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))
436 | 
437 |         input_ids = features["input_ids"]
438 |         input_mask = features["input_mask"]
439 |         segment_ids = features["segment_ids"]
440 |         tag_ids = features["tag_ids"]
441 |         is_real_example = None
442 |         if "is_real_example" in features:
443 |             is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
444 |         else:
445 |             is_real_example = tf.ones(tf.shape(tag_ids), dtype=tf.float32)
446 | 
447 |         is_training = (mode == tf.estimator.ModeKeys.TRAIN)
448 | 
449 |         total_loss, logits, predictions = create_model(
450 |             bert_config, is_training, input_ids, input_mask, segment_ids, tag_ids,
451 |             num_labels, use_one_hot_embeddings)
452 | 
453 |         tvars = tf.trainable_variables()
454 |         initialized_variable_names = {}
455 |         scaffold_fn = None
456 |         if init_checkpoint:
457 |             (assignment_map, initialized_variable_names
458 |              ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
459 |             if use_tpu:
460 |                 def tpu_scaffold():
461 |                     tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
462 |                     return tf.train.Scaffold()
463 | 
464 |                 scaffold_fn = tpu_scaffold
465 |             else:
466 |                 tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
467 | 
468 |         tf.logging.info("**** Trainable Variables ****")
469 |         for var in tvars:
470 |             init_string = ""
471 |             if var.name in initialized_variable_names:
472 |                 init_string = ", *INIT_FROM_CKPT*"
473 |             tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
474 |                             init_string)
475 | 
476 |         if mode == tf.estimator.ModeKeys.TRAIN:
477 |             # 添加loss的hook，不然在GPU/CPU上不打印loss
478 |             logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10)
479 |             train_op = optimization.create_optimizer(
480 |                 total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)
481 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
482 |                 mode=mode,
483 |                 loss=total_loss,
484 |                 train_op=train_op,
485 |                 training_hooks=[logging_hook],
486 |                 scaffold_fn=scaffold_fn)
487 |         elif mode == tf.estimator.ModeKeys.EVAL:
488 |             def metric_fn(per_example_loss, tag_ids, is_real_example):
489 |                 # 这里使用的accuracy来计算，宽松匹配方法
490 |                 accuracy = tf.metrics.accuracy(
491 |                     labels=tag_ids, predictions=predictions, weights=is_real_example)
492 |                 return {
493 |                     "eval_accuracy": accuracy,
494 |                 }
495 | 
496 |             eval_metrics = (metric_fn,
497 |                             [total_loss, tag_ids, is_real_example])
498 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
499 |                 mode=mode,
500 |                 loss=total_loss,
501 |                 eval_metrics=eval_metrics,
502 |                 scaffold_fn=scaffold_fn)
503 |         else:
504 |             output_spec = tf.contrib.tpu.TPUEstimatorSpec(
505 |                 mode=mode,
506 |                 predictions={"predictions": predictions},
507 |                 scaffold_fn=scaffold_fn)
508 |         return output_spec
509 | 
510 |     return model_fn
511 | 
512 | 
513 | def main():
514 |     tf.logging.set_verbosity(tf.logging.INFO)
515 | 
516 |     processors = {
517 |         "cluener": NerProcessor,
518 |     }
519 | 
520 |     tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case,
521 |                                                   FLAGS.init_checkpoint)
522 | 
523 |     # if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict:
524 |     #     raise ValueError(
525 |     #         "At least one of `do_train`, `do_eval` or `do_predict' must be True.")
526 | 
527 |     bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)
528 | 
529 |     if FLAGS.max_seq_length > bert_config.max_position_embeddings:
530 |         raise ValueError(
531 |             "Cannot use sequence length %d because the BERT model "
532 |             "was only trained up to sequence length %d" %
533 |             (FLAGS.max_seq_length, bert_config.max_position_embeddings))
534 | 
535 |     tf.gfile.MakeDirs(FLAGS.output_dir)
536 | 
537 |     task_name = FLAGS.task_name.lower()
538 | 
539 |     if task_name not in processors:
540 |         raise ValueError("Task not found: %s" % (task_name))
541 | 
542 |     processor = processors[task_name]()
543 | 
544 |     tag_list = processor.get_tags()
545 | 
546 |     tokenizer = tokenization.FullTokenizer(
547 |         vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
548 | 
549 |     tpu_cluster_resolver = None
550 |     if FLAGS.use_tpu and FLAGS.tpu_name:
551 |         tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
552 |             FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)
553 | 
554 |     is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
555 |     run_config = tf.contrib.tpu.RunConfig(
556 |         cluster=tpu_cluster_resolver,
557 |         master=FLAGS.master,
558 |         model_dir=FLAGS.output_dir,
559 |         save_checkpoints_steps=FLAGS.save_checkpoints_steps,
560 |         tpu_config=tf.contrib.tpu.TPUConfig(
561 |             iterations_per_loop=FLAGS.iterations_per_loop,
562 |             num_shards=FLAGS.num_tpu_cores,
563 |             per_host_input_for_training=is_per_host))
564 | 
565 |     train_examples = None
566 |     num_train_steps = None
567 |     num_warmup_steps = None
568 |     if FLAGS.do_train:
569 |         train_examples = processor.get_train_examples(FLAGS.data_dir)
570 |         num_train_steps = int(
571 |             len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
572 |         num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
573 |     # num_labels=2 * len(tag_list) + 1 BI两种外加一个O
574 |     model_fn = model_fn_builder(
575 |         bert_config=bert_config,
576 |         num_labels=2*len(tag_list) + 1,
577 |         init_checkpoint=FLAGS.init_checkpoint,
578 |         learning_rate=FLAGS.learning_rate,
579 |         num_train_steps=num_train_steps,
580 |         num_warmup_steps=num_warmup_steps,
581 |         use_tpu=FLAGS.use_tpu,
582 |         use_one_hot_embeddings=FLAGS.use_tpu)
583 | 
584 |     estimator = tf.contrib.tpu.TPUEstimator(
585 |         use_tpu=FLAGS.use_tpu,
586 |         model_fn=model_fn,
587 |         config=run_config,
588 |         train_batch_size=FLAGS.train_batch_size,
589 |         eval_batch_size=FLAGS.eval_batch_size,
590 |         predict_batch_size=FLAGS.predict_batch_size)
591 | 
592 |     if FLAGS.do_train:
593 |         train_file = os.path.join(FLAGS.data_dir, "train.tf_record")
594 |         file_based_convert_examples_to_features(
595 |             train_examples, tag_list, FLAGS.max_seq_length, tokenizer, train_file)
596 |         tf.logging.info("***** Running training *****")
597 |         tf.logging.info("  Num examples = %d", len(train_examples))
598 |         tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
599 |         tf.logging.info("  Num steps = %d", num_train_steps)
600 |         train_input_fn = file_based_input_fn_builder(
601 |             input_file=train_file,
602 |             seq_length=FLAGS.max_seq_length,
603 |             is_training=True,
604 |             drop_remainder=True)
605 |         estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
606 | 
607 |     if FLAGS.do_eval:
608 |         eval_examples = processor.get_dev_examples(FLAGS.data_dir)
609 |         num_actual_eval_examples = len(eval_examples)
610 |         if FLAGS.use_tpu:
611 |             while len(eval_examples) % FLAGS.eval_batch_size != 0:
612 |                 eval_examples.append(PaddingInputExample())
613 | 
614 |         eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record")
615 |         file_based_convert_examples_to_features(
616 |             eval_examples, tag_list, FLAGS.max_seq_length, tokenizer, eval_file)
617 | 
618 |         tf.logging.info("***** Running evaluation *****")
619 |         tf.logging.info("  Num examples = %d (%d actual, %d padding)",
620 |                         len(eval_examples), num_actual_eval_examples,
621 |                         len(eval_examples) - num_actual_eval_examples)
622 |         tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)
623 | 
624 |         # This tells the estimator to run through the entire set.
625 |         eval_steps = None
626 |         if FLAGS.use_tpu:
627 |             assert len(eval_examples) % FLAGS.eval_batch_size == 0
628 |             eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size)
629 | 
630 |         eval_drop_remainder = True if FLAGS.use_tpu else False
631 |         eval_input_fn = file_based_input_fn_builder(
632 |             input_file=eval_file,
633 |             seq_length=FLAGS.max_seq_length,
634 |             is_training=False,
635 |             drop_remainder=eval_drop_remainder)
636 | 
637 |         result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
638 | 
639 |         output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
640 |         with tf.gfile.GFile(output_eval_file, "w") as writer:
641 |             tf.logging.info("***** Eval results *****")
642 |             for key in sorted(result.keys()):
643 |                 tf.logging.info("  %s = %s", key, str(result[key]))
644 |                 writer.write("%s = %s\n" % (key, str(result[key])))
645 | 
646 |     if FLAGS.do_predict:
647 |         # label dict的设置
648 |         tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book',
649 |                    5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game',
650 |                    9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie',
651 |                    13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization',
652 |                    17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'}
653 | 
654 |         predict_examples = processor.get_test_examples(FLAGS.data_dir)
655 |         num_actual_predict_examples = len(predict_examples)
656 |         test_file = os.path.join(FLAGS.data_dir, "test.tf_record")
657 |         file_based_convert_examples_to_features(predict_examples, tag_list,
658 |                                                 FLAGS.max_seq_length, tokenizer,
659 |                                                 test_file)
660 | 
661 |         tf.logging.info("***** Running prediction*****")
662 |         tf.logging.info("  Num examples = %d (%d actual, %d padding)",
663 |                         len(predict_examples), num_actual_predict_examples,
664 |                         len(predict_examples) - num_actual_predict_examples)
665 |         tf.logging.info("  Batch size = %d", FLAGS.predict_batch_size)
666 | 
667 |         predict_drop_remainder = True if FLAGS.use_tpu else False
668 |         predict_input_fn = file_based_input_fn_builder(
669 |             input_file=test_file,
670 |             seq_length=FLAGS.max_seq_length,
671 |             is_training=False,
672 |             drop_remainder=predict_drop_remainder)
673 | 
674 |         results = estimator.predict(input_fn=predict_input_fn)
675 | 
676 |         output_file = os.path.join(FLAGS.data_dir, 'clue_predict.json')
677 |         with open(output_file, 'w', encoding='utf-8') as fr:
678 |             for example, result in zip(predict_examples, results):
679 |                 pre_id = result['predictions']
680 |                 # print(f'text is {example.text_a}')
681 |                 # print(f'preid is {pre_id}')
682 |                 text = example.text_a
683 |                 # 只获取text中的长度的tag输出
684 |                 tags = [tag_ids[tag] for tag in pre_id][1:len(text) + 1]
685 |                 res_words, res_pos = get_result(text, tags)
686 |                 rs = {}
687 |                 for w, t in zip(res_words, res_pos):
688 |                     rs[t] = rs.get(t, []) + [w]
689 |                 pres = {}
690 |                 for t, ws in rs.items():
691 |                     temp = {}
692 |                     for w in ws:
693 |                         word = text[w[0]: w[1] + 1]
694 |                         temp[word] = temp.get(word, []) + [w]
695 |                     pres[t] = temp
696 |                 output_line = json.dumps({'id': example.guid, 'label': pres}, ensure_ascii=False) + '\n'
697 |                 fr.write(output_line)
698 | 
699 | 
700 | def get_result(text, tags):
701 |     """ 改写成clue要提交的格式 """
702 |     result_words = []
703 |     result_pos = []
704 |     temp_word = []
705 |     temp_pos = ''
706 |     for i in range(min(len(text), len(tags))):
707 |         if tags[i].startswith('O'):
708 |             if len(temp_word) > 0:
709 |                 result_words.append([min(temp_word), max(temp_word)])
710 |                 result_pos.append(temp_pos)
711 |             temp_word = []
712 |             temp_pos = ''
713 |         elif tags[i].startswith('B-'):
714 |             if len(temp_word) > 0:
715 |                 result_words.append([min(temp_word), max(temp_word)])
716 |                 result_pos.append(temp_pos)
717 |             temp_word = [i]
718 |             temp_pos = tags[i].split('-')[1]
719 |         elif tags[i].startswith('I-'):
720 |             if len(temp_word) > 0:
721 |                 temp_word.append(i)
722 |                 if temp_pos == '':
723 |                     temp_pos = tags[i].split('-')[1]
724 |         else:
725 |             if len(temp_word) > 0:
726 |                 temp_word.append(i)
727 |                 if temp_pos == '':
728 |                     temp_pos = tags[i].split('-')[1]
729 |                 result_words.append([min(temp_word), max(temp_word)])
730 |                 result_pos.append(temp_pos)
731 |             temp_word = []
732 |             temp_pos = ''
733 |     return result_words, result_pos
734 | 
735 | 
736 | if __name__ == "__main__":
737 |     main()
738 | 


--------------------------------------------------------------------------------
/ckbqa/DUTIR中文开放域知识问答评测报告.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/DUTIR中文开放域知识问答评测报告.pdf


--------------------------------------------------------------------------------
/ckbqa/README.md:
--------------------------------------------------------------------------------
 1 | ## KBQA简介
 2 | 
 3 | 基于知识库的问答（Knowledge Based Question Answering，KBQA）是自然语言处理（NLP）领域的热门研究方向。知识库（知识图谱， Knowledge Based/Knowledge Graph）是知识的结构化表示，一般是由一组SPO三元组（主语Subject，谓语Predicate，宾语Object）形式构成（也称实体，关系，属性三元组），表示实体和实体间存在的语义关系。例如，中国的首都是北京，可以表示为：[中国，首都，北京]。
 4 | 
 5 | 基于知识库的问答主要步骤是接收一个自然语言问句，识别出句子中的实体，理解问句的语义关系，构建有关实体和关系的查询语句，进而从知识库中检索出答案。
 6 | 
 7 | 目前基于知识库的问答主要方法有：
 8 | 
 9 | - 基于语义解析/规则的方法
10 | - 基于信息检索/信息抽取的方法
11 | 
12 | 这里有一篇2019年KGQA的综述：Introduction to Neural Network Based Approaches for Question Answering over Knowledge Graphs。这篇文章将KGQA/KBQA当作语义解析的任务来对待，然后介绍了几种语义解析方法，如Classification、Ranking、Translation等。这里不做介绍，感兴趣的可以去翻原文。
13 | 
14 | 基于中文知识库问答（**Chinese Knowledge Based Question Answering，CKBQA**）相比英文的KBQA，中文知识库包含关系多，数据集难以覆盖所有关系，另外中文语言的特点，有居多的挑战。
15 | 
16 | **基于语义解析/规则的方法：**
17 | 
18 | 该类方法使用字典、规则和机器学习，直接从问题中解析出实体、关系和逻辑组合。这里介绍两篇论文，一篇是 The APVA-TURBO Approach to Question Answering in Knowledge Base，文章使用序列标注模型解析问题中的实体，利用端到端模型解析问题中的关系序列。另一篇 A State-transition Framework to Answer Complex
19 | Questions over Knowledge Base，文章中提出了一种状态转移框架并结合卷积神经网络等方法。（上述方法均基于英文数据集）
20 | 
21 | 基于语义解析/规则的方法一般步骤：
22 | 
23 | - 实体识别：使用领域词表，相似度等（也可以使用深度学习模型，如BiLstm+CRF，BERT等）
24 | - 属性关系识别：词表规则，或使用分类模型
25 | - 答案查询：基于前两个步骤，更加规则模板转换SPARQL等查询语言进行查询
26 | 
27 | 基于语义解析/规则的方法比较简单，当前Github上很多KBQA的项目都是基于这种模式。
28 | 
29 | 这里推荐几个基于语义解析/规则的 KBQA项目：
30 | 
31 | - 豆瓣的电影知识图谱问答：https://github.com/weizhixiaoyi/DouBan-KGQA
32 | - 基于NLPCC数据的KBQA：https://zhuanlan.zhihu.com/p/62946533
33 | 
34 | **基于信息检索/信息抽取的方法：**
35 | 
36 | 该类方法首先根据问题得到若干个候选实体，根据预定义的逻辑形式，从知识库中抽取与候选实体相连的关系作为候选查询路径，再使用文本匹配模型，选择出与问题相似度最高的候选查询路径，到知识库中检索答案。这里介绍一种增强路径匹配的方法： Improved neural relation detection for knowledge base question answering。
37 | 
38 | 当前CKBQA任务上，大多采用的是基于信息检索/信息抽取的方法，一般的步骤：
39 | 
40 | - 实体与关系识别
41 | - 路径匹配
42 | - 答案检索
43 | 
44 | 在CCKS的KBQA比赛中这种方法非常常见，CCKS官网网站上有每一年的评测论文，下面推荐几个最新的：
45 | 
46 | - 2019年CCKS的KBQA任务第四名方案：DUTIR中文开放域知识问答评测报告
47 | - 2020年CCKS的KBQA任务第一名方案：基于特征融合的中文知识库问答方法
48 | 
49 | 具体内容可见官网的评测论文，这里附件上传，见ckbqa目录下两个pdf文件。
50 | 
51 | ## 中英文数据集
52 | 
53 | 英文数据集：
54 | 
55 | - FREE917:第一个大规模的KBQA数据集，于2013年提出，包含917 个问题，同时提供相应逻辑查询，覆盖600多种freebase上的关系。
56 | - Webquestions：数据集中有6642个问题答案对，数据集规模虽然较FREE917提高了不少，但有两个突出的缺陷：没有提供对应的查询，不利于基于逻辑表达式模型的训练；另外webquestions中简单问句多而复杂问句少。
57 | - WebQSP：是WEBQUESTIONS的子集，问题都是需要多跳才能回答，属于multi-relation KBQA dataset，另外补全了对应的查询句。
58 | - Complexquestion、GRAPHQUESTIONS：在问句的结构和表达多样性等方面进一步增强了WEBQUESTIONSP，，包括类型约束，显\隐式的时间约束，聚合操作。
59 | - SimpleQuestions：数据规模较大，共100K，数据形式为(quesition，knowledge base fact)，均为简单问题，只需KB中的一个三元组即可回答,即single-relation dataset。
60 | 
61 | 英文数据集较多，这里只列举几个常见的。详细的数据集可见北航的[KBQA调研](https://github.com/BDBC-KG-NLP/QA-Survey/blob/master/KBQA%E8%B0%83%E7%A0%94-%E5%AD%A6%E6%9C%AF%E7%95%8C.md#13-%E6%95%B0%E6%8D%AE%E9%9B%86)
62 | 
63 | 中文数据集：
64 | 
65 | - NLPCC开放领域知识图谱问答的数据集：简单问题（单跳问题），14609条训练数据，9870条验证和测试数据，数据集下载。
66 | - CCKS开放领域知识图谱问答的数据集：包含简单问题和复杂问答，2298条训练数据，766的验证和测试数据，数据集下载。
67 | 
68 | 除了上述两个中文数据集（提取码均是），CLUE上还提供了一些问答的数据集，可以见[CLUE的数据集搜索](https://www.cluebenchmarks.com/dataSet_search_modify.html?keywords=QA)。
69 | 
70 | ## KBQA的实现
71 | 
72 | 下面基于CCKS的数据集来实现2019年第四名方案和2020年第一名方案。
73 | 
74 | CCKS的数据集，百度网盘下载地址：链接：https://pan.baidu.com/s/1NI9VrhuvOgyTFk1tGjlZIw   提取码：l7pm 
75 | 
76 | todo list（等有空实现了就补上）：
77 | 
78 | - 使用tensorflow实现2019年第四名方案
79 | - 使用tensorflow实现2020年第一名方案
80 | 
81 | 附上2019年第四名方案的开源地址 https://github.com/atom32/ccks2019-ckbqa-4th-codes
82 | 流程还算完整，但想端到端完整运行有点困难，而且很多数据的处理过程都耦合在模型中。需要花一定的时间去整理。
83 | 
84 | 2020年第一名方案代码暂未开源。
85 | 
86 | 
87 | 
88 | ## 扩展
89 | 
90 | - 美团大脑：知识图谱的建模方法及其应用：https://tech.meituan.com/2018/11/01/meituan-ai-nlp.html
91 | - 百度大脑UNIT3.0详解之知识图谱与对话：https://baijiahao.baidu.com/s?id=1643915882369765998&wfr=spider&for=pc
92 | - 更新ing
93 | 


--------------------------------------------------------------------------------
/ckbqa/基于特征融合的中文知识库问答方法.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/基于特征融合的中文知识库问答方法.pdf


--------------------------------------------------------------------------------
/named_entity_recognition/README.md:
--------------------------------------------------------------------------------
  1 | ## 命名实体识别（Named Entity Recognition）
  2 | 
  3 | 这里首先介绍一篇基于深度学习的命名实体识别综述，《A Survey on Deep Learning for Named Entity Recognition》，论文来源：https://arxiv.org/abs/1812.09449（2020年3月份发表在TKDE）
  4 | 
  5 | **1.命名实体识别简介**
  6 | 
  7 | 命名实体识别（Named Entity Recognition，NER）旨在给定的文本中识别出属于预定义的类别片段（如人名、位置、组织等）。NER一直是很多自然语言应用的基础，如机器问答、文本摘要和机器翻译。
  8 | 
  9 | NER任务最早是由第六届语义理解会议（Sixth Message Understanding Conference）提出，但当时仅定义一些通用的实体类型，如组织、人名和地点。
 10 | 
 11 | **2.命名实体识别常用方法**
 12 | 
 13 | - 基于规则的方法（Rule-based Approaches）：不需要标注数据，依赖人工规则，特定领域需要专家知识
 14 | - 无监督学习方法（Unsupervised Learning Approaches）：不需要标注数据，依赖于无监督学习方法，如聚类算法
 15 | - 基于特征的有监督学习方法（Feature-based Supervised Learning Approaches）：将NER当作一个多分类问题或序列标签分类任务，依赖于特征工程
 16 | - 基于深度学习的方法（DL-based Approaches）：后面详细介绍
 17 | 
 18 | 论文简单介绍了前三种方法，这里也不在赘述，感兴趣的可以看论文。
 19 | 
 20 | **3.基于深度学习的方法**
 21 | 
 22 | 文章中将NER任务拆解成三个结构：
 23 | 
 24 | - 输入的分布式表示（Distributed Representations for Input）
 25 | - 上下文编码（Context Encoder Architectures）
 26 | - 标签解码（Tag Decoder Architectures）
 27 | 
 28 | 这里不在展开描述具体的内容（有兴趣的可以去翻论文），下表总结了基于神经网络的NER模型的工作，并展示了每个NER模型在各类数据集上的表现。
 29 | 
 30 | ![image](https://github.com/xudongMk/AwesomeNLPBaseline/blob/main/named_entity_recognition/pics/%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E7%9A%84%E6%A8%A1%E5%9E%8B%E6%80%BB%E7%BB%93%E5%9B%BE.png)
 31 | 
 32 | 总结：BiLstm+CRF是使用深度学习的NER最常见的体系结构，以Cloze风格使用预训练双向Transformer在CoNLL03数据集上达到了SOTA效果（93.5%），另外Bert+Dice Loss在OntoNotes5.0数据集上达到了SOTA效果（92.07%）。
 33 | 
 34 | **4.评测指标**
 35 | 
 36 | 文中将NER的评测指标Precision、Recall和F1分成了两类。
 37 | 
 38 | - Exact match：严格匹配方法，需要识别的边界和类别都正确
 39 | - Relaxed match：宽松匹配方法，实体位置区间重叠、位置正确类别错误等都视为正确
 40 | 
 41 | 
 42 | 
 43 | ## 命名实体识别数据集
 44 | 
 45 | 命名实体识别数据集一般是BIO或者BIOES模式标注。
 46 | 
 47 | - BIO模式：具体指B-begin、I-inside、O-outside
 48 | - BIOES模式：具体指B-begin、I-inside、O-outside、E-end、S-single
 49 | 
 50 | 首先是综述中提到的几个数据集，见下表，具体的就不介绍了。
 51 | 
 52 | ![image](https://github.com/xudongMk/AwesomeNLPBaseline/blob/main/named_entity_recognition/pics/%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E6%95%B0%E6%8D%AE%E5%9B%BE.png)
 53 | 
 54 | 
 55 | 
 56 | 下面介绍一个中文的命名实体识别数据集，**CLUENER 细粒度命名实体识别**，地址：https://github.com/CLUEbenchmark/CLUENER2020
 57 | 
 58 | - 数据类别：10个，地址、书名、公司、游戏、政府、电影、姓名、组织、职位和景点
 59 | - 数据分布：训练集10748，测试集1343，具体类别分布见原文
 60 | - 数据来源：在THUCTC文本分类数据集基础上，选出部分数据进行细粒度实体标注
 61 | 
 62 | 
 63 | 
 64 | ## 命名实体识别Baseline算法实现
 65 | 
 66 | 使用Tensorflow1.x版本Estimator高阶api实现常见的命名实体识别算法，主要包括BiLstm+CRF、Bert、Bert+CRF。
 67 | 
 68 | （当前只在本目录下实现了BiLstm+CRF，至于BERT的在bert_downstream目录下暂未实现）
 69 | 
 70 | 环境信息：
 71 | 
 72 | tensorflow==1.13.1
 73 | 
 74 | python==3.7
 75 | 
 76 | **数据预处理**
 77 | 
 78 | 要求训练集和测试集分开存储，要求数据集格式为BIO形式。
 79 | 
 80 | 在训练模型前，需要先运行preprocess.py文件进行数据预处理，将数据处理成id形式并保存为pkl形式，另外中间过程产生的词表也会保存为vocab.txt文件。
 81 | 
 82 | **文件结构**
 83 | 
 84 | - data_path：数据集存放的位置
 85 | - data_utils：数据处理相关的工具类存放位置
 86 | - model_ckpt：chekpoint模型保存的位置
 87 | - model_pb：pb形式的模型保存为位置
 88 | - models：ner基本的算法存放位置，如BiLstm等
 89 | - preprocess.py：数据预处理代码
 90 | - ner_main.py：训练主入口
 91 | 
 92 | **模型训练**
 93 | 
 94 | - 首先准备好数据集，放在data_path下，然后运行preprocess.py文件
 95 | - 运行ner_main.py，具体的模型参数可以在ARGS里面设置，也可以使用python ner_main.py --train_path='./data_path/clue_data.pkl'的形式
 96 | 
 97 | **模型推理**
 98 | 
 99 | - 推理代码在inference.py中
100 | 
101 | 
102 | 
103 | ## 示例
104 | 
105 | 下面使用中文任务测评基准(CLUE benchmark)的CLUENER数据进行demo示例演示：
106 | 
107 | 数据集下载地址[[CLUENER细粒度命名实体识别](https://github.com/CLUEbenchmark/CLUENER2020)]，该数据由CLUEBenchMark整理，数据分为10个标签类别分别为: 地址（address），书名（book），公司（company），游戏（game），政府（government），电影（movie），姓名（name），组织机构（organization），职位（position），景点（scene）
108 | 
109 | 数据集分布：
110 | 
111 | ```
112 | 训练集：10748
113 | 验证集集：1343
114 | 
115 | 按照不同标签类别统计，训练集数据分布如下（注：一条数据中出现的所有实体都进行标注，如果一条数据出现两个地址（address）实体，那么统计地址（address）类别数据的时候，算两条数据）：
116 | 【训练集】标签数据分布如下：
117 | 地址（address）:2829
118 | 书名（book）:1131
119 | 公司（company）:2897
120 | 游戏（game）:2325
121 | 政府（government）:1797
122 | 电影（movie）:1109
123 | 姓名（name）:3661
124 | 组织机构（organization）:3075
125 | 职位（position）:3052
126 | 景点（scene）:1462
127 | 
128 | 【验证集】标签数据分布如下：
129 | 地址（address）:364
130 | 书名（book）:152
131 | 公司（company）:366
132 | 游戏（game）:287
133 | 政府（government）:244
134 | 电影（movie）:150
135 | 姓名（name）:451
136 | 组织机构（organization）:344
137 | 职位（position）:425
138 | 景点（scene）:199
139 | ```
140 | 
141 | **1.数据EDA：**
142 | 
143 | 省略，需要的可以自己分析一下数据集的分布情况
144 | 
145 | **2.数据预处理：**
146 | 
147 | 转换BIO形式，具体conver_bio.py，将CLUE提供的数据集转换为BIO标注形式；运行preprocess.py将数据集转换为id形式并保存为pkl形式。
148 | 
149 | **3.模型训练：**
150 | 
151 | 代码见ner_main.py，参数设置的时候有几个参数需要根据自己的数据分布来设置：
152 | 
153 | - vocab_size：这里的大小，一般需要根据自己生成的vocab.txt中词表的大小来设置
154 | - num_tags：类别标签的数量，算上O，这里是21类
155 | - train_path/eval_path：数据集的路径
156 | 
157 | 其他的参数视个人情况而定
158 | 
159 | **4.开始预测并提交结果**
160 | 
161 | 预测代码见inference.py
162 | 
163 | #todo next 只完成了一部分，写入文件的部分暂时未完成。因为其提交的文件格式有点难受....太细化了...
164 | 
165 | 
166 | 
167 | ## NER的比赛
168 | 
169 | 1.天池的比赛 https://tianchi.aliyun.com/competition/entrance/531824/introduction
170 | 
171 | 2.CLUE的评测 https://www.cluebenchmarks.com/introduce.html
172 | 
173 | 
174 | 
175 | ## 扩展
176 | 
177 | - 美团搜索中NER技术的探索和实践：https://tech.meituan.com/2020/07/23/ner-in-meituan-nlp.html
178 | 
179 | 
180 | 
181 | 
182 | 


--------------------------------------------------------------------------------
/named_entity_recognition/convert_bio.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020/12/15 21:46
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : convert_bio.py
 6 | # @Software: PyCharm
 7 | import json
 8 | 
 9 | """
10 | 将数据转换成bio形式
11 | """
12 | 
13 | 
14 | def read_data(file_path):
15 |     """
16 |     读取数据集
17 |     :param file_path:
18 |     :return:
19 |     """
20 |     with open(file_path, encoding='utf-8') as fr:
21 |         lines = fr.readlines()
22 |     print(f'the data size is {len(lines)}')
23 |     return lines
24 | 
25 | 
26 | def convert_bio_data(file_path, out):
27 |     """
28 |     转换成bio形式
29 |     example：
30 |     我要去故宫   O O O B-location I-location
31 |     :param file_path:
32 |     :return:
33 |     """
34 |     lines = read_data(file_path)
35 |     bio_data = []
36 |     for line in lines:
37 |         data = json.loads(line)
38 |         text = data['text']
39 |         labels = data['label']
40 |         # 遍历处理label
41 |         bios = ['O'] * len(text)
42 |         for label in labels:
43 |             entitys = labels[label]
44 |             for entity in entitys:
45 |                 indexs = entitys[entity]
46 |                 for index in indexs:
47 |                     start = index[0]
48 |                     end = index[1]
49 |                     for i in range(start, end + 1):
50 |                         if i == start:
51 |                             bios[i] = f'B-{label}'
52 |                         else:
53 |                             bios[i] = f'I-{label}'
54 |         bio_data.append(text + '\t' + ' '.join(bios))
55 |     # write to file
56 |     with open(out, 'w', encoding='utf-8') as fr:
57 |         for data in bio_data:
58 |             fr.write(data + '\n')
59 |     print(f'convert bio data over!')
60 | 
61 | 
62 | if __name__ == '__main__':
63 |     convert_bio_data('./data_path/train.json', './data_path/train.txt')
64 |     convert_bio_data('./data_path/dev.json', './data_path/dev.txt')
65 | 


--------------------------------------------------------------------------------
/named_entity_recognition/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里存放的是数据集


--------------------------------------------------------------------------------
/named_entity_recognition/data_utils/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time    : 2020/12/9 20:34
3 | # @Author  : xudong
4 | # @email   : dongxu222mk@163.com
5 | # @File    : __init__.py.py
6 | # @Software: PyCharm
7 | 


--------------------------------------------------------------------------------
/named_entity_recognition/data_utils/datasets.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020/12/9 20:34
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : datasets.py
 6 | # @Software: PyCharm
 7 | 
 8 | import numpy as np
 9 | import tensorflow as tf
10 | 
11 | """
12 | 数据集构建类
13 | 将数据转换成模型所需要的dataset输入
14 | """
15 | 
16 | 
17 | class DataBuilder:
18 |     def __init__(self, data):
19 |         self.words = np.asarray(data['words'])
20 |         self.tags = np.asarray(data['tags'])
21 | 
22 |     @property
23 |     def size(self):
24 |         return len(self.words)
25 | 
26 |     def build_generator(self):
27 |         """
28 |         build data generator for model
29 |         :return:
30 |         """
31 |         for word, tag in zip(self.words, self.tags):
32 |             yield (word, len(word)), tag
33 | 
34 |     def build_dataset(self):
35 |         """
36 |         build dataset from generator
37 |         :return:
38 |         """
39 |         dataset = tf.data.Dataset.from_generator(
40 |             self.build_generator,
41 |             ((tf.int64, tf.int64), tf.int64),
42 |             ((tf.TensorShape([None]), tf.TensorShape([])), tf.TensorShape([None]))
43 |         )
44 |         return dataset
45 | 
46 |     def get_train_batch(self, dataset, batch_size, epoch):
47 |         """
48 |         get one batch train data
49 |         :param dataset:
50 |         :param batch_size:
51 |         :param epoch:
52 |         :return:
53 |         """
54 |         dataset = dataset.cache()\
55 |             .shuffle(buffer_size=10000)\
56 |             .padded_batch(batch_size, padded_shapes=(([None], []), [None]))\
57 |             .repeat(epoch)
58 |         return dataset.make_one_shot_iterator().get_next()
59 | 
60 |     def get_test_batch(self, dataset, batch_size):
61 |         """
62 |         get one batch test data
63 |         :param dataset:
64 |         :param batch_size:
65 |         :return:
66 |         """
67 |         dataset = dataset.padded_batch(batch_size,
68 |                                        padded_shapes=(([None], []), [None]))
69 |         return dataset.make_one_shot_iterator().get_next()
70 | 


--------------------------------------------------------------------------------
/named_entity_recognition/inference.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2021/1/6 22:59
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : inference.py
  6 | # @Software: PyCharm
  7 | 
  8 | import tensorflow as tf
  9 | import tqdm
 10 | import json
 11 | import _pickle as cPickle
 12 | 
 13 | """
 14 | 命名实体识别推理代码
 15 | """
 16 | 
 17 | # 加载词典
 18 | word_dict = {}
 19 | with open('./data_path/clue_vocab.txt', encoding='utf-8') as fr:
 20 |     lines = fr.readlines()
 21 | for line in lines:
 22 |     word = line.split('\t')[0]
 23 |     id = line.split('\t')[1]
 24 |     word_dict[word] = id
 25 | print(word_dict)
 26 | 
 27 | # label dict的设置 这个和preprocess中的tag_dict对应
 28 | tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book',
 29 |            5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game',
 30 |            9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie',
 31 |            13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization',
 32 |            17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'}
 33 | 
 34 | 
 35 | def words_to_ids(words, word_dict):
 36 |     """ 将words 转换成ids形式 """
 37 |     ids = [word_dict.get(word, 1) for word in words]
 38 |     return ids
 39 | 
 40 | 
 41 | def predict_main(test_file, out_path):
 42 |     """ 预测主入口 """
 43 |     model_path = './model_pb/1609946529'
 44 |     with tf.Session(graph=tf.Graph()) as sess:
 45 |         model = tf.saved_model.loader.load(sess, ['serve'], model_path)
 46 |         # print(model)
 47 |         out = sess.graph.get_tensor_by_name('tag_ids:0')
 48 |         input_id = sess.graph.get_tensor_by_name('input_words:0')
 49 |         input_len = sess.graph.get_tensor_by_name('input_len:0')
 50 | 
 51 |         with open(test_file, encoding='utf-8') as fr:
 52 |             lines = fr.readlines()
 53 |         res_list = []
 54 | 
 55 |         cnt = 0
 56 |         for line in tqdm.tqdm(lines):
 57 |             json_str = json.loads(line)
 58 |             id = json_str['id']
 59 |             text = json_str['text']
 60 |             if len(text) < 1:
 61 |                 print('there are some sample error!')
 62 |             text_features = words_to_ids(text, word_dict)
 63 |             text_label = len(text)
 64 |             feed = {input_id: [text_features], input_len: [text_label]}
 65 |             score = sess.run(out, feed_dict=feed)
 66 | 
 67 |             cnt += 1
 68 |             tags = [tag_ids[tag] for tag in score[0]]
 69 |             # print(tags)
 70 |             res_words, res_pos = get_result(text, tags)
 71 |             rs = {}
 72 |             for w, t in zip(res_words, res_pos):
 73 |                 rs[t] = rs.get(t, []) + [w]
 74 |             pres = {}
 75 |             for t, ws in rs.items():
 76 |                 temp = {}
 77 |                 for w in ws:
 78 |                     word = text[w[0]: w[1] + 1]
 79 |                     temp[word] = temp.get(word, []) + [w]
 80 |                 pres[t] = temp
 81 |             output_line = json.dumps({'id': id, 'label': pres}, ensure_ascii=False)
 82 |             res_list.append(output_line)
 83 |             # print(output_line)
 84 |         # write to file
 85 |         with open(out_path, 'w', encoding='utf-8') as fr:
 86 |             for res in res_list:
 87 |                 fr.write(res)
 88 |                 fr.write('\n')
 89 | 
 90 | 
 91 | def get_result(text, tags):
 92 |     """ 改写成clue要提交的格式 """
 93 |     result_words = []
 94 |     result_pos = []
 95 |     temp_word = []
 96 |     temp_pos = ''
 97 |     for i in range(min(len(text), len(tags))):
 98 |         if tags[i].startswith('O'):
 99 |             if len(temp_word) > 0:
100 |                 result_words.append([min(temp_word), max(temp_word)])
101 |                 result_pos.append(temp_pos)
102 |             temp_word = []
103 |             temp_pos = ''
104 |         elif tags[i].startswith('B-'):
105 |             if len(temp_word) > 0:
106 |                 result_words.append([min(temp_word), max(temp_word)])
107 |                 result_pos.append(temp_pos)
108 |             temp_word = [i]
109 |             temp_pos = tags[i].split('-')[1]
110 |         elif tags[i].startswith('I-'):
111 |             if len(temp_word) > 0:
112 |                 temp_word.append(i)
113 |                 if temp_pos == '':
114 |                     temp_pos = tags[i].split('-')[1]
115 |         else:
116 |             if len(temp_word) > 0:
117 |                 temp_word.append(i)
118 |                 if temp_pos == '':
119 |                     temp_pos = tags[i].split('-')[1]
120 |                 result_words.append([min(temp_word), max(temp_word)])
121 |                 result_pos.append(temp_pos)
122 |             temp_word = []
123 |             temp_pos = ''
124 |     return result_words, result_pos
125 | 
126 | 
127 | if __name__ == '__main__':
128 |     test_file = './data_path/test.json'
129 |     out_path = './data_path/clue_predict.json'
130 |     predict_main(test_file, out_path)
131 | 


--------------------------------------------------------------------------------
/named_entity_recognition/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里保存训练后的checkpoint文件


--------------------------------------------------------------------------------
/named_entity_recognition/model_pb/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里保存训练后的pb模型文件


--------------------------------------------------------------------------------
/named_entity_recognition/models/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time    : 2020/12/9 20:34
3 | # @Author  : xudong
4 | # @email   : dongxu222mk@163.com
5 | # @File    : __init__.py.py
6 | # @Software: PyCharm
7 | 


--------------------------------------------------------------------------------
/named_entity_recognition/models/bilstm_crf.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020/12/9 20:34
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : bilstm_crf.py
  6 | # @Software: PyCharm
  7 | 
  8 | import tensorflow as tf
  9 | from tensorflow.contrib.rnn import LSTMCell
 10 | from tensorflow.contrib.rnn import MultiRNNCell
 11 | 
 12 | 
 13 | class Linear:
 14 |     """
 15 |     全链接层
 16 |     """
 17 |     def __init__(self, scope_name, input_size, output_size,
 18 |                  drop_out=0., trainable=True):
 19 |         with tf.variable_scope(scope_name):
 20 |             w_init = tf.random_uniform_initializer(-0.1, 0.1)
 21 |             self.W = tf.get_variable('W', [input_size, output_size],
 22 |                                      initializer=w_init,
 23 |                                      trainable=trainable)
 24 | 
 25 |             self.b = tf.get_variable('b', [output_size],
 26 |                                      initializer=tf.zeros_initializer(),
 27 |                                      trainable=trainable)
 28 | 
 29 |         self.drop_out = tf.layers.Dropout(drop_out)
 30 | 
 31 |         self.output_size = output_size
 32 | 
 33 |     def __call__(self, inputs, training):
 34 |         size = tf.shape(inputs)
 35 |         input_trans = tf.reshape(inputs, [-1, size[-1]])
 36 |         input_trans = tf.nn.xw_plus_b(input_trans, self.W, self.b)
 37 |         input_trans = self.drop_out(input_trans, training=training)
 38 | 
 39 |         input_trans = tf.reshape(input_trans, [-1, size[1], self.output_size])
 40 | 
 41 |         return input_trans
 42 | 
 43 | 
 44 | class LookupTable:
 45 |     """
 46 |     embedding layer
 47 |     """
 48 |     def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True):
 49 |         self.vocab_size = vocab_size
 50 |         self.embed_size = embed_size
 51 | 
 52 |         with tf.variable_scope(scope_name, reuse=bool(reuse)):
 53 |             self.embedding = tf.get_variable('embedding', [vocab_size, embed_size],
 54 |                                              initializer=tf.random_uniform_initializer(-0.25, 0.25),
 55 |                                              trainable=trainable)
 56 | 
 57 |     def __call__(self, input):
 58 |         input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input))
 59 |         return tf.nn.embedding_lookup(self.embedding, input)
 60 | 
 61 | 
 62 | class LstmBase:
 63 |     """
 64 |     build rnn cell
 65 |     """
 66 |     def build_rnn(self, hidden_size, num_layes):
 67 |         cells = []
 68 |         for i in range(num_layes):
 69 |             cell = LSTMCell(num_units=hidden_size,
 70 |                             state_is_tuple=True,
 71 |                             initializer=tf.random_uniform_initializer(-0.25, 0.25))
 72 |             cells.append(cell)
 73 |         cells = MultiRNNCell(cells, state_is_tuple=True)
 74 | 
 75 |         return cells
 76 | 
 77 | 
 78 | class BiLstm(LstmBase):
 79 |     """
 80 |     define the lstm
 81 |     """
 82 |     def __init__(self, scope_name, hidden_size, num_layers):
 83 |         super(BiLstm, self).__init__()
 84 |         assert hidden_size % 2 == 0
 85 |         hidden_size /= 2
 86 | 
 87 |         self.fw_rnns = []
 88 |         self.bw_rnns = []
 89 |         for i in range(num_layers):
 90 |             self.fw_rnns.append(self.build_rnn(hidden_size, 1))
 91 |             self.bw_rnns.append(self.build_rnn(hidden_size, 1))
 92 | 
 93 |         self.scope_name = scope_name
 94 | 
 95 |     def __call__(self, input, input_len):
 96 |         for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)):
 97 |             scope_name = '{}_{}'.format(self.scope_name, idx)
 98 |             ctx, _ = tf.nn.bidirectional_dynamic_rnn(
 99 |                 fw_rnn, bw_rnn, input, sequence_length=input_len,
100 |                 dtype=tf.float32, time_major=False,
101 |                 scope=scope_name
102 |             )
103 |             input = tf.concat(ctx, -1)
104 |         ctx = input
105 |         return ctx
106 | 
107 | 
108 | class BiLstm_Crf:
109 |     def __init__(self, args, vocab_size, emb_size):
110 |         # embedding
111 |         scope_name = 'look_up'
112 |         self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
113 | 
114 |         # rnn
115 |         scope_name = 'bi_lstm'
116 |         self.rnn = BiLstm(scope_name, args.hidden_dim, 1)
117 | 
118 |         # linear
119 |         scope_name = 'linear'
120 |         self.linear = Linear(scope_name, args.hidden_dim, args.num_tags,
121 |                              drop_out=args.drop_out)
122 | 
123 |         # crf
124 |         scope_name = 'crf_param'
125 |         self.crf_param = tf.get_variable(scope_name, [args.num_tags, args.num_tags],
126 |                                          dtype=tf.float32)
127 | 
128 |     def __call__(self, inputs, training):
129 |         masks = tf.sign(inputs)
130 |         sent_len = tf.reduce_sum(masks, axis=1)
131 | 
132 |         embedding = self.lookuptables(inputs)
133 | 
134 |         rnn_out = self.rnn(embedding, sent_len)
135 | 
136 |         logits = self.linear(rnn_out, training)
137 | 
138 |         pred_ids, _ = tf.contrib.crf.crf_decode(logits, self.crf_param, sent_len)
139 | 
140 |         return logits, pred_ids, self.crf_param
141 | 
142 | 
143 | 
144 | 
145 | 


--------------------------------------------------------------------------------
/named_entity_recognition/ner_main.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020-10-09 23:07
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : ner_main.py
  6 | # @Software: PyCharm
  7 | 
  8 | import sys
  9 | import time
 10 | import tensorflow as tf
 11 | from data_utils import datasets
 12 | 
 13 | import _pickle as cPickle
 14 | 
 15 | from argparse import ArgumentParser
 16 | from models.bilstm_crf import BiLstm_Crf
 17 | 
 18 | parser = ArgumentParser()
 19 | 
 20 | parser.add_argument("--vocab_size", type=int, default=4000, help='vocab size')
 21 | parser.add_argument("--emb_size", type=int, default=300, help='emb size')
 22 | parser.add_argument("--train_path", type=str, default='./data_path/clue_data.pkl')
 23 | parser.add_argument("--test_path", type=str, default='./data_path/clue_data.pkl')
 24 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/')
 25 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb')
 26 | parser.add_argument("--hidden_dim", type=int, default=300)
 27 | parser.add_argument("--num_tags", type=int, default=21)
 28 | parser.add_argument("--drop_out", type=float, default=0.1)
 29 | parser.add_argument("--batch_size", type=int, default=16)
 30 | parser.add_argument("--epoch", type=int, default=50)
 31 | parser.add_argument("--lr", type=float, default=1e-4,
 32 |                     help='the learning rate for optimizer')
 33 | 
 34 | 
 35 | tf.logging.set_verbosity(tf.logging.INFO)
 36 | ARGS, unparsed = parser.parse_known_args()
 37 | print(ARGS)
 38 | 
 39 | sys.stdout.flush()
 40 | 
 41 | 
 42 | def init_data(file_name, type=None):
 43 |     """
 44 |     init data
 45 |     :param file_name:
 46 |     :param type:
 47 |     :return:
 48 |     """
 49 |     data = cPickle.load(open(file_name, 'rb'))[type]
 50 | 
 51 |     data_builder = datasets.DataBuilder(data)
 52 |     dataset = data_builder.build_dataset()
 53 | 
 54 |     def train_input():
 55 |         return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch)
 56 | 
 57 |     def test_input():
 58 |         return data_builder.get_test_batch(dataset, ARGS.batch_size)
 59 | 
 60 |     return train_input if type == 'train' else test_input
 61 | 
 62 | 
 63 | def model_fn(features, labels, mode, params):
 64 |     """
 65 |     build model fn
 66 |     :return:
 67 |     """
 68 |     vocab_size = ARGS.vocab_size
 69 |     emb_size = ARGS.emb_size
 70 |     model = BiLstm_Crf(ARGS, vocab_size, emb_size)
 71 | 
 72 |     if isinstance(features, dict):
 73 |         features = features['words'], features['words_len']
 74 | 
 75 |     words, words_len = features
 76 | 
 77 |     if mode == tf.estimator.ModeKeys.PREDICT:
 78 |         _, pred_ids, _ = model(words, training=False)
 79 | 
 80 |         prediction = {'tag_ids': tf.identity(pred_ids, name='tag_ids')}
 81 | 
 82 |         return tf.estimator.EstimatorSpec(
 83 |             mode=mode,
 84 |             predictions=prediction,
 85 |             export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)}
 86 |         )
 87 |     else:
 88 |         tags = labels
 89 |         weights = tf.sequence_mask(words_len)
 90 |         if mode == tf.estimator.ModeKeys.TRAIN:
 91 |             logits, pred_ids, crf_params = model(words, training=True)
 92 | 
 93 |             log_like_lihood, _ = tf.contrib.crf.crf_log_likelihood(
 94 |                 logits, tags, words_len, crf_params
 95 |             )
 96 |             loss = -tf.reduce_mean(log_like_lihood)
 97 |             accuracy = tf.metrics.accuracy(tags, pred_ids, weights)
 98 | 
 99 |             tf.identity(accuracy[1], name='train_accuracy')
100 |             tf.summary.scalar('train_accuracy', accuracy[1])
101 |             optimizer = tf.train.AdamOptimizer(learning_rate=1e-4)
102 |             return tf.estimator.EstimatorSpec(
103 |                 mode=mode,
104 |                 loss=loss,
105 |                 train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step())
106 |             )
107 |         else:
108 |             _, pred_ids, _ = model(words, training=False)
109 |             accuracy = tf.metrics.accuracy(tags, pred_ids, weights)
110 |             metrics = {
111 |                 'accuracy': accuracy
112 |             }
113 |             return tf.estimator.EstimatorSpec(
114 |                 mode=mode,
115 |                 loss=tf.constant(0),
116 |                 eval_metric_ops=metrics
117 |             )
118 | 
119 | 
120 | def main_es(unparsed):
121 |     """
122 |     main method
123 |     :param unparsed:
124 |     :return:
125 |     """
126 |     cur_time = time.time()
127 |     model_dir = ARGS.model_ckpt_dir + str(int(cur_time))
128 | 
129 |     classifer = tf.estimator.Estimator(
130 |         model_fn=model_fn,
131 |         model_dir=model_dir,
132 |         params={}
133 |     )
134 | 
135 |     # train
136 |     train_input = init_data(ARGS.train_path, 'train')
137 |     tensors_to_log = {'train_accuracy': 'train_accuracy'}
138 |     logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100)
139 |     classifer.train(input_fn=train_input, hooks=[logging_hook])
140 | 
141 |     # eval
142 |     test_input = init_data(ARGS.test_path, 'test')
143 |     eval_res = classifer.evaluate(input_fn=test_input)
144 |     print(f'Evaluation res is : \n\t{eval_res}')
145 | 
146 |     if ARGS.model_pb_dir:
147 |         words = tf.placeholder(tf.int64, [None, None], name='input_words')
148 |         words_len = tf.placeholder(tf.int64, [None], name='input_len')
149 |         input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
150 |             'words': words,
151 |             'words_len': words_len
152 |         })
153 |         classifer.export_savedmodel(ARGS.model_pb_dir, input_fn)
154 | 
155 | 
156 | if __name__ == '__main__':
157 |     tf.app.run(main=main_es, argv=[sys.argv[0]])


--------------------------------------------------------------------------------
/named_entity_recognition/pics/命名实体识别数据图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别数据图.png


--------------------------------------------------------------------------------
/named_entity_recognition/pics/命名实体识别的模型总结图.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别的模型总结图.png


--------------------------------------------------------------------------------
/named_entity_recognition/preprocess.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020-10-11 18:52
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : preprocess.py
  6 | # @Software: PyCharm
  7 | 
  8 | import os
  9 | import _pickle as cPickle
 10 | import pandas as pd
 11 | import random
 12 | 
 13 | """
 14 | 数据预处理
 15 | 将数据处理成id，并封装成pkl形式
 16 | """
 17 | 
 18 | 
 19 | # clue2020细粒度命名实体识别的类别
 20 | tag_list = ['address', 'book', 'company', 'game', 'government',
 21 |             'movie', 'name', 'organization', 'position', 'scene']
 22 | tag_dict = {'O': 0}
 23 | 
 24 | for tag in tag_list:
 25 |     tag_B = 'B-' + tag
 26 |     tag_I = 'I-' + tag
 27 |     tag_dict[tag_B] = len(tag_dict)
 28 |     tag_dict[tag_I] = len(tag_dict)
 29 | 
 30 | print(tag_dict)
 31 | 
 32 | 
 33 | def make_vocab(file_path):
 34 |     """
 35 |     构建词典
 36 |     :param file_path:
 37 |     :return:
 38 |     """
 39 |     data = pd.read_csv(file_path, sep='\t', header=None)
 40 |     data.columns = ['text', 'tag']
 41 |     vocab = {'PAD': 0, 'UNK': 1}
 42 |     words_list = []
 43 |     for index, row in data.iterrows():
 44 |         words = row['text']
 45 |         for word in words:
 46 |             words_list.append(word)
 47 | 
 48 |     random.shuffle(words_list)
 49 |     for word in words_list:
 50 |         if word not in vocab:
 51 |             vocab[word] = len(vocab)
 52 |     return vocab
 53 | 
 54 | 
 55 | def make_data(file_path, vocab):
 56 |     """
 57 |     构建数据
 58 |     :param file_path:
 59 |     :param vocab
 60 |     :return:
 61 |     """
 62 |     data = pd.read_csv(file_path, sep='\t', header=None)
 63 |     data.columns = ['text', 'tag']
 64 |     word_ids = []
 65 |     tag_ids = []
 66 |     for index, row in data.iterrows():
 67 |         tag_str = row['tag']
 68 |         tags = tag_str.split(' ')
 69 |         words = row['text']
 70 | 
 71 |         word_id = [vocab.get(word) if word in vocab else 1 for word in words]
 72 |         tag_id = [tag_dict.get(tag) for tag in tags]
 73 | 
 74 |         word_ids.append(word_id)
 75 |         tag_ids.append(tag_id)
 76 |     print(word_ids[0])
 77 |     print(tag_ids[0])
 78 |     return {'words': word_ids, 'tags': tag_ids}
 79 | 
 80 | 
 81 | def save_vocab(vocab, output):
 82 |     """
 83 |     save vocab dict
 84 |     :param vocab:
 85 |     :param output:
 86 |     :return:
 87 |     """
 88 |     with open(output, 'w', encoding='utf-8') as fr:
 89 |         for word in vocab:
 90 |             fr.write(word + '\t' + str(vocab.get(word)) + '\n')
 91 |     print('save vocab is ok.')
 92 | 
 93 | 
 94 | def main(output_path):
 95 |     """
 96 |     main method
 97 |     :param output_path:
 98 |     :return:
 99 |     """
100 |     data = {}
101 |     # 这里是bio形式的数据集，如果不是需要提前转换成bio形式
102 |     train_path = './data_path/train.txt'
103 |     test_path = './data_path/dev.txt'
104 |     vocab = make_vocab(train_path)
105 |     train_data = make_data(train_path, vocab)
106 |     test_data = make_data(test_path, vocab)
107 | 
108 |     data['train'] = train_data
109 |     data['test'] = test_data
110 | 
111 |     data_path = os.path.join(output_path, 'clue_data.pkl')
112 |     cPickle.dump(data, open(data_path, 'wb'), protocol=2)
113 |     print('save data to pkl ok.')
114 | 
115 |     vocab_path = os.path.join(output_path, 'clue_vocab.txt')
116 |     save_vocab(vocab, vocab_path)
117 | 
118 | 
119 | if __name__ == '__main__':
120 |     output = './data_path/'
121 |     main(output)
122 | 


--------------------------------------------------------------------------------
/text_classification/README.md:
--------------------------------------------------------------------------------
  1 | ## 文本分类
  2 | 
  3 | 这里首先介绍一篇基于深度学习的文本分类综述，《Deep Learning Based Text Classification: A Comprehensive Review》，论文来源：https://arxiv.org/abs/2004.03705
  4 | 
  5 | **文本分类简介**：
  6 | 
  7 | 文本分类是NLP中一个非常经典任务（对给定的句子、查询、段落或者文档打上相应的类别标签）。其应用包括机器问答、垃圾邮件识别、情感分析、新闻分类、用户意图识别等。文本数据的来源也十分的广泛，比如网页数据、邮件内容、聊天记录、社交媒体、用户评论等。
  8 | 
  9 | **文本分类三大方法**：
 10 | 
 11 | 1. Rule-based methods：使用预定义的规则进行分类，需要很强的领域知识而且系统很难维护
 12 | 2. ML (data-driven) based methods：经典的机器学习方法使用特征提取（Bow词袋等）来提取特征，再使用朴素贝叶斯、SVM、HMM、Gradien Boosting Tree和随机森林等方法进行分类。深度学习方法通常使用的是end2end形式，比如Transformer、Bert等。
 13 | 3. Hybrid methods：基于规则和基于机器学习（深度学习）方法的混合
 14 | 
 15 | **文本分类任务**：
 16 | 
 17 | 1. 情感分析（Sentiment Analysis）：给定文本，分析用户的观点并且抽取出他们的主要观点。可以是二分类，也可以是多分类任务
 18 | 2. 新闻分类（News Categorization）：识别新闻主题，并给用户推荐相关的新闻。主要应用于推荐系统
 19 | 3. 主题分析（Topic Analysis）：给定文本，抽取出其文本的一个或者多个主题
 20 | 4. 机器问答（Question Answering）：提取式（extractive），给定问题和一堆候选答案，从中识别出正确答案；生成式（generative），给定问题，然后生成答案。（NL2SQL？）
 21 | 5. 自然语言推理（Natural Language Inference）：文本蕴含任务，预测一个文本是否可以从另一个文本中推断出。一般包括entailment、contradiction和neutral三种关系类型
 22 | 
 23 | **文本分类模型（深度学习）**：
 24 | 
 25 | 1. 基于前馈神经网络（Feed-Forward Neural Networks）
 26 | 2. 基于循环神经网络（RNN）
 27 | 3. 基于卷积神经网络（CNN）
 28 | 4. 基于胶囊高神经网络（Capsule networks）
 29 | 5. 基于Attention机制
 30 | 6. 基于记忆增强网络（Memory-augmented networks）
 31 | 7. 基于Transformer机制
 32 | 8. 基于图神经网络
 33 | 9. 基于孪生神经网络（Siamese Neural Network）
 34 | 10. 混合神经网络（Hybrid models）
 35 | 
 36 | 详解见https://blog.csdn.net/u013963380/article/details/106957420（只详细描述了前4种深度学习模型）。
 37 | 
 38 | ## 文本分类数据集
 39 | 
 40 | Deep Learning Based Text Classification: A Comprehensive Review一文中提到了很多的文本分类的数据集，大多数是英文的。
 41 | 
 42 | 下面列出一些中文文本分类数据集：
 43 | 
 44 | | 数据集   | 说明                                                         | 链接                                                         |
 45 | | :------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
 46 | | THUCNews | THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成。<br />包含财经、彩票、房产、股票、家居、教育等14个类别。<br />原始数据集见：[链接](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) | [下载地址](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) |
 47 | | 今日头条 | 来源于今日头条，为短文本分类任务，数据包含15个类别           | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip) |
 48 | | IFLYTEK  | 1.7万多条关于app应用描述的长文本标注数据，包含和日常生活相关的各类应用主题，共119个类别 | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip) |
 49 | | 新闻标题 | 数据集来源于Kesci平台，为新闻标题领域短文本分类任务。<br />内容大多为短文本标题(length<50)，数据包含15个类别，共38w条样本 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)<br />提取码：lrmv |
 50 | | 复大文本 | 数据集来源于复旦大学，为短文本分类任务，数据包含20个类别，共9804篇文档 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)<br />提取码：lrmv |
 51 | | OCNLI    | 中文原版自然语言推理，是第一个非翻译的、使用原生汉语的大型中文自然语言推理数据集<br />详细见https://github.com/CLUEbenchmark/OCNLI | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.zip) |
 52 | | 情感分析 | OCEMOTION–中文情感分类，对应文章https://www.aclweb.org/anthology/L16-1291.pdf<br />原始数据集未找到，只有一部分数据 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)<br />提取码：lrmv |
 53 | | 更新ing  | ...                                                          | ...                                                          |
 54 | 
 55 | 还有一些其他的中文文本数据集，可以在CLUE上搜索，CLUE地址：https://www.cluebenchmarks.com/ ，但是下载需要注册账号，有的链接失效，有的限制日下载次数，这里放到百度网盘供下载学习使用。（请勿用于商业目的）
 56 | 
 57 | ## 文本分类Baseline算法实现
 58 | 
 59 | 使用Tensorflow1.x版本Estimator高阶api实现常见文本分类算法，主要包括前馈神经网络（all 全连接层）模型、双向LSTM模型、文本卷积网络（TextCnn）、Transformer。
 60 | 
 61 | 环境信息：
 62 | 
 63 | tensorflow==1.13.1
 64 | 
 65 | python==3.7
 66 | 
 67 | **数据预处理**
 68 | 
 69 | 要求训练集和测试集分开存储（提供划分数据集方法），另外需要对文本进行分词，数据EDA部分可以见示例中的tnews_data_eda.ipynb文件。
 70 | 
 71 | 在训练模型前，需要先运行preprocess.py文件进行数据预处理，将数据处理成id形式并保存为pkl形式，另外中间过程产生的词表也会保存为vocab.txt文件。
 72 | 
 73 | **文件结构**
 74 | 
 75 | - data_path：数据集存放的位置
 76 | - data_utils：数据处理相关的工具类存放位置
 77 | - model_ckpt：模型checkpoint保存的位置
 78 | - model_pb：pb形式的模型保存的位置
 79 | - models：文本分类baseline模型存放的位置，包括BiLstm、TextCnn等
 80 | - train_main.py：模型训练主入口
 81 | - preprocess.py：数据预处理代码，包括划分数据集、转换文本为id等
 82 | - tf_metrics.py：tensorflow1.x版本不支持多分类的指标函数，这里使用的是Guillaume Genthial编写的多分类指标函数，[github地址](https://github.com/guillaumegenthial/tf_metrics)
 83 | - inference.py：推理主入口
 84 | 
 85 | **模型训练过程**
 86 | 
 87 | - 首先准备好数据集，放在data_path下，然后运行preprocess.py文件
 88 | - 运行train_main.py，具体的模型参数可以在ARGS里面设置，也可以使用python train_main.py --train_path='./data_path/emotion_data.pkl'的形式
 89 | 
 90 | **模型推理**
 91 | 
 92 | - 推理代码在inference.py中
 93 | 
 94 | ## 示例
 95 | 
 96 | 下面使用中文任务测评基准(CLUE benchmark)的头条新闻分类数据来进行demo示例演示：
 97 | 
 98 | 数据集下载地址：https://github.com/CLUEbenchmark/CLUE 中的[TNEWS'数据集下载](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip)
 99 | 
100 | 该数据集来自今日头条新闻版块，共15个类别的新闻，包括旅游、教育、金融、军事等。
101 | 
102 | ```
103 | 数据量：训练集(53,360)，验证集(10,000)，测试集(10,000)
104 | 例子：
105 | {"label": "102", "label_des": "news_entertainment", "sentence": "江疏影甜甜圈自拍，迷之角度竟这么好看，美吸引一切事物"}
106 | 每一条数据有三个属性，从前往后分别是 分类ID，分类名称，新闻字符串（仅含标题）。
107 | ```
108 | 
109 | **1.数据EDA**
110 | 
111 | 数据EDA部分见tnews_data_eda.ipynb，主要是简单分析一下数据集的文本的长度分布、类别标签的数量比。然后对文本进行分词，这里使用的jieba分词软件。分词后将数据集保存到data_path目录下。
112 | 
113 | ```
114 | # 各种类别标签的数量分布
115 | 109    5955
116 | 104    5200
117 | 102    4976
118 | 113    4851
119 | 107    4118
120 | 101    4081
121 | 103    3991
122 | 110    3632
123 | 108    3437
124 | 116    3390
125 | 112    3368
126 | 115    2886
127 | 106    2107
128 | 100    1111
129 | 114     257
130 | ```
131 | 
132 | **2.设置训练参数**
133 | 
134 | 参数设置的时候有几个参数需要根据自己的数据分布来设置：
135 | 
136 | - vocab_size：这里的大小，一般需要根据自己生成的vocab.txt中词表的大小来设置
137 | - num_label：类别标签的数量
138 | - train_path/eval_path：数据集的路径
139 | - weights权重设置：根据数据EDA中的类别标签分布，设置weights=[0.9,0.9,0.9,0.9,1,1,1,1,1,1,1,1,1,1.2,1.5]，后面几个类别的数量明显很少，权重设置大一点。具体数值自己根据个人分析来定义
140 | 
141 | 其他的参数视个人情况而定
142 | 
143 | **3.模型训练并保存模型**
144 | 
145 | 这里使用的是BiLstm模型。
146 | 
147 | 代码中保存了两种模型形式，一种是checkpoint，另一种是pb格式
148 | 
149 | **4.开始预测并提交结果**
150 | 
151 | 预测代码见inferecen.py，最后在CLUE上提交的结果是50.92（[ALBERT-xxlarge](https://github.com/google-research/albert) ：59.46，目前[UER-ensemble](https://github.com/dbiir/UER-py)：72.20）
152 | 
153 | ## 中文文本分类比赛OR评测
154 | 
155 | 1.[零基础入门NLP-新闻文本分类](https://tianchi.aliyun.com/competition/entrance/531810/introduction?spm=5176.12281973.1005.4.3dd52448KQuWQe)（DataWhale和天池举办的学习赛）
156 | 
157 | 2.[中文CLUE的各种分类任务的评测](https://www.cluebenchmarks.com/)


--------------------------------------------------------------------------------
/text_classification/data_path/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里存放的是数据集


--------------------------------------------------------------------------------
/text_classification/data_path/tnews_data.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/text_classification/data_path/tnews_data.pkl


--------------------------------------------------------------------------------
/text_classification/inference.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020/12/28 21:45
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : inference.py
 6 | # @Software: PyCharm
 7 | 
 8 | import tensorflow as tf
 9 | import tqdm
10 | import json
11 | import jieba
12 | 
13 | """
14 | 文本分类推理代码
15 | """
16 | 
17 | # 设置filter
18 | filter = './?？；。（()）【】{}[]!！，,<>《》+'
19 | # 加载词典
20 | word_dict = {}
21 | with open('./data_path/vocab.txt', encoding='utf-8') as fr:
22 |     lines = fr.readlines()
23 | for line in lines:
24 |     word = line.split('\t')[0]
25 |     id = line.split('\t')[1]
26 |     word_dict[word] = id
27 | print(word_dict)
28 | 
29 | # label dict的设置
30 | label_id = {0: 109, 1: 104, 2: 102, 3: 113,
31 |             4: 107, 5: 101, 6: 103, 7: 110,
32 |             8: 108, 9: 116, 10: 112, 11: 115,
33 |             12: 106, 13: 100, 14: 114}
34 | label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment",
35 |               103: "news_sports", 104: "news_finance", 106: "news_house",
36 |               107: "news_car", 108: "news_edu", 109: "news_tech",
37 |               110: "news_military", 112: "news_travel", 113: "news_world",
38 |               114: "news_stock", 115: "news_agriculture", 116: "news_game"}
39 | 
40 | 
41 | def cut_with_jieba(text, filter=None):
42 |     """ 使用jieba切分句子 """
43 |     if filter:
44 |         for c in filter:
45 |             text = text.replace(c, '')
46 |     words = ['Number' if word.isdigit() else word for word in jieba.cut(text)]
47 |     return words
48 | 
49 | 
50 | def words_to_ids(words, word_dict):
51 |     """ 将words 转换成ids形式 """
52 |     ids = [word_dict.get(word, 1) for word in words]
53 |     return ids
54 | 
55 | 
56 | def predict_main(test_file, out_path):
57 |     """ 预测主入口 """
58 |     model_path = './model_pb/1609247078'
59 |     with tf.Session(graph=tf.Graph()) as sess:
60 |         model = tf.saved_model.loader.load(sess, ['serve'], model_path)
61 |         # print(model)
62 |         out = sess.graph.get_tensor_by_name('class_out:0')
63 |         input_p = sess.graph.get_tensor_by_name('input_words:0')
64 | 
65 |         with open(test_file, encoding='utf-8') as fr:
66 |             lines = fr.readlines()
67 |         res_list = []
68 |         for line in tqdm.tqdm(lines):
69 |             json_str = json.loads(line)
70 |             id = json_str['id']
71 |             sentence = json_str['sentence']
72 | 
73 |             words = cut_with_jieba(str(sentence), filter)
74 |             if len(words) < 1:
75 |                 print('there are some sample error!')
76 |             text_features = words_to_ids(words, word_dict)
77 |             feed = {input_p: [text_features]}
78 |             score = sess.run(out, feed_dict=feed)
79 | 
80 |             label = label_id.get(score[0])
81 |             label_d = label_desc.get(label)
82 | 
83 |             res_list.append(
84 |                 json.dumps({"id": id, "label": str(label), "label_desc": label_d}))
85 |     # 写入到文件
86 |     with open(out_path, 'w', encoding='utf-8') as fr:
87 |         for res in res_list:
88 |             fr.write(res)
89 |             fr.write('\n')
90 |     print('predict and write to file over!!!')
91 | 
92 | 
93 | if __name__ == '__main__':
94 |     test_file = './data_path/test.json'
95 |     out_path = './data_path/tnews_predict.json'
96 |     predict_main(test_file, out_path)
97 | 


--------------------------------------------------------------------------------
/text_classification/model_ckpt/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里保存训练后的checkpoint文件


--------------------------------------------------------------------------------
/text_classification/model_pb/README.md:
--------------------------------------------------------------------------------
1 | ### 文件说明
2 | 
3 | 这里保存训练后的pb模型文件


--------------------------------------------------------------------------------
/text_classification/models/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | # @Time    : 2020/12/10 21:34
3 | # @Author  : xudong
4 | # @email   : dongxu222mk@163.com
5 | # @File    : __init__.py.py
6 | # @Software: PyCharm
7 | 
8 | 


--------------------------------------------------------------------------------
/text_classification/models/attention.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020/12/10 21:51
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : attention.py
 6 | # @Software: PyCharm
 7 | 
 8 | import tensorflow as tf
 9 | from .base_model import Linear
10 | 
11 | 
12 | class Attention:
13 |     """
14 |     the attention
15 |     """
16 |     def __init__(self, scope_name, hidden_size, num_heads, dropout):
17 |         if hidden_size % num_heads != 0:
18 |             raise ValueError('the hidden size and heads is not match!')
19 | 
20 |         self.hidden_size = hidden_size
21 |         self.num_heads = num_heads
22 | 
23 |         self.q_layer = Linear(f'{scope_name}_q', hidden_size, hidden_size, bias=False)
24 |         self.k_layer = Linear(f'{scope_name}_k', hidden_size, hidden_size, bias=False)
25 |         self.v_layer = Linear(f'{scope_name}_v', hidden_size, hidden_size, bias=False)
26 | 
27 |         self.out_layer = Linear(f'{scope_name}_output', hidden_size,
28 |                                 hidden_size, bias=False)
29 |         self.dropout = tf.layers.Dropout(dropout)
30 | 
31 |     def split_heads(self, x):
32 |         """ split the heads """
33 |         with tf.name_scope('split_heads'):
34 |             batch_size = tf.shape(x)[0]
35 |             length = tf.shape(x)[1]
36 | 
37 |             depth = self.hidden_size // self.num_heads
38 | 
39 |             x = tf.reshape(x, [batch_size, length, self.num_heads, depth])
40 | 
41 |             return tf.transpose(x, [0, 2, 1, 3])
42 | 
43 |     def combine_heads(self, x):
44 |         """ combine the heads """
45 |         with tf.name_scope('combine_heads'):
46 |             batch_size = tf.shape(x)[0]
47 |             length = tf.shape(x)[2]
48 | 
49 |             x = tf.transpose(x, [0, 2, 1, 3])  # bacth length, heads, depth
50 |             return tf.reshape(x, [batch_size, length, self.hidden_size])
51 | 
52 |     def call(self, x, y, training, bias, cache=None):
53 |         q = self.q_layer(x, training)
54 |         k = self.k_layer(y, training)
55 |         v = self.v_layer(y, training)
56 | 
57 |         if cache:
58 |             k = tf.concat([cache['k'], k], axis=1)
59 |             v = tf.concat([cache['v'], v], axis=1)
60 | 
61 |             cache['k'] = k
62 |             cache['v'] = v
63 | 
64 |         q = self.split_heads(q)
65 |         k = self.split_heads(k)
66 |         v = self.split_heads(v)
67 | 
68 |         depth = self.hidden_size // self.num_heads
69 |         q *= depth ** -0.5
70 | 
71 |         # calculate dot product attention
72 |         logits = tf.matmul(q, k, transpose_b=True)
73 |         logits += bias
74 |         weights = tf.nn.softmax(logits)
75 |         weights = self.dropout(weights, training=training)
76 |         attention_output = tf.matmul(weights, v)
77 | 
78 |         attention_output = self.combine_heads(attention_output)
79 | 
80 |         attention_output = self.out_layer(attention_output, training)
81 |         return attention_output
82 | 
83 | 
84 | class SelfAttention(Attention):
85 |     def __call__(self, x, training, bias, cache=None):
86 |         return super(SelfAttention, self).call(x, x, training, bias, cache)


--------------------------------------------------------------------------------
/text_classification/models/base_model.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020/11/10 21:34
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : base_model.py
  6 | # @Software: PyCharm
  7 | 
  8 | import tensorflow as tf
  9 | 
 10 | from tensorflow.contrib.rnn import LSTMCell
 11 | from tensorflow.contrib.rnn import MultiRNNCell
 12 | from tensorflow.contrib.rnn import GRUCell
 13 | from tensorflow.contrib.rnn import BasicRNNCell
 14 | 
 15 | 
 16 | class Linear:
 17 |     """
 18 |     线性层，全连接层
 19 |     """
 20 |     def __init__(self, scope_name, input_size, output_sizes, bias=True,
 21 |                  activator='', drop_out=0., reuse=False, trainable=True):
 22 |         self.input_size = input_size
 23 | 
 24 |         # todo 判断 output_sizes 是不是列表
 25 |         if not isinstance(output_sizes, list):
 26 |             output_sizes = [output_sizes]
 27 | 
 28 |         self.output_size = output_sizes[-1]
 29 | 
 30 |         self.W = []
 31 |         self.b = []
 32 |         size = input_size
 33 |         with tf.variable_scope(scope_name, reuse=reuse):
 34 |             for i, output_size in enumerate(output_sizes):
 35 |                 W = tf.get_variable(
 36 |                     'W{0}'.format(i), [size, output_size],
 37 |                     initializer=tf.random_uniform_initializer(-0.25, 0.25),
 38 |                     trainable=trainable
 39 |                 )
 40 |                 if bias:
 41 |                     b = tf.get_variable(
 42 |                         'b{0}'.format(i), [output_size],
 43 |                         initializer=tf.zeros_initializer(),
 44 |                         trainable=trainable
 45 |                     )
 46 |                 else:
 47 |                     b = None
 48 | 
 49 |                 self.W.append(W)
 50 |                 self.b.append(b)
 51 |                 size = output_size
 52 | 
 53 |         if activator == 'relu':
 54 |             self.activator = tf.nn.relu
 55 |         elif activator == 'relu6':
 56 |             self.activator = tf.nn.relu6
 57 |         elif activator == 'tanh':
 58 |             self.activator = tf.nn.tanh
 59 |         else:
 60 |             self.activator = tf.identity
 61 | 
 62 |         self.drop_out = tf.layers.Dropout(drop_out)
 63 | 
 64 |     def __call__(self, input, training):
 65 |         size = tf.shape(input)
 66 |         input_trans = tf.reshape(input, [-1, size[-1]])
 67 |         for W, b in zip(self.W, self.b):
 68 |             if b is not None:
 69 |                 input_trans = tf.nn.xw_plus_b(input_trans, W, b)
 70 |             else:
 71 |                 input_trans = tf.matmul(input_trans, W)
 72 | 
 73 |             input_trans = self.drop_out(input_trans, training)
 74 |             input_trans = self.activator(input_trans)
 75 | 
 76 |         new_size = tf.concat([size[:-1], tf.constant([self.output_size])], 0)
 77 |         input_trans = tf.reshape(input_trans, new_size)
 78 |         return input_trans
 79 | 
 80 | 
 81 | class LookupTable:
 82 |     """
 83 |     embedding层
 84 |     """
 85 |     def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True):
 86 |         self.vocab_size = vocab_size
 87 |         self.embed_size = embed_size
 88 | 
 89 |         with tf.variable_scope(scope_name, reuse=bool(reuse)):
 90 |             self.embedding = tf.get_variable(
 91 |                 'embedding', [vocab_size, embed_size],
 92 |                 initializer=tf.random_uniform_initializer(-0.25, 0.25),
 93 |                 trainable=trainable
 94 |             )
 95 | 
 96 |     def __call__(self, input):
 97 |         input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input))
 98 |         return tf.nn.embedding_lookup(self.embedding, input)
 99 | 
100 | 
101 | class AttentionPooling:
102 |     """
103 |     attention pooling层
104 |     """
105 |     def __init__(self, scope_name, input_size, hidden_size, reuse=False,
106 |                  trainable=True):
107 |         name = scope_name
108 |         self.linear1 = Linear(f'{name}_linear1', input_size,
109 |                               hidden_size, bias=False, reuse=reuse,
110 |                               trainable=trainable)
111 |         self.linear2 = Linear(f'{name}_linear2', hidden_size, 1,
112 |                               bias=False, reuse=reuse, trainable=trainable)
113 | 
114 |     def __call__(self, input, mask, training):
115 |         output_linear1 = self.linear1(input, training)
116 |         output_linear2 = self.linear2(output_linear1, training)
117 |         weights = tf.squeeze(output_linear2, [-1])
118 |         if mask is not None:
119 |             weights += mask
120 |         weights = tf.nn.softmax(weights, -1)
121 |         return tf.reduce_sum(input * tf.expand_dims(weights, -1), axis=1)
122 | 
123 | 
124 | class LayerNormalization:
125 |     """
126 |     归一化层
127 |     """
128 |     def __init__(self, scope_name, hidden_size):
129 |         with tf.variable_scope(scope_name):
130 |             self.scale = tf.get_variable('layer_norm_scale', [hidden_size],
131 |                                          initializer=tf.ones_initializer())
132 |             self.bias = tf.get_variable('layer_norm_bias', [hidden_size],
133 |                                         initializer=tf.zeros_initializer())
134 | 
135 |     def __call__(self, x, epsilon=1e-6):
136 |         mean, variance = tf.nn.moments(x, -1, keep_dims=True)
137 |         norm_x = (x - mean) * tf.rsqrt(variance + epsilon)
138 |         return norm_x * self.scale + self.bias
139 | 
140 | 
141 | class LstmBase:
142 |     """
143 |     RNN的基础层
144 |     """
145 |     def build_rnn(self, rnn_type, hidden_size, num_layes):
146 |         cells = []
147 |         for i in range(num_layes):
148 |             if rnn_type == 'lstm':
149 |                 cell = LSTMCell(num_units=hidden_size,
150 |                                 state_is_tuple=True,
151 |                                 initializer=tf.random_uniform_initializer(-0.25, 0.25))
152 |             elif rnn_type == 'gru':
153 |                 cell = GRUCell(num_units=hidden_size)
154 |             elif rnn_type:
155 |                 cell = BasicRNNCell(num_units=hidden_size)
156 |             else:
157 |                 raise NotImplementedError(f'the rnn type is unexist: {rnn_type}')
158 |             cells.append(cell)
159 | 
160 |         cells = MultiRNNCell(cells, state_is_tuple=True)
161 | 
162 |         return cells
163 | 
164 | 
165 | class BiLstm(LstmBase):
166 |     """
167 |     双向LSTM层
168 |     """
169 |     def __init__(self, scope_name, hidden_size, num_layers):
170 |         super(BiLstm, self).__init__()
171 |         assert hidden_size % 2 == 0
172 |         hidden_size /= 2
173 | 
174 |         self.fw_rnns = []
175 |         self.bw_rnns = []
176 |         for i in range(num_layers):
177 |             self.fw_rnns.append(self.build_rnn('lstm', hidden_size, 1))
178 |             self.bw_rnns.append(self.build_rnn('lstm', hidden_size, 1))
179 | 
180 |         self.scope_name = scope_name
181 | 
182 |     def __call__(self, input, input_len):
183 |         for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)):
184 |             scope_name = '{}_{}'.format(self.scope_name, idx)
185 |             ctx, _ = tf.nn.bidirectional_dynamic_rnn(
186 |                 fw_rnn, bw_rnn, input, sequence_length=input_len,
187 |                 dtype=tf.float32, time_major=False,
188 |                 scope=scope_name
189 |             )
190 |             input = tf.concat(ctx, -1)
191 |         ctx = input
192 |         return ctx
193 | 
194 | 
195 | class Cnn:
196 |     """
197 |     define cnn
198 |     """
199 |     def __init__(self, scoep_name, input_size, hidden_size):
200 |         kws=[3]
201 |         self.conv_ws = []
202 |         self.conv_bs = []
203 |         for idx, kw in enumerate(kws):
204 |             w = tf.get_variable(
205 |                 f"conv_w_{idx}",
206 |                 [kw, input_size, hidden_size],
207 |                 initializer=tf.random_uniform_initializer(-0.1, 0.1)
208 |             )
209 |             b = tf.get_variable(
210 |                 f"conv_b_{idx}",
211 |                 [hidden_size],
212 |                 initializer=tf.zeros_initializer()
213 |             )
214 |             self.conv_ws.append(w)
215 |             self.conv_bs.append(b)
216 | 
217 |     def __call__(self, input, mask):
218 |         outputs = []
219 |         for conv_w, conv_b in zip(self.conv_ws, self.conv_bs):
220 |             conv = tf.nn.conv1d(input, conv_w, 1, 'SAME')
221 |             conv = tf.nn.bias_add(conv, conv_b)
222 |             if mask is not None:
223 |                 conv += tf.expand_dims(mask, -1)
224 |             outputs.append(conv)
225 |         output = tf.concat(outputs, -1)
226 |         return output
227 | 


--------------------------------------------------------------------------------
/text_classification/models/bilstm_model.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020-12-11 12:37
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : bilstm_model.py
 6 | # @Software: PyCharm
 7 | import tensorflow as tf
 8 | 
 9 | from .base_model import LookupTable
10 | from .base_model import BiLstm
11 | from .base_model import Linear
12 | 
13 | 
14 | class BiLstmModel:
15 |     """
16 |     BiLstm模型的实现：
17 |     主要包含：embedding层、rnn层、池化层、两层全连接层和一个Dropout层
18 |     """
19 |     def __init__(self, vocab_size, emb_size, args):
20 | 
21 |         # embedding
22 |         scope_name = 'look_up'
23 |         self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
24 | 
25 |         # rnn
26 |         scope_name = 'bi_lstm'
27 |         # rnn的层数 这里设置为1
28 |         num_layers = 1
29 |         self.rnn = BiLstm(scope_name, args.hidden_size, num_layers)
30 | 
31 |         # linear1
32 |         scope_name = 'linear1'
33 |         self.linear1 = Linear(scope_name, args.hidden_size, args.fc_layer_size,
34 |                               activator=args.activator)
35 | 
36 |         # logits out
37 |         scope_name = 'linear2'
38 |         self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label)
39 | 
40 |         self.dropout = tf.layers.Dropout(args.drop_out)
41 | 
42 |         def max_pool(inputs):
43 |             return tf.reduce_max(inputs, 1)
44 | 
45 |         def mean_pool(inputs):
46 |             return tf.reduce_mean(inputs, 1)
47 | 
48 |         if args.pool == 'max':
49 |             self.pool = max_pool
50 |         else:
51 |             self.pool = mean_pool
52 | 
53 |     def __call__(self, inputs, training):
54 |         masks = tf.sign(inputs)
55 |         sent_len = tf.reduce_sum(masks, axis=1)
56 | 
57 |         embedding = self.lookuptables(inputs)
58 | 
59 |         rnn_out = self.rnn(embedding, sent_len)
60 |         pool_out = self.pool(rnn_out)
61 |         linear_out = self.linear1(pool_out, training)
62 |         # dropout
63 |         linear_out = self.dropout(linear_out, training)
64 |         # linear
65 |         output = self.linear2(linear_out, training)
66 |         return output


--------------------------------------------------------------------------------
/text_classification/models/ffnn_model.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020-12-11 19:05
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : ffnn_model.py
 6 | # @Software: PyCharm
 7 | 
 8 | import tensorflow as tf
 9 | 
10 | from .base_model import LookupTable
11 | from .base_model import Linear
12 | 
13 | 
14 | class FCModel:
15 |     """
16 |     前馈网络
17 |     主要包括：embedding层、两个全连接层和一个dropout层
18 |     """
19 |     def __init__(self, vocab_size, emb_size, args):
20 | 
21 |         # embedding
22 |         scope_name = 'look_up'
23 |         self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
24 | 
25 |         fc_layer_size = args.fc_layer_size  # 全链接层的size
26 |         scope_name = 'linear1'
27 |         self.linear1 = Linear(scope_name, emb_size, fc_layer_size, args.activator)
28 | 
29 |         scope_name = 'linear2'
30 |         self.linear2 = Linear(scope_name, fc_layer_size, args.num_label)
31 | 
32 |         self.dropout = tf.layers.Dropout(args.drop_out)
33 | 
34 |     def __call__(self, inputs, training):
35 |         embedding = self.lookuptables(inputs)
36 |         pool_out = tf.reduce_mean(embedding, 1)
37 |         pool_out = self.dropout(pool_out, training)
38 |         pool_out = self.linear1(pool_out, training)
39 |         pool_out = self.dropout(pool_out, training)
40 |         output = self.linear2(pool_out, training)
41 | 
42 |         return output
43 | 


--------------------------------------------------------------------------------
/text_classification/models/model_utils.py:
--------------------------------------------------------------------------------
 1 | """Transformer model helper methods."""
 2 | 
 3 | import math
 4 | 
 5 | import numpy as np
 6 | import tensorflow as tf
 7 | 
 8 | _NEG_INF_FP32 = -1e9
 9 | _NEG_INF_FP16 = np.finfo(np.float16).min
10 | 
11 | 
12 | def get_position_encoding(length,
13 |                           hidden_size,
14 |                           min_timescale=1.0,
15 |                           max_timescale=1.0e4):
16 |     """Return positional encoding.
17 |     Calculates the position encoding as a mix of sine and cosine functions with
18 |     geometrically increasing wavelengths.
19 |     Defined and formulized in Attention is All You Need, section 3.5.
20 |     Args:
21 |       length: Sequence length.
22 |       hidden_size: Size of the
23 |       min_timescale: Minimum scale that will be applied at each position
24 |       max_timescale: Maximum scale that will be applied at each position
25 |     Returns:
26 |       Tensor with shape [length, hidden_size]
27 |     """
28 |     # We compute the positional encoding in float32 even if the model uses
29 |     # float16, as many of the ops used, like log and exp, are numerically unstable
30 |     # in float16.
31 |     position = tf.cast(tf.range(length), tf.float32)
32 |     num_timescales = hidden_size // 2
33 |     log_timescale_increment = (
34 |             math.log(float(max_timescale) / float(min_timescale)) /
35 |             (tf.cast(num_timescales, tf.float32) - 1))
36 |     inv_timescales = min_timescale * tf.exp(
37 |         tf.cast(tf.range(num_timescales), tf.float32) * -log_timescale_increment)
38 |     scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0)
39 |     signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1)
40 |     return signal
41 | 
42 | 
43 | def get_decoder_self_attention_bias(length, dtype=tf.float32):
44 |     """Calculate bias for decoder that maintains model's autoregressive property.
45 |     Creates a tensor that masks out locations that correspond to illegal
46 |     connections, so prediction at position i cannot draw information from future
47 |     positions.
48 |     Args:
49 |       length: int length of sequences in batch.
50 |       dtype: The dtype of the return value.
51 |     Returns:
52 |       float tensor of shape [1, 1, length, length]
53 |     """
54 |     neg_inf = _NEG_INF_FP16 if dtype == tf.float16 else _NEG_INF_FP32
55 |     with tf.name_scope("decoder_self_attention_bias"):
56 |         valid_locs = tf.linalg.band_part(
57 |             tf.ones([length, length], dtype=dtype), -1, 0)
58 |         valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
59 |         decoder_bias = neg_inf * (1.0 - valid_locs)
60 |     return decoder_bias
61 | 
62 | 
63 | def get_padding(x, padding_value=0, dtype=tf.float32):
64 |     """Return float tensor representing the padding values in x.
65 |     Args:
66 |       x: int tensor with any shape
67 |       padding_value: int which represents padded values in input
68 |       dtype: The dtype of the return value.
69 |     Returns:
70 |       float tensor with same shape as x containing values 0 or 1.
71 |         0 -> non-padding, 1 -> padding
72 |     """
73 |     with tf.name_scope("padding"):
74 |         return tf.cast(tf.equal(x, padding_value), dtype)
75 | 
76 | 
77 | def get_padding_bias(x, padding_value=0, dtype=tf.float32):
78 |     """Calculate bias tensor from padding values in tensor.
79 |     Bias tensor that is added to the pre-softmax multi-headed attention logits,
80 |     which has shape [batch_size, num_heads, length, length]. The tensor is zero at
81 |     non-padding locations, and -1e9 (negative infinity) at padding locations.
82 |     Args:
83 |       x: int tensor with shape [batch_size, length]
84 |       padding_value: int which represents padded values in input
85 |       dtype: The dtype of the return value
86 |     Returns:
87 |       Attention bias tensor of shape [batch_size, 1, 1, length].
88 |     """
89 |     with tf.name_scope("attention_bias"):
90 |         padding = get_padding(x, padding_value, dtype)
91 |         attention_bias = padding * _NEG_INF_FP32
92 |         attention_bias = tf.expand_dims(
93 |             tf.expand_dims(attention_bias, axis=1), axis=1)
94 |     return attention_bias
95 | 


--------------------------------------------------------------------------------
/text_classification/models/text_cnn.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | # @Time    : 2020-12-11 19:19
 3 | # @Author  : xudong
 4 | # @email   : dongxu222mk@163.com
 5 | # @File    : text_cnn.py
 6 | # @Software: PyCharm
 7 | 
 8 | import tensorflow as tf
 9 | 
10 | from .base_model import LookupTable
11 | from .base_model import BiLstm
12 | from .base_model import Linear
13 | 
14 | 
15 | class TextCnn:
16 |     """
17 |     text cnn model
18 |     主要包括：embedding层、三个不同size卷积核层、两个全连接层和dropout层
19 |     """
20 |     def __init__(self, vocab_size, emb_size, args):
21 | 
22 |         # embedding
23 |         scope_name = 'look_up'
24 |         self.lookuptables = LookupTable(scope_name, vocab_size, emb_size)
25 | 
26 |         # 三个卷积核
27 |         kws = [2, 3, 5]
28 |         self.conv_ws = []
29 |         self.conv_bs = []
30 | 
31 |         # the num of filter 卷积核的数量
32 |         filter_num = args.filter_num
33 |         for idx, kw in enumerate(kws):
34 |             w = tf.get_variable(
35 |                 f"conv_w_{idx}",
36 |                 [kw, emb_size, filter_num],
37 |                 initializer=tf.random_uniform_initializer(-0.25, 0.25)
38 |             )
39 |             b = tf.get_variable(
40 |                 f"conv_b_{idx}",
41 |                 [filter_num],
42 |                 initializer=tf.random_uniform_initializer(-0.25, 0.25)
43 |             )
44 |             self.conv_ws.append(w)
45 |             self.conv_bs.append(b)
46 | 
47 |         scope_name = 'linear1'
48 |         self.linear1 = Linear(scope_name, len(kws) * filter_num,
49 |                               args.fc_layer_size, activator=args.activator)
50 | 
51 |         scope_name = 'linear2'
52 |         self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label)
53 | 
54 |         self.dropout = tf.layers.Dropout(args.drop_out)
55 | 
56 |     def __call__(self, inputs, training):
57 |         embedding = self.lookuptables(inputs)
58 | 
59 |         outputs = []
60 |         for conv_w, conv_b in zip(self.conv_ws, self.conv_bs):
61 |             conv = tf.nn.conv1d(embedding, conv_w, 1, 'SAME')
62 |             conv = tf.nn.bias_add(conv, conv_b)
63 |             pool = tf.reduce_max(conv, axis=1)
64 |             outputs.append(pool)
65 |         output = tf.concat(outputs, -1)
66 |         output = self.linear1(output, training)
67 |         output = self.dropout(output, training)
68 |         output = self.linear2(output, training)
69 |         return output


--------------------------------------------------------------------------------
/text_classification/preprocess.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020-10-11 18:52
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : preprocess.py
  6 | # @Software: PyCharm
  7 | 
  8 | import os
  9 | import _pickle as pickle
 10 | import pandas as pd
 11 | import random
 12 | 
 13 | from sklearn.model_selection import train_test_split
 14 | 
 15 | """
 16 | 数据预处理
 17 | 将数据处理成id，并封装成pkl形式
 18 | """
 19 | 
 20 | # 可以人为自定义label dict
 21 | label_dict_default = {109: 0, 104: 1, 102: 2, 113: 3,
 22 |                       107: 4, 101: 5, 103: 6, 110: 7,
 23 |                       108: 8, 116: 9, 112: 10, 115: 11,
 24 |                       106: 12, 100: 13, 114: 14}
 25 | 
 26 | 
 27 | def make_vocab(file_path):
 28 |     """
 29 |     构建词典和label映射词典
 30 |     :param file_path:
 31 |     :return:
 32 |     """
 33 |     data = pd.read_csv(file_path, sep='\t')
 34 |     vocab = {'PAD': 0, 'UNK': 1}
 35 |     words_list = []
 36 |     for index, row in data.iterrows():
 37 |         label = row['label']
 38 |         words = row['words'].split(' ')
 39 |         for word in words:
 40 |             words_list.append(word)
 41 |     random.shuffle(words_list)
 42 |     for word in words_list:
 43 |         if word not in vocab:
 44 |             vocab[word] = len(vocab)
 45 |     # save to file and print the label dict
 46 |     save_path = './data_path/vocab.txt'
 47 |     save_vocab(vocab, save_path)
 48 |     print(f'the vocab size is {len(vocab)}')
 49 |     return vocab
 50 | 
 51 | 
 52 | def make_data(file_path, vocab, type):
 53 |     """
 54 |     构建数据
 55 |     :param file_path:
 56 |     :param vocab
 57 |     :return:
 58 |     """
 59 |     data = pd.read_csv(file_path, sep='\t')
 60 |     word_ids = []
 61 |     label_ids = []
 62 |     for index, row in data.iterrows():
 63 |         label = row['label']
 64 |         words = row['words'].split(' ')
 65 |         word_id_temp = [vocab.get(word) if word in vocab else 1 for word in words]
 66 |         word_ids.append(word_id_temp)
 67 |         label_ids.append(label_dict_default.get(label))
 68 | 
 69 |     print(f'the {type} data size is {len(word_ids)}')
 70 |     print(word_ids[0])
 71 |     print(label_ids[0])
 72 | 
 73 |     return {'words': word_ids, 'labels': label_ids}
 74 | 
 75 | 
 76 | def save_vocab(vocab, output):
 77 |     """
 78 |     保存vocab到本地文件
 79 |     :param vocab:
 80 |     :param output:
 81 |     :return:
 82 |     """
 83 |     with open(output, 'w', encoding='utf-8') as fr:
 84 |         for word in vocab:
 85 |             fr.write(word + '\t' + str(vocab.get(word)) + '\n')
 86 |     print('save vocab is ok.')
 87 | 
 88 | 
 89 | def main(output_path):
 90 |     """
 91 |     main method
 92 |     :param output_path:
 93 |     :return:
 94 |     """
 95 |     data = {}
 96 |     train_path = './data_path/train_data.csv'
 97 |     test_path = './data_path/dev_data.csv'
 98 |     vocab = make_vocab(train_path)
 99 |     train_data = make_data(train_path, vocab, 'train')
100 |     test_data = make_data(test_path, vocab, 'test')
101 | 
102 |     data['train'] = train_data
103 |     data['test'] = test_data
104 | 
105 |     data_path = os.path.join(output_path, 'tnews_data.pkl')
106 |     pickle.dump(data, open(data_path, 'wb'), protocol=2)
107 |     print('save data to pkl over.')
108 | 
109 | 
110 | def split_data(file_path, output):
111 |     """
112 |     划分数据集
113 |     :param file_path:
114 |     :param output:
115 |     :return:
116 |     """
117 |     all_data = pd.read_csv(file_path, sep='\t', header=None)
118 |     all_data.columns = ['id', 'texta', 'textb', 'label']
119 |     train_data, test_data = train_test_split(all_data, stratify=all_data['label'],
120 |                                              test_size=0.2, shuffle=True,
121 |                                              random_state=42)
122 |     print(train_data)
123 |     print(test_data)
124 |     train_path = os.path.join(output, 'train_nli.csv')
125 |     test_path = os.path.join(output, 'dev_nli.csv')
126 |     train_data.to_csv(train_path, sep='\t', header=False, index=False)
127 |     test_data.to_csv(test_path, sep='\t', header=False, index=False)
128 |     print(f'split data train size={len(train_data)} test size={len(test_data)}')
129 | 
130 | 
131 | if __name__ == '__main__':
132 |     output_path = './data_path'
133 |     main(output_path)
134 | 


--------------------------------------------------------------------------------
/text_classification/tf_metrics.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | """Multiclass"""
  4 | 
  5 | __author__ = "Guillaume Genthial"
  6 | 
  7 | import numpy as np
  8 | import tensorflow as tf
  9 | from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix
 10 | 
 11 | 
 12 | def precision(labels, predictions, num_classes, pos_indices=None,
 13 |               weights=None, average='micro'):
 14 |     """Multi-class precision metric for Tensorflow
 15 |     Parameters
 16 |     ----------
 17 |     labels : Tensor of tf.int32 or tf.int64
 18 |         The true labels
 19 |     predictions : Tensor of tf.int32 or tf.int64
 20 |         The predictions, same shape as labels
 21 |     num_classes : int
 22 |         The number of classes
 23 |     pos_indices : list of int, optional
 24 |         The indices of the positive classes, default is all
 25 |     weights : Tensor of tf.int32, optional
 26 |         Mask, must be of compatible shape with labels
 27 |     average : str, optional
 28 |         'micro': counts the total number of true positives, false
 29 |             positives, and false negatives for the classes in
 30 |             `pos_indices` and infer the metric from it.
 31 |         'macro': will compute the metric separately for each class in
 32 |             `pos_indices` and average. Will not account for class
 33 |             imbalance.
 34 |         'weighted': will compute the metric separately for each class in
 35 |             `pos_indices` and perform a weighted average by the total
 36 |             number of true labels for each class.
 37 |     Returns
 38 |     -------
 39 |     tuple of (scalar float Tensor, update_op)
 40 |     """
 41 |     cm, op = _streaming_confusion_matrix(
 42 |         labels, predictions, num_classes, weights)
 43 |     pr, _, _ = metrics_from_confusion_matrix(
 44 |         cm, pos_indices, average=average)
 45 |     op, _, _ = metrics_from_confusion_matrix(
 46 |         op, pos_indices, average=average)
 47 |     return (pr, op)
 48 | 
 49 | 
 50 | def recall(labels, predictions, num_classes, pos_indices=None, weights=None,
 51 |            average='micro'):
 52 |     """Multi-class recall metric for Tensorflow
 53 |     Parameters
 54 |     ----------
 55 |     labels : Tensor of tf.int32 or tf.int64
 56 |         The true labels
 57 |     predictions : Tensor of tf.int32 or tf.int64
 58 |         The predictions, same shape as labels
 59 |     num_classes : int
 60 |         The number of classes
 61 |     pos_indices : list of int, optional
 62 |         The indices of the positive classes, default is all
 63 |     weights : Tensor of tf.int32, optional
 64 |         Mask, must be of compatible shape with labels
 65 |     average : str, optional
 66 |         'micro': counts the total number of true positives, false
 67 |             positives, and false negatives for the classes in
 68 |             `pos_indices` and infer the metric from it.
 69 |         'macro': will compute the metric separately for each class in
 70 |             `pos_indices` and average. Will not account for class
 71 |             imbalance.
 72 |         'weighted': will compute the metric separately for each class in
 73 |             `pos_indices` and perform a weighted average by the total
 74 |             number of true labels for each class.
 75 |     Returns
 76 |     -------
 77 |     tuple of (scalar float Tensor, update_op)
 78 |     """
 79 |     cm, op = _streaming_confusion_matrix(
 80 |         labels, predictions, num_classes, weights)
 81 |     _, re, _ = metrics_from_confusion_matrix(
 82 |         cm, pos_indices, average=average)
 83 |     _, op, _ = metrics_from_confusion_matrix(
 84 |         op, pos_indices, average=average)
 85 |     return (re, op)
 86 | 
 87 | 
 88 | def f1(labels, predictions, num_classes, pos_indices=None, weights=None,
 89 |        average='micro'):
 90 |     return fbeta(labels, predictions, num_classes, pos_indices, weights,
 91 |                  average)
 92 | 
 93 | 
 94 | def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None,
 95 |           average='micro', beta=1):
 96 |     """Multi-class fbeta metric for Tensorflow
 97 |     Parameters
 98 |     ----------
 99 |     labels : Tensor of tf.int32 or tf.int64
100 |         The true labels
101 |     predictions : Tensor of tf.int32 or tf.int64
102 |         The predictions, same shape as labels
103 |     num_classes : int
104 |         The number of classes
105 |     pos_indices : list of int, optional
106 |         The indices of the positive classes, default is all
107 |     weights : Tensor of tf.int32, optional
108 |         Mask, must be of compatible shape with labels
109 |     average : str, optional
110 |         'micro': counts the total number of true positives, false
111 |             positives, and false negatives for the classes in
112 |             `pos_indices` and infer the metric from it.
113 |         'macro': will compute the metric separately for each class in
114 |             `pos_indices` and average. Will not account for class
115 |             imbalance.
116 |         'weighted': will compute the metric separately for each class in
117 |             `pos_indices` and perform a weighted average by the total
118 |             number of true labels for each class.
119 |     beta : int, optional
120 |         Weight of precision in harmonic mean
121 |     Returns
122 |     -------
123 |     tuple of (scalar float Tensor, update_op)
124 |     """
125 |     cm, op = _streaming_confusion_matrix(
126 |         labels, predictions, num_classes, weights)
127 |     _, _, fbeta = metrics_from_confusion_matrix(
128 |         cm, pos_indices, average=average, beta=beta)
129 |     _, _, op = metrics_from_confusion_matrix(
130 |         op, pos_indices, average=average, beta=beta)
131 |     return (fbeta, op)
132 | 
133 | 
134 | def safe_div(numerator, denominator):
135 |     """Safe division, return 0 if denominator is 0"""
136 |     numerator, denominator = tf.to_float(numerator), tf.to_float(denominator)
137 |     zeros = tf.zeros_like(numerator, dtype=numerator.dtype)
138 |     denominator_is_zero = tf.equal(denominator, zeros)
139 |     return tf.where(denominator_is_zero, zeros, numerator / denominator)
140 | 
141 | 
142 | def pr_re_fbeta(cm, pos_indices, beta=1):
143 |     """Uses a confusion matrix to compute precision, recall and fbeta"""
144 |     num_classes = cm.shape[0]
145 |     neg_indices = [i for i in range(num_classes) if i not in pos_indices]
146 |     cm_mask = np.ones([num_classes, num_classes])
147 |     cm_mask[neg_indices, neg_indices] = 0
148 |     diag_sum = tf.reduce_sum(tf.diag_part(cm * cm_mask))
149 | 
150 |     cm_mask = np.ones([num_classes, num_classes])
151 |     cm_mask[:, neg_indices] = 0
152 |     tot_pred = tf.reduce_sum(cm * cm_mask)
153 | 
154 |     cm_mask = np.ones([num_classes, num_classes])
155 |     cm_mask[neg_indices, :] = 0
156 |     tot_gold = tf.reduce_sum(cm * cm_mask)
157 | 
158 |     pr = safe_div(diag_sum, tot_pred)
159 |     re = safe_div(diag_sum, tot_gold)
160 |     fbeta = safe_div((1. + beta**2) * pr * re, beta**2 * pr + re)
161 | 
162 |     return pr, re, fbeta
163 | 
164 | 
165 | def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro',
166 |                                   beta=1):
167 |     """Precision, Recall and F1 from the confusion matrix
168 |     Parameters
169 |     ----------
170 |     cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes)
171 |         The streaming confusion matrix.
172 |     pos_indices : list of int, optional
173 |         The indices of the positive classes
174 |     beta : int, optional
175 |         Weight of precision in harmonic mean
176 |     average : str, optional
177 |         'micro', 'macro' or 'weighted'
178 |     """
179 |     num_classes = cm.shape[0]
180 |     if pos_indices is None:
181 |         pos_indices = [i for i in range(num_classes)]
182 | 
183 |     if average == 'micro':
184 |         return pr_re_fbeta(cm, pos_indices, beta)
185 |     elif average in {'macro', 'weighted'}:
186 |         precisions, recalls, fbetas, n_golds = [], [], [], []
187 |         for idx in pos_indices:
188 |             pr, re, fbeta = pr_re_fbeta(cm, [idx], beta)
189 |             precisions.append(pr)
190 |             recalls.append(re)
191 |             fbetas.append(fbeta)
192 |             cm_mask = np.zeros([num_classes, num_classes])
193 |             cm_mask[idx, :] = 1
194 |             n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask)))
195 | 
196 |         if average == 'macro':
197 |             pr = tf.reduce_mean(precisions)
198 |             re = tf.reduce_mean(recalls)
199 |             fbeta = tf.reduce_mean(fbetas)
200 |             return pr, re, fbeta
201 |         if average == 'weighted':
202 |             n_gold = tf.reduce_sum(n_golds)
203 |             pr_sum = sum(p * n for p, n in zip(precisions, n_golds))
204 |             pr = safe_div(pr_sum, n_gold)
205 |             re_sum = sum(r * n for r, n in zip(recalls, n_golds))
206 |             re = safe_div(re_sum, n_gold)
207 |             fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds))
208 |             fbeta = safe_div(fbeta_sum, n_gold)
209 |             return pr, re, fbeta
210 | 
211 |     else:
212 |         raise NotImplementedError()
213 | 


--------------------------------------------------------------------------------
/text_classification/tnews_data_eda.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "raw",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "CLUEBenchmark的头条中文新闻分类  数据EDA过程\n",
  8 |     "任务介绍：https://www.cluebenchmarks.com/introduce.html"
  9 |    ]
 10 |   },
 11 |   {
 12 |    "cell_type": "code",
 13 |    "execution_count": 1,
 14 |    "metadata": {},
 15 |    "outputs": [],
 16 |    "source": [
 17 |     "import pandas as pd\n",
 18 |     "import numpy as np\n",
 19 |     "import json"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "code",
 24 |    "execution_count": 2,
 25 |    "metadata": {},
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "def convert_df(file_path, task):\n",
 29 |     "    with open(file_path, encoding='utf-8') as fr:\n",
 30 |     "        lines = fr.readlines()\n",
 31 |     "    label_list = []\n",
 32 |     "    sentence_list = []\n",
 33 |     "    ids = []\n",
 34 |     "    for line in lines:\n",
 35 |     "        json_str = json.loads(line)\n",
 36 |     "        if task == 'test':\n",
 37 |     "            ids.append(json_str['id'])\n",
 38 |     "            sentence_list.append(json_str['sentence'])\n",
 39 |     "        else:\n",
 40 |     "            label_list.append(json_str['label'])\n",
 41 |     "            sentence_list.append(json_str['sentence'])\n",
 42 |     "    if task == 'test':\n",
 43 |     "        data_dict = {'id': ids, 'text': sentence_list}\n",
 44 |     "    else:\n",
 45 |     "        data_dict = {'label': label_list, 'text': sentence_list}\n",
 46 |     "    data = pd.DataFrame(data_dict)\n",
 47 |     "    \n",
 48 |     "    return data"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": 3,
 54 |    "metadata": {},
 55 |    "outputs": [
 56 |     {
 57 |      "name": "stdout",
 58 |      "output_type": "stream",
 59 |      "text": [
 60 |       "  label                                            text\n",
 61 |       "0   108    上课时学生手机响个不停，老师一怒之下把手机摔了，家长拿发票让老师赔，大家怎么看待这种事？\n",
 62 |       "1   104  商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告\n",
 63 |       "2   106                通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？\n",
 64 |       "3   112                             2018年去俄罗斯看世界杯得花多少钱？\n",
 65 |       "4   109                           剃须刀的个性革新，雷明登天猫定制版新品首发\n",
 66 |       "  label                            text\n",
 67 |       "0   102      江疏影甜甜圈自拍，迷之角度竟这么好看，美吸引一切事物\n",
 68 |       "1   110  以色列大规模空袭开始！伊朗多个军事目标遭遇打击，誓言对等反击\n",
 69 |       "2   104           出栏一头猪亏损300元，究竟谁能笑到最后！\n",
 70 |       "3   109                以前很火的巴铁为何现在只字不提？\n",
 71 |       "4   112   作为一名酒店从业人员，你经历过房客哪些特别没有素质的行为？\n"
 72 |      ]
 73 |     }
 74 |    ],
 75 |    "source": [
 76 |     "train_path = './data/train.json'\n",
 77 |     "dev_path = './data/dev.json'\n",
 78 |     "test_path = './data/test.json'\n",
 79 |     "\n",
 80 |     "train_data = convert_df(train_path, 'train')\n",
 81 |     "print(train_data.head(5))\n",
 82 |     "\n",
 83 |     "dev_data = convert_df(dev_path, 'dev')\n",
 84 |     "print(dev_data.head(5))"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 4,
 90 |    "metadata": {},
 91 |    "outputs": [
 92 |     {
 93 |      "name": "stdout",
 94 |      "output_type": "stream",
 95 |      "text": [
 96 |       "  label                                            text  word_cnt\n",
 97 |       "0   108    上课时学生手机响个不停，老师一怒之下把手机摔了，家长拿发票让老师赔，大家怎么看待这种事？        44\n",
 98 |       "1   104  商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告        46\n",
 99 |       "2   106                通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？        32\n",
100 |       "3   112                             2018年去俄罗斯看世界杯得花多少钱？        19\n",
101 |       "4   109                           剃须刀的个性革新，雷明登天猫定制版新品首发        21\n"
102 |      ]
103 |     }
104 |    ],
105 |    "source": [
106 |     "train_data['word_cnt'] = train_data['text'].apply(lambda x: len(x))\n",
107 |     "print(train_data.head(5))"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": 5,
113 |    "metadata": {},
114 |    "outputs": [
115 |     {
116 |      "data": {
117 |       "text/plain": [
118 |        "count    53360.000000\n",
119 |        "mean        22.131241\n",
120 |        "std          7.309860\n",
121 |        "min          2.000000\n",
122 |        "25%         17.000000\n",
123 |        "50%         22.000000\n",
124 |        "75%         28.000000\n",
125 |        "max        145.000000\n",
126 |        "Name: word_cnt, dtype: float64"
127 |       ]
128 |      },
129 |      "execution_count": 5,
130 |      "metadata": {},
131 |      "output_type": "execute_result"
132 |     }
133 |    ],
134 |    "source": [
135 |     "train_data['word_cnt'].describe()"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 6,
141 |    "metadata": {},
142 |    "outputs": [
143 |     {
144 |      "data": {
145 |       "text/plain": [
146 |        "109    5955\n",
147 |        "104    5200\n",
148 |        "102    4976\n",
149 |        "113    4851\n",
150 |        "107    4118\n",
151 |        "101    4081\n",
152 |        "103    3991\n",
153 |        "110    3632\n",
154 |        "108    3437\n",
155 |        "116    3390\n",
156 |        "112    3368\n",
157 |        "115    2886\n",
158 |        "106    2107\n",
159 |        "100    1111\n",
160 |        "114     257\n",
161 |        "Name: label, dtype: int64"
162 |       ]
163 |      },
164 |      "execution_count": 6,
165 |      "metadata": {},
166 |      "output_type": "execute_result"
167 |     }
168 |    ],
169 |    "source": [
170 |     "train_data['label'].value_counts()"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "raw",
175 |    "metadata": {},
176 |    "source": [
177 |     "从上面的数据看出tricks：\n",
178 |     "1.文本最长是145，大部分都是28左右\n",
179 |     "2.label的数量是不均衡的，可以在loss计算的时候加上label的权重"
180 |    ]
181 |   },
182 |   {
183 |    "cell_type": "code",
184 |    "execution_count": 16,
185 |    "metadata": {},
186 |    "outputs": [
187 |     {
188 |      "name": "stdout",
189 |      "output_type": "stream",
190 |      "text": [
191 |       "  label                                            text  word_cnt  \\\n",
192 |       "0   108    上课时学生手机响个不停，老师一怒之下把手机摔了，家长拿发票让老师赔，大家怎么看待这种事？        44   \n",
193 |       "1   104  商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告        46   \n",
194 |       "2   106                通过中介公司买了二手房，首付都付了，现在卖家不想卖了。怎么处理？        32   \n",
195 |       "3   112                             2018年去俄罗斯看世界杯得花多少钱？        19   \n",
196 |       "4   109                           剃须刀的个性革新，雷明登天猫定制版新品首发        21   \n",
197 |       "\n",
198 |       "                                               words  \n",
199 |       "0  上课时 学生 手机 响个 不停 老师 一怒之下 把 手机 摔 了 家长 拿 发票 让 老师 ...  \n",
200 |       "1  商赢 环球 股份 有限公司 关于 延期 回复 上海证券交易所 对 公司 Number 年 年...  \n",
201 |       "2       通过 中介 公司 买 了 二手房 首付 都 付 了 现在 卖家 不想 卖 了 怎么 处理  \n",
202 |       "3                       Number 年 去 俄罗斯 看 世界杯 得花 多少 钱  \n",
203 |       "4                      剃须刀 的 个性 革新 雷明登 天猫 定制 版 新品 首发  \n"
204 |      ]
205 |     }
206 |    ],
207 |    "source": [
208 |     "# 分词去掉一些无用词\n",
209 |     "import jieba\n",
210 |     "def cut_with_jieba(text, filter=None):\n",
211 |     "    if filter:\n",
212 |     "        for c in filter:\n",
213 |     "            text = text.replace(c, '')\n",
214 |     "    words = ['Number' if word.isdigit() else word for word in jieba.cut(text)]\n",
215 |     "    # todo 停用词表还可以加进来\n",
216 |     "    return ' '.join(words)\n",
217 |     "\n",
218 |     "filter = './?？；。（()）【】{}[]!！，,<>《》+'\n",
219 |     "\n",
220 |     "train_data['words'] = train_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
221 |     "\n",
222 |     "print(train_data.head(5))"
223 |    ]
224 |   },
225 |   {
226 |    "cell_type": "code",
227 |    "execution_count": 17,
228 |    "metadata": {},
229 |    "outputs": [
230 |     {
231 |      "name": "stdout",
232 |      "output_type": "stream",
233 |      "text": [
234 |       "  label                            text  \\\n",
235 |       "0   102      江疏影甜甜圈自拍，迷之角度竟这么好看，美吸引一切事物   \n",
236 |       "1   110  以色列大规模空袭开始！伊朗多个军事目标遭遇打击，誓言对等反击   \n",
237 |       "2   104           出栏一头猪亏损300元，究竟谁能笑到最后！   \n",
238 |       "3   109                以前很火的巴铁为何现在只字不提？   \n",
239 |       "4   112   作为一名酒店从业人员，你经历过房客哪些特别没有素质的行为？   \n",
240 |       "\n",
241 |       "                                       words  \n",
242 |       "0      江 疏影 甜甜 圈自 拍迷 之 角度 竟 这么 好看 美 吸引 一切 事物  \n",
243 |       "1  以色列 大规模 空袭 开始 伊朗 多个 军事 目标 遭遇 打击 誓言 对 等 反击  \n",
244 |       "2          出栏 一头 猪 亏损 Number 元 究竟 谁 能 笑 到 最后  \n",
245 |       "3                      以前 很火 的 巴铁 为何 现在 只字不提  \n",
246 |       "4   作为 一名 酒店 从业人员 你 经历 过 房客 哪些 特别 没有 素质 的 行为  \n"
247 |      ]
248 |     }
249 |    ],
250 |    "source": [
251 |     "dev_data['words'] = dev_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
252 |     "print(dev_data.head(5))"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "raw",
257 |    "metadata": {},
258 |    "source": [
259 |     "可以看出分词其实不太准确，这个地方还可以加入原始数据集中的key word"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "code",
264 |    "execution_count": 18,
265 |    "metadata": {},
266 |    "outputs": [],
267 |    "source": [
268 |     "# 写入到文件\n",
269 |     "train_data[['words', 'label']].to_csv('./data/train_data.csv', sep='\\t', encoding='utf-8', index=None)\n",
270 |     "dev_data[['words', 'label']].to_csv('./data/dev_data.csv', sep='\\t', encoding='utf-8', index=None)"
271 |    ]
272 |   },
273 |   {
274 |    "cell_type": "code",
275 |    "execution_count": 19,
276 |    "metadata": {},
277 |    "outputs": [
278 |     {
279 |      "name": "stdout",
280 |      "output_type": "stream",
281 |      "text": [
282 |       "   id                            text  \\\n",
283 |       "0   0  在设计史上，每当相对稳定的发展时期，这种设计思想就会成为主导   \n",
284 |       "1   1          利希施泰纳宣布赛季结束后离队：我需要新的挑战   \n",
285 |       "2   2                   庄家一般都是什么操盘思路？   \n",
286 |       "3   3                  王者荣耀里搅屎棍英雄都有谁？   \n",
287 |       "4   4          照片不小心被删，看看下面的教程，完美找回来！   \n",
288 |       "\n",
289 |       "                                         words  \n",
290 |       "0  在 设计 史上 每当 相对 稳定 的 发展 时期 这种 设计 思想 就 会 成为 主导  \n",
291 |       "1           利希 施泰纳 宣布 赛季 结束 后 离队 ： 我 需要 新 的 挑战  \n",
292 |       "2                           庄家 一般 都 是 什么 操盘 思路  \n",
293 |       "3                       王者 荣耀 里 搅 屎 棍 英雄 都 有 谁  \n",
294 |       "4                照片 不 小心 被删 看看 下面 的 教程 完美 找 回来  \n"
295 |      ]
296 |     }
297 |    ],
298 |    "source": [
299 |     "\n",
300 |     "# 准备测试集\n",
301 |     "test_data = convert_df(test_path, 'test')\n",
302 |     "\n",
303 |     "test_data['words'] = test_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n",
304 |     "\n",
305 |     "print(test_data.head(5))\n",
306 |     "\n",
307 |     "test_data[['id', 'words']].to_csv('./data/test_data.csv', sep='\\t', encoding='utf-8', index=None)\n"
308 |    ]
309 |   },
310 |   {
311 |    "cell_type": "code",
312 |    "execution_count": null,
313 |    "metadata": {},
314 |    "outputs": [],
315 |    "source": []
316 |   }
317 |  ],
318 |  "metadata": {
319 |   "kernelspec": {
320 |    "display_name": "Python [conda env:tf_envs]",
321 |    "language": "python",
322 |    "name": "conda-env-tf_envs-py"
323 |   },
324 |   "language_info": {
325 |    "codemirror_mode": {
326 |     "name": "ipython",
327 |     "version": 3
328 |    },
329 |    "file_extension": ".py",
330 |    "mimetype": "text/x-python",
331 |    "name": "python",
332 |    "nbconvert_exporter": "python",
333 |    "pygments_lexer": "ipython3",
334 |    "version": "3.7.7"
335 |   }
336 |  },
337 |  "nbformat": 4,
338 |  "nbformat_minor": 4
339 | }
340 | 


--------------------------------------------------------------------------------
/text_classification/train_main.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | # @Time    : 2020-12-11 19:47
  3 | # @Author  : xudong
  4 | # @email   : dongxu222mk@163.com
  5 | # @File    : train_main.py
  6 | # @Software: PyCharm
  7 | 
  8 | import sys
  9 | import time
 10 | import tensorflow as tf
 11 | import tf_metrics
 12 | import _pickle as cPickle
 13 | 
 14 | from data_utils import datasets
 15 | from argparse import ArgumentParser
 16 | from models.bilstm_model import BiLstmModel
 17 | from models.text_cnn import TextCnn
 18 | from models.ffnn_model import FCModel
 19 | 
 20 | 
 21 | # 设置参数
 22 | parser = ArgumentParser()
 23 | 
 24 | parser.add_argument("--train_path", type=str, default='./data_path/tnews_data.pkl',
 25 |                     help='the file path of train data, needs pkl type')
 26 | parser.add_argument("--eval_path", type=str, default='./data_path/tnews_data.pkl',
 27 |                     help='the file path of test data, needs pkl type')
 28 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/',
 29 |                     help='the dir of the checkpoint type model')
 30 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb',
 31 |                     help='the dir of the pb type model')
 32 | 
 33 | parser.add_argument("--vocab_size", type=int, default=68000, help='the vocab size')
 34 | parser.add_argument("--emb_size", type=int, default=300, help='the embedding size')
 35 | parser.add_argument("--hidden_size", type=int, default=300,
 36 |                     help='the hidden size of rnn layer, will split it half in rnn')
 37 | parser.add_argument("--fc_layer_size", type=int, default=300,
 38 |                     help='the hidden size of fully connect layer')
 39 | parser.add_argument("--num_label", type=int, default=15, help='the number of task label')
 40 | parser.add_argument("--drop_out", type=float, default=0.2,
 41 |                     help='the dropout rate in layers')
 42 | parser.add_argument("--batch_size", type=int, default=16,
 43 |                     help='the batch size of dataset in one step training')
 44 | parser.add_argument("--epoch", type=int, default=5,
 45 |                     help='the epoch count we want to train')
 46 | parser.add_argument("--model_name", type=str, default='lstm',
 47 |                     help='which model we want use in our task, [lstm, cnn, fc, ...]')
 48 | parser.add_argument("--pool", type=str, default='max',
 49 |                     help='the pool function, [max, mean, ...]')
 50 | parser.add_argument("--activator", type=str, default='relu',
 51 |                     help='the activate function, [relu, relu6, tanh, ...]')
 52 | parser.add_argument("--filter_num", type=int, default=128,
 53 |                     help='the number of the cnn filters')
 54 | parser.add_argument("--use_pos", type=int, default=0,
 55 |                     help='whether to use position encoding in embedding layer')
 56 | parser.add_argument("--lr", type=float, default=1e-3,
 57 |                     help='the learning rate for optimizer')
 58 | 
 59 | 
 60 | # todo 还可以加入位置信息在embedding层
 61 | # todo pool层还可以加入attention pool
 62 | 
 63 | 
 64 | tf.logging.set_verbosity(tf.logging.INFO)
 65 | ARGS, unparsed = parser.parse_known_args()
 66 | print(ARGS)
 67 | sys.stdout.flush()
 68 | 
 69 | 
 70 | def init_data(file_name, type=None):
 71 |     """
 72 |     初始化数据集并构建input function
 73 |     :param file_name:
 74 |     :param type:
 75 |     :return:
 76 |     """
 77 |     data = cPickle.load(open(file_name, 'rb'))[type]
 78 | 
 79 |     data_builder = datasets.DataBuilder(data)
 80 |     dataset = data_builder.build_dataset()
 81 | 
 82 |     def train_input():
 83 |         return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch)
 84 | 
 85 |     def test_input():
 86 |         return data_builder.get_test_batch(dataset, ARGS.batch_size)
 87 | 
 88 |     return train_input if type == 'train' else test_input
 89 | 
 90 | 
 91 | def make_model():
 92 |     """
 93 |     构建模型
 94 |     :return:
 95 |     """
 96 |     vocab_size = ARGS.vocab_size
 97 |     emb_size = ARGS.emb_size
 98 |     print(f'the model name is {ARGS.model_name}')
 99 |     if ARGS.model_name == 'lstm':
100 |         model = BiLstmModel(vocab_size, emb_size, ARGS)
101 |     elif ARGS.model_name == 'cnn':
102 |         model = TextCnn(vocab_size, emb_size, ARGS)
103 |     elif ARGS.model_name == 'fc':
104 |         model = FCModel(vocab_size, emb_size, ARGS)
105 |     else:
106 |         raise KeyError('the model type is not implemented!')
107 |     return model
108 | 
109 | 
110 | def model_fn(features, labels, mode, params):
111 |     """
112 |     the model fn
113 |     :return:
114 |     """
115 |     model = make_model()
116 | 
117 |     if isinstance(features, dict):
118 |         features = features['words']
119 | 
120 |     words = features
121 | 
122 |     if mode == tf.estimator.ModeKeys.PREDICT:
123 |         logits = model(words, training=False)
124 | 
125 |         prediction = {'class_id': tf.argmax(logits, axis=1, name='class_out'),
126 |                       'prob': tf.nn.softmax(logits, name='prob_out')}
127 | 
128 |         return tf.estimator.EstimatorSpec(
129 |             mode=mode,
130 |             predictions=prediction,
131 |             export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)}
132 |         )
133 |     else:
134 |         if mode == tf.estimator.ModeKeys.TRAIN:
135 |             logits = model(words, training=True)
136 |             weights = tf.constant(
137 |                 [0.9, 0.9, 0.9, 0.9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.2, 1.5])
138 |             weights = tf.gather(weights, labels)
139 |             loss = tf.losses.sparse_softmax_cross_entropy(
140 |                 labels, logits,
141 |                 weights=weights,
142 |                 reduction=tf.losses.Reduction.MEAN)
143 |             prediction = tf.argmax(logits, axis=1)
144 |             accuracy = tf.metrics.accuracy(labels=labels,
145 |                                            predictions=prediction)
146 |             tf.identity(accuracy[1], name='train_accuracy')
147 |             tf.summary.scalar('train_accuracy', accuracy[1])
148 |             optimizer = tf.train.AdamOptimizer(learning_rate=ARGS.lr)
149 |             return tf.estimator.EstimatorSpec(
150 |                 mode=mode,
151 |                 loss=loss,
152 |                 train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step())
153 |             )
154 |         else:
155 |             logits = model(words, training=False)
156 |             prediction = tf.argmax(logits, axis=1)
157 |             # tf原始的metrics不支持多类别计算
158 |             precision = tf_metrics.precision(labels, prediction, ARGS.num_label)
159 |             recall = tf_metrics.recall(labels, prediction, ARGS.num_label)
160 |             accuracy = tf.metrics.accuracy(labels, predictions=prediction)
161 |             metrics = {
162 |                 'accuracy': accuracy,
163 |                 'recall': recall,
164 |                 'precision': precision
165 |             }
166 |             return tf.estimator.EstimatorSpec(
167 |                 mode=mode,
168 |                 loss=tf.constant(0),
169 |                 eval_metric_ops=metrics
170 |             )
171 | 
172 | 
173 | def main_es(unparsed):
174 |     """
175 |     main method
176 |     :param unparsed:
177 |     :return:
178 |     """
179 |     cur_time = time.time()
180 |     model_dir = ARGS.model_ckpt_dir + str(int(cur_time))
181 | 
182 |     classifer = tf.estimator.Estimator(
183 |         model_fn=model_fn,
184 |         model_dir=model_dir,
185 |         params={}
186 |     )
187 | 
188 |     # train model
189 |     train_input = init_data(ARGS.train_path, 'train')
190 |     tensors_to_log = {'train_accuracy': 'train_accuracy'}
191 |     logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100)
192 |     classifer.train(input_fn=train_input, hooks=[logging_hook])
193 | 
194 |     # eval model
195 |     eval_input = init_data(ARGS.eval_path, 'test')
196 |     eval_res = classifer.evaluate(input_fn=eval_input)
197 |     print(f'Evaluation res is : \n\t{eval_res}')
198 | 
199 |     if ARGS.model_pb_dir:
200 |         words = tf.placeholder(tf.int64, [None, None], name='input_words')
201 |         input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
202 |             'words': words
203 |         })
204 |         classifer.export_savedmodel(ARGS.model_pb_dir, input_fn)
205 | 
206 | 
207 | if __name__ == '__main__':
208 |     tf.app.run(main=main_es, argv=[sys.argv[0]] + unparsed)
209 | 


--------------------------------------------------------------------------------