├── README.md ├── bert_downstream ├── README.md ├── bert_master │ └── README.md ├── data_path │ └── README.md ├── model_ckpt │ └── README.md ├── pre_trained │ └── README.md ├── train_classifier.py ├── train_multi_learning.py └── train_ner.py ├── ckbqa ├── DUTIR中文开放域知识问答评测报告.pdf ├── README.md └── 基于特征融合的中文知识库问答方法.pdf ├── named_entity_recognition ├── README.md ├── convert_bio.py ├── data_path │ └── README.md ├── data_utils │ ├── __init__.py │ └── datasets.py ├── inference.py ├── model_ckpt │ └── README.md ├── model_pb │ └── README.md ├── models │ ├── __init__.py │ └── bilstm_crf.py ├── ner_main.py ├── pics │ ├── 命名实体识别数据图.png │ └── 命名实体识别的模型总结图.png └── preprocess.py └── text_classification ├── README.md ├── data_path ├── README.md ├── tnews_data.pkl └── vocab.txt ├── inference.py ├── model_ckpt └── README.md ├── model_pb └── README.md ├── models ├── __init__.py ├── attention.py ├── base_model.py ├── bilstm_model.py ├── ffnn_model.py ├── model_utils.py └── text_cnn.py ├── preprocess.py ├── tf_metrics.py ├── tnews_data_eda.ipynb └── train_main.py /README.md: -------------------------------------------------------------------------------- 1 | ## AwesomeNLPBaseline 2 | 3 | 本项目是NLP领域一些任务的基准模型实现,包括文本分类、命名实体识别、实体关系抽取、NL2SQL、CKBQA以及BERT的各种下游任务应用。 4 | 5 | 主要使用Tensorflow1.x 6 | 7 | 话说Tensorflow1.0版本在实现某些任务的时候是真心的不如torch,实力劝退。不要问既然这么难用为什么不用torch(因为不会啊)?问就是正因为难用才要用,而且在公司部署项目的时候,TF的as-server模式真香。 8 | 9 | **任务介绍** 10 | 11 | - 文本分类 12 | - 命名实体识别 13 | - bert下游任务 14 | - 实体关系抽取 15 | - nl2sql 16 | - ckbqa 17 | - doing(持续更新) 18 | 19 | **目录结构如下**: 20 | 21 | * text_classification: 文本分类 22 | * named_entity_recognition: 命名实体识别 23 | * entity_relation_extraction: 实体关系抽取 24 | * ckbqa: 中文知识问答 25 | * nl2sql: 自然语言到Sql语句 26 | * bert_downstream: 基于bert进行fine-tune下游任务以及bert相关研究 27 | 28 | Tip:当前只实现了文本分类,bert下游任务,命名实体识别三个任务,其他的等有空了再补上。 29 | 30 | 31 | **声明** 32 | 33 | 本项目是作者平时学习和工作中遇到的NLP任务积累,仅供学习交流。欢迎提issue和pr。 34 | 35 | -------------------------------------------------------------------------------- /bert_downstream/README.md: -------------------------------------------------------------------------------- 1 | ## BERT介绍 2 | 3 | **简介** 4 | 5 | BERT是谷歌于2018年10月公开的一种预训练模型。该模型一经发布,就引起了学术界以及工业界的广泛关注。在效果方面,BERT刷新了11个NLP任务的当前最优效果,该方法也被评为2018年NLP的重大进展以及NAACL 2019的best paper。BERT和早前OpenAI发布的GPT方法技术路线基本一致,只是在技术细节上存在略微差异。两个工作的主要贡献在于使用预训练+微调的思路来解决自然语言处理问题。以BERT为例,模型应用包括2个环节: 6 | 7 | - 预训练(Pre-training),该环节在大量通用语料上学习网络参数,通用语料包括Wikipedia、Book Corpus,这些语料包含了大量的文本,能够提供丰富的语言相关现象。 8 | - 微调(Fine-tuning),该环节使用“任务相关”的标注数据对网络参数进行微调,不需要再为目标任务设计Task-specific网络从头训练。 9 | 10 | 模型的详细信息可见论文,原文 [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805v1),这是BERT在2018年10月发布的版本,与2019年5月版本[v2](https://arxiv.org/abs/1810.04805v2)有稍许不同。 11 | 12 | 英文不好的可以参考大佬的论文翻译:[BERT论文中文翻译](https://github.com/yuanxiaosc/BERT_Paper_Chinese_Translation) 13 | 14 | **各类BERT预训练模型**: 15 | 16 | - 官网BERT:https://github.com/google-research/bert 17 | - Transformers:https://github.com/huggingface/transformers 18 | - 哈工大讯飞:https://github.com/ymcui/Chinese-BERT-wwm 19 | - Brightmart:https://github.com/brightmart/roberta_zh 20 | - CLUEPretrainedModels:https://github.com/CLUEbenchmark/CLUEPretrainedModels 21 | 22 | **BERT下游任务** 23 | 24 | 随着预训练模型的提出,大大减少了我们对NLP任务设计特定结构的需求,我们只需要在BERT等预训练模型之后再接一些简单的网络,即可完成我们的NLP任务,而且效果非常好。 25 | 26 | 原因也非常简单,BERT等预训练模型通过大量语料的无监督学习,已经将语料中的知识迁移进了预训练模型的Embedding中,为此我们只需在针对特定任务增加结构来进行微调,即可适应当前任务,这也是迁移学习的魔力所在。 27 | 28 | 下面介绍几类下游任务: 29 | 30 | - 句子对分类任务,如自然语言推理(NLI),其数据集一般有MNLI、QNLI、STS-B、MRPC等 31 | - 单句子分类任务,如文本分类(Text-classification),其数据集一般有SST-2、CoLA等 32 | - 问答任务,数据集一般有SQuAD v1.1等 33 | - 单句子token标注任务,如命名实体识别(NER),其数据集一般有CoNLL-2003等 34 | 35 | 36 | 37 | ## BERT下游任务代码 38 | 39 | 下面基于官方BERT的fine-tune代码来实现文本分类、命名实体识别和多任务学习的Baseline模型。 40 | 41 | (下面所有任务的预训练模型都是基于哈工大讯飞实验室的**`BERT-wwm-ext, Chinese`**模型,模型下载地址见上述链接) 42 | 43 | 1.文本分类 44 | 45 | ``` 46 | 数据集来源:情感分类,是包含7个分类的细粒度情感性分析数据集,NLP中文预训练模型泛化能力挑战赛的数据集 47 | 运行脚本见train_classifier.py 48 | 49 | 3个epoch,使用的是哈工大的BERT-wwm-ext, Chinese,未做任何的tricks,在CLUE上提交的结果是56.04 50 | 之前使用BiLstm线上提交的结果是50.92(代码见text_classification中) 51 | ([ALBERT-xxlarge](https://github.com/google-research/albert) :59.46,目前[UER-ensemble](https://github.com/dbiir/UER-py):72.20) 52 | ``` 53 | 54 | 2.命名实体识别 55 | 56 | ``` 57 | 数据集来源:CLUENER 细粒度命名实体识别,数据分为10个标签类别,详细信息见:https://github.com/CLUEbenchmark/CLUENER2020 58 | 运行脚本见train_ner.py 59 | # todo next 60 | ``` 61 | 62 | 3.多任务学习 63 | 64 | ``` 65 | 数据集来源:NLP中文预训练模型泛化能力挑战赛,https://tianchi.aliyun.com/competition/entrance/531841/introduction 66 | 运行脚本见trian_multi_learning.py 67 | 68 | 3个epoch没做任何tricks,当前的score是0.5717 69 | 3个epoch没做任何的tricks,使用roberta-large,score是0.6236 70 | ``` 71 | 72 | 73 | 74 | ## 拓展 75 | 76 | 下面推荐两篇有意思的文章 77 | 78 | 1.How to Fine-tune bert for Text-classification 79 | 80 | 2.few sample bert fine-tune -------------------------------------------------------------------------------- /bert_downstream/bert_master/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里存放的是bert官方的模型代码,地址见:https://github.com/google-research/bert 4 | 这里主要包括三个文件: 5 | - modeling.py 6 | - optimization.py 7 | - tokenization.py -------------------------------------------------------------------------------- /bert_downstream/data_path/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里存放的是数据集 -------------------------------------------------------------------------------- /bert_downstream/model_ckpt/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里保存训练后的checkpoint文件 -------------------------------------------------------------------------------- /bert_downstream/pre_trained/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里存放的是预训练模型,本项目主要使用的是哈工大讯飞实验室的bert模型,地址见:https://github.com/ymcui/Chinese-BERT-wwm -------------------------------------------------------------------------------- /bert_downstream/train_classifier.py: -------------------------------------------------------------------------------- 1 | """BERT finetuning runner for text classification.""" 2 | 3 | import collections 4 | import os 5 | import json 6 | import tensorflow as tf 7 | 8 | import bert_master.modeling as modeling 9 | import bert_master.optimization as optimization 10 | import bert_master.tokenization as tokenization 11 | 12 | flags = tf.flags 13 | FLAGS = flags.FLAGS 14 | 15 | # Required parameters 16 | flags.DEFINE_string( 17 | "data_dir", './data_path/tnews/', 18 | "The input data dir. Should contain the .tsv files (or other data files) " 19 | "for the task.") 20 | 21 | flags.DEFINE_string( 22 | "bert_config_file", './pre_trained/bert_config.json', 23 | "The config json file corresponding to the pre-trained BERT model. " 24 | "This specifies the model architecture.") 25 | 26 | flags.DEFINE_string("task_name", 'tnews', 27 | "The name of the task to train.") 28 | 29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt', 30 | "The vocabulary file that the BERT model was trained on.") 31 | 32 | flags.DEFINE_string( 33 | "output_dir", './model_ckpt/tnews/', 34 | "The output directory where the model checkpoints will be written.") 35 | 36 | flags.DEFINE_string( 37 | "init_checkpoint", './pre_trained/bert_model.ckpt', 38 | "Initial checkpoint (usually from a pre-trained BERT model).") 39 | 40 | flags.DEFINE_bool( 41 | "do_lower_case", True, 42 | "Whether to lower case the input text. Should be True for uncased " 43 | "models and False for cased models.") 44 | 45 | flags.DEFINE_integer( 46 | "max_seq_length", 128, 47 | "The maximum total input sequence length after WordPiece tokenization. " 48 | "Sequences longer than this will be truncated, and sequences shorter " 49 | "than this will be padded.") 50 | 51 | flags.DEFINE_bool("do_train", True, "Whether to run training.") 52 | 53 | flags.DEFINE_bool("do_eval", True, "Whether to run eval on the dev set.") 54 | 55 | flags.DEFINE_bool( 56 | "do_predict", True, 57 | "Whether to run the model in inference mode on the test set.") 58 | 59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.") 60 | 61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 62 | 63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 64 | 65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 66 | 67 | flags.DEFINE_float("num_train_epochs", 3.0, 68 | "Total number of training epochs to perform.") 69 | 70 | flags.DEFINE_float( 71 | "warmup_proportion", 0.1, 72 | "Proportion of training to perform linear learning rate warmup for. " 73 | "E.g., 0.1 = 10% of training.") 74 | 75 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 76 | "How often to save the model checkpoint.") 77 | 78 | flags.DEFINE_integer("iterations_per_loop", 1000, 79 | "How many steps to make in each estimator call.") 80 | 81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 82 | 83 | tf.flags.DEFINE_string( 84 | "tpu_name", None, 85 | "The Cloud TPU to use for training. This should be either the name " 86 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 87 | "url.") 88 | 89 | tf.flags.DEFINE_string( 90 | "tpu_zone", None, 91 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 92 | "specified, we will attempt to automatically detect the GCE project from " 93 | "metadata.") 94 | 95 | tf.flags.DEFINE_string( 96 | "gcp_project", None, 97 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 98 | "specified, we will attempt to automatically detect the GCE project from " 99 | "metadata.") 100 | 101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 102 | 103 | flags.DEFINE_integer( 104 | "num_tpu_cores", 8, 105 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 106 | 107 | 108 | class InputExample(object): 109 | """A single training/test example for simple sequence classification.""" 110 | 111 | def __init__(self, guid, text_a, text_b=None, label=None): 112 | """Constructs a InputExample. 113 | 114 | Args: 115 | guid: Unique id for the example. 116 | text_a: string. The untokenized text of the first sequence. For single 117 | sequence tasks, only this sequence must be specified. 118 | text_b: (Optional) string. The untokenized text of the second sequence. 119 | Only must be specified for sequence pair tasks. 120 | label: (Optional) string. The label of the example. This should be 121 | specified for train and dev examples, but not for test examples. 122 | """ 123 | self.guid = guid 124 | self.text_a = text_a 125 | self.text_b = text_b 126 | self.label = label 127 | 128 | 129 | class PaddingInputExample(object): 130 | """Fake example so the num input examples is a multiple of the batch size. 131 | 132 | When running eval/predict on the TPU, we need to pad the number of examples 133 | to be a multiple of the batch size, because the TPU requires a fixed batch 134 | size. The alternative is to drop the last batch, which is bad because it means 135 | the entire output data won't be generated. 136 | 137 | We use this class instead of `None` because treating `None` as padding 138 | battches could cause silent errors. 139 | """ 140 | 141 | 142 | class InputFeatures(object): 143 | """A single set of features of data.""" 144 | 145 | def __init__(self, 146 | input_ids, 147 | input_mask, 148 | segment_ids, 149 | label_id, 150 | is_real_example=True): 151 | self.input_ids = input_ids 152 | self.input_mask = input_mask 153 | self.segment_ids = segment_ids 154 | self.label_id = label_id 155 | self.is_real_example = is_real_example 156 | 157 | 158 | class TnewsProcessor: 159 | def get_train_examples(self, data_dir): 160 | """获取训练集.""" 161 | return self._create_examples( 162 | self._read_tsv(os.path.join(data_dir, "train.json")), "train") 163 | 164 | def get_dev_examples(self, data_dir): 165 | """获取验证集.""" 166 | return self._create_examples( 167 | self._read_tsv(os.path.join(data_dir, "dev.json")), "dev") 168 | 169 | def get_test_examples(self, data_dir): 170 | """获取测试集.""" 171 | return self._create_examples( 172 | self._read_tsv(os.path.join(data_dir, "test.json")), "test") 173 | 174 | def get_labels(self): 175 | """填写新闻分类的类别标签""" 176 | return ['100', '101', '102', '103', '104', '106', '107', 177 | '108', '109', '110', '112', '113', '114', '115', '116'] 178 | 179 | def _read_tsv(self, input_file): 180 | """读取数据集""" 181 | with open(input_file, encoding='utf-8') as fr: 182 | lines = fr.readlines() 183 | return lines 184 | 185 | def _create_examples(self, lines, set_type): 186 | """Creates examples for the training and dev sets.""" 187 | examples = [] 188 | for (i, line) in enumerate(lines): 189 | json_str = json.loads(line) 190 | guid = "%s-%s" % (set_type, i) 191 | if set_type == "test": 192 | text_a = tokenization.convert_to_unicode(json_str['sentence']) 193 | label = None 194 | guid = json_str['id'] 195 | else: 196 | text_a = tokenization.convert_to_unicode(json_str['sentence']) 197 | label = tokenization.convert_to_unicode(json_str['label']) 198 | examples.append( 199 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) 200 | return examples 201 | 202 | 203 | def convert_single_example(ex_index, example, label_list, max_seq_length, 204 | tokenizer): 205 | """Converts a single `InputExample` into a single `InputFeatures`.""" 206 | 207 | if isinstance(example, PaddingInputExample): 208 | return InputFeatures( 209 | input_ids=[0] * max_seq_length, 210 | input_mask=[0] * max_seq_length, 211 | segment_ids=[0] * max_seq_length, 212 | label_id=0, 213 | is_real_example=False) 214 | 215 | label_map = {} 216 | for (i, label) in enumerate(label_list): 217 | label_map[label] = i 218 | 219 | tokens_a = tokenizer.tokenize(example.text_a) 220 | tokens_b = None 221 | if example.text_b: 222 | tokens_b = tokenizer.tokenize(example.text_b) 223 | 224 | if tokens_b: 225 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 226 | else: 227 | # Account for [CLS] and [SEP] with "- 2" 228 | if len(tokens_a) > max_seq_length - 2: 229 | tokens_a = tokens_a[0:(max_seq_length - 2)] 230 | 231 | tokens = [] 232 | segment_ids = [] 233 | tokens.append("[CLS]") 234 | segment_ids.append(0) 235 | for token in tokens_a: 236 | tokens.append(token) 237 | segment_ids.append(0) 238 | tokens.append("[SEP]") 239 | segment_ids.append(0) 240 | 241 | if tokens_b: 242 | for token in tokens_b: 243 | tokens.append(token) 244 | segment_ids.append(1) 245 | tokens.append("[SEP]") 246 | segment_ids.append(1) 247 | 248 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 249 | 250 | input_mask = [1] * len(input_ids) 251 | 252 | # Zero-pad up to the sequence length. 253 | while len(input_ids) < max_seq_length: 254 | input_ids.append(0) 255 | input_mask.append(0) 256 | segment_ids.append(0) 257 | 258 | assert len(input_ids) == max_seq_length 259 | assert len(input_mask) == max_seq_length 260 | assert len(segment_ids) == max_seq_length 261 | 262 | if example.label: 263 | label_id = label_map[example.label] 264 | else: 265 | label_id = 0 266 | 267 | if ex_index < 5: 268 | tf.logging.info("*** Example ***") 269 | tf.logging.info("guid: %s" % (example.guid)) 270 | tf.logging.info("tokens: %s" % " ".join( 271 | [tokenization.printable_text(x) for x in tokens])) 272 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 273 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 274 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 275 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 276 | 277 | feature = InputFeatures( 278 | input_ids=input_ids, 279 | input_mask=input_mask, 280 | segment_ids=segment_ids, 281 | label_id=label_id, 282 | is_real_example=True) 283 | return feature 284 | 285 | 286 | def file_based_convert_examples_to_features( 287 | examples, label_list, max_seq_length, tokenizer, output_file): 288 | """Convert a set of `InputExample`s to a TFRecord file.""" 289 | 290 | writer = tf.python_io.TFRecordWriter(output_file) 291 | 292 | for (ex_index, example) in enumerate(examples): 293 | if ex_index % 10000 == 0: 294 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 295 | 296 | feature = convert_single_example(ex_index, example, label_list, 297 | max_seq_length, tokenizer) 298 | 299 | def create_int_feature(values): 300 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 301 | return f 302 | 303 | features = collections.OrderedDict() 304 | features["input_ids"] = create_int_feature(feature.input_ids) 305 | features["input_mask"] = create_int_feature(feature.input_mask) 306 | features["segment_ids"] = create_int_feature(feature.segment_ids) 307 | features["label_ids"] = create_int_feature([feature.label_id]) 308 | features["is_real_example"] = create_int_feature( 309 | [int(feature.is_real_example)]) 310 | 311 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 312 | writer.write(tf_example.SerializeToString()) 313 | writer.close() 314 | 315 | 316 | def file_based_input_fn_builder(input_file, seq_length, is_training, 317 | drop_remainder): 318 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 319 | 320 | name_to_features = { 321 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 322 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 323 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 324 | "label_ids": tf.FixedLenFeature([], tf.int64), 325 | "is_real_example": tf.FixedLenFeature([], tf.int64), 326 | } 327 | 328 | def _decode_record(record, name_to_features): 329 | """Decodes a record to a TensorFlow example.""" 330 | example = tf.parse_single_example(record, name_to_features) 331 | 332 | # tf.Example only supports tf.int64, but the TPU only supports tf.int32. 333 | # So cast all int64 to int32. 334 | for name in list(example.keys()): 335 | t = example[name] 336 | if t.dtype == tf.int64: 337 | t = tf.to_int32(t) 338 | example[name] = t 339 | 340 | return example 341 | 342 | def input_fn(params): 343 | """The actual input function.""" 344 | batch_size = params["batch_size"] 345 | 346 | d = tf.data.TFRecordDataset(input_file) 347 | if is_training: 348 | d = d.repeat() 349 | d = d.shuffle(buffer_size=100) 350 | 351 | d = d.apply( 352 | tf.contrib.data.map_and_batch( 353 | lambda record: _decode_record(record, name_to_features), 354 | batch_size=batch_size, 355 | drop_remainder=drop_remainder)) 356 | 357 | return d 358 | 359 | return input_fn 360 | 361 | 362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 363 | """Truncates a sequence pair in place to the maximum length.""" 364 | 365 | while True: 366 | total_length = len(tokens_a) + len(tokens_b) 367 | if total_length <= max_length: 368 | break 369 | if len(tokens_a) > len(tokens_b): 370 | tokens_a.pop() 371 | else: 372 | tokens_b.pop() 373 | 374 | 375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 376 | labels, num_labels, use_one_hot_embeddings): 377 | """Creates a classification model.""" 378 | model = modeling.BertModel( 379 | config=bert_config, 380 | is_training=is_training, 381 | input_ids=input_ids, 382 | input_mask=input_mask, 383 | token_type_ids=segment_ids, 384 | use_one_hot_embeddings=use_one_hot_embeddings) 385 | 386 | output_layer = model.get_pooled_output() 387 | 388 | hidden_size = output_layer.shape[-1].value 389 | 390 | output_weights = tf.get_variable( 391 | "output_weights", [num_labels, hidden_size], 392 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 393 | 394 | output_bias = tf.get_variable( 395 | "output_bias", [num_labels], initializer=tf.zeros_initializer()) 396 | 397 | with tf.variable_scope("loss"): 398 | if is_training: 399 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 400 | 401 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 402 | logits = tf.nn.bias_add(logits, output_bias) 403 | 404 | log_probs = tf.nn.log_softmax(logits, axis=-1) 405 | 406 | one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) 407 | 408 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 409 | loss = tf.reduce_mean(per_example_loss) 410 | 411 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int32) 412 | 413 | return loss, per_example_loss, predictions 414 | 415 | 416 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 417 | num_train_steps, num_warmup_steps, use_tpu, 418 | use_one_hot_embeddings): 419 | """Returns `model_fn` closure for TPUEstimator.""" 420 | 421 | def model_fn(features, labels, mode, params): # pylint: disable=unused-argument 422 | """The `model_fn` for TPUEstimator.""" 423 | 424 | tf.logging.info("*** Features ***") 425 | for name in sorted(features.keys()): 426 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 427 | 428 | input_ids = features["input_ids"] 429 | input_mask = features["input_mask"] 430 | segment_ids = features["segment_ids"] 431 | label_ids = features["label_ids"] 432 | is_real_example = None 433 | if "is_real_example" in features: 434 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 435 | else: 436 | is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32) 437 | 438 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 439 | 440 | (total_loss, per_example_loss, predictions) = create_model( 441 | bert_config, is_training, input_ids, input_mask, segment_ids, label_ids, 442 | num_labels, use_one_hot_embeddings) 443 | 444 | tvars = tf.trainable_variables() 445 | initialized_variable_names = {} 446 | scaffold_fn = None 447 | if init_checkpoint: 448 | (assignment_map, initialized_variable_names 449 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 450 | if use_tpu: 451 | def tpu_scaffold(): 452 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 453 | return tf.train.Scaffold() 454 | 455 | scaffold_fn = tpu_scaffold 456 | else: 457 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 458 | 459 | tf.logging.info("**** Trainable Variables ****") 460 | for var in tvars: 461 | init_string = "" 462 | if var.name in initialized_variable_names: 463 | init_string = ", *INIT_FROM_CKPT*" 464 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 465 | init_string) 466 | 467 | if mode == tf.estimator.ModeKeys.TRAIN: 468 | # 添加loss的hook,不然在GPU/CPU上不打印loss 469 | logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10) 470 | train_op = optimization.create_optimizer( 471 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 472 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 473 | mode=mode, 474 | loss=total_loss, 475 | train_op=train_op, 476 | training_hooks=[logging_hook], 477 | scaffold_fn=scaffold_fn) 478 | elif mode == tf.estimator.ModeKeys.EVAL: 479 | def metric_fn(per_example_loss, label_ids, is_real_example): 480 | accuracy = tf.metrics.accuracy( 481 | labels=label_ids, predictions=predictions, weights=is_real_example) 482 | loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example) 483 | return { 484 | "eval_accuracy": accuracy, 485 | "eval_loss": loss, 486 | } 487 | 488 | eval_metrics = (metric_fn, 489 | [per_example_loss, label_ids, is_real_example]) 490 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 491 | mode=mode, 492 | loss=total_loss, 493 | eval_metrics=eval_metrics, 494 | scaffold_fn=scaffold_fn) 495 | else: 496 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 497 | mode=mode, 498 | predictions={"predictions": predictions}, 499 | scaffold_fn=scaffold_fn) 500 | return output_spec 501 | 502 | return model_fn 503 | 504 | 505 | def main(): 506 | tf.logging.set_verbosity(tf.logging.INFO) 507 | 508 | processors = { 509 | "tnews": TnewsProcessor, 510 | } 511 | 512 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, 513 | FLAGS.init_checkpoint) 514 | 515 | if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 516 | raise ValueError( 517 | "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 518 | 519 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 520 | 521 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 522 | raise ValueError( 523 | "Cannot use sequence length %d because the BERT model " 524 | "was only trained up to sequence length %d" % 525 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 526 | 527 | tf.gfile.MakeDirs(FLAGS.output_dir) 528 | 529 | task_name = FLAGS.task_name.lower() 530 | 531 | if task_name not in processors: 532 | raise ValueError("Task not found: %s" % (task_name)) 533 | 534 | processor = processors[task_name]() 535 | 536 | label_list = processor.get_labels() 537 | 538 | tokenizer = tokenization.FullTokenizer( 539 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 540 | 541 | tpu_cluster_resolver = None 542 | if FLAGS.use_tpu and FLAGS.tpu_name: 543 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 544 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 545 | 546 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 547 | run_config = tf.contrib.tpu.RunConfig( 548 | cluster=tpu_cluster_resolver, 549 | master=FLAGS.master, 550 | model_dir=FLAGS.output_dir, 551 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 552 | tpu_config=tf.contrib.tpu.TPUConfig( 553 | iterations_per_loop=FLAGS.iterations_per_loop, 554 | num_shards=FLAGS.num_tpu_cores, 555 | per_host_input_for_training=is_per_host)) 556 | 557 | train_examples = None 558 | num_train_steps = None 559 | num_warmup_steps = None 560 | if FLAGS.do_train: 561 | train_examples = processor.get_train_examples(FLAGS.data_dir) 562 | num_train_steps = int( 563 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 564 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 565 | 566 | model_fn = model_fn_builder( 567 | bert_config=bert_config, 568 | num_labels=len(label_list), 569 | init_checkpoint=FLAGS.init_checkpoint, 570 | learning_rate=FLAGS.learning_rate, 571 | num_train_steps=num_train_steps, 572 | num_warmup_steps=num_warmup_steps, 573 | use_tpu=FLAGS.use_tpu, 574 | use_one_hot_embeddings=FLAGS.use_tpu) 575 | 576 | # If TPU is not available, this will fall back to normal Estimator on CPU 577 | # or GPU. 578 | estimator = tf.contrib.tpu.TPUEstimator( 579 | use_tpu=FLAGS.use_tpu, 580 | model_fn=model_fn, 581 | config=run_config, 582 | train_batch_size=FLAGS.train_batch_size, 583 | eval_batch_size=FLAGS.eval_batch_size, 584 | predict_batch_size=FLAGS.predict_batch_size) 585 | 586 | if FLAGS.do_train: 587 | train_file = os.path.join(FLAGS.data_dir, "train.tf_record") 588 | file_based_convert_examples_to_features( 589 | train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file) 590 | tf.logging.info("***** Running training *****") 591 | tf.logging.info(" Num examples = %d", len(train_examples)) 592 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 593 | tf.logging.info(" Num steps = %d", num_train_steps) 594 | train_input_fn = file_based_input_fn_builder( 595 | input_file=train_file, 596 | seq_length=FLAGS.max_seq_length, 597 | is_training=True, 598 | drop_remainder=True) 599 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 600 | 601 | if FLAGS.do_eval: 602 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 603 | num_actual_eval_examples = len(eval_examples) 604 | if FLAGS.use_tpu: 605 | while len(eval_examples) % FLAGS.eval_batch_size != 0: 606 | eval_examples.append(PaddingInputExample()) 607 | 608 | eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record") 609 | file_based_convert_examples_to_features( 610 | eval_examples, label_list, FLAGS.max_seq_length, tokenizer, eval_file) 611 | 612 | tf.logging.info("***** Running evaluation *****") 613 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 614 | len(eval_examples), num_actual_eval_examples, 615 | len(eval_examples) - num_actual_eval_examples) 616 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 617 | 618 | # This tells the estimator to run through the entire set. 619 | eval_steps = None 620 | if FLAGS.use_tpu: 621 | assert len(eval_examples) % FLAGS.eval_batch_size == 0 622 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) 623 | 624 | eval_drop_remainder = True if FLAGS.use_tpu else False 625 | eval_input_fn = file_based_input_fn_builder( 626 | input_file=eval_file, 627 | seq_length=FLAGS.max_seq_length, 628 | is_training=False, 629 | drop_remainder=eval_drop_remainder) 630 | 631 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 632 | 633 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 634 | with tf.gfile.GFile(output_eval_file, "w") as writer: 635 | tf.logging.info("***** Eval results *****") 636 | for key in sorted(result.keys()): 637 | tf.logging.info(" %s = %s", key, str(result[key])) 638 | writer.write("%s = %s\n" % (key, str(result[key]))) 639 | 640 | if FLAGS.do_predict: 641 | # label dict的设置 642 | label_dict = {0: 100, 1: 101, 2: 102, 3: 103, 643 | 4: 104, 5: 106, 6: 107, 7: 108, 644 | 8: 109, 9: 110, 10: 112, 11: 113, 645 | 12: 114, 13: 115, 14: 116} 646 | label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment", 647 | 103: "news_sports", 104: "news_finance", 106: "news_house", 648 | 107: "news_car", 108: "news_edu", 109: "news_tech", 649 | 110: "news_military", 112: "news_travel", 113: "news_world", 650 | 114: "news_stock", 115: "news_agriculture", 116: "news_game"} 651 | 652 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 653 | num_actual_predict_examples = len(predict_examples) 654 | test_file = os.path.join(FLAGS.data_dir, "test.tf_record") 655 | file_based_convert_examples_to_features(predict_examples, label_list, 656 | FLAGS.max_seq_length, tokenizer, 657 | test_file) 658 | 659 | tf.logging.info("***** Running prediction*****") 660 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 661 | len(predict_examples), num_actual_predict_examples, 662 | len(predict_examples) - num_actual_predict_examples) 663 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 664 | 665 | predict_drop_remainder = True if FLAGS.use_tpu else False 666 | predict_input_fn = file_based_input_fn_builder( 667 | input_file=test_file, 668 | seq_length=FLAGS.max_seq_length, 669 | is_training=False, 670 | drop_remainder=predict_drop_remainder) 671 | 672 | results = estimator.predict(input_fn=predict_input_fn) 673 | 674 | output_file = os.path.join(FLAGS.output_dir, 'news_predict.json') 675 | with open(output_file, 'w', encoding='utf-8') as fr: 676 | print(results) 677 | for index, result in enumerate(results): 678 | pre_id = result['predictions'] 679 | print(f'the index is {index} preid is {pre_id}') 680 | label = label_dict.get(pre_id) 681 | label_d = label_desc.get(label) 682 | 683 | json_str = json.dumps({"id": index, "label": str(label), "label_desc": label_d}) 684 | fr.write(json_str) 685 | fr.write('\n') 686 | 687 | 688 | if __name__ == "__main__": 689 | main() 690 | -------------------------------------------------------------------------------- /bert_downstream/train_multi_learning.py: -------------------------------------------------------------------------------- 1 | """ BERT finetuning runner for multi-learning task """ 2 | 3 | import collections 4 | import math 5 | import os 6 | import random 7 | import pandas as pd 8 | import numpy as np 9 | import json 10 | import tqdm 11 | 12 | import bert_master.modeling as modeling 13 | import bert_master.optimization as optimization 14 | import bert_master.tokenization as tokenization 15 | import tensorflow as tf 16 | 17 | flags = tf.flags 18 | FLAGS = flags.FLAGS 19 | 20 | # Required parameters 21 | flags.DEFINE_string( 22 | "data_dir", './data_path/', 23 | "The input data dir. Should contain the .tsv files (or other data files) " 24 | "for the task.") 25 | 26 | flags.DEFINE_string( 27 | "bert_config_file", './pre_trained/bert_config.json', 28 | "The config json file corresponding to the pre-trained BERT model. " 29 | "This specifies the model architecture.") 30 | 31 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt', 32 | "The vocabulary file that the BERT model was trained on.") 33 | 34 | flags.DEFINE_string( 35 | "output_dir", './model_ckpt/multi_learning/', 36 | "The output directory where the model checkpoints will be written.") 37 | 38 | flags.DEFINE_string( 39 | "init_checkpoint", './pre_trained/bert_model.ckpt', 40 | "Initial checkpoint (usually from a pre-trained BERT model).") 41 | 42 | flags.DEFINE_bool( 43 | "do_lower_case", True, 44 | "Whether to lower case the input text. Should be True for uncased " 45 | "models and False for cased models.") 46 | 47 | flags.DEFINE_integer( 48 | "max_seq_length", 128, 49 | "The maximum total input sequence length after WordPiece tokenization. " 50 | "Sequences longer than this will be truncated, and sequences shorter " 51 | "than this will be padded.") 52 | 53 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.") 54 | 55 | flags.DEFINE_integer("eval_batch_size", 16, "Total batch size for eval.") 56 | 57 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 58 | 59 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 60 | 61 | flags.DEFINE_integer("num_train_epochs", 3, 62 | "Total number of training epochs to perform.") 63 | 64 | flags.DEFINE_float( 65 | "warmup_proportion", 0.1, 66 | "Proportion of training to perform linear learning rate warmup for. " 67 | "E.g., 0.1 = 10% of training.") 68 | 69 | 70 | class InputExample(object): 71 | """A single training/test example for simple sequence classification.""" 72 | 73 | def __init__(self, guid, text_a, text_b=None, label=None, task=None): 74 | """Constructs a InputExample. 75 | 76 | Args: 77 | guid: Unique id for the example. 78 | text_a: string. The untokenized text of the first sequence. For single 79 | sequence tasks, only this sequence must be specified. 80 | text_b: (Optional) string. The untokenized text of the second sequence. 81 | Only must be specified for sequence pair tasks. 82 | label: (Optional) string. The label of the example. This should be 83 | specified for train and dev examples, but not for test examples. 84 | """ 85 | self.guid = guid 86 | self.text_a = text_a 87 | self.text_b = text_b 88 | self.label = label 89 | self.task = task 90 | 91 | 92 | class PaddingInputExample(object): 93 | """Fake example so the num input examples is a multiple of the batch size. 94 | 95 | When running eval/predict on the TPU, we need to pad the number of examples 96 | to be a multiple of the batch size, because the TPU requires a fixed batch 97 | size. The alternative is to drop the last batch, which is bad because it means 98 | the entire output data won't be generated. 99 | 100 | We use this class instead of `None` because treating `None` as padding 101 | battches could cause silent errors. 102 | """ 103 | 104 | 105 | class InputFeatures(object): 106 | """A single set of features of data.""" 107 | 108 | def __init__(self, 109 | input_ids, 110 | input_mask, 111 | segment_ids, 112 | label_id, 113 | task, 114 | is_real_example=True): 115 | self.input_ids = input_ids 116 | self.input_mask = input_mask 117 | self.segment_ids = segment_ids 118 | self.label_id = label_id 119 | self.task = task 120 | self.is_real_example = is_real_example 121 | 122 | 123 | class DataProcessor(object): 124 | """Base class for data converters for sequence classification data sets.""" 125 | 126 | def get_train_examples(self, data_dir): 127 | """Gets a collection of `InputExample`s for the train set.""" 128 | raise NotImplementedError() 129 | 130 | def get_dev_examples(self, data_dir): 131 | """Gets a collection of `InputExample`s for the dev set.""" 132 | raise NotImplementedError() 133 | 134 | def get_test_examples(self, data_dir): 135 | """Gets a collection of `InputExample`s for prediction.""" 136 | raise NotImplementedError() 137 | 138 | def get_labels(self): 139 | """Gets the list of labels for this data set.""" 140 | raise NotImplementedError() 141 | 142 | @classmethod 143 | def _read_csv(cls, input_file, task): 144 | data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None) 145 | if task == 'nli': 146 | data.columns = ['id', 'texta', 'textb', 'label'] 147 | else: 148 | data.columns = ['id', 'text', 'label'] 149 | lines = [] 150 | for index, row in data.iterrows(): 151 | if task == 'nli': 152 | lines.append((row['texta'], row['textb'], row['label'])) 153 | else: 154 | lines.append((row['text'], row['label'])) 155 | return lines 156 | 157 | @classmethod 158 | def _read_test(cls, input_file, task): 159 | data = pd.read_csv(input_file, sep='\t', encoding='utf-8', header=None) 160 | if task == 'nli': 161 | data.columns = ['id', 'texta', 'textb'] 162 | else: 163 | data.columns = ['id', 'text'] 164 | lines = [] 165 | # 添加id防止预测提交出错 166 | for index, row in data.iterrows(): 167 | if task == 'nli': 168 | lines.append((row['id'], row['texta'], row['textb'])) 169 | else: 170 | lines.append((row['id'], row['text'])) 171 | return lines 172 | 173 | 174 | class AllProcessor(DataProcessor): 175 | """Processor for the CoLA data set (GLUE version).""" 176 | 177 | def get_train_examples(self, data_dir): 178 | """See base class.""" 179 | emotion_dir = os.path.join(data_dir, 'train_emotion.csv') 180 | news_dir = os.path.join(data_dir, 'train_news.csv') 181 | nli_dir = os.path.join(data_dir, 'train_nli.csv') 182 | emotion_lines = self._read_csv(emotion_dir, 'emotion') 183 | news_lines = self._read_csv(news_dir, 'news') 184 | nli_lines = self._read_csv(nli_dir, 'nli') 185 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'train') 186 | 187 | def get_dev_examples(self, data_dir): 188 | """See base class.""" 189 | emotion_dir = os.path.join(data_dir, 'dev_emotion.csv') 190 | news_dir = os.path.join(data_dir, 'dev_news.csv') 191 | nli_dir = os.path.join(data_dir, 'dev_nli.csv') 192 | emotion_lines = self._read_csv(emotion_dir, 'emotion') 193 | news_lines = self._read_csv(news_dir, 'news') 194 | nli_lines = self._read_csv(nli_dir, 'nli') 195 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'dev') 196 | 197 | def get_test_examples(self, data_dir): 198 | """See base class.""" 199 | emotion_dir = os.path.join(data_dir, 'test_emotion.csv') 200 | news_dir = os.path.join(data_dir, 'test_news.csv') 201 | nli_dir = os.path.join(data_dir, 'test_nli.csv') 202 | emotion_lines = self._read_test(emotion_dir, 'emotion') 203 | news_lines = self._read_test(news_dir, 'news') 204 | nli_lines = self._read_test(nli_dir, 'nli') 205 | return self._create_examples(emotion_lines, news_lines, nli_lines, 'test') 206 | 207 | def get_labels(self): 208 | """See base class.""" 209 | return [['sadness', 'anger', 'happiness', 'fear', 'like', 210 | 'disgust', 'surprise'], 211 | ['108', '104', '106', '112', '109', '103', '116', '101', 212 | '107', '100', '102', '110', '115', '113', '114'], 213 | ['0', '1', '2']] 214 | 215 | def _create_examples(self, emotion_lines, news_lines, nli_lines, set_type): 216 | """Creates examples for the training and dev sets.""" 217 | examples = [] 218 | 219 | # emotion 220 | for (i, line) in enumerate(emotion_lines): 221 | guid = "%s-%s" % (set_type, i) 222 | if set_type == "test": 223 | text_a = tokenization.convert_to_unicode(line[1]) 224 | label = None 225 | guid = line[0] 226 | else: 227 | text_a = tokenization.convert_to_unicode(line[0]) 228 | label = tokenization.convert_to_unicode(line[1]) 229 | examples.append( 230 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='1')) 231 | 232 | # news 233 | for i, line in enumerate(news_lines): 234 | guid = f'news_{set_type}_{i}' 235 | if set_type == 'test': 236 | text_a = tokenization.convert_to_unicode(line[1]) 237 | label = None 238 | guid = line[0] 239 | else: 240 | text_a = tokenization.convert_to_unicode(line[0]) 241 | label = tokenization.convert_to_unicode(str(line[1])) 242 | 243 | examples.append( 244 | InputExample(guid=guid, text_a=text_a, text_b=None, label=label, task='2')) 245 | 246 | # nli 247 | for i, line in enumerate(nli_lines): 248 | guid = f'news_{set_type}_{i}' 249 | if set_type == 'test': 250 | text_a = tokenization.convert_to_unicode(line[1]) 251 | text_b = tokenization.convert_to_unicode(line[2]) 252 | label = None 253 | guid = line[0] 254 | else: 255 | text_a = tokenization.convert_to_unicode(line[0]) 256 | text_b = tokenization.convert_to_unicode(line[1]) 257 | label = tokenization.convert_to_unicode(str(line[2])) 258 | 259 | examples.append( 260 | InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label, task='3')) 261 | 262 | return examples 263 | 264 | 265 | def convert_single_example(ex_index, example, label_list, max_seq_length, 266 | tokenizer): 267 | """Converts a single `InputExample` into a single `InputFeatures`.""" 268 | 269 | emotion_label_map = {} 270 | news_label_map = {} 271 | nli_label_map = {} 272 | for (i, label) in enumerate(label_list[0]): 273 | emotion_label_map[label] = i 274 | for (i, label) in enumerate(label_list[1]): 275 | news_label_map[label] = i 276 | for (i, label) in enumerate(label_list[2]): 277 | nli_label_map[label] = i 278 | 279 | tokens_a = tokenizer.tokenize(example.text_a) 280 | tokens_b = None 281 | if example.text_b: 282 | tokens_b = tokenizer.tokenize(example.text_b) 283 | 284 | if tokens_b: 285 | _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) 286 | else: 287 | if len(tokens_a) > max_seq_length - 2: 288 | tokens_a = tokens_a[0:(max_seq_length - 2)] 289 | 290 | tokens = [] 291 | segment_ids = [] 292 | tokens.append("[CLS]") 293 | segment_ids.append(0) 294 | for token in tokens_a: 295 | tokens.append(token) 296 | segment_ids.append(0) 297 | tokens.append("[SEP]") 298 | segment_ids.append(0) 299 | 300 | if tokens_b: 301 | for token in tokens_b: 302 | tokens.append(token) 303 | segment_ids.append(1) 304 | tokens.append("[SEP]") 305 | segment_ids.append(1) 306 | 307 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 308 | 309 | input_mask = [1] * len(input_ids) 310 | 311 | # Zero-pad up to the sequence length. 312 | while len(input_ids) < max_seq_length: 313 | input_ids.append(0) 314 | input_mask.append(0) 315 | segment_ids.append(0) 316 | 317 | assert len(input_ids) == max_seq_length 318 | assert len(input_mask) == max_seq_length 319 | assert len(segment_ids) == max_seq_length 320 | task = example.task 321 | if example.label: 322 | if task == '1': label_id = emotion_label_map[example.label] 323 | if task == '2': label_id = news_label_map[example.label] 324 | if task == '3': label_id = nli_label_map[example.label] 325 | else: 326 | label_id = 0 327 | 328 | if ex_index < 5: 329 | tf.logging.info("*** Example ***") 330 | tf.logging.info("guid: %s" % (example.guid)) 331 | tf.logging.info("tokens: %s" % " ".join( 332 | [tokenization.printable_text(x) for x in tokens])) 333 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 334 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 335 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 336 | tf.logging.info("label: %s (id = %d)" % (example.label, label_id)) 337 | 338 | feature = InputFeatures( 339 | input_ids=input_ids, 340 | input_mask=input_mask, 341 | segment_ids=segment_ids, 342 | label_id=label_id, 343 | task=int(task), 344 | is_real_example=True) 345 | return feature 346 | 347 | 348 | def file_based_convert_examples_to_features( 349 | examples, label_list, max_seq_length, tokenizer, output_file, type): 350 | """Convert a set of `InputExample`s to a TFRecord file.""" 351 | 352 | emotion_out = os.path.join(output_file, f'emotion_{type}.record') 353 | news_out = os.path.join(output_file, f'news_{type}.record') 354 | nli_out = os.path.join(output_file, f'nli_{type}.record') 355 | 356 | emotion_writer = tf.python_io.TFRecordWriter(emotion_out) 357 | news_writer = tf.python_io.TFRecordWriter(news_out) 358 | nli_writer = tf.python_io.TFRecordWriter(nli_out) 359 | 360 | emotion_cnt = 0 361 | news_cnt = 0 362 | nli_cnt = 0 363 | for (ex_index, example) in enumerate(examples): 364 | if ex_index % 10000 == 0: 365 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 366 | 367 | feature = convert_single_example(ex_index, example, label_list, 368 | max_seq_length, tokenizer) 369 | 370 | def create_int_feature(values): 371 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 372 | return f 373 | 374 | features = collections.OrderedDict() 375 | features["input_ids"] = create_int_feature(feature.input_ids) 376 | features["input_mask"] = create_int_feature(feature.input_mask) 377 | features["segment_ids"] = create_int_feature(feature.segment_ids) 378 | features["label_ids"] = create_int_feature([feature.label_id]) 379 | features["task"] = create_int_feature([feature.task]) 380 | features["is_real_example"] = create_int_feature( 381 | [int(feature.is_real_example)]) 382 | 383 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 384 | 385 | if feature.task == 1: 386 | emotion_cnt += 1 387 | emotion_writer.write(tf_example.SerializeToString()) 388 | if feature.task == 2: 389 | news_cnt += 1 390 | news_writer.write(tf_example.SerializeToString()) 391 | if feature.task == 3: 392 | nli_cnt += 1 393 | nli_writer.write(tf_example.SerializeToString()) 394 | 395 | emotion_writer.close() 396 | news_writer.close() 397 | nli_writer.close() 398 | print(f'the emotion news nli cnt is {emotion_cnt} {news_cnt} {nli_cnt}') 399 | return emotion_cnt, news_cnt, nli_cnt 400 | 401 | 402 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 403 | """Truncates a sequence pair in place to the maximum length.""" 404 | 405 | while True: 406 | total_length = len(tokens_a) + len(tokens_b) 407 | if total_length <= max_length: 408 | break 409 | if len(tokens_a) > len(tokens_b): 410 | tokens_a.pop() 411 | else: 412 | tokens_b.pop() 413 | 414 | 415 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 416 | labels, use_one_hot_embeddings, task): 417 | """Creates a classification model.""" 418 | model = modeling.BertModel( 419 | config=bert_config, 420 | is_training=is_training, 421 | input_ids=input_ids, 422 | input_mask=input_mask, 423 | token_type_ids=segment_ids, 424 | use_one_hot_embeddings=use_one_hot_embeddings) 425 | 426 | output_layer = model.get_pooled_output() 427 | 428 | hidden_size = output_layer.shape[-1].value 429 | 430 | # 三个任务对应的三个全连接层参数 431 | emotion_weights = tf.get_variable( 432 | "emotion_weights", [7, hidden_size], 433 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 434 | emotion_bias = tf.get_variable( 435 | "emotion_bias", [7], initializer=tf.zeros_initializer()) 436 | 437 | news_weights = tf.get_variable( 438 | "news_weights", [15, hidden_size], 439 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 440 | news_bias = tf.get_variable( 441 | "news_bias", [15], initializer=tf.zeros_initializer()) 442 | 443 | nli_weights = tf.get_variable( 444 | "nli_weights", [3, hidden_size], 445 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 446 | nli_bias = tf.get_variable( 447 | "nli_bias", [3], initializer=tf.zeros_initializer()) 448 | 449 | if is_training: 450 | # I.e., 0.1 dropout 451 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 452 | 453 | emotion_logits = tf.matmul(output_layer, emotion_weights, transpose_b=True) 454 | emotion_logits = tf.nn.bias_add(emotion_logits, emotion_bias) 455 | 456 | news_logits = tf.matmul(output_layer, news_weights, transpose_b=True) 457 | news_logits = tf.nn.bias_add(news_logits, news_bias) 458 | 459 | nli_logits = tf.matmul(output_layer, nli_weights, transpose_b=True) 460 | nli_logits = tf.nn.bias_add(nli_logits, nli_bias) 461 | 462 | logits = tf.cond( 463 | tf.equal(task, 1), 464 | lambda: emotion_logits, 465 | lambda: tf.cond(tf.equal(task, 2), lambda: news_logits, lambda: nli_logits) 466 | ) 467 | depth = tf.cond( 468 | tf.equal(task, 1), 469 | lambda: 7, 470 | lambda: tf.cond(tf.equal(task, 2), lambda: 15, lambda: 3) 471 | ) 472 | 473 | predictions = tf.argmax(logits, axis=-1, output_type=tf.int64, name='pre_id') 474 | 475 | with tf.variable_scope("loss"): 476 | log_probs = tf.nn.log_softmax(logits, axis=-1) 477 | one_hot_labels = tf.one_hot(labels, depth=depth, dtype=tf.float32) 478 | per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) 479 | loss = tf.reduce_mean(per_example_loss) 480 | 481 | equals = tf.reduce_sum(tf.cast(tf.equal(predictions, labels), tf.int64)) 482 | acc = equals / FLAGS.eval_batch_size 483 | return loss, logits, acc, predictions 484 | 485 | 486 | def get_input_data(input_file, seq_len, batch_size, is_training): 487 | def parser(record): 488 | name_to_features = { 489 | "input_ids": tf.FixedLenFeature([seq_len], tf.int64), 490 | "input_mask": tf.FixedLenFeature([seq_len], tf.int64), 491 | "segment_ids": tf.FixedLenFeature([seq_len], tf.int64), 492 | "label_ids": tf.FixedLenFeature([], tf.int64), 493 | } 494 | # 解析的时候需要的是int64 495 | example = tf.parse_single_example(record, features=name_to_features) 496 | input_ids = example["input_ids"] 497 | input_mask = example["input_mask"] 498 | segment_ids = example["segment_ids"] 499 | labels = example["label_ids"] 500 | return input_ids, input_mask, segment_ids, labels 501 | 502 | dataset = tf.data.TFRecordDataset(input_file) 503 | if is_training: 504 | dataset = dataset.map(parser).batch(batch_size).shuffle(buffer_size=3000) 505 | else: 506 | dataset = dataset.map(parser).batch(batch_size) 507 | iterator = dataset.make_one_shot_iterator() 508 | input_ids, input_mask, segment_ids, labels = iterator.get_next() 509 | return input_ids, input_mask, segment_ids, labels 510 | 511 | 512 | def main(): 513 | """ 训练主入口 """ 514 | tf.logging.info('start to train') 515 | 516 | # 部分参数设置 517 | process = AllProcessor() 518 | label_list = process.get_labels() 519 | tokenizer = tokenization.FullTokenizer( 520 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 521 | 522 | train_examples = process.get_train_examples(FLAGS.data_dir) 523 | train_cnt = file_based_convert_examples_to_features( 524 | train_examples, 525 | label_list, 526 | FLAGS.max_seq_length, 527 | tokenizer, 528 | FLAGS.data_dir, 529 | 'train' 530 | ) 531 | dev_examples = process.get_dev_examples(FLAGS.data_dir) 532 | dev_cnt = file_based_convert_examples_to_features( 533 | dev_examples, 534 | label_list, 535 | FLAGS.max_seq_length, 536 | tokenizer, 537 | FLAGS.data_dir, 538 | 'dev' 539 | ) 540 | 541 | # 输入输出定义 542 | input_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length], 543 | name='input_ids') 544 | input_mask = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length], 545 | name='input_mask') 546 | segment_ids = tf.placeholder(tf.int64, shape=[None, FLAGS.max_seq_length], 547 | name='segment_ids') 548 | labels = tf.placeholder(tf.int64, shape=[None], name='labels') 549 | task = tf.placeholder(tf.int64, name='task') 550 | 551 | # bert相关参数设置 552 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 553 | 554 | loss, logits, acc, pre_id = create_model( 555 | bert_config, 556 | True, 557 | input_ids, 558 | input_mask, 559 | segment_ids, 560 | labels, 561 | False, 562 | task 563 | ) 564 | num_train_steps = int(len(train_examples) / FLAGS.train_batch_size) 565 | num_warmup_steps = math.ceil( 566 | num_train_steps * FLAGS.train_batch_size * FLAGS.warmup_proportion) 567 | train_op = optimization.create_optimizer( 568 | loss, 569 | FLAGS.learning_rate, 570 | num_train_steps * FLAGS.num_train_epochs, 571 | num_warmup_steps, 572 | False 573 | ) 574 | 575 | # 初始化参数 576 | init_global = tf.global_variables_initializer() 577 | saver = tf.train.Saver( 578 | [v for v in tf.global_variables() 579 | if 'adam_v' not in v.name and 'adam_m' not in v.name]) 580 | 581 | with tf.Session() as sess: 582 | sess.run(init_global) 583 | print('start to load bert params') 584 | if FLAGS.init_checkpoint: 585 | # tvars = tf.global_variables() 586 | tvars = tf.trainable_variables() 587 | print("global_variables", len(tvars)) 588 | assignment_map, initialized_variable_names = \ 589 | modeling.get_assignment_map_from_checkpoint(tvars, 590 | FLAGS.init_checkpoint) 591 | print("initialized_variable_names:", len(initialized_variable_names)) 592 | saver_ = tf.train.Saver([v for v in tvars if v.name in initialized_variable_names]) 593 | saver_.restore(sess, FLAGS.init_checkpoint) 594 | tvars = tf.global_variables() 595 | # initialized_vars = [v for v in tvars if v.name in initialized_variable_names] 596 | not_initialized_vars = [v for v in tvars if v.name not in initialized_variable_names] 597 | print('all size %s; not initialized size %s' % (len(tvars), len(not_initialized_vars))) 598 | if len(not_initialized_vars): 599 | sess.run(tf.variables_initializer(not_initialized_vars)) 600 | # for v in initialized_vars: 601 | # print('initialized: %s, shape = %s' % (v.name, v.shape)) 602 | # for v in not_initialized_vars: 603 | # print('not initialized: %s, shape = %s' % (v.name, v.shape)) 604 | else: 605 | print('the bert init checkpoint is None!!!') 606 | sess.run(tf.global_variables_initializer()) 607 | 608 | # 训练的step 609 | def train_step(ids, mask, seg, true_y, task_id): 610 | feed = {input_ids: ids, 611 | input_mask: mask, 612 | segment_ids: seg, 613 | labels: true_y, 614 | task: task_id} 615 | _, logits_out, loss_out = sess.run([train_op, logits, loss], feed_dict=feed) 616 | return logits_out, loss_out 617 | 618 | # 验证的step 619 | def dev_step(ids, mask, seg, true_y, task_id): 620 | feed = {input_ids: ids, 621 | input_mask: mask, 622 | segment_ids: seg, 623 | labels: true_y, 624 | task: task_id} 625 | pre_out, acc_out = sess.run([pre_id, acc], feed_dict=feed) 626 | return pre_out, acc_out 627 | 628 | # 开始训练 629 | for epoch in range(FLAGS.num_train_epochs): 630 | tf.logging.info(f'start to train and the epoch:{epoch}') 631 | epoch_loss = do_train(sess, train_cnt, train_step, epoch) 632 | tf.logging.info(f'the epoch{epoch} loss is {epoch_loss}') 633 | saver.save(sess, FLAGS.output_dir + 'bert.ckpt', global_step=epoch) 634 | # 每一个epoch开始验证模型 635 | do_eval(sess, dev_cnt, dev_step) 636 | 637 | # 进行预测并保存结果 638 | do_predict(label_list, process, tokenizer, dev_step) 639 | 640 | tf.logging.info('the training is over!!!!') 641 | 642 | 643 | def set_random_task(train_cnt): 644 | """ 任务采样 : 各任务每个epoch 迭代的step次数 """ 645 | # emotion cnt 646 | emotion_cnt = train_cnt[0] // FLAGS.train_batch_size 647 | news_cnt = train_cnt[1] // FLAGS.train_batch_size 648 | nli_cnt = train_cnt[2] // FLAGS.train_batch_size 649 | 650 | emotion_list = [1] * emotion_cnt 651 | news_list = [2] * news_cnt 652 | nli_list = [3] * nli_cnt 653 | 654 | task_list = emotion_list + news_list + nli_list 655 | 656 | random.shuffle(task_list) 657 | 658 | return task_list 659 | 660 | 661 | def do_train(sess, train_cnt, train_step, epoch): 662 | """ 模型训练 """ 663 | emotion_train_file = os.path.join(FLAGS.data_dir, 'emotion_train.record') 664 | news_train_file = os.path.join(FLAGS.data_dir, 'news_train.record') 665 | nli_train_file = os.path.join(FLAGS.data_dir, 'nli_train.record') 666 | ids1, mask1, seg1, labels1 = get_input_data( 667 | emotion_train_file, FLAGS.max_seq_length, 668 | FLAGS.train_batch_size, True) 669 | ids2, mask2, seg2, labels2 = get_input_data( 670 | news_train_file, FLAGS.max_seq_length, 671 | FLAGS.train_batch_size, True) 672 | ids3, mask3, seg3, labels3 = get_input_data( 673 | nli_train_file, FLAGS.max_seq_length, 674 | FLAGS.train_batch_size, True) 675 | 676 | # 设置任务list 677 | tasks = set_random_task(train_cnt) 678 | 679 | total_loss = 0 680 | for step, task_id in enumerate(tasks): 681 | if task_id == 1: 682 | ids_train, mask_train, seg_train, y_train = sess.run( 683 | [ids1, mask1, seg1, labels1]) 684 | if task_id == 2: 685 | ids_train, mask_train, seg_train, y_train = sess.run( 686 | [ids2, mask2, seg2, labels2]) 687 | if task_id == 3: 688 | ids_train, mask_train, seg_train, y_train = sess.run( 689 | [ids3, mask3, seg3, labels3]) 690 | 691 | _, step_loss = train_step(ids_train, mask_train, seg_train, y_train, task_id) 692 | 693 | tf.logging.info(f'epoch {epoch} the step loss: {step_loss}') 694 | 695 | total_loss += step_loss 696 | 697 | return total_loss / len(tasks) 698 | 699 | 700 | def do_eval(sess, dev_cnt, dev_step): 701 | """ 模型验证 """ 702 | tf.logging.info(f'start to do eval') 703 | emotion_dev_file = os.path.join(FLAGS.data_dir, 'emotion_dev.record') 704 | news_dev_file = os.path.join(FLAGS.data_dir, 'news_dev.record') 705 | nli_dev_file = os.path.join(FLAGS.data_dir, 'nli_dev.record') 706 | 707 | ids1, mask1, seg1, labels1 = get_input_data( 708 | emotion_dev_file, FLAGS.max_seq_length, 709 | FLAGS.eval_batch_size, False) 710 | ids2, mask2, seg2, labels2 = get_input_data( 711 | news_dev_file, FLAGS.max_seq_length, 712 | FLAGS.eval_batch_size, False) 713 | ids3, mask3, seg3, labels3 = get_input_data( 714 | nli_dev_file, FLAGS.max_seq_length, 715 | FLAGS.eval_batch_size, False) 716 | 717 | # 验证emotion的 718 | total_dev_acc = 0 719 | step_cnt = dev_cnt[0] // FLAGS.eval_batch_size 720 | for step in range(step_cnt): 721 | ids_dev, mask_dev, seg_dev, y_dev = sess.run( 722 | [ids1, mask1, seg1, labels1]) 723 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 1) 724 | total_dev_acc += dev_acc 725 | tf.logging.info(f'===the emotion acc is {total_dev_acc / step_cnt}===') 726 | 727 | total_dev_acc = 0 728 | step_cnt = dev_cnt[1] // FLAGS.eval_batch_size 729 | for step in range(step_cnt): 730 | ids_dev, mask_dev, seg_dev, y_dev = sess.run( 731 | [ids2, mask2, seg2, labels2]) 732 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 2) 733 | total_dev_acc += dev_acc 734 | tf.logging.info(f'===the news acc is {total_dev_acc / step_cnt}===') 735 | 736 | total_dev_acc = 0 737 | step_cnt = dev_cnt[3] // FLAGS.eval_batch_size 738 | for step in range(step_cnt): 739 | ids_dev, mask_dev, seg_dev, y_dev = sess.run( 740 | [ids3, mask3, seg3, labels3]) 741 | _, dev_acc = dev_step(ids_dev, mask_dev, seg_dev, y_dev, 3) 742 | total_dev_acc += dev_acc 743 | tf.logging.info(f'===the nli acc is {total_dev_acc / step_cnt}===') 744 | 745 | 746 | def do_predict(label_list, process, tokenizer, dev_step): 747 | """ 预测 """ 748 | tf.logging.info('start to do predict') 749 | # 设置标签到索引的对应 750 | emotion_map = {} 751 | news_map = {} 752 | nli_map = {} 753 | for (i, label) in enumerate(label_list[0]): 754 | emotion_map[i] = label 755 | for (i, label) in enumerate(label_list[1]): 756 | news_map[i] = label 757 | for (i, label) in enumerate(label_list[2]): 758 | nli_map[i] = label 759 | 760 | test_examples = process.get_test_examples(FLAGS.data_dir) 761 | emotion_res = [] 762 | news_res = [] 763 | nli_res = [] 764 | batch_size = 1 765 | index = 0 766 | for example in tqdm.tqdm(test_examples): 767 | index += 1 768 | feature = convert_single_example(index, example, label_list, 769 | FLAGS.max_seq_length, tokenizer) 770 | ids = np.reshape([feature.input_ids], (batch_size, FLAGS.max_seq_length)) 771 | mask = np.reshape([feature.input_mask], (batch_size, FLAGS.max_seq_length)) 772 | seg = np.reshape([feature.segment_ids], (batch_size, FLAGS.max_seq_length)) 773 | true_y = np.reshape([0], batch_size) 774 | 775 | task_id = example.task 776 | pred_res, _ = dev_step(ids, mask, seg, true_y, int(task_id)) 777 | 778 | guid = str(example.guid).strip() 779 | if task_id == '1': 780 | label_res = emotion_map.get(pred_res[0]) 781 | emotion_res.append(json.dumps({"id": str(guid), "label": str(label_res)})) 782 | if task_id == '2': 783 | label_res = news_map.get(pred_res[0]) 784 | news_res.append(json.dumps({"id": str(guid), "label": str(label_res)})) 785 | if task_id == '3': 786 | label_res = nli_map.get(pred_res[0]) 787 | nli_res.append(json.dumps({"id": str(guid), "label": str(label_res)})) 788 | 789 | # 写入预测文件 790 | with open('./data_path/ocemotion_predict.json', 'w', encoding='utf-8') as fr: 791 | for res in emotion_res: 792 | fr.write(res) 793 | fr.write('\n') 794 | 795 | with open('./data_path/tnews_predict.json', 'w', encoding='utf-8') as fr: 796 | for res in news_res: 797 | fr.write(res) 798 | fr.write('\n') 799 | 800 | with open('./data_path/ocnli_predict.json', 'w', encoding='utf-8') as fr: 801 | for res in nli_res: 802 | fr.write(res) 803 | fr.write('\n') 804 | tf.logging.info('predict and write file over!') 805 | 806 | 807 | if __name__ == "__main__": 808 | tf.logging.set_verbosity(tf.logging.INFO) 809 | main() 810 | -------------------------------------------------------------------------------- /bert_downstream/train_ner.py: -------------------------------------------------------------------------------- 1 | """BERT finetuning runner for ner (sequence label classification).""" 2 | 3 | import collections 4 | import os 5 | import json 6 | import tensorflow as tf 7 | 8 | import bert_master.modeling as modeling 9 | import bert_master.optimization as optimization 10 | import bert_master.tokenization as tokenization 11 | 12 | flags = tf.flags 13 | FLAGS = flags.FLAGS 14 | 15 | # Required parameters 16 | flags.DEFINE_string( 17 | "data_dir", './data_path/clue_ner/', 18 | "The input data dir. Should contain the .tsv files (or other data files) " 19 | "for the task.") 20 | 21 | flags.DEFINE_string( 22 | "bert_config_file", './pre_trained/bert_config.json', 23 | "The config json file corresponding to the pre-trained BERT model. " 24 | "This specifies the model architecture.") 25 | 26 | flags.DEFINE_string("task_name", 'cluener', 27 | "The name of the task to train.") 28 | 29 | flags.DEFINE_string("vocab_file", './pre_trained/vocab.txt', 30 | "The vocabulary file that the BERT model was trained on.") 31 | 32 | flags.DEFINE_string( 33 | "output_dir", './model_ckpt/clue_ner/', 34 | "The output directory where the model checkpoints will be written.") 35 | 36 | flags.DEFINE_string( 37 | "init_checkpoint", './pre_trained/bert_model.ckpt', 38 | "Initial checkpoint (usually from a pre-trained BERT model).") 39 | 40 | flags.DEFINE_bool( 41 | "do_lower_case", True, 42 | "Whether to lower case the input text. Should be True for uncased " 43 | "models and False for cased models.") 44 | 45 | flags.DEFINE_integer( 46 | "max_seq_length", 128, 47 | "The maximum total input sequence length after WordPiece tokenization. " 48 | "Sequences longer than this will be truncated, and sequences shorter " 49 | "than this will be padded.") 50 | 51 | flags.DEFINE_bool("do_train", False, "Whether to run training.") 52 | 53 | flags.DEFINE_bool("do_eval", False, "Whether to run eval on the dev set.") 54 | 55 | flags.DEFINE_bool( 56 | "do_predict", True, 57 | "Whether to run the model in inference mode on the test set.") 58 | 59 | flags.DEFINE_integer("train_batch_size", 16, "Total batch size for training.") 60 | 61 | flags.DEFINE_integer("eval_batch_size", 8, "Total batch size for eval.") 62 | 63 | flags.DEFINE_integer("predict_batch_size", 8, "Total batch size for predict.") 64 | 65 | flags.DEFINE_float("learning_rate", 5e-5, "The initial learning rate for Adam.") 66 | 67 | flags.DEFINE_float("num_train_epochs", 1.0, 68 | "Total number of training epochs to perform.") 69 | 70 | flags.DEFINE_float( 71 | "warmup_proportion", 0.1, 72 | "Proportion of training to perform linear learning rate warmup for. " 73 | "E.g., 0.1 = 10% of training.") 74 | 75 | flags.DEFINE_integer("save_checkpoints_steps", 1000, 76 | "How often to save the model checkpoint.") 77 | 78 | flags.DEFINE_integer("iterations_per_loop", 1000, 79 | "How many steps to make in each estimator call.") 80 | 81 | flags.DEFINE_bool("use_tpu", False, "Whether to use TPU or GPU/CPU.") 82 | 83 | tf.flags.DEFINE_string( 84 | "tpu_name", None, 85 | "The Cloud TPU to use for training. This should be either the name " 86 | "used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 " 87 | "url.") 88 | 89 | tf.flags.DEFINE_string( 90 | "tpu_zone", None, 91 | "[Optional] GCE zone where the Cloud TPU is located in. If not " 92 | "specified, we will attempt to automatically detect the GCE project from " 93 | "metadata.") 94 | 95 | tf.flags.DEFINE_string( 96 | "gcp_project", None, 97 | "[Optional] Project name for the Cloud TPU-enabled project. If not " 98 | "specified, we will attempt to automatically detect the GCE project from " 99 | "metadata.") 100 | 101 | tf.flags.DEFINE_string("master", None, "[Optional] TensorFlow master URL.") 102 | 103 | flags.DEFINE_integer( 104 | "num_tpu_cores", 8, 105 | "Only used if `use_tpu` is True. Total number of TPU cores to use.") 106 | 107 | 108 | class InputExample(object): 109 | """A single training/test example for simple sequence classification.""" 110 | 111 | def __init__(self, guid, text_a, text_b=None, tag=None): 112 | """Constructs a InputExample. 113 | 114 | Args: 115 | guid: Unique id for the example. 116 | text_a: string. The untokenized text of the first sequence. For single 117 | sequence tasks, only this sequence must be specified. 118 | text_b: (Optional) string. The untokenized text of the second sequence. 119 | Only must be specified for sequence pair tasks. 120 | label: (Optional) string. The label of the example. This should be 121 | specified for train and dev examples, but not for test examples. 122 | """ 123 | self.guid = guid 124 | self.text_a = text_a 125 | self.text_b = text_b 126 | self.tag = tag 127 | 128 | 129 | class PaddingInputExample(object): 130 | """Fake example so the num input examples is a multiple of the batch size. 131 | 132 | When running eval/predict on the TPU, we need to pad the number of examples 133 | to be a multiple of the batch size, because the TPU requires a fixed batch 134 | size. The alternative is to drop the last batch, which is bad because it means 135 | the entire output data won't be generated. 136 | 137 | We use this class instead of `None` because treating `None` as padding 138 | battches could cause silent errors. 139 | """ 140 | 141 | 142 | class InputFeatures(object): 143 | """A single set of features of data.""" 144 | 145 | def __init__(self, 146 | input_ids, 147 | input_mask, 148 | segment_ids, 149 | tag_ids, 150 | is_real_example=True): 151 | self.input_ids = input_ids 152 | self.input_mask = input_mask 153 | self.segment_ids = segment_ids 154 | self.tag_ids = tag_ids 155 | self.is_real_example = is_real_example 156 | 157 | 158 | class NerProcessor: 159 | def get_train_examples(self, data_dir): 160 | """获取训练集.""" 161 | return self._create_examples( 162 | self._read_tsv(os.path.join(data_dir, "train.txt")), "train") 163 | 164 | def get_dev_examples(self, data_dir): 165 | """获取验证集.""" 166 | return self._create_examples( 167 | self._read_tsv(os.path.join(data_dir, "dev.txt")), "dev") 168 | 169 | def get_test_examples(self, data_dir): 170 | """获取测试集.""" 171 | return self._create_examples( 172 | self._read_tsv(os.path.join(data_dir, "test.json")), "test") 173 | 174 | def get_tags(self): 175 | """填写tag的标签,采用BIO形式标注""" 176 | # 会在convert_single_example方法中添加头,成为BIO形式标签 177 | return ['address', 'book', 'company', 'game', 'government', 178 | 'movie', 'name', 'organization', 'position', 'scene'] 179 | 180 | def _read_tsv(self, input_file): 181 | """读取数据集""" 182 | with open(input_file, encoding='utf-8') as fr: 183 | lines = fr.readlines() 184 | return lines 185 | 186 | def _create_examples(self, lines, set_type): 187 | """Creates examples for the training and dev sets.""" 188 | examples = [] 189 | for (i, line) in enumerate(lines): 190 | if set_type == 'test': 191 | json_str = json.loads(line) 192 | text_a = tokenization.convert_to_unicode(json_str['text']) 193 | tag = None 194 | guid = json_str['id'] 195 | else: 196 | text_tag = line.split('\t') 197 | guid = "%s-%s" % (set_type, i) 198 | text_a = tokenization.convert_to_unicode(text_tag[0]) 199 | tag = tokenization.convert_to_unicode(text_tag[1]) 200 | examples.append( 201 | InputExample(guid=guid, text_a=text_a, text_b=None, tag=tag)) 202 | return examples 203 | 204 | 205 | def convert_single_example(ex_index, example, tag_list, max_seq_length, 206 | tokenizer): 207 | """Converts a single `InputExample` into a single `InputFeatures`.""" 208 | 209 | if isinstance(example, PaddingInputExample): 210 | return InputFeatures( 211 | input_ids=[0] * max_seq_length, 212 | input_mask=[0] * max_seq_length, 213 | segment_ids=[0] * max_seq_length, 214 | tag_ids=[0] * max_seq_length, 215 | is_real_example=False) 216 | 217 | tag_map = {'O': 0} 218 | for tag in tag_list: 219 | tag_b = 'B-' + tag 220 | tag_i = 'I-' + tag 221 | tag_map[tag_b] = len(tag_map) 222 | tag_map[tag_i] = len(tag_map) 223 | 224 | # 因为CLUE要求提交文件中包含索引,所以不能直接使用tokenizer去分割text 225 | tokens_a = [] 226 | text_list = list(example.text_a) 227 | for word in text_list: 228 | token = tokenizer.tokenize(word) 229 | tokens_a.extend(token) 230 | 231 | if len(tokens_a) > max_seq_length - 2: 232 | tokens_a = tokens_a[0:(max_seq_length - 2)] 233 | 234 | tokens = [] 235 | segment_ids = [] 236 | tokens.append("[CLS]") 237 | segment_ids.append(0) 238 | for token in tokens_a: 239 | tokens.append(token) 240 | segment_ids.append(0) 241 | tokens.append("[SEP]") 242 | segment_ids.append(0) 243 | 244 | input_ids = tokenizer.convert_tokens_to_ids(tokens) 245 | input_mask = [1] * len(input_ids) 246 | 247 | if example.tag: 248 | tag_ids = [0] # input第一位是[CLS] 249 | tags = example.tag.strip().split(' ') 250 | for tag in tags: 251 | tag_ids.append(tag_map.get(tag)) 252 | tag_ids.append(0) # input最后一位是[SEP] 253 | else: 254 | tag_ids = [0] * max_seq_length 255 | # Zero-pad up to the sequence length. 256 | while len(input_ids) < max_seq_length: 257 | input_ids.append(0) 258 | input_mask.append(0) 259 | segment_ids.append(0) 260 | # test的时候已经*max_len所以不需要再继续padding 261 | if example.tag: 262 | tag_ids.append(0) 263 | 264 | assert len(input_ids) == max_seq_length 265 | assert len(input_mask) == max_seq_length 266 | assert len(segment_ids) == max_seq_length 267 | assert len(tag_ids) == max_seq_length 268 | 269 | if ex_index < 5: 270 | tf.logging.info("*** Example ***") 271 | tf.logging.info("guid: %s" % (example.guid)) 272 | tf.logging.info("tokens: %s" % " ".join( 273 | [tokenization.printable_text(x) for x in tokens])) 274 | tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids])) 275 | tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask])) 276 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids])) 277 | tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in tag_ids])) 278 | 279 | feature = InputFeatures( 280 | input_ids=input_ids, 281 | input_mask=input_mask, 282 | segment_ids=segment_ids, 283 | tag_ids=tag_ids, 284 | is_real_example=True) 285 | return feature 286 | 287 | 288 | def file_based_convert_examples_to_features( 289 | examples, tag_list, max_seq_length, tokenizer, output_file): 290 | """Convert a set of `InputExample`s to a TFRecord file.""" 291 | 292 | writer = tf.python_io.TFRecordWriter(output_file) 293 | 294 | for (ex_index, example) in enumerate(examples): 295 | if ex_index % 10000 == 0: 296 | tf.logging.info("Writing example %d of %d" % (ex_index, len(examples))) 297 | 298 | feature = convert_single_example(ex_index, example, tag_list, 299 | max_seq_length, tokenizer) 300 | 301 | def create_int_feature(values): 302 | f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values))) 303 | return f 304 | 305 | features = collections.OrderedDict() 306 | features["input_ids"] = create_int_feature(feature.input_ids) 307 | features["input_mask"] = create_int_feature(feature.input_mask) 308 | features["segment_ids"] = create_int_feature(feature.segment_ids) 309 | features["tag_ids"] = create_int_feature(feature.tag_ids) 310 | features["is_real_example"] = create_int_feature( 311 | [int(feature.is_real_example)]) 312 | 313 | tf_example = tf.train.Example(features=tf.train.Features(feature=features)) 314 | writer.write(tf_example.SerializeToString()) 315 | writer.close() 316 | 317 | 318 | def file_based_input_fn_builder(input_file, seq_length, is_training, 319 | drop_remainder): 320 | """Creates an `input_fn` closure to be passed to TPUEstimator.""" 321 | 322 | name_to_features = { 323 | "input_ids": tf.FixedLenFeature([seq_length], tf.int64), 324 | "input_mask": tf.FixedLenFeature([seq_length], tf.int64), 325 | "segment_ids": tf.FixedLenFeature([seq_length], tf.int64), 326 | "tag_ids": tf.FixedLenFeature([seq_length], tf.int64), 327 | "is_real_example": tf.FixedLenFeature([], tf.int64), 328 | } 329 | 330 | def _decode_record(record, name_to_features): 331 | """Decodes a record to a TensorFlow example.""" 332 | example = tf.parse_single_example(record, name_to_features) 333 | 334 | for name in list(example.keys()): 335 | t = example[name] 336 | if t.dtype == tf.int64: 337 | t = tf.to_int32(t) 338 | example[name] = t 339 | 340 | return example 341 | 342 | def input_fn(params): 343 | """The actual input function.""" 344 | batch_size = params["batch_size"] 345 | 346 | d = tf.data.TFRecordDataset(input_file) 347 | if is_training: 348 | d = d.repeat() 349 | d = d.shuffle(buffer_size=100) 350 | 351 | d = d.apply( 352 | tf.contrib.data.map_and_batch( 353 | lambda record: _decode_record(record, name_to_features), 354 | batch_size=batch_size, 355 | drop_remainder=drop_remainder)) 356 | 357 | return d 358 | 359 | return input_fn 360 | 361 | 362 | def _truncate_seq_pair(tokens_a, tokens_b, max_length): 363 | """Truncates a sequence pair in place to the maximum length.""" 364 | 365 | while True: 366 | total_length = len(tokens_a) + len(tokens_b) 367 | if total_length <= max_length: 368 | break 369 | if len(tokens_a) > len(tokens_b): 370 | tokens_a.pop() 371 | else: 372 | tokens_b.pop() 373 | 374 | 375 | def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, 376 | tags, num_tags, use_one_hot_embeddings): 377 | """Creates a classification model.""" 378 | model = modeling.BertModel( 379 | config=bert_config, 380 | is_training=is_training, 381 | input_ids=input_ids, 382 | input_mask=input_mask, 383 | token_type_ids=segment_ids, 384 | use_one_hot_embeddings=use_one_hot_embeddings) 385 | 386 | # 用bert的sequence输出层 387 | output_layer = model.get_sequence_output() 388 | 389 | hidden_size = output_layer.shape[-1].value 390 | seq_len = output_layer.shape[1].value 391 | # [batch, seq_len, emb_size] 16 128 768 392 | 393 | output_weights = tf.get_variable( 394 | "output_weights", [num_tags, hidden_size], 395 | initializer=tf.truncated_normal_initializer(stddev=0.02)) 396 | 397 | output_bias = tf.get_variable( 398 | "output_bias", [num_tags], initializer=tf.zeros_initializer()) 399 | 400 | with tf.variable_scope("loss"): 401 | if is_training: 402 | output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) 403 | 404 | # 进行matmul需要reshape 405 | output_layer = tf.reshape(output_layer, [-1, hidden_size]) 406 | # [batch*seq_len, num_tags] 407 | logits = tf.matmul(output_layer, output_weights, transpose_b=True) 408 | logits = tf.nn.bias_add(logits, output_bias) 409 | 410 | logits = tf.reshape(logits, [-1, seq_len, num_tags]) 411 | 412 | # 真实的长度 413 | input_m = tf.count_nonzero(input_mask, -1) 414 | log_likelihood, transition_matrix = tf.contrib.crf.crf_log_likelihood( 415 | logits, tags, input_m) 416 | loss = tf.reduce_mean(-log_likelihood) 417 | 418 | # 使用crf_decode输出 419 | viterbi_sequence, _ = tf.contrib.crf.crf_decode( 420 | logits, transition_matrix, input_m) 421 | 422 | return loss, logits, viterbi_sequence 423 | 424 | 425 | def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate, 426 | num_train_steps, num_warmup_steps, use_tpu, 427 | use_one_hot_embeddings): 428 | """Returns `model_fn` closure for TPUEstimator.""" 429 | 430 | def model_fn(features, labels, mode, params): 431 | """The `model_fn` for TPUEstimator.""" 432 | 433 | tf.logging.info("*** Features ***") 434 | for name in sorted(features.keys()): 435 | tf.logging.info(" name = %s, shape = %s" % (name, features[name].shape)) 436 | 437 | input_ids = features["input_ids"] 438 | input_mask = features["input_mask"] 439 | segment_ids = features["segment_ids"] 440 | tag_ids = features["tag_ids"] 441 | is_real_example = None 442 | if "is_real_example" in features: 443 | is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32) 444 | else: 445 | is_real_example = tf.ones(tf.shape(tag_ids), dtype=tf.float32) 446 | 447 | is_training = (mode == tf.estimator.ModeKeys.TRAIN) 448 | 449 | total_loss, logits, predictions = create_model( 450 | bert_config, is_training, input_ids, input_mask, segment_ids, tag_ids, 451 | num_labels, use_one_hot_embeddings) 452 | 453 | tvars = tf.trainable_variables() 454 | initialized_variable_names = {} 455 | scaffold_fn = None 456 | if init_checkpoint: 457 | (assignment_map, initialized_variable_names 458 | ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) 459 | if use_tpu: 460 | def tpu_scaffold(): 461 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 462 | return tf.train.Scaffold() 463 | 464 | scaffold_fn = tpu_scaffold 465 | else: 466 | tf.train.init_from_checkpoint(init_checkpoint, assignment_map) 467 | 468 | tf.logging.info("**** Trainable Variables ****") 469 | for var in tvars: 470 | init_string = "" 471 | if var.name in initialized_variable_names: 472 | init_string = ", *INIT_FROM_CKPT*" 473 | tf.logging.info(" name = %s, shape = %s%s", var.name, var.shape, 474 | init_string) 475 | 476 | if mode == tf.estimator.ModeKeys.TRAIN: 477 | # 添加loss的hook,不然在GPU/CPU上不打印loss 478 | logging_hook = tf.train.LoggingTensorHook({"loss": total_loss}, every_n_iter=10) 479 | train_op = optimization.create_optimizer( 480 | total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu) 481 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 482 | mode=mode, 483 | loss=total_loss, 484 | train_op=train_op, 485 | training_hooks=[logging_hook], 486 | scaffold_fn=scaffold_fn) 487 | elif mode == tf.estimator.ModeKeys.EVAL: 488 | def metric_fn(per_example_loss, tag_ids, is_real_example): 489 | # 这里使用的accuracy来计算,宽松匹配方法 490 | accuracy = tf.metrics.accuracy( 491 | labels=tag_ids, predictions=predictions, weights=is_real_example) 492 | return { 493 | "eval_accuracy": accuracy, 494 | } 495 | 496 | eval_metrics = (metric_fn, 497 | [total_loss, tag_ids, is_real_example]) 498 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 499 | mode=mode, 500 | loss=total_loss, 501 | eval_metrics=eval_metrics, 502 | scaffold_fn=scaffold_fn) 503 | else: 504 | output_spec = tf.contrib.tpu.TPUEstimatorSpec( 505 | mode=mode, 506 | predictions={"predictions": predictions}, 507 | scaffold_fn=scaffold_fn) 508 | return output_spec 509 | 510 | return model_fn 511 | 512 | 513 | def main(): 514 | tf.logging.set_verbosity(tf.logging.INFO) 515 | 516 | processors = { 517 | "cluener": NerProcessor, 518 | } 519 | 520 | tokenization.validate_case_matches_checkpoint(FLAGS.do_lower_case, 521 | FLAGS.init_checkpoint) 522 | 523 | # if not FLAGS.do_train and not FLAGS.do_eval and not FLAGS.do_predict: 524 | # raise ValueError( 525 | # "At least one of `do_train`, `do_eval` or `do_predict' must be True.") 526 | 527 | bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file) 528 | 529 | if FLAGS.max_seq_length > bert_config.max_position_embeddings: 530 | raise ValueError( 531 | "Cannot use sequence length %d because the BERT model " 532 | "was only trained up to sequence length %d" % 533 | (FLAGS.max_seq_length, bert_config.max_position_embeddings)) 534 | 535 | tf.gfile.MakeDirs(FLAGS.output_dir) 536 | 537 | task_name = FLAGS.task_name.lower() 538 | 539 | if task_name not in processors: 540 | raise ValueError("Task not found: %s" % (task_name)) 541 | 542 | processor = processors[task_name]() 543 | 544 | tag_list = processor.get_tags() 545 | 546 | tokenizer = tokenization.FullTokenizer( 547 | vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case) 548 | 549 | tpu_cluster_resolver = None 550 | if FLAGS.use_tpu and FLAGS.tpu_name: 551 | tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver( 552 | FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project) 553 | 554 | is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2 555 | run_config = tf.contrib.tpu.RunConfig( 556 | cluster=tpu_cluster_resolver, 557 | master=FLAGS.master, 558 | model_dir=FLAGS.output_dir, 559 | save_checkpoints_steps=FLAGS.save_checkpoints_steps, 560 | tpu_config=tf.contrib.tpu.TPUConfig( 561 | iterations_per_loop=FLAGS.iterations_per_loop, 562 | num_shards=FLAGS.num_tpu_cores, 563 | per_host_input_for_training=is_per_host)) 564 | 565 | train_examples = None 566 | num_train_steps = None 567 | num_warmup_steps = None 568 | if FLAGS.do_train: 569 | train_examples = processor.get_train_examples(FLAGS.data_dir) 570 | num_train_steps = int( 571 | len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs) 572 | num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion) 573 | # num_labels=2 * len(tag_list) + 1 BI两种外加一个O 574 | model_fn = model_fn_builder( 575 | bert_config=bert_config, 576 | num_labels=2*len(tag_list) + 1, 577 | init_checkpoint=FLAGS.init_checkpoint, 578 | learning_rate=FLAGS.learning_rate, 579 | num_train_steps=num_train_steps, 580 | num_warmup_steps=num_warmup_steps, 581 | use_tpu=FLAGS.use_tpu, 582 | use_one_hot_embeddings=FLAGS.use_tpu) 583 | 584 | estimator = tf.contrib.tpu.TPUEstimator( 585 | use_tpu=FLAGS.use_tpu, 586 | model_fn=model_fn, 587 | config=run_config, 588 | train_batch_size=FLAGS.train_batch_size, 589 | eval_batch_size=FLAGS.eval_batch_size, 590 | predict_batch_size=FLAGS.predict_batch_size) 591 | 592 | if FLAGS.do_train: 593 | train_file = os.path.join(FLAGS.data_dir, "train.tf_record") 594 | file_based_convert_examples_to_features( 595 | train_examples, tag_list, FLAGS.max_seq_length, tokenizer, train_file) 596 | tf.logging.info("***** Running training *****") 597 | tf.logging.info(" Num examples = %d", len(train_examples)) 598 | tf.logging.info(" Batch size = %d", FLAGS.train_batch_size) 599 | tf.logging.info(" Num steps = %d", num_train_steps) 600 | train_input_fn = file_based_input_fn_builder( 601 | input_file=train_file, 602 | seq_length=FLAGS.max_seq_length, 603 | is_training=True, 604 | drop_remainder=True) 605 | estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) 606 | 607 | if FLAGS.do_eval: 608 | eval_examples = processor.get_dev_examples(FLAGS.data_dir) 609 | num_actual_eval_examples = len(eval_examples) 610 | if FLAGS.use_tpu: 611 | while len(eval_examples) % FLAGS.eval_batch_size != 0: 612 | eval_examples.append(PaddingInputExample()) 613 | 614 | eval_file = os.path.join(FLAGS.data_dir, "eval.tf_record") 615 | file_based_convert_examples_to_features( 616 | eval_examples, tag_list, FLAGS.max_seq_length, tokenizer, eval_file) 617 | 618 | tf.logging.info("***** Running evaluation *****") 619 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 620 | len(eval_examples), num_actual_eval_examples, 621 | len(eval_examples) - num_actual_eval_examples) 622 | tf.logging.info(" Batch size = %d", FLAGS.eval_batch_size) 623 | 624 | # This tells the estimator to run through the entire set. 625 | eval_steps = None 626 | if FLAGS.use_tpu: 627 | assert len(eval_examples) % FLAGS.eval_batch_size == 0 628 | eval_steps = int(len(eval_examples) // FLAGS.eval_batch_size) 629 | 630 | eval_drop_remainder = True if FLAGS.use_tpu else False 631 | eval_input_fn = file_based_input_fn_builder( 632 | input_file=eval_file, 633 | seq_length=FLAGS.max_seq_length, 634 | is_training=False, 635 | drop_remainder=eval_drop_remainder) 636 | 637 | result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps) 638 | 639 | output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt") 640 | with tf.gfile.GFile(output_eval_file, "w") as writer: 641 | tf.logging.info("***** Eval results *****") 642 | for key in sorted(result.keys()): 643 | tf.logging.info(" %s = %s", key, str(result[key])) 644 | writer.write("%s = %s\n" % (key, str(result[key]))) 645 | 646 | if FLAGS.do_predict: 647 | # label dict的设置 648 | tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book', 649 | 5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game', 650 | 9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie', 651 | 13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization', 652 | 17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'} 653 | 654 | predict_examples = processor.get_test_examples(FLAGS.data_dir) 655 | num_actual_predict_examples = len(predict_examples) 656 | test_file = os.path.join(FLAGS.data_dir, "test.tf_record") 657 | file_based_convert_examples_to_features(predict_examples, tag_list, 658 | FLAGS.max_seq_length, tokenizer, 659 | test_file) 660 | 661 | tf.logging.info("***** Running prediction*****") 662 | tf.logging.info(" Num examples = %d (%d actual, %d padding)", 663 | len(predict_examples), num_actual_predict_examples, 664 | len(predict_examples) - num_actual_predict_examples) 665 | tf.logging.info(" Batch size = %d", FLAGS.predict_batch_size) 666 | 667 | predict_drop_remainder = True if FLAGS.use_tpu else False 668 | predict_input_fn = file_based_input_fn_builder( 669 | input_file=test_file, 670 | seq_length=FLAGS.max_seq_length, 671 | is_training=False, 672 | drop_remainder=predict_drop_remainder) 673 | 674 | results = estimator.predict(input_fn=predict_input_fn) 675 | 676 | output_file = os.path.join(FLAGS.data_dir, 'clue_predict.json') 677 | with open(output_file, 'w', encoding='utf-8') as fr: 678 | for example, result in zip(predict_examples, results): 679 | pre_id = result['predictions'] 680 | # print(f'text is {example.text_a}') 681 | # print(f'preid is {pre_id}') 682 | text = example.text_a 683 | # 只获取text中的长度的tag输出 684 | tags = [tag_ids[tag] for tag in pre_id][1:len(text) + 1] 685 | res_words, res_pos = get_result(text, tags) 686 | rs = {} 687 | for w, t in zip(res_words, res_pos): 688 | rs[t] = rs.get(t, []) + [w] 689 | pres = {} 690 | for t, ws in rs.items(): 691 | temp = {} 692 | for w in ws: 693 | word = text[w[0]: w[1] + 1] 694 | temp[word] = temp.get(word, []) + [w] 695 | pres[t] = temp 696 | output_line = json.dumps({'id': example.guid, 'label': pres}, ensure_ascii=False) + '\n' 697 | fr.write(output_line) 698 | 699 | 700 | def get_result(text, tags): 701 | """ 改写成clue要提交的格式 """ 702 | result_words = [] 703 | result_pos = [] 704 | temp_word = [] 705 | temp_pos = '' 706 | for i in range(min(len(text), len(tags))): 707 | if tags[i].startswith('O'): 708 | if len(temp_word) > 0: 709 | result_words.append([min(temp_word), max(temp_word)]) 710 | result_pos.append(temp_pos) 711 | temp_word = [] 712 | temp_pos = '' 713 | elif tags[i].startswith('B-'): 714 | if len(temp_word) > 0: 715 | result_words.append([min(temp_word), max(temp_word)]) 716 | result_pos.append(temp_pos) 717 | temp_word = [i] 718 | temp_pos = tags[i].split('-')[1] 719 | elif tags[i].startswith('I-'): 720 | if len(temp_word) > 0: 721 | temp_word.append(i) 722 | if temp_pos == '': 723 | temp_pos = tags[i].split('-')[1] 724 | else: 725 | if len(temp_word) > 0: 726 | temp_word.append(i) 727 | if temp_pos == '': 728 | temp_pos = tags[i].split('-')[1] 729 | result_words.append([min(temp_word), max(temp_word)]) 730 | result_pos.append(temp_pos) 731 | temp_word = [] 732 | temp_pos = '' 733 | return result_words, result_pos 734 | 735 | 736 | if __name__ == "__main__": 737 | main() 738 | -------------------------------------------------------------------------------- /ckbqa/DUTIR中文开放域知识问答评测报告.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/DUTIR中文开放域知识问答评测报告.pdf -------------------------------------------------------------------------------- /ckbqa/README.md: -------------------------------------------------------------------------------- 1 | ## KBQA简介 2 | 3 | 基于知识库的问答(Knowledge Based Question Answering,KBQA)是自然语言处理(NLP)领域的热门研究方向。知识库(知识图谱, Knowledge Based/Knowledge Graph)是知识的结构化表示,一般是由一组SPO三元组(主语Subject,谓语Predicate,宾语Object)形式构成(也称实体,关系,属性三元组),表示实体和实体间存在的语义关系。例如,中国的首都是北京,可以表示为:[中国,首都,北京]。 4 | 5 | 基于知识库的问答主要步骤是接收一个自然语言问句,识别出句子中的实体,理解问句的语义关系,构建有关实体和关系的查询语句,进而从知识库中检索出答案。 6 | 7 | 目前基于知识库的问答主要方法有: 8 | 9 | - 基于语义解析/规则的方法 10 | - 基于信息检索/信息抽取的方法 11 | 12 | 这里有一篇2019年KGQA的综述:Introduction to Neural Network Based Approaches for Question Answering over Knowledge Graphs。这篇文章将KGQA/KBQA当作语义解析的任务来对待,然后介绍了几种语义解析方法,如Classification、Ranking、Translation等。这里不做介绍,感兴趣的可以去翻原文。 13 | 14 | 基于中文知识库问答(**Chinese Knowledge Based Question Answering,CKBQA**)相比英文的KBQA,中文知识库包含关系多,数据集难以覆盖所有关系,另外中文语言的特点,有居多的挑战。 15 | 16 | **基于语义解析/规则的方法:** 17 | 18 | 该类方法使用字典、规则和机器学习,直接从问题中解析出实体、关系和逻辑组合。这里介绍两篇论文,一篇是 The APVA-TURBO Approach to Question Answering in Knowledge Base,文章使用序列标注模型解析问题中的实体,利用端到端模型解析问题中的关系序列。另一篇 A State-transition Framework to Answer Complex 19 | Questions over Knowledge Base,文章中提出了一种状态转移框架并结合卷积神经网络等方法。(上述方法均基于英文数据集) 20 | 21 | 基于语义解析/规则的方法一般步骤: 22 | 23 | - 实体识别:使用领域词表,相似度等(也可以使用深度学习模型,如BiLstm+CRF,BERT等) 24 | - 属性关系识别:词表规则,或使用分类模型 25 | - 答案查询:基于前两个步骤,更加规则模板转换SPARQL等查询语言进行查询 26 | 27 | 基于语义解析/规则的方法比较简单,当前Github上很多KBQA的项目都是基于这种模式。 28 | 29 | 这里推荐几个基于语义解析/规则的 KBQA项目: 30 | 31 | - 豆瓣的电影知识图谱问答:https://github.com/weizhixiaoyi/DouBan-KGQA 32 | - 基于NLPCC数据的KBQA:https://zhuanlan.zhihu.com/p/62946533 33 | 34 | **基于信息检索/信息抽取的方法:** 35 | 36 | 该类方法首先根据问题得到若干个候选实体,根据预定义的逻辑形式,从知识库中抽取与候选实体相连的关系作为候选查询路径,再使用文本匹配模型,选择出与问题相似度最高的候选查询路径,到知识库中检索答案。这里介绍一种增强路径匹配的方法: Improved neural relation detection for knowledge base question answering。 37 | 38 | 当前CKBQA任务上,大多采用的是基于信息检索/信息抽取的方法,一般的步骤: 39 | 40 | - 实体与关系识别 41 | - 路径匹配 42 | - 答案检索 43 | 44 | 在CCKS的KBQA比赛中这种方法非常常见,CCKS官网网站上有每一年的评测论文,下面推荐几个最新的: 45 | 46 | - 2019年CCKS的KBQA任务第四名方案:DUTIR中文开放域知识问答评测报告 47 | - 2020年CCKS的KBQA任务第一名方案:基于特征融合的中文知识库问答方法 48 | 49 | 具体内容可见官网的评测论文,这里附件上传,见ckbqa目录下两个pdf文件。 50 | 51 | ## 中英文数据集 52 | 53 | 英文数据集: 54 | 55 | - FREE917:第一个大规模的KBQA数据集,于2013年提出,包含917 个问题,同时提供相应逻辑查询,覆盖600多种freebase上的关系。 56 | - Webquestions:数据集中有6642个问题答案对,数据集规模虽然较FREE917提高了不少,但有两个突出的缺陷:没有提供对应的查询,不利于基于逻辑表达式模型的训练;另外webquestions中简单问句多而复杂问句少。 57 | - WebQSP:是WEBQUESTIONS的子集,问题都是需要多跳才能回答,属于multi-relation KBQA dataset,另外补全了对应的查询句。 58 | - Complexquestion、GRAPHQUESTIONS:在问句的结构和表达多样性等方面进一步增强了WEBQUESTIONSP,,包括类型约束,显\隐式的时间约束,聚合操作。 59 | - SimpleQuestions:数据规模较大,共100K,数据形式为(quesition,knowledge base fact),均为简单问题,只需KB中的一个三元组即可回答,即single-relation dataset。 60 | 61 | 英文数据集较多,这里只列举几个常见的。详细的数据集可见北航的[KBQA调研](https://github.com/BDBC-KG-NLP/QA-Survey/blob/master/KBQA%E8%B0%83%E7%A0%94-%E5%AD%A6%E6%9C%AF%E7%95%8C.md#13-%E6%95%B0%E6%8D%AE%E9%9B%86) 62 | 63 | 中文数据集: 64 | 65 | - NLPCC开放领域知识图谱问答的数据集:简单问题(单跳问题),14609条训练数据,9870条验证和测试数据,数据集下载。 66 | - CCKS开放领域知识图谱问答的数据集:包含简单问题和复杂问答,2298条训练数据,766的验证和测试数据,数据集下载。 67 | 68 | 除了上述两个中文数据集(提取码均是),CLUE上还提供了一些问答的数据集,可以见[CLUE的数据集搜索](https://www.cluebenchmarks.com/dataSet_search_modify.html?keywords=QA)。 69 | 70 | ## KBQA的实现 71 | 72 | 下面基于CCKS的数据集来实现2019年第四名方案和2020年第一名方案。 73 | 74 | CCKS的数据集,百度网盘下载地址:链接:https://pan.baidu.com/s/1NI9VrhuvOgyTFk1tGjlZIw 提取码:l7pm 75 | 76 | todo list(等有空实现了就补上): 77 | 78 | - 使用tensorflow实现2019年第四名方案 79 | - 使用tensorflow实现2020年第一名方案 80 | 81 | 附上2019年第四名方案的开源地址 https://github.com/atom32/ccks2019-ckbqa-4th-codes 82 | 流程还算完整,但想端到端完整运行有点困难,而且很多数据的处理过程都耦合在模型中。需要花一定的时间去整理。 83 | 84 | 2020年第一名方案代码暂未开源。 85 | 86 | 87 | 88 | ## 扩展 89 | 90 | - 美团大脑:知识图谱的建模方法及其应用:https://tech.meituan.com/2018/11/01/meituan-ai-nlp.html 91 | - 百度大脑UNIT3.0详解之知识图谱与对话:https://baijiahao.baidu.com/s?id=1643915882369765998&wfr=spider&for=pc 92 | - 更新ing 93 | -------------------------------------------------------------------------------- /ckbqa/基于特征融合的中文知识库问答方法.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/ckbqa/基于特征融合的中文知识库问答方法.pdf -------------------------------------------------------------------------------- /named_entity_recognition/README.md: -------------------------------------------------------------------------------- 1 | ## 命名实体识别(Named Entity Recognition) 2 | 3 | 这里首先介绍一篇基于深度学习的命名实体识别综述,《A Survey on Deep Learning for Named Entity Recognition》,论文来源:https://arxiv.org/abs/1812.09449(2020年3月份发表在TKDE) 4 | 5 | **1.命名实体识别简介** 6 | 7 | 命名实体识别(Named Entity Recognition,NER)旨在给定的文本中识别出属于预定义的类别片段(如人名、位置、组织等)。NER一直是很多自然语言应用的基础,如机器问答、文本摘要和机器翻译。 8 | 9 | NER任务最早是由第六届语义理解会议(Sixth Message Understanding Conference)提出,但当时仅定义一些通用的实体类型,如组织、人名和地点。 10 | 11 | **2.命名实体识别常用方法** 12 | 13 | - 基于规则的方法(Rule-based Approaches):不需要标注数据,依赖人工规则,特定领域需要专家知识 14 | - 无监督学习方法(Unsupervised Learning Approaches):不需要标注数据,依赖于无监督学习方法,如聚类算法 15 | - 基于特征的有监督学习方法(Feature-based Supervised Learning Approaches):将NER当作一个多分类问题或序列标签分类任务,依赖于特征工程 16 | - 基于深度学习的方法(DL-based Approaches):后面详细介绍 17 | 18 | 论文简单介绍了前三种方法,这里也不在赘述,感兴趣的可以看论文。 19 | 20 | **3.基于深度学习的方法** 21 | 22 | 文章中将NER任务拆解成三个结构: 23 | 24 | - 输入的分布式表示(Distributed Representations for Input) 25 | - 上下文编码(Context Encoder Architectures) 26 | - 标签解码(Tag Decoder Architectures) 27 | 28 | 这里不在展开描述具体的内容(有兴趣的可以去翻论文),下表总结了基于神经网络的NER模型的工作,并展示了每个NER模型在各类数据集上的表现。 29 | 30 | ![image](https://github.com/xudongMk/AwesomeNLPBaseline/blob/main/named_entity_recognition/pics/%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E7%9A%84%E6%A8%A1%E5%9E%8B%E6%80%BB%E7%BB%93%E5%9B%BE.png) 31 | 32 | 总结:BiLstm+CRF是使用深度学习的NER最常见的体系结构,以Cloze风格使用预训练双向Transformer在CoNLL03数据集上达到了SOTA效果(93.5%),另外Bert+Dice Loss在OntoNotes5.0数据集上达到了SOTA效果(92.07%)。 33 | 34 | **4.评测指标** 35 | 36 | 文中将NER的评测指标Precision、Recall和F1分成了两类。 37 | 38 | - Exact match:严格匹配方法,需要识别的边界和类别都正确 39 | - Relaxed match:宽松匹配方法,实体位置区间重叠、位置正确类别错误等都视为正确 40 | 41 | 42 | 43 | ## 命名实体识别数据集 44 | 45 | 命名实体识别数据集一般是BIO或者BIOES模式标注。 46 | 47 | - BIO模式:具体指B-begin、I-inside、O-outside 48 | - BIOES模式:具体指B-begin、I-inside、O-outside、E-end、S-single 49 | 50 | 首先是综述中提到的几个数据集,见下表,具体的就不介绍了。 51 | 52 | ![image](https://github.com/xudongMk/AwesomeNLPBaseline/blob/main/named_entity_recognition/pics/%E5%91%BD%E5%90%8D%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E6%95%B0%E6%8D%AE%E5%9B%BE.png) 53 | 54 | 55 | 56 | 下面介绍一个中文的命名实体识别数据集,**CLUENER 细粒度命名实体识别**,地址:https://github.com/CLUEbenchmark/CLUENER2020 57 | 58 | - 数据类别:10个,地址、书名、公司、游戏、政府、电影、姓名、组织、职位和景点 59 | - 数据分布:训练集10748,测试集1343,具体类别分布见原文 60 | - 数据来源:在THUCTC文本分类数据集基础上,选出部分数据进行细粒度实体标注 61 | 62 | 63 | 64 | ## 命名实体识别Baseline算法实现 65 | 66 | 使用Tensorflow1.x版本Estimator高阶api实现常见的命名实体识别算法,主要包括BiLstm+CRF、Bert、Bert+CRF。 67 | 68 | (当前只在本目录下实现了BiLstm+CRF,至于BERT的在bert_downstream目录下暂未实现) 69 | 70 | 环境信息: 71 | 72 | tensorflow==1.13.1 73 | 74 | python==3.7 75 | 76 | **数据预处理** 77 | 78 | 要求训练集和测试集分开存储,要求数据集格式为BIO形式。 79 | 80 | 在训练模型前,需要先运行preprocess.py文件进行数据预处理,将数据处理成id形式并保存为pkl形式,另外中间过程产生的词表也会保存为vocab.txt文件。 81 | 82 | **文件结构** 83 | 84 | - data_path:数据集存放的位置 85 | - data_utils:数据处理相关的工具类存放位置 86 | - model_ckpt:chekpoint模型保存的位置 87 | - model_pb:pb形式的模型保存为位置 88 | - models:ner基本的算法存放位置,如BiLstm等 89 | - preprocess.py:数据预处理代码 90 | - ner_main.py:训练主入口 91 | 92 | **模型训练** 93 | 94 | - 首先准备好数据集,放在data_path下,然后运行preprocess.py文件 95 | - 运行ner_main.py,具体的模型参数可以在ARGS里面设置,也可以使用python ner_main.py --train_path='./data_path/clue_data.pkl'的形式 96 | 97 | **模型推理** 98 | 99 | - 推理代码在inference.py中 100 | 101 | 102 | 103 | ## 示例 104 | 105 | 下面使用中文任务测评基准(CLUE benchmark)的CLUENER数据进行demo示例演示: 106 | 107 | 数据集下载地址[[CLUENER细粒度命名实体识别](https://github.com/CLUEbenchmark/CLUENER2020)],该数据由CLUEBenchMark整理,数据分为10个标签类别分别为: 地址(address),书名(book),公司(company),游戏(game),政府(government),电影(movie),姓名(name),组织机构(organization),职位(position),景点(scene) 108 | 109 | 数据集分布: 110 | 111 | ``` 112 | 训练集:10748 113 | 验证集集:1343 114 | 115 | 按照不同标签类别统计,训练集数据分布如下(注:一条数据中出现的所有实体都进行标注,如果一条数据出现两个地址(address)实体,那么统计地址(address)类别数据的时候,算两条数据): 116 | 【训练集】标签数据分布如下: 117 | 地址(address):2829 118 | 书名(book):1131 119 | 公司(company):2897 120 | 游戏(game):2325 121 | 政府(government):1797 122 | 电影(movie):1109 123 | 姓名(name):3661 124 | 组织机构(organization):3075 125 | 职位(position):3052 126 | 景点(scene):1462 127 | 128 | 【验证集】标签数据分布如下: 129 | 地址(address):364 130 | 书名(book):152 131 | 公司(company):366 132 | 游戏(game):287 133 | 政府(government):244 134 | 电影(movie):150 135 | 姓名(name):451 136 | 组织机构(organization):344 137 | 职位(position):425 138 | 景点(scene):199 139 | ``` 140 | 141 | **1.数据EDA:** 142 | 143 | 省略,需要的可以自己分析一下数据集的分布情况 144 | 145 | **2.数据预处理:** 146 | 147 | 转换BIO形式,具体conver_bio.py,将CLUE提供的数据集转换为BIO标注形式;运行preprocess.py将数据集转换为id形式并保存为pkl形式。 148 | 149 | **3.模型训练:** 150 | 151 | 代码见ner_main.py,参数设置的时候有几个参数需要根据自己的数据分布来设置: 152 | 153 | - vocab_size:这里的大小,一般需要根据自己生成的vocab.txt中词表的大小来设置 154 | - num_tags:类别标签的数量,算上O,这里是21类 155 | - train_path/eval_path:数据集的路径 156 | 157 | 其他的参数视个人情况而定 158 | 159 | **4.开始预测并提交结果** 160 | 161 | 预测代码见inference.py 162 | 163 | #todo next 只完成了一部分,写入文件的部分暂时未完成。因为其提交的文件格式有点难受....太细化了... 164 | 165 | 166 | 167 | ## NER的比赛 168 | 169 | 1.天池的比赛 https://tianchi.aliyun.com/competition/entrance/531824/introduction 170 | 171 | 2.CLUE的评测 https://www.cluebenchmarks.com/introduce.html 172 | 173 | 174 | 175 | ## 扩展 176 | 177 | - 美团搜索中NER技术的探索和实践:https://tech.meituan.com/2020/07/23/ner-in-meituan-nlp.html 178 | 179 | 180 | 181 | 182 | -------------------------------------------------------------------------------- /named_entity_recognition/convert_bio.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/15 21:46 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : convert_bio.py 6 | # @Software: PyCharm 7 | import json 8 | 9 | """ 10 | 将数据转换成bio形式 11 | """ 12 | 13 | 14 | def read_data(file_path): 15 | """ 16 | 读取数据集 17 | :param file_path: 18 | :return: 19 | """ 20 | with open(file_path, encoding='utf-8') as fr: 21 | lines = fr.readlines() 22 | print(f'the data size is {len(lines)}') 23 | return lines 24 | 25 | 26 | def convert_bio_data(file_path, out): 27 | """ 28 | 转换成bio形式 29 | example: 30 | 我要去故宫 O O O B-location I-location 31 | :param file_path: 32 | :return: 33 | """ 34 | lines = read_data(file_path) 35 | bio_data = [] 36 | for line in lines: 37 | data = json.loads(line) 38 | text = data['text'] 39 | labels = data['label'] 40 | # 遍历处理label 41 | bios = ['O'] * len(text) 42 | for label in labels: 43 | entitys = labels[label] 44 | for entity in entitys: 45 | indexs = entitys[entity] 46 | for index in indexs: 47 | start = index[0] 48 | end = index[1] 49 | for i in range(start, end + 1): 50 | if i == start: 51 | bios[i] = f'B-{label}' 52 | else: 53 | bios[i] = f'I-{label}' 54 | bio_data.append(text + '\t' + ' '.join(bios)) 55 | # write to file 56 | with open(out, 'w', encoding='utf-8') as fr: 57 | for data in bio_data: 58 | fr.write(data + '\n') 59 | print(f'convert bio data over!') 60 | 61 | 62 | if __name__ == '__main__': 63 | convert_bio_data('./data_path/train.json', './data_path/train.txt') 64 | convert_bio_data('./data_path/dev.json', './data_path/dev.txt') 65 | -------------------------------------------------------------------------------- /named_entity_recognition/data_path/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里存放的是数据集 -------------------------------------------------------------------------------- /named_entity_recognition/data_utils/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/9 20:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : __init__.py.py 6 | # @Software: PyCharm 7 | -------------------------------------------------------------------------------- /named_entity_recognition/data_utils/datasets.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/9 20:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : datasets.py 6 | # @Software: PyCharm 7 | 8 | import numpy as np 9 | import tensorflow as tf 10 | 11 | """ 12 | 数据集构建类 13 | 将数据转换成模型所需要的dataset输入 14 | """ 15 | 16 | 17 | class DataBuilder: 18 | def __init__(self, data): 19 | self.words = np.asarray(data['words']) 20 | self.tags = np.asarray(data['tags']) 21 | 22 | @property 23 | def size(self): 24 | return len(self.words) 25 | 26 | def build_generator(self): 27 | """ 28 | build data generator for model 29 | :return: 30 | """ 31 | for word, tag in zip(self.words, self.tags): 32 | yield (word, len(word)), tag 33 | 34 | def build_dataset(self): 35 | """ 36 | build dataset from generator 37 | :return: 38 | """ 39 | dataset = tf.data.Dataset.from_generator( 40 | self.build_generator, 41 | ((tf.int64, tf.int64), tf.int64), 42 | ((tf.TensorShape([None]), tf.TensorShape([])), tf.TensorShape([None])) 43 | ) 44 | return dataset 45 | 46 | def get_train_batch(self, dataset, batch_size, epoch): 47 | """ 48 | get one batch train data 49 | :param dataset: 50 | :param batch_size: 51 | :param epoch: 52 | :return: 53 | """ 54 | dataset = dataset.cache()\ 55 | .shuffle(buffer_size=10000)\ 56 | .padded_batch(batch_size, padded_shapes=(([None], []), [None]))\ 57 | .repeat(epoch) 58 | return dataset.make_one_shot_iterator().get_next() 59 | 60 | def get_test_batch(self, dataset, batch_size): 61 | """ 62 | get one batch test data 63 | :param dataset: 64 | :param batch_size: 65 | :return: 66 | """ 67 | dataset = dataset.padded_batch(batch_size, 68 | padded_shapes=(([None], []), [None])) 69 | return dataset.make_one_shot_iterator().get_next() 70 | -------------------------------------------------------------------------------- /named_entity_recognition/inference.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2021/1/6 22:59 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : inference.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | import tqdm 10 | import json 11 | import _pickle as cPickle 12 | 13 | """ 14 | 命名实体识别推理代码 15 | """ 16 | 17 | # 加载词典 18 | word_dict = {} 19 | with open('./data_path/clue_vocab.txt', encoding='utf-8') as fr: 20 | lines = fr.readlines() 21 | for line in lines: 22 | word = line.split('\t')[0] 23 | id = line.split('\t')[1] 24 | word_dict[word] = id 25 | print(word_dict) 26 | 27 | # label dict的设置 这个和preprocess中的tag_dict对应 28 | tag_ids = {0: 'O', 1: 'B-address', 2: 'I-address', 3: 'B-book', 4: 'I-book', 29 | 5: 'B-company', 6: 'I-company', 7: 'B-game', 8: 'I-game', 30 | 9: 'B-government', 10: 'I-government', 11: 'B-movie', 12: 'I-movie', 31 | 13: 'B-name', 14: 'I-name', 15: 'B-organization', 16: 'I-organization', 32 | 17: 'B-position', 18: 'I-position', 19: 'B-scene', 20: 'I-scene'} 33 | 34 | 35 | def words_to_ids(words, word_dict): 36 | """ 将words 转换成ids形式 """ 37 | ids = [word_dict.get(word, 1) for word in words] 38 | return ids 39 | 40 | 41 | def predict_main(test_file, out_path): 42 | """ 预测主入口 """ 43 | model_path = './model_pb/1609946529' 44 | with tf.Session(graph=tf.Graph()) as sess: 45 | model = tf.saved_model.loader.load(sess, ['serve'], model_path) 46 | # print(model) 47 | out = sess.graph.get_tensor_by_name('tag_ids:0') 48 | input_id = sess.graph.get_tensor_by_name('input_words:0') 49 | input_len = sess.graph.get_tensor_by_name('input_len:0') 50 | 51 | with open(test_file, encoding='utf-8') as fr: 52 | lines = fr.readlines() 53 | res_list = [] 54 | 55 | cnt = 0 56 | for line in tqdm.tqdm(lines): 57 | json_str = json.loads(line) 58 | id = json_str['id'] 59 | text = json_str['text'] 60 | if len(text) < 1: 61 | print('there are some sample error!') 62 | text_features = words_to_ids(text, word_dict) 63 | text_label = len(text) 64 | feed = {input_id: [text_features], input_len: [text_label]} 65 | score = sess.run(out, feed_dict=feed) 66 | 67 | cnt += 1 68 | tags = [tag_ids[tag] for tag in score[0]] 69 | # print(tags) 70 | res_words, res_pos = get_result(text, tags) 71 | rs = {} 72 | for w, t in zip(res_words, res_pos): 73 | rs[t] = rs.get(t, []) + [w] 74 | pres = {} 75 | for t, ws in rs.items(): 76 | temp = {} 77 | for w in ws: 78 | word = text[w[0]: w[1] + 1] 79 | temp[word] = temp.get(word, []) + [w] 80 | pres[t] = temp 81 | output_line = json.dumps({'id': id, 'label': pres}, ensure_ascii=False) 82 | res_list.append(output_line) 83 | # print(output_line) 84 | # write to file 85 | with open(out_path, 'w', encoding='utf-8') as fr: 86 | for res in res_list: 87 | fr.write(res) 88 | fr.write('\n') 89 | 90 | 91 | def get_result(text, tags): 92 | """ 改写成clue要提交的格式 """ 93 | result_words = [] 94 | result_pos = [] 95 | temp_word = [] 96 | temp_pos = '' 97 | for i in range(min(len(text), len(tags))): 98 | if tags[i].startswith('O'): 99 | if len(temp_word) > 0: 100 | result_words.append([min(temp_word), max(temp_word)]) 101 | result_pos.append(temp_pos) 102 | temp_word = [] 103 | temp_pos = '' 104 | elif tags[i].startswith('B-'): 105 | if len(temp_word) > 0: 106 | result_words.append([min(temp_word), max(temp_word)]) 107 | result_pos.append(temp_pos) 108 | temp_word = [i] 109 | temp_pos = tags[i].split('-')[1] 110 | elif tags[i].startswith('I-'): 111 | if len(temp_word) > 0: 112 | temp_word.append(i) 113 | if temp_pos == '': 114 | temp_pos = tags[i].split('-')[1] 115 | else: 116 | if len(temp_word) > 0: 117 | temp_word.append(i) 118 | if temp_pos == '': 119 | temp_pos = tags[i].split('-')[1] 120 | result_words.append([min(temp_word), max(temp_word)]) 121 | result_pos.append(temp_pos) 122 | temp_word = [] 123 | temp_pos = '' 124 | return result_words, result_pos 125 | 126 | 127 | if __name__ == '__main__': 128 | test_file = './data_path/test.json' 129 | out_path = './data_path/clue_predict.json' 130 | predict_main(test_file, out_path) 131 | -------------------------------------------------------------------------------- /named_entity_recognition/model_ckpt/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里保存训练后的checkpoint文件 -------------------------------------------------------------------------------- /named_entity_recognition/model_pb/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里保存训练后的pb模型文件 -------------------------------------------------------------------------------- /named_entity_recognition/models/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/9 20:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : __init__.py.py 6 | # @Software: PyCharm 7 | -------------------------------------------------------------------------------- /named_entity_recognition/models/bilstm_crf.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/9 20:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : bilstm_crf.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | from tensorflow.contrib.rnn import LSTMCell 10 | from tensorflow.contrib.rnn import MultiRNNCell 11 | 12 | 13 | class Linear: 14 | """ 15 | 全链接层 16 | """ 17 | def __init__(self, scope_name, input_size, output_size, 18 | drop_out=0., trainable=True): 19 | with tf.variable_scope(scope_name): 20 | w_init = tf.random_uniform_initializer(-0.1, 0.1) 21 | self.W = tf.get_variable('W', [input_size, output_size], 22 | initializer=w_init, 23 | trainable=trainable) 24 | 25 | self.b = tf.get_variable('b', [output_size], 26 | initializer=tf.zeros_initializer(), 27 | trainable=trainable) 28 | 29 | self.drop_out = tf.layers.Dropout(drop_out) 30 | 31 | self.output_size = output_size 32 | 33 | def __call__(self, inputs, training): 34 | size = tf.shape(inputs) 35 | input_trans = tf.reshape(inputs, [-1, size[-1]]) 36 | input_trans = tf.nn.xw_plus_b(input_trans, self.W, self.b) 37 | input_trans = self.drop_out(input_trans, training=training) 38 | 39 | input_trans = tf.reshape(input_trans, [-1, size[1], self.output_size]) 40 | 41 | return input_trans 42 | 43 | 44 | class LookupTable: 45 | """ 46 | embedding layer 47 | """ 48 | def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True): 49 | self.vocab_size = vocab_size 50 | self.embed_size = embed_size 51 | 52 | with tf.variable_scope(scope_name, reuse=bool(reuse)): 53 | self.embedding = tf.get_variable('embedding', [vocab_size, embed_size], 54 | initializer=tf.random_uniform_initializer(-0.25, 0.25), 55 | trainable=trainable) 56 | 57 | def __call__(self, input): 58 | input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input)) 59 | return tf.nn.embedding_lookup(self.embedding, input) 60 | 61 | 62 | class LstmBase: 63 | """ 64 | build rnn cell 65 | """ 66 | def build_rnn(self, hidden_size, num_layes): 67 | cells = [] 68 | for i in range(num_layes): 69 | cell = LSTMCell(num_units=hidden_size, 70 | state_is_tuple=True, 71 | initializer=tf.random_uniform_initializer(-0.25, 0.25)) 72 | cells.append(cell) 73 | cells = MultiRNNCell(cells, state_is_tuple=True) 74 | 75 | return cells 76 | 77 | 78 | class BiLstm(LstmBase): 79 | """ 80 | define the lstm 81 | """ 82 | def __init__(self, scope_name, hidden_size, num_layers): 83 | super(BiLstm, self).__init__() 84 | assert hidden_size % 2 == 0 85 | hidden_size /= 2 86 | 87 | self.fw_rnns = [] 88 | self.bw_rnns = [] 89 | for i in range(num_layers): 90 | self.fw_rnns.append(self.build_rnn(hidden_size, 1)) 91 | self.bw_rnns.append(self.build_rnn(hidden_size, 1)) 92 | 93 | self.scope_name = scope_name 94 | 95 | def __call__(self, input, input_len): 96 | for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)): 97 | scope_name = '{}_{}'.format(self.scope_name, idx) 98 | ctx, _ = tf.nn.bidirectional_dynamic_rnn( 99 | fw_rnn, bw_rnn, input, sequence_length=input_len, 100 | dtype=tf.float32, time_major=False, 101 | scope=scope_name 102 | ) 103 | input = tf.concat(ctx, -1) 104 | ctx = input 105 | return ctx 106 | 107 | 108 | class BiLstm_Crf: 109 | def __init__(self, args, vocab_size, emb_size): 110 | # embedding 111 | scope_name = 'look_up' 112 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size) 113 | 114 | # rnn 115 | scope_name = 'bi_lstm' 116 | self.rnn = BiLstm(scope_name, args.hidden_dim, 1) 117 | 118 | # linear 119 | scope_name = 'linear' 120 | self.linear = Linear(scope_name, args.hidden_dim, args.num_tags, 121 | drop_out=args.drop_out) 122 | 123 | # crf 124 | scope_name = 'crf_param' 125 | self.crf_param = tf.get_variable(scope_name, [args.num_tags, args.num_tags], 126 | dtype=tf.float32) 127 | 128 | def __call__(self, inputs, training): 129 | masks = tf.sign(inputs) 130 | sent_len = tf.reduce_sum(masks, axis=1) 131 | 132 | embedding = self.lookuptables(inputs) 133 | 134 | rnn_out = self.rnn(embedding, sent_len) 135 | 136 | logits = self.linear(rnn_out, training) 137 | 138 | pred_ids, _ = tf.contrib.crf.crf_decode(logits, self.crf_param, sent_len) 139 | 140 | return logits, pred_ids, self.crf_param 141 | 142 | 143 | 144 | 145 | -------------------------------------------------------------------------------- /named_entity_recognition/ner_main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-10-09 23:07 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : ner_main.py 6 | # @Software: PyCharm 7 | 8 | import sys 9 | import time 10 | import tensorflow as tf 11 | from data_utils import datasets 12 | 13 | import _pickle as cPickle 14 | 15 | from argparse import ArgumentParser 16 | from models.bilstm_crf import BiLstm_Crf 17 | 18 | parser = ArgumentParser() 19 | 20 | parser.add_argument("--vocab_size", type=int, default=4000, help='vocab size') 21 | parser.add_argument("--emb_size", type=int, default=300, help='emb size') 22 | parser.add_argument("--train_path", type=str, default='./data_path/clue_data.pkl') 23 | parser.add_argument("--test_path", type=str, default='./data_path/clue_data.pkl') 24 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/') 25 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb') 26 | parser.add_argument("--hidden_dim", type=int, default=300) 27 | parser.add_argument("--num_tags", type=int, default=21) 28 | parser.add_argument("--drop_out", type=float, default=0.1) 29 | parser.add_argument("--batch_size", type=int, default=16) 30 | parser.add_argument("--epoch", type=int, default=50) 31 | parser.add_argument("--lr", type=float, default=1e-4, 32 | help='the learning rate for optimizer') 33 | 34 | 35 | tf.logging.set_verbosity(tf.logging.INFO) 36 | ARGS, unparsed = parser.parse_known_args() 37 | print(ARGS) 38 | 39 | sys.stdout.flush() 40 | 41 | 42 | def init_data(file_name, type=None): 43 | """ 44 | init data 45 | :param file_name: 46 | :param type: 47 | :return: 48 | """ 49 | data = cPickle.load(open(file_name, 'rb'))[type] 50 | 51 | data_builder = datasets.DataBuilder(data) 52 | dataset = data_builder.build_dataset() 53 | 54 | def train_input(): 55 | return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch) 56 | 57 | def test_input(): 58 | return data_builder.get_test_batch(dataset, ARGS.batch_size) 59 | 60 | return train_input if type == 'train' else test_input 61 | 62 | 63 | def model_fn(features, labels, mode, params): 64 | """ 65 | build model fn 66 | :return: 67 | """ 68 | vocab_size = ARGS.vocab_size 69 | emb_size = ARGS.emb_size 70 | model = BiLstm_Crf(ARGS, vocab_size, emb_size) 71 | 72 | if isinstance(features, dict): 73 | features = features['words'], features['words_len'] 74 | 75 | words, words_len = features 76 | 77 | if mode == tf.estimator.ModeKeys.PREDICT: 78 | _, pred_ids, _ = model(words, training=False) 79 | 80 | prediction = {'tag_ids': tf.identity(pred_ids, name='tag_ids')} 81 | 82 | return tf.estimator.EstimatorSpec( 83 | mode=mode, 84 | predictions=prediction, 85 | export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)} 86 | ) 87 | else: 88 | tags = labels 89 | weights = tf.sequence_mask(words_len) 90 | if mode == tf.estimator.ModeKeys.TRAIN: 91 | logits, pred_ids, crf_params = model(words, training=True) 92 | 93 | log_like_lihood, _ = tf.contrib.crf.crf_log_likelihood( 94 | logits, tags, words_len, crf_params 95 | ) 96 | loss = -tf.reduce_mean(log_like_lihood) 97 | accuracy = tf.metrics.accuracy(tags, pred_ids, weights) 98 | 99 | tf.identity(accuracy[1], name='train_accuracy') 100 | tf.summary.scalar('train_accuracy', accuracy[1]) 101 | optimizer = tf.train.AdamOptimizer(learning_rate=1e-4) 102 | return tf.estimator.EstimatorSpec( 103 | mode=mode, 104 | loss=loss, 105 | train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step()) 106 | ) 107 | else: 108 | _, pred_ids, _ = model(words, training=False) 109 | accuracy = tf.metrics.accuracy(tags, pred_ids, weights) 110 | metrics = { 111 | 'accuracy': accuracy 112 | } 113 | return tf.estimator.EstimatorSpec( 114 | mode=mode, 115 | loss=tf.constant(0), 116 | eval_metric_ops=metrics 117 | ) 118 | 119 | 120 | def main_es(unparsed): 121 | """ 122 | main method 123 | :param unparsed: 124 | :return: 125 | """ 126 | cur_time = time.time() 127 | model_dir = ARGS.model_ckpt_dir + str(int(cur_time)) 128 | 129 | classifer = tf.estimator.Estimator( 130 | model_fn=model_fn, 131 | model_dir=model_dir, 132 | params={} 133 | ) 134 | 135 | # train 136 | train_input = init_data(ARGS.train_path, 'train') 137 | tensors_to_log = {'train_accuracy': 'train_accuracy'} 138 | logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100) 139 | classifer.train(input_fn=train_input, hooks=[logging_hook]) 140 | 141 | # eval 142 | test_input = init_data(ARGS.test_path, 'test') 143 | eval_res = classifer.evaluate(input_fn=test_input) 144 | print(f'Evaluation res is : \n\t{eval_res}') 145 | 146 | if ARGS.model_pb_dir: 147 | words = tf.placeholder(tf.int64, [None, None], name='input_words') 148 | words_len = tf.placeholder(tf.int64, [None], name='input_len') 149 | input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({ 150 | 'words': words, 151 | 'words_len': words_len 152 | }) 153 | classifer.export_savedmodel(ARGS.model_pb_dir, input_fn) 154 | 155 | 156 | if __name__ == '__main__': 157 | tf.app.run(main=main_es, argv=[sys.argv[0]]) -------------------------------------------------------------------------------- /named_entity_recognition/pics/命名实体识别数据图.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别数据图.png -------------------------------------------------------------------------------- /named_entity_recognition/pics/命名实体识别的模型总结图.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/named_entity_recognition/pics/命名实体识别的模型总结图.png -------------------------------------------------------------------------------- /named_entity_recognition/preprocess.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-10-11 18:52 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : preprocess.py 6 | # @Software: PyCharm 7 | 8 | import os 9 | import _pickle as cPickle 10 | import pandas as pd 11 | import random 12 | 13 | """ 14 | 数据预处理 15 | 将数据处理成id,并封装成pkl形式 16 | """ 17 | 18 | 19 | # clue2020细粒度命名实体识别的类别 20 | tag_list = ['address', 'book', 'company', 'game', 'government', 21 | 'movie', 'name', 'organization', 'position', 'scene'] 22 | tag_dict = {'O': 0} 23 | 24 | for tag in tag_list: 25 | tag_B = 'B-' + tag 26 | tag_I = 'I-' + tag 27 | tag_dict[tag_B] = len(tag_dict) 28 | tag_dict[tag_I] = len(tag_dict) 29 | 30 | print(tag_dict) 31 | 32 | 33 | def make_vocab(file_path): 34 | """ 35 | 构建词典 36 | :param file_path: 37 | :return: 38 | """ 39 | data = pd.read_csv(file_path, sep='\t', header=None) 40 | data.columns = ['text', 'tag'] 41 | vocab = {'PAD': 0, 'UNK': 1} 42 | words_list = [] 43 | for index, row in data.iterrows(): 44 | words = row['text'] 45 | for word in words: 46 | words_list.append(word) 47 | 48 | random.shuffle(words_list) 49 | for word in words_list: 50 | if word not in vocab: 51 | vocab[word] = len(vocab) 52 | return vocab 53 | 54 | 55 | def make_data(file_path, vocab): 56 | """ 57 | 构建数据 58 | :param file_path: 59 | :param vocab 60 | :return: 61 | """ 62 | data = pd.read_csv(file_path, sep='\t', header=None) 63 | data.columns = ['text', 'tag'] 64 | word_ids = [] 65 | tag_ids = [] 66 | for index, row in data.iterrows(): 67 | tag_str = row['tag'] 68 | tags = tag_str.split(' ') 69 | words = row['text'] 70 | 71 | word_id = [vocab.get(word) if word in vocab else 1 for word in words] 72 | tag_id = [tag_dict.get(tag) for tag in tags] 73 | 74 | word_ids.append(word_id) 75 | tag_ids.append(tag_id) 76 | print(word_ids[0]) 77 | print(tag_ids[0]) 78 | return {'words': word_ids, 'tags': tag_ids} 79 | 80 | 81 | def save_vocab(vocab, output): 82 | """ 83 | save vocab dict 84 | :param vocab: 85 | :param output: 86 | :return: 87 | """ 88 | with open(output, 'w', encoding='utf-8') as fr: 89 | for word in vocab: 90 | fr.write(word + '\t' + str(vocab.get(word)) + '\n') 91 | print('save vocab is ok.') 92 | 93 | 94 | def main(output_path): 95 | """ 96 | main method 97 | :param output_path: 98 | :return: 99 | """ 100 | data = {} 101 | # 这里是bio形式的数据集,如果不是需要提前转换成bio形式 102 | train_path = './data_path/train.txt' 103 | test_path = './data_path/dev.txt' 104 | vocab = make_vocab(train_path) 105 | train_data = make_data(train_path, vocab) 106 | test_data = make_data(test_path, vocab) 107 | 108 | data['train'] = train_data 109 | data['test'] = test_data 110 | 111 | data_path = os.path.join(output_path, 'clue_data.pkl') 112 | cPickle.dump(data, open(data_path, 'wb'), protocol=2) 113 | print('save data to pkl ok.') 114 | 115 | vocab_path = os.path.join(output_path, 'clue_vocab.txt') 116 | save_vocab(vocab, vocab_path) 117 | 118 | 119 | if __name__ == '__main__': 120 | output = './data_path/' 121 | main(output) 122 | -------------------------------------------------------------------------------- /text_classification/README.md: -------------------------------------------------------------------------------- 1 | ## 文本分类 2 | 3 | 这里首先介绍一篇基于深度学习的文本分类综述,《Deep Learning Based Text Classification: A Comprehensive Review》,论文来源:https://arxiv.org/abs/2004.03705 4 | 5 | **文本分类简介**: 6 | 7 | 文本分类是NLP中一个非常经典任务(对给定的句子、查询、段落或者文档打上相应的类别标签)。其应用包括机器问答、垃圾邮件识别、情感分析、新闻分类、用户意图识别等。文本数据的来源也十分的广泛,比如网页数据、邮件内容、聊天记录、社交媒体、用户评论等。 8 | 9 | **文本分类三大方法**: 10 | 11 | 1. Rule-based methods:使用预定义的规则进行分类,需要很强的领域知识而且系统很难维护 12 | 2. ML (data-driven) based methods:经典的机器学习方法使用特征提取(Bow词袋等)来提取特征,再使用朴素贝叶斯、SVM、HMM、Gradien Boosting Tree和随机森林等方法进行分类。深度学习方法通常使用的是end2end形式,比如Transformer、Bert等。 13 | 3. Hybrid methods:基于规则和基于机器学习(深度学习)方法的混合 14 | 15 | **文本分类任务**: 16 | 17 | 1. 情感分析(Sentiment Analysis):给定文本,分析用户的观点并且抽取出他们的主要观点。可以是二分类,也可以是多分类任务 18 | 2. 新闻分类(News Categorization):识别新闻主题,并给用户推荐相关的新闻。主要应用于推荐系统 19 | 3. 主题分析(Topic Analysis):给定文本,抽取出其文本的一个或者多个主题 20 | 4. 机器问答(Question Answering):提取式(extractive),给定问题和一堆候选答案,从中识别出正确答案;生成式(generative),给定问题,然后生成答案。(NL2SQL?) 21 | 5. 自然语言推理(Natural Language Inference):文本蕴含任务,预测一个文本是否可以从另一个文本中推断出。一般包括entailment、contradiction和neutral三种关系类型 22 | 23 | **文本分类模型(深度学习)**: 24 | 25 | 1. 基于前馈神经网络(Feed-Forward Neural Networks) 26 | 2. 基于循环神经网络(RNN) 27 | 3. 基于卷积神经网络(CNN) 28 | 4. 基于胶囊高神经网络(Capsule networks) 29 | 5. 基于Attention机制 30 | 6. 基于记忆增强网络(Memory-augmented networks) 31 | 7. 基于Transformer机制 32 | 8. 基于图神经网络 33 | 9. 基于孪生神经网络(Siamese Neural Network) 34 | 10. 混合神经网络(Hybrid models) 35 | 36 | 详解见https://blog.csdn.net/u013963380/article/details/106957420(只详细描述了前4种深度学习模型)。 37 | 38 | ## 文本分类数据集 39 | 40 | Deep Learning Based Text Classification: A Comprehensive Review一文中提到了很多的文本分类的数据集,大多数是英文的。 41 | 42 | 下面列出一些中文文本分类数据集: 43 | 44 | | 数据集 | 说明 | 链接 | 45 | | :------- | ------------------------------------------------------------ | ------------------------------------------------------------ | 46 | | THUCNews | THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成。
包含财经、彩票、房产、股票、家居、教育等14个类别。
原始数据集见:[链接](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) | [下载地址](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) | 47 | | 今日头条 | 来源于今日头条,为短文本分类任务,数据包含15个类别 | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip) | 48 | | IFLYTEK | 1.7万多条关于app应用描述的长文本标注数据,包含和日常生活相关的各类应用主题,共119个类别 | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/iflytek_public.zip) | 49 | | 新闻标题 | 数据集来源于Kesci平台,为新闻标题领域短文本分类任务。
内容大多为短文本标题(length<50),数据包含15个类别,共38w条样本 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv | 50 | | 复大文本 | 数据集来源于复旦大学,为短文本分类任务,数据包含20个类别,共9804篇文档 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv | 51 | | OCNLI | 中文原版自然语言推理,是第一个非翻译的、使用原生汉语的大型中文自然语言推理数据集
详细见https://github.com/CLUEbenchmark/OCNLI | [下载地址](https://storage.googleapis.com/cluebenchmark/tasks/ocnli_public.zip) | 52 | | 情感分析 | OCEMOTION–中文情感分类,对应文章https://www.aclweb.org/anthology/L16-1291.pdf
原始数据集未找到,只有一部分数据 | [下载地址](https://pan.baidu.com/s/1vyGSIycsan3YWHEjBod9pw)
提取码:lrmv | 53 | | 更新ing | ... | ... | 54 | 55 | 还有一些其他的中文文本数据集,可以在CLUE上搜索,CLUE地址:https://www.cluebenchmarks.com/ ,但是下载需要注册账号,有的链接失效,有的限制日下载次数,这里放到百度网盘供下载学习使用。(请勿用于商业目的) 56 | 57 | ## 文本分类Baseline算法实现 58 | 59 | 使用Tensorflow1.x版本Estimator高阶api实现常见文本分类算法,主要包括前馈神经网络(all 全连接层)模型、双向LSTM模型、文本卷积网络(TextCnn)、Transformer。 60 | 61 | 环境信息: 62 | 63 | tensorflow==1.13.1 64 | 65 | python==3.7 66 | 67 | **数据预处理** 68 | 69 | 要求训练集和测试集分开存储(提供划分数据集方法),另外需要对文本进行分词,数据EDA部分可以见示例中的tnews_data_eda.ipynb文件。 70 | 71 | 在训练模型前,需要先运行preprocess.py文件进行数据预处理,将数据处理成id形式并保存为pkl形式,另外中间过程产生的词表也会保存为vocab.txt文件。 72 | 73 | **文件结构** 74 | 75 | - data_path:数据集存放的位置 76 | - data_utils:数据处理相关的工具类存放位置 77 | - model_ckpt:模型checkpoint保存的位置 78 | - model_pb:pb形式的模型保存的位置 79 | - models:文本分类baseline模型存放的位置,包括BiLstm、TextCnn等 80 | - train_main.py:模型训练主入口 81 | - preprocess.py:数据预处理代码,包括划分数据集、转换文本为id等 82 | - tf_metrics.py:tensorflow1.x版本不支持多分类的指标函数,这里使用的是Guillaume Genthial编写的多分类指标函数,[github地址](https://github.com/guillaumegenthial/tf_metrics) 83 | - inference.py:推理主入口 84 | 85 | **模型训练过程** 86 | 87 | - 首先准备好数据集,放在data_path下,然后运行preprocess.py文件 88 | - 运行train_main.py,具体的模型参数可以在ARGS里面设置,也可以使用python train_main.py --train_path='./data_path/emotion_data.pkl'的形式 89 | 90 | **模型推理** 91 | 92 | - 推理代码在inference.py中 93 | 94 | ## 示例 95 | 96 | 下面使用中文任务测评基准(CLUE benchmark)的头条新闻分类数据来进行demo示例演示: 97 | 98 | 数据集下载地址:https://github.com/CLUEbenchmark/CLUE 中的[TNEWS'数据集下载](https://storage.googleapis.com/cluebenchmark/tasks/tnews_public.zip) 99 | 100 | 该数据集来自今日头条新闻版块,共15个类别的新闻,包括旅游、教育、金融、军事等。 101 | 102 | ``` 103 | 数据量:训练集(53,360),验证集(10,000),测试集(10,000) 104 | 例子: 105 | {"label": "102", "label_des": "news_entertainment", "sentence": "江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物"} 106 | 每一条数据有三个属性,从前往后分别是 分类ID,分类名称,新闻字符串(仅含标题)。 107 | ``` 108 | 109 | **1.数据EDA** 110 | 111 | 数据EDA部分见tnews_data_eda.ipynb,主要是简单分析一下数据集的文本的长度分布、类别标签的数量比。然后对文本进行分词,这里使用的jieba分词软件。分词后将数据集保存到data_path目录下。 112 | 113 | ``` 114 | # 各种类别标签的数量分布 115 | 109 5955 116 | 104 5200 117 | 102 4976 118 | 113 4851 119 | 107 4118 120 | 101 4081 121 | 103 3991 122 | 110 3632 123 | 108 3437 124 | 116 3390 125 | 112 3368 126 | 115 2886 127 | 106 2107 128 | 100 1111 129 | 114 257 130 | ``` 131 | 132 | **2.设置训练参数** 133 | 134 | 参数设置的时候有几个参数需要根据自己的数据分布来设置: 135 | 136 | - vocab_size:这里的大小,一般需要根据自己生成的vocab.txt中词表的大小来设置 137 | - num_label:类别标签的数量 138 | - train_path/eval_path:数据集的路径 139 | - weights权重设置:根据数据EDA中的类别标签分布,设置weights=[0.9,0.9,0.9,0.9,1,1,1,1,1,1,1,1,1,1.2,1.5],后面几个类别的数量明显很少,权重设置大一点。具体数值自己根据个人分析来定义 140 | 141 | 其他的参数视个人情况而定 142 | 143 | **3.模型训练并保存模型** 144 | 145 | 这里使用的是BiLstm模型。 146 | 147 | 代码中保存了两种模型形式,一种是checkpoint,另一种是pb格式 148 | 149 | **4.开始预测并提交结果** 150 | 151 | 预测代码见inferecen.py,最后在CLUE上提交的结果是50.92([ALBERT-xxlarge](https://github.com/google-research/albert) :59.46,目前[UER-ensemble](https://github.com/dbiir/UER-py):72.20) 152 | 153 | ## 中文文本分类比赛OR评测 154 | 155 | 1.[零基础入门NLP-新闻文本分类](https://tianchi.aliyun.com/competition/entrance/531810/introduction?spm=5176.12281973.1005.4.3dd52448KQuWQe)(DataWhale和天池举办的学习赛) 156 | 157 | 2.[中文CLUE的各种分类任务的评测](https://www.cluebenchmarks.com/) -------------------------------------------------------------------------------- /text_classification/data_path/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里存放的是数据集 -------------------------------------------------------------------------------- /text_classification/data_path/tnews_data.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xudongMk/AwesomeNLPBaseline/83774119d8b3a68b062df25f0e8775970fab8079/text_classification/data_path/tnews_data.pkl -------------------------------------------------------------------------------- /text_classification/inference.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/28 21:45 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : inference.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | import tqdm 10 | import json 11 | import jieba 12 | 13 | """ 14 | 文本分类推理代码 15 | """ 16 | 17 | # 设置filter 18 | filter = './??;。(())【】{}[]!!,,<>《》+' 19 | # 加载词典 20 | word_dict = {} 21 | with open('./data_path/vocab.txt', encoding='utf-8') as fr: 22 | lines = fr.readlines() 23 | for line in lines: 24 | word = line.split('\t')[0] 25 | id = line.split('\t')[1] 26 | word_dict[word] = id 27 | print(word_dict) 28 | 29 | # label dict的设置 30 | label_id = {0: 109, 1: 104, 2: 102, 3: 113, 31 | 4: 107, 5: 101, 6: 103, 7: 110, 32 | 8: 108, 9: 116, 10: 112, 11: 115, 33 | 12: 106, 13: 100, 14: 114} 34 | label_desc = {100: "news_story", 101: "news_culture", 102: "news_entertainment", 35 | 103: "news_sports", 104: "news_finance", 106: "news_house", 36 | 107: "news_car", 108: "news_edu", 109: "news_tech", 37 | 110: "news_military", 112: "news_travel", 113: "news_world", 38 | 114: "news_stock", 115: "news_agriculture", 116: "news_game"} 39 | 40 | 41 | def cut_with_jieba(text, filter=None): 42 | """ 使用jieba切分句子 """ 43 | if filter: 44 | for c in filter: 45 | text = text.replace(c, '') 46 | words = ['Number' if word.isdigit() else word for word in jieba.cut(text)] 47 | return words 48 | 49 | 50 | def words_to_ids(words, word_dict): 51 | """ 将words 转换成ids形式 """ 52 | ids = [word_dict.get(word, 1) for word in words] 53 | return ids 54 | 55 | 56 | def predict_main(test_file, out_path): 57 | """ 预测主入口 """ 58 | model_path = './model_pb/1609247078' 59 | with tf.Session(graph=tf.Graph()) as sess: 60 | model = tf.saved_model.loader.load(sess, ['serve'], model_path) 61 | # print(model) 62 | out = sess.graph.get_tensor_by_name('class_out:0') 63 | input_p = sess.graph.get_tensor_by_name('input_words:0') 64 | 65 | with open(test_file, encoding='utf-8') as fr: 66 | lines = fr.readlines() 67 | res_list = [] 68 | for line in tqdm.tqdm(lines): 69 | json_str = json.loads(line) 70 | id = json_str['id'] 71 | sentence = json_str['sentence'] 72 | 73 | words = cut_with_jieba(str(sentence), filter) 74 | if len(words) < 1: 75 | print('there are some sample error!') 76 | text_features = words_to_ids(words, word_dict) 77 | feed = {input_p: [text_features]} 78 | score = sess.run(out, feed_dict=feed) 79 | 80 | label = label_id.get(score[0]) 81 | label_d = label_desc.get(label) 82 | 83 | res_list.append( 84 | json.dumps({"id": id, "label": str(label), "label_desc": label_d})) 85 | # 写入到文件 86 | with open(out_path, 'w', encoding='utf-8') as fr: 87 | for res in res_list: 88 | fr.write(res) 89 | fr.write('\n') 90 | print('predict and write to file over!!!') 91 | 92 | 93 | if __name__ == '__main__': 94 | test_file = './data_path/test.json' 95 | out_path = './data_path/tnews_predict.json' 96 | predict_main(test_file, out_path) 97 | -------------------------------------------------------------------------------- /text_classification/model_ckpt/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里保存训练后的checkpoint文件 -------------------------------------------------------------------------------- /text_classification/model_pb/README.md: -------------------------------------------------------------------------------- 1 | ### 文件说明 2 | 3 | 这里保存训练后的pb模型文件 -------------------------------------------------------------------------------- /text_classification/models/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/10 21:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : __init__.py.py 6 | # @Software: PyCharm 7 | 8 | -------------------------------------------------------------------------------- /text_classification/models/attention.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/12/10 21:51 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : attention.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | from .base_model import Linear 10 | 11 | 12 | class Attention: 13 | """ 14 | the attention 15 | """ 16 | def __init__(self, scope_name, hidden_size, num_heads, dropout): 17 | if hidden_size % num_heads != 0: 18 | raise ValueError('the hidden size and heads is not match!') 19 | 20 | self.hidden_size = hidden_size 21 | self.num_heads = num_heads 22 | 23 | self.q_layer = Linear(f'{scope_name}_q', hidden_size, hidden_size, bias=False) 24 | self.k_layer = Linear(f'{scope_name}_k', hidden_size, hidden_size, bias=False) 25 | self.v_layer = Linear(f'{scope_name}_v', hidden_size, hidden_size, bias=False) 26 | 27 | self.out_layer = Linear(f'{scope_name}_output', hidden_size, 28 | hidden_size, bias=False) 29 | self.dropout = tf.layers.Dropout(dropout) 30 | 31 | def split_heads(self, x): 32 | """ split the heads """ 33 | with tf.name_scope('split_heads'): 34 | batch_size = tf.shape(x)[0] 35 | length = tf.shape(x)[1] 36 | 37 | depth = self.hidden_size // self.num_heads 38 | 39 | x = tf.reshape(x, [batch_size, length, self.num_heads, depth]) 40 | 41 | return tf.transpose(x, [0, 2, 1, 3]) 42 | 43 | def combine_heads(self, x): 44 | """ combine the heads """ 45 | with tf.name_scope('combine_heads'): 46 | batch_size = tf.shape(x)[0] 47 | length = tf.shape(x)[2] 48 | 49 | x = tf.transpose(x, [0, 2, 1, 3]) # bacth length, heads, depth 50 | return tf.reshape(x, [batch_size, length, self.hidden_size]) 51 | 52 | def call(self, x, y, training, bias, cache=None): 53 | q = self.q_layer(x, training) 54 | k = self.k_layer(y, training) 55 | v = self.v_layer(y, training) 56 | 57 | if cache: 58 | k = tf.concat([cache['k'], k], axis=1) 59 | v = tf.concat([cache['v'], v], axis=1) 60 | 61 | cache['k'] = k 62 | cache['v'] = v 63 | 64 | q = self.split_heads(q) 65 | k = self.split_heads(k) 66 | v = self.split_heads(v) 67 | 68 | depth = self.hidden_size // self.num_heads 69 | q *= depth ** -0.5 70 | 71 | # calculate dot product attention 72 | logits = tf.matmul(q, k, transpose_b=True) 73 | logits += bias 74 | weights = tf.nn.softmax(logits) 75 | weights = self.dropout(weights, training=training) 76 | attention_output = tf.matmul(weights, v) 77 | 78 | attention_output = self.combine_heads(attention_output) 79 | 80 | attention_output = self.out_layer(attention_output, training) 81 | return attention_output 82 | 83 | 84 | class SelfAttention(Attention): 85 | def __call__(self, x, training, bias, cache=None): 86 | return super(SelfAttention, self).call(x, x, training, bias, cache) -------------------------------------------------------------------------------- /text_classification/models/base_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020/11/10 21:34 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : base_model.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | 10 | from tensorflow.contrib.rnn import LSTMCell 11 | from tensorflow.contrib.rnn import MultiRNNCell 12 | from tensorflow.contrib.rnn import GRUCell 13 | from tensorflow.contrib.rnn import BasicRNNCell 14 | 15 | 16 | class Linear: 17 | """ 18 | 线性层,全连接层 19 | """ 20 | def __init__(self, scope_name, input_size, output_sizes, bias=True, 21 | activator='', drop_out=0., reuse=False, trainable=True): 22 | self.input_size = input_size 23 | 24 | # todo 判断 output_sizes 是不是列表 25 | if not isinstance(output_sizes, list): 26 | output_sizes = [output_sizes] 27 | 28 | self.output_size = output_sizes[-1] 29 | 30 | self.W = [] 31 | self.b = [] 32 | size = input_size 33 | with tf.variable_scope(scope_name, reuse=reuse): 34 | for i, output_size in enumerate(output_sizes): 35 | W = tf.get_variable( 36 | 'W{0}'.format(i), [size, output_size], 37 | initializer=tf.random_uniform_initializer(-0.25, 0.25), 38 | trainable=trainable 39 | ) 40 | if bias: 41 | b = tf.get_variable( 42 | 'b{0}'.format(i), [output_size], 43 | initializer=tf.zeros_initializer(), 44 | trainable=trainable 45 | ) 46 | else: 47 | b = None 48 | 49 | self.W.append(W) 50 | self.b.append(b) 51 | size = output_size 52 | 53 | if activator == 'relu': 54 | self.activator = tf.nn.relu 55 | elif activator == 'relu6': 56 | self.activator = tf.nn.relu6 57 | elif activator == 'tanh': 58 | self.activator = tf.nn.tanh 59 | else: 60 | self.activator = tf.identity 61 | 62 | self.drop_out = tf.layers.Dropout(drop_out) 63 | 64 | def __call__(self, input, training): 65 | size = tf.shape(input) 66 | input_trans = tf.reshape(input, [-1, size[-1]]) 67 | for W, b in zip(self.W, self.b): 68 | if b is not None: 69 | input_trans = tf.nn.xw_plus_b(input_trans, W, b) 70 | else: 71 | input_trans = tf.matmul(input_trans, W) 72 | 73 | input_trans = self.drop_out(input_trans, training) 74 | input_trans = self.activator(input_trans) 75 | 76 | new_size = tf.concat([size[:-1], tf.constant([self.output_size])], 0) 77 | input_trans = tf.reshape(input_trans, new_size) 78 | return input_trans 79 | 80 | 81 | class LookupTable: 82 | """ 83 | embedding层 84 | """ 85 | def __init__(self, scope_name, vocab_size, embed_size, reuse=False, trainable=True): 86 | self.vocab_size = vocab_size 87 | self.embed_size = embed_size 88 | 89 | with tf.variable_scope(scope_name, reuse=bool(reuse)): 90 | self.embedding = tf.get_variable( 91 | 'embedding', [vocab_size, embed_size], 92 | initializer=tf.random_uniform_initializer(-0.25, 0.25), 93 | trainable=trainable 94 | ) 95 | 96 | def __call__(self, input): 97 | input = tf.where(tf.less(input, self.vocab_size), input, tf.ones_like(input)) 98 | return tf.nn.embedding_lookup(self.embedding, input) 99 | 100 | 101 | class AttentionPooling: 102 | """ 103 | attention pooling层 104 | """ 105 | def __init__(self, scope_name, input_size, hidden_size, reuse=False, 106 | trainable=True): 107 | name = scope_name 108 | self.linear1 = Linear(f'{name}_linear1', input_size, 109 | hidden_size, bias=False, reuse=reuse, 110 | trainable=trainable) 111 | self.linear2 = Linear(f'{name}_linear2', hidden_size, 1, 112 | bias=False, reuse=reuse, trainable=trainable) 113 | 114 | def __call__(self, input, mask, training): 115 | output_linear1 = self.linear1(input, training) 116 | output_linear2 = self.linear2(output_linear1, training) 117 | weights = tf.squeeze(output_linear2, [-1]) 118 | if mask is not None: 119 | weights += mask 120 | weights = tf.nn.softmax(weights, -1) 121 | return tf.reduce_sum(input * tf.expand_dims(weights, -1), axis=1) 122 | 123 | 124 | class LayerNormalization: 125 | """ 126 | 归一化层 127 | """ 128 | def __init__(self, scope_name, hidden_size): 129 | with tf.variable_scope(scope_name): 130 | self.scale = tf.get_variable('layer_norm_scale', [hidden_size], 131 | initializer=tf.ones_initializer()) 132 | self.bias = tf.get_variable('layer_norm_bias', [hidden_size], 133 | initializer=tf.zeros_initializer()) 134 | 135 | def __call__(self, x, epsilon=1e-6): 136 | mean, variance = tf.nn.moments(x, -1, keep_dims=True) 137 | norm_x = (x - mean) * tf.rsqrt(variance + epsilon) 138 | return norm_x * self.scale + self.bias 139 | 140 | 141 | class LstmBase: 142 | """ 143 | RNN的基础层 144 | """ 145 | def build_rnn(self, rnn_type, hidden_size, num_layes): 146 | cells = [] 147 | for i in range(num_layes): 148 | if rnn_type == 'lstm': 149 | cell = LSTMCell(num_units=hidden_size, 150 | state_is_tuple=True, 151 | initializer=tf.random_uniform_initializer(-0.25, 0.25)) 152 | elif rnn_type == 'gru': 153 | cell = GRUCell(num_units=hidden_size) 154 | elif rnn_type: 155 | cell = BasicRNNCell(num_units=hidden_size) 156 | else: 157 | raise NotImplementedError(f'the rnn type is unexist: {rnn_type}') 158 | cells.append(cell) 159 | 160 | cells = MultiRNNCell(cells, state_is_tuple=True) 161 | 162 | return cells 163 | 164 | 165 | class BiLstm(LstmBase): 166 | """ 167 | 双向LSTM层 168 | """ 169 | def __init__(self, scope_name, hidden_size, num_layers): 170 | super(BiLstm, self).__init__() 171 | assert hidden_size % 2 == 0 172 | hidden_size /= 2 173 | 174 | self.fw_rnns = [] 175 | self.bw_rnns = [] 176 | for i in range(num_layers): 177 | self.fw_rnns.append(self.build_rnn('lstm', hidden_size, 1)) 178 | self.bw_rnns.append(self.build_rnn('lstm', hidden_size, 1)) 179 | 180 | self.scope_name = scope_name 181 | 182 | def __call__(self, input, input_len): 183 | for idx, (fw_rnn, bw_rnn) in enumerate(zip(self.fw_rnns, self.bw_rnns)): 184 | scope_name = '{}_{}'.format(self.scope_name, idx) 185 | ctx, _ = tf.nn.bidirectional_dynamic_rnn( 186 | fw_rnn, bw_rnn, input, sequence_length=input_len, 187 | dtype=tf.float32, time_major=False, 188 | scope=scope_name 189 | ) 190 | input = tf.concat(ctx, -1) 191 | ctx = input 192 | return ctx 193 | 194 | 195 | class Cnn: 196 | """ 197 | define cnn 198 | """ 199 | def __init__(self, scoep_name, input_size, hidden_size): 200 | kws=[3] 201 | self.conv_ws = [] 202 | self.conv_bs = [] 203 | for idx, kw in enumerate(kws): 204 | w = tf.get_variable( 205 | f"conv_w_{idx}", 206 | [kw, input_size, hidden_size], 207 | initializer=tf.random_uniform_initializer(-0.1, 0.1) 208 | ) 209 | b = tf.get_variable( 210 | f"conv_b_{idx}", 211 | [hidden_size], 212 | initializer=tf.zeros_initializer() 213 | ) 214 | self.conv_ws.append(w) 215 | self.conv_bs.append(b) 216 | 217 | def __call__(self, input, mask): 218 | outputs = [] 219 | for conv_w, conv_b in zip(self.conv_ws, self.conv_bs): 220 | conv = tf.nn.conv1d(input, conv_w, 1, 'SAME') 221 | conv = tf.nn.bias_add(conv, conv_b) 222 | if mask is not None: 223 | conv += tf.expand_dims(mask, -1) 224 | outputs.append(conv) 225 | output = tf.concat(outputs, -1) 226 | return output 227 | -------------------------------------------------------------------------------- /text_classification/models/bilstm_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-12-11 12:37 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : bilstm_model.py 6 | # @Software: PyCharm 7 | import tensorflow as tf 8 | 9 | from .base_model import LookupTable 10 | from .base_model import BiLstm 11 | from .base_model import Linear 12 | 13 | 14 | class BiLstmModel: 15 | """ 16 | BiLstm模型的实现: 17 | 主要包含:embedding层、rnn层、池化层、两层全连接层和一个Dropout层 18 | """ 19 | def __init__(self, vocab_size, emb_size, args): 20 | 21 | # embedding 22 | scope_name = 'look_up' 23 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size) 24 | 25 | # rnn 26 | scope_name = 'bi_lstm' 27 | # rnn的层数 这里设置为1 28 | num_layers = 1 29 | self.rnn = BiLstm(scope_name, args.hidden_size, num_layers) 30 | 31 | # linear1 32 | scope_name = 'linear1' 33 | self.linear1 = Linear(scope_name, args.hidden_size, args.fc_layer_size, 34 | activator=args.activator) 35 | 36 | # logits out 37 | scope_name = 'linear2' 38 | self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label) 39 | 40 | self.dropout = tf.layers.Dropout(args.drop_out) 41 | 42 | def max_pool(inputs): 43 | return tf.reduce_max(inputs, 1) 44 | 45 | def mean_pool(inputs): 46 | return tf.reduce_mean(inputs, 1) 47 | 48 | if args.pool == 'max': 49 | self.pool = max_pool 50 | else: 51 | self.pool = mean_pool 52 | 53 | def __call__(self, inputs, training): 54 | masks = tf.sign(inputs) 55 | sent_len = tf.reduce_sum(masks, axis=1) 56 | 57 | embedding = self.lookuptables(inputs) 58 | 59 | rnn_out = self.rnn(embedding, sent_len) 60 | pool_out = self.pool(rnn_out) 61 | linear_out = self.linear1(pool_out, training) 62 | # dropout 63 | linear_out = self.dropout(linear_out, training) 64 | # linear 65 | output = self.linear2(linear_out, training) 66 | return output -------------------------------------------------------------------------------- /text_classification/models/ffnn_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-12-11 19:05 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : ffnn_model.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | 10 | from .base_model import LookupTable 11 | from .base_model import Linear 12 | 13 | 14 | class FCModel: 15 | """ 16 | 前馈网络 17 | 主要包括:embedding层、两个全连接层和一个dropout层 18 | """ 19 | def __init__(self, vocab_size, emb_size, args): 20 | 21 | # embedding 22 | scope_name = 'look_up' 23 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size) 24 | 25 | fc_layer_size = args.fc_layer_size # 全链接层的size 26 | scope_name = 'linear1' 27 | self.linear1 = Linear(scope_name, emb_size, fc_layer_size, args.activator) 28 | 29 | scope_name = 'linear2' 30 | self.linear2 = Linear(scope_name, fc_layer_size, args.num_label) 31 | 32 | self.dropout = tf.layers.Dropout(args.drop_out) 33 | 34 | def __call__(self, inputs, training): 35 | embedding = self.lookuptables(inputs) 36 | pool_out = tf.reduce_mean(embedding, 1) 37 | pool_out = self.dropout(pool_out, training) 38 | pool_out = self.linear1(pool_out, training) 39 | pool_out = self.dropout(pool_out, training) 40 | output = self.linear2(pool_out, training) 41 | 42 | return output 43 | -------------------------------------------------------------------------------- /text_classification/models/model_utils.py: -------------------------------------------------------------------------------- 1 | """Transformer model helper methods.""" 2 | 3 | import math 4 | 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | _NEG_INF_FP32 = -1e9 9 | _NEG_INF_FP16 = np.finfo(np.float16).min 10 | 11 | 12 | def get_position_encoding(length, 13 | hidden_size, 14 | min_timescale=1.0, 15 | max_timescale=1.0e4): 16 | """Return positional encoding. 17 | Calculates the position encoding as a mix of sine and cosine functions with 18 | geometrically increasing wavelengths. 19 | Defined and formulized in Attention is All You Need, section 3.5. 20 | Args: 21 | length: Sequence length. 22 | hidden_size: Size of the 23 | min_timescale: Minimum scale that will be applied at each position 24 | max_timescale: Maximum scale that will be applied at each position 25 | Returns: 26 | Tensor with shape [length, hidden_size] 27 | """ 28 | # We compute the positional encoding in float32 even if the model uses 29 | # float16, as many of the ops used, like log and exp, are numerically unstable 30 | # in float16. 31 | position = tf.cast(tf.range(length), tf.float32) 32 | num_timescales = hidden_size // 2 33 | log_timescale_increment = ( 34 | math.log(float(max_timescale) / float(min_timescale)) / 35 | (tf.cast(num_timescales, tf.float32) - 1)) 36 | inv_timescales = min_timescale * tf.exp( 37 | tf.cast(tf.range(num_timescales), tf.float32) * -log_timescale_increment) 38 | scaled_time = tf.expand_dims(position, 1) * tf.expand_dims(inv_timescales, 0) 39 | signal = tf.concat([tf.sin(scaled_time), tf.cos(scaled_time)], axis=1) 40 | return signal 41 | 42 | 43 | def get_decoder_self_attention_bias(length, dtype=tf.float32): 44 | """Calculate bias for decoder that maintains model's autoregressive property. 45 | Creates a tensor that masks out locations that correspond to illegal 46 | connections, so prediction at position i cannot draw information from future 47 | positions. 48 | Args: 49 | length: int length of sequences in batch. 50 | dtype: The dtype of the return value. 51 | Returns: 52 | float tensor of shape [1, 1, length, length] 53 | """ 54 | neg_inf = _NEG_INF_FP16 if dtype == tf.float16 else _NEG_INF_FP32 55 | with tf.name_scope("decoder_self_attention_bias"): 56 | valid_locs = tf.linalg.band_part( 57 | tf.ones([length, length], dtype=dtype), -1, 0) 58 | valid_locs = tf.reshape(valid_locs, [1, 1, length, length]) 59 | decoder_bias = neg_inf * (1.0 - valid_locs) 60 | return decoder_bias 61 | 62 | 63 | def get_padding(x, padding_value=0, dtype=tf.float32): 64 | """Return float tensor representing the padding values in x. 65 | Args: 66 | x: int tensor with any shape 67 | padding_value: int which represents padded values in input 68 | dtype: The dtype of the return value. 69 | Returns: 70 | float tensor with same shape as x containing values 0 or 1. 71 | 0 -> non-padding, 1 -> padding 72 | """ 73 | with tf.name_scope("padding"): 74 | return tf.cast(tf.equal(x, padding_value), dtype) 75 | 76 | 77 | def get_padding_bias(x, padding_value=0, dtype=tf.float32): 78 | """Calculate bias tensor from padding values in tensor. 79 | Bias tensor that is added to the pre-softmax multi-headed attention logits, 80 | which has shape [batch_size, num_heads, length, length]. The tensor is zero at 81 | non-padding locations, and -1e9 (negative infinity) at padding locations. 82 | Args: 83 | x: int tensor with shape [batch_size, length] 84 | padding_value: int which represents padded values in input 85 | dtype: The dtype of the return value 86 | Returns: 87 | Attention bias tensor of shape [batch_size, 1, 1, length]. 88 | """ 89 | with tf.name_scope("attention_bias"): 90 | padding = get_padding(x, padding_value, dtype) 91 | attention_bias = padding * _NEG_INF_FP32 92 | attention_bias = tf.expand_dims( 93 | tf.expand_dims(attention_bias, axis=1), axis=1) 94 | return attention_bias 95 | -------------------------------------------------------------------------------- /text_classification/models/text_cnn.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-12-11 19:19 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : text_cnn.py 6 | # @Software: PyCharm 7 | 8 | import tensorflow as tf 9 | 10 | from .base_model import LookupTable 11 | from .base_model import BiLstm 12 | from .base_model import Linear 13 | 14 | 15 | class TextCnn: 16 | """ 17 | text cnn model 18 | 主要包括:embedding层、三个不同size卷积核层、两个全连接层和dropout层 19 | """ 20 | def __init__(self, vocab_size, emb_size, args): 21 | 22 | # embedding 23 | scope_name = 'look_up' 24 | self.lookuptables = LookupTable(scope_name, vocab_size, emb_size) 25 | 26 | # 三个卷积核 27 | kws = [2, 3, 5] 28 | self.conv_ws = [] 29 | self.conv_bs = [] 30 | 31 | # the num of filter 卷积核的数量 32 | filter_num = args.filter_num 33 | for idx, kw in enumerate(kws): 34 | w = tf.get_variable( 35 | f"conv_w_{idx}", 36 | [kw, emb_size, filter_num], 37 | initializer=tf.random_uniform_initializer(-0.25, 0.25) 38 | ) 39 | b = tf.get_variable( 40 | f"conv_b_{idx}", 41 | [filter_num], 42 | initializer=tf.random_uniform_initializer(-0.25, 0.25) 43 | ) 44 | self.conv_ws.append(w) 45 | self.conv_bs.append(b) 46 | 47 | scope_name = 'linear1' 48 | self.linear1 = Linear(scope_name, len(kws) * filter_num, 49 | args.fc_layer_size, activator=args.activator) 50 | 51 | scope_name = 'linear2' 52 | self.linear2 = Linear(scope_name, args.fc_layer_size, args.num_label) 53 | 54 | self.dropout = tf.layers.Dropout(args.drop_out) 55 | 56 | def __call__(self, inputs, training): 57 | embedding = self.lookuptables(inputs) 58 | 59 | outputs = [] 60 | for conv_w, conv_b in zip(self.conv_ws, self.conv_bs): 61 | conv = tf.nn.conv1d(embedding, conv_w, 1, 'SAME') 62 | conv = tf.nn.bias_add(conv, conv_b) 63 | pool = tf.reduce_max(conv, axis=1) 64 | outputs.append(pool) 65 | output = tf.concat(outputs, -1) 66 | output = self.linear1(output, training) 67 | output = self.dropout(output, training) 68 | output = self.linear2(output, training) 69 | return output -------------------------------------------------------------------------------- /text_classification/preprocess.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-10-11 18:52 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : preprocess.py 6 | # @Software: PyCharm 7 | 8 | import os 9 | import _pickle as pickle 10 | import pandas as pd 11 | import random 12 | 13 | from sklearn.model_selection import train_test_split 14 | 15 | """ 16 | 数据预处理 17 | 将数据处理成id,并封装成pkl形式 18 | """ 19 | 20 | # 可以人为自定义label dict 21 | label_dict_default = {109: 0, 104: 1, 102: 2, 113: 3, 22 | 107: 4, 101: 5, 103: 6, 110: 7, 23 | 108: 8, 116: 9, 112: 10, 115: 11, 24 | 106: 12, 100: 13, 114: 14} 25 | 26 | 27 | def make_vocab(file_path): 28 | """ 29 | 构建词典和label映射词典 30 | :param file_path: 31 | :return: 32 | """ 33 | data = pd.read_csv(file_path, sep='\t') 34 | vocab = {'PAD': 0, 'UNK': 1} 35 | words_list = [] 36 | for index, row in data.iterrows(): 37 | label = row['label'] 38 | words = row['words'].split(' ') 39 | for word in words: 40 | words_list.append(word) 41 | random.shuffle(words_list) 42 | for word in words_list: 43 | if word not in vocab: 44 | vocab[word] = len(vocab) 45 | # save to file and print the label dict 46 | save_path = './data_path/vocab.txt' 47 | save_vocab(vocab, save_path) 48 | print(f'the vocab size is {len(vocab)}') 49 | return vocab 50 | 51 | 52 | def make_data(file_path, vocab, type): 53 | """ 54 | 构建数据 55 | :param file_path: 56 | :param vocab 57 | :return: 58 | """ 59 | data = pd.read_csv(file_path, sep='\t') 60 | word_ids = [] 61 | label_ids = [] 62 | for index, row in data.iterrows(): 63 | label = row['label'] 64 | words = row['words'].split(' ') 65 | word_id_temp = [vocab.get(word) if word in vocab else 1 for word in words] 66 | word_ids.append(word_id_temp) 67 | label_ids.append(label_dict_default.get(label)) 68 | 69 | print(f'the {type} data size is {len(word_ids)}') 70 | print(word_ids[0]) 71 | print(label_ids[0]) 72 | 73 | return {'words': word_ids, 'labels': label_ids} 74 | 75 | 76 | def save_vocab(vocab, output): 77 | """ 78 | 保存vocab到本地文件 79 | :param vocab: 80 | :param output: 81 | :return: 82 | """ 83 | with open(output, 'w', encoding='utf-8') as fr: 84 | for word in vocab: 85 | fr.write(word + '\t' + str(vocab.get(word)) + '\n') 86 | print('save vocab is ok.') 87 | 88 | 89 | def main(output_path): 90 | """ 91 | main method 92 | :param output_path: 93 | :return: 94 | """ 95 | data = {} 96 | train_path = './data_path/train_data.csv' 97 | test_path = './data_path/dev_data.csv' 98 | vocab = make_vocab(train_path) 99 | train_data = make_data(train_path, vocab, 'train') 100 | test_data = make_data(test_path, vocab, 'test') 101 | 102 | data['train'] = train_data 103 | data['test'] = test_data 104 | 105 | data_path = os.path.join(output_path, 'tnews_data.pkl') 106 | pickle.dump(data, open(data_path, 'wb'), protocol=2) 107 | print('save data to pkl over.') 108 | 109 | 110 | def split_data(file_path, output): 111 | """ 112 | 划分数据集 113 | :param file_path: 114 | :param output: 115 | :return: 116 | """ 117 | all_data = pd.read_csv(file_path, sep='\t', header=None) 118 | all_data.columns = ['id', 'texta', 'textb', 'label'] 119 | train_data, test_data = train_test_split(all_data, stratify=all_data['label'], 120 | test_size=0.2, shuffle=True, 121 | random_state=42) 122 | print(train_data) 123 | print(test_data) 124 | train_path = os.path.join(output, 'train_nli.csv') 125 | test_path = os.path.join(output, 'dev_nli.csv') 126 | train_data.to_csv(train_path, sep='\t', header=False, index=False) 127 | test_data.to_csv(test_path, sep='\t', header=False, index=False) 128 | print(f'split data train size={len(train_data)} test size={len(test_data)}') 129 | 130 | 131 | if __name__ == '__main__': 132 | output_path = './data_path' 133 | main(output_path) 134 | -------------------------------------------------------------------------------- /text_classification/tf_metrics.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | """Multiclass""" 4 | 5 | __author__ = "Guillaume Genthial" 6 | 7 | import numpy as np 8 | import tensorflow as tf 9 | from tensorflow.python.ops.metrics_impl import _streaming_confusion_matrix 10 | 11 | 12 | def precision(labels, predictions, num_classes, pos_indices=None, 13 | weights=None, average='micro'): 14 | """Multi-class precision metric for Tensorflow 15 | Parameters 16 | ---------- 17 | labels : Tensor of tf.int32 or tf.int64 18 | The true labels 19 | predictions : Tensor of tf.int32 or tf.int64 20 | The predictions, same shape as labels 21 | num_classes : int 22 | The number of classes 23 | pos_indices : list of int, optional 24 | The indices of the positive classes, default is all 25 | weights : Tensor of tf.int32, optional 26 | Mask, must be of compatible shape with labels 27 | average : str, optional 28 | 'micro': counts the total number of true positives, false 29 | positives, and false negatives for the classes in 30 | `pos_indices` and infer the metric from it. 31 | 'macro': will compute the metric separately for each class in 32 | `pos_indices` and average. Will not account for class 33 | imbalance. 34 | 'weighted': will compute the metric separately for each class in 35 | `pos_indices` and perform a weighted average by the total 36 | number of true labels for each class. 37 | Returns 38 | ------- 39 | tuple of (scalar float Tensor, update_op) 40 | """ 41 | cm, op = _streaming_confusion_matrix( 42 | labels, predictions, num_classes, weights) 43 | pr, _, _ = metrics_from_confusion_matrix( 44 | cm, pos_indices, average=average) 45 | op, _, _ = metrics_from_confusion_matrix( 46 | op, pos_indices, average=average) 47 | return (pr, op) 48 | 49 | 50 | def recall(labels, predictions, num_classes, pos_indices=None, weights=None, 51 | average='micro'): 52 | """Multi-class recall metric for Tensorflow 53 | Parameters 54 | ---------- 55 | labels : Tensor of tf.int32 or tf.int64 56 | The true labels 57 | predictions : Tensor of tf.int32 or tf.int64 58 | The predictions, same shape as labels 59 | num_classes : int 60 | The number of classes 61 | pos_indices : list of int, optional 62 | The indices of the positive classes, default is all 63 | weights : Tensor of tf.int32, optional 64 | Mask, must be of compatible shape with labels 65 | average : str, optional 66 | 'micro': counts the total number of true positives, false 67 | positives, and false negatives for the classes in 68 | `pos_indices` and infer the metric from it. 69 | 'macro': will compute the metric separately for each class in 70 | `pos_indices` and average. Will not account for class 71 | imbalance. 72 | 'weighted': will compute the metric separately for each class in 73 | `pos_indices` and perform a weighted average by the total 74 | number of true labels for each class. 75 | Returns 76 | ------- 77 | tuple of (scalar float Tensor, update_op) 78 | """ 79 | cm, op = _streaming_confusion_matrix( 80 | labels, predictions, num_classes, weights) 81 | _, re, _ = metrics_from_confusion_matrix( 82 | cm, pos_indices, average=average) 83 | _, op, _ = metrics_from_confusion_matrix( 84 | op, pos_indices, average=average) 85 | return (re, op) 86 | 87 | 88 | def f1(labels, predictions, num_classes, pos_indices=None, weights=None, 89 | average='micro'): 90 | return fbeta(labels, predictions, num_classes, pos_indices, weights, 91 | average) 92 | 93 | 94 | def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None, 95 | average='micro', beta=1): 96 | """Multi-class fbeta metric for Tensorflow 97 | Parameters 98 | ---------- 99 | labels : Tensor of tf.int32 or tf.int64 100 | The true labels 101 | predictions : Tensor of tf.int32 or tf.int64 102 | The predictions, same shape as labels 103 | num_classes : int 104 | The number of classes 105 | pos_indices : list of int, optional 106 | The indices of the positive classes, default is all 107 | weights : Tensor of tf.int32, optional 108 | Mask, must be of compatible shape with labels 109 | average : str, optional 110 | 'micro': counts the total number of true positives, false 111 | positives, and false negatives for the classes in 112 | `pos_indices` and infer the metric from it. 113 | 'macro': will compute the metric separately for each class in 114 | `pos_indices` and average. Will not account for class 115 | imbalance. 116 | 'weighted': will compute the metric separately for each class in 117 | `pos_indices` and perform a weighted average by the total 118 | number of true labels for each class. 119 | beta : int, optional 120 | Weight of precision in harmonic mean 121 | Returns 122 | ------- 123 | tuple of (scalar float Tensor, update_op) 124 | """ 125 | cm, op = _streaming_confusion_matrix( 126 | labels, predictions, num_classes, weights) 127 | _, _, fbeta = metrics_from_confusion_matrix( 128 | cm, pos_indices, average=average, beta=beta) 129 | _, _, op = metrics_from_confusion_matrix( 130 | op, pos_indices, average=average, beta=beta) 131 | return (fbeta, op) 132 | 133 | 134 | def safe_div(numerator, denominator): 135 | """Safe division, return 0 if denominator is 0""" 136 | numerator, denominator = tf.to_float(numerator), tf.to_float(denominator) 137 | zeros = tf.zeros_like(numerator, dtype=numerator.dtype) 138 | denominator_is_zero = tf.equal(denominator, zeros) 139 | return tf.where(denominator_is_zero, zeros, numerator / denominator) 140 | 141 | 142 | def pr_re_fbeta(cm, pos_indices, beta=1): 143 | """Uses a confusion matrix to compute precision, recall and fbeta""" 144 | num_classes = cm.shape[0] 145 | neg_indices = [i for i in range(num_classes) if i not in pos_indices] 146 | cm_mask = np.ones([num_classes, num_classes]) 147 | cm_mask[neg_indices, neg_indices] = 0 148 | diag_sum = tf.reduce_sum(tf.diag_part(cm * cm_mask)) 149 | 150 | cm_mask = np.ones([num_classes, num_classes]) 151 | cm_mask[:, neg_indices] = 0 152 | tot_pred = tf.reduce_sum(cm * cm_mask) 153 | 154 | cm_mask = np.ones([num_classes, num_classes]) 155 | cm_mask[neg_indices, :] = 0 156 | tot_gold = tf.reduce_sum(cm * cm_mask) 157 | 158 | pr = safe_div(diag_sum, tot_pred) 159 | re = safe_div(diag_sum, tot_gold) 160 | fbeta = safe_div((1. + beta**2) * pr * re, beta**2 * pr + re) 161 | 162 | return pr, re, fbeta 163 | 164 | 165 | def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro', 166 | beta=1): 167 | """Precision, Recall and F1 from the confusion matrix 168 | Parameters 169 | ---------- 170 | cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes) 171 | The streaming confusion matrix. 172 | pos_indices : list of int, optional 173 | The indices of the positive classes 174 | beta : int, optional 175 | Weight of precision in harmonic mean 176 | average : str, optional 177 | 'micro', 'macro' or 'weighted' 178 | """ 179 | num_classes = cm.shape[0] 180 | if pos_indices is None: 181 | pos_indices = [i for i in range(num_classes)] 182 | 183 | if average == 'micro': 184 | return pr_re_fbeta(cm, pos_indices, beta) 185 | elif average in {'macro', 'weighted'}: 186 | precisions, recalls, fbetas, n_golds = [], [], [], [] 187 | for idx in pos_indices: 188 | pr, re, fbeta = pr_re_fbeta(cm, [idx], beta) 189 | precisions.append(pr) 190 | recalls.append(re) 191 | fbetas.append(fbeta) 192 | cm_mask = np.zeros([num_classes, num_classes]) 193 | cm_mask[idx, :] = 1 194 | n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask))) 195 | 196 | if average == 'macro': 197 | pr = tf.reduce_mean(precisions) 198 | re = tf.reduce_mean(recalls) 199 | fbeta = tf.reduce_mean(fbetas) 200 | return pr, re, fbeta 201 | if average == 'weighted': 202 | n_gold = tf.reduce_sum(n_golds) 203 | pr_sum = sum(p * n for p, n in zip(precisions, n_golds)) 204 | pr = safe_div(pr_sum, n_gold) 205 | re_sum = sum(r * n for r, n in zip(recalls, n_golds)) 206 | re = safe_div(re_sum, n_gold) 207 | fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds)) 208 | fbeta = safe_div(fbeta_sum, n_gold) 209 | return pr, re, fbeta 210 | 211 | else: 212 | raise NotImplementedError() 213 | -------------------------------------------------------------------------------- /text_classification/tnews_data_eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "raw", 5 | "metadata": {}, 6 | "source": [ 7 | "CLUEBenchmark的头条中文新闻分类 数据EDA过程\n", 8 | "任务介绍:https://www.cluebenchmarks.com/introduce.html" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "import pandas as pd\n", 18 | "import numpy as np\n", 19 | "import json" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 2, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "def convert_df(file_path, task):\n", 29 | " with open(file_path, encoding='utf-8') as fr:\n", 30 | " lines = fr.readlines()\n", 31 | " label_list = []\n", 32 | " sentence_list = []\n", 33 | " ids = []\n", 34 | " for line in lines:\n", 35 | " json_str = json.loads(line)\n", 36 | " if task == 'test':\n", 37 | " ids.append(json_str['id'])\n", 38 | " sentence_list.append(json_str['sentence'])\n", 39 | " else:\n", 40 | " label_list.append(json_str['label'])\n", 41 | " sentence_list.append(json_str['sentence'])\n", 42 | " if task == 'test':\n", 43 | " data_dict = {'id': ids, 'text': sentence_list}\n", 44 | " else:\n", 45 | " data_dict = {'label': label_list, 'text': sentence_list}\n", 46 | " data = pd.DataFrame(data_dict)\n", 47 | " \n", 48 | " return data" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 3, 54 | "metadata": {}, 55 | "outputs": [ 56 | { 57 | "name": "stdout", 58 | "output_type": "stream", 59 | "text": [ 60 | " label text\n", 61 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事?\n", 62 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告\n", 63 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理?\n", 64 | "3 112 2018年去俄罗斯看世界杯得花多少钱?\n", 65 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发\n", 66 | " label text\n", 67 | "0 102 江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物\n", 68 | "1 110 以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击\n", 69 | "2 104 出栏一头猪亏损300元,究竟谁能笑到最后!\n", 70 | "3 109 以前很火的巴铁为何现在只字不提?\n", 71 | "4 112 作为一名酒店从业人员,你经历过房客哪些特别没有素质的行为?\n" 72 | ] 73 | } 74 | ], 75 | "source": [ 76 | "train_path = './data/train.json'\n", 77 | "dev_path = './data/dev.json'\n", 78 | "test_path = './data/test.json'\n", 79 | "\n", 80 | "train_data = convert_df(train_path, 'train')\n", 81 | "print(train_data.head(5))\n", 82 | "\n", 83 | "dev_data = convert_df(dev_path, 'dev')\n", 84 | "print(dev_data.head(5))" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "name": "stdout", 94 | "output_type": "stream", 95 | "text": [ 96 | " label text word_cnt\n", 97 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事? 44\n", 98 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告 46\n", 99 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理? 32\n", 100 | "3 112 2018年去俄罗斯看世界杯得花多少钱? 19\n", 101 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发 21\n" 102 | ] 103 | } 104 | ], 105 | "source": [ 106 | "train_data['word_cnt'] = train_data['text'].apply(lambda x: len(x))\n", 107 | "print(train_data.head(5))" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "metadata": {}, 114 | "outputs": [ 115 | { 116 | "data": { 117 | "text/plain": [ 118 | "count 53360.000000\n", 119 | "mean 22.131241\n", 120 | "std 7.309860\n", 121 | "min 2.000000\n", 122 | "25% 17.000000\n", 123 | "50% 22.000000\n", 124 | "75% 28.000000\n", 125 | "max 145.000000\n", 126 | "Name: word_cnt, dtype: float64" 127 | ] 128 | }, 129 | "execution_count": 5, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "train_data['word_cnt'].describe()" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 6, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/plain": [ 146 | "109 5955\n", 147 | "104 5200\n", 148 | "102 4976\n", 149 | "113 4851\n", 150 | "107 4118\n", 151 | "101 4081\n", 152 | "103 3991\n", 153 | "110 3632\n", 154 | "108 3437\n", 155 | "116 3390\n", 156 | "112 3368\n", 157 | "115 2886\n", 158 | "106 2107\n", 159 | "100 1111\n", 160 | "114 257\n", 161 | "Name: label, dtype: int64" 162 | ] 163 | }, 164 | "execution_count": 6, 165 | "metadata": {}, 166 | "output_type": "execute_result" 167 | } 168 | ], 169 | "source": [ 170 | "train_data['label'].value_counts()" 171 | ] 172 | }, 173 | { 174 | "cell_type": "raw", 175 | "metadata": {}, 176 | "source": [ 177 | "从上面的数据看出tricks:\n", 178 | "1.文本最长是145,大部分都是28左右\n", 179 | "2.label的数量是不均衡的,可以在loss计算的时候加上label的权重" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": 16, 185 | "metadata": {}, 186 | "outputs": [ 187 | { 188 | "name": "stdout", 189 | "output_type": "stream", 190 | "text": [ 191 | " label text word_cnt \\\n", 192 | "0 108 上课时学生手机响个不停,老师一怒之下把手机摔了,家长拿发票让老师赔,大家怎么看待这种事? 44 \n", 193 | "1 104 商赢环球股份有限公司关于延期回复上海证券交易所对公司2017年年度报告的事后审核问询函的公告 46 \n", 194 | "2 106 通过中介公司买了二手房,首付都付了,现在卖家不想卖了。怎么处理? 32 \n", 195 | "3 112 2018年去俄罗斯看世界杯得花多少钱? 19 \n", 196 | "4 109 剃须刀的个性革新,雷明登天猫定制版新品首发 21 \n", 197 | "\n", 198 | " words \n", 199 | "0 上课时 学生 手机 响个 不停 老师 一怒之下 把 手机 摔 了 家长 拿 发票 让 老师 ... \n", 200 | "1 商赢 环球 股份 有限公司 关于 延期 回复 上海证券交易所 对 公司 Number 年 年... \n", 201 | "2 通过 中介 公司 买 了 二手房 首付 都 付 了 现在 卖家 不想 卖 了 怎么 处理 \n", 202 | "3 Number 年 去 俄罗斯 看 世界杯 得花 多少 钱 \n", 203 | "4 剃须刀 的 个性 革新 雷明登 天猫 定制 版 新品 首发 \n" 204 | ] 205 | } 206 | ], 207 | "source": [ 208 | "# 分词去掉一些无用词\n", 209 | "import jieba\n", 210 | "def cut_with_jieba(text, filter=None):\n", 211 | " if filter:\n", 212 | " for c in filter:\n", 213 | " text = text.replace(c, '')\n", 214 | " words = ['Number' if word.isdigit() else word for word in jieba.cut(text)]\n", 215 | " # todo 停用词表还可以加进来\n", 216 | " return ' '.join(words)\n", 217 | "\n", 218 | "filter = './??;。(())【】{}[]!!,,<>《》+'\n", 219 | "\n", 220 | "train_data['words'] = train_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n", 221 | "\n", 222 | "print(train_data.head(5))" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 17, 228 | "metadata": {}, 229 | "outputs": [ 230 | { 231 | "name": "stdout", 232 | "output_type": "stream", 233 | "text": [ 234 | " label text \\\n", 235 | "0 102 江疏影甜甜圈自拍,迷之角度竟这么好看,美吸引一切事物 \n", 236 | "1 110 以色列大规模空袭开始!伊朗多个军事目标遭遇打击,誓言对等反击 \n", 237 | "2 104 出栏一头猪亏损300元,究竟谁能笑到最后! \n", 238 | "3 109 以前很火的巴铁为何现在只字不提? \n", 239 | "4 112 作为一名酒店从业人员,你经历过房客哪些特别没有素质的行为? \n", 240 | "\n", 241 | " words \n", 242 | "0 江 疏影 甜甜 圈自 拍迷 之 角度 竟 这么 好看 美 吸引 一切 事物 \n", 243 | "1 以色列 大规模 空袭 开始 伊朗 多个 军事 目标 遭遇 打击 誓言 对 等 反击 \n", 244 | "2 出栏 一头 猪 亏损 Number 元 究竟 谁 能 笑 到 最后 \n", 245 | "3 以前 很火 的 巴铁 为何 现在 只字不提 \n", 246 | "4 作为 一名 酒店 从业人员 你 经历 过 房客 哪些 特别 没有 素质 的 行为 \n" 247 | ] 248 | } 249 | ], 250 | "source": [ 251 | "dev_data['words'] = dev_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n", 252 | "print(dev_data.head(5))" 253 | ] 254 | }, 255 | { 256 | "cell_type": "raw", 257 | "metadata": {}, 258 | "source": [ 259 | "可以看出分词其实不太准确,这个地方还可以加入原始数据集中的key word" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 18, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [ 268 | "# 写入到文件\n", 269 | "train_data[['words', 'label']].to_csv('./data/train_data.csv', sep='\\t', encoding='utf-8', index=None)\n", 270 | "dev_data[['words', 'label']].to_csv('./data/dev_data.csv', sep='\\t', encoding='utf-8', index=None)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 19, 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "name": "stdout", 280 | "output_type": "stream", 281 | "text": [ 282 | " id text \\\n", 283 | "0 0 在设计史上,每当相对稳定的发展时期,这种设计思想就会成为主导 \n", 284 | "1 1 利希施泰纳宣布赛季结束后离队:我需要新的挑战 \n", 285 | "2 2 庄家一般都是什么操盘思路? \n", 286 | "3 3 王者荣耀里搅屎棍英雄都有谁? \n", 287 | "4 4 照片不小心被删,看看下面的教程,完美找回来! \n", 288 | "\n", 289 | " words \n", 290 | "0 在 设计 史上 每当 相对 稳定 的 发展 时期 这种 设计 思想 就 会 成为 主导 \n", 291 | "1 利希 施泰纳 宣布 赛季 结束 后 离队 : 我 需要 新 的 挑战 \n", 292 | "2 庄家 一般 都 是 什么 操盘 思路 \n", 293 | "3 王者 荣耀 里 搅 屎 棍 英雄 都 有 谁 \n", 294 | "4 照片 不 小心 被删 看看 下面 的 教程 完美 找 回来 \n" 295 | ] 296 | } 297 | ], 298 | "source": [ 299 | "\n", 300 | "# 准备测试集\n", 301 | "test_data = convert_df(test_path, 'test')\n", 302 | "\n", 303 | "test_data['words'] = test_data['text'].apply(lambda x: cut_with_jieba(x, filter))\n", 304 | "\n", 305 | "print(test_data.head(5))\n", 306 | "\n", 307 | "test_data[['id', 'words']].to_csv('./data/test_data.csv', sep='\\t', encoding='utf-8', index=None)\n" 308 | ] 309 | }, 310 | { 311 | "cell_type": "code", 312 | "execution_count": null, 313 | "metadata": {}, 314 | "outputs": [], 315 | "source": [] 316 | } 317 | ], 318 | "metadata": { 319 | "kernelspec": { 320 | "display_name": "Python [conda env:tf_envs]", 321 | "language": "python", 322 | "name": "conda-env-tf_envs-py" 323 | }, 324 | "language_info": { 325 | "codemirror_mode": { 326 | "name": "ipython", 327 | "version": 3 328 | }, 329 | "file_extension": ".py", 330 | "mimetype": "text/x-python", 331 | "name": "python", 332 | "nbconvert_exporter": "python", 333 | "pygments_lexer": "ipython3", 334 | "version": "3.7.7" 335 | } 336 | }, 337 | "nbformat": 4, 338 | "nbformat_minor": 4 339 | } 340 | -------------------------------------------------------------------------------- /text_classification/train_main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | # @Time : 2020-12-11 19:47 3 | # @Author : xudong 4 | # @email : dongxu222mk@163.com 5 | # @File : train_main.py 6 | # @Software: PyCharm 7 | 8 | import sys 9 | import time 10 | import tensorflow as tf 11 | import tf_metrics 12 | import _pickle as cPickle 13 | 14 | from data_utils import datasets 15 | from argparse import ArgumentParser 16 | from models.bilstm_model import BiLstmModel 17 | from models.text_cnn import TextCnn 18 | from models.ffnn_model import FCModel 19 | 20 | 21 | # 设置参数 22 | parser = ArgumentParser() 23 | 24 | parser.add_argument("--train_path", type=str, default='./data_path/tnews_data.pkl', 25 | help='the file path of train data, needs pkl type') 26 | parser.add_argument("--eval_path", type=str, default='./data_path/tnews_data.pkl', 27 | help='the file path of test data, needs pkl type') 28 | parser.add_argument("--model_ckpt_dir", type=str, default='./model_ckpt/', 29 | help='the dir of the checkpoint type model') 30 | parser.add_argument("--model_pb_dir", type=str, default='./model_pb', 31 | help='the dir of the pb type model') 32 | 33 | parser.add_argument("--vocab_size", type=int, default=68000, help='the vocab size') 34 | parser.add_argument("--emb_size", type=int, default=300, help='the embedding size') 35 | parser.add_argument("--hidden_size", type=int, default=300, 36 | help='the hidden size of rnn layer, will split it half in rnn') 37 | parser.add_argument("--fc_layer_size", type=int, default=300, 38 | help='the hidden size of fully connect layer') 39 | parser.add_argument("--num_label", type=int, default=15, help='the number of task label') 40 | parser.add_argument("--drop_out", type=float, default=0.2, 41 | help='the dropout rate in layers') 42 | parser.add_argument("--batch_size", type=int, default=16, 43 | help='the batch size of dataset in one step training') 44 | parser.add_argument("--epoch", type=int, default=5, 45 | help='the epoch count we want to train') 46 | parser.add_argument("--model_name", type=str, default='lstm', 47 | help='which model we want use in our task, [lstm, cnn, fc, ...]') 48 | parser.add_argument("--pool", type=str, default='max', 49 | help='the pool function, [max, mean, ...]') 50 | parser.add_argument("--activator", type=str, default='relu', 51 | help='the activate function, [relu, relu6, tanh, ...]') 52 | parser.add_argument("--filter_num", type=int, default=128, 53 | help='the number of the cnn filters') 54 | parser.add_argument("--use_pos", type=int, default=0, 55 | help='whether to use position encoding in embedding layer') 56 | parser.add_argument("--lr", type=float, default=1e-3, 57 | help='the learning rate for optimizer') 58 | 59 | 60 | # todo 还可以加入位置信息在embedding层 61 | # todo pool层还可以加入attention pool 62 | 63 | 64 | tf.logging.set_verbosity(tf.logging.INFO) 65 | ARGS, unparsed = parser.parse_known_args() 66 | print(ARGS) 67 | sys.stdout.flush() 68 | 69 | 70 | def init_data(file_name, type=None): 71 | """ 72 | 初始化数据集并构建input function 73 | :param file_name: 74 | :param type: 75 | :return: 76 | """ 77 | data = cPickle.load(open(file_name, 'rb'))[type] 78 | 79 | data_builder = datasets.DataBuilder(data) 80 | dataset = data_builder.build_dataset() 81 | 82 | def train_input(): 83 | return data_builder.get_train_batch(dataset, ARGS.batch_size, ARGS.epoch) 84 | 85 | def test_input(): 86 | return data_builder.get_test_batch(dataset, ARGS.batch_size) 87 | 88 | return train_input if type == 'train' else test_input 89 | 90 | 91 | def make_model(): 92 | """ 93 | 构建模型 94 | :return: 95 | """ 96 | vocab_size = ARGS.vocab_size 97 | emb_size = ARGS.emb_size 98 | print(f'the model name is {ARGS.model_name}') 99 | if ARGS.model_name == 'lstm': 100 | model = BiLstmModel(vocab_size, emb_size, ARGS) 101 | elif ARGS.model_name == 'cnn': 102 | model = TextCnn(vocab_size, emb_size, ARGS) 103 | elif ARGS.model_name == 'fc': 104 | model = FCModel(vocab_size, emb_size, ARGS) 105 | else: 106 | raise KeyError('the model type is not implemented!') 107 | return model 108 | 109 | 110 | def model_fn(features, labels, mode, params): 111 | """ 112 | the model fn 113 | :return: 114 | """ 115 | model = make_model() 116 | 117 | if isinstance(features, dict): 118 | features = features['words'] 119 | 120 | words = features 121 | 122 | if mode == tf.estimator.ModeKeys.PREDICT: 123 | logits = model(words, training=False) 124 | 125 | prediction = {'class_id': tf.argmax(logits, axis=1, name='class_out'), 126 | 'prob': tf.nn.softmax(logits, name='prob_out')} 127 | 128 | return tf.estimator.EstimatorSpec( 129 | mode=mode, 130 | predictions=prediction, 131 | export_outputs={'classify': tf.estimator.export.PredictOutput(prediction)} 132 | ) 133 | else: 134 | if mode == tf.estimator.ModeKeys.TRAIN: 135 | logits = model(words, training=True) 136 | weights = tf.constant( 137 | [0.9, 0.9, 0.9, 0.9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1.2, 1.5]) 138 | weights = tf.gather(weights, labels) 139 | loss = tf.losses.sparse_softmax_cross_entropy( 140 | labels, logits, 141 | weights=weights, 142 | reduction=tf.losses.Reduction.MEAN) 143 | prediction = tf.argmax(logits, axis=1) 144 | accuracy = tf.metrics.accuracy(labels=labels, 145 | predictions=prediction) 146 | tf.identity(accuracy[1], name='train_accuracy') 147 | tf.summary.scalar('train_accuracy', accuracy[1]) 148 | optimizer = tf.train.AdamOptimizer(learning_rate=ARGS.lr) 149 | return tf.estimator.EstimatorSpec( 150 | mode=mode, 151 | loss=loss, 152 | train_op=optimizer.minimize(loss, tf.train.get_or_create_global_step()) 153 | ) 154 | else: 155 | logits = model(words, training=False) 156 | prediction = tf.argmax(logits, axis=1) 157 | # tf原始的metrics不支持多类别计算 158 | precision = tf_metrics.precision(labels, prediction, ARGS.num_label) 159 | recall = tf_metrics.recall(labels, prediction, ARGS.num_label) 160 | accuracy = tf.metrics.accuracy(labels, predictions=prediction) 161 | metrics = { 162 | 'accuracy': accuracy, 163 | 'recall': recall, 164 | 'precision': precision 165 | } 166 | return tf.estimator.EstimatorSpec( 167 | mode=mode, 168 | loss=tf.constant(0), 169 | eval_metric_ops=metrics 170 | ) 171 | 172 | 173 | def main_es(unparsed): 174 | """ 175 | main method 176 | :param unparsed: 177 | :return: 178 | """ 179 | cur_time = time.time() 180 | model_dir = ARGS.model_ckpt_dir + str(int(cur_time)) 181 | 182 | classifer = tf.estimator.Estimator( 183 | model_fn=model_fn, 184 | model_dir=model_dir, 185 | params={} 186 | ) 187 | 188 | # train model 189 | train_input = init_data(ARGS.train_path, 'train') 190 | tensors_to_log = {'train_accuracy': 'train_accuracy'} 191 | logging_hook = tf.train.LoggingTensorHook(tensors=tensors_to_log, every_n_iter=100) 192 | classifer.train(input_fn=train_input, hooks=[logging_hook]) 193 | 194 | # eval model 195 | eval_input = init_data(ARGS.eval_path, 'test') 196 | eval_res = classifer.evaluate(input_fn=eval_input) 197 | print(f'Evaluation res is : \n\t{eval_res}') 198 | 199 | if ARGS.model_pb_dir: 200 | words = tf.placeholder(tf.int64, [None, None], name='input_words') 201 | input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({ 202 | 'words': words 203 | }) 204 | classifer.export_savedmodel(ARGS.model_pb_dir, input_fn) 205 | 206 | 207 | if __name__ == '__main__': 208 | tf.app.run(main=main_es, argv=[sys.argv[0]] + unparsed) 209 | --------------------------------------------------------------------------------