├── README.md ├── bert └── README.md ├── data ├── README.md ├── ccks2019_event_entity_extract │ ├── event_type_entity_extract_eval.csv │ └── event_type_entity_extract_train.csv └── event_type_entity_extract_test.csv ├── requirements.txt └── src └── SEBERT_model.py /README.md: -------------------------------------------------------------------------------- 1 | # SEBERTNets:一种面向金融领域的事件主体抽取方法 2 | 3 | 4 | # 简介 5 | 6 | “事件识别”是舆情监控领域和金融领域的重要任务之一,“事件”在金融领域是投资分析,资产管理的重要决策参考。“事件识别”的复杂性在于事件类型和事件主体的判断,比如“公司A产品出现添加剂,其下属子公司B和公司C遭到了调查”,对于“产品出现问题”事件类型,该句中事件主体是“公司A”,而不是“公司B”或“公司C”。我们称发生特定事件类型的主体成为事件主体,本任务中事件主体范围限定为:公司和机构。事件类型范围确定为:产品出现问题、高管减持、违法违规… 7 | 8 | 9 | 本次评测任务的主要目标是从真实的新闻语料中,抽取特定事件类型的主体。即给定一段文本T,和文本所属的事件类型S,从文本T中抽取指定事件类型S的事件主体。 10 | 11 | 12 | 输入:一段文本,事件类型S 13 | 14 | 输出:事件主体 15 | 16 | 示例: 17 | 18 | 样例1 19 | 20 | 输入:”公司A产品出现添加剂,其下属子公司B和公司C遭到了调查”, “产品出现问题” 21 | 22 | 输出: “公司A” 23 | 24 | 25 | 26 | 样例2 27 | 28 | 输入:“公司A高管涉嫌违规减持”,“交易违规” 29 | 30 | 输出: “公司A” 31 | 32 | 33 | # 下载数据 34 | 35 | download dataset from 36 | 百度网盘 37 | - ccks2019_event_entity_extract.zip 38 | 39 | https://pan.baidu.com/s/1HNTcqWf0594rtmwBd1p9HQ 40 | 提取码:jh4u 41 | 42 | 43 | 44 | - event_type_entity_extract_test.csv 45 | 46 | https://pan.baidu.com/s/1cWRq-9IKx8lOWakZFLhS-A 47 | 提取码:qdr9 48 | 49 | or 50 | 51 | download dataset from 52 | Dropbox 53 | - ccks2019_event_entity_extract.zip 54 | 55 | https://www.dropbox.com/s/lli5mgip2clguya/ccks2019_event_entity_extract.zip?dl=0 56 | 57 | - event_type_entity_extract_test.csv 58 | 59 | https://www.dropbox.com/s/e0ajdb93s2lfdw0/event_type_entity_extract_test.csv?dl=0 60 | 61 | 62 | # 方法 63 | 64 | “事件识别”是舆情监控领域和金融领域的重要任务之一,在金融领域是投资分析、资产管理的重要决策参考。然而,“事件识别”的复杂性复杂性在于事件类型和事件主体的判断。本文提出了一种新的模型,SequenceEnhancedBERTNetworks(简称:SEBERTNets),该模型既可以继承BERT模型的优点,即在少量的标签样本中可以取得很好的很好的效果,同时利用序列模型(如:GRU、LSTM)可以捕捉文本的序列语义信息。 65 | 66 | 67 | # 运行环境 68 | 69 | - tensorflow==1.14.0 70 | 71 | - keras==2.2.4 72 | 73 | - keras-bert==0.89.0 74 | 75 | - scikit-learn==0.24.2 76 | 77 | - pandas==1.1.5 78 | 79 | - tqdm==4.64.0 80 | 81 | 82 | # 运行方法 83 | 84 | ```shell 85 | wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip 86 | python SEBERT_model.py 87 | ``` 88 | 89 | 90 | # docker上运行 91 | ```shell 92 | sudo docker pull tensorflow/tensorflow:1.14.0-gpu-py3 93 | sudo docker run -dit --restart unless-stopped --gpus=all --name=evententityextraction tensorflow/tensorflow:1.14.0-gpu-py3 94 | sudo docker exec -it evententityextraction bash 95 | apt update 96 | apt install git 97 | apt install wget 98 | git clone https://github.com/hecongqing/CCKS2019_EventEntityExtraction_Rank5.git 99 | cd CCKS2019_EventEntityExtraction_Rank5 100 | pip install -r requirements.txt 101 | wget -P ./bert https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip 102 | unzip -d ./bert ./bert/chinese_L-12_H-768_A-12.zip 103 | python ./src/SEBERT_model.py 104 | ``` 105 | 106 | -------------------------------------------------------------------------------- /bert/README.md: -------------------------------------------------------------------------------- 1 | wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip 2 | -------------------------------------------------------------------------------- /data/README.md: -------------------------------------------------------------------------------- 1 | # add dateset 2 | 3 | download dataset from 4 | 百度网盘 5 | - ccks2019_event_entity_extract.zip 6 | 7 | https://pan.baidu.com/share/init?surl=HNTcqWf0594rtmwBd1p9HQ\ 8 | 提取码:jh4u 9 | 10 | - event_type_entity_extract_test.csv 11 | 12 | https://pan.baidu.com/share/init?surl=cWRq-9IKx8lOWakZFLhS-A\ 13 | 提取码:qdr9 14 | 15 | or 16 | 17 | download dataset from 18 | Dropbox 19 | - ccks2019_event_entity_extract.zip 20 | 21 | https://www.dropbox.com/s/lli5mgip2clguya/ccks2019_event_entity_extract.zip?dl=0 22 | 23 | - event_type_entity_extract_test.csv 24 | 25 | https://www.dropbox.com/s/e0ajdb93s2lfdw0/event_type_entity_extract_test.csv?dl=0 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | absl-py==0.7.1 2 | asn1crypto==0.24.0 3 | astor==0.8.0 4 | backcall==0.2.0 5 | cryptography==2.1.4 6 | decorator==5.1.1 7 | gast==0.2.2 8 | google-pasta==0.1.7 9 | grpcio==1.21.1 10 | h5py==2.9.0 11 | idna==2.6 12 | importlib-resources==5.4.0 13 | ipython==7.16.3 14 | ipython-genutils==0.2.0 15 | jedi==0.17.2 16 | joblib==1.1.0 17 | Keras==2.2.4 18 | Keras-Applications==1.0.8 19 | keras-bert==0.89.0 20 | keras-embed-sim==0.10.0 21 | keras-layer-normalization==0.16.0 22 | keras-multi-head==0.29.0 23 | keras-pos-embd==0.13.0 24 | keras-position-wise-feed-forward==0.8.0 25 | Keras-Preprocessing==1.1.0 26 | keras-self-attention==0.51.0 27 | keras-transformer==0.40.0 28 | keyring==10.6.0 29 | keyrings.alt==3.0 30 | Markdown==3.1.1 31 | numpy==1.16.4 32 | pandas==1.1.5 33 | parso==0.7.1 34 | pexpect==4.8.0 35 | pickleshare==0.7.5 36 | prompt-toolkit==3.0.29 37 | protobuf==3.8.0 38 | ptyprocess==0.7.0 39 | pycrypto==2.6.1 40 | Pygments==2.12.0 41 | pygobject==3.26.1 42 | python-apt==1.6.4 43 | python-dateutil==2.8.2 44 | pytz==2022.1 45 | pyxdg==0.25 46 | PyYAML==6.0 47 | scikit-learn==0.24.2 48 | scipy==1.5.4 49 | SecretStorage==2.3.1 50 | six==1.11.0 51 | sklearn==0.0 52 | tensorboard==1.14.0 53 | tensorflow-estimator==1.14.0 54 | tensorflow-gpu==1.14.0 55 | termcolor==1.1.0 56 | threadpoolctl==3.1.0 57 | tqdm==4.64.0 58 | traitlets==4.3.3 59 | wcwidth==0.2.5 60 | Werkzeug==0.15.4 61 | wrapt==1.11.2 62 | zipp==3.6.0 63 | -------------------------------------------------------------------------------- /src/SEBERT_model.py: -------------------------------------------------------------------------------- 1 | #! -*- coding: utf-8 -*- 2 | 3 | import tensorflow as tf 4 | gpu_options = tf.GPUOptions(allow_growth=True) 5 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 6 | # os.environ["TF_KERAS"]='1' 7 | import json 8 | from tqdm import tqdm 9 | import os, re 10 | import numpy as np 11 | import pandas as pd 12 | from keras_bert import load_trained_model_from_checkpoint, Tokenizer 13 | import codecs 14 | import gc 15 | from random import choice 16 | import tensorflow.keras.backend as K 17 | import tensorflow as tf 18 | from tensorflow.keras.layers import * 19 | # from tensorflow.keras.engine.topology import Layer 20 | from tensorflow.keras.models import Model 21 | from tensorflow.keras.callbacks import * 22 | from tensorflow.keras.optimizers import Adam,SGD 23 | from sklearn.model_selection import KFold 24 | 25 | 26 | DEBUG = True 27 | if DEBUG == True: 28 | maxlen = 32 29 | else: 30 | maxlen = 140 # 140 31 | 32 | 33 | learning_rate = 5e-5 # 5e-5 34 | min_learning_rate = 1e-5 # 1e-5 35 | 36 | 37 | config_path = '../bert/chinese_L-12_H-768_A-12/bert_config.json' 38 | checkpoint_path = '../bert/chinese_L-12_H-768_A-12/bert_model.ckpt' 39 | dict_path = '../bert/chinese_L-12_H-768_A-12/vocab.txt' 40 | 41 | model_save_path="./" 42 | 43 | import os 44 | if not os.path.exists(model_save_path): 45 | os.mkdir(model_save_path) 46 | 47 | 48 | 49 | token_dict = {} 50 | 51 | with codecs.open(dict_path, 'r', 'utf8') as reader: 52 | for line in reader: 53 | token = line.strip() 54 | token_dict[token] = len(token_dict) 55 | 56 | 57 | class OurTokenizer(Tokenizer): 58 | def _tokenize(self, text): 59 | R = [] 60 | for c in text: 61 | if c in self._token_dict: 62 | R.append(c) 63 | elif self._is_space(c): 64 | R.append('[unused1]') # space类用未经训练的[unused1]表示 65 | else: 66 | R.append('[UNK]') # 剩余的字符是[UNK] 67 | return R 68 | 69 | tokenizer = OurTokenizer(token_dict) 70 | 71 | 72 | def seq_padding(X, padding=0): 73 | L = [len(x) for x in X] 74 | ML = max(L) 75 | return np.array([ 76 | np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X 77 | ]) 78 | 79 | def list_find(list1, list2): 80 | """在list1中寻找子串list2,如果找到,返回第一个下标; 81 | 如果找不到,返回-1。 82 | """ 83 | n_list2 = len(list2) 84 | for i in range(len(list1)): 85 | if list1[i: i+n_list2] == list2: 86 | return i 87 | return -1 88 | 89 | #读取训练集 90 | D = pd.read_csv('../data/ccks2019_event_entity_extract/event_type_entity_extract_train.csv', encoding='utf-8', names = ['a','b','c','d']) 91 | D = D[D["c"] != u'其他'] 92 | classes = set(D["c"].unique()) 93 | 94 | entity_train= list(set(D['d'].values.tolist())) 95 | 96 | 97 | # ClearData 98 | D.drop("a", axis=1, inplace=True) # drop id 99 | D["d"] = D["d"].map(lambda x:x.replace(u'其他','')) 100 | D["e"] = D.apply(lambda row:1 if row[2] in row[0] else 0,axis=1) 101 | D = D[D["e"] == 1] 102 | #D.drop_duplicates(["b","c"],keep='last',inplace = True) # drop duplicates 103 | 104 | train_data = [] 105 | for t,c,n in zip(D["b"], D["c"], D["d"]): 106 | train_data.append((t, c, n)) 107 | 108 | D = pd.read_csv('../data/ccks2019_event_entity_extract/event_type_entity_extract_eval.csv',header=None,names=["id","text","event"]) 109 | D['event']=D['event'].map(lambda x: "公司股市异常" if x=="股市异常" else x) 110 | D['text']=D['text'].map(lambda x:x.replace("\x07","").replace("\x05","").replace("\x08","").replace("\x06","").replace("\x04","")) 111 | 112 | import re 113 | comp=re.compile(r"(\d{4}-\d{1,2}-\d{1,2})") 114 | D['text']=D['text'].map(lambda x:re.sub(comp,"▲",x)) 115 | 116 | test_data = [] 117 | for id,t,c in zip(D["id"], D["text"], D["event"]): 118 | test_data.append((id, t, c)) 119 | 120 | 121 | additional_chars = set() 122 | for d in train_data: 123 | additional_chars.update(re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', d[2])) 124 | 125 | additional_chars.remove(u',') 126 | 127 | 128 | class data_generator: 129 | def __init__(self, data, batch_size=32): 130 | self.data = data 131 | self.batch_size = batch_size 132 | self.steps = len(self.data) // self.batch_size 133 | if len(self.data) % self.batch_size != 0: 134 | self.steps += 1 135 | def __len__(self): 136 | return self.steps 137 | def __iter__(self): 138 | while True: 139 | idxs = list(range(len(self.data))) 140 | np.random.shuffle(idxs) 141 | X1, X2, S1, S2 = [], [], [], [] 142 | for i in idxs: 143 | d = self.data[i] 144 | text, c = d[0][:maxlen], d[1] 145 | text = u'___%s___%s' % (c, text) 146 | tokens = tokenizer.tokenize(text) 147 | e = d[2] 148 | e_tokens = tokenizer.tokenize(e)[1:-1] 149 | s1, s2 = np.zeros(len(tokens)), np.zeros(len(tokens)) 150 | start = list_find(tokens, e_tokens) 151 | if start != -1: 152 | end = start + len(e_tokens) - 1 153 | s1[start] = 1 154 | s2[end] = 1 155 | x1, x2 = tokenizer.encode(first=text) 156 | X1.append(x1) 157 | X2.append(x2) 158 | S1.append(s1) 159 | S2.append(s2) 160 | if len(X1) == self.batch_size or i == idxs[-1]: 161 | X1 = seq_padding(X1) 162 | X2 = seq_padding(X2) 163 | S1 = seq_padding(S1) 164 | S2 = seq_padding(S2) 165 | yield [X1, X2, S1, S2], None 166 | X1, X2, S1, S2 = [], [], [], [] 167 | 168 | 169 | 170 | 171 | 172 | 173 | from keras.optimizers import Optimizer 174 | import keras.backend as K 175 | 176 | 177 | class AccumOptimizer(Optimizer): 178 | """继承Optimizer类,包装原有优化器,实现梯度累积。 179 | # 参数 180 | optimizer:优化器实例,支持目前所有的keras优化器; 181 | steps_per_update:累积的步数。 182 | # 返回 183 | 一个新的keras优化器 184 | Inheriting Optimizer class, wrapping the original optimizer 185 | to achieve a new corresponding optimizer of gradient accumulation. 186 | # Arguments 187 | optimizer: an instance of keras optimizer (supporting 188 | all keras optimizers currently available); 189 | steps_per_update: the steps of gradient accumulation 190 | # Returns 191 | a new keras optimizer. 192 | """ 193 | def __init__(self, optimizer, steps_per_update=1, **kwargs): 194 | super(AccumOptimizer, self).__init__(**kwargs) 195 | self.optimizer = optimizer 196 | with K.name_scope(self.__class__.__name__): 197 | self.steps_per_update = steps_per_update 198 | self.iterations = K.variable(0, dtype='int64', name='iterations') 199 | self.cond = K.equal(self.iterations % self.steps_per_update, 0) 200 | self.lr = self.optimizer.lr 201 | self.optimizer.lr = K.switch(self.cond, self.optimizer.lr, 0.) 202 | for attr in ['momentum', 'rho', 'beta_1', 'beta_2']: 203 | if hasattr(self.optimizer, attr): 204 | value = getattr(self.optimizer, attr) 205 | setattr(self, attr, value) 206 | setattr(self.optimizer, attr, K.switch(self.cond, value, 1 - 1e-7)) 207 | for attr in self.optimizer.get_config(): 208 | if not hasattr(self, attr): 209 | value = getattr(self.optimizer, attr) 210 | setattr(self, attr, value) 211 | # 覆盖原有的获取梯度方法,指向累积梯度 212 | # Cover the original get_gradients method with accumulative gradients. 213 | def get_gradients(loss, params): 214 | return [ag / self.steps_per_update for ag in self.accum_grads] 215 | self.optimizer.get_gradients = get_gradients 216 | def get_updates(self, loss, params): 217 | self.updates = [ 218 | K.update_add(self.iterations, 1), 219 | K.update_add(self.optimizer.iterations, K.cast(self.cond, 'int64')), 220 | ] 221 | # 累积梯度 (gradient accumulation) 222 | self.accum_grads = [K.zeros(K.int_shape(p), dtype=K.dtype(p)) for p in params] 223 | grads = self.get_gradients(loss, params) 224 | for g, ag in zip(grads, self.accum_grads): 225 | self.updates.append(K.update(ag, K.switch(self.cond, ag * 0, ag + g))) 226 | # 继承optimizer的更新 (inheriting updates of original optimizer) 227 | self.updates.extend(self.optimizer.get_updates(loss, params)[1:]) 228 | self.weights.extend(self.optimizer.weights) 229 | return self.updates 230 | def get_config(self): 231 | iterations = K.eval(self.iterations) 232 | K.set_value(self.iterations, 0) 233 | config = self.optimizer.get_config() 234 | K.set_value(self.iterations, iterations) 235 | return config 236 | 237 | 238 | def modify_bert_model_3(): # BiGRU + DNN # 239 | # [0.8855585831063352, 0.878065395095436, 0.8739782016349456, 0.8773841961853542, 240 | # 0.8827539195638037, 0.8766189502386503, 0.8684389911384458, 0.8663940013633947, 241 | # 0.8663940013633947, 0.8718473074301977] 242 | # mean score: 0.8747433547119957 243 | bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path) 244 | 245 | for l in bert_model.layers: 246 | l.trainable = True 247 | 248 | x1_in = Input(shape=(None,)) # 待识别句子输入 249 | x2_in = Input(shape=(None,)) # 待识别句子输入 250 | s1_in = Input(shape=(None,)) # 实体左边界(标签) 251 | s2_in = Input(shape=(None,)) # 实体右边界(标签) 252 | 253 | x1, x2, s1, s2 = x1_in, x2_in, s1_in, s2_in 254 | x_mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(x1) 255 | x = bert_model([x1, x2]) 256 | 257 | l = Lambda(lambda t: t[:, -1])(x) 258 | x = Add()([x, l]) 259 | x = Dropout(0.1)(x) 260 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 261 | 262 | x = SpatialDropout1D(0.1)(x) 263 | x = Bidirectional(CuDNNGRU(200, return_sequences=True))(x) 264 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 265 | x = Bidirectional(CuDNNGRU(200, return_sequences=True))(x) 266 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 267 | 268 | x = Dense(1024, use_bias=False, activation='tanh')(x) 269 | x = Dropout(0.2)(x) 270 | x = Dense(64, use_bias=False, activation='tanh')(x) 271 | x = Dropout(0.2)(x) 272 | x = Dense(8, use_bias=False, activation='tanh')(x) 273 | 274 | ps1 = Dense(1, use_bias=False)(x) 275 | ps1 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10)([ps1, x_mask]) 276 | ps2 = Dense(1, use_bias=False)(x) 277 | ps2 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10)([ps2, x_mask]) 278 | 279 | model = Model([x1_in, x2_in], [ps1, ps2]) 280 | 281 | train_model = Model([x1_in, x2_in, s1_in, s2_in], [ps1, ps2]) 282 | 283 | loss1 = K.mean(K.categorical_crossentropy(s1_in, ps1, from_logits=True)) 284 | ps2 -= (1 - K.cumsum(s1, 1)) * 1e10 285 | loss2 = K.mean(K.categorical_crossentropy(s2_in, ps2, from_logits=True)) 286 | loss = loss1 + loss2 287 | 288 | train_model.add_loss(loss) 289 | train_model.compile(optimizer=Adam(learning_rate),metrics=['accuracy']) 290 | train_model.summary() 291 | return model, train_model 292 | 293 | 294 | 295 | 296 | def modify_bert_model_h3(): # BiGRU + DNN # 297 | # [0.8855585831063352, 0.878065395095436, 0.8739782016349456, 0.8773841961853542, 298 | # 0.8827539195638037, 0.8766189502386503, 0.8684389911384458, 0.8663940013633947, 299 | # 0.8663940013633947, 0.8718473074301977] 300 | # mean score: 0.8747433547119957 301 | bert_model = load_trained_model_from_checkpoint(config_path, checkpoint_path) 302 | 303 | for l in bert_model.layers: 304 | l.trainable = True 305 | 306 | x1_in = Input(shape=(None,)) # 待识别句子输入 307 | x2_in = Input(shape=(None,)) # 待识别句子输入 308 | s1_in = Input(shape=(None,)) # 实体左边界(标签) 309 | s2_in = Input(shape=(None,)) # 实体右边界(标签) 310 | 311 | x1, x2, s1, s2 = x1_in, x2_in, s1_in, s2_in 312 | x_mask = Lambda(lambda x: K.cast(K.greater(K.expand_dims(x, 2), 0), 'float32'))(x1) 313 | x = bert_model([x1, x2]) 314 | 315 | l = Lambda(lambda t: t[:, -1])(x) 316 | x = Add()([x, l]) 317 | x = Dropout(0.1)(x) 318 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 319 | 320 | x = SpatialDropout1D(0.1)(x) 321 | x = Bidirectional(CuDNNGRU(200, return_sequences=True))(x) 322 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 323 | x = Bidirectional(CuDNNGRU(200, return_sequences=True))(x) 324 | x = Lambda(lambda x: x[0] * x[1])([x, x_mask]) 325 | 326 | x = Dense(1024, use_bias=False, activation='tanh')(x) 327 | x = Dropout(0.2)(x) 328 | x = Dense(64, use_bias=False, activation='tanh')(x) 329 | x = Dropout(0.2)(x) 330 | x = Dense(8, use_bias=False, activation='tanh')(x) 331 | 332 | ps1 = Dense(1, use_bias=False)(x) 333 | ps1 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10)([ps1, x_mask]) 334 | ps2 = Dense(1, use_bias=False)(x) 335 | ps2 = Lambda(lambda x: x[0][..., 0] - (1 - x[1][..., 0]) * 1e10)([ps2, x_mask]) 336 | 337 | model = Model([x1_in, x2_in], [ps1, ps2]) 338 | 339 | train_model = Model([x1_in, x2_in, s1_in, s2_in], [ps1, ps2]) 340 | 341 | loss1 = K.mean(K.categorical_crossentropy(s1_in, ps1, from_logits=True)) 342 | ps2 -= (1 - K.cumsum(s1, 1)) * 1e10 343 | loss2 = K.mean(K.categorical_crossentropy(s2_in, ps2, from_logits=True)) 344 | loss = loss1 + loss2 345 | 346 | train_model.add_loss(loss) 347 | sgd=SGD(lr=0.0001, momentum=0.0, decay=0.0, nesterov=False) 348 | train_model.compile(optimizer=sgd,metrics=['accuracy']) 349 | train_model.summary() 350 | return model, train_model 351 | 352 | 353 | 354 | 355 | 356 | def softmax(x): 357 | x = x - np.max(x) 358 | x = np.exp(x) 359 | return x / np.sum(x) 360 | 361 | def extract_entity(text_in, c_in): 362 | if c_in not in classes: 363 | return 'NaN' 364 | text_in = u'___%s___%s' % (c_in, text_in) 365 | text_in = text_in[:510] 366 | _tokens = tokenizer.tokenize(text_in) 367 | _x1, _x2 = tokenizer.encode(first=text_in) 368 | _x1, _x2 = np.array([_x1]), np.array([_x2]) 369 | _ps1, _ps2 = model.predict([_x1, _x2]) 370 | _ps1, _ps2 = softmax(_ps1[0]), softmax(_ps2[0]) 371 | for i, _t in enumerate(_tokens): 372 | if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: 373 | _ps1[i] -= 10 374 | start = _ps1.argmax() 375 | for end in range(start, len(_tokens)): 376 | _t = _tokens[end] 377 | if len(_t) == 1 and re.findall(u'[^\u4e00-\u9fa5a-zA-Z0-9\*]', _t) and _t not in additional_chars: 378 | break 379 | end = _ps2[start:end+1].argmax() + start 380 | a = text_in[start-1: end] 381 | return a 382 | 383 | class Evaluate(Callback): 384 | def __init__(self, dev_data, model_path): 385 | self.ACC = [] 386 | self.best = 0. 387 | self.passed = 0 388 | self.dev_data = dev_data 389 | self.model_path = model_path 390 | def on_batch_begin(self, batch, logs=None): 391 | """第一个epoch用来warmup,第二个epoch把学习率降到最低 392 | """ 393 | if self.passed < self.params['steps']: 394 | lr = (self.passed + 1.) / self.params['steps'] * learning_rate 395 | K.set_value(self.model.optimizer.lr, lr) 396 | self.passed += 1 397 | elif self.params['steps'] <= self.passed < self.params['steps'] * 2: 398 | lr = (2 - (self.passed + 1.) / self.params['steps']) * (learning_rate - min_learning_rate) 399 | lr += min_learning_rate 400 | K.set_value(self.model.optimizer.lr, lr) 401 | self.passed += 1 402 | def on_epoch_end(self, epoch, logs=None): 403 | acc = self.evaluate() 404 | self.ACC.append(acc) 405 | if acc > self.best: 406 | self.best = acc 407 | print("save best model weights ...") 408 | train_model.save_weights(self.model_path) 409 | print('acc: %.4f, best acc: %.4f\n' % (acc, self.best)) 410 | def evaluate(self): 411 | A = 1e-10 412 | # F = open('dev_pred.json', 'w', encoding = 'utf-8') 413 | for d in tqdm(iter(self.dev_data)): 414 | R = extract_entity(d[0], d[1]) 415 | if R == d[2]: 416 | A += 1 417 | # s = ', '.join(d + (R,)) 418 | # F.write(s + "\n") 419 | # F.close() 420 | return A / len(self.dev_data) 421 | 422 | def test(test_data,result_path): 423 | F = open(result_path, 'w', encoding = 'utf-8') 424 | for d in tqdm(iter(test_data)): 425 | s = u'"%s","%s"\n' % (d[0], extract_entity(d[1], d[2])) 426 | # s = s.encode('utf-8') 427 | F.write(s) 428 | F.close() 429 | 430 | 431 | def evaluate(dev_data): 432 | A = 1e-10 433 | # F = open('dev_pred.json', 'w', encoding='utf-8') 434 | for d in tqdm(iter(dev_data)): 435 | R = extract_entity(d[0], d[1]) 436 | if R == d[2]: 437 | A += 1 438 | # s = ', '.join(d + (R,)) 439 | # F.write(s + "\n") 440 | # F.close() 441 | return A / len(dev_data) 442 | 443 | 444 | # Model 445 | flodnums = 10 446 | kf = KFold(n_splits=flodnums, shuffle=True, random_state=520).split(train_data) 447 | 448 | score = [] 449 | 450 | 451 | for i, (train_fold, test_fold) in enumerate(kf): 452 | print("kFlod ",i,"/",flodnums) 453 | train_ = [train_data[i] for i in train_fold] 454 | dev_ = [train_data[i] for i in test_fold] 455 | 456 | model, train_model = modify_bert_model_3() 457 | 458 | train_D = data_generator(train_) 459 | dev_D = data_generator(dev_) 460 | 461 | model_path = model_save_path+"modify_bert_model" + str(i) + ".weights" 462 | if not os.path.exists(model_path): 463 | evaluator = Evaluate(dev_, model_path) 464 | train_model.fit_generator(train_D.__iter__(), 465 | steps_per_epoch = len(train_D), 466 | epochs = 5, 467 | callbacks = [evaluator], 468 | validation_data = dev_D.__iter__(), 469 | validation_steps = len(dev_D) 470 | ) 471 | print("load best model weights ...") 472 | del train_model 473 | gc.collect() 474 | del model 475 | gc.collect() 476 | K.clear_session() 477 | 478 | model, train_model = modify_bert_model_h3() 479 | 480 | 481 | 482 | model_h_path = model_save_path+"modify_bert_model_h" + str(i) + ".weights" 483 | if not os.path.exists(model_h_path): 484 | train_model.load_weights(model_path) 485 | evaluator = Evaluate(dev_, model_h_path) 486 | train_model.fit_generator(train_D.__iter__(), 487 | steps_per_epoch = len(train_D), 488 | epochs = 10, 489 | callbacks = [evaluator], 490 | validation_data = dev_D.__iter__(), 491 | validation_steps = len(dev_D) 492 | ) 493 | score.append(evaluate(dev_)) 494 | print("valid evluation:", score) 495 | print("valid mean score:", np.mean(score)) 496 | 497 | train_model.load_weights(model_h_path) 498 | 499 | 500 | 501 | 502 | 503 | result_path = model_save_path+"result_k" + str(i) + ".txt" 504 | test(test_data,result_path) 505 | 506 | del train_model 507 | gc.collect() 508 | del model 509 | gc.collect() 510 | K.clear_session() 511 | 512 | 513 | 514 | 515 | 516 | 517 | ####### Submit ####### 518 | data = pd.DataFrame(columns=["sid","company"]) 519 | 520 | dataid = pd.read_csv(model_save_path+"result_k0.txt",sep=',',names=["sid","company"])[['sid']] 521 | 522 | for i in range(flodnums): 523 | datak = pd.read_csv(model_save_path+"result_k"+str(i)+".txt",sep=',',names=["sid","company"]) 524 | print(datak.shape) 525 | data = pd.concat([data, datak], axis = 0) 526 | 527 | submit = data.groupby(['sid','company'],as_index=False)['sid'].agg({"count":"count"}) 528 | 529 | print(submit.shape) 530 | print(submit[submit.company == 'NaN']) 531 | 532 | submit = submit.sort_values(by=["sid","count"],ascending=False).groupby("sid",as_index=False).first() 533 | 534 | print(submit.shape) 535 | 536 | submit = dataid.merge(submit,how='left',on = 'sid').fillna("NaN") 537 | print(data[['sid']].drop_duplicates().shape) 538 | print(submit.shape) 539 | 540 | submit[['sid','company']].to_csv(model_save_path+"result.txt",header=None,index=False,sep=',') 541 | 542 | print(submit) 543 | --------------------------------------------------------------------------------