├── .gitignore ├── LICENSE ├── README.md ├── baseline ├── DemoModel.py ├── config.py ├── data_process.py ├── data_process_aug.py ├── examine_dev.py ├── examine_dev_ensemble.py ├── file_save.py ├── main.py ├── model.py ├── nn_func.py └── util.py ├── best_single_model ├── config.py ├── data_process_addAnswer.py ├── examine_dev.py ├── file_save.py ├── focal_loss.py ├── main.py ├── model_addAnswer_newGraph.py ├── nlp_feature.json ├── nn_func.py └── util_addAnswer.py ├── ensemble ├── dev_soft │ ├── model_char_1102_0.7278.txt │ ├── model_newgraph_1101_0.7474.txt │ └── model_newgraph_2lr_1101_0.7459.txt ├── ensemble_predict.py ├── ensemble_train.py └── test_soft │ ├── model_char_1102_0.7278.txt │ ├── model_newgraph_1101_0.7474.txt │ └── model_newgraph_2lr_1101_0.7459.txt └── pics └── model.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 yuhaitao1994 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # AIchallenger_MachineReadingComprehension 2 | AI Challenger 2018 观点型问题阅读理解比赛 8th place solution 3 | 4 | **** 5 | 6 | |Author|[yuhaitao](https://github.com/yuhaitao1994)|[little_white](https://github.com/faverous)| 7 | |---|---|--- 8 | 9 | [比赛总结]() 10 | **** 11 | 12 | ## 1.比赛成绩 13 | |Model|Accuracy| 14 | |---|--- 15 | |baseline|72.36% 16 | |test_A ensemble|76.39% 17 | |best single model|75.13%(dev) 18 | |test_B ensemble|77.33% 19 | 20 | 21 | ## 2.环境配置 22 | 23 | |环境/库|版本| 24 | |---|--- 25 | |ubuntu|16.04 26 | |Python|>=3.5 27 | |TensorFlow|>=1.6 28 | 29 | ## **3.baseline** 30 | 31 | baseline模型借鉴了微软R-Net模型,感谢[HKUST-KnowComp](https://github.com/HKUST-KnowComp/R-Net)的tensorflow实现代码。 32 | 33 | 与R-Net模型不同的是,我们取消了模型尾部的ptrNet结构,取而代之的是一个单向GRU与softmax层。 34 | 35 | ### 打开方式 36 | 37 | 新建file目录,将训练集、验证集、测试集A原始数据移入。 38 | 39 | 数据预处理 40 | 41 | python config.py --mode prepro 42 | 43 | 训练 44 | 45 | python config.py --mode train 46 | 47 | 评估验证集效果 48 | 49 | python config.py --mode examine_dev 50 | 51 | 生成测试结果 52 | 53 | python config.py --mode test 54 | 55 | 56 | ## **4.best single model** 57 | 58 | 最好成绩的单模型我们选择加入alternatives语义和feature engineering的方式,基于R-Net改进。 59 | 60 | **alternatives语义**:由于观点型问题的某些备选答案是携带语义信息的,所以我们将备选答案也做encoding处理。 61 | 62 | **feature engneering**:特征工程,我们使用了tf-idf等方法,将提取的特征向量作为深度模型的另一个输入,只用Linear层进行处理。由于阅读理解任务数据的特性,特征工程这部分工作只有微弱提升,没有公开代码。 63 | 64 | ### 模型结构 65 | ![best single model](/pics/model.png) 66 | 67 | ## **5.ensemble** 68 | 69 | 最终提交的test_B结果共采用了16个模型进行融合,融合的方式为stacking,在验证集上训练各模型softmax层所占权重。这种方式可能会造成在验证集上的过拟合,但据实际测试,并没有发生此问题。 70 | 71 | 我们一共使用了三种改进模型,分别基于R-Net、QA-Net和BiDAF。 72 | 73 | ### ensemble使用方式 74 | 75 | 训练集成模型的权重 76 | 77 | python ensemble_train.py 78 | 79 | 预测test_A的结果 80 | 81 | python ensemble_predict.py 82 | -------------------------------------------------------------------------------- /baseline/DemoModel.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | DemoModel.py:模型演示代码,测试模型能否跑通 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | 10 | import tensorflow as tf 11 | import os 12 | import numpy as np 13 | import random 14 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout 15 | 16 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" # tensorflow的log显示级别 17 | os.environ["CUDA_VISIBLE_DEVICES"] = "1" 18 | 19 | hidden = 75 20 | use_cudnn = False 21 | batch_size = 2 22 | learning_rate = 0.001 23 | emb_lr = 0.000001 24 | keep_prob = 0.7 25 | grad_clip = 5.0 26 | len_limit = 15 27 | 28 | class DemoModel(object): 29 | 30 | def __init__(self, word_mat, trainable=True, opt=True): 31 | # 注意,placeholder是数据传输的入口,不能在计算图中重新赋值 32 | self.passage = tf.placeholder(tf.int32, [batch_size, None], name="passage") 33 | self.question = tf.placeholder(tf.int32, [batch_size, None], name="question") 34 | self.answer = tf.placeholder(tf.int32, [batch_size], name="answer") 35 | self.qa_id = tf.placeholder(tf.int32, [batch_size], name="qa_id") 36 | self.is_train = tf.placeholder(tf.bool, name="is_train") 37 | 38 | self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, 39 | initializer=tf.constant_initializer(0), trainable=False) 40 | 41 | self.word_mat = tf.get_variable("word_mat", initializer=tf.constant( 42 | word_mat, dtype=tf.float32), trainable=True) # 测试一下可训练的word——embedding 43 | 44 | self.c_mask = tf.cast(self.passage, tf.bool) 45 | self.q_mask = tf.cast(self.question, tf.bool) 46 | self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1) 47 | self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1) 48 | 49 | if opt: 50 | self.c_maxlen = tf.reduce_max(self.c_len) 51 | self.q_maxlen = tf.reduce_max(self.q_len) 52 | self.c = tf.slice(self.passage, [0, 0], [batch_size, self.c_maxlen]) 53 | self.q = tf.slice(self.question, [0, 0], [batch_size, self.q_maxlen]) 54 | self.c_mask = tf.slice(self.c_mask, [0, 0], [ 55 | batch_size, self.c_maxlen]) 56 | self.q_mask = tf.slice(self.q_mask, [0, 0], [ 57 | batch_size, self.q_maxlen]) 58 | else: 59 | self.c_maxlen, self.q_maxlen = config.para_limit, config.ques_limit 60 | 61 | self.RNet() 62 | 63 | if trainable: 64 | # 对embedding层设置单独的学习率 65 | self.emb_lr = tf.get_variable("emb_lr", shape=[], dtype=tf.float32, trainable=False) 66 | self.learning_rate = tf.get_variable( 67 | "learning_rate", shape=[], dtype=tf.float32, trainable=False) 68 | self.emb_opt = tf.train.AdamOptimizer(learning_rate=self.emb_lr, epsilon=1e-8) 69 | self.opt = tf.train.AdamOptimizer( 70 | learning_rate=self.learning_rate, epsilon=1e-8) 71 | # 区分不同的变量列表 72 | self.var_list = tf.trainable_variables() 73 | var_list1 = [] 74 | var_list2 = [] 75 | for var in self.var_list: 76 | if var.op.name == "word_mat": 77 | var_list1.append(var) 78 | else: 79 | var_list2.append(var) 80 | 81 | grads = tf.gradients(self.loss, var_list1 + var_list2) 82 | # grads = self.opt.compute_gradients(self.loss) 83 | # gradients, variables = zip(*grads) 84 | capped_grads, _ = tf.clip_by_global_norm( 85 | grads, grad_clip) 86 | grads1 = capped_grads[:len(var_list1)] 87 | grads2 = capped_grads[len(var_list1):] 88 | self.train_op1 = self.emb_opt.apply_gradients( 89 | zip(grads1, var_list1), global_step=self.global_step) 90 | self.train_op2 = self.opt.apply_gradients(zip(grads2, var_list2)) 91 | self.train_op = tf.group(self.train_op1, self.train_op2) 92 | 93 | def RNet(self): 94 | PL, QL, d = self.c_maxlen, self.q_maxlen, hidden 95 | gru = cudnn_gru if use_cudnn else native_gru 96 | 97 | with tf.variable_scope("embedding"): 98 | with tf.name_scope("word"): 99 | c_emb = tf.nn.embedding_lookup(self.word_mat, self.c) 100 | q_emb = tf.nn.embedding_lookup(self.word_mat, self.q) 101 | 102 | with tf.variable_scope("encoding"): 103 | rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape( 104 | ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train) 105 | c = rnn(c_emb, seq_len=self.c_len) 106 | q = rnn(q_emb, seq_len=self.q_len) 107 | 108 | with tf.variable_scope("attention"): 109 | qc_att = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d, 110 | keep_prob=keep_prob, is_train=self.is_train) 111 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att.get_shape( 112 | ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train) 113 | att = rnn(qc_att, seq_len=self.c_len) 114 | print(att.get_shape().as_list()) 115 | 116 | with tf.variable_scope("match"): 117 | self_att = dot_attention( 118 | att, att, mask=self.c_mask, hidden=d, keep_prob=keep_prob, is_train=self.is_train) 119 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=self_att.get_shape( 120 | ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train) 121 | # match:[batch_size, c_maxlen, 6*hidden] 122 | match = rnn(self_att, seq_len=self.c_len) 123 | print(match.get_shape().as_list()) 124 | 125 | with tf.variable_scope("YesNo_classification"): 126 | init = summ(q[:, :, -2 * d:], d, mask=self.q_mask, 127 | keep_prob=keep_prob, is_train=self.is_train) 128 | print(init.get_shape().as_list()) 129 | match = dropout(match, keep_prob=keep_prob, 130 | is_train=self.is_train) 131 | final_hiddens = init.get_shape().as_list()[-1] 132 | final_gru = tf.contrib.rnn.GRUCell(final_hiddens) 133 | _, final_state = tf.nn.dynamic_rnn( 134 | final_gru, match, initial_state=init, dtype=tf.float32) 135 | final_w = tf.get_variable(name="final_w", shape=[final_hiddens, 2]) 136 | final_b = tf.get_variable(name="final_b", shape=[ 137 | 2], initializer=tf.constant_initializer(0.)) 138 | self.logits = tf.matmul(final_state, final_w) 139 | self.logits = tf.nn.bias_add(self.logits, final_b) # logits:[batch_size, 3] 140 | 141 | with tf.variable_scope("softmax_and_loss"): 142 | final_softmax = tf.nn.softmax(self.logits) 143 | self.classes = tf.cast( 144 | tf.argmax(final_softmax, axis=1), dtype=tf.int32, name="classes") 145 | self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits( 146 | logits=self.logits, labels=tf.stop_gradient(self.answer))) 147 | 148 | def get_loss(self): 149 | return self.loss 150 | 151 | def get_global_step(self): 152 | return self.global_step 153 | 154 | def get_bacth(examples, word2idx_dict, batch_size): 155 | """ 156 | 获取mini-batch 157 | """ 158 | passages = [] 159 | questions = [] 160 | answers = [] 161 | qa_ids = [] 162 | for i in range(batch_size): 163 | passage_id = [] 164 | question_id = [] 165 | passage = examples[i]["passage"] 166 | for j in range(15): 167 | if (j + 1) <= len(passage): 168 | p = passage[j] 169 | passage_id.append(word2idx_dict[p]) 170 | else: 171 | passage_id.append(0) 172 | question = examples[i]["question"] 173 | for j in range(15): 174 | if (j + 1) <= len(question): 175 | q = question[j] 176 | question_id.append(word2idx_dict[q]) 177 | else: 178 | question_id.append(0) 179 | answer = examples[i]["answer"] 180 | qa_id = examples[i]["qa_id"] 181 | passages.append(passage_id) 182 | questions.append(question_id) 183 | answers.append(answer) 184 | qa_ids.append(qa_id) 185 | passages = np.array(passages).astype(np.int32) 186 | questions = np.array(questions) 187 | answers = np.array(answers) 188 | qa_ids = np.array(qa_ids) 189 | return passages, questions, answers, qa_ids 190 | 191 | 192 | def main(_): 193 | """ 194 | 测试模型的demo 195 | """ 196 | train_examples = [ 197 | { 198 | "passage": ['苹果', '是', '甜的', '它', '是', '硬的'], 199 | "question":['苹果', '是', '甜的', '吗'], 200 | "answer":0, 201 | "qa_id":1}, 202 | { 203 | "passage": ['橘子', '是', '酸的', '它', '是', '软的', '也是', '好吃的'], 204 | "question":['橘子', '是', '甜的', '吗'], 205 | "answer":1, 206 | "qa_id":2}, 207 | { 208 | "passage": ['梨', '是', '甜的', '它', '是', '硬的'], 209 | "question":['梨', '是', '软的', '吗'], 210 | "answer":1, 211 | "qa_id":3}, 212 | { 213 | "passage": ['西瓜', '是', '甜的', '它', '是', '硬的', '也是', '大的', '和', '圆的'], 214 | "question":['西瓜', '是', '酸的', '吗'], 215 | "answer":2, 216 | "qa_id":4} 217 | ] 218 | 219 | dev_examples = [ 220 | { 221 | "passage": ['葡萄', '是', '甜的', '它', '是', '软的'], 222 | "question":['葡萄', '是', '硬的', '吗'], 223 | "answer":1, 224 | "qa_id":5}, 225 | { 226 | "passage": ['香蕉', '是', '甜的', '它', '是', '软的', '也是', '好吃的'], 227 | "question":['香蕉', '是', '好吃的', '吗'], 228 | "answer":0, 229 | "qa_id":6} 230 | ] 231 | 232 | train_2_examples = [ 233 | { 234 | "passage": ['苹果'], 235 | "question": ['苹果'], 236 | "answer":0, 237 | "qa_id":7}, 238 | { 239 | "passage": ['梨'], 240 | "question": ['西瓜'], 241 | "answer":1, 242 | "qa_id":8}, 243 | { 244 | "passage": ['葡萄', '香蕉'], 245 | "question": ['葡萄', '香蕉'], 246 | "answer":0, 247 | "qa_id":9}, 248 | { 249 | "passage": ['西瓜', '橘子'], 250 | "question": ['甜的'], 251 | "answer":1, 252 | "qa_id":10}, 253 | ] 254 | 255 | dev_2_examples = [ 256 | { 257 | "passage": ['橘子'], 258 | "question": ['苹果'], 259 | "answer":1, 260 | "qa_id":11}, 261 | { 262 | "passage": ['梨', '西瓜'], 263 | "question": ['梨', '西瓜'], 264 | "answer":0, 265 | "qa_id":12}, 266 | ] 267 | 268 | word2idx_dict = {"null":0,"苹果":1,"梨":2,"西瓜":3,"葡萄":4,"香蕉":5,"橘子":6,"甜的":7,"酸的":8,"硬的":9,\ 269 | "软的":10,"大的":11,"圆的":12,"好吃的":13,"是":14,"也是":15,"它":16,"和":17,"吗":18} 270 | """ 271 | id2vec = { 272 | 0:[0.0,0.0,0.0,0.0], 273 | 1:[0.1,0.1,0.1,0.1], 274 | 2:[0.1,0.2,0.1,0.2], 275 | 3:[0.2,0.1,0.3,0.2], 276 | 4:[0.4,0.2,0.3,0.4], 277 | 5:[0.4,0.4,0.4,0.4], 278 | 6:[0.5,0.4,0.5,0.5], 279 | 7:[0.6,0.5,0.5,0.6], 280 | 8:[0.5,0.7,0.6,0.5], 281 | 9:[0.7,0.6,0.5,0.5], 282 | 10:[0.8,0.5,0.7,0.6], 283 | 11:[0.8,0.6,0.6,0.6], 284 | 12:[0.6,0.8,0.8,0.5], 285 | 13:[0.5,0.8,0.8,0.6], 286 | 14:[0.9,0.9,0.9,0.9], 287 | 15:[0.9,0.9,0.8,0.8], 288 | 17:[0.9,0.8,0.7,0.6], 289 | 18:[0.9,0.5,0.6,0.7] 290 | } 291 | """ 292 | 293 | id2vec = [ 294 | [0.0,0.0,0.0,0.0], 295 | [0.05,0.05,0.05,0.05], 296 | [0.1,0.1,0.1,0.1], 297 | [0.15,0.15,0.15,0.15], 298 | [0.2,0.2,0.2,0.2], 299 | [0.25,0.25,0.25,0.25], 300 | [0.3,0.3,0.3,0.3], 301 | [0.35,0.35,0.35,0.35], 302 | [0.4,0.4,0.4,0.4], 303 | [0.45,0.46,0.45,0.45], 304 | [0.5,0.5,0.5,0.5], 305 | [0.55,0.55,0.55,0.55], 306 | [0.6,0.6,0.6,0.6], 307 | [0.65,0.65,0.65,0.65], 308 | [0.7,0.7,0.7,0.7], 309 | [0.75,0.75,0.75,0.75], 310 | [0.8,0.8,0.8,0.8], 311 | [0.85,0.85,0.85,0.85] 312 | ] 313 | 314 | print("Building model...") 315 | word_mat = np.array(id2vec) 316 | model = DemoModel(word_mat) 317 | 318 | sess_config = tf.ConfigProto(allow_soft_placement=True) 319 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 320 | sess_config.gpu_options.allow_growth = True 321 | 322 | with tf.Session(config=sess_config) as sess: 323 | sess.run(tf.global_variables_initializer()) 324 | sess.run(tf.assign(model.learning_rate, 325 | tf.constant(learning_rate, dtype=tf.float32))) 326 | sess.run(tf.assign(model.emb_lr, tf.constant(emb_lr, dtype=tf.float32))) 327 | 328 | dev_p, dev_q, dev_a, dev_id = get_bacth(dev_2_examples, word2idx_dict, batch_size) 329 | 330 | def get_acc(outputs, targets): 331 | t = 0 332 | for i in range(len(outputs)): 333 | if outputs[i] == targets[i]: 334 | t += 1 335 | return (t / len(outputs)) * 1.0 336 | 337 | for i in range(10): 338 | global_step = sess.run(model.global_step) + 1 339 | random.shuffle(train_examples) 340 | train_p, train_q, train_a, train_id = get_bacth(train_2_examples, word2idx_dict, batch_size) 341 | # train 342 | feed = {model.passage: train_p, model.question: train_q, model.answer: train_a, model.qa_id: train_id, model.is_train:True} 343 | train_loss, train_op, t_classes, t_id = sess.run([model.loss, model.train_op, model.classes, model.qa_id], feed_dict=feed) 344 | # dev 345 | feed2 = {model.passage: dev_p, model.question: dev_q, model.answer: dev_a, model.qa_id: dev_id, model.is_train:False} 346 | dev_loss, d_classes, d_id = sess.run([model.loss, model.classes, model.qa_id], feed_dict=feed2) 347 | # 输出 348 | train_acc = get_acc(t_classes, train_a) 349 | dev_acc = get_acc(d_classes, dev_a) 350 | if (i + 1) % 1 == 0: 351 | print("steps:{},train_loss:{:.4f},train_acc:{:.4f},dev_loss:{:.4f},dev_acc:{:.4f}"\ 352 | .format(global_step, train_loss, train_acc, dev_loss, dev_acc)) 353 | for j in range(2): 354 | print("dev_id:{},answer:{},my_answer:{}".format(dev_id[j],dev_a[j],d_classes[j])) 355 | 356 | 357 | if __name__ == '__main__': 358 | tf.app.run() 359 | 360 | -------------------------------------------------------------------------------- /baseline/config.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | config.py:配置文件,程序运行入口 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import os 10 | import tensorflow as tf 11 | 12 | import data_process 13 | from main import train, test, dev 14 | from file_save import * 15 | from examine_dev import examine_dev 16 | 17 | flags = tf.flags 18 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" 19 | 20 | train_file = os.path.join("file", "ai_challenger_oqmrc_trainingset.json") 21 | dev_file = os.path.join("file", "ai_challenger_oqmrc_validationset.json") 22 | test_file = os.path.join("file", "ai_challenger_oqmrc_testa.json") 23 | ''' 24 | train_file = os.path.join("file", "train_demo.json") 25 | dev_file = os.path.join("file", "val_demo.json") 26 | test_file = os.path.join("file", "test_demo.json")''' 27 | 28 | target_dir = "data" 29 | log_dir = "log/event" 30 | save_dir = "log/model" 31 | prediction_dir = "log/prediction" 32 | train_record_file = os.path.join(target_dir, "train.tfrecords") 33 | dev_record_file = os.path.join(target_dir, "dev.tfrecords") 34 | test_record_file = os.path.join(target_dir, "test.tfrecords") 35 | id2vec_file = os.path.join(target_dir, "id2vec.json") # id号->向量 36 | word2id_file = os.path.join(target_dir, "word2id.json") # 词->id号 37 | train_eval = os.path.join(target_dir, "train_eval.json") 38 | dev_eval = os.path.join(target_dir, "dev_eval.json") 39 | test_eval = os.path.join(target_dir, "test_eval.json") 40 | 41 | if not os.path.exists(target_dir): 42 | os.makedirs(target_dir) 43 | if not os.path.exists(log_dir): 44 | os.makedirs(log_dir) 45 | if not os.path.exists(save_dir): 46 | os.makedirs(save_dir) 47 | if not os.path.exists(prediction_dir): 48 | os.makedirs(prediction_dir) 49 | 50 | flags.DEFINE_string("mode", "train", "train/debug/test") 51 | flags.DEFINE_string("gpu", "0", "0/1") 52 | flags.DEFINE_string("experiment", "lalala", "每次存不同模型分不同的文件夹") 53 | flags.DEFINE_string("model_name", "default", "选取不同的模型") 54 | 55 | flags.DEFINE_string("target_dir", target_dir, "") 56 | flags.DEFINE_string("log_dir", log_dir, "") 57 | flags.DEFINE_string("save_dir", save_dir, "") 58 | flags.DEFINE_string("prediction_dir", prediction_dir, "") 59 | flags.DEFINE_string("train_file", train_file, "") 60 | flags.DEFINE_string("dev_file", dev_file, "") 61 | flags.DEFINE_string("test_file", test_file, "") 62 | 63 | flags.DEFINE_string("train_record_file", train_record_file, "") 64 | flags.DEFINE_string("dev_record_file", dev_record_file, "") 65 | flags.DEFINE_string("test_record_file", test_record_file, "") 66 | flags.DEFINE_string("train_eval_file", train_eval, "") 67 | flags.DEFINE_string("dev_eval_file", dev_eval, "") 68 | flags.DEFINE_string("test_eval_file", test_eval, "") 69 | flags.DEFINE_string("word2id_file", word2id_file, "") 70 | flags.DEFINE_string("id2vec_file", id2vec_file, "") 71 | 72 | flags.DEFINE_integer("para_limit", 150, "Limit length for paragraph") 73 | flags.DEFINE_integer("ques_limit", 30, "Limit length for question") 74 | flags.DEFINE_integer("min_count", 1, "embedding 的最小出现次数") 75 | flags.DEFINE_integer("embedding_size", 300, "the dimension of vector") 76 | 77 | flags.DEFINE_integer("capacity", 15000, "Batch size of dataset shuffle") 78 | flags.DEFINE_integer("num_threads", 4, "Number of threads in input pipeline") 79 | # 使用cudnn训练,提升6倍速度 80 | flags.DEFINE_boolean("use_cudnn", True, "Whether to use cudnn (only for GPU)") 81 | flags.DEFINE_boolean("is_bucket", False, "Whether to use bucketing") 82 | 83 | flags.DEFINE_integer("batch_size", 64, "Batch size") 84 | flags.DEFINE_integer("num_steps", 250000, "Number of steps") 85 | flags.DEFINE_integer("checkpoint", 1000, "checkpoint for evaluation") 86 | flags.DEFINE_integer("period", 500, "period to save batch loss") 87 | flags.DEFINE_integer("val_num_batches", 150, "Num of batches for evaluation") 88 | flags.DEFINE_float("init_learning_rate", 0.001, 89 | "Initial learning rate for Adam") 90 | flags.DEFINE_float("init_emb_lr", 0., "") 91 | flags.DEFINE_float("keep_prob", 0.7, "Keep prob in rnn") 92 | flags.DEFINE_float("grad_clip", 5.0, "Global Norm gradient clipping rate") 93 | flags.DEFINE_integer("hidden", 60, "Hidden size") # best:128 94 | flags.DEFINE_integer("patience", 5, "Patience for learning rate decay") 95 | flags.DEFINE_string("optimizer", "Adam", "") 96 | flags.DEFINE_string("loss_function", "default", "") 97 | flags.DEFINE_boolean("use_dropout", True, "") 98 | 99 | 100 | def main(_): 101 | config = flags.FLAGS 102 | os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu # 选择一块gpu 103 | if config.mode == "train": 104 | train(config) 105 | elif config.mode == "prepro": 106 | data_process.prepro(config) 107 | elif config.mode == "debug": 108 | config.num_steps = 2 109 | config.val_num_batches = 1 110 | config.checkpoint = 1 111 | config.period = 1 112 | train(config) 113 | elif config.mode == "test": 114 | test(config) 115 | elif config.mode == "examine": 116 | examine_dev(config) 117 | elif config.mode == "save_dev": 118 | save_dev(config) 119 | elif config.mode == "save_test": 120 | save_test(config) 121 | else: 122 | print("Unknown mode") 123 | exit(0) 124 | 125 | 126 | if __name__ == "__main__": 127 | tf.app.run() 128 | -------------------------------------------------------------------------------- /baseline/data_process.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | data_process.py:数据预处理代码 5 | 6 | @author: haomaojie 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import pandas as pd 10 | import time 11 | import json 12 | import jieba 13 | import csv 14 | import word2vec 15 | import re 16 | import random 17 | import tensorflow as tf 18 | import numpy as np 19 | from tqdm import tqdm # 进度条 20 | import os 21 | import gensim 22 | 23 | 24 | def read_data(json_path, output_path, line_count): 25 | ''' 26 | 读取json文件并转成Dataframe 27 | ''' 28 | start_time = time.time() 29 | data = [] 30 | with open(json_path, 'r') as f: 31 | for i in range(line_count): 32 | data_list = json.loads(f.readline()) 33 | data.append([data_list['passage'], data_list['query']]) 34 | df = pd.DataFrame(data, columns=['passage', 'query']) 35 | df.to_csv(output_path, index=False) 36 | print('转化成功,已生成csv文件') 37 | end_time = time.time() 38 | print(end_time - start_time) 39 | 40 | 41 | def de_word(data_path, out_path): 42 | ''' 43 | 分词 44 | ''' 45 | start_time = time.time() 46 | word = [] 47 | data_file = open(data_path).read().split('\n') 48 | for i in range(len(data_file)): 49 | result = [] 50 | seg_list = jieba.cut(data_file[i]) 51 | for w in seg_list: 52 | result.append(w) 53 | word.append(result) 54 | print('分词完成') 55 | with open(out_path, 'w+') as txt_write: 56 | for i in range(len(word)): 57 | s = str(word[i]).replace( 58 | '[', '').replace(']', '') # 去除[],这两行按数据不同,可以选择 59 | s = s.replace("'", '').replace(',', '') + \ 60 | '\n' # 去除单引号,逗号,每行末尾追加换行符 61 | txt_write.write(s) 62 | print('保存成功') 63 | end_time = time.time() 64 | print(end_time - start_time) 65 | 66 | 67 | def word_vec(file_txt, file_bin, min_count, size): 68 | word2vec.word2vec(file_txt, file_bin, min_count=min_count, 69 | size=size, verbose=True) 70 | 71 | 72 | def merge_csv(target_dir, output_file): 73 | for inputfile in [os.path.join(target_dir, 'train_oridata.csv'), 74 | os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]: 75 | data = pd.read_csv(inputfile) 76 | df = pd.DataFrame(data) 77 | df.to_csv(output_file, mode='a', index=False) 78 | 79 | # 词转id,id转向量 80 | 81 | 82 | def transfer(model_path, embedding_size): 83 | start_time = time.time() 84 | model = word2vec.load(model_path) 85 | word2id_dic = {} 86 | init_0 = [0.0 for i in range(embedding_size)] 87 | id2vec_dic = [init_0] 88 | for i in range(len(model.vocab)): 89 | id = i + 1 90 | word2id_dic[model.vocab[i]] = id 91 | id2vec_dic.append(model[model.vocab[i]].tolist()) 92 | end_time = time.time() 93 | print('词转id,id转向量完成') 94 | print(end_time - start_time) 95 | return word2id_dic, id2vec_dic 96 | 97 | 98 | def transfer_txt(model_path, embedding_size): 99 | print("开始转换...") 100 | start_time = time.time() 101 | model = gensim.models.KeyedVectors.load_word2vec_format( 102 | model_path, binary=False) 103 | word_dic = model.wv.vocab 104 | word2id_dic = {} 105 | init_0 = [0.0 for i in range(embedding_size)] 106 | id2vec_dic = [init_0] 107 | id = 1 108 | for i in word_dic: 109 | word2id_dic[i] = id 110 | id2vec_dic.append(model[i].tolist()) 111 | id += 1 112 | end_time = time.time() 113 | print('词转id,id转向量完成') 114 | print(end_time - start_time) 115 | return word2id_dic, id2vec_dic 116 | 117 | # 存入json文件 118 | 119 | 120 | def save_json(output_path, dic_data, message=None): 121 | start_time = time.time() 122 | if message is not None: 123 | print("Saving {}...".format(message)) 124 | with open(output_path, "w") as fh: 125 | json.dump(dic_data, fh, ensure_ascii=False, indent=4) 126 | print('保存完成') 127 | end_time = time.time() 128 | print(end_time - start_time) 129 | 130 | # 将原文中的passage,query,alternative,answer,query_id转成id号 131 | # 输入参数为词典的位置和训练集的位置 132 | 133 | 134 | def TrainningsetProcess(dic_url, dataset_url, passage_len_limit): 135 | res = [] # 最后返回的结果 136 | rule = re.compile(r'\|') 137 | id2alternatives = {} 138 | # 读取字典 139 | with open(dic_url, 'r', encoding='utf-8') as dic_file: 140 | dic = dict() 141 | dic = json.load(dic_file) 142 | # 读取训练集 143 | over_limit = 0 144 | with open(dataset_url, 'r', encoding='utf-8') as ts_file: 145 | for file_line in ts_file: 146 | line = json.loads(file_line) # 读取一行json文件 147 | this_line_res = dict() # 变量定义,代表这一行映射之后的结果 148 | passage = line['passage'] 149 | alternatives = line['alternatives'] 150 | query = line['query'] 151 | if dataset_url.find('test') == -1: 152 | answer = line['answer'] 153 | query_idx = line['query_id'] 154 | 155 | # 用jieba将passage和query分词,lcut返回list 156 | passage_cut = jieba.lcut(passage, cut_all=False) 157 | query_cut = jieba.lcut(query, cut_all=False) 158 | 159 | # 用词典将passage和query映射到id 160 | passage_id = [] 161 | query_id = [] 162 | for each_passage_word in passage_cut: 163 | passage_id.append(dic.get(each_passage_word)) 164 | for each_query_word in query_cut: 165 | query_id.append(dic.get(each_query_word)) 166 | 167 | # 对选项进行排序 168 | alternatives_cut = re.split(rule, alternatives) 169 | alternatives_cut = [s.strip() for s in alternatives_cut] 170 | tmp = [0, 0, 0] 171 | 172 | # 选项少于三个 173 | if len(alternatives_cut) == 1: 174 | alternatives_cut.append(alternatives_cut[0]) 175 | alternatives_cut.append(alternatives_cut[0]) 176 | if len(alternatives_cut) == 2: 177 | alternatives_cut.append(alternatives_cut[0]) 178 | 179 | # 跳过无效数据(135条) 180 | if alternatives.find("无法") == -1 and alternatives.find("不确定") == -1: 181 | if dataset_url.find('test') != -1: 182 | tmp[0] = alternatives_cut[0] 183 | tmp[1] = alternatives_cut[1] 184 | tmp[2] = alternatives_cut[2] 185 | else: 186 | print(1) 187 | continue 188 | if alternatives.count("无法确定") > 1 or alternatives.count("没") > 1: 189 | if dataset_url.find('test') != -1: 190 | tmp[0] = alternatives_cut[0] 191 | tmp[1] = alternatives_cut[1] 192 | tmp[2] = alternatives_cut[2] 193 | else: 194 | print(2) 195 | continue # 第64772条数据 196 | if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1: 197 | print(3) 198 | continue # 第144146条数据 199 | if "不确定" in alternatives_cut and "无法确定" in alternatives_cut: 200 | tmp[0] = "确定" 201 | tmp[1] = "不确定" 202 | tmp[2] = "无法确定" 203 | # 肯定/否定/无法确定 204 | elif alternatives.find("不") != -1 or alternatives.find("没") != -1: 205 | if alternatives.count("不") == 1 and alternatives.find("不确定") != -1: 206 | alternatives_cut.remove("不确定") 207 | alternatives_cut.append("不确定") 208 | tmp[0] = alternatives_cut[0] 209 | tmp[1] = alternatives_cut[1] 210 | tmp[2] = alternatives_cut[2] 211 | elif alternatives.count("不") > 1: 212 | if alternatives.find("不确定") == -1: 213 | if dataset_url.find("test") != -1: 214 | tmp[0] = alternatives_cut[0] 215 | tmp[1] = alternatives_cut[1] 216 | tmp[2] = alternatives_cut[2] 217 | else: 218 | print(line) 219 | continue 220 | else: 221 | alternatives_cut.remove("不确定") 222 | if alternatives_cut[0].find("不") != -1: 223 | tmp[1] = alternatives_cut[0] 224 | tmp[0] = alternatives_cut[1] 225 | else: 226 | tmp[1] = alternatives_cut[1] 227 | tmp[0] = alternatives_cut[0] 228 | alternatives_cut.append("不确定") 229 | tmp[2] = alternatives_cut[2] 230 | else: 231 | for tmp_alternatives in alternatives_cut: 232 | if tmp_alternatives.find("无法") != -1: 233 | tmp[2] = tmp_alternatives 234 | elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1: 235 | tmp[1] = tmp_alternatives 236 | else: 237 | tmp[0] = tmp_alternatives 238 | # 无明显肯定与否定词义 239 | else: 240 | for tmp_alternatives in alternatives_cut: 241 | if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1: 242 | alternatives_cut.remove(tmp_alternatives) 243 | alternatives_cut.append(tmp_alternatives) 244 | break 245 | tmp[0] = alternatives_cut[0] 246 | tmp[1] = alternatives_cut[1] 247 | tmp[2] = alternatives_cut[2] 248 | 249 | # 根据tmp列表生成answer_id 250 | if dataset_url.find('test') == -1: 251 | answer_id = tmp.index(answer.strip()) 252 | # 得到这一行映射后的结果,是dict类型的数据 253 | if len(passage_id) > passage_len_limit: 254 | passage_id = passage_id[:passage_len_limit] 255 | over_limit += 1 256 | this_line_res['passage'] = passage_id 257 | this_line_res['query'] = query_id 258 | this_line_res['alternatives'] = tmp 259 | if dataset_url.find('test') == -1: 260 | this_line_res['answer'] = answer_id 261 | this_line_res['query_id'] = query_idx 262 | # 创建query_id到alternatives的字典,保存为json 263 | id2alternatives[query_idx] = tmp 264 | res.append(this_line_res) 265 | print(len(res)) 266 | print("over_limit:{}".format(over_limit)) 267 | return res, id2alternatives 268 | 269 | 270 | def data_process(config): 271 | target_dir = config.target_dir 272 | # 这里如果使用自己训练好的词向量就可以注释掉 273 | ''' 274 | read_data(config.train_file, os.path.join( 275 | target_dir, 'train_oridata.csv'), 250000) # 250000 276 | read_data(config.test_file, os.path.join( 277 | target_dir, 'test_oridata.csv'), 10000) # 10000 278 | read_data(config.dev_file, os.path.join( 279 | target_dir, 'validation_oridata.csv'), 30000) # 30000 280 | merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv')) 281 | de_word(os.path.join(target_dir, 'ori_data.csv'), 282 | os.path.join(target_dir, 'seg_list.txt')) 283 | word_vec(os.path.join(target_dir, 'seg_list.txt'), 284 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size) 285 | # 如果是用外部词向量,从这里开始 286 | # word2id_dic, id2vec_dic = transfer_txt( 287 | # os.path.join(target_dir, 'baidu_300_wc+ng_sgns.baidubaike.bigram-char.txt'), config.embedding_size) 288 | word2id_dic, id2vec_dic = transfer( 289 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size) 290 | save_json(config.word2id_file, word2id_dic, "word to id") 291 | save_json(config.id2vec_file, id2vec_dic, "id to vec") 292 | ''' 293 | train_examples, train_id2alternatives = TrainningsetProcess( 294 | config.word2id_file, config.train_file, config.para_limit) 295 | test_examples, test_id2alternatives = TrainningsetProcess( 296 | config.word2id_file, config.test_file, config.para_limit) 297 | validation_examples, validation_id2alternatives = TrainningsetProcess( 298 | config.word2id_file, config.dev_file, config.para_limit) 299 | save_json(config.train_eval_file, train_id2alternatives, 300 | message='保存train每条数据的alternatives') 301 | save_json(config.test_eval_file, test_id2alternatives, 302 | message='保存test每条数据的alternatives') 303 | save_json(config.dev_eval_file, validation_id2alternatives, 304 | message='保存validation每条数据的alternatives') 305 | return train_examples, test_examples, validation_examples 306 | 307 | 308 | def build_features(config, examples, data_type, out_file, is_test=False): 309 | """ 310 | 将数据读入TFrecords 311 | """ 312 | 313 | para_limit = config.para_limit 314 | ques_limit = config.ques_limit 315 | 316 | print("Processing {} examples...".format(data_type)) 317 | writer = tf.python_io.TFRecordWriter(out_file) 318 | total = 0 319 | meta = {} 320 | random.shuffle(examples) # 先给打乱顺序 321 | for example in tqdm(examples): 322 | total += 1 323 | passage_idxs = np.zeros([para_limit], dtype=np.int32) 324 | question_idxs = np.zeros([ques_limit], dtype=np.int32) 325 | 326 | for i, token in enumerate(example["passage"]): 327 | if token == None: 328 | passage_idxs[i] = 0 329 | else: 330 | passage_idxs[i] = token 331 | for i, token in enumerate(example["query"]): 332 | if token == None: 333 | question_idxs[i] = 0 334 | else: 335 | question_idxs[i] = token 336 | # print(passage_idxs) 337 | # print(example["passage"]) 338 | if not is_test: 339 | record = tf.train.Example(features=tf.train.Features(feature={ 340 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 341 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 342 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])), 343 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])) 344 | })) 345 | else: 346 | record = tf.train.Example(features=tf.train.Features(feature={ 347 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 348 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 349 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])), 350 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])) 351 | })) 352 | # print(record) 353 | writer.write(record.SerializeToString()) 354 | print("Build {} instances of features in total".format(total)) 355 | writer.close() 356 | 357 | 358 | def prepro(config): 359 | """ 360 | 数据预处理函数 361 | """ 362 | train_examples, test_examples, dev_examples = data_process(config) 363 | ''' 364 | print(train_examples) 365 | print(test_examples) 366 | print(dev_examples) 367 | print(word2id_dict) 368 | ''' 369 | # train: 249778, test: 10000, dev: 29968 370 | # train: 439, test: 18, dev: 48 371 | 372 | build_features(config, train_examples, "train", config.train_record_file) 373 | build_features(config, dev_examples, "dev", config.dev_record_file) 374 | build_features(config, test_examples, "test", 375 | config.test_record_file, is_test=True) 376 | 377 | print("done!!!") 378 | -------------------------------------------------------------------------------- /baseline/data_process_aug.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | data_process_aug.py:数据预处理代码(数据增强) 5 | 6 | @author: haomaojie 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import pandas as pd 10 | import time 11 | import json 12 | import jieba 13 | import csv 14 | import word2vec 15 | import re 16 | import tensorflow as tf 17 | import numpy as np 18 | from tqdm import tqdm # 进度条 19 | import os 20 | import random 21 | #import gensim 22 | 23 | 24 | def read_data(json_path, output_path, line_count): 25 | ''' 26 | 读取json文件并转成Dataframe 27 | ''' 28 | start_time = time.time() 29 | data = [] 30 | with open(json_path, 'r') as f: 31 | for i in range(line_count): 32 | data_list = json.loads(f.readline()) 33 | data.append([data_list['passage'], data_list['query']]) 34 | df = pd.DataFrame(data, columns=['passage', 'query']) 35 | df.to_csv(output_path, index=False) 36 | print('转化成功,已生成csv文件') 37 | end_time = time.time() 38 | print(end_time - start_time) 39 | 40 | 41 | def de_word(data_path, out_path): 42 | ''' 43 | 分词 44 | ''' 45 | start_time = time.time() 46 | word = [] 47 | data_file = open(data_path).read().split('\n') 48 | for i in range(len(data_file)): 49 | result = [] 50 | seg_list = jieba.cut(data_file[i]) 51 | for w in seg_list: 52 | result.append(w) 53 | word.append(result) 54 | print('分词完成') 55 | with open(out_path, 'w+') as txt_write: 56 | for i in range(len(word)): 57 | s = str(word[i]).replace( 58 | '[', '').replace(']', '') # 去除[],这两行按数据不同,可以选择 59 | s = s.replace("'", '').replace(',', '') + \ 60 | '\n' # 去除单引号,逗号,每行末尾追加换行符 61 | txt_write.write(s) 62 | print('保存成功') 63 | end_time = time.time() 64 | print(end_time - start_time) 65 | 66 | 67 | def word_vec(file_txt, file_bin, min_count, size): 68 | word2vec.word2vec(file_txt, file_bin, min_count=min_count, 69 | size=size, verbose=True) 70 | 71 | 72 | def merge_csv(target_dir, output_file): 73 | for inputfile in [os.path.join(target_dir, 'train_oridata.csv'), 74 | os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]: 75 | data = pd.read_csv(inputfile) 76 | df = pd.DataFrame(data) 77 | df.to_csv(output_file, mode='a', index=False) 78 | 79 | # 词转id,id转向量 80 | 81 | 82 | def transfer(model_path, embedding_size): 83 | start_time = time.time() 84 | model = word2vec.load(model_path) 85 | word2id_dic = {} 86 | init_0 = [0.0 for i in range(embedding_size)] 87 | id2vec_dic = [init_0] 88 | for i in range(len(model.vocab)): 89 | id = i + 1 90 | word2id_dic[model.vocab[i]] = id 91 | id2vec_dic.append(model[model.vocab[i]].tolist()) 92 | end_time = time.time() 93 | print('词转id,id转向量完成') 94 | print(end_time - start_time) 95 | return word2id_dic, id2vec_dic 96 | 97 | # 存入json文件 98 | 99 | 100 | def save_json(output_path, dic_data, message=None): 101 | start_time = time.time() 102 | if message is not None: 103 | print("Saving {}...".format(message)) 104 | with open(output_path, "w") as fh: 105 | json.dump(dic_data, fh, ensure_ascii=False, indent=4) 106 | print('保存完成') 107 | end_time = time.time() 108 | print(end_time - start_time) 109 | 110 | # 将原文中的passage,query,alternative,answer,query_id转成id号 111 | # 输入参数为词典的位置和训练集的位置 112 | 113 | 114 | def TrainningsetProcess(dic_url, dataset_url): 115 | res = [] # 最后返回的结果 116 | rule = re.compile(r'\|') 117 | id2alternatives = {} 118 | # 读取字典 119 | with open(dic_url, 'r', encoding='utf-8') as dic_file: 120 | dic = dict() 121 | dic = json.load(dic_file) 122 | # 读取训练集 123 | over_limit = 0 124 | with open(dataset_url, 'r', encoding='utf-8') as ts_file: 125 | for file_line in ts_file: 126 | line = json.loads(file_line) # 读取一行json文件 127 | this_line_res = dict() # 变量定义,代表这一行映射之后的结果 128 | passage = line['passage'] 129 | alternatives = line['alternatives'] 130 | query = line['query'] 131 | if dataset_url.find('test') == -1: 132 | answer = line['answer'] 133 | query_idx = line['query_id'] 134 | 135 | # 用jieba将passage和query分词,lcut返回list 136 | passage_cut = jieba.lcut(passage, cut_all=False) 137 | query_cut = jieba.lcut(query, cut_all=False) 138 | 139 | # 用词典将passage和query映射到id 140 | passage_id = [] 141 | query_id = [] 142 | for each_passage_word in passage_cut: 143 | passage_id.append(dic.get(each_passage_word)) 144 | for each_query_word in query_cut: 145 | query_id.append(dic.get(each_query_word)) 146 | 147 | # 对选项进行排序 148 | alternatives_cut = re.split(rule, alternatives) 149 | alternatives_cut = [s.strip() for s in alternatives_cut] 150 | tmp = [0, 0, 0] 151 | 152 | # 选项少于三个 153 | if len(alternatives_cut) == 1: 154 | alternatives_cut.append(alternatives_cut[0]) 155 | alternatives_cut.append(alternatives_cut[0]) 156 | if len(alternatives_cut) == 2: 157 | alternatives_cut.append(alternatives_cut[0]) 158 | 159 | # 跳过无效数据(135条) 160 | if alternatives.find("无法") == -1 and alternatives.find("不确定") == -1: 161 | if dataset_url.find('test') != -1: 162 | tmp[0] = alternatives_cut[0] 163 | tmp[1] = alternatives_cut[1] 164 | tmp[2] = alternatives_cut[2] 165 | else: 166 | print(1) 167 | continue 168 | if alternatives.count("无法确定") > 1 or alternatives.count("没") > 1: 169 | if dataset_url.find('test') != -1: 170 | tmp[0] = alternatives_cut[0] 171 | tmp[1] = alternatives_cut[1] 172 | tmp[2] = alternatives_cut[2] 173 | else: 174 | print(2) 175 | continue # 第64772条数据 176 | if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1: 177 | print(3) 178 | continue # 第144146条数据 179 | if "不确定" in alternatives_cut and "无法确定" in alternatives_cut: 180 | tmp[0] = "确定" 181 | tmp[1] = "不确定" 182 | tmp[2] = "无法确定" 183 | # 肯定/否定/无法确定 184 | elif alternatives.find("不") != -1 or alternatives.find("没") != -1: 185 | if alternatives.count("不") == 1 and alternatives.find("不确定") != -1: 186 | alternatives_cut.remove("不确定") 187 | alternatives_cut.append("不确定") 188 | tmp[0] = alternatives_cut[0] 189 | tmp[1] = alternatives_cut[1] 190 | tmp[2] = alternatives_cut[2] 191 | elif alternatives.count("不") > 1: 192 | if alternatives.find("不确定") == -1: 193 | if dataset_url.find("test") != -1: 194 | tmp[0] = alternatives_cut[0] 195 | tmp[1] = alternatives_cut[1] 196 | tmp[2] = alternatives_cut[2] 197 | else: 198 | print(line) 199 | continue 200 | else: 201 | alternatives_cut.remove("不确定") 202 | if alternatives_cut[0].find("不") != -1: 203 | tmp[1] = alternatives_cut[0] 204 | tmp[0] = alternatives_cut[1] 205 | else: 206 | tmp[1] = alternatives_cut[1] 207 | tmp[0] = alternatives_cut[0] 208 | alternatives_cut.append("不确定") 209 | tmp[2] = alternatives_cut[2] 210 | else: 211 | for tmp_alternatives in alternatives_cut: 212 | if tmp_alternatives.find("无法") != -1: 213 | tmp[2] = tmp_alternatives 214 | elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1: 215 | tmp[1] = tmp_alternatives 216 | else: 217 | tmp[0] = tmp_alternatives 218 | # 无明显肯定与否定词义 219 | else: 220 | for tmp_alternatives in alternatives_cut: 221 | if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1: 222 | alternatives_cut.remove(tmp_alternatives) 223 | alternatives_cut.append(tmp_alternatives) 224 | break 225 | tmp[0] = alternatives_cut[0] 226 | tmp[1] = alternatives_cut[1] 227 | tmp[2] = alternatives_cut[2] 228 | 229 | # 根据tmp列表生成answer_id 230 | if dataset_url.find('test') == -1: 231 | answer_id = tmp.index(answer.strip()) 232 | # 得到这一行映射后的结果,是dict类型的数据 233 | if len(passage_id) > 500: 234 | passage_id = passage_id[:500] 235 | over_limit += 1 236 | this_line_res['passage'] = passage_id 237 | this_line_res['query'] = query_id 238 | this_line_res['alternatives'] = tmp 239 | if dataset_url.find('test') == -1: 240 | this_line_res['answer'] = answer_id 241 | this_line_res['query_id'] = query_idx 242 | # 创建query_id到alternatives的字典,保存为json 243 | id2alternatives[query_idx] = tmp 244 | res.append(this_line_res) 245 | print(len(res)) 246 | print("over_limit:{}".format(over_limit)) 247 | return res, id2alternatives 248 | 249 | 250 | def data_process(config, train_file, test_file, validation_file): 251 | target_dir = config.target_dir 252 | read_data(train_file, os.path.join( 253 | target_dir, 'train_oridata.csv'), 250000) # 250000 254 | read_data(test_file, os.path.join( 255 | target_dir, 'test_oridata.csv'), 10000) # 10000 256 | read_data(validation_file, os.path.join( 257 | target_dir, 'validation_oridata.csv'), 30000) # 30000 258 | merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv')) 259 | de_word(os.path.join(target_dir, 'ori_data.csv'), 260 | os.path.join(target_dir, 'seg_list.txt')) 261 | word_vec(os.path.join(target_dir, 'seg_list.txt'), 262 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size) 263 | word2id_dic, id2vec_dic = transfer( 264 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size) 265 | save_json(config.word2id_file, word2id_dic, "word to id") 266 | save_json(config.id2vec_file, id2vec_dic, "id to vec") 267 | train_examples, train_id2alternatives = TrainningsetProcess( 268 | config.word2id_file, train_file) 269 | test_examples, test_id2alternatives = TrainningsetProcess( 270 | config.word2id_file, test_file) 271 | validation_examples, validation_id2alternatives = TrainningsetProcess( 272 | config.word2id_file, validation_file) 273 | save_json(config.train_eval_file, train_id2alternatives, 274 | message='保存train每条数据的alternatives') 275 | save_json(config.test_eval_file, test_id2alternatives, 276 | message='保存test每条数据的alternatives') 277 | save_json(config.dev_eval_file, validation_id2alternatives, 278 | message='保存validation每条数据的alternatives') 279 | return train_examples, test_examples, validation_examples, word2id_dic 280 | 281 | 282 | def build_features(config, examples, data_type, out_file, word2idx_dict, is_test=False): 283 | """ 284 | 将数据读入TFrecords 285 | """ 286 | 287 | para_limit = config.para_limit 288 | ques_limit = config.ques_limit 289 | 290 | print("Processing {} examples...".format(data_type)) 291 | writer = tf.python_io.TFRecordWriter(out_file) 292 | total = 0 293 | meta = {} 294 | # 数据增强用 295 | yes_examples = [] 296 | no_examples = [] 297 | depend_examples = [] 298 | 299 | for example in tqdm(examples): 300 | if data_type == "train": 301 | if example["answer"] == 0: 302 | yes_examples.append(example) 303 | elif example["answer"] == 1: 304 | no_examples.append(example) 305 | else: 306 | depend_examples.append(example) 307 | 308 | total += 1 309 | passage_idxs = np.zeros([para_limit], dtype=np.int32) 310 | question_idxs = np.zeros([ques_limit], dtype=np.int32) 311 | 312 | for i, token in enumerate(example["passage"]): 313 | if token == None: 314 | passage_idxs[i] = 0 315 | else: 316 | passage_idxs[i] = token 317 | for i, token in enumerate(example["query"]): 318 | if token == None: 319 | question_idxs[i] = 0 320 | else: 321 | question_idxs[i] = token 322 | # print(passage_idxs) 323 | # print(example["passage"]) 324 | if not is_test: 325 | record = tf.train.Example(features=tf.train.Features(feature={ 326 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 327 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 328 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])), 329 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])) 330 | })) 331 | else: 332 | record = tf.train.Example(features=tf.train.Features(feature={ 333 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 334 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 335 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])), 336 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])) 337 | })) 338 | # print(record) 339 | writer.write(record.SerializeToString()) 340 | 341 | # 数据增强,初步的增强是将不确定选项的答案搭配其他passage生成新的数据 342 | if data_type == "train": 343 | for example in depend_examples: 344 | random1 = random.randint(0, len(yes_examples) - 1) 345 | example["passage"] = yes_examples[random1]["passage"] 346 | total += 1 347 | passage_idxs = np.zeros([para_limit], dtype=np.int32) 348 | question_idxs = np.zeros([ques_limit], dtype=np.int32) 349 | for i, token in enumerate(example["passage"]): 350 | if token == None: 351 | passage_idxs[i] = 0 352 | else: 353 | passage_idxs[i] = token 354 | for i, token in enumerate(example["query"]): 355 | if token == None: 356 | question_idxs[i] = 0 357 | else: 358 | question_idxs[i] = token 359 | record = tf.train.Example(features=tf.train.Features(feature={ 360 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 361 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 362 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])), 363 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"] + 500000])) 364 | })) 365 | writer.write(record.SerializeToString()) 366 | for example in depend_examples: 367 | random2 = random.randint(0, len(no_examples) - 1) 368 | example["passage"] = no_examples[random2]["passage"] 369 | total += 1 370 | passage_idxs = np.zeros([para_limit], dtype=np.int32) 371 | question_idxs = np.zeros([ques_limit], dtype=np.int32) 372 | for i, token in enumerate(example["passage"]): 373 | if token == None: 374 | passage_idxs[i] = 0 375 | else: 376 | passage_idxs[i] = token 377 | for i, token in enumerate(example["query"]): 378 | if token == None: 379 | question_idxs[i] = 0 380 | else: 381 | question_idxs[i] = token 382 | record = tf.train.Example(features=tf.train.Features(feature={ 383 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 384 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 385 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])), 386 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"] + 1000000])) 387 | })) 388 | writer.write(record.SerializeToString()) 389 | 390 | print("Build {} instances of features in total".format(total)) 391 | writer.close() 392 | 393 | 394 | def prepro(config): 395 | """ 396 | 数据预处理函数 397 | """ 398 | train_examples, test_examples, dev_examples, word2id_dict = data_process( 399 | config, config.train_file, config.test_file, config.dev_file) 400 | ''' 401 | print(train_examples) 402 | print(test_examples) 403 | print(dev_examples) 404 | print(word2id_dict) 405 | ''' 406 | # train: 249778, test: 10000, dev: 29968 407 | # train: 439, test: 18, dev: 48 408 | 409 | build_features(config, train_examples, "train", 410 | config.train_record_file, word2id_dict) 411 | build_features(config, dev_examples, "dev", 412 | config.dev_record_file, word2id_dict) 413 | build_features(config, test_examples, "test", 414 | config.test_record_file, word2id_dict, is_test=True) 415 | 416 | print("done!!!") 417 | -------------------------------------------------------------------------------- /baseline/examine_dev.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | examine_dev.py:检查dev集的结果,辅助分析 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import os 14 | import codecs 15 | import time 16 | 17 | from model import Model 18 | from util import * 19 | 20 | 21 | def examine_dev(config): 22 | """ 23 | 检查dev集的结果,辅助分析 24 | """ 25 | with open(config.id2vec_file, "r") as fh: 26 | id2vec = np.array(json.load(fh), dtype=np.float32) 27 | with open(config.dev_eval_file, "r") as fh: 28 | dev_eval_file = json.load(fh) 29 | 30 | total = 29968 31 | # 读取模型的路径和预测存储的路径 32 | save_dir = config.save_dir + config.experiment 33 | if not os.path.exists(save_dir): 34 | print("no save!") 35 | return 36 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 37 | os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt")) 38 | 39 | print("Loading model...") 40 | examine_batch = get_dataset(config.dev_record_file, get_record_parser( 41 | config), config).make_one_shot_iterator() 42 | 43 | model = Model(config, examine_batch, id2vec, trainable=False) 44 | 45 | sess_config = tf.ConfigProto(allow_soft_placement=True) 46 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 47 | sess_config.gpu_options.allow_growth = True 48 | 49 | print("examining ...") 50 | with tf.Session(config=sess_config) as sess: 51 | sess.run(tf.global_variables_initializer()) 52 | saver = tf.train.Saver() 53 | saver.restore(sess, tf.train.latest_checkpoint(save_dir)) 54 | sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool))) 55 | answer_dict = {} 56 | truth_dict = {} 57 | for step in tqdm(range(total // config.batch_size + 1)): 58 | # 预测答案 59 | qa_id, answer, truth = sess.run( 60 | [model.qa_id, model.classes, model.answer]) 61 | answer_dict_ = {} 62 | truth_dict_ = {} 63 | for ids, tr, ans in zip(qa_id, truth, answer): 64 | answer_dict_[str(ids)] = ans 65 | truth_dict_[str(ids)] = tr 66 | answer_dict.update(answer_dict_) 67 | truth_dict.update(truth_dict_) 68 | metrics = evaluate_acc(truth_dict, answer_dict) 69 | print(len(truth_dict)) 70 | print(len(answer_dict)) 71 | print("accuracy:{}".format(metrics["accuracy"])) 72 | 73 | yes_predictions = [] # 正确答案是肯定的错题 74 | no_predictions = [] # 正确答案是否定的错题 75 | depend_predictions = [] # 正确答案是不确定的错题 76 | yes, no, depend = 0, 0, 0 77 | yes_wrong, no_wrong, depend_wrong = 0, 0, 0 78 | for key, value in answer_dict.items(): 79 | if truth_dict[key] != value: 80 | if truth_dict[key] == 0: 81 | yes += 1 82 | yes_wrong += 1 83 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 84 | wrong_answer = dev_eval_file[str(key)][value] 85 | yes_predictions.append( 86 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 87 | elif truth_dict[key] == 1: 88 | no += 1 89 | no_wrong += 1 90 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 91 | wrong_answer = dev_eval_file[str(key)][value] 92 | no_predictions.append( 93 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 94 | else: 95 | depend += 1 96 | depend_wrong += 1 97 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 98 | wrong_answer = dev_eval_file[str(key)][value] 99 | depend_predictions.append( 100 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 101 | else: 102 | if truth_dict[key] == 0: 103 | yes += 1 104 | elif truth_dict[key] == 1: 105 | no += 1 106 | else: 107 | depend += 1 108 | 109 | print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend)) 110 | print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0)) 111 | print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0)) 112 | print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0)) 113 | outputs_0 = u'\n'.join(yes_predictions) 114 | outputs_1 = u'\n'.join(no_predictions) 115 | outputs_2 = u'\n'.join(depend_predictions) 116 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f: 117 | f.write(outputs_0) 118 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f: 119 | f.write(outputs_1) 120 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f: 121 | f.write(outputs_2) 122 | print("done!") 123 | -------------------------------------------------------------------------------- /baseline/examine_dev_ensemble.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | examine_dev.py:检查dev集的结果,辅助分析 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import os 14 | import pickle 15 | import codecs 16 | import time 17 | from collections import Counter 18 | from model import Model 19 | from util import * 20 | 21 | 22 | def examine_dev_ensemble(config): 23 | """ 24 | 检查dev集的结果,辅助分析 25 | """ 26 | with open(config.id2vec_file, "r") as fh: 27 | id2vec = np.array(json.load(fh), dtype=np.float32) 28 | with open(config.dev_eval_file, "r") as fh: 29 | dev_eval_file = json.load(fh) 30 | 31 | total = 29968 * 3 32 | # 读取模型的路径和预测存储的路径 33 | save_dir = config.save_dir + config.experiment 34 | if not os.path.exists(save_dir): 35 | print("no save!") 36 | return 37 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 38 | os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt")) 39 | 40 | print("Loading model...") 41 | examine_batch = get_dataset("./data/dev_aug.tfrecords", get_record_parser( 42 | config), config).make_one_shot_iterator() 43 | 44 | model = Model(config, examine_batch, id2vec, trainable=False) 45 | 46 | sess_config = tf.ConfigProto(allow_soft_placement=True) 47 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 48 | sess_config.gpu_options.allow_growth = True 49 | 50 | print("examining ...") 51 | with tf.Session(config=sess_config) as sess: 52 | sess.run(tf.global_variables_initializer()) 53 | saver = tf.train.Saver() 54 | saver.restore(sess, tf.train.latest_checkpoint(save_dir)) 55 | sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool))) 56 | logits_dict = {} 57 | truth_dict = {} 58 | answer_dict = {} 59 | for step in tqdm(range(total // config.batch_size + 1)): 60 | # 预测答案 61 | qa_id, softmax, truth = sess.run( 62 | [model.qa_id, model.final_softmax, model.answer]) 63 | # 往字典中添加每个id的三个logits 64 | for ids, tr, log in zip(qa_id, truth, softmax): 65 | if str(ids) not in logits_dict.keys(): 66 | logits_dict[str(ids)] = log 67 | truth_dict[str(ids)] = tr 68 | else: 69 | logits_dict[str(ids)] += log 70 | # if str(ids) not in class_dict.keys(): 71 | # class_dict[str(ids)] = [int(cla)] 72 | # truth_dict[str(ids)] = tr 73 | # else: 74 | # class_dict[str(ids)].append(int(cla)) 75 | # 根据合并的logits求answer 76 | for key, value in logits_dict.items(): 77 | val = value.tolist() 78 | answer_dict[key] = val.index(max(val)) 79 | # answer_dict[key], _ = Counter(value).most_common(1)[0] 80 | metrics = evaluate_acc(truth_dict, answer_dict) 81 | print(len(truth_dict)) 82 | print(len(answer_dict)) 83 | print("accuracy:{}".format(metrics["accuracy"])) 84 | 85 | print("正在保存dev的softmax结果文件!") 86 | if not os.path.exists("./dev_soft"): 87 | os.makedirs("./dev_soft") 88 | with open("./dev_soft/model_aug_dev_ensemble.txt", "wb") as f1: # 手动更改保存的名字,路径不用改 89 | pickle.dump(logits_dict, f1) 90 | 91 | yes_predictions = [] # 正确答案是肯定的错题 92 | no_predictions = [] # 正确答案是否定的错题 93 | depend_predictions = [] # 正确答案是不确定的错题 94 | yes, no, depend = 0, 0, 0 95 | yes_wrong, no_wrong, depend_wrong = 0, 0, 0 96 | for key, value in answer_dict.items(): 97 | if truth_dict[key] != value: 98 | if truth_dict[key] == 0: 99 | yes += 1 100 | yes_wrong += 1 101 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 102 | wrong_answer = dev_eval_file[str(key)][value] 103 | yes_predictions.append( 104 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 105 | elif truth_dict[key] == 1: 106 | no += 1 107 | no_wrong += 1 108 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 109 | wrong_answer = dev_eval_file[str(key)][value] 110 | no_predictions.append( 111 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 112 | else: 113 | depend += 1 114 | depend_wrong += 1 115 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 116 | wrong_answer = dev_eval_file[str(key)][value] 117 | depend_predictions.append( 118 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 119 | else: 120 | if truth_dict[key] == 0: 121 | yes += 1 122 | elif truth_dict[key] == 1: 123 | no += 1 124 | else: 125 | depend += 1 126 | 127 | print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend)) 128 | print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0)) 129 | print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0)) 130 | print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0)) 131 | outputs_0 = u'\n'.join(yes_predictions) 132 | outputs_1 = u'\n'.join(no_predictions) 133 | outputs_2 = u'\n'.join(depend_predictions) 134 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f: 135 | f.write(outputs_0) 136 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f: 137 | f.write(outputs_1) 138 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f: 139 | f.write(outputs_2) 140 | print("done!") 141 | -------------------------------------------------------------------------------- /baseline/file_save.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | examine_dev.py:检查dev集的结果,辅助分析 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import pickle 14 | import os 15 | import codecs 16 | import time 17 | from model import Model 18 | from util import * 19 | 20 | 21 | def save_dev(config): 22 | """ 23 | 验证dev集的结果,保存文件 24 | """ 25 | with open(config.id2vec_file, "r") as fh: 26 | id2vec = np.array(json.load(fh), dtype=np.float32) 27 | with open(config.dev_eval_file, "r") as fh: 28 | dev_eval_file = json.load(fh) 29 | total = 29968 30 | 31 | print("Loading model...") 32 | sess_config = tf.ConfigProto(allow_soft_placement=True) 33 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 34 | sess_config.gpu_options.allow_growth = True 35 | 36 | truth_dict = {} 37 | predict_dict = {} 38 | logits_dict1 = {} 39 | 40 | print("正在模型预测!") 41 | g1 = tf.Graph() 42 | with tf.Session(graph=g1, config=sess_config) as sess1: 43 | with g1.as_default(): 44 | dev_batch1 = get_dataset(config.dev_record_file, get_record_parser( 45 | config), config).make_one_shot_iterator() 46 | model_1 = Model(config, dev_batch1, id2vec, trainable=False) 47 | sess1.run(tf.global_variables_initializer()) 48 | saver1 = tf.train.Saver() 49 | # 需要手动更改路径 50 | saver1.restore( 51 | sess1, "./log/model/model_10000_devAcc_0.662240.ckpt") 52 | sess1.run(tf.assign(model_1.is_train, 53 | tf.constant(False, dtype=tf.bool))) 54 | for step in tqdm(range(total // config.batch_size + 1)): 55 | qa_id, logits, truths = sess1.run( 56 | [model_1.qa_id, model_1.logits, model_1.answer]) 57 | for ids, logits, truth in zip(qa_id, logits, truths): 58 | logits_dict1[str(ids)] = logits 59 | truth_dict[str(ids)] = truth 60 | if len(logits_dict1) != len(dev_eval_file): 61 | print("logits1 data number not match") 62 | 63 | a = tf.placeholder(shape=[3], dtype=tf.float32, name="me") 64 | softmax = tf.nn.softmax(a) 65 | for key, val in truth_dict.items(): 66 | value = sess1.run(softmax, feed_dict={a: logits_dict1[key]}) 67 | predict_dict[key] = value 68 | print("正在保存dev的softmax结果文件!") 69 | if not os.path.exists("./dev_soft"): 70 | os.makedirs("./dev_soft") 71 | with open("./dev_soft/BIDAF_b64_e256_h150_v256.txt", "wb") as f1: # 手动更改保存的名字,路径不用改 72 | pickle.dump(predict_dict, f1) 73 | if not os.path.exists("./truth"): 74 | os.makedirs("./truth") 75 | with open("./truth/truth_dict.txt", "wb") as f2: # 不用改 76 | pickle.dump(truth_dict, f2) 77 | 78 | 79 | def save_test(config): 80 | """ 81 | 输出test集的结果,保存文件 82 | """ 83 | with open(config.id2vec_file, "r") as fh: 84 | id2vec = np.array(json.load(fh), dtype=np.float32) 85 | with open(config.test_eval_file, "r") as fh: 86 | test_eval_file = json.load(fh) 87 | total = 10000 88 | 89 | print("Loading model...") 90 | sess_config = tf.ConfigProto(allow_soft_placement=True) 91 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 92 | sess_config.gpu_options.allow_growth = True 93 | 94 | predict_dict = {} 95 | logits_dict1 = {} 96 | 97 | print("正在模型预测!") 98 | g1 = tf.Graph() 99 | with tf.Session(graph=g1, config=sess_config) as sess1: 100 | with g1.as_default(): 101 | test_batch1 = get_dataset(config.test_record_file, get_record_parser( 102 | config), config).make_one_shot_iterator() 103 | model_1 = Model(config, test_batch1, id2vec, trainable=False) 104 | sess1.run(tf.global_variables_initializer()) 105 | saver1 = tf.train.Saver() 106 | # 需要手动更改路径 107 | saver1.restore( 108 | sess1, "./log/model/model_131000_devAcc_0.732782.ckpt") 109 | sess1.run(tf.assign(model_1.is_train, 110 | tf.constant(False, dtype=tf.bool))) 111 | for step in tqdm(range(total // config.batch_size + 1)): 112 | qa_id, logits = sess1.run( 113 | [model_1.qa_id, model_1.logits]) 114 | for ids, logits in zip(qa_id, logits): 115 | logits_dict1[str(ids)] = logits 116 | if len(logits_dict1) != len(test_eval_file): 117 | print("logits1 data number not match") 118 | 119 | a = tf.placeholder(shape=[3], dtype=tf.float32, name="me") 120 | softmax = tf.nn.softmax(a) 121 | for key, val in logits_dict1.items(): 122 | value = sess1.run(softmax, feed_dict={a: logits_dict1[key]}) 123 | predict_dict[key] = value 124 | 125 | print("正在保存dev的softmax结果文件!") 126 | if not os.path.exists("./test_soft"): 127 | os.makedirs("./test_soft") 128 | with open("./test_soft/RNET_b64_e300_h60_v300.txt", "wb") as f1: 129 | pickle.dump(predict_dict, f1) 130 | -------------------------------------------------------------------------------- /baseline/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | main.py:train and test 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import os 14 | import codecs 15 | import time 16 | import math 17 | 18 | from model import Model 19 | from util import * 20 | 21 | 22 | def train(config): 23 | """ 24 | 训练与验证函数 25 | """ 26 | with open(config.id2vec_file, "r") as fh: 27 | id2vec = np.array(json.load(fh), dtype=np.float32) 28 | with open(config.train_eval_file, "r") as fh: 29 | train_eval_file = json.load(fh) 30 | with open(config.dev_eval_file, "r") as fh: 31 | dev_eval_file = json.load(fh) 32 | 33 | dev_total = 29968 # 验证集数据量 34 | 35 | # 不同参数的训练在不同的文件夹下存储 36 | log_dir = config.log_dir + config.experiment 37 | save_dir = config.save_dir + config.experiment 38 | if not os.path.exists(log_dir): 39 | os.makedirs(log_dir) 40 | if not os.path.exists(save_dir): 41 | os.makedirs(save_dir) 42 | 43 | print("Building model...") 44 | parser = get_record_parser(config) 45 | train_dataset = get_batch_dataset(config.train_record_file, parser, config) 46 | dev_dataset = get_dataset(config.dev_record_file, parser, config) 47 | 48 | # 可馈送迭代器,通过feed_dict机制选择每次sess.run时调用train_iterator还是dev_iterator 49 | handle = tf.placeholder(tf.string, shape=[]) 50 | iterator = tf.data.Iterator.from_string_handle( 51 | handle, train_dataset.output_types, train_dataset.output_shapes) 52 | train_iterator = train_dataset.make_one_shot_iterator() 53 | dev_iterator = dev_dataset.make_one_shot_iterator() 54 | 55 | # 选取模型 56 | if config.model_name == "default": 57 | model = Model(config, iterator, id2vec) 58 | else: 59 | print("model error") 60 | return 61 | 62 | sess_config = tf.ConfigProto(allow_soft_placement=True) 63 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 64 | sess_config.gpu_options.allow_growth = True 65 | 66 | loss_save = 100.0 67 | patience = 0 68 | lr = config.init_learning_rate 69 | emb_lr = config.init_emb_lr 70 | 71 | with tf.Session(config=sess_config) as sess: 72 | writer = tf.summary.FileWriter(log_dir, sess.graph) # 存储计算图 73 | sess.run(tf.global_variables_initializer()) 74 | saver = tf.train.Saver() 75 | train_handle = sess.run(train_iterator.string_handle()) 76 | dev_handle = sess.run(dev_iterator.string_handle()) 77 | sess.run(tf.assign(model.is_train, tf.constant(True, dtype=tf.bool))) 78 | sess.run(tf.assign(model.learning_rate, 79 | tf.constant(lr, dtype=tf.float32))) 80 | # sess.run(tf.assign(model.emb_lr, tf.constant(emb_lr, dtype=tf.float32))) 81 | 82 | best_dev_acc = 0.0 # 定义一个最佳验证准确率,只有当准确率高于它才保存模型 83 | print("Training ...") 84 | for go in tqdm(range(1, config.num_steps + 1)): 85 | global_step = sess.run(model.global_step) + 1 86 | loss, train_op = sess.run([model.loss, model.train_op], feed_dict={ 87 | handle: train_handle}) 88 | if global_step % config.period == 0: # 每隔一段步数就记录一次train_loss和learning_rate 89 | loss_sum = tf.Summary(value=[tf.Summary.Value( 90 | tag="model/loss", simple_value=loss), ]) 91 | writer.add_summary(loss_sum, global_step) 92 | lr_sum = tf.Summary(value=[tf.Summary.Value( 93 | tag="model/learning_rate", simple_value=sess.run(model.learning_rate)), ]) 94 | writer.add_summary(lr_sum, global_step) 95 | # emb_lr_sum = tf.Summary(value=[tf.Summary.Value( 96 | # tag="model/emb_lr", simple_value=sess.run(model.emb_lr)), ]) 97 | # writer.add_summary(emb_lr_sum, global_step) 98 | 99 | if global_step % config.checkpoint == 0: # 验证acc,并保存模型 100 | sess.run(tf.assign(model.is_train, 101 | tf.constant(False, dtype=tf.bool))) 102 | 103 | # 评估训练集 104 | _, summ = evaluate_batch( 105 | model, config.val_num_batches, train_eval_file, sess, "train_eval", handle, train_handle) 106 | for s in summ: 107 | writer.add_summary(s, global_step) 108 | 109 | # 评估验证集 110 | metrics, summ = evaluate_batch( 111 | model, dev_total // config.batch_size + 1, dev_eval_file, sess, "dev", handle, dev_handle) 112 | sess.run(tf.assign(model.is_train, 113 | tf.constant(True, dtype=tf.bool))) 114 | for s in summ: 115 | writer.add_summary(s, global_step) 116 | writer.flush() # 将事件文件刷新到磁盘 117 | 118 | # 学习率衰减的策略1 119 | if config.optimizer == "Adadelta": 120 | dev_loss = metrics["loss"] 121 | if dev_loss < loss_save: 122 | loss_save = dev_loss 123 | patience = 0 124 | else: 125 | patience += 1 126 | if patience >= config.patience: 127 | lr /= 2.0 128 | loss_save = dev_loss 129 | patience = 0 130 | elif config.optimizer == "Adam": 131 | # 学习率衰减策略2 132 | if global_step <= 50000: 133 | lr = config.init_learning_rate 134 | elif global_step <= 100000: 135 | lr = config.init_learning_rate / \ 136 | math.sqrt((global_step - 45000) / 5000) 137 | emb_lr = 5e-6 138 | elif global_step <= 200000: 139 | lr = config.init_learning_rate / \ 140 | math.sqrt((global_step - 45000) / 1000) 141 | emb_lr = 3e-6 142 | else: 143 | lr = config.init_learning_rate / \ 144 | math.sqrt(global_step / 1000) 145 | emb_lr = 1e-6 146 | else: 147 | print("error") 148 | return 149 | 150 | sess.run(tf.assign(model.learning_rate, 151 | tf.constant(lr, dtype=tf.float32))) 152 | # sess.run(tf.assign(model.emb_lr, tf.constant( 153 | # emb_lr, dtype=tf.float32))) 154 | 155 | # 保存模型的逻辑 156 | if metrics["accuracy"] > best_dev_acc: 157 | best_dev_acc = metrics["accuracy"] 158 | filename = os.path.join( 159 | save_dir, "model_{}_devAcc_{:.6f}.ckpt".format(global_step, best_dev_acc)) 160 | saver.save(sess, filename) 161 | 162 | print("finished!") 163 | 164 | 165 | def evaluate_batch(model, num_batches, eval_file, sess, data_type, handle, str_handle): 166 | """ 167 | 模型评估函数 168 | """ 169 | answer_dict = {} # 答案词典 170 | truth_dict = {} # 真实答案词典 171 | losses = [] 172 | for _ in tqdm(range(1, num_batches + 1)): 173 | qa_id, loss, truth, answer = sess.run( 174 | [model.qa_id, model.loss, model.answer, model.classes], feed_dict={handle: str_handle}) 175 | answer_dict_ = {} 176 | truth_dict_ = {} 177 | for ids, tr, ans in zip(qa_id, truth, answer): 178 | answer_dict_[str(ids)] = ans 179 | truth_dict_[str(ids)] = tr 180 | answer_dict.update(answer_dict_) 181 | truth_dict.update(truth_dict_) 182 | losses.append(loss) 183 | loss = np.mean(losses) 184 | metrics = evaluate_acc(truth_dict, answer_dict) 185 | metrics["loss"] = loss 186 | loss_sum = tf.Summary(value=[tf.Summary.Value( 187 | tag="{}/loss".format(data_type), simple_value=metrics["loss"]), ]) 188 | acc_sum = tf.Summary(value=[tf.Summary.Value( 189 | tag="{}/accuracy".format(data_type), simple_value=metrics["accuracy"]), ]) 190 | return metrics, [loss_sum, acc_sum] 191 | 192 | 193 | def test(config): 194 | """ 195 | 测试函数 196 | """ 197 | with open(config.id2vec_file, "r") as fh: 198 | id2vec = np.array(json.load(fh), dtype=np.float32) 199 | with open(config.test_eval_file, "r") as fh: 200 | test_eval_file = json.load(fh) 201 | 202 | total = 10000 203 | # 读取模型的路径和预测存储的路径 204 | save_dir = config.save_dir + config.experiment 205 | if not os.path.exists(save_dir): 206 | print("no save!") 207 | return 208 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 209 | prediction_file = os.path.join( 210 | config.prediction_dir, (predic_time + "_predictions.txt")) 211 | 212 | print("Loading model...") 213 | test_batch = get_dataset(config.test_record_file, get_record_parser( 214 | config), config).make_one_shot_iterator() 215 | 216 | # 选取模型 217 | if config.model_name == "default": 218 | model = Model(config, test_batch, id2vec, trainable=False) 219 | else: 220 | print("model error") 221 | return 222 | 223 | sess_config = tf.ConfigProto(allow_soft_placement=True) 224 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 225 | sess_config.gpu_options.allow_growth = True 226 | 227 | print("testing ...") 228 | with tf.Session(config=sess_config) as sess: 229 | sess.run(tf.global_variables_initializer()) 230 | saver = tf.train.Saver() 231 | saver.restore(sess, tf.train.latest_checkpoint(save_dir)) 232 | sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool))) 233 | answer_dict = {} 234 | for step in tqdm(range(total // config.batch_size + 1)): 235 | # 预测答案 236 | qa_id, answer = sess.run([model.qa_id, model.classes]) 237 | answer_dict_ = {} 238 | for ids, ans in zip(qa_id, answer): 239 | answer_dict_[str(ids)] = ans 240 | answer_dict.update(answer_dict_) 241 | # 将结果写文件的操作,不用考虑问题顺序 242 | if len(answer_dict) != len(test_eval_file): 243 | print("data number not match") 244 | predictions = [] 245 | for key, value in answer_dict.items(): 246 | prediction_answer = test_eval_file[str(key)][value] 247 | predictions.append(str(key) + '\t' + str(prediction_answer)) 248 | outputs = u'\n'.join(predictions) 249 | with codecs.open(prediction_file, 'w', encoding='utf-8') as f: 250 | f.write(outputs) 251 | print("done!") 252 | 253 | 254 | def dev(config): 255 | with open(config.id2vec_file, "r") as fh: 256 | id2vec = np.array(json.load(fh), dtype=np.float32) 257 | with open(config.dev_eval_file, "r") as fh: 258 | dev_eval_file = json.load(fh) 259 | 260 | total = 29968 261 | print("Loading model...") 262 | sess_config = tf.ConfigProto(allow_soft_placement=True) 263 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 264 | sess_config.gpu_options.allow_growth = True 265 | 266 | truth_dict = {} 267 | predict_dict = {} 268 | logits_dict1 = {} 269 | logits_dict2 = {} 270 | logits_dict3 = {} 271 | logits_dict = {} 272 | 273 | print("model-1 predicting!") 274 | g1 = tf.Graph() 275 | with tf.Session(graph=g1, config=sess_config) as sess1: 276 | with g1.as_default(): 277 | dev_batch1 = get_dataset(config.dev_record_file, get_record_parser(config), 278 | config).make_one_shot_iterator() 279 | model_1 = Model( 280 | config, dev_batch1, id2vec, trainable=False) 281 | sess1.run(tf.global_variables_initializer()) 282 | saver1 = tf.train.Saver() 283 | saver1.restore( 284 | sess1, "./log/model/model_131000_devAcc_0.732782.ckpt") 285 | sess1.run(tf.assign(model_1.is_train, 286 | tf.constant(False, dtype=tf.bool))) 287 | for step in tqdm(range(total // config.batch_size + 1)): 288 | qa_id, logits, truths = sess1.run( 289 | [model_1.qa_id, model_1.logits, model_1.answer]) 290 | for ids, logits, truth in zip(qa_id, logits, truths): 291 | logits_dict1[str(ids)] = logits 292 | truth_dict[str(ids)] = truth 293 | if len(logits_dict1) != len(dev_eval_file): 294 | print("logits1 data number not match") 295 | 296 | print("logits相加,模型融合分类!") 297 | predictions = [] 298 | g4 = tf.Graph() 299 | with tf.Session(graph=g4, config=sess_config) as sess4: 300 | a = tf.placeholder(shape=[3], dtype=tf.float32, name="mee") 301 | b = tf.placeholder(shape=[3], dtype=tf.float32, name="me") 302 | c = tf.placeholder(shape=[3], dtype=tf.float32, name="xiaodong") 303 | softmax_a = tf.nn.softmax(a) 304 | softmax_b = tf.nn.softmax(b) 305 | softmax_c = tf.nn.softmax(c) 306 | final = 0.4 * softmax_a + 0.4 * softmax_b + 0.2 * softmax_c 307 | final_class = tf.cast(tf.argmax(final), dtype=tf.int32) 308 | for key, val in truth_dict.items(): 309 | value = sess4.run(final_class, feed_dict={ 310 | a: logits_dict1[key], b: logits_dict2[key], c: logits_dict3[key]}) 311 | predict_dict[key] = value 312 | print(evaluate_acc(truth_dict, predict_dict)) 313 | -------------------------------------------------------------------------------- /baseline/model.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | model.py:基于R-Net的改进模型,将PtrNet改成分类器 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | 10 | import tensorflow as tf 11 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout 12 | 13 | 14 | class Model(object): 15 | def __init__(self, config, batch, word_mat=None, trainable=True, opt=True): 16 | """ 17 | 模型初始化函数 18 | Args: 19 | config:是tf.flag.FLAGS,配置整个项目的超参数 20 | batch:是一个tf.data.iterator对象,读取数据的迭代器,可能联系到tf.records,如果我们的数据集比较小就可以不用 21 | word_mat:np.array数组,是词向量? 22 | char_mat:同上 23 | """ 24 | self.config = config 25 | self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, 26 | initializer=tf.constant_initializer(0), trainable=False) 27 | # tf.data.iterator的get_next方法,返回dataset中下一个element的tensor对象,在sess.run中实现迭代 28 | """ 29 | passage: passage序列的每个词的id号的tensor(tf.int32),长度应该是都取最大限制长度,空余的填充空值?(这里待定) 30 | question: question序列的每个词的id号的tensor(tf.int32) 31 | ch, qh, y1, y2: 本项目不需要,已经取消 32 | qa_id: question的id 33 | answer: 新添加的answer标签,(0/1/2),shape初步定义为[batch_size] 34 | """ 35 | self.passage, self.question, self.answer, self.qa_id = batch.get_next() 36 | self.is_train = tf.get_variable( 37 | "is_train", shape=[], dtype=tf.bool, trainable=False) 38 | 39 | # word embeddings的变量,这里定义的是不能训练的 40 | self.word_mat = tf.get_variable("word_mat", initializer=tf.constant( 41 | word_mat, dtype=tf.float32), trainable=False) 42 | 43 | # tf.cast将tensor转换为bool类型,生成mask,有值部分用true,空值用false 44 | self.c_mask = tf.cast(self.passage, tf.bool) 45 | self.q_mask = tf.cast(self.question, tf.bool) 46 | # 求每个序列的真实长度,得到_len的tensor 47 | self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1) 48 | self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1) 49 | 50 | if opt: 51 | batch_size = config.batch_size 52 | # 求一个batch中序列最大长度,并按照最大长度对对tensor进行slice划分 53 | self.c_maxlen = tf.reduce_max(self.c_len) 54 | self.q_maxlen = tf.reduce_max(self.q_len) 55 | self.c = tf.slice(self.passage, [0, 0], [ 56 | batch_size, self.c_maxlen]) 57 | self.q = tf.slice(self.question, [0, 0], [ 58 | batch_size, self.q_maxlen]) 59 | self.c_mask = tf.slice(self.c_mask, [0, 0], [ 60 | batch_size, self.c_maxlen]) 61 | self.q_mask = tf.slice(self.q_mask, [0, 0], [ 62 | batch_size, self.q_maxlen]) 63 | else: 64 | self.c_maxlen, self.q_maxlen = config.para_limit, config.ques_limit 65 | 66 | self.RNet() # 构造R-Net模型 67 | 68 | if trainable: 69 | 70 | self.learning_rate = tf.get_variable( 71 | "learning_rate", shape=[], dtype=tf.float32, trainable=False) 72 | 73 | if config.optimizer == "Adam": 74 | self.opt = tf.train.AdamOptimizer( 75 | learning_rate=self.learning_rate, epsilon=1e-8) 76 | elif config.optimizer == "Adadelta": 77 | self.opt = tf.train.AdadeltaOptimizer( 78 | learning_rate=self.learning_rate, epsilon=1e-6) 79 | else: 80 | print("optimizer error") 81 | return 82 | 83 | grads = self.opt.compute_gradients(self.loss) 84 | gradients, variables = zip(*grads) 85 | capped_grads, _ = tf.clip_by_global_norm( 86 | gradients, config.grad_clip) 87 | self.train_op = self.opt.apply_gradients( 88 | zip(capped_grads, variables), global_step=self.global_step) 89 | 90 | # # 对embedding层设置单独的学习率 91 | # self.emb_lr = tf.get_variable( 92 | # "emb_lr", shape=[], dtype=tf.float32, trainable=False) 93 | # self.learning_rate = tf.get_variable( 94 | # "learning_rate", shape=[], dtype=tf.float32, trainable=False) 95 | # self.emb_opt = tf.train.AdamOptimizer( 96 | # learning_rate=self.emb_lr, epsilon=1e-8) 97 | # self.opt = tf.train.AdamOptimizer( 98 | # learning_rate=self.learning_rate, epsilon=1e-8) 99 | # # 区分不同的变量列表 100 | # self.var_list = tf.trainable_variables() 101 | # var_list1 = [] 102 | # var_list2 = [] 103 | # for var in self.var_list: 104 | # if var.op.name == "word_mat": 105 | # var_list1.append(var) 106 | # else: 107 | # var_list2.append(var) 108 | 109 | # grads = tf.gradients(self.loss, var_list1 + var_list2) 110 | # capped_grads, _ = tf.clip_by_global_norm( 111 | # grads, config.grad_clip) 112 | # grads1 = capped_grads[:len(var_list1)] 113 | # grads2 = capped_grads[len(var_list1):] 114 | # self.train_op1 = self.emb_opt.apply_gradients( 115 | # zip(grads1, var_list1)) 116 | # self.train_op2 = self.opt.apply_gradients( 117 | # zip(grads2, var_list2), global_step=self.global_step) 118 | # self.train_op = tf.group(self.train_op1, self.train_op2) 119 | 120 | def RNet(self): 121 | config = self.config 122 | batch_size, PL, QL, d = config.batch_size, self.c_maxlen, self.q_maxlen, config.hidden 123 | gru = cudnn_gru if config.use_cudnn else native_gru # 选择使用哪种gru网络 124 | 125 | with tf.variable_scope("embedding"): 126 | # word_embedding层 127 | with tf.name_scope("word"): 128 | # embedding后的shape是[batch_size, max_len, vec_len] 129 | c_emb = tf.nn.embedding_lookup(self.word_mat, self.c) 130 | q_emb = tf.nn.embedding_lookup(self.word_mat, self.q) 131 | 132 | with tf.variable_scope("encoding"): 133 | # encoder层,将context和question分别输入双向GRU 134 | rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape( 135 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train) 136 | # RNN每层的正向反向输出合并,本代码默认的是每层的输出也合并 137 | # 所以对于3层rnn,输出的shape是[batch_size, max_len, 6*num_units] 138 | # 并且,序列空值处的输出都清零了 139 | c = rnn(c_emb, seq_len=self.c_len) 140 | q = rnn(q_emb, seq_len=self.q_len) 141 | 142 | with tf.variable_scope("QP_attention"): 143 | """ 144 | 基于注意力的循环神经网络层,匹配context和question 145 | """ 146 | # qc_att的shape [batch_size, c_maxlen, 12*hidden] 147 | qc_att_ = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d, 148 | keep_prob=config.keep_prob, is_train=self.is_train) 149 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att_.get_shape( 150 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train) 151 | # att:[batch_size, c_maxlen, 6*hidden] 152 | qc_att = rnn(qc_att_, seq_len=self.c_len) 153 | 154 | with tf.variable_scope("passage_match"): 155 | """ 156 | context自匹配层 157 | """ 158 | c_att = dot_attention( 159 | qc_att, qc_att, mask=self.c_mask, hidden=d, keep_prob=config.keep_prob, is_train=self.is_train) 160 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=c_att.get_shape( 161 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train) 162 | # match:[batch_size, c_maxlen, 6*hidden] 163 | c_match = rnn(c_att, seq_len=self.c_len) 164 | 165 | with tf.variable_scope("YesNo_classification"): 166 | """ 167 | 对问题答案的分类层, 需要的输入有question的编码结果q和context的match 168 | """ 169 | # init的shape:[batch_size, 2*hidden] 170 | # 这步的作用初始猜测是将question进行pooling操作,然后再输入给一个rnn层进行分类 171 | init = summ(q[:, :, -2 * d:], d, mask=self.q_mask, 172 | keep_prob=config.keep_prob, is_train=self.is_train) 173 | c_match_ = dropout(c_match, keep_prob=config.keep_prob, 174 | is_train=self.is_train) 175 | final_hiddens = init.get_shape().as_list()[-1] 176 | final_gru = tf.contrib.rnn.GRUCell(final_hiddens) 177 | _, final_state = tf.nn.dynamic_rnn( 178 | final_gru, c_match_, initial_state=init, dtype=tf.float32) 179 | final_w = tf.get_variable(name="final_w", shape=[final_hiddens, 3]) 180 | final_b = tf.get_variable(name="final_b", shape=[ 181 | 3], initializer=tf.constant_initializer(0.)) 182 | self.logits = tf.matmul(final_state, final_w) 183 | self.logits = tf.nn.bias_add( 184 | self.logits, final_b) # logits:[batch_size, 3] 185 | 186 | with tf.variable_scope("softmax_and_loss"): 187 | final_softmax = tf.nn.softmax(self.logits) 188 | self.classes = tf.cast( 189 | tf.argmax(final_softmax, axis=1), dtype=tf.int32, name="classes") 190 | # 注意stop_gradient的使用,因为answer不是placeholder传进来的,所以要注明不对其计算梯度 191 | if config.loss_function == "focal_loss": 192 | self.loss = tf.reduce_mean(sparse_focal_loss( 193 | logits=self.logits, labels=tf.stop_gradient(self.answer))) 194 | else: 195 | self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits( 196 | logits=self.logits, labels=tf.stop_gradient(self.answer))) 197 | 198 | def get_loss(self): 199 | return self.loss 200 | 201 | def get_global_step(self): 202 | return self.global_step 203 | -------------------------------------------------------------------------------- /baseline/nn_func.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | nn_func.py:神经网络模型的组件 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | 11 | INF = 1e30 12 | 13 | 14 | class cudnn_gru: 15 | 16 | def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope=None): 17 | self.num_layers = num_layers 18 | self.grus = [] 19 | self.inits = [] 20 | self.dropout_mask = [] 21 | for layer in range(num_layers): 22 | input_size_ = input_size if layer == 0 else 2 * num_units 23 | gru_fw = tf.contrib.cudnn_rnn.CudnnGRU(1, num_units) 24 | gru_bw = tf.contrib.cudnn_rnn.CudnnGRU(1, num_units) 25 | init_fw = tf.tile(tf.Variable( 26 | tf.zeros([1, 1, num_units])), [1, batch_size, 1]) 27 | init_bw = tf.tile(tf.Variable( 28 | tf.zeros([1, 1, num_units])), [1, batch_size, 1]) 29 | mask_fw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32), 30 | keep_prob=keep_prob, is_train=is_train, mode=None) 31 | mask_bw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32), 32 | keep_prob=keep_prob, is_train=is_train, mode=None) 33 | self.grus.append((gru_fw, gru_bw, )) 34 | self.inits.append((init_fw, init_bw, )) 35 | self.dropout_mask.append((mask_fw, mask_bw, )) 36 | 37 | def __call__(self, inputs, seq_len, keep_prob=1.0, is_train=None, concat_layers=True): 38 | # cudnn GRU需要交换张量的维度,可能是便于计算 39 | outputs = [tf.transpose(inputs, [1, 0, 2])] 40 | for layer in range(self.num_layers): 41 | gru_fw, gru_bw = self.grus[layer] 42 | init_fw, init_bw = self.inits[layer] 43 | mask_fw, mask_bw = self.dropout_mask[layer] 44 | with tf.variable_scope("fw_{}".format(layer)): 45 | out_fw, _ = gru_fw( 46 | outputs[-1] * mask_fw, initial_state=(init_fw, )) 47 | with tf.variable_scope("bw_{}".format(layer)): 48 | inputs_bw = tf.reverse_sequence( 49 | outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1) 50 | out_bw, _ = gru_bw(inputs_bw, initial_state=(init_bw, )) 51 | out_bw = tf.reverse_sequence( 52 | out_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1) 53 | outputs.append(tf.concat([out_fw, out_bw], axis=2)) 54 | if concat_layers: 55 | res = tf.concat(outputs[1:], axis=2) 56 | else: 57 | res = outputs[-1] 58 | res = tf.transpose(res, [1, 0, 2]) 59 | return res 60 | 61 | 62 | class native_gru: 63 | 64 | def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope="native_gru"): 65 | self.num_layers = num_layers 66 | self.grus = [] 67 | self.inits = [] 68 | self.dropout_mask = [] 69 | self.scope = scope 70 | for layer in range(num_layers): 71 | input_size_ = input_size if layer == 0 else 2 * num_units 72 | # 双向Bi-GRU f:forward b:back 73 | gru_fw = tf.contrib.rnn.GRUCell(num_units) 74 | gru_bw = tf.contrib.rnn.GRUCell(num_units) 75 | # tf.tile 平铺给定的张量,这里是将初始状态扩张到batch_size倍 76 | init_fw = tf.tile(tf.Variable( 77 | tf.zeros([1, num_units])), [batch_size, 1]) 78 | init_bw = tf.tile(tf.Variable( 79 | tf.zeros([1, num_units])), [batch_size, 1]) 80 | mask_fw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32), 81 | keep_prob=keep_prob, is_train=is_train, mode=None) 82 | mask_bw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32), 83 | keep_prob=keep_prob, is_train=is_train, mode=None) 84 | self.grus.append((gru_fw, gru_bw, )) 85 | self.inits.append((init_fw, init_bw, )) 86 | self.dropout_mask.append((mask_fw, mask_bw, )) 87 | 88 | def __call__(self, inputs, seq_len, concat_layers=True): 89 | """ 90 | 运行RNN 91 | 这里的keep_prob和is_train没用,在__init__中就已设置好了 92 | """ 93 | outputs = [inputs] 94 | with tf.variable_scope(self.scope): 95 | for layer in range(self.num_layers): 96 | gru_fw, gru_bw = self.grus[layer] 97 | init_fw, init_bw = self.inits[layer] 98 | mask_fw, mask_bw = self.dropout_mask[layer] 99 | # 正向RNN 100 | with tf.variable_scope("fw_{}".format(layer)): 101 | # 每一层使用上层的输出 102 | # dynamic_rnn中的超过seq_len的部分就不计算了,state直接重复,output直接清零,节省资源 103 | out_fw, _ = tf.nn.dynamic_rnn( 104 | gru_fw, outputs[-1] * mask_fw, seq_len, initial_state=init_fw, dtype=tf.float32) 105 | # 反向RNN 106 | with tf.variable_scope("bw_{}".format(layer)): 107 | inputs_bw = tf.reverse_sequence( 108 | outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0) 109 | out_bw, _ = tf.nn.dynamic_rnn( 110 | gru_bw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32) 111 | out_bw = tf.reverse_sequence( 112 | out_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0) 113 | # 正向输出和反向输出合并 114 | outputs.append(tf.concat([out_fw, out_bw], axis=2)) 115 | if concat_layers: 116 | res = tf.concat(outputs[1:], axis=2) 117 | else: 118 | res = outputs[-1] 119 | return res 120 | 121 | 122 | def dropout(args, keep_prob, is_train, mode="recurrent"): 123 | """ 124 | dropout层,args初始是1.0 125 | """ 126 | if keep_prob < 1.0: 127 | noise_shape = None 128 | scale = 1.0 129 | shape = tf.shape(args) 130 | if mode == "embedding": 131 | noise_shape = [shape[0], 1] 132 | scale = keep_prob 133 | if mode == "recurrent" and len(args.get_shape().as_list()) == 3: 134 | noise_shape = [shape[0], 1, shape[-1]] 135 | args = tf.cond(is_train, lambda: tf.nn.dropout( 136 | args, keep_prob, noise_shape=noise_shape) * scale, lambda: args) 137 | return args 138 | 139 | 140 | def softmax_mask(val, mask): 141 | """ 142 | 作用是给空值处减小注意力 143 | """ 144 | return -INF * (1 - tf.cast(mask, tf.float32)) + val # tf.cast:true转为1.0,false转为0.0 145 | 146 | 147 | def summ(memory, hidden, mask, keep_prob=1.0, is_train=None, scope="summ"): 148 | """ 149 | 对question进行最后一步的处理,可以看作是pooling吗 150 | """ 151 | with tf.variable_scope(scope): 152 | d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train) 153 | s0 = tf.nn.tanh(dense(d_memory, hidden, scope="s0")) 154 | s = dense(s0, 1, use_bias=False, scope="s") 155 | # tf.squeeze把长度只有1的维度去掉 156 | # s1:[batch_size, c_maxlen] 157 | s1 = softmax_mask(tf.squeeze(s, [2]), mask) 158 | a = tf.expand_dims(tf.nn.softmax(s1), axis=2) 159 | res = tf.reduce_sum(a * memory, axis=1) # 逐元素相乘,shape跟随memory一致 160 | return res # [batch_size, 2*hidden] 161 | 162 | 163 | def dot_attention(inputs, memory, mask, hidden, keep_prob=1.0, is_train=None, scope="dot_attention"): 164 | """ 165 | 门控attention层 166 | """ 167 | with tf.variable_scope(scope): 168 | 169 | d_inputs = dropout(inputs, keep_prob=keep_prob, is_train=is_train) 170 | d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train) 171 | JX = tf.shape(inputs)[1] # inputs的1维度,应该是c_maxlen 172 | 173 | with tf.variable_scope("attention"): 174 | # inputs_的shape:[batch_size, c_maxlen, hidden] 175 | inputs_ = tf.nn.relu( 176 | dense(d_inputs, hidden, use_bias=False, scope="inputs")) 177 | memory_ = tf.nn.relu( 178 | dense(d_memory, hidden, use_bias=False, scope="memory")) 179 | # 三维矩阵相乘,结果的shape是[batch_size, c_maxlen, q_maxlen] 180 | outputs = tf.matmul(inputs_, tf.transpose( 181 | memory_, [0, 2, 1])) / (hidden ** 0.5) 182 | # 将mask平铺成与outputs相同的形状,这里考虑,改进成input和memory都需要mask 183 | mask = tf.tile(tf.expand_dims(mask, axis=1), [1, JX, 1]) 184 | logits = tf.nn.softmax(softmax_mask(outputs, mask)) 185 | outputs = tf.matmul(logits, memory) 186 | # res:[batch_size, c_maxlen, 12*hidden] 187 | res = tf.concat([inputs, outputs], axis=2) 188 | 189 | with tf.variable_scope("gate"): 190 | """ 191 | attention * gate 192 | """ 193 | dim = res.get_shape().as_list()[-1] 194 | d_res = dropout(res, keep_prob=keep_prob, is_train=is_train) 195 | gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False)) 196 | return res * gate # 向量的逐元素相乘 197 | 198 | 199 | # 写一个谷歌论文中新的attention模块 200 | def multihead_attention(Q, K, V, mask, hidden, head_num=4, keep_prob=1.0, is_train=None, has_gate=True, scope="multihead_attention"): 201 | """ 202 | Q : passage 203 | K,V: question 204 | mask: Q的mask 205 | """ 206 | size = int(hidden / head_num) # 每个attention的大小 207 | 208 | with tf.variable_scope(scope): 209 | d_Q = dropout(Q, keep_prob=keep_prob, is_train=is_train) 210 | d_K = dropout(K, keep_prob=keep_prob, is_train=is_train) 211 | JX = tf.shape(Q)[1] 212 | 213 | with tf.variable_scope("attention"): 214 | Q_ = tf.nn.relu(dense(d_Q, hidden, use_bias=False, scope="Q")) 215 | K_ = tf.nn.relu(dense(d_K, hidden, use_bias=False, scope="K")) 216 | V_ = tf.nn.relu(dense(V, hidden, use_bias=False, scope="V")) 217 | Q_ = tf.reshape(Q_, (-1, tf.shape(Q_)[1], head_num, size)) 218 | K_ = tf.reshape(K_, (-1, tf.shape(K_)[1], head_num, size)) 219 | V_ = tf.reshape(V_, (-1, tf.shape(V_)[1], head_num, size)) 220 | Q_ = tf.transpose(Q_, [0, 2, 1, 3]) 221 | K_ = tf.transpose(K_, [0, 2, 1, 3]) 222 | V_ = tf.transpose(V_, [0, 2, 1, 3]) 223 | # scale:[batch_size, head_num, c_maxlen, q_maxlen] 224 | scale = tf.matmul(Q_, K_, transpose_b=True) / tf.sqrt(float(size)) 225 | scale = tf.transpose(scale, [0, 3, 2, 1]) 226 | for _ in range(len(scale.shape) - 2): 227 | mask = tf.expand_dims(mask, axis=2) 228 | mask_scale = softmax_mask(scale, mask) 229 | mask_scale = tf.transpose(scale, [0, 3, 2, 1]) 230 | logits = tf.nn.softmax(mask_scale) 231 | outputs = tf.matmul(logits, V_) # [b,h,c,s] 232 | outputs = tf.transpose(outputs, [0, 2, 1, 3]) 233 | # [batch_size, c_maxlen, hidden] 234 | outputs = tf.reshape(outputs, (-1, tf.shape(Q)[1], hidden)) 235 | # res连接 236 | res = tf.concat([Q, outputs], axis=2) 237 | 238 | if has_gate: 239 | with tf.variable_scope("gate"): 240 | dim = res.get_shape().as_list()[-1] 241 | d_res = dropout(res, keep_prob=keep_prob, is_train=is_train) 242 | gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False)) 243 | return res * gate 244 | else: 245 | return res 246 | 247 | 248 | def dense(inputs, hidden, use_bias=True, scope="dense"): 249 | """ 250 | 全连接层 251 | """ 252 | with tf.variable_scope(scope): 253 | shape = tf.shape(inputs) 254 | dim = inputs.get_shape().as_list()[-1] 255 | out_shape = [shape[idx] for idx in range( 256 | len(inputs.get_shape().as_list()) - 1)] + [hidden] 257 | # 三维的inputs,reshape成二维 258 | flat_inputs = tf.reshape(inputs, [-1, dim]) 259 | W = tf.get_variable("W", [dim, hidden]) 260 | res = tf.matmul(flat_inputs, W) 261 | if use_bias: 262 | b = tf.get_variable( 263 | "b", [hidden], initializer=tf.constant_initializer(0.)) 264 | res = tf.nn.bias_add(res, b) 265 | # outshape就是input的最后一维变成hidden 266 | res = tf.reshape(res, out_shape) 267 | return res 268 | -------------------------------------------------------------------------------- /baseline/util.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | util.py:一些工具 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import numpy as np 11 | import re 12 | from collections import Counter 13 | import string 14 | 15 | 16 | def get_record_parser(config): 17 | def parse(example): 18 | para_limit = config.para_limit 19 | ques_limit = config.ques_limit 20 | features = tf.parse_single_example(example, 21 | features={ 22 | "passage_idxs": tf.FixedLenFeature([], tf.string), 23 | "question_idxs": tf.FixedLenFeature([], tf.string), 24 | "answer": tf.FixedLenFeature([], tf.int64), 25 | "id": tf.FixedLenFeature([], tf.int64) 26 | }) 27 | # tf.decode_raw: 将字符串的字节重新解释为数字向量 28 | passage_idxs = tf.reshape(tf.decode_raw( 29 | features["passage_idxs"], tf.int32), [para_limit]) 30 | question_idxs = tf.reshape(tf.decode_raw( 31 | features["question_idxs"], tf.int32), [ques_limit]) 32 | answer = features["answer"] 33 | qa_id = features["id"] 34 | return passage_idxs, question_idxs, answer, qa_id 35 | return parse 36 | 37 | 38 | def get_batch_dataset(record_file, parser, config): 39 | """ 40 | 训练数据集TFRecordDataset的batch生成器。 41 | Args: 42 | record_file: 训练数据tf_record路径 43 | parser: 数据存储的格式 44 | config: 超参数 45 | """ 46 | num_threads = tf.constant(config.num_threads, dtype=tf.int32) 47 | dataset = tf.data.TFRecordDataset(record_file).map( 48 | parser, num_parallel_calls=num_threads).shuffle(config.capacity).repeat() 49 | if config.is_bucket: 50 | # bucket方法,用于解决序列长度不同的mini-batch的计算效率问题 51 | buckets = [tf.constant(num) for num in range(*config.bucket_range)] 52 | 53 | def key_func(context_idxs, ques_idxs, context_char_idxs, ques_char_idxs, y1, y2, qa_id): 54 | c_len = tf.reduce_sum( 55 | tf.cast(tf.cast(context_idxs, tf.bool), tf.int32)) 56 | buckets_min = [np.iinfo(np.int32).min] + buckets 57 | buckets_max = buckets + [np.iinfo(np.int32).max] 58 | conditions_c = tf.logical_and( 59 | tf.less(buckets_min, c_len), tf.less_equal(c_len, buckets_max)) 60 | bucket_id = tf.reduce_min(tf.where(conditions_c)) 61 | return bucket_id 62 | 63 | def reduce_func(key, elements): 64 | return elements.batch(config.batch_size) 65 | 66 | dataset = dataset.apply(tf.contrib.data.group_by_window( 67 | key_func, reduce_func, window_size=5 * config.batch_size)).shuffle(len(buckets) * 25) 68 | else: 69 | dataset = dataset.batch(config.batch_size) 70 | return dataset 71 | 72 | 73 | def get_dataset(record_file, parser, config): 74 | num_threads = tf.constant(config.num_threads, dtype=tf.int32) 75 | dataset = tf.data.TFRecordDataset(record_file).map( 76 | parser, num_parallel_calls=num_threads).repeat().batch(config.batch_size) 77 | return dataset 78 | 79 | 80 | def evaluate_acc(truth_dict, answer_dict): 81 | """ 82 | 计算准确率,还可以设计返回正确问题和错误问题列表 83 | """ 84 | total = 0 85 | right = 0 86 | wrong = 0 87 | for key, value in answer_dict.items(): 88 | total += 1 89 | ground_truths = truth_dict[key] 90 | prediction = value 91 | if prediction == ground_truths: 92 | right += 1 93 | else: 94 | wrong += 1 95 | accuracy = (right / total) * 1.0 96 | return {"accuracy": accuracy} 97 | 98 | def f1_score(truth_dict, answer_dict): 99 | """ 100 | 计算平均f1分数 101 | """ 102 | 103 | 104 | -------------------------------------------------------------------------------- /best_single_model/config.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | config.py:配置文件,程序运行入口 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import os 10 | import tensorflow as tf 11 | import data_process_addAnswer 12 | from main import train, test 13 | from file_save import * 14 | from examine_dev import examine_dev 15 | 16 | flags = tf.flags 17 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" 18 | 19 | train_file = os.path.join("file", "ai_challenger_oqmrc_trainingset.json") 20 | dev_file = os.path.join("file", "ai_challenger_oqmrc_validationset.json") 21 | test_file = os.path.join("file", "ai_challenger_oqmrc_testa.json") 22 | 23 | target_dir = "data" 24 | log_dir = "log/event" 25 | save_dir = "log/model" 26 | prediction_dir = "log/prediction" 27 | train_record_file = os.path.join(target_dir, "train.tfrecords") 28 | dev_record_file = os.path.join(target_dir, "dev.tfrecords") 29 | test_record_file = os.path.join(target_dir, "test.tfrecords") 30 | id2vec_file = os.path.join(target_dir, "id2vec.json") # id号->向量 31 | word2id_file = os.path.join(target_dir, "word2id.json") # 词->id号 32 | train_eval = os.path.join(target_dir, "train_eval.json") 33 | dev_eval = os.path.join(target_dir, "dev_eval.json") 34 | test_eval = os.path.join(target_dir, "test_eval.json") 35 | 36 | if not os.path.exists(target_dir): 37 | os.makedirs(target_dir) 38 | if not os.path.exists(log_dir): 39 | os.makedirs(log_dir) 40 | if not os.path.exists(save_dir): 41 | os.makedirs(save_dir) 42 | if not os.path.exists(prediction_dir): 43 | os.makedirs(prediction_dir) 44 | 45 | flags.DEFINE_string("mode", "train", "train/debug/test") 46 | flags.DEFINE_string("gpu", "0", "0/1") 47 | flags.DEFINE_string("experiment", "lalala", "每次存不同模型分不同的文件夹") 48 | flags.DEFINE_string("model_name", "default", "选取不同的模型") 49 | 50 | flags.DEFINE_string("target_dir", target_dir, "") 51 | flags.DEFINE_string("log_dir", log_dir, "") 52 | flags.DEFINE_string("save_dir", save_dir, "") 53 | flags.DEFINE_string("prediction_dir", prediction_dir, "") 54 | flags.DEFINE_string("train_file", train_file, "") 55 | flags.DEFINE_string("dev_file", dev_file, "") 56 | flags.DEFINE_string("test_file", test_file, "") 57 | 58 | flags.DEFINE_string("train_record_file", train_record_file, "") 59 | flags.DEFINE_string("dev_record_file", dev_record_file, "") 60 | flags.DEFINE_string("test_record_file", test_record_file, "") 61 | flags.DEFINE_string("train_eval_file", train_eval, "") 62 | flags.DEFINE_string("dev_eval_file", dev_eval, "") 63 | flags.DEFINE_string("test_eval_file", test_eval, "") 64 | flags.DEFINE_string("word2id_file", word2id_file, "") 65 | flags.DEFINE_string("id2vec_file", id2vec_file, "") 66 | 67 | flags.DEFINE_integer("para_limit", 150, "Limit length for paragraph") 68 | flags.DEFINE_integer("ques_limit", 30, "Limit length for question") 69 | flags.DEFINE_integer("ans_limit", 5, "Limit length for Answer") 70 | flags.DEFINE_integer("min_count", 1, "embedding 的最小出现次数") 71 | flags.DEFINE_integer("embedding_size", 300, "the dimension of vector") 72 | 73 | flags.DEFINE_integer("capacity", 15000, "Batch size of dataset shuffle") 74 | flags.DEFINE_integer("num_threads", 4, "Number of threads in input pipeline") 75 | # 使用cudnn训练,提升6倍速度 76 | flags.DEFINE_boolean("use_cudnn", True, "Whether to use cudnn (only for GPU)") 77 | flags.DEFINE_boolean("is_bucket", False, "Whether to use bucketing") 78 | flags.DEFINE_list("bucket_range", [40, 361, 40], "range of bucket") 79 | 80 | flags.DEFINE_integer("batch_size", 64, "Batch size") 81 | flags.DEFINE_integer("num_steps", 300000, "Number of steps") 82 | flags.DEFINE_integer("checkpoint", 1000, "checkpoint for evaluation") 83 | flags.DEFINE_integer("period", 500, "period to save batch loss") 84 | flags.DEFINE_integer("val_num_batches", 150, "Num of batches for evaluation") 85 | # 关于学习率 86 | flags.DEFINE_float("init_learning_rate", 0.001, 87 | "Initial learning rate for Adam") 88 | flags.DEFINE_float("init_emb_lr", 0., "") 89 | flags.DEFINE_boolean("training_embedding", False, "") 90 | 91 | flags.DEFINE_float("keep_prob", 0.7, "Keep prob in rnn") 92 | flags.DEFINE_float("grad_clip", 5.0, "Global Norm gradient clipping rate") 93 | flags.DEFINE_integer("hidden", 60, "Hidden size") # best:128 94 | flags.DEFINE_integer("patience", 3, "Patience for learning rate decay") 95 | flags.DEFINE_string("optimizer", "Adam", "") 96 | flags.DEFINE_string("loss_function", "default", "") 97 | 98 | 99 | def main(_): 100 | config = flags.FLAGS 101 | os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu # 选择一块gpu 102 | if config.mode == "train": 103 | train(config) 104 | elif config.mode == "prepro": 105 | data_process_addAnswer.prepro(config) 106 | elif config.mode == "test": 107 | test(config) 108 | elif config.mode == "examine": 109 | examine_dev(config) 110 | elif config.mode == "save_dev": 111 | save_dev(config) 112 | elif config.mode == "save_test": 113 | save_test(config) 114 | else: 115 | print("Unknown mode") 116 | exit(0) 117 | 118 | 119 | if __name__ == "__main__": 120 | tf.app.run() 121 | -------------------------------------------------------------------------------- /best_single_model/data_process_addAnswer.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | data_process_addAnswer.py:数据预处理代码, 加入alternatives的语义,以及特征工程。 5 | 6 | @author: haomaojie 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import pandas as pd 10 | import time 11 | import json 12 | import jieba 13 | import csv 14 | import word2vec 15 | import re 16 | import random 17 | import tensorflow as tf 18 | import numpy as np 19 | from tqdm import tqdm # 进度条 20 | import os 21 | import gensim 22 | 23 | 24 | def read_data(json_path, output_path, line_count): 25 | ''' 26 | 读取json文件并转成Dataframe 27 | ''' 28 | start_time = time.time() 29 | data = [] 30 | with open(json_path, 'r') as f: 31 | for i in range(line_count): 32 | data_list = json.loads(f.readline()) 33 | data.append([data_list['passage'], data_list['query']]) 34 | df = pd.DataFrame(data, columns=['passage', 'query']) 35 | df.to_csv(output_path, index=False) 36 | print('转化成功,已生成csv文件') 37 | end_time = time.time() 38 | print(end_time - start_time) 39 | 40 | 41 | def de_word(data_path, out_path): 42 | ''' 43 | 分词 44 | ''' 45 | start_time = time.time() 46 | word = [] 47 | data_file = open(data_path).read().split('\n') 48 | for i in range(len(data_file)): 49 | result = [] 50 | seg_list = jieba.cut(data_file[i]) 51 | for w in seg_list: 52 | result.append(w) 53 | word.append(result) 54 | print('分词完成') 55 | with open(out_path, 'w+') as txt_write: 56 | for i in range(len(word)): 57 | s = str(word[i]).replace( 58 | '[', '').replace(']', '') # 去除[],这两行按数据不同,可以选择 59 | s = s.replace("'", '').replace(',', '') + \ 60 | '\n' # 去除单引号,逗号,每行末尾追加换行符 61 | txt_write.write(s) 62 | print('保存成功') 63 | end_time = time.time() 64 | print(end_time - start_time) 65 | 66 | 67 | def word_vec(file_txt, file_bin, min_count, size): 68 | word2vec.word2vec(file_txt, file_bin, min_count=min_count, 69 | size=size, verbose=True) 70 | 71 | 72 | def merge_csv(target_dir, output_file): 73 | for inputfile in [os.path.join(target_dir, 'train_oridata.csv'), 74 | os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]: 75 | data = pd.read_csv(inputfile) 76 | df = pd.DataFrame(data) 77 | df.to_csv(output_file, mode='a', index=False) 78 | 79 | # 词转id,id转向量 80 | 81 | 82 | def transfer(model_path, embedding_size): 83 | start_time = time.time() 84 | model = word2vec.load(model_path) 85 | word2id_dic = {} 86 | init_0 = [0.0 for i in range(embedding_size)] 87 | id2vec_dic = [init_0] 88 | for i in range(len(model.vocab)): 89 | id = i + 1 90 | word2id_dic[model.vocab[i]] = id 91 | id2vec_dic.append(model[model.vocab[i]].tolist()) 92 | end_time = time.time() 93 | print('词转id,id转向量完成') 94 | print(end_time - start_time) 95 | return word2id_dic, id2vec_dic 96 | 97 | 98 | def transfer_txt(model_path, embedding_size): 99 | print("开始转换...") 100 | start_time = time.time() 101 | model = gensim.models.KeyedVectors.load_word2vec_format( 102 | model_path, binary=False) 103 | word_dic = model.wv.vocab 104 | word2id_dic = {} 105 | init_0 = [0.0 for i in range(embedding_size)] 106 | id2vec_dic = [init_0] 107 | id = 1 108 | for i in word_dic: 109 | word2id_dic[i] = id 110 | id2vec_dic.append(model[i].tolist()) 111 | id += 1 112 | end_time = time.time() 113 | print('词转id,id转向量完成') 114 | print(end_time - start_time) 115 | return word2id_dic, id2vec_dic 116 | 117 | # 存入json文件 118 | 119 | 120 | def save_json(output_path, dic_data, message=None): 121 | start_time = time.time() 122 | if message is not None: 123 | print("Saving {}...".format(message)) 124 | with open(output_path, "w") as fh: 125 | json.dump(dic_data, fh, ensure_ascii=False, indent=4) 126 | print('保存完成') 127 | end_time = time.time() 128 | print(end_time - start_time) 129 | 130 | # 将原文中的passage,query,alternative,answer,query_id转成id号 131 | # 输入参数为词典的位置和训练集的位置 132 | 133 | 134 | def TrainningsetProcess(dic_url, dataset_url, passage_len_limit): 135 | res = [] # 最后返回的结果 136 | rule = re.compile(r'\|') 137 | id2alternatives = {} 138 | # 读取字典 139 | with open(dic_url, 'r', encoding='utf-8') as dic_file: 140 | dic = dict() 141 | dic = json.load(dic_file) 142 | # 读取训练集 143 | over_limit = 0 144 | ans_over_limit = 0 145 | with open(dataset_url, 'r', encoding='utf-8') as ts_file: 146 | for file_line in ts_file: 147 | line = json.loads(file_line) # 读取一行json文件 148 | this_line_res = dict() # 变量定义,代表这一行映射之后的结果 149 | passage = line['passage'] 150 | alternatives = line['alternatives'] 151 | query = line['query'] 152 | if dataset_url.find('test') == -1: 153 | answer = line['answer'] 154 | query_idx = line['query_id'] 155 | 156 | # 用jieba将passage和query分词,lcut返回list 157 | passage_cut = jieba.lcut(passage, cut_all=False) 158 | query_cut = jieba.lcut(query, cut_all=False) 159 | 160 | # 用词典将passage和query映射到id 161 | passage_id = [] 162 | query_id = [] 163 | for each_passage_word in passage_cut: 164 | passage_id.append(dic.get(each_passage_word)) 165 | for each_query_word in query_cut: 166 | query_id.append(dic.get(each_query_word)) 167 | 168 | # 对选项进行排序 169 | alternatives_cut = re.split(rule, alternatives) 170 | alternatives_cut = [s.strip() for s in alternatives_cut] 171 | tmp = [0, 0, 0] 172 | 173 | # 选项少于三个 174 | if len(alternatives_cut) == 1: 175 | alternatives_cut.append("") 176 | alternatives_cut.append("") 177 | if len(alternatives_cut) == 2: 178 | alternatives_cut.append(" 1 or alternatives.count("没") > 1: 190 | if dataset_url.find('test') != -1: 191 | tmp[0] = alternatives_cut[0] 192 | tmp[1] = alternatives_cut[1] 193 | tmp[2] = alternatives_cut[2] 194 | else: 195 | print(2) 196 | continue # 第64772条数据 197 | if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1: 198 | print(3) 199 | continue # 第144146条数据 200 | if "不确定" in alternatives_cut and "无法确定" in alternatives_cut: 201 | tmp[0] = "确定" 202 | tmp[1] = "不确定" 203 | tmp[2] = "无法确定" 204 | # 肯定/否定/无法确定 205 | elif alternatives.find("不") != -1 or alternatives.find("没") != -1: 206 | if alternatives.count("不") == 1 and alternatives.find("不确定") != -1: 207 | alternatives_cut.remove("不确定") 208 | alternatives_cut.append("不确定") 209 | tmp[0] = alternatives_cut[0] 210 | tmp[1] = alternatives_cut[1] 211 | tmp[2] = alternatives_cut[2] 212 | elif alternatives.count("不") > 1: 213 | if alternatives.find("不确定") == -1: 214 | if dataset_url.find("test") != -1: 215 | tmp[0] = alternatives_cut[0] 216 | tmp[1] = alternatives_cut[1] 217 | tmp[2] = alternatives_cut[2] 218 | else: 219 | print(line) 220 | continue 221 | else: 222 | alternatives_cut.remove("不确定") 223 | if alternatives_cut[0].find("不") != -1: 224 | tmp[1] = alternatives_cut[0] 225 | tmp[0] = alternatives_cut[1] 226 | else: 227 | tmp[1] = alternatives_cut[1] 228 | tmp[0] = alternatives_cut[0] 229 | alternatives_cut.append("不确定") 230 | tmp[2] = alternatives_cut[2] 231 | else: 232 | for tmp_alternatives in alternatives_cut: 233 | if tmp_alternatives.find("无法") != -1: 234 | tmp[2] = tmp_alternatives 235 | elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1: 236 | tmp[1] = tmp_alternatives 237 | else: 238 | tmp[0] = tmp_alternatives 239 | # 无明显肯定与否定词义 240 | else: 241 | for tmp_alternatives in alternatives_cut: 242 | if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1: 243 | alternatives_cut.remove(tmp_alternatives) 244 | alternatives_cut.append(tmp_alternatives) 245 | break 246 | tmp[0] = alternatives_cut[0] 247 | tmp[1] = alternatives_cut[1] 248 | tmp[2] = alternatives_cut[2] 249 | 250 | # 根据tmp列表生成answer_id 251 | if dataset_url.find('test') == -1: 252 | answer_id = tmp.index(answer.strip()) 253 | 254 | # 将tmp列表分词存id 255 | tmp_id = [] 256 | for ans in tmp: 257 | if ans == None or ans == "": 258 | tmp_id.append([0]) 259 | else: 260 | ans = jieba.lcut(ans, cut_all=False) 261 | if len(ans) > 5: 262 | ans = ans[:5] 263 | ans_over_limit += 1 264 | tmp_id.append([dic.get(x) for x in ans]) 265 | 266 | # 得到这一行映射后的结果,是dict类型的数据 267 | if len(passage_id) > passage_len_limit: 268 | passage_id = passage_id[:passage_len_limit] 269 | over_limit += 1 270 | this_line_res['passage'] = passage_id 271 | this_line_res['query'] = query_id 272 | this_line_res['alternatives'] = tmp_id 273 | if dataset_url.find('test') == -1: 274 | this_line_res['answer'] = answer_id 275 | this_line_res['query_id'] = query_idx 276 | # 创建query_id到alternatives的字典,保存为json 277 | id2alternatives[query_idx] = tmp 278 | res.append(this_line_res) 279 | print(len(res)) 280 | print("over_limit:{}".format(over_limit)) 281 | print("ans_over_limit:{}".format(ans_over_limit)) 282 | return res, id2alternatives 283 | 284 | 285 | def data_process(config): 286 | target_dir = config.target_dir 287 | # 这里如果使用自己训练好的词向量就可以注释掉 288 | read_data(config.train_file, os.path.join( 289 | target_dir, 'train_oridata.csv'), 250000) # 250000 290 | read_data(config.test_file, os.path.join( 291 | target_dir, 'test_oridata.csv'), 10000) # 10000 292 | read_data(config.dev_file, os.path.join( 293 | target_dir, 'validation_oridata.csv'), 30000) # 30000 294 | merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv')) 295 | de_word(os.path.join(target_dir, 'ori_data.csv'), 296 | os.path.join(target_dir, 'seg_list.txt')) 297 | word_vec(os.path.join(target_dir, 'seg_list.txt'), 298 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size) 299 | # 如果是用外部词向量,从这里开始 300 | # word2id_dic, id2vec_dic = transfer_txt( 301 | # os.path.join(target_dir, 'baidu_300_wc+ng_sgns.baidubaike.bigram-char.txt'), config.embedding_size) 302 | word2id_dic, id2vec_dic = transfer( 303 | os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size) 304 | save_json(config.word2id_file, word2id_dic, "word to id") 305 | save_json(config.id2vec_file, id2vec_dic, "id to vec") 306 | train_examples, train_id2alternatives = TrainningsetProcess( 307 | config.word2id_file, config.train_file, config.para_limit) 308 | test_examples, test_id2alternatives = TrainningsetProcess( 309 | config.word2id_file, config.test_file, config.para_limit) 310 | validation_examples, validation_id2alternatives = TrainningsetProcess( 311 | config.word2id_file, config.dev_file, config.para_limit) 312 | save_json(config.train_eval_file, train_id2alternatives, 313 | message='保存train每条数据的alternatives') 314 | save_json(config.test_eval_file, test_id2alternatives, 315 | message='保存test每条数据的alternatives') 316 | save_json(config.dev_eval_file, validation_id2alternatives, 317 | message='保存validation每条数据的alternatives') 318 | return train_examples, test_examples, validation_examples 319 | 320 | 321 | def build_features(config, examples, data_type, out_file, is_test=False): 322 | """ 323 | 将数据读入TFrecords 324 | """ 325 | 326 | para_limit = config.para_limit 327 | ques_limit = config.ques_limit 328 | ans_limit = config.ans_limit 329 | 330 | print("Processing {} examples...".format(data_type)) 331 | 332 | list_nlp=[] 333 | with open("nlp_feature.json","r") as f: 334 | list_nlp=json.load(f) 335 | 336 | writer = tf.python_io.TFRecordWriter(out_file) 337 | total = 0 338 | meta = {} 339 | random.shuffle(examples) # 先给打乱顺序 340 | for example in tqdm(examples): 341 | total += 1 342 | passage_idxs = np.zeros([para_limit], dtype=np.int32) 343 | question_idxs = np.zeros([ques_limit], dtype=np.int32) 344 | alternative_idxs = np.zeros([3, ans_limit], dtype=np.int32) 345 | 346 | for i, token in enumerate(example["passage"]): 347 | if token == None: 348 | passage_idxs[i] = 0 349 | else: 350 | passage_idxs[i] = token 351 | for i, token in enumerate(example["query"]): 352 | if token == None: 353 | question_idxs[i] = 0 354 | else: 355 | question_idxs[i] = token 356 | for i, token in enumerate(example["alternatives"]): 357 | for j, tk in enumerate(token): 358 | if tk == None: 359 | alternative_idxs[i][j] = 0 360 | else: 361 | alternative_idxs[i][j] = tk 362 | # print(example["passage"]) 363 | if not is_test: 364 | record = tf.train.Example(features=tf.train.Features(feature={ 365 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 366 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 367 | "alternative_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[alternative_idxs.tostring()])), 368 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])), 369 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])), 370 | "nlp_feature":tf.train.Feature(float_list=tf.train.FloatList(value=list_nlp[int(example["query_id"])-1][1:])) 371 | })) 372 | else: 373 | record = tf.train.Example(features=tf.train.Features(feature={ 374 | "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])), 375 | "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])), 376 | "alternative_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[alternative_idxs.tostring()])), 377 | "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])), 378 | "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])) 379 | #,"nlp_feature": tf.train.Feature(float_list=tf.train.FloatList(value=list_nlp[int(example["query_id"]) - 1][1:])) 380 | })) 381 | # print(record) 382 | writer.write(record.SerializeToString()) 383 | print("Build {} instances of features in total".format(total)) 384 | writer.close() 385 | 386 | 387 | def prepro(config): 388 | """ 389 | 数据预处理函数 390 | """ 391 | train_examples, test_examples, dev_examples = data_process(config) 392 | 393 | # print(train_examples) 394 | # print(test_examples) 395 | # print(dev_examples) 396 | 397 | # train: 249778, test: 10000, dev: 29968 398 | # train: 439, test: 18, dev: 48 399 | 400 | build_features(config, train_examples, "train", config.train_record_file) 401 | build_features(config, dev_examples, "dev", config.dev_record_file) 402 | build_features(config, test_examples, "test", 403 | config.test_record_file, is_test=True) 404 | 405 | print("done!!!") 406 | -------------------------------------------------------------------------------- /best_single_model/examine_dev.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | examine_dev.py:检查验证集的结果,辅助分析。 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import os 14 | import codecs 15 | import time 16 | 17 | from model_addAnswer_newGraph import Model 18 | from util_addAnswer import * 19 | 20 | 21 | def examine_dev(config): 22 | """ 23 | 检查dev集的结果,辅助分析 24 | """ 25 | with open(config.id2vec_file, "r") as fh: 26 | id2vec = np.array(json.load(fh), dtype=np.float32) 27 | with open(config.dev_eval_file, "r") as fh: 28 | dev_eval_file = json.load(fh) 29 | 30 | total = 29968 31 | # 读取模型的路径和预测存储的路径 32 | save_dir = config.save_dir + config.experiment 33 | if not os.path.exists(save_dir): 34 | print("no save!") 35 | return 36 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 37 | os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt")) 38 | 39 | print("Loading model...") 40 | examine_batch = get_dataset(config.dev_record_file, get_record_parser( 41 | config), config).make_one_shot_iterator() 42 | 43 | model = Model(config, examine_batch, id2vec, trainable=False) 44 | 45 | sess_config = tf.ConfigProto(allow_soft_placement=True) 46 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 47 | sess_config.gpu_options.allow_growth = True 48 | 49 | print("examining ...") 50 | with tf.Session(config=sess_config) as sess: 51 | sess.run(tf.global_variables_initializer()) 52 | saver = tf.train.Saver() 53 | saver.restore(sess, tf.train.latest_checkpoint(save_dir)) 54 | sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool))) 55 | answer_dict = {} 56 | truth_dict = {} 57 | for step in tqdm(range(total // config.batch_size + 1)): 58 | # 预测答案 59 | qa_id, answer, truth = sess.run( 60 | [model.qa_id, model.classes, model.answer]) 61 | answer_dict_ = {} 62 | truth_dict_ = {} 63 | for ids, tr, ans in zip(qa_id, truth, answer): 64 | answer_dict_[str(ids)] = ans 65 | truth_dict_[str(ids)] = tr 66 | answer_dict.update(answer_dict_) 67 | truth_dict.update(truth_dict_) 68 | metrics = evaluate_acc(truth_dict, answer_dict) 69 | print(len(truth_dict)) 70 | print(len(answer_dict)) 71 | print("accuracy:{}".format(metrics["accuracy"])) 72 | 73 | yes_predictions = [] # 正确答案是肯定的错题 74 | no_predictions = [] # 正确答案是否定的错题 75 | depend_predictions = [] # 正确答案是不确定的错题 76 | yes, no, depend = 0, 0, 0 77 | yes_wrong, no_wrong, depend_wrong = 0, 0, 0 78 | for key, value in answer_dict.items(): 79 | if truth_dict[key] != value: 80 | if truth_dict[key] == 0: 81 | yes += 1 82 | yes_wrong += 1 83 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 84 | wrong_answer = dev_eval_file[str(key)][value] 85 | yes_predictions.append( 86 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 87 | elif truth_dict[key] == 1: 88 | no += 1 89 | no_wrong += 1 90 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 91 | wrong_answer = dev_eval_file[str(key)][value] 92 | no_predictions.append( 93 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 94 | else: 95 | depend += 1 96 | depend_wrong += 1 97 | right_answer = dev_eval_file[str(key)][truth_dict[key]] 98 | wrong_answer = dev_eval_file[str(key)][value] 99 | depend_predictions.append( 100 | str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer)) 101 | else: 102 | if truth_dict[key] == 0: 103 | yes += 1 104 | elif truth_dict[key] == 1: 105 | no += 1 106 | else: 107 | depend += 1 108 | 109 | print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend)) 110 | print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0)) 111 | print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0)) 112 | print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0)) 113 | outputs_0 = u'\n'.join(yes_predictions) 114 | outputs_1 = u'\n'.join(no_predictions) 115 | outputs_2 = u'\n'.join(depend_predictions) 116 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f: 117 | f.write(outputs_0) 118 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f: 119 | f.write(outputs_1) 120 | with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f: 121 | f.write(outputs_2) 122 | print("done!") 123 | -------------------------------------------------------------------------------- /best_single_model/file_save.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | file_save.py: 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import pickle 14 | import os 15 | import codecs 16 | import time 17 | from model_addAnswer_newGraph import Model 18 | from util_addAnswer import * 19 | 20 | 21 | def save_dev(config): 22 | """ 23 | 验证dev集的结果,保存文件 24 | """ 25 | with open(config.id2vec_file, "r") as fh: 26 | id2vec = np.array(json.load(fh), dtype=np.float32) 27 | with open(config.dev_eval_file, "r") as fh: 28 | dev_eval_file = json.load(fh) 29 | total = 29968 30 | 31 | print("Loading model...") 32 | sess_config = tf.ConfigProto(allow_soft_placement=True) 33 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 34 | sess_config.gpu_options.allow_growth = True 35 | 36 | truth_dict = {} 37 | predict_dict = {} 38 | logits_dict1 = {} 39 | 40 | print("正在模型预测!") 41 | g1 = tf.Graph() 42 | with tf.Session(graph=g1, config=sess_config) as sess1: 43 | with g1.as_default(): 44 | dev_batch1 = get_dataset(config.dev_record_file, get_record_parser( 45 | config), config).make_one_shot_iterator() 46 | model_1 = Model(config, dev_batch1, id2vec, trainable=False) 47 | sess1.run(tf.global_variables_initializer()) 48 | saver1 = tf.train.Saver() 49 | # 需要手动更改路径 50 | saver1.restore( 51 | sess1, "./log/modellalala/model_99000_devAcc_0.751301.ckpt") 52 | sess1.run(tf.assign(model_1.is_train, 53 | tf.constant(False, dtype=tf.bool))) 54 | for step in tqdm(range(total // config.batch_size + 1)): 55 | qa_id, logits, truths = sess1.run( 56 | [model_1.qa_id, model_1.logits, model_1.answer]) 57 | for ids, logits, truth in zip(qa_id, logits, truths): 58 | logits_dict1[str(ids)] = logits 59 | truth_dict[str(ids)] = truth 60 | if len(logits_dict1) != len(dev_eval_file): 61 | print("logits1 data number not match") 62 | 63 | a = tf.placeholder(shape=[3], dtype=tf.float32, name="me") 64 | softmax = tf.nn.softmax(a) 65 | for key, val in truth_dict.items(): 66 | value = sess1.run(softmax, feed_dict={a: logits_dict1[key]}) 67 | predict_dict[key] = value 68 | print("正在保存dev的softmax结果文件!") 69 | if not os.path.exists("./dev_soft"): 70 | os.makedirs("./dev_soft") 71 | with open("./dev_soft/nlp_model_0.7513.txt", "wb") as f1: # 手动更改保存的名字,路径不用改 72 | pickle.dump(predict_dict, f1) 73 | if not os.path.exists("./truth"): 74 | os.makedirs("./truth") 75 | with open("./truth/truth_dict.txt", "wb") as f2: # 不用改 76 | pickle.dump(truth_dict, f2) 77 | 78 | 79 | def save_test(config): 80 | """ 81 | 输出test集的结果,保存文件 82 | """ 83 | with open(config.id2vec_file, "r") as fh: 84 | id2vec = np.array(json.load(fh), dtype=np.float32) 85 | with open(config.test_eval_file, "r") as fh: 86 | test_eval_file = json.load(fh) 87 | total = 10000 88 | 89 | print("Loading model...") 90 | sess_config = tf.ConfigProto(allow_soft_placement=True) 91 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 92 | sess_config.gpu_options.allow_growth = True 93 | 94 | predict_dict = {} 95 | logits_dict1 = {} 96 | 97 | print("正在模型预测!") 98 | g1 = tf.Graph() 99 | with tf.Session(graph=g1, config=sess_config) as sess1: 100 | with g1.as_default(): 101 | test_batch1 = get_dataset(config.test_record_file, get_record_parser( 102 | config), config).make_one_shot_iterator() 103 | model_1 = Model(config, test_batch1, id2vec, trainable=False) 104 | sess1.run(tf.global_variables_initializer()) 105 | saver1 = tf.train.Saver() 106 | # 需要手动更改路径 107 | saver1.restore( 108 | sess1, "./log/modellalala/model_99000_devAcc_0.751301.ckpt") 109 | sess1.run(tf.assign(model_1.is_train, 110 | tf.constant(False, dtype=tf.bool))) 111 | for step in tqdm(range(total // config.batch_size + 1)): 112 | qa_id, logits = sess1.run( 113 | [model_1.qa_id, model_1.logits]) 114 | for ids, logits in zip(qa_id, logits): 115 | logits_dict1[str(ids)] = logits 116 | if len(logits_dict1) != len(test_eval_file): 117 | print("logits1 data number not match") 118 | 119 | a = tf.placeholder(shape=[3], dtype=tf.float32, name="me") 120 | softmax = tf.nn.softmax(a) 121 | for key, val in logits_dict1.items(): 122 | value = sess1.run(softmax, feed_dict={a: logits_dict1[key]}) 123 | predict_dict[key] = value 124 | 125 | print("正在保存dev的softmax结果文件!") 126 | if not os.path.exists("./test_soft"): 127 | os.makedirs("./test_soft") 128 | with open("./test_soft/nlp_model_0.7446.txt", "wb") as f1: 129 | pickle.dump(predict_dict, f1) 130 | -------------------------------------------------------------------------------- /best_single_model/focal_loss.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | focal_loss.py 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | 11 | 12 | def sparse_focal_loss(logits, labels, gamma=2): 13 | """ 14 | Computer focal loss for multi classification 15 | Args: 16 | labels: A int32 tensor of shape [batch_size]. 17 | logits: A float32 tensor of shape [batch_size,num_classes]. 18 | gamma: A scalar for focal loss gamma hyper-parameter. 19 | Returns: 20 | A tensor of the same shape as `lables` 21 | """ 22 | with tf.name_scope("focal_loss"): 23 | y_pred = tf.nn.softmax(logits, dim=-1) # [batch_size,num_classes] 24 | labels = tf.one_hot(labels, depth=y_pred.shape[1]) 25 | L = -labels * ((1 - y_pred)**gamma) * tf.log(y_pred) 26 | L = tf.reduce_sum(L, axis=1) 27 | return L 28 | 29 | ''' 30 | if __name__ == '__main__': 31 | labels = tf.constant([0, 1], name="labels") 32 | logits = tf.constant([[0.7, 0.2, 0.1], [0.6, 0.1, 0.3]], name="logits") 33 | a = tf.reduce_mean(sparse_focal_loss(logits, tf.stop_gradient(labels))) 34 | with tf.Session() as sess: 35 | print(sess.run(a))''' 36 | -------------------------------------------------------------------------------- /best_single_model/main.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | main.py:train and test 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import os 14 | import codecs 15 | import time 16 | import math 17 | 18 | from model_addAnswer_newGraph import Model 19 | from util_addAnswer import * 20 | 21 | 22 | def train(config): 23 | """ 24 | 训练与验证函数 25 | """ 26 | with open(config.id2vec_file, "r") as fh: 27 | id2vec = np.array(json.load(fh), dtype=np.float32) 28 | with open(config.train_eval_file, "r") as fh: 29 | train_eval_file = json.load(fh) 30 | with open(config.dev_eval_file, "r") as fh: 31 | dev_eval_file = json.load(fh) 32 | 33 | dev_total = 29968 # 验证集数据量 34 | 35 | # 不同参数的训练在不同的文件夹下存储 36 | log_dir = config.log_dir + config.experiment 37 | save_dir = config.save_dir + config.experiment 38 | if not os.path.exists(log_dir): 39 | os.makedirs(log_dir) 40 | if not os.path.exists(save_dir): 41 | os.makedirs(save_dir) 42 | 43 | print("Building model...") 44 | parser = get_record_parser(config) 45 | train_dataset = get_batch_dataset(config.train_record_file, parser, config) 46 | dev_dataset = get_dataset(config.dev_record_file, parser, config) 47 | 48 | # 可馈送迭代器,通过feed_dict机制选择每次sess.run时调用train_iterator还是dev_iterator 49 | handle = tf.placeholder(tf.string, shape=[]) 50 | iterator = tf.data.Iterator.from_string_handle( 51 | handle, train_dataset.output_types, train_dataset.output_shapes) 52 | train_iterator = train_dataset.make_one_shot_iterator() 53 | dev_iterator = dev_dataset.make_one_shot_iterator() 54 | 55 | # 选取模型 56 | if config.model_name == "default": 57 | model = Model(config, iterator, id2vec) 58 | else: 59 | print("model error") 60 | return 61 | 62 | sess_config = tf.ConfigProto(allow_soft_placement=True) 63 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 64 | sess_config.gpu_options.allow_growth = True 65 | 66 | loss_save = 100.0 67 | patience = 0 68 | lr = config.init_learning_rate 69 | emb_lr = config.init_emb_lr 70 | 71 | with tf.Session(config=sess_config) as sess: 72 | writer = tf.summary.FileWriter(log_dir, sess.graph) # 存储计算图 73 | sess.run(tf.global_variables_initializer()) 74 | saver = tf.train.Saver() 75 | train_handle = sess.run(train_iterator.string_handle()) 76 | dev_handle = sess.run(dev_iterator.string_handle()) 77 | sess.run(tf.assign(model.is_train, tf.constant(True, dtype=tf.bool))) 78 | sess.run(tf.assign(model.learning_rate, 79 | tf.constant(lr, dtype=tf.float32))) 80 | if config.training_embedding: 81 | sess.run(tf.assign(model.emb_lr, tf.constant( 82 | emb_lr, dtype=tf.float32))) 83 | 84 | best_dev_acc = 0.0 # 定义一个最佳验证准确率,只有当准确率高于它才保存模型 85 | print("Training ...") 86 | for go in tqdm(range(1, config.num_steps + 1)): 87 | global_step = sess.run(model.global_step) + 1 88 | loss, train_op = sess.run([model.loss, model.train_op], feed_dict={ 89 | handle: train_handle}) 90 | if global_step % config.period == 0: # 每隔一段步数就记录一次train_loss和learning_rate 91 | loss_sum = tf.Summary(value=[tf.Summary.Value( 92 | tag="model/loss", simple_value=loss), ]) 93 | writer.add_summary(loss_sum, global_step) 94 | lr_sum = tf.Summary(value=[tf.Summary.Value( 95 | tag="model/learning_rate", simple_value=sess.run(model.learning_rate)), ]) 96 | writer.add_summary(lr_sum, global_step) 97 | if config.training_embedding: 98 | emb_lr_sum = tf.Summary(value=[tf.Summary.Value( 99 | tag="model/emb_lr", simple_value=sess.run(model.emb_lr)), ]) 100 | writer.add_summary(emb_lr_sum, global_step) 101 | 102 | if global_step % config.checkpoint == 0: # 验证acc,并保存模型 103 | sess.run(tf.assign(model.is_train, 104 | tf.constant(False, dtype=tf.bool))) 105 | 106 | # 评估训练集 107 | _, summ = evaluate_batch( 108 | model, config.val_num_batches, train_eval_file, sess, "train_eval", handle, train_handle) 109 | for s in summ: 110 | writer.add_summary(s, global_step) 111 | 112 | # 评估验证集 113 | metrics, summ = evaluate_batch( 114 | model, dev_total // config.batch_size + 1, dev_eval_file, sess, "dev", handle, dev_handle) 115 | sess.run(tf.assign(model.is_train, 116 | tf.constant(True, dtype=tf.bool))) 117 | for s in summ: 118 | writer.add_summary(s, global_step) 119 | writer.flush() # 将事件文件刷新到磁盘 120 | 121 | # 1101 122 | if global_step <= 40000: 123 | lr = config.init_learning_rate 124 | elif global_step <= 60000: 125 | lr = config.init_learning_rate 126 | emb_lr = 1e-5 127 | elif global_step <= 120000: 128 | lr = config.init_learning_rate / \ 129 | math.sqrt((global_step - 50000) / 10000) 130 | emb_lr = (1e-5) / \ 131 | math.sqrt((global_step - 50000) / 10000) 132 | elif global_step <= 200000: 133 | lr = config.init_learning_rate / \ 134 | math.sqrt((global_step - 50000) / 5000) 135 | emb_lr = (1e-5) / \ 136 | math.sqrt((global_step - 50000) / 5000) 137 | else: 138 | lr = config.init_learning_rate / \ 139 | math.sqrt(global_step / 1000) 140 | emb_lr = (1e-5) / \ 141 | math.sqrt(global_step / 1000) 142 | 143 | sess.run(tf.assign(model.learning_rate, 144 | tf.constant(lr, dtype=tf.float32))) 145 | if config.training_embedding: 146 | sess.run(tf.assign(model.emb_lr, tf.constant( 147 | emb_lr, dtype=tf.float32))) 148 | 149 | # 保存模型的逻辑 150 | if metrics["accuracy"] > best_dev_acc: 151 | best_dev_acc = metrics["accuracy"] 152 | filename = os.path.join( 153 | save_dir, "model_{}_devAcc_{:.6f}.ckpt".format(global_step, best_dev_acc)) 154 | saver.save(sess, filename) 155 | 156 | print("finished!") 157 | 158 | 159 | def evaluate_batch(model, num_batches, eval_file, sess, data_type, handle, str_handle): 160 | """ 161 | 模型评估函数 162 | """ 163 | answer_dict = {} # 答案词典 164 | truth_dict = {} # 真实答案词典 165 | losses = [] 166 | for _ in tqdm(range(1, num_batches + 1)): 167 | qa_id, loss, truth, answer = sess.run( 168 | [model.qa_id, model.loss, model.answer, model.classes], feed_dict={handle: str_handle}) 169 | answer_dict_ = {} 170 | truth_dict_ = {} 171 | for ids, tr, ans in zip(qa_id, truth, answer): 172 | answer_dict_[str(ids)] = ans 173 | truth_dict_[str(ids)] = tr 174 | answer_dict.update(answer_dict_) 175 | truth_dict.update(truth_dict_) 176 | losses.append(loss) 177 | loss = np.mean(losses) 178 | metrics = evaluate_acc(truth_dict, answer_dict) 179 | metrics["loss"] = loss 180 | loss_sum = tf.Summary(value=[tf.Summary.Value( 181 | tag="{}/loss".format(data_type), simple_value=metrics["loss"]), ]) 182 | acc_sum = tf.Summary(value=[tf.Summary.Value( 183 | tag="{}/accuracy".format(data_type), simple_value=metrics["accuracy"]), ]) 184 | return metrics, [loss_sum, acc_sum] 185 | 186 | 187 | def test(config): 188 | """ 189 | 测试函数 190 | """ 191 | with open(config.id2vec_file, "r") as fh: 192 | id2vec = np.array(json.load(fh), dtype=np.float32) 193 | with open(config.test_eval_file, "r") as fh: 194 | test_eval_file = json.load(fh) 195 | 196 | total = 10000 197 | # 读取模型的路径和预测存储的路径 198 | save_dir = config.save_dir + config.experiment 199 | if not os.path.exists(save_dir): 200 | print("no save!") 201 | return 202 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 203 | prediction_file = os.path.join( 204 | config.prediction_dir, (predic_time + "_predictions.txt")) 205 | 206 | print("Loading model...") 207 | test_batch = get_dataset(config.test_record_file, get_record_parser( 208 | config), config).make_one_shot_iterator() 209 | 210 | # 选取模型 211 | if config.model_name == "default": 212 | model = Model(config, test_batch, id2vec, trainable=False) 213 | else: 214 | print("model error") 215 | return 216 | 217 | sess_config = tf.ConfigProto(allow_soft_placement=True) 218 | sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9 219 | sess_config.gpu_options.allow_growth = True 220 | 221 | print("testing ...") 222 | with tf.Session(config=sess_config) as sess: 223 | sess.run(tf.global_variables_initializer()) 224 | saver = tf.train.Saver() 225 | saver.restore(sess, tf.train.latest_checkpoint(save_dir)) 226 | sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool))) 227 | answer_dict = {} 228 | for step in tqdm(range(total // config.batch_size + 1)): 229 | # 预测答案 230 | qa_id, answer = sess.run([model.qa_id, model.classes]) 231 | answer_dict_ = {} 232 | for ids, ans in zip(qa_id, answer): 233 | answer_dict_[str(ids)] = ans 234 | answer_dict.update(answer_dict_) 235 | # 将结果写文件的操作,不用考虑问题顺序 236 | if len(answer_dict) != len(test_eval_file): 237 | print("data number not match") 238 | predictions = [] 239 | for key, value in answer_dict.items(): 240 | prediction_answer = test_eval_file[str(key)][value] 241 | predictions.append(str(key) + '\t' + str(prediction_answer)) 242 | outputs = u'\n'.join(predictions) 243 | with codecs.open(prediction_file, 'w', encoding='utf-8') as f: 244 | f.write(outputs) 245 | print("done!") 246 | -------------------------------------------------------------------------------- /best_single_model/model_addAnswer_newGraph.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | model_addAnswer_newGraph.py:改进R-net模型,引入alternatives信息和特征工程。 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | 10 | import tensorflow as tf 11 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout, dense 12 | 13 | 14 | class Model(object): 15 | def __init__(self, config, batch, word_mat=None, trainable=True): 16 | """ 17 | 模型初始化函数 18 | Args: 19 | config:是tf.flag.FLAGS,配置整个项目的超参数 20 | batch:是一个tf.data.iterator对象,读取数据的迭代器,可能联系到tf.records,如果我们的数据集比较小就可以不用 21 | word_mat:np.array数组,是词向量? 22 | char_mat:同上 23 | """ 24 | self.config = config 25 | batch_size = config.batch_size 26 | self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32, 27 | initializer=tf.constant_initializer(0), trainable=False) 28 | # tf.data.iterator的get_next方法,返回dataset中下一个element的tensor对象,在sess.run中实现迭代 29 | """ 30 | passage: passage序列的每个词的id号的tensor(tf.int32),长度应该是都取最大限制长度,空余的填充空值?(这里待定) 31 | question: question序列的每个词的id号的tensor(tf.int32) 32 | ch, qh, y1, y2: 本项目不需要,已经取消 33 | qa_id: question的id 34 | answer: 新添加的answer标签,(0/1/2),shape初步定义为[batch_size] 35 | """ 36 | self.passage, self.question, self.alternatives, self.answer, self.qa_id, self.nlp_feature = batch.get_next() 37 | self.is_train = tf.get_variable( 38 | "is_train", shape=[], dtype=tf.bool, trainable=False) 39 | 40 | # word embeddings的变量,可以选择是否训练. 41 | if self.config.training_embedding: 42 | self.word_mat = tf.get_variable("word_mat", initializer=tf.constant( 43 | word_mat, dtype=tf.float32), trainable=True) 44 | else: 45 | self.word_mat = tf.get_variable("word_mat", initializer=tf.constant( 46 | word_mat, dtype=tf.float32), trainable=False) 47 | 48 | with tf.name_scope("process"): 49 | # tf.cast将tensor转换为bool类型,生成mask,有值部分用true,空值用false 50 | self.c_mask = tf.cast(self.passage, tf.bool) 51 | self.q_mask = tf.cast(self.question, tf.bool) 52 | # 求每个序列的真实长度,得到_len的tensor 53 | self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1) 54 | self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1) 55 | # alternatives编码过程用到的 56 | self.a_len = tf.constant( 57 | value=3 * self.config.ans_limit, shape=[batch_size], dtype=tf.int32, name="a_len") 58 | 59 | # 求一个batch中序列最大长度,并按照最大长度对对tensor进行slice划分 60 | self.c_maxlen = tf.reduce_max(self.c_len) 61 | self.q_maxlen = tf.reduce_max(self.q_len) 62 | self.c = tf.slice(self.passage, [0, 0], [ 63 | batch_size, self.c_maxlen]) 64 | self.q = tf.slice(self.question, [0, 0], [ 65 | batch_size, self.q_maxlen]) 66 | self.c_mask = tf.slice(self.c_mask, [0, 0], [ 67 | batch_size, self.c_maxlen]) 68 | self.q_mask = tf.slice(self.q_mask, [0, 0], [ 69 | batch_size, self.q_maxlen]) 70 | # a_mask 71 | self.a_mask = tf.constant( 72 | value=True, shape=[batch_size, 3], dtype=tf.bool, name="a_mask") 73 | 74 | self.Structure() # 构造R-Net模型结构 75 | 76 | if trainable: 77 | 78 | if not self.config.training_embedding: 79 | self.learning_rate = tf.get_variable( 80 | "learning_rate", shape=[], dtype=tf.float32, trainable=False) 81 | self.opt = tf.train.AdamOptimizer( 82 | learning_rate=self.learning_rate, epsilon=1e-8) 83 | 84 | grads = self.opt.compute_gradients(self.loss) 85 | gradients, variables = zip(*grads) 86 | capped_grads, _ = tf.clip_by_global_norm( 87 | gradients, config.grad_clip) 88 | self.train_op = self.opt.apply_gradients( 89 | zip(capped_grads, variables), global_step=self.global_step) 90 | else: 91 | # 对embedding层设置单独的学习率 92 | self.emb_lr = tf.get_variable( 93 | "emb_lr", shape=[], dtype=tf.float32, trainable=False) 94 | self.learning_rate = tf.get_variable( 95 | "learning_rate", shape=[], dtype=tf.float32, trainable=False) 96 | self.emb_opt = tf.train.AdamOptimizer( 97 | learning_rate=self.emb_lr, epsilon=1e-8) 98 | self.opt = tf.train.AdamOptimizer( 99 | learning_rate=self.learning_rate, epsilon=1e-8) 100 | # 区分不同的变量列表 101 | self.var_list = tf.trainable_variables() 102 | var_list1 = [] 103 | var_list2 = [] 104 | for var in self.var_list: 105 | if var.op.name == "word_mat": 106 | var_list1.append(var) 107 | else: 108 | var_list2.append(var) 109 | 110 | grads = tf.gradients(self.loss, var_list1 + var_list2) 111 | capped_grads, _ = tf.clip_by_global_norm( 112 | grads, config.grad_clip) 113 | grads1 = capped_grads[:len(var_list1)] 114 | grads2 = capped_grads[len(var_list1):] 115 | self.train_op1 = self.emb_opt.apply_gradients( 116 | zip(grads1, var_list1)) 117 | self.train_op2 = self.opt.apply_gradients( 118 | zip(grads2, var_list2), global_step=self.global_step) 119 | self.train_op = tf.group(self.train_op1, self.train_op2) 120 | 121 | def Structure(self): 122 | config = self.config 123 | batch_size, PL, QL, d = config.batch_size, self.c_maxlen, self.q_maxlen, config.hidden 124 | gru = cudnn_gru if config.use_cudnn else native_gru # 选择使用哪种gru网络 125 | 126 | with tf.variable_scope("embedding"): 127 | # word_embedding层 128 | with tf.name_scope("word"): 129 | # embedding后的shape是[batch_size, max_len, vec_len] 130 | c_emb = tf.nn.embedding_lookup(self.word_mat, self.c) 131 | q_emb = tf.nn.embedding_lookup(self.word_mat, self.q) 132 | a_emb = tf.nn.embedding_lookup( 133 | self.word_mat, self.alternatives) # [batch_size, 3, ans_limit, 300] 134 | 135 | with tf.variable_scope("nlp_feature"): 136 | nlp_w = tf.get_variable( 137 | "nlp_w", shape=[187, 512], dtype=tf.float32) 138 | nlp_input = dropout(tf.matmul(self.nlp_feature, nlp_w), 139 | keep_prob=config.keep_prob, is_train=self.is_train) 140 | nlp_w2 = tf.get_variable( 141 | "nlp_w2", shape=[512, d], dtype=tf.float32) 142 | nlp_out = tf.nn.relu(tf.matmul(nlp_input, nlp_w2)) 143 | 144 | with tf.variable_scope("encoding"): 145 | rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape( 146 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_encoder") 147 | c = rnn(c_emb, seq_len=self.c_len) 148 | q = rnn(q_emb, seq_len=self.q_len) 149 | al_inputs = tf.reshape( 150 | a_emb, [-1, 3 * config.ans_limit, a_emb.get_shape().as_list()[-1]]) 151 | # [batch_size, 3*ans_limit, 2*hidden] 152 | al_encode = rnn(al_inputs, seq_len=self.a_len) 153 | # al_encode = rnn(al_inputs, seq_len=self.a_len)[:, :, -2 * d:]这个还没试 154 | al_output_ = tf.reshape(al_encode, [batch_size, 3, -1]) 155 | al_output = tf.nn.relu(dense(al_output_, d)) 156 | 157 | # with tf.variable_scope("alternative_encoding"): 158 | # # al_inputs = tf.reduce_sum(a_emb, axis=2) # [batch_size, 3, 300] 159 | # # al_encode = dense(al_inputs, d, use_bias=False, 160 | # # scope="al_encoder") # [batch_size, 3, hidden] 161 | # rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=a_emb.get_shape( 162 | # ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="al_encoder") 163 | # al_inputs = tf.reshape( 164 | # a_emb, [-1, 3 * config.ans_limit, a_emb.get_shape().as_list()[-1]]) 165 | # # [batch_size, 3*ans_limit, 2*hidden] 166 | # al_encode = rnn(al_inputs, seq_len=self.a_len) 167 | # al_output_ = tf.reshape(al_encode, [batch_size, 3, -1]) 168 | # al_output = tf.nn.relu(dense(al_output_, d)) 169 | 170 | # with tf.variable_scope("question_encoding"): 171 | # # encoder层,将context和question分别输入双向GRU 172 | # rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=q_emb.get_shape( 173 | # ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="q_encoder") 174 | # # RNN每层的正向反向输出合并,本代码默认的是每层的输出也合并 175 | # # 所以对于3层rnn,输出的shape是[batch_size, max_len, 6*hidden] 176 | # # 并且,序列空值处的输出都清零了 177 | # q = rnn(q_emb, seq_len=self.q_len) 178 | 179 | # with tf.variable_scope("passage_encoding"): 180 | # rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape( 181 | # ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_encoder") 182 | # c = rnn(c_emb, seq_len=self.c_len) 183 | 184 | with tf.variable_scope("QP_attention"): 185 | """ 186 | 基于注意力的循环神经网络层,匹配context和question 187 | """ 188 | # qc_att的shape [batch_size, c_maxlen, 12*hidden] 189 | qc_att_ = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d, 190 | keep_prob=config.keep_prob, is_train=self.is_train) 191 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att_.get_shape( 192 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="qp") 193 | qc_att = rnn(qc_att_, seq_len=self.c_len) 194 | 195 | with tf.variable_scope("passage_match"): 196 | """ 197 | context自匹配层 198 | """ 199 | c_att = dot_attention( 200 | qc_att, qc_att, mask=self.c_mask, hidden=d, keep_prob=config.keep_prob, is_train=self.is_train) 201 | rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=c_att.get_shape( 202 | ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_match") 203 | # [batch_size, c_maxlen, 2*hidden] 204 | c_match = rnn(c_att, seq_len=self.c_len) 205 | 206 | with tf.variable_scope("YesNo_classification"): 207 | """ 208 | 对问题答案的分类层, 需要的输入有question的编码结果q和context的match 209 | """ 210 | # init的shape:[batch_size, 2*hidden] 211 | # 这步的作用初始猜测是将question进行pooling操作,然后再输入给一个rnn层进行分类 212 | init = summ(q[:, :, -2 * d:], d, mask=self.q_mask, 213 | keep_prob=config.keep_prob, is_train=self.is_train) 214 | c_match_ = dropout(c_match, keep_prob=config.keep_prob, 215 | is_train=self.is_train) 216 | final_hiddens = init.get_shape().as_list()[-1] 217 | final_gru = tf.contrib.rnn.GRUCell(final_hiddens) 218 | qp_output_, _ = tf.nn.dynamic_rnn( 219 | final_gru, c_match_, initial_state=init, dtype=tf.float32) # [batch_size, c_maxlen, 2*hidden] 220 | qp_output = dense(qp_output_, d) 221 | 222 | # final_att: [batch_size, 3, 2*hidden] 223 | final_att = dot_attention(al_output, qp_output, self.c_mask, 224 | hidden=d, keep_prob=config.keep_prob, is_train=self.is_train) 225 | # 将特征工程的信息融合进来 226 | nlp_final = tf.expand_dims(nlp_out, axis=1) 227 | nlp_final = tf.tile(nlp_final, [1, 3, 1]) 228 | 229 | final_concat = tf.concat([final_att, nlp_final], axis=2) 230 | 231 | final_output = dense( 232 | final_concat, 1, use_bias=True, scope="final_output") 233 | self.logits = tf.squeeze(final_output) 234 | 235 | with tf.variable_scope("softmax_and_loss"): 236 | self.final_softmax = tf.nn.softmax(self.logits) 237 | self.classes = tf.cast( 238 | tf.argmax(self.final_softmax, axis=1), dtype=tf.int32, name="classes") 239 | # 注意stop_gradient的使用,因为answer不是placeholder传进来的,所以要注明不对其计算梯度 240 | if config.loss_function == "focal_loss": 241 | self.loss = tf.reduce_mean(sparse_focal_loss( 242 | logits=self.logits, labels=tf.stop_gradient(self.answer))) 243 | else: 244 | self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits( 245 | logits=self.logits, labels=tf.stop_gradient(self.answer))) 246 | 247 | def get_loss(self): 248 | return self.loss 249 | 250 | def get_global_step(self): 251 | return self.global_step 252 | -------------------------------------------------------------------------------- /best_single_model/nlp_feature.json: -------------------------------------------------------------------------------- 1 | {"说明":"此文件存储数据样本经特征工程处理之后的特征向量。"} -------------------------------------------------------------------------------- /best_single_model/nn_func.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | nn_func.py:神经网络模型的组件 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | 11 | INF = 1e30 12 | 13 | 14 | class cudnn_gru: 15 | 16 | def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope=None): 17 | self.num_layers = num_layers 18 | self.grus = [] 19 | self.inits = [] 20 | self.dropout_mask = [] 21 | self.scope = scope 22 | for layer in range(num_layers): 23 | input_size_ = input_size if layer == 0 else 2 * num_units 24 | gru_fw = tf.contrib.cudnn_rnn.CudnnGRU( 25 | 1, num_units, name="f_cudnn_gru") 26 | gru_bw = tf.contrib.cudnn_rnn.CudnnGRU( 27 | 1, num_units, name="b_cudnn_gru") 28 | init_fw = tf.tile(tf.Variable( 29 | tf.zeros([1, 1, num_units])), [1, batch_size, 1]) 30 | init_bw = tf.tile(tf.Variable( 31 | tf.zeros([1, 1, num_units])), [1, batch_size, 1]) 32 | mask_fw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32), 33 | keep_prob=keep_prob, is_train=is_train, mode=None) 34 | mask_bw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32), 35 | keep_prob=keep_prob, is_train=is_train, mode=None) 36 | self.grus.append((gru_fw, gru_bw, )) 37 | self.inits.append((init_fw, init_bw, )) 38 | self.dropout_mask.append((mask_fw, mask_bw, )) 39 | 40 | def __call__(self, inputs, seq_len, keep_prob=1.0, is_train=None, concat_layers=True): 41 | # cudnn GRU需要交换张量的维度,可能是便于计算 42 | outputs = [tf.transpose(inputs, [1, 0, 2])] 43 | with tf.variable_scope(self.scope): 44 | for layer in range(self.num_layers): 45 | gru_fw, gru_bw = self.grus[layer] 46 | init_fw, init_bw = self.inits[layer] 47 | mask_fw, mask_bw = self.dropout_mask[layer] 48 | with tf.variable_scope("fw_{}".format(layer)): 49 | out_fw, _ = gru_fw( 50 | outputs[-1] * mask_fw, initial_state=(init_fw, )) 51 | with tf.variable_scope("bw_{}".format(layer)): 52 | inputs_bw = tf.reverse_sequence( 53 | outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1) 54 | out_bw, _ = gru_bw( 55 | inputs_bw, initial_state=(init_bw, )) 56 | out_bw = tf.reverse_sequence( 57 | out_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1) 58 | outputs.append(tf.concat([out_fw, out_bw], axis=2)) 59 | if concat_layers: 60 | res = tf.concat(outputs[1:], axis=2) 61 | else: 62 | res = outputs[-1] 63 | res = tf.transpose(res, [1, 0, 2]) 64 | return res 65 | 66 | 67 | class native_gru: 68 | 69 | def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope="native_gru"): 70 | self.num_layers = num_layers 71 | self.grus = [] 72 | self.inits = [] 73 | self.dropout_mask = [] 74 | self.scope = scope 75 | for layer in range(num_layers): 76 | input_size_ = input_size if layer == 0 else 2 * num_units 77 | # 双向Bi-GRU f:forward b:back 78 | gru_fw = tf.contrib.rnn.GRUCell(num_units) 79 | gru_bw = tf.contrib.rnn.GRUCell(num_units) 80 | # tf.tile 平铺给定的张量,这里是将初始状态扩张到batch_size倍 81 | init_fw = tf.tile(tf.Variable( 82 | tf.zeros([1, num_units])), [batch_size, 1]) 83 | init_bw = tf.tile(tf.Variable( 84 | tf.zeros([1, num_units])), [batch_size, 1]) 85 | mask_fw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32), 86 | keep_prob=keep_prob, is_train=is_train, mode=None) 87 | mask_bw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32), 88 | keep_prob=keep_prob, is_train=is_train, mode=None) 89 | self.grus.append((gru_fw, gru_bw, )) 90 | self.inits.append((init_fw, init_bw, )) 91 | self.dropout_mask.append((mask_fw, mask_bw, )) 92 | 93 | def __call__(self, inputs, seq_len, concat_layers=True): 94 | """ 95 | 运行RNN 96 | 这里的keep_prob和is_train没用,在__init__中就已设置好了 97 | """ 98 | outputs = [inputs] 99 | with tf.variable_scope(self.scope): 100 | for layer in range(self.num_layers): 101 | gru_fw, gru_bw = self.grus[layer] 102 | init_fw, init_bw = self.inits[layer] 103 | mask_fw, mask_bw = self.dropout_mask[layer] 104 | # 正向RNN 105 | with tf.variable_scope("fw_{}".format(layer)): 106 | # 每一层使用上层的输出 107 | # dynamic_rnn中的超过seq_len的部分就不计算了,state直接重复,output直接清零,节省资源 108 | out_fw, _ = tf.nn.dynamic_rnn( 109 | gru_fw, outputs[-1] * mask_fw, seq_len, initial_state=init_fw, dtype=tf.float32) 110 | # 反向RNN 111 | with tf.variable_scope("bw_{}".format(layer)): 112 | inputs_bw = tf.reverse_sequence( 113 | outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0) 114 | out_bw, _ = tf.nn.dynamic_rnn( 115 | gru_bw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32) 116 | out_bw = tf.reverse_sequence( 117 | out_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0) 118 | # 正向输出和反向输出合并 119 | outputs.append(tf.concat([out_fw, out_bw], axis=2)) 120 | if concat_layers: 121 | res = tf.concat(outputs[1:], axis=2) 122 | else: 123 | res = outputs[-1] 124 | return res 125 | 126 | 127 | def dropout(args, keep_prob, is_train, mode="recurrent"): 128 | """ 129 | dropout层,args初始是1.0 130 | """ 131 | if keep_prob < 1.0: 132 | noise_shape = None 133 | scale = 1.0 134 | shape = tf.shape(args) 135 | if mode == "embedding": 136 | noise_shape = [shape[0], 1] 137 | scale = keep_prob 138 | if mode == "recurrent" and len(args.get_shape().as_list()) == 3: 139 | noise_shape = [shape[0], 1, shape[-1]] 140 | args = tf.cond(is_train, lambda: tf.nn.dropout( 141 | args, keep_prob, noise_shape=noise_shape) * scale, lambda: args) 142 | return args 143 | 144 | 145 | def softmax_mask(val, mask): 146 | """ 147 | 作用是给空值处减小注意力 148 | """ 149 | return -INF * (1 - tf.cast(mask, tf.float32)) + val # tf.cast:true转为1.0,false转为0.0 150 | 151 | 152 | def summ(memory, hidden, mask, keep_prob=1.0, is_train=None, scope="summ"): 153 | """ 154 | 对question进行最后一步的处理,可以看作是pooling吗 155 | """ 156 | with tf.variable_scope(scope): 157 | d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train) 158 | s0 = tf.nn.tanh(dense(d_memory, hidden, scope="s0")) 159 | s = dense(s0, 1, use_bias=False, scope="s") 160 | # tf.squeeze把长度只有1的维度去掉 161 | # s1:[batch_size, c_maxlen] 162 | s1 = softmax_mask(tf.squeeze(s, [2]), mask) 163 | a = tf.expand_dims(tf.nn.softmax(s1), axis=2) 164 | res = tf.reduce_sum(a * memory, axis=1) # 逐元素相乘,shape跟随memory一致 165 | return res # [batch_size, 2*hidden] 166 | 167 | 168 | def dot_attention(inputs, memory, mask, hidden, keep_prob=1.0, is_train=None, scope="dot_attention"): 169 | """ 170 | 门控attention层 171 | """ 172 | with tf.variable_scope(scope): 173 | 174 | d_inputs = dropout(inputs, keep_prob=keep_prob, is_train=is_train) 175 | d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train) 176 | JX = tf.shape(inputs)[1] # inputs的1维度,应该是c_maxlen 177 | 178 | with tf.variable_scope("attention"): 179 | # inputs_的shape:[batch_size, c_maxlen, hidden] 180 | inputs_ = tf.nn.relu( 181 | dense(d_inputs, hidden, use_bias=False, scope="inputs")) 182 | memory_ = tf.nn.relu( 183 | dense(d_memory, hidden, use_bias=False, scope="memory")) 184 | # 三维矩阵相乘,结果的shape是[batch_size, c_maxlen, q_maxlen] 185 | outputs = tf.matmul(inputs_, tf.transpose( 186 | memory_, [0, 2, 1])) / (hidden ** 0.5) 187 | # 将mask平铺成与outputs相同的形状,这里考虑,改进成input和memory都需要mask 188 | mask = tf.tile(tf.expand_dims(mask, axis=1), [1, JX, 1]) 189 | logits = tf.nn.softmax(softmax_mask(outputs, mask)) 190 | outputs = tf.matmul(logits, memory) 191 | # res:[batch_size, c_maxlen, 12*hidden] 192 | res = tf.concat([inputs, outputs], axis=2) 193 | 194 | with tf.variable_scope("gate"): 195 | """ 196 | attention * gate 197 | """ 198 | dim = res.get_shape().as_list()[-1] 199 | d_res = dropout(res, keep_prob=keep_prob, is_train=is_train) 200 | gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False)) 201 | return res * gate # 向量的逐元素相乘 202 | 203 | 204 | # 写一个谷歌论文中新的attention模块 205 | def multihead_attention(Q, K, V, mask, hidden, head_num=4, keep_prob=1.0, is_train=None, has_gate=True, scope="multihead_attention"): 206 | """ 207 | Q : passage 208 | K,V: question 209 | mask: Q的mask 210 | """ 211 | size = int(hidden / head_num) # 每个attention的大小 212 | 213 | with tf.variable_scope(scope): 214 | d_Q = dropout(Q, keep_prob=keep_prob, is_train=is_train) 215 | d_K = dropout(K, keep_prob=keep_prob, is_train=is_train) 216 | JX = tf.shape(Q)[1] 217 | 218 | with tf.variable_scope("attention"): 219 | Q_ = tf.nn.relu(dense(d_Q, hidden, use_bias=False, scope="Q")) 220 | K_ = tf.nn.relu(dense(d_K, hidden, use_bias=False, scope="K")) 221 | V_ = tf.nn.relu(dense(V, hidden, use_bias=False, scope="V")) 222 | Q_ = tf.reshape(Q_, (-1, tf.shape(Q_)[1], head_num, size)) 223 | K_ = tf.reshape(K_, (-1, tf.shape(K_)[1], head_num, size)) 224 | V_ = tf.reshape(V_, (-1, tf.shape(V_)[1], head_num, size)) 225 | Q_ = tf.transpose(Q_, [0, 2, 1, 3]) 226 | K_ = tf.transpose(K_, [0, 2, 1, 3]) 227 | V_ = tf.transpose(V_, [0, 2, 1, 3]) 228 | # scale:[batch_size, head_num, c_maxlen, q_maxlen] 229 | scale = tf.matmul(Q_, K_, transpose_b=True) / tf.sqrt(float(size)) 230 | scale = tf.transpose(scale, [0, 3, 2, 1]) 231 | for _ in range(len(scale.shape) - 2): 232 | mask = tf.expand_dims(mask, axis=2) 233 | mask_scale = softmax_mask(scale, mask) 234 | mask_scale = tf.transpose(scale, [0, 3, 2, 1]) 235 | logits = tf.nn.softmax(mask_scale) 236 | outputs = tf.matmul(logits, V_) # [b,h,c,s] 237 | outputs = tf.transpose(outputs, [0, 2, 1, 3]) 238 | # [batch_size, c_maxlen, hidden] 239 | outputs = tf.reshape(outputs, (-1, tf.shape(Q)[1], hidden)) 240 | # res连接 241 | res = tf.concat([Q, outputs], axis=2) 242 | 243 | if has_gate: 244 | with tf.variable_scope("gate"): 245 | dim = res.get_shape().as_list()[-1] 246 | d_res = dropout(res, keep_prob=keep_prob, is_train=is_train) 247 | gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False)) 248 | return res * gate 249 | else: 250 | return res 251 | 252 | 253 | def dense(inputs, hidden, use_bias=True, scope="dense"): 254 | """ 255 | 全连接层 256 | """ 257 | with tf.variable_scope(scope): 258 | shape = tf.shape(inputs) 259 | dim = inputs.get_shape().as_list()[-1] 260 | out_shape = [shape[idx] for idx in range( 261 | len(inputs.get_shape().as_list()) - 1)] + [hidden] 262 | # 三维的inputs,reshape成二维 263 | flat_inputs = tf.reshape(inputs, [-1, dim]) 264 | W = tf.get_variable("W", [dim, hidden]) 265 | res = tf.matmul(flat_inputs, W) 266 | if use_bias: 267 | b = tf.get_variable( 268 | "b", [hidden], initializer=tf.constant_initializer(0.)) 269 | res = tf.nn.bias_add(res, b) 270 | # outshape就是input的最后一维变成hidden 271 | res = tf.reshape(res, out_shape) 272 | return res 273 | -------------------------------------------------------------------------------- /best_single_model/util_addAnswer.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | util_addAnswer.py:读取batch的工具。 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import numpy as np 11 | import re 12 | from collections import Counter 13 | import string 14 | 15 | 16 | def get_record_parser(config): 17 | def parse(example): 18 | para_limit = config.para_limit 19 | ques_limit = config.ques_limit 20 | ans_limit = config.ans_limit 21 | features = tf.parse_single_example(example, 22 | features={ 23 | "passage_idxs": tf.FixedLenFeature([], tf.string), 24 | "question_idxs": tf.FixedLenFeature([], tf.string), 25 | "alternative_idxs": tf.FixedLenFeature([], tf.string), 26 | "answer": tf.FixedLenFeature([], tf.int64), 27 | "id": tf.FixedLenFeature([], tf.int64), 28 | "nlp_feature": tf.FixedLenFeature([187], tf.float32) 29 | }) 30 | # tf.decode_raw: 将字符串的字节重新解释为数字向量 31 | passage_idxs = tf.reshape(tf.decode_raw( 32 | features["passage_idxs"], tf.int32), [para_limit]) 33 | question_idxs = tf.reshape(tf.decode_raw( 34 | features["question_idxs"], tf.int32), [ques_limit]) 35 | alternative_idxs = tf.reshape(tf.decode_raw( 36 | features["alternative_idxs"], tf.int32), [3, ans_limit]) 37 | answer = features["answer"] 38 | qa_id = features["id"] 39 | nlp_feature= features["nlp_feature"] 40 | return passage_idxs, question_idxs, alternative_idxs, answer, qa_id ,nlp_feature 41 | return parse 42 | 43 | 44 | def get_batch_dataset(record_file, parser, config): 45 | """ 46 | 训练数据集TFRecordDataset的batch生成器。 47 | Args: 48 | record_file: 训练数据tf_record路径 49 | parser: 数据存储的格式 50 | config: 超参数 51 | """ 52 | num_threads = tf.constant(config.num_threads, dtype=tf.int32) 53 | dataset = tf.data.TFRecordDataset(record_file).map( 54 | parser, num_parallel_calls=num_threads).shuffle(config.capacity).repeat() 55 | if config.is_bucket: 56 | # bucket方法,用于解决序列长度不同的mini-batch的计算效率问题 57 | buckets = [tf.constant(num) for num in range(*config.bucket_range)] 58 | 59 | def key_func(context_idxs, ques_idxs, context_char_idxs, ques_char_idxs, y1, y2, qa_id): 60 | c_len = tf.reduce_sum( 61 | tf.cast(tf.cast(context_idxs, tf.bool), tf.int32)) 62 | buckets_min = [np.iinfo(np.int32).min] + buckets 63 | buckets_max = buckets + [np.iinfo(np.int32).max] 64 | conditions_c = tf.logical_and( 65 | tf.less(buckets_min, c_len), tf.less_equal(c_len, buckets_max)) 66 | bucket_id = tf.reduce_min(tf.where(conditions_c)) 67 | return bucket_id 68 | 69 | def reduce_func(key, elements): 70 | return elements.batch(config.batch_size) 71 | 72 | dataset = dataset.apply(tf.contrib.data.group_by_window( 73 | key_func, reduce_func, window_size=5 * config.batch_size)).shuffle(len(buckets) * 25) 74 | else: 75 | dataset = dataset.batch(config.batch_size) 76 | return dataset 77 | 78 | 79 | def get_dataset(record_file, parser, config): 80 | num_threads = tf.constant(config.num_threads, dtype=tf.int32) 81 | dataset = tf.data.TFRecordDataset(record_file).map( 82 | parser, num_parallel_calls=num_threads).repeat().batch(config.batch_size) 83 | return dataset 84 | 85 | 86 | def evaluate_acc(truth_dict, answer_dict): 87 | """ 88 | 计算准确率,还可以设计返回正确问题和错误问题列表 89 | """ 90 | total = 0 91 | right = 0 92 | wrong = 0 93 | for key, value in answer_dict.items(): 94 | total += 1 95 | ground_truths = truth_dict[key] 96 | prediction = value 97 | if prediction == ground_truths: 98 | right += 1 99 | else: 100 | wrong += 1 101 | accuracy = (right / total) * 1.0 102 | return {"accuracy": accuracy} 103 | -------------------------------------------------------------------------------- /ensemble/dev_soft/model_char_1102_0.7278.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_char_1102_0.7278.txt -------------------------------------------------------------------------------- /ensemble/dev_soft/model_newgraph_1101_0.7474.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_newgraph_1101_0.7474.txt -------------------------------------------------------------------------------- /ensemble/dev_soft/model_newgraph_2lr_1101_0.7459.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_newgraph_2lr_1101_0.7459.txt -------------------------------------------------------------------------------- /ensemble/ensemble_predict.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | ensemble_predict.py:将模型权重用于融合,预测结果。 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import pickle 14 | import os 15 | import codecs 16 | import time 17 | 18 | os.environ["CUDA_VISIBLE_DEVICES"] = "" 19 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" 20 | 21 | 22 | def emsemble_predict(): 23 | with open("test_eval.json", "r") as fh: 24 | test_eval_file = json.load(fh) 25 | with open("ensemble_wb.json", "r") as fh: 26 | best_wb = json.load(fh) 27 | 28 | predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime()) 29 | prediction_file = os.path.join( 30 | "predictions", (predic_time + "_predictions.txt")) 31 | 32 | print("正在读取test的softmax结果文件!") 33 | rootdir = "./test_soft" 34 | # 定义融合后的验证集softmax字典 35 | dev_dict = {} 36 | predict_dict = {} 37 | 38 | # 获取目录下所有文件,并去除隐藏文件 39 | filelist = os.listdir(rootdir) 40 | filenames = [ 41 | filename for filename in filelist if not filename.startswith('.')] 42 | for i in range(len(filenames)): 43 | print("{}: {}".format(i + 1, filenames[i])) 44 | # print(filenames) 45 | 46 | # 初始化dev_dict 47 | if len(filenames) == 0: 48 | print("没有softmax文件") 49 | return 50 | path = os.path.join(rootdir, filenames[0]) 51 | with open(path, "rb") as f1: 52 | soft = pickle.load(f1) 53 | for key, value in soft.items(): 54 | dev_dict[key] = best_wb[filenames[0]][0] * value 55 | print("初始化完成") 56 | 57 | # 遍历剩下的dev文件 58 | for i in range(1, len(filenames)): 59 | weight = best_wb[filenames[i]][0] 60 | print(weight) 61 | path = os.path.join(rootdir, filenames[i]) 62 | with open(path, "rb") as f1: 63 | soft = pickle.load(f1) 64 | for key in soft.keys(): 65 | dev_dict[key] += weight * soft[key] 66 | print(i + 1, "个文件处理完成") 67 | for key in dev_dict.keys(): 68 | dev_dict[key] += best_wb["bias"][0] 69 | # 计算准确率 70 | with tf.Session(graph=tf.Graph()) as sess: 71 | ddev = tf.placeholder(shape=[3], dtype=tf.float32, name='all') 72 | dev_class = tf.cast(tf.argmax(ddev), dtype=tf.int32) 73 | for k in range(280001, 290001): 74 | key = str(k) 75 | value = sess.run(dev_class, feed_dict={ddev: dev_dict[key]}) 76 | predict_dict[key] = value 77 | 78 | predictions = [] 79 | for key, value in predict_dict.items(): 80 | prediction_answer = test_eval_file[str(key)][value] 81 | predictions.append(str(key) + '\t' + str(prediction_answer)) 82 | outputs = u'\n'.join(predictions) 83 | with codecs.open(prediction_file, 'w', encoding='utf-8') as f: 84 | f.write(outputs) 85 | print("done!") 86 | 87 | 88 | if __name__ == '__main__': 89 | ensemble_predict() 90 | -------------------------------------------------------------------------------- /ensemble/ensemble_train.py: -------------------------------------------------------------------------------- 1 | """ 2 | AI Challenger观点型问题阅读理解 3 | 4 | ensemble_train.py:在验证集中训练模型融合的权重。 5 | 6 | @author: yuhaitao 7 | """ 8 | # -*- coding:utf-8 -*- 9 | import tensorflow as tf 10 | import json as json 11 | import numpy as np 12 | from tqdm import tqdm 13 | import pickle 14 | import os 15 | import codecs 16 | import time 17 | 18 | os.environ["CUDA_VISIBLE_DEVICES"] = "0" 19 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" 20 | 21 | 22 | def ensemble_train(): 23 | total = 29968 24 | print("正在读取dev的softmax结果文件!") 25 | rootdir = "./dev_soft" 26 | dev_dict = {} 27 | predict_dict = {} 28 | with open('truth/truth_dict.txt', 'rb') as f1: 29 | truth_dict = pickle.load(f1) 30 | filelist = os.listdir(rootdir) 31 | filenames = [ 32 | filename for filename in filelist if not filename.startswith('.')] 33 | for i in range(len(filenames)): 34 | print("{}: {}".format(i + 1, filenames[i])) 35 | if len(filenames) == 0: 36 | print("没有softmax文件") 37 | return 38 | 39 | # 定义整个输入和标签矩阵 40 | all_inputs = np.zeros(shape=[total, 3, len(filenames)], dtype=np.float32) 41 | all_labels = np.zeros(shape=[total], dtype=np.int32) 42 | keys = [] 43 | for k in truth_dict.keys(): 44 | keys.append(k) 45 | keys.sort(reverse=False) 46 | if len(keys) != total: 47 | print("keys number error") 48 | return 49 | # 给标签赋值 50 | for i in range(total): 51 | all_labels[i] = truth_dict[keys[i]] 52 | # 遍历文件加入矩阵 53 | for i in range(len(filenames)): 54 | path = os.path.join(rootdir, filenames[i]) 55 | with open(path, "rb") as f1: 56 | soft = pickle.load(f1) 57 | for j in range(total): 58 | all_inputs[j, :, i] = soft[keys[j]] 59 | print(i + 1, "个文件处理完成") 60 | # print(all_labels[:10]) 61 | # print(all_inputs[:10, :, :]) 62 | sess_config = tf.ConfigProto(allow_soft_placement=True) 63 | sess_config.gpu_options.allow_growth = True 64 | with tf.Session(config=sess_config) as sess: 65 | inputs = tf.placeholder(shape=[total, 3, len( 66 | filenames)], dtype=tf.float32, name="inputs") 67 | labels = tf.placeholder(shape=[total], dtype=tf.int32, name="labels") 68 | W = tf.get_variable(shape=[len(filenames), 1], 69 | dtype=tf.float32, name="weights") 70 | b = tf.get_variable( 71 | shape=[1], dtype=tf.float32, name="bias") 72 | re_inputs = tf.reshape(inputs, shape=[-1, len(filenames)]) 73 | pred = tf.matmul(re_inputs, W) 74 | re_pred = tf.reshape(pred, shape=[total, 3, 1]) 75 | outputs = tf.squeeze(re_pred) 76 | predictions = tf.cast(tf.argmax(outputs, axis=1), tf.int32) 77 | # loss and opt 78 | loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits( 79 | logits=outputs, labels=tf.stop_gradient(labels))) 80 | train_op = tf.train.GradientDescentOptimizer(0.5).minimize(loss) 81 | 82 | # run 83 | sess.run(tf.global_variables_initializer()) 84 | best_acc = 0. 85 | best_wb = {} 86 | for steps in range(1500): 87 | answers, train_OP = sess.run([predictions, train_op], feed_dict={ 88 | inputs: all_inputs, labels: all_labels}) 89 | answer_dict = {} 90 | for i in range(total): 91 | answer_dict[keys[i]] = answers[i] 92 | if evaluate_acc(truth_dict, answer_dict)["accuracy"] > best_acc: 93 | best_acc = evaluate_acc(truth_dict, answer_dict)["accuracy"] 94 | best_wb["weights"] = sess.run(W).tolist() 95 | best_wb["bias"] = sess.run(b).tolist() 96 | if (steps + 1) % 50 == 0: 97 | print("steps:{},acc:{:.5f}".format( 98 | steps + 1, evaluate_acc(truth_dict, answer_dict)["accuracy"])) 99 | for i in range(len(best_wb["weights"])): 100 | print("{}: {}".format(i + 1, best_wb["weights"][i])) 101 | save_wb = {} 102 | for file, weight in zip(filenames, best_wb["weights"]): 103 | save_wb[file] = weight 104 | save_wb["bias"] = best_wb["bias"] 105 | print(best_acc) 106 | with open("ensemble_wb.json", "w") as fw: 107 | json.dump(save_wb, fw) 108 | 109 | 110 | def evaluate_acc(truth_dict, answer_dict): 111 | """ 112 | 计算准确率,还可以设计返回正确问题和错误问题列表 113 | """ 114 | total = 0 115 | right = 0 116 | wrong = 0 117 | for key, value in answer_dict.items(): 118 | total += 1 119 | ground_truths = truth_dict[key] 120 | prediction = value 121 | if prediction == ground_truths: 122 | right += 1 123 | else: 124 | wrong += 1 125 | accuracy = (right / total) * 1.0 126 | return {"accuracy": accuracy} 127 | 128 | 129 | if __name__ == '__main__': 130 | ensemble_train() 131 | -------------------------------------------------------------------------------- /ensemble/test_soft/model_char_1102_0.7278.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_char_1102_0.7278.txt -------------------------------------------------------------------------------- /ensemble/test_soft/model_newgraph_1101_0.7474.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_newgraph_1101_0.7474.txt -------------------------------------------------------------------------------- /ensemble/test_soft/model_newgraph_2lr_1101_0.7459.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_newgraph_2lr_1101_0.7459.txt -------------------------------------------------------------------------------- /pics/model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/pics/model.png --------------------------------------------------------------------------------