├── .gitignore
├── LICENSE
├── README.md
├── baseline
    ├── DemoModel.py
    ├── config.py
    ├── data_process.py
    ├── data_process_aug.py
    ├── examine_dev.py
    ├── examine_dev_ensemble.py
    ├── file_save.py
    ├── main.py
    ├── model.py
    ├── nn_func.py
    └── util.py
├── best_single_model
    ├── config.py
    ├── data_process_addAnswer.py
    ├── examine_dev.py
    ├── file_save.py
    ├── focal_loss.py
    ├── main.py
    ├── model_addAnswer_newGraph.py
    ├── nlp_feature.json
    ├── nn_func.py
    └── util_addAnswer.py
├── ensemble
    ├── dev_soft
    │   ├── model_char_1102_0.7278.txt
    │   ├── model_newgraph_1101_0.7474.txt
    │   └── model_newgraph_2lr_1101_0.7459.txt
    ├── ensemble_predict.py
    ├── ensemble_train.py
    └── test_soft
    │   ├── model_char_1102_0.7278.txt
    │   ├── model_newgraph_1101_0.7474.txt
    │   └── model_newgraph_2lr_1101_0.7459.txt
└── pics
    └── model.png


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2018 yuhaitao1994
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # AIchallenger_MachineReadingComprehension
 2 | AI Challenger 2018 观点型问题阅读理解比赛 8th place solution
 3 | 
 4 | ****
 5 | 
 6 | |Author|[yuhaitao](https://github.com/yuhaitao1994)|[little_white](https://github.com/faverous)|
 7 | |---|---|---
 8 | 
 9 | [比赛总结]()
10 | ****
11 | 
12 | ## 1.比赛成绩
13 | |Model|Accuracy|
14 | |---|---
15 | |baseline|72.36%
16 | |test_A ensemble|76.39%
17 | |best single model|75.13%(dev)
18 | |test_B ensemble|77.33%
19 | 
20 | 
21 | ## 2.环境配置
22 | 
23 | |环境/库|版本|
24 | |---|---
25 | |ubuntu|16.04
26 | |Python|>=3.5
27 | |TensorFlow|>=1.6
28 | 
29 | ## **3.baseline**
30 | 
31 | baseline模型借鉴了微软R-Net模型，感谢[HKUST-KnowComp](https://github.com/HKUST-KnowComp/R-Net)的tensorflow实现代码。
32 | 
33 | 与R-Net模型不同的是，我们取消了模型尾部的ptrNet结构，取而代之的是一个单向GRU与softmax层。
34 | 
35 | ### 打开方式
36 | 
37 | 新建file目录，将训练集、验证集、测试集A原始数据移入。
38 | 
39 | 数据预处理
40 | 
41 |     python config.py --mode prepro
42 | 
43 | 训练
44 | 
45 |     python config.py --mode train 
46 | 
47 | 评估验证集效果
48 | 
49 |     python config.py --mode examine_dev
50 | 
51 | 生成测试结果
52 | 
53 |     python config.py --mode test
54 | 
55 | 
56 | ## **4.best single model**
57 | 
58 | 最好成绩的单模型我们选择加入alternatives语义和feature engineering的方式，基于R-Net改进。
59 | 
60 | **alternatives语义**：由于观点型问题的某些备选答案是携带语义信息的，所以我们将备选答案也做encoding处理。
61 | 
62 | **feature engneering**：特征工程，我们使用了tf-idf等方法，将提取的特征向量作为深度模型的另一个输入，只用Linear层进行处理。由于阅读理解任务数据的特性，特征工程这部分工作只有微弱提升，没有公开代码。
63 | 
64 | ### 模型结构
65 | ![best single model](/pics/model.png)
66 | 
67 | ## **5.ensemble**
68 | 
69 | 最终提交的test_B结果共采用了16个模型进行融合，融合的方式为stacking，在验证集上训练各模型softmax层所占权重。这种方式可能会造成在验证集上的过拟合，但据实际测试，并没有发生此问题。
70 | 
71 | 我们一共使用了三种改进模型，分别基于R-Net、QA-Net和BiDAF。
72 | 
73 | ### ensemble使用方式
74 | 
75 | 训练集成模型的权重
76 | 
77 |     python ensemble_train.py
78 | 
79 | 预测test_A的结果
80 | 
81 |     python ensemble_predict.py
82 | 


--------------------------------------------------------------------------------
/baseline/DemoModel.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | DemoModel.py：模型演示代码，测试模型能否跑通
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | 
 10 | import tensorflow as tf
 11 | import os
 12 | import numpy as np
 13 | import random
 14 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout
 15 | 
 16 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # tensorflow的log显示级别
 17 | os.environ["CUDA_VISIBLE_DEVICES"] = "1"
 18 | 
 19 | hidden = 75
 20 | use_cudnn = False
 21 | batch_size = 2
 22 | learning_rate = 0.001
 23 | emb_lr = 0.000001
 24 | keep_prob = 0.7
 25 | grad_clip = 5.0
 26 | len_limit = 15
 27 | 
 28 | class DemoModel(object):
 29 | 
 30 |     def __init__(self, word_mat, trainable=True, opt=True):
 31 |         # 注意，placeholder是数据传输的入口，不能在计算图中重新赋值
 32 |         self.passage = tf.placeholder(tf.int32, [batch_size, None], name="passage")
 33 |         self.question = tf.placeholder(tf.int32, [batch_size, None], name="question")
 34 |         self.answer = tf.placeholder(tf.int32, [batch_size], name="answer")
 35 |         self.qa_id = tf.placeholder(tf.int32, [batch_size], name="qa_id")
 36 |         self.is_train = tf.placeholder(tf.bool, name="is_train")
 37 | 
 38 |         self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32,
 39 |                                            initializer=tf.constant_initializer(0), trainable=False)
 40 | 
 41 |         self.word_mat = tf.get_variable("word_mat", initializer=tf.constant(
 42 |             word_mat, dtype=tf.float32), trainable=True) # 测试一下可训练的word——embedding
 43 | 
 44 |         self.c_mask = tf.cast(self.passage, tf.bool)
 45 |         self.q_mask = tf.cast(self.question, tf.bool)
 46 |         self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1)
 47 |         self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1)
 48 | 
 49 |         if opt:
 50 |             self.c_maxlen = tf.reduce_max(self.c_len)
 51 |             self.q_maxlen = tf.reduce_max(self.q_len)
 52 |             self.c = tf.slice(self.passage, [0, 0], [batch_size, self.c_maxlen])
 53 |             self.q = tf.slice(self.question, [0, 0], [batch_size, self.q_maxlen])
 54 |             self.c_mask = tf.slice(self.c_mask, [0, 0], [
 55 |                                    batch_size, self.c_maxlen])
 56 |             self.q_mask = tf.slice(self.q_mask, [0, 0], [
 57 |                                    batch_size, self.q_maxlen])
 58 |         else:
 59 |             self.c_maxlen, self.q_maxlen = config.para_limit, config.ques_limit
 60 | 
 61 |         self.RNet()
 62 | 
 63 |         if trainable:
 64 |             # 对embedding层设置单独的学习率
 65 |             self.emb_lr = tf.get_variable("emb_lr", shape=[], dtype=tf.float32, trainable=False)
 66 |             self.learning_rate = tf.get_variable(
 67 |                 "learning_rate", shape=[], dtype=tf.float32, trainable=False)
 68 |             self.emb_opt = tf.train.AdamOptimizer(learning_rate=self.emb_lr, epsilon=1e-8)
 69 |             self.opt = tf.train.AdamOptimizer(
 70 |                 learning_rate=self.learning_rate, epsilon=1e-8)
 71 |             # 区分不同的变量列表
 72 |             self.var_list = tf.trainable_variables()
 73 |             var_list1 = []
 74 |             var_list2 = []
 75 |             for var in self.var_list:
 76 |                 if var.op.name == "word_mat":
 77 |                     var_list1.append(var)
 78 |                 else:
 79 |                     var_list2.append(var)
 80 | 
 81 |             grads = tf.gradients(self.loss, var_list1 + var_list2)
 82 |             # grads = self.opt.compute_gradients(self.loss)
 83 |             # gradients, variables = zip(*grads)
 84 |             capped_grads, _ = tf.clip_by_global_norm(
 85 |                 grads, grad_clip)
 86 |             grads1 = capped_grads[:len(var_list1)]
 87 |             grads2 = capped_grads[len(var_list1):]
 88 |             self.train_op1 = self.emb_opt.apply_gradients(
 89 |                 zip(grads1, var_list1), global_step=self.global_step)
 90 |             self.train_op2 = self.opt.apply_gradients(zip(grads2, var_list2))
 91 |             self.train_op = tf.group(self.train_op1, self.train_op2)
 92 | 
 93 |     def RNet(self):
 94 |         PL, QL, d = self.c_maxlen, self.q_maxlen, hidden
 95 |         gru = cudnn_gru if use_cudnn else native_gru
 96 | 
 97 |         with tf.variable_scope("embedding"):
 98 |             with tf.name_scope("word"):
 99 |                 c_emb = tf.nn.embedding_lookup(self.word_mat, self.c)
100 |                 q_emb = tf.nn.embedding_lookup(self.word_mat, self.q)
101 | 
102 |         with tf.variable_scope("encoding"):
103 |             rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape(
104 |             ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train)
105 |             c = rnn(c_emb, seq_len=self.c_len)
106 |             q = rnn(q_emb, seq_len=self.q_len)
107 | 
108 |         with tf.variable_scope("attention"):
109 |             qc_att = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d,
110 |                                    keep_prob=keep_prob, is_train=self.is_train)
111 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att.get_shape(
112 |             ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train)
113 |             att = rnn(qc_att, seq_len=self.c_len)
114 |             print(att.get_shape().as_list())
115 | 
116 |         with tf.variable_scope("match"):
117 |             self_att = dot_attention(
118 |                 att, att, mask=self.c_mask, hidden=d, keep_prob=keep_prob, is_train=self.is_train)
119 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=self_att.get_shape(
120 |             ).as_list()[-1], keep_prob=keep_prob, is_train=self.is_train)
121 |             # match:[batch_size, c_maxlen, 6*hidden]
122 |             match = rnn(self_att, seq_len=self.c_len)
123 |             print(match.get_shape().as_list())
124 | 
125 |         with tf.variable_scope("YesNo_classification"):
126 |             init = summ(q[:, :, -2 * d:], d, mask=self.q_mask,
127 |                         keep_prob=keep_prob, is_train=self.is_train)
128 |             print(init.get_shape().as_list())
129 |             match = dropout(match, keep_prob=keep_prob,
130 |                             is_train=self.is_train)
131 |             final_hiddens = init.get_shape().as_list()[-1]
132 |             final_gru = tf.contrib.rnn.GRUCell(final_hiddens)
133 |             _, final_state = tf.nn.dynamic_rnn(
134 |                 final_gru, match, initial_state=init, dtype=tf.float32)
135 |             final_w = tf.get_variable(name="final_w", shape=[final_hiddens, 2])
136 |             final_b = tf.get_variable(name="final_b", shape=[
137 |                                       2], initializer=tf.constant_initializer(0.))
138 |             self.logits = tf.matmul(final_state, final_w)
139 |             self.logits = tf.nn.bias_add(self.logits, final_b)  # logits:[batch_size, 3]
140 | 
141 |         with tf.variable_scope("softmax_and_loss"):
142 |             final_softmax = tf.nn.softmax(self.logits)
143 |             self.classes = tf.cast(
144 |                 tf.argmax(final_softmax, axis=1), dtype=tf.int32, name="classes")
145 |             self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
146 |                 logits=self.logits, labels=tf.stop_gradient(self.answer)))
147 | 
148 |     def get_loss(self):
149 |         return self.loss
150 | 
151 |     def get_global_step(self):
152 |         return self.global_step
153 | 
154 | def get_bacth(examples, word2idx_dict, batch_size):
155 |     """
156 |     获取mini-batch
157 |     """
158 |     passages = []
159 |     questions = []
160 |     answers = []
161 |     qa_ids = []
162 |     for i in range(batch_size):
163 |         passage_id = []
164 |         question_id = []
165 |         passage = examples[i]["passage"]
166 |         for j in range(15):
167 |             if (j + 1) <= len(passage):
168 |                 p = passage[j]
169 |                 passage_id.append(word2idx_dict[p])
170 |             else:
171 |                 passage_id.append(0)
172 |         question = examples[i]["question"]
173 |         for j in range(15):
174 |             if (j + 1) <= len(question):
175 |                 q = question[j]
176 |                 question_id.append(word2idx_dict[q])
177 |             else:
178 |                 question_id.append(0)
179 |         answer = examples[i]["answer"]
180 |         qa_id = examples[i]["qa_id"]
181 |         passages.append(passage_id)
182 |         questions.append(question_id)
183 |         answers.append(answer)
184 |         qa_ids.append(qa_id)
185 |     passages = np.array(passages).astype(np.int32)
186 |     questions = np.array(questions)
187 |     answers = np.array(answers)
188 |     qa_ids = np.array(qa_ids)
189 |     return passages, questions, answers, qa_ids
190 | 
191 | 
192 | def main(_):
193 |     """
194 |     测试模型的demo
195 |     """
196 |     train_examples = [
197 |         {
198 |             "passage": ['苹果', '是', '甜的', '它', '是', '硬的'],
199 |             "question":['苹果', '是', '甜的', '吗'],
200 |             "answer":0,
201 |             "qa_id":1},
202 |         {
203 |             "passage": ['橘子', '是', '酸的', '它', '是', '软的', '也是', '好吃的'],
204 |             "question":['橘子', '是', '甜的', '吗'],
205 |             "answer":1,
206 |             "qa_id":2},
207 |         {
208 |             "passage": ['梨', '是', '甜的', '它', '是', '硬的'],
209 |             "question":['梨', '是', '软的', '吗'],
210 |             "answer":1,
211 |             "qa_id":3},
212 |         {
213 |             "passage": ['西瓜', '是', '甜的', '它', '是', '硬的', '也是', '大的', '和', '圆的'],
214 |             "question":['西瓜', '是', '酸的', '吗'],
215 |             "answer":2,
216 |             "qa_id":4}
217 |     ]
218 | 
219 |     dev_examples = [
220 |         {
221 |             "passage": ['葡萄', '是', '甜的', '它', '是', '软的'],
222 |             "question":['葡萄', '是', '硬的', '吗'],
223 |             "answer":1,
224 |             "qa_id":5},
225 |         {
226 |             "passage": ['香蕉', '是', '甜的', '它', '是', '软的', '也是', '好吃的'],
227 |             "question":['香蕉', '是', '好吃的', '吗'],
228 |             "answer":0,
229 |             "qa_id":6}
230 |     ]
231 | 
232 |     train_2_examples = [
233 |         {
234 |             "passage": ['苹果'],
235 |             "question": ['苹果'],
236 |             "answer":0,
237 |             "qa_id":7},
238 |         {
239 |             "passage": ['梨'],
240 |             "question": ['西瓜'],
241 |             "answer":1,
242 |             "qa_id":8},
243 |         {
244 |             "passage": ['葡萄', '香蕉'],
245 |             "question": ['葡萄', '香蕉'],
246 |             "answer":0,
247 |             "qa_id":9},
248 |         {
249 |             "passage": ['西瓜', '橘子'],
250 |             "question": ['甜的'],
251 |             "answer":1,
252 |             "qa_id":10},
253 |     ]
254 | 
255 |     dev_2_examples = [
256 |         {
257 |             "passage": ['橘子'],
258 |             "question": ['苹果'],
259 |             "answer":1,
260 |             "qa_id":11},
261 |         {
262 |             "passage": ['梨', '西瓜'],
263 |             "question": ['梨', '西瓜'],
264 |             "answer":0,
265 |             "qa_id":12},
266 |     ]
267 | 
268 |     word2idx_dict = {"null":0,"苹果":1,"梨":2,"西瓜":3,"葡萄":4,"香蕉":5,"橘子":6,"甜的":7,"酸的":8,"硬的":9,\
269 |         "软的":10,"大的":11,"圆的":12,"好吃的":13,"是":14,"也是":15,"它":16,"和":17,"吗":18}
270 |     """
271 |     id2vec = {
272 |     0:[0.0,0.0,0.0,0.0],
273 |     1:[0.1,0.1,0.1,0.1],
274 |     2:[0.1,0.2,0.1,0.2],
275 |     3:[0.2,0.1,0.3,0.2],
276 |     4:[0.4,0.2,0.3,0.4],
277 |     5:[0.4,0.4,0.4,0.4],
278 |     6:[0.5,0.4,0.5,0.5],
279 |     7:[0.6,0.5,0.5,0.6],
280 |     8:[0.5,0.7,0.6,0.5],
281 |     9:[0.7,0.6,0.5,0.5],
282 |     10:[0.8,0.5,0.7,0.6],
283 |     11:[0.8,0.6,0.6,0.6],
284 |     12:[0.6,0.8,0.8,0.5],
285 |     13:[0.5,0.8,0.8,0.6],
286 |     14:[0.9,0.9,0.9,0.9],
287 |     15:[0.9,0.9,0.8,0.8],
288 |     17:[0.9,0.8,0.7,0.6],
289 |     18:[0.9,0.5,0.6,0.7]
290 |     }
291 |     """
292 | 
293 |     id2vec = [
294 |     [0.0,0.0,0.0,0.0],
295 |     [0.05,0.05,0.05,0.05],
296 |     [0.1,0.1,0.1,0.1],
297 |     [0.15,0.15,0.15,0.15],
298 |     [0.2,0.2,0.2,0.2],
299 |     [0.25,0.25,0.25,0.25],
300 |     [0.3,0.3,0.3,0.3],
301 |     [0.35,0.35,0.35,0.35],
302 |     [0.4,0.4,0.4,0.4],
303 |     [0.45,0.46,0.45,0.45],
304 |     [0.5,0.5,0.5,0.5],
305 |     [0.55,0.55,0.55,0.55],
306 |     [0.6,0.6,0.6,0.6],
307 |     [0.65,0.65,0.65,0.65],
308 |     [0.7,0.7,0.7,0.7],
309 |     [0.75,0.75,0.75,0.75],
310 |     [0.8,0.8,0.8,0.8],
311 |     [0.85,0.85,0.85,0.85]
312 |     ]
313 | 
314 |     print("Building model...")
315 |     word_mat = np.array(id2vec)
316 |     model = DemoModel(word_mat)
317 | 
318 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
319 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
320 |     sess_config.gpu_options.allow_growth = True
321 | 
322 |     with tf.Session(config=sess_config) as sess:
323 |         sess.run(tf.global_variables_initializer())
324 |         sess.run(tf.assign(model.learning_rate,
325 |                            tf.constant(learning_rate, dtype=tf.float32)))
326 |         sess.run(tf.assign(model.emb_lr, tf.constant(emb_lr, dtype=tf.float32)))
327 | 
328 |         dev_p, dev_q, dev_a, dev_id = get_bacth(dev_2_examples, word2idx_dict, batch_size)
329 |         
330 |         def get_acc(outputs, targets):
331 |             t = 0
332 |             for i in range(len(outputs)):
333 |                 if outputs[i] == targets[i]:
334 |                     t += 1
335 |             return (t / len(outputs)) * 1.0
336 | 
337 |         for i in range(10):
338 |             global_step = sess.run(model.global_step) + 1
339 |             random.shuffle(train_examples)
340 |             train_p, train_q, train_a, train_id = get_bacth(train_2_examples, word2idx_dict, batch_size)
341 |             # train
342 |             feed = {model.passage: train_p, model.question: train_q, model.answer: train_a, model.qa_id: train_id, model.is_train:True}
343 |             train_loss, train_op, t_classes, t_id = sess.run([model.loss, model.train_op, model.classes, model.qa_id], feed_dict=feed)
344 |             # dev
345 |             feed2 = {model.passage: dev_p, model.question: dev_q, model.answer: dev_a, model.qa_id: dev_id, model.is_train:False}
346 |             dev_loss, d_classes, d_id = sess.run([model.loss, model.classes, model.qa_id], feed_dict=feed2)
347 |             # 输出
348 |             train_acc = get_acc(t_classes, train_a)
349 |             dev_acc = get_acc(d_classes, dev_a)
350 |             if (i + 1) % 1 == 0:
351 |                 print("steps:{},train_loss:{:.4f},train_acc:{:.4f},dev_loss:{:.4f},dev_acc:{:.4f}"\
352 |                     .format(global_step, train_loss, train_acc, dev_loss, dev_acc))
353 |                 for j in range(2):
354 |                     print("dev_id:{},answer:{},my_answer:{}".format(dev_id[j],dev_a[j],d_classes[j]))
355 | 
356 | 
357 | if __name__ == '__main__':
358 |     tf.app.run()
359 | 
360 | 


--------------------------------------------------------------------------------
/baseline/config.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | config.py：配置文件，程序运行入口
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import os
 10 | import tensorflow as tf
 11 | 
 12 | import data_process
 13 | from main import train, test, dev
 14 | from file_save import *
 15 | from examine_dev import examine_dev
 16 | 
 17 | flags = tf.flags
 18 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
 19 | 
 20 | train_file = os.path.join("file", "ai_challenger_oqmrc_trainingset.json")
 21 | dev_file = os.path.join("file", "ai_challenger_oqmrc_validationset.json")
 22 | test_file = os.path.join("file", "ai_challenger_oqmrc_testa.json")
 23 | '''
 24 | train_file = os.path.join("file", "train_demo.json")
 25 | dev_file = os.path.join("file", "val_demo.json")
 26 | test_file = os.path.join("file", "test_demo.json")'''
 27 | 
 28 | target_dir = "data"
 29 | log_dir = "log/event"
 30 | save_dir = "log/model"
 31 | prediction_dir = "log/prediction"
 32 | train_record_file = os.path.join(target_dir, "train.tfrecords")
 33 | dev_record_file = os.path.join(target_dir, "dev.tfrecords")
 34 | test_record_file = os.path.join(target_dir, "test.tfrecords")
 35 | id2vec_file = os.path.join(target_dir, "id2vec.json")  # id号->向量
 36 | word2id_file = os.path.join(target_dir, "word2id.json")  # 词->id号
 37 | train_eval = os.path.join(target_dir, "train_eval.json")
 38 | dev_eval = os.path.join(target_dir, "dev_eval.json")
 39 | test_eval = os.path.join(target_dir, "test_eval.json")
 40 | 
 41 | if not os.path.exists(target_dir):
 42 |     os.makedirs(target_dir)
 43 | if not os.path.exists(log_dir):
 44 |     os.makedirs(log_dir)
 45 | if not os.path.exists(save_dir):
 46 |     os.makedirs(save_dir)
 47 | if not os.path.exists(prediction_dir):
 48 |     os.makedirs(prediction_dir)
 49 | 
 50 | flags.DEFINE_string("mode", "train", "train/debug/test")
 51 | flags.DEFINE_string("gpu", "0", "0/1")
 52 | flags.DEFINE_string("experiment", "lalala", "每次存不同模型分不同的文件夹")
 53 | flags.DEFINE_string("model_name", "default", "选取不同的模型")
 54 | 
 55 | flags.DEFINE_string("target_dir", target_dir, "")
 56 | flags.DEFINE_string("log_dir", log_dir, "")
 57 | flags.DEFINE_string("save_dir", save_dir, "")
 58 | flags.DEFINE_string("prediction_dir", prediction_dir, "")
 59 | flags.DEFINE_string("train_file", train_file, "")
 60 | flags.DEFINE_string("dev_file", dev_file, "")
 61 | flags.DEFINE_string("test_file", test_file, "")
 62 | 
 63 | flags.DEFINE_string("train_record_file", train_record_file, "")
 64 | flags.DEFINE_string("dev_record_file", dev_record_file, "")
 65 | flags.DEFINE_string("test_record_file", test_record_file, "")
 66 | flags.DEFINE_string("train_eval_file", train_eval, "")
 67 | flags.DEFINE_string("dev_eval_file", dev_eval, "")
 68 | flags.DEFINE_string("test_eval_file", test_eval, "")
 69 | flags.DEFINE_string("word2id_file", word2id_file, "")
 70 | flags.DEFINE_string("id2vec_file", id2vec_file, "")
 71 | 
 72 | flags.DEFINE_integer("para_limit", 150, "Limit length for paragraph")
 73 | flags.DEFINE_integer("ques_limit", 30, "Limit length for question")
 74 | flags.DEFINE_integer("min_count", 1, "embedding 的最小出现次数")
 75 | flags.DEFINE_integer("embedding_size", 300, "the dimension of vector")
 76 | 
 77 | flags.DEFINE_integer("capacity", 15000, "Batch size of dataset shuffle")
 78 | flags.DEFINE_integer("num_threads", 4, "Number of threads in input pipeline")
 79 | # 使用cudnn训练，提升6倍速度
 80 | flags.DEFINE_boolean("use_cudnn", True, "Whether to use cudnn (only for GPU)")
 81 | flags.DEFINE_boolean("is_bucket", False, "Whether to use bucketing")
 82 | 
 83 | flags.DEFINE_integer("batch_size", 64, "Batch size")
 84 | flags.DEFINE_integer("num_steps", 250000, "Number of steps")
 85 | flags.DEFINE_integer("checkpoint", 1000, "checkpoint for evaluation")
 86 | flags.DEFINE_integer("period", 500, "period to save batch loss")
 87 | flags.DEFINE_integer("val_num_batches", 150, "Num of batches for evaluation")
 88 | flags.DEFINE_float("init_learning_rate", 0.001,
 89 |                    "Initial learning rate for Adam")
 90 | flags.DEFINE_float("init_emb_lr", 0., "")
 91 | flags.DEFINE_float("keep_prob", 0.7, "Keep prob in rnn")
 92 | flags.DEFINE_float("grad_clip", 5.0, "Global Norm gradient clipping rate")
 93 | flags.DEFINE_integer("hidden", 60, "Hidden size")  # best:128
 94 | flags.DEFINE_integer("patience", 5, "Patience for learning rate decay")
 95 | flags.DEFINE_string("optimizer", "Adam", "")
 96 | flags.DEFINE_string("loss_function", "default", "")
 97 | flags.DEFINE_boolean("use_dropout", True, "")
 98 | 
 99 | 
100 | def main(_):
101 |     config = flags.FLAGS
102 |     os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu  # 选择一块gpu
103 |     if config.mode == "train":
104 |         train(config)
105 |     elif config.mode == "prepro":
106 |         data_process.prepro(config)
107 |     elif config.mode == "debug":
108 |         config.num_steps = 2
109 |         config.val_num_batches = 1
110 |         config.checkpoint = 1
111 |         config.period = 1
112 |         train(config)
113 |     elif config.mode == "test":
114 |         test(config)
115 |     elif config.mode == "examine":
116 |         examine_dev(config)
117 |     elif config.mode == "save_dev":
118 |         save_dev(config)
119 |     elif config.mode == "save_test":
120 |         save_test(config)
121 |     else:
122 |         print("Unknown mode")
123 |         exit(0)
124 | 
125 | 
126 | if __name__ == "__main__":
127 |     tf.app.run()
128 | 


--------------------------------------------------------------------------------
/baseline/data_process.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | data_process.py：数据预处理代码
  5 | 
  6 | @author: haomaojie
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import pandas as pd
 10 | import time
 11 | import json
 12 | import jieba
 13 | import csv
 14 | import word2vec
 15 | import re
 16 | import random
 17 | import tensorflow as tf
 18 | import numpy as np
 19 | from tqdm import tqdm  # 进度条
 20 | import os
 21 | import gensim
 22 | 
 23 | 
 24 | def read_data(json_path, output_path, line_count):
 25 |     '''
 26 |     读取json文件并转成Dataframe
 27 |     '''
 28 |     start_time = time.time()
 29 |     data = []
 30 |     with open(json_path, 'r') as f:
 31 |         for i in range(line_count):
 32 |             data_list = json.loads(f.readline())
 33 |             data.append([data_list['passage'], data_list['query']])
 34 |         df = pd.DataFrame(data, columns=['passage', 'query'])
 35 |     df.to_csv(output_path, index=False)
 36 |     print('转化成功，已生成csv文件')
 37 |     end_time = time.time()
 38 |     print(end_time - start_time)
 39 | 
 40 | 
 41 | def de_word(data_path, out_path):
 42 |     '''
 43 |     分词
 44 |     '''
 45 |     start_time = time.time()
 46 |     word = []
 47 |     data_file = open(data_path).read().split('\n')
 48 |     for i in range(len(data_file)):
 49 |         result = []
 50 |         seg_list = jieba.cut(data_file[i])
 51 |         for w in seg_list:
 52 |             result.append(w)
 53 |         word.append(result)
 54 |     print('分词完成')
 55 |     with open(out_path, 'w+') as txt_write:
 56 |         for i in range(len(word)):
 57 |             s = str(word[i]).replace(
 58 |                 '[', '').replace(']', '')  # 去除[],这两行按数据不同，可以选择
 59 |             s = s.replace("'", '').replace(',', '') + \
 60 |                 '\n'  # 去除单引号，逗号，每行末尾追加换行符
 61 |             txt_write.write(s)
 62 |     print('保存成功')
 63 |     end_time = time.time()
 64 |     print(end_time - start_time)
 65 | 
 66 | 
 67 | def word_vec(file_txt, file_bin, min_count, size):
 68 |     word2vec.word2vec(file_txt, file_bin, min_count=min_count,
 69 |                       size=size, verbose=True)
 70 | 
 71 | 
 72 | def merge_csv(target_dir, output_file):
 73 |     for inputfile in [os.path.join(target_dir, 'train_oridata.csv'),
 74 |                       os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]:
 75 |         data = pd.read_csv(inputfile)
 76 |         df = pd.DataFrame(data)
 77 |         df.to_csv(output_file, mode='a', index=False)
 78 | 
 79 | # 词转id，id转向量
 80 | 
 81 | 
 82 | def transfer(model_path, embedding_size):
 83 |     start_time = time.time()
 84 |     model = word2vec.load(model_path)
 85 |     word2id_dic = {}
 86 |     init_0 = [0.0 for i in range(embedding_size)]
 87 |     id2vec_dic = [init_0]
 88 |     for i in range(len(model.vocab)):
 89 |         id = i + 1
 90 |         word2id_dic[model.vocab[i]] = id
 91 |         id2vec_dic.append(model[model.vocab[i]].tolist())
 92 |     end_time = time.time()
 93 |     print('词转id，id转向量完成')
 94 |     print(end_time - start_time)
 95 |     return word2id_dic, id2vec_dic
 96 | 
 97 | 
 98 | def transfer_txt(model_path, embedding_size):
 99 |     print("开始转换...")
100 |     start_time = time.time()
101 |     model = gensim.models.KeyedVectors.load_word2vec_format(
102 |         model_path, binary=False)
103 |     word_dic = model.wv.vocab
104 |     word2id_dic = {}
105 |     init_0 = [0.0 for i in range(embedding_size)]
106 |     id2vec_dic = [init_0]
107 |     id = 1
108 |     for i in word_dic:
109 |         word2id_dic[i] = id
110 |         id2vec_dic.append(model[i].tolist())
111 |         id += 1
112 |     end_time = time.time()
113 |     print('词转id，id转向量完成')
114 |     print(end_time - start_time)
115 |     return word2id_dic, id2vec_dic
116 | 
117 | # 存入json文件
118 | 
119 | 
120 | def save_json(output_path, dic_data, message=None):
121 |     start_time = time.time()
122 |     if message is not None:
123 |         print("Saving {}...".format(message))
124 |         with open(output_path, "w") as fh:
125 |             json.dump(dic_data, fh, ensure_ascii=False, indent=4)
126 |     print('保存完成')
127 |     end_time = time.time()
128 |     print(end_time - start_time)
129 | 
130 | # 将原文中的passage，query，alternative，answer，query_id转成id号
131 | # 输入参数为词典的位置和训练集的位置
132 | 
133 | 
134 | def TrainningsetProcess(dic_url, dataset_url, passage_len_limit):
135 |     res = []  # 最后返回的结果
136 |     rule = re.compile(r'\|')
137 |     id2alternatives = {}
138 |     # 读取字典
139 |     with open(dic_url, 'r', encoding='utf-8') as dic_file:
140 |         dic = dict()
141 |         dic = json.load(dic_file)
142 |     # 读取训练集
143 |     over_limit = 0
144 |     with open(dataset_url, 'r', encoding='utf-8') as ts_file:
145 |         for file_line in ts_file:
146 |             line = json.loads(file_line)  # 读取一行json文件
147 |             this_line_res = dict()  # 变量定义，代表这一行映射之后的结果
148 |             passage = line['passage']
149 |             alternatives = line['alternatives']
150 |             query = line['query']
151 |             if dataset_url.find('test') == -1:
152 |                 answer = line['answer']
153 |             query_idx = line['query_id']
154 | 
155 |             # 用jieba将passage和query分词,lcut返回list
156 |             passage_cut = jieba.lcut(passage, cut_all=False)
157 |             query_cut = jieba.lcut(query, cut_all=False)
158 | 
159 |             # 用词典将passage和query映射到id
160 |             passage_id = []
161 |             query_id = []
162 |             for each_passage_word in passage_cut:
163 |                 passage_id.append(dic.get(each_passage_word))
164 |             for each_query_word in query_cut:
165 |                 query_id.append(dic.get(each_query_word))
166 | 
167 |             # 对选项进行排序
168 |             alternatives_cut = re.split(rule, alternatives)
169 |             alternatives_cut = [s.strip() for s in alternatives_cut]
170 |             tmp = [0, 0, 0]
171 | 
172 |             # 选项少于三个
173 |             if len(alternatives_cut) == 1:
174 |                 alternatives_cut.append(alternatives_cut[0])
175 |                 alternatives_cut.append(alternatives_cut[0])
176 |             if len(alternatives_cut) == 2:
177 |                 alternatives_cut.append(alternatives_cut[0])
178 | 
179 |             # 跳过无效数据（135条）
180 |             if alternatives.find("无法") == -1 and alternatives.find("不确定") == -1:
181 |                 if dataset_url.find('test') != -1:
182 |                     tmp[0] = alternatives_cut[0]
183 |                     tmp[1] = alternatives_cut[1]
184 |                     tmp[2] = alternatives_cut[2]
185 |                 else:
186 |                     print(1)
187 |                     continue
188 |             if alternatives.count("无法确定") > 1 or alternatives.count("没") > 1:
189 |                 if dataset_url.find('test') != -1:
190 |                     tmp[0] = alternatives_cut[0]
191 |                     tmp[1] = alternatives_cut[1]
192 |                     tmp[2] = alternatives_cut[2]
193 |                 else:
194 |                     print(2)
195 |                     continue  # 第64772条数据
196 |             if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1:
197 |                 print(3)
198 |                 continue  # 第144146条数据
199 |             if "不确定" in alternatives_cut and "无法确定" in alternatives_cut:
200 |                 tmp[0] = "确定"
201 |                 tmp[1] = "不确定"
202 |                 tmp[2] = "无法确定"
203 |             # 肯定/否定/无法确定
204 |             elif alternatives.find("不") != -1 or alternatives.find("没") != -1:
205 |                 if alternatives.count("不") == 1 and alternatives.find("不确定") != -1:
206 |                     alternatives_cut.remove("不确定")
207 |                     alternatives_cut.append("不确定")
208 |                     tmp[0] = alternatives_cut[0]
209 |                     tmp[1] = alternatives_cut[1]
210 |                     tmp[2] = alternatives_cut[2]
211 |                 elif alternatives.count("不") > 1:
212 |                     if alternatives.find("不确定") == -1:
213 |                         if dataset_url.find("test") != -1:
214 |                             tmp[0] = alternatives_cut[0]
215 |                             tmp[1] = alternatives_cut[1]
216 |                             tmp[2] = alternatives_cut[2]
217 |                         else:
218 |                             print(line)
219 |                             continue
220 |                     else:
221 |                         alternatives_cut.remove("不确定")
222 |                         if alternatives_cut[0].find("不") != -1:
223 |                             tmp[1] = alternatives_cut[0]
224 |                             tmp[0] = alternatives_cut[1]
225 |                         else:
226 |                             tmp[1] = alternatives_cut[1]
227 |                             tmp[0] = alternatives_cut[0]
228 |                         alternatives_cut.append("不确定")
229 |                         tmp[2] = alternatives_cut[2]
230 |                 else:
231 |                     for tmp_alternatives in alternatives_cut:
232 |                         if tmp_alternatives.find("无法") != -1:
233 |                             tmp[2] = tmp_alternatives
234 |                         elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1:
235 |                             tmp[1] = tmp_alternatives
236 |                         else:
237 |                             tmp[0] = tmp_alternatives
238 |             # 无明显肯定与否定词义
239 |             else:
240 |                 for tmp_alternatives in alternatives_cut:
241 |                     if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1:
242 |                         alternatives_cut.remove(tmp_alternatives)
243 |                         alternatives_cut.append(tmp_alternatives)
244 |                         break
245 |                 tmp[0] = alternatives_cut[0]
246 |                 tmp[1] = alternatives_cut[1]
247 |                 tmp[2] = alternatives_cut[2]
248 | 
249 |             # 根据tmp列表生成answer_id
250 |             if dataset_url.find('test') == -1:
251 |                 answer_id = tmp.index(answer.strip())
252 |             # 得到这一行映射后的结果，是dict类型的数据
253 |             if len(passage_id) > passage_len_limit:
254 |                 passage_id = passage_id[:passage_len_limit]
255 |                 over_limit += 1
256 |             this_line_res['passage'] = passage_id
257 |             this_line_res['query'] = query_id
258 |             this_line_res['alternatives'] = tmp
259 |             if dataset_url.find('test') == -1:
260 |                 this_line_res['answer'] = answer_id
261 |             this_line_res['query_id'] = query_idx
262 |             # 创建query_id到alternatives的字典，保存为json
263 |             id2alternatives[query_idx] = tmp
264 |             res.append(this_line_res)
265 |         print(len(res))
266 |         print("over_limit:{}".format(over_limit))
267 |         return res, id2alternatives
268 | 
269 | 
270 | def data_process(config):
271 |     target_dir = config.target_dir
272 |     # 这里如果使用自己训练好的词向量就可以注释掉
273 |     '''
274 |     read_data(config.train_file, os.path.join(
275 |         target_dir, 'train_oridata.csv'), 250000)  # 250000
276 |     read_data(config.test_file, os.path.join(
277 |         target_dir, 'test_oridata.csv'), 10000)  # 10000
278 |     read_data(config.dev_file, os.path.join(
279 |         target_dir, 'validation_oridata.csv'), 30000)  # 30000
280 |     merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv'))
281 |     de_word(os.path.join(target_dir, 'ori_data.csv'),
282 |             os.path.join(target_dir, 'seg_list.txt'))
283 |     word_vec(os.path.join(target_dir, 'seg_list.txt'),
284 |              os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size)
285 |     # 如果是用外部词向量，从这里开始
286 |     # word2id_dic, id2vec_dic = transfer_txt(
287 |     #     os.path.join(target_dir, 'baidu_300_wc+ng_sgns.baidubaike.bigram-char.txt'), config.embedding_size)
288 |     word2id_dic, id2vec_dic = transfer(
289 |         os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size)
290 |     save_json(config.word2id_file, word2id_dic, "word to id")
291 |     save_json(config.id2vec_file, id2vec_dic, "id to vec")
292 |     '''
293 |     train_examples, train_id2alternatives = TrainningsetProcess(
294 |         config.word2id_file, config.train_file, config.para_limit)
295 |     test_examples, test_id2alternatives = TrainningsetProcess(
296 |         config.word2id_file, config.test_file, config.para_limit)
297 |     validation_examples, validation_id2alternatives = TrainningsetProcess(
298 |         config.word2id_file, config.dev_file, config.para_limit)
299 |     save_json(config.train_eval_file, train_id2alternatives,
300 |               message='保存train每条数据的alternatives')
301 |     save_json(config.test_eval_file, test_id2alternatives,
302 |               message='保存test每条数据的alternatives')
303 |     save_json(config.dev_eval_file, validation_id2alternatives,
304 |               message='保存validation每条数据的alternatives')
305 |     return train_examples, test_examples, validation_examples
306 | 
307 | 
308 | def build_features(config, examples, data_type, out_file, is_test=False):
309 |     """
310 |     将数据读入TFrecords
311 |     """
312 | 
313 |     para_limit = config.para_limit
314 |     ques_limit = config.ques_limit
315 | 
316 |     print("Processing {} examples...".format(data_type))
317 |     writer = tf.python_io.TFRecordWriter(out_file)
318 |     total = 0
319 |     meta = {}
320 |     random.shuffle(examples)  # 先给打乱顺序
321 |     for example in tqdm(examples):
322 |         total += 1
323 |         passage_idxs = np.zeros([para_limit], dtype=np.int32)
324 |         question_idxs = np.zeros([ques_limit], dtype=np.int32)
325 | 
326 |         for i, token in enumerate(example["passage"]):
327 |             if token == None:
328 |                 passage_idxs[i] = 0
329 |             else:
330 |                 passage_idxs[i] = token
331 |         for i, token in enumerate(example["query"]):
332 |             if token == None:
333 |                 question_idxs[i] = 0
334 |             else:
335 |                 question_idxs[i] = token
336 |         # print(passage_idxs)
337 |         # print(example["passage"])
338 |         if not is_test:
339 |             record = tf.train.Example(features=tf.train.Features(feature={
340 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
341 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
342 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])),
343 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]]))
344 |                                       }))
345 |         else:
346 |             record = tf.train.Example(features=tf.train.Features(feature={
347 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
348 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
349 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])),
350 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]]))
351 |                                       }))
352 |         # print(record)
353 |         writer.write(record.SerializeToString())
354 |     print("Build {} instances of features in total".format(total))
355 |     writer.close()
356 | 
357 | 
358 | def prepro(config):
359 |     """
360 |     数据预处理函数
361 |     """
362 |     train_examples, test_examples, dev_examples = data_process(config)
363 |     '''
364 |     print(train_examples)
365 |     print(test_examples)
366 |     print(dev_examples)
367 |     print(word2id_dict)
368 |     '''
369 |     # train: 249778, test: 10000, dev: 29968
370 |     # train: 439, test: 18, dev: 48
371 | 
372 |     build_features(config, train_examples, "train", config.train_record_file)
373 |     build_features(config, dev_examples, "dev", config.dev_record_file)
374 |     build_features(config, test_examples, "test",
375 |                    config.test_record_file, is_test=True)
376 | 
377 |     print("done!!!")
378 | 


--------------------------------------------------------------------------------
/baseline/data_process_aug.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | data_process_aug.py：数据预处理代码(数据增强)
  5 | 
  6 | @author: haomaojie
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import pandas as pd
 10 | import time
 11 | import json
 12 | import jieba
 13 | import csv
 14 | import word2vec
 15 | import re
 16 | import tensorflow as tf
 17 | import numpy as np
 18 | from tqdm import tqdm  # 进度条
 19 | import os
 20 | import random
 21 | #import gensim
 22 | 
 23 | 
 24 | def read_data(json_path, output_path, line_count):
 25 |     '''
 26 |     读取json文件并转成Dataframe
 27 |     '''
 28 |     start_time = time.time()
 29 |     data = []
 30 |     with open(json_path, 'r') as f:
 31 |         for i in range(line_count):
 32 |             data_list = json.loads(f.readline())
 33 |             data.append([data_list['passage'], data_list['query']])
 34 |         df = pd.DataFrame(data, columns=['passage', 'query'])
 35 |     df.to_csv(output_path, index=False)
 36 |     print('转化成功，已生成csv文件')
 37 |     end_time = time.time()
 38 |     print(end_time - start_time)
 39 | 
 40 | 
 41 | def de_word(data_path, out_path):
 42 |     '''
 43 |     分词
 44 |     '''
 45 |     start_time = time.time()
 46 |     word = []
 47 |     data_file = open(data_path).read().split('\n')
 48 |     for i in range(len(data_file)):
 49 |         result = []
 50 |         seg_list = jieba.cut(data_file[i])
 51 |         for w in seg_list:
 52 |             result.append(w)
 53 |         word.append(result)
 54 |     print('分词完成')
 55 |     with open(out_path, 'w+') as txt_write:
 56 |         for i in range(len(word)):
 57 |             s = str(word[i]).replace(
 58 |                 '[', '').replace(']', '')  # 去除[],这两行按数据不同，可以选择
 59 |             s = s.replace("'", '').replace(',', '') + \
 60 |                 '\n'  # 去除单引号，逗号，每行末尾追加换行符
 61 |             txt_write.write(s)
 62 |     print('保存成功')
 63 |     end_time = time.time()
 64 |     print(end_time - start_time)
 65 | 
 66 | 
 67 | def word_vec(file_txt, file_bin, min_count, size):
 68 |     word2vec.word2vec(file_txt, file_bin, min_count=min_count,
 69 |                       size=size, verbose=True)
 70 | 
 71 | 
 72 | def merge_csv(target_dir, output_file):
 73 |     for inputfile in [os.path.join(target_dir, 'train_oridata.csv'),
 74 |                       os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]:
 75 |         data = pd.read_csv(inputfile)
 76 |         df = pd.DataFrame(data)
 77 |         df.to_csv(output_file, mode='a', index=False)
 78 | 
 79 | # 词转id，id转向量
 80 | 
 81 | 
 82 | def transfer(model_path, embedding_size):
 83 |     start_time = time.time()
 84 |     model = word2vec.load(model_path)
 85 |     word2id_dic = {}
 86 |     init_0 = [0.0 for i in range(embedding_size)]
 87 |     id2vec_dic = [init_0]
 88 |     for i in range(len(model.vocab)):
 89 |         id = i + 1
 90 |         word2id_dic[model.vocab[i]] = id
 91 |         id2vec_dic.append(model[model.vocab[i]].tolist())
 92 |     end_time = time.time()
 93 |     print('词转id，id转向量完成')
 94 |     print(end_time - start_time)
 95 |     return word2id_dic, id2vec_dic
 96 | 
 97 | # 存入json文件
 98 | 
 99 | 
100 | def save_json(output_path, dic_data, message=None):
101 |     start_time = time.time()
102 |     if message is not None:
103 |         print("Saving {}...".format(message))
104 |         with open(output_path, "w") as fh:
105 |             json.dump(dic_data, fh, ensure_ascii=False, indent=4)
106 |     print('保存完成')
107 |     end_time = time.time()
108 |     print(end_time - start_time)
109 | 
110 | # 将原文中的passage，query，alternative，answer，query_id转成id号
111 | # 输入参数为词典的位置和训练集的位置
112 | 
113 | 
114 | def TrainningsetProcess(dic_url, dataset_url):
115 |     res = []  # 最后返回的结果
116 |     rule = re.compile(r'\|')
117 |     id2alternatives = {}
118 |     # 读取字典
119 |     with open(dic_url, 'r', encoding='utf-8') as dic_file:
120 |         dic = dict()
121 |         dic = json.load(dic_file)
122 |     # 读取训练集
123 |     over_limit = 0
124 |     with open(dataset_url, 'r', encoding='utf-8') as ts_file:
125 |         for file_line in ts_file:
126 |             line = json.loads(file_line)  # 读取一行json文件
127 |             this_line_res = dict()  # 变量定义，代表这一行映射之后的结果
128 |             passage = line['passage']
129 |             alternatives = line['alternatives']
130 |             query = line['query']
131 |             if dataset_url.find('test') == -1:
132 |                 answer = line['answer']
133 |             query_idx = line['query_id']
134 | 
135 |             # 用jieba将passage和query分词,lcut返回list
136 |             passage_cut = jieba.lcut(passage, cut_all=False)
137 |             query_cut = jieba.lcut(query, cut_all=False)
138 | 
139 |             # 用词典将passage和query映射到id
140 |             passage_id = []
141 |             query_id = []
142 |             for each_passage_word in passage_cut:
143 |                 passage_id.append(dic.get(each_passage_word))
144 |             for each_query_word in query_cut:
145 |                 query_id.append(dic.get(each_query_word))
146 | 
147 |             # 对选项进行排序
148 |             alternatives_cut = re.split(rule, alternatives)
149 |             alternatives_cut = [s.strip() for s in alternatives_cut]
150 |             tmp = [0, 0, 0]
151 | 
152 |             # 选项少于三个
153 |             if len(alternatives_cut) == 1:
154 |                 alternatives_cut.append(alternatives_cut[0])
155 |                 alternatives_cut.append(alternatives_cut[0])
156 |             if len(alternatives_cut) == 2:
157 |                 alternatives_cut.append(alternatives_cut[0])
158 | 
159 |             # 跳过无效数据（135条）
160 |             if alternatives.find("无法") == -1 and alternatives.find("不确定") == -1:
161 |                 if dataset_url.find('test') != -1:
162 |                     tmp[0] = alternatives_cut[0]
163 |                     tmp[1] = alternatives_cut[1]
164 |                     tmp[2] = alternatives_cut[2]
165 |                 else:
166 |                     print(1)
167 |                     continue
168 |             if alternatives.count("无法确定") > 1 or alternatives.count("没") > 1:
169 |                 if dataset_url.find('test') != -1:
170 |                     tmp[0] = alternatives_cut[0]
171 |                     tmp[1] = alternatives_cut[1]
172 |                     tmp[2] = alternatives_cut[2]
173 |                 else:
174 |                     print(2)
175 |                     continue  # 第64772条数据
176 |             if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1:
177 |                 print(3)
178 |                 continue  # 第144146条数据
179 |             if "不确定" in alternatives_cut and "无法确定" in alternatives_cut:
180 |                 tmp[0] = "确定"
181 |                 tmp[1] = "不确定"
182 |                 tmp[2] = "无法确定"
183 |             # 肯定/否定/无法确定
184 |             elif alternatives.find("不") != -1 or alternatives.find("没") != -1:
185 |                 if alternatives.count("不") == 1 and alternatives.find("不确定") != -1:
186 |                     alternatives_cut.remove("不确定")
187 |                     alternatives_cut.append("不确定")
188 |                     tmp[0] = alternatives_cut[0]
189 |                     tmp[1] = alternatives_cut[1]
190 |                     tmp[2] = alternatives_cut[2]
191 |                 elif alternatives.count("不") > 1:
192 |                     if alternatives.find("不确定") == -1:
193 |                         if dataset_url.find("test") != -1:
194 |                             tmp[0] = alternatives_cut[0]
195 |                             tmp[1] = alternatives_cut[1]
196 |                             tmp[2] = alternatives_cut[2]
197 |                         else:
198 |                             print(line)
199 |                             continue
200 |                     else:
201 |                         alternatives_cut.remove("不确定")
202 |                         if alternatives_cut[0].find("不") != -1:
203 |                             tmp[1] = alternatives_cut[0]
204 |                             tmp[0] = alternatives_cut[1]
205 |                         else:
206 |                             tmp[1] = alternatives_cut[1]
207 |                             tmp[0] = alternatives_cut[0]
208 |                         alternatives_cut.append("不确定")
209 |                         tmp[2] = alternatives_cut[2]
210 |                 else:
211 |                     for tmp_alternatives in alternatives_cut:
212 |                         if tmp_alternatives.find("无法") != -1:
213 |                             tmp[2] = tmp_alternatives
214 |                         elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1:
215 |                             tmp[1] = tmp_alternatives
216 |                         else:
217 |                             tmp[0] = tmp_alternatives
218 |             # 无明显肯定与否定词义
219 |             else:
220 |                 for tmp_alternatives in alternatives_cut:
221 |                     if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1:
222 |                         alternatives_cut.remove(tmp_alternatives)
223 |                         alternatives_cut.append(tmp_alternatives)
224 |                         break
225 |                 tmp[0] = alternatives_cut[0]
226 |                 tmp[1] = alternatives_cut[1]
227 |                 tmp[2] = alternatives_cut[2]
228 | 
229 |             # 根据tmp列表生成answer_id
230 |             if dataset_url.find('test') == -1:
231 |                 answer_id = tmp.index(answer.strip())
232 |             # 得到这一行映射后的结果，是dict类型的数据
233 |             if len(passage_id) > 500:
234 |                 passage_id = passage_id[:500]
235 |                 over_limit += 1
236 |             this_line_res['passage'] = passage_id
237 |             this_line_res['query'] = query_id
238 |             this_line_res['alternatives'] = tmp
239 |             if dataset_url.find('test') == -1:
240 |                 this_line_res['answer'] = answer_id
241 |             this_line_res['query_id'] = query_idx
242 |             # 创建query_id到alternatives的字典，保存为json
243 |             id2alternatives[query_idx] = tmp
244 |             res.append(this_line_res)
245 |         print(len(res))
246 |         print("over_limit:{}".format(over_limit))
247 |         return res, id2alternatives
248 | 
249 | 
250 | def data_process(config, train_file, test_file, validation_file):
251 |     target_dir = config.target_dir
252 |     read_data(train_file, os.path.join(
253 |         target_dir, 'train_oridata.csv'), 250000)  # 250000
254 |     read_data(test_file, os.path.join(
255 |         target_dir, 'test_oridata.csv'), 10000)  # 10000
256 |     read_data(validation_file, os.path.join(
257 |         target_dir, 'validation_oridata.csv'), 30000)  # 30000
258 |     merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv'))
259 |     de_word(os.path.join(target_dir, 'ori_data.csv'),
260 |             os.path.join(target_dir, 'seg_list.txt'))
261 |     word_vec(os.path.join(target_dir, 'seg_list.txt'),
262 |              os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size)
263 |     word2id_dic, id2vec_dic = transfer(
264 |         os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size)
265 |     save_json(config.word2id_file, word2id_dic, "word to id")
266 |     save_json(config.id2vec_file, id2vec_dic, "id to vec")
267 |     train_examples, train_id2alternatives = TrainningsetProcess(
268 |         config.word2id_file, train_file)
269 |     test_examples, test_id2alternatives = TrainningsetProcess(
270 |         config.word2id_file, test_file)
271 |     validation_examples, validation_id2alternatives = TrainningsetProcess(
272 |         config.word2id_file, validation_file)
273 |     save_json(config.train_eval_file, train_id2alternatives,
274 |               message='保存train每条数据的alternatives')
275 |     save_json(config.test_eval_file, test_id2alternatives,
276 |               message='保存test每条数据的alternatives')
277 |     save_json(config.dev_eval_file, validation_id2alternatives,
278 |               message='保存validation每条数据的alternatives')
279 |     return train_examples, test_examples, validation_examples, word2id_dic
280 | 
281 | 
282 | def build_features(config, examples, data_type, out_file, word2idx_dict, is_test=False):
283 |     """
284 |     将数据读入TFrecords
285 |     """
286 | 
287 |     para_limit = config.para_limit
288 |     ques_limit = config.ques_limit
289 | 
290 |     print("Processing {} examples...".format(data_type))
291 |     writer = tf.python_io.TFRecordWriter(out_file)
292 |     total = 0
293 |     meta = {}
294 |     # 数据增强用
295 |     yes_examples = []
296 |     no_examples = []
297 |     depend_examples = []
298 | 
299 |     for example in tqdm(examples):
300 |         if data_type == "train":
301 |             if example["answer"] == 0:
302 |                 yes_examples.append(example)
303 |             elif example["answer"] == 1:
304 |                 no_examples.append(example)
305 |             else:
306 |                 depend_examples.append(example)
307 | 
308 |         total += 1
309 |         passage_idxs = np.zeros([para_limit], dtype=np.int32)
310 |         question_idxs = np.zeros([ques_limit], dtype=np.int32)
311 | 
312 |         for i, token in enumerate(example["passage"]):
313 |             if token == None:
314 |                 passage_idxs[i] = 0
315 |             else:
316 |                 passage_idxs[i] = token
317 |         for i, token in enumerate(example["query"]):
318 |             if token == None:
319 |                 question_idxs[i] = 0
320 |             else:
321 |                 question_idxs[i] = token
322 |         # print(passage_idxs)
323 |         # print(example["passage"])
324 |         if not is_test:
325 |             record = tf.train.Example(features=tf.train.Features(feature={
326 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
327 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
328 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])),
329 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]]))
330 |                                       }))
331 |         else:
332 |             record = tf.train.Example(features=tf.train.Features(feature={
333 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
334 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
335 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])),
336 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]]))
337 |                                       }))
338 |         # print(record)
339 |         writer.write(record.SerializeToString())
340 | 
341 |     # 数据增强，初步的增强是将不确定选项的答案搭配其他passage生成新的数据
342 |     if data_type == "train":
343 |         for example in depend_examples:
344 |             random1 = random.randint(0, len(yes_examples) - 1)
345 |             example["passage"] = yes_examples[random1]["passage"]
346 |             total += 1
347 |             passage_idxs = np.zeros([para_limit], dtype=np.int32)
348 |             question_idxs = np.zeros([ques_limit], dtype=np.int32)
349 |             for i, token in enumerate(example["passage"]):
350 |                 if token == None:
351 |                     passage_idxs[i] = 0
352 |                 else:
353 |                     passage_idxs[i] = token
354 |             for i, token in enumerate(example["query"]):
355 |                 if token == None:
356 |                     question_idxs[i] = 0
357 |                 else:
358 |                     question_idxs[i] = token
359 |             record = tf.train.Example(features=tf.train.Features(feature={
360 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
361 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
362 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])),
363 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"] + 500000]))
364 |                                       }))
365 |             writer.write(record.SerializeToString())
366 |         for example in depend_examples:
367 |             random2 = random.randint(0, len(no_examples) - 1)
368 |             example["passage"] = no_examples[random2]["passage"]
369 |             total += 1
370 |             passage_idxs = np.zeros([para_limit], dtype=np.int32)
371 |             question_idxs = np.zeros([ques_limit], dtype=np.int32)
372 |             for i, token in enumerate(example["passage"]):
373 |                 if token == None:
374 |                     passage_idxs[i] = 0
375 |                 else:
376 |                     passage_idxs[i] = token
377 |             for i, token in enumerate(example["query"]):
378 |                 if token == None:
379 |                     question_idxs[i] = 0
380 |                 else:
381 |                     question_idxs[i] = token
382 |             record = tf.train.Example(features=tf.train.Features(feature={
383 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
384 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
385 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])),
386 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"] + 1000000]))
387 |                                       }))
388 |             writer.write(record.SerializeToString())
389 | 
390 |     print("Build {} instances of features in total".format(total))
391 |     writer.close()
392 | 
393 | 
394 | def prepro(config):
395 |     """
396 |     数据预处理函数
397 |     """
398 |     train_examples, test_examples, dev_examples, word2id_dict = data_process(
399 |         config, config.train_file, config.test_file, config.dev_file)
400 |     '''
401 |     print(train_examples)
402 |     print(test_examples)
403 |     print(dev_examples)
404 |     print(word2id_dict)
405 |     '''
406 |     # train: 249778, test: 10000, dev: 29968
407 |     # train: 439, test: 18, dev: 48
408 | 
409 |     build_features(config, train_examples, "train",
410 |                    config.train_record_file, word2id_dict)
411 |     build_features(config, dev_examples, "dev",
412 |                    config.dev_record_file, word2id_dict)
413 |     build_features(config, test_examples, "test",
414 |                    config.test_record_file, word2id_dict, is_test=True)
415 | 
416 |     print("done!!!")
417 | 


--------------------------------------------------------------------------------
/baseline/examine_dev.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | examine_dev.py：检查dev集的结果，辅助分析
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import os
 14 | import codecs
 15 | import time
 16 | 
 17 | from model import Model
 18 | from util import *
 19 | 
 20 | 
 21 | def examine_dev(config):
 22 |     """
 23 |     检查dev集的结果，辅助分析
 24 |     """
 25 |     with open(config.id2vec_file, "r") as fh:
 26 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 27 |     with open(config.dev_eval_file, "r") as fh:
 28 |         dev_eval_file = json.load(fh)
 29 | 
 30 |     total = 29968
 31 |     # 读取模型的路径和预测存储的路径
 32 |     save_dir = config.save_dir + config.experiment
 33 |     if not os.path.exists(save_dir):
 34 |         print("no save!")
 35 |         return
 36 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
 37 |     os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt"))
 38 | 
 39 |     print("Loading model...")
 40 |     examine_batch = get_dataset(config.dev_record_file, get_record_parser(
 41 |         config), config).make_one_shot_iterator()
 42 | 
 43 |     model = Model(config, examine_batch, id2vec, trainable=False)
 44 | 
 45 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 46 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 47 |     sess_config.gpu_options.allow_growth = True
 48 | 
 49 |     print("examining ...")
 50 |     with tf.Session(config=sess_config) as sess:
 51 |         sess.run(tf.global_variables_initializer())
 52 |         saver = tf.train.Saver()
 53 |         saver.restore(sess, tf.train.latest_checkpoint(save_dir))
 54 |         sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool)))
 55 |         answer_dict = {}
 56 |         truth_dict = {}
 57 |         for step in tqdm(range(total // config.batch_size + 1)):
 58 |             # 预测答案
 59 |             qa_id, answer, truth = sess.run(
 60 |                 [model.qa_id, model.classes, model.answer])
 61 |             answer_dict_ = {}
 62 |             truth_dict_ = {}
 63 |             for ids, tr, ans in zip(qa_id, truth, answer):
 64 |                 answer_dict_[str(ids)] = ans
 65 |                 truth_dict_[str(ids)] = tr
 66 |             answer_dict.update(answer_dict_)
 67 |             truth_dict.update(truth_dict_)
 68 |         metrics = evaluate_acc(truth_dict, answer_dict)
 69 |         print(len(truth_dict))
 70 |         print(len(answer_dict))
 71 |         print("accuracy:{}".format(metrics["accuracy"]))
 72 | 
 73 |         yes_predictions = []  # 正确答案是肯定的错题
 74 |         no_predictions = []  # 正确答案是否定的错题
 75 |         depend_predictions = []  # 正确答案是不确定的错题
 76 |         yes, no, depend = 0, 0, 0
 77 |         yes_wrong, no_wrong, depend_wrong = 0, 0, 0
 78 |         for key, value in answer_dict.items():
 79 |             if truth_dict[key] != value:
 80 |                 if truth_dict[key] == 0:
 81 |                     yes += 1
 82 |                     yes_wrong += 1
 83 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 84 |                     wrong_answer = dev_eval_file[str(key)][value]
 85 |                     yes_predictions.append(
 86 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
 87 |                 elif truth_dict[key] == 1:
 88 |                     no += 1
 89 |                     no_wrong += 1
 90 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 91 |                     wrong_answer = dev_eval_file[str(key)][value]
 92 |                     no_predictions.append(
 93 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
 94 |                 else:
 95 |                     depend += 1
 96 |                     depend_wrong += 1
 97 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 98 |                     wrong_answer = dev_eval_file[str(key)][value]
 99 |                     depend_predictions.append(
100 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
101 |             else:
102 |                 if truth_dict[key] == 0:
103 |                     yes += 1
104 |                 elif truth_dict[key] == 1:
105 |                     no += 1
106 |                 else:
107 |                     depend += 1
108 | 
109 |         print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend))
110 |         print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0))
111 |         print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0))
112 |         print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0))
113 |         outputs_0 = u'\n'.join(yes_predictions)
114 |         outputs_1 = u'\n'.join(no_predictions)
115 |         outputs_2 = u'\n'.join(depend_predictions)
116 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f:
117 |             f.write(outputs_0)
118 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f:
119 |             f.write(outputs_1)
120 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f:
121 |             f.write(outputs_2)
122 |         print("done!")
123 | 


--------------------------------------------------------------------------------
/baseline/examine_dev_ensemble.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | examine_dev.py：检查dev集的结果，辅助分析
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import os
 14 | import pickle
 15 | import codecs
 16 | import time
 17 | from collections import Counter
 18 | from model import Model
 19 | from util import *
 20 | 
 21 | 
 22 | def examine_dev_ensemble(config):
 23 |     """
 24 |     检查dev集的结果，辅助分析
 25 |     """
 26 |     with open(config.id2vec_file, "r") as fh:
 27 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 28 |     with open(config.dev_eval_file, "r") as fh:
 29 |         dev_eval_file = json.load(fh)
 30 | 
 31 |     total = 29968 * 3
 32 |     # 读取模型的路径和预测存储的路径
 33 |     save_dir = config.save_dir + config.experiment
 34 |     if not os.path.exists(save_dir):
 35 |         print("no save!")
 36 |         return
 37 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
 38 |     os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt"))
 39 | 
 40 |     print("Loading model...")
 41 |     examine_batch = get_dataset("./data/dev_aug.tfrecords", get_record_parser(
 42 |         config), config).make_one_shot_iterator()
 43 | 
 44 |     model = Model(config, examine_batch, id2vec, trainable=False)
 45 | 
 46 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 47 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 48 |     sess_config.gpu_options.allow_growth = True
 49 | 
 50 |     print("examining ...")
 51 |     with tf.Session(config=sess_config) as sess:
 52 |         sess.run(tf.global_variables_initializer())
 53 |         saver = tf.train.Saver()
 54 |         saver.restore(sess, tf.train.latest_checkpoint(save_dir))
 55 |         sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool)))
 56 |         logits_dict = {}
 57 |         truth_dict = {}
 58 |         answer_dict = {}
 59 |         for step in tqdm(range(total // config.batch_size + 1)):
 60 |             # 预测答案
 61 |             qa_id, softmax, truth = sess.run(
 62 |                 [model.qa_id, model.final_softmax, model.answer])
 63 |             # 往字典中添加每个id的三个logits
 64 |             for ids, tr, log in zip(qa_id, truth, softmax):
 65 |                 if str(ids) not in logits_dict.keys():
 66 |                     logits_dict[str(ids)] = log
 67 |                     truth_dict[str(ids)] = tr
 68 |                 else:
 69 |                     logits_dict[str(ids)] += log
 70 |                 # if str(ids) not in class_dict.keys():
 71 |                 #     class_dict[str(ids)] = [int(cla)]
 72 |                 #     truth_dict[str(ids)] = tr
 73 |                 # else:
 74 |                 #     class_dict[str(ids)].append(int(cla))
 75 |         # 根据合并的logits求answer
 76 |         for key, value in logits_dict.items():
 77 |             val = value.tolist()
 78 |             answer_dict[key] = val.index(max(val))
 79 |             # answer_dict[key], _ = Counter(value).most_common(1)[0]
 80 |         metrics = evaluate_acc(truth_dict, answer_dict)
 81 |         print(len(truth_dict))
 82 |         print(len(answer_dict))
 83 |         print("accuracy:{}".format(metrics["accuracy"]))
 84 | 
 85 |         print("正在保存dev的softmax结果文件!")
 86 |         if not os.path.exists("./dev_soft"):
 87 |             os.makedirs("./dev_soft")
 88 |         with open("./dev_soft/model_aug_dev_ensemble.txt", "wb") as f1:  # 手动更改保存的名字，路径不用改
 89 |             pickle.dump(logits_dict, f1)
 90 | 
 91 |         yes_predictions = []  # 正确答案是肯定的错题
 92 |         no_predictions = []  # 正确答案是否定的错题
 93 |         depend_predictions = []  # 正确答案是不确定的错题
 94 |         yes, no, depend = 0, 0, 0
 95 |         yes_wrong, no_wrong, depend_wrong = 0, 0, 0
 96 |         for key, value in answer_dict.items():
 97 |             if truth_dict[key] != value:
 98 |                 if truth_dict[key] == 0:
 99 |                     yes += 1
100 |                     yes_wrong += 1
101 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
102 |                     wrong_answer = dev_eval_file[str(key)][value]
103 |                     yes_predictions.append(
104 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
105 |                 elif truth_dict[key] == 1:
106 |                     no += 1
107 |                     no_wrong += 1
108 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
109 |                     wrong_answer = dev_eval_file[str(key)][value]
110 |                     no_predictions.append(
111 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
112 |                 else:
113 |                     depend += 1
114 |                     depend_wrong += 1
115 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
116 |                     wrong_answer = dev_eval_file[str(key)][value]
117 |                     depend_predictions.append(
118 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
119 |             else:
120 |                 if truth_dict[key] == 0:
121 |                     yes += 1
122 |                 elif truth_dict[key] == 1:
123 |                     no += 1
124 |                 else:
125 |                     depend += 1
126 | 
127 |         print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend))
128 |         print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0))
129 |         print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0))
130 |         print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0))
131 |         outputs_0 = u'\n'.join(yes_predictions)
132 |         outputs_1 = u'\n'.join(no_predictions)
133 |         outputs_2 = u'\n'.join(depend_predictions)
134 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f:
135 |             f.write(outputs_0)
136 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f:
137 |             f.write(outputs_1)
138 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f:
139 |             f.write(outputs_2)
140 |         print("done!")
141 | 


--------------------------------------------------------------------------------
/baseline/file_save.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | examine_dev.py：检查dev集的结果，辅助分析
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import pickle
 14 | import os
 15 | import codecs
 16 | import time
 17 | from model import Model
 18 | from util import *
 19 | 
 20 | 
 21 | def save_dev(config):
 22 |     """
 23 |     验证dev集的结果，保存文件
 24 |     """
 25 |     with open(config.id2vec_file, "r") as fh:
 26 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 27 |     with open(config.dev_eval_file, "r") as fh:
 28 |         dev_eval_file = json.load(fh)
 29 |     total = 29968
 30 | 
 31 |     print("Loading model...")
 32 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 33 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 34 |     sess_config.gpu_options.allow_growth = True
 35 | 
 36 |     truth_dict = {}
 37 |     predict_dict = {}
 38 |     logits_dict1 = {}
 39 | 
 40 |     print("正在模型预测!")
 41 |     g1 = tf.Graph()
 42 |     with tf.Session(graph=g1, config=sess_config) as sess1:
 43 |         with g1.as_default():
 44 |             dev_batch1 = get_dataset(config.dev_record_file, get_record_parser(
 45 |                 config), config).make_one_shot_iterator()
 46 |             model_1 = Model(config, dev_batch1, id2vec, trainable=False)
 47 |             sess1.run(tf.global_variables_initializer())
 48 |             saver1 = tf.train.Saver()
 49 |             # 需要手动更改路径
 50 |             saver1.restore(
 51 |                 sess1, "./log/model/model_10000_devAcc_0.662240.ckpt")
 52 |             sess1.run(tf.assign(model_1.is_train,
 53 |                                 tf.constant(False, dtype=tf.bool)))
 54 |             for step in tqdm(range(total // config.batch_size + 1)):
 55 |                 qa_id, logits, truths = sess1.run(
 56 |                     [model_1.qa_id, model_1.logits, model_1.answer])
 57 |                 for ids, logits, truth in zip(qa_id, logits, truths):
 58 |                     logits_dict1[str(ids)] = logits
 59 |                     truth_dict[str(ids)] = truth
 60 |             if len(logits_dict1) != len(dev_eval_file):
 61 |                 print("logits1 data number not match")
 62 | 
 63 |             a = tf.placeholder(shape=[3], dtype=tf.float32, name="me")
 64 |             softmax = tf.nn.softmax(a)
 65 |             for key, val in truth_dict.items():
 66 |                 value = sess1.run(softmax, feed_dict={a: logits_dict1[key]})
 67 |                 predict_dict[key] = value
 68 |     print("正在保存dev的softmax结果文件!")
 69 |     if not os.path.exists("./dev_soft"):
 70 |         os.makedirs("./dev_soft")
 71 |     with open("./dev_soft/BIDAF_b64_e256_h150_v256.txt", "wb") as f1:  # 手动更改保存的名字，路径不用改
 72 |         pickle.dump(predict_dict, f1)
 73 |     if not os.path.exists("./truth"):
 74 |         os.makedirs("./truth")
 75 |     with open("./truth/truth_dict.txt", "wb") as f2:  # 不用改
 76 |         pickle.dump(truth_dict, f2)
 77 | 
 78 | 
 79 | def save_test(config):
 80 |     """
 81 |     输出test集的结果，保存文件
 82 |     """
 83 |     with open(config.id2vec_file, "r") as fh:
 84 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 85 |     with open(config.test_eval_file, "r") as fh:
 86 |         test_eval_file = json.load(fh)
 87 |     total = 10000
 88 | 
 89 |     print("Loading model...")
 90 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 91 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 92 |     sess_config.gpu_options.allow_growth = True
 93 | 
 94 |     predict_dict = {}
 95 |     logits_dict1 = {}
 96 | 
 97 |     print("正在模型预测!")
 98 |     g1 = tf.Graph()
 99 |     with tf.Session(graph=g1, config=sess_config) as sess1:
100 |         with g1.as_default():
101 |             test_batch1 = get_dataset(config.test_record_file, get_record_parser(
102 |                 config), config).make_one_shot_iterator()
103 |             model_1 = Model(config, test_batch1, id2vec, trainable=False)
104 |             sess1.run(tf.global_variables_initializer())
105 |             saver1 = tf.train.Saver()
106 |             # 需要手动更改路径
107 |             saver1.restore(
108 |                 sess1, "./log/model/model_131000_devAcc_0.732782.ckpt")
109 |             sess1.run(tf.assign(model_1.is_train,
110 |                                 tf.constant(False, dtype=tf.bool)))
111 |             for step in tqdm(range(total // config.batch_size + 1)):
112 |                 qa_id, logits = sess1.run(
113 |                     [model_1.qa_id, model_1.logits])
114 |                 for ids, logits in zip(qa_id, logits):
115 |                     logits_dict1[str(ids)] = logits
116 |             if len(logits_dict1) != len(test_eval_file):
117 |                 print("logits1 data number not match")
118 | 
119 |             a = tf.placeholder(shape=[3], dtype=tf.float32, name="me")
120 |             softmax = tf.nn.softmax(a)
121 |             for key, val in logits_dict1.items():
122 |                 value = sess1.run(softmax, feed_dict={a: logits_dict1[key]})
123 |                 predict_dict[key] = value
124 | 
125 |     print("正在保存dev的softmax结果文件!")
126 |     if not os.path.exists("./test_soft"):
127 |         os.makedirs("./test_soft")
128 |     with open("./test_soft/RNET_b64_e300_h60_v300.txt", "wb") as f1:
129 |         pickle.dump(predict_dict, f1)
130 | 


--------------------------------------------------------------------------------
/baseline/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | main.py：train and test
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import os
 14 | import codecs
 15 | import time
 16 | import math
 17 | 
 18 | from model import Model
 19 | from util import *
 20 | 
 21 | 
 22 | def train(config):
 23 |     """
 24 |     训练与验证函数
 25 |     """
 26 |     with open(config.id2vec_file, "r") as fh:
 27 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 28 |     with open(config.train_eval_file, "r") as fh:
 29 |         train_eval_file = json.load(fh)
 30 |     with open(config.dev_eval_file, "r") as fh:
 31 |         dev_eval_file = json.load(fh)
 32 | 
 33 |     dev_total = 29968  # 验证集数据量
 34 | 
 35 |     # 不同参数的训练在不同的文件夹下存储
 36 |     log_dir = config.log_dir + config.experiment
 37 |     save_dir = config.save_dir + config.experiment
 38 |     if not os.path.exists(log_dir):
 39 |         os.makedirs(log_dir)
 40 |     if not os.path.exists(save_dir):
 41 |         os.makedirs(save_dir)
 42 | 
 43 |     print("Building model...")
 44 |     parser = get_record_parser(config)
 45 |     train_dataset = get_batch_dataset(config.train_record_file, parser, config)
 46 |     dev_dataset = get_dataset(config.dev_record_file, parser, config)
 47 | 
 48 |     # 可馈送迭代器，通过feed_dict机制选择每次sess.run时调用train_iterator还是dev_iterator
 49 |     handle = tf.placeholder(tf.string, shape=[])
 50 |     iterator = tf.data.Iterator.from_string_handle(
 51 |         handle, train_dataset.output_types, train_dataset.output_shapes)
 52 |     train_iterator = train_dataset.make_one_shot_iterator()
 53 |     dev_iterator = dev_dataset.make_one_shot_iterator()
 54 | 
 55 |     # 选取模型
 56 |     if config.model_name == "default":
 57 |         model = Model(config, iterator, id2vec)
 58 |     else:
 59 |         print("model error")
 60 |         return
 61 | 
 62 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 63 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 64 |     sess_config.gpu_options.allow_growth = True
 65 | 
 66 |     loss_save = 100.0
 67 |     patience = 0
 68 |     lr = config.init_learning_rate
 69 |     emb_lr = config.init_emb_lr
 70 | 
 71 |     with tf.Session(config=sess_config) as sess:
 72 |         writer = tf.summary.FileWriter(log_dir, sess.graph)  # 存储计算图
 73 |         sess.run(tf.global_variables_initializer())
 74 |         saver = tf.train.Saver()
 75 |         train_handle = sess.run(train_iterator.string_handle())
 76 |         dev_handle = sess.run(dev_iterator.string_handle())
 77 |         sess.run(tf.assign(model.is_train, tf.constant(True, dtype=tf.bool)))
 78 |         sess.run(tf.assign(model.learning_rate,
 79 |                            tf.constant(lr, dtype=tf.float32)))
 80 |         # sess.run(tf.assign(model.emb_lr, tf.constant(emb_lr, dtype=tf.float32)))
 81 | 
 82 |         best_dev_acc = 0.0  # 定义一个最佳验证准确率，只有当准确率高于它才保存模型
 83 |         print("Training ...")
 84 |         for go in tqdm(range(1, config.num_steps + 1)):
 85 |             global_step = sess.run(model.global_step) + 1
 86 |             loss, train_op = sess.run([model.loss, model.train_op], feed_dict={
 87 |                                       handle: train_handle})
 88 |             if global_step % config.period == 0:  # 每隔一段步数就记录一次train_loss和learning_rate
 89 |                 loss_sum = tf.Summary(value=[tf.Summary.Value(
 90 |                     tag="model/loss", simple_value=loss), ])
 91 |                 writer.add_summary(loss_sum, global_step)
 92 |                 lr_sum = tf.Summary(value=[tf.Summary.Value(
 93 |                     tag="model/learning_rate", simple_value=sess.run(model.learning_rate)), ])
 94 |                 writer.add_summary(lr_sum, global_step)
 95 |                 # emb_lr_sum = tf.Summary(value=[tf.Summary.Value(
 96 |                 #     tag="model/emb_lr", simple_value=sess.run(model.emb_lr)), ])
 97 |                 # writer.add_summary(emb_lr_sum, global_step)
 98 | 
 99 |             if global_step % config.checkpoint == 0:  # 验证acc，并保存模型
100 |                 sess.run(tf.assign(model.is_train,
101 |                                    tf.constant(False, dtype=tf.bool)))
102 | 
103 |                 # 评估训练集
104 |                 _, summ = evaluate_batch(
105 |                     model, config.val_num_batches, train_eval_file, sess, "train_eval", handle, train_handle)
106 |                 for s in summ:
107 |                     writer.add_summary(s, global_step)
108 | 
109 |                 # 评估验证集
110 |                 metrics, summ = evaluate_batch(
111 |                     model, dev_total // config.batch_size + 1, dev_eval_file, sess, "dev", handle, dev_handle)
112 |                 sess.run(tf.assign(model.is_train,
113 |                                    tf.constant(True, dtype=tf.bool)))
114 |                 for s in summ:
115 |                     writer.add_summary(s, global_step)
116 |                 writer.flush()  # 将事件文件刷新到磁盘
117 | 
118 |                 # 学习率衰减的策略1
119 |                 if config.optimizer == "Adadelta":
120 |                     dev_loss = metrics["loss"]
121 |                     if dev_loss < loss_save:
122 |                         loss_save = dev_loss
123 |                         patience = 0
124 |                     else:
125 |                         patience += 1
126 |                     if patience >= config.patience:
127 |                         lr /= 2.0
128 |                         loss_save = dev_loss
129 |                         patience = 0
130 |                 elif config.optimizer == "Adam":
131 |                     # 学习率衰减策略2
132 |                     if global_step <= 50000:
133 |                         lr = config.init_learning_rate
134 |                     elif global_step <= 100000:
135 |                         lr = config.init_learning_rate / \
136 |                             math.sqrt((global_step - 45000) / 5000)
137 |                         emb_lr = 5e-6
138 |                     elif global_step <= 200000:
139 |                         lr = config.init_learning_rate / \
140 |                             math.sqrt((global_step - 45000) / 1000)
141 |                         emb_lr = 3e-6
142 |                     else:
143 |                         lr = config.init_learning_rate / \
144 |                             math.sqrt(global_step / 1000)
145 |                         emb_lr = 1e-6
146 |                 else:
147 |                     print("error")
148 |                     return
149 | 
150 |                 sess.run(tf.assign(model.learning_rate,
151 |                                    tf.constant(lr, dtype=tf.float32)))
152 |                 # sess.run(tf.assign(model.emb_lr, tf.constant(
153 |                 #     emb_lr, dtype=tf.float32)))
154 | 
155 |                 # 保存模型的逻辑
156 |                 if metrics["accuracy"] > best_dev_acc:
157 |                     best_dev_acc = metrics["accuracy"]
158 |                     filename = os.path.join(
159 |                         save_dir, "model_{}_devAcc_{:.6f}.ckpt".format(global_step, best_dev_acc))
160 |                     saver.save(sess, filename)
161 | 
162 |     print("finished!")
163 | 
164 | 
165 | def evaluate_batch(model, num_batches, eval_file, sess, data_type, handle, str_handle):
166 |     """
167 |     模型评估函数
168 |     """
169 |     answer_dict = {}  # 答案词典
170 |     truth_dict = {}  # 真实答案词典
171 |     losses = []
172 |     for _ in tqdm(range(1, num_batches + 1)):
173 |         qa_id, loss, truth, answer = sess.run(
174 |             [model.qa_id, model.loss, model.answer, model.classes], feed_dict={handle: str_handle})
175 |         answer_dict_ = {}
176 |         truth_dict_ = {}
177 |         for ids, tr, ans in zip(qa_id, truth, answer):
178 |             answer_dict_[str(ids)] = ans
179 |             truth_dict_[str(ids)] = tr
180 |         answer_dict.update(answer_dict_)
181 |         truth_dict.update(truth_dict_)
182 |         losses.append(loss)
183 |     loss = np.mean(losses)
184 |     metrics = evaluate_acc(truth_dict, answer_dict)
185 |     metrics["loss"] = loss
186 |     loss_sum = tf.Summary(value=[tf.Summary.Value(
187 |         tag="{}/loss".format(data_type), simple_value=metrics["loss"]), ])
188 |     acc_sum = tf.Summary(value=[tf.Summary.Value(
189 |         tag="{}/accuracy".format(data_type), simple_value=metrics["accuracy"]), ])
190 |     return metrics, [loss_sum, acc_sum]
191 | 
192 | 
193 | def test(config):
194 |     """
195 |     测试函数
196 |     """
197 |     with open(config.id2vec_file, "r") as fh:
198 |         id2vec = np.array(json.load(fh), dtype=np.float32)
199 |     with open(config.test_eval_file, "r") as fh:
200 |         test_eval_file = json.load(fh)
201 | 
202 |     total = 10000
203 |     # 读取模型的路径和预测存储的路径
204 |     save_dir = config.save_dir + config.experiment
205 |     if not os.path.exists(save_dir):
206 |         print("no save!")
207 |         return
208 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
209 |     prediction_file = os.path.join(
210 |         config.prediction_dir, (predic_time + "_predictions.txt"))
211 | 
212 |     print("Loading model...")
213 |     test_batch = get_dataset(config.test_record_file, get_record_parser(
214 |         config), config).make_one_shot_iterator()
215 | 
216 |     # 选取模型
217 |     if config.model_name == "default":
218 |         model = Model(config, test_batch, id2vec, trainable=False)
219 |     else:
220 |         print("model error")
221 |         return
222 | 
223 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
224 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
225 |     sess_config.gpu_options.allow_growth = True
226 | 
227 |     print("testing ...")
228 |     with tf.Session(config=sess_config) as sess:
229 |         sess.run(tf.global_variables_initializer())
230 |         saver = tf.train.Saver()
231 |         saver.restore(sess, tf.train.latest_checkpoint(save_dir))
232 |         sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool)))
233 |         answer_dict = {}
234 |         for step in tqdm(range(total // config.batch_size + 1)):
235 |             # 预测答案
236 |             qa_id, answer = sess.run([model.qa_id, model.classes])
237 |             answer_dict_ = {}
238 |             for ids, ans in zip(qa_id, answer):
239 |                 answer_dict_[str(ids)] = ans
240 |             answer_dict.update(answer_dict_)
241 |         # 将结果写文件的操作，不用考虑问题顺序
242 |         if len(answer_dict) != len(test_eval_file):
243 |             print("data number not match")
244 |         predictions = []
245 |         for key, value in answer_dict.items():
246 |             prediction_answer = test_eval_file[str(key)][value]
247 |             predictions.append(str(key) + '\t' + str(prediction_answer))
248 |         outputs = u'\n'.join(predictions)
249 |         with codecs.open(prediction_file, 'w', encoding='utf-8') as f:
250 |             f.write(outputs)
251 |         print("done!")
252 | 
253 | 
254 | def dev(config):
255 |     with open(config.id2vec_file, "r") as fh:
256 |         id2vec = np.array(json.load(fh), dtype=np.float32)
257 |     with open(config.dev_eval_file, "r") as fh:
258 |         dev_eval_file = json.load(fh)
259 | 
260 |     total = 29968
261 |     print("Loading model...")
262 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
263 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
264 |     sess_config.gpu_options.allow_growth = True
265 | 
266 |     truth_dict = {}
267 |     predict_dict = {}
268 |     logits_dict1 = {}
269 |     logits_dict2 = {}
270 |     logits_dict3 = {}
271 |     logits_dict = {}
272 | 
273 |     print("model-1 predicting!")
274 |     g1 = tf.Graph()
275 |     with tf.Session(graph=g1, config=sess_config) as sess1:
276 |         with g1.as_default():
277 |             dev_batch1 = get_dataset(config.dev_record_file, get_record_parser(config),
278 |                                      config).make_one_shot_iterator()
279 |             model_1 = Model(
280 |                 config, dev_batch1, id2vec, trainable=False)
281 |             sess1.run(tf.global_variables_initializer())
282 |             saver1 = tf.train.Saver()
283 |             saver1.restore(
284 |                 sess1, "./log/model/model_131000_devAcc_0.732782.ckpt")
285 |             sess1.run(tf.assign(model_1.is_train,
286 |                                 tf.constant(False, dtype=tf.bool)))
287 |             for step in tqdm(range(total // config.batch_size + 1)):
288 |                 qa_id, logits, truths = sess1.run(
289 |                     [model_1.qa_id, model_1.logits, model_1.answer])
290 |                 for ids, logits, truth in zip(qa_id, logits, truths):
291 |                     logits_dict1[str(ids)] = logits
292 |                     truth_dict[str(ids)] = truth
293 |             if len(logits_dict1) != len(dev_eval_file):
294 |                 print("logits1 data number not match")
295 | 
296 |     print("logits相加，模型融合分类!")
297 |     predictions = []
298 |     g4 = tf.Graph()
299 |     with tf.Session(graph=g4, config=sess_config) as sess4:
300 |         a = tf.placeholder(shape=[3], dtype=tf.float32, name="mee")
301 |         b = tf.placeholder(shape=[3], dtype=tf.float32, name="me")
302 |         c = tf.placeholder(shape=[3], dtype=tf.float32, name="xiaodong")
303 |         softmax_a = tf.nn.softmax(a)
304 |         softmax_b = tf.nn.softmax(b)
305 |         softmax_c = tf.nn.softmax(c)
306 |         final = 0.4 * softmax_a + 0.4 * softmax_b + 0.2 * softmax_c
307 |         final_class = tf.cast(tf.argmax(final), dtype=tf.int32)
308 |         for key, val in truth_dict.items():
309 |             value = sess4.run(final_class, feed_dict={
310 |                               a: logits_dict1[key], b: logits_dict2[key], c: logits_dict3[key]})
311 |             predict_dict[key] = value
312 |     print(evaluate_acc(truth_dict, predict_dict))
313 | 


--------------------------------------------------------------------------------
/baseline/model.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | model.py：基于R-Net的改进模型，将PtrNet改成分类器
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | 
 10 | import tensorflow as tf
 11 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout
 12 | 
 13 | 
 14 | class Model(object):
 15 |     def __init__(self, config, batch, word_mat=None, trainable=True, opt=True):
 16 |         """
 17 |         模型初始化函数
 18 |         Args:
 19 |             config:是tf.flag.FLAGS，配置整个项目的超参数
 20 |             batch:是一个tf.data.iterator对象，读取数据的迭代器，可能联系到tf.records，如果我们的数据集比较小就可以不用
 21 |             word_mat:np.array数组，是词向量？
 22 |             char_mat:同上
 23 |         """
 24 |         self.config = config
 25 |         self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32,
 26 |                                            initializer=tf.constant_initializer(0), trainable=False)
 27 |         # tf.data.iterator的get_next方法，返回dataset中下一个element的tensor对象，在sess.run中实现迭代
 28 |         """
 29 |         passage: passage序列的每个词的id号的tensor(tf.int32)，长度应该是都取最大限制长度，空余的填充空值？(这里待定)
 30 |         question: question序列的每个词的id号的tensor(tf.int32)
 31 |         ch, qh, y1, y2: 本项目不需要，已经取消
 32 |         qa_id: question的id
 33 |         answer: 新添加的answer标签，(0/1/2)，shape初步定义为[batch_size]
 34 |         """
 35 |         self.passage, self.question, self.answer, self.qa_id = batch.get_next()
 36 |         self.is_train = tf.get_variable(
 37 |             "is_train", shape=[], dtype=tf.bool, trainable=False)
 38 | 
 39 |         # word embeddings的变量,这里定义的是不能训练的
 40 |         self.word_mat = tf.get_variable("word_mat", initializer=tf.constant(
 41 |             word_mat, dtype=tf.float32), trainable=False)
 42 | 
 43 |         # tf.cast将tensor转换为bool类型，生成mask，有值部分用true，空值用false
 44 |         self.c_mask = tf.cast(self.passage, tf.bool)
 45 |         self.q_mask = tf.cast(self.question, tf.bool)
 46 |         # 求每个序列的真实长度，得到_len的tensor
 47 |         self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1)
 48 |         self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1)
 49 | 
 50 |         if opt:
 51 |             batch_size = config.batch_size
 52 |             # 求一个batch中序列最大长度，并按照最大长度对对tensor进行slice划分
 53 |             self.c_maxlen = tf.reduce_max(self.c_len)
 54 |             self.q_maxlen = tf.reduce_max(self.q_len)
 55 |             self.c = tf.slice(self.passage, [0, 0], [
 56 |                               batch_size, self.c_maxlen])
 57 |             self.q = tf.slice(self.question, [0, 0], [
 58 |                               batch_size, self.q_maxlen])
 59 |             self.c_mask = tf.slice(self.c_mask, [0, 0], [
 60 |                                    batch_size, self.c_maxlen])
 61 |             self.q_mask = tf.slice(self.q_mask, [0, 0], [
 62 |                                    batch_size, self.q_maxlen])
 63 |         else:
 64 |             self.c_maxlen, self.q_maxlen = config.para_limit, config.ques_limit
 65 | 
 66 |         self.RNet()  # 构造R-Net模型
 67 | 
 68 |         if trainable:
 69 | 
 70 |             self.learning_rate = tf.get_variable(
 71 |                 "learning_rate", shape=[], dtype=tf.float32, trainable=False)
 72 | 
 73 |             if config.optimizer == "Adam":
 74 |                 self.opt = tf.train.AdamOptimizer(
 75 |                     learning_rate=self.learning_rate, epsilon=1e-8)
 76 |             elif config.optimizer == "Adadelta":
 77 |                 self.opt = tf.train.AdadeltaOptimizer(
 78 |                     learning_rate=self.learning_rate, epsilon=1e-6)
 79 |             else:
 80 |                 print("optimizer error")
 81 |                 return
 82 | 
 83 |             grads = self.opt.compute_gradients(self.loss)
 84 |             gradients, variables = zip(*grads)
 85 |             capped_grads, _ = tf.clip_by_global_norm(
 86 |                 gradients, config.grad_clip)
 87 |             self.train_op = self.opt.apply_gradients(
 88 |                 zip(capped_grads, variables), global_step=self.global_step)
 89 | 
 90 |             # # 对embedding层设置单独的学习率
 91 |             # self.emb_lr = tf.get_variable(
 92 |             #     "emb_lr", shape=[], dtype=tf.float32, trainable=False)
 93 |             # self.learning_rate = tf.get_variable(
 94 |             #     "learning_rate", shape=[], dtype=tf.float32, trainable=False)
 95 |             # self.emb_opt = tf.train.AdamOptimizer(
 96 |             #     learning_rate=self.emb_lr, epsilon=1e-8)
 97 |             # self.opt = tf.train.AdamOptimizer(
 98 |             #     learning_rate=self.learning_rate, epsilon=1e-8)
 99 |             # # 区分不同的变量列表
100 |             # self.var_list = tf.trainable_variables()
101 |             # var_list1 = []
102 |             # var_list2 = []
103 |             # for var in self.var_list:
104 |             #     if var.op.name == "word_mat":
105 |             #         var_list1.append(var)
106 |             #     else:
107 |             #         var_list2.append(var)
108 | 
109 |             # grads = tf.gradients(self.loss, var_list1 + var_list2)
110 |             # capped_grads, _ = tf.clip_by_global_norm(
111 |             #     grads, config.grad_clip)
112 |             # grads1 = capped_grads[:len(var_list1)]
113 |             # grads2 = capped_grads[len(var_list1):]
114 |             # self.train_op1 = self.emb_opt.apply_gradients(
115 |             #     zip(grads1, var_list1))
116 |             # self.train_op2 = self.opt.apply_gradients(
117 |             #     zip(grads2, var_list2), global_step=self.global_step)
118 |             # self.train_op = tf.group(self.train_op1, self.train_op2)
119 | 
120 |     def RNet(self):
121 |         config = self.config
122 |         batch_size, PL, QL, d = config.batch_size, self.c_maxlen, self.q_maxlen, config.hidden
123 |         gru = cudnn_gru if config.use_cudnn else native_gru  # 选择使用哪种gru网络
124 | 
125 |         with tf.variable_scope("embedding"):
126 |             # word_embedding层
127 |             with tf.name_scope("word"):
128 |                 # embedding后的shape是[batch_size, max_len, vec_len]
129 |                 c_emb = tf.nn.embedding_lookup(self.word_mat, self.c)
130 |                 q_emb = tf.nn.embedding_lookup(self.word_mat, self.q)
131 | 
132 |         with tf.variable_scope("encoding"):
133 |             # encoder层，将context和question分别输入双向GRU
134 |             rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape(
135 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
136 |             # RNN每层的正向反向输出合并，本代码默认的是每层的输出也合并
137 |             # 所以对于3层rnn，输出的shape是[batch_size, max_len, 6*num_units]
138 |             # 并且，序列空值处的输出都清零了
139 |             c = rnn(c_emb, seq_len=self.c_len)
140 |             q = rnn(q_emb, seq_len=self.q_len)
141 | 
142 |         with tf.variable_scope("QP_attention"):
143 |             """
144 |             基于注意力的循环神经网络层，匹配context和question
145 |             """
146 |             # qc_att的shape [batch_size, c_maxlen, 12*hidden]
147 |             qc_att_ = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d,
148 |                                     keep_prob=config.keep_prob, is_train=self.is_train)
149 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att_.get_shape(
150 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
151 |             # att:[batch_size, c_maxlen, 6*hidden]
152 |             qc_att = rnn(qc_att_, seq_len=self.c_len)
153 | 
154 |         with tf.variable_scope("passage_match"):
155 |             """
156 |             context自匹配层
157 |             """
158 |             c_att = dot_attention(
159 |                 qc_att, qc_att, mask=self.c_mask, hidden=d, keep_prob=config.keep_prob, is_train=self.is_train)
160 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=c_att.get_shape(
161 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train)
162 |             # match:[batch_size, c_maxlen, 6*hidden]
163 |             c_match = rnn(c_att, seq_len=self.c_len)
164 | 
165 |         with tf.variable_scope("YesNo_classification"):
166 |             """
167 |             对问题答案的分类层, 需要的输入有question的编码结果q和context的match
168 |             """
169 |             # init的shape:[batch_size, 2*hidden]
170 |             # 这步的作用初始猜测是将question进行pooling操作，然后再输入给一个rnn层进行分类
171 |             init = summ(q[:, :, -2 * d:], d, mask=self.q_mask,
172 |                         keep_prob=config.keep_prob, is_train=self.is_train)
173 |             c_match_ = dropout(c_match, keep_prob=config.keep_prob,
174 |                                is_train=self.is_train)
175 |             final_hiddens = init.get_shape().as_list()[-1]
176 |             final_gru = tf.contrib.rnn.GRUCell(final_hiddens)
177 |             _, final_state = tf.nn.dynamic_rnn(
178 |                 final_gru, c_match_, initial_state=init, dtype=tf.float32)
179 |             final_w = tf.get_variable(name="final_w", shape=[final_hiddens, 3])
180 |             final_b = tf.get_variable(name="final_b", shape=[
181 |                                       3], initializer=tf.constant_initializer(0.))
182 |             self.logits = tf.matmul(final_state, final_w)
183 |             self.logits = tf.nn.bias_add(
184 |                 self.logits, final_b)  # logits:[batch_size, 3]
185 | 
186 |         with tf.variable_scope("softmax_and_loss"):
187 |             final_softmax = tf.nn.softmax(self.logits)
188 |             self.classes = tf.cast(
189 |                 tf.argmax(final_softmax, axis=1), dtype=tf.int32, name="classes")
190 |             # 注意stop_gradient的使用，因为answer不是placeholder传进来的，所以要注明不对其计算梯度
191 |             if config.loss_function == "focal_loss":
192 |                 self.loss = tf.reduce_mean(sparse_focal_loss(
193 |                     logits=self.logits, labels=tf.stop_gradient(self.answer)))
194 |             else:
195 |                 self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
196 |                     logits=self.logits, labels=tf.stop_gradient(self.answer)))
197 | 
198 |     def get_loss(self):
199 |         return self.loss
200 | 
201 |     def get_global_step(self):
202 |         return self.global_step
203 | 


--------------------------------------------------------------------------------
/baseline/nn_func.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | nn_func.py：神经网络模型的组件
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | 
 11 | INF = 1e30
 12 | 
 13 | 
 14 | class cudnn_gru:
 15 | 
 16 |     def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope=None):
 17 |         self.num_layers = num_layers
 18 |         self.grus = []
 19 |         self.inits = []
 20 |         self.dropout_mask = []
 21 |         for layer in range(num_layers):
 22 |             input_size_ = input_size if layer == 0 else 2 * num_units
 23 |             gru_fw = tf.contrib.cudnn_rnn.CudnnGRU(1, num_units)
 24 |             gru_bw = tf.contrib.cudnn_rnn.CudnnGRU(1, num_units)
 25 |             init_fw = tf.tile(tf.Variable(
 26 |                 tf.zeros([1, 1, num_units])), [1, batch_size, 1])
 27 |             init_bw = tf.tile(tf.Variable(
 28 |                 tf.zeros([1, 1, num_units])), [1, batch_size, 1])
 29 |             mask_fw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32),
 30 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 31 |             mask_bw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32),
 32 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 33 |             self.grus.append((gru_fw, gru_bw, ))
 34 |             self.inits.append((init_fw, init_bw, ))
 35 |             self.dropout_mask.append((mask_fw, mask_bw, ))
 36 | 
 37 |     def __call__(self, inputs, seq_len, keep_prob=1.0, is_train=None, concat_layers=True):
 38 |         # cudnn GRU需要交换张量的维度，可能是便于计算
 39 |         outputs = [tf.transpose(inputs, [1, 0, 2])]
 40 |         for layer in range(self.num_layers):
 41 |             gru_fw, gru_bw = self.grus[layer]
 42 |             init_fw, init_bw = self.inits[layer]
 43 |             mask_fw, mask_bw = self.dropout_mask[layer]
 44 |             with tf.variable_scope("fw_{}".format(layer)):
 45 |                 out_fw, _ = gru_fw(
 46 |                     outputs[-1] * mask_fw, initial_state=(init_fw, ))
 47 |             with tf.variable_scope("bw_{}".format(layer)):
 48 |                 inputs_bw = tf.reverse_sequence(
 49 |                     outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1)
 50 |                 out_bw, _ = gru_bw(inputs_bw, initial_state=(init_bw, ))
 51 |                 out_bw = tf.reverse_sequence(
 52 |                     out_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1)
 53 |             outputs.append(tf.concat([out_fw, out_bw], axis=2))
 54 |         if concat_layers:
 55 |             res = tf.concat(outputs[1:], axis=2)
 56 |         else:
 57 |             res = outputs[-1]
 58 |         res = tf.transpose(res, [1, 0, 2])
 59 |         return res
 60 | 
 61 | 
 62 | class native_gru:
 63 | 
 64 |     def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope="native_gru"):
 65 |         self.num_layers = num_layers
 66 |         self.grus = []
 67 |         self.inits = []
 68 |         self.dropout_mask = []
 69 |         self.scope = scope
 70 |         for layer in range(num_layers):
 71 |             input_size_ = input_size if layer == 0 else 2 * num_units
 72 |             # 双向Bi-GRU f:forward b:back
 73 |             gru_fw = tf.contrib.rnn.GRUCell(num_units)
 74 |             gru_bw = tf.contrib.rnn.GRUCell(num_units)
 75 |             # tf.tile 平铺给定的张量，这里是将初始状态扩张到batch_size倍
 76 |             init_fw = tf.tile(tf.Variable(
 77 |                 tf.zeros([1, num_units])), [batch_size, 1])
 78 |             init_bw = tf.tile(tf.Variable(
 79 |                 tf.zeros([1, num_units])), [batch_size, 1])
 80 |             mask_fw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32),
 81 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 82 |             mask_bw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32),
 83 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 84 |             self.grus.append((gru_fw, gru_bw, ))
 85 |             self.inits.append((init_fw, init_bw, ))
 86 |             self.dropout_mask.append((mask_fw, mask_bw, ))
 87 | 
 88 |     def __call__(self, inputs, seq_len, concat_layers=True):
 89 |         """
 90 |         运行RNN
 91 |         这里的keep_prob和is_train没用，在__init__中就已设置好了
 92 |         """
 93 |         outputs = [inputs]
 94 |         with tf.variable_scope(self.scope):
 95 |             for layer in range(self.num_layers):
 96 |                 gru_fw, gru_bw = self.grus[layer]
 97 |                 init_fw, init_bw = self.inits[layer]
 98 |                 mask_fw, mask_bw = self.dropout_mask[layer]
 99 |                 # 正向RNN
100 |                 with tf.variable_scope("fw_{}".format(layer)):
101 |                     # 每一层使用上层的输出
102 |                     # dynamic_rnn中的超过seq_len的部分就不计算了，state直接重复，output直接清零，节省资源
103 |                     out_fw, _ = tf.nn.dynamic_rnn(
104 |                         gru_fw, outputs[-1] * mask_fw, seq_len, initial_state=init_fw, dtype=tf.float32)
105 |                 # 反向RNN
106 |                 with tf.variable_scope("bw_{}".format(layer)):
107 |                     inputs_bw = tf.reverse_sequence(
108 |                         outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0)
109 |                     out_bw, _ = tf.nn.dynamic_rnn(
110 |                         gru_bw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32)
111 |                     out_bw = tf.reverse_sequence(
112 |                         out_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0)
113 |                 # 正向输出和反向输出合并
114 |                 outputs.append(tf.concat([out_fw, out_bw], axis=2))
115 |         if concat_layers:
116 |             res = tf.concat(outputs[1:], axis=2)
117 |         else:
118 |             res = outputs[-1]
119 |         return res
120 | 
121 | 
122 | def dropout(args, keep_prob, is_train, mode="recurrent"):
123 |     """
124 |     dropout层,args初始是1.0
125 |     """
126 |     if keep_prob < 1.0:
127 |         noise_shape = None
128 |         scale = 1.0
129 |         shape = tf.shape(args)
130 |         if mode == "embedding":
131 |             noise_shape = [shape[0], 1]
132 |             scale = keep_prob
133 |         if mode == "recurrent" and len(args.get_shape().as_list()) == 3:
134 |             noise_shape = [shape[0], 1, shape[-1]]
135 |         args = tf.cond(is_train, lambda: tf.nn.dropout(
136 |             args, keep_prob, noise_shape=noise_shape) * scale, lambda: args)
137 |     return args
138 | 
139 | 
140 | def softmax_mask(val, mask):
141 |     """
142 |     作用是给空值处减小注意力
143 |     """
144 |     return -INF * (1 - tf.cast(mask, tf.float32)) + val  # tf.cast:true转为1.0，false转为0.0
145 | 
146 | 
147 | def summ(memory, hidden, mask, keep_prob=1.0, is_train=None, scope="summ"):
148 |     """
149 |     对question进行最后一步的处理，可以看作是pooling吗
150 |     """
151 |     with tf.variable_scope(scope):
152 |         d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train)
153 |         s0 = tf.nn.tanh(dense(d_memory, hidden, scope="s0"))
154 |         s = dense(s0, 1, use_bias=False, scope="s")
155 |         # tf.squeeze把长度只有1的维度去掉
156 |         # s1:[batch_size, c_maxlen]
157 |         s1 = softmax_mask(tf.squeeze(s, [2]), mask)
158 |         a = tf.expand_dims(tf.nn.softmax(s1), axis=2)
159 |         res = tf.reduce_sum(a * memory, axis=1)  # 逐元素相乘，shape跟随memory一致
160 |         return res  # [batch_size, 2*hidden]
161 | 
162 | 
163 | def dot_attention(inputs, memory, mask, hidden, keep_prob=1.0, is_train=None, scope="dot_attention"):
164 |     """
165 |     门控attention层
166 |     """
167 |     with tf.variable_scope(scope):
168 | 
169 |         d_inputs = dropout(inputs, keep_prob=keep_prob, is_train=is_train)
170 |         d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train)
171 |         JX = tf.shape(inputs)[1]  # inputs的1维度，应该是c_maxlen
172 | 
173 |         with tf.variable_scope("attention"):
174 |             # inputs_的shape:[batch_size, c_maxlen, hidden]
175 |             inputs_ = tf.nn.relu(
176 |                 dense(d_inputs, hidden, use_bias=False, scope="inputs"))
177 |             memory_ = tf.nn.relu(
178 |                 dense(d_memory, hidden, use_bias=False, scope="memory"))
179 |             # 三维矩阵相乘，结果的shape是[batch_size, c_maxlen, q_maxlen]
180 |             outputs = tf.matmul(inputs_, tf.transpose(
181 |                 memory_, [0, 2, 1])) / (hidden ** 0.5)
182 |             # 将mask平铺成与outputs相同的形状，这里考虑，改进成input和memory都需要mask
183 |             mask = tf.tile(tf.expand_dims(mask, axis=1), [1, JX, 1])
184 |             logits = tf.nn.softmax(softmax_mask(outputs, mask))
185 |             outputs = tf.matmul(logits, memory)
186 |             # res:[batch_size, c_maxlen, 12*hidden]
187 |             res = tf.concat([inputs, outputs], axis=2)
188 | 
189 |         with tf.variable_scope("gate"):
190 |             """
191 |             attention * gate
192 |             """
193 |             dim = res.get_shape().as_list()[-1]
194 |             d_res = dropout(res, keep_prob=keep_prob, is_train=is_train)
195 |             gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False))
196 |             return res * gate  # 向量的逐元素相乘
197 | 
198 | 
199 | # 写一个谷歌论文中新的attention模块
200 | def multihead_attention(Q, K, V, mask, hidden, head_num=4, keep_prob=1.0, is_train=None, has_gate=True, scope="multihead_attention"):
201 |     """
202 |     Q : passage
203 |     K,V: question
204 |     mask: Q的mask
205 |     """
206 |     size = int(hidden / head_num)  # 每个attention的大小
207 | 
208 |     with tf.variable_scope(scope):
209 |         d_Q = dropout(Q, keep_prob=keep_prob, is_train=is_train)
210 |         d_K = dropout(K, keep_prob=keep_prob, is_train=is_train)
211 |         JX = tf.shape(Q)[1]
212 | 
213 |         with tf.variable_scope("attention"):
214 |             Q_ = tf.nn.relu(dense(d_Q, hidden, use_bias=False, scope="Q"))
215 |             K_ = tf.nn.relu(dense(d_K, hidden, use_bias=False, scope="K"))
216 |             V_ = tf.nn.relu(dense(V, hidden, use_bias=False, scope="V"))
217 |             Q_ = tf.reshape(Q_, (-1, tf.shape(Q_)[1], head_num, size))
218 |             K_ = tf.reshape(K_, (-1, tf.shape(K_)[1], head_num, size))
219 |             V_ = tf.reshape(V_, (-1, tf.shape(V_)[1], head_num, size))
220 |             Q_ = tf.transpose(Q_, [0, 2, 1, 3])
221 |             K_ = tf.transpose(K_, [0, 2, 1, 3])
222 |             V_ = tf.transpose(V_, [0, 2, 1, 3])
223 |             # scale:[batch_size, head_num, c_maxlen, q_maxlen]
224 |             scale = tf.matmul(Q_, K_, transpose_b=True) / tf.sqrt(float(size))
225 |             scale = tf.transpose(scale, [0, 3, 2, 1])
226 |             for _ in range(len(scale.shape) - 2):
227 |                 mask = tf.expand_dims(mask, axis=2)
228 |             mask_scale = softmax_mask(scale, mask)
229 |             mask_scale = tf.transpose(scale, [0, 3, 2, 1])
230 |             logits = tf.nn.softmax(mask_scale)
231 |             outputs = tf.matmul(logits, V_)  # [b,h,c,s]
232 |             outputs = tf.transpose(outputs, [0, 2, 1, 3])
233 |             # [batch_size, c_maxlen, hidden]
234 |             outputs = tf.reshape(outputs, (-1, tf.shape(Q)[1], hidden))
235 |             # res连接
236 |             res = tf.concat([Q, outputs], axis=2)
237 | 
238 |         if has_gate:
239 |             with tf.variable_scope("gate"):
240 |                 dim = res.get_shape().as_list()[-1]
241 |                 d_res = dropout(res, keep_prob=keep_prob, is_train=is_train)
242 |                 gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False))
243 |                 return res * gate
244 |         else:
245 |             return res
246 | 
247 | 
248 | def dense(inputs, hidden, use_bias=True, scope="dense"):
249 |     """
250 |     全连接层
251 |     """
252 |     with tf.variable_scope(scope):
253 |         shape = tf.shape(inputs)
254 |         dim = inputs.get_shape().as_list()[-1]
255 |         out_shape = [shape[idx] for idx in range(
256 |             len(inputs.get_shape().as_list()) - 1)] + [hidden]
257 |         # 三维的inputs，reshape成二维
258 |         flat_inputs = tf.reshape(inputs, [-1, dim])
259 |         W = tf.get_variable("W", [dim, hidden])
260 |         res = tf.matmul(flat_inputs, W)
261 |         if use_bias:
262 |             b = tf.get_variable(
263 |                 "b", [hidden], initializer=tf.constant_initializer(0.))
264 |             res = tf.nn.bias_add(res, b)
265 |         # outshape就是input的最后一维变成hidden
266 |         res = tf.reshape(res, out_shape)
267 |         return res
268 | 


--------------------------------------------------------------------------------
/baseline/util.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | util.py：一些工具
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import numpy as np
 11 | import re
 12 | from collections import Counter
 13 | import string
 14 | 
 15 | 
 16 | def get_record_parser(config):
 17 |     def parse(example):
 18 |         para_limit = config.para_limit
 19 |         ques_limit = config.ques_limit
 20 |         features = tf.parse_single_example(example,
 21 |                                            features={
 22 |                                                "passage_idxs": tf.FixedLenFeature([], tf.string),
 23 |                                                "question_idxs": tf.FixedLenFeature([], tf.string),
 24 |                                                "answer": tf.FixedLenFeature([], tf.int64),
 25 |                                                "id": tf.FixedLenFeature([], tf.int64)
 26 |                                            })
 27 |         # tf.decode_raw: 将字符串的字节重新解释为数字向量
 28 |         passage_idxs = tf.reshape(tf.decode_raw(
 29 |             features["passage_idxs"], tf.int32), [para_limit])
 30 |         question_idxs = tf.reshape(tf.decode_raw(
 31 |             features["question_idxs"], tf.int32), [ques_limit])
 32 |         answer = features["answer"]
 33 |         qa_id = features["id"]
 34 |         return passage_idxs, question_idxs, answer, qa_id
 35 |     return parse
 36 | 
 37 | 
 38 | def get_batch_dataset(record_file, parser, config):
 39 |     """
 40 |     训练数据集TFRecordDataset的batch生成器。
 41 |     Args:
 42 |         record_file: 训练数据tf_record路径
 43 |         parser: 数据存储的格式
 44 |         config: 超参数
 45 |     """
 46 |     num_threads = tf.constant(config.num_threads, dtype=tf.int32)
 47 |     dataset = tf.data.TFRecordDataset(record_file).map(
 48 |         parser, num_parallel_calls=num_threads).shuffle(config.capacity).repeat()
 49 |     if config.is_bucket:
 50 |         # bucket方法，用于解决序列长度不同的mini-batch的计算效率问题
 51 |         buckets = [tf.constant(num) for num in range(*config.bucket_range)]
 52 | 
 53 |         def key_func(context_idxs, ques_idxs, context_char_idxs, ques_char_idxs, y1, y2, qa_id):
 54 |             c_len = tf.reduce_sum(
 55 |                 tf.cast(tf.cast(context_idxs, tf.bool), tf.int32))
 56 |             buckets_min = [np.iinfo(np.int32).min] + buckets
 57 |             buckets_max = buckets + [np.iinfo(np.int32).max]
 58 |             conditions_c = tf.logical_and(
 59 |                 tf.less(buckets_min, c_len), tf.less_equal(c_len, buckets_max))
 60 |             bucket_id = tf.reduce_min(tf.where(conditions_c))
 61 |             return bucket_id
 62 | 
 63 |         def reduce_func(key, elements):
 64 |             return elements.batch(config.batch_size)
 65 | 
 66 |         dataset = dataset.apply(tf.contrib.data.group_by_window(
 67 |             key_func, reduce_func, window_size=5 * config.batch_size)).shuffle(len(buckets) * 25)
 68 |     else:
 69 |         dataset = dataset.batch(config.batch_size)
 70 |     return dataset
 71 | 
 72 | 
 73 | def get_dataset(record_file, parser, config):
 74 |     num_threads = tf.constant(config.num_threads, dtype=tf.int32)
 75 |     dataset = tf.data.TFRecordDataset(record_file).map(
 76 |         parser, num_parallel_calls=num_threads).repeat().batch(config.batch_size)
 77 |     return dataset
 78 | 
 79 | 
 80 | def evaluate_acc(truth_dict, answer_dict):
 81 |     """
 82 |     计算准确率，还可以设计返回正确问题和错误问题列表
 83 |     """
 84 |     total = 0
 85 |     right = 0
 86 |     wrong = 0
 87 |     for key, value in answer_dict.items():
 88 |         total += 1
 89 |         ground_truths = truth_dict[key]
 90 |         prediction = value
 91 |         if prediction == ground_truths:
 92 |             right += 1
 93 |         else:
 94 |             wrong += 1
 95 |     accuracy = (right / total) * 1.0
 96 |     return {"accuracy": accuracy}
 97 | 
 98 | def f1_score(truth_dict, answer_dict):
 99 |     """
100 |     计算平均f1分数
101 |     """
102 |     
103 | 
104 | 


--------------------------------------------------------------------------------
/best_single_model/config.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | config.py：配置文件，程序运行入口
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import os
 10 | import tensorflow as tf
 11 | import data_process_addAnswer
 12 | from main import train, test
 13 | from file_save import *
 14 | from examine_dev import examine_dev
 15 | 
 16 | flags = tf.flags
 17 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
 18 | 
 19 | train_file = os.path.join("file", "ai_challenger_oqmrc_trainingset.json")
 20 | dev_file = os.path.join("file", "ai_challenger_oqmrc_validationset.json")
 21 | test_file = os.path.join("file", "ai_challenger_oqmrc_testa.json")
 22 | 
 23 | target_dir = "data"
 24 | log_dir = "log/event"
 25 | save_dir = "log/model"
 26 | prediction_dir = "log/prediction"
 27 | train_record_file = os.path.join(target_dir, "train.tfrecords")
 28 | dev_record_file = os.path.join(target_dir, "dev.tfrecords")
 29 | test_record_file = os.path.join(target_dir, "test.tfrecords")
 30 | id2vec_file = os.path.join(target_dir, "id2vec.json")  # id号->向量
 31 | word2id_file = os.path.join(target_dir, "word2id.json")  # 词->id号
 32 | train_eval = os.path.join(target_dir, "train_eval.json")
 33 | dev_eval = os.path.join(target_dir, "dev_eval.json")
 34 | test_eval = os.path.join(target_dir, "test_eval.json")
 35 | 
 36 | if not os.path.exists(target_dir):
 37 |     os.makedirs(target_dir)
 38 | if not os.path.exists(log_dir):
 39 |     os.makedirs(log_dir)
 40 | if not os.path.exists(save_dir):
 41 |     os.makedirs(save_dir)
 42 | if not os.path.exists(prediction_dir):
 43 |     os.makedirs(prediction_dir)
 44 | 
 45 | flags.DEFINE_string("mode", "train", "train/debug/test")
 46 | flags.DEFINE_string("gpu", "0", "0/1")
 47 | flags.DEFINE_string("experiment", "lalala", "每次存不同模型分不同的文件夹")
 48 | flags.DEFINE_string("model_name", "default", "选取不同的模型")
 49 | 
 50 | flags.DEFINE_string("target_dir", target_dir, "")
 51 | flags.DEFINE_string("log_dir", log_dir, "")
 52 | flags.DEFINE_string("save_dir", save_dir, "")
 53 | flags.DEFINE_string("prediction_dir", prediction_dir, "")
 54 | flags.DEFINE_string("train_file", train_file, "")
 55 | flags.DEFINE_string("dev_file", dev_file, "")
 56 | flags.DEFINE_string("test_file", test_file, "")
 57 | 
 58 | flags.DEFINE_string("train_record_file", train_record_file, "")
 59 | flags.DEFINE_string("dev_record_file", dev_record_file, "")
 60 | flags.DEFINE_string("test_record_file", test_record_file, "")
 61 | flags.DEFINE_string("train_eval_file", train_eval, "")
 62 | flags.DEFINE_string("dev_eval_file", dev_eval, "")
 63 | flags.DEFINE_string("test_eval_file", test_eval, "")
 64 | flags.DEFINE_string("word2id_file", word2id_file, "")
 65 | flags.DEFINE_string("id2vec_file", id2vec_file, "")
 66 | 
 67 | flags.DEFINE_integer("para_limit", 150, "Limit length for paragraph")
 68 | flags.DEFINE_integer("ques_limit", 30, "Limit length for question")
 69 | flags.DEFINE_integer("ans_limit", 5, "Limit length for Answer")
 70 | flags.DEFINE_integer("min_count", 1, "embedding 的最小出现次数")
 71 | flags.DEFINE_integer("embedding_size", 300, "the dimension of vector")
 72 | 
 73 | flags.DEFINE_integer("capacity", 15000, "Batch size of dataset shuffle")
 74 | flags.DEFINE_integer("num_threads", 4, "Number of threads in input pipeline")
 75 | # 使用cudnn训练，提升6倍速度
 76 | flags.DEFINE_boolean("use_cudnn", True, "Whether to use cudnn (only for GPU)")
 77 | flags.DEFINE_boolean("is_bucket", False, "Whether to use bucketing")
 78 | flags.DEFINE_list("bucket_range", [40, 361, 40], "range of bucket")
 79 | 
 80 | flags.DEFINE_integer("batch_size", 64, "Batch size")
 81 | flags.DEFINE_integer("num_steps", 300000, "Number of steps")
 82 | flags.DEFINE_integer("checkpoint", 1000, "checkpoint for evaluation")
 83 | flags.DEFINE_integer("period", 500, "period to save batch loss")
 84 | flags.DEFINE_integer("val_num_batches", 150, "Num of batches for evaluation")
 85 | # 关于学习率
 86 | flags.DEFINE_float("init_learning_rate", 0.001,
 87 |                    "Initial learning rate for Adam")
 88 | flags.DEFINE_float("init_emb_lr", 0., "")
 89 | flags.DEFINE_boolean("training_embedding", False, "")
 90 | 
 91 | flags.DEFINE_float("keep_prob", 0.7, "Keep prob in rnn")
 92 | flags.DEFINE_float("grad_clip", 5.0, "Global Norm gradient clipping rate")
 93 | flags.DEFINE_integer("hidden", 60, "Hidden size")  # best:128
 94 | flags.DEFINE_integer("patience", 3, "Patience for learning rate decay")
 95 | flags.DEFINE_string("optimizer", "Adam", "")
 96 | flags.DEFINE_string("loss_function", "default", "")
 97 | 
 98 | 
 99 | def main(_):
100 |     config = flags.FLAGS
101 |     os.environ["CUDA_VISIBLE_DEVICES"] = config.gpu  # 选择一块gpu
102 |     if config.mode == "train":
103 |         train(config)
104 |     elif config.mode == "prepro":
105 |         data_process_addAnswer.prepro(config)
106 |     elif config.mode == "test":
107 |         test(config)
108 |     elif config.mode == "examine":
109 |         examine_dev(config)
110 |     elif config.mode == "save_dev":
111 |         save_dev(config)
112 |     elif config.mode == "save_test":
113 |         save_test(config)
114 |     else:
115 |         print("Unknown mode")
116 |         exit(0)
117 | 
118 | 
119 | if __name__ == "__main__":
120 |     tf.app.run()
121 | 


--------------------------------------------------------------------------------
/best_single_model/data_process_addAnswer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | data_process_addAnswer.py：数据预处理代码, 加入alternatives的语义，以及特征工程。
  5 | 
  6 | @author: haomaojie
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import pandas as pd
 10 | import time
 11 | import json
 12 | import jieba
 13 | import csv
 14 | import word2vec
 15 | import re
 16 | import random
 17 | import tensorflow as tf
 18 | import numpy as np
 19 | from tqdm import tqdm  # 进度条
 20 | import os
 21 | import gensim
 22 | 
 23 | 
 24 | def read_data(json_path, output_path, line_count):
 25 |     '''
 26 |     读取json文件并转成Dataframe
 27 |     '''
 28 |     start_time = time.time()
 29 |     data = []
 30 |     with open(json_path, 'r') as f:
 31 |         for i in range(line_count):
 32 |             data_list = json.loads(f.readline())
 33 |             data.append([data_list['passage'], data_list['query']])
 34 |         df = pd.DataFrame(data, columns=['passage', 'query'])
 35 |     df.to_csv(output_path, index=False)
 36 |     print('转化成功，已生成csv文件')
 37 |     end_time = time.time()
 38 |     print(end_time - start_time)
 39 | 
 40 | 
 41 | def de_word(data_path, out_path):
 42 |     '''
 43 |     分词
 44 |     '''
 45 |     start_time = time.time()
 46 |     word = []
 47 |     data_file = open(data_path).read().split('\n')
 48 |     for i in range(len(data_file)):
 49 |         result = []
 50 |         seg_list = jieba.cut(data_file[i])
 51 |         for w in seg_list:
 52 |             result.append(w)
 53 |         word.append(result)
 54 |     print('分词完成')
 55 |     with open(out_path, 'w+') as txt_write:
 56 |         for i in range(len(word)):
 57 |             s = str(word[i]).replace(
 58 |                 '[', '').replace(']', '')  # 去除[],这两行按数据不同，可以选择
 59 |             s = s.replace("'", '').replace(',', '') + \
 60 |                 '\n'  # 去除单引号，逗号，每行末尾追加换行符
 61 |             txt_write.write(s)
 62 |     print('保存成功')
 63 |     end_time = time.time()
 64 |     print(end_time - start_time)
 65 | 
 66 | 
 67 | def word_vec(file_txt, file_bin, min_count, size):
 68 |     word2vec.word2vec(file_txt, file_bin, min_count=min_count,
 69 |                       size=size, verbose=True)
 70 | 
 71 | 
 72 | def merge_csv(target_dir, output_file):
 73 |     for inputfile in [os.path.join(target_dir, 'train_oridata.csv'),
 74 |                       os.path.join(target_dir, 'test_oridata.csv'), os.path.join(target_dir, 'validation_oridata.csv')]:
 75 |         data = pd.read_csv(inputfile)
 76 |         df = pd.DataFrame(data)
 77 |         df.to_csv(output_file, mode='a', index=False)
 78 | 
 79 | # 词转id，id转向量
 80 | 
 81 | 
 82 | def transfer(model_path, embedding_size):
 83 |     start_time = time.time()
 84 |     model = word2vec.load(model_path)
 85 |     word2id_dic = {}
 86 |     init_0 = [0.0 for i in range(embedding_size)]
 87 |     id2vec_dic = [init_0]
 88 |     for i in range(len(model.vocab)):
 89 |         id = i + 1
 90 |         word2id_dic[model.vocab[i]] = id
 91 |         id2vec_dic.append(model[model.vocab[i]].tolist())
 92 |     end_time = time.time()
 93 |     print('词转id，id转向量完成')
 94 |     print(end_time - start_time)
 95 |     return word2id_dic, id2vec_dic
 96 | 
 97 | 
 98 | def transfer_txt(model_path, embedding_size):
 99 |     print("开始转换...")
100 |     start_time = time.time()
101 |     model = gensim.models.KeyedVectors.load_word2vec_format(
102 |         model_path, binary=False)
103 |     word_dic = model.wv.vocab
104 |     word2id_dic = {}
105 |     init_0 = [0.0 for i in range(embedding_size)]
106 |     id2vec_dic = [init_0]
107 |     id = 1
108 |     for i in word_dic:
109 |         word2id_dic[i] = id
110 |         id2vec_dic.append(model[i].tolist())
111 |         id += 1
112 |     end_time = time.time()
113 |     print('词转id，id转向量完成')
114 |     print(end_time - start_time)
115 |     return word2id_dic, id2vec_dic
116 | 
117 | # 存入json文件
118 | 
119 | 
120 | def save_json(output_path, dic_data, message=None):
121 |     start_time = time.time()
122 |     if message is not None:
123 |         print("Saving {}...".format(message))
124 |         with open(output_path, "w") as fh:
125 |             json.dump(dic_data, fh, ensure_ascii=False, indent=4)
126 |     print('保存完成')
127 |     end_time = time.time()
128 |     print(end_time - start_time)
129 | 
130 | # 将原文中的passage，query，alternative，answer，query_id转成id号
131 | # 输入参数为词典的位置和训练集的位置
132 | 
133 | 
134 | def TrainningsetProcess(dic_url, dataset_url, passage_len_limit):
135 |     res = []  # 最后返回的结果
136 |     rule = re.compile(r'\|')
137 |     id2alternatives = {}
138 |     # 读取字典
139 |     with open(dic_url, 'r', encoding='utf-8') as dic_file:
140 |         dic = dict()
141 |         dic = json.load(dic_file)
142 |     # 读取训练集
143 |     over_limit = 0
144 |     ans_over_limit = 0
145 |     with open(dataset_url, 'r', encoding='utf-8') as ts_file:
146 |         for file_line in ts_file:
147 |             line = json.loads(file_line)  # 读取一行json文件
148 |             this_line_res = dict()  # 变量定义，代表这一行映射之后的结果
149 |             passage = line['passage']
150 |             alternatives = line['alternatives']
151 |             query = line['query']
152 |             if dataset_url.find('test') == -1:
153 |                 answer = line['answer']
154 |             query_idx = line['query_id']
155 | 
156 |             # 用jieba将passage和query分词,lcut返回list
157 |             passage_cut = jieba.lcut(passage, cut_all=False)
158 |             query_cut = jieba.lcut(query, cut_all=False)
159 | 
160 |             # 用词典将passage和query映射到id
161 |             passage_id = []
162 |             query_id = []
163 |             for each_passage_word in passage_cut:
164 |                 passage_id.append(dic.get(each_passage_word))
165 |             for each_query_word in query_cut:
166 |                 query_id.append(dic.get(each_query_word))
167 | 
168 |             # 对选项进行排序
169 |             alternatives_cut = re.split(rule, alternatives)
170 |             alternatives_cut = [s.strip() for s in alternatives_cut]
171 |             tmp = [0, 0, 0]
172 | 
173 |             # 选项少于三个
174 |             if len(alternatives_cut) == 1:
175 |                 alternatives_cut.append("<NULL>")
176 |                 alternatives_cut.append("<NULL>")
177 |             if len(alternatives_cut) == 2:
178 |                 alternatives_cut.append("<NULL")
179 | 
180 |             # 跳过无效数据（135条）
181 |             if alternatives.find("无法") == -1 and alternatives.find("不确定") == -1:
182 |                 if dataset_url.find('test') != -1:
183 |                     tmp[0] = alternatives_cut[0]
184 |                     tmp[1] = alternatives_cut[1]
185 |                     tmp[2] = alternatives_cut[2]
186 |                 else:
187 |                     print(1)
188 |                     continue
189 |             if alternatives.count("无法确定") > 1 or alternatives.count("没") > 1:
190 |                 if dataset_url.find('test') != -1:
191 |                     tmp[0] = alternatives_cut[0]
192 |                     tmp[1] = alternatives_cut[1]
193 |                     tmp[2] = alternatives_cut[2]
194 |                 else:
195 |                     print(2)
196 |                     continue  # 第64772条数据
197 |             if alternatives.find("没") != -1 and alternatives.find("不") != -1 and alternatives.find("不确定") == -1:
198 |                 print(3)
199 |                 continue  # 第144146条数据
200 |             if "不确定" in alternatives_cut and "无法确定" in alternatives_cut:
201 |                 tmp[0] = "确定"
202 |                 tmp[1] = "不确定"
203 |                 tmp[2] = "无法确定"
204 |             # 肯定/否定/无法确定
205 |             elif alternatives.find("不") != -1 or alternatives.find("没") != -1:
206 |                 if alternatives.count("不") == 1 and alternatives.find("不确定") != -1:
207 |                     alternatives_cut.remove("不确定")
208 |                     alternatives_cut.append("不确定")
209 |                     tmp[0] = alternatives_cut[0]
210 |                     tmp[1] = alternatives_cut[1]
211 |                     tmp[2] = alternatives_cut[2]
212 |                 elif alternatives.count("不") > 1:
213 |                     if alternatives.find("不确定") == -1:
214 |                         if dataset_url.find("test") != -1:
215 |                             tmp[0] = alternatives_cut[0]
216 |                             tmp[1] = alternatives_cut[1]
217 |                             tmp[2] = alternatives_cut[2]
218 |                         else:
219 |                             print(line)
220 |                             continue
221 |                     else:
222 |                         alternatives_cut.remove("不确定")
223 |                         if alternatives_cut[0].find("不") != -1:
224 |                             tmp[1] = alternatives_cut[0]
225 |                             tmp[0] = alternatives_cut[1]
226 |                         else:
227 |                             tmp[1] = alternatives_cut[1]
228 |                             tmp[0] = alternatives_cut[0]
229 |                         alternatives_cut.append("不确定")
230 |                         tmp[2] = alternatives_cut[2]
231 |                 else:
232 |                     for tmp_alternatives in alternatives_cut:
233 |                         if tmp_alternatives.find("无法") != -1:
234 |                             tmp[2] = tmp_alternatives
235 |                         elif tmp_alternatives.find("不") != -1 or tmp_alternatives.find("没") != -1:
236 |                             tmp[1] = tmp_alternatives
237 |                         else:
238 |                             tmp[0] = tmp_alternatives
239 |             # 无明显肯定与否定词义
240 |             else:
241 |                 for tmp_alternatives in alternatives_cut:
242 |                     if tmp_alternatives.find("无法") != -1 or alternatives.find("不确定") != -1:
243 |                         alternatives_cut.remove(tmp_alternatives)
244 |                         alternatives_cut.append(tmp_alternatives)
245 |                         break
246 |                 tmp[0] = alternatives_cut[0]
247 |                 tmp[1] = alternatives_cut[1]
248 |                 tmp[2] = alternatives_cut[2]
249 | 
250 |             # 根据tmp列表生成answer_id
251 |             if dataset_url.find('test') == -1:
252 |                 answer_id = tmp.index(answer.strip())
253 | 
254 |             # 将tmp列表分词存id
255 |             tmp_id = []
256 |             for ans in tmp:
257 |                 if ans == None or ans == "<NULL>":
258 |                     tmp_id.append([0])
259 |                 else:
260 |                     ans = jieba.lcut(ans, cut_all=False)
261 |                     if len(ans) > 5:
262 |                         ans = ans[:5]
263 |                         ans_over_limit += 1
264 |                     tmp_id.append([dic.get(x) for x in ans])
265 | 
266 |             # 得到这一行映射后的结果，是dict类型的数据
267 |             if len(passage_id) > passage_len_limit:
268 |                 passage_id = passage_id[:passage_len_limit]
269 |                 over_limit += 1
270 |             this_line_res['passage'] = passage_id
271 |             this_line_res['query'] = query_id
272 |             this_line_res['alternatives'] = tmp_id
273 |             if dataset_url.find('test') == -1:
274 |                 this_line_res['answer'] = answer_id
275 |             this_line_res['query_id'] = query_idx
276 |             # 创建query_id到alternatives的字典，保存为json
277 |             id2alternatives[query_idx] = tmp
278 |             res.append(this_line_res)
279 |         print(len(res))
280 |         print("over_limit:{}".format(over_limit))
281 |         print("ans_over_limit:{}".format(ans_over_limit))
282 |         return res, id2alternatives
283 | 
284 | 
285 | def data_process(config):
286 |     target_dir = config.target_dir
287 |     # 这里如果使用自己训练好的词向量就可以注释掉
288 |     read_data(config.train_file, os.path.join(
289 |         target_dir, 'train_oridata.csv'), 250000)  # 250000
290 |     read_data(config.test_file, os.path.join(
291 |         target_dir, 'test_oridata.csv'), 10000)  # 10000
292 |     read_data(config.dev_file, os.path.join(
293 |         target_dir, 'validation_oridata.csv'), 30000)  # 30000
294 |     merge_csv(target_dir, os.path.join(target_dir, 'ori_data.csv'))
295 |     de_word(os.path.join(target_dir, 'ori_data.csv'),
296 |             os.path.join(target_dir, 'seg_list.txt'))
297 |     word_vec(os.path.join(target_dir, 'seg_list.txt'),
298 |              os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.min_count, config.embedding_size)
299 |     # 如果是用外部词向量，从这里开始
300 |     # word2id_dic, id2vec_dic = transfer_txt(
301 |     #     os.path.join(target_dir, 'baidu_300_wc+ng_sgns.baidubaike.bigram-char.txt'), config.embedding_size)
302 |     word2id_dic, id2vec_dic = transfer(
303 |         os.path.join(target_dir, 'seg_listWord2Vec.bin'), config.embedding_size)
304 |     save_json(config.word2id_file, word2id_dic, "word to id")
305 |     save_json(config.id2vec_file, id2vec_dic, "id to vec")
306 |     train_examples, train_id2alternatives = TrainningsetProcess(
307 |         config.word2id_file, config.train_file, config.para_limit)
308 |     test_examples, test_id2alternatives = TrainningsetProcess(
309 |         config.word2id_file, config.test_file, config.para_limit)
310 |     validation_examples, validation_id2alternatives = TrainningsetProcess(
311 |         config.word2id_file, config.dev_file, config.para_limit)
312 |     save_json(config.train_eval_file, train_id2alternatives,
313 |               message='保存train每条数据的alternatives')
314 |     save_json(config.test_eval_file, test_id2alternatives,
315 |               message='保存test每条数据的alternatives')
316 |     save_json(config.dev_eval_file, validation_id2alternatives,
317 |               message='保存validation每条数据的alternatives')
318 |     return train_examples, test_examples, validation_examples
319 | 
320 | 
321 | def build_features(config, examples, data_type, out_file, is_test=False):
322 |     """
323 |     将数据读入TFrecords
324 |     """
325 | 
326 |     para_limit = config.para_limit
327 |     ques_limit = config.ques_limit
328 |     ans_limit = config.ans_limit
329 | 
330 |     print("Processing {} examples...".format(data_type))
331 | 
332 |     list_nlp=[]
333 |     with open("nlp_feature.json","r") as f:
334 |         list_nlp=json.load(f)
335 | 
336 |     writer = tf.python_io.TFRecordWriter(out_file)
337 |     total = 0
338 |     meta = {}
339 |     random.shuffle(examples)  # 先给打乱顺序
340 |     for example in tqdm(examples):
341 |         total += 1
342 |         passage_idxs = np.zeros([para_limit], dtype=np.int32)
343 |         question_idxs = np.zeros([ques_limit], dtype=np.int32)
344 |         alternative_idxs = np.zeros([3, ans_limit], dtype=np.int32)
345 | 
346 |         for i, token in enumerate(example["passage"]):
347 |             if token == None:
348 |                 passage_idxs[i] = 0
349 |             else:
350 |                 passage_idxs[i] = token
351 |         for i, token in enumerate(example["query"]):
352 |             if token == None:
353 |                 question_idxs[i] = 0
354 |             else:
355 |                 question_idxs[i] = token
356 |         for i, token in enumerate(example["alternatives"]):
357 |             for j, tk in enumerate(token):
358 |                 if tk == None:
359 |                     alternative_idxs[i][j] = 0
360 |                 else:
361 |                     alternative_idxs[i][j] = tk
362 |         # print(example["passage"])
363 |         if not is_test:
364 |             record = tf.train.Example(features=tf.train.Features(feature={
365 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
366 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
367 |                                       "alternative_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[alternative_idxs.tostring()])),
368 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["answer"]])),
369 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]])),
370 |                                       "nlp_feature":tf.train.Feature(float_list=tf.train.FloatList(value=list_nlp[int(example["query_id"])-1][1:]))
371 |                                       }))
372 |         else:
373 |             record = tf.train.Example(features=tf.train.Features(feature={
374 |                                       "passage_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[passage_idxs.tostring()])),
375 |                                       "question_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[question_idxs.tostring()])),
376 |                                       "alternative_idxs": tf.train.Feature(bytes_list=tf.train.BytesList(value=[alternative_idxs.tostring()])),
377 |                                       "answer": tf.train.Feature(int64_list=tf.train.Int64List(value=[int(-1)])),
378 |                                       "id": tf.train.Feature(int64_list=tf.train.Int64List(value=[example["query_id"]]))
379 |                                       #,"nlp_feature": tf.train.Feature(float_list=tf.train.FloatList(value=list_nlp[int(example["query_id"]) - 1][1:]))
380 |                                       }))
381 |         # print(record)
382 |         writer.write(record.SerializeToString())
383 |     print("Build {} instances of features in total".format(total))
384 |     writer.close()
385 | 
386 | 
387 | def prepro(config):
388 |     """
389 |     数据预处理函数
390 |     """
391 |     train_examples, test_examples, dev_examples = data_process(config)
392 | 
393 |     # print(train_examples)
394 |     # print(test_examples)
395 |     # print(dev_examples)
396 | 
397 |     # train: 249778, test: 10000, dev: 29968
398 |     # train: 439, test: 18, dev: 48
399 | 
400 |     build_features(config, train_examples, "train", config.train_record_file)
401 |     build_features(config, dev_examples, "dev", config.dev_record_file)
402 |     build_features(config, test_examples, "test",
403 |                    config.test_record_file, is_test=True)
404 | 
405 |     print("done!!!")
406 | 


--------------------------------------------------------------------------------
/best_single_model/examine_dev.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | examine_dev.py：检查验证集的结果，辅助分析。
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import os
 14 | import codecs
 15 | import time
 16 | 
 17 | from model_addAnswer_newGraph import Model
 18 | from util_addAnswer import *
 19 | 
 20 | 
 21 | def examine_dev(config):
 22 |     """
 23 |     检查dev集的结果，辅助分析
 24 |     """
 25 |     with open(config.id2vec_file, "r") as fh:
 26 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 27 |     with open(config.dev_eval_file, "r") as fh:
 28 |         dev_eval_file = json.load(fh)
 29 | 
 30 |     total = 29968
 31 |     # 读取模型的路径和预测存储的路径
 32 |     save_dir = config.save_dir + config.experiment
 33 |     if not os.path.exists(save_dir):
 34 |         print("no save!")
 35 |         return
 36 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
 37 |     os.path.join(config.prediction_dir, (predic_time + "_examine_dev.txt"))
 38 | 
 39 |     print("Loading model...")
 40 |     examine_batch = get_dataset(config.dev_record_file, get_record_parser(
 41 |         config), config).make_one_shot_iterator()
 42 | 
 43 |     model = Model(config, examine_batch, id2vec, trainable=False)
 44 | 
 45 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 46 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 47 |     sess_config.gpu_options.allow_growth = True
 48 | 
 49 |     print("examining ...")
 50 |     with tf.Session(config=sess_config) as sess:
 51 |         sess.run(tf.global_variables_initializer())
 52 |         saver = tf.train.Saver()
 53 |         saver.restore(sess, tf.train.latest_checkpoint(save_dir))
 54 |         sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool)))
 55 |         answer_dict = {}
 56 |         truth_dict = {}
 57 |         for step in tqdm(range(total // config.batch_size + 1)):
 58 |             # 预测答案
 59 |             qa_id, answer, truth = sess.run(
 60 |                 [model.qa_id, model.classes, model.answer])
 61 |             answer_dict_ = {}
 62 |             truth_dict_ = {}
 63 |             for ids, tr, ans in zip(qa_id, truth, answer):
 64 |                 answer_dict_[str(ids)] = ans
 65 |                 truth_dict_[str(ids)] = tr
 66 |             answer_dict.update(answer_dict_)
 67 |             truth_dict.update(truth_dict_)
 68 |         metrics = evaluate_acc(truth_dict, answer_dict)
 69 |         print(len(truth_dict))
 70 |         print(len(answer_dict))
 71 |         print("accuracy:{}".format(metrics["accuracy"]))
 72 | 
 73 |         yes_predictions = []  # 正确答案是肯定的错题
 74 |         no_predictions = []  # 正确答案是否定的错题
 75 |         depend_predictions = []  # 正确答案是不确定的错题
 76 |         yes, no, depend = 0, 0, 0
 77 |         yes_wrong, no_wrong, depend_wrong = 0, 0, 0
 78 |         for key, value in answer_dict.items():
 79 |             if truth_dict[key] != value:
 80 |                 if truth_dict[key] == 0:
 81 |                     yes += 1
 82 |                     yes_wrong += 1
 83 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 84 |                     wrong_answer = dev_eval_file[str(key)][value]
 85 |                     yes_predictions.append(
 86 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
 87 |                 elif truth_dict[key] == 1:
 88 |                     no += 1
 89 |                     no_wrong += 1
 90 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 91 |                     wrong_answer = dev_eval_file[str(key)][value]
 92 |                     no_predictions.append(
 93 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
 94 |                 else:
 95 |                     depend += 1
 96 |                     depend_wrong += 1
 97 |                     right_answer = dev_eval_file[str(key)][truth_dict[key]]
 98 |                     wrong_answer = dev_eval_file[str(key)][value]
 99 |                     depend_predictions.append(
100 |                         str(key) + '\t' + str(right_answer) + '\t' + str(wrong_answer))
101 |             else:
102 |                 if truth_dict[key] == 0:
103 |                     yes += 1
104 |                 elif truth_dict[key] == 1:
105 |                     no += 1
106 |                 else:
107 |                     depend += 1
108 | 
109 |         print("肯定型问题个数:{},否定型问题个数:{},不确定问题个数:{}".format(yes, no, depend))
110 |         print("肯定型问题正确率:{}".format((yes - yes_wrong) / yes * 1.0))
111 |         print("否定型问题正确率:{}".format((no - no_wrong) / no * 1.0))
112 |         print("不确定型问题正确率:{}".format((depend - depend_wrong) / depend * 1.0))
113 |         outputs_0 = u'\n'.join(yes_predictions)
114 |         outputs_1 = u'\n'.join(no_predictions)
115 |         outputs_2 = u'\n'.join(depend_predictions)
116 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_0.txt")), 'w', encoding='utf-8') as f:
117 |             f.write(outputs_0)
118 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_1.txt")), 'w', encoding='utf-8') as f:
119 |             f.write(outputs_1)
120 |         with codecs.open(os.path.join(config.prediction_dir, (predic_time + "_examine_dev_2.txt")), 'w', encoding='utf-8') as f:
121 |             f.write(outputs_2)
122 |         print("done!")
123 | 


--------------------------------------------------------------------------------
/best_single_model/file_save.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | file_save.py：
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import pickle
 14 | import os
 15 | import codecs
 16 | import time
 17 | from model_addAnswer_newGraph import Model
 18 | from util_addAnswer import *
 19 | 
 20 | 
 21 | def save_dev(config):
 22 |     """
 23 |     验证dev集的结果，保存文件
 24 |     """
 25 |     with open(config.id2vec_file, "r") as fh:
 26 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 27 |     with open(config.dev_eval_file, "r") as fh:
 28 |         dev_eval_file = json.load(fh)
 29 |     total = 29968
 30 | 
 31 |     print("Loading model...")
 32 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 33 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 34 |     sess_config.gpu_options.allow_growth = True
 35 | 
 36 |     truth_dict = {}
 37 |     predict_dict = {}
 38 |     logits_dict1 = {}
 39 | 
 40 |     print("正在模型预测!")
 41 |     g1 = tf.Graph()
 42 |     with tf.Session(graph=g1, config=sess_config) as sess1:
 43 |         with g1.as_default():
 44 |             dev_batch1 = get_dataset(config.dev_record_file, get_record_parser(
 45 |                 config), config).make_one_shot_iterator()
 46 |             model_1 = Model(config, dev_batch1, id2vec, trainable=False)
 47 |             sess1.run(tf.global_variables_initializer())
 48 |             saver1 = tf.train.Saver()
 49 |             # 需要手动更改路径
 50 |             saver1.restore(
 51 |                 sess1, "./log/modellalala/model_99000_devAcc_0.751301.ckpt")
 52 |             sess1.run(tf.assign(model_1.is_train,
 53 |                                 tf.constant(False, dtype=tf.bool)))
 54 |             for step in tqdm(range(total // config.batch_size + 1)):
 55 |                 qa_id, logits, truths = sess1.run(
 56 |                     [model_1.qa_id, model_1.logits, model_1.answer])
 57 |                 for ids, logits, truth in zip(qa_id, logits, truths):
 58 |                     logits_dict1[str(ids)] = logits
 59 |                     truth_dict[str(ids)] = truth
 60 |             if len(logits_dict1) != len(dev_eval_file):
 61 |                 print("logits1 data number not match")
 62 | 
 63 |             a = tf.placeholder(shape=[3], dtype=tf.float32, name="me")
 64 |             softmax = tf.nn.softmax(a)
 65 |             for key, val in truth_dict.items():
 66 |                 value = sess1.run(softmax, feed_dict={a: logits_dict1[key]})
 67 |                 predict_dict[key] = value
 68 |     print("正在保存dev的softmax结果文件!")
 69 |     if not os.path.exists("./dev_soft"):
 70 |         os.makedirs("./dev_soft")
 71 |     with open("./dev_soft/nlp_model_0.7513.txt", "wb") as f1:  # 手动更改保存的名字，路径不用改
 72 |         pickle.dump(predict_dict, f1)
 73 |     if not os.path.exists("./truth"):
 74 |         os.makedirs("./truth")
 75 |     with open("./truth/truth_dict.txt", "wb") as f2:  # 不用改
 76 |         pickle.dump(truth_dict, f2)
 77 | 
 78 | 
 79 | def save_test(config):
 80 |     """
 81 |     输出test集的结果，保存文件
 82 |     """
 83 |     with open(config.id2vec_file, "r") as fh:
 84 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 85 |     with open(config.test_eval_file, "r") as fh:
 86 |         test_eval_file = json.load(fh)
 87 |     total = 10000
 88 | 
 89 |     print("Loading model...")
 90 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 91 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 92 |     sess_config.gpu_options.allow_growth = True
 93 | 
 94 |     predict_dict = {}
 95 |     logits_dict1 = {}
 96 | 
 97 |     print("正在模型预测!")
 98 |     g1 = tf.Graph()
 99 |     with tf.Session(graph=g1, config=sess_config) as sess1:
100 |         with g1.as_default():
101 |             test_batch1 = get_dataset(config.test_record_file, get_record_parser(
102 |                 config), config).make_one_shot_iterator()
103 |             model_1 = Model(config, test_batch1, id2vec, trainable=False)
104 |             sess1.run(tf.global_variables_initializer())
105 |             saver1 = tf.train.Saver()
106 |             # 需要手动更改路径
107 |             saver1.restore(
108 |                 sess1, "./log/modellalala/model_99000_devAcc_0.751301.ckpt")
109 |             sess1.run(tf.assign(model_1.is_train,
110 |                                 tf.constant(False, dtype=tf.bool)))
111 |             for step in tqdm(range(total // config.batch_size + 1)):
112 |                 qa_id, logits = sess1.run(
113 |                     [model_1.qa_id, model_1.logits])
114 |                 for ids, logits in zip(qa_id, logits):
115 |                     logits_dict1[str(ids)] = logits
116 |             if len(logits_dict1) != len(test_eval_file):
117 |                 print("logits1 data number not match")
118 | 
119 |             a = tf.placeholder(shape=[3], dtype=tf.float32, name="me")
120 |             softmax = tf.nn.softmax(a)
121 |             for key, val in logits_dict1.items():
122 |                 value = sess1.run(softmax, feed_dict={a: logits_dict1[key]})
123 |                 predict_dict[key] = value
124 | 
125 |     print("正在保存dev的softmax结果文件!")
126 |     if not os.path.exists("./test_soft"):
127 |         os.makedirs("./test_soft")
128 |     with open("./test_soft/nlp_model_0.7446.txt", "wb") as f1:
129 |         pickle.dump(predict_dict, f1)
130 | 


--------------------------------------------------------------------------------
/best_single_model/focal_loss.py:
--------------------------------------------------------------------------------
 1 | """
 2 | AI Challenger观点型问题阅读理解
 3 | 
 4 | focal_loss.py
 5 | 
 6 | @author: yuhaitao
 7 | """
 8 | # -*- coding:utf-8 -*-
 9 | import tensorflow as tf
10 | 
11 | 
12 | def sparse_focal_loss(logits, labels, gamma=2):
13 |     """
14 |     Computer focal loss for multi classification
15 |     Args:
16 |       labels: A int32 tensor of shape [batch_size].
17 |       logits: A float32 tensor of shape [batch_size,num_classes].
18 |       gamma: A scalar for focal loss gamma hyper-parameter.
19 |     Returns:
20 |       A tensor of the same shape as `lables`
21 |     """
22 |     with tf.name_scope("focal_loss"):
23 |         y_pred = tf.nn.softmax(logits, dim=-1)  # [batch_size,num_classes]
24 |         labels = tf.one_hot(labels, depth=y_pred.shape[1])
25 |         L = -labels * ((1 - y_pred)**gamma) * tf.log(y_pred)
26 |         L = tf.reduce_sum(L, axis=1)
27 |         return L
28 | 
29 | '''
30 | if __name__ == '__main__':
31 |     labels = tf.constant([0, 1], name="labels")
32 |     logits = tf.constant([[0.7, 0.2, 0.1], [0.6, 0.1, 0.3]], name="logits")
33 |     a = tf.reduce_mean(sparse_focal_loss(logits, tf.stop_gradient(labels)))
34 |     with tf.Session() as sess:
35 |         print(sess.run(a))'''
36 | 


--------------------------------------------------------------------------------
/best_single_model/main.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | main.py：train and test
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import os
 14 | import codecs
 15 | import time
 16 | import math
 17 | 
 18 | from model_addAnswer_newGraph import Model
 19 | from util_addAnswer import *
 20 | 
 21 | 
 22 | def train(config):
 23 |     """
 24 |     训练与验证函数
 25 |     """
 26 |     with open(config.id2vec_file, "r") as fh:
 27 |         id2vec = np.array(json.load(fh), dtype=np.float32)
 28 |     with open(config.train_eval_file, "r") as fh:
 29 |         train_eval_file = json.load(fh)
 30 |     with open(config.dev_eval_file, "r") as fh:
 31 |         dev_eval_file = json.load(fh)
 32 | 
 33 |     dev_total = 29968  # 验证集数据量
 34 | 
 35 |     # 不同参数的训练在不同的文件夹下存储
 36 |     log_dir = config.log_dir + config.experiment
 37 |     save_dir = config.save_dir + config.experiment
 38 |     if not os.path.exists(log_dir):
 39 |         os.makedirs(log_dir)
 40 |     if not os.path.exists(save_dir):
 41 |         os.makedirs(save_dir)
 42 | 
 43 |     print("Building model...")
 44 |     parser = get_record_parser(config)
 45 |     train_dataset = get_batch_dataset(config.train_record_file, parser, config)
 46 |     dev_dataset = get_dataset(config.dev_record_file, parser, config)
 47 | 
 48 |     # 可馈送迭代器，通过feed_dict机制选择每次sess.run时调用train_iterator还是dev_iterator
 49 |     handle = tf.placeholder(tf.string, shape=[])
 50 |     iterator = tf.data.Iterator.from_string_handle(
 51 |         handle, train_dataset.output_types, train_dataset.output_shapes)
 52 |     train_iterator = train_dataset.make_one_shot_iterator()
 53 |     dev_iterator = dev_dataset.make_one_shot_iterator()
 54 | 
 55 |     # 选取模型
 56 |     if config.model_name == "default":
 57 |         model = Model(config, iterator, id2vec)
 58 |     else:
 59 |         print("model error")
 60 |         return
 61 | 
 62 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 63 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
 64 |     sess_config.gpu_options.allow_growth = True
 65 | 
 66 |     loss_save = 100.0
 67 |     patience = 0
 68 |     lr = config.init_learning_rate
 69 |     emb_lr = config.init_emb_lr
 70 | 
 71 |     with tf.Session(config=sess_config) as sess:
 72 |         writer = tf.summary.FileWriter(log_dir, sess.graph)  # 存储计算图
 73 |         sess.run(tf.global_variables_initializer())
 74 |         saver = tf.train.Saver()
 75 |         train_handle = sess.run(train_iterator.string_handle())
 76 |         dev_handle = sess.run(dev_iterator.string_handle())
 77 |         sess.run(tf.assign(model.is_train, tf.constant(True, dtype=tf.bool)))
 78 |         sess.run(tf.assign(model.learning_rate,
 79 |                            tf.constant(lr, dtype=tf.float32)))
 80 |         if config.training_embedding:
 81 |             sess.run(tf.assign(model.emb_lr, tf.constant(
 82 |                 emb_lr, dtype=tf.float32)))
 83 | 
 84 |         best_dev_acc = 0.0  # 定义一个最佳验证准确率，只有当准确率高于它才保存模型
 85 |         print("Training ...")
 86 |         for go in tqdm(range(1, config.num_steps + 1)):
 87 |             global_step = sess.run(model.global_step) + 1
 88 |             loss, train_op = sess.run([model.loss, model.train_op], feed_dict={
 89 |                                       handle: train_handle})
 90 |             if global_step % config.period == 0:  # 每隔一段步数就记录一次train_loss和learning_rate
 91 |                 loss_sum = tf.Summary(value=[tf.Summary.Value(
 92 |                     tag="model/loss", simple_value=loss), ])
 93 |                 writer.add_summary(loss_sum, global_step)
 94 |                 lr_sum = tf.Summary(value=[tf.Summary.Value(
 95 |                     tag="model/learning_rate", simple_value=sess.run(model.learning_rate)), ])
 96 |                 writer.add_summary(lr_sum, global_step)
 97 |                 if config.training_embedding:
 98 |                     emb_lr_sum = tf.Summary(value=[tf.Summary.Value(
 99 |                         tag="model/emb_lr", simple_value=sess.run(model.emb_lr)), ])
100 |                     writer.add_summary(emb_lr_sum, global_step)
101 | 
102 |             if global_step % config.checkpoint == 0:  # 验证acc，并保存模型
103 |                 sess.run(tf.assign(model.is_train,
104 |                                    tf.constant(False, dtype=tf.bool)))
105 | 
106 |                 # 评估训练集
107 |                 _, summ = evaluate_batch(
108 |                     model, config.val_num_batches, train_eval_file, sess, "train_eval", handle, train_handle)
109 |                 for s in summ:
110 |                     writer.add_summary(s, global_step)
111 | 
112 |                 # 评估验证集
113 |                 metrics, summ = evaluate_batch(
114 |                     model, dev_total // config.batch_size + 1, dev_eval_file, sess, "dev", handle, dev_handle)
115 |                 sess.run(tf.assign(model.is_train,
116 |                                    tf.constant(True, dtype=tf.bool)))
117 |                 for s in summ:
118 |                     writer.add_summary(s, global_step)
119 |                 writer.flush()  # 将事件文件刷新到磁盘
120 | 
121 |                 # 1101
122 |                 if global_step <= 40000:
123 |                     lr = config.init_learning_rate
124 |                 elif global_step <= 60000:
125 |                     lr = config.init_learning_rate
126 |                     emb_lr = 1e-5
127 |                 elif global_step <= 120000:
128 |                     lr = config.init_learning_rate / \
129 |                         math.sqrt((global_step - 50000) / 10000)
130 |                     emb_lr = (1e-5) / \
131 |                         math.sqrt((global_step - 50000) / 10000)
132 |                 elif global_step <= 200000:
133 |                     lr = config.init_learning_rate / \
134 |                         math.sqrt((global_step - 50000) / 5000)
135 |                     emb_lr = (1e-5) / \
136 |                         math.sqrt((global_step - 50000) / 5000)
137 |                 else:
138 |                     lr = config.init_learning_rate / \
139 |                         math.sqrt(global_step / 1000)
140 |                     emb_lr = (1e-5) / \
141 |                         math.sqrt(global_step / 1000)
142 | 
143 |                 sess.run(tf.assign(model.learning_rate,
144 |                                    tf.constant(lr, dtype=tf.float32)))
145 |                 if config.training_embedding:
146 |                     sess.run(tf.assign(model.emb_lr, tf.constant(
147 |                         emb_lr, dtype=tf.float32)))
148 | 
149 |                 # 保存模型的逻辑
150 |                 if metrics["accuracy"] > best_dev_acc:
151 |                     best_dev_acc = metrics["accuracy"]
152 |                     filename = os.path.join(
153 |                         save_dir, "model_{}_devAcc_{:.6f}.ckpt".format(global_step, best_dev_acc))
154 |                     saver.save(sess, filename)
155 | 
156 |     print("finished!")
157 | 
158 | 
159 | def evaluate_batch(model, num_batches, eval_file, sess, data_type, handle, str_handle):
160 |     """
161 |     模型评估函数
162 |     """
163 |     answer_dict = {}  # 答案词典
164 |     truth_dict = {}  # 真实答案词典
165 |     losses = []
166 |     for _ in tqdm(range(1, num_batches + 1)):
167 |         qa_id, loss, truth, answer = sess.run(
168 |             [model.qa_id, model.loss, model.answer, model.classes], feed_dict={handle: str_handle})
169 |         answer_dict_ = {}
170 |         truth_dict_ = {}
171 |         for ids, tr, ans in zip(qa_id, truth, answer):
172 |             answer_dict_[str(ids)] = ans
173 |             truth_dict_[str(ids)] = tr
174 |         answer_dict.update(answer_dict_)
175 |         truth_dict.update(truth_dict_)
176 |         losses.append(loss)
177 |     loss = np.mean(losses)
178 |     metrics = evaluate_acc(truth_dict, answer_dict)
179 |     metrics["loss"] = loss
180 |     loss_sum = tf.Summary(value=[tf.Summary.Value(
181 |         tag="{}/loss".format(data_type), simple_value=metrics["loss"]), ])
182 |     acc_sum = tf.Summary(value=[tf.Summary.Value(
183 |         tag="{}/accuracy".format(data_type), simple_value=metrics["accuracy"]), ])
184 |     return metrics, [loss_sum, acc_sum]
185 | 
186 | 
187 | def test(config):
188 |     """
189 |     测试函数
190 |     """
191 |     with open(config.id2vec_file, "r") as fh:
192 |         id2vec = np.array(json.load(fh), dtype=np.float32)
193 |     with open(config.test_eval_file, "r") as fh:
194 |         test_eval_file = json.load(fh)
195 | 
196 |     total = 10000
197 |     # 读取模型的路径和预测存储的路径
198 |     save_dir = config.save_dir + config.experiment
199 |     if not os.path.exists(save_dir):
200 |         print("no save!")
201 |         return
202 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
203 |     prediction_file = os.path.join(
204 |         config.prediction_dir, (predic_time + "_predictions.txt"))
205 | 
206 |     print("Loading model...")
207 |     test_batch = get_dataset(config.test_record_file, get_record_parser(
208 |         config), config).make_one_shot_iterator()
209 | 
210 |     # 选取模型
211 |     if config.model_name == "default":
212 |         model = Model(config, test_batch, id2vec, trainable=False)
213 |     else:
214 |         print("model error")
215 |         return
216 | 
217 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
218 |     sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
219 |     sess_config.gpu_options.allow_growth = True
220 | 
221 |     print("testing ...")
222 |     with tf.Session(config=sess_config) as sess:
223 |         sess.run(tf.global_variables_initializer())
224 |         saver = tf.train.Saver()
225 |         saver.restore(sess, tf.train.latest_checkpoint(save_dir))
226 |         sess.run(tf.assign(model.is_train, tf.constant(False, dtype=tf.bool)))
227 |         answer_dict = {}
228 |         for step in tqdm(range(total // config.batch_size + 1)):
229 |             # 预测答案
230 |             qa_id, answer = sess.run([model.qa_id, model.classes])
231 |             answer_dict_ = {}
232 |             for ids, ans in zip(qa_id, answer):
233 |                 answer_dict_[str(ids)] = ans
234 |             answer_dict.update(answer_dict_)
235 |         # 将结果写文件的操作，不用考虑问题顺序
236 |         if len(answer_dict) != len(test_eval_file):
237 |             print("data number not match")
238 |         predictions = []
239 |         for key, value in answer_dict.items():
240 |             prediction_answer = test_eval_file[str(key)][value]
241 |             predictions.append(str(key) + '\t' + str(prediction_answer))
242 |         outputs = u'\n'.join(predictions)
243 |         with codecs.open(prediction_file, 'w', encoding='utf-8') as f:
244 |             f.write(outputs)
245 |         print("done!")
246 | 


--------------------------------------------------------------------------------
/best_single_model/model_addAnswer_newGraph.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | model_addAnswer_newGraph.py：改进R-net模型，引入alternatives信息和特征工程。
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | 
 10 | import tensorflow as tf
 11 | from nn_func import cudnn_gru, native_gru, dot_attention, summ, dropout, dense
 12 | 
 13 | 
 14 | class Model(object):
 15 |     def __init__(self, config, batch, word_mat=None, trainable=True):
 16 |         """
 17 |         模型初始化函数
 18 |         Args:
 19 |             config:是tf.flag.FLAGS，配置整个项目的超参数
 20 |             batch:是一个tf.data.iterator对象，读取数据的迭代器，可能联系到tf.records，如果我们的数据集比较小就可以不用
 21 |             word_mat:np.array数组，是词向量？
 22 |             char_mat:同上
 23 |         """
 24 |         self.config = config
 25 |         batch_size = config.batch_size
 26 |         self.global_step = tf.get_variable('global_step', shape=[], dtype=tf.int32,
 27 |                                            initializer=tf.constant_initializer(0), trainable=False)
 28 |         # tf.data.iterator的get_next方法，返回dataset中下一个element的tensor对象，在sess.run中实现迭代
 29 |         """
 30 |         passage: passage序列的每个词的id号的tensor(tf.int32)，长度应该是都取最大限制长度，空余的填充空值？(这里待定)
 31 |         question: question序列的每个词的id号的tensor(tf.int32)
 32 |         ch, qh, y1, y2: 本项目不需要，已经取消
 33 |         qa_id: question的id
 34 |         answer: 新添加的answer标签，(0/1/2)，shape初步定义为[batch_size]
 35 |         """
 36 |         self.passage, self.question, self.alternatives, self.answer, self.qa_id, self.nlp_feature = batch.get_next()
 37 |         self.is_train = tf.get_variable(
 38 |             "is_train", shape=[], dtype=tf.bool, trainable=False)
 39 | 
 40 |         # word embeddings的变量,可以选择是否训练.
 41 |         if self.config.training_embedding:
 42 |             self.word_mat = tf.get_variable("word_mat", initializer=tf.constant(
 43 |                 word_mat, dtype=tf.float32), trainable=True)
 44 |         else:
 45 |             self.word_mat = tf.get_variable("word_mat", initializer=tf.constant(
 46 |                 word_mat, dtype=tf.float32), trainable=False)
 47 | 
 48 |         with tf.name_scope("process"):
 49 |             # tf.cast将tensor转换为bool类型，生成mask，有值部分用true，空值用false
 50 |             self.c_mask = tf.cast(self.passage, tf.bool)
 51 |             self.q_mask = tf.cast(self.question, tf.bool)
 52 |             # 求每个序列的真实长度，得到_len的tensor
 53 |             self.c_len = tf.reduce_sum(tf.cast(self.c_mask, tf.int32), axis=1)
 54 |             self.q_len = tf.reduce_sum(tf.cast(self.q_mask, tf.int32), axis=1)
 55 |             # alternatives编码过程用到的
 56 |             self.a_len = tf.constant(
 57 |                 value=3 * self.config.ans_limit, shape=[batch_size], dtype=tf.int32, name="a_len")
 58 | 
 59 |             # 求一个batch中序列最大长度，并按照最大长度对对tensor进行slice划分
 60 |             self.c_maxlen = tf.reduce_max(self.c_len)
 61 |             self.q_maxlen = tf.reduce_max(self.q_len)
 62 |             self.c = tf.slice(self.passage, [0, 0], [
 63 |                               batch_size, self.c_maxlen])
 64 |             self.q = tf.slice(self.question, [0, 0], [
 65 |                               batch_size, self.q_maxlen])
 66 |             self.c_mask = tf.slice(self.c_mask, [0, 0], [
 67 |                                    batch_size, self.c_maxlen])
 68 |             self.q_mask = tf.slice(self.q_mask, [0, 0], [
 69 |                                    batch_size, self.q_maxlen])
 70 |             # a_mask
 71 |             self.a_mask = tf.constant(
 72 |                 value=True, shape=[batch_size, 3], dtype=tf.bool, name="a_mask")
 73 | 
 74 |         self.Structure()  # 构造R-Net模型结构
 75 | 
 76 |         if trainable:
 77 | 
 78 |             if not self.config.training_embedding:
 79 |                 self.learning_rate = tf.get_variable(
 80 |                     "learning_rate", shape=[], dtype=tf.float32, trainable=False)
 81 |                 self.opt = tf.train.AdamOptimizer(
 82 |                     learning_rate=self.learning_rate, epsilon=1e-8)
 83 | 
 84 |                 grads = self.opt.compute_gradients(self.loss)
 85 |                 gradients, variables = zip(*grads)
 86 |                 capped_grads, _ = tf.clip_by_global_norm(
 87 |                     gradients, config.grad_clip)
 88 |                 self.train_op = self.opt.apply_gradients(
 89 |                     zip(capped_grads, variables), global_step=self.global_step)
 90 |             else:
 91 |                 # 对embedding层设置单独的学习率
 92 |                 self.emb_lr = tf.get_variable(
 93 |                     "emb_lr", shape=[], dtype=tf.float32, trainable=False)
 94 |                 self.learning_rate = tf.get_variable(
 95 |                     "learning_rate", shape=[], dtype=tf.float32, trainable=False)
 96 |                 self.emb_opt = tf.train.AdamOptimizer(
 97 |                     learning_rate=self.emb_lr, epsilon=1e-8)
 98 |                 self.opt = tf.train.AdamOptimizer(
 99 |                     learning_rate=self.learning_rate, epsilon=1e-8)
100 |                 # 区分不同的变量列表
101 |                 self.var_list = tf.trainable_variables()
102 |                 var_list1 = []
103 |                 var_list2 = []
104 |                 for var in self.var_list:
105 |                     if var.op.name == "word_mat":
106 |                         var_list1.append(var)
107 |                     else:
108 |                         var_list2.append(var)
109 | 
110 |                 grads = tf.gradients(self.loss, var_list1 + var_list2)
111 |                 capped_grads, _ = tf.clip_by_global_norm(
112 |                     grads, config.grad_clip)
113 |                 grads1 = capped_grads[:len(var_list1)]
114 |                 grads2 = capped_grads[len(var_list1):]
115 |                 self.train_op1 = self.emb_opt.apply_gradients(
116 |                     zip(grads1, var_list1))
117 |                 self.train_op2 = self.opt.apply_gradients(
118 |                     zip(grads2, var_list2), global_step=self.global_step)
119 |                 self.train_op = tf.group(self.train_op1, self.train_op2)
120 | 
121 |     def Structure(self):
122 |         config = self.config
123 |         batch_size, PL, QL, d = config.batch_size, self.c_maxlen, self.q_maxlen, config.hidden
124 |         gru = cudnn_gru if config.use_cudnn else native_gru  # 选择使用哪种gru网络
125 | 
126 |         with tf.variable_scope("embedding"):
127 |             # word_embedding层
128 |             with tf.name_scope("word"):
129 |                 # embedding后的shape是[batch_size, max_len, vec_len]
130 |                 c_emb = tf.nn.embedding_lookup(self.word_mat, self.c)
131 |                 q_emb = tf.nn.embedding_lookup(self.word_mat, self.q)
132 |                 a_emb = tf.nn.embedding_lookup(
133 |                     self.word_mat, self.alternatives)  # [batch_size, 3, ans_limit, 300]
134 | 
135 |         with tf.variable_scope("nlp_feature"):
136 |             nlp_w = tf.get_variable(
137 |                 "nlp_w", shape=[187, 512], dtype=tf.float32)
138 |             nlp_input = dropout(tf.matmul(self.nlp_feature, nlp_w),
139 |                                 keep_prob=config.keep_prob, is_train=self.is_train)
140 |             nlp_w2 = tf.get_variable(
141 |                 "nlp_w2", shape=[512, d], dtype=tf.float32)
142 |             nlp_out = tf.nn.relu(tf.matmul(nlp_input, nlp_w2))
143 | 
144 |         with tf.variable_scope("encoding"):
145 |             rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape(
146 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_encoder")
147 |             c = rnn(c_emb, seq_len=self.c_len)
148 |             q = rnn(q_emb, seq_len=self.q_len)
149 |             al_inputs = tf.reshape(
150 |                 a_emb, [-1, 3 * config.ans_limit, a_emb.get_shape().as_list()[-1]])
151 |             # [batch_size, 3*ans_limit, 2*hidden]
152 |             al_encode = rnn(al_inputs, seq_len=self.a_len)
153 |             # al_encode = rnn(al_inputs, seq_len=self.a_len)[:, :, -2 * d:]这个还没试
154 |             al_output_ = tf.reshape(al_encode, [batch_size, 3, -1])
155 |             al_output = tf.nn.relu(dense(al_output_, d))
156 | 
157 |         # with tf.variable_scope("alternative_encoding"):
158 |         #     # al_inputs = tf.reduce_sum(a_emb, axis=2)  # [batch_size, 3, 300]
159 |         #     # al_encode = dense(al_inputs, d, use_bias=False,
160 |         #     #                   scope="al_encoder")  # [batch_size, 3, hidden]
161 |         #     rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=a_emb.get_shape(
162 |         #     ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="al_encoder")
163 |         #     al_inputs = tf.reshape(
164 |         #         a_emb, [-1, 3 * config.ans_limit, a_emb.get_shape().as_list()[-1]])
165 |         #     # [batch_size, 3*ans_limit, 2*hidden]
166 |         #     al_encode = rnn(al_inputs, seq_len=self.a_len)
167 |         #     al_output_ = tf.reshape(al_encode, [batch_size, 3, -1])
168 |         #     al_output = tf.nn.relu(dense(al_output_, d))
169 | 
170 |         # with tf.variable_scope("question_encoding"):
171 |         #     # encoder层，将context和question分别输入双向GRU
172 |         #     rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=q_emb.get_shape(
173 |         #     ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="q_encoder")
174 |         #     # RNN每层的正向反向输出合并，本代码默认的是每层的输出也合并
175 |         #     # 所以对于3层rnn，输出的shape是[batch_size, max_len, 6*hidden]
176 |         #     # 并且，序列空值处的输出都清零了
177 |         #     q = rnn(q_emb, seq_len=self.q_len)
178 | 
179 |         # with tf.variable_scope("passage_encoding"):
180 |         #     rnn = gru(num_layers=3, num_units=d, batch_size=batch_size, input_size=c_emb.get_shape(
181 |         #     ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_encoder")
182 |         #     c = rnn(c_emb, seq_len=self.c_len)
183 | 
184 |         with tf.variable_scope("QP_attention"):
185 |             """
186 |             基于注意力的循环神经网络层，匹配context和question
187 |             """
188 |             # qc_att的shape [batch_size, c_maxlen, 12*hidden]
189 |             qc_att_ = dot_attention(inputs=c, memory=q, mask=self.q_mask, hidden=d,
190 |                                     keep_prob=config.keep_prob, is_train=self.is_train)
191 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=qc_att_.get_shape(
192 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="qp")
193 |             qc_att = rnn(qc_att_, seq_len=self.c_len)
194 | 
195 |         with tf.variable_scope("passage_match"):
196 |             """
197 |             context自匹配层
198 |             """
199 |             c_att = dot_attention(
200 |                 qc_att, qc_att, mask=self.c_mask, hidden=d, keep_prob=config.keep_prob, is_train=self.is_train)
201 |             rnn = gru(num_layers=1, num_units=d, batch_size=batch_size, input_size=c_att.get_shape(
202 |             ).as_list()[-1], keep_prob=config.keep_prob, is_train=self.is_train, scope="p_match")
203 |             # [batch_size, c_maxlen, 2*hidden]
204 |             c_match = rnn(c_att, seq_len=self.c_len)
205 | 
206 |         with tf.variable_scope("YesNo_classification"):
207 |             """
208 |             对问题答案的分类层, 需要的输入有question的编码结果q和context的match
209 |             """
210 |             # init的shape:[batch_size, 2*hidden]
211 |             # 这步的作用初始猜测是将question进行pooling操作，然后再输入给一个rnn层进行分类
212 |             init = summ(q[:, :, -2 * d:], d, mask=self.q_mask,
213 |                         keep_prob=config.keep_prob, is_train=self.is_train)
214 |             c_match_ = dropout(c_match, keep_prob=config.keep_prob,
215 |                                is_train=self.is_train)
216 |             final_hiddens = init.get_shape().as_list()[-1]
217 |             final_gru = tf.contrib.rnn.GRUCell(final_hiddens)
218 |             qp_output_, _ = tf.nn.dynamic_rnn(
219 |                 final_gru, c_match_, initial_state=init, dtype=tf.float32)  # [batch_size, c_maxlen, 2*hidden]
220 |             qp_output = dense(qp_output_, d)
221 | 
222 |             # final_att: [batch_size, 3, 2*hidden]
223 |             final_att = dot_attention(al_output, qp_output, self.c_mask,
224 |                                       hidden=d, keep_prob=config.keep_prob, is_train=self.is_train)
225 |             # 将特征工程的信息融合进来
226 |             nlp_final = tf.expand_dims(nlp_out, axis=1)
227 |             nlp_final = tf.tile(nlp_final, [1, 3, 1])
228 | 
229 |             final_concat = tf.concat([final_att, nlp_final], axis=2)
230 | 
231 |             final_output = dense(
232 |                 final_concat, 1, use_bias=True, scope="final_output")
233 |             self.logits = tf.squeeze(final_output)
234 | 
235 |         with tf.variable_scope("softmax_and_loss"):
236 |             self.final_softmax = tf.nn.softmax(self.logits)
237 |             self.classes = tf.cast(
238 |                 tf.argmax(self.final_softmax, axis=1), dtype=tf.int32, name="classes")
239 |             # 注意stop_gradient的使用，因为answer不是placeholder传进来的，所以要注明不对其计算梯度
240 |             if config.loss_function == "focal_loss":
241 |                 self.loss = tf.reduce_mean(sparse_focal_loss(
242 |                     logits=self.logits, labels=tf.stop_gradient(self.answer)))
243 |             else:
244 |                 self.loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
245 |                     logits=self.logits, labels=tf.stop_gradient(self.answer)))
246 | 
247 |     def get_loss(self):
248 |         return self.loss
249 | 
250 |     def get_global_step(self):
251 |         return self.global_step
252 | 


--------------------------------------------------------------------------------
/best_single_model/nlp_feature.json:
--------------------------------------------------------------------------------
1 | {"说明":"此文件存储数据样本经特征工程处理之后的特征向量。"}


--------------------------------------------------------------------------------
/best_single_model/nn_func.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | nn_func.py：神经网络模型的组件
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | 
 11 | INF = 1e30
 12 | 
 13 | 
 14 | class cudnn_gru:
 15 | 
 16 |     def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope=None):
 17 |         self.num_layers = num_layers
 18 |         self.grus = []
 19 |         self.inits = []
 20 |         self.dropout_mask = []
 21 |         self.scope = scope
 22 |         for layer in range(num_layers):
 23 |             input_size_ = input_size if layer == 0 else 2 * num_units
 24 |             gru_fw = tf.contrib.cudnn_rnn.CudnnGRU(
 25 |                 1, num_units, name="f_cudnn_gru")
 26 |             gru_bw = tf.contrib.cudnn_rnn.CudnnGRU(
 27 |                 1, num_units, name="b_cudnn_gru")
 28 |             init_fw = tf.tile(tf.Variable(
 29 |                 tf.zeros([1, 1, num_units])), [1, batch_size, 1])
 30 |             init_bw = tf.tile(tf.Variable(
 31 |                 tf.zeros([1, 1, num_units])), [1, batch_size, 1])
 32 |             mask_fw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32),
 33 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 34 |             mask_bw = dropout(tf.ones([1, batch_size, input_size_], dtype=tf.float32),
 35 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 36 |             self.grus.append((gru_fw, gru_bw, ))
 37 |             self.inits.append((init_fw, init_bw, ))
 38 |             self.dropout_mask.append((mask_fw, mask_bw, ))
 39 | 
 40 |     def __call__(self, inputs, seq_len, keep_prob=1.0, is_train=None, concat_layers=True):
 41 |         # cudnn GRU需要交换张量的维度，可能是便于计算
 42 |         outputs = [tf.transpose(inputs, [1, 0, 2])]
 43 |         with tf.variable_scope(self.scope):
 44 |             for layer in range(self.num_layers):
 45 |                 gru_fw, gru_bw = self.grus[layer]
 46 |                 init_fw, init_bw = self.inits[layer]
 47 |                 mask_fw, mask_bw = self.dropout_mask[layer]
 48 |                 with tf.variable_scope("fw_{}".format(layer)):
 49 |                     out_fw, _ = gru_fw(
 50 |                         outputs[-1] * mask_fw, initial_state=(init_fw, ))
 51 |                 with tf.variable_scope("bw_{}".format(layer)):
 52 |                     inputs_bw = tf.reverse_sequence(
 53 |                         outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1)
 54 |                     out_bw, _ = gru_bw(
 55 |                         inputs_bw, initial_state=(init_bw, ))
 56 |                     out_bw = tf.reverse_sequence(
 57 |                         out_bw, seq_lengths=seq_len, seq_dim=0, batch_dim=1)
 58 |                 outputs.append(tf.concat([out_fw, out_bw], axis=2))
 59 |         if concat_layers:
 60 |             res = tf.concat(outputs[1:], axis=2)
 61 |         else:
 62 |             res = outputs[-1]
 63 |         res = tf.transpose(res, [1, 0, 2])
 64 |         return res
 65 | 
 66 | 
 67 | class native_gru:
 68 | 
 69 |     def __init__(self, num_layers, num_units, batch_size, input_size, keep_prob=1.0, is_train=None, scope="native_gru"):
 70 |         self.num_layers = num_layers
 71 |         self.grus = []
 72 |         self.inits = []
 73 |         self.dropout_mask = []
 74 |         self.scope = scope
 75 |         for layer in range(num_layers):
 76 |             input_size_ = input_size if layer == 0 else 2 * num_units
 77 |             # 双向Bi-GRU f:forward b:back
 78 |             gru_fw = tf.contrib.rnn.GRUCell(num_units)
 79 |             gru_bw = tf.contrib.rnn.GRUCell(num_units)
 80 |             # tf.tile 平铺给定的张量，这里是将初始状态扩张到batch_size倍
 81 |             init_fw = tf.tile(tf.Variable(
 82 |                 tf.zeros([1, num_units])), [batch_size, 1])
 83 |             init_bw = tf.tile(tf.Variable(
 84 |                 tf.zeros([1, num_units])), [batch_size, 1])
 85 |             mask_fw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32),
 86 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 87 |             mask_bw = dropout(tf.ones([batch_size, 1, input_size_], dtype=tf.float32),
 88 |                               keep_prob=keep_prob, is_train=is_train, mode=None)
 89 |             self.grus.append((gru_fw, gru_bw, ))
 90 |             self.inits.append((init_fw, init_bw, ))
 91 |             self.dropout_mask.append((mask_fw, mask_bw, ))
 92 | 
 93 |     def __call__(self, inputs, seq_len, concat_layers=True):
 94 |         """
 95 |         运行RNN
 96 |         这里的keep_prob和is_train没用，在__init__中就已设置好了
 97 |         """
 98 |         outputs = [inputs]
 99 |         with tf.variable_scope(self.scope):
100 |             for layer in range(self.num_layers):
101 |                 gru_fw, gru_bw = self.grus[layer]
102 |                 init_fw, init_bw = self.inits[layer]
103 |                 mask_fw, mask_bw = self.dropout_mask[layer]
104 |                 # 正向RNN
105 |                 with tf.variable_scope("fw_{}".format(layer)):
106 |                     # 每一层使用上层的输出
107 |                     # dynamic_rnn中的超过seq_len的部分就不计算了，state直接重复，output直接清零，节省资源
108 |                     out_fw, _ = tf.nn.dynamic_rnn(
109 |                         gru_fw, outputs[-1] * mask_fw, seq_len, initial_state=init_fw, dtype=tf.float32)
110 |                 # 反向RNN
111 |                 with tf.variable_scope("bw_{}".format(layer)):
112 |                     inputs_bw = tf.reverse_sequence(
113 |                         outputs[-1] * mask_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0)
114 |                     out_bw, _ = tf.nn.dynamic_rnn(
115 |                         gru_bw, inputs_bw, seq_len, initial_state=init_bw, dtype=tf.float32)
116 |                     out_bw = tf.reverse_sequence(
117 |                         out_bw, seq_lengths=seq_len, seq_dim=1, batch_dim=0)
118 |                 # 正向输出和反向输出合并
119 |                 outputs.append(tf.concat([out_fw, out_bw], axis=2))
120 |         if concat_layers:
121 |             res = tf.concat(outputs[1:], axis=2)
122 |         else:
123 |             res = outputs[-1]
124 |         return res
125 | 
126 | 
127 | def dropout(args, keep_prob, is_train, mode="recurrent"):
128 |     """
129 |     dropout层,args初始是1.0
130 |     """
131 |     if keep_prob < 1.0:
132 |         noise_shape = None
133 |         scale = 1.0
134 |         shape = tf.shape(args)
135 |         if mode == "embedding":
136 |             noise_shape = [shape[0], 1]
137 |             scale = keep_prob
138 |         if mode == "recurrent" and len(args.get_shape().as_list()) == 3:
139 |             noise_shape = [shape[0], 1, shape[-1]]
140 |         args = tf.cond(is_train, lambda: tf.nn.dropout(
141 |             args, keep_prob, noise_shape=noise_shape) * scale, lambda: args)
142 |     return args
143 | 
144 | 
145 | def softmax_mask(val, mask):
146 |     """
147 |     作用是给空值处减小注意力
148 |     """
149 |     return -INF * (1 - tf.cast(mask, tf.float32)) + val  # tf.cast:true转为1.0，false转为0.0
150 | 
151 | 
152 | def summ(memory, hidden, mask, keep_prob=1.0, is_train=None, scope="summ"):
153 |     """
154 |     对question进行最后一步的处理，可以看作是pooling吗
155 |     """
156 |     with tf.variable_scope(scope):
157 |         d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train)
158 |         s0 = tf.nn.tanh(dense(d_memory, hidden, scope="s0"))
159 |         s = dense(s0, 1, use_bias=False, scope="s")
160 |         # tf.squeeze把长度只有1的维度去掉
161 |         # s1:[batch_size, c_maxlen]
162 |         s1 = softmax_mask(tf.squeeze(s, [2]), mask)
163 |         a = tf.expand_dims(tf.nn.softmax(s1), axis=2)
164 |         res = tf.reduce_sum(a * memory, axis=1)  # 逐元素相乘，shape跟随memory一致
165 |         return res  # [batch_size, 2*hidden]
166 | 
167 | 
168 | def dot_attention(inputs, memory, mask, hidden, keep_prob=1.0, is_train=None, scope="dot_attention"):
169 |     """
170 |     门控attention层
171 |     """
172 |     with tf.variable_scope(scope):
173 | 
174 |         d_inputs = dropout(inputs, keep_prob=keep_prob, is_train=is_train)
175 |         d_memory = dropout(memory, keep_prob=keep_prob, is_train=is_train)
176 |         JX = tf.shape(inputs)[1]  # inputs的1维度，应该是c_maxlen
177 | 
178 |         with tf.variable_scope("attention"):
179 |             # inputs_的shape:[batch_size, c_maxlen, hidden]
180 |             inputs_ = tf.nn.relu(
181 |                 dense(d_inputs, hidden, use_bias=False, scope="inputs"))
182 |             memory_ = tf.nn.relu(
183 |                 dense(d_memory, hidden, use_bias=False, scope="memory"))
184 |             # 三维矩阵相乘，结果的shape是[batch_size, c_maxlen, q_maxlen]
185 |             outputs = tf.matmul(inputs_, tf.transpose(
186 |                 memory_, [0, 2, 1])) / (hidden ** 0.5)
187 |             # 将mask平铺成与outputs相同的形状，这里考虑，改进成input和memory都需要mask
188 |             mask = tf.tile(tf.expand_dims(mask, axis=1), [1, JX, 1])
189 |             logits = tf.nn.softmax(softmax_mask(outputs, mask))
190 |             outputs = tf.matmul(logits, memory)
191 |             # res:[batch_size, c_maxlen, 12*hidden]
192 |             res = tf.concat([inputs, outputs], axis=2)
193 | 
194 |         with tf.variable_scope("gate"):
195 |             """
196 |             attention * gate
197 |             """
198 |             dim = res.get_shape().as_list()[-1]
199 |             d_res = dropout(res, keep_prob=keep_prob, is_train=is_train)
200 |             gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False))
201 |             return res * gate  # 向量的逐元素相乘
202 | 
203 | 
204 | # 写一个谷歌论文中新的attention模块
205 | def multihead_attention(Q, K, V, mask, hidden, head_num=4, keep_prob=1.0, is_train=None, has_gate=True, scope="multihead_attention"):
206 |     """
207 |     Q : passage
208 |     K,V: question
209 |     mask: Q的mask
210 |     """
211 |     size = int(hidden / head_num)  # 每个attention的大小
212 | 
213 |     with tf.variable_scope(scope):
214 |         d_Q = dropout(Q, keep_prob=keep_prob, is_train=is_train)
215 |         d_K = dropout(K, keep_prob=keep_prob, is_train=is_train)
216 |         JX = tf.shape(Q)[1]
217 | 
218 |         with tf.variable_scope("attention"):
219 |             Q_ = tf.nn.relu(dense(d_Q, hidden, use_bias=False, scope="Q"))
220 |             K_ = tf.nn.relu(dense(d_K, hidden, use_bias=False, scope="K"))
221 |             V_ = tf.nn.relu(dense(V, hidden, use_bias=False, scope="V"))
222 |             Q_ = tf.reshape(Q_, (-1, tf.shape(Q_)[1], head_num, size))
223 |             K_ = tf.reshape(K_, (-1, tf.shape(K_)[1], head_num, size))
224 |             V_ = tf.reshape(V_, (-1, tf.shape(V_)[1], head_num, size))
225 |             Q_ = tf.transpose(Q_, [0, 2, 1, 3])
226 |             K_ = tf.transpose(K_, [0, 2, 1, 3])
227 |             V_ = tf.transpose(V_, [0, 2, 1, 3])
228 |             # scale:[batch_size, head_num, c_maxlen, q_maxlen]
229 |             scale = tf.matmul(Q_, K_, transpose_b=True) / tf.sqrt(float(size))
230 |             scale = tf.transpose(scale, [0, 3, 2, 1])
231 |             for _ in range(len(scale.shape) - 2):
232 |                 mask = tf.expand_dims(mask, axis=2)
233 |             mask_scale = softmax_mask(scale, mask)
234 |             mask_scale = tf.transpose(scale, [0, 3, 2, 1])
235 |             logits = tf.nn.softmax(mask_scale)
236 |             outputs = tf.matmul(logits, V_)  # [b,h,c,s]
237 |             outputs = tf.transpose(outputs, [0, 2, 1, 3])
238 |             # [batch_size, c_maxlen, hidden]
239 |             outputs = tf.reshape(outputs, (-1, tf.shape(Q)[1], hidden))
240 |             # res连接
241 |             res = tf.concat([Q, outputs], axis=2)
242 | 
243 |         if has_gate:
244 |             with tf.variable_scope("gate"):
245 |                 dim = res.get_shape().as_list()[-1]
246 |                 d_res = dropout(res, keep_prob=keep_prob, is_train=is_train)
247 |                 gate = tf.nn.sigmoid(dense(d_res, dim, use_bias=False))
248 |                 return res * gate
249 |         else:
250 |             return res
251 | 
252 | 
253 | def dense(inputs, hidden, use_bias=True, scope="dense"):
254 |     """
255 |     全连接层
256 |     """
257 |     with tf.variable_scope(scope):
258 |         shape = tf.shape(inputs)
259 |         dim = inputs.get_shape().as_list()[-1]
260 |         out_shape = [shape[idx] for idx in range(
261 |             len(inputs.get_shape().as_list()) - 1)] + [hidden]
262 |         # 三维的inputs，reshape成二维
263 |         flat_inputs = tf.reshape(inputs, [-1, dim])
264 |         W = tf.get_variable("W", [dim, hidden])
265 |         res = tf.matmul(flat_inputs, W)
266 |         if use_bias:
267 |             b = tf.get_variable(
268 |                 "b", [hidden], initializer=tf.constant_initializer(0.))
269 |             res = tf.nn.bias_add(res, b)
270 |         # outshape就是input的最后一维变成hidden
271 |         res = tf.reshape(res, out_shape)
272 |         return res
273 | 


--------------------------------------------------------------------------------
/best_single_model/util_addAnswer.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | util_addAnswer.py：读取batch的工具。
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import numpy as np
 11 | import re
 12 | from collections import Counter
 13 | import string
 14 | 
 15 | 
 16 | def get_record_parser(config):
 17 |     def parse(example):
 18 |         para_limit = config.para_limit
 19 |         ques_limit = config.ques_limit
 20 |         ans_limit = config.ans_limit
 21 |         features = tf.parse_single_example(example,
 22 |                                            features={
 23 |                                                "passage_idxs": tf.FixedLenFeature([], tf.string),
 24 |                                                "question_idxs": tf.FixedLenFeature([], tf.string),
 25 |                                                "alternative_idxs": tf.FixedLenFeature([], tf.string),
 26 |                                                "answer": tf.FixedLenFeature([], tf.int64),
 27 |                                                "id": tf.FixedLenFeature([], tf.int64),
 28 |                                                "nlp_feature": tf.FixedLenFeature([187], tf.float32)
 29 |                                            })
 30 |         # tf.decode_raw: 将字符串的字节重新解释为数字向量
 31 |         passage_idxs = tf.reshape(tf.decode_raw(
 32 |             features["passage_idxs"], tf.int32), [para_limit])
 33 |         question_idxs = tf.reshape(tf.decode_raw(
 34 |             features["question_idxs"], tf.int32), [ques_limit])
 35 |         alternative_idxs = tf.reshape(tf.decode_raw(
 36 |             features["alternative_idxs"], tf.int32), [3, ans_limit])
 37 |         answer = features["answer"]
 38 |         qa_id = features["id"]
 39 |         nlp_feature= features["nlp_feature"]
 40 |         return passage_idxs, question_idxs, alternative_idxs, answer, qa_id ,nlp_feature
 41 |     return parse
 42 | 
 43 | 
 44 | def get_batch_dataset(record_file, parser, config):
 45 |     """
 46 |     训练数据集TFRecordDataset的batch生成器。
 47 |     Args:
 48 |         record_file: 训练数据tf_record路径
 49 |         parser: 数据存储的格式
 50 |         config: 超参数
 51 |     """
 52 |     num_threads = tf.constant(config.num_threads, dtype=tf.int32)
 53 |     dataset = tf.data.TFRecordDataset(record_file).map(
 54 |         parser, num_parallel_calls=num_threads).shuffle(config.capacity).repeat()
 55 |     if config.is_bucket:
 56 |         # bucket方法，用于解决序列长度不同的mini-batch的计算效率问题
 57 |         buckets = [tf.constant(num) for num in range(*config.bucket_range)]
 58 | 
 59 |         def key_func(context_idxs, ques_idxs, context_char_idxs, ques_char_idxs, y1, y2, qa_id):
 60 |             c_len = tf.reduce_sum(
 61 |                 tf.cast(tf.cast(context_idxs, tf.bool), tf.int32))
 62 |             buckets_min = [np.iinfo(np.int32).min] + buckets
 63 |             buckets_max = buckets + [np.iinfo(np.int32).max]
 64 |             conditions_c = tf.logical_and(
 65 |                 tf.less(buckets_min, c_len), tf.less_equal(c_len, buckets_max))
 66 |             bucket_id = tf.reduce_min(tf.where(conditions_c))
 67 |             return bucket_id
 68 | 
 69 |         def reduce_func(key, elements):
 70 |             return elements.batch(config.batch_size)
 71 | 
 72 |         dataset = dataset.apply(tf.contrib.data.group_by_window(
 73 |             key_func, reduce_func, window_size=5 * config.batch_size)).shuffle(len(buckets) * 25)
 74 |     else:
 75 |         dataset = dataset.batch(config.batch_size)
 76 |     return dataset
 77 | 
 78 | 
 79 | def get_dataset(record_file, parser, config):
 80 |     num_threads = tf.constant(config.num_threads, dtype=tf.int32)
 81 |     dataset = tf.data.TFRecordDataset(record_file).map(
 82 |         parser, num_parallel_calls=num_threads).repeat().batch(config.batch_size)
 83 |     return dataset
 84 | 
 85 | 
 86 | def evaluate_acc(truth_dict, answer_dict):
 87 |     """
 88 |     计算准确率，还可以设计返回正确问题和错误问题列表
 89 |     """
 90 |     total = 0
 91 |     right = 0
 92 |     wrong = 0
 93 |     for key, value in answer_dict.items():
 94 |         total += 1
 95 |         ground_truths = truth_dict[key]
 96 |         prediction = value
 97 |         if prediction == ground_truths:
 98 |             right += 1
 99 |         else:
100 |             wrong += 1
101 |     accuracy = (right / total) * 1.0
102 |     return {"accuracy": accuracy}
103 | 


--------------------------------------------------------------------------------
/ensemble/dev_soft/model_char_1102_0.7278.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_char_1102_0.7278.txt


--------------------------------------------------------------------------------
/ensemble/dev_soft/model_newgraph_1101_0.7474.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_newgraph_1101_0.7474.txt


--------------------------------------------------------------------------------
/ensemble/dev_soft/model_newgraph_2lr_1101_0.7459.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/dev_soft/model_newgraph_2lr_1101_0.7459.txt


--------------------------------------------------------------------------------
/ensemble/ensemble_predict.py:
--------------------------------------------------------------------------------
 1 | """
 2 | AI Challenger观点型问题阅读理解
 3 | 
 4 | ensemble_predict.py：将模型权重用于融合，预测结果。
 5 | 
 6 | @author: yuhaitao
 7 | """
 8 | # -*- coding:utf-8 -*-
 9 | import tensorflow as tf
10 | import json as json
11 | import numpy as np
12 | from tqdm import tqdm
13 | import pickle
14 | import os
15 | import codecs
16 | import time
17 | 
18 | os.environ["CUDA_VISIBLE_DEVICES"] = ""
19 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
20 | 
21 | 
22 | def emsemble_predict():
23 |     with open("test_eval.json", "r") as fh:
24 |         test_eval_file = json.load(fh)
25 |     with open("ensemble_wb.json", "r") as fh:
26 |         best_wb = json.load(fh)
27 | 
28 |     predic_time = time.strftime("%Y-%m-%d_%H:%M:%S ", time.localtime())
29 |     prediction_file = os.path.join(
30 |         "predictions", (predic_time + "_predictions.txt"))
31 | 
32 |     print("正在读取test的softmax结果文件!")
33 |     rootdir = "./test_soft"
34 |     # 定义融合后的验证集softmax字典
35 |     dev_dict = {}
36 |     predict_dict = {}
37 | 
38 |     # 获取目录下所有文件，并去除隐藏文件
39 |     filelist = os.listdir(rootdir)
40 |     filenames = [
41 |         filename for filename in filelist if not filename.startswith('.')]
42 |     for i in range(len(filenames)):
43 |         print("{}: {}".format(i + 1, filenames[i]))
44 | #     print(filenames)
45 | 
46 |     # 初始化dev_dict
47 |     if len(filenames) == 0:
48 |         print("没有softmax文件")
49 |         return
50 |     path = os.path.join(rootdir, filenames[0])
51 |     with open(path, "rb") as f1:
52 |         soft = pickle.load(f1)
53 |     for key, value in soft.items():
54 |         dev_dict[key] = best_wb[filenames[0]][0] * value
55 |     print("初始化完成")
56 | 
57 |     # 遍历剩下的dev文件
58 |     for i in range(1, len(filenames)):
59 |         weight = best_wb[filenames[i]][0]
60 |         print(weight)
61 |         path = os.path.join(rootdir, filenames[i])
62 |         with open(path, "rb") as f1:
63 |             soft = pickle.load(f1)
64 |             for key in soft.keys():
65 |                 dev_dict[key] += weight * soft[key]
66 |         print(i + 1, "个文件处理完成")
67 |     for key in dev_dict.keys():
68 |         dev_dict[key] += best_wb["bias"][0]
69 |     # 计算准确率
70 |     with tf.Session(graph=tf.Graph()) as sess:
71 |         ddev = tf.placeholder(shape=[3], dtype=tf.float32, name='all')
72 |         dev_class = tf.cast(tf.argmax(ddev), dtype=tf.int32)
73 |         for k in range(280001, 290001):
74 |             key = str(k)
75 |             value = sess.run(dev_class, feed_dict={ddev: dev_dict[key]})
76 |             predict_dict[key] = value
77 | 
78 |     predictions = []
79 |     for key, value in predict_dict.items():
80 |         prediction_answer = test_eval_file[str(key)][value]
81 |         predictions.append(str(key) + '\t' + str(prediction_answer))
82 |     outputs = u'\n'.join(predictions)
83 |     with codecs.open(prediction_file, 'w', encoding='utf-8') as f:
84 |         f.write(outputs)
85 |     print("done!")
86 | 
87 | 
88 | if __name__ == '__main__':
89 |     ensemble_predict()
90 | 


--------------------------------------------------------------------------------
/ensemble/ensemble_train.py:
--------------------------------------------------------------------------------
  1 | """
  2 | AI Challenger观点型问题阅读理解
  3 | 
  4 | ensemble_train.py:在验证集中训练模型融合的权重。
  5 | 
  6 | @author: yuhaitao
  7 | """
  8 | # -*- coding:utf-8 -*-
  9 | import tensorflow as tf
 10 | import json as json
 11 | import numpy as np
 12 | from tqdm import tqdm
 13 | import pickle
 14 | import os
 15 | import codecs
 16 | import time
 17 | 
 18 | os.environ["CUDA_VISIBLE_DEVICES"] = "0"
 19 | os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
 20 | 
 21 | 
 22 | def ensemble_train():
 23 |     total = 29968
 24 |     print("正在读取dev的softmax结果文件!")
 25 |     rootdir = "./dev_soft"
 26 |     dev_dict = {}
 27 |     predict_dict = {}
 28 |     with open('truth/truth_dict.txt', 'rb') as f1:
 29 |         truth_dict = pickle.load(f1)
 30 |     filelist = os.listdir(rootdir)
 31 |     filenames = [
 32 |         filename for filename in filelist if not filename.startswith('.')]
 33 |     for i in range(len(filenames)):
 34 |         print("{}: {}".format(i + 1, filenames[i]))
 35 |     if len(filenames) == 0:
 36 |         print("没有softmax文件")
 37 |         return
 38 | 
 39 |     # 定义整个输入和标签矩阵
 40 |     all_inputs = np.zeros(shape=[total, 3, len(filenames)], dtype=np.float32)
 41 |     all_labels = np.zeros(shape=[total], dtype=np.int32)
 42 |     keys = []
 43 |     for k in truth_dict.keys():
 44 |         keys.append(k)
 45 |     keys.sort(reverse=False)
 46 |     if len(keys) != total:
 47 |         print("keys number error")
 48 |         return
 49 |     # 给标签赋值
 50 |     for i in range(total):
 51 |         all_labels[i] = truth_dict[keys[i]]
 52 |     # 遍历文件加入矩阵
 53 |     for i in range(len(filenames)):
 54 |         path = os.path.join(rootdir, filenames[i])
 55 |         with open(path, "rb") as f1:
 56 |             soft = pickle.load(f1)
 57 |             for j in range(total):
 58 |                 all_inputs[j, :, i] = soft[keys[j]]
 59 |             print(i + 1, "个文件处理完成")
 60 |     # print(all_labels[:10])
 61 |     # print(all_inputs[:10, :, :])
 62 |     sess_config = tf.ConfigProto(allow_soft_placement=True)
 63 |     sess_config.gpu_options.allow_growth = True
 64 |     with tf.Session(config=sess_config) as sess:
 65 |         inputs = tf.placeholder(shape=[total, 3, len(
 66 |             filenames)], dtype=tf.float32, name="inputs")
 67 |         labels = tf.placeholder(shape=[total], dtype=tf.int32, name="labels")
 68 |         W = tf.get_variable(shape=[len(filenames), 1],
 69 |                             dtype=tf.float32, name="weights")
 70 |         b = tf.get_variable(
 71 |             shape=[1], dtype=tf.float32, name="bias")
 72 |         re_inputs = tf.reshape(inputs, shape=[-1, len(filenames)])
 73 |         pred = tf.matmul(re_inputs, W)
 74 |         re_pred = tf.reshape(pred, shape=[total, 3, 1])
 75 |         outputs = tf.squeeze(re_pred)
 76 |         predictions = tf.cast(tf.argmax(outputs, axis=1), tf.int32)
 77 |         # loss and opt
 78 |         loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(
 79 |             logits=outputs, labels=tf.stop_gradient(labels)))
 80 |         train_op = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
 81 | 
 82 |         # run
 83 |         sess.run(tf.global_variables_initializer())
 84 |         best_acc = 0.
 85 |         best_wb = {}
 86 |         for steps in range(1500):
 87 |             answers, train_OP = sess.run([predictions, train_op], feed_dict={
 88 |                                          inputs: all_inputs, labels: all_labels})
 89 |             answer_dict = {}
 90 |             for i in range(total):
 91 |                 answer_dict[keys[i]] = answers[i]
 92 |             if evaluate_acc(truth_dict, answer_dict)["accuracy"] > best_acc:
 93 |                 best_acc = evaluate_acc(truth_dict, answer_dict)["accuracy"]
 94 |                 best_wb["weights"] = sess.run(W).tolist()
 95 |                 best_wb["bias"] = sess.run(b).tolist()
 96 |             if (steps + 1) % 50 == 0:
 97 |                 print("steps:{},acc:{:.5f}".format(
 98 |                     steps + 1, evaluate_acc(truth_dict, answer_dict)["accuracy"]))
 99 |         for i in range(len(best_wb["weights"])):
100 |             print("{}: {}".format(i + 1, best_wb["weights"][i]))
101 |         save_wb = {}
102 |         for file, weight in zip(filenames, best_wb["weights"]):
103 |             save_wb[file] = weight
104 |         save_wb["bias"] = best_wb["bias"]
105 |         print(best_acc)
106 |         with open("ensemble_wb.json", "w") as fw:
107 |             json.dump(save_wb, fw)
108 | 
109 | 
110 | def evaluate_acc(truth_dict, answer_dict):
111 |     """
112 |     计算准确率，还可以设计返回正确问题和错误问题列表
113 |     """
114 |     total = 0
115 |     right = 0
116 |     wrong = 0
117 |     for key, value in answer_dict.items():
118 |         total += 1
119 |         ground_truths = truth_dict[key]
120 |         prediction = value
121 |         if prediction == ground_truths:
122 |             right += 1
123 |         else:
124 |             wrong += 1
125 |     accuracy = (right / total) * 1.0
126 |     return {"accuracy": accuracy}
127 | 
128 | 
129 | if __name__ == '__main__':
130 |     ensemble_train()
131 | 


--------------------------------------------------------------------------------
/ensemble/test_soft/model_char_1102_0.7278.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_char_1102_0.7278.txt


--------------------------------------------------------------------------------
/ensemble/test_soft/model_newgraph_1101_0.7474.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_newgraph_1101_0.7474.txt


--------------------------------------------------------------------------------
/ensemble/test_soft/model_newgraph_2lr_1101_0.7459.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/ensemble/test_soft/model_newgraph_2lr_1101_0.7459.txt


--------------------------------------------------------------------------------
/pics/model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yuhaitao1994/AIchallenger2018_MachineReadingComprehension/03c8d4ab60f6ac9c7f777fd2c932cc01300b5c42/pics/model.png


--------------------------------------------------------------------------------