├── Query ├── code ├── model.pyc ├── blacklist ├── view.py ├── preprocess_test.py ├── predict.py ├── test.py ├── preprocess_train.py ├── train.py └── model.py ├── temp ├── test_data ├── to_sentence ├── att_sents.pickle ├── att_words.pickle ├── attention.pickle └── predict_y.pickle ├── dictionary ├── 公司简称.xlsx ├── 公告负面词.xlsx ├── 新闻负面词.xlsx ├── 组合管理-持仓清单.xlsx ├── positive.txt ├── negative.txt └── stopwords_CN.dat ├── test_data └── sample.xlsx ├── training_data └── sample.xlsx └── ReadMe.md /Query: -------------------------------------------------------------------------------- 1 | $query1$ 2 | $query2$ 3 | $query3$ 4 | -------------------------------------------------------------------------------- /code/model.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/code/model.pyc -------------------------------------------------------------------------------- /temp/test_data: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/test_data -------------------------------------------------------------------------------- /temp/to_sentence: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/to_sentence -------------------------------------------------------------------------------- /dictionary/公司简称.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/dictionary/公司简称.xlsx -------------------------------------------------------------------------------- /dictionary/公告负面词.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/dictionary/公告负面词.xlsx -------------------------------------------------------------------------------- /dictionary/新闻负面词.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/dictionary/新闻负面词.xlsx -------------------------------------------------------------------------------- /temp/att_sents.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/att_sents.pickle -------------------------------------------------------------------------------- /temp/att_words.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/att_words.pickle -------------------------------------------------------------------------------- /temp/attention.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/attention.pickle -------------------------------------------------------------------------------- /temp/predict_y.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/temp/predict_y.pickle -------------------------------------------------------------------------------- /test_data/sample.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/test_data/sample.xlsx -------------------------------------------------------------------------------- /dictionary/组合管理-持仓清单.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/dictionary/组合管理-持仓清单.xlsx -------------------------------------------------------------------------------- /training_data/sample.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/LLluoling/FISHQA/HEAD/training_data/sample.xlsx -------------------------------------------------------------------------------- /code/blacklist: -------------------------------------------------------------------------------- 1 | 年度 2 | 年 3 | 位于 4 | 年内 5 | 日 6 | 月 7 | 他 8 | 她 9 | 它 10 | 上 11 | 一 12 | 二 13 | 三 14 | 四 15 | 五 16 | 六 17 | 气 18 | 七 19 | 八 20 | 九 21 | 十 22 | 成立 23 | 昨日 24 | 今日 25 | 明日 26 | 全天 27 | 在 28 | 网 29 | 名单 30 | 新 31 | 是 32 | 上 33 | 下 34 | 左 35 | 右 36 | ! 37 | 拟 -------------------------------------------------------------------------------- /dictionary/positive.txt: -------------------------------------------------------------------------------- 1 | 涨停 2 | 利好 3 | 追加担保 4 | 推荐 5 | 未受影响 6 | 资产注入 7 | 感谢 8 | 拯救 9 | 增持 10 | 拟投 11 | 定增 12 | 利好 13 | 化解 14 | 转型 15 | 看好 16 | 优质 17 | 牛股 18 | 金股 19 | 上调 20 | 升级 21 | 推动 22 | 机遇 23 | 腾飞 24 | 拓展 25 | 整合 26 | 启航 27 | 不减持 28 | 良机 29 | 孕育 30 | 机会 31 | 全额 32 | 推进 33 | 进军 34 | 加强 35 | 补贴 36 | 携手 37 | 振兴 38 | 扭亏为盈 39 | 助力 40 | 改革 41 | 转机 42 | 扫清 43 | 接盘 44 | 跳出 45 | 付息 46 | 逆袭 47 | 潜力 48 | 潜质 49 | 突围 50 | 有望 51 | 护航 52 | 优先 53 | 稳定 54 | 挺进 55 | 低估 56 | 翻番 57 | 在即 58 | 抄底 59 | 回应 60 | 回升 61 | 增补 62 | 加大 63 | 调升 64 | 上调 65 | 反弹 66 | 提高 67 | 持续提高 68 | 大有可为 69 | 战略重组 70 | 重启 71 | 追捧 72 | 积极 73 | 登顶 74 | 强力 75 | 自救成功 76 | 成功转让 77 | 增加 78 | 中标 79 | 领衔 80 | 解冻 81 | 付息 82 | 实现盈利 83 | 亏损收窄 84 | 收购 85 | 激励 86 | 澄清 87 | 盈利 88 | 派息 -------------------------------------------------------------------------------- /ReadMe.md: -------------------------------------------------------------------------------- 1 | # FISHQA ( Financial-Sentiment-Analysis-with-Hierarchical-Query-driven-Attention ) 2 | ### This is a Tensorflow implementation of [Beyond Polarity: Interpretable Financial Sentiment Analysis with Hierarchical Query-driven Attention](https://www.ijcai.org/proceedings/2018/0590.pdf) 3 | 4 | ## Requirements 5 | * python 3.6.1 6 | * Tensorflow 1.11.0 7 | * jieba 0.39 8 | 9 | ## Code Introduction 10 | 11 | ### Step 1: Preprocess data 12 | ```bash 13 | python preprocess_train.py 14 | python preprocess_test.py 15 | ``` 16 | Preprocess training dataset/test dataset. 17 | Remember to modify the dictionary, fiterwords based on your own datasets. 18 | 19 | ### Step 2: Training model 20 | ```bash 21 | python train.py 22 | ``` 23 | ```bash 24 | cd FISHQA/code 25 | ``` 26 | Set params based on your own datasets and train you own model 27 | 28 | ### Step 3: Test model 29 | ```bash 30 | python test.py 31 | ``` 32 | 33 | ### Step 4: Simple attention visualization 34 | ```bash 35 | python view.py 36 | ``` 37 | 38 | 39 | ## Data Introduction 40 | * Modify your own queries(FISHQA/Query) based on your own datasets and prior knowledge. Each `query` can be manually decided. 41 | * Notice that under folder `temp/` is a subset of our preprocessed data. 42 | * As the our dataset is private, we cannot release it. We put two raw samples in folder `train_data` and `test_data` individually. 43 | * Under folder `dictionary/`, there are some extra dictionaries summarized by professional for Chinese financial news. 44 | -------------------------------------------------------------------------------- /dictionary/negative.txt: -------------------------------------------------------------------------------- 1 | 违约 2 | 实质违约 3 | 不确定性 4 | 不确定 5 | 退市 6 | 未按时 7 | 未按期 8 | 暂停上市 9 | 暂停交易 10 | 终止上市 11 | 逾期 12 | 债务逾期 13 | 贷款逾期 14 | 新增贷款 15 | 亏损 16 | 预亏 17 | 巨亏 18 | 血本无归 19 | 风险 20 | 偿付风险 21 | 兑付风险 22 | 特别风险 23 | 风险提示 24 | 风险警示 25 | 警示 26 | 缩减 27 | 风波不断 28 | 下调 29 | 下跌 30 | 下滑 31 | 重整 32 | 重组 33 | 重大事项 34 | 破产重组 35 | 破产 36 | 清盘 37 | 偿债 38 | 还债 39 | 免职 40 | 免去 41 | 解聘 42 | 判决 43 | 诉讼 44 | 审理 45 | 法律诉讼 46 | 司法 47 | 冻结 48 | 涉嫌 49 | 涉诉 50 | 起诉 51 | 纠纷 52 | 败诉 53 | 查封 54 | 仲裁 55 | 通缉 56 | 查处 57 | 禁止 58 | 扣押 59 | 推迟 60 | 延期 61 | 取消 62 | 终止 63 | 停牌 64 | 停盘 65 | 停产 66 | 停工 67 | 停止 68 | 跌停 69 | 欠息 70 | 降级 71 | 调降 72 | 下调 73 | 观察名单 74 | 无法偿还 75 | 无法兑付 76 | 无力偿还 77 | 未及时兑付 78 | 未足额 79 | 未履行 80 | 不能履行 81 | 大额对外担保 82 | 经理变更 83 | 股东变更 84 | 通报批评 85 | 谴责 86 | 调查 87 | 立案 88 | 处罚 89 | 违规 90 | 违反 91 | 处分 92 | 警告 93 | 警示 94 | 受贿 95 | 强制 96 | 非法 97 | 挪用 98 | 行贿 99 | 洗钱 100 | 内幕交易 101 | 违纪 102 | 违法 103 | 调查 104 | 开除 105 | 贪污 106 | 事故 107 | 会计差错 108 | 自杀 109 | 跳楼 110 | 上吊 111 | 身亡 112 | 溺水 113 | 坠楼 114 | 抑郁 115 | 抑郁症 116 | 自缢 117 | 死者 118 | 负面 119 | 审计署 120 | 整治 121 | 整顿 122 | 资不抵债 123 | 低效 124 | 暴跌 125 | 寒冬 126 | 恶化 127 | 调低 128 | 产能过剩 129 | 低迷 130 | 限用令 131 | 过剩 132 | 缩水 133 | 问询 134 | 危机 135 | 剥离 136 | 跌 137 | 下挫 138 | 走弱 139 | 走低 140 | 造假 141 | 大跌 142 | 临停 143 | 乏力 144 | 悉售 145 | 停产 146 | 补充质押 147 | 打开涨停 -------------------------------------------------------------------------------- /code/view.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | import numpy as np 4 | import pandas as pd 5 | import pickle 6 | #import matplotlib.pyplot as plt 7 | #import seaborn as sns 8 | from functools import reduce 9 | from tqdm import tqdm 10 | import os 11 | import codecs 12 | data_path = "../test_data" 13 | output_path = "../output" 14 | def union_f(x, y = ""): 15 | return x +" "+ y 16 | with codecs.open("blacklist.txt",'r') as fr: 17 | b = fr.readlines() 18 | blacklist = [] 19 | for i in b: 20 | blacklist.append(i.strip()) 21 | blacklist = set(blacklist) 22 | 23 | f = open(os.path.join(output_path,"deal_list")) 24 | title,content = [],[] 25 | for line in f: 26 | filename = os.path.join(data_path,line.strip()) 27 | sheet = pd.read_excel(filename) 28 | title.extend(list(sheet.loc[:,"title"])) 29 | content.extend(list(sheet.loc[:,"content"])) 30 | 31 | # load att_words 32 | with open('../temp/att_words.pickle', 'rb') as f: 33 | att_words = pickle.load(f) 34 | # load att_sents 35 | with open('../temp/att_sents.pickle', 'rb') as f: 36 | att_sents = pickle.load(f) 37 | # load predict_y 38 | with open('../temp/predict_y.pickle', 'rb') as f: 39 | y_pred = pickle.load(f) 40 | # Y = [i.index(1) for i in y_pred] 41 | with open('../temp/to_sentence', 'rb') as f: 42 | to_sentence = pickle.load(f) 43 | with open('../temp/test_data', 'rb') as f: 44 | X,_ = pickle.load(f) 45 | 46 | with open('../model/vocab.pickle', 'rb') as f: 47 | vocab = pickle.load(f) 48 | 49 | new_vocab = dict(map(lambda t:(t[1],t[0]), vocab.items())) 50 | 51 | print("title,content,att_words,att_sents,y_pred: ",len(title),len(content),len(att_words),len(att_sents),len(y_pred)) 52 | 53 | # output a file to view attended sentences 54 | S,W = [],[] 55 | for doc in tqdm(range(len(att_sents))): 56 | b = [] 57 | for query in range(len(att_sents[0])): 58 | b.append(sorted(range(len(att_sents[doc][query])), key=list(att_sents[doc][query]).__getitem__,reverse=True)) 59 | # load sents 60 | try: 61 | tmp = "" 62 | count = 0;i = 0 63 | for i in range(3): 64 | temp_list = [] 65 | for sent in b[i]: 66 | if len(to_sentence[str(X[doc][sent])])>3: 67 | temp_list.append(to_sentence[str(X[doc][sent])]) 68 | count += 1 69 | if count >=3: 70 | break 71 | tmp += reduce(union_f,temp_list)+"||" 72 | S.append(tmp) 73 | except: 74 | S.append(title[doc]) 75 | # load words 76 | try: 77 | tmp = "" 78 | for i in range(30): 79 | word = new_vocab[X[doc][i][att_words[doc][i]]] 80 | if word!="UNKNOW_TOKEN": 81 | tmp += word+" " 82 | W.append(tmp) 83 | except: 84 | pass 85 | 86 | 87 | writer = pd.ExcelWriter('../output/result.xlsx') 88 | df = pd.DataFrame(columns=['title','content',"predict_score","attened_sents","attened_words"]) 89 | df.loc[:,"title"] = title 90 | df.loc[:,"content"] = content 91 | df.loc[:,"predict_score"] = list(y_pred) 92 | df.loc[:,"attened_sents"] = S 93 | df.loc[:,"attened_words"] = W 94 | 95 | df.to_excel(writer,'Sheet1') 96 | writer.save() 97 | print("done!") 98 | -------------------------------------------------------------------------------- /code/preprocess_test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | import numpy as np 4 | import pandas as pd 5 | import jieba.posseg as pseg 6 | import jieba 7 | import re 8 | import os 9 | import codecs 10 | from collections import defaultdict 11 | from tqdm import tqdm 12 | import pickle 13 | import random 14 | import argparse 15 | import os 16 | 17 | 18 | # load dictionary/names of all the corps 19 | # noted that jieba is Chinese text segmentation; see https://github.com/fxsjy/jieba 20 | name = ["太平洋资产管理有限责任公司","张家港农商银行","江苏银行","中建投信托有限责任公司","华宝兴业基金"] 21 | corps =set() 22 | for i in range(len(name)): 23 | tmp_sheet = pd.read_excel("../dictionary/组合管理-持仓清单.xlsx",sheetname=name[i]) 24 | corps = corps|set(list(tmp_sheet.loc[:,"主体名称"])) 25 | corps = corps|set(list(tmp_sheet.loc[:,"债券名称"])) 26 | sheet2 = pd.read_excel('../dictionary/公司简称.xlsx',sheetname = 0) 27 | corps = corps|(set(list(sheet2.iloc[:,1]))) 28 | corps = corps|(set(list(sheet2.iloc[:,0]))) 29 | jieba.load_userdict('../dictionary/mydict.txt') 30 | jieba.load_userdict('../dictionary/negative.txt') 31 | jieba.load_userdict('../dictionary/positive.txt') 32 | jieba.load_userdict(corps) 33 | 34 | # load negative words 35 | neg_words = pd.read_excel("../dictionary/新闻负面词.xlsx") 36 | jieba.load_userdict(list(neg_words.loc[:,"NewsClass"])) 37 | 38 | # filter some nosiy data 39 | # pattern="[\s\.\!\/_,-:;~{}`^\\\[\]<=>?$%^*()+\"\']+|[+——!·【】‘’“”《》,。:;?、~@#¥%……&*()]+0123456789qwertyuioplkjhgfdsazxcvbnm" 40 | pattern="[\.\\/_,,.:;~{}`^\\\[\]<=>?$%^*()+\"\']+|[+·。:【】‘’“”《》、~@#¥%……&*()]+0123456789" 41 | pat = set(pattern)|set(["\n",'\u3000'," ","\s","","
"]) 42 | filterwords = ["
","责任编辑","DF","点击查看","热点栏目 资金流向 千股千评 个股诊断 最新评级 模拟交易 客户端","进入【新浪财经股吧】讨论","记者","鸣谢","报道","重点提示","重大事项","重要内容提示","提示:键盘也能翻页,试试“← →”键","原标题"] 43 | # with codecs.open('../dictionary/stopwords_CN.dat','r') as fr: 44 | # stopwords=fr.readlines() 45 | # stopwords=[i.strip() for i in stopwords] 46 | # stopwords=set(stopwords) 47 | 48 | 49 | # get test data 50 | test_x, test_y = [],[] 51 | #测试集每个分词后的句子对应的真实的句子,存在词典里面, 52 | to_sentence = {} 53 | #每个句子对应的文档index 54 | to_document = {} 55 | max_sent_in_doc = 30 56 | max_word_in_sent = 45 57 | UNKNOWN = 0 58 | num_classes =2 59 | with open("../model/vocab.pickle",'rb') as f: 60 | vocab = pickle.load(f) 61 | def FormData(sheet): 62 | for row in tqdm(range(len(sheet))): 63 | doc=np.zeros((30,45), dtype=np.int32) 64 | title = str(sheet.loc[row,"title"]) 65 | text = str(sheet.loc[row,"content"]) 66 | for item in filterwords: 67 | text = text.replace(item,"") 68 | sents = title +"。"+text 69 | count1 = 0 70 | for i, sent in enumerate(sents.split("。")): 71 | # filter the code in the news 72 | if "function()" in sent: 73 | continue 74 | if count1 < max_sent_in_doc: 75 | count = 0 76 | for j, word in enumerate(pseg.lcut(sent)): 77 | kind = (list(word))[1][0] 78 | tmpword = (list(word))[0] 79 | if (tmpword not in pat) and (tmpword[0] not in pat) and (count < max_word_in_sent): 80 | doc[count1][count] = vocab.get(tmpword, UNKNOWN) 81 | count +=1 82 | to_sentence[str(doc[count1].tolist())] = sent 83 | count1 +=1 84 | # score==1: negative; score==0: positive 85 | # try: 86 | if sheet.loc[row,"score"]==0: 87 | label = 0 88 | else: 89 | label = 1 90 | labels = [0] * num_classes 91 | labels[label] = 1 92 | # except: 93 | # labels = [0] * num_classes 94 | test_y.append(labels) 95 | test_x.append(doc.tolist()) 96 | 97 | # deal with every file in test_data 98 | path = "../test_data" 99 | f = open("../output/deal_list","w") 100 | for n_file in os.listdir(path): 101 | try: 102 | file_path = os.path.join(path,n_file) 103 | data = pd.read_excel(file_path) 104 | # dat = data.loc[data.clas=="财经网站"].copy() 105 | # print(len(dat)) 106 | f.write(n_file+"\n") 107 | FormData(data) 108 | except: 109 | pass 110 | f.close() 111 | pickle.dump((to_sentence), open('../temp/to_sentence', 'wb')) 112 | pickle.dump((test_x, test_y), open('../temp/test_data', 'wb')) 113 | print("load test_data finished") 114 | -------------------------------------------------------------------------------- /code/predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | from model import FISHQA,read_question,shuffle_data 4 | import tensorflow as tf 5 | import time 6 | import pickle 7 | import numpy as np 8 | from tqdm import tqdm 9 | from tensorflow.contrib import rnn 10 | from tensorflow.contrib import layers 11 | import random 12 | import pandas as pd 13 | import os 14 | import argparse 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument('--logdir', default='1526700733') 17 | args = parser.parse_args() 18 | # Data loading params 19 | tf.flags.DEFINE_string("data_dir", "../data/data.dat", "data directory") 20 | tf.flags.DEFINE_integer("vocab_size", 52812, "vocabulary size") 21 | tf.flags.DEFINE_integer("num_classes", 2, "number of classes") 22 | tf.flags.DEFINE_integer("embedding_size", 200, "Dimensionality of character embedding (default: 200)") 23 | tf.flags.DEFINE_integer("hidden_size", 100, "Dimensionality of GRU hidden layer (default: 50)") 24 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 25 | tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 50)") 26 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 27 | tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") 28 | tf.flags.DEFINE_integer("evaluate_every", 10, "evaluate every this many batches") 29 | tf.flags.DEFINE_float("learning_rate", 0.001, "learning rate") 30 | tf.flags.DEFINE_float("grad_clip", 5, "grad clip to prevent gradient explode") 31 | tf.flags.DEFINE_float("sentence_num", 30, "the max number of sentence in a document") 32 | tf.flags.DEFINE_float("sentence_length", 45, "the max length of each sentence") 33 | 34 | with open("../temp/test_data", 'rb') as f: 35 | test_x,test_y = pickle.load(f) 36 | 37 | FLAGS = tf.flags.FLAGS 38 | print("loading test data finished") 39 | 40 | def main(): 41 | with tf.Session() as sess: 42 | fishqa = FISHQA(vocab_size=FLAGS.vocab_size, 43 | num_classes=FLAGS.num_classes, 44 | embedding_size=FLAGS.embedding_size, 45 | hidden_size=FLAGS.hidden_size, 46 | dropout_keep_proba=0.5, 47 | query = read_question() 48 | ) 49 | with tf.name_scope('loss'): 50 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=fishqa.input_y, 51 | logits=fishqa.out, 52 | name='loss')) 53 | with tf.name_scope('accuracy'): 54 | predict = tf.argmax(fishqa.out, axis=1, name='predict') 55 | label = tf.argmax(fishqa.input_y, axis=1, name='label') 56 | acc = tf.reduce_mean(tf.cast(tf.equal(predict, label), tf.float32)) 57 | 58 | with tf.name_scope('att_words'): 59 | att_words = tf.reshape(fishqa.att_word,[-1,30]) 60 | with tf.name_scope('att_sents'): 61 | att_sents = tf.reshape(fishqa.att_sent,[-1,4,30]) 62 | timestamp = str(int(time.time())) 63 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", args.logdir)) 64 | 65 | 66 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 67 | checkpoint_path = checkpoint_dir + '/my-model.ckpt' 68 | 69 | # saver = tf.train.Saver(tf.global_variables(), max_to_keep=1) 70 | saver = tf.train.Saver(tf.global_variables()) 71 | sess.run(tf.global_variables_initializer()) 72 | def test_step(x, y): 73 | predictions,labels = [],[] 74 | attend_w,attend_s = [],[] 75 | for i in range(0, len(x), FLAGS.batch_size): 76 | 77 | feed_dict = { 78 | fishqa.input_x: x[i:i + FLAGS.batch_size], 79 | fishqa.input_y: y[i:i + FLAGS.batch_size], 80 | fishqa.max_sentence_num: 30, 81 | fishqa.max_sentence_length: 45, 82 | fishqa.batch_size: 64, 83 | fishqa.is_training:False 84 | } 85 | # step, summaries,cost, accuracy,correctNumber = sess.run([global_step, dev_summary_op,loss,acc,accNUM], feed_dict) 86 | pre,att_w,att_s= sess.run([predict,att_words,att_sents], feed_dict) 87 | attend_w.extend(att_w) 88 | attend_s.extend(att_s) 89 | predictions.extend(pre) 90 | 91 | print("predict score done!") 92 | pickle.dump(attend_w, open('../temp/att_words.pickle', 'wb')) 93 | pickle.dump(attend_s, open('../temp/att_sents.pickle', 'wb')) 94 | pickle.dump(predictions, open('../temp/predict_y.pickle', 'wb')) 95 | print("attention weights loaded!") 96 | 97 | saver.restore(sess, checkpoint_path) 98 | test_step(test_x, test_y) 99 | 100 | if __name__ == '__main__': 101 | main() 102 | -------------------------------------------------------------------------------- /code/test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | from model import FISHQA,read_question,shuffle_data 4 | import tensorflow as tf 5 | import time 6 | import pickle 7 | import numpy as np 8 | from tqdm import tqdm 9 | from tensorflow.contrib import rnn 10 | from tensorflow.contrib import layers 11 | import random 12 | import pandas as pd 13 | import os 14 | import argparse 15 | parser = argparse.ArgumentParser() 16 | parser.add_argument('--logdir', default='1526700733') 17 | args = parser.parse_args() 18 | # Data loading params 19 | tf.flags.DEFINE_string("data_dir", "../data/data.dat", "data directory") 20 | tf.flags.DEFINE_integer("vocab_size", 52812, "vocabulary size") 21 | tf.flags.DEFINE_integer("num_classes", 2, "number of classes") 22 | tf.flags.DEFINE_integer("embedding_size", 200, "Dimensionality of character embedding (default: 200)") 23 | tf.flags.DEFINE_integer("hidden_size", 100, "Dimensionality of GRU hidden layer (default: 50)") 24 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 25 | tf.flags.DEFINE_integer("num_epochs", 20, "Number of training epochs (default: 50)") 26 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 27 | tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") 28 | tf.flags.DEFINE_integer("evaluate_every", 10, "evaluate every this many batches") 29 | tf.flags.DEFINE_float("learning_rate", 0.001, "learning rate") 30 | tf.flags.DEFINE_float("grad_clip", 5, "grad clip to prevent gradient explode") 31 | tf.flags.DEFINE_float("sentence_num", 30, "the max number of sentence in a document") 32 | tf.flags.DEFINE_float("sentence_length", 45, "the max length of each sentence") 33 | 34 | with open("../temp/test_data", 'rb') as f: 35 | test_x,test_y = pickle.load(f) 36 | 37 | FLAGS = tf.flags.FLAGS 38 | print("loading test data finished") 39 | 40 | def main(): 41 | with tf.Session() as sess: 42 | fishqa = FISHQA(vocab_size=FLAGS.vocab_size, 43 | num_classes=FLAGS.num_classes, 44 | embedding_size=FLAGS.embedding_size, 45 | hidden_size=FLAGS.hidden_size, 46 | dropout_keep_proba=0.5, 47 | query = read_question() 48 | ) 49 | with tf.name_scope('loss'): 50 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=fishqa.input_y, 51 | logits=fishqa.out, 52 | name='loss')) 53 | with tf.name_scope('accuracy'): 54 | predict = tf.argmax(fishqa.out, axis=1, name='predict') 55 | label = tf.argmax(fishqa.input_y, axis=1, name='label') 56 | acc = tf.reduce_mean(tf.cast(tf.equal(predict, label), tf.float32)) 57 | 58 | with tf.name_scope('att_words'): 59 | att_words = tf.reshape(fishqa.att_word,[-1,30]) 60 | with tf.name_scope('att_sents'): 61 | att_sents = tf.reshape(fishqa.att_sent,[-1,4,30]) 62 | timestamp = str(int(time.time())) 63 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", args.logdir)) 64 | 65 | 66 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 67 | checkpoint_path = checkpoint_dir + '/my-model.ckpt' 68 | 69 | # saver = tf.train.Saver(tf.global_variables(), max_to_keep=1) 70 | saver = tf.train.Saver(tf.global_variables()) 71 | sess.run(tf.global_variables_initializer()) 72 | def test_step(x, y): 73 | predictions,labels = [],[] 74 | attend_w,attend_s = [],[] 75 | for i in range(0, len(x), FLAGS.batch_size): 76 | 77 | feed_dict = { 78 | fishqa.input_x: x[i:i + FLAGS.batch_size], 79 | fishqa.input_y: y[i:i + FLAGS.batch_size], 80 | fishqa.max_sentence_num: 30, 81 | fishqa.max_sentence_length: 45, 82 | fishqa.batch_size: 64, 83 | fishqa.is_training:False 84 | } 85 | pre, groundtruth, att_w, att_s = sess.run([predict,label,att_words,att_sents], feed_dict) 86 | predictions.extend(pre) 87 | labels.extend(groundtruth) 88 | attend_w.extend(att_w) 89 | attend_s.extend(att_s) 90 | df = pd.DataFrame({'predictions': predictions, 'labels': labels}) 91 | acc_dev = (df['predictions'] == df['labels']).mean() 92 | print("++++++++++++++++++test++++++++++++++: acc {:g} ".format(acc_dev)) 93 | pickle.dump(attend_w, open('../temp/att_words.pickle', 'wb')) 94 | pickle.dump(attend_s, open('../temp/att_sents.pickle', 'wb')) 95 | pickle.dump(predictions, open('../temp/predict_y.pickle', 'wb')) 96 | print("attention weights loaded!") 97 | 98 | saver.restore(sess, checkpoint_path) 99 | test_step(test_x, test_y) 100 | 101 | if __name__ == '__main__': 102 | main() 103 | -------------------------------------------------------------------------------- /code/preprocess_train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | import numpy as np 4 | import pandas as pd 5 | import jieba.posseg as pseg 6 | import jieba 7 | import re 8 | import os 9 | import codecs 10 | from collections import defaultdict 11 | from tqdm import tqdm 12 | import pickle 13 | import random 14 | import argparse 15 | from model import shuffle_data 16 | 17 | 18 | # load dictionary/names of all the corps 19 | # noted that jieba is Chinese text segmentation; see https://github.com/fxsjy/jieba 20 | name = ["太平洋资产管理有限责任公司","张家港农商银行","江苏银行","中建投信托有限责任公司","华宝兴业基金"] 21 | corps =set() 22 | for i in range(len(name)): 23 | sheet = pd.read_excel("../dictionary/组合管理-持仓清单.xlsx",sheetname=name[i]) 24 | corps = corps|set(list(sheet.loc[:,"主体名称"])) 25 | corps = corps|set(list(sheet.loc[:,"债券名称"])) 26 | sheet2 = pd.read_excel('../dictionary/公司简称.xlsx',sheetname = 0) 27 | corps = corps|(set(list(sheet2.iloc[:,1]))) 28 | corps = corps|(set(list(sheet2.iloc[:,0]))) 29 | jieba.load_userdict('../dictionary/mydict.txt') 30 | jieba.load_userdict('../dictionary/negative.txt') 31 | jieba.load_userdict('../dictionary/positive.txt') 32 | jieba.load_userdict(corps) 33 | 34 | # load negative words 35 | neg_words = pd.read_excel("../dictionary/新闻负面词.xlsx") 36 | jieba.load_userdict(list(neg_words.loc[:,"NewsClass"])) 37 | 38 | # filter some nosiy marks 39 | pattern="[\.\\/_,,.:;~{}`^\\\[\]<=>?$%^*()+\"\']+|[+·。:【】‘’“”《》、~@#¥%……&*()]+0123456789" 40 | pat = set(pattern)|set(["\n",'\u3000'," ","\s","","
"]) 41 | 42 | # some noise words in chinese news 43 | filterwords = ["
","责任编辑","DF","点击查看","热点栏目 资金流向 千股千评 个股诊断 最新评级 模拟交易 客户端","进入【新浪财经股吧】讨论","记者","鸣谢","报道","重点提示","重大事项","重要内容提示","提示:键盘也能翻页,试试“← →”键","原标题"] 44 | # with codecs.open('../dictionary/stopwords_CN.dat','r') as fr: 45 | # stopwords=fr.readlines() 46 | # stopwords=[i.strip() for i in stopwords] 47 | # stopwords=set(stopwords) 48 | 49 | 50 | # count the frequency of each word in documents 51 | print("count word frequency") 52 | word_freq = defaultdict(int) 53 | def Getdata(sheet): 54 | for row in tqdm(range(len(sheet))): 55 | title=str(sheet.loc[row,"title"]) 56 | content=str(sheet.loc[row,"content"]) 57 | for item in filterwords: 58 | content = content.replace(item,"") 59 | sents = title +" " +content 60 | words=pseg.lcut(sents) 61 | for j, word in enumerate(words): 62 | kind = (list(word))[1][0] 63 | tmpword = (list(word))[0] 64 | #if (kind not in ['e','x','m','u']) and (tmpword not in stopwords): 65 | if (tmpword not in pat) and (tmpword[0] not in pat): 66 | word_freq[tmpword]+=1 67 | path = "../training_data" 68 | for n_file in os.listdir(path): 69 | file_path = os.path.join(path,n_file) 70 | sheet = pd.read_excel(file_path) 71 | Getdata(sheet) 72 | 73 | # count the frequency of each word in query set 74 | q=[] 75 | f = open("../Query") 76 | for line in f: 77 | q.append(line) 78 | words=pseg.lcut(str(line.strip())) 79 | for j, word in enumerate(words): 80 | kind = (list(word))[1][0] 81 | tmpword = (list(word))[0] 82 | if (tmpword not in pat) and (tmpword[0] not in pat): 83 | word_freq[tmpword]+=1 84 | f.close() 85 | print("previous data length:",len(word_freq)) 86 | 87 | 88 | # load word frequency 89 | if not os.path.exists("../model"): 90 | os.mkdir("../model") 91 | with open('../model/word_freq.pickle', 'wb') as g: 92 | pickle.dump(word_freq, g) 93 | print(len(word_freq),"word_freq save finished") 94 | # sorted by word frequency and remove those whose frquency < 3 95 | sort_words = list(sorted(word_freq.items(), key=lambda x:-x[1])) 96 | print("the 10 most words:",sort_words[:10],"\n the 10 least words:",sort_words[-10:]) 97 | 98 | 99 | # load word vocab 100 | vocab = {} 101 | i = 3 102 | vocab['UNKNOW_TOKEN'] = 0 103 | 104 | for word, freq in sort_words: 105 | if freq > 3: 106 | vocab[word] = i 107 | i += 1 108 | with open('../model/vocab.pickle', 'wb') as f: 109 | pickle.dump(vocab, f) 110 | print(len(vocab),"vocab save finished") 111 | UNKNOWN = 0 112 | num_classes = 2 113 | 114 | 115 | # get training data 116 | data_x,data_y =[],[] 117 | max_sent_in_doc = 30 118 | max_word_in_sent = 45 119 | 120 | # we form 3 queries for our model (depending on your datasets and your need) 121 | question = np.zeros((3,max_word_in_sent), dtype=np.int32) 122 | 123 | for i,ite in enumerate(q): 124 | words=pseg.lcut(ite) 125 | count = 0 126 | for j, word in enumerate(words): 127 | kind = (list(word))[1][0] 128 | tmpword = (list(word))[0] 129 | if (tmpword not in pat) and (tmpword[0] not in pat): 130 | question[i][count] = vocab.get(tmpword, UNKNOWN) 131 | count +=1 132 | def FormData(sheet): 133 | for row in tqdm(range(len(sheet))): 134 | doc=np.zeros((30,45), dtype=np.int32) 135 | title = str(sheet.loc[row,"title"]) 136 | text = str(sheet.loc[row,"content"]) 137 | for item in filterwords: 138 | text = text.replace(item,"") 139 | sents = title +"。"+text 140 | count1 = 0 141 | for i, sent in enumerate(sents.split("。")): 142 | # filter the code in the news 143 | if "function()" in sent: 144 | continue 145 | if count1 < max_sent_in_doc: 146 | count = 0 147 | for j, word in enumerate(pseg.lcut(sent)): 148 | kind = (list(word))[1][0] 149 | tmpword = (list(word))[0] 150 | if (tmpword not in pat) and (tmpword[0] not in pat) and (count < max_word_in_sent): 151 | doc[count1][count] = vocab.get(tmpword, UNKNOWN) 152 | count +=1 153 | count1 +=1 154 | # 0: non-neg 1: neg 155 | if sheet.loc[row,"score"]==0: 156 | label = 0 157 | else: 158 | label = 1 159 | labels = [0] * num_classes 160 | labels[label] = 1 161 | data_y.append(labels) 162 | data_x.append(doc.tolist()) 163 | for n_file in os.listdir(path): 164 | file_path = os.path.join(path,n_file) 165 | sheet = pd.read_excel(file_path) 166 | FormData(sheet) 167 | print("load train_data finished, length: ",len(data_x)) 168 | 169 | 170 | # load training data 171 | data_x,data_y = shuffle_data(data_x,data_y) 172 | train_x,train_y,eval_x,eval_y = [],[],[],[] 173 | for i in range(len(data_x)): 174 | r = random.random() 175 | if r<0.8: 176 | train_x.append(data_x[i]) 177 | train_y.append(data_y[i]) 178 | else: 179 | eval_x.append(data_x[i]) 180 | eval_y.append(data_y[i]) 181 | 182 | print("shuffle data finished!") 183 | pickle.dump((train_x,train_y), open('../model/train_data', 'wb')) 184 | pickle.dump((eval_x,eval_y), open('../model/dev_data', 'wb')) 185 | pickle.dump((question[0].tolist()), open('../model/q1_data', 'wb')) 186 | pickle.dump((question[1].tolist()), open('../model/q2_data', 'wb')) 187 | pickle.dump((question[2].tolist()), open('../model/q3_data', 'wb')) 188 | print("store training data finished!") 189 | -------------------------------------------------------------------------------- /code/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | from model import FISHQA,read_question,shuffle_data 4 | import tensorflow as tf 5 | import time 6 | import pickle 7 | import numpy as np 8 | from tqdm import tqdm 9 | from tensorflow.contrib import rnn 10 | from tensorflow.contrib import layers 11 | import random 12 | import pandas as pd 13 | import os 14 | 15 | 16 | # Data loading params 17 | tf.flags.DEFINE_string("data_dir", "../data/data.dat", "data directory") 18 | tf.flags.DEFINE_integer("vocab_size", 52812, "vocabulary size") 19 | tf.flags.DEFINE_integer("num_classes", 2, "number of classes") 20 | tf.flags.DEFINE_integer("embedding_size", 200, "Dimensionality of character embedding (default: 200)") 21 | tf.flags.DEFINE_integer("hidden_size", 100, "Dimensionality of GRU hidden layer (default: 50)") 22 | tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)") 23 | tf.flags.DEFINE_integer("num_epochs", 15, "Number of training epochs (default: 50)") 24 | tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)") 25 | tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)") 26 | tf.flags.DEFINE_integer("evaluate_every", 10, "evaluate every this many batches") 27 | tf.flags.DEFINE_float("learning_rate", 0.001, "learning rate") 28 | tf.flags.DEFINE_float("grad_clip", 5, "grad clip to prevent gradient explode") 29 | tf.flags.DEFINE_float("sentence_num", 30, "the max number of sentence in a document") 30 | tf.flags.DEFINE_float("sentence_length", 45, "the max length of each sentence") 31 | 32 | def read_dataset(): 33 | train_x, train_y,dev_x, dev_y =[],[],[],[] 34 | with open("../model/train_data", 'rb') as f: 35 | train_x, train_y = pickle.load(f) 36 | with open("../model/dev_data", 'rb') as g: 37 | dev_x, dev_y = pickle.load(g) 38 | return train_x, train_y,dev_x, dev_y 39 | 40 | FLAGS = tf.flags.FLAGS 41 | train_x, train_y,dev_x, dev_y = read_dataset() 42 | acc_record = 0 43 | print("data load finished") 44 | 45 | 46 | 47 | def main(): 48 | with tf.Session() as sess: 49 | fishqa = FISHQA(vocab_size=FLAGS.vocab_size, 50 | num_classes=FLAGS.num_classes, 51 | embedding_size=FLAGS.embedding_size, 52 | hidden_size=FLAGS.hidden_size, 53 | dropout_keep_proba=0.5, 54 | query = read_question() 55 | ) 56 | with tf.name_scope('loss'): 57 | loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=fishqa.input_y, 58 | logits=fishqa.out, 59 | name='loss')) 60 | with tf.name_scope('accuracy'): 61 | predict = tf.argmax(fishqa.out, axis=1, name='predict') 62 | label = tf.argmax(fishqa.input_y, axis=1, name='label') 63 | acc = tf.reduce_mean(tf.cast(tf.equal(predict, label), tf.float32)) 64 | 65 | timestamp = str(int(time.time())) 66 | out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp)) 67 | print("Model Writing to {}\n".format(out_dir)) 68 | global_step = tf.Variable(0, trainable=False) 69 | 70 | optimizer = tf.train.AdamOptimizer(FLAGS.learning_rate) 71 | #optimizer = tf.train.MomentumOptimizer(FLAGS.learning_rate,0.9) 72 | 73 | tvars = tf.trainable_variables() 74 | grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), FLAGS.grad_clip) 75 | grads_and_vars = tuple(zip(grads, tvars)) 76 | train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step) 77 | 78 | # Keep track of gradient values and sparsity (optional) 79 | # grad_summaries = grad_summary 80 | grad_summaries = [] 81 | for g, v in grads_and_vars: 82 | if g is not None: 83 | grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g) 84 | grad_summaries.append(grad_hist_summary) 85 | 86 | # grad_summaries_merged = tf.summary.merge(grad_summaries) 87 | 88 | loss_summary = tf.summary.scalar('loss', loss) 89 | acc_summary = tf.summary.scalar('accuracy', acc) 90 | 91 | 92 | # train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged]) 93 | train_summary_op = tf.summary.merge_all()#tf.merge_all_summaries() 94 | train_summary_dir = os.path.join(out_dir, "summaries", "train") 95 | train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph) 96 | 97 | checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints")) 98 | checkpoint_prefix = os.path.join(checkpoint_dir, "model") 99 | checkpoint_path = checkpoint_dir + '/my-model.ckpt' 100 | if not os.path.exists(checkpoint_dir): 101 | os.makedirs(checkpoint_dir) 102 | # saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints) 103 | # saver = tf.train.Saver() 104 | # saver = tf.train.Saver(tf.global_variables(), max_to_keep=1) 105 | saver = tf.train.Saver(tf.global_variables(), max_to_keep=1) 106 | sess.run(tf.global_variables_initializer()) 107 | 108 | 109 | def train_step(x_batch, y_batch): 110 | feed_dict = { 111 | fishqa.input_x: x_batch, 112 | fishqa.input_y: y_batch, 113 | fishqa.max_sentence_num: FLAGS.sentence_num, 114 | fishqa.max_sentence_length: FLAGS.sentence_length, 115 | fishqa.batch_size: FLAGS.batch_size, 116 | fishqa.is_training: True 117 | } 118 | _, step, summaries, cost, accuracy = sess.run([train_op, global_step, train_summary_op, loss, acc], feed_dict) 119 | 120 | time_str = str(int(time.time())) 121 | # print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, cost, accuracy)) 122 | train_summary_writer.add_summary(summaries, step) 123 | return step 124 | 125 | def dev_step(x, y): 126 | global acc_record 127 | predictions = [] 128 | labels = [] 129 | for i in range(0, len(x), FLAGS.batch_size): 130 | 131 | feed_dict = { 132 | fishqa.input_x: x[i:i + FLAGS.batch_size], 133 | fishqa.input_y: y[i:i + FLAGS.batch_size], 134 | fishqa.max_sentence_num: 30, 135 | fishqa.max_sentence_length: 45, 136 | fishqa.batch_size: 64, 137 | fishqa.is_training:False 138 | } 139 | # step, summaries,cost, accuracy,correctNumber = sess.run([global_step, dev_summary_op,loss,acc,accNUM], feed_dict) 140 | step, pre, groundtruth= sess.run([global_step, predict, label], feed_dict) 141 | predictions.extend(pre) 142 | labels.extend(groundtruth) 143 | time_str = str(int(time.time())) 144 | df = pd.DataFrame({'predictions': predictions, 'labels': labels}) 145 | acc_dev = (df['predictions'] == df['labels']).mean() 146 | print("++++++++++++++++++dev++++++++++++++{}: step {}, acc {:g} ".format(time_str, step, acc_dev)) 147 | if acc_dev>acc_record: 148 | acc_record = acc_dev 149 | saver.save(sess, checkpoint_path) 150 | 151 | for epoch in range(FLAGS.num_epochs): 152 | X,Y = shuffle_data(train_x,train_y) 153 | print('current epoch %s' % (epoch + 1)) 154 | for i in range(0, len(X), FLAGS.batch_size): 155 | x = X[i:i + FLAGS.batch_size] 156 | y = Y[i:i + FLAGS.batch_size] 157 | step = train_step(x, y) 158 | if step % FLAGS.evaluate_every == 0: 159 | dev_step(dev_x, dev_y) 160 | 161 | if __name__ == '__main__': 162 | main() 163 | -------------------------------------------------------------------------------- /code/model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | #coding:utf-8 3 | import tensorflow as tf 4 | import time 5 | import pickle 6 | import numpy as np 7 | from tqdm import tqdm 8 | from tensorflow.contrib import rnn 9 | from tensorflow.contrib import layers 10 | import random 11 | import pandas as pd 12 | 13 | # return the length of each sequence 14 | def length(sequences): 15 | used = tf.sign(tf.reduce_max(tf.abs(sequences), reduction_indices=2)) 16 | seq_len = tf.reduce_sum(used, reduction_indices=1) 17 | return tf.cast(seq_len, tf.int32) 18 | # load 3 query set (we set 3 based on ) 19 | def read_question(): 20 | with open('../model/q1_data', 'rb') as f: 21 | q1 = pickle.load(f) 22 | with open('../model/q2_data', 'rb') as g: 23 | q2 = pickle.load(g) 24 | with open('../model/q3_data', 'rb') as b: 25 | q3 = pickle.load(b) 26 | return [q1,q2,q3] 27 | def shuffle_data(x,y): 28 | train_x = [];train_y=[] 29 | li = np.random.permutation(len(x)) 30 | for i in tqdm(range(len(li))): 31 | train_x.append(x[li[i]]) 32 | train_y.append(y[li[i]]) 33 | return train_x,train_y 34 | class FISHQA(): 35 | 36 | def __init__(self, vocab_size, num_classes, embedding_size=200, hidden_size=50, dropout_keep_proba=0.5,query=[]): 37 | 38 | self.vocab_size = vocab_size 39 | self.num_classes = num_classes 40 | self.embedding_size = embedding_size 41 | self.hidden_size = hidden_size 42 | self.dropout_keep_proba = dropout_keep_proba 43 | self.query = query 44 | 45 | with tf.name_scope('placeholder'): 46 | self.max_sentence_num = tf.placeholder(tf.int32, name='max_sentence_num') 47 | self.max_sentence_length = tf.placeholder(tf.int32, name='max_sentence_length') 48 | self.batch_size = tf.placeholder(tf.int32, name='batch_size') 49 | #x shape [batch_size, sentence_num,word_num ] 50 | #y shape [batch_size, num_classes] 51 | self.input_x = tf.placeholder(tf.int32, [None, None, None], name='input_x') 52 | self.input_y = tf.placeholder(tf.float32, [None, num_classes], name='input_y') 53 | self.is_training = tf.placeholder(dtype=tf.bool, name='is_training') 54 | 55 | word_embedded,q1_emb,q2_emb,q3_emb = self.word2vec() 56 | sent_vec,att_word = self.sent2vec(word_embedded,q1_emb,q2_emb,q3_emb) 57 | doc_vec,att_sent = self.doc2vec(sent_vec,q1_emb,q2_emb,q3_emb) 58 | out = self.classifer(doc_vec) 59 | 60 | self.out = out 61 | self.att_word = att_word 62 | self.att_sent = att_sent 63 | def word2vec(self): 64 | with tf.name_scope("embedding"): 65 | embedding_mat = tf.Variable(tf.truncated_normal((self.vocab_size, self.embedding_size))) 66 | #shape: [batch_size, sent_in_doc, word_in_sent, embedding_size] 67 | # 45 is the max 68 | word_embedded = tf.nn.embedding_lookup(embedding_mat, self.input_x) 69 | q1_emb = tf.reduce_sum(tf.nn.embedding_lookup(embedding_mat, self.query[0]),axis=0)/45 70 | q2_emb = tf.reduce_sum(tf.nn.embedding_lookup(embedding_mat, self.query[1]),axis=0)/45 71 | q3_emb = tf.reduce_sum(tf.nn.embedding_lookup(embedding_mat, self.query[2]),axis=0)/45 72 | return word_embedded,q1_emb,q2_emb,q3_emb 73 | 74 | def sent2vec(self, word_embedded,q1_emb,q2_emb,q3_emb): 75 | with tf.name_scope("sent2vec"): 76 | #GRU input size : [batch_size, max_time, ...] 77 | #shape: [batch_size*sent_in_doc, word_in_sent, embedding_size] 78 | word_embedded = tf.reshape(word_embedded, [-1, self.max_sentence_length, self.embedding_size]) 79 | #shape: [batch_size*sent_in_doce, word_in_sent, hidden_size*2] 80 | word_encoded = self.BidirectionalGRUEncoder(word_embedded, name='word_encoder') 81 | #shape: [batch_size*sent_in_doc, hidden_size*2] 82 | sent_temp,att_word = self.AttentionLayer(word_encoded,q1_emb,q2_emb,q3_emb, name='word_attention') 83 | sent_vec = layers.dropout(sent_temp, keep_prob=self.dropout_keep_proba,is_training=self.is_training,) 84 | return sent_vec,att_word 85 | 86 | def doc2vec(self, sent_vec,q1_embedded,q2_embedded,q3_embedded): 87 | # the same with sent2vec 88 | with tf.name_scope("doc2vec"): 89 | sent_vec = tf.reshape(sent_vec, [-1, self.max_sentence_num, self.hidden_size*2]) 90 | #shape为[batch_size, sent_in_doc, hidden_size*2] 91 | doc_encoded = self.BidirectionalGRUEncoder(sent_vec, name='sent_encoder') 92 | #shape为[batch_szie, hidden_szie*2] 93 | doc_temp,att_sent = self.SentenceAttentionLayer(doc_encoded,q1_embedded,q2_embedded,q3_embedded,name='sent_attention') 94 | doc_vec = layers.dropout(doc_temp, keep_prob=self.dropout_keep_proba,is_training=self.is_training,) 95 | return doc_vec,att_sent 96 | 97 | def classifer(self, doc_vec): 98 | with tf.name_scope('doc_classification'): 99 | out = layers.fully_connected(inputs=doc_vec, num_outputs=self.num_classes, activation_fn=None) 100 | return out 101 | 102 | def BidirectionalGRUEncoder(self, inputs, name): 103 | #inputs shape: [batch_size, max_time, voc_size] 104 | with tf.variable_scope(name): 105 | GRU_cell_fw = rnn.GRUCell(self.hidden_size) 106 | GRU_cell_bw = rnn.GRUCell(self.hidden_size) 107 | #fw_outputs, bw_outputs size: [batch_size, max_time, hidden_size] 108 | # time_major=False, 109 | # if time_major = True, tensor shape: `[max_time, batch_size, depth]`. 110 | # if time_major = False, tensor shape`[batch_size, max_time, depth]`. 111 | ((fw_outputs, bw_outputs), (_, _)) = tf.nn.bidirectional_dynamic_rnn(cell_fw=GRU_cell_fw, 112 | cell_bw=GRU_cell_bw, 113 | inputs=inputs, 114 | sequence_length=length(inputs), 115 | dtype=tf.float32) 116 | #outputs size [batch_size, max_time, hidden_size*2] 117 | outputs = tf.concat((fw_outputs, bw_outputs), 2) 118 | return outputs 119 | 120 | def AttentionLayer(self, inputs, q1_emb,q2_emb,q3_emb,name): 121 | #inputs size [batch_size, max_time, encoder_size(hidden_size * 2)] 122 | with tf.variable_scope(name): 123 | # u_context length is 2×hidden_size 124 | u_context = tf.Variable(tf.truncated_normal([self.hidden_size * 2]), name='u_context') 125 | # output size [batch_size, max_time, hidden_size * 2] 126 | h1 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 127 | h2 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 128 | h3 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 129 | h4 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 130 | 131 | # shape [batch_size, max_time, 1] 132 | t_alpha = tf.nn.softmax(tf.reduce_sum(tf.multiply(h1, u_context), axis=2, keep_dims=True), dim=1) 133 | q_alpha1 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h2, q1_emb), axis=2, keep_dims=True), dim=1) 134 | q_alpha2 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h3, q2_emb), axis=2, keep_dims=True), dim=1) 135 | q_alpha3 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h4, q3_emb), axis=2, keep_dims=True), dim=1) 136 | 137 | alpha = (t_alpha+q_alpha1+q_alpha2+q_alpha3)/4 138 | 139 | a = tf.nn.top_k((tf.reshape(alpha,[-1,self.max_sentence_length])),k=1).indices 140 | # shape [batch_size, max_time, 1] 141 | # alpha = tf.nn.softmax(tf.reduce_sum(tf.multiply(h, u_context), axis=2, keep_dims=True), dim=1) 142 | # reduce_sum [batch_size, max_time, hidden_size*2] ---> [batch_size, hidden_size*2] 143 | atten_output = tf.reduce_sum(tf.multiply(inputs, alpha), axis=1) 144 | # atten_output = tf.reduce_sum(inputs,axis=1) 145 | return atten_output,a 146 | def SentenceAttentionLayer(self, inputs,q1_emb,q2_emb,q3_emb, name): 147 | # inputs size [batch_size, max_time, encoder_size(hidden_size * 2)] 148 | with tf.variable_scope(name): 149 | u_context = tf.Variable(tf.truncated_normal([self.hidden_size * 2]), name='u_context') 150 | 151 | h1 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 152 | h2 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 153 | h3 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 154 | h4 = layers.fully_connected(inputs, self.hidden_size * 2, activation_fn=tf.nn.tanh) 155 | 156 | # shape [batch_size, max_time, 1] 157 | t_alpha = tf.nn.softmax(tf.reduce_sum(tf.multiply(h1, u_context), axis=2, keep_dims=True), dim=1) 158 | q_alpha1 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h2, q1_emb), axis=2, keep_dims=True), dim=1) 159 | q_alpha2 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h3, q2_emb), axis=2, keep_dims=True), dim=1) 160 | q_alpha3 = tf.nn.softmax(tf.reduce_sum(tf.multiply(h4, q3_emb), axis=2, keep_dims=True), dim=1) 161 | 162 | # sents shape [batch_size, sent_in_doc, hidden_size*2] 163 | 164 | alpha = (t_alpha+q_alpha1+q_alpha2+q_alpha3)/4 165 | #tf.add_to_collection('attention_value',alpha) 166 | #reduce_sum [batch_szie, max_time, hidden_szie*2] ---> [batch_size, hidden_size*2] 167 | atten_output = tf.reduce_sum(tf.multiply(inputs, alpha), axis=1) 168 | 169 | att = tf.concat([t_alpha,q_alpha1,q_alpha2,q_alpha3],0) 170 | #atten_output = tf.reduce_sum(inputs,axis=1) 171 | return atten_output,att 172 | -------------------------------------------------------------------------------- /dictionary/stopwords_CN.dat: -------------------------------------------------------------------------------- 1 | 责任编辑 2 | 末 3 | 年 4 | 月 5 | 日 6 | 啊 7 | 阿 8 | 哎 9 | 哎呀 10 | 哎哟 11 | 唉 12 | 俺 13 | 俺们 14 | 按 15 | 按照 16 | 吧 17 | 吧哒 18 | 把 19 | 罢了 20 | 被 21 | 本 22 | 本着 23 | 比 24 | 比方 25 | 比如 26 | 鄙人 27 | 彼 28 | 彼此 29 | 边 30 | 别 31 | 别的 32 | 别说 33 | 并 34 | 并且 35 | 不单 36 | 不但 37 | 不独 38 | 不管 39 | 不光 40 | 不过 41 | 不仅 42 | 不拘 43 | 不论 44 | 不怕 45 | 不然 46 | 不特 47 | 不惟 48 | 不问 49 | 不只 50 | 朝 51 | 朝着 52 | 趁 53 | 趁着 54 | 乘 55 | 冲 56 | 除 57 | 除此之外 58 | 除非 59 | 除了 60 | 此 61 | 此间 62 | 此外 63 | 从 64 | 从而 65 | 打 66 | 待 67 | 当 68 | 当着 69 | 到 70 | 得 71 | 的 72 | 的话 73 | 等 74 | 等等 75 | 地 76 | 第 77 | 叮咚 78 | 对 79 | 对于 80 | 多 81 | 多少 82 | 而 83 | 而况 84 | 而且 85 | 而是 86 | 而外 87 | 而言 88 | 而已 89 | 尔后 90 | 反过来 91 | 反过来说 92 | 反之 93 | 非但 94 | 非徒 95 | 否则 96 | 嘎 97 | 嘎登 98 | 该 99 | 赶 100 | 个 101 | 各 102 | 各个 103 | 各位 104 | 各种 105 | 各自 106 | 给 107 | 根据 108 | 跟 109 | 故 110 | 故此 111 | 固然 112 | 关于 113 | 管 114 | 归 115 | 果然 116 | 果真 117 | 过 118 | 哈 119 | 哈哈 120 | 呵 121 | 和 122 | 何 123 | 何处 124 | 何况 125 | 何时 126 | 嘿 127 | 哼 128 | 哼唷 129 | 呼哧 130 | 乎 131 | 哗 132 | 还是 133 | 还有 134 | 换句话说 135 | 换言之 136 | 或 137 | 或是 138 | 或者 139 | 极了 140 | 及 141 | 及其 142 | 及至 143 | 即 144 | 即便 145 | 即或 146 | 即令 147 | 即若 148 | 即使 149 | 几 150 | 几时 151 | 己 152 | 既 153 | 既然 154 | 既是 155 | 继而 156 | 加之 157 | 假如 158 | 假若 159 | 假使 160 | 鉴于 161 | 将 162 | 较 163 | 较之 164 | 叫 165 | 接着 166 | 结果 167 | 借 168 | 紧接着 169 | 进而 170 | 尽 171 | 尽管 172 | 经 173 | 经过 174 | 就 175 | 就是 176 | 就是说 177 | 据 178 | 具体地说 179 | 具体说来 180 | 开始 181 | 开外 182 | 靠 183 | 咳 184 | 可 185 | 可见 186 | 可是 187 | 可以 188 | 况且 189 | 啦 190 | 来 191 | 来着 192 | 离 193 | 例如 194 | 哩 195 | 连 196 | 连同 197 | 两者 198 | 了 199 | 临 200 | 另 201 | 另外 202 | 另一方面 203 | 论 204 | 嘛 205 | 吗 206 | 慢说 207 | 漫说 208 | 冒 209 | 么 210 | 每 211 | 每当 212 | 们 213 | 莫若 214 | 某 215 | 某个 216 | 某些 217 | 拿 218 | 哪 219 | 哪边 220 | 哪儿 221 | 哪个 222 | 哪里 223 | 哪年 224 | 哪怕 225 | 哪天 226 | 哪些 227 | 哪样 228 | 那 229 | 那边 230 | 那儿 231 | 那个 232 | 那会儿 233 | 那里 234 | 那么 235 | 那么些 236 | 那么样 237 | 那时 238 | 那些 239 | 那样 240 | 乃 241 | 乃至 242 | 呢 243 | 能 244 | 你 245 | 你们 246 | 您 247 | 宁 248 | 宁可 249 | 宁肯 250 | 宁愿 251 | 哦 252 | 呕 253 | 啪达 254 | 旁人 255 | 呸 256 | 凭 257 | 凭借 258 | 其 259 | 其次 260 | 其二 261 | 其他 262 | 其它 263 | 其一 264 | 其余 265 | 其中 266 | 起 267 | 起见 268 | 岂但 269 | 恰恰相反 270 | 前后 271 | 前者 272 | 且 273 | 然而 274 | 然后 275 | 然则 276 | 让 277 | 人家 278 | 任 279 | 任何 280 | 任凭 281 | 如 282 | 如此 283 | 如果 284 | 如何 285 | 如其 286 | 如若 287 | 如上所述 288 | 若 289 | 若非 290 | 若是 291 | 啥 292 | 上下 293 | 尚且 294 | 设若 295 | 设使 296 | 甚而 297 | 甚么 298 | 甚至 299 | 省得 300 | 时候 301 | 什么 302 | 什么样 303 | 使得 304 | 是 305 | 是的 306 | 首先 307 | 谁 308 | 谁知 309 | 顺 310 | 顺着 311 | 似的 312 | 虽 313 | 虽然 314 | 虽说 315 | 虽则 316 | 随 317 | 随着 318 | 所 319 | 所以 320 | 他 321 | 他们 322 | 他人 323 | 它 324 | 它们 325 | 她 326 | 她们 327 | 倘 328 | 倘或 329 | 倘然 330 | 倘若 331 | 倘使 332 | 腾 333 | 替 334 | 通过 335 | 同 336 | 同时 337 | 哇 338 | 万一 339 | 往 340 | 望 341 | 为 342 | 为何 343 | 为了 344 | 为什么 345 | 为着 346 | 喂 347 | 嗡嗡 348 | 我 349 | 我们 350 | 呜 351 | 呜呼 352 | 乌乎 353 | 无论 354 | 无宁 355 | 毋宁 356 | 嘻 357 | 吓 358 | 相对而言 359 | 像 360 | 向 361 | 向着 362 | 嘘 363 | 呀 364 | 焉 365 | 沿 366 | 沿着 367 | 要 368 | 要不 369 | 要不然 370 | 要不是 371 | 要么 372 | 要是 373 | 也 374 | 也罢 375 | 也好 376 | 一 377 | 一般 378 | 一旦 379 | 一方面 380 | 一来 381 | 一切 382 | 一样 383 | 一则 384 | 依 385 | 依照 386 | 矣 387 | 以 388 | 以便 389 | 以及 390 | 以免 391 | 以至 392 | 以至于 393 | 以致 394 | 抑或 395 | 因 396 | 因此 397 | 因而 398 | 因为 399 | 哟 400 | 用 401 | 由 402 | 由此可见 403 | 由于 404 | 有 405 | 有的 406 | 有关 407 | 有些 408 | 又 409 | 于 410 | 于是 411 | 于是乎 412 | 与 413 | 与此同时 414 | 与否 415 | 与其 416 | 越是 417 | 云云 418 | 哉 419 | 再说 420 | 再者 421 | 在 422 | 在下 423 | 咱 424 | 咱们 425 | 则 426 | 怎 427 | 怎么 428 | 怎么办 429 | 怎么样 430 | 怎样 431 | 咋 432 | 照 433 | 照着 434 | 者 435 | 这 436 | 这边 437 | 这儿 438 | 这个 439 | 这会儿 440 | 这就是说 441 | 这里 442 | 这么 443 | 这么点儿 444 | 这么些 445 | 这么样 446 | 这时 447 | 这些 448 | 这样 449 | 正如 450 | 吱 451 | 之 452 | 之类 453 | 之所以 454 | 之一 455 | 只是 456 | 只限 457 | 只要 458 | 只有 459 | 至 460 | 至于 461 | 诸位 462 | 着 463 | 着呢 464 | 自 465 | 自从 466 | 自个儿 467 | 自各儿 468 | 自己 469 | 自家 470 | 自身 471 | 综上所述 472 | 总的来看 473 | 总的来说 474 | 总的说来 475 | 总而言之 476 | 总之 477 | 纵 478 | 纵令 479 | 纵然 480 | 纵使 481 | 遵照 482 | 作为 483 | 兮 484 | 呃 485 | 呗 486 | 咚 487 | 咦 488 | 喏 489 | 啐 490 | 喔唷 491 | 嗬 492 | 嗯 493 | 嗳 494 | 啊哈 495 | 啊呀 496 | 啊哟 497 | 挨次 498 | 挨个 499 | 挨家挨户 500 | 挨门挨户 501 | 挨门逐户 502 | 挨着 503 | 按理 504 | 按期 505 | 按时 506 | 按说 507 | 暗地里 508 | 暗中 509 | 暗自 510 | 昂然 511 | 八成 512 | 白白 513 | 半 514 | 梆 515 | 保管 516 | 保险 517 | 饱 518 | 背地里 519 | 背靠背 520 | 倍感 521 | 倍加 522 | 本人 523 | 本身 524 | 甭 525 | 比起 526 | 比如说 527 | 比照 528 | 毕竟 529 | 必 530 | 必定 531 | 必将 532 | 必须 533 | 便 534 | 别人 535 | 并肩 536 | 并排 537 | 勃然 538 | 策略地 539 | 差不多 540 | 差一点 541 | 常 542 | 常常 543 | 常言道 544 | 常言说 545 | 常言说得好 546 | 长此下去 547 | 长话短说 548 | 长期以来 549 | 长线 550 | 敞开儿 551 | 彻夜 552 | 陈年 553 | 趁便 554 | 趁机 555 | 趁热 556 | 趁势 557 | 趁早 558 | 成年 559 | 成年累月 560 | 成心 561 | 乘机 562 | 乘胜 563 | 乘势 564 | 乘隙 565 | 乘虚 566 | 诚然 567 | 迟早 568 | 充分 569 | 充其极 570 | 充其量 571 | 抽冷子 572 | 臭 573 | 初 574 | 出 575 | 出来 576 | 出去 577 | 除此 578 | 除此而外 579 | 除此以外 580 | 除开 581 | 除去 582 | 除却 583 | 除外 584 | 处处 585 | 川流不息 586 | 传 587 | 传说 588 | 传闻 589 | 串行 590 | 纯 591 | 纯粹 592 | 此后 593 | 此中 594 | 次第 595 | 匆匆 596 | 从不 597 | 从此 598 | 从此以后 599 | 从古到今 600 | 从古至今 601 | 从今以后 602 | 从宽 603 | 从来 604 | 从轻 605 | 从速 606 | 从头 607 | 从未 608 | 从无到有 609 | 从小 610 | 从新 611 | 从严 612 | 从优 613 | 从早到晚 614 | 从中 615 | 从重 616 | 凑巧 617 | 粗 618 | 存心 619 | 达旦 620 | 打从 621 | 打开天窗说亮话 622 | 大 623 | 大不了 624 | 大大 625 | 大抵 626 | 大都 627 | 大多 628 | 大凡 629 | 大概 630 | 大家 631 | 大举 632 | 大略 633 | 大面儿上 634 | 大事 635 | 大体 636 | 大体上 637 | 大约 638 | 大张旗鼓 639 | 大致 640 | 呆呆地 641 | 带 642 | 殆 643 | 待到 644 | 单 645 | 单纯 646 | 单单 647 | 但愿 648 | 弹指之间 649 | 当场 650 | 当儿 651 | 当即 652 | 当口儿 653 | 当然 654 | 当庭 655 | 当头 656 | 当下 657 | 当真 658 | 当中 659 | 倒不如 660 | 倒不如说 661 | 倒是 662 | 到处 663 | 到底 664 | 到了儿 665 | 到目前为止 666 | 到头 667 | 到头来 668 | 得起 669 | 得天独厚 670 | 的确 671 | 等到 672 | 叮当 673 | 顶多 674 | 定 675 | 动不动 676 | 动辄 677 | 陡然 678 | 都 679 | 独 680 | 独自 681 | 断然 682 | 顿时 683 | 多次 684 | 多多 685 | 多多少少 686 | 多多益善 687 | 多亏 688 | 多年来 689 | 多年前 690 | 而后 691 | 而论 692 | 而又 693 | 尔等 694 | 二话不说 695 | 二话没说 696 | 反倒 697 | 反倒是 698 | 反而 699 | 反手 700 | 反之亦然 701 | 反之则 702 | 方 703 | 方才 704 | 方能 705 | 放量 706 | 非常 707 | 非得 708 | 分期 709 | 分期分批 710 | 分头 711 | 奋勇 712 | 愤然 713 | 风雨无阻 714 | 逢 715 | 弗 716 | 甫 717 | 嘎嘎 718 | 该当 719 | 概 720 | 赶快 721 | 赶早不赶晚 722 | 敢 723 | 敢情 724 | 敢于 725 | 刚 726 | 刚才 727 | 刚好 728 | 刚巧 729 | 高低 730 | 格外 731 | 隔日 732 | 隔夜 733 | 个人 734 | 各式 735 | 更 736 | 更加 737 | 更进一步 738 | 更为 739 | 公然 740 | 共 741 | 共总 742 | 够瞧的 743 | 姑且 744 | 古来 745 | 故而 746 | 故意 747 | 固 748 | 怪 749 | 怪不得 750 | 惯常 751 | 光 752 | 光是 753 | 归根到底 754 | 归根结底 755 | 过于 756 | 毫不 757 | 毫无 758 | 毫无保留地 759 | 毫无例外 760 | 好在 761 | 何必 762 | 何尝 763 | 何妨 764 | 何苦 765 | 何乐而不为 766 | 何须 767 | 何止 768 | 很 769 | 很多 770 | 很少 771 | 轰然 772 | 后来 773 | 呼啦 774 | 忽地 775 | 忽然 776 | 互 777 | 互相 778 | 哗啦 779 | 话说 780 | 还 781 | 恍然 782 | 会 783 | 豁然 784 | 活 785 | 伙同 786 | 或多或少 787 | 或许 788 | 基本 789 | 基本上 790 | 基于 791 | 极 792 | 极大 793 | 极度 794 | 极端 795 | 极力 796 | 极其 797 | 极为 798 | 急匆匆 799 | 即将 800 | 即刻 801 | 即是说 802 | 几度 803 | 几番 804 | 几乎 805 | 几经 806 | 既...又 807 | 继之 808 | 加上 809 | 加以 810 | 间或 811 | 简而言之 812 | 简言之 813 | 简直 814 | 见 815 | 将才 816 | 将近 817 | 将要 818 | 交口 819 | 较比 820 | 较为 821 | 接连不断 822 | 接下来 823 | 皆可 824 | 截然 825 | 截至 826 | 藉以 827 | 借此 828 | 借以 829 | 届时 830 | 仅 831 | 仅仅 832 | 谨 833 | 进来 834 | 进去 835 | 近 836 | 近几年来 837 | 近来 838 | 近年来 839 | 尽管如此 840 | 尽可能 841 | 尽快 842 | 尽量 843 | 尽然 844 | 尽如人意 845 | 尽心竭力 846 | 尽心尽力 847 | 尽早 848 | 精光 849 | 经常 850 | 竟 851 | 竟然 852 | 究竟 853 | 就此 854 | 就地 855 | 就算 856 | 居然 857 | 局外 858 | 举凡 859 | 据称 860 | 据此 861 | 据实 862 | 据说 863 | 据我所知 864 | 据悉 865 | 具体来说 866 | 决不 867 | 决非 868 | 绝 869 | 绝不 870 | 绝顶 871 | 绝对 872 | 绝非 873 | 均 874 | 喀 875 | 看 876 | 看来 877 | 看起来 878 | 看上去 879 | 看样子 880 | 可好 881 | 可能 882 | 恐怕 883 | 快 884 | 快要 885 | 来不及 886 | 来得及 887 | 来讲 888 | 来看 889 | 拦腰 890 | 牢牢 891 | 老 892 | 老大 893 | 老老实实 894 | 老是 895 | 累次 896 | 累年 897 | 理当 898 | 理该 899 | 理应 900 | 历 901 | 立 902 | 立地 903 | 立刻 904 | 立马 905 | 立时 906 | 联袂 907 | 连连 908 | 连日 909 | 连日来 910 | 连声 911 | 连袂 912 | 临到 913 | 另方面 914 | 另行 915 | 另一个 916 | 路经 917 | 屡 918 | 屡次 919 | 屡次三番 920 | 屡屡 921 | 缕缕 922 | 率尔 923 | 率然 924 | 略 925 | 略加 926 | 略微 927 | 略为 928 | 论说 929 | 马上 930 | 蛮 931 | 满 932 | 每逢 933 | 每每 934 | 每时每刻 935 | 猛然 936 | 猛然间 937 | 莫 938 | 莫不 939 | 莫非 940 | 莫如 941 | 默默地 942 | 默然 943 | 呐 944 | 那末 945 | 奈 946 | 难道 947 | 难得 948 | 难怪 949 | 难说 950 | 内 951 | 年复一年 952 | 凝神 953 | 偶而 954 | 偶尔 955 | 怕 956 | 砰 957 | 碰巧 958 | 譬如 959 | 偏偏 960 | 乒 961 | 平素 962 | 颇 963 | 迫于 964 | 扑通 965 | 其后 966 | 其实 967 | 奇 968 | 齐 969 | 起初 970 | 起来 971 | 起首 972 | 起头 973 | 起先 974 | 岂 975 | 岂非 976 | 岂止 977 | 迄 978 | 恰逢 979 | 恰好 980 | 恰恰 981 | 恰巧 982 | 恰如 983 | 恰似 984 | 千 985 | 万 986 | 千万 987 | 千万千万 988 | 切 989 | 切不可 990 | 切莫 991 | 切切 992 | 切勿 993 | 窃 994 | 亲口 995 | 亲身 996 | 亲手 997 | 亲眼 998 | 亲自 999 | 顷 1000 | 顷刻 1001 | 顷刻间 1002 | 顷刻之间 1003 | 请勿 1004 | 穷年累月 1005 | 取道 1006 | 去 1007 | 权时 1008 | 全都 1009 | 全力 1010 | 全年 1011 | 全然 1012 | 全身心 1013 | 然 1014 | 人人 1015 | 仍 1016 | 仍旧 1017 | 仍然 1018 | 日复一日 1019 | 日见 1020 | 日渐 1021 | 日益 1022 | 日臻 1023 | 如常 1024 | 如此等等 1025 | 如次 1026 | 如今 1027 | 如期 1028 | 如前所述 1029 | 如上 1030 | 如下 1031 | 汝 1032 | 三番两次 1033 | 三番五次 1034 | 三天两头 1035 | 瑟瑟 1036 | 沙沙 1037 | 上 1038 | 上来 1039 | 上去 1040 | 一 1041 | 一一 1042 | 一下 1043 | 一个 1044 | 一些 1045 | 一何 1046 | 一则通过 1047 | 一天 1048 | 一定 1049 | 一时 1050 | 一次 1051 | 一片 1052 | 一番 1053 | 一直 1054 | 一致 1055 | 一起 1056 | 一转眼 1057 | 一边 1058 | 一面 1059 | 上升 1060 | 上述 1061 | 上面 1062 | 下 1063 | 下列 1064 | 下去 1065 | 下来 1066 | 下面 1067 | 不一 1068 | 不久 1069 | 不变 1070 | 不可 1071 | 不够 1072 | 不尽 1073 | 不尽然 1074 | 不敢 1075 | 不断 1076 | 不若 1077 | 不足 1078 | 与其说 1079 | 专门 1080 | 且不说 1081 | 且说 1082 | 严格 1083 | 严重 1084 | 个别 1085 | 中小 1086 | 中间 1087 | 丰富 1088 | 为主 1089 | 为什麽 1090 | 为止 1091 | 为此 1092 | 主张 1093 | 主要 1094 | 举行 1095 | 乃至于 1096 | 之前 1097 | 之后 1098 | 之後 1099 | 也就是说 1100 | 也是 1101 | 了解 1102 | 争取 1103 | 二来 1104 | 云尔 1105 | 些 1106 | 亦 1107 | 产生 1108 | 人 1109 | 人们 1110 | 什麽 1111 | 今 1112 | 今后 1113 | 今天 1114 | 今年 1115 | 今後 1116 | 介于 1117 | 从事 1118 | 他是 1119 | 他的 1120 | 代替 1121 | 以上 1122 | 以下 1123 | 以为 1124 | 以前 1125 | 以后 1126 | 以外 1127 | 以後 1128 | 以故 1129 | 以期 1130 | 以来 1131 | 任务 1132 | 企图 1133 | 伟大 1134 | 似乎 1135 | 但凡 1136 | 何以 1137 | 余外 1138 | 你是 1139 | 你的 1140 | 使 1141 | 使用 1142 | 依据 1143 | 依靠 1144 | 便于 1145 | 促进 1146 | 保持 1147 | 做到 1148 | 傥然 1149 | 儿 1150 | 允许 1151 | 元/吨 1152 | 先不先 1153 | 先后 1154 | 先後 1155 | 先生 1156 | 全体 1157 | 全部 1158 | 全面 1159 | 共同 1160 | 具体 1161 | 具有 1162 | 兼之 1163 | 再 1164 | 再其次 1165 | 再则 1166 | 再有 1167 | 再次 1168 | 再者说 1169 | 决定 1170 | 准备 1171 | 凡 1172 | 凡是 1173 | 出于 1174 | 出现 1175 | 分别 1176 | 则甚 1177 | 别处 1178 | 别是 1179 | 别管 1180 | 前此 1181 | 前进 1182 | 前面 1183 | 加入 1184 | 加强 1185 | 十分 1186 | 即如 1187 | 却 1188 | 却不 1189 | 原来 1190 | 又及 1191 | 及时 1192 | 双方 1193 | 反应 1194 | 反映 1195 | 取得 1196 | 受到 1197 | 变成 1198 | 另悉 1199 | 只 1200 | 只当 1201 | 只怕 1202 | 只消 1203 | 叫做 1204 | 召开 1205 | 各人 1206 | 各地 1207 | 各级 1208 | 合理 1209 | 同一 1210 | 同样 1211 | 后 1212 | 后者 1213 | 后面 1214 | 向使 1215 | 周围 1216 | 呵呵 1217 | 咧 1218 | 唯有 1219 | 啷当 1220 | 喽 1221 | 嗡 1222 | 嘿嘿 1223 | 因了 1224 | 因着 1225 | 在于 1226 | 坚决 1227 | 坚持 1228 | 处在 1229 | 处理 1230 | 复杂 1231 | 多么 1232 | 多数 1233 | 大力 1234 | 大多数 1235 | 大批 1236 | 大量 1237 | 失去 1238 | 她是 1239 | 她的 1240 | 好 1241 | 好的 1242 | 好象 1243 | 如同 1244 | 如是 1245 | 始而 1246 | 存在 1247 | 孰料 1248 | 孰知 1249 | 它们的 1250 | 它是 1251 | 它的 1252 | 安全 1253 | 完全 1254 | 完成 1255 | 实现 1256 | 实际 1257 | 宣布 1258 | 容易 1259 | 密切 1260 | 对应 1261 | 对待 1262 | 对方 1263 | 对比 1264 | 小 1265 | 少数 1266 | 尔 1267 | 尔尔 1268 | 尤其 1269 | 就是了 1270 | 就要 1271 | 属于 1272 | 左右 1273 | 巨大 1274 | 巩固 1275 | 已 1276 | 已矣 1277 | 已经 1278 | 巴 1279 | 巴巴 1280 | 帮助 1281 | 并不 1282 | 并不是 1283 | 广大 1284 | 广泛 1285 | 应当 1286 | 应用 1287 | 应该 1288 | 庶乎 1289 | 庶几 1290 | 开展 1291 | 引起 1292 | 强烈 1293 | 强调 1294 | 归齐 1295 | 当前 1296 | 当地 1297 | 当时 1298 | 形成 1299 | 彻底 1300 | 彼时 1301 | 往往 1302 | 後来 1303 | 後面 1304 | 得了 1305 | 得出 1306 | 得到 1307 | 心里 1308 | 必然 1309 | 必要 1310 | 怎奈 1311 | 怎麽 1312 | 总是 1313 | 总结 1314 | 您们 1315 | 您是 1316 | 惟其 1317 | 意思 1318 | 愿意 1319 | 成为 1320 | 我是 1321 | 我的 1322 | 或则 1323 | 或曰 1324 | 战斗 1325 | 所在 1326 | 所幸 1327 | 所有 1328 | 所谓 1329 | 扩大 1330 | 掌握 1331 | 接著 1332 | 数/ 1333 | 整个 1334 | 方便 1335 | 方面 1336 | 既往 1337 | 明显 1338 | 明确 1339 | 是不是 1340 | 是以 1341 | 是否 1342 | 显然 1343 | 显著 1344 | 普通 1345 | 普遍 1346 | 曾 1347 | 曾经 1348 | 替代 1349 | 最 1350 | 最后 1351 | 最大 1352 | 最好 1353 | 最後 1354 | 最近 1355 | 最高 1356 | 有利 1357 | 有力 1358 | 有及 1359 | 有所 1360 | 有效 1361 | 有时 1362 | 有点 1363 | 有的是 1364 | 有着 1365 | 有著 1366 | 末 1367 | 本地 1368 | 来自 1369 | 来说 1370 | 构成 1371 | 某某 1372 | 根本 1373 | 欢迎 1374 | 欤 1375 | 正值 1376 | 正在 1377 | 正巧 1378 | 正常 1379 | 正是 1380 | 此地 1381 | 此处 1382 | 此时 1383 | 此次 1384 | 每个 1385 | 每天 1386 | 每年 1387 | 比及 1388 | 比较 1389 | 没奈何 1390 | 注意 1391 | 深入 1392 | 清楚 1393 | 满足 1394 | 然後 1395 | 特别是 1396 | 特殊 1397 | 特点 1398 | 犹且 1399 | 犹自 1400 | 现代 1401 | 现在 1402 | 甚且 1403 | 甚或 1404 | 甚至于 1405 | 用来 1406 | 由是 1407 | 由此 1408 | 目前 1409 | 直到 1410 | 直接 1411 | 相似 1412 | 相信 1413 | 相反 1414 | 相同 1415 | 相对 1416 | 相应 1417 | 相当 1418 | 相等 1419 | 看出 1420 | 看到 1421 | 看看 1422 | 看见 1423 | 真是 1424 | 真正 1425 | 眨眼 1426 | 矣乎 1427 | 矣哉 1428 | 知道 1429 | 确定 1430 | 种 1431 | 积极 1432 | 移动 1433 | 突出 1434 | 突然 1435 | 立即 1436 | 竟而 1437 | 第二 1438 | 类如 1439 | 练习 1440 | 组成 1441 | 结合 1442 | 继后 1443 | 继续 1444 | 维持 1445 | 考虑 1446 | 联系 1447 | 能否 1448 | 能够 1449 | 自后 1450 | 自打 1451 | 至今 1452 | 至若 1453 | 致 1454 | 般的 1455 | 良好 1456 | 若夫 1457 | 若果 1458 | 范围 1459 | 莫不然 1460 | 获得 1461 | 行为 1462 | 行动 1463 | 表明 1464 | 表示 1465 | 要求 1466 | 规定 1467 | 觉得 1468 | 譬喻 1469 | 认为 1470 | 认真 1471 | 认识 1472 | 许多 1473 | 设或 1474 | 诚如 1475 | 说明 1476 | 说来 1477 | 说说 1478 | 诸 1479 | 诸如 1480 | 谁人 1481 | 谁料 1482 | 贼死 1483 | 赖以 1484 | 距 1485 | 转动 1486 | 转变 1487 | 转贴 1488 | 达到 1489 | 迅速 1490 | 过去 1491 | 过来 1492 | 运用 1493 | 还要 1494 | 这一来 1495 | 这次 1496 | 这点 1497 | 这种 1498 | 这般 1499 | 这麽 1500 | 进入 1501 | 进步 1502 | 进行 1503 | 适应 1504 | 适当 1505 | 适用 1506 | 逐步 1507 | 逐渐 1508 | 通常 1509 | 造成 1510 | 遇到 1511 | 遭到 1512 | 遵循 1513 | 避免 1514 | 那般 1515 | 那麽 1516 | 部分 1517 | 采取 1518 | 里面 1519 | 重大 1520 | 重新 1521 | 重要 1522 | 针对 1523 | 问题 1524 | 防止 1525 | 附近 1526 | 限制 1527 | 随后 1528 | 随时 1529 | 随著 1530 | 难道说 1531 | 集中 1532 | 需要 1533 | 非特 1534 | 非独 1535 | 高兴 1536 | 若果 --------------------------------------------------------------------------------