├── .gitignore ├── README.md ├── bash ├── elmo_inference.sh └── elmo_train.sh ├── dataset.py ├── img └── model.png ├── labels.txt ├── main.py ├── model.py ├── scripts ├── data_preprocess.py ├── preprocess.sh └── readme.md ├── thrid_utils.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | 106 | # data folder 107 | scripts/data/ 108 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # fsauor2018 2 | 3 | Code for Fine-grained Sentiment Analysis of User Reviews of AI Challenger 2018. 4 | 5 | Single model can achieve 0.71 marco-f1 score. 6 | 7 | Testa rank: 27 8 | 9 | Testb rank: 16 10 | 11 | > The final result is achieved by ensemble 10 models by simple voting. 12 | 13 | Issues and starts are welcomed! 14 | 15 | ## Train from scratch 16 | 17 | For those who don't want to preprocess data, refer to [scripts](./scripts/readme.md). 18 | 19 | ## Data 20 | 21 | For those who want to get the raw dataset, please refer to this link [data](https://drive.google.com/file/d/1OInXRx_OmIJgK3ZdoFZnmqUi0rGfOaQo/view?usp=sharing). 22 | 23 | ## Requirements 24 | 25 | tensorflow == 1.4.1 26 | 27 | ## Model Architecture 28 | 29 | The model architecture is simple. Basiclly, you can think of it as a seq2seq model. 30 | 31 | ![模型结构](img/model.png) 32 | 33 | Some details of the model: 34 | 35 | - Embedding layer + 3 * Bi-LSTM layers as encoder 36 | - Residual connection is added on the second and third Bi-LSTM layers 37 | - The final encoder outputs are weighted sum of outputs of each layer. Scalars and weight are learned variables. This idea is copied from ELMO. 38 | - Use a simple LSTM cell + Attention as decoder, decode 20 steps to get 20 outputs for each label 39 | - Inputs to decoder are learnable embeddings 40 | - Outputs of decoder are fed to two FC layers to get the final sentiment logits 41 | 42 | ## Data preprocess 43 | 44 | The data preprocess code is not provided here, I may release it later. 45 | 46 | To use this project, you need fowllowing files: 47 | 48 | - train.json / validataion.json / testa.json 49 | - vocab.txt 50 | - embedding.txt 51 | - label.txt 52 | 53 | ### Training files 54 | 55 | You need to preprocess the orginal data to json files, each line of the json line should be like fowllowing: 56 | 57 | ```json 58 | {"id": "0", "content": "吼吼吼 , 萌 死 人 的 棒棒糖 , 中 了 大众 点评 的 霸王餐 , 太 可爱 了 。 一直 就 好奇 这个 棒棒 糖 是 怎么 个 东西 , 大众 点评 给 了 我 这个 土老 冒 一个 见识 的 机会 。 看 介绍 棒棒 糖 是 用 糖 做 的 , 不 会 很 甜 , 中间 的 照片 是 糯米 的 , 能 食用 , 真是 太 高端 大气 上档次 了 , 还 可以 买 蝴蝶 结扎口 , 送 人 可以 买 礼盒 。 我 是 先 打 的 卖家 电话 , 加 了 微信 , 给 卖家传 的 照片 。 等 了 几 天 , 卖家 就 告诉 我 可以 取 货 了 , 去 那 取 的 。 虽然 连 卖家 的 面 都 没 见到 , 但是 还是 谢谢 卖家 送 我 这么 可爱 的 东西 , 太 喜欢 了 , 这 哪 舍得 吃 啊 。", "location_traffic_convenience": "-2", "location_distance_from_business_district": "-2", "location_easy_to_find": "-2", "service_wait_time": "-2", "service_waiters_attitude": "1", "service_parking_convenience": "-2", "service_serving_speed": "-2", "price_level": "-2", "price_cost_effective": "-2", "price_discount": "1", "environment_decoration": "-2", "environment_noise": "-2", "environment_space": "-2", "environment_cleaness": "-2", "dish_portion": "-2", "dish_taste": "-2", "dish_look": "1", "dish_recommendation": "-2", "others_overall_experience": "1", "others_willing_to_consume_again": "-2"} 59 | ``` 60 | 61 | To be specific: 62 | - content should be tokeninzed words 63 | - You can use jieba/ltp to do the segmentation 64 | - Use NER toolkits to replace place and orginaztion to special tokens '\','\' 65 | - other fields are same as the original data files 66 | - for test files, which labels are unknow, you can leave them to be empty string("") 67 | 68 | ### Vocab file 69 | 70 | I choose the top 50k most common words in training file. 71 | 72 | The top 3 words are special tokens, which are: 73 | - \: unknow token 74 | - \: start of content 75 | - \: end of content, also used as padding token 76 | 77 | ### Embedding file 78 | 79 | This is a glove-format embedding file, I use [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors) as pretrained embedding file(which is Sogou News word2vec word embedding). 80 | 81 | ### Label file 82 | 83 | All the label names. 84 | 85 | ## Train 86 | 87 | Refer to bash/elmo_train.sh 88 | 89 | ## Inference 90 | 91 | Refer to bash/elmo_inference.sh 92 | -------------------------------------------------------------------------------- /bash/elmo_inference.sh: -------------------------------------------------------------------------------- 1 | python main.py \ 2 | --mode=inference \ 3 | --data_files=scripts/data/testa.json \ 4 | --label_file=scripts/data/labels.txt \ 5 | --vocab_file=scripts/data/vocab.txt \ 6 | --out_file=scripts/data/out.testa.json \ 7 | --prob=False \ 8 | --batch_size=300 \ 9 | --feature_num=20 \ 10 | --checkpoint_dir=scripts/data/elmo_ema_0120 -------------------------------------------------------------------------------- /bash/elmo_train.sh: -------------------------------------------------------------------------------- 1 | python main.py \ 2 | --mode=train \ 3 | --data_files scripts/data/train.json \ 4 | --eval_files=scripts/data/validation.json \ 5 | --label_file=scripts/data/labels.txt \ 6 | --vocab_file=scripts/data/vocab.txt \ 7 | --embed_file=scripts/data/embedding.txt \ 8 | --num_layers=3 \ 9 | --batch_size=32 \ 10 | --encoder=elmo \ 11 | --rnn_cell_name=lstm \ 12 | --feature_num=20 \ 13 | --steps_per_eval=2000 \ 14 | --learning_rate=0.001 \ 15 | --focal_loss=0.0 \ 16 | --checkpoint_dir=scripts/data/elmo_ema_0120 -------------------------------------------------------------------------------- /dataset.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | import codecs 7 | import json 8 | from collections import namedtuple 9 | 10 | import numpy as np 11 | import tensorflow as tf 12 | 13 | from utils import print_out 14 | from thrid_utils import read_vocab 15 | 16 | UNK_ID = 0 17 | SOS_ID = 1 18 | EOS_ID = 2 19 | 20 | def _padding(tokens_list, max_len): 21 | ret = np.zeros((len(tokens_list),max_len),np.int32) 22 | for i,t in enumerate(tokens_list): 23 | t = t + (max_len-len(t)) * [EOS_ID] 24 | ret[i] = t 25 | return ret 26 | 27 | def _tokenize(content, w2i, max_tokens=1200, reverse=False, split=True): 28 | def get_tokens(content): 29 | tokens = content.strip().split() 30 | ids = [] 31 | for t in tokens: 32 | if t in w2i: 33 | ids.append(w2i[t]) 34 | else: 35 | for c in t: 36 | ids.append(w2i.get(c,UNK_ID)) 37 | return ids 38 | if split: 39 | ids = get_tokens(content) 40 | else: 41 | ids = [w2i.get(t,UNK_ID) for t in content.strip().split()] 42 | if reverse: 43 | ids = list(reversed(ids)) 44 | tokens = [SOS_ID] + ids[:max_tokens] + [EOS_ID] 45 | return tokens 46 | 47 | class DataItem(namedtuple("DataItem",('content','length','labels','id'))): 48 | pass 49 | 50 | class DataSet(object): 51 | def __init__(self, data_files, vocab_file, label_file, batch_size=32, reverse=False, split_word=True, max_len = 1200): 52 | self.reverse = reverse 53 | self.split_word = split_word 54 | self.data_files = data_files 55 | self.batch_size = batch_size 56 | self.max_len = max_len 57 | 58 | self.vocab, self.w2i = read_vocab(vocab_file) 59 | self.i2w = {v:k for k,v in self.w2i.items()} 60 | self.label_names, self.l2i = read_vocab(label_file) 61 | self.i2l = {v:k for k,v in self.l2i.items()} 62 | 63 | self.tag_l2i = {"1":0,"0":1,"-1":2,"-2":3} 64 | self.tag_i2l = {v:k for k,v in self.tag_l2i.items()} 65 | 66 | self._raw_data = [] 67 | self.items = [] 68 | self._preprocess() 69 | 70 | def get_label(self, labels, l2i, normalize=False): 71 | one_hot_labels = np.zeros(len(l2i),dtype=np.float32) 72 | for n in labels: 73 | if n: 74 | one_hot_labels[l2i[n]] = 1 75 | 76 | if normalize: 77 | one_hot_labels = one_hot_labels / len(labels) 78 | return one_hot_labels 79 | 80 | def _preprocess(self): 81 | print_out("# Start to preprocessing data...") 82 | for fname in self.data_files: 83 | print_out("# load data from %s ..." % fname) 84 | for line in open(fname): 85 | item = json.loads(line.strip()) 86 | content = item['content'] 87 | content = _tokenize(content, self.w2i, self.max_len, self.reverse, self.split_word) 88 | item_labels = [] 89 | for label_name in self.label_names: 90 | labels = [item[label_name]] 91 | labels = self.get_label(labels,self.tag_l2i) 92 | item_labels.append(labels) 93 | self._raw_data.append(DataItem(content=content,labels=np.asarray(item_labels),length=len(content),id=int(item['id']))) 94 | self.items.append(item) 95 | 96 | self.num_batches = len(self._raw_data) // self.batch_size 97 | self.data_size = len(self._raw_data) 98 | print_out("# Got %d data items with %d batches" % (self.data_size, self.num_batches)) 99 | 100 | def _shuffle(self): 101 | # code from https://github.com/fastai/fastai/blob/3f2079f7bc07ef84a750f6417f68b7b9fdc9525a/fastai/text.py#L125 102 | idxs = np.random.permutation(self.data_size) 103 | sz = self.batch_size * 50 104 | ck_idx = [idxs[i:i+sz] for i in range(0, len(idxs), sz)] 105 | sort_idx = np.concatenate([sorted(s, key=lambda x:self._raw_data[x].length, reverse=True) for s in ck_idx]) 106 | sz = self.batch_size 107 | ck_idx = [sort_idx[i:i+sz] for i in range(0, len(sort_idx), sz)] 108 | max_ck = np.argmax([self._raw_data[ck[0]].length for ck in ck_idx]) # find the chunk with the largest key, 109 | ck_idx[0],ck_idx[max_ck] = ck_idx[max_ck],ck_idx[0] # then make sure it goes first. 110 | sort_idx = np.concatenate(np.random.permutation(ck_idx[1:])) 111 | sort_idx = np.concatenate((ck_idx[0], sort_idx)) 112 | return iter(sort_idx) 113 | 114 | def process_batch(self, batch): 115 | contents = [item.content for item in batch] 116 | lengths = [item.length for item in batch] 117 | contents = _padding(contents,max(lengths)) 118 | lengths = np.asarray(lengths) 119 | targets = np.asarray([item.labels for item in batch]) 120 | ids = [item.id for item in batch] 121 | return contents, lengths, targets, ids 122 | 123 | def get_next(self, shuffle=True): 124 | if shuffle: 125 | idxs = self._shuffle() 126 | else: 127 | idxs = range(self.data_size) 128 | 129 | batch = [] 130 | for i in idxs: 131 | item = self._raw_data[i] 132 | if len(batch) >= self.batch_size: 133 | yield self.process_batch(batch) 134 | batch = [item] 135 | else: 136 | batch.append(item) 137 | if len(batch) > 0: 138 | yield self.process_batch(batch) 139 | -------------------------------------------------------------------------------- /img/model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/xueyouluo/fsauor2018/03a624517c31387d2b6029b0ff952d7fe01f8c1d/img/model.png -------------------------------------------------------------------------------- /labels.txt: -------------------------------------------------------------------------------- 1 | location_traffic_convenience 2 | location_distance_from_business_district 3 | location_easy_to_find 4 | service_wait_time 5 | service_waiters_attitude 6 | service_parking_convenience 7 | service_serving_speed 8 | price_level 9 | price_cost_effective 10 | price_discount 11 | environment_decoration 12 | environment_noise 13 | environment_space 14 | environment_cleaness 15 | dish_portion 16 | dish_taste 17 | dish_look 18 | dish_recommendation 19 | others_overall_experience 20 | others_willing_to_consume_again 21 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | 7 | import argparse 8 | import json 9 | import time 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | from dataset import DataSet 15 | from model import Model 16 | from utils import * 17 | 18 | def add_arguments(parser): 19 | """Build ArgumentParser.""" 20 | parser.register("type", "bool", lambda v: v.lower() == "true") 21 | 22 | # mode 23 | parser.add_argument("--mode", type=str, default='train', help="running mode: train | eval | inference") 24 | 25 | # data 26 | parser.add_argument("--data_files", type=str, nargs='+', default=None, help="data file for train or inference") 27 | parser.add_argument("--eval_files", type=str, nargs='+', default=None, help="eval data file for evaluation") 28 | parser.add_argument("--label_file", type=str, default=None, help="label file") 29 | parser.add_argument("--vocab_file", type=str, default=None, help="vocab file") 30 | parser.add_argument("--embed_file", type=str, default=None, help="embedding file to restore") 31 | parser.add_argument("--out_file", type=str, default=None, help="output file for inference") 32 | parser.add_argument("--split_word", type='bool', nargs="?", const=True, default=True, help="Whether to split word when oov") 33 | parser.add_argument("--max_len", type=int, default=1200, help='max length for doc') 34 | parser.add_argument("--batch_size", type=int, default=32, help="batch size") 35 | parser.add_argument("--reverse", type='bool', nargs="?", const=True, default=False, help="Whether to reverse data") 36 | parser.add_argument("--prob", type='bool', nargs="?", const=True, default=False, help="Whether to export prob") 37 | 38 | # model 39 | parser.add_argument("--num_layers", type=int, default=2, help="number of layers") 40 | parser.add_argument("--decay_schema", type=str, default='hand', help = 'learning rate decay: exp | hand') 41 | parser.add_argument("--encoder", type=str, default='gnmt', help="gnmt | elmo") 42 | parser.add_argument("--decay_steps", type=int, default=10000, help="decay steps") 43 | parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate. RMS: 0.001 | 0.0001") 44 | parser.add_argument("--focal_loss", type=float, default=2., help="gamma of focal loss") 45 | parser.add_argument("--embedding_dropout", type=float, default=0.1, help="embedding_dropout") 46 | parser.add_argument("--max_gradient_norm", type=float, default=5.0, help="Clip gradients to this norm.") 47 | parser.add_argument("--dropout_keep_prob", type=float, default=0.8, help="drop out keep ratio for training") 48 | parser.add_argument("--weight_keep_drop", type=float, default=0.8, help="weight keep drop") 49 | parser.add_argument("--l2_loss_ratio", type=float, default=0.0, help="l2 loss ratio") 50 | parser.add_argument("--rnn_cell_name", type=str, default='lstm', help = 'rnn cell name') 51 | parser.add_argument("--embedding_size", type=int, default=300, help="embedding_size") 52 | parser.add_argument("--num_units", type=int, default=300, help="num_units") 53 | parser.add_argument("--double_decoder", type='bool', nargs="?", const=True, default=False, help="Whether to double decoder size") 54 | parser.add_argument("--variational_dropout", type='bool', nargs="?", const=True, default=True, help="Whether to use variational_dropout") 55 | 56 | # clf 57 | parser.add_argument("--target_label_num", type=int, default=4, help="target_label_num") 58 | parser.add_argument("--feature_num", type=int, default=20, help="feature_num") 59 | 60 | # train 61 | parser.add_argument("--need_early_stop", type='bool', nargs="?", const=True, default=True, help="Whether to early stop") 62 | parser.add_argument("--patient", type=int, default=5, help="patient of early stop") 63 | parser.add_argument("--debug", type='bool', nargs="?", const=True, default=False, help="Whether use debug mode") 64 | parser.add_argument("--num_train_epoch", type=int, default=50, help="training epoches") 65 | parser.add_argument("--steps_per_stats", type=int, default=20, help="steps to print stats") 66 | parser.add_argument("--steps_per_summary", type=int, default=50, help="steps to save summary") 67 | parser.add_argument("--steps_per_eval", type=int, default=2000, help="steps to save model") 68 | 69 | parser.add_argument("--checkpoint_dir", type=str, default='/tmp/visual-semantic', help="checkpoint dir to save model") 70 | 71 | 72 | def convert_to_hparams(params): 73 | hparams = tf.contrib.training.HParams() 74 | for k,v in params.items(): 75 | hparams.add_hparam(k,v) 76 | return hparams 77 | 78 | def inference(flags): 79 | print_out("inference data file {0}".format(flags.data_files)) 80 | dataset = DataSet(flags.data_files, flags.vocab_file, flags.label_file, flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len) 81 | hparams = load_hparams(flags.checkpoint_dir,{"mode":'inference','checkpoint_dir':flags.checkpoint_dir+"/best_eval",'embed_file':None}) 82 | with tf.Session(config = get_config_proto(log_device_placement=False)) as sess: 83 | model = Model(hparams) 84 | model.build() 85 | 86 | try: 87 | model.restore_model(sess) #restore best solution 88 | except Exception as e: 89 | print("unable to restore model with exception",e) 90 | exit(1) 91 | 92 | scalars = model.scalars.eval(session=sess) 93 | print("Scalars:", scalars) 94 | weight = model.weight.eval(session=sess) 95 | print("Weight:",weight) 96 | cnt = 0 97 | for (source, lengths, _, ids) in dataset.get_next(shuffle=False): 98 | predict,logits = model.inference_clf_one_batch(sess, source, lengths) 99 | for i,(p,l) in enumerate(zip(predict,logits)): 100 | for j in range(flags.feature_num): 101 | label_name = dataset.i2l[j] 102 | if flags.prob: 103 | tag = [float(v) for v in l[j]] 104 | else: 105 | tag = dataset.tag_i2l[np.argmax(p[j])] 106 | dataset.items[cnt + i][label_name] = tag 107 | cnt += len(lengths) 108 | print_out("\r# process {0:.2%}".format(cnt/dataset.data_size),new_line=False) 109 | 110 | print_out("# Write result to file ...") 111 | with open(flags.out_file,'w') as f: 112 | for item in dataset.items: 113 | f.write(json.dumps(item,ensure_ascii=False) + '\n') 114 | print_out("# Done") 115 | 116 | def train_eval_clf(model, sess, dataset): 117 | from collections import defaultdict 118 | checkpoint_loss, acc = 0.0, 0.0 119 | 120 | predicts, truths = defaultdict(list), defaultdict(list) 121 | for i,(source, lengths, targets, _) in enumerate(dataset.get_next(shuffle=False)): 122 | batch_loss, accuracy, batch_size, predict = model.eval_clf_one_step(sess, source, lengths, targets) 123 | # batch * 20 * 4 124 | for i,p in enumerate(predict): 125 | for j in range(model.hparams.feature_num): 126 | label_name = dataset.i2l[j] 127 | truths[label_name].append(targets[i][j]) 128 | predicts[label_name].append(p[j]) 129 | checkpoint_loss += batch_loss 130 | acc += accuracy 131 | if (i+1) % 100 == 0: 132 | print_out("# batch %d/%d" %(i+1,dataset.num_batches)) 133 | 134 | results = {} 135 | total_f1 = 0.0 136 | for label_name in dataset.label_names: 137 | # print("# Get f1 score for",label_name) 138 | f1,precision,recall = cal_f1(model.hparams.target_label_num,np.asarray(predicts[label_name]),np.asarray(truths[label_name])) 139 | results[label_name] = f1 140 | total_f1 += f1 141 | print("# {0} - {1}".format(label_name,f1)) 142 | 143 | final_f1 = total_f1 / len(results) 144 | 145 | print_out( "# Eval loss %.5f, f1 %.5f" % (checkpoint_loss/i, final_f1)) 146 | return -1 * final_f1, checkpoint_loss/i 147 | 148 | def train_clf(flags): 149 | dataset = DataSet(flags.data_files, flags.vocab_file, flags.label_file, flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len) 150 | eval_dataset = DataSet(flags.eval_files, flags.vocab_file, flags.label_file, 5 * flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len) 151 | 152 | params = vars(flags) 153 | params['vocab_size'] = len(dataset.w2i) 154 | hparams = convert_to_hparams(params) 155 | 156 | save_hparams(flags.checkpoint_dir, hparams) 157 | print(hparams) 158 | 159 | train_graph = tf.Graph() 160 | eval_graph = tf.Graph() 161 | 162 | with train_graph.as_default(): 163 | train_model = Model(hparams) 164 | train_model.build() 165 | initializer = tf.global_variables_initializer() 166 | 167 | with eval_graph.as_default(): 168 | eval_hparams = load_hparams(flags.checkpoint_dir,{"mode":'eval','checkpoint_dir':flags.checkpoint_dir+"/best_eval"}) 169 | eval_model = Model(eval_hparams) 170 | eval_model.build() 171 | 172 | train_sess = tf.Session(graph=train_graph, config=get_config_proto(log_device_placement=False )) 173 | train_model.init_model(train_sess, initializer=initializer) 174 | try: 175 | train_model.restore_model(train_sess) 176 | except: 177 | print_out("unable to restore model, train from scratch") 178 | 179 | print_out("# Start to train with learning rate {0}, {1}".format(flags.learning_rate,time.ctime())) 180 | 181 | global_step = train_sess.run(train_model.global_step) 182 | print("# Global step", global_step) 183 | 184 | eval_ppls = [] 185 | best_eval = 1000000000 186 | pre_best_checkpoint = None 187 | final_learn = 2 188 | for epoch in range(flags.num_train_epoch): 189 | step_time, checkpoint_loss, acc, iters = 0.0, 0.0, 0.0, 0 190 | for i,(source, lengths, targets, _) in enumerate(dataset.get_next()): 191 | start_time = time.time() 192 | add_summary = (global_step % flags.steps_per_summary == 0) 193 | batch_loss, global_step, accuracy, token_num,batch_size = train_model.train_clf_one_step(train_sess,source, lengths, targets, add_summary = add_summary, run_info= add_summary and flags.debug) 194 | step_time += (time.time() - start_time) 195 | checkpoint_loss += batch_loss 196 | acc += accuracy 197 | iters += token_num 198 | 199 | if global_step == 0: 200 | continue 201 | 202 | if global_step % flags.steps_per_stats == 0: 203 | train_acc = (acc / flags.steps_per_stats) * 100 204 | acc_summary = tf.Summary() 205 | acc_summary.value.add(tag='accuracy', simple_value = train_acc) 206 | train_model.summary_writer.add_summary(acc_summary, global_step=global_step) 207 | 208 | print_out( 209 | "# Epoch %d global step %d loss %.5f batch %d/%d lr %g " 210 | "accuracy %.5f wps %.2f step time %.2fs" % 211 | (epoch+1, global_step, checkpoint_loss/flags.steps_per_stats, i+1,dataset.num_batches, train_model.learning_rate.eval(session=train_sess), 212 | train_acc, (iters)/step_time, step_time/(flags.steps_per_stats))) 213 | step_time, checkpoint_loss, iters, acc = 0.0, 0.0, 0, 0.0 214 | 215 | if global_step % flags.steps_per_eval == 0: 216 | print_out("# global step {0}, eval model at {1}".format(global_step, time.ctime())) 217 | checkpoint_path = train_model.save_model(train_sess) 218 | with tf.Session(graph=eval_graph, config=get_config_proto(log_device_placement=False)) as eval_sess: 219 | eval_model.init_model(eval_sess) 220 | eval_model.restore_ema_model(eval_sess, checkpoint_path) 221 | eval_ppl, eval_loss = train_eval_clf(eval_model, eval_sess, eval_dataset) 222 | print_out("# current result {0}, previous best result {1}".format(eval_ppl,best_eval)) 223 | loss_summary = tf.Summary() 224 | loss_summary.value.add(tag='eval_loss', simple_value = eval_loss) 225 | train_model.summary_writer.add_summary(loss_summary, global_step=global_step) 226 | if eval_ppl < best_eval: 227 | pre_best_checkpoint = checkpoint_path 228 | eval_model.save_model(eval_sess,global_step) 229 | best_eval = eval_ppl 230 | eval_ppls.append(eval_ppl) 231 | if flags.need_early_stop: 232 | if early_stop(eval_ppls, flags.patient): 233 | print_out("# No loss decrease, restore previous best model and set learning rate to half of previous one") 234 | current_lr = train_model.learning_rate.eval(session=train_sess) 235 | if final_learn > 0: 236 | final_learn -= 1 237 | else: 238 | print_out("# Early stop, exit") 239 | exit(0) 240 | train_model.saver.restore(train_sess, pre_best_checkpoint) 241 | lr = tf.assign(train_model.learning_rate, current_lr/10) 242 | if final_learn==0: 243 | dropout = tf.assign(train_model.dropout_keep_prob, 1.0) 244 | emd_drop = tf.assign(train_model.embedding_dropout, 0.0) 245 | train_sess.run([dropout,emd_drop]) 246 | train_sess.run(lr) 247 | eval_ppls = [best_eval] 248 | continue 249 | 250 | print_out("# Finsh epoch {1}, global step {0}".format(global_step, epoch+1)) 251 | print_out("# Best accuracy {0}".format(best_eval)) 252 | 253 | if __name__ == "__main__": 254 | parser = argparse.ArgumentParser() 255 | add_arguments(parser) 256 | flags, unparsed = parser.parse_known_args() 257 | if flags.mode == 'train': 258 | train_clf(flags) 259 | elif flags.mode == 'inference': 260 | inference(flags) 261 | -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | import os 7 | 8 | import numpy as np 9 | import tensorflow as tf 10 | 11 | from utils import (_reverse, focal_loss, gelu, get_total_param_num, print_out, 12 | single_rnn_cell) 13 | from thrid_utils import create_embedding 14 | 15 | class Model(object): 16 | def __init__(self, hparams): 17 | self.hparams = hparams 18 | 19 | def is_training(self): 20 | return self.hparams.mode == 'train' 21 | 22 | def build(self): 23 | self.setup_input_placeholders() 24 | self.setup_embedding() 25 | if self.hparams.encoder == 'gnmt': 26 | self.gnmt_encoder() 27 | elif self.hparams.encoder == 'elmo': 28 | self.elmo_encoder() 29 | else: 30 | raise ValueError("Un-supported encoder %s" % self.hparams.encoder) 31 | self.setup_clf() 32 | 33 | self.params = tf.trainable_variables() 34 | self.ema = tf.train.ExponentialMovingAverage(decay=0.9999) 35 | 36 | if self.hparams.mode in ['train', 'eval']: 37 | self.setup_loss() 38 | if self.hparams.mode == 'train': 39 | self.setup_training() 40 | self.setup_summary() 41 | self.saver = tf.train.Saver(tf.global_variables(),max_to_keep=5) 42 | 43 | def init_model(self, sess, initializer=None): 44 | if initializer: 45 | sess.run(initializer) 46 | else: 47 | sess.run(tf.global_variables_initializer()) 48 | 49 | def save_model(self, sess, global_step=None): 50 | return self.saver.save(sess, os.path.join(self.hparams.checkpoint_dir, 51 | "model.ckpt"), global_step=global_step if global_step else self.global_step) 52 | 53 | def restore_best_model(self, sess): 54 | self.saver.restore(sess, tf.train.latest_checkpoint( 55 | self.hparams.checkpoint_dir + '/best_dev')) 56 | 57 | def restore_ema_model(self, sess, path): 58 | shadow_vars = {self.ema.average_name(v):v for v in self.params} 59 | saver = tf.train.Saver(shadow_vars) 60 | saver.restore(sess, path) 61 | 62 | def restore_model(self, sess, epoch=None): 63 | if epoch is None: 64 | self.saver.restore(sess, tf.train.latest_checkpoint( 65 | self.hparams.checkpoint_dir)) 66 | else: 67 | self.saver.restore( 68 | sess, os.path.join(self.hparams.checkpoint_dir, "model.ckpt" + ("-%d" % epoch))) 69 | print("restored model") 70 | 71 | def setup_input_placeholders(self): 72 | self.source_tokens = tf.placeholder( 73 | tf.int32, shape=[None, None], name='source_tokens') 74 | 75 | # for training and evaluation 76 | if self.hparams.mode in ['train', 'eval']: 77 | self.target_labels = tf.placeholder( 78 | tf.float32, shape=[None, self.hparams.feature_num, self.hparams.target_label_num], name='target_labels') 79 | 80 | self.batch_size = tf.shape(self.source_tokens,out_type=tf.int32)[0] 81 | 82 | self.sequence_length = tf.placeholder( 83 | tf.int32, shape=[None], name='sequence_length') 84 | 85 | self.global_step = tf.Variable( 86 | initial_value=0, 87 | name="global_step", 88 | trainable=False, 89 | collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES]) 90 | 91 | self.predict_token_num = tf.reduce_sum(self.sequence_length) 92 | self.embedding_dropout = tf.Variable(self.hparams.embedding_dropout, trainable=False) 93 | self.dropout_keep_prob = tf.Variable(self.hparams.dropout_keep_prob, trainable=False) 94 | 95 | def setup_embedding(self): 96 | # load pretrained embedding 97 | self.embedding = create_embedding( 98 | "embedding", 99 | self.hparams.vocab_size, 100 | self.hparams.embedding_size, 101 | vocab_file=self.hparams.vocab_file, 102 | embed_file=self.hparams.embed_file) 103 | 104 | if self.hparams.embedding_dropout > 0 and self.is_training(): 105 | vocab_size = tf.shape(self.embedding)[0] 106 | mask = tf.nn.dropout(tf.ones([vocab_size]),keep_prob=1-self.embedding_dropout) * (1-self.embedding_dropout) 107 | mask = tf.expand_dims(mask,1) 108 | self.embedding = mask * self.embedding 109 | 110 | self.source_embedding = tf.nn.embedding_lookup( 111 | self.embedding, self.source_tokens) 112 | # [20] 113 | features = tf.range(self.hparams.feature_num,dtype=tf.int32) 114 | feature_embedding_var = create_embedding("feature_embedding", self.hparams.feature_num, self.hparams.embedding_size) 115 | # [20 * embedding_size] 116 | feature_embedding = tf.nn.embedding_lookup(feature_embedding_var, features) 117 | # [batch * 20 * embedding_size] 118 | self.feature_embedding = tf.tile(tf.expand_dims(feature_embedding,axis=0),[self.batch_size,1,1]) 119 | 120 | if self.is_training(): 121 | self.source_embedding = tf.nn.dropout( 122 | self.source_embedding, keep_prob=self.dropout_keep_prob) 123 | self.feature_embedding = tf.nn.dropout( 124 | self.feature_embedding, keep_prob=self.dropout_keep_prob) 125 | 126 | def elmo_encoder(self): 127 | print_out("build elmo encoder") 128 | with tf.variable_scope("elmo_encoder") as scope: 129 | inputs = tf.transpose(self.source_embedding,[1,0,2]) 130 | inputs_reverse = _reverse( 131 | inputs, seq_lengths=self.sequence_length, 132 | seq_dim=0, batch_dim=1) 133 | encoder_states = [] 134 | outputs = [tf.concat([inputs,inputs],axis=-1)] 135 | fw_cell_inputs = inputs 136 | bw_cell_inputs = inputs_reverse 137 | for i in range(self.hparams.num_layers): 138 | with tf.variable_scope("fw_%d" % i) as s: 139 | cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False) 140 | fused_outputs_op, fused_state_op = cell(fw_cell_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype) 141 | encoder_states.append(fused_state_op) 142 | with tf.variable_scope("bw_%d" % i) as s: 143 | bw_cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False) 144 | bw_fused_outputs_op_reverse, bw_fused_state_op = bw_cell(bw_cell_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype) 145 | bw_fused_outputs_op = _reverse( 146 | bw_fused_outputs_op_reverse, seq_lengths=self.sequence_length, 147 | seq_dim=0, batch_dim=1) 148 | encoder_states.append(bw_fused_state_op) 149 | output = tf.concat([fused_outputs_op,bw_fused_outputs_op],axis=-1) 150 | if i > 0: 151 | fw_cell_inputs = output + fw_cell_inputs 152 | bw_cell_inputs = _reverse( 153 | output, seq_lengths=self.sequence_length, 154 | seq_dim=0, batch_dim=1) + bw_cell_inputs 155 | else: 156 | fw_cell_inputs = output 157 | bw_cell_inputs = _reverse( 158 | output, seq_lengths=self.sequence_length, 159 | seq_dim=0, batch_dim=1) 160 | outputs.append(output) 161 | 162 | final_output = None 163 | # embedding + num_layers 164 | n = 1 + self.hparams.num_layers 165 | scalars = tf.get_variable('scalar',initializer=tf.constant([1/(n)]*n)) 166 | self.scalars = scalars 167 | weight = tf.get_variable('weight',initializer=tf.constant(0.001)) 168 | self.weight = weight 169 | 170 | soft_scalars = tf.nn.softmax(scalars) 171 | for i, output in enumerate(outputs): 172 | if final_output is None: 173 | final_output = soft_scalars[i] * tf.transpose(output,[1,0,2]) 174 | else: 175 | final_output = final_output + soft_scalars[i] * tf.transpose(output,[1,0,2]) 176 | 177 | self.final_outputs = weight * final_output 178 | self.final_state = tuple(encoder_states) 179 | 180 | def gnmt_encoder(self): 181 | print_out("build gnmt encoder") 182 | with tf.variable_scope("gnmt_encoder") as scope: 183 | inputs = tf.transpose(self.source_embedding,[1,0,2]) 184 | inputs_reverse = _reverse( 185 | inputs, seq_lengths=self.sequence_length, 186 | seq_dim=0, batch_dim=1) 187 | encoder_states = [] 188 | outputs = [inputs] 189 | 190 | with tf.variable_scope("fw") as s: 191 | cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False) 192 | fused_outputs_op, fused_state_op = cell(inputs,sequence_length=self.sequence_length,dtype=inputs.dtype) 193 | encoder_states.append(fused_state_op) 194 | outputs.append(fused_outputs_op) 195 | 196 | with tf.variable_scope('bw') as s: 197 | bw_cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False) 198 | bw_fused_outputs_op, bw_fused_state_op = bw_cell(inputs_reverse,sequence_length=self.sequence_length,dtype=inputs.dtype) 199 | bw_fused_outputs_op = _reverse( 200 | bw_fused_outputs_op, seq_lengths=self.sequence_length, 201 | seq_dim=0, batch_dim=1) 202 | encoder_states.append(bw_fused_state_op) 203 | outputs.append(bw_fused_outputs_op) 204 | 205 | with tf.variable_scope("uni") as s: 206 | uni_inputs = tf.concat([fused_outputs_op,bw_fused_outputs_op],axis=-1) 207 | for i in range(self.hparams.num_layers-1): 208 | with tf.variable_scope("layer_%d" % i) as scope: 209 | uni_cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False) 210 | uni_fused_outputs_op, uni_fused_state_op = uni_cell(uni_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype) 211 | encoder_states.append(uni_fused_state_op) 212 | outputs.append(uni_fused_outputs_op) 213 | if i > 0: 214 | uni_fused_outputs_op = uni_fused_outputs_op + uni_inputs 215 | uni_inputs = uni_fused_outputs_op 216 | 217 | final_output = None 218 | # embedding + fw + bw + uni 219 | n = 3 + self.hparams.num_layers - 1 220 | scalars = tf.get_variable('scalar',initializer=tf.constant([1/(n)]*n)) 221 | self.scalars = scalars 222 | weight = tf.get_variable('weight',initializer=tf.constant(0.001)) 223 | self.weight = weight 224 | 225 | soft_scalars = tf.nn.softmax(scalars) 226 | for i, output in enumerate(outputs): 227 | if final_output is None: 228 | final_output = soft_scalars[i] * tf.transpose(output,[1,0,2]) 229 | else: 230 | final_output = final_output + soft_scalars[i] * tf.transpose(output,[1,0,2]) 231 | 232 | self.final_outputs = weight * final_output 233 | self.final_state = tuple(encoder_states) 234 | 235 | def setup_attention_semantic(self): 236 | num_units = self.hparams.num_units * 2 if self.hparams.double_decoder else self.hparams.num_units 237 | with tf.variable_scope("attention_semantic") as scope: 238 | cell = single_rnn_cell(self.hparams.rnn_cell_name, num_units, self.is_training(), self.dropout_keep_prob, self.hparams.weight_keep_drop, self.hparams.variational_dropout) 239 | attention = tf.contrib.seq2seq.LuongAttention(num_units, self.final_outputs, self.sequence_length,scale=True) 240 | attn_cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, output_attention=True) 241 | if 'lstm' in self.hparams.rnn_cell_name.lower(): 242 | h = tf.layers.dense(tf.concat([state.h for state in self.final_state],axis=-1),num_units, use_bias=True) 243 | c = tf.layers.dense(tf.concat([state.c for state in self.final_state],axis=-1),num_units, use_bias=True) 244 | initial_state = attn_cell.zero_state(self.batch_size,dtype=tf.float32).clone(cell_state=tf.contrib.rnn.LSTMStateTuple(c=c,h=h)) 245 | else: 246 | h = tf.layers.dense(tf.concat([state for state in self.final_state],axis=-1),num_units, use_bias=True) 247 | 248 | initial_state = attn_cell.zero_state(self.batch_size,dtype=tf.float32).clone(cell_state=h) 249 | outputs = [] 250 | state = initial_state 251 | for i in range(self.hparams.feature_num): 252 | if i > 0: tf.get_variable_scope().reuse_variables() 253 | inputs = self.feature_embedding[:,i,:] 254 | cell_output, state = attn_cell(inputs, state) 255 | if 'lstm' in self.hparams.rnn_cell_name.lower(): 256 | out_state = tf.concat([state.cell_state.h,cell_output],axis=-1) 257 | else: 258 | out_state = tf.concat([state.cell_state,cell_output],axis=-1) 259 | outputs.append(out_state) 260 | return outputs 261 | 262 | def setup_clf(self): 263 | num_units = self.hparams.num_units * 2 if self.hparams.double_decoder else self.hparams.num_units 264 | with tf.variable_scope("classification",reuse=tf.AUTO_REUSE) as scope: 265 | states = self.setup_attention_semantic() 266 | final_logits = [] 267 | final_predicts = [] 268 | with tf.variable_scope("predict_clf"): 269 | hidden_layer = tf.layers.Dense(num_units, use_bias=True, activation=tf.nn.relu) 270 | output_layer = tf.layers.Dense(self.hparams.target_label_num) 271 | 272 | for i,state in enumerate(states): 273 | semantic = hidden_layer(state) 274 | logits = output_layer(semantic) 275 | 276 | final_logits.append(logits) 277 | predict = tf.argmax(logits,axis=-1) 278 | predict = tf.one_hot(predict,self.hparams.target_label_num) 279 | final_predicts.append(predict) 280 | 281 | self.final_logits = tf.concat([tf.expand_dims(l,1) for l in final_logits],axis=1) 282 | self.final_predict = tf.concat([tf.expand_dims(p,1) for p in final_predicts],axis=1) 283 | if self.hparams.mode in ['train','eval']: 284 | self.accurary = tf.contrib.metrics.accuracy(tf.to_int32(self.final_predict),tf.to_int32(self.target_labels)) 285 | 286 | def setup_loss(self): 287 | if self.hparams.focal_loss > 0: 288 | self.gamma = tf.Variable(self.hparams.focal_loss,dtype=tf.float32, trainable=False) 289 | label_losses = focal_loss(self.target_labels, self.final_logits, self.gamma) 290 | else: 291 | label_losses = tf.losses.softmax_cross_entropy(onehot_labels=self.target_labels, logits=self.final_logits, reduction=tf.losses.Reduction.MEAN) 292 | self.losses = label_losses 293 | 294 | def setup_summary(self): 295 | self.summary_writer = tf.summary.FileWriter( 296 | self.hparams.checkpoint_dir, tf.get_default_graph()) 297 | tf.summary.scalar("train_loss", self.losses) 298 | tf.summary.scalar("learning_rate", self.learning_rate) 299 | tf.summary.scalar("accuracy", self.accurary) 300 | tf.summary.scalar('gN', self.gradient_norm) 301 | tf.summary.scalar('pN', self.param_norm) 302 | self.summary_op = tf.summary.merge_all() 303 | 304 | def setup_training(self): 305 | # learning rate decay 306 | if self.hparams.decay_schema == 'exp': 307 | self.learning_rate = tf.train.exponential_decay(self.hparams.learning_rate, self.global_step, 308 | self.hparams.decay_steps, 0.96, staircase=True) 309 | else: 310 | self.learning_rate = tf.Variable( 311 | self.hparams.learning_rate, dtype=tf.float32, trainable=False) 312 | 313 | params = self.params 314 | if self.hparams.l2_loss_ratio > 0: 315 | l2_loss = self.hparams.l2_loss_ratio * tf.add_n([tf.nn.l2_loss(p) for p in params if ('predict_clf' in p.name and 'bias' not in p.name)]) 316 | self.losses += l2_loss 317 | 318 | get_total_param_num(params) 319 | 320 | self.param_norm = tf.global_norm(params) 321 | 322 | gradients = tf.gradients(self.losses, params, colocate_gradients_with_ops=True) 323 | clipped_gradients, _ = tf.clip_by_global_norm( 324 | gradients, self.hparams.max_gradient_norm) 325 | self.gradient_norm = tf.global_norm(gradients) 326 | opt = tf.train.RMSPropOptimizer(self.learning_rate) 327 | train_op = opt.apply_gradients( 328 | zip(clipped_gradients, params), global_step=self.global_step) 329 | with tf.control_dependencies([train_op]): 330 | train_op = self.ema.apply(params) 331 | self.train_op = train_op 332 | 333 | def train_clf_one_step(self, sess, source, lengths, targets, add_summary=False, run_info=False): 334 | feed_dict = {} 335 | feed_dict[self.source_tokens] = source 336 | feed_dict[self.sequence_length] = lengths 337 | feed_dict[self.target_labels] = targets 338 | if run_info: 339 | run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE) 340 | run_metadata = tf.RunMetadata() 341 | 342 | _, batch_loss, summary, global_step, accuracy, token_num, batch_size = sess.run( 343 | [self.train_op, self.losses, self.summary_op, self.global_step, self.accurary, self.predict_token_num, self.batch_size], 344 | feed_dict=feed_dict, 345 | options=run_options, 346 | run_metadata=run_metadata) 347 | 348 | else: 349 | _, batch_loss, summary, global_step, accuracy, token_num, batch_size = sess.run( 350 | [self.train_op, self.losses, self.summary_op, self.global_step, self.accurary, self.predict_token_num, self.batch_size], 351 | feed_dict = feed_dict 352 | ) 353 | if run_info: 354 | self.summary_writer.add_run_metadata( 355 | run_metadata, 'step%03d' % global_step) 356 | print("adding run meta for", global_step) 357 | 358 | 359 | if add_summary: 360 | self.summary_writer.add_summary(summary, global_step=global_step) 361 | return batch_loss, global_step, accuracy, token_num, batch_size 362 | 363 | def eval_clf_one_step(self, sess, source, lengths, targets): 364 | feed_dict = {} 365 | feed_dict[self.source_tokens] = source 366 | feed_dict[self.sequence_length] = lengths 367 | feed_dict[self.target_labels] = targets 368 | 369 | batch_loss, accuracy,batch_size, predict = sess.run( 370 | [self.losses, self.accurary,self.batch_size, self.final_predict], 371 | feed_dict = feed_dict 372 | ) 373 | return batch_loss, accuracy,batch_size,predict 374 | 375 | def inference_clf_one_batch(self, sess, source, lengths): 376 | feed_dict = {} 377 | feed_dict[self.source_tokens] = source 378 | feed_dict[self.sequence_length] = lengths 379 | 380 | predict,logits = sess.run([self.final_predict, tf.nn.softmax(self.final_logits)], feed_dict=feed_dict) 381 | return predict, logits 382 | -------------------------------------------------------------------------------- /scripts/data_preprocess.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | 7 | import argparse 8 | import csv 9 | import json 10 | import re 11 | from collections import Counter 12 | 13 | import jieba 14 | 15 | def add_arguments(parser): 16 | """Build ArgumentParser.""" 17 | parser.register("type", "bool", lambda v: v.lower() == "true") 18 | 19 | parser.add_argument("--data_file", type=str, default=None, required=True, help="data file to process") 20 | parser.add_argument("--output_file", type=str, default=None, required=True, help="data file to process") 21 | parser.add_argument("--vocab_file", type=str, default=None, help="vocab file, needed when data file is training file") 22 | parser.add_argument("--vocab_size", type=int, default=50000, help='vocab size') 23 | parser.add_argument("--embedding", type='bool', nargs="?", const=True, default=False, help='whether process embedding file') 24 | 25 | def replace_dish(content): 26 | return re.sub("【.{5,20}】","",content) 27 | 28 | def normalize_num(words): 29 | '''Normalize numbers 30 | for example: 123 -> 100, 3934 -> 3000 31 | ''' 32 | tokens = [] 33 | for w in words: 34 | try: 35 | ww = w 36 | num = int(float(ww)) 37 | if len(ww) < 2: 38 | tokens.append(ww) 39 | else: 40 | num = int(ww[0]) * (10**(len(str(num))-1)) 41 | tokens.append(str(num)) 42 | except: 43 | tokens.append(w) 44 | return tokens 45 | 46 | def tokenize(content): 47 | content = content.replace("\u0006",'').replace("\u0005",'').replace("\u0007",'') 48 | tokens = [] 49 | content = content.lower() 50 | # 去除重复字符 51 | content = re.sub('~+','~',content) 52 | content = re.sub('~+','~',content) 53 | content = re.sub('(\n)+','\n',content) 54 | for para in content.split('\n'): 55 | para_tokens = [] 56 | words = list(jieba.cut(para)) 57 | words = normalize_num(words) 58 | para_tokens.extend(words) 59 | para_tokens.append('') 60 | tokens.append(' '.join(para_tokens)) 61 | content = " ".join(tokens) 62 | content = re.sub('\s+',' ',content) 63 | content = re.sub('( )+',' ',content) 64 | content = re.sub('(- )+','- ',content) 65 | content = re.sub('(= )+','= ',content) 66 | content = re.sub('(\. )+','. ',content).strip() 67 | content = replace_dish(content) 68 | if content.endswith(""): 69 | content = content[:-7] 70 | return content 71 | 72 | def create_vocab(data, vocab_file, vocab_size): 73 | print("# Start to create vocab ...") 74 | words = Counter() 75 | for item in data: 76 | words.update(item['content'].split()) 77 | special_tokens = ['','',''] 78 | with open(vocab_file,'w') as f: 79 | for w in special_tokens: 80 | f.write(w + '\n') 81 | for w,_ in words.most_common(vocab_size-len(special_tokens)): 82 | f.write(w + '\n') 83 | print("# Created vocab file {0} with vocab size {1}".format(vocab_file,vocab_size)) 84 | 85 | def process_data(output_file, data_file): 86 | data = [] 87 | with open(output_file,'w') as f: 88 | with open(data_file,encoding='utf-8-sig') as csvfile: 89 | reader = csv.DictReader(csvfile) 90 | for i,item in enumerate(reader): 91 | content = tokenize(item['content'].strip()[1:-1]) 92 | item['content'] = content 93 | f.write(json.dumps(item,ensure_ascii=False)+'\n') 94 | data.append(item) 95 | if (i+1) % 10000 == 0: 96 | print("# processed -- %d --"%(i+1)) 97 | return data 98 | 99 | def process_embedding(embedding_file, vocab_file, out_embedding_file): 100 | words = set([line.strip() for line in open(vocab_file)]) 101 | with open(out_embedding_file,'w') as f: 102 | for line in open(embedding_file): 103 | tokens = line.split() 104 | # skip the first line 105 | if len(tokens) == 2: 106 | continue 107 | word = tokens[0].lower() 108 | if word in words: 109 | f.write(word + ' ' + ' '.join(tokens[1:]) + '\n') 110 | 111 | if __name__ == "__main__": 112 | parser = argparse.ArgumentParser() 113 | add_arguments(parser) 114 | flags, unparsed = parser.parse_known_args() 115 | if flags.embedding: 116 | process_embedding(flags.data_file, flags.vocab_file, flags.output_file) 117 | else: 118 | if 'train' in flags.data_file: 119 | if flags.vocab_file is None: 120 | raise ValueError("Must provided a vocab file to save vocab") 121 | data = process_data(flags.output_file, flags.data_file) 122 | create_vocab(data,flags.vocab_file,flags.vocab_size) 123 | else: 124 | process_data(flags.output_file, flags.data_file) 125 | -------------------------------------------------------------------------------- /scripts/preprocess.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Modify the following values depend on your environment 4 | # Path to the csv files 5 | TRAIN_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_trainingset_20180816/sentiment_analysis_trainingset.csv 6 | VALIDATION_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv 7 | TESTA_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv 8 | TESTB_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentimetn_analysis_testb_20180816/sentiment_analysis_testb.csv 9 | 10 | # Path to pretrained embedding file 11 | EMBEDDING_FILE=/data/xueyou/data/embedding/sgns.sogou.word 12 | 13 | VOCAB_SIZE=50000 14 | 15 | # Create a folder to save training files 16 | mkdir -p data 17 | 18 | echo 'Process training file ...' 19 | python data_preprocess.py \ 20 | --data_file=$TRAIN_FILE \ 21 | --output_file=data/train.json \ 22 | --vocab_file=data/vocab.txt \ 23 | --vocab_size=$VOCAB_SIZE 24 | 25 | echo 'Process validation file ...' 26 | python data_preprocess.py \ 27 | --data_file=$VALIDATION_FILE \ 28 | --output_file=data/validation.json 29 | 30 | echo 'Process testa file ...' 31 | python data_preprocess.py \ 32 | --data_file=$TESTA_FILE \ 33 | --output_file=data/testa.json 34 | 35 | # Uncomment following code to get testb file 36 | # echo 'Process testb file ...' 37 | # python data_preprocess.py \ 38 | # --data_file=$TESTB_FILE \ 39 | # --output_file=data/testb.json 40 | 41 | echo 'Get pretrained embedding ...' 42 | python data_preprocess.py \ 43 | --data_file=$EMBEDDING_FILE \ 44 | --output_file=data/embedding.txt \ 45 | --vocab_file=data/vocab.txt \ 46 | --embedding=True 47 | 48 | echo "Get label file ..." 49 | cp ../labels.txt data/labels.txt -------------------------------------------------------------------------------- /scripts/readme.md: -------------------------------------------------------------------------------- 1 | ## Train Model from scratch with GPU 2 | 3 | This is a simple script used to run this code from scratch. Using the default settings, you can get macro f1 score 0.70769. 4 | 5 | > The data preprocess steps are no the same as the one I used during the competition, so the final f1 score may be not the same: 6 | 7 | Some major differences: 8 | - Jieba is used instead of LTP here 9 | - NER is not used here 10 | - No custom dictionary used here 11 | 12 | ### 1. Download raw data 13 | 14 | - Download data from this [data link](https://drive.google.com/file/d/1OInXRx_OmIJgK3ZdoFZnmqUi0rGfOaQo/view?usp=sharing). 15 | - Unzip the files to get the raw csv files. 16 | 17 | ### 2. Download pretrained embedding file 18 | 19 | - Download embedding file from this [embedding link](https://pan.baidu.com/s/1tUghuTno5yOvOx4LXA9-wg). 20 | - Unzip the file to get the embedding file. 21 | 22 | ### 3. Preprocess data 23 | 24 | Modify the file paths in preprocess.sh: 25 | - TRAIN_FILE 26 | - VALIDATION_FILE 27 | - TESTA_FILE 28 | - TESTB_FILE 29 | > Refer to 1 to get the data file path 30 | - EMBEDDING_FILE 31 | > Refer to 2 to get the embedding file path 32 | - VOCAB_SIZE 33 | > You can try different vocab size 34 | 35 | Then run 36 | 37 | ``` 38 | bash preprocess.sh 39 | ``` 40 | 41 | We will create all the files needed under ./data folder. 42 | 43 | ### 4. Run training 44 | 45 | Change your workdir to parent folder, and run the training scripts: 46 | 47 | ``` 48 | bash bash/elmo_train.sh 49 | ``` 50 | 51 | ### 5. Run inference 52 | 53 | After training, we can get the predicted results of test files: 54 | 55 | ``` 56 | bash bash/elmo_inference.sh 57 | ``` 58 | -------------------------------------------------------------------------------- /thrid_utils.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | 7 | '''These codes are copied from eigen-tensorflow''' 8 | import codecs 9 | import csv 10 | import os 11 | 12 | import numpy as np 13 | import tensorflow as tf 14 | 15 | # If a vocab size is greater than this value, put the embedding on cpu instead 16 | VOCAB_SIZE_THRESHOLD_CPU = 30000 17 | 18 | def read_vocab(vocab_file): 19 | """read vocab from file 20 | 21 | Args: 22 | vocab_file ([type]): path to the vocab file, the vocab file should contains a word each line 23 | 24 | Returns: 25 | list of words 26 | """ 27 | 28 | if not os.path.isfile(vocab_file): 29 | raise ValueError("%s is not a vaild file"%vocab_file) 30 | 31 | vocab = [] 32 | word2id = {} 33 | with codecs.getreader("utf-8")(tf.gfile.GFile(vocab_file, "rb")) as f: 34 | for i,line in enumerate(f): 35 | word = line.strip() 36 | if not word: 37 | raise ValueError("Got empty word at line %d"%(i+1)) 38 | vocab.append(word) 39 | word2id[word] = len(word2id) 40 | 41 | print("# vocab size: ",len(vocab)) 42 | return vocab, word2id 43 | 44 | def load_embed_file(embed_file): 45 | """Load embed_file into a python dictionary. 46 | 47 | Note: the embed_file should be a Glove formated txt file. Assuming 48 | embed_size=5, for example: 49 | 50 | the -0.071549 0.093459 0.023738 -0.090339 0.056123 51 | to 0.57346 0.5417 -0.23477 -0.3624 0.4037 52 | and 0.20327 0.47348 0.050877 0.002103 0.060547 53 | 54 | Args: 55 | embed_file: file path to the embedding file. 56 | Returns: 57 | a dictionary that maps word to vector, and the size of embedding dimensions. 58 | """ 59 | emb_dict = dict() 60 | emb_size = None 61 | with codecs.getreader("utf-8")(tf.gfile.GFile(embed_file, 'rb')) as f: 62 | for i,line in enumerate(f): 63 | tokens = line.strip().split(" ") 64 | word = tokens[0] 65 | vec = list(map(float, tokens[1:])) 66 | emb_dict[word] = vec 67 | if emb_size: 68 | assert emb_size == len( 69 | vec), "All embedding size should be same, but got {0} at line {1}".format(len(vec),i+1) 70 | else: 71 | emb_size = len(vec) 72 | return emb_dict, emb_size 73 | 74 | def embedding_dropout(embedding, dropout=0.1): 75 | vocab_size = tf.shape(embedding)[0] 76 | mask = tf.nn.dropout(tf.ones([vocab_size]),keep_prob=1-dropout) * (1-dropout) 77 | mask = tf.expand_dims(mask, 1) 78 | return mask * embedding 79 | 80 | def _get_embed_device(vocab_size): 81 | """Decide on which device to place an embed matrix given its vocab size.""" 82 | if vocab_size > VOCAB_SIZE_THRESHOLD_CPU: 83 | return "/cpu:0" 84 | else: 85 | return "/gpu:0" 86 | 87 | def _load_pretrained_emb_from_file(name, vocab_file, embed_file, num_trainable_tokens=0, dtype=tf.float32): 88 | print("# Start to load pretrained embedding...") 89 | vocab,_ = read_vocab(vocab_file) 90 | if num_trainable_tokens: 91 | trainable_tokens = vocab[:num_trainable_tokens] 92 | else: 93 | trainable_tokens = vocab 94 | 95 | emb_dict, emb_size = load_embed_file(embed_file) 96 | print("# pretrained embedding size",len(emb_dict),emb_size) 97 | 98 | for token in trainable_tokens: 99 | if token not in emb_dict: 100 | if '' in emb_dict: 101 | emb_dict[token] = emb_dict[''] 102 | else: 103 | emb_dict[token] = list(np.random.random(emb_size)) 104 | 105 | emb_mat = np.array([emb_dict[token] for token in vocab], dtype=dtype.as_numpy_dtype()) 106 | if num_trainable_tokens: 107 | emb_mat = tf.constant(emb_mat) 108 | emb_mat_const = tf.slice(emb_mat,[num_trainable_tokens,0],[-1,-1]) 109 | with tf.device(_get_embed_device(num_trainable_tokens)): 110 | emb_mat_var = tf.get_variable(name + "_emb_mat_var", [num_trainable_tokens, emb_size]) 111 | return tf.concat([emb_mat_var,emb_mat_const],0,name=name) 112 | else: 113 | with tf.device(_get_embed_device(len(vocab))): 114 | emb_mat_var = tf.get_variable(name,emb_mat.shape,initializer=tf.constant_initializer(emb_mat)) 115 | return emb_mat_var 116 | 117 | def create_embedding(name, vocab_size, embed_size, vocab_file=None, embed_file=None, num_trainable_tokens=0, dtype=tf.float32, scope=None): 118 | '''create a new embedding tensor or load from a pretrained embedding file 119 | 120 | Args: 121 | name: name of the embedding 122 | vocab_size : vocab size 123 | embed_size : embeddign size 124 | vocab_file ([type], optional): Defaults to None. vocab file 125 | embed_file ([type], optional): Defaults to None. 126 | num_trainable_tokens (int, optional): Defaults to 0. the number of tokens to be trained, if 0 then train all the tokens 127 | dtype ([type], optional): Defaults to tf.float32. [description] 128 | scope ([type], optional): Defaults to None. [description] 129 | 130 | Returns: 131 | embedding variable 132 | ''' 133 | 134 | with tf.variable_scope(scope or "embedding", dtype=dtype) as scope: 135 | if vocab_file and embed_file: 136 | embedding = _load_pretrained_emb_from_file(name, vocab_file, embed_file, num_trainable_tokens, dtype) 137 | else: 138 | with tf.device(_get_embed_device(vocab_size)): 139 | embedding = tf.get_variable(name,[vocab_size,embed_size],dtype) 140 | return embedding 141 | 142 | class DropConnectLayer(tf.layers.Dense): 143 | def __init__(self, units, 144 | mode=tf.estimator.ModeKeys.TRAIN, 145 | keep_prob=0.7, 146 | activation=None, 147 | use_bias=True, 148 | kernel_initializer=None, 149 | bias_initializer=tf.zeros_initializer(), 150 | kernel_regularizer=None, 151 | bias_regularizer=None, 152 | activity_regularizer=None, 153 | kernel_constraint=None, 154 | bias_constraint=None, 155 | trainable=True, 156 | name=None, 157 | **kwargs): 158 | super(DropConnectLayer,self).__init__( units, 159 | activation=activation, 160 | use_bias=use_bias, 161 | kernel_initializer=kernel_initializer, 162 | bias_initializer=bias_initializer, 163 | kernel_regularizer=kernel_regularizer, 164 | bias_regularizer=bias_regularizer, 165 | activity_regularizer=activity_regularizer, 166 | kernel_constraint=kernel_constraint, 167 | bias_constraint=bias_constraint, 168 | trainable=trainable, 169 | name=name, 170 | **kwargs) 171 | self.mode = mode 172 | self.keep_prob = keep_prob 173 | self.mask = None 174 | 175 | def build(self, input_shape): 176 | from tensorflow.python.layers import base 177 | from tensorflow.python.framework import tensor_shape 178 | input_shape = tensor_shape.TensorShape(input_shape) 179 | if input_shape[-1].value is None: 180 | raise ValueError('The last dimension of the inputs to `Dense` ' 181 | 'should be defined. Found `None`.') 182 | self.input_spec = base.InputSpec(min_ndim=2, 183 | axes={-1: input_shape[-1].value}) 184 | self.kernel = self.add_variable('kernel', 185 | shape=[input_shape[-1].value, self.units], 186 | initializer=self.kernel_initializer, 187 | regularizer=self.kernel_regularizer, 188 | constraint=self.kernel_constraint, 189 | dtype=self.dtype, 190 | trainable=True) 191 | if self.mode == tf.estimator.ModeKeys.TRAIN: 192 | if self.mask is None: 193 | mask = tf.ones_like(self.kernel) 194 | self.mask = tf.nn.dropout(mask, keep_prob=self.keep_prob) * self.keep_prob 195 | self.kernel = self.kernel * self.mask 196 | if self.use_bias: 197 | self.bias = self.add_variable('bias', 198 | shape=[self.units,], 199 | initializer=self.bias_initializer, 200 | regularizer=self.bias_regularizer, 201 | constraint=self.bias_constraint, 202 | dtype=self.dtype, 203 | trainable=True) 204 | else: 205 | self.bias = None 206 | self.built = True 207 | 208 | 209 | class WeightDropLSTMCell(tf.contrib.rnn.BasicLSTMCell): 210 | '''Apply dropout on hidden-to-hidden weights''' 211 | 212 | def __init__(self, num_units, weight_keep_drop=0.7, mode=tf.estimator.ModeKeys.TRAIN, 213 | forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None): 214 | """Initialize the parameters for an LSTM cell. 215 | """ 216 | super(WeightDropLSTMCell,self).__init__( num_units, forget_bias, state_is_tuple, activation, reuse) 217 | self.w_layer = tf.layers.Dense(4 * num_units) 218 | self.h_layer = DropConnectLayer(4 * num_units, mode, weight_keep_drop, use_bias=False) 219 | 220 | def build(self, inputs_shape): 221 | # compatible with tf-1.5 222 | self.built = True 223 | 224 | def call(self, inputs, state): 225 | """Long short-term memory cell (LSTM). 226 | Args: 227 | inputs: `2-D` tensor with shape `[batch_size x input_size]`. 228 | state: An `LSTMStateTuple` of state tensors, each shaped 229 | `[batch_size x self.state_size]`, if `state_is_tuple` has been set to 230 | `True`. Otherwise, a `Tensor` shaped 231 | `[batch_size x 2 * self.state_size]`. 232 | Returns: 233 | A pair containing the new hidden state, and the new state (either a 234 | `LSTMStateTuple` or a concatenated state, depending on 235 | `state_is_tuple`). 236 | """ 237 | sigmoid = tf.sigmoid 238 | # Parameters of gates are concatenated into one multiply for efficiency. 239 | if self._state_is_tuple: 240 | c, h = state 241 | else: 242 | c, h = tf.split(value=state, num_or_size_splits=2, axis=1) 243 | 244 | # W * x + b 245 | inputs = self.w_layer(inputs) 246 | # U * h(t-1) 247 | h = self.h_layer(h) 248 | 249 | # i = input_gate, j = new_input, f = forget_gate, o = output_gate 250 | i, j, f, o = tf.split( 251 | value=inputs + h, num_or_size_splits=4, axis=1) 252 | 253 | new_c = ( 254 | c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j)) 255 | new_h = self._activation(new_c) * sigmoid(o) 256 | 257 | if self._state_is_tuple: 258 | new_state = tf.contrib.rnn.LSTMStateTuple(new_c, new_h) 259 | else: 260 | new_state = tf.concat([new_c, new_h], 1) 261 | return new_h, new_state 262 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | # ======================================== 2 | # Author: Xueyou Luo 3 | # Email: xueyou.luo@aidigger.com 4 | # Copyright: Eigen Tech @ 2018 5 | # ======================================== 6 | import codecs 7 | import json 8 | import os 9 | import sys 10 | 11 | import numpy as np 12 | import tensorflow as tf 13 | 14 | 15 | def print_out(s, f=None, new_line=True): 16 | """Similar to print but with support to flush and output to a file.""" 17 | if isinstance(s, bytes): 18 | s = s.decode("utf-8") 19 | 20 | if f: 21 | f.write(s.encode("utf-8")) 22 | if new_line: 23 | f.write(b"\n") 24 | 25 | # stdout 26 | out_s = s.encode("utf-8") 27 | if not isinstance(out_s, str): 28 | out_s = out_s.decode("utf-8") 29 | 30 | print(out_s, end="", file=sys.stdout) 31 | 32 | if new_line: 33 | sys.stdout.write("\n") 34 | sys.stdout.flush() 35 | 36 | def _reverse(input_, seq_lengths, seq_dim, batch_dim): 37 | if seq_lengths is not None: 38 | return tf.reverse_sequence( 39 | input=input_, seq_lengths=seq_lengths, 40 | seq_dim=seq_dim, batch_dim=batch_dim) 41 | else: 42 | return tf.reverse(input_, axis=[seq_dim]) 43 | 44 | def gelu(input_tensor): 45 | """Gaussian Error Linear Unit. 46 | This is a smoother version of the RELU. 47 | Original paper: https://arxiv.org/abs/1606.08415 48 | Args: 49 | input_tensor: float Tensor to perform activation. 50 | Returns: 51 | `input_tensor` with the GELU activation applied. 52 | """ 53 | cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0))) 54 | return input_tensor * cdf 55 | 56 | def single_rnn_cell(cell_name, num_units, train_phase=True, keep_prob=0.75, weight_keep_drop=0.65, variational_dropout = False): 57 | """ 58 | Get a single rnn cell 59 | """ 60 | cell_name = cell_name.upper() 61 | if cell_name == "GRU": 62 | cell = tf.contrib.rnn.GRUCell(num_units) 63 | elif cell_name == "LSTM": 64 | cell = tf.contrib.rnn.LSTMCell(num_units) 65 | elif cell_name == 'block_lstm'.upper(): 66 | cell = tf.contrib.rnn.LSTMBlockCell(num_units) 67 | elif cell_name == 'WEIGHT_LSTM': 68 | from thrid_utils import WeightDropLSTMCell 69 | cell = WeightDropLSTMCell(num_units,weight_keep_drop=weight_keep_drop,mode=tf.estimator.ModeKeys.TRAIN if train_phase and weight_keep_drop<1.0 else tf.estimator.ModeKeys.PREDICT) 70 | elif cell_name == 'LAYERNORM_LSTM': 71 | cell = tf.contrib.rnn.LayerNormBasicLSTMCell(num_units) 72 | else: 73 | cell = tf.contrib.rnn.BasicRNNCell(num_units) 74 | 75 | # dropout wrapper 76 | if train_phase: 77 | # TODO: variational_recurrent=True and input_keep_prob < 1 then we need provide input_size 78 | # But because we use different size in different layers, we will got shape in-compatible error 79 | # So I just set input_keep_prob to 1.0 when we use variational dropout to avoid this error for now. 80 | cell = tf.contrib.rnn.DropoutWrapper( 81 | cell=cell, 82 | input_keep_prob=keep_prob if not variational_dropout else 1.0, 83 | output_keep_prob=keep_prob, 84 | variational_recurrent=variational_dropout, 85 | dtype=tf.float32) 86 | 87 | return cell 88 | 89 | def focal_loss(labels, logits, gamma=2): 90 | epsilon = 1.e-9 91 | y_pred = tf.nn.softmax(logits,dim=-1) 92 | y_pred = y_pred + epsilon # to avoid 0.0 in log 93 | L = -labels*tf.pow((1-y_pred),gamma)*tf.log(y_pred) 94 | L = tf.reduce_sum(L) 95 | batch_size = tf.shape(labels)[0] 96 | return L / tf.to_float(batch_size) 97 | 98 | def get_total_param_num(params, threshold = 1): 99 | total_parameters = 0 100 | #iterating over all variables 101 | for variable in params: 102 | local_parameters=1 103 | shape = variable.get_shape() #getting shape of a variable 104 | for i in shape: 105 | local_parameters*=i.value #mutiplying dimension values 106 | if local_parameters >= threshold: 107 | print("variable {0} with parameter number {1}".format(variable, local_parameters)) 108 | total_parameters+=local_parameters 109 | print('# total parameter number',total_parameters) 110 | return total_parameters 111 | 112 | def cal_f1(label_num,predicted,truth): 113 | results = [] 114 | for i in range(label_num): 115 | results.append({"TP": 0, "FP": 0, "FN": 0, "TN": 0}) 116 | 117 | for i, p in enumerate(predicted): 118 | t = truth[i] 119 | for j in range(label_num): 120 | if p[j] == 1: 121 | if t[j] == 1: 122 | results[j]['TP'] += 1 123 | else: 124 | results[j]['FP'] += 1 125 | else: 126 | if t[j] == 1: 127 | results[j]['FN'] += 1 128 | else: 129 | results[j]['TN'] += 1 130 | 131 | precision = [0.0] * label_num 132 | recall = [0.0] * label_num 133 | f1 = [0.0] * label_num 134 | for i in range(label_num): 135 | if results[i]['TP'] == 0: 136 | if results[i]['FP']==0 and results[i]['FN']==0: 137 | precision[i] = 1.0 138 | recall[i] = 1.0 139 | f1[i] = 1.0 140 | else: 141 | precision[i] = 0.0 142 | recall[i] = 0.0 143 | f1[i] = 0.0 144 | else: 145 | precision[i] = results[i]['TP'] / (results[i]['TP'] + results[i]['FP']) 146 | recall[i] = results[i]['TP'] / (results[i]['TP'] + results[i]['FN']) 147 | f1[i] = 2 * precision[i] * recall[i] / (precision[i] + recall[i]) 148 | 149 | # for i in range(label_num): 150 | # print(i,results[i], precision[i], recall[i], f1[i]) 151 | return sum(f1)/label_num, sum(precision)/label_num, sum(recall)/label_num 152 | 153 | 154 | def load_hparams(out_dir, overidded = None): 155 | hparams_file = os.path.join(out_dir,"hparams") 156 | print("loading hparams from %s" % hparams_file) 157 | hparams_json = json.load(open(hparams_file)) 158 | hparams = tf.contrib.training.HParams() 159 | for k,v in hparams_json.items(): 160 | hparams.add_hparam(k,v) 161 | if overidded: 162 | for k,v in overidded.items(): 163 | if k not in hparams_json: 164 | hparams.add_hparam(k,v) 165 | else: 166 | hparams.set_hparam(k,v) 167 | return hparams 168 | 169 | def save_hparams(out_dir, hparams): 170 | """Save hparams.""" 171 | if not os.path.isdir(out_dir): 172 | os.mkdir(out_dir) 173 | hparams_file = os.path.join(out_dir, "hparams") 174 | print(" saving hparams to %s" % hparams_file) 175 | with codecs.getwriter("utf-8")(tf.gfile.GFile(hparams_file, "wb")) as f: 176 | f.write(hparams.to_json()) 177 | 178 | def get_config_proto(log_device_placement=True, allow_soft_placement=True, 179 | num_intra_threads=0, num_inter_threads=0, per_process_gpu_memory_fraction=0.95, allow_growth=True): 180 | # GPU options: 181 | # https://www.tensorflow.org/versions/r0.10/how_tos/using_gpu/index.html 182 | config_proto = tf.ConfigProto( 183 | log_device_placement=log_device_placement, 184 | allow_soft_placement=allow_soft_placement) 185 | config_proto.gpu_options.allow_growth = allow_growth 186 | config_proto.gpu_options.per_process_gpu_memory_fraction = per_process_gpu_memory_fraction 187 | # CPU threads options 188 | if num_intra_threads: 189 | config_proto.intra_op_parallelism_threads = num_intra_threads 190 | if num_inter_threads: 191 | config_proto.inter_op_parallelism_threads = num_inter_threads 192 | 193 | return config_proto 194 | 195 | def early_stop(values, no_decrease=3): 196 | if len(values) < 2: 197 | return False 198 | best_index = np.argmin(values) 199 | if values[-1] > values[best_index] and (best_index + no_decrease) <= len(values): 200 | return True 201 | else: 202 | return False 203 | 204 | def gl_stop(values, alpha=5): 205 | if len(values) < 2: 206 | return False 207 | best = -1 * min(values) 208 | current = -1 * values[-1] 209 | if 100 * ( 1 - (current / best) ) > alpha: 210 | return True 211 | else: 212 | return False --------------------------------------------------------------------------------