├── .gitignore
├── README.md
├── bash
    ├── elmo_inference.sh
    └── elmo_train.sh
├── dataset.py
├── img
    └── model.png
├── labels.txt
├── main.py
├── model.py
├── scripts
    ├── data_preprocess.py
    ├── preprocess.sh
    └── readme.md
├── thrid_utils.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | *.egg-info/
 24 | .installed.cfg
 25 | *.egg
 26 | MANIFEST
 27 | 
 28 | # PyInstaller
 29 | #  Usually these files are written by a python script from a template
 30 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 31 | *.manifest
 32 | *.spec
 33 | 
 34 | # Installer logs
 35 | pip-log.txt
 36 | pip-delete-this-directory.txt
 37 | 
 38 | # Unit test / coverage reports
 39 | htmlcov/
 40 | .tox/
 41 | .coverage
 42 | .coverage.*
 43 | .cache
 44 | nosetests.xml
 45 | coverage.xml
 46 | *.cover
 47 | .hypothesis/
 48 | .pytest_cache/
 49 | 
 50 | # Translations
 51 | *.mo
 52 | *.pot
 53 | 
 54 | # Django stuff:
 55 | *.log
 56 | local_settings.py
 57 | db.sqlite3
 58 | 
 59 | # Flask stuff:
 60 | instance/
 61 | .webassets-cache
 62 | 
 63 | # Scrapy stuff:
 64 | .scrapy
 65 | 
 66 | # Sphinx documentation
 67 | docs/_build/
 68 | 
 69 | # PyBuilder
 70 | target/
 71 | 
 72 | # Jupyter Notebook
 73 | .ipynb_checkpoints
 74 | 
 75 | # pyenv
 76 | .python-version
 77 | 
 78 | # celery beat schedule file
 79 | celerybeat-schedule
 80 | 
 81 | # SageMath parsed files
 82 | *.sage.py
 83 | 
 84 | # Environments
 85 | .env
 86 | .venv
 87 | env/
 88 | venv/
 89 | ENV/
 90 | env.bak/
 91 | venv.bak/
 92 | 
 93 | # Spyder project settings
 94 | .spyderproject
 95 | .spyproject
 96 | 
 97 | # Rope project settings
 98 | .ropeproject
 99 | 
100 | # mkdocs documentation
101 | /site
102 | 
103 | # mypy
104 | .mypy_cache/
105 | 
106 | # data folder
107 | scripts/data/
108 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # fsauor2018
 2 | 
 3 | Code for Fine-grained Sentiment Analysis of User Reviews of AI Challenger 2018.
 4 | 
 5 | Single model can achieve 0.71 marco-f1 score.
 6 | 
 7 | Testa rank: 27
 8 | 
 9 | Testb rank: 16
10 | 
11 | > The final result is achieved by ensemble 10 models by simple voting.
12 | 
13 | Issues and starts are welcomed!
14 | 
15 | ## Train from scratch
16 | 
17 | For those who don't want to preprocess data, refer to [scripts](./scripts/readme.md).
18 | 
19 | ## Data
20 | 
21 | For those who want to get the raw dataset, please refer to this link [data](https://drive.google.com/file/d/1OInXRx_OmIJgK3ZdoFZnmqUi0rGfOaQo/view?usp=sharing).
22 | 
23 | ## Requirements
24 | 
25 | tensorflow == 1.4.1
26 | 
27 | ## Model Architecture
28 | 
29 | The model architecture is simple. Basiclly, you can think of it as a seq2seq model.
30 | 
31 | ![模型结构](img/model.png)
32 | 
33 | Some details of the model:
34 | 
35 | - Embedding layer + 3 * Bi-LSTM layers as encoder
36 | - Residual connection is added on the second and third Bi-LSTM layers
37 | - The final encoder outputs are weighted sum of outputs of each layer. Scalars and weight are learned variables. This idea is copied from ELMO.
38 | - Use a simple LSTM cell + Attention as decoder, decode 20 steps to get 20 outputs for each label
39 |     - Inputs to decoder are learnable embeddings
40 | - Outputs of decoder are fed to two FC layers to get the final sentiment logits
41 | 
42 | ## Data preprocess
43 | 
44 | The data preprocess code is not provided here, I may release it later.
45 | 
46 | To use this project, you need fowllowing files:
47 | 
48 | - train.json / validataion.json / testa.json
49 | - vocab.txt
50 | - embedding.txt
51 | - label.txt
52 | 
53 | ### Training files
54 | 
55 | You need to preprocess the orginal data to json files, each line of the json line should be like fowllowing:
56 | 
57 | ```json
58 | {"id": "0", "content": "吼吼吼 ， 萌 死 人 的 棒棒糖 ， 中 了 大众 点评 的 霸王餐 ， 太 可爱 了 。 一直 就 好奇 这个 棒棒 糖 是 怎么 个 东西 ， 大众 点评 给 了 我 这个 土老 冒 一个 见识 的 机会 。 看 介绍 棒棒 糖 是 用 <place> 糖 做 的 ， 不 会 很 甜 ， 中间 的 照片 是 糯米 的 ， 能 食用 ， 真是 太 高端 大气 上档次 了 ， 还 可以 买 蝴蝶 结扎口 ， 送 人 可以 买 礼盒 。 我 是 先 打 的 卖家 电话 ， 加 了 微信 ， 给 卖家传 的 照片 。 等 了 几 天 ， 卖家 就 告诉 我 可以 取 货 了 ， 去 <place> 那 取 的 。 虽然 连 卖家 的 面 都 没 见到 ， 但是 还是 谢谢 卖家 送 我 这么 可爱 的 东西 ， 太 喜欢 了 ， 这 哪 舍得 吃 啊 。", "location_traffic_convenience": "-2", "location_distance_from_business_district": "-2", "location_easy_to_find": "-2", "service_wait_time": "-2", "service_waiters_attitude": "1", "service_parking_convenience": "-2", "service_serving_speed": "-2", "price_level": "-2", "price_cost_effective": "-2", "price_discount": "1", "environment_decoration": "-2", "environment_noise": "-2", "environment_space": "-2", "environment_cleaness": "-2", "dish_portion": "-2", "dish_taste": "-2", "dish_look": "1", "dish_recommendation": "-2", "others_overall_experience": "1", "others_willing_to_consume_again": "-2"}
59 | ```
60 | 
61 | To be specific:
62 | - content should be tokeninzed words
63 |     - You can use jieba/ltp to do the segmentation
64 |     - Use NER toolkits to replace place and orginaztion to special tokens '\<place>','\<org>'
65 | - other fields are same as the original data files
66 | - for test files, which labels are unknow, you can leave them to be empty string("")
67 | 
68 | ### Vocab file
69 | 
70 | I choose the top 50k most common words in training file.
71 | 
72 | The top 3 words are special tokens, which are:
73 | - \<unk>: unknow token
74 | - \<sos>: start of content
75 | - \<eos>: end of content, also used as padding token
76 | 
77 | ### Embedding file
78 | 
79 | This is a glove-format embedding file, I use [Chinese-Word-Vectors](https://github.com/Embedding/Chinese-Word-Vectors) as pretrained embedding file(which is Sogou News word2vec word embedding).
80 | 
81 | ### Label file
82 | 
83 | All the label names.
84 | 
85 | ## Train
86 | 
87 | Refer to bash/elmo_train.sh
88 | 
89 | ## Inference
90 | 
91 | Refer to bash/elmo_inference.sh
92 | 


--------------------------------------------------------------------------------
/bash/elmo_inference.sh:
--------------------------------------------------------------------------------
 1 | python main.py \
 2 | --mode=inference \
 3 | --data_files=scripts/data/testa.json \
 4 | --label_file=scripts/data/labels.txt \
 5 | --vocab_file=scripts/data/vocab.txt \
 6 | --out_file=scripts/data/out.testa.json \
 7 | --prob=False \
 8 | --batch_size=300 \
 9 | --feature_num=20 \
10 | --checkpoint_dir=scripts/data/elmo_ema_0120


--------------------------------------------------------------------------------
/bash/elmo_train.sh:
--------------------------------------------------------------------------------
 1 | python main.py \
 2 | --mode=train \
 3 | --data_files scripts/data/train.json \
 4 | --eval_files=scripts/data/validation.json \
 5 | --label_file=scripts/data/labels.txt \
 6 | --vocab_file=scripts/data/vocab.txt \
 7 | --embed_file=scripts/data/embedding.txt \
 8 | --num_layers=3 \
 9 | --batch_size=32 \
10 | --encoder=elmo \
11 | --rnn_cell_name=lstm \
12 | --feature_num=20 \
13 | --steps_per_eval=2000 \
14 | --learning_rate=0.001 \
15 | --focal_loss=0.0 \
16 | --checkpoint_dir=scripts/data/elmo_ema_0120


--------------------------------------------------------------------------------
/dataset.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | import codecs
  7 | import json
  8 | from collections import namedtuple
  9 | 
 10 | import numpy as np
 11 | import tensorflow as tf
 12 | 
 13 | from utils import print_out
 14 | from thrid_utils import read_vocab
 15 | 
 16 | UNK_ID = 0
 17 | SOS_ID = 1
 18 | EOS_ID = 2
 19 | 
 20 | def _padding(tokens_list, max_len):
 21 |     ret = np.zeros((len(tokens_list),max_len),np.int32)
 22 |     for i,t in enumerate(tokens_list):
 23 |         t = t + (max_len-len(t)) * [EOS_ID]
 24 |         ret[i] = t
 25 |     return ret
 26 | 
 27 | def _tokenize(content, w2i, max_tokens=1200, reverse=False, split=True):
 28 |     def get_tokens(content):
 29 |         tokens = content.strip().split()
 30 |         ids = []
 31 |         for t in tokens:
 32 |             if t in w2i:
 33 |                 ids.append(w2i[t])
 34 |             else:
 35 |                 for c in t:
 36 |                     ids.append(w2i.get(c,UNK_ID))
 37 |         return ids
 38 |     if split:
 39 |         ids = get_tokens(content)
 40 |     else:
 41 |         ids = [w2i.get(t,UNK_ID) for t in content.strip().split()]
 42 |     if reverse:
 43 |         ids = list(reversed(ids))
 44 |     tokens = [SOS_ID] + ids[:max_tokens] + [EOS_ID]
 45 |     return tokens
 46 | 
 47 | class DataItem(namedtuple("DataItem",('content','length','labels','id'))):
 48 |     pass
 49 | 
 50 | class DataSet(object):
 51 |     def __init__(self, data_files, vocab_file, label_file, batch_size=32, reverse=False, split_word=True, max_len = 1200):
 52 |         self.reverse = reverse
 53 |         self.split_word = split_word
 54 |         self.data_files = data_files
 55 |         self.batch_size = batch_size
 56 |         self.max_len = max_len
 57 | 
 58 |         self.vocab, self.w2i = read_vocab(vocab_file)
 59 |         self.i2w = {v:k for k,v in self.w2i.items()}
 60 |         self.label_names, self.l2i = read_vocab(label_file)
 61 |         self.i2l = {v:k for k,v in self.l2i.items()}
 62 | 
 63 |         self.tag_l2i = {"1":0,"0":1,"-1":2,"-2":3}
 64 |         self.tag_i2l = {v:k for k,v in self.tag_l2i.items()}
 65 | 
 66 |         self._raw_data = []
 67 |         self.items = []
 68 |         self._preprocess()
 69 | 
 70 |     def get_label(self, labels, l2i, normalize=False):
 71 |         one_hot_labels = np.zeros(len(l2i),dtype=np.float32)
 72 |         for n in labels:
 73 |             if n:
 74 |                 one_hot_labels[l2i[n]] = 1
 75 |         
 76 |         if normalize:
 77 |             one_hot_labels = one_hot_labels / len(labels)
 78 |         return one_hot_labels
 79 | 
 80 |     def _preprocess(self):
 81 |         print_out("# Start to preprocessing data...")
 82 |         for fname in self.data_files:
 83 |             print_out("# load data from %s ..." % fname)
 84 |             for line in open(fname):
 85 |                 item = json.loads(line.strip())
 86 |                 content = item['content']
 87 |                 content = _tokenize(content, self.w2i, self.max_len, self.reverse, self.split_word)
 88 |                 item_labels = []
 89 |                 for label_name in self.label_names:
 90 |                     labels = [item[label_name]]
 91 |                     labels = self.get_label(labels,self.tag_l2i)
 92 |                     item_labels.append(labels)
 93 |                 self._raw_data.append(DataItem(content=content,labels=np.asarray(item_labels),length=len(content),id=int(item['id'])))
 94 |                 self.items.append(item)
 95 | 
 96 |         self.num_batches = len(self._raw_data) // self.batch_size
 97 |         self.data_size = len(self._raw_data)
 98 |         print_out("# Got %d data items with %d batches" % (self.data_size, self.num_batches))
 99 | 
100 |     def _shuffle(self):
101 |         # code from https://github.com/fastai/fastai/blob/3f2079f7bc07ef84a750f6417f68b7b9fdc9525a/fastai/text.py#L125
102 |         idxs = np.random.permutation(self.data_size)
103 |         sz = self.batch_size * 50
104 |         ck_idx = [idxs[i:i+sz] for i in range(0, len(idxs), sz)]
105 |         sort_idx = np.concatenate([sorted(s, key=lambda x:self._raw_data[x].length, reverse=True) for s in ck_idx])
106 |         sz = self.batch_size
107 |         ck_idx = [sort_idx[i:i+sz] for i in range(0, len(sort_idx), sz)]
108 |         max_ck = np.argmax([self._raw_data[ck[0]].length for ck in ck_idx])  # find the chunk with the largest key,
109 |         ck_idx[0],ck_idx[max_ck] = ck_idx[max_ck],ck_idx[0]     # then make sure it goes first.
110 |         sort_idx = np.concatenate(np.random.permutation(ck_idx[1:]))
111 |         sort_idx = np.concatenate((ck_idx[0], sort_idx))
112 |         return iter(sort_idx)
113 | 
114 |     def process_batch(self, batch):
115 |         contents = [item.content for item in batch]
116 |         lengths = [item.length for item in batch]
117 |         contents = _padding(contents,max(lengths))
118 |         lengths = np.asarray(lengths)
119 |         targets = np.asarray([item.labels for item in batch])
120 |         ids = [item.id for item in batch]
121 |         return contents, lengths, targets, ids
122 | 
123 |     def get_next(self, shuffle=True):
124 |         if shuffle:
125 |             idxs = self._shuffle()
126 |         else:
127 |             idxs = range(self.data_size)
128 | 
129 |         batch = []
130 |         for i in idxs:
131 |             item = self._raw_data[i]
132 |             if len(batch) >= self.batch_size:
133 |                 yield self.process_batch(batch)
134 |                 batch = [item]
135 |             else:
136 |                 batch.append(item)
137 |         if len(batch) > 0:
138 |             yield self.process_batch(batch)
139 | 


--------------------------------------------------------------------------------
/img/model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/xueyouluo/fsauor2018/03a624517c31387d2b6029b0ff952d7fe01f8c1d/img/model.png


--------------------------------------------------------------------------------
/labels.txt:
--------------------------------------------------------------------------------
 1 | location_traffic_convenience
 2 | location_distance_from_business_district
 3 | location_easy_to_find
 4 | service_wait_time
 5 | service_waiters_attitude
 6 | service_parking_convenience
 7 | service_serving_speed
 8 | price_level
 9 | price_cost_effective
10 | price_discount
11 | environment_decoration
12 | environment_noise
13 | environment_space
14 | environment_cleaness
15 | dish_portion
16 | dish_taste
17 | dish_look
18 | dish_recommendation
19 | others_overall_experience
20 | others_willing_to_consume_again
21 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | 
  7 | import argparse
  8 | import json
  9 | import time
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | from dataset import DataSet
 15 | from model import Model
 16 | from utils import *
 17 | 
 18 | def add_arguments(parser):
 19 |     """Build ArgumentParser."""
 20 |     parser.register("type", "bool", lambda v: v.lower() == "true")
 21 | 
 22 |     # mode
 23 |     parser.add_argument("--mode", type=str, default='train', help="running mode: train | eval | inference")
 24 | 
 25 |     # data
 26 |     parser.add_argument("--data_files", type=str, nargs='+', default=None, help="data file for train or inference")
 27 |     parser.add_argument("--eval_files", type=str, nargs='+', default=None, help="eval data file for evaluation")
 28 |     parser.add_argument("--label_file", type=str, default=None, help="label file")
 29 |     parser.add_argument("--vocab_file", type=str, default=None, help="vocab file")
 30 |     parser.add_argument("--embed_file", type=str, default=None, help="embedding file to restore")
 31 |     parser.add_argument("--out_file", type=str, default=None, help="output file for inference")
 32 |     parser.add_argument("--split_word", type='bool', nargs="?", const=True, default=True, help="Whether to split word when oov")
 33 |     parser.add_argument("--max_len", type=int, default=1200, help='max length for doc')
 34 |     parser.add_argument("--batch_size", type=int, default=32, help="batch size")
 35 |     parser.add_argument("--reverse", type='bool', nargs="?", const=True, default=False, help="Whether to reverse data")
 36 |     parser.add_argument("--prob", type='bool', nargs="?", const=True, default=False, help="Whether to export prob")
 37 | 
 38 |     # model
 39 |     parser.add_argument("--num_layers", type=int, default=2, help="number of layers")
 40 |     parser.add_argument("--decay_schema", type=str, default='hand', help = 'learning rate decay: exp | hand')
 41 |     parser.add_argument("--encoder", type=str, default='gnmt', help="gnmt | elmo")
 42 |     parser.add_argument("--decay_steps", type=int, default=10000, help="decay steps")
 43 |     parser.add_argument("--learning_rate", type=float, default=0.001, help="Learning rate. RMS: 0.001 | 0.0001")
 44 |     parser.add_argument("--focal_loss", type=float, default=2., help="gamma of focal loss")
 45 |     parser.add_argument("--embedding_dropout", type=float, default=0.1, help="embedding_dropout")
 46 |     parser.add_argument("--max_gradient_norm", type=float, default=5.0, help="Clip gradients to this norm.")
 47 |     parser.add_argument("--dropout_keep_prob", type=float, default=0.8, help="drop out keep ratio for training")
 48 |     parser.add_argument("--weight_keep_drop", type=float, default=0.8, help="weight keep drop")
 49 |     parser.add_argument("--l2_loss_ratio", type=float, default=0.0, help="l2 loss ratio")
 50 |     parser.add_argument("--rnn_cell_name", type=str, default='lstm', help = 'rnn cell name')
 51 |     parser.add_argument("--embedding_size", type=int, default=300, help="embedding_size")
 52 |     parser.add_argument("--num_units", type=int, default=300, help="num_units")
 53 |     parser.add_argument("--double_decoder", type='bool', nargs="?", const=True, default=False, help="Whether to double decoder size")
 54 |     parser.add_argument("--variational_dropout", type='bool', nargs="?", const=True, default=True, help="Whether to use variational_dropout")
 55 | 
 56 |     # clf
 57 |     parser.add_argument("--target_label_num", type=int, default=4, help="target_label_num")
 58 |     parser.add_argument("--feature_num", type=int, default=20, help="feature_num")
 59 | 
 60 |     # train
 61 |     parser.add_argument("--need_early_stop", type='bool', nargs="?", const=True, default=True, help="Whether to early stop")
 62 |     parser.add_argument("--patient", type=int, default=5, help="patient of early stop")
 63 |     parser.add_argument("--debug", type='bool', nargs="?", const=True, default=False, help="Whether use debug mode")
 64 |     parser.add_argument("--num_train_epoch", type=int, default=50, help="training epoches")
 65 |     parser.add_argument("--steps_per_stats", type=int, default=20, help="steps to print stats")
 66 |     parser.add_argument("--steps_per_summary", type=int, default=50, help="steps to save summary")
 67 |     parser.add_argument("--steps_per_eval", type=int, default=2000, help="steps to save model")
 68 | 
 69 |     parser.add_argument("--checkpoint_dir", type=str, default='/tmp/visual-semantic', help="checkpoint dir to save model")
 70 |     
 71 | 
 72 | def convert_to_hparams(params):
 73 |     hparams = tf.contrib.training.HParams()
 74 |     for k,v in params.items():
 75 |         hparams.add_hparam(k,v)
 76 |     return hparams
 77 | 
 78 | def inference(flags):
 79 |     print_out("inference data file {0}".format(flags.data_files))
 80 |     dataset = DataSet(flags.data_files, flags.vocab_file, flags.label_file, flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len)
 81 |     hparams = load_hparams(flags.checkpoint_dir,{"mode":'inference','checkpoint_dir':flags.checkpoint_dir+"/best_eval",'embed_file':None})
 82 |     with tf.Session(config = get_config_proto(log_device_placement=False)) as sess:
 83 |         model = Model(hparams)
 84 |         model.build()
 85 |         
 86 |         try:
 87 |             model.restore_model(sess)  #restore best solution
 88 |         except Exception as e:
 89 |             print("unable to restore model with exception",e)
 90 |             exit(1)
 91 | 
 92 |         scalars = model.scalars.eval(session=sess)
 93 |         print("Scalars:", scalars)
 94 |         weight = model.weight.eval(session=sess)
 95 |         print("Weight:",weight)
 96 |         cnt = 0
 97 |         for (source, lengths, _, ids) in dataset.get_next(shuffle=False):
 98 |             predict,logits = model.inference_clf_one_batch(sess, source, lengths)
 99 |             for i,(p,l) in enumerate(zip(predict,logits)):
100 |                 for j in range(flags.feature_num):
101 |                     label_name = dataset.i2l[j]
102 |                     if flags.prob:
103 |                         tag =  [float(v) for v in l[j]]
104 |                     else:
105 |                         tag = dataset.tag_i2l[np.argmax(p[j])]
106 |                     dataset.items[cnt + i][label_name] = tag
107 |             cnt += len(lengths)
108 |             print_out("\r# process {0:.2%}".format(cnt/dataset.data_size),new_line=False)
109 |     
110 |     print_out("# Write result to file ...")
111 |     with open(flags.out_file,'w') as f:
112 |         for item in dataset.items:
113 |             f.write(json.dumps(item,ensure_ascii=False) + '\n')
114 |     print_out("# Done")
115 | 
116 | def train_eval_clf(model, sess, dataset):
117 |     from collections import defaultdict
118 |     checkpoint_loss, acc = 0.0, 0.0
119 | 
120 |     predicts, truths = defaultdict(list), defaultdict(list)
121 |     for i,(source, lengths, targets, _) in enumerate(dataset.get_next(shuffle=False)):
122 |         batch_loss, accuracy, batch_size, predict = model.eval_clf_one_step(sess, source, lengths, targets)
123 |         # batch * 20 * 4
124 |         for i,p in enumerate(predict):
125 |             for j in range(model.hparams.feature_num):
126 |                 label_name = dataset.i2l[j]
127 |                 truths[label_name].append(targets[i][j])
128 |                 predicts[label_name].append(p[j])
129 |         checkpoint_loss += batch_loss
130 |         acc += accuracy
131 |         if (i+1) % 100 == 0:
132 |             print_out("# batch %d/%d" %(i+1,dataset.num_batches))
133 | 
134 |     results = {}
135 |     total_f1 = 0.0
136 |     for label_name in dataset.label_names:
137 |         # print("# Get f1 score for",label_name)
138 |         f1,precision,recall = cal_f1(model.hparams.target_label_num,np.asarray(predicts[label_name]),np.asarray(truths[label_name]))
139 |         results[label_name] = f1
140 |         total_f1 += f1
141 |         print("# {0} - {1}".format(label_name,f1))
142 | 
143 |     final_f1 = total_f1 / len(results)
144 |         
145 |     print_out( "# Eval loss %.5f, f1 %.5f" % (checkpoint_loss/i, final_f1))
146 |     return -1 * final_f1, checkpoint_loss/i
147 | 
148 | def train_clf(flags):
149 |     dataset = DataSet(flags.data_files, flags.vocab_file, flags.label_file, flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len)
150 |     eval_dataset = DataSet(flags.eval_files, flags.vocab_file, flags.label_file, 5 * flags.batch_size, reverse=flags.reverse, split_word=flags.split_word, max_len=flags.max_len)
151 | 
152 |     params = vars(flags)
153 |     params['vocab_size'] = len(dataset.w2i)
154 |     hparams = convert_to_hparams(params)
155 | 
156 |     save_hparams(flags.checkpoint_dir, hparams)
157 |     print(hparams)
158 | 
159 |     train_graph = tf.Graph()
160 |     eval_graph = tf.Graph()
161 | 
162 |     with train_graph.as_default():
163 |         train_model = Model(hparams)
164 |         train_model.build()
165 |         initializer = tf.global_variables_initializer()
166 | 
167 |     with eval_graph.as_default():
168 |         eval_hparams = load_hparams(flags.checkpoint_dir,{"mode":'eval','checkpoint_dir':flags.checkpoint_dir+"/best_eval"})
169 |         eval_model = Model(eval_hparams)
170 |         eval_model.build()
171 | 
172 |     train_sess = tf.Session(graph=train_graph, config=get_config_proto(log_device_placement=False ))
173 |     train_model.init_model(train_sess, initializer=initializer)
174 |     try:
175 |         train_model.restore_model(train_sess)
176 |     except:
177 |         print_out("unable to restore model, train from scratch")
178 |             
179 |     print_out("# Start to train with learning rate {0}, {1}".format(flags.learning_rate,time.ctime()))
180 | 
181 |     global_step = train_sess.run(train_model.global_step)
182 |     print("# Global step", global_step)
183 | 
184 |     eval_ppls = []
185 |     best_eval = 1000000000
186 |     pre_best_checkpoint = None
187 |     final_learn = 2
188 |     for epoch in range(flags.num_train_epoch):
189 |         step_time, checkpoint_loss, acc, iters = 0.0, 0.0, 0.0, 0
190 |         for i,(source, lengths, targets, _) in enumerate(dataset.get_next()):
191 |             start_time = time.time()
192 |             add_summary = (global_step % flags.steps_per_summary == 0)
193 |             batch_loss, global_step, accuracy, token_num,batch_size = train_model.train_clf_one_step(train_sess,source, lengths, targets, add_summary = add_summary, run_info= add_summary and flags.debug) 
194 |             step_time += (time.time() - start_time)
195 |             checkpoint_loss += batch_loss
196 |             acc += accuracy
197 |             iters += token_num
198 | 
199 |             if global_step == 0:
200 |                 continue
201 | 
202 |             if global_step % flags.steps_per_stats == 0:
203 |                 train_acc = (acc / flags.steps_per_stats) * 100
204 |                 acc_summary = tf.Summary()
205 |                 acc_summary.value.add(tag='accuracy', simple_value = train_acc)
206 |                 train_model.summary_writer.add_summary(acc_summary, global_step=global_step)
207 | 
208 |                 print_out(
209 |                     "# Epoch %d  global step %d loss %.5f batch %d/%d lr %g "
210 |                     "accuracy %.5f wps %.2f step time %.2fs" %
211 |                     (epoch+1, global_step, checkpoint_loss/flags.steps_per_stats, i+1,dataset.num_batches, train_model.learning_rate.eval(session=train_sess),
212 |                     train_acc, (iters)/step_time, step_time/(flags.steps_per_stats)))
213 |                 step_time, checkpoint_loss, iters, acc = 0.0, 0.0, 0, 0.0
214 | 
215 |             if global_step % flags.steps_per_eval == 0:
216 |                 print_out("# global step {0}, eval model at {1}".format(global_step, time.ctime()))
217 |                 checkpoint_path = train_model.save_model(train_sess)
218 |                 with tf.Session(graph=eval_graph, config=get_config_proto(log_device_placement=False)) as eval_sess:
219 |                     eval_model.init_model(eval_sess)
220 |                     eval_model.restore_ema_model(eval_sess, checkpoint_path)
221 |                     eval_ppl, eval_loss = train_eval_clf(eval_model, eval_sess, eval_dataset)
222 |                     print_out("# current result {0}, previous best result {1}".format(eval_ppl,best_eval))
223 |                     loss_summary = tf.Summary()
224 |                     loss_summary.value.add(tag='eval_loss', simple_value = eval_loss)
225 |                     train_model.summary_writer.add_summary(loss_summary, global_step=global_step)
226 |                     if eval_ppl < best_eval:
227 |                         pre_best_checkpoint = checkpoint_path
228 |                         eval_model.save_model(eval_sess,global_step)
229 |                         best_eval = eval_ppl
230 |                     eval_ppls.append(eval_ppl)
231 |                 if flags.need_early_stop:
232 |                     if early_stop(eval_ppls, flags.patient):
233 |                         print_out("# No loss decrease, restore previous best model and set learning rate to half of previous one")
234 |                         current_lr = train_model.learning_rate.eval(session=train_sess)
235 |                         if final_learn > 0:
236 |                             final_learn -= 1
237 |                         else:
238 |                             print_out("# Early stop, exit")
239 |                             exit(0)
240 |                         train_model.saver.restore(train_sess, pre_best_checkpoint)
241 |                         lr = tf.assign(train_model.learning_rate, current_lr/10)
242 |                         if final_learn==0:
243 |                             dropout = tf.assign(train_model.dropout_keep_prob, 1.0)
244 |                             emd_drop = tf.assign(train_model.embedding_dropout, 0.0)
245 |                             train_sess.run([dropout,emd_drop])
246 |                         train_sess.run(lr)
247 |                         eval_ppls = [best_eval]
248 |                         continue
249 | 
250 |         print_out("# Finsh epoch {1}, global step {0}".format(global_step, epoch+1))
251 |     print_out("# Best accuracy {0}".format(best_eval))
252 | 
253 | if __name__ == "__main__":
254 |     parser = argparse.ArgumentParser()
255 |     add_arguments(parser)
256 |     flags, unparsed = parser.parse_known_args()
257 |     if flags.mode == 'train':
258 |         train_clf(flags)
259 |     elif flags.mode == 'inference':
260 |         inference(flags)
261 | 


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | import os
  7 | 
  8 | import numpy as np
  9 | import tensorflow as tf
 10 | 
 11 | from utils import (_reverse, focal_loss, gelu, get_total_param_num, print_out,
 12 |                    single_rnn_cell)
 13 | from thrid_utils import create_embedding
 14 | 
 15 | class Model(object):
 16 |     def __init__(self, hparams):
 17 |         self.hparams = hparams
 18 | 
 19 |     def is_training(self):
 20 |         return self.hparams.mode == 'train'
 21 | 
 22 |     def build(self):
 23 |         self.setup_input_placeholders()
 24 |         self.setup_embedding()
 25 |         if self.hparams.encoder == 'gnmt':
 26 |             self.gnmt_encoder()
 27 |         elif self.hparams.encoder == 'elmo':
 28 |             self.elmo_encoder()
 29 |         else:
 30 |             raise ValueError("Un-supported encoder %s" % self.hparams.encoder)
 31 |         self.setup_clf()
 32 | 
 33 |         self.params = tf.trainable_variables()
 34 |         self.ema = tf.train.ExponentialMovingAverage(decay=0.9999)
 35 |         
 36 |         if self.hparams.mode in ['train', 'eval']:
 37 |             self.setup_loss()
 38 |         if self.hparams.mode == 'train':
 39 |             self.setup_training()
 40 |             self.setup_summary()
 41 |         self.saver = tf.train.Saver(tf.global_variables(),max_to_keep=5)
 42 | 
 43 |     def init_model(self, sess, initializer=None):
 44 |         if initializer:
 45 |             sess.run(initializer)
 46 |         else:
 47 |             sess.run(tf.global_variables_initializer())
 48 | 
 49 |     def save_model(self, sess, global_step=None):
 50 |         return self.saver.save(sess, os.path.join(self.hparams.checkpoint_dir,
 51 |                                                   "model.ckpt"), global_step=global_step if global_step else self.global_step)
 52 | 
 53 |     def restore_best_model(self, sess):
 54 |         self.saver.restore(sess, tf.train.latest_checkpoint(
 55 |                 self.hparams.checkpoint_dir + '/best_dev'))
 56 | 
 57 |     def restore_ema_model(self, sess, path):
 58 |         shadow_vars = {self.ema.average_name(v):v for v in self.params}
 59 |         saver = tf.train.Saver(shadow_vars)
 60 |         saver.restore(sess, path)
 61 | 
 62 |     def restore_model(self, sess, epoch=None):
 63 |         if epoch is None:
 64 |             self.saver.restore(sess, tf.train.latest_checkpoint(
 65 |                 self.hparams.checkpoint_dir))
 66 |         else:
 67 |             self.saver.restore(
 68 |                 sess, os.path.join(self.hparams.checkpoint_dir, "model.ckpt" + ("-%d" % epoch)))
 69 |         print("restored model")
 70 | 
 71 |     def setup_input_placeholders(self):
 72 |         self.source_tokens = tf.placeholder(
 73 |             tf.int32, shape=[None, None], name='source_tokens')
 74 |         
 75 |         # for training and evaluation
 76 |         if self.hparams.mode in ['train', 'eval']:
 77 |             self.target_labels = tf.placeholder(
 78 |                 tf.float32, shape=[None, self.hparams.feature_num, self.hparams.target_label_num], name='target_labels')
 79 | 
 80 |         self.batch_size = tf.shape(self.source_tokens,out_type=tf.int32)[0]
 81 | 
 82 |         self.sequence_length = tf.placeholder(
 83 |             tf.int32, shape=[None], name='sequence_length')
 84 |         
 85 |         self.global_step = tf.Variable(
 86 |             initial_value=0,
 87 |             name="global_step",
 88 |             trainable=False,
 89 |             collections=[tf.GraphKeys.GLOBAL_STEP, tf.GraphKeys.GLOBAL_VARIABLES])
 90 | 
 91 |         self.predict_token_num = tf.reduce_sum(self.sequence_length)
 92 |         self.embedding_dropout = tf.Variable(self.hparams.embedding_dropout, trainable=False)
 93 |         self.dropout_keep_prob = tf.Variable(self.hparams.dropout_keep_prob, trainable=False)
 94 | 
 95 |     def setup_embedding(self):
 96 |         # load pretrained embedding
 97 |         self.embedding = create_embedding(
 98 |             "embedding",
 99 |             self.hparams.vocab_size,
100 |             self.hparams.embedding_size,
101 |             vocab_file=self.hparams.vocab_file,
102 |             embed_file=self.hparams.embed_file)
103 |         
104 |         if self.hparams.embedding_dropout > 0 and self.is_training():
105 |             vocab_size = tf.shape(self.embedding)[0]
106 |             mask = tf.nn.dropout(tf.ones([vocab_size]),keep_prob=1-self.embedding_dropout) * (1-self.embedding_dropout)
107 |             mask = tf.expand_dims(mask,1)
108 |             self.embedding = mask * self.embedding
109 |         
110 |         self.source_embedding = tf.nn.embedding_lookup(
111 |             self.embedding, self.source_tokens)
112 |         # [20]
113 |         features = tf.range(self.hparams.feature_num,dtype=tf.int32)
114 |         feature_embedding_var = create_embedding("feature_embedding", self.hparams.feature_num, self.hparams.embedding_size)
115 |         # [20 * embedding_size]
116 |         feature_embedding = tf.nn.embedding_lookup(feature_embedding_var, features)
117 |         # [batch * 20 * embedding_size]
118 |         self.feature_embedding = tf.tile(tf.expand_dims(feature_embedding,axis=0),[self.batch_size,1,1])
119 | 
120 |         if self.is_training():
121 |             self.source_embedding = tf.nn.dropout(
122 |                     self.source_embedding, keep_prob=self.dropout_keep_prob)
123 |             self.feature_embedding = tf.nn.dropout(
124 |                 self.feature_embedding, keep_prob=self.dropout_keep_prob)
125 | 
126 |     def elmo_encoder(self):
127 |         print_out("build elmo encoder")
128 |         with tf.variable_scope("elmo_encoder") as scope:
129 |             inputs = tf.transpose(self.source_embedding,[1,0,2])
130 |             inputs_reverse = _reverse(
131 |                 inputs, seq_lengths=self.sequence_length,
132 |                 seq_dim=0, batch_dim=1)
133 |             encoder_states = []
134 |             outputs = [tf.concat([inputs,inputs],axis=-1)]
135 |             fw_cell_inputs = inputs
136 |             bw_cell_inputs = inputs_reverse
137 |             for i in range(self.hparams.num_layers):
138 |                 with tf.variable_scope("fw_%d" % i) as s:
139 |                     cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False)
140 |                     fused_outputs_op, fused_state_op = cell(fw_cell_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype)
141 |                     encoder_states.append(fused_state_op)
142 |                 with tf.variable_scope("bw_%d" % i) as s:
143 |                     bw_cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False)
144 |                     bw_fused_outputs_op_reverse, bw_fused_state_op = bw_cell(bw_cell_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype)
145 |                     bw_fused_outputs_op = _reverse(
146 |                         bw_fused_outputs_op_reverse, seq_lengths=self.sequence_length,
147 |                         seq_dim=0, batch_dim=1)
148 |                     encoder_states.append(bw_fused_state_op)
149 |                 output = tf.concat([fused_outputs_op,bw_fused_outputs_op],axis=-1)
150 |                 if i > 0:
151 |                     fw_cell_inputs = output + fw_cell_inputs
152 |                     bw_cell_inputs = _reverse(
153 |                         output, seq_lengths=self.sequence_length,
154 |                         seq_dim=0, batch_dim=1) + bw_cell_inputs
155 |                 else:
156 |                     fw_cell_inputs = output
157 |                     bw_cell_inputs = _reverse(
158 |                         output, seq_lengths=self.sequence_length,
159 |                         seq_dim=0, batch_dim=1)
160 |                 outputs.append(output)
161 |             
162 |             final_output = None
163 |             # embedding + num_layers
164 |             n = 1 + self.hparams.num_layers
165 |             scalars = tf.get_variable('scalar',initializer=tf.constant([1/(n)]*n))
166 |             self.scalars = scalars
167 |             weight = tf.get_variable('weight',initializer=tf.constant(0.001))
168 |             self.weight = weight
169 | 
170 |             soft_scalars = tf.nn.softmax(scalars)
171 |             for i, output in enumerate(outputs):
172 |                 if final_output is None:
173 |                     final_output = soft_scalars[i] * tf.transpose(output,[1,0,2])
174 |                 else:
175 |                     final_output = final_output + soft_scalars[i] * tf.transpose(output,[1,0,2])
176 | 
177 |             self.final_outputs = weight * final_output
178 |             self.final_state = tuple(encoder_states)
179 | 
180 |     def gnmt_encoder(self):
181 |         print_out("build gnmt encoder")
182 |         with tf.variable_scope("gnmt_encoder") as scope:
183 |             inputs = tf.transpose(self.source_embedding,[1,0,2])
184 |             inputs_reverse = _reverse(
185 |                 inputs, seq_lengths=self.sequence_length,
186 |                 seq_dim=0, batch_dim=1)
187 |             encoder_states = []
188 |             outputs = [inputs]
189 | 
190 |             with tf.variable_scope("fw") as s:
191 |                 cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False)
192 |                 fused_outputs_op, fused_state_op = cell(inputs,sequence_length=self.sequence_length,dtype=inputs.dtype)
193 |                 encoder_states.append(fused_state_op)
194 |                 outputs.append(fused_outputs_op)
195 |             
196 |             with tf.variable_scope('bw') as s:
197 |                 bw_cell = tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False)
198 |                 bw_fused_outputs_op, bw_fused_state_op = bw_cell(inputs_reverse,sequence_length=self.sequence_length,dtype=inputs.dtype)
199 |                 bw_fused_outputs_op = _reverse(
200 |                     bw_fused_outputs_op, seq_lengths=self.sequence_length,
201 |                     seq_dim=0, batch_dim=1)
202 |                 encoder_states.append(bw_fused_state_op)
203 |                 outputs.append(bw_fused_outputs_op)
204 | 
205 |             with tf.variable_scope("uni") as s:
206 |                 uni_inputs = tf.concat([fused_outputs_op,bw_fused_outputs_op],axis=-1)
207 |                 for i in range(self.hparams.num_layers-1):
208 |                     with tf.variable_scope("layer_%d" % i) as scope:
209 |                         uni_cell =  tf.contrib.rnn.LSTMBlockFusedCell(self.hparams.num_units,use_peephole=False)
210 |                         uni_fused_outputs_op, uni_fused_state_op = uni_cell(uni_inputs,sequence_length=self.sequence_length,dtype=inputs.dtype)
211 |                         encoder_states.append(uni_fused_state_op)
212 |                         outputs.append(uni_fused_outputs_op)
213 |                         if i > 0:
214 |                             uni_fused_outputs_op = uni_fused_outputs_op + uni_inputs
215 |                         uni_inputs = uni_fused_outputs_op
216 | 
217 |             final_output = None
218 |             # embedding + fw + bw + uni
219 |             n = 3 + self.hparams.num_layers - 1
220 |             scalars = tf.get_variable('scalar',initializer=tf.constant([1/(n)]*n))
221 |             self.scalars = scalars
222 |             weight = tf.get_variable('weight',initializer=tf.constant(0.001))
223 |             self.weight = weight
224 |             
225 |             soft_scalars = tf.nn.softmax(scalars)
226 |             for i, output in enumerate(outputs):
227 |                 if final_output is None:
228 |                     final_output = soft_scalars[i] * tf.transpose(output,[1,0,2])
229 |                 else:
230 |                     final_output = final_output + soft_scalars[i] * tf.transpose(output,[1,0,2])
231 | 
232 |             self.final_outputs = weight * final_output
233 |             self.final_state = tuple(encoder_states)
234 |     
235 |     def setup_attention_semantic(self):
236 |         num_units = self.hparams.num_units * 2 if self.hparams.double_decoder else self.hparams.num_units
237 |         with tf.variable_scope("attention_semantic") as scope:
238 |             cell = single_rnn_cell(self.hparams.rnn_cell_name, num_units, self.is_training(), self.dropout_keep_prob, self.hparams.weight_keep_drop, self.hparams.variational_dropout)
239 |             attention = tf.contrib.seq2seq.LuongAttention(num_units, self.final_outputs, self.sequence_length,scale=True)
240 |             attn_cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, output_attention=True)
241 |             if 'lstm' in self.hparams.rnn_cell_name.lower():
242 |                 h = tf.layers.dense(tf.concat([state.h for state in self.final_state],axis=-1),num_units, use_bias=True)
243 |                 c = tf.layers.dense(tf.concat([state.c for state in self.final_state],axis=-1),num_units, use_bias=True)
244 |                 initial_state = attn_cell.zero_state(self.batch_size,dtype=tf.float32).clone(cell_state=tf.contrib.rnn.LSTMStateTuple(c=c,h=h))
245 |             else:
246 |                 h = tf.layers.dense(tf.concat([state for state in self.final_state],axis=-1),num_units, use_bias=True)
247 |             
248 |                 initial_state = attn_cell.zero_state(self.batch_size,dtype=tf.float32).clone(cell_state=h)
249 |             outputs = []
250 |             state = initial_state
251 |             for i in range(self.hparams.feature_num):
252 |                 if i > 0: tf.get_variable_scope().reuse_variables()
253 |                 inputs = self.feature_embedding[:,i,:]
254 |                 cell_output, state = attn_cell(inputs, state)
255 |                 if 'lstm' in self.hparams.rnn_cell_name.lower():
256 |                     out_state = tf.concat([state.cell_state.h,cell_output],axis=-1)
257 |                 else:
258 |                     out_state = tf.concat([state.cell_state,cell_output],axis=-1)
259 |                 outputs.append(out_state)
260 |             return outputs
261 | 
262 |     def setup_clf(self):
263 |         num_units = self.hparams.num_units * 2 if self.hparams.double_decoder else self.hparams.num_units
264 |         with tf.variable_scope("classification",reuse=tf.AUTO_REUSE) as scope:
265 |             states = self.setup_attention_semantic()
266 |             final_logits = []
267 |             final_predicts = []
268 |             with tf.variable_scope("predict_clf"):
269 |                 hidden_layer = tf.layers.Dense(num_units, use_bias=True, activation=tf.nn.relu)
270 |                 output_layer = tf.layers.Dense(self.hparams.target_label_num)
271 | 
272 |                 for i,state in enumerate(states):
273 |                     semantic = hidden_layer(state)
274 |                     logits = output_layer(semantic)
275 | 
276 |                     final_logits.append(logits)
277 |                     predict = tf.argmax(logits,axis=-1)
278 |                     predict = tf.one_hot(predict,self.hparams.target_label_num)
279 |                     final_predicts.append(predict)
280 | 
281 |             self.final_logits = tf.concat([tf.expand_dims(l,1) for l in final_logits],axis=1)
282 |             self.final_predict = tf.concat([tf.expand_dims(p,1) for p in final_predicts],axis=1)
283 |             if self.hparams.mode in ['train','eval']:
284 |                 self.accurary = tf.contrib.metrics.accuracy(tf.to_int32(self.final_predict),tf.to_int32(self.target_labels))
285 | 
286 |     def setup_loss(self):
287 |         if self.hparams.focal_loss > 0:
288 |             self.gamma = tf.Variable(self.hparams.focal_loss,dtype=tf.float32, trainable=False)
289 |             label_losses = focal_loss(self.target_labels, self.final_logits, self.gamma)
290 |         else:
291 |             label_losses = tf.losses.softmax_cross_entropy(onehot_labels=self.target_labels, logits=self.final_logits, reduction=tf.losses.Reduction.MEAN)
292 |         self.losses = label_losses
293 | 
294 |     def setup_summary(self):
295 |         self.summary_writer = tf.summary.FileWriter(
296 |             self.hparams.checkpoint_dir, tf.get_default_graph())
297 |         tf.summary.scalar("train_loss", self.losses)
298 |         tf.summary.scalar("learning_rate", self.learning_rate)
299 |         tf.summary.scalar("accuracy", self.accurary)
300 |         tf.summary.scalar('gN', self.gradient_norm)
301 |         tf.summary.scalar('pN', self.param_norm)
302 |         self.summary_op = tf.summary.merge_all()
303 | 
304 |     def setup_training(self):
305 |         # learning rate decay
306 |         if self.hparams.decay_schema == 'exp':
307 |             self.learning_rate = tf.train.exponential_decay(self.hparams.learning_rate, self.global_step,
308 |                                                             self.hparams.decay_steps, 0.96, staircase=True)
309 |         else:
310 |             self.learning_rate = tf.Variable(
311 |                 self.hparams.learning_rate, dtype=tf.float32, trainable=False)
312 | 
313 |         params = self.params
314 |         if self.hparams.l2_loss_ratio > 0:
315 |             l2_loss = self.hparams.l2_loss_ratio * tf.add_n([tf.nn.l2_loss(p) for p in params if ('predict_clf' in p.name and 'bias' not in p.name)])
316 |             self.losses += l2_loss
317 | 
318 |         get_total_param_num(params)
319 | 
320 |         self.param_norm = tf.global_norm(params)
321 | 
322 |         gradients = tf.gradients(self.losses, params, colocate_gradients_with_ops=True)
323 |         clipped_gradients, _ = tf.clip_by_global_norm(
324 |             gradients, self.hparams.max_gradient_norm)
325 |         self.gradient_norm = tf.global_norm(gradients)
326 |         opt = tf.train.RMSPropOptimizer(self.learning_rate)
327 |         train_op = opt.apply_gradients(
328 |                 zip(clipped_gradients, params), global_step=self.global_step)
329 |         with tf.control_dependencies([train_op]):
330 |             train_op = self.ema.apply(params)
331 |         self.train_op = train_op
332 | 
333 |     def train_clf_one_step(self, sess, source, lengths, targets, add_summary=False, run_info=False):
334 |         feed_dict = {}
335 |         feed_dict[self.source_tokens] = source
336 |         feed_dict[self.sequence_length] = lengths
337 |         feed_dict[self.target_labels] = targets
338 |         if run_info:
339 |             run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
340 |             run_metadata = tf.RunMetadata()
341 | 
342 |             _, batch_loss, summary, global_step, accuracy, token_num, batch_size = sess.run(
343 |             [self.train_op, self.losses, self.summary_op, self.global_step, self.accurary, self.predict_token_num, self.batch_size],
344 |                 feed_dict=feed_dict,
345 |                 options=run_options,
346 |                 run_metadata=run_metadata)
347 |             
348 |         else:
349 |             _, batch_loss, summary, global_step, accuracy, token_num, batch_size = sess.run(
350 |                 [self.train_op, self.losses, self.summary_op, self.global_step, self.accurary, self.predict_token_num, self.batch_size],
351 |                 feed_dict = feed_dict
352 |             )
353 |         if run_info:
354 |             self.summary_writer.add_run_metadata(
355 |                 run_metadata, 'step%03d' % global_step)
356 |             print("adding run meta for", global_step)
357 | 
358 | 
359 |         if add_summary:
360 |             self.summary_writer.add_summary(summary, global_step=global_step)
361 |         return batch_loss, global_step, accuracy, token_num, batch_size
362 | 
363 |     def eval_clf_one_step(self, sess, source, lengths, targets):
364 |         feed_dict = {}
365 |         feed_dict[self.source_tokens] = source
366 |         feed_dict[self.sequence_length] = lengths
367 |         feed_dict[self.target_labels] = targets
368 | 
369 |         batch_loss, accuracy,batch_size, predict = sess.run(
370 |             [self.losses, self.accurary,self.batch_size, self.final_predict],
371 |             feed_dict = feed_dict
372 |         )
373 |         return batch_loss, accuracy,batch_size,predict
374 | 
375 |     def inference_clf_one_batch(self, sess, source, lengths):
376 |         feed_dict = {}
377 |         feed_dict[self.source_tokens] = source
378 |         feed_dict[self.sequence_length] = lengths
379 | 
380 |         predict,logits = sess.run([self.final_predict, tf.nn.softmax(self.final_logits)], feed_dict=feed_dict)
381 |         return predict, logits
382 | 


--------------------------------------------------------------------------------
/scripts/data_preprocess.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | 
  7 | import argparse
  8 | import csv
  9 | import json
 10 | import re
 11 | from collections import Counter
 12 | 
 13 | import jieba
 14 | 
 15 | def add_arguments(parser):
 16 |     """Build ArgumentParser."""
 17 |     parser.register("type", "bool", lambda v: v.lower() == "true")
 18 | 
 19 |     parser.add_argument("--data_file", type=str, default=None, required=True, help="data file to process")
 20 |     parser.add_argument("--output_file", type=str, default=None, required=True, help="data file to process")
 21 |     parser.add_argument("--vocab_file", type=str, default=None, help="vocab file, needed when data file is training file")
 22 |     parser.add_argument("--vocab_size", type=int, default=50000, help='vocab size')
 23 |     parser.add_argument("--embedding", type='bool', nargs="?", const=True, default=False, help='whether process embedding file')
 24 |     
 25 | def replace_dish(content):
 26 |     return re.sub("【.{5,20}】","<dish>",content)
 27 | 
 28 | def normalize_num(words):
 29 |     '''Normalize numbers
 30 |     for example: 123 -> 100,  3934 -> 3000
 31 |     '''
 32 |     tokens = []
 33 |     for w in words:
 34 |         try:
 35 |             ww = w
 36 |             num = int(float(ww))
 37 |             if len(ww) < 2:
 38 |                 tokens.append(ww)
 39 |             else:
 40 |                 num = int(ww[0]) * (10**(len(str(num))-1))
 41 |                 tokens.append(str(num))
 42 |         except:
 43 |             tokens.append(w)
 44 |     return tokens
 45 | 
 46 | def tokenize(content):
 47 |     content = content.replace("\u0006",'').replace("\u0005",'').replace("\u0007",'')
 48 |     tokens = []
 49 |     content = content.lower()
 50 |     # 去除重复字符
 51 |     content = re.sub('~+','~',content)
 52 |     content = re.sub('～+','～',content)
 53 |     content = re.sub('(\n)+','\n',content)
 54 |     for para in content.split('\n'):
 55 |         para_tokens = []
 56 |         words = list(jieba.cut(para))
 57 |         words = normalize_num(words)
 58 |         para_tokens.extend(words)
 59 |         para_tokens.append('<para>')
 60 |         tokens.append(' '.join(para_tokens))
 61 |     content = " ".join(tokens)
 62 |     content = re.sub('\s+',' ',content)
 63 |     content = re.sub('(<para> )+','<para> ',content)
 64 |     content = re.sub('(- )+','- ',content)    
 65 |     content = re.sub('(= )+','= ',content)
 66 |     content = re.sub('(\. )+','. ',content).strip()
 67 |     content = replace_dish(content)
 68 |     if content.endswith("<para>"):
 69 |         content = content[:-7]
 70 |     return content
 71 | 
 72 | def create_vocab(data, vocab_file, vocab_size):
 73 |     print("# Start to create vocab ...")
 74 |     words = Counter()
 75 |     for item in data:
 76 |         words.update(item['content'].split())
 77 |     special_tokens = ['<unk>','<sos>','<eos>']
 78 |     with open(vocab_file,'w') as f:
 79 |         for w in special_tokens:
 80 |             f.write(w + '\n')
 81 |         for w,_ in words.most_common(vocab_size-len(special_tokens)):
 82 |             f.write(w + '\n')
 83 |     print("# Created vocab file {0} with vocab size {1}".format(vocab_file,vocab_size))
 84 | 
 85 | def process_data(output_file, data_file):
 86 |     data = []
 87 |     with open(output_file,'w') as f:
 88 |         with open(data_file,encoding='utf-8-sig') as csvfile:
 89 |             reader = csv.DictReader(csvfile)
 90 |             for i,item in enumerate(reader):
 91 |                 content = tokenize(item['content'].strip()[1:-1])
 92 |                 item['content'] = content
 93 |                 f.write(json.dumps(item,ensure_ascii=False)+'\n')
 94 |                 data.append(item)
 95 |                 if (i+1) % 10000 == 0:
 96 |                     print("# processed -- %d --"%(i+1))
 97 |     return data
 98 | 
 99 | def process_embedding(embedding_file, vocab_file, out_embedding_file):
100 |     words = set([line.strip() for line in open(vocab_file)])
101 |     with open(out_embedding_file,'w') as f:
102 |         for line in open(embedding_file):
103 |             tokens = line.split()
104 |             # skip the first line
105 |             if len(tokens) == 2:
106 |                 continue
107 |             word = tokens[0].lower()
108 |             if word in words:
109 |                 f.write(word + ' ' + ' '.join(tokens[1:]) + '\n')
110 | 
111 | if __name__ == "__main__":
112 |     parser = argparse.ArgumentParser()
113 |     add_arguments(parser)
114 |     flags, unparsed = parser.parse_known_args()
115 |     if flags.embedding:
116 |         process_embedding(flags.data_file, flags.vocab_file, flags.output_file)
117 |     else:
118 |         if 'train' in flags.data_file:
119 |             if flags.vocab_file is None:
120 |                 raise ValueError("Must provided a vocab file to save vocab")
121 |             data = process_data(flags.output_file, flags.data_file)
122 |             create_vocab(data,flags.vocab_file,flags.vocab_size)
123 |         else:
124 |             process_data(flags.output_file, flags.data_file)
125 | 


--------------------------------------------------------------------------------
/scripts/preprocess.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Modify the following values depend on your environment
 4 | # Path to the csv files
 5 | TRAIN_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_trainingset_20180816/sentiment_analysis_trainingset.csv
 6 | VALIDATION_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv
 7 | TESTA_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv
 8 | TESTB_FILE=/data/xueyou/data/ai_challenger_sentiment/ai_challenger_sentimetn_analysis_testb_20180816/sentiment_analysis_testb.csv
 9 | 
10 | # Path to pretrained embedding file
11 | EMBEDDING_FILE=/data/xueyou/data/embedding/sgns.sogou.word
12 | 
13 | VOCAB_SIZE=50000
14 | 
15 | # Create a folder to save training files
16 | mkdir -p data
17 | 
18 | echo 'Process training file ...'
19 | python data_preprocess.py \
20 |     --data_file=$TRAIN_FILE \
21 |     --output_file=data/train.json \
22 |     --vocab_file=data/vocab.txt \
23 |     --vocab_size=$VOCAB_SIZE
24 | 
25 | echo 'Process validation file ...'
26 | python data_preprocess.py \
27 |     --data_file=$VALIDATION_FILE \
28 |     --output_file=data/validation.json
29 | 
30 | echo 'Process testa file ...'
31 | python data_preprocess.py \
32 |     --data_file=$TESTA_FILE \
33 |     --output_file=data/testa.json
34 | 
35 | # Uncomment following code to get testb file
36 | # echo 'Process testb file ...'
37 | # python data_preprocess.py \
38 | #     --data_file=$TESTB_FILE \
39 | #     --output_file=data/testb.json
40 | 
41 | echo 'Get pretrained embedding ...'
42 | python data_preprocess.py \
43 |     --data_file=$EMBEDDING_FILE \
44 |     --output_file=data/embedding.txt \
45 |     --vocab_file=data/vocab.txt \
46 |     --embedding=True
47 | 
48 | echo "Get label file ..."
49 | cp ../labels.txt data/labels.txt


--------------------------------------------------------------------------------
/scripts/readme.md:
--------------------------------------------------------------------------------
 1 | ## Train Model from scratch with GPU
 2 | 
 3 | This is a simple script used to run this code from scratch. Using the default settings, you can get macro f1 score 0.70769.
 4 | 
 5 | > The data preprocess steps are no the same as the one I used during the competition, so the final f1 score may be not the same:
 6 | 
 7 | Some major differences:
 8 | - Jieba is used instead of LTP here
 9 | - NER is not used here
10 | - No custom dictionary used here
11 | 
12 | ### 1. Download raw data
13 | 
14 | - Download data from this [data link](https://drive.google.com/file/d/1OInXRx_OmIJgK3ZdoFZnmqUi0rGfOaQo/view?usp=sharing).
15 | - Unzip the files to get the raw csv files.
16 | 
17 | ### 2. Download pretrained embedding file
18 | 
19 | - Download embedding file from this [embedding link](https://pan.baidu.com/s/1tUghuTno5yOvOx4LXA9-wg).
20 | - Unzip the file to get the embedding file.
21 | 
22 | ### 3. Preprocess data
23 | 
24 | Modify the file paths in preprocess.sh:
25 | - TRAIN_FILE
26 | - VALIDATION_FILE
27 | - TESTA_FILE
28 | - TESTB_FILE
29 | > Refer to 1 to get the data file path
30 | - EMBEDDING_FILE
31 | > Refer to 2 to get the embedding file path
32 | - VOCAB_SIZE
33 | > You can try different vocab size
34 | 
35 | Then run
36 | 
37 | ```
38 | bash preprocess.sh
39 | ```
40 | 
41 | We will create all the files needed under ./data folder.
42 | 
43 | ### 4. Run training
44 | 
45 | Change your workdir to parent folder, and run the training scripts:
46 | 
47 | ```
48 | bash bash/elmo_train.sh
49 | ```
50 | 
51 | ### 5. Run inference
52 | 
53 | After training, we can get the predicted results of test files:
54 | 
55 | ```
56 | bash bash/elmo_inference.sh
57 | ```
58 | 


--------------------------------------------------------------------------------
/thrid_utils.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | 
  7 | '''These codes are copied from eigen-tensorflow'''
  8 | import codecs
  9 | import csv
 10 | import os
 11 | 
 12 | import numpy as np
 13 | import tensorflow as tf
 14 | 
 15 | # If a vocab size is greater than this value, put the embedding on cpu instead
 16 | VOCAB_SIZE_THRESHOLD_CPU = 30000
 17 | 
 18 | def read_vocab(vocab_file):
 19 |     """read vocab from file
 20 |     
 21 |     Args:
 22 |         vocab_file ([type]): path to the vocab file, the vocab file should contains a word each line
 23 |     
 24 |     Returns:
 25 |         list of words
 26 |     """
 27 | 
 28 |     if not os.path.isfile(vocab_file):
 29 |         raise ValueError("%s is not a vaild file"%vocab_file)
 30 | 
 31 |     vocab = []
 32 |     word2id = {}
 33 |     with codecs.getreader("utf-8")(tf.gfile.GFile(vocab_file, "rb")) as f:
 34 |         for i,line in enumerate(f):
 35 |             word = line.strip()
 36 |             if not word:
 37 |                 raise ValueError("Got empty word at line %d"%(i+1))
 38 |             vocab.append(word)
 39 |             word2id[word] = len(word2id)
 40 | 
 41 |     print("# vocab size: ",len(vocab))
 42 |     return vocab, word2id
 43 | 
 44 | def load_embed_file(embed_file):
 45 |     """Load embed_file into a python dictionary.
 46 | 
 47 |     Note: the embed_file should be a Glove formated txt file. Assuming
 48 |     embed_size=5, for example:
 49 | 
 50 |     the -0.071549 0.093459 0.023738 -0.090339 0.056123
 51 |     to 0.57346 0.5417 -0.23477 -0.3624 0.4037
 52 |     and 0.20327 0.47348 0.050877 0.002103 0.060547
 53 | 
 54 |     Args:
 55 |       embed_file: file path to the embedding file.
 56 |     Returns:
 57 |       a dictionary that maps word to vector, and the size of embedding dimensions.
 58 |     """
 59 |     emb_dict = dict()
 60 |     emb_size = None
 61 |     with codecs.getreader("utf-8")(tf.gfile.GFile(embed_file, 'rb')) as f:
 62 |         for i,line in enumerate(f):
 63 |             tokens = line.strip().split(" ")
 64 |             word = tokens[0]
 65 |             vec = list(map(float, tokens[1:]))
 66 |             emb_dict[word] = vec
 67 |             if emb_size:
 68 |                 assert emb_size == len(
 69 |                     vec), "All embedding size should be same, but got {0} at line {1}".format(len(vec),i+1)
 70 |             else:
 71 |                 emb_size = len(vec)
 72 |     return emb_dict, emb_size
 73 | 
 74 | def embedding_dropout(embedding, dropout=0.1):
 75 |     vocab_size = tf.shape(embedding)[0]
 76 |     mask = tf.nn.dropout(tf.ones([vocab_size]),keep_prob=1-dropout) * (1-dropout)
 77 |     mask = tf.expand_dims(mask, 1)
 78 |     return mask * embedding
 79 | 
 80 | def _get_embed_device(vocab_size):
 81 |     """Decide on which device to place an embed matrix given its vocab size."""
 82 |     if vocab_size > VOCAB_SIZE_THRESHOLD_CPU:
 83 |         return "/cpu:0"
 84 |     else:
 85 |         return "/gpu:0"
 86 | 
 87 | def _load_pretrained_emb_from_file(name, vocab_file, embed_file, num_trainable_tokens=0, dtype=tf.float32):
 88 |     print("# Start to load pretrained embedding...")
 89 |     vocab,_ = read_vocab(vocab_file)
 90 |     if num_trainable_tokens:
 91 |         trainable_tokens = vocab[:num_trainable_tokens]
 92 |     else:
 93 |         trainable_tokens = vocab
 94 |     
 95 |     emb_dict, emb_size = load_embed_file(embed_file)
 96 |     print("# pretrained embedding size",len(emb_dict),emb_size)
 97 | 
 98 |     for token in trainable_tokens:
 99 |         if token not in emb_dict:
100 |             if '<average>' in emb_dict:
101 |                 emb_dict[token] = emb_dict['<average>']
102 |             else:
103 |                 emb_dict[token] = list(np.random.random(emb_size))
104 |     
105 |     emb_mat = np.array([emb_dict[token] for token in vocab], dtype=dtype.as_numpy_dtype())
106 |     if num_trainable_tokens:
107 |         emb_mat = tf.constant(emb_mat)
108 |         emb_mat_const = tf.slice(emb_mat,[num_trainable_tokens,0],[-1,-1])
109 |         with tf.device(_get_embed_device(num_trainable_tokens)):
110 |             emb_mat_var = tf.get_variable(name + "_emb_mat_var", [num_trainable_tokens, emb_size])
111 |         return tf.concat([emb_mat_var,emb_mat_const],0,name=name)
112 |     else:
113 |         with tf.device(_get_embed_device(len(vocab))):
114 |             emb_mat_var = tf.get_variable(name,emb_mat.shape,initializer=tf.constant_initializer(emb_mat))
115 |         return emb_mat_var
116 | 
117 | def create_embedding(name, vocab_size, embed_size, vocab_file=None, embed_file=None, num_trainable_tokens=0, dtype=tf.float32, scope=None):
118 |     '''create a new embedding tensor or load from a pretrained embedding file
119 |     
120 |     Args:
121 |         name: name of the embedding
122 |         vocab_size : vocab size
123 |         embed_size : embeddign size
124 |         vocab_file ([type], optional): Defaults to None. vocab file
125 |         embed_file ([type], optional): Defaults to None. 
126 |         num_trainable_tokens (int, optional): Defaults to 0. the number of tokens to be trained, if 0 then train all the tokens
127 |         dtype ([type], optional): Defaults to tf.float32. [description]
128 |         scope ([type], optional): Defaults to None. [description]
129 |     
130 |     Returns:
131 |         embedding variable
132 |     '''
133 | 
134 |     with tf.variable_scope(scope or "embedding", dtype=dtype) as scope:
135 |         if vocab_file and embed_file:
136 |             embedding = _load_pretrained_emb_from_file(name, vocab_file, embed_file, num_trainable_tokens, dtype)
137 |         else:
138 |             with tf.device(_get_embed_device(vocab_size)):
139 |                 embedding = tf.get_variable(name,[vocab_size,embed_size],dtype)
140 |         return embedding
141 | 
142 | class DropConnectLayer(tf.layers.Dense):
143 |     def __init__(self, units,
144 |                mode=tf.estimator.ModeKeys.TRAIN,
145 |                keep_prob=0.7,
146 |                activation=None,
147 |                use_bias=True,
148 |                kernel_initializer=None,
149 |                bias_initializer=tf.zeros_initializer(),
150 |                kernel_regularizer=None,
151 |                bias_regularizer=None,
152 |                activity_regularizer=None,
153 |                kernel_constraint=None,
154 |                bias_constraint=None,
155 |                trainable=True,
156 |                name=None,
157 |                **kwargs):
158 |         super(DropConnectLayer,self).__init__(  units,
159 |                                                 activation=activation,
160 |                                                 use_bias=use_bias,
161 |                                                 kernel_initializer=kernel_initializer,
162 |                                                 bias_initializer=bias_initializer,
163 |                                                 kernel_regularizer=kernel_regularizer,
164 |                                                 bias_regularizer=bias_regularizer,
165 |                                                 activity_regularizer=activity_regularizer,
166 |                                                 kernel_constraint=kernel_constraint,
167 |                                                 bias_constraint=bias_constraint,
168 |                                                 trainable=trainable,
169 |                                                 name=name,
170 |                                                 **kwargs)
171 |         self.mode = mode
172 |         self.keep_prob = keep_prob
173 |         self.mask = None
174 | 
175 |     def build(self, input_shape):
176 |         from tensorflow.python.layers import base
177 |         from tensorflow.python.framework import tensor_shape
178 |         input_shape = tensor_shape.TensorShape(input_shape)
179 |         if input_shape[-1].value is None:
180 |             raise ValueError('The last dimension of the inputs to `Dense` '
181 |                             'should be defined. Found `None`.')
182 |         self.input_spec = base.InputSpec(min_ndim=2,
183 |                                         axes={-1: input_shape[-1].value})
184 |         self.kernel = self.add_variable('kernel',
185 |                                         shape=[input_shape[-1].value, self.units],
186 |                                         initializer=self.kernel_initializer,
187 |                                         regularizer=self.kernel_regularizer,
188 |                                         constraint=self.kernel_constraint,
189 |                                         dtype=self.dtype,
190 |                                         trainable=True)
191 |         if self.mode == tf.estimator.ModeKeys.TRAIN:
192 |             if self.mask is None:
193 |                 mask = tf.ones_like(self.kernel)
194 |                 self.mask = tf.nn.dropout(mask, keep_prob=self.keep_prob) * self.keep_prob
195 |             self.kernel = self.kernel * self.mask
196 |         if self.use_bias:
197 |             self.bias = self.add_variable('bias',
198 |                                             shape=[self.units,],
199 |                                             initializer=self.bias_initializer,
200 |                                             regularizer=self.bias_regularizer,
201 |                                             constraint=self.bias_constraint,
202 |                                             dtype=self.dtype,
203 |                                             trainable=True)
204 |         else:
205 |             self.bias = None
206 |         self.built = True
207 |     
208 | 
209 | class WeightDropLSTMCell(tf.contrib.rnn.BasicLSTMCell):
210 |     '''Apply dropout on hidden-to-hidden weights'''
211 |     
212 |     def __init__(self, num_units, weight_keep_drop=0.7, mode=tf.estimator.ModeKeys.TRAIN, 
213 |                 forget_bias=1.0, state_is_tuple=True, activation=None, reuse=None):
214 |         """Initialize the parameters for an LSTM cell.
215 |         """
216 |         super(WeightDropLSTMCell,self).__init__( num_units, forget_bias, state_is_tuple, activation, reuse)
217 |         self.w_layer = tf.layers.Dense(4 * num_units)
218 |         self.h_layer = DropConnectLayer(4 * num_units, mode, weight_keep_drop, use_bias=False)
219 | 
220 |     def build(self, inputs_shape):
221 |         # compatible with tf-1.5
222 |         self.built = True
223 | 
224 |     def call(self, inputs, state):
225 |         """Long short-term memory cell (LSTM).
226 |         Args:
227 |             inputs: `2-D` tensor with shape `[batch_size x input_size]`.
228 |             state: An `LSTMStateTuple` of state tensors, each shaped
229 |                 `[batch_size x self.state_size]`, if `state_is_tuple` has been set to
230 |                 `True`.  Otherwise, a `Tensor` shaped
231 |                 `[batch_size x 2 * self.state_size]`.
232 |             Returns:
233 |             A pair containing the new hidden state, and the new state (either a
234 |                 `LSTMStateTuple` or a concatenated state, depending on
235 |                 `state_is_tuple`).
236 |         """
237 |         sigmoid = tf.sigmoid
238 |         # Parameters of gates are concatenated into one multiply for efficiency.
239 |         if self._state_is_tuple:
240 |             c, h = state
241 |         else:
242 |             c, h = tf.split(value=state, num_or_size_splits=2, axis=1)
243 | 
244 |         # W * x + b
245 |         inputs = self.w_layer(inputs)
246 |         # U * h(t-1)
247 |         h = self.h_layer(h)
248 | 
249 |         # i = input_gate, j = new_input, f = forget_gate, o = output_gate
250 |         i, j, f, o = tf.split(
251 |             value=inputs + h, num_or_size_splits=4, axis=1)
252 | 
253 |         new_c = (
254 |             c * sigmoid(f + self._forget_bias) + sigmoid(i) * self._activation(j))
255 |         new_h = self._activation(new_c) * sigmoid(o)
256 | 
257 |         if self._state_is_tuple:
258 |             new_state = tf.contrib.rnn.LSTMStateTuple(new_c, new_h)
259 |         else:
260 |             new_state = tf.concat([new_c, new_h], 1)
261 |         return new_h, new_state
262 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | # ======================================== 
  2 | # Author: Xueyou Luo 
  3 | # Email: xueyou.luo@aidigger.com 
  4 | # Copyright: Eigen Tech @ 2018 
  5 | # ========================================
  6 | import codecs
  7 | import json
  8 | import os
  9 | import sys
 10 | 
 11 | import numpy as np
 12 | import tensorflow as tf
 13 | 
 14 | 
 15 | def print_out(s, f=None, new_line=True):
 16 |   """Similar to print but with support to flush and output to a file."""
 17 |   if isinstance(s, bytes):
 18 |     s = s.decode("utf-8")
 19 | 
 20 |   if f:
 21 |     f.write(s.encode("utf-8"))
 22 |     if new_line:
 23 |       f.write(b"\n")
 24 | 
 25 |   # stdout
 26 |   out_s = s.encode("utf-8")
 27 |   if not isinstance(out_s, str):
 28 |     out_s = out_s.decode("utf-8")
 29 |   
 30 |   print(out_s, end="", file=sys.stdout)
 31 | 
 32 |   if new_line:
 33 |     sys.stdout.write("\n")
 34 |   sys.stdout.flush()
 35 | 
 36 | def _reverse(input_, seq_lengths, seq_dim, batch_dim):
 37 |     if seq_lengths is not None:
 38 |         return tf.reverse_sequence(
 39 |             input=input_, seq_lengths=seq_lengths,
 40 |             seq_dim=seq_dim, batch_dim=batch_dim)
 41 |     else:
 42 |         return tf.reverse(input_, axis=[seq_dim])
 43 | 
 44 | def gelu(input_tensor):
 45 |   """Gaussian Error Linear Unit.
 46 |   This is a smoother version of the RELU.
 47 |   Original paper: https://arxiv.org/abs/1606.08415
 48 |   Args:
 49 |     input_tensor: float Tensor to perform activation.
 50 |   Returns:
 51 |     `input_tensor` with the GELU activation applied.
 52 |   """
 53 |   cdf = 0.5 * (1.0 + tf.erf(input_tensor / tf.sqrt(2.0)))
 54 |   return input_tensor * cdf
 55 | 
 56 | def single_rnn_cell(cell_name, num_units, train_phase=True, keep_prob=0.75, weight_keep_drop=0.65, variational_dropout = False):
 57 |     """
 58 |     Get a single rnn cell
 59 |     """
 60 |     cell_name = cell_name.upper()
 61 |     if cell_name == "GRU":
 62 |         cell = tf.contrib.rnn.GRUCell(num_units)
 63 |     elif cell_name == "LSTM":
 64 |         cell = tf.contrib.rnn.LSTMCell(num_units)
 65 |     elif cell_name == 'block_lstm'.upper():
 66 |         cell = tf.contrib.rnn.LSTMBlockCell(num_units)
 67 |     elif cell_name == 'WEIGHT_LSTM':
 68 |         from thrid_utils import WeightDropLSTMCell
 69 |         cell = WeightDropLSTMCell(num_units,weight_keep_drop=weight_keep_drop,mode=tf.estimator.ModeKeys.TRAIN if train_phase and weight_keep_drop<1.0 else tf.estimator.ModeKeys.PREDICT)
 70 |     elif cell_name == 'LAYERNORM_LSTM':
 71 |         cell = tf.contrib.rnn.LayerNormBasicLSTMCell(num_units)
 72 |     else:
 73 |         cell = tf.contrib.rnn.BasicRNNCell(num_units)
 74 | 
 75 |     # dropout wrapper
 76 |     if train_phase:
 77 |         # TODO: variational_recurrent=True and input_keep_prob < 1 then we need provide input_size
 78 |         # But because we use different size in different layers, we will got shape in-compatible error
 79 |         # So I just set input_keep_prob to 1.0 when we use variational dropout to avoid this error for now.
 80 |         cell = tf.contrib.rnn.DropoutWrapper(
 81 |             cell=cell,
 82 |             input_keep_prob=keep_prob if not variational_dropout else 1.0,
 83 |             output_keep_prob=keep_prob,
 84 |             variational_recurrent=variational_dropout,
 85 |             dtype=tf.float32)
 86 | 
 87 |     return cell
 88 | 
 89 | def focal_loss(labels, logits, gamma=2):
 90 |     epsilon = 1.e-9
 91 |     y_pred = tf.nn.softmax(logits,dim=-1)
 92 |     y_pred = y_pred + epsilon # to avoid 0.0 in log
 93 |     L = -labels*tf.pow((1-y_pred),gamma)*tf.log(y_pred)
 94 |     L = tf.reduce_sum(L)
 95 |     batch_size = tf.shape(labels)[0]
 96 |     return L / tf.to_float(batch_size)
 97 | 
 98 | def get_total_param_num(params, threshold = 1):
 99 |     total_parameters = 0
100 |     #iterating over all variables
101 |     for variable in params:  
102 |         local_parameters=1
103 |         shape = variable.get_shape()  #getting shape of a variable
104 |         for i in shape:
105 |             local_parameters*=i.value  #mutiplying dimension values
106 |         if local_parameters >= threshold:
107 |             print("variable {0} with parameter number {1}".format(variable, local_parameters))
108 |         total_parameters+=local_parameters
109 |     print('# total parameter number',total_parameters) 
110 |     return total_parameters
111 | 
112 | def cal_f1(label_num,predicted,truth):
113 |     results = []
114 |     for i in range(label_num):
115 |         results.append({"TP": 0, "FP": 0, "FN": 0, "TN": 0})
116 |     
117 |     for i, p in enumerate(predicted):
118 |         t = truth[i]
119 |         for j in range(label_num):
120 |             if p[j] == 1:
121 |                 if t[j] == 1:
122 |                     results[j]['TP'] += 1
123 |                 else:
124 |                     results[j]['FP'] += 1
125 |             else:
126 |                 if t[j] == 1:
127 |                     results[j]['FN'] += 1
128 |                 else:
129 |                     results[j]['TN'] += 1
130 |     
131 |     precision = [0.0] * label_num
132 |     recall = [0.0] * label_num
133 |     f1 = [0.0] * label_num
134 |     for i in range(label_num):
135 |         if results[i]['TP'] == 0:
136 |             if results[i]['FP']==0 and results[i]['FN']==0:
137 |                 precision[i] = 1.0
138 |                 recall[i] = 1.0
139 |                 f1[i] = 1.0
140 |             else:
141 |                 precision[i] = 0.0
142 |                 recall[i] = 0.0
143 |                 f1[i] = 0.0
144 |         else:
145 |             precision[i] = results[i]['TP'] / (results[i]['TP'] + results[i]['FP'])
146 |             recall[i] = results[i]['TP'] / (results[i]['TP'] + results[i]['FN'])
147 |             f1[i] =  2 * precision[i] * recall[i] / (precision[i] + recall[i])
148 |     
149 |     # for i in range(label_num):
150 |     #     print(i,results[i], precision[i], recall[i], f1[i])
151 |     return sum(f1)/label_num, sum(precision)/label_num, sum(recall)/label_num
152 | 
153 | 
154 | def load_hparams(out_dir, overidded = None):
155 |     hparams_file = os.path.join(out_dir,"hparams")
156 |     print("loading hparams from %s" % hparams_file)
157 |     hparams_json = json.load(open(hparams_file))
158 |     hparams = tf.contrib.training.HParams()
159 |     for k,v in hparams_json.items():
160 |         hparams.add_hparam(k,v)
161 |     if overidded:
162 |         for k,v in overidded.items():
163 |             if k not in hparams_json:
164 |                 hparams.add_hparam(k,v)
165 |             else:
166 |                 hparams.set_hparam(k,v)
167 |     return hparams
168 | 
169 | def save_hparams(out_dir, hparams):
170 |     """Save hparams."""
171 |     if not os.path.isdir(out_dir):
172 |         os.mkdir(out_dir)
173 |     hparams_file = os.path.join(out_dir, "hparams")
174 |     print("  saving hparams to %s" % hparams_file)
175 |     with codecs.getwriter("utf-8")(tf.gfile.GFile(hparams_file, "wb")) as f:
176 |         f.write(hparams.to_json())
177 | 
178 | def get_config_proto(log_device_placement=True, allow_soft_placement=True,
179 |                      num_intra_threads=0, num_inter_threads=0, per_process_gpu_memory_fraction=0.95, allow_growth=True):
180 |     # GPU options:
181 |     # https://www.tensorflow.org/versions/r0.10/how_tos/using_gpu/index.html
182 |     config_proto = tf.ConfigProto(
183 |         log_device_placement=log_device_placement,
184 |         allow_soft_placement=allow_soft_placement)
185 |     config_proto.gpu_options.allow_growth = allow_growth
186 |     config_proto.gpu_options.per_process_gpu_memory_fraction = per_process_gpu_memory_fraction
187 |     # CPU threads options
188 |     if num_intra_threads:
189 |         config_proto.intra_op_parallelism_threads = num_intra_threads
190 |     if num_inter_threads:
191 |         config_proto.inter_op_parallelism_threads = num_inter_threads
192 | 
193 |     return config_proto
194 | 
195 | def early_stop(values, no_decrease=3):
196 |     if len(values) < 2:
197 |         return False
198 |     best_index = np.argmin(values)
199 |     if values[-1] > values[best_index] and (best_index + no_decrease) <= len(values):
200 |         return True
201 |     else:
202 |         return False
203 | 
204 | def gl_stop(values, alpha=5):
205 |     if len(values) < 2:
206 |         return False
207 |     best = -1 * min(values)
208 |     current = -1 * values[-1]
209 |     if 100 * ( 1 - (current / best) ) > alpha:
210 |         return True
211 |     else:
212 |         return False


--------------------------------------------------------------------------------