├── .idea
├── dictionaries
│ └── xuliang.xml
└── vcs.xml
├── BERT_BASE_DIR
└── readMe.md
├── PRE_TRAIN_DIR
└── readMe.md
├── README.md
├── README_bert_chinese_tutorial.md
├── TEXT_DIR
├── dev2.tsv
├── readMe.md
└── train2.tsv
├── __init__.py
├── ai_challenger_sentiment_analysis_testa_20180816
├── README.txt
└── protocol.txt
├── ai_challenger_sentiment_analysis_trainingset_20180816
├── README.txt
└── protocol.txt
├── ai_challenger_sentiment_analysis_validationset_20180816
├── README.txt
└── protocol.txt
├── ai_challenger_sentimetn_analysis_testb_20180816
├── README.txt
└── protocol.txt
├── bigru_char_checkpoint
└── README.txt
├── create_pretraining_data.py
├── data
└── img
│ ├── bert_sa.jpg
│ ├── bert_sentiment_analysis.jpg
│ └── fine_grain.jpg
├── data_util_hdf5.py
├── evaluation_matrix.py
├── model
├── __init__.py
├── base_model.py
├── bert_cnn_fine_grain_model.py
├── bert_cnn_model.py
├── bert_model.py
├── bert_modeling.py
├── config.py
├── config_transformer.py
├── encoder.py
├── layer_norm_residual_conn.py
├── multi_head_attention.py
├── optimization.py
├── poistion_wise_feed_forward.py
└── transfomer_model.py
├── old
├── JoinAttLayer.py
├── Preprocess_char_old.ipynb
├── classifier_bigru.py
├── classifier_capsule.py
├── classifier_rcnn.py
├── evaluate_char.py
├── model_bigru_char.py
├── model_capsule_char.py
├── model_rcnn_char.py
├── predict_bigru_char.py
├── predict_rcnn_char.py
├── rcnn_retrain.py
├── stopwords.txt
├── temp_covert.py
├── train_transform.py
├── validation_bigru_char.py
└── validation_rcnn_char.py
├── preprocess_char.ipynb
├── preprocess_char
└── README.txt
├── preprocess_word.ipynb
├── pretrain_task.py
├── run_classifier_multi_labels_bert.py
├── run_pretraining.py
├── tokenization.py
├── tokenizer_char.pickle
├── train_bert_fine_tuning.py
├── train_cnn_fine_grain.py
├── train_cnn_lm.py
└── word2vec
└── README.txt
/.idea/dictionaries/xuliang.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/BERT_BASE_DIR/readMe.md:
--------------------------------------------------------------------------------
1 | you need to download pre-trained model from google, and put it into a folder(e.g.BERT_BASE_DIR)
--------------------------------------------------------------------------------
/PRE_TRAIN_DIR/readMe.md:
--------------------------------------------------------------------------------
1 | # you can put pre-train file in a dir. e.g. "PRE_TRAIN_DIR".
2 | # Input file format:
3 | # (1) One sentence per line. These should ideally be actual sentences, not
4 | # entire paragraphs or arbitrary spans of text. (Because we use the
5 | # sentence boundaries for the "next sentence prediction" task).
6 | # (2) Blank lines between documents. Document boundaries are needed so
7 | # that the "next sentence prediction" task doesn't span between documents.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | With this repository, you will able to train Multi-label Classification with BERT,
4 |
5 | Deploy BERT for online prediction.
6 |
7 | You can also find the a short tutorial of how to use bert with chinese: BERT short chinese tutorial
8 |
9 | You can find Introduction to fine grain sentiment from AI Challenger
10 |
11 | ## Basic Ideas
12 |
13 | Add something here.
14 |
15 |
16 | ## Experiment on New Models
17 |
18 |
19 |
20 | for more, check model/bert_cnn_fine_grain_model.py
21 |
22 | ## Performance
23 |
24 | Model | TextCNN(No-pretrain)| TextCNN(Pretrain-Finetuning)| Bert(base_model_zh) | Bert(base_model_zh,pre-train on corpus)
25 | --- | --- | --- | ----------- | -----------
26 | F1 Score | 0.678 | 0.685 | ADD A NUMBER HERE | ADD A NUMBER HERE
27 | ----------------------------------------------------------------------------------------------
28 |
29 | Notice: F1 Score is reported on validation set
30 |
31 |
32 |
33 | ## Usage
34 |
35 | ### Bert for Multi-label Classificaiton [data for fine-tuning and pre-train]
36 |
37 | export BERT_BASE_DIR=BERT_BASE_DIR/chinese_L-12_H-768_A-12
38 | export TEXT_DIR=TEXT_DIR
39 | nohup python run_classifier_multi_labels_bert.py
40 | --task_name=sentiment_analysis
41 | --do_train=true
42 | --do_eval=true
43 | --data_dir=$TEXT_DIR
44 | --vocab_file=$BERT_BASE_DIR/vocab.txt
45 | --bert_config_file=$BERT_BASE_DIR/bert_config.json
46 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
47 | --max_seq_length=512
48 | --train_batch_size=4
49 | --learning_rate=2e-5
50 | --num_train_epochs=3
51 | --output_dir=./checkpoint_bert &
52 |
53 | 1.firstly, you need to download pre-trained model from google, and put to a folder(e.g.BERT_BASE_DIR)
54 |
55 | chinese_L-12_H-768_A-12 from bert
56 |
57 | 2.secondly, you need to have training data(e.g. train.tsv) and validation data(e.g. dev.tsv), and put it under a
58 |
59 | folder(e.g.TEXT_DIR ). you can also download data from here data to train bert for AI challenger-Sentiment Analysis.
60 |
61 | it contains processed data you can run for both fine-tuning on sentiment analysis and pre-train with Bert.
62 |
63 | it is generated by following this notebook step by step:
64 |
65 | preprocess_char.ipynb
66 |
67 | you can generate data by yourself as long as data format is compatible with
68 |
69 | processor SentimentAnalysisFineGrainProcessor(alias as sentiment_analysis);
70 |
71 |
72 | data format: label1,label2,label3\t here is sentence or sentences\t
73 |
74 | it only contains two columns, the first one is target(one or multi-labels), the second one is input strings.
75 |
76 | no need to tokenized.
77 |
78 | sample:"0_1,1_-2,2_-2,3_-2,4_1,5_-2,6_-2,7_-2,8_1,9_1,10_-2,11_-2,12_-2,13_-2,14_-2,15_1,16_-2,17_-2,18_0,19_-2 浦东五莲路站,老饭店福瑞轩属于上海的本帮菜,交通方便,最近又重新装修,来拨草了,饭店活动满188元送50元钱,环境干净,简单。朋友提前一天来预订包房也没有订到,只有大堂,五点半到店基本上每个台子都客满了,都是附近居民,每道冷菜量都比以前小,味道还可以,热菜烤茄子,炒河虾仁,脆皮鸭,照牌鸡,小牛排,手撕腊味花菜等每道菜都很入味好吃,会员价划算,服务员人手太少,服务态度好,要能团购更好。可以用支付宝方便"
79 |
80 | check sample data in ./BERT_BASE_DIR folder
81 |
82 | for more detail, check create_model and SentimentAnalysisFineGrainProcessor from run_classifier.py
83 |
84 | ### Pre-train Bert model based on open-souced model, then do classification task
85 |
86 | 1. generate raw data: [ADD SOMETHING HERE]
87 |
88 | take sure each line is a sentence. between each document there is a blank line.
89 |
90 | you can find generated data from zip file.
91 |
92 | use write_pre_train_doc() from preprocess_char.ipynb
93 |
94 | 1. generate data for pre-train stage using:
95 |
96 | export BERT_BASE_DIR=./BERT_BASE_DIR/chinese_L-12_H-768_A-12
97 | nohup python create_pretraining_data.py \
98 | --input_file=./PRE_TRAIN_DIR/bert_*_pretrain.txt \
99 | --output_file=./PRE_TRAIN_DIR/tf_examples.tfrecord \
100 | --vocab_file=$BERT_BASE_DIR/vocab.txt \
101 | --do_lower_case=True \
102 | --max_seq_length=512 \
103 | --max_predictions_per_seq=60 \
104 | --masked_lm_prob=0.15 \
105 | --random_seed=12345 \
106 | --dupe_factor=5 nohup_pre.out &
107 |
108 | 2. pre-train model with generated data:
109 |
110 | python run_pretraining.py
111 |
112 | 3. fine-tuning
113 |
114 | python run_classifier.py
115 |
116 | ### TextCNN
117 |
118 | 1. download cache file of sentiment analysis(tokens are in word level)
119 |
120 | 2. train the model:
121 |
122 | python train_cnn_fine_grain.py
123 |
124 |
125 | cache file of TextCNN model was generate by following steps from preprocess_word.ipynb.
126 |
127 | it contains everything you need to run TextCNN.
128 |
129 | it include: processed train/validation/test set; vocabulary of word; a dict map label to index.
130 |
131 | take train_valid_test_vocab_cache.pik and put it under folder of preprocess_word/
132 |
133 | raw data are also included in this zip file.
134 |
135 |
136 | ### Pre-train TextCNN
137 |
138 | 1. pre-train TextCNN with masked language model
139 |
140 | python train_cnn_lm.py
141 |
142 | 2. fine-tuning for TextCNN
143 |
144 | python train_cnn_fine_grain.py
145 |
146 | ### Deploy BERT for online prediction
147 |
148 | with session and feed style you can easily deploy BERT.
149 |
150 | online prediction with BERT, check more from here
151 |
152 |
153 | ## Reference
154 |
155 | 1. Bidirectional Encoder Representations from Transformers for Language Understanding
156 |
157 | 2. google-research/bert
158 |
159 | 3. pengshuang/AI-Comp
160 |
161 | 4. AI Challenger 2018
162 |
163 | 5. Convolutional Neural Networks for Sentence Classification
--------------------------------------------------------------------------------
/README_bert_chinese_tutorial.md:
--------------------------------------------------------------------------------
1 | 【使用bert预训练过的中文模型:最短教程】
2 |
3 | 【使用自带数据集】
4 |
5 | 1.下载模型和数据集合:https://github.com/google-research/bert
6 |
7 | 2. 使用命令run_classifier.py,带上参数
8 |
9 | 【如何使用自定义数据集?】
10 |
11 | 1.在run_classifier.py中添加一个Processor,告诉processor怎么取输入和标签;并加该processor到main中的processors
12 |
13 | 2.将自己的数据集放入到特定目录。每行是一个数据,包括输入和标签,中间用"\t"隔开。
14 |
15 | 3.运行命令run_classifier.py,带上参数
16 |
17 | 【session-feed方式使用bert模型;使用bert做在线预测】
18 |
19 | 使用bert做在线预测-简明例子
20 |
21 | 【目前支持的任务类型】
22 |
23 | 1.文本分类(二分类或多分类);
24 |
25 | 2.句子对分类Sentence Pair Classificaiton(输入两个句子,输出一个标签)
26 |
27 | 3.文本分类(多类别,multi-label classification)
28 |
29 | 使用bert做多类别任务(e.g.AI challenger情感分析任务),详见run_classifier_multi_labels_bert.py
30 |
31 | 【在bert中文模型基础上,做预训练,再调优fine-tuning】
32 |
33 | 1. 生成预训练需要的文件: 每行为一个句子;每个文档中间用空行隔开
34 | 2. 生成tf.record格式的预训练语料:
35 | create_pretraining_data.py
36 | 3. 使用已经生成的数据做预训练,可以指定初始的checkpoint:
37 | run_pretraining.py
38 | 4. 调优fine-tuning
39 | run_classifier.py
40 |
--------------------------------------------------------------------------------
/TEXT_DIR/readMe.md:
--------------------------------------------------------------------------------
1 | you need have training data(e.g. train.tsv) and validation data(e.g. dev.tsv), and put it under a folder(e.g.TEXT_DIR )
2 |
3 | you can generate data by yourself as long as data format is compatible with processor
4 |
5 | SentimentAnalysisFineGrainProcessor(alias as sentiment_analysis);
6 |
7 | you can also download data from here AI challenger-sentiment analysis, which is generated by following
8 |
9 | step by step: preprocess_char.ipynb
10 |
11 |
12 | data format: label1,label2,label3\t here is sentence or sentences\t
13 |
14 | it only contains two columns, the first one is target(one or multi-labels), the second one is input strings.
15 |
16 | no need to tokenized.
17 |
18 | sample:"0_1,1_-2,2_-2,3_-2,4_1,5_-2,6_-2,7_-2,8_1,9_1,10_-2,11_-2,12_-2,13_-2,14_-2,15_1,16_-2,17_-2,18_0,19_-2 浦东五莲路站,老饭店福瑞轩属于上海的本帮菜,交通方便,最近又重新装修,来拨草了,饭店活动满188元送50元钱,环境干净,简单。朋友提前一天来预订包房也没有订到,只有大堂,五点半到店基本上每个台子都客满了,都是附近居民,每道冷菜量都比以前小,味道还可以,热菜烤茄子,炒河虾仁,脆皮鸭,照牌鸡,小牛排,手撕腊味花菜等每道菜都很入味好吃,会员价划算,服务员人手太少,服务态度好,要能团购更好。可以用支付宝方便"
19 |
20 | check sample data in ./BERT_BASE_DIR folder
21 |
--------------------------------------------------------------------------------
/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/__init__.py
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_testa_20180816/README.txt:
--------------------------------------------------------------------------------
1 | sentiment_analysis_testa.csv 为测试集A数据文件,共15000条评论数据
2 | protocol.txt 为数据集下载协议
3 |
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_testa_20180816/protocol.txt:
--------------------------------------------------------------------------------
1 | 数据集下载协议
2 |
3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款:
4 |
5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途;
6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的;
7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任;
8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任;
9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束;
10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。
11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集;
12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。
13 |
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_trainingset_20180816/README.txt:
--------------------------------------------------------------------------------
1 | sentiment_analysis_trainingset.csv 为训练集数据文件,共105000条评论数据
2 | sentiment_analysis_trainingset_annotations.docx 为数据标注说明文件
3 | protocol.txt 为数据集下载协议
4 |
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_trainingset_20180816/protocol.txt:
--------------------------------------------------------------------------------
1 | 数据集下载协议
2 |
3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款:
4 |
5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途;
6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的;
7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任;
8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任;
9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束;
10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。
11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集;
12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。
13 |
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_validationset_20180816/README.txt:
--------------------------------------------------------------------------------
1 | sentiment_analysis_validationset.csv 为验证集数据文件,共15000条评论数据
2 | sentiment_analysis_validationset_annotations.docx 为数据标注说明文件
3 | protocol.txt 为数据集下载协议
4 |
--------------------------------------------------------------------------------
/ai_challenger_sentiment_analysis_validationset_20180816/protocol.txt:
--------------------------------------------------------------------------------
1 | 数据集下载协议
2 |
3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款:
4 |
5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途;
6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的;
7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任;
8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任;
9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束;
10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。
11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集;
12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。
13 |
--------------------------------------------------------------------------------
/ai_challenger_sentimetn_analysis_testb_20180816/README.txt:
--------------------------------------------------------------------------------
1 | sentiment_analysis_testb.csv 为测试集B数据文件,共200000条评论数据
2 | protocol.txt 为数据集下载协议
3 |
--------------------------------------------------------------------------------
/ai_challenger_sentimetn_analysis_testb_20180816/protocol.txt:
--------------------------------------------------------------------------------
1 | 数据集下载协议
2 |
3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款:
4 |
5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途;
6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的;
7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任;
8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任;
9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束;
10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。
11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集;
12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。
13 |
--------------------------------------------------------------------------------
/bigru_char_checkpoint/README.txt:
--------------------------------------------------------------------------------
1 | checkpoint of bigru_char will be saved here.
--------------------------------------------------------------------------------
/data/img/bert_sa.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/bert_sa.jpg
--------------------------------------------------------------------------------
/data/img/bert_sentiment_analysis.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/bert_sentiment_analysis.jpg
--------------------------------------------------------------------------------
/data/img/fine_grain.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/fine_grain.jpg
--------------------------------------------------------------------------------
/evaluation_matrix.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import numpy as np
3 | import random
4 | import codecs
5 | """
6 | compute single evaulation matrix for task1,task2 and task3:
7 | compute f1 score(micro,macro) for accusation & relevant article, and score for pentaly
8 | """
9 |
10 | small_value=0.00001
11 | random_number=1000
12 | def compute_confuse_matrix_batch(y_targetlabel_list,y_logits_array,label_dict,name='default'):
13 | """
14 | compute confuse matrix for a batch
15 | :param y_targetlabel_list: a list; each element is a mulit-hot,e.g. [1,0,0,1,...]
16 | :param y_logits_array: a 2-d array. [batch_size,num_class]
17 | :param label_dict:{label:(TP, FP, FN)}
18 | :param name: a string for debug purpose
19 | :return:label_dict:{label:(TP, FP, FN)}
20 | """
21 | for i,y_targetlabel_list_single in enumerate(y_targetlabel_list):
22 | label_dict=compute_confuse_matrix(y_targetlabel_list_single,y_logits_array[i],label_dict,name=name)
23 | return label_dict
24 |
25 | def compute_confuse_matrix(y_targetlabel_list_single,y_logit_array_single,label_dict,name='default'):
26 | """
27 | compute true postive(TP), false postive(FP), false negative(FN) given target lable and predict label
28 | :param y_targetlabel_list: a list. length is batch_size(e.g.1). each element is a multi-hot,like '[0,0,1,0,1,...]'
29 | :param y_logit_array: an numpy array. shape is:[batch_size,num_classes]
30 | :param label_dict {label:(TP,FP,FN)}
31 | :return: macro_f1(a scalar),micro_f1(a scalar)
32 | """
33 | #1.get target label and predict label
34 | y_target_labels=get_target_label_short(y_targetlabel_list_single) #e.g. y_targetlabel_list[0]=[2,12,88]
35 | #y_logit=y_logit_array_single #y_logit_array[0] #[202]
36 | y_predict_labels=[i for i in range(len(y_logit_array_single)) if y_logit_array_single[i]>=0.50] #TODO 0.5PW e.g.[2,12,13,10]
37 | if len(y_predict_labels) < 1: y_predict_labels = [np.argmax(y_logit_array_single)]
38 |
39 | #if len(y_predict_labels)<1: y_predict_labels=[np.argmax(y_logit_array_single)] #TODO ADD 2018.05.29
40 | if random.choice([x for x in range(random_number)]) ==1:print(name+".y_target_labels:",y_target_labels,";y_predict_labels:",y_predict_labels) #debug purpose
41 |
42 | #2.count number of TP,FP,FN for each class
43 | y_labels_unique=[]
44 | y_labels_unique.extend(y_target_labels)
45 | y_labels_unique.extend(y_predict_labels)
46 | y_labels_unique=list(set(y_labels_unique))
47 | for i,label in enumerate(y_labels_unique): #e.g. label=2
48 | TP, FP, FN = label_dict[label]
49 | if label in y_predict_labels and label in y_target_labels:#predict=1,truth=1 (TP)
50 | TP=TP+1
51 | elif label in y_predict_labels and label not in y_target_labels:#predict=1,truth=0(FP)
52 | FP=FP+1
53 | elif label not in y_predict_labels and label in y_target_labels:#predict=0,truth=1(FN)
54 | FN=FN+1
55 | label_dict[label] = (TP, FP, FN)
56 | return label_dict
57 |
58 |
59 | def compute_penalty_score_batch(target_deaths, predict_deaths,target_lifeimprisons, predict_lifeimprisons,target_imprsions, predict_imprisons):
60 | """
61 | compute penalty score(task 3) for a batch.
62 | :param target_deaths: a list. each element is a mulit-hot list
63 | :param predict_deaths: a 2-d array. [batch_size,num_class]
64 | :param target_lifeimprisons: a list. each element is a mulit-hot list
65 | :param predict_lifeimprisons: a 2-d array. [batch_size,num_class]
66 | :param target_imprsions: a list. each element is a mulit-hot list
67 | :param predict_imprisons: a 2-d array. [batch_size,num_class]
68 | :return: score_batch: a scalar, average score for that batch
69 | """
70 | length=len(target_deaths)
71 | score_total=0.0
72 | for i in range(length):
73 | score=compute_penalty_score(target_deaths[i], predict_deaths[i], target_lifeimprisons[i],predict_lifeimprisons[i],target_imprsions[i], predict_imprisons[i])
74 | score_total=score_total+score
75 | score_batch=score_total/float(length)
76 | return score_batch
77 |
78 | def compute_penalty_score(target_death, predict_death,target_lifeimprison, predict_lifeimprison,target_imprsion, predict_imprison):
79 | """
80 | compute penalty score(task 3) for a single data
81 | :param target_death: a mulit-hot list. e.g. [1,0,0,1,...]
82 | :param predict_death: [num_class]
83 | :param target_lifeimprison: a mulit-hot list. e.g. [1,0,0,1,...]
84 | :param predict_lifeimprison: [num_class]
85 | :param target_imprsion: a mulit-hot list. e.g. [1,0,0,1,...]
86 | :param predict_imprison:[num_class]
87 | :return: score: a scalar,score for this data
88 | """
89 | score_death=compute_death_lifeimprisonment_score(target_death, predict_death)
90 | score_lifeimprisonment=compute_death_lifeimprisonment_score(target_lifeimprison, predict_lifeimprison)
91 | score_imprisonment=compute_imprisonment_score(target_imprsion, predict_imprison)
92 | score=((score_death+score_lifeimprisonment+score_imprisonment)/3.0)*(100.0)
93 | return score
94 |
95 | def compute_death_lifeimprisonment_score(target,predict):
96 | """
97 | compute score for death or life imprisonment
98 | :param target: a list
99 | :param predict: an array
100 | :return: score: a scalar
101 | """
102 |
103 | score=0.0
104 | target=np.argmax(target)
105 | predict=np.argmax(predict)
106 | if random.choice([x for x in range(random_number)]) == 1:print("death_lifeimprisonment_score.target:", target, ";predict:", predict)
107 | if target==predict:
108 | score=1.0
109 | if random.choice([x for x in range(random_number)]) == 1:print("death_lifeimprisonment_score:",score)
110 | return score
111 |
112 | def compute_imprisonment_score(target_value,predict_value):
113 | """
114 | compute imprisonment score
115 | :param target_value: a scalar
116 | :param predict_value:a scalar
117 | :return: score: a scalar
118 | """
119 | if random.choice([x for x in range(random_number)]) ==1:print("x.imprisonment_score.target_value:",target_value,";predict_value:",predict_value)
120 | score=0.0
121 | v=np.abs(np.log(predict_value+1.0)-np.log(target_value+1.0))
122 | if v<=0.2:
123 | score=1.0
124 | elif v<=0.4:
125 | score=0.8
126 | elif v<=0.6:
127 | score=0.6
128 | elif v<=0.8:
129 | score=0.4
130 | elif v<=1.0:
131 | score=0.2
132 | else:
133 | score=0.0
134 | if random.choice([x for x in range(random_number)]) ==1:print("imprisonment_score:",score)
135 | return score
136 |
137 | def compute_micro_macro(label_dict):
138 | """
139 | compute f1 of micro and macro
140 | :param label_dict:
141 | :return: f1_micro,f1_macro: scalar, scalar
142 | """
143 | f1_micro = compute_f1_micro_use_TFFPFN(label_dict)
144 | f1_macro= compute_f1_macro_use_TFFPFN(label_dict)
145 | return f1_micro,f1_macro
146 |
147 | def compute_f1_micro_use_TFFPFN(label_dict):
148 | """
149 | compute f1_micro
150 | :param label_dict: {label:(TP,FP,FN)}
151 | :return: f1_micro: a scalar
152 | """
153 | TF_micro_accusation, FP_micro_accusation, FN_micro_accusation =compute_TF_FP_FN_micro(label_dict)
154 | f1_micro_accusation = compute_f1(TF_micro_accusation, FP_micro_accusation, FN_micro_accusation,'micro')
155 | return f1_micro_accusation
156 |
157 | def compute_f1_macro_use_TFFPFN(label_dict):
158 | """
159 | compute f1_macro
160 | :param label_dict: {label:(TP,FP,FN)}
161 | :return: f1_macro
162 | """
163 | f1_dict= {}
164 | num_classes=len(label_dict)
165 | for label, tuplee in label_dict.items():
166 | TP,FP,FN=tuplee
167 | f1_score_onelabel=compute_f1(TP,FP,FN,'macro')
168 | f1_dict[label]=f1_score_onelabel
169 | f1_score_sum=0.0
170 | for label,f1_score in f1_dict.items():
171 | f1_score_sum=f1_score_sum+f1_score
172 | f1_score=f1_score_sum/float(num_classes)
173 | return f1_score
174 |
175 | #[this function is for debug purpose only]
176 | def compute_f1_score_write_for_debug(label_dict,label2index):
177 | """
178 | compute f1 score. basicly you can also use other function to get result
179 | :param label_dict: {label:(TP,FP,FN)}
180 | :return: a dict. key is label name, value is f1 score.
181 | """
182 | f1score_dict={}
183 | # 1. compute f1 score for each accusation.
184 | for label, tuplee in label_dict.items():
185 | TP, FP, FN = tuplee
186 | f1_score_single = compute_f1(TP, FP, FN, 'normal_f1_score')
187 | accusation_index2label = {kv[1]: kv[0] for kv in label2index.items()}
188 | label_name=accusation_index2label[label]
189 | f1score_dict[label_name]=f1_score_single
190 |
191 | # 2. each to file system for debug purpose.
192 | f1score_file='debug_accuracy.txt'
193 | write_object = codecs.open(f1score_file, mode='a', encoding='utf-8')
194 | write_object.write("\n\n")
195 |
196 | #tuple_list = sorted(f1score_dict.items(), lambda x, y: cmp(x[1], y[1]), reverse=False)
197 | tuple_list = sorted(f1score_dict.items(), key=lambda x: x[1], reverse=False)
198 |
199 | for tuplee in tuple_list:
200 | label_name,f1_score=tuplee
201 | write_object.write(label_name+":"+str(f1_score)+"\n")
202 | write_object.close()
203 | return f1score_dict
204 |
205 | def compute_f1(TP,FP,FN,compute_type):
206 | """
207 | compute f1
208 | :param TP_micro: number.e.g. 200
209 | :param FP_micro: number.e.g. 200
210 | :param FN_micro: number.e.g. 200
211 | :return: f1_score: a scalar
212 | """
213 | precison=TP/(TP+FP+small_value)
214 | recall=TP/(TP+FN+small_value)
215 | f1_score=(2*precison*recall)/(precison+recall+small_value)
216 |
217 | if random.choice([x for x in range(500)]) == 1:print(compute_type,"precison:",str(precison),";recall:",str(recall),";f1_score:",f1_score)
218 |
219 | return f1_score
220 |
221 | def compute_TF_FP_FN_micro(label_dict):
222 | """
223 | compute micro FP,FP,FN
224 | :param label_dict_accusation: a dict. {label:(TP, FP, FN)}
225 | :return:TP_micro,FP_micro,FN_micro
226 | """
227 | TP_micro,FP_micro,FN_micro=0.0,0.0,0.0
228 | for label,tuplee in label_dict.items():
229 | TP,FP,FN=tuplee
230 | TP_micro=TP_micro+TP
231 | FP_micro=FP_micro+FP
232 | FN_micro=FN_micro+FN
233 | return TP_micro,FP_micro,FN_micro
234 |
235 | def init_label_dict(num_classes):
236 | """
237 | init label dict. this dict will be used to save TP,FP,FN
238 | :param num_classes:
239 | :return: label_dict: a dict. {label_index:(0,0,0)}
240 | """
241 | label_dict={}
242 | for i in range(num_classes):
243 | label_dict[i]=(0,0,0)
244 | return label_dict
245 |
246 | def get_target_label_short(y_mulitihot):
247 | """
248 | get target label.
249 | :param y_mulitihot: [0,0,1,0,1,0,...]
250 | :return: taget_list.e.g. [3,5,100]
251 | """
252 | taget_list = [];
253 | for i, element in enumerate(y_mulitihot):
254 | if element == 1:
255 | taget_list.append(i)
256 | return taget_list
--------------------------------------------------------------------------------
/model/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/model/__init__.py
--------------------------------------------------------------------------------
/model/base_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import tensorflow as tf
3 | from model.multi_head_attention import MultiHeadAttention
4 | from model.poistion_wise_feed_forward import PositionWiseFeedFoward
5 | from model.layer_norm_residual_conn import LayerNormResidualConnection
6 | class BaseClass(object):
7 | """
8 | base class has some common fields and functions.
9 | """
10 | def __init__(self,d_model,d_k,d_v,sequence_length,h,batch_size,num_layer=6,decoder_sent_length=None):
11 | """
12 | :param d_model:
13 | :param d_k:
14 | :param d_v:
15 | :param sequence_length:
16 | :param h:
17 | :param batch_size:
18 | :param embedded_words: shape:[batch_size,sequence_length,embed_size]
19 | """
20 | self.d_model=d_model
21 | self.d_k=d_k
22 | self.d_v=d_v
23 | self.sequence_length=sequence_length
24 | self.h=h
25 | self.num_layer=num_layer
26 | self.batch_size=batch_size
27 | self.decoder_sent_length=decoder_sent_length
28 |
29 | def sub_layer_postion_wise_feed_forward(self, x, layer_index) :# COMMON FUNCTION
30 | """
31 | position-wise feed forward. you can implement it as feed forward network, or two layers of CNN.
32 | :param x: shape should be:[batch_size,sequence_length,d_model]
33 | :param layer_index: index of layer number
34 | :return: [batch_size,sequence_length,d_model]
35 | """
36 | # use variable scope here with input of layer index, to make sure each layer has different parameters.
37 | with tf.variable_scope("sub_layer_postion_wise_feed_forward" + str(layer_index)):
38 | postion_wise_feed_forward = PositionWiseFeedFoward(x, layer_index,d_model=self.d_model,d_ff=self.d_model*4)
39 | postion_wise_feed_forward_output = postion_wise_feed_forward.position_wise_feed_forward_fn()
40 | return postion_wise_feed_forward_output
41 |
42 | def sub_layer_multi_head_attention(self ,layer_index ,Q ,K_s,V_s,mask=None,is_training=None,dropout_keep_prob=0.9) :# COMMON FUNCTION
43 | """
44 | multi head attention as sub layer
45 | :param layer_index: index of layer number
46 | :param Q: shape should be: [batch_size,sequence_length,embed_size]
47 | :param k_s: shape should be: [batch_size,sequence_length,embed_size]
48 | :param mask: when use mask,illegal connection will be mask as huge big negative value.so it's possiblitity will become zero.
49 | :return: output of multi head attention.shape:[batch_size,sequence_length,d_model]
50 | """
51 | #print("sub_layer_multi_head_attention.",";layer_index:",layer_index)
52 | with tf.variable_scope("base_mode_sub_layer_multi_head_attention_" +str(layer_index)):
53 | #2. call function of multi head attention to get result
54 | multi_head_attention_class = MultiHeadAttention(Q, K_s, V_s, self.d_model, self.d_k, self.d_v, self.sequence_length,self.h,
55 | is_training=is_training,mask=mask,dropout_rate=(1.0-dropout_keep_prob))
56 | sub_layer_multi_head_attention_output = multi_head_attention_class.multi_head_attention_fn() # [batch_size*sequence_length,d_model]
57 | return sub_layer_multi_head_attention_output # [batch_size,sequence_length,d_model]
58 |
59 | def sub_layer_layer_norm_residual_connection(self,layer_input ,layer_output,layer_index,dropout_keep_prob=0.9,use_residual_conn=True,sub_layer_name='layer1'): # COMMON FUNCTION
60 | """
61 | layer norm & residual connection
62 | :param input: [batch_size,equence_length,d_model]
63 | :param output:[batch_size,sequence_length,d_model]
64 | :return:
65 | """
66 | #print("sub_layer_layer_norm_residual_connection.layer_input:",layer_input,";layer_output:",layer_output,";dropout_keep_prob:",dropout_keep_prob)
67 | #assert layer_input.get_shape().as_list()==layer_output.get_shape().as_list()
68 | #layer_output_new= layer_input+ layer_output
69 | variable_scope="sub_layer_layer_norm_residual_connection_" +str(layer_index)+'_'+sub_layer_name
70 | #print("######sub_layer_layer_norm_residual_connection.variable_scope:",variable_scope)
71 | with tf.variable_scope(variable_scope):
72 | layer_norm_residual_conn=LayerNormResidualConnection(layer_input,layer_output,layer_index,residual_dropout=(1-dropout_keep_prob),use_residual_conn=use_residual_conn)
73 | output = layer_norm_residual_conn.layer_norm_residual_connection()
74 | return output # [batch_size,sequence_length,d_model]
--------------------------------------------------------------------------------
/model/config.py:
--------------------------------------------------------------------------------
1 |
2 | class Config:
3 | def __init__(self):
4 | self.learning_rate=0.0003
5 | self.num_classes = 2
6 | self.batch_size = 64
7 | self.sequence_length = 100
8 | self.vocab_size = 50000
9 |
10 | self.d_model =512
11 | self.num_layer=6
12 | self.h=8
13 | self.d_k=64
14 | self.d_v=64
15 |
16 | self.clip_gradients = 5.0
17 | self.decay_steps = 1000
18 | self.decay_rate = 0.9
19 | self.dropout_keep_prob = 0.9
20 | self.ckpt_dir = 'checkpoint/dummy_test/'
21 | self.is_training=True
22 | self.is_pretrain=True
23 | self.num_classes_lm=self.vocab_size
24 |
--------------------------------------------------------------------------------
/model/config_transformer.py:
--------------------------------------------------------------------------------
1 | class Config:
2 | def __init__(self):
3 | self.learning_rate = 0.0003
4 | self.num_classes = 2
5 | self.batch_size = 64
6 | self.sequence_length = 100
7 | self.vocab_size = 50000
8 |
9 | self.d_model = 512
10 | self.num_layer = 6
11 | self.h = 8
12 | self.d_k = 64
13 | self.d_v = 64
14 |
15 | self.clip_gradients = 5.0
16 | self.decay_steps = 1000
17 | self.decay_rate = 0.9
18 | self.dropout_keep_prob = 0.9
19 | self.ckpt_dir = 'checkpoint/dummy_test/'
20 | self.is_training = True
21 |
--------------------------------------------------------------------------------
/model/encoder.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | encoder for the transformer:
4 | 6 layers.each layers has two sub-layers.
5 | the first is multi-head self-attention mechanism;
6 | the second is position-wise fully connected feed-forward network.
7 | for each sublayer. use LayerNorm(x+Sublayer(x)). all dimension=512.
8 | """
9 | import tensorflow as tf
10 | from model.base_model import BaseClass
11 | import time
12 | class Encoder(BaseClass):
13 | def __init__(self,d_model,d_k,d_v,sequence_length,h,batch_size,num_layer,Q,K_s,mask=None,dropout_keep_prob=0.9,use_residual_conn=True):
14 | """
15 | :param d_model:
16 | :param d_k:
17 | :param d_v:
18 | :param sequence_length:
19 | :param h:
20 | :param batch_size:
21 | :param embedded_words: shape:[batch_size*sequence_length,embed_size]
22 | """
23 | super(Encoder, self).__init__(d_model,d_k,d_v,sequence_length,h,batch_size,num_layer=num_layer)
24 | self.Q=Q
25 | self.K_s=K_s
26 | self.mask=mask
27 | self.initializer = tf.random_normal_initializer(stddev=0.1)
28 | self.dropout_keep_prob=dropout_keep_prob
29 | self.use_residual_conn=use_residual_conn
30 |
31 | def encoder_fn(self):
32 | """
33 | use transformer encoder to encode the input, output a sequence. input: [batch_size,sequence_length,d_embedding]
34 | :return: output:[batch_size*sequence_length,d_model]
35 | """
36 | start = time.time()
37 | #print("encoder_fn.started.")
38 | x=self.Q
39 | for layer_index in range(self.num_layer):
40 | x=self.encoder_single_layer(x,x,x,layer_index) # Q,K_s,V_s
41 | #print("encoder_fn.",layer_index,".x:",x)
42 | end = time.time()
43 | #print("encoder_fn.ended.x:",x)
44 | #print("time spent:",(end-start))
45 | return x
46 |
47 | def encoder_single_layer(self,Q,K_s,V_s,layer_index):
48 | """
49 | singel layer for encoder.each layers has two sub-layers:
50 | the first is multi-head self-attention mechanism; the second is position-wise fully connected feed-forward network.
51 | for each sublayer. use LayerNorm(x+Sublayer(x)). input and output of last dimension: d_model
52 | :param Q: shape should be: [batch_size,sequence_length,d_model]
53 | :param K_s: shape should be: [batch_size,sequence_length,d_model]
54 | :return:output: shape should be: [batch_size,sequence_length,d_model]
55 | """
56 | #1.1 the first is multi-head self-attention mechanism
57 | multi_head_attention_output=self.sub_layer_multi_head_attention(layer_index,Q,K_s,V_s,mask=self.mask,dropout_keep_prob=self.dropout_keep_prob) #[batch_size,sequence_length,d_model]
58 | #1.2 use LayerNorm(x+Sublayer(x)). all dimension=512.
59 | multi_head_attention_output=self.sub_layer_layer_norm_residual_connection(K_s,multi_head_attention_output,layer_index,
60 | dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn,sub_layer_name='layer1')
61 | #2.1 the second is position-wise fully connected feed-forward network.
62 | postion_wise_feed_forward_output=self.sub_layer_postion_wise_feed_forward(multi_head_attention_output,layer_index)
63 | #2.2 use LayerNorm(x+Sublayer(x)). all dimension=512.
64 | postion_wise_feed_forward_output= self.sub_layer_layer_norm_residual_connection(multi_head_attention_output,postion_wise_feed_forward_output,layer_index,
65 | dropout_keep_prob=self.dropout_keep_prob,sub_layer_name='layer2')
66 | return postion_wise_feed_forward_output #,postion_wise_feed_forward_output
67 |
68 |
69 | def init():
70 | #1. assign value to fields
71 | vocab_size=1000
72 | d_model = 512
73 | d_k = 64
74 | d_v = 64
75 | sequence_length = 5*10
76 | h = 8
77 | batch_size=4*32
78 | initializer = tf.random_normal_initializer(stddev=0.1)
79 | # 2.set values for Q,K,V
80 | vocab_size=1000
81 | embed_size=d_model
82 | Embedding = tf.get_variable("Embedding_E", shape=[vocab_size, embed_size],initializer=initializer)
83 | input_x = tf.placeholder(tf.int32, [batch_size,sequence_length], name="input_x") #[4,10]
84 | print("input_x:",input_x)
85 | embedded_words = tf.nn.embedding_lookup(Embedding, input_x) #[batch_size*sequence_length,embed_size]
86 | Q = embedded_words # [batch_size*sequence_length,embed_size]
87 | K_s = embedded_words # [batch_size*sequence_length,embed_size]
88 | V_s = embedded_words # [batch_size*sequence_length,embed_size]
89 | num_layer=6
90 | mask = get_mask(batch_size, sequence_length)
91 | #3. get class object
92 | encoder_class=Encoder(d_model,d_k,d_v,sequence_length,h,batch_size,num_layer,Q,K_s,mask=mask) #Q,K_s,embedded_words
93 | return encoder_class,Q,K_s,V_s
94 |
95 | def get_mask(batch_size,sequence_length):
96 | lower_triangle=tf.matrix_band_part(tf.ones([sequence_length,sequence_length]),-1,0)
97 | result=-1e9*(1.0-lower_triangle)
98 | print("get_mask==>result:",result)
99 | return result
100 |
101 | def test_postion_wise_feed_forward(encoder_class,x,layer_index):
102 | sub_layer_postion_wise_feed_forward_output=encoder_class.sub_layer_postion_wise_feed_forward(x, layer_index)
103 | return sub_layer_postion_wise_feed_forward_output
104 |
105 | def test_sub_layer_multi_head_attention(encoder_class,index_layer,Q,K_s,V_s):
106 | sub_layer_multi_head_attention_output=encoder_class.sub_layer_multi_head_attention(index_layer,Q,K_s,V_s)
107 | return sub_layer_multi_head_attention_output
108 |
109 |
110 | encoder_class,Q,K_s,V_s=init()
111 |
112 |
113 | #below is 4 callable codes for testing functions: from sub(small) function to whole function of encoder.
114 |
115 | def test():
116 | #1.test 1: for sub layer of multi head attention
117 | index_layer=0
118 | #sub_layer_multi_head_attention_output=test_sub_layer_multi_head_attention(encoder_class,index_layer,Q,K_s,V_s)
119 | #print("sub_layer_multi_head_attention_output1:",sub_layer_multi_head_attention_output)
120 |
121 | #2. test 2: for sub layer of multi head attention with poistion-wise feed forward
122 | #d1,d2,d3=sub_layer_multi_head_attention_output.get_shape().as_list()
123 | #print("d1:",d1,";d2:",d2,";d3:",d3)
124 | #postion_wise_ff_input=sub_layer_multi_head_attention_output #tf.reshape(sub_layer_multi_head_attention_output,shape=[-1,d3])
125 | #print("sub_layer_postion_wise_feed_forward_input:",postion_wise_ff_input)
126 | #sub_layer_postion_wise_feed_forward_output=test_postion_wise_feed_forward(encoder_class,postion_wise_ff_input,index_layer)
127 | #sub_layer_postion_wise_feed_forward_output=tf.reshape(sub_layer_postion_wise_feed_forward_output,shape=(d1,d2,d3))
128 | #print("sub_layer_postion_wise_feed_forward_output:",sub_layer_postion_wise_feed_forward_output)
129 | #3.test 3: test for single layer of encoder
130 | #encoder_class.encoder_single_layer(Q,K_s,V_s,index_layer)
131 | #4.test 4: test for encoder. with N layers
132 |
133 | representation = encoder_class.encoder_fn()
134 | print("representation:",representation)
135 |
136 | # test()
--------------------------------------------------------------------------------
/model/layer_norm_residual_conn.py:
--------------------------------------------------------------------------------
1 | import tensorflow as tf
2 | import time
3 | """
4 | We employ a residual connection around each of the two sub-layers, followed by layer normalization.
5 | That is, the output of each sub-layer is LayerNorm(x+ Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. """
6 | class LayerNormResidualConnection(object):
7 | def __init__(self,x,y,layer_index,residual_dropout=0.1,use_residual_conn=True):
8 | self.x=x
9 | self.y=y
10 | self.layer_index=layer_index
11 | self.residual_dropout=residual_dropout
12 | #print("LayerNormResidualConnection.residual_dropout:",self.residual_dropout)
13 | self.use_residual_conn=use_residual_conn
14 |
15 | #call residual connection and layer normalization
16 | def layer_norm_residual_connection(self):
17 | #print("LayerNormResidualConnection.use_residual_conn:",self.use_residual_conn)
18 | if self.use_residual_conn: # todo previously it is removed in a classification task, may be because result become not stable
19 | x_residual=self.residual_connection()
20 | x_layer_norm=self.layer_normalization(x_residual)
21 | else:
22 | x_layer_norm = self.layer_normalization(self.x)
23 | return x_layer_norm
24 |
25 | def residual_connection(self):
26 | output=self.x + tf.nn.dropout(self.y, 1.0 - self.residual_dropout)
27 | return output
28 |
29 | # layer normalize the tensor x, averaging over the last dimension.
30 | def layer_normalization(self,x):
31 | """
32 | x should be:[batch_size,sequence_length,d_model]
33 | :return:
34 | """
35 | filter=x.get_shape()[-1] #last dimension of x. e.g. 512
36 | #print("layer_normalization:==================>variable_scope:","layer_normalization"+str(self.layer_index))
37 | with tf.variable_scope("layer_normalization"+str(self.layer_index)):
38 | # 1. normalize input by using mean and variance according to last dimension
39 | mean=tf.reduce_mean(x,axis=-1,keepdims=True) #[batch_size,sequence_length,1]
40 | variance=tf.reduce_mean(tf.square(x-mean),axis=-1,keepdims=True) #[batch_size,sequence_length,1]
41 | norm_x=(x-mean)*tf.rsqrt(variance+1e-6) #[batch_size,sequence_length,d_model]
42 | # 2. re-scale normalized input back
43 | scale=tf.get_variable("layer_norm_scale",[filter],initializer=tf.ones_initializer) #[filter]
44 | bias=tf.get_variable("layer_norm_bias",[filter],initializer=tf.ones_initializer) #[filter]
45 | output=norm_x*scale+bias #[batch_size,sequence_length,d_model]
46 | return output #[batch_size,sequence_length,d_model]
47 |
48 | def test():
49 | start = time.time()
50 |
51 | batch_size=128
52 | sequence_length=1000
53 | d_model=512
54 | x=tf.ones((batch_size,sequence_length,d_model))
55 | y=x*3-0.5
56 | layer_norm_residual_conn=LayerNormResidualConnection(x,y,0,'encoder')
57 | output=layer_norm_residual_conn.layer_norm_residual_connection()
58 |
59 | end = time.time()
60 | print("x:",x,";y:",y)
61 | print("output:",output,";time spent:",(end-start))
62 |
63 | #test()
--------------------------------------------------------------------------------
/model/multi_head_attention.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #test self-attention
3 | import tensorflow as tf
4 | import time
5 | """
6 | multi head attention.
7 | 1.linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions)
8 | 2.scaled dot product attention for each projected version of Q,K,V
9 | 3.concatenated result
10 | 4.linear projection to get final result
11 | three kinds of usage:
12 | 1. attention for encoder
13 | 2. attention for decoder(need a mask to pay attention for only known position)
14 | 3. attention as bridge of encoder and decoder
15 | """
16 | class MultiHeadAttention(object):
17 | """ multi head attention"""
18 | def __init__(self,Q,K_s,V_s,d_model,d_k,d_v,sequence_length,h,typee=None,is_training=None,mask=None,dropout_rate=0.1):
19 | self.d_model=d_model
20 | self.d_k=d_k
21 | self.d_v=d_v
22 | self.sequence_length=sequence_length
23 | self.h=h
24 | self.Q=Q
25 | self.K_s=K_s
26 | self.V_s=V_s
27 | self.typee=typee
28 | self.is_training=is_training
29 | self.mask=mask
30 | self.dropout_rate=dropout_rate
31 | #print("MultiHeadAttention.self.dropout_rate:",self.dropout_rate)
32 |
33 | def multi_head_attention_fn(self):
34 | """
35 | multi head attention
36 | :param Q: query. shape:[batch,sequence_length,d_model]
37 | :param K_s: keys. shape:[batch,sequence_length,d_model].
38 | :param V_s:values.shape:[batch,sequence_length,d_model].
39 | :param h: h times
40 | :return: result of scaled dot product attention. shape:[sequence_length,d_model]
41 | """
42 | # 1. linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions)
43 | Q_projected = tf.layers.dense(self.Q,units=self.d_model) # [batch,sequence_length,d_model]
44 | K_s_projected = tf.layers.dense(self.K_s, units=self.d_model) # [batch,sequence_length,d_model]
45 | V_s_projected = tf.layers.dense(self.V_s, units=self.d_model) # [batch,sequence_length,d_model]
46 | # 2. scaled dot product attention for each projected version of Q,K,V
47 | dot_product=self.scaled_dot_product_attention_batch(Q_projected,K_s_projected,V_s_projected) # [batch,h,sequence_length,d_v]
48 | # 3. concatenated
49 | batch_size,h,length,d_v=dot_product.get_shape().as_list()
50 | #print("dot_product:",dot_product,";self.sequence_length:",self.sequence_length) ##dot_product:(128, 8, 6, 64);5
51 | dot_product=tf.reshape(dot_product,shape=(-1,length,self.d_model)) # [batch,sequence_length,d_model]
52 | # 4. linear projection
53 | output=tf.layers.dense(dot_product,units=self.d_model) # [batch,sequence_length,d_model]
54 | return output #[batch,sequence_length,d_model]
55 |
56 | def scaled_dot_product_attention_batch_mine(self,Q,K_s,V_s): #my own implementation of scaled dot product attention.
57 | """
58 | scaled dot product attention
59 | :param Q: query. shape:[batch,sequence_length,d_model]
60 | :param K_s: keys. shape:[batch,sequence_length,d_model]
61 | :param V_s:values. shape:[batch,sequence_length,d_model]
62 | :param mask: shape:[batch,sequence_length]
63 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k]
64 | """
65 | # 1. split Q,K,V
66 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k]
67 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k]
68 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k]
69 | dot_product=tf.multiply(Q_heads,K_heads) # [batch,h,sequence_length,d_k]
70 | # 2. dot product
71 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,d_k]
72 | dot_product=tf.reduce_sum(dot_product,axis=-1,keep_dims=True) # [batch,h,sequence_length,1]
73 | # 3. add mask if it is none
74 | if self.mask is not None:
75 | mask = tf.expand_dims(self.mask, axis=-1) # [batch,sequence_length,1]
76 | mask = tf.expand_dims(mask, axis=1) # [batch,1,sequence_length,1]
77 | dot_product=dot_product+mask # [batch,h,sequence_length,1]
78 | # 4. get possibility
79 | p=tf.nn.softmax(dot_product) # [batch,h,sequence_length,1]
80 | # 5. final output
81 | output=tf.multiply(p,V_heads) # [batch,h,sequence_length,d_k]
82 | return output # [batch,h,sequence_length,d_k]
83 |
84 | def scaled_dot_product_attention_batch(self, Q, K_s, V_s):# scaled dot product attention: implementation style like tensor2tensor from google
85 | """
86 | scaled dot product attention
87 | :param Q: query. shape:[batch,sequence_length,d_model]
88 | :param K_s: keys. shape:[batch,sequence_length,d_model]
89 | :param V_s:values. shape:[batch,sequence_length,d_model]
90 | :param mask: shape:[sequence_length,sequence_length]
91 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k]
92 | """
93 | # 1. split Q,K,V
94 | #K_s=tf.layers.dense(K_s,self.d_model) # transform K_s, while keep as shape. TODO add 2018.10.21. so that Q and K shoud be not the same.
95 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k]
96 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k]
97 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_v]. during implementation, d_v=d_k.
98 | # 2. dot product of Q,K
99 | dot_product=tf.matmul(Q_heads,K_heads,transpose_b=True) # [batch,h,sequence_length,sequence_length]
100 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,sequence_length]
101 | # 3. add mask if it is none
102 | #print("scaled_dot_product_attention_batch.mask is not none?",self.mask is not None)
103 | if self.mask is not None:
104 | mask_expand=tf.expand_dims(tf.expand_dims(self.mask,axis=0),axis=0) # [1,1,sequence_length,sequence_length]
105 | #dot_product:(128, 8, 6, 6);mask_expand:(1, 1, 6, 6)
106 | #print("scaled_dot_product_attention_batch.dot_product:",dot_product,";mask_expand:",mask_expand)
107 | dot_product=dot_product+mask_expand # [batch,h,sequence_length,sequence_length]
108 | # 4.get possibility
109 | weights=tf.nn.softmax(dot_product) # [batch,h,sequence_length,sequence_length]
110 | # drop out weights
111 | weights=tf.nn.dropout(weights,1.0-self.dropout_rate) # [batch,h,sequence_length,sequence_length]
112 | # 5. final output
113 | output=tf.matmul(weights,V_heads) # [batch,h,sequence_length,d_v]
114 | return output
115 |
116 |
117 | #vectorized implementation of multi head attention for sentences with batch
118 | def multi_head_attention_for_sentence_vectorized(layer_number):
119 | print("started...")
120 | start = time.time()
121 | # 1.set parameter
122 | d_model = 512
123 | d_k = 64
124 | d_v = 64
125 | sequence_length = 1000
126 | h = 8
127 | batch_size=128
128 | initializer = tf.random_normal_initializer(stddev=0.1)
129 | # 2.set Q,K,V
130 | vocab_size=1000
131 | embed_size=d_model
132 | typee='decoder'
133 | Embedding = tf.get_variable("Embedding_", shape=[vocab_size, embed_size],initializer=initializer)
134 | input_x = tf.placeholder(tf.int32, [batch_size,sequence_length], name="input_x")
135 | embedded_words = tf.nn.embedding_lookup(Embedding, input_x) #[batch_size,sequence_length,embed_size]
136 | mask=get_mask(batch_size,sequence_length) #tf.ones((batch_size,sequence_length))*-1e8 #[batch,sequence_length]
137 | with tf.variable_scope("query_at_each_sentence"+str(layer_number)):
138 | Q = embedded_words # [batch_size*sequence_length,embed_size]
139 | K_s=embedded_words #[batch_size*sequence_length,embed_size]
140 | V_s=embedded_words #tf.get_variable("V_s_original_", shape=embedded_words.get_shape().as_list(),initializer=initializer) #[batch_size,sequence_length,embed_size]
141 | # 3.call method to get result
142 | multi_head_attention_class = MultiHeadAttention(Q, K_s, V_s, d_model, d_k, d_v, sequence_length, h,typee='decoder',mask=mask)
143 | encoder_output=multi_head_attention_class.multi_head_attention_fn() #shape:[sequence_length,d_model]
144 | encoder_output=tf.reshape(encoder_output,shape=(batch_size,sequence_length,d_model))
145 | end = time.time()
146 | print("input_x:",input_x)
147 | print("encoder_output:",encoder_output,";time_spent:",(end-start))
148 |
149 | def get_mask(batch_size,sequence_length):
150 | lower_triangle=tf.matrix_band_part(tf.ones([sequence_length,sequence_length]),-1,0)
151 | result=-1e9*(1.0-lower_triangle)
152 | print("get_mask==>result:",result)
153 | return result
154 |
155 | layer_number=0
156 | #multi_head_attention_for_sentence_vectorized(0)
--------------------------------------------------------------------------------
/model/optimization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """Functions and classes related to optimization (weight updates)."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import re
22 | import tensorflow as tf
23 |
24 |
25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu):
26 | """Creates an optimizer training op."""
27 | global_step = tf.train.get_or_create_global_step()
28 |
29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32)
30 |
31 | # Implements linear decay of the learning rate.
32 | learning_rate = tf.train.polynomial_decay(
33 | learning_rate,
34 | global_step,
35 | num_train_steps,
36 | end_learning_rate=0.0,
37 | power=1.0,
38 | cycle=False)
39 |
40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the
41 | # learning rate will be `global_step/num_warmup_steps * init_lr`.
42 | if num_warmup_steps:
43 | global_steps_int = tf.cast(global_step, tf.int32)
44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32)
45 |
46 | global_steps_float = tf.cast(global_steps_int, tf.float32)
47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32)
48 |
49 | warmup_percent_done = global_steps_float / warmup_steps_float
50 | warmup_learning_rate = init_lr * warmup_percent_done
51 |
52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32)
53 | learning_rate = (
54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate)
55 |
56 | # It is recommended that you use this optimizer for fine tuning, since this
57 | # is how the model was trained (note that the Adam m/v variables are NOT
58 | # loaded from init_checkpoint.)
59 | optimizer = AdamWeightDecayOptimizer(
60 | learning_rate=learning_rate,
61 | weight_decay_rate=0.01,
62 | beta_1=0.9,
63 | beta_2=0.999,
64 | epsilon=1e-6,
65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"])
66 |
67 | if use_tpu:
68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
69 |
70 | tvars = tf.trainable_variables()
71 | grads = tf.gradients(loss, tvars)
72 |
73 | # This is how the model was pre-trained.
74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0)
75 |
76 | train_op = optimizer.apply_gradients(
77 | zip(grads, tvars), global_step=global_step)
78 |
79 | new_global_step = global_step + 1
80 | train_op = tf.group(train_op, [global_step.assign(new_global_step)])
81 | return train_op
82 |
83 |
84 | class AdamWeightDecayOptimizer(tf.train.Optimizer):
85 | """A basic Adam optimizer that includes "correct" L2 weight decay."""
86 |
87 | def __init__(self,
88 | learning_rate,
89 | weight_decay_rate=0.0,
90 | beta_1=0.9,
91 | beta_2=0.999,
92 | epsilon=1e-6,
93 | exclude_from_weight_decay=None,
94 | name="AdamWeightDecayOptimizer"):
95 | """Constructs a AdamWeightDecayOptimizer."""
96 | super(AdamWeightDecayOptimizer, self).__init__(False, name)
97 |
98 | self.learning_rate = learning_rate
99 | self.weight_decay_rate = weight_decay_rate
100 | self.beta_1 = beta_1
101 | self.beta_2 = beta_2
102 | self.epsilon = epsilon
103 | self.exclude_from_weight_decay = exclude_from_weight_decay
104 |
105 | def apply_gradients(self, grads_and_vars, global_step=None, name=None):
106 | """See base class."""
107 | assignments = []
108 | for (grad, param) in grads_and_vars:
109 | if grad is None or param is None:
110 | continue
111 |
112 | param_name = self._get_variable_name(param.name)
113 |
114 | m = tf.get_variable(
115 | name=param_name + "/adam_m",
116 | shape=param.shape.as_list(),
117 | dtype=tf.float32,
118 | trainable=False,
119 | initializer=tf.zeros_initializer())
120 | v = tf.get_variable(
121 | name=param_name + "/adam_v",
122 | shape=param.shape.as_list(),
123 | dtype=tf.float32,
124 | trainable=False,
125 | initializer=tf.zeros_initializer())
126 |
127 | # Standard Adam update.
128 | next_m = (
129 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
130 | next_v = (
131 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
132 | tf.square(grad)))
133 |
134 | update = next_m / (tf.sqrt(next_v) + self.epsilon)
135 |
136 | # Just adding the square of the weights to the loss function is *not*
137 | # the correct way of using L2 regularization/weight decay with Adam,
138 | # since that will interact with the m and v parameters in strange ways.
139 | #
140 | # Instead we want ot decay the weights in a manner that doesn't interact
141 | # with the m/v parameters. This is equivalent to adding the square
142 | # of the weights to the loss with plain (non-momentum) SGD.
143 | if self._do_use_weight_decay(param_name):
144 | update += self.weight_decay_rate * param
145 |
146 | update_with_lr = self.learning_rate * update
147 |
148 | next_param = param - update_with_lr
149 |
150 | assignments.extend(
151 | [param.assign(next_param),
152 | m.assign(next_m),
153 | v.assign(next_v)])
154 | return tf.group(*assignments, name=name)
155 |
156 | def _do_use_weight_decay(self, param_name):
157 | """Whether to use L2 weight decay for `param_name`."""
158 | if not self.weight_decay_rate:
159 | return False
160 | if self.exclude_from_weight_decay:
161 | for r in self.exclude_from_weight_decay:
162 | if re.search(r, param_name) is not None:
163 | return False
164 | return True
165 |
166 | def _get_variable_name(self, param_name):
167 | """Get the variable name from the tensor name."""
168 | m = re.match("^(.*):\\d+$", param_name)
169 | if m is not None:
170 | param_name = m.group(1)
171 | return param_name
172 |
--------------------------------------------------------------------------------
/model/poistion_wise_feed_forward.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import tensorflow as tf
3 | import time
4 | """
5 | Position-wise Feed-Forward Networks
6 | In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
7 | connected feed-forward network, which is applied to each position separately and identically. This
8 | consists of two linear transformations with a ReLU activation in between.
9 | FFN(x) = max(0,xW1+b1)W2+b2
10 | While the linear transformations are the same across different positions, they use different parameters
11 | from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
12 | The dimensionality of input and output is d_model= 512, and the inner-layer has dimensionalityd_ff= 2048.
13 | """
14 | class PositionWiseFeedFoward(object):
15 | """
16 | position-wise feed forward networks. formula as below:
17 | FFN(x)=max(0,xW1+b1)W2+b2
18 | """
19 | def __init__(self,x,layer_index,d_model=512,d_ff=2048):
20 | """
21 | :param x: shape should be:[batch,sequence_length,d_model]
22 | :param layer_index: index of layer
23 | :return: shape:[sequence_length,d_model]
24 | """
25 | shape_list=x.get_shape().as_list()
26 | assert(len(shape_list)==3)
27 | self.x=x
28 | self.layer_index=layer_index
29 | self.d_model=d_model
30 | self.d_ff=d_ff
31 | self.initializer = tf.random_normal_initializer(stddev=0.1)
32 |
33 | def position_wise_feed_forward_fn(self):
34 | """
35 | positional wise fully connected feed forward implement as two layers of cnn
36 | x: [batch,sequence_length,d_model]
37 | :return: [batch,sequence_length,d_model]
38 | """
39 | # 1.conv layer 1
40 | input=tf.expand_dims(self.x,axis=3) # [batch,sequence_length,d_model,1]
41 | # conv2d.input: [batch,sentence_length,embed_size,1]. filter=[filter_size,self.embed_size,1,self.num_filters]
42 | output_conv1=tf.layers.conv2d( # output_conv1: [batch_size,sequence_length,1,d_ff]
43 | input,filters=self.d_ff,kernel_size=[1,self.d_model],padding="VALID",
44 | name='conv1',kernel_initializer=self.initializer,activation=tf.nn.relu
45 | )
46 | output_conv1 = tf.transpose(output_conv1, [0,1,3,2]) #output_conv1:[batch_size,sequence_length,d_ff,1]
47 | # print("output_conv1:",output_conv1)
48 |
49 | # 2.conv layer 2
50 | output_conv2 = tf.layers.conv2d( # output_conv2:[batch_size, sequence_length,1,d_model]
51 | output_conv1,filters=self.d_model,kernel_size=[1,self.d_ff],padding="VALID",
52 | name='conv2',kernel_initializer=self.initializer,activation=None
53 | )
54 | output=tf.squeeze(output_conv2) #[batch,sequence_length,d_model]
55 | return output #[batch,sequence_length,d_model]
56 |
57 | def position_wise_feed_forward_fc_fn(self):
58 | """
59 | positional wise fully connected feed forward implement as original version.
60 | FFN(x) = max(0,xW1+b1)W2+b2
61 | this function provide you as an alternative if you want to use original version, or you don't want to use two layers of cnn,
62 | but may be less efficient as sequence become longer.
63 | x: [batch,sequence_length,d_model]
64 | :return: [batch,sequence_length,d_model]
65 | """
66 | # 0. pre-process input x
67 | _,sequence_length,d_model=self.x.get_shape().as_list()
68 |
69 | element_list = tf.split(self.x, sequence_length,axis=1) # it is a list,length is sequence_length, each element is [batch_size,1,d_model]
70 | element_list = [tf.squeeze(element, axis=1) for element in element_list] # it is a list,length is sequence_length, each element is [batch_size,d_model]
71 | output_list=[]
72 | for i, element in enumerate(element_list):
73 | with tf.variable_scope("foo", reuse=True if i>0 else False):
74 | # 1. layer 1
75 | W1 = tf.get_variable("ff_layer1", shape=[self.d_model, self.d_ff], initializer=self.initializer)
76 | z1=tf.nn.relu(tf.matmul(element,W1)) # z1:[batch_size,d_ff]<--------tf.matmul([batch_size,d_model],[d_model, d_ff])
77 | # 2. layer 2
78 | W2 = tf.get_variable("ff_layer2", shape=[self.d_ff, self.d_model], initializer=self.initializer)
79 | output_element=tf.matmul(z1,W2) # output:[batch_size,d_model]<----------tf.matmul([batch_size,d_ff],[d_ff, d_model])
80 | output_list.append(output_element) # a list, each element is [batch_size,d_model]
81 | output=tf.stack(output_list,axis=1) # [batch,sequence_length,d_model]
82 | return output # [batch,sequence_length,d_model]
83 |
84 | #test function of position_wise_feed_forward_fn
85 | #time spent:OLD VERSION(FC): length=1000,time spent:2.04 s; NEW VERSION(CNN):0.03s, speed up as 68x.
86 | def test_position_wise_feed_forward_fn():
87 | start=time.time()
88 | x=tf.ones((8,1000,512)) #batch_size=8,sequence_length=10 ;
89 | layer_index=0
90 | postion_wise_feed_forward=PositionWiseFeedFoward(x,layer_index)
91 | output=postion_wise_feed_forward.position_wise_feed_forward_fn()
92 | end=time.time()
93 | print("x:",x.shape,";output:",output.shape)
94 | print("time spent:",(end-start))
95 | return output
96 |
97 | def test():
98 | with tf.Session() as sess:
99 | result=test_position_wise_feed_forward_fn()
100 | sess.run(tf.global_variables_initializer())
101 | result_=sess.run(result)
102 | print("result_.shape:",result_.shape)
103 |
104 | #test()
--------------------------------------------------------------------------------
/model/transfomer_model.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # -*- coding: utf-8 -*-
4 | """
5 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
6 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task)
7 | on large scale of corpus, then fine-tuning by add a single classification layer.
8 | """
9 |
10 | import tensorflow as tf
11 | import numpy as np
12 | from model.encoder import Encoder
13 | from model.config_transformer import Config
14 | import os
15 | os.environ["CUDA_VISIBLE_DEVICES"] = "6"
16 |
17 | class TransformerModel:
18 | def __init__(self,config):
19 | """
20 | init all hyperparameter with config class, define placeholder, computation graph
21 | """
22 | self.num_classes = config.num_classes
23 | print("BertModel.num_classes:",self.num_classes)
24 | self.batch_size = config.batch_size
25 | self.sequence_length = config.sequence_length
26 | self.vocab_size = config.vocab_size
27 | self.d_model = config.d_model
28 | self.learning_rate = tf.Variable(config.learning_rate, trainable=False, name="learning_rate")
29 | self.clip_gradients=config.clip_gradients
30 | self.decay_steps=config.decay_steps
31 | self.decay_rate=config.decay_rate
32 | self.d_k=config.d_k
33 | self.d_model=config.d_model
34 | self.h=config.h
35 | self.d_v=config.d_v
36 | self.num_layer=config.num_layer
37 | self.use_residual_conn=True
38 | self.is_training=config.is_training
39 |
40 | # place holder(X,y)
41 | self.input_x= tf.placeholder(tf.int32, [self.batch_size, self.sequence_length], name="input_x") # e.g.is a sequence, input='the man [mask1] to [mask2] store'
42 | self.input_y=tf.placeholder(tf.float32, [self.batch_size, self.num_classes],name="input_y")
43 |
44 | self.learning_rate_decay_half_op = tf.assign(self.learning_rate, self.learning_rate *config.decay_rate)
45 | self.initializer=tf.random_normal_initializer(stddev=0.1)
46 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")
47 | self.global_step = tf.Variable(0, trainable=False, name="Global_Step")
48 | self.epoch_step = tf.Variable(0, trainable=False, name="Epoch_Step")
49 | self.epoch_increment = tf.assign(self.epoch_step, tf.add(self.epoch_step, tf.constant(1)))
50 |
51 | self.instantiate_weights()
52 | self.logits =self.inference() # shape:[None,self.num_classes]
53 | self.predictions = tf.argmax(self.logits, axis=1, name="predictions") # shape:[None,]
54 |
55 | if not self.is_training:
56 | return
57 | self.loss_val = self.loss()
58 | self.train_op = self.train()
59 |
60 | def inference(self):
61 | """
62 | main inference logic here: invoke transformer model to do inference. input is a sequence, output is also a sequence.
63 | input representation-->
64 | :return:
65 | """
66 | # 1. input representation(input embedding, positional encoding, segment encoding)
67 | token_embeddings = tf.nn.embedding_lookup(self.embedding,self.input_x) # [batch_size,sequence_length,embed_size]
68 | self.input_representation=tf.add(tf.add(token_embeddings,self.segment_embeddings),self.position_embeddings) # [batch_size,sequence_length,embed_size]
69 |
70 | # 2. repeat Nx times of building block( multi-head attention followed by Add & Norm; feed forward followed by Add & Norm)
71 | encoder_class=Encoder(self.d_model,self.d_k,self.d_v,self.sequence_length,self.h,self.batch_size,self.num_layer,self.input_representation,
72 | self.input_representation,dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn)
73 | h = encoder_class.encoder_fn() # [batch_size,sequence_length,d_model]
74 |
75 | # 3. get logits for different tasks by applying projection layer
76 | logits=self.project_tasks(h) # shape:[None,self.num_classes]
77 | return logits # shape:[None,self.num_classes]
78 |
79 | def project_tasks(self,h):
80 | """
81 | project the representation, then to do classification.
82 | :param h: [batch_size,sequence_length,d_model]
83 | :return: logits: [batch_size, num_classes]
84 | transoform each sub task using one-layer MLP ,then get logits.
85 | get some insights from densely connected layers from recently development
86 | """
87 | cls_representation = h[:, 0, :] # [CLS] token's information: classification task's representation
88 | logits = tf.layers.dense(cls_representation, self.num_classes) # shape:[None,self.num_classes]
89 | logits = tf.nn.dropout(logits,keep_prob=self.dropout_keep_prob) # shape:[None,self.num_classes]
90 | return logits
91 |
92 | def loss(self,l2_lambda=0.0001*3,epislon=0.000001):
93 | # input: `logits` and `labels` must have the same shape `[batch_size, num_classes]`
94 | # output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss.
95 | # let `x = logits`, `z = labels`. The logistic loss is:z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x))
96 | losses= tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y,logits=self.logits) #[batch_size,num_classes]
97 | self.losses = tf.reduce_mean((tf.reduce_sum(losses,axis=1))) # shape=(?,)-->(). loss for all data in the batch-->single loss
98 | self.l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda
99 |
100 | loss=self.losses+self.l2_loss
101 | return loss
102 |
103 | def train(self):
104 | """based on the loss, use SGD to update parameter"""
105 | learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps,self.decay_rate, staircase=True)
106 | train_op = tf.contrib.layers.optimize_loss(self.loss_val, global_step=self.global_step,learning_rate=learning_rate, optimizer="Adam",clip_gradients=self.clip_gradients)
107 | return train_op
108 |
109 | def instantiate_weights(self):
110 | """define all weights here"""
111 | with tf.name_scope("embedding"): # embedding matrix
112 | self.embedding = tf.get_variable("embedding", shape=[self.vocab_size, self.d_model],initializer=self.initializer) # [vocab_size,embed_size]
113 | self.segment_embeddings = tf.get_variable("segment_embeddings", [self.d_model],initializer=tf.constant_initializer(1.0)) # a learned sequence embedding
114 | self.position_embeddings = tf.get_variable("position_embeddings", [self.sequence_length, self.d_model],initializer=tf.constant_initializer(1.0)) # sequence_length,1]
115 |
116 |
117 | # train the model on toy task: learn to count,sum up all inputs, and distinct whether the total value of input is below or greater than a threshold.
118 | # usage: first run train () to train the model, it will save checkpoint to file system. then run predict() to make a prediction based on checkpoint.
119 | def train():
120 | # 1.init config and model
121 | config=Config()
122 | threshold=(config.sequence_length/2)+1
123 | model = TransformerModel(config)
124 | gpu_config = tf.ConfigProto()
125 | gpu_config.gpu_options.allow_growth = True
126 | saver = tf.train.Saver()
127 | save_path = config.ckpt_dir + "model.ckpt"
128 | #if not os.path.exists(config.ckpt_dir):
129 | # os.makedirs(config.ckpt_dir)
130 | with tf.Session(config=gpu_config) as sess:
131 | sess.run(tf.global_variables_initializer())
132 | if os.path.exists(config.ckpt_dir): # 如果存在,则加载预训练过的模型
133 | saver.restore(sess, tf.train.latest_checkpoint(save_path))
134 | for i in range(100000):
135 | # 2.feed data
136 | input_x = np.random.randn(config.batch_size, config.sequence_length) # [None, self.sequence_length]
137 | input_x[input_x >= 0] = 1
138 | input_x[input_x < 0] = 0
139 | input_y = generate_label(input_x,threshold)
140 | # 3.run session to train the model, print some logs.
141 | loss, _ = sess.run([model.loss_val, model.train_op],feed_dict={model.input_x: input_x, model.input_y: input_y,model.dropout_keep_prob: config.dropout_keep_prob})
142 | print(i, "loss:", loss, "-------------------------------------------------------")
143 | if i==300:
144 | print("label[0]:", input_y[0]);print("input_x:",input_x)
145 | if i % 500 == 0:
146 | saver.save(sess, save_path, global_step=i)
147 |
148 | # use saved checkpoint from model to make prediction, and print it, to see whether it is able to do toy task successfully.
149 | def predict():
150 | config=Config()
151 | threshold=(config.sequence_length/2)+1
152 | config.batch_size=1
153 | model = TransformerModel(config)
154 | gpu_config = tf.ConfigProto()
155 | gpu_config.gpu_options.allow_growth = True
156 | saver = tf.train.Saver()
157 | ckpt_dir = config.ckpt_dir
158 | print("ckpt_dir:",ckpt_dir)
159 | with tf.Session(config=gpu_config) as sess:
160 | sess.run(tf.global_variables_initializer())
161 | saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir))
162 | for i in range(100):
163 | # 2.feed data
164 | input_x = np.random.randn(config.batch_size, config.sequence_length) # [None, self.sequence_length]
165 | input_x[input_x >= 0] = 1
166 | input_x[input_x < 0] = 0
167 | target_label = generate_label(input_x,threshold)
168 | input_sum=np.sum(input_x)
169 | # 3.run session to train the model, print some logs.
170 | logit,prediction = sess.run([model.logits, model.predictions],feed_dict={model.input_x: input_x ,model.dropout_keep_prob: config.dropout_keep_prob})
171 | print("target_label:", target_label,";input_sum:",input_sum,"threshold:",threshold,";prediction:",prediction);
172 | print("input_x:",input_x,";logit:",logit)
173 |
174 |
175 | def generate_label(input_x,threshold):
176 | """
177 | generate label with input
178 | :param input_x: shape of [batch_size, sequence_length]
179 | :return: y:[batch_size]
180 | """
181 | batch_size,sequence_length=input_x.shape
182 | y=np.zeros((batch_size,2))
183 | for i in range(batch_size):
184 | input_single=input_x[i]
185 | sum=np.sum(input_single)
186 | if i == 0:print("sum:",sum,";threshold:",threshold)
187 | y_single=1 if sum>threshold else 0
188 | if y_single==1:
189 | y[i]=[0,1]
190 | else: # y_single=0
191 | y[i]=[1,0]
192 | return y
193 |
194 | #train()
195 | #predict()
--------------------------------------------------------------------------------
/old/JoinAttLayer.py:
--------------------------------------------------------------------------------
1 | # coding=utf8
2 | from keras import backend as K
3 | from keras.engine.topology import Layer
4 | from keras import initializers, regularizers, constraints
5 | from keras.layers.merge import _Merge
6 |
7 |
8 | class Attention(Layer):
9 | def __init__(self, step_dim,
10 | W_regularizer=None, b_regularizer=None,
11 | W_constraint=None, b_constraint=None,
12 | bias=True, **kwargs):
13 | """
14 | Keras Layer that implements an Attention mechanism for temporal data.
15 | Supports Masking.
16 | Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756]
17 | # Input shape
18 | 3D tensor with shape: `(samples, steps, features)`.
19 | # Output shape
20 | 2D tensor with shape: `(samples, features)`.
21 | :param kwargs:
22 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
23 | The dimensions are inferred based on the output shape of the RNN.
24 | Example:
25 | model.add(LSTM(64, return_sequences=True))
26 | model.add(Attention())
27 | """
28 | self.supports_masking = True
29 | # self.init = initializations.get('glorot_uniform')
30 | self.init = initializers.get('glorot_uniform')
31 |
32 | self.W_regularizer = regularizers.get(W_regularizer)
33 | self.b_regularizer = regularizers.get(b_regularizer)
34 |
35 | self.W_constraint = constraints.get(W_constraint)
36 | self.b_constraint = constraints.get(b_constraint)
37 |
38 | self.bias = bias
39 | self.step_dim = step_dim
40 | self.features_dim = 0
41 | super(Attention, self).__init__(**kwargs)
42 |
43 | def build(self, input_shape):
44 | assert len(input_shape) == 3
45 |
46 | self.W = self.add_weight((input_shape[-1],),
47 | initializer=self.init,
48 | name='{}_W'.format(self.name),
49 | regularizer=self.W_regularizer,
50 | constraint=self.W_constraint)
51 | self.features_dim = input_shape[-1]
52 |
53 | if self.bias:
54 | self.b = self.add_weight((input_shape[1],),
55 | initializer='zero',
56 | name='{}_b'.format(self.name),
57 | regularizer=self.b_regularizer,
58 | constraint=self.b_constraint)
59 | else:
60 | self.b = None
61 |
62 | self.built = True
63 |
64 | def compute_mask(self, input, input_mask=None):
65 | # do not pass the mask to the next layers
66 | return None
67 |
68 | def call(self, x, mask=None):
69 | input_shape = K.int_shape(x)
70 |
71 | features_dim = self.features_dim
72 | # step_dim = self.step_dim
73 | step_dim = input_shape[1]
74 |
75 | eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
76 |
77 | if self.bias:
78 | eij += self.b[:input_shape[1]]
79 |
80 | eij = K.tanh(eij)
81 |
82 | a = K.exp(eij)
83 |
84 | # apply mask after the exp. will be re-normalized next
85 | if mask is not None:
86 | # Cast the mask to floatX to avoid float64 upcasting in theano
87 | a *= K.cast(mask, K.floatx())
88 |
89 | # in some cases especially in the early stages of training the sum may be almost zero
90 | # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
91 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
92 |
93 | a = K.expand_dims(a)
94 | weighted_input = x * a
95 | # print weigthted_input.shape
96 | return K.sum(weighted_input, axis=1)
97 |
98 | def compute_output_shape(self, input_shape):
99 | # return input_shape[0], input_shape[-1]
100 | return input_shape[0], self.features_dim
101 | # end Attention
102 |
103 |
104 | class JoinAttention(_Merge):
105 | def __init__(self, step_dim, hid_size,
106 | W_regularizer=None, b_regularizer=None,
107 | W_constraint=None, b_constraint=None,
108 | bias=True, **kwargs):
109 | """
110 | Keras Layer that implements an Attention mechanism according to other vector.
111 | Supports Masking.
112 | # Input shape, list of
113 | 2D tensor with shape: `(samples, features_1)`.
114 | 3D tensor with shape: `(samples, steps, features_2)`.
115 | # Output shape
116 | 2D tensor with shape: `(samples, features)`.
117 | :param kwargs:
118 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True.
119 | The dimensions are inferred based on the output shape of the RNN.
120 | Example:
121 | en = LSTM(64, return_sequences=False)(input)
122 | de = LSTM(64, return_sequences=True)(input2)
123 | output = JoinAttention(64, 20)([en, de])
124 | """
125 | self.supports_masking = True
126 | # self.init = initializations.get('glorot_uniform')
127 | self.init = initializers.get('glorot_uniform')
128 |
129 | self.W_regularizer = regularizers.get(W_regularizer)
130 | self.b_regularizer = regularizers.get(b_regularizer)
131 |
132 | self.W_constraint = constraints.get(W_constraint)
133 | self.b_constraint = constraints.get(b_constraint)
134 |
135 | self.bias = bias
136 | self.step_dim = step_dim
137 | self.hid_size = hid_size
138 | super(JoinAttention, self).__init__(**kwargs)
139 |
140 | def build(self, input_shape):
141 | if not isinstance(input_shape, list):
142 | raise ValueError('A merge layer [JoinAttention] should be called '
143 | 'on a list of inputs.')
144 | if len(input_shape) != 2:
145 | raise ValueError('A merge layer [JoinAttention] should be called '
146 | 'on a list of 2 inputs. '
147 | 'Got ' + str(len(input_shape)) + ' inputs.')
148 | if len(input_shape[0]) != 2 or len(input_shape[1]) != 3:
149 | raise ValueError('A merge layer [JoinAttention] should be called '
150 | 'on a list of 2 inputs with first ndim 2 and second one ndim 3. '
151 | 'Got ' + str(len(input_shape)) + ' inputs.')
152 |
153 | self.W_en1 = self.add_weight((input_shape[0][-1], self.hid_size),
154 | initializer=self.init,
155 | name='{}_W0'.format(self.name),
156 | regularizer=self.W_regularizer,
157 | constraint=self.W_constraint)
158 | self.W_en2 = self.add_weight((input_shape[1][-1], self.hid_size),
159 | initializer=self.init,
160 | name='{}_W1'.format(self.name),
161 | regularizer=self.W_regularizer,
162 | constraint=self.W_constraint)
163 | self.W_de = self.add_weight((self.hid_size,),
164 | initializer=self.init,
165 | name='{}_W2'.format(self.name),
166 | regularizer=self.W_regularizer,
167 | constraint=self.W_constraint)
168 |
169 | if self.bias:
170 | self.b_en1 = self.add_weight((self.hid_size,),
171 | initializer='zero',
172 | name='{}_b0'.format(self.name),
173 | regularizer=self.b_regularizer,
174 | constraint=self.b_constraint)
175 | self.b_en2 = self.add_weight((self.hid_size,),
176 | initializer='zero',
177 | name='{}_b1'.format(self.name),
178 | regularizer=self.b_regularizer,
179 | constraint=self.b_constraint)
180 | self.b_de = self.add_weight((input_shape[1][1],),
181 | initializer='zero',
182 | name='{}_b2'.format(self.name),
183 | regularizer=self.b_regularizer,
184 | constraint=self.b_constraint)
185 | else:
186 | self.b_en1 = None
187 | self.b_en2 = None
188 | self.b_de = None
189 |
190 | self._reshape_required = False
191 | self.built = True
192 |
193 | def compute_output_shape(self, input_shape):
194 | return input_shape[1][0], input_shape[1][-1]
195 |
196 | def compute_mask(self, input, input_mask=None):
197 | # do not pass the mask to the next layers
198 | return None
199 |
200 | def call(self, inputs, mask=None):
201 | en = inputs[0]
202 | de = inputs[1]
203 | de_shape = K.int_shape(de)
204 | step_dim = de_shape[1]
205 |
206 | hid_en = K.dot(en, self.W_en1)
207 | hid_de = K.dot(de, self.W_en2)
208 | if self.bias:
209 | hid_en += self.b_en1
210 | hid_de += self.b_en2
211 | hid = K.tanh(K.expand_dims(hid_en, axis=1) + hid_de)
212 | eij = K.reshape(K.dot(hid, K.reshape(self.W_de, (self.hid_size, 1))), (-1, step_dim))
213 | if self.bias:
214 | eij += self.b_de[:step_dim]
215 |
216 | a = K.exp(eij - K.max(eij, axis=-1, keepdims=True))
217 |
218 | # apply mask after the exp. will be re-normalized next
219 | if mask is not None:
220 | # Cast the mask to floatX to avoid float64 upcasting in theano
221 | a *= K.cast(mask[1], K.floatx())
222 |
223 | # in some cases especially in the early stages of training the sum may be almost zero
224 | # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
225 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
226 |
227 | a = K.expand_dims(a)
228 | weighted_input = de * a
229 | return K.sum(weighted_input, axis=1)
230 | # end JoinAttention
231 |
--------------------------------------------------------------------------------
/old/classifier_bigru.py:
--------------------------------------------------------------------------------
1 | import keras
2 | from keras import Model
3 | from keras.layers import *
4 | from JoinAttLayer import Attention
5 |
6 |
7 | class TextClassifier():
8 |
9 | def model(self, embeddings_matrix, maxlen, word_index, num_class):
10 | inp = Input(shape=(maxlen,))
11 | encode = Bidirectional(CuDNNGRU(128, return_sequences=True))
12 | encode2 = Bidirectional(CuDNNGRU(128, return_sequences=True))
13 | attention = Attention(maxlen)
14 | x_4 = Embedding(len(word_index) + 1,
15 | embeddings_matrix.shape[1],
16 | weights=[embeddings_matrix],
17 | input_length=maxlen,
18 | trainable=True)(inp)
19 | x_3 = SpatialDropout1D(0.2)(x_4)
20 | x_3 = encode(x_3)
21 | x_3 = Dropout(0.2)(x_3)
22 | x_3 = encode2(x_3)
23 | x_3 = Dropout(0.2)(x_3)
24 | avg_pool_3 = GlobalAveragePooling1D()(x_3)
25 | max_pool_3 = GlobalMaxPooling1D()(x_3)
26 | attention_3 = attention(x_3)
27 | x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name="fc")
28 | x = Dense(num_class, activation="sigmoid")(x)
29 |
30 | adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08,amsgrad=True)
31 | rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-06)
32 | model = Model(inputs=inp, outputs=x)
33 | model.compile(
34 | loss='categorical_crossentropy',
35 | optimizer=adam)
36 | return model
37 |
--------------------------------------------------------------------------------
/old/classifier_capsule.py:
--------------------------------------------------------------------------------
1 | import keras
2 | import random
3 | random.seed = 16
4 | import numpy as np
5 | np.random.seed(16)
6 | from tensorflow import set_random_seed
7 | set_random_seed(16)
8 | import random
9 | random.seed = 16
10 | from keras.models import Model
11 | from keras.layers import *
12 | from JoinAttLayer import Attention
13 |
14 |
15 | def precision(y_true, y_pred):
16 | """Precision metric.
17 | Only computes a batch-wise average of precision.
18 | Computes the precision, a metric for multi-label classification of
19 | how many selected items are relevant.
20 | """
21 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
22 | predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
23 | precision = true_positives / (predicted_positives + K.epsilon())
24 | return precision
25 |
26 |
27 | def recall(y_true, y_pred):
28 | """Recall metric.
29 | Only computes a batch-wise average of recall.
30 | Computes the recall, a metric for multi-label classification of
31 | how many relevant items are selected.
32 | """
33 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
34 | possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
35 | recall = true_positives / (possible_positives + K.epsilon())
36 | return recall
37 |
38 |
39 | def f1(y_true, y_pred, beta=1):
40 | """Computes the F score.
41 | The F score is the weighted harmonic mean of precision and recall.
42 | Here it is only computed as a batch-wise average, not globally.
43 | This is useful for multi-label classification, where input samples can be
44 | classified as sets of labels. By only using accuracy (precision) a model
45 | would achieve a perfect score by simply assigning every class to every
46 | input. In order to avoid this, a metric should penalize incorrect class
47 | assignments as well (recall). The F-beta score (ranged from 0.0 to 1.0)
48 | computes this, as a weighted mean of the proportion of correct class
49 | assignments vs. the proportion of incorrect class assignments.
50 | With beta = 1, this is equivalent to a F-measure. With beta < 1, assigning
51 | correct classes becomes more important, and with beta > 1 the metric is
52 | instead weighted towards penalizing incorrect class assignments.
53 | """
54 | if beta < 0:
55 | raise ValueError('The lowest choosable beta is zero (only precision).')
56 |
57 | p = precision(y_true, y_pred)
58 | r = recall(y_true, y_pred)
59 | bb = beta ** 2
60 | fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon())
61 | return fbeta_score
62 |
63 |
64 | def squash(x, axis=-1):
65 | # s_squared_norm is really small
66 | # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon()
67 | # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm)
68 | # return scale * x
69 | s_squared_norm = K.sum(K.square(x), axis, keepdims=True)
70 | scale = K.sqrt(s_squared_norm + K.epsilon())
71 | return x / scale
72 |
73 |
74 | # A Capsule Implement with Pure Keras
75 | class Capsule(Layer):
76 | def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True,
77 | activation='default', **kwargs):
78 | super(Capsule, self).__init__(**kwargs)
79 | self.num_capsule = num_capsule
80 | self.dim_capsule = dim_capsule
81 | self.routings = routings
82 | self.kernel_size = kernel_size
83 | self.share_weights = share_weights
84 | if activation == 'default':
85 | self.activation = squash
86 | else:
87 | self.activation = Activation(activation)
88 |
89 | def build(self, input_shape):
90 | super(Capsule, self).build(input_shape)
91 | input_dim_capsule = input_shape[-1]
92 | if self.share_weights:
93 | self.W = self.add_weight(name='capsule_kernel',
94 | shape=(1, input_dim_capsule,
95 | self.num_capsule * self.dim_capsule),
96 | # shape=self.kernel_size,
97 | initializer='glorot_uniform',
98 | trainable=True)
99 | else:
100 | input_num_capsule = input_shape[-2]
101 | self.W = self.add_weight(name='capsule_kernel',
102 | shape=(input_num_capsule,
103 | input_dim_capsule,
104 | self.num_capsule * self.dim_capsule),
105 | initializer='glorot_uniform',
106 | trainable=True)
107 |
108 | def call(self, u_vecs):
109 | if self.share_weights:
110 | u_hat_vecs = K.conv1d(u_vecs, self.W)
111 | else:
112 | u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1])
113 |
114 | batch_size = K.shape(u_vecs)[0]
115 | input_num_capsule = K.shape(u_vecs)[1]
116 | u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule,
117 | self.num_capsule, self.dim_capsule))
118 | u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3))
119 | # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule]
120 |
121 | b = K.zeros_like(u_hat_vecs[:, :, :, 0]) # shape = [None, num_capsule, input_num_capsule]
122 | for i in range(self.routings):
123 | b = K.permute_dimensions(b, (0, 2, 1)) # shape = [None, input_num_capsule, num_capsule]
124 | c = K.softmax(b)
125 | c = K.permute_dimensions(c, (0, 2, 1))
126 | b = K.permute_dimensions(b, (0, 2, 1))
127 | outputs = self.activation(K.batch_dot(c, u_hat_vecs, [2, 2]))
128 | if i < self.routings - 1:
129 | b = K.batch_dot(outputs, u_hat_vecs, [2, 3])
130 |
131 | return outputs
132 |
133 | def compute_output_shape(self, input_shape):
134 | return (None, self.num_capsule, self.dim_capsule)
135 |
136 |
137 | class TextClassifier():
138 |
139 | def model(self, embeddings_matrix, maxlen, word_index, num_class):
140 | input1 = Input(shape=(maxlen,))
141 | embed_layer = Embedding(len(word_index) + 1,
142 | embeddings_matrix.shape[1],
143 | input_length=maxlen,
144 | weights=[embeddings_matrix],
145 | trainable=True)(input1)
146 | embed_layer = SpatialDropout1D(0.28)(embed_layer)
147 |
148 | x = Bidirectional(
149 | CuDNNGRU(128, return_sequences=True))(
150 | embed_layer)
151 | x = Activation('relu')(x)
152 | x = Dropout(0.25)(x)
153 | x = Bidirectional(
154 | CuDNNGRU(128, return_sequences=True))(
155 | x)
156 | x = Activation('relu')(x)
157 | x = Dropout(0.25)(x)
158 | capsule = Capsule(num_capsule=10, dim_capsule=16, routings=5,
159 | share_weights=True)(x)
160 | # output_capsule = Lambda(lambda x: K.sqrt(K.sum(K.square(x), 2)))(capsule)
161 | capsule = Flatten()(capsule)
162 | capsule = Dropout(0.25)(capsule)
163 | output = Dense(num_class, activation='sigmoid')(capsule)
164 | model = Model(inputs=input1, outputs=output)
165 | model.compile(
166 | loss='binary_crossentropy',
167 | optimizer='adam',
168 | metrics=["categorical_accuracy"])
169 | return model
170 |
171 |
--------------------------------------------------------------------------------
/old/classifier_rcnn.py:
--------------------------------------------------------------------------------
1 | import keras
2 | from keras import Model
3 | from keras.layers import *
4 | from JoinAttLayer import Attention
5 |
6 |
7 | def precision(y_true, y_pred):
8 | """Precision metric.
9 | Only computes a batch-wise average of precision.
10 | Computes the precision, a metric for multi-label classification of
11 | how many selected items are relevant.
12 | """
13 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
14 | predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
15 | precision = true_positives / (predicted_positives + K.epsilon())
16 | return precision
17 |
18 |
19 | def recall(y_true, y_pred):
20 | """Recall metric.
21 | Only computes a batch-wise average of recall.
22 | Computes the recall, a metric for multi-label classification of
23 | how many relevant items are selected.
24 | """
25 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
26 | possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
27 | recall = true_positives / (possible_positives + K.epsilon())
28 | return recall
29 |
30 |
31 | def f1(y_true, y_pred, beta=1):
32 | """Computes the F score.
33 | The F score is the weighted harmonic mean of precision and recall.
34 | Here it is only computed as a batch-wise average, not globally.
35 | This is useful for multi-label classification, where input samples can be
36 | classified as sets of labels. By only using accuracy (precision) a model
37 | would achieve a perfect score by simply assigning every class to every
38 | input. In order to avoid this, a metric should penalize incorrect class
39 | assignments as well (recall). The F-beta score (ranged from 0.0 to 1.0)
40 | computes this, as a weighted mean of the proportion of correct class
41 | assignments vs. the proportion of incorrect class assignments.
42 | With beta = 1, this is equivalent to a F-measure. With beta < 1, assigning
43 | correct classes becomes more important, and with beta > 1 the metric is
44 | instead weighted towards penalizing incorrect class assignments.
45 | """
46 | if beta < 0:
47 | raise ValueError('The lowest choosable beta is zero (only precision).')
48 |
49 | p = precision(y_true, y_pred)
50 | r = recall(y_true, y_pred)
51 | bb = beta ** 2
52 | fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon())
53 | return fbeta_score
54 |
55 |
56 | class TextClassifier():
57 |
58 | def model(self, embeddings_matrix, maxlen, word_index, num_class):
59 | inp = Input(shape=(maxlen,))
60 | encode = Bidirectional(GRU(1, return_sequences=True))
61 | encode2 = Bidirectional(GRU(1, return_sequences=True))
62 | attention = Attention(maxlen)
63 | x_4 = Embedding(len(word_index) + 1,
64 | embeddings_matrix.shape[1],
65 | weights=[embeddings_matrix],
66 | input_length=maxlen,
67 | trainable=True)(inp)
68 | x_3 = SpatialDropout1D(0.2)(x_4)
69 | x_3 = encode(x_3)
70 | x_3 = Dropout(0.2)(x_3)
71 | x_3 = encode2(x_3)
72 | x_3 = Dropout(0.2)(x_3)
73 | x_3 = Conv1D(64, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x_3)
74 | x_3 = Dropout(0.2)(x_3)
75 | avg_pool_3 = GlobalAveragePooling1D()(x_3)
76 | max_pool_3 = GlobalMaxPooling1D()(x_3)
77 | attention_3 = attention(x_3)
78 | x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3])
79 | x = Dense(num_class, activation="sigmoid")(x)
80 |
81 | adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
82 | rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-06)
83 | model = Model(inputs=inp, outputs=x)
84 | model.compile(
85 | loss='categorical_crossentropy',
86 | optimizer=rmsprop
87 | )
88 | return model
89 |
--------------------------------------------------------------------------------
/old/evaluate_char.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | from sklearn.metrics import f1_score, classification_report
3 |
4 |
5 | if __name__ == "__main__":
6 | validation_pred = pd.read_csv("validation_rcnn_char.csv")
7 | validation_real = pd.read_csv("preprocess/validation_char.csv")
8 | f_scores = 0
9 |
10 | print(classification_report(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"]))
11 | print(classification_report(validation_real["location_distance_from_business_district"], validation_pred["location_distance_from_business_district"]))
12 | print(classification_report(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"]))
13 | print(classification_report(validation_real["service_wait_time"], validation_pred["service_wait_time"]))
14 | print(classification_report(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"]))
15 | print(classification_report(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"]))
16 | print(classification_report(validation_real["service_serving_speed"], validation_pred["service_serving_speed"]))
17 | print(classification_report(validation_real["price_level"], validation_pred["price_level"]))
18 | print(classification_report(validation_real["price_cost_effective"], validation_pred["price_cost_effective"]))
19 | print(classification_report(validation_real["price_discount"], validation_pred["price_discount"]))
20 | print(classification_report(validation_real["environment_decoration"], validation_pred["environment_decoration"]))
21 | print(classification_report(validation_real["environment_noise"], validation_pred["environment_noise"]))
22 | print(classification_report(validation_real["environment_space"], validation_pred["environment_space"]))
23 | print(classification_report(validation_real["environment_cleaness"], validation_pred["environment_cleaness"]))
24 | print(classification_report(validation_real["dish_portion"], validation_pred["dish_portion"]))
25 | print(classification_report(validation_real["dish_taste"], validation_pred["dish_taste"]))
26 | print(classification_report(validation_real["dish_look"], validation_pred["dish_look"]))
27 | print(classification_report(validation_real["dish_recommendation"], validation_pred["dish_recommendation"]))
28 | print(classification_report(validation_real["others_overall_experience"], validation_pred["others_overall_experience"]))
29 | print(classification_report(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"]))
30 |
31 | f_scores += f1_score(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"],
32 | average="macro")
33 | print(f1_score(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"],
34 | average="macro"))
35 |
36 | f_scores += f1_score(validation_real["location_distance_from_business_district"],
37 | validation_pred["location_distance_from_business_district"], average="macro")
38 | print(f1_score(validation_real["location_distance_from_business_district"],
39 | validation_pred["location_distance_from_business_district"], average="macro"))
40 |
41 | f_scores += f1_score(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"], average="macro")
42 | print(f1_score(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"],
43 | average="macro"))
44 |
45 | f_scores += f1_score(validation_real["service_wait_time"], validation_pred["service_wait_time"], average="macro")
46 | print(f1_score(validation_real["service_wait_time"], validation_pred["service_wait_time"],
47 | average="macro"))
48 |
49 | f_scores += f1_score(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"], average="macro")
50 | print(f1_score(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"],
51 | average="macro"))
52 |
53 | f_scores += f1_score(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"],
54 | average="macro")
55 | print(f1_score(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"],
56 | average="macro"))
57 |
58 | f_scores += f1_score(validation_real["service_serving_speed"], validation_pred["service_serving_speed"], average="macro")
59 | print(f1_score(validation_real["service_serving_speed"], validation_pred["service_serving_speed"],
60 | average="macro"))
61 |
62 | f_scores += f1_score(validation_real["price_level"], validation_pred["price_level"], average="macro")
63 | print(f1_score(validation_real["price_level"], validation_pred["price_level"],
64 | average="macro"))
65 |
66 | f_scores += f1_score(validation_real["price_cost_effective"], validation_pred["price_cost_effective"], average="macro")
67 | print(f1_score(validation_real["price_cost_effective"], validation_pred["price_cost_effective"],
68 | average="macro"))
69 |
70 | f_scores += f1_score(validation_real["price_discount"], validation_pred["price_discount"], average="macro")
71 | print(f1_score(validation_real["price_discount"], validation_pred["price_discount"],
72 | average="macro"))
73 |
74 | f_scores += f1_score(validation_real["environment_decoration"], validation_pred["environment_decoration"], average="macro")
75 | print(f1_score(validation_real["environment_decoration"], validation_pred["environment_decoration"],
76 | average="macro"))
77 |
78 | f_scores += f1_score(validation_real["environment_noise"], validation_pred["environment_noise"], average="macro")
79 | print(f1_score(validation_real["environment_noise"], validation_pred["environment_noise"],
80 | average="macro"))
81 |
82 | f_scores += f1_score(validation_real["environment_space"], validation_pred["environment_space"], average="macro")
83 | print(f1_score(validation_real["environment_space"], validation_pred["environment_space"],
84 | average="macro"))
85 |
86 | f_scores += f1_score(validation_real["environment_cleaness"], validation_pred["environment_cleaness"], average="macro")
87 | print(f1_score(validation_real["environment_cleaness"], validation_pred["environment_cleaness"],
88 | average="macro"))
89 |
90 | f_scores += f1_score(validation_real["dish_portion"], validation_pred["dish_portion"], average="macro")
91 | print(f1_score(validation_real["dish_portion"], validation_pred["dish_portion"],
92 | average="macro"))
93 |
94 | f_scores += f1_score(validation_real["dish_taste"], validation_pred["dish_taste"], average="macro")
95 | print(f1_score(validation_real["dish_taste"], validation_pred["dish_taste"],
96 | average="macro"))
97 |
98 | f_scores += f1_score(validation_real["dish_look"], validation_pred["dish_look"], average="macro")
99 | print(f1_score(validation_real["dish_look"], validation_pred["dish_look"],
100 | average="macro"))
101 |
102 | f_scores += f1_score(validation_real["dish_recommendation"], validation_pred["dish_recommendation"], average="macro")
103 | print(f1_score(validation_real["dish_recommendation"], validation_pred["dish_recommendation"],
104 | average="macro"))
105 |
106 | f_scores += f1_score(validation_real["others_overall_experience"], validation_pred["others_overall_experience"],
107 | average="macro")
108 | print(f1_score(validation_real["others_overall_experience"], validation_pred["others_overall_experience"],
109 | average="macro"))
110 |
111 | f_scores += f1_score(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"],
112 | average="macro")
113 | print(f1_score(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"],
114 | average="macro"))
115 |
116 | print(f_scores / 20)
--------------------------------------------------------------------------------
/old/predict_bigru_char.py:
--------------------------------------------------------------------------------
1 | from keras.backend.tensorflow_backend import set_session
2 | import tensorflow as tf
3 | config = tf.ConfigProto()
4 | config.gpu_options.allow_growth = True
5 | set_session(tf.Session(config=config))
6 | import gc
7 | import pandas as pd
8 | import pickle
9 | import numpy as np
10 | np.random.seed(16)
11 | from tensorflow import set_random_seed
12 | set_random_seed(16)
13 | from keras.layers import *
14 | from keras.preprocessing import sequence
15 | from gensim.models.keyedvectors import KeyedVectors
16 | from classifier_bigru import TextClassifier
17 |
18 |
19 | def getClassification(arr):
20 | arr = list(arr)
21 | if arr.index(max(arr)) == 0:
22 | return -2
23 | elif arr.index(max(arr)) == 1:
24 | return -1
25 | elif arr.index(max(arr)) == 2:
26 | return 0
27 | else:
28 | return 1
29 |
30 |
31 | if __name__ == "__main__":
32 | with open('tokenizer_char.pickle', 'rb') as handle:
33 | maxlen = 1000
34 | model_dir = "model_bigru_char/"
35 | tokenizer = pickle.load(handle)
36 | word_index = tokenizer.word_index
37 | validation = pd.read_csv("preprocess/test_char.csv")
38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)
39 | X_test = validation["content"].values
40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test)
41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)
42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
43 | unicode_errors='ignore')
44 | embeddings_index = {}
45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
46 | word2idx = {"_PAD": 0}
47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]
48 | for word, i in word_index.items():
49 | if word in w2_model:
50 | embedding_vector = w2_model[word]
51 | else:
52 | embedding_vector = None
53 | if embedding_vector is not None:
54 | embeddings_matrix[i] = embedding_vector
55 |
56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv")
57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv")
58 |
59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
60 | model1.load_weights(model_dir + "model_ltc_01.hdf5")
61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation)))
62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation))
63 | del model1
64 | gc.collect()
65 | K.clear_session()
66 |
67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
68 | model2.load_weights(model_dir + "model_ldfbd_01.hdf5")
69 | submit["location_distance_from_business_district"] = list(
70 | map(getClassification, model2.predict(input_validation)))
71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation))
72 | del model2
73 | gc.collect()
74 | K.clear_session()
75 |
76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
77 | model3.load_weights(model_dir + "model_letf_02.hdf5")
78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation)))
79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation))
80 | del model3
81 | gc.collect()
82 | K.clear_session()
83 |
84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
85 | model4.load_weights(model_dir + "model_swt_02.hdf5")
86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation)))
87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation))
88 | del model4
89 | gc.collect()
90 | K.clear_session()
91 |
92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
93 | model5.load_weights(model_dir + "model_swa_02.hdf5")
94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation)))
95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation))
96 | del model5
97 | gc.collect()
98 | K.clear_session()
99 |
100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
101 | model6.load_weights(model_dir + "model_spc_02.hdf5")
102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation)))
103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation))
104 | del model6
105 | gc.collect()
106 | K.clear_session()
107 |
108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
109 | model7.load_weights(model_dir + "model_ssp_02.hdf5")
110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation)))
111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation))
112 | del model7
113 | gc.collect()
114 | K.clear_session()
115 |
116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
117 | model8.load_weights(model_dir + "model_pl_02.hdf5")
118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation)))
119 | submit_prob["price_level"] = list(model8.predict(input_validation))
120 | del model8
121 | gc.collect()
122 | K.clear_session()
123 |
124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
125 | model9.load_weights(model_dir + "model_pce_02.hdf5")
126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation)))
127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation))
128 | del model9
129 | gc.collect()
130 | K.clear_session()
131 |
132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
133 | model10.load_weights(model_dir + "model_pd_02.hdf5")
134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation)))
135 | submit_prob["price_discount"] = list(model10.predict(input_validation))
136 | del model10
137 | gc.collect()
138 | K.clear_session()
139 |
140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
141 | model11.load_weights(model_dir + "model_ed_01.hdf5")
142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation)))
143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation))
144 | del model11
145 | gc.collect()
146 | K.clear_session()
147 |
148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
149 | model12.load_weights(model_dir + "model_en_02.hdf5")
150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation)))
151 | submit_prob["environment_noise"] = list(model12.predict(input_validation))
152 | del model12
153 | gc.collect()
154 | K.clear_session()
155 |
156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
157 | model13.load_weights(model_dir + "model_es_02.hdf5")
158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation)))
159 | submit_prob["environment_space"] = list(model13.predict(input_validation))
160 | del model13
161 | gc.collect()
162 | K.clear_session()
163 |
164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
165 | model14.load_weights(model_dir + "model_ec_01.hdf5")
166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation)))
167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation))
168 | del model14
169 | gc.collect()
170 | K.clear_session()
171 |
172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
173 | model15.load_weights(model_dir + "model_dp_01.hdf5")
174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation)))
175 | submit_prob["dish_portion"] = list(model15.predict(input_validation))
176 | del model15
177 | gc.collect()
178 | K.clear_session()
179 |
180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
181 | model16.load_weights(model_dir + "model_dt_02.hdf5")
182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation)))
183 | submit_prob["dish_taste"] = list(model16.predict(input_validation))
184 | del model16
185 | gc.collect()
186 | K.clear_session()
187 |
188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
189 | model17.load_weights(model_dir + "model_dl_02.hdf5")
190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation)))
191 | submit_prob["dish_look"] = list(model17.predict(input_validation))
192 | del model17
193 | gc.collect()
194 | K.clear_session()
195 |
196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
197 | model18.load_weights(model_dir + "model_dr_01.hdf5")
198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation)))
199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation))
200 | del model18
201 | gc.collect()
202 | K.clear_session()
203 |
204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
205 | model19.load_weights(model_dir + "model_ooe_01.hdf5")
206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation)))
207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation))
208 | del model19
209 | gc.collect()
210 | K.clear_session()
211 |
212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
213 | model20.load_weights(model_dir + "model_owta_02.hdf5")
214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation)))
215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation))
216 | del model20
217 | gc.collect()
218 | K.clear_session()
219 |
220 | submit.to_csv("baseline_bigru_char.csv", index=None)
221 | submit_prob.to_csv("baseline_bigru_char_prob.csv", index=None)
--------------------------------------------------------------------------------
/old/predict_rcnn_char.py:
--------------------------------------------------------------------------------
1 | from keras.backend.tensorflow_backend import set_session
2 | import tensorflow as tf
3 | config = tf.ConfigProto()
4 | config.gpu_options.allow_growth = True
5 | set_session(tf.Session(config=config))
6 | import gc
7 | import pandas as pd
8 | import pickle
9 | import numpy as np
10 | np.random.seed(16)
11 | from tensorflow import set_random_seed
12 | set_random_seed(16)
13 | from keras.layers import *
14 | from keras.preprocessing import sequence
15 | from gensim.models.keyedvectors import KeyedVectors
16 | from old.classifier_rcnn import TextClassifier
17 |
18 |
19 | def getClassification(arr):
20 | arr = list(arr)
21 | if arr.index(max(arr)) == 0:
22 | return -2
23 | elif arr.index(max(arr)) == 1:
24 | return -1
25 | elif arr.index(max(arr)) == 2:
26 | return 0
27 | else:
28 | return 1
29 |
30 |
31 | if __name__ == "__main__":
32 | with open('tokenizer_char.pickle', 'rb') as handle:
33 | maxlen = 1000
34 | model_dir = "model_rcnn_char/"
35 | tokenizer = pickle.load(handle)
36 | word_index = tokenizer.word_index
37 | validation = pd.read_csv("preprocess/test_char.csv")
38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)
39 | X_test = validation["content"].values
40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test)
41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)
42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
43 | unicode_errors='ignore')
44 | embeddings_index = {}
45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
46 | word2idx = {"_PAD": 0}
47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]
48 | for word, i in word_index.items():
49 | if word in w2_model:
50 | embedding_vector = w2_model[word]
51 | else:
52 | embedding_vector = None
53 | if embedding_vector is not None:
54 | embeddings_matrix[i] = embedding_vector
55 |
56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv")
57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv")
58 |
59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
60 | model1.load_weights(model_dir + "model_ltc_02.hdf5")
61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation)))
62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation))
63 | del model1
64 | gc.collect()
65 | K.clear_session()
66 |
67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
68 | model2.load_weights(model_dir + "model_ldfbd_02.hdf5")
69 | submit["location_distance_from_business_district"] = list(
70 | map(getClassification, model2.predict(input_validation)))
71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation))
72 | del model2
73 | gc.collect()
74 | K.clear_session()
75 |
76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
77 | model3.load_weights(model_dir + "model_letf_02.hdf5")
78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation)))
79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation))
80 | del model3
81 | gc.collect()
82 | K.clear_session()
83 |
84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
85 | model4.load_weights(model_dir + "model_swt_02.hdf5")
86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation)))
87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation))
88 | del model4
89 | gc.collect()
90 | K.clear_session()
91 |
92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
93 | model5.load_weights(model_dir + "model_swa_02.hdf5")
94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation)))
95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation))
96 | del model5
97 | gc.collect()
98 | K.clear_session()
99 |
100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
101 | model6.load_weights(model_dir + "model_spc_01.hdf5")
102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation)))
103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation))
104 | del model6
105 | gc.collect()
106 | K.clear_session()
107 |
108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
109 | model7.load_weights(model_dir + "model_ssp_02.hdf5")
110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation)))
111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation))
112 | del model7
113 | gc.collect()
114 | K.clear_session()
115 |
116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
117 | model8.load_weights(model_dir + "model_pl_02.hdf5")
118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation)))
119 | submit_prob["price_level"] = list(model8.predict(input_validation))
120 | del model8
121 | gc.collect()
122 | K.clear_session()
123 |
124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
125 | model9.load_weights(model_dir + "model_pce_02.hdf5")
126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation)))
127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation))
128 | del model9
129 | gc.collect()
130 | K.clear_session()
131 |
132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
133 | model10.load_weights(model_dir + "model_pd_02.hdf5")
134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation)))
135 | submit_prob["price_discount"] = list(model10.predict(input_validation))
136 | del model10
137 | gc.collect()
138 | K.clear_session()
139 |
140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
141 | model11.load_weights(model_dir + "model_ed_02.hdf5")
142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation)))
143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation))
144 | del model11
145 | gc.collect()
146 | K.clear_session()
147 |
148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
149 | model12.load_weights(model_dir + "model_en_02.hdf5")
150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation)))
151 | submit_prob["environment_noise"] = list(model12.predict(input_validation))
152 | del model12
153 | gc.collect()
154 | K.clear_session()
155 |
156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
157 | model13.load_weights(model_dir + "model_es_01.hdf5")
158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation)))
159 | submit_prob["environment_space"] = list(model13.predict(input_validation))
160 | del model13
161 | gc.collect()
162 | K.clear_session()
163 |
164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
165 | model14.load_weights(model_dir + "model_ec_02.hdf5")
166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation)))
167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation))
168 | del model14
169 | gc.collect()
170 | K.clear_session()
171 |
172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
173 | model15.load_weights(model_dir + "model_dp_02.hdf5")
174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation)))
175 | submit_prob["dish_portion"] = list(model15.predict(input_validation))
176 | del model15
177 | gc.collect()
178 | K.clear_session()
179 |
180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
181 | model16.load_weights(model_dir + "model_dt_02.hdf5")
182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation)))
183 | submit_prob["dish_taste"] = list(model16.predict(input_validation))
184 | del model16
185 | gc.collect()
186 | K.clear_session()
187 |
188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
189 | model17.load_weights(model_dir + "model_dl_02.hdf5")
190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation)))
191 | submit_prob["dish_look"] = list(model17.predict(input_validation))
192 | del model17
193 | gc.collect()
194 | K.clear_session()
195 |
196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
197 | model18.load_weights(model_dir + "model_dr_02.hdf5")
198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation)))
199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation))
200 | del model18
201 | gc.collect()
202 | K.clear_session()
203 |
204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
205 | model19.load_weights(model_dir + "model_ooe_02.hdf5")
206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation)))
207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation))
208 | del model19
209 | gc.collect()
210 | K.clear_session()
211 |
212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
213 | model20.load_weights(model_dir + "model_owta_02.hdf5")
214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation)))
215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation))
216 | del model20
217 | gc.collect()
218 | K.clear_session()
219 |
220 | submit.to_csv("baseline_rcnn_char.csv", index=None)
221 | submit_prob.to_csv("baseline_rcnn_char_prob.csv", index=None)
--------------------------------------------------------------------------------
/old/stopwords.txt:
--------------------------------------------------------------------------------
1 | …
2 | ~
3 | )
4 | (
5 | ”
6 | “
7 | "
8 | 、
9 | :
10 | \n
11 | ~
12 | ?
13 | ,
14 | *
15 | ,
16 | \ufeff
17 | s
18 | 。
19 | .
20 | -
21 | _
22 | '
23 | =
24 | ?
25 | ·
26 | @
27 | !
28 | ^
29 | &
30 | (
31 | )
32 | |
33 | \
34 | +
35 | -
36 | $
37 | 【
38 | 】
39 | ^
40 | _
41 | `
42 | #
43 | $
44 | %
45 | &
46 | >
47 | <
48 | [
49 | ]
50 | 》
51 | 《
52 | /
53 | $
54 | 0
55 | 1
56 | 2
57 | 3
58 | 4
59 | 5
60 | 6
61 | 7
62 | 8
63 | 9
64 | ?
65 | _
66 | “
67 | ”
68 | 、
69 | 。
70 | 《
71 | 》
72 | ———
73 | 》),
74 | )÷(1-
75 | ”,
76 | )、
77 | =(
78 | :
79 | →
80 | ℃
81 | &
82 | *
83 | 一一
84 | ~~~~
85 | ’
86 | .
87 | 『
88 | .一
89 | ./
90 | --
91 | 』
92 | =″
93 | 【
94 | [*]
95 | }>
96 | [⑤]]
97 | [①D]
98 | c]
99 | ng昉
100 | *
101 | //
102 | [
103 | ]
104 | [②e]
105 | [②g]
106 | ={
107 | }
108 | ,也
109 | ‘
110 | A
111 | [①⑥]
112 | [②B]
113 | [①a]
114 | [④a]
115 | [①③]
116 | [③h]
117 | ③]
118 | 1.
119 | --
120 | [②b]
121 | ’‘
122 | ×××
123 | [①⑧]
124 | 0:2
125 | =[
126 | [⑤b]
127 | [②c]
128 | [④b]
129 | [②③]
130 | [③a]
131 | [④c]
132 | [①⑤]
133 | [①⑦]
134 | [①g]
135 | ∈[
136 | [①⑨]
137 | [①④]
138 | [①c]
139 | [②f]
140 | [②⑧]
141 | [②①]
142 | [①C]
143 | [③c]
144 | [③g]
145 | [②⑤]
146 | [②②]
147 | 一.
148 | [①h]
149 | .数
150 | []
151 | [①B]
152 | 数/
153 | [①i]
154 | [③e]
155 | [①①]
156 | [④d]
157 | [④e]
158 | [③b]
159 | [⑤a]
160 | [①A]
161 | [②⑧]
162 | [②⑦]
163 | [①d]
164 | [②j]
165 | 〕〔
166 | ][
167 | ://
168 | ′∈
169 | [②④
170 | [⑤e]
171 | 12%
172 | b]
173 | ...
174 | ...................
175 | …………………………………………………③
176 | ZXFITL
177 | [③F]
178 | 」
179 | [①o]
180 | ]∧′=[
181 | ∪φ∈
182 | ′|
183 | {-
184 | ②c
185 | }
186 | [③①]
187 | R.L.
188 | [①E]
189 | Ψ
190 | -[*]-
191 | ↑
192 | .日
193 | [②d]
194 | [②
195 | [②⑦]
196 | [②②]
197 | [③e]
198 | [①i]
199 | [①B]
200 | [①h]
201 | [①d]
202 | [①g]
203 | [①②]
204 | [②a]
205 | f]
206 | [⑩]
207 | a]
208 | [①e]
209 | [②h]
210 | [②⑥]
211 | [③d]
212 | [②⑩]
213 | e]
214 | 〉
215 | 】
216 | 元/吨
217 | [②⑩]
218 | 2.3%
219 | 5:0
220 | [①]
221 | ::
222 | [②]
223 | [③]
224 | [④]
225 | [⑤]
226 | [⑥]
227 | [⑦]
228 | [⑧]
229 | [⑨]
230 | ……
231 | ——
232 | ?
233 | 、
234 | 。
235 | “
236 | ”
237 | 《
238 | 》
239 | !
240 | ,
241 | :
242 | ;
243 | ?
244 | .
245 | ,
246 | .
247 | '
248 | ?
249 | ·
250 | ———
251 | ──
252 | ?
253 | —
254 | <
255 | >
256 | (
257 | )
258 | 〔
259 | 〕
260 | [
261 | ]
262 | (
263 | )
264 | -
265 | +
266 | ~
267 | ×
268 | /
269 | /
270 | ①
271 | ②
272 | ③
273 | ④
274 | ⑤
275 | ⑥
276 | ⑦
277 | ⑧
278 | ⑨
279 | ⑩
280 | Ⅲ
281 | В
282 | "
283 | ;
284 | #
285 | @
286 | γ
287 | μ
288 | φ
289 | φ.
290 | ×
291 | Δ
292 | ■
293 | ▲
294 | sub
295 | exp
296 | sup
297 | sub
298 | Lex
299 | #
300 | %
301 | &
302 | '
303 | +
304 | +ξ
305 | ++
306 | -
307 | -β
308 | <
309 | <±
310 | <Δ
311 | <λ
312 | <φ
313 | <<
314 | =
315 | =
316 | =☆
317 | =-
318 | >
319 | >λ
320 | _
321 | ~±
322 | ~+
323 | [⑤f]
324 | [⑤d]
325 | [②i]
326 | ≈
327 | [②G]
328 | [①f]
329 | LI
330 | ㈧
331 | [-
332 | ......
333 | 〉
334 | [③⑩]
335 | 第二
336 | 一番
337 | 一直
338 | 一个
339 | 一些
340 | 许多
341 | 种
342 | 有的是
343 | 也就是说
344 | 末##末
345 | 啊
346 | 阿
347 | 哎
348 | 哎呀
349 | 哎哟
350 | 唉
351 | 俺
352 | 俺们
353 | 按
354 | 按照
355 | 吧
356 | 吧哒
357 | 把
358 | 罢了
359 | 被
360 | 本
361 | 本着
362 | 比
363 | 比方
364 | 比如
365 | 鄙人
366 | 彼
367 | 彼此
368 | 边
369 | 别
370 | 别的
371 | 别说
372 | 并
373 | 并且
374 | 不比
375 | 不成
376 | 不单
377 | 不但
378 | 不独
379 | 不管
380 | 不光
381 | 不过
382 | 不仅
383 | 不拘
384 | 不论
385 | 不怕
386 | 不然
387 | 不如
388 | 不特
389 | 不惟
390 | 不问
391 | 不只
392 | 朝
393 | 朝着
394 | 趁
395 | 趁着
396 | 乘
397 | 冲
398 | 除
399 | 除此之外
400 | 除非
401 | 除了
402 | 此
403 | 此间
404 | 此外
405 | 从
406 | 从而
407 | 打
408 | 待
409 | 但
410 | 但是
411 | 当
412 | 当着
413 | 到
414 | 得
415 | 的
416 | 的话
417 | 等
418 | 等等
419 | 地
420 | 第
421 | 叮咚
422 | 对
423 | 对于
424 | 多
425 | 多少
426 | 而
427 | 而况
428 | 而且
429 | 而是
430 | 而外
431 | 而言
432 | 而已
433 | 尔后
434 | 反过来
435 | 反过来说
436 | 反之
437 | 非但
438 | 非徒
439 | 否则
440 | 嘎
441 | 嘎登
442 | 该
443 | 赶
444 | 个
445 | 各
446 | 各个
447 | 各位
448 | 各种
449 | 各自
450 | 给
451 | 根据
452 | 跟
453 | 故
454 | 故此
455 | 固然
456 | 关于
457 | 管
458 | 归
459 | 果然
460 | 果真
461 | 过
462 | 哈
463 | 哈哈
464 | 呵
465 | 和
466 | 何
467 | 何处
468 | 何况
469 | 何时
470 | 嘿
471 | 哼
472 | 哼唷
473 | 呼哧
474 | 乎
475 | 哗
476 | 还是
477 | 还有
478 | 换句话说
479 | 换言之
480 | 或
481 | 或是
482 | 或者
483 | 极了
484 | 及
485 | 及其
486 | 及至
487 | 即
488 | 即便
489 | 即或
490 | 即令
491 | 即若
492 | 即使
493 | 几
494 | 几时
495 | 己
496 | 既
497 | 既然
498 | 既是
499 | 继而
500 | 加之
501 | 假如
502 | 假若
503 | 假使
504 | 鉴于
505 | 将
506 | 较
507 | 较之
508 | 叫
509 | 接着
510 | 结果
511 | 借
512 | 紧接着
513 | 进而
514 | 尽
515 | 尽管
516 | 经
517 | 经过
518 | 就
519 | 就是
520 | 就是说
521 | 据
522 | 具体地说
523 | 具体说来
524 | 开始
525 | 开外
526 | 靠
527 | 咳
528 | 可
529 | 可见
530 | 可是
531 | 可以
532 | 况且
533 | 啦
534 | 来
535 | 来着
536 | 离
537 | 例如
538 | 哩
539 | 连
540 | 连同
541 | 两者
542 | 了
543 | 临
544 | 另
545 | 另外
546 | 另一方面
547 | 论
548 | 嘛
549 | 吗
550 | 慢说
551 | 漫说
552 | 冒
553 | 么
554 | 每
555 | 每当
556 | 们
557 | 莫若
558 | 某
559 | 某个
560 | 某些
561 | 拿
562 | 哪
563 | 哪边
564 | 哪儿
565 | 哪个
566 | 哪里
567 | 哪年
568 | 哪怕
569 | 哪天
570 | 哪些
571 | 哪样
572 | 那
573 | 那边
574 | 那儿
575 | 那个
576 | 那会儿
577 | 那里
578 | 那么
579 | 那么些
580 | 那么样
581 | 那时
582 | 那些
583 | 那样
584 | 乃
585 | 乃至
586 | 呢
587 | 能
588 | 你
589 | 你们
590 | 您
591 | 宁
592 | 宁可
593 | 宁肯
594 | 宁愿
595 | 哦
596 | 呕
597 | 啪达
598 | 旁人
599 | 呸
600 | 凭
601 | 凭借
602 | 其
603 | 其次
604 | 其二
605 | 其他
606 | 其它
607 | 其一
608 | 其余
609 | 其中
610 | 起
611 | 起见
612 | 起见
613 | 岂但
614 | 恰恰相反
615 | 前后
616 | 前者
617 | 且
618 | 然而
619 | 然后
620 | 然则
621 | 让
622 | 人家
623 | 任
624 | 任何
625 | 任凭
626 | 如
627 | 如此
628 | 如果
629 | 如何
630 | 如其
631 | 如若
632 | 如上所述
633 | 若
634 | 若非
635 | 若是
636 | 啥
637 | 上下
638 | 尚且
639 | 设若
640 | 设使
641 | 甚而
642 | 甚么
643 | 甚至
644 | 省得
645 | 时候
646 | 什么
647 | 什么样
648 | 使得
649 | 是
650 | 是的
651 | 首先
652 | 谁
653 | 谁知
654 | 顺
655 | 顺着
656 | 似的
657 | 虽
658 | 虽然
659 | 虽说
660 | 虽则
661 | 随
662 | 随着
663 | 所
664 | 所以
665 | 他
666 | 他们
667 | 他人
668 | 它
669 | 它们
670 | 她
671 | 她们
672 | 倘
673 | 倘或
674 | 倘然
675 | 倘若
676 | 倘使
677 | 腾
678 | 替
679 | 通过
680 | 同
681 | 同时
682 | 哇
683 | 万一
684 | 往
685 | 望
686 | 为
687 | 为何
688 | 为了
689 | 为什么
690 | 为着
691 | 喂
692 | 嗡嗡
693 | 我
694 | 我们
695 | 呜
696 | 呜呼
697 | 乌乎
698 | 无论
699 | 无宁
700 | 毋宁
701 | 嘻
702 | 吓
703 | 相对而言
704 | 像
705 | 向
706 | 向着
707 | 嘘
708 | 呀
709 | 焉
710 | 沿
711 | 沿着
712 | 要
713 | 要不
714 | 要不然
715 | 要不是
716 | 要么
717 | 要是
718 | 也
719 | 也罢
720 | 也好
721 | 一
722 | 一般
723 | 一旦
724 | 一方面
725 | 一来
726 | 一切
727 | 一样
728 | 一则
729 | 依
730 | 依照
731 | 矣
732 | 以
733 | 以便
734 | 以及
735 | 以免
736 | 以至
737 | 以至于
738 | 以致
739 | 抑或
740 | 因
741 | 因此
742 | 因而
743 | 因为
744 | 哟
745 | 用
746 | 由
747 | 由此可见
748 | 由于
749 | 有
750 | 有的
751 | 有关
752 | 有些
753 | 又
754 | 于
755 | 于是
756 | 于是乎
757 | 与
758 | 与此同时
759 | 与否
760 | 与其
761 | 越是
762 | 云云
763 | 哉
764 | 再说
765 | 再者
766 | 在
767 | 在下
768 | 咱
769 | 咱们
770 | 则
771 | 怎
772 | 怎么
773 | 怎么办
774 | 怎么样
775 | 怎样
776 | 咋
777 | 照
778 | 照着
779 | 者
780 | 这
781 | 这边
782 | 这儿
783 | 这个
784 | 这会儿
785 | 这就是说
786 | 这里
787 | 这么
788 | 这么点儿
789 | 这么些
790 | 这么样
791 | 这时
792 | 这些
793 | 这样
794 | 正如
795 | 吱
796 | 之
797 | 之类
798 | 之所以
799 | 之一
800 | 只是
801 | 只限
802 | 只要
803 | 只有
804 | 至
805 | 至于
806 | 诸位
807 | 着
808 | 着呢
809 | 自
810 | 自从
811 | 自个儿
812 | 自各儿
813 | 自己
814 | 自家
815 | 自身
816 | 综上所述
817 | 总的来看
818 | 总的来说
819 | 总的说来
820 | 总而言之
821 | 总之
822 | 纵
823 | 纵令
824 | 纵然
825 | 纵使
826 | 遵照
827 | 作为
828 | 兮
829 | 呃
830 | 呗
831 | 咚
832 | 咦
833 | 喏
834 | 啐
835 | 喔唷
836 | 嗬
837 | 嗯
838 | 嗳
839 |
840 |
841 |
842 |
--------------------------------------------------------------------------------
/old/temp_covert.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | import json
3 | import random
4 |
5 | dict_unique={}
6 |
7 | dict_type_ignore_count={'train':0,'valid':0,'test':0}
8 | def transform_data_to_fasttext_format(file_path,target_path,data_type):
9 | file_object=open(file_path,'r')
10 | target_object=open(target_path,'w')
11 | lines=file_object.readlines()
12 | print("length of lines:",len(lines))
13 | random.shuffle(lines)
14 | for i,line in enumerate(lines):
15 | json_string=json.loads(line)
16 | accusation_list=json_string['meta']['accusation']
17 | fact=json_string['fact'].strip('\n\r').replace("\n","").replace("\r","")
18 | unique_value=dict_unique.get(fact,None)
19 | if unique_value is None: # if not exist, put to unique dict, then process
20 | dict_unique[fact] = fact
21 | else: # otherwise, ignore
22 | print("going to ignore.",data_type,fact)
23 | dict_type_ignore_count[data_type]=dict_type_ignore_count[data_type]+1
24 | continue
25 | length_accusation=len(accusation_list)
26 | #if length_accusation>1:
27 | #print("accusation_list:",str(accusation_list))
28 | #print("json_string:",json_string)
29 | accusation_strings=''
30 | for i,accusation in enumerate(accusation_list):
31 | accusation_strings+=' __label__'+accusation
32 | target_object.write(fact+accusation_strings+"\n")
33 | target_object.close()
34 | file_object.close()
35 | print("dict_type_ignore_count:",dict_type_ignore_count[data_type])
36 |
37 | file_path='./data/cail2018/data_valid_checked.json'
38 | target_path='./data/data_valid2.txt'
39 | transform_data_to_fasttext_format(file_path,target_path,'valid')
40 |
41 | file_path='./data/cail2018/data_test.json'
42 | target_path='./data/data_test2.txt'
43 | transform_data_to_fasttext_format(file_path,target_path,'test')
44 |
45 | file_path='./data/cail2018/cail2018_big_downsmapled.json'
46 | target_path='./data/data_train2.txt'
47 | transform_data_to_fasttext_format(file_path,target_path,'train')
48 |
49 |
--------------------------------------------------------------------------------
/old/train_transform.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
3 | """
4 |
5 | train the model(transformer) with data enhanced by pre-training of two tasks.
6 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a
7 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).
8 | """
9 | #import sys
10 | #reload(sys)
11 | #sys.setdefaultencoding('utf8')
12 | import tensorflow as tf
13 | import numpy as np
14 | from model.transfomer_model import TransformerModel
15 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config
16 | import os
17 | from evaluation_matrix import *
18 | from model.config_transformer import Config
19 | #configuration
20 | FLAGS=tf.app.flags.FLAGS
21 |
22 | # you can change as you like
23 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used")
24 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
25 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train.txt","path of traning data.") #./data/cail2018_bi.json
26 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_test.txt","path of validation data.")
27 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test.txt","path of validation data.")
28 | tf.app.flags.DEFINE_integer("d_model", 64, "dimension of model") # 512-->128-->64
29 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer")
30 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header")
31 | tf.app.flags.DEFINE_integer("d_k", 8, "dimension of k") # 64-->16-->8
32 | tf.app.flags.DEFINE_integer("d_v", 8, "dimension of v") # 64-->16-->8
33 |
34 | # below hyperparameter you can use default one, seldom change
35 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_transformer/","checkpoint location for the model") #save to here, so make it easy to upload for test
36 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model")
37 | tf.app.flags.DEFINE_integer("vocab_size",50002,"maximum vocab size.")
38 | tf.app.flags.DEFINE_float("learning_rate",0.0001,"learning rate") #0.001
39 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128
40 | tf.app.flags.DEFINE_integer("decay_steps", 1000, "how many steps before decay learning rate.") # 32-->128
41 | tf.app.flags.DEFINE_float("decay_rate", 1.0, "Rate of decay for learning rate.") #0.65
42 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65
43 | tf.app.flags.DEFINE_integer("sequence_length",200,"max sentence length")#400
44 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference")
45 | tf.app.flags.DEFINE_integer("num_epochs",30,"number of epochs to run.")
46 | tf.app.flags.DEFINE_integer("process_num",3,"number of cpu used")
47 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") #
48 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
49 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
50 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length of language model")
51 | tf.app.flags.DEFINE_boolean("is_fine_tuning",False,"is_finetuning.ture:this is fine-tuning stage")
52 |
53 | def main(_):
54 | vocab_word2index, label2index= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name='transfomer')
55 | vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes)
56 | train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length,
57 | process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name='transfomer')
58 | train_X, train_Y= train
59 | valid_X, valid_Y= valid
60 | test_X,test_Y = test
61 | print("Test_mode:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape)
62 | # 1.create session.
63 | gpu_config=tf.ConfigProto()
64 | gpu_config.gpu_options.allow_growth=True
65 | with tf.Session(config=gpu_config) as sess:
66 | #Instantiate Model
67 | config=set_config(FLAGS,num_classes,vocab_size)
68 | model=TransformerModel(config)
69 | #Initialize Save
70 | saver=tf.train.Saver()
71 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"):
72 | print("Restoring Variables from Checkpoint.")
73 | saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir))
74 | #for i in range(2): #decay learning rate if necessary.
75 | # print(i,"Going to decay learning rate by half.")
76 | # sess.run(model.learning_rate_decay_half_op)
77 | else:
78 | print('Initializing Variables')
79 | sess.run(tf.global_variables_initializer())
80 | if FLAGS.use_pretrained_embedding:
81 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()}
82 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings
83 | curr_epoch=sess.run(model.epoch_step)
84 | # 2.feed data & training
85 | number_of_training_data=len(train_X)
86 | batch_size=FLAGS.batch_size
87 | iteration=0
88 | score_best=-100
89 | f1_score=0
90 | for epoch in range(curr_epoch,FLAGS.num_epochs):
91 | loss_total, counter = 0.0, 0
92 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)):
93 | iteration=iteration+1
94 | if epoch==0 and counter==0:
95 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape)
96 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob}
97 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict)
98 | loss_total,counter=loss_total+current_loss,counter+1
99 | if counter %30==0:
100 | print("Learning rate:%.5f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss))
101 | if start!=0 and start%(3000*FLAGS.batch_size)==0:
102 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index)
103 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0)*100.0
104 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid))
105 |
106 | # save model to checkpoint
107 | if f1_score_valid>score_best:
108 | save_path = FLAGS.ckpt_dir + "model.ckpt"
109 | print("going to save check point.")
110 | saver.save(sess, save_path, global_step=epoch)
111 | score_best=f1_score_valid
112 | #epoch increment
113 | print("going to increment epoch counter....")
114 | sess.run(model.epoch_increment)
115 |
116 | # 4.validation
117 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0))
118 | if epoch % FLAGS.validate_every==0:
119 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index)
120 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0
121 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2))
122 | #save model to checkpoint
123 | if f1_score_valid2 > score_best:
124 | save_path=FLAGS.ckpt_dir+"model.ckpt"
125 | print("going to save check point.")
126 | saver.save(sess,save_path,global_step=epoch)
127 | score_best = f1_score_valid2
128 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13):
129 | for i in range(1):
130 | print(i, "Going to decay learning rate by half.")
131 | sess.run(model.learning_rate_decay_half_op)
132 |
133 | # 5.最后在测试集上做测试,并报告测试准确率 Testto 0.0
134 | loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index)
135 | f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) #* 100.0
136 | print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test))
137 | print("training completed...")
138 |
139 | #sess,model,valid,iteration,num_classes,label2index
140 | def do_eval(sess,model,valid,num_classes,label2index):
141 | """
142 | do evaluation using validation set, and report loss, and f1 score.
143 | :param sess:
144 | :param model:
145 | :param valid:
146 | :param num_classes:
147 | :param label2index:
148 | :return:
149 | """
150 | number_examples=valid[0].shape[0]
151 | valid_x,valid_y=valid
152 | print("number_examples:",number_examples)
153 | eval_loss,eval_counter=0.0,0
154 | batch_size=FLAGS.batch_size
155 | label_dict=init_label_dict(num_classes)
156 | eval_macro_f1, eval_micro_f1 = 0.0,0.0
157 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)):
158 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0}
159 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size]
160 | #compute confuse matrix
161 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright')
162 | eval_loss=eval_loss+curr_eval_loss
163 | eval_counter=eval_counter+1
164 | #compute f1_micro & f1_macro
165 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive
166 | compute_f1_score_write_for_debug(label_dict,label2index)
167 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro
168 |
169 | if __name__ == "__main__":
170 | tf.app.run()
--------------------------------------------------------------------------------
/old/validation_bigru_char.py:
--------------------------------------------------------------------------------
1 | from keras.backend.tensorflow_backend import set_session
2 | import tensorflow as tf
3 | config = tf.ConfigProto()
4 | config.gpu_options.allow_growth = True
5 | set_session(tf.Session(config=config))
6 | import gc
7 | import pandas as pd
8 | import pickle
9 | import numpy as np
10 | np.random.seed(16)
11 | from tensorflow import set_random_seed
12 | set_random_seed(16)
13 | from keras.layers import *
14 | from keras.preprocessing import sequence
15 | from gensim.models.keyedvectors import KeyedVectors
16 | from classifier_bigru import TextClassifier
17 |
18 |
19 | def getClassification(arr):
20 | arr = list(arr)
21 | if arr.index(max(arr)) == 0:
22 | return -2
23 | elif arr.index(max(arr)) == 1:
24 | return -1
25 | elif arr.index(max(arr)) == 2:
26 | return 0
27 | else:
28 | return 1
29 |
30 |
31 | if __name__ == "__main__":
32 | with open('tokenizer_char.pickle', 'rb') as handle:
33 | maxlen = 1000
34 | model_dir = "model_bigru_char/"
35 | tokenizer = pickle.load(handle)
36 | word_index = tokenizer.word_index
37 | validation = pd.read_csv("preprocess/validation_char.csv")
38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)
39 | X_test = validation["content"].values
40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test)
41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)
42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
43 | unicode_errors='ignore')
44 | embeddings_index = {}
45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
46 | word2idx = {"_PAD": 0}
47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]
48 | for word, i in word_index.items():
49 | if word in w2_model:
50 | embedding_vector = w2_model[word]
51 | else:
52 | embedding_vector = None
53 | if embedding_vector is not None:
54 | embeddings_matrix[i] = embedding_vector
55 |
56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv")
57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv")
58 |
59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
60 | model1.load_weights(model_dir + "model_ltc_01.hdf5")
61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation)))
62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation))
63 | del model1
64 | gc.collect()
65 | K.clear_session()
66 |
67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
68 | model2.load_weights(model_dir + "model_ldfbd_01.hdf5")
69 | submit["location_distance_from_business_district"] = list(
70 | map(getClassification, model2.predict(input_validation)))
71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation))
72 | del model2
73 | gc.collect()
74 | K.clear_session()
75 |
76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
77 | model3.load_weights(model_dir + "model_letf_02.hdf5")
78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation)))
79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation))
80 | del model3
81 | gc.collect()
82 | K.clear_session()
83 |
84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
85 | model4.load_weights(model_dir + "model_swt_02.hdf5")
86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation)))
87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation))
88 | del model4
89 | gc.collect()
90 | K.clear_session()
91 |
92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
93 | model5.load_weights(model_dir + "model_swa_02.hdf5")
94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation)))
95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation))
96 | del model5
97 | gc.collect()
98 | K.clear_session()
99 |
100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
101 | model6.load_weights(model_dir + "model_spc_02.hdf5")
102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation)))
103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation))
104 | del model6
105 | gc.collect()
106 | K.clear_session()
107 |
108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
109 | model7.load_weights(model_dir + "model_ssp_02.hdf5")
110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation)))
111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation))
112 | del model7
113 | gc.collect()
114 | K.clear_session()
115 |
116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
117 | model8.load_weights(model_dir + "model_pl_02.hdf5")
118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation)))
119 | submit_prob["price_level"] = list(model8.predict(input_validation))
120 | del model8
121 | gc.collect()
122 | K.clear_session()
123 |
124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
125 | model9.load_weights(model_dir + "model_pce_02.hdf5")
126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation)))
127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation))
128 | del model9
129 | gc.collect()
130 | K.clear_session()
131 |
132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
133 | model10.load_weights(model_dir + "model_pd_02.hdf5")
134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation)))
135 | submit_prob["price_discount"] = list(model10.predict(input_validation))
136 | del model10
137 | gc.collect()
138 | K.clear_session()
139 |
140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
141 | model11.load_weights(model_dir + "model_ed_01.hdf5")
142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation)))
143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation))
144 | del model11
145 | gc.collect()
146 | K.clear_session()
147 |
148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
149 | model12.load_weights(model_dir + "model_en_02.hdf5")
150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation)))
151 | submit_prob["environment_noise"] = list(model12.predict(input_validation))
152 | del model12
153 | gc.collect()
154 | K.clear_session()
155 |
156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
157 | model13.load_weights(model_dir + "model_es_02.hdf5")
158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation)))
159 | submit_prob["environment_space"] = list(model13.predict(input_validation))
160 | del model13
161 | gc.collect()
162 | K.clear_session()
163 |
164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
165 | model14.load_weights(model_dir + "model_ec_01.hdf5")
166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation)))
167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation))
168 | del model14
169 | gc.collect()
170 | K.clear_session()
171 |
172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
173 | model15.load_weights(model_dir + "model_dp_01.hdf5")
174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation)))
175 | submit_prob["dish_portion"] = list(model15.predict(input_validation))
176 | del model15
177 | gc.collect()
178 | K.clear_session()
179 |
180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
181 | model16.load_weights(model_dir + "model_dt_02.hdf5")
182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation)))
183 | submit_prob["dish_taste"] = list(model16.predict(input_validation))
184 | del model16
185 | gc.collect()
186 | K.clear_session()
187 |
188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
189 | model17.load_weights(model_dir + "model_dl_02.hdf5")
190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation)))
191 | submit_prob["dish_look"] = list(model17.predict(input_validation))
192 | del model17
193 | gc.collect()
194 | K.clear_session()
195 |
196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
197 | model18.load_weights(model_dir + "model_dr_01.hdf5")
198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation)))
199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation))
200 | del model18
201 | gc.collect()
202 | K.clear_session()
203 |
204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
205 | model19.load_weights(model_dir + "model_ooe_01.hdf5")
206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation)))
207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation))
208 | del model19
209 | gc.collect()
210 | K.clear_session()
211 |
212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
213 | model20.load_weights(model_dir + "model_owta_02.hdf5")
214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation)))
215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation))
216 | del model20
217 | gc.collect()
218 | K.clear_session()
219 |
220 | submit.to_csv("validation_bigru_char.csv", index=None)
221 | submit_prob.to_csv("validation_bigru_char_prob.csv", index=None)
--------------------------------------------------------------------------------
/old/validation_rcnn_char.py:
--------------------------------------------------------------------------------
1 | from keras.backend.tensorflow_backend import set_session
2 | import tensorflow as tf
3 | config = tf.ConfigProto()
4 | config.gpu_options.allow_growth = True
5 | set_session(tf.Session(config=config))
6 | import gc
7 | import pandas as pd
8 | import pickle
9 | import numpy as np
10 | np.random.seed(16)
11 | from tensorflow import set_random_seed
12 | set_random_seed(16)
13 | from keras.layers import *
14 | from keras.preprocessing import sequence
15 | from gensim.models.keyedvectors import KeyedVectors
16 | from old.classifier_rcnn import TextClassifier
17 |
18 |
19 | def getClassification(arr):
20 | arr = list(arr)
21 | if arr.index(max(arr)) == 0:
22 | return -2
23 | elif arr.index(max(arr)) == 1:
24 | return -1
25 | elif arr.index(max(arr)) == 2:
26 | return 0
27 | else:
28 | return 1
29 |
30 |
31 | if __name__ == "__main__":
32 | with open('tokenizer_char.pickle', 'rb') as handle:
33 | maxlen = 1000
34 | model_dir = "model_rcnn_char/"
35 | tokenizer = pickle.load(handle)
36 | word_index = tokenizer.word_index
37 | validation = pd.read_csv("preprocess/validation_char.csv")
38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1)
39 | X_test = validation["content"].values
40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test)
41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen)
42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8',
43 | unicode_errors='ignore')
44 | embeddings_index = {}
45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size))
46 | word2idx = {"_PAD": 0}
47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()]
48 | for word, i in word_index.items():
49 | if word in w2_model:
50 | embedding_vector = w2_model[word]
51 | else:
52 | embedding_vector = None
53 | if embedding_vector is not None:
54 | embeddings_matrix[i] = embedding_vector
55 |
56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv")
57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv")
58 |
59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
60 | model1.load_weights(model_dir + "model_ltc_02.hdf5")
61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation)))
62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation))
63 | del model1
64 | gc.collect()
65 | K.clear_session()
66 |
67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
68 | model2.load_weights(model_dir + "model_ldfbd_02.hdf5")
69 | submit["location_distance_from_business_district"] = list(
70 | map(getClassification, model2.predict(input_validation)))
71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation))
72 | del model2
73 | gc.collect()
74 | K.clear_session()
75 |
76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
77 | model3.load_weights(model_dir + "model_letf_02.hdf5")
78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation)))
79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation))
80 | del model3
81 | gc.collect()
82 | K.clear_session()
83 |
84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
85 | model4.load_weights(model_dir + "model_swt_02.hdf5")
86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation)))
87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation))
88 | del model4
89 | gc.collect()
90 | K.clear_session()
91 |
92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
93 | model5.load_weights(model_dir + "model_swa_02.hdf5")
94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation)))
95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation))
96 | del model5
97 | gc.collect()
98 | K.clear_session()
99 |
100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
101 | model6.load_weights(model_dir + "model_spc_01.hdf5")
102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation)))
103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation))
104 | del model6
105 | gc.collect()
106 | K.clear_session()
107 |
108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
109 | model7.load_weights(model_dir + "model_ssp_02.hdf5")
110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation)))
111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation))
112 | del model7
113 | gc.collect()
114 | K.clear_session()
115 |
116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
117 | model8.load_weights(model_dir + "model_pl_02.hdf5")
118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation)))
119 | submit_prob["price_level"] = list(model8.predict(input_validation))
120 | del model8
121 | gc.collect()
122 | K.clear_session()
123 |
124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
125 | model9.load_weights(model_dir + "model_pce_02.hdf5")
126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation)))
127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation))
128 | del model9
129 | gc.collect()
130 | K.clear_session()
131 |
132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
133 | model10.load_weights(model_dir + "model_pd_02.hdf5")
134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation)))
135 | submit_prob["price_discount"] = list(model10.predict(input_validation))
136 | del model10
137 | gc.collect()
138 | K.clear_session()
139 |
140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
141 | model11.load_weights(model_dir + "model_ed_02.hdf5")
142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation)))
143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation))
144 | del model11
145 | gc.collect()
146 | K.clear_session()
147 |
148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
149 | model12.load_weights(model_dir + "model_en_02.hdf5")
150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation)))
151 | submit_prob["environment_noise"] = list(model12.predict(input_validation))
152 | del model12
153 | gc.collect()
154 | K.clear_session()
155 |
156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
157 | model13.load_weights(model_dir + "model_es_01.hdf5")
158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation)))
159 | submit_prob["environment_space"] = list(model13.predict(input_validation))
160 | del model13
161 | gc.collect()
162 | K.clear_session()
163 |
164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
165 | model14.load_weights(model_dir + "model_ec_02.hdf5")
166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation)))
167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation))
168 | del model14
169 | gc.collect()
170 | K.clear_session()
171 |
172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
173 | model15.load_weights(model_dir + "model_dp_02.hdf5")
174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation)))
175 | submit_prob["dish_portion"] = list(model15.predict(input_validation))
176 | del model15
177 | gc.collect()
178 | K.clear_session()
179 |
180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
181 | model16.load_weights(model_dir + "model_dt_02.hdf5")
182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation)))
183 | submit_prob["dish_taste"] = list(model16.predict(input_validation))
184 | del model16
185 | gc.collect()
186 | K.clear_session()
187 |
188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
189 | model17.load_weights(model_dir + "model_dl_02.hdf5")
190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation)))
191 | submit_prob["dish_look"] = list(model17.predict(input_validation))
192 | del model17
193 | gc.collect()
194 | K.clear_session()
195 |
196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
197 | model18.load_weights(model_dir + "model_dr_02.hdf5")
198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation)))
199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation))
200 | del model18
201 | gc.collect()
202 | K.clear_session()
203 |
204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
205 | model19.load_weights(model_dir + "model_ooe_02.hdf5")
206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation)))
207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation))
208 | del model19
209 | gc.collect()
210 | K.clear_session()
211 |
212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4)
213 | model20.load_weights(model_dir + "model_owta_02.hdf5")
214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation)))
215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation))
216 | del model20
217 | gc.collect()
218 | K.clear_session()
219 |
220 | submit.to_csv("validation_rcnn_char.csv", index=None)
221 | submit_prob.to_csv("validation_rcnn_char_prob.csv", index=None)
--------------------------------------------------------------------------------
/preprocess_char/README.txt:
--------------------------------------------------------------------------------
1 | processed test/train/validation file will be save here
2 |
--------------------------------------------------------------------------------
/tokenization.py:
--------------------------------------------------------------------------------
1 | # coding=utf-8
2 | # Copyright 2018 The Google AI Language Team Authors.
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | """Tokenization classes."""
16 |
17 | from __future__ import absolute_import
18 | from __future__ import division
19 | from __future__ import print_function
20 |
21 | import collections
22 | import unicodedata
23 | import six
24 | import tensorflow as tf
25 |
26 |
27 | def convert_to_unicode(text):
28 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
29 | if six.PY3:
30 | if isinstance(text, str):
31 | return text
32 | elif isinstance(text, bytes):
33 | return text.decode("utf-8", "ignore")
34 | else:
35 | raise ValueError("Unsupported string type: %s" % (type(text)))
36 | elif six.PY2:
37 | if isinstance(text, str):
38 | return text.decode("utf-8", "ignore")
39 | elif isinstance(text, unicode):
40 | return text
41 | else:
42 | raise ValueError("Unsupported string type: %s" % (type(text)))
43 | else:
44 | raise ValueError("Not running on Python2 or Python 3?")
45 |
46 |
47 | def printable_text(text):
48 | """Returns text encoded in a way suitable for print or `tf.logging`."""
49 |
50 | # These functions want `str` for both Python2 and Python3, but in one case
51 | # it's a Unicode string and in the other it's a byte string.
52 | if six.PY3:
53 | if isinstance(text, str):
54 | return text
55 | elif isinstance(text, bytes):
56 | return text.decode("utf-8", "ignore")
57 | else:
58 | raise ValueError("Unsupported string type: %s" % (type(text)))
59 | elif six.PY2:
60 | if isinstance(text, str):
61 | return text
62 | elif isinstance(text, unicode):
63 | return text.encode("utf-8")
64 | else:
65 | raise ValueError("Unsupported string type: %s" % (type(text)))
66 | else:
67 | raise ValueError("Not running on Python2 or Python 3?")
68 |
69 |
70 | def load_vocab(vocab_file):
71 | """Loads a vocabulary file into a dictionary."""
72 | vocab = collections.OrderedDict()
73 | index = 0
74 | with tf.gfile.GFile(vocab_file, "r") as reader:
75 | while True:
76 | token = convert_to_unicode(reader.readline())
77 | if not token:
78 | break
79 | token = token.strip()
80 | vocab[token] = index
81 | index += 1
82 | return vocab
83 |
84 |
85 | def convert_by_vocab(vocab, items):
86 | """Converts a sequence of [tokens|ids] using the vocab."""
87 | output = []
88 | for item in items:
89 | output.append(vocab[item])
90 | return output
91 |
92 |
93 | def convert_tokens_to_ids(vocab, tokens):
94 | return convert_by_vocab(vocab, tokens)
95 |
96 |
97 | def convert_ids_to_tokens(inv_vocab, ids):
98 | return convert_by_vocab(inv_vocab, ids)
99 |
100 |
101 | def whitespace_tokenize(text):
102 | """Runs basic whitespace cleaning and splitting on a peice of text."""
103 | text = text.strip()
104 | if not text:
105 | return []
106 | tokens = text.split()
107 | return tokens
108 |
109 |
110 | class FullTokenizer(object):
111 | """Runs end-to-end tokenziation."""
112 |
113 | def __init__(self, vocab_file, do_lower_case=True):
114 | self.vocab = load_vocab(vocab_file)
115 | self.inv_vocab = {v: k for k, v in self.vocab.items()}
116 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case)
117 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab)
118 |
119 | def tokenize(self, text):
120 | split_tokens = []
121 | for token in self.basic_tokenizer.tokenize(text):
122 | for sub_token in self.wordpiece_tokenizer.tokenize(token):
123 | split_tokens.append(sub_token)
124 |
125 | return split_tokens
126 |
127 | def convert_tokens_to_ids(self, tokens):
128 | return convert_by_vocab(self.vocab, tokens)
129 |
130 | def convert_ids_to_tokens(self, ids):
131 | return convert_by_vocab(self.inv_vocab, ids)
132 |
133 |
134 | class BasicTokenizer(object):
135 | """Runs basic tokenization (punctuation splitting, lower casing, etc.)."""
136 |
137 | def __init__(self, do_lower_case=True):
138 | """Constructs a BasicTokenizer.
139 |
140 | Args:
141 | do_lower_case: Whether to lower case the input.
142 | """
143 | self.do_lower_case = do_lower_case
144 |
145 | def tokenize(self, text):
146 | """Tokenizes a piece of text."""
147 | text = convert_to_unicode(text)
148 | text = self._clean_text(text)
149 |
150 | # This was added on November 1st, 2018 for the multilingual and Chinese
151 | # models. This is also applied to the English models now, but it doesn't
152 | # matter since the English models were not trained on any Chinese data
153 | # and generally don't have any Chinese data in them (there are Chinese
154 | # characters in the vocabulary because Wikipedia does have some Chinese
155 | # words in the English Wikipedia.).
156 | text = self._tokenize_chinese_chars(text)
157 |
158 | orig_tokens = whitespace_tokenize(text)
159 | split_tokens = []
160 | for token in orig_tokens:
161 | if self.do_lower_case:
162 | token = token.lower()
163 | token = self._run_strip_accents(token)
164 | split_tokens.extend(self._run_split_on_punc(token))
165 |
166 | output_tokens = whitespace_tokenize(" ".join(split_tokens))
167 | return output_tokens
168 |
169 | def _run_strip_accents(self, text):
170 | """Strips accents from a piece of text."""
171 | text = unicodedata.normalize("NFD", text)
172 | output = []
173 | for char in text:
174 | cat = unicodedata.category(char)
175 | if cat == "Mn":
176 | continue
177 | output.append(char)
178 | return "".join(output)
179 |
180 | def _run_split_on_punc(self, text):
181 | """Splits punctuation on a piece of text."""
182 | chars = list(text)
183 | i = 0
184 | start_new_word = True
185 | output = []
186 | while i < len(chars):
187 | char = chars[i]
188 | if _is_punctuation(char):
189 | output.append([char])
190 | start_new_word = True
191 | else:
192 | if start_new_word:
193 | output.append([])
194 | start_new_word = False
195 | output[-1].append(char)
196 | i += 1
197 |
198 | return ["".join(x) for x in output]
199 |
200 | def _tokenize_chinese_chars(self, text):
201 | """Adds whitespace around any CJK character."""
202 | output = []
203 | for char in text:
204 | cp = ord(char)
205 | if self._is_chinese_char(cp):
206 | output.append(" ")
207 | output.append(char)
208 | output.append(" ")
209 | else:
210 | output.append(char)
211 | return "".join(output)
212 |
213 | def _is_chinese_char(self, cp):
214 | """Checks whether CP is the codepoint of a CJK character."""
215 | # This defines a "chinese character" as anything in the CJK Unicode block:
216 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
217 | #
218 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
219 | # despite its name. The modern Korean Hangul alphabet is a different block,
220 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write
221 | # space-separated words, so they are not treated specially and handled
222 | # like the all of the other languages.
223 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
224 | (cp >= 0x3400 and cp <= 0x4DBF) or #
225 | (cp >= 0x20000 and cp <= 0x2A6DF) or #
226 | (cp >= 0x2A700 and cp <= 0x2B73F) or #
227 | (cp >= 0x2B740 and cp <= 0x2B81F) or #
228 | (cp >= 0x2B820 and cp <= 0x2CEAF) or
229 | (cp >= 0xF900 and cp <= 0xFAFF) or #
230 | (cp >= 0x2F800 and cp <= 0x2FA1F)): #
231 | return True
232 |
233 | return False
234 |
235 | def _clean_text(self, text):
236 | """Performs invalid character removal and whitespace cleanup on text."""
237 | output = []
238 | for char in text:
239 | cp = ord(char)
240 | if cp == 0 or cp == 0xfffd or _is_control(char):
241 | continue
242 | if _is_whitespace(char):
243 | output.append(" ")
244 | else:
245 | output.append(char)
246 | return "".join(output)
247 |
248 |
249 | class WordpieceTokenizer(object):
250 | """Runs WordPiece tokenziation."""
251 |
252 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
253 | self.vocab = vocab
254 | self.unk_token = unk_token
255 | self.max_input_chars_per_word = max_input_chars_per_word
256 |
257 | def tokenize(self, text):
258 | """Tokenizes a piece of text into its word pieces.
259 |
260 | This uses a greedy longest-match-first algorithm to perform tokenization
261 | using the given vocabulary.
262 |
263 | For example:
264 | input = "unaffable"
265 | output = ["un", "##aff", "##able"]
266 |
267 | Args:
268 | text: A single token or whitespace separated tokens. This should have
269 | already been passed through `BasicTokenizer.
270 |
271 | Returns:
272 | A list of wordpiece tokens.
273 | """
274 |
275 | text = convert_to_unicode(text)
276 |
277 | output_tokens = []
278 | for token in whitespace_tokenize(text):
279 | chars = list(token)
280 | if len(chars) > self.max_input_chars_per_word:
281 | output_tokens.append(self.unk_token)
282 | continue
283 |
284 | is_bad = False
285 | start = 0
286 | sub_tokens = []
287 | while start < len(chars):
288 | end = len(chars)
289 | cur_substr = None
290 | while start < end:
291 | substr = "".join(chars[start:end])
292 | if start > 0:
293 | substr = "##" + substr
294 | if substr in self.vocab:
295 | cur_substr = substr
296 | break
297 | end -= 1
298 | if cur_substr is None:
299 | is_bad = True
300 | break
301 | sub_tokens.append(cur_substr)
302 | start = end
303 |
304 | if is_bad:
305 | output_tokens.append(self.unk_token)
306 | else:
307 | output_tokens.extend(sub_tokens)
308 | return output_tokens
309 |
310 |
311 | def _is_whitespace(char):
312 | """Checks whether `chars` is a whitespace character."""
313 | # \t, \n, and \r are technically contorl characters but we treat them
314 | # as whitespace since they are generally considered as such.
315 | if char == " " or char == "\t" or char == "\n" or char == "\r":
316 | return True
317 | cat = unicodedata.category(char)
318 | if cat == "Zs":
319 | return True
320 | return False
321 |
322 |
323 | def _is_control(char):
324 | """Checks whether `chars` is a control character."""
325 | # These are technically control characters but we count them as whitespace
326 | # characters.
327 | if char == "\t" or char == "\n" or char == "\r":
328 | return False
329 | cat = unicodedata.category(char)
330 | if cat.startswith("C"):
331 | return True
332 | return False
333 |
334 |
335 | def _is_punctuation(char):
336 | """Checks whether `chars` is a punctuation character."""
337 | cp = ord(char)
338 | # We treat all non-letter/number ASCII as punctuation.
339 | # Characters such as "^", "$", and "`" are not in the Unicode
340 | # Punctuation class but we treat them as punctuation anyways, for
341 | # consistency.
342 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
343 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
344 | return True
345 | cat = unicodedata.category(char)
346 | if cat.startswith("P"):
347 | return True
348 | return False
349 |
--------------------------------------------------------------------------------
/tokenizer_char.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/tokenizer_char.pickle
--------------------------------------------------------------------------------
/train_bert_fine_tuning.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
3 | """
4 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
5 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task)
6 | on large scale of corpus, then fine-tuning by add a single classification layer.
7 |
8 | train the model(transformer) with data enhanced by pre-training of two tasks.
9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a
10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).
11 | """
12 |
13 | import tensorflow as tf
14 | #import numpy as np
15 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model
16 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel
17 |
18 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config,get_lable2index
19 | import os
20 | from evaluation_matrix import *
21 | #from model.config_transformer import Config
22 | #configuration
23 | FLAGS=tf.app.flags.FLAGS
24 |
25 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
26 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train2.txt","path of traning data.") #./data/cail2018_bi.json
27 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_valid2.txt","path of validation data.")
28 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test2.txt","path of validation data.")
29 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model for restore from pre-train") #save to here, so make it easy to upload for test
30 | tf.app.flags.DEFINE_string("ckpt_dir_save","./checkpoint_lm_save/","checkpoint location for the model for save fine-tuning") #save to here, so make it easy to upload for test
31 |
32 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model")
33 | tf.app.flags.DEFINE_string("model_name","BertCNNFineGrainModel","text cnn model. pre-train and fine-tuning.")
34 |
35 | tf.app.flags.DEFINE_integer("vocab_size",50000,"maximum vocab size.")
36 | tf.app.flags.DEFINE_float("learning_rate",0.00001,"learning rate") #0.001
37 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128
38 | tf.app.flags.DEFINE_integer("decay_steps", 10000, "how many steps before decay learning rate.") # 32-->128
39 | tf.app.flags.DEFINE_float("decay_rate", 0.8, "Rate of decay for learning rate.") #0.65
40 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65
41 | tf.app.flags.DEFINE_integer("sequence_length",800,"max sentence length")#400
42 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model")
43 |
44 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference")
45 | tf.app.flags.DEFINE_boolean("is_fine_tuning",True,"is_finetuning.ture:this is fine-tuning stage")
46 |
47 | tf.app.flags.DEFINE_integer("num_epochs",35,"number of epochs to run.")
48 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu used")
49 |
50 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") #
51 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
52 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
53 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.")
54 |
55 | tf.app.flags.DEFINE_integer("d_model", 128, "dimension of model") # 512-->128
56 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer")
57 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header")
58 | tf.app.flags.DEFINE_integer("d_k", 16, "dimension of k") # 64-->16
59 | tf.app.flags.DEFINE_integer("d_v", 16, "dimension of v") # 64-->16
60 |
61 | def main(_):
62 | # 1.load vocabulary of token from cache file save from pre-trained stage; load label dict from training file; print some message.
63 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name=FLAGS.model_name)
64 | label2index=get_lable2index(FLAGS.data_path,FLAGS.training_data_file, tokenize_style=FLAGS.tokenize_style)
65 | vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes)
66 | iii=0;iii/0 # todo test first two function, then continue
67 | # load training data.
68 | train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length,
69 | process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style)
70 | train_X, train_Y= train
71 | valid_X, valid_Y= valid
72 | test_X,test_Y = test
73 | print("test_model:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape)
74 | # 2.create session.
75 | gpu_config=tf.ConfigProto()
76 | gpu_config.gpu_options.allow_growth=True
77 | with tf.Session(config=gpu_config) as sess:
78 | #Instantiate Model
79 | config=set_config(FLAGS,num_classes,vocab_size)
80 | model=BertModel(config)
81 | #Initialize Save
82 | saver=tf.train.Saver()
83 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"):
84 | print("Restoring Variables from Checkpoint.")
85 | sess.run(tf.global_variables_initializer())
86 | for i in range(6): #decay learning rate if necessary.
87 | print(i,"Going to decay learning rate by a factor of "+str(FLAGS.decay_rate))
88 | sess.run(model.learning_rate_decay_half_op)
89 | # restore those variables that names and shapes exists in your model from checkpoint. for detail check: https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4
90 | optimistic_restore(sess, tf.train.latest_checkpoint(FLAGS.ckpt_dir)) #saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir))
91 | else:
92 | print('Initializing Variables as model instance is not exist.')
93 | sess.run(tf.global_variables_initializer())
94 | if FLAGS.use_pretrained_embedding:
95 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()}
96 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings
97 | curr_epoch=sess.run(model.epoch_step)
98 | # 3.feed data & training
99 | number_of_training_data=len(train_X)
100 | batch_size=FLAGS.batch_size
101 | iteration=0
102 | score_best=-100
103 | f1_score=0
104 | epoch=0
105 | for epoch in range(curr_epoch,FLAGS.num_epochs):
106 | loss_total, counter = 0.0, 0
107 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)):
108 | iteration=iteration+1
109 | if epoch==0 and counter==0:
110 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape)
111 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob}
112 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict)
113 | loss_total,counter=loss_total+current_loss,counter+1
114 | if counter %30==0:
115 | print("Learning rate:%.7f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss))
116 | if start!=0 and start%(4000*FLAGS.batch_size)==0:
117 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index)
118 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0) #*100.0
119 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid))
120 |
121 | # save model to checkpoint
122 | if f1_score_valid>score_best:
123 | save_path = FLAGS.ckpt_dir_save + "model.ckpt"
124 | print("going to save check point.")
125 | saver.save(sess, save_path, global_step=epoch)
126 | score_best=f1_score_valid
127 | #epoch increment
128 | print("going to increment epoch counter....")
129 | sess.run(model.epoch_increment)
130 |
131 | # 4.validation
132 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0))
133 | if epoch % FLAGS.validate_every==0:
134 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index)
135 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0
136 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2))
137 | #save model to checkpoint
138 | if f1_score_valid2 > score_best:
139 | save_path=FLAGS.ckpt_dir_save+"model.ckpt"
140 | print("going to save check point.")
141 | saver.save(sess,save_path,global_step=epoch)
142 | score_best = f1_score_valid2
143 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13):
144 | for i in range(1):
145 | print(i, "Going to decay learning rate by half.")
146 | sess.run(model.learning_rate_decay_half_op)
147 |
148 | # 5.report on test set
149 | loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index)
150 | f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) * 100.0
151 | print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test))
152 | print("training completed...")
153 |
154 | #sess,model,valid,iteration,num_classes,label2index
155 | def do_eval(sess,model,valid,num_classes,label2index):
156 | """
157 | do evaluation using validation set, and report loss, and f1 score.
158 | :param sess:
159 | :param model:
160 | :param valid:
161 | :param num_classes:
162 | :param label2index:
163 | :return:
164 | """
165 | number_examples=valid[0].shape[0]
166 | valid=valid[0:64*15] # todo
167 | valid_x,valid_y=valid
168 | print("number_examples:",number_examples)
169 | eval_loss,eval_counter=0.0,0
170 | batch_size=FLAGS.batch_size
171 | label_dict=init_label_dict(num_classes)
172 | eval_macro_f1, eval_micro_f1 = 0.0,0.0
173 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)):
174 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0}
175 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size]
176 | #compute confuse matrix
177 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright')
178 | eval_loss=eval_loss+curr_eval_loss
179 | eval_counter=eval_counter+1
180 | #compute f1_micro & f1_macro
181 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive
182 | compute_f1_score_write_for_debug(label_dict,label2index)
183 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro
184 |
185 | def optimistic_restore(session, save_file):
186 | """
187 | restore only those variable that exists in the model
188 | :param session:
189 | :param save_file:
190 | :return:
191 | """
192 | reader = tf.train.NewCheckpointReader(save_file)
193 | saved_shapes = reader.get_variable_to_shape_map()
194 | var_names = sorted([(var.name, var.name.split(':')[0]) for
195 | var in tf.global_variables()
196 | if var.name.split(':')[0] in saved_shapes])
197 | restore_vars = []
198 | name2var = dict(zip(map(lambda x: x.name.split(':')[0],tf.global_variables()),tf.global_variables()))
199 | with tf.variable_scope('', reuse=True):
200 | for var_name, saved_var_name in var_names:
201 | curr_var = name2var[saved_var_name]
202 | var_shape = curr_var.get_shape().as_list()
203 | if var_shape == saved_shapes[saved_var_name]:
204 | #print("going to restore.var_name:",var_name,";saved_var_name:",saved_var_name)
205 | restore_vars.append(curr_var)
206 | else:
207 | print("variable not trained.var_name:",var_name)
208 | saver = tf.train.Saver(restore_vars)
209 | saver.restore(session, save_file)
210 |
211 | if __name__ == "__main__":
212 | tf.app.run()
--------------------------------------------------------------------------------
/train_cnn_fine_grain.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
3 | """
4 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
5 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task)
6 | on large scale of corpus, then fine-tuning by add a single classification layer.
7 |
8 | train the model(transformer) with data enhanced by pre-training of two tasks.
9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a
10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).
11 | """
12 |
13 | import tensorflow as tf
14 | #import numpy as np
15 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model
16 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel
17 |
18 | from data_util_hdf5 import assign_pretrained_word_embedding,set_config,create_or_load_vocabulary
19 | import os
20 | import pickle
21 | from evaluation_matrix import *
22 | #from model.config_transformer import Config
23 | #configuration
24 | FLAGS=tf.app.flags.FLAGS
25 |
26 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
27 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train2.txt","path of traning data.") #./data/cail2018_bi.json
28 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_valid2.txt","path of validation data.")
29 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test2.txt","path of validation data.")
30 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model for restore from pre-train") #save to here, so make it easy to upload for test
31 | tf.app.flags.DEFINE_string("ckpt_dir_save","./checkpoint_lm_save/","checkpoint location for the model for save fine-tuning") #save to here, so make it easy to upload for test
32 |
33 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model")
34 | tf.app.flags.DEFINE_string("model_name","","text cnn model. pre-train and fine-tuning.") # BertCNNFineGrainModel
35 |
36 | tf.app.flags.DEFINE_integer("vocab_size",70000,"maximum vocab size.")
37 | tf.app.flags.DEFINE_float("learning_rate",0.001,"learning rate") #0.001
38 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128
39 | tf.app.flags.DEFINE_integer("decay_steps", 10000, "how many steps before decay learning rate.") # 32-->128
40 | tf.app.flags.DEFINE_float("decay_rate", 0.8, "Rate of decay for learning rate.") #0.65
41 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65
42 | tf.app.flags.DEFINE_integer("sequence_length",400,"max sentence length")#400
43 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model")
44 |
45 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference")
46 | tf.app.flags.DEFINE_boolean("is_fine_tuning",True,"is_finetuning.ture:this is fine-tuning stage")
47 |
48 | tf.app.flags.DEFINE_integer("num_epochs",35,"number of epochs to run.")
49 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu used")
50 |
51 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") #
52 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
53 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
54 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.")
55 |
56 | tf.app.flags.DEFINE_integer("d_model", 128, "dimension of model") # 128-->200
57 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer")
58 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header")
59 | tf.app.flags.DEFINE_integer("d_k", 16, "dimension of k") # 64-->16
60 | tf.app.flags.DEFINE_integer("d_v", 16, "dimension of v") # 64-->16
61 |
62 | tf.app.flags.DEFINE_string("cache_file","./preprocess_word/train_valid_test_vocab_cache.pik","cache file that contains train/valid/test data and vocab of words and label2index")
63 |
64 | def main(_):
65 | # 1.load vocabulary of token from cache file save from pre-trained stage; load label dict from training file; print some message.
66 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name=FLAGS.model_name)
67 | #label2index=get_lable2index(FLAGS.data_path,FLAGS.training_data_file, tokenize_style=FLAGS.tokenize_style)
68 | #vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes)
69 | # load training data.
70 | #train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length,
71 | # process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style)
72 | #train_X, train_Y= train
73 | #valid_X, valid_Y= valid
74 | #test_X,test_Y = test
75 | if not os.path.exists(FLAGS.cache_file):
76 | print("cache file is missing. please generate it though step by step with preprocess_word.ipynb")
77 | return
78 | train_X, train_Y, valid_X, valid_Y, test_X, label2index=None,None,None,None,None,None
79 |
80 | with open(FLAGS.cache_file, 'rb') as data_f:
81 | train_X, train_Y, valid_X, valid_Y, test_X,_, label2index=pickle.load(data_f)
82 | valid=(valid_X, valid_Y)
83 | data_f.close()
84 | num_classes=len(label2index)
85 | vocab_size=len(vocab_word2index)
86 | FLAGS.sequence_length=train_X.shape[1] #
87 | print("test_model:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape)
88 | # 2.create session.
89 | gpu_config=tf.ConfigProto()
90 | gpu_config.gpu_options.allow_growth=True
91 | with tf.Session(config=gpu_config) as sess:
92 | #Instantiate Model
93 | config=set_config(FLAGS,num_classes,vocab_size)
94 | model=BertModel(config)
95 | #Initialize Save
96 | saver=tf.train.Saver()
97 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"):
98 | print("Restoring Variables from Checkpoint.")
99 | sess.run(tf.global_variables_initializer())
100 | for i in range(6): #decay learning rate if necessary.
101 | print(i,"Going to decay learning rate by a factor of "+str(FLAGS.decay_rate))
102 | sess.run(model.learning_rate_decay_half_op)
103 | # restore those variables that names and shapes exists in your model from checkpoint. for detail check: https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4
104 | optimistic_restore(sess, tf.train.latest_checkpoint(FLAGS.ckpt_dir)) #saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir))
105 | else:
106 | print('Initializing Variables as model instance is not exist.')
107 | sess.run(tf.global_variables_initializer())
108 | if FLAGS.use_pretrained_embedding:
109 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()}
110 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings
111 | curr_epoch=sess.run(model.epoch_step)
112 | # 3.feed data & training
113 | number_of_training_data=len(train_X)
114 | batch_size=FLAGS.batch_size
115 | iteration=0
116 | score_best=-100
117 | f1_score=0
118 | epoch=0
119 | for epoch in range(curr_epoch,FLAGS.num_epochs):
120 | loss_total, counter = 0.0, 0
121 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)):
122 | iteration=iteration+1
123 | if epoch==0 and counter==0:
124 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape)
125 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob}
126 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict)
127 | loss_total,counter=loss_total+current_loss,counter+1
128 | if counter %30==0:
129 | print("Learning rate:%.7f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss))
130 | if start!=0 and start%(4000*FLAGS.batch_size)==0:
131 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index)
132 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0) #*100.0
133 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid))
134 |
135 | # save model to checkpoint
136 | if f1_score_valid>score_best:
137 | save_path = FLAGS.ckpt_dir_save + "model.ckpt"
138 | print("going to save check point.")
139 | saver.save(sess, save_path, global_step=epoch)
140 | score_best=f1_score_valid
141 | #epoch increment
142 | print("going to increment epoch counter....")
143 | sess.run(model.epoch_increment)
144 |
145 | # 4.validation
146 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0))
147 | if epoch % FLAGS.validate_every==0:
148 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index)
149 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0
150 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2))
151 | #save model to checkpoint
152 | if f1_score_valid2 > score_best:
153 | save_path=FLAGS.ckpt_dir_save+"model.ckpt"
154 | print("going to save check point.")
155 | saver.save(sess,save_path,global_step=epoch)
156 | score_best = f1_score_valid2
157 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13):
158 | for i in range(1):
159 | print(i, "Going to decay learning rate by half.")
160 | sess.run(model.learning_rate_decay_half_op)
161 |
162 | # 5.report on test set
163 | #loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index)
164 | #f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) * 100.0
165 | #print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test))
166 | print("training completed...")
167 |
168 | #sess,model,valid,iteration,num_classes,label2index
169 | num_fine_grain_type=20 # 20 fine grain sentiment analysis
170 | num_fine_grain_value=4 # 4 kinds of value: [1,0,-1,-2]
171 | def do_eval(sess,model,valid,num_classes,label2index):
172 | """
173 | do evaluation using validation set, and report loss, and f1 score.
174 | :param sess:
175 | :param model:
176 | :param valid:
177 | :param num_classes:
178 | :param label2index:
179 | :return:
180 | """
181 | valid=valid[0:64*80] # todo
182 | number_examples=valid[0].shape[0]
183 | valid_x,valid_y=valid
184 | print("number_examples for valid:",number_examples)
185 | eval_loss,eval_counter=0.0,0
186 | batch_size=FLAGS.batch_size
187 | label_dict=init_label_dict(num_classes)
188 | eval_macro_f1, eval_micro_f1 = 0.0,0.0
189 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)):
190 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0}
191 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size]
192 | #compute confuse matrix
193 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright')
194 | #for aspect_index in range(num_fine_grain_type):
195 | # start_sub=aspect_index*num_fine_grain_value
196 | # start_end=start_sub+num_fine_grain_value
197 | # valid_y_sub=valid_y[start:end][:,start_sub:start_end]
198 | # logits_sub=logits[start:end][:,start_sub:start_end]
199 | # label_dict=compute_confuse_matrix_batch(valid_y_sub[start:end],logits_sub,label_dict,name='bright')
200 | # if start%3000==0:
201 | # print("valid_y_sub:",valid_y_sub)
202 | # print("logits_sub:",logits_sub)
203 | eval_loss=eval_loss+curr_eval_loss
204 | eval_counter=eval_counter+1
205 | #compute f1_micro & f1_macro
206 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive
207 | compute_f1_score_write_for_debug(label_dict,label2index)
208 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro
209 |
210 | def optimistic_restore(session, save_file):
211 | """
212 | restore only those variable that exists in the model
213 | :param session:
214 | :param save_file:
215 | :return:
216 | """
217 | reader = tf.train.NewCheckpointReader(save_file)
218 | saved_shapes = reader.get_variable_to_shape_map()
219 | var_names = sorted([(var.name, var.name.split(':')[0]) for
220 | var in tf.global_variables()
221 | if var.name.split(':')[0] in saved_shapes])
222 | restore_vars = []
223 | name2var = dict(zip(map(lambda x: x.name.split(':')[0],tf.global_variables()),tf.global_variables()))
224 | with tf.variable_scope('', reuse=True):
225 | for var_name, saved_var_name in var_names:
226 | curr_var = name2var[saved_var_name]
227 | var_shape = curr_var.get_shape().as_list()
228 | if var_shape == saved_shapes[saved_var_name]:
229 | #print("going to restore.var_name:",var_name,";saved_var_name:",saved_var_name)
230 | restore_vars.append(curr_var)
231 | else:
232 | print("variable not trained.var_name:",var_name)
233 | saver = tf.train.Saver(restore_vars)
234 | saver.restore(session, save_file)
235 |
236 | if __name__ == "__main__":
237 | tf.app.run()
--------------------------------------------------------------------------------
/train_cnn_lm.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | #process--->1.load data(X,y). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction)
3 |
4 | """
5 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
6 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task)
7 | on large scale of corpus, then fine-tuning by add a single classification layer.
8 | train the model(transformer) with data enhanced by pre-training of two tasks.
9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a
10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).
11 | """
12 | import tensorflow as tf
13 | import numpy as np
14 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model
15 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel
16 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config
17 | import os
18 | from evaluation_matrix import *
19 | from pretrain_task import mask_language_model,mask_language_model_multi_processing
20 | from model.config import Config
21 | import random
22 |
23 | #configuration
24 | FLAGS=tf.app.flags.FLAGS
25 |
26 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used")
27 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
28 | tf.app.flags.DEFINE_string("mask_lm_source_file","./data/sentiment_analysis_all.csv","path of traning data.") # sentiment_analysis_all.csv is concat of training and testb of this task, which are both csv format file
29 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model") #save to here, so make it easy to upload for test
30 | tf.app.flags.DEFINE_integer("vocab_size",70000,"maximum vocab size.")
31 | tf.app.flags.DEFINE_integer("d_model", 200, "dimension of model") # 512-->128
32 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer")
33 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header")
34 | tf.app.flags.DEFINE_integer("d_k", 8, "dimension of k") # 64
35 | tf.app.flags.DEFINE_integer("d_v", 8, "dimension of v") # 64
36 |
37 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model")
38 | tf.app.flags.DEFINE_integer("max_allow_sentence_length",10,"max length of allowed sentence for masked language model")
39 | tf.app.flags.DEFINE_float("learning_rate",0.0001,"learning rate") #0.001
40 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.")
41 | tf.app.flags.DEFINE_integer("decay_steps", 1000, "how many steps before decay learning rate.")
42 | tf.app.flags.DEFINE_float("decay_rate", 1.0, "Rate of decay for learning rate.")
43 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.")
44 | tf.app.flags.DEFINE_integer("sequence_length",200,"max sentence length")#400
45 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model")
46 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference")
47 | tf.app.flags.DEFINE_boolean("is_fine_tuning",False,"is_finetuning.ture:this is fine-tuning stage")
48 | tf.app.flags.DEFINE_integer("num_epochs",30,"number of epochs to run.")
49 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.")
50 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
51 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
52 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu process")
53 |
54 | def main(_):
55 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.mask_lm_source_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style)
56 | vocab_size = len(vocab_word2index);print("bert_pertrain_lm.vocab_size:",vocab_size)
57 | index2word={v:k for k,v in vocab_word2index.items()}
58 | #train,valid,test=mask_language_model(FLAGS.mask_lm_source_file,FLAGS.data_path,index2word,max_allow_sentence_length=FLAGS.max_allow_sentence_length,test_mode=FLAGS.test_mode)
59 | train, valid, test = mask_language_model(FLAGS.mask_lm_source_file, FLAGS.data_path, index2word, max_allow_sentence_length=FLAGS.max_allow_sentence_length,test_mode=FLAGS.test_mode, process_num=FLAGS.process_num)
60 |
61 | train_X, train_y,train_p = train
62 | valid_X, valid_y,valid_p = valid
63 | test_X,test_y,test_p = test
64 |
65 | print("length of training data:",train_X.shape,";train_Y:",train_y.shape,";train_p:",train_p.shape,";valid data:",valid_X.shape,";test data:",test_X.shape)
66 | # 1.create session.
67 | gpu_config=tf.ConfigProto()
68 | gpu_config.gpu_options.allow_growth=True
69 | with tf.Session(config=gpu_config) as sess:
70 | #Instantiate Model
71 | config=set_config(FLAGS,vocab_size,vocab_size)
72 | model=BertModel(config)
73 | #Initialize Save
74 | saver=tf.train.Saver()
75 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"):
76 | print("Restoring Variables from Checkpoint.")
77 | saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir))
78 | for i in range(2): #decay learning rate if necessary.
79 | print(i,"Going to decay learning rate by half.")
80 | sess.run(model.learning_rate_decay_half_op)
81 | else:
82 | print('Initializing Variables')
83 | sess.run(tf.global_variables_initializer())
84 | if FLAGS.use_pretrained_embedding:
85 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()}
86 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings
87 | curr_epoch=sess.run(model.epoch_step)
88 |
89 | # 2.feed data & training
90 | number_of_training_data=len(train_X)
91 | print("number_of_training_data:",number_of_training_data)
92 | batch_size=FLAGS.batch_size
93 | iteration=0
94 | score_best=-100
95 | for epoch in range(curr_epoch,FLAGS.num_epochs):
96 | loss_total_lm, counter = 0.0, 0
97 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)):
98 | iteration=iteration+1
99 | if epoch==0 and counter==0:
100 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape)
101 | feed_dict = {model.x_mask_lm: train_X[start:end],model.y_mask_lm: train_y[start:end],model.p_mask_lm:train_p[start:end],
102 | model.dropout_keep_prob: FLAGS.dropout_keep_prob}
103 | current_loss_lm,lr,l2_loss,_=sess.run([model.loss_val_lm,model.learning_rate,model.l2_loss_lm,model.train_op_lm],feed_dict)
104 | loss_total_lm,counter=loss_total_lm+current_loss_lm,counter+1
105 | if counter %30==0:
106 | print("%d\t%d\tLearning rate:%.5f\tLoss_lm:%.3f\tCurrent_loss_lm:%.3f\tL2_loss:%.3f\t"%(epoch,counter,lr,float(loss_total_lm)/float(counter),current_loss_lm,l2_loss))
107 | if start!=0 and start%(3000*FLAGS.batch_size)==0: # epoch!=0
108 | loss_valid, acc_valid= do_eval(sess, model, valid,batch_size)
109 | print("%d\tValid.Epoch %d ValidLoss:%.3f\tAcc_valid:%.3f\t" % (counter,epoch, loss_valid, acc_valid*100))
110 | # save model to checkpoint
111 | if acc_valid>score_best:
112 | save_path = FLAGS.ckpt_dir + "model.ckpt"
113 | print("going to save check point.")
114 | saver.save(sess, save_path, global_step=epoch)
115 | score_best=acc_valid
116 | sess.run(model.epoch_increment)
117 |
118 | validation_size=100*FLAGS.batch_size
119 | def do_eval(sess,model,valid,batch_size):
120 | """
121 | do evaluation using validation set, and report loss, and f1 score.
122 | :param sess:
123 | :param model:
124 | :param valid:
125 | :param num_classes:
126 | :return:
127 | """
128 | valid_X,valid_y,valid_p=valid
129 | number_examples=valid_X.shape[0]
130 | if number_examples>10000:
131 | number_examples=validation_size
132 | print("do_eval.valid.number_examples:",number_examples)
133 | if number_examples>validation_size: valid_X,valid_y,valid_p=valid_X[0:validation_size],valid_y[0:validation_size],valid_p[0:validation_size]
134 | eval_loss,eval_counter,eval_acc=0.0,0,0.0
135 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)):
136 | feed_dict = {model.x_mask_lm: valid_X[start:end],model.y_mask_lm: valid_y[start:end],model.p_mask_lm:valid_p[start:end],
137 | model.dropout_keep_prob: 1.0} # FLAGS.dropout_keep_prob
138 | curr_eval_loss, logits_lm, accuracy_lm= sess.run([model.loss_val_lm,model.logits_lm,model.accuracy_lm],feed_dict) # logits:[batch_size,label_size]
139 | eval_loss=eval_loss+curr_eval_loss
140 | eval_acc=eval_acc+accuracy_lm
141 | eval_counter=eval_counter+1
142 | return eval_loss/float(eval_counter+small_value), eval_acc/float(eval_counter+small_value)
143 |
144 | if __name__ == "__main__":
145 | tf.app.run()
--------------------------------------------------------------------------------
/word2vec/README.txt:
--------------------------------------------------------------------------------
1 | save word2vec here.
--------------------------------------------------------------------------------