├── .idea ├── dictionaries │ └── xuliang.xml └── vcs.xml ├── BERT_BASE_DIR └── readMe.md ├── PRE_TRAIN_DIR └── readMe.md ├── README.md ├── README_bert_chinese_tutorial.md ├── TEXT_DIR ├── dev2.tsv ├── readMe.md └── train2.tsv ├── __init__.py ├── ai_challenger_sentiment_analysis_testa_20180816 ├── README.txt └── protocol.txt ├── ai_challenger_sentiment_analysis_trainingset_20180816 ├── README.txt └── protocol.txt ├── ai_challenger_sentiment_analysis_validationset_20180816 ├── README.txt └── protocol.txt ├── ai_challenger_sentimetn_analysis_testb_20180816 ├── README.txt └── protocol.txt ├── bigru_char_checkpoint └── README.txt ├── create_pretraining_data.py ├── data └── img │ ├── bert_sa.jpg │ ├── bert_sentiment_analysis.jpg │ └── fine_grain.jpg ├── data_util_hdf5.py ├── evaluation_matrix.py ├── model ├── __init__.py ├── base_model.py ├── bert_cnn_fine_grain_model.py ├── bert_cnn_model.py ├── bert_model.py ├── bert_modeling.py ├── config.py ├── config_transformer.py ├── encoder.py ├── layer_norm_residual_conn.py ├── multi_head_attention.py ├── optimization.py ├── poistion_wise_feed_forward.py └── transfomer_model.py ├── old ├── JoinAttLayer.py ├── Preprocess_char_old.ipynb ├── classifier_bigru.py ├── classifier_capsule.py ├── classifier_rcnn.py ├── evaluate_char.py ├── model_bigru_char.py ├── model_capsule_char.py ├── model_rcnn_char.py ├── predict_bigru_char.py ├── predict_rcnn_char.py ├── rcnn_retrain.py ├── stopwords.txt ├── temp_covert.py ├── train_transform.py ├── validation_bigru_char.py └── validation_rcnn_char.py ├── preprocess_char.ipynb ├── preprocess_char └── README.txt ├── preprocess_word.ipynb ├── pretrain_task.py ├── run_classifier_multi_labels_bert.py ├── run_pretraining.py ├── tokenization.py ├── tokenizer_char.pickle ├── train_bert_fine_tuning.py ├── train_cnn_fine_grain.py ├── train_cnn_lm.py └── word2vec └── README.txt /.idea/dictionaries/xuliang.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | -------------------------------------------------------------------------------- /.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /BERT_BASE_DIR/readMe.md: -------------------------------------------------------------------------------- 1 | you need to download pre-trained model from google, and put it into a folder(e.g.BERT_BASE_DIR) -------------------------------------------------------------------------------- /PRE_TRAIN_DIR/readMe.md: -------------------------------------------------------------------------------- 1 | # you can put pre-train file in a dir. e.g. "PRE_TRAIN_DIR". 2 | # Input file format: 3 | # (1) One sentence per line. These should ideally be actual sentences, not 4 | # entire paragraphs or arbitrary spans of text. (Because we use the 5 | # sentence boundaries for the "next sentence prediction" task). 6 | # (2) Blank lines between documents. Document boundaries are needed so 7 | # that the "next sentence prediction" task doesn't span between documents. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Introduction 2 | 3 | With this repository, you will able to train Multi-label Classification with BERT, 4 | 5 | Deploy BERT for online prediction. 6 | 7 | You can also find the a short tutorial of how to use bert with chinese: BERT short chinese tutorial 8 | 9 | You can find Introduction to fine grain sentiment from AI Challenger 10 | 11 | ## Basic Ideas 12 | 13 | Add something here. 14 | 15 | 16 | ## Experiment on New Models 17 | 18 | 19 | 20 | for more, check model/bert_cnn_fine_grain_model.py 21 | 22 | ## Performance 23 | 24 | Model | TextCNN(No-pretrain)| TextCNN(Pretrain-Finetuning)| Bert(base_model_zh) | Bert(base_model_zh,pre-train on corpus) 25 | --- | --- | --- | ----------- | ----------- 26 | F1 Score | 0.678 | 0.685 | ADD A NUMBER HERE | ADD A NUMBER HERE 27 | ---------------------------------------------------------------------------------------------- 28 | 29 | Notice: F1 Score is reported on validation set 30 | 31 | 32 | 33 | ## Usage 34 | 35 | ### Bert for Multi-label Classificaiton [data for fine-tuning and pre-train] 36 | 37 | export BERT_BASE_DIR=BERT_BASE_DIR/chinese_L-12_H-768_A-12 38 | export TEXT_DIR=TEXT_DIR 39 | nohup python run_classifier_multi_labels_bert.py 40 | --task_name=sentiment_analysis 41 | --do_train=true 42 | --do_eval=true 43 | --data_dir=$TEXT_DIR 44 | --vocab_file=$BERT_BASE_DIR/vocab.txt 45 | --bert_config_file=$BERT_BASE_DIR/bert_config.json 46 | --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt 47 | --max_seq_length=512 48 | --train_batch_size=4 49 | --learning_rate=2e-5 50 | --num_train_epochs=3 51 | --output_dir=./checkpoint_bert & 52 | 53 | 1.firstly, you need to download pre-trained model from google, and put to a folder(e.g.BERT_BASE_DIR) 54 | 55 | chinese_L-12_H-768_A-12 from bert 56 | 57 | 2.secondly, you need to have training data(e.g. train.tsv) and validation data(e.g. dev.tsv), and put it under a 58 | 59 | folder(e.g.TEXT_DIR ). you can also download data from here data to train bert for AI challenger-Sentiment Analysis. 60 | 61 | it contains processed data you can run for both fine-tuning on sentiment analysis and pre-train with Bert. 62 | 63 | it is generated by following this notebook step by step: 64 | 65 | preprocess_char.ipynb 66 | 67 | you can generate data by yourself as long as data format is compatible with 68 | 69 | processor SentimentAnalysisFineGrainProcessor(alias as sentiment_analysis); 70 | 71 | 72 | data format: label1,label2,label3\t here is sentence or sentences\t 73 | 74 | it only contains two columns, the first one is target(one or multi-labels), the second one is input strings. 75 | 76 | no need to tokenized. 77 | 78 | sample:"0_1,1_-2,2_-2,3_-2,4_1,5_-2,6_-2,7_-2,8_1,9_1,10_-2,11_-2,12_-2,13_-2,14_-2,15_1,16_-2,17_-2,18_0,19_-2 浦东五莲路站,老饭店福瑞轩属于上海的本帮菜,交通方便,最近又重新装修,来拨草了,饭店活动满188元送50元钱,环境干净,简单。朋友提前一天来预订包房也没有订到,只有大堂,五点半到店基本上每个台子都客满了,都是附近居民,每道冷菜量都比以前小,味道还可以,热菜烤茄子,炒河虾仁,脆皮鸭,照牌鸡,小牛排,手撕腊味花菜等每道菜都很入味好吃,会员价划算,服务员人手太少,服务态度好,要能团购更好。可以用支付宝方便" 79 | 80 | check sample data in ./BERT_BASE_DIR folder 81 | 82 | for more detail, check create_model and SentimentAnalysisFineGrainProcessor from run_classifier.py 83 | 84 | ### Pre-train Bert model based on open-souced model, then do classification task 85 | 86 | 1. generate raw data: [ADD SOMETHING HERE] 87 | 88 | take sure each line is a sentence. between each document there is a blank line. 89 | 90 | you can find generated data from zip file. 91 | 92 | use write_pre_train_doc() from preprocess_char.ipynb 93 | 94 | 1. generate data for pre-train stage using: 95 | 96 | export BERT_BASE_DIR=./BERT_BASE_DIR/chinese_L-12_H-768_A-12 97 | nohup python create_pretraining_data.py \ 98 | --input_file=./PRE_TRAIN_DIR/bert_*_pretrain.txt \ 99 | --output_file=./PRE_TRAIN_DIR/tf_examples.tfrecord \ 100 | --vocab_file=$BERT_BASE_DIR/vocab.txt \ 101 | --do_lower_case=True \ 102 | --max_seq_length=512 \ 103 | --max_predictions_per_seq=60 \ 104 | --masked_lm_prob=0.15 \ 105 | --random_seed=12345 \ 106 | --dupe_factor=5 nohup_pre.out & 107 | 108 | 2. pre-train model with generated data: 109 | 110 | python run_pretraining.py 111 | 112 | 3. fine-tuning 113 | 114 | python run_classifier.py 115 | 116 | ### TextCNN 117 | 118 | 1. download cache file of sentiment analysis(tokens are in word level) 119 | 120 | 2. train the model: 121 | 122 | python train_cnn_fine_grain.py 123 | 124 | 125 | cache file of TextCNN model was generate by following steps from preprocess_word.ipynb. 126 | 127 | it contains everything you need to run TextCNN. 128 | 129 | it include: processed train/validation/test set; vocabulary of word; a dict map label to index. 130 | 131 | take train_valid_test_vocab_cache.pik and put it under folder of preprocess_word/ 132 | 133 | raw data are also included in this zip file. 134 | 135 | 136 | ### Pre-train TextCNN 137 | 138 | 1. pre-train TextCNN with masked language model 139 | 140 | python train_cnn_lm.py 141 | 142 | 2. fine-tuning for TextCNN 143 | 144 | python train_cnn_fine_grain.py 145 | 146 | ### Deploy BERT for online prediction 147 | 148 | with session and feed style you can easily deploy BERT. 149 | 150 | online prediction with BERT, check more from here 151 | 152 | 153 | ## Reference 154 | 155 | 1. Bidirectional Encoder Representations from Transformers for Language Understanding 156 | 157 | 2. google-research/bert 158 | 159 | 3. pengshuang/AI-Comp 160 | 161 | 4. AI Challenger 2018 162 | 163 | 5. Convolutional Neural Networks for Sentence Classification -------------------------------------------------------------------------------- /README_bert_chinese_tutorial.md: -------------------------------------------------------------------------------- 1 | 【使用bert预训练过的中文模型:最短教程】 2 | 3 | 【使用自带数据集】 4 | 5 | 1.下载模型和数据集合:https://github.com/google-research/bert 6 | 7 | 2. 使用命令run_classifier.py,带上参数 8 | 9 | 【如何使用自定义数据集?】 10 | 11 | 1.在run_classifier.py中添加一个Processor,告诉processor怎么取输入和标签;并加该processor到main中的processors 12 | 13 | 2.将自己的数据集放入到特定目录。每行是一个数据,包括输入和标签,中间用"\t"隔开。 14 | 15 | 3.运行命令run_classifier.py,带上参数 16 | 17 | 【session-feed方式使用bert模型;使用bert做在线预测】 18 | 19 | 使用bert做在线预测-简明例子 20 | 21 | 【目前支持的任务类型】 22 | 23 | 1.文本分类(二分类或多分类); 24 | 25 | 2.句子对分类Sentence Pair Classificaiton(输入两个句子,输出一个标签) 26 | 27 | 3.文本分类(多类别,multi-label classification) 28 | 29 | 使用bert做多类别任务(e.g.AI challenger情感分析任务),详见run_classifier_multi_labels_bert.py 30 | 31 | 【在bert中文模型基础上,做预训练,再调优fine-tuning】 32 | 33 | 1. 生成预训练需要的文件: 每行为一个句子;每个文档中间用空行隔开 34 | 2. 生成tf.record格式的预训练语料: 35 | create_pretraining_data.py 36 | 3. 使用已经生成的数据做预训练,可以指定初始的checkpoint: 37 | run_pretraining.py 38 | 4. 调优fine-tuning 39 | run_classifier.py 40 | -------------------------------------------------------------------------------- /TEXT_DIR/readMe.md: -------------------------------------------------------------------------------- 1 | you need have training data(e.g. train.tsv) and validation data(e.g. dev.tsv), and put it under a folder(e.g.TEXT_DIR ) 2 | 3 | you can generate data by yourself as long as data format is compatible with processor 4 | 5 | SentimentAnalysisFineGrainProcessor(alias as sentiment_analysis); 6 | 7 | you can also download data from here AI challenger-sentiment analysis, which is generated by following 8 | 9 | step by step: preprocess_char.ipynb 10 | 11 | 12 | data format: label1,label2,label3\t here is sentence or sentences\t 13 | 14 | it only contains two columns, the first one is target(one or multi-labels), the second one is input strings. 15 | 16 | no need to tokenized. 17 | 18 | sample:"0_1,1_-2,2_-2,3_-2,4_1,5_-2,6_-2,7_-2,8_1,9_1,10_-2,11_-2,12_-2,13_-2,14_-2,15_1,16_-2,17_-2,18_0,19_-2 浦东五莲路站,老饭店福瑞轩属于上海的本帮菜,交通方便,最近又重新装修,来拨草了,饭店活动满188元送50元钱,环境干净,简单。朋友提前一天来预订包房也没有订到,只有大堂,五点半到店基本上每个台子都客满了,都是附近居民,每道冷菜量都比以前小,味道还可以,热菜烤茄子,炒河虾仁,脆皮鸭,照牌鸡,小牛排,手撕腊味花菜等每道菜都很入味好吃,会员价划算,服务员人手太少,服务态度好,要能团购更好。可以用支付宝方便" 19 | 20 | check sample data in ./BERT_BASE_DIR folder 21 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/__init__.py -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_testa_20180816/README.txt: -------------------------------------------------------------------------------- 1 | sentiment_analysis_testa.csv 为测试集A数据文件,共15000条评论数据 2 | protocol.txt 为数据集下载协议 3 | -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_testa_20180816/protocol.txt: -------------------------------------------------------------------------------- 1 | 数据集下载协议 2 | 3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款: 4 | 5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途; 6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的; 7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任; 8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任; 9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束; 10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。 11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集; 12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。 13 | -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_trainingset_20180816/README.txt: -------------------------------------------------------------------------------- 1 | sentiment_analysis_trainingset.csv 为训练集数据文件,共105000条评论数据 2 | sentiment_analysis_trainingset_annotations.docx 为数据标注说明文件 3 | protocol.txt 为数据集下载协议 4 | -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_trainingset_20180816/protocol.txt: -------------------------------------------------------------------------------- 1 | 数据集下载协议 2 | 3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款: 4 | 5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途; 6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的; 7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任; 8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任; 9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束; 10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。 11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集; 12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。 13 | -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_validationset_20180816/README.txt: -------------------------------------------------------------------------------- 1 | sentiment_analysis_validationset.csv 为验证集数据文件,共15000条评论数据 2 | sentiment_analysis_validationset_annotations.docx 为数据标注说明文件 3 | protocol.txt 为数据集下载协议 4 | -------------------------------------------------------------------------------- /ai_challenger_sentiment_analysis_validationset_20180816/protocol.txt: -------------------------------------------------------------------------------- 1 | 数据集下载协议 2 | 3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款: 4 | 5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途; 6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的; 7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任; 8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任; 9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束; 10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。 11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集; 12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。 13 | -------------------------------------------------------------------------------- /ai_challenger_sentimetn_analysis_testb_20180816/README.txt: -------------------------------------------------------------------------------- 1 | sentiment_analysis_testb.csv 为测试集B数据文件,共200000条评论数据 2 | protocol.txt 为数据集下载协议 3 | -------------------------------------------------------------------------------- /ai_challenger_sentimetn_analysis_testb_20180816/protocol.txt: -------------------------------------------------------------------------------- 1 | 数据集下载协议 2 | 3 | 您(以下称“研究者”)正在请求举办方授予您访问、下载并使用数据集(以下简称“数据集”)的权利(以下简称“授权”),作为获得该等授权的条件,您同意遵守以下条款: 4 | 5 | 1、研究者同意仅为非商业性的科学研究或课堂教学目的使用数据集,并不得将数据集用于任何商业用途; 6 | 2、我们不享有数据集中使用的图片、音频、文字等内容的知识产权,对前述内容不作任何保证,包括但不限于不侵犯他人知识产权或可将前述内容用于任何特定目的; 7 | 3、我们不承担因数据集使用造成的任何形式的损失或伤害,不会对任何因使用比赛数据产生的法律后果承担任何责任; 8 | 4、 与数据集使用有关的任何法律责任均由研究者承担,如研究者或其员工、代理人、分支机构使用数据集的行为给我们造成声誉或经济损失,研究者应当承担赔偿责任; 9 | 5、研究者可以授权其助手、同事或其他合作者访问和使用数据集,但应确保前述人员已经认真阅读并同意接受本协议约束; 10 | 6、如果研究者受雇于以盈利为目的的商业主体,应确保使用数据集仅用于非商业目的,且其雇主同样受本协议约束,研究者确认其签订本协议前已经取得雇主的充分授权。 11 | 7、我们有权随时取消或撤回对研究者使用数据集的授权,并有权要求研究者删除已下载数据集; 12 | 8、凡因本合同引起的或与本合同有关的任何争议,均应提交中国国际经济贸易仲裁委员会,按照申请仲裁时该会现行有效的仲裁规则,并适用中华人民共和国法律解决进行仲裁。仲裁语言应为中文。 13 | -------------------------------------------------------------------------------- /bigru_char_checkpoint/README.txt: -------------------------------------------------------------------------------- 1 | checkpoint of bigru_char will be saved here. -------------------------------------------------------------------------------- /data/img/bert_sa.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/bert_sa.jpg -------------------------------------------------------------------------------- /data/img/bert_sentiment_analysis.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/bert_sentiment_analysis.jpg -------------------------------------------------------------------------------- /data/img/fine_grain.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/data/img/fine_grain.jpg -------------------------------------------------------------------------------- /evaluation_matrix.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import numpy as np 3 | import random 4 | import codecs 5 | """ 6 | compute single evaulation matrix for task1,task2 and task3: 7 | compute f1 score(micro,macro) for accusation & relevant article, and score for pentaly 8 | """ 9 | 10 | small_value=0.00001 11 | random_number=1000 12 | def compute_confuse_matrix_batch(y_targetlabel_list,y_logits_array,label_dict,name='default'): 13 | """ 14 | compute confuse matrix for a batch 15 | :param y_targetlabel_list: a list; each element is a mulit-hot,e.g. [1,0,0,1,...] 16 | :param y_logits_array: a 2-d array. [batch_size,num_class] 17 | :param label_dict:{label:(TP, FP, FN)} 18 | :param name: a string for debug purpose 19 | :return:label_dict:{label:(TP, FP, FN)} 20 | """ 21 | for i,y_targetlabel_list_single in enumerate(y_targetlabel_list): 22 | label_dict=compute_confuse_matrix(y_targetlabel_list_single,y_logits_array[i],label_dict,name=name) 23 | return label_dict 24 | 25 | def compute_confuse_matrix(y_targetlabel_list_single,y_logit_array_single,label_dict,name='default'): 26 | """ 27 | compute true postive(TP), false postive(FP), false negative(FN) given target lable and predict label 28 | :param y_targetlabel_list: a list. length is batch_size(e.g.1). each element is a multi-hot,like '[0,0,1,0,1,...]' 29 | :param y_logit_array: an numpy array. shape is:[batch_size,num_classes] 30 | :param label_dict {label:(TP,FP,FN)} 31 | :return: macro_f1(a scalar),micro_f1(a scalar) 32 | """ 33 | #1.get target label and predict label 34 | y_target_labels=get_target_label_short(y_targetlabel_list_single) #e.g. y_targetlabel_list[0]=[2,12,88] 35 | #y_logit=y_logit_array_single #y_logit_array[0] #[202] 36 | y_predict_labels=[i for i in range(len(y_logit_array_single)) if y_logit_array_single[i]>=0.50] #TODO 0.5PW e.g.[2,12,13,10] 37 | if len(y_predict_labels) < 1: y_predict_labels = [np.argmax(y_logit_array_single)] 38 | 39 | #if len(y_predict_labels)<1: y_predict_labels=[np.argmax(y_logit_array_single)] #TODO ADD 2018.05.29 40 | if random.choice([x for x in range(random_number)]) ==1:print(name+".y_target_labels:",y_target_labels,";y_predict_labels:",y_predict_labels) #debug purpose 41 | 42 | #2.count number of TP,FP,FN for each class 43 | y_labels_unique=[] 44 | y_labels_unique.extend(y_target_labels) 45 | y_labels_unique.extend(y_predict_labels) 46 | y_labels_unique=list(set(y_labels_unique)) 47 | for i,label in enumerate(y_labels_unique): #e.g. label=2 48 | TP, FP, FN = label_dict[label] 49 | if label in y_predict_labels and label in y_target_labels:#predict=1,truth=1 (TP) 50 | TP=TP+1 51 | elif label in y_predict_labels and label not in y_target_labels:#predict=1,truth=0(FP) 52 | FP=FP+1 53 | elif label not in y_predict_labels and label in y_target_labels:#predict=0,truth=1(FN) 54 | FN=FN+1 55 | label_dict[label] = (TP, FP, FN) 56 | return label_dict 57 | 58 | 59 | def compute_penalty_score_batch(target_deaths, predict_deaths,target_lifeimprisons, predict_lifeimprisons,target_imprsions, predict_imprisons): 60 | """ 61 | compute penalty score(task 3) for a batch. 62 | :param target_deaths: a list. each element is a mulit-hot list 63 | :param predict_deaths: a 2-d array. [batch_size,num_class] 64 | :param target_lifeimprisons: a list. each element is a mulit-hot list 65 | :param predict_lifeimprisons: a 2-d array. [batch_size,num_class] 66 | :param target_imprsions: a list. each element is a mulit-hot list 67 | :param predict_imprisons: a 2-d array. [batch_size,num_class] 68 | :return: score_batch: a scalar, average score for that batch 69 | """ 70 | length=len(target_deaths) 71 | score_total=0.0 72 | for i in range(length): 73 | score=compute_penalty_score(target_deaths[i], predict_deaths[i], target_lifeimprisons[i],predict_lifeimprisons[i],target_imprsions[i], predict_imprisons[i]) 74 | score_total=score_total+score 75 | score_batch=score_total/float(length) 76 | return score_batch 77 | 78 | def compute_penalty_score(target_death, predict_death,target_lifeimprison, predict_lifeimprison,target_imprsion, predict_imprison): 79 | """ 80 | compute penalty score(task 3) for a single data 81 | :param target_death: a mulit-hot list. e.g. [1,0,0,1,...] 82 | :param predict_death: [num_class] 83 | :param target_lifeimprison: a mulit-hot list. e.g. [1,0,0,1,...] 84 | :param predict_lifeimprison: [num_class] 85 | :param target_imprsion: a mulit-hot list. e.g. [1,0,0,1,...] 86 | :param predict_imprison:[num_class] 87 | :return: score: a scalar,score for this data 88 | """ 89 | score_death=compute_death_lifeimprisonment_score(target_death, predict_death) 90 | score_lifeimprisonment=compute_death_lifeimprisonment_score(target_lifeimprison, predict_lifeimprison) 91 | score_imprisonment=compute_imprisonment_score(target_imprsion, predict_imprison) 92 | score=((score_death+score_lifeimprisonment+score_imprisonment)/3.0)*(100.0) 93 | return score 94 | 95 | def compute_death_lifeimprisonment_score(target,predict): 96 | """ 97 | compute score for death or life imprisonment 98 | :param target: a list 99 | :param predict: an array 100 | :return: score: a scalar 101 | """ 102 | 103 | score=0.0 104 | target=np.argmax(target) 105 | predict=np.argmax(predict) 106 | if random.choice([x for x in range(random_number)]) == 1:print("death_lifeimprisonment_score.target:", target, ";predict:", predict) 107 | if target==predict: 108 | score=1.0 109 | if random.choice([x for x in range(random_number)]) == 1:print("death_lifeimprisonment_score:",score) 110 | return score 111 | 112 | def compute_imprisonment_score(target_value,predict_value): 113 | """ 114 | compute imprisonment score 115 | :param target_value: a scalar 116 | :param predict_value:a scalar 117 | :return: score: a scalar 118 | """ 119 | if random.choice([x for x in range(random_number)]) ==1:print("x.imprisonment_score.target_value:",target_value,";predict_value:",predict_value) 120 | score=0.0 121 | v=np.abs(np.log(predict_value+1.0)-np.log(target_value+1.0)) 122 | if v<=0.2: 123 | score=1.0 124 | elif v<=0.4: 125 | score=0.8 126 | elif v<=0.6: 127 | score=0.6 128 | elif v<=0.8: 129 | score=0.4 130 | elif v<=1.0: 131 | score=0.2 132 | else: 133 | score=0.0 134 | if random.choice([x for x in range(random_number)]) ==1:print("imprisonment_score:",score) 135 | return score 136 | 137 | def compute_micro_macro(label_dict): 138 | """ 139 | compute f1 of micro and macro 140 | :param label_dict: 141 | :return: f1_micro,f1_macro: scalar, scalar 142 | """ 143 | f1_micro = compute_f1_micro_use_TFFPFN(label_dict) 144 | f1_macro= compute_f1_macro_use_TFFPFN(label_dict) 145 | return f1_micro,f1_macro 146 | 147 | def compute_f1_micro_use_TFFPFN(label_dict): 148 | """ 149 | compute f1_micro 150 | :param label_dict: {label:(TP,FP,FN)} 151 | :return: f1_micro: a scalar 152 | """ 153 | TF_micro_accusation, FP_micro_accusation, FN_micro_accusation =compute_TF_FP_FN_micro(label_dict) 154 | f1_micro_accusation = compute_f1(TF_micro_accusation, FP_micro_accusation, FN_micro_accusation,'micro') 155 | return f1_micro_accusation 156 | 157 | def compute_f1_macro_use_TFFPFN(label_dict): 158 | """ 159 | compute f1_macro 160 | :param label_dict: {label:(TP,FP,FN)} 161 | :return: f1_macro 162 | """ 163 | f1_dict= {} 164 | num_classes=len(label_dict) 165 | for label, tuplee in label_dict.items(): 166 | TP,FP,FN=tuplee 167 | f1_score_onelabel=compute_f1(TP,FP,FN,'macro') 168 | f1_dict[label]=f1_score_onelabel 169 | f1_score_sum=0.0 170 | for label,f1_score in f1_dict.items(): 171 | f1_score_sum=f1_score_sum+f1_score 172 | f1_score=f1_score_sum/float(num_classes) 173 | return f1_score 174 | 175 | #[this function is for debug purpose only] 176 | def compute_f1_score_write_for_debug(label_dict,label2index): 177 | """ 178 | compute f1 score. basicly you can also use other function to get result 179 | :param label_dict: {label:(TP,FP,FN)} 180 | :return: a dict. key is label name, value is f1 score. 181 | """ 182 | f1score_dict={} 183 | # 1. compute f1 score for each accusation. 184 | for label, tuplee in label_dict.items(): 185 | TP, FP, FN = tuplee 186 | f1_score_single = compute_f1(TP, FP, FN, 'normal_f1_score') 187 | accusation_index2label = {kv[1]: kv[0] for kv in label2index.items()} 188 | label_name=accusation_index2label[label] 189 | f1score_dict[label_name]=f1_score_single 190 | 191 | # 2. each to file system for debug purpose. 192 | f1score_file='debug_accuracy.txt' 193 | write_object = codecs.open(f1score_file, mode='a', encoding='utf-8') 194 | write_object.write("\n\n") 195 | 196 | #tuple_list = sorted(f1score_dict.items(), lambda x, y: cmp(x[1], y[1]), reverse=False) 197 | tuple_list = sorted(f1score_dict.items(), key=lambda x: x[1], reverse=False) 198 | 199 | for tuplee in tuple_list: 200 | label_name,f1_score=tuplee 201 | write_object.write(label_name+":"+str(f1_score)+"\n") 202 | write_object.close() 203 | return f1score_dict 204 | 205 | def compute_f1(TP,FP,FN,compute_type): 206 | """ 207 | compute f1 208 | :param TP_micro: number.e.g. 200 209 | :param FP_micro: number.e.g. 200 210 | :param FN_micro: number.e.g. 200 211 | :return: f1_score: a scalar 212 | """ 213 | precison=TP/(TP+FP+small_value) 214 | recall=TP/(TP+FN+small_value) 215 | f1_score=(2*precison*recall)/(precison+recall+small_value) 216 | 217 | if random.choice([x for x in range(500)]) == 1:print(compute_type,"precison:",str(precison),";recall:",str(recall),";f1_score:",f1_score) 218 | 219 | return f1_score 220 | 221 | def compute_TF_FP_FN_micro(label_dict): 222 | """ 223 | compute micro FP,FP,FN 224 | :param label_dict_accusation: a dict. {label:(TP, FP, FN)} 225 | :return:TP_micro,FP_micro,FN_micro 226 | """ 227 | TP_micro,FP_micro,FN_micro=0.0,0.0,0.0 228 | for label,tuplee in label_dict.items(): 229 | TP,FP,FN=tuplee 230 | TP_micro=TP_micro+TP 231 | FP_micro=FP_micro+FP 232 | FN_micro=FN_micro+FN 233 | return TP_micro,FP_micro,FN_micro 234 | 235 | def init_label_dict(num_classes): 236 | """ 237 | init label dict. this dict will be used to save TP,FP,FN 238 | :param num_classes: 239 | :return: label_dict: a dict. {label_index:(0,0,0)} 240 | """ 241 | label_dict={} 242 | for i in range(num_classes): 243 | label_dict[i]=(0,0,0) 244 | return label_dict 245 | 246 | def get_target_label_short(y_mulitihot): 247 | """ 248 | get target label. 249 | :param y_mulitihot: [0,0,1,0,1,0,...] 250 | :return: taget_list.e.g. [3,5,100] 251 | """ 252 | taget_list = []; 253 | for i, element in enumerate(y_mulitihot): 254 | if element == 1: 255 | taget_list.append(i) 256 | return taget_list -------------------------------------------------------------------------------- /model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/model/__init__.py -------------------------------------------------------------------------------- /model/base_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import tensorflow as tf 3 | from model.multi_head_attention import MultiHeadAttention 4 | from model.poistion_wise_feed_forward import PositionWiseFeedFoward 5 | from model.layer_norm_residual_conn import LayerNormResidualConnection 6 | class BaseClass(object): 7 | """ 8 | base class has some common fields and functions. 9 | """ 10 | def __init__(self,d_model,d_k,d_v,sequence_length,h,batch_size,num_layer=6,decoder_sent_length=None): 11 | """ 12 | :param d_model: 13 | :param d_k: 14 | :param d_v: 15 | :param sequence_length: 16 | :param h: 17 | :param batch_size: 18 | :param embedded_words: shape:[batch_size,sequence_length,embed_size] 19 | """ 20 | self.d_model=d_model 21 | self.d_k=d_k 22 | self.d_v=d_v 23 | self.sequence_length=sequence_length 24 | self.h=h 25 | self.num_layer=num_layer 26 | self.batch_size=batch_size 27 | self.decoder_sent_length=decoder_sent_length 28 | 29 | def sub_layer_postion_wise_feed_forward(self, x, layer_index) :# COMMON FUNCTION 30 | """ 31 | position-wise feed forward. you can implement it as feed forward network, or two layers of CNN. 32 | :param x: shape should be:[batch_size,sequence_length,d_model] 33 | :param layer_index: index of layer number 34 | :return: [batch_size,sequence_length,d_model] 35 | """ 36 | # use variable scope here with input of layer index, to make sure each layer has different parameters. 37 | with tf.variable_scope("sub_layer_postion_wise_feed_forward" + str(layer_index)): 38 | postion_wise_feed_forward = PositionWiseFeedFoward(x, layer_index,d_model=self.d_model,d_ff=self.d_model*4) 39 | postion_wise_feed_forward_output = postion_wise_feed_forward.position_wise_feed_forward_fn() 40 | return postion_wise_feed_forward_output 41 | 42 | def sub_layer_multi_head_attention(self ,layer_index ,Q ,K_s,V_s,mask=None,is_training=None,dropout_keep_prob=0.9) :# COMMON FUNCTION 43 | """ 44 | multi head attention as sub layer 45 | :param layer_index: index of layer number 46 | :param Q: shape should be: [batch_size,sequence_length,embed_size] 47 | :param k_s: shape should be: [batch_size,sequence_length,embed_size] 48 | :param mask: when use mask,illegal connection will be mask as huge big negative value.so it's possiblitity will become zero. 49 | :return: output of multi head attention.shape:[batch_size,sequence_length,d_model] 50 | """ 51 | #print("sub_layer_multi_head_attention.",";layer_index:",layer_index) 52 | with tf.variable_scope("base_mode_sub_layer_multi_head_attention_" +str(layer_index)): 53 | #2. call function of multi head attention to get result 54 | multi_head_attention_class = MultiHeadAttention(Q, K_s, V_s, self.d_model, self.d_k, self.d_v, self.sequence_length,self.h, 55 | is_training=is_training,mask=mask,dropout_rate=(1.0-dropout_keep_prob)) 56 | sub_layer_multi_head_attention_output = multi_head_attention_class.multi_head_attention_fn() # [batch_size*sequence_length,d_model] 57 | return sub_layer_multi_head_attention_output # [batch_size,sequence_length,d_model] 58 | 59 | def sub_layer_layer_norm_residual_connection(self,layer_input ,layer_output,layer_index,dropout_keep_prob=0.9,use_residual_conn=True,sub_layer_name='layer1'): # COMMON FUNCTION 60 | """ 61 | layer norm & residual connection 62 | :param input: [batch_size,equence_length,d_model] 63 | :param output:[batch_size,sequence_length,d_model] 64 | :return: 65 | """ 66 | #print("sub_layer_layer_norm_residual_connection.layer_input:",layer_input,";layer_output:",layer_output,";dropout_keep_prob:",dropout_keep_prob) 67 | #assert layer_input.get_shape().as_list()==layer_output.get_shape().as_list() 68 | #layer_output_new= layer_input+ layer_output 69 | variable_scope="sub_layer_layer_norm_residual_connection_" +str(layer_index)+'_'+sub_layer_name 70 | #print("######sub_layer_layer_norm_residual_connection.variable_scope:",variable_scope) 71 | with tf.variable_scope(variable_scope): 72 | layer_norm_residual_conn=LayerNormResidualConnection(layer_input,layer_output,layer_index,residual_dropout=(1-dropout_keep_prob),use_residual_conn=use_residual_conn) 73 | output = layer_norm_residual_conn.layer_norm_residual_connection() 74 | return output # [batch_size,sequence_length,d_model] -------------------------------------------------------------------------------- /model/config.py: -------------------------------------------------------------------------------- 1 | 2 | class Config: 3 | def __init__(self): 4 | self.learning_rate=0.0003 5 | self.num_classes = 2 6 | self.batch_size = 64 7 | self.sequence_length = 100 8 | self.vocab_size = 50000 9 | 10 | self.d_model =512 11 | self.num_layer=6 12 | self.h=8 13 | self.d_k=64 14 | self.d_v=64 15 | 16 | self.clip_gradients = 5.0 17 | self.decay_steps = 1000 18 | self.decay_rate = 0.9 19 | self.dropout_keep_prob = 0.9 20 | self.ckpt_dir = 'checkpoint/dummy_test/' 21 | self.is_training=True 22 | self.is_pretrain=True 23 | self.num_classes_lm=self.vocab_size 24 | -------------------------------------------------------------------------------- /model/config_transformer.py: -------------------------------------------------------------------------------- 1 | class Config: 2 | def __init__(self): 3 | self.learning_rate = 0.0003 4 | self.num_classes = 2 5 | self.batch_size = 64 6 | self.sequence_length = 100 7 | self.vocab_size = 50000 8 | 9 | self.d_model = 512 10 | self.num_layer = 6 11 | self.h = 8 12 | self.d_k = 64 13 | self.d_v = 64 14 | 15 | self.clip_gradients = 5.0 16 | self.decay_steps = 1000 17 | self.decay_rate = 0.9 18 | self.dropout_keep_prob = 0.9 19 | self.ckpt_dir = 'checkpoint/dummy_test/' 20 | self.is_training = True 21 | -------------------------------------------------------------------------------- /model/encoder.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | encoder for the transformer: 4 | 6 layers.each layers has two sub-layers. 5 | the first is multi-head self-attention mechanism; 6 | the second is position-wise fully connected feed-forward network. 7 | for each sublayer. use LayerNorm(x+Sublayer(x)). all dimension=512. 8 | """ 9 | import tensorflow as tf 10 | from model.base_model import BaseClass 11 | import time 12 | class Encoder(BaseClass): 13 | def __init__(self,d_model,d_k,d_v,sequence_length,h,batch_size,num_layer,Q,K_s,mask=None,dropout_keep_prob=0.9,use_residual_conn=True): 14 | """ 15 | :param d_model: 16 | :param d_k: 17 | :param d_v: 18 | :param sequence_length: 19 | :param h: 20 | :param batch_size: 21 | :param embedded_words: shape:[batch_size*sequence_length,embed_size] 22 | """ 23 | super(Encoder, self).__init__(d_model,d_k,d_v,sequence_length,h,batch_size,num_layer=num_layer) 24 | self.Q=Q 25 | self.K_s=K_s 26 | self.mask=mask 27 | self.initializer = tf.random_normal_initializer(stddev=0.1) 28 | self.dropout_keep_prob=dropout_keep_prob 29 | self.use_residual_conn=use_residual_conn 30 | 31 | def encoder_fn(self): 32 | """ 33 | use transformer encoder to encode the input, output a sequence. input: [batch_size,sequence_length,d_embedding] 34 | :return: output:[batch_size*sequence_length,d_model] 35 | """ 36 | start = time.time() 37 | #print("encoder_fn.started.") 38 | x=self.Q 39 | for layer_index in range(self.num_layer): 40 | x=self.encoder_single_layer(x,x,x,layer_index) # Q,K_s,V_s 41 | #print("encoder_fn.",layer_index,".x:",x) 42 | end = time.time() 43 | #print("encoder_fn.ended.x:",x) 44 | #print("time spent:",(end-start)) 45 | return x 46 | 47 | def encoder_single_layer(self,Q,K_s,V_s,layer_index): 48 | """ 49 | singel layer for encoder.each layers has two sub-layers: 50 | the first is multi-head self-attention mechanism; the second is position-wise fully connected feed-forward network. 51 | for each sublayer. use LayerNorm(x+Sublayer(x)). input and output of last dimension: d_model 52 | :param Q: shape should be: [batch_size,sequence_length,d_model] 53 | :param K_s: shape should be: [batch_size,sequence_length,d_model] 54 | :return:output: shape should be: [batch_size,sequence_length,d_model] 55 | """ 56 | #1.1 the first is multi-head self-attention mechanism 57 | multi_head_attention_output=self.sub_layer_multi_head_attention(layer_index,Q,K_s,V_s,mask=self.mask,dropout_keep_prob=self.dropout_keep_prob) #[batch_size,sequence_length,d_model] 58 | #1.2 use LayerNorm(x+Sublayer(x)). all dimension=512. 59 | multi_head_attention_output=self.sub_layer_layer_norm_residual_connection(K_s,multi_head_attention_output,layer_index, 60 | dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn,sub_layer_name='layer1') 61 | #2.1 the second is position-wise fully connected feed-forward network. 62 | postion_wise_feed_forward_output=self.sub_layer_postion_wise_feed_forward(multi_head_attention_output,layer_index) 63 | #2.2 use LayerNorm(x+Sublayer(x)). all dimension=512. 64 | postion_wise_feed_forward_output= self.sub_layer_layer_norm_residual_connection(multi_head_attention_output,postion_wise_feed_forward_output,layer_index, 65 | dropout_keep_prob=self.dropout_keep_prob,sub_layer_name='layer2') 66 | return postion_wise_feed_forward_output #,postion_wise_feed_forward_output 67 | 68 | 69 | def init(): 70 | #1. assign value to fields 71 | vocab_size=1000 72 | d_model = 512 73 | d_k = 64 74 | d_v = 64 75 | sequence_length = 5*10 76 | h = 8 77 | batch_size=4*32 78 | initializer = tf.random_normal_initializer(stddev=0.1) 79 | # 2.set values for Q,K,V 80 | vocab_size=1000 81 | embed_size=d_model 82 | Embedding = tf.get_variable("Embedding_E", shape=[vocab_size, embed_size],initializer=initializer) 83 | input_x = tf.placeholder(tf.int32, [batch_size,sequence_length], name="input_x") #[4,10] 84 | print("input_x:",input_x) 85 | embedded_words = tf.nn.embedding_lookup(Embedding, input_x) #[batch_size*sequence_length,embed_size] 86 | Q = embedded_words # [batch_size*sequence_length,embed_size] 87 | K_s = embedded_words # [batch_size*sequence_length,embed_size] 88 | V_s = embedded_words # [batch_size*sequence_length,embed_size] 89 | num_layer=6 90 | mask = get_mask(batch_size, sequence_length) 91 | #3. get class object 92 | encoder_class=Encoder(d_model,d_k,d_v,sequence_length,h,batch_size,num_layer,Q,K_s,mask=mask) #Q,K_s,embedded_words 93 | return encoder_class,Q,K_s,V_s 94 | 95 | def get_mask(batch_size,sequence_length): 96 | lower_triangle=tf.matrix_band_part(tf.ones([sequence_length,sequence_length]),-1,0) 97 | result=-1e9*(1.0-lower_triangle) 98 | print("get_mask==>result:",result) 99 | return result 100 | 101 | def test_postion_wise_feed_forward(encoder_class,x,layer_index): 102 | sub_layer_postion_wise_feed_forward_output=encoder_class.sub_layer_postion_wise_feed_forward(x, layer_index) 103 | return sub_layer_postion_wise_feed_forward_output 104 | 105 | def test_sub_layer_multi_head_attention(encoder_class,index_layer,Q,K_s,V_s): 106 | sub_layer_multi_head_attention_output=encoder_class.sub_layer_multi_head_attention(index_layer,Q,K_s,V_s) 107 | return sub_layer_multi_head_attention_output 108 | 109 | 110 | encoder_class,Q,K_s,V_s=init() 111 | 112 | 113 | #below is 4 callable codes for testing functions: from sub(small) function to whole function of encoder. 114 | 115 | def test(): 116 | #1.test 1: for sub layer of multi head attention 117 | index_layer=0 118 | #sub_layer_multi_head_attention_output=test_sub_layer_multi_head_attention(encoder_class,index_layer,Q,K_s,V_s) 119 | #print("sub_layer_multi_head_attention_output1:",sub_layer_multi_head_attention_output) 120 | 121 | #2. test 2: for sub layer of multi head attention with poistion-wise feed forward 122 | #d1,d2,d3=sub_layer_multi_head_attention_output.get_shape().as_list() 123 | #print("d1:",d1,";d2:",d2,";d3:",d3) 124 | #postion_wise_ff_input=sub_layer_multi_head_attention_output #tf.reshape(sub_layer_multi_head_attention_output,shape=[-1,d3]) 125 | #print("sub_layer_postion_wise_feed_forward_input:",postion_wise_ff_input) 126 | #sub_layer_postion_wise_feed_forward_output=test_postion_wise_feed_forward(encoder_class,postion_wise_ff_input,index_layer) 127 | #sub_layer_postion_wise_feed_forward_output=tf.reshape(sub_layer_postion_wise_feed_forward_output,shape=(d1,d2,d3)) 128 | #print("sub_layer_postion_wise_feed_forward_output:",sub_layer_postion_wise_feed_forward_output) 129 | #3.test 3: test for single layer of encoder 130 | #encoder_class.encoder_single_layer(Q,K_s,V_s,index_layer) 131 | #4.test 4: test for encoder. with N layers 132 | 133 | representation = encoder_class.encoder_fn() 134 | print("representation:",representation) 135 | 136 | # test() -------------------------------------------------------------------------------- /model/layer_norm_residual_conn.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | import time 3 | """ 4 | We employ a residual connection around each of the two sub-layers, followed by layer normalization. 5 | That is, the output of each sub-layer is LayerNorm(x+ Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. """ 6 | class LayerNormResidualConnection(object): 7 | def __init__(self,x,y,layer_index,residual_dropout=0.1,use_residual_conn=True): 8 | self.x=x 9 | self.y=y 10 | self.layer_index=layer_index 11 | self.residual_dropout=residual_dropout 12 | #print("LayerNormResidualConnection.residual_dropout:",self.residual_dropout) 13 | self.use_residual_conn=use_residual_conn 14 | 15 | #call residual connection and layer normalization 16 | def layer_norm_residual_connection(self): 17 | #print("LayerNormResidualConnection.use_residual_conn:",self.use_residual_conn) 18 | if self.use_residual_conn: # todo previously it is removed in a classification task, may be because result become not stable 19 | x_residual=self.residual_connection() 20 | x_layer_norm=self.layer_normalization(x_residual) 21 | else: 22 | x_layer_norm = self.layer_normalization(self.x) 23 | return x_layer_norm 24 | 25 | def residual_connection(self): 26 | output=self.x + tf.nn.dropout(self.y, 1.0 - self.residual_dropout) 27 | return output 28 | 29 | # layer normalize the tensor x, averaging over the last dimension. 30 | def layer_normalization(self,x): 31 | """ 32 | x should be:[batch_size,sequence_length,d_model] 33 | :return: 34 | """ 35 | filter=x.get_shape()[-1] #last dimension of x. e.g. 512 36 | #print("layer_normalization:==================>variable_scope:","layer_normalization"+str(self.layer_index)) 37 | with tf.variable_scope("layer_normalization"+str(self.layer_index)): 38 | # 1. normalize input by using mean and variance according to last dimension 39 | mean=tf.reduce_mean(x,axis=-1,keepdims=True) #[batch_size,sequence_length,1] 40 | variance=tf.reduce_mean(tf.square(x-mean),axis=-1,keepdims=True) #[batch_size,sequence_length,1] 41 | norm_x=(x-mean)*tf.rsqrt(variance+1e-6) #[batch_size,sequence_length,d_model] 42 | # 2. re-scale normalized input back 43 | scale=tf.get_variable("layer_norm_scale",[filter],initializer=tf.ones_initializer) #[filter] 44 | bias=tf.get_variable("layer_norm_bias",[filter],initializer=tf.ones_initializer) #[filter] 45 | output=norm_x*scale+bias #[batch_size,sequence_length,d_model] 46 | return output #[batch_size,sequence_length,d_model] 47 | 48 | def test(): 49 | start = time.time() 50 | 51 | batch_size=128 52 | sequence_length=1000 53 | d_model=512 54 | x=tf.ones((batch_size,sequence_length,d_model)) 55 | y=x*3-0.5 56 | layer_norm_residual_conn=LayerNormResidualConnection(x,y,0,'encoder') 57 | output=layer_norm_residual_conn.layer_norm_residual_connection() 58 | 59 | end = time.time() 60 | print("x:",x,";y:",y) 61 | print("output:",output,";time spent:",(end-start)) 62 | 63 | #test() -------------------------------------------------------------------------------- /model/multi_head_attention.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #test self-attention 3 | import tensorflow as tf 4 | import time 5 | """ 6 | multi head attention. 7 | 1.linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions) 8 | 2.scaled dot product attention for each projected version of Q,K,V 9 | 3.concatenated result 10 | 4.linear projection to get final result 11 | three kinds of usage: 12 | 1. attention for encoder 13 | 2. attention for decoder(need a mask to pay attention for only known position) 14 | 3. attention as bridge of encoder and decoder 15 | """ 16 | class MultiHeadAttention(object): 17 | """ multi head attention""" 18 | def __init__(self,Q,K_s,V_s,d_model,d_k,d_v,sequence_length,h,typee=None,is_training=None,mask=None,dropout_rate=0.1): 19 | self.d_model=d_model 20 | self.d_k=d_k 21 | self.d_v=d_v 22 | self.sequence_length=sequence_length 23 | self.h=h 24 | self.Q=Q 25 | self.K_s=K_s 26 | self.V_s=V_s 27 | self.typee=typee 28 | self.is_training=is_training 29 | self.mask=mask 30 | self.dropout_rate=dropout_rate 31 | #print("MultiHeadAttention.self.dropout_rate:",self.dropout_rate) 32 | 33 | def multi_head_attention_fn(self): 34 | """ 35 | multi head attention 36 | :param Q: query. shape:[batch,sequence_length,d_model] 37 | :param K_s: keys. shape:[batch,sequence_length,d_model]. 38 | :param V_s:values.shape:[batch,sequence_length,d_model]. 39 | :param h: h times 40 | :return: result of scaled dot product attention. shape:[sequence_length,d_model] 41 | """ 42 | # 1. linearly project the queries,keys and values h times(with different,learned linear projections to d_k,d_k,d_v dimensions) 43 | Q_projected = tf.layers.dense(self.Q,units=self.d_model) # [batch,sequence_length,d_model] 44 | K_s_projected = tf.layers.dense(self.K_s, units=self.d_model) # [batch,sequence_length,d_model] 45 | V_s_projected = tf.layers.dense(self.V_s, units=self.d_model) # [batch,sequence_length,d_model] 46 | # 2. scaled dot product attention for each projected version of Q,K,V 47 | dot_product=self.scaled_dot_product_attention_batch(Q_projected,K_s_projected,V_s_projected) # [batch,h,sequence_length,d_v] 48 | # 3. concatenated 49 | batch_size,h,length,d_v=dot_product.get_shape().as_list() 50 | #print("dot_product:",dot_product,";self.sequence_length:",self.sequence_length) ##dot_product:(128, 8, 6, 64);5 51 | dot_product=tf.reshape(dot_product,shape=(-1,length,self.d_model)) # [batch,sequence_length,d_model] 52 | # 4. linear projection 53 | output=tf.layers.dense(dot_product,units=self.d_model) # [batch,sequence_length,d_model] 54 | return output #[batch,sequence_length,d_model] 55 | 56 | def scaled_dot_product_attention_batch_mine(self,Q,K_s,V_s): #my own implementation of scaled dot product attention. 57 | """ 58 | scaled dot product attention 59 | :param Q: query. shape:[batch,sequence_length,d_model] 60 | :param K_s: keys. shape:[batch,sequence_length,d_model] 61 | :param V_s:values. shape:[batch,sequence_length,d_model] 62 | :param mask: shape:[batch,sequence_length] 63 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k] 64 | """ 65 | # 1. split Q,K,V 66 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k] 67 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 68 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 69 | dot_product=tf.multiply(Q_heads,K_heads) # [batch,h,sequence_length,d_k] 70 | # 2. dot product 71 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,d_k] 72 | dot_product=tf.reduce_sum(dot_product,axis=-1,keep_dims=True) # [batch,h,sequence_length,1] 73 | # 3. add mask if it is none 74 | if self.mask is not None: 75 | mask = tf.expand_dims(self.mask, axis=-1) # [batch,sequence_length,1] 76 | mask = tf.expand_dims(mask, axis=1) # [batch,1,sequence_length,1] 77 | dot_product=dot_product+mask # [batch,h,sequence_length,1] 78 | # 4. get possibility 79 | p=tf.nn.softmax(dot_product) # [batch,h,sequence_length,1] 80 | # 5. final output 81 | output=tf.multiply(p,V_heads) # [batch,h,sequence_length,d_k] 82 | return output # [batch,h,sequence_length,d_k] 83 | 84 | def scaled_dot_product_attention_batch(self, Q, K_s, V_s):# scaled dot product attention: implementation style like tensor2tensor from google 85 | """ 86 | scaled dot product attention 87 | :param Q: query. shape:[batch,sequence_length,d_model] 88 | :param K_s: keys. shape:[batch,sequence_length,d_model] 89 | :param V_s:values. shape:[batch,sequence_length,d_model] 90 | :param mask: shape:[sequence_length,sequence_length] 91 | :return: result of scaled dot product attention. shape:[batch,h,sequence_length,d_k] 92 | """ 93 | # 1. split Q,K,V 94 | #K_s=tf.layers.dense(K_s,self.d_model) # transform K_s, while keep as shape. TODO add 2018.10.21. so that Q and K shoud be not the same. 95 | Q_heads = tf.stack(tf.split(Q,self.h,axis=2),axis=1) # [batch,h,sequence_length,d_k] 96 | K_heads = tf.stack(tf.split(K_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_k] 97 | V_heads = tf.stack(tf.split(V_s, self.h, axis=2), axis=1) # [batch,h,sequence_length,d_v]. during implementation, d_v=d_k. 98 | # 2. dot product of Q,K 99 | dot_product=tf.matmul(Q_heads,K_heads,transpose_b=True) # [batch,h,sequence_length,sequence_length] 100 | dot_product=dot_product*(1.0/tf.sqrt(tf.cast(self.d_model,tf.float32))) # [batch,h,sequence_length,sequence_length] 101 | # 3. add mask if it is none 102 | #print("scaled_dot_product_attention_batch.mask is not none?",self.mask is not None) 103 | if self.mask is not None: 104 | mask_expand=tf.expand_dims(tf.expand_dims(self.mask,axis=0),axis=0) # [1,1,sequence_length,sequence_length] 105 | #dot_product:(128, 8, 6, 6);mask_expand:(1, 1, 6, 6) 106 | #print("scaled_dot_product_attention_batch.dot_product:",dot_product,";mask_expand:",mask_expand) 107 | dot_product=dot_product+mask_expand # [batch,h,sequence_length,sequence_length] 108 | # 4.get possibility 109 | weights=tf.nn.softmax(dot_product) # [batch,h,sequence_length,sequence_length] 110 | # drop out weights 111 | weights=tf.nn.dropout(weights,1.0-self.dropout_rate) # [batch,h,sequence_length,sequence_length] 112 | # 5. final output 113 | output=tf.matmul(weights,V_heads) # [batch,h,sequence_length,d_v] 114 | return output 115 | 116 | 117 | #vectorized implementation of multi head attention for sentences with batch 118 | def multi_head_attention_for_sentence_vectorized(layer_number): 119 | print("started...") 120 | start = time.time() 121 | # 1.set parameter 122 | d_model = 512 123 | d_k = 64 124 | d_v = 64 125 | sequence_length = 1000 126 | h = 8 127 | batch_size=128 128 | initializer = tf.random_normal_initializer(stddev=0.1) 129 | # 2.set Q,K,V 130 | vocab_size=1000 131 | embed_size=d_model 132 | typee='decoder' 133 | Embedding = tf.get_variable("Embedding_", shape=[vocab_size, embed_size],initializer=initializer) 134 | input_x = tf.placeholder(tf.int32, [batch_size,sequence_length], name="input_x") 135 | embedded_words = tf.nn.embedding_lookup(Embedding, input_x) #[batch_size,sequence_length,embed_size] 136 | mask=get_mask(batch_size,sequence_length) #tf.ones((batch_size,sequence_length))*-1e8 #[batch,sequence_length] 137 | with tf.variable_scope("query_at_each_sentence"+str(layer_number)): 138 | Q = embedded_words # [batch_size*sequence_length,embed_size] 139 | K_s=embedded_words #[batch_size*sequence_length,embed_size] 140 | V_s=embedded_words #tf.get_variable("V_s_original_", shape=embedded_words.get_shape().as_list(),initializer=initializer) #[batch_size,sequence_length,embed_size] 141 | # 3.call method to get result 142 | multi_head_attention_class = MultiHeadAttention(Q, K_s, V_s, d_model, d_k, d_v, sequence_length, h,typee='decoder',mask=mask) 143 | encoder_output=multi_head_attention_class.multi_head_attention_fn() #shape:[sequence_length,d_model] 144 | encoder_output=tf.reshape(encoder_output,shape=(batch_size,sequence_length,d_model)) 145 | end = time.time() 146 | print("input_x:",input_x) 147 | print("encoder_output:",encoder_output,";time_spent:",(end-start)) 148 | 149 | def get_mask(batch_size,sequence_length): 150 | lower_triangle=tf.matrix_band_part(tf.ones([sequence_length,sequence_length]),-1,0) 151 | result=-1e9*(1.0-lower_triangle) 152 | print("get_mask==>result:",result) 153 | return result 154 | 155 | layer_number=0 156 | #multi_head_attention_for_sentence_vectorized(0) -------------------------------------------------------------------------------- /model/optimization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Functions and classes related to optimization (weight updates).""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import re 22 | import tensorflow as tf 23 | 24 | 25 | def create_optimizer(loss, init_lr, num_train_steps, num_warmup_steps, use_tpu): 26 | """Creates an optimizer training op.""" 27 | global_step = tf.train.get_or_create_global_step() 28 | 29 | learning_rate = tf.constant(value=init_lr, shape=[], dtype=tf.float32) 30 | 31 | # Implements linear decay of the learning rate. 32 | learning_rate = tf.train.polynomial_decay( 33 | learning_rate, 34 | global_step, 35 | num_train_steps, 36 | end_learning_rate=0.0, 37 | power=1.0, 38 | cycle=False) 39 | 40 | # Implements linear warmup. I.e., if global_step < num_warmup_steps, the 41 | # learning rate will be `global_step/num_warmup_steps * init_lr`. 42 | if num_warmup_steps: 43 | global_steps_int = tf.cast(global_step, tf.int32) 44 | warmup_steps_int = tf.constant(num_warmup_steps, dtype=tf.int32) 45 | 46 | global_steps_float = tf.cast(global_steps_int, tf.float32) 47 | warmup_steps_float = tf.cast(warmup_steps_int, tf.float32) 48 | 49 | warmup_percent_done = global_steps_float / warmup_steps_float 50 | warmup_learning_rate = init_lr * warmup_percent_done 51 | 52 | is_warmup = tf.cast(global_steps_int < warmup_steps_int, tf.float32) 53 | learning_rate = ( 54 | (1.0 - is_warmup) * learning_rate + is_warmup * warmup_learning_rate) 55 | 56 | # It is recommended that you use this optimizer for fine tuning, since this 57 | # is how the model was trained (note that the Adam m/v variables are NOT 58 | # loaded from init_checkpoint.) 59 | optimizer = AdamWeightDecayOptimizer( 60 | learning_rate=learning_rate, 61 | weight_decay_rate=0.01, 62 | beta_1=0.9, 63 | beta_2=0.999, 64 | epsilon=1e-6, 65 | exclude_from_weight_decay=["LayerNorm", "layer_norm", "bias"]) 66 | 67 | if use_tpu: 68 | optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer) 69 | 70 | tvars = tf.trainable_variables() 71 | grads = tf.gradients(loss, tvars) 72 | 73 | # This is how the model was pre-trained. 74 | (grads, _) = tf.clip_by_global_norm(grads, clip_norm=1.0) 75 | 76 | train_op = optimizer.apply_gradients( 77 | zip(grads, tvars), global_step=global_step) 78 | 79 | new_global_step = global_step + 1 80 | train_op = tf.group(train_op, [global_step.assign(new_global_step)]) 81 | return train_op 82 | 83 | 84 | class AdamWeightDecayOptimizer(tf.train.Optimizer): 85 | """A basic Adam optimizer that includes "correct" L2 weight decay.""" 86 | 87 | def __init__(self, 88 | learning_rate, 89 | weight_decay_rate=0.0, 90 | beta_1=0.9, 91 | beta_2=0.999, 92 | epsilon=1e-6, 93 | exclude_from_weight_decay=None, 94 | name="AdamWeightDecayOptimizer"): 95 | """Constructs a AdamWeightDecayOptimizer.""" 96 | super(AdamWeightDecayOptimizer, self).__init__(False, name) 97 | 98 | self.learning_rate = learning_rate 99 | self.weight_decay_rate = weight_decay_rate 100 | self.beta_1 = beta_1 101 | self.beta_2 = beta_2 102 | self.epsilon = epsilon 103 | self.exclude_from_weight_decay = exclude_from_weight_decay 104 | 105 | def apply_gradients(self, grads_and_vars, global_step=None, name=None): 106 | """See base class.""" 107 | assignments = [] 108 | for (grad, param) in grads_and_vars: 109 | if grad is None or param is None: 110 | continue 111 | 112 | param_name = self._get_variable_name(param.name) 113 | 114 | m = tf.get_variable( 115 | name=param_name + "/adam_m", 116 | shape=param.shape.as_list(), 117 | dtype=tf.float32, 118 | trainable=False, 119 | initializer=tf.zeros_initializer()) 120 | v = tf.get_variable( 121 | name=param_name + "/adam_v", 122 | shape=param.shape.as_list(), 123 | dtype=tf.float32, 124 | trainable=False, 125 | initializer=tf.zeros_initializer()) 126 | 127 | # Standard Adam update. 128 | next_m = ( 129 | tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad)) 130 | next_v = ( 131 | tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2, 132 | tf.square(grad))) 133 | 134 | update = next_m / (tf.sqrt(next_v) + self.epsilon) 135 | 136 | # Just adding the square of the weights to the loss function is *not* 137 | # the correct way of using L2 regularization/weight decay with Adam, 138 | # since that will interact with the m and v parameters in strange ways. 139 | # 140 | # Instead we want ot decay the weights in a manner that doesn't interact 141 | # with the m/v parameters. This is equivalent to adding the square 142 | # of the weights to the loss with plain (non-momentum) SGD. 143 | if self._do_use_weight_decay(param_name): 144 | update += self.weight_decay_rate * param 145 | 146 | update_with_lr = self.learning_rate * update 147 | 148 | next_param = param - update_with_lr 149 | 150 | assignments.extend( 151 | [param.assign(next_param), 152 | m.assign(next_m), 153 | v.assign(next_v)]) 154 | return tf.group(*assignments, name=name) 155 | 156 | def _do_use_weight_decay(self, param_name): 157 | """Whether to use L2 weight decay for `param_name`.""" 158 | if not self.weight_decay_rate: 159 | return False 160 | if self.exclude_from_weight_decay: 161 | for r in self.exclude_from_weight_decay: 162 | if re.search(r, param_name) is not None: 163 | return False 164 | return True 165 | 166 | def _get_variable_name(self, param_name): 167 | """Get the variable name from the tensor name.""" 168 | m = re.match("^(.*):\\d+$", param_name) 169 | if m is not None: 170 | param_name = m.group(1) 171 | return param_name 172 | -------------------------------------------------------------------------------- /model/poistion_wise_feed_forward.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import tensorflow as tf 3 | import time 4 | """ 5 | Position-wise Feed-Forward Networks 6 | In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully 7 | connected feed-forward network, which is applied to each position separately and identically. This 8 | consists of two linear transformations with a ReLU activation in between. 9 | FFN(x) = max(0,xW1+b1)W2+b2 10 | While the linear transformations are the same across different positions, they use different parameters 11 | from layer to layer. Another way of describing this is as two convolutions with kernel size 1. 12 | The dimensionality of input and output is d_model= 512, and the inner-layer has dimensionalityd_ff= 2048. 13 | """ 14 | class PositionWiseFeedFoward(object): 15 | """ 16 | position-wise feed forward networks. formula as below: 17 | FFN(x)=max(0,xW1+b1)W2+b2 18 | """ 19 | def __init__(self,x,layer_index,d_model=512,d_ff=2048): 20 | """ 21 | :param x: shape should be:[batch,sequence_length,d_model] 22 | :param layer_index: index of layer 23 | :return: shape:[sequence_length,d_model] 24 | """ 25 | shape_list=x.get_shape().as_list() 26 | assert(len(shape_list)==3) 27 | self.x=x 28 | self.layer_index=layer_index 29 | self.d_model=d_model 30 | self.d_ff=d_ff 31 | self.initializer = tf.random_normal_initializer(stddev=0.1) 32 | 33 | def position_wise_feed_forward_fn(self): 34 | """ 35 | positional wise fully connected feed forward implement as two layers of cnn 36 | x: [batch,sequence_length,d_model] 37 | :return: [batch,sequence_length,d_model] 38 | """ 39 | # 1.conv layer 1 40 | input=tf.expand_dims(self.x,axis=3) # [batch,sequence_length,d_model,1] 41 | # conv2d.input: [batch,sentence_length,embed_size,1]. filter=[filter_size,self.embed_size,1,self.num_filters] 42 | output_conv1=tf.layers.conv2d( # output_conv1: [batch_size,sequence_length,1,d_ff] 43 | input,filters=self.d_ff,kernel_size=[1,self.d_model],padding="VALID", 44 | name='conv1',kernel_initializer=self.initializer,activation=tf.nn.relu 45 | ) 46 | output_conv1 = tf.transpose(output_conv1, [0,1,3,2]) #output_conv1:[batch_size,sequence_length,d_ff,1] 47 | # print("output_conv1:",output_conv1) 48 | 49 | # 2.conv layer 2 50 | output_conv2 = tf.layers.conv2d( # output_conv2:[batch_size, sequence_length,1,d_model] 51 | output_conv1,filters=self.d_model,kernel_size=[1,self.d_ff],padding="VALID", 52 | name='conv2',kernel_initializer=self.initializer,activation=None 53 | ) 54 | output=tf.squeeze(output_conv2) #[batch,sequence_length,d_model] 55 | return output #[batch,sequence_length,d_model] 56 | 57 | def position_wise_feed_forward_fc_fn(self): 58 | """ 59 | positional wise fully connected feed forward implement as original version. 60 | FFN(x) = max(0,xW1+b1)W2+b2 61 | this function provide you as an alternative if you want to use original version, or you don't want to use two layers of cnn, 62 | but may be less efficient as sequence become longer. 63 | x: [batch,sequence_length,d_model] 64 | :return: [batch,sequence_length,d_model] 65 | """ 66 | # 0. pre-process input x 67 | _,sequence_length,d_model=self.x.get_shape().as_list() 68 | 69 | element_list = tf.split(self.x, sequence_length,axis=1) # it is a list,length is sequence_length, each element is [batch_size,1,d_model] 70 | element_list = [tf.squeeze(element, axis=1) for element in element_list] # it is a list,length is sequence_length, each element is [batch_size,d_model] 71 | output_list=[] 72 | for i, element in enumerate(element_list): 73 | with tf.variable_scope("foo", reuse=True if i>0 else False): 74 | # 1. layer 1 75 | W1 = tf.get_variable("ff_layer1", shape=[self.d_model, self.d_ff], initializer=self.initializer) 76 | z1=tf.nn.relu(tf.matmul(element,W1)) # z1:[batch_size,d_ff]<--------tf.matmul([batch_size,d_model],[d_model, d_ff]) 77 | # 2. layer 2 78 | W2 = tf.get_variable("ff_layer2", shape=[self.d_ff, self.d_model], initializer=self.initializer) 79 | output_element=tf.matmul(z1,W2) # output:[batch_size,d_model]<----------tf.matmul([batch_size,d_ff],[d_ff, d_model]) 80 | output_list.append(output_element) # a list, each element is [batch_size,d_model] 81 | output=tf.stack(output_list,axis=1) # [batch,sequence_length,d_model] 82 | return output # [batch,sequence_length,d_model] 83 | 84 | #test function of position_wise_feed_forward_fn 85 | #time spent:OLD VERSION(FC): length=1000,time spent:2.04 s; NEW VERSION(CNN):0.03s, speed up as 68x. 86 | def test_position_wise_feed_forward_fn(): 87 | start=time.time() 88 | x=tf.ones((8,1000,512)) #batch_size=8,sequence_length=10 ; 89 | layer_index=0 90 | postion_wise_feed_forward=PositionWiseFeedFoward(x,layer_index) 91 | output=postion_wise_feed_forward.position_wise_feed_forward_fn() 92 | end=time.time() 93 | print("x:",x.shape,";output:",output.shape) 94 | print("time spent:",(end-start)) 95 | return output 96 | 97 | def test(): 98 | with tf.Session() as sess: 99 | result=test_position_wise_feed_forward_fn() 100 | sess.run(tf.global_variables_initializer()) 101 | result_=sess.run(result) 102 | print("result_.shape:",result_.shape) 103 | 104 | #test() -------------------------------------------------------------------------------- /model/transfomer_model.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # -*- coding: utf-8 -*- 4 | """ 5 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 6 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task) 7 | on large scale of corpus, then fine-tuning by add a single classification layer. 8 | """ 9 | 10 | import tensorflow as tf 11 | import numpy as np 12 | from model.encoder import Encoder 13 | from model.config_transformer import Config 14 | import os 15 | os.environ["CUDA_VISIBLE_DEVICES"] = "6" 16 | 17 | class TransformerModel: 18 | def __init__(self,config): 19 | """ 20 | init all hyperparameter with config class, define placeholder, computation graph 21 | """ 22 | self.num_classes = config.num_classes 23 | print("BertModel.num_classes:",self.num_classes) 24 | self.batch_size = config.batch_size 25 | self.sequence_length = config.sequence_length 26 | self.vocab_size = config.vocab_size 27 | self.d_model = config.d_model 28 | self.learning_rate = tf.Variable(config.learning_rate, trainable=False, name="learning_rate") 29 | self.clip_gradients=config.clip_gradients 30 | self.decay_steps=config.decay_steps 31 | self.decay_rate=config.decay_rate 32 | self.d_k=config.d_k 33 | self.d_model=config.d_model 34 | self.h=config.h 35 | self.d_v=config.d_v 36 | self.num_layer=config.num_layer 37 | self.use_residual_conn=True 38 | self.is_training=config.is_training 39 | 40 | # place holder(X,y) 41 | self.input_x= tf.placeholder(tf.int32, [self.batch_size, self.sequence_length], name="input_x") # e.g.is a sequence, input='the man [mask1] to [mask2] store' 42 | self.input_y=tf.placeholder(tf.float32, [self.batch_size, self.num_classes],name="input_y") 43 | 44 | self.learning_rate_decay_half_op = tf.assign(self.learning_rate, self.learning_rate *config.decay_rate) 45 | self.initializer=tf.random_normal_initializer(stddev=0.1) 46 | self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob") 47 | self.global_step = tf.Variable(0, trainable=False, name="Global_Step") 48 | self.epoch_step = tf.Variable(0, trainable=False, name="Epoch_Step") 49 | self.epoch_increment = tf.assign(self.epoch_step, tf.add(self.epoch_step, tf.constant(1))) 50 | 51 | self.instantiate_weights() 52 | self.logits =self.inference() # shape:[None,self.num_classes] 53 | self.predictions = tf.argmax(self.logits, axis=1, name="predictions") # shape:[None,] 54 | 55 | if not self.is_training: 56 | return 57 | self.loss_val = self.loss() 58 | self.train_op = self.train() 59 | 60 | def inference(self): 61 | """ 62 | main inference logic here: invoke transformer model to do inference. input is a sequence, output is also a sequence. 63 | input representation--> 64 | :return: 65 | """ 66 | # 1. input representation(input embedding, positional encoding, segment encoding) 67 | token_embeddings = tf.nn.embedding_lookup(self.embedding,self.input_x) # [batch_size,sequence_length,embed_size] 68 | self.input_representation=tf.add(tf.add(token_embeddings,self.segment_embeddings),self.position_embeddings) # [batch_size,sequence_length,embed_size] 69 | 70 | # 2. repeat Nx times of building block( multi-head attention followed by Add & Norm; feed forward followed by Add & Norm) 71 | encoder_class=Encoder(self.d_model,self.d_k,self.d_v,self.sequence_length,self.h,self.batch_size,self.num_layer,self.input_representation, 72 | self.input_representation,dropout_keep_prob=self.dropout_keep_prob,use_residual_conn=self.use_residual_conn) 73 | h = encoder_class.encoder_fn() # [batch_size,sequence_length,d_model] 74 | 75 | # 3. get logits for different tasks by applying projection layer 76 | logits=self.project_tasks(h) # shape:[None,self.num_classes] 77 | return logits # shape:[None,self.num_classes] 78 | 79 | def project_tasks(self,h): 80 | """ 81 | project the representation, then to do classification. 82 | :param h: [batch_size,sequence_length,d_model] 83 | :return: logits: [batch_size, num_classes] 84 | transoform each sub task using one-layer MLP ,then get logits. 85 | get some insights from densely connected layers from recently development 86 | """ 87 | cls_representation = h[:, 0, :] # [CLS] token's information: classification task's representation 88 | logits = tf.layers.dense(cls_representation, self.num_classes) # shape:[None,self.num_classes] 89 | logits = tf.nn.dropout(logits,keep_prob=self.dropout_keep_prob) # shape:[None,self.num_classes] 90 | return logits 91 | 92 | def loss(self,l2_lambda=0.0001*3,epislon=0.000001): 93 | # input: `logits` and `labels` must have the same shape `[batch_size, num_classes]` 94 | # output: A 1-D `Tensor` of length `batch_size` of the same type as `logits` with the softmax cross entropy loss. 95 | # let `x = logits`, `z = labels`. The logistic loss is:z * -log(sigmoid(x)) + (1 - z) * -log(1 - sigmoid(x)) 96 | losses= tf.nn.sigmoid_cross_entropy_with_logits(labels=self.input_y,logits=self.logits) #[batch_size,num_classes] 97 | self.losses = tf.reduce_mean((tf.reduce_sum(losses,axis=1))) # shape=(?,)-->(). loss for all data in the batch-->single loss 98 | self.l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables() if 'bias' not in v.name]) * l2_lambda 99 | 100 | loss=self.losses+self.l2_loss 101 | return loss 102 | 103 | def train(self): 104 | """based on the loss, use SGD to update parameter""" 105 | learning_rate = tf.train.exponential_decay(self.learning_rate, self.global_step, self.decay_steps,self.decay_rate, staircase=True) 106 | train_op = tf.contrib.layers.optimize_loss(self.loss_val, global_step=self.global_step,learning_rate=learning_rate, optimizer="Adam",clip_gradients=self.clip_gradients) 107 | return train_op 108 | 109 | def instantiate_weights(self): 110 | """define all weights here""" 111 | with tf.name_scope("embedding"): # embedding matrix 112 | self.embedding = tf.get_variable("embedding", shape=[self.vocab_size, self.d_model],initializer=self.initializer) # [vocab_size,embed_size] 113 | self.segment_embeddings = tf.get_variable("segment_embeddings", [self.d_model],initializer=tf.constant_initializer(1.0)) # a learned sequence embedding 114 | self.position_embeddings = tf.get_variable("position_embeddings", [self.sequence_length, self.d_model],initializer=tf.constant_initializer(1.0)) # sequence_length,1] 115 | 116 | 117 | # train the model on toy task: learn to count,sum up all inputs, and distinct whether the total value of input is below or greater than a threshold. 118 | # usage: first run train () to train the model, it will save checkpoint to file system. then run predict() to make a prediction based on checkpoint. 119 | def train(): 120 | # 1.init config and model 121 | config=Config() 122 | threshold=(config.sequence_length/2)+1 123 | model = TransformerModel(config) 124 | gpu_config = tf.ConfigProto() 125 | gpu_config.gpu_options.allow_growth = True 126 | saver = tf.train.Saver() 127 | save_path = config.ckpt_dir + "model.ckpt" 128 | #if not os.path.exists(config.ckpt_dir): 129 | # os.makedirs(config.ckpt_dir) 130 | with tf.Session(config=gpu_config) as sess: 131 | sess.run(tf.global_variables_initializer()) 132 | if os.path.exists(config.ckpt_dir): # 如果存在,则加载预训练过的模型 133 | saver.restore(sess, tf.train.latest_checkpoint(save_path)) 134 | for i in range(100000): 135 | # 2.feed data 136 | input_x = np.random.randn(config.batch_size, config.sequence_length) # [None, self.sequence_length] 137 | input_x[input_x >= 0] = 1 138 | input_x[input_x < 0] = 0 139 | input_y = generate_label(input_x,threshold) 140 | # 3.run session to train the model, print some logs. 141 | loss, _ = sess.run([model.loss_val, model.train_op],feed_dict={model.input_x: input_x, model.input_y: input_y,model.dropout_keep_prob: config.dropout_keep_prob}) 142 | print(i, "loss:", loss, "-------------------------------------------------------") 143 | if i==300: 144 | print("label[0]:", input_y[0]);print("input_x:",input_x) 145 | if i % 500 == 0: 146 | saver.save(sess, save_path, global_step=i) 147 | 148 | # use saved checkpoint from model to make prediction, and print it, to see whether it is able to do toy task successfully. 149 | def predict(): 150 | config=Config() 151 | threshold=(config.sequence_length/2)+1 152 | config.batch_size=1 153 | model = TransformerModel(config) 154 | gpu_config = tf.ConfigProto() 155 | gpu_config.gpu_options.allow_growth = True 156 | saver = tf.train.Saver() 157 | ckpt_dir = config.ckpt_dir 158 | print("ckpt_dir:",ckpt_dir) 159 | with tf.Session(config=gpu_config) as sess: 160 | sess.run(tf.global_variables_initializer()) 161 | saver.restore(sess, tf.train.latest_checkpoint(ckpt_dir)) 162 | for i in range(100): 163 | # 2.feed data 164 | input_x = np.random.randn(config.batch_size, config.sequence_length) # [None, self.sequence_length] 165 | input_x[input_x >= 0] = 1 166 | input_x[input_x < 0] = 0 167 | target_label = generate_label(input_x,threshold) 168 | input_sum=np.sum(input_x) 169 | # 3.run session to train the model, print some logs. 170 | logit,prediction = sess.run([model.logits, model.predictions],feed_dict={model.input_x: input_x ,model.dropout_keep_prob: config.dropout_keep_prob}) 171 | print("target_label:", target_label,";input_sum:",input_sum,"threshold:",threshold,";prediction:",prediction); 172 | print("input_x:",input_x,";logit:",logit) 173 | 174 | 175 | def generate_label(input_x,threshold): 176 | """ 177 | generate label with input 178 | :param input_x: shape of [batch_size, sequence_length] 179 | :return: y:[batch_size] 180 | """ 181 | batch_size,sequence_length=input_x.shape 182 | y=np.zeros((batch_size,2)) 183 | for i in range(batch_size): 184 | input_single=input_x[i] 185 | sum=np.sum(input_single) 186 | if i == 0:print("sum:",sum,";threshold:",threshold) 187 | y_single=1 if sum>threshold else 0 188 | if y_single==1: 189 | y[i]=[0,1] 190 | else: # y_single=0 191 | y[i]=[1,0] 192 | return y 193 | 194 | #train() 195 | #predict() -------------------------------------------------------------------------------- /old/JoinAttLayer.py: -------------------------------------------------------------------------------- 1 | # coding=utf8 2 | from keras import backend as K 3 | from keras.engine.topology import Layer 4 | from keras import initializers, regularizers, constraints 5 | from keras.layers.merge import _Merge 6 | 7 | 8 | class Attention(Layer): 9 | def __init__(self, step_dim, 10 | W_regularizer=None, b_regularizer=None, 11 | W_constraint=None, b_constraint=None, 12 | bias=True, **kwargs): 13 | """ 14 | Keras Layer that implements an Attention mechanism for temporal data. 15 | Supports Masking. 16 | Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756] 17 | # Input shape 18 | 3D tensor with shape: `(samples, steps, features)`. 19 | # Output shape 20 | 2D tensor with shape: `(samples, features)`. 21 | :param kwargs: 22 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. 23 | The dimensions are inferred based on the output shape of the RNN. 24 | Example: 25 | model.add(LSTM(64, return_sequences=True)) 26 | model.add(Attention()) 27 | """ 28 | self.supports_masking = True 29 | # self.init = initializations.get('glorot_uniform') 30 | self.init = initializers.get('glorot_uniform') 31 | 32 | self.W_regularizer = regularizers.get(W_regularizer) 33 | self.b_regularizer = regularizers.get(b_regularizer) 34 | 35 | self.W_constraint = constraints.get(W_constraint) 36 | self.b_constraint = constraints.get(b_constraint) 37 | 38 | self.bias = bias 39 | self.step_dim = step_dim 40 | self.features_dim = 0 41 | super(Attention, self).__init__(**kwargs) 42 | 43 | def build(self, input_shape): 44 | assert len(input_shape) == 3 45 | 46 | self.W = self.add_weight((input_shape[-1],), 47 | initializer=self.init, 48 | name='{}_W'.format(self.name), 49 | regularizer=self.W_regularizer, 50 | constraint=self.W_constraint) 51 | self.features_dim = input_shape[-1] 52 | 53 | if self.bias: 54 | self.b = self.add_weight((input_shape[1],), 55 | initializer='zero', 56 | name='{}_b'.format(self.name), 57 | regularizer=self.b_regularizer, 58 | constraint=self.b_constraint) 59 | else: 60 | self.b = None 61 | 62 | self.built = True 63 | 64 | def compute_mask(self, input, input_mask=None): 65 | # do not pass the mask to the next layers 66 | return None 67 | 68 | def call(self, x, mask=None): 69 | input_shape = K.int_shape(x) 70 | 71 | features_dim = self.features_dim 72 | # step_dim = self.step_dim 73 | step_dim = input_shape[1] 74 | 75 | eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim)) 76 | 77 | if self.bias: 78 | eij += self.b[:input_shape[1]] 79 | 80 | eij = K.tanh(eij) 81 | 82 | a = K.exp(eij) 83 | 84 | # apply mask after the exp. will be re-normalized next 85 | if mask is not None: 86 | # Cast the mask to floatX to avoid float64 upcasting in theano 87 | a *= K.cast(mask, K.floatx()) 88 | 89 | # in some cases especially in the early stages of training the sum may be almost zero 90 | # and this results in NaN's. A workaround is to add a very small positive number ε to the sum. 91 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx()) 92 | 93 | a = K.expand_dims(a) 94 | weighted_input = x * a 95 | # print weigthted_input.shape 96 | return K.sum(weighted_input, axis=1) 97 | 98 | def compute_output_shape(self, input_shape): 99 | # return input_shape[0], input_shape[-1] 100 | return input_shape[0], self.features_dim 101 | # end Attention 102 | 103 | 104 | class JoinAttention(_Merge): 105 | def __init__(self, step_dim, hid_size, 106 | W_regularizer=None, b_regularizer=None, 107 | W_constraint=None, b_constraint=None, 108 | bias=True, **kwargs): 109 | """ 110 | Keras Layer that implements an Attention mechanism according to other vector. 111 | Supports Masking. 112 | # Input shape, list of 113 | 2D tensor with shape: `(samples, features_1)`. 114 | 3D tensor with shape: `(samples, steps, features_2)`. 115 | # Output shape 116 | 2D tensor with shape: `(samples, features)`. 117 | :param kwargs: 118 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. 119 | The dimensions are inferred based on the output shape of the RNN. 120 | Example: 121 | en = LSTM(64, return_sequences=False)(input) 122 | de = LSTM(64, return_sequences=True)(input2) 123 | output = JoinAttention(64, 20)([en, de]) 124 | """ 125 | self.supports_masking = True 126 | # self.init = initializations.get('glorot_uniform') 127 | self.init = initializers.get('glorot_uniform') 128 | 129 | self.W_regularizer = regularizers.get(W_regularizer) 130 | self.b_regularizer = regularizers.get(b_regularizer) 131 | 132 | self.W_constraint = constraints.get(W_constraint) 133 | self.b_constraint = constraints.get(b_constraint) 134 | 135 | self.bias = bias 136 | self.step_dim = step_dim 137 | self.hid_size = hid_size 138 | super(JoinAttention, self).__init__(**kwargs) 139 | 140 | def build(self, input_shape): 141 | if not isinstance(input_shape, list): 142 | raise ValueError('A merge layer [JoinAttention] should be called ' 143 | 'on a list of inputs.') 144 | if len(input_shape) != 2: 145 | raise ValueError('A merge layer [JoinAttention] should be called ' 146 | 'on a list of 2 inputs. ' 147 | 'Got ' + str(len(input_shape)) + ' inputs.') 148 | if len(input_shape[0]) != 2 or len(input_shape[1]) != 3: 149 | raise ValueError('A merge layer [JoinAttention] should be called ' 150 | 'on a list of 2 inputs with first ndim 2 and second one ndim 3. ' 151 | 'Got ' + str(len(input_shape)) + ' inputs.') 152 | 153 | self.W_en1 = self.add_weight((input_shape[0][-1], self.hid_size), 154 | initializer=self.init, 155 | name='{}_W0'.format(self.name), 156 | regularizer=self.W_regularizer, 157 | constraint=self.W_constraint) 158 | self.W_en2 = self.add_weight((input_shape[1][-1], self.hid_size), 159 | initializer=self.init, 160 | name='{}_W1'.format(self.name), 161 | regularizer=self.W_regularizer, 162 | constraint=self.W_constraint) 163 | self.W_de = self.add_weight((self.hid_size,), 164 | initializer=self.init, 165 | name='{}_W2'.format(self.name), 166 | regularizer=self.W_regularizer, 167 | constraint=self.W_constraint) 168 | 169 | if self.bias: 170 | self.b_en1 = self.add_weight((self.hid_size,), 171 | initializer='zero', 172 | name='{}_b0'.format(self.name), 173 | regularizer=self.b_regularizer, 174 | constraint=self.b_constraint) 175 | self.b_en2 = self.add_weight((self.hid_size,), 176 | initializer='zero', 177 | name='{}_b1'.format(self.name), 178 | regularizer=self.b_regularizer, 179 | constraint=self.b_constraint) 180 | self.b_de = self.add_weight((input_shape[1][1],), 181 | initializer='zero', 182 | name='{}_b2'.format(self.name), 183 | regularizer=self.b_regularizer, 184 | constraint=self.b_constraint) 185 | else: 186 | self.b_en1 = None 187 | self.b_en2 = None 188 | self.b_de = None 189 | 190 | self._reshape_required = False 191 | self.built = True 192 | 193 | def compute_output_shape(self, input_shape): 194 | return input_shape[1][0], input_shape[1][-1] 195 | 196 | def compute_mask(self, input, input_mask=None): 197 | # do not pass the mask to the next layers 198 | return None 199 | 200 | def call(self, inputs, mask=None): 201 | en = inputs[0] 202 | de = inputs[1] 203 | de_shape = K.int_shape(de) 204 | step_dim = de_shape[1] 205 | 206 | hid_en = K.dot(en, self.W_en1) 207 | hid_de = K.dot(de, self.W_en2) 208 | if self.bias: 209 | hid_en += self.b_en1 210 | hid_de += self.b_en2 211 | hid = K.tanh(K.expand_dims(hid_en, axis=1) + hid_de) 212 | eij = K.reshape(K.dot(hid, K.reshape(self.W_de, (self.hid_size, 1))), (-1, step_dim)) 213 | if self.bias: 214 | eij += self.b_de[:step_dim] 215 | 216 | a = K.exp(eij - K.max(eij, axis=-1, keepdims=True)) 217 | 218 | # apply mask after the exp. will be re-normalized next 219 | if mask is not None: 220 | # Cast the mask to floatX to avoid float64 upcasting in theano 221 | a *= K.cast(mask[1], K.floatx()) 222 | 223 | # in some cases especially in the early stages of training the sum may be almost zero 224 | # and this results in NaN's. A workaround is to add a very small positive number ε to the sum. 225 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx()) 226 | 227 | a = K.expand_dims(a) 228 | weighted_input = de * a 229 | return K.sum(weighted_input, axis=1) 230 | # end JoinAttention 231 | -------------------------------------------------------------------------------- /old/classifier_bigru.py: -------------------------------------------------------------------------------- 1 | import keras 2 | from keras import Model 3 | from keras.layers import * 4 | from JoinAttLayer import Attention 5 | 6 | 7 | class TextClassifier(): 8 | 9 | def model(self, embeddings_matrix, maxlen, word_index, num_class): 10 | inp = Input(shape=(maxlen,)) 11 | encode = Bidirectional(CuDNNGRU(128, return_sequences=True)) 12 | encode2 = Bidirectional(CuDNNGRU(128, return_sequences=True)) 13 | attention = Attention(maxlen) 14 | x_4 = Embedding(len(word_index) + 1, 15 | embeddings_matrix.shape[1], 16 | weights=[embeddings_matrix], 17 | input_length=maxlen, 18 | trainable=True)(inp) 19 | x_3 = SpatialDropout1D(0.2)(x_4) 20 | x_3 = encode(x_3) 21 | x_3 = Dropout(0.2)(x_3) 22 | x_3 = encode2(x_3) 23 | x_3 = Dropout(0.2)(x_3) 24 | avg_pool_3 = GlobalAveragePooling1D()(x_3) 25 | max_pool_3 = GlobalMaxPooling1D()(x_3) 26 | attention_3 = attention(x_3) 27 | x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3], name="fc") 28 | x = Dense(num_class, activation="sigmoid")(x) 29 | 30 | adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08,amsgrad=True) 31 | rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-06) 32 | model = Model(inputs=inp, outputs=x) 33 | model.compile( 34 | loss='categorical_crossentropy', 35 | optimizer=adam) 36 | return model 37 | -------------------------------------------------------------------------------- /old/classifier_capsule.py: -------------------------------------------------------------------------------- 1 | import keras 2 | import random 3 | random.seed = 16 4 | import numpy as np 5 | np.random.seed(16) 6 | from tensorflow import set_random_seed 7 | set_random_seed(16) 8 | import random 9 | random.seed = 16 10 | from keras.models import Model 11 | from keras.layers import * 12 | from JoinAttLayer import Attention 13 | 14 | 15 | def precision(y_true, y_pred): 16 | """Precision metric. 17 | Only computes a batch-wise average of precision. 18 | Computes the precision, a metric for multi-label classification of 19 | how many selected items are relevant. 20 | """ 21 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 22 | predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) 23 | precision = true_positives / (predicted_positives + K.epsilon()) 24 | return precision 25 | 26 | 27 | def recall(y_true, y_pred): 28 | """Recall metric. 29 | Only computes a batch-wise average of recall. 30 | Computes the recall, a metric for multi-label classification of 31 | how many relevant items are selected. 32 | """ 33 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 34 | possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) 35 | recall = true_positives / (possible_positives + K.epsilon()) 36 | return recall 37 | 38 | 39 | def f1(y_true, y_pred, beta=1): 40 | """Computes the F score. 41 | The F score is the weighted harmonic mean of precision and recall. 42 | Here it is only computed as a batch-wise average, not globally. 43 | This is useful for multi-label classification, where input samples can be 44 | classified as sets of labels. By only using accuracy (precision) a model 45 | would achieve a perfect score by simply assigning every class to every 46 | input. In order to avoid this, a metric should penalize incorrect class 47 | assignments as well (recall). The F-beta score (ranged from 0.0 to 1.0) 48 | computes this, as a weighted mean of the proportion of correct class 49 | assignments vs. the proportion of incorrect class assignments. 50 | With beta = 1, this is equivalent to a F-measure. With beta < 1, assigning 51 | correct classes becomes more important, and with beta > 1 the metric is 52 | instead weighted towards penalizing incorrect class assignments. 53 | """ 54 | if beta < 0: 55 | raise ValueError('The lowest choosable beta is zero (only precision).') 56 | 57 | p = precision(y_true, y_pred) 58 | r = recall(y_true, y_pred) 59 | bb = beta ** 2 60 | fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon()) 61 | return fbeta_score 62 | 63 | 64 | def squash(x, axis=-1): 65 | # s_squared_norm is really small 66 | # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon() 67 | # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm) 68 | # return scale * x 69 | s_squared_norm = K.sum(K.square(x), axis, keepdims=True) 70 | scale = K.sqrt(s_squared_norm + K.epsilon()) 71 | return x / scale 72 | 73 | 74 | # A Capsule Implement with Pure Keras 75 | class Capsule(Layer): 76 | def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True, 77 | activation='default', **kwargs): 78 | super(Capsule, self).__init__(**kwargs) 79 | self.num_capsule = num_capsule 80 | self.dim_capsule = dim_capsule 81 | self.routings = routings 82 | self.kernel_size = kernel_size 83 | self.share_weights = share_weights 84 | if activation == 'default': 85 | self.activation = squash 86 | else: 87 | self.activation = Activation(activation) 88 | 89 | def build(self, input_shape): 90 | super(Capsule, self).build(input_shape) 91 | input_dim_capsule = input_shape[-1] 92 | if self.share_weights: 93 | self.W = self.add_weight(name='capsule_kernel', 94 | shape=(1, input_dim_capsule, 95 | self.num_capsule * self.dim_capsule), 96 | # shape=self.kernel_size, 97 | initializer='glorot_uniform', 98 | trainable=True) 99 | else: 100 | input_num_capsule = input_shape[-2] 101 | self.W = self.add_weight(name='capsule_kernel', 102 | shape=(input_num_capsule, 103 | input_dim_capsule, 104 | self.num_capsule * self.dim_capsule), 105 | initializer='glorot_uniform', 106 | trainable=True) 107 | 108 | def call(self, u_vecs): 109 | if self.share_weights: 110 | u_hat_vecs = K.conv1d(u_vecs, self.W) 111 | else: 112 | u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1]) 113 | 114 | batch_size = K.shape(u_vecs)[0] 115 | input_num_capsule = K.shape(u_vecs)[1] 116 | u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule, 117 | self.num_capsule, self.dim_capsule)) 118 | u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3)) 119 | # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule] 120 | 121 | b = K.zeros_like(u_hat_vecs[:, :, :, 0]) # shape = [None, num_capsule, input_num_capsule] 122 | for i in range(self.routings): 123 | b = K.permute_dimensions(b, (0, 2, 1)) # shape = [None, input_num_capsule, num_capsule] 124 | c = K.softmax(b) 125 | c = K.permute_dimensions(c, (0, 2, 1)) 126 | b = K.permute_dimensions(b, (0, 2, 1)) 127 | outputs = self.activation(K.batch_dot(c, u_hat_vecs, [2, 2])) 128 | if i < self.routings - 1: 129 | b = K.batch_dot(outputs, u_hat_vecs, [2, 3]) 130 | 131 | return outputs 132 | 133 | def compute_output_shape(self, input_shape): 134 | return (None, self.num_capsule, self.dim_capsule) 135 | 136 | 137 | class TextClassifier(): 138 | 139 | def model(self, embeddings_matrix, maxlen, word_index, num_class): 140 | input1 = Input(shape=(maxlen,)) 141 | embed_layer = Embedding(len(word_index) + 1, 142 | embeddings_matrix.shape[1], 143 | input_length=maxlen, 144 | weights=[embeddings_matrix], 145 | trainable=True)(input1) 146 | embed_layer = SpatialDropout1D(0.28)(embed_layer) 147 | 148 | x = Bidirectional( 149 | CuDNNGRU(128, return_sequences=True))( 150 | embed_layer) 151 | x = Activation('relu')(x) 152 | x = Dropout(0.25)(x) 153 | x = Bidirectional( 154 | CuDNNGRU(128, return_sequences=True))( 155 | x) 156 | x = Activation('relu')(x) 157 | x = Dropout(0.25)(x) 158 | capsule = Capsule(num_capsule=10, dim_capsule=16, routings=5, 159 | share_weights=True)(x) 160 | # output_capsule = Lambda(lambda x: K.sqrt(K.sum(K.square(x), 2)))(capsule) 161 | capsule = Flatten()(capsule) 162 | capsule = Dropout(0.25)(capsule) 163 | output = Dense(num_class, activation='sigmoid')(capsule) 164 | model = Model(inputs=input1, outputs=output) 165 | model.compile( 166 | loss='binary_crossentropy', 167 | optimizer='adam', 168 | metrics=["categorical_accuracy"]) 169 | return model 170 | 171 | -------------------------------------------------------------------------------- /old/classifier_rcnn.py: -------------------------------------------------------------------------------- 1 | import keras 2 | from keras import Model 3 | from keras.layers import * 4 | from JoinAttLayer import Attention 5 | 6 | 7 | def precision(y_true, y_pred): 8 | """Precision metric. 9 | Only computes a batch-wise average of precision. 10 | Computes the precision, a metric for multi-label classification of 11 | how many selected items are relevant. 12 | """ 13 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 14 | predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1))) 15 | precision = true_positives / (predicted_positives + K.epsilon()) 16 | return precision 17 | 18 | 19 | def recall(y_true, y_pred): 20 | """Recall metric. 21 | Only computes a batch-wise average of recall. 22 | Computes the recall, a metric for multi-label classification of 23 | how many relevant items are selected. 24 | """ 25 | true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 26 | possible_positives = K.sum(K.round(K.clip(y_true, 0, 1))) 27 | recall = true_positives / (possible_positives + K.epsilon()) 28 | return recall 29 | 30 | 31 | def f1(y_true, y_pred, beta=1): 32 | """Computes the F score. 33 | The F score is the weighted harmonic mean of precision and recall. 34 | Here it is only computed as a batch-wise average, not globally. 35 | This is useful for multi-label classification, where input samples can be 36 | classified as sets of labels. By only using accuracy (precision) a model 37 | would achieve a perfect score by simply assigning every class to every 38 | input. In order to avoid this, a metric should penalize incorrect class 39 | assignments as well (recall). The F-beta score (ranged from 0.0 to 1.0) 40 | computes this, as a weighted mean of the proportion of correct class 41 | assignments vs. the proportion of incorrect class assignments. 42 | With beta = 1, this is equivalent to a F-measure. With beta < 1, assigning 43 | correct classes becomes more important, and with beta > 1 the metric is 44 | instead weighted towards penalizing incorrect class assignments. 45 | """ 46 | if beta < 0: 47 | raise ValueError('The lowest choosable beta is zero (only precision).') 48 | 49 | p = precision(y_true, y_pred) 50 | r = recall(y_true, y_pred) 51 | bb = beta ** 2 52 | fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon()) 53 | return fbeta_score 54 | 55 | 56 | class TextClassifier(): 57 | 58 | def model(self, embeddings_matrix, maxlen, word_index, num_class): 59 | inp = Input(shape=(maxlen,)) 60 | encode = Bidirectional(GRU(1, return_sequences=True)) 61 | encode2 = Bidirectional(GRU(1, return_sequences=True)) 62 | attention = Attention(maxlen) 63 | x_4 = Embedding(len(word_index) + 1, 64 | embeddings_matrix.shape[1], 65 | weights=[embeddings_matrix], 66 | input_length=maxlen, 67 | trainable=True)(inp) 68 | x_3 = SpatialDropout1D(0.2)(x_4) 69 | x_3 = encode(x_3) 70 | x_3 = Dropout(0.2)(x_3) 71 | x_3 = encode2(x_3) 72 | x_3 = Dropout(0.2)(x_3) 73 | x_3 = Conv1D(64, kernel_size=3, padding="valid", kernel_initializer="glorot_uniform")(x_3) 74 | x_3 = Dropout(0.2)(x_3) 75 | avg_pool_3 = GlobalAveragePooling1D()(x_3) 76 | max_pool_3 = GlobalMaxPooling1D()(x_3) 77 | attention_3 = attention(x_3) 78 | x = keras.layers.concatenate([avg_pool_3, max_pool_3, attention_3]) 79 | x = Dense(num_class, activation="sigmoid")(x) 80 | 81 | adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08) 82 | rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=1e-06) 83 | model = Model(inputs=inp, outputs=x) 84 | model.compile( 85 | loss='categorical_crossentropy', 86 | optimizer=rmsprop 87 | ) 88 | return model 89 | -------------------------------------------------------------------------------- /old/evaluate_char.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.metrics import f1_score, classification_report 3 | 4 | 5 | if __name__ == "__main__": 6 | validation_pred = pd.read_csv("validation_rcnn_char.csv") 7 | validation_real = pd.read_csv("preprocess/validation_char.csv") 8 | f_scores = 0 9 | 10 | print(classification_report(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"])) 11 | print(classification_report(validation_real["location_distance_from_business_district"], validation_pred["location_distance_from_business_district"])) 12 | print(classification_report(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"])) 13 | print(classification_report(validation_real["service_wait_time"], validation_pred["service_wait_time"])) 14 | print(classification_report(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"])) 15 | print(classification_report(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"])) 16 | print(classification_report(validation_real["service_serving_speed"], validation_pred["service_serving_speed"])) 17 | print(classification_report(validation_real["price_level"], validation_pred["price_level"])) 18 | print(classification_report(validation_real["price_cost_effective"], validation_pred["price_cost_effective"])) 19 | print(classification_report(validation_real["price_discount"], validation_pred["price_discount"])) 20 | print(classification_report(validation_real["environment_decoration"], validation_pred["environment_decoration"])) 21 | print(classification_report(validation_real["environment_noise"], validation_pred["environment_noise"])) 22 | print(classification_report(validation_real["environment_space"], validation_pred["environment_space"])) 23 | print(classification_report(validation_real["environment_cleaness"], validation_pred["environment_cleaness"])) 24 | print(classification_report(validation_real["dish_portion"], validation_pred["dish_portion"])) 25 | print(classification_report(validation_real["dish_taste"], validation_pred["dish_taste"])) 26 | print(classification_report(validation_real["dish_look"], validation_pred["dish_look"])) 27 | print(classification_report(validation_real["dish_recommendation"], validation_pred["dish_recommendation"])) 28 | print(classification_report(validation_real["others_overall_experience"], validation_pred["others_overall_experience"])) 29 | print(classification_report(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"])) 30 | 31 | f_scores += f1_score(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"], 32 | average="macro") 33 | print(f1_score(validation_real["location_traffic_convenience"], validation_pred["location_traffic_convenience"], 34 | average="macro")) 35 | 36 | f_scores += f1_score(validation_real["location_distance_from_business_district"], 37 | validation_pred["location_distance_from_business_district"], average="macro") 38 | print(f1_score(validation_real["location_distance_from_business_district"], 39 | validation_pred["location_distance_from_business_district"], average="macro")) 40 | 41 | f_scores += f1_score(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"], average="macro") 42 | print(f1_score(validation_real["location_easy_to_find"], validation_pred["location_easy_to_find"], 43 | average="macro")) 44 | 45 | f_scores += f1_score(validation_real["service_wait_time"], validation_pred["service_wait_time"], average="macro") 46 | print(f1_score(validation_real["service_wait_time"], validation_pred["service_wait_time"], 47 | average="macro")) 48 | 49 | f_scores += f1_score(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"], average="macro") 50 | print(f1_score(validation_real["service_waiters_attitude"], validation_pred["service_waiters_attitude"], 51 | average="macro")) 52 | 53 | f_scores += f1_score(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"], 54 | average="macro") 55 | print(f1_score(validation_real["service_parking_convenience"], validation_pred["service_parking_convenience"], 56 | average="macro")) 57 | 58 | f_scores += f1_score(validation_real["service_serving_speed"], validation_pred["service_serving_speed"], average="macro") 59 | print(f1_score(validation_real["service_serving_speed"], validation_pred["service_serving_speed"], 60 | average="macro")) 61 | 62 | f_scores += f1_score(validation_real["price_level"], validation_pred["price_level"], average="macro") 63 | print(f1_score(validation_real["price_level"], validation_pred["price_level"], 64 | average="macro")) 65 | 66 | f_scores += f1_score(validation_real["price_cost_effective"], validation_pred["price_cost_effective"], average="macro") 67 | print(f1_score(validation_real["price_cost_effective"], validation_pred["price_cost_effective"], 68 | average="macro")) 69 | 70 | f_scores += f1_score(validation_real["price_discount"], validation_pred["price_discount"], average="macro") 71 | print(f1_score(validation_real["price_discount"], validation_pred["price_discount"], 72 | average="macro")) 73 | 74 | f_scores += f1_score(validation_real["environment_decoration"], validation_pred["environment_decoration"], average="macro") 75 | print(f1_score(validation_real["environment_decoration"], validation_pred["environment_decoration"], 76 | average="macro")) 77 | 78 | f_scores += f1_score(validation_real["environment_noise"], validation_pred["environment_noise"], average="macro") 79 | print(f1_score(validation_real["environment_noise"], validation_pred["environment_noise"], 80 | average="macro")) 81 | 82 | f_scores += f1_score(validation_real["environment_space"], validation_pred["environment_space"], average="macro") 83 | print(f1_score(validation_real["environment_space"], validation_pred["environment_space"], 84 | average="macro")) 85 | 86 | f_scores += f1_score(validation_real["environment_cleaness"], validation_pred["environment_cleaness"], average="macro") 87 | print(f1_score(validation_real["environment_cleaness"], validation_pred["environment_cleaness"], 88 | average="macro")) 89 | 90 | f_scores += f1_score(validation_real["dish_portion"], validation_pred["dish_portion"], average="macro") 91 | print(f1_score(validation_real["dish_portion"], validation_pred["dish_portion"], 92 | average="macro")) 93 | 94 | f_scores += f1_score(validation_real["dish_taste"], validation_pred["dish_taste"], average="macro") 95 | print(f1_score(validation_real["dish_taste"], validation_pred["dish_taste"], 96 | average="macro")) 97 | 98 | f_scores += f1_score(validation_real["dish_look"], validation_pred["dish_look"], average="macro") 99 | print(f1_score(validation_real["dish_look"], validation_pred["dish_look"], 100 | average="macro")) 101 | 102 | f_scores += f1_score(validation_real["dish_recommendation"], validation_pred["dish_recommendation"], average="macro") 103 | print(f1_score(validation_real["dish_recommendation"], validation_pred["dish_recommendation"], 104 | average="macro")) 105 | 106 | f_scores += f1_score(validation_real["others_overall_experience"], validation_pred["others_overall_experience"], 107 | average="macro") 108 | print(f1_score(validation_real["others_overall_experience"], validation_pred["others_overall_experience"], 109 | average="macro")) 110 | 111 | f_scores += f1_score(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"], 112 | average="macro") 113 | print(f1_score(validation_real["others_willing_to_consume_again"], validation_pred["others_willing_to_consume_again"], 114 | average="macro")) 115 | 116 | print(f_scores / 20) -------------------------------------------------------------------------------- /old/predict_bigru_char.py: -------------------------------------------------------------------------------- 1 | from keras.backend.tensorflow_backend import set_session 2 | import tensorflow as tf 3 | config = tf.ConfigProto() 4 | config.gpu_options.allow_growth = True 5 | set_session(tf.Session(config=config)) 6 | import gc 7 | import pandas as pd 8 | import pickle 9 | import numpy as np 10 | np.random.seed(16) 11 | from tensorflow import set_random_seed 12 | set_random_seed(16) 13 | from keras.layers import * 14 | from keras.preprocessing import sequence 15 | from gensim.models.keyedvectors import KeyedVectors 16 | from classifier_bigru import TextClassifier 17 | 18 | 19 | def getClassification(arr): 20 | arr = list(arr) 21 | if arr.index(max(arr)) == 0: 22 | return -2 23 | elif arr.index(max(arr)) == 1: 24 | return -1 25 | elif arr.index(max(arr)) == 2: 26 | return 0 27 | else: 28 | return 1 29 | 30 | 31 | if __name__ == "__main__": 32 | with open('tokenizer_char.pickle', 'rb') as handle: 33 | maxlen = 1000 34 | model_dir = "model_bigru_char/" 35 | tokenizer = pickle.load(handle) 36 | word_index = tokenizer.word_index 37 | validation = pd.read_csv("preprocess/test_char.csv") 38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1) 39 | X_test = validation["content"].values 40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test) 41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen) 42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8', 43 | unicode_errors='ignore') 44 | embeddings_index = {} 45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size)) 46 | word2idx = {"_PAD": 0} 47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()] 48 | for word, i in word_index.items(): 49 | if word in w2_model: 50 | embedding_vector = w2_model[word] 51 | else: 52 | embedding_vector = None 53 | if embedding_vector is not None: 54 | embeddings_matrix[i] = embedding_vector 55 | 56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv") 57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv") 58 | 59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 60 | model1.load_weights(model_dir + "model_ltc_01.hdf5") 61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation))) 62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation)) 63 | del model1 64 | gc.collect() 65 | K.clear_session() 66 | 67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 68 | model2.load_weights(model_dir + "model_ldfbd_01.hdf5") 69 | submit["location_distance_from_business_district"] = list( 70 | map(getClassification, model2.predict(input_validation))) 71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation)) 72 | del model2 73 | gc.collect() 74 | K.clear_session() 75 | 76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 77 | model3.load_weights(model_dir + "model_letf_02.hdf5") 78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation))) 79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation)) 80 | del model3 81 | gc.collect() 82 | K.clear_session() 83 | 84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 85 | model4.load_weights(model_dir + "model_swt_02.hdf5") 86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation))) 87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation)) 88 | del model4 89 | gc.collect() 90 | K.clear_session() 91 | 92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 93 | model5.load_weights(model_dir + "model_swa_02.hdf5") 94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation))) 95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation)) 96 | del model5 97 | gc.collect() 98 | K.clear_session() 99 | 100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 101 | model6.load_weights(model_dir + "model_spc_02.hdf5") 102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation))) 103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation)) 104 | del model6 105 | gc.collect() 106 | K.clear_session() 107 | 108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 109 | model7.load_weights(model_dir + "model_ssp_02.hdf5") 110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation))) 111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation)) 112 | del model7 113 | gc.collect() 114 | K.clear_session() 115 | 116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 117 | model8.load_weights(model_dir + "model_pl_02.hdf5") 118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation))) 119 | submit_prob["price_level"] = list(model8.predict(input_validation)) 120 | del model8 121 | gc.collect() 122 | K.clear_session() 123 | 124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 125 | model9.load_weights(model_dir + "model_pce_02.hdf5") 126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation))) 127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation)) 128 | del model9 129 | gc.collect() 130 | K.clear_session() 131 | 132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 133 | model10.load_weights(model_dir + "model_pd_02.hdf5") 134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation))) 135 | submit_prob["price_discount"] = list(model10.predict(input_validation)) 136 | del model10 137 | gc.collect() 138 | K.clear_session() 139 | 140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 141 | model11.load_weights(model_dir + "model_ed_01.hdf5") 142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation))) 143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation)) 144 | del model11 145 | gc.collect() 146 | K.clear_session() 147 | 148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 149 | model12.load_weights(model_dir + "model_en_02.hdf5") 150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation))) 151 | submit_prob["environment_noise"] = list(model12.predict(input_validation)) 152 | del model12 153 | gc.collect() 154 | K.clear_session() 155 | 156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 157 | model13.load_weights(model_dir + "model_es_02.hdf5") 158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation))) 159 | submit_prob["environment_space"] = list(model13.predict(input_validation)) 160 | del model13 161 | gc.collect() 162 | K.clear_session() 163 | 164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 165 | model14.load_weights(model_dir + "model_ec_01.hdf5") 166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation))) 167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation)) 168 | del model14 169 | gc.collect() 170 | K.clear_session() 171 | 172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 173 | model15.load_weights(model_dir + "model_dp_01.hdf5") 174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation))) 175 | submit_prob["dish_portion"] = list(model15.predict(input_validation)) 176 | del model15 177 | gc.collect() 178 | K.clear_session() 179 | 180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 181 | model16.load_weights(model_dir + "model_dt_02.hdf5") 182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation))) 183 | submit_prob["dish_taste"] = list(model16.predict(input_validation)) 184 | del model16 185 | gc.collect() 186 | K.clear_session() 187 | 188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 189 | model17.load_weights(model_dir + "model_dl_02.hdf5") 190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation))) 191 | submit_prob["dish_look"] = list(model17.predict(input_validation)) 192 | del model17 193 | gc.collect() 194 | K.clear_session() 195 | 196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 197 | model18.load_weights(model_dir + "model_dr_01.hdf5") 198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation))) 199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation)) 200 | del model18 201 | gc.collect() 202 | K.clear_session() 203 | 204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 205 | model19.load_weights(model_dir + "model_ooe_01.hdf5") 206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation))) 207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation)) 208 | del model19 209 | gc.collect() 210 | K.clear_session() 211 | 212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 213 | model20.load_weights(model_dir + "model_owta_02.hdf5") 214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation))) 215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation)) 216 | del model20 217 | gc.collect() 218 | K.clear_session() 219 | 220 | submit.to_csv("baseline_bigru_char.csv", index=None) 221 | submit_prob.to_csv("baseline_bigru_char_prob.csv", index=None) -------------------------------------------------------------------------------- /old/predict_rcnn_char.py: -------------------------------------------------------------------------------- 1 | from keras.backend.tensorflow_backend import set_session 2 | import tensorflow as tf 3 | config = tf.ConfigProto() 4 | config.gpu_options.allow_growth = True 5 | set_session(tf.Session(config=config)) 6 | import gc 7 | import pandas as pd 8 | import pickle 9 | import numpy as np 10 | np.random.seed(16) 11 | from tensorflow import set_random_seed 12 | set_random_seed(16) 13 | from keras.layers import * 14 | from keras.preprocessing import sequence 15 | from gensim.models.keyedvectors import KeyedVectors 16 | from old.classifier_rcnn import TextClassifier 17 | 18 | 19 | def getClassification(arr): 20 | arr = list(arr) 21 | if arr.index(max(arr)) == 0: 22 | return -2 23 | elif arr.index(max(arr)) == 1: 24 | return -1 25 | elif arr.index(max(arr)) == 2: 26 | return 0 27 | else: 28 | return 1 29 | 30 | 31 | if __name__ == "__main__": 32 | with open('tokenizer_char.pickle', 'rb') as handle: 33 | maxlen = 1000 34 | model_dir = "model_rcnn_char/" 35 | tokenizer = pickle.load(handle) 36 | word_index = tokenizer.word_index 37 | validation = pd.read_csv("preprocess/test_char.csv") 38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1) 39 | X_test = validation["content"].values 40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test) 41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen) 42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8', 43 | unicode_errors='ignore') 44 | embeddings_index = {} 45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size)) 46 | word2idx = {"_PAD": 0} 47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()] 48 | for word, i in word_index.items(): 49 | if word in w2_model: 50 | embedding_vector = w2_model[word] 51 | else: 52 | embedding_vector = None 53 | if embedding_vector is not None: 54 | embeddings_matrix[i] = embedding_vector 55 | 56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv") 57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_testa_20180816/sentiment_analysis_testa.csv") 58 | 59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 60 | model1.load_weights(model_dir + "model_ltc_02.hdf5") 61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation))) 62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation)) 63 | del model1 64 | gc.collect() 65 | K.clear_session() 66 | 67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 68 | model2.load_weights(model_dir + "model_ldfbd_02.hdf5") 69 | submit["location_distance_from_business_district"] = list( 70 | map(getClassification, model2.predict(input_validation))) 71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation)) 72 | del model2 73 | gc.collect() 74 | K.clear_session() 75 | 76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 77 | model3.load_weights(model_dir + "model_letf_02.hdf5") 78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation))) 79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation)) 80 | del model3 81 | gc.collect() 82 | K.clear_session() 83 | 84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 85 | model4.load_weights(model_dir + "model_swt_02.hdf5") 86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation))) 87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation)) 88 | del model4 89 | gc.collect() 90 | K.clear_session() 91 | 92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 93 | model5.load_weights(model_dir + "model_swa_02.hdf5") 94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation))) 95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation)) 96 | del model5 97 | gc.collect() 98 | K.clear_session() 99 | 100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 101 | model6.load_weights(model_dir + "model_spc_01.hdf5") 102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation))) 103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation)) 104 | del model6 105 | gc.collect() 106 | K.clear_session() 107 | 108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 109 | model7.load_weights(model_dir + "model_ssp_02.hdf5") 110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation))) 111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation)) 112 | del model7 113 | gc.collect() 114 | K.clear_session() 115 | 116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 117 | model8.load_weights(model_dir + "model_pl_02.hdf5") 118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation))) 119 | submit_prob["price_level"] = list(model8.predict(input_validation)) 120 | del model8 121 | gc.collect() 122 | K.clear_session() 123 | 124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 125 | model9.load_weights(model_dir + "model_pce_02.hdf5") 126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation))) 127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation)) 128 | del model9 129 | gc.collect() 130 | K.clear_session() 131 | 132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 133 | model10.load_weights(model_dir + "model_pd_02.hdf5") 134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation))) 135 | submit_prob["price_discount"] = list(model10.predict(input_validation)) 136 | del model10 137 | gc.collect() 138 | K.clear_session() 139 | 140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 141 | model11.load_weights(model_dir + "model_ed_02.hdf5") 142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation))) 143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation)) 144 | del model11 145 | gc.collect() 146 | K.clear_session() 147 | 148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 149 | model12.load_weights(model_dir + "model_en_02.hdf5") 150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation))) 151 | submit_prob["environment_noise"] = list(model12.predict(input_validation)) 152 | del model12 153 | gc.collect() 154 | K.clear_session() 155 | 156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 157 | model13.load_weights(model_dir + "model_es_01.hdf5") 158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation))) 159 | submit_prob["environment_space"] = list(model13.predict(input_validation)) 160 | del model13 161 | gc.collect() 162 | K.clear_session() 163 | 164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 165 | model14.load_weights(model_dir + "model_ec_02.hdf5") 166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation))) 167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation)) 168 | del model14 169 | gc.collect() 170 | K.clear_session() 171 | 172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 173 | model15.load_weights(model_dir + "model_dp_02.hdf5") 174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation))) 175 | submit_prob["dish_portion"] = list(model15.predict(input_validation)) 176 | del model15 177 | gc.collect() 178 | K.clear_session() 179 | 180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 181 | model16.load_weights(model_dir + "model_dt_02.hdf5") 182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation))) 183 | submit_prob["dish_taste"] = list(model16.predict(input_validation)) 184 | del model16 185 | gc.collect() 186 | K.clear_session() 187 | 188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 189 | model17.load_weights(model_dir + "model_dl_02.hdf5") 190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation))) 191 | submit_prob["dish_look"] = list(model17.predict(input_validation)) 192 | del model17 193 | gc.collect() 194 | K.clear_session() 195 | 196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 197 | model18.load_weights(model_dir + "model_dr_02.hdf5") 198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation))) 199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation)) 200 | del model18 201 | gc.collect() 202 | K.clear_session() 203 | 204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 205 | model19.load_weights(model_dir + "model_ooe_02.hdf5") 206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation))) 207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation)) 208 | del model19 209 | gc.collect() 210 | K.clear_session() 211 | 212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 213 | model20.load_weights(model_dir + "model_owta_02.hdf5") 214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation))) 215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation)) 216 | del model20 217 | gc.collect() 218 | K.clear_session() 219 | 220 | submit.to_csv("baseline_rcnn_char.csv", index=None) 221 | submit_prob.to_csv("baseline_rcnn_char_prob.csv", index=None) -------------------------------------------------------------------------------- /old/stopwords.txt: -------------------------------------------------------------------------------- 1 | … 2 | ~ 3 | ) 4 | ( 5 | ” 6 | “ 7 | " 8 | 、 9 | : 10 | \n 11 | ~ 12 | ? 13 | , 14 | * 15 | , 16 | \ufeff 17 | s 18 | 。 19 | . 20 | - 21 | _ 22 | ' 23 | = 24 | ? 25 | · 26 | @ 27 | ! 28 | ^ 29 | & 30 | ( 31 | ) 32 | | 33 | \ 34 | + 35 | - 36 | $ 37 | 【 38 | 】 39 | ^ 40 | _ 41 | ` 42 | # 43 | $ 44 | % 45 | & 46 | > 47 | < 48 | [ 49 | ] 50 | 》 51 | 《 52 | / 53 | $ 54 | 0 55 | 1 56 | 2 57 | 3 58 | 4 59 | 5 60 | 6 61 | 7 62 | 8 63 | 9 64 | ? 65 | _ 66 | “ 67 | ” 68 | 、 69 | 。 70 | 《 71 | 》 72 | ——— 73 | 》), 74 | )÷(1- 75 | ”, 76 | )、 77 | =( 78 | : 79 | → 80 | ℃ 81 | & 82 | * 83 | 一一 84 | ~~~~ 85 | ’ 86 | . 87 | 『 88 | .一 89 | ./ 90 | -- 91 | 』 92 | =″ 93 | 【 94 | [*] 95 | }> 96 | [⑤]] 97 | [①D] 98 | c] 99 | ng昉 100 | * 101 | // 102 | [ 103 | ] 104 | [②e] 105 | [②g] 106 | ={ 107 | } 108 | ,也 109 | ‘ 110 | A 111 | [①⑥] 112 | [②B] 113 | [①a] 114 | [④a] 115 | [①③] 116 | [③h] 117 | ③] 118 | 1. 119 | -- 120 | [②b] 121 | ’‘ 122 | ××× 123 | [①⑧] 124 | 0:2 125 | =[ 126 | [⑤b] 127 | [②c] 128 | [④b] 129 | [②③] 130 | [③a] 131 | [④c] 132 | [①⑤] 133 | [①⑦] 134 | [①g] 135 | ∈[ 136 | [①⑨] 137 | [①④] 138 | [①c] 139 | [②f] 140 | [②⑧] 141 | [②①] 142 | [①C] 143 | [③c] 144 | [③g] 145 | [②⑤] 146 | [②②] 147 | 一. 148 | [①h] 149 | .数 150 | [] 151 | [①B] 152 | 数/ 153 | [①i] 154 | [③e] 155 | [①①] 156 | [④d] 157 | [④e] 158 | [③b] 159 | [⑤a] 160 | [①A] 161 | [②⑧] 162 | [②⑦] 163 | [①d] 164 | [②j] 165 | 〕〔 166 | ][ 167 | :// 168 | ′∈ 169 | [②④ 170 | [⑤e] 171 | 12% 172 | b] 173 | ... 174 | ................... 175 | …………………………………………………③ 176 | ZXFITL 177 | [③F] 178 | 」 179 | [①o] 180 | ]∧′=[ 181 | ∪φ∈ 182 | ′| 183 | {- 184 | ②c 185 | } 186 | [③①] 187 | R.L. 188 | [①E] 189 | Ψ 190 | -[*]- 191 | ↑ 192 | .日 193 | [②d] 194 | [② 195 | [②⑦] 196 | [②②] 197 | [③e] 198 | [①i] 199 | [①B] 200 | [①h] 201 | [①d] 202 | [①g] 203 | [①②] 204 | [②a] 205 | f] 206 | [⑩] 207 | a] 208 | [①e] 209 | [②h] 210 | [②⑥] 211 | [③d] 212 | [②⑩] 213 | e] 214 | 〉 215 | 】 216 | 元/吨 217 | [②⑩] 218 | 2.3% 219 | 5:0 220 | [①] 221 | :: 222 | [②] 223 | [③] 224 | [④] 225 | [⑤] 226 | [⑥] 227 | [⑦] 228 | [⑧] 229 | [⑨] 230 | …… 231 | —— 232 | ? 233 | 、 234 | 。 235 | “ 236 | ” 237 | 《 238 | 》 239 | ! 240 | , 241 | : 242 | ; 243 | ? 244 | . 245 | , 246 | . 247 | ' 248 | ? 249 | · 250 | ——— 251 | ── 252 | ? 253 | — 254 | < 255 | > 256 | ( 257 | ) 258 | 〔 259 | 〕 260 | [ 261 | ] 262 | ( 263 | ) 264 | - 265 | + 266 | ~ 267 | × 268 | / 269 | / 270 | ① 271 | ② 272 | ③ 273 | ④ 274 | ⑤ 275 | ⑥ 276 | ⑦ 277 | ⑧ 278 | ⑨ 279 | ⑩ 280 | Ⅲ 281 | В 282 | " 283 | ; 284 | # 285 | @ 286 | γ 287 | μ 288 | φ 289 | φ. 290 | × 291 | Δ 292 | ■ 293 | ▲ 294 | sub 295 | exp 296 | sup 297 | sub 298 | Lex 299 | # 300 | % 301 | & 302 | ' 303 | + 304 | +ξ 305 | ++ 306 | - 307 | -β 308 | < 309 | <± 310 | <Δ 311 | <λ 312 | <φ 313 | << 314 | = 315 | = 316 | =☆ 317 | =- 318 | > 319 | >λ 320 | _ 321 | ~± 322 | ~+ 323 | [⑤f] 324 | [⑤d] 325 | [②i] 326 | ≈ 327 | [②G] 328 | [①f] 329 | LI 330 | ㈧ 331 | [- 332 | ...... 333 | 〉 334 | [③⑩] 335 | 第二 336 | 一番 337 | 一直 338 | 一个 339 | 一些 340 | 许多 341 | 种 342 | 有的是 343 | 也就是说 344 | 末##末 345 | 啊 346 | 阿 347 | 哎 348 | 哎呀 349 | 哎哟 350 | 唉 351 | 俺 352 | 俺们 353 | 按 354 | 按照 355 | 吧 356 | 吧哒 357 | 把 358 | 罢了 359 | 被 360 | 本 361 | 本着 362 | 比 363 | 比方 364 | 比如 365 | 鄙人 366 | 彼 367 | 彼此 368 | 边 369 | 别 370 | 别的 371 | 别说 372 | 并 373 | 并且 374 | 不比 375 | 不成 376 | 不单 377 | 不但 378 | 不独 379 | 不管 380 | 不光 381 | 不过 382 | 不仅 383 | 不拘 384 | 不论 385 | 不怕 386 | 不然 387 | 不如 388 | 不特 389 | 不惟 390 | 不问 391 | 不只 392 | 朝 393 | 朝着 394 | 趁 395 | 趁着 396 | 乘 397 | 冲 398 | 除 399 | 除此之外 400 | 除非 401 | 除了 402 | 此 403 | 此间 404 | 此外 405 | 从 406 | 从而 407 | 打 408 | 待 409 | 但 410 | 但是 411 | 当 412 | 当着 413 | 到 414 | 得 415 | 的 416 | 的话 417 | 等 418 | 等等 419 | 地 420 | 第 421 | 叮咚 422 | 对 423 | 对于 424 | 多 425 | 多少 426 | 而 427 | 而况 428 | 而且 429 | 而是 430 | 而外 431 | 而言 432 | 而已 433 | 尔后 434 | 反过来 435 | 反过来说 436 | 反之 437 | 非但 438 | 非徒 439 | 否则 440 | 嘎 441 | 嘎登 442 | 该 443 | 赶 444 | 个 445 | 各 446 | 各个 447 | 各位 448 | 各种 449 | 各自 450 | 给 451 | 根据 452 | 跟 453 | 故 454 | 故此 455 | 固然 456 | 关于 457 | 管 458 | 归 459 | 果然 460 | 果真 461 | 过 462 | 哈 463 | 哈哈 464 | 呵 465 | 和 466 | 何 467 | 何处 468 | 何况 469 | 何时 470 | 嘿 471 | 哼 472 | 哼唷 473 | 呼哧 474 | 乎 475 | 哗 476 | 还是 477 | 还有 478 | 换句话说 479 | 换言之 480 | 或 481 | 或是 482 | 或者 483 | 极了 484 | 及 485 | 及其 486 | 及至 487 | 即 488 | 即便 489 | 即或 490 | 即令 491 | 即若 492 | 即使 493 | 几 494 | 几时 495 | 己 496 | 既 497 | 既然 498 | 既是 499 | 继而 500 | 加之 501 | 假如 502 | 假若 503 | 假使 504 | 鉴于 505 | 将 506 | 较 507 | 较之 508 | 叫 509 | 接着 510 | 结果 511 | 借 512 | 紧接着 513 | 进而 514 | 尽 515 | 尽管 516 | 经 517 | 经过 518 | 就 519 | 就是 520 | 就是说 521 | 据 522 | 具体地说 523 | 具体说来 524 | 开始 525 | 开外 526 | 靠 527 | 咳 528 | 可 529 | 可见 530 | 可是 531 | 可以 532 | 况且 533 | 啦 534 | 来 535 | 来着 536 | 离 537 | 例如 538 | 哩 539 | 连 540 | 连同 541 | 两者 542 | 了 543 | 临 544 | 另 545 | 另外 546 | 另一方面 547 | 论 548 | 嘛 549 | 吗 550 | 慢说 551 | 漫说 552 | 冒 553 | 么 554 | 每 555 | 每当 556 | 们 557 | 莫若 558 | 某 559 | 某个 560 | 某些 561 | 拿 562 | 哪 563 | 哪边 564 | 哪儿 565 | 哪个 566 | 哪里 567 | 哪年 568 | 哪怕 569 | 哪天 570 | 哪些 571 | 哪样 572 | 那 573 | 那边 574 | 那儿 575 | 那个 576 | 那会儿 577 | 那里 578 | 那么 579 | 那么些 580 | 那么样 581 | 那时 582 | 那些 583 | 那样 584 | 乃 585 | 乃至 586 | 呢 587 | 能 588 | 你 589 | 你们 590 | 您 591 | 宁 592 | 宁可 593 | 宁肯 594 | 宁愿 595 | 哦 596 | 呕 597 | 啪达 598 | 旁人 599 | 呸 600 | 凭 601 | 凭借 602 | 其 603 | 其次 604 | 其二 605 | 其他 606 | 其它 607 | 其一 608 | 其余 609 | 其中 610 | 起 611 | 起见 612 | 起见 613 | 岂但 614 | 恰恰相反 615 | 前后 616 | 前者 617 | 且 618 | 然而 619 | 然后 620 | 然则 621 | 让 622 | 人家 623 | 任 624 | 任何 625 | 任凭 626 | 如 627 | 如此 628 | 如果 629 | 如何 630 | 如其 631 | 如若 632 | 如上所述 633 | 若 634 | 若非 635 | 若是 636 | 啥 637 | 上下 638 | 尚且 639 | 设若 640 | 设使 641 | 甚而 642 | 甚么 643 | 甚至 644 | 省得 645 | 时候 646 | 什么 647 | 什么样 648 | 使得 649 | 是 650 | 是的 651 | 首先 652 | 谁 653 | 谁知 654 | 顺 655 | 顺着 656 | 似的 657 | 虽 658 | 虽然 659 | 虽说 660 | 虽则 661 | 随 662 | 随着 663 | 所 664 | 所以 665 | 他 666 | 他们 667 | 他人 668 | 它 669 | 它们 670 | 她 671 | 她们 672 | 倘 673 | 倘或 674 | 倘然 675 | 倘若 676 | 倘使 677 | 腾 678 | 替 679 | 通过 680 | 同 681 | 同时 682 | 哇 683 | 万一 684 | 往 685 | 望 686 | 为 687 | 为何 688 | 为了 689 | 为什么 690 | 为着 691 | 喂 692 | 嗡嗡 693 | 我 694 | 我们 695 | 呜 696 | 呜呼 697 | 乌乎 698 | 无论 699 | 无宁 700 | 毋宁 701 | 嘻 702 | 吓 703 | 相对而言 704 | 像 705 | 向 706 | 向着 707 | 嘘 708 | 呀 709 | 焉 710 | 沿 711 | 沿着 712 | 要 713 | 要不 714 | 要不然 715 | 要不是 716 | 要么 717 | 要是 718 | 也 719 | 也罢 720 | 也好 721 | 一 722 | 一般 723 | 一旦 724 | 一方面 725 | 一来 726 | 一切 727 | 一样 728 | 一则 729 | 依 730 | 依照 731 | 矣 732 | 以 733 | 以便 734 | 以及 735 | 以免 736 | 以至 737 | 以至于 738 | 以致 739 | 抑或 740 | 因 741 | 因此 742 | 因而 743 | 因为 744 | 哟 745 | 用 746 | 由 747 | 由此可见 748 | 由于 749 | 有 750 | 有的 751 | 有关 752 | 有些 753 | 又 754 | 于 755 | 于是 756 | 于是乎 757 | 与 758 | 与此同时 759 | 与否 760 | 与其 761 | 越是 762 | 云云 763 | 哉 764 | 再说 765 | 再者 766 | 在 767 | 在下 768 | 咱 769 | 咱们 770 | 则 771 | 怎 772 | 怎么 773 | 怎么办 774 | 怎么样 775 | 怎样 776 | 咋 777 | 照 778 | 照着 779 | 者 780 | 这 781 | 这边 782 | 这儿 783 | 这个 784 | 这会儿 785 | 这就是说 786 | 这里 787 | 这么 788 | 这么点儿 789 | 这么些 790 | 这么样 791 | 这时 792 | 这些 793 | 这样 794 | 正如 795 | 吱 796 | 之 797 | 之类 798 | 之所以 799 | 之一 800 | 只是 801 | 只限 802 | 只要 803 | 只有 804 | 至 805 | 至于 806 | 诸位 807 | 着 808 | 着呢 809 | 自 810 | 自从 811 | 自个儿 812 | 自各儿 813 | 自己 814 | 自家 815 | 自身 816 | 综上所述 817 | 总的来看 818 | 总的来说 819 | 总的说来 820 | 总而言之 821 | 总之 822 | 纵 823 | 纵令 824 | 纵然 825 | 纵使 826 | 遵照 827 | 作为 828 | 兮 829 | 呃 830 | 呗 831 | 咚 832 | 咦 833 | 喏 834 | 啐 835 | 喔唷 836 | 嗬 837 | 嗯 838 | 嗳 839 | 840 | 841 | 842 | -------------------------------------------------------------------------------- /old/temp_covert.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | import json 3 | import random 4 | 5 | dict_unique={} 6 | 7 | dict_type_ignore_count={'train':0,'valid':0,'test':0} 8 | def transform_data_to_fasttext_format(file_path,target_path,data_type): 9 | file_object=open(file_path,'r') 10 | target_object=open(target_path,'w') 11 | lines=file_object.readlines() 12 | print("length of lines:",len(lines)) 13 | random.shuffle(lines) 14 | for i,line in enumerate(lines): 15 | json_string=json.loads(line) 16 | accusation_list=json_string['meta']['accusation'] 17 | fact=json_string['fact'].strip('\n\r').replace("\n","").replace("\r","") 18 | unique_value=dict_unique.get(fact,None) 19 | if unique_value is None: # if not exist, put to unique dict, then process 20 | dict_unique[fact] = fact 21 | else: # otherwise, ignore 22 | print("going to ignore.",data_type,fact) 23 | dict_type_ignore_count[data_type]=dict_type_ignore_count[data_type]+1 24 | continue 25 | length_accusation=len(accusation_list) 26 | #if length_accusation>1: 27 | #print("accusation_list:",str(accusation_list)) 28 | #print("json_string:",json_string) 29 | accusation_strings='' 30 | for i,accusation in enumerate(accusation_list): 31 | accusation_strings+=' __label__'+accusation 32 | target_object.write(fact+accusation_strings+"\n") 33 | target_object.close() 34 | file_object.close() 35 | print("dict_type_ignore_count:",dict_type_ignore_count[data_type]) 36 | 37 | file_path='./data/cail2018/data_valid_checked.json' 38 | target_path='./data/data_valid2.txt' 39 | transform_data_to_fasttext_format(file_path,target_path,'valid') 40 | 41 | file_path='./data/cail2018/data_test.json' 42 | target_path='./data/data_test2.txt' 43 | transform_data_to_fasttext_format(file_path,target_path,'test') 44 | 45 | file_path='./data/cail2018/cail2018_big_downsmapled.json' 46 | target_path='./data/data_train2.txt' 47 | transform_data_to_fasttext_format(file_path,target_path,'train') 48 | 49 | -------------------------------------------------------------------------------- /old/train_transform.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction) 3 | """ 4 | 5 | train the model(transformer) with data enhanced by pre-training of two tasks. 6 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a 7 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny). 8 | """ 9 | #import sys 10 | #reload(sys) 11 | #sys.setdefaultencoding('utf8') 12 | import tensorflow as tf 13 | import numpy as np 14 | from model.transfomer_model import TransformerModel 15 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config 16 | import os 17 | from evaluation_matrix import * 18 | from model.config_transformer import Config 19 | #configuration 20 | FLAGS=tf.app.flags.FLAGS 21 | 22 | # you can change as you like 23 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used") 24 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.") 25 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train.txt","path of traning data.") #./data/cail2018_bi.json 26 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_test.txt","path of validation data.") 27 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test.txt","path of validation data.") 28 | tf.app.flags.DEFINE_integer("d_model", 64, "dimension of model") # 512-->128-->64 29 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer") 30 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header") 31 | tf.app.flags.DEFINE_integer("d_k", 8, "dimension of k") # 64-->16-->8 32 | tf.app.flags.DEFINE_integer("d_v", 8, "dimension of v") # 64-->16-->8 33 | 34 | # below hyperparameter you can use default one, seldom change 35 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_transformer/","checkpoint location for the model") #save to here, so make it easy to upload for test 36 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model") 37 | tf.app.flags.DEFINE_integer("vocab_size",50002,"maximum vocab size.") 38 | tf.app.flags.DEFINE_float("learning_rate",0.0001,"learning rate") #0.001 39 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128 40 | tf.app.flags.DEFINE_integer("decay_steps", 1000, "how many steps before decay learning rate.") # 32-->128 41 | tf.app.flags.DEFINE_float("decay_rate", 1.0, "Rate of decay for learning rate.") #0.65 42 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65 43 | tf.app.flags.DEFINE_integer("sequence_length",200,"max sentence length")#400 44 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference") 45 | tf.app.flags.DEFINE_integer("num_epochs",30,"number of epochs to run.") 46 | tf.app.flags.DEFINE_integer("process_num",3,"number of cpu used") 47 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") # 48 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")# 49 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char 50 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length of language model") 51 | tf.app.flags.DEFINE_boolean("is_fine_tuning",False,"is_finetuning.ture:this is fine-tuning stage") 52 | 53 | def main(_): 54 | vocab_word2index, label2index= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name='transfomer') 55 | vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes) 56 | train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length, 57 | process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name='transfomer') 58 | train_X, train_Y= train 59 | valid_X, valid_Y= valid 60 | test_X,test_Y = test 61 | print("Test_mode:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape) 62 | # 1.create session. 63 | gpu_config=tf.ConfigProto() 64 | gpu_config.gpu_options.allow_growth=True 65 | with tf.Session(config=gpu_config) as sess: 66 | #Instantiate Model 67 | config=set_config(FLAGS,num_classes,vocab_size) 68 | model=TransformerModel(config) 69 | #Initialize Save 70 | saver=tf.train.Saver() 71 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"): 72 | print("Restoring Variables from Checkpoint.") 73 | saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir)) 74 | #for i in range(2): #decay learning rate if necessary. 75 | # print(i,"Going to decay learning rate by half.") 76 | # sess.run(model.learning_rate_decay_half_op) 77 | else: 78 | print('Initializing Variables') 79 | sess.run(tf.global_variables_initializer()) 80 | if FLAGS.use_pretrained_embedding: 81 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()} 82 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings 83 | curr_epoch=sess.run(model.epoch_step) 84 | # 2.feed data & training 85 | number_of_training_data=len(train_X) 86 | batch_size=FLAGS.batch_size 87 | iteration=0 88 | score_best=-100 89 | f1_score=0 90 | for epoch in range(curr_epoch,FLAGS.num_epochs): 91 | loss_total, counter = 0.0, 0 92 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)): 93 | iteration=iteration+1 94 | if epoch==0 and counter==0: 95 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape) 96 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob} 97 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict) 98 | loss_total,counter=loss_total+current_loss,counter+1 99 | if counter %30==0: 100 | print("Learning rate:%.5f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss)) 101 | if start!=0 and start%(3000*FLAGS.batch_size)==0: 102 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index) 103 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0)*100.0 104 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid)) 105 | 106 | # save model to checkpoint 107 | if f1_score_valid>score_best: 108 | save_path = FLAGS.ckpt_dir + "model.ckpt" 109 | print("going to save check point.") 110 | saver.save(sess, save_path, global_step=epoch) 111 | score_best=f1_score_valid 112 | #epoch increment 113 | print("going to increment epoch counter....") 114 | sess.run(model.epoch_increment) 115 | 116 | # 4.validation 117 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0)) 118 | if epoch % FLAGS.validate_every==0: 119 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index) 120 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0 121 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2)) 122 | #save model to checkpoint 123 | if f1_score_valid2 > score_best: 124 | save_path=FLAGS.ckpt_dir+"model.ckpt" 125 | print("going to save check point.") 126 | saver.save(sess,save_path,global_step=epoch) 127 | score_best = f1_score_valid2 128 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13): 129 | for i in range(1): 130 | print(i, "Going to decay learning rate by half.") 131 | sess.run(model.learning_rate_decay_half_op) 132 | 133 | # 5.最后在测试集上做测试,并报告测试准确率 Testto 0.0 134 | loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index) 135 | f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) #* 100.0 136 | print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test)) 137 | print("training completed...") 138 | 139 | #sess,model,valid,iteration,num_classes,label2index 140 | def do_eval(sess,model,valid,num_classes,label2index): 141 | """ 142 | do evaluation using validation set, and report loss, and f1 score. 143 | :param sess: 144 | :param model: 145 | :param valid: 146 | :param num_classes: 147 | :param label2index: 148 | :return: 149 | """ 150 | number_examples=valid[0].shape[0] 151 | valid_x,valid_y=valid 152 | print("number_examples:",number_examples) 153 | eval_loss,eval_counter=0.0,0 154 | batch_size=FLAGS.batch_size 155 | label_dict=init_label_dict(num_classes) 156 | eval_macro_f1, eval_micro_f1 = 0.0,0.0 157 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)): 158 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0} 159 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size] 160 | #compute confuse matrix 161 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright') 162 | eval_loss=eval_loss+curr_eval_loss 163 | eval_counter=eval_counter+1 164 | #compute f1_micro & f1_macro 165 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive 166 | compute_f1_score_write_for_debug(label_dict,label2index) 167 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro 168 | 169 | if __name__ == "__main__": 170 | tf.app.run() -------------------------------------------------------------------------------- /old/validation_bigru_char.py: -------------------------------------------------------------------------------- 1 | from keras.backend.tensorflow_backend import set_session 2 | import tensorflow as tf 3 | config = tf.ConfigProto() 4 | config.gpu_options.allow_growth = True 5 | set_session(tf.Session(config=config)) 6 | import gc 7 | import pandas as pd 8 | import pickle 9 | import numpy as np 10 | np.random.seed(16) 11 | from tensorflow import set_random_seed 12 | set_random_seed(16) 13 | from keras.layers import * 14 | from keras.preprocessing import sequence 15 | from gensim.models.keyedvectors import KeyedVectors 16 | from classifier_bigru import TextClassifier 17 | 18 | 19 | def getClassification(arr): 20 | arr = list(arr) 21 | if arr.index(max(arr)) == 0: 22 | return -2 23 | elif arr.index(max(arr)) == 1: 24 | return -1 25 | elif arr.index(max(arr)) == 2: 26 | return 0 27 | else: 28 | return 1 29 | 30 | 31 | if __name__ == "__main__": 32 | with open('tokenizer_char.pickle', 'rb') as handle: 33 | maxlen = 1000 34 | model_dir = "model_bigru_char/" 35 | tokenizer = pickle.load(handle) 36 | word_index = tokenizer.word_index 37 | validation = pd.read_csv("preprocess/validation_char.csv") 38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1) 39 | X_test = validation["content"].values 40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test) 41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen) 42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8', 43 | unicode_errors='ignore') 44 | embeddings_index = {} 45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size)) 46 | word2idx = {"_PAD": 0} 47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()] 48 | for word, i in word_index.items(): 49 | if word in w2_model: 50 | embedding_vector = w2_model[word] 51 | else: 52 | embedding_vector = None 53 | if embedding_vector is not None: 54 | embeddings_matrix[i] = embedding_vector 55 | 56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv") 57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv") 58 | 59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 60 | model1.load_weights(model_dir + "model_ltc_01.hdf5") 61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation))) 62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation)) 63 | del model1 64 | gc.collect() 65 | K.clear_session() 66 | 67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 68 | model2.load_weights(model_dir + "model_ldfbd_01.hdf5") 69 | submit["location_distance_from_business_district"] = list( 70 | map(getClassification, model2.predict(input_validation))) 71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation)) 72 | del model2 73 | gc.collect() 74 | K.clear_session() 75 | 76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 77 | model3.load_weights(model_dir + "model_letf_02.hdf5") 78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation))) 79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation)) 80 | del model3 81 | gc.collect() 82 | K.clear_session() 83 | 84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 85 | model4.load_weights(model_dir + "model_swt_02.hdf5") 86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation))) 87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation)) 88 | del model4 89 | gc.collect() 90 | K.clear_session() 91 | 92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 93 | model5.load_weights(model_dir + "model_swa_02.hdf5") 94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation))) 95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation)) 96 | del model5 97 | gc.collect() 98 | K.clear_session() 99 | 100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 101 | model6.load_weights(model_dir + "model_spc_02.hdf5") 102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation))) 103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation)) 104 | del model6 105 | gc.collect() 106 | K.clear_session() 107 | 108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 109 | model7.load_weights(model_dir + "model_ssp_02.hdf5") 110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation))) 111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation)) 112 | del model7 113 | gc.collect() 114 | K.clear_session() 115 | 116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 117 | model8.load_weights(model_dir + "model_pl_02.hdf5") 118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation))) 119 | submit_prob["price_level"] = list(model8.predict(input_validation)) 120 | del model8 121 | gc.collect() 122 | K.clear_session() 123 | 124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 125 | model9.load_weights(model_dir + "model_pce_02.hdf5") 126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation))) 127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation)) 128 | del model9 129 | gc.collect() 130 | K.clear_session() 131 | 132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 133 | model10.load_weights(model_dir + "model_pd_02.hdf5") 134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation))) 135 | submit_prob["price_discount"] = list(model10.predict(input_validation)) 136 | del model10 137 | gc.collect() 138 | K.clear_session() 139 | 140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 141 | model11.load_weights(model_dir + "model_ed_01.hdf5") 142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation))) 143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation)) 144 | del model11 145 | gc.collect() 146 | K.clear_session() 147 | 148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 149 | model12.load_weights(model_dir + "model_en_02.hdf5") 150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation))) 151 | submit_prob["environment_noise"] = list(model12.predict(input_validation)) 152 | del model12 153 | gc.collect() 154 | K.clear_session() 155 | 156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 157 | model13.load_weights(model_dir + "model_es_02.hdf5") 158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation))) 159 | submit_prob["environment_space"] = list(model13.predict(input_validation)) 160 | del model13 161 | gc.collect() 162 | K.clear_session() 163 | 164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 165 | model14.load_weights(model_dir + "model_ec_01.hdf5") 166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation))) 167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation)) 168 | del model14 169 | gc.collect() 170 | K.clear_session() 171 | 172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 173 | model15.load_weights(model_dir + "model_dp_01.hdf5") 174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation))) 175 | submit_prob["dish_portion"] = list(model15.predict(input_validation)) 176 | del model15 177 | gc.collect() 178 | K.clear_session() 179 | 180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 181 | model16.load_weights(model_dir + "model_dt_02.hdf5") 182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation))) 183 | submit_prob["dish_taste"] = list(model16.predict(input_validation)) 184 | del model16 185 | gc.collect() 186 | K.clear_session() 187 | 188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 189 | model17.load_weights(model_dir + "model_dl_02.hdf5") 190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation))) 191 | submit_prob["dish_look"] = list(model17.predict(input_validation)) 192 | del model17 193 | gc.collect() 194 | K.clear_session() 195 | 196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 197 | model18.load_weights(model_dir + "model_dr_01.hdf5") 198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation))) 199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation)) 200 | del model18 201 | gc.collect() 202 | K.clear_session() 203 | 204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 205 | model19.load_weights(model_dir + "model_ooe_01.hdf5") 206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation))) 207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation)) 208 | del model19 209 | gc.collect() 210 | K.clear_session() 211 | 212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 213 | model20.load_weights(model_dir + "model_owta_02.hdf5") 214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation))) 215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation)) 216 | del model20 217 | gc.collect() 218 | K.clear_session() 219 | 220 | submit.to_csv("validation_bigru_char.csv", index=None) 221 | submit_prob.to_csv("validation_bigru_char_prob.csv", index=None) -------------------------------------------------------------------------------- /old/validation_rcnn_char.py: -------------------------------------------------------------------------------- 1 | from keras.backend.tensorflow_backend import set_session 2 | import tensorflow as tf 3 | config = tf.ConfigProto() 4 | config.gpu_options.allow_growth = True 5 | set_session(tf.Session(config=config)) 6 | import gc 7 | import pandas as pd 8 | import pickle 9 | import numpy as np 10 | np.random.seed(16) 11 | from tensorflow import set_random_seed 12 | set_random_seed(16) 13 | from keras.layers import * 14 | from keras.preprocessing import sequence 15 | from gensim.models.keyedvectors import KeyedVectors 16 | from old.classifier_rcnn import TextClassifier 17 | 18 | 19 | def getClassification(arr): 20 | arr = list(arr) 21 | if arr.index(max(arr)) == 0: 22 | return -2 23 | elif arr.index(max(arr)) == 1: 24 | return -1 25 | elif arr.index(max(arr)) == 2: 26 | return 0 27 | else: 28 | return 1 29 | 30 | 31 | if __name__ == "__main__": 32 | with open('tokenizer_char.pickle', 'rb') as handle: 33 | maxlen = 1000 34 | model_dir = "model_rcnn_char/" 35 | tokenizer = pickle.load(handle) 36 | word_index = tokenizer.word_index 37 | validation = pd.read_csv("preprocess/validation_char.csv") 38 | validation["content"] = validation.apply(lambda x: eval(x[1]), axis=1) 39 | X_test = validation["content"].values 40 | list_tokenized_validation = tokenizer.texts_to_sequences(X_test) 41 | input_validation = sequence.pad_sequences(list_tokenized_validation, maxlen=maxlen) 42 | w2_model = KeyedVectors.load_word2vec_format("word2vec/chars.vector", binary=True, encoding='utf8', 43 | unicode_errors='ignore') 44 | embeddings_index = {} 45 | embeddings_matrix = np.zeros((len(word_index) + 1, w2_model.vector_size)) 46 | word2idx = {"_PAD": 0} 47 | vocab_list = [(k, w2_model.wv[k]) for k, v in w2_model.wv.vocab.items()] 48 | for word, i in word_index.items(): 49 | if word in w2_model: 50 | embedding_vector = w2_model[word] 51 | else: 52 | embedding_vector = None 53 | if embedding_vector is not None: 54 | embeddings_matrix[i] = embedding_vector 55 | 56 | submit = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv") 57 | submit_prob = pd.read_csv("ai_challenger_sentiment_analysis_validationset_20180816/sentiment_analysis_validationset.csv") 58 | 59 | model1 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 60 | model1.load_weights(model_dir + "model_ltc_02.hdf5") 61 | submit["location_traffic_convenience"] = list(map(getClassification, model1.predict(input_validation))) 62 | submit_prob["location_traffic_convenience"] = list(model1.predict(input_validation)) 63 | del model1 64 | gc.collect() 65 | K.clear_session() 66 | 67 | model2 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 68 | model2.load_weights(model_dir + "model_ldfbd_02.hdf5") 69 | submit["location_distance_from_business_district"] = list( 70 | map(getClassification, model2.predict(input_validation))) 71 | submit_prob["location_distance_from_business_district"] = list(model2.predict(input_validation)) 72 | del model2 73 | gc.collect() 74 | K.clear_session() 75 | 76 | model3 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 77 | model3.load_weights(model_dir + "model_letf_02.hdf5") 78 | submit["location_easy_to_find"] = list(map(getClassification, model3.predict(input_validation))) 79 | submit_prob["location_easy_to_find"] = list(model3.predict(input_validation)) 80 | del model3 81 | gc.collect() 82 | K.clear_session() 83 | 84 | model4 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 85 | model4.load_weights(model_dir + "model_swt_02.hdf5") 86 | submit["service_wait_time"] = list(map(getClassification, model4.predict(input_validation))) 87 | submit_prob["service_wait_time"] = list(model4.predict(input_validation)) 88 | del model4 89 | gc.collect() 90 | K.clear_session() 91 | 92 | model5 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 93 | model5.load_weights(model_dir + "model_swa_02.hdf5") 94 | submit["service_waiters_attitude"] = list(map(getClassification, model5.predict(input_validation))) 95 | submit_prob["service_waiters_attitude"] = list(model5.predict(input_validation)) 96 | del model5 97 | gc.collect() 98 | K.clear_session() 99 | 100 | model6 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 101 | model6.load_weights(model_dir + "model_spc_01.hdf5") 102 | submit["service_parking_convenience"] = list(map(getClassification, model6.predict(input_validation))) 103 | submit_prob["service_parking_convenience"] = list(model6.predict(input_validation)) 104 | del model6 105 | gc.collect() 106 | K.clear_session() 107 | 108 | model7 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 109 | model7.load_weights(model_dir + "model_ssp_02.hdf5") 110 | submit["service_serving_speed"] = list(map(getClassification, model7.predict(input_validation))) 111 | submit_prob["service_serving_speed"] = list(model7.predict(input_validation)) 112 | del model7 113 | gc.collect() 114 | K.clear_session() 115 | 116 | model8 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 117 | model8.load_weights(model_dir + "model_pl_02.hdf5") 118 | submit["price_level"] = list(map(getClassification, model8.predict(input_validation))) 119 | submit_prob["price_level"] = list(model8.predict(input_validation)) 120 | del model8 121 | gc.collect() 122 | K.clear_session() 123 | 124 | model9 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 125 | model9.load_weights(model_dir + "model_pce_02.hdf5") 126 | submit["price_cost_effective"] = list(map(getClassification, model9.predict(input_validation))) 127 | submit_prob["price_cost_effective"] = list(model9.predict(input_validation)) 128 | del model9 129 | gc.collect() 130 | K.clear_session() 131 | 132 | model10 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 133 | model10.load_weights(model_dir + "model_pd_02.hdf5") 134 | submit["price_discount"] = list(map(getClassification, model10.predict(input_validation))) 135 | submit_prob["price_discount"] = list(model10.predict(input_validation)) 136 | del model10 137 | gc.collect() 138 | K.clear_session() 139 | 140 | model11 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 141 | model11.load_weights(model_dir + "model_ed_02.hdf5") 142 | submit["environment_decoration"] = list(map(getClassification, model11.predict(input_validation))) 143 | submit_prob["environment_decoration"] = list(model11.predict(input_validation)) 144 | del model11 145 | gc.collect() 146 | K.clear_session() 147 | 148 | model12 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 149 | model12.load_weights(model_dir + "model_en_02.hdf5") 150 | submit["environment_noise"] = list(map(getClassification, model12.predict(input_validation))) 151 | submit_prob["environment_noise"] = list(model12.predict(input_validation)) 152 | del model12 153 | gc.collect() 154 | K.clear_session() 155 | 156 | model13 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 157 | model13.load_weights(model_dir + "model_es_01.hdf5") 158 | submit["environment_space"] = list(map(getClassification, model13.predict(input_validation))) 159 | submit_prob["environment_space"] = list(model13.predict(input_validation)) 160 | del model13 161 | gc.collect() 162 | K.clear_session() 163 | 164 | model14 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 165 | model14.load_weights(model_dir + "model_ec_02.hdf5") 166 | submit["environment_cleaness"] = list(map(getClassification, model14.predict(input_validation))) 167 | submit_prob["environment_cleaness"] = list(model14.predict(input_validation)) 168 | del model14 169 | gc.collect() 170 | K.clear_session() 171 | 172 | model15 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 173 | model15.load_weights(model_dir + "model_dp_02.hdf5") 174 | submit["dish_portion"] = list(map(getClassification, model15.predict(input_validation))) 175 | submit_prob["dish_portion"] = list(model15.predict(input_validation)) 176 | del model15 177 | gc.collect() 178 | K.clear_session() 179 | 180 | model16 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 181 | model16.load_weights(model_dir + "model_dt_02.hdf5") 182 | submit["dish_taste"] = list(map(getClassification, model16.predict(input_validation))) 183 | submit_prob["dish_taste"] = list(model16.predict(input_validation)) 184 | del model16 185 | gc.collect() 186 | K.clear_session() 187 | 188 | model17 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 189 | model17.load_weights(model_dir + "model_dl_02.hdf5") 190 | submit["dish_look"] = list(map(getClassification, model17.predict(input_validation))) 191 | submit_prob["dish_look"] = list(model17.predict(input_validation)) 192 | del model17 193 | gc.collect() 194 | K.clear_session() 195 | 196 | model18 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 197 | model18.load_weights(model_dir + "model_dr_02.hdf5") 198 | submit["dish_recommendation"] = list(map(getClassification, model18.predict(input_validation))) 199 | submit_prob["dish_recommendation"] = list(model18.predict(input_validation)) 200 | del model18 201 | gc.collect() 202 | K.clear_session() 203 | 204 | model19 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 205 | model19.load_weights(model_dir + "model_ooe_02.hdf5") 206 | submit["others_overall_experience"] = list(map(getClassification, model19.predict(input_validation))) 207 | submit_prob["others_overall_experience"] = list(model19.predict(input_validation)) 208 | del model19 209 | gc.collect() 210 | K.clear_session() 211 | 212 | model20 = TextClassifier().model(embeddings_matrix, maxlen, word_index, 4) 213 | model20.load_weights(model_dir + "model_owta_02.hdf5") 214 | submit["others_willing_to_consume_again"] = list(map(getClassification, model20.predict(input_validation))) 215 | submit_prob["others_willing_to_consume_again"] = list(model20.predict(input_validation)) 216 | del model20 217 | gc.collect() 218 | K.clear_session() 219 | 220 | submit.to_csv("validation_rcnn_char.csv", index=None) 221 | submit_prob.to_csv("validation_rcnn_char_prob.csv", index=None) -------------------------------------------------------------------------------- /preprocess_char/README.txt: -------------------------------------------------------------------------------- 1 | processed test/train/validation file will be save here 2 | -------------------------------------------------------------------------------- /tokenization.py: -------------------------------------------------------------------------------- 1 | # coding=utf-8 2 | # Copyright 2018 The Google AI Language Team Authors. 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | """Tokenization classes.""" 16 | 17 | from __future__ import absolute_import 18 | from __future__ import division 19 | from __future__ import print_function 20 | 21 | import collections 22 | import unicodedata 23 | import six 24 | import tensorflow as tf 25 | 26 | 27 | def convert_to_unicode(text): 28 | """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" 29 | if six.PY3: 30 | if isinstance(text, str): 31 | return text 32 | elif isinstance(text, bytes): 33 | return text.decode("utf-8", "ignore") 34 | else: 35 | raise ValueError("Unsupported string type: %s" % (type(text))) 36 | elif six.PY2: 37 | if isinstance(text, str): 38 | return text.decode("utf-8", "ignore") 39 | elif isinstance(text, unicode): 40 | return text 41 | else: 42 | raise ValueError("Unsupported string type: %s" % (type(text))) 43 | else: 44 | raise ValueError("Not running on Python2 or Python 3?") 45 | 46 | 47 | def printable_text(text): 48 | """Returns text encoded in a way suitable for print or `tf.logging`.""" 49 | 50 | # These functions want `str` for both Python2 and Python3, but in one case 51 | # it's a Unicode string and in the other it's a byte string. 52 | if six.PY3: 53 | if isinstance(text, str): 54 | return text 55 | elif isinstance(text, bytes): 56 | return text.decode("utf-8", "ignore") 57 | else: 58 | raise ValueError("Unsupported string type: %s" % (type(text))) 59 | elif six.PY2: 60 | if isinstance(text, str): 61 | return text 62 | elif isinstance(text, unicode): 63 | return text.encode("utf-8") 64 | else: 65 | raise ValueError("Unsupported string type: %s" % (type(text))) 66 | else: 67 | raise ValueError("Not running on Python2 or Python 3?") 68 | 69 | 70 | def load_vocab(vocab_file): 71 | """Loads a vocabulary file into a dictionary.""" 72 | vocab = collections.OrderedDict() 73 | index = 0 74 | with tf.gfile.GFile(vocab_file, "r") as reader: 75 | while True: 76 | token = convert_to_unicode(reader.readline()) 77 | if not token: 78 | break 79 | token = token.strip() 80 | vocab[token] = index 81 | index += 1 82 | return vocab 83 | 84 | 85 | def convert_by_vocab(vocab, items): 86 | """Converts a sequence of [tokens|ids] using the vocab.""" 87 | output = [] 88 | for item in items: 89 | output.append(vocab[item]) 90 | return output 91 | 92 | 93 | def convert_tokens_to_ids(vocab, tokens): 94 | return convert_by_vocab(vocab, tokens) 95 | 96 | 97 | def convert_ids_to_tokens(inv_vocab, ids): 98 | return convert_by_vocab(inv_vocab, ids) 99 | 100 | 101 | def whitespace_tokenize(text): 102 | """Runs basic whitespace cleaning and splitting on a peice of text.""" 103 | text = text.strip() 104 | if not text: 105 | return [] 106 | tokens = text.split() 107 | return tokens 108 | 109 | 110 | class FullTokenizer(object): 111 | """Runs end-to-end tokenziation.""" 112 | 113 | def __init__(self, vocab_file, do_lower_case=True): 114 | self.vocab = load_vocab(vocab_file) 115 | self.inv_vocab = {v: k for k, v in self.vocab.items()} 116 | self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) 117 | self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) 118 | 119 | def tokenize(self, text): 120 | split_tokens = [] 121 | for token in self.basic_tokenizer.tokenize(text): 122 | for sub_token in self.wordpiece_tokenizer.tokenize(token): 123 | split_tokens.append(sub_token) 124 | 125 | return split_tokens 126 | 127 | def convert_tokens_to_ids(self, tokens): 128 | return convert_by_vocab(self.vocab, tokens) 129 | 130 | def convert_ids_to_tokens(self, ids): 131 | return convert_by_vocab(self.inv_vocab, ids) 132 | 133 | 134 | class BasicTokenizer(object): 135 | """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" 136 | 137 | def __init__(self, do_lower_case=True): 138 | """Constructs a BasicTokenizer. 139 | 140 | Args: 141 | do_lower_case: Whether to lower case the input. 142 | """ 143 | self.do_lower_case = do_lower_case 144 | 145 | def tokenize(self, text): 146 | """Tokenizes a piece of text.""" 147 | text = convert_to_unicode(text) 148 | text = self._clean_text(text) 149 | 150 | # This was added on November 1st, 2018 for the multilingual and Chinese 151 | # models. This is also applied to the English models now, but it doesn't 152 | # matter since the English models were not trained on any Chinese data 153 | # and generally don't have any Chinese data in them (there are Chinese 154 | # characters in the vocabulary because Wikipedia does have some Chinese 155 | # words in the English Wikipedia.). 156 | text = self._tokenize_chinese_chars(text) 157 | 158 | orig_tokens = whitespace_tokenize(text) 159 | split_tokens = [] 160 | for token in orig_tokens: 161 | if self.do_lower_case: 162 | token = token.lower() 163 | token = self._run_strip_accents(token) 164 | split_tokens.extend(self._run_split_on_punc(token)) 165 | 166 | output_tokens = whitespace_tokenize(" ".join(split_tokens)) 167 | return output_tokens 168 | 169 | def _run_strip_accents(self, text): 170 | """Strips accents from a piece of text.""" 171 | text = unicodedata.normalize("NFD", text) 172 | output = [] 173 | for char in text: 174 | cat = unicodedata.category(char) 175 | if cat == "Mn": 176 | continue 177 | output.append(char) 178 | return "".join(output) 179 | 180 | def _run_split_on_punc(self, text): 181 | """Splits punctuation on a piece of text.""" 182 | chars = list(text) 183 | i = 0 184 | start_new_word = True 185 | output = [] 186 | while i < len(chars): 187 | char = chars[i] 188 | if _is_punctuation(char): 189 | output.append([char]) 190 | start_new_word = True 191 | else: 192 | if start_new_word: 193 | output.append([]) 194 | start_new_word = False 195 | output[-1].append(char) 196 | i += 1 197 | 198 | return ["".join(x) for x in output] 199 | 200 | def _tokenize_chinese_chars(self, text): 201 | """Adds whitespace around any CJK character.""" 202 | output = [] 203 | for char in text: 204 | cp = ord(char) 205 | if self._is_chinese_char(cp): 206 | output.append(" ") 207 | output.append(char) 208 | output.append(" ") 209 | else: 210 | output.append(char) 211 | return "".join(output) 212 | 213 | def _is_chinese_char(self, cp): 214 | """Checks whether CP is the codepoint of a CJK character.""" 215 | # This defines a "chinese character" as anything in the CJK Unicode block: 216 | # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) 217 | # 218 | # Note that the CJK Unicode block is NOT all Japanese and Korean characters, 219 | # despite its name. The modern Korean Hangul alphabet is a different block, 220 | # as is Japanese Hiragana and Katakana. Those alphabets are used to write 221 | # space-separated words, so they are not treated specially and handled 222 | # like the all of the other languages. 223 | if ((cp >= 0x4E00 and cp <= 0x9FFF) or # 224 | (cp >= 0x3400 and cp <= 0x4DBF) or # 225 | (cp >= 0x20000 and cp <= 0x2A6DF) or # 226 | (cp >= 0x2A700 and cp <= 0x2B73F) or # 227 | (cp >= 0x2B740 and cp <= 0x2B81F) or # 228 | (cp >= 0x2B820 and cp <= 0x2CEAF) or 229 | (cp >= 0xF900 and cp <= 0xFAFF) or # 230 | (cp >= 0x2F800 and cp <= 0x2FA1F)): # 231 | return True 232 | 233 | return False 234 | 235 | def _clean_text(self, text): 236 | """Performs invalid character removal and whitespace cleanup on text.""" 237 | output = [] 238 | for char in text: 239 | cp = ord(char) 240 | if cp == 0 or cp == 0xfffd or _is_control(char): 241 | continue 242 | if _is_whitespace(char): 243 | output.append(" ") 244 | else: 245 | output.append(char) 246 | return "".join(output) 247 | 248 | 249 | class WordpieceTokenizer(object): 250 | """Runs WordPiece tokenziation.""" 251 | 252 | def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): 253 | self.vocab = vocab 254 | self.unk_token = unk_token 255 | self.max_input_chars_per_word = max_input_chars_per_word 256 | 257 | def tokenize(self, text): 258 | """Tokenizes a piece of text into its word pieces. 259 | 260 | This uses a greedy longest-match-first algorithm to perform tokenization 261 | using the given vocabulary. 262 | 263 | For example: 264 | input = "unaffable" 265 | output = ["un", "##aff", "##able"] 266 | 267 | Args: 268 | text: A single token or whitespace separated tokens. This should have 269 | already been passed through `BasicTokenizer. 270 | 271 | Returns: 272 | A list of wordpiece tokens. 273 | """ 274 | 275 | text = convert_to_unicode(text) 276 | 277 | output_tokens = [] 278 | for token in whitespace_tokenize(text): 279 | chars = list(token) 280 | if len(chars) > self.max_input_chars_per_word: 281 | output_tokens.append(self.unk_token) 282 | continue 283 | 284 | is_bad = False 285 | start = 0 286 | sub_tokens = [] 287 | while start < len(chars): 288 | end = len(chars) 289 | cur_substr = None 290 | while start < end: 291 | substr = "".join(chars[start:end]) 292 | if start > 0: 293 | substr = "##" + substr 294 | if substr in self.vocab: 295 | cur_substr = substr 296 | break 297 | end -= 1 298 | if cur_substr is None: 299 | is_bad = True 300 | break 301 | sub_tokens.append(cur_substr) 302 | start = end 303 | 304 | if is_bad: 305 | output_tokens.append(self.unk_token) 306 | else: 307 | output_tokens.extend(sub_tokens) 308 | return output_tokens 309 | 310 | 311 | def _is_whitespace(char): 312 | """Checks whether `chars` is a whitespace character.""" 313 | # \t, \n, and \r are technically contorl characters but we treat them 314 | # as whitespace since they are generally considered as such. 315 | if char == " " or char == "\t" or char == "\n" or char == "\r": 316 | return True 317 | cat = unicodedata.category(char) 318 | if cat == "Zs": 319 | return True 320 | return False 321 | 322 | 323 | def _is_control(char): 324 | """Checks whether `chars` is a control character.""" 325 | # These are technically control characters but we count them as whitespace 326 | # characters. 327 | if char == "\t" or char == "\n" or char == "\r": 328 | return False 329 | cat = unicodedata.category(char) 330 | if cat.startswith("C"): 331 | return True 332 | return False 333 | 334 | 335 | def _is_punctuation(char): 336 | """Checks whether `chars` is a punctuation character.""" 337 | cp = ord(char) 338 | # We treat all non-letter/number ASCII as punctuation. 339 | # Characters such as "^", "$", and "`" are not in the Unicode 340 | # Punctuation class but we treat them as punctuation anyways, for 341 | # consistency. 342 | if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or 343 | (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): 344 | return True 345 | cat = unicodedata.category(char) 346 | if cat.startswith("P"): 347 | return True 348 | return False 349 | -------------------------------------------------------------------------------- /tokenizer_char.pickle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/brightmart/sentiment_analysis_fine_grain/bad7c61a2610eec614e1b42b07cadb1ec57a2ef7/tokenizer_char.pickle -------------------------------------------------------------------------------- /train_bert_fine_tuning.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction) 3 | """ 4 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 5 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task) 6 | on large scale of corpus, then fine-tuning by add a single classification layer. 7 | 8 | train the model(transformer) with data enhanced by pre-training of two tasks. 9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a 10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny). 11 | """ 12 | 13 | import tensorflow as tf 14 | #import numpy as np 15 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model 16 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel 17 | 18 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config,get_lable2index 19 | import os 20 | from evaluation_matrix import * 21 | #from model.config_transformer import Config 22 | #configuration 23 | FLAGS=tf.app.flags.FLAGS 24 | 25 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.") 26 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train2.txt","path of traning data.") #./data/cail2018_bi.json 27 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_valid2.txt","path of validation data.") 28 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test2.txt","path of validation data.") 29 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model for restore from pre-train") #save to here, so make it easy to upload for test 30 | tf.app.flags.DEFINE_string("ckpt_dir_save","./checkpoint_lm_save/","checkpoint location for the model for save fine-tuning") #save to here, so make it easy to upload for test 31 | 32 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model") 33 | tf.app.flags.DEFINE_string("model_name","BertCNNFineGrainModel","text cnn model. pre-train and fine-tuning.") 34 | 35 | tf.app.flags.DEFINE_integer("vocab_size",50000,"maximum vocab size.") 36 | tf.app.flags.DEFINE_float("learning_rate",0.00001,"learning rate") #0.001 37 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128 38 | tf.app.flags.DEFINE_integer("decay_steps", 10000, "how many steps before decay learning rate.") # 32-->128 39 | tf.app.flags.DEFINE_float("decay_rate", 0.8, "Rate of decay for learning rate.") #0.65 40 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65 41 | tf.app.flags.DEFINE_integer("sequence_length",800,"max sentence length")#400 42 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model") 43 | 44 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference") 45 | tf.app.flags.DEFINE_boolean("is_fine_tuning",True,"is_finetuning.ture:this is fine-tuning stage") 46 | 47 | tf.app.flags.DEFINE_integer("num_epochs",35,"number of epochs to run.") 48 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu used") 49 | 50 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") # 51 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")# 52 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char 53 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.") 54 | 55 | tf.app.flags.DEFINE_integer("d_model", 128, "dimension of model") # 512-->128 56 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer") 57 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header") 58 | tf.app.flags.DEFINE_integer("d_k", 16, "dimension of k") # 64-->16 59 | tf.app.flags.DEFINE_integer("d_v", 16, "dimension of v") # 64-->16 60 | 61 | def main(_): 62 | # 1.load vocabulary of token from cache file save from pre-trained stage; load label dict from training file; print some message. 63 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name=FLAGS.model_name) 64 | label2index=get_lable2index(FLAGS.data_path,FLAGS.training_data_file, tokenize_style=FLAGS.tokenize_style) 65 | vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes) 66 | iii=0;iii/0 # todo test first two function, then continue 67 | # load training data. 68 | train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length, 69 | process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style) 70 | train_X, train_Y= train 71 | valid_X, valid_Y= valid 72 | test_X,test_Y = test 73 | print("test_model:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape) 74 | # 2.create session. 75 | gpu_config=tf.ConfigProto() 76 | gpu_config.gpu_options.allow_growth=True 77 | with tf.Session(config=gpu_config) as sess: 78 | #Instantiate Model 79 | config=set_config(FLAGS,num_classes,vocab_size) 80 | model=BertModel(config) 81 | #Initialize Save 82 | saver=tf.train.Saver() 83 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"): 84 | print("Restoring Variables from Checkpoint.") 85 | sess.run(tf.global_variables_initializer()) 86 | for i in range(6): #decay learning rate if necessary. 87 | print(i,"Going to decay learning rate by a factor of "+str(FLAGS.decay_rate)) 88 | sess.run(model.learning_rate_decay_half_op) 89 | # restore those variables that names and shapes exists in your model from checkpoint. for detail check: https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4 90 | optimistic_restore(sess, tf.train.latest_checkpoint(FLAGS.ckpt_dir)) #saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir)) 91 | else: 92 | print('Initializing Variables as model instance is not exist.') 93 | sess.run(tf.global_variables_initializer()) 94 | if FLAGS.use_pretrained_embedding: 95 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()} 96 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings 97 | curr_epoch=sess.run(model.epoch_step) 98 | # 3.feed data & training 99 | number_of_training_data=len(train_X) 100 | batch_size=FLAGS.batch_size 101 | iteration=0 102 | score_best=-100 103 | f1_score=0 104 | epoch=0 105 | for epoch in range(curr_epoch,FLAGS.num_epochs): 106 | loss_total, counter = 0.0, 0 107 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)): 108 | iteration=iteration+1 109 | if epoch==0 and counter==0: 110 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape) 111 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob} 112 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict) 113 | loss_total,counter=loss_total+current_loss,counter+1 114 | if counter %30==0: 115 | print("Learning rate:%.7f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss)) 116 | if start!=0 and start%(4000*FLAGS.batch_size)==0: 117 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index) 118 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0) #*100.0 119 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid)) 120 | 121 | # save model to checkpoint 122 | if f1_score_valid>score_best: 123 | save_path = FLAGS.ckpt_dir_save + "model.ckpt" 124 | print("going to save check point.") 125 | saver.save(sess, save_path, global_step=epoch) 126 | score_best=f1_score_valid 127 | #epoch increment 128 | print("going to increment epoch counter....") 129 | sess.run(model.epoch_increment) 130 | 131 | # 4.validation 132 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0)) 133 | if epoch % FLAGS.validate_every==0: 134 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index) 135 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0 136 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2)) 137 | #save model to checkpoint 138 | if f1_score_valid2 > score_best: 139 | save_path=FLAGS.ckpt_dir_save+"model.ckpt" 140 | print("going to save check point.") 141 | saver.save(sess,save_path,global_step=epoch) 142 | score_best = f1_score_valid2 143 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13): 144 | for i in range(1): 145 | print(i, "Going to decay learning rate by half.") 146 | sess.run(model.learning_rate_decay_half_op) 147 | 148 | # 5.report on test set 149 | loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index) 150 | f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) * 100.0 151 | print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test)) 152 | print("training completed...") 153 | 154 | #sess,model,valid,iteration,num_classes,label2index 155 | def do_eval(sess,model,valid,num_classes,label2index): 156 | """ 157 | do evaluation using validation set, and report loss, and f1 score. 158 | :param sess: 159 | :param model: 160 | :param valid: 161 | :param num_classes: 162 | :param label2index: 163 | :return: 164 | """ 165 | number_examples=valid[0].shape[0] 166 | valid=valid[0:64*15] # todo 167 | valid_x,valid_y=valid 168 | print("number_examples:",number_examples) 169 | eval_loss,eval_counter=0.0,0 170 | batch_size=FLAGS.batch_size 171 | label_dict=init_label_dict(num_classes) 172 | eval_macro_f1, eval_micro_f1 = 0.0,0.0 173 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)): 174 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0} 175 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size] 176 | #compute confuse matrix 177 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright') 178 | eval_loss=eval_loss+curr_eval_loss 179 | eval_counter=eval_counter+1 180 | #compute f1_micro & f1_macro 181 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive 182 | compute_f1_score_write_for_debug(label_dict,label2index) 183 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro 184 | 185 | def optimistic_restore(session, save_file): 186 | """ 187 | restore only those variable that exists in the model 188 | :param session: 189 | :param save_file: 190 | :return: 191 | """ 192 | reader = tf.train.NewCheckpointReader(save_file) 193 | saved_shapes = reader.get_variable_to_shape_map() 194 | var_names = sorted([(var.name, var.name.split(':')[0]) for 195 | var in tf.global_variables() 196 | if var.name.split(':')[0] in saved_shapes]) 197 | restore_vars = [] 198 | name2var = dict(zip(map(lambda x: x.name.split(':')[0],tf.global_variables()),tf.global_variables())) 199 | with tf.variable_scope('', reuse=True): 200 | for var_name, saved_var_name in var_names: 201 | curr_var = name2var[saved_var_name] 202 | var_shape = curr_var.get_shape().as_list() 203 | if var_shape == saved_shapes[saved_var_name]: 204 | #print("going to restore.var_name:",var_name,";saved_var_name:",saved_var_name) 205 | restore_vars.append(curr_var) 206 | else: 207 | print("variable not trained.var_name:",var_name) 208 | saver = tf.train.Saver(restore_vars) 209 | saver.restore(session, save_file) 210 | 211 | if __name__ == "__main__": 212 | tf.app.run() -------------------------------------------------------------------------------- /train_cnn_fine_grain.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #process--->1.load data(X:list of lint,y:int). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction) 3 | """ 4 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 5 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task) 6 | on large scale of corpus, then fine-tuning by add a single classification layer. 7 | 8 | train the model(transformer) with data enhanced by pre-training of two tasks. 9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a 10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny). 11 | """ 12 | 13 | import tensorflow as tf 14 | #import numpy as np 15 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model 16 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel 17 | 18 | from data_util_hdf5 import assign_pretrained_word_embedding,set_config,create_or_load_vocabulary 19 | import os 20 | import pickle 21 | from evaluation_matrix import * 22 | #from model.config_transformer import Config 23 | #configuration 24 | FLAGS=tf.app.flags.FLAGS 25 | 26 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.") 27 | tf.app.flags.DEFINE_string("training_data_file","./data/bert_train2.txt","path of traning data.") #./data/cail2018_bi.json 28 | tf.app.flags.DEFINE_string("valid_data_file","./data/bert_valid2.txt","path of validation data.") 29 | tf.app.flags.DEFINE_string("test_data_file","./data/bert_test2.txt","path of validation data.") 30 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model for restore from pre-train") #save to here, so make it easy to upload for test 31 | tf.app.flags.DEFINE_string("ckpt_dir_save","./checkpoint_lm_save/","checkpoint location for the model for save fine-tuning") #save to here, so make it easy to upload for test 32 | 33 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model") 34 | tf.app.flags.DEFINE_string("model_name","","text cnn model. pre-train and fine-tuning.") # BertCNNFineGrainModel 35 | 36 | tf.app.flags.DEFINE_integer("vocab_size",70000,"maximum vocab size.") 37 | tf.app.flags.DEFINE_float("learning_rate",0.001,"learning rate") #0.001 38 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128 39 | tf.app.flags.DEFINE_integer("decay_steps", 10000, "how many steps before decay learning rate.") # 32-->128 40 | tf.app.flags.DEFINE_float("decay_rate", 0.8, "Rate of decay for learning rate.") #0.65 41 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65 42 | tf.app.flags.DEFINE_integer("sequence_length",400,"max sentence length")#400 43 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model") 44 | 45 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference") 46 | tf.app.flags.DEFINE_boolean("is_fine_tuning",True,"is_finetuning.ture:this is fine-tuning stage") 47 | 48 | tf.app.flags.DEFINE_integer("num_epochs",35,"number of epochs to run.") 49 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu used") 50 | 51 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") # 52 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")# 53 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char 54 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.") 55 | 56 | tf.app.flags.DEFINE_integer("d_model", 128, "dimension of model") # 128-->200 57 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer") 58 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header") 59 | tf.app.flags.DEFINE_integer("d_k", 16, "dimension of k") # 64-->16 60 | tf.app.flags.DEFINE_integer("d_v", 16, "dimension of v") # 64-->16 61 | 62 | tf.app.flags.DEFINE_string("cache_file","./preprocess_word/train_valid_test_vocab_cache.pik","cache file that contains train/valid/test data and vocab of words and label2index") 63 | 64 | def main(_): 65 | # 1.load vocabulary of token from cache file save from pre-trained stage; load label dict from training file; print some message. 66 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.training_data_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style,model_name=FLAGS.model_name) 67 | #label2index=get_lable2index(FLAGS.data_path,FLAGS.training_data_file, tokenize_style=FLAGS.tokenize_style) 68 | #vocab_size = len(vocab_word2index);print("cnn_model.vocab_size:",vocab_size);num_classes=len(label2index);print("num_classes:",num_classes) 69 | # load training data. 70 | #train,valid, test= load_data_multilabel(FLAGS.data_path,FLAGS.training_data_file,FLAGS.valid_data_file,FLAGS.test_data_file,vocab_word2index,label2index,FLAGS.sequence_length, 71 | # process_num=FLAGS.process_num,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style) 72 | #train_X, train_Y= train 73 | #valid_X, valid_Y= valid 74 | #test_X,test_Y = test 75 | if not os.path.exists(FLAGS.cache_file): 76 | print("cache file is missing. please generate it though step by step with preprocess_word.ipynb") 77 | return 78 | train_X, train_Y, valid_X, valid_Y, test_X, label2index=None,None,None,None,None,None 79 | 80 | with open(FLAGS.cache_file, 'rb') as data_f: 81 | train_X, train_Y, valid_X, valid_Y, test_X,_, label2index=pickle.load(data_f) 82 | valid=(valid_X, valid_Y) 83 | data_f.close() 84 | num_classes=len(label2index) 85 | vocab_size=len(vocab_word2index) 86 | FLAGS.sequence_length=train_X.shape[1] # 87 | print("test_model:",FLAGS.test_mode,";length of training data:",train_X.shape,";valid data:",valid_X.shape,";test data:",test_X.shape,";train_Y:",train_Y.shape) 88 | # 2.create session. 89 | gpu_config=tf.ConfigProto() 90 | gpu_config.gpu_options.allow_growth=True 91 | with tf.Session(config=gpu_config) as sess: 92 | #Instantiate Model 93 | config=set_config(FLAGS,num_classes,vocab_size) 94 | model=BertModel(config) 95 | #Initialize Save 96 | saver=tf.train.Saver() 97 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"): 98 | print("Restoring Variables from Checkpoint.") 99 | sess.run(tf.global_variables_initializer()) 100 | for i in range(6): #decay learning rate if necessary. 101 | print(i,"Going to decay learning rate by a factor of "+str(FLAGS.decay_rate)) 102 | sess.run(model.learning_rate_decay_half_op) 103 | # restore those variables that names and shapes exists in your model from checkpoint. for detail check: https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4 104 | optimistic_restore(sess, tf.train.latest_checkpoint(FLAGS.ckpt_dir)) #saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir)) 105 | else: 106 | print('Initializing Variables as model instance is not exist.') 107 | sess.run(tf.global_variables_initializer()) 108 | if FLAGS.use_pretrained_embedding: 109 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()} 110 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings 111 | curr_epoch=sess.run(model.epoch_step) 112 | # 3.feed data & training 113 | number_of_training_data=len(train_X) 114 | batch_size=FLAGS.batch_size 115 | iteration=0 116 | score_best=-100 117 | f1_score=0 118 | epoch=0 119 | for epoch in range(curr_epoch,FLAGS.num_epochs): 120 | loss_total, counter = 0.0, 0 121 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)): 122 | iteration=iteration+1 123 | if epoch==0 and counter==0: 124 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape) 125 | feed_dict = {model.input_x: train_X[start:end],model.input_y:train_Y[start:end],model.dropout_keep_prob: FLAGS.dropout_keep_prob} 126 | current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict) 127 | loss_total,counter=loss_total+current_loss,counter+1 128 | if counter %30==0: 129 | print("Learning rate:%.7f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss)) 130 | if start!=0 and start%(4000*FLAGS.batch_size)==0: 131 | loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index) 132 | f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0) #*100.0 133 | print("Valid.Epoch %d ValidLoss:%.3f\tF1_score_valid:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_valid, f1_score_valid, f1_macro_valid, f1_micro_valid)) 134 | 135 | # save model to checkpoint 136 | if f1_score_valid>score_best: 137 | save_path = FLAGS.ckpt_dir_save + "model.ckpt" 138 | print("going to save check point.") 139 | saver.save(sess, save_path, global_step=epoch) 140 | score_best=f1_score_valid 141 | #epoch increment 142 | print("going to increment epoch counter....") 143 | sess.run(model.epoch_increment) 144 | 145 | # 4.validation 146 | print(epoch,FLAGS.validate_every,(epoch % FLAGS.validate_every==0)) 147 | if epoch % FLAGS.validate_every==0: 148 | loss_valid,f1_macro_valid2,f1_micro_valid2=do_eval(sess,model,valid,num_classes,label2index) 149 | f1_score_valid2 = ((f1_macro_valid2 + f1_micro_valid2) / 2.0) #* 100.0 150 | print("Valid.Epoch %d ValidLoss:%.3f\tF1 score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t"% (epoch,loss_valid,f1_score_valid2,f1_macro_valid2,f1_micro_valid2)) 151 | #save model to checkpoint 152 | if f1_score_valid2 > score_best: 153 | save_path=FLAGS.ckpt_dir_save+"model.ckpt" 154 | print("going to save check point.") 155 | saver.save(sess,save_path,global_step=epoch) 156 | score_best = f1_score_valid2 157 | if (epoch == 2 or epoch == 4 or epoch == 6 or epoch == 9 or epoch == 13): 158 | for i in range(1): 159 | print(i, "Going to decay learning rate by half.") 160 | sess.run(model.learning_rate_decay_half_op) 161 | 162 | # 5.report on test set 163 | #loss_test, f1_macro_test, f1_micro_test=do_eval(sess, model, test,num_classes, label2index) 164 | #f1_score_test=((f1_macro_test + f1_micro_test) / 2.0) * 100.0 165 | #print("Test.Epoch %d TestLoss:%.3f\tF1_score:%.3f\tMacro_f1:%.3f\tMicro_f1:%.3f\t" % (epoch, loss_test, f1_score_test,f1_macro_test, f1_micro_test)) 166 | print("training completed...") 167 | 168 | #sess,model,valid,iteration,num_classes,label2index 169 | num_fine_grain_type=20 # 20 fine grain sentiment analysis 170 | num_fine_grain_value=4 # 4 kinds of value: [1,0,-1,-2] 171 | def do_eval(sess,model,valid,num_classes,label2index): 172 | """ 173 | do evaluation using validation set, and report loss, and f1 score. 174 | :param sess: 175 | :param model: 176 | :param valid: 177 | :param num_classes: 178 | :param label2index: 179 | :return: 180 | """ 181 | valid=valid[0:64*80] # todo 182 | number_examples=valid[0].shape[0] 183 | valid_x,valid_y=valid 184 | print("number_examples for valid:",number_examples) 185 | eval_loss,eval_counter=0.0,0 186 | batch_size=FLAGS.batch_size 187 | label_dict=init_label_dict(num_classes) 188 | eval_macro_f1, eval_micro_f1 = 0.0,0.0 189 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)): 190 | feed_dict = {model.input_x: valid_x[start:end],model.input_y:valid_y[start:end],model.dropout_keep_prob: 1.0} 191 | curr_eval_loss, logits= sess.run([model.loss_val,model.logits],feed_dict) # logits:[batch_size,label_size] 192 | #compute confuse matrix 193 | label_dict=compute_confuse_matrix_batch(valid_y[start:end],logits,label_dict,name='bright') 194 | #for aspect_index in range(num_fine_grain_type): 195 | # start_sub=aspect_index*num_fine_grain_value 196 | # start_end=start_sub+num_fine_grain_value 197 | # valid_y_sub=valid_y[start:end][:,start_sub:start_end] 198 | # logits_sub=logits[start:end][:,start_sub:start_end] 199 | # label_dict=compute_confuse_matrix_batch(valid_y_sub[start:end],logits_sub,label_dict,name='bright') 200 | # if start%3000==0: 201 | # print("valid_y_sub:",valid_y_sub) 202 | # print("logits_sub:",logits_sub) 203 | eval_loss=eval_loss+curr_eval_loss 204 | eval_counter=eval_counter+1 205 | #compute f1_micro & f1_macro 206 | f1_micro,f1_macro=compute_micro_macro(label_dict) #label_dict is a dict, key is: an label,value is: (TP,FP,FN). where TP is number of True Positive 207 | compute_f1_score_write_for_debug(label_dict,label2index) 208 | return eval_loss/float(eval_counter+small_value),f1_macro,f1_micro 209 | 210 | def optimistic_restore(session, save_file): 211 | """ 212 | restore only those variable that exists in the model 213 | :param session: 214 | :param save_file: 215 | :return: 216 | """ 217 | reader = tf.train.NewCheckpointReader(save_file) 218 | saved_shapes = reader.get_variable_to_shape_map() 219 | var_names = sorted([(var.name, var.name.split(':')[0]) for 220 | var in tf.global_variables() 221 | if var.name.split(':')[0] in saved_shapes]) 222 | restore_vars = [] 223 | name2var = dict(zip(map(lambda x: x.name.split(':')[0],tf.global_variables()),tf.global_variables())) 224 | with tf.variable_scope('', reuse=True): 225 | for var_name, saved_var_name in var_names: 226 | curr_var = name2var[saved_var_name] 227 | var_shape = curr_var.get_shape().as_list() 228 | if var_shape == saved_shapes[saved_var_name]: 229 | #print("going to restore.var_name:",var_name,";saved_var_name:",saved_var_name) 230 | restore_vars.append(curr_var) 231 | else: 232 | print("variable not trained.var_name:",var_name) 233 | saver = tf.train.Saver(restore_vars) 234 | saver.restore(session, save_file) 235 | 236 | if __name__ == "__main__": 237 | tf.app.run() -------------------------------------------------------------------------------- /train_cnn_lm.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | #process--->1.load data(X,y). 2.create session. 3.feed data. 4.training (5.validation) ,(6.prediction) 3 | 4 | """ 5 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 6 | main idea: based on multiple layer self-attention model(encoder of Transformer), pretrain two tasks( masked language model and next sentence prediction task) 7 | on large scale of corpus, then fine-tuning by add a single classification layer. 8 | train the model(transformer) with data enhanced by pre-training of two tasks. 9 | default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have a small data set or want to train a 10 | small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny). 11 | """ 12 | import tensorflow as tf 13 | import numpy as np 14 | #from model.bert_model import BertModel # TODO TODO TODO test whether pretrain can boost perofrmance with other model 15 | from model.bert_cnn_fine_grain_model import BertCNNFineGrainModel as BertModel 16 | from data_util_hdf5 import create_or_load_vocabulary,load_data_multilabel,assign_pretrained_word_embedding,set_config 17 | import os 18 | from evaluation_matrix import * 19 | from pretrain_task import mask_language_model,mask_language_model_multi_processing 20 | from model.config import Config 21 | import random 22 | 23 | #configuration 24 | FLAGS=tf.app.flags.FLAGS 25 | 26 | tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used") 27 | tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.") 28 | tf.app.flags.DEFINE_string("mask_lm_source_file","./data/sentiment_analysis_all.csv","path of traning data.") # sentiment_analysis_all.csv is concat of training and testb of this task, which are both csv format file 29 | tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model") #save to here, so make it easy to upload for test 30 | tf.app.flags.DEFINE_integer("vocab_size",70000,"maximum vocab size.") 31 | tf.app.flags.DEFINE_integer("d_model", 200, "dimension of model") # 512-->128 32 | tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer") 33 | tf.app.flags.DEFINE_integer("num_header", 8, "number of header") 34 | tf.app.flags.DEFINE_integer("d_k", 8, "dimension of k") # 64 35 | tf.app.flags.DEFINE_integer("d_v", 8, "dimension of v") # 64 36 | 37 | tf.app.flags.DEFINE_string("tokenize_style","word","checkpoint location for the model") 38 | tf.app.flags.DEFINE_integer("max_allow_sentence_length",10,"max length of allowed sentence for masked language model") 39 | tf.app.flags.DEFINE_float("learning_rate",0.0001,"learning rate") #0.001 40 | tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") 41 | tf.app.flags.DEFINE_integer("decay_steps", 1000, "how many steps before decay learning rate.") 42 | tf.app.flags.DEFINE_float("decay_rate", 1.0, "Rate of decay for learning rate.") 43 | tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") 44 | tf.app.flags.DEFINE_integer("sequence_length",200,"max sentence length")#400 45 | tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model") 46 | tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference") 47 | tf.app.flags.DEFINE_boolean("is_fine_tuning",False,"is_finetuning.ture:this is fine-tuning stage") 48 | tf.app.flags.DEFINE_integer("num_epochs",30,"number of epochs to run.") 49 | tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") 50 | tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")# 51 | tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char 52 | tf.app.flags.DEFINE_integer("process_num",35,"number of cpu process") 53 | 54 | def main(_): 55 | vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.mask_lm_source_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style) 56 | vocab_size = len(vocab_word2index);print("bert_pertrain_lm.vocab_size:",vocab_size) 57 | index2word={v:k for k,v in vocab_word2index.items()} 58 | #train,valid,test=mask_language_model(FLAGS.mask_lm_source_file,FLAGS.data_path,index2word,max_allow_sentence_length=FLAGS.max_allow_sentence_length,test_mode=FLAGS.test_mode) 59 | train, valid, test = mask_language_model(FLAGS.mask_lm_source_file, FLAGS.data_path, index2word, max_allow_sentence_length=FLAGS.max_allow_sentence_length,test_mode=FLAGS.test_mode, process_num=FLAGS.process_num) 60 | 61 | train_X, train_y,train_p = train 62 | valid_X, valid_y,valid_p = valid 63 | test_X,test_y,test_p = test 64 | 65 | print("length of training data:",train_X.shape,";train_Y:",train_y.shape,";train_p:",train_p.shape,";valid data:",valid_X.shape,";test data:",test_X.shape) 66 | # 1.create session. 67 | gpu_config=tf.ConfigProto() 68 | gpu_config.gpu_options.allow_growth=True 69 | with tf.Session(config=gpu_config) as sess: 70 | #Instantiate Model 71 | config=set_config(FLAGS,vocab_size,vocab_size) 72 | model=BertModel(config) 73 | #Initialize Save 74 | saver=tf.train.Saver() 75 | if os.path.exists(FLAGS.ckpt_dir+"checkpoint"): 76 | print("Restoring Variables from Checkpoint.") 77 | saver.restore(sess,tf.train.latest_checkpoint(FLAGS.ckpt_dir)) 78 | for i in range(2): #decay learning rate if necessary. 79 | print(i,"Going to decay learning rate by half.") 80 | sess.run(model.learning_rate_decay_half_op) 81 | else: 82 | print('Initializing Variables') 83 | sess.run(tf.global_variables_initializer()) 84 | if FLAGS.use_pretrained_embedding: 85 | vocabulary_index2word={index:word for word,index in vocab_word2index.items()} 86 | assign_pretrained_word_embedding(sess, vocabulary_index2word, vocab_size,FLAGS.word2vec_model_path,model.embedding,config.d_model) # assign pretrained word embeddings 87 | curr_epoch=sess.run(model.epoch_step) 88 | 89 | # 2.feed data & training 90 | number_of_training_data=len(train_X) 91 | print("number_of_training_data:",number_of_training_data) 92 | batch_size=FLAGS.batch_size 93 | iteration=0 94 | score_best=-100 95 | for epoch in range(curr_epoch,FLAGS.num_epochs): 96 | loss_total_lm, counter = 0.0, 0 97 | for start, end in zip(range(0, number_of_training_data, batch_size),range(batch_size, number_of_training_data, batch_size)): 98 | iteration=iteration+1 99 | if epoch==0 and counter==0: 100 | print("trainX[start:end]:",train_X[start:end],"train_X.shape:",train_X.shape) 101 | feed_dict = {model.x_mask_lm: train_X[start:end],model.y_mask_lm: train_y[start:end],model.p_mask_lm:train_p[start:end], 102 | model.dropout_keep_prob: FLAGS.dropout_keep_prob} 103 | current_loss_lm,lr,l2_loss,_=sess.run([model.loss_val_lm,model.learning_rate,model.l2_loss_lm,model.train_op_lm],feed_dict) 104 | loss_total_lm,counter=loss_total_lm+current_loss_lm,counter+1 105 | if counter %30==0: 106 | print("%d\t%d\tLearning rate:%.5f\tLoss_lm:%.3f\tCurrent_loss_lm:%.3f\tL2_loss:%.3f\t"%(epoch,counter,lr,float(loss_total_lm)/float(counter),current_loss_lm,l2_loss)) 107 | if start!=0 and start%(3000*FLAGS.batch_size)==0: # epoch!=0 108 | loss_valid, acc_valid= do_eval(sess, model, valid,batch_size) 109 | print("%d\tValid.Epoch %d ValidLoss:%.3f\tAcc_valid:%.3f\t" % (counter,epoch, loss_valid, acc_valid*100)) 110 | # save model to checkpoint 111 | if acc_valid>score_best: 112 | save_path = FLAGS.ckpt_dir + "model.ckpt" 113 | print("going to save check point.") 114 | saver.save(sess, save_path, global_step=epoch) 115 | score_best=acc_valid 116 | sess.run(model.epoch_increment) 117 | 118 | validation_size=100*FLAGS.batch_size 119 | def do_eval(sess,model,valid,batch_size): 120 | """ 121 | do evaluation using validation set, and report loss, and f1 score. 122 | :param sess: 123 | :param model: 124 | :param valid: 125 | :param num_classes: 126 | :return: 127 | """ 128 | valid_X,valid_y,valid_p=valid 129 | number_examples=valid_X.shape[0] 130 | if number_examples>10000: 131 | number_examples=validation_size 132 | print("do_eval.valid.number_examples:",number_examples) 133 | if number_examples>validation_size: valid_X,valid_y,valid_p=valid_X[0:validation_size],valid_y[0:validation_size],valid_p[0:validation_size] 134 | eval_loss,eval_counter,eval_acc=0.0,0,0.0 135 | for start,end in zip(range(0,number_examples,batch_size),range(batch_size,number_examples,batch_size)): 136 | feed_dict = {model.x_mask_lm: valid_X[start:end],model.y_mask_lm: valid_y[start:end],model.p_mask_lm:valid_p[start:end], 137 | model.dropout_keep_prob: 1.0} # FLAGS.dropout_keep_prob 138 | curr_eval_loss, logits_lm, accuracy_lm= sess.run([model.loss_val_lm,model.logits_lm,model.accuracy_lm],feed_dict) # logits:[batch_size,label_size] 139 | eval_loss=eval_loss+curr_eval_loss 140 | eval_acc=eval_acc+accuracy_lm 141 | eval_counter=eval_counter+1 142 | return eval_loss/float(eval_counter+small_value), eval_acc/float(eval_counter+small_value) 143 | 144 | if __name__ == "__main__": 145 | tf.app.run() -------------------------------------------------------------------------------- /word2vec/README.txt: -------------------------------------------------------------------------------- 1 | save word2vec here. --------------------------------------------------------------------------------