├── README.txt ├── eval_script.sh ├── label_corpus.sh ├── models └── thresh_dict.json ├── requirements.txt ├── src ├── config.py ├── data_aug.py ├── data_augmentation.py ├── dataset.py ├── eval.py ├── eval_ensemble.py ├── eval_ensemble_final.py ├── eval_ensemble_round2.py ├── eval_round2.py ├── finetune_cv.py ├── lr_scheduler.py ├── model.py ├── pretrain.py ├── pretrain2.py ├── pretrain2_cv.py ├── test_cv.py ├── test_ensemble_cv.py ├── train.py ├── train_cv.py └── train_round2.py └── train_script.sh /README.txt: -------------------------------------------------------------------------------- 1 | 0.运行环境 2 | 软件: 3 | Ubuntu 18.04 4 | Python: 3.6.5 5 | Pytorch: 1.1.0 6 | CUDA: 9.0 7 | CUDNN: 7.1.3 8 | 硬件: 9 | 显卡:GTX1080 8G单卡 10 | 内存:16G 11 | CPU: i7 7700 12 | 13 | 1.安装requirements.txt中的依赖 14 | requirements.txt所在路径执行: 15 | pip install -r requirements.txt 16 | 17 | 2.训练 18 | 如需一键训练,执行 sh train_script.sh 可能耗时较长。 19 | 20 | 或者分步运行: 21 | 2.1 无监督laptop语料 + 有标注makeup数据 进行预训练 22 | 在src/文件夹下运行: 23 | python pretrain.py --base_model roberta 24 | python pretrain.py --base_model wwm 25 | python pretrain.py --base_model ernie 26 | 从而分别对roberta、wwm、ernie三种模型进行预训练, 权重保存在models/文件夹下, 分别为: 27 | pretrained_roberta, pretrained_wwm, pretrained_ernie 28 | 预训练耗时较长,可以跳过,在下一步微调中直接使用我们提供的权重,节省时间。 29 | 30 | 2.2 有标注laptop数据 进行交叉验证finetune 31 | 在src/文件夹下运行: 32 | python finetune_cv.py --base_model roberta 33 | python finetune_cv.py --base_model wwm 34 | python finetune_cv.py --base_model ernie 35 | 分别对三种预训练模型进行微调,权重保存至models/文件夹下,命名形如roberta_cvX, X代表cv的折数,设定为5折,即X为0~4 36 | 微调过程中会保存各个模型在验证集中最佳的筛选阈值,在models/thresh_dict.json中。 37 | 38 | 3.测试 39 | 如需一键测试,执行 sh eval_script.sh 40 | 41 | 或者: 42 | 在src/文件夹下运行: 43 | python eval_ensemble_round2.py 44 | 结果保存为submit/Result.csv 45 | 46 | 我们提供了单模型1折的权重models/roberta_cv2, 可以一键测试得到输出结果, 应该比线上成绩稍低。 47 | 48 | 4.算法原理 49 | 模型特点:One-stage端到端,无需分别进行实体抽取和关系分类, OpinoNet Only Look Once。 50 | 基础模型为使用bert预训练模型作为骨架的One-stage端到端实体关系抽取模型,使用了roberta-wwm、bert-wwm、ernie等初始预训练模型。 51 | 对于复赛中数据特点,基本流程如下: 52 | 1. 首先使用无监督laptop语料和有标注的makeup数据进行MLM和当前下游任务的双任务训练。 53 | 2. 在数量较少的有标注laptop语料上对上一步得到的模型进行交叉验证训练微调,将模型迁移至laptop领域。 54 | 3. 不同初始预训练模型进行结果的集成。 55 | 56 | 5.迭代说明 57 | 复赛最高成绩inference时间:5 min 58 | 复赛最高成绩训练时间:约 24 h 59 | 60 | 迭代过程: 61 | 从初赛到复赛结束有效提交一共6次: 62 | 初赛: 63 | 1. 0.7868 -- 初赛单模型CV 8月28 64 | 2. 0.7793 -- 初赛单模型单折 8月29 65 | 复赛: 66 | 1. 0.7892 -- 复赛单模型单折 9月26 67 | 2. 0.8109 -- 复赛单模型CV 9月27 68 | * 3. 0.8224 -- 复赛集成模型CV 9月28 69 | 4. 0.8218 -- 微调使用数据扩增,效果不好 9月30 70 | -------------------------------------------------------------------------------- /eval_script.sh: -------------------------------------------------------------------------------- 1 | cd src 2 | python eval_ensemble_round2.py -------------------------------------------------------------------------------- /label_corpus.sh: -------------------------------------------------------------------------------- 1 | cd src 2 | python eval_ensemble_final.py --gen_label \ 3 | --bs 64 \ 4 | --rv ../data/TRAIN/Train_laptop_corpus.csv \ 5 | --o ../data/TRAIN/Train_laptop_corpus_labels0.csv \ 6 | --labelfold 0 7 | 8 | python eval_ensemble_final.py --gen_label \ 9 | --bs 64 \ 10 | --rv ../data/TRAIN/Train_laptop_corpus.csv \ 11 | --o ../data/TRAIN/Train_laptop_corpus_labels1.csv \ 12 | --labelfold 1 13 | 14 | python eval_ensemble_final.py --gen_label \ 15 | --bs 64 \ 16 | --rv ../data/TRAIN/Train_laptop_corpus.csv \ 17 | --o ../data/TRAIN/Train_laptop_corpus_labels2.csv \ 18 | --labelfold 2 19 | 20 | python eval_ensemble_final.py --gen_label \ 21 | --bs 64 \ 22 | --rv ../data/TRAIN/Train_laptop_corpus.csv \ 23 | --o ../data/TRAIN/Train_laptop_corpus_labels3.csv \ 24 | --labelfold 3 25 | 26 | python eval_ensemble_final.py --gen_label \ 27 | --bs 64 \ 28 | --rv ../data/TRAIN/Train_laptop_corpus.csv \ 29 | --o ../data/TRAIN/Train_laptop_corpus_labels4.csv \ 30 | --labelfold 4 -------------------------------------------------------------------------------- /models/thresh_dict.json: -------------------------------------------------------------------------------- 1 | {"roberta_cv0": {"name": "roberta", "thresh": 0.5, "f1": 0.3868520859671303}, "roberta_cv1": {"name": "roberta", "thresh": 0.35, "f1": 0.3868520859671303}, "roberta_cv2": {"name": "roberta", "thresh": 0.55, "f1": 0.3868520859671303}, "roberta_cv3": {"name": "roberta", "thresh": 0.5, "f1": 0.3868520859671303}, "roberta_cv4": {"name": "roberta", "thresh": 0.45, "f1": 0.3868520859671303}, "wwm_cv0": {"name": "wwm", "thresh": 0.4500000000000001, "f1": 0.3868520859671303}, "wwm_cv1": {"name": "wwm", "thresh": 0.600000000000001, "f1": 0.3868520859671303}, "wwm_cv2": {"name": "wwm", "thresh": 0.3500000000000001, "f1": 0.3868520859671303}, "wwm_cv3": {"name": "wwm", "thresh": 0.500000000000001, "f1": 0.3868520859671303}, "wwm_cv4": {"name": "wwm", "thresh": 0.600000000000001, "f1": 0.3868520859671303}, "ernie_cv1": {"name": "ernie", "thresh": 0.7500000000000001, "f1": 0.3868520859671303}, "ernie_cv2": {"name": "ernie", "thresh": 0.500000000000001, "f1": 0.3868520859671303}, "ernie_cv3": {"name": "ernie", "thresh": 0.700000000000001, "f1": 0.3868520859671303}, "ernie_cv4": {"name": "ernie", "thresh": 0.7500000000000001, "f1": 0.3868520859671303}, "roberta_focal_cv0": {"name": "roberta_focal", "thresh": 0.35, "f1": 0.7833333333333333}, "roberta_focal_cv1": {"name": "roberta_focal", "thresh": 0.3749999999999999, "f1": 0.8455114822546973}, "roberta_focal_cv2": {"name": "roberta_focal", "thresh": 0.22499999999999998, "f1": 0.8355739400206825}, "roberta_focal_cv3": {"name": "roberta_focal", "thresh": 0.47499999999999987, "f1": 0.7962382445141065}, "roberta_focal_cv4": {"name": "roberta_focal", "thresh": 0.47499999999999987, "f1": 0.8249496981891348}, "wwm_focal_cv0": {"name": "wwm_focal", "thresh": 0.22499999999999998, "f1": 0.7887323943661972}, "wwm_focal_cv1": {"name": "wwm_focal", "thresh": 0.3749999999999999, "f1": 0.8340248962655602}, "wwm_focal_cv2": {"name": "wwm_focal", "thresh": 0.1, "f1": 0.8231644260599793}, "wwm_focal_cv3": {"name": "wwm_focal", "thresh": 0.3749999999999999, "f1": 0.7732497387669802}, "wwm_focal_cv4": {"name": "wwm_focal", "thresh": 0.3749999999999999, "f1": 0.811740890688259}, "ernie_focal_cv0": {"name": "ernie_focal", "thresh": 0.1, "f1": 0.7650602409638554}, "ernie_focal_cv1": {"name": "ernie_focal", "thresh": 0.5499999999999999, "f1": 0.8254310344827587}, "ernie_focal_cv2": {"name": "ernie_focal", "thresh": 0.3999999999999999, "f1": 0.8250265111346765}, "ernie_focal_cv3": {"name": "ernie_focal", "thresh": 0.3999999999999999, "f1": 0.7950052029136315}, "ernie_focal_cv4": {"name": "ernie_focal", "thresh": 0.3999999999999999, "f1": 0.8084677419354838}, "roberta_tiny_cv0": {"name": "roberta_tiny", "thresh": 0.44999999999999996, "f1": 0.7733050847457626}, "roberta_tiny_cv1": {"name": "roberta_tiny", "thresh": 0.47499999999999987, "f1": 0.8273684210526316}, "roberta_tiny_cv2": {"name": "roberta_tiny", "thresh": 0.5999999999999999, "f1": 0.8401727861771057}, "roberta_tiny_cv3": {"name": "roberta_tiny", "thresh": 0.44999999999999996, "f1": 0.7889344262295082}, "roberta_tiny_cv4": {"name": "roberta_tiny", "thresh": 0.19999999999999998, "f1": 0.8202137998056366}, "wwm_tiny_cv0": {"name": "wwm_tiny", "thresh": 0.24999999999999997, "f1": 0.7817258883248731}, "wwm_tiny_cv1": {"name": "wwm_tiny", "thresh": 0.27499999999999997, "f1": 0.8353909465020576}, "wwm_tiny_cv2": {"name": "wwm_tiny", "thresh": 0.22499999999999998, "f1": 0.8165803108808289}, "wwm_tiny_cv3": {"name": "wwm_tiny", "thresh": 0.42499999999999993, "f1": 0.786008230452675}, "wwm_tiny_cv4": {"name": "wwm_tiny", "thresh": 0.27499999999999997, "f1": 0.8113391984359726}, "ernie_tiny_cv0": {"name": "ernie_tiny", "thresh": 0.125, "f1": 0.7827827827827828}, "ernie_tiny_cv1": {"name": "ernie_tiny", "thresh": 0.42499999999999993, "f1": 0.8366701791359327}, "ernie_tiny_cv2": {"name": "ernie_tiny", "thresh": 0.29999999999999993, "f1": 0.8230366492146597}, "ernie_tiny_cv3": {"name": "ernie_tiny", "thresh": 0.47499999999999987, "f1": 0.7923728813559322}, "ernie_tiny_cv4": {"name": "ernie_tiny", "thresh": 0.4999999999999999, "f1": 0.8109756097560975}, "roberta_focal2_cv0": {"name": "roberta_focal2", "thresh": 0.6500000000000001, "f1": 0.7983706720977596}, "roberta_focal2_cv1": {"name": "roberta_focal2", "thresh": 0.7000000000000002, "f1": 0.8529411764705883}, "roberta_focal2_cv2": {"name": "roberta_focal2", "thresh": 0.8000000000000002, "f1": 0.8365591397849463}, "roberta_focal2_cv3": {"name": "roberta_focal2", "thresh": 0.7500000000000002, "f1": 0.7915789473684212}, "roberta_focal2_cv4": {"name": "roberta_focal2", "thresh": 0.5500000000000002, "f1": 0.8279252704031467}} -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | jieba 2 | torch==1.1.0 3 | pytorch-pretrained-bert==0.6.2 4 | tqdm==4.24.0 -------------------------------------------------------------------------------- /src/config.py: -------------------------------------------------------------------------------- 1 | PRETRAINED_MODELS = { 2 | 'roberta': { 3 | 'name': 'roberta', 4 | 'path': '../models/chinese_roberta_wwm_ext_pytorch', 5 | 'lr': 6e-6, 6 | 'version': 'large', 7 | 'focal': False 8 | }, 9 | 'wwm': { 10 | 'name': 'wwm', 11 | 'path': '../models/chinese_wwm_ext_pytorch', 12 | 'lr': 6e-6, 13 | 'version': 'large', 14 | 'focal': False 15 | }, 16 | 'ernie': { 17 | 'name': 'ernie', 18 | 'path': '../models/ERNIE', 19 | 'lr': 8e-6, 20 | 'version': 'large', 21 | 'focal': False 22 | }, 23 | 'roberta_focal': { 24 | 'name': 'roberta_focal', 25 | 'path': '../models/chinese_roberta_wwm_ext_pytorch', 26 | 'lr': 6e-6, 27 | 'version': 'large', 28 | 'focal': True 29 | }, 30 | 'wwm_focal': { 31 | 'name': 'wwm_focal', 32 | 'path': '../models/chinese_wwm_ext_pytorch', 33 | 'lr': 6e-6, 34 | 'version': 'large', 35 | 'focal': True 36 | }, 37 | 'ernie_focal': { 38 | 'name': 'ernie_focal', 39 | 'path': '../models/ERNIE', 40 | 'lr': 8e-6, 41 | 'version': 'large', 42 | 'focal': True 43 | }, 44 | 45 | 'roberta_tiny': { 46 | 'name': 'roberta_tiny', 47 | 'path': '../models/chinese_roberta_wwm_ext_pytorch', 48 | 'lr': 6e-6, 49 | 'version': 'tiny', 50 | 'focal': True 51 | }, 52 | 'wwm_tiny': { 53 | 'name': 'wwm_tiny', 54 | 'path': '../models/chinese_wwm_ext_pytorch', 55 | 'lr': 6e-6, 56 | 'version': 'tiny', 57 | 'focal': True 58 | }, 59 | 'ernie_tiny': { 60 | 'name': 'ernie_tiny', 61 | 'path': '../models/ERNIE', 62 | 'lr': 8e-6, 63 | 'version': 'tiny', 64 | 'focal': True 65 | }, 66 | 'roberta_focal2': { 67 | 'name': 'roberta_focal2', 68 | 'path': '../models/chinese_roberta_wwm_ext_pytorch', 69 | 'lr': 6e-6, 70 | 'version': 'large', 71 | 'focal': True 72 | }, 73 | 'roberta2': { 74 | 'name': 'roberta2', 75 | 'path': '../models/chinese_roberta_wwm_ext_pytorch', 76 | 'lr': 6e-6, 77 | 'version': 'large', 78 | 'focal': False 79 | }, 80 | 'ernie_tiny2': { 81 | 'name': 'ernie_tiny2', 82 | 'path': '../models/ERNIE', 83 | 'lr': 8e-6, 84 | 'version': 'large', 85 | 'focal': False 86 | } 87 | } -------------------------------------------------------------------------------- /src/data_aug.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import synonyms 3 | import pandas as pd 4 | 5 | def is_intersec(s1, e1, s2, e2): 6 | if min(e1, e2) > max(s1, s2): 7 | return True 8 | return False 9 | 10 | def aug_single(): 11 | pass 12 | 13 | def aug_df(reviews_df, labels_df, op, n=3): 14 | for idx in reviews_df.index: 15 | id = reviews_df.loc[idx, 'id'] 16 | rv = reviews_df.loc[idx, 'Reviews'] 17 | for i in reversed(range(len(rv))): 18 | if rv[i].strip() == '': 19 | for j in labels_df[labels_df['id'] == id].index: 20 | lb = labels_df[labels_df['id'] == id].loc[j] 21 | a_s = lb['A_start'].strip() 22 | a_e = lb['A_end'].strip() 23 | if a_s != '' and a_e != '': 24 | a_s = int(a_s) 25 | a_e = int(a_e) 26 | if a_s > i: 27 | a_s -= 1 28 | a_e -= 1 29 | labels_df.loc[j, 'A_start'] = str(a_s) 30 | labels_df.loc[j, 'A_end'] = str(a_e) 31 | o_s = lb['O_start'].strip() 32 | o_e = lb['O_end'].strip() 33 | if o_s != '' and o_e != '': 34 | o_s = int(o_s) 35 | o_e = int(o_e) 36 | if o_s > i: 37 | o_s -= 1 38 | o_e -= 1 39 | labels_df.loc[j, 'O_start'] = str(o_s) 40 | labels_df.loc[j, 'O_end'] = str(o_e) 41 | 42 | rv = rv.replace(' ', '') 43 | 44 | still_spans = [] 45 | for i in labels_df[labels_df['id'] == id].index: 46 | lb = labels_df.loc[i] 47 | a_s = lb['A_start'].strip() 48 | a_e = lb['A_end'].strip() 49 | if a_s != '' and a_e != '': 50 | still_spans.append((int(a_s), int(a_e))) 51 | o_s = lb['O_start'].strip() 52 | o_e = lb['O_end'].strip() 53 | if o_s != '' and o_e != '': 54 | still_spans.append((int(o_s), int(o_e))) 55 | 56 | still_spans.sort(key=lambda x: x[0]) 57 | 58 | rv_tokens = synonyms.seg(rv)[0] 59 | editable_tokens = [] 60 | editable_spans = [] 61 | cur = 0 62 | for i in range(len(rv_tokens)): 63 | 64 | end = cur + len(rv_tokens[i]) 65 | editable = True 66 | for span in still_spans: 67 | if is_intersec(cur, end, span[0], span[1]): 68 | editable = False 69 | break 70 | if editable and (rv_tokens[i] not in [',', ',', '!', '。', '*', '?', '?']): 71 | editable_spans.append((cur, end)) 72 | editable_tokens.append(rv_tokens[i]) 73 | cur = end 74 | 75 | if not editable_tokens: 76 | continue 77 | 78 | rv_list = list(rv) 79 | if op == 'delete' or op == 'replace' or op == 'insert': 80 | to_edit = sorted(np.random.choice(range(len(editable_tokens)), size=min(len(editable_tokens), n), replace=False), 81 | reverse=True) 82 | for ii in to_edit: 83 | span = editable_spans[ii] 84 | token = editable_tokens[ii] 85 | if op == 'delete' or op == 'replace': 86 | left, right = span 87 | if op == 'delete': 88 | target_token = '' 89 | else: 90 | candi, probs = synonyms.nearby(token) 91 | if len(candi) <= 1: 92 | target_token = '' 93 | else: 94 | probs = np.array(probs[1:]) / sum(probs[1:]) 95 | target_token = np.random.choice(candi[1:], p=probs) 96 | else: 97 | left, right = span[-1], span[-1] 98 | token = '' 99 | candi, probs = synonyms.nearby(editable_tokens[ii]) 100 | if len(candi) <= 1: 101 | target_token = '' 102 | else: 103 | probs = np.array(probs[1:]) / sum(probs[1:]) 104 | target_token = np.random.choice(candi[1:], p=probs) 105 | 106 | shift = len(target_token)-len(token) 107 | 108 | for i in labels_df[labels_df['id'] == id].index: 109 | lb = labels_df.loc[i] 110 | a_s = lb['A_start'].strip() 111 | a_e = lb['A_end'].strip() 112 | if a_s != '' and a_e != '': 113 | a_s = int(a_s) 114 | a_e = int(a_e) 115 | if a_s >= span[-1]: 116 | a_s += shift 117 | a_e += shift 118 | labels_df.loc[i, 'A_start'] = str(a_s) 119 | labels_df.loc[i, 'A_end'] = str(a_e) 120 | o_s = lb['O_start'].strip() 121 | o_e = lb['O_end'].strip() 122 | if o_s != '' and o_e != '': 123 | o_s = int(o_s) 124 | o_e = int(o_e) 125 | if o_s >= span[-1]: 126 | o_s += shift 127 | o_e += shift 128 | labels_df.loc[i, 'O_start'] = str(o_s) 129 | labels_df.loc[i, 'O_end'] = str(o_e) 130 | print(token) 131 | print(''.join(rv_list[:left]), ''.join(rv_list[right:])) 132 | rv_list = rv_list[:left] + list(target_token) + rv_list[right:] 133 | 134 | elif op == 'swap': 135 | cur_time = 0 136 | if len(editable_tokens) < 2: 137 | continue 138 | if len(editable_tokens) == 2: 139 | time = 1 140 | else: 141 | time = n 142 | while cur_time != time: 143 | idx0, idx1 = sorted(np.random.choice(range(len(editable_tokens)), size=2, replace=False)) 144 | token0, token1 = editable_tokens[idx0], editable_tokens[idx1] 145 | span0, span1 = editable_spans[idx0], editable_spans[idx1] 146 | print(token0, token1) 147 | editable_tokens[idx0], editable_tokens[idx1] = token1, token0 148 | if len(token0) != len(token1): 149 | shift = len(token1) - len(token0) 150 | editable_spans[idx0] = (span0[0], span0[0]+len(token1)) 151 | editable_spans[idx1] = (span1[0]+shift, span1[0] + shift + len(token0)) 152 | 153 | for idx_edt in range(len(editable_tokens)): 154 | cur_span = editable_spans[idx_edt] 155 | if cur_span[0] >= span0[1] and cur_span[1] <= span1[0]: 156 | editable_spans[idx_edt] = (cur_span[0]+shift, cur_span[1]+shift) 157 | 158 | for i in labels_df[labels_df['id'] == id].index: 159 | lb = labels_df.loc[i] 160 | a_s = lb['A_start'].strip() 161 | a_e = lb['A_end'].strip() 162 | if a_s != '' and a_e != '': 163 | a_s = int(a_s) 164 | a_e = int(a_e) 165 | if a_s >= span0[1] and a_e <= span1[0]: 166 | a_s += shift 167 | a_e += shift 168 | labels_df.loc[i, 'A_start'] = str(a_s) 169 | labels_df.loc[i, 'A_end'] = str(a_e) 170 | o_s = lb['O_start'].strip() 171 | o_e = lb['O_end'].strip() 172 | if o_s != '' and o_e != '': 173 | o_s = int(o_s) 174 | o_e = int(o_e) 175 | if o_s >= span0[1] and o_e <= span1[0]: 176 | o_s += shift 177 | o_e += shift 178 | labels_df.loc[i, 'O_start'] = str(o_s) 179 | labels_df.loc[i, 'O_end'] = str(o_e) 180 | 181 | rv_list = rv_list[:span0[0]] + list(token1) + rv_list[span0[1]: span1[0]] + list(token0) + rv_list[span1[1]:] 182 | 183 | cur_time += 1 184 | 185 | rv_new = ''.join(rv_list) 186 | reviews_df.loc[idx, 'Reviews'] = rv_new 187 | print(rv) 188 | print(rv_new) 189 | print(labels_df[labels_df['id'] == id]) 190 | 191 | return reviews_df, labels_df 192 | 193 | if __name__ == '__main__': 194 | reviews_df = pd.read_csv('../data/TRAIN/Train_laptop_reviews.csv', encoding='utf-8') 195 | labels_df = pd.read_csv('../data/TRAIN/Train_laptop_labels.csv', encoding='utf-8') 196 | 197 | reviews_df_replace, labels_df_replace = reviews_df.copy(), labels_df.copy() 198 | reviews_df_replace, labels_df_replace = aug_df(reviews_df_replace, labels_df_replace, 'replace') 199 | reviews_df_replace['id'] += reviews_df.shape[0] 200 | labels_df_replace['id'] += reviews_df.shape[0] 201 | 202 | reviews_df_insert, labels_df_insert = reviews_df.copy(), labels_df.copy() 203 | reviews_df_insert, labels_df_insert = aug_df(reviews_df_insert, labels_df_insert, 'insert') 204 | reviews_df_insert['id'] += reviews_df.shape[0]*2 205 | labels_df_insert['id'] += reviews_df.shape[0] * 2 206 | 207 | reviews_df_swap, labels_df_swap = reviews_df.copy(), labels_df.copy() 208 | reviews_df_swap, labels_df_swap = aug_df(reviews_df_swap, labels_df_swap, 'swap', 3) 209 | reviews_df_swap['id'] += reviews_df.shape[0]*3 210 | labels_df_swap['id'] += reviews_df.shape[0] * 3 211 | 212 | reviews_df_aug = pd.concat([reviews_df, reviews_df_replace, reviews_df_insert, reviews_df_swap], axis=0, ignore_index=True) 213 | labels_df_aug = pd.concat([labels_df, labels_df_replace, labels_df_insert, labels_df_swap], axis=0, ignore_index=True) 214 | 215 | reviews_df_aug.to_csv('../data/TRAIN/Train_laptop_aug_reviews.csv', index=False) 216 | labels_df_aug.to_csv('../data/TRAIN/Train_laptop_aug_labels.csv', index=False) -------------------------------------------------------------------------------- /src/data_augmentation.py: -------------------------------------------------------------------------------- 1 | """ 2 | Author: 周树帆 - SJTU 3 | Email: sfzhou567@163.com 4 | """ 5 | import pandas as pd 6 | from collections import defaultdict 7 | import random 8 | import numpy as np 9 | from tqdm import tqdm 10 | 11 | 12 | def data_augment(reviews_df, labels_df, epochs=5): 13 | POLARITY_DICT = 'polarity_dict' 14 | cate_dict = dict() 15 | for index, row in labels_df.iterrows(): 16 | cate = row['Categories'] 17 | aspect = row['AspectTerms'] 18 | opinion, polarity = row['OpinionTerms'], row['Polarities'] 19 | 20 | if cate not in cate_dict: 21 | cate_dict[cate] = {POLARITY_DICT: dict()} 22 | if polarity not in cate_dict[cate][POLARITY_DICT]: 23 | cate_dict[cate][POLARITY_DICT][polarity] = set() 24 | cate_dict[cate][POLARITY_DICT][polarity].add((aspect, opinion)) 25 | 26 | global_review_id = 1 27 | new_reviews_df = pd.DataFrame(columns=reviews_df.columns) 28 | global_label_idx = 1 29 | new_labels_df = pd.DataFrame(columns=labels_df.columns) 30 | 31 | label_groups = labels_df.groupby('id') 32 | for epoch in range(epochs): 33 | for id, group in tqdm(label_groups): 34 | review = reviews_df.loc[id - 1]['Reviews'] 35 | # TODO: 确认一下是否存在重叠的区间? 然后把重叠的那部分都去掉 36 | 37 | ## region 区分极性,汇总数据, 然后遍历候选个数做aug 38 | polar_dict = defaultdict(list) 39 | for idx, row in group.iterrows(): 40 | polarity = row['Polarities'] 41 | polar_dict[polarity].append(idx) 42 | ## endregion 43 | for polar in polar_dict: 44 | indices = polar_dict[polar] 45 | for size in range(1, len(indices) + 1): 46 | new_group = group.copy() 47 | new_group['AspectOffset'] = 0 48 | new_group['OpinionOffset'] = 0 49 | chosen_indices = np.random.choice(indices, size, replace=False) 50 | for index in chosen_indices: 51 | row = new_group.loc[index] 52 | cate = row['Categories'] 53 | aspect = row['AspectTerms'] 54 | opinion, polarity = row['OpinionTerms'], row['Polarities'] 55 | 56 | pair_set = cate_dict[cate][POLARITY_DICT][polarity] 57 | pair_list = list(pair_set) 58 | if len(pair_list) > 1: 59 | new_aspect, new_opinion = aspect, opinion 60 | accident_cnt = 0 61 | while (aspect == '_' and new_aspect != '_') or (new_aspect == aspect and new_opinion == opinion): 62 | new_idx = random.randint(0, len(pair_list) - 1) 63 | new_aspect, new_opinion = pair_list[new_idx] 64 | accident_cnt += 1 65 | if accident_cnt >= 1000: # FIXME: 给原aspect为'_'的建一个dict来提速。不过现在这个代码也能用。 66 | break 67 | new_group.loc[index, 'AspectTerms'] = new_aspect 68 | new_group.loc[index, 'AspectOffset'] = (0 if new_aspect == '_' else len(new_aspect)) - len(aspect) 69 | new_group.loc[index, 'OpinionTerms'] = new_opinion 70 | new_group.loc[index, 'OpinionOffset'] = (0 if new_opinion == '_' else len(new_opinion)) - len(opinion) 71 | 72 | ## 把spans拿出来排序,然后再塞回label里去 73 | spans = [] 74 | span_set = set() 75 | for i, row in new_group.iterrows(): 76 | aspect_element = {'idx': i, 77 | 'text': row['AspectTerms'], 78 | 'span': (row['A_start'], row['A_end']), 79 | 'offset': row['AspectOffset'], 80 | 'type': 'a'} 81 | if aspect_element['span'][0].strip() != '' and aspect_element['span'] not in span_set: 82 | span_set.add(aspect_element['span']) 83 | spans.append(aspect_element) 84 | opinion_element = {'idx': i, 85 | 'text': row['OpinionTerms'], 86 | 'span': (row['O_start'], row['O_end']), 87 | 'offset': row['OpinionOffset'], 88 | 'type': 'o'} 89 | if opinion_element['span'][0].strip() != '' and opinion_element['span'] not in span_set: 90 | span_set.add(opinion_element['span']) 91 | spans.append(opinion_element) 92 | sorted_spans = sorted(spans, key=lambda d: int(d['span'][0])) 93 | new_review = '' 94 | last_start = 0 95 | offset = 0 96 | for span in sorted_spans: 97 | # 下面3行顺序不能换, 必须是start, offset, end 98 | idx = span['idx'] 99 | start = int(span['span'][0]) + offset if span['text'] != '_' else ' ' 100 | offset += span['offset'] 101 | end = int(span['span'][1]) + offset if span['text'] != '_' else ' ' 102 | 103 | new_review += review[last_start:int(span['span'][0])] + (span['text'] if span['text'] != '_' else '') 104 | last_start = int(span['span'][1]) 105 | 106 | if span['type'] == 'a': 107 | new_group.loc[idx, 'A_start'] = str(start) 108 | new_group.loc[idx, 'A_end'] = str(end) 109 | else: 110 | new_group.loc[idx, 'O_start'] = str(start) 111 | new_group.loc[idx, 'O_end'] = str(end) 112 | new_review += review[last_start:] 113 | 114 | ## 记录结果 115 | del new_group['AspectOffset'] 116 | del new_group['OpinionOffset'] 117 | for i, row in new_group.iterrows(): 118 | row_data = row.tolist() 119 | row_data[0] = global_review_id 120 | new_labels_df.loc[global_label_idx] = row_data 121 | global_label_idx += 1 122 | new_reviews_df.loc[global_review_id] = [global_review_id, new_review] 123 | global_review_id += 1 124 | 125 | return new_reviews_df, new_labels_df 126 | 127 | 128 | if __name__ == '__main__': 129 | data_type = 'laptop_corpus' 130 | epochs = 3 # 控制aug的倍数, 建议取<=5的正整数 131 | 132 | reviews_df = pd.read_csv('../data/TRAIN/Train_%s_reviews.csv' % data_type, encoding='utf-8') 133 | labels_df = pd.read_csv('../data/TRAIN/Train_%s_labels.csv' % data_type, encoding='utf-8') 134 | 135 | new_reviews_df, new_labels_df = data_augment(reviews_df, labels_df, epochs) 136 | 137 | new_reviews_df.to_csv('../data/TRAIN/Train_%s_aug_reviews.csv' % data_type, index=False, encoding='utf-8') 138 | new_labels_df.to_csv('../data/TRAIN/Train_%s_aug_labels.csv' % data_type, index=False, encoding='utf-8') 139 | -------------------------------------------------------------------------------- /src/dataset.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch.utils.data import Dataset, DataLoader, ConcatDataset, random_split 3 | import pandas as pd 4 | from sklearn.model_selection import KFold 5 | import jieba 6 | import numpy as np 7 | from data_augmentation import data_augment 8 | # { 硬件&性能、软件&性能、外观、使用场景、物流、服务、包装、价格、真伪、整体、其他 } 9 | ID2C = ['包装', '成分', '尺寸', '服务', '功效', '价格', '气味', '使用体验', '物流', '新鲜度', '真伪', '整体', '其他'] 10 | ID2COMMON = ['物流', '服务', '包装', '价格', '真伪', '整体', '其他'] 11 | ID2LAPTOP = ID2COMMON + ['硬件&性能', '软件&性能', '外观', '使用场景'] 12 | ID2MAKUP = ID2COMMON + ['成分', '尺寸', '功效', '气味', '使用体验', '新鲜度'] 13 | 14 | ID2P = ['正面', '中性', '负面'] 15 | 16 | 17 | # C2ID = dict(zip(ID2C, range(len(ID2C)))) 18 | LAPTOP2ID = dict(zip(ID2LAPTOP, range(len(ID2LAPTOP)))) 19 | MAKUP2ID = dict(zip(ID2MAKUP, range(len(ID2MAKUP)))) 20 | P2ID = dict(zip(ID2P, range(len(ID2P)))) 21 | 22 | class CorpusDataset(Dataset): 23 | def __init__(self, corpus_path, tokenizer): 24 | super(CorpusDataset, self).__init__() 25 | corpus_df = pd.read_csv(corpus_path, encoding='utf-8') 26 | self.tokenizer = tokenizer 27 | self.samples = self._preprocess_data(corpus_df, tokenizer) 28 | 29 | def _preprocess_data(self, corpus_df, tokenizer): 30 | samples = [] 31 | for id, rv in zip(corpus_df['id'], corpus_df['Reviews']): 32 | if len(rv) >= 120: 33 | continue 34 | RV = ['[CLS]'] 35 | RV_INTERVALS = [] 36 | rv_cut = jieba.cut(rv) 37 | for word in rv_cut: 38 | s = len(RV) 39 | for c in word: 40 | if c == ' ': 41 | RV.append('[unused1]') 42 | elif c in tokenizer.vocab: 43 | RV.append(c) 44 | else: 45 | RV.append('[UNK]') 46 | e = len(RV) 47 | RV_INTERVALS.append((s, e)) 48 | 49 | RV.append('[SEP]') 50 | RV = tokenizer.convert_tokens_to_ids(RV) 51 | 52 | samples.append((rv, RV, RV_INTERVALS)) 53 | return samples 54 | 55 | def batchify(self, batch_samples): 56 | rv_raw = [] 57 | INPUT_IDS = [] 58 | ATTN_MASK = [] 59 | LM_LABEL = [] 60 | for raw, rv, rv_intervals in batch_samples: 61 | rv_raw.append(raw) 62 | masked_rv = [_ for _ in rv] 63 | lm_label = [-1] * len(masked_rv) 64 | mask_word_num = int(len(rv_intervals) * 0.15) 65 | masked_word_idxs = list(np.random.choice(list(range(len(rv_intervals))), mask_word_num, False)) 66 | for i in masked_word_idxs: 67 | s, e = rv_intervals[i] 68 | for j in range(s, e): 69 | lm_label[j] = masked_rv[j] 70 | rand = np.random.rand() 71 | if rand < 0.1: # 随机替换 72 | replace_id = np.random.choice(range(len(self.tokenizer.vocab))) 73 | elif rand < 0.2: # 保留原词 74 | replace_id = lm_label[j] 75 | else: #换为mask 76 | replace_id = self.tokenizer.vocab['[MASK]'] 77 | masked_rv[j] = replace_id 78 | 79 | ATTN_MASK.append([1] * len(masked_rv)) 80 | INPUT_IDS.append(masked_rv) 81 | LM_LABEL.append(lm_label) 82 | INPUT_IDS = torch.LongTensor(pad_batch_seqs(INPUT_IDS, self.tokenizer.vocab['[PAD]'])) 83 | ATTN_MASK = torch.LongTensor(pad_batch_seqs(ATTN_MASK, 0)) 84 | LM_LABEL = torch.LongTensor(pad_batch_seqs(LM_LABEL, -1)) 85 | 86 | return INPUT_IDS, ATTN_MASK, LM_LABEL 87 | 88 | 89 | def __getitem__(self, index): 90 | return self.samples[index] 91 | 92 | def __len__(self): 93 | return len(self.samples) 94 | 95 | class ReviewDataset(Dataset): 96 | def __init__(self, reviews, labels, tokenizer, type='makeup'): 97 | super(ReviewDataset, self).__init__() 98 | if isinstance(reviews, str): 99 | reviews_df = pd.read_csv(reviews, encoding='utf-8') 100 | elif isinstance(reviews, pd.DataFrame): 101 | reviews_df = reviews 102 | else: 103 | raise TypeError("接受路径或df") 104 | if type == 'makeup': 105 | self.C2ID = MAKUP2ID 106 | else: 107 | self.C2ID = LAPTOP2ID 108 | labels_df = None 109 | if labels is not None: 110 | if isinstance(labels, str): 111 | labels_df = pd.read_csv(labels, encoding='utf-8') 112 | elif isinstance(labels, pd.DataFrame): 113 | labels_df = labels 114 | else: 115 | raise TypeError("接受路径或df") 116 | 117 | self.samples = self._preprocess_data(reviews_df, labels_df, tokenizer) 118 | self.PAD_ID = tokenizer.vocab['[PAD]'] 119 | 120 | 121 | def __getitem__(self, index): 122 | return self.samples[index] 123 | 124 | def __len__(self): 125 | return len(self.samples) 126 | 127 | def _preprocess_data(self, reviews_df, labels_df, tokenizer): 128 | samples = [] 129 | for id, rv in zip(reviews_df['id'], reviews_df['Reviews']): 130 | rv = rv[:120] 131 | RV = [] 132 | for c in rv: 133 | if c == ' ': 134 | RV.append('[unused1]') 135 | elif c in tokenizer.vocab: 136 | RV.append(c) 137 | else: 138 | RV.append('[UNK]') 139 | 140 | # RV = [c if c in tokenizer.vocab else '[UNK]' for c in rv] 141 | RV = ['[CLS]'] + RV + ['[SEP]'] 142 | RV = tokenizer.convert_tokens_to_ids(RV) 143 | 144 | if labels_df is not None: 145 | lbs = labels_df[labels_df['id'] == id] 146 | # A_start & A_end 147 | LB_AS = [-1] * len(RV) 148 | LB_AE = [-1] * len(RV) 149 | 150 | # O_start & O_end 151 | LB_OS = [-1] * len(RV) 152 | LB_OE = [-1] * len(RV) 153 | 154 | # Objectiveness 155 | LB_OBJ = [0] * len(RV) 156 | 157 | # Categories 158 | LB_C = [-1] * len(RV) 159 | 160 | # Polarities 161 | LB_P = [-1] * len(RV) 162 | 163 | lb_raw = [] 164 | for i in range(len(lbs)): 165 | lb = lbs.iloc[i] 166 | a_s = lb['A_start'].strip() 167 | a_e = lb['A_end'].strip() 168 | o_s = lb['O_start'].strip() 169 | o_e = lb['O_end'].strip() 170 | c = lb['Categories'].strip() 171 | p = lb['Polarities'].strip() 172 | 173 | if c in self.C2ID: 174 | c = self.C2ID[c] 175 | else: 176 | c = -1 177 | 178 | if p in P2ID: 179 | p = P2ID[p] 180 | else: 181 | p = -1 182 | # a和o均从1开始 0 代表CLS 183 | if a_s != '' and a_e != '': 184 | a_s, a_e = int(a_s) + 1, int(a_e) 185 | else: 186 | a_s, a_e = 0, 0 187 | if o_s != '' and o_e != '': 188 | o_s, o_e = int(o_s) + 1, int(o_e) 189 | else: 190 | o_s, o_e = 0, 0 191 | 192 | if a_s >= len(RV) - 1: 193 | a_s, a_e = 0, 0 194 | if o_s >= len(RV) - 1: 195 | o_s, o_e = 0, 0 196 | 197 | a_s = min(a_s, len(RV) - 2) 198 | a_e = min(a_e, len(RV) - 2) 199 | o_s = min(o_s, len(RV) - 2) 200 | o_e = min(o_e, len(RV) - 2) 201 | 202 | # print(a_s, a_e, o_s, o_e, len(RV)) 203 | 204 | if a_s > 0: 205 | LB_AS[a_s: a_e + 1] = [a_s] * (a_e - a_s + 1) 206 | LB_AE[a_s: a_e + 1] = [a_e] * (a_e - a_s + 1) 207 | LB_OS[a_s: a_e + 1] = [o_s] * (a_e - a_s + 1) 208 | LB_OE[a_s: a_e + 1] = [o_e] * (a_e - a_s + 1) 209 | LB_OBJ[a_s: a_e + 1] = [1] * (a_e - a_s + 1) 210 | LB_C[a_s: a_e + 1] = [c] * (a_e - a_s + 1) 211 | LB_P[a_s: a_e + 1] = [p] * (a_e - a_s + 1) 212 | 213 | if o_s > 0: 214 | LB_AS[o_s: o_e + 1] = [a_s] * (o_e - o_s + 1) 215 | LB_AE[o_s: o_e + 1] = [a_e] * (o_e - o_s + 1) 216 | LB_OS[o_s: o_e + 1] = [o_s] * (o_e - o_s + 1) 217 | LB_OE[o_s: o_e + 1] = [o_e] * (o_e - o_s + 1) 218 | LB_OBJ[o_s: o_e + 1] = [1] * (o_e - o_s + 1) 219 | LB_C[o_s: o_e + 1] = [c] * (o_e - o_s + 1) 220 | LB_P[o_s: o_e + 1] = [p] * (o_e - o_s + 1) 221 | lb_raw.append((a_s, a_e, o_s, o_e, c, p)) 222 | 223 | # print(LB_NUM) 224 | # obj_weights = 1 / sum(LB_OBJ) 225 | # LB_OBJ = list(map(lambda x: obj_weights if x == 1 else 0, LB_OBJ)) 226 | LABELS = (LB_AS, LB_AE, LB_OS, LB_OE, LB_OBJ, LB_C, LB_P) 227 | rv = (rv, lb_raw) 228 | else: 229 | LABELS = None 230 | rv = (rv, None) 231 | 232 | samples.append((rv, RV, LABELS)) 233 | return samples 234 | 235 | def batchify(self, batch_samples): 236 | rv_raw = [] 237 | lb_raw = [] 238 | IN_RV = [] 239 | IN_ATT_MASK = [] 240 | IN_RV_MASK = [] 241 | 242 | for raw, RV, _ in batch_samples: 243 | rv_raw.append(raw[0]) 244 | lb_raw.append(raw[1]) 245 | IN_RV.append(RV) 246 | IN_ATT_MASK.append([1] * len(RV)) 247 | IN_RV_MASK.append([0] + [1] * (len(RV) - 2) + [0]) 248 | 249 | IN_RV = torch.LongTensor(pad_batch_seqs(IN_RV, self.PAD_ID)) 250 | IN_ATT_MASK = torch.LongTensor(pad_batch_seqs(IN_ATT_MASK, 0)) 251 | IN_RV_MASK = torch.LongTensor(pad_batch_seqs(IN_RV_MASK, 0)) 252 | 253 | INPUTS = [IN_RV, IN_ATT_MASK, IN_RV_MASK] 254 | 255 | if batch_samples[0][2] is not None: 256 | TARGETS = [[] for _ in batch_samples[0][2]] 257 | # TGT_AS, TGT_AE, TGT_OS, TGT_OE, TGT_OBJ, TGT_C, TGT_P = [], [], [], [], [], [], [] 258 | for _, RV, LABELS in batch_samples: 259 | for i in range(len(LABELS)): 260 | TARGETS[i].append(LABELS[i]) 261 | 262 | for i in range(len(TARGETS)): 263 | if i == 4: 264 | TARGETS[i] = torch.FloatTensor(pad_batch_seqs(TARGETS[i], 0)) # OBJ for kldiv 265 | # elif i == len(TARGETS) - 1: 266 | # TARGETS[i] = torch.LongTensor(TARGETS[i]) 267 | else: 268 | TARGETS[i] = torch.LongTensor(pad_batch_seqs(TARGETS[i], -1)) # for CE Loss ignore 269 | else: 270 | TARGETS = None 271 | 272 | return (rv_raw, lb_raw), INPUTS, TARGETS 273 | 274 | 275 | def pad_batch_seqs(seqs: list, pad=None, max_len=None) -> list: 276 | if not max_len: 277 | max_len = max([len(s) for s in seqs]) 278 | if not pad: 279 | pad = 0 280 | for i in range(len(seqs)): 281 | if len(seqs[i]) > max_len: 282 | seqs[i] = seqs[i][:max_len] 283 | else: 284 | seqs[i].extend([pad] * (max_len - len(seqs[i]))) 285 | 286 | return seqs 287 | 288 | 289 | def get_data_loaders(rv_path, lb_path, tokenizer, batch_size, val_split=0.15): 290 | full_dataset = ReviewDataset(rv_path, lb_path, tokenizer) 291 | train_size = int(len(full_dataset) * (1 - val_split)) 292 | val_size = len(full_dataset) - train_size 293 | lengths = [train_size, val_size] 294 | torch.manual_seed(502) 295 | train_data, val_data = random_split(full_dataset, lengths) 296 | train_loader = DataLoader(train_data, batch_size, collate_fn=full_dataset.batchify, shuffle=True, num_workers=5, 297 | drop_last=False) 298 | val_loader = DataLoader(val_data, batch_size, collate_fn=full_dataset.batchify, shuffle=False, num_workers=5, 299 | drop_last=False) 300 | 301 | return train_loader, val_loader 302 | 303 | 304 | def get_full_data_loaders(rv_path, lb_path, tokenizer, batch_size, type='makeup', shuffle=False): 305 | full_dataset = ReviewDataset(rv_path, lb_path, tokenizer, type) 306 | loader = DataLoader(full_dataset, batch_size, collate_fn=full_dataset.batchify, shuffle=shuffle, num_workers=5, 307 | drop_last=False) 308 | 309 | return loader 310 | 311 | 312 | def get_data_loaders_cv(rv_path, lb_path, tokenizer, batch_size, type='makeup', folds=5, return_val_idxs=False): 313 | full_dataset = ReviewDataset(rv_path, lb_path, tokenizer, type) 314 | 315 | kf = KFold(n_splits=folds, shuffle=True, random_state=502) 316 | folds = kf.split(full_dataset) 317 | cv_loaders = [] 318 | val_idxs = [] 319 | for train_idx, val_idx in folds: 320 | train_loader = DataLoader([full_dataset.samples[i] for i in train_idx], batch_size, 321 | collate_fn=full_dataset.batchify, shuffle=True, num_workers=5, drop_last=False) 322 | val_loader = DataLoader([full_dataset.samples[i] for i in val_idx], batch_size, 323 | collate_fn=full_dataset.batchify, shuffle=False, num_workers=5, drop_last=False) 324 | cv_loaders.append((train_loader, val_loader)) 325 | val_idxs.append(val_idx) 326 | 327 | if return_val_idxs: 328 | return cv_loaders, val_idxs 329 | 330 | return cv_loaders 331 | 332 | 333 | def get_aug_data_loaders_cv(rv_path, lb_path, tokenizer, batch_size, type='makeup', folds=5): 334 | # full_dataset = ReviewDataset(rv_path, lb_path, tokenizer, type) 335 | rv_df = pd.read_csv(rv_path, encoding='utf-8') 336 | lb_df = pd.read_csv(lb_path, encoding='utf-8') 337 | kf = KFold(n_splits=folds, shuffle=True, random_state=502) 338 | folds = kf.split(range(rv_df.shape[0])) 339 | for train_idx, val_idx in folds: 340 | train_rv_df = rv_df.iloc[train_idx].copy() 341 | train_lb_df = lb_df[lb_df['id'].isin(train_rv_df['id'])].copy() 342 | val_rv_df = rv_df.iloc[val_idx].copy() 343 | val_lb_df = lb_df[lb_df['id'].isin(val_rv_df['id'])].copy() 344 | 345 | train_rv_aug_df, train_lb_aug_df = data_augment(train_rv_df, train_lb_df, 1) 346 | 347 | print(train_rv_aug_df.shape[0]) 348 | print(train_rv_df.shape[0]) 349 | train_rv_df['id'] += train_rv_aug_df.shape[0] 350 | train_lb_df['id'] += train_rv_aug_df.shape[0] 351 | train_rv_aug_df = train_rv_aug_df.append(train_rv_df, ignore_index=True) 352 | train_lb_aug_df = train_lb_aug_df.append(train_lb_df, ignore_index=True) 353 | 354 | train_dataset = ReviewDataset(train_rv_aug_df, train_lb_aug_df, tokenizer, type) 355 | val_dataset = ReviewDataset(val_rv_df, val_lb_df, tokenizer, type) 356 | 357 | 358 | train_loader = DataLoader(train_dataset, batch_size, 359 | collate_fn=train_dataset.batchify, shuffle=True, num_workers=5, drop_last=False) 360 | val_loader = DataLoader(val_dataset, batch_size, 361 | collate_fn=val_dataset.batchify, shuffle=False, num_workers=5, drop_last=False) 362 | yield train_loader, val_loader 363 | 364 | 365 | def get_data_loaders_round2(tokenizer, batch_size, val_split=0.15): 366 | makeup_rv1 = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 367 | makeup_rv2 = ReviewDataset('../data/TRAIN/Train_makeup_reviews.csv', '../data/TRAIN/Train_makeup_labels.csv', tokenizer) 368 | makeup_rv = ConcatDataset([makeup_rv1, makeup_rv2]) 369 | laptop_rv = ReviewDataset('../data/TRAIN/Train_laptop_reviews.csv', '../data/TRAIN/Train_laptop_labels.csv', tokenizer, type='laptop') 370 | 371 | 372 | laptop_corpus1 = CorpusDataset('../data/TEST/Test_reviews.csv', tokenizer) 373 | laptop_corpus2 = CorpusDataset('../data/TRAIN/Train_laptop_corpus.csv', tokenizer) 374 | laptop_corpus3 = CorpusDataset('../data/TRAIN/Train_laptop_reviews.csv', tokenizer) 375 | makeup_corpus1 = CorpusDataset('../data/TEST/Test_reviews1.csv', tokenizer) 376 | makeup_corpus2 = CorpusDataset('../data/TRAIN/Train_reviews.csv', tokenizer) 377 | makeup_corpus3 = CorpusDataset('../data/TRAIN/Train_makeup_reviews.csv', tokenizer) 378 | 379 | corpus_rv = ConcatDataset([laptop_corpus1, laptop_corpus2, laptop_corpus3, makeup_corpus1, makeup_corpus2, makeup_corpus3]) 380 | corpus_loader = DataLoader(corpus_rv, batch_size, collate_fn=laptop_corpus1.batchify, shuffle=True, num_workers=5, 381 | drop_last=False) 382 | 383 | makeup_train_size = int(len(makeup_rv) * (1 - val_split)) 384 | makeup_val_size = len(makeup_rv) - makeup_train_size 385 | torch.manual_seed(502) 386 | makeup_train, makeup_val = random_split(makeup_rv, [makeup_train_size, makeup_val_size]) 387 | makeup_train_loader = DataLoader(makeup_train, batch_size // 2, collate_fn=makeup_rv1.batchify, shuffle=True, num_workers=5, 388 | drop_last=False) 389 | makeup_val_loader = DataLoader(makeup_val, batch_size, collate_fn=makeup_rv1.batchify, shuffle=False, num_workers=5, 390 | drop_last=False) 391 | 392 | laptop_train_size = int(len(laptop_rv) * (1 - val_split)) 393 | laptop_val_size = len(laptop_rv) - laptop_train_size 394 | torch.manual_seed(502) 395 | laptop_train, laptop_val = random_split(laptop_rv, [laptop_train_size, laptop_val_size]) 396 | laptop_train_loader = DataLoader(laptop_train, batch_size // 2, collate_fn=laptop_rv.batchify, shuffle=True, 397 | num_workers=5, 398 | drop_last=False) 399 | laptop_val_loader = DataLoader(laptop_val, batch_size, collate_fn=laptop_rv.batchify, shuffle=False, num_workers=5, 400 | drop_last=False) 401 | 402 | return makeup_train_loader, makeup_val_loader, laptop_train_loader, laptop_val_loader, corpus_loader 403 | 404 | 405 | def get_pretrain_loaders(tokenizer, batch_size, val_split=0.15): 406 | makeup_rv1 = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 407 | makeup_rv2 = ReviewDataset('../data/TRAIN/Train_makeup_reviews.csv', '../data/TRAIN/Train_makeup_labels.csv', 408 | tokenizer) 409 | makeup_rv = ConcatDataset([makeup_rv1, makeup_rv2]) 410 | laptop_corpus1 = CorpusDataset('../data/TEST/Test_reviews.csv', tokenizer) 411 | laptop_corpus2 = CorpusDataset('../data/TRAIN/Train_laptop_corpus.csv', tokenizer) 412 | laptop_corpus3 = CorpusDataset('../data/TRAIN/Train_laptop_reviews.csv', tokenizer) 413 | makeup_corpus1 = CorpusDataset('../data/TEST/Test_reviews1.csv', tokenizer) 414 | makeup_corpus2 = CorpusDataset('../data/TRAIN/Train_reviews.csv', tokenizer) 415 | makeup_corpus3 = CorpusDataset('../data/TRAIN/Train_makeup_reviews.csv', tokenizer) 416 | 417 | corpus_rv = ConcatDataset( 418 | [laptop_corpus1, laptop_corpus2, laptop_corpus3, makeup_corpus1, makeup_corpus2, makeup_corpus3]) 419 | corpus_loader = DataLoader(corpus_rv, batch_size, collate_fn=laptop_corpus1.batchify, shuffle=True, num_workers=5, 420 | drop_last=False) 421 | makeup_train_size = int(len(makeup_rv) * (1 - val_split)) 422 | makeup_val_size = len(makeup_rv) - makeup_train_size 423 | torch.manual_seed(502) 424 | makeup_train, makeup_val = random_split(makeup_rv, [makeup_train_size, makeup_val_size]) 425 | makeup_train_loader = DataLoader(makeup_train, batch_size, collate_fn=makeup_rv1.batchify, shuffle=True, 426 | num_workers=5, 427 | drop_last=False) 428 | makeup_val_loader = DataLoader(makeup_val, batch_size, collate_fn=makeup_rv1.batchify, shuffle=False, num_workers=5, 429 | drop_last=False) 430 | 431 | return makeup_train_loader, makeup_val_loader, corpus_loader 432 | 433 | 434 | def get_pretrain2_loaders(tokenizer, batch_size, val_split=0.15): 435 | makeup_rv1 = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 436 | makeup_rv2 = ReviewDataset('../data/TRAIN/Train_makeup_reviews.csv', '../data/TRAIN/Train_makeup_labels.csv', 437 | tokenizer) 438 | makeup_rv = ConcatDataset([makeup_rv1, makeup_rv2]) 439 | makeup_train_size = int(len(makeup_rv) * (1 - val_split)) 440 | makeup_val_size = len(makeup_rv) - makeup_train_size 441 | torch.manual_seed(502) 442 | makeup_train, makeup_val = random_split(makeup_rv, [makeup_train_size, makeup_val_size]) 443 | makeup_loader = DataLoader(makeup_train, batch_size, collate_fn=makeup_rv1.batchify, shuffle=True, 444 | num_workers=5, 445 | drop_last=False) 446 | makeup_val_loader = DataLoader(makeup_val, batch_size, collate_fn=makeup_rv1.batchify, shuffle=False, num_workers=5, 447 | drop_last=False) 448 | 449 | laptop_rv = ReviewDataset('../data/TRAIN/Train_laptop_corpus.csv', '../data/TRAIN/Train_laptop_corpus_labels.csv', tokenizer, 'laptop') 450 | laptop_val_rv = ReviewDataset('../data/TRAIN/Train_laptop_reviews.csv', '../data/TRAIN/Train_laptop_labels.csv', tokenizer, 'laptop') 451 | 452 | laptop_loader = DataLoader(laptop_rv, batch_size, collate_fn=laptop_rv.batchify, shuffle=True, 453 | num_workers=5, 454 | drop_last=False) 455 | 456 | laptop_val_loader = DataLoader(laptop_val_rv, batch_size, collate_fn=laptop_val_rv.batchify, shuffle=False, 457 | num_workers=5, 458 | drop_last=False) 459 | 460 | laptop_corpus1 = CorpusDataset('../data/TEST/Test_reviews.csv', tokenizer) 461 | laptop_corpus2 = CorpusDataset('../data/TRAIN/Train_laptop_corpus.csv', tokenizer) 462 | laptop_corpus3 = CorpusDataset('../data/TRAIN/Train_laptop_reviews.csv', tokenizer) 463 | makeup_corpus1 = CorpusDataset('../data/TEST/Test_reviews1.csv', tokenizer) 464 | makeup_corpus2 = CorpusDataset('../data/TRAIN/Train_reviews.csv', tokenizer) 465 | makeup_corpus3 = CorpusDataset('../data/TRAIN/Train_makeup_reviews.csv', tokenizer) 466 | 467 | corpus_rv = ConcatDataset( 468 | [laptop_corpus1, laptop_corpus2, laptop_corpus3, makeup_corpus1, makeup_corpus2, makeup_corpus3]) 469 | corpus_loader = DataLoader(corpus_rv, batch_size, collate_fn=laptop_corpus1.batchify, shuffle=True, num_workers=5, 470 | drop_last=False) 471 | 472 | return makeup_loader, makeup_val_loader, laptop_loader, laptop_val_loader, corpus_loader 473 | 474 | def get_pretrain_2_laptop_fake_loaders_cv(tokenizer, batch_size, folds=5): 475 | # ## laptop cv 476 | # laptop_rv_small = ReviewDataset('../data/TRAIN/Train_laptop_reviews.csv', '../data/TRAIN/Train_laptop_labels.csv', 477 | # tokenizer, 'laptop') 478 | # kf = KFold(n_splits=folds, shuffle=True, random_state=502) 479 | # folds = kf.split(laptop_rv_small) 480 | for cv_idx in range(folds): 481 | laptop_rv_big = ReviewDataset('../data/TRAIN/Train_laptop_corpus.csv', 482 | '../data/TRAIN/Train_laptop_corpus_labels' + str(cv_idx) + '.csv', tokenizer, 'laptop') 483 | # laptop_rv_big.samples.extend([laptop_rv_small.samples[i] for i in train_idx]) 484 | 485 | train_loader = DataLoader(laptop_rv_big, batch_size, 486 | collate_fn=laptop_rv_big.batchify, shuffle=True, num_workers=5, drop_last=False) 487 | # val_loader = DataLoader([laptop_rv_big.samples[i] for i in val_idx], batch_size, 488 | # collate_fn=laptop_rv_big.batchify, shuffle=False, num_workers=5, drop_last=False) 489 | yield train_loader 490 | 491 | 492 | def get_pretrain2_loaders_cv(tokenizer, batch_size, val_split=0.15): 493 | 494 | ## makeup split 495 | makeup_rv1 = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 496 | makeup_rv2 = ReviewDataset('../data/TRAIN/Train_makeup_reviews.csv', '../data/TRAIN/Train_makeup_labels.csv', 497 | tokenizer) 498 | makeup_rv = ConcatDataset([makeup_rv1, makeup_rv2]) 499 | makeup_train_size = int(len(makeup_rv) * (1 - val_split)) 500 | makeup_val_size = len(makeup_rv) - makeup_train_size 501 | torch.manual_seed(502) 502 | makeup_train, makeup_val = random_split(makeup_rv, [makeup_train_size, makeup_val_size]) 503 | makeup_loader = DataLoader(makeup_train, batch_size, collate_fn=makeup_rv1.batchify, shuffle=True, 504 | num_workers=5, 505 | drop_last=False) 506 | makeup_val_loader = DataLoader(makeup_val, batch_size, collate_fn=makeup_rv1.batchify, shuffle=False, num_workers=5, 507 | drop_last=False) 508 | 509 | ## corpus total 510 | laptop_corpus1 = CorpusDataset('../data/TEST/Test_reviews.csv', tokenizer) 511 | laptop_corpus2 = CorpusDataset('../data/TRAIN/Train_laptop_corpus.csv', tokenizer) 512 | laptop_corpus3 = CorpusDataset('../data/TRAIN/Train_laptop_reviews.csv', tokenizer) 513 | makeup_corpus1 = CorpusDataset('../data/TEST/Test_reviews1.csv', tokenizer) 514 | makeup_corpus2 = CorpusDataset('../data/TRAIN/Train_reviews.csv', tokenizer) 515 | makeup_corpus3 = CorpusDataset('../data/TRAIN/Train_makeup_reviews.csv', tokenizer) 516 | 517 | corpus_rv = ConcatDataset( 518 | [laptop_corpus1, laptop_corpus2, laptop_corpus3, makeup_corpus1, makeup_corpus2, makeup_corpus3]) 519 | corpus_loader = DataLoader(corpus_rv, batch_size, collate_fn=laptop_corpus1.batchify, shuffle=True, num_workers=5, 520 | drop_last=False) 521 | 522 | return makeup_loader, makeup_val_loader, corpus_loader 523 | 524 | def get_makeup_full_loaders(tokenizer, batch_size): 525 | makeup_rv1 = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 526 | makeup_rv2 = ReviewDataset('../data/TRAIN/Train_makeup_reviews.csv', '../data/TRAIN/Train_makeup_labels.csv', 527 | tokenizer) 528 | makeup_rv = ConcatDataset([makeup_rv1, makeup_rv2]) 529 | makeup_train_loader = DataLoader(makeup_rv, batch_size, collate_fn=makeup_rv1.batchify, shuffle=True, 530 | num_workers=5, 531 | drop_last=False) 532 | 533 | return makeup_train_loader 534 | 535 | if __name__ == '__main__': 536 | from pytorch_pretrained_bert import BertTokenizer 537 | 538 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch', 539 | do_lower_case=True) 540 | d = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 541 | mxl = 0 542 | mxn = 0 543 | for raw, a, b in d: 544 | mxl = max(len(raw), mxl) 545 | mxn = max(b[-1], mxn) 546 | print(mxl, mxn) 547 | -------------------------------------------------------------------------------- /src/eval.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.utils.data import DataLoader 7 | 8 | from tqdm import tqdm 9 | import os.path as osp 10 | import pandas as pd 11 | from dataset import ID2C, ID2P 12 | 13 | 14 | def eval_epoch(model, dataloader): 15 | model.eval() 16 | step = 0 17 | 18 | result = pd.DataFrame(columns=['id', 'A', 'O', 'C', 'P']) 19 | 20 | pbar = tqdm(dataloader) 21 | cur_idx = 1 22 | for raw, x, _ in pbar: 23 | if step == len(dataloader): 24 | pbar.close() 25 | break 26 | rv_raw, _ = raw 27 | x = [item.cuda() for item in x] 28 | with torch.no_grad(): 29 | probs, logits = model.forward(x) 30 | pred_result = model.gen_candidates(probs) 31 | pred_result = model.nms_filter(pred_result, 0.1) 32 | for b in range(len(pred_result)): 33 | opinions = pred_result[b] 34 | if len(opinions) == 0: 35 | result = result.append({'id': cur_idx, 'A': '_', 'O': '_', 'C': '_', 'P': '_'}, ignore_index=True) 36 | for opn in opinions: 37 | opn = opn[0] 38 | a_s, a_e, o_s, o_e = opn[0:4] 39 | c, p = opn[4:6] 40 | if a_s == 0: 41 | A = '_' 42 | else: 43 | A = rv_raw[b][a_s - 1: a_e] 44 | if o_s == 0: 45 | O = '_' 46 | else: 47 | O = rv_raw[b][o_s - 1: o_e] 48 | C = ID2C[c] 49 | P = ID2P[p] 50 | result = result.append({'id': cur_idx, 'A': A, 'O': O, 'C': C, 'P': P}, ignore_index=True) 51 | cur_idx += 1 52 | 53 | step += 1 54 | return result 55 | 56 | 57 | if __name__ == '__main__': 58 | EP = 100 59 | SAVING_DIR = '../models/' 60 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch', 61 | do_lower_case=True) 62 | test_dataset = ReviewDataset('../data/TEST/Test_reviews.csv', None, tokenizer) 63 | test_loader = DataLoader(test_dataset, 12, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 64 | 65 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch') 66 | model.load_state_dict(torch.load('../models/best_bert_model')) 67 | model.cuda() 68 | result = eval_epoch(model, test_loader) 69 | import time 70 | result.to_csv('../submit/result-'+str(round(time.time())) + '.csv', header=False, index=False) 71 | print(len(result['id'].unique())) -------------------------------------------------------------------------------- /src/eval_ensemble.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.utils.data import DataLoader 7 | 8 | from tqdm import tqdm 9 | import os.path as osp 10 | import pandas as pd 11 | from dataset import ID2C, ID2P 12 | from collections import Counter 13 | 14 | 15 | def eval_epoch(model, dataloader): 16 | model.eval() 17 | step = 0 18 | result = [] 19 | pbar = tqdm(dataloader) 20 | for raw, x, _ in pbar: 21 | if step == len(dataloader): 22 | pbar.close() 23 | break 24 | rv_raw, _ = raw 25 | x = [item.cuda() for item in x] 26 | with torch.no_grad(): 27 | probs, logits = model.forward(x) 28 | pred_result = model.gen_candidates(probs) 29 | pred_result = model.nms_filter(pred_result, 0) 30 | 31 | result += pred_result 32 | 33 | step += 1 34 | return result 35 | 36 | 37 | def accum_result(old, new): 38 | if old is None: 39 | return new 40 | for i in range(len(old)): 41 | merged = Counter(dict(old[i])) + Counter(dict(new[i])) 42 | old[i] = list(merged.items()) 43 | return old 44 | 45 | 46 | def average_result(result, num): 47 | for i in range(len(result)): 48 | for j in range(len(result[i])): 49 | result[i][j] = (result[i][j][0], result[i][j][1] / num) 50 | return result 51 | 52 | 53 | def gen_submit(ret, raw): 54 | result = pd.DataFrame(columns=['id', 'A', 'O', 'C', 'P']) 55 | cur_idx = 1 56 | for i, opinions in enumerate(ret): 57 | 58 | if len(opinions) == 0: 59 | result = result.append({'id': cur_idx, 'A': '_', 'O': '_', 'C': '_', 'P': '_'}, ignore_index=True) 60 | 61 | for j, (opn, score) in enumerate(opinions): 62 | a_s, a_e, o_s, o_e = opn[0:4] 63 | c, p = opn[4:6] 64 | if a_s == 0: 65 | A = '_' 66 | else: 67 | A = raw[i][a_s - 1: a_e] 68 | if o_s == 0: 69 | O = '_' 70 | else: 71 | O = raw[i][o_s - 1: o_e] 72 | C = ID2C[c] 73 | P = ID2P[p] 74 | result = result.append({'id': cur_idx, 'A': A, 'O': O, 'C': C, 'P': P}, ignore_index=True) 75 | cur_idx += 1 76 | return result 77 | 78 | 79 | if __name__ == '__main__': 80 | THRESH = 0.10 81 | SAVING_DIR = '../models/' 82 | MODELS = [ 83 | 'best_bert_model_774', 84 | 'best_bert_model_77', 85 | 'best_bert_model_cv0', 86 | 'best_bert_model_cv1', 87 | 'best_bert_model_cv2', 88 | 'best_bert_model_cv3', 89 | 'best_bert_model_cv4' 90 | ] 91 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch', 92 | do_lower_case=True) 93 | test_dataset = ReviewDataset('../data/TEST/Test_reviews.csv', None, tokenizer) 94 | test_loader = DataLoader(test_dataset, 12, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 95 | 96 | ret = None 97 | for name in MODELS: 98 | model_path = osp.join(SAVING_DIR, name) 99 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch') 100 | model.load_state_dict(torch.load(model_path)) 101 | model.cuda() 102 | ret = accum_result(ret, eval_epoch(model, test_loader)) 103 | del model 104 | ret = average_result(ret, len(MODELS)) 105 | ret = OpinioNet.nms_filter(ret, THRESH) 106 | raw = [s[0][0] for s in test_dataset.samples] 107 | result = gen_submit(ret, raw) 108 | import time 109 | 110 | result.to_csv('../submit/ensemble-' + str(round(time.time())) + '.csv', header=False, index=False) 111 | print(len(result['id'].unique()), result.shape[0]) 112 | -------------------------------------------------------------------------------- /src/eval_ensemble_final.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.utils.data import DataLoader 7 | 8 | from tqdm import tqdm 9 | import os 10 | import os.path as osp 11 | import pandas as pd 12 | from dataset import ID2C, ID2P, ID2LAPTOP 13 | from collections import Counter 14 | 15 | 16 | def eval_epoch(model, dataloader, th): 17 | model.eval() 18 | step = 0 19 | result = [] 20 | pbar = tqdm(dataloader) 21 | for raw, x, _ in pbar: 22 | if step == len(dataloader): 23 | pbar.close() 24 | break 25 | rv_raw, _ = raw 26 | x = [item.cuda() for item in x] 27 | with torch.no_grad(): 28 | probs, logits = model.forward(x, 'laptop') 29 | pred_result = model.gen_candidates(probs) 30 | pred_result = model.nms_filter(pred_result, th) 31 | 32 | result += pred_result 33 | 34 | step += 1 35 | return result 36 | 37 | 38 | def accum_result(old, new): 39 | if old is None: 40 | return new 41 | for i in range(len(old)): 42 | merged = Counter(dict(old[i])) + Counter(dict(new[i])) 43 | old[i] = list(merged.items()) 44 | return old 45 | 46 | 47 | def average_result(result, num): 48 | for i in range(len(result)): 49 | for j in range(len(result[i])): 50 | result[i][j] = (result[i][j][0], result[i][j][1] / num) 51 | return result 52 | 53 | 54 | def gen_submit(ret, raw): 55 | 56 | cur_idx = 1 57 | result = [] 58 | for i, opinions in enumerate(ret): 59 | 60 | if len(opinions) == 0: 61 | result.append([cur_idx, '_', '_', '_', '_']) 62 | # result.loc[result.shape[0]] = {'id': cur_idx, 'A': '_', 'O': '_', 'C': '_', 'P': '_'} 63 | 64 | for j, (opn, score) in enumerate(opinions): 65 | a_s, a_e, o_s, o_e = opn[0:4] 66 | c, p = opn[4:6] 67 | if a_s == 0: 68 | A = '_' 69 | else: 70 | A = raw[i][a_s - 1: a_e] 71 | if o_s == 0: 72 | O = '_' 73 | else: 74 | O = raw[i][o_s - 1: o_e] 75 | C = ID2LAPTOP[c] 76 | P = ID2P[p] 77 | # result.loc[result.shape[0]] = {'id': cur_idx, 'A': A, 'O': O, 'C': C, 'P': P} 78 | result.append([cur_idx, A, O, C, P]) 79 | cur_idx += 1 80 | result = pd.DataFrame(data=result, columns=['id', 'A', 'O', 'C', 'P']) 81 | return result 82 | 83 | def gen_label(ret, raw): 84 | 85 | cur_idx = 1 86 | result = [] 87 | for i, opinions in enumerate(ret): 88 | 89 | if len(opinions) == 0: 90 | result.append([cur_idx, '_', ' ', ' ', '_', ' ', ' ', '_', '_']) 91 | # result.loc[result.shape[0]] = {'id': cur_idx, 92 | # 'AspectTerms': '_', 'A_start': ' ', 'A_end': ' ', 93 | # 'OpinionTerms': '_', 'O_start': ' ', 'O_end': ' ', 94 | # 'Categories': '_', 'Polarities': '_'} 95 | 96 | for j, (opn, score) in enumerate(opinions): 97 | a_s, a_e, o_s, o_e = opn[0:4] 98 | c, p = opn[4:6] 99 | if a_s == 0: 100 | A = '_' 101 | a_s = ' ' 102 | a_e = ' ' 103 | else: 104 | A = raw[i][a_s - 1: a_e] 105 | a_s = str(a_s - 1) 106 | a_e = str(a_e) 107 | if o_s == 0: 108 | O = '_' 109 | o_s = ' ' 110 | o_e = ' ' 111 | else: 112 | O = raw[i][o_s - 1: o_e] 113 | o_s = str(o_s - 1) 114 | o_e = str(o_e) 115 | C = ID2LAPTOP[c] 116 | P = ID2P[p] 117 | # result.loc[result.shape[0]] = {'id': cur_idx, 118 | # 'AspectTerms': A, 'A_start': a_s, 'A_end': a_e, 119 | # 'OpinionTerms': O, 'O_start': o_s, 'O_end': o_e, 120 | # 'Categories': C, 'Polarities': P} 121 | result.append([cur_idx, A, a_s, a_e, O, o_s, o_e, C, P]) 122 | cur_idx += 1 123 | result = pd.DataFrame(data=result, 124 | columns=['id', 'AspectTerms', 'A_start', 'A_end', 'OpinionTerms', 'O_start', 'O_end', 'Categories', 125 | 'Polarities']) 126 | 127 | return result 128 | 129 | 130 | import json 131 | import argparse 132 | from config import PRETRAINED_MODELS 133 | if __name__ == '__main__': 134 | parser = argparse.ArgumentParser() 135 | parser.add_argument('--rv', type=str, default='../data/TEST/Test_reviews.csv') 136 | parser.add_argument('--lb', type=str, required=False) 137 | parser.add_argument('--gen_label', action='store_true') 138 | parser.add_argument('--labelfold', type=int, default=None) 139 | parser.add_argument('--o', type=str, default='Result') 140 | parser.add_argument('--bs', type=int, default=64) 141 | args = parser.parse_args() 142 | 143 | FOLDS = 5 144 | SAVING_DIR = '../models/' 145 | THRESH_DIR = '../models/thresh_dict.json' 146 | if not osp.exists('../submit'): 147 | os.mkdir('../submit') 148 | if not osp.exists('../testResults'): 149 | os.mkdir('../testResults') 150 | 151 | SUBMIT_DIR = args.o 152 | LABEL_DIR = args.o 153 | 154 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 155 | thresh_dict = json.load(f) 156 | 157 | WEIGHT_NAMES, MODEL_NAMES, THRESHS = [], [], [] 158 | for k, v in thresh_dict.items(): 159 | if v['name'] in PRETRAINED_MODELS: 160 | if args.labelfold is None or 'cv' + str(args.labelfold) in k: 161 | WEIGHT_NAMES.append(k) 162 | MODEL_NAMES.append(v['name']) 163 | THRESHS.append(v['thresh']) 164 | 165 | print(WEIGHT_NAMES) 166 | 167 | MODELS = list(zip(WEIGHT_NAMES, MODEL_NAMES, THRESHS)) 168 | # tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODELS['roberta']['path'], do_lower_case=True) 169 | # test_dataset = ReviewDataset(args.rv, args.lb, tokenizer, 'laptop') 170 | # test_loader = DataLoader(test_dataset, args.bs, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 171 | ret = None 172 | raw = None 173 | lb = None 174 | num_model = 0 175 | for weight_name, model_name, thresh in MODELS: 176 | if not osp.isfile('../models/' + weight_name): 177 | continue 178 | num_model += 1 179 | model_config = PRETRAINED_MODELS[model_name] 180 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 181 | test_dataset = ReviewDataset(args.rv, args.lb, tokenizer, 'laptop') 182 | test_loader = DataLoader(test_dataset, args.bs, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 183 | 184 | if not raw: 185 | raw = [s[0][0] for s in test_dataset.samples] 186 | if not lb and args.lb: 187 | lb = [s[0][1] for s in test_dataset.samples] 188 | 189 | 190 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], focal=model_config['focal']) 191 | print(weight_name) 192 | model.load_state_dict(torch.load('../models/' + weight_name)) 193 | model.cuda() 194 | ret = accum_result(ret, eval_epoch(model, test_loader, thresh)) 195 | del model 196 | ret = average_result(ret, num_model) 197 | ret = OpinioNet.nms_filter(ret, 0.28) 198 | 199 | if args.lb: 200 | def f1_score(P, G, S): 201 | pr = S / P 202 | rc = S / G 203 | f1 = 2 * pr * rc / (pr + rc) 204 | return f1, pr, rc 205 | 206 | 207 | def evaluate_sample(gt, pred): 208 | gt = set(gt) 209 | pred = set(pred) 210 | p = len(pred) 211 | g = len(gt) 212 | s = len(gt.intersection(pred)) 213 | return p, g, s 214 | P, G, S = 0, 0, 0 215 | for b in range(len(ret)): 216 | gt = lb[b] 217 | pred = [x[0] for x in ret[b]] 218 | p, g, s = evaluate_sample(gt, pred) 219 | 220 | P += p 221 | G += g 222 | S += s 223 | f1, pr, rc = f1_score(P, G, S) 224 | print("f1 %.5f, pr %.5f, rc %.5f" % (f1, pr, rc)) 225 | 226 | if args.gen_label: 227 | result = gen_label(ret, raw) 228 | result.to_csv(LABEL_DIR, header=True, index=False) 229 | print(len(result['id'].unique()), result.shape[0]) 230 | else: 231 | result = gen_submit(ret, raw) 232 | result.to_csv(SUBMIT_DIR, header=False, index=False) 233 | print(len(result['id'].unique()), result.shape[0]) 234 | -------------------------------------------------------------------------------- /src/eval_ensemble_round2.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.utils.data import DataLoader 7 | 8 | from tqdm import tqdm 9 | import os.path as osp 10 | import pandas as pd 11 | from dataset import ID2C, ID2P, ID2LAPTOP 12 | from collections import Counter 13 | 14 | 15 | def eval_epoch(model, dataloader, th): 16 | model.eval() 17 | step = 0 18 | result = [] 19 | pbar = tqdm(dataloader) 20 | for raw, x, _ in pbar: 21 | if step == len(dataloader): 22 | pbar.close() 23 | break 24 | rv_raw, _ = raw 25 | x = [item.cuda() for item in x] 26 | with torch.no_grad(): 27 | probs, logits = model.forward(x, 'laptop') 28 | pred_result = model.gen_candidates(probs) 29 | pred_result = model.nms_filter(pred_result, th) 30 | 31 | result += pred_result 32 | 33 | step += 1 34 | return result 35 | 36 | 37 | def accum_result(old, new): 38 | if old is None: 39 | return new 40 | for i in range(len(old)): 41 | merged = Counter(dict(old[i])) + Counter(dict(new[i])) 42 | old[i] = list(merged.items()) 43 | return old 44 | 45 | 46 | def average_result(result, num): 47 | for i in range(len(result)): 48 | for j in range(len(result[i])): 49 | result[i][j] = (result[i][j][0], result[i][j][1] / num) 50 | return result 51 | 52 | 53 | def gen_submit(ret, raw): 54 | result = pd.DataFrame(columns=['id', 'A', 'O', 'C', 'P']) 55 | cur_idx = 1 56 | for i, opinions in enumerate(ret): 57 | 58 | if len(opinions) == 0: 59 | result.loc[result.shape[0]] = {'id': cur_idx, 'A': '_', 'O': '_', 'C': '_', 'P': '_'} 60 | 61 | for j, (opn, score) in enumerate(opinions): 62 | a_s, a_e, o_s, o_e = opn[0:4] 63 | c, p = opn[4:6] 64 | if a_s == 0: 65 | A = '_' 66 | else: 67 | A = raw[i][a_s - 1: a_e] 68 | if o_s == 0: 69 | O = '_' 70 | else: 71 | O = raw[i][o_s - 1: o_e] 72 | C = ID2LAPTOP[c] 73 | P = ID2P[p] 74 | result.loc[result.shape[0]] = {'id': cur_idx, 'A': A, 'O': O, 'C': C, 'P': P} 75 | cur_idx += 1 76 | return result 77 | 78 | def gen_label(ret, raw): 79 | result = pd.DataFrame( 80 | columns=['id', 'AspectTerms', 'A_start', 'A_end', 'OpinionTerms', 'O_start', 'O_end', 'Categories', 81 | 'Polarities']) 82 | cur_idx = 1 83 | for i, opinions in enumerate(ret): 84 | 85 | if len(opinions) == 0: 86 | result.loc[result.shape[0]] = {'id': cur_idx, 87 | 'AspectTerms': '_', 'A_start': ' ', 'A_end': ' ', 88 | 'OpinionTerms': '_', 'O_start': ' ', 'O_end': ' ', 89 | 'Categories': '_', 'Polarities': '_'} 90 | 91 | for j, (opn, score) in enumerate(opinions): 92 | a_s, a_e, o_s, o_e = opn[0:4] 93 | c, p = opn[4:6] 94 | if a_s == 0: 95 | A = '_' 96 | a_s = ' ' 97 | a_e = ' ' 98 | else: 99 | A = raw[i][a_s - 1: a_e] 100 | a_s = str(a_s - 1) 101 | a_e = str(a_e) 102 | if o_s == 0: 103 | O = '_' 104 | o_s = ' ' 105 | o_e = ' ' 106 | else: 107 | O = raw[i][o_s - 1: o_e] 108 | o_s = str(o_s - 1) 109 | o_e = str(o_e) 110 | C = ID2LAPTOP[c] 111 | P = ID2P[p] 112 | result.loc[result.shape[0]] = {'id': cur_idx, 113 | 'AspectTerms': A, 'A_start': a_s, 'A_end': a_e, 114 | 'OpinionTerms': O, 'O_start': o_s, 'O_end': o_e, 115 | 'Categories': C, 'Polarities': P} 116 | cur_idx += 1 117 | return result 118 | 119 | 120 | 121 | import json 122 | from config import PRETRAINED_MODELS 123 | if __name__ == '__main__': 124 | MODE = 'SUBMIT' 125 | 126 | SAVING_DIR = '../models/' 127 | THRESH_DIR = '../models/thresh_dict.json' 128 | 129 | if MODE == 'SUBMIT': 130 | DATA_DIR = '../data/TEST/Test_reviews.csv' 131 | SUBMIT_DIR = '../submit/Result.csv' 132 | LABEL_DIR = None 133 | else: 134 | DATA_DIR = '../data/TRAIN/Train_laptop_corpus.csv' 135 | LABEL_DIR = '../data/TRAIN/Train_laptop_corpus_labels.csv' 136 | SUBMIT_DIR = None 137 | 138 | 139 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 140 | thresh_dict = json.load(f) 141 | 142 | WEIGHT_NAMES, MODEL_NAMES, THRESHS = [], [], [] 143 | for k, v in thresh_dict.items(): 144 | WEIGHT_NAMES.append(k) 145 | MODEL_NAMES.append(v['name']) 146 | THRESHS.append(v['thresh']) 147 | 148 | MODELS = list(zip(WEIGHT_NAMES, MODEL_NAMES, THRESHS)) 149 | tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODELS['roberta']['path'], do_lower_case=True) 150 | test_dataset = ReviewDataset(DATA_DIR, None, tokenizer, 'laptop') 151 | test_loader = DataLoader(test_dataset, 12, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 152 | ret = None 153 | num_model = 0 154 | for weight_name, model_name, thresh in MODELS: 155 | if not osp.isfile('../models/' + weight_name): 156 | continue 157 | num_model += 1 158 | model_config = PRETRAINED_MODELS[model_name] 159 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 160 | test_dataset = ReviewDataset(DATA_DIR, None, tokenizer, 'laptop') 161 | test_loader = DataLoader(test_dataset, 12, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 162 | print(model_config) 163 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], focal=model_config['focal']) 164 | model.load_state_dict(torch.load('../models/' + weight_name)) 165 | model.cuda() 166 | ret = accum_result(ret, eval_epoch(model, test_loader, thresh)) 167 | del model 168 | ret = average_result(ret, num_model) 169 | # import numpy as np 170 | # import copy 171 | # 172 | # min_dis = float('inf') 173 | # threshs = list(np.arange(0.1, 0.9, 0.05)) 174 | # result = None 175 | # target_num = len(test_dataset) * 2456 / 871 176 | # raw = [s[0][0] for s in test_dataset.samples] 177 | # for th in threshs: 178 | # ret_cp = copy.deepcopy(ret) 179 | # ret_cp = OpinioNet.nms_filter(ret_cp, th) 180 | # cur_result = gen_submit(ret_cp, raw) 181 | # 182 | # if abs(cur_result.shape[0] - target_num) < min_dis: 183 | # min_dis = abs(cur_result.shape[0] - target_num) 184 | # result = cur_result 185 | 186 | ret = OpinioNet.nms_filter(ret, 0.3) 187 | raw = [s[0][0] for s in test_dataset.samples] 188 | 189 | 190 | 191 | # import time 192 | # result.to_csv('../submit/ensemble-' + str(round(time.time())) + '.csv', header=False, index=False) 193 | if MODE == 'SUBMIT': 194 | result = gen_submit(ret, raw) 195 | result.to_csv(SUBMIT_DIR, header=False, index=False) 196 | else: 197 | result = gen_label(ret, raw) 198 | result.to_csv(LABEL_DIR, header=True, index=False) 199 | print(len(result['id'].unique()), result.shape[0]) 200 | -------------------------------------------------------------------------------- /src/eval_round2.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.utils.data import DataLoader 7 | 8 | from tqdm import tqdm 9 | import os.path as osp 10 | import pandas as pd 11 | from dataset import ID2C, ID2P, ID2LAPTOP 12 | 13 | 14 | def eval_epoch(model, dataloader): 15 | model.eval() 16 | step = 0 17 | 18 | result = pd.DataFrame(columns=['id', 'A', 'O', 'C', 'P']) 19 | 20 | pbar = tqdm(dataloader) 21 | cur_idx = 1 22 | for raw, x, _ in pbar: 23 | if step == len(dataloader): 24 | pbar.close() 25 | break 26 | rv_raw, _ = raw 27 | x = [item.cuda() for item in x] 28 | with torch.no_grad(): 29 | probs, logits = model.forward(x, type='laptop') 30 | pred_result = model.gen_candidates(probs) 31 | pred_result = model.nms_filter(pred_result, 0.6) 32 | for b in range(len(pred_result)): 33 | opinions = pred_result[b] 34 | if len(opinions) == 0: 35 | result = result.append({'id': cur_idx, 'A': '_', 'O': '_', 'C': '_', 'P': '_'}, ignore_index=True) 36 | for opn in opinions: 37 | opn = opn[0] 38 | a_s, a_e, o_s, o_e = opn[0:4] 39 | c, p = opn[4:6] 40 | if a_s == 0: 41 | A = '_' 42 | else: 43 | A = rv_raw[b][a_s - 1: a_e] 44 | if o_s == 0: 45 | O = '_' 46 | else: 47 | O = rv_raw[b][o_s - 1: o_e] 48 | C = ID2LAPTOP[c] 49 | P = ID2P[p] 50 | result = result.append({'id': cur_idx, 'A': A, 'O': O, 'C': C, 'P': P}, ignore_index=True) 51 | cur_idx += 1 52 | 53 | step += 1 54 | return result 55 | 56 | 57 | if __name__ == '__main__': 58 | EP = 100 59 | SAVING_DIR = '../models/' 60 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese_wwm_ext_pytorch', 61 | do_lower_case=True) 62 | test_dataset = ReviewDataset('../data/TEST/Test_reviews.csv', None, tokenizer, type='laptop') 63 | test_loader = DataLoader(test_dataset, 12, collate_fn=test_dataset.batchify, shuffle=False, num_workers=5) 64 | 65 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese_wwm_ext_pytorch') 66 | model.load_state_dict(torch.load('../models/saved_best_model_wwm_ext')) 67 | model.cuda() 68 | result = eval_epoch(model, test_loader) 69 | import time 70 | result.to_csv('../submit/result-'+str(round(time.time())) + '.csv', header=False, index=False) 71 | print(len(result['id'].unique())) -------------------------------------------------------------------------------- /src/finetune_cv.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pytorch_pretrained_bert import BertTokenizer 3 | from dataset import ReviewDataset, get_data_loaders_cv, get_aug_data_loaders_cv 4 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 5 | from model import OpinioNet 6 | 7 | import torch 8 | from torch.optim import Adam 9 | 10 | from tqdm import tqdm 11 | import os.path as osp 12 | import numpy as np 13 | import copy 14 | 15 | 16 | def f1_score(P, G, S): 17 | pr = S / P 18 | rc = S / G 19 | f1 = 2 * pr * rc / (pr + rc) 20 | return f1, pr, rc 21 | 22 | 23 | def evaluate_sample(gt, pred): 24 | gt = set(gt) 25 | pred = set(pred) 26 | p = len(pred) 27 | g = len(gt) 28 | s = len(gt.intersection(pred)) 29 | return p, g, s 30 | 31 | 32 | def train_epoch(model, dataloader, optimizer, scheduler=None, type='makeup'): 33 | model.train() 34 | cum_loss = 0 35 | P, G, S = 0, 0, 0 36 | total_sample = 0 37 | step = 0 38 | pbar = tqdm(dataloader) 39 | for raw, x, y in pbar: 40 | if step == len(dataloader): 41 | pbar.close() 42 | break 43 | rv_raw, lb_raw = raw 44 | x = [item.cuda() for item in x] 45 | y = [item.cuda() for item in y] 46 | 47 | probs, logits = model.forward(x, type) 48 | loss = model.loss(logits, y) 49 | 50 | optimizer.zero_grad() 51 | loss.backward() 52 | optimizer.step() 53 | if scheduler: 54 | scheduler.step() 55 | 56 | pred_result = model.gen_candidates(probs) 57 | pred_result = model.nms_filter(pred_result, 0.1) 58 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 59 | total_sample += len(rv_raw) 60 | for b in range(len(pred_result)): 61 | gt = lb_raw[b] 62 | pred = [x[0] for x in pred_result[b]] 63 | p, g, s = evaluate_sample(gt, pred) 64 | P += p 65 | G += g 66 | S += s 67 | 68 | step += 1 69 | 70 | total_f1, total_pr, total_rc = f1_score(P, G, S) 71 | total_loss = cum_loss / total_sample 72 | 73 | return total_loss, total_f1, total_pr, total_rc 74 | 75 | 76 | def eval_epoch(model, dataloader, type='makeup'): 77 | model.eval() 78 | cum_loss = 0 79 | # P, G, S = 0, 0, 0 80 | total_sample = 0 81 | step = 0 82 | pbar = tqdm(dataloader) 83 | 84 | PRED = [] 85 | GT = [] 86 | for raw, x, y in pbar: 87 | if step == len(dataloader): 88 | pbar.close() 89 | break 90 | rv_raw, lb_raw = raw 91 | x = [item.cuda() for item in x] 92 | y = [item.cuda() for item in y] 93 | with torch.no_grad(): 94 | probs, logits = model.forward(x, type) 95 | loss = model.loss(logits, y) 96 | pred_result = model.gen_candidates(probs) 97 | PRED += pred_result 98 | GT += lb_raw 99 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 100 | total_sample += len(rv_raw) 101 | 102 | step += 1 103 | 104 | total_loss = cum_loss / total_sample 105 | 106 | threshs = list(np.arange(0.1, 0.9, 0.025)) 107 | best_f1, best_pr, best_rc = 0, 0, 0 108 | best_thresh = 0.1 109 | for th in threshs: 110 | P, G, S = 0, 0, 0 111 | PRED_COPY = copy.deepcopy(PRED) 112 | PRED_COPY = model.nms_filter(PRED_COPY, th) 113 | for b in range(len(PRED_COPY)): 114 | gt = GT[b] 115 | pred = [x[0] for x in PRED_COPY[b]] 116 | p, g, s = evaluate_sample(gt, pred) 117 | 118 | P += p 119 | G += g 120 | S += s 121 | f1, pr, rc = f1_score(P, G, S) 122 | if f1 > best_f1: 123 | best_f1, best_pr, best_rc = f1, pr, rc 124 | best_thresh = th 125 | 126 | return total_loss, best_f1, best_pr, best_rc, best_thresh 127 | 128 | 129 | import json 130 | import argparse 131 | from config import PRETRAINED_MODELS 132 | 133 | if __name__ == '__main__': 134 | parser = argparse.ArgumentParser() 135 | parser.add_argument('--base_model', type=str, default='roberta') 136 | parser.add_argument('--bs', type=int, default=12) 137 | parser.add_argument('--gpu', type=int, default=0) 138 | args = parser.parse_args() 139 | 140 | # os.environ["CUDA_VISIBLE_DEVICES"] = "%d" % args.gpu 141 | 142 | EP = 100 143 | FOLDS = 5 144 | SAVING_DIR = '../models/' 145 | THRESH_DIR = '../models/thresh_dict.json' 146 | model_config = PRETRAINED_MODELS[args.base_model] 147 | print(model_config) 148 | 149 | if osp.isfile(THRESH_DIR): 150 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 151 | thresh_dict = json.load(f) 152 | else: 153 | thresh_dict = {} 154 | 155 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 156 | cv_loaders = get_data_loaders_cv(rv_path='../data/TRAIN/Train_laptop_reviews.csv', 157 | lb_path='../data/TRAIN/Train_laptop_labels.csv', 158 | tokenizer=tokenizer, 159 | batch_size=args.bs, 160 | type='laptop', 161 | folds=FOLDS) 162 | 163 | BEST_THRESHS = [0.1] * FOLDS 164 | BEST_F1 = [0] * FOLDS 165 | for cv_idx, (train_loader, val_loader) in enumerate(cv_loaders): 166 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], focal=model_config['focal']) 167 | model.load_state_dict(torch.load('../models/pretrained_' + model_config['name'])) 168 | model.cuda() 169 | optimizer = Adam(model.parameters(), lr=model_config['lr']) 170 | scheduler = GradualWarmupScheduler(optimizer, total_epoch=10 * len(train_loader)) 171 | best_val_f1 = 0 172 | best_val_loss = float('inf') 173 | 174 | for e in range(EP): 175 | 176 | print('Epoch [%d/%d] train:' % (e, EP)) 177 | train_loss, train_f1, train_pr, train_rc = train_epoch(model, train_loader, optimizer, scheduler, type='laptop') 178 | print("train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (train_loss, train_f1, train_pr, train_rc)) 179 | 180 | print('Epoch [%d/%d] eval:' % (e, EP)) 181 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, val_loader, type='laptop') 182 | print("val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % (val_loss, val_f1, val_pr, val_rc, best_th)) 183 | 184 | if val_loss < best_val_loss: 185 | best_val_loss = val_loss 186 | if val_f1 > best_val_f1: 187 | best_val_f1 = val_f1 188 | if val_f1 >= 0.75: 189 | saving_name = model_config['name'] + '_cv' + str(cv_idx) 190 | saving_dir = osp.join(SAVING_DIR, saving_name) 191 | torch.save(model.state_dict(), saving_dir) 192 | print('saved best model to %s' % saving_dir) 193 | BEST_THRESHS[cv_idx] = best_th 194 | BEST_F1[cv_idx] = best_val_f1 195 | thresh_dict[saving_name] = { 196 | 'name': model_config['name'], 197 | 'thresh': best_th, 198 | 'f1': best_val_f1, 199 | } 200 | with open(THRESH_DIR, 'w', encoding='utf-8') as f: 201 | json.dump(thresh_dict, f) 202 | 203 | print('best loss %.5f' % best_val_loss) 204 | print('best f1 %.5f' % best_val_f1) 205 | 206 | del model, optimizer, scheduler 207 | print(BEST_THRESHS) 208 | print(BEST_F1) 209 | -------------------------------------------------------------------------------- /src/lr_scheduler.py: -------------------------------------------------------------------------------- 1 | # from:https://github.com/ildoonet/pytorch-gradual-warmup-lr 2 | from torch.optim.lr_scheduler import _LRScheduler 3 | from torch.optim.lr_scheduler import ReduceLROnPlateau 4 | 5 | 6 | class GradualWarmupScheduler(_LRScheduler): 7 | 8 | def __init__(self, optimizer, total_epoch, after_scheduler=None): 9 | self.total_epoch = total_epoch 10 | self.after_scheduler = after_scheduler 11 | self.finished = False 12 | super(GradualWarmupScheduler, self).__init__(optimizer) 13 | 14 | def get_lr(self): 15 | if self.last_epoch > self.total_epoch: 16 | if self.after_scheduler: 17 | if not self.finished: 18 | self.after_scheduler.base_lrs = [base_lr for base_lr in self.base_lrs] 19 | self.finished = True 20 | return self.after_scheduler.get_lr() 21 | return [base_lr for base_lr in self.base_lrs] 22 | 23 | return [base_lr * (self.last_epoch / self.total_epoch) for base_lr in 24 | self.base_lrs] 25 | 26 | def step_ReduceLROnPlateau(self, metrics, epoch=None): 27 | if epoch is None: 28 | epoch = self.last_epoch + 1 29 | self.last_epoch = epoch if epoch != 0 else 1 # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning 30 | if self.last_epoch <= self.total_epoch: 31 | warmup_lr = [base_lr * (self.last_epoch / self.total_epoch) for base_lr in 32 | self.base_lrs] 33 | for param_group, lr in zip(self.optimizer.param_groups, warmup_lr): 34 | param_group['lr'] = lr 35 | else: 36 | if epoch is None: 37 | self.after_scheduler.step(metrics, None) 38 | else: 39 | self.after_scheduler.step(metrics, epoch - self.total_epoch) 40 | 41 | def step(self, epoch=None, metrics=None): 42 | if type(self.after_scheduler) != ReduceLROnPlateau: 43 | if self.finished and self.after_scheduler: 44 | if epoch is None: 45 | self.after_scheduler.step(None) 46 | else: 47 | self.after_scheduler.step(epoch - self.total_epoch) 48 | else: 49 | return super(GradualWarmupScheduler, self).step(epoch) 50 | else: 51 | self.step_ReduceLROnPlateau(metrics, epoch) 52 | -------------------------------------------------------------------------------- /src/model.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | from pytorch_pretrained_bert.modeling import BertPreTrainedModel, BertOnlyMLMHead 5 | from pytorch_pretrained_bert import BertModel, BertAdam, BertConfig 6 | 7 | from dataset import ID2P, ID2COMMON, ID2MAKUP, ID2LAPTOP 8 | import numpy as np 9 | 10 | from collections import Counter 11 | 12 | def margin_negsub_bce_with_logits(logits, target, margin=0.1, neg_sub=0.25): 13 | y = torch.sigmoid(logits) 14 | keep_mask = (torch.abs(target - y) > margin).float() 15 | pos_keep = keep_mask * target 16 | neg_keep = keep_mask - pos_keep 17 | loss_pos = - pos_keep * torch.log(torch.clamp(y, 1e-10)) 18 | loss_neg = - neg_keep * neg_sub * torch.log(torch.clamp(-y + 1, 1e-10, 1.0)) 19 | loss = (loss_pos + loss_neg).mean() 20 | return loss 21 | 22 | 23 | def focalBCE_with_logits(logits, target, gamma=2): 24 | probs = torch.sigmoid(logits) 25 | grad = torch.abs(target - probs) ** gamma 26 | grad /= grad.mean() 27 | loss = grad * F.binary_cross_entropy(probs, target, reduction='none') 28 | return loss.mean() 29 | 30 | 31 | def focalCE_with_logits(logit, target, ignore_index=-1, alpha=None, gamma=2, smooth=0.05): 32 | num_classes = logit.size(1) 33 | logit = F.softmax(logit, dim=1) 34 | if not alpha: 35 | alpha = 1.0 36 | if logit.dim() > 2: 37 | # N,C,d1,d2 -> N,C,m (m=d1*d2*...) 38 | logit = logit.view(logit.size(0), logit.size(1), -1) 39 | logit = logit.permute(0, 2, 1).contiguous() 40 | logit = logit.view(-1, logit.size(-1)) # N*m, C 41 | 42 | target = target.view(-1) # N*m 43 | logit = logit[target != ignore_index] 44 | target = target[target != ignore_index] 45 | 46 | target_onehot = F.one_hot(target, num_classes=num_classes).float() 47 | if smooth: 48 | target_onehot = torch.clamp(target_onehot, smooth, 1.0 - smooth) 49 | 50 | pt = (target_onehot * logit).sum(1) + 1e-10 51 | logpt = pt.log() 52 | loss = -alpha * torch.pow((1 - pt), gamma) * logpt 53 | loss = loss.mean() 54 | return loss 55 | 56 | # 57 | # num_classes = logits.shape[1] 58 | # loss = F.cross_entropy(logits, target, ignore_index=ignore_index, reduction='none') 59 | # keep_mask = 1 - target.eq(ignore_index).float() 60 | # probs = torch.softmax(logits, dim=1) 61 | # target = target.masked_fill(target.eq(ignore_index), num_classes) 62 | # target = F.one_hot(target, num_classes=num_classes+1).permute((0, 2, 1))[:, :-1, :].float() # b c seq 63 | # 64 | # focal = (torch.abs(target - probs) ** gamma).max(1)[0] * keep_mask 65 | # # focal /= (focal.sum(-1, keepdim=True) / keep_mask.sum(-1, keepdim=True)) 66 | # # print(focal) 67 | # loss = (focal * loss).sum() / keep_mask.sum() 68 | # return loss 69 | 70 | 71 | class OpinioNet(BertPreTrainedModel): 72 | def __init__(self, config, hidden=100, gpu=True, dropout_prob=0.3, bert_cache_dir=None, version='large', focal=False): 73 | super(OpinioNet, self).__init__(config) 74 | self.version = version 75 | if self.version == 'tiny': 76 | self._tiny_version_init(hidden) 77 | self.focal = focal 78 | 79 | self.bert_cache_dir = bert_cache_dir 80 | 81 | self.bert = BertModel(config) 82 | self.apply(self.init_bert_weights) 83 | self.bert_hidden_size = self.config.hidden_size 84 | 85 | self.w_as11 = nn.Linear(self.bert_hidden_size, hidden) 86 | self.w_as12 = nn.Linear(self.bert_hidden_size, hidden) 87 | self.w_ae11 = nn.Linear(self.bert_hidden_size, hidden) 88 | self.w_ae12 = nn.Linear(self.bert_hidden_size, hidden) 89 | self.w_os11 = nn.Linear(self.bert_hidden_size, hidden) 90 | self.w_os12 = nn.Linear(self.bert_hidden_size, hidden) 91 | self.w_oe11 = nn.Linear(self.bert_hidden_size, hidden) 92 | self.w_oe12 = nn.Linear(self.bert_hidden_size, hidden) 93 | 94 | self.w_as2 = nn.Linear(hidden, 1) 95 | self.w_ae2 = nn.Linear(hidden, 1) 96 | self.w_os2 = nn.Linear(hidden, 1) 97 | self.w_oe2 = nn.Linear(hidden, 1) 98 | 99 | self.w_obj = nn.Linear(self.bert_hidden_size, 1) 100 | 101 | self.w_common = nn.Linear(self.bert_hidden_size, len(ID2COMMON)) 102 | self.w_makeup = nn.Linear(self.bert_hidden_size, len(ID2MAKUP) - len(ID2COMMON)) 103 | self.w_laptop = nn.Linear(self.bert_hidden_size, len(ID2LAPTOP) - len(ID2COMMON)) 104 | self.w_p = nn.Linear(self.bert_hidden_size, len(ID2P)) 105 | 106 | self.cls = BertOnlyMLMHead(config, self.bert.embeddings.word_embeddings.weight) 107 | 108 | # self.w_num = nn.Linear(self.bert_hidden_size, 8) 109 | 110 | self.dropout = nn.Dropout(dropout_prob) 111 | 112 | self.softmax = nn.Softmax(dim=-1) 113 | self.log_softmax = nn.LogSoftmax(dim=-1) 114 | 115 | self.kl_loss = nn.KLDivLoss(reduction='batchmean') 116 | 117 | if gpu: 118 | self.cuda() 119 | 120 | def _tiny_version_init(self, hidden=100): 121 | 122 | self.w_as11t = nn.Linear(373, hidden) 123 | self.w_as12t = nn.Linear(373, hidden) 124 | self.w_ae11t = nn.Linear(373, hidden) 125 | self.w_ae12t = nn.Linear(373, hidden) 126 | self.w_os11t = nn.Linear(373, hidden) 127 | self.w_os12t = nn.Linear(373, hidden) 128 | self.w_oe11t = nn.Linear(373, hidden) 129 | self.w_oe12t = nn.Linear(373, hidden) 130 | 131 | 132 | 133 | 134 | def foward_LM(self, input_ids, attention_mask=None, masked_lm_labels=None): 135 | sequence_output, _ = self.bert(input_ids, None, attention_mask, 136 | output_all_encoded_layers=False) 137 | prediction_scores = self.cls(sequence_output) 138 | 139 | if masked_lm_labels is not None: 140 | loss_fct = nn.CrossEntropyLoss(ignore_index=-1) 141 | masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), masked_lm_labels.view(-1)) 142 | return masked_lm_loss 143 | else: 144 | return prediction_scores 145 | 146 | 147 | def _forward_large(self, rv_seq, type='laptop'): 148 | as_logits = self.w_as2(F.leaky_relu(self.w_as11(self.dropout(rv_seq)).unsqueeze(2) 149 | + self.w_as12(self.dropout(rv_seq)).unsqueeze(1))).squeeze(-1) 150 | 151 | ae_logits = self.w_ae2(F.leaky_relu(self.w_ae11(self.dropout(rv_seq)).unsqueeze(2) 152 | + self.w_ae12(self.dropout(rv_seq)).unsqueeze(1))).squeeze(-1) 153 | 154 | os_logits = self.w_os2(F.leaky_relu(self.w_os11(self.dropout(rv_seq)).unsqueeze(2) 155 | + self.w_os12(self.dropout(rv_seq)).unsqueeze(1))).squeeze(-1) 156 | 157 | oe_logits = self.w_oe2(F.leaky_relu(self.w_oe11(self.dropout(rv_seq)).unsqueeze(2) 158 | + self.w_oe12(self.dropout(rv_seq)).unsqueeze(1))).squeeze(-1) 159 | 160 | obj_logits = self.w_obj(self.dropout(rv_seq)).squeeze(-1) 161 | 162 | # c_logits = self.w_c(self.dropout(rv_seq)) 163 | common_logits = self.w_common(self.dropout(rv_seq)) 164 | if type == 'laptop': 165 | special_logits = self.w_laptop(self.dropout(rv_seq)) 166 | else: 167 | special_logits = self.w_makeup(self.dropout(rv_seq)) 168 | 169 | c_logits = torch.cat([common_logits, special_logits], dim=-1) 170 | p_logits = self.w_p(self.dropout(rv_seq)) 171 | 172 | return as_logits, ae_logits, os_logits, oe_logits, obj_logits, c_logits, p_logits 173 | 174 | def _forward_tiny(self, rv_seq, type='laptop'): 175 | 176 | obj_logits = rv_seq[:, :, 0] 177 | 178 | common_logits = rv_seq[:, :, 1: 8] 179 | if type == 'laptop': 180 | special_logits = rv_seq[:, :, 8: 12] 181 | else: 182 | special_logits = rv_seq[:, :, 12: 18] 183 | c_logits = torch.cat([common_logits, special_logits], dim=-1) 184 | p_logits = rv_seq[:, :, 18: 21] 185 | 186 | as_logits = self.w_as2(F.leaky_relu(self.w_as11t(self.dropout(rv_seq[:, :, 21: 394])).unsqueeze(2) 187 | + self.w_as12t(self.dropout(rv_seq[:, :, 21: 394])).unsqueeze(1))).squeeze(-1) 188 | 189 | ae_logits = self.w_ae2(F.leaky_relu(self.w_ae11t(self.dropout(rv_seq[:, :, 21: 394])).unsqueeze(2) 190 | + self.w_ae12t(self.dropout(rv_seq[:, :, 21: 394])).unsqueeze(1))).squeeze(-1) 191 | 192 | os_logits = self.w_os2(F.leaky_relu(self.w_os11t(self.dropout(rv_seq[:, :, 394: 767])).unsqueeze(2) 193 | + self.w_os12t(self.dropout(rv_seq[:, :, 394: 767])).unsqueeze(1))).squeeze(-1) 194 | 195 | oe_logits = self.w_oe2(F.leaky_relu(self.w_oe11t(self.dropout(rv_seq[:, :, 394: 767])).unsqueeze(2) 196 | + self.w_oe12t(self.dropout(rv_seq[:, :, 394: 767])).unsqueeze(1))).squeeze(-1) 197 | 198 | return as_logits, ae_logits, os_logits, oe_logits, obj_logits, c_logits, p_logits 199 | 200 | def forward(self, input, type='laptop'): 201 | rv_seq, att_mask, rv_mask = input 202 | 203 | rv_seq, cls_emb = self.bert(input_ids=rv_seq, attention_mask=att_mask, output_all_encoded_layers=False) 204 | 205 | 206 | 207 | as_logits, ae_logits, os_logits, oe_logits, obj_logits, c_logits, p_logits = self._forward_large(rv_seq, type) 208 | if self.version == 'tiny': 209 | as_logits_t, ae_logits_t, os_logits_t, oe_logits_t, obj_logits_t, c_logits_t, p_logits_t = self._forward_tiny(rv_seq, type) 210 | 211 | as_logits += as_logits_t 212 | ae_logits += ae_logits_t 213 | os_logits += os_logits_t 214 | oe_logits += oe_logits_t 215 | 216 | obj_logits += obj_logits_t 217 | c_logits += c_logits_t 218 | p_logits += p_logits_t 219 | 220 | 221 | rv_mask_with_cls = rv_mask.clone() 222 | rv_mask_with_cls[:, 0] = 1 223 | pointer_mask = rv_mask_with_cls.unsqueeze(2) * rv_mask_with_cls.unsqueeze(1) 224 | pointer_mask[:, 0, :] = 0 225 | 226 | pointer_mask = (1 - pointer_mask).byte() 227 | rv_mask = (1 - rv_mask).byte() 228 | 229 | as_logits = as_logits.masked_fill(pointer_mask, -1e5) 230 | ae_logits = ae_logits.masked_fill(pointer_mask, -1e5) 231 | os_logits = os_logits.masked_fill(pointer_mask, -1e5) 232 | oe_logits = oe_logits.masked_fill(pointer_mask, -1e5) 233 | 234 | obj_logits = obj_logits.masked_fill(rv_mask, -1e5) 235 | 236 | probs = [self.softmax(as_logits), 237 | self.softmax(ae_logits), 238 | self.softmax(os_logits), 239 | self.softmax(oe_logits), 240 | torch.sigmoid(obj_logits), 241 | self.softmax(c_logits), 242 | self.softmax(p_logits)] 243 | 244 | logits = [as_logits, ae_logits, os_logits, oe_logits, obj_logits, c_logits, p_logits] 245 | 246 | return probs, logits 247 | 248 | def loss(self, preds, targets, neg_sub=False): 249 | as_logits, ae_logits, os_logits, oe_logits, obj_logits, c_logits, p_logits = preds 250 | as_tgt, ae_tgt, os_tgt, oe_tgt, obj_tgt, c_tgt, p_tgt = targets 251 | 252 | as_logits = as_logits.permute((0, 2, 1)) 253 | ae_logits = ae_logits.permute((0, 2, 1)) 254 | os_logits = os_logits.permute((0, 2, 1)) 255 | oe_logits = oe_logits.permute((0, 2, 1)) 256 | c_logits = c_logits.permute((0, 2, 1)) 257 | p_logits = p_logits.permute((0, 2, 1)) 258 | 259 | loss = 0 260 | 261 | if self.focal: 262 | loss += focalCE_with_logits(as_logits, as_tgt, ignore_index=-1) 263 | loss += focalCE_with_logits(ae_logits, ae_tgt, ignore_index=-1) 264 | loss += focalCE_with_logits(os_logits, os_tgt, ignore_index=-1) 265 | loss += focalCE_with_logits(oe_logits, oe_tgt, ignore_index=-1) 266 | loss += focalCE_with_logits(c_logits, c_tgt, ignore_index=-1) 267 | loss += focalCE_with_logits(p_logits, p_tgt, ignore_index=-1) 268 | else: 269 | loss += F.cross_entropy(as_logits, as_tgt, ignore_index=-1) 270 | loss += F.cross_entropy(ae_logits, ae_tgt, ignore_index=-1) 271 | loss += F.cross_entropy(os_logits, os_tgt, ignore_index=-1) 272 | loss += F.cross_entropy(oe_logits, oe_tgt, ignore_index=-1) 273 | loss += F.cross_entropy(c_logits, c_tgt, ignore_index=-1) 274 | loss += F.cross_entropy(p_logits, p_tgt, ignore_index=-1) 275 | 276 | if neg_sub: 277 | loss += margin_negsub_bce_with_logits(obj_logits, obj_tgt) 278 | else: 279 | loss += margin_negsub_bce_with_logits(obj_logits, obj_tgt, neg_sub=1.0) 280 | 281 | return loss 282 | 283 | def gen_candidates(self, probs, thresh=0.01): 284 | as_probs, ae_probs, os_probs, oe_probs, obj_probs, c_probs, p_probs = probs 285 | as_scores, as_preds = as_probs.max(dim=-1) 286 | ae_scores, ae_preds = ae_probs.max(dim=-1) 287 | os_scores, os_preds = os_probs.max(dim=-1) 288 | oe_scores, oe_preds = oe_probs.max(dim=-1) 289 | 290 | c_scores, c_preds = c_probs.max(dim=-1) 291 | p_scores, p_preds = p_probs.max(dim=-1) 292 | confidence = ( 293 | as_scores * ae_scores * os_scores * oe_scores * p_scores * c_scores * obj_probs).data.cpu().numpy() 294 | 295 | as_preds = as_preds.data.cpu().numpy() 296 | ae_preds = ae_preds.data.cpu().numpy() 297 | os_preds = os_preds.data.cpu().numpy() 298 | oe_preds = oe_preds.data.cpu().numpy() 299 | 300 | conf_rank = (-confidence).argsort(-1) 301 | 302 | c_preds = c_preds.data.cpu().numpy() 303 | p_preds = p_preds.data.cpu().numpy() 304 | 305 | result = [] 306 | for b in range(len(c_preds)): 307 | sample_res = [] 308 | for pos in conf_rank[b]: 309 | if sample_res and confidence[b][pos] < thresh: 310 | break 311 | a_s = as_preds[b][pos] 312 | a_e = ae_preds[b][pos] 313 | o_s = os_preds[b][pos] 314 | o_e = oe_preds[b][pos] 315 | 316 | cls = c_preds[b][pos] 317 | polar = p_preds[b][pos] 318 | 319 | conf = confidence[b][pos] 320 | 321 | # 检查自身是否合理 322 | if a_s > a_e: 323 | continue 324 | if o_s > o_e: 325 | continue 326 | if min(a_e, o_e) >= max(a_s, o_s): # 内部重叠 327 | continue 328 | 329 | # 检查与前面的是否重叠 330 | # is_bad = False 331 | # for sample in sample_res: 332 | # s1, e1, s2, e2 = sample[0][:4] 333 | # if min(a_e, e1) >= max(a_s, s1) and min(o_e, e2) >= max(o_s, s2): 334 | # is_bad = True 335 | # break 336 | # if is_bad: 337 | # continue 338 | 339 | sample_res.append(((a_s, a_e, o_s, o_e, cls, polar), conf)) 340 | result.append(sample_res) 341 | return result 342 | 343 | def beam_search(self, probs, thresh=0.01): 344 | as_probs, ae_probs, os_probs, oe_probs, obj_probs, c_probs, p_probs = probs 345 | 346 | c_scores, c_preds = c_probs.max(dim=-1) 347 | p_scores, p_preds = p_probs.max(dim=-1) 348 | 349 | c_preds = c_preds.data.cpu().numpy() 350 | p_preds = p_preds.data.cpu().numpy() 351 | 352 | as_sorted = as_probs.argsort(dim=-1, descending=True) 353 | ae_sorted = ae_probs.argsort(dim=-1, descending=True) 354 | os_sorted = os_probs.argsort(dim=-1, descending=True) 355 | oe_sorted = oe_probs.argsort(dim=-1, descending=True) 356 | max_conf = (as_probs.gather(dim=2, index=as_sorted[:, :, 0:1]).squeeze(-1) * 357 | ae_probs.gather(dim=2, index=ae_sorted[:, :, 0:1]).squeeze(-1) * 358 | os_probs.gather(dim=2, index=os_sorted[:, :, 0:1]).squeeze(-1) * 359 | oe_probs.gather(dim=2, index=oe_sorted[:, :, 0:1]).squeeze(-1) * 360 | obj_probs * c_scores * p_scores) 361 | conf_rank = max_conf.argsort(dim=-1, descending=True) 362 | result = [] 363 | 364 | for b in range(len(conf_rank)): 365 | sample_res = [] 366 | # print('=====start====') 367 | for pos_idx in range(len(conf_rank[b])): 368 | 369 | pos = conf_rank[b][pos_idx] 370 | cur_conf = max_conf[b][pos].data.cpu().item() 371 | # print(max_conf[b]) 372 | # print(as_probs[b][pos]) 373 | # print('entering position %d, conf: %.5f' % (pos, cur_conf)) 374 | if cur_conf < thresh and len(sample_res) > 0: 375 | break 376 | as_idx, ae_idx, os_idx, oe_idx = 0, 0, 0, 0 377 | 378 | cls = c_preds[b][pos] 379 | polar = p_preds[b][pos] 380 | 381 | while True: 382 | 383 | a_s = as_sorted[b][pos][as_idx].data.cpu().item() 384 | a_e = ae_sorted[b][pos][ae_idx].data.cpu().item() 385 | o_s = os_sorted[b][pos][os_idx].data.cpu().item() 386 | o_e = oe_sorted[b][pos][oe_idx].data.cpu().item() 387 | 388 | is_bad = False 389 | # 检查自身是否合理 390 | if a_s > a_e: 391 | is_bad = True 392 | elif o_s > o_e: 393 | is_bad = True 394 | elif min(a_e, o_e) >= max(a_s, o_s): # 内部重叠 395 | is_bad = True 396 | 397 | # 继续搜索 398 | # print(a_s, a_e, o_s, o_e, cur_conf, is_bad) 399 | if is_bad: 400 | if as_idx != as_sorted.shape[-1] - 1: 401 | as_cur_score = as_probs[b][pos][as_sorted[b][pos][as_idx]].data.cpu().item() 402 | as_nxt_score = as_probs[b][pos][as_sorted[b][pos][as_idx + 1]].data.cpu().item() 403 | nxt_as_conf = cur_conf * as_nxt_score / as_cur_score 404 | else: 405 | nxt_as_conf = 0 406 | 407 | if ae_idx != ae_sorted.shape[-1] - 1: 408 | ae_cur_score = ae_probs[b][pos][ae_sorted[b][pos][ae_idx]].data.cpu().item() 409 | ae_nxt_score = ae_probs[b][pos][ae_sorted[b][pos][ae_idx + 1]].data.cpu().item() 410 | nxt_ae_conf = cur_conf * ae_nxt_score / ae_cur_score 411 | else: 412 | nxt_ae_conf = 0 413 | 414 | if os_idx != os_sorted.shape[-1] - 1: 415 | os_cur_score = os_probs[b][pos][os_sorted[b][pos][os_idx]].data.cpu().item() 416 | os_nxt_score = os_probs[b][pos][os_sorted[b][pos][os_idx + 1]].data.cpu().item() 417 | nxt_os_conf = cur_conf * os_nxt_score / os_cur_score 418 | else: 419 | nxt_os_conf = 0 420 | 421 | if oe_idx != oe_sorted.shape[-1] - 1: 422 | oe_cur_score = oe_probs[b][pos][oe_sorted[b][pos][oe_idx]].data.cpu().item() 423 | oe_nxt_score = oe_probs[b][pos][oe_sorted[b][pos][oe_idx + 1]].data.cpu().item() 424 | nxt_oe_conf = cur_conf * oe_nxt_score / oe_cur_score 425 | else: 426 | nxt_oe_conf = 0 427 | 428 | if nxt_as_conf == nxt_ae_conf == nxt_os_conf == nxt_oe_conf == 0: 429 | break 430 | 431 | max_conf_idx = np.argmax([nxt_as_conf, nxt_ae_conf, nxt_os_conf, nxt_oe_conf]) 432 | if max_conf_idx == 0: 433 | as_idx += 1 434 | cur_conf = nxt_as_conf 435 | elif max_conf_idx == 1: 436 | ae_idx += 1 437 | cur_conf = nxt_ae_conf 438 | elif max_conf_idx == 2: 439 | os_idx += 1 440 | cur_conf = nxt_os_conf 441 | else: 442 | oe_idx += 1 443 | cur_conf = nxt_oe_conf 444 | # print('next conf', cur_conf) 445 | if cur_conf < thresh and len(sample_res) > 0: 446 | break 447 | 448 | # if pos_idx != len(conf_rank[b]) - 1: 449 | # nxt_pos = conf_rank[b][pos_idx + 1] 450 | # nxt_max_conf = max_conf[b][nxt_pos].data.cpu().item() 451 | # if cur_conf < nxt_max_conf: 452 | # break 453 | 454 | else: 455 | # print('inserting ', ((a_s, a_e, o_s, o_e, cls, polar), cur_conf)) 456 | sample_res.append(((a_s, a_e, o_s, o_e, cls, polar), cur_conf)) 457 | break 458 | 459 | result.append(sample_res) 460 | return result 461 | 462 | @staticmethod 463 | def nms_filter(results, thresh=0.1): 464 | for i, opinions in enumerate(results): 465 | # 对于重复结果分数取平均值 466 | # scores = {} 467 | # for k, v in opinions: 468 | # if k not in scores: 469 | # scores[k] = [0, 0] # 总分、数量 470 | # scores[k][0] += v 471 | # scores[k][1] += 1 472 | # for k in scores.keys(): 473 | # scores[k] = scores[k][0] / scores[k][1] 474 | # # 按照分数排序 进入nms筛选 475 | # opinions = sorted(list(scores.items()), key=lambda x: -x[1]) 476 | opinions = sorted(opinions, key=lambda x: -x[1]) 477 | nmsopns = [] 478 | for opn in opinions: 479 | if opn[1] < thresh and len(nmsopns) > 0: 480 | break 481 | isbad = False 482 | for nmsopn in nmsopns: 483 | as1, ae1, os1, oe1 = opn[0][:4] 484 | as2, ae2, os2, oe2 = nmsopn[0][:4] 485 | if min(ae1, ae2) >= max(as1, as2) and min(oe1, oe2) >= max(os1, os2): 486 | isbad = True 487 | break 488 | if not isbad: 489 | nmsopns.append(opn) 490 | results[i] = nmsopns 491 | # print(results) 492 | return results 493 | 494 | 495 | if __name__ == '__main__': 496 | from pytorch_pretrained_bert import BertTokenizer 497 | from dataset import ReviewDataset 498 | 499 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/ERNIE', 500 | do_lower_case=True) 501 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/ERNIE') 502 | model.cuda() 503 | model.train() 504 | 505 | d = ReviewDataset('../data/TRAIN/Train_reviews.csv', '../data/TRAIN/Train_labels.csv', tokenizer) 506 | b_raw, b_in, b_tgt = d.batchify(d[:10]) 507 | 508 | for i in range(len(b_in)): 509 | b_in[i] = b_in[i].cuda() 510 | for i in range(len(b_tgt)): 511 | b_tgt[i] = b_tgt[i].cuda() 512 | print(b_in) 513 | probs, logits = model.forward(b_in) 514 | loss = model.loss(logits, b_tgt) 515 | result = model.nms(probs) 516 | print(loss) 517 | print(result) 518 | -------------------------------------------------------------------------------- /src/pretrain.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pytorch_pretrained_bert import BertTokenizer 3 | from dataset import get_pretrain_loaders 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 10 | from tqdm import tqdm 11 | import os.path as osp 12 | import numpy as np 13 | import copy 14 | 15 | 16 | def f1_score(P, G, S): 17 | pr = S / P 18 | rc = S / G 19 | f1 = 2 * pr * rc / (pr + rc) 20 | return f1, pr, rc 21 | 22 | 23 | def evaluate_sample(gt, pred): 24 | gt = set(gt) 25 | pred = set(pred) 26 | p = len(pred) 27 | g = len(gt) 28 | s = len(gt.intersection(pred)) 29 | # print(p, g, s) 30 | return p, g, s 31 | 32 | 33 | def train_epoch(model, makeup_loader, corpus_loader, optimizer, scheduler=None): 34 | model.train() 35 | cum_loss = 0 36 | cum_lm_loss = 0 37 | total_lm_sample = 0 38 | P, G, S = 0, 0, 0 39 | total_sample = 0 40 | step = 0 41 | pbar = tqdm(range(max(len(makeup_loader), len(corpus_loader)))) 42 | makeup_iter = iter(makeup_loader) 43 | corpus_iter = iter(corpus_loader) 44 | for _ in pbar: 45 | if step == max(len(makeup_loader), len(corpus_loader)): 46 | pbar.close() 47 | break 48 | 49 | try: 50 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 51 | except StopIteration: 52 | corpus_iter = iter(corpus_loader) 53 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 54 | 55 | corpus_ids = corpus_ids.cuda() 56 | corpus_attn = corpus_attn.cuda() 57 | lm_label = lm_label.cuda() 58 | loss = model.foward_LM(corpus_ids, corpus_attn, lm_label) 59 | optimizer.zero_grad() 60 | loss.backward() 61 | optimizer.step() 62 | if scheduler: 63 | scheduler.step() 64 | cum_lm_loss += loss.data.cpu().numpy() * len(corpus_ids) 65 | total_lm_sample += len(corpus_ids) 66 | del corpus_ids, corpus_attn, lm_label, loss 67 | 68 | try: 69 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 70 | except StopIteration: 71 | makeup_iter = iter(makeup_loader) 72 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 73 | 74 | makeup_rv_raw, makeup_lb_raw = makeup_raw 75 | makeup_x = [item.cuda() for item in makeup_x] 76 | makeup_y = [item.cuda() for item in makeup_y] 77 | 78 | makeup_probs, makeup_logits = model.forward(makeup_x, type='makeup') 79 | loss = model.loss(makeup_logits, makeup_y) 80 | 81 | optimizer.zero_grad() 82 | loss.backward() 83 | optimizer.step() 84 | if scheduler: 85 | scheduler.step() 86 | 87 | makeup_pred = model.gen_candidates(makeup_probs) 88 | makeup_pred = model.nms_filter(makeup_pred, 0.1) 89 | 90 | for b in range(len(makeup_pred)): 91 | gt = makeup_lb_raw[b] 92 | pred = [x[0] for x in makeup_pred[b]] 93 | p, g, s = evaluate_sample(gt, pred) 94 | P += p 95 | G += g 96 | S += s 97 | 98 | cum_loss += loss.data.cpu().numpy() * len(makeup_rv_raw) 99 | total_sample += len(makeup_rv_raw) 100 | step += 1 101 | while makeup_x: 102 | a = makeup_x.pop(); 103 | del a 104 | while makeup_y: 105 | a = makeup_y.pop(); 106 | del a 107 | 108 | while makeup_probs: 109 | a = makeup_probs.pop(); 110 | del a 111 | a = makeup_logits.pop(); 112 | del a 113 | 114 | del loss 115 | 116 | total_f1, total_pr, total_rc = f1_score(P, G, S) 117 | total_loss = cum_loss / total_sample 118 | total_lm_loss = cum_lm_loss / total_lm_sample 119 | return total_loss, total_lm_loss, total_f1, total_pr, total_rc 120 | 121 | 122 | def eval_epoch(model, dataloader, type='makeup'): 123 | model.eval() 124 | cum_loss = 0 125 | # P, G, S = 0, 0, 0 126 | total_sample = 0 127 | step = 0 128 | pbar = tqdm(dataloader) 129 | 130 | PRED = [] 131 | GT = [] 132 | for raw, x, y in pbar: 133 | if step == len(dataloader): 134 | pbar.close() 135 | break 136 | rv_raw, lb_raw = raw 137 | x = [item.cuda() for item in x] 138 | y = [item.cuda() for item in y] 139 | with torch.no_grad(): 140 | probs, logits = model.forward(x, type) 141 | loss = model.loss(logits, y) 142 | pred_result = model.gen_candidates(probs) 143 | PRED += pred_result 144 | GT += lb_raw 145 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 146 | total_sample += len(rv_raw) 147 | 148 | step += 1 149 | 150 | total_loss = cum_loss / total_sample 151 | 152 | threshs = list(np.arange(0.1, 0.9, 0.05)) 153 | best_f1, best_pr, best_rc = 0, 0, 0 154 | best_thresh = 0.1 155 | for th in threshs: 156 | P, G, S = 0, 0, 0 157 | PRED_COPY = copy.deepcopy(PRED) 158 | PRED_COPY = model.nms_filter(PRED_COPY, th) 159 | for b in range(len(PRED_COPY)): 160 | gt = GT[b] 161 | pred = [x[0] for x in PRED_COPY[b]] 162 | p, g, s = evaluate_sample(gt, pred) 163 | 164 | P += p 165 | G += g 166 | S += s 167 | f1, pr, rc = f1_score(P, G, S) 168 | if f1 > best_f1: 169 | best_f1, best_pr, best_rc = f1, pr, rc 170 | best_thresh = th 171 | 172 | return total_loss, best_f1, best_pr, best_rc, best_thresh 173 | 174 | 175 | import argparse 176 | from config import PRETRAINED_MODELS 177 | 178 | if __name__ == '__main__': 179 | parser = argparse.ArgumentParser() 180 | parser.add_argument('--base_model', type=str, default='roberta') 181 | parser.add_argument('--bs', type=int, default=12) 182 | parser.add_argument('--gpu', type=int, default=0) 183 | args = parser.parse_args() 184 | 185 | # os.environ["CUDA_VISIBLE_DEVICES"] = "%d" % args.gpu 186 | 187 | EP = 25 188 | model_config = PRETRAINED_MODELS[args.base_model] 189 | print(model_config) 190 | SAVING_DIR = '../models/' 191 | 192 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 193 | makeup_train_loader, makeup_val_loader, corpus_loader = get_pretrain_loaders(tokenizer, batch_size=args.bs) 194 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], focal=model_config['focal']) 195 | model.cuda() 196 | optimizer = Adam(model.parameters(), lr=model_config['lr']) 197 | scheduler = GradualWarmupScheduler(optimizer, total_epoch=2 * max(len(makeup_train_loader), len(corpus_loader))) 198 | best_val_f1 = 0 199 | best_val_loss = float('inf') 200 | for e in range(EP): 201 | 202 | print('Epoch [%d/%d] train:' % (e, EP)) 203 | train_loss, train_lm_loss, train_f1, train_pr, train_rc = train_epoch(model, makeup_train_loader, corpus_loader, 204 | optimizer, scheduler) 205 | print( 206 | "loss %.5f, lm loss %.5f f1 %.5f, pr %.5f, rc %.5f" % (train_loss, train_lm_loss, train_f1, train_pr, train_rc)) 207 | 208 | print('Epoch [%d/%d] makeup eval:' % (e, EP)) 209 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, makeup_val_loader, type='makeup') 210 | print("makeup_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % (val_loss, val_f1, val_pr, val_rc, best_th)) 211 | 212 | if val_loss < best_val_loss: 213 | best_val_loss = val_loss 214 | if val_f1 > best_val_f1: 215 | best_val_f1 = val_f1 216 | if best_val_f1 >= 0.75: 217 | saving_dir = osp.join(SAVING_DIR, 'pretrained_' + model_config['name']) 218 | torch.save(model.state_dict(), saving_dir) 219 | print('saved best model to %s' % saving_dir) 220 | else: 221 | break 222 | print('best loss %.5f' % best_val_loss) 223 | print('best f1 %.5f' % best_val_f1) 224 | 225 | -------------------------------------------------------------------------------- /src/pretrain2.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pytorch_pretrained_bert import BertTokenizer 3 | from dataset import get_pretrain2_loaders 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 10 | from tqdm import tqdm 11 | import os.path as osp 12 | import numpy as np 13 | import copy 14 | 15 | 16 | def f1_score(P, G, S): 17 | pr = S / P 18 | rc = S / G 19 | f1 = 2 * pr * rc / (pr + rc) 20 | return f1, pr, rc 21 | 22 | 23 | def evaluate_sample(gt, pred): 24 | gt = set(gt) 25 | pred = set(pred) 26 | p = len(pred) 27 | g = len(gt) 28 | s = len(gt.intersection(pred)) 29 | # print(p, g, s) 30 | return p, g, s 31 | 32 | 33 | def train_epoch(model, makeup_loader, laptop_loader, corpus_loader, optimizer, scheduler=None): 34 | model.train() 35 | 36 | cum_lm_loss = 0 37 | cum_makeup_loss = 0 38 | cum_laptop_loss = 0 39 | total_lm_sample = 0 40 | total_makeup_sample = 0 41 | total_laptop_sample = 0 42 | P_makeup, G_makeup, S_makeup = 0, 0, 0 43 | P_laptop, G_laptop, S_laptop = 0, 0, 0 44 | step = 0 45 | epoch_len = max(len(makeup_loader), len(corpus_loader), len(laptop_loader)) 46 | pbar = tqdm(range(epoch_len)) 47 | 48 | corpus_iter = iter(corpus_loader) 49 | makeup_iter = iter(makeup_loader) 50 | laptop_iter = iter(laptop_loader) 51 | 52 | for _ in pbar: 53 | if step == epoch_len: 54 | pbar.close() 55 | break 56 | ################ MLM ################### 57 | try: 58 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 59 | except StopIteration: 60 | corpus_iter = iter(corpus_loader) 61 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 62 | 63 | corpus_ids = corpus_ids.cuda() 64 | corpus_attn = corpus_attn.cuda() 65 | lm_label = lm_label.cuda() 66 | loss = model.foward_LM(corpus_ids, corpus_attn, lm_label) 67 | optimizer.zero_grad() 68 | loss.backward() 69 | optimizer.step() 70 | if scheduler: 71 | scheduler.step() 72 | cum_lm_loss += loss.data.cpu().numpy() * len(corpus_ids) 73 | total_lm_sample += len(corpus_ids) 74 | del corpus_ids, corpus_attn, lm_label, loss 75 | 76 | ############### makeup ################## 77 | try: 78 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 79 | except StopIteration: 80 | makeup_iter = iter(makeup_loader) 81 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 82 | 83 | makeup_rv_raw, makeup_lb_raw = makeup_raw 84 | makeup_x = [item.cuda() for item in makeup_x] 85 | makeup_y = [item.cuda() for item in makeup_y] 86 | 87 | makeup_probs, makeup_logits = model.forward(makeup_x, type='makeup') 88 | loss = model.loss(makeup_logits, makeup_y) 89 | 90 | optimizer.zero_grad() 91 | loss.backward() 92 | optimizer.step() 93 | if scheduler: 94 | scheduler.step() 95 | 96 | makeup_pred = model.gen_candidates(makeup_probs) 97 | makeup_pred = model.nms_filter(makeup_pred, 0.1) 98 | 99 | for b in range(len(makeup_pred)): 100 | gt = makeup_lb_raw[b] 101 | pred = [x[0] for x in makeup_pred[b]] 102 | p, g, s = evaluate_sample(gt, pred) 103 | P_makeup += p 104 | G_makeup += g 105 | S_makeup += s 106 | 107 | cum_makeup_loss += loss.data.cpu().numpy() * len(makeup_rv_raw) 108 | total_makeup_sample += len(makeup_rv_raw) 109 | while makeup_x: 110 | a = makeup_x.pop(); 111 | del a 112 | while makeup_y: 113 | a = makeup_y.pop(); 114 | del a 115 | 116 | while makeup_probs: 117 | a = makeup_probs.pop(); 118 | del a 119 | a = makeup_logits.pop(); 120 | del a 121 | 122 | ############### laptop ################## 123 | try: 124 | laptop_raw, laptop_x, laptop_y = next(laptop_iter) 125 | except StopIteration: 126 | laptop_iter = iter(laptop_loader) 127 | laptop_raw, laptop_x, laptop_y = next(laptop_iter) 128 | 129 | laptop_rv_raw, laptop_lb_raw = laptop_raw 130 | laptop_x = [item.cuda() for item in laptop_x] 131 | laptop_y = [item.cuda() for item in laptop_y] 132 | 133 | laptop_probs, laptop_logits = model.forward(laptop_x, type='laptop') 134 | loss = model.loss(laptop_logits, laptop_y) 135 | 136 | optimizer.zero_grad() 137 | loss.backward() 138 | optimizer.step() 139 | if scheduler: 140 | scheduler.step() 141 | 142 | laptop_pred = model.gen_candidates(laptop_probs) 143 | laptop_pred = model.nms_filter(laptop_pred, 0.1) 144 | 145 | for b in range(len(laptop_pred)): 146 | gt = laptop_lb_raw[b] 147 | pred = [x[0] for x in laptop_pred[b]] 148 | p, g, s = evaluate_sample(gt, pred) 149 | P_laptop += p 150 | G_laptop += g 151 | S_laptop += s 152 | 153 | cum_laptop_loss += loss.data.cpu().numpy() * len(laptop_rv_raw) 154 | total_laptop_sample += len(laptop_rv_raw) 155 | while laptop_x: 156 | a = laptop_x.pop(); 157 | del a 158 | while laptop_y: 159 | a = laptop_y.pop(); 160 | del a 161 | 162 | while laptop_probs: 163 | a = laptop_probs.pop(); 164 | del a 165 | a = laptop_logits.pop(); 166 | del a 167 | 168 | del loss 169 | 170 | step += 1 171 | 172 | total_lm_loss = cum_lm_loss / total_lm_sample 173 | 174 | makeup_f1, makeup_pr, makeup_rc = f1_score(P_makeup, G_makeup, S_makeup) 175 | makeup_loss = cum_makeup_loss / total_makeup_sample 176 | 177 | laptop_f1, laptop_pr, laptop_rc = f1_score(P_laptop, G_laptop, S_laptop) 178 | laptop_loss = cum_laptop_loss / total_laptop_sample 179 | 180 | return makeup_loss, makeup_f1, makeup_pr, makeup_rc, \ 181 | laptop_loss, laptop_f1, laptop_pr, laptop_rc, \ 182 | total_lm_loss 183 | 184 | 185 | def eval_epoch(model, dataloader, type='makeup'): 186 | model.eval() 187 | cum_loss = 0 188 | # P, G, S = 0, 0, 0 189 | total_sample = 0 190 | step = 0 191 | pbar = tqdm(dataloader) 192 | 193 | PRED = [] 194 | GT = [] 195 | for raw, x, y in pbar: 196 | if step == len(dataloader): 197 | pbar.close() 198 | break 199 | rv_raw, lb_raw = raw 200 | x = [item.cuda() for item in x] 201 | y = [item.cuda() for item in y] 202 | with torch.no_grad(): 203 | probs, logits = model.forward(x, type) 204 | loss = model.loss(logits, y) 205 | pred_result = model.gen_candidates(probs) 206 | PRED += pred_result 207 | GT += lb_raw 208 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 209 | total_sample += len(rv_raw) 210 | 211 | step += 1 212 | 213 | total_loss = cum_loss / total_sample 214 | 215 | threshs = list(np.arange(0.1, 0.9, 0.05)) 216 | best_f1, best_pr, best_rc = 0, 0, 0 217 | best_thresh = 0.1 218 | for th in threshs: 219 | P, G, S = 0, 0, 0 220 | PRED_COPY = copy.deepcopy(PRED) 221 | PRED_COPY = model.nms_filter(PRED_COPY, th) 222 | for b in range(len(PRED_COPY)): 223 | gt = GT[b] 224 | pred = [x[0] for x in PRED_COPY[b]] 225 | p, g, s = evaluate_sample(gt, pred) 226 | 227 | P += p 228 | G += g 229 | S += s 230 | f1, pr, rc = f1_score(P, G, S) 231 | if f1 > best_f1: 232 | best_f1, best_pr, best_rc = f1, pr, rc 233 | best_thresh = th 234 | 235 | return total_loss, best_f1, best_pr, best_rc, best_thresh 236 | 237 | 238 | import argparse 239 | from config import PRETRAINED_MODELS 240 | 241 | if __name__ == '__main__': 242 | parser = argparse.ArgumentParser() 243 | parser.add_argument('--base_model', type=str, default='roberta') 244 | parser.add_argument('--bs', type=int, default=12) 245 | parser.add_argument('--gpu', type=int, default=0) 246 | args = parser.parse_args() 247 | 248 | os.environ["CUDA_VISIBLE_DEVICES"] = "%d" % args.gpu 249 | 250 | EP = 25 251 | model_config = PRETRAINED_MODELS[args.base_model] 252 | SAVING_DIR = '../models/' 253 | 254 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 255 | makeup_loader, makeup_val_loader, laptop_loader, laptop_val_loader, corpus_loader = get_pretrain2_loaders(tokenizer, batch_size=args.bs) 256 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version']) 257 | model.cuda() 258 | optimizer = Adam(model.parameters(), lr=model_config['lr']) 259 | scheduler = GradualWarmupScheduler(optimizer, 260 | total_epoch=2 * max(len(makeup_loader), len(laptop_loader), len(corpus_loader))) 261 | best_val_f1 = 0 262 | best_val_loss = float('inf') 263 | for e in range(EP): 264 | 265 | print('Epoch [%d/%d] train:' % (e, EP)) 266 | makeup_loss, makeup_f1, makeup_pr, makeup_rc, \ 267 | laptop_loss, laptop_f1, laptop_pr, laptop_rc, \ 268 | total_lm_loss = train_epoch(model, makeup_loader, laptop_loader, corpus_loader, optimizer, scheduler) 269 | print("makeup_train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (makeup_loss, makeup_f1, makeup_pr, makeup_rc)) 270 | print("laptop_train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (laptop_loss, laptop_f1, laptop_pr, laptop_rc)) 271 | print("lm loss %.5f", total_lm_loss) 272 | 273 | print('Epoch [%d/%d] makeup eval:' % (e, EP)) 274 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, makeup_val_loader, type='makeup') 275 | print("makeup_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % (val_loss, val_f1, val_pr, val_rc, best_th)) 276 | 277 | print('Epoch [%d/%d] laptop eval:' % (e, EP)) 278 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, laptop_val_loader, type='laptop') 279 | print("laptop_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % (val_loss, val_f1, val_pr, val_rc, best_th)) 280 | 281 | if val_loss < best_val_loss: 282 | best_val_loss = val_loss 283 | if val_f1 > best_val_f1: 284 | best_val_f1 = val_f1 285 | if best_val_f1 >= 0.75: 286 | saving_dir = osp.join(SAVING_DIR, 'pretrained2_' + model_config['name']) 287 | torch.save(model.state_dict(), saving_dir) 288 | print('saved best model to %s' % saving_dir) 289 | 290 | print('best loss %.5f' % best_val_loss) 291 | print('best f1 %.5f' % best_val_f1) 292 | # if best_val_f1 >= 0.82: 293 | # break 294 | -------------------------------------------------------------------------------- /src/pretrain2_cv.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pytorch_pretrained_bert import BertTokenizer 3 | from dataset import get_pretrain_2_laptop_fake_loaders_cv, get_pretrain2_loaders_cv, get_data_loaders_cv 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 10 | from tqdm import tqdm 11 | import os.path as osp 12 | import numpy as np 13 | import copy 14 | 15 | 16 | def f1_score(P, G, S): 17 | pr = S / P 18 | rc = S / G 19 | f1 = 2 * pr * rc / (pr + rc) 20 | return f1, pr, rc 21 | 22 | 23 | def evaluate_sample(gt, pred): 24 | gt = set(gt) 25 | pred = set(pred) 26 | p = len(pred) 27 | g = len(gt) 28 | s = len(gt.intersection(pred)) 29 | # print(p, g, s) 30 | return p, g, s 31 | 32 | 33 | def train_epoch(model, makeup_loader, laptop_fake_train, laptop_gt_train, corpus_loader, optimizer, scheduler=None): 34 | model.train() 35 | 36 | cum_lm_loss = 0 37 | cum_makeup_loss = 0 38 | cum_laptop_loss = 0 39 | total_lm_sample = 0 40 | total_makeup_sample = 0 41 | total_laptop_sample = 0 42 | P_makeup, G_makeup, S_makeup = 0, 0, 0 43 | P_laptop, G_laptop, S_laptop = 0, 0, 0 44 | step = 0 45 | epoch_len = max(len(makeup_loader), len(corpus_loader), len(laptop_fake_train)) 46 | pbar = tqdm(range(epoch_len)) 47 | 48 | corpus_iter = iter(corpus_loader) 49 | makeup_iter = iter(makeup_loader) 50 | laptop_fake_iter = iter(laptop_fake_train) 51 | laptop_gt_iter = iter(laptop_gt_train) 52 | 53 | for _ in pbar: 54 | if step == epoch_len: 55 | pbar.close() 56 | break 57 | ################ MLM ################### 58 | try: 59 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 60 | except StopIteration: 61 | corpus_iter = iter(corpus_loader) 62 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 63 | 64 | corpus_ids = corpus_ids.cuda() 65 | corpus_attn = corpus_attn.cuda() 66 | lm_label = lm_label.cuda() 67 | loss = model.foward_LM(corpus_ids, corpus_attn, lm_label) 68 | optimizer.zero_grad() 69 | loss.backward() 70 | optimizer.step() 71 | if scheduler: 72 | scheduler.step() 73 | cum_lm_loss += loss.data.cpu().numpy() * len(corpus_ids) 74 | total_lm_sample += len(corpus_ids) 75 | del corpus_ids, corpus_attn, lm_label, loss 76 | 77 | ############### makeup ################## 78 | try: 79 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 80 | except StopIteration: 81 | makeup_iter = iter(makeup_loader) 82 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 83 | 84 | makeup_rv_raw, makeup_lb_raw = makeup_raw 85 | makeup_x = [item.cuda() for item in makeup_x] 86 | makeup_y = [item.cuda() for item in makeup_y] 87 | 88 | makeup_probs, makeup_logits = model.forward(makeup_x, type='makeup') 89 | loss = model.loss(makeup_logits, makeup_y) 90 | 91 | optimizer.zero_grad() 92 | loss.backward() 93 | optimizer.step() 94 | if scheduler: 95 | scheduler.step() 96 | 97 | makeup_pred = model.gen_candidates(makeup_probs) 98 | makeup_pred = model.nms_filter(makeup_pred, 0.1) 99 | 100 | for b in range(len(makeup_pred)): 101 | gt = makeup_lb_raw[b] 102 | pred = [x[0] for x in makeup_pred[b]] 103 | p, g, s = evaluate_sample(gt, pred) 104 | P_makeup += p 105 | G_makeup += g 106 | S_makeup += s 107 | 108 | cum_makeup_loss += loss.data.cpu().numpy() * len(makeup_rv_raw) 109 | total_makeup_sample += len(makeup_rv_raw) 110 | while makeup_x: 111 | a = makeup_x.pop(); 112 | del a 113 | while makeup_y: 114 | a = makeup_y.pop(); 115 | del a 116 | 117 | while makeup_probs: 118 | a = makeup_probs.pop(); 119 | del a 120 | a = makeup_logits.pop(); 121 | del a 122 | 123 | ############### laptop fake ################## 124 | try: 125 | laptop_raw, laptop_x, laptop_y = next(laptop_fake_iter) 126 | except StopIteration: 127 | laptop_fake_iter = iter(laptop_fake_train) 128 | laptop_raw, laptop_x, laptop_y = next(laptop_fake_iter) 129 | 130 | laptop_rv_raw, laptop_lb_raw = laptop_raw 131 | laptop_x = [item.cuda() for item in laptop_x] 132 | laptop_y = [item.cuda() for item in laptop_y] 133 | 134 | laptop_probs, laptop_logits = model.forward(laptop_x, type='laptop') 135 | loss = model.loss(laptop_logits, laptop_y, neg_sub=True) 136 | 137 | # optimizer.zero_grad() 138 | # loss.backward() 139 | # optimizer.step() 140 | # if scheduler: 141 | # scheduler.step() 142 | 143 | laptop_pred = model.gen_candidates(laptop_probs) 144 | laptop_pred = model.nms_filter(laptop_pred, 0.1) 145 | 146 | for b in range(len(laptop_pred)): 147 | gt = laptop_lb_raw[b] 148 | pred = [x[0] for x in laptop_pred[b]] 149 | p, g, s = evaluate_sample(gt, pred) 150 | P_laptop += p 151 | G_laptop += g 152 | S_laptop += s 153 | 154 | cum_laptop_loss += loss.data.cpu().numpy() * len(laptop_rv_raw) 155 | total_laptop_sample += len(laptop_rv_raw) 156 | while laptop_x: 157 | a = laptop_x.pop(); 158 | del a 159 | while laptop_y: 160 | a = laptop_y.pop(); 161 | del a 162 | 163 | while laptop_probs: 164 | a = laptop_probs.pop(); 165 | del a 166 | a = laptop_logits.pop(); 167 | del a 168 | 169 | ############### laptop gt ################## 170 | try: 171 | laptop_raw, laptop_x, laptop_y = next(laptop_gt_iter) 172 | except StopIteration: 173 | laptop_gt_iter = iter(laptop_gt_train) 174 | laptop_raw, laptop_x, laptop_y = next(laptop_gt_iter) 175 | 176 | laptop_rv_raw, laptop_lb_raw = laptop_raw 177 | laptop_x = [item.cuda() for item in laptop_x] 178 | laptop_y = [item.cuda() for item in laptop_y] 179 | 180 | laptop_probs, laptop_logits = model.forward(laptop_x, type='laptop') 181 | laptop_gt_loss = model.loss(laptop_logits, laptop_y) 182 | loss += laptop_gt_loss 183 | loss /= 2 184 | optimizer.zero_grad() 185 | loss.backward() 186 | optimizer.step() 187 | if scheduler: 188 | scheduler.step() 189 | 190 | laptop_pred = model.gen_candidates(laptop_probs) 191 | laptop_pred = model.nms_filter(laptop_pred, 0.1) 192 | 193 | for b in range(len(laptop_pred)): 194 | gt = laptop_lb_raw[b] 195 | pred = [x[0] for x in laptop_pred[b]] 196 | p, g, s = evaluate_sample(gt, pred) 197 | P_laptop += p 198 | G_laptop += g 199 | S_laptop += s 200 | 201 | cum_laptop_loss += laptop_gt_loss.data.cpu().numpy() * len(laptop_rv_raw) 202 | total_laptop_sample += len(laptop_rv_raw) 203 | while laptop_x: 204 | a = laptop_x.pop(); 205 | del a 206 | while laptop_y: 207 | a = laptop_y.pop(); 208 | del a 209 | 210 | while laptop_probs: 211 | a = laptop_probs.pop(); 212 | del a 213 | a = laptop_logits.pop(); 214 | del a 215 | 216 | del loss 217 | 218 | step += 1 219 | 220 | total_lm_loss = cum_lm_loss / total_lm_sample 221 | 222 | makeup_f1, makeup_pr, makeup_rc = f1_score(P_makeup, G_makeup, S_makeup) 223 | makeup_loss = cum_makeup_loss / total_makeup_sample 224 | 225 | laptop_f1, laptop_pr, laptop_rc = f1_score(P_laptop, G_laptop, S_laptop) 226 | laptop_loss = cum_laptop_loss / total_laptop_sample 227 | 228 | return makeup_loss, makeup_f1, makeup_pr, makeup_rc, \ 229 | laptop_loss, laptop_f1, laptop_pr, laptop_rc, \ 230 | total_lm_loss 231 | 232 | 233 | def eval_epoch(model, dataloader, type='makeup'): 234 | model.eval() 235 | cum_loss = 0 236 | # P, G, S = 0, 0, 0 237 | total_sample = 0 238 | step = 0 239 | pbar = tqdm(dataloader) 240 | 241 | PRED = [] 242 | GT = [] 243 | for raw, x, y in pbar: 244 | if step == len(dataloader): 245 | pbar.close() 246 | break 247 | rv_raw, lb_raw = raw 248 | x = [item.cuda() for item in x] 249 | y = [item.cuda() for item in y] 250 | with torch.no_grad(): 251 | probs, logits = model.forward(x, type) 252 | loss = model.loss(logits, y) 253 | pred_result = model.gen_candidates(probs) 254 | PRED += pred_result 255 | GT += lb_raw 256 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 257 | total_sample += len(rv_raw) 258 | 259 | step += 1 260 | 261 | total_loss = cum_loss / total_sample 262 | 263 | threshs = list(np.arange(0.1, 0.9, 0.05)) 264 | best_f1, best_pr, best_rc = 0, 0, 0 265 | best_thresh = 0.1 266 | for th in threshs: 267 | P, G, S = 0, 0, 0 268 | PRED_COPY = copy.deepcopy(PRED) 269 | PRED_COPY = model.nms_filter(PRED_COPY, th) 270 | for b in range(len(PRED_COPY)): 271 | gt = GT[b] 272 | pred = [x[0] for x in PRED_COPY[b]] 273 | p, g, s = evaluate_sample(gt, pred) 274 | 275 | P += p 276 | G += g 277 | S += s 278 | f1, pr, rc = f1_score(P, G, S) 279 | if f1 > best_f1: 280 | best_f1, best_pr, best_rc = f1, pr, rc 281 | best_thresh = th 282 | 283 | return total_loss, best_f1, best_pr, best_rc, best_thresh 284 | 285 | 286 | import argparse 287 | from config import PRETRAINED_MODELS 288 | import json 289 | if __name__ == '__main__': 290 | parser = argparse.ArgumentParser() 291 | parser.add_argument('--base_model', type=str, default='roberta') 292 | parser.add_argument('--bs', type=int, default=12) 293 | parser.add_argument('--no_improve', type=int, default=2) 294 | parser.add_argument('--gpu', type=int, default=0) 295 | args = parser.parse_args() 296 | 297 | # os.environ["CUDA_VISIBLE_DEVICES"] = "%d" % args.gpu 298 | 299 | THRESH_DIR = '../models/thresh_dict.json' 300 | if osp.isfile(THRESH_DIR): 301 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 302 | thresh_dict = json.load(f) 303 | else: 304 | thresh_dict = {} 305 | EP = 25 306 | model_config = PRETRAINED_MODELS[args.base_model] 307 | SAVING_DIR = '../models/' 308 | 309 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 310 | makeup_loader, makeup_val_loader, corpus_loader = get_pretrain2_loaders_cv(tokenizer, batch_size=args.bs) 311 | 312 | laptop_gt_cv_loaders = get_data_loaders_cv(rv_path='../data/TRAIN/Train_laptop_reviews.csv', 313 | lb_path='../data/TRAIN/Train_laptop_labels.csv', 314 | tokenizer=tokenizer, 315 | batch_size=args.bs, 316 | type='laptop', 317 | folds=5) 318 | 319 | laptop_fake_cv_loaders = get_pretrain_2_laptop_fake_loaders_cv(tokenizer, batch_size=args.bs) 320 | BEST_THRESHS = [0.1] * 5 321 | BEST_F1 = [0] * 5 322 | for cv_idx, (laptop_fake_train) in enumerate(laptop_fake_cv_loaders): 323 | laptop_gt_train, laptop_gt_val = laptop_gt_cv_loaders[cv_idx] 324 | 325 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version']) 326 | model.cuda() 327 | optimizer = Adam(model.parameters(), lr=model_config['lr']) 328 | scheduler = GradualWarmupScheduler(optimizer, 329 | total_epoch=2 * max(len(makeup_loader), len(laptop_fake_train), len(corpus_loader))) 330 | best_val_f1 = 0 331 | best_val_loss = float('inf') 332 | no_imporve = 0 333 | for e in range(EP): 334 | 335 | print('Epoch [%d/%d] train:' % (e, EP)) 336 | makeup_loss, makeup_f1, makeup_pr, makeup_rc, \ 337 | laptop_loss, laptop_f1, laptop_pr, laptop_rc, \ 338 | total_lm_loss = train_epoch(model, makeup_loader, laptop_fake_train, laptop_gt_train, corpus_loader, optimizer, scheduler) 339 | print("makeup_train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (makeup_loss, makeup_f1, makeup_pr, makeup_rc)) 340 | print("laptop_train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (laptop_loss, laptop_f1, laptop_pr, laptop_rc)) 341 | print("lm loss %.5f", total_lm_loss) 342 | 343 | print('Epoch [%d/%d] makeup eval:' % (e, EP)) 344 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, makeup_val_loader, type='makeup') 345 | print("makeup_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % ( 346 | val_loss, val_f1, val_pr, val_rc, best_th)) 347 | 348 | print('Epoch [%d/%d] laptop eval:' % (e, EP)) 349 | val_loss, val_f1, val_pr, val_rc, best_th = eval_epoch(model, laptop_gt_val, type='laptop') 350 | print("laptop_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f, thresh %.2f" % ( 351 | val_loss, val_f1, val_pr, val_rc, best_th)) 352 | 353 | if val_loss < best_val_loss: 354 | best_val_loss = val_loss 355 | if val_f1 > best_val_f1: 356 | no_imporve = 0 357 | best_val_f1 = val_f1 358 | if best_val_f1 >= 0.75: 359 | saving_name = model_config['name'] + '_cv' + str(cv_idx) 360 | saving_dir = osp.join(SAVING_DIR, saving_name) 361 | torch.save(model.state_dict(), saving_dir) 362 | print('saved best model to %s' % saving_dir) 363 | BEST_THRESHS[cv_idx] = best_th 364 | BEST_F1[cv_idx] = best_val_f1 365 | thresh_dict[saving_name] = { 366 | 'name': model_config['name'], 367 | 'thresh': best_th, 368 | 'f1': best_val_f1, 369 | } 370 | with open(THRESH_DIR, 'w', encoding='utf-8') as f: 371 | json.dump(thresh_dict, f) 372 | else: 373 | no_imporve += 1 374 | 375 | print('best loss %.5f' % best_val_loss) 376 | print('best f1 %.5f' % best_val_f1) 377 | if no_imporve >= args.no_improve: 378 | break 379 | del model, optimizer, scheduler 380 | print(BEST_F1) 381 | print(BEST_THRESHS) 382 | -------------------------------------------------------------------------------- /src/test_cv.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders_cv, ID2LAPTOP, ID2P 3 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from tqdm import tqdm 10 | import os.path as osp 11 | import numpy as np 12 | import pandas as pd 13 | import copy 14 | 15 | def f1_score(P, G, S): 16 | pr = S / P 17 | rc = S / G 18 | f1 = 2 * pr * rc / (pr + rc) 19 | return f1, pr, rc 20 | 21 | 22 | def evaluate_sample(gt, pred): 23 | gt = set(gt) 24 | pred = set(pred) 25 | p = len(pred) 26 | g = len(gt) 27 | s = len(gt.intersection(pred)) 28 | return p, g, s 29 | 30 | 31 | def gen_submit(ret, raw): 32 | result = pd.DataFrame( 33 | columns=['id', 'AspectTerms', 'A_start', 'A_end', 'OpinionTerms', 'O_start', 'O_end', 'Categories', 34 | 'Polarities']) 35 | cur_idx = 1 36 | for i, opinions in enumerate(ret): 37 | 38 | if len(opinions) == 0: 39 | result.loc[result.shape[0]] = {'id': cur_idx, 40 | 'AspectTerms': '_', 'A_start': ' ', 'A_end': ' ', 41 | 'OpinionTerms': '_', 'O_start': ' ', 'O_end': ' ', 42 | 'Categories': '_', 'Polarities': '_'} 43 | 44 | for j, (opn, score) in enumerate(opinions): 45 | a_s, a_e, o_s, o_e = opn[0:4] 46 | c, p = opn[4:6] 47 | if a_s == 0: 48 | A = '_' 49 | a_s = ' ' 50 | a_e = ' ' 51 | else: 52 | A = raw[i][a_s - 1: a_e] 53 | a_s = str(a_s - 1) 54 | a_e = str(a_e) 55 | if o_s == 0: 56 | O = '_' 57 | o_s = ' ' 58 | o_e = ' ' 59 | else: 60 | O = raw[i][o_s - 1: o_e] 61 | o_s = str(o_s - 1) 62 | o_e = str(o_e) 63 | C = ID2LAPTOP[c] 64 | P = ID2P[p] 65 | result.loc[result.shape[0]] = {'id': cur_idx, 66 | 'AspectTerms': A, 'A_start': a_s, 'A_end': a_e, 67 | 'OpinionTerms': O, 'O_start': o_s, 'O_end': o_e, 68 | 'Categories': C, 'Polarities': P} 69 | cur_idx += 1 70 | return result 71 | 72 | 73 | import json 74 | import os 75 | import argparse 76 | from config import PRETRAINED_MODELS 77 | if __name__ == '__main__': 78 | parser = argparse.ArgumentParser() 79 | parser.add_argument('--base_model', type=str, default='roberta') 80 | parser.add_argument('--bs', type=int, default=12) 81 | args = parser.parse_args() 82 | FOLDS = 5 83 | THRESH_DIR = '../models/thresh_dict.json' 84 | model_config = PRETRAINED_MODELS[args.base_model] 85 | print(model_config) 86 | 87 | if osp.isfile(THRESH_DIR): 88 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 89 | thresh_dict = json.load(f) 90 | else: 91 | thresh_dict = {} 92 | 93 | tokenizer = BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 94 | cv_loaders, val_idxs = get_data_loaders_cv(rv_path='../data/TRAIN/Train_laptop_reviews.csv', 95 | lb_path='../data/TRAIN/Train_laptop_labels.csv', 96 | tokenizer=tokenizer, 97 | batch_size=args.bs, 98 | type='laptop', 99 | folds=FOLDS, 100 | return_val_idxs=True) 101 | 102 | PRED = [] 103 | LB, GT = [], [] 104 | VAL_IDX = [] 105 | for cv_idx, (train_loader, val_loader) in enumerate(cv_loaders): 106 | if not os.path.exists('../models/' + model_config['name'] + '_cv' + str(cv_idx)): 107 | continue 108 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], focal=model_config['focal']) 109 | model.load_state_dict(torch.load('../models/' + model_config['name']+'_cv'+str(cv_idx))) 110 | model.cuda() 111 | model.eval() 112 | try: 113 | thresh = thresh_dict[model_config['name'] + '_cv' + str(cv_idx)]['thresh'] 114 | except: 115 | thresh = 0.5 116 | VAL_IDX.extend(val_idxs[cv_idx]) 117 | 118 | for idx, ((rv_raw, lb_raw), x, y) in enumerate(val_loader): 119 | x = [item.cuda() for item in x] 120 | y = [item.cuda() for item in y] 121 | with torch.no_grad(): 122 | probs, logits = model.forward(x, 'laptop') 123 | loss = model.loss(logits, y) 124 | pred_result = model.gen_candidates(probs) 125 | pred_result = model.nms_filter(pred_result, thresh) 126 | PRED.extend(pred_result) 127 | LB.extend(lb_raw) 128 | GT.extend(rv_raw) 129 | 130 | del model 131 | 132 | P, G, S = 0, 0, 0 133 | for b in range(len(PRED)): 134 | gt = LB[b] 135 | pred = [x[0] for x in PRED[b]] 136 | p, g, s = evaluate_sample(gt, pred) 137 | 138 | P += p 139 | G += g 140 | S += s 141 | f1, pr, rc = f1_score(P, G, S) 142 | print("f1 %.5f, pr %.5f, rc %.5f" % (f1, pr, rc)) 143 | 144 | ZZ = list(zip(VAL_IDX, PRED, GT)) 145 | 146 | ZZ.sort(key=lambda x: x[0]) 147 | 148 | PRED = [p[1] for p in ZZ] 149 | GT = [p[2] for p in ZZ] 150 | result = gen_submit(PRED, GT) 151 | 152 | result.to_csv('../testResults/' + model_config['name'] + '.csv', header=True, index=False) 153 | print(len(result['id'].unique()), result.shape[0]) 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | -------------------------------------------------------------------------------- /src/test_ensemble_cv.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders_cv, ID2LAPTOP, ID2P 3 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from tqdm import tqdm 10 | import os.path as osp 11 | import numpy as np 12 | import pandas as pd 13 | import copy 14 | 15 | from collections import Counter 16 | def f1_score(P, G, S): 17 | pr = S / P 18 | rc = S / G 19 | f1 = 2 * pr * rc / (pr + rc) 20 | return f1, pr, rc 21 | 22 | 23 | def evaluate_sample(gt, pred): 24 | gt = set(gt) 25 | pred = set(pred) 26 | p = len(pred) 27 | g = len(gt) 28 | s = len(gt.intersection(pred)) 29 | return p, g, s 30 | 31 | 32 | def eval_epoch(model, dataloader, th): 33 | model.eval() 34 | step = 0 35 | result = [] 36 | pbar = tqdm(dataloader) 37 | for raw, x, _ in pbar: 38 | if step == len(dataloader): 39 | pbar.close() 40 | break 41 | rv_raw, _ = raw 42 | x = [item.cuda() for item in x] 43 | with torch.no_grad(): 44 | probs, logits = model.forward(x, 'laptop') 45 | pred_result = model.gen_candidates(probs) 46 | pred_result = model.nms_filter(pred_result, th) 47 | 48 | result += pred_result 49 | 50 | step += 1 51 | return result 52 | 53 | 54 | def accum_result(old, new): 55 | if old is None: 56 | return new 57 | for i in range(len(old)): 58 | merged = Counter(dict(old[i])) + Counter(dict(new[i])) 59 | old[i] = list(merged.items()) 60 | return old 61 | 62 | 63 | def average_result(result, num): 64 | for i in range(len(result)): 65 | for j in range(len(result[i])): 66 | result[i][j] = (result[i][j][0], result[i][j][1] / num) 67 | return result 68 | 69 | def gen_submit(ret, raw): 70 | result = pd.DataFrame( 71 | columns=['id', 'AspectTerms', 'A_start', 'A_end', 'OpinionTerms', 'O_start', 'O_end', 'Categories', 72 | 'Polarities']) 73 | cur_idx = 1 74 | for i, opinions in enumerate(ret): 75 | 76 | if len(opinions) == 0: 77 | result.loc[result.shape[0]] = {'id': cur_idx, 78 | 'AspectTerms': '_', 'A_start': ' ', 'A_end': ' ', 79 | 'OpinionTerms': '_', 'O_start': ' ', 'O_end': ' ', 80 | 'Categories': '_', 'Polarities': '_'} 81 | 82 | for j, (opn, score) in enumerate(opinions): 83 | a_s, a_e, o_s, o_e = opn[0:4] 84 | c, p = opn[4:6] 85 | if a_s == 0: 86 | A = '_' 87 | a_s = ' ' 88 | a_e = ' ' 89 | else: 90 | A = raw[i][a_s - 1: a_e] 91 | a_s = str(a_s - 1) 92 | a_e = str(a_e) 93 | if o_s == 0: 94 | O = '_' 95 | o_s = ' ' 96 | o_e = ' ' 97 | else: 98 | O = raw[i][o_s - 1: o_e] 99 | o_s = str(o_s - 1) 100 | o_e = str(o_e) 101 | C = ID2LAPTOP[c] 102 | P = ID2P[p] 103 | result.loc[result.shape[0]] = {'id': cur_idx, 104 | 'AspectTerms': A, 'A_start': a_s, 'A_end': a_e, 105 | 'OpinionTerms': O, 'O_start': o_s, 'O_end': o_e, 106 | 'Categories': C, 'Polarities': P} 107 | cur_idx += 1 108 | return result 109 | 110 | 111 | import json 112 | import argparse 113 | from config import PRETRAINED_MODELS 114 | if __name__ == '__main__': 115 | parser = argparse.ArgumentParser() 116 | parser.add_argument('--bs', type=int, default=12) 117 | args = parser.parse_args() 118 | FOLDS = 5 119 | THRESH_DIR = '../models/thresh_dict.json' 120 | 121 | 122 | with open(THRESH_DIR, 'r', encoding='utf-8') as f: 123 | thresh_dict = json.load(f) 124 | 125 | tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODELS['roberta']['path'], do_lower_case=True) 126 | cv_loader, val_idxs = get_data_loaders_cv(rv_path='../data/TRAIN/Train_laptop_reviews.csv', 127 | lb_path='../data/TRAIN/Train_laptop_labels.csv', 128 | tokenizer=tokenizer, 129 | batch_size=args.bs, 130 | type='laptop', 131 | folds=FOLDS, 132 | return_val_idxs=True) 133 | VAL_IDX = [] 134 | LB, GT = [], [] 135 | for idxs in val_idxs: 136 | VAL_IDX.extend(idxs) 137 | for train, val in cv_loader: 138 | for ((rv_raw, lb_raw), x, y) in val: 139 | LB.extend(lb_raw) 140 | GT.extend(rv_raw) 141 | tokenizers = dict([(model_name, 142 | BertTokenizer.from_pretrained(model_config['path'], do_lower_case=True) 143 | ) for model_name, model_config in PRETRAINED_MODELS.items()]) 144 | # print(tokenizers) 145 | 146 | cv_loaders = dict([(model_name, 147 | get_data_loaders_cv(rv_path='../data/TRAIN/Train_laptop_reviews.csv', 148 | lb_path='../data/TRAIN/Train_laptop_labels.csv', 149 | tokenizer=tokenizers[model_name], 150 | batch_size=args.bs, 151 | type='laptop', 152 | folds=FOLDS) 153 | ) for model_name, model_config in PRETRAINED_MODELS.items()]) 154 | 155 | PRED = [] 156 | for cv_idx in range(FOLDS): 157 | cv_model_num = 0 158 | cvret = None 159 | for model_name, model_config in PRETRAINED_MODELS.items(): 160 | tokenizer = tokenizers[model_name] 161 | _, val_loader = cv_loaders[model_name][cv_idx] 162 | 163 | try: 164 | model = OpinioNet.from_pretrained(model_config['path'], version=model_config['version'], 165 | focal=model_config['focal']) 166 | weight_name = model_config['name'] + '_cv' + str(cv_idx) 167 | weight = torch.load('../models/' + weight_name) 168 | except FileNotFoundError: 169 | continue 170 | print(weight_name) 171 | model.load_state_dict(weight) 172 | model.cuda() 173 | try: 174 | thresh = thresh_dict[weight_name]['thresh'] 175 | except: 176 | thresh = 0.5 177 | cvret = accum_result(cvret, eval_epoch(model, val_loader, thresh)) 178 | cv_model_num += 1 179 | del model 180 | cvret = average_result(cvret, cv_model_num) 181 | PRED.extend(cvret) 182 | 183 | PRED_COPY = copy.deepcopy(PRED) 184 | 185 | # P, G, S = 0, 0, 0 186 | # BEST_PRED = OpinioNet.nms_filter(PRED_COPY, 0.3) 187 | # for b in range(len(PRED_COPY)): 188 | # gt = LB[b] 189 | # pred = [x[0] for x in BEST_PRED[b]] 190 | # p, g, s = evaluate_sample(gt, pred) 191 | # 192 | # P += p 193 | # G += g 194 | # S += s 195 | # f1, pr, rc = f1_score(P, G, S) 196 | # print("f1 %.5f, pr %.5f, rc %.5f, th %.5f" % (f1, pr, rc, 0.3)) 197 | 198 | threshs = list(np.arange(0.1, 0.9, 0.025)) 199 | best_f1, best_pr, best_rc = 0, 0, 0 200 | best_thresh = 0.1 201 | P, G, S = 0, 0, 0 202 | BEST_PRED = PRED_COPY 203 | for th in threshs: 204 | P, G, S = 0, 0, 0 205 | PRED_COPY = copy.deepcopy(PRED) 206 | PRED_COPY = OpinioNet.nms_filter(PRED_COPY, th) 207 | for b in range(len(PRED_COPY)): 208 | gt = LB[b] 209 | pred = [x[0] for x in PRED_COPY[b]] 210 | p, g, s = evaluate_sample(gt, pred) 211 | 212 | P += p 213 | G += g 214 | S += s 215 | f1, pr, rc = f1_score(P, G, S) 216 | if f1 > best_f1: 217 | best_f1, best_pr, best_rc = f1, pr, rc 218 | best_thresh = th 219 | BEST_PRED = copy.deepcopy(PRED_COPY) 220 | 221 | print("f1 %.5f, pr %.5f, rc %.5f, th %.5f" % (best_f1, best_pr, best_rc, best_thresh)) 222 | 223 | ZZ = list(zip(VAL_IDX, BEST_PRED, GT)) 224 | ZZ.sort(key=lambda x: x[0]) 225 | 226 | BEST_PRED = [p[1] for p in ZZ] 227 | GT = [p[2] for p in ZZ] 228 | result = gen_submit(BEST_PRED, GT) 229 | if not osp.exists('../testResults/'): 230 | import os 231 | os.mkdir('../testResults/') 232 | result.to_csv('../testResults/' + 'ensemble_result_label_'+ ('%.5f' % best_f1) +'.csv', header=True, index=False) 233 | print(len(result['id'].unique()), result.shape[0]) 234 | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | -------------------------------------------------------------------------------- /src/train.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.optim import Adam 7 | 8 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 9 | from tqdm import tqdm 10 | import os.path as osp 11 | 12 | 13 | def f1_score(P, G, S): 14 | pr = S / P 15 | rc = S / G 16 | f1 = 2 * pr * rc / (pr + rc) 17 | return f1, pr, rc 18 | 19 | 20 | def evaluate_sample(gt, pred): 21 | gt = set(gt) 22 | pred = set(pred) 23 | p = len(pred) 24 | g = len(gt) 25 | s = len(gt.intersection(pred)) 26 | # print(p, g, s) 27 | return p, g, s 28 | 29 | 30 | def train_epoch(model, dataloader, optimizer, scheduler=None): 31 | model.train() 32 | cum_loss = 0 33 | P, G, S = 0, 0, 0 34 | total_sample = 0 35 | step = 0 36 | pbar = tqdm(dataloader) 37 | for raw, x, y in pbar: 38 | if step == len(dataloader): 39 | pbar.close() 40 | break 41 | rv_raw, lb_raw = raw 42 | x = [item.cuda() for item in x] 43 | y = [item.cuda() for item in y] 44 | 45 | probs, logits = model.forward(x) 46 | loss = model.loss(logits, y) 47 | 48 | optimizer.zero_grad() 49 | loss.backward() 50 | if scheduler: 51 | scheduler.step() 52 | optimizer.step() 53 | 54 | pred_result = model.gen_candidates(probs) 55 | pred_result = model.nms_filter(pred_result, 0.1) 56 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 57 | total_sample += len(rv_raw) 58 | for b in range(len(pred_result)): 59 | gt = lb_raw[b] 60 | pred = [x[0] for x in pred_result[b]] 61 | p, g, s = evaluate_sample(gt, pred) 62 | P += p 63 | G += g 64 | S += s 65 | 66 | step += 1 67 | 68 | total_f1, total_pr, total_rc = f1_score(P, G, S) 69 | total_loss = cum_loss / total_sample 70 | 71 | return total_loss, total_f1, total_pr, total_rc 72 | 73 | 74 | def eval_epoch(model, dataloader): 75 | model.eval() 76 | cum_loss = 0 77 | P, G, S = 0, 0, 0 78 | total_sample = 0 79 | step = 0 80 | pbar = tqdm(dataloader) 81 | for raw, x, y in pbar: 82 | if step == len(dataloader): 83 | pbar.close() 84 | break 85 | rv_raw, lb_raw = raw 86 | x = [item.cuda() for item in x] 87 | y = [item.cuda() for item in y] 88 | with torch.no_grad(): 89 | probs, logits = model.forward(x) 90 | loss = model.loss(logits, y) 91 | pred_result = model.gen_candidates(probs) 92 | pred_result = model.nms_filter(pred_result, 0.1) 93 | 94 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 95 | total_sample += len(rv_raw) 96 | for b in range(len(pred_result)): 97 | gt = lb_raw[b] 98 | pred = [x[0] for x in pred_result[b]] 99 | p, g, s = evaluate_sample(gt, pred) 100 | 101 | P += p 102 | G += g 103 | S += s 104 | 105 | step += 1 106 | 107 | total_f1, total_pr, total_rc = f1_score(P, G, S) 108 | total_loss = cum_loss / total_sample 109 | 110 | return total_loss, total_f1, total_pr, total_rc 111 | 112 | 113 | if __name__ == '__main__': 114 | EP = 100 115 | SAVING_DIR = '../models/' 116 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch', 117 | do_lower_case=True) 118 | train_loader, val_loader = get_data_loaders(rv_path='../data/TRAIN/Train_reviews.csv', 119 | lb_path='../data/TRAIN/Train_labels.csv', 120 | tokenizer=tokenizer, 121 | batch_size=12, 122 | val_split=0.15) 123 | 124 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch') 125 | model.cuda() 126 | optimizer = Adam(model.parameters(), lr=5e-6) 127 | scheduler = GradualWarmupScheduler(optimizer, total_epoch=2) 128 | best_val_f1 = 0 129 | best_val_loss = float('inf') 130 | for e in range(EP): 131 | 132 | print('Epoch [%d/%d] train:' % (e, EP)) 133 | train_loss, train_f1, train_pr, train_rc = train_epoch(model, train_loader, optimizer, scheduler) 134 | print("loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (train_loss, train_f1, train_pr, train_rc)) 135 | 136 | print('Epoch [%d/%d] eval:' % (e, EP)) 137 | val_loss, val_f1, val_pr, val_rc = eval_epoch(model, val_loader) 138 | print("loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (val_loss, val_f1, val_pr, val_rc)) 139 | if val_loss < best_val_loss: 140 | best_val_loss = val_loss 141 | if val_f1 > best_val_f1: 142 | best_val_f1 = val_f1 143 | if best_val_f1 >= 0.75: 144 | model_name_args = ['ep' + str(e), 'f1' + str(round(val_f1, 5))] 145 | model_name = '-'.join(model_name_args) + '.pth' 146 | saving_dir = osp.join(SAVING_DIR, 'saved_best_model') 147 | torch.save(model.state_dict(), saving_dir) 148 | print('saved best model to %s' % saving_dir) 149 | 150 | print('best loss %.5f' % best_val_loss) 151 | print('best f1 %.5f' % best_val_f1) -------------------------------------------------------------------------------- /src/train_cv.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders_cv 3 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 4 | from model import OpinioNet 5 | 6 | import torch 7 | from torch.optim import Adam 8 | 9 | from tqdm import tqdm 10 | import os.path as osp 11 | 12 | 13 | def f1_score(P, G, S): 14 | pr = S / P 15 | rc = S / G 16 | f1 = 2 * pr * rc / (pr + rc) 17 | return f1, pr, rc 18 | 19 | 20 | def evaluate_sample(gt, pred): 21 | gt = set(gt) 22 | pred = set(pred) 23 | p = len(pred) 24 | g = len(gt) 25 | s = len(gt.intersection(pred)) 26 | return p, g, s 27 | 28 | 29 | def train_epoch(model, dataloader, optimizer, scheduler=None): 30 | model.train() 31 | cum_loss = 0 32 | P, G, S = 0, 0, 0 33 | total_sample = 0 34 | step = 0 35 | pbar = tqdm(dataloader) 36 | for raw, x, y in pbar: 37 | if step == len(dataloader): 38 | pbar.close() 39 | break 40 | rv_raw, lb_raw = raw 41 | x = [item.cuda() for item in x] 42 | y = [item.cuda() for item in y] 43 | 44 | probs, logits = model.forward(x) 45 | loss = model.loss(logits, y) 46 | 47 | optimizer.zero_grad() 48 | loss.backward() 49 | if scheduler: 50 | scheduler.step() 51 | optimizer.step() 52 | 53 | pred_result = model.gen_candidates(probs) 54 | pred_result = model.nms_filter(pred_result, 0.1) 55 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 56 | total_sample += len(rv_raw) 57 | for b in range(len(pred_result)): 58 | gt = lb_raw[b] 59 | pred = [x[0] for x in pred_result[b]] 60 | p, g, s = evaluate_sample(gt, pred) 61 | P += p 62 | G += g 63 | S += s 64 | 65 | step += 1 66 | 67 | total_f1, total_pr, total_rc = f1_score(P, G, S) 68 | total_loss = cum_loss / total_sample 69 | 70 | return total_loss, total_f1, total_pr, total_rc 71 | 72 | 73 | def eval_epoch(model, dataloader): 74 | model.eval() 75 | cum_loss = 0 76 | P, G, S = 0, 0, 0 77 | total_sample = 0 78 | step = 0 79 | pbar = tqdm(dataloader) 80 | for raw, x, y in pbar: 81 | if step == len(dataloader): 82 | pbar.close() 83 | break 84 | rv_raw, lb_raw = raw 85 | x = [item.cuda() for item in x] 86 | y = [item.cuda() for item in y] 87 | with torch.no_grad(): 88 | probs, logits = model.forward(x) 89 | loss = model.loss(logits, y) 90 | pred_result = model.gen_candidates(probs) 91 | pred_result = model.nms_filter(pred_result, 0.1) 92 | 93 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 94 | total_sample += len(rv_raw) 95 | for b in range(len(pred_result)): 96 | gt = lb_raw[b] 97 | pred = [x[0] for x in pred_result[b]] 98 | p, g, s = evaluate_sample(gt, pred) 99 | 100 | P += p 101 | G += g 102 | S += s 103 | 104 | step += 1 105 | 106 | total_f1, total_pr, total_rc = f1_score(P, G, S) 107 | total_loss = cum_loss / total_sample 108 | 109 | return total_loss, total_f1, total_pr, total_rc 110 | 111 | 112 | if __name__ == '__main__': 113 | EP = 30 114 | FOLDS = 5 115 | SAVING_DIR = '../models/' 116 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch', 117 | do_lower_case=True) 118 | cv_loaders = get_data_loaders_cv(rv_path='../data/TRAIN/Train_reviews.csv', 119 | lb_path='../data/TRAIN/Train_labels.csv', 120 | tokenizer=tokenizer, 121 | batch_size=12, 122 | folds=FOLDS) 123 | 124 | for cv_idx, (train_loader, val_loader) in enumerate(cv_loaders): 125 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese-bert_chinese_wwm_pytorch') 126 | model.cuda() 127 | optimizer = Adam(model.parameters(), lr=5e-6) 128 | scheduler = GradualWarmupScheduler(optimizer, total_epoch=2) 129 | best_val_f1 = 0 130 | best_val_loss = float('inf') 131 | for e in range(EP): 132 | 133 | print('Epoch [%d/%d] train:' % (e, EP)) 134 | train_loss, train_f1, train_pr, train_rc = train_epoch(model, train_loader, optimizer, scheduler) 135 | print("train: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (train_loss, train_f1, train_pr, train_rc)) 136 | 137 | print('Epoch [%d/%d] eval:' % (e, EP)) 138 | val_loss, val_f1, val_pr, val_rc = eval_epoch(model, val_loader) 139 | print("val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (val_loss, val_f1, val_pr, val_rc)) 140 | 141 | 142 | if val_loss < best_val_loss: 143 | best_val_loss = val_loss 144 | if val_f1 > best_val_f1: 145 | best_val_f1 = val_f1 146 | if val_f1 >= 0.75: 147 | model_name_args = ['ep' + str(e), 'f1' + str(round(val_f1, 5))] 148 | model_name = '-'.join(model_name_args) + '.pth' 149 | saving_dir = osp.join(SAVING_DIR, 'saved_best_model_cv' + str(cv_idx)) 150 | torch.save(model.state_dict(), saving_dir) 151 | print('saved best model to %s' % saving_dir) 152 | 153 | print('best loss %.5f' % best_val_loss) 154 | print('best f1 %.5f' % best_val_f1) 155 | 156 | del model, optimizer, scheduler -------------------------------------------------------------------------------- /src/train_round2.py: -------------------------------------------------------------------------------- 1 | from pytorch_pretrained_bert import BertTokenizer 2 | from dataset import ReviewDataset, get_data_loaders, get_data_loaders_round2 3 | from model import OpinioNet 4 | 5 | import torch 6 | from torch.optim import Adam 7 | 8 | from lr_scheduler import GradualWarmupScheduler, ReduceLROnPlateau 9 | from tqdm import tqdm 10 | import os.path as osp 11 | 12 | 13 | def f1_score(P, G, S): 14 | pr = S / P 15 | rc = S / G 16 | f1 = 2 * pr * rc / (pr + rc) 17 | return f1, pr, rc 18 | 19 | 20 | def evaluate_sample(gt, pred): 21 | gt = set(gt) 22 | pred = set(pred) 23 | p = len(pred) 24 | g = len(gt) 25 | s = len(gt.intersection(pred)) 26 | # print(p, g, s) 27 | return p, g, s 28 | 29 | 30 | def train_epoch(model, makeup_loader, laptop_loader, corpus_loader, optimizer, scheduler=None): 31 | model.train() 32 | cum_loss = 0 33 | cum_lm_loss = 0 34 | total_lm_sample = 0 35 | P, G, S = 0, 0, 0 36 | total_sample = 0 37 | step = 0 38 | pbar = tqdm(range(max(len(makeup_loader), len(laptop_loader), len(corpus_loader)))) 39 | makeup_iter = iter(makeup_loader) 40 | laptop_iter = iter(laptop_loader) 41 | corpus_iter = iter(corpus_loader) 42 | for _ in pbar: 43 | if step == max(len(makeup_loader), len(laptop_loader), len(corpus_loader)): 44 | pbar.close() 45 | break 46 | 47 | try: 48 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 49 | except StopIteration: 50 | corpus_iter = iter(corpus_loader) 51 | corpus_ids, corpus_attn, lm_label = next(corpus_iter) 52 | 53 | corpus_ids = corpus_ids.cuda() 54 | corpus_attn = corpus_attn.cuda() 55 | lm_label = lm_label.cuda() 56 | loss = model.foward_LM(corpus_ids, corpus_attn, lm_label) 57 | optimizer.zero_grad() 58 | loss.backward() 59 | if scheduler: 60 | scheduler.step() 61 | optimizer.step() 62 | cum_lm_loss += loss.data.cpu().numpy() * len(corpus_ids) 63 | total_lm_sample += len(corpus_ids) 64 | del corpus_ids, corpus_attn, lm_label, loss 65 | 66 | try: 67 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 68 | except StopIteration: 69 | makeup_iter = iter(makeup_loader) 70 | makeup_raw, makeup_x, makeup_y = next(makeup_iter) 71 | try: 72 | laptop_raw, laptop_x, laptop_y = next(laptop_iter) 73 | except StopIteration: 74 | laptop_iter = iter(laptop_loader) 75 | laptop_raw, laptop_x, laptop_y = next(laptop_iter) 76 | 77 | makeup_rv_raw, makeup_lb_raw = makeup_raw 78 | makeup_x = [item.cuda() for item in makeup_x] 79 | makeup_y = [item.cuda() for item in makeup_y] 80 | 81 | laptop_rv_raw, laptop_lb_raw = laptop_raw 82 | laptop_x = [item.cuda() for item in laptop_x] 83 | laptop_y = [item.cuda() for item in laptop_y] 84 | 85 | makeup_probs, makeup_logits = model.forward(makeup_x, type='makeup') 86 | makeup_loss = model.loss(makeup_logits, makeup_y) * len(makeup_rv_raw) 87 | 88 | laptop_probs, laptop_logits = model.forward(laptop_x, type='laptop') 89 | laptop_loss = model.loss(laptop_logits, laptop_y) * len(laptop_rv_raw) 90 | 91 | loss = (makeup_loss + laptop_loss) / (len(makeup_rv_raw) + len(laptop_rv_raw)) 92 | optimizer.zero_grad() 93 | loss.backward() 94 | if scheduler: 95 | scheduler.step() 96 | optimizer.step() 97 | 98 | makeup_pred = model.gen_candidates(makeup_probs) 99 | makeup_pred = model.nms_filter(makeup_pred, 0.1) 100 | 101 | for b in range(len(makeup_pred)): 102 | gt = makeup_lb_raw[b] 103 | pred = [x[0] for x in makeup_pred[b]] 104 | p, g, s = evaluate_sample(gt, pred) 105 | P += p 106 | G += g 107 | S += s 108 | 109 | laptop_pred = model.gen_candidates(laptop_probs) 110 | laptop_pred = model.nms_filter(laptop_pred, 0.1) 111 | for b in range(len(laptop_pred)): 112 | gt = laptop_lb_raw[b] 113 | pred = [x[0] for x in laptop_pred[b]] 114 | p, g, s = evaluate_sample(gt, pred) 115 | P += p 116 | G += g 117 | S += s 118 | 119 | cum_loss += loss.data.cpu().numpy() * (len(makeup_rv_raw) + len(laptop_rv_raw)) 120 | total_sample += (len(makeup_rv_raw) + len(laptop_rv_raw)) 121 | step += 1 122 | while makeup_x: 123 | a = makeup_x.pop(); del a 124 | a = laptop_x.pop(); del a 125 | while makeup_y: 126 | a = makeup_y.pop(); del a 127 | a = laptop_y.pop(); del a 128 | 129 | while makeup_probs: 130 | a = makeup_probs.pop(); del a 131 | a = makeup_logits.pop(); del a 132 | a = laptop_probs.pop(); del a 133 | a = laptop_logits.pop(); del a 134 | 135 | del loss, makeup_loss, laptop_loss 136 | 137 | total_f1, total_pr, total_rc = f1_score(P, G, S) 138 | total_loss = cum_loss / total_sample 139 | total_lm_loss = cum_lm_loss / total_lm_sample 140 | return total_loss, total_lm_loss, total_f1, total_pr, total_rc 141 | 142 | 143 | def eval_epoch(model, dataloader, type='makeup'): 144 | model.eval() 145 | cum_loss = 0 146 | P, G, S = 0, 0, 0 147 | total_sample = 0 148 | step = 0 149 | pbar = tqdm(dataloader) 150 | for raw, x, y in pbar: 151 | if step == len(dataloader): 152 | pbar.close() 153 | break 154 | rv_raw, lb_raw = raw 155 | x = [item.cuda() for item in x] 156 | y = [item.cuda() for item in y] 157 | with torch.no_grad(): 158 | probs, logits = model.forward(x, type) 159 | loss = model.loss(logits, y) 160 | pred_result = model.gen_candidates(probs) 161 | pred_result = model.nms_filter(pred_result, 0.1) 162 | 163 | cum_loss += loss.data.cpu().numpy() * len(rv_raw) 164 | total_sample += len(rv_raw) 165 | for b in range(len(pred_result)): 166 | gt = lb_raw[b] 167 | pred = [x[0] for x in pred_result[b]] 168 | p, g, s = evaluate_sample(gt, pred) 169 | 170 | P += p 171 | G += g 172 | S += s 173 | 174 | step += 1 175 | 176 | total_f1, total_pr, total_rc = f1_score(P, G, S) 177 | total_loss = cum_loss / total_sample 178 | 179 | return total_loss, total_f1, total_pr, total_rc 180 | 181 | 182 | if __name__ == '__main__': 183 | EP = 100 184 | SAVING_DIR = '../models/' 185 | tokenizer = BertTokenizer.from_pretrained('/home/zydq/.torch/models/bert/chinese_roberta_wwm_ext_L-12_H-768_A-12', 186 | do_lower_case=True) 187 | # tokenizer = BertTokenizer.from_pretrained('/home/zydq/.tf/bert/chinese_roberta_wwm_ext_L-12_H-768_A-12', 188 | # do_lower_case=True) 189 | 190 | makeup_train_loader, makeup_val_loader, laptop_train_loader, laptop_val_loader, corpus_loader = \ 191 | get_data_loaders_round2(tokenizer, batch_size=12) 192 | 193 | 194 | model = OpinioNet.from_pretrained('/home/zydq/.torch/models/bert/chinese_roberta_wwm_ext_L-12_H-768_A-12') 195 | # model = OpinioNet.from_pretrained('/home/zydq/.tf/bert/chinese_roberta_wwm_ext_L-12_H-768_A-12', from_tf=True) 196 | model.cuda() 197 | optimizer = Adam(model.parameters(), lr=6e-6) 198 | scheduler = GradualWarmupScheduler(optimizer, total_epoch=2*max(len(makeup_train_loader), len(corpus_loader))) 199 | best_val_f1 = 0 200 | best_val_loss = float('inf') 201 | for e in range(EP): 202 | 203 | print('Epoch [%d/%d] train:' % (e, EP)) 204 | train_loss, train_lm_loss, train_f1, train_pr, train_rc = train_epoch(model, makeup_train_loader, laptop_train_loader, corpus_loader, optimizer, scheduler) 205 | print("loss %.5f, lm loss %.5f f1 %.5f, pr %.5f, rc %.5f" % (train_loss, train_lm_loss, train_f1, train_pr, train_rc)) 206 | 207 | print('Epoch [%d/%d] makeup eval:' % (e, EP)) 208 | val_loss, val_f1, val_pr, val_rc = eval_epoch(model, makeup_val_loader, type='makeup') 209 | print("makeup_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (val_loss, val_f1, val_pr, val_rc)) 210 | 211 | print('Epoch [%d/%d] laptop eval:' % (e, EP)) 212 | val_loss, val_f1, val_pr, val_rc = eval_epoch(model, laptop_val_loader, type='laptop') 213 | print("laptop_val: loss %.5f, f1 %.5f, pr %.5f, rc %.5f" % (val_loss, val_f1, val_pr, val_rc)) 214 | 215 | if val_loss < best_val_loss: 216 | best_val_loss = val_loss 217 | if val_f1 > best_val_f1: 218 | best_val_f1 = val_f1 219 | if best_val_f1 >= 0.75: 220 | model_name_args = ['ep' + str(e), 'f1' + str(round(val_f1, 5))] 221 | model_name = '-'.join(model_name_args) + '.pth' 222 | saving_dir = osp.join(SAVING_DIR, 'saved_best_model') 223 | torch.save(model.state_dict(), saving_dir) 224 | print('saved best model to %s' % saving_dir) 225 | 226 | print('best loss %.5f' % best_val_loss) 227 | print('best f1 %.5f' % best_val_f1) -------------------------------------------------------------------------------- /train_script.sh: -------------------------------------------------------------------------------- 1 | cd src 2 | python pretrain.py --base_model roberta 3 | python finetune_cv.py --base_model roberta 4 | python pretrain.py --base_model wwm 5 | python finetune_cv.py --base_model wwm 6 | python pretrain.py --base_model ernie 7 | python finetune_cv.py --base_model ernie 8 | 9 | # local 10 | ############## focal ################# 11 | python pretrain.py --base_model roberta_focal 12 | python finetune_cv.py --base_model roberta_focal 13 | python pretrain.py --base_model wwm_focal 14 | python finetune_cv.py --base_model wwm_focal 15 | python pretrain.py --base_model ernie_focal 16 | python finetune_cv.py --base_model ernie_focal # ** 17 | 18 | 19 | # on server 20 | ###############tiny################### 21 | python pretrain.py --base_model roberta_tiny 22 | python finetune_cv.py --base_model roberta_tiny 23 | python pretrain.py --base_model wwm_tiny 24 | python finetune_cv.py --base_model wwm_tiny 25 | python pretrain.py --base_model ernie_tiny 26 | python finetune_cv.py --base_model ernie_tiny # ** 27 | --------------------------------------------------------------------------------