├── README.md ├── kcc2022 poster.pdf ├── kcc2022 poster.png ├── model.png ├── requirements.txt ├── run_baseline_torch.py └── src ├── dependency └── merge.py ├── functions ├── biattention.py ├── metric.py ├── processor.py └── utils.py └── model ├── main_functions_multi.py └── model_multi.py /README.md: -------------------------------------------------------------------------------- 1 | # Dual-Classification of Scientific Paper Sentence 2 | Code for KCC 2022 paper: *[Dual-Classification of Paper Sentence using Chunk Representation Method and Dependency Parsing](https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11113336)* 3 | 4 | 5 | 6 | 7 | 8 | 9 | ## Setting up the code environment 10 | 11 | ``` 12 | $ virtualenv --python=python3.6 venv 13 | $ source venv/bin/activate 14 | $ pip install -r requirements.txt 15 | ``` 16 | 17 | All code only supports running on Linux. 18 | 19 | ## Model Structure 20 | 21 | 22 | 23 | 24 | 25 | ## Data 26 | 27 | *[국내 논문 문장 의미 태깅 데이터셋](https://aida.kisti.re.kr/data/8d0fd6f4-4bf9-47ae-bd71-7d41f01ad9a6)* 28 | 29 | ## Directory and Pre-processing 30 | `의존 구문 분석 모델은 미공개(The dependency parser model is unpublished)` 31 | ``` 32 | ├── data 33 | │  ├── origin.json 34 | │   └── origin 35 | │  ├──DP_origin_preprocess.json 36 | │    └── merge_origin_preprocess 37 | │   ├── origin_train.json 38 | │   └── origin_test.json 39 | ├── bert 40 | │ ├── init_weight 41 | │   └── biaffine_model 42 | │    └── multi 43 | ├── src 44 | │   ├── dependency 45 | │   └── merge.py 46 | │   ├── functions 47 | │   ├── biattention.py 48 | │   ├── utils.py 49 | │   ├── metric.py 50 | │   └── processor.json 51 | │   └── model 52 | │   ├── main_functions_multi.py 53 | │   └── model_multi.py 54 | ├── run_baseline_torch.py 55 | ├── requirements.txt 56 | └── README.md 57 | ``` 58 | 59 | * 원시 데이터(data/origin.json)를 의존 구문 분석 모델을 활용하여 입력 문장 쌍에 대한 어절 단위 의존 구문 구조 추출(data/origin/DP_origin_preprocess.json) 60 | 61 | * 입력 문장 쌍에 대한 어절 단위 의존 구문 구조(data/origin/DP_origin_preprocess.json)를 src/dependency/merge.py를 통해 입력 문장 쌍에 대한 청크 단위 의존 구문 구조로 변환(data/origin/merge_origin_preprocess/origin.json) 62 | 63 | * 학습 데이터와 평가 데이터를 세부분류별 4:1 비율로 나누기(data/origin/merge_origin_preprocess/origin_train.json, data/origin/merge_origin_preprocess/origin_test.json) 64 | 65 | * [bert/init_weight](https://huggingface.co/klue/bert-base)의 vocab.json에 청크 단위로 구분해주는 스폐셜 토큰(Special Token) `` 추가 66 | 67 | ## Train & Test 68 | 69 | ### Pretrained model 70 | * KLUE/BERT-base 71 | ### How To Run 72 | `python run_baseline_torch.py` 73 | 74 | ## Results 75 | 76 | | Model | Macro F1 | Acc | 77 | |---|--------- |--------- | 78 | | BERT | 89.66% | 89.90% | 79 | | proposed | 89.75% | 89.99% | 80 | -------------------------------------------------------------------------------- /kcc2022 poster.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/kcc2022 poster.pdf -------------------------------------------------------------------------------- /kcc2022 poster.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/kcc2022 poster.png -------------------------------------------------------------------------------- /model.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/model.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy 2 | pandas 3 | tokenizers 4 | attrdict 5 | fastprogress 6 | tqdm 7 | pytorch-crf 8 | scikit-learn 9 | tensorflow-gpu==1.15.0 10 | transformers==4.7.0 11 | konlpy 12 | tweepy==3.10.0 13 | torch==1.8.1+cu111 14 | -------------------------------------------------------------------------------- /run_baseline_torch.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import logging 4 | from attrdict import AttrDict 5 | 6 | # bert 7 | from transformers import AutoTokenizer 8 | from transformers import BertConfig,RobertaConfig 9 | 10 | from src.model.model_multi import BertForSequenceClassification 11 | 12 | from src.model.main_functions_multi import train, evaluate, predict 13 | 14 | from src.functions.utils import init_logger, set_seed 15 | 16 | import sys 17 | sys.path.append(os.path.dirname(os.path.abspath(os.path.dirname(__file__)))) 18 | 19 | def create_model(args): 20 | 21 | if args.model_name_or_path.split("/")[-2] == "bert": 22 | 23 | # 모델 파라미터 Load 24 | config = BertConfig.from_pretrained( 25 | args.model_name_or_path#'bert/Bert-base' 26 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 27 | cache_dir=args.cache_dir, 28 | ) 29 | 30 | config.num_coarse_labels = 3 31 | config.num_labels = 9 32 | 33 | # roberta attention 추출하기 34 | config.output_attentions=True 35 | 36 | # tokenizer는 pre-trained된 것을 불러오는 과정이 아닌 불러오는 모델의 vocab 등을 Load 37 | # BertTokenizerFast로 되어있음 38 | tokenizer = AutoTokenizer.from_pretrained( 39 | args.model_name_or_path#'bert/init_weight' 40 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 41 | do_lower_case=args.do_lower_case, 42 | cache_dir=args.cache_dir, 43 | ) 44 | 45 | model = BertForSequenceClassification.from_pretrained( 46 | args.model_name_or_path#'bert/init_weight' 47 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 48 | cache_dir=args.cache_dir, 49 | config=config, 50 | max_sentence_length=args.max_sentence_length, 51 | # from_tf= True if args.from_init_weight else False 52 | ) 53 | args.model_name_or_path = args.cache_dir 54 | # print(tokenizer.convert_tokens_to_ids("")) 55 | 56 | elif args.model_name_or_path.split("/")[-2] == "roberta": 57 | 58 | # 모델 파라미터 Load 59 | config = RobertaConfig.from_pretrained( 60 | args.model_name_or_path 61 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 62 | cache_dir=args.cache_dir, 63 | ) 64 | 65 | config.num_coarse_labels = 3 66 | config.num_labels = 9 67 | 68 | # roberta attention 추출하기 69 | config.output_attentions=True 70 | 71 | # tokenizer는 pre-trained된 것을 불러오는 과정이 아닌 불러오는 모델의 vocab 등을 Load 72 | # BertTokenizerFast로 되어있음 73 | tokenizer = AutoTokenizer.from_pretrained( 74 | args.model_name_or_path 75 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 76 | do_lower_case=args.do_lower_case, 77 | cache_dir=args.cache_dir, 78 | ) 79 | 80 | model = RobertaForSequenceClassification.from_pretrained( 81 | args.model_name_or_path 82 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)), 83 | cache_dir=args.cache_dir, 84 | config=config, 85 | max_sentence_length=args.max_sentence_length, 86 | # from_tf= True if args.from_init_weight else False 87 | ) 88 | args.model_name_or_path = args.cache_dir 89 | # print(tokenizer.convert_tokens_to_ids("")) 90 | 91 | model.to(args.device) 92 | return model, tokenizer 93 | 94 | def main(cli_args): 95 | # 파라미터 업데이트 96 | args = AttrDict(vars(cli_args)) 97 | args.device = "cuda" 98 | logger = logging.getLogger(__name__) 99 | 100 | # logger 및 seed 지정 101 | init_logger() 102 | set_seed(args) 103 | 104 | # 모델 불러오기 105 | model, tokenizer = create_model(args) 106 | 107 | # Running mode에 따른 실행 108 | if args.do_train: 109 | train(args, model, tokenizer, logger) 110 | elif args.do_eval: 111 | evaluate(args, model, tokenizer, logger) 112 | elif args.do_predict: 113 | predict(args, model, tokenizer) 114 | 115 | 116 | if __name__ == '__main__': 117 | cli_parser = argparse.ArgumentParser() 118 | 119 | # Directory 120 | 121 | #------------------------------------------------------------------------------------------------ 122 | cli_parser.add_argument("--data_dir", type=str, default="./data/origin/merge_origin_preprocess") 123 | 124 | cli_parser.add_argument("--train_file", type=str, default='origin_train.json') 125 | #cli_parser.add_argument("--train_file", type=str, default='sample.json') 126 | cli_parser.add_argument("--eval_file", type=str, default='origin_test.json') 127 | cli_parser.add_argument("--predict_file", type=str, default='origin_test.json') 128 | 129 | # ------------------------------------------------------------------------------------------------ 130 | 131 | ## roberta 132 | # cli_parser.add_argument("--model_name_or_path", type=str, default="./roberta/init_weight") 133 | # cli_parser.add_argument("--cache_dir", type=str, default="./roberta/init_weight") 134 | 135 | # bert 136 | cli_parser.add_argument("--model_name_or_path", type=str, default="./bert/init_weight") 137 | cli_parser.add_argument("--cache_dir", type=str, default="./bert/init_weight") 138 | 139 | #------------------------------------------------------------------------------------------------------------ 140 | cli_parser.add_argument("--output_dir", type=str, default="./bert/biaffine_model/multi") 141 | #cli_parser.add_argument("--output_dir", type=str, default="./roberta/biaffine_model/multi") 142 | 143 | 144 | # ------------------------------------------------------------------------------------------------------------ 145 | 146 | cli_parser.add_argument("--max_sentence_length", type=int, default=110) 147 | 148 | # https://github.com/KLUE-benchmark/KLUE-baseline/blob/main/run_all.sh 149 | # Model Hyper Parameter 150 | cli_parser.add_argument("--max_seq_length", type=int, default=512) 151 | # Training Parameter 152 | cli_parser.add_argument("--learning_rate", type=float, default=1e-5) 153 | cli_parser.add_argument("--train_batch_size", type=int, default =16) 154 | cli_parser.add_argument("--eval_batch_size", type=int, default = 32) 155 | cli_parser.add_argument("--num_train_epochs", type=int, default=6) 156 | 157 | #cli_parser.add_argument("--save_steps", type=int, default=2000) 158 | cli_parser.add_argument("--logging_steps", type=int, default=100) 159 | cli_parser.add_argument("--seed", type=int, default=42) 160 | cli_parser.add_argument("--threads", type=int, default=8) 161 | 162 | cli_parser.add_argument("--weight_decay", type=float, default=0.0) 163 | cli_parser.add_argument("--adam_epsilon", type=int, default=1e-10) 164 | cli_parser.add_argument("--gradient_accumulation_steps", type=int, default=4) 165 | cli_parser.add_argument("--warmup_steps", type=int, default=0) 166 | cli_parser.add_argument("--max_steps", type=int, default=-1) 167 | cli_parser.add_argument("--max_grad_norm", type=int, default=1.0) 168 | 169 | cli_parser.add_argument("--verbose_logging", type=bool, default=False) 170 | cli_parser.add_argument("--do_lower_case", type=bool, default=False) 171 | cli_parser.add_argument("--no_cuda", type=bool, default=False) 172 | 173 | # Running Mode 174 | cli_parser.add_argument("--from_init_weight", type=bool, default= True) #False)#True) 175 | cli_parser.add_argument("--checkpoint", type=str, default="5") 176 | 177 | cli_parser.add_argument("--do_train", type=bool, default=True) #False)#True) 178 | cli_parser.add_argument("--do_eval", type=bool, default=False)#True)#False) 179 | cli_parser.add_argument("--do_predict", type=bool, default=False) #True)#False) 180 | 181 | cli_args = cli_parser.parse_args() 182 | 183 | main(cli_args) 184 | -------------------------------------------------------------------------------- /src/dependency/merge.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | from tqdm import tqdm 4 | import random 5 | 6 | 7 | def change_tag(change_list, tag_list): 8 | new_tag_list = [] 9 | for tag_li_idx, tag_li in enumerate(tag_list): 10 | new_tag_li = [] 11 | for change in change_list: 12 | change_result = set(sorted(change[1])) 13 | for word_idx in change[0]: 14 | for tag_idx, tag in enumerate(tag_li): 15 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]] 16 | if list(tag[0])[0] == word_idx: 17 | tag_list[tag_li_idx][tag_idx] = [[change_result, set(sorted(list(tag[0])[1]))], tag[1]] 18 | if list(tag[0])[1] == word_idx: 19 | tag_list[tag_li_idx][tag_idx] = [[set(sorted(list(tag[0])[0])), change_result], tag[1]] 20 | for tag in tag_li: 21 | if (len(tag[0][0].difference(tag[0][1]))!=0): 22 | if (len(tag[0][1].difference(tag[0][0]))!=0):new_tag_li.append(tag) 23 | new_tag_list.append(new_tag_li) 24 | tag_list = new_tag_list 25 | del new_tag_list 26 | return tag_list 27 | 28 | def tag_case1(change_list, tag_l): 29 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]] 30 | case1_conti = True;cnt = 0 31 | while case1_conti: 32 | case1_conti = False 33 | del_tag_l = [] 34 | for tag1 in tag_l: 35 | for tag2 in tag_l: 36 | if (max(tag1[0][1]) == min(tag2[0][0])): 37 | case1_conti = True;cnt += 1 38 | change = tag1[0][0].union(tag1[0][1].union(tag2[0][0])) 39 | change_list.append([[tag1[0][0], tag1[0][1], tag2[0][0]], change]) 40 | tag2[0][0] = change 41 | if tag1 in tag_l: del_tag_l.append(tag1) 42 | new_tag_l = [tag for tag in tag_l if tag not in del_tag_l] 43 | tag_l = new_tag_l 44 | del new_tag_l 45 | return change_list, tag_l, cnt 46 | 47 | def tag_case2(change_list, tag_li1, tag_li2): 48 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]] 49 | case2_conti = True;cnt = 0 50 | while case2_conti: 51 | case2_conti = False 52 | for tag1 in tag_li1: 53 | del_tag_li2 = [] 54 | for tag2 in tag_li2: 55 | if ((tag1[0][1] == tag2[0][1]) and ((max(tag1[0][0]) - min(tag2[0][0])) == 1)): 56 | case2_conti = True;cnt+=1 57 | change = tag1[0][0].union(tag2[0][0]) 58 | change_list.append([[tag1[0][0], tag2[0][0]], change]) 59 | tag1[0][0] = change 60 | if tag2 in tag_li2: del_tag_li2.append(tag2) 61 | new_tag_li2 = [tag for tag in tag_li2 if tag not in del_tag_li2] 62 | tag_li2 = new_tag_li2 63 | del new_tag_li2 64 | 65 | return change_list, tag_li1, tag_li2, cnt 66 | 67 | def merge_tag(datas, CNJ=True): 68 | outputs = [] 69 | for id, data in tqdm(enumerate(datas)): 70 | # [{'R', 'VNP', 'L', 'VP', 'S', 'AP', 'NP', 'DP', 'IP', 'X'}, {'None', 'MOD', 'CNJ', 'AJT', 'OBJ', 'SBJ', 'CMP'}] 71 | # 구문 정보 72 | r_list = []; 73 | l_list = []; 74 | s_list = []; 75 | x_list = []; 76 | np_list = []; 77 | dp_list = []; 78 | vp_list = []; 79 | vnp_list = []; 80 | ap_list = []; 81 | ip_list = [] 82 | # 수식어 기능 정보 83 | tag_list = []; 84 | np_cnj_list = [] 85 | 86 | # sentence word 87 | sen_words = data["preprocess"].split() 88 | # 지배소 idx 89 | heads = [x-1 for x in data["parsing"]["heads"][:-1]] 90 | 91 | # 의존관계태그 92 | labels = data["parsing"]["label"][:-1] 93 | assert len(sen_words)-1 == len(heads) == len(labels) 94 | 95 | # 문장내 의존관계태그 분류 96 | for w,(w1, w2_idx, label) in enumerate(zip(sen_words, heads, labels)): 97 | label = label.split("_") 98 | if (len(label)==1):label.append("None") 99 | 100 | dependency_list = [[set([w]),set([w2_idx])], label] 101 | 102 | if (label[0] == "NP"): 103 | if CNJ: 104 | if (label[1] == "CNJ"): 105 | np_cnj_list.append(dependency_list) 106 | if (label[1] != "CNJ"): 107 | np_list.append(dependency_list) 108 | else:np_list.append(dependency_list) 109 | elif (label[0] == "VP"): 110 | vp_list.append(dependency_list) 111 | elif (label[0] == "VNP"): 112 | vnp_list.append(dependency_list) 113 | elif (label[0] == "DP"): 114 | dp_list.append(dependency_list) 115 | elif (label[0] == "AP"): 116 | ap_list.append(dependency_list) 117 | elif (label[0] == "IP"): 118 | ip_list.append(dependency_list) 119 | elif (label[0] == "R"): 120 | r_list.append(dependency_list) 121 | elif (label[0] == "L"): 122 | l_list.append(dependency_list) 123 | elif (label[0] == "S"): 124 | s_list.append(dependency_list) 125 | elif (label[0] == "X"): 126 | x_list.append(dependency_list) 127 | 128 | if (label[1] in ["MOD", "AJT", "CMP", "None"]): 129 | tag_list.append(dependency_list); 130 | vp_list = vp_list + vnp_list 131 | 132 | tag_list = [tag_list] + [x for x in [np_list, dp_list, vp_list, ap_list, ip_list, r_list, l_list, s_list, x_list] if len(x) != 0] 133 | 134 | # NP-CNJ 135 | if np_cnj_list != []: 136 | np_cnj_list = [cnj[0] for cnj in np_cnj_list] 137 | for word_idxs in np_cnj_list: 138 | for tag_li_idx,tag_li in enumerate(tag_list): 139 | new_tag_li = []; new_tag_li2 = [] 140 | for tag_idx, tag in enumerate(tag_li): 141 | if (list(tag[0])[0] == word_idxs[0]): 142 | if (word_idxs[0] != list(tag[0])[1]): new_tag_li.append([[word_idxs[0],list(tag[0])[1]], tag[1]]) 143 | elif (list(tag[0])[1] == word_idxs[0]): 144 | if (list(tag[0])[0] != word_idxs[0]): new_tag_li.append([[list(tag[0])[0],word_idxs[0]], tag[1]]) 145 | elif (list(tag[0])[0] == word_idxs[1]): 146 | if (word_idxs[1] != list(tag[0])[1]): new_tag_li.append([[word_idxs[1],list(tag[0])[1]], tag[1]]) 147 | elif (list(tag[0])[1] == word_idxs[1]): 148 | if (list(tag[0])[0] != word_idxs[1]): new_tag_li.append([[list(tag[0])[0],word_idxs[1]], tag[1]]) 149 | for new_tag in new_tag_li: 150 | if new_tag not in tag_li: 151 | new_tag_li2.append(new_tag) 152 | del new_tag_li 153 | new_tag_li2 = new_tag_li2 + tag_li 154 | tag_list[tag_li_idx] = new_tag_li2 155 | 156 | #print("len of dependency_list:"+str(len(sum(tag_list, [])))) 157 | Done = True 158 | while Done: 159 | origin_tag_list = tag_list.copy() 160 | for tag_li_idx, tag_li in enumerate(tag_list): 161 | ## tag_li = [tag1, tag2, ...] 162 | ## tag = [[set([피지배소idx]), set([지배소idx])], [구문정보, 기능정보]] 163 | conti = True 164 | while conti: 165 | conti_tf = 0 166 | 167 | ## case1 168 | ## (a<-b<-c) => (ab<-c) 169 | change_list = [] 170 | tag_dist_1 = []# 지배소와 피지배소 물리적 거리가 1 171 | for tag in tag_li: 172 | if ((max(list(tag[0])[1])-min(list(tag[0])[0])) == 1):tag_dist_1.append(tag) 173 | 174 | change_list, tag_dist_1, cnt = tag_case1(change_list, tag_dist_1) 175 | if (cnt != 0): 176 | conti_tf += cnt 177 | tag_list = change_tag(change_list, tag_list) 178 | tag_li = tag_list[tag_li_idx] 179 | #print("tag_case1 done") 180 | 181 | ## case2 182 | ## (a<-b, a<-c) => (ab<-c) 183 | change_list = [] 184 | tag_dist_1 = [] # 지배소와 피지배소 물리적 거리가 1 185 | tag_dist_2 = [] # 지배소와 피지배소 물리적 거리가 2 186 | 187 | for tag in tag_li: 188 | if ((max(list(tag[0])[1])-min(list(tag[0])[0])) == 1):tag_dist_1.append(tag) 189 | elif((max(list(tag[0])[1])-min(list(tag[0])[0])) == 2):tag_dist_2.append(tag) 190 | 191 | 192 | change_list, tag_dist_1, tag_dist_2, cnt = tag_case2(change_list, tag_dist_1, tag_dist_2) 193 | 194 | if (cnt != 0): 195 | conti_tf += cnt 196 | tag_list = change_tag(change_list, tag_list) 197 | tag_li = tag_list[tag_li_idx] 198 | #print("tag_case2 done") 199 | 200 | if conti_tf == 0: conti = False 201 | 202 | if (origin_tag_list == tag_list): Done = False 203 | 204 | dependency_lists = sum(tag_list, []) 205 | 206 | sen_idxs = [set(), set()] 207 | for dep_idx, dependency_list in enumerate(dependency_lists): 208 | # dependency_list = [[set([w]), set([w2_idx])], label] 209 | word_idxs = dependency_list[0] 210 | sen_idxs[0].add(min(word_idxs[0])) 211 | sen_idxs[0].add(min(word_idxs[1])) 212 | sen_idxs[1].add(max(word_idxs[0])+1) 213 | sen_idxs[1].add(max(word_idxs[1])+1) 214 | 215 | # 후처리 216 | ## 삭제 217 | sen_idxs[0] = set(list(sen_idxs[0])[1:]); sen_idxs[1] = set(list(sen_idxs[1])[:-1]) 218 | if len(sen_idxs[0]) != len(sen_idxs[1]): 219 | # print(sen_idxs[0].difference(sen_idxs[1])) 220 | # print(sen_idxs[1].difference(sen_idxs[0])) 221 | # print("len of dependency_lists: "+str(len(dependency_lists))) 222 | del_dependency_lists = [] 223 | for dep_idx, dependency_list in enumerate(dependency_lists): 224 | # dependency_list = [[set([w]), set([w2_idx])], label] 225 | if min(dependency_list[0][0]) in sen_idxs[0].difference(sen_idxs[1]): 226 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list) 227 | if min(dependency_list[0][1]) in sen_idxs[0].difference(sen_idxs[1]): 228 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list) 229 | if (max(dependency_list[0][0])+1) in sen_idxs[1].difference(sen_idxs[0]): 230 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list) 231 | if (max(dependency_list[0][1])+1) in sen_idxs[1].difference(sen_idxs[0]): 232 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list) 233 | new_dependency_lists = [dependency_list for dependency_list in dependency_lists if dependency_list not in del_dependency_lists] 234 | dependency_lists = new_dependency_lists 235 | del new_dependency_lists 236 | #print("len of dependency_lists: " + str(len(dependency_lists))) 237 | 238 | sen_idxs = [set(), set()] 239 | for dep_idx, dependency_list in enumerate(dependency_lists): 240 | # dependency_list = [[set([w]), set([w2_idx])], label] 241 | word_idxs = dependency_list[0] 242 | sen_idxs[0].add(min(word_idxs[0])) 243 | sen_idxs[0].add(min(word_idxs[1])) 244 | sen_idxs[1].add(max(word_idxs[0]) + 1) 245 | sen_idxs[1].add(max(word_idxs[1]) + 1) 246 | dependency_lists[dep_idx] = [[" ".join(sen_words[min(word_idxs[0]): 1 + max(word_idxs[0])]), " ".join( 247 | sen_words[min(word_idxs[1]): 1 + max(word_idxs[1])])]] + dependency_list 248 | 249 | 250 | new_sen_words = [] 251 | for start_idx, end_idx in zip(sorted(sen_idxs[0]), sorted(sen_idxs[1])): 252 | new_sen_words.append([" ".join(sen_words[start_idx: end_idx]), [i for i in range(start_idx, end_idx)]]) 253 | 254 | for dep_idx, dependency_list in enumerate(dependency_lists): 255 | dependency_lists[dep_idx][1] = [sorted(dependency_lists[dep_idx][1][0]), sorted(dependency_lists[dep_idx][1][1])] 256 | # dependency_lists[dep_idx][1] = [new_sen_words.index(dependency_list[0][0]), new_sen_words.index(dependency_list[0][1])] 257 | 258 | output = {"doc_id":data["doc_id"], 259 | "sentence": data["sentence"], 260 | "preprocess": data["preprocess"], 261 | "merge": { 262 | "origin": new_sen_words, 263 | "parsing": dependency_lists 264 | }, 265 | "keysentence": data["keysentence"], 266 | "tag": data["tag"], 267 | "coarse_tag": data["coarse_tag"] 268 | } 269 | outputs.append(output) 270 | return outputs 271 | 272 | if __name__ == '__main__': 273 | inf_dir = "../../data/origin/DP_origin_preprocess.json" 274 | outf_dir = "../../data/origin/merge_origin_preprocess/origin.json" 275 | # 276 | with open(inf_dir, "r", encoding="utf-8") as inf: 277 | datas = json.load(inf) 278 | outputs = merge_tag(datas, CNJ=True) 279 | 280 | with open(outf_dir, "w", encoding="utf-8") as outf: 281 | json.dump(outputs, outf, ensure_ascii=False, indent=4) 282 | outf.close() 283 | 284 | -------------------------------------------------------------------------------- /src/functions/biattention.py: -------------------------------------------------------------------------------- 1 | # 해당 코드는 아래 링크에서 가져옴 2 | # https://github.com/KLUE-benchmark/KLUE-baseline/blob/8a03c9447e4c225e806877a84242aea11258c790/klue_baseline/models/dependency_parsing.py 3 | import numpy as np 4 | 5 | import torch 6 | import torch.nn as nn 7 | from torch.nn.parameter import Parameter 8 | import torch.nn.functional as F 9 | 10 | 11 | class BiAttention(nn.Module): 12 | def __init__( 13 | self, 14 | input_size_encoder, 15 | input_size_decoder, 16 | num_labels, 17 | biaffine=True, 18 | **kwargs 19 | ): 20 | super(BiAttention, self).__init__() 21 | self.input_size_encoder = input_size_encoder 22 | self.input_size_decoder = input_size_decoder 23 | self.num_labels = num_labels 24 | self.biaffine = biaffine 25 | 26 | self.W_e = Parameter(torch.Tensor(self.num_labels, self.input_size_encoder)) 27 | self.W_d = Parameter(torch.Tensor(self.num_labels, self.input_size_decoder)) 28 | self.b = Parameter(torch.Tensor(self.num_labels, 1, 1)) 29 | if self.biaffine: 30 | self.U = Parameter( 31 | torch.Tensor( 32 | self.num_labels, self.input_size_decoder, self.input_size_encoder 33 | ) 34 | ) 35 | else: 36 | self.register_parameter("U", None) 37 | 38 | self.reset_parameters() 39 | 40 | def reset_parameters(self): 41 | nn.init.xavier_uniform_(self.W_e) 42 | nn.init.xavier_uniform_(self.W_d) 43 | nn.init.constant_(self.b, 0.0) 44 | if self.biaffine: 45 | nn.init.xavier_uniform_(self.U) 46 | 47 | def forward(self, input_e, input_d, mask_d=None, mask_e=None): 48 | assert input_d.size(0) == input_e.size(0) 49 | batch, length_decoder, _ = input_d.size() 50 | _, length_encoder, _ = input_e.size() 51 | 52 | # input_d : [b, t, d] 53 | # input_e : [b, s, e] 54 | # out_d : [b, l, d, 1] 55 | # out_e : [b, l ,1, e] 56 | out_d = torch.matmul(self.W_d, input_d.transpose(1, 2)).unsqueeze(3) 57 | out_e = torch.matmul(self.W_e, input_e.transpose(1, 2)).unsqueeze(2) 58 | 59 | if self.biaffine: 60 | # output : [b, 1, t, d] * [l, d, e] -> [b, l, t, e] 61 | output = torch.matmul(input_d.unsqueeze(1), self.U) 62 | # output : [b, l, t, e] * [b, 1, e, s] -> [b, l, t, s] 63 | output = torch.matmul(output, input_e.unsqueeze(1).transpose(2, 3)) 64 | output = output + out_d + out_e + self.b 65 | else: 66 | output = out_d + out_d + self.b 67 | 68 | if mask_d is not None: 69 | output = ( 70 | output 71 | * mask_d.unsqueeze(1).unsqueeze(3) 72 | * mask_e.unsqueeze(1).unsqueeze(2) 73 | ) 74 | 75 | # input1 = (batch_size, input11, input12) 76 | # input2 = (batch_size, input21, input22) 77 | return output # (batch_size, output_size, input11, input21) 78 | 79 | class BiLinear(nn.Module): 80 | def __init__(self, left_features: int, right_features: int, out_features: int): 81 | super(BiLinear, self).__init__() 82 | self.left_features = left_features 83 | self.right_features = right_features 84 | self.out_features = out_features 85 | 86 | self.U = Parameter(torch.Tensor(self.out_features, self.left_features, self.right_features)) 87 | self.W_l = Parameter(torch.Tensor(self.out_features, self.left_features)) 88 | self.W_r = Parameter(torch.Tensor(self.out_features, self.right_features)) 89 | self.bias = Parameter(torch.Tensor(out_features)) 90 | 91 | self.reset_parameters() 92 | 93 | def reset_parameters(self) -> None: 94 | nn.init.xavier_uniform_(self.W_l) 95 | nn.init.xavier_uniform_(self.W_r) 96 | nn.init.constant_(self.bias, 0.0) 97 | nn.init.xavier_uniform_(self.U) 98 | 99 | def forward(self, input_left: torch.Tensor, input_right: torch.Tensor) -> torch.Tensor: 100 | left_size = input_left.size() 101 | right_size = input_right.size() 102 | assert left_size[:-1] == right_size[:-1], "batch size of left and right inputs mis-match: (%s, %s)" % ( 103 | left_size[:-1], 104 | right_size[:-1], 105 | ) 106 | batch = int(np.prod(left_size[:-1])) 107 | 108 | input_left = input_left.contiguous().view(batch, self.left_features) 109 | input_right = input_right.contiguous().view(batch, self.right_features) 110 | 111 | output = F.bilinear(input_left, input_right, self.U, self.bias) 112 | output = output + F.linear(input_left, self.W_l, None) + F.linear(input_right, self.W_r, None) 113 | return output.view(left_size[:-1] + (self.out_features,)) 114 | -------------------------------------------------------------------------------- /src/functions/metric.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score 3 | 4 | 5 | def get_sklearn_score(predicts, corrects, idx2label): 6 | predicts = [idx2label[predict] for predict in predicts] 7 | corrects = [idx2label[correct] for correct in corrects] 8 | result = {"accuracy": accuracy_score(corrects, predicts), 9 | "macro_precision": precision_score(corrects, predicts, average="macro"), 10 | "micro_precision": precision_score(corrects, predicts, average="micro"), 11 | "macro_f1": f1_score(corrects, predicts, average="macro"), 12 | "micro_f1": f1_score(corrects, predicts, average="micro"), 13 | "macro_recall": recall_score(corrects, predicts, average="macro"), 14 | "micro_recall": recall_score(corrects, predicts, average="micro"), 15 | } 16 | 17 | for k, v in result.items(): 18 | result[k] = round(v, 3) 19 | print(k + ": " + str(v)) 20 | return result 21 | 22 | 23 | -------------------------------------------------------------------------------- /src/functions/processor.py: -------------------------------------------------------------------------------- 1 | import json 2 | import logging 3 | import os 4 | from functools import partial 5 | from multiprocessing import Pool, cpu_count 6 | 7 | import numpy as np 8 | from tqdm import tqdm 9 | 10 | import transformers 11 | from transformers.file_utils import is_tf_available, is_torch_available 12 | from transformers.data.processors.utils import DataProcessor 13 | 14 | if is_torch_available(): 15 | import torch 16 | from torch.utils.data import TensorDataset 17 | 18 | if is_tf_available(): 19 | import tensorflow as tf 20 | 21 | logger = logging.getLogger(__name__) 22 | 23 | 24 | 25 | def convert_example_to_features(example, max_seq_length, is_training, max_sentence_length, language): 26 | 27 | # 데이터의 유효성 검사를 위한 부분 28 | # ======================================================== 29 | label = None 30 | coarse_label = None 31 | if is_training: 32 | # Get label 33 | label = example.label 34 | coarse_label = example.coarse_label 35 | 36 | # label_dictionary에 주어진 label이 존재하지 않으면 None을 feature로 출력 37 | # If the label cannot be found in the text, then skip this example. 38 | ## kind_of_label: label의 종류 39 | kind_of_label = ["문제 정의", "가설 설정", "기술 정의", "제안 방법", "대상 데이터", "데이터처리", "이론/모형", "성능/효과", "후속연구"]#, "기타"] 40 | actual_text = kind_of_label[label] if label<=len(kind_of_label) else label 41 | if actual_text not in kind_of_label: 42 | logger.warning("Could not find label: '%s' \n not in label list", actual_text) 43 | return None 44 | 45 | kind_of_coarse_label = ["연구 목적", "연구 방법", "연구 결과"]#, "기타"] 46 | actual_text = kind_of_coarse_label[coarse_label] if coarse_label <= len(kind_of_coarse_label) else coarse_label 47 | if actual_text not in kind_of_coarse_label: 48 | logger.warning("Could not find coarse_label: '%s' \n not in coarse_label list", actual_text) 49 | return None 50 | 51 | # ======================================================== 52 | 53 | # 단어(어절;word)와 토큰 간의 위치 정보 확인 54 | tok_to_orig_index = {"sentence": []} # token 개수만큼 # token에 대한 word의 위치 55 | orig_to_tok_index = {"sentence": []} # origin 개수만큼 # word를 토큰화하여 나온 첫번째 token의 위치 56 | all_doc_tokens = {"sentence": []} # origin text를 tokenization 57 | token_to_orig_map = {"sentence": []} 58 | 59 | for case in example.merge.keys(): 60 | new_merge = [] 61 | new_word = [] 62 | idx = 0 63 | for merge_idx in example.merge[case]: 64 | for m_idx in merge_idx: 65 | new_word.append(example.doc_tokens[case][m_idx]) 66 | new_word.append("") 67 | merge_idx = [m_idx+idx for m_idx in range(0,len(merge_idx))] 68 | new_merge.append(merge_idx) 69 | idx = max(merge_idx)+1 70 | new_merge.append([idx]) 71 | idx+=1 72 | example.merge[case] = new_merge 73 | example.doc_tokens[case] = new_word 74 | 75 | for case in example.merge.keys(): 76 | for merge_idx in example.merge[case]: 77 | for word_idx in merge_idx: 78 | # word를 토큰화하여 나온 첫번째 token의 위치 79 | orig_to_tok_index[case].append(len(tok_to_orig_index[case])) 80 | if (example.doc_tokens[case][word_idx] == ""): 81 | sub_tokens = [""] 82 | else: sub_tokens = tokenizer.tokenize(example.doc_tokens[case][word_idx]) 83 | for sub_token in sub_tokens: 84 | # token 저장 85 | all_doc_tokens[case].append(sub_token) 86 | # token에 대한 word의 위치 87 | tok_to_orig_index[case].append(word_idx) 88 | # token_to_orig_map: {token:word} 89 | #token_to_orig_map[case][len(tok_to_orig_index[case]) - 1] = len(orig_to_tok_index[case]) - 1 90 | token_to_orig_map[case].append(len(orig_to_tok_index[case]) - 1) 91 | 92 | # print("tok_to_orig_index\n"+str(tok_to_orig_index)) 93 | # print("orig_to_tok_index\n"+str(orig_to_tok_index)) 94 | # print("all_doc_tokens\n"+str(all_doc_tokens)) 95 | # print("token_to_orig_map\n\tindex of token : index of word\n\t"+str(token_to_orig_map)) 96 | 97 | # ========================================================= 98 | if language == "bert": 99 | ## 최대 길이 넘는지 확인 100 | if int(transformers.__version__[0]) <= 3: 101 | assert len(all_doc_tokens["sentence"]) + 2 <= tokenizer.max_len 102 | else: 103 | assert len(all_doc_tokens["sentence"]) + 2 <= tokenizer.model_max_length 104 | 105 | input_ids = [tokenizer.cls_token_id] 106 | if language == "KorSciBERT": 107 | input_ids += sum([tokenizer.convert_tokens_to_ids([token]) for token in all_doc_tokens["sentence"]], []) 108 | word_idxs = [0] + list(filter(lambda x: input_ids[x] == tokenizer.convert_tokens_to_ids([""])[0], range(len(input_ids)))) 109 | else: 110 | input_ids += [tokenizer.convert_tokens_to_ids(token) for token in all_doc_tokens["sentence"]] 111 | word_idxs = [0] + list(filter(lambda x: input_ids[x] == tokenizer.convert_tokens_to_ids(""), range(len(input_ids)))) 112 | 113 | input_ids += [tokenizer.sep_token_id] 114 | 115 | token_type_ids = [0] * len(input_ids) 116 | 117 | position_ids = list(range(0, len(input_ids))) 118 | 119 | # non_padded_ids: padding을 제외한 토큰의 index 번호 120 | non_padded_ids = [i for i in input_ids] 121 | 122 | # tokens: padding을 제외한 토큰 123 | non_padded_tokens = tokenizer.convert_ids_to_tokens(non_padded_ids) 124 | 125 | attention_mask = [1]*len(input_ids) 126 | 127 | paddings = [tokenizer.pad_token_id]*(max_seq_length - len(input_ids)) 128 | 129 | if tokenizer.padding_side == "right": 130 | input_ids += paddings 131 | attention_mask += [0]*len(paddings) 132 | token_type_ids += paddings 133 | position_ids += paddings 134 | else: 135 | input_ids = paddings + input_ids 136 | attention_mask = [0]*len(paddings) + attention_mask 137 | token_type_ids = paddings + token_type_ids 138 | position_ids = paddings + position_ids 139 | 140 | word_idxs = [x+len(paddings) for x in word_idxs] 141 | 142 | # """ 143 | # mean pooling 144 | not_word_list = [] 145 | for k, p_idx in enumerate(word_idxs[1:]): 146 | not_word_idxs = [0] * len(input_ids); 147 | for j in range(word_idxs[k] + 1, p_idx): 148 | not_word_idxs[j] = 1 / (p_idx - word_idxs[k] - 1) 149 | not_word_list.append(not_word_idxs) 150 | not_word_list = not_word_list + [[0] * len(input_ids)] * ( 151 | max_sentence_length - len(not_word_list)) 152 | 153 | 154 | """ 155 | # (a,b, |a-b|, a*b) 156 | not_word_list = [[], []] 157 | for k, p_idx in enumerate(word_idxs[1:]): 158 | not_word_list[0].append(word_idxs[k] + 1) 159 | not_word_list[1].append(p_idx - 1) 160 | not_word_list[0] = not_word_list[0] + [int(word_idxs[-1]+i+2) for i in range(0, (max_sentence_length - len(not_word_list)))] 161 | not_word_list[1] = not_word_list[1] + [int(word_idxs[-1] + i + 2) for i in range(0, (max_sentence_length - len(pnot_word_list)))] 162 | """ 163 | 164 | # p_mask: mask with 0 for token which belong premise and hypothesis including CLS TOKEN 165 | # and with 1 otherwise. 166 | # Original TF implem also keep the classification token (set to 0) 167 | p_mask = np.ones_like(token_type_ids) 168 | if tokenizer.padding_side == "right": 169 | # [CLS] P [SEP] H [SEP] PADDING 170 | p_mask[:len(all_doc_tokens["sentence"]) + 1] = 0 171 | else: 172 | p_mask[-(len(all_doc_tokens["sentence"]) + 1): ] = 0 173 | 174 | # pad_token_indices: input_ids에서 padding된 위치 175 | pad_token_indices = np.array(range(len(non_padded_ids), len(input_ids))) 176 | # special_token_indices: special token의 위치 177 | special_token_indices = np.asarray( 178 | tokenizer.get_special_tokens_mask(input_ids, already_has_special_tokens=True) 179 | ).nonzero() 180 | 181 | p_mask[pad_token_indices] = 1 182 | p_mask[special_token_indices] = 1 183 | 184 | # Set the cls index to 0: the CLS index can be used for impossible answers 185 | # Identify the position of the CLS token 186 | cls_index = input_ids.index(tokenizer.cls_token_id) 187 | 188 | p_mask[cls_index] = 0 189 | 190 | # dependency = [[tail, head, dependency], [], ...] 191 | if example.dependency["sentence"] == [[]]: 192 | example.dependency["sentence"] = [[max_sentence_length-1,max_sentence_length-1,0] for _ in range(0,max_sentence_length)] 193 | else: 194 | example.dependency["sentence"] = example.dependency["sentence"] + [[max_sentence_length-1,max_sentence_length-1,0] for i in range(0, abs(max_sentence_length-len(example.dependency["sentence"])))] 195 | 196 | dependency = example.dependency["sentence"] 197 | 198 | return CLASSIFIERFeatures( 199 | input_ids, 200 | attention_mask, 201 | token_type_ids, 202 | position_ids, 203 | cls_index, 204 | p_mask.tolist(), 205 | example_index=0, 206 | tokens=non_padded_tokens, 207 | token_to_orig_map=token_to_orig_map, 208 | label = label, 209 | coarse_label = coarse_label, 210 | doc_id = example.doc_id, 211 | language = language, 212 | dependency = dependency, 213 | not_word_list = not_word_list, 214 | ) 215 | 216 | 217 | 218 | def convert_example_to_features_init(tokenizer_for_convert): 219 | global tokenizer 220 | tokenizer = tokenizer_for_convert 221 | 222 | 223 | def convert_examples_to_features( 224 | examples, 225 | tokenizer, 226 | max_seq_length, 227 | is_training, 228 | return_dataset=False, 229 | threads=1, 230 | max_sentence_length = 0, 231 | tqdm_enabled=True, 232 | language = None, 233 | ): 234 | """ 235 | Converts a list of examples into a list of features that can be directly given as input to a model. 236 | It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs. 237 | 238 | Args: 239 | examples: list of :class:`~transformers.data.processors.squad.SquadExample` 240 | tokenizer: an instance of a child of :class:`~transformers.PreTrainedTokenizer` 241 | max_seq_length: The maximum sequence length of the inputs. 242 | doc_stride: The stride used when the context is too large and is split across several features. 243 | max_query_length: The maximum length of the query. 244 | is_training: whether to create features for model evaluation or model training. 245 | return_dataset: Default False. Either 'pt' or 'tf'. 246 | if 'pt': returns a torch.data.TensorDataset, 247 | if 'tf': returns a tf.data.Dataset 248 | threads: multiple processing threadsa-smi 249 | 250 | 251 | Returns: 252 | list of :class:`~transformers.data.processors.squad.SquadFeatures` 253 | 254 | Example:: 255 | 256 | processor = SquadV2Processor() 257 | examples = processor.get_dev_examples(data_dir) 258 | 259 | features = squad_convert_examples_to_features( 260 | examples=examples, 261 | tokenizer=tokenizer, 262 | max_seq_length=args.max_seq_length, 263 | doc_stride=args.doc_stride, 264 | max_query_length=args.max_query_length, 265 | is_training=not evaluate, 266 | ) 267 | """ 268 | 269 | # Defining helper methods 270 | features = [] 271 | threads = min(threads, cpu_count()) 272 | with Pool(threads, initializer=convert_example_to_features_init, initargs=(tokenizer,)) as p: 273 | 274 | # annotate_ = 하나의 example에 대한 여러 feature를 리스트로 모은 것 275 | # annotate_ = list(feature1, feature2, ...) 276 | annotate_ = partial( 277 | convert_example_to_features, 278 | max_seq_length=max_seq_length, 279 | max_sentence_length=max_sentence_length, 280 | is_training=is_training, 281 | language = language, 282 | ) 283 | 284 | # examples에 대한 annotate_ 285 | # features = list( feature1, feature2, feature3, ... ) 286 | ## len(features) == len(examples) 287 | features = list( 288 | tqdm( 289 | p.imap(annotate_, examples, chunksize=32), 290 | total=len(examples), 291 | desc="convert bert examples to features", 292 | disable=not tqdm_enabled, 293 | ) 294 | ) 295 | new_features = [] 296 | example_index = 0 # example의 id ## len(features) == len(examples) 297 | for example_feature in tqdm( 298 | features, total=len(features), desc="add example index", disable=not tqdm_enabled 299 | ): 300 | if not example_feature: 301 | continue 302 | 303 | example_feature.example_index = example_index 304 | new_features.append(example_feature) 305 | example_index += 1 306 | 307 | features = new_features 308 | del new_features 309 | 310 | if return_dataset == "pt": 311 | if not is_torch_available(): 312 | raise RuntimeError("PyTorch must be installed to return a PyTorch dataset.") 313 | 314 | # Convert to Tensors and build dataset 315 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long) 316 | all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long) 317 | 318 | ## RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. 319 | all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long) 320 | all_position_ids = torch.tensor([f.position_ids for f in features], dtype=torch.long) 321 | 322 | all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long) 323 | all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float) 324 | 325 | all_example_indices = torch.tensor([f.example_index for f in features], dtype=torch.long) 326 | all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long) # 전체 feature의 개별 index 327 | 328 | # all_dependency = [[[premise_tail, premise_head, dependency], [], ...],[[hypothesis_tail, hypothesis_head, dependency], [], ...]], [[],[]], ... ] 329 | all_dependency = torch.tensor([f.dependency for f in features], dtype=torch.long) 330 | 331 | all_not_word_list = torch.tensor([f.not_word_list for f in features], dtype=torch.float) 332 | 333 | if not is_training: 334 | dataset = TensorDataset( 335 | all_input_ids, 336 | all_attention_masks, all_token_type_ids, all_position_ids, 337 | all_cls_index, all_p_mask, all_feature_index, 338 | all_dependency, 339 | all_not_word_list 340 | ) 341 | else: 342 | all_labels = torch.tensor([f.label for f in features], dtype=torch.long) 343 | all_coarse_labels = torch.tensor([f.coarse_label for f in features], dtype=torch.long) 344 | # label_dict = {"entailment": 0, "contradiction": 1, "neutral": 2} 345 | # all_labels = torch.tensor([label_dict[f.label] for f in features], dtype=torch.long) 346 | 347 | dataset = TensorDataset( 348 | all_input_ids, 349 | all_attention_masks, 350 | all_token_type_ids, 351 | all_position_ids, 352 | all_labels, 353 | all_coarse_labels, 354 | all_cls_index, 355 | all_p_mask, 356 | all_example_indices, 357 | all_feature_index, 358 | all_dependency, 359 | all_not_word_list 360 | ) 361 | 362 | return features, dataset 363 | else: 364 | return features 365 | 366 | class CLASSIFIERProcessor(DataProcessor): 367 | train_file = None 368 | dev_file = None 369 | 370 | def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False): 371 | if not evaluate: 372 | gold_label = None 373 | gold_coarse_label = None 374 | label = tensor_dict["tag"].numpy().decode("utf-8") 375 | coarse_label = tensor_dict["coarse_tag"].numpy().decode("utf-8") 376 | else: 377 | gold_label = tensor_dict["tag"].numpy().decode("utf-8") 378 | gold_coarse_label = tensor_dict["coarse_tag"].numpy().decode("utf-8") 379 | label = None 380 | coarse_label = None 381 | 382 | return CLASSIFIERExample( 383 | doc_id=tensor_dict["doc_id"].numpy().decode("utf-8"), 384 | # sentid=tensor_dict["sentid"].numpy().decode("utf-8"), 385 | sentence=tensor_dict["sentence"].numpy().decode("utf-8"), 386 | preprocess=tensor_dict["preprocess"].numpy().decode("utf-8"), 387 | parsing=tensor_dict["merge"]["parsing"].numpy().decode("utf-8"), 388 | keysentnece=tensor_dict["keysentence"].numpy().decode("utf-8"), 389 | label=label, 390 | coarse_label=coarse_label, 391 | gold_label=gold_label, 392 | gold_coarse_label = gold_coarse_label 393 | ) 394 | 395 | def get_examples_from_dataset(self, dataset, evaluate=False): 396 | """ 397 | Creates a list of :class:`~transformers.data.processors.squad.CLASSIFIERExample` using a TFDS dataset. 398 | 399 | Args: 400 | dataset: The tfds dataset loaded from `tensorflow_datasets.load("squad")` 401 | evaluate: boolean specifying if in evaluation mode or in training mode 402 | 403 | Returns: 404 | List of CLASSIFIERExample 405 | 406 | Examples:: 407 | 408 | import tensorflow_datasets as tfds 409 | dataset = tfds.load("squad") 410 | 411 | training_examples = get_examples_from_dataset(dataset, evaluate=False) 412 | evaluation_examples = get_examples_from_dataset(dataset, evaluate=True) 413 | """ 414 | 415 | if evaluate: 416 | dataset = dataset["validation"] 417 | else: 418 | dataset = dataset["train"] 419 | 420 | examples = [] 421 | for tensor_dict in tqdm(dataset): 422 | examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate)) 423 | 424 | return examples 425 | 426 | def get_train_examples(self, data_dir, filename=None, depend_embedding = None): 427 | """ 428 | Returns the training examples from the data directory. 429 | 430 | Args: 431 | data_dir: Directory containing the data files used for training and evaluating. 432 | filename: None by default. 433 | 434 | """ 435 | if data_dir is None: 436 | data_dir = "" 437 | 438 | #if self.train_file is None: 439 | # raise ValueError("CLASSIFIERProcessor should be instantiated via CLASSIFIERV1Processor.") 440 | 441 | with open( 442 | os.path.join(data_dir, self.train_file if filename is None else filename), "r", encoding="utf-8" 443 | ) as reader: 444 | input_data = json.load(reader) 445 | return self._create_examples(input_data, 'train', self.train_file if filename is None else filename) 446 | 447 | def get_dev_examples(self, data_dir, filename=None, depend_embedding = None): 448 | """ 449 | Returns the evaluation example from the data directory. 450 | 451 | Args: 452 | data_dir: Directory containing the data files used for training and evaluating. 453 | filename: None by default. 454 | """ 455 | if data_dir is None: 456 | data_dir = "" 457 | 458 | #if self.dev_file is None: 459 | # raise ValueError("CLASSIFIERProcessor should be instantiated via CLASSIFIERV1Processor.") 460 | 461 | with open( 462 | os.path.join(data_dir, self.dev_file if filename is None else filename), "r", encoding="utf-8" 463 | ) as reader: 464 | input_data = json.load(reader) 465 | return self._create_examples(input_data, "dev", self.dev_file if filename is None else filename) 466 | 467 | def get_example_from_input(self, input_dictionary): 468 | 469 | doc_id = input_dictionary["doc_id"] 470 | keysentnece = input_dictionary["keysentnece"] 471 | # sentid = input_dictionary["sentid"] 472 | sentence = input_dictionary["sentence"] 473 | 474 | label = None 475 | coarse_label = None 476 | gold_label = None 477 | gold_coarse_label = None 478 | 479 | examples = [CLASSIFIERExample( 480 | doc_id=doc_id, 481 | # sentid=sentid, 482 | keysentence = keysentnece, 483 | sentence=sentence, 484 | gold_label=gold_label, 485 | gold_coarse_label=gold_coarse_label, 486 | label=label, 487 | coarse_label=coarse_label, 488 | )] 489 | return examples 490 | 491 | def _create_examples(self, input_data, set_type, data_file): 492 | is_training = set_type == "train" 493 | num = 0 494 | examples = [] 495 | for entry in tqdm(input_data): 496 | 497 | doc_id = entry["doc_id"] 498 | # sentid = entry["sentid"] 499 | sentence = entry["sentence"] 500 | preprocess = entry["preprocess"] 501 | merge = entry["merge"]["origin"] 502 | parsing = entry["merge"]["parsing"] 503 | keysentence= entry["keysentence"] 504 | 505 | label = None 506 | coarse_label = None 507 | gold_label = None 508 | gold_coarse_label = None 509 | if is_training: 510 | label = entry["tag"] 511 | coarse_label = entry["coarse_tag"] 512 | else: 513 | gold_label = entry["tag"] 514 | gold_coarse_label = entry["coarse_tag"] 515 | 516 | 517 | 518 | example = CLASSIFIERExample( 519 | doc_id=doc_id, 520 | # sentid=sentid, 521 | keysentence=keysentence, 522 | sentence=sentence, 523 | preprocess=preprocess, 524 | parsing=parsing, 525 | merge=merge, 526 | gold_label=gold_label, 527 | gold_coarse_label=gold_coarse_label, 528 | label=label, 529 | coarse_label=coarse_label, 530 | ) 531 | examples.append(example) 532 | # len(examples) == len(input_data) 533 | return examples 534 | 535 | 536 | class CLASSIFIERV1Processor(CLASSIFIERProcessor): 537 | train_file = "train.json" 538 | dev_file = "dev.json" 539 | 540 | 541 | class CLASSIFIERExample(object): 542 | def __init__( 543 | self, 544 | doc_id, 545 | # sentid, 546 | sentence, 547 | preprocess, 548 | parsing, 549 | merge, 550 | keysentence, 551 | gold_label=None, 552 | gold_coarse_label=None, 553 | label=None, 554 | coarse_label=None, 555 | ): 556 | self.doc_id = doc_id 557 | # self.sentid = sentid 558 | self.keysentence = keysentence 559 | self.sentence = sentence 560 | self.preprocess = preprocess 561 | self.parsing = parsing 562 | self.merge = merge 563 | 564 | label_dict = {'문제 정의': 0, '가설 설정': 1, '기술 정의': 2, '제안 방법': 3, '대상 데이터': 4, '데이터처리': 5, '이론/모형': 6, '성능/효과': 7, '후속연구': 8, '기타': 9} 565 | coarse_label_dict = {'연구 목적': 0, '연구 방법': 1, '연구 결과': 2, '기타': 3} 566 | if gold_label in label_dict.keys(): 567 | gold_label = label_dict[gold_label] 568 | if gold_coarse_label in coarse_label_dict.keys(): 569 | gold_coarse_label = coarse_label_dict[gold_coarse_label] 570 | self.gold_label = gold_label 571 | self.gold_coarse_label = gold_coarse_label 572 | 573 | if coarse_label in coarse_label_dict.keys(): 574 | coarse_label = coarse_label_dict[coarse_label] 575 | if label in label_dict.keys(): 576 | label = label_dict[label] 577 | self.label = label 578 | self.coarse_label = coarse_label 579 | 580 | # doct_tokens : 띄어쓰기 기준으로 나누어진 어절(word)로 만들어진 리스트 581 | ## sentence1 sentence2 582 | self.doc_tokens = {"sentence":self.preprocess.strip().split()} 583 | 584 | # merge: 말뭉치의 시작위치를 어절 기준으로 만든 리스트 585 | merge_word = []; check_merge_word = [] 586 | merge_index = [] 587 | for merge in self.merge: 588 | if merge != []: merge_index.append(merge[1]) 589 | 590 | # 구문구조 종류 591 | depend2idx = {"None":0}; idx2depend ={0:"None"} 592 | for depend1 in ['DP', 'L', 'NP', 'IP', 'PAD', 'VP', 'VNP', 'X', 'AP']: 593 | for depend2 in ['MOD', 'OBJ', 'CNJ', 'CMP', 'SBJ', 'None', 'AJT']: 594 | depend2idx[depend1 + "-" + depend2] = len(depend2idx) 595 | idx2depend[len(idx2depend)] = depend1 + "-" + depend2 596 | 597 | if ([words for words in self.parsing if words[2][0] != words[2][1]] == []): merge_word.append([]) 598 | else: 599 | for words in self.parsing: 600 | if words[2][0] != words[2][1]: 601 | w1 = merge_index.index(words[1][0]) 602 | w2 = merge_index.index(words[1][1]) 603 | dep = depend2idx["-".join(words[2])] 604 | if [w1,w2] not in check_merge_word: 605 | check_merge_word.append([w1, w2]) 606 | merge_word.append([w1,w2,dep]) 607 | else: 608 | check_index = check_merge_word.index([w1,w2]) 609 | now_dep = idx2depend[merge_word[check_index][2]].split("-")[1] 610 | if (words[2][1] in ['SBJ', 'CNJ', 'OBJ']) and(now_dep in ['CMP', 'MOD', 'AJT', 'None', "UNDEF"]): 611 | merge_word[check_index][2] = dep 612 | 613 | del check_merge_word 614 | self.merge = {"sentence":merge_index} 615 | self.dependency = {"sentence":merge_word} 616 | 617 | class CLASSIFIERFeatures(object): 618 | def __init__( 619 | self, 620 | input_ids, 621 | attention_mask, 622 | token_type_ids, 623 | position_ids, 624 | cls_index, 625 | p_mask, 626 | example_index, 627 | token_to_orig_map, 628 | doc_id, 629 | tokens, 630 | label, 631 | coarse_label, 632 | language, 633 | dependency, 634 | not_word_list, 635 | ): 636 | self.input_ids = input_ids 637 | self.attention_mask = attention_mask 638 | self.token_type_ids = token_type_ids 639 | self.position_ids = position_ids 640 | self.cls_index = cls_index 641 | self.p_mask = p_mask 642 | 643 | self.example_index = example_index 644 | self.token_to_orig_map = token_to_orig_map 645 | self.doc_id = doc_id 646 | self.tokens = tokens 647 | 648 | self.label = label 649 | self.coarse_label = coarse_label 650 | 651 | self.dependency = dependency 652 | 653 | self.not_word_list = not_word_list 654 | 655 | 656 | class KLUEResult(object): 657 | def __init__(self, example_index, label_logits, gold_label=None, cls_logits=None): 658 | self.label_logits = label_logits 659 | self.example_index = example_index 660 | 661 | if gold_label: 662 | self.gold_label = gold_label 663 | self.cls_logits = cls_logits 664 | -------------------------------------------------------------------------------- /src/functions/utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import random 3 | import torch 4 | import numpy as np 5 | import os 6 | 7 | from src.functions.processor import ( 8 | CLASSIFIERProcessor, 9 | convert_examples_to_features 10 | ) 11 | 12 | def init_logger(): 13 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s', 14 | datefmt='%m/%d/%Y %H:%M:%S', 15 | level=logging.INFO) 16 | 17 | def set_seed(args): 18 | random.seed(args.seed) 19 | np.random.seed(args.seed) 20 | torch.manual_seed(args.seed) 21 | if not args.no_cuda and torch.cuda.is_available(): 22 | torch.cuda.manual_seed_all(args.seed) 23 | 24 | # tensor를 list 형으로 변환하기위한 함수 25 | def to_list(tensor): 26 | return tensor.detach().cpu().tolist() 27 | 28 | 29 | # dataset을 load 하는 함수 30 | def load_examples(args, tokenizer, evaluate=False, output_examples=False, do_predict=False, input_dict=None): 31 | ''' 32 | 33 | :param args: 하이퍼 파라미터 34 | :param tokenizer: tokenization에 사용되는 tokenizer 35 | :param evaluate: 평가나 open test시, True 36 | :param output_examples: 평가나 open test 시, True / True 일 경우, examples와 features를 같이 return 37 | :param do_predict: open test시, True 38 | :param input_dict: open test시 입력되는 문서와 질문으로 이루어진 dictionary 39 | :return: 40 | examples : max_length 상관 없이, 원문으로 각 데이터를 저장한 리스트 41 | features : max_length에 따라 분할 및 tokenize된 원문 리스트 42 | dataset : max_length에 따라 분할 및 학습에 직접적으로 사용되는 tensor 형태로 변환된 입력 ids 43 | ''' 44 | input_dir = args.data_dir 45 | print("Creating features from dataset file at {}".format(input_dir)) 46 | 47 | # processor 선언 48 | ## json으로 된 train과 dev data_file명 49 | processor = CLASSIFIERProcessor() 50 | 51 | # open test 시 52 | if do_predict: 53 | ## input_dict: guid, premise, hypothesis로 이루어진 dictionary 54 | # examples = processor.get_example_from_input(input_dict) 55 | examples = processor.get_dev_examples(os.path.join(args.data_dir), 56 | filename=args.predict_file) 57 | # 평가 시 58 | elif evaluate: 59 | examples = processor.get_dev_examples(os.path.join(args.data_dir), 60 | filename=args.eval_file) 61 | # 학습 시 62 | else: 63 | examples = processor.get_train_examples(os.path.join(args.data_dir), 64 | filename=args.train_file) 65 | 66 | # features = (prem_features, hypo_features) 67 | features, dataset = convert_examples_to_features( 68 | examples=examples, 69 | tokenizer=tokenizer, 70 | max_seq_length=args.max_seq_length, 71 | is_training=not evaluate, 72 | return_dataset="pt", 73 | threads=args.threads, 74 | max_sentence_length = args.max_sentence_length, 75 | language = args.model_name_or_path.split("/")[-2] 76 | ) 77 | if output_examples: 78 | ## example == feature == dataset 79 | return dataset, examples, features 80 | return dataset 81 | -------------------------------------------------------------------------------- /src/model/main_functions_multi.py: -------------------------------------------------------------------------------- 1 | import os 2 | import numpy as np 3 | import pandas as pd 4 | import torch 5 | import timeit 6 | from fastprogress.fastprogress import master_bar, progress_bar 7 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler 8 | from transformers.file_utils import is_torch_available 9 | 10 | from transformers import ( 11 | AdamW, 12 | get_linear_schedule_with_warmup 13 | ) 14 | 15 | from src.functions.utils import load_examples, set_seed, to_list 16 | from src.functions.metric import get_score, get_ai_score, get_sklearn_score 17 | 18 | from sklearn.metrics import confusion_matrix 19 | from functools import partial 20 | 21 | def train(args, model, tokenizer, logger): 22 | max_f1 =0.89 23 | max_acc = 0.89 24 | # 학습에 사용하기 위한 dataset Load 25 | ## dataset: tensor형태의 데이터셋 26 | ## all_input_ids, 27 | # all_attention_masks, 28 | # all_labels, 29 | # all_cls_index, 30 | # all_p_mask, 31 | # all_example_indices, 32 | # all_feature_index 33 | 34 | train_dataset = load_examples(args, tokenizer, evaluate=False, output_examples=False) 35 | 36 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader 37 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정 38 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정 39 | train_sampler = RandomSampler(train_dataset) 40 | 41 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size) 42 | 43 | # t_total: total optimization step 44 | # optimization 최적화 schedule 을 위한 전체 training step 계산 45 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs 46 | 47 | # Layer에 따른 가중치 decay 적용 48 | no_decay = ["bias", "LayerNorm.weight"] 49 | optimizer_grouped_parameters = [ 50 | { 51 | "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 52 | "weight_decay": args.weight_decay, 53 | }, 54 | { 55 | "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 56 | "weight_decay": 0.0}, 57 | ] 58 | 59 | # optimizer 및 scheduler 선언 60 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon) 61 | scheduler = get_linear_schedule_with_warmup( 62 | optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total 63 | ) 64 | 65 | # Training Step 66 | logger.info("***** Running training *****") 67 | logger.info(" Num examples = %d", len(train_dataset)) 68 | logger.info(" Num Epochs = %d", args.num_train_epochs) 69 | logger.info(" Train batch size per GPU = %d", args.train_batch_size) 70 | logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d", args.train_batch_size * args.gradient_accumulation_steps) 71 | logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) 72 | logger.info(" Total optimization steps = %d", t_total) 73 | 74 | global_step = 1 75 | if not args.from_init_weight: global_step += int(args.checkpoint) 76 | 77 | tr_loss, logging_loss = 0.0, 0.0 78 | 79 | # loss buffer 초기화 80 | model.zero_grad() 81 | 82 | mb = master_bar(range(int(args.num_train_epochs))) 83 | set_seed(args) 84 | 85 | epoch_idx=0 86 | if not args.from_init_weight: epoch_idx += int(args.checkpoint) 87 | 88 | for epoch in mb: 89 | epoch_iterator = progress_bar(train_dataloader, parent=mb) 90 | for step, batch in enumerate(epoch_iterator): 91 | # train 모드로 설정 92 | model.train() 93 | batch = tuple(t.to(args.device) for t in batch) 94 | 95 | # 모델에 입력할 입력 tensor 저장 96 | inputs_list = ["input_ids", "attention_mask","token_type_ids","position_ids"] 97 | inputs_list.append("labels") 98 | inputs_list.append("coarse_labels") 99 | inputs = dict() 100 | for n, input in enumerate(inputs_list): inputs[input] = batch[n] 101 | inputs_list2 = ['word_idxs', 'span'] 102 | for m, input in enumerate(inputs_list2): 103 | inputs[input] = batch[-(m+1)] 104 | 105 | # Loss 계산 및 저장 106 | ## outputs = (total_loss,) + outputs 107 | outputs = model(**inputs) 108 | loss = outputs[0] 109 | 110 | # 높은 batch size는 학습이 진행하는 중에 발생하는 noisy gradient가 경감되어 불안정한 학습을 안정적이게 되도록 해줌 111 | # 높은 batch size 효과를 주기위한 "gradient_accumulation_step" 112 | ## batch size *= gradient_accumulation_step 113 | # batch size: 16 114 | # gradient_accumulation_step: 2 라고 가정 115 | # 실제 batch size 32의 효과와 동일하진 않지만 비슷한 효과를 보임 116 | if args.gradient_accumulation_steps > 1: 117 | loss = loss / args.gradient_accumulation_steps 118 | 119 | 120 | ## batch_size의 개수만큼의 데이터를 입력으로 받아 만들어진 모델의 loss는 121 | ## 입력 데이터들에 대한 특징을 보유하고 있다(loss를 어떻게 만드느냐에 따라 달라) 122 | ### loss_fct = CrossEntropyLoss(ignore_index=ignored_index, reduction = ?) 123 | ### reduction = mean : 입력 데이터에 대한 평균 124 | loss.backward() 125 | tr_loss += loss.item() 126 | 127 | 128 | # Loss 출력 129 | if (global_step + 1) % 50 == 0: 130 | print("{} step processed.. Current Loss : {}".format((global_step+1),loss.item())) 131 | 132 | if (step + 1) % args.gradient_accumulation_steps == 0: 133 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm) 134 | 135 | optimizer.step() 136 | scheduler.step() # Update learning rate schedule 137 | model.zero_grad() 138 | global_step += 1 139 | 140 | epoch_idx += 1 141 | logger.info("***** Eval results *****") 142 | results = evaluate(args, model, tokenizer, logger, epoch_idx = str(epoch_idx), tr_loss = loss.item()) 143 | 144 | # model save 145 | if ((max_acc < float(results["accuracy"])) or ( 146 | max_f1 < float(results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"]))): 147 | if max_acc < float(results["accuracy"]): max_acc = float(results["accuracy"]) 148 | if max_f1 < float( 149 | results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"]): max_f1 = float( 150 | results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"]) 151 | 152 | # 모델 저장 디렉토리 생성 153 | output_dir = os.path.join(args.output_dir, "model/checkpoint-{}".format(epoch_idx)) 154 | if not os.path.exists(output_dir): 155 | os.makedirs(output_dir) 156 | 157 | # 학습된 가중치 및 vocab 저장 158 | ## pretrained 모델같은 경우 model.save_pretrained(...)로 저장 159 | ## nn.Module로 만들어진 모델일 경우 model.save(...)로 저장 160 | ### 두개가 모두 사용되는 모델일 경우 이 두가지 방법으로 저장을 해야한다!!!! 161 | model.save_pretrained(output_dir) 162 | if (args.model_name_or_path.split("/")[-2] != "KorSciBERT"): tokenizer.save_pretrained(output_dir) 163 | torch.save(args, os.path.join(output_dir, "training_args.bin")) 164 | logger.info("Saving model checkpoint to %s", output_dir) 165 | 166 | mb.write("Epoch {} done".format(epoch + 1)) 167 | 168 | return global_step, tr_loss / global_step 169 | 170 | # 정답이 사전부착된 데이터로부터 평가하기 위한 함수 171 | def evaluate(args, model, tokenizer, logger, epoch_idx = "", tr_loss = 1): 172 | # 데이터셋 Load 173 | ## dataset: tensor형태의 데이터셋 174 | ## example: json형태의 origin 데이터셋 175 | ## features: index번호가 추가된 list형태의 examples 데이터셋 176 | dataset, examples, features = load_examples(args, tokenizer, evaluate=True, output_examples=True) 177 | 178 | # 최종 출력 파일 저장을 위한 디렉토리 생성 179 | if not os.path.exists(args.output_dir): 180 | os.makedirs(args.output_dir) 181 | 182 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader 183 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정 184 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정 185 | eval_sampler = SequentialSampler(dataset) 186 | eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 187 | 188 | # Eval! 189 | logger.info("***** Running evaluation {} *****".format(epoch_idx)) 190 | logger.info(" Num examples = %d", len(dataset)) 191 | logger.info(" Batch size = %d", args.eval_batch_size) 192 | 193 | # 평가 시간 측정을 위한 time 변수 194 | start_time = timeit.default_timer() 195 | 196 | # 예측 라벨 197 | pred_logits = torch.tensor([], dtype = torch.long).to(args.device) 198 | pred_coarse_logits = torch.tensor([], dtype = torch.long).to(args.device) 199 | for batch in progress_bar(eval_dataloader): 200 | # 모델을 평가 모드로 변경 201 | model.eval() 202 | batch = tuple(t.to(args.device) for t in batch) 203 | 204 | with torch.no_grad(): 205 | # 평가에 필요한 입력 데이터 저장 206 | inputs_list = ["input_ids", "attention_mask", "token_type_ids", "position_ids"] 207 | inputs = dict() 208 | for n, input in enumerate(inputs_list): inputs[input] = batch[n] 209 | 210 | inputs_list2 = ['word_idxs', 'span'] 211 | for m, input in enumerate(inputs_list2): inputs[input] = batch[-(m + 1)] 212 | 213 | # outputs = (label_logits, ) 214 | # label_logits: [batch_size, num_labels] 215 | outputs = model(**inputs) 216 | 217 | pred_logits = torch.cat([pred_logits,outputs[0][1]], dim = 0) 218 | pred_coarse_logits = torch.cat([pred_coarse_logits, outputs[0][0]], dim=0) 219 | 220 | # pred_label과 gold_label 비교 221 | pred_logits= pred_logits.detach().cpu().numpy() 222 | pred_coarse_logits= pred_coarse_logits.detach().cpu().numpy() 223 | pred_labels = np.argmax(pred_logits, axis=-1) 224 | pred_coarse_labels = np.argmax(pred_coarse_logits, axis=-1) 225 | ## gold_labels 226 | gold_labels = [example.gold_label for example in examples] 227 | gold_coarse_labels = [example.gold_coarse_label for example in examples] 228 | 229 | # print('\n\n=====================outputs=====================') 230 | # for g,p in zip(gold_labels, pred_labels): 231 | # print(str(g)+"\t"+str(p)) 232 | # print('===========================================================') 233 | 234 | # 평가 시간 측정을 위한 time 변수 235 | evalTime = timeit.default_timer() - start_time 236 | logger.info(" Evaluation done in total %f secs (%f sec per example)", evalTime, evalTime / len(dataset)) 237 | 238 | # 최종 예측값과 원문이 저장된 example로 부터 성능 평가 239 | ## results = {"macro_precision":round(macro_precision, 4), "macro_recall":round(macro_recall, 4), "macro_f1_score":round(macro_f1_score, 4), \ 240 | ## "accuracy":round(total_accuracy, 4), \ 241 | ## "micro_precision":round(micro_precision, 4), "micro_recall":round(micro_recall, 4), "micro_f1":round(micro_f1_score, 4)} 242 | idx2label = {0: '문제 정의', 1: '가설 설정', 2: '기술 정의', 3: '제안 방법', 4: '대상 데이터', 5: '데이터처리', 6: '이론/모형', 7: '성능/효과', 8: '후속연구'}# , 9: '기타'} 243 | idx2coarse_label = {0: '연구 목적', 1: '연구 방법', 2: '연구 결과'} # , 3: '기타'} 244 | 245 | # results = get_score(pred_labels, gold_labels, idx2label) 246 | results = get_sklearn_score(pred_labels, gold_labels, idx2label) 247 | coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label) 248 | 249 | output_dir = os.path.join( args.output_dir, 'eval') 250 | 251 | out_file_type = 'a' 252 | if not os.path.exists(output_dir): 253 | os.makedirs(output_dir) 254 | out_file_type ='w' 255 | 256 | # 평가 스크립트 기반 성능 저장을 위한 파일 생성 257 | if os.path.exists(args.model_name_or_path): 258 | print(args.model_name_or_path) 259 | eval_file_name = list(filter(None, args.model_name_or_path.split("/"))).pop() 260 | else: 261 | eval_file_name = "init_weight" 262 | output_eval_file = os.path.join(output_dir, "eval_result_{}.txt".format(eval_file_name)) 263 | 264 | with open(output_eval_file, out_file_type, encoding='utf-8') as f: 265 | f.write("train loss: {}\n".format(tr_loss)) 266 | f.write("epoch: {}\n".format(epoch_idx)) 267 | f.write("세부분류 성능\n") 268 | for k in results.keys(): 269 | f.write("{} : {}\n".format(k, results[k])) 270 | f.write("\n대분류 성능\n") 271 | for k in coarse_results.keys(): 272 | f.write("{} : {}\n".format(k, coarse_results[k])) 273 | 274 | confusion_m = confusion_matrix(pred_labels, gold_labels) 275 | confusion_list = [[], [0 for i in range(0, len(confusion_m))], []] 276 | for i in range(0, len(confusion_m)): 277 | for j in range(0, len(confusion_m[i])): 278 | if (i == j): confusion_list[0].append(confusion_m[i][j]) 279 | all_cnt = sum([sum(i) for i in confusion_m]) 280 | f.write("micro_accuracy: " + str(round((sum(confusion_list[0]) / all_cnt), 4)) +"\n") 281 | print("micro_accuracy: " + str(round((sum(confusion_list[0]) / all_cnt), 4))) 282 | 283 | f.write("=======================================\n\n") 284 | return results 285 | 286 | def predict(args, model, tokenizer): 287 | dataset, examples, features = load_examples(args, tokenizer, evaluate=True, output_examples=True, do_predict=True) 288 | 289 | # 최종 출력 파일 저장을 위한 디렉토리 생성 290 | if not os.path.exists(args.output_dir): 291 | os.makedirs(args.output_dir) 292 | 293 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader 294 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정 295 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정 296 | eval_sampler = SequentialSampler(dataset) 297 | eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size) 298 | 299 | print("***** Running Prediction *****") 300 | print(" Num examples = %d", len(dataset)) 301 | 302 | # 예측 라벨 303 | pred_coarse_logits = torch.tensor([], dtype=torch.long).to(args.device) 304 | pred_logits = torch.tensor([], dtype=torch.long).to(args.device) 305 | for batch in progress_bar(eval_dataloader): 306 | # 모델을 평가 모드로 변경 307 | model.eval() 308 | batch = tuple(t.to(args.device) for t in batch) 309 | 310 | with torch.no_grad(): 311 | # 평가에 필요한 입력 데이터 저장 312 | inputs_list = ["input_ids", "attention_mask", "token_type_ids", "position_ids"] 313 | inputs = dict() 314 | for n, input in enumerate(inputs_list): inputs[input] = batch[n] 315 | 316 | inputs_list2 = ['word_idxs', 'span'] 317 | for m, input in enumerate(inputs_list2): inputs[input] = batch[-(m + 1)] 318 | 319 | # outputs = (label_logits, ) 320 | # label_logits: [batch_size, num_labels] 321 | outputs = model(**inputs) 322 | 323 | pred_logits = torch.cat([pred_logits, outputs[0][1]], dim=0) 324 | pred_coarse_logits = torch.cat([pred_coarse_logits, outputs[0][0]], dim=0) 325 | 326 | # pred_label과 gold_label 비교 327 | pred_logits = pred_logits.detach().cpu().numpy() 328 | pred_coarse_logits = pred_coarse_logits.detach().cpu().numpy() 329 | pred_labels = np.argmax(pred_logits, axis=-1) 330 | pred_coarse_labels = np.argmax(pred_coarse_logits, axis=-1) 331 | ## gold_labels 332 | gold_labels = [example.gold_label for example in examples] 333 | gold_coarse_labels = [example.gold_coarse_label for example in examples] 334 | 335 | idx2label = {0: '문제 정의', 1: '가설 설정', 2: '기술 정의', 3: '제안 방법', 4: '대상 데이터', 5: '데이터처리', 6: '이론/모형', 7: '성능/효과', 8: '후속연구'}#, 9: '기타'} 336 | idx2coarse_label = {0: '연구 목적', 1: '연구 방법', 2: '연구 결과'} # , 3: '기타'} 337 | 338 | # results = get_score(pred_labels, gold_labels, idx2label) 339 | results = get_sklearn_score(pred_labels, gold_labels, idx2label) 340 | coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label) 341 | 342 | print("result of get_ai_score") 343 | for k in coarse_results.keys(): 344 | print("{} : {}\n".format(k, coarse_results[k])) 345 | for k in results.keys(): 346 | print("{} : {}\n".format(k, results[k])) 347 | 348 | print("result of get_sklearn_score") 349 | sk_results = get_sklearn_score(pred_labels, gold_labels, idx2label) 350 | sk_coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label) 351 | for k in sk_coarse_results.keys(): 352 | print("{} : {}\n".format(k, sk_coarse_results[k])) 353 | for k in sk_results.keys(): 354 | print("{} : {}\n".format(k, sk_results[k])) 355 | 356 | 357 | # 검증 스크립트 기반 성능 저장 358 | output_dir = os.path.join(args.output_dir, 'test') 359 | 360 | out_file_type = 'a' 361 | if not os.path.exists(output_dir): 362 | os.makedirs(output_dir) 363 | out_file_type = 'w' 364 | 365 | ## 검증 스크립트 기반 성능 저장을 위한 파일 생성 366 | if os.path.exists(args.model_name_or_path): 367 | print(args.model_name_or_path) 368 | eval_file_name = list(filter(None, args.model_name_or_path.split("/"))).pop() 369 | else: 370 | eval_file_name = "init_weight" 371 | 372 | ## 대분류 세부분류 373 | print("===== 대분류 세부분류 =====") 374 | coarse_new_output_1 = torch.zeros([9,9]) 375 | coarse_new_output_2 = torch.zeros([9,9]) 376 | coarse_new_output_3 = torch.zeros([9,9]) 377 | 378 | for i,(cg,cp, g,p) in enumerate(zip(pred_coarse_labels, gold_coarse_labels, gold_labels, pred_labels)): 379 | if cp == 0: 380 | coarse_new_output_1[p][g] += torch.tensor(1) 381 | elif cp == 1: 382 | coarse_new_output_2[p][g] += torch.tensor(1) 383 | elif cp == 2: 384 | coarse_new_output_3[p][g] += torch.tensor(1) 385 | 386 | 387 | print("============대분류 결과====================") 388 | for co in zip(coarse_new_output_1, coarse_new_output_2, coarse_new_output_3): 389 | print(co) 390 | 391 | 392 | ### incorrect data 저장 393 | out_incorrect = {"sentence": [], "correct": [], "predict": []} 394 | print('\n\n=====================outputs=====================') 395 | for i,(g,p) in enumerate(zip(gold_labels, pred_labels)): 396 | if g != p: 397 | out_incorrect["sentence"].append(examples[i].sentence) 398 | out_incorrect["correct"].append(idx2label[g]) 399 | out_incorrect["predict"].append(idx2label[p]) 400 | df_incorrect = pd.DataFrame(out_incorrect) 401 | df_incorrect.to_csv(os.path.join(output_dir, "test_result_{}_incorrect.csv".format(eval_file_name)), index=False) 402 | 403 | ### 전체 data 저장 404 | out = {"sentence":[], "correct":[], "predict":[]} 405 | for i,(g,p) in enumerate(zip(gold_labels, pred_labels)): 406 | for k,v in zip(out.keys(),[examples[i].sentence, idx2label[g], idx2label[p]]): 407 | out[k].append(v) 408 | for k, v in zip(out.keys(), [examples[i].sentence, idx2label[g], idx2label[p]]): 409 | out[k].append(v) 410 | df = pd.DataFrame(out) 411 | df.to_csv(os.path.join(output_dir, "test_result_{}.csv".format(eval_file_name)), index=False) 412 | 413 | return results 414 | -------------------------------------------------------------------------------- /src/model/model_multi.py: -------------------------------------------------------------------------------- 1 | # model += Parsing Infor Collecting Layer (PIC) 2 | 3 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss 4 | import torch.nn as nn 5 | import torch 6 | import torch.nn.functional as F 7 | 8 | from transformers import BertModel, RobertaModel 9 | 10 | import transformers 11 | if int(transformers.__version__[0]) <= 3: 12 | from transformers.modeling_roberta import RobertaPreTrainedModel 13 | from transformers.modeling_bert import BertPreTrainedModel 14 | from transformers.modeling_electra import ElectraModel, ElectraPreTrainedModel 15 | else: 16 | from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel 17 | from transformers.models.bert.modeling_bert import BertPreTrainedModel 18 | from transformers.models.electra.modeling_electra import ElectraPreTrainedModel 19 | 20 | from src.functions.biattention import BiAttention, BiLinear 21 | 22 | 23 | class KorSciBERTForSequenceClassification(BertPreTrainedModel): 24 | 25 | def __init__(self, config, max_sentence_length, path): 26 | super(KorSciBERTForSequenceClassification, self).__init__(config, max_sentence_length, path) 27 | self.num_labels = config.num_labels 28 | self.num_coarse_labels = 3 29 | self.config = config 30 | 31 | self.bert = transformers.BertModel.from_pretrained(path, config=self.config) 32 | 33 | # special token 추가 34 | self.config.vocab_size = 15330 + 1 35 | self.bert.resize_token_embeddings(self.config.vocab_size) 36 | 37 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습 38 | self.span_info_collect = SICModel1(config) 39 | #self.span_info_collect = SICModel2(config) 40 | 41 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화 42 | self.parsing_info_collect = PICModel(config, max_sentence_length) 43 | 44 | classifier_dropout = ( 45 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1 46 | ) 47 | 48 | # 대분류 49 | self.dropout1 = nn.Dropout(classifier_dropout) 50 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels) 51 | 52 | # 세부분류 53 | self.dropout2 = nn.Dropout(classifier_dropout) 54 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, self.num_labels) 55 | 56 | self.reset_parameters #self.init_weights() 57 | 58 | def forward( 59 | self, 60 | input_ids=None, 61 | attention_mask=None, 62 | token_type_ids=None, 63 | position_ids=None, 64 | head_mask=None, 65 | inputs_embeds=None, 66 | labels=None, 67 | coarse_labels=None, 68 | span=None, 69 | word_idxs=None, 70 | ): 71 | batch_size = input_ids.shape[0] 72 | discriminator_hidden_states = self.bert( 73 | input_ids=input_ids, 74 | attention_mask=attention_mask, 75 | ) 76 | # last-layer hidden state 77 | # sequence_output: [batch_size, seq_length, hidden_size] 78 | sequence_output = discriminator_hidden_states[0] 79 | 80 | # span info collecting layer(SIC) 81 | h_ij = self.span_info_collect(sequence_output, word_idxs) 82 | 83 | # parser info collecting layer(PIC) 84 | hidden_states = self.parsing_info_collect(h_ij, 85 | batch_size= batch_size, 86 | span=span,) 87 | 88 | # 대분류 89 | hidden_states1 = self.dropout1(hidden_states) 90 | logits1 = self.classifier1(hidden_states1) 91 | 92 | # concat 93 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1) 94 | 95 | # 세부 분류 96 | hidden_states2 = self.dropout2(concat_hidden_states) 97 | logits2 = self.classifier2(hidden_states2) 98 | 99 | #logits = logits1 100 | logits = [logits1, logits2] 101 | outputs = (logits, ) + discriminator_hidden_states[2:] 102 | 103 | if labels is not None: 104 | if self.num_labels == 1: 105 | # We are doing regression 106 | loss_fct = MSELoss() 107 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1)) 108 | loss2 = loss_fct(logits2.view(-1), labels.view(-1)) 109 | else: 110 | loss_fct = CrossEntropyLoss() 111 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1)) 112 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1)) 113 | 114 | loss = loss1 + loss2 115 | # print("loss: "+str(loss)) 116 | outputs = (loss,) + outputs 117 | 118 | return outputs # (loss), logits, (hidden_states), (attentions) 119 | 120 | def reset_parameters(self): 121 | self.dropout1.reset_parameters() 122 | self.classifier1.reset_parameters() 123 | self.dropout2.reset_parameters() 124 | self.classifier2.reset_parameters() 125 | 126 | class BertForSequenceClassification(BertPreTrainedModel): 127 | 128 | def __init__(self, config, max_sentence_length): 129 | super().__init__(config) 130 | self.num_labels = config.num_labels 131 | self.num_coarse_labels = 3 132 | self.config = config 133 | self.bert = BertModel(config) 134 | 135 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습 136 | self.span_info_collect = SICModel1(config) 137 | #self.span_info_collect = SICModel2(config) 138 | 139 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화 140 | self.parsing_info_collect = PICModel(config, max_sentence_length) # 구묶음 + tag 정보 + bert-biaffine attention + bilistm + bert-bilinear classification 141 | 142 | classifier_dropout = ( 143 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1 144 | ) 145 | self.dropout1 = nn.Dropout(classifier_dropout) 146 | self.dropout2 = nn.Dropout(classifier_dropout) 147 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels) 148 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, config.num_labels) 149 | 150 | self.init_weights() 151 | 152 | def forward( 153 | self, 154 | input_ids=None, 155 | attention_mask=None, 156 | token_type_ids=None, 157 | position_ids=None, 158 | head_mask=None, 159 | inputs_embeds=None, 160 | labels=None, 161 | coarse_labels=None, 162 | span=None, 163 | word_idxs=None, 164 | ): 165 | batch_size = input_ids.shape[0] 166 | discriminator_hidden_states = self.bert( 167 | input_ids=input_ids, 168 | attention_mask=attention_mask, 169 | token_type_ids=token_type_ids, 170 | position_ids=position_ids, 171 | ) 172 | # last-layer hidden state 173 | # sequence_output: [batch_size, seq_length, hidden_size] 174 | sequence_output = discriminator_hidden_states[0] 175 | 176 | # span info collecting layer(SIC) 177 | h_ij = self.span_info_collect(sequence_output, word_idxs) 178 | 179 | # parser info collecting layer(PIC) 180 | hidden_states = self.parsing_info_collect(h_ij, 181 | batch_size= batch_size, 182 | span=span,) 183 | 184 | # 대분류 185 | hidden_states1 = self.dropout1(hidden_states) 186 | logits1 = self.classifier1(hidden_states1) 187 | 188 | # concat 189 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1) 190 | 191 | # 세부 분류 192 | hidden_states2 = self.dropout2(concat_hidden_states) 193 | logits2 = self.classifier2(hidden_states2) 194 | 195 | logits = [logits1, logits2] 196 | outputs = (logits, ) + discriminator_hidden_states[2:] 197 | 198 | if labels is not None: 199 | if self.num_labels == 1: 200 | # We are doing regression 201 | loss_fct = MSELoss() 202 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1)) 203 | loss2 = loss_fct(logits2.view(-1), labels.view(-1)) 204 | else: 205 | loss_fct = CrossEntropyLoss() 206 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1)) 207 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1)) 208 | loss = loss1+loss2 209 | #print("loss: "+str(loss)) 210 | outputs = (loss,) + outputs 211 | 212 | return outputs # (loss), logits, (hidden_states), (attentions) 213 | 214 | class RobertaForSequenceClassification(BertPreTrainedModel): 215 | 216 | def __init__(self, config, max_sentence_length): 217 | super().__init__(config) 218 | self.num_labels = config.num_labels 219 | self.num_coarse_labels = 3 220 | self.config = config 221 | self.roberta = RobertaModel(config) 222 | 223 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습 224 | self.span_info_collect = SICModel1(config) 225 | #self.span_info_collect = SICModel2(config) 226 | 227 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화 228 | self.parsing_info_collect = PICModel(config, max_sentence_length) # 구묶음 + tag 정보 + bert-biaffine attention + bilistm + bert-bilinear classification 229 | 230 | classifier_dropout = ( 231 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1 232 | ) 233 | self.dropout1 = nn.Dropout(classifier_dropout) 234 | self.dropout2 = nn.Dropout(classifier_dropout) 235 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels) 236 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, config.num_labels) 237 | 238 | self.init_weights() 239 | 240 | def forward( 241 | self, 242 | input_ids=None, 243 | attention_mask=None, 244 | token_type_ids=None, 245 | position_ids=None, 246 | head_mask=None, 247 | inputs_embeds=None, 248 | labels=None, 249 | coarse_labels=None, 250 | span=None, 251 | word_idxs=None, 252 | ): 253 | batch_size = input_ids.shape[0] 254 | discriminator_hidden_states = self.roberta( 255 | input_ids=input_ids, 256 | attention_mask=attention_mask, 257 | ) 258 | # last-layer hidden state 259 | # sequence_output: [batch_size, seq_length, hidden_size] 260 | sequence_output = discriminator_hidden_states[0] 261 | 262 | # span info collecting layer(SIC) 263 | h_ij = self.span_info_collect(sequence_output, word_idxs) 264 | 265 | # parser info collecting layer(PIC) 266 | hidden_states = self.parsing_info_collect(h_ij, 267 | batch_size= batch_size, 268 | span=span,) 269 | 270 | # 대분류 271 | hidden_states1 = self.dropout1(hidden_states) 272 | logits1 = self.classifier1(hidden_states1) 273 | 274 | # concat 275 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1) 276 | 277 | # 세부 분류 278 | hidden_states2 = self.dropout2(concat_hidden_states) 279 | logits2 = self.classifier2(hidden_states2) 280 | 281 | logits = [logits1, logits2] 282 | outputs = (logits, ) + discriminator_hidden_states[2:] 283 | 284 | if labels is not None: 285 | if self.num_labels == 1: 286 | # We are doing regression 287 | loss_fct = MSELoss() 288 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1)) 289 | loss2 = loss_fct(logits2.view(-1), labels.view(-1)) 290 | else: 291 | loss_fct = CrossEntropyLoss() 292 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1)) 293 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1)) 294 | loss = loss1+loss2 295 | #print("loss: "+str(loss)) 296 | outputs = (loss,) + outputs 297 | 298 | return outputs # (loss), logits, (hidden_states), (attentions) 299 | 300 | class SICModel1(nn.Module): 301 | def __init__(self, config): 302 | super().__init__() 303 | 304 | def forward(self, hidden_states, word_idxs): 305 | # (batch, max_pre_sen, seq_len) @ (batch, seq_len, hidden) = (batch, max_pre_sen, hidden) 306 | word_idxs = word_idxs.squeeze(1) 307 | 308 | sen = torch.matmul(word_idxs, hidden_states) 309 | 310 | return sen 311 | 312 | class SICModel2(nn.Module): 313 | def __init__(self, config): 314 | super().__init__() 315 | self.hidden_size = config.hidden_size 316 | 317 | self.W_1 = nn.Linear(self.hidden_size, self.hidden_size) 318 | self.W_2 = nn.Linear(self.hidden_size, self.hidden_size) 319 | self.W_3 = nn.Linear(self.hidden_size, self.hidden_size) 320 | self.W_4 = nn.Linear(self.hidden_size, self.hidden_size) 321 | 322 | def forward(self, hidden_states, word_idxs): 323 | word_idxs = word_idxs.squeeze(1).type(torch.LongTensor).to("cuda") 324 | 325 | W1_h = self.W_1(hidden_states) # (bs, length, hidden_size) 326 | W2_h = self.W_2(hidden_states) 327 | W3_h = self.W_3(hidden_states) 328 | W4_h = self.W_4(hidden_states) 329 | 330 | W1_hi_emb=torch.tensor([], dtype=torch.long).to("cuda") 331 | W2_hi_emb=torch.tensor([], dtype=torch.long).to("cuda") 332 | W3_hi_start_emb = torch.tensor([], dtype=torch.long).to("cuda") 333 | W3_hi_end_emb = torch.tensor([], dtype=torch.long).to("cuda") 334 | W4_hi_start_emb = torch.tensor([], dtype=torch.long).to("cuda") 335 | W4_hi_end_emb = torch.tensor([], dtype=torch.long).to("cuda") 336 | for i in range(0, hidden_states.shape[0]): 337 | sub_W1_hi_emb = torch.index_select(W1_h[i], 0, word_idxs[i][0]) # (max_seq_length, hidden_size) 338 | sub_W2_hi_emb = torch.index_select(W2_h[i], 0, word_idxs[i][1]) 339 | sub_W3_hi_start_emb = torch.index_select(W3_h[i], 0, word_idxs[i][0]) 340 | sub_W3_hi_end_emb = torch.index_select(W3_h[i], 0, word_idxs[i][1]) 341 | sub_W4_hi_start_emb = torch.index_select(W4_h[i], 0, word_idxs[i][0]) 342 | sub_W4_hi_end_emb = torch.index_select(W4_h[i], 0, word_idxs[i][1]) 343 | 344 | W1_hi_emb = torch.cat((W1_hi_emb, sub_W1_hi_emb.unsqueeze(0))) 345 | W2_hi_emb = torch.cat((W2_hi_emb, sub_W2_hi_emb.unsqueeze(0))) 346 | W3_hi_start_emb = torch.cat((W3_hi_start_emb, sub_W3_hi_start_emb.unsqueeze(0))) 347 | W3_hi_end_emb = torch.cat((W3_hi_end_emb, sub_W3_hi_end_emb.unsqueeze(0))) 348 | W4_hi_start_emb = torch.cat((W4_hi_start_emb, sub_W4_hi_start_emb.unsqueeze(0))) 349 | W4_hi_end_emb = torch.cat((W4_hi_end_emb, sub_W4_hi_end_emb.unsqueeze(0))) 350 | 351 | # [w1*hi, w2*hj, w3(hi-hj), w4(hi⊗hj)] 352 | span = W1_hi_emb + W2_hi_emb + (W3_hi_start_emb - W3_hi_end_emb) + torch.mul(W4_hi_start_emb, W4_hi_end_emb) # (batch_size, max_seq_length, hidden_size) 353 | h_ij = torch.tanh(span) 354 | 355 | return h_ij 356 | 357 | 358 | class PICModel(nn.Module): 359 | def __init__(self, config, max_sentence_length): 360 | super().__init__() 361 | self.hidden_size = config.hidden_size 362 | self.max_sentence_length = max_sentence_length 363 | 364 | # 구문구조 종류 365 | depend2idx = {"None": 0}; 366 | idx2depend = {0: "None"}; 367 | for depend1 in ['IP', 'AP', 'DP', 'VP', 'VNP', 'S', 'R', 'NP', 'L', 'X']: 368 | for depend2 in ['CMP', 'MOD', 'SBJ', 'AJT', 'CNJ', 'OBJ', "UNDEF"]: 369 | depend2idx[depend1 + "-" + depend2] = len(depend2idx) 370 | idx2depend[len(idx2depend)] = depend1 + "-" + depend2 371 | self.depend2idx = depend2idx 372 | self.idx2depend = idx2depend 373 | self.depend_embedding = nn.Embedding(len(idx2depend), self.hidden_size, padding_idx=0).to("cuda") 374 | 375 | self.reduction1 = nn.Linear(self.hidden_size , int(self.hidden_size // 3)) 376 | self.reduction2 = nn.Linear(self.hidden_size , int(self.hidden_size // 3)) 377 | 378 | self.biaffine = BiAttention(int(self.hidden_size // 3), int(self.hidden_size // 3), 100) 379 | 380 | self.bi_lism = nn.LSTM(input_size=100, hidden_size=self.hidden_size//2, num_layers=1, bidirectional=True) 381 | 382 | def forward(self, hidden_states, batch_size, span): 383 | # hidden_states: [[batch_size, word_idxs, hidden_size], []] 384 | # span: [batch_size, max_sentence_length, max_sentence_length] 385 | # word_idxs: [batch_size, seq_length] 386 | # -> sequence_outputs: [batch_size, seq_length, hidden_size] 387 | 388 | # span: (batch, max_prem_len, 3) -> (batch, max_prem_len, 3*hidden_size) 389 | new_span = torch.tensor([], dtype=torch.long).to("cuda") 390 | 391 | for i, span in enumerate(span.tolist()): 392 | span_head = torch.tensor([span[0] for span in span]).to("cuda") #(max_prem_len) 393 | span_tail = torch.tensor([span[1] for span in span]).to("cuda") 394 | span_dep = torch.tensor([span[2] for span in span]).to("cuda") 395 | 396 | span_head = torch.index_select(hidden_states[i], 0, span_head) #(max_prem_len, hidden_size) 397 | span_tail = torch.index_select(hidden_states[i], 0, span_tail) 398 | span_dep = self.depend_embedding(span_dep) 399 | 400 | n_span = span_head + span_tail + span_dep 401 | new_span = torch.cat((new_span, n_span.unsqueeze(0))) 402 | 403 | span = new_span 404 | del new_span 405 | 406 | # biaffine attention 407 | # hidden_states: (batch_size, max_prem_len, hidden_size) 408 | # span: (batch, max_prem_len, hidden_size) 409 | # -> biaffine_outputs: [batch_size, 100, max_prem_len, max_prem_len] 410 | span = self.reduction1(span) 411 | hidden_states = self.reduction2(hidden_states) 412 | 413 | biaffine_outputs= self.biaffine(hidden_states, span) 414 | 415 | # bilstm 416 | # biaffine_outputs: [batch_size, 100, max_prem_len, max_prem_len] -> [batch_size, 100, max_prem_len] -> [max_prem_len, batch_size, 100] 417 | # -> hidden_states: [batch_size, max_sentence_length] 418 | biaffine_outputs = biaffine_outputs.mean(-1) 419 | 420 | biaffine_outputs = biaffine_outputs.transpose(1,2).transpose(0,1) 421 | states = None 422 | 423 | bilstm_outputs, states = self.bi_lism(biaffine_outputs) 424 | 425 | hidden_states = states[0].transpose(0, 1).contiguous().view(batch_size, -1) 426 | 427 | return hidden_states 428 | 429 | def reset_parameters(self): 430 | self.W_1_bilinear.reset_parameters() 431 | self.W_1_linear.reset_parameters() 432 | self.W_2_bilinear.reset_parameters() 433 | self.W_2_linear.reset_parameters() 434 | 435 | self.biaffine_W_bilinear.reset_parameters() 436 | self.biaffine_W_linear.reset_parameters() 437 | 438 | 439 | --------------------------------------------------------------------------------