├── README.md
├── kcc2022 poster.pdf
├── kcc2022 poster.png
├── model.png
├── requirements.txt
├── run_baseline_torch.py
└── src
├── dependency
└── merge.py
├── functions
├── biattention.py
├── metric.py
├── processor.py
└── utils.py
└── model
├── main_functions_multi.py
└── model_multi.py
/README.md:
--------------------------------------------------------------------------------
1 | # Dual-Classification of Scientific Paper Sentence
2 | Code for KCC 2022 paper: *[Dual-Classification of Paper Sentence using Chunk Representation Method and Dependency Parsing](https://www.dbpia.co.kr/journal/articleDetail?nodeId=NODE11113336)*
3 |
4 |
5 |
6 |
7 |
8 |
9 | ## Setting up the code environment
10 |
11 | ```
12 | $ virtualenv --python=python3.6 venv
13 | $ source venv/bin/activate
14 | $ pip install -r requirements.txt
15 | ```
16 |
17 | All code only supports running on Linux.
18 |
19 | ## Model Structure
20 |
21 |
22 |
23 |
24 |
25 | ## Data
26 |
27 | *[국내 논문 문장 의미 태깅 데이터셋](https://aida.kisti.re.kr/data/8d0fd6f4-4bf9-47ae-bd71-7d41f01ad9a6)*
28 |
29 | ## Directory and Pre-processing
30 | `의존 구문 분석 모델은 미공개(The dependency parser model is unpublished)`
31 | ```
32 | ├── data
33 | │ ├── origin.json
34 | │ └── origin
35 | │ ├──DP_origin_preprocess.json
36 | │ └── merge_origin_preprocess
37 | │ ├── origin_train.json
38 | │ └── origin_test.json
39 | ├── bert
40 | │ ├── init_weight
41 | │ └── biaffine_model
42 | │ └── multi
43 | ├── src
44 | │ ├── dependency
45 | │ └── merge.py
46 | │ ├── functions
47 | │ ├── biattention.py
48 | │ ├── utils.py
49 | │ ├── metric.py
50 | │ └── processor.json
51 | │ └── model
52 | │ ├── main_functions_multi.py
53 | │ └── model_multi.py
54 | ├── run_baseline_torch.py
55 | ├── requirements.txt
56 | └── README.md
57 | ```
58 |
59 | * 원시 데이터(data/origin.json)를 의존 구문 분석 모델을 활용하여 입력 문장 쌍에 대한 어절 단위 의존 구문 구조 추출(data/origin/DP_origin_preprocess.json)
60 |
61 | * 입력 문장 쌍에 대한 어절 단위 의존 구문 구조(data/origin/DP_origin_preprocess.json)를 src/dependency/merge.py를 통해 입력 문장 쌍에 대한 청크 단위 의존 구문 구조로 변환(data/origin/merge_origin_preprocess/origin.json)
62 |
63 | * 학습 데이터와 평가 데이터를 세부분류별 4:1 비율로 나누기(data/origin/merge_origin_preprocess/origin_train.json, data/origin/merge_origin_preprocess/origin_test.json)
64 |
65 | * [bert/init_weight](https://huggingface.co/klue/bert-base)의 vocab.json에 청크 단위로 구분해주는 스폐셜 토큰(Special Token) `` 추가
66 |
67 | ## Train & Test
68 |
69 | ### Pretrained model
70 | * KLUE/BERT-base
71 | ### How To Run
72 | `python run_baseline_torch.py`
73 |
74 | ## Results
75 |
76 | | Model | Macro F1 | Acc |
77 | |---|--------- |--------- |
78 | | BERT | 89.66% | 89.90% |
79 | | proposed | 89.75% | 89.99% |
80 |
--------------------------------------------------------------------------------
/kcc2022 poster.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/kcc2022 poster.pdf
--------------------------------------------------------------------------------
/kcc2022 poster.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/kcc2022 poster.png
--------------------------------------------------------------------------------
/model.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/KUNLP/XAI_BinaryClassifier/4d1e9632654559b916de392857cb815928924b19/model.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy
2 | pandas
3 | tokenizers
4 | attrdict
5 | fastprogress
6 | tqdm
7 | pytorch-crf
8 | scikit-learn
9 | tensorflow-gpu==1.15.0
10 | transformers==4.7.0
11 | konlpy
12 | tweepy==3.10.0
13 | torch==1.8.1+cu111
14 |
--------------------------------------------------------------------------------
/run_baseline_torch.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import os
3 | import logging
4 | from attrdict import AttrDict
5 |
6 | # bert
7 | from transformers import AutoTokenizer
8 | from transformers import BertConfig,RobertaConfig
9 |
10 | from src.model.model_multi import BertForSequenceClassification
11 |
12 | from src.model.main_functions_multi import train, evaluate, predict
13 |
14 | from src.functions.utils import init_logger, set_seed
15 |
16 | import sys
17 | sys.path.append(os.path.dirname(os.path.abspath(os.path.dirname(__file__))))
18 |
19 | def create_model(args):
20 |
21 | if args.model_name_or_path.split("/")[-2] == "bert":
22 |
23 | # 모델 파라미터 Load
24 | config = BertConfig.from_pretrained(
25 | args.model_name_or_path#'bert/Bert-base'
26 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
27 | cache_dir=args.cache_dir,
28 | )
29 |
30 | config.num_coarse_labels = 3
31 | config.num_labels = 9
32 |
33 | # roberta attention 추출하기
34 | config.output_attentions=True
35 |
36 | # tokenizer는 pre-trained된 것을 불러오는 과정이 아닌 불러오는 모델의 vocab 등을 Load
37 | # BertTokenizerFast로 되어있음
38 | tokenizer = AutoTokenizer.from_pretrained(
39 | args.model_name_or_path#'bert/init_weight'
40 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
41 | do_lower_case=args.do_lower_case,
42 | cache_dir=args.cache_dir,
43 | )
44 |
45 | model = BertForSequenceClassification.from_pretrained(
46 | args.model_name_or_path#'bert/init_weight'
47 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
48 | cache_dir=args.cache_dir,
49 | config=config,
50 | max_sentence_length=args.max_sentence_length,
51 | # from_tf= True if args.from_init_weight else False
52 | )
53 | args.model_name_or_path = args.cache_dir
54 | # print(tokenizer.convert_tokens_to_ids(""))
55 |
56 | elif args.model_name_or_path.split("/")[-2] == "roberta":
57 |
58 | # 모델 파라미터 Load
59 | config = RobertaConfig.from_pretrained(
60 | args.model_name_or_path
61 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
62 | cache_dir=args.cache_dir,
63 | )
64 |
65 | config.num_coarse_labels = 3
66 | config.num_labels = 9
67 |
68 | # roberta attention 추출하기
69 | config.output_attentions=True
70 |
71 | # tokenizer는 pre-trained된 것을 불러오는 과정이 아닌 불러오는 모델의 vocab 등을 Load
72 | # BertTokenizerFast로 되어있음
73 | tokenizer = AutoTokenizer.from_pretrained(
74 | args.model_name_or_path
75 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
76 | do_lower_case=args.do_lower_case,
77 | cache_dir=args.cache_dir,
78 | )
79 |
80 | model = RobertaForSequenceClassification.from_pretrained(
81 | args.model_name_or_path
82 | if args.from_init_weight else os.path.join(args.output_dir,"model/checkpoint-{}".format(args.checkpoint)),
83 | cache_dir=args.cache_dir,
84 | config=config,
85 | max_sentence_length=args.max_sentence_length,
86 | # from_tf= True if args.from_init_weight else False
87 | )
88 | args.model_name_or_path = args.cache_dir
89 | # print(tokenizer.convert_tokens_to_ids(""))
90 |
91 | model.to(args.device)
92 | return model, tokenizer
93 |
94 | def main(cli_args):
95 | # 파라미터 업데이트
96 | args = AttrDict(vars(cli_args))
97 | args.device = "cuda"
98 | logger = logging.getLogger(__name__)
99 |
100 | # logger 및 seed 지정
101 | init_logger()
102 | set_seed(args)
103 |
104 | # 모델 불러오기
105 | model, tokenizer = create_model(args)
106 |
107 | # Running mode에 따른 실행
108 | if args.do_train:
109 | train(args, model, tokenizer, logger)
110 | elif args.do_eval:
111 | evaluate(args, model, tokenizer, logger)
112 | elif args.do_predict:
113 | predict(args, model, tokenizer)
114 |
115 |
116 | if __name__ == '__main__':
117 | cli_parser = argparse.ArgumentParser()
118 |
119 | # Directory
120 |
121 | #------------------------------------------------------------------------------------------------
122 | cli_parser.add_argument("--data_dir", type=str, default="./data/origin/merge_origin_preprocess")
123 |
124 | cli_parser.add_argument("--train_file", type=str, default='origin_train.json')
125 | #cli_parser.add_argument("--train_file", type=str, default='sample.json')
126 | cli_parser.add_argument("--eval_file", type=str, default='origin_test.json')
127 | cli_parser.add_argument("--predict_file", type=str, default='origin_test.json')
128 |
129 | # ------------------------------------------------------------------------------------------------
130 |
131 | ## roberta
132 | # cli_parser.add_argument("--model_name_or_path", type=str, default="./roberta/init_weight")
133 | # cli_parser.add_argument("--cache_dir", type=str, default="./roberta/init_weight")
134 |
135 | # bert
136 | cli_parser.add_argument("--model_name_or_path", type=str, default="./bert/init_weight")
137 | cli_parser.add_argument("--cache_dir", type=str, default="./bert/init_weight")
138 |
139 | #------------------------------------------------------------------------------------------------------------
140 | cli_parser.add_argument("--output_dir", type=str, default="./bert/biaffine_model/multi")
141 | #cli_parser.add_argument("--output_dir", type=str, default="./roberta/biaffine_model/multi")
142 |
143 |
144 | # ------------------------------------------------------------------------------------------------------------
145 |
146 | cli_parser.add_argument("--max_sentence_length", type=int, default=110)
147 |
148 | # https://github.com/KLUE-benchmark/KLUE-baseline/blob/main/run_all.sh
149 | # Model Hyper Parameter
150 | cli_parser.add_argument("--max_seq_length", type=int, default=512)
151 | # Training Parameter
152 | cli_parser.add_argument("--learning_rate", type=float, default=1e-5)
153 | cli_parser.add_argument("--train_batch_size", type=int, default =16)
154 | cli_parser.add_argument("--eval_batch_size", type=int, default = 32)
155 | cli_parser.add_argument("--num_train_epochs", type=int, default=6)
156 |
157 | #cli_parser.add_argument("--save_steps", type=int, default=2000)
158 | cli_parser.add_argument("--logging_steps", type=int, default=100)
159 | cli_parser.add_argument("--seed", type=int, default=42)
160 | cli_parser.add_argument("--threads", type=int, default=8)
161 |
162 | cli_parser.add_argument("--weight_decay", type=float, default=0.0)
163 | cli_parser.add_argument("--adam_epsilon", type=int, default=1e-10)
164 | cli_parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
165 | cli_parser.add_argument("--warmup_steps", type=int, default=0)
166 | cli_parser.add_argument("--max_steps", type=int, default=-1)
167 | cli_parser.add_argument("--max_grad_norm", type=int, default=1.0)
168 |
169 | cli_parser.add_argument("--verbose_logging", type=bool, default=False)
170 | cli_parser.add_argument("--do_lower_case", type=bool, default=False)
171 | cli_parser.add_argument("--no_cuda", type=bool, default=False)
172 |
173 | # Running Mode
174 | cli_parser.add_argument("--from_init_weight", type=bool, default= True) #False)#True)
175 | cli_parser.add_argument("--checkpoint", type=str, default="5")
176 |
177 | cli_parser.add_argument("--do_train", type=bool, default=True) #False)#True)
178 | cli_parser.add_argument("--do_eval", type=bool, default=False)#True)#False)
179 | cli_parser.add_argument("--do_predict", type=bool, default=False) #True)#False)
180 |
181 | cli_args = cli_parser.parse_args()
182 |
183 | main(cli_args)
184 |
--------------------------------------------------------------------------------
/src/dependency/merge.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | from tqdm import tqdm
4 | import random
5 |
6 |
7 | def change_tag(change_list, tag_list):
8 | new_tag_list = []
9 | for tag_li_idx, tag_li in enumerate(tag_list):
10 | new_tag_li = []
11 | for change in change_list:
12 | change_result = set(sorted(change[1]))
13 | for word_idx in change[0]:
14 | for tag_idx, tag in enumerate(tag_li):
15 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]]
16 | if list(tag[0])[0] == word_idx:
17 | tag_list[tag_li_idx][tag_idx] = [[change_result, set(sorted(list(tag[0])[1]))], tag[1]]
18 | if list(tag[0])[1] == word_idx:
19 | tag_list[tag_li_idx][tag_idx] = [[set(sorted(list(tag[0])[0])), change_result], tag[1]]
20 | for tag in tag_li:
21 | if (len(tag[0][0].difference(tag[0][1]))!=0):
22 | if (len(tag[0][1].difference(tag[0][0]))!=0):new_tag_li.append(tag)
23 | new_tag_list.append(new_tag_li)
24 | tag_list = new_tag_list
25 | del new_tag_list
26 | return tag_list
27 |
28 | def tag_case1(change_list, tag_l):
29 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]]
30 | case1_conti = True;cnt = 0
31 | while case1_conti:
32 | case1_conti = False
33 | del_tag_l = []
34 | for tag1 in tag_l:
35 | for tag2 in tag_l:
36 | if (max(tag1[0][1]) == min(tag2[0][0])):
37 | case1_conti = True;cnt += 1
38 | change = tag1[0][0].union(tag1[0][1].union(tag2[0][0]))
39 | change_list.append([[tag1[0][0], tag1[0][1], tag2[0][0]], change])
40 | tag2[0][0] = change
41 | if tag1 in tag_l: del_tag_l.append(tag1)
42 | new_tag_l = [tag for tag in tag_l if tag not in del_tag_l]
43 | tag_l = new_tag_l
44 | del new_tag_l
45 | return change_list, tag_l, cnt
46 |
47 | def tag_case2(change_list, tag_li1, tag_li2):
48 | ## tag = [[피지배소idx, 지배소idx], [구문정보, 기능정보]]
49 | case2_conti = True;cnt = 0
50 | while case2_conti:
51 | case2_conti = False
52 | for tag1 in tag_li1:
53 | del_tag_li2 = []
54 | for tag2 in tag_li2:
55 | if ((tag1[0][1] == tag2[0][1]) and ((max(tag1[0][0]) - min(tag2[0][0])) == 1)):
56 | case2_conti = True;cnt+=1
57 | change = tag1[0][0].union(tag2[0][0])
58 | change_list.append([[tag1[0][0], tag2[0][0]], change])
59 | tag1[0][0] = change
60 | if tag2 in tag_li2: del_tag_li2.append(tag2)
61 | new_tag_li2 = [tag for tag in tag_li2 if tag not in del_tag_li2]
62 | tag_li2 = new_tag_li2
63 | del new_tag_li2
64 |
65 | return change_list, tag_li1, tag_li2, cnt
66 |
67 | def merge_tag(datas, CNJ=True):
68 | outputs = []
69 | for id, data in tqdm(enumerate(datas)):
70 | # [{'R', 'VNP', 'L', 'VP', 'S', 'AP', 'NP', 'DP', 'IP', 'X'}, {'None', 'MOD', 'CNJ', 'AJT', 'OBJ', 'SBJ', 'CMP'}]
71 | # 구문 정보
72 | r_list = [];
73 | l_list = [];
74 | s_list = [];
75 | x_list = [];
76 | np_list = [];
77 | dp_list = [];
78 | vp_list = [];
79 | vnp_list = [];
80 | ap_list = [];
81 | ip_list = []
82 | # 수식어 기능 정보
83 | tag_list = [];
84 | np_cnj_list = []
85 |
86 | # sentence word
87 | sen_words = data["preprocess"].split()
88 | # 지배소 idx
89 | heads = [x-1 for x in data["parsing"]["heads"][:-1]]
90 |
91 | # 의존관계태그
92 | labels = data["parsing"]["label"][:-1]
93 | assert len(sen_words)-1 == len(heads) == len(labels)
94 |
95 | # 문장내 의존관계태그 분류
96 | for w,(w1, w2_idx, label) in enumerate(zip(sen_words, heads, labels)):
97 | label = label.split("_")
98 | if (len(label)==1):label.append("None")
99 |
100 | dependency_list = [[set([w]),set([w2_idx])], label]
101 |
102 | if (label[0] == "NP"):
103 | if CNJ:
104 | if (label[1] == "CNJ"):
105 | np_cnj_list.append(dependency_list)
106 | if (label[1] != "CNJ"):
107 | np_list.append(dependency_list)
108 | else:np_list.append(dependency_list)
109 | elif (label[0] == "VP"):
110 | vp_list.append(dependency_list)
111 | elif (label[0] == "VNP"):
112 | vnp_list.append(dependency_list)
113 | elif (label[0] == "DP"):
114 | dp_list.append(dependency_list)
115 | elif (label[0] == "AP"):
116 | ap_list.append(dependency_list)
117 | elif (label[0] == "IP"):
118 | ip_list.append(dependency_list)
119 | elif (label[0] == "R"):
120 | r_list.append(dependency_list)
121 | elif (label[0] == "L"):
122 | l_list.append(dependency_list)
123 | elif (label[0] == "S"):
124 | s_list.append(dependency_list)
125 | elif (label[0] == "X"):
126 | x_list.append(dependency_list)
127 |
128 | if (label[1] in ["MOD", "AJT", "CMP", "None"]):
129 | tag_list.append(dependency_list);
130 | vp_list = vp_list + vnp_list
131 |
132 | tag_list = [tag_list] + [x for x in [np_list, dp_list, vp_list, ap_list, ip_list, r_list, l_list, s_list, x_list] if len(x) != 0]
133 |
134 | # NP-CNJ
135 | if np_cnj_list != []:
136 | np_cnj_list = [cnj[0] for cnj in np_cnj_list]
137 | for word_idxs in np_cnj_list:
138 | for tag_li_idx,tag_li in enumerate(tag_list):
139 | new_tag_li = []; new_tag_li2 = []
140 | for tag_idx, tag in enumerate(tag_li):
141 | if (list(tag[0])[0] == word_idxs[0]):
142 | if (word_idxs[0] != list(tag[0])[1]): new_tag_li.append([[word_idxs[0],list(tag[0])[1]], tag[1]])
143 | elif (list(tag[0])[1] == word_idxs[0]):
144 | if (list(tag[0])[0] != word_idxs[0]): new_tag_li.append([[list(tag[0])[0],word_idxs[0]], tag[1]])
145 | elif (list(tag[0])[0] == word_idxs[1]):
146 | if (word_idxs[1] != list(tag[0])[1]): new_tag_li.append([[word_idxs[1],list(tag[0])[1]], tag[1]])
147 | elif (list(tag[0])[1] == word_idxs[1]):
148 | if (list(tag[0])[0] != word_idxs[1]): new_tag_li.append([[list(tag[0])[0],word_idxs[1]], tag[1]])
149 | for new_tag in new_tag_li:
150 | if new_tag not in tag_li:
151 | new_tag_li2.append(new_tag)
152 | del new_tag_li
153 | new_tag_li2 = new_tag_li2 + tag_li
154 | tag_list[tag_li_idx] = new_tag_li2
155 |
156 | #print("len of dependency_list:"+str(len(sum(tag_list, []))))
157 | Done = True
158 | while Done:
159 | origin_tag_list = tag_list.copy()
160 | for tag_li_idx, tag_li in enumerate(tag_list):
161 | ## tag_li = [tag1, tag2, ...]
162 | ## tag = [[set([피지배소idx]), set([지배소idx])], [구문정보, 기능정보]]
163 | conti = True
164 | while conti:
165 | conti_tf = 0
166 |
167 | ## case1
168 | ## (a<-b<-c) => (ab<-c)
169 | change_list = []
170 | tag_dist_1 = []# 지배소와 피지배소 물리적 거리가 1
171 | for tag in tag_li:
172 | if ((max(list(tag[0])[1])-min(list(tag[0])[0])) == 1):tag_dist_1.append(tag)
173 |
174 | change_list, tag_dist_1, cnt = tag_case1(change_list, tag_dist_1)
175 | if (cnt != 0):
176 | conti_tf += cnt
177 | tag_list = change_tag(change_list, tag_list)
178 | tag_li = tag_list[tag_li_idx]
179 | #print("tag_case1 done")
180 |
181 | ## case2
182 | ## (a<-b, a<-c) => (ab<-c)
183 | change_list = []
184 | tag_dist_1 = [] # 지배소와 피지배소 물리적 거리가 1
185 | tag_dist_2 = [] # 지배소와 피지배소 물리적 거리가 2
186 |
187 | for tag in tag_li:
188 | if ((max(list(tag[0])[1])-min(list(tag[0])[0])) == 1):tag_dist_1.append(tag)
189 | elif((max(list(tag[0])[1])-min(list(tag[0])[0])) == 2):tag_dist_2.append(tag)
190 |
191 |
192 | change_list, tag_dist_1, tag_dist_2, cnt = tag_case2(change_list, tag_dist_1, tag_dist_2)
193 |
194 | if (cnt != 0):
195 | conti_tf += cnt
196 | tag_list = change_tag(change_list, tag_list)
197 | tag_li = tag_list[tag_li_idx]
198 | #print("tag_case2 done")
199 |
200 | if conti_tf == 0: conti = False
201 |
202 | if (origin_tag_list == tag_list): Done = False
203 |
204 | dependency_lists = sum(tag_list, [])
205 |
206 | sen_idxs = [set(), set()]
207 | for dep_idx, dependency_list in enumerate(dependency_lists):
208 | # dependency_list = [[set([w]), set([w2_idx])], label]
209 | word_idxs = dependency_list[0]
210 | sen_idxs[0].add(min(word_idxs[0]))
211 | sen_idxs[0].add(min(word_idxs[1]))
212 | sen_idxs[1].add(max(word_idxs[0])+1)
213 | sen_idxs[1].add(max(word_idxs[1])+1)
214 |
215 | # 후처리
216 | ## 삭제
217 | sen_idxs[0] = set(list(sen_idxs[0])[1:]); sen_idxs[1] = set(list(sen_idxs[1])[:-1])
218 | if len(sen_idxs[0]) != len(sen_idxs[1]):
219 | # print(sen_idxs[0].difference(sen_idxs[1]))
220 | # print(sen_idxs[1].difference(sen_idxs[0]))
221 | # print("len of dependency_lists: "+str(len(dependency_lists)))
222 | del_dependency_lists = []
223 | for dep_idx, dependency_list in enumerate(dependency_lists):
224 | # dependency_list = [[set([w]), set([w2_idx])], label]
225 | if min(dependency_list[0][0]) in sen_idxs[0].difference(sen_idxs[1]):
226 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list)
227 | if min(dependency_list[0][1]) in sen_idxs[0].difference(sen_idxs[1]):
228 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list)
229 | if (max(dependency_list[0][0])+1) in sen_idxs[1].difference(sen_idxs[0]):
230 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list)
231 | if (max(dependency_list[0][1])+1) in sen_idxs[1].difference(sen_idxs[0]):
232 | if dependency_list in dependency_lists: del_dependency_lists.append(dependency_list)
233 | new_dependency_lists = [dependency_list for dependency_list in dependency_lists if dependency_list not in del_dependency_lists]
234 | dependency_lists = new_dependency_lists
235 | del new_dependency_lists
236 | #print("len of dependency_lists: " + str(len(dependency_lists)))
237 |
238 | sen_idxs = [set(), set()]
239 | for dep_idx, dependency_list in enumerate(dependency_lists):
240 | # dependency_list = [[set([w]), set([w2_idx])], label]
241 | word_idxs = dependency_list[0]
242 | sen_idxs[0].add(min(word_idxs[0]))
243 | sen_idxs[0].add(min(word_idxs[1]))
244 | sen_idxs[1].add(max(word_idxs[0]) + 1)
245 | sen_idxs[1].add(max(word_idxs[1]) + 1)
246 | dependency_lists[dep_idx] = [[" ".join(sen_words[min(word_idxs[0]): 1 + max(word_idxs[0])]), " ".join(
247 | sen_words[min(word_idxs[1]): 1 + max(word_idxs[1])])]] + dependency_list
248 |
249 |
250 | new_sen_words = []
251 | for start_idx, end_idx in zip(sorted(sen_idxs[0]), sorted(sen_idxs[1])):
252 | new_sen_words.append([" ".join(sen_words[start_idx: end_idx]), [i for i in range(start_idx, end_idx)]])
253 |
254 | for dep_idx, dependency_list in enumerate(dependency_lists):
255 | dependency_lists[dep_idx][1] = [sorted(dependency_lists[dep_idx][1][0]), sorted(dependency_lists[dep_idx][1][1])]
256 | # dependency_lists[dep_idx][1] = [new_sen_words.index(dependency_list[0][0]), new_sen_words.index(dependency_list[0][1])]
257 |
258 | output = {"doc_id":data["doc_id"],
259 | "sentence": data["sentence"],
260 | "preprocess": data["preprocess"],
261 | "merge": {
262 | "origin": new_sen_words,
263 | "parsing": dependency_lists
264 | },
265 | "keysentence": data["keysentence"],
266 | "tag": data["tag"],
267 | "coarse_tag": data["coarse_tag"]
268 | }
269 | outputs.append(output)
270 | return outputs
271 |
272 | if __name__ == '__main__':
273 | inf_dir = "../../data/origin/DP_origin_preprocess.json"
274 | outf_dir = "../../data/origin/merge_origin_preprocess/origin.json"
275 | #
276 | with open(inf_dir, "r", encoding="utf-8") as inf:
277 | datas = json.load(inf)
278 | outputs = merge_tag(datas, CNJ=True)
279 |
280 | with open(outf_dir, "w", encoding="utf-8") as outf:
281 | json.dump(outputs, outf, ensure_ascii=False, indent=4)
282 | outf.close()
283 |
284 |
--------------------------------------------------------------------------------
/src/functions/biattention.py:
--------------------------------------------------------------------------------
1 | # 해당 코드는 아래 링크에서 가져옴
2 | # https://github.com/KLUE-benchmark/KLUE-baseline/blob/8a03c9447e4c225e806877a84242aea11258c790/klue_baseline/models/dependency_parsing.py
3 | import numpy as np
4 |
5 | import torch
6 | import torch.nn as nn
7 | from torch.nn.parameter import Parameter
8 | import torch.nn.functional as F
9 |
10 |
11 | class BiAttention(nn.Module):
12 | def __init__(
13 | self,
14 | input_size_encoder,
15 | input_size_decoder,
16 | num_labels,
17 | biaffine=True,
18 | **kwargs
19 | ):
20 | super(BiAttention, self).__init__()
21 | self.input_size_encoder = input_size_encoder
22 | self.input_size_decoder = input_size_decoder
23 | self.num_labels = num_labels
24 | self.biaffine = biaffine
25 |
26 | self.W_e = Parameter(torch.Tensor(self.num_labels, self.input_size_encoder))
27 | self.W_d = Parameter(torch.Tensor(self.num_labels, self.input_size_decoder))
28 | self.b = Parameter(torch.Tensor(self.num_labels, 1, 1))
29 | if self.biaffine:
30 | self.U = Parameter(
31 | torch.Tensor(
32 | self.num_labels, self.input_size_decoder, self.input_size_encoder
33 | )
34 | )
35 | else:
36 | self.register_parameter("U", None)
37 |
38 | self.reset_parameters()
39 |
40 | def reset_parameters(self):
41 | nn.init.xavier_uniform_(self.W_e)
42 | nn.init.xavier_uniform_(self.W_d)
43 | nn.init.constant_(self.b, 0.0)
44 | if self.biaffine:
45 | nn.init.xavier_uniform_(self.U)
46 |
47 | def forward(self, input_e, input_d, mask_d=None, mask_e=None):
48 | assert input_d.size(0) == input_e.size(0)
49 | batch, length_decoder, _ = input_d.size()
50 | _, length_encoder, _ = input_e.size()
51 |
52 | # input_d : [b, t, d]
53 | # input_e : [b, s, e]
54 | # out_d : [b, l, d, 1]
55 | # out_e : [b, l ,1, e]
56 | out_d = torch.matmul(self.W_d, input_d.transpose(1, 2)).unsqueeze(3)
57 | out_e = torch.matmul(self.W_e, input_e.transpose(1, 2)).unsqueeze(2)
58 |
59 | if self.biaffine:
60 | # output : [b, 1, t, d] * [l, d, e] -> [b, l, t, e]
61 | output = torch.matmul(input_d.unsqueeze(1), self.U)
62 | # output : [b, l, t, e] * [b, 1, e, s] -> [b, l, t, s]
63 | output = torch.matmul(output, input_e.unsqueeze(1).transpose(2, 3))
64 | output = output + out_d + out_e + self.b
65 | else:
66 | output = out_d + out_d + self.b
67 |
68 | if mask_d is not None:
69 | output = (
70 | output
71 | * mask_d.unsqueeze(1).unsqueeze(3)
72 | * mask_e.unsqueeze(1).unsqueeze(2)
73 | )
74 |
75 | # input1 = (batch_size, input11, input12)
76 | # input2 = (batch_size, input21, input22)
77 | return output # (batch_size, output_size, input11, input21)
78 |
79 | class BiLinear(nn.Module):
80 | def __init__(self, left_features: int, right_features: int, out_features: int):
81 | super(BiLinear, self).__init__()
82 | self.left_features = left_features
83 | self.right_features = right_features
84 | self.out_features = out_features
85 |
86 | self.U = Parameter(torch.Tensor(self.out_features, self.left_features, self.right_features))
87 | self.W_l = Parameter(torch.Tensor(self.out_features, self.left_features))
88 | self.W_r = Parameter(torch.Tensor(self.out_features, self.right_features))
89 | self.bias = Parameter(torch.Tensor(out_features))
90 |
91 | self.reset_parameters()
92 |
93 | def reset_parameters(self) -> None:
94 | nn.init.xavier_uniform_(self.W_l)
95 | nn.init.xavier_uniform_(self.W_r)
96 | nn.init.constant_(self.bias, 0.0)
97 | nn.init.xavier_uniform_(self.U)
98 |
99 | def forward(self, input_left: torch.Tensor, input_right: torch.Tensor) -> torch.Tensor:
100 | left_size = input_left.size()
101 | right_size = input_right.size()
102 | assert left_size[:-1] == right_size[:-1], "batch size of left and right inputs mis-match: (%s, %s)" % (
103 | left_size[:-1],
104 | right_size[:-1],
105 | )
106 | batch = int(np.prod(left_size[:-1]))
107 |
108 | input_left = input_left.contiguous().view(batch, self.left_features)
109 | input_right = input_right.contiguous().view(batch, self.right_features)
110 |
111 | output = F.bilinear(input_left, input_right, self.U, self.bias)
112 | output = output + F.linear(input_left, self.W_l, None) + F.linear(input_right, self.W_r, None)
113 | return output.view(left_size[:-1] + (self.out_features,))
114 |
--------------------------------------------------------------------------------
/src/functions/metric.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
3 |
4 |
5 | def get_sklearn_score(predicts, corrects, idx2label):
6 | predicts = [idx2label[predict] for predict in predicts]
7 | corrects = [idx2label[correct] for correct in corrects]
8 | result = {"accuracy": accuracy_score(corrects, predicts),
9 | "macro_precision": precision_score(corrects, predicts, average="macro"),
10 | "micro_precision": precision_score(corrects, predicts, average="micro"),
11 | "macro_f1": f1_score(corrects, predicts, average="macro"),
12 | "micro_f1": f1_score(corrects, predicts, average="micro"),
13 | "macro_recall": recall_score(corrects, predicts, average="macro"),
14 | "micro_recall": recall_score(corrects, predicts, average="micro"),
15 | }
16 |
17 | for k, v in result.items():
18 | result[k] = round(v, 3)
19 | print(k + ": " + str(v))
20 | return result
21 |
22 |
23 |
--------------------------------------------------------------------------------
/src/functions/processor.py:
--------------------------------------------------------------------------------
1 | import json
2 | import logging
3 | import os
4 | from functools import partial
5 | from multiprocessing import Pool, cpu_count
6 |
7 | import numpy as np
8 | from tqdm import tqdm
9 |
10 | import transformers
11 | from transformers.file_utils import is_tf_available, is_torch_available
12 | from transformers.data.processors.utils import DataProcessor
13 |
14 | if is_torch_available():
15 | import torch
16 | from torch.utils.data import TensorDataset
17 |
18 | if is_tf_available():
19 | import tensorflow as tf
20 |
21 | logger = logging.getLogger(__name__)
22 |
23 |
24 |
25 | def convert_example_to_features(example, max_seq_length, is_training, max_sentence_length, language):
26 |
27 | # 데이터의 유효성 검사를 위한 부분
28 | # ========================================================
29 | label = None
30 | coarse_label = None
31 | if is_training:
32 | # Get label
33 | label = example.label
34 | coarse_label = example.coarse_label
35 |
36 | # label_dictionary에 주어진 label이 존재하지 않으면 None을 feature로 출력
37 | # If the label cannot be found in the text, then skip this example.
38 | ## kind_of_label: label의 종류
39 | kind_of_label = ["문제 정의", "가설 설정", "기술 정의", "제안 방법", "대상 데이터", "데이터처리", "이론/모형", "성능/효과", "후속연구"]#, "기타"]
40 | actual_text = kind_of_label[label] if label<=len(kind_of_label) else label
41 | if actual_text not in kind_of_label:
42 | logger.warning("Could not find label: '%s' \n not in label list", actual_text)
43 | return None
44 |
45 | kind_of_coarse_label = ["연구 목적", "연구 방법", "연구 결과"]#, "기타"]
46 | actual_text = kind_of_coarse_label[coarse_label] if coarse_label <= len(kind_of_coarse_label) else coarse_label
47 | if actual_text not in kind_of_coarse_label:
48 | logger.warning("Could not find coarse_label: '%s' \n not in coarse_label list", actual_text)
49 | return None
50 |
51 | # ========================================================
52 |
53 | # 단어(어절;word)와 토큰 간의 위치 정보 확인
54 | tok_to_orig_index = {"sentence": []} # token 개수만큼 # token에 대한 word의 위치
55 | orig_to_tok_index = {"sentence": []} # origin 개수만큼 # word를 토큰화하여 나온 첫번째 token의 위치
56 | all_doc_tokens = {"sentence": []} # origin text를 tokenization
57 | token_to_orig_map = {"sentence": []}
58 |
59 | for case in example.merge.keys():
60 | new_merge = []
61 | new_word = []
62 | idx = 0
63 | for merge_idx in example.merge[case]:
64 | for m_idx in merge_idx:
65 | new_word.append(example.doc_tokens[case][m_idx])
66 | new_word.append("")
67 | merge_idx = [m_idx+idx for m_idx in range(0,len(merge_idx))]
68 | new_merge.append(merge_idx)
69 | idx = max(merge_idx)+1
70 | new_merge.append([idx])
71 | idx+=1
72 | example.merge[case] = new_merge
73 | example.doc_tokens[case] = new_word
74 |
75 | for case in example.merge.keys():
76 | for merge_idx in example.merge[case]:
77 | for word_idx in merge_idx:
78 | # word를 토큰화하여 나온 첫번째 token의 위치
79 | orig_to_tok_index[case].append(len(tok_to_orig_index[case]))
80 | if (example.doc_tokens[case][word_idx] == ""):
81 | sub_tokens = [""]
82 | else: sub_tokens = tokenizer.tokenize(example.doc_tokens[case][word_idx])
83 | for sub_token in sub_tokens:
84 | # token 저장
85 | all_doc_tokens[case].append(sub_token)
86 | # token에 대한 word의 위치
87 | tok_to_orig_index[case].append(word_idx)
88 | # token_to_orig_map: {token:word}
89 | #token_to_orig_map[case][len(tok_to_orig_index[case]) - 1] = len(orig_to_tok_index[case]) - 1
90 | token_to_orig_map[case].append(len(orig_to_tok_index[case]) - 1)
91 |
92 | # print("tok_to_orig_index\n"+str(tok_to_orig_index))
93 | # print("orig_to_tok_index\n"+str(orig_to_tok_index))
94 | # print("all_doc_tokens\n"+str(all_doc_tokens))
95 | # print("token_to_orig_map\n\tindex of token : index of word\n\t"+str(token_to_orig_map))
96 |
97 | # =========================================================
98 | if language == "bert":
99 | ## 최대 길이 넘는지 확인
100 | if int(transformers.__version__[0]) <= 3:
101 | assert len(all_doc_tokens["sentence"]) + 2 <= tokenizer.max_len
102 | else:
103 | assert len(all_doc_tokens["sentence"]) + 2 <= tokenizer.model_max_length
104 |
105 | input_ids = [tokenizer.cls_token_id]
106 | if language == "KorSciBERT":
107 | input_ids += sum([tokenizer.convert_tokens_to_ids([token]) for token in all_doc_tokens["sentence"]], [])
108 | word_idxs = [0] + list(filter(lambda x: input_ids[x] == tokenizer.convert_tokens_to_ids([""])[0], range(len(input_ids))))
109 | else:
110 | input_ids += [tokenizer.convert_tokens_to_ids(token) for token in all_doc_tokens["sentence"]]
111 | word_idxs = [0] + list(filter(lambda x: input_ids[x] == tokenizer.convert_tokens_to_ids(""), range(len(input_ids))))
112 |
113 | input_ids += [tokenizer.sep_token_id]
114 |
115 | token_type_ids = [0] * len(input_ids)
116 |
117 | position_ids = list(range(0, len(input_ids)))
118 |
119 | # non_padded_ids: padding을 제외한 토큰의 index 번호
120 | non_padded_ids = [i for i in input_ids]
121 |
122 | # tokens: padding을 제외한 토큰
123 | non_padded_tokens = tokenizer.convert_ids_to_tokens(non_padded_ids)
124 |
125 | attention_mask = [1]*len(input_ids)
126 |
127 | paddings = [tokenizer.pad_token_id]*(max_seq_length - len(input_ids))
128 |
129 | if tokenizer.padding_side == "right":
130 | input_ids += paddings
131 | attention_mask += [0]*len(paddings)
132 | token_type_ids += paddings
133 | position_ids += paddings
134 | else:
135 | input_ids = paddings + input_ids
136 | attention_mask = [0]*len(paddings) + attention_mask
137 | token_type_ids = paddings + token_type_ids
138 | position_ids = paddings + position_ids
139 |
140 | word_idxs = [x+len(paddings) for x in word_idxs]
141 |
142 | # """
143 | # mean pooling
144 | not_word_list = []
145 | for k, p_idx in enumerate(word_idxs[1:]):
146 | not_word_idxs = [0] * len(input_ids);
147 | for j in range(word_idxs[k] + 1, p_idx):
148 | not_word_idxs[j] = 1 / (p_idx - word_idxs[k] - 1)
149 | not_word_list.append(not_word_idxs)
150 | not_word_list = not_word_list + [[0] * len(input_ids)] * (
151 | max_sentence_length - len(not_word_list))
152 |
153 |
154 | """
155 | # (a,b, |a-b|, a*b)
156 | not_word_list = [[], []]
157 | for k, p_idx in enumerate(word_idxs[1:]):
158 | not_word_list[0].append(word_idxs[k] + 1)
159 | not_word_list[1].append(p_idx - 1)
160 | not_word_list[0] = not_word_list[0] + [int(word_idxs[-1]+i+2) for i in range(0, (max_sentence_length - len(not_word_list)))]
161 | not_word_list[1] = not_word_list[1] + [int(word_idxs[-1] + i + 2) for i in range(0, (max_sentence_length - len(pnot_word_list)))]
162 | """
163 |
164 | # p_mask: mask with 0 for token which belong premise and hypothesis including CLS TOKEN
165 | # and with 1 otherwise.
166 | # Original TF implem also keep the classification token (set to 0)
167 | p_mask = np.ones_like(token_type_ids)
168 | if tokenizer.padding_side == "right":
169 | # [CLS] P [SEP] H [SEP] PADDING
170 | p_mask[:len(all_doc_tokens["sentence"]) + 1] = 0
171 | else:
172 | p_mask[-(len(all_doc_tokens["sentence"]) + 1): ] = 0
173 |
174 | # pad_token_indices: input_ids에서 padding된 위치
175 | pad_token_indices = np.array(range(len(non_padded_ids), len(input_ids)))
176 | # special_token_indices: special token의 위치
177 | special_token_indices = np.asarray(
178 | tokenizer.get_special_tokens_mask(input_ids, already_has_special_tokens=True)
179 | ).nonzero()
180 |
181 | p_mask[pad_token_indices] = 1
182 | p_mask[special_token_indices] = 1
183 |
184 | # Set the cls index to 0: the CLS index can be used for impossible answers
185 | # Identify the position of the CLS token
186 | cls_index = input_ids.index(tokenizer.cls_token_id)
187 |
188 | p_mask[cls_index] = 0
189 |
190 | # dependency = [[tail, head, dependency], [], ...]
191 | if example.dependency["sentence"] == [[]]:
192 | example.dependency["sentence"] = [[max_sentence_length-1,max_sentence_length-1,0] for _ in range(0,max_sentence_length)]
193 | else:
194 | example.dependency["sentence"] = example.dependency["sentence"] + [[max_sentence_length-1,max_sentence_length-1,0] for i in range(0, abs(max_sentence_length-len(example.dependency["sentence"])))]
195 |
196 | dependency = example.dependency["sentence"]
197 |
198 | return CLASSIFIERFeatures(
199 | input_ids,
200 | attention_mask,
201 | token_type_ids,
202 | position_ids,
203 | cls_index,
204 | p_mask.tolist(),
205 | example_index=0,
206 | tokens=non_padded_tokens,
207 | token_to_orig_map=token_to_orig_map,
208 | label = label,
209 | coarse_label = coarse_label,
210 | doc_id = example.doc_id,
211 | language = language,
212 | dependency = dependency,
213 | not_word_list = not_word_list,
214 | )
215 |
216 |
217 |
218 | def convert_example_to_features_init(tokenizer_for_convert):
219 | global tokenizer
220 | tokenizer = tokenizer_for_convert
221 |
222 |
223 | def convert_examples_to_features(
224 | examples,
225 | tokenizer,
226 | max_seq_length,
227 | is_training,
228 | return_dataset=False,
229 | threads=1,
230 | max_sentence_length = 0,
231 | tqdm_enabled=True,
232 | language = None,
233 | ):
234 | """
235 | Converts a list of examples into a list of features that can be directly given as input to a model.
236 | It is model-dependant and takes advantage of many of the tokenizer's features to create the model's inputs.
237 |
238 | Args:
239 | examples: list of :class:`~transformers.data.processors.squad.SquadExample`
240 | tokenizer: an instance of a child of :class:`~transformers.PreTrainedTokenizer`
241 | max_seq_length: The maximum sequence length of the inputs.
242 | doc_stride: The stride used when the context is too large and is split across several features.
243 | max_query_length: The maximum length of the query.
244 | is_training: whether to create features for model evaluation or model training.
245 | return_dataset: Default False. Either 'pt' or 'tf'.
246 | if 'pt': returns a torch.data.TensorDataset,
247 | if 'tf': returns a tf.data.Dataset
248 | threads: multiple processing threadsa-smi
249 |
250 |
251 | Returns:
252 | list of :class:`~transformers.data.processors.squad.SquadFeatures`
253 |
254 | Example::
255 |
256 | processor = SquadV2Processor()
257 | examples = processor.get_dev_examples(data_dir)
258 |
259 | features = squad_convert_examples_to_features(
260 | examples=examples,
261 | tokenizer=tokenizer,
262 | max_seq_length=args.max_seq_length,
263 | doc_stride=args.doc_stride,
264 | max_query_length=args.max_query_length,
265 | is_training=not evaluate,
266 | )
267 | """
268 |
269 | # Defining helper methods
270 | features = []
271 | threads = min(threads, cpu_count())
272 | with Pool(threads, initializer=convert_example_to_features_init, initargs=(tokenizer,)) as p:
273 |
274 | # annotate_ = 하나의 example에 대한 여러 feature를 리스트로 모은 것
275 | # annotate_ = list(feature1, feature2, ...)
276 | annotate_ = partial(
277 | convert_example_to_features,
278 | max_seq_length=max_seq_length,
279 | max_sentence_length=max_sentence_length,
280 | is_training=is_training,
281 | language = language,
282 | )
283 |
284 | # examples에 대한 annotate_
285 | # features = list( feature1, feature2, feature3, ... )
286 | ## len(features) == len(examples)
287 | features = list(
288 | tqdm(
289 | p.imap(annotate_, examples, chunksize=32),
290 | total=len(examples),
291 | desc="convert bert examples to features",
292 | disable=not tqdm_enabled,
293 | )
294 | )
295 | new_features = []
296 | example_index = 0 # example의 id ## len(features) == len(examples)
297 | for example_feature in tqdm(
298 | features, total=len(features), desc="add example index", disable=not tqdm_enabled
299 | ):
300 | if not example_feature:
301 | continue
302 |
303 | example_feature.example_index = example_index
304 | new_features.append(example_feature)
305 | example_index += 1
306 |
307 | features = new_features
308 | del new_features
309 |
310 | if return_dataset == "pt":
311 | if not is_torch_available():
312 | raise RuntimeError("PyTorch must be installed to return a PyTorch dataset.")
313 |
314 | # Convert to Tensors and build dataset
315 | all_input_ids = torch.tensor([f.input_ids for f in features], dtype=torch.long)
316 | all_attention_masks = torch.tensor([f.attention_mask for f in features], dtype=torch.long)
317 |
318 | ## RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment.
319 | all_token_type_ids = torch.tensor([f.token_type_ids for f in features], dtype=torch.long)
320 | all_position_ids = torch.tensor([f.position_ids for f in features], dtype=torch.long)
321 |
322 | all_cls_index = torch.tensor([f.cls_index for f in features], dtype=torch.long)
323 | all_p_mask = torch.tensor([f.p_mask for f in features], dtype=torch.float)
324 |
325 | all_example_indices = torch.tensor([f.example_index for f in features], dtype=torch.long)
326 | all_feature_index = torch.arange(all_input_ids.size(0), dtype=torch.long) # 전체 feature의 개별 index
327 |
328 | # all_dependency = [[[premise_tail, premise_head, dependency], [], ...],[[hypothesis_tail, hypothesis_head, dependency], [], ...]], [[],[]], ... ]
329 | all_dependency = torch.tensor([f.dependency for f in features], dtype=torch.long)
330 |
331 | all_not_word_list = torch.tensor([f.not_word_list for f in features], dtype=torch.float)
332 |
333 | if not is_training:
334 | dataset = TensorDataset(
335 | all_input_ids,
336 | all_attention_masks, all_token_type_ids, all_position_ids,
337 | all_cls_index, all_p_mask, all_feature_index,
338 | all_dependency,
339 | all_not_word_list
340 | )
341 | else:
342 | all_labels = torch.tensor([f.label for f in features], dtype=torch.long)
343 | all_coarse_labels = torch.tensor([f.coarse_label for f in features], dtype=torch.long)
344 | # label_dict = {"entailment": 0, "contradiction": 1, "neutral": 2}
345 | # all_labels = torch.tensor([label_dict[f.label] for f in features], dtype=torch.long)
346 |
347 | dataset = TensorDataset(
348 | all_input_ids,
349 | all_attention_masks,
350 | all_token_type_ids,
351 | all_position_ids,
352 | all_labels,
353 | all_coarse_labels,
354 | all_cls_index,
355 | all_p_mask,
356 | all_example_indices,
357 | all_feature_index,
358 | all_dependency,
359 | all_not_word_list
360 | )
361 |
362 | return features, dataset
363 | else:
364 | return features
365 |
366 | class CLASSIFIERProcessor(DataProcessor):
367 | train_file = None
368 | dev_file = None
369 |
370 | def _get_example_from_tensor_dict(self, tensor_dict, evaluate=False):
371 | if not evaluate:
372 | gold_label = None
373 | gold_coarse_label = None
374 | label = tensor_dict["tag"].numpy().decode("utf-8")
375 | coarse_label = tensor_dict["coarse_tag"].numpy().decode("utf-8")
376 | else:
377 | gold_label = tensor_dict["tag"].numpy().decode("utf-8")
378 | gold_coarse_label = tensor_dict["coarse_tag"].numpy().decode("utf-8")
379 | label = None
380 | coarse_label = None
381 |
382 | return CLASSIFIERExample(
383 | doc_id=tensor_dict["doc_id"].numpy().decode("utf-8"),
384 | # sentid=tensor_dict["sentid"].numpy().decode("utf-8"),
385 | sentence=tensor_dict["sentence"].numpy().decode("utf-8"),
386 | preprocess=tensor_dict["preprocess"].numpy().decode("utf-8"),
387 | parsing=tensor_dict["merge"]["parsing"].numpy().decode("utf-8"),
388 | keysentnece=tensor_dict["keysentence"].numpy().decode("utf-8"),
389 | label=label,
390 | coarse_label=coarse_label,
391 | gold_label=gold_label,
392 | gold_coarse_label = gold_coarse_label
393 | )
394 |
395 | def get_examples_from_dataset(self, dataset, evaluate=False):
396 | """
397 | Creates a list of :class:`~transformers.data.processors.squad.CLASSIFIERExample` using a TFDS dataset.
398 |
399 | Args:
400 | dataset: The tfds dataset loaded from `tensorflow_datasets.load("squad")`
401 | evaluate: boolean specifying if in evaluation mode or in training mode
402 |
403 | Returns:
404 | List of CLASSIFIERExample
405 |
406 | Examples::
407 |
408 | import tensorflow_datasets as tfds
409 | dataset = tfds.load("squad")
410 |
411 | training_examples = get_examples_from_dataset(dataset, evaluate=False)
412 | evaluation_examples = get_examples_from_dataset(dataset, evaluate=True)
413 | """
414 |
415 | if evaluate:
416 | dataset = dataset["validation"]
417 | else:
418 | dataset = dataset["train"]
419 |
420 | examples = []
421 | for tensor_dict in tqdm(dataset):
422 | examples.append(self._get_example_from_tensor_dict(tensor_dict, evaluate=evaluate))
423 |
424 | return examples
425 |
426 | def get_train_examples(self, data_dir, filename=None, depend_embedding = None):
427 | """
428 | Returns the training examples from the data directory.
429 |
430 | Args:
431 | data_dir: Directory containing the data files used for training and evaluating.
432 | filename: None by default.
433 |
434 | """
435 | if data_dir is None:
436 | data_dir = ""
437 |
438 | #if self.train_file is None:
439 | # raise ValueError("CLASSIFIERProcessor should be instantiated via CLASSIFIERV1Processor.")
440 |
441 | with open(
442 | os.path.join(data_dir, self.train_file if filename is None else filename), "r", encoding="utf-8"
443 | ) as reader:
444 | input_data = json.load(reader)
445 | return self._create_examples(input_data, 'train', self.train_file if filename is None else filename)
446 |
447 | def get_dev_examples(self, data_dir, filename=None, depend_embedding = None):
448 | """
449 | Returns the evaluation example from the data directory.
450 |
451 | Args:
452 | data_dir: Directory containing the data files used for training and evaluating.
453 | filename: None by default.
454 | """
455 | if data_dir is None:
456 | data_dir = ""
457 |
458 | #if self.dev_file is None:
459 | # raise ValueError("CLASSIFIERProcessor should be instantiated via CLASSIFIERV1Processor.")
460 |
461 | with open(
462 | os.path.join(data_dir, self.dev_file if filename is None else filename), "r", encoding="utf-8"
463 | ) as reader:
464 | input_data = json.load(reader)
465 | return self._create_examples(input_data, "dev", self.dev_file if filename is None else filename)
466 |
467 | def get_example_from_input(self, input_dictionary):
468 |
469 | doc_id = input_dictionary["doc_id"]
470 | keysentnece = input_dictionary["keysentnece"]
471 | # sentid = input_dictionary["sentid"]
472 | sentence = input_dictionary["sentence"]
473 |
474 | label = None
475 | coarse_label = None
476 | gold_label = None
477 | gold_coarse_label = None
478 |
479 | examples = [CLASSIFIERExample(
480 | doc_id=doc_id,
481 | # sentid=sentid,
482 | keysentence = keysentnece,
483 | sentence=sentence,
484 | gold_label=gold_label,
485 | gold_coarse_label=gold_coarse_label,
486 | label=label,
487 | coarse_label=coarse_label,
488 | )]
489 | return examples
490 |
491 | def _create_examples(self, input_data, set_type, data_file):
492 | is_training = set_type == "train"
493 | num = 0
494 | examples = []
495 | for entry in tqdm(input_data):
496 |
497 | doc_id = entry["doc_id"]
498 | # sentid = entry["sentid"]
499 | sentence = entry["sentence"]
500 | preprocess = entry["preprocess"]
501 | merge = entry["merge"]["origin"]
502 | parsing = entry["merge"]["parsing"]
503 | keysentence= entry["keysentence"]
504 |
505 | label = None
506 | coarse_label = None
507 | gold_label = None
508 | gold_coarse_label = None
509 | if is_training:
510 | label = entry["tag"]
511 | coarse_label = entry["coarse_tag"]
512 | else:
513 | gold_label = entry["tag"]
514 | gold_coarse_label = entry["coarse_tag"]
515 |
516 |
517 |
518 | example = CLASSIFIERExample(
519 | doc_id=doc_id,
520 | # sentid=sentid,
521 | keysentence=keysentence,
522 | sentence=sentence,
523 | preprocess=preprocess,
524 | parsing=parsing,
525 | merge=merge,
526 | gold_label=gold_label,
527 | gold_coarse_label=gold_coarse_label,
528 | label=label,
529 | coarse_label=coarse_label,
530 | )
531 | examples.append(example)
532 | # len(examples) == len(input_data)
533 | return examples
534 |
535 |
536 | class CLASSIFIERV1Processor(CLASSIFIERProcessor):
537 | train_file = "train.json"
538 | dev_file = "dev.json"
539 |
540 |
541 | class CLASSIFIERExample(object):
542 | def __init__(
543 | self,
544 | doc_id,
545 | # sentid,
546 | sentence,
547 | preprocess,
548 | parsing,
549 | merge,
550 | keysentence,
551 | gold_label=None,
552 | gold_coarse_label=None,
553 | label=None,
554 | coarse_label=None,
555 | ):
556 | self.doc_id = doc_id
557 | # self.sentid = sentid
558 | self.keysentence = keysentence
559 | self.sentence = sentence
560 | self.preprocess = preprocess
561 | self.parsing = parsing
562 | self.merge = merge
563 |
564 | label_dict = {'문제 정의': 0, '가설 설정': 1, '기술 정의': 2, '제안 방법': 3, '대상 데이터': 4, '데이터처리': 5, '이론/모형': 6, '성능/효과': 7, '후속연구': 8, '기타': 9}
565 | coarse_label_dict = {'연구 목적': 0, '연구 방법': 1, '연구 결과': 2, '기타': 3}
566 | if gold_label in label_dict.keys():
567 | gold_label = label_dict[gold_label]
568 | if gold_coarse_label in coarse_label_dict.keys():
569 | gold_coarse_label = coarse_label_dict[gold_coarse_label]
570 | self.gold_label = gold_label
571 | self.gold_coarse_label = gold_coarse_label
572 |
573 | if coarse_label in coarse_label_dict.keys():
574 | coarse_label = coarse_label_dict[coarse_label]
575 | if label in label_dict.keys():
576 | label = label_dict[label]
577 | self.label = label
578 | self.coarse_label = coarse_label
579 |
580 | # doct_tokens : 띄어쓰기 기준으로 나누어진 어절(word)로 만들어진 리스트
581 | ## sentence1 sentence2
582 | self.doc_tokens = {"sentence":self.preprocess.strip().split()}
583 |
584 | # merge: 말뭉치의 시작위치를 어절 기준으로 만든 리스트
585 | merge_word = []; check_merge_word = []
586 | merge_index = []
587 | for merge in self.merge:
588 | if merge != []: merge_index.append(merge[1])
589 |
590 | # 구문구조 종류
591 | depend2idx = {"None":0}; idx2depend ={0:"None"}
592 | for depend1 in ['DP', 'L', 'NP', 'IP', 'PAD', 'VP', 'VNP', 'X', 'AP']:
593 | for depend2 in ['MOD', 'OBJ', 'CNJ', 'CMP', 'SBJ', 'None', 'AJT']:
594 | depend2idx[depend1 + "-" + depend2] = len(depend2idx)
595 | idx2depend[len(idx2depend)] = depend1 + "-" + depend2
596 |
597 | if ([words for words in self.parsing if words[2][0] != words[2][1]] == []): merge_word.append([])
598 | else:
599 | for words in self.parsing:
600 | if words[2][0] != words[2][1]:
601 | w1 = merge_index.index(words[1][0])
602 | w2 = merge_index.index(words[1][1])
603 | dep = depend2idx["-".join(words[2])]
604 | if [w1,w2] not in check_merge_word:
605 | check_merge_word.append([w1, w2])
606 | merge_word.append([w1,w2,dep])
607 | else:
608 | check_index = check_merge_word.index([w1,w2])
609 | now_dep = idx2depend[merge_word[check_index][2]].split("-")[1]
610 | if (words[2][1] in ['SBJ', 'CNJ', 'OBJ']) and(now_dep in ['CMP', 'MOD', 'AJT', 'None', "UNDEF"]):
611 | merge_word[check_index][2] = dep
612 |
613 | del check_merge_word
614 | self.merge = {"sentence":merge_index}
615 | self.dependency = {"sentence":merge_word}
616 |
617 | class CLASSIFIERFeatures(object):
618 | def __init__(
619 | self,
620 | input_ids,
621 | attention_mask,
622 | token_type_ids,
623 | position_ids,
624 | cls_index,
625 | p_mask,
626 | example_index,
627 | token_to_orig_map,
628 | doc_id,
629 | tokens,
630 | label,
631 | coarse_label,
632 | language,
633 | dependency,
634 | not_word_list,
635 | ):
636 | self.input_ids = input_ids
637 | self.attention_mask = attention_mask
638 | self.token_type_ids = token_type_ids
639 | self.position_ids = position_ids
640 | self.cls_index = cls_index
641 | self.p_mask = p_mask
642 |
643 | self.example_index = example_index
644 | self.token_to_orig_map = token_to_orig_map
645 | self.doc_id = doc_id
646 | self.tokens = tokens
647 |
648 | self.label = label
649 | self.coarse_label = coarse_label
650 |
651 | self.dependency = dependency
652 |
653 | self.not_word_list = not_word_list
654 |
655 |
656 | class KLUEResult(object):
657 | def __init__(self, example_index, label_logits, gold_label=None, cls_logits=None):
658 | self.label_logits = label_logits
659 | self.example_index = example_index
660 |
661 | if gold_label:
662 | self.gold_label = gold_label
663 | self.cls_logits = cls_logits
664 |
--------------------------------------------------------------------------------
/src/functions/utils.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import random
3 | import torch
4 | import numpy as np
5 | import os
6 |
7 | from src.functions.processor import (
8 | CLASSIFIERProcessor,
9 | convert_examples_to_features
10 | )
11 |
12 | def init_logger():
13 | logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
14 | datefmt='%m/%d/%Y %H:%M:%S',
15 | level=logging.INFO)
16 |
17 | def set_seed(args):
18 | random.seed(args.seed)
19 | np.random.seed(args.seed)
20 | torch.manual_seed(args.seed)
21 | if not args.no_cuda and torch.cuda.is_available():
22 | torch.cuda.manual_seed_all(args.seed)
23 |
24 | # tensor를 list 형으로 변환하기위한 함수
25 | def to_list(tensor):
26 | return tensor.detach().cpu().tolist()
27 |
28 |
29 | # dataset을 load 하는 함수
30 | def load_examples(args, tokenizer, evaluate=False, output_examples=False, do_predict=False, input_dict=None):
31 | '''
32 |
33 | :param args: 하이퍼 파라미터
34 | :param tokenizer: tokenization에 사용되는 tokenizer
35 | :param evaluate: 평가나 open test시, True
36 | :param output_examples: 평가나 open test 시, True / True 일 경우, examples와 features를 같이 return
37 | :param do_predict: open test시, True
38 | :param input_dict: open test시 입력되는 문서와 질문으로 이루어진 dictionary
39 | :return:
40 | examples : max_length 상관 없이, 원문으로 각 데이터를 저장한 리스트
41 | features : max_length에 따라 분할 및 tokenize된 원문 리스트
42 | dataset : max_length에 따라 분할 및 학습에 직접적으로 사용되는 tensor 형태로 변환된 입력 ids
43 | '''
44 | input_dir = args.data_dir
45 | print("Creating features from dataset file at {}".format(input_dir))
46 |
47 | # processor 선언
48 | ## json으로 된 train과 dev data_file명
49 | processor = CLASSIFIERProcessor()
50 |
51 | # open test 시
52 | if do_predict:
53 | ## input_dict: guid, premise, hypothesis로 이루어진 dictionary
54 | # examples = processor.get_example_from_input(input_dict)
55 | examples = processor.get_dev_examples(os.path.join(args.data_dir),
56 | filename=args.predict_file)
57 | # 평가 시
58 | elif evaluate:
59 | examples = processor.get_dev_examples(os.path.join(args.data_dir),
60 | filename=args.eval_file)
61 | # 학습 시
62 | else:
63 | examples = processor.get_train_examples(os.path.join(args.data_dir),
64 | filename=args.train_file)
65 |
66 | # features = (prem_features, hypo_features)
67 | features, dataset = convert_examples_to_features(
68 | examples=examples,
69 | tokenizer=tokenizer,
70 | max_seq_length=args.max_seq_length,
71 | is_training=not evaluate,
72 | return_dataset="pt",
73 | threads=args.threads,
74 | max_sentence_length = args.max_sentence_length,
75 | language = args.model_name_or_path.split("/")[-2]
76 | )
77 | if output_examples:
78 | ## example == feature == dataset
79 | return dataset, examples, features
80 | return dataset
81 |
--------------------------------------------------------------------------------
/src/model/main_functions_multi.py:
--------------------------------------------------------------------------------
1 | import os
2 | import numpy as np
3 | import pandas as pd
4 | import torch
5 | import timeit
6 | from fastprogress.fastprogress import master_bar, progress_bar
7 | from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
8 | from transformers.file_utils import is_torch_available
9 |
10 | from transformers import (
11 | AdamW,
12 | get_linear_schedule_with_warmup
13 | )
14 |
15 | from src.functions.utils import load_examples, set_seed, to_list
16 | from src.functions.metric import get_score, get_ai_score, get_sklearn_score
17 |
18 | from sklearn.metrics import confusion_matrix
19 | from functools import partial
20 |
21 | def train(args, model, tokenizer, logger):
22 | max_f1 =0.89
23 | max_acc = 0.89
24 | # 학습에 사용하기 위한 dataset Load
25 | ## dataset: tensor형태의 데이터셋
26 | ## all_input_ids,
27 | # all_attention_masks,
28 | # all_labels,
29 | # all_cls_index,
30 | # all_p_mask,
31 | # all_example_indices,
32 | # all_feature_index
33 |
34 | train_dataset = load_examples(args, tokenizer, evaluate=False, output_examples=False)
35 |
36 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader
37 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정
38 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정
39 | train_sampler = RandomSampler(train_dataset)
40 |
41 | train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=args.train_batch_size)
42 |
43 | # t_total: total optimization step
44 | # optimization 최적화 schedule 을 위한 전체 training step 계산
45 | t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs
46 |
47 | # Layer에 따른 가중치 decay 적용
48 | no_decay = ["bias", "LayerNorm.weight"]
49 | optimizer_grouped_parameters = [
50 | {
51 | "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
52 | "weight_decay": args.weight_decay,
53 | },
54 | {
55 | "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
56 | "weight_decay": 0.0},
57 | ]
58 |
59 | # optimizer 및 scheduler 선언
60 | optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
61 | scheduler = get_linear_schedule_with_warmup(
62 | optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=t_total
63 | )
64 |
65 | # Training Step
66 | logger.info("***** Running training *****")
67 | logger.info(" Num examples = %d", len(train_dataset))
68 | logger.info(" Num Epochs = %d", args.num_train_epochs)
69 | logger.info(" Train batch size per GPU = %d", args.train_batch_size)
70 | logger.info(" Total train batch size (w. parallel, distributed & accumulation) = %d", args.train_batch_size * args.gradient_accumulation_steps)
71 | logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps)
72 | logger.info(" Total optimization steps = %d", t_total)
73 |
74 | global_step = 1
75 | if not args.from_init_weight: global_step += int(args.checkpoint)
76 |
77 | tr_loss, logging_loss = 0.0, 0.0
78 |
79 | # loss buffer 초기화
80 | model.zero_grad()
81 |
82 | mb = master_bar(range(int(args.num_train_epochs)))
83 | set_seed(args)
84 |
85 | epoch_idx=0
86 | if not args.from_init_weight: epoch_idx += int(args.checkpoint)
87 |
88 | for epoch in mb:
89 | epoch_iterator = progress_bar(train_dataloader, parent=mb)
90 | for step, batch in enumerate(epoch_iterator):
91 | # train 모드로 설정
92 | model.train()
93 | batch = tuple(t.to(args.device) for t in batch)
94 |
95 | # 모델에 입력할 입력 tensor 저장
96 | inputs_list = ["input_ids", "attention_mask","token_type_ids","position_ids"]
97 | inputs_list.append("labels")
98 | inputs_list.append("coarse_labels")
99 | inputs = dict()
100 | for n, input in enumerate(inputs_list): inputs[input] = batch[n]
101 | inputs_list2 = ['word_idxs', 'span']
102 | for m, input in enumerate(inputs_list2):
103 | inputs[input] = batch[-(m+1)]
104 |
105 | # Loss 계산 및 저장
106 | ## outputs = (total_loss,) + outputs
107 | outputs = model(**inputs)
108 | loss = outputs[0]
109 |
110 | # 높은 batch size는 학습이 진행하는 중에 발생하는 noisy gradient가 경감되어 불안정한 학습을 안정적이게 되도록 해줌
111 | # 높은 batch size 효과를 주기위한 "gradient_accumulation_step"
112 | ## batch size *= gradient_accumulation_step
113 | # batch size: 16
114 | # gradient_accumulation_step: 2 라고 가정
115 | # 실제 batch size 32의 효과와 동일하진 않지만 비슷한 효과를 보임
116 | if args.gradient_accumulation_steps > 1:
117 | loss = loss / args.gradient_accumulation_steps
118 |
119 |
120 | ## batch_size의 개수만큼의 데이터를 입력으로 받아 만들어진 모델의 loss는
121 | ## 입력 데이터들에 대한 특징을 보유하고 있다(loss를 어떻게 만드느냐에 따라 달라)
122 | ### loss_fct = CrossEntropyLoss(ignore_index=ignored_index, reduction = ?)
123 | ### reduction = mean : 입력 데이터에 대한 평균
124 | loss.backward()
125 | tr_loss += loss.item()
126 |
127 |
128 | # Loss 출력
129 | if (global_step + 1) % 50 == 0:
130 | print("{} step processed.. Current Loss : {}".format((global_step+1),loss.item()))
131 |
132 | if (step + 1) % args.gradient_accumulation_steps == 0:
133 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
134 |
135 | optimizer.step()
136 | scheduler.step() # Update learning rate schedule
137 | model.zero_grad()
138 | global_step += 1
139 |
140 | epoch_idx += 1
141 | logger.info("***** Eval results *****")
142 | results = evaluate(args, model, tokenizer, logger, epoch_idx = str(epoch_idx), tr_loss = loss.item())
143 |
144 | # model save
145 | if ((max_acc < float(results["accuracy"])) or (
146 | max_f1 < float(results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"]))):
147 | if max_acc < float(results["accuracy"]): max_acc = float(results["accuracy"])
148 | if max_f1 < float(
149 | results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"]): max_f1 = float(
150 | results["macro_f1_score" if "macro_f1_score" in results.keys() else "macro_f1"])
151 |
152 | # 모델 저장 디렉토리 생성
153 | output_dir = os.path.join(args.output_dir, "model/checkpoint-{}".format(epoch_idx))
154 | if not os.path.exists(output_dir):
155 | os.makedirs(output_dir)
156 |
157 | # 학습된 가중치 및 vocab 저장
158 | ## pretrained 모델같은 경우 model.save_pretrained(...)로 저장
159 | ## nn.Module로 만들어진 모델일 경우 model.save(...)로 저장
160 | ### 두개가 모두 사용되는 모델일 경우 이 두가지 방법으로 저장을 해야한다!!!!
161 | model.save_pretrained(output_dir)
162 | if (args.model_name_or_path.split("/")[-2] != "KorSciBERT"): tokenizer.save_pretrained(output_dir)
163 | torch.save(args, os.path.join(output_dir, "training_args.bin"))
164 | logger.info("Saving model checkpoint to %s", output_dir)
165 |
166 | mb.write("Epoch {} done".format(epoch + 1))
167 |
168 | return global_step, tr_loss / global_step
169 |
170 | # 정답이 사전부착된 데이터로부터 평가하기 위한 함수
171 | def evaluate(args, model, tokenizer, logger, epoch_idx = "", tr_loss = 1):
172 | # 데이터셋 Load
173 | ## dataset: tensor형태의 데이터셋
174 | ## example: json형태의 origin 데이터셋
175 | ## features: index번호가 추가된 list형태의 examples 데이터셋
176 | dataset, examples, features = load_examples(args, tokenizer, evaluate=True, output_examples=True)
177 |
178 | # 최종 출력 파일 저장을 위한 디렉토리 생성
179 | if not os.path.exists(args.output_dir):
180 | os.makedirs(args.output_dir)
181 |
182 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader
183 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정
184 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정
185 | eval_sampler = SequentialSampler(dataset)
186 | eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
187 |
188 | # Eval!
189 | logger.info("***** Running evaluation {} *****".format(epoch_idx))
190 | logger.info(" Num examples = %d", len(dataset))
191 | logger.info(" Batch size = %d", args.eval_batch_size)
192 |
193 | # 평가 시간 측정을 위한 time 변수
194 | start_time = timeit.default_timer()
195 |
196 | # 예측 라벨
197 | pred_logits = torch.tensor([], dtype = torch.long).to(args.device)
198 | pred_coarse_logits = torch.tensor([], dtype = torch.long).to(args.device)
199 | for batch in progress_bar(eval_dataloader):
200 | # 모델을 평가 모드로 변경
201 | model.eval()
202 | batch = tuple(t.to(args.device) for t in batch)
203 |
204 | with torch.no_grad():
205 | # 평가에 필요한 입력 데이터 저장
206 | inputs_list = ["input_ids", "attention_mask", "token_type_ids", "position_ids"]
207 | inputs = dict()
208 | for n, input in enumerate(inputs_list): inputs[input] = batch[n]
209 |
210 | inputs_list2 = ['word_idxs', 'span']
211 | for m, input in enumerate(inputs_list2): inputs[input] = batch[-(m + 1)]
212 |
213 | # outputs = (label_logits, )
214 | # label_logits: [batch_size, num_labels]
215 | outputs = model(**inputs)
216 |
217 | pred_logits = torch.cat([pred_logits,outputs[0][1]], dim = 0)
218 | pred_coarse_logits = torch.cat([pred_coarse_logits, outputs[0][0]], dim=0)
219 |
220 | # pred_label과 gold_label 비교
221 | pred_logits= pred_logits.detach().cpu().numpy()
222 | pred_coarse_logits= pred_coarse_logits.detach().cpu().numpy()
223 | pred_labels = np.argmax(pred_logits, axis=-1)
224 | pred_coarse_labels = np.argmax(pred_coarse_logits, axis=-1)
225 | ## gold_labels
226 | gold_labels = [example.gold_label for example in examples]
227 | gold_coarse_labels = [example.gold_coarse_label for example in examples]
228 |
229 | # print('\n\n=====================outputs=====================')
230 | # for g,p in zip(gold_labels, pred_labels):
231 | # print(str(g)+"\t"+str(p))
232 | # print('===========================================================')
233 |
234 | # 평가 시간 측정을 위한 time 변수
235 | evalTime = timeit.default_timer() - start_time
236 | logger.info(" Evaluation done in total %f secs (%f sec per example)", evalTime, evalTime / len(dataset))
237 |
238 | # 최종 예측값과 원문이 저장된 example로 부터 성능 평가
239 | ## results = {"macro_precision":round(macro_precision, 4), "macro_recall":round(macro_recall, 4), "macro_f1_score":round(macro_f1_score, 4), \
240 | ## "accuracy":round(total_accuracy, 4), \
241 | ## "micro_precision":round(micro_precision, 4), "micro_recall":round(micro_recall, 4), "micro_f1":round(micro_f1_score, 4)}
242 | idx2label = {0: '문제 정의', 1: '가설 설정', 2: '기술 정의', 3: '제안 방법', 4: '대상 데이터', 5: '데이터처리', 6: '이론/모형', 7: '성능/효과', 8: '후속연구'}# , 9: '기타'}
243 | idx2coarse_label = {0: '연구 목적', 1: '연구 방법', 2: '연구 결과'} # , 3: '기타'}
244 |
245 | # results = get_score(pred_labels, gold_labels, idx2label)
246 | results = get_sklearn_score(pred_labels, gold_labels, idx2label)
247 | coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label)
248 |
249 | output_dir = os.path.join( args.output_dir, 'eval')
250 |
251 | out_file_type = 'a'
252 | if not os.path.exists(output_dir):
253 | os.makedirs(output_dir)
254 | out_file_type ='w'
255 |
256 | # 평가 스크립트 기반 성능 저장을 위한 파일 생성
257 | if os.path.exists(args.model_name_or_path):
258 | print(args.model_name_or_path)
259 | eval_file_name = list(filter(None, args.model_name_or_path.split("/"))).pop()
260 | else:
261 | eval_file_name = "init_weight"
262 | output_eval_file = os.path.join(output_dir, "eval_result_{}.txt".format(eval_file_name))
263 |
264 | with open(output_eval_file, out_file_type, encoding='utf-8') as f:
265 | f.write("train loss: {}\n".format(tr_loss))
266 | f.write("epoch: {}\n".format(epoch_idx))
267 | f.write("세부분류 성능\n")
268 | for k in results.keys():
269 | f.write("{} : {}\n".format(k, results[k]))
270 | f.write("\n대분류 성능\n")
271 | for k in coarse_results.keys():
272 | f.write("{} : {}\n".format(k, coarse_results[k]))
273 |
274 | confusion_m = confusion_matrix(pred_labels, gold_labels)
275 | confusion_list = [[], [0 for i in range(0, len(confusion_m))], []]
276 | for i in range(0, len(confusion_m)):
277 | for j in range(0, len(confusion_m[i])):
278 | if (i == j): confusion_list[0].append(confusion_m[i][j])
279 | all_cnt = sum([sum(i) for i in confusion_m])
280 | f.write("micro_accuracy: " + str(round((sum(confusion_list[0]) / all_cnt), 4)) +"\n")
281 | print("micro_accuracy: " + str(round((sum(confusion_list[0]) / all_cnt), 4)))
282 |
283 | f.write("=======================================\n\n")
284 | return results
285 |
286 | def predict(args, model, tokenizer):
287 | dataset, examples, features = load_examples(args, tokenizer, evaluate=True, output_examples=True, do_predict=True)
288 |
289 | # 최종 출력 파일 저장을 위한 디렉토리 생성
290 | if not os.path.exists(args.output_dir):
291 | os.makedirs(args.output_dir)
292 |
293 | # tokenizing 된 데이터를 batch size만큼 가져오기 위한 random sampler 및 DataLoader
294 | ## RandomSampler: 데이터 index를 무작위로 선택하여 조정
295 | ## SequentialSampler: 데이터 index를 항상 같은 순서로 조정
296 | eval_sampler = SequentialSampler(dataset)
297 | eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=args.eval_batch_size)
298 |
299 | print("***** Running Prediction *****")
300 | print(" Num examples = %d", len(dataset))
301 |
302 | # 예측 라벨
303 | pred_coarse_logits = torch.tensor([], dtype=torch.long).to(args.device)
304 | pred_logits = torch.tensor([], dtype=torch.long).to(args.device)
305 | for batch in progress_bar(eval_dataloader):
306 | # 모델을 평가 모드로 변경
307 | model.eval()
308 | batch = tuple(t.to(args.device) for t in batch)
309 |
310 | with torch.no_grad():
311 | # 평가에 필요한 입력 데이터 저장
312 | inputs_list = ["input_ids", "attention_mask", "token_type_ids", "position_ids"]
313 | inputs = dict()
314 | for n, input in enumerate(inputs_list): inputs[input] = batch[n]
315 |
316 | inputs_list2 = ['word_idxs', 'span']
317 | for m, input in enumerate(inputs_list2): inputs[input] = batch[-(m + 1)]
318 |
319 | # outputs = (label_logits, )
320 | # label_logits: [batch_size, num_labels]
321 | outputs = model(**inputs)
322 |
323 | pred_logits = torch.cat([pred_logits, outputs[0][1]], dim=0)
324 | pred_coarse_logits = torch.cat([pred_coarse_logits, outputs[0][0]], dim=0)
325 |
326 | # pred_label과 gold_label 비교
327 | pred_logits = pred_logits.detach().cpu().numpy()
328 | pred_coarse_logits = pred_coarse_logits.detach().cpu().numpy()
329 | pred_labels = np.argmax(pred_logits, axis=-1)
330 | pred_coarse_labels = np.argmax(pred_coarse_logits, axis=-1)
331 | ## gold_labels
332 | gold_labels = [example.gold_label for example in examples]
333 | gold_coarse_labels = [example.gold_coarse_label for example in examples]
334 |
335 | idx2label = {0: '문제 정의', 1: '가설 설정', 2: '기술 정의', 3: '제안 방법', 4: '대상 데이터', 5: '데이터처리', 6: '이론/모형', 7: '성능/효과', 8: '후속연구'}#, 9: '기타'}
336 | idx2coarse_label = {0: '연구 목적', 1: '연구 방법', 2: '연구 결과'} # , 3: '기타'}
337 |
338 | # results = get_score(pred_labels, gold_labels, idx2label)
339 | results = get_sklearn_score(pred_labels, gold_labels, idx2label)
340 | coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label)
341 |
342 | print("result of get_ai_score")
343 | for k in coarse_results.keys():
344 | print("{} : {}\n".format(k, coarse_results[k]))
345 | for k in results.keys():
346 | print("{} : {}\n".format(k, results[k]))
347 |
348 | print("result of get_sklearn_score")
349 | sk_results = get_sklearn_score(pred_labels, gold_labels, idx2label)
350 | sk_coarse_results = get_sklearn_score(pred_coarse_labels, gold_coarse_labels, idx2coarse_label)
351 | for k in sk_coarse_results.keys():
352 | print("{} : {}\n".format(k, sk_coarse_results[k]))
353 | for k in sk_results.keys():
354 | print("{} : {}\n".format(k, sk_results[k]))
355 |
356 |
357 | # 검증 스크립트 기반 성능 저장
358 | output_dir = os.path.join(args.output_dir, 'test')
359 |
360 | out_file_type = 'a'
361 | if not os.path.exists(output_dir):
362 | os.makedirs(output_dir)
363 | out_file_type = 'w'
364 |
365 | ## 검증 스크립트 기반 성능 저장을 위한 파일 생성
366 | if os.path.exists(args.model_name_or_path):
367 | print(args.model_name_or_path)
368 | eval_file_name = list(filter(None, args.model_name_or_path.split("/"))).pop()
369 | else:
370 | eval_file_name = "init_weight"
371 |
372 | ## 대분류 세부분류
373 | print("===== 대분류 세부분류 =====")
374 | coarse_new_output_1 = torch.zeros([9,9])
375 | coarse_new_output_2 = torch.zeros([9,9])
376 | coarse_new_output_3 = torch.zeros([9,9])
377 |
378 | for i,(cg,cp, g,p) in enumerate(zip(pred_coarse_labels, gold_coarse_labels, gold_labels, pred_labels)):
379 | if cp == 0:
380 | coarse_new_output_1[p][g] += torch.tensor(1)
381 | elif cp == 1:
382 | coarse_new_output_2[p][g] += torch.tensor(1)
383 | elif cp == 2:
384 | coarse_new_output_3[p][g] += torch.tensor(1)
385 |
386 |
387 | print("============대분류 결과====================")
388 | for co in zip(coarse_new_output_1, coarse_new_output_2, coarse_new_output_3):
389 | print(co)
390 |
391 |
392 | ### incorrect data 저장
393 | out_incorrect = {"sentence": [], "correct": [], "predict": []}
394 | print('\n\n=====================outputs=====================')
395 | for i,(g,p) in enumerate(zip(gold_labels, pred_labels)):
396 | if g != p:
397 | out_incorrect["sentence"].append(examples[i].sentence)
398 | out_incorrect["correct"].append(idx2label[g])
399 | out_incorrect["predict"].append(idx2label[p])
400 | df_incorrect = pd.DataFrame(out_incorrect)
401 | df_incorrect.to_csv(os.path.join(output_dir, "test_result_{}_incorrect.csv".format(eval_file_name)), index=False)
402 |
403 | ### 전체 data 저장
404 | out = {"sentence":[], "correct":[], "predict":[]}
405 | for i,(g,p) in enumerate(zip(gold_labels, pred_labels)):
406 | for k,v in zip(out.keys(),[examples[i].sentence, idx2label[g], idx2label[p]]):
407 | out[k].append(v)
408 | for k, v in zip(out.keys(), [examples[i].sentence, idx2label[g], idx2label[p]]):
409 | out[k].append(v)
410 | df = pd.DataFrame(out)
411 | df.to_csv(os.path.join(output_dir, "test_result_{}.csv".format(eval_file_name)), index=False)
412 |
413 | return results
414 |
--------------------------------------------------------------------------------
/src/model/model_multi.py:
--------------------------------------------------------------------------------
1 | # model += Parsing Infor Collecting Layer (PIC)
2 |
3 | from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
4 | import torch.nn as nn
5 | import torch
6 | import torch.nn.functional as F
7 |
8 | from transformers import BertModel, RobertaModel
9 |
10 | import transformers
11 | if int(transformers.__version__[0]) <= 3:
12 | from transformers.modeling_roberta import RobertaPreTrainedModel
13 | from transformers.modeling_bert import BertPreTrainedModel
14 | from transformers.modeling_electra import ElectraModel, ElectraPreTrainedModel
15 | else:
16 | from transformers.models.roberta.modeling_roberta import RobertaPreTrainedModel
17 | from transformers.models.bert.modeling_bert import BertPreTrainedModel
18 | from transformers.models.electra.modeling_electra import ElectraPreTrainedModel
19 |
20 | from src.functions.biattention import BiAttention, BiLinear
21 |
22 |
23 | class KorSciBERTForSequenceClassification(BertPreTrainedModel):
24 |
25 | def __init__(self, config, max_sentence_length, path):
26 | super(KorSciBERTForSequenceClassification, self).__init__(config, max_sentence_length, path)
27 | self.num_labels = config.num_labels
28 | self.num_coarse_labels = 3
29 | self.config = config
30 |
31 | self.bert = transformers.BertModel.from_pretrained(path, config=self.config)
32 |
33 | # special token 추가
34 | self.config.vocab_size = 15330 + 1
35 | self.bert.resize_token_embeddings(self.config.vocab_size)
36 |
37 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습
38 | self.span_info_collect = SICModel1(config)
39 | #self.span_info_collect = SICModel2(config)
40 |
41 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화
42 | self.parsing_info_collect = PICModel(config, max_sentence_length)
43 |
44 | classifier_dropout = (
45 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1
46 | )
47 |
48 | # 대분류
49 | self.dropout1 = nn.Dropout(classifier_dropout)
50 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels)
51 |
52 | # 세부분류
53 | self.dropout2 = nn.Dropout(classifier_dropout)
54 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, self.num_labels)
55 |
56 | self.reset_parameters #self.init_weights()
57 |
58 | def forward(
59 | self,
60 | input_ids=None,
61 | attention_mask=None,
62 | token_type_ids=None,
63 | position_ids=None,
64 | head_mask=None,
65 | inputs_embeds=None,
66 | labels=None,
67 | coarse_labels=None,
68 | span=None,
69 | word_idxs=None,
70 | ):
71 | batch_size = input_ids.shape[0]
72 | discriminator_hidden_states = self.bert(
73 | input_ids=input_ids,
74 | attention_mask=attention_mask,
75 | )
76 | # last-layer hidden state
77 | # sequence_output: [batch_size, seq_length, hidden_size]
78 | sequence_output = discriminator_hidden_states[0]
79 |
80 | # span info collecting layer(SIC)
81 | h_ij = self.span_info_collect(sequence_output, word_idxs)
82 |
83 | # parser info collecting layer(PIC)
84 | hidden_states = self.parsing_info_collect(h_ij,
85 | batch_size= batch_size,
86 | span=span,)
87 |
88 | # 대분류
89 | hidden_states1 = self.dropout1(hidden_states)
90 | logits1 = self.classifier1(hidden_states1)
91 |
92 | # concat
93 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1)
94 |
95 | # 세부 분류
96 | hidden_states2 = self.dropout2(concat_hidden_states)
97 | logits2 = self.classifier2(hidden_states2)
98 |
99 | #logits = logits1
100 | logits = [logits1, logits2]
101 | outputs = (logits, ) + discriminator_hidden_states[2:]
102 |
103 | if labels is not None:
104 | if self.num_labels == 1:
105 | # We are doing regression
106 | loss_fct = MSELoss()
107 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1))
108 | loss2 = loss_fct(logits2.view(-1), labels.view(-1))
109 | else:
110 | loss_fct = CrossEntropyLoss()
111 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1))
112 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1))
113 |
114 | loss = loss1 + loss2
115 | # print("loss: "+str(loss))
116 | outputs = (loss,) + outputs
117 |
118 | return outputs # (loss), logits, (hidden_states), (attentions)
119 |
120 | def reset_parameters(self):
121 | self.dropout1.reset_parameters()
122 | self.classifier1.reset_parameters()
123 | self.dropout2.reset_parameters()
124 | self.classifier2.reset_parameters()
125 |
126 | class BertForSequenceClassification(BertPreTrainedModel):
127 |
128 | def __init__(self, config, max_sentence_length):
129 | super().__init__(config)
130 | self.num_labels = config.num_labels
131 | self.num_coarse_labels = 3
132 | self.config = config
133 | self.bert = BertModel(config)
134 |
135 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습
136 | self.span_info_collect = SICModel1(config)
137 | #self.span_info_collect = SICModel2(config)
138 |
139 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화
140 | self.parsing_info_collect = PICModel(config, max_sentence_length) # 구묶음 + tag 정보 + bert-biaffine attention + bilistm + bert-bilinear classification
141 |
142 | classifier_dropout = (
143 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1
144 | )
145 | self.dropout1 = nn.Dropout(classifier_dropout)
146 | self.dropout2 = nn.Dropout(classifier_dropout)
147 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels)
148 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, config.num_labels)
149 |
150 | self.init_weights()
151 |
152 | def forward(
153 | self,
154 | input_ids=None,
155 | attention_mask=None,
156 | token_type_ids=None,
157 | position_ids=None,
158 | head_mask=None,
159 | inputs_embeds=None,
160 | labels=None,
161 | coarse_labels=None,
162 | span=None,
163 | word_idxs=None,
164 | ):
165 | batch_size = input_ids.shape[0]
166 | discriminator_hidden_states = self.bert(
167 | input_ids=input_ids,
168 | attention_mask=attention_mask,
169 | token_type_ids=token_type_ids,
170 | position_ids=position_ids,
171 | )
172 | # last-layer hidden state
173 | # sequence_output: [batch_size, seq_length, hidden_size]
174 | sequence_output = discriminator_hidden_states[0]
175 |
176 | # span info collecting layer(SIC)
177 | h_ij = self.span_info_collect(sequence_output, word_idxs)
178 |
179 | # parser info collecting layer(PIC)
180 | hidden_states = self.parsing_info_collect(h_ij,
181 | batch_size= batch_size,
182 | span=span,)
183 |
184 | # 대분류
185 | hidden_states1 = self.dropout1(hidden_states)
186 | logits1 = self.classifier1(hidden_states1)
187 |
188 | # concat
189 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1)
190 |
191 | # 세부 분류
192 | hidden_states2 = self.dropout2(concat_hidden_states)
193 | logits2 = self.classifier2(hidden_states2)
194 |
195 | logits = [logits1, logits2]
196 | outputs = (logits, ) + discriminator_hidden_states[2:]
197 |
198 | if labels is not None:
199 | if self.num_labels == 1:
200 | # We are doing regression
201 | loss_fct = MSELoss()
202 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1))
203 | loss2 = loss_fct(logits2.view(-1), labels.view(-1))
204 | else:
205 | loss_fct = CrossEntropyLoss()
206 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1))
207 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1))
208 | loss = loss1+loss2
209 | #print("loss: "+str(loss))
210 | outputs = (loss,) + outputs
211 |
212 | return outputs # (loss), logits, (hidden_states), (attentions)
213 |
214 | class RobertaForSequenceClassification(BertPreTrainedModel):
215 |
216 | def __init__(self, config, max_sentence_length):
217 | super().__init__(config)
218 | self.num_labels = config.num_labels
219 | self.num_coarse_labels = 3
220 | self.config = config
221 | self.roberta = RobertaModel(config)
222 |
223 | # 입력 토큰에서 token1, token2가 있을 때 (index of token1, index of token2)를 하나의 span으로 보고 이에 대한 정보를 학습
224 | self.span_info_collect = SICModel1(config)
225 | #self.span_info_collect = SICModel2(config)
226 |
227 | # biaffine을 통해 premise와 hypothesis span에 대한 정보를 결합후 정규화
228 | self.parsing_info_collect = PICModel(config, max_sentence_length) # 구묶음 + tag 정보 + bert-biaffine attention + bilistm + bert-bilinear classification
229 |
230 | classifier_dropout = (
231 | config.hidden_dropout_prob if config.hidden_dropout_prob is not None else 0.1
232 | )
233 | self.dropout1 = nn.Dropout(classifier_dropout)
234 | self.dropout2 = nn.Dropout(classifier_dropout)
235 | self.classifier1 = nn.Linear(config.hidden_size, self.num_coarse_labels)
236 | self.classifier2 = nn.Linear(config.hidden_size+self.num_coarse_labels, config.num_labels)
237 |
238 | self.init_weights()
239 |
240 | def forward(
241 | self,
242 | input_ids=None,
243 | attention_mask=None,
244 | token_type_ids=None,
245 | position_ids=None,
246 | head_mask=None,
247 | inputs_embeds=None,
248 | labels=None,
249 | coarse_labels=None,
250 | span=None,
251 | word_idxs=None,
252 | ):
253 | batch_size = input_ids.shape[0]
254 | discriminator_hidden_states = self.roberta(
255 | input_ids=input_ids,
256 | attention_mask=attention_mask,
257 | )
258 | # last-layer hidden state
259 | # sequence_output: [batch_size, seq_length, hidden_size]
260 | sequence_output = discriminator_hidden_states[0]
261 |
262 | # span info collecting layer(SIC)
263 | h_ij = self.span_info_collect(sequence_output, word_idxs)
264 |
265 | # parser info collecting layer(PIC)
266 | hidden_states = self.parsing_info_collect(h_ij,
267 | batch_size= batch_size,
268 | span=span,)
269 |
270 | # 대분류
271 | hidden_states1 = self.dropout1(hidden_states)
272 | logits1 = self.classifier1(hidden_states1)
273 |
274 | # concat
275 | concat_hidden_states = torch.cat((logits1, hidden_states), dim=1)
276 |
277 | # 세부 분류
278 | hidden_states2 = self.dropout2(concat_hidden_states)
279 | logits2 = self.classifier2(hidden_states2)
280 |
281 | logits = [logits1, logits2]
282 | outputs = (logits, ) + discriminator_hidden_states[2:]
283 |
284 | if labels is not None:
285 | if self.num_labels == 1:
286 | # We are doing regression
287 | loss_fct = MSELoss()
288 | loss1 = loss_fct(logits1.view(-1), coarse_labels.view(-1))
289 | loss2 = loss_fct(logits2.view(-1), labels.view(-1))
290 | else:
291 | loss_fct = CrossEntropyLoss()
292 | loss1 = loss_fct(logits1.view(-1, self.num_coarse_labels), coarse_labels.view(-1))
293 | loss2 = loss_fct(logits2.view(-1, self.num_labels), labels.view(-1))
294 | loss = loss1+loss2
295 | #print("loss: "+str(loss))
296 | outputs = (loss,) + outputs
297 |
298 | return outputs # (loss), logits, (hidden_states), (attentions)
299 |
300 | class SICModel1(nn.Module):
301 | def __init__(self, config):
302 | super().__init__()
303 |
304 | def forward(self, hidden_states, word_idxs):
305 | # (batch, max_pre_sen, seq_len) @ (batch, seq_len, hidden) = (batch, max_pre_sen, hidden)
306 | word_idxs = word_idxs.squeeze(1)
307 |
308 | sen = torch.matmul(word_idxs, hidden_states)
309 |
310 | return sen
311 |
312 | class SICModel2(nn.Module):
313 | def __init__(self, config):
314 | super().__init__()
315 | self.hidden_size = config.hidden_size
316 |
317 | self.W_1 = nn.Linear(self.hidden_size, self.hidden_size)
318 | self.W_2 = nn.Linear(self.hidden_size, self.hidden_size)
319 | self.W_3 = nn.Linear(self.hidden_size, self.hidden_size)
320 | self.W_4 = nn.Linear(self.hidden_size, self.hidden_size)
321 |
322 | def forward(self, hidden_states, word_idxs):
323 | word_idxs = word_idxs.squeeze(1).type(torch.LongTensor).to("cuda")
324 |
325 | W1_h = self.W_1(hidden_states) # (bs, length, hidden_size)
326 | W2_h = self.W_2(hidden_states)
327 | W3_h = self.W_3(hidden_states)
328 | W4_h = self.W_4(hidden_states)
329 |
330 | W1_hi_emb=torch.tensor([], dtype=torch.long).to("cuda")
331 | W2_hi_emb=torch.tensor([], dtype=torch.long).to("cuda")
332 | W3_hi_start_emb = torch.tensor([], dtype=torch.long).to("cuda")
333 | W3_hi_end_emb = torch.tensor([], dtype=torch.long).to("cuda")
334 | W4_hi_start_emb = torch.tensor([], dtype=torch.long).to("cuda")
335 | W4_hi_end_emb = torch.tensor([], dtype=torch.long).to("cuda")
336 | for i in range(0, hidden_states.shape[0]):
337 | sub_W1_hi_emb = torch.index_select(W1_h[i], 0, word_idxs[i][0]) # (max_seq_length, hidden_size)
338 | sub_W2_hi_emb = torch.index_select(W2_h[i], 0, word_idxs[i][1])
339 | sub_W3_hi_start_emb = torch.index_select(W3_h[i], 0, word_idxs[i][0])
340 | sub_W3_hi_end_emb = torch.index_select(W3_h[i], 0, word_idxs[i][1])
341 | sub_W4_hi_start_emb = torch.index_select(W4_h[i], 0, word_idxs[i][0])
342 | sub_W4_hi_end_emb = torch.index_select(W4_h[i], 0, word_idxs[i][1])
343 |
344 | W1_hi_emb = torch.cat((W1_hi_emb, sub_W1_hi_emb.unsqueeze(0)))
345 | W2_hi_emb = torch.cat((W2_hi_emb, sub_W2_hi_emb.unsqueeze(0)))
346 | W3_hi_start_emb = torch.cat((W3_hi_start_emb, sub_W3_hi_start_emb.unsqueeze(0)))
347 | W3_hi_end_emb = torch.cat((W3_hi_end_emb, sub_W3_hi_end_emb.unsqueeze(0)))
348 | W4_hi_start_emb = torch.cat((W4_hi_start_emb, sub_W4_hi_start_emb.unsqueeze(0)))
349 | W4_hi_end_emb = torch.cat((W4_hi_end_emb, sub_W4_hi_end_emb.unsqueeze(0)))
350 |
351 | # [w1*hi, w2*hj, w3(hi-hj), w4(hi⊗hj)]
352 | span = W1_hi_emb + W2_hi_emb + (W3_hi_start_emb - W3_hi_end_emb) + torch.mul(W4_hi_start_emb, W4_hi_end_emb) # (batch_size, max_seq_length, hidden_size)
353 | h_ij = torch.tanh(span)
354 |
355 | return h_ij
356 |
357 |
358 | class PICModel(nn.Module):
359 | def __init__(self, config, max_sentence_length):
360 | super().__init__()
361 | self.hidden_size = config.hidden_size
362 | self.max_sentence_length = max_sentence_length
363 |
364 | # 구문구조 종류
365 | depend2idx = {"None": 0};
366 | idx2depend = {0: "None"};
367 | for depend1 in ['IP', 'AP', 'DP', 'VP', 'VNP', 'S', 'R', 'NP', 'L', 'X']:
368 | for depend2 in ['CMP', 'MOD', 'SBJ', 'AJT', 'CNJ', 'OBJ', "UNDEF"]:
369 | depend2idx[depend1 + "-" + depend2] = len(depend2idx)
370 | idx2depend[len(idx2depend)] = depend1 + "-" + depend2
371 | self.depend2idx = depend2idx
372 | self.idx2depend = idx2depend
373 | self.depend_embedding = nn.Embedding(len(idx2depend), self.hidden_size, padding_idx=0).to("cuda")
374 |
375 | self.reduction1 = nn.Linear(self.hidden_size , int(self.hidden_size // 3))
376 | self.reduction2 = nn.Linear(self.hidden_size , int(self.hidden_size // 3))
377 |
378 | self.biaffine = BiAttention(int(self.hidden_size // 3), int(self.hidden_size // 3), 100)
379 |
380 | self.bi_lism = nn.LSTM(input_size=100, hidden_size=self.hidden_size//2, num_layers=1, bidirectional=True)
381 |
382 | def forward(self, hidden_states, batch_size, span):
383 | # hidden_states: [[batch_size, word_idxs, hidden_size], []]
384 | # span: [batch_size, max_sentence_length, max_sentence_length]
385 | # word_idxs: [batch_size, seq_length]
386 | # -> sequence_outputs: [batch_size, seq_length, hidden_size]
387 |
388 | # span: (batch, max_prem_len, 3) -> (batch, max_prem_len, 3*hidden_size)
389 | new_span = torch.tensor([], dtype=torch.long).to("cuda")
390 |
391 | for i, span in enumerate(span.tolist()):
392 | span_head = torch.tensor([span[0] for span in span]).to("cuda") #(max_prem_len)
393 | span_tail = torch.tensor([span[1] for span in span]).to("cuda")
394 | span_dep = torch.tensor([span[2] for span in span]).to("cuda")
395 |
396 | span_head = torch.index_select(hidden_states[i], 0, span_head) #(max_prem_len, hidden_size)
397 | span_tail = torch.index_select(hidden_states[i], 0, span_tail)
398 | span_dep = self.depend_embedding(span_dep)
399 |
400 | n_span = span_head + span_tail + span_dep
401 | new_span = torch.cat((new_span, n_span.unsqueeze(0)))
402 |
403 | span = new_span
404 | del new_span
405 |
406 | # biaffine attention
407 | # hidden_states: (batch_size, max_prem_len, hidden_size)
408 | # span: (batch, max_prem_len, hidden_size)
409 | # -> biaffine_outputs: [batch_size, 100, max_prem_len, max_prem_len]
410 | span = self.reduction1(span)
411 | hidden_states = self.reduction2(hidden_states)
412 |
413 | biaffine_outputs= self.biaffine(hidden_states, span)
414 |
415 | # bilstm
416 | # biaffine_outputs: [batch_size, 100, max_prem_len, max_prem_len] -> [batch_size, 100, max_prem_len] -> [max_prem_len, batch_size, 100]
417 | # -> hidden_states: [batch_size, max_sentence_length]
418 | biaffine_outputs = biaffine_outputs.mean(-1)
419 |
420 | biaffine_outputs = biaffine_outputs.transpose(1,2).transpose(0,1)
421 | states = None
422 |
423 | bilstm_outputs, states = self.bi_lism(biaffine_outputs)
424 |
425 | hidden_states = states[0].transpose(0, 1).contiguous().view(batch_size, -1)
426 |
427 | return hidden_states
428 |
429 | def reset_parameters(self):
430 | self.W_1_bilinear.reset_parameters()
431 | self.W_1_linear.reset_parameters()
432 | self.W_2_bilinear.reset_parameters()
433 | self.W_2_linear.reset_parameters()
434 |
435 | self.biaffine_W_bilinear.reset_parameters()
436 | self.biaffine_W_linear.reset_parameters()
437 |
438 |
439 |
--------------------------------------------------------------------------------