├── README.md ├── log ├── process_data ├── SKPencoder4tree.py ├── ann.json ├── generate_tree.py └── get_label.py └── treelstmv3.0.py /README.md: -------------------------------------------------------------------------------- 1 | # treeLSTM 2 | Personal implementation of the ACL 19 paper "Tree LSTMs with Convolution Units to Predict Stance and Rumor Veracity in Social Media Conversations" with pytorch. 3 | 4 | # Performance 5 | |Model|CH|SS|FG|OS|GC| 6 | | ---------- | :----------: | :-----------: | :-----------: | :-----------: | :-----------: | 7 | |My Implementation | 0.511 | 0.470 | 0.471 | 0.484 | 0.559 | 8 | |Original Paper report| 0.514 | 0.579 | 0.553 | 0.469 | 0.547 | 9 | 10 | 11 | The performance drops a lot compared to the original reported one, may due to the following possible reasons: 12 | 13 | 1. Possible bugs I do not find currently in my implementation. : ) 14 | 2. Different deep learning frameworks: They use DGL and PyTorch to build the treelstm, however, I build the whole model by PyTorch purely. 15 | 16 | Different tweets encoding methods: I adopt SKP (the best encoding method in Table 3, however, the version of my SKP based on TensorFlow but the authors' is theano. 17 | 3. Confused details in the paper: The author didn't report the batch size to train the model, and the initialization method of the first hidden state and cell memory (i.e., h_{0}, c_{0}). 18 | 19 | The most important one part - resampling the minor class is also missing, this part is very confused as described in paper section 3, so I just turn the weight of class comment as 0.5 and 1 for other classes, in my experiments this rate affects a lot. 20 | 21 | And in the Tree Conv LSTM, how to do the convolution operation if there only exit one children, the size of the kernels is 2! 22 | 4. Different training and development set: Due to the data is not clean in the raw form, some threads contain more than one source tweets, I just keep the first main tweets, this may cause the dataset not same as the authors use. 23 | 24 | Additionally, I have to argue that the dataset in baseline Lukasik et al., 2016 is different from the author used (sample size in each class cannot match). So I think the comparison between the two methods is meaningless. 25 | 5. *Just a question: in my view, using 4 events to train and evaluate at another event is ill-defined, because it will lead the model tuning hyperparameters at the testing set.* 26 | 27 | 28 | # Code Structure 29 | ``` 30 | root 31 | ├── process_data 32 | │ ├── generate_tree.pys 33 | │ ├── get_label.py 34 | │ └── SKPencoder4tree.py 35 | ├── treeLSTM.py 36 | ├── log # training logs 37 | └── readme.md 38 | ``` 39 | ### generate_tree.py 40 | Convert raw data into 1) raw tweets (output: "event.tweets") 2) clean row tree (output: "event_tree.json") 3) clean original tree (output: "event_ori_tree.json") 41 | 42 | *p.s. there are some threads contains more than one main tweets (which means it's not a tree structure anymore), I only keep one tweet and generate the tree.* 43 | ### get_label.py 44 | Get the label for the whole data in tree (json) form (output: "event_label.txt"). 45 | 46 | *p.s. there are some threads lack of the label in annotation.json, I label them with class 4 to keep the tree structure. Class 4 will be ingored during training and validating.* 47 | ### SKPencoder4tree.py 48 | Convert the raw tweets into SKP feature by tf-SKP in following repo [https://github.com/tensorflow/models/tree/master/research/skip_thoughts], output: "events.npy" 49 | 50 | *The authors in ACL used theano based version, however theano is not suitable for python3.x and GPU, so I just use the tf version* 51 | ### treeLSTM.py 52 | I implemented two models: Child Conv Tree LSTM and Child Sum Tree LSTM, performances will be print on the screen. 53 | 54 | *It will be hard to run the treeLSTM parallelly in GPU, so I keep batch size equal to 1 and use CPU to run this model. Due to the size of training sample, it won't cost a lot to train the model (less than one hour for 30 epochs).* 55 | -------------------------------------------------------------------------------- /log: -------------------------------------------------------------------------------- 1 | # Training Logs 2 | # OW 3 | ``` 4 | evaluating at ow event 5 | /opt/miniconda/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 6 | 'precision', 'predicted', average, warn_for) 7 | current model over dev set macro f measure: 0.32814685314685316 8 | epoch: 0, loss: 1.053495 9 | current model over dev set macro f measure: 0.44771063494891894 10 | epoch: 1, loss: 0.974285 11 | current model over dev set macro f measure: 0.44353529759154486 12 | epoch: 2, loss: 0.937010 13 | current model over dev set macro f measure: 0.4546228323318485 14 | epoch: 3, loss: 0.882013 15 | current model over dev set macro f measure: 0.47094871656160253 16 | epoch: 4, loss: 0.824249 17 | current model over dev set macro f measure: 0.4843187379650678 18 | epoch: 5, loss: 0.684245 19 | current model over dev set macro f measure: 0.4842401018848174 20 | epoch: 6, loss: 0.549408 21 | current model over dev set macro f measure: 0.4787758366822009 22 | epoch: 7, loss: 0.494834 23 | current model over dev set macro f measure: 0.4748493709020024 24 | epoch: 8, loss: 0.423444 25 | current model over dev set macro f measure: 0.4717012288212587 26 | epoch: 9, loss: 0.319792 27 | current model over dev set macro f measure: 0.4725363062887186 28 | epoch: 10, loss: 0.258566 29 | current model over dev set macro f measure: 0.47113708246447766 30 | epoch: 11, loss: 0.235613 31 | current model over dev set macro f measure: 0.4664937999477784 32 | epoch: 12, loss: 0.190939 33 | current model over dev set macro f measure: 0.450311178310301 34 | epoch: 13, loss: 0.169445 35 | current model over dev set macro f measure: 0.4456715461894258 36 | epoch: 14, loss: 0.153646 37 | current model over dev set macro f measure: 0.44732877155442613 38 | epoch: 15, loss: 0.142082 39 | current model over dev set macro f measure: 0.45459050184279204 40 | epoch: 16, loss: 0.130388 41 | current model over dev set macro f measure: 0.45444273056812345 42 | epoch: 17, loss: 0.118868 43 | current model over dev set macro f measure: 0.4525942475008655 44 | epoch: 18, loss: 0.107441 45 | current model over dev set macro f measure: 0.45769082685522366 46 | epoch: 19, loss: 0.096496 47 | current mode 48 | ``` 49 | # SS 50 | ``` 51 | evaluating at ss event 52 | /opt/miniconda/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 53 | 'precision', 'predicted', average, warn_for) 54 | current model over dev set macro f measure: 0.38554649064831675 55 | epoch: 0, loss: 0.899244 56 | current model over dev set macro f measure: 0.44304105094607193 57 | epoch: 1, loss: 0.790168 58 | current model over dev set macro f measure: 0.44348542598542606 59 | epoch: 2, loss: 0.722845 60 | current model over dev set macro f measure: 0.4533632192309373 61 | epoch: 3, loss: 0.658700 62 | current model over dev set macro f measure: 0.46135303599087774 63 | epoch: 4, loss: 0.591269 64 | current model over dev set macro f measure: 0.4620673066681815 65 | epoch: 5, loss: 0.521110 66 | current model over dev set macro f measure: 0.470770101891832 67 | epoch: 6, loss: 0.455678 68 | current model over dev set macro f measure: 0.4549394759932677 69 | epoch: 7, loss: 0.406527 70 | current model over dev set macro f measure: 0.4456543314133934 71 | epoch: 8, loss: 0.366591 72 | current model over dev set macro f measure: 0.449490441071695 73 | epoch: 9, loss: 0.330275 74 | current model over dev set macro f measure: 0.45166051770258087 75 | epoch: 10, loss: 0.303290 76 | current model over dev set macro f measure: 0.44476839706064963 77 | epoch: 11, loss: 0.266350 78 | current model over dev set macro f measure: 0.43770881163384534 79 | epoch: 12, loss: 0.250740 80 | current model over dev set macro f measure: 0.4376775890935184 81 | epoch: 13, loss: 0.223240 82 | current model over dev set macro f measure: 0.429444655810157 83 | epoch: 14, loss: 0.201249 84 | current model over dev set macro f measure: 0.4299281721689988 85 | epoch: 15, loss: 0.181897 86 | current model over dev set macro f measure: 0.43136629452418923 87 | epoch: 16, loss: 0.163990 88 | current model over dev set macro f measure: 0.4333102766360029 89 | epoch: 17, loss: 0.142856 90 | current model over dev set macro f measure: 0.43385813167325205 91 | epoch: 18, loss: 0.125632 92 | current model over dev set macro f measure: 0.41829321701396244 93 | epoch: 19, loss: 0.113375 94 | current model 95 | ``` 96 | # GC 97 | ``` 98 | evaluating at gc event 99 | /opt/miniconda/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 100 | 'precision', 'predicted', average, warn_for) 101 | current model over dev set macro f measure: 0.45853892543859653 102 | epoch: 0, loss: 0.909895 103 | current model over dev set macro f measure: 0.4965610101371691 104 | epoch: 1, loss: 0.844062 105 | current model over dev set macro f measure: 0.5111332049131092 106 | epoch: 2, loss: 0.743724 107 | current model over dev set macro f measure: 0.5144678584506808 108 | epoch: 3, loss: 0.579992 109 | current model over dev set macro f measure: 0.5139490884072927 110 | epoch: 4, loss: 0.408547 111 | current model over dev set macro f measure: 0.5219542362399505 112 | epoch: 5, loss: 0.290465 113 | current model over dev set macro f measure: 0.5176111619514918 114 | epoch: 6, loss: 0.209248 115 | current model over dev set macro f measure: 0.47858596526203523 116 | epoch: 7, loss: 0.155648 117 | current model over dev set macro f measure: 0.49477931247958745 118 | epoch: 8, loss: 0.121921 119 | current model over dev set macro f measure: 0.49342863926197256 120 | epoch: 9, loss: 0.098100 121 | current model over dev set macro f measure: 0.5311771665187814 122 | epoch: 10, loss: 0.083284 123 | current model over dev set macro f measure: 0.5578165213781652 124 | epoch: 11, loss: 0.074745 125 | current model over dev set macro f measure: 0.5594617676682894 126 | epoch: 12, loss: 0.058982 127 | current model over dev set macro f measure: 0.5430516851569483 128 | epoch: 13, loss: 0.044443 129 | current model over dev set macro f measure: 0.5250667536911289 130 | epoch: 14, loss: 0.037196 131 | current model over dev set macro f measure: 0.5174511830761831 132 | epoch: 15, loss: 0.031683 133 | current model over dev set macro f measure: 0.5131174733015224 134 | epoch: 16, loss: 0.026798 135 | current model over dev set macro f measure: 0.5048771436384234 136 | epoch: 17, loss: 0.021979 137 | current model over dev set macro f measure: 0.4878327670955496 138 | epoch: 18, loss: 0.017993 139 | current model over dev set macro f measure: 0.4552380952380952 140 | epoch: 19, loss: 0.014807 141 | current model 142 | ``` 143 | # FG 144 | ``` 145 | evaluating at fg event 146 | /opt/miniconda/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 147 | 'precision', 'predicted', average, warn_for) 148 | current model over dev set macro f measure: 0.3026474244425453 149 | epoch: 0, loss: 1.487222 150 | current model over dev set macro f measure: 0.41397005144400734 151 | epoch: 1, loss: 1.340546 152 | current model over dev set macro f measure: 0.4150733518253834 153 | epoch: 2, loss: 1.315479 154 | current model over dev set macro f measure: 0.42160343735791417 155 | epoch: 3, loss: 1.310306 156 | current model over dev set macro f measure: 0.42371087365420756 157 | epoch: 4, loss: 1.287965 158 | current model over dev set macro f measure: 0.43990777138894926 159 | epoch: 5, loss: 1.243475 160 | current model over dev set macro f measure: 0.42479899239586805 161 | epoch: 6, loss: 1.173173 162 | current model over dev set macro f measure: 0.43246424669530886 163 | epoch: 7, loss: 1.038354 164 | current model over dev set macro f measure: 0.41814064462318923 165 | epoch: 8, loss: 0.870224 166 | current model over dev set macro f measure: 0.4059352875122074 167 | epoch: 9, loss: 0.719146 168 | current model over dev set macro f measure: 0.3994670129792829 169 | epoch: 10, loss: 0.665530 170 | current model over dev set macro f measure: 0.39870466264502613 171 | epoch: 11, loss: 0.572283 172 | current model over dev set macro f measure: 0.3951438663654032 173 | epoch: 12, loss: 0.510518 174 | current model over dev set macro f measure: 0.40380506047036147 175 | epoch: 13, loss: 0.464287 176 | current model over dev set macro f measure: 0.40255254425335896 177 | epoch: 14, loss: 0.391928 178 | current model over dev set macro f measure: 0.40362006291363745 179 | epoch: 15, loss: 0.321788 180 | current model over dev set macro f measure: 0.4053969382916751 181 | epoch: 16, loss: 0.274579 182 | current model over dev set macro f measure: 0.40333268764747676 183 | epoch: 17, loss: 0.242197 184 | current model over dev set macro f measure: 0.40604773405017536 185 | epoch: 18, loss: 0.205969 186 | current model over dev set macro f measure: 0.40862040079610973 187 | epoch: 19, loss: 0.165544 188 | current model over dev set macro f measure: 0.407309000425706 189 | ``` 190 | # CH 191 | ``` 192 | evaluating at ch event 193 | /opt/miniconda/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 194 | 'precision', 'predicted', average, warn_for) 195 | current model over dev set macro f measure: 0.34578634892494176 196 | epoch: 0, loss: 1.413275 197 | current model over dev set macro f measure: 0.4619220272022678 198 | epoch: 1, loss: 1.357112 199 | current model over dev set macro f measure: 0.46653980028925535 200 | epoch: 2, loss: 1.193244 201 | current model over dev set macro f measure: 0.47909918913803606 202 | epoch: 3, loss: 0.988928 203 | current model over dev set macro f measure: 0.500840845717071 204 | epoch: 4, loss: 0.716135 205 | current model over dev set macro f measure: 0.5059114025996478 206 | epoch: 5, loss: 0.473390 207 | current model over dev set macro f measure: 0.5112706315637225 208 | epoch: 6, loss: 0.293499 209 | current model over dev set macro f measure: 0.5034207166685517 210 | epoch: 7, loss: 0.184502 211 | current model over dev set macro f measure: 0.4869491664077007 212 | epoch: 8, loss: 0.123235 213 | current model over dev set macro f measure: 0.4979930094434016 214 | epoch: 9, loss: 0.087779 215 | current model over dev set macro f measure: 0.48556375463740686 216 | epoch: 10, loss: 0.069376 217 | current model over dev set macro f measure: 0.48725526413273446 218 | epoch: 11, loss: 0.051334 219 | current model over dev set macro f measure: 0.4778268796769385 220 | epoch: 12, loss: 0.039942 221 | current model over dev set macro f measure: 0.4803421388271305 222 | epoch: 13, loss: 0.032236 223 | current model over dev set macro f measure: 0.48219200503806 224 | epoch: 14, loss: 0.026901 225 | current model over dev set macro f measure: 0.46598598999789703 226 | epoch: 15, loss: 0.023527 227 | current model over dev set macro f measure: 0.47413962276439814 228 | epoch: 16, loss: 0.018108 229 | current model over dev set macro f measure: 0.4608831444438538 230 | epoch: 17, loss: 0.024726 231 | current model over dev set macro f measure: 0.4666727531367006 232 | epoch: 18, loss: 0.014301 233 | current model over dev set macro f measure: 0.45168713482631084 234 | epoch: 19, loss: 0.010897 235 | current model over dev set macro f mea 236 | ``` 237 | -------------------------------------------------------------------------------- /process_data/SKPencoder4tree.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import 2 | from __future__ import division 3 | from __future__ import print_function 4 | import numpy as np 5 | import os.path 6 | import scipy.spatial.distance as sd 7 | import configuration 8 | import encoder_manager 9 | import os 10 | os.environ["CUDA_VISIBLE_DEVICES"] = '0' 11 | # Set paths to the model. 12 | # 13 | VOCAB_FILE = "pretrained/uni/vocab.txt" 14 | EMBEDDING_MATRIX_FILE = "pretrained/uni/embeddings.npy" 15 | CHECKPOINT_PATH = "pretrained/uni/model.ckpt-501424" 16 | 17 | # Set up the encoder. Here we are using a single unidirectional model. 18 | # To use a bidirectional model as well, call load_model() again with 19 | # configuration.model_config(bidirectional_encoder=True) and paths to the 20 | # bidirectional model's files. The encoder will use the concatenation of 21 | # all loaded models. 22 | encoder = encoder_manager.EncoderManager() 23 | encoder.load_model(configuration.model_config(), 24 | vocabulary_file=VOCAB_FILE, 25 | embedding_matrix_file=EMBEDDING_MATRIX_FILE, 26 | checkpoint_path=CHECKPOINT_PATH) 27 | VOCAB_FILE = "pretrained/bi/vocab.txt" 28 | EMBEDDING_MATRIX_FILE = "pretrained/bi/embeddings.npy" 29 | CHECKPOINT_PATH = "pretrained/bi/model.ckpt-500008" 30 | 31 | encoder.load_model(configuration.model_config(bidirectional_encoder=True), 32 | vocabulary_file=VOCAB_FILE, 33 | embedding_matrix_file=EMBEDDING_MATRIX_FILE, 34 | checkpoint_path=CHECKPOINT_PATH) 35 | 36 | five_events = ['ch', 'fg', 'gc', 'ow', 'ss'] 37 | stat = {} 38 | stat['ch'] = (74, 93) 39 | stat['fg'] = (46, 95) 40 | stat['gc'] = (25, 43) 41 | stat['ow'] = (58, 46) 42 | stat['ss'] = (71, 103) 43 | 44 | for events in five_events: 45 | threads_num = stat[events][0] 46 | max_tweets = 103 47 | matrix = np.zeros((threads_num, max_tweets, 4800)) #totally 157 threads, padding to 95 tweets, each tweets 4800d 48 | text_in = open(events + '_tweets.txt','r') 49 | thread_index = 0 50 | for thread in text_in: 51 | thread = thread.strip() 52 | thread = thread.split('\t') 53 | thread = thread + ["padding_tweets"]*(max_tweets-len(thread)) 54 | embedding = encoder.encode(thread) 55 | embedding = embedding.reshape(1,max_tweets,4800) 56 | matrix[thread_index]= embedding 57 | thread_index = thread_index + 1 58 | print(str(thread_index)) 59 | np.save(events + "v2.npy", matrix) 60 | -------------------------------------------------------------------------------- /process_data/generate_tree.py: -------------------------------------------------------------------------------- 1 | # -*- coding: UTF-8 -*- 2 | # there're some threads contains more than one main tweets, so we fix it here, truncate the following threads only keep the every first one. 3 | import json 4 | import os 5 | import numpy as np 6 | 7 | five_folders = ['ch', 'fg', 'gc', 'ow', 'ss'] 8 | 9 | for main_folders in five_folders: 10 | text_out = open(main_folders + '_tweets.txt', 'w', encoding='UTF-8') 11 | json_tree_out = open(main_folders + '_tree.json', 'w') 12 | ori_json_out = open(main_folders + '_ori_tree.json', 'w') 13 | max_tweets = 0 14 | folder_count = 0 15 | for folders in os.listdir(main_folders): 16 | if(folders.isdigit()): #ignore other files we created 17 | with open(main_folders +"/" + folders + "/structure.json", 'r') as f: 18 | tree_in = json.load(f) 19 | for key in tree_in.keys(): 20 | new_tree_in = {} 21 | new_tree_in[key] = tree_in[key] # only keep the first main tweet to keep each thread is a key 22 | tree_in = new_tree_in 23 | break 24 | tree_out = str(json.dumps(tree_in)) # change type to str 25 | tree_id = 0 26 | for key in tree_in: 27 | tree_out = tree_out.replace(str(key),str(tree_id)) 28 | # each tree only has one key, the source twitter: 29 | with open(main_folders +"/" + folders + '/source-tweets/'+str(key) + '.json', 'r') as f: 30 | twitter_text = json.load(f) 31 | text = twitter_text['text'].strip().replace('\n','').replace('\r','') 32 | text_out.write(text) 33 | text_out.write('\t') 34 | break # there has a bug exits 552816020403269632 file 35 | 36 | def in_loop(current_data): 37 | global tree_id 38 | global tree_out 39 | if(current_data==None): 40 | return 0 41 | for new_key in current_data: 42 | tree_id = tree_id + 1 43 | tree_out = tree_out.replace(str(new_key), str(tree_id)) 44 | with open(main_folders +"/" + folders + '/reactions/'+str(new_key) + '.json', 'r') as f: 45 | twitter_text = json.load(f) 46 | text = twitter_text['text'].strip().replace('\n','').replace('\r','') 47 | text_out.write(text) 48 | text_out.write('\t') #use tab to sperate the intra tweets 49 | in_loop(current_data[new_key]) 50 | 51 | in_loop(tree_in[key]) 52 | text_out.write('\n') # change line to sperate the inter tweets (i.e., different threads) 53 | folder_count = folder_count + 1 54 | json_tree_out.write(tree_out) # save string to json file 55 | json_tree_out.write("\n") # change_line 56 | ori_json_out.write(str(json.dumps(tree_in))) 57 | ori_json_out.write("\n") 58 | tree_out = json.loads(tree_out) 59 | if(tree_id>max_tweets): 60 | max_tweets = tree_id+1 #record max sentence for padding 61 | text_out.close() 62 | json_tree_out.close() 63 | print(main_folders + " max tweets: " + str(max_tweets)) 64 | print(main_folders + " max threads count: " + str(folder_count)) -------------------------------------------------------------------------------- /process_data/get_label.py: -------------------------------------------------------------------------------- 1 | import json 2 | label_file = open('ann.json', 'r') 3 | label_map = {} 4 | 5 | for line in label_file: 6 | line = line.strip() 7 | label_set = json.loads(line) 8 | tweetid = label_set['tweetid'] 9 | if(label_set.get("responsetype-vs-source", 0)!=0): #reply 10 | stance = label_set['responsetype-vs-source'] 11 | if(stance=="agreed"): 12 | label_map[tweetid] = 0 13 | elif(stance=="disagreed"): 14 | label_map[tweetid] = 1 15 | elif(stance=="appeal-for-more-information"): 16 | label_map[tweetid] = 2 17 | elif(stance=="comment"): 18 | label_map[tweetid] = 3 19 | elif(label_set.get("support", 0)!=0): #source tweets 20 | stance = label_set['support'] 21 | if(stance=="supporting"): 22 | label_map[tweetid] = 0 23 | else: 24 | label_map[tweetid] = 1 # denying 25 | else: # main tweet 26 | label_map[tweetid] = 4 27 | 28 | five_events = ['ch', 'fg', 'gc', 'ow', 'ss'] 29 | 30 | for events in five_events: 31 | 32 | json_file = open(events + '_ori_tree.json','r') 33 | json_out = open(events + '_label.txt', 'w') 34 | for line in json_file: 35 | line = line.strip() 36 | tree_in = json.loads(line) 37 | for key in tree_in.keys(): 38 | new_tree_in = {} 39 | new_tree_in[key] = tree_in[key] # only keep the first main tweet to keep each thread is a key 40 | tree_in = new_tree_in 41 | break 42 | def in_loop(current_data): 43 | global tree_id 44 | global tree_out 45 | if(current_data==None): 46 | return 0 47 | for new_key in current_data: 48 | in_loop(current_data[new_key]) 49 | json_out.write(str(label_map.get(new_key,4))) 50 | json_out.write('\t') 51 | in_loop(tree_in) 52 | json_out.write('\n') 53 | json_out.close() 54 | -------------------------------------------------------------------------------- /treelstmv3.0.py: -------------------------------------------------------------------------------- 1 | import json 2 | import torch 3 | import numpy as np 4 | from torch import nn 5 | import torch.utils.data as Data 6 | from torch.nn import init 7 | from sklearn.metrics import f1_score 8 | import random 9 | """ 10 | use json as the tree structure 11 | 12 | """ 13 | def eval_at_dev(model, val_feature, val_whole_tree, gd_label_arr): 14 | """ 15 | args: val_feature numpy array: (thtread_size, tweets_size, 4800-d features) 16 | gd_label_arr: [label for thread_1, label for thread_2] 17 | """ 18 | pred_list = [] 19 | model.eval() 20 | for val_index in range(val_feature.shape[0]): 21 | pred = model.get_hidden_buffer(val_feature[val_index], val_whole_tree[val_index]['0'], 0)#get model predict 22 | pred_index = torch.topk(pred[:, 0:4], 1)[1].view(-1,).tolist() 23 | pred_list = pred_list + pred_index 24 | f1 = f1_score(gd_label_arr, pred_list, labels=[0,1,2,3], average='macro') 25 | print("current model over dev set macro f measure: " + str(f1)) 26 | return f1 27 | 28 | class TreeLSTMCell(nn.Module): 29 | def __init__(self): 30 | super(LSTM, self).__init__() 31 | def __init__(self, input_size = 4800, hidden_size = 64, class_num = 5): 32 | super(TreeLSTMCell, self).__init__() 33 | self.input_size = input_size 34 | self.hidden_size = hidden_size 35 | self.class_num = class_num 36 | # bais term only at W is ok, U doesn't need lah 37 | self.W_i = torch.nn.Linear(self.input_size, self.hidden_size) 38 | self.U_i = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False) 39 | self.W_f = torch.nn.Linear(self.input_size, self.hidden_size) 40 | self.U_f = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False) 41 | self.W_o = torch.nn.Linear(self.input_size, self.hidden_size) 42 | self.U_o = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False) 43 | self.W_u = torch.nn.Linear(self.input_size, self.hidden_size) 44 | self.U_u = torch.nn.Linear(self.hidden_size, self.hidden_size, bias=False) 45 | self.conv = torch.nn.Conv2d(in_channels=1, out_channels=self.hidden_size, kernel_size=(2, self.hidden_size)) 46 | self.hidden_buffer = [] 47 | # self.fc_1 = torch.nn.Linear(self.hidden_size, 128) 48 | self.classifier = torch.nn.Linear(self.hidden_size, self.class_num) 49 | self.dropout = torch.nn.Dropout(p=0.3) 50 | self.sum = True 51 | def forward(self, inputs, tree, current_child_id): 52 | # work for batch_size = 1 53 | # compute by recursive, to compute current node, we must have the child, i.e.: tree[child_id] 54 | inputs = torch.Tensor(inputs) # change array from nparray to torch.Tensor 55 | batch_size = 1 56 | children_outputs = [self.forward(inputs, tree[child_id], child_id) 57 | for child_id in tree] # shape: child_num* 58 | # if currently we are at non-leaf nodes, then we got children_states, 59 | # or we should initalize it with torch.Tensor 60 | if children_outputs: 61 | children_states = children_outputs 62 | else: 63 | children_states = [(torch.zeros(batch_size, self.hidden_size), torch.zeros(batch_size, self.hidden_size))] 64 | 65 | #given the children states, how we compute the hidden states 66 | return self.node_forward(inputs[int(current_child_id), :], children_states) 67 | 68 | def node_forward(self, inputs, children_states): 69 | # comment notation: 70 | # inputs: 4800d vector 71 | # children_states: K*[(C-dim hidden states, C-dim cell memory) ] 72 | # C for hidden state dimensions 73 | # K for number of children 74 | # calculate gate outputs for i, o, u 75 | batch_size = 1 76 | # Child Sum LSTM 77 | if(self.sum): 78 | K = len(children_states) 79 | average_h = torch.zeros(batch_size, self.hidden_size) 80 | K = len(children_states) 81 | # average to get 82 | for index in range(int(K)): 83 | average_h = average_h + children_states[index][0] 84 | # Child Conv LSTM 85 | else: 86 | # Child Conv LSTM 87 | K = len(children_states) 88 | if(K < 2): # if only one child, cannot conv, just return 89 | child_tensor_list = [] 90 | child_tensor_list.append(children_states[0][0]) 91 | child_tensor_list.append(children_states[0][0]) 92 | child_tensor = torch.stack(child_tensor_list).view(1, 1, 2, self.hidden_size) 93 | else: 94 | child_tensor_list = [] 95 | #get the matrix 96 | for index in range(int(K)): 97 | child_tensor_list.append(children_states[index][0]) 98 | child_tensor = torch.stack(child_tensor_list).view(1, 1, K, self.hidden_size) #batch x channel x children_size x hidden_size 99 | # get the conv output 100 | child_tensor_conv = self.conv(child_tensor) # batch x channel x children_size x hidden_size 101 | # pooling by max 102 | average_h = torch.max(child_tensor_conv, dim=2)[0] # batch_size x hidden_size 103 | average_h = average_h.view(1, -1) 104 | # compute each gate 105 | i = torch.sigmoid(self.W_i(inputs) + self.U_i(average_h)) 106 | o = torch.sigmoid(self.W_o(inputs) + self.U_o(average_h)) 107 | u = torch.tanh(self.W_u(inputs) + self.U_u(average_h)) 108 | 109 | # forget gate needs different computing, for each children differently 110 | sum_f = torch.zeros(batch_size, self.hidden_size) 111 | for index in range(int(K)): 112 | f = torch.sigmoid(self.W_f(inputs) + self.U_f(children_states[index][0])) 113 | sum_f = sum_f + f*children_states[index][1] 114 | 115 | # calculate cell state and hidden state 116 | c = sum_f + i*u 117 | h = o*torch.tanh(c) 118 | cell_memory = c 119 | hidden_state = h 120 | output = hidden_state#torch.relu(self.fc_1(hidden_state)) 121 | output = self.dropout(output) 122 | output = self.classifier(output) 123 | self.hidden_buffer.append(output.view(-1,)) 124 | return (hidden_state, cell_memory) 125 | 126 | def get_hidden_buffer(self, inputs, tree, current_child_id): 127 | self.hidden_buffer = [] # empty it 128 | self.forward(inputs, tree, current_child_id) 129 | return torch.stack(self.hidden_buffer) 130 | 131 | cd = TreeLSTMCell() 132 | #init.orthogonal_(cd.parameters) 133 | 134 | #loss = nn.CrossEntropyLoss(ignore_index = 4) 135 | loss = nn.CrossEntropyLoss(weight=torch.Tensor([1, 1, 1, 0.5, 0]), ignore_index=4) 136 | 137 | #定义优化算法 138 | import torch.optim as optim # 139 | optimizer = optim.Adam(cd.parameters(), lr=0.001) #使用Adam 优化器 140 | 141 | # read data 142 | #define train and val events 143 | five_events = ['ch', 'fg', 'gc', 'ow', 'ss'] 144 | dev_events = 1# means for ch 145 | train_events = [3, 0, 2, 4] 146 | 147 | #read in dev set 148 | 149 | dev_feature = np.load(five_events[dev_events] + ".npy") 150 | dev_tree_in = open(five_events[dev_events] + '_tree.json', 'r') 151 | dev_whole_tree = [] 152 | for line in dev_tree_in: 153 | line = line.strip() 154 | dev_whole_tree.append(json.loads(line)) 155 | dev_label_set = [] 156 | dev_label_in = open(five_events[dev_events] + '_label.txt', 'r') 157 | dev_label_arr = [] 158 | for line in dev_label_in: 159 | dev_label_set.append([int(x) for x in line.strip().split("\t")]) 160 | dev_label_arr = dev_label_arr + [int(x) for x in line.strip().split("\t")] 161 | 162 | 163 | #read in train set 164 | 165 | train_whole_tree = [] 166 | train_label_set = [] 167 | train_label_arr = [] 168 | 169 | for eve_index in range(0, 4): 170 | 171 | train_tree_in = open(five_events[train_events[eve_index]] + '_tree.json', 'r') 172 | train_label_in = open(five_events[train_events[eve_index]] + '_label.txt', 'r') 173 | 174 | for line in train_tree_in: 175 | line = line.strip() 176 | train_whole_tree.append(json.loads(line)) 177 | 178 | for line in train_label_in: 179 | train_label_set.append([int(x) for x in line.strip().split("\t")]) 180 | train_label_arr = train_label_arr + [int(x) for x in line.strip().split("\t")] 181 | train_tree_in.close() 182 | train_label_in.close() 183 | 184 | # v2 is the original data after tweets padding 185 | train_feature = np.concatenate((np.load(five_events[train_events[0]] + "v2.npy"), 186 | np.load(five_events[train_events[1]] + "v2.npy"), 187 | np.load(five_events[train_events[2]] + "v2.npy"), 188 | np.load(five_events[train_events[3]] + "v2.npy"))) 189 | 190 | # train 191 | max_epoch = 30 192 | train_index_set = [i for i in range(0, train_feature.shape[0])] 193 | random.shuffle(train_index_set) 194 | print('evaluating at ' + five_events[dev_events] + ' event') 195 | for epoch in range(max_epoch): 196 | 197 | for index in train_index_set: 198 | output = cd.get_hidden_buffer(train_feature[index], train_whole_tree[index]['0'], 0) 199 | L = loss(output, torch.Tensor(train_label_set[index]).long()) 200 | optimizer.zero_grad() 201 | L.backward() 202 | optimizer.step() 203 | eval_at_dev(cd, dev_feature, dev_whole_tree, dev_label_arr) 204 | print('epoch: %d, loss: %f' % (epoch, L.item())) 205 | --------------------------------------------------------------------------------