├── README.md ├── build_data.py ├── build_features.py ├── build_network.py ├── data ├── example │ └── example.txt └── stopwords_en.txt ├── model ├── code │ ├── __init__.py │ ├── layers.py │ ├── models.py │ ├── print_log.py │ ├── train.py │ ├── utils.py │ └── utils_inductive.py └── data │ └── example │ ├── example.cites │ ├── example.content.entity │ ├── example.content.text │ ├── example.content.topic │ ├── mapindex.txt │ ├── test.map │ ├── train.map │ └── vali.map ├── tagMe.py └── utils.py /README.md: -------------------------------------------------------------------------------- 1 | An implement of EMNLP 2019 paper "[Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification](http://shichuan.org/doc/74.pdf)" and its extension "[HGAT: Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification](https://doi.org/10.1145/3450352)" (TOIS 2021). 2 | 3 | Thank you for your interest in our work! :smile: 4 | 5 | 6 | 7 | # Requirements 8 | 9 | - Anaconda3 (python 3.6) 10 | - Pytorch 1.3.1 11 | - gensim 3.6.0 12 | 13 | 14 | 15 | # Easy Run 16 | 17 | ``` 18 | cd ./model/code/ 19 | python train.py 20 | ``` 21 | 22 | You may change the dataset by modifying the variable "dataset = 'example'" in the top of the code "train.py" or use arguments (see train.py). 23 | 24 | Our datasets can be downloaded from [Google Drive](https://drive.google.com/open?id=1pz1IMdJqkKidD7eEc3T_2-VkrUhkUKd4). PS: I have accidentally deleted some files, but I tried to restore them, hope they will run correctly. 25 | 26 | 27 | 28 | # Prepare for your own dataset 29 | 30 | The following files are required: 31 | 32 | ./model/data/YourData/ 33 | ---- YourData.cites // the adjcencies 34 | ---- YourData.content.text // the features of texts 35 | ---- YourData.content.entity // the features of entities 36 | ---- YourData.content.topic // the features of topics 37 | ---- train.map // the index of the training node 38 | ---- vali.map // the index of the validation nodes 39 | ---- test.map // the index of the testing nodes 40 | 41 | The format is as following: 42 | 43 | - **YourData.cites** 44 | 45 | Each line contains an edge: "idx1\tidx2\n". eg: "98 13" 46 | 47 | - **YourData.content.text** 48 | 49 | Each line contains a node: "idx\t[features]\t[category]\n", note that the [features] is a list of floats with '\t' as the delimiter. eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 computers" 50 | If used for multi-label classification, [category] must be one-hot with space as delimiter, eg: "59 1.0 0.5 0.751 0.0 0.659 0.0 0 1 1 0 1 0". 51 | 52 | - **YourData.content.entity** 53 | 54 | Similar with .text, just change the [category] to "entity". eg: "13 0.0 0.0 1.0 0.0 0.0 entity" 55 | 56 | - **YourData.content.topic** 57 | 58 | Similar with .text, just change the [category] to "topic". eg: "64 0.10 1.21 8.09 0.10 topic" 59 | 60 | - ***.map** 61 | 62 | Each line contains an index: "idx\n". eg: "98" 63 | 64 | You can see the example in ./model/data/example/* 65 | 66 | ---- 67 | 68 | A simple data preprocessing code is provided. Successfully running it requires a token of [tagme](https://sobigdata.d4science.org/web/tagme/tagme-help "TagMe")'s account (my personal token is provided in tagme.py, but may be invalid in the future), [Wikipedia](https://dumps.wikimedia.org/ "WikiPedia")'s entity descriptions, and a word2vec model containing entity embeddings. You can prepare them yourself or obtain our files from [Google Drive](https://drive.google.com/open?id=1v9GD5ezHGbekoLDw5aAzh6-C-QUS-j93) and unzip them to ./data/ . 69 | 70 | Then, you should prepare a data file like ./data/example/example.txt, whose format is: "[idx]\t[category]\t[content]\n". 71 | 72 | Finally, modify the variable "dataset = 'example'" in the top of following codes and run: 73 | 74 | ``` 75 | python tagMe.py 76 | python build_network.py 77 | python build_features.py 78 | python build_data.py 79 | ``` 80 | 81 | 82 | 83 | # Use HGAT as GNN 84 | 85 | If you just wanna use the HGAT model as a graph neural network, you can just prepare some files following the above format: 86 | 87 | ./model/data/YourData/ 88 | ---- YourData.cites // the adjcencies 89 | ---- YourData.content.* // the features of *, namely node_type1, node_type2, ... 90 | ---- train.map // the index of the training node 91 | ---- vali.map // the index of the validation nodes 92 | ---- test.map // the index of the testing nodes 93 | 94 | And change the "load_data()" in ./model/code/utils.py 95 | 96 | ``` 97 | type_list = [node_type1, node_type2, ...] 98 | type_have_label = node_type 99 | ``` 100 | 101 | See the codes for more details. 102 | 103 | 104 | 105 | # Citation 106 | 107 | If you make advantage of the HGAT model in your research, please cite the following in your manuscript: 108 | 109 | ``` 110 | @article{yang2021hgat, 111 | author = {Yang, Tianchi and Hu, Linmei and Shi, Chuan and Ji, Houye and Li, Xiaoli and Nie, Liqiang}, 112 | title = {HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification}, 113 | year = {2021}, 114 | publisher = {Association for Computing Machinery}, 115 | volume = {39}, 116 | number = {3}, 117 | doi = {10.1145/3450352}, 118 | journal = {ACM Transactions on Information Systems}, 119 | month = may, 120 | articleno = {32}, 121 | numpages = {29}, 122 | } 123 | 124 | @inproceedings{linmei2019heterogeneous, 125 | title={Heterogeneous graph attention networks for semi-supervised short text classification}, 126 | author={Linmei, Hu and Yang, Tianchi and Shi, Chuan and Ji, Houye and Li, Xiaoli}, 127 | booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)}, 128 | pages={4823--4832}, 129 | year={2019} 130 | } 131 | ``` 132 | -------------------------------------------------------------------------------- /build_data.py: -------------------------------------------------------------------------------- 1 | #!/user/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import networkx 5 | # import matplotlib.pyplot as plt 6 | import json, os 7 | import pickle 8 | import gensim 9 | import numpy as np 10 | from tqdm import tqdm 11 | from scipy import sparse as spr 12 | 13 | from utils import sample 14 | 15 | DATASETS = 'example' 16 | 17 | 18 | 19 | rootpath = './' 20 | outpath = rootpath + 'model/data/{}/'.format(DATASETS) 21 | datapath = rootpath + 'data/{}/'.format(DATASETS) 22 | SIM = 0.5 23 | LP = 0.75 24 | RHO = 0.3 25 | TopK_for_Topics = 2 26 | 27 | 28 | def cnt_nodes(g): 29 | text_nodes, entity_nodes, topic_nodes = set(), set(), set() 30 | for i in g.nodes(): 31 | if i.isdigit(): 32 | text_nodes.add(i) 33 | elif i[:6] == 'topic_': 34 | topic_nodes.add(i) 35 | else: 36 | entity_nodes.add(i) 37 | print("# text_nodes: {} # entity_nodes: {} # topic_nodes: {}".format( 38 | len(text_nodes), len(entity_nodes), len(topic_nodes))) 39 | return text_nodes, entity_nodes, topic_nodes 40 | 41 | with open(datapath+'model_network_sampled.pkl', 'rb') as f: 42 | g = pickle.load(f) 43 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g) 44 | 45 | with open(datapath+"features_BOW.pkl", 'rb') as f: 46 | features_BOW = pickle.load(f) 47 | with open(datapath+"features_TFIDF.pkl", 'rb') as f: 48 | features_TFIDF = pickle.load(f) 49 | with open(datapath+"features_index.pkl", 'rb') as f: 50 | features_index_BOWTFIDF = pickle.load(f) 51 | with open(datapath+"features_entity_descBOW.pkl", 'rb') as f: 52 | features_entity_BOW = pickle.load(f) 53 | with open(datapath+"features_entity_descTFIDF.pkl", 'rb') as f: 54 | features_entity_TFIDF = pickle.load(f) 55 | with open(datapath+"features_entity_index_desc.pkl", 'rb') as f: 56 | features_entity_index_desc = pickle.load(f) 57 | 58 | feature = features_TFIDF 59 | features_index = features_index_BOWTFIDF 60 | entityF = features_entity_TFIDF 61 | features_entity_index = features_entity_index_desc 62 | textShape = feature.shape 63 | entityShape = entityF.shape 64 | print("Shape of text feature:",textShape, 'Shape of entity feature:', entityShape) 65 | 66 | # 删掉没有特征的实体 67 | notinind = set() 68 | 69 | entitySet = set(entity_nodes) 70 | print(len(entitySet)) 71 | 72 | for i in entitySet: 73 | if i not in features_entity_index: 74 | notinind.add(i) 75 | print(len(g.nodes()), len(notinind)) 76 | g.remove_nodes_from(notinind) 77 | entitySet = entitySet - notinind 78 | print(len(entitySet), len(features_entity_index)) 79 | N = len(g.nodes()) 80 | print(len(g.nodes()), len(g.edges())) 81 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g) 82 | 83 | 84 | 85 | # 删掉一些边 86 | cnt = 0 87 | nodes = g.nodes() 88 | print(len(g.edges())) 89 | for node in tqdm(nodes): 90 | try: 91 | cache = [j for j in g[node] 92 | if ('sim' in g[node][j] and g[node][j]['sim'] < SIM) # 0.5 93 | or ('link_probability' in g[node][j] and g[node][j]['link_probability'] <= LP) 94 | or ('rho' in g[node][j] and g[node][j]['rho'] < RHO) 95 | ] 96 | if len(cache) != 0: 97 | g.remove_edges_from( [(node, i) for i in cache] ) 98 | cnt += len( cache ) 99 | except: 100 | print(g[node]) 101 | break 102 | print(len(g.edges()), cnt) 103 | 104 | # 删掉孤立点(实体) 105 | delete = [n for n in g.nodes() if len(g[n]) == 0 and n not in text_nodes] 106 | print("Num of 孤立点:", len(delete)) 107 | g.remove_nodes_from(delete) 108 | 109 | 110 | 111 | train, vali, test, alltext = sample(datapath, DATASETS) 112 | 113 | 114 | 115 | 116 | 117 | # topic 118 | with open(datapath + 'topic_word_distribution.pkl', 'rb') as f: 119 | topic_word = pickle.load(f) 120 | with open(datapath + 'doc_topic_distribution.pkl', 'rb') as f: 121 | doc_topic = pickle.load(f) 122 | with open(datapath + 'doc_index_LDA.pkl', 'rb') as f: 123 | doc_idx_list = pickle.load(f) 124 | 125 | topic_num = topic_word.shape[0] 126 | topics = [] 127 | for i in range(topic_num): 128 | topicName = 'topic_' + str(i) 129 | topics.append(topicName) 130 | 131 | def naive_arg_topK(matrix, K, axis=0): 132 | """ 133 | perform topK based on np.argsort 134 | :param matrix: to be sorted 135 | :param K: select and sort the top K items 136 | :param axis: dimension to be sorted. 137 | :return: 138 | """ 139 | full_sort = np.argsort(-matrix, axis=axis) 140 | return full_sort.take(np.arange(K), axis=axis) 141 | 142 | topK_topics = naive_arg_topK(doc_topic, TopK_for_Topics, axis=1) 143 | for i in range(topK_topics.shape[0]): 144 | for j in range(TopK_for_Topics): 145 | g.add_edge(doc_idx_list[i], topics[topK_topics[i, j]]) 146 | 147 | print("gnodes:", len(g.nodes()), "gedges:", len(g.edges())) 148 | 149 | 150 | 151 | 152 | 153 | # build Edges data 154 | import collections 155 | cnt = 0 156 | nodes = g.nodes() 157 | graphdict = collections.defaultdict(list) 158 | for node in tqdm(nodes): 159 | try: 160 | cache = [j for j in g[node] 161 | if ('sim' in g[node][j] and g[node][j]['sim'] >= SIM) or ('sim' not in g[node][j]) # 0.5 162 | if ('link_probability' in g[node][j] and g[node][j]['link_probability'] > LP) or ('link_probability' not in g[node][j]) 163 | if ('rho' in g[node][j] and g[node][j]['rho'] > RHO) or ('rho' not in g[node][j]) 164 | ] 165 | if len(cache) != 0: 166 | graphdict[ node ] = cache 167 | cnt += len( cache ) 168 | except: 169 | print(g[node]) 170 | break 171 | print('edges: ', cnt) 172 | 173 | def normalizeF(mx): 174 | sup = np.absolute(mx).max() 175 | if sup == 0: 176 | return mx 177 | return mx / sup 178 | 179 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g) 180 | 181 | mapindex = dict() 182 | cnt = 0 183 | for i in text_nodes|entity_nodes|topic_nodes: 184 | mapindex[i] = cnt 185 | cnt += 1 186 | print(len(g.nodes()), len(mapindex)) 187 | 188 | if not os.path.exists(outpath): 189 | os.makedirs(outpath) 190 | 191 | 192 | # build feature data 193 | gnodes = set(g.nodes()) 194 | print(gnodes, mapindex) 195 | with open(outpath + 'train.map', 'w') as f: 196 | f.write('\n'.join([str(mapindex[i]) for i in train if i in gnodes])) 197 | with open(outpath + 'vali.map', 'w') as f: 198 | f.write('\n'.join([str(mapindex[i]) for i in vali if i in gnodes])) 199 | with open(outpath + 'test.map', 'w') as f: 200 | f.write('\n'.join([str(mapindex[i]) for i in test if i in gnodes])) 201 | 202 | flag_zero = False 203 | 204 | input_type = 'text&entity&topic2hgcn' 205 | if input_type == 'text&entity&topic2hgcn': 206 | node_with_feature = set() 207 | DEBUG = False 208 | 209 | # text node 210 | content = dict() 211 | for i in tqdm(range(textShape[0])): 212 | ind = features_index[i] 213 | if (ind) not in text_nodes: 214 | continue 215 | content[ind] = feature[i, :].toarray()[0].tolist() # 216 | if DEBUG: 217 | content[ind] = feature[i, :10].toarray()[0].tolist() 218 | if flag_zero: 219 | entityFlen = entityShape[1] 220 | content[ind] += [0] * (entityFlen + topic_word.shape[1]) 221 | with open(datapath + '{}.txt'.format(DATASETS), 'r') as f: 222 | for line in tqdm(f): 223 | ind, cat = line.strip('\n').split('\t')[:2] 224 | ind = (ind) 225 | if ind not in text_nodes: 226 | continue 227 | content[ind] += [cat] 228 | alllen = len(content[ind]) 229 | with open(outpath + '{}.content.text'.format(DATASETS), 'w') as f: 230 | for ind in tqdm(content): 231 | f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n') 232 | node_with_feature.add(ind) 233 | cache = len(content) 234 | print("共{}个文本".format(len(content))) 235 | 236 | # entity node 237 | content = dict() 238 | for i in tqdm(range(entityShape[0])): 239 | name = features_entity_index[i] 240 | if name not in entity_nodes: 241 | continue 242 | content[name] = entityF[i, :].toarray()[0].tolist() + ['entity'] 243 | if flag_zero: 244 | content[name] = [0] * textShape[1] + content[name] + [0] * topic_word.shape[1] + ['entity'] 245 | with open(outpath + '{}.content.entity'.format(DATASETS), 'w') as f: 246 | for ind in tqdm(content): 247 | f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n') 248 | node_with_feature.add(ind) 249 | cache += len(content) 250 | print("共{}个实体".format(len(content))) 251 | 252 | # topic node 253 | content = dict() 254 | for i in range(topic_num): 255 | # zero_num = textShape[1] + entityFlen - topic_num 256 | topicName = topics[i] 257 | if topicName not in topic_nodes: 258 | continue 259 | one_hot = [0] * topic_num 260 | one_hot[i] = 1 261 | content[topicName] = one_hot 262 | content[topicName] = topic_word[i].tolist() + ['topic'] 263 | if flag_zero: 264 | zero_num = textShape[1] + entityFlen 265 | content[topicName] = [0] * zero_num + content[topicName] + ['topic'] 266 | 267 | with open(outpath + '{}.content.topic'.format(DATASETS), 'w') as f: 268 | for ind in tqdm(content): 269 | f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n') 270 | node_with_feature.add(ind) 271 | cache += len(content) 272 | print("共{}个主题".format(len(content))) 273 | 274 | print(cache, len(mapindex)) 275 | print("nodes with features:", len(node_with_feature)) 276 | 277 | 278 | 279 | # save mappings 280 | with open(outpath+'mapindex.txt', 'w') as f: 281 | for i in mapindex: 282 | f.write("{}\t{}\n".format(i, mapindex[i])) 283 | 284 | 285 | 286 | # save adj matrix 287 | with open(outpath+'{}.cites'.format(DATASETS), 'w') as f: 288 | doneSet = set() 289 | nodeSet = set() 290 | for node in graphdict: 291 | for i in graphdict[node]: 292 | if (node, i) not in doneSet: 293 | f.write( str(mapindex[node])+'\t'+str(mapindex[i])+'\n' ) 294 | doneSet.add( (i, node) ) 295 | doneSet.add( (node, i) ) 296 | nodeSet.add( node ) 297 | nodeSet.add( i ) 298 | for i in tqdm(range(len(mapindex))): 299 | f.write(str(i)+'\t'+str(i)+'\n') 300 | 301 | print('Num of nodes with edges: ', len(nodeSet)) -------------------------------------------------------------------------------- /build_features.py: -------------------------------------------------------------------------------- 1 | #!/user/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import networkx 4 | import json 5 | from tqdm import tqdm 6 | from sklearn.feature_extraction.text import CountVectorizer 7 | from sklearn.feature_extraction.text import TfidfTransformer 8 | from sklearn.decomposition import LatentDirichletAllocation 9 | import pickle as pkl 10 | from nltk.tokenize import WordPunctTokenizer 11 | import os 12 | from utils import sample, preprocess_corpus_notDropEntity, load_stopwords 13 | 14 | import jieba 15 | DATASETS = 'example' 16 | 17 | 18 | 19 | rootpath = './' 20 | datapath = rootpath + 'data/{}/'.format(DATASETS) 21 | 22 | def tokenize(sen): 23 | return WordPunctTokenizer().tokenize(sen) 24 | # return jieba.cut(sen) 25 | 26 | 27 | def build_entity_feature_with_description(datapath, stopwords=list()): 28 | with open(datapath + 'model_network_sampled.pkl', 'rb') as f: 29 | g = pkl.load(f) 30 | nodesset = set(g.nodes()) 31 | entityIndex = [] 32 | corpus = [] 33 | cnt = 0 34 | for i in tqdm(range(40), desc="Read desc: "): 35 | filename = str(i).zfill(4) 36 | with open("./data/wikiAbstract/"+filename, 'r') as f: 37 | for line in f: 38 | ent, desc = line.strip('\n').split('\t') 39 | entity = ent.replace(" ", "_") 40 | if entity in nodesset: 41 | if entity not in entityIndex: 42 | entityIndex.append(entity) 43 | cnt += 1 44 | else: 45 | print('error') 46 | content = tokenize(desc) 47 | content = ' '.join([ word.lower() for word in content if word.isalpha() ]) 48 | corpus.append(content) 49 | 50 | print(len(corpus), len(entityIndex)) 51 | 52 | vectorizer = CountVectorizer(min_df = 10, stop_words=stopwords) 53 | X = vectorizer.fit_transform(corpus) 54 | print("Entity feature shape: ", X.shape) 55 | transformer = TfidfTransformer() 56 | tfidf = transformer.fit_transform(X) 57 | print("Caculated! Saving...") 58 | with open(datapath+"vectorizer_model.pkl", 'wb') as f: 59 | pkl.dump(vectorizer, f) 60 | with open(datapath+"transformer_model.pkl", 'wb') as f: 61 | pkl.dump(transformer, f) 62 | with open(datapath+"features_entity_descBOW.pkl", 'wb') as f: 63 | pkl.dump(X, f) 64 | with open(datapath+"features_entity_descTFIDF.pkl", 'wb') as f: 65 | pkl.dump(tfidf, f) 66 | with open(datapath+"features_entity_index_desc.pkl", 'wb') as f: 67 | pkl.dump(entityIndex, f) 68 | print("done!") 69 | 70 | def build_text_feature(datapath, DATASETS, rho=0.3, lp=0.5, stopwords=list()): 71 | train, vali, test, alltext = sample(datapath=datapath, DATASETS=DATASETS, resample=False) 72 | # 这里先把未替换的ind-content对存在字典中 73 | pre_replace = dict() 74 | index2ind = {} 75 | cnt = 0 76 | corpus = [] 77 | involved_entity = set() 78 | 79 | with open("{}{}.txt".format(datapath, DATASETS), 'r', encoding='utf8') as f: 80 | for line in f: 81 | ind, cate, content = line.strip('\n').split('\t') 82 | if ind not in alltext: 83 | continue 84 | pre_replace[ind] = content.lower() 85 | content = pre_replace[ind] 86 | corpus.append(content) 87 | index2ind[cnt] = ind 88 | cnt += 1 89 | print(len(pre_replace)) 90 | 91 | print("loading entities...") 92 | with open('{}{}2entity.txt'.format(datapath, DATASETS), 'r', encoding='utf8') as f: 93 | for line in tqdm(f): 94 | ind, entityList = line.strip('\n').split('\t') 95 | # ind = int(ind) 96 | if ind not in pre_replace: 97 | continue 98 | 99 | entityList = json.loads(entityList) 100 | for d in entityList: 101 | if d['rho'] < rho: 102 | continue 103 | if d['link_probability'] < lp: 104 | continue 105 | if 'title' not in d: 106 | print("An entity with no title, whose spot is: {}".format(d['spot'])) 107 | continue 108 | ent = d['title'].replace(" ", '') 109 | involved_entity.add(ent) 110 | ori = d['spot'].lower() 111 | content.replace(ori, ent) 112 | 113 | 114 | len(corpus) 115 | print("text preprocessing...") 116 | 117 | corpus = preprocess_corpus_notDropEntity(corpus, 118 | stopwords=stopwords, involved_entity=involved_entity) 119 | print("text feature transforming...") 120 | 121 | vectorizer = CountVectorizer(min_df=10 if DATASETS != "example" else 0, stop_words=stopwords) 122 | X = vectorizer.fit_transform(corpus) 123 | 124 | with open(datapath + 'TextBoW_model.pkl', 'wb') as f: 125 | pkl.dump(vectorizer, f) 126 | 127 | transformer = TfidfTransformer() 128 | tfidf = transformer.fit_transform(X) 129 | print("text feature transformed.") 130 | 131 | with open(datapath + "features_BOW.pkl", 'wb') as f: 132 | pkl.dump(X, f) 133 | with open(datapath + "features_TFIDF.pkl", 'wb') as f: 134 | pkl.dump(tfidf, f) 135 | with open(datapath + "features_index.pkl", 'wb') as f: 136 | pkl.dump(index2ind, f) 137 | print(X.shape) 138 | 139 | alllength = sum([len(sentence.split(' ')) for sentence in corpus]) 140 | avg_length = alllength / len(corpus) 141 | print('train: {}\tvali: {}\ttest: {}'.format(len(train), len(vali), len(test))) 142 | print('num of all corpus: {}'.format(len(train + vali + test))) 143 | print('avg of tokens: {:.1f}'.format(avg_length)) 144 | vocab = set() 145 | for s in corpus: 146 | vocab.update(s.split(' ')) 147 | print('involved entities: {}'.format(len(involved_entity))) 148 | print('vocabulary size: {}'.format(len(vocab))) 149 | 150 | 151 | def build_topic_feature_sklearn(datapath, DATASETS, TopicNum=20, stopwords=list(), train=False): 152 | # sklearn-lda 153 | 154 | idxlist = [] 155 | corpus = [] 156 | catelist = [] 157 | with open('{}{}.txt'.format(datapath, DATASETS), 'r', encoding='utf8') as f: 158 | for line in f: 159 | ind, cate, content = line.strip().split('\t') 160 | idxlist.append(ind) 161 | corpus.append(content) 162 | catelist.append(cate) 163 | 164 | with open(datapath + 'doc_index_LDA.pkl', 'wb') as f: 165 | pkl.dump(idxlist, f) 166 | 167 | print("text feature transforming...") 168 | corpus = preprocess_corpus_notDropEntity(corpus,stopwords=stopwords, involved_entity=set()) 169 | 170 | with open(datapath + "features_BOW.pkl", 'rb') as f: 171 | X = pkl.load(f) 172 | # vocabulary_ 的对照关系,读上面那个bow的模型就可以了 173 | 174 | 175 | if train: 176 | alpha, beta = 0.1, 0.1 177 | lda = LatentDirichletAllocation(n_components=TopicNum, max_iter=1200, 178 | learning_method='batch', n_jobs=-1, 179 | doc_topic_prior=alpha, topic_word_prior=beta, 180 | verbose=1, 181 | ) 182 | lda_feature = lda.fit_transform(X) 183 | with open(datapath + 'lda_model.pkl', 'wb') as f: 184 | pkl.dump(lda, f) 185 | with open(datapath + 'topic_word_distribution.pkl', 'wb') as f: 186 | pkl.dump(lda.components_, f) 187 | else: 188 | with open(datapath + 'lda_model.pkl', 'rb') as f: 189 | lda = pkl.load(f) 190 | lda_feature = lda.transform(X) 191 | 192 | with open(datapath + 'doc_topic_distribution.pkl', 'wb') as f: 193 | pkl.dump(lda_feature, f) 194 | 195 | 196 | 197 | 198 | 199 | if __name__ == '__main__': 200 | stopwords = load_stopwords() 201 | 202 | build_entity_feature_with_description(datapath, stopwords=stopwords) 203 | build_text_feature(datapath, DATASETS, stopwords=stopwords) 204 | build_topic_feature_sklearn(datapath, DATASETS, stopwords=stopwords, train=True) -------------------------------------------------------------------------------- /build_network.py: -------------------------------------------------------------------------------- 1 | #!/user/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import networkx 5 | import json 6 | import pickle 7 | import gensim 8 | from tqdm import tqdm 9 | 10 | from utils import sample 11 | 12 | DATASETS = 'example' 13 | 14 | NumOfTrainTextPerClass = 2 15 | TOPK = 10 16 | SIM_MIN = 0.5 17 | 18 | rootpath = './' 19 | datapath = rootpath + 'data/{}/'.format(DATASETS) 20 | g = networkx.Graph() 21 | train, vali, test, alltext = sample(datapath=datapath, DATASETS=DATASETS, resample=True, 22 | trainNumPerClass=NumOfTrainTextPerClass) 23 | 24 | # load text-entity 25 | entitySet = set() 26 | rho = 0.1 27 | noEntity = set() 28 | with open(datapath+'{}2entity.txt'.format(DATASETS), 'r') as f: 29 | for line in tqdm(f, desc="text-ent: "): 30 | ind, entityList = line.strip('\n').split('\t') 31 | # ind = int(ind) 32 | if ind not in alltext: 33 | continue 34 | entityList = json.loads(entityList) 35 | entities = [(d['title'].replace(" ", '_'), d['rho'], d['link_probability']) 36 | for d in entityList if 'title' in d and float(d['rho']) > rho] 37 | entitySet.update([d['title'].replace(" ", '_') 38 | for d in entityList if 'title' in d and float(d['rho']) > rho]) 39 | g.add_edges_from([(ind, e[0], {'rho': e[1], 'link_probability': e[2]}) 40 | for e in entities]) 41 | if len(entities) == 0: 42 | noEntity.add(ind) 43 | g.add_node(ind) 44 | print("text-entity done.") 45 | 46 | # load labels 47 | with open(datapath+'{}.txt'.format(DATASETS), 'r', encoding='utf8') as f: 48 | for line in tqdm(f,desc="text label: "): 49 | ind, cate, title = line.strip('\n').split('\t') 50 | # ind = int(ind) 51 | if ind not in alltext: 52 | continue 53 | if ind not in g.nodes(): 54 | g.add_node(ind) 55 | g.node[ind]['type'] = cate 56 | 57 | 58 | # load similarities between entities 59 | print("loading Gensim.word2vec. ") 60 | model = gensim.models.Word2Vec.load(rootpath+'data/word2vec/word2vec_gensim_5') 61 | print("word2vec model done.") 62 | 63 | # topK + 阈值 64 | sim_min = SIM_MIN 65 | topK = TOPK 66 | el = list(entitySet) 67 | entity_edge = [] 68 | cnt_no = 0 69 | cnt_yes = 0 70 | cnt = 0 71 | for i in tqdm(range(len(el)), desc="ent-ent: "): 72 | simList = [] 73 | topKleft = topK 74 | for j in range(len(el)): 75 | if i == j: 76 | continue 77 | cnt += 1 78 | try: 79 | sim = model.wv.similarity(el[i].lower().strip(')'), el[j].lower().strip(')')) 80 | cnt_yes += 1 81 | if sim >= sim_min: 82 | entity_edge.append( (el[i], el[j], {'sim': sim}) ) 83 | topKleft -= 1 84 | else: 85 | simList.append( (sim, el[j]) ) 86 | except Exception as e: 87 | cnt_no += 1 88 | simList = sorted(simList, key=(lambda x: x[0]), reverse=True) 89 | for i in range(min(max(topKleft, 0), len(simList))): 90 | entity_edge.append( (el[i], simList[i][1], {'sim': simList[i][0]}) ) 91 | print(cnt_yes, cnt_no) 92 | 93 | g.add_edges_from(entity_edge) 94 | 95 | # save the network 96 | with open(datapath+'model_network_sampled.pkl', 'wb') as f: 97 | pickle.dump(g, f) -------------------------------------------------------------------------------- /data/example/example.txt: -------------------------------------------------------------------------------- 1 | 0 business manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers 2 | 1 business empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management 3 | 2 business dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save 4 | 3 business thomasnet thomasnet cnc machining metal stamping gaskets fasteners searchable database information products services cad drawings 5 | 4 business crnano products nanotechnology products molecular manufacturing product structural design simplest products software-controlled extra cost manufacturing 6 | 2130 computers digitalmars intro programming language digital mars compiled garbage collected simpler replacement walter bright wrote dos compiler maximum similarity backward 7 | 2131 computers google top computers programming languages google directory computers programming languages programming languages weblog news discussion web access usenet newsgroups programming languages 8 | 2132 computers ruby lang ruby programming language interpreted dynamically typed pure object-oriented scripting language programming japan straightforward extensible 9 | 2133 computers sun java technology experienced programmers migrating java language java programming language sl- -se practice exam 10 | 2134 computers dir yahoo computers and internet programming and development languages computer programming languages yahoo directory links sites code samples tools tutorials articles resources focusing programming languages 11 | 2560 culture-arts-entertainment oscar oscar com annual academy awards homepage nominee winners carpet photograph gallery press video clips voting ballot information 12 | 2561 culture-arts-entertainment oscars welcome academy motion picture arts sciences academy history information academy awards photographs events screenings press releases 13 | 2562 culture-arts-entertainment https oscar symplicity oscar online system clerkship application review users internet explorer firefox netscape internet browser access oscar 14 | 2563 culture-arts-entertainment wikipedia wiki academy awards academy award wikipedia encyclopedia academy awards oscars prominent oscar statuette academy award merit 15 | 2564 culture-arts-entertainment oscar openclustergroup oscar source cluster application resources oscar users regardless experience level nix environment install beowulf performance computing cluster 16 | 5220 education-science wikipedia wiki mechanics mechanics wikipedia encyclopedia mechanics greek μηχανική branch physics behaviour physical bodies subjected forces displacements 17 | 5221 education-science wikipedia wiki mechanic mechanic wikipedia encyclopedia mechanics specialised field auto mechanics boiler mechanics industrial maintenance mechanics millwrights 18 | 5222 education-science popularmechanics mechanics mechanics service magazine covering information home improvement automotive electronics computers telecommunications 19 | 5223 education-science carpros professional mechanics online car repair question professional mechanics online answer 20 | 5224 education-science bls gov oco ocos automotive service technicians mechanics opportunities automotive service technicians mechanics diagnostic problem-solving skills knowledge electronics 21 | 6760 engineering wikipedia wiki jet engine jet engine wikipedia encyclopedia jet engine engine discharges jet fluid generate thrust accordance newton law motion 22 | 6761 engineering aardvark pjet pulse jet engine information building pulsejet turbojet engine 23 | 6762 engineering aardvark pjet turbinenuts shtml crazy jet engine invention jet engine people crazier strapping 24 | 6763 engineering howstuffworks turbine howstuffworks gas turbine engines wonder jet engine cruising feet jets helicopters power plants class 25 | 6764 engineering inventors about library inventors blhowajetengineworks jet engine jet engine operates application sir isaac newton law physics 26 | 7630 health rcseng news newsitem hospital doctor interview president bernard ribeiro magazine hospital doctor interviewed bernard ribeiro president college feature interview touches 27 | 7631 health medicalnewstoday articles advocate south suburban hospital doctor tips jogging neighborhood biking trails inline skating chicago lakefront 28 | 7632 health doctorshospital doctors hospital life doctors hospital opelousas community healthcare provider dedicated serving healthcare landry parish surrounding 29 | 7633 health webserver bjc sfnet bjh physiciansearch barnes-jewish hospital physician search doctors listed referral directory medical staff barnes-jewish hospital doctors payment referral 30 | 7634 health gwhospital hospital doctor doctor george washington physician referral service help gw-docs 31 | 8520 politics-society congress congressorg congress org election candidates information candidate information accessible zip code national elections archives 32 | 8521 politics-society sos state elections candidates index shtml candidates candidate guide candidates ballot upcoming elections calendar election link texas ethics 33 | 8522 politics-society ieee organizations corporate candidates ieee ieee annual election candidates ieee annual election candidates listed positions candidates ieee annual election ballot 34 | 8523 politics-society stc candidatesfaq society technical communication candidates faq information candidates region director election stc annual meeting 35 | 8524 politics-society marketingpilgrim presidential election reputation presidential election candidate reputation study marketing declared presidential candidates negative search engine listings republicans 36 | 9070 sports yalebulldogs cstv yale university bulldogs athletic athletic yale university college sports network comprehensive coverage yale university athletics 37 | 9071 sports texastech cstv texas tech raiders athletic athletic texas tech university fansonly network comprehensive coverage raider athletics internet 38 | 9072 sports wikipedia wiki track and field athletics track field wikipedia encyclopedia athletics track field track field athletics collection sports events running throwing jumping 39 | 9073 sports siusalukis cstv southern illinois university athletic salukis athletic training spirit booster club team merchandise 40 | 9074 sports highschoolsports highschoolsports net school teams world highschoolsports net authority school schedules scores stats download sports schedules sign notified 41 | -------------------------------------------------------------------------------- /data/stopwords_en.txt: -------------------------------------------------------------------------------- 1 | a 2 | b 3 | c 4 | d 5 | e 6 | f 7 | g 8 | h 9 | i 10 | j 11 | k 12 | l 13 | m 14 | n 15 | o 16 | p 17 | q 18 | r 19 | s 20 | t 21 | u 22 | v 23 | w 24 | x 25 | y 26 | z 27 | / 28 | * 29 | & 30 | % 31 | # 32 | @ 33 | ! 34 | ~ 35 | + 36 | - 37 | �C 38 | ( 39 | ) 40 | ? 41 | : 42 | " 43 | ' 44 | \ 45 | = 46 | ` 47 | about 48 | above 49 | across 50 | after 51 | afterwards 52 | again 53 | against 54 | all 55 | almost 56 | alone 57 | along 58 | already 59 | also 60 | although 61 | always 62 | am 63 | among 64 | amongst 65 | amoungst 66 | amount 67 | an 68 | and 69 | another 70 | any 71 | anyhow 72 | anyone 73 | anything 74 | anyway 75 | anywhere 76 | are 77 | around 78 | as 79 | at 80 | bmp 81 | back 82 | be 83 | became 84 | because 85 | become 86 | becomes 87 | becoming 88 | been 89 | before 90 | beforehand 91 | behind 92 | being 93 | below 94 | beside 95 | besides 96 | between 97 | beyond 98 | bill 99 | both 100 | bottom 101 | but 102 | by 103 | call 104 | can 105 | cannot 106 | cant 107 | co 108 | computer 109 | com 110 | con 111 | could 112 | couldnt 113 | cry 114 | de 115 | describe 116 | detail 117 | do 118 | done 119 | down 120 | due 121 | during 122 | each 123 | eg 124 | e.g 125 | e.g. 126 | eight 127 | either 128 | eleven 129 | else 130 | elsewhere 131 | empty 132 | enough 133 | etc 134 | e.t.c 135 | even 136 | ever 137 | every 138 | everyone 139 | everything 140 | everywhere 141 | except 142 | exactly 143 | few 144 | fifteen 145 | fifty 146 | fill 147 | find 148 | fire 149 | first 150 | five 151 | for 152 | former 153 | formerly 154 | forty 155 | found 156 | four 157 | from 158 | front 159 | full 160 | further 161 | get 162 | give 163 | go 164 | gt 165 | had 166 | has 167 | hasnt 168 | hasn't 169 | have 170 | he 171 | hence 172 | her 173 | here 174 | hereafter 175 | hereby 176 | herein 177 | hereupon 178 | hers 179 | herself 180 | him 181 | himself 182 | his 183 | how 184 | however 185 | http 186 | href 187 | i 188 | ie 189 | i.e 190 | id 191 | if 192 | in 193 | inc 194 | it 195 | image 196 | images 197 | indeed 198 | interest 199 | into 200 | is 201 | it 202 | its 203 | itself 204 | jpg 205 | j 206 | keep 207 | last 208 | latter 209 | latterly 210 | least 211 | less 212 | ltd 213 | lt 214 | made 215 | many 216 | may 217 | me 218 | meanwhile 219 | might 220 | mill 221 | mine 222 | more 223 | moreover 224 | most 225 | mostly 226 | move 227 | much 228 | must 229 | my 230 | myself 231 | name 232 | namely 233 | neither 234 | never 235 | nevertheless 236 | next 237 | nine 238 | no 239 | nobody 240 | none 241 | noone 242 | nor 243 | not 244 | nothing 245 | now 246 | nowhere 247 | of 248 | off 249 | often 250 | on 251 | once 252 | one 253 | only 254 | onto 255 | or 256 | other 257 | others 258 | otherwise 259 | our 260 | ours 261 | ourselves 262 | out 263 | over 264 | own 265 | part 266 | per 267 | perhaps 268 | please 269 | people 270 | person 271 | peoples 272 | persons 273 | profile 274 | profiles 275 | put 276 | rather 277 | re 278 | rt 279 | same 280 | see 281 | seem 282 | seemed 283 | seeming 284 | seems 285 | serious 286 | several 287 | she 288 | should 289 | show 290 | shall 291 | side 292 | since 293 | sincere 294 | six 295 | sixty 296 | so 297 | some 298 | somehow 299 | someone 300 | something 301 | sometime 302 | sometimes 303 | somewhere 304 | still 305 | such 306 | system 307 | take 308 | twitter 309 | tweet 310 | tweets 311 | ten 312 | than 313 | that 314 | the 315 | their 316 | them 317 | themselves 318 | then 319 | thence 320 | there 321 | thereafter 322 | thereby 323 | therefore 324 | therein 325 | thereupon 326 | these 327 | they 328 | thick 329 | thin 330 | third 331 | this 332 | those 333 | though 334 | three 335 | through 336 | throughout 337 | thru 338 | thus 339 | to 340 | together 341 | too 342 | top 343 | toward 344 | towards 345 | twelve 346 | twenty 347 | two 348 | un 349 | under 350 | until 351 | up 352 | upon 353 | us 354 | url 355 | user 356 | very 357 | via 358 | was 359 | we 360 | well 361 | were 362 | what 363 | whatever 364 | when 365 | whence 366 | whenever 367 | where 368 | whereafter 369 | whereas 370 | whereby 371 | wherein 372 | whereupon 373 | wherever 374 | whether 375 | which 376 | while 377 | wheather 378 | who 379 | www 380 | web 381 | whoever 382 | whole 383 | whom 384 | whose 385 | why 386 | will 387 | with 388 | within 389 | without 390 | would 391 | yet 392 | you 393 | your 394 | yours 395 | yourself 396 | yourselves 397 | reuters 398 | related 399 | story 400 | external 401 | link 402 | -------------------------------------------------------------------------------- /model/code/__init__.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | from __future__ import division 3 | -------------------------------------------------------------------------------- /model/code/layers.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | from torch.nn.parameter import Parameter 4 | from torch.nn.modules.module import Module 5 | import torch.nn.functional as F 6 | from torch import nn 7 | 8 | class GraphConvolution(Module): 9 | def __init__(self, in_features, out_features, bias=True): 10 | super(GraphConvolution, self).__init__() 11 | self.in_features = in_features 12 | self.out_features = out_features 13 | self.weight = Parameter(torch.FloatTensor(in_features, out_features)) 14 | if bias: 15 | self.bias = Parameter(torch.FloatTensor(out_features)) 16 | else: 17 | self.register_parameter('bias', None) 18 | self.reset_parameters() 19 | 20 | def reset_parameters(self): 21 | stdv = 1. / math.sqrt(self.weight.size(1)) 22 | self.weight.data.uniform_(-stdv, stdv) 23 | if self.bias is not None: 24 | self.bias.data.uniform_(-stdv, stdv) 25 | 26 | def forward(self, inputs, adj, global_W = None): 27 | if len(adj._values()) == 0: 28 | return torch.zeros(adj.shape[0], self.out_features, device=inputs.device) 29 | 30 | support = torch.spmm(inputs, self.weight) 31 | if global_W is not None: 32 | support = torch.spmm(support, global_W) 33 | output = torch.spmm(adj, support) 34 | if self.bias is not None: 35 | return output + self.bias 36 | else: 37 | return output 38 | 39 | def __repr__(self): 40 | return self.__class__.__name__ + ' (' \ 41 | + str(self.in_features) + ' -> ' \ 42 | + str(self.out_features) + ')' 43 | 44 | 45 | class SelfAttention_ori(Module): 46 | def __init__(self, in_features): 47 | super(SelfAttention, self).__init__() 48 | self.a = Parameter(torch.FloatTensor(2 * in_features, 1)) 49 | self.reset_parameters() 50 | 51 | def reset_parameters(self): 52 | stdv = 1. / math.sqrt(self.a.size(1)) 53 | self.a.data.uniform_(-stdv, stdv) 54 | 55 | def forward(self, inputs): 56 | x = inputs.transpose(0, 1) 57 | self.n = x.size()[0] 58 | x = torch.cat([x, torch.stack([x] * self.n, dim=0)], dim=2) 59 | U = torch.matmul(x, self.a).transpose(0, 1) 60 | U = F.leaky_relu(U) 61 | weights = F.softmax(U, dim=1) 62 | outputs = torch.matmul(weights.transpose(1, 2), inputs).squeeze(1) 63 | return outputs, weights 64 | 65 | 66 | class SelfAttention(Module): 67 | def __init__(self, in_features, idx, hidden_dim): 68 | super(SelfAttention, self).__init__() 69 | self.idx = idx 70 | self.linear = torch.nn.Linear(in_features, hidden_dim) 71 | self.a = Parameter(torch.FloatTensor(2 * hidden_dim, 1)) 72 | self.reset_parameters() 73 | 74 | def reset_parameters(self): 75 | stdv = 1. / math.sqrt(self.a.size(1)) 76 | self.a.data.uniform_(-stdv, stdv) 77 | 78 | def forward(self, inputs): 79 | # inputs size: node_num * 3 * in_features 80 | x = self.linear(inputs).transpose(0, 1) 81 | self.n = x.size()[0] 82 | x = torch.cat([x, torch.stack([x[self.idx]] * self.n, dim=0)], dim=2) 83 | U = torch.matmul(x, self.a).transpose(0, 1) 84 | U = F.leaky_relu_(U) 85 | weights = F.softmax(U, dim=1) 86 | outputs = torch.matmul(weights.transpose(1, 2), inputs).squeeze(1) * 3 87 | return outputs, weights 88 | 89 | 90 | class GraphAttentionConvolution(Module): 91 | def __init__(self, in_features_list, out_features, bias=True, gamma = 0.1): 92 | super(GraphAttentionConvolution, self).__init__() 93 | self.ntype = len(in_features_list) 94 | self.in_features_list = in_features_list 95 | self.out_features = out_features 96 | self.weights = nn.ParameterList() 97 | for i in range(self.ntype): 98 | cache = Parameter(torch.FloatTensor(in_features_list[i], out_features)) 99 | nn.init.xavier_normal_(cache.data, gain=1.414) 100 | self.weights.append( cache ) 101 | if bias: 102 | self.bias = Parameter(torch.FloatTensor(out_features)) 103 | stdv = 1. / math.sqrt(out_features) 104 | self.bias.data.uniform_(-stdv, stdv) 105 | else: 106 | self.register_parameter('bias', None) 107 | 108 | self.att_list = nn.ModuleList() 109 | for i in range(self.ntype): 110 | self.att_list.append( Attention_NodeLevel(out_features, gamma) ) 111 | 112 | 113 | def forward(self, inputs_list, adj_list, global_W = None): 114 | h = [] 115 | for i in range(self.ntype): 116 | h.append( torch.spmm(inputs_list[i], self.weights[i]) ) 117 | if global_W is not None: 118 | for i in range(self.ntype): 119 | h[i] = ( torch.spmm(h[i], global_W) ) 120 | outputs = [] 121 | for t1 in range(self.ntype): 122 | x_t1 = [] 123 | for t2 in range(self.ntype): 124 | # adj has no non-zeros 125 | if len(adj_list[t1][t2]._values()) == 0: 126 | x_t1.append( torch.zeros(adj_list[t1][t2].shape[0], self.out_features, device=self.bias.device) ) 127 | continue 128 | 129 | if self.bias is not None: 130 | x_t1.append( self.att_list[t1](h[t1], h[t2], adj_list[t1][t2]) + self.bias ) 131 | else: 132 | x_t1.append( self.att_list[t1](h[t1], h[t2], adj_list[t1][t2]) ) 133 | outputs.append(x_t1) 134 | 135 | return outputs 136 | 137 | 138 | 139 | 140 | 141 | class Attention_NodeLevel(nn.Module): 142 | def __init__(self, dim_features, gamma = 0.1): 143 | super(Attention_NodeLevel, self).__init__() 144 | 145 | self.dim_features = dim_features 146 | 147 | self.a1 = nn.Parameter(torch.zeros(size=(dim_features, 1))) 148 | self.a2 = nn.Parameter(torch.zeros(size=(dim_features, 1))) 149 | nn.init.xavier_normal_(self.a1.data, gain=1.414) 150 | nn.init.xavier_normal_(self.a2.data, gain=1.414) 151 | 152 | self.leakyrelu = nn.LeakyReLU(0.2) 153 | self.gamma = gamma 154 | 155 | def forward(self, input1, input2, adj): 156 | h = input1 157 | g = input2 158 | N = h.size()[0] 159 | M = g.size()[0] 160 | 161 | e1 = torch.matmul(h, self.a1).repeat(1, M) 162 | e2 = torch.matmul(g, self.a2).repeat(1, N).t() 163 | e = e1 + e2 164 | e = self.leakyrelu(e) 165 | 166 | zero_vec = -9e15*torch.ones_like(e) 167 | if 'sparse' in adj.type(): 168 | adj_dense = adj.to_dense() 169 | attention = torch.where(adj_dense > 0, e, zero_vec) 170 | attention = F.softmax(attention, dim=1) 171 | attention = torch.mul(attention, adj_dense.sum(1).repeat(M, 1).t()) 172 | attention = torch.add(attention * self.gamma, adj_dense * (1 - self.gamma)) 173 | del(adj_dense) 174 | else: 175 | attention = torch.where(adj > 0, e, zero_vec) 176 | attention = F.softmax(attention, dim=1) 177 | attention = torch.mul(attention, adj.sum(1).repeat(M, 1).t()) 178 | attention = torch.add(attention * self.gamma, adj.to_dense() * (1 - self.gamma)) 179 | del(zero_vec) 180 | 181 | h_prime = torch.matmul(attention, g) 182 | 183 | return h_prime 184 | 185 | 186 | -------------------------------------------------------------------------------- /model/code/models.py: -------------------------------------------------------------------------------- 1 | import math 2 | import torch 3 | import torch.nn as nn 4 | import torch.nn.functional as F 5 | from layers import * 6 | from torch.nn.parameter import Parameter 7 | from functools import reduce 8 | from utils import dense_tensor_to_sparse 9 | 10 | 11 | class HGAT(nn.Module): 12 | def __init__(self, nfeat_list, nhid, nclass, dropout, 13 | type_attention=True, node_attention=True, 14 | gamma=0.1, sigmoid=False, orphan=True, 15 | write_emb=True 16 | ): 17 | super(HGAT, self).__init__() 18 | self.sigmoid = sigmoid 19 | self.type_attention = type_attention 20 | self.node_attention = node_attention 21 | 22 | self.write_emb = write_emb 23 | if self.write_emb: 24 | self.emb = None 25 | self.emb2 = None 26 | 27 | self.nonlinear = F.relu_ 28 | 29 | self.nclass = nclass 30 | self.ntype = len(nfeat_list) 31 | 32 | dim_1st = nhid 33 | dim_2nd = nclass 34 | if orphan: 35 | dim_2nd += self.ntype - 1 36 | 37 | self.gc2 = nn.ModuleList() 38 | if not self.node_attention: 39 | self.gc1 = nn.ModuleList() 40 | for t in range(self.ntype): 41 | self.gc1.append( GraphConvolution(nfeat_list[t], dim_1st, bias=False) ) 42 | self.bias1 = Parameter( torch.FloatTensor(dim_1st) ) 43 | stdv = 1. / math.sqrt(dim_1st) 44 | self.bias1.data.uniform_(-stdv, stdv) 45 | else: 46 | self.gc1 = GraphAttentionConvolution(nfeat_list, dim_1st, gamma=gamma) 47 | self.gc2.append( GraphConvolution(dim_1st, dim_2nd, bias=True) ) 48 | 49 | if self.type_attention: 50 | self.at1 = nn.ModuleList() 51 | self.at2 = nn.ModuleList() 52 | for t in range(self.ntype): 53 | self.at1.append( SelfAttention(dim_1st, t, 50) ) 54 | self.at2.append( SelfAttention(dim_2nd, t, 50) ) 55 | 56 | self.dropout = dropout 57 | 58 | def forward(self, x_list, adj_list, adj_all = None): 59 | x0 = x_list 60 | 61 | if not self.node_attention: 62 | x1 = [None for _ in range(self.ntype)] 63 | # First Layer 64 | for t1 in range(self.ntype): 65 | x_t1 = [] 66 | for t2 in range(self.ntype): 67 | idx = t2 68 | x_t1.append( self.gc1[idx](x0[t2], adj_list[t1][t2]) + self.bias1 ) 69 | if self.type_attention: 70 | x_t1, weights = self.at1[t1]( torch.stack(x_t1, dim=1) ) 71 | else: 72 | x_t1 = reduce(torch.add, x_t1) 73 | 74 | x_t1 = self.nonlinear(x_t1) 75 | x_t1 = F.dropout(x_t1, self.dropout, training=self.training) 76 | x1[t1] = x_t1 77 | else: 78 | x1 = [None for _ in range(self.ntype)] 79 | x1_in = self.gc1(x0, adj_list) 80 | for t1 in range(len(x1_in)): 81 | x_t1 = x1_in[t1] 82 | if self.type_attention: 83 | x_t1, weights = self.at1[t1]( torch.stack(x_t1, dim=1) ) 84 | else: 85 | x_t1 = reduce(torch.add, x_t1) 86 | x_t1 = self.nonlinear(x_t1) 87 | x_t1 = F.dropout(x_t1, self.dropout, training=self.training) 88 | x1[t1] = x_t1 89 | if self.write_emb: 90 | self.emb = x1[0] 91 | 92 | x2 = [None for _ in range(self.ntype)] 93 | # Second Layer 94 | for t1 in range(self.ntype): 95 | x_t1 = [] 96 | for t2 in range(self.ntype): 97 | if adj_list[t1][t2] is None: 98 | continue 99 | idx = 0 100 | x_t1.append( self.gc2[idx](x1[t2], adj_list[t1][t2]) ) 101 | if self.type_attention: 102 | x_t1, weights = self.at2[t1]( torch.stack(x_t1, dim=1) ) 103 | else: 104 | x_t1 = reduce(torch.add, x_t1) 105 | 106 | x2[t1] = x_t1 107 | if self.write_emb and t1 == 0: 108 | self.emb2 = x2[t1] 109 | 110 | # output layer 111 | if self.sigmoid: 112 | x2[t1] = torch.sigmoid(x_t1) 113 | else: 114 | x2[t1] = F.log_softmax(x_t1, dim=1) 115 | 116 | return x2 117 | 118 | -------------------------------------------------------------------------------- /model/code/print_log.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | 4 | 5 | class Logger(object): 6 | def __init__(self, filename="Default.log", remove=True): 7 | self.terminal = sys.stdout 8 | if remove and os.path.exists(filename): 9 | os.remove(filename) 10 | self.log = open(filename, "a") 11 | 12 | def write(self, message): 13 | self.terminal.write(message) 14 | self.log.write(message) 15 | 16 | def flush(self): 17 | pass 18 | 19 | def change_file(self, filename="Default.log"): 20 | self.log.close() 21 | self.log = open(filename, "a") 22 | 23 | 24 | if __name__ == '__main__': 25 | sys.stdout = Logger("yourlogfilename.txt") 26 | print('content.') -------------------------------------------------------------------------------- /model/code/train.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | from __future__ import print_function 3 | 4 | import time 5 | import argparse 6 | import numpy as np 7 | import pickle as pkl 8 | from copy import deepcopy 9 | from random import shuffle 10 | 11 | import torch 12 | import torch.nn as nn 13 | import torch.nn.functional as F 14 | import torch.optim as optim 15 | from sklearn.metrics import precision_recall_fscore_support 16 | from sklearn.metrics import accuracy_score 17 | from sklearn.metrics import classification_report 18 | 19 | from utils import load_data, accuracy, dense_tensor_to_sparse, resample, makedirs 20 | from utils_inductive import transform_dataset_by_idx 21 | from models import HGAT 22 | import os, gc, sys 23 | from print_log import Logger 24 | 25 | logdir = "log/" 26 | savedir = 'model/' 27 | embdir = 'embeddings/' 28 | makedirs([logdir, savedir, embdir]) 29 | 30 | os.environ["CUDA_VISIBLE_DEVICES"] = "2" 31 | 32 | write_embeddings = True 33 | HOP = 2 34 | 35 | dataset = 'example' 36 | 37 | LR = 0.01 if dataset == 'snippets' else 0.005 38 | DP = 0.95 if dataset in ['agnews', 'tagmynews'] else 0.8 39 | WD = 0 if dataset == 'snippets' else 5e-6 40 | LR = 0.05 if 'multi' in dataset else LR 41 | DP = 0.5 if 'multi' in dataset else DP 42 | WD = 0 if 'multi' in dataset else WD 43 | 44 | # Training settings 45 | parser = argparse.ArgumentParser() 46 | parser.add_argument('--no_cuda', action='store_true', default=False, 47 | help='Disables CUDA training.') 48 | parser.add_argument('--fastmode', action='store_true', default=False, 49 | help='Validate during training pass.') 50 | parser.add_argument('--seed', type=int, default=42, help='Random seed.') 51 | parser.add_argument('--epochs', type=int, default=300, 52 | help='Number of epochs to train.') 53 | parser.add_argument('--lr', type=float, default=LR, 54 | help='Initial learning rate.') 55 | parser.add_argument('--weight_decay', type=float, default=WD, 56 | help='Weight decay (L2 loss on parameters).') 57 | parser.add_argument('--hidden', type=int, default=512, 58 | help='Number of hidden units.') 59 | parser.add_argument('--dropout', type=float, default=DP, 60 | help='Dropout rate (1 - keep probability).') 61 | parser.add_argument('--inductive', type=bool, default=False, 62 | help='Whether use the transductive mode or inductive mode. ') 63 | parser.add_argument('--dataset', type=str, default=dataset, 64 | help='Dataset') 65 | parser.add_argument('--repeat', type=int, default=1, 66 | help='Number of repeated trials') 67 | parser.add_argument('--node', action='store_false', default=True, 68 | help='Use node-level attention or not. ') 69 | parser.add_argument('--type', action='store_false', default=True, 70 | help='Use type-level attention or not. ') 71 | 72 | args = parser.parse_args() 73 | dataset = args.dataset 74 | 75 | args.cuda = not args.no_cuda and torch.cuda.is_available() 76 | sys.stdout = Logger(logdir + "{}.log".format(dataset)) 77 | 78 | np.random.seed(args.seed) 79 | torch.manual_seed(args.seed) 80 | if args.cuda: 81 | torch.cuda.manual_seed(args.seed) 82 | 83 | loss_list = dict() 84 | 85 | def margin_loss(preds, y, weighted_sample=False): 86 | nclass = y.shape[1] 87 | preds = preds[:, :nclass] 88 | y = y.float() 89 | lam = 0.25 90 | m = nn.Threshold(0., 0.) 91 | loss = y * m(0.9 - preds) ** 2 + \ 92 | lam * (1.0 - y) * (m(preds - 0.1) ** 2) 93 | 94 | if weighted_sample: 95 | n, N = y.sum(dim=0, keepdim=True), y.shape[0] 96 | weight = torch.where(y == 1, n, torch.zeros_like(loss)) 97 | weight = torch.where(y != 1, N-n, weight) 98 | weight = N / weight / 2 99 | loss = torch.mul(loss, weight) 100 | 101 | loss = torch.mean(torch.sum(loss, dim=1)) 102 | return loss 103 | 104 | def nll_loss(preds, y): 105 | y = y.max(1)[1].type_as(labels) 106 | return F.nll_loss(preds, y) 107 | 108 | def evaluate(preds_list, y_list): 109 | nclass = y_list.shape[1] 110 | preds_list = preds_list[:, :nclass] 111 | if not preds_list.device == 'cpu': 112 | preds_list, y_list = preds_list.cpu(), y_list.cpu() 113 | 114 | threshold = 0.5 115 | multi_label = 'multi' in dataset 116 | if multi_label: 117 | y_list = y_list.numpy() 118 | preds_probs = preds_list.detach().numpy() 119 | preds = deepcopy(preds_probs) 120 | preds[np.arange(preds.shape[0]), preds.argmax(1)] = 1.0 121 | preds[np.where(preds >= threshold)] = 1.0 122 | preds[np.where(preds < threshold)] = 0.0 123 | [precision, recall, F1, support] = \ 124 | precision_recall_fscore_support(y_list[preds.sum(axis=1) != 0], preds[preds.sum(axis=1) != 0], 125 | average='micro') 126 | [precision_ma, recall_ma, F1_ma, support] = \ 127 | precision_recall_fscore_support(y_list[preds.sum(axis=1) != 0], preds[preds.sum(axis=1) != 0], 128 | average='macro') 129 | ER = accuracy_score(y_list, preds) * 100 130 | 131 | report = classification_report(y_list, preds, digits=5) 132 | 133 | print(' ER: %6.2f' % ER, 134 | 'mi-ma: P: %5.1f %5.1f' % (precision*100, precision_ma*100), 135 | 'R: %5.1f %5.1f' % (recall*100, recall_ma*100), 136 | 'F1: %5.1f %5.1f' % (F1*100, F1_ma*100), 137 | end="") 138 | return ER, report 139 | else: 140 | y_list = y_list.numpy() 141 | preds_probs = preds_list.detach().numpy() 142 | preds = deepcopy(preds_probs) 143 | preds[np.arange(preds.shape[0]), preds.argmax(1)] = 1.0 144 | preds[np.where(preds < 1)] = 0.0 145 | [precision, recall, F1, support] = \ 146 | precision_recall_fscore_support(y_list, preds, average='macro') 147 | ER = accuracy_score(y_list, preds) * 100 148 | print(' Ac: %6.2f' % ER, 149 | 'P: %5.1f' % (precision*100), 150 | 'R: %5.1f' % (recall*100), 151 | 'F1: %5.1f' % (F1*100), 152 | end="") 153 | return ER, F1 154 | 155 | LOSS = margin_loss if 'multi' in dataset else nll_loss 156 | 157 | 158 | def train(epoch, 159 | input_adj_train, input_features_train, idx_out_train, idx_train, 160 | input_adj_val, input_features_val, idx_out_val, idx_val): 161 | print('Epoch: {:04d}'.format(epoch+1), end='') 162 | t = time.time() 163 | model.train() 164 | optimizer.zero_grad() 165 | output = model(input_features_train, input_adj_train) 166 | 167 | if isinstance(output, list): 168 | O, L = output[0][idx_out_train], labels[idx_train] 169 | else: 170 | O, L = output[idx_out_train], labels[idx_train] 171 | loss_train = LOSS(O, L) 172 | print(' | loss: {:.4f}'.format(loss_train.item()), end='') 173 | acc_train, f1_train = evaluate(O, L) 174 | loss_train.backward() 175 | optimizer.step() 176 | 177 | model.eval() 178 | output = model(input_features_val, input_adj_val) 179 | if isinstance(output, list): 180 | loss_val = LOSS(output[0][idx_out_val], labels[idx_val]) 181 | print(' | loss: {:.4f}'.format(loss_val.item()), end='') 182 | results = evaluate(output[0][idx_out_val], labels[idx_val]) 183 | else: 184 | loss_val = LOSS(output[idx_out_val], labels[idx_val]) 185 | print(' | loss: {:.4f}'.format(loss_val.item()), end='') 186 | results = evaluate(output[idx_out_val], labels[idx_val]) 187 | print(' | time: {:.4f}s'.format(time.time() - t)) 188 | loss_list[epoch] = [loss_train.item()] 189 | 190 | if 'multi' in dataset: 191 | acc_val, res_line = results 192 | return float(acc_val.item()), res_line 193 | else: 194 | acc_val, f1_val = results 195 | return float(acc_val.item()), float(f1_val.item()) 196 | 197 | 198 | def test(epoch, input_adj_test, input_features_test, idx_out_test, idx_test): 199 | print(' '*90 if 'multi' in dataset else ' '*65, end='') 200 | t = time.time() 201 | model.eval() 202 | output = model(input_features_test, input_adj_test) 203 | 204 | if isinstance(output, list): 205 | loss_test = LOSS(output[0][idx_out_test], labels[idx_test]) 206 | print(' | loss: {:.4f}'.format(loss_test.item()), end='') 207 | results = evaluate(output[0][idx_out_test], labels[idx_test]) 208 | else: 209 | loss_test = LOSS(output[idx_out_test], labels[idx_test]) 210 | print(' | loss: {:.4f}'.format(loss_test.item()), end='') 211 | results = evaluate(output[idx_out_test], labels[idx_test]) 212 | print(' | time: {:.4f}s'.format(time.time() - t)) 213 | loss_list[epoch] += [loss_test.item()] 214 | 215 | if 'multi' in dataset: 216 | acc_test, res_line = results 217 | return float(acc_test.item()), res_line 218 | else: 219 | acc_test, f1_test = results 220 | return float(acc_test.item()), float(f1_test.item()) 221 | 222 | 223 | 224 | 225 | path = '../data/'+ dataset +'/' 226 | adj, features, labels, idx_train_ori, idx_val_ori, idx_test_ori, idx_map = load_data(path = path, dataset = dataset) 227 | N = len(adj) 228 | 229 | # Transductive的数据集变换 230 | if args.inductive: 231 | print("Transfer to be inductive.") 232 | 233 | # resample 234 | # 之前的数据集划分: 训练集20 * class 验证集1000 其他的测试集 235 | # 这里换成: 训练集不变, 验证集 -> 测试集, 236 | # 原本的测试集拆出和训练集一样多的样本作为验证集,其余作为无标注样本 237 | idx_train,idx_unlabeled,idx_val,idx_test = resample(idx_train_ori,idx_val_ori,idx_test_ori,path,idx_map) 238 | 239 | # if experimentType == 'unlabeled': 240 | # bias = int(idx_unlabeled.shape[0] * supPara) 241 | # idx_unlabeled = idx_unlabeled[: bias] 242 | # print("\n\tidx_train: {}, idx_unlabeled: {},\n\tidx_val: {}, idx_test: {}".format( 243 | # idx_train.shape[0], idx_unlabeled.shape[0], idx_val.shape[0], idx_test.shape[0])) 244 | 245 | input_adj_train, input_features_train, idx_related_train, idx_out_train = \ 246 | transform_dataset_by_idx(adj,features,torch.cat([idx_train, idx_unlabeled]),idx_train, hop=HOP) 247 | input_adj_val, input_features_val, idx_related_val, idx_out_val = \ 248 | transform_dataset_by_idx(adj,features,idx_val,idx_val, hop=HOP) 249 | input_adj_test, input_features_test, idx_related_test, idx_out_test = \ 250 | transform_dataset_by_idx(adj,features,idx_test,idx_test, hop=HOP) 251 | 252 | all_node_count = sum([_.shape[0] for _ in adj[0]]) 253 | all_input_idx, all_related_idx = set(), set() 254 | for input_idx, related_idx in [ [torch.cat([idx_train, idx_unlabeled]), idx_related_train], 255 | [idx_val, idx_related_val], 256 | [idx_test, idx_related_test] ]: 257 | print("# input_nodes: {}, # related_nodes: {} / {}".format( 258 | len(input_idx), len(related_idx), all_node_count)) 259 | all_input_idx.update(input_idx.numpy().tolist()) 260 | all_related_idx.update(related_idx.numpy().tolist()) 261 | print("Sum: # input_nodes: {}, # related_nodes: {} / {}\n".format( 262 | len(all_input_idx), len(all_related_idx), all_node_count)) 263 | else: 264 | print("Transfer to be transductive.") 265 | input_adj_train, input_features_train, idx_out_train = adj, features, idx_train_ori 266 | input_adj_val, input_features_val, idx_out_val = adj, features, idx_val_ori 267 | input_adj_test, input_features_test, idx_out_test = adj, features, idx_test_ori 268 | idx_train, idx_val, idx_test = idx_train_ori, idx_val_ori, idx_test_ori 269 | 270 | 271 | if args.cuda: 272 | N = len(features) 273 | for i in range(N): 274 | if input_features_train[i] is not None: 275 | input_features_train[i] = input_features_train[i].cuda() 276 | if input_features_val[i] is not None: 277 | input_features_val[i] = input_features_val[i].cuda() 278 | if input_features_test[i] is not None: 279 | input_features_test[i] = input_features_test[i].cuda() 280 | for i in range(N): 281 | for j in range(N): 282 | if input_adj_train[i][j] is not None: 283 | input_adj_train[i][j] = input_adj_train[i][j].cuda() 284 | if input_adj_val[i][j] is not None: 285 | input_adj_val[i][j] = input_adj_val[i][j].cuda() 286 | if input_adj_test[i][j] is not None: 287 | input_adj_test[i][j] = input_adj_test[i][j].cuda() 288 | labels = labels.cuda() 289 | idx_train, idx_out_train = idx_train.cuda(), idx_out_train.cuda() 290 | idx_val, idx_out_val = idx_val.cuda(), idx_out_val.cuda() 291 | idx_test, idx_out_test = idx_test.cuda(), idx_out_test.cuda() 292 | 293 | 294 | 295 | FINAL_RESULT = [] 296 | for i in range(args.repeat): 297 | # Model and optimizer 298 | print("\n\nNo. {} test.\n".format(i+1)) 299 | model = HGAT(nfeat_list=[i.shape[1] for i in features], 300 | type_attention=args.type, 301 | node_attention=args.node, 302 | nhid=args.hidden, 303 | nclass=labels.shape[1], 304 | dropout=args.dropout, 305 | gamma=0.1, 306 | orphan=True, 307 | ) 308 | 309 | 310 | # print(model) 311 | print(len(list(model.parameters()))) 312 | optimizer = optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay) 313 | 314 | if args.cuda: 315 | model.cuda() 316 | 317 | print(len(list(model.parameters()))) 318 | print([i.size() for i in model.parameters()]) 319 | # Train model 320 | t_total = time.time() 321 | vali_max = [0, [0, 0], -1] 322 | 323 | 324 | for epoch in range(args.epochs): 325 | vali_acc, vali_f1 = train(epoch, 326 | input_adj_train, input_features_train, idx_out_train, idx_train, 327 | input_adj_val, input_features_val, idx_out_val, idx_val) 328 | test_acc, test_f1 = test(epoch, 329 | input_adj_test, input_features_test, idx_out_test, idx_test) 330 | if vali_acc > vali_max[0]: 331 | vali_max = [vali_acc, (test_acc, test_f1), epoch+1] 332 | with open(savedir + "{}.pkl".format(dataset), 'wb') as f: 333 | pkl.dump(model, f) 334 | 335 | if write_embeddings: 336 | makedirs([embdir]) 337 | with open(embdir + "{}.emb".format(dataset), 'w') as f: 338 | for i in model.emb.tolist(): 339 | f.write("{}\n".format(i)) 340 | with open(embdir + "{}.emb2".format(dataset), 'w') as f: 341 | for i in model.emb2.tolist(): 342 | f.write("{}\n".format(i)) 343 | 344 | print("Optimization Finished!") 345 | print("Total time elapsed: {:.4f}s".format(time.time() - t_total)) 346 | if 'multi' in dataset: 347 | print("The best result is ACC: {0:.4f}, where epoch is {2}\n{1}\n".format( 348 | vali_max[1][0], 349 | vali_max[1][1], 350 | vali_max[2])) 351 | else: 352 | print("The best result is: ACC: {0:.4f} F1: {1:.4f}, where epoch is {2}\n\n".format( 353 | vali_max[1][0], 354 | vali_max[1][1], 355 | vali_max[2])) 356 | FINAL_RESULT.append(list(vali_max)) 357 | 358 | print("\n") 359 | for i in range(len(FINAL_RESULT)): 360 | if 'multi' in dataset: 361 | print("{0}:\tvali: {1:.5f}\ttest: ACC: {2:.4f}, epoch={4}.\n{3}".format( 362 | i, 363 | FINAL_RESULT[i][0], 364 | FINAL_RESULT[i][1][0], 365 | FINAL_RESULT[i][1][1], 366 | FINAL_RESULT[i][2])) 367 | else: 368 | print("{}:\tvali: {:.5f}\ttest: ACC: {:.4f} F1: {:.4f}, epoch={}".format( 369 | i, 370 | FINAL_RESULT[i][0], 371 | FINAL_RESULT[i][1][0], 372 | FINAL_RESULT[i][1][1], 373 | FINAL_RESULT[i][2])) -------------------------------------------------------------------------------- /model/code/utils.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import scipy.sparse as sp 3 | from random import shuffle 4 | import torch 5 | from tqdm import tqdm 6 | import os 7 | 8 | 9 | def load_data(path="../data/citeseer/", dataset="citeseer"): 10 | print('Loading {} dataset...'.format(dataset)) 11 | features_block = False # concatenate the feature spaces or not 12 | 13 | MULTI_LABEL = 'multi' in dataset 14 | 15 | type_list = ['text', 'topic', 'entity'] 16 | type_have_label = 'text' 17 | 18 | features_list = [] 19 | idx_map_list = [] 20 | idx2type = {t: set() for t in type_list} 21 | 22 | for type_name in type_list: 23 | print('Loading {} content...'.format(type_name)) 24 | print(path) 25 | print(dataset) 26 | print(type_name) 27 | indexes, features, labels = [], [], [] 28 | with open("{}{}.content.{}".format(path, dataset, type_name)) as f: 29 | for line in tqdm(f): 30 | cache = line.strip().split('\t') 31 | indexes.append(np.array(cache[0], dtype=int)) 32 | features.append(np.array(cache[1:-1], dtype=np.float32)) 33 | labels.append(np.array([cache[-1]], dtype=str) ) 34 | features = np.stack(features) 35 | features = normalize(features) 36 | if not features_block: 37 | features = torch.FloatTensor(np.array(features)) 38 | features = dense_tensor_to_sparse(features) 39 | 40 | features_list.append(features) 41 | 42 | if type_name == type_have_label: 43 | labels = np.stack(labels) 44 | if not MULTI_LABEL: 45 | labels = encode_onehot(labels) 46 | else: 47 | labels = multi_label(labels) 48 | Labels = torch.LongTensor(labels) 49 | print("label matrix shape: {}".format(Labels.shape)) 50 | 51 | idx = np.stack(indexes) 52 | for i in idx: 53 | idx2type[type_name].add(i) 54 | idx_map = {j: i for i, j in enumerate(idx)} 55 | idx_map_list.append(idx_map) 56 | print('done.') 57 | 58 | len_list = [len(idx2type[t]) for t in type_list] 59 | type2len = {t: len(idx2type[t]) for t in type_list} 60 | len_all = sum(len_list) 61 | if features_block: 62 | flen = [i.shape[1] for i in features_list] 63 | features = sp.lil_matrix(np.zeros((len_all, sum(flen))), dtype=np.float32) 64 | bias = 0 65 | for i_l in range(len(len_list)): 66 | features[bias:bias+len_list[i_l], :flen[i_l]] = features_list[i_l] 67 | features_list[i_l] = features[bias:bias+len_list[i_l], :] 68 | bias += len_list[i_l] 69 | for fi in range(len(features_list)): 70 | features_list[fi] = torch.FloatTensor(np.array(features_list[fi].todense())) 71 | features_list[fi] = dense_tensor_to_sparse(features_list[fi]) 72 | 73 | print('Building graph...') 74 | adj_list = [[None for _ in range(len(type_list))] for __ in range(len(type_list))] 75 | # build graph 76 | edges_unordered = np.genfromtxt("{}{}.cites".format(path, dataset), 77 | dtype=np.int32) 78 | 79 | adj_all = sp.lil_matrix(np.zeros((len_all, len_all)), dtype=np.float32) 80 | 81 | for i1 in range(len(type_list)): 82 | for i2 in range(len(type_list)): 83 | t1, t2 = type_list[i1], type_list[i2] 84 | if i1 == i2: 85 | edges = [] 86 | for edge in edges_unordered: 87 | if (edge[0] in idx2type[t1] and edge[1] in idx2type[t2]): 88 | edges.append([idx_map_list[i1].get(edge[0]), idx_map_list[i2].get(edge[1])]) 89 | edges = np.array(edges) 90 | if len(edges) > 0: 91 | adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])), 92 | shape=(type2len[t1], type2len[t2]), dtype=np.float32) 93 | else: 94 | adj = sp.coo_matrix((type2len[t1], type2len[t2]), dtype=np.float32) 95 | adj_all[sum(len_list[:i1]): sum(len_list[:i1 + 1]), 96 | sum(len_list[:i2]): sum(len_list[:i2 + 1])] = adj.tolil() 97 | 98 | elif i1 < i2: 99 | edges = [] 100 | for edge in edges_unordered: 101 | if (edge[0] in idx2type[t1] and edge[1] in idx2type[t2]): 102 | edges.append([idx_map_list[i1].get(edge[0]), idx_map_list[i2].get(edge[1])]) 103 | elif (edge[1] in idx2type[t1] and edge[0] in idx2type[t2]): 104 | edges.append([idx_map_list[i1].get(edge[1]), idx_map_list[i2].get(edge[0])]) 105 | edges = np.array(edges) 106 | if len(edges) > 0: 107 | adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])), 108 | shape=(type2len[t1], type2len[t2]), dtype=np.float32) 109 | else: 110 | adj = sp.coo_matrix((type2len[t1], type2len[t2]), dtype=np.float32) 111 | 112 | adj_all[ 113 | sum(len_list[:i1]): sum(len_list[:i1 + 1]), 114 | sum(len_list[:i2]): sum(len_list[:i2 + 1])] = adj.tolil() 115 | adj_all[ 116 | sum(len_list[:i2]): sum(len_list[:i2 + 1]), 117 | sum(len_list[:i1]): sum(len_list[:i1 + 1])] = adj.T.tolil() 118 | 119 | adj_all = adj_all + adj_all.T.multiply(adj_all.T > adj_all) - adj_all.multiply(adj_all.T > adj_all) 120 | adj_all = normalize_adj(adj_all + sp.eye(adj_all.shape[0])) 121 | 122 | for i1 in range(len(type_list)): 123 | for i2 in range(len(type_list)): 124 | adj_list[i1][i2] = sparse_mx_to_torch_sparse_tensor( 125 | adj_all[sum(len_list[:i1]): sum(len_list[:i1 + 1]), 126 | sum(len_list[:i2]): sum(len_list[:i2 + 1])] 127 | ) 128 | 129 | print("Num of edges: {}".format(len( adj_all.nonzero()[0] ))) 130 | idx_train, idx_val, idx_test = load_divide_idx(path, idx_map_list[0]) 131 | return adj_list, features_list, Labels, idx_train, idx_val, idx_test, idx_map_list[0] 132 | 133 | 134 | def multi_label(labels): 135 | def myfunction(x): 136 | return list(map(int, x[0].split())) 137 | return np.apply_along_axis(myfunction, axis=1, arr=labels) 138 | 139 | 140 | def encode_onehot(labels): 141 | classes = set(labels.T[0]) 142 | classes_dict = {c: np.identity(len(classes))[i, :] for i, c in 143 | enumerate(classes)} 144 | labels_onehot = np.array(list(map(classes_dict.get, labels.T[0])), 145 | dtype=np.int32) 146 | return labels_onehot 147 | 148 | def load_divide_idx(path, idx_map): 149 | idx_train = [] 150 | idx_val = [] 151 | idx_test = [] 152 | with open(path+'train.map', 'r') as f: 153 | for line in f: 154 | idx_train.append( idx_map.get(int(line.strip('\n'))) ) 155 | with open(path+'vali.map', 'r') as f: 156 | for line in f: 157 | idx_val.append( idx_map.get(int(line.strip('\n'))) ) 158 | with open(path+'test.map', 'r') as f: 159 | for line in f: 160 | idx_test.append( idx_map.get(int(line.strip('\n'))) ) 161 | 162 | print("train, vali, test: ", len(idx_train), len(idx_val), len(idx_test)) 163 | idx_train = torch.LongTensor(idx_train) 164 | idx_val = torch.LongTensor(idx_val) 165 | idx_test = torch.LongTensor(idx_test) 166 | return idx_train, idx_val, idx_test 167 | 168 | 169 | def resample(train, val, test : torch.LongTensor, path, idx_map, rewrite=True): 170 | if os.path.exists(path+'train_inductive.map'): 171 | rewrite = False 172 | filenames = ['train', 'unlabeled', 'vali', 'test'] 173 | ans = [] 174 | for file in filenames: 175 | with open(path+file+'_inductive.map', 'r') as f: 176 | cache = [] 177 | for line in f: 178 | cache.append(idx_map.get(int(line))) 179 | ans.append(torch.LongTensor(cache)) 180 | return ans 181 | 182 | idx_train = train 183 | idx_test = val 184 | cache = list(test.numpy()) 185 | shuffle(cache) 186 | idx_val = cache[: idx_train.shape[0]] 187 | idx_unlabeled = cache[idx_train.shape[0]: ] 188 | idx_val = torch.LongTensor(idx_val) 189 | idx_unlabeled = torch.LongTensor(idx_unlabeled) 190 | 191 | print("\n\ttrain: ", idx_train.shape[0], 192 | "\n\tunlabeled: ", idx_unlabeled.shape[0], 193 | "\n\tvali: ", idx_val.shape[0], 194 | "\n\ttest: ", idx_test.shape[0]) 195 | if rewrite: 196 | idx_map_reverse = dict(map(lambda t: (t[1], t[0]), idx_map.items())) 197 | filenames = ['train', 'unlabeled', 'vali', 'test'] 198 | ans = [idx_train, idx_unlabeled, idx_val, idx_test] 199 | for i in range(4): 200 | with open(path+filenames[i]+'_inductive.map', 'w') as f: 201 | f.write("\n".join(map(str, map(idx_map_reverse.get, ans[i].numpy())))) 202 | 203 | return idx_train, idx_unlabeled, idx_val, idx_test 204 | 205 | 206 | def normalize(mx): 207 | rowsum = np.array(mx.sum(1)) 208 | r_inv = np.power(rowsum, -1).flatten() 209 | r_inv[np.isinf(r_inv)] = 0. 210 | r_mat_inv = sp.diags(r_inv) 211 | mx = r_mat_inv.dot(mx) 212 | return mx 213 | 214 | 215 | def normalize_adj(mx): 216 | rowsum = np.array(mx.sum(1)) 217 | r_inv_sqrt = np.power(rowsum, -0.5).flatten() 218 | r_inv_sqrt[np.isinf(r_inv_sqrt)] = 0. 219 | r_mat_inv_sqrt = sp.diags(r_inv_sqrt) 220 | return mx.dot(r_mat_inv_sqrt).transpose().dot(r_mat_inv_sqrt) 221 | 222 | 223 | def accuracy(output, labels): 224 | preds = output.max(1)[1].type_as(labels) 225 | correct = preds.eq(labels).double() 226 | correct = correct.sum() 227 | return correct / len(labels) 228 | 229 | 230 | def sparse_mx_to_torch_sparse_tensor(sparse_mx): 231 | """Convert a scipy sparse matrix to a torch sparse tensor.""" 232 | if len(sparse_mx.nonzero()[0]) == 0: 233 | # 空矩阵 234 | r, c = sparse_mx.shape 235 | return torch.sparse.FloatTensor(r, c) 236 | sparse_mx = sparse_mx.tocoo().astype(np.float32) 237 | indices = torch.from_numpy( 238 | np.vstack((sparse_mx.row, sparse_mx.col)).astype(np.int64)) 239 | values = torch.from_numpy(sparse_mx.data) 240 | shape = torch.Size(sparse_mx.shape) 241 | return torch.sparse.FloatTensor(indices, values, shape) 242 | 243 | 244 | def dense_tensor_to_sparse(dense_mx): 245 | return sparse_mx_to_torch_sparse_tensor( sp.coo.coo_matrix(dense_mx) ) 246 | 247 | 248 | def makedirs(dirs: list): 249 | for d in dirs: 250 | if not os.path.exists(d): 251 | os.makedirs(d) 252 | return -------------------------------------------------------------------------------- /model/code/utils_inductive.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import networkx 3 | import torch 4 | import os 5 | from tqdm import tqdm 6 | import pickle as pkl 7 | from utils import dense_tensor_to_sparse 8 | 9 | 10 | def get_related_nodes(adj_list, idx, k): 11 | def bfs_1_deep(adj, candidate): 12 | candidate = torch.LongTensor(list(candidate)) 13 | ans = torch.sum(adj[candidate], dim=0) 14 | ans = ans.nonzero().squeeze() 15 | return set(ans.numpy()) 16 | 17 | def bfs_k_deep(adj, candidate, k): 18 | candidate = torch.LongTensor(list(candidate)) 19 | ans = set(candidate.numpy()) 20 | next_candidate = candidate 21 | for i in range(k): 22 | next_candidate = bfs_1_deep(adj, next_candidate) 23 | next_candidate = next_candidate - ans 24 | ans.update(next_candidate) 25 | return ans 26 | 27 | adj = combine_adj_list(adj_list) 28 | involved_nodes = bfs_k_deep(adj, list(idx), k=k) 29 | ans = torch.LongTensor(list(involved_nodes), device=adj.device) 30 | return ans.sort()[0] 31 | 32 | 33 | def combine_adj_list(adj_list): 34 | ans = [] 35 | for i in adj_list: 36 | cache = [] 37 | for j in i: 38 | cache.append(j.to_dense()) 39 | ans.append(torch.cat(cache, dim=1)) 40 | return torch.cat(ans, dim=0) 41 | 42 | 43 | def transform_idx_for_adjList(adj_list, idx): 44 | N = len(adj_list) 45 | bias = 0 46 | ans = [] 47 | for adj in adj_list[0]: 48 | shape = adj.shape 49 | idx_r = idx[(idx >= bias) & (idx < bias + shape[1])] - bias 50 | # idx_c = idx[(idx>=bias_c) & (idx entity_set( entity_name ), edge_list( (doc_idx, entity_name) ) 86 | 87 | # path = "/home/ytc/GCN/整理的/HGAT/data/entity_recog/test.txt" 88 | rootpath = "../../data/entity_recog/" 89 | with open(rootpath + "test.txt", 'w') as f: 90 | f.write('\n'.join(sentences)) 91 | origin_path = os.getcwd() 92 | os.chdir(rootpath) 93 | os.system("python ER.py --test=test.txt --output=output_file.txt") 94 | os.chdir(origin_path) 95 | 96 | edges = [] 97 | entities = set() 98 | with open(rootpath + "output_file.txt", 'r') as f: 99 | lines = f.readlines() 100 | for i in range(0, len(lines), 3): 101 | l1, l2, l3 = lines[i:i + 3] 102 | entity_list = l2.strip('\n').split('\t') 103 | entities.update(set(entity_list)) 104 | edges.extend([("test_{}".format(i//3), e) for e in entity_list]) 105 | 106 | return edges 107 | 108 | 109 | def get_topic(sentences, datapath): # 需要分好词的 110 | # sent_list(content) => entity_set( topic_name: "topic_idx" ), edge_list( (doc_idx, topic_name) ) 111 | TopK_for_Topics = 2 112 | 113 | def naive_arg_topK(matrix, K, axis=0): 114 | """ 115 | perform topK based on np.argsort 116 | :param matrix: to be sorted 117 | :param K: select and sort the top K items 118 | :param axis: dimension to be sorted. 119 | :return: 120 | """ 121 | full_sort = np.argsort(-matrix, axis=axis) 122 | return full_sort.take(np.arange(K), axis=axis) 123 | 124 | with open(datapath + "vectorizer_model.pkl", 'rb') as f: 125 | vectorizer = pkl.load(f) 126 | with open(datapath + "lda_model.pkl", 'rb') as f: 127 | lda = pkl.load(f) 128 | 129 | X = vectorizer.transform(sentences) 130 | doc_topic = lda.transform(X) 131 | topK_topics = naive_arg_topK(doc_topic, TopK_for_Topics, axis=1) 132 | 133 | topics = [] 134 | for i in range(doc_topic.shape[1]): 135 | topicName = 'topic_' + str(i) 136 | topics.append(topicName) 137 | edges = [] 138 | for i in range(topK_topics.shape[0]): 139 | for j in range(TopK_for_Topics): 140 | edges.append(("test_{}".format(i), topics[topK_topics[i, j]])) 141 | 142 | return edges 143 | 144 | 145 | def build_subnetwork(datapath, entity_edges, topic_edges): 146 | graph_path = datapath + 'model_network_sampled.pkl' 147 | with open(graph_path, 'rb') as f: 148 | g = pkl.load(f) 149 | # g.add_edges_from(entity_edges) 150 | ''' 151 | 这个的目的是把新来的文本加入原图中,以便后续求交操作不会把他们剔除掉 152 | 不加实体的边的原因是,会有新的实体引入,这部分新加的实体,由于没有对应的特征(和对应的参数矩阵),应当剔除 153 | 可以加主题的边的原因是,主题没有新的,都是旧有的,不会引入新的主题节点 154 | ''' 155 | g.add_edges_from(topic_edges) 156 | 157 | sub_g = networkx.Graph() 158 | sub_g.add_edges_from(entity_edges) 159 | sub_g.add_edges_from(topic_edges) 160 | return g, sub_g 161 | 162 | 163 | def judge_node_type(node): 164 | if node[:5] == 'test_': 165 | return 'text' 166 | if node[:6] == 'topic_': 167 | return 'topic' 168 | return 'entity' 169 | 170 | 171 | 172 | def release_feature(DATASET, node_list): 173 | # 实体的,最麻烦了………… 174 | mapindex = dict() 175 | with open('../data/{}/mapindex.txt'.format(DATASET), 'r') as f: 176 | for line in f: 177 | k, v = line.strip().split('\t') 178 | mapindex[k] = int(v) 179 | featuremap = dict() 180 | 181 | for node_type in ['entity', 'topic']: 182 | with open('../data/{0}/{0}.content.{1}'.format(DATASET, node_type), 'r') as f: 183 | for line in tqdm(f): 184 | cache = line.strip().split('\t') 185 | index = int(cache[0]) 186 | feature = np.array(cache[1:-1], dtype=np.float32) 187 | featuremap[index] = feature 188 | 189 | with open('../data/{0}/{0}_inductive.content.{1}'.format(DATASET, node_type), 'w') as f: 190 | for n in range(len(node_list)): 191 | if judge_node_type(node_list[n]) == node_type: 192 | f.write( 193 | str(n) + '\t' + '\t'.join(map(str, featuremap[mapindex[node_list[n]]])) 194 | + '\n') 195 | 196 | 197 | def preprocess_inductive_text(sentences, DATASET): 198 | datapath = '../../data/{}/'.format(DATASET) 199 | 200 | entity_edges = get_entity(sentences) 201 | 202 | def tokenize(sen): 203 | from nltk.tokenize import WordPunctTokenizer 204 | return WordPunctTokenizer().tokenize(sen) 205 | # return jieba.cut(sen) 206 | sentences = [[word.lower() for word in tokenize(sentence)] for sentence in sentences] 207 | sentences = [' '.join(i) for i in sentences] 208 | topic_edges = get_topic(sentences, datapath) 209 | 210 | g, sub_g = build_subnetwork(DATASET, entity_edges, topic_edges) 211 | # 剔除不认识的实体 212 | sub_g.remove_nodes_from( set(sub_g.nodes()) - set(g.nodes()) ) 213 | del g 214 | g = sub_g 215 | node_list = list(g.nodes()) 216 | 217 | # build_feature 218 | release_feature(DATASET, node_list) 219 | with open(datapath + "vectorizer_model.pkl", 'rb') as f: 220 | vectorizer = pkl.load(f) 221 | with open(datapath + "transformer_model.pkl", 'rb') as f: 222 | tfidf_model = pkl.load(f) 223 | tfidf = tfidf_model.transform(vectorizer.transform(sentences)) 224 | 225 | with open('../data/{0}/{0}_inductive.content.{1}'.format(DATASET, 'text'), 'w') as f: 226 | for n in range(len(node_list)): 227 | if judge_node_type(node_list[n]) == 'text': 228 | idx = int(node_list[n].split('_')[1]) 229 | f.write( 230 | str(n) + '\t' + '\t'.join(map(str, tfidf[idx, :].toarray()[0].tolist() )) 231 | + '\n') 232 | 233 | # build_adj 234 | adj_dict = g.adj 235 | with open('../data/{0}/{0}_inductive.cites'.format(DATASET), 'w') as f: 236 | for i in adj_dict: 237 | for j in adj_dict[i]: 238 | f.write('{}\t{}\n'.format(i, j)) 239 | 240 | 241 | 242 | if __name__ == '__main__': 243 | sent = [ 244 | "VE子接口接入L3VPN转发,L3VE接口上下行均配置qos-profile模板,qos-profile中配置SQ加入GQ,然后进行主备倒换,查看配置恢复并且上下行流量按SQ参数进行队列调度和限速。查询和Reset SQ和GQ统计,检查统计和清除统计功能正常。", 245 | "整机启动时间48小时内、回退的文件不存在(大包文件不存在、或配置文件不在、或补丁文件不存在、或paf文件不存在),以上场景下,执行一键式版本回退,回退失败,提示对应Error信息", 246 | "全网1588时间同步;dot1q终结子接口终结多Q段接入VLL;IP FPM基本配置,丢包和时延使用默认值,测量周期1s;配置单向流,六元组匹配;MCP主备倒换3次", 247 | "集中式NAT444实例配置natserver,配置满规格日志服务器带不同私网vpn,反复配置去配置日志服务器,主备倒换设备无异常,检查反向建流和流表老化时,日志发送功能正常;修改部分日志服务器私网VPN与用户不一致,检查流表新建和老化时日志发送正常。", 248 | "2R native eth组网,配置基于物理口的PORT CC场景的1ag,使能eth-test测量。接收端设备ISSU升级过程中,发送端设备分别发送有bit错、CRC错、乱序的ETH-TEST报文,统计结果符合预期。配置周期为1s的,二层主接口的outward型的1ag,使能ETH_BN功能,测试仪模拟BNM报文,ISSU升级过程中,带宽通告可恢复,可产生。", 249 | "配置VPLS场景,接口类型为dot1q终结,设置检测接口为物理口,指定测试模式为丢包,手工触发平滑对账,查看测量结果符合规格预期", 250 | "NATIVE IP场景,测试仪模拟TWAMP LIGHT的发起端,设备作为TWAMP LIGHT的反射端,TWAMP LIGHT的时延统计功能正常,reset接口所在单板后,TWAMP LIGHT的时延统计功能能够恢复。", 251 | "接口上行配置qos-profile模板,模板中配置SQ,SQ关联的flow-queue模板中配置八个队列都为同种调度方式(lpq),打不同优先级的流量,八种队列按照配置的调度方式进行正确调度", 252 | "满规格配置IPFPM逐跳多路径检测,中间节点设备主备倒换3次后,功能正常", 253 | "通道口上建立TE隧道,隧道能正常建立并UP,ping和tracert该lsp都通,流量通过TE转发无丢包,ping可设置发送的报文长度,长度覆盖65、4000、8100(覆盖E3 Serial口)", 254 | "双归双活EVPN场景,AC侧接口为eth-trunk,evc封装dot1q,流动作为pop single接入evpn,公网使用te隧道,配置ac侧mac本地化功能,本地动态学习mac后ac侧本地化学习,动态mac清除、ac侧本地化mac清除", 255 | "配置两条静态vxlan隧道场景,Dot1q子接口配置远程端口镜像,切换镜像实例指定pw隧道测试恢复vxlan隧道镜像,再子卡插拔测试", 256 | "单跳检测bdif,交换机接口shut down后BFD会话的状态变为down,接口重新up则BFD会话可以协商UP" 257 | ] 258 | preprocess_inductive_text(sent, 'hw') -------------------------------------------------------------------------------- /model/data/example/example.cites: -------------------------------------------------------------------------------- 1 | 98 92 2 | 98 69 3 | 85 75 4 | 85 43 5 | 85 124 6 | 85 56 7 | 85 118 8 | 52 3 9 | 101 28 10 | 101 21 11 | 101 55 12 | 101 134 13 | 101 117 14 | 43 54 15 | 27 69 16 | 27 92 17 | 124 91 18 | 124 75 19 | 124 139 20 | 124 134 21 | 124 70 22 | 124 44 23 | 124 47 24 | 28 21 25 | 28 135 26 | 135 68 27 | 135 91 28 | 135 49 29 | 135 93 30 | 135 21 31 | 135 42 32 | 135 103 33 | 135 134 34 | 135 95 35 | 135 129 36 | 135 50 37 | 135 13 38 | 135 107 39 | 135 73 40 | 135 102 41 | 135 117 42 | 135 23 43 | 131 69 44 | 131 92 45 | 134 68 46 | 134 91 47 | 134 49 48 | 134 21 49 | 134 120 50 | 134 72 51 | 134 70 52 | 134 95 53 | 31 64 54 | 31 69 55 | 75 91 56 | 75 93 57 | 75 21 58 | 75 4 59 | 75 129 60 | 75 56 61 | 75 118 62 | 75 23 63 | 56 78 64 | 56 118 65 | 118 78 66 | 103 115 67 | 103 7 68 | 103 136 69 | 103 93 70 | 103 4 71 | 103 42 72 | 103 100 73 | 103 129 74 | 103 50 75 | 103 13 76 | 103 99 77 | 103 73 78 | 103 84 79 | 103 14 80 | 103 117 81 | 103 23 82 | 103 76 83 | 120 49 84 | 120 72 85 | 120 70 86 | 120 44 87 | 137 69 88 | 137 92 89 | 49 70 90 | 107 68 91 | 107 21 92 | 107 42 93 | 107 95 94 | 59 61 95 | 59 34 96 | 59 58 97 | 59 33 98 | 61 144 99 | 61 96 100 | 61 34 101 | 115 68 102 | 115 93 103 | 115 42 104 | 115 100 105 | 115 129 106 | 115 45 107 | 115 13 108 | 115 73 109 | 115 102 110 | 115 130 111 | 115 14 112 | 115 23 113 | 115 76 114 | 13 93 115 | 13 21 116 | 13 42 117 | 13 100 118 | 13 129 119 | 13 50 120 | 13 45 121 | 13 73 122 | 13 102 123 | 13 130 124 | 13 14 125 | 13 23 126 | 13 76 127 | 34 144 128 | 34 96 129 | 34 102 130 | 94 14 131 | 45 93 132 | 45 42 133 | 45 100 134 | 45 129 135 | 45 50 136 | 45 6 137 | 45 73 138 | 45 102 139 | 45 130 140 | 45 14 141 | 45 23 142 | 45 76 143 | 29 125 144 | 29 127 145 | 29 33 146 | 29 92 147 | 142 136 148 | 142 106 149 | 142 127 150 | 142 84 151 | 142 117 152 | 112 21 153 | 112 129 154 | 112 50 155 | 112 73 156 | 112 117 157 | 112 23 158 | 112 76 159 | 127 7 160 | 127 136 161 | 127 84 162 | 127 117 163 | 127 76 164 | 121 58 165 | 121 33 166 | 130 68 167 | 130 93 168 | 130 21 169 | 130 42 170 | 130 100 171 | 130 129 172 | 130 50 173 | 130 6 174 | 130 73 175 | 130 102 176 | 130 14 177 | 130 23 178 | 130 76 179 | 42 68 180 | 42 93 181 | 42 21 182 | 42 100 183 | 42 95 184 | 42 129 185 | 42 50 186 | 42 73 187 | 42 102 188 | 42 14 189 | 42 23 190 | 42 76 191 | 93 68 192 | 93 21 193 | 93 4 194 | 93 100 195 | 93 95 196 | 93 129 197 | 93 50 198 | 93 6 199 | 93 96 200 | 93 73 201 | 93 84 202 | 93 102 203 | 93 14 204 | 93 117 205 | 93 23 206 | 93 76 207 | 102 68 208 | 102 21 209 | 102 100 210 | 102 129 211 | 102 50 212 | 102 96 213 | 102 73 214 | 102 14 215 | 102 23 216 | 86 58 217 | 86 33 218 | 68 100 219 | 68 129 220 | 68 96 221 | 68 73 222 | 68 14 223 | 68 23 224 | 68 76 225 | 53 84 226 | 53 33 227 | 53 92 228 | 21 100 229 | 21 95 230 | 21 129 231 | 21 50 232 | 21 96 233 | 21 73 234 | 21 14 235 | 21 117 236 | 21 23 237 | 21 76 238 | 84 7 239 | 84 136 240 | 84 50 241 | 84 73 242 | 84 14 243 | 84 117 244 | 84 76 245 | 100 7 246 | 100 136 247 | 100 129 248 | 100 50 249 | 100 6 250 | 100 73 251 | 100 14 252 | 100 117 253 | 100 23 254 | 100 76 255 | 4 95 256 | 4 126 257 | 4 50 258 | 4 99 259 | 4 73 260 | 81 15 261 | 81 64 262 | 110 9 263 | 22 15 264 | 22 64 265 | 46 41 266 | 46 126 267 | 99 126 268 | 83 15 269 | 83 109 270 | 96 129 271 | 50 129 272 | 50 6 273 | 50 73 274 | 50 14 275 | 50 117 276 | 50 23 277 | 50 76 278 | 6 73 279 | 6 14 280 | 6 76 281 | 76 7 282 | 76 129 283 | 76 73 284 | 76 14 285 | 76 117 286 | 76 23 287 | 97 15 288 | 97 132 289 | 7 136 290 | 7 117 291 | 32 16 292 | 32 15 293 | 32 109 294 | 73 129 295 | 73 14 296 | 73 117 297 | 73 23 298 | 14 129 299 | 14 117 300 | 14 23 301 | 23 91 302 | 23 129 303 | 23 117 304 | 129 117 305 | 104 15 306 | 104 58 307 | 19 133 308 | 12 15 309 | 12 132 310 | 72 44 311 | 79 58 312 | 79 64 313 | 38 58 314 | 38 15 315 | 105 58 316 | 105 15 317 | 40 41 318 | 40 63 319 | 111 126 320 | 1 39 321 | 1 80 322 | 1 15 323 | 57 8 324 | 57 71 325 | 133 8 326 | 39 2 327 | 116 80 328 | 116 64 329 | 35 80 330 | 35 25 331 | 17 106 332 | 17 80 333 | 17 25 334 | 8 71 335 | 2 80 336 | 2 25 337 | 24 92 338 | 24 25 339 | 60 92 340 | 60 25 341 | 113 140 342 | 30 92 343 | 30 25 344 | 89 123 345 | 141 62 346 | 141 92 347 | 141 33 348 | 128 92 349 | 128 25 350 | 5 138 351 | 5 37 352 | 5 0 353 | 26 64 354 | 26 108 355 | 66 18 356 | 66 123 357 | 90 108 358 | 90 25 359 | 37 138 360 | 37 77 361 | 37 0 362 | 119 108 363 | 119 64 364 | 114 108 365 | 114 64 366 | 136 117 367 | 122 108 368 | 122 80 369 | 51 65 370 | 51 132 371 | 10 20 372 | 10 11 373 | 138 0 374 | 143 87 375 | 143 36 376 | 74 77 377 | 74 65 378 | 74 132 379 | 77 0 380 | 82 132 381 | 82 15 382 | 88 65 383 | 88 25 384 | 67 48 385 | 67 69 386 | 67 132 387 | 0 0 388 | 1 1 389 | 2 2 390 | 3 3 391 | 4 4 392 | 5 5 393 | 6 6 394 | 7 7 395 | 8 8 396 | 9 9 397 | 10 10 398 | 11 11 399 | 12 12 400 | 13 13 401 | 14 14 402 | 15 15 403 | 16 16 404 | 17 17 405 | 18 18 406 | 19 19 407 | 20 20 408 | 21 21 409 | 22 22 410 | 23 23 411 | 24 24 412 | 25 25 413 | 26 26 414 | 27 27 415 | 28 28 416 | 29 29 417 | 30 30 418 | 31 31 419 | 32 32 420 | 33 33 421 | 34 34 422 | 35 35 423 | 36 36 424 | 37 37 425 | 38 38 426 | 39 39 427 | 40 40 428 | 41 41 429 | 42 42 430 | 43 43 431 | 44 44 432 | 45 45 433 | 46 46 434 | 47 47 435 | 48 48 436 | 49 49 437 | 50 50 438 | 51 51 439 | 52 52 440 | 53 53 441 | 54 54 442 | 55 55 443 | 56 56 444 | 57 57 445 | 58 58 446 | 59 59 447 | 60 60 448 | 61 61 449 | 62 62 450 | 63 63 451 | 64 64 452 | 65 65 453 | 66 66 454 | 67 67 455 | 68 68 456 | 69 69 457 | 70 70 458 | 71 71 459 | 72 72 460 | 73 73 461 | 74 74 462 | 75 75 463 | 76 76 464 | 77 77 465 | 78 78 466 | 79 79 467 | 80 80 468 | 81 81 469 | 82 82 470 | 83 83 471 | 84 84 472 | 85 85 473 | 86 86 474 | 87 87 475 | 88 88 476 | 89 89 477 | 90 90 478 | 91 91 479 | 92 92 480 | 93 93 481 | 94 94 482 | 95 95 483 | 96 96 484 | 97 97 485 | 98 98 486 | 99 99 487 | 100 100 488 | 101 101 489 | 102 102 490 | 103 103 491 | 104 104 492 | 105 105 493 | 106 106 494 | 107 107 495 | 108 108 496 | 109 109 497 | 110 110 498 | 111 111 499 | 112 112 500 | 113 113 501 | 114 114 502 | 115 115 503 | 116 116 504 | 117 117 505 | 118 118 506 | 119 119 507 | 120 120 508 | 121 121 509 | 122 122 510 | 123 123 511 | 124 124 512 | 125 125 513 | 126 126 514 | 127 127 515 | 128 128 516 | 129 129 517 | 130 130 518 | 131 131 519 | 132 132 520 | 133 133 521 | 134 134 522 | 135 135 523 | 136 136 524 | 137 137 525 | 138 138 526 | 139 139 527 | 140 140 528 | 141 141 529 | 142 142 530 | 143 143 531 | 144 144 532 | -------------------------------------------------------------------------------- /model/data/example/example.content.entity: -------------------------------------------------------------------------------- 1 | 110 0.4128976190896328 0.42347086738257994 0.0 0.0 0.0 0.8063423470389967 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 2 | 16 0.0 0.7460770404765611 0.0 0.0 0.0 0.0 0.0 0.0 0.6658596321100535 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 3 | 45 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 4 | 68 0.0 0.16626884101293 0.0 0.0 0.0 0.0 0.29678358480919553 0.928576357788077 0.14839179240459777 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 5 | 52 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 entity 6 | 115 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9162423370679809 0.0 0.3123917334001648 0.0 0.0 0.0 0.2508214198736721 0.0 0.0 entity 7 | 140 0.5036917525882165 0.5165900056102781 0.0 0.5165900056102781 0.0 0.0 0.0 0.0 0.4610467986894133 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 8 | 13 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 9 | 103 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 10 | 41 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 entity 11 | 66 0.0 0.0 0.0 0.0 0.0 0.0 0.9192504543328178 0.0 0.0 0.0 0.0 0.0 0.0 0.39367321754077705 0.0 0.0 entity 12 | 57 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 13 | 124 0.0 0.7392370411053496 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6734453185358168 0.0 0.0 0.0 0.0 0.0 entity 14 | 139 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 entity 15 | 19 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 16 | 136 0.0 0.9340966471734243 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3570202427585404 0.0 0.0 entity 17 | 3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 18 | 130 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 19 | 46 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 20 | 100 0.0 0.0 0.0 0.5148343604877473 0.0 0.0 0.0 0.4792068600200417 0.0 0.0 0.0 0.0 0.5148343604877473 0.0 0.4901550242852527 0.0 entity 21 | 4 0.0 0.0 0.0 0.0 0.0 0.0 0.5041490938189078 0.0 0.0 0.0 0.0 0.0 0.0 0.8636166343937418 0.0 0.0 entity 22 | 8 0.42920383141594415 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3364927980278603 0.8381865353089701 0.0 entity 23 | 37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 24 | 123 0.0 0.7460770404765611 0.0 0.0 0.0 0.0 0.0 0.0 0.6658596321100535 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 25 | 76 0.0 0.526361543744378 0.0 0.0 0.0 0.0 0.0 0.0 0.46976771145596224 0.5011296351945588 0.0 0.0 0.0 0.0 0.0 0.5011296351945588 entity 26 | 142 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5071213902423028 0.0 0.0 0.4255242692005045 0.5299797127627106 0.5299797127627106 entity 27 | 138 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 28 | 20 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8903220484888332 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4553313628278285 entity 29 | 133 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 30 | 85 0.0 0.582154700206958 0.0 0.582154700206958 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5676194236051949 0.0 0.0 0.0 0.0 entity 31 | 49 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 32 | 107 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 33 | 120 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 34 | 39 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 35 | 89 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 36 | 36 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 37 | 42 0.3537352132330117 0.0 0.3537352132330117 0.0 0.0 0.0 0.0 0.33768747802609306 0.0 0.0 0.33050502943729493 0.0 0.725586928313225 0.0 0.0 0.0 entity 38 | 6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 entity 39 | 71 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6839091107827836 0.0 0.0 0.0 0.0 0.0 0.0 0.7295672197873904 entity 40 | 101 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 41 | 126 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 42 | 21 0.0 0.0 0.0 0.0 0.13033856426006174 0.0 0.36654497008423925 0.12742730569713723 0.0 0.9123699498204323 0.0 0.0 0.0 0.0 0.0 0.0 entity 43 | 63 0.0 0.0 0.0 0.7242527273481703 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6895346161932243 0.0 entity 44 | 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 45 | 77 0.0 0.0 0.0 0.0 0.0 0.7295672197873904 0.0 0.0 0.6839091107827836 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 46 | 112 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.626076108355491 0.0 0.7797619550519527 entity 47 | 43 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8776026932008457 0.0 0.0 0.47938868664855044 0.0 0.0 0.0 0.0 entity 48 | 113 0.0 0.0 0.8898235312117321 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.45630481402444545 0.0 0.0 0.0 entity 49 | 70 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 50 | 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 51 | 111 0.0 0.0 0.0 0.0 0.33348940998145365 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7005612453635459 0.535521772064518 0.33348940998145365 0.0 entity 52 | 40 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 53 | 56 0.0 0.0 0.0 0.0 0.0 0.9054732821028685 0.4244032697775304 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 54 | 34 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 55 | 73 0.0 0.0 0.0 0.0 0.24253562503633297 0.0 0.0 0.0 0.0 0.9701425001453319 0.0 0.0 0.0 0.0 0.0 0.0 entity 56 | 106 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 entity 57 | 18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 58 | 23 0.0 0.0 0.0 0.0 0.0 0.0 0.8045238736139533 0.0 0.0 0.42911714178898863 0.4106090785748029 0.0 0.0 0.0 0.0 0.0 entity 59 | 75 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 60 | 135 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 61 | 54 0.0 0.0 0.0 0.0 0.0 0.0 0.9049839500227976 0.0 0.0 0.0 0.0 0.0 0.3380030437607315 0.2583756811497849 0.0 0.0 entity 62 | 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 63 | 78 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 entity 64 | 55 0.0 0.0 0.0 0.0 0.0 0.0 0.6839091107827836 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7295672197873904 0.0 entity 65 | 134 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 66 | 44 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 67 | 102 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9630189650151143 0.0 0.0 0.0 0.16813066375666905 0.0 0.13181326293179332 0.16417008448933054 0.0 entity 68 | 95 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 entity 69 | 125 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 70 | 87 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 71 | 96 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 72 | 72 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 entity 73 | 47 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 74 | 62 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 entity 75 | 28 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5844900431032921 0.0 0.0 0.6385529950927725 0.0 0.500621076235471 0.0 0.0 entity 76 | 84 0.0 0.0 0.0 0.5865708015535791 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5865708015535791 0.0 0.0 0.5584526743866337 entity 77 | 10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 78 | 117 0.0 0.0 0.0 0.46577813790634043 0.0 0.0 0.0 0.0 0.4156981688554791 0.4434503833052836 0.0 0.0 0.46577813790634043 0.0 0.0 0.4434503833052836 entity 79 | 118 0.0 0.0 0.6696705123801245 0.0 0.0 0.32694769098327986 0.6129730024822695 0.0 0.0 0.0 0.0 0.0 0.0 0.2625084959332076 0.0 0.0 entity 80 | 7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 entity 81 | 61 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 82 | 48 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 83 | 127 0.5293979964997674 0.0 0.0 0.0 0.0 0.5169272033543715 0.0 0.0 0.0 0.0 0.0 0.5293979964997674 0.0 0.41504432177334144 0.0 0.0 entity 84 | 129 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7071067811865475 0.0 0.0 0.0 0.0 0.7071067811865475 0.0 entity 85 | 99 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 86 | 94 0.0 0.0 0.0 0.0 0.4631219854333841 0.0 0.0 0.0 0.0 0.0 0.8862945484477721 0.0 0.0 0.0 0.0 0.0 entity 87 | 14 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6751907623427075 0.0 0.0 0.7376431620011626 0.0 0.0 0.0 0.0 entity 88 | 93 0.0 0.0 0.0 0.2851210859459021 0.27145339070541175 0.27145339070541175 0.0 0.5307803472643001 0.25446517061225893 0.27145339070541175 0.5194909071628946 0.0 0.2851210859459021 0.0 0.0 0.0 entity 89 | 9 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 entity 90 | 144 0.0 0.0 0.0 0.4854274117243385 0.0 0.4621577405149383 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7421391047699457 0.0 0.0 entity 91 | 91 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9630392083318012 0.0 0.0 0.26936125039741216 0.0 0.0 entity 92 | 50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6990759925117283 0.0 0.7150473807334323 0.0 0.0 0.0 0.0 0.0 0.0 entity 93 | 143 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 entity 94 | -------------------------------------------------------------------------------- /model/data/example/example.content.text: -------------------------------------------------------------------------------- 1 | 98 0.0 0.0 0.0 0.0 0.0 0.9358650031234387 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.35235876025546164 0.0 0.0 0.0 0.0 business 2 | 27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9156288122439306 0.0 0.0 0.4020247233551301 0.0 0.0 0.0 0.0 business 3 | 131 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9156288122439306 0.0 0.0 0.4020247233551301 0.0 0.0 0.0 0.0 business 4 | 31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6695963650777833 0.0 0.0 0.0 0.0 0.0 0.0 0.7427251900094812 0.0 0.0 0.0 0.0 business 5 | 137 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6046697920338051 0.0 0.0 0.7964762661886384 0.0 0.0 0.0 0.0 business 6 | 59 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7514065815604376 0.0 0.0 0.0 0.0 0.0 0.6598394874419514 0.0 0.0 0.0 computers 7 | 29 0.0 0.0 0.0 0.0 0.0 0.16467742487504686 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7671089427046102 0.0 0.0 0.0 0.0 0.6200203349561516 0.0 0.0 0.0 computers 8 | 121 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7514065815604376 0.0 0.0 0.0 0.0 0.0 0.6598394874419514 0.0 0.0 0.0 computers 9 | 86 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.9156288122439306 0.0 0.0 0.0 0.0 0.0 0.4020247233551301 0.0 0.0 0.0 computers 10 | 53 0.0 0.0 0.0 0.0 0.0 0.21729028321966476 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7591446712281558 0.0 0.0 0.0 0.0 0.6135831654830609 0.0 0.0 0.0 computers 11 | 81 0.40461548377817164 0.40461548377817164 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.32032492957494885 0.0 0.0 0.0 0.0 0.0 0.7549599724930564 0.0 0.0 0.0 0.0 0.0 culture-arts-entertainment 12 | 22 0.9668995057997964 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2551574919223607 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 culture-arts-entertainment 13 | 83 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 culture-arts-entertainment 14 | 97 0.862719620468864 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.18939695437392945 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.2012153535427588 0.0 0.0 0.0 0.18939695437392945 0.3787939087478589 culture-arts-entertainment 15 | 32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 culture-arts-entertainment 16 | 104 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.25819888974716115 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7745966692414835 0.0 0.0 0.0 0.0 0.25819888974716115 0.5163977794943223 education-science 17 | 12 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.20614945054270384 0.0 0.25505467825685685 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8245978021708154 0.0 0.0 0.0 0.0 0.20614945054270384 0.4122989010854077 education-science 18 | 79 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.41094830562394763 0.0 0.0 0.0 0.0 0.9116586477979609 0.0 0.0 0.0 0.0 0.0 0.0 education-science 19 | 38 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 education-science 20 | 105 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 education-science 21 | 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16584805931160926 0.628392605698543 0.0 0.0 0.0 0.0 0.663392237246437 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.16584805931160926 0.3316961186232185 engineering 22 | 116 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8151167991472721 0.0 0.0 0.0 0.38789492827902133 0.4302582112677438 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 engineering 23 | 35 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.687696055429443 0.0 0.0 0.0 0.0 0.7259987158024347 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 engineering 24 | 17 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.687696055429443 0.0 0.0 0.0 0.0 0.7259987158024347 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 engineering 25 | 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.687696055429443 0.0 0.0 0.0 0.0 0.7259987158024347 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 engineering 26 | 24 0.0 0.0 0.0 0.0 0.0 0.0 0.7514065815604376 0.0 0.0 0.0 0.0 0.6598394874419514 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 health 27 | 60 0.0 0.0 0.0 0.0 0.0 0.0 0.7514065815604376 0.0 0.0 0.0 0.0 0.6598394874419514 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 health 28 | 30 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 health 29 | 141 0.0 0.0 0.0 0.0 0.0 0.46912084121140396 0.0 0.0 0.0 0.0 0.0 0.8831339854977299 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 health 30 | 128 0.0 0.0 0.0 0.0 0.0 0.0 0.9156288122439306 0.0 0.0 0.0 0.0 0.4020247233551301 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 health 31 | 26 0.0 0.0 0.0 0.0 0.4363902117064088 0.0 0.0 0.4363902117064088 0.0 0.0 0.0 0.0 0.0 0.7868463422128056 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 politics-society 32 | 90 0.0 0.0 0.0 0.0 0.9486832980505139 0.0 0.0 0.31622776601683794 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 politics-society 33 | 119 0.0 0.37273755096583017 0.0 0.0 0.4364205489016015 0.0 0.0 0.32731541167620115 0.0 0.0 0.0 0.0 0.7506453515979827 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 politics-society 34 | 114 0.0 0.4270855370625232 0.0 0.0 0.7500810048375636 0.0 0.0 0.3750405024187818 0.0 0.0 0.0 0.0 0.0 0.33811396268026966 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 politics-society 35 | 122 0.0 0.0 0.0 0.0 0.41178889441522537 0.0 0.0 0.8235777888304507 0.0 0.3900634976275425 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 politics-society 36 | 51 0.0 0.0 0.5345224838248488 0.2672612419124244 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.8017837257372731 0.0 0.0 sports 37 | 74 0.0 0.0 0.8164965809277259 0.40824829046386296 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.40824829046386296 0.0 0.0 sports 38 | 82 0.0 0.0 0.0 0.5261283520506196 0.0 0.0 0.0 0.0 0.15400462309180898 0.0 0.7621577353583696 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.15400462309180898 0.30800924618361797 sports 39 | 88 0.0 0.0 0.8944271909999159 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4472135954999579 0.0 0.0 sports 40 | 67 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 sports 41 | -------------------------------------------------------------------------------- /model/data/example/example.content.topic: -------------------------------------------------------------------------------- 1 | 69 0.10000000064493393 0.1000000008140825 0.10000000028266168 0.10000000035382671 0.10000000047539047 0.1000000004828411 0.10000000060871007 0.1000000005324753 0.100000000711586 0.10000000066359978 0.10000000052238267 0.10000000067661378 0.10000000014003871 0.10000310368917421 0.10000000045052576 0.10000000044994944 0.10000000015794049 6.099999976462049 0.1000000010483934 0.10000000090751568 6.100029142682792 0.10000000061865216 0.1000000002451704 0.100000000711586 0.10000000061927684 topic 2 | 108 0.10000000030248687 3.9869331920755253 0.1000000001324301 0.1000000001657717 11.099988383941328 0.10000000022568818 0.10000000028518735 8.099985059782904 0.10000000033338584 0.10000959756685197 0.10000000024474201 0.10000000031700097 5.099999989264985 0.10000958461921998 0.10000000021107626 0.10000000021080625 0.10000000007399686 0.10000000014486304 0.1000000004911838 0.10000000042576132 0.10000000034382804 0.10000000028984532 0.10000000011486503 0.10000000033338584 0.10000000029013799 topic 3 | 33 0.10000000050165696 0.10000000063322788 0.10000000021986624 0.10000000027522142 0.10000000036977887 2.099991023864151 0.10000000047348051 0.10000000041418187 0.10000000055350176 0.10000000051617604 0.10000000040633139 0.10000000052634735 0.10000000010892805 0.10000000066465843 0.10000000035043805 0.100000000350928 7.09999998794141 0.10000000024050797 0.10000000081548484 0.10000000070590417 0.10000000057101618 7.100064058850597 0.10000000019070393 0.10000000055350176 0.10000000048169978 topic 4 | 65 0.100000000601957 0.1000000007598339 6.099999978463166 2.099979341443886 0.10000000044371153 0.10000000044961346 0.10000000056814703 0.10000000049699237 0.10000000066421821 0.10000000061937901 0.10000000048760901 0.10000000063152578 0.10000000013070685 0.1000000007975486 0.10000000042050376 0.10000000041996585 0.1000000001474157 0.10000000028859453 0.10000000097853087 0.10000000084704092 0.10000000068497036 0.1000000005774266 5.099999981319692 0.10000000066421821 0.10000000057805047 topic 5 | 80 0.10000000032255908 0.10000000040710746 0.10000000014135384 0.10000000017696559 0.10000074886048957 0.10000000024089607 0.10000000030440456 0.10000208808596316 1.0999923634983155 12.099987122216671 0.1000000002613717 0.10000000033836194 0.10000000007003075 0.10000620317028404 10.099998014746577 0.10000000022501133 0.10000000007898309 0.10000000015462458 0.1000000005243858 0.10000000045384043 0.10000000036699673 0.10000000030937642 0.10000000012260515 1.0999923634983155 2.099988012939781 topic 6 | 92 0.1000000003926223 0.1000000004955964 0.10000000017207852 0.10000000021540231 0.10000000028940778 6.100008939650146 5.099999953423517 0.10000000032415984 0.10000000043319869 0.10000000040398568 0.10000000031801565 8.099999948227703 0.10000000008525264 0.10000000052032255 0.10000000027427068 0.10000000027391982 0.10000000009625576 0.10000000018828213 0.10000000063824 0.10000000055247658 2.0999641218661806 0.1000000003767274 0.10000000014925463 0.10000000043319869 0.1000000003770028 topic 7 | 25 0.10000000447053763 0.10000000564303888 0.10000000195934661 0.10000000245264679 0.10000000329529947 0.10000000333913112 0.10000000421944119 0.10000000369099893 0.10000000493255537 0.10000000459992453 0.10000000362104004 0.10000000469013443 0.10000000097071651 0.10000000592313252 0.10000000312294278 0.10000000311894792 0.10000000109480751 0.10000000214329609 0.1000000072672295 0.10000000629069618 0.10000000508705013 0.10000000428835745 0.10000000169946556 0.10000000493255537 0.10000000429268854 topic 8 | 58 0.10000000053116166 0.10000000067047086 0.10000000023279756 0.10000000029140843 0.10000000039152722 0.10000000039685897 0.10000000050132803 0.10000000043854178 0.10000000058689941 0.10000000054653468 0.10000000043121514 0.1000000005572529 0.1000000001153346 0.10000634588594152 0.10000000037104888 5.099999965667074 0.10000000013018323 0.10000000025465332 6.099990297447763 0.10000000074742157 0.10000000060441193 4.099935894426184 0.1000000002019201 0.10000000058689941 0.10000000051081784 topic 9 | 109 0.10000000447053763 0.10000000564303888 0.10000000195934661 0.10000000245264679 0.10000000329529947 0.10000000333913112 0.10000000421944119 0.10000000369099893 0.10000000493255537 0.10000000459992453 0.10000000362104004 0.10000000469013443 0.10000000097071651 0.10000000592313252 0.10000000312294278 0.10000000311894792 0.10000000109480751 0.10000000214329609 0.1000000072672295 0.10000000629069618 0.10000000508705013 0.10000000428835745 0.10000000169946556 0.10000000493255537 0.10000000429268854 topic 10 | 132 0.1000000006947255 0.1000000008766686 0.10000000030463979 3.100020631978266 0.10000000051193815 0.10000000051874758 0.10000000065550729 0.10000000057341163 1.1000090738708639 0.10000000071474371 4.1000570677731965 0.10000000072863138 0.10000000015080479 0.10000000092018244 0.10000000048531513 0.10000000048454125 0.10000000017008284 0.10000000033296982 0.10000000113121894 0.10000000097733115 0.10000000079029378 0.10000000066621371 0.10000000026427903 1.1000090738708639 2.1000218909978687 topic 11 | 64 0.10001884268356509 1.2130488029016366 0.10000000035604667 0.10000000044568769 0.10001083159273699 0.10000000060677694 0.10000000076674422 0.10001281225018842 0.10000000089632895 0.10000323051447904 0.10000000065800438 0.10000000085227717 0.10000000017646501 8.09996845256624 0.10000195129897382 0.10000000056676593 0.1000000001989452 0.10000000038947345 0.10000653674416829 0.1000150754088154 0.10000668095550053 0.10000000077926748 0.10000000030882186 0.10000000089632895 0.10000000078005433 topic 12 | 15 8.099981108618941 0.10001794393543985 0.1000000001014875 0.1000000001270875 0.10000000017068529 0.10000000017295561 0.10000000021855267 0.10000000019118116 3.09999850912612 0.10000000023832109 1.0999428926867112 0.10000000024293296 0.10000000005027983 0.10000624793508826 0.10000000016183115 0.10000000016155089 0.10000000005670732 0.10000000011101547 7.100003087508312 9.099984856066916 0.10000000026349182 0.1000000002221223 0.10000000008802655 3.09999850912612 6.099990049498399 topic 13 | -------------------------------------------------------------------------------- /model/data/example/mapindex.txt: -------------------------------------------------------------------------------- 1 | Southern_Illinois_University 0 2 | 6760 1 3 | 6764 2 4 | Taiwan 3 5 | Tool 4 6 | George_Washington_University 5 7 | Firefox 6 8 | Wikipedia 7 9 | Turbine 8 10 | Film 9 11 | CBS_Sports_Network 10 12 | The_Sports_Network 11 13 | 5221 12 14 | Garbage_collection_(computer_science) 13 15 | Unix-like 14 16 | topic_19 15 17 | Beowulf 16 18 | 6763 17 19 | Candidate 18 20 | Force 19 21 | Television_network 20 22 | Software_development 21 23 | 2561 22 24 | Installation_(computer_programs) 23 25 | 7630 24 26 | topic_9 25 27 | 8520 26 28 | 1 27 29 | Electronics_manufacturing_services 28 30 | 2131 29 31 | 7632 30 32 | 3 31 33 | 2564 32 34 | topic_3 33 35 | Walter_Bright 34 36 | 6762 35 37 | Track_and_field 36 38 | University_of_Texas_at_Austin 37 39 | 5223 38 40 | Newton's_laws_of_motion 39 41 | Equal_opportunity 40 42 | Ethics 41 43 | Type_system 42 44 | Export 43 45 | Technician 44 46 | Backward_compatibility 45 47 | Science 46 48 | Standard_Telephones_and_Cables 47 49 | Net_Authority 48 50 | Structural_engineering 49 51 | Application_programming_interface 50 52 | 9070 51 53 | China 52 54 | 2134 53 55 | Procurement 54 56 | Home_improvement 55 57 | Machining 56 58 | Electricity_generation 57 59 | topic_12 58 60 | 2130 59 61 | 7631 60 62 | Digital_Mars 61 63 | Barnes-Jewish_Hospital 62 64 | Health_care 63 65 | topic_18 64 66 | topic_4 65 67 | Election 66 68 | 9074 67 69 | Computer_programming 68 70 | topic_0 69 71 | Automotive_engineering 70 72 | Power_station 71 73 | Auto_mechanic 72 74 | Application_software 73 75 | 9071 74 76 | Numerical_control 75 77 | Web_browser 76 78 | Texas_Tech_University 77 79 | Fastener 78 80 | 5222 79 81 | topic_6 80 82 | 2560 81 83 | 9072 82 84 | 2562 83 85 | Yahoo!_Directory 84 86 | Manufacturing 85 87 | 2133 86 88 | Booster_club 87 89 | 9073 88 90 | Opelousas,_Louisiana 89 91 | 8521 90 92 | Automotive_electronics 91 93 | topic_7 92 94 | Scripting_language 93 95 | DOS 94 96 | Problem_solving 95 97 | NLS_(computer_system) 96 98 | 2563 97 99 | 0 98 100 | Information 99 101 | Source_code 100 102 | Product_(business) 101 103 | Extensible_programming 102 104 | Database 103 105 | 5220 104 106 | 5224 105 107 | HowStuffWorks 106 108 | Mathematical_optimization 107 109 | topic_2 108 110 | topic_16 109 111 | Academy_Awards 110 112 | Skill 111 113 | Internet_access 112 114 | Jogging 113 115 | 8523 114 116 | Compiler 115 117 | 6761 116 118 | Web_search_engine 117 119 | Stamping_(metalworking) 118 120 | 8522 119 121 | Technical_drawing 120 122 | 2132 121 123 | 8524 122 124 | Republican_Party_(United_States) 123 125 | Electronics 124 126 | Google_Directory 125 127 | Knowledge 126 128 | Usenet 127 129 | 7634 128 130 | Computer_cluster 129 131 | Ruby_(programming_language) 130 132 | 2 131 133 | topic_17 132 134 | Thrust 133 135 | Product_design 134 136 | Data_management 135 137 | FAQ 136 138 | 4 137 139 | Yale_University 138 140 | Amplifier 139 141 | Cycling 140 142 | 7633 141 143 | Blog 142 144 | College_athletics 143 145 | Aardvark_(search_engine) 144 146 | -------------------------------------------------------------------------------- /model/data/example/test.map: -------------------------------------------------------------------------------- 1 | 74 2 | 81 3 | 116 4 | 114 5 | 17 6 | 32 7 | 119 8 | 30 -------------------------------------------------------------------------------- /model/data/example/train.map: -------------------------------------------------------------------------------- 1 | 38 2 | 24 3 | 29 4 | 97 5 | 121 6 | 88 7 | 27 8 | 12 9 | 98 10 | 104 11 | 59 12 | 22 13 | 53 14 | 86 15 | 83 16 | 82 -------------------------------------------------------------------------------- /model/data/example/vali.map: -------------------------------------------------------------------------------- 1 | 31 2 | 51 3 | 35 4 | 67 5 | 2 6 | 90 7 | 137 8 | 131 9 | 141 10 | 1 11 | 122 12 | 60 13 | 105 14 | 26 15 | 79 16 | 128 -------------------------------------------------------------------------------- /tagMe.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import time 4 | import json 5 | import requests 6 | from tqdm import tqdm 7 | from multiprocessing import Pool as ProcessPool 8 | time0 = time.time() 9 | 10 | dataset = 'example' 11 | 12 | 13 | def getEntityList(text): 14 | url = 'https://tagme.d4science.org/tagme/tag' 15 | # token = '14a339a1-6913-4e8a-bff3-da6ae4459381-843339462' 16 | token = 'fe4df7bf-ab75-4efb-aa1c-551afaa65cd3-843339462' 17 | para = {'lang': 'en', 18 | 'gcube-token': token, 19 | 'text': text, 20 | } 21 | headers = { 22 | 'Host': 'tagme.d4science.org', 23 | 'Connection': 'keep-alive', 24 | 'Cache-Control': 'max-age=0', 25 | 'Upgrade-Insecure-Requests': '1', 26 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 27 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 28 | 'Accept-Encoding': 'gzip, deflate, br', 29 | 'Accept-Language': 'zh-CN,zh;q=0.9', 30 | 'Cookie': '_ga=GA1.2.827175290.1544765315; _gid=GA1.2.121830695.1544765315', 31 | } 32 | response = requests.get(url+'?lang=en&gcube-token='+token+'&text='+text, headers=headers, timeout=10) 33 | try: 34 | return json.loads(response.text)['annotations'] 35 | except: 36 | print('.'*50) 37 | print(text) 38 | print(response.text) 39 | print('.'*50) 40 | 41 | def sentence2Link(sentence): 42 | return json.dumps(getEntityList(sentence)) 43 | 44 | 45 | def run(para, times = 0): 46 | ind = para[0] 47 | sentence = para[1] 48 | try: 49 | l = sentence2Link(sentence) 50 | except Exception as e: 51 | print(ind, e) 52 | print("Content: ", sentence) 53 | if times < 5: 54 | return run(para, times + 1) 55 | else: 56 | with open("error_info.txt", 'w+') as f: 57 | f.write("{}\t{}\n".format(ind, sentence)) 58 | return None 59 | print(ind) 60 | return str(ind)+'\t'+l 61 | 62 | 63 | def process_pool(data): 64 | cnt = 0 65 | p = ProcessPool(32) 66 | chunkSize = 128 67 | res = [] 68 | i = 0 69 | while i < int(len(data)/chunkSize): 70 | try: 71 | res += list(p.map(run, data[i*chunkSize: (i+1)*chunkSize])) 72 | print(str(round((i+1)*chunkSize/len(data)*100, 2))+'%', round(time.time()-time0, 2)) 73 | cnt += 1 74 | # fout = open("cache" + str(cnt).zfill(3) + '.txt', 'w', encoding='utf8') 75 | # fout.write('\n'.join(res)) 76 | # fout.close() 77 | i += 1 78 | time.sleep(1.0) 79 | except: 80 | for i in range(60): 81 | time.sleep(10) 82 | print("\t{} / 600s".format(i*10)) 83 | 84 | res += list(p.map(run, data[(i)*chunkSize:])) 85 | p.close() 86 | p.join() 87 | return res 88 | 89 | 90 | 91 | 92 | if __name__ == "__main__": 93 | print('reading...') 94 | data = [] 95 | with open("./data/{0}/{0}.txt".format(dataset), 'r', encoding='utf8') as fin: 96 | for line in fin: 97 | ind, cate, content = line.split('\t') 98 | if int(ind) > -1 and int(ind) < 9e19: 99 | data.append([ind, content]) 100 | 101 | print('read done. Tagging...') 102 | outdata = process_pool(data) 103 | fout = open("./data/{0}/{0}2entity.txt".format(dataset), 'w', encoding='utf8') 104 | fout.write('\n'.join(outdata)) 105 | fout.close() -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | #!/user/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import numpy as np 5 | from nltk.tokenize import WordPunctTokenizer 6 | from collections import defaultdict 7 | from tqdm import tqdm 8 | 9 | 10 | def sample(datapath, DATASETS, resample = False, trainNumPerClass=20): 11 | if resample: 12 | X = [] 13 | Y = [] 14 | from sklearn.model_selection import train_test_split 15 | with open(datapath+'{}.txt'.format(DATASETS), 'r') as f: 16 | for line in f: 17 | ind, cat, title = line.strip('\n').split('\t') 18 | t = cat.lower() 19 | X.append([ind]) 20 | Y.append(t) 21 | 22 | cateset = list(set(Y)) 23 | catemap = dict() 24 | for i in range(len(cateset)): 25 | catemap[cateset[i]] = i 26 | Y = [catemap[i] for i in Y] 27 | X = np.array(X) 28 | Y = np.array(Y) 29 | 30 | trainNum = trainNumPerClass*len(catemap) 31 | ind_train, ind_test = train_test_split(X, 32 | train_size=trainNum, random_state=1, ) 33 | ind_vali, ind_test = train_test_split(ind_test, 34 | train_size=trainNum/(len(X)-trainNum), random_state=1, ) 35 | train = sum(ind_train.tolist(), []) 36 | vali = sum(ind_vali.tolist(), []) 37 | test = sum(ind_test.tolist(), []) 38 | 39 | print( len(train), len(vali), len(test) ) 40 | alltext = set(train + vali + test) 41 | print( "train: {}\nvali: {}\ntest: {}\nAllTexts: {}".format( len(train), len(vali), len(test), len(alltext)) ) 42 | 43 | with open(datapath+'train.list', 'w') as f: 44 | f.write( '\n'.join(map(str, train)) ) 45 | with open(datapath+'vali.list', 'w') as f: 46 | f.write( '\n'.join(map(str, vali)) ) 47 | with open(datapath+'test.list', 'w') as f: 48 | f.write( '\n'.join(map(str, test)) ) 49 | else: 50 | train = [] 51 | vali = [] 52 | test = [] 53 | with open(datapath + 'train.list', 'r') as f: 54 | for line in f: 55 | train.append(line.strip()) 56 | with open(datapath + 'vali.list', 'r') as f: 57 | for line in f: 58 | vali.append(line.strip()) 59 | with open(datapath + 'test.list', 'r') as f: 60 | for line in f: 61 | test.append(line.strip()) 62 | alltext = set(train + vali + test) 63 | 64 | return train, vali, test, alltext 65 | 66 | def tokenize(sen): 67 | return WordPunctTokenizer().tokenize(sen) 68 | 69 | 70 | def preprocess_corpus_notDropEntity(corpus, stopwords, involved_entity): 71 | corpus1 = [[word.lower() for word in tokenize(sentence)] for sentence in tqdm(corpus)] 72 | corpus2 = [[word for word in sentence if word.isalpha() if word not in stopwords] for sentence in tqdm(corpus1)] 73 | all_words = defaultdict(int) 74 | for c in tqdm(corpus2): 75 | for w in c: 76 | all_words[w] += 1 77 | low_freq = set(word for word in set(all_words) if all_words[word] < 5 and word not in involved_entity) 78 | text = [[word for word in sentence if word not in low_freq] for sentence in tqdm(corpus2)] 79 | ans = [' '.join(i) for i in text] 80 | return ans 81 | 82 | def load_stopwords(filepath='./data/stopwords_en.txt'): 83 | stopwords = set() 84 | with open(filepath, 'r') as f: 85 | for line in f: 86 | stopwords.add(line.strip()) 87 | print(len(stopwords)) 88 | return stopwords --------------------------------------------------------------------------------