├── README.md
├── build_data.py
├── build_features.py
├── build_network.py
├── data
    ├── example
    │   └── example.txt
    └── stopwords_en.txt
├── model
    ├── code
    │   ├── __init__.py
    │   ├── layers.py
    │   ├── models.py
    │   ├── print_log.py
    │   ├── train.py
    │   ├── utils.py
    │   └── utils_inductive.py
    └── data
    │   └── example
    │       ├── example.cites
    │       ├── example.content.entity
    │       ├── example.content.text
    │       ├── example.content.topic
    │       ├── mapindex.txt
    │       ├── test.map
    │       ├── train.map
    │       └── vali.map
├── tagMe.py
└── utils.py


/README.md:
--------------------------------------------------------------------------------
  1 | An implement of EMNLP 2019 paper "[Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification](http://shichuan.org/doc/74.pdf)" and its extension "[HGAT: Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification](https://doi.org/10.1145/3450352)"  (TOIS 2021). 
  2 | 
  3 | Thank you for your interest in our work!  :smile:
  4 | 
  5 | 
  6 | 
  7 | # Requirements
  8 | 
  9 | - Anaconda3 (python 3.6)
 10 | - Pytorch 1.3.1
 11 | - gensim  3.6.0
 12 | 
 13 | 
 14 | 
 15 | # Easy Run
 16 | 
 17 | ```
 18 | cd ./model/code/
 19 | python train.py
 20 | ```
 21 | 
 22 | You may change the dataset by modifying the variable "dataset = 'example'" in the top of the code "train.py" or use arguments (see train.py). 
 23 | 
 24 | Our datasets can be downloaded from [Google Drive](https://drive.google.com/open?id=1pz1IMdJqkKidD7eEc3T_2-VkrUhkUKd4).   PS: I have accidentally deleted some files, but I tried to restore them, hope they will run correctly. 
 25 | 
 26 | 
 27 | 
 28 | # Prepare for your own dataset
 29 | 
 30 | The following files are required:
 31 | 
 32 |     ./model/data/YourData/
 33 |         ---- YourData.cites                // the adjcencies
 34 |         ---- YourData.content.text         // the features of texts
 35 |         ---- YourData.content.entity       // the features of entities
 36 |         ---- YourData.content.topic        // the features of topics
 37 |         ---- train.map                     // the index of the training node
 38 |         ---- vali.map                      // the index of the validation nodes
 39 |         ---- test.map                      // the index of the testing nodes
 40 | 
 41 | The format is as following:
 42 | 
 43 | - **YourData.cites**
 44 | 
 45 |   Each line contains an edge:     "idx1\tidx2\n".        eg: "98	13"
 46 | 
 47 | - **YourData.content.text**
 48 | 
 49 |   Each line contains a node:    "idx\t[features]\t[category]\n", note that the [features] is a list of floats with '\t' as the delimiter.      eg:    "59	1.0	0.5	0.751	0.0	0.659	0.0	computers"
 50 |   If used for multi-label classification,  [category] must be one-hot with space as delimiter,       eg:   "59	1.0	0.5	0.751	0.0	0.659	0.0	0 1 1 0 1 0".
 51 | 
 52 |  - **YourData.content.entity**
 53 | 
 54 |    Similar with .text, just change the [category] to "entity".		eg: "13	0.0	0.0	1.0	0.0	0.0	entity"
 55 | 
 56 |  - **YourData.content.topic**
 57 | 
 58 |    Similar with .text, just change the [category] to "topic".		eg: "64	0.10	1.21	8.09	0.10	topic"
 59 | 
 60 |  - ***.map**
 61 | 
 62 |    Each line contains an index:     "idx\n".              eg:  "98"
 63 | 
 64 | You can see the example in ./model/data/example/*
 65 | 
 66 | ----
 67 | 
 68 | A simple data preprocessing code is provided. Successfully running it requires a token of [tagme](https://sobigdata.d4science.org/web/tagme/tagme-help "TagMe")'s account  (my personal token is provided  in tagme.py, but may be invalid in the future), [Wikipedia](https://dumps.wikimedia.org/ "WikiPedia")'s entity descriptions, and a word2vec model containing entity embeddings. You can prepare them yourself or obtain our files from [Google Drive](https://drive.google.com/open?id=1v9GD5ezHGbekoLDw5aAzh6-C-QUS-j93) and unzip them to ./data/ .
 69 | 
 70 | Then, you should prepare a data file like ./data/example/example.txt, whose format is:         "[idx]\t[category]\t[content]\n". 
 71 | 
 72 | Finally, modify the variable "dataset = 'example'" in the top of following codes and run:
 73 | 
 74 | ```
 75 | python tagMe.py
 76 | python build_network.py
 77 | python build_features.py
 78 | python build_data.py
 79 | ```
 80 | 
 81 | 
 82 | 
 83 | # Use HGAT as GNN
 84 | 
 85 | If you just wanna use the HGAT model as a graph neural network, you can just prepare some files following the above format:
 86 | 
 87 |      ./model/data/YourData/
 88 |         ---- YourData.cites                // the adjcencies
 89 |         ---- YourData.content.*            // the features of *, namely node_type1, node_type2, ...
 90 |         ---- train.map                     // the index of the training node
 91 |         ---- vali.map                      // the index of the validation nodes
 92 |         ---- test.map                      // the index of the testing nodes
 93 | 
 94 | And change the   "load_data()"  in ./model/code/utils.py
 95 | 
 96 | ```
 97 | type_list = [node_type1, node_type2, ...]
 98 | type_have_label = node_type
 99 | ```
100 | 
101 | See the codes for more details.
102 | 
103 | 
104 | 
105 | # Citation
106 | 
107 | If you make advantage of the HGAT model in your research, please cite the following in your manuscript:
108 | 
109 | ```
110 | @article{yang2021hgat,
111 |   author = {Yang, Tianchi and Hu, Linmei and Shi, Chuan and Ji, Houye and Li, Xiaoli and Nie, Liqiang},
112 |   title = {HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification},
113 |   year = {2021},
114 |   publisher = {Association for Computing Machinery},
115 |   volume = {39},
116 |   number = {3},
117 |   doi = {10.1145/3450352},
118 |   journal = {ACM Transactions on Information Systems},
119 |   month = may,
120 |   articleno = {32},
121 |   numpages = {29},
122 | }
123 | 
124 | @inproceedings{linmei2019heterogeneous,
125 |   title={Heterogeneous graph attention networks for semi-supervised short text classification},
126 |   author={Linmei, Hu and Yang, Tianchi and Shi, Chuan and Ji, Houye and Li, Xiaoli},
127 |   booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
128 |   pages={4823--4832},
129 |   year={2019}
130 | }
131 | ```
132 | 


--------------------------------------------------------------------------------
/build_data.py:
--------------------------------------------------------------------------------
  1 | #!/user/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import networkx
  5 | # import matplotlib.pyplot as plt
  6 | import json, os
  7 | import pickle
  8 | import gensim
  9 | import numpy as np
 10 | from tqdm import tqdm
 11 | from scipy import sparse as spr
 12 | 
 13 | from utils import sample
 14 | 
 15 | DATASETS = 'example'
 16 | 
 17 | 
 18 | 
 19 | rootpath = './'
 20 | outpath = rootpath + 'model/data/{}/'.format(DATASETS)
 21 | datapath = rootpath + 'data/{}/'.format(DATASETS)
 22 | SIM = 0.5
 23 | LP = 0.75
 24 | RHO = 0.3
 25 | TopK_for_Topics = 2
 26 | 
 27 | 
 28 | def cnt_nodes(g):
 29 |     text_nodes, entity_nodes, topic_nodes = set(), set(), set()
 30 |     for i in g.nodes():
 31 |         if i.isdigit():
 32 |             text_nodes.add(i)
 33 |         elif i[:6] == 'topic_':
 34 |             topic_nodes.add(i)
 35 |         else:
 36 |             entity_nodes.add(i)
 37 |     print("# text_nodes: {}     # entity_nodes: {}     # topic_nodes: {}".format(
 38 |             len(text_nodes), len(entity_nodes), len(topic_nodes)))
 39 |     return text_nodes, entity_nodes, topic_nodes
 40 | 
 41 | with open(datapath+'model_network_sampled.pkl', 'rb') as f:
 42 |     g = pickle.load(f)
 43 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g)
 44 | 
 45 | with open(datapath+"features_BOW.pkl", 'rb') as f:
 46 |     features_BOW = pickle.load(f)
 47 | with open(datapath+"features_TFIDF.pkl", 'rb') as f:
 48 |     features_TFIDF = pickle.load(f)
 49 | with open(datapath+"features_index.pkl", 'rb') as f:
 50 |     features_index_BOWTFIDF = pickle.load(f)
 51 | with open(datapath+"features_entity_descBOW.pkl", 'rb') as f:
 52 |     features_entity_BOW = pickle.load(f)
 53 | with open(datapath+"features_entity_descTFIDF.pkl", 'rb') as f:
 54 |     features_entity_TFIDF = pickle.load(f)
 55 | with open(datapath+"features_entity_index_desc.pkl", 'rb') as f:
 56 |     features_entity_index_desc = pickle.load(f)
 57 | 
 58 | feature = features_TFIDF
 59 | features_index = features_index_BOWTFIDF
 60 | entityF = features_entity_TFIDF
 61 | features_entity_index = features_entity_index_desc
 62 | textShape = feature.shape
 63 | entityShape = entityF.shape
 64 | print("Shape of text feature:",textShape, 'Shape of entity feature:', entityShape)
 65 | 
 66 | # 删掉没有特征的实体
 67 | notinind = set()
 68 | 
 69 | entitySet = set(entity_nodes)
 70 | print(len(entitySet))
 71 | 
 72 | for i in entitySet:
 73 |     if i not in features_entity_index:
 74 |         notinind.add(i)
 75 | print(len(g.nodes()), len(notinind))
 76 | g.remove_nodes_from(notinind)
 77 | entitySet = entitySet - notinind
 78 | print(len(entitySet), len(features_entity_index))
 79 | N = len(g.nodes())
 80 | print(len(g.nodes()), len(g.edges()))
 81 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g)
 82 | 
 83 | 
 84 | 
 85 | # 删掉一些边
 86 | cnt = 0
 87 | nodes = g.nodes()
 88 | print(len(g.edges()))
 89 | for node in tqdm(nodes):
 90 |     try:
 91 |         cache = [j for j in g[node]
 92 |                    if ('sim' in g[node][j] and g[node][j]['sim'] < SIM)    # 0.5
 93 |                        or ('link_probability' in g[node][j] and g[node][j]['link_probability'] <= LP)
 94 |                        or ('rho' in g[node][j] and g[node][j]['rho'] < RHO)
 95 |                 ]
 96 |         if len(cache) != 0:
 97 |             g.remove_edges_from( [(node, i) for i in cache] )
 98 |         cnt += len( cache )
 99 |     except:
100 |         print(g[node])
101 |         break
102 | print(len(g.edges()), cnt)
103 | 
104 | # 删掉孤立点（实体）
105 | delete = [n for n in g.nodes() if len(g[n]) == 0 and n not in text_nodes]
106 | print("Num of 孤立点：", len(delete))
107 | g.remove_nodes_from(delete)
108 | 
109 | 
110 | 
111 | train, vali, test, alltext = sample(datapath, DATASETS)
112 | 
113 | 
114 | 
115 | 
116 | 
117 | # topic
118 | with open(datapath + 'topic_word_distribution.pkl', 'rb') as f:
119 |     topic_word = pickle.load(f)
120 | with open(datapath + 'doc_topic_distribution.pkl', 'rb') as f:
121 |     doc_topic = pickle.load(f)
122 | with open(datapath + 'doc_index_LDA.pkl', 'rb') as f:
123 |     doc_idx_list = pickle.load(f)
124 | 
125 | topic_num = topic_word.shape[0]
126 | topics = []
127 | for i in range(topic_num):
128 |     topicName = 'topic_' + str(i)
129 |     topics.append(topicName)
130 | 
131 | def naive_arg_topK(matrix, K, axis=0):
132 |     """
133 |     perform topK based on np.argsort
134 |     :param matrix: to be sorted
135 |     :param K: select and sort the top K items
136 |     :param axis: dimension to be sorted.
137 |     :return:
138 |     """
139 |     full_sort = np.argsort(-matrix, axis=axis)
140 |     return full_sort.take(np.arange(K), axis=axis)
141 | 
142 | topK_topics = naive_arg_topK(doc_topic, TopK_for_Topics, axis=1)
143 | for i in range(topK_topics.shape[0]):
144 |     for j in range(TopK_for_Topics):
145 |         g.add_edge(doc_idx_list[i], topics[topK_topics[i, j]])
146 | 
147 | print("gnodes:", len(g.nodes()), "gedges:", len(g.edges()))
148 | 
149 | 
150 | 
151 | 
152 | 
153 | # build Edges data
154 | import collections
155 | cnt = 0
156 | nodes = g.nodes()
157 | graphdict = collections.defaultdict(list)
158 | for node in tqdm(nodes):
159 |     try:
160 |         cache = [j for j in g[node]
161 |                    if ('sim' in g[node][j] and g[node][j]['sim'] >= SIM) or ('sim' not in g[node][j])    # 0.5
162 |                    if ('link_probability' in g[node][j] and g[node][j]['link_probability'] > LP) or ('link_probability' not in g[node][j])
163 |                    if ('rho' in g[node][j] and g[node][j]['rho'] > RHO) or ('rho' not in g[node][j])
164 |                 ]
165 |         if len(cache) != 0:
166 |             graphdict[ node ] = cache
167 |         cnt += len( cache )
168 |     except:
169 |         print(g[node])
170 |         break
171 | print('edges: ', cnt)
172 | 
173 | def normalizeF(mx):
174 |     sup = np.absolute(mx).max()
175 |     if sup == 0:
176 |         return mx
177 |     return mx / sup
178 | 
179 | text_nodes, entity_nodes, topic_nodes = cnt_nodes(g)
180 | 
181 | mapindex = dict()
182 | cnt = 0
183 | for i in text_nodes|entity_nodes|topic_nodes:
184 |     mapindex[i] = cnt
185 |     cnt += 1
186 | print(len(g.nodes()), len(mapindex))
187 | 
188 | if not os.path.exists(outpath):
189 |     os.makedirs(outpath)
190 | 
191 | 
192 | # build feature data
193 | gnodes = set(g.nodes())
194 | print(gnodes, mapindex)
195 | with open(outpath + 'train.map', 'w') as f:
196 |     f.write('\n'.join([str(mapindex[i]) for i in train if i in gnodes]))
197 | with open(outpath + 'vali.map', 'w') as f:
198 |     f.write('\n'.join([str(mapindex[i]) for i in vali if i in gnodes]))
199 | with open(outpath + 'test.map', 'w') as f:
200 |     f.write('\n'.join([str(mapindex[i]) for i in test if i in gnodes]))
201 | 
202 | flag_zero = False
203 | 
204 | input_type = 'text&entity&topic2hgcn'
205 | if input_type == 'text&entity&topic2hgcn':
206 |     node_with_feature = set()
207 |     DEBUG = False
208 |     
209 |     # text node
210 |     content = dict()
211 |     for i in tqdm(range(textShape[0])):
212 |         ind = features_index[i]
213 |         if (ind) not in text_nodes:
214 |             continue
215 |         content[ind] = feature[i, :].toarray()[0].tolist()  # 
216 |         if DEBUG:
217 |             content[ind] = feature[i, :10].toarray()[0].tolist()
218 |         if flag_zero:
219 |             entityFlen = entityShape[1]
220 |             content[ind] += [0] * (entityFlen + topic_word.shape[1])
221 |     with open(datapath + '{}.txt'.format(DATASETS), 'r') as f:
222 |         for line in tqdm(f):
223 |             ind, cat = line.strip('\n').split('\t')[:2]
224 |             ind = (ind)
225 |             if ind not in text_nodes:
226 |                 continue
227 |             content[ind] += [cat]
228 |             alllen = len(content[ind])
229 |     with open(outpath + '{}.content.text'.format(DATASETS), 'w') as f:
230 |         for ind in tqdm(content):
231 |             f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n')
232 |             node_with_feature.add(ind)
233 |     cache = len(content)
234 |     print("共{}个文本".format(len(content)))
235 | 
236 |     # entity node
237 |     content = dict()
238 |     for i in tqdm(range(entityShape[0])):
239 |         name = features_entity_index[i]
240 |         if name not in entity_nodes:
241 |             continue
242 |         content[name] = entityF[i, :].toarray()[0].tolist() + ['entity']
243 |         if flag_zero:
244 |             content[name] = [0] * textShape[1] + content[name] + [0] * topic_word.shape[1] + ['entity']
245 |     with open(outpath + '{}.content.entity'.format(DATASETS), 'w') as f:
246 |         for ind in tqdm(content):
247 |             f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n')
248 |             node_with_feature.add(ind)
249 |     cache += len(content)
250 |     print("共{}个实体".format(len(content)))
251 | 
252 |     # topic node
253 |     content = dict()
254 |     for i in range(topic_num):
255 |         #         zero_num = textShape[1] + entityFlen - topic_num
256 |         topicName = topics[i]
257 |         if topicName not in topic_nodes:
258 |             continue
259 |         one_hot = [0] * topic_num
260 |         one_hot[i] = 1
261 |         content[topicName] = one_hot
262 |         content[topicName] = topic_word[i].tolist() + ['topic']
263 |         if flag_zero:
264 |             zero_num = textShape[1] + entityFlen
265 |             content[topicName] = [0] * zero_num + content[topicName] + ['topic']
266 | 
267 |     with open(outpath + '{}.content.topic'.format(DATASETS), 'w') as f:
268 |         for ind in tqdm(content):
269 |             f.write(str(mapindex[ind]) + '\t' + '\t'.join(map(str, content[ind])) + '\n')
270 |             node_with_feature.add(ind)
271 |     cache += len(content)
272 |     print("共{}个主题".format(len(content)))
273 | 
274 |     print(cache, len(mapindex))
275 |     print("nodes with features:", len(node_with_feature))
276 | 
277 | 
278 | 
279 | # save mappings
280 | with open(outpath+'mapindex.txt', 'w') as f:
281 |     for i in mapindex:
282 |         f.write("{}\t{}\n".format(i, mapindex[i]))
283 | 
284 | 
285 | 
286 | # save adj matrix
287 | with open(outpath+'{}.cites'.format(DATASETS), 'w') as f:
288 |     doneSet = set()
289 |     nodeSet = set()
290 |     for node in graphdict:
291 |         for i in graphdict[node]:
292 |             if (node, i) not in doneSet:
293 |                 f.write( str(mapindex[node])+'\t'+str(mapindex[i])+'\n' )
294 |                 doneSet.add( (i, node) )
295 |                 doneSet.add( (node, i) )
296 |                 nodeSet.add( node )
297 |                 nodeSet.add( i )
298 |     for i in tqdm(range(len(mapindex))):
299 |          f.write(str(i)+'\t'+str(i)+'\n')
300 | 
301 | print('Num of nodes with edges: ', len(nodeSet))


--------------------------------------------------------------------------------
/build_features.py:
--------------------------------------------------------------------------------
  1 | #!/user/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | import networkx
  4 | import json
  5 | from tqdm import tqdm
  6 | from sklearn.feature_extraction.text import CountVectorizer
  7 | from sklearn.feature_extraction.text import TfidfTransformer
  8 | from sklearn.decomposition import LatentDirichletAllocation
  9 | import pickle as pkl
 10 | from nltk.tokenize import WordPunctTokenizer
 11 | import os
 12 | from utils import sample, preprocess_corpus_notDropEntity, load_stopwords
 13 | 
 14 | import jieba
 15 | DATASETS = 'example'
 16 | 
 17 | 
 18 | 
 19 | rootpath = './'
 20 | datapath = rootpath + 'data/{}/'.format(DATASETS)
 21 | 
 22 | def tokenize(sen):
 23 |     return WordPunctTokenizer().tokenize(sen)
 24 |     # return jieba.cut(sen)
 25 | 
 26 | 
 27 | def build_entity_feature_with_description(datapath, stopwords=list()):
 28 |     with open(datapath + 'model_network_sampled.pkl', 'rb') as f:
 29 |         g = pkl.load(f)
 30 |     nodesset = set(g.nodes())
 31 |     entityIndex = []
 32 |     corpus = []
 33 |     cnt = 0
 34 |     for i in tqdm(range(40), desc="Read desc: "):
 35 |         filename = str(i).zfill(4)
 36 |         with open("./data/wikiAbstract/"+filename, 'r') as f:
 37 |             for line in f:
 38 |                 ent, desc = line.strip('\n').split('\t')
 39 |                 entity = ent.replace(" ", "_")
 40 |                 if entity in nodesset:
 41 |                     if entity not in entityIndex:
 42 |                         entityIndex.append(entity)
 43 |                         cnt += 1
 44 |                     else:
 45 |                         print('error')
 46 |                     content = tokenize(desc)
 47 |                     content = ' '.join([ word.lower() for word in content if word.isalpha() ])
 48 |                     corpus.append(content)
 49 | 
 50 |     print(len(corpus), len(entityIndex))
 51 | 
 52 |     vectorizer = CountVectorizer(min_df = 10, stop_words=stopwords)
 53 |     X = vectorizer.fit_transform(corpus)
 54 |     print("Entity feature shape: ", X.shape)
 55 |     transformer = TfidfTransformer()
 56 |     tfidf = transformer.fit_transform(X)
 57 |     print("Caculated! Saving...")
 58 |     with open(datapath+"vectorizer_model.pkl", 'wb') as f:
 59 |         pkl.dump(vectorizer, f)
 60 |     with open(datapath+"transformer_model.pkl", 'wb') as f:
 61 |         pkl.dump(transformer, f)
 62 |     with open(datapath+"features_entity_descBOW.pkl", 'wb') as f:
 63 |         pkl.dump(X, f)
 64 |     with open(datapath+"features_entity_descTFIDF.pkl", 'wb') as f:
 65 |         pkl.dump(tfidf, f)
 66 |     with open(datapath+"features_entity_index_desc.pkl", 'wb') as f:
 67 |         pkl.dump(entityIndex, f)
 68 |     print("done!")
 69 | 
 70 | def build_text_feature(datapath, DATASETS, rho=0.3, lp=0.5, stopwords=list()):
 71 |     train, vali, test, alltext = sample(datapath=datapath, DATASETS=DATASETS, resample=False)
 72 |     # 这里先把未替换的ind-content对存在字典中
 73 |     pre_replace = dict()
 74 |     index2ind = {}
 75 |     cnt = 0
 76 |     corpus = []
 77 |     involved_entity = set()
 78 |     
 79 |     with open("{}{}.txt".format(datapath, DATASETS), 'r', encoding='utf8') as f:
 80 |         for line in f:
 81 |             ind, cate, content = line.strip('\n').split('\t')
 82 |             if ind not in alltext:
 83 |                 continue
 84 |             pre_replace[ind] = content.lower()
 85 |             content = pre_replace[ind]
 86 |             corpus.append(content)
 87 |             index2ind[cnt] = ind
 88 |             cnt += 1
 89 |     print(len(pre_replace))
 90 | 
 91 |     print("loading entities...")
 92 |     with open('{}{}2entity.txt'.format(datapath, DATASETS), 'r', encoding='utf8') as f:
 93 |         for line in tqdm(f):
 94 |             ind, entityList = line.strip('\n').split('\t')
 95 |             # ind = int(ind)
 96 |             if ind not in pre_replace:
 97 |                 continue
 98 |             
 99 |             entityList = json.loads(entityList)
100 |             for d in entityList:
101 |                 if d['rho'] < rho:
102 |                     continue
103 |                 if d['link_probability'] < lp:
104 |                     continue
105 |                 if 'title' not in d:
106 |                     print("An entity with no title, whose spot is: {}".format(d['spot']))
107 |                     continue
108 |                 ent = d['title'].replace(" ", '')
109 |                 involved_entity.add(ent)
110 |                 ori = d['spot'].lower()
111 |                 content.replace(ori, ent)
112 | 
113 |             
114 |     len(corpus)
115 |     print("text preprocessing...")
116 |     
117 |     corpus = preprocess_corpus_notDropEntity(corpus,
118 |                         stopwords=stopwords, involved_entity=involved_entity)
119 |     print("text feature transforming...")
120 | 
121 |     vectorizer = CountVectorizer(min_df=10 if DATASETS != "example" else 0, stop_words=stopwords)
122 |     X = vectorizer.fit_transform(corpus)
123 | 
124 |     with open(datapath + 'TextBoW_model.pkl', 'wb') as f:
125 |         pkl.dump(vectorizer, f)
126 | 
127 |     transformer = TfidfTransformer()
128 |     tfidf = transformer.fit_transform(X)
129 |     print("text feature transformed.")
130 |     
131 |     with open(datapath + "features_BOW.pkl", 'wb') as f:
132 |         pkl.dump(X, f)
133 |     with open(datapath + "features_TFIDF.pkl", 'wb') as f:
134 |         pkl.dump(tfidf, f)
135 |     with open(datapath + "features_index.pkl", 'wb') as f:
136 |         pkl.dump(index2ind, f)
137 |     print(X.shape)
138 | 
139 |     alllength = sum([len(sentence.split(' ')) for sentence in corpus])
140 |     avg_length = alllength / len(corpus)
141 |     print('train: {}\tvali: {}\ttest: {}'.format(len(train), len(vali), len(test)))
142 |     print('num of all corpus: {}'.format(len(train + vali + test)))
143 |     print('avg of tokens: {:.1f}'.format(avg_length))
144 |     vocab = set()
145 |     for s in corpus:
146 |         vocab.update(s.split(' '))
147 |     print('involved entities: {}'.format(len(involved_entity)))
148 |     print('vocabulary size: {}'.format(len(vocab)))
149 | 
150 | 
151 | def build_topic_feature_sklearn(datapath, DATASETS, TopicNum=20, stopwords=list(), train=False):
152 |     # sklearn-lda
153 | 
154 |     idxlist = []
155 |     corpus = []
156 |     catelist = []
157 |     with open('{}{}.txt'.format(datapath, DATASETS), 'r', encoding='utf8') as f:
158 |         for line in f:
159 |             ind, cate, content = line.strip().split('\t')
160 |             idxlist.append(ind)
161 |             corpus.append(content)
162 |             catelist.append(cate)
163 | 
164 |     with open(datapath + 'doc_index_LDA.pkl', 'wb') as f:
165 |         pkl.dump(idxlist, f)
166 | 
167 |     print("text feature transforming...")
168 |     corpus = preprocess_corpus_notDropEntity(corpus,stopwords=stopwords, involved_entity=set())
169 | 
170 |     with open(datapath + "features_BOW.pkl", 'rb') as f:
171 |         X = pkl.load(f)
172 |     # vocabulary_ 的对照关系，读上面那个bow的模型就可以了
173 | 
174 | 
175 |     if train:
176 |         alpha, beta = 0.1, 0.1
177 |         lda = LatentDirichletAllocation(n_components=TopicNum, max_iter=1200,
178 |                                         learning_method='batch', n_jobs=-1,
179 |                                         doc_topic_prior=alpha, topic_word_prior=beta,
180 |                                         verbose=1,
181 |                                         )
182 |         lda_feature = lda.fit_transform(X)
183 |         with open(datapath + 'lda_model.pkl', 'wb') as f:
184 |             pkl.dump(lda, f)
185 |         with open(datapath + 'topic_word_distribution.pkl', 'wb') as f:
186 |             pkl.dump(lda.components_, f)
187 |     else:
188 |         with open(datapath + 'lda_model.pkl', 'rb') as f:
189 |             lda = pkl.load(f)
190 |         lda_feature = lda.transform(X)
191 | 
192 |     with open(datapath + 'doc_topic_distribution.pkl', 'wb') as f:
193 |         pkl.dump(lda_feature, f)
194 | 
195 | 
196 | 
197 | 
198 | 
199 | if __name__ == '__main__':
200 |     stopwords = load_stopwords()
201 | 
202 |     build_entity_feature_with_description(datapath, stopwords=stopwords)
203 |     build_text_feature(datapath, DATASETS, stopwords=stopwords)
204 |     build_topic_feature_sklearn(datapath, DATASETS, stopwords=stopwords, train=True)


--------------------------------------------------------------------------------
/build_network.py:
--------------------------------------------------------------------------------
 1 | #!/user/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import networkx
 5 | import json
 6 | import pickle
 7 | import gensim
 8 | from tqdm import tqdm
 9 | 
10 | from utils import sample
11 | 
12 | DATASETS = 'example'
13 | 
14 | NumOfTrainTextPerClass = 2
15 | TOPK = 10
16 | SIM_MIN = 0.5
17 | 
18 | rootpath = './'
19 | datapath = rootpath + 'data/{}/'.format(DATASETS)
20 | g = networkx.Graph()
21 | train, vali, test, alltext = sample(datapath=datapath, DATASETS=DATASETS, resample=True,
22 |                                     trainNumPerClass=NumOfTrainTextPerClass)
23 | 
24 | # load text-entity
25 | entitySet = set()
26 | rho = 0.1
27 | noEntity = set()
28 | with open(datapath+'{}2entity.txt'.format(DATASETS), 'r') as f:
29 |     for line in tqdm(f, desc="text-ent: "):
30 |         ind, entityList = line.strip('\n').split('\t')
31 |         # ind = int(ind)
32 |         if ind not in alltext:
33 |             continue
34 |         entityList = json.loads(entityList)
35 |         entities = [(d['title'].replace(" ", '_'), d['rho'], d['link_probability'])
36 |                         for d in entityList if 'title' in d and float(d['rho']) > rho]
37 |         entitySet.update([d['title'].replace(" ", '_')
38 |                         for d in entityList if 'title' in d and float(d['rho']) > rho])
39 |         g.add_edges_from([(ind, e[0], {'rho': e[1], 'link_probability': e[2]})
40 |                              for e in entities])
41 |         if len(entities) == 0:
42 |             noEntity.add(ind)
43 |             g.add_node(ind)
44 | print("text-entity done.")
45 | 
46 | # load labels
47 | with open(datapath+'{}.txt'.format(DATASETS), 'r', encoding='utf8') as f:
48 |     for line in tqdm(f,desc="text label: "):
49 |         ind, cate, title = line.strip('\n').split('\t')
50 |         # ind = int(ind)
51 |         if ind not in alltext:
52 |             continue
53 |         if ind not in g.nodes():
54 |             g.add_node(ind)
55 |         g.node[ind]['type'] = cate
56 | 
57 | 
58 | # load similarities between entities
59 | print("loading Gensim.word2vec. ")
60 | model = gensim.models.Word2Vec.load(rootpath+'data/word2vec/word2vec_gensim_5')
61 | print("word2vec model done.")
62 | 
63 | # topK + 阈值
64 | sim_min = SIM_MIN
65 | topK = TOPK
66 | el = list(entitySet)
67 | entity_edge = []
68 | cnt_no = 0
69 | cnt_yes = 0
70 | cnt = 0
71 | for i in tqdm(range(len(el)), desc="ent-ent: "):
72 |     simList = []
73 |     topKleft = topK
74 |     for j in range(len(el)):
75 |         if i == j:
76 |             continue
77 |         cnt += 1
78 |         try:
79 |             sim = model.wv.similarity(el[i].lower().strip(')'), el[j].lower().strip(')'))
80 |             cnt_yes += 1
81 |             if sim >= sim_min:
82 |                 entity_edge.append( (el[i], el[j], {'sim': sim}) )
83 |                 topKleft -= 1
84 |             else:
85 |                 simList.append( (sim, el[j]) )
86 |         except Exception as e:
87 |             cnt_no += 1
88 |     simList = sorted(simList, key=(lambda x: x[0]), reverse=True)
89 |     for i in range(min(max(topKleft, 0), len(simList))):
90 |         entity_edge.append( (el[i], simList[i][1], {'sim': simList[i][0]}) )
91 | print(cnt_yes, cnt_no)
92 | 
93 | g.add_edges_from(entity_edge)
94 | 
95 | # save the network
96 | with open(datapath+'model_network_sampled.pkl', 'wb') as f:
97 |     pickle.dump(g, f)


--------------------------------------------------------------------------------
/data/example/example.txt:
--------------------------------------------------------------------------------
 1 | 0	business	manufacture manufacturer directory directory china taiwan products manufacturers directory- taiwan china products manufacturer direcory exporter directory supplier directory suppliers
 2 | 1	business	empmag electronics manufacturing procurement homepage electronics manufacturing procurement magazine procrement power products production essentials data management
 3 | 2	business	dfma truecost paper true cost overseas manufacture product design costs manufacturing products china manufacturing redesigned product china save
 4 | 3	business	thomasnet thomasnet cnc machining metal stamping gaskets fasteners searchable database information products services cad drawings
 5 | 4	business	crnano products nanotechnology products molecular manufacturing product structural design simplest products software-controlled extra cost manufacturing
 6 | 2130	computers	digitalmars intro programming language digital mars compiled garbage collected simpler replacement walter bright wrote dos compiler maximum similarity backward
 7 | 2131	computers	google top computers programming languages google directory computers programming languages programming languages weblog news discussion web access usenet newsgroups programming languages
 8 | 2132	computers	ruby lang ruby programming language interpreted dynamically typed pure object-oriented scripting language programming japan straightforward extensible
 9 | 2133	computers	sun java technology experienced programmers migrating java language java programming language sl- -se practice exam
10 | 2134	computers	dir yahoo computers and internet programming and development languages computer programming languages yahoo directory links sites code samples tools tutorials articles resources focusing programming languages
11 | 2560	culture-arts-entertainment	oscar oscar com annual academy awards homepage nominee winners carpet photograph gallery press video clips voting ballot information
12 | 2561	culture-arts-entertainment	oscars welcome academy motion picture arts sciences academy history information academy awards photographs events screenings press releases
13 | 2562	culture-arts-entertainment	https oscar symplicity oscar online system clerkship application review users internet explorer firefox netscape internet browser access oscar
14 | 2563	culture-arts-entertainment	wikipedia wiki academy awards academy award wikipedia encyclopedia academy awards oscars prominent oscar statuette academy award merit
15 | 2564	culture-arts-entertainment	oscar openclustergroup oscar source cluster application resources oscar users regardless experience level nix environment install beowulf performance computing cluster
16 | 5220	education-science	wikipedia wiki mechanics mechanics wikipedia encyclopedia mechanics greek μηχανική branch physics behaviour physical bodies subjected forces displacements
17 | 5221	education-science	wikipedia wiki mechanic mechanic wikipedia encyclopedia mechanics specialised field auto mechanics boiler mechanics industrial maintenance mechanics millwrights
18 | 5222	education-science	popularmechanics mechanics mechanics service magazine covering information home improvement automotive electronics computers telecommunications
19 | 5223	education-science	carpros professional mechanics online car repair question professional mechanics online answer
20 | 5224	education-science	bls gov oco ocos automotive service technicians mechanics opportunities automotive service technicians mechanics diagnostic problem-solving skills knowledge electronics
21 | 6760	engineering	wikipedia wiki jet engine jet engine wikipedia encyclopedia jet engine engine discharges jet fluid generate thrust accordance newton law motion
22 | 6761	engineering	aardvark pjet pulse jet engine information building pulsejet turbojet engine
23 | 6762	engineering	aardvark pjet turbinenuts shtml crazy jet engine invention jet engine people crazier strapping
24 | 6763	engineering	howstuffworks turbine howstuffworks gas turbine engines wonder jet engine cruising feet jets helicopters power plants class
25 | 6764	engineering	inventors about library inventors blhowajetengineworks jet engine jet engine operates application sir isaac newton law physics
26 | 7630	health	rcseng news newsitem hospital doctor interview president bernard ribeiro magazine hospital doctor interviewed bernard ribeiro president college feature interview touches
27 | 7631	health	medicalnewstoday articles advocate south suburban hospital doctor tips jogging neighborhood biking trails inline skating chicago lakefront
28 | 7632	health	doctorshospital doctors hospital life doctors hospital opelousas community healthcare provider dedicated serving healthcare landry parish surrounding
29 | 7633	health	webserver bjc sfnet bjh physiciansearch barnes-jewish hospital physician search doctors listed referral directory medical staff barnes-jewish hospital doctors payment referral
30 | 7634	health	gwhospital hospital doctor doctor george washington physician referral service help gw-docs
31 | 8520	politics-society	congress congressorg congress org election candidates information candidate information accessible zip code national elections archives
32 | 8521	politics-society	sos state elections candidates index shtml candidates candidate guide candidates ballot upcoming elections calendar election link texas ethics
33 | 8522	politics-society	ieee organizations corporate candidates ieee ieee annual election candidates ieee annual election candidates listed positions candidates ieee annual election ballot
34 | 8523	politics-society	stc candidatesfaq society technical communication candidates faq information candidates region director election stc annual meeting
35 | 8524	politics-society	marketingpilgrim presidential election reputation presidential election candidate reputation study marketing declared presidential candidates negative search engine listings republicans
36 | 9070	sports	yalebulldogs cstv yale university bulldogs athletic athletic yale university college sports network comprehensive coverage yale university athletics
37 | 9071	sports	texastech cstv texas tech raiders athletic athletic texas tech university fansonly network comprehensive coverage raider athletics internet
38 | 9072	sports	wikipedia wiki track and field athletics track field wikipedia encyclopedia athletics track field track field athletics collection sports events running throwing jumping
39 | 9073	sports	siusalukis cstv southern illinois university athletic salukis athletic training spirit booster club team merchandise
40 | 9074	sports	highschoolsports highschoolsports net school teams world highschoolsports net authority school schedules scores stats download sports schedules sign notified
41 | 


--------------------------------------------------------------------------------
/data/stopwords_en.txt:
--------------------------------------------------------------------------------
  1 | a
  2 | b
  3 | c
  4 | d
  5 | e
  6 | f
  7 | g
  8 | h
  9 | i
 10 | j
 11 | k
 12 | l
 13 | m
 14 | n
 15 | o
 16 | p
 17 | q
 18 | r
 19 | s
 20 | t
 21 | u
 22 | v
 23 | w
 24 | x
 25 | y
 26 | z
 27 | /
 28 | *
 29 | &
 30 | %
 31 | #
 32 | @
 33 | !
 34 | ~
 35 | +
 36 | -
 37 | �C
 38 | (
 39 | )
 40 | ?
 41 | :
 42 | "
 43 | '
 44 | \
 45 | =
 46 | `
 47 | about
 48 | above
 49 | across
 50 | after
 51 | afterwards
 52 | again
 53 | against
 54 | all
 55 | almost
 56 | alone
 57 | along
 58 | already
 59 | also
 60 | although
 61 | always
 62 | am
 63 | among
 64 | amongst
 65 | amoungst
 66 | amount
 67 | an
 68 | and
 69 | another
 70 | any
 71 | anyhow
 72 | anyone
 73 | anything
 74 | anyway
 75 | anywhere
 76 | are
 77 | around
 78 | as
 79 | at
 80 | bmp
 81 | back
 82 | be
 83 | became
 84 | because
 85 | become
 86 | becomes
 87 | becoming
 88 | been
 89 | before
 90 | beforehand
 91 | behind
 92 | being
 93 | below
 94 | beside
 95 | besides
 96 | between
 97 | beyond
 98 | bill
 99 | both
100 | bottom
101 | but
102 | by
103 | call
104 | can
105 | cannot
106 | cant
107 | co
108 | computer
109 | com
110 | con
111 | could
112 | couldnt
113 | cry
114 | de
115 | describe
116 | detail
117 | do
118 | done
119 | down
120 | due
121 | during
122 | each
123 | eg
124 | e.g
125 | e.g.
126 | eight
127 | either
128 | eleven
129 | else
130 | elsewhere
131 | empty
132 | enough
133 | etc
134 | e.t.c
135 | even
136 | ever
137 | every
138 | everyone
139 | everything
140 | everywhere
141 | except
142 | exactly
143 | few
144 | fifteen
145 | fifty
146 | fill
147 | find
148 | fire
149 | first
150 | five
151 | for
152 | former
153 | formerly
154 | forty
155 | found
156 | four
157 | from
158 | front
159 | full
160 | further
161 | get
162 | give
163 | go
164 | gt
165 | had
166 | has
167 | hasnt
168 | hasn't
169 | have
170 | he
171 | hence
172 | her
173 | here
174 | hereafter
175 | hereby
176 | herein
177 | hereupon
178 | hers
179 | herself
180 | him
181 | himself
182 | his
183 | how
184 | however
185 | http
186 | href
187 | i
188 | ie
189 | i.e
190 | id
191 | if
192 | in
193 | inc
194 | it
195 | image
196 | images
197 | indeed
198 | interest
199 | into
200 | is
201 | it
202 | its
203 | itself
204 | jpg
205 | j
206 | keep
207 | last
208 | latter
209 | latterly
210 | least
211 | less
212 | ltd
213 | lt
214 | made
215 | many
216 | may
217 | me
218 | meanwhile
219 | might
220 | mill
221 | mine
222 | more
223 | moreover
224 | most
225 | mostly
226 | move
227 | much
228 | must
229 | my
230 | myself
231 | name
232 | namely
233 | neither
234 | never
235 | nevertheless
236 | next
237 | nine
238 | no
239 | nobody
240 | none
241 | noone
242 | nor
243 | not
244 | nothing
245 | now
246 | nowhere
247 | of
248 | off
249 | often
250 | on
251 | once
252 | one
253 | only
254 | onto
255 | or
256 | other
257 | others
258 | otherwise
259 | our
260 | ours
261 | ourselves
262 | out
263 | over
264 | own
265 | part
266 | per
267 | perhaps
268 | please
269 | people
270 | person
271 | peoples
272 | persons
273 | profile
274 | profiles
275 | put
276 | rather
277 | re
278 | rt
279 | same
280 | see
281 | seem
282 | seemed
283 | seeming
284 | seems
285 | serious
286 | several
287 | she
288 | should
289 | show
290 | shall
291 | side
292 | since
293 | sincere
294 | six
295 | sixty
296 | so
297 | some
298 | somehow
299 | someone
300 | something
301 | sometime
302 | sometimes
303 | somewhere
304 | still
305 | such
306 | system
307 | take
308 | twitter
309 | tweet
310 | tweets
311 | ten
312 | than
313 | that
314 | the
315 | their
316 | them
317 | themselves
318 | then
319 | thence
320 | there
321 | thereafter
322 | thereby
323 | therefore
324 | therein
325 | thereupon
326 | these
327 | they
328 | thick
329 | thin
330 | third
331 | this
332 | those
333 | though
334 | three
335 | through
336 | throughout
337 | thru
338 | thus
339 | to
340 | together
341 | too
342 | top
343 | toward
344 | towards
345 | twelve
346 | twenty
347 | two
348 | un
349 | under
350 | until
351 | up
352 | upon
353 | us
354 | url
355 | user
356 | very
357 | via
358 | was
359 | we
360 | well
361 | were
362 | what
363 | whatever
364 | when
365 | whence
366 | whenever
367 | where
368 | whereafter
369 | whereas
370 | whereby
371 | wherein
372 | whereupon
373 | wherever
374 | whether
375 | which
376 | while
377 | wheather
378 | who
379 | www
380 | web
381 | whoever
382 | whole
383 | whom
384 | whose
385 | why
386 | will
387 | with
388 | within
389 | without
390 | would
391 | yet
392 | you
393 | your
394 | yours
395 | yourself
396 | yourselves
397 | reuters
398 | related
399 | story
400 | external
401 | link
402 | 


--------------------------------------------------------------------------------
/model/code/__init__.py:
--------------------------------------------------------------------------------
1 | from __future__ import print_function
2 | from __future__ import division
3 | 


--------------------------------------------------------------------------------
/model/code/layers.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | from torch.nn.parameter import Parameter
  4 | from torch.nn.modules.module import Module
  5 | import torch.nn.functional as F
  6 | from torch import nn
  7 | 
  8 | class GraphConvolution(Module):
  9 |     def __init__(self, in_features, out_features, bias=True):
 10 |         super(GraphConvolution, self).__init__()
 11 |         self.in_features = in_features
 12 |         self.out_features = out_features
 13 |         self.weight = Parameter(torch.FloatTensor(in_features, out_features))
 14 |         if bias:
 15 |             self.bias = Parameter(torch.FloatTensor(out_features))
 16 |         else:
 17 |             self.register_parameter('bias', None)
 18 |         self.reset_parameters()
 19 | 
 20 |     def reset_parameters(self):
 21 |         stdv = 1. / math.sqrt(self.weight.size(1))
 22 |         self.weight.data.uniform_(-stdv, stdv)
 23 |         if self.bias is not None:
 24 |             self.bias.data.uniform_(-stdv, stdv)
 25 | 
 26 |     def forward(self, inputs, adj, global_W = None):
 27 |         if len(adj._values()) == 0:
 28 |             return torch.zeros(adj.shape[0], self.out_features, device=inputs.device)
 29 | 
 30 |         support = torch.spmm(inputs, self.weight)
 31 |         if global_W is not None:
 32 |             support = torch.spmm(support, global_W)
 33 |         output = torch.spmm(adj, support)
 34 |         if self.bias is not None:
 35 |             return output + self.bias
 36 |         else:
 37 |             return output
 38 | 
 39 |     def __repr__(self):
 40 |         return self.__class__.__name__ + ' (' \
 41 |                + str(self.in_features) + ' -> ' \
 42 |                + str(self.out_features) + ')'
 43 | 
 44 | 
 45 | class SelfAttention_ori(Module):
 46 |     def __init__(self, in_features):
 47 |         super(SelfAttention, self).__init__()
 48 |         self.a = Parameter(torch.FloatTensor(2 * in_features, 1))
 49 |         self.reset_parameters()
 50 | 
 51 |     def reset_parameters(self):
 52 |         stdv = 1. / math.sqrt(self.a.size(1))
 53 |         self.a.data.uniform_(-stdv, stdv)
 54 | 
 55 |     def forward(self, inputs):
 56 |         x = inputs.transpose(0, 1)
 57 |         self.n = x.size()[0]
 58 |         x = torch.cat([x, torch.stack([x] * self.n, dim=0)], dim=2)
 59 |         U = torch.matmul(x, self.a).transpose(0, 1)
 60 |         U = F.leaky_relu(U)
 61 |         weights = F.softmax(U, dim=1)
 62 |         outputs = torch.matmul(weights.transpose(1, 2), inputs).squeeze(1)
 63 |         return outputs, weights
 64 | 
 65 | 
 66 | class SelfAttention(Module):
 67 |     def __init__(self, in_features, idx, hidden_dim):
 68 |         super(SelfAttention, self).__init__()
 69 |         self.idx = idx
 70 |         self.linear = torch.nn.Linear(in_features, hidden_dim)
 71 |         self.a = Parameter(torch.FloatTensor(2 * hidden_dim, 1))
 72 |         self.reset_parameters()
 73 | 
 74 |     def reset_parameters(self):
 75 |         stdv = 1. / math.sqrt(self.a.size(1))
 76 |         self.a.data.uniform_(-stdv, stdv)
 77 | 
 78 |     def forward(self, inputs):
 79 |         # inputs size:  node_num * 3 * in_features
 80 |         x = self.linear(inputs).transpose(0, 1)
 81 |         self.n = x.size()[0]
 82 |         x = torch.cat([x, torch.stack([x[self.idx]] * self.n, dim=0)], dim=2)
 83 |         U = torch.matmul(x, self.a).transpose(0, 1)
 84 |         U = F.leaky_relu_(U)
 85 |         weights = F.softmax(U, dim=1)
 86 |         outputs = torch.matmul(weights.transpose(1, 2), inputs).squeeze(1) * 3
 87 |         return outputs, weights
 88 | 
 89 | 
 90 | class GraphAttentionConvolution(Module):
 91 |     def __init__(self, in_features_list, out_features, bias=True, gamma = 0.1):
 92 |         super(GraphAttentionConvolution, self).__init__()
 93 |         self.ntype = len(in_features_list)
 94 |         self.in_features_list = in_features_list
 95 |         self.out_features = out_features
 96 |         self.weights = nn.ParameterList()
 97 |         for i in range(self.ntype):
 98 |             cache = Parameter(torch.FloatTensor(in_features_list[i], out_features))
 99 |             nn.init.xavier_normal_(cache.data, gain=1.414)
100 |             self.weights.append( cache )
101 |         if bias:
102 |             self.bias = Parameter(torch.FloatTensor(out_features))
103 |             stdv = 1. / math.sqrt(out_features)
104 |             self.bias.data.uniform_(-stdv, stdv)
105 |         else:
106 |             self.register_parameter('bias', None)
107 |         
108 |         self.att_list = nn.ModuleList()
109 |         for i in range(self.ntype):
110 |             self.att_list.append( Attention_NodeLevel(out_features, gamma) )
111 | 
112 | 
113 |     def forward(self, inputs_list, adj_list, global_W = None):
114 |         h = []
115 |         for i in range(self.ntype):
116 |             h.append( torch.spmm(inputs_list[i], self.weights[i]) )
117 |         if global_W is not None:
118 |             for i in range(self.ntype):
119 |                 h[i] = ( torch.spmm(h[i], global_W) )
120 |         outputs = []
121 |         for t1 in range(self.ntype):
122 |             x_t1 = []
123 |             for t2 in range(self.ntype):
124 |                 # adj has no non-zeros
125 |                 if len(adj_list[t1][t2]._values()) == 0:
126 |                     x_t1.append( torch.zeros(adj_list[t1][t2].shape[0], self.out_features, device=self.bias.device) )
127 |                     continue
128 | 
129 |                 if self.bias is not None:
130 |                     x_t1.append( self.att_list[t1](h[t1], h[t2], adj_list[t1][t2]) + self.bias )
131 |                 else:
132 |                     x_t1.append( self.att_list[t1](h[t1], h[t2], adj_list[t1][t2]) )
133 |             outputs.append(x_t1)
134 |             
135 |         return outputs
136 | 
137 | 
138 |         
139 |      
140 | 
141 | class Attention_NodeLevel(nn.Module):
142 |     def __init__(self, dim_features, gamma = 0.1):
143 |         super(Attention_NodeLevel, self).__init__()
144 | 
145 |         self.dim_features = dim_features
146 | 
147 |         self.a1 = nn.Parameter(torch.zeros(size=(dim_features, 1)))
148 |         self.a2 = nn.Parameter(torch.zeros(size=(dim_features, 1)))
149 |         nn.init.xavier_normal_(self.a1.data, gain=1.414)
150 |         nn.init.xavier_normal_(self.a2.data, gain=1.414)        
151 | 
152 |         self.leakyrelu = nn.LeakyReLU(0.2)
153 |         self.gamma = gamma
154 | 
155 |     def forward(self, input1, input2, adj):
156 |         h = input1
157 |         g = input2
158 |         N = h.size()[0]
159 |         M = g.size()[0]
160 | 
161 |         e1 = torch.matmul(h, self.a1).repeat(1, M)
162 |         e2 = torch.matmul(g, self.a2).repeat(1, N).t()
163 |         e = e1 + e2  
164 |         e = self.leakyrelu(e)
165 |         
166 |         zero_vec = -9e15*torch.ones_like(e)
167 |         if 'sparse' in adj.type():
168 |             adj_dense = adj.to_dense()
169 |             attention = torch.where(adj_dense > 0, e, zero_vec)
170 |             attention = F.softmax(attention, dim=1)
171 |             attention = torch.mul(attention, adj_dense.sum(1).repeat(M, 1).t())
172 |             attention = torch.add(attention * self.gamma, adj_dense * (1 - self.gamma))
173 |             del(adj_dense)
174 |         else:
175 |             attention = torch.where(adj > 0, e, zero_vec)
176 |             attention = F.softmax(attention, dim=1)
177 |             attention = torch.mul(attention, adj.sum(1).repeat(M, 1).t())
178 |             attention = torch.add(attention * self.gamma, adj.to_dense() * (1 - self.gamma))
179 |         del(zero_vec)
180 | 
181 |         h_prime = torch.matmul(attention, g)
182 | 
183 |         return h_prime
184 |     
185 |         
186 | 


--------------------------------------------------------------------------------
/model/code/models.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import torch
  3 | import torch.nn as nn
  4 | import torch.nn.functional as F
  5 | from layers import *
  6 | from torch.nn.parameter import Parameter
  7 | from functools import reduce
  8 | from utils import dense_tensor_to_sparse
  9 | 
 10 | 
 11 | class HGAT(nn.Module):
 12 |     def __init__(self, nfeat_list, nhid, nclass, dropout,
 13 |                  type_attention=True, node_attention=True,
 14 |                  gamma=0.1, sigmoid=False, orphan=True,
 15 |                  write_emb=True
 16 |                  ):
 17 |         super(HGAT, self).__init__()
 18 |         self.sigmoid = sigmoid
 19 |         self.type_attention = type_attention
 20 |         self.node_attention = node_attention
 21 | 
 22 |         self.write_emb = write_emb
 23 |         if self.write_emb:
 24 |             self.emb = None
 25 |             self.emb2 = None
 26 | 
 27 |         self.nonlinear = F.relu_
 28 | 
 29 |         self.nclass = nclass
 30 |         self.ntype = len(nfeat_list)
 31 | 
 32 |         dim_1st = nhid
 33 |         dim_2nd = nclass
 34 |         if orphan:
 35 |             dim_2nd += self.ntype - 1
 36 |         
 37 |         self.gc2 = nn.ModuleList()
 38 |         if not self.node_attention:
 39 |             self.gc1 = nn.ModuleList()
 40 |             for t in range(self.ntype):
 41 |                 self.gc1.append( GraphConvolution(nfeat_list[t], dim_1st, bias=False) )
 42 |                 self.bias1 = Parameter( torch.FloatTensor(dim_1st) )
 43 |                 stdv = 1. / math.sqrt(dim_1st)
 44 |                 self.bias1.data.uniform_(-stdv, stdv)
 45 |         else:
 46 |             self.gc1 = GraphAttentionConvolution(nfeat_list, dim_1st, gamma=gamma)
 47 |         self.gc2.append( GraphConvolution(dim_1st, dim_2nd, bias=True) )
 48 | 
 49 |         if self.type_attention:
 50 |             self.at1 = nn.ModuleList()
 51 |             self.at2 = nn.ModuleList()
 52 |             for t in range(self.ntype):
 53 |                 self.at1.append( SelfAttention(dim_1st, t, 50) )
 54 |                 self.at2.append( SelfAttention(dim_2nd, t, 50) )
 55 |            
 56 |         self.dropout = dropout
 57 | 
 58 |     def forward(self, x_list, adj_list, adj_all = None):
 59 |         x0 = x_list
 60 |         
 61 |         if not self.node_attention:
 62 |             x1 = [None for _ in range(self.ntype)]
 63 |             # First Layer
 64 |             for t1 in range(self.ntype):
 65 |                 x_t1 = []
 66 |                 for t2 in range(self.ntype):
 67 |                     idx = t2
 68 |                     x_t1.append( self.gc1[idx](x0[t2], adj_list[t1][t2]) + self.bias1 )
 69 |                 if self.type_attention:
 70 |                     x_t1, weights = self.at1[t1]( torch.stack(x_t1, dim=1) )
 71 |                 else:
 72 |                     x_t1 = reduce(torch.add, x_t1)
 73 |                     
 74 |                 x_t1 = self.nonlinear(x_t1)
 75 |                 x_t1 = F.dropout(x_t1, self.dropout, training=self.training)
 76 |                 x1[t1] = x_t1
 77 |         else:
 78 |             x1 = [None for _ in range(self.ntype)]
 79 |             x1_in = self.gc1(x0, adj_list)
 80 |             for t1 in range(len(x1_in)):
 81 |                 x_t1 = x1_in[t1]
 82 |                 if self.type_attention:
 83 |                     x_t1, weights = self.at1[t1]( torch.stack(x_t1, dim=1) )
 84 |                 else:
 85 |                     x_t1 = reduce(torch.add, x_t1)
 86 |                 x_t1 = self.nonlinear(x_t1)
 87 |                 x_t1 = F.dropout(x_t1, self.dropout, training=self.training)
 88 |                 x1[t1] = x_t1
 89 |         if self.write_emb:
 90 |             self.emb = x1[0]        
 91 |         
 92 |         x2 = [None for _ in range(self.ntype)]
 93 |         # Second Layer
 94 |         for t1 in range(self.ntype):
 95 |             x_t1 = []
 96 |             for t2 in range(self.ntype):
 97 |                 if adj_list[t1][t2] is None:
 98 |                     continue
 99 |                 idx = 0
100 |                 x_t1.append( self.gc2[idx](x1[t2], adj_list[t1][t2]) )
101 |             if self.type_attention:
102 |                 x_t1, weights = self.at2[t1]( torch.stack(x_t1, dim=1) )
103 |             else:
104 |                 x_t1 = reduce(torch.add, x_t1)
105 | 
106 |             x2[t1] = x_t1
107 |             if self.write_emb and t1 == 0:
108 |                 self.emb2 = x2[t1]
109 | 
110 |             # output layer
111 |             if self.sigmoid:
112 |                 x2[t1] = torch.sigmoid(x_t1)
113 |             else:
114 |                 x2[t1] = F.log_softmax(x_t1, dim=1)
115 | 
116 |         return x2
117 | 
118 | 


--------------------------------------------------------------------------------
/model/code/print_log.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | 
 4 | 
 5 | class Logger(object):
 6 |     def __init__(self, filename="Default.log", remove=True):
 7 |         self.terminal = sys.stdout
 8 |         if remove and os.path.exists(filename):
 9 |             os.remove(filename)
10 |         self.log = open(filename, "a")
11 | 
12 |     def write(self, message):
13 |         self.terminal.write(message)
14 |         self.log.write(message)
15 | 
16 |     def flush(self):
17 |         pass
18 | 
19 |     def change_file(self, filename="Default.log"):
20 |         self.log.close()
21 |         self.log = open(filename, "a")
22 | 
23 | 
24 | if __name__ == '__main__':
25 |     sys.stdout = Logger("yourlogfilename.txt")
26 |     print('content.')


--------------------------------------------------------------------------------
/model/code/train.py:
--------------------------------------------------------------------------------
  1 | from __future__ import division
  2 | from __future__ import print_function
  3 | 
  4 | import time
  5 | import argparse
  6 | import numpy as np
  7 | import pickle as pkl
  8 | from copy import deepcopy
  9 | from random import shuffle
 10 | 
 11 | import torch
 12 | import torch.nn as nn
 13 | import torch.nn.functional as F
 14 | import torch.optim as optim
 15 | from sklearn.metrics import precision_recall_fscore_support
 16 | from sklearn.metrics import accuracy_score
 17 | from sklearn.metrics import classification_report
 18 | 
 19 | from utils import load_data, accuracy, dense_tensor_to_sparse, resample, makedirs
 20 | from utils_inductive import transform_dataset_by_idx
 21 | from models import HGAT
 22 | import os, gc, sys
 23 | from print_log import Logger
 24 | 
 25 | logdir = "log/"
 26 | savedir = 'model/'
 27 | embdir = 'embeddings/'
 28 | makedirs([logdir, savedir, embdir])
 29 | 
 30 | os.environ["CUDA_VISIBLE_DEVICES"] = "2"
 31 | 
 32 | write_embeddings = True
 33 | HOP = 2
 34 | 
 35 | dataset = 'example'
 36 | 
 37 | LR = 0.01 if dataset == 'snippets' else 0.005
 38 | DP = 0.95 if dataset in ['agnews', 'tagmynews'] else 0.8
 39 | WD = 0 if dataset == 'snippets' else 5e-6
 40 | LR = 0.05 if 'multi' in dataset else LR
 41 | DP = 0.5 if 'multi' in dataset else DP
 42 | WD = 0 if 'multi' in dataset else WD
 43 | 
 44 | # Training settings
 45 | parser = argparse.ArgumentParser()
 46 | parser.add_argument('--no_cuda', action='store_true', default=False,
 47 |                     help='Disables CUDA training.')
 48 | parser.add_argument('--fastmode', action='store_true', default=False,
 49 |                     help='Validate during training pass.')
 50 | parser.add_argument('--seed', type=int, default=42, help='Random seed.')
 51 | parser.add_argument('--epochs', type=int, default=300,
 52 |                     help='Number of epochs to train.')
 53 | parser.add_argument('--lr', type=float, default=LR,
 54 |                     help='Initial learning rate.')
 55 | parser.add_argument('--weight_decay', type=float, default=WD,
 56 |                     help='Weight decay (L2 loss on parameters).')
 57 | parser.add_argument('--hidden', type=int, default=512,
 58 |                     help='Number of hidden units.')
 59 | parser.add_argument('--dropout', type=float, default=DP,
 60 |                     help='Dropout rate (1 - keep probability).')
 61 | parser.add_argument('--inductive', type=bool, default=False,
 62 |                     help='Whether use the transductive mode or inductive mode. ')
 63 | parser.add_argument('--dataset', type=str, default=dataset,
 64 |                     help='Dataset')
 65 | parser.add_argument('--repeat', type=int, default=1,
 66 |                     help='Number of repeated trials')
 67 | parser.add_argument('--node', action='store_false', default=True,
 68 |                     help='Use node-level attention or not. ')
 69 | parser.add_argument('--type', action='store_false', default=True,
 70 |                     help='Use type-level attention or not. ')
 71 | 
 72 | args = parser.parse_args()
 73 | dataset = args.dataset
 74 | 
 75 | args.cuda = not args.no_cuda and torch.cuda.is_available()
 76 | sys.stdout = Logger(logdir + "{}.log".format(dataset))
 77 | 
 78 | np.random.seed(args.seed)
 79 | torch.manual_seed(args.seed)
 80 | if args.cuda:
 81 |     torch.cuda.manual_seed(args.seed)
 82 | 
 83 | loss_list = dict()
 84 | 
 85 | def margin_loss(preds, y, weighted_sample=False):
 86 |     nclass = y.shape[1]
 87 |     preds = preds[:, :nclass]
 88 |     y = y.float()
 89 |     lam = 0.25
 90 |     m = nn.Threshold(0., 0.)
 91 |     loss = y * m(0.9 - preds) ** 2 + \
 92 |         lam * (1.0 - y) * (m(preds - 0.1) ** 2)
 93 | 
 94 |     if weighted_sample:
 95 |         n, N = y.sum(dim=0, keepdim=True), y.shape[0]
 96 |         weight = torch.where(y == 1, n, torch.zeros_like(loss))
 97 |         weight = torch.where(y != 1, N-n, weight)
 98 |         weight = N / weight / 2
 99 |         loss = torch.mul(loss, weight)
100 | 
101 |     loss = torch.mean(torch.sum(loss, dim=1))
102 |     return loss
103 | 
104 | def nll_loss(preds, y):
105 |     y = y.max(1)[1].type_as(labels)
106 |     return F.nll_loss(preds, y)
107 | 
108 | def evaluate(preds_list, y_list):
109 |     nclass = y_list.shape[1]
110 |     preds_list = preds_list[:, :nclass]
111 |     if not preds_list.device == 'cpu':
112 |         preds_list, y_list = preds_list.cpu(), y_list.cpu()
113 | 
114 |     threshold = 0.5
115 |     multi_label = 'multi' in dataset
116 |     if multi_label:
117 |         y_list = y_list.numpy()
118 |         preds_probs = preds_list.detach().numpy()
119 |         preds = deepcopy(preds_probs)
120 |         preds[np.arange(preds.shape[0]), preds.argmax(1)] = 1.0
121 |         preds[np.where(preds >= threshold)] = 1.0
122 |         preds[np.where(preds < threshold)] = 0.0
123 |         [precision, recall, F1, support] = \
124 |             precision_recall_fscore_support(y_list[preds.sum(axis=1) != 0], preds[preds.sum(axis=1) != 0],
125 |                                             average='micro')
126 |         [precision_ma, recall_ma, F1_ma, support] = \
127 |             precision_recall_fscore_support(y_list[preds.sum(axis=1) != 0], preds[preds.sum(axis=1) != 0],
128 |                                             average='macro')
129 |         ER = accuracy_score(y_list, preds) * 100
130 | 
131 |         report = classification_report(y_list, preds, digits=5)
132 | 
133 |         print(' ER: %6.2f' % ER,
134 |               'mi-ma: P: %5.1f %5.1f' % (precision*100, precision_ma*100),
135 |               'R: %5.1f %5.1f' % (recall*100, recall_ma*100),
136 |               'F1: %5.1f %5.1f' % (F1*100, F1_ma*100),
137 |               end="")
138 |         return ER, report
139 |     else:
140 |         y_list = y_list.numpy()
141 |         preds_probs = preds_list.detach().numpy()
142 |         preds = deepcopy(preds_probs)
143 |         preds[np.arange(preds.shape[0]), preds.argmax(1)] = 1.0
144 |         preds[np.where(preds < 1)] = 0.0
145 |         [precision, recall, F1, support] = \
146 |             precision_recall_fscore_support(y_list, preds, average='macro')
147 |         ER = accuracy_score(y_list, preds) * 100
148 |         print(' Ac: %6.2f' % ER,
149 |               'P: %5.1f' % (precision*100),
150 |               'R: %5.1f' % (recall*100),
151 |               'F1: %5.1f' % (F1*100),
152 |               end="")
153 |         return ER, F1
154 | 
155 | LOSS = margin_loss if 'multi' in dataset else nll_loss
156 | 
157 | 
158 | def train(epoch,
159 |           input_adj_train, input_features_train, idx_out_train, idx_train,
160 |           input_adj_val, input_features_val, idx_out_val, idx_val):
161 |     print('Epoch: {:04d}'.format(epoch+1), end='')
162 |     t = time.time()
163 |     model.train()
164 |     optimizer.zero_grad()
165 |     output = model(input_features_train, input_adj_train)
166 | 
167 |     if isinstance(output, list):
168 |         O, L = output[0][idx_out_train], labels[idx_train]
169 |     else:
170 |         O, L = output[idx_out_train], labels[idx_train]
171 |     loss_train = LOSS(O, L)
172 |     print(' | loss: {:.4f}'.format(loss_train.item()), end='')
173 |     acc_train, f1_train = evaluate(O, L)
174 |     loss_train.backward()
175 |     optimizer.step()
176 | 
177 |     model.eval()
178 |     output = model(input_features_val, input_adj_val)
179 |     if isinstance(output, list):
180 |         loss_val = LOSS(output[0][idx_out_val], labels[idx_val])
181 |         print(' | loss: {:.4f}'.format(loss_val.item()), end='')
182 |         results = evaluate(output[0][idx_out_val], labels[idx_val])
183 |     else:
184 |         loss_val = LOSS(output[idx_out_val], labels[idx_val])
185 |         print(' | loss: {:.4f}'.format(loss_val.item()), end='')
186 |         results = evaluate(output[idx_out_val], labels[idx_val])
187 |     print(' | time: {:.4f}s'.format(time.time() - t))
188 |     loss_list[epoch] = [loss_train.item()]
189 | 
190 |     if 'multi' in dataset:
191 |         acc_val, res_line = results
192 |         return float(acc_val.item()), res_line
193 |     else:
194 |         acc_val, f1_val = results
195 |         return float(acc_val.item()), float(f1_val.item())
196 | 
197 | 
198 | def test(epoch, input_adj_test, input_features_test, idx_out_test, idx_test):
199 |     print(' '*90 if 'multi' in dataset else ' '*65, end='')
200 |     t = time.time()
201 |     model.eval()
202 |     output = model(input_features_test, input_adj_test)
203 | 
204 |     if isinstance(output, list):
205 |         loss_test = LOSS(output[0][idx_out_test], labels[idx_test])
206 |         print(' | loss: {:.4f}'.format(loss_test.item()), end='')
207 |         results = evaluate(output[0][idx_out_test], labels[idx_test])
208 |     else:
209 |         loss_test = LOSS(output[idx_out_test], labels[idx_test])
210 |         print(' | loss: {:.4f}'.format(loss_test.item()), end='')
211 |         results = evaluate(output[idx_out_test], labels[idx_test])
212 |     print(' | time: {:.4f}s'.format(time.time() - t))
213 |     loss_list[epoch] += [loss_test.item()]
214 | 
215 |     if 'multi' in dataset:
216 |         acc_test, res_line = results
217 |         return float(acc_test.item()), res_line
218 |     else:
219 |         acc_test, f1_test = results
220 |         return float(acc_test.item()), float(f1_test.item())
221 | 
222 | 
223 | 
224 | 
225 | path = '../data/'+ dataset +'/'
226 | adj, features, labels, idx_train_ori, idx_val_ori, idx_test_ori, idx_map = load_data(path = path, dataset = dataset)
227 | N = len(adj)
228 | 
229 | # Transductive的数据集变换
230 | if args.inductive:
231 |     print("Transfer to be inductive.")
232 | 
233 |     # resample
234 |     # 之前的数据集划分：  训练集20 * class   验证集1000   其他的测试集
235 |     # 这里换成：         训练集不变，  验证集 -> 测试集，
236 |     #                   原本的测试集拆出和训练集一样多的样本作为验证集，其余作为无标注样本
237 |     idx_train,idx_unlabeled,idx_val,idx_test = resample(idx_train_ori,idx_val_ori,idx_test_ori,path,idx_map)
238 | 
239 |     # if experimentType == 'unlabeled':
240 |     #     bias = int(idx_unlabeled.shape[0] * supPara)
241 |     #     idx_unlabeled = idx_unlabeled[: bias]
242 |     #     print("\n\tidx_train: {}, idx_unlabeled: {},\n\tidx_val: {}, idx_test: {}".format(
243 |     #             idx_train.shape[0], idx_unlabeled.shape[0], idx_val.shape[0], idx_test.shape[0]))
244 | 
245 |     input_adj_train, input_features_train, idx_related_train, idx_out_train = \
246 |             transform_dataset_by_idx(adj,features,torch.cat([idx_train, idx_unlabeled]),idx_train, hop=HOP)
247 |     input_adj_val, input_features_val, idx_related_val, idx_out_val = \
248 |             transform_dataset_by_idx(adj,features,idx_val,idx_val, hop=HOP)
249 |     input_adj_test, input_features_test, idx_related_test, idx_out_test = \
250 |             transform_dataset_by_idx(adj,features,idx_test,idx_test, hop=HOP)
251 | 
252 |     all_node_count = sum([_.shape[0] for _ in adj[0]])
253 |     all_input_idx, all_related_idx = set(), set()
254 |     for input_idx, related_idx in [ [torch.cat([idx_train, idx_unlabeled]), idx_related_train],
255 |                                     [idx_val, idx_related_val],
256 |                                     [idx_test, idx_related_test] ]:
257 |         print("# input_nodes: {}, # related_nodes: {} / {}".format(
258 |                         len(input_idx), len(related_idx), all_node_count))
259 |         all_input_idx.update(input_idx.numpy().tolist())
260 |         all_related_idx.update(related_idx.numpy().tolist())
261 |     print("Sum: # input_nodes: {}, # related_nodes: {} / {}\n".format(
262 |                         len(all_input_idx), len(all_related_idx), all_node_count))
263 | else:
264 |     print("Transfer to be transductive.")
265 |     input_adj_train, input_features_train, idx_out_train = adj, features, idx_train_ori
266 |     input_adj_val, input_features_val, idx_out_val = adj, features, idx_val_ori
267 |     input_adj_test, input_features_test, idx_out_test = adj, features, idx_test_ori
268 |     idx_train, idx_val, idx_test = idx_train_ori, idx_val_ori, idx_test_ori
269 | 
270 | 
271 | if args.cuda:
272 |     N = len(features)
273 |     for i in range(N):
274 |         if input_features_train[i] is not None:
275 |             input_features_train[i] = input_features_train[i].cuda()
276 |         if input_features_val[i] is not None:
277 |             input_features_val[i] = input_features_val[i].cuda()
278 |         if input_features_test[i] is not None:
279 |             input_features_test[i] = input_features_test[i].cuda()
280 |     for i in range(N):
281 |         for j in range(N):
282 |             if input_adj_train[i][j] is not None:
283 |                 input_adj_train[i][j] = input_adj_train[i][j].cuda()
284 |             if input_adj_val[i][j] is not None:
285 |                 input_adj_val[i][j] = input_adj_val[i][j].cuda()
286 |             if input_adj_test[i][j] is not None:
287 |                 input_adj_test[i][j] = input_adj_test[i][j].cuda()
288 |     labels = labels.cuda()
289 |     idx_train, idx_out_train = idx_train.cuda(), idx_out_train.cuda()
290 |     idx_val, idx_out_val = idx_val.cuda(), idx_out_val.cuda()
291 |     idx_test, idx_out_test = idx_test.cuda(), idx_out_test.cuda()
292 | 
293 | 
294 | 
295 | FINAL_RESULT = []
296 | for i in range(args.repeat):
297 | # Model and optimizer
298 |     print("\n\nNo. {} test.\n".format(i+1))
299 |     model = HGAT(nfeat_list=[i.shape[1] for i in features],
300 |                  type_attention=args.type,
301 |                  node_attention=args.node,
302 |                     nhid=args.hidden,
303 |                     nclass=labels.shape[1],
304 |                     dropout=args.dropout,
305 |                     gamma=0.1,
306 |                     orphan=True,
307 |                  )
308 | 
309 | 
310 |     # print(model)
311 |     print(len(list(model.parameters())))
312 |     optimizer = optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
313 | 
314 |     if args.cuda:
315 |         model.cuda()
316 | 
317 |     print(len(list(model.parameters())))
318 |     print([i.size() for i in model.parameters()])
319 |     # Train model
320 |     t_total = time.time()
321 |     vali_max = [0, [0, 0], -1]
322 | 
323 | 
324 |     for epoch in range(args.epochs):
325 |         vali_acc, vali_f1 = train(epoch,
326 |                          input_adj_train, input_features_train, idx_out_train, idx_train,
327 |                          input_adj_val, input_features_val, idx_out_val, idx_val)
328 |         test_acc, test_f1 = test(epoch,
329 |                         input_adj_test, input_features_test, idx_out_test, idx_test)
330 |         if vali_acc > vali_max[0]:
331 |             vali_max = [vali_acc, (test_acc, test_f1), epoch+1]
332 |             with open(savedir + "{}.pkl".format(dataset), 'wb') as f:
333 |                 pkl.dump(model, f)
334 | 
335 |             if write_embeddings:
336 |                 makedirs([embdir])
337 |                 with open(embdir + "{}.emb".format(dataset), 'w') as f:
338 |                     for i in model.emb.tolist():
339 |                         f.write("{}\n".format(i))
340 |                 with open(embdir + "{}.emb2".format(dataset), 'w') as f:
341 |                     for i in model.emb2.tolist():
342 |                         f.write("{}\n".format(i))
343 | 
344 |     print("Optimization Finished!")
345 |     print("Total time elapsed: {:.4f}s".format(time.time() - t_total))
346 |     if 'multi' in dataset:
347 |         print("The best result is ACC: {0:.4f}, where epoch is {2}\n{1}\n".format(
348 |             vali_max[1][0],
349 |             vali_max[1][1],
350 |             vali_max[2]))
351 |     else:
352 |         print("The best result is: ACC: {0:.4f} F1: {1:.4f}, where epoch is {2}\n\n".format(
353 |             vali_max[1][0],
354 |             vali_max[1][1],
355 |             vali_max[2]))
356 |     FINAL_RESULT.append(list(vali_max))
357 | 
358 | print("\n")
359 | for i in range(len(FINAL_RESULT)):
360 |     if 'multi' in dataset:
361 |         print("{0}:\tvali:  {1:.5f}\ttest:  ACC: {2:.4f}, epoch={4}.\n{3}".format(
362 |             i,
363 |             FINAL_RESULT[i][0],
364 |             FINAL_RESULT[i][1][0],
365 |             FINAL_RESULT[i][1][1],
366 |             FINAL_RESULT[i][2]))
367 |     else:
368 |         print("{}:\tvali:  {:.5f}\ttest:  ACC: {:.4f} F1: {:.4f}, epoch={}".format(
369 |                         i,
370 |             FINAL_RESULT[i][0],
371 |             FINAL_RESULT[i][1][0],
372 |             FINAL_RESULT[i][1][1],
373 |             FINAL_RESULT[i][2]))


--------------------------------------------------------------------------------
/model/code/utils.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import scipy.sparse as sp
  3 | from random import shuffle
  4 | import torch
  5 | from tqdm import tqdm
  6 | import os
  7 | 
  8 | 
  9 | def load_data(path="../data/citeseer/", dataset="citeseer"):
 10 |     print('Loading {} dataset...'.format(dataset))
 11 |     features_block = False  # concatenate the feature spaces or not
 12 |     
 13 |     MULTI_LABEL = 'multi' in dataset
 14 |     
 15 |     type_list = ['text', 'topic', 'entity']
 16 |     type_have_label = 'text'
 17 | 
 18 |     features_list = []
 19 |     idx_map_list = []
 20 |     idx2type = {t: set() for t in type_list}
 21 | 
 22 |     for type_name in type_list:
 23 |         print('Loading {} content...'.format(type_name))
 24 |         print(path)
 25 |         print(dataset)
 26 |         print(type_name)
 27 |         indexes, features, labels = [], [], []
 28 |         with open("{}{}.content.{}".format(path, dataset, type_name)) as f:
 29 |             for line in tqdm(f):
 30 |                 cache = line.strip().split('\t')
 31 |                 indexes.append(np.array(cache[0], dtype=int))
 32 |                 features.append(np.array(cache[1:-1], dtype=np.float32))
 33 |                 labels.append(np.array([cache[-1]], dtype=str) )
 34 |             features = np.stack(features)
 35 |             features = normalize(features)
 36 |             if not features_block:
 37 |                 features = torch.FloatTensor(np.array(features))
 38 |                 features = dense_tensor_to_sparse(features)
 39 | 
 40 |             features_list.append(features)
 41 | 
 42 |         if type_name == type_have_label:
 43 |             labels = np.stack(labels)
 44 |             if not MULTI_LABEL:
 45 |                 labels = encode_onehot(labels)
 46 |             else:
 47 |                 labels = multi_label(labels)
 48 |             Labels = torch.LongTensor(labels)
 49 |             print("label matrix shape: {}".format(Labels.shape))
 50 | 
 51 |         idx = np.stack(indexes)
 52 |         for i in idx:
 53 |             idx2type[type_name].add(i)
 54 |         idx_map = {j: i for i, j in enumerate(idx)}
 55 |         idx_map_list.append(idx_map)
 56 |         print('done.')
 57 | 
 58 |     len_list = [len(idx2type[t]) for t in type_list]
 59 |     type2len = {t: len(idx2type[t]) for t in type_list}
 60 |     len_all = sum(len_list)
 61 |     if features_block:
 62 |         flen = [i.shape[1] for i in features_list]
 63 |         features = sp.lil_matrix(np.zeros((len_all, sum(flen))), dtype=np.float32)
 64 |         bias = 0
 65 |         for i_l in range(len(len_list)):
 66 |             features[bias:bias+len_list[i_l], :flen[i_l]] = features_list[i_l]
 67 |             features_list[i_l] = features[bias:bias+len_list[i_l], :]
 68 |             bias += len_list[i_l]
 69 |         for fi in range(len(features_list)):
 70 |             features_list[fi] = torch.FloatTensor(np.array(features_list[fi].todense()))
 71 |             features_list[fi] = dense_tensor_to_sparse(features_list[fi])
 72 | 
 73 |     print('Building graph...')
 74 |     adj_list = [[None for _ in range(len(type_list))] for __ in range(len(type_list))]
 75 |     # build graph
 76 |     edges_unordered = np.genfromtxt("{}{}.cites".format(path, dataset),
 77 |                                     dtype=np.int32)
 78 | 
 79 |     adj_all = sp.lil_matrix(np.zeros((len_all, len_all)), dtype=np.float32)
 80 | 
 81 |     for i1 in range(len(type_list)):
 82 |         for i2 in range(len(type_list)):
 83 |             t1, t2 = type_list[i1], type_list[i2]
 84 |             if i1 == i2:
 85 |                 edges = []
 86 |                 for edge in edges_unordered:
 87 |                     if (edge[0] in idx2type[t1] and edge[1] in idx2type[t2]):
 88 |                         edges.append([idx_map_list[i1].get(edge[0]), idx_map_list[i2].get(edge[1])])
 89 |                 edges = np.array(edges)
 90 |                 if len(edges) > 0:
 91 |                     adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])),
 92 |                                         shape=(type2len[t1], type2len[t2]), dtype=np.float32)
 93 |                 else:
 94 |                     adj = sp.coo_matrix((type2len[t1], type2len[t2]), dtype=np.float32)
 95 |                 adj_all[sum(len_list[:i1]): sum(len_list[:i1 + 1]),
 96 |                         sum(len_list[:i2]): sum(len_list[:i2 + 1])] = adj.tolil()
 97 | 
 98 |             elif i1 < i2:
 99 |                 edges = []
100 |                 for edge in edges_unordered:
101 |                     if (edge[0] in idx2type[t1] and edge[1] in idx2type[t2]):
102 |                         edges.append([idx_map_list[i1].get(edge[0]), idx_map_list[i2].get(edge[1])])
103 |                     elif (edge[1] in idx2type[t1] and edge[0] in idx2type[t2]):
104 |                         edges.append([idx_map_list[i1].get(edge[1]), idx_map_list[i2].get(edge[0])])
105 |                 edges = np.array(edges)
106 |                 if len(edges) > 0:
107 |                     adj = sp.coo_matrix((np.ones(edges.shape[0]), (edges[:, 0], edges[:, 1])),
108 |                                         shape=(type2len[t1], type2len[t2]), dtype=np.float32)
109 |                 else:
110 |                     adj = sp.coo_matrix((type2len[t1], type2len[t2]), dtype=np.float32)
111 | 
112 |                 adj_all[
113 |                     sum(len_list[:i1]): sum(len_list[:i1 + 1]),
114 |                     sum(len_list[:i2]): sum(len_list[:i2 + 1])] = adj.tolil()
115 |                 adj_all[
116 |                     sum(len_list[:i2]): sum(len_list[:i2 + 1]),
117 |                     sum(len_list[:i1]): sum(len_list[:i1 + 1])] = adj.T.tolil()
118 | 
119 |     adj_all = adj_all + adj_all.T.multiply(adj_all.T > adj_all) - adj_all.multiply(adj_all.T > adj_all)
120 |     adj_all = normalize_adj(adj_all + sp.eye(adj_all.shape[0]))
121 | 
122 |     for i1 in range(len(type_list)):
123 |         for i2 in range(len(type_list)):
124 |             adj_list[i1][i2] = sparse_mx_to_torch_sparse_tensor(
125 |                 adj_all[sum(len_list[:i1]): sum(len_list[:i1 + 1]),
126 |                         sum(len_list[:i2]): sum(len_list[:i2 + 1])]
127 |             )
128 | 
129 |     print("Num of edges: {}".format(len( adj_all.nonzero()[0] )))
130 |     idx_train, idx_val, idx_test = load_divide_idx(path, idx_map_list[0])
131 |     return adj_list, features_list, Labels, idx_train, idx_val, idx_test, idx_map_list[0]
132 | 
133 | 
134 | def multi_label(labels):
135 |     def myfunction(x):
136 |         return list(map(int, x[0].split()))
137 |     return np.apply_along_axis(myfunction, axis=1, arr=labels)
138 | 
139 | 
140 | def encode_onehot(labels):
141 |     classes = set(labels.T[0])
142 |     classes_dict = {c: np.identity(len(classes))[i, :] for i, c in
143 |                     enumerate(classes)}
144 |     labels_onehot = np.array(list(map(classes_dict.get, labels.T[0])),
145 |                              dtype=np.int32)
146 |     return labels_onehot
147 | 
148 | def load_divide_idx(path, idx_map):
149 |     idx_train = []
150 |     idx_val = []
151 |     idx_test = []
152 |     with open(path+'train.map', 'r') as f:
153 |         for line in f:
154 |             idx_train.append( idx_map.get(int(line.strip('\n'))) )
155 |     with open(path+'vali.map', 'r') as f:
156 |         for line in f:
157 |             idx_val.append( idx_map.get(int(line.strip('\n'))) )
158 |     with open(path+'test.map', 'r') as f:
159 |         for line in f:
160 |             idx_test.append( idx_map.get(int(line.strip('\n'))) )
161 | 
162 |     print("train, vali, test: ", len(idx_train), len(idx_val), len(idx_test))
163 |     idx_train = torch.LongTensor(idx_train)
164 |     idx_val = torch.LongTensor(idx_val)
165 |     idx_test = torch.LongTensor(idx_test)
166 |     return idx_train, idx_val, idx_test
167 | 
168 | 
169 | def resample(train, val, test : torch.LongTensor, path, idx_map, rewrite=True):
170 |     if os.path.exists(path+'train_inductive.map'):
171 |         rewrite = False
172 |         filenames = ['train', 'unlabeled', 'vali', 'test']
173 |         ans = []
174 |         for file in filenames:
175 |             with open(path+file+'_inductive.map', 'r') as f:
176 |                 cache = []
177 |                 for line in f:
178 |                     cache.append(idx_map.get(int(line)))
179 |             ans.append(torch.LongTensor(cache))
180 |         return ans
181 | 
182 |     idx_train = train
183 |     idx_test = val
184 |     cache = list(test.numpy())
185 |     shuffle(cache)
186 |     idx_val = cache[: idx_train.shape[0]]
187 |     idx_unlabeled = cache[idx_train.shape[0]: ]
188 |     idx_val = torch.LongTensor(idx_val)
189 |     idx_unlabeled = torch.LongTensor(idx_unlabeled)
190 | 
191 |     print("\n\ttrain: ", idx_train.shape[0],
192 |           "\n\tunlabeled: ", idx_unlabeled.shape[0],
193 |           "\n\tvali: ", idx_val.shape[0],
194 |           "\n\ttest: ", idx_test.shape[0])
195 |     if rewrite:
196 |         idx_map_reverse = dict(map(lambda t: (t[1], t[0]), idx_map.items()))
197 |         filenames = ['train', 'unlabeled', 'vali', 'test']
198 |         ans = [idx_train, idx_unlabeled, idx_val, idx_test]
199 |         for i in range(4):
200 |             with open(path+filenames[i]+'_inductive.map', 'w') as f:
201 |                 f.write("\n".join(map(str, map(idx_map_reverse.get, ans[i].numpy()))))
202 | 
203 |     return idx_train, idx_unlabeled, idx_val, idx_test
204 | 
205 | 
206 | def normalize(mx):
207 |     rowsum = np.array(mx.sum(1))
208 |     r_inv = np.power(rowsum, -1).flatten()
209 |     r_inv[np.isinf(r_inv)] = 0.
210 |     r_mat_inv = sp.diags(r_inv)
211 |     mx = r_mat_inv.dot(mx)
212 |     return mx
213 | 
214 | 
215 | def normalize_adj(mx):
216 |     rowsum = np.array(mx.sum(1))
217 |     r_inv_sqrt = np.power(rowsum, -0.5).flatten()
218 |     r_inv_sqrt[np.isinf(r_inv_sqrt)] = 0.
219 |     r_mat_inv_sqrt = sp.diags(r_inv_sqrt)
220 |     return mx.dot(r_mat_inv_sqrt).transpose().dot(r_mat_inv_sqrt)
221 | 
222 | 
223 | def accuracy(output, labels):
224 |     preds = output.max(1)[1].type_as(labels)
225 |     correct = preds.eq(labels).double()
226 |     correct = correct.sum()
227 |     return correct / len(labels)
228 | 
229 | 
230 | def sparse_mx_to_torch_sparse_tensor(sparse_mx):
231 |     """Convert a scipy sparse matrix to a torch sparse tensor."""
232 |     if len(sparse_mx.nonzero()[0]) == 0:
233 |         # 空矩阵
234 |         r, c = sparse_mx.shape
235 |         return torch.sparse.FloatTensor(r, c)
236 |     sparse_mx = sparse_mx.tocoo().astype(np.float32)
237 |     indices = torch.from_numpy(
238 |         np.vstack((sparse_mx.row, sparse_mx.col)).astype(np.int64))
239 |     values = torch.from_numpy(sparse_mx.data)
240 |     shape = torch.Size(sparse_mx.shape)
241 |     return torch.sparse.FloatTensor(indices, values, shape)
242 | 
243 | 
244 | def dense_tensor_to_sparse(dense_mx):
245 |     return sparse_mx_to_torch_sparse_tensor( sp.coo.coo_matrix(dense_mx) )
246 | 
247 | 
248 | def makedirs(dirs: list):
249 |     for d in dirs:
250 |         if not os.path.exists(d):
251 |             os.makedirs(d)
252 |     return


--------------------------------------------------------------------------------
/model/code/utils_inductive.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import networkx
  3 | import torch
  4 | import os
  5 | from tqdm import tqdm
  6 | import pickle as pkl
  7 | from utils import dense_tensor_to_sparse
  8 | 
  9 | 
 10 | def get_related_nodes(adj_list, idx, k):
 11 |     def bfs_1_deep(adj, candidate):
 12 |         candidate = torch.LongTensor(list(candidate))
 13 |         ans = torch.sum(adj[candidate], dim=0)
 14 |         ans = ans.nonzero().squeeze()
 15 |         return set(ans.numpy())
 16 | 
 17 |     def bfs_k_deep(adj, candidate, k):
 18 |         candidate = torch.LongTensor(list(candidate))
 19 |         ans = set(candidate.numpy())
 20 |         next_candidate = candidate
 21 |         for i in range(k):
 22 |             next_candidate = bfs_1_deep(adj, next_candidate)
 23 |             next_candidate = next_candidate - ans
 24 |             ans.update(next_candidate)
 25 |         return ans
 26 | 
 27 |     adj = combine_adj_list(adj_list)
 28 |     involved_nodes = bfs_k_deep(adj, list(idx), k=k)
 29 |     ans = torch.LongTensor(list(involved_nodes), device=adj.device)
 30 |     return ans.sort()[0]
 31 | 
 32 | 
 33 | def combine_adj_list(adj_list):
 34 |     ans = []
 35 |     for i in adj_list:
 36 |         cache = []
 37 |         for j in i:
 38 |             cache.append(j.to_dense())
 39 |         ans.append(torch.cat(cache, dim=1))
 40 |     return torch.cat(ans, dim=0)
 41 | 
 42 | 
 43 | def transform_idx_for_adjList(adj_list, idx):
 44 |     N = len(adj_list)
 45 |     bias = 0
 46 |     ans = []
 47 |     for adj in adj_list[0]:
 48 |         shape = adj.shape
 49 |         idx_r = idx[(idx >= bias) & (idx < bias + shape[1])] - bias
 50 |         # idx_c = idx[(idx>=bias_c) & (idx<bias_c+shape[1])]
 51 |         bias += shape[1]
 52 |         ans.append(idx_r)
 53 |     return ans
 54 | 
 55 | 
 56 | def transform_idx_by_related_nodes(idx_mask, idx_query):
 57 |     a, b = idx_mask, idx_query
 58 |     c = [(a == t).nonzero()[0].item() for t in b if not (a == t).nonzero().nelement() == 0]
 59 |     return torch.LongTensor(c, device=idx_mask.device)
 60 | 
 61 | 
 62 | # 根据输入的related_idx, 返回新的对应的，adj, fea, idx
 63 | def transform_dataset_by_idx(adj_list, feature_list, related_idx, trans_idx, hop=1):
 64 |     related_nodes = get_related_nodes(adj_list, related_idx, hop)
 65 |     idx_new = transform_idx_for_adjList(adj_list, related_nodes)
 66 |     adj_list_new = []
 67 |     for r in range(len(adj_list)):
 68 |         adj_list_new.append([])
 69 |         for c in range(len(adj_list[r])):
 70 |             adj = adj_list[r][c]
 71 |             adj = adj.to_dense()[idx_new[r]].t()[idx_new[c]].t()
 72 |             adj_list_new[-1].append(dense_tensor_to_sparse(adj))
 73 | 
 74 |     feature_list_new = []
 75 |     for i in range(len(feature_list)):
 76 |         fea = feature_list[i]
 77 |         fea = fea.to_dense()[idx_new[i]]
 78 |         feature_list_new.append(dense_tensor_to_sparse(fea))
 79 | 
 80 |     output_idx = transform_idx_by_related_nodes(related_nodes, trans_idx)
 81 |     return adj_list_new, feature_list_new, related_nodes, output_idx
 82 | 
 83 | 
 84 | def get_entity(sentences):  # 不需要分词
 85 |     # sent_list(content) => entity_set( entity_name ), edge_list( (doc_idx, entity_name) )
 86 | 
 87 |     # path = "/home/ytc/GCN/整理的/HGAT/data/entity_recog/test.txt"
 88 |     rootpath = "../../data/entity_recog/"
 89 |     with open(rootpath + "test.txt", 'w') as f:
 90 |         f.write('\n'.join(sentences))
 91 |     origin_path = os.getcwd()
 92 |     os.chdir(rootpath)
 93 |     os.system("python ER.py --test=test.txt --output=output_file.txt")
 94 |     os.chdir(origin_path)
 95 | 
 96 |     edges = []
 97 |     entities = set()
 98 |     with open(rootpath + "output_file.txt", 'r') as f:
 99 |         lines = f.readlines()
100 |         for i in range(0, len(lines), 3):
101 |             l1, l2, l3 = lines[i:i + 3]
102 |             entity_list = l2.strip('\n').split('\t')
103 |             entities.update(set(entity_list))
104 |             edges.extend([("test_{}".format(i//3), e) for e in entity_list])
105 | 
106 |     return edges
107 | 
108 | 
109 | def get_topic(sentences, datapath):  # 需要分好词的
110 |     # sent_list(content) => entity_set( topic_name: "topic_idx" ), edge_list( (doc_idx, topic_name) )
111 |     TopK_for_Topics = 2
112 | 
113 |     def naive_arg_topK(matrix, K, axis=0):
114 |         """
115 |         perform topK based on np.argsort
116 |         :param matrix: to be sorted
117 |         :param K: select and sort the top K items
118 |         :param axis: dimension to be sorted.
119 |         :return:
120 |         """
121 |         full_sort = np.argsort(-matrix, axis=axis)
122 |         return full_sort.take(np.arange(K), axis=axis)
123 | 
124 |     with open(datapath + "vectorizer_model.pkl", 'rb') as f:
125 |         vectorizer = pkl.load(f)
126 |     with open(datapath + "lda_model.pkl", 'rb') as f:
127 |         lda = pkl.load(f)
128 | 
129 |     X = vectorizer.transform(sentences)
130 |     doc_topic = lda.transform(X)
131 |     topK_topics = naive_arg_topK(doc_topic, TopK_for_Topics, axis=1)
132 | 
133 |     topics = []
134 |     for i in range(doc_topic.shape[1]):
135 |         topicName = 'topic_' + str(i)
136 |         topics.append(topicName)
137 |     edges = []
138 |     for i in range(topK_topics.shape[0]):
139 |         for j in range(TopK_for_Topics):
140 |             edges.append(("test_{}".format(i), topics[topK_topics[i, j]]))
141 | 
142 |     return edges
143 | 
144 | 
145 | def build_subnetwork(datapath, entity_edges, topic_edges):
146 |     graph_path = datapath + 'model_network_sampled.pkl'
147 |     with open(graph_path, 'rb') as f:
148 |         g = pkl.load(f)
149 |     # g.add_edges_from(entity_edges)
150 |     '''
151 |     这个的目的是把新来的文本加入原图中，以便后续求交操作不会把他们剔除掉
152 |     不加实体的边的原因是，会有新的实体引入，这部分新加的实体，由于没有对应的特征（和对应的参数矩阵），应当剔除
153 |     可以加主题的边的原因是，主题没有新的，都是旧有的，不会引入新的主题节点
154 |     '''
155 |     g.add_edges_from(topic_edges)
156 | 
157 |     sub_g = networkx.Graph()
158 |     sub_g.add_edges_from(entity_edges)
159 |     sub_g.add_edges_from(topic_edges)
160 |     return g, sub_g
161 | 
162 | 
163 | def judge_node_type(node):
164 |     if node[:5] == 'test_':
165 |         return 'text'
166 |     if node[:6] == 'topic_':
167 |         return 'topic'
168 |     return 'entity'
169 | 
170 | 
171 | 
172 | def release_feature(DATASET, node_list):
173 |     # 实体的，最麻烦了…………
174 |     mapindex = dict()
175 |     with open('../data/{}/mapindex.txt'.format(DATASET), 'r') as f:
176 |         for line in f:
177 |             k, v = line.strip().split('\t')
178 |             mapindex[k] = int(v)
179 |     featuremap = dict()
180 | 
181 |     for node_type in ['entity', 'topic']:
182 |         with open('../data/{0}/{0}.content.{1}'.format(DATASET, node_type), 'r') as f:
183 |             for line in tqdm(f):
184 |                 cache = line.strip().split('\t')
185 |                 index = int(cache[0])
186 |                 feature = np.array(cache[1:-1], dtype=np.float32)
187 |                 featuremap[index] = feature
188 | 
189 |         with open('../data/{0}/{0}_inductive.content.{1}'.format(DATASET, node_type), 'w') as f:
190 |             for n in range(len(node_list)):
191 |                 if judge_node_type(node_list[n]) == node_type:
192 |                     f.write(
193 |                         str(n) + '\t' + '\t'.join(map(str, featuremap[mapindex[node_list[n]]]))
194 |                     + '\n')
195 | 
196 | 
197 | def preprocess_inductive_text(sentences, DATASET):
198 |     datapath = '../../data/{}/'.format(DATASET)
199 | 
200 |     entity_edges = get_entity(sentences)
201 | 
202 |     def tokenize(sen):
203 |         from nltk.tokenize import WordPunctTokenizer
204 |         return WordPunctTokenizer().tokenize(sen)
205 |         # return jieba.cut(sen)
206 |     sentences = [[word.lower() for word in tokenize(sentence)] for sentence in sentences]
207 |     sentences = [' '.join(i) for i in sentences]
208 |     topic_edges = get_topic(sentences, datapath)
209 | 
210 |     g, sub_g = build_subnetwork(DATASET, entity_edges, topic_edges)
211 |     # 剔除不认识的实体
212 |     sub_g.remove_nodes_from( set(sub_g.nodes()) - set(g.nodes()) )
213 |     del g
214 |     g = sub_g
215 |     node_list = list(g.nodes())
216 | 
217 |     # build_feature
218 |     release_feature(DATASET, node_list)
219 |     with open(datapath + "vectorizer_model.pkl", 'rb') as f:
220 |         vectorizer = pkl.load(f)
221 |     with open(datapath + "transformer_model.pkl", 'rb') as f:
222 |         tfidf_model = pkl.load(f)
223 |     tfidf = tfidf_model.transform(vectorizer.transform(sentences))
224 | 
225 |     with open('../data/{0}/{0}_inductive.content.{1}'.format(DATASET, 'text'), 'w') as f:
226 |         for n in range(len(node_list)):
227 |             if judge_node_type(node_list[n]) == 'text':
228 |                 idx = int(node_list[n].split('_')[1])
229 |                 f.write(
230 |                     str(n) + '\t' + '\t'.join(map(str, tfidf[idx, :].toarray()[0].tolist() ))
231 |                 + '\n')
232 | 
233 |     # build_adj
234 |     adj_dict = g.adj
235 |     with open('../data/{0}/{0}_inductive.cites'.format(DATASET), 'w') as f:
236 |         for i in adj_dict:
237 |             for j in adj_dict[i]:
238 |                 f.write('{}\t{}\n'.format(i, j))
239 | 
240 | 
241 | 
242 | if __name__ == '__main__':
243 |     sent = [
244 |         "VE子接口接入L3VPN转发，L3VE接口上下行均配置qos-profile模板，qos-profile中配置SQ加入GQ，然后进行主备倒换，查看配置恢复并且上下行流量按SQ参数进行队列调度和限速。查询和Reset SQ和GQ统计，检查统计和清除统计功能正常。",
245 |         "整机启动时间48小时内、回退的文件不存在（大包文件不存在、或配置文件不在、或补丁文件不存在、或paf文件不存在），以上场景下，执行一键式版本回退，回退失败，提示对应Error信息",
246 |         "全网1588时间同步;dot1q终结子接口终结多Q段接入VLL;IP FPM基本配置，丢包和时延使用默认值，测量周期1s;配置单向流，六元组匹配;MCP主备倒换3次",
247 |         "集中式NAT444实例配置natserver，配置满规格日志服务器带不同私网vpn，反复配置去配置日志服务器，主备倒换设备无异常，检查反向建流和流表老化时，日志发送功能正常；修改部分日志服务器私网VPN与用户不一致，检查流表新建和老化时日志发送正常。",
248 |         "2R native eth组网，配置基于物理口的PORT CC场景的1ag，使能eth-test测量。接收端设备ISSU升级过程中，发送端设备分别发送有bit错、CRC错、乱序的ETH-TEST报文，统计结果符合预期。配置周期为1s的，二层主接口的outward型的1ag，使能ETH_BN功能，测试仪模拟BNM报文，ISSU升级过程中，带宽通告可恢复，可产生。",
249 |         "配置VPLS场景，接口类型为dot1q终结，设置检测接口为物理口，指定测试模式为丢包，手工触发平滑对账，查看测量结果符合规格预期",
250 |         "NATIVE IP场景，测试仪模拟TWAMP LIGHT的发起端，设备作为TWAMP LIGHT的反射端，TWAMP LIGHT的时延统计功能正常，reset接口所在单板后，TWAMP LIGHT的时延统计功能能够恢复。",
251 |         "接口上行配置qos-profile模板，模板中配置SQ，SQ关联的flow-queue模板中配置八个队列都为同种调度方式(lpq)，打不同优先级的流量，八种队列按照配置的调度方式进行正确调度",
252 |         "满规格配置IPFPM逐跳多路径检测，中间节点设备主备倒换3次后，功能正常",
253 |         "通道口上建立TE隧道，隧道能正常建立并UP，ping和tracert该lsp都通，流量通过TE转发无丢包，ping可设置发送的报文长度，长度覆盖65、4000、8100（覆盖E3 Serial口）",
254 |         "双归双活EVPN场景，AC侧接口为eth-trunk，evc封装dot1q，流动作为pop single接入evpn，公网使用te隧道，配置ac侧mac本地化功能，本地动态学习mac后ac侧本地化学习，动态mac清除、ac侧本地化mac清除",
255 |         "配置两条静态vxlan隧道场景，Dot1q子接口配置远程端口镜像,切换镜像实例指定pw隧道测试恢复vxlan隧道镜像,再子卡插拔测试",
256 |         "单跳检测bdif，交换机接口shut down后BFD会话的状态变为down，接口重新up则BFD会话可以协商UP"
257 |     ]
258 |     preprocess_inductive_text(sent, 'hw')


--------------------------------------------------------------------------------
/model/data/example/example.cites:
--------------------------------------------------------------------------------
  1 | 98	92
  2 | 98	69
  3 | 85	75
  4 | 85	43
  5 | 85	124
  6 | 85	56
  7 | 85	118
  8 | 52	3
  9 | 101	28
 10 | 101	21
 11 | 101	55
 12 | 101	134
 13 | 101	117
 14 | 43	54
 15 | 27	69
 16 | 27	92
 17 | 124	91
 18 | 124	75
 19 | 124	139
 20 | 124	134
 21 | 124	70
 22 | 124	44
 23 | 124	47
 24 | 28	21
 25 | 28	135
 26 | 135	68
 27 | 135	91
 28 | 135	49
 29 | 135	93
 30 | 135	21
 31 | 135	42
 32 | 135	103
 33 | 135	134
 34 | 135	95
 35 | 135	129
 36 | 135	50
 37 | 135	13
 38 | 135	107
 39 | 135	73
 40 | 135	102
 41 | 135	117
 42 | 135	23
 43 | 131	69
 44 | 131	92
 45 | 134	68
 46 | 134	91
 47 | 134	49
 48 | 134	21
 49 | 134	120
 50 | 134	72
 51 | 134	70
 52 | 134	95
 53 | 31	64
 54 | 31	69
 55 | 75	91
 56 | 75	93
 57 | 75	21
 58 | 75	4
 59 | 75	129
 60 | 75	56
 61 | 75	118
 62 | 75	23
 63 | 56	78
 64 | 56	118
 65 | 118	78
 66 | 103	115
 67 | 103	7
 68 | 103	136
 69 | 103	93
 70 | 103	4
 71 | 103	42
 72 | 103	100
 73 | 103	129
 74 | 103	50
 75 | 103	13
 76 | 103	99
 77 | 103	73
 78 | 103	84
 79 | 103	14
 80 | 103	117
 81 | 103	23
 82 | 103	76
 83 | 120	49
 84 | 120	72
 85 | 120	70
 86 | 120	44
 87 | 137	69
 88 | 137	92
 89 | 49	70
 90 | 107	68
 91 | 107	21
 92 | 107	42
 93 | 107	95
 94 | 59	61
 95 | 59	34
 96 | 59	58
 97 | 59	33
 98 | 61	144
 99 | 61	96
100 | 61	34
101 | 115	68
102 | 115	93
103 | 115	42
104 | 115	100
105 | 115	129
106 | 115	45
107 | 115	13
108 | 115	73
109 | 115	102
110 | 115	130
111 | 115	14
112 | 115	23
113 | 115	76
114 | 13	93
115 | 13	21
116 | 13	42
117 | 13	100
118 | 13	129
119 | 13	50
120 | 13	45
121 | 13	73
122 | 13	102
123 | 13	130
124 | 13	14
125 | 13	23
126 | 13	76
127 | 34	144
128 | 34	96
129 | 34	102
130 | 94	14
131 | 45	93
132 | 45	42
133 | 45	100
134 | 45	129
135 | 45	50
136 | 45	6
137 | 45	73
138 | 45	102
139 | 45	130
140 | 45	14
141 | 45	23
142 | 45	76
143 | 29	125
144 | 29	127
145 | 29	33
146 | 29	92
147 | 142	136
148 | 142	106
149 | 142	127
150 | 142	84
151 | 142	117
152 | 112	21
153 | 112	129
154 | 112	50
155 | 112	73
156 | 112	117
157 | 112	23
158 | 112	76
159 | 127	7
160 | 127	136
161 | 127	84
162 | 127	117
163 | 127	76
164 | 121	58
165 | 121	33
166 | 130	68
167 | 130	93
168 | 130	21
169 | 130	42
170 | 130	100
171 | 130	129
172 | 130	50
173 | 130	6
174 | 130	73
175 | 130	102
176 | 130	14
177 | 130	23
178 | 130	76
179 | 42	68
180 | 42	93
181 | 42	21
182 | 42	100
183 | 42	95
184 | 42	129
185 | 42	50
186 | 42	73
187 | 42	102
188 | 42	14
189 | 42	23
190 | 42	76
191 | 93	68
192 | 93	21
193 | 93	4
194 | 93	100
195 | 93	95
196 | 93	129
197 | 93	50
198 | 93	6
199 | 93	96
200 | 93	73
201 | 93	84
202 | 93	102
203 | 93	14
204 | 93	117
205 | 93	23
206 | 93	76
207 | 102	68
208 | 102	21
209 | 102	100
210 | 102	129
211 | 102	50
212 | 102	96
213 | 102	73
214 | 102	14
215 | 102	23
216 | 86	58
217 | 86	33
218 | 68	100
219 | 68	129
220 | 68	96
221 | 68	73
222 | 68	14
223 | 68	23
224 | 68	76
225 | 53	84
226 | 53	33
227 | 53	92
228 | 21	100
229 | 21	95
230 | 21	129
231 | 21	50
232 | 21	96
233 | 21	73
234 | 21	14
235 | 21	117
236 | 21	23
237 | 21	76
238 | 84	7
239 | 84	136
240 | 84	50
241 | 84	73
242 | 84	14
243 | 84	117
244 | 84	76
245 | 100	7
246 | 100	136
247 | 100	129
248 | 100	50
249 | 100	6
250 | 100	73
251 | 100	14
252 | 100	117
253 | 100	23
254 | 100	76
255 | 4	95
256 | 4	126
257 | 4	50
258 | 4	99
259 | 4	73
260 | 81	15
261 | 81	64
262 | 110	9
263 | 22	15
264 | 22	64
265 | 46	41
266 | 46	126
267 | 99	126
268 | 83	15
269 | 83	109
270 | 96	129
271 | 50	129
272 | 50	6
273 | 50	73
274 | 50	14
275 | 50	117
276 | 50	23
277 | 50	76
278 | 6	73
279 | 6	14
280 | 6	76
281 | 76	7
282 | 76	129
283 | 76	73
284 | 76	14
285 | 76	117
286 | 76	23
287 | 97	15
288 | 97	132
289 | 7	136
290 | 7	117
291 | 32	16
292 | 32	15
293 | 32	109
294 | 73	129
295 | 73	14
296 | 73	117
297 | 73	23
298 | 14	129
299 | 14	117
300 | 14	23
301 | 23	91
302 | 23	129
303 | 23	117
304 | 129	117
305 | 104	15
306 | 104	58
307 | 19	133
308 | 12	15
309 | 12	132
310 | 72	44
311 | 79	58
312 | 79	64
313 | 38	58
314 | 38	15
315 | 105	58
316 | 105	15
317 | 40	41
318 | 40	63
319 | 111	126
320 | 1	39
321 | 1	80
322 | 1	15
323 | 57	8
324 | 57	71
325 | 133	8
326 | 39	2
327 | 116	80
328 | 116	64
329 | 35	80
330 | 35	25
331 | 17	106
332 | 17	80
333 | 17	25
334 | 8	71
335 | 2	80
336 | 2	25
337 | 24	92
338 | 24	25
339 | 60	92
340 | 60	25
341 | 113	140
342 | 30	92
343 | 30	25
344 | 89	123
345 | 141	62
346 | 141	92
347 | 141	33
348 | 128	92
349 | 128	25
350 | 5	138
351 | 5	37
352 | 5	0
353 | 26	64
354 | 26	108
355 | 66	18
356 | 66	123
357 | 90	108
358 | 90	25
359 | 37	138
360 | 37	77
361 | 37	0
362 | 119	108
363 | 119	64
364 | 114	108
365 | 114	64
366 | 136	117
367 | 122	108
368 | 122	80
369 | 51	65
370 | 51	132
371 | 10	20
372 | 10	11
373 | 138	0
374 | 143	87
375 | 143	36
376 | 74	77
377 | 74	65
378 | 74	132
379 | 77	0
380 | 82	132
381 | 82	15
382 | 88	65
383 | 88	25
384 | 67	48
385 | 67	69
386 | 67	132
387 | 0	0
388 | 1	1
389 | 2	2
390 | 3	3
391 | 4	4
392 | 5	5
393 | 6	6
394 | 7	7
395 | 8	8
396 | 9	9
397 | 10	10
398 | 11	11
399 | 12	12
400 | 13	13
401 | 14	14
402 | 15	15
403 | 16	16
404 | 17	17
405 | 18	18
406 | 19	19
407 | 20	20
408 | 21	21
409 | 22	22
410 | 23	23
411 | 24	24
412 | 25	25
413 | 26	26
414 | 27	27
415 | 28	28
416 | 29	29
417 | 30	30
418 | 31	31
419 | 32	32
420 | 33	33
421 | 34	34
422 | 35	35
423 | 36	36
424 | 37	37
425 | 38	38
426 | 39	39
427 | 40	40
428 | 41	41
429 | 42	42
430 | 43	43
431 | 44	44
432 | 45	45
433 | 46	46
434 | 47	47
435 | 48	48
436 | 49	49
437 | 50	50
438 | 51	51
439 | 52	52
440 | 53	53
441 | 54	54
442 | 55	55
443 | 56	56
444 | 57	57
445 | 58	58
446 | 59	59
447 | 60	60
448 | 61	61
449 | 62	62
450 | 63	63
451 | 64	64
452 | 65	65
453 | 66	66
454 | 67	67
455 | 68	68
456 | 69	69
457 | 70	70
458 | 71	71
459 | 72	72
460 | 73	73
461 | 74	74
462 | 75	75
463 | 76	76
464 | 77	77
465 | 78	78
466 | 79	79
467 | 80	80
468 | 81	81
469 | 82	82
470 | 83	83
471 | 84	84
472 | 85	85
473 | 86	86
474 | 87	87
475 | 88	88
476 | 89	89
477 | 90	90
478 | 91	91
479 | 92	92
480 | 93	93
481 | 94	94
482 | 95	95
483 | 96	96
484 | 97	97
485 | 98	98
486 | 99	99
487 | 100	100
488 | 101	101
489 | 102	102
490 | 103	103
491 | 104	104
492 | 105	105
493 | 106	106
494 | 107	107
495 | 108	108
496 | 109	109
497 | 110	110
498 | 111	111
499 | 112	112
500 | 113	113
501 | 114	114
502 | 115	115
503 | 116	116
504 | 117	117
505 | 118	118
506 | 119	119
507 | 120	120
508 | 121	121
509 | 122	122
510 | 123	123
511 | 124	124
512 | 125	125
513 | 126	126
514 | 127	127
515 | 128	128
516 | 129	129
517 | 130	130
518 | 131	131
519 | 132	132
520 | 133	133
521 | 134	134
522 | 135	135
523 | 136	136
524 | 137	137
525 | 138	138
526 | 139	139
527 | 140	140
528 | 141	141
529 | 142	142
530 | 143	143
531 | 144	144
532 | 


--------------------------------------------------------------------------------
/model/data/example/example.content.entity:
--------------------------------------------------------------------------------
 1 | 110	0.4128976190896328	0.42347086738257994	0.0	0.0	0.0	0.8063423470389967	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 2 | 16	0.0	0.7460770404765611	0.0	0.0	0.0	0.0	0.0	0.0	0.6658596321100535	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 3 | 45	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 4 | 68	0.0	0.16626884101293	0.0	0.0	0.0	0.0	0.29678358480919553	0.928576357788077	0.14839179240459777	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 5 | 52	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	entity
 6 | 115	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9162423370679809	0.0	0.3123917334001648	0.0	0.0	0.0	0.2508214198736721	0.0	0.0	entity
 7 | 140	0.5036917525882165	0.5165900056102781	0.0	0.5165900056102781	0.0	0.0	0.0	0.0	0.4610467986894133	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 8 | 13	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
 9 | 103	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
10 | 41	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	entity
11 | 66	0.0	0.0	0.0	0.0	0.0	0.0	0.9192504543328178	0.0	0.0	0.0	0.0	0.0	0.0	0.39367321754077705	0.0	0.0	entity
12 | 57	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
13 | 124	0.0	0.7392370411053496	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6734453185358168	0.0	0.0	0.0	0.0	0.0	entity
14 | 139	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	entity
15 | 19	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
16 | 136	0.0	0.9340966471734243	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.3570202427585404	0.0	0.0	entity
17 | 3	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
18 | 130	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
19 | 46	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
20 | 100	0.0	0.0	0.0	0.5148343604877473	0.0	0.0	0.0	0.4792068600200417	0.0	0.0	0.0	0.0	0.5148343604877473	0.0	0.4901550242852527	0.0	entity
21 | 4	0.0	0.0	0.0	0.0	0.0	0.0	0.5041490938189078	0.0	0.0	0.0	0.0	0.0	0.0	0.8636166343937418	0.0	0.0	entity
22 | 8	0.42920383141594415	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.3364927980278603	0.8381865353089701	0.0	entity
23 | 37	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
24 | 123	0.0	0.7460770404765611	0.0	0.0	0.0	0.0	0.0	0.0	0.6658596321100535	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
25 | 76	0.0	0.526361543744378	0.0	0.0	0.0	0.0	0.0	0.0	0.46976771145596224	0.5011296351945588	0.0	0.0	0.0	0.0	0.0	0.5011296351945588	entity
26 | 142	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5071213902423028	0.0	0.0	0.4255242692005045	0.5299797127627106	0.5299797127627106	entity
27 | 138	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
28 | 20	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.8903220484888332	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.4553313628278285	entity
29 | 133	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
30 | 85	0.0	0.582154700206958	0.0	0.582154700206958	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5676194236051949	0.0	0.0	0.0	0.0	entity
31 | 49	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
32 | 107	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
33 | 120	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
34 | 39	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
35 | 89	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
36 | 36	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
37 | 42	0.3537352132330117	0.0	0.3537352132330117	0.0	0.0	0.0	0.0	0.33768747802609306	0.0	0.0	0.33050502943729493	0.0	0.725586928313225	0.0	0.0	0.0	entity
38 | 6	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	entity
39 | 71	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6839091107827836	0.0	0.0	0.0	0.0	0.0	0.0	0.7295672197873904	entity
40 | 101	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
41 | 126	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
42 | 21	0.0	0.0	0.0	0.0	0.13033856426006174	0.0	0.36654497008423925	0.12742730569713723	0.0	0.9123699498204323	0.0	0.0	0.0	0.0	0.0	0.0	entity
43 | 63	0.0	0.0	0.0	0.7242527273481703	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6895346161932243	0.0	entity
44 | 5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
45 | 77	0.0	0.0	0.0	0.0	0.0	0.7295672197873904	0.0	0.0	0.6839091107827836	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
46 | 112	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.626076108355491	0.0	0.7797619550519527	entity
47 | 43	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.8776026932008457	0.0	0.0	0.47938868664855044	0.0	0.0	0.0	0.0	entity
48 | 113	0.0	0.0	0.8898235312117321	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.45630481402444545	0.0	0.0	0.0	entity
49 | 70	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
50 | 11	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
51 | 111	0.0	0.0	0.0	0.0	0.33348940998145365	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7005612453635459	0.535521772064518	0.33348940998145365	0.0	entity
52 | 40	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
53 | 56	0.0	0.0	0.0	0.0	0.0	0.9054732821028685	0.4244032697775304	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
54 | 34	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
55 | 73	0.0	0.0	0.0	0.0	0.24253562503633297	0.0	0.0	0.0	0.0	0.9701425001453319	0.0	0.0	0.0	0.0	0.0	0.0	entity
56 | 106	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	entity
57 | 18	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
58 | 23	0.0	0.0	0.0	0.0	0.0	0.0	0.8045238736139533	0.0	0.0	0.42911714178898863	0.4106090785748029	0.0	0.0	0.0	0.0	0.0	entity
59 | 75	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
60 | 135	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
61 | 54	0.0	0.0	0.0	0.0	0.0	0.0	0.9049839500227976	0.0	0.0	0.0	0.0	0.0	0.3380030437607315	0.2583756811497849	0.0	0.0	entity
62 | 0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
63 | 78	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	entity
64 | 55	0.0	0.0	0.0	0.0	0.0	0.0	0.6839091107827836	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7295672197873904	0.0	entity
65 | 134	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
66 | 44	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
67 | 102	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9630189650151143	0.0	0.0	0.0	0.16813066375666905	0.0	0.13181326293179332	0.16417008448933054	0.0	entity
68 | 95	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	entity
69 | 125	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
70 | 87	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
71 | 96	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
72 | 72	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	entity
73 | 47	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
74 | 62	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	entity
75 | 28	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5844900431032921	0.0	0.0	0.6385529950927725	0.0	0.500621076235471	0.0	0.0	entity
76 | 84	0.0	0.0	0.0	0.5865708015535791	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5865708015535791	0.0	0.0	0.5584526743866337	entity
77 | 10	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
78 | 117	0.0	0.0	0.0	0.46577813790634043	0.0	0.0	0.0	0.0	0.4156981688554791	0.4434503833052836	0.0	0.0	0.46577813790634043	0.0	0.0	0.4434503833052836	entity
79 | 118	0.0	0.0	0.6696705123801245	0.0	0.0	0.32694769098327986	0.6129730024822695	0.0	0.0	0.0	0.0	0.0	0.0	0.2625084959332076	0.0	0.0	entity
80 | 7	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	entity
81 | 61	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
82 | 48	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
83 | 127	0.5293979964997674	0.0	0.0	0.0	0.0	0.5169272033543715	0.0	0.0	0.0	0.0	0.0	0.5293979964997674	0.0	0.41504432177334144	0.0	0.0	entity
84 | 129	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7071067811865475	0.0	0.0	0.0	0.0	0.7071067811865475	0.0	entity
85 | 99	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
86 | 94	0.0	0.0	0.0	0.0	0.4631219854333841	0.0	0.0	0.0	0.0	0.0	0.8862945484477721	0.0	0.0	0.0	0.0	0.0	entity
87 | 14	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6751907623427075	0.0	0.0	0.7376431620011626	0.0	0.0	0.0	0.0	entity
88 | 93	0.0	0.0	0.0	0.2851210859459021	0.27145339070541175	0.27145339070541175	0.0	0.5307803472643001	0.25446517061225893	0.27145339070541175	0.5194909071628946	0.0	0.2851210859459021	0.0	0.0	0.0	entity
89 | 9	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	entity
90 | 144	0.0	0.0	0.0	0.4854274117243385	0.0	0.4621577405149383	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7421391047699457	0.0	0.0	entity
91 | 91	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9630392083318012	0.0	0.0	0.26936125039741216	0.0	0.0	entity
92 | 50	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6990759925117283	0.0	0.7150473807334323	0.0	0.0	0.0	0.0	0.0	0.0	entity
93 | 143	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	entity
94 | 


--------------------------------------------------------------------------------
/model/data/example/example.content.text:
--------------------------------------------------------------------------------
 1 | 98	0.0	0.0	0.0	0.0	0.0	0.9358650031234387	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.35235876025546164	0.0	0.0	0.0	0.0	business
 2 | 27	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9156288122439306	0.0	0.0	0.4020247233551301	0.0	0.0	0.0	0.0	business
 3 | 131	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9156288122439306	0.0	0.0	0.4020247233551301	0.0	0.0	0.0	0.0	business
 4 | 31	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6695963650777833	0.0	0.0	0.0	0.0	0.0	0.0	0.7427251900094812	0.0	0.0	0.0	0.0	business
 5 | 137	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.6046697920338051	0.0	0.0	0.7964762661886384	0.0	0.0	0.0	0.0	business
 6 | 59	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7514065815604376	0.0	0.0	0.0	0.0	0.0	0.6598394874419514	0.0	0.0	0.0	computers
 7 | 29	0.0	0.0	0.0	0.0	0.0	0.16467742487504686	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7671089427046102	0.0	0.0	0.0	0.0	0.6200203349561516	0.0	0.0	0.0	computers
 8 | 121	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7514065815604376	0.0	0.0	0.0	0.0	0.0	0.6598394874419514	0.0	0.0	0.0	computers
 9 | 86	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.9156288122439306	0.0	0.0	0.0	0.0	0.0	0.4020247233551301	0.0	0.0	0.0	computers
10 | 53	0.0	0.0	0.0	0.0	0.0	0.21729028321966476	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7591446712281558	0.0	0.0	0.0	0.0	0.6135831654830609	0.0	0.0	0.0	computers
11 | 81	0.40461548377817164	0.40461548377817164	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.32032492957494885	0.0	0.0	0.0	0.0	0.0	0.7549599724930564	0.0	0.0	0.0	0.0	0.0	culture-arts-entertainment
12 | 22	0.9668995057997964	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.2551574919223607	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	culture-arts-entertainment
13 | 83	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	culture-arts-entertainment
14 | 97	0.862719620468864	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.18939695437392945	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.2012153535427588	0.0	0.0	0.0	0.18939695437392945	0.3787939087478589	culture-arts-entertainment
15 | 32	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	culture-arts-entertainment
16 | 104	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.25819888974716115	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7745966692414835	0.0	0.0	0.0	0.0	0.25819888974716115	0.5163977794943223	education-science
17 | 12	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.20614945054270384	0.0	0.25505467825685685	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.8245978021708154	0.0	0.0	0.0	0.0	0.20614945054270384	0.4122989010854077	education-science
18 | 79	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.41094830562394763	0.0	0.0	0.0	0.0	0.9116586477979609	0.0	0.0	0.0	0.0	0.0	0.0	education-science
19 | 38	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	education-science
20 | 105	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	education-science
21 | 1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.16584805931160926	0.628392605698543	0.0	0.0	0.0	0.0	0.663392237246437	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.16584805931160926	0.3316961186232185	engineering
22 | 116	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.8151167991472721	0.0	0.0	0.0	0.38789492827902133	0.4302582112677438	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	engineering
23 | 35	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.687696055429443	0.0	0.0	0.0	0.0	0.7259987158024347	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	engineering
24 | 17	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.687696055429443	0.0	0.0	0.0	0.0	0.7259987158024347	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	engineering
25 | 2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.687696055429443	0.0	0.0	0.0	0.0	0.7259987158024347	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	engineering
26 | 24	0.0	0.0	0.0	0.0	0.0	0.0	0.7514065815604376	0.0	0.0	0.0	0.0	0.6598394874419514	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	health
27 | 60	0.0	0.0	0.0	0.0	0.0	0.0	0.7514065815604376	0.0	0.0	0.0	0.0	0.6598394874419514	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	health
28 | 30	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	health
29 | 141	0.0	0.0	0.0	0.0	0.0	0.46912084121140396	0.0	0.0	0.0	0.0	0.0	0.8831339854977299	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	health
30 | 128	0.0	0.0	0.0	0.0	0.0	0.0	0.9156288122439306	0.0	0.0	0.0	0.0	0.4020247233551301	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	health
31 | 26	0.0	0.0	0.0	0.0	0.4363902117064088	0.0	0.0	0.4363902117064088	0.0	0.0	0.0	0.0	0.0	0.7868463422128056	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	politics-society
32 | 90	0.0	0.0	0.0	0.0	0.9486832980505139	0.0	0.0	0.31622776601683794	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	politics-society
33 | 119	0.0	0.37273755096583017	0.0	0.0	0.4364205489016015	0.0	0.0	0.32731541167620115	0.0	0.0	0.0	0.0	0.7506453515979827	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	politics-society
34 | 114	0.0	0.4270855370625232	0.0	0.0	0.7500810048375636	0.0	0.0	0.3750405024187818	0.0	0.0	0.0	0.0	0.0	0.33811396268026966	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	politics-society
35 | 122	0.0	0.0	0.0	0.0	0.41178889441522537	0.0	0.0	0.8235777888304507	0.0	0.3900634976275425	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	politics-society
36 | 51	0.0	0.0	0.5345224838248488	0.2672612419124244	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.8017837257372731	0.0	0.0	sports
37 | 74	0.0	0.0	0.8164965809277259	0.40824829046386296	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.40824829046386296	0.0	0.0	sports
38 | 82	0.0	0.0	0.0	0.5261283520506196	0.0	0.0	0.0	0.0	0.15400462309180898	0.0	0.7621577353583696	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.15400462309180898	0.30800924618361797	sports
39 | 88	0.0	0.0	0.8944271909999159	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.4472135954999579	0.0	0.0	sports
40 | 67	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	sports
41 | 


--------------------------------------------------------------------------------
/model/data/example/example.content.topic:
--------------------------------------------------------------------------------
 1 | 69	0.10000000064493393	0.1000000008140825	0.10000000028266168	0.10000000035382671	0.10000000047539047	0.1000000004828411	0.10000000060871007	0.1000000005324753	0.100000000711586	0.10000000066359978	0.10000000052238267	0.10000000067661378	0.10000000014003871	0.10000310368917421	0.10000000045052576	0.10000000044994944	0.10000000015794049	6.099999976462049	0.1000000010483934	0.10000000090751568	6.100029142682792	0.10000000061865216	0.1000000002451704	0.100000000711586	0.10000000061927684	topic
 2 | 108	0.10000000030248687	3.9869331920755253	0.1000000001324301	0.1000000001657717	11.099988383941328	0.10000000022568818	0.10000000028518735	8.099985059782904	0.10000000033338584	0.10000959756685197	0.10000000024474201	0.10000000031700097	5.099999989264985	0.10000958461921998	0.10000000021107626	0.10000000021080625	0.10000000007399686	0.10000000014486304	0.1000000004911838	0.10000000042576132	0.10000000034382804	0.10000000028984532	0.10000000011486503	0.10000000033338584	0.10000000029013799	topic
 3 | 33	0.10000000050165696	0.10000000063322788	0.10000000021986624	0.10000000027522142	0.10000000036977887	2.099991023864151	0.10000000047348051	0.10000000041418187	0.10000000055350176	0.10000000051617604	0.10000000040633139	0.10000000052634735	0.10000000010892805	0.10000000066465843	0.10000000035043805	0.100000000350928	7.09999998794141	0.10000000024050797	0.10000000081548484	0.10000000070590417	0.10000000057101618	7.100064058850597	0.10000000019070393	0.10000000055350176	0.10000000048169978	topic
 4 | 65	0.100000000601957	0.1000000007598339	6.099999978463166	2.099979341443886	0.10000000044371153	0.10000000044961346	0.10000000056814703	0.10000000049699237	0.10000000066421821	0.10000000061937901	0.10000000048760901	0.10000000063152578	0.10000000013070685	0.1000000007975486	0.10000000042050376	0.10000000041996585	0.1000000001474157	0.10000000028859453	0.10000000097853087	0.10000000084704092	0.10000000068497036	0.1000000005774266	5.099999981319692	0.10000000066421821	0.10000000057805047	topic
 5 | 80	0.10000000032255908	0.10000000040710746	0.10000000014135384	0.10000000017696559	0.10000074886048957	0.10000000024089607	0.10000000030440456	0.10000208808596316	1.0999923634983155	12.099987122216671	0.1000000002613717	0.10000000033836194	0.10000000007003075	0.10000620317028404	10.099998014746577	0.10000000022501133	0.10000000007898309	0.10000000015462458	0.1000000005243858	0.10000000045384043	0.10000000036699673	0.10000000030937642	0.10000000012260515	1.0999923634983155	2.099988012939781	topic
 6 | 92	0.1000000003926223	0.1000000004955964	0.10000000017207852	0.10000000021540231	0.10000000028940778	6.100008939650146	5.099999953423517	0.10000000032415984	0.10000000043319869	0.10000000040398568	0.10000000031801565	8.099999948227703	0.10000000008525264	0.10000000052032255	0.10000000027427068	0.10000000027391982	0.10000000009625576	0.10000000018828213	0.10000000063824	0.10000000055247658	2.0999641218661806	0.1000000003767274	0.10000000014925463	0.10000000043319869	0.1000000003770028	topic
 7 | 25	0.10000000447053763	0.10000000564303888	0.10000000195934661	0.10000000245264679	0.10000000329529947	0.10000000333913112	0.10000000421944119	0.10000000369099893	0.10000000493255537	0.10000000459992453	0.10000000362104004	0.10000000469013443	0.10000000097071651	0.10000000592313252	0.10000000312294278	0.10000000311894792	0.10000000109480751	0.10000000214329609	0.1000000072672295	0.10000000629069618	0.10000000508705013	0.10000000428835745	0.10000000169946556	0.10000000493255537	0.10000000429268854	topic
 8 | 58	0.10000000053116166	0.10000000067047086	0.10000000023279756	0.10000000029140843	0.10000000039152722	0.10000000039685897	0.10000000050132803	0.10000000043854178	0.10000000058689941	0.10000000054653468	0.10000000043121514	0.1000000005572529	0.1000000001153346	0.10000634588594152	0.10000000037104888	5.099999965667074	0.10000000013018323	0.10000000025465332	6.099990297447763	0.10000000074742157	0.10000000060441193	4.099935894426184	0.1000000002019201	0.10000000058689941	0.10000000051081784	topic
 9 | 109	0.10000000447053763	0.10000000564303888	0.10000000195934661	0.10000000245264679	0.10000000329529947	0.10000000333913112	0.10000000421944119	0.10000000369099893	0.10000000493255537	0.10000000459992453	0.10000000362104004	0.10000000469013443	0.10000000097071651	0.10000000592313252	0.10000000312294278	0.10000000311894792	0.10000000109480751	0.10000000214329609	0.1000000072672295	0.10000000629069618	0.10000000508705013	0.10000000428835745	0.10000000169946556	0.10000000493255537	0.10000000429268854	topic
10 | 132	0.1000000006947255	0.1000000008766686	0.10000000030463979	3.100020631978266	0.10000000051193815	0.10000000051874758	0.10000000065550729	0.10000000057341163	1.1000090738708639	0.10000000071474371	4.1000570677731965	0.10000000072863138	0.10000000015080479	0.10000000092018244	0.10000000048531513	0.10000000048454125	0.10000000017008284	0.10000000033296982	0.10000000113121894	0.10000000097733115	0.10000000079029378	0.10000000066621371	0.10000000026427903	1.1000090738708639	2.1000218909978687	topic
11 | 64	0.10001884268356509	1.2130488029016366	0.10000000035604667	0.10000000044568769	0.10001083159273699	0.10000000060677694	0.10000000076674422	0.10001281225018842	0.10000000089632895	0.10000323051447904	0.10000000065800438	0.10000000085227717	0.10000000017646501	8.09996845256624	0.10000195129897382	0.10000000056676593	0.1000000001989452	0.10000000038947345	0.10000653674416829	0.1000150754088154	0.10000668095550053	0.10000000077926748	0.10000000030882186	0.10000000089632895	0.10000000078005433	topic
12 | 15	8.099981108618941	0.10001794393543985	0.1000000001014875	0.1000000001270875	0.10000000017068529	0.10000000017295561	0.10000000021855267	0.10000000019118116	3.09999850912612	0.10000000023832109	1.0999428926867112	0.10000000024293296	0.10000000005027983	0.10000624793508826	0.10000000016183115	0.10000000016155089	0.10000000005670732	0.10000000011101547	7.100003087508312	9.099984856066916	0.10000000026349182	0.1000000002221223	0.10000000008802655	3.09999850912612	6.099990049498399	topic
13 | 


--------------------------------------------------------------------------------
/model/data/example/mapindex.txt:
--------------------------------------------------------------------------------
  1 | Southern_Illinois_University	0
  2 | 6760	1
  3 | 6764	2
  4 | Taiwan	3
  5 | Tool	4
  6 | George_Washington_University	5
  7 | Firefox	6
  8 | Wikipedia	7
  9 | Turbine	8
 10 | Film	9
 11 | CBS_Sports_Network	10
 12 | The_Sports_Network	11
 13 | 5221	12
 14 | Garbage_collection_(computer_science)	13
 15 | Unix-like	14
 16 | topic_19	15
 17 | Beowulf	16
 18 | 6763	17
 19 | Candidate	18
 20 | Force	19
 21 | Television_network	20
 22 | Software_development	21
 23 | 2561	22
 24 | Installation_(computer_programs)	23
 25 | 7630	24
 26 | topic_9	25
 27 | 8520	26
 28 | 1	27
 29 | Electronics_manufacturing_services	28
 30 | 2131	29
 31 | 7632	30
 32 | 3	31
 33 | 2564	32
 34 | topic_3	33
 35 | Walter_Bright	34
 36 | 6762	35
 37 | Track_and_field	36
 38 | University_of_Texas_at_Austin	37
 39 | 5223	38
 40 | Newton's_laws_of_motion	39
 41 | Equal_opportunity	40
 42 | Ethics	41
 43 | Type_system	42
 44 | Export	43
 45 | Technician	44
 46 | Backward_compatibility	45
 47 | Science	46
 48 | Standard_Telephones_and_Cables	47
 49 | Net_Authority	48
 50 | Structural_engineering	49
 51 | Application_programming_interface	50
 52 | 9070	51
 53 | China	52
 54 | 2134	53
 55 | Procurement	54
 56 | Home_improvement	55
 57 | Machining	56
 58 | Electricity_generation	57
 59 | topic_12	58
 60 | 2130	59
 61 | 7631	60
 62 | Digital_Mars	61
 63 | Barnes-Jewish_Hospital	62
 64 | Health_care	63
 65 | topic_18	64
 66 | topic_4	65
 67 | Election	66
 68 | 9074	67
 69 | Computer_programming	68
 70 | topic_0	69
 71 | Automotive_engineering	70
 72 | Power_station	71
 73 | Auto_mechanic	72
 74 | Application_software	73
 75 | 9071	74
 76 | Numerical_control	75
 77 | Web_browser	76
 78 | Texas_Tech_University	77
 79 | Fastener	78
 80 | 5222	79
 81 | topic_6	80
 82 | 2560	81
 83 | 9072	82
 84 | 2562	83
 85 | Yahoo!_Directory	84
 86 | Manufacturing	85
 87 | 2133	86
 88 | Booster_club	87
 89 | 9073	88
 90 | Opelousas,_Louisiana	89
 91 | 8521	90
 92 | Automotive_electronics	91
 93 | topic_7	92
 94 | Scripting_language	93
 95 | DOS	94
 96 | Problem_solving	95
 97 | NLS_(computer_system)	96
 98 | 2563	97
 99 | 0	98
100 | Information	99
101 | Source_code	100
102 | Product_(business)	101
103 | Extensible_programming	102
104 | Database	103
105 | 5220	104
106 | 5224	105
107 | HowStuffWorks	106
108 | Mathematical_optimization	107
109 | topic_2	108
110 | topic_16	109
111 | Academy_Awards	110
112 | Skill	111
113 | Internet_access	112
114 | Jogging	113
115 | 8523	114
116 | Compiler	115
117 | 6761	116
118 | Web_search_engine	117
119 | Stamping_(metalworking)	118
120 | 8522	119
121 | Technical_drawing	120
122 | 2132	121
123 | 8524	122
124 | Republican_Party_(United_States)	123
125 | Electronics	124
126 | Google_Directory	125
127 | Knowledge	126
128 | Usenet	127
129 | 7634	128
130 | Computer_cluster	129
131 | Ruby_(programming_language)	130
132 | 2	131
133 | topic_17	132
134 | Thrust	133
135 | Product_design	134
136 | Data_management	135
137 | FAQ	136
138 | 4	137
139 | Yale_University	138
140 | Amplifier	139
141 | Cycling	140
142 | 7633	141
143 | Blog	142
144 | College_athletics	143
145 | Aardvark_(search_engine)	144
146 | 


--------------------------------------------------------------------------------
/model/data/example/test.map:
--------------------------------------------------------------------------------
1 | 74
2 | 81
3 | 116
4 | 114
5 | 17
6 | 32
7 | 119
8 | 30


--------------------------------------------------------------------------------
/model/data/example/train.map:
--------------------------------------------------------------------------------
 1 | 38
 2 | 24
 3 | 29
 4 | 97
 5 | 121
 6 | 88
 7 | 27
 8 | 12
 9 | 98
10 | 104
11 | 59
12 | 22
13 | 53
14 | 86
15 | 83
16 | 82


--------------------------------------------------------------------------------
/model/data/example/vali.map:
--------------------------------------------------------------------------------
 1 | 31
 2 | 51
 3 | 35
 4 | 67
 5 | 2
 6 | 90
 7 | 137
 8 | 131
 9 | 141
10 | 1
11 | 122
12 | 60
13 | 105
14 | 26
15 | 79
16 | 128


--------------------------------------------------------------------------------
/tagMe.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | import time
  4 | import json
  5 | import requests
  6 | from tqdm import tqdm
  7 | from multiprocessing import Pool as ProcessPool
  8 | time0 = time.time()
  9 | 
 10 | dataset = 'example'
 11 | 
 12 | 
 13 | def getEntityList(text):
 14 |     url = 'https://tagme.d4science.org/tagme/tag'
 15 |     # token = '14a339a1-6913-4e8a-bff3-da6ae4459381-843339462'
 16 |     token = 'fe4df7bf-ab75-4efb-aa1c-551afaa65cd3-843339462'
 17 |     para = {'lang': 'en',
 18 |            'gcube-token': token,
 19 |            'text': text,
 20 |     }
 21 |     headers = {
 22 |         'Host': 'tagme.d4science.org',
 23 |         'Connection': 'keep-alive',
 24 |         'Cache-Control': 'max-age=0',
 25 |         'Upgrade-Insecure-Requests': '1',
 26 |         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
 27 |         'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
 28 |         'Accept-Encoding': 'gzip, deflate, br',
 29 |         'Accept-Language': 'zh-CN,zh;q=0.9',
 30 |         'Cookie': '_ga=GA1.2.827175290.1544765315; _gid=GA1.2.121830695.1544765315',
 31 |     }
 32 |     response = requests.get(url+'?lang=en&gcube-token='+token+'&text='+text, headers=headers, timeout=10)
 33 |     try:
 34 |         return json.loads(response.text)['annotations']
 35 |     except:
 36 |         print('.'*50)
 37 |         print(text)
 38 |         print(response.text)
 39 |         print('.'*50)
 40 | 
 41 | def sentence2Link(sentence):
 42 |     return json.dumps(getEntityList(sentence))
 43 | 
 44 | 
 45 | def run(para, times = 0):
 46 |     ind = para[0]
 47 |     sentence = para[1]
 48 |     try:
 49 |         l = sentence2Link(sentence)
 50 |     except Exception as e:
 51 |         print(ind, e)
 52 |         print("Content: ", sentence)
 53 |         if times < 5:
 54 |             return run(para, times + 1)
 55 |         else:
 56 |             with open("error_info.txt", 'w+') as f:
 57 |                 f.write("{}\t{}\n".format(ind, sentence))
 58 |             return None
 59 |     print(ind)
 60 |     return str(ind)+'\t'+l
 61 | 
 62 | 
 63 | def process_pool(data):
 64 |     cnt = 0
 65 |     p = ProcessPool(32)
 66 |     chunkSize = 128
 67 |     res = []
 68 |     i = 0
 69 |     while i < int(len(data)/chunkSize):
 70 |         try:
 71 |             res += list(p.map(run, data[i*chunkSize: (i+1)*chunkSize]))
 72 |             print(str(round((i+1)*chunkSize/len(data)*100, 2))+'%', round(time.time()-time0, 2))
 73 |             cnt += 1
 74 |             # fout = open("cache" + str(cnt).zfill(3) + '.txt', 'w', encoding='utf8')
 75 |             # fout.write('\n'.join(res))
 76 |             # fout.close()
 77 |             i += 1
 78 |             time.sleep(1.0)
 79 |         except:
 80 |             for i in range(60):
 81 |                 time.sleep(10)
 82 |                 print("\t{} / 600s".format(i*10))
 83 |         
 84 |     res += list(p.map(run, data[(i)*chunkSize:]))
 85 |     p.close()
 86 |     p.join()
 87 |     return res
 88 | 
 89 |     
 90 | 
 91 | 
 92 | if __name__ == "__main__":
 93 |     print('reading...')
 94 |     data = []
 95 |     with open("./data/{0}/{0}.txt".format(dataset), 'r', encoding='utf8') as fin:
 96 |         for line in fin:
 97 |             ind, cate, content = line.split('\t')
 98 |             if int(ind) > -1 and int(ind) < 9e19:
 99 |                 data.append([ind, content])
100 | 
101 |     print('read done. Tagging...')
102 |     outdata = process_pool(data)
103 |     fout = open("./data/{0}/{0}2entity.txt".format(dataset), 'w', encoding='utf8')
104 |     fout.write('\n'.join(outdata))
105 |     fout.close()


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
 1 | #!/user/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import numpy as np
 5 | from nltk.tokenize import WordPunctTokenizer
 6 | from collections import defaultdict
 7 | from tqdm import tqdm
 8 | 
 9 | 
10 | def sample(datapath, DATASETS, resample = False, trainNumPerClass=20):
11 |     if resample:
12 |         X = []
13 |         Y = []
14 |         from sklearn.model_selection import train_test_split
15 |         with open(datapath+'{}.txt'.format(DATASETS), 'r') as f:
16 |             for line in f:
17 |                 ind, cat, title = line.strip('\n').split('\t')
18 |                 t = cat.lower()
19 |                 X.append([ind])
20 |                 Y.append(t)
21 | 
22 |         cateset = list(set(Y))
23 |         catemap = dict()
24 |         for i in range(len(cateset)):
25 |             catemap[cateset[i]] = i
26 |         Y = [catemap[i] for i in Y]
27 |         X = np.array(X)
28 |         Y = np.array(Y)
29 | 
30 |         trainNum = trainNumPerClass*len(catemap)
31 |         ind_train, ind_test = train_test_split(X,
32 |                                         train_size=trainNum, random_state=1, )
33 |         ind_vali, ind_test = train_test_split(ind_test,
34 |                                         train_size=trainNum/(len(X)-trainNum), random_state=1, )
35 |         train = sum(ind_train.tolist(), [])
36 |         vali = sum(ind_vali.tolist(), [])
37 |         test = sum(ind_test.tolist(), [])
38 | 
39 |         print( len(train), len(vali), len(test) )
40 |         alltext = set(train + vali + test)
41 |         print( "train: {}\nvali: {}\ntest: {}\nAllTexts: {}".format( len(train), len(vali), len(test), len(alltext)) )
42 | 
43 |         with open(datapath+'train.list', 'w') as f:
44 |             f.write( '\n'.join(map(str, train)) )
45 |         with open(datapath+'vali.list', 'w') as f:
46 |             f.write( '\n'.join(map(str, vali)) )
47 |         with open(datapath+'test.list', 'w') as f:
48 |             f.write( '\n'.join(map(str, test)) )
49 |     else:
50 |         train = []
51 |         vali = []
52 |         test = []
53 |         with open(datapath + 'train.list', 'r') as f:
54 |             for line in f:
55 |                 train.append(line.strip())
56 |         with open(datapath + 'vali.list', 'r') as f:
57 |             for line in f:
58 |                 vali.append(line.strip())
59 |         with open(datapath + 'test.list', 'r') as f:
60 |             for line in f:
61 |                 test.append(line.strip())
62 |         alltext = set(train + vali + test)
63 | 
64 |     return train, vali, test, alltext
65 | 
66 | def tokenize(sen):
67 |     return WordPunctTokenizer().tokenize(sen)
68 | 
69 | 
70 | def preprocess_corpus_notDropEntity(corpus, stopwords, involved_entity):
71 |     corpus1 = [[word.lower() for word in tokenize(sentence)] for sentence in tqdm(corpus)]
72 |     corpus2 = [[word for word in sentence if word.isalpha() if word not in stopwords] for sentence in tqdm(corpus1)]
73 |     all_words = defaultdict(int)
74 |     for c in tqdm(corpus2):
75 |         for w in c:
76 |             all_words[w] += 1
77 |     low_freq = set(word for word in set(all_words) if all_words[word] < 5 and word not in involved_entity)
78 |     text = [[word for word in sentence if word not in low_freq] for sentence in tqdm(corpus2)]
79 |     ans = [' '.join(i) for i in text]
80 |     return ans
81 | 
82 | def load_stopwords(filepath='./data/stopwords_en.txt'):
83 |     stopwords = set()
84 |     with open(filepath, 'r') as f:
85 |         for line in f:
86 |             stopwords.add(line.strip())
87 |     print(len(stopwords))
88 |     return stopwords


--------------------------------------------------------------------------------