├── KeyExt.py
├── KeyExt
    ├── ClassicalApproaches
    │   ├── README.md
    │   └── main.py
    ├── EmbedRank
    │   ├── Dockerfile
    │   ├── LICENSE
    │   ├── README.md
    │   ├── benchmark.py
    │   ├── config.ini
    │   ├── extract_keys_from_embedrank.py
    │   ├── launch.py
    │   ├── launch.pyc
    │   ├── requirements.txt
    │   ├── setup.cfg
    │   ├── setup.py
    │   └── swisscom_ai
    │   │   ├── __init__.py
    │   │   └── research_keyphrase
    │   │       ├── __init__.py
    │   │       ├── embeddings
    │   │           ├── __init__.py
    │   │           ├── emb_distrib_interface.py
    │   │           └── emb_distrib_local.py
    │   │       ├── model
    │   │           ├── __init__.py
    │   │           ├── extractor.py
    │   │           ├── input_representation.py
    │   │           ├── method.py
    │   │           └── methods_embeddings.py
    │   │       ├── preprocessing
    │   │           ├── __init__.py
    │   │           ├── custom_stanford.py
    │   │           └── postagging.py
    │   │       └── util
    │   │           ├── __init__.py
    │   │           ├── fileIO.py
    │   │           └── solr_fields.py
    ├── KPRank
    │   ├── PositionRank.py
    │   ├── README.md
    │   ├── __init__.py
    │   ├── doc_candidates.py
    │   ├── evaluation.py
    │   ├── main.py
    │   ├── process_data.py
    │   ├── requirements.txt
    │   └── run_scibert_model.py
    ├── Key2Vec
    │   ├── README.md
    │   ├── key2vec.py
    │   ├── key2vec
    │   │   ├── __init__.py
    │   │   ├── cleaner.py
    │   │   ├── constants.json
    │   │   ├── constants.py
    │   │   ├── docs.py
    │   │   ├── glove.py
    │   │   ├── key2vec.py
    │   │   └── phrase_graph.py
    │   ├── requirements.txt
    │   ├── setup.py
    │   ├── test.py
    │   ├── test.txt
    │   └── tests
    │   │   ├── test_docs.py
    │   │   └── test_glove.py
    ├── KeyBERT
    │   ├── KeyBERT.py
    │   └── README.md
    ├── RVA
    │   ├── LICENSE
    │   ├── Makefile
    │   ├── README.md
    │   ├── RVA.py
    │   ├── build
    │   │   ├── common.o
    │   │   ├── cooccur
    │   │   ├── cooccur.o
    │   │   ├── glove
    │   │   ├── glove.o
    │   │   ├── shuffle
    │   │   ├── shuffle.o
    │   │   ├── vocab_count
    │   │   └── vocab_count.o
    │   ├── cooccurrence.bin
    │   ├── cooccurrence.shuf.bin
    │   ├── demo.sh
    │   ├── eval
    │   │   ├── matlab
    │   │   │   ├── WordLookup.m
    │   │   │   ├── evaluate_vectors.m
    │   │   │   └── read_and_evaluate.m
    │   │   ├── octave
    │   │   │   ├── WordLookup_octave.m
    │   │   │   ├── evaluate_vectors_octave.m
    │   │   │   └── read_and_evaluate_octave.m
    │   │   └── python
    │   │   │   ├── distance.py
    │   │   │   ├── evaluate.py
    │   │   │   └── word_analogy.py
    │   ├── randomization.test.sh
    │   └── src
    │   │   ├── README.md
    │   │   ├── common.c
    │   │   ├── common.h
    │   │   ├── cooccur.c
    │   │   ├── glove.c
    │   │   ├── shuffle.c
    │   │   └── vocab_count.c
    ├── SIFRank
    │   ├── README.md
    │   ├── auxiliary_data
    │   │   ├── __init__.py
    │   │   ├── duc2001_vocab.txt
    │   │   ├── elmo_2x4096_512_2048cnn_2xhighway_options.json
    │   │   ├── enwiki_vocab_min200.txt
    │   │   ├── inspec_vocab.txt
    │   │   └── semeval_vocab.txt
    │   ├── embeddings
    │   │   ├── __init__.py
    │   │   ├── sent_emb_sif.py
    │   │   ├── word_emb_bert.py
    │   │   └── word_emb_elmo.py
    │   ├── eval
    │   │   └── sifrank_eval.py
    │   ├── main.py
    │   ├── model
    │   │   ├── __init__.py
    │   │   ├── extractor.py
    │   │   ├── input_representation.py
    │   │   └── method.py
    │   ├── requirements.txt
    │   ├── test
    │   │   └── test.py
    │   └── util
    │   │   └── fileIO.py
    ├── __init__.py
    ├── config.py
    ├── experiments.py
    ├── metrics.py
    └── utils.py
├── LICENSE
├── README.md
└── requirements.txt


/KeyExt.py:
--------------------------------------------------------------------------------
 1 | from KeyExt.config import datasets_path, output_dir
 2 | from KeyExt.experiments import run_experiments
 3 | 
 4 | 
 5 | def main():
 6 |     for partial_match in [False, True]:
 7 |         for n in [5, 10]:
 8 |             run_experiments(
 9 |                 datasets_path, output_dir, 
10 |                 top_n = n, partial_match = partial_match
11 |             )
12 | 
13 | 
14 | if __name__=='__main__': main()
15 | 


--------------------------------------------------------------------------------
/KeyExt/ClassicalApproaches/README.md:
--------------------------------------------------------------------------------
 1 | # Classical Approaches
 2 | 
 3 | This directory contains classical unsupervised approaches, which do not utilize word embeddings.  
 4 | These include `YAKE!`, `KPMiner`, `MPRank`, `PositionRank`, `TopicalPageRank`, `SingleRank`, `TextRank` and `TopicRank`.  
 5 | 
 6 | ## Setup
 7 | In order to run this script you need to:
 8 | ```
 9 | pip install pke
10 | pip install pytextrank
11 | pip install spacy
12 | pip install git+https://github.com/LIAAD/yake
13 | ```
14 | The `en_core_web_sm` model for the respective `spacy` version needs to be installed, since it is used by [pytextrank](https://github.com/DerwenAI/pytextrank).  
15 | `TopicalPagerank` and `KPMiner` use a `lda_model_file` and a `weights_file`respectively, which can be obtained from the [pke](https://github.com/boudinfl/pke) repo.  
16 | After they are obtained, their respective paths and the `base_path` for the dataset directory should be set in `main.py`.  
17 | 


--------------------------------------------------------------------------------
/KeyExt/ClassicalApproaches/main.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import pke
  3 | import time
  4 | import yake
  5 | import spacy
  6 | import string
  7 | import pathlib
  8 | import functools
  9 | import pytextrank
 10 | 
 11 | def counter(func):
 12 |     """
 13 |     Print the elapsed system time in seconds.
 14 |     """
 15 |     @functools.wraps(func)
 16 |     def wrapper_counter(*args, **kwargs):
 17 |         start_time = time.perf_counter()
 18 |         result = func(*args, **kwargs)
 19 |         end_time = time.perf_counter()
 20 |         print(f'{func.__name__}: {end_time - start_time} secs')
 21 |         return result
 22 |     return wrapper_counter
 23 | 
 24 | @counter
 25 | def kpminer(text, top_n = 10):
 26 |      weights_file = r'..\pke\models\df-semeval2010.tsv.gz'
 27 |      extractor = pke.unsupervised.KPMiner()
 28 |      extractor.load_document(input = text, language = 'en')
 29 |      extractor.candidate_selection(lasf = 5, cutoff = 200)
 30 |      df = pke.load_document_frequency_file(input_file = weights_file)
 31 |      extractor.candidate_weighting(df = df, alpha = 2.3, sigma = 3.0)
 32 |      keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 33 |      return keyphrases
 34 | 
 35 | @counter
 36 | def mprank(text, top_n = 10):
 37 |     extractor = pke.unsupervised.MultipartiteRank()
 38 |     stoplist = list(string.punctuation) + list(pke.lang.stopwords.get('en'))
 39 |     extractor.load_document(input = text, stoplist = stoplist, language = 'en')
 40 |     extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'})
 41 |     extractor.candidate_weighting(alpha = 1.1, threshold = 0.74, method = 'average')
 42 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 43 |     return keyphrases
 44 | 
 45 | @counter
 46 | def positionrank(text, top_n = 10):
 47 |     extractor = pke.unsupervised.PositionRank()
 48 |     extractor.load_document(input = text, language = 'en', normalization = None)
 49 |     extractor.candidate_selection(grammar = "NP: {<ADJ>*<NOUN|PROPN>+}", maximum_word_number = 3)
 50 |     extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'})
 51 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 52 |     return keyphrases
 53 | 
 54 | @counter
 55 | def topicalpagerank(text, top_n = 10):
 56 |     lda_model_file = r'..\pke\models\lda-1000-semeval2010.py3.pickle.gz'
 57 |     extractor = pke.unsupervised.TopicalPageRank()
 58 |     extractor.load_document(input = text, language = 'en', normalization = None)
 59 |     extractor.candidate_selection(grammar = "NP: {<ADJ>*<NOUN|PROPN>+}")
 60 |     extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'}, lda_model = lda_model_file)
 61 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 62 |     return keyphrases
 63 | 
 64 | @counter
 65 | def singlerank(text, top_n = 10):
 66 |     extractor = pke.unsupervised.SingleRank()
 67 |     extractor.load_document(input = text, language = 'en', normalization = None)
 68 |     extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'})
 69 |     extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'})
 70 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 71 |     return keyphrases
 72 | 
 73 | @counter
 74 | def textrank(text, top_n = 10):
 75 |     extractor = pke.unsupervised.TextRank()
 76 |     extractor.load_document(input = text, language = 'en', normalization = None)
 77 |     extractor.candidate_weighting(window = 2, pos = {'NOUN', 'PROPN', 'ADJ'}, top_percent = 0.33)
 78 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 79 |     return keyphrases
 80 | 
 81 | @counter
 82 | def topicrank(text, top_n = 10):
 83 |     extractor = pke.unsupervised.TopicRank()
 84 |     stoplist = list(string.punctuation) + list(pke.lang.stopwords.get('en'))
 85 |     extractor.load_document(input = text, stoplist = stoplist, language = 'en')
 86 |     extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'})
 87 |     extractor.candidate_weighting(threshold = 0.74, method = 'average')
 88 |     keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)]
 89 |     return keyphrases
 90 | 
 91 | 
 92 | @counter
 93 | def py_textrank(nlp, text, top_n = 10):
 94 |     nlp.add_pipe('textrank')
 95 |     doc = nlp(text)
 96 |     nlp.remove_pipe('textrank')
 97 |     
 98 |     keyphrases = [
 99 |         phrase.text for phrase in doc._.phrases
100 |     ]
101 |     return keyphrases[:top_n]
102 | 
103 | @counter
104 | def py_positionrank(nlp, text, top_n = 10):
105 |     nlp.add_pipe('positionrank')
106 |     doc = nlp(text)
107 |     nlp.remove_pipe('positionrank')
108 |     
109 |     keyphrases = [
110 |         phrase.text for phrase in doc._.phrases
111 |     ]
112 |     return keyphrases[:top_n]
113 | 
114 | @counter
115 | def py_topicrank(nlp, text, top_n = 10):
116 |     nlp.add_pipe('topicrank')
117 |     doc = nlp(text)
118 |     nlp.remove_pipe('topicrank')
119 |     
120 |     keyphrases = [
121 |         phrase.text for phrase in doc._.phrases
122 |     ]
123 |     return keyphrases[:top_n]
124 | 
125 | @counter
126 | def yake_ke(text, top_n = 10):
127 |     custom_kw_extractor = yake.KeywordExtractor(lan = "en", n = 3, dedupLim = 0.9, dedupFunc = 'seqm', windowsSize = 1, top = 10, features=None)
128 |     keywords = [key for key,_ in custom_kw_extractor.extract_keywords(text)]
129 |     return keywords
130 | 
131 | 
132 | def single_test():
133 |     text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
134 | 
135 |     # load a spaCy model, depending on language, scale, etc.
136 |     nlp = spacy.load("en_core_web_sm")
137 | 
138 |     print(kpminer(text))
139 |     print(mprank(text))
140 |     print(topicalpagerank(text))
141 |     print(singlerank(text))
142 |     print('\n\n')
143 |     
144 |     print('\n\n')
145 |     print(textrank(text))
146 |     print(py_textrank(nlp, text))
147 | 
148 |     print('\n\n')
149 |     print(positionrank(text))
150 |     print(py_positionrank(nlp, text))
151 | 
152 |     print('\n\n')
153 |     print(topicrank(text))
154 |     print(py_topicrank(nlp, text))
155 |     print(yake_ke(text))
156 |     return
157 | 
158 | def main():
159 |     nlp = spacy.load('en_core_web_sm')
160 |     method_name = 'textrank'
161 |     method = { 
162 |         'kpminer': lambda nlp, text: kpminer(text),
163 |         'mprank': lambda nlp, text: mprank(text),
164 |         'topicalpagerank': lambda nlp, text: topicalpagerank(text),
165 |         'singlerank': lambda nlp, text: singlerank(text),
166 |         'pytextrank': lambda nlp, text: py_textrank(nlp, text),
167 |         'textrank': lambda nlp, text: textrank(text),
168 |         'positionrank': lambda nlp, text: positionrank(text),
169 |         'pypositionrank': lambda nlp, text: py_positionrank(nlp, text),
170 |         'topicrank': lambda nlp, text: topicrank(text),
171 |         'pytopicrank': lambda nlp, text: py_topicrank(nlp, text),
172 |         'yake': lambda nlp, text: yake_ke(text)
173 |     }
174 | 
175 |     base_path = r'..\datasets\Krapivin2009'
176 |     input_dir = os.path.join(base_path, 'docsutf8')
177 |     output_dir = os.path.join(base_path, f'extracted\{method_name}')
178 |     print(os.getcwd())
179 | 
180 |     # Set the current directory to the input dir
181 |     os.chdir(os.path.join(os.getcwd(), input_dir))
182 | 
183 |     # Get all file names and their absolute paths.
184 |     docnames = sorted(os.listdir())
185 |     docpaths = list(map(os.path.abspath, docnames))
186 | 
187 |     # Create the keys directory, after the names and paths are loaded.
188 |     pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True)
189 | 
190 |     for i, (docname, docpath) in enumerate(zip(docnames, docpaths)):
191 | 
192 |         #if i < 225: continue
193 |         # keys shows up in docnames, erroneously.
194 |         if docname == 'keys':
195 |             continue
196 |             
197 |         print(f'Processing {i} out of {len(docnames)}...')
198 | 
199 |         # Save the output dir path
200 |         output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key')
201 |         print(output_dirpath)
202 | 
203 |         with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \
204 |                 open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out:
205 |             
206 |             # Read the file and remove the newlines.
207 |             text = file.read().replace('\n', ' ')
208 | 
209 |             # Extract the top 10 keyphrases.
210 |             try:
211 |                 ranked_list = method[method_name](nlp, text)
212 |                 keys = '\n'.join(map(str, ranked_list) or '')
213 |                 out.write(keys)
214 |             except Exception:
215 |                 pass
216 | 
217 |         os.system('clear')
218 | 
219 | 
220 | if __name__ == '__main__': main()
221 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/Dockerfile:
--------------------------------------------------------------------------------
 1 | # Use a base image that comes with NumPy and SciPy pre-installed
 2 | FROM publysher/alpine-scipy:1.0.0-numpy1.14.0-python3.6-alpine3.7
 3 | # Because of the image, our versions differ from those in the requirements.txt:
 4 | #   numpy==1.14.0 (instead of 1.13.1)
 5 | #   scipy==1.0.0 (instead of 0.19.1)
 6 | 
 7 | # Install Java for Stanford Tagger
 8 | RUN apk --update add openjdk8-jre
 9 | # Set environment
10 | ENV JAVA_HOME /opt/jdk
11 | ENV PATH ${PATH}:${JAVA_HOME}/bin
12 | 
13 | # Download CoreNLP full Stanford Tagger for English
14 | RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip && \
15 |     unzip stanford-corenlp-full-*.zip && \
16 |     rm stanford-corenlp-full-*.zip && \
17 |     mv stanford-corenlp-full-* stanford-corenlp
18 | 
19 | # Install sent2vec
20 | RUN apk add --update git g++ make && \
21 |     git clone https://github.com/epfml/sent2vec && \
22 |     cd sent2vec && \
23 |     git checkout f827d014a473aa22b2fef28d9e29211d50808d48 && \
24 |     make && \
25 |     apk del git make && \
26 |     rm -rf /var/cache/apk/* && \
27 |     pip install cython && \
28 |     cd src && \
29 |     python setup.py build_ext && \
30 |     pip install .
31 | 
32 | 
33 | 
34 | # Install requirements
35 | WORKDIR /app
36 | ADD requirements.txt .
37 | # Remove NumPy and SciPy from the requirements before installing the rest
38 | RUN cd /app && \
39 |     sed -i '/^numpy.*$/d' requirements.txt && \
40 |     sed -i '/^scipy.*$/d' requirements.txt && \
41 |     pip install -r requirements.txt
42 | 
43 | # Download NLTK data
44 | RUN python -c "import nltk; nltk.download('punkt')"
45 | 
46 | # Set the paths in config.ini
47 | ADD config.ini.template config.ini
48 | RUN sed -i '6 c\host = localhost' config.ini && \
49 |     sed -i '7 c\port = 9000' config.ini && \
50 |     sed -i '10 c\model_path = /sent2vec/pretrained_model.bin' config.ini
51 | 
52 | # Add actual source code
53 | ADD swisscom_ai swisscom_ai/
54 | ADD launch.py .
55 | 
56 | ENTRYPOINT ["/bin/sh"]


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/README.md:
--------------------------------------------------------------------------------
 1 | # EmbedRank
 2 | 
 3 | This directory contains the modified code for the [EmbedRank](https://github.com/swisscom/ai-research-keyphrase-extraction) approach.
 4 | 
 5 | ## Setup
 6 | Follow the install instructions from the original repo.  
 7 | Afterwards replace the files with the modified ones.  
 8 | In `main.py`, `base_path` needs to be set for the dataset directory.  
 9 | In `benchmark.py`, `output_path` needs to be set to a local output path.  
10 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/config.ini:
--------------------------------------------------------------------------------
 1 | [STANFORDTAGGER]
 2 | jar_path =
 3 | model_directory_path =
 4 | 
 5 | [STANFORDCORENLPTAGGER]
 6 | host = localhost
 7 | port = 9000
 8 | 
 9 | [SENT2VEC]
10 | model_path = ./wiki_bigrams.bin


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/extract_keys_from_embedrank.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import launch
 3 | import pathlib
 4 | 
 5 | base_path = '../datasets/DUC-2001/'
 6 | input_dir = os.path.join(base_path, 'docsutf8')
 7 | output_dir = os.path.join(base_path, 'extracted/embedrank')
 8 | 
 9 | embedding_distributor = launch.load_local_embedding_distributor()
10 | pos_tagger = launch.load_local_corenlp_pos_tagger()
11 | 
12 | # Set the current directory to the input dir
13 | os.chdir(os.path.join(os.getcwd(), input_dir))
14 | 
15 | # Get all file names and their absolute paths.
16 | docnames = sorted(os.listdir())
17 | docpaths = list(map(os.path.abspath, docnames))
18 | 
19 | # Create the keys directory, after the names and paths are loaded.
20 | pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True)
21 | 
22 | for i, (docname, docpath) in enumerate(zip(docnames, docpaths)):
23 | 
24 |     # keys shows up in docnames, erroneously.
25 |     if docname == 'keys':
26 |         continue
27 |         
28 |     print(f'Processing {i} out of {len(docnames)}...')
29 |     
30 |     # Save the output dir path
31 |     output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key')
32 |     print(output_dirpath)
33 | 
34 |     with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \
35 |          open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out:
36 |         # Read the file and remove the newlines.
37 |         text = file.read().replace('\n', ' ')
38 |         # Extract the top 10 keyphrases.
39 |         try:
40 |             kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, text, 10, 'en')
41 |             keys = "\n".join(kp1[0] or '')
42 |             out.write(keys)
43 |         except:
44 |             pass
45 | 
46 |     os.system('clear')


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/launch.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | from configparser import ConfigParser
 3 | 
 4 | from swisscom_ai.research_keyphrase.embeddings.emb_distrib_local import EmbeddingDistributorLocal
 5 | from swisscom_ai.research_keyphrase.model.input_representation import InputTextObj
 6 | from swisscom_ai.research_keyphrase.model.method import MMRPhrase
 7 | from swisscom_ai.research_keyphrase.preprocessing.postagging import PosTaggingCoreNLP
 8 | from swisscom_ai.research_keyphrase.util.fileIO import read_file
 9 | 
10 | 
11 | def extract_keyphrases(embedding_distrib, ptagger, raw_text, N, lang, beta=0.55, alias_threshold=0.7):
12 |     """
13 |     Method that extract a set of keyphrases
14 | 
15 |     :param embedding_distrib: An Embedding Distributor object see @EmbeddingDistributor
16 |     :param ptagger: A Pos Tagger object see @PosTagger
17 |     :param raw_text: A string containing the raw text to extract
18 |     :param N: The number of keyphrases to extract
19 |     :param lang: The language
20 |     :param beta: beta factor for MMR (tradeoff informativness/diversity)
21 |     :param alias_threshold: threshold to group candidates as aliases
22 |     :return: A tuple with 3 elements :
23 |     1)list of the top-N candidates (or less if there are not enough candidates) (list of string)
24 |     2)list of associated relevance scores (list of float)
25 |     3)list containing for each keyphrase a list of alias (list of list of string)
26 |     """
27 |     tagged = ptagger.pos_tag_raw_text(raw_text)
28 |     text_obj = InputTextObj(tagged, lang)
29 |     return MMRPhrase(embedding_distrib, text_obj, N=N, beta=beta, alias_threshold=alias_threshold)
30 | 
31 | 
32 | def load_local_embedding_distributor():
33 |     config_parser = ConfigParser()
34 |     config_parser.read('config.ini')
35 |     sent2vec_model_path = config_parser.get('SENT2VEC', 'model_path')
36 |     return EmbeddingDistributorLocal(sent2vec_model_path)
37 | 
38 | 
39 | def load_local_corenlp_pos_tagger():
40 |     config_parser = ConfigParser()
41 |     config_parser.read('config.ini')
42 |     host = config_parser.get('STANFORDCORENLPTAGGER', 'host')
43 |     port = config_parser.get('STANFORDCORENLPTAGGER', 'port')
44 |     return PosTaggingCoreNLP(host, port)
45 | 
46 | 
47 | if __name__ == '__main__':
48 |     parser = argparse.ArgumentParser(description='Extract keyphrases from raw text')
49 | 
50 |     group = parser.add_mutually_exclusive_group(required=True)
51 |     group.add_argument('-raw_text', help='raw text to process')
52 |     group.add_argument('-text_file', help='file containing the raw text to process')
53 |     
54 | 
55 |     parser.add_argument('-tagger_host', help='CoreNLP host', default='localhost')
56 |     parser.add_argument('-tagger_port', help='CoreNLP port', default=9000)
57 |     parser.add_argument('-N', help='number of keyphrases to extract', required=True, type=int)
58 |     args = parser.parse_args()
59 | 
60 |     if args.text_file:
61 |         raw_text = read_file(args.text_file)
62 |     else:
63 |         raw_text = args.raw_text
64 | 
65 |     embedding_distributor = load_local_embedding_distributor()
66 |     pos_tagger = load_local_corenlp_pos_tagger(args.tagger_host, args.tagger_port)
67 |     print(extract_keyphrases(embedding_distributor, pos_tagger, raw_text, args.N, 'en'))
68 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/launch.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/launch.pyc


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/requirements.txt:
--------------------------------------------------------------------------------
1 | langdetect==1.0.7
2 | nltk==3.4.1
3 | numpy==1.14.3
4 | scikit-learn==0.19.0
5 | scipy==0.19.1
6 | six==1.10.0
7 | requests==2.21.0


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/setup.cfg:
--------------------------------------------------------------------------------
1 | [flake8]
2 | max-line-length = 120
3 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/setup.py:
--------------------------------------------------------------------------------
 1 | """A setuptools based setup module.
 2 | 
 3 | See:
 4 | https://packaging.python.org/en/latest/distributing.html
 5 | https://github.com/pypa/sampleproject
 6 | """
 7 | from codecs import open
 8 | 
 9 | from setuptools import setup, find_packages
10 | 
11 | with open('requirements.txt') as f:
12 |     required = f.read().splitlines()
13 | 
14 | setup(
15 |     name='swisscom_ai.research_keyphrase',
16 | 
17 |     # Versions should comply with PEP440.  For a discussion on single-sourcing
18 |     # the version across setup.py and the project code, see
19 |     # https://packaging.python.org/en/latest/single_source_version.html
20 |     version='0.9.5',
21 | 
22 |     description='Swisscom AI Research Keyphrase Extraction',
23 |     url='https://github.com/swisscom/ai-research-keyphrase-extraction',
24 | 
25 |     author='Swisscom (Schweiz) AG',
26 | 
27 |     # See https://pypi.python.org/pypi?%3Aaction=list_classifiers
28 |     classifiers=[
29 |         'Programming Language :: Python :: 3.6',
30 |     ],
31 | 
32 |     # You can just specify the packages manually here if your project is
33 |     # simple. Or you can use find_packages().
34 |     packages=find_packages(exclude=['contrib', 'docs', 'tests']),
35 | 
36 |     package_data={'swisscom_ai.research_keyphrase': []},
37 |     include_package_data=True,
38 | 
39 |     # List run-time dependencies here.  These will be installed by pip when
40 |     # your project is installed. For an analysis of "install_requires" vs pip's
41 |     # requirements files see:
42 |     # https://packaging.python.org/en/latest/requirements.html
43 |     install_requires=required,
44 | 
45 |     # List additional groups of dependencies here (e.g. development
46 |     # dependencies). You can install these using the following syntax,
47 |     # for example:
48 |     # $ pip install -e .[dev,test]
49 |     extras_require={
50 |         'dev': [],
51 |         'test': [],
52 |     },
53 | )
54 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/emb_distrib_interface.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | from abc import ABC, abstractmethod
 7 | 
 8 | 
 9 | class Singleton(type):
10 |     _instances = {}
11 | 
12 |     def __call__(cls, *args, **kwargs):
13 |         if cls not in cls._instances:
14 |             cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
15 |         return cls._instances[cls]
16 | 
17 | 
18 | class EmbeddingDistributor(ABC):
19 |     """
20 |     Abstract class in charge of providing the embeddings of piece of texts
21 |     """
22 |     @abstractmethod
23 |     def get_tokenized_sents_embeddings(self, sents):
24 |         """
25 |         Generate a numpy ndarray with the embedding of each element of sent in each row
26 |         :param sents: list of string (sentences/phrases)
27 |         :return: ndarray with shape (len(sents), dimension of embeddings)
28 |         """
29 |         pass
30 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/emb_distrib_local.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | import numpy as np
 7 | 
 8 | from swisscom_ai.research_keyphrase.embeddings.emb_distrib_interface import EmbeddingDistributor
 9 | import sent2vec
10 | 
11 | 
12 | class EmbeddingDistributorLocal(EmbeddingDistributor):
13 |     """
14 |     Concrete class of @EmbeddingDistributor using a local installation of sent2vec
15 |     https://github.com/epfml/sent2vec
16 |     
17 |     """
18 | 
19 |     def __init__(self, fasttext_model):
20 |         self.model = sent2vec.Sent2vecModel()
21 |         self.model.load_model(fasttext_model)
22 | 
23 |     def get_tokenized_sents_embeddings(self, sents):
24 |         """
25 |         @see EmbeddingDistributor
26 |         """
27 |         for sent in sents:
28 |             if '\n' in sent:
29 |                 raise RuntimeError('New line is not allowed inside a sentence')
30 | 
31 |         return self.model.embed_sentences(sents)
32 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/extractor.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | """Contain method that return list of candidate"""
 7 | 
 8 | import re
 9 | 
10 | import nltk
11 | 
12 | GRAMMAR_EN = """  NP:
13 |         {<NN.*|JJ>*<NN.*>}  # Adjective(s)(optional) + Noun(s)"""
14 | 
15 | GRAMMAR_DE = """
16 | NBAR:
17 |         {<JJ|CARD>*<NN.*>+}  # [Adjective(s) or Article(s) or Posessive pronoun](optional) + Noun(s)
18 |         {<NN>+<PPOSAT><JJ|CARD>*<NN.*>+}
19 | 
20 | NP:
21 | {<NBAR><APPR|APPRART><ART>*<NBAR>}# Above, connected with APPR and APPART (beim vom)
22 | {<NBAR>+}
23 | """
24 | 
25 | GRAMMAR_FR = """  NP:
26 |         {<NN.*|JJ>*<NN.*>+<JJ>*}  # Adjective(s)(optional) + Noun(s) + Adjective(s)(optional)"""
27 | 
28 | 
29 | def get_grammar(lang):
30 |     if lang == 'en':
31 |         grammar = GRAMMAR_EN
32 |     elif lang == 'de':
33 |         grammar = GRAMMAR_DE
34 |     elif lang == 'fr':
35 |         grammar = GRAMMAR_FR
36 |     else:
37 |         raise ValueError('Language not handled')
38 |     return grammar
39 | 
40 | 
41 | def extract_candidates(text_obj, no_subset=False):
42 |     """
43 |     Based on part of speech return a list of candidate phrases
44 |     :param text_obj: Input text Representation see @InputTextObj
45 |     :param no_subset: if true won't put a candidate which is the subset of an other candidate
46 |     :param lang: language (currently en, fr and de are supported)
47 |     :return: list of candidate phrases (string)
48 |     """
49 | 
50 |     keyphrase_candidate = set()
51 | 
52 |     np_parser = nltk.RegexpParser(get_grammar(text_obj.lang))  # Noun phrase parser
53 |     trees = np_parser.parse_sents(text_obj.pos_tagged)  # Generator with one tree per sentence
54 | 
55 |     for tree in trees:
56 |         for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):  # For each nounphrase
57 |             # Concatenate the token with a space
58 |             keyphrase_candidate.add(' '.join(word for word, tag in subtree.leaves()))
59 | 
60 |     keyphrase_candidate = {kp for kp in keyphrase_candidate if len(kp.split()) <= 5}
61 | 
62 |     if no_subset:
63 |         keyphrase_candidate = unique_ngram_candidates(keyphrase_candidate)
64 |     else:
65 |         keyphrase_candidate = list(keyphrase_candidate)
66 | 
67 |     return keyphrase_candidate
68 | 
69 | 
70 | def extract_sent_candidates(text_obj):
71 |     """
72 | 
73 |     :param text_obj: input Text Representation see @InputTextObj
74 |     :return: list of tokenized sentence (string) , each token is separated by a space in the string
75 |     """
76 |     return [(' '.join(word for word, tag in sent)) for sent in text_obj.pos_tagged]
77 | 
78 | 
79 | def unique_ngram_candidates(strings):
80 |     """
81 |     ['machine learning', 'machine', 'backward induction', 'induction', 'start'] ->
82 |     ['backward induction', 'start', 'machine learning']
83 |     :param strings: List of string
84 |     :return: List of string where no string is fully contained inside another string
85 |     """
86 |     results = []
87 |     for s in sorted(set(strings), key=len, reverse=True):
88 |         if not any(re.search(r'\b{}\b'.format(re.escape(s)), r) for r in results):
89 |             results.append(s)
90 |     return results
91 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/input_representation.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | from nltk.stem import PorterStemmer
 7 | 
 8 | 
 9 | class InputTextObj:
10 |     """Represent the input text in which we want to extract keyphrases"""
11 | 
12 |     def __init__(self, pos_tagged, lang, stem=False, min_word_len=3):
13 |         """
14 |         :param pos_tagged: List of list : Text pos_tagged as a list of sentences
15 |         where each sentence is a list of tuple (word, TAG).
16 |         :param stem: If we want to apply stemming on the text.
17 |         """
18 |         self.min_word_len = min_word_len
19 |         self.considered_tags = {'NN', 'NNS', 'NNP', 'NNPS', 'JJ'}
20 |         self.pos_tagged = []
21 |         self.filtered_pos_tagged = []
22 |         self.isStemmed = stem
23 |         self.lang = lang
24 | 
25 |         if stem:
26 |             stemmer = PorterStemmer()
27 |             self.pos_tagged = [[(stemmer.stem(t[0]), t[1]) for t in sent] for sent in pos_tagged]
28 |         else:
29 |             self.pos_tagged = [[(t[0].lower(), t[1]) for t in sent] for sent in pos_tagged]
30 | 
31 |         temp = []
32 |         for sent in self.pos_tagged:
33 |             s = []
34 |             for elem in sent:
35 |                 if len(elem[0]) < min_word_len:
36 |                     s.append((elem[0], 'LESS'))
37 |                 else:
38 |                     s.append(elem)
39 |             temp.append(s)
40 | 
41 |         self.pos_tagged = temp
42 |         # Convert some language-specific tag (NC, NE to NN) or ADJA ->JJ see convert method.
43 |         if lang in ['fr', 'de']:
44 |             self.pos_tagged = [[(tagged_token[0], convert(tagged_token[1])) for tagged_token in sentence] for sentence
45 |                                in
46 |                                self.pos_tagged]
47 |         self.filtered_pos_tagged = [[(t[0].lower(), t[1]) for t in sent if self.is_candidate(t)] for sent in
48 |                                     self.pos_tagged]
49 | 
50 |     def is_candidate(self, tagged_token):
51 |         """
52 | 
53 |         :param tagged_token: tuple (word, tag)
54 |         :return: True if its a valid candidate word
55 |         """
56 |         return tagged_token[1] in self.considered_tags
57 | 
58 |     def extract_candidates(self):
59 |         """
60 |         :return: set of all candidates word
61 |         """
62 |         return {tagged_token[0].lower()
63 |                 for sentence in self.pos_tagged
64 |                 for tagged_token in sentence
65 |                 if self.is_candidate(tagged_token) and len(tagged_token[0]) >= self.min_word_len
66 |                 }
67 | 
68 | 
69 | def convert(fr_or_de_tag):
70 |     if fr_or_de_tag in {'NN', 'NNE', 'NE', 'N', 'NPP', 'NC', 'NOUN'}:
71 |         return 'NN'
72 |     elif fr_or_de_tag in {'ADJA', 'ADJ'}:
73 |         return 'JJ'
74 |     else:
75 |         return fr_or_de_tag
76 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/method.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
  2 | # All rights reserved.
  3 | #
  4 | #Authors: Kamil Bennani-Smires, Yann Savary
  5 | 
  6 | import warnings
  7 | 
  8 | import numpy as np
  9 | from sklearn.metrics.pairwise import cosine_similarity
 10 | 
 11 | from swisscom_ai.research_keyphrase.model.methods_embeddings import extract_candidates_embedding_for_doc, \
 12 |     extract_doc_embedding, extract_sent_candidates_embedding_for_doc
 13 | 
 14 | 
 15 | def _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered, alias_threshold):
 16 |     """
 17 |     Core method using Maximal Marginal Relevance in charge to return the top-N candidates
 18 | 
 19 |     :param embdistrib: embdistrib: embedding distributor see @EmbeddingDistributor
 20 |     :param text_obj: Input text representation see @InputTextObj
 21 |     :param candidates: list of candidates (string)
 22 |     :param X: numpy array with the embedding of each candidate in each row
 23 |     :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity)
 24 |     :param N: number of candidates to extract
 25 |     :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding
 26 |     :return: A tuple with 3 elements :
 27 |     1)list of the top-N candidates (or less if there are not enough candidates) (list of string)
 28 |     2)list of associated relevance scores (list of float)
 29 |     3)list containing for each keyphrase a list of alias (list of list of string)
 30 |     """
 31 | 
 32 |     N = min(N, len(candidates))
 33 |     doc_embedd = extract_doc_embedding(embdistrib, text_obj, use_filtered)  # Extract doc embedding
 34 |     doc_sim = cosine_similarity(X, doc_embedd.reshape(1, -1))
 35 | 
 36 |     doc_sim_norm = doc_sim/np.max(doc_sim)
 37 |     doc_sim_norm = 0.5 + (doc_sim_norm - np.average(doc_sim_norm)) / np.std(doc_sim_norm)
 38 | 
 39 |     sim_between = cosine_similarity(X)
 40 |     np.fill_diagonal(sim_between, np.NaN)
 41 | 
 42 |     sim_between_norm = sim_between/np.nanmax(sim_between, axis=0)
 43 |     sim_between_norm = \
 44 |         0.5 + (sim_between_norm - np.nanmean(sim_between_norm, axis=0)) / np.nanstd(sim_between_norm, axis=0)
 45 | 
 46 |     selected_candidates = []
 47 |     unselected_candidates = [c for c in range(len(candidates))]
 48 | 
 49 |     j = np.argmax(doc_sim)
 50 |     selected_candidates.append(j)
 51 |     unselected_candidates.remove(j)
 52 | 
 53 |     for _ in range(N - 1):
 54 |         selec_array = np.array(selected_candidates)
 55 |         unselec_array = np.array(unselected_candidates)
 56 | 
 57 |         distance_to_doc = doc_sim_norm[unselec_array, :]
 58 |         dist_between = sim_between_norm[unselec_array][:, selec_array]
 59 |         if dist_between.ndim == 1:
 60 |             dist_between = dist_between[:, np.newaxis]
 61 |         j = np.argmax(beta * distance_to_doc - (1 - beta) * np.max(dist_between, axis=1).reshape(-1, 1))
 62 |         item_idx = unselected_candidates[j]
 63 |         selected_candidates.append(item_idx)
 64 |         unselected_candidates.remove(item_idx)
 65 | 
 66 |     # Not using normalized version of doc_sim for computing relevance
 67 |     relevance_list = max_normalization(doc_sim[selected_candidates]).tolist()
 68 |     aliases_list = get_aliases(sim_between[selected_candidates, :], candidates, alias_threshold)
 69 | 
 70 |     return candidates[selected_candidates].tolist(), relevance_list, aliases_list
 71 | 
 72 | 
 73 | def MMRPhrase(embdistrib, text_obj, beta=0.65, N=10, use_filtered=True, alias_threshold=0.8):
 74 |     """
 75 |     Extract N keyphrases
 76 | 
 77 |     :param embdistrib: embedding distributor see @EmbeddingDistributor
 78 |     :param text_obj: Input text representation see @InputTextObj
 79 |     :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity)
 80 |     :param N: number of keyphrases to extract
 81 |     :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding
 82 |     :return: A tuple with 3 elements :
 83 |     1)list of the top-N candidates (or less if there are not enough candidates) (list of string)
 84 |     2)list of associated relevance scores (list of float)
 85 |     3)list containing for each keyphrase a list of alias (list of list of string)
 86 |     """
 87 |     candidates, X = extract_candidates_embedding_for_doc(embdistrib, text_obj)
 88 | 
 89 |     if len(candidates) == 0:
 90 |         warnings.warn('No keyphrase extracted for this document')
 91 |         return None, None, None
 92 | 
 93 |     return _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered, alias_threshold)
 94 | 
 95 | 
 96 | def MMRSent(embdistrib, text_obj, beta=0.5, N=10, use_filtered=True):
 97 |     """
 98 | 
 99 |     Extract N key sentences
100 | 
101 |     :param embdistrib: embedding distributor see @EmbeddingDistributor
102 |     :param text_obj: Input text representation see @InputTextObj
103 |     :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity)
104 |     :param N: number of key sentences to extract
105 |     :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding
106 |     :return: list of N key sentences (or less if there are not enough candidates)
107 |     """
108 |     candidates, X = extract_sent_candidates_embedding_for_doc(embdistrib, text_obj)
109 | 
110 |     if len(candidates) == 0:
111 |         warnings.warn('No keysentence extracted for this document')
112 |         return []
113 | 
114 |     return _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered)
115 | 
116 | 
117 | def max_normalization(array):
118 |     """
119 |     Compute maximum normalization (max is set to 1) of the array
120 |     :param array: 1-d array
121 |     :return: 1-d array max- normalized : each value is multiplied by 1/max value
122 |     """
123 |     return 1/np.max(array) * array.squeeze(axis=1)
124 | 
125 | 
126 | def get_aliases(kp_sim_between, candidates, threshold):
127 |     """
128 |     Find candidates which are very similar to the keyphrases (aliases)
129 |     :param kp_sim_between: ndarray of shape (nb_kp , nb candidates) containing the similarity
130 |     of each kp with all the candidates. Note that the similarity between the keyphrase and itself should be set to
131 |     NaN or 0
132 |     :param candidates: array of candidates (array of string)
133 |     :return: list containing for each keyphrase a list that contain candidates which are aliases
134 |     (very similar) (list of list of string)
135 |     """
136 | 
137 |     kp_sim_between = np.nan_to_num(kp_sim_between, 0)
138 |     idx_sorted = np.flip(np.argsort(kp_sim_between), 1)
139 |     aliases = []
140 |     for kp_idx, item in enumerate(idx_sorted):
141 |         alias_for_item = []
142 |         for i in item:
143 |             if kp_sim_between[kp_idx, i] >= threshold:
144 |                 alias_for_item.append(candidates[i])
145 |             else:
146 |                 break
147 |         aliases.append(alias_for_item)
148 | 
149 |     return aliases
150 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/methods_embeddings.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | import numpy as np
 7 | 
 8 | from swisscom_ai.research_keyphrase.model.extractor import extract_candidates, extract_sent_candidates
 9 | 
10 | 
11 | def extract_doc_embedding(embedding_distrib, inp_rpr, use_filtered=False):
12 |     """
13 |     Return the embedding of the full document
14 | 
15 |     :param embedding_distrib: embedding distributor see @EmbeddingDistributor
16 |     :param inp_rpr: input text representation see @InputTextObj
17 |     :param use_filtered: if true keep only candidate words in the raw text before computing the embedding
18 |     :return: numpy array of shape (1, dimension of embeddings) that contains the document embedding
19 |     """
20 |     if use_filtered:
21 |         tagged = inp_rpr.filtered_pos_tagged
22 |     else:
23 |         tagged = inp_rpr.pos_tagged
24 | 
25 |     tokenized_doc_text = ' '.join(token[0].lower() for sent in tagged for token in sent)
26 |     return embedding_distrib.get_tokenized_sents_embeddings([tokenized_doc_text])
27 | 
28 | 
29 | def extract_candidates_embedding_for_doc(embedding_distrib, inp_rpr):
30 |     """
31 | 
32 |     Return the list of candidate phrases as well as the associated numpy array that contains their embeddings.
33 |     Note that candidates phrases extracted by PosTag rules  which are uknown (in term of embeddings)
34 |     will be removed from the candidates.
35 | 
36 |     :param embedding_distrib: embedding distributor see @EmbeddingDistributor
37 |     :param inp_rpr: input text representation see @InputTextObj
38 |     :return: A tuple of two element containing 1) the list of candidate phrases
39 |     2) a numpy array of shape (number of candidate phrases, dimension of embeddings :
40 |     each row is the embedding of one candidate phrase
41 |     """
42 |     candidates = np.array(extract_candidates(inp_rpr))  # List of candidates based on PosTag rules
43 |     if len(candidates) > 0:
44 |         embeddings = np.array(embedding_distrib.get_tokenized_sents_embeddings(candidates))  # Associated embeddings
45 |         valid_candidates_mask = ~np.all(embeddings == 0, axis=1)  # Only candidates which are not unknown.
46 |         return candidates[valid_candidates_mask], embeddings[valid_candidates_mask, :]
47 |     else:
48 |         return np.array([]), np.array([])
49 | 
50 | 
51 | def extract_sent_candidates_embedding_for_doc(embedding_distrib, inp_rpr):
52 |     """
53 |     Return the list of candidate senetences as well as the associated numpy array that contains their embeddings.
54 |     Note that candidates sentences which are uknown (in term of embeddings) will be removed from the candidates.
55 | 
56 |     :param embedding_distrib: embedding distributor see @EmbeddingDistributor
57 |     :param inp_rpr: input text representation see @InputTextObj
58 |     :return: A tuple of two element containing 1) the list of candidate sentences
59 |     2) a numpy array of shape (number of candidate sentences, dimension of embeddings :
60 |     each row is the embedding of one candidate sentence
61 |     """
62 |     candidates = np.array(extract_sent_candidates(inp_rpr))
63 |     embeddings = np.array(embedding_distrib.get_tokenized_sents_embeddings(candidates))
64 | 
65 |     valid_candidates_mask = ~np.all(embeddings == 0, axis=1)
66 |     return candidates[valid_candidates_mask], embeddings[valid_candidates_mask, :]
67 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/custom_stanford.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | """Implementation of StanfordPOSTagger with tokenization in the specific language, s.t. the tag and tag_sent methods
 7 | perform tokenization in the specific language.
 8 | """
 9 | from nltk.tag import StanfordPOSTagger
10 | 
11 | 
12 | class EnglishStanfordPOSTagger(StanfordPOSTagger):
13 | 
14 |     @property
15 |     def _cmd(self):
16 |         return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
17 |                 '-model', self._stanford_model, '-textFile', self._input_file_path,
18 |                 '-outputFormatOptions', 'keepEmptySentences']
19 | 
20 | 
21 | class FrenchStanfordPOSTagger(StanfordPOSTagger):
22 |     """
23 |     Taken from github mhkuu/french-learner-corpus
24 |     Extends the StanfordPosTagger with a custom command that calls the FrenchTokenizerFactory.
25 |     """
26 | 
27 |     @property
28 |     def _cmd(self):
29 |         return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
30 |                 '-model', self._stanford_model, '-textFile',
31 |                 self._input_file_path, '-tokenizerFactory',
32 |                 'edu.stanford.nlp.international.french.process.FrenchTokenizer$FrenchTokenizerFactory',
33 |                 '-outputFormatOptions', 'keepEmptySentences']
34 | 
35 | 
36 | class GermanStanfordPOSTagger(StanfordPOSTagger):
37 |     """ Use english tokenizer for german """
38 | 
39 |     @property
40 |     def _cmd(self):
41 |         return ['edu.stanford.nlp.tagger.maxent.MaxentTagger',
42 |                 '-model', self._stanford_model, '-textFile', self._input_file_path,
43 |                 '-outputFormatOptions', 'keepEmptySentences']
44 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/postagging.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
  2 | # All rights reserved.
  3 | #
  4 | #Authors: Kamil Bennani-Smires, Yann Savary
  5 | 
  6 | import argparse
  7 | import os
  8 | import re
  9 | import warnings
 10 | from abc import ABC, abstractmethod
 11 | 
 12 | # NLTK imports
 13 | import nltk
 14 | from nltk.tag.util import tuple2str
 15 | from nltk.parse import CoreNLPParser
 16 | 
 17 | import swisscom_ai.research_keyphrase.preprocessing.custom_stanford as custom_stanford
 18 | from swisscom_ai.research_keyphrase.util.fileIO import read_file, write_string
 19 | 
 20 | # If you want to use spacy , install it and uncomment the following import
 21 | # import spacy
 22 | 
 23 | 
 24 | class PosTagging(ABC):
 25 |     @abstractmethod
 26 |     def pos_tag_raw_text(self, text, as_tuple_list=True):
 27 |         """
 28 |         Tokenize and POS tag a string
 29 |         Sentence level is kept in the result :
 30 |         Either we have a list of list (for each sentence a list of tuple (word,tag))
 31 |         Or a separator [ENDSENT] if we are requesting a string by putting as_tuple_list = False
 32 | 
 33 |         Example :
 34 |         >>from sentkp.preprocessing import postagger as pt
 35 | 
 36 |         >>pt = postagger.PosTagger()
 37 | 
 38 |         >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.')
 39 |         [
 40 |             [('Write', 'VB'), ('your', 'PRP$'), ('python', 'NN'),
 41 |             ('code', 'NN'), ('in', 'IN'), ('a', 'DT'), ('.', '.'), ('py', 'NN'), ('file', 'NN'), ('.', '.')
 42 |             ],
 43 |             [('Thank', 'VB'), ('you', 'PRP'), ('.', '.')]
 44 |         ]
 45 | 
 46 |         >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.', as_tuple_list=False)
 47 | 
 48 |         'Write/VB your/PRP$ python/NN code/NN in/IN a/DT ./.[ENDSENT]py/NN file/NN ./.[ENDSENT]Thank/VB you/PRP ./.'
 49 | 
 50 | 
 51 |         >>pt = postagger.PosTagger(separator='_')
 52 |         >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.', as_tuple_list=False)
 53 |         Write_VB your_PRP$ python_NN code_NN in_IN a_DT ._. py_NN file_NN ._.
 54 |         Thank_VB you_PRP ._.
 55 | 
 56 | 
 57 | 
 58 |         :param as_tuple_list: Return result as list of list (word,Pos_tag)
 59 |         :param text:  String to POS tag
 60 |         :return: POS Tagged string or Tuple list
 61 |         """
 62 | 
 63 |         pass
 64 | 
 65 |     def pos_tag_file(self, input_path, output_path=None):
 66 | 
 67 |         """
 68 |         POS Tag a file.
 69 |         Either we have a list of list (for each sentence a list of tuple (word,tag))
 70 |         Or a file with the POS tagged text
 71 | 
 72 |         Note : The jumpline is only for readibility purpose , when reading a tagged file we'll use again
 73 |         sent_tokenize to find the sentences boundaries.
 74 | 
 75 |         :param input_path: path of the source file
 76 |         :param output_path: If set write POS tagged text with separator (self.pos_tag_raw_text with as_tuple_list False)
 77 |                             If not set, return list of list of tuple (self.post_tag_raw_text with as_tuple_list = True)
 78 | 
 79 |         :return: resulting POS tagged text as a list of list of tuple or nothing if output path is set.
 80 |         """
 81 | 
 82 |         original_text = read_file(input_path)
 83 | 
 84 |         if output_path is not None:
 85 |             tagged_text = self.pos_tag_raw_text(original_text, as_tuple_list=False)
 86 |             # Write to the output the POS-Tagged text.
 87 |             write_string(tagged_text, output_path)
 88 |         else:
 89 |             return self.pos_tag_raw_text(original_text, as_tuple_list=True)
 90 | 
 91 |     def pos_tag_and_write_corpora(self, list_of_path, suffix):
 92 |         """
 93 |         POS tag a list of files
 94 |         It writes the resulting file in the same directory with the same name + suffix
 95 |         e.g
 96 |         pos_tag_and_write_corpora(['/Users/user1/text1', '/Users/user1/direct/text2'] , suffix = _POS)
 97 |         will create
 98 |         /Users/user1/text1_POS
 99 |         /Users/user1/direct/text2_POS
100 | 
101 |         :param list_of_path: list containing the path (as string) of each file to POS Tag
102 |         :param suffix: suffix to append at the end of the original filename for the resulting pos_tagged file.
103 | 
104 |         """
105 |         for path in list_of_path:
106 |             output_file_path = path + suffix
107 |             if os.path.isfile(path):
108 |                 self.pos_tag_file(path, output_file_path)
109 |             else:
110 |                 warnings.warn('file ' + output_file_path + 'does not exists')
111 | 
112 | 
113 | class PosTaggingStanford(PosTagging):
114 |     """
115 |     Concrete class of PosTagging using StanfordPOSTokenizer and StanfordPOSTagger
116 | 
117 |     tokenizer contains the default nltk tokenizer (PhunktSentenceTokenizer).
118 |     tagger contains the StanfordPOSTagger object (which also trigger word tokenization  see : -tokenize option in Java).
119 | 
120 |     """
121 | 
122 |     def __init__(self, jar_path, model_path_directory, separator='|', lang='en'):
123 |         """
124 |         :param model_path_directory: path of the model directory
125 |         :param jar_path: path of the jar for StanfordPOSTagger (override the configuration file)
126 |         :param separator: Separator between a token and a tag in the resulting string (default : |)
127 | 
128 |         """
129 | 
130 |         if lang == 'en':
131 |             model_path = os.path.join(model_path_directory, 'english-left3words-distsim.tagger')
132 |             self.sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
133 |             self.tagger = custom_stanford.EnglishStanfordPOSTagger(model_path, jar_path, java_options='-mx2g')
134 |         elif lang == 'de':
135 |             model_path = os.path.join(model_path_directory, 'german-hgc.tagger')
136 |             self.sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
137 |             self.tagger = custom_stanford.GermanStanfordPOSTagger(model_path, jar_path, java_options='-mx2g')
138 |         elif lang == 'fr':
139 |             model_path = os.path.join(model_path_directory, 'french.tagger')
140 |             self.sent_tokenizer = nltk.data.load('tokenizers/punkt/french.pickle')
141 |             self.tagger = custom_stanford.FrenchStanfordPOSTagger(model_path, jar_path, java_options='-mx2g')
142 |         else:
143 |             raise ValueError('Language ' + lang + 'not handled')
144 | 
145 |         self.separator = separator
146 | 
147 |     def pos_tag_raw_text(self, text, as_tuple_list=True):
148 |         """
149 |         Implementation of abstract method from PosTagging
150 |         @see PosTagging
151 |         """
152 |         tagged_text = self.tagger.tag_sents([self.sent_tokenizer.sentences_from_text(text)])
153 | 
154 |         if as_tuple_list:
155 |             return tagged_text
156 |         return '[ENDSENT]'.join(
157 |             [' '.join([tuple2str(tagged_token, self.separator) for tagged_token in sent]) for sent in tagged_text])
158 | 
159 | 
160 | class PosTaggingSpacy(PosTagging):
161 |     """
162 |         Concrete class of PosTagging using StanfordPOSTokenizer and StanfordPOSTagger
163 |     """
164 | 
165 |     def __init__(self, nlp=None, separator='|' ,lang='en'):
166 |         if not nlp:
167 |             print('Loading Spacy model')
168 |             #  self.nlp = spacy.load(lang, entity=False)
169 |             print('Spacy model loaded ' + lang)
170 |         else:
171 |             self.nlp = nlp
172 |         self.separator = separator
173 | 
174 |     def pos_tag_raw_text(self, text, as_tuple_list=True):
175 |         """
176 |             Implementation of abstract method from PosTagging
177 |             @see PosTagging
178 |         """
179 | 
180 |         # This step is not necessary int the stanford tokenizer.
181 |         # This is used to avoid such tags :  ('      ', 'SP')
182 |         text = re.sub('[ ]+', ' ', text).strip()  # Convert multiple whitespaces into one
183 | 
184 |         doc = self.nlp(text)
185 |         if as_tuple_list:
186 |             return [[(token.text, token.tag_) for token in sent] for sent in doc.sents]
187 |         return '[ENDSENT]'.join(' '.join(self.separator.join([token.text, token.tag_]) for token in sent) for sent in doc.sents)
188 |     
189 | 
190 | class PosTaggingCoreNLP(PosTagging):
191 |     """
192 |     Concrete class of PosTagging using a CoreNLP server 
193 |     Provides a faster way to process several documents using since it doesn't require to load the model each time.
194 |     """
195 | 
196 |     def __init__(self, host='localhost' ,port=9000, separator='|'):
197 |         self.parser = CoreNLPParser(url=f'http://{host}:{port}')
198 |         self.separator = separator
199 |     
200 |     def pos_tag_raw_text(self, text, as_tuple_list=True):
201 |         # Unfortunately for the moment there is no method to do sentence split + pos tagging in nltk.parse.corenlp
202 |         # Ony raw_tag_sents is available but assumes a list of str (so it assumes the sentence are already split)
203 |         # We create a small custom function highly inspired from raw_tag_sents to do both
204 | 
205 |         def raw_tag_text():
206 |             """
207 |             Perform tokenizing sentence splitting and PosTagging and keep the 
208 |             sentence splits structure
209 |             """
210 |             properties = {'annotators':'tokenize,ssplit,pos'}
211 |             tagged_data = self.parser.api_call(text, properties=properties)
212 |             for tagged_sentence in tagged_data['sentences']:
213 |                 yield [(token['word'], token['pos']) for token in tagged_sentence['tokens']]
214 |         
215 |         tagged_text = list(raw_tag_text())
216 | 
217 |         if as_tuple_list:
218 |             return tagged_text
219 |         return '[ENDSENT]'.join(
220 |             [' '.join([tuple2str(tagged_token, self.separator) for tagged_token in sent]) for sent in tagged_text])
221 |         
222 | 
223 | 
224 | 
225 | if __name__ == '__main__':
226 |     parser = argparse.ArgumentParser(description='Write POS tagged files, the resulting file will be written'
227 |                                                  ' at the same location with _POS append at the end of the filename')
228 | 
229 |     parser.add_argument('tagger', help='which pos tagger to use [stanford, spacy, corenlp]')
230 |     parser.add_argument('listing_file_path', help='path to a text file '
231 |                                                   'containing in each row a path to a file to POS tag')
232 |     args = parser.parse_args()
233 | 
234 |     if args.tagger == 'stanford':
235 |         pt = PosTaggingStanford()
236 |         suffix = 'STANFORD'
237 |     elif args.tagger == 'spacy':
238 |         pt = PosTaggingSpacy()
239 |         suffix = 'SPACY'
240 |     elif args.tagger == 'corenlp':
241 |         pt = PosTaggingCoreNLP()
242 |         suffix = 'CoreNLP'
243 | 
244 |     list_of_path = read_file(args.listing_file_path).splitlines()
245 |     print('POS Tagging and writing ', len(list_of_path), 'files')
246 |     pt.pos_tag_and_write_corpora(list_of_path, suffix)
247 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/__init__.py


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/fileIO.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | import codecs
 7 | 
 8 | codecs.register_error('replace_with_space', lambda e: (u' ', e.start + 1))
 9 | 
10 | 
11 | def write_string(s, output_path):
12 |     with open(output_path, 'w') as output_file:
13 |         output_file.write(s)
14 | 
15 | 
16 | def read_file(input_path):
17 |     with open(input_path, 'r', errors='replace_with_space') as input_file:
18 |         return input_file.read().strip()
19 | 


--------------------------------------------------------------------------------
/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/solr_fields.py:
--------------------------------------------------------------------------------
 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG.
 2 | # All rights reserved.
 3 | #
 4 | #Authors: Kamil Bennani-Smires, Yann Savary
 5 | 
 6 | """Module containing helper function to process results of a solr query"""
 7 | 
 8 | 
 9 | def process_tagged_text(s):
10 |     """
11 |     Return a tagged_text as a list of sentence where each sentence is list of tuple (word,tag)
12 |     :param s: string tagged_text coming from solr word1|tag1 word2|tag2[ENDSENT]word3|tag3 ...
13 |     :return: (list of list of tuple) list of sentences where each sentence is a list of tuple (word,tag)
14 |     """
15 | 
16 |     def str2tuple(tagged_token_text, sep='|'):
17 |         loc = tagged_token_text.rfind(sep)
18 |         if loc >= 0:
19 |             return tagged_token_text[:loc], tagged_token_text[loc + len(sep):]
20 |         else:
21 |             raise RuntimeError('Problem when parsing tagged token '+tagged_token_text)
22 | 
23 |     result = []
24 |     for sent in s.split('[ENDSENT]'):
25 |         sent = [str2tuple(tagged_token) for tagged_token in sent.split(' ')]
26 |         result.append(sent)
27 |     return result
28 | 


--------------------------------------------------------------------------------
/KeyExt/KPRank/PositionRank.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | from doc_candidates import LoadFile
  5 | import networkx as nx
  6 | from numpy import dot
  7 | from numpy.linalg import norm
  8 | import numpy as np
  9 | from math import log10
 10 | from collections import defaultdict
 11 | import operator
 12 | import unicodedata
 13 | 
 14 | def normalize_text(text):
 15 |     if not isinstance(text, unicode):
 16 |         text = unicode(text, 'utf-8')
 17 |     text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore') 
 18 |     return text
 19 |     
 20 | class PositionRank(LoadFile):
 21 | 
 22 |     def __init__(self, input_text, window, phrase_type, emb_dim, embeddings):
 23 |         """ Redefining initializer for PositionRank. """
 24 | 
 25 |         super(PositionRank, self).__init__(input_text=input_text)
 26 | 
 27 |         self.graph = nx.Graph()
 28 |         """ The word graph. """
 29 |         self.window = window
 30 | 
 31 |         self.phrase_type = phrase_type
 32 |         self.emb_dim = emb_dim
 33 |         self.embeddings = embeddings#KeyedVectors.load_word2vec_format(emb_file, binary=True)
 34 |         self.random_embeddings = {}
 35 |     
 36 |     
 37 |     def get_cosine_dist(self, word1, word2):
 38 |         curr_embeddings1 = []
 39 |         if word1.lower() in self.embeddings:
 40 |             curr_embeddings1 = self.embeddings[word1.lower()] 
 41 |         elif word1.lower() in self.random_embeddings:
 42 |             curr_embeddings1 = self.random_embeddings[word1.lower()]
 43 |         else:
 44 |             curr_embeddings1 = np.random.rand(self.emb_dim)
 45 |             self.random_embeddings[word1.lower()] = curr_embeddings1
 46 |             
 47 |         curr_embeddings2 = []
 48 |         if word2.lower() in self.embeddings:
 49 |             curr_embeddings2 = self.embeddings[word2.lower()] 
 50 |         elif word2.lower() in self.random_embeddings:
 51 |             curr_embeddings2 = self.random_embeddings[word2.lower()]
 52 |         else:
 53 |             curr_embeddings2 = np.random.rand(self.emb_dim)
 54 |             self.random_embeddings[word2.lower()] = curr_embeddings2
 55 |         
 56 |         cos_sim = 0.0
 57 |         if (norm(curr_embeddings1)*norm(curr_embeddings2)) != 0:
 58 |             #print curr_embeddings1
 59 |             #print curr_embeddings2
 60 |             cos_sim = dot(curr_embeddings1, curr_embeddings2)/(norm(curr_embeddings1)*norm(curr_embeddings2))
 61 |         semantic_val = 0.0
 62 |         if cos_sim != 1.0:
 63 |             semantic_val = 1.0 / (1.0 - cos_sim)
 64 |             
 65 |         return semantic_val
 66 | 
 67 | 
 68 |     def build_graph(self, window, pos=None):
 69 |         """
 70 |         build the word graph
 71 | 
 72 |         :param window: the size of window to add edges in the graph
 73 |         :param pos: he part of speech tags used to select the graph's nodes
 74 |         :return:
 75 |         """
 76 | 
 77 |         if pos is None:
 78 |             pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ']
 79 | 
 80 |         # container for the nodes
 81 |         seq = []        
 82 |         individual_count = {} # my addition
 83 |         stemmed_original_map = {}
 84 |         
 85 |         # select nodes to be added in the graph
 86 |         for el in self.words:
 87 |             if el.pos_pattern in pos:
 88 |                 seq.append((el.stemmed_form, el.position, el.sentence_id))
 89 |                 self.graph.add_node(el.stemmed_form)
 90 |                 if el.stemmed_form not in individual_count:
 91 |                     individual_count[el.stemmed_form] = 0
 92 |                 individual_count[el.stemmed_form] += 1
 93 |                 if el.stemmed_form not in stemmed_original_map:
 94 |                     stemmed_original_map[el.stemmed_form] = el.surface_form
 95 |         
 96 |         # add edges
 97 |         for i in range(0, len(seq)):
 98 |             for j in range(i+1, len(seq)):
 99 |                 if seq[i][1] != seq[j][1] and abs(j-i) < window:
100 |                     if not self.graph.has_edge(seq[i][0], seq[j][0]):
101 |                         self.graph.add_edge(seq[i][0], seq[j][0], weight=1)
102 |                     else:
103 |                         self.graph[seq[i][0]][seq[j][0]]['weight'] += 1
104 |         
105 |     def candidate_selection(self, pos=None, phrase_type='n_grams'):
106 |         """
107 |         the candidates selection for PositionRank
108 |         :param pos: pos: the part of speech tags used to select candidates
109 |         :return:
110 |         """
111 | 
112 |         if pos is None:
113 |             pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ']
114 | 
115 |         # uncomment the line below if you wish to extract ngrams instead of the longest phrase
116 |         if phrase_type=='n_grams':
117 |             self.get_ngrams(n=4, good_pos=pos)
118 |         else:
119 |             # select the longest phrase as candidate keyphrases
120 |             self.get_phrases(self, good_pos=pos)
121 | 
122 | 
123 |     def candidate_scoring(self, pos=None, window=10, theme_mode = 'adj_noun_title' ,update_scoring_method=False):
124 |         """
125 |         compute a score for each candidate based on PageRank algorithm
126 |         :param pos: the part of speech tags
127 |         :param window: window size
128 |         :param update_scoring_method: if you want to update the scoring method based on my paper cited below:
129 |         Florescu, Corina, and Cornelia Caragea. "A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction."
130 |          European Conference on Information Retrieval. Springer, Cham, 2017.
131 | 
132 |         :return:
133 |         """
134 |         
135 |         if pos is None:
136 |             pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ']
137 | 
138 |         # build the word graph
139 |         self.build_graph(window=window, pos=pos)
140 | 
141 |         # filter out canditates that unlikely to be keyphrases
142 |         self.filter_candidates(max_phrase_length=4, min_word_length=3, valid_punctuation='-.')
143 | 
144 |         ######### get Theme scores ########
145 | 
146 |         # get the theme vector
147 |         theme_vec = np.array([0] * self.emb_dim)
148 |         
149 |         if theme_mode == 'adj_noun_title':
150 |             tv_words = 0
151 |             for w, p in self.sentences[0]:
152 |                 w = w.lower()
153 |                 if p in pos:
154 |                     if w in self.embeddings['words']: # Fix embeddings structure bug.
155 |                         curr_vec = np.array(self.embeddings['embeddings'][self.embeddings['words'].index(w)])
156 |                         theme_vec = theme_vec + curr_vec    
157 |                         tv_words += 1
158 |             if tv_words > 0:
159 |                 theme_vec = theme_vec / tv_words
160 |                         
161 |         elif theme_mode == 'adj_noun_all':
162 |             tv_words = 0
163 |             for sentence in self.sentences:
164 |                 for w, p in sentence:
165 |                     w = w.lower()
166 |                     if p in pos:
167 |                         if w in self.embeddings['words']: # Fix embeddings structure bug.
168 |                             curr_vec = np.array(self.embeddings['embeddings'][self.embeddings['words'].index(w)])
169 |                             theme_vec = theme_vec + curr_vec           
170 |                             tv_words += 1
171 |             if tv_words > 0:        
172 |                 theme_vec = theme_vec / tv_words
173 |             
174 |         elif theme_mode == 'cls_title':
175 |             theme_vec = self.embeddings['cls_ttl']
176 |         elif theme_mode == 'cls_all':
177 |             theme_vec = self.embeddings['cls_all']
178 |         elif theme_mode == 'mean_title':
179 |             theme_vec = self.embeddings['mean_ttl']
180 |         elif theme_mode == 'mean_all':
181 |             theme_vec = self.embeddings['mean_all']
182 |             
183 |         # get the thematic scores            
184 |         personalization_k2v = {}
185 |         for w in self.words:
186 |             word = w.surface_form
187 |             stem = w.stemmed_form
188 |             curr_pos = w.pos_pattern
189 |             word = word.lower()
190 |             if curr_pos in pos:
191 |                 if stem not in personalization_k2v.keys():
192 |                     curr_vec = []
193 |                     if word in self.embeddings['words']: # Fix embeddings structure bug.
194 |                         print(theme_mode + ': EMB-FOUND')
195 |                         curr_vec = self.embeddings['embeddings'][self.embeddings['words'].index(word)]
196 |                     elif word in self.random_embeddings:
197 |                         curr_vec = self.random_embeddings[word]
198 |                     else:
199 |                         curr_vec = np.random.rand(self.emb_dim)
200 |                         self.random_embeddings[word] = curr_vec
201 |                         print('EMB-NOT-FOUND')
202 |                     cos_sim = 0.000000001
203 |                     if (norm(curr_vec)*norm(theme_vec)) != 0.0:
204 |                         cos_sim = dot(curr_vec, theme_vec)/(norm(curr_vec)*norm(theme_vec))
205 |                     personalization_k2v[stem] = cos_sim
206 | 
207 |         ######### get Positional scores ########
208 |         personalization_pr = {}
209 |         for w in self.words:
210 |             stem = w.stemmed_form
211 |             poz = w.position
212 |             pos = w.pos_pattern
213 | 
214 |             if pos in pos:
215 |                 if stem not in personalization_pr:
216 |                     personalization_pr[stem] = 1.0/poz
217 |                 else:
218 |                     personalization_pr[stem] = personalization_pr[stem]+1.0/poz
219 | 
220 |         ######## multiply both scores #######
221 |         ipdict=[personalization_k2v, personalization_pr]
222 | 
223 |         output=defaultdict(lambda:1)
224 |         for d in ipdict:
225 |             for item in d:
226 |                output[item] *= d[item]
227 |         
228 |         personalization = dict(output)
229 |         
230 |         ######## normalize scores ########        
231 |         factor = 1.0 / sum(personalization.values())
232 | 
233 |         normalized_personalization = {k: v * factor for k, v in personalization.items()}
234 |         
235 |         # compute the word scores using personalized random walk
236 |         pagerank_weights = nx.pagerank_scipy(self.graph, personalization=normalized_personalization, weight='weight')
237 |         #pagerank_weights = normalized_personalization
238 | 
239 |         
240 |         # loop through the candidates
241 |         if update_scoring_method:
242 |             for c in self.candidates:
243 |                 if len(c.stemmed_form.split()) > 1:
244 |                     # for arithmetic mean
245 |                     #self.weights[c.stemmed_form] = [stem.stemmed_form for stem in self.candidates].count(c.stemmed_form) * \
246 |                                                    #sum([pagerank_weights[t] for t in c.stemmed_form.split()]) \
247 |                                                    #/ len(c.stemmed_form.split())
248 |                     # for harmonic mean
249 |                     self.weights[c.stemmed_form] = [cand.stemmed_form for cand in self.candidates].count(c.stemmed_form) * \
250 |                                                    len(c.stemmed_form.split()) / sum([1.0 / pagerank_weights[t] for t in c.stemmed_form.split()])
251 |                 else:
252 |                     self.weights[c.stemmed_form] = pagerank_weights[c.stemmed_form]
253 |         else:
254 |             for c in self.candidates:
255 |                 self.weights[c.stemmed_form] = sum([pagerank_weights[t] for t in c.stemmed_form.split()])
256 | 
257 | 
258 | 
259 | 


--------------------------------------------------------------------------------
/KeyExt/KPRank/README.md:
--------------------------------------------------------------------------------
 1 | # KPRank
 2 | 
 3 | This directory hosts modified code for the `KPRank` approach, which can be found from its official [repo](https://github.com/PatelKrutarth/KPRank).
 4 | 
 5 | ## Setup
 6 | Follow the instructions from the original repo.  
 7 | Afterwards replace the files with the modified ones.  
 8 | The `dataset_dir` variable in `main.py` needs to be set to the dataset directory path.  
 9 | The `dsdir` and `model_version` variables in `run_scibert_model` need to be set.  
10 | 


--------------------------------------------------------------------------------
/KeyExt/KPRank/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | 
3 | __author__ = 'Krutarth Patel'
4 | __email__ = 'kipatel@ksu.edu'
5 | __version__ = '1.0'
6 | 


--------------------------------------------------------------------------------
/KeyExt/KPRank/evaluation.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | 
 5 | def firstRank(predicted, gold):
 6 |     """returns the the rank of the first correct predicted keyphrase"""
 7 |     firstRank = 0
 8 |     for i in range(0, len(predicted)):
 9 |         if predicted[i] in gold:
10 |             firstRank = i
11 |             break
12 | 
13 |     return firstRank
14 | 
15 | 
16 | def Rprecision(predicted, gold, k):
17 | 
18 |     hits = set(predicted).intersection(set(gold))
19 |     Rpr = 0.0
20 |     if len(hits)>0 and len(predicted)>0:
21 |         Rpr =  len(hits)*1.0/k
22 | 
23 |     return Rpr
24 | 
25 | def PRF(predicted, gold, k):
26 | 
27 |     predicted = predicted[:k]
28 | 
29 |     hits = set(predicted).intersection(set(gold))
30 |     P, R, F1 = 0.0, 0.0, 0.0
31 | 
32 |     if len(hits)>0 and len(predicted)>0:
33 |         P =  len(hits)/len(predicted)
34 |         R = len(hits)/len(gold)
35 |         F1 = 2*P*R/(P+R)
36 | 
37 |     return {'precision':P,'recall': R,'f1-score': F1}
38 | 
39 | def PRF_range(predicted, gold, k):
40 | 
41 |     P = []
42 |     R = []
43 |     F1 = []
44 | 
45 |     for i in range(0,k):
46 |         predict = predicted[:i+1]
47 |         
48 |         hits = set(predict).intersection(set(gold))
49 |         pr = 0.0
50 |         re = 0.0
51 |         f1 = 0.0
52 |         if len(hits)>0 and len(predict)>0:
53 |             pr =  len(hits)*1.0/len(predict)
54 |             re = len(hits)*1.0/len(gold)
55 |         if pr+re > 0:
56 |             f1 = 2*pr*re/(pr+re)
57 |         
58 |         P.append(pr)
59 |         R.append(re)
60 |         F1.append(f1)
61 | 
62 |     return P,R,F1
63 | 
64 | def Bpref (pred, gold):
65 |     incorrect = 0
66 |     correct = 0
67 |     bpref = 0
68 | 
69 |     for kp in pred:
70 |         if kp in gold:
71 |             bpref += (1.0 - (incorrect*1.0/len(pred)))
72 |             correct += 1
73 |         else:
74 |             incorrect +=1
75 | 
76 |     if correct >0:
77 |         bpref = bpref*1.0/correct
78 |     else:
79 |         bpref = 0.0
80 | 
81 |     return bpref


--------------------------------------------------------------------------------
/KeyExt/KPRank/main.py:
--------------------------------------------------------------------------------
 1 | #from __future__ import division
 2 | from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
 3 | import sys
 4 | import PositionRank
 5 | from gensim.models import KeyedVectors
 6 | import evaluation
 7 | import process_data
 8 | import os
 9 | from os.path import isfile, join
10 | import pathlib
11 | from nltk.stem.porter import PorterStemmer
12 | porter_stemmer = PorterStemmer()
13 | import pickle
14 | 
15 | def ensure_dir(dirName):
16 |     if not os.path.exists(dirName):
17 |         print('making dir: ' + dirName)
18 |         os.makedirs(dirName)
19 | 
20 | def load_obj(filePath):
21 |     with open(filePath, 'rb') as f:
22 |         return pickle.load(f)
23 | 
24 | def main():
25 |     # Initialize parameters.
26 |     topK = 10
27 |     window = 10
28 |     phrase_type = 'ngrams'
29 |     emb_dim = 768
30 |     theme_mode = 'adj_noun_title'
31 |     model_name = 'scibert'
32 | 
33 |     # Initialize paths.
34 |     dataset_dir = r'..\datasets\Krapivin2009'
35 |     input_dir = os.path.join(dataset_dir, 'docsutf8')
36 |     output_dir = os.path.join(dataset_dir, 'extracted\kprank')
37 |     emb_dir = os.path.join(dataset_dir, f'{model_name}_emb_fulltext_title')
38 |         
39 |     # Set the current directory to the input dir
40 |     os.chdir(os.path.join(os.getcwd(), input_dir))
41 | 
42 |     # Get all file names and their absolute paths.
43 |     docnames = sorted(os.listdir())
44 |     docpaths = list(map(os.path.abspath, docnames))
45 | 
46 |     # Create the keys directory, after the names and paths are loaded.
47 |     pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True)
48 | 
49 |     for i, (docname, docpath) in enumerate(zip(docnames, docpaths)):
50 |         # keys shows up in docnames, erroneously.
51 |         if docname == 'keys':
52 |             continue
53 | 
54 |         #if i < 115: continue
55 | 
56 |         print(f'Processing {i} out of {len(docnames)}...')
57 |         
58 |         # Form the output path.
59 |         output_path = os.path.join(output_dir, docname.split('.')[0]+'.key')
60 |         print(output_path)
61 | 
62 |         # Process the data of the document.
63 |         text = process_data.read_input_file(docpath)
64 | 
65 |         # Load the embeddings.
66 |         emb_path = os.path.join(emb_dir, f'{docname}_fulltext.pkl')
67 |         embeddings = load_obj(emb_path)
68 |         model = PositionRank.PositionRank(text, window, phrase_type, emb_dim, embeddings)
69 | 
70 |         # Run the model.
71 |         model.get_doc_words()
72 |         model.candidate_selection()
73 |         model.candidate_scoring(theme_mode = theme_mode, update_scoring_method = False)
74 |         keyphrases = model.get_best_k(topK)[:10]
75 |         
76 |         # Write the keyphrases to a file.
77 |         keys = '\n'.join(map(str, keyphrases) or '')
78 |         with open(output_path, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out:
79 |             out.write(keys)
80 | 
81 |         os.system('clear')
82 |     return
83 | 
84 | 
85 | if __name__ == "__main__": main()


--------------------------------------------------------------------------------
/KeyExt/KPRank/process_data.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import codecs
  5 | import itertools
  6 | from nltk.stem.porter import PorterStemmer
  7 | import os.path
  8 | from nltk import word_tokenize
  9 | from string import punctuation
 10 | import re
 11 | import unicodedata
 12 | 
 13 | 
 14 | def read_input_file(this_file):
 15 |     # read the text of the file; if the file cannot be read then the file is excluded
 16 |     if os.path.exists(this_file):
 17 |         with codecs.open(this_file, "r", encoding='utf-8') as f:
 18 |             #text = f.read()
 19 |             lines = f.readlines()
 20 |             lines[0] = lines[0].strip()
 21 |             if not (lines[0].endswith(".") or lines[0].endswith("?") or lines[0].endswith("!")):
 22 |                 lines[0] = lines[0]+'.'
 23 |             text = ' '.join(lines)
 24 |         f.close()
 25 |     else:
 26 |         text = None
 27 | 
 28 |     return text
 29 | 
 30 | 
 31 | def read_gold_file(this_gold):
 32 | 
 33 |     # read the gold file; if the file cannot be read (does not exist) the file is excluded
 34 |     if os.path.exists(this_gold):
 35 |         with codecs.open(this_gold, "r", encoding='utf-8') as f:
 36 |             gold_list = f.readlines()
 37 |         f.close()
 38 |     else:
 39 |         gold_list = None
 40 | 
 41 |     return gold_list
 42 | 
 43 | def get_ascii(text):
 44 |     if not isinstance(text, unicode):
 45 |             text = unicode(text, "utf-8")
 46 |     text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
 47 |     return text 
 48 |         
 49 | 
 50 | def get_stemmed_words_and_stemmed_text(text):
 51 |     stemmer = PorterStemmer()
 52 |     text_words = text.split()
 53 |     text_words_stem = []
 54 |     for word in text_words:
 55 |         text_words_stem.append(stemmer.stem(word))
 56 |     text_stem = ' '.join(text_words_stem)
 57 |     return text_words_stem, text_stem
 58 |         
 59 | 
 60 | def load_stemmed_gold_phrases(lines):
 61 |     punct_list = ['\'', '"', '\\', '!', '@', '#', '$', '%', 
 62 |               '^', '&', '*', '(', ')', '_', '-', '+', '=','{', '}', '[', ']', 
 63 |               '|', ':', ';', '<', '>', ',', '.', '?', '/', '`', '~']
 64 | 
 65 |     punct_re = '|'.join(map(re.escape, punct_list))
 66 |     
 67 |     gold_phrases = []
 68 |     for line in lines:
 69 |         line = line.strip()
 70 |         line = line.lower()
 71 |         line = get_ascii(line)
 72 |         line = re.sub(punct_re, ' ', line)
 73 |         line = re.sub('\s+', ' ', line).strip()
 74 |         line_words_stem, line_stem = get_stemmed_words_and_stemmed_text(line)
 75 |         gold_phrases.append(line_stem)
 76 |     return gold_phrases
 77 | 
 78 | def tokenize(text, encoding):
 79 |     """ tokenize text
 80 |     Args:
 81 |         text: tect to be tokenized
 82 |         """
 83 |     return [token for token in word_tokenize(text.lower().decode(encoding))]
 84 | 
 85 | 
 86 | def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'):
 87 |     """ discard candidates based on various criteria
 88 |     Args:
 89 |         tokens: tokens to be filtered out
 90 |         stopwords_file: if you want to load a file with stopwords
 91 |         min_word_length: filter words shorter than min_word_length
 92 |         valid_punctuation: filter words that contain other punctuation than valid_punctuation
 93 |         encoding='utf-8'
 94 |         """
 95 | 
 96 |     # if a list of stopwords is not provided then load the stopwords'list from nltk
 97 |     stopwords_list = []
 98 |     if stopwords_file is None:
 99 |         from nltk.corpus import stopwords
100 |         stopwords_list = set(stopwords.words('english'))
101 |     else:
102 |         with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f:
103 |             f.readlines()
104 |         f.close()
105 |         # add the stopword from file in the stopwords_list container
106 |         for line in f:
107 |             stopwords_list.append(line)
108 | 
109 |     # keep indices to be deleted
110 |     indices = []
111 | 
112 |     for i, c in enumerate(tokens):
113 | 
114 |         # discard those candidates that contain stopwords
115 |         if c in stopwords_list:
116 |             indices.append(i)
117 | 
118 |         # discard candidates that contain words shorter that min_word_length
119 |         elif len(c) < min_word_length:
120 |             indices.append(i)
121 | 
122 |         elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']:
123 |             indices.append(i)
124 | 
125 |         else:
126 | 
127 |             # discard candidates that contain other characters except letter, digits, and valid punctuation
128 |             letters_set = set([u for u in c])
129 | 
130 |             if letters_set.issubset(punctuation):
131 |                 indices.append(i)
132 | 
133 |             elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c):
134 |                 pass
135 |             else:
136 |                 indices.append(i)
137 | 
138 |     dels = 0
139 | 
140 |     for index in indices:
141 |         offset = index - dels
142 |         del tokens[offset]
143 |         dels += 1
144 | 
145 |     return tokens
146 | 
147 | 
148 | def stemming(text):
149 |     """ stem tokens """
150 |     p_stemmer = PorterStemmer()
151 |     return [p_stemmer.stem(i) for i in text]
152 | 
153 | 
154 | def iter_data(path_to_data, encoding):
155 |     """Yield each article from the Medline """
156 |     files = []
157 |     #with open('/home/corina/Documents/Research/Projects/unsupervisedKE/data_analysis/medline_10000_1.txt','rb') as rf:
158 |         #filenames = rf.readlines()
159 |         #files = [file.strip() for file in filenames]
160 |     #rf.close()
161 |     #print files
162 |     i=1
163 |     #for filename in filenames: #os.listdir(path_to_data):
164 |     for filename in os.listdir(path_to_data):
165 |         #filename = filename.strip()
166 | 
167 |         i += 1
168 |         with open(path_to_data + filename, 'rb') as f:
169 |             text = f.read().strip()
170 |             tokens = tokenize(text, encoding)
171 |             tokens = filter_candidates(tokens)
172 |             tokens = stemming(tokens)
173 |         f.close()
174 |         yield path_to_data + filename, text, tokens
175 | 
176 | 
177 | class MyCorpus(object):
178 | 
179 |     def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'):
180 |         """
181 |         Parse the collection of documents from file path_to_data.
182 |         Yield each document in turn, as a list of tokens.
183 |         Args:
184 |             path_to_data: the location of the collection
185 |             dictionary: the mapping between word and ids
186 |             length: the number of docs in the corpus
187 |         """
188 |         self.path_to_data = path_to_data
189 |         self.dictionary = dictionary
190 |         self.length = length
191 |         self.encoding = encoding
192 |         self.index_filename = {}
193 | 
194 |     def __iter__(self):
195 | 
196 |         index = 0
197 | 
198 |         for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length):
199 |             self.index_filename[index] = filename
200 |             index += 1
201 |             yield self.dictionary.doc2bow(tokens)
202 | 
203 |     def __len__(self):
204 |         if self.length is None:
205 |             self.length = sum(1 for doc in self)
206 |         return self.length
207 | 


--------------------------------------------------------------------------------
/KeyExt/KPRank/requirements.txt:
--------------------------------------------------------------------------------
 1 | backports.functools-lru-cache==1.5
 2 | decorator==4.3.0
 3 | networkx==2.2
 4 | nltk==3.4
 5 | nose==1.3.7
 6 | numpy==1.15.4
 7 | Pillow==5.3.0
 8 | psutil==5.4.8
 9 | pyparsing==2.3.0
10 | pytz==2018.7
11 | scipy==1.1.0
12 | singledispatch==3.4.0.3
13 | six==1.12.0
14 | subprocess32==3.5.3
15 | torch==1.10.0
16 | transformers==2.8.0
17 | gensim


--------------------------------------------------------------------------------
/KeyExt/KPRank/run_scibert_model.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import codecs
  3 | import pickle
  4 | import re
  5 | from datetime import datetime
  6 | import torch
  7 | from transformers import BertTokenizer, BertModel
  8 | 
  9 | def ensure_dir(dirName):
 10 |     if not os.path.exists(dirName):
 11 |         print('making dir: ' + dirName)
 12 |         os.makedirs(dirName)
 13 | 
 14 | def getText(filePath):
 15 |     text = None
 16 |     title = None
 17 |     if os.path.exists(filePath):
 18 |         with codecs.open(filePath, "r", encoding='utf-8') as f:
 19 |             lines = f.readlines()
 20 |             lines[0] = lines[0].strip()
 21 |             if not (lines[0].endswith(".") or lines[0].endswith(".") or lines[0].endswith("!")):
 22 |                 lines[0] = lines[0]+'.'
 23 |             text = ' '.join(lines)
 24 |             title = lines[0]
 25 |         f.close()
 26 |         
 27 |     return text, title
 28 |    
 29 | def load_obj(filePath):
 30 |     with open(filePath, 'rb') as f:
 31 |         return pickle.load(f)
 32 | 
 33 | def save_obj(obj, filePath):
 34 |     with open(filePath, 'wb') as output:
 35 |         pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
 36 | 
 37 | def embed_text(text, model):
 38 |     input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)  # Batch size 1
 39 |     outputs = model(input_ids)
 40 |     last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 41 |     return last_hidden_states[0]
 42 |     
 43 | def embed_tokens(tokens, model):
 44 |     input_ids = torch.tensor(tokens).unsqueeze(0)  # Batch size 1, only 1 sentense
 45 |     outputs = model(input_ids)
 46 |     last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple
 47 |     return last_hidden_states[0] # num_tokens (or num_words+2) * 768 dimentioanal output
 48 | 
 49 | def main():
 50 |     """
 51 |     Python 3.7 code
 52 |     Download SciBERT (scibert_scivocab_uncased) model from: https://github.com/allenai/scibert
 53 |     generates wordembeddings for each document name listed in overlap_test_bl.txt file in each dataset directory
 54 |     file structure expected:
 55 |         - dataset_name
 56 |             - abstracts : directory containing abstracts
 57 |             - overlap_test_bl.txt : file containing a list of test documents, 1 document name per line
 58 |     Generates word embeddings as directory structure below:
 59 |         - dataset_name
 60 |             - MODEL_MODE_emb_fulltext_title 
 61 |                 - FILE_NAME_fulltext.pkl: file contains words, corresponding tokens, and embeddings for title as an input to the model
 62 |                 - FILE_NAME_fulltext.pkl: file contains words, corresponding tokens, and embeddings for title+abstract as an input to the model
 63 |     """
 64 |     
 65 |     model_mode = 'scibert' # 'bert'
 66 |     dsDir = r'..\datasets\Krapivin2009' # directory containing the dataset
 67 |     
 68 |     do_lower_case = True
 69 |     model = None
 70 |     tokenizer = None
 71 |     ####### SciBERT model #########
 72 |     if model_mode == 'scibert':
 73 |         # please change the path to a downloaded Scibert Model
 74 |         model_version = r'..\KPRank\KPRank\scibert_scivocab_uncased' 
 75 |         model = BertModel.from_pretrained(model_version)
 76 |         tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case)
 77 |         
 78 |     elif model_mode == 'bert':
 79 |         tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
 80 |         model = BertModel.from_pretrained("bert-base-uncased")
 81 |     
 82 |     #datasets = ['hulth', 'semeval']
 83 |     #datasets = ['krapivin', 'nus']
 84 |     #datasets = ['nus']
 85 |     #datasets = ['acm']
 86 |     
 87 |     ipDir = os.path.join(dsDir, 'docsutf8')
 88 |     opDir = os.path.join(dsDir, f'{model_mode}_emb_fulltext_title')
 89 |     ensure_dir(opDir)
 90 |         
 91 |     # opening a file containing a list of test documents, 1 document name per line
 92 |     ipList = sorted(os.listdir(ipDir))
 93 |         
 94 |     for i, l in enumerate(ipList):
 95 |         
 96 |         print(f'Processing {i} out of {len(ipList)}...')
 97 | 
 98 |         #if i < 1761: continue
 99 | 
100 |         l = l.strip()
101 |         opFilePath_fulltext =  os.path.join(opDir, f'{l}_fulltext.pkl')
102 |         opFilePath_title = os.path.join(opDir, f'{l}_title.pkl')
103 |             
104 |         #print(l)
105 |         file_path = os.path.join(ipDir, l)
106 |         fulltext, title = getText(file_path)
107 |                 
108 |         fulltext = re.sub('\s+', ' ', fulltext).strip() # remove extra spaces and new lines
109 |         title = re.sub('\s+', ' ', title).strip() # remove extra spaces and new lines
110 |                 
111 |         fulltext_words = tokenizer.tokenize(fulltext)
112 |         title_words = tokenizer.tokenize(title)
113 |                 
114 |         fulltext_en_tokens = tokenizer.convert_tokens_to_ids(['[CLS]'] + fulltext_words[:510] + ['[SEP]'])
115 |         title_en_tokens = tokenizer.convert_tokens_to_ids(['[CLS]'] + title_words[:510] + ['[SEP]'])
116 |                 
117 |                 
118 |         fulltext_em = embed_tokens(fulltext_en_tokens, model).detach().numpy()
119 |         title_em = embed_tokens(title_en_tokens, model).detach().numpy()
120 |                 
121 |         fulltext_dict = {}
122 |         title_dict = {}
123 |                 
124 |         fulltext_dict['words'] = fulltext_words[:510]
125 |         fulltext_dict['tokens'] = fulltext_en_tokens
126 |         fulltext_dict['embeddings'] = fulltext_em
127 |                 
128 |         title_dict['words'] = title_words[:510]
129 |         title_dict['tokens'] = title_en_tokens
130 |         title_dict['embeddings'] = title_em
131 |                     
132 |         save_obj(fulltext_dict, opFilePath_fulltext)
133 |         save_obj(title_dict, opFilePath_title)
134 | 
135 |         os.system('clear')
136 | 
137 | if __name__ == "__main__":
138 |   main()


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/README.md:
--------------------------------------------------------------------------------
 1 | # Key2Vec
 2 | 
 3 | This directory hosts code to run the Key2Vec approach from this [repo](https://github.com/MarkSecada/key2vec).
 4 | 
 5 | ## Setup
 6 | Clone the aforementioned repository.  
 7 | Replace the files from this directory over the files of the cloned repository.  
 8 | Download the `glove.6B.50d.txt` from the [Glove](https://github.com/stanfordnlp/GloVe) repository and place it in the `data` subdirectory.  
 9 | In `main.py` set the `base_path` to the local dataset directory.  
10 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pathlib
 3 | import key2vec
 4 | 
 5 | def main():
 6 |     glove = key2vec.glove.Glove('./data/glove.6B.50d.txt')
 7 |     base_path = '../datasets/DUC-2001'
 8 |     input_dir = os.path.join(base_path, 'docsutf8')
 9 |     output_dir = os.path.join(base_path, 'extracted/key2vec')
10 | 
11 |     # Set the current directory to the input dir
12 |     os.chdir(os.path.join(os.getcwd(), input_dir))
13 | 
14 |     # Get all file names and their absolute paths.
15 |     docnames = sorted(os.listdir())
16 |     docpaths = list(map(os.path.abspath, docnames))
17 | 
18 |     # Create the keys directory, after the names and paths are loaded.
19 |     pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True)
20 | 
21 |     for i, (docname, docpath) in enumerate(zip(docnames, docpaths)):
22 | 
23 |         if i < 292: continue
24 |         # keys shows up in docnames, erroneously.
25 |         if docname == 'keys':
26 |             continue
27 |             
28 |         print(f'Processing {i} out of {len(docnames)}...')
29 | 
30 |         # Save the output dir path
31 |         output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key')
32 |         print(output_dirpath)
33 | 
34 |         with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \
35 |              open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out:
36 |             
37 |             # Read the file and remove the newlines.
38 |             text = file.read().replace('\n', ' ')
39 | 
40 |             # Extract the top 10 keyphrases.
41 |             try:
42 |                 m = key2vec.key2vec.Key2Vec(text, glove)
43 |                 m.extract_candidates()
44 |                 m.set_theme_weights()
45 |                 m.build_candidate_graph()
46 |                 ranked_list = m.page_rank_candidates(top_n = 10)
47 | 
48 |                 keys = "\n".join(map(str, ranked_list) or '')
49 |                 out.write(keys)
50 |             except:
51 |                 pass
52 | 
53 |         os.system('clear')
54 | 
55 | 
56 | if __name__ == "__main__": main()
57 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/__init__.py:
--------------------------------------------------------------------------------
1 | from . import cleaner
2 | from . import constants
3 | from . import docs
4 | from . import glove
5 | from . import key2vec
6 | from . import phrase_graph


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/cleaner.py:
--------------------------------------------------------------------------------
 1 | from .constants import STOPWORDS, POS_BLACKLIST, DETERMINERS, PUNCT_SET
 2 | 
 3 | class Cleaner(object):
 4 |     """Cleans candidate keyphrase"""
 5 | 
 6 |     def __init__(self, doc):
 7 |         self.doc = doc
 8 |         self.tokens = [token for token in doc]
 9 | 
10 |     def transform_text(self):
11 |         transformed_text = []
12 |         tokens_len = len(self.tokens)
13 |         for i, token in enumerate(self.tokens):
14 |             remove = False
15 |             if (i == 0) or (i == tokens_len - 1):
16 |                 is_stop = token.text in STOPWORDS
17 |                 is_banned_pos = token.pos_ in POS_BLACKLIST
18 |                 is_determiner = token.text in DETERMINERS
19 |                 has_punct = not set(token.text).isdisjoint(PUNCT_SET)
20 |                 remove = (is_stop
21 |                     or is_banned_pos
22 |                     or is_determiner
23 |                     or has_punct)
24 |             else:
25 |                 pass
26 |             if not remove:
27 |                 transformed_text.append(token.text)
28 | 
29 |         if transformed_text == []:
30 |             return ''
31 |         elif '-' in transformed_text:
32 |             dash_index = transformed_text.index('-')
33 |             first_half = ' '.join(transformed_text[:dash_index])
34 |             sec_half = ' '.join(transformed_text[dash_index + 1:])
35 |             return ' '.join([first_half, sec_half]).lower()
36 |         else:
37 |             return ' '.join(transformed_text).lower()


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/constants.json:
--------------------------------------------------------------------------------
  1 | {
  2 |     "punctuation": [
  3 |         "\\",
  4 |         "]",
  5 |         ";",
  6 |         "%",
  7 |         "(",
  8 |         "_",
  9 |         "@",
 10 |         ",",
 11 |         "-",
 12 |         "–",
 13 |         "=",
 14 |         "!",
 15 |         ":",
 16 |         "[",
 17 |         "\"",
 18 |         ")",
 19 |         "?",
 20 |         "}",
 21 |         "&",
 22 |         "'",
 23 |         "|",
 24 |         "/",
 25 |         "#",
 26 |         "<",
 27 |         "$",
 28 |         "^",
 29 |         ".",
 30 |         "`",
 31 |         "*",
 32 |         "+",
 33 |         "~",
 34 |         "{",
 35 |         ">",
 36 |         "\n",
 37 |         "\t",
 38 |     ],
 39 |     "pos_blacklist": [
 40 |         "INTJ",
 41 |         "AUX",
 42 |         "CCONJ",
 43 |         "ADP",
 44 |         "DET",
 45 |         "NUM",
 46 |         "PART",
 47 |         "PRON",
 48 |         "SCONJ",
 49 |         "PUNCT",
 50 |         "SYM",
 51 |         "X",
 52 |     ],
 53 |     "ents_to_ignore": [
 54 |         "DATE",
 55 |         "TIME",
 56 |         "PERCENT",
 57 |         "MONEY",
 58 |         "QUANTITY",
 59 |         "ORDINAL",
 60 |         "CARDINAL",
 61 |     ],
 62 |     "determiners": [
 63 |         "the",
 64 |         "a",
 65 |         "an",
 66 |         "this",
 67 |         "that",
 68 |         "these",
 69 |         "those",
 70 |         "my",
 71 |         "your",
 72 |         "his",
 73 |         "her",
 74 |         "its",
 75 |         "our",
 76 |         "their",
 77 |         "a few",
 78 |         "a little",
 79 |         "much",
 80 |         "many",
 81 |         "a lot of",
 82 |         "most",
 83 |         "some",
 84 |         "any",
 85 |         "enough",
 86 |         "one",
 87 |         "ten",
 88 |         "thirty",
 89 |         "all",
 90 |         "both",
 91 |         "either",
 92 |         "neither",
 93 |         "each",
 94 |         "every",
 95 |         "other",
 96 |         "another",
 97 |         "such",
 98 |         "what",
 99 |         "rather",
100 |         "quite",
101 |     ]
102 | }


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/constants.py:
--------------------------------------------------------------------------------
  1 | import string
  2 | 
  3 | PUNCT_SET = list(set(string.punctuation))
  4 | PUNCT_SET.append(u'\u201c')
  5 | PUNCT_SET.append(u'\u201d')
  6 | PUNCT_SET.append(u'\u2018')
  7 | PUNCT_SET.append(u'\u2019')
  8 | PUNCT_SET.append(u'\u2014')
  9 | PUNCT_SET.append(b'\xe2\x80\x9c')
 10 | PUNCT_SET.append('\n')
 11 | PUNCT_SET.append('\\')
 12 | PUNCT_SET.append('\"')
 13 | PUNCT_SET.append('\a')
 14 | PUNCT_SET.append('\f')
 15 | PUNCT_SET.append('\n')
 16 | PUNCT_SET.append('\r')
 17 | PUNCT_SET.append('\t')
 18 | PUNCT_SET.append('\v')
 19 | PUNCT_SET = set(PUNCT_SET)
 20 | 
 21 | POS_BLACKLIST = ['INTJ', 'AUX', 'CCONJ', 
 22 |             'ADP', 'DET', 'NUM', 'PART', 
 23 |             'PRON', 'SCONJ', 'PUNCT',
 24 |             'SYM', 'X']
 25 | 
 26 | ENTS_TO_IGNORE = ['DATE', 'TIME', 'PERCENT', 
 27 |                 'MONEY', 'QUANTITY', 'ORDINAL', 
 28 |                 'CARDINAL']
 29 | 
 30 | DETERMINERS = ['the', 'a', 'an', 'this', 'that', 'these', 'those',
 31 |             'my', 'your', 'his', 'her', 'its', 'our', 'their', 
 32 |             'a few', 'a little', 'much', 'many', 'a lot of', 'most', 
 33 |             'some', 'any', 'enough', 'one', 'ten', 'thirty', 'all', 
 34 |             'both', 'either', 'neither', 'each', 'every', 'other', 
 35 |             'another', 'such', 'what', 'rather', 'quite']
 36 | 
 37 | STOPWORDS = ["word", 
 38 |     "a", "a's", "able", "about", "above", "according",
 39 |     "accordingly", "across", "actually", "after", "afterwards", 
 40 |     "again", "against", "ago", "aim", "ain't", "all", "allow",
 41 |     "allows", "almost", "alone", "along", "already", "also", 
 42 |     "although", "always", "am", "among", "amongst", "an", "and",
 43 |     "another", "any", "anybody", "anyhow", "anyone", "anything", 
 44 |     "anyway", "anyways", "anywhere", "apart", "appear", "appreciate",
 45 |     "approach", "appropriate", "are", "area", "areas", "aren't", 
 46 |     "around", "as", "aside", "ask", "asked", "asking", "asks", 
 47 |     "associated", "at", "available", "away", "awfully", "b", "back", 
 48 |     "backed", "backing", "backs", "bad", "based", "be", "became",
 49 |     "because", "become", "becomes", "becoming", "been", "before", 
 50 |     "beforehand", "began", "behind", "being", "beings", "believe", 
 51 |     "below", "beside", "besides", "best", "better", "between", 
 52 |     "beyond", "big", "bit", "both", "brief", "bring", "but", "by", 
 53 |     "c", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", 
 54 |     "case", "cases", "cause", "causes", "certain", "certainly", 
 55 |     "changes", "clear", "clearly", "co", "com", "come", "comes", 
 56 |     "concerning", "consequently", "consider", "considering", 
 57 |     "contain", "containing", "contains", "continue", "corresponding", 
 58 |     "could", "couldn't", "course", "currently", "d", "definitely", 
 59 |     "described", "despite", "did", "didn't", "differ", "different", 
 60 |     "differently", "do", "does", "doesn't", "doing", "don't", "done", 
 61 |     "down", "downed", "downing", "downs", "downwards", "dr", "during", 
 62 |     "e", "each", "earlier", "early", "edu", "eg", "eight", "either", 
 63 |     "else", "elsewhere", "end", "ended", "ending", "ends", "enough", 
 64 |     "entirely", "especially", "et", "etc", "even", "evenly", "ever", 
 65 |     "every", "everybody", "everyone", "everything", "everywhere", "ex", 
 66 |     "exactly", "example", "except", "f", "face", "faces", "fact", 
 67 |     "facts", "far", "felt", "few", "fifth", "find", "finds", "first", 
 68 |     "five", "flawed", "focusing", "followed", "following", "follows", 
 69 |     "for", "former", "formerly", "forth", "four", "from", "full", 
 70 |     "fully", "fun", "further", "furthered", "furthering", 
 71 |     "furthermore", "furthers", "g", "gave", "general", "generally", 
 72 |     "get", "gets", "getting", "gigot", "give", "given", "gives", "go", 
 73 |     "goes", "going", "gone", "good", "goods", "got", "gotten", "great", 
 74 |     "greater", "greatest", "greetings", "group", "grouped", "grouping", 
 75 |     "groups", "h", "had", "hadn't", "half", "happens", "hardly", "has",
 76 |     "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", 
 77 |     "he's", "held", "hello", "help", "hence", "her", "here", "here's",
 78 |     "hereafter", "hereby", "herein", "hereupon", "hers", "herself", 
 79 |     "hi", "high", "higher", "highest", "him", "himself", "his", 
 80 |     "hither", "hopefully", "how", "how's", "howbeit", "however", "i", 
 81 |     "i'd", "i'll", "i'm", "i've", "ie", "if", "ignored", "ii", 
 82 |     "immediate", "immediately", "important", "in", "inasmuch", "inc", 
 83 |     "include", "including", "indeed", "indicate", "indicated", 
 84 |     "indicates", "inevitable", "inner", "insofar", "instead", 
 85 |     "interest", "interested", "interesting", "interests", "into", 
 86 |     "involving", "inward", "is", "isn't", "issue", "it", "it'd", 
 87 |     "it'll", "it's", "its", "itself", "ix", "j", "just", "k", "keep", 
 88 |     "keeps", "kept", "kind", "knew", "know", "known", "knows", "l", 
 89 |     "large", "largely", "last", "lately", "later", "latest", 
 90 |     "latter", "latterly", "lead", "least", "led", "less", "lest", 
 91 |     "let", "let's", "lets", "letting", "like", "liked", "likely", 
 92 |     "likes", "line", "listen", "little", "long", "longer", "longest", 
 93 |     "look", "looking", "looks", "lot", "ltd", "m", "m.d", "made", 
 94 |     "mainly", "make", "makes", "making", "man", "many", "may", "maybe", 
 95 |     "me", "mean", "meant", "meanwhile", "member", "members", "men", 
 96 |     "merely", "messrs", "met", "might", "more", "moreover", "most",
 97 |     "mostly", "move", "mr", "mrs", "ms", "much", "must", "mustn't",
 98 |     "my", "myself", "n", "name", "namely", "nd", "near", "nearly",
 99 |     "necessary", "need", "needed", "needing", "needs", "neither",
100 |     "never", "nevertheless", "new", "newer", "newest", "next",
101 |     "nine", "no", "nobody", "non", "none", "nonetheless", "noone",
102 |     "nor", "normally", "not", "nothing", "novel", "now", "nowhere",
103 |     "number", "numbers", "o", "obviously", "of", "off", "often",
104 |     "oh", "ok", "okay", "old", "older", "oldest", "on", "once",
105 |     "one", "ones", "only", "onto", "open", "opened", "opening", 
106 |     "opens", "or", "order", "ordered", "ordering", "orders", 
107 |     "other", "others", "otherwise", "ought", "our", "ours", 
108 |     "ourselves", "out", "outside", "over", "overall", 
109 |     "overwhelming", "own", "p", "part", "parted", "particular",
110 |     "particularly", "parting", "parts", "people", "per", "perhaps",
111 |     "place", "placed", "places", "please", "plus", "point", "pointed",
112 |     "pointing", "points", "possible", "prefer", "present", "presented", 
113 |     "presenting", "presents", "presumably", "probably", "problem", 
114 |     "problems", "prof", "provides", "put", "puts", "putting", "q", 
115 |     "que", "quite", "qv", "r", "rather", "rd", "re", "really", 
116 |     "reasonably", "recently", "regarding", "regardless", "regards", 
117 |     "relatively", "respectively", "right", "room", "rooms", "s", 
118 |     "said", "same", "saw", "say", "saying", "says", "sec", "second", 
119 |     "secondly", "seconds", "see", "seeing", "seem", "seemed", 
120 |     "seeming", "seemingly", "seems", "seen", "sees", "self", "selves", 
121 |     "sensible", "sent", "serious", "seriously", "set", "seven", 
122 |     "several", "shall", "shan't", "she", "she'd", "she'll", "she's", 
123 |     "shortly", "should", "shouldn't", "show", "showed", "showing", 
124 |     "shows", "side", "sides", "simply", "since", "six", "small", 
125 |     "smaller", "smallest", "so", "some", "somebody", "somehow", 
126 |     "someone", "something", "sometime", "sometimes", "somewhat", 
127 |     "somewhere", "soon", "sorry", "specified", "specify", "specifying",
128 |     "st", "state", "states", "still", "sub", "such", "sup", "sure", 
129 |     "t", "t's", "take", "taken", "tell", "tends", "th", "than", 
130 |     "thank", "thanks", "thanx", "that", "that's", "thats", "the", 
131 |     "their", "theirs", "them", "themselves", "then", "thence", "there",
132 |     "there's", "thereafter", "thereby", "therefore", "therein", 
133 |     "theres", "thereupon", "these", "they", "they'd", "they'll", 
134 |     "they're", "they've", "thing", "things", "think", "thinks", 
135 |     "third", "this", "thorough", "thoroughly", "those", "though", 
136 |     "thought", "thoughts", "three", "through", "throughout", "thru", 
137 |     "thus", "to", "today", "together", "told", "too", "took", "top", 
138 |     "toward", "towards", "tried", "tries", "truly", "try", "trying", 
139 |     "turn", "turned", "turning", "turns", "twice", "two", "u", "un", 
140 |     "under", "unfortunately", "unless", "unlike", "unlikely", "until", 
141 |     "unto", "up", "upon", "us", "use", "used", "useful", "uses",
142 |     "using", "usually", "uucp", "v", "value", "various", "very", "via",
143 |     "viz", "vs", "w", "want", "wanted", "wanting", "wants", "was",
144 |     "wasn't", "watched", "way", "ways", "we", "we'd", "we'll", "we're",
145 |     "we've", "welcome", "well", "wells", "went", "were", "weren't", 
146 |     "what", "what's", "whatever", "when", "when's", "whence",
147 |     "whenever", "where", "where's", "whereafter", "whereas", "whereby",
148 |     "wherein", "whereupon", "wherever", "whether", "which", "while",
149 |     "whither", "who", "who's", "whoever", "whole", "whom", "whose",
150 |     "why", "why's", "will", "willing", "wish", "with", "within",
151 |     "without", "won't", "wonder", "work", "worked", "working",
152 |     "works", "worst", "would", "wouldn't", "x", "y", "year", "years",
153 |     "yes", "yet", "you", "you'd", "you'll", "you're", "you've",
154 |     "young", "younger", "youngest", "your", "yours", "yourself", 
155 |     "yourselves", "z", "zero", "mr", "ms", "mrs", "mssrs", "mssr", 
156 |     "also", "said", "should", "could", "would", "week", "weeks", 
157 |     "month", "months", "year", "years"]


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/docs.py:
--------------------------------------------------------------------------------
  1 | from nltk import sent_tokenize, wordpunct_tokenize
  2 | from typing import Dict, List, Tuple
  3 | from .constants import PUNCT_SET
  4 | from .glove import Glove
  5 | 
  6 | import numpy as np
  7 | 
  8 | def cosine_similarity(a: np.float64, b: np.float64) -> float:
  9 |     norm_a = np.linalg.norm(a)
 10 |     norm_b = np.linalg.norm(b)
 11 |     if norm_a == 0 or norm_b == 0:
 12 |         return -1
 13 |     return np.dot(a, b) / (norm_a * norm_b)
 14 | 
 15 | def _filter_words(text: str) -> List[str]:
 16 |     tokens = wordpunct_tokenize(text)
 17 |     words_filter = [word.lower() for word in tokens
 18 |                 if set(word).isdisjoint(PUNCT_SET)]
 19 |     return words_filter
 20 | 
 21 | class Document(object):
 22 |     """Document to be embedded. May be a word, a sentence, etc.
 23 | 
 24 |     Parameters
 25 |     ----------
 26 |     text : str, required
 27 |         The text to be embedded
 28 |     glove : Glove, required
 29 |         GloVe embeddings
 30 | 
 31 |     Attributes
 32 |     ----------
 33 |     text : str
 34 |     dim : int
 35 |         Dimension of GloVe embeddings.
 36 |     embedding : np.float64
 37 |         Document embedding built from average of GloVe embeddings.
 38 |     """
 39 | 
 40 |     def __init__(self,
 41 |                 text: str, 
 42 |                 glove: Glove) -> None:
 43 |         self.text = text
 44 |         self.dim = glove.dim
 45 |         self.embedding = self.__embed_document(glove.embeddings)
 46 | 
 47 |     def __embed_document(self,
 48 |                 embeddings: Dict[str, np.float64]) -> np.float64:
 49 |         words = wordpunct_tokenize(self.text.lower())
 50 |         vector = np.zeros(self.dim)
 51 |         for i, word in enumerate(words):
 52 |             if embeddings.get(word, None) is None:
 53 |                 vector += np.zeros(self.dim)
 54 |             else:
 55 |                 vector += embeddings[word]
 56 |         return vector / len(words)
 57 | 
 58 |     def get_word_positions(self) -> Dict[str, List[int]]:
 59 |         words = _filter_words(self.text)
 60 |         word_positions = {}
 61 |         for i, word in enumerate(words):
 62 |             if word_positions.get(word) is None:
 63 |                 word_positions[word] = [i]
 64 |             else:
 65 |                 word_positions[word].append(i)
 66 |         return word_positions
 67 | 
 68 | class Phrase(Document):
 69 |     """Phrase to be embedded. Inherits from Document object.
 70 | 
 71 |     Parameters
 72 |     ----------
 73 |     text : str, required
 74 |         The text to be embedded
 75 |     glove : Glove, required
 76 |         GloVe embeddings
 77 |     parent : Document, required
 78 |         Document where the Phrase is from
 79 | 
 80 |     Attributes
 81 |     ----------
 82 |     text : str
 83 |     dim : int
 84 |     embedding : np.float64
 85 |     parent : Document
 86 |     positions : List[Tuple[int]]
 87 |         List of indices where a given phrase is located. 
 88 |         Each index is represented as a Tuple where the first
 89 |         element is the first index the phrase appears in
 90 |         and the second element is the second index the phrase
 91 |         appears in. If a phrase is a unigram, a position Tuple
 92 |         is (position, position).
 93 |     similarity : float
 94 |         Cosine similarity between the parent document and the phrase.
 95 |     score : float, None
 96 |         Min/Max scaling of the cosine similarity in relation to the
 97 |         other candidate keyphrases.
 98 |     rank : int, None
 99 |         Phrase ranking with respect to the score in descending order.
100 |     """
101 | 
102 |     def __init__(self, 
103 |                 text: str, 
104 |                 parent: Document,
105 |                 glove: Glove) -> None:
106 |         super().__init__(text, glove)
107 |         self.parent = parent
108 |         self.positions = self.__get_positions()
109 |         self.window = self.__expand_window()
110 |         self.similarity = cosine_similarity(parent.embedding, 
111 |             self.embedding)
112 |         self.theme_weight = None
113 |         self.score = None
114 |         self.rank = None
115 | 
116 |     def __str__(self) -> str:
117 |         return self.text
118 | 
119 |     def set_theme_weight(self, 
120 |                 min_: float, 
121 |                 max_: float) -> None:
122 |         # THIS SHOULD BE SET_THEME_EMBEDDING!!!!!
123 |         diff = max_ - min_
124 |         self.theme_weight = (self.similarity - min_) / diff
125 | 
126 |     def calc_pmi(self, phrase, candidates: int):
127 |         """Calculates point-wise mutual information between
128 |         one candidate phrase and another."""
129 |         prob_phrase_one = len(self.positions) / candidates
130 |         prob_phrase_two = len(phrase.positions) / candidates
131 |         cooccur = 0
132 |         for pos in phrase.positions:
133 |             if self.window.get(pos[0]) or self.window.get(pos[1]):
134 |                 cooccur += 1
135 |         prob_cooccur = cooccur / candidates
136 |         return np.log(prob_cooccur / (prob_phrase_one * prob_phrase_two))
137 | 
138 |     def __get_positions(self) -> List[Tuple[int]]:
139 |         """Gets positions a phrase is in."""
140 |         parent_word_positions = self.parent.get_word_positions()
141 |         phrase_split = self.text.lower().split(' ')
142 |         positions = []
143 |         if len(phrase_split) == 1:
144 |             for word_pos in parent_word_positions[phrase_split[0]]:
145 |                 positions.append((word_pos, word_pos))
146 |         else:
147 |             phrase = {word: parent_word_positions[word] 
148 |                     for word in phrase_split}
149 |             len_phrase = len(phrase_split)
150 |             for position in phrase[phrase_split[0]]:
151 |                 for i, word in enumerate(phrase_split[1:]):
152 |                     pred_pos = position + i + 1
153 |                     end_of_phrase = i + 2 == len_phrase
154 |                     is_pred_pos = pred_pos in phrase[word]
155 |                     if is_pred_pos and end_of_phrase:
156 |                         positions.append((position, pred_pos))
157 |         return positions
158 | 
159 |     def __expand_window(self) -> Dict[int, int]:
160 |         """Returns dictionary of positions in a phrase's 
161 |         adj. window."""
162 |         window = {}
163 |         phrase_len = len(self.parent.text.split(' '))
164 |         for pos in self.positions:
165 |             min_index = max(pos[0] - 5, 0)
166 |             max_index = min(pos[1] + 6, phrase_len)
167 |             indices = [i for i in range(min_index, max_index)]
168 |             for i in indices:
169 |                 if window.get(i) is None:
170 |                     window[i] = i
171 |         return window


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/glove.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from typing import Dict
 3 | 
 4 | class Glove(object):
 5 |     """GloVe vectors.
 6 | 
 7 |     Parameters
 8 |     ----------
 9 |     path : str, required
10 |         Path to the GloVe embeddings
11 | 
12 |     Attributes
13 |     ----------
14 |     embeddings : Dict[str, np.float64]
15 |         Dictionary of GloVe embeddings
16 |     dim : int
17 |         Dimension of GloVe embeddings
18 |     """
19 | 
20 |     def __init__(self, path: str) -> None:
21 |         self.embeddings = self.__read_glove(path)
22 |         self.dim = self.__get_dim()
23 | 
24 |     def __read_glove(self, path: str) -> Dict[str, np.float64]:
25 |         """Reads GloVe vectors into a dictionary, where
26 |            the words are the keys, and the vectors are the values.
27 | 
28 |         Returns
29 |         -------
30 |         word_vectors : Dict[str, np.float64]
31 |         """
32 |         with open(path, 'r') as f:
33 |             data = f.readlines()
34 |         word_vectors = {}
35 |         for row in data:
36 |             stripped_row = row.strip('\n')
37 |             split_row = stripped_row.split(' ')
38 |             word = split_row[0]
39 |             vector = []
40 |             for el in split_row[1:]:
41 |                 vector.append(float(el))
42 |             word_vectors[word] = np.array(vector)
43 |         return word_vectors
44 | 
45 |     def __get_dim(self) -> int:
46 |         return len(self.embeddings[list(self.embeddings.keys())[0]])


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/key2vec.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import spacy
  3 | import string
  4 | import en_core_web_sm
  5 | import os
  6 | 
  7 | from nltk import sent_tokenize, wordpunct_tokenize
  8 | from typing import Dict, List
  9 | from .cleaner import Cleaner
 10 | from .constants import ENTS_TO_IGNORE, STOPWORDS, PUNCT_SET
 11 | from .docs import Document, Phrase
 12 | from .glove import Glove
 13 | from .phrase_graph import PhraseNode, PhraseGraph
 14 | 
 15 | NLP = en_core_web_sm.load()
 16 | 
 17 | class Key2Vec(object):
 18 |     """Implementation of Key2Vec.
 19 | 
 20 |     Parameters
 21 |     ----------
 22 |     text : str, required
 23 |         The text to extract the top keyphrases from.
 24 |     glove : Glove
 25 |         GloVe vectors.
 26 | 
 27 |     Attributes
 28 |     ----------
 29 |     text : Document
 30 |         Document object of the `text` parameter.
 31 |     glove : Glove
 32 |     candidates : List[Phrase]
 33 |         List of candidate keyphrases. Initialized as an empty list.
 34 |     candidate_graph : PhraseGraph
 35 |         Bidrectional graph of all candidate phrases
 36 |     """
 37 | 
 38 |     def __init__(self,
 39 |         text: str,
 40 |         glove: Glove) -> None:
 41 |         
 42 |         self.doc = Document(text, glove)
 43 |         self.glove = glove
 44 |         self.candidates = []
 45 |         self.candidate_graph = None
 46 | 
 47 |     def extract_candidates(self):
 48 |         """Extracts candidate phrases from the text. Sets
 49 |         `candidates` attributes to a list of Phrase objects.
 50 |         """ 
 51 | 
 52 |         sentences = sent_tokenize(self.doc.text)
 53 |         candidates = {}
 54 |         for sentence in sentences:
 55 |             doc = NLP(sentence)
 56 |             candidates = self.__extract_tokens(doc, candidates)
 57 |             candidates = self.__extract_entities(doc, candidates)
 58 |             candidates = self.__extract_noun_chunks(doc, candidates)
 59 |         self.candidates = list(candidates.values())
 60 | 
 61 |     def __extract_tokens(self, doc, candidates):
 62 |         for token in doc:
 63 |             text = token.text.lower()
 64 |             not_punct = set(text).isdisjoint(PUNCT_SET)
 65 |             is_stopword = text in STOPWORDS
 66 |             in_candidates = candidates.get(text) is not None
 67 |             not_empty = text != ''
 68 |             keep = (not_punct
 69 |                 and not_empty
 70 |                 and not (is_stopword or in_candidates))
 71 |             if keep:
 72 |                 try:
 73 |                     candidates[text] = Phrase(text, self.doc, 
 74 |                         self.glove)
 75 |                 except KeyError:
 76 |                     next
 77 |             else:
 78 |                 pass
 79 |         return candidates
 80 | 
 81 |     def __extract_entities(self, doc, candidates):
 82 |         for ent in doc.ents:
 83 |             cleaned_text = Cleaner(ent).transform_text()
 84 |             is_ent_to_ignore = ent.label_ in ENTS_TO_IGNORE
 85 |             in_candidates = candidates.get(cleaned_text) is not None
 86 |             not_empty = cleaned_text != ''
 87 |             if not (is_ent_to_ignore or in_candidates) and not_empty:
 88 |                 try:
 89 |                     candidates[cleaned_text] = Phrase(cleaned_text, self.doc,
 90 |                         self.glove)
 91 |                 except KeyError:
 92 |                     next
 93 |         return candidates
 94 | 
 95 |     def __extract_noun_chunks(self, doc, candidates):
 96 |         for chunk in doc.noun_chunks:
 97 |             cleaned_text = Cleaner(chunk).transform_text()
 98 |             not_empty = cleaned_text != ''
 99 |             if candidates.get(cleaned_text) is None and not_empty:
100 |                 try:
101 |                     candidates[cleaned_text] = Phrase(cleaned_text, 
102 |                         self.doc, self.glove)
103 |                 except KeyError:
104 |                     next
105 |         return candidates
106 | 
107 |     def set_theme_weights(self) -> List[Phrase]:
108 |         """Ranks candidate keyphrases.
109 | 
110 |         Parameters
111 |         ----------
112 |         top_n : int, optional (int = 10)
113 |             How many top keyphrases to return.
114 | 
115 |         Returns
116 |         -------
117 |         sorted_candidates : List[Phrase]
118 |             Sorted list of candidates in reverse order. Returns `top_n`
119 |             Phrase objects.
120 |         """
121 |         max_ = max([c.similarity for c in self.candidates])
122 |         min_ = min([c.similarity for c in self.candidates])
123 | 
124 |         for c in self.candidates:
125 |             c.set_theme_weight(min_, max_)
126 | 
127 |     def build_candidate_graph(self) -> None:
128 |         """Builds bidirectional graph of candidates."""
129 | 
130 |         if self.candidates == []:
131 |             return
132 | 
133 |         candidate_graph = PhraseGraph(self.candidates)
134 |         for candidate in self.candidates:
135 |             candidate_graph.add_node(candidate)
136 | 
137 |         nodes = len(self.candidates)
138 | 
139 |         for node in candidate_graph.nodes:
140 |             for other in candidate_graph.nodes:
141 |                 if node != other:
142 |                     candidate_graph.nodes[node].add_neighbor(
143 |                         candidate_graph.nodes[other], nodes)
144 |         self.candidate_graph = candidate_graph
145 | 
146 |     def page_rank_candidates(self, top_n: int=10) -> List[Phrase]:
147 |         """Page Ranks candidate phrases."""
148 |         if self.candidate_graph is None:
149 |             return
150 | 
151 |         for node in self.candidate_graph.nodes.values():
152 |             theme = node.phrase.theme_weight
153 |             d = 0.85
154 |             weights = []
155 |             neighbors = list(node.adj_nodes.keys())
156 |             for neighbor in neighbors:
157 |                 out = node.adj_nodes[neighbor].incoming_edges
158 |                 weights.append(node.adj_weights[neighbor] / out)
159 |             score = theme * (1 - d) + d * sum(weights)
160 |             node.phrase.score = score
161 | 
162 |         sorted_candidates = sorted(self.candidates, 
163 |             key=lambda x: x.score)[::-1]
164 | 
165 |         for i, c in enumerate(sorted_candidates):
166 |             c.rank = i + 1
167 | 
168 |         return sorted_candidates[:top_n]


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/key2vec/phrase_graph.py:
--------------------------------------------------------------------------------
 1 | from .docs import Document, Phrase, cosine_similarity
 2 | from typing import List
 3 | 
 4 | class PhraseNode(object):
 5 |     """Node in Phrase Graph."""
 6 | 
 7 |     def __init__(self, phrase: Phrase):
 8 |         self.key = phrase.text
 9 |         self.phrase = phrase
10 |         self.incoming_edges = 0
11 |         self.adj_nodes = {}
12 |         self.adj_weights = {}
13 | 
14 |     def __repr__(self):
15 |         return str(self.key)
16 | 
17 |     def __lt__(self, other):
18 |         return self.key < other.key
19 | 
20 |     def add_neighbor(self, neighbor, candidates, weight=0):
21 |         if neighbor is None or weight is None:
22 |             raise TypeError('neighbor or weight cannot be None')
23 |         if self.__in_window(neighbor):
24 |             neighbor.incoming_edges += 1
25 |             cosine_score = cosine_similarity(self.phrase.embedding,
26 |                 neighbor.phrase.embedding)
27 |             # need to rewrite api to allow candidates to be calculated
28 |             pmi = self.phrase.calc_pmi(neighbor.phrase, candidates)
29 |             self.adj_weights[neighbor.key] = cosine_score * pmi
30 |             self.adj_nodes[neighbor.key] = neighbor
31 | 
32 |     def __in_window(self, neighbor):
33 |         window = self.phrase.window
34 |         neighbor_pos = neighbor.phrase.positions
35 |         for pos in neighbor_pos:
36 |             pos0 = window.get(pos[0])
37 |             pos1 = window.get(pos[1])
38 |             if window.get(pos0) or window.get(pos1):
39 |                 return True
40 |         return False
41 | 
42 | class PhraseGraph(object):
43 |     """Bi-directional G=graph of phrases"""
44 | 
45 |     def __init__(self, candidates: List[Phrase]):
46 |         self.nodes = {}
47 |         self.candidates = candidates
48 | 
49 |     def add_node(self, key):
50 |         if key is None:
51 |             raise TypeError('key cannot be None')
52 |         if key not in self.nodes:
53 |             self.nodes[key] = PhraseNode(key)
54 |         return self.nodes[key]
55 | 
56 |     def add_edge(self, source_key, dest_key, weight=0):
57 |         if source_key is None or dest_key is None:
58 |             raise KeyError('Invalid key')
59 |         if source_key not in self.nodes:
60 |             self.add_node(dest_key)
61 |         if dest_key not in self.nodes:
62 |             self.add_node(dest_key)
63 |         self.nodes[source_key].add_neighbor(self.nodes[dest_key], 
64 |             weight)


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/requirements.txt:
--------------------------------------------------------------------------------
 1 | blis==0.4.1
 2 | certifi==2019.9.11
 3 | chardet==3.0.4
 4 | cymem==2.0.2
 5 | idna==2.8
 6 | murmurhash==1.0.2
 7 | nltk==3.4.5
 8 | numpy==1.17.3
 9 | plac==0.9.6
10 | preshed==3.0.2
11 | python-dotenv==0.10.3
12 | requests==2.22.0
13 | scipy==1.3.1
14 | six==1.12.0
15 | spacy==2.2.1
16 | srsly==0.1.0
17 | thinc==7.1.1
18 | tqdm==4.36.1
19 | urllib3==1.25.6
20 | wasabi==0.2.2
21 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/setup.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/Key2Vec/setup.py


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/test.py:
--------------------------------------------------------------------------------
 1 | import key2vec
 2 | 
 3 | path = './data/glove.6B.50d.txt'
 4 | glove = key2vec.glove.Glove(path)
 5 | with open('./test.txt', 'r') as f:
 6 |     test = f.read()
 7 | m = key2vec.key2vec.Key2Vec(test, glove)
 8 | m.extract_candidates()
 9 | m.set_theme_weights()
10 | m.build_candidate_graph()
11 | ranked = m.page_rank_candidates()
12 | 
13 | for row in ranked:
14 |     print(f'{row.text}')
15 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/test.txt:
--------------------------------------------------------------------------------
 1 | Optimal and safe ship control as a multi-step matrix game
 2 | The paper describes the process of the safe ship control in a collision
 3 |     situation using a differential game model with j participants. As an
 4 |     approximated model of the manoeuvring process, a model of a multi-step
 5 |     matrix game is adopted here. RISKTRAJ computer program is designed in
 6 |     the Matlab language in order to determine the ship's trajectory as a
 7 |     certain sequence of manoeuvres executed by altering the course and
 8 |     speed, in the online navigator decision support system. These
 9 |     considerations are illustrated with examples of a computer simulation
10 |     of the safe ship's trajectories in real situation at sea when passing
11 |     twelve of the encountered objects
12 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/tests/test_docs.py:
--------------------------------------------------------------------------------
 1 | # More things to test about both the Document object
 2 | # and the Phrase object
 3 | 
 4 | import pytest
 5 | from key2vec.glove import Glove
 6 | from key2vec.docs import Document, Phrase
 7 | 
 8 | glove = Glove('../data/glove.6B/glove.6B.50d.txt')
 9 | 
10 | def test_document():
11 |     text = "Hello! My name is Mark Secada. I'm a Data Scientist."
12 |     doc = Document(text, glove)
13 |     assert doc.text == text
14 |     assert doc.dim == 50
15 |     assert doc.embedding is not None
16 | 
17 | def test_phrase():
18 |     text = "Hello! My name is Mark Secada. I'm a Data Scientist."
19 |     doc = Document(text, glove)
20 |     phrase = Phrase("Mark Secada", glove, doc)
21 |     assert phrase.text == "Mark Secada"
22 |     assert phrase.dim == 50
23 |     assert phrase.embedding is not None
24 |     assert phrase.parent.text == text
25 |     assert phrase.parent.dim == phrase.dim
26 |     assert phrase.parent.embedding is not None
27 |     assert type(phrase.similarity) == float
28 | 
29 |     phrase = Phrase("Secada", glove, doc)
30 |     assert phrase.similarity == -1
31 | 
32 | 


--------------------------------------------------------------------------------
/KeyExt/Key2Vec/tests/test_glove.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from key2vec.glove import Glove
3 | 
4 | def test_glove():
5 |     path = '../data/glove.6B/glove.6B.50d.txt'
6 |     glove = Glove(path)
7 |     assert glove.dim == 50
8 |     assert glove.embeddings.get('the', None) is not None


--------------------------------------------------------------------------------
/KeyExt/KeyBERT/README.md:
--------------------------------------------------------------------------------
1 | # KeyBERT
2 | 
3 | This directory hosts code to run and benchmark the [KeyBERT](https://github.com/MaartenGr/KeyBERT) approach.
4 | 
5 | ## Setup
6 | In order to run this approach, you need to `pip install keybert` and modify the `base_path` in `KeyBERT.py`, which is used to access the dataset directory.  
7 | If you wish to run the `benchmark()` function you need to set the `output_path`, as well.  


--------------------------------------------------------------------------------
/KeyExt/RVA/Makefile:
--------------------------------------------------------------------------------
 1 | CC = gcc
 2 | #For older gcc, use -O3 or -O2 instead of -Ofast
 3 | # CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result
 4 | 
 5 | # Use -Ofast with caution. It speeds up training, but the checks for NaN will not work
 6 | # (-Ofast turns on --fast-math, which turns on -ffinite-math-only,
 7 | # which assumes everything is NOT NaN or +-Inf, so checks for NaN always return false
 8 | # see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html)
 9 | # CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic
10 | 
11 | CFLAGS = -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic
12 | BUILDDIR := build
13 | SRCDIR := src
14 | OBJDIR := $(BUILDDIR)
15 | 
16 | OBJ := $(OBJDIR)/vocab_count.o $(OBJDIR)/cooccur.o $(OBJDIR)/shuffle.o $(OBJDIR)/glove.o
17 | HEADERS := $(SRCDIR)/common.h
18 | MODULES := $(BUILDDIR)/vocab_count $(BUILDDIR)/cooccur $(BUILDDIR)/shuffle $(BUILDDIR)/glove
19 | 
20 | 
21 | all: dir $(OBJ) $(MODULES)
22 | dir :
23 | 	mkdir -p $(BUILDDIR)
24 | $(BUILDDIR)/glove : $(OBJDIR)/glove.o $(OBJDIR)/common.o
25 | 	$(CC) $^ -o $@ $(CFLAGS)
26 | $(BUILDDIR)/shuffle : $(OBJDIR)/shuffle.o $(OBJDIR)/common.o
27 | 	$(CC) $^ -o $@ $(CFLAGS)
28 | $(BUILDDIR)/cooccur : $(OBJDIR)/cooccur.o $(OBJDIR)/common.o
29 | 	$(CC) $^ -o $@ $(CFLAGS)
30 | $(BUILDDIR)/vocab_count : $(OBJDIR)/vocab_count.o $(OBJDIR)/common.o
31 | 	$(CC) $^ -o $@ $(CFLAGS)
32 | $(OBJDIR)/%.o : $(SRCDIR)/%.c $(HEADERS)
33 | 	$(CC) -c $< -o $@ $(CFLAGS)
34 | .PHONY: clean
35 | clean:
36 | 	rm -rf $(BUILDDIR)
37 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/README.md:
--------------------------------------------------------------------------------
1 | # RVA
2 | 
3 | This directory contains the modified code for the [RVA](https://github.com/epapagia/RVA) approach.
4 | 
5 | ## Setup
6 | Follow the instructions from the original repo.  
7 | Afterwards replace the files with the modified ones.  
8 | In `main.py`, `base_path` and the path in `subprocess.call` need to be set for the dataset directory and the `.sh` script respectively.  
9 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/build/common.o:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/common.o


--------------------------------------------------------------------------------
/KeyExt/RVA/build/cooccur:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/cooccur


--------------------------------------------------------------------------------
/KeyExt/RVA/build/cooccur.o:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/cooccur.o


--------------------------------------------------------------------------------
/KeyExt/RVA/build/glove:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/glove


--------------------------------------------------------------------------------
/KeyExt/RVA/build/glove.o:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/glove.o


--------------------------------------------------------------------------------
/KeyExt/RVA/build/shuffle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/shuffle


--------------------------------------------------------------------------------
/KeyExt/RVA/build/shuffle.o:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/shuffle.o


--------------------------------------------------------------------------------
/KeyExt/RVA/build/vocab_count:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/vocab_count


--------------------------------------------------------------------------------
/KeyExt/RVA/build/vocab_count.o:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/vocab_count.o


--------------------------------------------------------------------------------
/KeyExt/RVA/cooccurrence.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/cooccurrence.bin


--------------------------------------------------------------------------------
/KeyExt/RVA/cooccurrence.shuf.bin:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/cooccurrence.shuf.bin


--------------------------------------------------------------------------------
/KeyExt/RVA/demo.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | set -e
 3 | 
 4 | # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it.
 5 | # One optional argument can specify the language used for eval script: matlab, octave or [default] python
 6 | 
 7 | #make
 8 | #if [ ! -e text8 ]; then
 9 | #  if hash wget 2>/dev/null; then
10 | #    wget http://mattmahoney.net/dc/text8.zip
11 | #  else
12 | #    curl -O http://mattmahoney.net/dc/text8.zip
13 | #  fi
14 | #  unzip text8.zip
15 | #  rm text8.zip
16 | #fi
17 | 
18 | CORPUS=$1
19 | VOCAB_FILE="vocab.txt$2$3$4"
20 | COOCCURRENCE_FILE=/home/groot/Desktop/RVA/glove/cooccurrence.bin
21 | COOCCURRENCE_SHUF_FILE=/home/groot/Desktop/RVA/glove/cooccurrence.shuf.bin
22 | BUILDDIR=/home/groot/Desktop/RVA/glove/build
23 | SAVE_FILE="vectors$2$3$4"
24 | VERBOSE=2
25 | MEMORY=7.1
26 | VOCAB_MIN_COUNT=1
27 | VECTOR_SIZE=$3
28 | MAX_ITER=$4
29 | WINDOW_SIZE=10
30 | BINARY=2
31 | NUM_THREADS=8
32 | X_MAX=100
33 | 
34 | echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE"
35 | $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
36 | echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE"
37 | $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
38 | echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE"
39 | $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
40 | echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE"
41 | $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
42 | if [ "$CORPUS" = 'text8' ]; then
43 |    if [ "$1" = 'matlab' ]; then
44 |        matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 
45 |    elif [ "$1" = 'octave' ]; then
46 |        octave < ./eval/octave/read_and_evaluate_octave.m 1>&2
47 |    else
48 |        echo "$ python eval/python/evaluate.py"
49 |        python eval/python/evaluate.py
50 |    fi
51 | fi
52 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/matlab/WordLookup.m:
--------------------------------------------------------------------------------
 1 | function index = WordLookup(InputString)
 2 | global wordMap
 3 | if wordMap.isKey(InputString)
 4 |     index = wordMap(InputString);
 5 | elseif wordMap.isKey('<unk>')
 6 |     index = wordMap('<unk>');
 7 | else
 8 |     index = 0;
 9 | end
10 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/matlab/evaluate_vectors.m:
--------------------------------------------------------------------------------
 1 | function [BB] = evaluate_vectors(W)
 2 | 
 3 | global wordMap
 4 | 
 5 | filenames = {'capital-common-countries' 'capital-world' 'currency' 'city-in-state' 'family' 'gram1-adjective-to-adverb' ...
 6 |     'gram2-opposite' 'gram3-comparative' 'gram4-superlative' 'gram5-present-participle' 'gram6-nationality-adjective' ...
 7 |     'gram7-past-tense' 'gram8-plural' 'gram9-plural-verbs'};
 8 | path = './eval/question-data/';
 9 | 
10 | split_size = 100; %to avoid memory overflow, could be increased/decreased depending on system and vocab size
11 | 
12 | correct_sem = 0; %count correct semantic questions
13 | correct_syn = 0; %count correct syntactic questions
14 | correct_tot = 0; %count correct questions
15 | count_sem = 0; %count all semantic questions
16 | count_syn = 0; %count all syntactic questions
17 | count_tot = 0; %count all questions
18 | full_count = 0; %count all questions, including those with unknown words
19 | 
20 | if wordMap.isKey('<unk>')
21 |     unkkey = wordMap('<unk>');
22 | else
23 |     unkkey = 0;
24 | end
25 | 
26 | for j=1:length(filenames);
27 | 
28 | clear dist;
29 | 
30 | fid=fopen([path filenames{j} '.txt']);
31 | temp=textscan(fid,'%s%s%s%s');
32 | fclose(fid);
33 | ind1 = cellfun(@WordLookup,temp{1}); %indices of first word in analogy
34 | ind2 = cellfun(@WordLookup,temp{2}); %indices of second word in analogy
35 | ind3 = cellfun(@WordLookup,temp{3}); %indices of third word in analogy
36 | ind4 = cellfun(@WordLookup,temp{4}); %indices of answer word in analogy
37 | full_count = full_count + length(ind1);
38 | ind = (ind1 ~= unkkey) & (ind2 ~= unkkey) & (ind3 ~= unkkey) & (ind4 ~= unkkey); %only look at those questions which have no unknown words
39 | ind1 = ind1(ind);
40 | ind2 = ind2(ind);
41 | ind3 = ind3(ind);
42 | ind4 = ind4(ind);
43 | disp([filenames{j} ':']);
44 | mx = zeros(1,length(ind1));
45 | num_iter = ceil(length(ind1)/split_size);
46 | for jj=1:num_iter
47 | range = (jj-1)*split_size+1:min(jj*split_size,length(ind1));
48 | dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' +  W(ind3(range),:)')); %cosine similarity if input W has been normalized
49 | for i=1:length(range)
50 | dist(ind1(range(i)),i) = -Inf;
51 | dist(ind2(range(i)),i) = -Inf;
52 | dist(ind3(range(i)),i) = -Inf;
53 | end
54 | [~, mx(range)] = max(dist); %predicted word index
55 | end
56 | 
57 | val = (ind4 == mx'); %correct predictions
58 | count_tot = count_tot + length(ind1);
59 | correct_tot = correct_tot + sum(val);
60 | disp(['ACCURACY TOP1: ' num2str(mean(val)*100,'%-2.2f') '%  (' num2str(sum(val)) '/' num2str(length(val)) ')']);
61 | if j < 6
62 |     count_sem = count_sem + length(ind1);
63 |     correct_sem = correct_sem + sum(val);
64 | else
65 |     count_syn = count_syn + length(ind1);
66 |     correct_syn = correct_syn + sum(val);
67 | end
68 |     
69 | disp(['Total accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '%   Semantic accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '%    Syntactic accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%']);
70 | 
71 | end
72 | disp('________________________________________________________________________________');
73 | disp(['Questions seen/total: ' num2str(100*count_tot/full_count,'%-2.2f') '%  (' num2str(count_tot) '/' num2str(full_count) ')']);
74 | disp(['Semantic Accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '%   (' num2str(correct_sem) '/' num2str(count_sem) ')']);
75 | disp(['Syntactic Accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%   (' num2str(correct_syn) '/' num2str(count_syn) ')']);
76 | disp(['Total Accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '%   (' num2str(correct_tot) '/' num2str(count_tot) ')']);
77 | BB = [100*correct_sem/count_sem 100*correct_syn/count_syn 100*correct_tot/count_tot];
78 | 
79 | end
80 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/matlab/read_and_evaluate.m:
--------------------------------------------------------------------------------
 1 | addpath('./eval/matlab');
 2 | if(~exist('vocab_file')) 
 3 |     vocab_file = 'vocab.txt';
 4 | end
 5 | if(~exist('vectors_file')) 
 6 |     vectors_file = 'vectors.bin';
 7 | end
 8 | 
 9 | fid = fopen(vocab_file, 'r');
10 | words = textscan(fid, '%s %f');
11 | fclose(fid);
12 | words = words{1};
13 | vocab_size = length(words);
14 | global wordMap
15 | wordMap = containers.Map(words(1:vocab_size),1:vocab_size);
16 | 
17 | fid = fopen(vectors_file,'r');
18 | fseek(fid,0,'eof');
19 | vector_size = ftell(fid)/16/vocab_size - 1;
20 | frewind(fid);
21 | WW = fread(fid, [vector_size+1 2*vocab_size], 'double')'; 
22 | fclose(fid); 
23 | 
24 | W1 = WW(1:vocab_size, 1:vector_size); % word vectors
25 | W2 = WW(vocab_size+1:end, 1:vector_size); % context (tilde) word vectors
26 | 
27 | W = W1 + W2; %Evaluate on sum of word vectors
28 | W = bsxfun(@rdivide,W,sqrt(sum(W.*W,2))); %normalize vectors before evaluation
29 | evaluate_vectors(W);
30 | exit
31 | 
32 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/octave/WordLookup_octave.m:
--------------------------------------------------------------------------------
 1 | function index = WordLookup_octave(InputString)
 2 | global wordMap
 3 | 
 4 | if isfield(wordMap, InputString)
 5 |   index = wordMap.(InputString);
 6 | elseif isfield(wordMap, '<unk>')
 7 |   index = wordMap.('<unk>');
 8 | else
 9 |   index = 0;
10 | end
11 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/octave/evaluate_vectors_octave.m:
--------------------------------------------------------------------------------
 1 | function [BB] = evaluate_vectors_octave(W)
 2 | 
 3 | global wordMap
 4 | 
 5 | filenames = {'capital-common-countries' 'capital-world' 'currency' 'city-in-state' 'family' 'gram1-adjective-to-adverb' ...
 6 |     'gram2-opposite' 'gram3-comparative' 'gram4-superlative' 'gram5-present-participle' 'gram6-nationality-adjective' ...
 7 |     'gram7-past-tense' 'gram8-plural' 'gram9-plural-verbs'};
 8 | path = './eval/question-data/';
 9 | 
10 | split_size = 100; %to avoid memory overflow, could be increased/decreased depending on system and vocab size
11 | 
12 | correct_sem = 0; %count correct semantic questions
13 | correct_syn = 0; %count correct syntactic questions
14 | correct_tot = 0; %count correct questions
15 | count_sem = 0; %count all semantic questions
16 | count_syn = 0; %count all syntactic questions
17 | count_tot = 0; %count all questions
18 | full_count = 0; %count all questions, including those with unknown words
19 | 
20 | 
21 | if isfield(wordMap, '<unk>')
22 |   unkkey = wordMap.('<unk>');
23 | else
24 |   unkkey = 0;
25 | end
26 | 
27 | for j=1:length(filenames);
28 | 
29 | clear dist;
30 | 
31 | fid=fopen([path filenames{j} '.txt']);
32 | temp=textscan(fid,'%s%s%s%s');
33 | fclose(fid);
34 | ind1 = cellfun(@WordLookup_octave,temp{1}); %indices of first word in analogy
35 | ind2 = cellfun(@WordLookup_octave,temp{2}); %indices of second word in analogy
36 | ind3 = cellfun(@WordLookup_octave,temp{3}); %indices of third word in analogy
37 | ind4 = cellfun(@WordLookup_octave,temp{4}); %indices of answer word in analogy
38 | full_count = full_count + length(ind1);
39 | ind = (ind1 ~= unkkey) & (ind2 ~= unkkey) & (ind3 ~= unkkey) & (ind4 ~= unkkey); %only look at those questions which have no unknown words
40 | ind1 = ind1(ind);
41 | ind2 = ind2(ind);
42 | ind3 = ind3(ind);
43 | ind4 = ind4(ind);
44 | disp([filenames{j} ':']);
45 | mx = zeros(1,length(ind1));
46 | num_iter = ceil(length(ind1)/split_size);
47 | for jj=1:num_iter
48 | range = (jj-1)*split_size+1:min(jj*split_size,length(ind1));
49 | dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' +  W(ind3(range),:)')); %cosine similarity if input W has been normalized
50 | for i=1:length(range)
51 | dist(ind1(range(i)),i) = -Inf;
52 | dist(ind2(range(i)),i) = -Inf;
53 | dist(ind3(range(i)),i) = -Inf;
54 | end
55 | [~, mx(range)] = max(dist); %predicted word index
56 | end
57 | 
58 | val = (ind4 == mx'); %correct predictions
59 | count_tot = count_tot + length(ind1);
60 | correct_tot = correct_tot + sum(val);
61 | disp(['ACCURACY TOP1: ' num2str(mean(val)*100,'%-2.2f') '%  (' num2str(sum(val)) '/' num2str(length(val)) ')']);
62 | if j < 6
63 |     count_sem = count_sem + length(ind1);
64 |     correct_sem = correct_sem + sum(val);
65 | else
66 |     count_syn = count_syn + length(ind1);
67 |     correct_syn = correct_syn + sum(val);
68 | end
69 |     
70 | disp(['Total accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '%   Semantic accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '%    Syntactic accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%']);
71 | 
72 | end
73 | disp('________________________________________________________________________________');
74 | disp(['Questions seen/total: ' num2str(100*count_tot/full_count,'%-2.2f') '%  (' num2str(count_tot) '/' num2str(full_count) ')']);
75 | disp(['Semantic Accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '%   (' num2str(correct_sem) '/' num2str(count_sem) ')']);
76 | disp(['Syntactic Accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%   (' num2str(correct_syn) '/' num2str(count_syn) ')']);
77 | disp(['Total Accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '%   (' num2str(correct_tot) '/' num2str(count_tot) ')']);
78 | BB = [100*correct_sem/count_sem 100*correct_syn/count_syn 100*correct_tot/count_tot];
79 | 
80 | end
81 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/octave/read_and_evaluate_octave.m:
--------------------------------------------------------------------------------
 1 | addpath('./eval/octave');
 2 | if(~exist('vocab_file')) 
 3 |     vocab_file = 'vocab.txt';
 4 | end
 5 | if(~exist('vectors_file')) 
 6 |     vectors_file = 'vectors.bin';
 7 | end
 8 | 
 9 | fid = fopen(vocab_file, 'r');
10 | words = textscan(fid, '%s %f');
11 | fclose(fid);
12 | words = words{1};
13 | vocab_size = length(words);
14 | global wordMap
15 | 
16 | wordMap = struct();
17 | for i=1:numel(words)
18 |     wordMap.(words{i}) = i;
19 | end
20 | 
21 | fid = fopen(vectors_file,'r');
22 | fseek(fid,0,'eof');
23 | vector_size = ftell(fid)/16/vocab_size - 1;
24 | frewind(fid);
25 | WW = fread(fid, [vector_size+1 2*vocab_size], 'double')'; 
26 | fclose(fid); 
27 | 
28 | W1 = WW(1:vocab_size, 1:vector_size); % word vectors
29 | W2 = WW(vocab_size+1:end, 1:vector_size); % context (tilde) word vectors
30 | 
31 | W = W1 + W2; %Evaluate on sum of word vectors
32 | W = bsxfun(@rdivide,W,sqrt(sum(W.*W,2))); %normalize vectors before evaluation
33 | evaluate_vectors_octave(W);
34 | exit
35 | 
36 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/python/distance.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | import sys
 4 | 
 5 | def generate():
 6 |     parser = argparse.ArgumentParser()
 7 |     parser.add_argument('--vocab_file', default='vocab.txt', type=str)
 8 |     parser.add_argument('--vectors_file', default='vectors.txt', type=str)
 9 |     args = parser.parse_args()
10 | 
11 |     with open(args.vocab_file, 'r') as f:
12 |         words = [x.rstrip().split(' ')[0] for x in f.readlines()]
13 |     with open(args.vectors_file, 'r') as f:
14 |         vectors = {}
15 |         for line in f:
16 |             vals = line.rstrip().split(' ')
17 |             vectors[vals[0]] = [float(x) for x in vals[1:]]
18 | 
19 |     vocab_size = len(words)
20 |     vocab = {w: idx for idx, w in enumerate(words)}
21 |     ivocab = {idx: w for idx, w in enumerate(words)}
22 | 
23 |     vector_dim = len(vectors[ivocab[0]])
24 |     W = np.zeros((vocab_size, vector_dim))
25 |     for word, v in vectors.items():
26 |         if word == '<unk>':
27 |             continue
28 |         W[vocab[word], :] = v
29 | 
30 |     # normalize each word vector to unit variance
31 |     W_norm = np.zeros(W.shape)
32 |     d = (np.sum(W ** 2, 1) ** (0.5))
33 |     W_norm = (W.T / d).T
34 |     return (W_norm, vocab, ivocab)
35 | 
36 | 
37 | def distance(W, vocab, ivocab, input_term):
38 |     for idx, term in enumerate(input_term.split(' ')):
39 |         if term in vocab:
40 |             print('Word: %s  Position in vocabulary: %i' % (term, vocab[term]))
41 |             if idx == 0:
42 |                 vec_result = np.copy(W[vocab[term], :])
43 |             else:
44 |                 vec_result += W[vocab[term], :]
45 |         else:
46 |             print('Word: %s  Out of dictionary!\n' % term)
47 |             return
48 | 
49 |     vec_norm = np.zeros(vec_result.shape)
50 |     d = (np.sum(vec_result ** 2,) ** (0.5))
51 |     vec_norm = (vec_result.T / d).T
52 | 
53 |     dist = np.dot(W, vec_norm.T)
54 | 
55 |     for term in input_term.split(' '):
56 |         index = vocab[term]
57 |         dist[index] = -np.Inf
58 | 
59 |     a = np.argsort(-dist)[:N]
60 | 
61 |     print("\n                               Word       Cosine distance\n")
62 |     print("---------------------------------------------------------\n")
63 |     for x in a:
64 |         print("%35s\t\t%f\n" % (ivocab[x], dist[x]))
65 | 
66 | 
67 | if __name__ == "__main__":
68 |     N = 100 # number of closest words that will be shown
69 |     W, vocab, ivocab = generate()
70 |     while True:
71 |         input_term = input("\nEnter word or sentence (EXIT to break): ")
72 |         if input_term == 'EXIT':
73 |             break
74 |         else:
75 |             distance(W, vocab, ivocab, input_term)
76 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/python/evaluate.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import numpy as np
  3 | 
  4 | def main():
  5 |     parser = argparse.ArgumentParser()
  6 |     parser.add_argument('--vocab_file', default='vocab.txt', type=str)
  7 |     parser.add_argument('--vectors_file', default='vectors.txt', type=str)
  8 |     args = parser.parse_args()
  9 | 
 10 |     with open(args.vocab_file, 'r') as f:
 11 |         words = [x.rstrip().split(' ')[0] for x in f.readlines()]
 12 |     with open(args.vectors_file, 'r') as f:
 13 |         vectors = {}
 14 |         for line in f:
 15 |             vals = line.rstrip().split(' ')
 16 |             vectors[vals[0]] = [float(x) for x in vals[1:]]
 17 | 
 18 |     vocab_size = len(words)
 19 |     vocab = {w: idx for idx, w in enumerate(words)}
 20 |     ivocab = {idx: w for idx, w in enumerate(words)}
 21 | 
 22 |     vector_dim = len(vectors[ivocab[0]])
 23 |     W = np.zeros((vocab_size, vector_dim))
 24 |     for word, v in vectors.items():
 25 |         if word == '<unk>':
 26 |             continue
 27 |         W[vocab[word], :] = v
 28 | 
 29 |     # normalize each word vector to unit length
 30 |     W_norm = np.zeros(W.shape)
 31 |     d = (np.sum(W ** 2, 1) ** (0.5))
 32 |     W_norm = (W.T / d).T
 33 |     evaluate_vectors(W_norm, vocab)
 34 | 
 35 | def evaluate_vectors(W, vocab):
 36 |     """Evaluate the trained word vectors on a variety of tasks"""
 37 | 
 38 |     filenames = [
 39 |         'capital-common-countries.txt', 'capital-world.txt', 'currency.txt',
 40 |         'city-in-state.txt', 'family.txt', 'gram1-adjective-to-adverb.txt',
 41 |         'gram2-opposite.txt', 'gram3-comparative.txt', 'gram4-superlative.txt',
 42 |         'gram5-present-participle.txt', 'gram6-nationality-adjective.txt',
 43 |         'gram7-past-tense.txt', 'gram8-plural.txt', 'gram9-plural-verbs.txt',
 44 |         ]
 45 |     prefix = './eval/question-data/'
 46 | 
 47 |     # to avoid memory overflow, could be increased/decreased
 48 |     # depending on system and vocab size
 49 |     split_size = 100
 50 | 
 51 |     correct_sem = 0; # count correct semantic questions
 52 |     correct_syn = 0; # count correct syntactic questions
 53 |     correct_tot = 0 # count correct questions
 54 |     count_sem = 0; # count all semantic questions
 55 |     count_syn = 0; # count all syntactic questions
 56 |     count_tot = 0 # count all questions
 57 |     full_count = 0 # count all questions, including those with unknown words
 58 | 
 59 |     for i in range(len(filenames)):
 60 |         with open('%s/%s' % (prefix, filenames[i]), 'r') as f:
 61 |             full_data = [line.rstrip().split(' ') for line in f]
 62 |             full_count += len(full_data)
 63 |             data = [x for x in full_data if all(word in vocab for word in x)]
 64 | 
 65 |         if len(data) == 0:
 66 |             print("ERROR: no lines of vocab kept for %s !" % filenames[i])
 67 |             print("Example missing line:", full_data[0])
 68 |             continue
 69 | 
 70 |         indices = np.array([[vocab[word] for word in row] for row in data])
 71 |         ind1, ind2, ind3, ind4 = indices.T
 72 | 
 73 |         predictions = np.zeros((len(indices),))
 74 |         num_iter = int(np.ceil(len(indices) / float(split_size)))
 75 |         for j in range(num_iter):
 76 |             subset = np.arange(j*split_size, min((j + 1)*split_size, len(ind1)))
 77 | 
 78 |             pred_vec = (W[ind2[subset], :] - W[ind1[subset], :]
 79 |                 +  W[ind3[subset], :])
 80 |             #cosine similarity if input W has been normalized
 81 |             dist = np.dot(W, pred_vec.T)
 82 | 
 83 |             for k in range(len(subset)):
 84 |                 dist[ind1[subset[k]], k] = -np.Inf
 85 |                 dist[ind2[subset[k]], k] = -np.Inf
 86 |                 dist[ind3[subset[k]], k] = -np.Inf
 87 | 
 88 |             # predicted word index
 89 |             predictions[subset] = np.argmax(dist, 0).flatten()
 90 | 
 91 |         val = (ind4 == predictions) # correct predictions
 92 |         count_tot = count_tot + len(ind1)
 93 |         correct_tot = correct_tot + sum(val)
 94 |         if i < 5:
 95 |             count_sem = count_sem + len(ind1)
 96 |             correct_sem = correct_sem + sum(val)
 97 |         else:
 98 |             count_syn = count_syn + len(ind1)
 99 |             correct_syn = correct_syn + sum(val)
100 | 
101 |         print("%s:" % filenames[i])
102 |         print('ACCURACY TOP1: %.2f%% (%d/%d)' %
103 |             (np.mean(val) * 100, np.sum(val), len(val)))
104 | 
105 |     print('Questions seen/total: %.2f%% (%d/%d)' %
106 |         (100 * count_tot / float(full_count), count_tot, full_count))
107 |     print('Semantic accuracy: %.2f%%  (%i/%i)' %
108 |         (100 * correct_sem / float(count_sem), correct_sem, count_sem))
109 |     print('Syntactic accuracy: %.2f%%  (%i/%i)' %
110 |         (100 * correct_syn / float(count_syn), correct_syn, count_syn))
111 |     print('Total accuracy: %.2f%%  (%i/%i)' % (100 * correct_tot / float(count_tot), correct_tot, count_tot))
112 | 
113 | 
114 | if __name__ == "__main__":
115 |     main()
116 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/eval/python/word_analogy.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import numpy as np
 3 | 
 4 | def generate():
 5 |     parser = argparse.ArgumentParser()
 6 |     parser.add_argument('--vocab_file', default='vocab.txt', type=str)
 7 |     parser.add_argument('--vectors_file', default='vectors.txt', type=str)
 8 |     args = parser.parse_args()
 9 | 
10 |     with open(args.vocab_file, 'r') as f:
11 |         words = [x.rstrip().split(' ')[0] for x in f.readlines()]
12 |     with open(args.vectors_file, 'r') as f:
13 |         vectors = {}
14 |         for line in f:
15 |             vals = line.rstrip().split(' ')
16 |             vectors[vals[0]] = [float(x) for x in vals[1:]]
17 | 
18 |     vocab_size = len(words)
19 |     vocab = {w: idx for idx, w in enumerate(words)}
20 |     ivocab = {idx: w for idx, w in enumerate(words)}
21 | 
22 |     vector_dim = len(vectors[ivocab[0]])
23 |     W = np.zeros((vocab_size, vector_dim))
24 |     for word, v in vectors.items():
25 |         if word == '<unk>':
26 |             continue
27 |         W[vocab[word], :] = v
28 | 
29 |     # normalize each word vector to unit variance
30 |     W_norm = np.zeros(W.shape)
31 |     d = (np.sum(W ** 2, 1) ** (0.5))
32 |     W_norm = (W.T / d).T
33 |     return (W_norm, vocab, ivocab)
34 | 
35 | 
36 | def distance(W, vocab, ivocab, input_term):
37 |     vecs = {}
38 |     if len(input_term.split(' ')) < 3:
39 |         print("Only %i words were entered.. three words are needed at the input to perform the calculation\n" % len(input_term.split(' ')))
40 |         return 
41 |     else:
42 |         for idx, term in enumerate(input_term.split(' ')):
43 |             if term in vocab:
44 |                 print('Word: %s  Position in vocabulary: %i' % (term, vocab[term]))
45 |                 vecs[idx] = W[vocab[term], :] 
46 |             else:
47 |                 print('Word: %s  Out of dictionary!\n' % term)
48 |                 return
49 | 
50 |         vec_result = vecs[1] - vecs[0] + vecs[2]
51 |         
52 |         vec_norm = np.zeros(vec_result.shape)
53 |         d = (np.sum(vec_result ** 2,) ** (0.5))
54 |         vec_norm = (vec_result.T / d).T
55 | 
56 |         dist = np.dot(W, vec_norm.T)
57 | 
58 |         for term in input_term.split(' '):
59 |             index = vocab[term]
60 |             dist[index] = -np.Inf
61 | 
62 |         a = np.argsort(-dist)[:N]
63 | 
64 |         print("\n                               Word       Cosine distance\n")
65 |         print("---------------------------------------------------------\n")
66 |         for x in a:
67 |             print("%35s\t\t%f\n" % (ivocab[x], dist[x]))
68 | 
69 | 
70 | if __name__ == "__main__":
71 |     N = 100;          # number of closest words that will be shown
72 |     W, vocab, ivocab = generate()
73 |     while True:
74 |         input_term = input("\nEnter three words (EXIT to break): ")
75 |         if input_term == 'EXIT':
76 |             break
77 |         else:
78 |             distance(W, vocab, ivocab, input_term)
79 | 
80 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/randomization.test.sh:
--------------------------------------------------------------------------------
  1 | # Tests for ensuring randomization is being controlled
  2 | 
  3 | make
  4 | 
  5 | if [ ! -e text8 ]; then
  6 |   if hash wget 2>/dev/null; then
  7 |     wget http://mattmahoney.net/dc/text8.zip
  8 |   else
  9 |     curl -O http://mattmahoney.net/dc/text8.zip
 10 |   fi
 11 |   unzip text8.zip
 12 |   rm text8.zip
 13 | fi
 14 | 
 15 | # Global constants
 16 | CORPUS=text8
 17 | VERBOSE=2
 18 | BUILDDIR=build
 19 | MEMORY=4.0
 20 | VOCAB_MIN_COUNT=20
 21 | 
 22 | # Re-used files
 23 | VOCAB_FILE=$(mktemp vocab.test.txt.XXXXXX)
 24 | COOCCURRENCE_FILE=$(mktemp cooccurrence.test.bin.XXXXXX)
 25 | COOCCURRENCE_SHUF_FILE=$(mktemp cooccurrence_shuf.test.bin.XXXXXX)
 26 | 
 27 | # Make vocab
 28 | $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
 29 | 
 30 | # Make Coocurrences
 31 | $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size 5 < $CORPUS > $COOCCURRENCE_FILE
 32 | 
 33 | # Shuffle Coocurrences
 34 | $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -seed 1 < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
 35 | 
 36 | # Keep track of failure
 37 | num_failed=0
 38 | 
 39 | check_exit() {
 40 |   eval $2
 41 |   failed=$(( $1 != $? ))
 42 |   num_failed=$(( $num_failed + $failed ))
 43 |   if [[ $failed -eq 0 ]]; then
 44 |     echo PASSED
 45 |   else
 46 |     echo FAILED
 47 |   fi
 48 | }
 49 | 
 50 | # Test control of random seed in shuffle
 51 | printf "\n\n--- TEST SET: Control of random seed in shuffle\n"
 52 | TEST_FILE=$(mktemp cooc_shuf.test.bin.XXXXXX)
 53 | 
 54 | printf "\n- TEST: Using the same seed should get the same shuffle\n"
 55 | $BUILDDIR/shuffle -memory $MEMORY -verbose 0 -seed 1 < $COOCCURRENCE_FILE > $TEST_FILE
 56 | check_exit 0 "cmp --quiet $COOCCURRENCE_SHUF_FILE $TEST_FILE"
 57 | 
 58 | printf "\n- TEST: Changing the seed should change the shuffle\n"
 59 | $BUILDDIR/shuffle -memory $MEMORY -verbose 0 -seed 2 < $COOCCURRENCE_FILE > $TEST_FILE
 60 | check_exit 1 "cmp --quiet $COOCCURRENCE_SHUF_FILE $TEST_FILE"
 61 | 
 62 | rm $TEST_FILE  # Clean up
 63 | # ---
 64 | 
 65 | # Control randomization in GloVe
 66 | printf "\n\n--- TEST SET: Control of random seed in glove\n"
 67 | # Note "-threads" must equal 1 for these to pass, since order in which results come back from individual threads is uncontrolled
 68 | BASE_PREFIX=$(mktemp base_vectors.XXXXXX)
 69 | TEST_PREFIX=$(mktemp test_vectors.XXXXXX)
 70 | 
 71 | printf "\n- TEST: Reusing seed should give the same vectors\n"
 72 | $BUILDDIR/glove -save-file $BASE_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 1
 73 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 1
 74 | check_exit 0 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin"
 75 | 
 76 | printf "\n- TEST: Changing seed should change the learned vectors\n"
 77 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 2
 78 | check_exit 1 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin"
 79 | 
 80 | printf "\n- TEST: Should be able to save/load initial parameters\n"
 81 | $BUILDDIR/glove -save-file $BASE_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -save-init-param 1
 82 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -save-init-param 1 -load-init-param 1 -init-param-file "$BASE_PREFIX.000.bin"
 83 | check_exit 0 "cmp --quiet $BASE_PREFIX.000.bin $TEST_PREFIX.000.bin && cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin"
 84 | 
 85 | rm "$BASE_PREFIX.000.bin" "$TEST_PREFIX.000.bin" "$BASE_PREFIX.bin" "$TEST_PREFIX.bin" # Clean up
 86 | rm $BASE_PREFIX $TEST_PREFIX
 87 | 
 88 | # ----
 89 | 
 90 | printf "\n- TEST: Should be able to save/load initial parameters and gradsq\n"
 91 | # note: the seed will be randomly assigned and should not matter
 92 | $BUILDDIR/glove -save-file $BASE_PREFIX -gradsq-file $BASE_PREFIX.gradsq -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 6 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -checkpoint-every 2
 93 | 
 94 | $BUILDDIR/glove -save-file $TEST_PREFIX -gradsq-file $TEST_PREFIX.gradsq -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 4 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -checkpoint-every 2 -load-init-param 1 -init-param-file "$BASE_PREFIX.002.bin" -load-init-gradsq 1 -init-gradsq-file "$BASE_PREFIX.gradsq.002.bin"
 95 | 
 96 | echo "Compare vectors before & after load gradsq - 2 iterations"
 97 | check_exit 0 "cmp --quiet $BASE_PREFIX.004.bin $TEST_PREFIX.002.bin"
 98 | echo "Compare vectors before & after load gradsq - 4 iterations"
 99 | check_exit 0 "cmp --quiet $BASE_PREFIX.006.bin $TEST_PREFIX.004.bin"
100 | echo "Compare vectors before & after load gradsq - final"
101 | check_exit 0 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin"
102 | 
103 | echo "Compare gradsq before & after load gradsq - 2 iterations"
104 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.004.bin $TEST_PREFIX.gradsq.002.bin"
105 | echo "Compare gradsq before & after load gradsq - 4 iterations"
106 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.006.bin $TEST_PREFIX.gradsq.004.bin"
107 | echo "Compare gradsq before & after load gradsq - final"
108 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.bin $TEST_PREFIX.gradsq.bin"
109 | 
110 | echo "Cleaning up files"
111 | check_exit 0 "rm $BASE_PREFIX.002.bin $BASE_PREFIX.004.bin $BASE_PREFIX.006.bin $BASE_PREFIX.bin"
112 | check_exit 0 "rm $BASE_PREFIX.gradsq.002.bin $BASE_PREFIX.gradsq.004.bin $BASE_PREFIX.gradsq.006.bin $BASE_PREFIX.gradsq.bin"
113 | check_exit 0 "rm $TEST_PREFIX.002.bin $TEST_PREFIX.004.bin $TEST_PREFIX.bin"
114 | check_exit 0 "rm $TEST_PREFIX.gradsq.002.bin $TEST_PREFIX.gradsq.004.bin $TEST_PREFIX.gradsq.bin"
115 | check_exit 0 "rm $VOCAB_FILE $COOCCURRENCE_FILE $COOCCURRENCE_SHUF_FILE"
116 | 
117 | echo
118 | echo SUMMARY:
119 | if [[ $num_failed -gt 0 ]]; then
120 |   echo $num_failed tests failed.
121 |   exit 1
122 | else
123 |   echo All tests passed.
124 |   exit 0
125 | fi
126 | 
127 | 
128 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/src/README.md:
--------------------------------------------------------------------------------
 1 | ### Package Contents
 2 | 
 3 | To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters. Cooccurrence contexts for words do not extend past newline characters. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in `demo.sh`, which you can modify as necessary.
 4 | 
 5 | The four main tools in this package are:
 6 | 
 7 | #### 1) vocab_count
 8 | This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the [Stanford Tokenizer](https://nlp.stanford.edu/software/tokenizer.html) first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
 9 | 
10 | #### 2) cooccur
11 | Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by `vocab_count`, and may specify a variety of parameters, as described by running `./build/cooccur`.
12 | 
13 | #### 3) shuffle
14 | Shuffles the binary file of cooccurrence statistics produced by `cooccur`. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running `./build/shuffle`.
15 | 
16 | #### 4) glove
17 | Train the GloVe model on the specified cooccurrence data, which typically will be the output of the `shuffle` tool. The user should supply a vocabulary file, as given by `vocab_count`, and may specify a number of other parameters, which are described by running `./build/glove`.
18 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/src/common.c:
--------------------------------------------------------------------------------
  1 | //  Common code for cooccur.c, vocab_count.c,
  2 | //  glove.c and shuffle.c
  3 | //
  4 | //  GloVe: Global Vectors for Word Representation
  5 | //  Copyright (c) 2014 The Board of Trustees of
  6 | //  The Leland Stanford Junior University. All Rights Reserved.
  7 | //
  8 | //  Licensed under the Apache License, Version 2.0 (the "License");
  9 | //  you may not use this file except in compliance with the License.
 10 | //  You may obtain a copy of the License at
 11 | //
 12 | //      http://www.apache.org/licenses/LICENSE-2.0
 13 | //
 14 | //  Unless required by applicable law or agreed to in writing, software
 15 | //  distributed under the License is distributed on an "AS IS" BASIS,
 16 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 17 | //  See the License for the specific language governing permissions and
 18 | //  limitations under the License.
 19 | //
 20 | //
 21 | //  For more information, bug reports, fixes, contact:
 22 | //    Jeffrey Pennington (jpennin@stanford.edu)
 23 | //    Christopher Manning (manning@cs.stanford.edu)
 24 | //    https://github.com/stanfordnlp/GloVe/
 25 | //    GlobalVectors@googlegroups.com
 26 | //    http://nlp.stanford.edu/projects/glove/
 27 | 
 28 | #include <errno.h>
 29 | #include <stdlib.h>
 30 | #include <string.h>
 31 | #include "common.h"
 32 | 
 33 | #ifdef _MSC_VER
 34 | #define STRERROR(ERRNO, BUF, BUFSIZE) strerror_s((BUF), (BUFSIZE), (ERRNO))
 35 | #else
 36 | #define STRERROR(ERRNO, BUF, BUFSIZE) strerror_r((ERRNO), (BUF), (BUFSIZE))
 37 | #endif
 38 | 
 39 | /* Efficient string comparison */
 40 | int scmp( char *s1, char *s2 ) {
 41 |     while (*s1 != '\0' && *s1 == *s2) {s1++; s2++;}
 42 |     return (*s1 - *s2);
 43 | }
 44 | 
 45 | /* Move-to-front hashing and hash function from Hugh Williams, http://www.seg.rmit.edu.au/code/zwh-ipl/ */
 46 | 
 47 | /* Simple bitwise hash function */
 48 | unsigned int bitwisehash(char *word, int tsize, unsigned int seed) {
 49 |     char c;
 50 |     unsigned int h;
 51 |     h = seed;
 52 |     for ( ; (c = *word) != '\0'; word++) h ^= ((h << 5) + c + (h >> 2));
 53 |     return (unsigned int)((h & 0x7fffffff) % tsize);
 54 | }
 55 | 
 56 | /* Create hash table, initialise pointers to NULL */
 57 | HASHREC ** inithashtable() {
 58 |     int i;
 59 |     HASHREC **ht;
 60 |     ht = (HASHREC **) malloc( sizeof(HASHREC *) * TSIZE );
 61 |     for (i = 0; i < TSIZE; i++) ht[i] = (HASHREC *) NULL;
 62 |     return ht;
 63 | }
 64 | 
 65 | /* Read word from input stream. Return 1 when encounter '\n' or EOF (but separate from word), 0 otherwise.
 66 |    Words can be separated by space(s), tab(s), or newline(s). Carriage return characters are just ignored.
 67 |    (Okay for Windows, but not for Mac OS 9-. Ignored even if by themselves or in words.)
 68 |    A newline is taken as indicating a new document (contexts won't cross newline).
 69 |    Argument word array is assumed to be of size MAX_STRING_LENGTH.
 70 |    words will be truncated if too long. They are truncated with some care so that they
 71 |    cannot truncate in the middle of a utf-8 character, but
 72 |    still little to no harm will be done for other encodings like iso-8859-1.
 73 |    (This function appears identically copied in vocab_count.c and cooccur.c.)
 74 |  */
 75 | int get_word(char *word, FILE *fin) {
 76 |     int i = 0, ch;
 77 |     for ( ; ; ) {
 78 |         ch = fgetc(fin);
 79 |         if (ch == '\r') continue;
 80 |         if (i == 0 && ((ch == '\n') || (ch == EOF))) {
 81 |             word[i] = 0;
 82 |             return 1;
 83 |         }
 84 |         if (i == 0 && ((ch == ' ') || (ch == '\t'))) continue; // skip leading space
 85 |         if ((ch == EOF) || (ch == ' ') || (ch == '\t') || (ch == '\n')) {
 86 |             if (ch == '\n') ungetc(ch, fin); // return the newline next time as document ender
 87 |             break;
 88 |         }
 89 |         if (i < MAX_STRING_LENGTH - 1)
 90 |           word[i++] = ch; // don't allow words to exceed MAX_STRING_LENGTH
 91 |     }
 92 |     word[i] = 0; //null terminate
 93 |     // avoid truncation destroying a multibyte UTF-8 char except if only thing on line (so the i > x tests won't overwrite word[0])
 94 |     // see https://en.wikipedia.org/wiki/UTF-8#Description
 95 |     if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0x80) == 0x80) {
 96 |         if ((word[i-1] & 0xC0) == 0xC0) {
 97 |             word[i-1] = '\0';
 98 |         } else if (i > 2 && (word[i-2] & 0xE0) == 0xE0) {
 99 |             word[i-2] = '\0';
100 |         } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) {
101 |             word[i-3] = '\0';
102 |         }
103 |     }
104 |     return 0;
105 | }
106 | 
107 | int find_arg(char *str, int argc, char **argv) {
108 |     int i;
109 |     for (i = 1; i < argc; i++) {
110 |         if (!scmp(str, argv[i])) {
111 |             if (i == argc - 1) {
112 |                 printf("No argument given for %s\n", str);
113 |                 exit(1);
114 |             }
115 |             return i;
116 |         }
117 |     }
118 |     return -1;
119 | }
120 | 
121 | void free_table(HASHREC **ht) {
122 |     int i;
123 |     HASHREC* current;
124 |     HASHREC* tmp;
125 |     for (i = 0; i < TSIZE; i++) {
126 |         current = ht[i];
127 |         while (current != NULL) {
128 |             tmp = current;
129 |             current = current->next;
130 |             free(tmp->word);
131 |             free(tmp);
132 |         }
133 |     }
134 |     free(ht);
135 | }
136 | 
137 | void free_fid(FILE **fid, const int num) {
138 |     int i;
139 |     for(i = 0; i < num; i++) {
140 |         if(fid[i] != NULL)
141 |             fclose(fid[i]);
142 |     }
143 |     free(fid);
144 | }
145 | 
146 | 
147 | int log_file_loading_error(char *file_description, char *file_name) {
148 |     fprintf(stderr, "Unable to open %s %s.\n", file_description, file_name);
149 |     fprintf(stderr, "Errno: %d\n", errno);
150 |     char error[MAX_STRING_LENGTH];
151 |     STRERROR(errno, error, MAX_STRING_LENGTH);
152 |     fprintf(stderr, "Error description: %s\n", error);
153 |     return errno;
154 | }
155 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/src/common.h:
--------------------------------------------------------------------------------
 1 | #ifndef COMMON_H
 2 | #define COMMON_H
 3 | 
 4 | //  Common code for cooccur.c, vocab_count.c,
 5 | //  glove.c and shuffle.c
 6 | //
 7 | //  GloVe: Global Vectors for Word Representation
 8 | //  Copyright (c) 2014 The Board of Trustees of
 9 | //  The Leland Stanford Junior University. All Rights Reserved.
10 | //
11 | //  Licensed under the Apache License, Version 2.0 (the "License");
12 | //  you may not use this file except in compliance with the License.
13 | //  You may obtain a copy of the License at
14 | //
15 | //      http://www.apache.org/licenses/LICENSE-2.0
16 | //
17 | //  Unless required by applicable law or agreed to in writing, software
18 | //  distributed under the License is distributed on an "AS IS" BASIS,
19 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
20 | //  See the License for the specific language governing permissions and
21 | //  limitations under the License.
22 | //
23 | //
24 | //  For more information, bug reports, fixes, contact:
25 | //    Jeffrey Pennington (jpennin@stanford.edu)
26 | //    Christopher Manning (manning@cs.stanford.edu)
27 | //    https://github.com/stanfordnlp/GloVe/
28 | //    GlobalVectors@googlegroups.com
29 | //    http://nlp.stanford.edu/projects/glove/
30 | 
31 | #include <stdio.h>
32 | 
33 | #define MAX_STRING_LENGTH 1000
34 | #define TSIZE 1048576
35 | #define SEED 1159241
36 | #define HASHFN bitwisehash
37 | 
38 | typedef double real;
39 | typedef struct cooccur_rec {
40 |     int word1;
41 |     int word2;
42 |     real val;
43 | } CREC;
44 | typedef struct hashrec {
45 |     char *word;
46 |     long long num; //count or id
47 |     struct hashrec *next;
48 | } HASHREC;
49 | 
50 | 
51 | int scmp( char *s1, char *s2 );
52 | unsigned int bitwisehash(char *word, int tsize, unsigned int seed);
53 | HASHREC **inithashtable();
54 | int get_word(char *word, FILE *fin);
55 | void free_table(HASHREC **ht);
56 | int find_arg(char *str, int argc, char **argv);
57 | void free_fid(FILE **fid, const int num);
58 | 
59 | // logs errors when loading files.  call after a failed load
60 | int log_file_loading_error(char *file_description, char *file_name);
61 | 
62 | #endif /* COMMON_H */
63 | 
64 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/src/shuffle.c:
--------------------------------------------------------------------------------
  1 | //  Tool to shuffle entries of word-word cooccurrence files
  2 | //
  3 | //  Copyright (c) 2014 The Board of Trustees of
  4 | //  The Leland Stanford Junior University. All Rights Reserved.
  5 | //
  6 | //  Licensed under the Apache License, Version 2.0 (the "License");
  7 | //  you may not use this file except in compliance with the License.
  8 | //  You may obtain a copy of the License at
  9 | //
 10 | //      http://www.apache.org/licenses/LICENSE-2.0
 11 | //
 12 | //  Unless required by applicable law or agreed to in writing, software
 13 | //  distributed under the License is distributed on an "AS IS" BASIS,
 14 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 15 | //  See the License for the specific language governing permissions and
 16 | //  limitations under the License.
 17 | //
 18 | //
 19 | //  For more information, bug reports, fixes, contact:
 20 | //    Jeffrey Pennington (jpennin@stanford.edu)
 21 | //    GlobalVectors@googlegroups.com
 22 | //    http://nlp.stanford.edu/projects/glove/
 23 | 
 24 | #include <stdio.h>
 25 | #include <stdlib.h>
 26 | #include <string.h>
 27 | #include <time.h>
 28 | #include "common.h"
 29 | 
 30 | 
 31 | static const long LRAND_MAX = ((long) RAND_MAX + 2) * (long)RAND_MAX;
 32 | 
 33 | int verbose = 2; // 0, 1, or 2
 34 | int seed = 0;
 35 | long long array_size = 2000000; // size of chunks to shuffle individually
 36 | char *file_head; // temporary file string
 37 | real memory_limit = 2.0; // soft limit, in gigabytes
 38 | 
 39 | /* Generate uniformly distributed random long ints */
 40 | static long rand_long(long n) {
 41 |     long limit = LRAND_MAX - LRAND_MAX % n;
 42 |     long rnd;
 43 |     do {
 44 |         rnd = ((long)RAND_MAX + 1) * (long)rand() + (long)rand();
 45 |     } while (rnd >= limit);
 46 |     return rnd % n;
 47 | }
 48 | 
 49 | /* Write contents of array to binary file */
 50 | int write_chunk(CREC *array, long size, FILE *fout) {
 51 |     long i = 0;
 52 |     for (i = 0; i < size; i++) fwrite(&array[i], sizeof(CREC), 1, fout);
 53 |     return 0;
 54 | }
 55 | 
 56 | /* Fisher-Yates shuffle */
 57 | void shuffle(CREC *array, long n) {
 58 |     long i, j;
 59 |     CREC tmp;
 60 |     for (i = n - 1; i > 0; i--) {
 61 |         j = rand_long(i + 1);
 62 |         tmp = array[j];
 63 |         array[j] = array[i];
 64 |         array[i] = tmp;
 65 |     }
 66 | }
 67 | 
 68 | /* Merge shuffled temporary files; doesn't necessarily produce a perfect shuffle, but good enough */
 69 | int shuffle_merge(int num) {
 70 |     long i, j, k, l = 0;
 71 |     int fidcounter = 0;
 72 |     CREC *array;
 73 |     char filename[MAX_STRING_LENGTH];
 74 |     FILE **fid, *fout = stdout;
 75 |     
 76 |     array = malloc(sizeof(CREC) * array_size);
 77 |     fid = calloc(num, sizeof(FILE));
 78 |     for (fidcounter = 0; fidcounter < num; fidcounter++) { //num = number of temporary files to merge
 79 |         sprintf(filename,"%s_%04d.bin",file_head, fidcounter);
 80 |         fid[fidcounter] = fopen(filename, "rb");
 81 |         if (fid[fidcounter] == NULL) {
 82 |             log_file_loading_error("temp file", filename);
 83 |             free(array);
 84 |             free_fid(fid, num);
 85 |             return 1;
 86 |         }
 87 |     }
 88 |     if (verbose > 0) fprintf(stderr, "Merging temp files: processed %ld lines.", l);
 89 |     
 90 |     while (1) { //Loop until EOF in all files
 91 |         i = 0;
 92 |         //Read at most array_size values into array, roughly array_size/num from each temp file
 93 |         for (j = 0; j < num; j++) {
 94 |             if (feof(fid[j])) continue;
 95 |             for (k = 0; k < array_size / num; k++){
 96 |                 fread(&array[i], sizeof(CREC), 1, fid[j]);
 97 |                 if (feof(fid[j])) break;
 98 |                 i++;
 99 |             }
100 |         }
101 |         if (i == 0) break;
102 |         l += i;
103 |         shuffle(array, i-1); // Shuffles lines between temp files
104 |         write_chunk(array,i,fout);
105 |         if (verbose > 0) fprintf(stderr, "\033[31G%ld lines.", l);
106 |     }
107 |     fprintf(stderr, "\033[0GMerging temp files: processed %ld lines.", l);
108 |     for (fidcounter = 0; fidcounter < num; fidcounter++) {
109 |         fclose(fid[fidcounter]);
110 |         sprintf(filename,"%s_%04d.bin",file_head, fidcounter);
111 |         remove(filename);
112 |     }
113 |     fprintf(stderr, "\n\n");
114 |     free(array);
115 |     free(fid);
116 |     return 0;
117 | }
118 | 
119 | /* Shuffle large input stream by splitting into chunks */
120 | int shuffle_by_chunks() {
121 |     if (seed == 0) {
122 |         seed = time(0);
123 |     }
124 |     fprintf(stderr, "Using random seed %d\n", seed);
125 |     srand(seed);
126 |     long i = 0, l = 0;
127 |     int fidcounter = 0;
128 |     char filename[MAX_STRING_LENGTH];
129 |     CREC *array;
130 |     FILE *fin = stdin, *fid;
131 |     array = malloc(sizeof(CREC) * array_size);
132 |     
133 |     fprintf(stderr,"SHUFFLING COOCCURRENCES\n");
134 |     if (verbose > 0) fprintf(stderr,"array size: %lld\n", array_size);
135 |     sprintf(filename,"%s_%04d.bin",file_head, fidcounter);
136 |     fid = fopen(filename,"w");
137 |     if (fid == NULL) {
138 |         log_file_loading_error("file", filename);
139 |         free(array);
140 |         return 1;
141 |     }
142 |     if (verbose > 1) fprintf(stderr, "Shuffling by chunks: processed 0 lines.");
143 |     
144 |     while (1) { //Continue until EOF
145 |         if (i >= array_size) {// If array is full, shuffle it and save to temporary file
146 |             shuffle(array, i-2);
147 |             l += i;
148 |             if (verbose > 1) fprintf(stderr, "\033[22Gprocessed %ld lines.", l);
149 |             write_chunk(array,i,fid);
150 |             fclose(fid);
151 |             fidcounter++;
152 |             sprintf(filename,"%s_%04d.bin",file_head, fidcounter);
153 |             fid = fopen(filename,"w");
154 |             if (fid == NULL) {
155 |                 log_file_loading_error("file", filename);
156 |                 free(array);
157 |                 return 1;
158 |             }
159 |             i = 0;
160 |         }
161 |         fread(&array[i], sizeof(CREC), 1, fin);
162 |         if (feof(fin)) break;
163 |         i++;
164 |     }
165 |     shuffle(array, i-2); //Last chunk may be smaller than array_size
166 |     write_chunk(array,i,fid);
167 |     l += i;
168 |     if (verbose > 1) fprintf(stderr, "\033[22Gprocessed %ld lines.\n", l);
169 |     if (verbose > 1) fprintf(stderr, "Wrote %d temporary file(s).\n", fidcounter + 1);
170 |     fclose(fid);
171 |     free(array);
172 |     return shuffle_merge(fidcounter + 1); // Merge and shuffle together temporary files
173 | }
174 | 
175 | int main(int argc, char **argv) {
176 |     int i;
177 |     
178 |     if (argc == 2 &&
179 |         (!scmp(argv[1], "-h") || !scmp(argv[1], "-help") || !scmp(argv[1], "--help"))) {
180 |         printf("Tool to shuffle entries of word-word cooccurrence files\n");
181 |         printf("Author: Jeffrey Pennington (jpennin@stanford.edu)\n\n");
182 |         printf("Usage options:\n");
183 |         printf("\t-verbose <int>\n");
184 |         printf("\t\tSet verbosity: 0, 1, or 2 (default)\n");
185 |         printf("\t-memory <float>\n");
186 |         printf("\t\tSoft limit for memory consumption, in GB; default 4.0\n");
187 |         printf("\t-array-size <int>\n");
188 |         printf("\t\tLimit to length <int> the buffer which stores chunks of data to shuffle before writing to disk. \n\t\tThis value overrides that which is automatically produced by '-memory'.\n");
189 |         printf("\t-temp-file <file>\n");
190 |         printf("\t\tFilename, excluding extension, for temporary files; default temp_shuffle\n");
191 |         printf("\t-seed <int>\n");
192 |         printf("\t\tRandom seed to use.  If not set, will be randomized using current time.");
193 |         printf("\nExample usage: (assuming 'cooccurrence.bin' has been produced by 'coccur')\n");
194 |         printf("./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin\n");
195 |         return 0;
196 |     }
197 | 
198 |     file_head = malloc(sizeof(char) * MAX_STRING_LENGTH);
199 |     if ((i = find_arg((char *)"-verbose", argc, argv)) > 0) verbose = atoi(argv[i + 1]);
200 |     if ((i = find_arg((char *)"-temp-file", argc, argv)) > 0) strcpy(file_head, argv[i + 1]);
201 |     else strcpy(file_head, (char *)"temp_shuffle");
202 |     if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]);
203 |     array_size = (long long) (0.95 * (real)memory_limit * 1073741824/(sizeof(CREC)));
204 |     if ((i = find_arg((char *)"-array-size", argc, argv)) > 0) array_size = atoll(argv[i + 1]);
205 |     if ((i = find_arg((char *)"-seed", argc, argv)) > 0) seed = atoi(argv[i + 1]);
206 |     const int returned_value = shuffle_by_chunks();
207 |     free(file_head);
208 |     return returned_value;
209 | }
210 | 
211 | 


--------------------------------------------------------------------------------
/KeyExt/RVA/src/vocab_count.c:
--------------------------------------------------------------------------------
  1 | //  Tool to extract unigram counts
  2 | //
  3 | //  GloVe: Global Vectors for Word Representation
  4 | //  Copyright (c) 2014 The Board of Trustees of
  5 | //  The Leland Stanford Junior University. All Rights Reserved.
  6 | //
  7 | //  Licensed under the Apache License, Version 2.0 (the "License");
  8 | //  you may not use this file except in compliance with the License.
  9 | //  You may obtain a copy of the License at
 10 | //
 11 | //      http://www.apache.org/licenses/LICENSE-2.0
 12 | //
 13 | //  Unless required by applicable law or agreed to in writing, software
 14 | //  distributed under the License is distributed on an "AS IS" BASIS,
 15 | //  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 16 | //  See the License for the specific language governing permissions and
 17 | //  limitations under the License.
 18 | //
 19 | //
 20 | //  For more information, bug reports, fixes, contact:
 21 | //    Jeffrey Pennington (jpennin@stanford.edu)
 22 | //    Christopher Manning (manning@cs.stanford.edu)
 23 | //    https://github.com/stanfordnlp/GloVe/
 24 | //    GlobalVectors@googlegroups.com
 25 | //    http://nlp.stanford.edu/projects/glove/
 26 | 
 27 | #include <stdio.h>
 28 | #include <stdlib.h>
 29 | #include <string.h>
 30 | #include "common.h"
 31 | 
 32 | typedef struct vocabulary {
 33 |     char *word;
 34 |     long long count;
 35 | } VOCAB;
 36 | 
 37 | int verbose = 2; // 0, 1, or 2
 38 | long long min_count = 1; // min occurrences for inclusion in vocab
 39 | long long max_vocab = 0; // max_vocab = 0 for no limit
 40 | 
 41 | 
 42 | /* Vocab frequency comparison; break ties alphabetically */
 43 | int CompareVocabTie(const void *a, const void *b) {
 44 |     long long c;
 45 |     if ( (c = ((VOCAB *) b)->count - ((VOCAB *) a)->count) != 0) return ( c > 0 ? 1 : -1 );
 46 |     else return (scmp(((VOCAB *) a)->word,((VOCAB *) b)->word));
 47 |     
 48 | }
 49 | 
 50 | /* Vocab frequency comparison; no tie-breaker */
 51 | int CompareVocab(const void *a, const void *b) {
 52 |     long long c;
 53 |     if ( (c = ((VOCAB *) b)->count - ((VOCAB *) a)->count) != 0) return ( c > 0 ? 1 : -1 );
 54 |     else return 0;
 55 | }
 56 | 
 57 | /* Search hash table for given string, insert if not found */
 58 | void hashinsert(HASHREC **ht, char *w) {
 59 |     HASHREC     *htmp, *hprv;
 60 |     unsigned int hval = HASHFN(w, TSIZE, SEED);
 61 |     
 62 |     for (hprv = NULL, htmp = ht[hval]; htmp != NULL && scmp(htmp->word, w) != 0; hprv = htmp, htmp = htmp->next);
 63 |     if (htmp == NULL) {
 64 |         htmp = (HASHREC *) malloc( sizeof(HASHREC) );
 65 |         htmp->word = (char *) malloc( strlen(w) + 1 );
 66 |         strcpy(htmp->word, w);
 67 |         htmp->num = 1;
 68 |         htmp->next = NULL;
 69 |         if ( hprv==NULL )
 70 |             ht[hval] = htmp;
 71 |         else
 72 |             hprv->next = htmp;
 73 |     }
 74 |     else {
 75 |         /* new records are not moved to front */
 76 |         htmp->num++;
 77 |         if (hprv != NULL) {
 78 |             /* move to front on access */
 79 |             hprv->next = htmp->next;
 80 |             htmp->next = ht[hval];
 81 |             ht[hval] = htmp;
 82 |         }
 83 |     }
 84 |     return;
 85 | }
 86 | 
 87 | int get_counts() {
 88 |     long long i = 0, j = 0, vocab_size = 12500;
 89 |     // char format[20];
 90 |     char str[MAX_STRING_LENGTH + 1];
 91 |     HASHREC **vocab_hash = inithashtable();
 92 |     HASHREC *htmp;
 93 |     VOCAB *vocab;
 94 |     FILE *fid = stdin;
 95 |     
 96 |     fprintf(stderr, "BUILDING VOCABULARY\n");
 97 |     if (verbose > 1) fprintf(stderr, "Processed %lld tokens.", i);
 98 |     // sprintf(format,"%%%ds",MAX_STRING_LENGTH);
 99 |     while ( ! feof(fid)) {
100 |         // Insert all tokens into hashtable
101 |         int nl = get_word(str, fid);
102 |         if (nl) continue; // just a newline marker or feof
103 |         if (strcmp(str, "<unk>") == 0) {
104 |             fprintf(stderr, "\nError, <unk> vector found in corpus.\nPlease remove <unk>s from your corpus (e.g. cat text8 | sed -e 's/<unk>/<raw_unk>/g' > text8.new)");
105 |             free_table(vocab_hash);
106 |             return 1;
107 |         }
108 |         hashinsert(vocab_hash, str);
109 |         if (((++i)%100000) == 0) if (verbose > 1) fprintf(stderr,"\033[11G%lld tokens.", i);
110 |     }
111 |     if (verbose > 1) fprintf(stderr, "\033[0GProcessed %lld tokens.\n", i);
112 |     vocab = malloc(sizeof(VOCAB) * vocab_size);
113 |     for (i = 0; i < TSIZE; i++) { // Migrate vocab to array
114 |         htmp = vocab_hash[i];
115 |         while (htmp != NULL) {
116 |             vocab[j].word = htmp->word;
117 |             vocab[j].count = htmp->num;
118 |             j++;
119 |             if (j>=vocab_size) {
120 |                 vocab_size += 2500;
121 |                 vocab = (VOCAB *)realloc(vocab, sizeof(VOCAB) * vocab_size);
122 |             }
123 |             htmp = htmp->next;
124 |         }
125 |     }
126 |     if (verbose > 1) fprintf(stderr, "Counted %lld unique words.\n", j);
127 |     if (max_vocab > 0 && max_vocab < j)
128 |         // If the vocabulary exceeds limit, first sort full vocab by frequency without alphabetical tie-breaks.
129 |         // This results in pseudo-random ordering for words with same frequency, so that when truncated, the words span whole alphabet
130 |         qsort(vocab, j, sizeof(VOCAB), CompareVocab);
131 |     else max_vocab = j;
132 |     qsort(vocab, max_vocab, sizeof(VOCAB), CompareVocabTie); //After (possibly) truncating, sort (possibly again), breaking ties alphabetically
133 |     
134 |     for (i = 0; i < max_vocab; i++) {
135 |         if (vocab[i].count < min_count) { // If a minimum frequency cutoff exists, truncate vocabulary
136 |             if (verbose > 0) fprintf(stderr, "Truncating vocabulary at min count %lld.\n",min_count);
137 |             break;
138 |         }
139 |         printf("%s %lld\n",vocab[i].word,vocab[i].count);
140 |     }
141 |     
142 |     if (i == max_vocab && max_vocab < j) if (verbose > 0) fprintf(stderr, "Truncating vocabulary at size %lld.\n", max_vocab);
143 |     fprintf(stderr, "Using vocabulary of size %lld.\n\n", i);
144 |     free_table(vocab_hash);
145 |     free(vocab);
146 |     return 0;
147 | }
148 | 
149 | int main(int argc, char **argv) {
150 |     if (argc == 2 &&
151 |         (!scmp(argv[1], "-h") || !scmp(argv[1], "-help") || !scmp(argv[1], "--help"))) {
152 |         printf("Simple tool to extract unigram counts\n");
153 |         printf("Author: Jeffrey Pennington (jpennin@stanford.edu)\n\n");
154 |         printf("Usage options:\n");
155 |         printf("\t-verbose <int>\n");
156 |         printf("\t\tSet verbosity: 0, 1, or 2 (default)\n");
157 |         printf("\t-max-vocab <int>\n");
158 |         printf("\t\tUpper bound on vocabulary size, i.e. keep the <int> most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.\n");
159 |         printf("\t-min-count <int>\n");
160 |         printf("\t\tLower limit such that words which occur fewer than <int> times are discarded.\n");
161 |         printf("\nExample usage:\n");
162 |         printf("./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt\n");
163 |         return 0;
164 |     }
165 | 
166 |     int i;
167 |     if ((i = find_arg((char *)"-verbose", argc, argv)) > 0) verbose = atoi(argv[i + 1]);
168 |     if ((i = find_arg((char *)"-max-vocab", argc, argv)) > 0) max_vocab = atoll(argv[i + 1]);
169 |     if ((i = find_arg((char *)"-min-count", argc, argv)) > 0) min_count = atoll(argv[i + 1]);
170 |     return get_counts();
171 | }
172 | 
173 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/README.md:
--------------------------------------------------------------------------------
 1 | # SIFRank
 2 | 
 3 | This directory contains the modified code for the [SIFRank](https://github.com/sunyilgdx/SIFRank) approach.
 4 | 
 5 | ## Modified files
 6 | The following files were modified in place, as to remove the hardcoded datasets paths,
 7 | and ensured that the approach runs in CPU mode.
 8 | 
 9 | * main.py
10 | * embeddings.sent_emb_sif.py
11 | * embeddings.word_emb_elmo
12 | 
13 | ## Setup
14 | Follow the instructions from the original repo and `pip install requirements.txt`.  
15 | Afterwards replace the files with the modified ones.  
16 | In `main.py`, `base_path` and `exec_path` need to be respectively set for the dataset directory and the local project path.  
17 | In `sent_emb_sif`, `weightfile_pretrain` and `weightfile_finetune` need to be set to the respective files of the local project path.  
18 | In `word_emb_elmo`, `options_file` and `weights_file` need to be similarly set.  
19 | If you wish to run the `benchmark()` function you need to set the `output_path`, in `main.py` as well.  
20 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/auxiliary_data/__init__.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # __author__ = "Sponge"
4 | # Date: 2019/12/19
5 | 
6 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json:
--------------------------------------------------------------------------------
1 | {"lstm": {"use_skip_connections": true, "projection_dim": 512, "cell_clip": 3, "proj_clip": 3, "dim": 4096, "n_layers": 2}, "char_cnn": {"activation": "relu", "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "n_highway": 2, "embedding": {"dim": 16}, "n_characters": 262, "max_characters_per_token": 50}}
2 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/embeddings/__init__.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # __author__ = "Sponge"
4 | # Date: 2019/12/19
5 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/embeddings/word_emb_bert.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # __author__ = "Sponge"
 4 | # Date: 2019/7/29
 5 | 
 6 | from bert_serving.client import BertClient
 7 | import numpy as np
 8 | class WordEmbeddings():
 9 |     """
10 |         Concrete class of @EmbeddingDistributor using ELMo
11 |         https://allennlp.org/elmo
12 | 
13 |     """
14 | 
15 |     def __init__(self,N=768):
16 | 
17 |         self.bert = BertClient()
18 |         self.N = N
19 | 
20 |     def get_tokenized_words_embeddings(self, sents_tokened):
21 |         """
22 |         @see EmbeddingDistributor
23 |         :param tokenized_sents: list of tokenized words string (sentences/phrases)
24 |         :return: ndarray with shape (len(sents), dimension of embeddings)
25 |         """
26 |         bert_embeddings=[]
27 |         for i in range(0, len(sents_tokened)):
28 |             length = len(sents_tokened[i])
29 |             b_e = np.zeros((1, length, self.N))
30 |             b_e[0]=self.bert.encode(sents_tokened[i])
31 |             bert_embeddings.append(b_e)
32 | 
33 |         return np.array( bert_embeddings)
34 | 
35 | 
36 | if __name__ == '__main__':
37 |     Bert=WordEmbeddings()
38 |     sent_tokens=[['I',"love","Rock","and","R","!"],['I',"love","Rock","and","R","!"]]
39 |     embs=Bert.get_tokenized_words_embeddings(sent_tokens)
40 |     print(embs)
41 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/embeddings/word_emb_elmo.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # __author__ = "Sponge"
 4 | # Date: 2019/6/19
 5 | from allennlp.commands.elmo import ElmoEmbedder
 6 | 
 7 | class WordEmbeddings():
 8 |     """
 9 |         ELMo
10 |         https://allennlp.org/elmo
11 | 
12 |     """
13 | 
14 |     def __init__(self,
15 |                  options_file="../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json",
16 |                  weight_file="../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5", cuda_device=0):
17 |         self.cuda_device=cuda_device
18 |         self.elmo = ElmoEmbedder(options_file, weight_file,cuda_device=self.cuda_device)
19 | 
20 |     def get_tokenized_words_embeddings(self, sents_tokened):
21 |         """
22 |         @see EmbeddingDistributor
23 |         :param tokenized_sents: list of tokenized words string (sentences/phrases)
24 |         :return: ndarray with shape (len(sents), dimension of embeddings)
25 |         """
26 | 
27 |         elmo_embedding, elmo_mask = self.elmo.batch_to_embeddings(sents_tokened)
28 |         if(self.cuda_device>-2):
29 |             return elmo_embedding.cpu(), elmo_mask.cpu()
30 |         else:
31 |             return elmo_embedding, elmo_mask
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/eval/sifrank_eval.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # __author__ = "Sponge"
  4 | # Date: 2019/6/25
  5 | 
  6 | import nltk
  7 | from embeddings import sent_emb_sif, word_emb_elmo
  8 | from model.method import SIFRank, SIFRank_plus
  9 | from util import fileIO
 10 | from stanfordcorenlp import StanfordCoreNLP
 11 | import time
 12 | 
 13 | def get_PRF(num_c, num_e, num_s):
 14 |     F1 = 0.0
 15 |     P = float(num_c) / float(num_e)
 16 |     R = float(num_c) / float(num_s)
 17 |     if (P + R == 0.0):
 18 |         F1 = 0
 19 |     else:
 20 |         F1 = 2 * P * R / (P + R)
 21 |     return P, R, F1
 22 | 
 23 | 
 24 | def print_PRF(P, R, F1, N):
 25 | 
 26 |     print("\nN=" + str(N), end="\n")
 27 |     print("P=" + str(P), end="\n")
 28 |     print("R=" + str(R), end="\n")
 29 |     print("F1=" + str(F1))
 30 |     return 0
 31 | 
 32 | 
 33 | time_start = time.time()
 34 | 
 35 | P = R = F1 = 0.0
 36 | num_c_5 = num_c_10 = num_c_15 = 0
 37 | num_e_5 = num_e_10 = num_e_15 = 0
 38 | num_s = 0
 39 | lamda = 0.0
 40 | 
 41 | database1 = "Inspec"
 42 | database2 = "Duc2001"
 43 | database3 = "Semeval2017"
 44 | 
 45 | database = database1
 46 | 
 47 | if(database == "Inspec"):
 48 |     data, labels = fileIO.get_inspec_data()
 49 |     lamda = 0.6
 50 |     elmo_layers_weight = [0.0, 1.0, 0.0]
 51 | elif(database == "Duc2001"):
 52 |     data, labels = fileIO.get_duc2001_data()
 53 |     lamda = 1.0
 54 |     elmo_layers_weight = [1.0, 0.0, 0.0]
 55 | else:
 56 |     data, labels = fileIO.get_semeval2017_data()
 57 |     lamda = 0.6
 58 |     elmo_layers_weight = [1.0, 0.0, 0.0]
 59 | 
 60 | #download from https://allennlp.org/elmo
 61 | options_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json"
 62 | weight_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
 63 | 
 64 | porter = nltk.PorterStemmer()#please download nltk
 65 | ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0)
 66 | SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=lamda, database=database)
 67 | en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True)#download from https://stanfordnlp.github.io/CoreNLP/
 68 | 
 69 | try:
 70 |     for key, data in data.items():
 71 | 
 72 |         lables = labels[key]
 73 |         lables_stemed = []
 74 | 
 75 |         for lable in lables:
 76 |             tokens = lable.split()
 77 |             lables_stemed.append(' '.join(porter.stem(t) for t in tokens))
 78 | 
 79 |         print(key)
 80 | 
 81 |         dist_sorted = SIFRank(data, SIF, en_model, elmo_layers_weight=elmo_layers_weight,if_DS=True,if_EA=True)
 82 |         # dist_sorted = SIFRank_plus(data, SIF, en_model, elmo_layers_weight=elmo_layers_weight)
 83 | 
 84 |         j = 0
 85 |         for temp in dist_sorted[0:15]:
 86 |             tokens = temp[0].split()
 87 |             tt = ' '.join(porter.stem(t) for t in tokens)
 88 |             if (tt in lables_stemed or temp[0] in labels[key]):
 89 |                 if (j < 5):
 90 |                     num_c_5 += 1
 91 |                     num_c_10 += 1
 92 |                     num_c_15 += 1
 93 | 
 94 |                 elif (j < 10 and j >= 5):
 95 |                     num_c_10 += 1
 96 |                     num_c_15 += 1
 97 | 
 98 |                 elif (j < 15 and j >= 10):
 99 |                     num_c_15 += 1
100 |             j += 1
101 | 
102 |         if (len(dist_sorted[0:5]) == 5):
103 |             num_e_5 += 5
104 |         else:
105 |             num_e_5 += len(dist_sorted[0:5])
106 | 
107 |         if (len(dist_sorted[0:10]) == 10):
108 |             num_e_10 += 10
109 |         else:
110 |             num_e_10 += len(dist_sorted[0:10])
111 | 
112 |         if (len(dist_sorted[0:15]) == 15):
113 |             num_e_15 += 15
114 |         else:
115 |             num_e_15 += len(dist_sorted[0:15])
116 | 
117 |         num_s += len(labels[key])
118 | 
119 |     en_model.close()
120 |     p, r, f = get_PRF(num_c_5, num_e_5, num_s)
121 |     print_PRF(p, r, f, 5)
122 |     p, r, f = get_PRF(num_c_10, num_e_10, num_s)
123 |     print_PRF(p, r, f, 10)
124 |     p, r, f = get_PRF(num_c_15, num_e_15, num_s)
125 |     print_PRF(p, r, f, 15)
126 | 
127 | 
128 | except ValueError:
129 |     en_model.close()
130 | en_model.close()
131 | time_end = time.time()
132 | print('totally cost', time_end - time_start)
133 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/model/__init__.py:
--------------------------------------------------------------------------------
1 | #! /usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | # __author__ = "Sponge"
4 | # Date: 2019/12/19
5 | 
6 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/model/extractor.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # __author__ = "Sponge"
 4 | # Date: 2019/6/19
 5 | import nltk
 6 | from model import input_representation
 7 | 
 8 | #GRAMMAR1 is the general way to extract NPs
 9 | 
10 | GRAMMAR1 = """  NP:
11 |         {<NN.*|JJ>*<NN.*>}  # Adjective(s)(optional) + Noun(s)"""
12 | 
13 | GRAMMAR2 = """  NP:
14 |         {<JJ|VBG>*<NN.*>{0,3}}  # Adjective(s)(optional) + Noun(s)"""
15 | 
16 | GRAMMAR3 = """  NP:
17 |         {<NN.*|JJ|VBG|VBN>*<NN.*>}  # Adjective(s)(optional) + Noun(s)"""
18 | 
19 | 
20 | def extract_candidates(tokens_tagged, no_subset=False):
21 |     """
22 |     Based on part of speech return a list of candidate phrases
23 |     :param text_obj: Input text Representation see @InputTextObj
24 |     :param no_subset: if true won't put a candidate which is the subset of an other candidate
25 |     :return keyphrase_candidate: list of list of candidate phrases: [tuple(string,tuple(start_index,end_index))]
26 |     """
27 |     np_parser = nltk.RegexpParser(GRAMMAR1)  # Noun phrase parser
28 |     keyphrase_candidate = []
29 |     np_pos_tag_tokens = np_parser.parse(tokens_tagged)
30 |     count = 0
31 |     for token in np_pos_tag_tokens:
32 |         if (isinstance(token, nltk.tree.Tree) and token._label == "NP"):
33 |             np = ' '.join(word for word, tag in token.leaves())
34 |             length = len(token.leaves())
35 |             start_end = (count, count + length)
36 |             count += length
37 |             keyphrase_candidate.append((np, start_end))
38 | 
39 |         else:
40 |             count += 1
41 | 
42 |     return keyphrase_candidate
43 | 
44 | # if __name__ == '__main__':
45 | #     #This is an example.
46 | #     sent17 = "NuVox shows staying power with new cash, new market Who says you can't raise cash in today's telecom market? NuVox Communications positions itself for the long run with $78.5 million in funding and a new credit facility"
47 | #     sent10 = "This paper deals with two questions: Does social capital determine innovation in manufacturing firms? If it is the case, to what extent? To deal with these questions, we review the literature on innovation in order to see how social capital came to be added to the other forms of capital as an explanatory variable of innovation. In doing so, we have been led to follow the dominating view of the literature on social capital and innovation which claims that social capital cannot be captured through a single indicator, but that it actually takes many different forms that must be accounted for. Therefore, to the traditional explanatory variables of innovation, we have added five forms of structural social capital (business network assets, information network assets, research network assets, participation assets, and relational assets) and one form of cognitive social capital (reciprocal trust). In a context where empirical investigations regarding the relations between social capital and innovation are still scanty, this paper makes contributions to the advancement of knowledge in providing new evidence regarding the impact and the extent of social capital on innovation at the two decisionmaking stages considered in this study"
48 | #
49 | #     input=input_representation.InputTextObj(sent10,is_sectioned=True,database="Inspec")
50 | #     keyphrase_candidate= extract_candidates(input)
51 | #     for kc in keyphrase_candidate:
52 | #         print(kc)


--------------------------------------------------------------------------------
/KeyExt/SIFRank/model/input_representation.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # __author__ = "Sponge"
 4 | # Date: 2019/6/19
 5 | 
 6 | from model import extractor
 7 | from nltk.corpus import stopwords
 8 | stopword_dict = set(stopwords.words('english'))
 9 | # from stanfordcorenlp import StanfordCoreNLP
10 | # en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True)
11 | class InputTextObj:
12 |     """Represent the input text in which we want to extract keyphrases"""
13 | 
14 |     def __init__(self, en_model, text=""):
15 |         """
16 |         :param is_sectioned: If we want to section the text.
17 |         :param en_model: the pipeline of tokenization and POS-tagger
18 |         :param considered_tags: The POSs we want to keep
19 |         """
20 |         self.considered_tags = {'NN', 'NNS', 'NNP', 'NNPS', 'JJ'}
21 | 
22 |         self.tokens = []
23 |         self.tokens_tagged = []
24 |         self.tokens = en_model.word_tokenize(text)
25 |         self.tokens_tagged = en_model.pos_tag(text)
26 |         assert len(self.tokens) == len(self.tokens_tagged)
27 |         for i, token in enumerate(self.tokens):
28 |             if token.lower() in stopword_dict:
29 |                 self.tokens_tagged[i] = (token, "IN")
30 |         self.keyphrase_candidate = extractor.extract_candidates(self.tokens_tagged, en_model)
31 | 
32 | # if __name__ == '__main__':
33 | #     text = "Adaptive state feedback control for a class of linear systems with unknown bounds of uncertainties The problem of adaptive robust stabilization for a class of linear time-varying systems with disturbance and nonlinear uncertainties is considered. The bounds of the disturbance and uncertainties are assumed to be unknown, being even arbitrary. For such uncertain dynamical systems, the adaptive robust state feedback controller is obtained. And the resulting closed-loop systems are asymptotically stable in theory. Moreover, an adaptive robust state feedback control scheme is given. The scheme ensures the closed-loop systems exponentially practically stable and can be used in practical engineering. Finally, simulations show that the control scheme is effective"
34 | #     ito = InputTextObj(en_model, text)
35 | #     print("OK")


--------------------------------------------------------------------------------
/KeyExt/SIFRank/model/method.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # __author__ = "Sponge"
  4 | # Date: 2019/6/19
  5 | 
  6 | import numpy as np
  7 | import nltk
  8 | from nltk.corpus import stopwords
  9 | from model import input_representation
 10 | import torch
 11 | 
 12 | wnl=nltk.WordNetLemmatizer()
 13 | stop_words = set(stopwords.words("english"))
 14 | 
 15 | def cos_sim_gpu(x,y):
 16 |     assert x.shape[0]==y.shape[0]
 17 |     zero_tensor = torch.zeros((1, x.shape[0])).cuda()
 18 |     # zero_list = [0] * len(x)
 19 |     if x == zero_tensor or y == zero_tensor:
 20 |         return float(1) if x == y else float(0)
 21 |     xx, yy, xy = 0.0, 0.0, 0.0
 22 |     for i in range(x.shape[0]):
 23 |         xx += x[i] * x[i]
 24 |         yy += y[i] * y[i]
 25 |         xy += x[i] * y[i]
 26 |     return 1.0 - xy / np.sqrt(xx * yy)
 27 | 
 28 | def cos_sim(vector_a, vector_b):
 29 |     """
 30 |     计算两个向量之间的余弦相似度
 31 |     :param vector_a: 向量 a
 32 |     :param vector_b: 向量 b
 33 |     :return: sim
 34 |     """
 35 |     vector_a = np.mat(vector_a)
 36 |     vector_b = np.mat(vector_b)
 37 |     num = float(vector_a * vector_b.T)
 38 |     denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
 39 |     if(denom==0.0):
 40 |         return 0.0
 41 |     else:
 42 |         cos = num / denom
 43 |         sim = 0.5 + 0.5 * cos
 44 |         return sim
 45 | 
 46 | def cos_sim_transformer(vector_a, vector_b):
 47 |     """
 48 |     计算两个向量之间的余弦相似度
 49 |     :param vector_a: 向量 a
 50 |     :param vector_b: 向量 b
 51 |     :return: sim
 52 |     """
 53 |     a = vector_a.detach().numpy()
 54 |     b = vector_b.detach().numpy()
 55 |     a=np.mat(a)
 56 |     b=np.mat(b)
 57 | 
 58 |     num = float(a * b.T)
 59 |     denom = np.linalg.norm(a) * np.linalg.norm(b)
 60 |     if(denom==0.0):
 61 |         return 0.0
 62 |     else:
 63 |         cos = num / denom
 64 |         sim = 0.5 + 0.5 * cos
 65 |         return sim
 66 | 
 67 | def get_dist_cosine(emb1, emb2, sent_emb_method="elmo",elmo_layers_weight=[0.0,1.0,0.0]):
 68 |     sum = 0.0
 69 |     assert emb1.shape == emb2.shape
 70 |     if(sent_emb_method=="elmo"):
 71 | 
 72 |         for i in range(0, 3):
 73 |             a = emb1[i]
 74 |             b = emb2[i]
 75 |             sum += cos_sim(a, b) * elmo_layers_weight[i]
 76 |         return sum
 77 | 
 78 |     elif(sent_emb_method=="elmo_transformer"):
 79 |         sum = cos_sim_transformer(emb1, emb2)
 80 |         return sum
 81 | 
 82 |     elif(sent_emb_method=="doc2vec"):
 83 |         sum=cos_sim(emb1,emb2)
 84 |         return sum
 85 | 
 86 |     elif (sent_emb_method == "glove"):
 87 |         sum = cos_sim(emb1, emb2)
 88 |         return sum
 89 |     return sum
 90 | 
 91 | def get_all_dist(candidate_embeddings_list, text_obj, dist_list):
 92 |     '''
 93 |     :param candidate_embeddings_list:
 94 |     :param text_obj:
 95 |     :param dist_list:
 96 |     :return: dist_all
 97 |     '''
 98 | 
 99 |     dist_all={}
100 |     for i, emb in enumerate(candidate_embeddings_list):
101 |         phrase = text_obj.keyphrase_candidate[i][0]
102 |         phrase = phrase.lower()
103 |         phrase = wnl.lemmatize(phrase)
104 |         if(phrase in dist_all):
105 |             #store the No. and distance
106 |             dist_all[phrase].append(dist_list[i])
107 |         else:
108 |             dist_all[phrase]=[]
109 |             dist_all[phrase].append(dist_list[i])
110 |     return dist_all
111 | 
112 | def get_final_dist(dist_all, method="average"):
113 |     '''
114 |     :param dist_all:
115 |     :param method: "average"
116 |     :return:
117 |     '''
118 | 
119 |     final_dist={}
120 | 
121 |     if(method=="average"):
122 | 
123 |         for phrase, dist_list in dist_all.items():
124 |             sum_dist = 0.0
125 |             for dist in dist_list:
126 |                 sum_dist += dist
127 |             if (phrase in stop_words):
128 |                 sum_dist = 0.0
129 |             final_dist[phrase] = sum_dist/float(len(dist_list))
130 |         return final_dist
131 | 
132 | def softmax(x):
133 |     # x = x - np.max(x)
134 |     exp_x = np.exp(x)
135 |     softmax_x = exp_x / np.sum(exp_x)
136 |     return softmax_x
137 | 
138 | 
139 | def get_position_score(keyphrase_candidate_list, position_bias):
140 |     length = len(keyphrase_candidate_list)
141 |     position_score ={}
142 |     for i,kc in enumerate(keyphrase_candidate_list):
143 |         np = kc[0]
144 |         p = kc[1][0]
145 |         np = np.lower()
146 |         np = wnl.lemmatize(np)
147 |         if np in position_score:
148 | 
149 |             position_score[np] += 0.0
150 |         else:
151 |             position_score[np] = 1/(float(i)+1+position_bias)
152 |     score_list=[]
153 |     for np,score in position_score.items():
154 |         score_list.append(score)
155 |     score_list = softmax(score_list)
156 | 
157 |     i=0
158 |     for np, score in position_score.items():
159 |         position_score[np] = score_list[i]
160 |         i+=1
161 |     return position_score
162 | 
163 | def SIFRank(text, SIF, en_model, method="average", N=15,
164 |             sent_emb_method="elmo", elmo_layers_weight=[0.0, 1.0, 0.0], if_DS=True, if_EA=True):
165 |     """
166 |     :param text_obj:
167 |     :param sent_embeddings:
168 |     :param candidate_embeddings_list:
169 |     :param sents_weight_list:
170 |     :param method:
171 |     :param N: the top-N number of keyphrases
172 |     :param sent_emb_method: 'elmo', 'glove'
173 |     :param elmo_layers_weight: the weights of different layers of ELMo
174 |     :param if_DS: if take document segmentation(DS)
175 |     :param if_EA: if take  embeddings alignment(EA)
176 |     :return:
177 |     """
178 |     text_obj = input_representation.InputTextObj(en_model, text)
179 |     sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
180 |     dist_list = []
181 |     for i, emb in enumerate(candidate_embeddings_list):
182 |         dist = get_dist_cosine(sent_embeddings, emb, sent_emb_method, elmo_layers_weight=elmo_layers_weight)
183 |         dist_list.append(dist)
184 |     dist_all = get_all_dist(candidate_embeddings_list, text_obj, dist_list)
185 |     dist_final = get_final_dist(dist_all, method='average')
186 |     dist_sorted = sorted(dist_final.items(), key=lambda x: x[1], reverse=True)
187 |     return dist_sorted[0:N]
188 | 
189 | def SIFRank_plus(text, SIF, en_model, method="average", N=15,
190 |             sent_emb_method="elmo", elmo_layers_weight=[0.0, 1.0, 0.0], if_DS=True, if_EA=True, position_bias = 3.4):
191 |     """
192 |     :param text_obj:
193 |     :param sent_embeddings:
194 |     :param candidate_embeddings_list:
195 |     :param sents_weight_list:
196 |     :param method:
197 |     :param N: the top-N number of keyphrases
198 |     :param sent_emb_method: 'elmo', 'glove'
199 |     :param elmo_layers_weight: the weights of different layers of ELMo
200 |     :return:
201 |     """
202 |     text_obj = input_representation.InputTextObj(en_model, text)
203 |     sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA)
204 |     position_score = get_position_score(text_obj.keyphrase_candidate, position_bias)
205 |     average_score = sum(position_score.values()) / (float)(len(position_score))#Little change here
206 |     dist_list = []
207 |     for i, emb in enumerate(candidate_embeddings_list):
208 |         dist = get_dist_cosine(sent_embeddings, emb, sent_emb_method, elmo_layers_weight=elmo_layers_weight)
209 |         dist_list.append(dist)
210 |     dist_all = get_all_dist(candidate_embeddings_list, text_obj, dist_list)
211 |     dist_final = get_final_dist(dist_all, method='average')
212 |     for np,dist in dist_final.items():
213 |         if np in position_score:
214 |             dist_final[np] = dist*position_score[np]/average_score#Little change here
215 |     dist_sorted = sorted(dist_final.items(), key=lambda x: x[1], reverse=True)
216 |     return dist_sorted[0:N]
217 | 
218 | 
219 | 


--------------------------------------------------------------------------------
/KeyExt/SIFRank/requirements.txt:
--------------------------------------------------------------------------------
1 | nltk==3.4.3
2 | StanfordCoreNLP==3.9.1.1
3 | torch==1.7.1
4 | allennlp==0.8.4
5 | overrides==3.1.0
6 | scikit-learn==0.22.2.post1


--------------------------------------------------------------------------------
/KeyExt/SIFRank/test/test.py:
--------------------------------------------------------------------------------
 1 | #! /usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | # __author__ = "Sponge_sy"
 4 | # Date: 2020/2/21
 5 | 
 6 | import nltk
 7 | from embeddings import sent_emb_sif, word_emb_elmo
 8 | from model.method import SIFRank, SIFRank_plus
 9 | from stanfordcorenlp import StanfordCoreNLP
10 | import time
11 | 
12 | #download from https://allennlp.org/elmo
13 | options_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json"
14 | weight_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
15 | 
16 | porter = nltk.PorterStemmer()
17 | ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0)
18 | SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=1.0)
19 | en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True)#download from https://stanfordnlp.github.io/CoreNLP/
20 | elmo_layers_weight = [0.0, 1.0, 0.0]
21 | 
22 | text = "Discrete output feedback sliding mode control of second order systems - a moving switching line approach The sliding mode control systems (SMCS) for which the switching variable is designed independent of the initial conditions are known to be sensitive to parameter variations and extraneous disturbances during the reaching phase. For second order systems this drawback is eliminated by using the moving switching line technique where the switching line is initially designed to pass the initial conditions and is subsequently moved towards a predetermined switching line. In this paper, we make use of the above idea of moving switching line together with the reaching law approach to design a discrete output feedback sliding mode control. The main contributions of this work are such that we do not require to use system states as it makes use of only the output samples for designing the controller. and by using the moving switching line a low sensitivity system is obtained through shortening the reaching phase. Simulation results show that the fast output sampling feedback guarantees sliding motion similar to that obtained using state feedback"
23 | keyphrases = SIFRank(text, SIF, en_model, N=15,elmo_layers_weight=elmo_layers_weight)
24 | keyphrases_ = SIFRank_plus(text, SIF, en_model, N=15, elmo_layers_weight=elmo_layers_weight)
25 | print(keyphrases)
26 | print(keyphrases_)


--------------------------------------------------------------------------------
/KeyExt/SIFRank/util/fileIO.py:
--------------------------------------------------------------------------------
  1 | #! /usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # __author__ = "Sponge"
  4 | # Date: 2019/6/21
  5 | 
  6 | import string,re,os
  7 | 
  8 | class Result:
  9 | 
 10 |     def __init__(self,N=15):
 11 |         self.database=""
 12 |         self.predict_keyphrases = []
 13 |         self.true_keyphrases = []
 14 |         self.file_names = []
 15 |         self.lamda=0.0
 16 |         self.beta=0.0
 17 | 
 18 |     def update_result(self, file_name, pre_kp, true_kp):
 19 |         self.file_names.append(file_name)
 20 |         self.predict_keyphrases.append(pre_kp)
 21 |         self.true_keyphrases.append(true_kp)
 22 | 
 23 |     def get_parameters(self,database="",lamda=0.6,beta=0.0):
 24 |         self.database = database
 25 |         self.lamda = lamda
 26 |         self.beta = beta
 27 | 
 28 |     def write_results(self):
 29 |         return 0
 30 | 
 31 | def write_string(s, output_path):
 32 |     with open(output_path, 'w') as output_file:
 33 |         output_file.write(s)
 34 | 
 35 | 
 36 | def read_file(input_path):
 37 |     with open(input_path, 'r', errors='replace_with_space') as input_file:
 38 |         return input_file.read()
 39 | 
 40 | def clean_text(text="",database="Inspec"):
 41 | 
 42 |     #Specially for Duc2001 Database
 43 |     if(database=="Duc2001" or database=="Semeval2017"):
 44 |         pattern2 = re.compile(r'[\s,]' + '[\n]{1}')
 45 |         while (True):
 46 |             if (pattern2.search(text) is not None):
 47 |                 position = pattern2.search(text)
 48 |                 start = position.start()
 49 |                 end = position.end()
 50 |                 # start = int(position[0])
 51 |                 text_new = text[:start] + "\n" + text[start + 2:]
 52 |                 text = text_new
 53 |             else:
 54 |                 break
 55 | 
 56 |     pattern2 = re.compile(r'[a-zA-Z0-9,\s]' + '[\n]{1}')
 57 |     while (True):
 58 |         if (pattern2.search(text) is not None):
 59 |             position = pattern2.search(text)
 60 |             start = position.start()
 61 |             end = position.end()
 62 |             # start = int(position[0])
 63 |             text_new = text[:start + 1] + " " + text[start + 2:]
 64 |             text = text_new
 65 |         else:
 66 |             break
 67 | 
 68 |     pattern3 = re.compile(r'\s{2,}')
 69 |     while (True):
 70 |         if (pattern3.search(text) is not None):
 71 |             position = pattern3.search(text)
 72 |             start = position.start()
 73 |             end = position.end()
 74 |             # start = int(position[0])
 75 |             text_new = text[:start + 1] + "" + text[start + 2:]
 76 |             text = text_new
 77 |         else:
 78 |             break
 79 | 
 80 |     pattern1 = re.compile(r'[<>[\]{}]')
 81 |     text = pattern1.sub(' ', text)
 82 |     text = text.replace("\t", " ")
 83 |     text = text.replace(' p ','\n')
 84 |     text = text.replace(' /p \n','\n')
 85 |     lines = text.splitlines()
 86 |     # delete blank line
 87 |     text_new=""
 88 |     for line in lines:
 89 |         if(line!='\n'):
 90 |             text_new+=line+'\n'
 91 | 
 92 |     return text_new
 93 | 
 94 | def get_duc2001_data(file_path="../data/DUC2001"):
 95 |     pattern = re.compile(r'<TEXT>(.*?)</TEXT>', re.S)
 96 |     data = {}
 97 |     labels = {}
 98 |     for dirname, dirnames, filenames in os.walk(file_path):
 99 |         for fname in filenames:
100 |             if (fname == "annotations.txt"):
101 |                 # left, right = fname.split('.')
102 |                 infile = os.path.join(dirname, fname)
103 |                 f = open(infile,'rb')
104 |                 text = f.read().decode('utf8')
105 |                 lines = text.splitlines()
106 |                 for line in lines:
107 |                     left, right = line.split("@")
108 |                     d = right.split(";")[:-1]
109 |                     l = left
110 |                     labels[l] = d
111 |                 f.close()
112 |             else:
113 |                 infile = os.path.join(dirname, fname)
114 |                 f = open(infile,'rb')
115 |                 text = f.read().decode('utf8')
116 |                 text = re.findall(pattern, text)[0]
117 | 
118 |                 text = text.lower()
119 |                 text = clean_text(text,database="Duc2001")
120 |                 data[fname]=text.strip("\n")
121 |                 # data[fname] = text
122 |     return data,labels
123 | 
124 | def get_inspec_data(file_path="../data/Inspec"):
125 | 
126 |     data={}
127 |     labels={}
128 |     for dirname, dirnames, filenames in os.walk(file_path):
129 |         for fname in filenames:
130 |             left, right = fname.split('.')
131 |             if (right == "abstr"):
132 |                 infile = os.path.join(dirname, fname)
133 |                 f=open(infile)
134 |                 text=f.read()
135 |                 text=clean_text(text)
136 |                 data[left]=text
137 |             if (right == "uncontr"):
138 |                 infile = os.path.join(dirname, fname)
139 |                 f=open(infile)
140 |                 text=f.read()
141 |                 text=text.replace("\n",' ')
142 |                 text=clean_text(text,database="Inspec")
143 |                 text=text.lower()
144 |                 label=text.split("; ")
145 |                 labels[left]=label
146 |     return data,labels
147 | 
148 | def get_semeval2017_data(data_path="../data/SemEval2017/docsutf8",labels_path="../data/SemEval2017/keys"):
149 | 
150 |     data={}
151 |     labels={}
152 |     for dirname, dirnames, filenames in os.walk(data_path):
153 |         for fname in filenames:
154 |             left, right = fname.split('.')
155 |             infile = os.path.join(dirname, fname)
156 |             f = open(infile, 'rb')
157 |             text = f.read().decode('utf8')
158 |             text = clean_text(text,database="Semeval2017")
159 |             data[left] = text.lower()
160 |             f.close()
161 |     for dirname, dirnames, filenames in os.walk(labels_path):
162 |         for fname in filenames:
163 |             left, right = fname.split('.')
164 |             infile = os.path.join(dirname, fname)
165 |             f = open(infile, 'rb')
166 |             text = f.read().decode('utf8')
167 |             text = text.strip()
168 |             ls=text.splitlines()
169 |             labels[left] = ls
170 |             f.close()
171 |     return data,labels
172 | 
173 | 
174 | # if __name__ == '__main__':
175 | #
176 | #     data,labels=get_semeval2017_data()
177 | #     print("OK")
178 | 
179 | 
180 | 
181 | 


--------------------------------------------------------------------------------
/KeyExt/__init__.py:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/KeyExt/config.py:
--------------------------------------------------------------------------------
1 | # Config values.
2 | datasets_path = r'..\datasets'
3 | output_dir = r'..\output'
4 | 


--------------------------------------------------------------------------------
/KeyExt/experiments.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import pandas as pd
 3 | import KeyExt.metrics
 4 | import KeyExt.utils
 5 | 
 6 | def run_experiments(datasets_dir, output_dir, top_n = 10, partial_match = True):
 7 | 
 8 |     # Make a list of all subdirectories.
 9 |     directories = next(os.walk(datasets_dir))[1][0:]
10 |     data = []
11 | 
12 |     # Set the metric name and construct the output path for the xlsx.
13 |     metric_name = f'pF1@{top_n}' if partial_match else f'F1@{top_n}'
14 |     xlsx_path = os.path.join(output_dir, f'{metric_name}.xlsx')
15 |     print(f'Calculating the {metric_name} score for all datasets...')
16 | 
17 |     for i, directory in enumerate(directories):
18 |         print(f'Processing {i+1} in {len(directories)} datasets.')
19 | 
20 |         # Change current working directory to the dataset directory.
21 |         dataset_path = os.path.join(datasets_dir, directory)
22 |         os.chdir(dataset_path)
23 | 
24 |         # Find human assigned keyphrase files and paths.
25 |         os.chdir(os.path.join(dataset_path, 'keys'))
26 |         key_paths = list(map(os.path.abspath, sorted(os.listdir())))
27 | 
28 |         # Find all methods (directories of keys) and their generated keyphrase files and paths.
29 |         extracted_path = os.path.join(dataset_path, 'extracted')
30 |         os.chdir(extracted_path)
31 |         methods = sorted(next(os.walk('.'))[1])
32 | 
33 |         # Initialize the macro(mean) metric vector.
34 |         macro_metric_vec = [0.0] * len(methods)
35 | 
36 |         # Compare the extracted keys of each method with the human assigned keys.
37 |         for j, method in enumerate(methods):
38 | 
39 |             print(f'    * Evaluating {method} for {len(key_paths)} documents.')
40 | 
41 |             # Find all extracted keys of the method.
42 |             os.chdir(os.path.join(extracted_path, method))
43 |             method_paths = list(map(os.path.abspath, sorted(os.listdir())))
44 | 
45 |             for key_path, method_path in zip(key_paths, method_paths):
46 |                 with open(method_path, 'r', encoding = 'utf-8-sig', errors = 'ignore') as method_keys, \
47 |                      open(key_path, 'r', encoding = 'utf-8-sig', errors = 'ignore') as human_keys:
48 |                     
49 |                     # Read the tags from file and then preprocess them, 
50 |                     # as to be lowercased, with no punctuation and stemmed.
51 |                     extracted = KeyExt.utils.preprocess(method_keys.read().split('\n'))
52 |                     assigned = KeyExt.utils.preprocess(human_keys.read().split('\n'))
53 |                     macro_metric_vec[j] += KeyExt.metrics.f1_metric_k (
54 |                     assigned, extracted, k = top_n, partial_match = partial_match
55 |                 )
56 | 
57 |         # The macro (mean) metric score us calculated from each method.
58 |         macro_metric_vec = [
59 |             round(metric_sum / len(key_paths), 3)
60 |             for metric_sum in macro_metric_vec
61 |         ]
62 | 
63 |         # Append the macro metric score for each directory to the data list of lists,
64 |         # each list has the dataset name prepended at the start of the row.
65 |         data.append([directory] + macro_metric_vec)
66 |         os.system('clear')
67 | 
68 | 
69 |     # Construct the dataframe and then transpose it.
70 |     df = pd.DataFrame(data, columns = [f'{metric_name}', *methods]).set_index(f'{metric_name}')
71 |     df = df.transpose()
72 | 
73 |     # Save the dataframe to excel.
74 |     df.to_excel(xlsx_path, engine = 'openpyxl')
75 |     return
76 | 


--------------------------------------------------------------------------------
/KeyExt/metrics.py:
--------------------------------------------------------------------------------
 1 | def exact_f1_k(assigned, extracted, k):
 2 |     """
 3 |     Computes the exatch match f1 measure at k.
 4 |     Arguments
 5 |     ---------
 6 |     assigned  : A list of human assigned keyphrases.
 7 |     extracted : A list of extracted keyphrases.
 8 |     k         : int
 9 |                 The maximum number of extracted keyphrases.
10 |     Returned value
11 |     --------------
12 |               : double
13 |     """
14 |     # Exit early, if one of the lists or both are empty.
15 |     if not assigned or not extracted:
16 |         return 0.0
17 | 
18 |     precision_k = len(set(assigned) & set(extracted)) / k
19 |     recall_k = len(set(assigned) & set(extracted)) / len(assigned)
20 |     return (
21 |         2 * precision_k * recall_k / (precision_k + recall_k)
22 |         if precision_k and recall_k else 0.0
23 |     )
24 | 
25 | 
26 | def partial_f1_k(assigned, extracted, k):
27 |     """
28 |     Computes the exatch match f1 measure at k.
29 |     Arguments
30 |     ---------
31 |     assigned  : A list of human assigned keyphrases.
32 |     extracted : A list of extracted keyphrases.
33 |     k         : int
34 |                 The maximum number of extracted keyphrases.
35 |     Returned value
36 |     --------------
37 |               : double
38 |     """
39 |     # Exit early, if one of the lists or both are empty.
40 |     if not assigned or not extracted:
41 |         return 0.0
42 | 
43 |     # Store the longest keyphrases first.
44 |     assigned_sets = sorted([set(keyword.split()) for keyword in assigned], key = len, reverse = True)
45 |     extracted_sets = sorted([set(keyword.split()) for keyword in extracted], key = len, reverse = True)
46 | 
47 |     # This list stores True, if the assigned keyphrase has been matched earlier.
48 |     # To avoid counting duplicate matches.
49 |     assigned_matches = [False for assigned_set in assigned_sets]
50 | 
51 |     # For each extracted keyphrase, find the closest match, 
52 |     # which is the assigned keyphrase it has the most words in common.
53 |     for extracted_set in extracted_sets:
54 |         all_matches = [(i, len(assigned_set & extracted_set)) for i, assigned_set in enumerate(assigned_sets)]
55 |         closest_match = sorted(all_matches, key = lambda x: x[1], reverse = True)[0]
56 |         assigned_matches[closest_match[0]] = True
57 | 
58 |     # Calculate the precision and recall metrics based on the partial matches.
59 |     partial_matches = assigned_matches.count(True)  
60 |     precision_k = partial_matches / k
61 |     recall_k = partial_matches / len(assigned)
62 |     
63 |     return (
64 |         2 * precision_k * recall_k / (precision_k + recall_k)
65 |         if precision_k and recall_k else 0.0
66 |     )
67 | 
68 | 
69 | def f1_metric_k(assigned, extracted, k, partial_match = True):
70 |     """
71 |     Wrapper function that calculates either the exact
72 |     or the partial match f1 metric.
73 |     """
74 |     return (
75 |         partial_f1_k(assigned, extracted, k) 
76 |         if partial_match else exact_f1_k(assigned, extracted, k)
77 |     )
78 | 


--------------------------------------------------------------------------------
/KeyExt/utils.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import time
 3 | import platform
 4 | import functools
 5 | import KeyExt.config
 6 | from string import punctuation
 7 | from nltk.stem import SnowballStemmer
 8 | 
 9 | 
10 | # Initialize the English stemmer once.
11 | stemmer = SnowballStemmer('english')
12 | 
13 | 
14 | def preprocess(lis):
15 |     """
16 |     Function which applies stemming to a 
17 |     lowercase version of each string of the list,
18 |     which has all punctuation removed.
19 |     """
20 |     return list(map(stemmer.stem, 
21 |            map(lambda s: s.translate(str.maketrans('', '', punctuation)),
22 |            map(str.lower, lis))))
23 | 
24 | 
25 | def rreplace(s, old, new, occurrence):
26 |     """
27 |     Function which replaces a string occurence
28 |     in a string from the end of the string.
29 |     """
30 |     return new.join(s.rsplit(old, occurrence))
31 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Keyword & Keyphrase Extraction Review
 2 | 
 3 | This repository hosts code for the papers:
 4 | * [A literature review of keyword and keyphrase  extraction -]() - [Download]()
 5 | * [A comparative assessment of state-of-the-art methods for multilingual unsupervised keyphrase extraction](https://link.springer.com/chapter/10.1007/978-3-030-79150-6_50) - [Download](https://github.com/NC0DER/KeyphraseExtraction/releases/tag/KeyphraseExtractionv1.0)  
 6 | 
 7 | ## Datasets
 8 | Available in [this link]()
 9 | 
10 | ## Disclaimer 
11 | This repository contains code for the evaluated approaches.
12 | The code for these approaches belongs to their respective authors.
13 | Some code files were modified to enable the evaluation.
14 | These modifications include:
15 | * Removing hardcoded paths.
16 | * Setting `cpu-only` mode for approaches that require a lot of `GPU VRAM`.
17 | * Updating code to run from `Python 2` to `Python 3`.
18 | * Amend errors related to old packages or functions with wrong parameters.
19 | * Disabling stemming performed early by certain approaches in their keyphrase extraction step, 
20 |   as to use a common stemmer later in the evaluation process.
21 | 
22 | ## Test Results
23 | Configure `KeyExt\config.py` and run `KeyExt.py`.
24 | 
25 | ## Installation
26 | * `Python 3` (min. version 3.7), `pip3` (& `py` launcher Windows-only).
27 | * Follow the install instructions in each subdirectory.
28 | 
29 | ## Contributors
30 | * Nikolaos Giarelis (giarelis@ceid.upatras.gr)
31 | * Nikos Karacapilidis (karacap@upatras.gr)
32 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | click==8.1.3
 2 | colorama==0.4.5
 3 | et-xmlfile==1.1.0
 4 | importlib-metadata==4.11.4
 5 | joblib==1.1.0
 6 | nltk==3.7
 7 | numpy==1.21.6
 8 | openpyxl==3.0.10
 9 | pandas==1.3.5
10 | pip==22.1.2
11 | python-dateutil==2.8.2
12 | pytz==2022.1
13 | regex==2022.6.2
14 | setuptools==62.4.0
15 | six==1.16.0
16 | tqdm==4.64.0
17 | typing_extensions==4.2.0
18 | zipp==3.8.0
19 | 


--------------------------------------------------------------------------------