├── KeyExt.py ├── KeyExt ├── ClassicalApproaches │ ├── README.md │ └── main.py ├── EmbedRank │ ├── Dockerfile │ ├── LICENSE │ ├── README.md │ ├── benchmark.py │ ├── config.ini │ ├── extract_keys_from_embedrank.py │ ├── launch.py │ ├── launch.pyc │ ├── requirements.txt │ ├── setup.cfg │ ├── setup.py │ └── swisscom_ai │ │ ├── __init__.py │ │ └── research_keyphrase │ │ ├── __init__.py │ │ ├── embeddings │ │ ├── __init__.py │ │ ├── emb_distrib_interface.py │ │ └── emb_distrib_local.py │ │ ├── model │ │ ├── __init__.py │ │ ├── extractor.py │ │ ├── input_representation.py │ │ ├── method.py │ │ └── methods_embeddings.py │ │ ├── preprocessing │ │ ├── __init__.py │ │ ├── custom_stanford.py │ │ └── postagging.py │ │ └── util │ │ ├── __init__.py │ │ ├── fileIO.py │ │ └── solr_fields.py ├── KPRank │ ├── PositionRank.py │ ├── README.md │ ├── __init__.py │ ├── doc_candidates.py │ ├── evaluation.py │ ├── main.py │ ├── process_data.py │ ├── requirements.txt │ └── run_scibert_model.py ├── Key2Vec │ ├── README.md │ ├── key2vec.py │ ├── key2vec │ │ ├── __init__.py │ │ ├── cleaner.py │ │ ├── constants.json │ │ ├── constants.py │ │ ├── docs.py │ │ ├── glove.py │ │ ├── key2vec.py │ │ └── phrase_graph.py │ ├── requirements.txt │ ├── setup.py │ ├── test.py │ ├── test.txt │ └── tests │ │ ├── test_docs.py │ │ └── test_glove.py ├── KeyBERT │ ├── KeyBERT.py │ └── README.md ├── RVA │ ├── LICENSE │ ├── Makefile │ ├── README.md │ ├── RVA.py │ ├── build │ │ ├── common.o │ │ ├── cooccur │ │ ├── cooccur.o │ │ ├── glove │ │ ├── glove.o │ │ ├── shuffle │ │ ├── shuffle.o │ │ ├── vocab_count │ │ └── vocab_count.o │ ├── cooccurrence.bin │ ├── cooccurrence.shuf.bin │ ├── demo.sh │ ├── eval │ │ ├── matlab │ │ │ ├── WordLookup.m │ │ │ ├── evaluate_vectors.m │ │ │ └── read_and_evaluate.m │ │ ├── octave │ │ │ ├── WordLookup_octave.m │ │ │ ├── evaluate_vectors_octave.m │ │ │ └── read_and_evaluate_octave.m │ │ └── python │ │ │ ├── distance.py │ │ │ ├── evaluate.py │ │ │ └── word_analogy.py │ ├── randomization.test.sh │ └── src │ │ ├── README.md │ │ ├── common.c │ │ ├── common.h │ │ ├── cooccur.c │ │ ├── glove.c │ │ ├── shuffle.c │ │ └── vocab_count.c ├── SIFRank │ ├── README.md │ ├── auxiliary_data │ │ ├── __init__.py │ │ ├── duc2001_vocab.txt │ │ ├── elmo_2x4096_512_2048cnn_2xhighway_options.json │ │ ├── enwiki_vocab_min200.txt │ │ ├── inspec_vocab.txt │ │ └── semeval_vocab.txt │ ├── embeddings │ │ ├── __init__.py │ │ ├── sent_emb_sif.py │ │ ├── word_emb_bert.py │ │ └── word_emb_elmo.py │ ├── eval │ │ └── sifrank_eval.py │ ├── main.py │ ├── model │ │ ├── __init__.py │ │ ├── extractor.py │ │ ├── input_representation.py │ │ └── method.py │ ├── requirements.txt │ ├── test │ │ └── test.py │ └── util │ │ └── fileIO.py ├── __init__.py ├── config.py ├── experiments.py ├── metrics.py └── utils.py ├── LICENSE ├── README.md └── requirements.txt /KeyExt.py: -------------------------------------------------------------------------------- 1 | from KeyExt.config import datasets_path, output_dir 2 | from KeyExt.experiments import run_experiments 3 | 4 | 5 | def main(): 6 | for partial_match in [False, True]: 7 | for n in [5, 10]: 8 | run_experiments( 9 | datasets_path, output_dir, 10 | top_n = n, partial_match = partial_match 11 | ) 12 | 13 | 14 | if __name__=='__main__': main() 15 | -------------------------------------------------------------------------------- /KeyExt/ClassicalApproaches/README.md: -------------------------------------------------------------------------------- 1 | # Classical Approaches 2 | 3 | This directory contains classical unsupervised approaches, which do not utilize word embeddings. 4 | These include `YAKE!`, `KPMiner`, `MPRank`, `PositionRank`, `TopicalPageRank`, `SingleRank`, `TextRank` and `TopicRank`. 5 | 6 | ## Setup 7 | In order to run this script you need to: 8 | ``` 9 | pip install pke 10 | pip install pytextrank 11 | pip install spacy 12 | pip install git+https://github.com/LIAAD/yake 13 | ``` 14 | The `en_core_web_sm` model for the respective `spacy` version needs to be installed, since it is used by [pytextrank](https://github.com/DerwenAI/pytextrank). 15 | `TopicalPagerank` and `KPMiner` use a `lda_model_file` and a `weights_file`respectively, which can be obtained from the [pke](https://github.com/boudinfl/pke) repo. 16 | After they are obtained, their respective paths and the `base_path` for the dataset directory should be set in `main.py`. 17 | -------------------------------------------------------------------------------- /KeyExt/ClassicalApproaches/main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pke 3 | import time 4 | import yake 5 | import spacy 6 | import string 7 | import pathlib 8 | import functools 9 | import pytextrank 10 | 11 | def counter(func): 12 | """ 13 | Print the elapsed system time in seconds. 14 | """ 15 | @functools.wraps(func) 16 | def wrapper_counter(*args, **kwargs): 17 | start_time = time.perf_counter() 18 | result = func(*args, **kwargs) 19 | end_time = time.perf_counter() 20 | print(f'{func.__name__}: {end_time - start_time} secs') 21 | return result 22 | return wrapper_counter 23 | 24 | @counter 25 | def kpminer(text, top_n = 10): 26 | weights_file = r'..\pke\models\df-semeval2010.tsv.gz' 27 | extractor = pke.unsupervised.KPMiner() 28 | extractor.load_document(input = text, language = 'en') 29 | extractor.candidate_selection(lasf = 5, cutoff = 200) 30 | df = pke.load_document_frequency_file(input_file = weights_file) 31 | extractor.candidate_weighting(df = df, alpha = 2.3, sigma = 3.0) 32 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 33 | return keyphrases 34 | 35 | @counter 36 | def mprank(text, top_n = 10): 37 | extractor = pke.unsupervised.MultipartiteRank() 38 | stoplist = list(string.punctuation) + list(pke.lang.stopwords.get('en')) 39 | extractor.load_document(input = text, stoplist = stoplist, language = 'en') 40 | extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'}) 41 | extractor.candidate_weighting(alpha = 1.1, threshold = 0.74, method = 'average') 42 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 43 | return keyphrases 44 | 45 | @counter 46 | def positionrank(text, top_n = 10): 47 | extractor = pke.unsupervised.PositionRank() 48 | extractor.load_document(input = text, language = 'en', normalization = None) 49 | extractor.candidate_selection(grammar = "NP: {*+}", maximum_word_number = 3) 50 | extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'}) 51 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 52 | return keyphrases 53 | 54 | @counter 55 | def topicalpagerank(text, top_n = 10): 56 | lda_model_file = r'..\pke\models\lda-1000-semeval2010.py3.pickle.gz' 57 | extractor = pke.unsupervised.TopicalPageRank() 58 | extractor.load_document(input = text, language = 'en', normalization = None) 59 | extractor.candidate_selection(grammar = "NP: {*+}") 60 | extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'}, lda_model = lda_model_file) 61 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 62 | return keyphrases 63 | 64 | @counter 65 | def singlerank(text, top_n = 10): 66 | extractor = pke.unsupervised.SingleRank() 67 | extractor.load_document(input = text, language = 'en', normalization = None) 68 | extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'}) 69 | extractor.candidate_weighting(window = 10, pos = {'NOUN', 'PROPN', 'ADJ'}) 70 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 71 | return keyphrases 72 | 73 | @counter 74 | def textrank(text, top_n = 10): 75 | extractor = pke.unsupervised.TextRank() 76 | extractor.load_document(input = text, language = 'en', normalization = None) 77 | extractor.candidate_weighting(window = 2, pos = {'NOUN', 'PROPN', 'ADJ'}, top_percent = 0.33) 78 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 79 | return keyphrases 80 | 81 | @counter 82 | def topicrank(text, top_n = 10): 83 | extractor = pke.unsupervised.TopicRank() 84 | stoplist = list(string.punctuation) + list(pke.lang.stopwords.get('en')) 85 | extractor.load_document(input = text, stoplist = stoplist, language = 'en') 86 | extractor.candidate_selection(pos = {'NOUN', 'PROPN', 'ADJ'}) 87 | extractor.candidate_weighting(threshold = 0.74, method = 'average') 88 | keyphrases = [key for key,_ in extractor.get_n_best(n = top_n)] 89 | return keyphrases 90 | 91 | 92 | @counter 93 | def py_textrank(nlp, text, top_n = 10): 94 | nlp.add_pipe('textrank') 95 | doc = nlp(text) 96 | nlp.remove_pipe('textrank') 97 | 98 | keyphrases = [ 99 | phrase.text for phrase in doc._.phrases 100 | ] 101 | return keyphrases[:top_n] 102 | 103 | @counter 104 | def py_positionrank(nlp, text, top_n = 10): 105 | nlp.add_pipe('positionrank') 106 | doc = nlp(text) 107 | nlp.remove_pipe('positionrank') 108 | 109 | keyphrases = [ 110 | phrase.text for phrase in doc._.phrases 111 | ] 112 | return keyphrases[:top_n] 113 | 114 | @counter 115 | def py_topicrank(nlp, text, top_n = 10): 116 | nlp.add_pipe('topicrank') 117 | doc = nlp(text) 118 | nlp.remove_pipe('topicrank') 119 | 120 | keyphrases = [ 121 | phrase.text for phrase in doc._.phrases 122 | ] 123 | return keyphrases[:top_n] 124 | 125 | @counter 126 | def yake_ke(text, top_n = 10): 127 | custom_kw_extractor = yake.KeywordExtractor(lan = "en", n = 3, dedupLim = 0.9, dedupFunc = 'seqm', windowsSize = 1, top = 10, features=None) 128 | keywords = [key for key,_ in custom_kw_extractor.extract_keywords(text)] 129 | return keywords 130 | 131 | 132 | def single_test(): 133 | text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types." 134 | 135 | # load a spaCy model, depending on language, scale, etc. 136 | nlp = spacy.load("en_core_web_sm") 137 | 138 | print(kpminer(text)) 139 | print(mprank(text)) 140 | print(topicalpagerank(text)) 141 | print(singlerank(text)) 142 | print('\n\n') 143 | 144 | print('\n\n') 145 | print(textrank(text)) 146 | print(py_textrank(nlp, text)) 147 | 148 | print('\n\n') 149 | print(positionrank(text)) 150 | print(py_positionrank(nlp, text)) 151 | 152 | print('\n\n') 153 | print(topicrank(text)) 154 | print(py_topicrank(nlp, text)) 155 | print(yake_ke(text)) 156 | return 157 | 158 | def main(): 159 | nlp = spacy.load('en_core_web_sm') 160 | method_name = 'textrank' 161 | method = { 162 | 'kpminer': lambda nlp, text: kpminer(text), 163 | 'mprank': lambda nlp, text: mprank(text), 164 | 'topicalpagerank': lambda nlp, text: topicalpagerank(text), 165 | 'singlerank': lambda nlp, text: singlerank(text), 166 | 'pytextrank': lambda nlp, text: py_textrank(nlp, text), 167 | 'textrank': lambda nlp, text: textrank(text), 168 | 'positionrank': lambda nlp, text: positionrank(text), 169 | 'pypositionrank': lambda nlp, text: py_positionrank(nlp, text), 170 | 'topicrank': lambda nlp, text: topicrank(text), 171 | 'pytopicrank': lambda nlp, text: py_topicrank(nlp, text), 172 | 'yake': lambda nlp, text: yake_ke(text) 173 | } 174 | 175 | base_path = r'..\datasets\Krapivin2009' 176 | input_dir = os.path.join(base_path, 'docsutf8') 177 | output_dir = os.path.join(base_path, f'extracted\{method_name}') 178 | print(os.getcwd()) 179 | 180 | # Set the current directory to the input dir 181 | os.chdir(os.path.join(os.getcwd(), input_dir)) 182 | 183 | # Get all file names and their absolute paths. 184 | docnames = sorted(os.listdir()) 185 | docpaths = list(map(os.path.abspath, docnames)) 186 | 187 | # Create the keys directory, after the names and paths are loaded. 188 | pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True) 189 | 190 | for i, (docname, docpath) in enumerate(zip(docnames, docpaths)): 191 | 192 | #if i < 225: continue 193 | # keys shows up in docnames, erroneously. 194 | if docname == 'keys': 195 | continue 196 | 197 | print(f'Processing {i} out of {len(docnames)}...') 198 | 199 | # Save the output dir path 200 | output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key') 201 | print(output_dirpath) 202 | 203 | with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \ 204 | open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out: 205 | 206 | # Read the file and remove the newlines. 207 | text = file.read().replace('\n', ' ') 208 | 209 | # Extract the top 10 keyphrases. 210 | try: 211 | ranked_list = method[method_name](nlp, text) 212 | keys = '\n'.join(map(str, ranked_list) or '') 213 | out.write(keys) 214 | except Exception: 215 | pass 216 | 217 | os.system('clear') 218 | 219 | 220 | if __name__ == '__main__': main() 221 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/Dockerfile: -------------------------------------------------------------------------------- 1 | # Use a base image that comes with NumPy and SciPy pre-installed 2 | FROM publysher/alpine-scipy:1.0.0-numpy1.14.0-python3.6-alpine3.7 3 | # Because of the image, our versions differ from those in the requirements.txt: 4 | # numpy==1.14.0 (instead of 1.13.1) 5 | # scipy==1.0.0 (instead of 0.19.1) 6 | 7 | # Install Java for Stanford Tagger 8 | RUN apk --update add openjdk8-jre 9 | # Set environment 10 | ENV JAVA_HOME /opt/jdk 11 | ENV PATH ${PATH}:${JAVA_HOME}/bin 12 | 13 | # Download CoreNLP full Stanford Tagger for English 14 | RUN wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-02-27.zip && \ 15 | unzip stanford-corenlp-full-*.zip && \ 16 | rm stanford-corenlp-full-*.zip && \ 17 | mv stanford-corenlp-full-* stanford-corenlp 18 | 19 | # Install sent2vec 20 | RUN apk add --update git g++ make && \ 21 | git clone https://github.com/epfml/sent2vec && \ 22 | cd sent2vec && \ 23 | git checkout f827d014a473aa22b2fef28d9e29211d50808d48 && \ 24 | make && \ 25 | apk del git make && \ 26 | rm -rf /var/cache/apk/* && \ 27 | pip install cython && \ 28 | cd src && \ 29 | python setup.py build_ext && \ 30 | pip install . 31 | 32 | 33 | 34 | # Install requirements 35 | WORKDIR /app 36 | ADD requirements.txt . 37 | # Remove NumPy and SciPy from the requirements before installing the rest 38 | RUN cd /app && \ 39 | sed -i '/^numpy.*$/d' requirements.txt && \ 40 | sed -i '/^scipy.*$/d' requirements.txt && \ 41 | pip install -r requirements.txt 42 | 43 | # Download NLTK data 44 | RUN python -c "import nltk; nltk.download('punkt')" 45 | 46 | # Set the paths in config.ini 47 | ADD config.ini.template config.ini 48 | RUN sed -i '6 c\host = localhost' config.ini && \ 49 | sed -i '7 c\port = 9000' config.ini && \ 50 | sed -i '10 c\model_path = /sent2vec/pretrained_model.bin' config.ini 51 | 52 | # Add actual source code 53 | ADD swisscom_ai swisscom_ai/ 54 | ADD launch.py . 55 | 56 | ENTRYPOINT ["/bin/sh"] -------------------------------------------------------------------------------- /KeyExt/EmbedRank/README.md: -------------------------------------------------------------------------------- 1 | # EmbedRank 2 | 3 | This directory contains the modified code for the [EmbedRank](https://github.com/swisscom/ai-research-keyphrase-extraction) approach. 4 | 5 | ## Setup 6 | Follow the install instructions from the original repo. 7 | Afterwards replace the files with the modified ones. 8 | In `main.py`, `base_path` needs to be set for the dataset directory. 9 | In `benchmark.py`, `output_path` needs to be set to a local output path. 10 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/config.ini: -------------------------------------------------------------------------------- 1 | [STANFORDTAGGER] 2 | jar_path = 3 | model_directory_path = 4 | 5 | [STANFORDCORENLPTAGGER] 6 | host = localhost 7 | port = 9000 8 | 9 | [SENT2VEC] 10 | model_path = ./wiki_bigrams.bin -------------------------------------------------------------------------------- /KeyExt/EmbedRank/extract_keys_from_embedrank.py: -------------------------------------------------------------------------------- 1 | import os 2 | import launch 3 | import pathlib 4 | 5 | base_path = '../datasets/DUC-2001/' 6 | input_dir = os.path.join(base_path, 'docsutf8') 7 | output_dir = os.path.join(base_path, 'extracted/embedrank') 8 | 9 | embedding_distributor = launch.load_local_embedding_distributor() 10 | pos_tagger = launch.load_local_corenlp_pos_tagger() 11 | 12 | # Set the current directory to the input dir 13 | os.chdir(os.path.join(os.getcwd(), input_dir)) 14 | 15 | # Get all file names and their absolute paths. 16 | docnames = sorted(os.listdir()) 17 | docpaths = list(map(os.path.abspath, docnames)) 18 | 19 | # Create the keys directory, after the names and paths are loaded. 20 | pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True) 21 | 22 | for i, (docname, docpath) in enumerate(zip(docnames, docpaths)): 23 | 24 | # keys shows up in docnames, erroneously. 25 | if docname == 'keys': 26 | continue 27 | 28 | print(f'Processing {i} out of {len(docnames)}...') 29 | 30 | # Save the output dir path 31 | output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key') 32 | print(output_dirpath) 33 | 34 | with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \ 35 | open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out: 36 | # Read the file and remove the newlines. 37 | text = file.read().replace('\n', ' ') 38 | # Extract the top 10 keyphrases. 39 | try: 40 | kp1 = launch.extract_keyphrases(embedding_distributor, pos_tagger, text, 10, 'en') 41 | keys = "\n".join(kp1[0] or '') 42 | out.write(keys) 43 | except: 44 | pass 45 | 46 | os.system('clear') -------------------------------------------------------------------------------- /KeyExt/EmbedRank/launch.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | from configparser import ConfigParser 3 | 4 | from swisscom_ai.research_keyphrase.embeddings.emb_distrib_local import EmbeddingDistributorLocal 5 | from swisscom_ai.research_keyphrase.model.input_representation import InputTextObj 6 | from swisscom_ai.research_keyphrase.model.method import MMRPhrase 7 | from swisscom_ai.research_keyphrase.preprocessing.postagging import PosTaggingCoreNLP 8 | from swisscom_ai.research_keyphrase.util.fileIO import read_file 9 | 10 | 11 | def extract_keyphrases(embedding_distrib, ptagger, raw_text, N, lang, beta=0.55, alias_threshold=0.7): 12 | """ 13 | Method that extract a set of keyphrases 14 | 15 | :param embedding_distrib: An Embedding Distributor object see @EmbeddingDistributor 16 | :param ptagger: A Pos Tagger object see @PosTagger 17 | :param raw_text: A string containing the raw text to extract 18 | :param N: The number of keyphrases to extract 19 | :param lang: The language 20 | :param beta: beta factor for MMR (tradeoff informativness/diversity) 21 | :param alias_threshold: threshold to group candidates as aliases 22 | :return: A tuple with 3 elements : 23 | 1)list of the top-N candidates (or less if there are not enough candidates) (list of string) 24 | 2)list of associated relevance scores (list of float) 25 | 3)list containing for each keyphrase a list of alias (list of list of string) 26 | """ 27 | tagged = ptagger.pos_tag_raw_text(raw_text) 28 | text_obj = InputTextObj(tagged, lang) 29 | return MMRPhrase(embedding_distrib, text_obj, N=N, beta=beta, alias_threshold=alias_threshold) 30 | 31 | 32 | def load_local_embedding_distributor(): 33 | config_parser = ConfigParser() 34 | config_parser.read('config.ini') 35 | sent2vec_model_path = config_parser.get('SENT2VEC', 'model_path') 36 | return EmbeddingDistributorLocal(sent2vec_model_path) 37 | 38 | 39 | def load_local_corenlp_pos_tagger(): 40 | config_parser = ConfigParser() 41 | config_parser.read('config.ini') 42 | host = config_parser.get('STANFORDCORENLPTAGGER', 'host') 43 | port = config_parser.get('STANFORDCORENLPTAGGER', 'port') 44 | return PosTaggingCoreNLP(host, port) 45 | 46 | 47 | if __name__ == '__main__': 48 | parser = argparse.ArgumentParser(description='Extract keyphrases from raw text') 49 | 50 | group = parser.add_mutually_exclusive_group(required=True) 51 | group.add_argument('-raw_text', help='raw text to process') 52 | group.add_argument('-text_file', help='file containing the raw text to process') 53 | 54 | 55 | parser.add_argument('-tagger_host', help='CoreNLP host', default='localhost') 56 | parser.add_argument('-tagger_port', help='CoreNLP port', default=9000) 57 | parser.add_argument('-N', help='number of keyphrases to extract', required=True, type=int) 58 | args = parser.parse_args() 59 | 60 | if args.text_file: 61 | raw_text = read_file(args.text_file) 62 | else: 63 | raw_text = args.raw_text 64 | 65 | embedding_distributor = load_local_embedding_distributor() 66 | pos_tagger = load_local_corenlp_pos_tagger(args.tagger_host, args.tagger_port) 67 | print(extract_keyphrases(embedding_distributor, pos_tagger, raw_text, args.N, 'en')) 68 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/launch.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/launch.pyc -------------------------------------------------------------------------------- /KeyExt/EmbedRank/requirements.txt: -------------------------------------------------------------------------------- 1 | langdetect==1.0.7 2 | nltk==3.4.1 3 | numpy==1.14.3 4 | scikit-learn==0.19.0 5 | scipy==0.19.1 6 | six==1.10.0 7 | requests==2.21.0 -------------------------------------------------------------------------------- /KeyExt/EmbedRank/setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | max-line-length = 120 3 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/setup.py: -------------------------------------------------------------------------------- 1 | """A setuptools based setup module. 2 | 3 | See: 4 | https://packaging.python.org/en/latest/distributing.html 5 | https://github.com/pypa/sampleproject 6 | """ 7 | from codecs import open 8 | 9 | from setuptools import setup, find_packages 10 | 11 | with open('requirements.txt') as f: 12 | required = f.read().splitlines() 13 | 14 | setup( 15 | name='swisscom_ai.research_keyphrase', 16 | 17 | # Versions should comply with PEP440. For a discussion on single-sourcing 18 | # the version across setup.py and the project code, see 19 | # https://packaging.python.org/en/latest/single_source_version.html 20 | version='0.9.5', 21 | 22 | description='Swisscom AI Research Keyphrase Extraction', 23 | url='https://github.com/swisscom/ai-research-keyphrase-extraction', 24 | 25 | author='Swisscom (Schweiz) AG', 26 | 27 | # See https://pypi.python.org/pypi?%3Aaction=list_classifiers 28 | classifiers=[ 29 | 'Programming Language :: Python :: 3.6', 30 | ], 31 | 32 | # You can just specify the packages manually here if your project is 33 | # simple. Or you can use find_packages(). 34 | packages=find_packages(exclude=['contrib', 'docs', 'tests']), 35 | 36 | package_data={'swisscom_ai.research_keyphrase': []}, 37 | include_package_data=True, 38 | 39 | # List run-time dependencies here. These will be installed by pip when 40 | # your project is installed. For an analysis of "install_requires" vs pip's 41 | # requirements files see: 42 | # https://packaging.python.org/en/latest/requirements.html 43 | install_requires=required, 44 | 45 | # List additional groups of dependencies here (e.g. development 46 | # dependencies). You can install these using the following syntax, 47 | # for example: 48 | # $ pip install -e .[dev,test] 49 | extras_require={ 50 | 'dev': [], 51 | 'test': [], 52 | }, 53 | ) 54 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/emb_distrib_interface.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | from abc import ABC, abstractmethod 7 | 8 | 9 | class Singleton(type): 10 | _instances = {} 11 | 12 | def __call__(cls, *args, **kwargs): 13 | if cls not in cls._instances: 14 | cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) 15 | return cls._instances[cls] 16 | 17 | 18 | class EmbeddingDistributor(ABC): 19 | """ 20 | Abstract class in charge of providing the embeddings of piece of texts 21 | """ 22 | @abstractmethod 23 | def get_tokenized_sents_embeddings(self, sents): 24 | """ 25 | Generate a numpy ndarray with the embedding of each element of sent in each row 26 | :param sents: list of string (sentences/phrases) 27 | :return: ndarray with shape (len(sents), dimension of embeddings) 28 | """ 29 | pass 30 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/embeddings/emb_distrib_local.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | import numpy as np 7 | 8 | from swisscom_ai.research_keyphrase.embeddings.emb_distrib_interface import EmbeddingDistributor 9 | import sent2vec 10 | 11 | 12 | class EmbeddingDistributorLocal(EmbeddingDistributor): 13 | """ 14 | Concrete class of @EmbeddingDistributor using a local installation of sent2vec 15 | https://github.com/epfml/sent2vec 16 | 17 | """ 18 | 19 | def __init__(self, fasttext_model): 20 | self.model = sent2vec.Sent2vecModel() 21 | self.model.load_model(fasttext_model) 22 | 23 | def get_tokenized_sents_embeddings(self, sents): 24 | """ 25 | @see EmbeddingDistributor 26 | """ 27 | for sent in sents: 28 | if '\n' in sent: 29 | raise RuntimeError('New line is not allowed inside a sentence') 30 | 31 | return self.model.embed_sentences(sents) 32 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/extractor.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | """Contain method that return list of candidate""" 7 | 8 | import re 9 | 10 | import nltk 11 | 12 | GRAMMAR_EN = """ NP: 13 | {*} # Adjective(s)(optional) + Noun(s)""" 14 | 15 | GRAMMAR_DE = """ 16 | NBAR: 17 | {*+} # [Adjective(s) or Article(s) or Posessive pronoun](optional) + Noun(s) 18 | {+*+} 19 | 20 | NP: 21 | {*}# Above, connected with APPR and APPART (beim vom) 22 | {+} 23 | """ 24 | 25 | GRAMMAR_FR = """ NP: 26 | {*+*} # Adjective(s)(optional) + Noun(s) + Adjective(s)(optional)""" 27 | 28 | 29 | def get_grammar(lang): 30 | if lang == 'en': 31 | grammar = GRAMMAR_EN 32 | elif lang == 'de': 33 | grammar = GRAMMAR_DE 34 | elif lang == 'fr': 35 | grammar = GRAMMAR_FR 36 | else: 37 | raise ValueError('Language not handled') 38 | return grammar 39 | 40 | 41 | def extract_candidates(text_obj, no_subset=False): 42 | """ 43 | Based on part of speech return a list of candidate phrases 44 | :param text_obj: Input text Representation see @InputTextObj 45 | :param no_subset: if true won't put a candidate which is the subset of an other candidate 46 | :param lang: language (currently en, fr and de are supported) 47 | :return: list of candidate phrases (string) 48 | """ 49 | 50 | keyphrase_candidate = set() 51 | 52 | np_parser = nltk.RegexpParser(get_grammar(text_obj.lang)) # Noun phrase parser 53 | trees = np_parser.parse_sents(text_obj.pos_tagged) # Generator with one tree per sentence 54 | 55 | for tree in trees: 56 | for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'): # For each nounphrase 57 | # Concatenate the token with a space 58 | keyphrase_candidate.add(' '.join(word for word, tag in subtree.leaves())) 59 | 60 | keyphrase_candidate = {kp for kp in keyphrase_candidate if len(kp.split()) <= 5} 61 | 62 | if no_subset: 63 | keyphrase_candidate = unique_ngram_candidates(keyphrase_candidate) 64 | else: 65 | keyphrase_candidate = list(keyphrase_candidate) 66 | 67 | return keyphrase_candidate 68 | 69 | 70 | def extract_sent_candidates(text_obj): 71 | """ 72 | 73 | :param text_obj: input Text Representation see @InputTextObj 74 | :return: list of tokenized sentence (string) , each token is separated by a space in the string 75 | """ 76 | return [(' '.join(word for word, tag in sent)) for sent in text_obj.pos_tagged] 77 | 78 | 79 | def unique_ngram_candidates(strings): 80 | """ 81 | ['machine learning', 'machine', 'backward induction', 'induction', 'start'] -> 82 | ['backward induction', 'start', 'machine learning'] 83 | :param strings: List of string 84 | :return: List of string where no string is fully contained inside another string 85 | """ 86 | results = [] 87 | for s in sorted(set(strings), key=len, reverse=True): 88 | if not any(re.search(r'\b{}\b'.format(re.escape(s)), r) for r in results): 89 | results.append(s) 90 | return results 91 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/input_representation.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | from nltk.stem import PorterStemmer 7 | 8 | 9 | class InputTextObj: 10 | """Represent the input text in which we want to extract keyphrases""" 11 | 12 | def __init__(self, pos_tagged, lang, stem=False, min_word_len=3): 13 | """ 14 | :param pos_tagged: List of list : Text pos_tagged as a list of sentences 15 | where each sentence is a list of tuple (word, TAG). 16 | :param stem: If we want to apply stemming on the text. 17 | """ 18 | self.min_word_len = min_word_len 19 | self.considered_tags = {'NN', 'NNS', 'NNP', 'NNPS', 'JJ'} 20 | self.pos_tagged = [] 21 | self.filtered_pos_tagged = [] 22 | self.isStemmed = stem 23 | self.lang = lang 24 | 25 | if stem: 26 | stemmer = PorterStemmer() 27 | self.pos_tagged = [[(stemmer.stem(t[0]), t[1]) for t in sent] for sent in pos_tagged] 28 | else: 29 | self.pos_tagged = [[(t[0].lower(), t[1]) for t in sent] for sent in pos_tagged] 30 | 31 | temp = [] 32 | for sent in self.pos_tagged: 33 | s = [] 34 | for elem in sent: 35 | if len(elem[0]) < min_word_len: 36 | s.append((elem[0], 'LESS')) 37 | else: 38 | s.append(elem) 39 | temp.append(s) 40 | 41 | self.pos_tagged = temp 42 | # Convert some language-specific tag (NC, NE to NN) or ADJA ->JJ see convert method. 43 | if lang in ['fr', 'de']: 44 | self.pos_tagged = [[(tagged_token[0], convert(tagged_token[1])) for tagged_token in sentence] for sentence 45 | in 46 | self.pos_tagged] 47 | self.filtered_pos_tagged = [[(t[0].lower(), t[1]) for t in sent if self.is_candidate(t)] for sent in 48 | self.pos_tagged] 49 | 50 | def is_candidate(self, tagged_token): 51 | """ 52 | 53 | :param tagged_token: tuple (word, tag) 54 | :return: True if its a valid candidate word 55 | """ 56 | return tagged_token[1] in self.considered_tags 57 | 58 | def extract_candidates(self): 59 | """ 60 | :return: set of all candidates word 61 | """ 62 | return {tagged_token[0].lower() 63 | for sentence in self.pos_tagged 64 | for tagged_token in sentence 65 | if self.is_candidate(tagged_token) and len(tagged_token[0]) >= self.min_word_len 66 | } 67 | 68 | 69 | def convert(fr_or_de_tag): 70 | if fr_or_de_tag in {'NN', 'NNE', 'NE', 'N', 'NPP', 'NC', 'NOUN'}: 71 | return 'NN' 72 | elif fr_or_de_tag in {'ADJA', 'ADJ'}: 73 | return 'JJ' 74 | else: 75 | return fr_or_de_tag 76 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/method.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | import warnings 7 | 8 | import numpy as np 9 | from sklearn.metrics.pairwise import cosine_similarity 10 | 11 | from swisscom_ai.research_keyphrase.model.methods_embeddings import extract_candidates_embedding_for_doc, \ 12 | extract_doc_embedding, extract_sent_candidates_embedding_for_doc 13 | 14 | 15 | def _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered, alias_threshold): 16 | """ 17 | Core method using Maximal Marginal Relevance in charge to return the top-N candidates 18 | 19 | :param embdistrib: embdistrib: embedding distributor see @EmbeddingDistributor 20 | :param text_obj: Input text representation see @InputTextObj 21 | :param candidates: list of candidates (string) 22 | :param X: numpy array with the embedding of each candidate in each row 23 | :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity) 24 | :param N: number of candidates to extract 25 | :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding 26 | :return: A tuple with 3 elements : 27 | 1)list of the top-N candidates (or less if there are not enough candidates) (list of string) 28 | 2)list of associated relevance scores (list of float) 29 | 3)list containing for each keyphrase a list of alias (list of list of string) 30 | """ 31 | 32 | N = min(N, len(candidates)) 33 | doc_embedd = extract_doc_embedding(embdistrib, text_obj, use_filtered) # Extract doc embedding 34 | doc_sim = cosine_similarity(X, doc_embedd.reshape(1, -1)) 35 | 36 | doc_sim_norm = doc_sim/np.max(doc_sim) 37 | doc_sim_norm = 0.5 + (doc_sim_norm - np.average(doc_sim_norm)) / np.std(doc_sim_norm) 38 | 39 | sim_between = cosine_similarity(X) 40 | np.fill_diagonal(sim_between, np.NaN) 41 | 42 | sim_between_norm = sim_between/np.nanmax(sim_between, axis=0) 43 | sim_between_norm = \ 44 | 0.5 + (sim_between_norm - np.nanmean(sim_between_norm, axis=0)) / np.nanstd(sim_between_norm, axis=0) 45 | 46 | selected_candidates = [] 47 | unselected_candidates = [c for c in range(len(candidates))] 48 | 49 | j = np.argmax(doc_sim) 50 | selected_candidates.append(j) 51 | unselected_candidates.remove(j) 52 | 53 | for _ in range(N - 1): 54 | selec_array = np.array(selected_candidates) 55 | unselec_array = np.array(unselected_candidates) 56 | 57 | distance_to_doc = doc_sim_norm[unselec_array, :] 58 | dist_between = sim_between_norm[unselec_array][:, selec_array] 59 | if dist_between.ndim == 1: 60 | dist_between = dist_between[:, np.newaxis] 61 | j = np.argmax(beta * distance_to_doc - (1 - beta) * np.max(dist_between, axis=1).reshape(-1, 1)) 62 | item_idx = unselected_candidates[j] 63 | selected_candidates.append(item_idx) 64 | unselected_candidates.remove(item_idx) 65 | 66 | # Not using normalized version of doc_sim for computing relevance 67 | relevance_list = max_normalization(doc_sim[selected_candidates]).tolist() 68 | aliases_list = get_aliases(sim_between[selected_candidates, :], candidates, alias_threshold) 69 | 70 | return candidates[selected_candidates].tolist(), relevance_list, aliases_list 71 | 72 | 73 | def MMRPhrase(embdistrib, text_obj, beta=0.65, N=10, use_filtered=True, alias_threshold=0.8): 74 | """ 75 | Extract N keyphrases 76 | 77 | :param embdistrib: embedding distributor see @EmbeddingDistributor 78 | :param text_obj: Input text representation see @InputTextObj 79 | :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity) 80 | :param N: number of keyphrases to extract 81 | :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding 82 | :return: A tuple with 3 elements : 83 | 1)list of the top-N candidates (or less if there are not enough candidates) (list of string) 84 | 2)list of associated relevance scores (list of float) 85 | 3)list containing for each keyphrase a list of alias (list of list of string) 86 | """ 87 | candidates, X = extract_candidates_embedding_for_doc(embdistrib, text_obj) 88 | 89 | if len(candidates) == 0: 90 | warnings.warn('No keyphrase extracted for this document') 91 | return None, None, None 92 | 93 | return _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered, alias_threshold) 94 | 95 | 96 | def MMRSent(embdistrib, text_obj, beta=0.5, N=10, use_filtered=True): 97 | """ 98 | 99 | Extract N key sentences 100 | 101 | :param embdistrib: embedding distributor see @EmbeddingDistributor 102 | :param text_obj: Input text representation see @InputTextObj 103 | :param beta: hyperparameter beta for MMR (control tradeoff between informativeness and diversity) 104 | :param N: number of key sentences to extract 105 | :param use_filtered: if true filter the text by keeping only candidate word before computing the doc embedding 106 | :return: list of N key sentences (or less if there are not enough candidates) 107 | """ 108 | candidates, X = extract_sent_candidates_embedding_for_doc(embdistrib, text_obj) 109 | 110 | if len(candidates) == 0: 111 | warnings.warn('No keysentence extracted for this document') 112 | return [] 113 | 114 | return _MMR(embdistrib, text_obj, candidates, X, beta, N, use_filtered) 115 | 116 | 117 | def max_normalization(array): 118 | """ 119 | Compute maximum normalization (max is set to 1) of the array 120 | :param array: 1-d array 121 | :return: 1-d array max- normalized : each value is multiplied by 1/max value 122 | """ 123 | return 1/np.max(array) * array.squeeze(axis=1) 124 | 125 | 126 | def get_aliases(kp_sim_between, candidates, threshold): 127 | """ 128 | Find candidates which are very similar to the keyphrases (aliases) 129 | :param kp_sim_between: ndarray of shape (nb_kp , nb candidates) containing the similarity 130 | of each kp with all the candidates. Note that the similarity between the keyphrase and itself should be set to 131 | NaN or 0 132 | :param candidates: array of candidates (array of string) 133 | :return: list containing for each keyphrase a list that contain candidates which are aliases 134 | (very similar) (list of list of string) 135 | """ 136 | 137 | kp_sim_between = np.nan_to_num(kp_sim_between, 0) 138 | idx_sorted = np.flip(np.argsort(kp_sim_between), 1) 139 | aliases = [] 140 | for kp_idx, item in enumerate(idx_sorted): 141 | alias_for_item = [] 142 | for i in item: 143 | if kp_sim_between[kp_idx, i] >= threshold: 144 | alias_for_item.append(candidates[i]) 145 | else: 146 | break 147 | aliases.append(alias_for_item) 148 | 149 | return aliases 150 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/model/methods_embeddings.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | import numpy as np 7 | 8 | from swisscom_ai.research_keyphrase.model.extractor import extract_candidates, extract_sent_candidates 9 | 10 | 11 | def extract_doc_embedding(embedding_distrib, inp_rpr, use_filtered=False): 12 | """ 13 | Return the embedding of the full document 14 | 15 | :param embedding_distrib: embedding distributor see @EmbeddingDistributor 16 | :param inp_rpr: input text representation see @InputTextObj 17 | :param use_filtered: if true keep only candidate words in the raw text before computing the embedding 18 | :return: numpy array of shape (1, dimension of embeddings) that contains the document embedding 19 | """ 20 | if use_filtered: 21 | tagged = inp_rpr.filtered_pos_tagged 22 | else: 23 | tagged = inp_rpr.pos_tagged 24 | 25 | tokenized_doc_text = ' '.join(token[0].lower() for sent in tagged for token in sent) 26 | return embedding_distrib.get_tokenized_sents_embeddings([tokenized_doc_text]) 27 | 28 | 29 | def extract_candidates_embedding_for_doc(embedding_distrib, inp_rpr): 30 | """ 31 | 32 | Return the list of candidate phrases as well as the associated numpy array that contains their embeddings. 33 | Note that candidates phrases extracted by PosTag rules which are uknown (in term of embeddings) 34 | will be removed from the candidates. 35 | 36 | :param embedding_distrib: embedding distributor see @EmbeddingDistributor 37 | :param inp_rpr: input text representation see @InputTextObj 38 | :return: A tuple of two element containing 1) the list of candidate phrases 39 | 2) a numpy array of shape (number of candidate phrases, dimension of embeddings : 40 | each row is the embedding of one candidate phrase 41 | """ 42 | candidates = np.array(extract_candidates(inp_rpr)) # List of candidates based on PosTag rules 43 | if len(candidates) > 0: 44 | embeddings = np.array(embedding_distrib.get_tokenized_sents_embeddings(candidates)) # Associated embeddings 45 | valid_candidates_mask = ~np.all(embeddings == 0, axis=1) # Only candidates which are not unknown. 46 | return candidates[valid_candidates_mask], embeddings[valid_candidates_mask, :] 47 | else: 48 | return np.array([]), np.array([]) 49 | 50 | 51 | def extract_sent_candidates_embedding_for_doc(embedding_distrib, inp_rpr): 52 | """ 53 | Return the list of candidate senetences as well as the associated numpy array that contains their embeddings. 54 | Note that candidates sentences which are uknown (in term of embeddings) will be removed from the candidates. 55 | 56 | :param embedding_distrib: embedding distributor see @EmbeddingDistributor 57 | :param inp_rpr: input text representation see @InputTextObj 58 | :return: A tuple of two element containing 1) the list of candidate sentences 59 | 2) a numpy array of shape (number of candidate sentences, dimension of embeddings : 60 | each row is the embedding of one candidate sentence 61 | """ 62 | candidates = np.array(extract_sent_candidates(inp_rpr)) 63 | embeddings = np.array(embedding_distrib.get_tokenized_sents_embeddings(candidates)) 64 | 65 | valid_candidates_mask = ~np.all(embeddings == 0, axis=1) 66 | return candidates[valid_candidates_mask], embeddings[valid_candidates_mask, :] 67 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/custom_stanford.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | """Implementation of StanfordPOSTagger with tokenization in the specific language, s.t. the tag and tag_sent methods 7 | perform tokenization in the specific language. 8 | """ 9 | from nltk.tag import StanfordPOSTagger 10 | 11 | 12 | class EnglishStanfordPOSTagger(StanfordPOSTagger): 13 | 14 | @property 15 | def _cmd(self): 16 | return ['edu.stanford.nlp.tagger.maxent.MaxentTagger', 17 | '-model', self._stanford_model, '-textFile', self._input_file_path, 18 | '-outputFormatOptions', 'keepEmptySentences'] 19 | 20 | 21 | class FrenchStanfordPOSTagger(StanfordPOSTagger): 22 | """ 23 | Taken from github mhkuu/french-learner-corpus 24 | Extends the StanfordPosTagger with a custom command that calls the FrenchTokenizerFactory. 25 | """ 26 | 27 | @property 28 | def _cmd(self): 29 | return ['edu.stanford.nlp.tagger.maxent.MaxentTagger', 30 | '-model', self._stanford_model, '-textFile', 31 | self._input_file_path, '-tokenizerFactory', 32 | 'edu.stanford.nlp.international.french.process.FrenchTokenizer$FrenchTokenizerFactory', 33 | '-outputFormatOptions', 'keepEmptySentences'] 34 | 35 | 36 | class GermanStanfordPOSTagger(StanfordPOSTagger): 37 | """ Use english tokenizer for german """ 38 | 39 | @property 40 | def _cmd(self): 41 | return ['edu.stanford.nlp.tagger.maxent.MaxentTagger', 42 | '-model', self._stanford_model, '-textFile', self._input_file_path, 43 | '-outputFormatOptions', 'keepEmptySentences'] 44 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/preprocessing/postagging.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | import argparse 7 | import os 8 | import re 9 | import warnings 10 | from abc import ABC, abstractmethod 11 | 12 | # NLTK imports 13 | import nltk 14 | from nltk.tag.util import tuple2str 15 | from nltk.parse import CoreNLPParser 16 | 17 | import swisscom_ai.research_keyphrase.preprocessing.custom_stanford as custom_stanford 18 | from swisscom_ai.research_keyphrase.util.fileIO import read_file, write_string 19 | 20 | # If you want to use spacy , install it and uncomment the following import 21 | # import spacy 22 | 23 | 24 | class PosTagging(ABC): 25 | @abstractmethod 26 | def pos_tag_raw_text(self, text, as_tuple_list=True): 27 | """ 28 | Tokenize and POS tag a string 29 | Sentence level is kept in the result : 30 | Either we have a list of list (for each sentence a list of tuple (word,tag)) 31 | Or a separator [ENDSENT] if we are requesting a string by putting as_tuple_list = False 32 | 33 | Example : 34 | >>from sentkp.preprocessing import postagger as pt 35 | 36 | >>pt = postagger.PosTagger() 37 | 38 | >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.') 39 | [ 40 | [('Write', 'VB'), ('your', 'PRP$'), ('python', 'NN'), 41 | ('code', 'NN'), ('in', 'IN'), ('a', 'DT'), ('.', '.'), ('py', 'NN'), ('file', 'NN'), ('.', '.') 42 | ], 43 | [('Thank', 'VB'), ('you', 'PRP'), ('.', '.')] 44 | ] 45 | 46 | >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.', as_tuple_list=False) 47 | 48 | 'Write/VB your/PRP$ python/NN code/NN in/IN a/DT ./.[ENDSENT]py/NN file/NN ./.[ENDSENT]Thank/VB you/PRP ./.' 49 | 50 | 51 | >>pt = postagger.PosTagger(separator='_') 52 | >>pt.pos_tag_raw_text('Write your python code in a .py file. Thank you.', as_tuple_list=False) 53 | Write_VB your_PRP$ python_NN code_NN in_IN a_DT ._. py_NN file_NN ._. 54 | Thank_VB you_PRP ._. 55 | 56 | 57 | 58 | :param as_tuple_list: Return result as list of list (word,Pos_tag) 59 | :param text: String to POS tag 60 | :return: POS Tagged string or Tuple list 61 | """ 62 | 63 | pass 64 | 65 | def pos_tag_file(self, input_path, output_path=None): 66 | 67 | """ 68 | POS Tag a file. 69 | Either we have a list of list (for each sentence a list of tuple (word,tag)) 70 | Or a file with the POS tagged text 71 | 72 | Note : The jumpline is only for readibility purpose , when reading a tagged file we'll use again 73 | sent_tokenize to find the sentences boundaries. 74 | 75 | :param input_path: path of the source file 76 | :param output_path: If set write POS tagged text with separator (self.pos_tag_raw_text with as_tuple_list False) 77 | If not set, return list of list of tuple (self.post_tag_raw_text with as_tuple_list = True) 78 | 79 | :return: resulting POS tagged text as a list of list of tuple or nothing if output path is set. 80 | """ 81 | 82 | original_text = read_file(input_path) 83 | 84 | if output_path is not None: 85 | tagged_text = self.pos_tag_raw_text(original_text, as_tuple_list=False) 86 | # Write to the output the POS-Tagged text. 87 | write_string(tagged_text, output_path) 88 | else: 89 | return self.pos_tag_raw_text(original_text, as_tuple_list=True) 90 | 91 | def pos_tag_and_write_corpora(self, list_of_path, suffix): 92 | """ 93 | POS tag a list of files 94 | It writes the resulting file in the same directory with the same name + suffix 95 | e.g 96 | pos_tag_and_write_corpora(['/Users/user1/text1', '/Users/user1/direct/text2'] , suffix = _POS) 97 | will create 98 | /Users/user1/text1_POS 99 | /Users/user1/direct/text2_POS 100 | 101 | :param list_of_path: list containing the path (as string) of each file to POS Tag 102 | :param suffix: suffix to append at the end of the original filename for the resulting pos_tagged file. 103 | 104 | """ 105 | for path in list_of_path: 106 | output_file_path = path + suffix 107 | if os.path.isfile(path): 108 | self.pos_tag_file(path, output_file_path) 109 | else: 110 | warnings.warn('file ' + output_file_path + 'does not exists') 111 | 112 | 113 | class PosTaggingStanford(PosTagging): 114 | """ 115 | Concrete class of PosTagging using StanfordPOSTokenizer and StanfordPOSTagger 116 | 117 | tokenizer contains the default nltk tokenizer (PhunktSentenceTokenizer). 118 | tagger contains the StanfordPOSTagger object (which also trigger word tokenization see : -tokenize option in Java). 119 | 120 | """ 121 | 122 | def __init__(self, jar_path, model_path_directory, separator='|', lang='en'): 123 | """ 124 | :param model_path_directory: path of the model directory 125 | :param jar_path: path of the jar for StanfordPOSTagger (override the configuration file) 126 | :param separator: Separator between a token and a tag in the resulting string (default : |) 127 | 128 | """ 129 | 130 | if lang == 'en': 131 | model_path = os.path.join(model_path_directory, 'english-left3words-distsim.tagger') 132 | self.sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 133 | self.tagger = custom_stanford.EnglishStanfordPOSTagger(model_path, jar_path, java_options='-mx2g') 134 | elif lang == 'de': 135 | model_path = os.path.join(model_path_directory, 'german-hgc.tagger') 136 | self.sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle') 137 | self.tagger = custom_stanford.GermanStanfordPOSTagger(model_path, jar_path, java_options='-mx2g') 138 | elif lang == 'fr': 139 | model_path = os.path.join(model_path_directory, 'french.tagger') 140 | self.sent_tokenizer = nltk.data.load('tokenizers/punkt/french.pickle') 141 | self.tagger = custom_stanford.FrenchStanfordPOSTagger(model_path, jar_path, java_options='-mx2g') 142 | else: 143 | raise ValueError('Language ' + lang + 'not handled') 144 | 145 | self.separator = separator 146 | 147 | def pos_tag_raw_text(self, text, as_tuple_list=True): 148 | """ 149 | Implementation of abstract method from PosTagging 150 | @see PosTagging 151 | """ 152 | tagged_text = self.tagger.tag_sents([self.sent_tokenizer.sentences_from_text(text)]) 153 | 154 | if as_tuple_list: 155 | return tagged_text 156 | return '[ENDSENT]'.join( 157 | [' '.join([tuple2str(tagged_token, self.separator) for tagged_token in sent]) for sent in tagged_text]) 158 | 159 | 160 | class PosTaggingSpacy(PosTagging): 161 | """ 162 | Concrete class of PosTagging using StanfordPOSTokenizer and StanfordPOSTagger 163 | """ 164 | 165 | def __init__(self, nlp=None, separator='|' ,lang='en'): 166 | if not nlp: 167 | print('Loading Spacy model') 168 | # self.nlp = spacy.load(lang, entity=False) 169 | print('Spacy model loaded ' + lang) 170 | else: 171 | self.nlp = nlp 172 | self.separator = separator 173 | 174 | def pos_tag_raw_text(self, text, as_tuple_list=True): 175 | """ 176 | Implementation of abstract method from PosTagging 177 | @see PosTagging 178 | """ 179 | 180 | # This step is not necessary int the stanford tokenizer. 181 | # This is used to avoid such tags : (' ', 'SP') 182 | text = re.sub('[ ]+', ' ', text).strip() # Convert multiple whitespaces into one 183 | 184 | doc = self.nlp(text) 185 | if as_tuple_list: 186 | return [[(token.text, token.tag_) for token in sent] for sent in doc.sents] 187 | return '[ENDSENT]'.join(' '.join(self.separator.join([token.text, token.tag_]) for token in sent) for sent in doc.sents) 188 | 189 | 190 | class PosTaggingCoreNLP(PosTagging): 191 | """ 192 | Concrete class of PosTagging using a CoreNLP server 193 | Provides a faster way to process several documents using since it doesn't require to load the model each time. 194 | """ 195 | 196 | def __init__(self, host='localhost' ,port=9000, separator='|'): 197 | self.parser = CoreNLPParser(url=f'http://{host}:{port}') 198 | self.separator = separator 199 | 200 | def pos_tag_raw_text(self, text, as_tuple_list=True): 201 | # Unfortunately for the moment there is no method to do sentence split + pos tagging in nltk.parse.corenlp 202 | # Ony raw_tag_sents is available but assumes a list of str (so it assumes the sentence are already split) 203 | # We create a small custom function highly inspired from raw_tag_sents to do both 204 | 205 | def raw_tag_text(): 206 | """ 207 | Perform tokenizing sentence splitting and PosTagging and keep the 208 | sentence splits structure 209 | """ 210 | properties = {'annotators':'tokenize,ssplit,pos'} 211 | tagged_data = self.parser.api_call(text, properties=properties) 212 | for tagged_sentence in tagged_data['sentences']: 213 | yield [(token['word'], token['pos']) for token in tagged_sentence['tokens']] 214 | 215 | tagged_text = list(raw_tag_text()) 216 | 217 | if as_tuple_list: 218 | return tagged_text 219 | return '[ENDSENT]'.join( 220 | [' '.join([tuple2str(tagged_token, self.separator) for tagged_token in sent]) for sent in tagged_text]) 221 | 222 | 223 | 224 | 225 | if __name__ == '__main__': 226 | parser = argparse.ArgumentParser(description='Write POS tagged files, the resulting file will be written' 227 | ' at the same location with _POS append at the end of the filename') 228 | 229 | parser.add_argument('tagger', help='which pos tagger to use [stanford, spacy, corenlp]') 230 | parser.add_argument('listing_file_path', help='path to a text file ' 231 | 'containing in each row a path to a file to POS tag') 232 | args = parser.parse_args() 233 | 234 | if args.tagger == 'stanford': 235 | pt = PosTaggingStanford() 236 | suffix = 'STANFORD' 237 | elif args.tagger == 'spacy': 238 | pt = PosTaggingSpacy() 239 | suffix = 'SPACY' 240 | elif args.tagger == 'corenlp': 241 | pt = PosTaggingCoreNLP() 242 | suffix = 'CoreNLP' 243 | 244 | list_of_path = read_file(args.listing_file_path).splitlines() 245 | print('POS Tagging and writing ', len(list_of_path), 'files') 246 | pt.pos_tag_and_write_corpora(list_of_path, suffix) 247 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/__init__.py -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/fileIO.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | import codecs 7 | 8 | codecs.register_error('replace_with_space', lambda e: (u' ', e.start + 1)) 9 | 10 | 11 | def write_string(s, output_path): 12 | with open(output_path, 'w') as output_file: 13 | output_file.write(s) 14 | 15 | 16 | def read_file(input_path): 17 | with open(input_path, 'r', errors='replace_with_space') as input_file: 18 | return input_file.read().strip() 19 | -------------------------------------------------------------------------------- /KeyExt/EmbedRank/swisscom_ai/research_keyphrase/util/solr_fields.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) 2017-present, Swisscom (Schweiz) AG. 2 | # All rights reserved. 3 | # 4 | #Authors: Kamil Bennani-Smires, Yann Savary 5 | 6 | """Module containing helper function to process results of a solr query""" 7 | 8 | 9 | def process_tagged_text(s): 10 | """ 11 | Return a tagged_text as a list of sentence where each sentence is list of tuple (word,tag) 12 | :param s: string tagged_text coming from solr word1|tag1 word2|tag2[ENDSENT]word3|tag3 ... 13 | :return: (list of list of tuple) list of sentences where each sentence is a list of tuple (word,tag) 14 | """ 15 | 16 | def str2tuple(tagged_token_text, sep='|'): 17 | loc = tagged_token_text.rfind(sep) 18 | if loc >= 0: 19 | return tagged_token_text[:loc], tagged_token_text[loc + len(sep):] 20 | else: 21 | raise RuntimeError('Problem when parsing tagged token '+tagged_token_text) 22 | 23 | result = [] 24 | for sent in s.split('[ENDSENT]'): 25 | sent = [str2tuple(tagged_token) for tagged_token in sent.split(' ')] 26 | result.append(sent) 27 | return result 28 | -------------------------------------------------------------------------------- /KeyExt/KPRank/PositionRank.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from doc_candidates import LoadFile 5 | import networkx as nx 6 | from numpy import dot 7 | from numpy.linalg import norm 8 | import numpy as np 9 | from math import log10 10 | from collections import defaultdict 11 | import operator 12 | import unicodedata 13 | 14 | def normalize_text(text): 15 | if not isinstance(text, unicode): 16 | text = unicode(text, 'utf-8') 17 | text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore') 18 | return text 19 | 20 | class PositionRank(LoadFile): 21 | 22 | def __init__(self, input_text, window, phrase_type, emb_dim, embeddings): 23 | """ Redefining initializer for PositionRank. """ 24 | 25 | super(PositionRank, self).__init__(input_text=input_text) 26 | 27 | self.graph = nx.Graph() 28 | """ The word graph. """ 29 | self.window = window 30 | 31 | self.phrase_type = phrase_type 32 | self.emb_dim = emb_dim 33 | self.embeddings = embeddings#KeyedVectors.load_word2vec_format(emb_file, binary=True) 34 | self.random_embeddings = {} 35 | 36 | 37 | def get_cosine_dist(self, word1, word2): 38 | curr_embeddings1 = [] 39 | if word1.lower() in self.embeddings: 40 | curr_embeddings1 = self.embeddings[word1.lower()] 41 | elif word1.lower() in self.random_embeddings: 42 | curr_embeddings1 = self.random_embeddings[word1.lower()] 43 | else: 44 | curr_embeddings1 = np.random.rand(self.emb_dim) 45 | self.random_embeddings[word1.lower()] = curr_embeddings1 46 | 47 | curr_embeddings2 = [] 48 | if word2.lower() in self.embeddings: 49 | curr_embeddings2 = self.embeddings[word2.lower()] 50 | elif word2.lower() in self.random_embeddings: 51 | curr_embeddings2 = self.random_embeddings[word2.lower()] 52 | else: 53 | curr_embeddings2 = np.random.rand(self.emb_dim) 54 | self.random_embeddings[word2.lower()] = curr_embeddings2 55 | 56 | cos_sim = 0.0 57 | if (norm(curr_embeddings1)*norm(curr_embeddings2)) != 0: 58 | #print curr_embeddings1 59 | #print curr_embeddings2 60 | cos_sim = dot(curr_embeddings1, curr_embeddings2)/(norm(curr_embeddings1)*norm(curr_embeddings2)) 61 | semantic_val = 0.0 62 | if cos_sim != 1.0: 63 | semantic_val = 1.0 / (1.0 - cos_sim) 64 | 65 | return semantic_val 66 | 67 | 68 | def build_graph(self, window, pos=None): 69 | """ 70 | build the word graph 71 | 72 | :param window: the size of window to add edges in the graph 73 | :param pos: he part of speech tags used to select the graph's nodes 74 | :return: 75 | """ 76 | 77 | if pos is None: 78 | pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ'] 79 | 80 | # container for the nodes 81 | seq = [] 82 | individual_count = {} # my addition 83 | stemmed_original_map = {} 84 | 85 | # select nodes to be added in the graph 86 | for el in self.words: 87 | if el.pos_pattern in pos: 88 | seq.append((el.stemmed_form, el.position, el.sentence_id)) 89 | self.graph.add_node(el.stemmed_form) 90 | if el.stemmed_form not in individual_count: 91 | individual_count[el.stemmed_form] = 0 92 | individual_count[el.stemmed_form] += 1 93 | if el.stemmed_form not in stemmed_original_map: 94 | stemmed_original_map[el.stemmed_form] = el.surface_form 95 | 96 | # add edges 97 | for i in range(0, len(seq)): 98 | for j in range(i+1, len(seq)): 99 | if seq[i][1] != seq[j][1] and abs(j-i) < window: 100 | if not self.graph.has_edge(seq[i][0], seq[j][0]): 101 | self.graph.add_edge(seq[i][0], seq[j][0], weight=1) 102 | else: 103 | self.graph[seq[i][0]][seq[j][0]]['weight'] += 1 104 | 105 | def candidate_selection(self, pos=None, phrase_type='n_grams'): 106 | """ 107 | the candidates selection for PositionRank 108 | :param pos: pos: the part of speech tags used to select candidates 109 | :return: 110 | """ 111 | 112 | if pos is None: 113 | pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ'] 114 | 115 | # uncomment the line below if you wish to extract ngrams instead of the longest phrase 116 | if phrase_type=='n_grams': 117 | self.get_ngrams(n=4, good_pos=pos) 118 | else: 119 | # select the longest phrase as candidate keyphrases 120 | self.get_phrases(self, good_pos=pos) 121 | 122 | 123 | def candidate_scoring(self, pos=None, window=10, theme_mode = 'adj_noun_title' ,update_scoring_method=False): 124 | """ 125 | compute a score for each candidate based on PageRank algorithm 126 | :param pos: the part of speech tags 127 | :param window: window size 128 | :param update_scoring_method: if you want to update the scoring method based on my paper cited below: 129 | Florescu, Corina, and Cornelia Caragea. "A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction." 130 | European Conference on Information Retrieval. Springer, Cham, 2017. 131 | 132 | :return: 133 | """ 134 | 135 | if pos is None: 136 | pos = ['NN', 'NNS', 'NNP', 'NNPS', 'JJ'] 137 | 138 | # build the word graph 139 | self.build_graph(window=window, pos=pos) 140 | 141 | # filter out canditates that unlikely to be keyphrases 142 | self.filter_candidates(max_phrase_length=4, min_word_length=3, valid_punctuation='-.') 143 | 144 | ######### get Theme scores ######## 145 | 146 | # get the theme vector 147 | theme_vec = np.array([0] * self.emb_dim) 148 | 149 | if theme_mode == 'adj_noun_title': 150 | tv_words = 0 151 | for w, p in self.sentences[0]: 152 | w = w.lower() 153 | if p in pos: 154 | if w in self.embeddings['words']: # Fix embeddings structure bug. 155 | curr_vec = np.array(self.embeddings['embeddings'][self.embeddings['words'].index(w)]) 156 | theme_vec = theme_vec + curr_vec 157 | tv_words += 1 158 | if tv_words > 0: 159 | theme_vec = theme_vec / tv_words 160 | 161 | elif theme_mode == 'adj_noun_all': 162 | tv_words = 0 163 | for sentence in self.sentences: 164 | for w, p in sentence: 165 | w = w.lower() 166 | if p in pos: 167 | if w in self.embeddings['words']: # Fix embeddings structure bug. 168 | curr_vec = np.array(self.embeddings['embeddings'][self.embeddings['words'].index(w)]) 169 | theme_vec = theme_vec + curr_vec 170 | tv_words += 1 171 | if tv_words > 0: 172 | theme_vec = theme_vec / tv_words 173 | 174 | elif theme_mode == 'cls_title': 175 | theme_vec = self.embeddings['cls_ttl'] 176 | elif theme_mode == 'cls_all': 177 | theme_vec = self.embeddings['cls_all'] 178 | elif theme_mode == 'mean_title': 179 | theme_vec = self.embeddings['mean_ttl'] 180 | elif theme_mode == 'mean_all': 181 | theme_vec = self.embeddings['mean_all'] 182 | 183 | # get the thematic scores 184 | personalization_k2v = {} 185 | for w in self.words: 186 | word = w.surface_form 187 | stem = w.stemmed_form 188 | curr_pos = w.pos_pattern 189 | word = word.lower() 190 | if curr_pos in pos: 191 | if stem not in personalization_k2v.keys(): 192 | curr_vec = [] 193 | if word in self.embeddings['words']: # Fix embeddings structure bug. 194 | print(theme_mode + ': EMB-FOUND') 195 | curr_vec = self.embeddings['embeddings'][self.embeddings['words'].index(word)] 196 | elif word in self.random_embeddings: 197 | curr_vec = self.random_embeddings[word] 198 | else: 199 | curr_vec = np.random.rand(self.emb_dim) 200 | self.random_embeddings[word] = curr_vec 201 | print('EMB-NOT-FOUND') 202 | cos_sim = 0.000000001 203 | if (norm(curr_vec)*norm(theme_vec)) != 0.0: 204 | cos_sim = dot(curr_vec, theme_vec)/(norm(curr_vec)*norm(theme_vec)) 205 | personalization_k2v[stem] = cos_sim 206 | 207 | ######### get Positional scores ######## 208 | personalization_pr = {} 209 | for w in self.words: 210 | stem = w.stemmed_form 211 | poz = w.position 212 | pos = w.pos_pattern 213 | 214 | if pos in pos: 215 | if stem not in personalization_pr: 216 | personalization_pr[stem] = 1.0/poz 217 | else: 218 | personalization_pr[stem] = personalization_pr[stem]+1.0/poz 219 | 220 | ######## multiply both scores ####### 221 | ipdict=[personalization_k2v, personalization_pr] 222 | 223 | output=defaultdict(lambda:1) 224 | for d in ipdict: 225 | for item in d: 226 | output[item] *= d[item] 227 | 228 | personalization = dict(output) 229 | 230 | ######## normalize scores ######## 231 | factor = 1.0 / sum(personalization.values()) 232 | 233 | normalized_personalization = {k: v * factor for k, v in personalization.items()} 234 | 235 | # compute the word scores using personalized random walk 236 | pagerank_weights = nx.pagerank_scipy(self.graph, personalization=normalized_personalization, weight='weight') 237 | #pagerank_weights = normalized_personalization 238 | 239 | 240 | # loop through the candidates 241 | if update_scoring_method: 242 | for c in self.candidates: 243 | if len(c.stemmed_form.split()) > 1: 244 | # for arithmetic mean 245 | #self.weights[c.stemmed_form] = [stem.stemmed_form for stem in self.candidates].count(c.stemmed_form) * \ 246 | #sum([pagerank_weights[t] for t in c.stemmed_form.split()]) \ 247 | #/ len(c.stemmed_form.split()) 248 | # for harmonic mean 249 | self.weights[c.stemmed_form] = [cand.stemmed_form for cand in self.candidates].count(c.stemmed_form) * \ 250 | len(c.stemmed_form.split()) / sum([1.0 / pagerank_weights[t] for t in c.stemmed_form.split()]) 251 | else: 252 | self.weights[c.stemmed_form] = pagerank_weights[c.stemmed_form] 253 | else: 254 | for c in self.candidates: 255 | self.weights[c.stemmed_form] = sum([pagerank_weights[t] for t in c.stemmed_form.split()]) 256 | 257 | 258 | 259 | -------------------------------------------------------------------------------- /KeyExt/KPRank/README.md: -------------------------------------------------------------------------------- 1 | # KPRank 2 | 3 | This directory hosts modified code for the `KPRank` approach, which can be found from its official [repo](https://github.com/PatelKrutarth/KPRank). 4 | 5 | ## Setup 6 | Follow the instructions from the original repo. 7 | Afterwards replace the files with the modified ones. 8 | The `dataset_dir` variable in `main.py` needs to be set to the dataset directory path. 9 | The `dsdir` and `model_version` variables in `run_scibert_model` need to be set. 10 | -------------------------------------------------------------------------------- /KeyExt/KPRank/__init__.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | __author__ = 'Krutarth Patel' 4 | __email__ = 'kipatel@ksu.edu' 5 | __version__ = '1.0' 6 | -------------------------------------------------------------------------------- /KeyExt/KPRank/evaluation.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | 5 | def firstRank(predicted, gold): 6 | """returns the the rank of the first correct predicted keyphrase""" 7 | firstRank = 0 8 | for i in range(0, len(predicted)): 9 | if predicted[i] in gold: 10 | firstRank = i 11 | break 12 | 13 | return firstRank 14 | 15 | 16 | def Rprecision(predicted, gold, k): 17 | 18 | hits = set(predicted).intersection(set(gold)) 19 | Rpr = 0.0 20 | if len(hits)>0 and len(predicted)>0: 21 | Rpr = len(hits)*1.0/k 22 | 23 | return Rpr 24 | 25 | def PRF(predicted, gold, k): 26 | 27 | predicted = predicted[:k] 28 | 29 | hits = set(predicted).intersection(set(gold)) 30 | P, R, F1 = 0.0, 0.0, 0.0 31 | 32 | if len(hits)>0 and len(predicted)>0: 33 | P = len(hits)/len(predicted) 34 | R = len(hits)/len(gold) 35 | F1 = 2*P*R/(P+R) 36 | 37 | return {'precision':P,'recall': R,'f1-score': F1} 38 | 39 | def PRF_range(predicted, gold, k): 40 | 41 | P = [] 42 | R = [] 43 | F1 = [] 44 | 45 | for i in range(0,k): 46 | predict = predicted[:i+1] 47 | 48 | hits = set(predict).intersection(set(gold)) 49 | pr = 0.0 50 | re = 0.0 51 | f1 = 0.0 52 | if len(hits)>0 and len(predict)>0: 53 | pr = len(hits)*1.0/len(predict) 54 | re = len(hits)*1.0/len(gold) 55 | if pr+re > 0: 56 | f1 = 2*pr*re/(pr+re) 57 | 58 | P.append(pr) 59 | R.append(re) 60 | F1.append(f1) 61 | 62 | return P,R,F1 63 | 64 | def Bpref (pred, gold): 65 | incorrect = 0 66 | correct = 0 67 | bpref = 0 68 | 69 | for kp in pred: 70 | if kp in gold: 71 | bpref += (1.0 - (incorrect*1.0/len(pred))) 72 | correct += 1 73 | else: 74 | incorrect +=1 75 | 76 | if correct >0: 77 | bpref = bpref*1.0/correct 78 | else: 79 | bpref = 0.0 80 | 81 | return bpref -------------------------------------------------------------------------------- /KeyExt/KPRank/main.py: -------------------------------------------------------------------------------- 1 | #from __future__ import division 2 | from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter 3 | import sys 4 | import PositionRank 5 | from gensim.models import KeyedVectors 6 | import evaluation 7 | import process_data 8 | import os 9 | from os.path import isfile, join 10 | import pathlib 11 | from nltk.stem.porter import PorterStemmer 12 | porter_stemmer = PorterStemmer() 13 | import pickle 14 | 15 | def ensure_dir(dirName): 16 | if not os.path.exists(dirName): 17 | print('making dir: ' + dirName) 18 | os.makedirs(dirName) 19 | 20 | def load_obj(filePath): 21 | with open(filePath, 'rb') as f: 22 | return pickle.load(f) 23 | 24 | def main(): 25 | # Initialize parameters. 26 | topK = 10 27 | window = 10 28 | phrase_type = 'ngrams' 29 | emb_dim = 768 30 | theme_mode = 'adj_noun_title' 31 | model_name = 'scibert' 32 | 33 | # Initialize paths. 34 | dataset_dir = r'..\datasets\Krapivin2009' 35 | input_dir = os.path.join(dataset_dir, 'docsutf8') 36 | output_dir = os.path.join(dataset_dir, 'extracted\kprank') 37 | emb_dir = os.path.join(dataset_dir, f'{model_name}_emb_fulltext_title') 38 | 39 | # Set the current directory to the input dir 40 | os.chdir(os.path.join(os.getcwd(), input_dir)) 41 | 42 | # Get all file names and their absolute paths. 43 | docnames = sorted(os.listdir()) 44 | docpaths = list(map(os.path.abspath, docnames)) 45 | 46 | # Create the keys directory, after the names and paths are loaded. 47 | pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True) 48 | 49 | for i, (docname, docpath) in enumerate(zip(docnames, docpaths)): 50 | # keys shows up in docnames, erroneously. 51 | if docname == 'keys': 52 | continue 53 | 54 | #if i < 115: continue 55 | 56 | print(f'Processing {i} out of {len(docnames)}...') 57 | 58 | # Form the output path. 59 | output_path = os.path.join(output_dir, docname.split('.')[0]+'.key') 60 | print(output_path) 61 | 62 | # Process the data of the document. 63 | text = process_data.read_input_file(docpath) 64 | 65 | # Load the embeddings. 66 | emb_path = os.path.join(emb_dir, f'{docname}_fulltext.pkl') 67 | embeddings = load_obj(emb_path) 68 | model = PositionRank.PositionRank(text, window, phrase_type, emb_dim, embeddings) 69 | 70 | # Run the model. 71 | model.get_doc_words() 72 | model.candidate_selection() 73 | model.candidate_scoring(theme_mode = theme_mode, update_scoring_method = False) 74 | keyphrases = model.get_best_k(topK)[:10] 75 | 76 | # Write the keyphrases to a file. 77 | keys = '\n'.join(map(str, keyphrases) or '') 78 | with open(output_path, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out: 79 | out.write(keys) 80 | 81 | os.system('clear') 82 | return 83 | 84 | 85 | if __name__ == "__main__": main() -------------------------------------------------------------------------------- /KeyExt/KPRank/process_data.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import codecs 5 | import itertools 6 | from nltk.stem.porter import PorterStemmer 7 | import os.path 8 | from nltk import word_tokenize 9 | from string import punctuation 10 | import re 11 | import unicodedata 12 | 13 | 14 | def read_input_file(this_file): 15 | # read the text of the file; if the file cannot be read then the file is excluded 16 | if os.path.exists(this_file): 17 | with codecs.open(this_file, "r", encoding='utf-8') as f: 18 | #text = f.read() 19 | lines = f.readlines() 20 | lines[0] = lines[0].strip() 21 | if not (lines[0].endswith(".") or lines[0].endswith("?") or lines[0].endswith("!")): 22 | lines[0] = lines[0]+'.' 23 | text = ' '.join(lines) 24 | f.close() 25 | else: 26 | text = None 27 | 28 | return text 29 | 30 | 31 | def read_gold_file(this_gold): 32 | 33 | # read the gold file; if the file cannot be read (does not exist) the file is excluded 34 | if os.path.exists(this_gold): 35 | with codecs.open(this_gold, "r", encoding='utf-8') as f: 36 | gold_list = f.readlines() 37 | f.close() 38 | else: 39 | gold_list = None 40 | 41 | return gold_list 42 | 43 | def get_ascii(text): 44 | if not isinstance(text, unicode): 45 | text = unicode(text, "utf-8") 46 | text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore') 47 | return text 48 | 49 | 50 | def get_stemmed_words_and_stemmed_text(text): 51 | stemmer = PorterStemmer() 52 | text_words = text.split() 53 | text_words_stem = [] 54 | for word in text_words: 55 | text_words_stem.append(stemmer.stem(word)) 56 | text_stem = ' '.join(text_words_stem) 57 | return text_words_stem, text_stem 58 | 59 | 60 | def load_stemmed_gold_phrases(lines): 61 | punct_list = ['\'', '"', '\\', '!', '@', '#', '$', '%', 62 | '^', '&', '*', '(', ')', '_', '-', '+', '=','{', '}', '[', ']', 63 | '|', ':', ';', '<', '>', ',', '.', '?', '/', '`', '~'] 64 | 65 | punct_re = '|'.join(map(re.escape, punct_list)) 66 | 67 | gold_phrases = [] 68 | for line in lines: 69 | line = line.strip() 70 | line = line.lower() 71 | line = get_ascii(line) 72 | line = re.sub(punct_re, ' ', line) 73 | line = re.sub('\s+', ' ', line).strip() 74 | line_words_stem, line_stem = get_stemmed_words_and_stemmed_text(line) 75 | gold_phrases.append(line_stem) 76 | return gold_phrases 77 | 78 | def tokenize(text, encoding): 79 | """ tokenize text 80 | Args: 81 | text: tect to be tokenized 82 | """ 83 | return [token for token in word_tokenize(text.lower().decode(encoding))] 84 | 85 | 86 | def filter_candidates(tokens, stopwords_file=None, min_word_length=2, valid_punctuation='-'): 87 | """ discard candidates based on various criteria 88 | Args: 89 | tokens: tokens to be filtered out 90 | stopwords_file: if you want to load a file with stopwords 91 | min_word_length: filter words shorter than min_word_length 92 | valid_punctuation: filter words that contain other punctuation than valid_punctuation 93 | encoding='utf-8' 94 | """ 95 | 96 | # if a list of stopwords is not provided then load the stopwords'list from nltk 97 | stopwords_list = [] 98 | if stopwords_file is None: 99 | from nltk.corpus import stopwords 100 | stopwords_list = set(stopwords.words('english')) 101 | else: 102 | with codecs.open(stopwords_file, 'rb', encoding='utf-8') as f: 103 | f.readlines() 104 | f.close() 105 | # add the stopword from file in the stopwords_list container 106 | for line in f: 107 | stopwords_list.append(line) 108 | 109 | # keep indices to be deleted 110 | indices = [] 111 | 112 | for i, c in enumerate(tokens): 113 | 114 | # discard those candidates that contain stopwords 115 | if c in stopwords_list: 116 | indices.append(i) 117 | 118 | # discard candidates that contain words shorter that min_word_length 119 | elif len(c) < min_word_length: 120 | indices.append(i) 121 | 122 | elif c in ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']: 123 | indices.append(i) 124 | 125 | else: 126 | 127 | # discard candidates that contain other characters except letter, digits, and valid punctuation 128 | letters_set = set([u for u in c]) 129 | 130 | if letters_set.issubset(punctuation): 131 | indices.append(i) 132 | 133 | elif re.match(r'^[a-zA-Z0-9%s]*$' % valid_punctuation, c): 134 | pass 135 | else: 136 | indices.append(i) 137 | 138 | dels = 0 139 | 140 | for index in indices: 141 | offset = index - dels 142 | del tokens[offset] 143 | dels += 1 144 | 145 | return tokens 146 | 147 | 148 | def stemming(text): 149 | """ stem tokens """ 150 | p_stemmer = PorterStemmer() 151 | return [p_stemmer.stem(i) for i in text] 152 | 153 | 154 | def iter_data(path_to_data, encoding): 155 | """Yield each article from the Medline """ 156 | files = [] 157 | #with open('/home/corina/Documents/Research/Projects/unsupervisedKE/data_analysis/medline_10000_1.txt','rb') as rf: 158 | #filenames = rf.readlines() 159 | #files = [file.strip() for file in filenames] 160 | #rf.close() 161 | #print files 162 | i=1 163 | #for filename in filenames: #os.listdir(path_to_data): 164 | for filename in os.listdir(path_to_data): 165 | #filename = filename.strip() 166 | 167 | i += 1 168 | with open(path_to_data + filename, 'rb') as f: 169 | text = f.read().strip() 170 | tokens = tokenize(text, encoding) 171 | tokens = filter_candidates(tokens) 172 | tokens = stemming(tokens) 173 | f.close() 174 | yield path_to_data + filename, text, tokens 175 | 176 | 177 | class MyCorpus(object): 178 | 179 | def __init__(self, path_to_data, dictionary, length=None, encoding='utf-8'): 180 | """ 181 | Parse the collection of documents from file path_to_data. 182 | Yield each document in turn, as a list of tokens. 183 | Args: 184 | path_to_data: the location of the collection 185 | dictionary: the mapping between word and ids 186 | length: the number of docs in the corpus 187 | """ 188 | self.path_to_data = path_to_data 189 | self.dictionary = dictionary 190 | self.length = length 191 | self.encoding = encoding 192 | self.index_filename = {} 193 | 194 | def __iter__(self): 195 | 196 | index = 0 197 | 198 | for filename, text, tokens in itertools.islice(iter_data(self.path_to_data, self.encoding), self.length): 199 | self.index_filename[index] = filename 200 | index += 1 201 | yield self.dictionary.doc2bow(tokens) 202 | 203 | def __len__(self): 204 | if self.length is None: 205 | self.length = sum(1 for doc in self) 206 | return self.length 207 | -------------------------------------------------------------------------------- /KeyExt/KPRank/requirements.txt: -------------------------------------------------------------------------------- 1 | backports.functools-lru-cache==1.5 2 | decorator==4.3.0 3 | networkx==2.2 4 | nltk==3.4 5 | nose==1.3.7 6 | numpy==1.15.4 7 | Pillow==5.3.0 8 | psutil==5.4.8 9 | pyparsing==2.3.0 10 | pytz==2018.7 11 | scipy==1.1.0 12 | singledispatch==3.4.0.3 13 | six==1.12.0 14 | subprocess32==3.5.3 15 | torch==1.10.0 16 | transformers==2.8.0 17 | gensim -------------------------------------------------------------------------------- /KeyExt/KPRank/run_scibert_model.py: -------------------------------------------------------------------------------- 1 | import os 2 | import codecs 3 | import pickle 4 | import re 5 | from datetime import datetime 6 | import torch 7 | from transformers import BertTokenizer, BertModel 8 | 9 | def ensure_dir(dirName): 10 | if not os.path.exists(dirName): 11 | print('making dir: ' + dirName) 12 | os.makedirs(dirName) 13 | 14 | def getText(filePath): 15 | text = None 16 | title = None 17 | if os.path.exists(filePath): 18 | with codecs.open(filePath, "r", encoding='utf-8') as f: 19 | lines = f.readlines() 20 | lines[0] = lines[0].strip() 21 | if not (lines[0].endswith(".") or lines[0].endswith(".") or lines[0].endswith("!")): 22 | lines[0] = lines[0]+'.' 23 | text = ' '.join(lines) 24 | title = lines[0] 25 | f.close() 26 | 27 | return text, title 28 | 29 | def load_obj(filePath): 30 | with open(filePath, 'rb') as f: 31 | return pickle.load(f) 32 | 33 | def save_obj(obj, filePath): 34 | with open(filePath, 'wb') as output: 35 | pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL) 36 | 37 | def embed_text(text, model): 38 | input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0) # Batch size 1 39 | outputs = model(input_ids) 40 | last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple 41 | return last_hidden_states[0] 42 | 43 | def embed_tokens(tokens, model): 44 | input_ids = torch.tensor(tokens).unsqueeze(0) # Batch size 1, only 1 sentense 45 | outputs = model(input_ids) 46 | last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple 47 | return last_hidden_states[0] # num_tokens (or num_words+2) * 768 dimentioanal output 48 | 49 | def main(): 50 | """ 51 | Python 3.7 code 52 | Download SciBERT (scibert_scivocab_uncased) model from: https://github.com/allenai/scibert 53 | generates wordembeddings for each document name listed in overlap_test_bl.txt file in each dataset directory 54 | file structure expected: 55 | - dataset_name 56 | - abstracts : directory containing abstracts 57 | - overlap_test_bl.txt : file containing a list of test documents, 1 document name per line 58 | Generates word embeddings as directory structure below: 59 | - dataset_name 60 | - MODEL_MODE_emb_fulltext_title 61 | - FILE_NAME_fulltext.pkl: file contains words, corresponding tokens, and embeddings for title as an input to the model 62 | - FILE_NAME_fulltext.pkl: file contains words, corresponding tokens, and embeddings for title+abstract as an input to the model 63 | """ 64 | 65 | model_mode = 'scibert' # 'bert' 66 | dsDir = r'..\datasets\Krapivin2009' # directory containing the dataset 67 | 68 | do_lower_case = True 69 | model = None 70 | tokenizer = None 71 | ####### SciBERT model ######### 72 | if model_mode == 'scibert': 73 | # please change the path to a downloaded Scibert Model 74 | model_version = r'..\KPRank\KPRank\scibert_scivocab_uncased' 75 | model = BertModel.from_pretrained(model_version) 76 | tokenizer = BertTokenizer.from_pretrained(model_version, do_lower_case=do_lower_case) 77 | 78 | elif model_mode == 'bert': 79 | tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') 80 | model = BertModel.from_pretrained("bert-base-uncased") 81 | 82 | #datasets = ['hulth', 'semeval'] 83 | #datasets = ['krapivin', 'nus'] 84 | #datasets = ['nus'] 85 | #datasets = ['acm'] 86 | 87 | ipDir = os.path.join(dsDir, 'docsutf8') 88 | opDir = os.path.join(dsDir, f'{model_mode}_emb_fulltext_title') 89 | ensure_dir(opDir) 90 | 91 | # opening a file containing a list of test documents, 1 document name per line 92 | ipList = sorted(os.listdir(ipDir)) 93 | 94 | for i, l in enumerate(ipList): 95 | 96 | print(f'Processing {i} out of {len(ipList)}...') 97 | 98 | #if i < 1761: continue 99 | 100 | l = l.strip() 101 | opFilePath_fulltext = os.path.join(opDir, f'{l}_fulltext.pkl') 102 | opFilePath_title = os.path.join(opDir, f'{l}_title.pkl') 103 | 104 | #print(l) 105 | file_path = os.path.join(ipDir, l) 106 | fulltext, title = getText(file_path) 107 | 108 | fulltext = re.sub('\s+', ' ', fulltext).strip() # remove extra spaces and new lines 109 | title = re.sub('\s+', ' ', title).strip() # remove extra spaces and new lines 110 | 111 | fulltext_words = tokenizer.tokenize(fulltext) 112 | title_words = tokenizer.tokenize(title) 113 | 114 | fulltext_en_tokens = tokenizer.convert_tokens_to_ids(['[CLS]'] + fulltext_words[:510] + ['[SEP]']) 115 | title_en_tokens = tokenizer.convert_tokens_to_ids(['[CLS]'] + title_words[:510] + ['[SEP]']) 116 | 117 | 118 | fulltext_em = embed_tokens(fulltext_en_tokens, model).detach().numpy() 119 | title_em = embed_tokens(title_en_tokens, model).detach().numpy() 120 | 121 | fulltext_dict = {} 122 | title_dict = {} 123 | 124 | fulltext_dict['words'] = fulltext_words[:510] 125 | fulltext_dict['tokens'] = fulltext_en_tokens 126 | fulltext_dict['embeddings'] = fulltext_em 127 | 128 | title_dict['words'] = title_words[:510] 129 | title_dict['tokens'] = title_en_tokens 130 | title_dict['embeddings'] = title_em 131 | 132 | save_obj(fulltext_dict, opFilePath_fulltext) 133 | save_obj(title_dict, opFilePath_title) 134 | 135 | os.system('clear') 136 | 137 | if __name__ == "__main__": 138 | main() -------------------------------------------------------------------------------- /KeyExt/Key2Vec/README.md: -------------------------------------------------------------------------------- 1 | # Key2Vec 2 | 3 | This directory hosts code to run the Key2Vec approach from this [repo](https://github.com/MarkSecada/key2vec). 4 | 5 | ## Setup 6 | Clone the aforementioned repository. 7 | Replace the files from this directory over the files of the cloned repository. 8 | Download the `glove.6B.50d.txt` from the [Glove](https://github.com/stanfordnlp/GloVe) repository and place it in the `data` subdirectory. 9 | In `main.py` set the `base_path` to the local dataset directory. 10 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pathlib 3 | import key2vec 4 | 5 | def main(): 6 | glove = key2vec.glove.Glove('./data/glove.6B.50d.txt') 7 | base_path = '../datasets/DUC-2001' 8 | input_dir = os.path.join(base_path, 'docsutf8') 9 | output_dir = os.path.join(base_path, 'extracted/key2vec') 10 | 11 | # Set the current directory to the input dir 12 | os.chdir(os.path.join(os.getcwd(), input_dir)) 13 | 14 | # Get all file names and their absolute paths. 15 | docnames = sorted(os.listdir()) 16 | docpaths = list(map(os.path.abspath, docnames)) 17 | 18 | # Create the keys directory, after the names and paths are loaded. 19 | pathlib.Path(output_dir).mkdir(parents = True, exist_ok = True) 20 | 21 | for i, (docname, docpath) in enumerate(zip(docnames, docpaths)): 22 | 23 | if i < 292: continue 24 | # keys shows up in docnames, erroneously. 25 | if docname == 'keys': 26 | continue 27 | 28 | print(f'Processing {i} out of {len(docnames)}...') 29 | 30 | # Save the output dir path 31 | output_dirpath = os.path.join(output_dir, docname.split('.')[0]+'.key') 32 | print(output_dirpath) 33 | 34 | with open(docpath, 'r', encoding = 'utf-8-sig', errors = 'ignore') as file, \ 35 | open(output_dirpath, 'w', encoding = 'utf-8-sig', errors = 'ignore') as out: 36 | 37 | # Read the file and remove the newlines. 38 | text = file.read().replace('\n', ' ') 39 | 40 | # Extract the top 10 keyphrases. 41 | try: 42 | m = key2vec.key2vec.Key2Vec(text, glove) 43 | m.extract_candidates() 44 | m.set_theme_weights() 45 | m.build_candidate_graph() 46 | ranked_list = m.page_rank_candidates(top_n = 10) 47 | 48 | keys = "\n".join(map(str, ranked_list) or '') 49 | out.write(keys) 50 | except: 51 | pass 52 | 53 | os.system('clear') 54 | 55 | 56 | if __name__ == "__main__": main() 57 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/__init__.py: -------------------------------------------------------------------------------- 1 | from . import cleaner 2 | from . import constants 3 | from . import docs 4 | from . import glove 5 | from . import key2vec 6 | from . import phrase_graph -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/cleaner.py: -------------------------------------------------------------------------------- 1 | from .constants import STOPWORDS, POS_BLACKLIST, DETERMINERS, PUNCT_SET 2 | 3 | class Cleaner(object): 4 | """Cleans candidate keyphrase""" 5 | 6 | def __init__(self, doc): 7 | self.doc = doc 8 | self.tokens = [token for token in doc] 9 | 10 | def transform_text(self): 11 | transformed_text = [] 12 | tokens_len = len(self.tokens) 13 | for i, token in enumerate(self.tokens): 14 | remove = False 15 | if (i == 0) or (i == tokens_len - 1): 16 | is_stop = token.text in STOPWORDS 17 | is_banned_pos = token.pos_ in POS_BLACKLIST 18 | is_determiner = token.text in DETERMINERS 19 | has_punct = not set(token.text).isdisjoint(PUNCT_SET) 20 | remove = (is_stop 21 | or is_banned_pos 22 | or is_determiner 23 | or has_punct) 24 | else: 25 | pass 26 | if not remove: 27 | transformed_text.append(token.text) 28 | 29 | if transformed_text == []: 30 | return '' 31 | elif '-' in transformed_text: 32 | dash_index = transformed_text.index('-') 33 | first_half = ' '.join(transformed_text[:dash_index]) 34 | sec_half = ' '.join(transformed_text[dash_index + 1:]) 35 | return ' '.join([first_half, sec_half]).lower() 36 | else: 37 | return ' '.join(transformed_text).lower() -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/constants.json: -------------------------------------------------------------------------------- 1 | { 2 | "punctuation": [ 3 | "\\", 4 | "]", 5 | ";", 6 | "%", 7 | "(", 8 | "_", 9 | "@", 10 | ",", 11 | "-", 12 | "–", 13 | "=", 14 | "!", 15 | ":", 16 | "[", 17 | "\"", 18 | ")", 19 | "?", 20 | "}", 21 | "&", 22 | "'", 23 | "|", 24 | "/", 25 | "#", 26 | "<", 27 | "$", 28 | "^", 29 | ".", 30 | "`", 31 | "*", 32 | "+", 33 | "~", 34 | "{", 35 | ">", 36 | "\n", 37 | "\t", 38 | ], 39 | "pos_blacklist": [ 40 | "INTJ", 41 | "AUX", 42 | "CCONJ", 43 | "ADP", 44 | "DET", 45 | "NUM", 46 | "PART", 47 | "PRON", 48 | "SCONJ", 49 | "PUNCT", 50 | "SYM", 51 | "X", 52 | ], 53 | "ents_to_ignore": [ 54 | "DATE", 55 | "TIME", 56 | "PERCENT", 57 | "MONEY", 58 | "QUANTITY", 59 | "ORDINAL", 60 | "CARDINAL", 61 | ], 62 | "determiners": [ 63 | "the", 64 | "a", 65 | "an", 66 | "this", 67 | "that", 68 | "these", 69 | "those", 70 | "my", 71 | "your", 72 | "his", 73 | "her", 74 | "its", 75 | "our", 76 | "their", 77 | "a few", 78 | "a little", 79 | "much", 80 | "many", 81 | "a lot of", 82 | "most", 83 | "some", 84 | "any", 85 | "enough", 86 | "one", 87 | "ten", 88 | "thirty", 89 | "all", 90 | "both", 91 | "either", 92 | "neither", 93 | "each", 94 | "every", 95 | "other", 96 | "another", 97 | "such", 98 | "what", 99 | "rather", 100 | "quite", 101 | ] 102 | } -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/constants.py: -------------------------------------------------------------------------------- 1 | import string 2 | 3 | PUNCT_SET = list(set(string.punctuation)) 4 | PUNCT_SET.append(u'\u201c') 5 | PUNCT_SET.append(u'\u201d') 6 | PUNCT_SET.append(u'\u2018') 7 | PUNCT_SET.append(u'\u2019') 8 | PUNCT_SET.append(u'\u2014') 9 | PUNCT_SET.append(b'\xe2\x80\x9c') 10 | PUNCT_SET.append('\n') 11 | PUNCT_SET.append('\\') 12 | PUNCT_SET.append('\"') 13 | PUNCT_SET.append('\a') 14 | PUNCT_SET.append('\f') 15 | PUNCT_SET.append('\n') 16 | PUNCT_SET.append('\r') 17 | PUNCT_SET.append('\t') 18 | PUNCT_SET.append('\v') 19 | PUNCT_SET = set(PUNCT_SET) 20 | 21 | POS_BLACKLIST = ['INTJ', 'AUX', 'CCONJ', 22 | 'ADP', 'DET', 'NUM', 'PART', 23 | 'PRON', 'SCONJ', 'PUNCT', 24 | 'SYM', 'X'] 25 | 26 | ENTS_TO_IGNORE = ['DATE', 'TIME', 'PERCENT', 27 | 'MONEY', 'QUANTITY', 'ORDINAL', 28 | 'CARDINAL'] 29 | 30 | DETERMINERS = ['the', 'a', 'an', 'this', 'that', 'these', 'those', 31 | 'my', 'your', 'his', 'her', 'its', 'our', 'their', 32 | 'a few', 'a little', 'much', 'many', 'a lot of', 'most', 33 | 'some', 'any', 'enough', 'one', 'ten', 'thirty', 'all', 34 | 'both', 'either', 'neither', 'each', 'every', 'other', 35 | 'another', 'such', 'what', 'rather', 'quite'] 36 | 37 | STOPWORDS = ["word", 38 | "a", "a's", "able", "about", "above", "according", 39 | "accordingly", "across", "actually", "after", "afterwards", 40 | "again", "against", "ago", "aim", "ain't", "all", "allow", 41 | "allows", "almost", "alone", "along", "already", "also", 42 | "although", "always", "am", "among", "amongst", "an", "and", 43 | "another", "any", "anybody", "anyhow", "anyone", "anything", 44 | "anyway", "anyways", "anywhere", "apart", "appear", "appreciate", 45 | "approach", "appropriate", "are", "area", "areas", "aren't", 46 | "around", "as", "aside", "ask", "asked", "asking", "asks", 47 | "associated", "at", "available", "away", "awfully", "b", "back", 48 | "backed", "backing", "backs", "bad", "based", "be", "became", 49 | "because", "become", "becomes", "becoming", "been", "before", 50 | "beforehand", "began", "behind", "being", "beings", "believe", 51 | "below", "beside", "besides", "best", "better", "between", 52 | "beyond", "big", "bit", "both", "brief", "bring", "but", "by", 53 | "c", "c'mon", "c's", "came", "can", "can't", "cannot", "cant", 54 | "case", "cases", "cause", "causes", "certain", "certainly", 55 | "changes", "clear", "clearly", "co", "com", "come", "comes", 56 | "concerning", "consequently", "consider", "considering", 57 | "contain", "containing", "contains", "continue", "corresponding", 58 | "could", "couldn't", "course", "currently", "d", "definitely", 59 | "described", "despite", "did", "didn't", "differ", "different", 60 | "differently", "do", "does", "doesn't", "doing", "don't", "done", 61 | "down", "downed", "downing", "downs", "downwards", "dr", "during", 62 | "e", "each", "earlier", "early", "edu", "eg", "eight", "either", 63 | "else", "elsewhere", "end", "ended", "ending", "ends", "enough", 64 | "entirely", "especially", "et", "etc", "even", "evenly", "ever", 65 | "every", "everybody", "everyone", "everything", "everywhere", "ex", 66 | "exactly", "example", "except", "f", "face", "faces", "fact", 67 | "facts", "far", "felt", "few", "fifth", "find", "finds", "first", 68 | "five", "flawed", "focusing", "followed", "following", "follows", 69 | "for", "former", "formerly", "forth", "four", "from", "full", 70 | "fully", "fun", "further", "furthered", "furthering", 71 | "furthermore", "furthers", "g", "gave", "general", "generally", 72 | "get", "gets", "getting", "gigot", "give", "given", "gives", "go", 73 | "goes", "going", "gone", "good", "goods", "got", "gotten", "great", 74 | "greater", "greatest", "greetings", "group", "grouped", "grouping", 75 | "groups", "h", "had", "hadn't", "half", "happens", "hardly", "has", 76 | "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", 77 | "he's", "held", "hello", "help", "hence", "her", "here", "here's", 78 | "hereafter", "hereby", "herein", "hereupon", "hers", "herself", 79 | "hi", "high", "higher", "highest", "him", "himself", "his", 80 | "hither", "hopefully", "how", "how's", "howbeit", "however", "i", 81 | "i'd", "i'll", "i'm", "i've", "ie", "if", "ignored", "ii", 82 | "immediate", "immediately", "important", "in", "inasmuch", "inc", 83 | "include", "including", "indeed", "indicate", "indicated", 84 | "indicates", "inevitable", "inner", "insofar", "instead", 85 | "interest", "interested", "interesting", "interests", "into", 86 | "involving", "inward", "is", "isn't", "issue", "it", "it'd", 87 | "it'll", "it's", "its", "itself", "ix", "j", "just", "k", "keep", 88 | "keeps", "kept", "kind", "knew", "know", "known", "knows", "l", 89 | "large", "largely", "last", "lately", "later", "latest", 90 | "latter", "latterly", "lead", "least", "led", "less", "lest", 91 | "let", "let's", "lets", "letting", "like", "liked", "likely", 92 | "likes", "line", "listen", "little", "long", "longer", "longest", 93 | "look", "looking", "looks", "lot", "ltd", "m", "m.d", "made", 94 | "mainly", "make", "makes", "making", "man", "many", "may", "maybe", 95 | "me", "mean", "meant", "meanwhile", "member", "members", "men", 96 | "merely", "messrs", "met", "might", "more", "moreover", "most", 97 | "mostly", "move", "mr", "mrs", "ms", "much", "must", "mustn't", 98 | "my", "myself", "n", "name", "namely", "nd", "near", "nearly", 99 | "necessary", "need", "needed", "needing", "needs", "neither", 100 | "never", "nevertheless", "new", "newer", "newest", "next", 101 | "nine", "no", "nobody", "non", "none", "nonetheless", "noone", 102 | "nor", "normally", "not", "nothing", "novel", "now", "nowhere", 103 | "number", "numbers", "o", "obviously", "of", "off", "often", 104 | "oh", "ok", "okay", "old", "older", "oldest", "on", "once", 105 | "one", "ones", "only", "onto", "open", "opened", "opening", 106 | "opens", "or", "order", "ordered", "ordering", "orders", 107 | "other", "others", "otherwise", "ought", "our", "ours", 108 | "ourselves", "out", "outside", "over", "overall", 109 | "overwhelming", "own", "p", "part", "parted", "particular", 110 | "particularly", "parting", "parts", "people", "per", "perhaps", 111 | "place", "placed", "places", "please", "plus", "point", "pointed", 112 | "pointing", "points", "possible", "prefer", "present", "presented", 113 | "presenting", "presents", "presumably", "probably", "problem", 114 | "problems", "prof", "provides", "put", "puts", "putting", "q", 115 | "que", "quite", "qv", "r", "rather", "rd", "re", "really", 116 | "reasonably", "recently", "regarding", "regardless", "regards", 117 | "relatively", "respectively", "right", "room", "rooms", "s", 118 | "said", "same", "saw", "say", "saying", "says", "sec", "second", 119 | "secondly", "seconds", "see", "seeing", "seem", "seemed", 120 | "seeming", "seemingly", "seems", "seen", "sees", "self", "selves", 121 | "sensible", "sent", "serious", "seriously", "set", "seven", 122 | "several", "shall", "shan't", "she", "she'd", "she'll", "she's", 123 | "shortly", "should", "shouldn't", "show", "showed", "showing", 124 | "shows", "side", "sides", "simply", "since", "six", "small", 125 | "smaller", "smallest", "so", "some", "somebody", "somehow", 126 | "someone", "something", "sometime", "sometimes", "somewhat", 127 | "somewhere", "soon", "sorry", "specified", "specify", "specifying", 128 | "st", "state", "states", "still", "sub", "such", "sup", "sure", 129 | "t", "t's", "take", "taken", "tell", "tends", "th", "than", 130 | "thank", "thanks", "thanx", "that", "that's", "thats", "the", 131 | "their", "theirs", "them", "themselves", "then", "thence", "there", 132 | "there's", "thereafter", "thereby", "therefore", "therein", 133 | "theres", "thereupon", "these", "they", "they'd", "they'll", 134 | "they're", "they've", "thing", "things", "think", "thinks", 135 | "third", "this", "thorough", "thoroughly", "those", "though", 136 | "thought", "thoughts", "three", "through", "throughout", "thru", 137 | "thus", "to", "today", "together", "told", "too", "took", "top", 138 | "toward", "towards", "tried", "tries", "truly", "try", "trying", 139 | "turn", "turned", "turning", "turns", "twice", "two", "u", "un", 140 | "under", "unfortunately", "unless", "unlike", "unlikely", "until", 141 | "unto", "up", "upon", "us", "use", "used", "useful", "uses", 142 | "using", "usually", "uucp", "v", "value", "various", "very", "via", 143 | "viz", "vs", "w", "want", "wanted", "wanting", "wants", "was", 144 | "wasn't", "watched", "way", "ways", "we", "we'd", "we'll", "we're", 145 | "we've", "welcome", "well", "wells", "went", "were", "weren't", 146 | "what", "what's", "whatever", "when", "when's", "whence", 147 | "whenever", "where", "where's", "whereafter", "whereas", "whereby", 148 | "wherein", "whereupon", "wherever", "whether", "which", "while", 149 | "whither", "who", "who's", "whoever", "whole", "whom", "whose", 150 | "why", "why's", "will", "willing", "wish", "with", "within", 151 | "without", "won't", "wonder", "work", "worked", "working", 152 | "works", "worst", "would", "wouldn't", "x", "y", "year", "years", 153 | "yes", "yet", "you", "you'd", "you'll", "you're", "you've", 154 | "young", "younger", "youngest", "your", "yours", "yourself", 155 | "yourselves", "z", "zero", "mr", "ms", "mrs", "mssrs", "mssr", 156 | "also", "said", "should", "could", "would", "week", "weeks", 157 | "month", "months", "year", "years"] -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/docs.py: -------------------------------------------------------------------------------- 1 | from nltk import sent_tokenize, wordpunct_tokenize 2 | from typing import Dict, List, Tuple 3 | from .constants import PUNCT_SET 4 | from .glove import Glove 5 | 6 | import numpy as np 7 | 8 | def cosine_similarity(a: np.float64, b: np.float64) -> float: 9 | norm_a = np.linalg.norm(a) 10 | norm_b = np.linalg.norm(b) 11 | if norm_a == 0 or norm_b == 0: 12 | return -1 13 | return np.dot(a, b) / (norm_a * norm_b) 14 | 15 | def _filter_words(text: str) -> List[str]: 16 | tokens = wordpunct_tokenize(text) 17 | words_filter = [word.lower() for word in tokens 18 | if set(word).isdisjoint(PUNCT_SET)] 19 | return words_filter 20 | 21 | class Document(object): 22 | """Document to be embedded. May be a word, a sentence, etc. 23 | 24 | Parameters 25 | ---------- 26 | text : str, required 27 | The text to be embedded 28 | glove : Glove, required 29 | GloVe embeddings 30 | 31 | Attributes 32 | ---------- 33 | text : str 34 | dim : int 35 | Dimension of GloVe embeddings. 36 | embedding : np.float64 37 | Document embedding built from average of GloVe embeddings. 38 | """ 39 | 40 | def __init__(self, 41 | text: str, 42 | glove: Glove) -> None: 43 | self.text = text 44 | self.dim = glove.dim 45 | self.embedding = self.__embed_document(glove.embeddings) 46 | 47 | def __embed_document(self, 48 | embeddings: Dict[str, np.float64]) -> np.float64: 49 | words = wordpunct_tokenize(self.text.lower()) 50 | vector = np.zeros(self.dim) 51 | for i, word in enumerate(words): 52 | if embeddings.get(word, None) is None: 53 | vector += np.zeros(self.dim) 54 | else: 55 | vector += embeddings[word] 56 | return vector / len(words) 57 | 58 | def get_word_positions(self) -> Dict[str, List[int]]: 59 | words = _filter_words(self.text) 60 | word_positions = {} 61 | for i, word in enumerate(words): 62 | if word_positions.get(word) is None: 63 | word_positions[word] = [i] 64 | else: 65 | word_positions[word].append(i) 66 | return word_positions 67 | 68 | class Phrase(Document): 69 | """Phrase to be embedded. Inherits from Document object. 70 | 71 | Parameters 72 | ---------- 73 | text : str, required 74 | The text to be embedded 75 | glove : Glove, required 76 | GloVe embeddings 77 | parent : Document, required 78 | Document where the Phrase is from 79 | 80 | Attributes 81 | ---------- 82 | text : str 83 | dim : int 84 | embedding : np.float64 85 | parent : Document 86 | positions : List[Tuple[int]] 87 | List of indices where a given phrase is located. 88 | Each index is represented as a Tuple where the first 89 | element is the first index the phrase appears in 90 | and the second element is the second index the phrase 91 | appears in. If a phrase is a unigram, a position Tuple 92 | is (position, position). 93 | similarity : float 94 | Cosine similarity between the parent document and the phrase. 95 | score : float, None 96 | Min/Max scaling of the cosine similarity in relation to the 97 | other candidate keyphrases. 98 | rank : int, None 99 | Phrase ranking with respect to the score in descending order. 100 | """ 101 | 102 | def __init__(self, 103 | text: str, 104 | parent: Document, 105 | glove: Glove) -> None: 106 | super().__init__(text, glove) 107 | self.parent = parent 108 | self.positions = self.__get_positions() 109 | self.window = self.__expand_window() 110 | self.similarity = cosine_similarity(parent.embedding, 111 | self.embedding) 112 | self.theme_weight = None 113 | self.score = None 114 | self.rank = None 115 | 116 | def __str__(self) -> str: 117 | return self.text 118 | 119 | def set_theme_weight(self, 120 | min_: float, 121 | max_: float) -> None: 122 | # THIS SHOULD BE SET_THEME_EMBEDDING!!!!! 123 | diff = max_ - min_ 124 | self.theme_weight = (self.similarity - min_) / diff 125 | 126 | def calc_pmi(self, phrase, candidates: int): 127 | """Calculates point-wise mutual information between 128 | one candidate phrase and another.""" 129 | prob_phrase_one = len(self.positions) / candidates 130 | prob_phrase_two = len(phrase.positions) / candidates 131 | cooccur = 0 132 | for pos in phrase.positions: 133 | if self.window.get(pos[0]) or self.window.get(pos[1]): 134 | cooccur += 1 135 | prob_cooccur = cooccur / candidates 136 | return np.log(prob_cooccur / (prob_phrase_one * prob_phrase_two)) 137 | 138 | def __get_positions(self) -> List[Tuple[int]]: 139 | """Gets positions a phrase is in.""" 140 | parent_word_positions = self.parent.get_word_positions() 141 | phrase_split = self.text.lower().split(' ') 142 | positions = [] 143 | if len(phrase_split) == 1: 144 | for word_pos in parent_word_positions[phrase_split[0]]: 145 | positions.append((word_pos, word_pos)) 146 | else: 147 | phrase = {word: parent_word_positions[word] 148 | for word in phrase_split} 149 | len_phrase = len(phrase_split) 150 | for position in phrase[phrase_split[0]]: 151 | for i, word in enumerate(phrase_split[1:]): 152 | pred_pos = position + i + 1 153 | end_of_phrase = i + 2 == len_phrase 154 | is_pred_pos = pred_pos in phrase[word] 155 | if is_pred_pos and end_of_phrase: 156 | positions.append((position, pred_pos)) 157 | return positions 158 | 159 | def __expand_window(self) -> Dict[int, int]: 160 | """Returns dictionary of positions in a phrase's 161 | adj. window.""" 162 | window = {} 163 | phrase_len = len(self.parent.text.split(' ')) 164 | for pos in self.positions: 165 | min_index = max(pos[0] - 5, 0) 166 | max_index = min(pos[1] + 6, phrase_len) 167 | indices = [i for i in range(min_index, max_index)] 168 | for i in indices: 169 | if window.get(i) is None: 170 | window[i] = i 171 | return window -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/glove.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from typing import Dict 3 | 4 | class Glove(object): 5 | """GloVe vectors. 6 | 7 | Parameters 8 | ---------- 9 | path : str, required 10 | Path to the GloVe embeddings 11 | 12 | Attributes 13 | ---------- 14 | embeddings : Dict[str, np.float64] 15 | Dictionary of GloVe embeddings 16 | dim : int 17 | Dimension of GloVe embeddings 18 | """ 19 | 20 | def __init__(self, path: str) -> None: 21 | self.embeddings = self.__read_glove(path) 22 | self.dim = self.__get_dim() 23 | 24 | def __read_glove(self, path: str) -> Dict[str, np.float64]: 25 | """Reads GloVe vectors into a dictionary, where 26 | the words are the keys, and the vectors are the values. 27 | 28 | Returns 29 | ------- 30 | word_vectors : Dict[str, np.float64] 31 | """ 32 | with open(path, 'r') as f: 33 | data = f.readlines() 34 | word_vectors = {} 35 | for row in data: 36 | stripped_row = row.strip('\n') 37 | split_row = stripped_row.split(' ') 38 | word = split_row[0] 39 | vector = [] 40 | for el in split_row[1:]: 41 | vector.append(float(el)) 42 | word_vectors[word] = np.array(vector) 43 | return word_vectors 44 | 45 | def __get_dim(self) -> int: 46 | return len(self.embeddings[list(self.embeddings.keys())[0]]) -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/key2vec.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import spacy 3 | import string 4 | import en_core_web_sm 5 | import os 6 | 7 | from nltk import sent_tokenize, wordpunct_tokenize 8 | from typing import Dict, List 9 | from .cleaner import Cleaner 10 | from .constants import ENTS_TO_IGNORE, STOPWORDS, PUNCT_SET 11 | from .docs import Document, Phrase 12 | from .glove import Glove 13 | from .phrase_graph import PhraseNode, PhraseGraph 14 | 15 | NLP = en_core_web_sm.load() 16 | 17 | class Key2Vec(object): 18 | """Implementation of Key2Vec. 19 | 20 | Parameters 21 | ---------- 22 | text : str, required 23 | The text to extract the top keyphrases from. 24 | glove : Glove 25 | GloVe vectors. 26 | 27 | Attributes 28 | ---------- 29 | text : Document 30 | Document object of the `text` parameter. 31 | glove : Glove 32 | candidates : List[Phrase] 33 | List of candidate keyphrases. Initialized as an empty list. 34 | candidate_graph : PhraseGraph 35 | Bidrectional graph of all candidate phrases 36 | """ 37 | 38 | def __init__(self, 39 | text: str, 40 | glove: Glove) -> None: 41 | 42 | self.doc = Document(text, glove) 43 | self.glove = glove 44 | self.candidates = [] 45 | self.candidate_graph = None 46 | 47 | def extract_candidates(self): 48 | """Extracts candidate phrases from the text. Sets 49 | `candidates` attributes to a list of Phrase objects. 50 | """ 51 | 52 | sentences = sent_tokenize(self.doc.text) 53 | candidates = {} 54 | for sentence in sentences: 55 | doc = NLP(sentence) 56 | candidates = self.__extract_tokens(doc, candidates) 57 | candidates = self.__extract_entities(doc, candidates) 58 | candidates = self.__extract_noun_chunks(doc, candidates) 59 | self.candidates = list(candidates.values()) 60 | 61 | def __extract_tokens(self, doc, candidates): 62 | for token in doc: 63 | text = token.text.lower() 64 | not_punct = set(text).isdisjoint(PUNCT_SET) 65 | is_stopword = text in STOPWORDS 66 | in_candidates = candidates.get(text) is not None 67 | not_empty = text != '' 68 | keep = (not_punct 69 | and not_empty 70 | and not (is_stopword or in_candidates)) 71 | if keep: 72 | try: 73 | candidates[text] = Phrase(text, self.doc, 74 | self.glove) 75 | except KeyError: 76 | next 77 | else: 78 | pass 79 | return candidates 80 | 81 | def __extract_entities(self, doc, candidates): 82 | for ent in doc.ents: 83 | cleaned_text = Cleaner(ent).transform_text() 84 | is_ent_to_ignore = ent.label_ in ENTS_TO_IGNORE 85 | in_candidates = candidates.get(cleaned_text) is not None 86 | not_empty = cleaned_text != '' 87 | if not (is_ent_to_ignore or in_candidates) and not_empty: 88 | try: 89 | candidates[cleaned_text] = Phrase(cleaned_text, self.doc, 90 | self.glove) 91 | except KeyError: 92 | next 93 | return candidates 94 | 95 | def __extract_noun_chunks(self, doc, candidates): 96 | for chunk in doc.noun_chunks: 97 | cleaned_text = Cleaner(chunk).transform_text() 98 | not_empty = cleaned_text != '' 99 | if candidates.get(cleaned_text) is None and not_empty: 100 | try: 101 | candidates[cleaned_text] = Phrase(cleaned_text, 102 | self.doc, self.glove) 103 | except KeyError: 104 | next 105 | return candidates 106 | 107 | def set_theme_weights(self) -> List[Phrase]: 108 | """Ranks candidate keyphrases. 109 | 110 | Parameters 111 | ---------- 112 | top_n : int, optional (int = 10) 113 | How many top keyphrases to return. 114 | 115 | Returns 116 | ------- 117 | sorted_candidates : List[Phrase] 118 | Sorted list of candidates in reverse order. Returns `top_n` 119 | Phrase objects. 120 | """ 121 | max_ = max([c.similarity for c in self.candidates]) 122 | min_ = min([c.similarity for c in self.candidates]) 123 | 124 | for c in self.candidates: 125 | c.set_theme_weight(min_, max_) 126 | 127 | def build_candidate_graph(self) -> None: 128 | """Builds bidirectional graph of candidates.""" 129 | 130 | if self.candidates == []: 131 | return 132 | 133 | candidate_graph = PhraseGraph(self.candidates) 134 | for candidate in self.candidates: 135 | candidate_graph.add_node(candidate) 136 | 137 | nodes = len(self.candidates) 138 | 139 | for node in candidate_graph.nodes: 140 | for other in candidate_graph.nodes: 141 | if node != other: 142 | candidate_graph.nodes[node].add_neighbor( 143 | candidate_graph.nodes[other], nodes) 144 | self.candidate_graph = candidate_graph 145 | 146 | def page_rank_candidates(self, top_n: int=10) -> List[Phrase]: 147 | """Page Ranks candidate phrases.""" 148 | if self.candidate_graph is None: 149 | return 150 | 151 | for node in self.candidate_graph.nodes.values(): 152 | theme = node.phrase.theme_weight 153 | d = 0.85 154 | weights = [] 155 | neighbors = list(node.adj_nodes.keys()) 156 | for neighbor in neighbors: 157 | out = node.adj_nodes[neighbor].incoming_edges 158 | weights.append(node.adj_weights[neighbor] / out) 159 | score = theme * (1 - d) + d * sum(weights) 160 | node.phrase.score = score 161 | 162 | sorted_candidates = sorted(self.candidates, 163 | key=lambda x: x.score)[::-1] 164 | 165 | for i, c in enumerate(sorted_candidates): 166 | c.rank = i + 1 167 | 168 | return sorted_candidates[:top_n] -------------------------------------------------------------------------------- /KeyExt/Key2Vec/key2vec/phrase_graph.py: -------------------------------------------------------------------------------- 1 | from .docs import Document, Phrase, cosine_similarity 2 | from typing import List 3 | 4 | class PhraseNode(object): 5 | """Node in Phrase Graph.""" 6 | 7 | def __init__(self, phrase: Phrase): 8 | self.key = phrase.text 9 | self.phrase = phrase 10 | self.incoming_edges = 0 11 | self.adj_nodes = {} 12 | self.adj_weights = {} 13 | 14 | def __repr__(self): 15 | return str(self.key) 16 | 17 | def __lt__(self, other): 18 | return self.key < other.key 19 | 20 | def add_neighbor(self, neighbor, candidates, weight=0): 21 | if neighbor is None or weight is None: 22 | raise TypeError('neighbor or weight cannot be None') 23 | if self.__in_window(neighbor): 24 | neighbor.incoming_edges += 1 25 | cosine_score = cosine_similarity(self.phrase.embedding, 26 | neighbor.phrase.embedding) 27 | # need to rewrite api to allow candidates to be calculated 28 | pmi = self.phrase.calc_pmi(neighbor.phrase, candidates) 29 | self.adj_weights[neighbor.key] = cosine_score * pmi 30 | self.adj_nodes[neighbor.key] = neighbor 31 | 32 | def __in_window(self, neighbor): 33 | window = self.phrase.window 34 | neighbor_pos = neighbor.phrase.positions 35 | for pos in neighbor_pos: 36 | pos0 = window.get(pos[0]) 37 | pos1 = window.get(pos[1]) 38 | if window.get(pos0) or window.get(pos1): 39 | return True 40 | return False 41 | 42 | class PhraseGraph(object): 43 | """Bi-directional G=graph of phrases""" 44 | 45 | def __init__(self, candidates: List[Phrase]): 46 | self.nodes = {} 47 | self.candidates = candidates 48 | 49 | def add_node(self, key): 50 | if key is None: 51 | raise TypeError('key cannot be None') 52 | if key not in self.nodes: 53 | self.nodes[key] = PhraseNode(key) 54 | return self.nodes[key] 55 | 56 | def add_edge(self, source_key, dest_key, weight=0): 57 | if source_key is None or dest_key is None: 58 | raise KeyError('Invalid key') 59 | if source_key not in self.nodes: 60 | self.add_node(dest_key) 61 | if dest_key not in self.nodes: 62 | self.add_node(dest_key) 63 | self.nodes[source_key].add_neighbor(self.nodes[dest_key], 64 | weight) -------------------------------------------------------------------------------- /KeyExt/Key2Vec/requirements.txt: -------------------------------------------------------------------------------- 1 | blis==0.4.1 2 | certifi==2019.9.11 3 | chardet==3.0.4 4 | cymem==2.0.2 5 | idna==2.8 6 | murmurhash==1.0.2 7 | nltk==3.4.5 8 | numpy==1.17.3 9 | plac==0.9.6 10 | preshed==3.0.2 11 | python-dotenv==0.10.3 12 | requests==2.22.0 13 | scipy==1.3.1 14 | six==1.12.0 15 | spacy==2.2.1 16 | srsly==0.1.0 17 | thinc==7.1.1 18 | tqdm==4.36.1 19 | urllib3==1.25.6 20 | wasabi==0.2.2 21 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/setup.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/Key2Vec/setup.py -------------------------------------------------------------------------------- /KeyExt/Key2Vec/test.py: -------------------------------------------------------------------------------- 1 | import key2vec 2 | 3 | path = './data/glove.6B.50d.txt' 4 | glove = key2vec.glove.Glove(path) 5 | with open('./test.txt', 'r') as f: 6 | test = f.read() 7 | m = key2vec.key2vec.Key2Vec(test, glove) 8 | m.extract_candidates() 9 | m.set_theme_weights() 10 | m.build_candidate_graph() 11 | ranked = m.page_rank_candidates() 12 | 13 | for row in ranked: 14 | print(f'{row.text}') 15 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/test.txt: -------------------------------------------------------------------------------- 1 | Optimal and safe ship control as a multi-step matrix game 2 | The paper describes the process of the safe ship control in a collision 3 | situation using a differential game model with j participants. As an 4 | approximated model of the manoeuvring process, a model of a multi-step 5 | matrix game is adopted here. RISKTRAJ computer program is designed in 6 | the Matlab language in order to determine the ship's trajectory as a 7 | certain sequence of manoeuvres executed by altering the course and 8 | speed, in the online navigator decision support system. These 9 | considerations are illustrated with examples of a computer simulation 10 | of the safe ship's trajectories in real situation at sea when passing 11 | twelve of the encountered objects 12 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/tests/test_docs.py: -------------------------------------------------------------------------------- 1 | # More things to test about both the Document object 2 | # and the Phrase object 3 | 4 | import pytest 5 | from key2vec.glove import Glove 6 | from key2vec.docs import Document, Phrase 7 | 8 | glove = Glove('../data/glove.6B/glove.6B.50d.txt') 9 | 10 | def test_document(): 11 | text = "Hello! My name is Mark Secada. I'm a Data Scientist." 12 | doc = Document(text, glove) 13 | assert doc.text == text 14 | assert doc.dim == 50 15 | assert doc.embedding is not None 16 | 17 | def test_phrase(): 18 | text = "Hello! My name is Mark Secada. I'm a Data Scientist." 19 | doc = Document(text, glove) 20 | phrase = Phrase("Mark Secada", glove, doc) 21 | assert phrase.text == "Mark Secada" 22 | assert phrase.dim == 50 23 | assert phrase.embedding is not None 24 | assert phrase.parent.text == text 25 | assert phrase.parent.dim == phrase.dim 26 | assert phrase.parent.embedding is not None 27 | assert type(phrase.similarity) == float 28 | 29 | phrase = Phrase("Secada", glove, doc) 30 | assert phrase.similarity == -1 31 | 32 | -------------------------------------------------------------------------------- /KeyExt/Key2Vec/tests/test_glove.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from key2vec.glove import Glove 3 | 4 | def test_glove(): 5 | path = '../data/glove.6B/glove.6B.50d.txt' 6 | glove = Glove(path) 7 | assert glove.dim == 50 8 | assert glove.embeddings.get('the', None) is not None -------------------------------------------------------------------------------- /KeyExt/KeyBERT/README.md: -------------------------------------------------------------------------------- 1 | # KeyBERT 2 | 3 | This directory hosts code to run and benchmark the [KeyBERT](https://github.com/MaartenGr/KeyBERT) approach. 4 | 5 | ## Setup 6 | In order to run this approach, you need to `pip install keybert` and modify the `base_path` in `KeyBERT.py`, which is used to access the dataset directory. 7 | If you wish to run the `benchmark()` function you need to set the `output_path`, as well. -------------------------------------------------------------------------------- /KeyExt/RVA/Makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | #For older gcc, use -O3 or -O2 instead of -Ofast 3 | # CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result 4 | 5 | # Use -Ofast with caution. It speeds up training, but the checks for NaN will not work 6 | # (-Ofast turns on --fast-math, which turns on -ffinite-math-only, 7 | # which assumes everything is NOT NaN or +-Inf, so checks for NaN always return false 8 | # see https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html) 9 | # CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wall -Wextra -Wpedantic 10 | 11 | CFLAGS = -lm -pthread -O3 -march=native -funroll-loops -Wall -Wextra -Wpedantic 12 | BUILDDIR := build 13 | SRCDIR := src 14 | OBJDIR := $(BUILDDIR) 15 | 16 | OBJ := $(OBJDIR)/vocab_count.o $(OBJDIR)/cooccur.o $(OBJDIR)/shuffle.o $(OBJDIR)/glove.o 17 | HEADERS := $(SRCDIR)/common.h 18 | MODULES := $(BUILDDIR)/vocab_count $(BUILDDIR)/cooccur $(BUILDDIR)/shuffle $(BUILDDIR)/glove 19 | 20 | 21 | all: dir $(OBJ) $(MODULES) 22 | dir : 23 | mkdir -p $(BUILDDIR) 24 | $(BUILDDIR)/glove : $(OBJDIR)/glove.o $(OBJDIR)/common.o 25 | $(CC) $^ -o $@ $(CFLAGS) 26 | $(BUILDDIR)/shuffle : $(OBJDIR)/shuffle.o $(OBJDIR)/common.o 27 | $(CC) $^ -o $@ $(CFLAGS) 28 | $(BUILDDIR)/cooccur : $(OBJDIR)/cooccur.o $(OBJDIR)/common.o 29 | $(CC) $^ -o $@ $(CFLAGS) 30 | $(BUILDDIR)/vocab_count : $(OBJDIR)/vocab_count.o $(OBJDIR)/common.o 31 | $(CC) $^ -o $@ $(CFLAGS) 32 | $(OBJDIR)/%.o : $(SRCDIR)/%.c $(HEADERS) 33 | $(CC) -c $< -o $@ $(CFLAGS) 34 | .PHONY: clean 35 | clean: 36 | rm -rf $(BUILDDIR) 37 | -------------------------------------------------------------------------------- /KeyExt/RVA/README.md: -------------------------------------------------------------------------------- 1 | # RVA 2 | 3 | This directory contains the modified code for the [RVA](https://github.com/epapagia/RVA) approach. 4 | 5 | ## Setup 6 | Follow the instructions from the original repo. 7 | Afterwards replace the files with the modified ones. 8 | In `main.py`, `base_path` and the path in `subprocess.call` need to be set for the dataset directory and the `.sh` script respectively. 9 | -------------------------------------------------------------------------------- /KeyExt/RVA/build/common.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/common.o -------------------------------------------------------------------------------- /KeyExt/RVA/build/cooccur: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/cooccur -------------------------------------------------------------------------------- /KeyExt/RVA/build/cooccur.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/cooccur.o -------------------------------------------------------------------------------- /KeyExt/RVA/build/glove: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/glove -------------------------------------------------------------------------------- /KeyExt/RVA/build/glove.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/glove.o -------------------------------------------------------------------------------- /KeyExt/RVA/build/shuffle: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/shuffle -------------------------------------------------------------------------------- /KeyExt/RVA/build/shuffle.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/shuffle.o -------------------------------------------------------------------------------- /KeyExt/RVA/build/vocab_count: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/vocab_count -------------------------------------------------------------------------------- /KeyExt/RVA/build/vocab_count.o: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/build/vocab_count.o -------------------------------------------------------------------------------- /KeyExt/RVA/cooccurrence.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/cooccurrence.bin -------------------------------------------------------------------------------- /KeyExt/RVA/cooccurrence.shuf.bin: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/NC0DER/KeyphraseExtraction/b2e5736fca737a5c7f6ae9e4e58a02bcb2c4e130/KeyExt/RVA/cooccurrence.shuf.bin -------------------------------------------------------------------------------- /KeyExt/RVA/demo.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | # Makes programs, downloads sample data, trains a GloVe model, and then evaluates it. 5 | # One optional argument can specify the language used for eval script: matlab, octave or [default] python 6 | 7 | #make 8 | #if [ ! -e text8 ]; then 9 | # if hash wget 2>/dev/null; then 10 | # wget http://mattmahoney.net/dc/text8.zip 11 | # else 12 | # curl -O http://mattmahoney.net/dc/text8.zip 13 | # fi 14 | # unzip text8.zip 15 | # rm text8.zip 16 | #fi 17 | 18 | CORPUS=$1 19 | VOCAB_FILE="vocab.txt$2$3$4" 20 | COOCCURRENCE_FILE=/home/groot/Desktop/RVA/glove/cooccurrence.bin 21 | COOCCURRENCE_SHUF_FILE=/home/groot/Desktop/RVA/glove/cooccurrence.shuf.bin 22 | BUILDDIR=/home/groot/Desktop/RVA/glove/build 23 | SAVE_FILE="vectors$2$3$4" 24 | VERBOSE=2 25 | MEMORY=7.1 26 | VOCAB_MIN_COUNT=1 27 | VECTOR_SIZE=$3 28 | MAX_ITER=$4 29 | WINDOW_SIZE=10 30 | BINARY=2 31 | NUM_THREADS=8 32 | X_MAX=100 33 | 34 | echo "$ $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE" 35 | $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE 36 | echo "$ $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE" 37 | $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE 38 | echo "$ $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE" 39 | $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE 40 | echo "$ $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE" 41 | $BUILDDIR/glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE 42 | if [ "$CORPUS" = 'text8' ]; then 43 | if [ "$1" = 'matlab' ]; then 44 | matlab -nodisplay -nodesktop -nojvm -nosplash < ./eval/matlab/read_and_evaluate.m 1>&2 45 | elif [ "$1" = 'octave' ]; then 46 | octave < ./eval/octave/read_and_evaluate_octave.m 1>&2 47 | else 48 | echo "$ python eval/python/evaluate.py" 49 | python eval/python/evaluate.py 50 | fi 51 | fi 52 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/matlab/WordLookup.m: -------------------------------------------------------------------------------- 1 | function index = WordLookup(InputString) 2 | global wordMap 3 | if wordMap.isKey(InputString) 4 | index = wordMap(InputString); 5 | elseif wordMap.isKey('') 6 | index = wordMap(''); 7 | else 8 | index = 0; 9 | end 10 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/matlab/evaluate_vectors.m: -------------------------------------------------------------------------------- 1 | function [BB] = evaluate_vectors(W) 2 | 3 | global wordMap 4 | 5 | filenames = {'capital-common-countries' 'capital-world' 'currency' 'city-in-state' 'family' 'gram1-adjective-to-adverb' ... 6 | 'gram2-opposite' 'gram3-comparative' 'gram4-superlative' 'gram5-present-participle' 'gram6-nationality-adjective' ... 7 | 'gram7-past-tense' 'gram8-plural' 'gram9-plural-verbs'}; 8 | path = './eval/question-data/'; 9 | 10 | split_size = 100; %to avoid memory overflow, could be increased/decreased depending on system and vocab size 11 | 12 | correct_sem = 0; %count correct semantic questions 13 | correct_syn = 0; %count correct syntactic questions 14 | correct_tot = 0; %count correct questions 15 | count_sem = 0; %count all semantic questions 16 | count_syn = 0; %count all syntactic questions 17 | count_tot = 0; %count all questions 18 | full_count = 0; %count all questions, including those with unknown words 19 | 20 | if wordMap.isKey('') 21 | unkkey = wordMap(''); 22 | else 23 | unkkey = 0; 24 | end 25 | 26 | for j=1:length(filenames); 27 | 28 | clear dist; 29 | 30 | fid=fopen([path filenames{j} '.txt']); 31 | temp=textscan(fid,'%s%s%s%s'); 32 | fclose(fid); 33 | ind1 = cellfun(@WordLookup,temp{1}); %indices of first word in analogy 34 | ind2 = cellfun(@WordLookup,temp{2}); %indices of second word in analogy 35 | ind3 = cellfun(@WordLookup,temp{3}); %indices of third word in analogy 36 | ind4 = cellfun(@WordLookup,temp{4}); %indices of answer word in analogy 37 | full_count = full_count + length(ind1); 38 | ind = (ind1 ~= unkkey) & (ind2 ~= unkkey) & (ind3 ~= unkkey) & (ind4 ~= unkkey); %only look at those questions which have no unknown words 39 | ind1 = ind1(ind); 40 | ind2 = ind2(ind); 41 | ind3 = ind3(ind); 42 | ind4 = ind4(ind); 43 | disp([filenames{j} ':']); 44 | mx = zeros(1,length(ind1)); 45 | num_iter = ceil(length(ind1)/split_size); 46 | for jj=1:num_iter 47 | range = (jj-1)*split_size+1:min(jj*split_size,length(ind1)); 48 | dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' + W(ind3(range),:)')); %cosine similarity if input W has been normalized 49 | for i=1:length(range) 50 | dist(ind1(range(i)),i) = -Inf; 51 | dist(ind2(range(i)),i) = -Inf; 52 | dist(ind3(range(i)),i) = -Inf; 53 | end 54 | [~, mx(range)] = max(dist); %predicted word index 55 | end 56 | 57 | val = (ind4 == mx'); %correct predictions 58 | count_tot = count_tot + length(ind1); 59 | correct_tot = correct_tot + sum(val); 60 | disp(['ACCURACY TOP1: ' num2str(mean(val)*100,'%-2.2f') '% (' num2str(sum(val)) '/' num2str(length(val)) ')']); 61 | if j < 6 62 | count_sem = count_sem + length(ind1); 63 | correct_sem = correct_sem + sum(val); 64 | else 65 | count_syn = count_syn + length(ind1); 66 | correct_syn = correct_syn + sum(val); 67 | end 68 | 69 | disp(['Total accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '% Semantic accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '% Syntactic accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%']); 70 | 71 | end 72 | disp('________________________________________________________________________________'); 73 | disp(['Questions seen/total: ' num2str(100*count_tot/full_count,'%-2.2f') '% (' num2str(count_tot) '/' num2str(full_count) ')']); 74 | disp(['Semantic Accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '% (' num2str(correct_sem) '/' num2str(count_sem) ')']); 75 | disp(['Syntactic Accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '% (' num2str(correct_syn) '/' num2str(count_syn) ')']); 76 | disp(['Total Accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '% (' num2str(correct_tot) '/' num2str(count_tot) ')']); 77 | BB = [100*correct_sem/count_sem 100*correct_syn/count_syn 100*correct_tot/count_tot]; 78 | 79 | end 80 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/matlab/read_and_evaluate.m: -------------------------------------------------------------------------------- 1 | addpath('./eval/matlab'); 2 | if(~exist('vocab_file')) 3 | vocab_file = 'vocab.txt'; 4 | end 5 | if(~exist('vectors_file')) 6 | vectors_file = 'vectors.bin'; 7 | end 8 | 9 | fid = fopen(vocab_file, 'r'); 10 | words = textscan(fid, '%s %f'); 11 | fclose(fid); 12 | words = words{1}; 13 | vocab_size = length(words); 14 | global wordMap 15 | wordMap = containers.Map(words(1:vocab_size),1:vocab_size); 16 | 17 | fid = fopen(vectors_file,'r'); 18 | fseek(fid,0,'eof'); 19 | vector_size = ftell(fid)/16/vocab_size - 1; 20 | frewind(fid); 21 | WW = fread(fid, [vector_size+1 2*vocab_size], 'double')'; 22 | fclose(fid); 23 | 24 | W1 = WW(1:vocab_size, 1:vector_size); % word vectors 25 | W2 = WW(vocab_size+1:end, 1:vector_size); % context (tilde) word vectors 26 | 27 | W = W1 + W2; %Evaluate on sum of word vectors 28 | W = bsxfun(@rdivide,W,sqrt(sum(W.*W,2))); %normalize vectors before evaluation 29 | evaluate_vectors(W); 30 | exit 31 | 32 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/octave/WordLookup_octave.m: -------------------------------------------------------------------------------- 1 | function index = WordLookup_octave(InputString) 2 | global wordMap 3 | 4 | if isfield(wordMap, InputString) 5 | index = wordMap.(InputString); 6 | elseif isfield(wordMap, '') 7 | index = wordMap.(''); 8 | else 9 | index = 0; 10 | end 11 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/octave/evaluate_vectors_octave.m: -------------------------------------------------------------------------------- 1 | function [BB] = evaluate_vectors_octave(W) 2 | 3 | global wordMap 4 | 5 | filenames = {'capital-common-countries' 'capital-world' 'currency' 'city-in-state' 'family' 'gram1-adjective-to-adverb' ... 6 | 'gram2-opposite' 'gram3-comparative' 'gram4-superlative' 'gram5-present-participle' 'gram6-nationality-adjective' ... 7 | 'gram7-past-tense' 'gram8-plural' 'gram9-plural-verbs'}; 8 | path = './eval/question-data/'; 9 | 10 | split_size = 100; %to avoid memory overflow, could be increased/decreased depending on system and vocab size 11 | 12 | correct_sem = 0; %count correct semantic questions 13 | correct_syn = 0; %count correct syntactic questions 14 | correct_tot = 0; %count correct questions 15 | count_sem = 0; %count all semantic questions 16 | count_syn = 0; %count all syntactic questions 17 | count_tot = 0; %count all questions 18 | full_count = 0; %count all questions, including those with unknown words 19 | 20 | 21 | if isfield(wordMap, '') 22 | unkkey = wordMap.(''); 23 | else 24 | unkkey = 0; 25 | end 26 | 27 | for j=1:length(filenames); 28 | 29 | clear dist; 30 | 31 | fid=fopen([path filenames{j} '.txt']); 32 | temp=textscan(fid,'%s%s%s%s'); 33 | fclose(fid); 34 | ind1 = cellfun(@WordLookup_octave,temp{1}); %indices of first word in analogy 35 | ind2 = cellfun(@WordLookup_octave,temp{2}); %indices of second word in analogy 36 | ind3 = cellfun(@WordLookup_octave,temp{3}); %indices of third word in analogy 37 | ind4 = cellfun(@WordLookup_octave,temp{4}); %indices of answer word in analogy 38 | full_count = full_count + length(ind1); 39 | ind = (ind1 ~= unkkey) & (ind2 ~= unkkey) & (ind3 ~= unkkey) & (ind4 ~= unkkey); %only look at those questions which have no unknown words 40 | ind1 = ind1(ind); 41 | ind2 = ind2(ind); 42 | ind3 = ind3(ind); 43 | ind4 = ind4(ind); 44 | disp([filenames{j} ':']); 45 | mx = zeros(1,length(ind1)); 46 | num_iter = ceil(length(ind1)/split_size); 47 | for jj=1:num_iter 48 | range = (jj-1)*split_size+1:min(jj*split_size,length(ind1)); 49 | dist = full(W * (W(ind2(range),:)' - W(ind1(range),:)' + W(ind3(range),:)')); %cosine similarity if input W has been normalized 50 | for i=1:length(range) 51 | dist(ind1(range(i)),i) = -Inf; 52 | dist(ind2(range(i)),i) = -Inf; 53 | dist(ind3(range(i)),i) = -Inf; 54 | end 55 | [~, mx(range)] = max(dist); %predicted word index 56 | end 57 | 58 | val = (ind4 == mx'); %correct predictions 59 | count_tot = count_tot + length(ind1); 60 | correct_tot = correct_tot + sum(val); 61 | disp(['ACCURACY TOP1: ' num2str(mean(val)*100,'%-2.2f') '% (' num2str(sum(val)) '/' num2str(length(val)) ')']); 62 | if j < 6 63 | count_sem = count_sem + length(ind1); 64 | correct_sem = correct_sem + sum(val); 65 | else 66 | count_syn = count_syn + length(ind1); 67 | correct_syn = correct_syn + sum(val); 68 | end 69 | 70 | disp(['Total accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '% Semantic accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '% Syntactic accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '%']); 71 | 72 | end 73 | disp('________________________________________________________________________________'); 74 | disp(['Questions seen/total: ' num2str(100*count_tot/full_count,'%-2.2f') '% (' num2str(count_tot) '/' num2str(full_count) ')']); 75 | disp(['Semantic Accuracy: ' num2str(100*correct_sem/count_sem,'%-2.2f') '% (' num2str(correct_sem) '/' num2str(count_sem) ')']); 76 | disp(['Syntactic Accuracy: ' num2str(100*correct_syn/count_syn,'%-2.2f') '% (' num2str(correct_syn) '/' num2str(count_syn) ')']); 77 | disp(['Total Accuracy: ' num2str(100*correct_tot/count_tot,'%-2.2f') '% (' num2str(correct_tot) '/' num2str(count_tot) ')']); 78 | BB = [100*correct_sem/count_sem 100*correct_syn/count_syn 100*correct_tot/count_tot]; 79 | 80 | end 81 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/octave/read_and_evaluate_octave.m: -------------------------------------------------------------------------------- 1 | addpath('./eval/octave'); 2 | if(~exist('vocab_file')) 3 | vocab_file = 'vocab.txt'; 4 | end 5 | if(~exist('vectors_file')) 6 | vectors_file = 'vectors.bin'; 7 | end 8 | 9 | fid = fopen(vocab_file, 'r'); 10 | words = textscan(fid, '%s %f'); 11 | fclose(fid); 12 | words = words{1}; 13 | vocab_size = length(words); 14 | global wordMap 15 | 16 | wordMap = struct(); 17 | for i=1:numel(words) 18 | wordMap.(words{i}) = i; 19 | end 20 | 21 | fid = fopen(vectors_file,'r'); 22 | fseek(fid,0,'eof'); 23 | vector_size = ftell(fid)/16/vocab_size - 1; 24 | frewind(fid); 25 | WW = fread(fid, [vector_size+1 2*vocab_size], 'double')'; 26 | fclose(fid); 27 | 28 | W1 = WW(1:vocab_size, 1:vector_size); % word vectors 29 | W2 = WW(vocab_size+1:end, 1:vector_size); % context (tilde) word vectors 30 | 31 | W = W1 + W2; %Evaluate on sum of word vectors 32 | W = bsxfun(@rdivide,W,sqrt(sum(W.*W,2))); %normalize vectors before evaluation 33 | evaluate_vectors_octave(W); 34 | exit 35 | 36 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/python/distance.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | import sys 4 | 5 | def generate(): 6 | parser = argparse.ArgumentParser() 7 | parser.add_argument('--vocab_file', default='vocab.txt', type=str) 8 | parser.add_argument('--vectors_file', default='vectors.txt', type=str) 9 | args = parser.parse_args() 10 | 11 | with open(args.vocab_file, 'r') as f: 12 | words = [x.rstrip().split(' ')[0] for x in f.readlines()] 13 | with open(args.vectors_file, 'r') as f: 14 | vectors = {} 15 | for line in f: 16 | vals = line.rstrip().split(' ') 17 | vectors[vals[0]] = [float(x) for x in vals[1:]] 18 | 19 | vocab_size = len(words) 20 | vocab = {w: idx for idx, w in enumerate(words)} 21 | ivocab = {idx: w for idx, w in enumerate(words)} 22 | 23 | vector_dim = len(vectors[ivocab[0]]) 24 | W = np.zeros((vocab_size, vector_dim)) 25 | for word, v in vectors.items(): 26 | if word == '': 27 | continue 28 | W[vocab[word], :] = v 29 | 30 | # normalize each word vector to unit variance 31 | W_norm = np.zeros(W.shape) 32 | d = (np.sum(W ** 2, 1) ** (0.5)) 33 | W_norm = (W.T / d).T 34 | return (W_norm, vocab, ivocab) 35 | 36 | 37 | def distance(W, vocab, ivocab, input_term): 38 | for idx, term in enumerate(input_term.split(' ')): 39 | if term in vocab: 40 | print('Word: %s Position in vocabulary: %i' % (term, vocab[term])) 41 | if idx == 0: 42 | vec_result = np.copy(W[vocab[term], :]) 43 | else: 44 | vec_result += W[vocab[term], :] 45 | else: 46 | print('Word: %s Out of dictionary!\n' % term) 47 | return 48 | 49 | vec_norm = np.zeros(vec_result.shape) 50 | d = (np.sum(vec_result ** 2,) ** (0.5)) 51 | vec_norm = (vec_result.T / d).T 52 | 53 | dist = np.dot(W, vec_norm.T) 54 | 55 | for term in input_term.split(' '): 56 | index = vocab[term] 57 | dist[index] = -np.Inf 58 | 59 | a = np.argsort(-dist)[:N] 60 | 61 | print("\n Word Cosine distance\n") 62 | print("---------------------------------------------------------\n") 63 | for x in a: 64 | print("%35s\t\t%f\n" % (ivocab[x], dist[x])) 65 | 66 | 67 | if __name__ == "__main__": 68 | N = 100 # number of closest words that will be shown 69 | W, vocab, ivocab = generate() 70 | while True: 71 | input_term = input("\nEnter word or sentence (EXIT to break): ") 72 | if input_term == 'EXIT': 73 | break 74 | else: 75 | distance(W, vocab, ivocab, input_term) 76 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/python/evaluate.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | 4 | def main(): 5 | parser = argparse.ArgumentParser() 6 | parser.add_argument('--vocab_file', default='vocab.txt', type=str) 7 | parser.add_argument('--vectors_file', default='vectors.txt', type=str) 8 | args = parser.parse_args() 9 | 10 | with open(args.vocab_file, 'r') as f: 11 | words = [x.rstrip().split(' ')[0] for x in f.readlines()] 12 | with open(args.vectors_file, 'r') as f: 13 | vectors = {} 14 | for line in f: 15 | vals = line.rstrip().split(' ') 16 | vectors[vals[0]] = [float(x) for x in vals[1:]] 17 | 18 | vocab_size = len(words) 19 | vocab = {w: idx for idx, w in enumerate(words)} 20 | ivocab = {idx: w for idx, w in enumerate(words)} 21 | 22 | vector_dim = len(vectors[ivocab[0]]) 23 | W = np.zeros((vocab_size, vector_dim)) 24 | for word, v in vectors.items(): 25 | if word == '': 26 | continue 27 | W[vocab[word], :] = v 28 | 29 | # normalize each word vector to unit length 30 | W_norm = np.zeros(W.shape) 31 | d = (np.sum(W ** 2, 1) ** (0.5)) 32 | W_norm = (W.T / d).T 33 | evaluate_vectors(W_norm, vocab) 34 | 35 | def evaluate_vectors(W, vocab): 36 | """Evaluate the trained word vectors on a variety of tasks""" 37 | 38 | filenames = [ 39 | 'capital-common-countries.txt', 'capital-world.txt', 'currency.txt', 40 | 'city-in-state.txt', 'family.txt', 'gram1-adjective-to-adverb.txt', 41 | 'gram2-opposite.txt', 'gram3-comparative.txt', 'gram4-superlative.txt', 42 | 'gram5-present-participle.txt', 'gram6-nationality-adjective.txt', 43 | 'gram7-past-tense.txt', 'gram8-plural.txt', 'gram9-plural-verbs.txt', 44 | ] 45 | prefix = './eval/question-data/' 46 | 47 | # to avoid memory overflow, could be increased/decreased 48 | # depending on system and vocab size 49 | split_size = 100 50 | 51 | correct_sem = 0; # count correct semantic questions 52 | correct_syn = 0; # count correct syntactic questions 53 | correct_tot = 0 # count correct questions 54 | count_sem = 0; # count all semantic questions 55 | count_syn = 0; # count all syntactic questions 56 | count_tot = 0 # count all questions 57 | full_count = 0 # count all questions, including those with unknown words 58 | 59 | for i in range(len(filenames)): 60 | with open('%s/%s' % (prefix, filenames[i]), 'r') as f: 61 | full_data = [line.rstrip().split(' ') for line in f] 62 | full_count += len(full_data) 63 | data = [x for x in full_data if all(word in vocab for word in x)] 64 | 65 | if len(data) == 0: 66 | print("ERROR: no lines of vocab kept for %s !" % filenames[i]) 67 | print("Example missing line:", full_data[0]) 68 | continue 69 | 70 | indices = np.array([[vocab[word] for word in row] for row in data]) 71 | ind1, ind2, ind3, ind4 = indices.T 72 | 73 | predictions = np.zeros((len(indices),)) 74 | num_iter = int(np.ceil(len(indices) / float(split_size))) 75 | for j in range(num_iter): 76 | subset = np.arange(j*split_size, min((j + 1)*split_size, len(ind1))) 77 | 78 | pred_vec = (W[ind2[subset], :] - W[ind1[subset], :] 79 | + W[ind3[subset], :]) 80 | #cosine similarity if input W has been normalized 81 | dist = np.dot(W, pred_vec.T) 82 | 83 | for k in range(len(subset)): 84 | dist[ind1[subset[k]], k] = -np.Inf 85 | dist[ind2[subset[k]], k] = -np.Inf 86 | dist[ind3[subset[k]], k] = -np.Inf 87 | 88 | # predicted word index 89 | predictions[subset] = np.argmax(dist, 0).flatten() 90 | 91 | val = (ind4 == predictions) # correct predictions 92 | count_tot = count_tot + len(ind1) 93 | correct_tot = correct_tot + sum(val) 94 | if i < 5: 95 | count_sem = count_sem + len(ind1) 96 | correct_sem = correct_sem + sum(val) 97 | else: 98 | count_syn = count_syn + len(ind1) 99 | correct_syn = correct_syn + sum(val) 100 | 101 | print("%s:" % filenames[i]) 102 | print('ACCURACY TOP1: %.2f%% (%d/%d)' % 103 | (np.mean(val) * 100, np.sum(val), len(val))) 104 | 105 | print('Questions seen/total: %.2f%% (%d/%d)' % 106 | (100 * count_tot / float(full_count), count_tot, full_count)) 107 | print('Semantic accuracy: %.2f%% (%i/%i)' % 108 | (100 * correct_sem / float(count_sem), correct_sem, count_sem)) 109 | print('Syntactic accuracy: %.2f%% (%i/%i)' % 110 | (100 * correct_syn / float(count_syn), correct_syn, count_syn)) 111 | print('Total accuracy: %.2f%% (%i/%i)' % (100 * correct_tot / float(count_tot), correct_tot, count_tot)) 112 | 113 | 114 | if __name__ == "__main__": 115 | main() 116 | -------------------------------------------------------------------------------- /KeyExt/RVA/eval/python/word_analogy.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import numpy as np 3 | 4 | def generate(): 5 | parser = argparse.ArgumentParser() 6 | parser.add_argument('--vocab_file', default='vocab.txt', type=str) 7 | parser.add_argument('--vectors_file', default='vectors.txt', type=str) 8 | args = parser.parse_args() 9 | 10 | with open(args.vocab_file, 'r') as f: 11 | words = [x.rstrip().split(' ')[0] for x in f.readlines()] 12 | with open(args.vectors_file, 'r') as f: 13 | vectors = {} 14 | for line in f: 15 | vals = line.rstrip().split(' ') 16 | vectors[vals[0]] = [float(x) for x in vals[1:]] 17 | 18 | vocab_size = len(words) 19 | vocab = {w: idx for idx, w in enumerate(words)} 20 | ivocab = {idx: w for idx, w in enumerate(words)} 21 | 22 | vector_dim = len(vectors[ivocab[0]]) 23 | W = np.zeros((vocab_size, vector_dim)) 24 | for word, v in vectors.items(): 25 | if word == '': 26 | continue 27 | W[vocab[word], :] = v 28 | 29 | # normalize each word vector to unit variance 30 | W_norm = np.zeros(W.shape) 31 | d = (np.sum(W ** 2, 1) ** (0.5)) 32 | W_norm = (W.T / d).T 33 | return (W_norm, vocab, ivocab) 34 | 35 | 36 | def distance(W, vocab, ivocab, input_term): 37 | vecs = {} 38 | if len(input_term.split(' ')) < 3: 39 | print("Only %i words were entered.. three words are needed at the input to perform the calculation\n" % len(input_term.split(' '))) 40 | return 41 | else: 42 | for idx, term in enumerate(input_term.split(' ')): 43 | if term in vocab: 44 | print('Word: %s Position in vocabulary: %i' % (term, vocab[term])) 45 | vecs[idx] = W[vocab[term], :] 46 | else: 47 | print('Word: %s Out of dictionary!\n' % term) 48 | return 49 | 50 | vec_result = vecs[1] - vecs[0] + vecs[2] 51 | 52 | vec_norm = np.zeros(vec_result.shape) 53 | d = (np.sum(vec_result ** 2,) ** (0.5)) 54 | vec_norm = (vec_result.T / d).T 55 | 56 | dist = np.dot(W, vec_norm.T) 57 | 58 | for term in input_term.split(' '): 59 | index = vocab[term] 60 | dist[index] = -np.Inf 61 | 62 | a = np.argsort(-dist)[:N] 63 | 64 | print("\n Word Cosine distance\n") 65 | print("---------------------------------------------------------\n") 66 | for x in a: 67 | print("%35s\t\t%f\n" % (ivocab[x], dist[x])) 68 | 69 | 70 | if __name__ == "__main__": 71 | N = 100; # number of closest words that will be shown 72 | W, vocab, ivocab = generate() 73 | while True: 74 | input_term = input("\nEnter three words (EXIT to break): ") 75 | if input_term == 'EXIT': 76 | break 77 | else: 78 | distance(W, vocab, ivocab, input_term) 79 | 80 | -------------------------------------------------------------------------------- /KeyExt/RVA/randomization.test.sh: -------------------------------------------------------------------------------- 1 | # Tests for ensuring randomization is being controlled 2 | 3 | make 4 | 5 | if [ ! -e text8 ]; then 6 | if hash wget 2>/dev/null; then 7 | wget http://mattmahoney.net/dc/text8.zip 8 | else 9 | curl -O http://mattmahoney.net/dc/text8.zip 10 | fi 11 | unzip text8.zip 12 | rm text8.zip 13 | fi 14 | 15 | # Global constants 16 | CORPUS=text8 17 | VERBOSE=2 18 | BUILDDIR=build 19 | MEMORY=4.0 20 | VOCAB_MIN_COUNT=20 21 | 22 | # Re-used files 23 | VOCAB_FILE=$(mktemp vocab.test.txt.XXXXXX) 24 | COOCCURRENCE_FILE=$(mktemp cooccurrence.test.bin.XXXXXX) 25 | COOCCURRENCE_SHUF_FILE=$(mktemp cooccurrence_shuf.test.bin.XXXXXX) 26 | 27 | # Make vocab 28 | $BUILDDIR/vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE 29 | 30 | # Make Coocurrences 31 | $BUILDDIR/cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size 5 < $CORPUS > $COOCCURRENCE_FILE 32 | 33 | # Shuffle Coocurrences 34 | $BUILDDIR/shuffle -memory $MEMORY -verbose $VERBOSE -seed 1 < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE 35 | 36 | # Keep track of failure 37 | num_failed=0 38 | 39 | check_exit() { 40 | eval $2 41 | failed=$(( $1 != $? )) 42 | num_failed=$(( $num_failed + $failed )) 43 | if [[ $failed -eq 0 ]]; then 44 | echo PASSED 45 | else 46 | echo FAILED 47 | fi 48 | } 49 | 50 | # Test control of random seed in shuffle 51 | printf "\n\n--- TEST SET: Control of random seed in shuffle\n" 52 | TEST_FILE=$(mktemp cooc_shuf.test.bin.XXXXXX) 53 | 54 | printf "\n- TEST: Using the same seed should get the same shuffle\n" 55 | $BUILDDIR/shuffle -memory $MEMORY -verbose 0 -seed 1 < $COOCCURRENCE_FILE > $TEST_FILE 56 | check_exit 0 "cmp --quiet $COOCCURRENCE_SHUF_FILE $TEST_FILE" 57 | 58 | printf "\n- TEST: Changing the seed should change the shuffle\n" 59 | $BUILDDIR/shuffle -memory $MEMORY -verbose 0 -seed 2 < $COOCCURRENCE_FILE > $TEST_FILE 60 | check_exit 1 "cmp --quiet $COOCCURRENCE_SHUF_FILE $TEST_FILE" 61 | 62 | rm $TEST_FILE # Clean up 63 | # --- 64 | 65 | # Control randomization in GloVe 66 | printf "\n\n--- TEST SET: Control of random seed in glove\n" 67 | # Note "-threads" must equal 1 for these to pass, since order in which results come back from individual threads is uncontrolled 68 | BASE_PREFIX=$(mktemp base_vectors.XXXXXX) 69 | TEST_PREFIX=$(mktemp test_vectors.XXXXXX) 70 | 71 | printf "\n- TEST: Reusing seed should give the same vectors\n" 72 | $BUILDDIR/glove -save-file $BASE_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 1 73 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 1 74 | check_exit 0 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin" 75 | 76 | printf "\n- TEST: Changing seed should change the learned vectors\n" 77 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -seed 2 78 | check_exit 1 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin" 79 | 80 | printf "\n- TEST: Should be able to save/load initial parameters\n" 81 | $BUILDDIR/glove -save-file $BASE_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -save-init-param 1 82 | $BUILDDIR/glove -save-file $TEST_PREFIX -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 3 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -save-init-param 1 -load-init-param 1 -init-param-file "$BASE_PREFIX.000.bin" 83 | check_exit 0 "cmp --quiet $BASE_PREFIX.000.bin $TEST_PREFIX.000.bin && cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin" 84 | 85 | rm "$BASE_PREFIX.000.bin" "$TEST_PREFIX.000.bin" "$BASE_PREFIX.bin" "$TEST_PREFIX.bin" # Clean up 86 | rm $BASE_PREFIX $TEST_PREFIX 87 | 88 | # ---- 89 | 90 | printf "\n- TEST: Should be able to save/load initial parameters and gradsq\n" 91 | # note: the seed will be randomly assigned and should not matter 92 | $BUILDDIR/glove -save-file $BASE_PREFIX -gradsq-file $BASE_PREFIX.gradsq -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 6 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -checkpoint-every 2 93 | 94 | $BUILDDIR/glove -save-file $TEST_PREFIX -gradsq-file $TEST_PREFIX.gradsq -threads 1 -input-file $COOCCURRENCE_SHUF_FILE -iter 4 -vector-size 10 -binary 1 -vocab-file $VOCAB_FILE -verbose 0 -checkpoint-every 2 -load-init-param 1 -init-param-file "$BASE_PREFIX.002.bin" -load-init-gradsq 1 -init-gradsq-file "$BASE_PREFIX.gradsq.002.bin" 95 | 96 | echo "Compare vectors before & after load gradsq - 2 iterations" 97 | check_exit 0 "cmp --quiet $BASE_PREFIX.004.bin $TEST_PREFIX.002.bin" 98 | echo "Compare vectors before & after load gradsq - 4 iterations" 99 | check_exit 0 "cmp --quiet $BASE_PREFIX.006.bin $TEST_PREFIX.004.bin" 100 | echo "Compare vectors before & after load gradsq - final" 101 | check_exit 0 "cmp --quiet $BASE_PREFIX.bin $TEST_PREFIX.bin" 102 | 103 | echo "Compare gradsq before & after load gradsq - 2 iterations" 104 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.004.bin $TEST_PREFIX.gradsq.002.bin" 105 | echo "Compare gradsq before & after load gradsq - 4 iterations" 106 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.006.bin $TEST_PREFIX.gradsq.004.bin" 107 | echo "Compare gradsq before & after load gradsq - final" 108 | check_exit 0 "cmp --quiet $BASE_PREFIX.gradsq.bin $TEST_PREFIX.gradsq.bin" 109 | 110 | echo "Cleaning up files" 111 | check_exit 0 "rm $BASE_PREFIX.002.bin $BASE_PREFIX.004.bin $BASE_PREFIX.006.bin $BASE_PREFIX.bin" 112 | check_exit 0 "rm $BASE_PREFIX.gradsq.002.bin $BASE_PREFIX.gradsq.004.bin $BASE_PREFIX.gradsq.006.bin $BASE_PREFIX.gradsq.bin" 113 | check_exit 0 "rm $TEST_PREFIX.002.bin $TEST_PREFIX.004.bin $TEST_PREFIX.bin" 114 | check_exit 0 "rm $TEST_PREFIX.gradsq.002.bin $TEST_PREFIX.gradsq.004.bin $TEST_PREFIX.gradsq.bin" 115 | check_exit 0 "rm $VOCAB_FILE $COOCCURRENCE_FILE $COOCCURRENCE_SHUF_FILE" 116 | 117 | echo 118 | echo SUMMARY: 119 | if [[ $num_failed -gt 0 ]]; then 120 | echo $num_failed tests failed. 121 | exit 1 122 | else 123 | echo All tests passed. 124 | exit 0 125 | fi 126 | 127 | 128 | -------------------------------------------------------------------------------- /KeyExt/RVA/src/README.md: -------------------------------------------------------------------------------- 1 | ### Package Contents 2 | 3 | To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by one or more spaces or tabs. If your corpus has multiple documents, the documents (only) should be separated by new line characters. Cooccurrence contexts for words do not extend past newline characters. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in `demo.sh`, which you can modify as necessary. 4 | 5 | The four main tools in this package are: 6 | 7 | #### 1) vocab_count 8 | This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the [Stanford Tokenizer](https://nlp.stanford.edu/software/tokenizer.html) first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count. 9 | 10 | #### 2) cooccur 11 | Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by `vocab_count`, and may specify a variety of parameters, as described by running `./build/cooccur`. 12 | 13 | #### 3) shuffle 14 | Shuffles the binary file of cooccurrence statistics produced by `cooccur`. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running `./build/shuffle`. 15 | 16 | #### 4) glove 17 | Train the GloVe model on the specified cooccurrence data, which typically will be the output of the `shuffle` tool. The user should supply a vocabulary file, as given by `vocab_count`, and may specify a number of other parameters, which are described by running `./build/glove`. 18 | -------------------------------------------------------------------------------- /KeyExt/RVA/src/common.c: -------------------------------------------------------------------------------- 1 | // Common code for cooccur.c, vocab_count.c, 2 | // glove.c and shuffle.c 3 | // 4 | // GloVe: Global Vectors for Word Representation 5 | // Copyright (c) 2014 The Board of Trustees of 6 | // The Leland Stanford Junior University. All Rights Reserved. 7 | // 8 | // Licensed under the Apache License, Version 2.0 (the "License"); 9 | // you may not use this file except in compliance with the License. 10 | // You may obtain a copy of the License at 11 | // 12 | // http://www.apache.org/licenses/LICENSE-2.0 13 | // 14 | // Unless required by applicable law or agreed to in writing, software 15 | // distributed under the License is distributed on an "AS IS" BASIS, 16 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 17 | // See the License for the specific language governing permissions and 18 | // limitations under the License. 19 | // 20 | // 21 | // For more information, bug reports, fixes, contact: 22 | // Jeffrey Pennington (jpennin@stanford.edu) 23 | // Christopher Manning (manning@cs.stanford.edu) 24 | // https://github.com/stanfordnlp/GloVe/ 25 | // GlobalVectors@googlegroups.com 26 | // http://nlp.stanford.edu/projects/glove/ 27 | 28 | #include 29 | #include 30 | #include 31 | #include "common.h" 32 | 33 | #ifdef _MSC_VER 34 | #define STRERROR(ERRNO, BUF, BUFSIZE) strerror_s((BUF), (BUFSIZE), (ERRNO)) 35 | #else 36 | #define STRERROR(ERRNO, BUF, BUFSIZE) strerror_r((ERRNO), (BUF), (BUFSIZE)) 37 | #endif 38 | 39 | /* Efficient string comparison */ 40 | int scmp( char *s1, char *s2 ) { 41 | while (*s1 != '\0' && *s1 == *s2) {s1++; s2++;} 42 | return (*s1 - *s2); 43 | } 44 | 45 | /* Move-to-front hashing and hash function from Hugh Williams, http://www.seg.rmit.edu.au/code/zwh-ipl/ */ 46 | 47 | /* Simple bitwise hash function */ 48 | unsigned int bitwisehash(char *word, int tsize, unsigned int seed) { 49 | char c; 50 | unsigned int h; 51 | h = seed; 52 | for ( ; (c = *word) != '\0'; word++) h ^= ((h << 5) + c + (h >> 2)); 53 | return (unsigned int)((h & 0x7fffffff) % tsize); 54 | } 55 | 56 | /* Create hash table, initialise pointers to NULL */ 57 | HASHREC ** inithashtable() { 58 | int i; 59 | HASHREC **ht; 60 | ht = (HASHREC **) malloc( sizeof(HASHREC *) * TSIZE ); 61 | for (i = 0; i < TSIZE; i++) ht[i] = (HASHREC *) NULL; 62 | return ht; 63 | } 64 | 65 | /* Read word from input stream. Return 1 when encounter '\n' or EOF (but separate from word), 0 otherwise. 66 | Words can be separated by space(s), tab(s), or newline(s). Carriage return characters are just ignored. 67 | (Okay for Windows, but not for Mac OS 9-. Ignored even if by themselves or in words.) 68 | A newline is taken as indicating a new document (contexts won't cross newline). 69 | Argument word array is assumed to be of size MAX_STRING_LENGTH. 70 | words will be truncated if too long. They are truncated with some care so that they 71 | cannot truncate in the middle of a utf-8 character, but 72 | still little to no harm will be done for other encodings like iso-8859-1. 73 | (This function appears identically copied in vocab_count.c and cooccur.c.) 74 | */ 75 | int get_word(char *word, FILE *fin) { 76 | int i = 0, ch; 77 | for ( ; ; ) { 78 | ch = fgetc(fin); 79 | if (ch == '\r') continue; 80 | if (i == 0 && ((ch == '\n') || (ch == EOF))) { 81 | word[i] = 0; 82 | return 1; 83 | } 84 | if (i == 0 && ((ch == ' ') || (ch == '\t'))) continue; // skip leading space 85 | if ((ch == EOF) || (ch == ' ') || (ch == '\t') || (ch == '\n')) { 86 | if (ch == '\n') ungetc(ch, fin); // return the newline next time as document ender 87 | break; 88 | } 89 | if (i < MAX_STRING_LENGTH - 1) 90 | word[i++] = ch; // don't allow words to exceed MAX_STRING_LENGTH 91 | } 92 | word[i] = 0; //null terminate 93 | // avoid truncation destroying a multibyte UTF-8 char except if only thing on line (so the i > x tests won't overwrite word[0]) 94 | // see https://en.wikipedia.org/wiki/UTF-8#Description 95 | if (i == MAX_STRING_LENGTH - 1 && (word[i-1] & 0x80) == 0x80) { 96 | if ((word[i-1] & 0xC0) == 0xC0) { 97 | word[i-1] = '\0'; 98 | } else if (i > 2 && (word[i-2] & 0xE0) == 0xE0) { 99 | word[i-2] = '\0'; 100 | } else if (i > 3 && (word[i-3] & 0xF8) == 0xF0) { 101 | word[i-3] = '\0'; 102 | } 103 | } 104 | return 0; 105 | } 106 | 107 | int find_arg(char *str, int argc, char **argv) { 108 | int i; 109 | for (i = 1; i < argc; i++) { 110 | if (!scmp(str, argv[i])) { 111 | if (i == argc - 1) { 112 | printf("No argument given for %s\n", str); 113 | exit(1); 114 | } 115 | return i; 116 | } 117 | } 118 | return -1; 119 | } 120 | 121 | void free_table(HASHREC **ht) { 122 | int i; 123 | HASHREC* current; 124 | HASHREC* tmp; 125 | for (i = 0; i < TSIZE; i++) { 126 | current = ht[i]; 127 | while (current != NULL) { 128 | tmp = current; 129 | current = current->next; 130 | free(tmp->word); 131 | free(tmp); 132 | } 133 | } 134 | free(ht); 135 | } 136 | 137 | void free_fid(FILE **fid, const int num) { 138 | int i; 139 | for(i = 0; i < num; i++) { 140 | if(fid[i] != NULL) 141 | fclose(fid[i]); 142 | } 143 | free(fid); 144 | } 145 | 146 | 147 | int log_file_loading_error(char *file_description, char *file_name) { 148 | fprintf(stderr, "Unable to open %s %s.\n", file_description, file_name); 149 | fprintf(stderr, "Errno: %d\n", errno); 150 | char error[MAX_STRING_LENGTH]; 151 | STRERROR(errno, error, MAX_STRING_LENGTH); 152 | fprintf(stderr, "Error description: %s\n", error); 153 | return errno; 154 | } 155 | -------------------------------------------------------------------------------- /KeyExt/RVA/src/common.h: -------------------------------------------------------------------------------- 1 | #ifndef COMMON_H 2 | #define COMMON_H 3 | 4 | // Common code for cooccur.c, vocab_count.c, 5 | // glove.c and shuffle.c 6 | // 7 | // GloVe: Global Vectors for Word Representation 8 | // Copyright (c) 2014 The Board of Trustees of 9 | // The Leland Stanford Junior University. All Rights Reserved. 10 | // 11 | // Licensed under the Apache License, Version 2.0 (the "License"); 12 | // you may not use this file except in compliance with the License. 13 | // You may obtain a copy of the License at 14 | // 15 | // http://www.apache.org/licenses/LICENSE-2.0 16 | // 17 | // Unless required by applicable law or agreed to in writing, software 18 | // distributed under the License is distributed on an "AS IS" BASIS, 19 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 20 | // See the License for the specific language governing permissions and 21 | // limitations under the License. 22 | // 23 | // 24 | // For more information, bug reports, fixes, contact: 25 | // Jeffrey Pennington (jpennin@stanford.edu) 26 | // Christopher Manning (manning@cs.stanford.edu) 27 | // https://github.com/stanfordnlp/GloVe/ 28 | // GlobalVectors@googlegroups.com 29 | // http://nlp.stanford.edu/projects/glove/ 30 | 31 | #include 32 | 33 | #define MAX_STRING_LENGTH 1000 34 | #define TSIZE 1048576 35 | #define SEED 1159241 36 | #define HASHFN bitwisehash 37 | 38 | typedef double real; 39 | typedef struct cooccur_rec { 40 | int word1; 41 | int word2; 42 | real val; 43 | } CREC; 44 | typedef struct hashrec { 45 | char *word; 46 | long long num; //count or id 47 | struct hashrec *next; 48 | } HASHREC; 49 | 50 | 51 | int scmp( char *s1, char *s2 ); 52 | unsigned int bitwisehash(char *word, int tsize, unsigned int seed); 53 | HASHREC **inithashtable(); 54 | int get_word(char *word, FILE *fin); 55 | void free_table(HASHREC **ht); 56 | int find_arg(char *str, int argc, char **argv); 57 | void free_fid(FILE **fid, const int num); 58 | 59 | // logs errors when loading files. call after a failed load 60 | int log_file_loading_error(char *file_description, char *file_name); 61 | 62 | #endif /* COMMON_H */ 63 | 64 | -------------------------------------------------------------------------------- /KeyExt/RVA/src/shuffle.c: -------------------------------------------------------------------------------- 1 | // Tool to shuffle entries of word-word cooccurrence files 2 | // 3 | // Copyright (c) 2014 The Board of Trustees of 4 | // The Leland Stanford Junior University. All Rights Reserved. 5 | // 6 | // Licensed under the Apache License, Version 2.0 (the "License"); 7 | // you may not use this file except in compliance with the License. 8 | // You may obtain a copy of the License at 9 | // 10 | // http://www.apache.org/licenses/LICENSE-2.0 11 | // 12 | // Unless required by applicable law or agreed to in writing, software 13 | // distributed under the License is distributed on an "AS IS" BASIS, 14 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 15 | // See the License for the specific language governing permissions and 16 | // limitations under the License. 17 | // 18 | // 19 | // For more information, bug reports, fixes, contact: 20 | // Jeffrey Pennington (jpennin@stanford.edu) 21 | // GlobalVectors@googlegroups.com 22 | // http://nlp.stanford.edu/projects/glove/ 23 | 24 | #include 25 | #include 26 | #include 27 | #include 28 | #include "common.h" 29 | 30 | 31 | static const long LRAND_MAX = ((long) RAND_MAX + 2) * (long)RAND_MAX; 32 | 33 | int verbose = 2; // 0, 1, or 2 34 | int seed = 0; 35 | long long array_size = 2000000; // size of chunks to shuffle individually 36 | char *file_head; // temporary file string 37 | real memory_limit = 2.0; // soft limit, in gigabytes 38 | 39 | /* Generate uniformly distributed random long ints */ 40 | static long rand_long(long n) { 41 | long limit = LRAND_MAX - LRAND_MAX % n; 42 | long rnd; 43 | do { 44 | rnd = ((long)RAND_MAX + 1) * (long)rand() + (long)rand(); 45 | } while (rnd >= limit); 46 | return rnd % n; 47 | } 48 | 49 | /* Write contents of array to binary file */ 50 | int write_chunk(CREC *array, long size, FILE *fout) { 51 | long i = 0; 52 | for (i = 0; i < size; i++) fwrite(&array[i], sizeof(CREC), 1, fout); 53 | return 0; 54 | } 55 | 56 | /* Fisher-Yates shuffle */ 57 | void shuffle(CREC *array, long n) { 58 | long i, j; 59 | CREC tmp; 60 | for (i = n - 1; i > 0; i--) { 61 | j = rand_long(i + 1); 62 | tmp = array[j]; 63 | array[j] = array[i]; 64 | array[i] = tmp; 65 | } 66 | } 67 | 68 | /* Merge shuffled temporary files; doesn't necessarily produce a perfect shuffle, but good enough */ 69 | int shuffle_merge(int num) { 70 | long i, j, k, l = 0; 71 | int fidcounter = 0; 72 | CREC *array; 73 | char filename[MAX_STRING_LENGTH]; 74 | FILE **fid, *fout = stdout; 75 | 76 | array = malloc(sizeof(CREC) * array_size); 77 | fid = calloc(num, sizeof(FILE)); 78 | for (fidcounter = 0; fidcounter < num; fidcounter++) { //num = number of temporary files to merge 79 | sprintf(filename,"%s_%04d.bin",file_head, fidcounter); 80 | fid[fidcounter] = fopen(filename, "rb"); 81 | if (fid[fidcounter] == NULL) { 82 | log_file_loading_error("temp file", filename); 83 | free(array); 84 | free_fid(fid, num); 85 | return 1; 86 | } 87 | } 88 | if (verbose > 0) fprintf(stderr, "Merging temp files: processed %ld lines.", l); 89 | 90 | while (1) { //Loop until EOF in all files 91 | i = 0; 92 | //Read at most array_size values into array, roughly array_size/num from each temp file 93 | for (j = 0; j < num; j++) { 94 | if (feof(fid[j])) continue; 95 | for (k = 0; k < array_size / num; k++){ 96 | fread(&array[i], sizeof(CREC), 1, fid[j]); 97 | if (feof(fid[j])) break; 98 | i++; 99 | } 100 | } 101 | if (i == 0) break; 102 | l += i; 103 | shuffle(array, i-1); // Shuffles lines between temp files 104 | write_chunk(array,i,fout); 105 | if (verbose > 0) fprintf(stderr, "\033[31G%ld lines.", l); 106 | } 107 | fprintf(stderr, "\033[0GMerging temp files: processed %ld lines.", l); 108 | for (fidcounter = 0; fidcounter < num; fidcounter++) { 109 | fclose(fid[fidcounter]); 110 | sprintf(filename,"%s_%04d.bin",file_head, fidcounter); 111 | remove(filename); 112 | } 113 | fprintf(stderr, "\n\n"); 114 | free(array); 115 | free(fid); 116 | return 0; 117 | } 118 | 119 | /* Shuffle large input stream by splitting into chunks */ 120 | int shuffle_by_chunks() { 121 | if (seed == 0) { 122 | seed = time(0); 123 | } 124 | fprintf(stderr, "Using random seed %d\n", seed); 125 | srand(seed); 126 | long i = 0, l = 0; 127 | int fidcounter = 0; 128 | char filename[MAX_STRING_LENGTH]; 129 | CREC *array; 130 | FILE *fin = stdin, *fid; 131 | array = malloc(sizeof(CREC) * array_size); 132 | 133 | fprintf(stderr,"SHUFFLING COOCCURRENCES\n"); 134 | if (verbose > 0) fprintf(stderr,"array size: %lld\n", array_size); 135 | sprintf(filename,"%s_%04d.bin",file_head, fidcounter); 136 | fid = fopen(filename,"w"); 137 | if (fid == NULL) { 138 | log_file_loading_error("file", filename); 139 | free(array); 140 | return 1; 141 | } 142 | if (verbose > 1) fprintf(stderr, "Shuffling by chunks: processed 0 lines."); 143 | 144 | while (1) { //Continue until EOF 145 | if (i >= array_size) {// If array is full, shuffle it and save to temporary file 146 | shuffle(array, i-2); 147 | l += i; 148 | if (verbose > 1) fprintf(stderr, "\033[22Gprocessed %ld lines.", l); 149 | write_chunk(array,i,fid); 150 | fclose(fid); 151 | fidcounter++; 152 | sprintf(filename,"%s_%04d.bin",file_head, fidcounter); 153 | fid = fopen(filename,"w"); 154 | if (fid == NULL) { 155 | log_file_loading_error("file", filename); 156 | free(array); 157 | return 1; 158 | } 159 | i = 0; 160 | } 161 | fread(&array[i], sizeof(CREC), 1, fin); 162 | if (feof(fin)) break; 163 | i++; 164 | } 165 | shuffle(array, i-2); //Last chunk may be smaller than array_size 166 | write_chunk(array,i,fid); 167 | l += i; 168 | if (verbose > 1) fprintf(stderr, "\033[22Gprocessed %ld lines.\n", l); 169 | if (verbose > 1) fprintf(stderr, "Wrote %d temporary file(s).\n", fidcounter + 1); 170 | fclose(fid); 171 | free(array); 172 | return shuffle_merge(fidcounter + 1); // Merge and shuffle together temporary files 173 | } 174 | 175 | int main(int argc, char **argv) { 176 | int i; 177 | 178 | if (argc == 2 && 179 | (!scmp(argv[1], "-h") || !scmp(argv[1], "-help") || !scmp(argv[1], "--help"))) { 180 | printf("Tool to shuffle entries of word-word cooccurrence files\n"); 181 | printf("Author: Jeffrey Pennington (jpennin@stanford.edu)\n\n"); 182 | printf("Usage options:\n"); 183 | printf("\t-verbose \n"); 184 | printf("\t\tSet verbosity: 0, 1, or 2 (default)\n"); 185 | printf("\t-memory \n"); 186 | printf("\t\tSoft limit for memory consumption, in GB; default 4.0\n"); 187 | printf("\t-array-size \n"); 188 | printf("\t\tLimit to length the buffer which stores chunks of data to shuffle before writing to disk. \n\t\tThis value overrides that which is automatically produced by '-memory'.\n"); 189 | printf("\t-temp-file \n"); 190 | printf("\t\tFilename, excluding extension, for temporary files; default temp_shuffle\n"); 191 | printf("\t-seed \n"); 192 | printf("\t\tRandom seed to use. If not set, will be randomized using current time."); 193 | printf("\nExample usage: (assuming 'cooccurrence.bin' has been produced by 'coccur')\n"); 194 | printf("./shuffle -verbose 2 -memory 8.0 < cooccurrence.bin > cooccurrence.shuf.bin\n"); 195 | return 0; 196 | } 197 | 198 | file_head = malloc(sizeof(char) * MAX_STRING_LENGTH); 199 | if ((i = find_arg((char *)"-verbose", argc, argv)) > 0) verbose = atoi(argv[i + 1]); 200 | if ((i = find_arg((char *)"-temp-file", argc, argv)) > 0) strcpy(file_head, argv[i + 1]); 201 | else strcpy(file_head, (char *)"temp_shuffle"); 202 | if ((i = find_arg((char *)"-memory", argc, argv)) > 0) memory_limit = atof(argv[i + 1]); 203 | array_size = (long long) (0.95 * (real)memory_limit * 1073741824/(sizeof(CREC))); 204 | if ((i = find_arg((char *)"-array-size", argc, argv)) > 0) array_size = atoll(argv[i + 1]); 205 | if ((i = find_arg((char *)"-seed", argc, argv)) > 0) seed = atoi(argv[i + 1]); 206 | const int returned_value = shuffle_by_chunks(); 207 | free(file_head); 208 | return returned_value; 209 | } 210 | 211 | -------------------------------------------------------------------------------- /KeyExt/RVA/src/vocab_count.c: -------------------------------------------------------------------------------- 1 | // Tool to extract unigram counts 2 | // 3 | // GloVe: Global Vectors for Word Representation 4 | // Copyright (c) 2014 The Board of Trustees of 5 | // The Leland Stanford Junior University. All Rights Reserved. 6 | // 7 | // Licensed under the Apache License, Version 2.0 (the "License"); 8 | // you may not use this file except in compliance with the License. 9 | // You may obtain a copy of the License at 10 | // 11 | // http://www.apache.org/licenses/LICENSE-2.0 12 | // 13 | // Unless required by applicable law or agreed to in writing, software 14 | // distributed under the License is distributed on an "AS IS" BASIS, 15 | // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 | // See the License for the specific language governing permissions and 17 | // limitations under the License. 18 | // 19 | // 20 | // For more information, bug reports, fixes, contact: 21 | // Jeffrey Pennington (jpennin@stanford.edu) 22 | // Christopher Manning (manning@cs.stanford.edu) 23 | // https://github.com/stanfordnlp/GloVe/ 24 | // GlobalVectors@googlegroups.com 25 | // http://nlp.stanford.edu/projects/glove/ 26 | 27 | #include 28 | #include 29 | #include 30 | #include "common.h" 31 | 32 | typedef struct vocabulary { 33 | char *word; 34 | long long count; 35 | } VOCAB; 36 | 37 | int verbose = 2; // 0, 1, or 2 38 | long long min_count = 1; // min occurrences for inclusion in vocab 39 | long long max_vocab = 0; // max_vocab = 0 for no limit 40 | 41 | 42 | /* Vocab frequency comparison; break ties alphabetically */ 43 | int CompareVocabTie(const void *a, const void *b) { 44 | long long c; 45 | if ( (c = ((VOCAB *) b)->count - ((VOCAB *) a)->count) != 0) return ( c > 0 ? 1 : -1 ); 46 | else return (scmp(((VOCAB *) a)->word,((VOCAB *) b)->word)); 47 | 48 | } 49 | 50 | /* Vocab frequency comparison; no tie-breaker */ 51 | int CompareVocab(const void *a, const void *b) { 52 | long long c; 53 | if ( (c = ((VOCAB *) b)->count - ((VOCAB *) a)->count) != 0) return ( c > 0 ? 1 : -1 ); 54 | else return 0; 55 | } 56 | 57 | /* Search hash table for given string, insert if not found */ 58 | void hashinsert(HASHREC **ht, char *w) { 59 | HASHREC *htmp, *hprv; 60 | unsigned int hval = HASHFN(w, TSIZE, SEED); 61 | 62 | for (hprv = NULL, htmp = ht[hval]; htmp != NULL && scmp(htmp->word, w) != 0; hprv = htmp, htmp = htmp->next); 63 | if (htmp == NULL) { 64 | htmp = (HASHREC *) malloc( sizeof(HASHREC) ); 65 | htmp->word = (char *) malloc( strlen(w) + 1 ); 66 | strcpy(htmp->word, w); 67 | htmp->num = 1; 68 | htmp->next = NULL; 69 | if ( hprv==NULL ) 70 | ht[hval] = htmp; 71 | else 72 | hprv->next = htmp; 73 | } 74 | else { 75 | /* new records are not moved to front */ 76 | htmp->num++; 77 | if (hprv != NULL) { 78 | /* move to front on access */ 79 | hprv->next = htmp->next; 80 | htmp->next = ht[hval]; 81 | ht[hval] = htmp; 82 | } 83 | } 84 | return; 85 | } 86 | 87 | int get_counts() { 88 | long long i = 0, j = 0, vocab_size = 12500; 89 | // char format[20]; 90 | char str[MAX_STRING_LENGTH + 1]; 91 | HASHREC **vocab_hash = inithashtable(); 92 | HASHREC *htmp; 93 | VOCAB *vocab; 94 | FILE *fid = stdin; 95 | 96 | fprintf(stderr, "BUILDING VOCABULARY\n"); 97 | if (verbose > 1) fprintf(stderr, "Processed %lld tokens.", i); 98 | // sprintf(format,"%%%ds",MAX_STRING_LENGTH); 99 | while ( ! feof(fid)) { 100 | // Insert all tokens into hashtable 101 | int nl = get_word(str, fid); 102 | if (nl) continue; // just a newline marker or feof 103 | if (strcmp(str, "") == 0) { 104 | fprintf(stderr, "\nError, vector found in corpus.\nPlease remove s from your corpus (e.g. cat text8 | sed -e 's///g' > text8.new)"); 105 | free_table(vocab_hash); 106 | return 1; 107 | } 108 | hashinsert(vocab_hash, str); 109 | if (((++i)%100000) == 0) if (verbose > 1) fprintf(stderr,"\033[11G%lld tokens.", i); 110 | } 111 | if (verbose > 1) fprintf(stderr, "\033[0GProcessed %lld tokens.\n", i); 112 | vocab = malloc(sizeof(VOCAB) * vocab_size); 113 | for (i = 0; i < TSIZE; i++) { // Migrate vocab to array 114 | htmp = vocab_hash[i]; 115 | while (htmp != NULL) { 116 | vocab[j].word = htmp->word; 117 | vocab[j].count = htmp->num; 118 | j++; 119 | if (j>=vocab_size) { 120 | vocab_size += 2500; 121 | vocab = (VOCAB *)realloc(vocab, sizeof(VOCAB) * vocab_size); 122 | } 123 | htmp = htmp->next; 124 | } 125 | } 126 | if (verbose > 1) fprintf(stderr, "Counted %lld unique words.\n", j); 127 | if (max_vocab > 0 && max_vocab < j) 128 | // If the vocabulary exceeds limit, first sort full vocab by frequency without alphabetical tie-breaks. 129 | // This results in pseudo-random ordering for words with same frequency, so that when truncated, the words span whole alphabet 130 | qsort(vocab, j, sizeof(VOCAB), CompareVocab); 131 | else max_vocab = j; 132 | qsort(vocab, max_vocab, sizeof(VOCAB), CompareVocabTie); //After (possibly) truncating, sort (possibly again), breaking ties alphabetically 133 | 134 | for (i = 0; i < max_vocab; i++) { 135 | if (vocab[i].count < min_count) { // If a minimum frequency cutoff exists, truncate vocabulary 136 | if (verbose > 0) fprintf(stderr, "Truncating vocabulary at min count %lld.\n",min_count); 137 | break; 138 | } 139 | printf("%s %lld\n",vocab[i].word,vocab[i].count); 140 | } 141 | 142 | if (i == max_vocab && max_vocab < j) if (verbose > 0) fprintf(stderr, "Truncating vocabulary at size %lld.\n", max_vocab); 143 | fprintf(stderr, "Using vocabulary of size %lld.\n\n", i); 144 | free_table(vocab_hash); 145 | free(vocab); 146 | return 0; 147 | } 148 | 149 | int main(int argc, char **argv) { 150 | if (argc == 2 && 151 | (!scmp(argv[1], "-h") || !scmp(argv[1], "-help") || !scmp(argv[1], "--help"))) { 152 | printf("Simple tool to extract unigram counts\n"); 153 | printf("Author: Jeffrey Pennington (jpennin@stanford.edu)\n\n"); 154 | printf("Usage options:\n"); 155 | printf("\t-verbose \n"); 156 | printf("\t\tSet verbosity: 0, 1, or 2 (default)\n"); 157 | printf("\t-max-vocab \n"); 158 | printf("\t\tUpper bound on vocabulary size, i.e. keep the most frequent words. The minimum frequency words are randomly sampled so as to obtain an even distribution over the alphabet.\n"); 159 | printf("\t-min-count \n"); 160 | printf("\t\tLower limit such that words which occur fewer than times are discarded.\n"); 161 | printf("\nExample usage:\n"); 162 | printf("./vocab_count -verbose 2 -max-vocab 100000 -min-count 10 < corpus.txt > vocab.txt\n"); 163 | return 0; 164 | } 165 | 166 | int i; 167 | if ((i = find_arg((char *)"-verbose", argc, argv)) > 0) verbose = atoi(argv[i + 1]); 168 | if ((i = find_arg((char *)"-max-vocab", argc, argv)) > 0) max_vocab = atoll(argv[i + 1]); 169 | if ((i = find_arg((char *)"-min-count", argc, argv)) > 0) min_count = atoll(argv[i + 1]); 170 | return get_counts(); 171 | } 172 | 173 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/README.md: -------------------------------------------------------------------------------- 1 | # SIFRank 2 | 3 | This directory contains the modified code for the [SIFRank](https://github.com/sunyilgdx/SIFRank) approach. 4 | 5 | ## Modified files 6 | The following files were modified in place, as to remove the hardcoded datasets paths, 7 | and ensured that the approach runs in CPU mode. 8 | 9 | * main.py 10 | * embeddings.sent_emb_sif.py 11 | * embeddings.word_emb_elmo 12 | 13 | ## Setup 14 | Follow the instructions from the original repo and `pip install requirements.txt`. 15 | Afterwards replace the files with the modified ones. 16 | In `main.py`, `base_path` and `exec_path` need to be respectively set for the dataset directory and the local project path. 17 | In `sent_emb_sif`, `weightfile_pretrain` and `weightfile_finetune` need to be set to the respective files of the local project path. 18 | In `word_emb_elmo`, `options_file` and `weights_file` need to be similarly set. 19 | If you wish to run the `benchmark()` function you need to set the `output_path`, in `main.py` as well. 20 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/auxiliary_data/__init__.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/12/19 5 | 6 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json: -------------------------------------------------------------------------------- 1 | {"lstm": {"use_skip_connections": true, "projection_dim": 512, "cell_clip": 3, "proj_clip": 3, "dim": 4096, "n_layers": 2}, "char_cnn": {"activation": "relu", "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "n_highway": 2, "embedding": {"dim": 16}, "n_characters": 262, "max_characters_per_token": 50}} 2 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/embeddings/__init__.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/12/19 5 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/embeddings/word_emb_bert.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/7/29 5 | 6 | from bert_serving.client import BertClient 7 | import numpy as np 8 | class WordEmbeddings(): 9 | """ 10 | Concrete class of @EmbeddingDistributor using ELMo 11 | https://allennlp.org/elmo 12 | 13 | """ 14 | 15 | def __init__(self,N=768): 16 | 17 | self.bert = BertClient() 18 | self.N = N 19 | 20 | def get_tokenized_words_embeddings(self, sents_tokened): 21 | """ 22 | @see EmbeddingDistributor 23 | :param tokenized_sents: list of tokenized words string (sentences/phrases) 24 | :return: ndarray with shape (len(sents), dimension of embeddings) 25 | """ 26 | bert_embeddings=[] 27 | for i in range(0, len(sents_tokened)): 28 | length = len(sents_tokened[i]) 29 | b_e = np.zeros((1, length, self.N)) 30 | b_e[0]=self.bert.encode(sents_tokened[i]) 31 | bert_embeddings.append(b_e) 32 | 33 | return np.array( bert_embeddings) 34 | 35 | 36 | if __name__ == '__main__': 37 | Bert=WordEmbeddings() 38 | sent_tokens=[['I',"love","Rock","and","R","!"],['I',"love","Rock","and","R","!"]] 39 | embs=Bert.get_tokenized_words_embeddings(sent_tokens) 40 | print(embs) 41 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/embeddings/word_emb_elmo.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/19 5 | from allennlp.commands.elmo import ElmoEmbedder 6 | 7 | class WordEmbeddings(): 8 | """ 9 | ELMo 10 | https://allennlp.org/elmo 11 | 12 | """ 13 | 14 | def __init__(self, 15 | options_file="../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json", 16 | weight_file="../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5", cuda_device=0): 17 | self.cuda_device=cuda_device 18 | self.elmo = ElmoEmbedder(options_file, weight_file,cuda_device=self.cuda_device) 19 | 20 | def get_tokenized_words_embeddings(self, sents_tokened): 21 | """ 22 | @see EmbeddingDistributor 23 | :param tokenized_sents: list of tokenized words string (sentences/phrases) 24 | :return: ndarray with shape (len(sents), dimension of embeddings) 25 | """ 26 | 27 | elmo_embedding, elmo_mask = self.elmo.batch_to_embeddings(sents_tokened) 28 | if(self.cuda_device>-2): 29 | return elmo_embedding.cpu(), elmo_mask.cpu() 30 | else: 31 | return elmo_embedding, elmo_mask 32 | 33 | 34 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/eval/sifrank_eval.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/25 5 | 6 | import nltk 7 | from embeddings import sent_emb_sif, word_emb_elmo 8 | from model.method import SIFRank, SIFRank_plus 9 | from util import fileIO 10 | from stanfordcorenlp import StanfordCoreNLP 11 | import time 12 | 13 | def get_PRF(num_c, num_e, num_s): 14 | F1 = 0.0 15 | P = float(num_c) / float(num_e) 16 | R = float(num_c) / float(num_s) 17 | if (P + R == 0.0): 18 | F1 = 0 19 | else: 20 | F1 = 2 * P * R / (P + R) 21 | return P, R, F1 22 | 23 | 24 | def print_PRF(P, R, F1, N): 25 | 26 | print("\nN=" + str(N), end="\n") 27 | print("P=" + str(P), end="\n") 28 | print("R=" + str(R), end="\n") 29 | print("F1=" + str(F1)) 30 | return 0 31 | 32 | 33 | time_start = time.time() 34 | 35 | P = R = F1 = 0.0 36 | num_c_5 = num_c_10 = num_c_15 = 0 37 | num_e_5 = num_e_10 = num_e_15 = 0 38 | num_s = 0 39 | lamda = 0.0 40 | 41 | database1 = "Inspec" 42 | database2 = "Duc2001" 43 | database3 = "Semeval2017" 44 | 45 | database = database1 46 | 47 | if(database == "Inspec"): 48 | data, labels = fileIO.get_inspec_data() 49 | lamda = 0.6 50 | elmo_layers_weight = [0.0, 1.0, 0.0] 51 | elif(database == "Duc2001"): 52 | data, labels = fileIO.get_duc2001_data() 53 | lamda = 1.0 54 | elmo_layers_weight = [1.0, 0.0, 0.0] 55 | else: 56 | data, labels = fileIO.get_semeval2017_data() 57 | lamda = 0.6 58 | elmo_layers_weight = [1.0, 0.0, 0.0] 59 | 60 | #download from https://allennlp.org/elmo 61 | options_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json" 62 | weight_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5" 63 | 64 | porter = nltk.PorterStemmer()#please download nltk 65 | ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0) 66 | SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=lamda, database=database) 67 | en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True)#download from https://stanfordnlp.github.io/CoreNLP/ 68 | 69 | try: 70 | for key, data in data.items(): 71 | 72 | lables = labels[key] 73 | lables_stemed = [] 74 | 75 | for lable in lables: 76 | tokens = lable.split() 77 | lables_stemed.append(' '.join(porter.stem(t) for t in tokens)) 78 | 79 | print(key) 80 | 81 | dist_sorted = SIFRank(data, SIF, en_model, elmo_layers_weight=elmo_layers_weight,if_DS=True,if_EA=True) 82 | # dist_sorted = SIFRank_plus(data, SIF, en_model, elmo_layers_weight=elmo_layers_weight) 83 | 84 | j = 0 85 | for temp in dist_sorted[0:15]: 86 | tokens = temp[0].split() 87 | tt = ' '.join(porter.stem(t) for t in tokens) 88 | if (tt in lables_stemed or temp[0] in labels[key]): 89 | if (j < 5): 90 | num_c_5 += 1 91 | num_c_10 += 1 92 | num_c_15 += 1 93 | 94 | elif (j < 10 and j >= 5): 95 | num_c_10 += 1 96 | num_c_15 += 1 97 | 98 | elif (j < 15 and j >= 10): 99 | num_c_15 += 1 100 | j += 1 101 | 102 | if (len(dist_sorted[0:5]) == 5): 103 | num_e_5 += 5 104 | else: 105 | num_e_5 += len(dist_sorted[0:5]) 106 | 107 | if (len(dist_sorted[0:10]) == 10): 108 | num_e_10 += 10 109 | else: 110 | num_e_10 += len(dist_sorted[0:10]) 111 | 112 | if (len(dist_sorted[0:15]) == 15): 113 | num_e_15 += 15 114 | else: 115 | num_e_15 += len(dist_sorted[0:15]) 116 | 117 | num_s += len(labels[key]) 118 | 119 | en_model.close() 120 | p, r, f = get_PRF(num_c_5, num_e_5, num_s) 121 | print_PRF(p, r, f, 5) 122 | p, r, f = get_PRF(num_c_10, num_e_10, num_s) 123 | print_PRF(p, r, f, 10) 124 | p, r, f = get_PRF(num_c_15, num_e_15, num_s) 125 | print_PRF(p, r, f, 15) 126 | 127 | 128 | except ValueError: 129 | en_model.close() 130 | en_model.close() 131 | time_end = time.time() 132 | print('totally cost', time_end - time_start) 133 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/model/__init__.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/12/19 5 | 6 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/model/extractor.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/19 5 | import nltk 6 | from model import input_representation 7 | 8 | #GRAMMAR1 is the general way to extract NPs 9 | 10 | GRAMMAR1 = """ NP: 11 | {*} # Adjective(s)(optional) + Noun(s)""" 12 | 13 | GRAMMAR2 = """ NP: 14 | {*{0,3}} # Adjective(s)(optional) + Noun(s)""" 15 | 16 | GRAMMAR3 = """ NP: 17 | {*} # Adjective(s)(optional) + Noun(s)""" 18 | 19 | 20 | def extract_candidates(tokens_tagged, no_subset=False): 21 | """ 22 | Based on part of speech return a list of candidate phrases 23 | :param text_obj: Input text Representation see @InputTextObj 24 | :param no_subset: if true won't put a candidate which is the subset of an other candidate 25 | :return keyphrase_candidate: list of list of candidate phrases: [tuple(string,tuple(start_index,end_index))] 26 | """ 27 | np_parser = nltk.RegexpParser(GRAMMAR1) # Noun phrase parser 28 | keyphrase_candidate = [] 29 | np_pos_tag_tokens = np_parser.parse(tokens_tagged) 30 | count = 0 31 | for token in np_pos_tag_tokens: 32 | if (isinstance(token, nltk.tree.Tree) and token._label == "NP"): 33 | np = ' '.join(word for word, tag in token.leaves()) 34 | length = len(token.leaves()) 35 | start_end = (count, count + length) 36 | count += length 37 | keyphrase_candidate.append((np, start_end)) 38 | 39 | else: 40 | count += 1 41 | 42 | return keyphrase_candidate 43 | 44 | # if __name__ == '__main__': 45 | # #This is an example. 46 | # sent17 = "NuVox shows staying power with new cash, new market Who says you can't raise cash in today's telecom market? NuVox Communications positions itself for the long run with $78.5 million in funding and a new credit facility" 47 | # sent10 = "This paper deals with two questions: Does social capital determine innovation in manufacturing firms? If it is the case, to what extent? To deal with these questions, we review the literature on innovation in order to see how social capital came to be added to the other forms of capital as an explanatory variable of innovation. In doing so, we have been led to follow the dominating view of the literature on social capital and innovation which claims that social capital cannot be captured through a single indicator, but that it actually takes many different forms that must be accounted for. Therefore, to the traditional explanatory variables of innovation, we have added five forms of structural social capital (business network assets, information network assets, research network assets, participation assets, and relational assets) and one form of cognitive social capital (reciprocal trust). In a context where empirical investigations regarding the relations between social capital and innovation are still scanty, this paper makes contributions to the advancement of knowledge in providing new evidence regarding the impact and the extent of social capital on innovation at the two decisionmaking stages considered in this study" 48 | # 49 | # input=input_representation.InputTextObj(sent10,is_sectioned=True,database="Inspec") 50 | # keyphrase_candidate= extract_candidates(input) 51 | # for kc in keyphrase_candidate: 52 | # print(kc) -------------------------------------------------------------------------------- /KeyExt/SIFRank/model/input_representation.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/19 5 | 6 | from model import extractor 7 | from nltk.corpus import stopwords 8 | stopword_dict = set(stopwords.words('english')) 9 | # from stanfordcorenlp import StanfordCoreNLP 10 | # en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True) 11 | class InputTextObj: 12 | """Represent the input text in which we want to extract keyphrases""" 13 | 14 | def __init__(self, en_model, text=""): 15 | """ 16 | :param is_sectioned: If we want to section the text. 17 | :param en_model: the pipeline of tokenization and POS-tagger 18 | :param considered_tags: The POSs we want to keep 19 | """ 20 | self.considered_tags = {'NN', 'NNS', 'NNP', 'NNPS', 'JJ'} 21 | 22 | self.tokens = [] 23 | self.tokens_tagged = [] 24 | self.tokens = en_model.word_tokenize(text) 25 | self.tokens_tagged = en_model.pos_tag(text) 26 | assert len(self.tokens) == len(self.tokens_tagged) 27 | for i, token in enumerate(self.tokens): 28 | if token.lower() in stopword_dict: 29 | self.tokens_tagged[i] = (token, "IN") 30 | self.keyphrase_candidate = extractor.extract_candidates(self.tokens_tagged, en_model) 31 | 32 | # if __name__ == '__main__': 33 | # text = "Adaptive state feedback control for a class of linear systems with unknown bounds of uncertainties The problem of adaptive robust stabilization for a class of linear time-varying systems with disturbance and nonlinear uncertainties is considered. The bounds of the disturbance and uncertainties are assumed to be unknown, being even arbitrary. For such uncertain dynamical systems, the adaptive robust state feedback controller is obtained. And the resulting closed-loop systems are asymptotically stable in theory. Moreover, an adaptive robust state feedback control scheme is given. The scheme ensures the closed-loop systems exponentially practically stable and can be used in practical engineering. Finally, simulations show that the control scheme is effective" 34 | # ito = InputTextObj(en_model, text) 35 | # print("OK") -------------------------------------------------------------------------------- /KeyExt/SIFRank/model/method.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/19 5 | 6 | import numpy as np 7 | import nltk 8 | from nltk.corpus import stopwords 9 | from model import input_representation 10 | import torch 11 | 12 | wnl=nltk.WordNetLemmatizer() 13 | stop_words = set(stopwords.words("english")) 14 | 15 | def cos_sim_gpu(x,y): 16 | assert x.shape[0]==y.shape[0] 17 | zero_tensor = torch.zeros((1, x.shape[0])).cuda() 18 | # zero_list = [0] * len(x) 19 | if x == zero_tensor or y == zero_tensor: 20 | return float(1) if x == y else float(0) 21 | xx, yy, xy = 0.0, 0.0, 0.0 22 | for i in range(x.shape[0]): 23 | xx += x[i] * x[i] 24 | yy += y[i] * y[i] 25 | xy += x[i] * y[i] 26 | return 1.0 - xy / np.sqrt(xx * yy) 27 | 28 | def cos_sim(vector_a, vector_b): 29 | """ 30 | 计算两个向量之间的余弦相似度 31 | :param vector_a: 向量 a 32 | :param vector_b: 向量 b 33 | :return: sim 34 | """ 35 | vector_a = np.mat(vector_a) 36 | vector_b = np.mat(vector_b) 37 | num = float(vector_a * vector_b.T) 38 | denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b) 39 | if(denom==0.0): 40 | return 0.0 41 | else: 42 | cos = num / denom 43 | sim = 0.5 + 0.5 * cos 44 | return sim 45 | 46 | def cos_sim_transformer(vector_a, vector_b): 47 | """ 48 | 计算两个向量之间的余弦相似度 49 | :param vector_a: 向量 a 50 | :param vector_b: 向量 b 51 | :return: sim 52 | """ 53 | a = vector_a.detach().numpy() 54 | b = vector_b.detach().numpy() 55 | a=np.mat(a) 56 | b=np.mat(b) 57 | 58 | num = float(a * b.T) 59 | denom = np.linalg.norm(a) * np.linalg.norm(b) 60 | if(denom==0.0): 61 | return 0.0 62 | else: 63 | cos = num / denom 64 | sim = 0.5 + 0.5 * cos 65 | return sim 66 | 67 | def get_dist_cosine(emb1, emb2, sent_emb_method="elmo",elmo_layers_weight=[0.0,1.0,0.0]): 68 | sum = 0.0 69 | assert emb1.shape == emb2.shape 70 | if(sent_emb_method=="elmo"): 71 | 72 | for i in range(0, 3): 73 | a = emb1[i] 74 | b = emb2[i] 75 | sum += cos_sim(a, b) * elmo_layers_weight[i] 76 | return sum 77 | 78 | elif(sent_emb_method=="elmo_transformer"): 79 | sum = cos_sim_transformer(emb1, emb2) 80 | return sum 81 | 82 | elif(sent_emb_method=="doc2vec"): 83 | sum=cos_sim(emb1,emb2) 84 | return sum 85 | 86 | elif (sent_emb_method == "glove"): 87 | sum = cos_sim(emb1, emb2) 88 | return sum 89 | return sum 90 | 91 | def get_all_dist(candidate_embeddings_list, text_obj, dist_list): 92 | ''' 93 | :param candidate_embeddings_list: 94 | :param text_obj: 95 | :param dist_list: 96 | :return: dist_all 97 | ''' 98 | 99 | dist_all={} 100 | for i, emb in enumerate(candidate_embeddings_list): 101 | phrase = text_obj.keyphrase_candidate[i][0] 102 | phrase = phrase.lower() 103 | phrase = wnl.lemmatize(phrase) 104 | if(phrase in dist_all): 105 | #store the No. and distance 106 | dist_all[phrase].append(dist_list[i]) 107 | else: 108 | dist_all[phrase]=[] 109 | dist_all[phrase].append(dist_list[i]) 110 | return dist_all 111 | 112 | def get_final_dist(dist_all, method="average"): 113 | ''' 114 | :param dist_all: 115 | :param method: "average" 116 | :return: 117 | ''' 118 | 119 | final_dist={} 120 | 121 | if(method=="average"): 122 | 123 | for phrase, dist_list in dist_all.items(): 124 | sum_dist = 0.0 125 | for dist in dist_list: 126 | sum_dist += dist 127 | if (phrase in stop_words): 128 | sum_dist = 0.0 129 | final_dist[phrase] = sum_dist/float(len(dist_list)) 130 | return final_dist 131 | 132 | def softmax(x): 133 | # x = x - np.max(x) 134 | exp_x = np.exp(x) 135 | softmax_x = exp_x / np.sum(exp_x) 136 | return softmax_x 137 | 138 | 139 | def get_position_score(keyphrase_candidate_list, position_bias): 140 | length = len(keyphrase_candidate_list) 141 | position_score ={} 142 | for i,kc in enumerate(keyphrase_candidate_list): 143 | np = kc[0] 144 | p = kc[1][0] 145 | np = np.lower() 146 | np = wnl.lemmatize(np) 147 | if np in position_score: 148 | 149 | position_score[np] += 0.0 150 | else: 151 | position_score[np] = 1/(float(i)+1+position_bias) 152 | score_list=[] 153 | for np,score in position_score.items(): 154 | score_list.append(score) 155 | score_list = softmax(score_list) 156 | 157 | i=0 158 | for np, score in position_score.items(): 159 | position_score[np] = score_list[i] 160 | i+=1 161 | return position_score 162 | 163 | def SIFRank(text, SIF, en_model, method="average", N=15, 164 | sent_emb_method="elmo", elmo_layers_weight=[0.0, 1.0, 0.0], if_DS=True, if_EA=True): 165 | """ 166 | :param text_obj: 167 | :param sent_embeddings: 168 | :param candidate_embeddings_list: 169 | :param sents_weight_list: 170 | :param method: 171 | :param N: the top-N number of keyphrases 172 | :param sent_emb_method: 'elmo', 'glove' 173 | :param elmo_layers_weight: the weights of different layers of ELMo 174 | :param if_DS: if take document segmentation(DS) 175 | :param if_EA: if take embeddings alignment(EA) 176 | :return: 177 | """ 178 | text_obj = input_representation.InputTextObj(en_model, text) 179 | sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA) 180 | dist_list = [] 181 | for i, emb in enumerate(candidate_embeddings_list): 182 | dist = get_dist_cosine(sent_embeddings, emb, sent_emb_method, elmo_layers_weight=elmo_layers_weight) 183 | dist_list.append(dist) 184 | dist_all = get_all_dist(candidate_embeddings_list, text_obj, dist_list) 185 | dist_final = get_final_dist(dist_all, method='average') 186 | dist_sorted = sorted(dist_final.items(), key=lambda x: x[1], reverse=True) 187 | return dist_sorted[0:N] 188 | 189 | def SIFRank_plus(text, SIF, en_model, method="average", N=15, 190 | sent_emb_method="elmo", elmo_layers_weight=[0.0, 1.0, 0.0], if_DS=True, if_EA=True, position_bias = 3.4): 191 | """ 192 | :param text_obj: 193 | :param sent_embeddings: 194 | :param candidate_embeddings_list: 195 | :param sents_weight_list: 196 | :param method: 197 | :param N: the top-N number of keyphrases 198 | :param sent_emb_method: 'elmo', 'glove' 199 | :param elmo_layers_weight: the weights of different layers of ELMo 200 | :return: 201 | """ 202 | text_obj = input_representation.InputTextObj(en_model, text) 203 | sent_embeddings, candidate_embeddings_list = SIF.get_tokenized_sent_embeddings(text_obj,if_DS=if_DS,if_EA=if_EA) 204 | position_score = get_position_score(text_obj.keyphrase_candidate, position_bias) 205 | average_score = sum(position_score.values()) / (float)(len(position_score))#Little change here 206 | dist_list = [] 207 | for i, emb in enumerate(candidate_embeddings_list): 208 | dist = get_dist_cosine(sent_embeddings, emb, sent_emb_method, elmo_layers_weight=elmo_layers_weight) 209 | dist_list.append(dist) 210 | dist_all = get_all_dist(candidate_embeddings_list, text_obj, dist_list) 211 | dist_final = get_final_dist(dist_all, method='average') 212 | for np,dist in dist_final.items(): 213 | if np in position_score: 214 | dist_final[np] = dist*position_score[np]/average_score#Little change here 215 | dist_sorted = sorted(dist_final.items(), key=lambda x: x[1], reverse=True) 216 | return dist_sorted[0:N] 217 | 218 | 219 | -------------------------------------------------------------------------------- /KeyExt/SIFRank/requirements.txt: -------------------------------------------------------------------------------- 1 | nltk==3.4.3 2 | StanfordCoreNLP==3.9.1.1 3 | torch==1.7.1 4 | allennlp==0.8.4 5 | overrides==3.1.0 6 | scikit-learn==0.22.2.post1 -------------------------------------------------------------------------------- /KeyExt/SIFRank/test/test.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge_sy" 4 | # Date: 2020/2/21 5 | 6 | import nltk 7 | from embeddings import sent_emb_sif, word_emb_elmo 8 | from model.method import SIFRank, SIFRank_plus 9 | from stanfordcorenlp import StanfordCoreNLP 10 | import time 11 | 12 | #download from https://allennlp.org/elmo 13 | options_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_options.json" 14 | weight_file = "../auxiliary_data/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5" 15 | 16 | porter = nltk.PorterStemmer() 17 | ELMO = word_emb_elmo.WordEmbeddings(options_file, weight_file, cuda_device=0) 18 | SIF = sent_emb_sif.SentEmbeddings(ELMO, lamda=1.0) 19 | en_model = StanfordCoreNLP(r'E:\Python_Files\stanford-corenlp-full-2018-02-27',quiet=True)#download from https://stanfordnlp.github.io/CoreNLP/ 20 | elmo_layers_weight = [0.0, 1.0, 0.0] 21 | 22 | text = "Discrete output feedback sliding mode control of second order systems - a moving switching line approach The sliding mode control systems (SMCS) for which the switching variable is designed independent of the initial conditions are known to be sensitive to parameter variations and extraneous disturbances during the reaching phase. For second order systems this drawback is eliminated by using the moving switching line technique where the switching line is initially designed to pass the initial conditions and is subsequently moved towards a predetermined switching line. In this paper, we make use of the above idea of moving switching line together with the reaching law approach to design a discrete output feedback sliding mode control. The main contributions of this work are such that we do not require to use system states as it makes use of only the output samples for designing the controller. and by using the moving switching line a low sensitivity system is obtained through shortening the reaching phase. Simulation results show that the fast output sampling feedback guarantees sliding motion similar to that obtained using state feedback" 23 | keyphrases = SIFRank(text, SIF, en_model, N=15,elmo_layers_weight=elmo_layers_weight) 24 | keyphrases_ = SIFRank_plus(text, SIF, en_model, N=15, elmo_layers_weight=elmo_layers_weight) 25 | print(keyphrases) 26 | print(keyphrases_) -------------------------------------------------------------------------------- /KeyExt/SIFRank/util/fileIO.py: -------------------------------------------------------------------------------- 1 | #! /usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # __author__ = "Sponge" 4 | # Date: 2019/6/21 5 | 6 | import string,re,os 7 | 8 | class Result: 9 | 10 | def __init__(self,N=15): 11 | self.database="" 12 | self.predict_keyphrases = [] 13 | self.true_keyphrases = [] 14 | self.file_names = [] 15 | self.lamda=0.0 16 | self.beta=0.0 17 | 18 | def update_result(self, file_name, pre_kp, true_kp): 19 | self.file_names.append(file_name) 20 | self.predict_keyphrases.append(pre_kp) 21 | self.true_keyphrases.append(true_kp) 22 | 23 | def get_parameters(self,database="",lamda=0.6,beta=0.0): 24 | self.database = database 25 | self.lamda = lamda 26 | self.beta = beta 27 | 28 | def write_results(self): 29 | return 0 30 | 31 | def write_string(s, output_path): 32 | with open(output_path, 'w') as output_file: 33 | output_file.write(s) 34 | 35 | 36 | def read_file(input_path): 37 | with open(input_path, 'r', errors='replace_with_space') as input_file: 38 | return input_file.read() 39 | 40 | def clean_text(text="",database="Inspec"): 41 | 42 | #Specially for Duc2001 Database 43 | if(database=="Duc2001" or database=="Semeval2017"): 44 | pattern2 = re.compile(r'[\s,]' + '[\n]{1}') 45 | while (True): 46 | if (pattern2.search(text) is not None): 47 | position = pattern2.search(text) 48 | start = position.start() 49 | end = position.end() 50 | # start = int(position[0]) 51 | text_new = text[:start] + "\n" + text[start + 2:] 52 | text = text_new 53 | else: 54 | break 55 | 56 | pattern2 = re.compile(r'[a-zA-Z0-9,\s]' + '[\n]{1}') 57 | while (True): 58 | if (pattern2.search(text) is not None): 59 | position = pattern2.search(text) 60 | start = position.start() 61 | end = position.end() 62 | # start = int(position[0]) 63 | text_new = text[:start + 1] + " " + text[start + 2:] 64 | text = text_new 65 | else: 66 | break 67 | 68 | pattern3 = re.compile(r'\s{2,}') 69 | while (True): 70 | if (pattern3.search(text) is not None): 71 | position = pattern3.search(text) 72 | start = position.start() 73 | end = position.end() 74 | # start = int(position[0]) 75 | text_new = text[:start + 1] + "" + text[start + 2:] 76 | text = text_new 77 | else: 78 | break 79 | 80 | pattern1 = re.compile(r'[<>[\]{}]') 81 | text = pattern1.sub(' ', text) 82 | text = text.replace("\t", " ") 83 | text = text.replace(' p ','\n') 84 | text = text.replace(' /p \n','\n') 85 | lines = text.splitlines() 86 | # delete blank line 87 | text_new="" 88 | for line in lines: 89 | if(line!='\n'): 90 | text_new+=line+'\n' 91 | 92 | return text_new 93 | 94 | def get_duc2001_data(file_path="../data/DUC2001"): 95 | pattern = re.compile(r'(.*?)', re.S) 96 | data = {} 97 | labels = {} 98 | for dirname, dirnames, filenames in os.walk(file_path): 99 | for fname in filenames: 100 | if (fname == "annotations.txt"): 101 | # left, right = fname.split('.') 102 | infile = os.path.join(dirname, fname) 103 | f = open(infile,'rb') 104 | text = f.read().decode('utf8') 105 | lines = text.splitlines() 106 | for line in lines: 107 | left, right = line.split("@") 108 | d = right.split(";")[:-1] 109 | l = left 110 | labels[l] = d 111 | f.close() 112 | else: 113 | infile = os.path.join(dirname, fname) 114 | f = open(infile,'rb') 115 | text = f.read().decode('utf8') 116 | text = re.findall(pattern, text)[0] 117 | 118 | text = text.lower() 119 | text = clean_text(text,database="Duc2001") 120 | data[fname]=text.strip("\n") 121 | # data[fname] = text 122 | return data,labels 123 | 124 | def get_inspec_data(file_path="../data/Inspec"): 125 | 126 | data={} 127 | labels={} 128 | for dirname, dirnames, filenames in os.walk(file_path): 129 | for fname in filenames: 130 | left, right = fname.split('.') 131 | if (right == "abstr"): 132 | infile = os.path.join(dirname, fname) 133 | f=open(infile) 134 | text=f.read() 135 | text=clean_text(text) 136 | data[left]=text 137 | if (right == "uncontr"): 138 | infile = os.path.join(dirname, fname) 139 | f=open(infile) 140 | text=f.read() 141 | text=text.replace("\n",' ') 142 | text=clean_text(text,database="Inspec") 143 | text=text.lower() 144 | label=text.split("; ") 145 | labels[left]=label 146 | return data,labels 147 | 148 | def get_semeval2017_data(data_path="../data/SemEval2017/docsutf8",labels_path="../data/SemEval2017/keys"): 149 | 150 | data={} 151 | labels={} 152 | for dirname, dirnames, filenames in os.walk(data_path): 153 | for fname in filenames: 154 | left, right = fname.split('.') 155 | infile = os.path.join(dirname, fname) 156 | f = open(infile, 'rb') 157 | text = f.read().decode('utf8') 158 | text = clean_text(text,database="Semeval2017") 159 | data[left] = text.lower() 160 | f.close() 161 | for dirname, dirnames, filenames in os.walk(labels_path): 162 | for fname in filenames: 163 | left, right = fname.split('.') 164 | infile = os.path.join(dirname, fname) 165 | f = open(infile, 'rb') 166 | text = f.read().decode('utf8') 167 | text = text.strip() 168 | ls=text.splitlines() 169 | labels[left] = ls 170 | f.close() 171 | return data,labels 172 | 173 | 174 | # if __name__ == '__main__': 175 | # 176 | # data,labels=get_semeval2017_data() 177 | # print("OK") 178 | 179 | 180 | 181 | -------------------------------------------------------------------------------- /KeyExt/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /KeyExt/config.py: -------------------------------------------------------------------------------- 1 | # Config values. 2 | datasets_path = r'..\datasets' 3 | output_dir = r'..\output' 4 | -------------------------------------------------------------------------------- /KeyExt/experiments.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pandas as pd 3 | import KeyExt.metrics 4 | import KeyExt.utils 5 | 6 | def run_experiments(datasets_dir, output_dir, top_n = 10, partial_match = True): 7 | 8 | # Make a list of all subdirectories. 9 | directories = next(os.walk(datasets_dir))[1][0:] 10 | data = [] 11 | 12 | # Set the metric name and construct the output path for the xlsx. 13 | metric_name = f'pF1@{top_n}' if partial_match else f'F1@{top_n}' 14 | xlsx_path = os.path.join(output_dir, f'{metric_name}.xlsx') 15 | print(f'Calculating the {metric_name} score for all datasets...') 16 | 17 | for i, directory in enumerate(directories): 18 | print(f'Processing {i+1} in {len(directories)} datasets.') 19 | 20 | # Change current working directory to the dataset directory. 21 | dataset_path = os.path.join(datasets_dir, directory) 22 | os.chdir(dataset_path) 23 | 24 | # Find human assigned keyphrase files and paths. 25 | os.chdir(os.path.join(dataset_path, 'keys')) 26 | key_paths = list(map(os.path.abspath, sorted(os.listdir()))) 27 | 28 | # Find all methods (directories of keys) and their generated keyphrase files and paths. 29 | extracted_path = os.path.join(dataset_path, 'extracted') 30 | os.chdir(extracted_path) 31 | methods = sorted(next(os.walk('.'))[1]) 32 | 33 | # Initialize the macro(mean) metric vector. 34 | macro_metric_vec = [0.0] * len(methods) 35 | 36 | # Compare the extracted keys of each method with the human assigned keys. 37 | for j, method in enumerate(methods): 38 | 39 | print(f' * Evaluating {method} for {len(key_paths)} documents.') 40 | 41 | # Find all extracted keys of the method. 42 | os.chdir(os.path.join(extracted_path, method)) 43 | method_paths = list(map(os.path.abspath, sorted(os.listdir()))) 44 | 45 | for key_path, method_path in zip(key_paths, method_paths): 46 | with open(method_path, 'r', encoding = 'utf-8-sig', errors = 'ignore') as method_keys, \ 47 | open(key_path, 'r', encoding = 'utf-8-sig', errors = 'ignore') as human_keys: 48 | 49 | # Read the tags from file and then preprocess them, 50 | # as to be lowercased, with no punctuation and stemmed. 51 | extracted = KeyExt.utils.preprocess(method_keys.read().split('\n')) 52 | assigned = KeyExt.utils.preprocess(human_keys.read().split('\n')) 53 | macro_metric_vec[j] += KeyExt.metrics.f1_metric_k ( 54 | assigned, extracted, k = top_n, partial_match = partial_match 55 | ) 56 | 57 | # The macro (mean) metric score us calculated from each method. 58 | macro_metric_vec = [ 59 | round(metric_sum / len(key_paths), 3) 60 | for metric_sum in macro_metric_vec 61 | ] 62 | 63 | # Append the macro metric score for each directory to the data list of lists, 64 | # each list has the dataset name prepended at the start of the row. 65 | data.append([directory] + macro_metric_vec) 66 | os.system('clear') 67 | 68 | 69 | # Construct the dataframe and then transpose it. 70 | df = pd.DataFrame(data, columns = [f'{metric_name}', *methods]).set_index(f'{metric_name}') 71 | df = df.transpose() 72 | 73 | # Save the dataframe to excel. 74 | df.to_excel(xlsx_path, engine = 'openpyxl') 75 | return 76 | -------------------------------------------------------------------------------- /KeyExt/metrics.py: -------------------------------------------------------------------------------- 1 | def exact_f1_k(assigned, extracted, k): 2 | """ 3 | Computes the exatch match f1 measure at k. 4 | Arguments 5 | --------- 6 | assigned : A list of human assigned keyphrases. 7 | extracted : A list of extracted keyphrases. 8 | k : int 9 | The maximum number of extracted keyphrases. 10 | Returned value 11 | -------------- 12 | : double 13 | """ 14 | # Exit early, if one of the lists or both are empty. 15 | if not assigned or not extracted: 16 | return 0.0 17 | 18 | precision_k = len(set(assigned) & set(extracted)) / k 19 | recall_k = len(set(assigned) & set(extracted)) / len(assigned) 20 | return ( 21 | 2 * precision_k * recall_k / (precision_k + recall_k) 22 | if precision_k and recall_k else 0.0 23 | ) 24 | 25 | 26 | def partial_f1_k(assigned, extracted, k): 27 | """ 28 | Computes the exatch match f1 measure at k. 29 | Arguments 30 | --------- 31 | assigned : A list of human assigned keyphrases. 32 | extracted : A list of extracted keyphrases. 33 | k : int 34 | The maximum number of extracted keyphrases. 35 | Returned value 36 | -------------- 37 | : double 38 | """ 39 | # Exit early, if one of the lists or both are empty. 40 | if not assigned or not extracted: 41 | return 0.0 42 | 43 | # Store the longest keyphrases first. 44 | assigned_sets = sorted([set(keyword.split()) for keyword in assigned], key = len, reverse = True) 45 | extracted_sets = sorted([set(keyword.split()) for keyword in extracted], key = len, reverse = True) 46 | 47 | # This list stores True, if the assigned keyphrase has been matched earlier. 48 | # To avoid counting duplicate matches. 49 | assigned_matches = [False for assigned_set in assigned_sets] 50 | 51 | # For each extracted keyphrase, find the closest match, 52 | # which is the assigned keyphrase it has the most words in common. 53 | for extracted_set in extracted_sets: 54 | all_matches = [(i, len(assigned_set & extracted_set)) for i, assigned_set in enumerate(assigned_sets)] 55 | closest_match = sorted(all_matches, key = lambda x: x[1], reverse = True)[0] 56 | assigned_matches[closest_match[0]] = True 57 | 58 | # Calculate the precision and recall metrics based on the partial matches. 59 | partial_matches = assigned_matches.count(True) 60 | precision_k = partial_matches / k 61 | recall_k = partial_matches / len(assigned) 62 | 63 | return ( 64 | 2 * precision_k * recall_k / (precision_k + recall_k) 65 | if precision_k and recall_k else 0.0 66 | ) 67 | 68 | 69 | def f1_metric_k(assigned, extracted, k, partial_match = True): 70 | """ 71 | Wrapper function that calculates either the exact 72 | or the partial match f1 metric. 73 | """ 74 | return ( 75 | partial_f1_k(assigned, extracted, k) 76 | if partial_match else exact_f1_k(assigned, extracted, k) 77 | ) 78 | -------------------------------------------------------------------------------- /KeyExt/utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import platform 4 | import functools 5 | import KeyExt.config 6 | from string import punctuation 7 | from nltk.stem import SnowballStemmer 8 | 9 | 10 | # Initialize the English stemmer once. 11 | stemmer = SnowballStemmer('english') 12 | 13 | 14 | def preprocess(lis): 15 | """ 16 | Function which applies stemming to a 17 | lowercase version of each string of the list, 18 | which has all punctuation removed. 19 | """ 20 | return list(map(stemmer.stem, 21 | map(lambda s: s.translate(str.maketrans('', '', punctuation)), 22 | map(str.lower, lis)))) 23 | 24 | 25 | def rreplace(s, old, new, occurrence): 26 | """ 27 | Function which replaces a string occurence 28 | in a string from the end of the string. 29 | """ 30 | return new.join(s.rsplit(old, occurrence)) 31 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Keyword & Keyphrase Extraction Review 2 | 3 | This repository hosts code for the papers: 4 | * [A literature review of keyword and keyphrase extraction -]() - [Download]() 5 | * [A comparative assessment of state-of-the-art methods for multilingual unsupervised keyphrase extraction](https://link.springer.com/chapter/10.1007/978-3-030-79150-6_50) - [Download](https://github.com/NC0DER/KeyphraseExtraction/releases/tag/KeyphraseExtractionv1.0) 6 | 7 | ## Datasets 8 | Available in [this link]() 9 | 10 | ## Disclaimer 11 | This repository contains code for the evaluated approaches. 12 | The code for these approaches belongs to their respective authors. 13 | Some code files were modified to enable the evaluation. 14 | These modifications include: 15 | * Removing hardcoded paths. 16 | * Setting `cpu-only` mode for approaches that require a lot of `GPU VRAM`. 17 | * Updating code to run from `Python 2` to `Python 3`. 18 | * Amend errors related to old packages or functions with wrong parameters. 19 | * Disabling stemming performed early by certain approaches in their keyphrase extraction step, 20 | as to use a common stemmer later in the evaluation process. 21 | 22 | ## Test Results 23 | Configure `KeyExt\config.py` and run `KeyExt.py`. 24 | 25 | ## Installation 26 | * `Python 3` (min. version 3.7), `pip3` (& `py` launcher Windows-only). 27 | * Follow the install instructions in each subdirectory. 28 | 29 | ## Contributors 30 | * Nikolaos Giarelis (giarelis@ceid.upatras.gr) 31 | * Nikos Karacapilidis (karacap@upatras.gr) 32 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | click==8.1.3 2 | colorama==0.4.5 3 | et-xmlfile==1.1.0 4 | importlib-metadata==4.11.4 5 | joblib==1.1.0 6 | nltk==3.7 7 | numpy==1.21.6 8 | openpyxl==3.0.10 9 | pandas==1.3.5 10 | pip==22.1.2 11 | python-dateutil==2.8.2 12 | pytz==2022.1 13 | regex==2022.6.2 14 | setuptools==62.4.0 15 | six==1.16.0 16 | tqdm==4.64.0 17 | typing_extensions==4.2.0 18 | zipp==3.8.0 19 | --------------------------------------------------------------------------------