├── README.md ├── _config.yml ├── bilm-tf ├── LICENSE ├── README.md ├── bilm │ ├── __init__.py │ ├── data.py │ ├── elmo.py │ ├── model.py │ └── training.py └── setup.py ├── cache.py ├── frontend ├── constants │ └── constants.js ├── index.html ├── index_content.html ├── info.js ├── js │ ├── analytics.js │ ├── bootstrap.css │ ├── bootstrap.min.js │ ├── global.css │ ├── jquery.js │ └── logo.png └── loading_icon.gif ├── install.sh ├── main.py ├── mapping ├── README.md ├── bbn.logic.mapping ├── bbn.mapping ├── figer.logic.mapping ├── figer.mapping ├── ontonotes.logic.mapping └── ontonotes.mapping ├── requirements.txt ├── scripts.py ├── server.py └── zoe_utils.py /README.md: -------------------------------------------------------------------------------- 1 | # ZOE (Zero-shot Open Entity Typing) 2 | A state of the art system for zero-shot entity fine typing with minimum supervision 3 | 4 | ## Introduction 5 | 6 | This is a demo system for our paper "Zero-Shot Open Entity Typing as Type-Compatible Grounding", 7 | which at the time of publication represents the state-of-the-art of zero-shot entity typing. 8 | 9 | The original experiments that produced all the results in the paper 10 | are done with a package written in Java. This is a re-written package solely for 11 | the purpose of demoing the algorithm and validating key results. 12 | 13 | The results may be slightly different with published numbers, due to the randomness in Java's 14 | HashSet and Python set's iteration order. The difference should be negligible. 15 | 16 | This system may take a long time if ran on a large number of new sentences, due to ELMo processing. 17 | We have cached ELMo results for the provided experiments. 18 | 19 | The package also contains an online demo, please refer to [Publication Page](http://cogcomp.org/page/publication_view/845) 20 | for more details. 21 | 22 | ## Usage 23 | 24 | ### Install the system 25 | 26 | #### Prerequisites 27 | 28 | * Minimum 20G available disk space and 16G memory. (strict requirement) 29 | * Python 3.X (Mostly tested on 3.5) 30 | * A POSIX OS (Windows not supported) 31 | * Java JDK and Maven 32 | * `virtualenv` if you are installing with script 33 | * `wget` if you are installing with script (Use brew to install it on OSX) 34 | * `unzip` if you are installing with script 35 | 36 | #### Install using a one-line command 37 | 38 | To make life easier, we provide a simple way to install with `sh install.sh`. 39 | 40 | This script does everything mentioned in the next section, plus creating a virtualenv. Use `source venv/bin/activate` to activate. 41 | 42 | #### Install manually 43 | 44 | See wiki [manual-installation](https://github.com/CogComp/zoe/wiki/Manual-Installation) 45 | 46 | ### Run the system 47 | 48 | Currently you can do the following without changes to the code: 49 | * Run experiment on FIGER test set (randomly sampled as the paper): `python3 main.py figer` 50 | * Run experiment on BBN test set: `python3 main.py bbn` 51 | * Run experiment on the first 1000 Ontonotes_fine test set instances (due to size issue): `python3 main.py ontonotes` 52 | 53 | Additionally, you can run server mode that initializes the online demo with `python3 server.py` 54 | However, this requires some additional files that's not provided for download yet. 55 | Please directly contact the authors. 56 | 57 | It's generally an expensive operation to run on large numerb of new sentences, but you are welcome to do it. 58 | Please refer to `main.py` and [Engineering Details](https://github.com/CogComp/zoe/wiki/Engineering-Details) 59 | to see how you can test on your own data. 60 | 61 | 62 | ## Citation 63 | See the following paper: 64 | ``` 65 | @inproceedings{ZKTR18, 66 | author = {Ben Zhou, Daniel Khashabi, Chen-Tse Tsai and Dan Roth }, 67 | title = {Zero-Shot Open Entity Typing as Type-Compatible Grounding}, 68 | booktitle = {EMNLP}, 69 | year = {2018}, 70 | } 71 | ``` 72 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-minimal -------------------------------------------------------------------------------- /bilm-tf/LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /bilm-tf/README.md: -------------------------------------------------------------------------------- 1 | # bilm-tf 2 | Tensorflow implementation of the pretrained biLM used to compute ELMo 3 | representations from ["Deep contextualized word representations"](http://arxiv.org/abs/1802.05365). 4 | 5 | This repository supports both training biLMs and using pre-trained models for prediction. 6 | 7 | We also have a pytorch implementation available in [AllenNLP](http://allennlp.org/). 8 | 9 | You may also find it easier to use the version provided in [Tensorflow Hub](https://www.tensorflow.org/hub/modules/google/elmo/2) if you just like to make predictions. 10 | 11 | Citation: 12 | 13 | ``` 14 | @inproceedings{Peters:2018, 15 | author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke}, 16 | title={Deep contextualized word representations}, 17 | booktitle={Proc. of NAACL}, 18 | year={2018} 19 | } 20 | ``` 21 | 22 | 23 | ## Installing 24 | Install python version 3.5 or later, tensorflow version 1.2 and h5py: 25 | 26 | ``` 27 | pip install tensorflow-gpu==1.2 h5py 28 | python setup.py install 29 | ``` 30 | 31 | Ensure the tests pass in your environment by running: 32 | ``` 33 | python -m unittest discover tests/ 34 | ``` 35 | 36 | ## Installing with Docker 37 | 38 | To run the image, you must use nvidia-docker, because this repository 39 | requires GPUs. 40 | ``` 41 | sudo nvidia-docker run -t allennlp/bilm-tf:training-gpu 42 | ``` 43 | 44 | ## Using pre-trained models 45 | 46 | We have several different English language pre-trained biLMs available for use. 47 | Each model is specified with two separate files, a JSON formatted "options" 48 | file with hyperparameters and a hdf5 formatted file with the model 49 | weights. Links to the pre-trained models are available [here](https://allennlp.org/elmo). 50 | 51 | 52 | There are three ways to integrate ELMo representations into a downstream task, depending on your use case. 53 | 54 | 1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive. 55 | 2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary. 56 | 3. Precompute the representations for your entire dataset and save to a file. 57 | 58 | We have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in #3 is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks. 59 | 60 | In all cases, the process roughly follows the same steps. 61 | First, create a `Batcher` (or `TokenBatcher` for #2) to translate tokenized strings to numpy arrays of character (or token) ids. 62 | Then, load the pretrained ELMo model (class `BidirectionalLanguageModel`). 63 | Finally, for steps #1 and #2 use `weight_layers` to compute the final ELMo representations. 64 | For #3, use `BidirectionalLanguageModel` to write all the intermediate layers to a file. 65 | 66 | #### Shape conventions 67 | Each tokenized sentence is a list of `str`, with a batch of sentences 68 | a list of tokenized sentences (`List[List[str]]`). 69 | 70 | The `Batcher` packs these into a shape 71 | `(n_sentences, max_sentence_length + 2, 50)` numpy array of character 72 | ids, padding on the right with 0 ids for sentences less then the maximum 73 | length. The first and last tokens for each sentence are special 74 | begin and end of sentence ids added by the `Batcher`. 75 | 76 | The input character id placeholder can be dimensioned `(None, None, 50)`, 77 | with both the batch dimension (axis=0) and time dimension (axis=1) determined 78 | for each batch, up the the maximum batch size specified in the 79 | `BidirectionalLanguageModel` constructor. 80 | 81 | After running inference with the batch, the return biLM embeddings are 82 | a numpy array with shape `(n_sentences, 3, max_sentence_length, 1024)`, 83 | after removing the special begin/end tokens. 84 | 85 | #### Vocabulary file 86 | The `Batcher` takes a vocabulary file as input for efficency. This is a 87 | text file, with one token per line, separated by newlines (`\n`). 88 | Each token in the vocabulary is cached as the appropriate 50 character id 89 | sequence once. Since the model is completely character based, tokens not in 90 | the vocabulary file are handled appropriately at run time, with a slight 91 | decrease in run time. It is recommended to always include the special 92 | `` and `` tokens (case sensitive) in the vocabulary file. 93 | 94 | ### ELMo with character input 95 | 96 | See `usage_character.py` for a detailed usage example. 97 | 98 | ### ELMo with pre-computed and cached context independent token representations 99 | To speed up model inference with a fixed, specified vocabulary, it is 100 | possible to pre-compute the context independent token representations, 101 | write them to a file, and re-use them for inference. Note that we don't 102 | support falling back to character inputs for out-of-vocabulary words, 103 | so this should only be used when the biLM is used to compute embeddings 104 | for input with a fixed, defined vocabulary. 105 | 106 | To use this option: 107 | 108 | 1. First create a vocabulary file with all of the unique tokens in your 109 | dataset and add the special `` and `` tokens. 110 | 2. Run `dump_token_embeddings` with the full model to write the token 111 | embeddings to a hdf5 file. 112 | 3. Use `TokenBatcher` (instead of `Batcher`) with your vocabulary file, 113 | and pass `use_token_inputs=False` and the name of the output file from step 114 | 2 to the `BidirectonalLanguageModel` constructor. 115 | 116 | See `usage_token.py` for a detailed usage example. 117 | 118 | ### Dumping biLM embeddings for an entire dataset to a single file. 119 | 120 | To take this option, create a text file with your tokenized dataset. Each line is one tokenized sentence (whitespace separated). Then use `dump_bilm_embeddings`. 121 | 122 | The output file is `hdf5` format. Each sentence in the input data is stored as a dataset with key `str(sentence_id)` where `sentence_id` is the line number in the dataset file (indexed from 0). 123 | The embeddings for each sentence are a shape (3, n_tokens, 1024) array. 124 | 125 | See `usage_cached.py` for a detailed example. 126 | 127 | ## Training a biLM on a new corpus 128 | 129 | Broadly speaking, the process to train and use a new biLM is: 130 | 131 | 1. Prepare input data and a vocabulary file. 132 | 2. Train the biLM. 133 | 3. Test (compute the perplexity of) the biLM on heldout data. 134 | 4. Write out the weights from the trained biLM to a hdf5 file. 135 | 5. See the instructions above for using the output from Step #4 in downstream models. 136 | 137 | 138 | #### 1. Prepare input data and a vocabulary file. 139 | To train and evaluate a biLM, you need to provide: 140 | 141 | * a vocabulary file 142 | * a set of training files 143 | * a set of heldout files 144 | 145 | The vocabulary file is a a text file with one token per line. It must also include the special tokens ``, `` and `` (case sensitive) in the file. 146 | 147 | IMPORTANT: the vocabulary file should be sorted in descending order by token count in your training data. The first three lines should be the special tokens (``, `` and ``), then the most common token in the training data, ending with the least common token. 148 | 149 | NOTE: the vocabulary file used in training may differ from the one use for prediction. 150 | 151 | The training data should be randomly split into many training files, 152 | each containing one slice of the data. Each file contains pre-tokenized and 153 | white space separated text, one sentence per line. 154 | Don't include the `` or `` tokens in your training data. 155 | 156 | All tokenization/normalization is done before training a model, so both 157 | the vocabulary file and training files should include normalized tokens. 158 | As the default settings use a fully character based token representation, in general we do not recommend any normalization other then tokenization. 159 | 160 | Finally, reserve a small amount of the training data as heldout data for evaluating the trained biLM. 161 | 162 | #### 2. Train the biLM. 163 | The hyperparameters used to train the ELMo model can be found in `bin/train_elmo.py`. 164 | 165 | The ELMo model was trained on 3 GPUs. 166 | To train a new model with the same hyperparameters, first download the training data from the [1 Billion Word Benchmark](http://www.statmt.org/lm-benchmark/). 167 | Then download the [vocabulary file](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt). 168 | Finally, run: 169 | 170 | ``` 171 | export CUDA_VISIBLE_DEVICES=0,1,2 172 | python bin/train_elmo.py \ 173 | --train_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled/*' \ 174 | --vocab_file /path/to/vocab-2016-09-10.txt \ 175 | --save_dir /output_path/to/checkpoint 176 | ``` 177 | 178 | #### 3. Evaluate the trained model. 179 | 180 | Use `bin/run_test.py` to evaluate a trained model, e.g. 181 | 182 | ``` 183 | export CUDA_VISIBLE_DEVICES=0 184 | python bin/run_test.py \ 185 | --test_prefix='/path/to/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-000*' \ 186 | --vocab_file /path/to/vocab-2016-09-10.txt \ 187 | --save_dir /output_path/to/checkpoint 188 | ``` 189 | 190 | #### 4. Convert the tensorflow checkpoint to hdf5 for prediction with `bilm` or `allennlp`. 191 | 192 | First, create an `options.json` file for the newly trained model. To do so, 193 | follow the template in an existing file (e.g. the [original `options.json`](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json) and modify for your hyperpararameters. 194 | 195 | **Important**: always set `n_characters` to 262 after training (see below). 196 | 197 | Then Run: 198 | 199 | ``` 200 | python bin/dump_weights.py \ 201 | --save_dir /output_path/to/checkpoint 202 | --outfile /output_path/to/weights.hdf5 203 | ``` 204 | 205 | ## Frequently asked questions and other warnings 206 | 207 | #### Can you provide the tensorflow checkpoint from training? 208 | The tensorflow checkpoint is available by downloading these files: 209 | 210 | * [vocabulary](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt) 211 | * [checkpoint](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/checkpoint) 212 | * [options](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/options.json) 213 | * [1](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.data-00000-of-00001) 214 | * [2](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.index) 215 | * [3](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway_tf_checkpoint/model.ckpt-935588.meta) 216 | 217 | 218 | #### How to do fine tune a model on additional unlabeled data? 219 | 220 | First download the checkpoint files above. 221 | Then prepare the dataset as described in the section "Training a biLM on a new corpus", with the exception that we will use the existing vocabulary file instead of creating a new one. Finally, use the script `bin/restart.py` to restart training with the existing checkpoint on the new dataset. 222 | For small datasets (e.g. < 10 million tokens) we only recommend tuning for a small number of epochs and monitoring the perplexity on a heldout set, otherwise the model will overfit the small dataset. 223 | 224 | #### Are the softmax weights available? 225 | 226 | They are available in the training checkpoint above. 227 | 228 | #### Can you provide some more details about how the model was trained? 229 | The script `bin/train_elmo.py` has hyperparameters for training the model. 230 | The original model was trained on 3 GTX 1080 for 10 epochs, taking about 231 | two weeks. 232 | 233 | For input processing, we used the raw 1 Billion Word Benchmark dataset 234 | [here]( 235 | http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz), and the existing vocabulary of 793471 tokens, including ``, `` and ``. 236 | You can find our vocabulary file [here](https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/vocab-2016-09-10.txt). 237 | At the model input, all text used the full character based representation, 238 | including tokens outside the vocab. 239 | For the softmax output we replaced OOV tokens with ``. 240 | 241 | The model was trained with a fixed size window of 20 tokens. 242 | The batches were constructed by padding sentences with `` and ``, then packing tokens from one or more sentences into each row to fill completely fill each batch. 243 | Partial sentences and the LSTM states were carried over from batch to batch so that the language model could use information across batches for context, but backpropogation was broken at each batch boundary. 244 | 245 | #### Why do I get slightly different embeddings if I run the same text through the pre-trained model twice? 246 | As a result of the training method (see above), the LSTMs are stateful, and carry their state forward from batch to batch. 247 | Consequently, this introduces a small amount of non-determinism, expecially 248 | for the first two batches. 249 | 250 | #### Why does training seem to take forever even with my small dataset? 251 | The number of gradient updates during training is determined by: 252 | 253 | * the number of tokens in the training data (`n_train_tokens`) 254 | * the batch size (`batch_size`) 255 | * the number of epochs (`n_epochs`) 256 | 257 | Be sure to set these values for your particular dataset in `bin/train_elmo.py`. 258 | 259 | 260 | #### What's the deal with `n_characters` and padding? 261 | During training, we fill each batch to exactly 20 tokens by adding `` and `` to each sentence, then packing tokens from one or more sentences into each row to fill completely fill each batch. 262 | As a result, we do not allocate space for a special padding token. 263 | The `UnicodeCharsVocabulary` that converts token strings to lists of character 264 | ids always uses a fixed number of character embeddings of `n_characters=261`, so always 265 | set `n_characters=261` during training. 266 | 267 | However, for prediction, we ensure each sentence is fully contained in a single batch, 268 | and as a result pad sentences of different lengths with a special padding id. 269 | This occurs in the `Batcher` [see here](https://github.com/allenai/bilm-tf/blob/master/bilm/data.py#L220). 270 | As a result, set `n_characters=262` during prediction in the `options.json`. 271 | 272 | #### How can I use ELMo to compute sentence representations? 273 | Simple methods like average and max pooling of the word level ELMo representations across sentences works well, often outperforming supervised methods on benchmark datasets. 274 | See "Evaluation of sentence embeddings in downstream and linguistic probing tasks", Perone et al, 2018 [arxiv link](https://arxiv.org/abs/1806.06259). 275 | 276 | 277 | -------------------------------------------------------------------------------- /bilm-tf/bilm/__init__.py: -------------------------------------------------------------------------------- 1 | 2 | from .data import Batcher, TokenBatcher 3 | from .elmo import weight_layers 4 | from .model import BidirectionalLanguageModel, dump_token_embeddings, \ 5 | dump_bilm_embeddings, dump_bilm_embeddings_inner, initialize_sess 6 | 7 | -------------------------------------------------------------------------------- /bilm-tf/bilm/data.py: -------------------------------------------------------------------------------- 1 | # originally based on https://github.com/tensorflow/models/tree/master/lm_1b 2 | import glob 3 | import random 4 | from typing import List 5 | 6 | import numpy as np 7 | 8 | 9 | class Vocabulary(object): 10 | ''' 11 | A token vocabulary. Holds a map from token to ids and provides 12 | a method for encoding text to a sequence of ids. 13 | ''' 14 | def __init__(self, filename, validate_file=False): 15 | ''' 16 | filename = the vocabulary file. It is a flat text file with one 17 | (normalized) token per line. In addition, the file should also 18 | contain the special tokens , , (case sensitive). 19 | ''' 20 | self._id_to_word = [] 21 | self._word_to_id = {} 22 | self._unk = -1 23 | self._bos = -1 24 | self._eos = -1 25 | 26 | with open(filename) as f: 27 | idx = 0 28 | for line in f: 29 | word_name = line.strip() 30 | if word_name == '': 31 | self._bos = idx 32 | elif word_name == '': 33 | self._eos = idx 34 | elif word_name == '': 35 | self._unk = idx 36 | if word_name == '!!!MAXTERMID': 37 | continue 38 | 39 | self._id_to_word.append(word_name) 40 | self._word_to_id[word_name] = idx 41 | idx += 1 42 | 43 | # check to ensure file has special tokens 44 | if validate_file: 45 | if self._bos == -1 or self._eos == -1 or self._unk == -1: 46 | raise ValueError("Ensure the vocabulary file has " 47 | ", , tokens") 48 | 49 | @property 50 | def bos(self): 51 | return self._bos 52 | 53 | @property 54 | def eos(self): 55 | return self._eos 56 | 57 | @property 58 | def unk(self): 59 | return self._unk 60 | 61 | @property 62 | def size(self): 63 | return len(self._id_to_word) 64 | 65 | def word_to_id(self, word): 66 | if word in self._word_to_id: 67 | return self._word_to_id[word] 68 | return self.unk 69 | 70 | def id_to_word(self, cur_id): 71 | return self._id_to_word[cur_id] 72 | 73 | def decode(self, cur_ids): 74 | """Convert a list of ids to a sentence, with space inserted.""" 75 | return ' '.join([self.id_to_word(cur_id) for cur_id in cur_ids]) 76 | 77 | def encode(self, sentence, reverse=False, split=True): 78 | """Convert a sentence to a list of ids, with special tokens added. 79 | Sentence is a single string with tokens separated by whitespace. 80 | 81 | If reverse, then the sentence is assumed to be reversed, and 82 | this method will swap the BOS/EOS tokens appropriately.""" 83 | 84 | if split: 85 | word_ids = [ 86 | self.word_to_id(cur_word) for cur_word in sentence.split() 87 | ] 88 | else: 89 | word_ids = [self.word_to_id(cur_word) for cur_word in sentence] 90 | 91 | if reverse: 92 | return np.array([self.eos] + word_ids + [self.bos], dtype=np.int32) 93 | else: 94 | return np.array([self.bos] + word_ids + [self.eos], dtype=np.int32) 95 | 96 | 97 | class UnicodeCharsVocabulary(Vocabulary): 98 | """Vocabulary containing character-level and word level information. 99 | 100 | Has a word vocabulary that is used to lookup word ids and 101 | a character id that is used to map words to arrays of character ids. 102 | 103 | The character ids are defined by ord(c) for c in word.encode('utf-8') 104 | This limits the total number of possible char ids to 256. 105 | To this we add 5 additional special ids: begin sentence, end sentence, 106 | begin word, end word and padding. 107 | 108 | WARNING: for prediction, we add +1 to the output ids from this 109 | class to create a special padding id (=0). As a result, we suggest 110 | you use the `Batcher`, `TokenBatcher`, and `LMDataset` classes instead 111 | of this lower level class. If you are using this lower level class, 112 | then be sure to add the +1 appropriately, otherwise embeddings computed 113 | from the pre-trained model will be useless. 114 | """ 115 | def __init__(self, filename, max_word_length, **kwargs): 116 | super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs) 117 | self._max_word_length = max_word_length 118 | 119 | # char ids 0-255 come from utf-8 encoding bytes 120 | # assign 256-300 to special chars 121 | self.bos_char = 256 # 122 | self.eos_char = 257 # 123 | self.bow_char = 258 # 124 | self.eow_char = 259 # 125 | self.pad_char = 260 # 126 | 127 | num_words = len(self._id_to_word) 128 | 129 | self._word_char_ids = np.zeros([num_words, max_word_length], 130 | dtype=np.int32) 131 | 132 | # the charcter representation of the begin/end of sentence characters 133 | def _make_bos_eos(c): 134 | r = np.zeros([self.max_word_length], dtype=np.int32) 135 | r[:] = self.pad_char 136 | r[0] = self.bow_char 137 | r[1] = c 138 | r[2] = self.eow_char 139 | return r 140 | self.bos_chars = _make_bos_eos(self.bos_char) 141 | self.eos_chars = _make_bos_eos(self.eos_char) 142 | 143 | for i, word in enumerate(self._id_to_word): 144 | self._word_char_ids[i] = self._convert_word_to_char_ids(word) 145 | 146 | self._word_char_ids[self.bos] = self.bos_chars 147 | self._word_char_ids[self.eos] = self.eos_chars 148 | # TODO: properly handle 149 | 150 | @property 151 | def word_char_ids(self): 152 | return self._word_char_ids 153 | 154 | @property 155 | def max_word_length(self): 156 | return self._max_word_length 157 | 158 | def _convert_word_to_char_ids(self, word): 159 | code = np.zeros([self.max_word_length], dtype=np.int32) 160 | code[:] = self.pad_char 161 | 162 | word_encoded = word.encode('utf-8', 'ignore')[:(self.max_word_length-2)] 163 | code[0] = self.bow_char 164 | for k, chr_id in enumerate(word_encoded, start=1): 165 | code[k] = chr_id 166 | code[k + 1] = self.eow_char 167 | 168 | return code 169 | 170 | def word_to_char_ids(self, word): 171 | if word in self._word_to_id: 172 | return self._word_char_ids[self._word_to_id[word]] 173 | else: 174 | return self._convert_word_to_char_ids(word) 175 | 176 | def encode_chars(self, sentence, reverse=False, split=True): 177 | ''' 178 | Encode the sentence as a white space delimited string of tokens. 179 | ''' 180 | if split: 181 | chars_ids = [self.word_to_char_ids(cur_word) 182 | for cur_word in sentence.split()] 183 | else: 184 | chars_ids = [self.word_to_char_ids(cur_word) 185 | for cur_word in sentence] 186 | if reverse: 187 | return np.vstack([self.eos_chars] + chars_ids + [self.bos_chars]) 188 | else: 189 | return np.vstack([self.bos_chars] + chars_ids + [self.eos_chars]) 190 | 191 | 192 | class Batcher(object): 193 | ''' 194 | Batch sentences of tokenized text into character id matrices. 195 | ''' 196 | def __init__(self, lm_vocab_file: str, max_token_length: int): 197 | ''' 198 | lm_vocab_file = the language model vocabulary file (one line per 199 | token) 200 | max_token_length = the maximum number of characters in each token 201 | ''' 202 | self._lm_vocab = UnicodeCharsVocabulary( 203 | lm_vocab_file, max_token_length 204 | ) 205 | self._max_token_length = max_token_length 206 | 207 | def batch_sentences(self, sentences: List[List[str]]): 208 | ''' 209 | Batch the sentences as character ids 210 | Each sentence is a list of tokens without or , e.g. 211 | [['The', 'first', 'sentence', '.'], ['Second', '.']] 212 | ''' 213 | n_sentences = len(sentences) 214 | max_length = max(len(sentence) for sentence in sentences) + 2 215 | 216 | X_char_ids = np.zeros( 217 | (n_sentences, max_length, self._max_token_length), 218 | dtype=np.int64 219 | ) 220 | 221 | for k, sent in enumerate(sentences): 222 | length = len(sent) + 2 223 | char_ids_without_mask = self._lm_vocab.encode_chars( 224 | sent, split=False) 225 | # add one so that 0 is the mask value 226 | X_char_ids[k, :length, :] = char_ids_without_mask + 1 227 | 228 | return X_char_ids 229 | 230 | 231 | class TokenBatcher(object): 232 | ''' 233 | Batch sentences of tokenized text into token id matrices. 234 | ''' 235 | def __init__(self, lm_vocab_file: str): 236 | ''' 237 | lm_vocab_file = the language model vocabulary file (one line per 238 | token) 239 | ''' 240 | self._lm_vocab = Vocabulary(lm_vocab_file) 241 | 242 | def batch_sentences(self, sentences: List[List[str]]): 243 | ''' 244 | Batch the sentences as character ids 245 | Each sentence is a list of tokens without or , e.g. 246 | [['The', 'first', 'sentence', '.'], ['Second', '.']] 247 | ''' 248 | n_sentences = len(sentences) 249 | max_length = max(len(sentence) for sentence in sentences) + 2 250 | 251 | X_ids = np.zeros((n_sentences, max_length), dtype=np.int64) 252 | 253 | for k, sent in enumerate(sentences): 254 | length = len(sent) + 2 255 | ids_without_mask = self._lm_vocab.encode(sent, split=False) 256 | # add one so that 0 is the mask value 257 | X_ids[k, :length] = ids_without_mask + 1 258 | 259 | return X_ids 260 | 261 | 262 | ##### for training 263 | def _get_batch(generator, batch_size, num_steps, max_word_length): 264 | """Read batches of input.""" 265 | cur_stream = [None] * batch_size 266 | 267 | no_more_data = False 268 | while True: 269 | inputs = np.zeros([batch_size, num_steps], np.int32) 270 | if max_word_length is not None: 271 | char_inputs = np.zeros([batch_size, num_steps, max_word_length], 272 | np.int32) 273 | else: 274 | char_inputs = None 275 | targets = np.zeros([batch_size, num_steps], np.int32) 276 | 277 | for i in range(batch_size): 278 | cur_pos = 0 279 | 280 | while cur_pos < num_steps: 281 | if cur_stream[i] is None or len(cur_stream[i][0]) <= 1: 282 | try: 283 | cur_stream[i] = list(next(generator)) 284 | except StopIteration: 285 | # No more data, exhaust current streams and quit 286 | no_more_data = True 287 | break 288 | 289 | how_many = min(len(cur_stream[i][0]) - 1, num_steps - cur_pos) 290 | next_pos = cur_pos + how_many 291 | 292 | inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many] 293 | if max_word_length is not None: 294 | char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][ 295 | :how_many] 296 | targets[i, cur_pos:next_pos] = cur_stream[i][0][1:how_many+1] 297 | 298 | cur_pos = next_pos 299 | 300 | cur_stream[i][0] = cur_stream[i][0][how_many:] 301 | if max_word_length is not None: 302 | cur_stream[i][1] = cur_stream[i][1][how_many:] 303 | 304 | if no_more_data: 305 | # There is no more data. Note: this will not return data 306 | # for the incomplete batch 307 | break 308 | 309 | X = {'token_ids': inputs, 'tokens_characters': char_inputs, 310 | 'next_token_id': targets} 311 | 312 | yield X 313 | 314 | class LMDataset(object): 315 | """ 316 | Hold a language model dataset. 317 | 318 | A dataset is a list of tokenized files. Each file contains one sentence 319 | per line. Each sentence is pre-tokenized and white space joined. 320 | """ 321 | def __init__(self, filepattern, vocab, reverse=False, test=False, 322 | shuffle_on_load=False): 323 | ''' 324 | filepattern = a glob string that specifies the list of files. 325 | vocab = an instance of Vocabulary or UnicodeCharsVocabulary 326 | reverse = if True, then iterate over tokens in each sentence in reverse 327 | test = if True, then iterate through all data once then stop. 328 | Otherwise, iterate forever. 329 | shuffle_on_load = if True, then shuffle the sentences after loading. 330 | ''' 331 | self._vocab = vocab 332 | self._all_shards = glob.glob(filepattern) 333 | print('Found %d shards at %s' % (len(self._all_shards), filepattern)) 334 | self._shards_to_choose = [] 335 | 336 | self._reverse = reverse 337 | self._test = test 338 | self._shuffle_on_load = shuffle_on_load 339 | self._use_char_inputs = hasattr(vocab, 'encode_chars') 340 | 341 | self._ids = self._load_random_shard() 342 | 343 | def _choose_random_shard(self): 344 | if len(self._shards_to_choose) == 0: 345 | self._shards_to_choose = list(self._all_shards) 346 | random.shuffle(self._shards_to_choose) 347 | shard_name = self._shards_to_choose.pop() 348 | return shard_name 349 | 350 | def _load_random_shard(self): 351 | """Randomly select a file and read it.""" 352 | if self._test: 353 | if len(self._all_shards) == 0: 354 | # we've loaded all the data 355 | # this will propogate up to the generator in get_batch 356 | # and stop iterating 357 | raise StopIteration 358 | else: 359 | shard_name = self._all_shards.pop() 360 | else: 361 | # just pick a random shard 362 | shard_name = self._choose_random_shard() 363 | 364 | ids = self._load_shard(shard_name) 365 | self._i = 0 366 | self._nids = len(ids) 367 | return ids 368 | 369 | def _load_shard(self, shard_name): 370 | """Read one file and convert to ids. 371 | 372 | Args: 373 | shard_name: file path. 374 | 375 | Returns: 376 | list of (id, char_id) tuples. 377 | """ 378 | print('Loading data from: %s' % shard_name) 379 | with open(shard_name) as f: 380 | sentences_raw = f.readlines() 381 | 382 | if self._reverse: 383 | sentences = [] 384 | for sentence in sentences_raw: 385 | splitted = sentence.split() 386 | splitted.reverse() 387 | sentences.append(' '.join(splitted)) 388 | else: 389 | sentences = sentences_raw 390 | 391 | if self._shuffle_on_load: 392 | random.shuffle(sentences) 393 | 394 | ids = [self.vocab.encode(sentence, self._reverse) 395 | for sentence in sentences] 396 | if self._use_char_inputs: 397 | chars_ids = [self.vocab.encode_chars(sentence, self._reverse) 398 | for sentence in sentences] 399 | else: 400 | chars_ids = [None] * len(ids) 401 | 402 | print('Loaded %d sentences.' % len(ids)) 403 | print('Finished loading') 404 | return list(zip(ids, chars_ids)) 405 | 406 | def get_sentence(self): 407 | while True: 408 | if self._i == self._nids: 409 | self._ids = self._load_random_shard() 410 | ret = self._ids[self._i] 411 | self._i += 1 412 | yield ret 413 | 414 | @property 415 | def max_word_length(self): 416 | if self._use_char_inputs: 417 | return self._vocab.max_word_length 418 | else: 419 | return None 420 | 421 | def iter_batches(self, batch_size, num_steps): 422 | for X in _get_batch(self.get_sentence(), batch_size, num_steps, 423 | self.max_word_length): 424 | 425 | # token_ids = (batch_size, num_steps) 426 | # char_inputs = (batch_size, num_steps, 50) of character ids 427 | # targets = word ID of next word (batch_size, num_steps) 428 | yield X 429 | 430 | @property 431 | def vocab(self): 432 | return self._vocab 433 | 434 | class BidirectionalLMDataset(object): 435 | def __init__(self, filepattern, vocab, test=False, shuffle_on_load=False): 436 | ''' 437 | bidirectional version of LMDataset 438 | ''' 439 | self._data_forward = LMDataset( 440 | filepattern, vocab, reverse=False, test=test, 441 | shuffle_on_load=shuffle_on_load) 442 | self._data_reverse = LMDataset( 443 | filepattern, vocab, reverse=True, test=test, 444 | shuffle_on_load=shuffle_on_load) 445 | 446 | def iter_batches(self, batch_size, num_steps): 447 | max_word_length = self._data_forward.max_word_length 448 | 449 | for X, Xr in zip( 450 | _get_batch(self._data_forward.get_sentence(), batch_size, 451 | num_steps, max_word_length), 452 | _get_batch(self._data_reverse.get_sentence(), batch_size, 453 | num_steps, max_word_length) 454 | ): 455 | 456 | for k, v in Xr.items(): 457 | X[k + '_reverse'] = v 458 | 459 | yield X 460 | 461 | 462 | class InvalidNumberOfCharacters(Exception): 463 | pass 464 | 465 | -------------------------------------------------------------------------------- /bilm-tf/bilm/elmo.py: -------------------------------------------------------------------------------- 1 | 2 | import tensorflow as tf 3 | 4 | def weight_layers(name, bilm_ops, l2_coef=None, 5 | use_top_only=False, do_layer_norm=False): 6 | ''' 7 | Weight the layers of a biLM with trainable scalar weights to 8 | compute ELMo representations. 9 | 10 | For each output layer, this returns two ops. The first computes 11 | a layer specific weighted average of the biLM layers, and 12 | the second the l2 regularizer loss term. 13 | The regularization terms are also add to tf.GraphKeys.REGULARIZATION_LOSSES 14 | 15 | Input: 16 | name = a string prefix used for the trainable variable names 17 | bilm_ops = the tensorflow ops returned to compute internal 18 | representations from a biLM. This is the return value 19 | from BidirectionalLanguageModel(...)(ids_placeholder) 20 | l2_coef: the l2 regularization coefficient $\lambda$. 21 | Pass None or 0.0 for no regularization. 22 | use_top_only: if True, then only use the top layer. 23 | do_layer_norm: if True, then apply layer normalization to each biLM 24 | layer before normalizing 25 | 26 | Output: 27 | { 28 | 'weighted_op': op to compute weighted average for output, 29 | 'regularization_op': op to compute regularization term 30 | } 31 | ''' 32 | def _l2_regularizer(weights): 33 | if l2_coef is not None: 34 | return l2_coef * tf.reduce_sum(tf.square(weights)) 35 | else: 36 | return 0.0 37 | 38 | # Get ops for computing LM embeddings and mask 39 | lm_embeddings = bilm_ops['lm_embeddings'] 40 | mask = bilm_ops['mask'] 41 | 42 | n_lm_layers = int(lm_embeddings.get_shape()[1]) 43 | lm_dim = int(lm_embeddings.get_shape()[3]) 44 | 45 | with tf.control_dependencies([lm_embeddings, mask]): 46 | # Cast the mask and broadcast for layer use. 47 | mask_float = tf.cast(mask, 'float32') 48 | broadcast_mask = tf.expand_dims(mask_float, axis=-1) 49 | 50 | def _do_ln(x): 51 | # do layer normalization excluding the mask 52 | x_masked = x * broadcast_mask 53 | N = tf.reduce_sum(mask_float) * lm_dim 54 | mean = tf.reduce_sum(x_masked) / N 55 | variance = tf.reduce_sum(((x_masked - mean) * broadcast_mask)**2 56 | ) / N 57 | return tf.nn.batch_normalization( 58 | x, mean, variance, None, None, 1E-12 59 | ) 60 | 61 | if use_top_only: 62 | layers = tf.split(lm_embeddings, n_lm_layers, axis=1) 63 | # just the top layer 64 | sum_pieces = tf.squeeze(layers[-1], squeeze_dims=1) 65 | # no regularization 66 | reg = 0.0 67 | else: 68 | W = tf.get_variable( 69 | '{}_ELMo_W'.format(name), 70 | shape=(n_lm_layers, ), 71 | initializer=tf.zeros_initializer, 72 | regularizer=_l2_regularizer, 73 | trainable=True, 74 | ) 75 | 76 | # normalize the weights 77 | normed_weights = tf.split( 78 | tf.nn.softmax(W + 1.0 / n_lm_layers), n_lm_layers 79 | ) 80 | # split LM layers 81 | layers = tf.split(lm_embeddings, n_lm_layers, axis=1) 82 | 83 | # compute the weighted, normalized LM activations 84 | pieces = [] 85 | for w, t in zip(normed_weights, layers): 86 | if do_layer_norm: 87 | pieces.append(w * _do_ln(tf.squeeze(t, squeeze_dims=1))) 88 | else: 89 | pieces.append(w * tf.squeeze(t, squeeze_dims=1)) 90 | sum_pieces = tf.add_n(pieces) 91 | 92 | # get the regularizer 93 | reg = [ 94 | r for r in tf.get_collection( 95 | tf.GraphKeys.REGULARIZATION_LOSSES) 96 | if r.name.find('{}_ELMo_W/'.format(name)) >= 0 97 | ] 98 | if len(reg) != 1: 99 | raise ValueError 100 | 101 | # scale the weighted sum by gamma 102 | gamma = tf.get_variable( 103 | '{}_ELMo_gamma'.format(name), 104 | shape=(1, ), 105 | initializer=tf.ones_initializer, 106 | regularizer=None, 107 | trainable=True, 108 | ) 109 | weighted_lm_layers = sum_pieces * gamma 110 | 111 | ret = {'weighted_op': weighted_lm_layers, 'regularization_op': reg} 112 | 113 | return ret 114 | 115 | -------------------------------------------------------------------------------- /bilm-tf/bilm/model.py: -------------------------------------------------------------------------------- 1 | 2 | import json 3 | 4 | import h5py 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | from .data import UnicodeCharsVocabulary, Batcher 9 | 10 | DTYPE = 'float32' 11 | DTYPE_INT = 'int64' 12 | 13 | 14 | class BidirectionalLanguageModel(object): 15 | def __init__( 16 | self, 17 | options_file: str, 18 | weight_file: str, 19 | use_character_inputs=True, 20 | embedding_weight_file=None, 21 | max_batch_size=128, 22 | ): 23 | ''' 24 | Creates the language model computational graph and loads weights 25 | 26 | Two options for input type: 27 | (1) To use character inputs (paired with Batcher) 28 | pass use_character_inputs=True, and ids_placeholder 29 | of shape (None, None, max_characters_per_token) 30 | to __call__ 31 | (2) To use token ids as input (paired with TokenBatcher), 32 | pass use_character_inputs=False and ids_placeholder 33 | of shape (None, None) to __call__. 34 | In this case, embedding_weight_file is also required input 35 | 36 | options_file: location of the json formatted file with 37 | LM hyperparameters 38 | weight_file: location of the hdf5 file with LM weights 39 | use_character_inputs: if True, then use character ids as input, 40 | otherwise use token ids 41 | max_batch_size: the maximum allowable batch size 42 | ''' 43 | with open(options_file, 'r') as fin: 44 | options = json.load(fin) 45 | 46 | if not use_character_inputs: 47 | if embedding_weight_file is None: 48 | raise ValueError( 49 | "embedding_weight_file is required input with " 50 | "not use_character_inputs" 51 | ) 52 | 53 | self._options = options 54 | self._weight_file = weight_file 55 | self._embedding_weight_file = embedding_weight_file 56 | self._use_character_inputs = use_character_inputs 57 | self._max_batch_size = max_batch_size 58 | 59 | self._ops = {} 60 | self._graphs = {} 61 | 62 | def __call__(self, ids_placeholder): 63 | ''' 64 | Given the input character ids (or token ids), returns a dictionary 65 | with tensorflow ops: 66 | 67 | {'lm_embeddings': embedding_op, 68 | 'lengths': sequence_lengths_op, 69 | 'mask': op to compute mask} 70 | 71 | embedding_op computes the LM embeddings and is shape 72 | (None, 3, None, 1024) 73 | lengths_op computes the sequence lengths and is shape (None, ) 74 | mask computes the sequence mask and is shape (None, None) 75 | 76 | ids_placeholder: a tf.placeholder of type int32. 77 | If use_character_inputs=True, it is shape 78 | (None, None, max_characters_per_token) and holds the input 79 | character ids for a batch 80 | If use_character_input=False, it is shape (None, None) and 81 | holds the input token ids for a batch 82 | ''' 83 | if ids_placeholder in self._ops: 84 | # have already created ops for this placeholder, just return them 85 | ret = self._ops[ids_placeholder] 86 | 87 | else: 88 | # need to create the graph 89 | if len(self._ops) == 0: 90 | # first time creating the graph, don't reuse variables 91 | lm_graph = BidirectionalLanguageModelGraph( 92 | self._options, 93 | self._weight_file, 94 | ids_placeholder, 95 | embedding_weight_file=self._embedding_weight_file, 96 | use_character_inputs=self._use_character_inputs, 97 | max_batch_size=self._max_batch_size) 98 | else: 99 | with tf.variable_scope('', reuse=True): 100 | lm_graph = BidirectionalLanguageModelGraph( 101 | self._options, 102 | self._weight_file, 103 | ids_placeholder, 104 | embedding_weight_file=self._embedding_weight_file, 105 | use_character_inputs=self._use_character_inputs, 106 | max_batch_size=self._max_batch_size) 107 | 108 | ops = self._build_ops(lm_graph) 109 | self._ops[ids_placeholder] = ops 110 | self._graphs[ids_placeholder] = lm_graph 111 | ret = ops 112 | 113 | return ret 114 | 115 | def _build_ops(self, lm_graph): 116 | with tf.control_dependencies([lm_graph.update_state_op]): 117 | # get the LM embeddings 118 | token_embeddings = lm_graph.embedding 119 | layers = [ 120 | tf.concat([token_embeddings, token_embeddings], axis=2) 121 | ] 122 | 123 | n_lm_layers = len(lm_graph.lstm_outputs['forward']) 124 | for i in range(n_lm_layers): 125 | layers.append( 126 | tf.concat( 127 | [lm_graph.lstm_outputs['forward'][i], 128 | lm_graph.lstm_outputs['backward'][i]], 129 | axis=-1 130 | ) 131 | ) 132 | 133 | # The layers include the BOS/EOS tokens. Remove them 134 | sequence_length_wo_bos_eos = lm_graph.sequence_lengths - 2 135 | layers_without_bos_eos = [] 136 | for layer in layers: 137 | layer_wo_bos_eos = layer[:, 1:, :] 138 | layer_wo_bos_eos = tf.reverse_sequence( 139 | layer_wo_bos_eos, 140 | lm_graph.sequence_lengths - 1, 141 | seq_axis=1, 142 | batch_axis=0, 143 | ) 144 | layer_wo_bos_eos = layer_wo_bos_eos[:, 1:, :] 145 | layer_wo_bos_eos = tf.reverse_sequence( 146 | layer_wo_bos_eos, 147 | sequence_length_wo_bos_eos, 148 | seq_axis=1, 149 | batch_axis=0, 150 | ) 151 | layers_without_bos_eos.append(layer_wo_bos_eos) 152 | 153 | # concatenate the layers 154 | lm_embeddings = tf.concat( 155 | [tf.expand_dims(t, axis=1) for t in layers_without_bos_eos], 156 | axis=1 157 | ) 158 | 159 | # get the mask op without bos/eos. 160 | # tf doesn't support reversing boolean tensors, so cast 161 | # to int then back 162 | mask_wo_bos_eos = tf.cast(lm_graph.mask[:, 1:], 'int32') 163 | mask_wo_bos_eos = tf.reverse_sequence( 164 | mask_wo_bos_eos, 165 | lm_graph.sequence_lengths - 1, 166 | seq_axis=1, 167 | batch_axis=0, 168 | ) 169 | mask_wo_bos_eos = mask_wo_bos_eos[:, 1:] 170 | mask_wo_bos_eos = tf.reverse_sequence( 171 | mask_wo_bos_eos, 172 | sequence_length_wo_bos_eos, 173 | seq_axis=1, 174 | batch_axis=0, 175 | ) 176 | mask_wo_bos_eos = tf.cast(mask_wo_bos_eos, 'bool') 177 | 178 | return { 179 | 'lm_embeddings': lm_embeddings, 180 | 'lengths': sequence_length_wo_bos_eos, 181 | 'token_embeddings': lm_graph.embedding, 182 | 'mask': mask_wo_bos_eos, 183 | } 184 | 185 | 186 | def _pretrained_initializer(varname, weight_file, embedding_weight_file=None): 187 | ''' 188 | We'll stub out all the initializers in the pretrained LM with 189 | a function that loads the weights from the file 190 | ''' 191 | weight_name_map = {} 192 | for i in range(2): 193 | for j in range(8): # if we decide to add more layers 194 | root = 'RNN_{}/RNN/MultiRNNCell/Cell{}'.format(i, j) 195 | weight_name_map[root + '/rnn/lstm_cell/kernel'] = \ 196 | root + '/LSTMCell/W_0' 197 | weight_name_map[root + '/rnn/lstm_cell/bias'] = \ 198 | root + '/LSTMCell/B' 199 | weight_name_map[root + '/rnn/lstm_cell/projection/kernel'] = \ 200 | root + '/LSTMCell/W_P_0' 201 | 202 | # convert the graph name to that in the checkpoint 203 | varname_in_file = varname[5:] 204 | if varname_in_file.startswith('RNN'): 205 | varname_in_file = weight_name_map[varname_in_file] 206 | 207 | if varname_in_file == 'embedding': 208 | with h5py.File(embedding_weight_file, 'r') as fin: 209 | # Have added a special 0 index for padding not present 210 | # in the original model. 211 | embed_weights = fin[varname_in_file][...] 212 | weights = np.zeros( 213 | (embed_weights.shape[0] + 1, embed_weights.shape[1]), 214 | dtype=DTYPE 215 | ) 216 | weights[1:, :] = embed_weights 217 | else: 218 | with h5py.File(weight_file, 'r') as fin: 219 | if varname_in_file == 'char_embed': 220 | # Have added a special 0 index for padding not present 221 | # in the original model. 222 | char_embed_weights = fin[varname_in_file][...] 223 | weights = np.zeros( 224 | (char_embed_weights.shape[0] + 1, 225 | char_embed_weights.shape[1]), 226 | dtype=DTYPE 227 | ) 228 | weights[1:, :] = char_embed_weights 229 | else: 230 | weights = fin[varname_in_file][...] 231 | 232 | # Tensorflow initializers are callables that accept a shape parameter 233 | # and some optional kwargs 234 | def ret(shape, **kwargs): 235 | if list(shape) != list(weights.shape): 236 | raise ValueError( 237 | "Invalid shape initializing {0}, got {1}, expected {2}".format( 238 | varname_in_file, shape, weights.shape) 239 | ) 240 | return weights 241 | 242 | return ret 243 | 244 | 245 | class BidirectionalLanguageModelGraph(object): 246 | ''' 247 | Creates the computational graph and holds the ops necessary for runnint 248 | a bidirectional language model 249 | ''' 250 | def __init__(self, options, weight_file, ids_placeholder, 251 | use_character_inputs=True, embedding_weight_file=None, 252 | max_batch_size=128): 253 | 254 | self.options = options 255 | self._max_batch_size = max_batch_size 256 | self.ids_placeholder = ids_placeholder 257 | self.use_character_inputs = use_character_inputs 258 | 259 | # this custom_getter will make all variables not trainable and 260 | # override the default initializer 261 | def custom_getter(getter, name, *args, **kwargs): 262 | kwargs['trainable'] = False 263 | for i in range(0, 3): 264 | try: 265 | kwargs['initializer'] = _pretrained_initializer( 266 | name, weight_file, embedding_weight_file 267 | ) 268 | except: 269 | continue 270 | else: 271 | break 272 | 273 | return getter(name, *args, **kwargs) 274 | 275 | if embedding_weight_file is not None: 276 | # get the vocab size 277 | with h5py.File(embedding_weight_file, 'r') as fin: 278 | # +1 for padding 279 | self._n_tokens_vocab = fin['embedding'].shape[0] + 1 280 | else: 281 | self._n_tokens_vocab = None 282 | 283 | with tf.variable_scope('bilm', custom_getter=custom_getter): 284 | self._build() 285 | 286 | def _build(self): 287 | if self.use_character_inputs: 288 | self._build_word_char_embeddings() 289 | else: 290 | self._build_word_embeddings() 291 | self._build_lstms() 292 | 293 | def _build_word_char_embeddings(self): 294 | ''' 295 | options contains key 'char_cnn': { 296 | 297 | 'n_characters': 262, 298 | 299 | # includes the start / end characters 300 | 'max_characters_per_token': 50, 301 | 302 | 'filters': [ 303 | [1, 32], 304 | [2, 32], 305 | [3, 64], 306 | [4, 128], 307 | [5, 256], 308 | [6, 512], 309 | [7, 512] 310 | ], 311 | 'activation': 'tanh', 312 | 313 | # for the character embedding 314 | 'embedding': {'dim': 16} 315 | 316 | # for highway layers 317 | # if omitted, then no highway layers 318 | 'n_highway': 2, 319 | } 320 | ''' 321 | projection_dim = self.options['lstm']['projection_dim'] 322 | 323 | cnn_options = self.options['char_cnn'] 324 | filters = cnn_options['filters'] 325 | n_filters = sum(f[1] for f in filters) 326 | max_chars = cnn_options['max_characters_per_token'] 327 | char_embed_dim = cnn_options['embedding']['dim'] 328 | n_chars = cnn_options['n_characters'] 329 | if n_chars != 262: 330 | raise InvalidNumberOfCharacters( 331 | "Set n_characters=262 after training see the README.md" 332 | ) 333 | if cnn_options['activation'] == 'tanh': 334 | activation = tf.nn.tanh 335 | elif cnn_options['activation'] == 'relu': 336 | activation = tf.nn.relu 337 | 338 | # the character embeddings 339 | with tf.device("/gpu:0"): 340 | self.embedding_weights = tf.get_variable( 341 | "char_embed", [n_chars, char_embed_dim], 342 | dtype=DTYPE, 343 | initializer=tf.random_uniform_initializer(-1.0, 1.0) 344 | ) 345 | # shape (batch_size, unroll_steps, max_chars, embed_dim) 346 | self.char_embedding = tf.nn.embedding_lookup(self.embedding_weights, 347 | self.ids_placeholder) 348 | 349 | # the convolutions 350 | def make_convolutions(inp): 351 | with tf.variable_scope('CNN') as scope: 352 | convolutions = [] 353 | for i, (width, num) in enumerate(filters): 354 | if cnn_options['activation'] == 'relu': 355 | # He initialization for ReLU activation 356 | # with char embeddings init between -1 and 1 357 | #w_init = tf.random_normal_initializer( 358 | # mean=0.0, 359 | # stddev=np.sqrt(2.0 / (width * char_embed_dim)) 360 | #) 361 | 362 | # Kim et al 2015, +/- 0.05 363 | w_init = tf.random_uniform_initializer( 364 | minval=-0.05, maxval=0.05) 365 | elif cnn_options['activation'] == 'tanh': 366 | # glorot init 367 | w_init = tf.random_normal_initializer( 368 | mean=0.0, 369 | stddev=np.sqrt(1.0 / (width * char_embed_dim)) 370 | ) 371 | w = tf.get_variable( 372 | "W_cnn_%s" % i, 373 | [1, width, char_embed_dim, num], 374 | initializer=w_init, 375 | dtype=DTYPE) 376 | b = tf.get_variable( 377 | "b_cnn_%s" % i, [num], dtype=DTYPE, 378 | initializer=tf.constant_initializer(0.0)) 379 | 380 | conv = tf.nn.conv2d( 381 | inp, w, 382 | strides=[1, 1, 1, 1], 383 | padding="VALID") + b 384 | # now max pool 385 | conv = tf.nn.max_pool( 386 | conv, [1, 1, max_chars-width+1, 1], 387 | [1, 1, 1, 1], 'VALID') 388 | 389 | # activation 390 | conv = activation(conv) 391 | conv = tf.squeeze(conv, squeeze_dims=[2]) 392 | 393 | convolutions.append(conv) 394 | 395 | return tf.concat(convolutions, 2) 396 | 397 | embedding = make_convolutions(self.char_embedding) 398 | 399 | # for highway and projection layers 400 | n_highway = cnn_options.get('n_highway') 401 | use_highway = n_highway is not None and n_highway > 0 402 | use_proj = n_filters != projection_dim 403 | 404 | if use_highway or use_proj: 405 | # reshape from (batch_size, n_tokens, dim) to (-1, dim) 406 | batch_size_n_tokens = tf.shape(embedding)[0:2] 407 | embedding = tf.reshape(embedding, [-1, n_filters]) 408 | 409 | # set up weights for projection 410 | if use_proj: 411 | assert n_filters > projection_dim 412 | with tf.variable_scope('CNN_proj') as scope: 413 | W_proj_cnn = tf.get_variable( 414 | "W_proj", [n_filters, projection_dim], 415 | initializer=tf.random_normal_initializer( 416 | mean=0.0, stddev=np.sqrt(1.0 / n_filters)), 417 | dtype=DTYPE) 418 | b_proj_cnn = tf.get_variable( 419 | "b_proj", [projection_dim], 420 | initializer=tf.constant_initializer(0.0), 421 | dtype=DTYPE) 422 | 423 | # apply highways layers 424 | def high(x, ww_carry, bb_carry, ww_tr, bb_tr): 425 | carry_gate = tf.nn.sigmoid(tf.matmul(x, ww_carry) + bb_carry) 426 | transform_gate = tf.nn.relu(tf.matmul(x, ww_tr) + bb_tr) 427 | return carry_gate * transform_gate + (1.0 - carry_gate) * x 428 | 429 | if use_highway: 430 | highway_dim = n_filters 431 | 432 | for i in range(n_highway): 433 | with tf.variable_scope('CNN_high_%s' % i) as scope: 434 | W_carry = tf.get_variable( 435 | 'W_carry', [highway_dim, highway_dim], 436 | # glorit init 437 | initializer=tf.random_normal_initializer( 438 | mean=0.0, stddev=np.sqrt(1.0 / highway_dim)), 439 | dtype=DTYPE) 440 | b_carry = tf.get_variable( 441 | 'b_carry', [highway_dim], 442 | initializer=tf.constant_initializer(-2.0), 443 | dtype=DTYPE) 444 | W_transform = tf.get_variable( 445 | 'W_transform', [highway_dim, highway_dim], 446 | initializer=tf.random_normal_initializer( 447 | mean=0.0, stddev=np.sqrt(1.0 / highway_dim)), 448 | dtype=DTYPE) 449 | b_transform = tf.get_variable( 450 | 'b_transform', [highway_dim], 451 | initializer=tf.constant_initializer(0.0), 452 | dtype=DTYPE) 453 | 454 | embedding = high(embedding, W_carry, b_carry, 455 | W_transform, b_transform) 456 | 457 | # finally project down if needed 458 | if use_proj: 459 | embedding = tf.matmul(embedding, W_proj_cnn) + b_proj_cnn 460 | 461 | # reshape back to (batch_size, tokens, dim) 462 | if use_highway or use_proj: 463 | shp = tf.concat([batch_size_n_tokens, [projection_dim]], axis=0) 464 | embedding = tf.reshape(embedding, shp) 465 | 466 | # at last assign attributes for remainder of the model 467 | self.embedding = embedding 468 | 469 | 470 | def _build_word_embeddings(self): 471 | projection_dim = self.options['lstm']['projection_dim'] 472 | 473 | # the word embeddings 474 | with tf.device("/gpu:0"): 475 | self.embedding_weights = tf.get_variable( 476 | "embedding", [self._n_tokens_vocab, projection_dim], 477 | dtype=DTYPE, 478 | ) 479 | self.embedding = tf.nn.embedding_lookup(self.embedding_weights, 480 | self.ids_placeholder) 481 | 482 | 483 | def _build_lstms(self): 484 | # now the LSTMs 485 | # these will collect the initial states for the forward 486 | # (and reverse LSTMs if we are doing bidirectional) 487 | 488 | # parse the options 489 | lstm_dim = self.options['lstm']['dim'] 490 | projection_dim = self.options['lstm']['projection_dim'] 491 | n_lstm_layers = self.options['lstm'].get('n_layers', 1) 492 | cell_clip = self.options['lstm'].get('cell_clip') 493 | proj_clip = self.options['lstm'].get('proj_clip') 494 | use_skip_connections = self.options['lstm']['use_skip_connections'] 495 | if use_skip_connections: 496 | print("USING SKIP CONNECTIONS") 497 | else: 498 | print("NOT USING SKIP CONNECTIONS") 499 | 500 | # the sequence lengths from input mask 501 | if self.use_character_inputs: 502 | mask = tf.reduce_any(self.ids_placeholder > 0, axis=2) 503 | else: 504 | mask = self.ids_placeholder > 0 505 | sequence_lengths = tf.reduce_sum(tf.cast(mask, tf.int32), axis=1) 506 | batch_size = tf.shape(sequence_lengths)[0] 507 | 508 | # for each direction, we'll store tensors for each layer 509 | self.lstm_outputs = {'forward': [], 'backward': []} 510 | self.lstm_state_sizes = {'forward': [], 'backward': []} 511 | self.lstm_init_states = {'forward': [], 'backward': []} 512 | self.lstm_final_states = {'forward': [], 'backward': []} 513 | 514 | update_ops = [] 515 | for direction in ['forward', 'backward']: 516 | if direction == 'forward': 517 | layer_input = self.embedding 518 | else: 519 | layer_input = tf.reverse_sequence( 520 | self.embedding, 521 | sequence_lengths, 522 | seq_axis=1, 523 | batch_axis=0 524 | ) 525 | 526 | for i in range(n_lstm_layers): 527 | if projection_dim < lstm_dim: 528 | # are projecting down output 529 | lstm_cell = tf.nn.rnn_cell.LSTMCell( 530 | lstm_dim, num_proj=projection_dim, 531 | cell_clip=cell_clip, proj_clip=proj_clip) 532 | else: 533 | lstm_cell = tf.nn.rnn_cell.LSTMCell( 534 | lstm_dim, 535 | cell_clip=cell_clip, proj_clip=proj_clip) 536 | 537 | if use_skip_connections: 538 | # ResidualWrapper adds inputs to outputs 539 | if i == 0: 540 | # don't add skip connection from token embedding to 541 | # 1st layer output 542 | pass 543 | else: 544 | # add a skip connection 545 | lstm_cell = tf.nn.rnn_cell.ResidualWrapper(lstm_cell) 546 | 547 | # collect the input state, run the dynamic rnn, collect 548 | # the output 549 | state_size = lstm_cell.state_size 550 | # the LSTMs are stateful. To support multiple batch sizes, 551 | # we'll allocate size for states up to max_batch_size, 552 | # then use the first batch_size entries for each batch 553 | init_states = [ 554 | tf.Variable( 555 | tf.zeros([self._max_batch_size, dim]), 556 | trainable=False 557 | ) 558 | for dim in lstm_cell.state_size 559 | ] 560 | batch_init_states = [ 561 | state[:batch_size, :] for state in init_states 562 | ] 563 | 564 | if direction == 'forward': 565 | i_direction = 0 566 | else: 567 | i_direction = 1 568 | variable_scope_name = 'RNN_{0}/RNN/MultiRNNCell/Cell{1}'.format( 569 | i_direction, i) 570 | with tf.variable_scope(variable_scope_name): 571 | layer_output, final_state = tf.nn.dynamic_rnn( 572 | lstm_cell, 573 | layer_input, 574 | sequence_length=sequence_lengths, 575 | initial_state=tf.nn.rnn_cell.LSTMStateTuple( 576 | *batch_init_states), 577 | ) 578 | 579 | self.lstm_state_sizes[direction].append(lstm_cell.state_size) 580 | self.lstm_init_states[direction].append(init_states) 581 | self.lstm_final_states[direction].append(final_state) 582 | if direction == 'forward': 583 | self.lstm_outputs[direction].append(layer_output) 584 | else: 585 | self.lstm_outputs[direction].append( 586 | tf.reverse_sequence( 587 | layer_output, 588 | sequence_lengths, 589 | seq_axis=1, 590 | batch_axis=0 591 | ) 592 | ) 593 | 594 | with tf.control_dependencies([layer_output]): 595 | # update the initial states 596 | for i in range(2): 597 | new_state = tf.concat( 598 | [final_state[i][:batch_size, :], 599 | init_states[i][batch_size:, :]], axis=0) 600 | state_update_op = tf.assign(init_states[i], new_state) 601 | update_ops.append(state_update_op) 602 | 603 | layer_input = layer_output 604 | 605 | self.mask = mask 606 | self.sequence_lengths = sequence_lengths 607 | self.update_state_op = tf.group(*update_ops) 608 | 609 | 610 | def dump_token_embeddings(vocab_file, options_file, weight_file, outfile): 611 | ''' 612 | Given an input vocabulary file, dump all the token embeddings to the 613 | outfile. The result can be used as the embedding_weight_file when 614 | constructing a BidirectionalLanguageModel. 615 | ''' 616 | with open(options_file, 'r') as fin: 617 | options = json.load(fin) 618 | max_word_length = options['char_cnn']['max_characters_per_token'] 619 | 620 | vocab = UnicodeCharsVocabulary(vocab_file, max_word_length) 621 | batcher = Batcher(vocab_file, max_word_length) 622 | 623 | ids_placeholder = tf.placeholder('int32', 624 | shape=(None, None, max_word_length) 625 | ) 626 | model = BidirectionalLanguageModel(options_file, weight_file) 627 | embedding_op = model(ids_placeholder)['token_embeddings'] 628 | 629 | n_tokens = vocab.size 630 | embed_dim = int(embedding_op.shape[2]) 631 | 632 | embeddings = np.zeros((n_tokens, embed_dim), dtype=DTYPE) 633 | 634 | config = tf.ConfigProto(allow_soft_placement=True) 635 | with tf.Session(config=config) as sess: 636 | sess.run(tf.global_variables_initializer()) 637 | for k in range(n_tokens): 638 | token = vocab.id_to_word(k) 639 | char_ids = batcher.batch_sentences([[token]])[0, 1, :].reshape( 640 | 1, 1, -1) 641 | embeddings[k, :] = sess.run( 642 | embedding_op, feed_dict={ids_placeholder: char_ids} 643 | ) 644 | 645 | with h5py.File(outfile, 'w') as fout: 646 | ds = fout.create_dataset( 647 | 'embedding', embeddings.shape, dtype='float32', data=embeddings 648 | ) 649 | 650 | 651 | def dump_bilm_embeddings(vocab_file, sentences, options_file, 652 | weight_file): 653 | with open(options_file, 'r') as fin: 654 | options = json.load(fin) 655 | max_word_length = options['char_cnn']['max_characters_per_token'] 656 | 657 | batcher = Batcher(vocab_file, max_word_length) 658 | 659 | ids_placeholder = tf.placeholder('int32', 660 | shape=(None, None, max_word_length) 661 | ) 662 | model = BidirectionalLanguageModel(options_file, weight_file) 663 | ops = model(ids_placeholder) 664 | config = tf.ConfigProto(allow_soft_placement=True) 665 | config.gpu_options.allow_growth = True 666 | ret_map = {} 667 | with tf.Session(config=config) as sess: 668 | sess.run(tf.global_variables_initializer()) 669 | sentence_id = 0 670 | for sentence in sentences: 671 | tokens = sentence.strip().split() 672 | char_ids = batcher.batch_sentences([tokens]) 673 | embeddings = sess.run( 674 | ops['lm_embeddings'], feed_dict={ids_placeholder: char_ids} 675 | ) 676 | ret_map[sentence_id] = embeddings[0] 677 | sentence_id += 1 678 | return ret_map 679 | 680 | 681 | def dump_bilm_embeddings_inner(vocab_file, line, options_file, 682 | weight_file): 683 | with open(options_file, 'r') as fin: 684 | options = json.load(fin) 685 | max_word_length = options['char_cnn']['max_characters_per_token'] 686 | 687 | batcher = Batcher(vocab_file, max_word_length) 688 | 689 | ids_placeholder = tf.placeholder('int32', 690 | shape=(None, None, max_word_length) 691 | ) 692 | model = BidirectionalLanguageModel(options_file, weight_file) 693 | ops = model(ids_placeholder) 694 | config = tf.ConfigProto(allow_soft_placement=True) 695 | with tf.Session(config=config) as sess: 696 | sess.run(tf.global_variables_initializer()) 697 | sentence = line.strip().split() 698 | char_ids = batcher.batch_sentences([sentence]) 699 | embeddings = sess.run( 700 | ops['lm_embeddings'], feed_dict={ids_placeholder: char_ids} 701 | ) 702 | return embeddings[0] 703 | 704 | 705 | def initialize_sess(vocab_file, options_file, weight_file): 706 | with open(options_file, 'r') as fin: 707 | options = json.load(fin) 708 | max_word_length = options['char_cnn']['max_characters_per_token'] 709 | batcher = Batcher(vocab_file, max_word_length) 710 | ids_placeholder = tf.placeholder('int32', 711 | shape=(None, None, max_word_length) 712 | ) 713 | model = BidirectionalLanguageModel(options_file, weight_file) 714 | ops = model(ids_placeholder) 715 | config = tf.ConfigProto(allow_soft_placement=True) 716 | config.gpu_options.allow_growth = True 717 | sess = tf.Session(config=config) 718 | sess.run(tf.global_variables_initializer()) 719 | return batcher, ids_placeholder, ops, sess 720 | -------------------------------------------------------------------------------- /bilm-tf/setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import setuptools 3 | 4 | setuptools.setup( 5 | name='bilm', 6 | version='0.1', 7 | url='http://github.com/allenai/bilm-tf', 8 | packages=setuptools.find_packages(), 9 | tests_require=[], 10 | zip_safe=False, 11 | entry_points='', 12 | ) 13 | 14 | -------------------------------------------------------------------------------- /cache.py: -------------------------------------------------------------------------------- 1 | import hashlib 2 | import pickle 3 | import sqlite3 4 | import time 5 | 6 | from flask import g 7 | 8 | 9 | class ServerCache: 10 | 11 | CLEANUP_THRESHOLD = 10000 12 | 13 | def __init__(self): 14 | self.added_count = 0 15 | self.initialized = False 16 | 17 | @staticmethod 18 | def compute_sig(sentence): 19 | key_val = str(sentence.get_sent_str() + "|||" + sentence.get_mention_surface() + "|||" + sentence.inference_signature).encode('utf-8') 20 | return hashlib.sha224(key_val).hexdigest() 21 | 22 | @staticmethod 23 | def get_mem_db(): 24 | if 'mem_db' not in g: 25 | g.mem_db = sqlite3.connect("./shared_cache.db") 26 | return g.mem_db 27 | 28 | def initialize_cache(self): 29 | db = ServerCache.get_mem_db() 30 | cursor = db.cursor() 31 | cursor.execute("DROP TABLE IF EXISTS memcache") 32 | cursor.execute("CREATE TABLE IF NOT EXISTS memcache (key TEXT PRIMARY KEY, value BLOB, time INTEGER)") 33 | db.commit() 34 | self.added_count = 0 35 | self.initialized = True 36 | 37 | def query_cache(self, sentence): 38 | if not self.initialized: 39 | self.initialize_cache() 40 | db = ServerCache.get_mem_db() 41 | cursor = db.cursor() 42 | key = ServerCache.compute_sig(sentence) 43 | cursor.execute("SELECT value FROM memcache WHERE key=?", [key]) 44 | data = cursor.fetchone() 45 | if data is None: 46 | return None 47 | else: 48 | result_binary = data[0] 49 | return pickle.loads(result_binary) 50 | 51 | def insert_cache(self, sentence): 52 | if not self.initialized: 53 | self.initialize_cache() 54 | db = ServerCache.get_mem_db() 55 | cursor = db.cursor() 56 | key = ServerCache.compute_sig(sentence) 57 | current_timestamp = int(time.time()) 58 | data = pickle.dumps(sentence) 59 | cursor.execute("INSERT INTO memcache VALUES (?, ?, ?)", [key, data, current_timestamp]) 60 | db.commit() 61 | self.added_count += 1 62 | if self.added_count > self.CLEANUP_THRESHOLD: 63 | self.initialize_cache() 64 | 65 | 66 | class SurfaceCache: 67 | def __init__(self, cache_file, server_mode=True): 68 | self.cache_file = cache_file 69 | self.server_mode = server_mode 70 | if not self.server_mode: 71 | self.surface_db = sqlite3.connect(self.cache_file) 72 | 73 | def get_surface_db(self): 74 | if self.server_mode: 75 | if 'surface_db' not in g: 76 | g.surface_db = sqlite3.connect(self.cache_file) 77 | return g.surface_db 78 | else: 79 | return self.surface_db 80 | 81 | def initialize_cache(self): 82 | db = self.get_surface_db() 83 | cursor = db.cursor() 84 | cursor.execute("CREATE TABLE IF NOT EXISTS cache (surface TEXT PRIMARY KEY, types BLOB)") 85 | db.commit() 86 | 87 | def query_cache(self, surface, limit=5): 88 | self.initialize_cache() 89 | surface = str(surface).lower() 90 | db = self.get_surface_db() 91 | cursor = db.cursor() 92 | cursor.execute("SELECT types FROM cache WHERE surface=?", [surface]) 93 | data = cursor.fetchone() 94 | if data is None: 95 | return None 96 | else: 97 | result_binary = data[0] 98 | cache_dict = sorted((pickle.loads(result_binary)).items(), key=lambda x: x[1], reverse=True) 99 | ret = [] 100 | for i in range(0, min(limit, len(cache_dict))): 101 | ret.append(cache_dict[i][0]) 102 | return ret 103 | 104 | def insert_cache(self, sentence): 105 | self.initialize_cache() 106 | surface = sentence.get_mention_surface().lower() 107 | db = self.get_surface_db() 108 | cursor = db.cursor() 109 | cursor.execute("SELECT types FROM cache WHERE surface=?", [surface]) 110 | data = cursor.fetchone() 111 | if data is None: 112 | to_insert_cache = {} 113 | for t in sentence.predicted_types: 114 | to_insert_cache[t] = 1 115 | cursor.execute("INSERT INTO cache VALUES (?, ?)", [surface, pickle.dumps(to_insert_cache)]) 116 | db.commit() 117 | else: 118 | previous_cache = pickle.loads(data[0]) 119 | for t in sentence.predicted_types: 120 | if t in previous_cache: 121 | previous_cache[t] += 1 122 | else: 123 | previous_cache[t] = 1 124 | cursor.execute("UPDATE cache SET types=? WHERE surface=?", [pickle.dumps(previous_cache), surface]) 125 | db.commit() 126 | -------------------------------------------------------------------------------- /frontend/constants/constants.js: -------------------------------------------------------------------------------- 1 | const SERVER_API = "http://127.0.0.1:5000/"; -------------------------------------------------------------------------------- /frontend/index_content.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 11 | 12 | 13 | Zoe Online Demo 14 | 19 | 24 | 25 | 26 | 27 |
28 |
29 |
30 |
31 | Taxonomy 32 |
33 |
34 | 37 | 40 |
41 | 46 |
47 | 52 |
53 |
54 |
55 | 56 | Use this option when you want to type with one of the three preset taxonomies. 57 | Select from the drop-down for the desired taxonomy. 58 |
59 |
60 |
61 | 63 | 66 | 67 | 72 | 73 |
74 |
75 |
76 | 77 | Select this option if you want to define your own taxonomy. 78 | To make things easier, instead of letting you choose from FreeBase types, 79 | we ask you to find a few examples that belongs to the custom type you want. 80 | Enter a valid Wikipedia URL in the left input box, and the custom type name in the right input box. 81 | The more examples you find, in theory the more precise the mapping is. 82 |
83 |
84 | 86 | 103 |
104 |
105 |
106 |
107 |
108 | Sentence 109 | 114 |
115 | 116 | 129 |
130 | 132 |
133 |
134 |
135 | 136 |
137 |
138 | 139 | Enter a sentence here. Click "Parse!" for the next step. 140 |
141 |
142 | 143 |
144 |
145 |
146 |
147 | 148 |
149 |
150 | 151 | Click on the tokens that constitutes the mention you want to type. 152 | Note that overlap or consecutive mention selections are not supported. 153 | You can always un-select and re-select. 154 | Hit "Annotate" when you have finished your selection. 155 |
156 |
157 | 158 |
159 |
160 |
161 |
162 |
163 | 164 | 165 |
166 | 167 | 168 | 169 | 170 |
171 | 172 | 173 |
174 | 175 | 176 | 177 | 178 | 181 | 184 | 187 | 188 | 189 | 568 | 641 | 686 | 687 | -------------------------------------------------------------------------------- /frontend/info.js: -------------------------------------------------------------------------------- 1 | var demoname = "Zoe Demo"; 2 | var demoexplanation = "This is an online demo of our recent paper Zero-Shot Open Entity Typing as Type-Compatible Grounding. Please use the question buttons when you are looking for instructions. If none of them solves your problem, please create an issue on our Github repo."; 3 | var citations = { 4 | "http://cogcomp.org/page/publication_view/845" : "Zero-Shot Open Entity Typing as Type-Compatible Grounding", 5 | }; 6 | var contact = "xzhou45@illinois.edu"; 7 | 8 | function initial_load() { 9 | document.getElementById("demo-name").innerHTML = demoname; 10 | document.getElementById("demo-explanation").innerHTML = "

" + demoexplanation + "

"; 11 | if (citations.length != 0) { 12 | citation_content = "If you wish to cite this work, please cite the following publication(s):"; 13 | var cid = 1; 14 | for (var key in citations) { 15 | citation_content += 16 | "" + 17 | "(" + cid.toString() + ")" + 18 | "" + citations[key] + "," + 19 | ""; 20 | cid ++; 21 | } 22 | document.getElementById("demo-citations").innerHTML = citation_content; 23 | document.getElementById("demo-contact").href = "mailto:" + contact; 24 | document.getElementById("demo-contact").innerHTML = contact; 25 | } 26 | } 27 | 28 | initial_load(); -------------------------------------------------------------------------------- /frontend/js/bootstrap.min.js: -------------------------------------------------------------------------------- 1 | /*! 2 | * Bootstrap v3.1.1 (http://getbootstrap.com) 3 | * Copyright 2011-2014 Twitter, Inc. 4 | * Licensed under MIT (https://github.com/twbs/bootstrap/blob/master/LICENSE) 5 | */ 6 | if("undefined"==typeof jQuery)throw new Error("Bootstrap's JavaScript requires jQuery");+function(a){"use strict";function b(){var a=document.createElement("bootstrap"),b={WebkitTransition:"webkitTransitionEnd",MozTransition:"transitionend",OTransition:"oTransitionEnd otransitionend",transition:"transitionend"};for(var c in b)if(void 0!==a.style[c])return{end:b[c]};return!1}a.fn.emulateTransitionEnd=function(b){var c=!1,d=this;a(this).one(a.support.transition.end,function(){c=!0});var e=function(){c||a(d).trigger(a.support.transition.end)};return setTimeout(e,b),this},a(function(){a.support.transition=b()})}(jQuery),+function(a){"use strict";var b='[data-dismiss="alert"]',c=function(c){a(c).on("click",b,this.close)};c.prototype.close=function(b){function c(){f.trigger("closed.bs.alert").remove()}var d=a(this),e=d.attr("data-target");e||(e=d.attr("href"),e=e&&e.replace(/.*(?=#[^\s]*$)/,""));var f=a(e);b&&b.preventDefault(),f.length||(f=d.hasClass("alert")?d:d.parent()),f.trigger(b=a.Event("close.bs.alert")),b.isDefaultPrevented()||(f.removeClass("in"),a.support.transition&&f.hasClass("fade")?f.one(a.support.transition.end,c).emulateTransitionEnd(150):c())};var d=a.fn.alert;a.fn.alert=function(b){return this.each(function(){var d=a(this),e=d.data("bs.alert");e||d.data("bs.alert",e=new c(this)),"string"==typeof b&&e[b].call(d)})},a.fn.alert.Constructor=c,a.fn.alert.noConflict=function(){return a.fn.alert=d,this},a(document).on("click.bs.alert.data-api",b,c.prototype.close)}(jQuery),+function(a){"use strict";var b=function(c,d){this.$element=a(c),this.options=a.extend({},b.DEFAULTS,d),this.isLoading=!1};b.DEFAULTS={loadingText:"loading..."},b.prototype.setState=function(b){var c="disabled",d=this.$element,e=d.is("input")?"val":"html",f=d.data();b+="Text",f.resetText||d.data("resetText",d[e]()),d[e](f[b]||this.options[b]),setTimeout(a.proxy(function(){"loadingText"==b?(this.isLoading=!0,d.addClass(c).attr(c,c)):this.isLoading&&(this.isLoading=!1,d.removeClass(c).removeAttr(c))},this),0)},b.prototype.toggle=function(){var a=!0,b=this.$element.closest('[data-toggle="buttons"]');if(b.length){var c=this.$element.find("input");"radio"==c.prop("type")&&(c.prop("checked")&&this.$element.hasClass("active")?a=!1:b.find(".active").removeClass("active")),a&&c.prop("checked",!this.$element.hasClass("active")).trigger("change")}a&&this.$element.toggleClass("active")};var c=a.fn.button;a.fn.button=function(c){return this.each(function(){var d=a(this),e=d.data("bs.button"),f="object"==typeof c&&c;e||d.data("bs.button",e=new b(this,f)),"toggle"==c?e.toggle():c&&e.setState(c)})},a.fn.button.Constructor=b,a.fn.button.noConflict=function(){return a.fn.button=c,this},a(document).on("click.bs.button.data-api","[data-toggle^=button]",function(b){var c=a(b.target);c.hasClass("btn")||(c=c.closest(".btn")),c.button("toggle"),b.preventDefault()})}(jQuery),+function(a){"use strict";var b=function(b,c){this.$element=a(b),this.$indicators=this.$element.find(".carousel-indicators"),this.options=c,this.paused=this.sliding=this.interval=this.$active=this.$items=null,"hover"==this.options.pause&&this.$element.on("mouseenter",a.proxy(this.pause,this)).on("mouseleave",a.proxy(this.cycle,this))};b.DEFAULTS={interval:5e3,pause:"hover",wrap:!0},b.prototype.cycle=function(b){return b||(this.paused=!1),this.interval&&clearInterval(this.interval),this.options.interval&&!this.paused&&(this.interval=setInterval(a.proxy(this.next,this),this.options.interval)),this},b.prototype.getActiveIndex=function(){return this.$active=this.$element.find(".item.active"),this.$items=this.$active.parent().children(),this.$items.index(this.$active)},b.prototype.to=function(b){var c=this,d=this.getActiveIndex();return b>this.$items.length-1||0>b?void 0:this.sliding?this.$element.one("slid.bs.carousel",function(){c.to(b)}):d==b?this.pause().cycle():this.slide(b>d?"next":"prev",a(this.$items[b]))},b.prototype.pause=function(b){return b||(this.paused=!0),this.$element.find(".next, .prev").length&&a.support.transition&&(this.$element.trigger(a.support.transition.end),this.cycle(!0)),this.interval=clearInterval(this.interval),this},b.prototype.next=function(){return this.sliding?void 0:this.slide("next")},b.prototype.prev=function(){return this.sliding?void 0:this.slide("prev")},b.prototype.slide=function(b,c){var d=this.$element.find(".item.active"),e=c||d[b](),f=this.interval,g="next"==b?"left":"right",h="next"==b?"first":"last",i=this;if(!e.length){if(!this.options.wrap)return;e=this.$element.find(".item")[h]()}if(e.hasClass("active"))return this.sliding=!1;var j=a.Event("slide.bs.carousel",{relatedTarget:e[0],direction:g});return this.$element.trigger(j),j.isDefaultPrevented()?void 0:(this.sliding=!0,f&&this.pause(),this.$indicators.length&&(this.$indicators.find(".active").removeClass("active"),this.$element.one("slid.bs.carousel",function(){var b=a(i.$indicators.children()[i.getActiveIndex()]);b&&b.addClass("active")})),a.support.transition&&this.$element.hasClass("slide")?(e.addClass(b),e[0].offsetWidth,d.addClass(g),e.addClass(g),d.one(a.support.transition.end,function(){e.removeClass([b,g].join(" ")).addClass("active"),d.removeClass(["active",g].join(" ")),i.sliding=!1,setTimeout(function(){i.$element.trigger("slid.bs.carousel")},0)}).emulateTransitionEnd(1e3*d.css("transition-duration").slice(0,-1))):(d.removeClass("active"),e.addClass("active"),this.sliding=!1,this.$element.trigger("slid.bs.carousel")),f&&this.cycle(),this)};var c=a.fn.carousel;a.fn.carousel=function(c){return this.each(function(){var d=a(this),e=d.data("bs.carousel"),f=a.extend({},b.DEFAULTS,d.data(),"object"==typeof c&&c),g="string"==typeof c?c:f.slide;e||d.data("bs.carousel",e=new b(this,f)),"number"==typeof c?e.to(c):g?e[g]():f.interval&&e.pause().cycle()})},a.fn.carousel.Constructor=b,a.fn.carousel.noConflict=function(){return a.fn.carousel=c,this},a(document).on("click.bs.carousel.data-api","[data-slide], [data-slide-to]",function(b){var c,d=a(this),e=a(d.attr("data-target")||(c=d.attr("href"))&&c.replace(/.*(?=#[^\s]+$)/,"")),f=a.extend({},e.data(),d.data()),g=d.attr("data-slide-to");g&&(f.interval=!1),e.carousel(f),(g=d.attr("data-slide-to"))&&e.data("bs.carousel").to(g),b.preventDefault()}),a(window).on("load",function(){a('[data-ride="carousel"]').each(function(){var b=a(this);b.carousel(b.data())})})}(jQuery),+function(a){"use strict";var b=function(c,d){this.$element=a(c),this.options=a.extend({},b.DEFAULTS,d),this.transitioning=null,this.options.parent&&(this.$parent=a(this.options.parent)),this.options.toggle&&this.toggle()};b.DEFAULTS={toggle:!0},b.prototype.dimension=function(){var a=this.$element.hasClass("width");return a?"width":"height"},b.prototype.show=function(){if(!this.transitioning&&!this.$element.hasClass("in")){var b=a.Event("show.bs.collapse");if(this.$element.trigger(b),!b.isDefaultPrevented()){var c=this.$parent&&this.$parent.find("> .panel > .in");if(c&&c.length){var d=c.data("bs.collapse");if(d&&d.transitioning)return;c.collapse("hide"),d||c.data("bs.collapse",null)}var e=this.dimension();this.$element.removeClass("collapse").addClass("collapsing")[e](0),this.transitioning=1;var f=function(){this.$element.removeClass("collapsing").addClass("collapse in")[e]("auto"),this.transitioning=0,this.$element.trigger("shown.bs.collapse")};if(!a.support.transition)return f.call(this);var g=a.camelCase(["scroll",e].join("-"));this.$element.one(a.support.transition.end,a.proxy(f,this)).emulateTransitionEnd(350)[e](this.$element[0][g])}}},b.prototype.hide=function(){if(!this.transitioning&&this.$element.hasClass("in")){var b=a.Event("hide.bs.collapse");if(this.$element.trigger(b),!b.isDefaultPrevented()){var c=this.dimension();this.$element[c](this.$element[c]())[0].offsetHeight,this.$element.addClass("collapsing").removeClass("collapse").removeClass("in"),this.transitioning=1;var d=function(){this.transitioning=0,this.$element.trigger("hidden.bs.collapse").removeClass("collapsing").addClass("collapse")};return a.support.transition?void this.$element[c](0).one(a.support.transition.end,a.proxy(d,this)).emulateTransitionEnd(350):d.call(this)}}},b.prototype.toggle=function(){this[this.$element.hasClass("in")?"hide":"show"]()};var c=a.fn.collapse;a.fn.collapse=function(c){return this.each(function(){var d=a(this),e=d.data("bs.collapse"),f=a.extend({},b.DEFAULTS,d.data(),"object"==typeof c&&c);!e&&f.toggle&&"show"==c&&(c=!c),e||d.data("bs.collapse",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.collapse.Constructor=b,a.fn.collapse.noConflict=function(){return a.fn.collapse=c,this},a(document).on("click.bs.collapse.data-api","[data-toggle=collapse]",function(b){var c,d=a(this),e=d.attr("data-target")||b.preventDefault()||(c=d.attr("href"))&&c.replace(/.*(?=#[^\s]+$)/,""),f=a(e),g=f.data("bs.collapse"),h=g?"toggle":d.data(),i=d.attr("data-parent"),j=i&&a(i);g&&g.transitioning||(j&&j.find('[data-toggle=collapse][data-parent="'+i+'"]').not(d).addClass("collapsed"),d[f.hasClass("in")?"addClass":"removeClass"]("collapsed")),f.collapse(h)})}(jQuery),+function(a){"use strict";function b(b){a(d).remove(),a(e).each(function(){var d=c(a(this)),e={relatedTarget:this};d.hasClass("open")&&(d.trigger(b=a.Event("hide.bs.dropdown",e)),b.isDefaultPrevented()||d.removeClass("open").trigger("hidden.bs.dropdown",e))})}function c(b){var c=b.attr("data-target");c||(c=b.attr("href"),c=c&&/#[A-Za-z]/.test(c)&&c.replace(/.*(?=#[^\s]*$)/,""));var d=c&&a(c);return d&&d.length?d:b.parent()}var d=".dropdown-backdrop",e="[data-toggle=dropdown]",f=function(b){a(b).on("click.bs.dropdown",this.toggle)};f.prototype.toggle=function(d){var e=a(this);if(!e.is(".disabled, :disabled")){var f=c(e),g=f.hasClass("open");if(b(),!g){"ontouchstart"in document.documentElement&&!f.closest(".navbar-nav").length&&a(''}),b.prototype=a.extend({},a.fn.tooltip.Constructor.prototype),b.prototype.constructor=b,b.prototype.getDefaults=function(){return b.DEFAULTS},b.prototype.setContent=function(){var a=this.tip(),b=this.getTitle(),c=this.getContent();a.find(".popover-title")[this.options.html?"html":"text"](b),a.find(".popover-content")[this.options.html?"string"==typeof c?"html":"append":"text"](c),a.removeClass("fade top bottom left right in"),a.find(".popover-title").html()||a.find(".popover-title").hide()},b.prototype.hasContent=function(){return this.getTitle()||this.getContent()},b.prototype.getContent=function(){var a=this.$element,b=this.options;return a.attr("data-content")||("function"==typeof b.content?b.content.call(a[0]):b.content)},b.prototype.arrow=function(){return this.$arrow=this.$arrow||this.tip().find(".arrow")},b.prototype.tip=function(){return this.$tip||(this.$tip=a(this.options.template)),this.$tip};var c=a.fn.popover;a.fn.popover=function(c){return this.each(function(){var d=a(this),e=d.data("bs.popover"),f="object"==typeof c&&c;(e||"destroy"!=c)&&(e||d.data("bs.popover",e=new b(this,f)),"string"==typeof c&&e[c]())})},a.fn.popover.Constructor=b,a.fn.popover.noConflict=function(){return a.fn.popover=c,this}}(jQuery),+function(a){"use strict";function b(c,d){var e,f=a.proxy(this.process,this);this.$element=a(a(c).is("body")?window:c),this.$body=a("body"),this.$scrollElement=this.$element.on("scroll.bs.scroll-spy.data-api",f),this.options=a.extend({},b.DEFAULTS,d),this.selector=(this.options.target||(e=a(c).attr("href"))&&e.replace(/.*(?=#[^\s]+$)/,"")||"")+" .nav li > a",this.offsets=a([]),this.targets=a([]),this.activeTarget=null,this.refresh(),this.process()}b.DEFAULTS={offset:10},b.prototype.refresh=function(){var b=this.$element[0]==window?"offset":"position";this.offsets=a([]),this.targets=a([]);{var c=this;this.$body.find(this.selector).map(function(){var d=a(this),e=d.data("target")||d.attr("href"),f=/^#./.test(e)&&a(e);return f&&f.length&&f.is(":visible")&&[[f[b]().top+(!a.isWindow(c.$scrollElement.get(0))&&c.$scrollElement.scrollTop()),e]]||null}).sort(function(a,b){return a[0]-b[0]}).each(function(){c.offsets.push(this[0]),c.targets.push(this[1])})}},b.prototype.process=function(){var a,b=this.$scrollElement.scrollTop()+this.options.offset,c=this.$scrollElement[0].scrollHeight||this.$body[0].scrollHeight,d=c-this.$scrollElement.height(),e=this.offsets,f=this.targets,g=this.activeTarget;if(b>=d)return g!=(a=f.last()[0])&&this.activate(a);if(g&&b<=e[0])return g!=(a=f[0])&&this.activate(a);for(a=e.length;a--;)g!=f[a]&&b>=e[a]&&(!e[a+1]||b<=e[a+1])&&this.activate(f[a])},b.prototype.activate=function(b){this.activeTarget=b,a(this.selector).parentsUntil(this.options.target,".active").removeClass("active");var c=this.selector+'[data-target="'+b+'"],'+this.selector+'[href="'+b+'"]',d=a(c).parents("li").addClass("active");d.parent(".dropdown-menu").length&&(d=d.closest("li.dropdown").addClass("active")),d.trigger("activate.bs.scrollspy")};var c=a.fn.scrollspy;a.fn.scrollspy=function(c){return this.each(function(){var d=a(this),e=d.data("bs.scrollspy"),f="object"==typeof c&&c;e||d.data("bs.scrollspy",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.scrollspy.Constructor=b,a.fn.scrollspy.noConflict=function(){return a.fn.scrollspy=c,this},a(window).on("load",function(){a('[data-spy="scroll"]').each(function(){var b=a(this);b.scrollspy(b.data())})})}(jQuery),+function(a){"use strict";var b=function(b){this.element=a(b)};b.prototype.show=function(){var b=this.element,c=b.closest("ul:not(.dropdown-menu)"),d=b.data("target");if(d||(d=b.attr("href"),d=d&&d.replace(/.*(?=#[^\s]*$)/,"")),!b.parent("li").hasClass("active")){var e=c.find(".active:last a")[0],f=a.Event("show.bs.tab",{relatedTarget:e});if(b.trigger(f),!f.isDefaultPrevented()){var g=a(d);this.activate(b.parent("li"),c),this.activate(g,g.parent(),function(){b.trigger({type:"shown.bs.tab",relatedTarget:e})})}}},b.prototype.activate=function(b,c,d){function e(){f.removeClass("active").find("> .dropdown-menu > .active").removeClass("active"),b.addClass("active"),g?(b[0].offsetWidth,b.addClass("in")):b.removeClass("fade"),b.parent(".dropdown-menu")&&b.closest("li.dropdown").addClass("active"),d&&d()}var f=c.find("> .active"),g=d&&a.support.transition&&f.hasClass("fade");g?f.one(a.support.transition.end,e).emulateTransitionEnd(150):e(),f.removeClass("in")};var c=a.fn.tab;a.fn.tab=function(c){return this.each(function(){var d=a(this),e=d.data("bs.tab");e||d.data("bs.tab",e=new b(this)),"string"==typeof c&&e[c]()})},a.fn.tab.Constructor=b,a.fn.tab.noConflict=function(){return a.fn.tab=c,this},a(document).on("click.bs.tab.data-api",'[data-toggle="tab"], [data-toggle="pill"]',function(b){b.preventDefault(),a(this).tab("show")})}(jQuery),+function(a){"use strict";var b=function(c,d){this.options=a.extend({},b.DEFAULTS,d),this.$window=a(window).on("scroll.bs.affix.data-api",a.proxy(this.checkPosition,this)).on("click.bs.affix.data-api",a.proxy(this.checkPositionWithEventLoop,this)),this.$element=a(c),this.affixed=this.unpin=this.pinnedOffset=null,this.checkPosition()};b.RESET="affix affix-top affix-bottom",b.DEFAULTS={offset:0},b.prototype.getPinnedOffset=function(){if(this.pinnedOffset)return this.pinnedOffset;this.$element.removeClass(b.RESET).addClass("affix");var a=this.$window.scrollTop(),c=this.$element.offset();return this.pinnedOffset=c.top-a},b.prototype.checkPositionWithEventLoop=function(){setTimeout(a.proxy(this.checkPosition,this),1)},b.prototype.checkPosition=function(){if(this.$element.is(":visible")){var c=a(document).height(),d=this.$window.scrollTop(),e=this.$element.offset(),f=this.options.offset,g=f.top,h=f.bottom;"top"==this.affixed&&(e.top+=d),"object"!=typeof f&&(h=g=f),"function"==typeof g&&(g=f.top(this.$element)),"function"==typeof h&&(h=f.bottom(this.$element));var i=null!=this.unpin&&d+this.unpin<=e.top?!1:null!=h&&e.top+this.$element.height()>=c-h?"bottom":null!=g&&g>=d?"top":!1;if(this.affixed!==i){this.unpin&&this.$element.css("top","");var j="affix"+(i?"-"+i:""),k=a.Event(j+".bs.affix");this.$element.trigger(k),k.isDefaultPrevented()||(this.affixed=i,this.unpin="bottom"==i?this.getPinnedOffset():null,this.$element.removeClass(b.RESET).addClass(j).trigger(a.Event(j.replace("affix","affixed"))),"bottom"==i&&this.$element.offset({top:c-h-this.$element.height()}))}}};var c=a.fn.affix;a.fn.affix=function(c){return this.each(function(){var d=a(this),e=d.data("bs.affix"),f="object"==typeof c&&c;e||d.data("bs.affix",e=new b(this,f)),"string"==typeof c&&e[c]()})},a.fn.affix.Constructor=b,a.fn.affix.noConflict=function(){return a.fn.affix=c,this},a(window).on("load",function(){a('[data-spy="affix"]').each(function(){var b=a(this),c=b.data();c.offset=c.offset||{},c.offsetBottom&&(c.offset.bottom=c.offsetBottom),c.offsetTop&&(c.offset.top=c.offsetTop),b.affix(c)})})}(jQuery); -------------------------------------------------------------------------------- /frontend/js/global.css: -------------------------------------------------------------------------------- 1 | p { 2 | font-size: 16px; 3 | } 4 | .breadcrumb { 5 | border: 1px solid lightgrey; 6 | clear: both; 7 | } 8 | 9 | hr { 10 | border-color: lightgrey; 11 | opacity: 100%; 12 | clear: both; 13 | } 14 | h2 { 15 | color: white !important; 16 | background: lightgray; 17 | background: -webkit-linear-gradient(left, #444 0%, #eee 100%); 18 | padding: 5px; 19 | padding-left: 15px; 20 | margin-top: 0px; 21 | margin-bottom: 20px; 22 | font-weight: normal; 23 | border-left: 5px solid #428bca; 24 | } 25 | h2 small { 26 | color: lightgrey; 27 | font-weight: 300; 28 | } 29 | .CCG, .CCG h1, .CCG h1 a { 30 | font-variant: small-caps; 31 | font-size: 1.44em; 32 | color: #2a6496; 33 | } 34 | .CCG h1 a { 35 | line-height: 0.1em; 36 | } 37 | .CCG { 38 | padding-left: 15px; 39 | padding-right: 10px; 40 | } 41 | .CCG small { 42 | font-size: 0.45em; 43 | text-transform: uppercase; 44 | color: #ddb482; 45 | font-weight: lighter; 46 | letter-spacing: 7.8px; 47 | display: block; 48 | line-height: 0.7em; 49 | } 50 | .CCG img { 51 | position: absolute; 52 | top: 0px; 53 | right: 0px; 54 | height:100px; 55 | } 56 | 57 | .popover{ 58 | max-width:800px; 59 | } 60 | 61 | #nav li .ell{ 62 | margin:0px; 63 | padding:0px; 64 | padding-top: 2px; 65 | padding-bottom:2px; 66 | padding-left:4px; 67 | padding-right:4px; 68 | margin:4px; 69 | } 70 | 71 | #nav{ 72 | margin-top:-10px; 73 | margin-left:-25px; 74 | } 75 | 76 | .lead{ 77 | padding:20px; 78 | } 79 | 80 | .ccg-sidebar{ 81 | width:110px; 82 | } 83 | 84 | #illinois-logo{ 85 | margin: 10px; 86 | padding: 10px; 87 | } 88 | 89 | .navigation-container{ 90 | margin-left: 50px; 91 | margin-right: 50px; 92 | } 93 | 94 | /* cloaking directive */ 95 | 96 | [ng\:cloak], [ng-cloak], [data-ng-cloak], [x-ng-cloak], .ng-cloak, .x-ng-cloak { 97 | display: none !important; 98 | } 99 | #problems { 100 | font-size: 0.8em; 101 | position: fixed; 102 | bottom: 0px; 103 | left: 17px; 104 | } 105 | #rollover { 106 | font-size: 0.8em; 107 | position: fixed; 108 | top: 17px; 109 | right: 17px; 110 | width: 200px; 111 | } 112 | .licensing { 113 | overflow-y: scroll; 114 | height: 300px; 115 | width: 100%; 116 | border: 1px solid #DDD; 117 | padding: 10px; 118 | background-color: white; 119 | } 120 | .holder { 121 | padding: 0 5%; 122 | } 123 | .header-container{ 124 | background-color: #FFFFFF; 125 | border-bottom: 20px solid #2a6496; 126 | } 127 | /* Navigation */ 128 | .nav-pills > li a { 129 | color: black; 130 | border-radius: 0px; 131 | } 132 | .nav-pills > li a:hover{ 133 | background-color: #2a6496; 134 | color: white; 135 | -webkit-transition: all 0.2s; 136 | transition: all 0.2s; 137 | } 138 | 139 | .research-project-card { 140 | min-height: 200px; 141 | } 142 | 143 | .research-project-card a h5{ 144 | background-color: #7F5425; 145 | padding: 1em 0.2em; 146 | color: white; 147 | } 148 | 149 | .inline { 150 | display: inline-block; 151 | vertical-align: middle; 152 | } 153 | 154 | /*.licensing{ 155 | overflow-y: scroll; 156 | height: 300px; 157 | width: 100%; 158 | border: 1px solid #DDD; 159 | padding: 10px; 160 | background-color: white; 161 | }*/ 162 | .list-group-item.past-project { 163 | min-height: 150px; 164 | } 165 | -------------------------------------------------------------------------------- /frontend/js/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CogComp/zoe/75030cae103743e76420c2a0d74f52bab039f0a0/frontend/js/logo.png -------------------------------------------------------------------------------- /frontend/loading_icon.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/CogComp/zoe/75030cae103743e76420c2a0d74f52bab039f0a0/frontend/loading_icon.gif -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if ! [ -x "$(command -v java)" ]; then 4 | echo 'Error: Java in not installed.' 5 | exit 1 6 | fi 7 | if ! [ -x "$(command -v mvn)" ]; then 8 | echo 'Error: maven is not installed.' 9 | exit 1 10 | fi 11 | if ! [ -x "$(command -v python3)" ]; then 12 | echo 'Error: python 3.x is not installed.' 13 | exit 1 14 | fi 15 | if ! [ -x "$(command -v virtualenv)" ]; then 16 | echo 'Error: virtualenv is not installed.' 17 | exit 1 18 | fi 19 | if ! [ -x "$(command -v wget)" ]; then 20 | echo 'Error: wget is not found. Either install or find replacement and modify this script.' 21 | exit 1 22 | fi 23 | if ! [ -x "$(command -v unzip)" ]; then 24 | echo 'Error: unzip is not found. Either install or find replacement and modify this script.' 25 | exit 1 26 | fi 27 | echo 'All dependencies satisfied. Moving on...' 28 | 29 | virtualenv -p python3 venv 30 | cd ./bilm-tf 31 | ../venv/bin/python3 setup.py install 32 | wget http://cogcomp.org/Data/ccgPapersData/xzhou45/zoe/model.zip 33 | unzip model.zip 34 | rm model.zip 35 | cd ../ 36 | venv/bin/pip3 install Cython 37 | venv/bin/pip3 install -r requirements.txt 38 | wget http://cogcomp.org/Data/ccgPapersData/xzhou45/zoe/data.zip 39 | unzip -n data.zip 40 | rm data.zip 41 | python -m ccg_nlpy download -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import sys 4 | 5 | from zoe_utils import DataReader 6 | from zoe_utils import ElmoProcessor 7 | from zoe_utils import EsaProcessor 8 | from zoe_utils import Evaluator 9 | from zoe_utils import InferenceProcessor 10 | 11 | 12 | class ZoeRunner: 13 | 14 | """ 15 | @allow_tensorflow sets whether the system will do run-time ELMo processing. 16 | It's set to False in experiments as ELMo results are cached, 17 | but please set it to default True when running on new sentences. 18 | """ 19 | def __init__(self, allow_tensorflow=True): 20 | self.elmo_processor = ElmoProcessor(allow_tensorflow) 21 | self.esa_processor = EsaProcessor() 22 | self.inference_processor = InferenceProcessor("figer") 23 | self.evaluator = Evaluator() 24 | self.evaluated = [] 25 | 26 | """ 27 | Process a single sentence 28 | @sentence: a sentence in zoe_utils.Sentence structure 29 | @return: a sentence in zoe_utils that has predicted types set 30 | """ 31 | def process_sentence(self, sentence, inference_processor=None): 32 | esa_candidates = self.esa_processor.get_candidates(sentence) 33 | elmo_candidates = self.elmo_processor.rank_candidates(sentence, esa_candidates) 34 | if len(elmo_candidates) > 0 and elmo_candidates[0][0] == self.elmo_processor.stop_sign: 35 | return -1 36 | if inference_processor is None: 37 | inference_processor = self.inference_processor 38 | inference_processor.inference(sentence, elmo_candidates, esa_candidates) 39 | return sentence 40 | 41 | def process_sentence_vec(self, sentence, inference_processor=None): 42 | esa_candidates = self.esa_processor.get_candidates(sentence) 43 | elmo_candidates = self.elmo_processor.rank_candidates_vec(sentence, esa_candidates) 44 | if len(elmo_candidates) > 0 and elmo_candidates[0][0] == self.elmo_processor.stop_sign: 45 | return -1 46 | if inference_processor is None: 47 | inference_processor = self.inference_processor 48 | inference_processor.inference(sentence, elmo_candidates, esa_candidates) 49 | return sentence 50 | 51 | """ 52 | Helper function to evaluate on a dataset that has multiple sentences 53 | @file_name: A string indicating the data file. 54 | Note the format needs to be the common json format, see examples 55 | @mode: A string indicating the mode. This adjusts the inference mode, and set caches etc. 56 | @return: None 57 | """ 58 | def evaluate_dataset(self, file_name, mode, do_inference=True, use_prior=True, use_context=True, size=-1): 59 | if not os.path.isfile(file_name): 60 | print("[ERROR] Invalid input data file.") 61 | return 62 | self.inference_processor = InferenceProcessor(mode, do_inference, use_prior, use_context) 63 | dataset = DataReader(file_name, size) 64 | for sentence in dataset.sentences: 65 | processed = self.process_sentence(sentence) 66 | if processed == -1: 67 | continue 68 | self.evaluated.append(processed) 69 | processed.print_self() 70 | evaluator = Evaluator() 71 | evaluator.print_performance(self.evaluated) 72 | 73 | """ 74 | Helper function that saves the predicted sentences list to a file. 75 | @file_name: A string indicating the target file path. 76 | Note it will override the content 77 | @return: None 78 | """ 79 | def save(self, file_name): 80 | with open(file_name, "wb") as handle: 81 | pickle.dump(self.evaluated, handle, pickle.HIGHEST_PROTOCOL) 82 | 83 | @staticmethod 84 | def evaluate_saved_runlog(log_name): 85 | with open(log_name, "rb") as handle: 86 | sentences = pickle.load(handle) 87 | evaluator = Evaluator() 88 | evaluator.print_performance(sentences) 89 | 90 | 91 | if __name__ == "__main__": 92 | if len(sys.argv) < 2: 93 | print("[ERROR] choose from 'figer', 'bbn', 'ontonotes' or 'eval'") 94 | exit(0) 95 | if sys.argv[1] == "figer": 96 | runner = ZoeRunner(allow_tensorflow=False) 97 | runner.elmo_processor.load_cached_embeddings("data/FIGER/target.min.embedding.pickle", "data/FIGER/wikilinks.min.embedding.pickle") 98 | runner.evaluate_dataset("data/FIGER/test_sampled.json", "figer") 99 | runner.save("data/log/runlog_figer.pickle") 100 | if sys.argv[1] == "bbn": 101 | runner = ZoeRunner(allow_tensorflow=False) 102 | runner.elmo_processor.load_cached_embeddings("data/BBN/target.min.embedding.pickle", "data/BBN/wikilinks.min.embedding.pickle") 103 | runner.evaluate_dataset("data/BBN/test.json", "bbn") 104 | runner.save("data/log/runlog_bbn.pickle") 105 | if sys.argv[1] == "ontonotes": 106 | runner = ZoeRunner(allow_tensorflow=False) 107 | runner.elmo_processor.load_cached_embeddings("data/ONTONOTES/target.min.embedding.pickle", "data/ONTONOTES/wikilinks.min.embedding.pickle") 108 | runner.evaluate_dataset("data/ONTONOTES/test.json", "ontonotes", size=1000) 109 | runner.save("data/log/runlog_ontonotes.pickle") 110 | if sys.argv[1] == "eval" and len(sys.argv) > 2: 111 | ZoeRunner.evaluate_saved_runlog(sys.argv[2]) 112 | -------------------------------------------------------------------------------- /mapping/README.md: -------------------------------------------------------------------------------- 1 | ## Mappings 2 | 3 | ### What are these 4 | 5 | Here we provide three mappings that maps FreeBase types to different taxonomies. 6 | 7 | ### How to modify or create your own 8 | 9 | Each mapping is composed with two files: *.mapping and *.logic.mapping 10 | 11 | Each line in a *.mapping file contains a tab-separated pair, where the left is a FreeBase type, and the right is a target type. 12 | Note that each target type is automatically dissected into hierarchies by splitting "/". This file works in 13 | a "OR" logic, that is, a target type is assigned if *any* of the FreeBase mapping sources is found in an entity's FreeBase types. 14 | 15 | *.logic.mapping serves as a supplementary mapping for more logics beyond "OR". 16 | Each lineThere are three patterns of each line: 17 | - "+\tA\tB" means if type A appears in the mapped types, type B will be added 18 | - "-\tA\tB" means if type A appears in the mapped types, type B (if present) will be removed 19 | - "=\tA\tB" means if type A and type B are equivalent for evaluation purposes. 20 | 21 | -------------------------------------------------------------------------------- /mapping/bbn.logic.mapping: -------------------------------------------------------------------------------- 1 | - /ORGANIZATION /LAW 2 | - /LOCATION/RIVER /LOCATION/LAKE_SEA_OCEAN 3 | - /ORGANIZATION/GOVERNMENT /ORGANIZATION/CORPORATION 4 | - /PERSON ALL_OTHER 5 | - /GPE ALL_OTHER 6 | - /LOCATION/CONTINENT /LOCATION/REGION -------------------------------------------------------------------------------- /mapping/bbn.mapping: -------------------------------------------------------------------------------- 1 | /business/employer /ORGANIZATION/CORPORATION 2 | /organization/organization /ORGANIZATION 3 | /people/person /PERSON 4 | /location/continent /LOCATION/CONTINENT 5 | /base/locations/continents /LOCATION/CONTINENT 6 | /location/location /LOCATION 7 | /aviation/airport /FAC/AIRPORT 8 | /location/country /GPE/COUNTRY 9 | /location/citytown /GPE/CITY 10 | /location/location /LOCATION 11 | /aviation/aircraft_model /PRODUCT/VEHICLE 12 | /geography/river /LOCATION/RIVER 13 | /geography/body_of_water /LOCATION/LAKE_SEA_OCEAN 14 | /location/statistical_region /LOCATION/REGION 15 | /base/plants/plant /PLANT 16 | /architecture/building /BUILDING 17 | /travel/hotel /ORGANIZATION/HOTEL 18 | /time/event /EVENT 19 | /transportation/road /FAC/HIGHWAY_STREET 20 | /book/written_work /WORK_OF_ART/BOOK 21 | /medicine/disease /DISEASE 22 | /language/human_language /LANGUAGE 23 | /base/locations/states_and_provences /GPE/STATE_PROVINCE 24 | /location/cn_province /GPE/STATE_PROVINCE 25 | /music/composition /WORK_OF_ART/SONG 26 | /transportation/bridge /FAC/BRIDGE 27 | /military/war /EVENT/WAR 28 | /cvg/computer_videogame /GAME 29 | /automotive/model /PRODUCT/VEHICLE 30 | /visual_art/artwork /WORK_OF_ART/PAINTING 31 | /food/food /SUBSTANCE/FOOD 32 | /medicine/hospital /ORGANIZATION/HOSPITAL 33 | /government/political_party /ORGANIZATION/POLITICAL 34 | /law /LAW 35 | /theater/play /WORK_OF_ART/PLAY 36 | /government/government /ORGANIZATION/GOVERNMENT 37 | /meteorology/tropical_cyclone /EVENT/HURRICANE 38 | /biology/animal /ANIMAL 39 | /religion/religion /ORGANIZATION/RELIGIOUS 40 | /medicine/drug /SUBSTANCE/DRUG 41 | /chemistry/chemical_compound /SUBSTANCE/CHEMICAL 42 | /education/academic_institution /ORGANIZATION/EDUCATIONAL 43 | /education/educational_institution /ORGANIZATION/EDUCATIONAL 44 | /law/invention /PRODUCT/WEAPON 45 | /government/government_agency /ORGANIZATION/GOVERNMENT 46 | /government/governmental_body /ORGANIZATION/GOVERNMENT -------------------------------------------------------------------------------- /mapping/figer.logic.mapping: -------------------------------------------------------------------------------- 1 | + /building /location 2 | + /news_agency /organization 3 | + /news_agency /organization/company 4 | - /news_agency /written_work 5 | + /written_work /art 6 | - /organization/educational_institution /organization/company 7 | - /organization/sports_league /organization/company 8 | + /transportation/road /location 9 | + /livingthing /living_thing 10 | + /living_thing /livingthing 11 | = /living_thing /livingthing -------------------------------------------------------------------------------- /mapping/figer.mapping: -------------------------------------------------------------------------------- 1 | /base/terrorism/terrorist /person/terrorist 2 | /base/terrorism/terrorist_attack /event/terrorist_attack 3 | /base/terrorism/terrorist_organization /organization/terrorist_organization 4 | /people/person /person 5 | /location/location /location 6 | /location/citytown /location/city 7 | /sports/pro_athlete /person/athlete 8 | /biology/organism_classification /living_thing 9 | /organization/organization /organization 10 | /music/album /music 11 | /music/artist /person/artist 12 | /soccer/football_player /person/athlete 13 | /government/politician /person/politician 14 | /book/author /person/author 15 | /architecture/structure /building 16 | /film/film /art/film 17 | /time/event /event 18 | /business/business_operation /organization/company 19 | /geography/geographical_feature /location 20 | /film/actor /person/actor 21 | /book/written_work /written_work 22 | /tv/tv_actor /person/actor 23 | /education/educational_institution /organization/educational_institution 24 | /music/composition /music 25 | /architecture/building /building 26 | /geography/body_of_water /location/body_of_water 27 | /book/book /written_work 28 | /music/musical_group /person/musician 29 | /tv/tv_program /broadcast_program 30 | /sports/sports_team /organization/sports_team 31 | /geography/river /location/body_of_water 32 | /location/administrative_division /location 33 | /boats/ship /product/ship 34 | /education/school /organization/educational_institution 35 | /visual_art/visual_artist /person/artist 36 | /astronomy/celestial_object /astral_body 37 | /baseball/baseball_player /person/athlete 38 | /american_football/football_player /person/athlete 39 | /transportation/road /transportation/road 40 | /cvg/computer_videogame /game 41 | /astronomy/orbital_relationship /astral_body 42 | /book/periodical /written_work 43 | /education/university /organization/educational_institution 44 | /film/director /person/director 45 | /soccer/football_team /organization/sports_team 46 | /music/composer /person/musician 47 | /cricket/cricket_player /person/athlete 48 | /government/u_s_congressperson /person/politician 49 | /film/writer /person/author 50 | /astronomy/asteroid /astral_body 51 | /broadcast/artist /person/artist 52 | /aviation/airport /building/airport 53 | /military/military_conflict /event/military_conflict 54 | /government/political_party /government/political_party 55 | /geography/mountain /geography/mountain 56 | /geography/lake /location/body_of_water 57 | /military/military_unit /military 58 | /chemistry/chemical_compound /chemistry 59 | /time/recurring_event /event 60 | /geography/island /geography/island 61 | /media_common/adaptation /art/film 62 | /basketball/basketball_player /person/athlete 63 | /architecture/museum /building 64 | /tv/tv_personality /person/actor 65 | /location/uk_civil_parish /location 66 | /film/producer /person/artist 67 | /user/robert/data_nursery/railway_station /building 68 | /ice_hockey/hockey_player /person/athlete 69 | /music/producer /person/artist 70 | /computer/software /software 71 | /book/magazine /written_work 72 | /people/ethnicity /people/ethnicity 73 | /aviation/aircraft_model /product/airplane 74 | /government/election /event/election 75 | /location/census_designated_place /location 76 | /sports/tournament_event_competition /event/sports_event 77 | /music/record_label /organization/company 78 | /medicine/disease /disease 79 | /architecture/architect /person/architect 80 | /sports/sports_facility /building/sports_facility 81 | /base/crime/lawyer /person 82 | /language/human_language /language 83 | /location/neighborhood /location 84 | /music/guitarist /person/musician 85 | /sports/boxer /person/athlete 86 | /base/popstra/celebrity /person 87 | /theater/play /play 88 | /base/rugby/rugby_player /person/athlete 89 | /automotive/model /product/car 90 | /book/journal /written_work 91 | /aviation/airline /organization/airline 92 | /transportation/bridge /location/bridge 93 | /theater/theater_actor /person/actor 94 | /architecture/skyscraper /building 95 | /user/sprocketonline/economics/legislation /law 96 | /medicine/drug /medicine/drug 97 | /olympics/olympic_event_competition /event/sports_event 98 | /book/newspaper /written_work 99 | /award/award /award 100 | /sports/cyclist /person/athlete 101 | /royalty/noble_title /title 102 | /tv/tv_producer /person/artist 103 | /government/government_agency /government_agency 104 | /government/government_body /government_agency 105 | /government/general_election /event/election 106 | /tv/tv_writer /person/author 107 | /location/us_county /location/county 108 | /tennis/tennis_player /person/athlete 109 | /medicine/medical_treatment /medicine/medical_treatment 110 | /internet/website /internet/website 111 | /visual_art/artwork /art 112 | /royalty/monarch /person/monarch 113 | /tv/tv_director /person/director 114 | /location/australian_suburb /location 115 | /organization/non_profit_organization /organization 116 | /music/lyricist /person/author 117 | /medicine/anatomical_structure /body_part 118 | /sports/sports_league /organization/sports_league 119 | /medicine/hospital /building/hospital 120 | /religion/religious_leader /person/religious_leader 121 | /base/givennames/given_name /person 122 | /people/place_of_interment /location/cemetery 123 | /boats/ship_class /product/ship 124 | /user/akatenev/weapons/weapon /product/weapon 125 | /tv/tv_program_creator /person/artist 126 | /sports/australian_rules_footballer /person/athlete 127 | /travel/tourist_attraction /location 128 | /american_football/football_coach /person/coach 129 | /business/shopping_center /location 130 | /event/disaster /event/natural_disaster 131 | /food/dish /food 132 | /geography/mountain_range /geography/mountain 133 | /astronomy/star /astral_body 134 | /sports/golfer /person/athlete 135 | /film/cinematographer /person/artist 136 | /book/short_story /written_work 137 | /government/government_office_or_title /title 138 | /medicine/physician /person/doctor 139 | /music/conductor /person/musician 140 | /user/joshuamclark/default_domain/bird /livingthing/animal 141 | /architecture/house /building 142 | /games/game /game 143 | /people/family_name /person 144 | /film/editor /person/artist 145 | /basketball/basketball_coach /person/coach 146 | /user/patrick/default_domain/submarine /product/ship 147 | /rail/locomotive_class /train 148 | /music/songwriter /person/author 149 | /law/judge /person 150 | /cvg/cvg_developer /person/engineer 151 | /comic_books/comic_book_series /written_work 152 | /base/switzerland/ch_city /location/city 153 | /celebrities/celebrity /person 154 | /chess/chess_player /person/athlete 155 | /location/cemetery /location/cemetery 156 | /tv/tv_network /broadcast_network 157 | /opera/opera /play 158 | /dining/restaurant /building/restaurant 159 | /geography/glacier /geography/glacier 160 | /military/armed_force /military 161 | /cricket/cricket_bowler /person/athlete 162 | /rail/railway /rail/railway 163 | /basketball/basketball_team /organization/sports_team 164 | /martial_arts/martial_artist /person/artist 165 | /book/publishing_company /organization/company 166 | /music/instrument /product/instrument 167 | /food/ingredient /food 168 | /people/profession /title 169 | /metropolitan_transit/transit_line /metropolitan_transit/transit_line 170 | /base/handball/handball_player /person/athlete 171 | /architecture/lighthouse /building 172 | /ice_hockey/hockey_team /organization/sports_team 173 | /book/poem /written_work 174 | /book/literary_series /written_work 175 | /location/cn_county /location/county 176 | /government/governmental_body /government/government 177 | /user/skud/legal/treaty /law 178 | /base/americancomedy/comedian /person/actor 179 | /business/consumer_product /product 180 | /theater/theater /building/theater 181 | /computer/programming_language /computer/programming_language 182 | /meteorology/tropical_cyclone /event/natural_disaster 183 | /medicine/icd_9_cm_classification /disease 184 | /base/infrastructure/power_station /building/power_station 185 | /base/crime/law_enforcement_authority /military 186 | /religion/deity /god 187 | /base/disaster2/attack /event/attack 188 | /cvg/cvg_publisher /organization/company 189 | /baseball/baseball_team /organization/sports_team 190 | /base/morelaw/canadian_lawyer /person 191 | /medicine/symptom /medicine/symptom 192 | /base/hotels/hotel /building/hotel 193 | /music/concert_tour /event 194 | /military/military_commander /person/soldier 195 | /business/job_title /title 196 | /food/food /food 197 | /tennis/tennis_tournament /event/sports_event 198 | /base/disaster2/death_causing_event /event/attack 199 | /base/foodrecipes/recipe_ingredient /food 200 | /sports/professional_sports_team /organization/sports_team 201 | /base/formula1/formula_1_grand_prix /event/sports_event 202 | /travel/accommodation /building/hotel 203 | /base/nascar/nascar_driver /person/athlete 204 | /geography/mountain_pass /geography/mountain 205 | /base/scubadiving/marine_creature /livingthing/animal 206 | /location/jp_city_town /location/city 207 | /astronomy/astronomer /person 208 | /cvg/cvg_designer /person/artist 209 | /film/film_festival /event 210 | /computer/computer_scientist /person 211 | /base/morelaw/canadian_judge /person 212 | /base/prison/prison /government_agency 213 | /user/robert/mobile_phones/mobile_phone /product/mobile_phone 214 | /spaceflight/spacecraft /product/spacecraft 215 | /base/fashionmodels/fashion_model /person 216 | /biology/protein /biology 217 | /user/tsegaran/random/formula_one_race /event/sports_event 218 | /conferences/conference_series /event 219 | /user/robert/data_nursery/aircraft_engine /product/engine_device 220 | /zoos/zoo /building 221 | /base/formula1/formula_1_driver /person/athlete 222 | /finance/currency /finance/currency 223 | /location/australian_local_government_area /location 224 | /engineering/engine /product/engine_device 225 | /aviation/airliner_accident /event/natural_disaster 226 | /music/festival /event 227 | /time/day_of_year /time 228 | /base/sportssandbox/sports_event /event/sports_event 229 | /geography/waterfall /location/body_of_water 230 | /film/production_company /organization/company 231 | /user/skud/boats/submarine /product/ship 232 | /location/in_district /location/county 233 | /user/skud/boats/vessel_class /product/ship 234 | /music/bassist /person/musician 235 | /user/robert/default_domain/given_name /person 236 | /base/athletics/track_and_field_athlete /person/athlete 237 | /location/de_city /location/city 238 | /cvg/musical_game_song /music 239 | /amusement_parks/park /park 240 | /location/country /location/country 241 | /location/jp_district /location/county 242 | /base/newsevents/news_reporting_organisation /news_agency 243 | /education/student_radio_station /broadcast_network 244 | /user/robert/earthquakes/earthquake /event/natural_disaster 245 | /astronomy/galaxy /astral_body 246 | /base/rugby/rugby_club /organization/sports_team 247 | /time/holiday /time 248 | /music/engineer /person/artist 249 | /religion/religion /religion/religion 250 | /film/film_production_designer /person/artist 251 | /user/robert/data_nursery/galaxy /astral_body 252 | /education/fraternity_sorority /organization/fraternity_sorority 253 | /education/department /education/department 254 | /wine/grape_variety /food 255 | /aviation/aircraft_manufacturer /organization/company 256 | /games/playing_card_game /game 257 | /opera/librettist /person/musician 258 | /base/rugby/rugby_coach /person/coach 259 | /venture_capital/venture_investor /organization/company 260 | /computer/software_developer /person/engineer 261 | /education/school_newspaper /newspaper 262 | /dining/chef /person 263 | /user/tsegaran/legal/act_of_congress /law 264 | /base/sails/sailing_ship_class /product/ship 265 | /geography/mountaineer /person 266 | /base/popstra/company /organization/company 267 | /location/nl_municipality /location/county 268 | /user/tsegaran/computer/algorithm /computer/algorithm 269 | /base/americancivilwar/battle /event/military_conflict 270 | /food/cheese /food 271 | /fashion/fashion_designer /person/artist 272 | /user/tsegaran/random/locomotive /train 273 | /music/orchestra /person/musician 274 | /soccer/football_league /organization/sports_league 275 | /user/robert/military/military_person /person/soldier 276 | /film/film_art_director /person/director 277 | /religion/religious_organization /organization 278 | /computer/computer /product/computer 279 | /religion/religious_leadership_title /title 280 | /film/film_company /organization/company 281 | /base/ovguide/country_musical_groups /person/musician 282 | /law/courthouse /building 283 | /baseball/baseball_manager /person/coach 284 | /base/ports/port_of_call /location 285 | /base/engineering/canal /location/body_of_water 286 | /law/court /government_agency 287 | /soccer/football_team_manager /person/coach 288 | /skiing/ski_area /location 289 | /user/skud/boats/warship_class /product/ship 290 | /boats/ship_type /product/ship 291 | /base/engineering/dam /building/dam 292 | /user/lindenb/default_domain/scientist /person 293 | /metropolitan_transit/transit_system /transit 294 | /geography/island_group /geography/island 295 | /base/ovguide/bollywood_films /art/film 296 | /astronomy/astronomical_observatory /building 297 | /education/educational_degree /education/educational_degree 298 | /book/illustrator /person/artist 299 | /base/usnationalparks/us_national_park /park 300 | /sports/school_sports_team /organization/sports_team 301 | /comic_strips/comic_strip_creator /person/artist 302 | /medicine/surgeon /person/doctor 303 | /cvg/cvg_platform /software 304 | /music/opera_singer /person/musician 305 | /location/ar_department /location/county 306 | /music/guitar /product/instrument 307 | /book/periodical_publisher /organization/company 308 | /user/robert/us_congress/us_senator /person/politician 309 | /baseball/baseball_coach /person/coach 310 | /base/wrestling/professional_wrestler /person/athlete 311 | /base/handball/handball_team /organization/sports_team 312 | /cvg/game_series /game 313 | /location/de_rural_district /location 314 | /cvg/game_voice_actor /person/artist 315 | /base/fires/explosion /event/natural_disaster 316 | /base/exoplanetology/exoplanet /astral_body 317 | /base/aptamer/chemical_compound /chemistry 318 | /user/skud/embassies_and_consulates/embassy /government_agency 319 | /base/casinos/casino /building 320 | /government/national_anthem /music 321 | /user/kconragan/graphic_design/graphic_designer /person/artist 322 | /user/carmenmfenn1/ballet/ballet /play 323 | /location/cn_prefecture_level_city /location/city 324 | /film/film_costumer_designer /person/artist 325 | /travel/transport_terminus /building 326 | /visual_art/color /visual_art/color 327 | /language/language_dialect /language 328 | /architecture/architecture_firm /organization/company 329 | /location/us_cbsa /location 330 | /user/skud/boats/cruise_ship /product/ship 331 | /digicams/digital_camera /product/camera 332 | /base/aptamer/nucleic_acid /biology 333 | /base/crime/police_department /government_agency 334 | /location/ca_census_division /location 335 | /opera/opera_company /organization/company 336 | /religion/religious_text /written_work 337 | /book/short_non_fiction /written_work 338 | /location/region /location 339 | /base/fight/protest /event/protest 340 | /location/cn_county_level_city /location/city 341 | /base/activism/organization /organization 342 | /base/popstra/location /location 343 | /music/drummer /person/musician 344 | /book/translated_work /written_work 345 | /library/public_library_system /building/library 346 | /base/fires/fires /event/natural_disaster 347 | /base/greece/gr_city /location/city 348 | /film/film_distributor /organization/company 349 | /user/arielb/israel/israeli_settlement /location 350 | /location/uk_non_metropolitan_district /location 351 | /base/disaster2/infectious_disease /disease 352 | /royalty/order_of_chivalry /title 353 | /medicine/infectious_disease /disease 354 | /base/morelaw/court /building 355 | /base/column/column_author /person/author 356 | /business/trade_union /organization 357 | /base/popstra/organization /organization 358 | /library/public_library /building/library 359 | /government/government /government/government 360 | /comic_books/comic_book_creator /person/artist 361 | /biology/animal /livingthing/animal 362 | /finance/stock_exchange /finance/stock_exchange 363 | /baseball/baseball_league /organization/sports_league 364 | /broadcast/tv_channel /broadcast/tv_channel 365 | /base/bioventurist/science_or_technology_company /organization/company 366 | /user/anandology/default_domain/railway_station /building 367 | /base/fairytales/fairy_tale /written_work 368 | /film/film_series /art/film 369 | /user/hangy/default_domain/at_municipality /location/city 370 | /user/maxim75/default_domain/transit_stop_connection /building 371 | /film/film_featured_song /music 372 | /base/fashion/fashion_designer /person/artist 373 | /film/film_festival_event /event 374 | /games/game_designer /person/artist 375 | /user/skud/boats/submarine_class /product/ship 376 | /user/carmenmfenn1/greco_roman_mythology/greek_deity /god 377 | /ice_hockey/hockey_coach /person/coach 378 | /cvg/musical_game /game 379 | /base/train/multiple_unit /train 380 | /base/magic/magician /person 381 | /user/carmenmfenn1/ballet/ballet_dancer /person/artist 382 | /rail/electric_locomotive_class /train 383 | /base/americancomedy/movie /art/film 384 | /rail/locomotive /train 385 | /base/train/electric_locomotive /train 386 | /base/tallships/tall_ship /product/ship 387 | /user/skud/boats/sailing_vessel /product/ship 388 | /base/classiccars/classic_car /product/car 389 | /user/carmenmfenn1/greco_roman_mythology/roman_deity /god 390 | /base/volleyball/beach_volleyball_player /person/athlete 391 | /cricket/cricket_team /organization/sports_team 392 | /american_football/football_team /organization/sports_team 393 | /base/bookstores/bookstore /building 394 | /architecture/tower /building 395 | /base/disaster2/tornado /event/natural_disaster 396 | /user/iubookgirl/default_domain/academic_library /building/library 397 | /base/crime/law_firm /organization/company 398 | /base/movietheatres/movie_theatre /building/theater 399 | /base/marchmadness/ncaa_basketball_team /organization/sports_team 400 | /base/weapons/weapon /product/weapon 401 | /book/publication /written_work 402 | /astronomy/comet /astral_body 403 | /base/peleton/cyclist /person/athlete 404 | /base/americancomedy/tv_show /broadcast_program 405 | /automotive/company /organization/company 406 | /base/surfing/surfer /person/athlete 407 | /base/disaster2/rail_accident /event/natural_disaster 408 | /user/robert/data_nursery/rail_accident /event/natural_disaster 409 | /base/fires/fire_department /government_agency 410 | /base/disaster2/flood /event/natural_disaster 411 | /user/robert/roman_empire/roman_emperor /person/monarch 412 | /user/techgnostic/default_domain/tv_series_serial /broadcast_program 413 | /base/infrastructure/nuclear_power_plant /building 414 | /base/crime/appellate_court /building 415 | /user/alecf/recreation/park /park 416 | /music/track /music 417 | /opera/opera_house /building 418 | /base/switzerland/ch_district /location 419 | /music/concert_film /art/film 420 | /location/us_indian_reservation /location 421 | /chemistry/chemical_element /chemistry 422 | /user/robert/data_nursery/chinese_emperor /person/monarch 423 | /base/sportssandbox/sports_recurring_event /event/sports_event 424 | /base/filmcameras/camera /product/camera 425 | /base/backpacking1/wilderness_area /location 426 | /base/athletics/athletics_marathon /event/sports_event 427 | /base/backpacking1/national_forest /location 428 | /education/student_organization /organization 429 | /award/award_ceremony /event 430 | /base/skateboarding/skateboarder /person/athlete 431 | /cricket/cricket_coach /person/coach 432 | /business/oil_field /location 433 | /base/peleton/road_bicycle_racing_event /event/sports_event 434 | /conferences/conference /event 435 | /base/charities/charity /organization 436 | /base/jsbach/bach_composition /music 437 | /tv/tv_theme_song /music 438 | /location/de_urban_district /location 439 | /base/snowboard/snowboarder /person/athlete 440 | /location/it_province /location/province 441 | /location/it_frazione /location 442 | /astronomy/nebula /astral_body 443 | /user/patrick/default_domain/submarine_class /product/ship 444 | /architecture/building_complex /building 445 | /base/bioventurist/organization /organization 446 | /base/smarthistory/visual_artist /person/artist 447 | /base/pgschools/school /organization/educational_institution 448 | /base/fight/political_rebellion_or_revolution /event/protest 449 | /base/engineering/mine /location 450 | /location/fr_department /location/county 451 | /food/beer_country_region /location 452 | /base/wrestling/championship_title /title 453 | /astronomy/star_system /astral_body 454 | /aviation/aircraft /product/airplane 455 | /radio/radio_program /broadcast_program 456 | /location/id_city /location/city 457 | /astronomy/constellation /astral_body 458 | /food/tea /food 459 | /food/beverage /food 460 | /location/pr_municipality /location/city 461 | /theater/theater_director /person/director 462 | /soccer/fifa /organization 463 | /location/kp_city /location/city 464 | /cricket/cricket_stadium /building/sports_facility 465 | /astronomy/supernova /astral_body 466 | /location/es_comarca /location/county 467 | /government/election_campaign /event/election 468 | /rail/diesel_locomotive_class /train 469 | /comic_books/comic_book_writer /person/artist 470 | /olympics/olympic_games /event/sports_event 471 | /language/conlang /language 472 | /location/uk_unitary_authority /location 473 | /location/vn_province /location/province 474 | /location/in_division /location/county 475 | /automotive/automotive_class /product/car 476 | /exhibitions/exhibition /event 477 | /automotive/make /organization/company 478 | /opera/opera_director /person/director 479 | /location/es_province /location/province 480 | /wine/wine_sub_region /location 481 | /location/us_state /location/province 482 | /film/film_casting_director /person/director 483 | /book/reviewed_work /written_work 484 | /food/beer /food 485 | /comic_books/comic_book_story /written_work 486 | /location/jp_prefecture /written_work 487 | /government/us_vice_president /person/politician 488 | /american_football/super_bowl /event/sports_event 489 | /location/id_subdistrict /location 490 | /location/in_city /location/city 491 | /government/us_president /person/politician 492 | /organization/australian_organization /organization 493 | /location/uk_metropolitan_borough /location 494 | /location/uk_non_metropolitan_county /location/county 495 | /location/id_province /location/province 496 | /martial_arts/martial_arts_organization /organization 497 | /location/cn_autonomous_county /location/county 498 | /education/academic_post_title /title 499 | /music/concert /event 500 | /location/uk_council_area /location 501 | /location/mx_state /location/province 502 | /location/de_regierungsbezirk /location/county 503 | /location/cn_autonomous_prefecture /location/county 504 | /law/constitutional_amendment /law 505 | /digicams/digital_camera_manufacturer /organization/company 506 | /wine/wine /food 507 | /soccer/football_world_cup /event/sports_event 508 | /location/in_state /location/province 509 | /location/tw_district /location/county 510 | /location/br_state /location/province 511 | /sports/golf_course /location 512 | /location/fr_region /location/province 513 | /location/ca_indian_reserve /location 514 | /royalty/chivalric_title /title 515 | /location/ua_oblast /location/province 516 | /location/jp_special_ward /location 517 | /location/cn_province /location/province 518 | /location/ar_province /location/province 519 | /book/book_binding /location/city 520 | /location/vn_provincial_city /location/city 521 | /book/translation /written_work 522 | /music/music_video_director /person/director 523 | /location/it_region /location 524 | /cricket/cricket_tournament_event /event/sports_event 525 | /biology/amino_acid /biology 526 | /location/de_borough /location 527 | /location/ru_republic /location/province 528 | /location/hk_district /location 529 | /education/school_magazine /written_work 530 | /location/jp_designated_city /location/city 531 | /location/us_territory /location 532 | /location/my_division /location/county 533 | /location/de_state /location/province 534 | /location/uk_overseas_territory /location 535 | /location/my_state /location/province 536 | /cricket/cricket_tournament /event/sports_event 537 | /location/nl_province /location/province 538 | /automotive/model_year /product/car 539 | /music/single /music 540 | /wine/vineyard /location 541 | /organization/organization_committee_title /title 542 | /location/province /location/province 543 | /location/mx_municipality /location 544 | /location/uk_region /location 545 | /location/kp_province /location/province 546 | /location/jp_subprefecture /location 547 | /location/cn_prefecture /location 548 | /music/live_album /music 549 | /location/ru_krai /location/province 550 | /law/judicial_title /title 551 | /cricket/cricket_league /organization/sports_league 552 | /astronomy/planet /location/province 553 | /location/in_union_territory /location/province 554 | /location/continent /location 555 | /book/short_non_fiction_variety /written_work 556 | /book/serialized_work /written_work 557 | /astronomy/galactic_super_cluster /astral_body 558 | /astronomy/galactic_group /astral_body 559 | /astronomy/galactic_cluster /astral_body 560 | /astronomy/celestial_object_with_coordinate_system /astral_body 561 | /location/vn_township /location/city 562 | /location/uk_metropolitan_county /location/county 563 | /location/ru_autonomous_okrug /location/province 564 | /location/kp_metropolitan_city /location/city 565 | /location/australian_state /location/province 566 | /biology/chromosome /biology -------------------------------------------------------------------------------- /mapping/ontonotes.logic.mapping: -------------------------------------------------------------------------------- 1 | + EMPTY /other -------------------------------------------------------------------------------- /mapping/ontonotes.mapping: -------------------------------------------------------------------------------- 1 | /people/person /person 2 | /book/author /person/artist/author 3 | /film/actor /person/artist/actor 4 | /music/artist /person/artist/music 5 | /sports/pro_athlete /person/athlete 6 | /medicine/physician /person/doctor 7 | /government/politician /person/political_figure 8 | /government/government_office_or_title /person/political_figure 9 | /base/crime/criminal_defence_attorney /person/legal 10 | /base/crime/lawyer_type /person/legal 11 | /law/judge /person/legal 12 | /fictional_universe/fictional_job_title /person/title 13 | /business/job_title /person/title 14 | /government/government_office_category /person/title 15 | /aviation/airport /location/structure/airport 16 | /architecture/building /location/structure 17 | /travel/hotel /location/structure/hotel 18 | /sports/sports_facility /location/structure/sports_facility 19 | /geography/body_of_water /location/geography/body_of_water 20 | /geography/mountain /location/geography/mountain 21 | /geography/* /location/geography 22 | /transportation/bridge /location/transit/bridge 23 | /rail/railway /location/transit/railway 24 | /transportation/road /location/transit/road 25 | /location/citytown /location/city 26 | /location/country /location/country 27 | /amusement_parks/park /location/park 28 | /base/usnationalparks/us_national_park /location/park 29 | /location/location /location 30 | /organization/organization_type /organization 31 | /organization/organization /organization 32 | /base/newsevents/news_reporting_organisation /organization/company/news 33 | /book/publishing_company /organization/company/news 34 | /broadcast/producer /organization/company/broadcast 35 | /business/employer /organization/company 36 | /education/academic_institution /organization/education 37 | /education/educational_institution /organization/education 38 | /government/government_agency /organization/government 39 | /government/government /organization/government 40 | /government/governmental_body /organization/government 41 | /base/crime/type_of_law_enforcement_agency /organization/government 42 | /military/military_unit /organization/military 43 | /government/political_party /organization/political_party 44 | /sports/sports_team /organization/sports_team 45 | /finance/stock_exchange /organization/stock_exchange 46 | /tv/tv_program /other/art/broadcast 47 | /film/film /other/art/film 48 | /music/album /other/art/music 49 | /music/composition /other/art/music 50 | /theater/play /other/art/stage 51 | /opera/opera /other/art/stage 52 | /book/written_work /other/art/writing 53 | /book/short_story /other/art/writing 54 | /book/poem /other/art/writing 55 | /book/literary_series /other/art/writing 56 | /book/publication /other/art/writing 57 | /time/event /other/event 58 | /time/holiday /other/event/holiday 59 | /military/military_conflict /other/event/violent_conflict 60 | /medicine/medical_treatment /other/health/treatment 61 | /award/award /other/award 62 | /medicine/anatomical_structure /other/body_part 63 | /finance/currency /other/currency 64 | /biology/animal /other/living_thing/animal 65 | /base/plants/plant /other/living_thing 66 | /law/invention /other/product/weapon 67 | /automotive/model /other/product/vehicle 68 | /aviation/aircraft_model /other/product/vehicle 69 | /computer/* /other/product/computer 70 | /computer/software /other/product/software 71 | /food/food /other/food 72 | /religion/religion /other/religion 73 | /people/ethnicity /other/heritage 74 | /base/morelaw/type_of_legal_subject /other/legal 75 | /user/sprocketonline/economics/legislation /other/legal 76 | /user/tsegaran/legal/act_of_congress /other/legal 77 | /user/skud/legal/treaty /other/legal 78 | /law/constitutional_amendment /other/legal 79 | EMPTY /other -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | h5py 2 | tensorflow 3 | numpy 4 | scipy 5 | regex 6 | Flask 7 | flask-cors 8 | ccg_nlpy 9 | gensim 10 | requests -------------------------------------------------------------------------------- /scripts.py: -------------------------------------------------------------------------------- 1 | import os 2 | import pickle 3 | import sqlite3 4 | import sys 5 | 6 | from ccg_nlpy import local_pipeline 7 | 8 | from cache import SurfaceCache 9 | from main import ZoeRunner 10 | from zoe_utils import DataReader 11 | from zoe_utils import ElmoProcessor 12 | from zoe_utils import Sentence 13 | 14 | 15 | def convert_esa_map(esa_file_name, freq_file_name, invcount_file_name): 16 | esa_map = {} 17 | with open(esa_file_name) as f: 18 | for line in f: 19 | line = line.strip() 20 | if len(line.split("\t")) <= 1: 21 | continue 22 | key = line.split("\t")[0] 23 | value = line.split("\t")[1] 24 | esa_map[key] = value 25 | with open('data/esa/esa.pickle', 'wb') as handle: 26 | pickle.dump(esa_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 27 | 28 | freq_map = {} 29 | with open(freq_file_name) as f: 30 | for line in f: 31 | line = line.strip() 32 | if len(line.split("\t")) <= 1: 33 | continue 34 | key = line.split("\t")[0] 35 | value = int(line.split("\t")[1]) 36 | freq_map[key] = value 37 | with open('data/esa/freq.pickle', 'wb') as handle: 38 | pickle.dump(freq_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 39 | 40 | invcount_map = {} 41 | with open(invcount_file_name) as f: 42 | for line in f: 43 | line = line.strip() 44 | if len(line.split("\t")) <= 1: 45 | continue 46 | key = line.split("\t")[0] 47 | value = int(line.split("\t")[1]) 48 | invcount_map[key] = value 49 | with open('data/esa/invcount.pickle', 'wb') as handle: 50 | pickle.dump(invcount_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 51 | 52 | 53 | def convert_wikilinks_sent_examples(sent_example_file_name): 54 | sent_example_map = {} 55 | max_bytes = 2 ** 31 - 1 56 | with open(sent_example_file_name) as f: 57 | for line in f: 58 | line = line.strip() 59 | if len(line.split("\t")) <= 1: 60 | continue 61 | key = line.split("\t")[0] 62 | value = line.split("\t")[1] 63 | sent_example_map[key] = value 64 | bytes_out = pickle.dumps(sent_example_map, protocol=pickle.HIGHEST_PROTOCOL) 65 | with open('data/sent_example.pickle', 'wb') as handle: 66 | for idx in range(0, len(bytes_out), max_bytes): 67 | handle.write(bytes_out[idx:idx + max_bytes]) 68 | 69 | 70 | def convert_freebase(freebase_file_name, freebase_sup_file_name): 71 | ret_map = {} 72 | with open(freebase_file_name) as f: 73 | for line in f: 74 | line = line.strip() 75 | if len(line.split("\t")) <= 1: 76 | continue 77 | key = line.split("\t")[0] 78 | val = line.split("\t")[1] 79 | ret_map[key] = val 80 | with open(freebase_sup_file_name) as f: 81 | for line in f: 82 | line = line.strip() 83 | if len(line.split("\t")) <= 1: 84 | continue 85 | key = line.split("\t")[0] 86 | val = line.split("\t")[1] 87 | if key not in ret_map: 88 | ret_map[key] = val 89 | with open('data/title2freebase.pickle', 'wb') as handle: 90 | pickle.dump(ret_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 91 | 92 | 93 | def convert_prob(prob_file_name, n2c_fil2_name): 94 | prob_map = {} 95 | n2c_map = {} 96 | with open(n2c_fil2_name) as f: 97 | for line in f: 98 | line = line.strip() 99 | n2c_map[line.split("\t")[0]] = line.split("\t")[1] 100 | with open(prob_file_name) as f: 101 | for line in f: 102 | line = line.strip() 103 | key = line.split("\t")[0] 104 | val = float(line.split("\t")[1]) 105 | surface = key.split("|")[0] 106 | title = key.split("|")[1] 107 | if title in n2c_map: 108 | title = n2c_map[title] 109 | if surface in prob_map: 110 | cur_highest = prob_map[surface][1] 111 | if val > cur_highest: 112 | prob_map[surface] = (title, val) 113 | else: 114 | prob_map[surface] = (title, val) 115 | with open('data/prior_prob.pickle', 'wb') as handle: 116 | pickle.dump(prob_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 117 | 118 | 119 | def convert_cached_embeddings(raw_file_name, output_file_name): 120 | ret_map = {} 121 | max_bytes = 2 ** 31 - 1 122 | with open(raw_file_name) as f: 123 | for line in f: 124 | line = line.strip() 125 | token = line.split("\t")[0] 126 | if line.split("\t")[1] == "null": 127 | continue 128 | vals = line.split("\t")[1].split(",") 129 | vec = [] 130 | for val in vals: 131 | vec.append(float(val)) 132 | ret_map[token] = vec 133 | bytes_out = pickle.dumps(ret_map, protocol=pickle.HIGHEST_PROTOCOL) 134 | with open(output_file_name, 'wb') as handle: 135 | for idx in range(0, len(bytes_out), max_bytes): 136 | handle.write(bytes_out[idx:idx + max_bytes]) 137 | 138 | 139 | def reduce_cache_file_size(cache_pickle_file_name, title_file_name, out_file_name): 140 | with open(cache_pickle_file_name, "rb") as handle: 141 | cache_map = pickle.load(handle) 142 | title_set = set() 143 | with open(title_file_name, "r") as f: 144 | for line in f: 145 | line = line.strip() 146 | title_set.add(line) 147 | ret_map = {} 148 | for key in cache_map: 149 | if key in title_set: 150 | ret_map[key] = cache_map[key] 151 | with open(out_file_name, "wb") as handle: 152 | pickle.dump(ret_map, handle, protocol=pickle.HIGHEST_PROTOCOL) 153 | 154 | 155 | def check_data_file_integrity(mode=""): 156 | file_list = [ 157 | 'data/esa/esa.pickle', 158 | 'data/esa/freq.pickle', 159 | 'data/esa/invcount.pickle', 160 | 'data/prior_prob.pickle', 161 | 'data/sent_example.pickle', 162 | 'data/title2freebase.pickle', 163 | ] 164 | corpus_supplements = [] 165 | if mode == "figer": 166 | corpus_supplements = [ 167 | 'data/FIGER/target.embedding.pickle', 168 | 'data/FIGER/wikilinks.embedding.pickle', 169 | 'mapping/figer.mapping', 170 | 'mapping/figer.logic.mapping' 171 | ] 172 | passed = True 173 | for file in file_list + corpus_supplements: 174 | if not os.path.isfile(file): 175 | print("[ERROR]: Missing " + file) 176 | passed = False 177 | if not passed: 178 | print("You have one or more file missing. Please refer to README for solution.") 179 | else: 180 | print("All required or suggested files are here. Go ahead and run experiments!") 181 | 182 | 183 | def compare_runlogs(runlog_file_a, runlog_file_b): 184 | if not os.path.isfile(runlog_file_a) or not os.path.isfile(runlog_file_b): 185 | print("Invalid input file names") 186 | with open (runlog_file_a, "rb") as handle: 187 | log_a = pickle.load(handle) 188 | with open (runlog_file_b, "rb") as handle: 189 | log_b = pickle.load(handle) 190 | for sentence in log_a: 191 | for compare_sentence in log_b: 192 | if sentence.get_sent_str() == compare_sentence.get_sent_str(): 193 | if sentence.get_mention_surface() == compare_sentence.get_mention_surface(): 194 | if sentence.predicted_types != compare_sentence.predicted_types: 195 | print(sentence.get_sent_str()) 196 | print(sentence.get_mention_surface()) 197 | print(sentence.gold_types) 198 | print("Log A prediction: " + str(sentence.predicted_types)) 199 | print("Log B prediction: " + str(compare_sentence.predicted_types)) 200 | 201 | 202 | def produce_cache(): 203 | elmo_processor = ElmoProcessor(allow_tensorflow=True) 204 | to_process = [] 205 | to_process_concepts = [] 206 | sorted_pairs = sorted(elmo_processor.sent_example_map.items()) 207 | cur_processing_file_num = ord(sorted_pairs[0][0][0]) 208 | sub_map_index = 0 209 | max_bytes = 2 ** 31 - 1 210 | for pair in sorted_pairs: 211 | concept = pair[0] 212 | file_num = ord(concept[0]) 213 | if file_num != cur_processing_file_num or len(to_process) > 10000: 214 | new_start = False 215 | if file_num != cur_processing_file_num: 216 | new_start = True 217 | out_file_name = "data/cache/batch_" + str(cur_processing_file_num) + "_" + str(sub_map_index) + ".pickle" 218 | if new_start: 219 | out_file_name = "data/cache/batch_" + str(cur_processing_file_num) + ".pickle" 220 | if not os.path.isfile(out_file_name) and cur_processing_file_num >= 65: 221 | print("Prepared to run ELMo on " + chr(cur_processing_file_num)) 222 | print("This batch contains " + str(len(to_process_concepts)) + " concepts, and " + str(len(to_process)) + " sentences.") 223 | elmo_map = elmo_processor.process_batch(to_process) 224 | batch_map = {} 225 | for processed_concept in to_process_concepts: 226 | if processed_concept in elmo_map: 227 | batch_map[processed_concept] = elmo_map[processed_concept] 228 | bytes_out = pickle.dumps(batch_map, protocol=pickle.HIGHEST_PROTOCOL) 229 | with open(out_file_name, "wb") as handle: 230 | for idx in range(0, len(bytes_out), max_bytes): 231 | handle.write(bytes_out[idx:idx + max_bytes]) 232 | print("Processed all concepts start with " + chr(cur_processing_file_num)) 233 | print() 234 | to_process = [] 235 | to_process_concepts = [] 236 | if new_start: 237 | cur_processing_file_num = file_num 238 | sub_map_index = 0 239 | else: 240 | sub_map_index += 1 241 | example_sentences_str = elmo_processor.sent_example_map[concept] 242 | example_sentences = example_sentences_str.split("|||") 243 | for i in range(0, min(len(example_sentences), 10)): 244 | to_process.append(example_sentences[i]) 245 | to_process_concepts.append(concept) 246 | 247 | 248 | def progress_bar(value, endvalue, bar_length=20): 249 | percent = float(value) / endvalue 250 | arrow = '-' * int(round(percent * bar_length) - 1) + '>' 251 | spaces = ' ' * (bar_length - len(arrow)) 252 | sys.stdout.write("\rProgress: [{0}] {1}%".format(arrow + spaces, round(percent * 100, 3))) 253 | sys.stdout.flush() 254 | 255 | 256 | def produce_surface_cache(db_name, cache_name): 257 | pipeline = local_pipeline.LocalPipeline() 258 | cache = SurfaceCache(db_name, server_mode=False) 259 | runner = ZoeRunner() 260 | runner.elmo_processor.load_sqlite_db(cache_name, server_mode=False) 261 | dataset = DataReader("data/large_text.json", size=-1, unique=True) 262 | counter = 0 263 | total = len(dataset.sentences) 264 | for sentence in dataset.sentences: 265 | ta = pipeline.doc([sentence.tokens], pretokenized=True) 266 | for chunk in ta.get_shallow_parse: 267 | new_sentence = Sentence(sentence.tokens, chunk['start'], chunk['end']) 268 | runner.process_sentence(new_sentence) 269 | cache.insert_cache(new_sentence) 270 | progress_bar(counter, total) 271 | 272 | 273 | def produce_magnitude_vec_file(db_name, out_file): 274 | conn = sqlite3.connect(db_name) 275 | cursor = conn.cursor() 276 | cursor.execute("SELECT * FROM data") 277 | w = open(out_file, "w") 278 | for row in cursor: 279 | key = row[0] 280 | val = row[1] 281 | val = val[1:-1].replace(",", "") 282 | w.write(key + " " + val + "\n") 283 | 284 | 285 | if __name__ == '__main__': 286 | if len(sys.argv) < 2: 287 | print("[ERROR]: No command given.") 288 | exit(0) 289 | if sys.argv[1] == "CHECKFILE": 290 | if len(sys.argv) == 2: 291 | check_data_file_integrity() 292 | else: 293 | check_data_file_integrity(sys.argv[2]) 294 | if sys.argv[1] == "COMPARE": 295 | if len(sys.argv) < 4: 296 | print("Need two files for comparison.") 297 | compare_runlogs(sys.argv[2], sys.argv[3]) 298 | if sys.argv[1] == "CACHE": 299 | produce_cache() 300 | if sys.argv[1] == "SURFACECACHE": 301 | produce_surface_cache("data/surface_cache.db", "/Volumes/Storage/Resources/wikilinks/elmo_cache_correct.db") 302 | if sys.argv[1] == "PRODUCE_VEC": 303 | produce_magnitude_vec_file("/Volumes/External/elmo_cache_correct.db", "/Volumes/External/elmo_cache.vec") 304 | -------------------------------------------------------------------------------- /server.py: -------------------------------------------------------------------------------- 1 | import json 2 | import signal 3 | import time 4 | import traceback 5 | 6 | import requests 7 | from ccg_nlpy import local_pipeline 8 | from flask import Flask 9 | from flask import request 10 | from flask import send_from_directory 11 | from flask_cors import CORS 12 | 13 | from cache import ServerCache 14 | from cache import SurfaceCache 15 | from main import ZoeRunner 16 | from zoe_utils import InferenceProcessor 17 | from zoe_utils import Sentence 18 | 19 | 20 | class CogCompLoggerClient: 21 | def __init__(self, demo_name, base_url="http://127.0.0.1:5000"): 22 | self.demo_name = demo_name 23 | self.base_url = base_url 24 | if self.base_url.endswith("/"): 25 | self.url = self.base_url + "log" 26 | else: 27 | self.url = self.base_url + "/log" 28 | 29 | def log(self, content=""): 30 | params = { 31 | 'entry_name': self.demo_name, 32 | 'content': content 33 | } 34 | result = requests.post(url=self.url, params=params).json() 35 | if result['result'] == 'SUCCESS': 36 | return True 37 | return False 38 | 39 | def log_dict(self, d=None): 40 | if d is None: 41 | return self.log() 42 | else: 43 | return self.log(content=json.dumps(d)) 44 | 45 | 46 | class Server: 47 | 48 | """ 49 | Initialize the server with needed resources 50 | @sql_db_path: The path pointing to the ELMo caches sqlite file 51 | """ 52 | def __init__(self, sql_db_path, surface_cache_path): 53 | self.app = Flask(__name__) 54 | CORS(self.app) 55 | self.mem_cache = ServerCache() 56 | self.surface_cache = SurfaceCache(surface_cache_path) 57 | self.pipeline = local_pipeline.LocalPipeline() 58 | self.pipeline_initialize_helper(['.']) 59 | self.logger = CogCompLoggerClient("zoe", base_url="http://macniece.seas.upenn.edu:4005") 60 | self.runner = ZoeRunner(allow_tensorflow=True) 61 | status = self.runner.elmo_processor.load_sqlite_db(sql_db_path, server_mode=True) 62 | if not status: 63 | print("ELMo cache file is not found. Server mode is prohibited without it.") 64 | print("Please contact the author for this cache, or modify this code if you know what you are doing.") 65 | exit(1) 66 | self.runner.elmo_processor.rank_candidates_vec() 67 | signal.signal(signal.SIGINT, self.grace_end) 68 | 69 | @staticmethod 70 | def handle_root(path): 71 | return send_from_directory('./frontend', path) 72 | 73 | @staticmethod 74 | def handle_redirection(): 75 | return Server.handle_root("index.html") 76 | 77 | def parse_custom_rules(self, rules): 78 | type_to_titles = {} 79 | freebase_freq_total = {} 80 | for rule in rules: 81 | title = rule.split("|||")[0] 82 | freebase_types = [] 83 | if title in self.runner.inference_processor.freebase_map: 84 | freebase_types = self.runner.inference_processor.freebase_map[title].split(",") 85 | for ft in freebase_types: 86 | if ft in freebase_freq_total: 87 | freebase_freq_total[ft] += 1 88 | else: 89 | freebase_freq_total[ft] = 1 90 | custom_type = rule.split("|||")[1] 91 | if custom_type in type_to_titles: 92 | type_to_titles[custom_type].append(title) 93 | else: 94 | type_to_titles[custom_type] = [title] 95 | counter = 0 96 | ret = {} 97 | for custom_type in type_to_titles: 98 | freebase_freq = {} 99 | for title in type_to_titles[custom_type]: 100 | freebase_types = [] 101 | if title in self.runner.inference_processor.freebase_map: 102 | freebase_types = self.runner.inference_processor.freebase_map[title].split(",") 103 | counter += 1 104 | for freebase_type in freebase_types: 105 | if freebase_type in freebase_freq: 106 | freebase_freq[freebase_type] += 1 107 | else: 108 | freebase_freq[freebase_type] = 1 109 | for ft in freebase_freq: 110 | if float(freebase_freq[ft]) > float(counter) * 0.5 and freebase_freq[ft] == freebase_freq_total[ft]: 111 | ft = "/" + ft.replace(".", "/") 112 | ret[ft] = custom_type 113 | return ret 114 | 115 | """ 116 | Main request handler 117 | It requires the request to contain required information like tokens/mentions 118 | in the format of a json string 119 | 120 | @param_override: override API input with a pre-defined dictionary 121 | """ 122 | def handle_input(self, param_override=None): 123 | start_time = time.time() 124 | ret = {} 125 | r = request.get_json() 126 | if param_override is not None: 127 | r = param_override 128 | if "tokens" not in r or "mention_starts" not in r or "mention_ends" not in r or "index" not in r: 129 | ret["type"] = [["INVALID_INPUT"]] 130 | ret["index"] = -1 131 | ret["mentions"] = [] 132 | ret["candidates"] = [[]] 133 | return json.dumps(ret) 134 | sentences = [] 135 | for i in range(0, len(r["mention_starts"])): 136 | sentence = Sentence(r["tokens"], int(r["mention_starts"][i]), int(r["mention_ends"][i]), "") 137 | sentences.append(sentence) 138 | mode = r["mode"] 139 | predicted_types = [] 140 | predicted_candidates = [] 141 | other_possible_types = [] 142 | selected_candidates = [] 143 | mentions = [] 144 | if mode != "figer": 145 | if mode != "custom": 146 | selected_inference_processor = InferenceProcessor(mode, resource_loader=self.runner.inference_processor) 147 | else: 148 | rules = r["taxonomy"] 149 | mappings = self.parse_custom_rules(rules) 150 | selected_inference_processor = InferenceProcessor(mode, custom_mapping=mappings) 151 | else: 152 | selected_inference_processor = self.runner.inference_processor 153 | 154 | for sentence in sentences: 155 | sentence.set_signature(selected_inference_processor.signature()) 156 | cached = self.mem_cache.query_cache(sentence) 157 | if cached is not None: 158 | sentence = cached 159 | else: 160 | self.runner.process_sentence(sentence, selected_inference_processor) 161 | try: 162 | self.mem_cache.insert_cache(sentence) 163 | self.surface_cache.insert_cache(sentence) 164 | except: 165 | print("Cache insertion exception. Ignored.") 166 | predicted_types.append(list(sentence.predicted_types)) 167 | predicted_candidates.append(sentence.elmo_candidate_titles) 168 | mentions.append(sentence.get_mention_surface_raw()) 169 | selected_candidates.append(sentence.selected_title) 170 | other_possible_types.append(sentence.could_also_be_types) 171 | 172 | elapsed_time = time.time() - start_time 173 | print("Processed mention " + str([x.get_mention_surface() for x in sentences]) + " in mode " + mode + ". TIME: " + str(elapsed_time) + " seconds.") 174 | 175 | # Post logging request to Cogcomp Logger 176 | self.logger.log_dict(r) 177 | 178 | ret["type"] = predicted_types 179 | ret["candidates"] = predicted_candidates 180 | ret["mentions"] = mentions 181 | ret["index"] = r["index"] 182 | ret["selected_candidates"] = selected_candidates 183 | ret["other_possible_type"] = other_possible_types 184 | return json.dumps(ret) 185 | 186 | def pipeline_initialize_helper(self, tokens): 187 | doc = self.pipeline.doc([tokens], pretokenized=True) 188 | doc.get_shallow_parse 189 | doc.get_ner_conll 190 | doc.get_ner_ontonotes 191 | doc.get_view("MENTION") 192 | 193 | def handle_tokenizer_input(self): 194 | r = request.get_json() 195 | ret = {"tokens": []} 196 | if "sentence" not in r: 197 | return json.dumps(ret) 198 | doc = self.pipeline.doc(r["sentence"]) 199 | token_view = doc.get_tokens 200 | for cons in token_view: 201 | ret["tokens"].append(str(cons)) 202 | return json.dumps(ret) 203 | 204 | """ 205 | Handles requests for mention filling 206 | """ 207 | def handle_mention_input(self): 208 | r = request.get_json() 209 | ret = {'mention_spans': []} 210 | if "tokens" not in r: 211 | return json.dumps(ret) 212 | tokens = r["tokens"] 213 | doc = self.pipeline.doc([tokens], pretokenized=True) 214 | shallow_parse_view = doc.get_shallow_parse 215 | ner_conll_view = doc.get_ner_conll 216 | ner_ontonotes_view = doc.get_ner_ontonotes 217 | md_view = doc.get_view("MENTION") 218 | ret_set = set() 219 | ret_list = [] 220 | additions_views = [] 221 | if ner_ontonotes_view.cons_list is not None: 222 | additions_views.append(ner_ontonotes_view) 223 | if md_view.cons_list is not None: 224 | additions_views.append(md_view) 225 | if shallow_parse_view.cons_list is not None: 226 | additions_views.append(shallow_parse_view) 227 | try: 228 | if ner_conll_view.cons_list is not None: 229 | for ner_conll in ner_conll_view: 230 | for i in range(ner_conll['start'], ner_conll['end']): 231 | ret_set.add(i) 232 | ret_list.append((ner_conll['start'], ner_conll['end'])) 233 | for additions_view in additions_views: 234 | for cons in additions_view: 235 | add_to_list = True 236 | if additions_view.view_name != "MENTION": 237 | if additions_view.view_name == "SHALLOW_PARSE" and cons['label'] != "NP": 238 | continue 239 | start = int(cons['start']) 240 | end = int(cons['end']) 241 | else: 242 | start = int(cons['properties']['EntityHeadStartSpan']) 243 | end = int(cons['properties']['EntityHeadEndSpan']) 244 | for i in range(max(start - 1, 0), min(len(tokens), end + 1)): 245 | if i in ret_set: 246 | add_to_list = False 247 | break 248 | if add_to_list: 249 | for i in range(start, end): 250 | ret_set.add(i) 251 | ret_list.append((start, end)) 252 | except Exception as e: 253 | traceback.print_exc() 254 | print(e) 255 | ret['mention_spans'] = ret_list 256 | return json.dumps(ret) 257 | 258 | """ 259 | Handles surface form cached requests 260 | This is expected to return sooner than actual processing 261 | """ 262 | def handle_simple_input(self): 263 | ret = {} 264 | r = request.get_json() 265 | if "tokens" not in r or "mention_starts" not in r or "mention_ends" not in r or "index" not in r: 266 | ret["type"] = [["INVALID_INPUT"]] 267 | return json.dumps(ret) 268 | sentences = [] 269 | for i in range(0, len(r["mention_starts"])): 270 | sentence = Sentence(r["tokens"], int(r["mention_starts"][i]), int(r["mention_ends"][i]), "") 271 | sentences.append(sentence) 272 | types = [] 273 | for sentence in sentences: 274 | surface = sentence.get_mention_surface() 275 | cached_types = self.surface_cache.query_cache(surface) 276 | if cached_types is not None: 277 | distinct = set() 278 | for t in cached_types: 279 | distinct.add("/" + t.split("/")[1]) 280 | types.append(list(distinct)) 281 | else: 282 | types.append([]) 283 | ret["type"] = types 284 | ret["index"] = r["index"] 285 | return json.dumps(ret) 286 | 287 | def handle_word2vec_input(self): 288 | ret = {} 289 | r = request.get_json() 290 | if "tokens" not in r or "mention_starts" not in r or "mention_ends" not in r or "index" not in r: 291 | ret["type"] = [["INVALID_INPUT"]] 292 | return json.dumps(ret) 293 | sentences = [] 294 | for i in range(0, len(r["mention_starts"])): 295 | sentence = Sentence(r["tokens"], int(r["mention_starts"][i]), int(r["mention_ends"][i]), "") 296 | sentences.append(sentence) 297 | predicted_types = [] 298 | for sentence in sentences: 299 | self.runner.process_sentence_vec(sentence) 300 | predicted_types.append(list(sentence.predicted_types)) 301 | ret["type"] = predicted_types 302 | ret["index"] = r["index"] 303 | return json.dumps(ret) 304 | 305 | def handle_elmo_input(self): 306 | ret = {} 307 | results = [] 308 | r = request.get_json() 309 | if "sentence" not in r: 310 | ret["vectors"] = [] 311 | return json.dumps(ret) 312 | elmo_map = self.runner.elmo_processor.process_single_continuous(r["sentence"]) 313 | for token in r["sentence"].split(): 314 | results.append((token, str(elmo_map[token]))) 315 | ret["vectors"] = results 316 | return json.dumps(ret) 317 | 318 | def handle_logger_test(self): 319 | params = { 320 | "tokens": ["Iced", "Earth", "\\u2019", "s", "musical", "style", "is", "influenced", "by", "many", "traditional", "heavy", "metal", "groups", "such", "as", "Black", "Sabbath", "."], 321 | "index": 0, 322 | "mention_starts": [0], 323 | "mention_ends": [2], 324 | "mode": "figer", 325 | "taxonomy": [], 326 | } 327 | self.handle_input(param_override=params) 328 | return "finished" 329 | 330 | """ 331 | Handler to start the Flask app 332 | @localhost: Whether the server lives only in localhost 333 | @port: A port number, default to 80 (Web) 334 | """ 335 | def start(self, localhost=False, port=80): 336 | self.app.add_url_rule("/", "", self.handle_redirection) 337 | self.app.add_url_rule("/", "", self.handle_root) 338 | self.app.add_url_rule("/annotate", "annotate", self.handle_input, methods=['POST']) 339 | self.app.add_url_rule("/annotate_token", "annotate_token", self.handle_tokenizer_input, methods=['POST']) 340 | self.app.add_url_rule("/annotate_mention", "annotate_mention", self.handle_mention_input, methods=['POST']) 341 | self.app.add_url_rule("/annotate_cache", "annotate_cache", self.handle_simple_input, methods=['POST']) 342 | self.app.add_url_rule("/annotate_vec", "annotate_vec", self.handle_word2vec_input, methods=['POST']) 343 | self.app.add_url_rule("/annotate_elmo", "annotate_elmo", self.handle_elmo_input, methods=['POST']) 344 | # Specifically saved for logger test 345 | self.app.add_url_rule("/test", "test", self.handle_logger_test, methods=['POST', 'GET']) 346 | if localhost: 347 | self.app.run() 348 | else: 349 | self.app.run(host='0.0.0.0', port=port) 350 | 351 | def grace_end(self, signum, frame): 352 | print("Gracefully Existing...") 353 | if self.runner.elmo_processor.db_loaded: 354 | self.runner.elmo_processor.db_conn.close() 355 | print("Resource Released. Existing.") 356 | exit(0) 357 | 358 | 359 | if __name__ == '__main__': 360 | # First argument is a placeholder. Please ask for the actual file. 361 | server = Server("elmo_cache_correct.db", "./data/surface_cache_new.db") 362 | server.start(localhost=True) 363 | 364 | --------------------------------------------------------------------------------