├── ELMo ├── options.json └── weights.hdf5 └── README.md /ELMo/options.json: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:86cb4c1a7e2f25e3867d06810e33ae27fbbd511ac77ca377be5fb7405a7085b1 3 | size 336 4 | -------------------------------------------------------------------------------- /ELMo/weights.hdf5: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:32a7af920f480eaf231b567676745ba7ccfdfed3885938bcad9bdd42759f0fd4 3 | size 374434776 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ChemPatentEmbeddings 2 | 3 | This repository contains an ELMo and a word2vec model pre-trained on a 1 Billion word chemical patent corpus. 4 | 5 | ## Dataset 6 | 7 | The dataset we used for training consists of 84,076 patent documents from 7 different patent authorities. The dataset will **NOT** be made publicly available. 8 | 9 | |PO|# of Document|# of Sentences|# of Tokens| 10 | |--|-------------|--------------|-----------| 11 | |AU|7,743 |4,662,375 |156,137,670| 12 | |CA|1,962 |463,123 |16,109,776 | 13 | |EP|19,274 |3,478,258 |117,992,191| 14 | |GB|918 |182,627 |6,038,837 | 15 | |IN|1,913 |261,260 |9,015,238 | 16 | |US|41,131 |19,800,123 |628,256,609| 17 | |WO|11,135 |4,830,708 |159,286,325| 18 | |Total|84,076 |33,687,474 |1,092,836,646| 19 | 20 | ## Word2Vec 21 | 22 | We trained the word2vec model with same hyper-parameters in [Pyysalo et al., (2013)](http://bio.nlplab.org/pdf/pyysalo13literature.pdf) for 10 iterations. The word vectors file (.txt) can be directly loaded into your neural network framework. Note that words **shorter than 25 characters** in length were replaced by *long_token* during training. 23 | 24 | Please click [here](https://chemu.eng.unimelb.edu.au/patent_w2v/) to download the pre-trained word vectors. 25 | 26 | ## ELMo 27 | 28 | Default hyper-parameters in [Peters et al., (2018)](https://arxiv.org/abs/1802.05365) were used. Note that words **shorter than 25 characters** in length were replaced by *long_token* during training (cf. max. character length is 50 under default setting). 29 | 30 | Please click [here](https://chemu.eng.unimelb.edu.au/ELMo/) to download ELMo model. 31 | 32 | * **Fine Tuning**: Load *weights.hdf5* and *options.json* into the original [ELMo implementation](https://github.com/allenai/bilm-tf) and train it further on your own datasets. 33 | 34 | * **Representation**: You can also use contextualized word representations generated by ELMo for downstream tasks by load *weights.hdf5* and *options.json* into [AllenNLP](https://allenai.github.io/allennlp-docs/) framework. 35 | 36 | ## Reference 37 | 38 | If you find the word representations useful, please cite the following paper: *Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings* 39 | 40 | ``` 41 | @inproceedings{zhai2019improving, 42 | author = {Zhai, Zenan and Nguyen, Dat Quoc and A. Akhondi, Saber and Thorne, Camilo and Druckenbrodt, 43 | Christian and Cohn, Trevor and Gregory, Michelle and Verspoor, Karin}, 44 | title = {{Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings}}, 45 | booktitle = {{Proceedings of the BioNLP 2019 workshop}}, 46 | pages = {To appear}, 47 | year = {2019}, 48 | } 49 | ``` 50 | --------------------------------------------------------------------------------