├── ELMo
    ├── options.json
    └── weights.hdf5
└── README.md


/ELMo/options.json:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:86cb4c1a7e2f25e3867d06810e33ae27fbbd511ac77ca377be5fb7405a7085b1
3 | size 336
4 | 


--------------------------------------------------------------------------------
/ELMo/weights.hdf5:
--------------------------------------------------------------------------------
1 | version https://git-lfs.github.com/spec/v1
2 | oid sha256:32a7af920f480eaf231b567676745ba7ccfdfed3885938bcad9bdd42759f0fd4
3 | size 374434776
4 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # ChemPatentEmbeddings
 2 | 
 3 | This repository contains an ELMo and a word2vec model pre-trained on a 1 Billion word chemical patent corpus.
 4 | 
 5 | ## Dataset
 6 | 
 7 | The dataset we used for training consists of 84,076 patent documents from 7 different patent authorities. The dataset will **NOT** be made publicly available.
 8 | 
 9 | |PO|# of Document|# of Sentences|# of Tokens|
10 | |--|-------------|--------------|-----------|
11 | |AU|7,743        |4,662,375     |156,137,670|
12 | |CA|1,962        |463,123       |16,109,776 |
13 | |EP|19,274       |3,478,258     |117,992,191|
14 | |GB|918          |182,627       |6,038,837  |
15 | |IN|1,913        |261,260       |9,015,238  |
16 | |US|41,131       |19,800,123    |628,256,609|
17 | |WO|11,135       |4,830,708     |159,286,325|
18 | |Total|84,076    |33,687,474    |1,092,836,646|
19 | 
20 | ## Word2Vec
21 | 
22 | We trained the word2vec model with same hyper-parameters in [Pyysalo et al., (2013)](http://bio.nlplab.org/pdf/pyysalo13literature.pdf) for 10 iterations. The word vectors file (.txt) can be directly loaded into your neural network framework. Note that words **shorter than 25 characters** in length were replaced by *long_token* during training.
23 | 
24 | Please click [here](https://chemu.eng.unimelb.edu.au/patent_w2v/) to download the pre-trained word vectors.
25 | 
26 | ## ELMo
27 | 
28 | Default hyper-parameters in [Peters et al., (2018)](https://arxiv.org/abs/1802.05365) were used. Note that words **shorter than 25 characters** in length were replaced by *long_token* during training (cf. max. character length is 50 under default setting).
29 | 
30 | Please click [here](https://chemu.eng.unimelb.edu.au/ELMo/) to download ELMo model.
31 | 
32 | * **Fine Tuning**:  Load *weights.hdf5* and *options.json* into the original [ELMo implementation](https://github.com/allenai/bilm-tf) and train it further on your own datasets.
33 | 
34 | * **Representation**: You can also use contextualized word representations generated by ELMo for downstream tasks by load *weights.hdf5* and *options.json* into [AllenNLP](https://allenai.github.io/allennlp-docs/) framework.
35 | 
36 | ## Reference
37 | 
38 | If you find the word representations useful, please cite the following paper: *Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings*
39 | 
40 | ```
41 | @inproceedings{zhai2019improving,
42 |   author    = {Zhai, Zenan and Nguyen, Dat Quoc and A. Akhondi, Saber and Thorne, Camilo and Druckenbrodt,
43 |                Christian and Cohn, Trevor and Gregory, Michelle and Verspoor, Karin},
44 |   title     = {{Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings}},
45 |   booktitle = {{Proceedings of the BioNLP 2019 workshop}},
46 |   pages     = {To appear},
47 |   year      = {2019},
48 | }
49 | ```
50 | 


--------------------------------------------------------------------------------