└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # BioBERT Pre-trained Weights 2 | 3 | This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. 4 | Please refer to our paper [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746) for more details. 5 | 6 | ## Downloading pre-trained weights 7 | Go to [releases](https://github.com/naver/biobert-pretrained/releases) section of this repository or click links below to download pre-trained weights of BioBERT. 8 | We provide three combinations of pre-trained weights: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC). 9 | Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows: 10 | 11 | * **[BioBERT-Base v1.1 (+ PubMed 1M)](https://drive.google.com/file/d/1R84voFKHfWV9xjzeLzWBbmY1uOMYpnyD/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary) 12 | * **[BioBERT-Large v1.1 (+ PubMed 1M)](https://drive.google.com/file/d/1GJpGjQj6aZPV-EfbiQELpBkvlGtoKiyA/view?usp=sharing)** - based on BERT-large-Cased (custom 30k vocabulary), [NER/QA Results](https://github.com/dmis-lab/biobert/wiki/BioBERT-Large-Results) 13 | * **[BioBERT-Base v1.0 (+ PubMed 200K)](https://drive.google.com/file/d/17j6pSKZt5TtJ8oQCDNIwlSZ0q5w7NNBg/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary) 14 | * **[BioBERT-Base v1.0 (+ PMC 270K)](https://drive.google.com/file/d/1LiAJklso-DCAJmBekRTVEvqUOfm0a9fX/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary) 15 | * **[BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K)](https://drive.google.com/file/d/1jGUu2dWB1RaeXmezeJmdiPKQp3ZCmNb7/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary) 16 | 17 | Make sure to specify the versions of pre-trained weights used in your works. 18 | If you have difficulty choosing which one to use, we recommend using **BioBERT-Base v1.1 (+ PubMed 1M)** or **BioBERT-Large v1.1 (+ PubMed 1M)** depending on your GPU resources. 19 | Note that for BioBERT-Base, we are using WordPiece vocabulary (`vocab.txt`) provided by Google as any new words in biomedical corpus can be represented with subwords (for instance, Leukemia => Leu + ##ke + ##mia). 20 | More details are in the closed [issue #1](https://github.com/naver/biobert-pretrained/issues/1). 21 | 22 | ## Pre-training corpus 23 | We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links: 24 | * **`PubMed Abstracts1`**: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ 25 | * **`PubMed Abstracts2`**: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ 26 | * **`PubMed Central Full Texts`**: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/ 27 | 28 | Estimated size of each corpus is 4.5 billion words for **`PubMed Abstracts1`** + **`PubMed Abstracts2`**, and 13.5 billion words for **`PubMed Central Full Texts`**. 29 | 30 | ## Fine-tuning BioBERT 31 | To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the [DMIS GitHub repository for BioBERT](https://github.com/dmis-lab/biobert). 32 | 33 | ## Citation 34 | ``` 35 | @article{10.1093/bioinformatics/btz682, 36 | author = {Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}, 37 | title = "{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}", 38 | journal = {Bioinformatics}, 39 | year = {2019}, 40 | month = {09}, 41 | issn = {1367-4803}, 42 | doi = {10.1093/bioinformatics/btz682}, 43 | url = {https://doi.org/10.1093/bioinformatics/btz682}, 44 | } 45 | ``` 46 | 47 | ## Contact information 48 | For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee 49 | (`lee.jnhk@gmail.com`), or Sungdong Kim (`sungdong.kim@navercorp.com`) for communication related to pre-trained weights of BioBERT. 50 | --------------------------------------------------------------------------------