Enter a sentence to query the index. Results include the title of the journal and the sentence that was found ordered by similarity. More details are shown if expanded.
136 | );
137 | }
138 |
--------------------------------------------------------------------------------
/frontend/README.md:
--------------------------------------------------------------------------------
1 | # CORD-19-ANN
2 |
3 | Below are instructions on how to setup the front-end for the search capabilities.
4 |
5 | ## Table of Contents:
6 | - [Table of Contents:](#Table-of-Contents)
7 | - [Prerequisites](#Prerequisites)
8 | - [Setup](#Setup)
9 | - [Babel & Webpack Configuration](#Babel--Webpack-Configuration)
10 | - [Configuration using Environment Variables](#Configuration-using-Environment-Variables)
11 | - [Babel Configuration](#Babel-Configuration)
12 | - [Development and Production Build Modes](#Development-and-Production-Build-Modes)
13 | - [Development Builds](#Development-Builds)
14 | - [Production Build](#Production-Build)
15 | - [Docker Deployment](#Docker-Deployment)
16 |
17 | ## Prerequisites
18 | The following tools are required to set up or run this template:
19 | - [node](https://nodejs.org/) v12.4.0
20 | - [npm](https://www.npmjs.com/) v6.9.0 **or** [Yarn](https://yarnpkg.com/) v1.16.0
21 |
22 | ## Setup
23 | 1. Clone the repo
24 | 2. Navigate to the root directory of this new repo and run either of the commands below:
25 | ```shell
26 | npm install
27 | ```
28 | 3. You'll need to modify the `.env.defaults` file to point to the URL of the search index. We've assumed you've ran the `index_server.py` on the appropriate node as explained in the README.
29 |
30 | A blank *.env* file is also created in the root directory (more on environment variables [here](#configuration-using-environment-variables)).
31 |
32 | ## Babel & Webpack Configuration
33 |
34 | ### Configuration using Environment Variables
35 | The *webpack.config.js* uses the [dotenv-webpack](https://www.npmjs.com/package/dotenv-webpack) plugin alongside [dotenv-defaults](https://www.npmjs.com/package/dotenv-defaults) to expose any environment variables set in the *.env* or *.env.defaults* file in the root directory. These variables are available within the webpack configuration itself and also anywhere within the application in the format `process.env.[VARIABLE]`.
36 |
37 | The root *.env.defaults* file must only contain non-sensitive configuration variables and should be considered safe to commit to any version control system.
38 |
39 | Any sensitive details, such as passwords or private keys, should be stored in the root *.env* file. This file should **never** be committed and accordingly is already listed within the root *.gitignore* file. The *.env* file also serves to overwrite any non-sensitive variables defined within the root *.env.defaults* file.
40 |
41 | ### Babel Configuration
42 | This project uses [Babel](https://babeljs.io/) to convert, transform and polyfill ECMAScript 2015+ code into a backwards compatible version of JavaScript.
43 |
44 | As with *webpack.config.js* the environmental variables defined in *.env.defaults* and *.env* are available within the *babel.config.js* where Babel's configuration is programmatically created.
45 |
46 | ### Development and Production Build Modes
47 | In *webpack.config.js* there's a common configuration object for both `development` and `production` builds called `commonConfig` which mainly handles loading for various file types. Extend this object with any modules or plugins which apply to both build modes.
48 |
49 | Within a switch statement after the `commonConfig` object individual properties for both the `development` and `production` builds can be defined separately as needed.
50 |
51 | Webpack will use the `--mode` flag it recieves when run to determine which build to bundle. This flag defaults to `development`.
52 |
53 | #### Development Builds
54 | A `development` build can be run in the following ways:
55 | ```shell
56 | // with npm
57 | npm run dev
58 | // or
59 | npm run dev:hot
60 |
61 | // with yarn
62 | yarn run dev
63 | // or
64 | yarn run dev:hot
65 | ```
66 | Both the `dev` and `dev:hot` scripts use [webpack-dev-server](https://webpack.js.org/configuration/dev-server/) to serve a `development` build locally. Some options are configured already in the root *.env.defaults* and can be overriden in the root *.env* file or within the root *webpack.config.js* itself as required.
67 |
68 | #### Production Build
69 | A `production` build can be run by the following command:
70 | ```
71 | // with npm
72 | npm run build
73 |
74 | // with yarn
75 | yarn run build
76 | ```
77 | The `build` script will write an optimized and compressed build to the 'build' directory. If a different directory is required it will mean changing the *.package.json*'s `clean` script as well as the `output.path` property of the *webpack.config.json* accordingly.
78 |
79 |
80 | ## Docker Deployment
81 | In order to run the project locally on Docker, first you need to run the backend. Please follow the instruction below to run the backend.
82 | Once your backend is up and running, you need to open a new tab on your terminal and navigate to the root of the project and run `docker-compose up` in your terminal (you need to have Docker installed on your machine before running this script). If the command is succesfully executed then you should be able to navigate to the project in your browser on `localhost:8080`.
83 |
--------------------------------------------------------------------------------
/cord_ann/index.py:
--------------------------------------------------------------------------------
1 | import json
2 | from pathlib import Path
3 |
4 | import numpy as np
5 | import pandas as pd
6 |
7 |
8 | class Index:
9 | def __init__(self, index_path, index_type, articles_path, mapping, metadata, k, num_workers):
10 | self.index = self.load_index(index_path, index_type)
11 | self.index_type = index_type
12 | self.articles_path = articles_path
13 | self.mapping = mapping
14 | self.metadata = metadata
15 | self.k = k
16 | self.num_workers = num_workers
17 |
18 | def load_index(self, index_path, index_type):
19 | if index_type == 'nmslib':
20 | import nmslib
21 | index = nmslib.init(method='hnsw', space='cosinesimil')
22 | index.loadIndex(index_path)
23 | elif index_type == 'faiss':
24 | import faiss
25 | index = faiss.read_index(index_path)
26 | else:
27 | raise TypeError('Index type can only be faiss or nmslib.')
28 | return index
29 |
30 | def search_index(self, sentences, search_embeddings, return_batch_ids=False):
31 | if self.index_type == 'nmslib':
32 | batch = self.index.knnQueryBatch(search_embeddings,
33 | k=self.k,
34 | num_threads=self.num_workers)
35 | batch = np.array(batch)
36 | batch_ids = batch[:, 0].astype(np.int)
37 | batch_distances = batch[:, 1].astype(np.float32)
38 | elif self.index_type == 'faiss':
39 | batch_distances, batch_ids = self.index.search(np.array(search_embeddings), k=self.k)
40 | else:
41 | raise TypeError('Index type can only be faiss or nmslib.')
42 |
43 | results = self._format_results(batch_ids=batch_ids,
44 | batch_distances=batch_distances,
45 | sentences=sentences,
46 | articles_path=self.articles_path,
47 | mapping=self.mapping)
48 | if return_batch_ids:
49 | return results, batch_ids
50 | return results
51 |
52 | def _load_article(self, articles_path, paper_id):
53 | json_path = Path(articles_path) / (paper_id + '.json')
54 | with json_path.open() as f:
55 | article = json.load(f)
56 | return article
57 |
58 | def _find_metadata(self, paper_id):
59 | metadata = self.metadata[self.metadata['sha'] == paper_id]
60 | if len(metadata) == 1:
61 | metadata = metadata.iloc[0].to_dict()
62 | return {
63 | 'doi': metadata['doi'] if not pd.isna(metadata['doi']) else 'N/A',
64 | 'url': metadata['url'] if not pd.isna(metadata['url']) else 'N/A',
65 | 'journal': metadata['journal'] if not pd.isna(metadata['journal']) else 'N/A',
66 | 'publish_time': metadata['publish_time'] if not pd.isna(metadata['publish_time']) else 'N/A',
67 | }
68 | else:
69 | return None # No metadata was found
70 |
71 | def _extract_k_hits(self, ids, distances, sentence, articles_path, sent_article_mapping):
72 | extracted = {
73 | "query": sentence,
74 | "hits": []
75 | }
76 |
77 | for id, distance in zip(ids, distances):
78 | mapping = sent_article_mapping[id]
79 | paragraph_idx = mapping["paragraph_idx"]
80 | sentence_idx = mapping["sentence_idx"]
81 | paper_id = mapping["paper_id"]
82 | article = self._load_article(articles_path=articles_path,
83 | paper_id=paper_id)
84 | hit = {
85 | 'title': article['metadata']['title'],
86 | 'authors': article['metadata']['authors'],
87 | 'paragraph': article['body_text'][paragraph_idx],
88 | 'sentence': article['body_text'][paragraph_idx]["sentences"][sentence_idx],
89 | 'abstract': article['abstract'],
90 | 'distance': float(distance),
91 | }
92 | metadata = self._find_metadata(paper_id)
93 | if metadata:
94 | hit['metadata'] = metadata
95 | extracted["hits"].append(hit)
96 | return extracted
97 |
98 | def _format_results(self, batch_ids, batch_distances, sentences, articles_path, mapping):
99 | return [self._extract_k_hits(ids=batch_ids[x],
100 | distances=batch_distances[x],
101 | sentence=query_sentence,
102 | articles_path=articles_path,
103 | sent_article_mapping=mapping) for x, query_sentence in enumerate(sentences)]
104 |
105 |
106 | def search_args(parser):
107 | parser.add_argument('--index_path', default="index",
108 | help='Path to the created index')
109 | parser.add_argument('--index_type', default="nmslib", type=str, choices=["nmslib", "faiss"],
110 | help='Type of index')
111 | parser.add_argument('--dataset_path', default="cord_19_dataset_formatted/",
112 | help='Path to the extracted dataset')
113 | parser.add_argument('--model_name_or_path', default='bert-base-nli-mean-tokens')
114 | parser.add_argument('--batch_size', default=8, type=int,
115 | help='Batch size for the transformer model encoding')
116 | parser.add_argument('--num_workers', default=8, type=int,
117 | help='Number of workers to use when parallelizing the index search')
118 | parser.add_argument('--k', default=10, type=int,
119 | help='The top K hits to return from the index')
120 | parser.add_argument('--device', default='cpu',
121 | help='Set to cuda to use the GPU')
122 | parser.add_argument('--silent', action="store_true",
123 | help='Turn off progress bar when searching')
124 | return parser
125 |
126 |
127 | def paths_from_dataset_path(dataset_path):
128 | """
129 | Creates paths to the files required for searching the index.
130 | :param dataset_path: The path to the extracted dataset.
131 | :return: Paths to various important files/folders for searching the index.
132 | """
133 | dataset_path = Path(dataset_path)
134 | articles_path = dataset_path / 'articles/'
135 | sentences_path = dataset_path / 'cord_19_sentences.txt'
136 | mapping_path = dataset_path / 'cord_19_sent_to_article_mapping.json'
137 | metadata_path = dataset_path / 'metadata.csv'
138 | return articles_path, sentences_path, mapping_path, metadata_path
139 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # CORD-19-ANN
2 |
3 | 
4 |
5 | [](https://colab.research.google.com/drive/137jbpY3yQJGSzlLHZGUYBk5F78bwuqKJ) [](https://medium.com/@seannaren/cord-19-ann-semantic-search-engine-using-s-bert-aebc5bcc5442?sk=92ea4a22df3cd1343c86a1e880b78f6f) [GitHub Pages](https://seannaren.github.io/CORD-19-ANN/)
6 |
7 | This repo contains the scripts and models to search [CORD-19](https://pages.semanticscholar.org/coronavirus-research) using [S-BERT](https://github.com/UKPLab/sentence-transformers) embeddings via [nmslib](https://github.com/nmslib/nmslib/blob/master/python_bindings/README.md) or [faiss](https://github.com/facebookresearch/faiss).
8 |
9 | Sentence embeddings are not perfect for searching (see [this issue](https://github.com/UKPLab/sentence-transformers/issues/174)) however can provide insight into the data that basic search functionality cannot. There is still room to improve the retrieval of relevant documents.
10 |
11 | We're not versed in the medical field, so any feedback or improvements we deeply encourage in the form of issues/PRs!
12 |
13 | We've included pre-trained models and the FAISS index to start your own server with instructions below.
14 |
15 | Finally we provide a front-end that can be used to search through the dataset and extract information via a UI. Instructions and installation for the front-end can be found [here](frontend/README.md).
16 |
17 | We currently are hosting the server on a gcp instance, if anyone can contribute for a more permanent hosting solution it would be appreciated.
18 |
19 | ## Installation
20 |
21 | ### Source
22 | We assume you have installed PyTorch and the necessary CUDA packages from [here](https://pytorch.org/). We suggest using Conda to make installation easier.
23 | ```
24 | # Install FAISS
25 | conda install faiss-cpu -c pytorch # Other instructions can be found at https://github.com/facebookresearch/faiss/blob/master/INSTALL.md
26 |
27 | git clone https://github.com/SeanNaren/CORD-19-ANN.git --recursive
28 | cd CORD-19-ANN/
29 | pip install -r requirements.txt
30 | pip install .
31 | ```
32 |
33 | ### Docker
34 |
35 | We also provide a docker container:
36 |
37 | ```
38 | docker pull seannaren/cord-19-ann
39 | sudo docker run -it --net=host --ipc=host --entrypoint=/bin/bash --rm seannaren/cord-19-ann
40 | ```
41 |
42 | ## Download Models
43 |
44 | We currently offer sentence models trained on [BlueBERT](https://github.com/ncbi-nlp/bluebert) (base uncased model) and [BioBERT](https://github.com/naver/biobert-pretrained) (base cased model) with the appropriate metadata/index. We currently serve S-BlueBERT however it is interchangeable.
45 |
46 |
47 | ### Download S-BERT Models and Search Index
48 |
49 | Download the corresponding Model and Index file. We suggest using S-BioBERT and assume you have done so for the subsequent commands. They are interchangeable however.
50 |
51 | | Model | Index | Test MedNLI Accuracy | Test STS Benchmark Cosine Pearson |
52 | |-----------------------------|--------------------------------|-----------------|------------------------------|
53 | | [S-BioBERT Base Cased](https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-biobert_base_cased_mli.tar.gz) | [BioBERT_faiss_PCAR128_SQ8](https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/biobert_mli_faiss_PCAR128_SQ8) | 0.7482 | 0.7122 |
54 | | [S-BlueBERT Base Uncased](https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-bluebert_base_uncased_mli.tar.gz) | [BlueBERT_faiss_PCAR128_SQ8](https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/bluebert_mli_faiss_PCAR128_SQ8) | 0.7525 | 0.6923 |
55 | | S-Bert Base Cased | | 0.5689 | 0.7265 |
56 |
57 |
58 | ### Download Metadata
59 | ```
60 | wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/cord_19_dataset_formatted_2020_03_27.tar.gz
61 | tar -xzvf cord_19_dataset_formatted_2020_03_27.tar.gz cord_19_dataset_formatted/
62 | ```
63 |
64 | ## Searching the Index
65 |
66 | We assume you've chosen the s-biobert model, it should be straightforward to swap in any other pre-trained models offered in this repo by modifying the paths below.
67 |
68 | We recommend using the server but we do offer a simple script to search given a text file of sentences:
69 |
70 | ```
71 | echo "These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)" > sentences.txt
72 | python search_index.py --index_path biobert_mli_faiss_PCAR128_SQ8 --index_type faiss --model_name_or_path s-biobert_base_cased_mli/ --dataset_path cord_19_dataset_formatted/ --input_path sentences.txt --output_path output.json
73 | ```
74 |
75 | #### Using the server
76 |
77 | To start the server:
78 | ```
79 | YOUR_IP=0.0.0.0
80 | YOUR_PORT=1337
81 | python index_server.py --index_path biobert_mli_faiss_PCAR128_SQ8 --index_type faiss --model_name_or_path s-biobert_base_cased_mli/ --dataset_path cord_19_dataset_formatted/ --address $YOUR_IP --port $YOUR_PORT --silent
82 | ```
83 |
84 | To test the server:
85 | ```
86 | curl --header "Content-Type: application/json" \
87 | --request POST \
88 | --data '["These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)"]' \
89 | http://$YOUR_IP:$YOUR_PORT/query
90 | ```
91 |
92 | ### Output Format
93 |
94 | The output from the index is a JSON object containing the top K hits from the index, an example of the API is given below:
95 |
96 | ```
97 | [
98 | {
99 | "query": "These RNA transcripts may be spliced to give rise to mRNAs encoding the envelope (Env) glycoproteins (Fig. 1a)",
100 | "hits": [
101 | {
102 | "title": "Title",
103 | "authors": [
104 | "..."
105 | ],
106 | "abstract": [
107 | "..."
108 | ],
109 | "paragraph": "Paragraph that included the hit",
110 | "sentence": "The semantically similar sentence",
111 | "distance": 42,
112 | }
113 | ]
114 | }
115 | ]
116 | ```
117 |
118 | ## Creating the Index from scratch
119 |
120 | The process requires a GPU enabled node such as a GCP n8 node with a nvidia-tesla-v100 to generate the embeddings, with at-least 20GB RAM.
121 |
122 | ### Preparing the dataset
123 |
124 | Currently we tokenize at the sentence level using SciSpacy, however future work may look into using paragraph level tokenization.
125 |
126 | ```
127 | mkdir datasets/
128 | python download_data.py
129 | python extract_sentences.py --num_workers 16
130 | ```
131 |
132 | ### Generating embeddings
133 |
134 | #### Using fine-tuned BioBERT/BlueBERT
135 |
136 | Using sentence-transformers we can fine-tune either model. BlueBERT offers only uncased models whereas BioBERT offer a cased model. We've converted them into PyTorch format and included them in releases, to download:
137 |
138 | ```
139 | wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-biobert_base_cased_mli.tar.gz
140 | wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/s-bluebert_base_uncased_mli.tar.gz
141 | tar -xzvf s-biobert_base_cased_mli.tar.gz
142 | tar -xzvf s-bluebert_base_uncased_mli.tar.gz
143 | ```
144 |
145 | ##### Using Pre-trained BioBERT/BlueBERT
146 |
147 | ```
148 | python generate_embeddings.py --model_name_or_path s-biobert_base_cased_mli/ --embedding_path biobert_embeddings.npy --device cuda --batch_size 256 # If you want to use biobert
149 | python generate_embeddings.py --model_name_or_path s-bluebert_base_uncased_mli/ --embedding_path bluebert_embeddings.npy --device cuda --batch_size 256 # If you want to use bluebert
150 | ```
151 |
152 | #### Using pre-trained S-BERT models
153 |
154 | You can also use the standard pre-trained model from the S-BERT repo like below, however we suggest using the fine-tuned models offered in this repo.
155 |
156 | ```
157 | python generate_embeddings.py --model_name_or_path bert-base-nli-mean-tokens --embedding_path pretrained_embeddings.npy --device cuda --batch_size 256
158 | ```
159 |
160 | ##### Training the model from scratch
161 |
162 | This takes a few hours on a V100 GPU.
163 |
164 | If you'd like to include the MedNLI dataset during training, you'll need to download the dataset from [here](https://physionet.org/content/mednli/1.0.0/). Getting access requires credentialed access which requires some efforts and a waiting period of up to two weeks.
165 |
166 | Once trained the model is saved to the `output/` folder by default. Inside there you'll find checkpoints such as `output/training_nli/biobert-2020-03-30_10-51-49/` after training has finished. Use this as the model path when generating your embeddings.
167 |
168 | ```
169 | wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/biobert_cased_v1.1.tar.gz
170 | wget https://github.com/SeanNaren/CORD-19-ANN/releases/download/V1.0/bluebert_base_uncased.tar.gz
171 | tar -xzvf biobert_cased_v1.1.tar.gz
172 | tar -xzvf bluebert_base_uncased.tar.gz
173 |
174 | mkdkir datasets/
175 | python sentence-transformers/examples/datasets/get_data.py --output_path datasets/
176 | python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path biobert_cased_v1.1/
177 | python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path bluebert_base_uncased/ --do_lower_case
178 |
179 | # Training with medNLI
180 | python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path biobert_cased_v1.1/ --mli_dataset_path path/to/mednli/
181 | python sentence-transformers/examples/training_nli_transformers.py --model_name_or_path bluebert_base_uncased/ --mli_dataset_path path/to/mednli/ --do_lower_case
182 | ```
183 |
184 | To exclude the MedNLI but still evaluate on the data (still requires the MedNLI dataset), use the `--exclude_mli`.
185 |
186 | ### Create the Index
187 |
188 | We have the ability to use faiss or nmslib given the parameter below. We've exposed the FAISS config string for modifying the index. More details about selecting the index can be seen [here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index).
189 |
190 | ```
191 | python create_index.py --output_path index --embedding_path pretrained_embeddings.npy --index_type faiss # Swap to scibert_embeddings.npy if using fine-tuned SciBERT embeddings
192 | ```
193 |
194 | ### Clustering
195 |
196 | We also took the example clustering script out of sentence-transformers and added it to this repository for using the pre-trained models. An example below:
197 |
198 | ```
199 | python cluster_sentences.py --input_path sentences.txt --model_name_or_path biobert_cased_v1.1/ --device cpu
200 | ```
201 |
202 | There is also a more interactive version available using the Google Colab demo: [](https://colab.research.google.com/drive/137jbpY3yQJGSzlLHZGUYBk5F78bwuqKJ)
203 |
204 | ## Acknowledgements
205 |
206 | Thanks to the authors of the various libraries that made this possible!
207 |
208 | - [sentence-transformers](https://github.com/UKPLab/sentence-transformers)
209 | - [cord-19](https://pages.semanticscholar.org/coronavirus-research)
210 | - [scibert](https://github.com/allenai/scibert)
211 | - [nmslib](https://github.com/nmslib/nmslib)
212 | - [FAISS](https://github.com/facebookresearch/faiss)
--------------------------------------------------------------------------------