├── 1-Accessing Dataset
├── CORD-19(Data dictionary).docx
├── codebook.ipynb
└── metadata_sample.pkl
├── 2-Data Pre-Processing
└── covid_qa_data_preprocessing.ipynb
├── 3-Exploratory Data Analysis
└── covid_aq_eda.ipynb
├── 4-Knowledge Base
├── README.md
├── es_populate.py
├── requirements.docx
├── requirements.txt
└── setup.sh
├── 5-QA Engine
├── Dockerfile
├── README.md
├── main.py
├── requirements.txt
└── setup.sh
├── 6-Front End with Streamlit
├── Dockerfile
├── Dockerfile(1)
├── README(1).md
├── README.md
├── requirements(1).txt
├── requirements.txt
├── setup(1).sh
├── setup.sh
├── streamlit_qa(1).py
└── streamlit_qa.py
├── 7-Deployment
├── Dockerfile_QA
├── Dockerfile_UI
└── setup.sh
└── README.md
/1-Accessing Dataset/CORD-19(Data dictionary).docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/1-Accessing Dataset/CORD-19(Data dictionary).docx
--------------------------------------------------------------------------------
/1-Accessing Dataset/codebook.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","id":"a6b53b6b","metadata":{"id":"a6b53b6b"},"source":["# Getting Started"]},{"cell_type":"code","source":["# connecting colab with google drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OI6fyf4c_ykn","executionInfo":{"status":"ok","timestamp":1697893070199,"user_tz":-180,"elapsed":3653,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"4ed016db-e9a2-440d-d4d4-03880363e7e5"},"id":"OI6fyf4c_ykn","execution_count":6,"outputs":[{"output_type":"stream","name":"stdout","text":["Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"]}]},{"cell_type":"code","source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"I3i_JlghUE3-","executionInfo":{"status":"ok","timestamp":1697893145199,"user_tz":-180,"elapsed":694,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"1cf9af69-037f-4aca-f077-66f4bae56f29"},"id":"I3i_JlghUE3-","execution_count":8,"outputs":[{"output_type":"stream","name":"stdout","text":["total 8.0K\n","drwx------ 5 root root 4.0K Oct 21 11:31 drive\n","drwxr-xr-x 1 root root 4.0K Oct 19 16:36 sample_data\n"]}]},{"cell_type":"markdown","id":"e3b260e6","metadata":{"id":"e3b260e6"},"source":["### Installing and importing Libraries"]},{"cell_type":"code","execution_count":null,"id":"771b0ff9","metadata":{"id":"771b0ff9","outputId":"7f84d4ac-a036-401a-ce6a-2749eb270d5b","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1697887955874,"user_tz":-180,"elapsed":6085,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting tqdm==4.62.0\n"," Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)\n","\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/76.0 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m71.7/76.0 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.0/76.0 kB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hInstalling collected packages: tqdm\n"," Attempting uninstall: tqdm\n"," Found existing installation: tqdm 4.66.1\n"," Uninstalling tqdm-4.66.1:\n"," Successfully uninstalled tqdm-4.66.1\n","Successfully installed tqdm-4.62.0\n"]}],"source":["!pip3 install tqdm==4.62.0"]},{"cell_type":"code","execution_count":null,"id":"a7141ff0","metadata":{"id":"a7141ff0"},"outputs":[],"source":["import time\n","import json\n","import glob\n","import numpy as np\n","import pandas as pd\n","from tqdm import tqdm"]},{"cell_type":"markdown","id":"e9497679","metadata":{"id":"e9497679"},"source":["### Downloading the covid dataset"]},{"cell_type":"code","execution_count":null,"id":"307bc53e","metadata":{"id":"307bc53e"},"outputs":[],"source":["#downloading dataset...\n","!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2021-11-15.tar.gz"]},{"cell_type":"code","execution_count":null,"id":"75e7b7fa","metadata":{"id":"75e7b7fa"},"outputs":[],"source":["#unzipping the dataset to google drive\n","!tar -xzf cord-19_2021-11-15.tar.gz -C /content/drive/MyDrive/Covid\\ Q\\&A\\ System\n","#unzipping the pdf and pmc files\n","%cd drive/MyDrive/Covid\\ Q\\&A\\ System/2021-11-15/\n","!tar xzf document_parses.tar.gz"]},{"cell_type":"markdown","id":"afcd56b3","metadata":{"id":"afcd56b3"},"source":["### Reading Dataset"]},{"cell_type":"code","execution_count":null,"id":"49050c24","metadata":{"id":"49050c24"},"outputs":[],"source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/2021-11-15/\n","#metadata\n","metadata = pd.read_csv('metadata.csv', dtype={'pubmed_id': str, 'title': str, 'abstract': str})\n","metadata.head()"]},{"cell_type":"code","execution_count":null,"id":"57ca0abe","metadata":{"id":"57ca0abe"},"outputs":[],"source":["# Fetching Research Papers from PDF and PMC Json folder\n","pdf_json = glob.glob('document_parses/pdf_json/*.json', recursive=True)\n","pmc_json = glob.glob('document_parses/pmc_json/*.json', recursive=True)"]},{"cell_type":"code","execution_count":null,"id":"ee135541","metadata":{"id":"ee135541"},"outputs":[],"source":["print('PDF:', len(pdf_json), 'PMC:', len(pmc_json))"]},{"cell_type":"code","execution_count":null,"id":"e1bfd9ee","metadata":{"id":"e1bfd9ee"},"outputs":[],"source":["# FileReader Class Exctract id, abstract and body from research papers\n","class FileReader:\n"," def __init__(self, file_path):\n"," with open(file_path) as f:\n"," content = json.load(f)\n"," self.paper_id = content['paper_id']\n"," self.abstract = '$$'.join([each['text'] for each in content.get('abstract', [])])\n"," self.body_text = '$$'.join([each['text'] for each in content.get('body_text', [])])\n","\n"," def __repr__(self):\n"," return f'{self.paper_id}\\tabstract: {self.abstract[:200]}\\tbody_text: {self.body_text}'"]},{"cell_type":"code","execution_count":null,"id":"d2539a02","metadata":{"id":"d2539a02"},"outputs":[],"source":["# A sample research paper from pdf json folder\n","pdf_file = FileReader(pdf_json[0])\n","print(pdf_file)"]},{"cell_type":"code","execution_count":null,"id":"e66e4bd7","metadata":{"id":"e66e4bd7","outputId":"146b7762-1063-4d7f-af4c-7b8213f6432c"},"outputs":[{"name":"stderr","output_type":"stream","text":["247236it [05:05, 807.98it/s] "]},{"name":"stdout","output_type":"stream","text":["306.0527939796448\n"]},{"name":"stderr","output_type":"stream","text":["\n"]}],"source":["#Create a dictionary of all research papers from pdf json\n","pdf_dict = {'paper_id': [], 'abstract': [], 'body_text': []}\n","t1 = time.time()\n","for idx, record in tqdm(enumerate(pdf_json)):\n"," content = FileReader(record)\n"," pdf_dict['paper_id'].append(content.paper_id)\n"," pdf_dict['abstract'].append(content.abstract)\n"," pdf_dict['body_text'].append(content.body_text)\n","print(time.time() - t1)"]},{"cell_type":"code","execution_count":null,"id":"50301fe1","metadata":{"id":"50301fe1","outputId":"9e3408a5-ea50-4f2a-8fb6-67480cdd13d1"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
paper_id
\n","
abstract
\n","
body_text
\n","
\n"," \n"," \n","
\n","
0
\n","
206be0740f4d299003d4e09cd6f9a32e6e351130
\n","
Heparanase (HPSE) is a multifunctional protein...
\n","
Heparanase (HPSE) is an endo-β-d-endoglycosida...
\n","
\n","
\n","
1
\n","
32356c8de8fcec7a46bc60793b557964e4e87f37
\n","
Objective The COVID-19 pandemic is currently o...
\n","
World Health Organization (WHO) declared the o...
\n","
\n","
\n","
2
\n","
e5447bc137727b3721de2313755d89b932e1eecc
\n","
As the world navigates the COVID-19 health cri...
\n","
With the rise in positive COVID-19 cases and t...
\n","
\n","
\n","
3
\n","
8f1e56dded7f860a33ad291c06c773653270ee52
\n","
The sudden outbreak of coronavirus disease 201...
\n","
In 2020, a new type of coronavirus, named coro...
\n","
\n","
\n","
4
\n","
c84b2484293b3aa59ec8aaecc7eadb93b2294dd7
\n","
Spinal cord stimulation may enable recovery of...
\n","
Spinal cord injury (SCI) is a life-long condit...
\n","
\n"," \n","
\n","
"],"text/plain":[" paper_id \\\n","0 206be0740f4d299003d4e09cd6f9a32e6e351130 \n","1 32356c8de8fcec7a46bc60793b557964e4e87f37 \n","2 e5447bc137727b3721de2313755d89b932e1eecc \n","3 8f1e56dded7f860a33ad291c06c773653270ee52 \n","4 c84b2484293b3aa59ec8aaecc7eadb93b2294dd7 \n","\n"," abstract \\\n","0 Heparanase (HPSE) is a multifunctional protein... \n","1 Objective The COVID-19 pandemic is currently o... \n","2 As the world navigates the COVID-19 health cri... \n","3 The sudden outbreak of coronavirus disease 201... \n","4 Spinal cord stimulation may enable recovery of... \n","\n"," body_text \n","0 Heparanase (HPSE) is an endo-β-d-endoglycosida... \n","1 World Health Organization (WHO) declared the o... \n","2 With the rise in positive COVID-19 cases and t... \n","3 In 2020, a new type of coronavirus, named coro... \n","4 Spinal cord injury (SCI) is a life-long condit... "]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["#Creating a dataframe of all research papers from pdf json\n","pdf_df = pd.DataFrame(pdf_dict, columns=['paper_id', 'abstract', 'body_text'])\n","pdf_df.head()"]},{"cell_type":"code","execution_count":null,"id":"2199f9a4","metadata":{"id":"2199f9a4","outputId":"f642e0c3-a9bc-4177-f6a9-789ffacc0473"},"outputs":[{"name":"stderr","output_type":"stream","text":["189611it [04:54, 644.91it/s]"]},{"name":"stdout","output_type":"stream","text":["294.0136697292328\n"]},{"name":"stderr","output_type":"stream","text":["\n"]}],"source":["#Create a dictionary of all research papers from pmc json\n","pmc_dict = {'paper_id': [], 'body_text': []}\n","t1 = time.time()\n","for idx, record in tqdm(enumerate(pmc_json)):\n"," content = FileReader(record)\n"," pmc_dict['paper_id'].append(content.paper_id)\n"," pmc_dict['body_text'].append(content.body_text)\n","print(time.time() - t1)"]},{"cell_type":"code","execution_count":null,"id":"881f7057","metadata":{"id":"881f7057","outputId":"9fe448cd-349e-4be7-87f4-1936b6abf9e5"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
paper_id
\n","
body_text
\n","
\n"," \n"," \n","
\n","
0
\n","
PMC7550677
\n","
Previous research suggested that emotional str...
\n","
\n","
\n","
1
\n","
PMC7297029
\n","
First, areas with severe outbreaks have genera...
\n","
\n","
\n","
2
\n","
PMC8207685
\n","
Environment‐related illnesses, such as indoor ...
\n","
\n","
\n","
3
\n","
PMC7926729
\n","
Materials, useful matter, are used extensively...
\n","
\n","
\n","
4
\n","
PMC7471855
\n","
\\n[4]\\n$$$No
\n","
\n"," \n","
\n","
"],"text/plain":[" paper_id body_text\n","0 PMC7550677 Previous research suggested that emotional str...\n","1 PMC7297029 First, areas with severe outbreaks have genera...\n","2 PMC8207685 Environment‐related illnesses, such as indoor ...\n","3 PMC7926729 Materials, useful matter, are used extensively...\n","4 PMC7471855 \\n[4]\\n$$$No"]},"execution_count":26,"metadata":{},"output_type":"execute_result"}],"source":["#Creating a dataframe of all research papers from pmc json\n","pmc_df = pd.DataFrame(pmc_dict, columns=['paper_id', 'body_text'])\n","pmc_df.head()"]},{"cell_type":"markdown","id":"5801f264","metadata":{"id":"5801f264"},"source":["### Sampling & Saving Course Dataset"]},{"cell_type":"raw","id":"9aeb860c","metadata":{"id":"9aeb860c"},"source":["Now we have 3 dataframes:\n","\n","\n","1. metadata - contains all the meta info of avaialable research papers\n","2. pdf_df - contains the paper_id, abstract and body of all pdf research papers available\n","3. pmc_df - contains the paper_id and body of all the pmc reasearch papers available"]},{"cell_type":"code","source":["%cd ..\n","%cd 1-Accessing\\ Dataset"],"metadata":{"id":"FtWkLAcZVcFs","executionInfo":{"status":"ok","timestamp":1697893475560,"user_tz":-180,"elapsed":422,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"id":"FtWkLAcZVcFs","execution_count":9,"outputs":[]},{"cell_type":"code","execution_count":null,"id":"d1e5bdd2","metadata":{"id":"d1e5bdd2"},"outputs":[],"source":["#drop rows from the metadata where the corresponding research text doesn't exist in pdf json\n","metadata.dropna(subset=['pdf_json'], inplace=True)"]},{"cell_type":"code","execution_count":null,"id":"ebea6546","metadata":{"id":"ebea6546","outputId":"2ffdbdcd-0480-4533-b125-3f387604725e"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
cord_uid
\n","
sha
\n","
source_x
\n","
title
\n","
doi
\n","
pmcid
\n","
pubmed_id
\n","
license
\n","
abstract
\n","
publish_time
\n","
authors
\n","
journal
\n","
mag_id
\n","
who_covidence_id
\n","
arxiv_id
\n","
pdf_json_files
\n","
pmc_json_files
\n","
url
\n","
s2_id
\n","
\n"," \n"," \n","
\n","
0
\n","
ug7v899j
\n","
d1aafb70c066a2068b02786f8929fd9c900897fb
\n","
PMC
\n","
Clinical features of culture-proven Mycoplasma...
\n","
10.1186/1471-2334-1-6
\n","
PMC35282
\n","
11472636
\n","
no-cc
\n","
OBJECTIVE: This retrospective chart review des...
\n","
2001-07-04
\n","
Madani, Tariq A; Al-Ghamdi, Aisha A
\n","
BMC Infect Dis
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/d1aafb70c066a2068b027...
\n","
document_parses/pmc_json/PMC35282.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
\n","
NaN
\n","
\n","
\n","
1
\n","
02tnwd4m
\n","
6b0567729c2143a66d737eb0a2f63f2dce2e5a7d
\n","
PMC
\n","
Nitric oxide: a pro-inflammatory mediator in l...
\n","
10.1186/rr14
\n","
PMC59543
\n","
11667967
\n","
no-cc
\n","
Inflammatory diseases of the respiratory tract...
\n","
2000-08-15
\n","
Vliet, Albert van der; Eiserich, Jason P; Cros...
\n","
Respir Res
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/6b0567729c2143a66d737...
\n","
document_parses/pmc_json/PMC59543.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
\n","
NaN
\n","
\n","
\n","
2
\n","
ejv2xln0
\n","
06ced00a5fc04215949aa72528f2eeaae1d58927
\n","
PMC
\n","
Surfactant protein-D and pulmonary host defense
\n","
10.1186/rr19
\n","
PMC59549
\n","
11667972
\n","
no-cc
\n","
Surfactant protein-D (SP-D) participates in th...
\n","
2000-08-25
\n","
Crouch, Erika C
\n","
Respir Res
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/06ced00a5fc04215949aa...
\n","
document_parses/pmc_json/PMC59549.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
\n","
NaN
\n","
\n"," \n","
\n","
"],"text/plain":[" cord_uid sha source_x \\\n","0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC \n","1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC \n","2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC \n","\n"," title doi \\\n","0 Clinical features of culture-proven Mycoplasma... 10.1186/1471-2334-1-6 \n","1 Nitric oxide: a pro-inflammatory mediator in l... 10.1186/rr14 \n","2 Surfactant protein-D and pulmonary host defense 10.1186/rr19 \n","\n"," pmcid pubmed_id license \\\n","0 PMC35282 11472636 no-cc \n","1 PMC59543 11667967 no-cc \n","2 PMC59549 11667972 no-cc \n","\n"," abstract publish_time \\\n","0 OBJECTIVE: This retrospective chart review des... 2001-07-04 \n","1 Inflammatory diseases of the respiratory tract... 2000-08-15 \n","2 Surfactant protein-D (SP-D) participates in th... 2000-08-25 \n","\n"," authors journal mag_id \\\n","0 Madani, Tariq A; Al-Ghamdi, Aisha A BMC Infect Dis NaN \n","1 Vliet, Albert van der; Eiserich, Jason P; Cros... Respir Res NaN \n","2 Crouch, Erika C Respir Res NaN \n","\n"," who_covidence_id arxiv_id \\\n","0 NaN NaN \n","1 NaN NaN \n","2 NaN NaN \n","\n"," pdf_json_files \\\n","0 document_parses/pdf_json/d1aafb70c066a2068b027... \n","1 document_parses/pdf_json/6b0567729c2143a66d737... \n","2 document_parses/pdf_json/06ced00a5fc04215949aa... \n","\n"," pmc_json_files \\\n","0 document_parses/pmc_json/PMC35282.xml.json \n","1 document_parses/pmc_json/PMC59543.xml.json \n","2 document_parses/pmc_json/PMC59549.xml.json \n","\n"," url s2_id \n","0 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3... NaN \n","1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN \n","2 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN "]},"execution_count":32,"metadata":{},"output_type":"execute_result"}],"source":["metadata.head(3)"]},{"cell_type":"code","execution_count":null,"id":"dc5e43e6","metadata":{"id":"dc5e43e6"},"outputs":[],"source":["#Taking 25000 randmly sampled records from metadata\n","sub_metadata = metadata.sample(25000)"]},{"cell_type":"code","execution_count":null,"id":"d000e2f2","metadata":{"id":"d000e2f2"},"outputs":[],"source":["#Sample the both pdf and pmc research paper table based on the sampled metadata table\n","sub_pdf_df = pdf_df[pdf_df['paper_id'].isin(sub_metadata['sha'])]\n","sub_pmc_df = pmc_df[pmc_df['paper_id'].isin(sub_metadata['pmcid'])]"]},{"cell_type":"code","execution_count":null,"id":"13991f84","metadata":{"id":"13991f84"},"outputs":[],"source":["#storing the sample data\n","sub_metadata.to_pickle('metadata_sample.pickle')\n","sub_pdf_df.to_pickle('json_pdf_sample.pickle')\n","sub_pmc_df.to_pickle('json_pmc_sample.pickle')"]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"}},"nbformat":4,"nbformat_minor":5}
--------------------------------------------------------------------------------
/1-Accessing Dataset/metadata_sample.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/1-Accessing Dataset/metadata_sample.pkl
--------------------------------------------------------------------------------
/2-Data Pre-Processing/covid_qa_data_preprocessing.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{"id":"view-in-github"},"source":[""]},{"cell_type":"code","source":["# connecting colab with google drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"id":"xhZ0fcuKSVKe","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1697893278915,"user_tz":-180,"elapsed":40886,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"9f2fe741-76f5-4b6b-c64b-5fd88a408bef"},"execution_count":11,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"taKuWoSOoXPs","outputId":"b7fdf869-f25a-419e-c6c2-6dc2407c09f4","executionInfo":{"status":"ok","timestamp":1697890838252,"user_tz":-180,"elapsed":17,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Python 3.10.12\n"]}],"source":["#check python version\n","!python3 --version"]},{"cell_type":"markdown","source":["### Importing Libraries"],"metadata":{"id":"nElmnapAL5lD"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"FwXMHbcGoKit"},"outputs":[],"source":["#for python < 3.8\n","# !pip3 install pickle5\n","# import pickle5 as pickle"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XO3nYJ8uz-22"},"outputs":[],"source":["import pickle\n","import numpy as np\n","import pandas as pd"]},{"cell_type":"markdown","metadata":{"id":"2RDPTgYZ0n-9"},"source":["**Accessing saved sample research papers and data**"]},{"cell_type":"code","source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/1-Accessing\\ Dataset/"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"R5ncZDA3WtgF","executionInfo":{"status":"ok","timestamp":1697893875035,"user_tz":-180,"elapsed":724,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"7bf79107-e990-4051-efb6-77c1e129a6be"},"execution_count":47,"outputs":[{"output_type":"stream","name":"stdout","text":["/content/drive/MyDrive/Covid Q&A System/1-Accessing Dataset\n"]}]},{"cell_type":"code","execution_count":51,"metadata":{"id":"AwC1R5tHzz47","executionInfo":{"status":"ok","timestamp":1697893941921,"user_tz":-180,"elapsed":792,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[],"source":["with open('metadata_sample.pkl', 'rb') as f:\n"," df_metadata = pickle.load(f)\n","with open('json_pdf_sample.pkl', 'rb') as f:\n"," df_pdf = pickle.load(f)\n","with open('json_pmc_sample.pkl', 'rb') as f:\n"," df_pmc = pickle.load(f)"]},{"cell_type":"code","execution_count":52,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":521},"id":"fNKMWxKG0IMk","outputId":"eadd4404-35ed-43bc-f5ab-8468bbf29809","executionInfo":{"status":"ok","timestamp":1697894074993,"user_tz":-180,"elapsed":580,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" cord_uid sha \\\n","600880 n441sprx feedfe27a4eee49d8a1d09f50e8ecfe73057602a \n","721427 95libakm c2ca01f12643a88e059e81619fb971fa61de3971 \n","701284 bjl0bjg2 d0d1e05bb14f068d8323807ba042860887b7aa00 \n","603427 7ch1yln4 fb8c3a315b4a345dafc7e59be000e21bf965dea8 \n","771175 zegp4jq4 ed4c180c272b047c6db958508fb9a8edd40cb1cb \n","\n"," source_x \\\n","600880 Medline; PMC; WHO \n","721427 Elsevier; Medline; PMC \n","701284 Medline; PMC \n","603427 Medline; PMC \n","771175 Elsevier; Medline; PMC \n","\n"," title \\\n","600880 Experiences and effects of telerehabilitation ... \n","721427 SCOAT-Net: A novel network for segmenting COVI... \n","701284 A Comprehensive Review on Factors Influences B... \n","603427 Opposition to cannabis legalization on public ... \n","771175 An epigenetic signature to fight COVID-19 \n","\n"," doi pmcid pubmed_id license \\\n","600880 10.4102/sajp.v77i1.1528 PMC8252170 34230898.0 cc-by \n","721427 10.1016/j.patcog.2021.108109 PMC8189738 34127870.0 no-cc \n","701284 10.2147/ijn.s291956 PMC7898217 33628021.0 cc-by-nc \n","603427 10.1016/j.lanwpc.2021.100142 PMC8315435 34327442.0 cc-by-nc-nd \n","771175 10.1016/j.ebiom.2021.103385 PMC8116817 33993054.0 no-cc \n","\n"," abstract publish_time \\\n","600880 BACKGROUND: The announcement of a national loc... 2021-06-30 \n","721427 Automatic segmentation of lung opacification f... 2021-06-10 \n","701284 Exosomes are nanoscale-sized membrane vesicles... 2021-02-17 \n","603427 NaN 2021-04-13 \n","771175 NaN 2021-05-13 \n","\n"," authors \\\n","600880 Ebrahim, Humairaa; Pillay-Jayaraman, Prithi; L... \n","721427 Zhao, Shixuan; Li, Zhidan; Chen, Yang; Zhao, W... \n","701284 Gurunathan, Sangiliyandi; Kang, Min-Hee; Kim, ... \n","603427 Smyth, Bobby P; Christie, Grant IG \n","771175 Herbein, Georges \n","\n"," journal mag_id who_covidence_id arxiv_id \\\n","600880 S Afr J Physiother NaN NaN NaN \n","721427 Pattern Recognit NaN NaN NaN \n","701284 Int J Nanomedicine NaN NaN NaN \n","603427 Lancet Reg Health West Pac NaN NaN NaN \n","771175 EBioMedicine NaN NaN NaN \n","\n"," pdf_json_files \\\n","600880 document_parses/pdf_json/feedfe27a4eee49d8a1d0... \n","721427 document_parses/pdf_json/c2ca01f12643a88e059e8... \n","701284 document_parses/pdf_json/d0d1e05bb14f068d83238... \n","603427 document_parses/pdf_json/fb8c3a315b4a345dafc7e... \n","771175 document_parses/pdf_json/ed4c180c272b047c6db95... \n","\n"," pmc_json_files \\\n","600880 document_parses/pmc_json/PMC8252170.xml.json \n","721427 document_parses/pmc_json/PMC8189738.xml.json \n","701284 document_parses/pmc_json/PMC7898217.xml.json \n","603427 document_parses/pmc_json/PMC8315435.xml.json \n","771175 document_parses/pmc_json/PMC8116817.xml.json \n","\n"," url s2_id \n","600880 https://www.ncbi.nlm.nih.gov/pubmed/34230898/;... 235720984.0 \n","721427 https://www.ncbi.nlm.nih.gov/pubmed/34127870/;... 235382189.0 \n","701284 https://doi.org/10.2147/ijn.s291956; https://w... 232016676.0 \n","603427 https://doi.org/10.1016/j.lanwpc.2021.100142; ... 234824471.0 \n","771175 https://www.ncbi.nlm.nih.gov/pubmed/33993054/;... 234487502.0 "],"text/html":["\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
cord_uid
\n","
sha
\n","
source_x
\n","
title
\n","
doi
\n","
pmcid
\n","
pubmed_id
\n","
license
\n","
abstract
\n","
publish_time
\n","
authors
\n","
journal
\n","
mag_id
\n","
who_covidence_id
\n","
arxiv_id
\n","
pdf_json_files
\n","
pmc_json_files
\n","
url
\n","
s2_id
\n","
\n"," \n"," \n","
\n","
600880
\n","
n441sprx
\n","
feedfe27a4eee49d8a1d09f50e8ecfe73057602a
\n","
Medline; PMC; WHO
\n","
Experiences and effects of telerehabilitation ...
\n","
10.4102/sajp.v77i1.1528
\n","
PMC8252170
\n","
34230898.0
\n","
cc-by
\n","
BACKGROUND: The announcement of a national loc...
\n","
2021-06-30
\n","
Ebrahim, Humairaa; Pillay-Jayaraman, Prithi; L...
\n","
S Afr J Physiother
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/feedfe27a4eee49d8a1d0...
\n","
document_parses/pmc_json/PMC8252170.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/34230898/;...
\n","
235720984.0
\n","
\n","
\n","
721427
\n","
95libakm
\n","
c2ca01f12643a88e059e81619fb971fa61de3971
\n","
Elsevier; Medline; PMC
\n","
SCOAT-Net: A novel network for segmenting COVI...
\n","
10.1016/j.patcog.2021.108109
\n","
PMC8189738
\n","
34127870.0
\n","
no-cc
\n","
Automatic segmentation of lung opacification f...
\n","
2021-06-10
\n","
Zhao, Shixuan; Li, Zhidan; Chen, Yang; Zhao, W...
\n","
Pattern Recognit
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/c2ca01f12643a88e059e8...
\n","
document_parses/pmc_json/PMC8189738.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/34127870/;...
\n","
235382189.0
\n","
\n","
\n","
701284
\n","
bjl0bjg2
\n","
d0d1e05bb14f068d8323807ba042860887b7aa00
\n","
Medline; PMC
\n","
A Comprehensive Review on Factors Influences B...
\n","
10.2147/ijn.s291956
\n","
PMC7898217
\n","
33628021.0
\n","
cc-by-nc
\n","
Exosomes are nanoscale-sized membrane vesicles...
\n","
2021-02-17
\n","
Gurunathan, Sangiliyandi; Kang, Min-Hee; Kim, ...
\n","
Int J Nanomedicine
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/d0d1e05bb14f068d83238...
\n","
document_parses/pmc_json/PMC7898217.xml.json
\n","
https://doi.org/10.2147/ijn.s291956; https://w...
\n","
232016676.0
\n","
\n","
\n","
603427
\n","
7ch1yln4
\n","
fb8c3a315b4a345dafc7e59be000e21bf965dea8
\n","
Medline; PMC
\n","
Opposition to cannabis legalization on public ...
\n","
10.1016/j.lanwpc.2021.100142
\n","
PMC8315435
\n","
34327442.0
\n","
cc-by-nc-nd
\n","
NaN
\n","
2021-04-13
\n","
Smyth, Bobby P; Christie, Grant IG
\n","
Lancet Reg Health West Pac
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/fb8c3a315b4a345dafc7e...
\n","
document_parses/pmc_json/PMC8315435.xml.json
\n","
https://doi.org/10.1016/j.lanwpc.2021.100142; ...
\n","
234824471.0
\n","
\n","
\n","
771175
\n","
zegp4jq4
\n","
ed4c180c272b047c6db958508fb9a8edd40cb1cb
\n","
Elsevier; Medline; PMC
\n","
An epigenetic signature to fight COVID-19
\n","
10.1016/j.ebiom.2021.103385
\n","
PMC8116817
\n","
33993054.0
\n","
no-cc
\n","
NaN
\n","
2021-05-13
\n","
Herbein, Georges
\n","
EBioMedicine
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/ed4c180c272b047c6db95...
\n","
document_parses/pmc_json/PMC8116817.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/33993054/;...
\n","
234487502.0
\n","
\n"," \n","
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":52}],"source":["df_metadata.head()"]},{"cell_type":"markdown","metadata":{"id":"1dMkwvFBpT8Y"},"source":["**Data Merging (merging the data from metadata, json pdf and json pmc files for research papers)**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"hI3Nm5iB1auL"},"outputs":[],"source":["#Merging the metadata and json pdf data\n","df_merged = pd.merge(df_metadata,df_pdf,left_on='sha',right_on='paper_id',how='left').drop('paper_id',axis=1)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":267},"id":"RAjNPccl1iDh","outputId":"e2f13d3a-c47d-4cc9-9880-734a0235c671"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
cord_uid
\n","
sha
\n","
source_x
\n","
title
\n","
doi
\n","
pmcid
\n","
pubmed_id
\n","
license
\n","
abstract_x
\n","
publish_time
\n","
authors
\n","
journal
\n","
mag_id
\n","
who_covidence_id
\n","
arxiv_id
\n","
pdf_json_files
\n","
pmc_json_files
\n","
url
\n","
s2_id
\n","
abstract_y
\n","
body_text
\n","
\n"," \n"," \n","
\n","
0
\n","
bccth3yz
\n","
41ed6b3014e89604b351c095571c45056ba12c37
\n","
Medline; PMC
\n","
A phantom study to optimise the automatic tube...
\n","
10.1186/s41747-021-00218-0
\n","
PMC8159722
\n","
34046737
\n","
cc-by
\n","
On March 11, 2020, the World Health Organizati...
\n","
2021-05-28
\n","
Gombolevskiy, Victor; Morozov, Sergey; Chernin...
\n","
Eur Radiol Exp
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/41ed6b3014e89604b351c...
\n","
document_parses/pmc_json/PMC8159722.xml.json
\n","
https://doi.org/10.1186/s41747-021-00218-0; ht...
\n","
235221111.0
\n","
On March 11, 2020, the World Health Organizati...
\n","
On March 11, 2020, the World Health Organizati...
\n","
\n","
\n","
1
\n","
3udtsvga
\n","
57221999055bdc0b1c4bb16c790fdebdac1f7ce7
\n","
Elsevier; Medline; PMC
\n","
Tunicamycin, an anticancer drug and inhibitor ...
\n","
10.1016/j.micpath.2020.104586
\n","
PMC7573633
\n","
33091582
\n","
no-cc
\n","
SARS-CoV-2 outbreaks remains a medical and eco...
\n","
2020-10-20
\n","
Dawood, Ali Adel; Altobje, Mahmood Abduljabar
\n","
Microb Pathog
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/57221999055bdc0b1c4bb...
\n","
document_parses/pmc_json/PMC7573633.xml.json
\n","
https://doi.org/10.1016/j.micpath.2020.104586;...
\n","
224769427.0
\n","
SARS-CoV-2 outbreaks remains a medical and eco...
\n","
Since last December, a new coronavirus has bee...
\n","
\n"," \n","
\n","
"],"text/plain":[" cord_uid ... body_text\n","0 bccth3yz ... On March 11, 2020, the World Health Organizati...\n","1 3udtsvga ... Since last December, a new coronavirus has bee...\n","\n","[2 rows x 21 columns]"]},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.head(2)\n","#abstract_x is the abstract from metadata and abstract_y is the abstract from json_pdf"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"gxzG2kDI1l5V","outputId":"e56835c6-be08-4c87-9c7c-bfd347b16cd4"},"outputs":[{"data":{"text/plain":["(25000, 22)"]},"execution_count":16,"metadata":{},"output_type":"execute_result"}],"source":["#Lets merge the json_pmc data to the merged data too\n","df_merged = pd.merge(df_merged,df_pmc,left_on='pmcid',right_on='paper_id',how='left').drop('paper_id',axis=1)\n","df_merged.shape"]},{"cell_type":"markdown","metadata":{"id":"F3QtxcWFqrLd"},"source":["**Data Cleaning and Preprocessing**"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"LC50Eff91svG","outputId":"4b9165c9-a87c-40f3-a6ba-d0c8abc32b80"},"outputs":[{"data":{"text/plain":["(22649, 22)"]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["df_merged[df_merged.abstract_x != df_merged.abstract_y].shape"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8ay1bvRG1u1U","outputId":"766cbd10-c600-4264-be85-f2290cc52900"},"outputs":[{"data":{"text/plain":["(3738, 0)"]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["# check metadata abstract column to see if null values exist\n","df_merged.abstract_x.isnull().sum(),(df_merged.abstract_x == '').sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8KPvZLCG1xq4","outputId":"c22b21a7-3718-411b-b1c2-a137b8dd01ce"},"outputs":[{"data":{"text/plain":["(1461, 6937)"]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["# Check pdf_json abstract to see if null values exist\n","df_merged.abstract_y.isnull().sum(),(df_merged.abstract_y == '').sum()"]},{"cell_type":"markdown","metadata":{"id":"q76-CsdO19Pd"},"source":["Since the abstract_x from metadata is more reliable , we will use it but only fill by abstract_y text when abstract_x value is null\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"b_78rcIn1zUc"},"outputs":[],"source":["# Convert all columns to string and then replace abstract_y values\n","#df = df.astype(str)\n","df_merged[\"abstract_y\"] = df_merged[\"abstract_y\"].astype(str)\n","df_merged['abstract_y'] = np.where(df_merged['abstract_y'].map(len) > 50, df_merged['abstract_y'], \"na\")"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"id":"BzTGcVLU2ApK","outputId":"636104a5-a74e-4bb8-a73e-8b669f9e7195"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
cord_uid
\n","
sha
\n","
source_x
\n","
title
\n","
doi
\n","
pmcid
\n","
pubmed_id
\n","
license
\n","
abstract_x
\n","
publish_time
\n","
authors
\n","
journal
\n","
mag_id
\n","
who_covidence_id
\n","
arxiv_id
\n","
pdf_json_files
\n","
pmc_json_files
\n","
url
\n","
s2_id
\n","
abstract_y
\n","
body_text_x
\n","
body_text_y
\n","
\n"," \n"," \n","
\n","
2
\n","
va3ov8aj
\n","
7fd64a56bf3a761da96e6aad00789413813599ca
\n","
Elsevier; Medline; PMC
\n","
Asymptomatic SARS-CoV-2 infection
\n","
10.1016/s1473-3099(20)30460-6
\n","
PMC7292578
\n","
32539989
\n","
no-cc
\n","
NaN
\n","
2020-06-12
\n","
Ooi, Eng Eong; Low, Jenny G
\n","
Lancet Infect Dis
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/7fd64a56bf3a761da96e6...
\n","
document_parses/pmc_json/PMC7292578.xml.json
\n","
https://doi.org/10.1016/s1473-3099(20)30460-6;...
\n","
219603626.0
\n","
na
\n","
The pandemic spread of severe acute respirator...
\n","
The pandemic spread of severe acute respirator...
\n","
\n","
\n","
7
\n","
fpbz7143
\n","
f54ff316cc17e9bc6047fdda2ba37dba7f0b3d72
\n","
Medline; PMC
\n","
Pression‐induced facial ulcers by prone positi...
\n","
10.1111/dth.13748
\n","
PMC7300922
\n","
32495445
\n","
no-cc
\n","
NaN
\n","
2020-06-03
\n","
Ramondetta, Alice; Ribero, Simone; Costi, Soni...
\n","
Dermatol Ther
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/f54ff316cc17e9bc6047f...
\n","
NaN
\n","
https://doi.org/10.1111/dth.13748; https://www...
\n","
219313107.0
\n","
na
\n","
Dear Editor, Severe acute respiratory syndrome...
\n","
NaN
\n","
\n","
\n","
10
\n","
mbs0mddg
\n","
b4466c4bfc22fad10131dc138563ddf36712ca4f; 1b79...
\n","
Medline; PMC
\n","
Impact of the COVID-19 pandemic on the core fu...
\n","
10.1136/bmjopen-2020-039674
\n","
PMC7306272
\n","
32554730
\n","
cc-by-nc
\n","
OBJECTIVES: The current COVID-19 pandemic, as ...
\n","
2020-06-17
\n","
Verhoeven, Veronique; Tsakitzidis, Giannoula; ...
\n","
BMJ Open
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/b4466c4bfc22fad10131d...
\n","
document_parses/pmc_json/PMC7306272.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/32554730/;...
\n","
219926577.0
\n","
na
\n","
NaN
\n","
The current COVID-19 pandemic puts a previousl...
\n","
\n","
\n","
11
\n","
3vvp6gch
\n","
2b5f5305fdb732f7323187062c881cf21c328837
\n","
Elsevier; PMC
\n","
Detection and screening of COVID-19 through ch...
\n","
10.1016/b978-0-12-824536-1.00039-3
\n","
PMC8137981
\n","
NaN
\n","
no-cc
\n","
December 2019 ended with a deadly virus outbre...
\n","
2021-05-21
\n","
Munir, Khushboo; Elahi, Hassan; Farooq, Muhamm...
\n","
Data Science for COVID-19
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/2b5f5305fdb732f732318...
\n","
document_parses/pmc_json/PMC8137981.xml.json
\n","
https://www.sciencedirect.com/science/article/...
\n","
234794717.0
\n","
na
\n","
Our World has witnessed three most deadly outb...
\n","
Our World has witnessed three most deadly outb...
\n","
\n","
\n","
18
\n","
xrzygcr3
\n","
cce7c2d31a3ba14d0e868a2c688555f681dceb2f
\n","
PMC
\n","
Lost Trust: Socio-biological Hazard—From AIDS ...
\n","
10.1007/978-4-431-55924-5_5
\n","
PMC7123902
\n","
NaN
\n","
no-cc
\n","
Iatrogenic HIV infection refers here to cases ...
\n","
2016-03-28
\n","
Atsuji, Shigeo
\n","
Unsafety
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/cce7c2d31a3ba14d0e868...
\n","
document_parses/pmc_json/PMC7123902.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
\n","
NaN
\n","
na
\n","
from a 'socio-biological perspective' at the p...
\n","
It is now over 30 years since the first AIDS c...
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
24988
\n","
0u3e1493
\n","
379fea33b0ffd1ea5b743ebf9c05ca7cd4498531
\n","
Elsevier; Medline; PMC
\n","
Research and control of parasitic diseases in ...
\n","
10.1016/j.pt.2007.02.011
\n","
PMC7106409
\n","
17350339
\n","
no-cc
\n","
Between 1950 and 1980, Japan eliminated severa...
\n","
2007-03-09
\n","
Kasai, Takeshi; Nakatani, Hiroki; Takeuchi, Ts...
\n","
Trends Parasitol
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/379fea33b0ffd1ea5b743...
\n","
document_parses/pmc_json/PMC7106409.xml.json
\n","
https://www.sciencedirect.com/science/article/...
\n","
42815596.0
\n","
na
\n","
Between 1950 and 1980, Japan eliminated severa...
\n","
Japan occupies a unique position with respect ...
\n","
\n","
\n","
24994
\n","
euh55i56
\n","
4d475f635e1e2ecee857165e5e98b6bc4f13700b
\n","
Elsevier; Medline; PMC
\n","
Avoiding Aerosol Generation During Tracheostom...
\n","
10.1016/j.jamcollsurg.2020.08.730
\n","
PMC7486050
\n","
32928626
\n","
no-cc
\n","
NaN
\n","
2020-09-11
\n","
Kapoor, Indu; Prabhakar, Hemanshu; Mahajan, Charu
\n","
J Am Coll Surg
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/4d475f635e1e2ecee8571...
\n","
document_parses/pmc_json/PMC7486050.xml.json
\n","
https://www.sciencedirect.com/science/article/...
\n","
221616263.0
\n","
na
\n","
We read with interest the article by Foster an...
\n","
We read with interest the article by Foster an...
\n","
\n","
\n","
24996
\n","
bhmzv574
\n","
dfda8e005370408f6e1a9eb9b4dadc181f45dfcc
\n","
Medline; PMC
\n","
SARS-CoV-2 within-host diversity and transmission
\n","
10.1126/science.abg0821
\n","
PMC8128293
\n","
33688063
\n","
cc-by
\n","
Extensive global sampling and sequencing of th...
\n","
2021-04-16
\n","
Lythgoe, Katrina A.; Hall, Matthew; Ferretti, ...
\n","
Science
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/dfda8e005370408f6e1a9...
\n","
document_parses/pmc_json/PMC8128293.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/33688063/;...
\n","
232169666.0
\n","
na
\n","
INTRODUCTION: Genome sequencing at an unpreced...
\n","
Reliable estimation of variant frequencies req...
\n","
\n","
\n","
24998
\n","
3kgxnyxv
\n","
fff0a264abb537842ffceba9933f2f6f6cca19b4
\n","
Elsevier; Medline; PMC
\n","
Disease surveillance for the COVID-19 era: tim...
\n","
10.1016/s0140-6736(21)01096-5
\n","
PMC8121493
\n","
34000258
\n","
no-cc
\n","
NaN
\n","
2021-05-14
\n","
Morgan, Oliver W; Aguilera, Ximena; Ammon, And...
\n","
Lancet
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/fff0a264abb537842ffce...
\n","
document_parses/pmc_json/PMC8121493.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/34000258/;...
\n","
234498002.0
\n","
na
\n","
The COVID-19 pandemic has exposed weaknesses i...
\n","
The COVID-19 pandemic has exposed weaknesses i...
\n","
\n","
\n","
24999
\n","
kq7mjq6d
\n","
8e5a6881a733deb1800fbb6edc37f4dc12501acf
\n","
Medline; PMC
\n","
Comprehensive Profiling of Inflammatory Factor...
\n","
10.3389/fimmu.2021.662465
\n","
PMC8320433
\n","
34335566
\n","
cc-by
\n","
To systematically explore potential biomarkers...
\n","
2021-07-15
\n","
Teng, Xiangyun; Zhang, Jiaqi; Shi, Yaling; Liu...
\n","
Front Immunol
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/8e5a6881a733deb1800fb...
\n","
document_parses/pmc_json/PMC8320433.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/34335566/;...
\n","
235916128.0
\n","
na
\n","
Coronavirus disease 2019 (COVID-19) resulting ...
\n","
Coronavirus disease 2019 (COVID-19) resulting ...
\n","
\n"," \n","
\n","
8513 rows × 22 columns
\n","
"],"text/plain":[" cord_uid ... body_text_y\n","2 va3ov8aj ... The pandemic spread of severe acute respirator...\n","7 fpbz7143 ... NaN\n","10 mbs0mddg ... The current COVID-19 pandemic puts a previousl...\n","11 3vvp6gch ... Our World has witnessed three most deadly outb...\n","18 xrzygcr3 ... It is now over 30 years since the first AIDS c...\n","... ... ... ...\n","24988 0u3e1493 ... Japan occupies a unique position with respect ...\n","24994 euh55i56 ... We read with interest the article by Foster an...\n","24996 bhmzv574 ... Reliable estimation of variant frequencies req...\n","24998 3kgxnyxv ... The COVID-19 pandemic has exposed weaknesses i...\n","24999 kq7mjq6d ... Coronavirus disease 2019 (COVID-19) resulting ...\n","\n","[8513 rows x 22 columns]"]},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":["df_merged[df_merged['abstract_y'].apply(lambda x: len(str(x)) <= 10)]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"RFh2fs_g2eCg"},"outputs":[],"source":["# replace abstract_x (metadata column) with abstract_y (pdf_json) value where abstract_x is null\n","df_merged.loc[df_merged.abstract_x.isnull() & (df_merged.abstract_y != 'na'),'abstract_x'] = df_merged[df_merged.abstract_x.isnull() & (df_merged.abstract_y != 'na')].abstract_y"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"jvnT5rrs2nfP","outputId":"a1e737a7-b2fe-41a0-c625-dca6603d128c"},"outputs":[{"data":{"text/plain":["3081"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["# Now we do not have any null abstract values.\n","# The null values have reduced which is what we had expected.\n","df_merged.abstract_x.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Pmoo77Ef2pgk","outputId":"d9bfca0f-25bf-4acb-dd29-4caeae0c2246"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text_x', 'body_text_y'],\n"," dtype='object')"]},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":["# Lets get rid of the pdf_json abstract column and rename the metadata abstract column\n","df_merged.rename(columns = {'abstract_x' : 'abstract'}, inplace = True)\n","df_merged.drop('abstract_y',axis=1,inplace = True)\n","df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"VkG6PBXq2rUb","outputId":"5c16e439-66f3-4838-a4d2-295ec3b56ef2"},"outputs":[{"data":{"text/plain":["24999"]},"execution_count":26,"metadata":{},"output_type":"execute_result"}],"source":["# This is expected because body text comes from pdf and pmc folders\n","(df_merged.body_text_x != df_merged.body_text_y).sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"tgRLK4mo2vbR","outputId":"aa2db169-3ef2-4a9a-d1ed-eb2b8c507778"},"outputs":[{"data":{"text/plain":["(1461, 0)"]},"execution_count":27,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.body_text_x.isnull().sum(),(df_merged.body_text_y == '').sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dFGxq1452yc2","outputId":"034de805-8c6d-4502-9d4b-db756093a56c"},"outputs":[{"data":{"text/plain":["5736"]},"execution_count":28,"metadata":{},"output_type":"execute_result"}],"source":["# This is expected because there are less papers in json_pmc\n","df_merged.body_text_y.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"KGEK6g_520a1","outputId":"51abfaf0-794a-47b0-b660-7980dbd5f37f"},"outputs":[{"data":{"text/plain":["(1461, 5736)"]},"execution_count":29,"metadata":{},"output_type":"execute_result"}],"source":["# body_text_x is pdf_json. body_text_y comes from pmc_json\n","# Where available we use the text from pmc file trusting the statement quality\n","df_merged.body_text_x.isnull().sum(),(df_merged.body_text_y.isnull()).sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"id":"LDSMjs4n23sY","outputId":"e040fc3c-414c-446e-85d9-abab1beacec3"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
cord_uid
\n","
sha
\n","
source_x
\n","
title
\n","
doi
\n","
pmcid
\n","
pubmed_id
\n","
license
\n","
abstract
\n","
publish_time
\n","
authors
\n","
journal
\n","
mag_id
\n","
who_covidence_id
\n","
arxiv_id
\n","
pdf_json_files
\n","
pmc_json_files
\n","
url
\n","
s2_id
\n","
body_text_x
\n","
body_text_y
\n","
\n"," \n"," \n","
\n","
10
\n","
mbs0mddg
\n","
b4466c4bfc22fad10131dc138563ddf36712ca4f; 1b79...
\n","
Medline; PMC
\n","
Impact of the COVID-19 pandemic on the core fu...
\n","
10.1136/bmjopen-2020-039674
\n","
PMC7306272
\n","
32554730
\n","
cc-by-nc
\n","
OBJECTIVES: The current COVID-19 pandemic, as ...
\n","
2020-06-17
\n","
Verhoeven, Veronique; Tsakitzidis, Giannoula; ...
\n","
BMJ Open
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/b4466c4bfc22fad10131d...
\n","
document_parses/pmc_json/PMC7306272.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/32554730/;...
\n","
219926577.0
\n","
NaN
\n","
The current COVID-19 pandemic puts a previousl...
\n","
\n","
\n","
21
\n","
di59froz
\n","
aed72ae7c8e4bd6cb9bdea4455743660f068c8fd; 3966...
\n","
ArXiv; Medline; PMC
\n","
Scenarios of future Indian electricity demand ...
\n","
10.1038/s41597-021-00951-6
\n","
PMC8282627
\n","
34267222
\n","
cc-by
\n","
India is expected to witness rapid growth in e...
\n","
2021-07-15
\n","
Barbar, Marc; Mallapragada, Dharik S.; Alsup, ...
\n","
Sci Data
\n","
NaN
\n","
NaN
\n","
2106.07588
\n","
document_parses/pdf_json/aed72ae7c8e4bd6cb9bde...
\n","
document_parses/pmc_json/PMC8282627.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/34267222/;...
\n","
235422365.0
\n","
NaN
\n","
Many assessments of future electricity demand ...
\n","
\n","
\n","
34
\n","
bajkeabm
\n","
47487599e7389023b75c7b76d6017a61677afaf9; 7b20...
\n","
Medline; PMC
\n","
COVID-19 in pregnancy: the foetal perspective—...
\n","
10.1136/bmjpo-2020-000859
\n","
PMC7689539
\n","
34192182
\n","
cc-by-nc
\n","
OBJECTIVE: We aimed to conduct a systematic re...
\n","
2020-11-25
\n","
Dube, Rajani; Kar, Subhranshu Sekhar
\n","
BMJ Paediatr Open
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/47487599e7389023b75c7...
\n","
document_parses/pmc_json/PMC7689539.xml.json
\n","
https://doi.org/10.1136/bmjpo-2020-000859; htt...
\n","
227228904.0
\n","
NaN
\n","
Novel coronavirus infection and associated cor...
\n","
\n","
\n","
47
\n","
6jzkxubq
\n","
d0cb7fd7db94f89b02affed967bcd89cafbf29d3; 93cd...
\n","
Medline; PMC
\n","
Identification of a new human coronavirus
\n","
10.1038/nm1024
\n","
PMC7095789
\n","
15034574
\n","
no-cc
\n","
Three human coronaviruses are known to exist: ...
\n","
2004-03-21
\n","
van der Hoek, Lia; Pyrc, Krzysztof; Jebbink, M...
\n","
Nat Med
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/d0cb7fd7db94f89b02aff...
\n","
document_parses/pmc_json/PMC7095789.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/15034574/
\n","
24428187.0
\n","
NaN
\n","
To date, there is still a variety of human dis...
\n","
\n","
\n","
52
\n","
3foexs3n
\n","
b3abb4d81aaff449e6cbfe444b6b2b80cb8d3cb2; aa80...
\n","
Medline; PMC
\n","
Lack of Efficacy of SGLT2-i in Severe Pneumoni...
\n","
10.1007/s13300-020-00844-8
\n","
PMC7244936
\n","
32447736
\n","
cc-by-nc
\n","
NaN
\n","
2020-05-23
\n","
Bossi, Antonio Carlo; Forloni, Franco; Colombe...
\n","
Diabetes Ther
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/b3abb4d81aaff449e6cbf...
\n","
document_parses/pmc_json/PMC7244936.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/32447736/;...
\n","
218837093.0
\n","
NaN
\n","
Italy is facing a dramatic health emergency re...
\n","
\n","
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
...
\n","
\n","
\n","
24891
\n","
opvq9h5f
\n","
4e88dd5e44f6025f7dfd2c63cd3c9f7778182346; 62ef...
\n","
Medline; PMC
\n","
Experiences of pregnant mothers using a social...
\n","
10.1136/bmjopen-2020-040649
\n","
PMC7813413
\n","
33455927
\n","
cc-by-nc
\n","
OBJECTIVES: The COVID-19 pandemic has seen unp...
\n","
2021-01-17
\n","
Chatwin, John; Butler, Danielle; Jones, Jude; ...
\n","
BMJ Open
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/4e88dd5e44f6025f7dfd2...
\n","
document_parses/pmc_json/PMC7813413.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/33455927/;...
\n","
231635948.0
\n","
NaN
\n","
The COVID-19 pandemic has seen unprecedented r...
\n","
\n","
\n","
24920
\n","
ejcs0mm1
\n","
e0be607e42c1e52efb075e7bf974ae0bca952a2b; e728...
\n","
Medline; PMC
\n","
Conformational Dynamics in the Interaction of ...
\n","
10.1021/acs.jpclett.1c00831
\n","
PMC8204754
\n","
34110168
\n","
no-cc
\n","
[Image: see text] Papain-like protease (PLpro)...
\n","
2021-06-10
\n","
Leite, Wellington C.; Weiss, Kevin L.; Phillip...
\n","
J Phys Chem Lett
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/e0be607e42c1e52efb075...
\n","
document_parses/pmc_json/PMC8204754.xml.json
\n","
https://doi.org/10.1021/acs.jpclett.1c00831; h...
\n","
235394198.0
\n","
NaN
\n","
Severe acute respiratory syndrome\\ncoronavirus...
\n","
\n","
\n","
24937
\n","
tiujpsm9
\n","
2be48d21caa44e9a401a518c1b91a13c68445846; 1875...
\n","
Elsevier; Medline; PMC
\n","
COVID-19 response and containment strategies i...
\n","
10.1016/j.ajem.2020.04.072
\n","
PMC7195105
\n","
32386807
\n","
no-cc
\n","
NaN
\n","
2020-04-26
\n","
Sen-Crowe, Brendon; McKenney, Mark; Elkbuli, Adel
\n","
Am J Emerg Med
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/2be48d21caa44e9a401a5...
\n","
document_parses/pmc_json/PMC7195105.xml.json
\n","
https://doi.org/10.1016/j.ajem.2020.04.072; ht...
\n","
218466657.0
\n","
NaN
\n","
COVID-19 confirmed fatalities in the United St...
\n","
\n","
\n","
24969
\n","
j28rgb41
\n","
66b2fb45d49bbd9c3fb969ddd6332ee1d548103e; 0978...
\n","
PMC
\n","
Co-circulation of both low and highly pathogen...
\n","
10.1093/ve/veaa037
\n","
PMC7326300
\n","
NaN
\n","
cc-by-nc
\n","
Highly pathogenic avian influenza (HPAI) A(H5)...
\n","
2020-06-30
\n","
Li, Yao-Tsun; Chen, Chen-Chih; Chang, Ai-Mei; ...
\n","
Virus Evol
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/66b2fb45d49bbd9c3fb96...
\n","
document_parses/pmc_json/PMC7326300.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
\n","
NaN
\n","
NaN
\n","
Highly pathogenic avian influenza virus (HPAI)...
\n","
\n","
\n","
24980
\n","
sld6vdml
\n","
680b13de30a0e9bcd8914849a1ef758c1391738b; 458c...
\n","
Medline; PMC
\n","
Sexual health and COVID-19: protocol for a sco...
\n","
10.1186/s13643-021-01591-y
\n","
PMC7825389
\n","
33485393
\n","
cc-by
\n","
BACKGROUND: Global responses to the COVID-19 p...
\n","
2021-01-23
\n","
Kumar, Navin; Janmohamed, Kamila; Nyhan, Kate;...
\n","
Syst Rev
\n","
NaN
\n","
NaN
\n","
NaN
\n","
document_parses/pdf_json/680b13de30a0e9bcd8914...
\n","
document_parses/pmc_json/PMC7825389.xml.json
\n","
https://www.ncbi.nlm.nih.gov/pubmed/33485393/;...
\n","
231690793.0
\n","
NaN
\n","
Global responses to the COVID-19 pandemic have...
\n","
\n"," \n","
\n","
1372 rows × 21 columns
\n","
"],"text/plain":[" cord_uid ... body_text_y\n","10 mbs0mddg ... The current COVID-19 pandemic puts a previousl...\n","21 di59froz ... Many assessments of future electricity demand ...\n","34 bajkeabm ... Novel coronavirus infection and associated cor...\n","47 6jzkxubq ... To date, there is still a variety of human dis...\n","52 3foexs3n ... Italy is facing a dramatic health emergency re...\n","... ... ... ...\n","24891 opvq9h5f ... The COVID-19 pandemic has seen unprecedented r...\n","24920 ejcs0mm1 ... Severe acute respiratory syndrome\\ncoronavirus...\n","24937 tiujpsm9 ... COVID-19 confirmed fatalities in the United St...\n","24969 j28rgb41 ... Highly pathogenic avian influenza virus (HPAI)...\n","24980 sld6vdml ... Global responses to the COVID-19 pandemic have...\n","\n","[1372 rows x 21 columns]"]},"execution_count":30,"metadata":{},"output_type":"execute_result"}],"source":["# There are ~13k rows where body_text_x is null but body_text_y is not null\n","df_merged.loc[df_merged.body_text_x.isnull() & df_merged.body_text_y.notnull()]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5NrCfQ7k25WX"},"outputs":[],"source":["# Lets trutst the text from pmc folder to be of higher quality as it contains full text.\n","# Hence we will replace with body_text_x with body_text_y where body_text_y exists\n","df_merged.loc[df_merged.body_text_y.notnull(),'body_text_x'] = df_merged.loc[df_merged.body_text_y.notnull(), 'body_text_y']"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"7a0p_Yb329FL","outputId":"62bd404f-797b-4d8a-9cee-2ff032690028"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text'],\n"," dtype='object')"]},"execution_count":32,"metadata":{},"output_type":"execute_result"}],"source":["# Lets get rid of the pdf_pmc body text column and rename the body text column\n","df_merged.rename(columns = {'body_text_x' : 'body_text'}, inplace = True)\n","df_merged.drop('body_text_y',axis=1,inplace = True)\n","df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"bbyVSGqh3BIs","outputId":"e54a485b-1d70-457b-a79a-8b176f702ac6"},"outputs":[{"data":{"text/plain":["89"]},"execution_count":33,"metadata":{},"output_type":"execute_result"}],"source":["# Body text null values have now decreased.\n","df_merged.body_text.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Tq3_goHV3DSG","outputId":"697e9d42-c34d-4fd7-b943-e9b22fa342be"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text'],\n"," dtype='object')"]},"execution_count":34,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"IqkpfMY43FHO"},"outputs":[],"source":["df_final = df_merged[['sha', 'title', 'abstract', 'publish_time', 'authors', 'url', 'body_text']]"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":595},"id":"4iQ9-OhE3HiH","outputId":"51892a21-81fa-40a5-89e6-66544e6069a7"},"outputs":[{"data":{"text/html":["
\n","\n","
\n"," \n","
\n","
\n","
sha
\n","
title
\n","
abstract
\n","
publish_time
\n","
authors
\n","
url
\n","
body_text
\n","
\n"," \n"," \n","
\n","
0
\n","
41ed6b3014e89604b351c095571c45056ba12c37
\n","
A phantom study to optimise the automatic tube...
\n","
On March 11, 2020, the World Health Organizati...
\n","
2021-05-28
\n","
Gombolevskiy, Victor; Morozov, Sergey; Chernin...
\n","
https://doi.org/10.1186/s41747-021-00218-0; ht...
\n","
\\nWe obtained a density difference (ground-gla...
\n","
\n","
\n","
1
\n","
57221999055bdc0b1c4bb16c790fdebdac1f7ce7
\n","
Tunicamycin, an anticancer drug and inhibitor ...
\n","
SARS-CoV-2 outbreaks remains a medical and eco...
\n","
2020-10-20
\n","
Dawood, Ali Adel; Altobje, Mahmood Abduljabar
\n","
https://doi.org/10.1016/j.micpath.2020.104586;...
\n","
Tunicamycin is an antibiotic was produced by S...
\n","
\n","
\n","
2
\n","
7fd64a56bf3a761da96e6aad00789413813599ca
\n","
Asymptomatic SARS-CoV-2 infection
\n","
NaN
\n","
2020-06-12
\n","
Ooi, Eng Eong; Low, Jenny G
\n","
https://doi.org/10.1016/s1473-3099(20)30460-6;...
\n","
The pandemic spread of severe acute respirator...
\n","
\n","
\n","
3
\n","
bc21c4203bb2d33785bff735f39d0b8758692509
\n","
Electroencephalographic findings in COVID-19 p...
\n","
BACKGROUND: Growing evidence of neurologic inv...
\n","
2020-09-15
\n","
Roberto, Katrina T.; Espiritu, Adrian I.; Fern...
\n","
https://www.ncbi.nlm.nih.gov/pubmed/32957032/;...
\n","
The coronavirus disease of 2019 (COVID-19) inf...
\n","
\n","
\n","
4
\n","
fa5968c0d0290201f8eb1ca2eaf98181ae6a7a6f
\n","
Mechanical Ventilator Parameter Estimation for...
\n","
Patients whose lungs are compromised due to va...
\n","
2021-05-07
\n","
Oruganti Venkata, Sanjay Sarma; Koenig, Amie; ...
\n","
https://doi.org/10.3390/bioengineering8050060;...
\n","
In a healthy person, spontaneous breaths are n...
\n","
\n"," \n","
\n","
"],"text/plain":[" sha ... body_text\n","0 41ed6b3014e89604b351c095571c45056ba12c37 ... \\nWe obtained a density difference (ground-gla...\n","1 57221999055bdc0b1c4bb16c790fdebdac1f7ce7 ... Tunicamycin is an antibiotic was produced by S...\n","2 7fd64a56bf3a761da96e6aad00789413813599ca ... The pandemic spread of severe acute respirator...\n","3 bc21c4203bb2d33785bff735f39d0b8758692509 ... The coronavirus disease of 2019 (COVID-19) inf...\n","4 fa5968c0d0290201f8eb1ca2eaf98181ae6a7a6f ... In a healthy person, spontaneous breaths are n...\n","\n","[5 rows x 7 columns]"]},"execution_count":36,"metadata":{},"output_type":"execute_result"}],"source":["df_final.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"jb17usY03JQq","outputId":"42f093f0-a33e-48f3-d298-8ae85146de95"},"outputs":[{"data":{"text/plain":["sha 0\n","title 0\n","abstract 3067\n","publish_time 0\n","authors 218\n","url 0\n","body_text 0\n","dtype: int64"]},"execution_count":37,"metadata":{},"output_type":"execute_result"}],"source":["df_final = df_final.dropna(axis=0,subset=['body_text', 'title'])\n","df_final.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"oNILUEuR3LxI","outputId":"ad345898-b710-459c-b5cb-d0be017e31d8"},"outputs":[{"data":{"text/plain":["(24910, 7)"]},"execution_count":38,"metadata":{},"output_type":"execute_result"}],"source":["df_final.shape"]},{"cell_type":"code","source":["%cd ..\n","%cd 2-Data\\ Pre-Processing"],"metadata":{"id":"pP1wrBgsXpQf"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"aseuam6r3N7w"},"outputs":[],"source":["df_final.to_csv('FINAL_COVID_QA_DATASET.csv', index=False)"]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0}
--------------------------------------------------------------------------------
/4-Knowledge Base/README.md:
--------------------------------------------------------------------------------
1 | This component handles the knowledge base for the QA service.
2 |
3 | 1. It installs the required installed packages
4 | 2. It installs and initializes the elasticsearch docker image
5 | 3. It processes and populates the data to elasticsearch.
6 | 4. Optimizes the data storage.
--------------------------------------------------------------------------------
/4-Knowledge Base/es_populate.py:
--------------------------------------------------------------------------------
1 |
2 | from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
3 | document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
4 |
5 |
6 | # Let's first fetch some documents that we want to query
7 | import pandas as pd
8 | df = pd.read_csv('SAMPLE_COVID_QA_PROCESSED.csv')
9 |
10 | # Convert files to dicts
11 | # You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers)
12 | # It must take a str as input, and return a str.
13 | dicts = df.to_dict('records')
14 |
15 | # We now have a list of dictionaries that we can write to our document store.
16 | # If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself.
17 | # The default format here is:
18 | # {
19 | # 'text': "",
20 | # 'meta': {'name': "", ...}
21 | #}
22 | # (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and
23 | # can be accessed later for filtering or shown in the responses of the Finder)
24 |
25 | # Let's have a look at the first 3 entries:
26 | #print(dicts[:3])
27 |
28 | # Now, let's write the dicts containing documents to our DB.
29 |
30 |
31 | final_dicts = []
32 | for each in dicts:
33 | tmp = {}
34 | tmp['text'] = each.pop('body_text')
35 | tmp['meta'] = each
36 | final_dicts.append(tmp)
37 |
38 | document_store.write_documents(final_dicts)
--------------------------------------------------------------------------------
/4-Knowledge Base/requirements.docx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/4-Knowledge Base/requirements.docx
--------------------------------------------------------------------------------
/4-Knowledge Base/requirements.txt:
--------------------------------------------------------------------------------
1 | farm-haystack==0.9.0
2 | farm==0.8.0
--------------------------------------------------------------------------------
/4-Knowledge Base/setup.sh:
--------------------------------------------------------------------------------
1 | # The first docker command pulls the elasticsearch image from the docker hub
2 | docker pull docker.elastic.co/elasticsearch/elasticsearch:7.14.2
3 | # We are running elasticsearch docker and mapping the 9200 port of local server with the host server
4 | docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.14.2
--------------------------------------------------------------------------------
/5-QA Engine/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.8-slim
2 | EXPOSE 8080
3 | WORKDIR /qaapi
4 | COPY . .
5 | RUN pip3 install -r requirements.txt
6 | CMD uvicorn main:app --host 0.0.0.0 --port 8080
--------------------------------------------------------------------------------
/5-QA Engine/README.md:
--------------------------------------------------------------------------------
1 | This component is the core component for QA.
2 |
3 | It receives the user request throught its' api endpoint, reads the data from knowledge base, runs the qa pipeline and sends the answer back to user.
--------------------------------------------------------------------------------
/5-QA Engine/main.py:
--------------------------------------------------------------------------------
1 | from fastapi import FastAPI
2 | from pydantic import BaseModel
3 |
4 | from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
5 | import os
6 | from haystack.pipeline import ExtractiveQAPipeline
7 | from haystack.reader.farm import FARMReader
8 | from haystack.retriever.sparse import ElasticsearchRetriever
9 |
10 | ELASTIC_SEARCH_HOST = os.environ.get('es_ip', 'localhost')
11 | ELASTIC_SEARCH_PORT = os.environ.get('es_port', 9200)
12 |
13 | document_store = ElasticsearchDocumentStore(host=ELASTIC_SEARCH_HOST,
14 | port=ELASTIC_SEARCH_PORT,
15 | username="", password="",
16 | index="document")
17 |
18 | retriever = ElasticsearchRetriever(document_store=document_store)
19 | reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=False)
20 | pipe = ExtractiveQAPipeline(reader, retriever)
21 |
22 | app = FastAPI()
23 |
24 | class Queobj(BaseModel):
25 | question: str
26 | num_answers: int
27 | num_docs: int
28 |
29 |
30 | @app.post('/query')
31 | async def query(que_obj: Queobj):
32 | question = que_obj.question
33 | k_retriver = que_obj.num_docs
34 | k_reader = que_obj.num_answers
35 | prediction = pipe.run(query=question, top_k_retriever=k_retriver, top_k_reader=k_reader)
36 | return {'answer': prediction}
--------------------------------------------------------------------------------
/5-QA Engine/requirements.txt:
--------------------------------------------------------------------------------
1 | fastapi==0.66.0
2 | farm==0.8.0
3 | farm-haystack==0.9.0
4 | uvicorn==0.14.0
5 | torch==1.8.1
6 | torchvision==0.9.1
7 | torchaudio==0.8.1
8 | pydantic==1.8.2
--------------------------------------------------------------------------------
/5-QA Engine/setup.sh:
--------------------------------------------------------------------------------
1 | version=`date "+%H-%M-%S_%d-%m-%y"`
2 | echo $version
3 | sudo docker build -t qa-api .
4 | sudo docker tag qa-api us.gcr.io/annular-mercury-318319/qa-api:$version
5 |
6 | sudo docker container prune -f
7 | sudo docker image prune -f
8 | sudo docker volume prune -f
9 | sudo docker network prune -f
10 |
11 | sudo docker push us.gcr.io/annular-mercury-318319/qa-api:$version
12 |
13 | # sudo docker run -p 85:8080 qa-api -e ELASTIC_SEARCH_HOST='host.docker.internal' -e ELASTIC_SEARCH_PORT=9200
--------------------------------------------------------------------------------
/6-Front End with Streamlit/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.8-slim
2 | EXPOSE 8080
3 | WORKDIR /SAM
4 | COPY ./requirements.txt ./requirements.txt
5 | RUN pip3 install -r requirements.txt
6 | COPY . .
7 | CMD streamlit run streamlit_qa.py --server.port 8080
--------------------------------------------------------------------------------
/6-Front End with Streamlit/Dockerfile(1):
--------------------------------------------------------------------------------
1 | FROM python:3.8-slim
2 | EXPOSE 8080
3 | WORKDIR /SAM
4 | COPY ./requirements.txt ./requirements.txt
5 | RUN pip3 install -r requirements.txt
6 | COPY . .
7 | CMD streamlit run streamlit_qa.py --server.port 8080
--------------------------------------------------------------------------------
/6-Front End with Streamlit/README(1).md:
--------------------------------------------------------------------------------
1 | This is the UI component for QA.
2 |
3 | It gives an interface to user to see the demo of QA.
--------------------------------------------------------------------------------
/6-Front End with Streamlit/README.md:
--------------------------------------------------------------------------------
1 | This is the UI component for QA.
2 |
3 | It gives an interface to user to see the demo of QA.
--------------------------------------------------------------------------------
/6-Front End with Streamlit/requirements(1).txt:
--------------------------------------------------------------------------------
1 | streamlit==0.84.0
2 | requests==2.25.1
3 | st-annotated-text==2.0.0
4 |
--------------------------------------------------------------------------------
/6-Front End with Streamlit/requirements.txt:
--------------------------------------------------------------------------------
1 | streamlit==0.84.0
2 | requests==2.25.1
3 | st-annotated-text==2.0.0
4 |
--------------------------------------------------------------------------------
/6-Front End with Streamlit/setup(1).sh:
--------------------------------------------------------------------------------
1 | version=`date "+%H-%M-%S_%d-%m-%y"`
2 | echo $version
3 |
4 | sudo docker container prune -f
5 | sudo docker image prune -f
6 | sudo docker volume prune -f
7 | sudo docker network prune -f
8 |
9 | sudo docker build -t qa-ui .
10 | sudo docker run -p 80:8080 qa-ui
--------------------------------------------------------------------------------
/6-Front End with Streamlit/setup.sh:
--------------------------------------------------------------------------------
1 | version=`date "+%H-%M-%S_%d-%m-%y"`
2 | echo $version
3 |
4 | sudo docker container prune -f
5 | sudo docker image prune -f
6 | sudo docker volume prune -f
7 | sudo docker network prune -f
8 |
9 | sudo docker build -t qa-ui .
10 | sudo docker run -p 80:8080 qa-ui
--------------------------------------------------------------------------------
/6-Front End with Streamlit/streamlit_qa(1).py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import requests, json
3 | from annotated_text import annotated_text
4 | st.set_page_config(
5 | page_title="Naturalangue",
6 | page_icon=":shark:",
7 | layout="wide",
8 | initial_sidebar_state="expanded",
9 | )
10 |
11 | def remote_css(url):
12 | st.markdown(f'', unsafe_allow_html=True)
13 | def icon(icon_name):
14 | st.markdown(f'', unsafe_allow_html=True)
15 |
16 | remote_css("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css")
17 |
18 | st.markdown('
Covid QA
', unsafe_allow_html=True)
19 |
20 | st.sidebar.header("Options")
21 | top_k_reader = st.sidebar.slider("Max. number of answers", min_value=1, max_value=10, value=3, step=1)
22 | top_k_retriever = st.sidebar.slider("Max. number of documents from retriever", min_value=1, max_value=10, value=3, step=1)
23 |
24 |
25 | st.markdown('
Question
', unsafe_allow_html=True)
26 | question = st.text_input("Put your query", value="What are the symptoms of COVID")
27 | button = st.button('Get Answer')
28 | st.text("")
29 | st.text("")
30 |
31 | if button:
32 | headers = {
33 | 'accept': 'application/json',
34 | 'Content-Type': 'application/json',
35 | }
36 | data = {
37 | 'question': question, 'num_answers': top_k_reader, 'num_docs': top_k_retriever
38 | }
39 | response = requests.post('http://host.docker.internal:85/query', headers=headers, data=json.dumps(data))
40 | result = response.json()
41 |
42 | print(result)
43 | #result = {'answer': {'query': 'who is the father of arya', 'no_ans_gap': -9.128407001495361, 'answers': [{'answer': 'Dr Shanthi Viswanathan', 'score': -1.2638678550720215, 'probability': 0.006202994845807552, 'context': ' period with plans needed for an exit strategy in the future.$$$Dr Shanthi Viswanathan contributed to the study design, conceptualization, acquisition', 'offset_start': 64, 'offset_end': 86, 'offset_start_in_doc': 28757, 'offset_end_in_doc': 28779, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Guthy Jackson', 'score': -1.3159641027450562, 'probability': 0.010259977541863918, 'context': 'al MS Society(USA),the Association of British Neurologists(ABN), the Guthy Jackson Charitable Foundation Website (GJCFW) as well as current literature', 'offset_start': 69, 'offset_end': 82, 'offset_start_in_doc': 6089, 'offset_end_in_doc': 6102, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Abboud', 'score': -2.4730982780456543, 'probability': 0.009177026338875294, 'context': 'VIG and TPE was thought to be safer. 9, 10, 11, 22 Brownlee W et al and Abboud H et al highlighted the importance of cautious IS continuation in patie', 'offset_start': 72, 'offset_end': 78, 'offset_start_in_doc': 23716, 'offset_end_in_doc': 23722, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}], 'node_id': 'Reader'}}
44 |
45 | for each in result['answer']['answers']:
46 | title = each['meta']['title']
47 | url = each['meta']['url'].split(';')[0]
48 | tokens = []
49 | tokens.append(each['context'][:each['offset_start']-1])
50 | tokens.append(
51 | (each['context'][each['offset_start']:each['offset_end']], 'ANS', '#faa')
52 | )
53 | tokens.append(each['context'][each['offset_end']:])
54 | col1,col2 = st.beta_columns([5,1])
55 | col1.markdown(f'{title}', unsafe_allow_html=True)
56 | col2.markdown(f'', unsafe_allow_html=True)
57 | st.text("")
58 | col1, col2 = st.beta_columns([2,4])
59 | col1.markdown(f'Publish time: {each["meta"]["publish_time"]}\
60 | Authors: {each["meta"]["authors"]}', unsafe_allow_html=True)
61 | annotated_text(*tokens)
62 |
63 |
--------------------------------------------------------------------------------
/6-Front End with Streamlit/streamlit_qa.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import requests, json
3 | from annotated_text import annotated_text
4 | st.set_page_config(
5 | page_title="Naturalangue",
6 | page_icon=":shark:",
7 | layout="wide",
8 | initial_sidebar_state="expanded",
9 | )
10 |
11 | def remote_css(url):
12 | st.markdown(f'', unsafe_allow_html=True)
13 | def icon(icon_name):
14 | st.markdown(f'', unsafe_allow_html=True)
15 |
16 | remote_css("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css")
17 |
18 | st.markdown('
Covid QA
', unsafe_allow_html=True)
19 |
20 | st.sidebar.header("Options")
21 | top_k_reader = st.sidebar.slider("Max. number of answers", min_value=1, max_value=10, value=3, step=1)
22 | top_k_retriever = st.sidebar.slider("Max. number of documents from retriever", min_value=1, max_value=10, value=3, step=1)
23 |
24 |
25 | st.markdown('
Question
', unsafe_allow_html=True)
26 | question = st.text_input("Put your query", value="What are the symptoms of COVID")
27 | button = st.button('Get Answer')
28 | st.text("")
29 | st.text("")
30 |
31 | if button:
32 | headers = {
33 | 'accept': 'application/json',
34 | 'Content-Type': 'application/json',
35 | }
36 | data = {
37 | 'question': question, 'num_answers': top_k_reader, 'num_docs': top_k_retriever
38 | }
39 | response = requests.post('http://host.docker.internal:85/query', headers=headers, data=json.dumps(data))
40 | result = response.json()
41 |
42 | print(result)
43 | #result = {'answer': {'query': 'who is the father of arya', 'no_ans_gap': -9.128407001495361, 'answers': [{'answer': 'Dr Shanthi Viswanathan', 'score': -1.2638678550720215, 'probability': 0.006202994845807552, 'context': ' period with plans needed for an exit strategy in the future.$$$Dr Shanthi Viswanathan contributed to the study design, conceptualization, acquisition', 'offset_start': 64, 'offset_end': 86, 'offset_start_in_doc': 28757, 'offset_end_in_doc': 28779, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Guthy Jackson', 'score': -1.3159641027450562, 'probability': 0.010259977541863918, 'context': 'al MS Society(USA),the Association of British Neurologists(ABN), the Guthy Jackson Charitable Foundation Website (GJCFW) as well as current literature', 'offset_start': 69, 'offset_end': 82, 'offset_start_in_doc': 6089, 'offset_end_in_doc': 6102, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Abboud', 'score': -2.4730982780456543, 'probability': 0.009177026338875294, 'context': 'VIG and TPE was thought to be safer. 9, 10, 11, 22 Brownlee W et al and Abboud H et al highlighted the importance of cautious IS continuation in patie', 'offset_start': 72, 'offset_end': 78, 'offset_start_in_doc': 23716, 'offset_end_in_doc': 23722, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}], 'node_id': 'Reader'}}
44 |
45 | for each in result['answer']['answers']:
46 | title = each['meta']['title']
47 | url = each['meta']['url'].split(';')[0]
48 | tokens = []
49 | tokens.append(each['context'][:each['offset_start']-1])
50 | tokens.append(
51 | (each['context'][each['offset_start']:each['offset_end']], 'ANS', '#faa')
52 | )
53 | tokens.append(each['context'][each['offset_end']:])
54 | col1,col2 = st.beta_columns([5,1])
55 | col1.markdown(f'{title}', unsafe_allow_html=True)
56 | col2.markdown(f'', unsafe_allow_html=True)
57 | st.text("")
58 | col1, col2 = st.beta_columns([2,4])
59 | col1.markdown(f'Publish time: {each["meta"]["publish_time"]}\
60 | Authors: {each["meta"]["authors"]}', unsafe_allow_html=True)
61 | annotated_text(*tokens)
62 |
63 |
--------------------------------------------------------------------------------
/7-Deployment/Dockerfile_QA:
--------------------------------------------------------------------------------
1 | FROM python:3.8-slim
2 | EXPOSE 8080
3 | WORKDIR /qaapi
4 | COPY . .
5 | RUN pip3 install -r requirements.txt
6 | CMD uvicorn main:app --host 0.0.0.0 --port 8080
--------------------------------------------------------------------------------
/7-Deployment/Dockerfile_UI:
--------------------------------------------------------------------------------
1 | FROM python:3.8-slim
2 | EXPOSE 8080
3 | WORKDIR /SAM
4 | COPY ./requirements.txt ./requirements.txt
5 | RUN pip3 install -r requirements.txt
6 | COPY . .
7 | CMD streamlit run streamlit_qa.py --server.port 8080
--------------------------------------------------------------------------------
/7-Deployment/setup.sh:
--------------------------------------------------------------------------------
1 | # Download the clean dataset
2 | wget https://storage.googleapis.com/haystack-qa/SAMPLE_COVID_QA_PROCESSED.csv
3 |
4 | # The first docker command pulls the elasticsearch image from the docker hub
5 | docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.4
6 | # We are running elasticsearch docker and mapping the 9200 port of local server with the host server
7 | docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.4
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # COVID-19 Q&A System
2 |
3 | ### About
4 | COVID-19 automated question answering system developed using Deep Learning, Natural Language Processing and over 250, 000 COVID-19 research papers and articles. This system can help researchers, clinicians, and the public to quickly and easily find information about COVID-19. This system facilitates rapid and user-friendly access to crucial information about COVID-19, enhancing the efficiency and effectiveness of information retrieval.
5 |
6 | ### Purpose
7 |
8 | 1. **Research Assistance:** Researchers can use this system to quickly access and extract relevant information from a vast pool of COVID-19 research. This can save them time and effort in finding the latest studies, statistics, and findings.
9 |
10 | 2. **Clinical Decision Support:** Healthcare professionals can use the system to help them make informed clinical decisions. The system can provide guidance based on the 2022 medical research and guidelines regarding COVID-19.
11 |
12 | 3. **Public Awareness:** The system is a valuable tool for the general public. It allows people to access reliable and easy-to-understand information about COVID-19. This can help people to learn more about the disease, take preventive measures, and make informed decisions.
13 |
14 | 4. **Educational Resource:** Educational institutions and instructors can use the system to provide students with accurate and timely information about the COVID-19 pandemic. This can help students to learn more about the disease and its impact.
15 |
16 | 5. **Policy Development:** Government agencies and policymakers can use the system to easily informed about COVID.
17 |
--------------------------------------------------------------------------------