├── 1-Accessing Dataset ├── CORD-19(Data dictionary).docx ├── codebook.ipynb └── metadata_sample.pkl ├── 2-Data Pre-Processing └── covid_qa_data_preprocessing.ipynb ├── 3-Exploratory Data Analysis └── covid_aq_eda.ipynb ├── 4-Knowledge Base ├── README.md ├── es_populate.py ├── requirements.docx ├── requirements.txt └── setup.sh ├── 5-QA Engine ├── Dockerfile ├── README.md ├── main.py ├── requirements.txt └── setup.sh ├── 6-Front End with Streamlit ├── Dockerfile ├── Dockerfile(1) ├── README(1).md ├── README.md ├── requirements(1).txt ├── requirements.txt ├── setup(1).sh ├── setup.sh ├── streamlit_qa(1).py └── streamlit_qa.py ├── 7-Deployment ├── Dockerfile_QA ├── Dockerfile_UI └── setup.sh └── README.md /1-Accessing Dataset/CORD-19(Data dictionary).docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/1-Accessing Dataset/CORD-19(Data dictionary).docx -------------------------------------------------------------------------------- /1-Accessing Dataset/codebook.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","id":"a6b53b6b","metadata":{"id":"a6b53b6b"},"source":["# Getting Started"]},{"cell_type":"code","source":["# connecting colab with google drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"OI6fyf4c_ykn","executionInfo":{"status":"ok","timestamp":1697893070199,"user_tz":-180,"elapsed":3653,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"4ed016db-e9a2-440d-d4d4-03880363e7e5"},"id":"OI6fyf4c_ykn","execution_count":6,"outputs":[{"output_type":"stream","name":"stdout","text":["Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n"]}]},{"cell_type":"code","source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"I3i_JlghUE3-","executionInfo":{"status":"ok","timestamp":1697893145199,"user_tz":-180,"elapsed":694,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"1cf9af69-037f-4aca-f077-66f4bae56f29"},"id":"I3i_JlghUE3-","execution_count":8,"outputs":[{"output_type":"stream","name":"stdout","text":["total 8.0K\n","drwx------ 5 root root 4.0K Oct 21 11:31 drive\n","drwxr-xr-x 1 root root 4.0K Oct 19 16:36 sample_data\n"]}]},{"cell_type":"markdown","id":"e3b260e6","metadata":{"id":"e3b260e6"},"source":["### Installing and importing Libraries"]},{"cell_type":"code","execution_count":null,"id":"771b0ff9","metadata":{"id":"771b0ff9","outputId":"7f84d4ac-a036-401a-ce6a-2749eb270d5b","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1697887955874,"user_tz":-180,"elapsed":6085,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting tqdm==4.62.0\n"," Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)\n","\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/76.0 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━\u001b[0m \u001b[32m71.7/76.0 kB\u001b[0m \u001b[31m2.4 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m76.0/76.0 kB\u001b[0m \u001b[31m2.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hInstalling collected packages: tqdm\n"," Attempting uninstall: tqdm\n"," Found existing installation: tqdm 4.66.1\n"," Uninstalling tqdm-4.66.1:\n"," Successfully uninstalled tqdm-4.66.1\n","Successfully installed tqdm-4.62.0\n"]}],"source":["!pip3 install tqdm==4.62.0"]},{"cell_type":"code","execution_count":null,"id":"a7141ff0","metadata":{"id":"a7141ff0"},"outputs":[],"source":["import time\n","import json\n","import glob\n","import numpy as np\n","import pandas as pd\n","from tqdm import tqdm"]},{"cell_type":"markdown","id":"e9497679","metadata":{"id":"e9497679"},"source":["### Downloading the covid dataset"]},{"cell_type":"code","execution_count":null,"id":"307bc53e","metadata":{"id":"307bc53e"},"outputs":[],"source":["#downloading dataset...\n","!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2021-11-15.tar.gz"]},{"cell_type":"code","execution_count":null,"id":"75e7b7fa","metadata":{"id":"75e7b7fa"},"outputs":[],"source":["#unzipping the dataset to google drive\n","!tar -xzf cord-19_2021-11-15.tar.gz -C /content/drive/MyDrive/Covid\\ Q\\&A\\ System\n","#unzipping the pdf and pmc files\n","%cd drive/MyDrive/Covid\\ Q\\&A\\ System/2021-11-15/\n","!tar xzf document_parses.tar.gz"]},{"cell_type":"markdown","id":"afcd56b3","metadata":{"id":"afcd56b3"},"source":["### Reading Dataset"]},{"cell_type":"code","execution_count":null,"id":"49050c24","metadata":{"id":"49050c24"},"outputs":[],"source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/2021-11-15/\n","#metadata\n","metadata = pd.read_csv('metadata.csv', dtype={'pubmed_id': str, 'title': str, 'abstract': str})\n","metadata.head()"]},{"cell_type":"code","execution_count":null,"id":"57ca0abe","metadata":{"id":"57ca0abe"},"outputs":[],"source":["# Fetching Research Papers from PDF and PMC Json folder\n","pdf_json = glob.glob('document_parses/pdf_json/*.json', recursive=True)\n","pmc_json = glob.glob('document_parses/pmc_json/*.json', recursive=True)"]},{"cell_type":"code","execution_count":null,"id":"ee135541","metadata":{"id":"ee135541"},"outputs":[],"source":["print('PDF:', len(pdf_json), 'PMC:', len(pmc_json))"]},{"cell_type":"code","execution_count":null,"id":"e1bfd9ee","metadata":{"id":"e1bfd9ee"},"outputs":[],"source":["# FileReader Class Exctract id, abstract and body from research papers\n","class FileReader:\n"," def __init__(self, file_path):\n"," with open(file_path) as f:\n"," content = json.load(f)\n"," self.paper_id = content['paper_id']\n"," self.abstract = '$$'.join([each['text'] for each in content.get('abstract', [])])\n"," self.body_text = '$$'.join([each['text'] for each in content.get('body_text', [])])\n","\n"," def __repr__(self):\n"," return f'{self.paper_id}\\tabstract: {self.abstract[:200]}\\tbody_text: {self.body_text}'"]},{"cell_type":"code","execution_count":null,"id":"d2539a02","metadata":{"id":"d2539a02"},"outputs":[],"source":["# A sample research paper from pdf json folder\n","pdf_file = FileReader(pdf_json[0])\n","print(pdf_file)"]},{"cell_type":"code","execution_count":null,"id":"e66e4bd7","metadata":{"id":"e66e4bd7","outputId":"146b7762-1063-4d7f-af4c-7b8213f6432c"},"outputs":[{"name":"stderr","output_type":"stream","text":["247236it [05:05, 807.98it/s] "]},{"name":"stdout","output_type":"stream","text":["306.0527939796448\n"]},{"name":"stderr","output_type":"stream","text":["\n"]}],"source":["#Create a dictionary of all research papers from pdf json\n","pdf_dict = {'paper_id': [], 'abstract': [], 'body_text': []}\n","t1 = time.time()\n","for idx, record in tqdm(enumerate(pdf_json)):\n"," content = FileReader(record)\n"," pdf_dict['paper_id'].append(content.paper_id)\n"," pdf_dict['abstract'].append(content.abstract)\n"," pdf_dict['body_text'].append(content.body_text)\n","print(time.time() - t1)"]},{"cell_type":"code","execution_count":null,"id":"50301fe1","metadata":{"id":"50301fe1","outputId":"9e3408a5-ea50-4f2a-8fb6-67480cdd13d1"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
paper_idabstractbody_text
0206be0740f4d299003d4e09cd6f9a32e6e351130Heparanase (HPSE) is a multifunctional protein...Heparanase (HPSE) is an endo-β-d-endoglycosida...
132356c8de8fcec7a46bc60793b557964e4e87f37Objective The COVID-19 pandemic is currently o...World Health Organization (WHO) declared the o...
2e5447bc137727b3721de2313755d89b932e1eeccAs the world navigates the COVID-19 health cri...With the rise in positive COVID-19 cases and t...
38f1e56dded7f860a33ad291c06c773653270ee52The sudden outbreak of coronavirus disease 201...In 2020, a new type of coronavirus, named coro...
4c84b2484293b3aa59ec8aaecc7eadb93b2294dd7Spinal cord stimulation may enable recovery of...Spinal cord injury (SCI) is a life-long condit...
\n","
"],"text/plain":[" paper_id \\\n","0 206be0740f4d299003d4e09cd6f9a32e6e351130 \n","1 32356c8de8fcec7a46bc60793b557964e4e87f37 \n","2 e5447bc137727b3721de2313755d89b932e1eecc \n","3 8f1e56dded7f860a33ad291c06c773653270ee52 \n","4 c84b2484293b3aa59ec8aaecc7eadb93b2294dd7 \n","\n"," abstract \\\n","0 Heparanase (HPSE) is a multifunctional protein... \n","1 Objective The COVID-19 pandemic is currently o... \n","2 As the world navigates the COVID-19 health cri... \n","3 The sudden outbreak of coronavirus disease 201... \n","4 Spinal cord stimulation may enable recovery of... \n","\n"," body_text \n","0 Heparanase (HPSE) is an endo-β-d-endoglycosida... \n","1 World Health Organization (WHO) declared the o... \n","2 With the rise in positive COVID-19 cases and t... \n","3 In 2020, a new type of coronavirus, named coro... \n","4 Spinal cord injury (SCI) is a life-long condit... "]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["#Creating a dataframe of all research papers from pdf json\n","pdf_df = pd.DataFrame(pdf_dict, columns=['paper_id', 'abstract', 'body_text'])\n","pdf_df.head()"]},{"cell_type":"code","execution_count":null,"id":"2199f9a4","metadata":{"id":"2199f9a4","outputId":"f642e0c3-a9bc-4177-f6a9-789ffacc0473"},"outputs":[{"name":"stderr","output_type":"stream","text":["189611it [04:54, 644.91it/s]"]},{"name":"stdout","output_type":"stream","text":["294.0136697292328\n"]},{"name":"stderr","output_type":"stream","text":["\n"]}],"source":["#Create a dictionary of all research papers from pmc json\n","pmc_dict = {'paper_id': [], 'body_text': []}\n","t1 = time.time()\n","for idx, record in tqdm(enumerate(pmc_json)):\n"," content = FileReader(record)\n"," pmc_dict['paper_id'].append(content.paper_id)\n"," pmc_dict['body_text'].append(content.body_text)\n","print(time.time() - t1)"]},{"cell_type":"code","execution_count":null,"id":"881f7057","metadata":{"id":"881f7057","outputId":"9fe448cd-349e-4be7-87f4-1936b6abf9e5"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
paper_idbody_text
0PMC7550677Previous research suggested that emotional str...
1PMC7297029First, areas with severe outbreaks have genera...
2PMC8207685Environment‐related illnesses, such as indoor ...
3PMC7926729Materials, useful matter, are used extensively...
4PMC7471855\\n[4]\\n$$$No
\n","
"],"text/plain":[" paper_id body_text\n","0 PMC7550677 Previous research suggested that emotional str...\n","1 PMC7297029 First, areas with severe outbreaks have genera...\n","2 PMC8207685 Environment‐related illnesses, such as indoor ...\n","3 PMC7926729 Materials, useful matter, are used extensively...\n","4 PMC7471855 \\n[4]\\n$$$No"]},"execution_count":26,"metadata":{},"output_type":"execute_result"}],"source":["#Creating a dataframe of all research papers from pmc json\n","pmc_df = pd.DataFrame(pmc_dict, columns=['paper_id', 'body_text'])\n","pmc_df.head()"]},{"cell_type":"markdown","id":"5801f264","metadata":{"id":"5801f264"},"source":["### Sampling & Saving Course Dataset"]},{"cell_type":"raw","id":"9aeb860c","metadata":{"id":"9aeb860c"},"source":["Now we have 3 dataframes:\n","\n","\n","1. metadata - contains all the meta info of avaialable research papers\n","2. pdf_df - contains the paper_id, abstract and body of all pdf research papers available\n","3. pmc_df - contains the paper_id and body of all the pmc reasearch papers available"]},{"cell_type":"code","source":["%cd ..\n","%cd 1-Accessing\\ Dataset"],"metadata":{"id":"FtWkLAcZVcFs","executionInfo":{"status":"ok","timestamp":1697893475560,"user_tz":-180,"elapsed":422,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"id":"FtWkLAcZVcFs","execution_count":9,"outputs":[]},{"cell_type":"code","execution_count":null,"id":"d1e5bdd2","metadata":{"id":"d1e5bdd2"},"outputs":[],"source":["#drop rows from the metadata where the corresponding research text doesn't exist in pdf json\n","metadata.dropna(subset=['pdf_json'], inplace=True)"]},{"cell_type":"code","execution_count":null,"id":"ebea6546","metadata":{"id":"ebea6546","outputId":"2ffdbdcd-0480-4533-b125-3f387604725e"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
cord_uidshasource_xtitledoipmcidpubmed_idlicenseabstractpublish_timeauthorsjournalmag_idwho_covidence_idarxiv_idpdf_json_filespmc_json_filesurls2_id
0ug7v899jd1aafb70c066a2068b02786f8929fd9c900897fbPMCClinical features of culture-proven Mycoplasma...10.1186/1471-2334-1-6PMC3528211472636no-ccOBJECTIVE: This retrospective chart review des...2001-07-04Madani, Tariq A; Al-Ghamdi, Aisha ABMC Infect DisNaNNaNNaNdocument_parses/pdf_json/d1aafb70c066a2068b027...document_parses/pmc_json/PMC35282.xml.jsonhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...NaN
102tnwd4m6b0567729c2143a66d737eb0a2f63f2dce2e5a7dPMCNitric oxide: a pro-inflammatory mediator in l...10.1186/rr14PMC5954311667967no-ccInflammatory diseases of the respiratory tract...2000-08-15Vliet, Albert van der; Eiserich, Jason P; Cros...Respir ResNaNNaNNaNdocument_parses/pdf_json/6b0567729c2143a66d737...document_parses/pmc_json/PMC59543.xml.jsonhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...NaN
2ejv2xln006ced00a5fc04215949aa72528f2eeaae1d58927PMCSurfactant protein-D and pulmonary host defense10.1186/rr19PMC5954911667972no-ccSurfactant protein-D (SP-D) participates in th...2000-08-25Crouch, Erika CRespir ResNaNNaNNaNdocument_parses/pdf_json/06ced00a5fc04215949aa...document_parses/pmc_json/PMC59549.xml.jsonhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...NaN
\n","
"],"text/plain":[" cord_uid sha source_x \\\n","0 ug7v899j d1aafb70c066a2068b02786f8929fd9c900897fb PMC \n","1 02tnwd4m 6b0567729c2143a66d737eb0a2f63f2dce2e5a7d PMC \n","2 ejv2xln0 06ced00a5fc04215949aa72528f2eeaae1d58927 PMC \n","\n"," title doi \\\n","0 Clinical features of culture-proven Mycoplasma... 10.1186/1471-2334-1-6 \n","1 Nitric oxide: a pro-inflammatory mediator in l... 10.1186/rr14 \n","2 Surfactant protein-D and pulmonary host defense 10.1186/rr19 \n","\n"," pmcid pubmed_id license \\\n","0 PMC35282 11472636 no-cc \n","1 PMC59543 11667967 no-cc \n","2 PMC59549 11667972 no-cc \n","\n"," abstract publish_time \\\n","0 OBJECTIVE: This retrospective chart review des... 2001-07-04 \n","1 Inflammatory diseases of the respiratory tract... 2000-08-15 \n","2 Surfactant protein-D (SP-D) participates in th... 2000-08-25 \n","\n"," authors journal mag_id \\\n","0 Madani, Tariq A; Al-Ghamdi, Aisha A BMC Infect Dis NaN \n","1 Vliet, Albert van der; Eiserich, Jason P; Cros... Respir Res NaN \n","2 Crouch, Erika C Respir Res NaN \n","\n"," who_covidence_id arxiv_id \\\n","0 NaN NaN \n","1 NaN NaN \n","2 NaN NaN \n","\n"," pdf_json_files \\\n","0 document_parses/pdf_json/d1aafb70c066a2068b027... \n","1 document_parses/pdf_json/6b0567729c2143a66d737... \n","2 document_parses/pdf_json/06ced00a5fc04215949aa... \n","\n"," pmc_json_files \\\n","0 document_parses/pmc_json/PMC35282.xml.json \n","1 document_parses/pmc_json/PMC59543.xml.json \n","2 document_parses/pmc_json/PMC59549.xml.json \n","\n"," url s2_id \n","0 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3... NaN \n","1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN \n","2 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5... NaN "]},"execution_count":32,"metadata":{},"output_type":"execute_result"}],"source":["metadata.head(3)"]},{"cell_type":"code","execution_count":null,"id":"dc5e43e6","metadata":{"id":"dc5e43e6"},"outputs":[],"source":["#Taking 25000 randmly sampled records from metadata\n","sub_metadata = metadata.sample(25000)"]},{"cell_type":"code","execution_count":null,"id":"d000e2f2","metadata":{"id":"d000e2f2"},"outputs":[],"source":["#Sample the both pdf and pmc research paper table based on the sampled metadata table\n","sub_pdf_df = pdf_df[pdf_df['paper_id'].isin(sub_metadata['sha'])]\n","sub_pmc_df = pmc_df[pmc_df['paper_id'].isin(sub_metadata['pmcid'])]"]},{"cell_type":"code","execution_count":null,"id":"13991f84","metadata":{"id":"13991f84"},"outputs":[],"source":["#storing the sample data\n","sub_metadata.to_pickle('metadata_sample.pickle')\n","sub_pdf_df.to_pickle('json_pdf_sample.pickle')\n","sub_pmc_df.to_pickle('json_pmc_sample.pickle')"]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"}},"nbformat":4,"nbformat_minor":5} -------------------------------------------------------------------------------- /1-Accessing Dataset/metadata_sample.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/1-Accessing Dataset/metadata_sample.pkl -------------------------------------------------------------------------------- /2-Data Pre-Processing/covid_qa_data_preprocessing.ipynb: -------------------------------------------------------------------------------- 1 | {"cells":[{"cell_type":"markdown","metadata":{"id":"view-in-github"},"source":["\"Open"]},{"cell_type":"code","source":["# connecting colab with google drive\n","from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"id":"xhZ0fcuKSVKe","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1697893278915,"user_tz":-180,"elapsed":40886,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"9f2fe741-76f5-4b6b-c64b-5fd88a408bef"},"execution_count":11,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"taKuWoSOoXPs","outputId":"b7fdf869-f25a-419e-c6c2-6dc2407c09f4","executionInfo":{"status":"ok","timestamp":1697890838252,"user_tz":-180,"elapsed":17,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"stream","name":"stdout","text":["Python 3.10.12\n"]}],"source":["#check python version\n","!python3 --version"]},{"cell_type":"markdown","source":["### Importing Libraries"],"metadata":{"id":"nElmnapAL5lD"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"FwXMHbcGoKit"},"outputs":[],"source":["#for python < 3.8\n","# !pip3 install pickle5\n","# import pickle5 as pickle"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"XO3nYJ8uz-22"},"outputs":[],"source":["import pickle\n","import numpy as np\n","import pandas as pd"]},{"cell_type":"markdown","metadata":{"id":"2RDPTgYZ0n-9"},"source":["**Accessing saved sample research papers and data**"]},{"cell_type":"code","source":["%cd drive/MyDrive/Covid\\ Q\\&A\\ System/1-Accessing\\ Dataset/"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"R5ncZDA3WtgF","executionInfo":{"status":"ok","timestamp":1697893875035,"user_tz":-180,"elapsed":724,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}},"outputId":"7bf79107-e990-4051-efb6-77c1e129a6be"},"execution_count":47,"outputs":[{"output_type":"stream","name":"stdout","text":["/content/drive/MyDrive/Covid Q&A System/1-Accessing Dataset\n"]}]},{"cell_type":"code","execution_count":51,"metadata":{"id":"AwC1R5tHzz47","executionInfo":{"status":"ok","timestamp":1697893941921,"user_tz":-180,"elapsed":792,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[],"source":["with open('metadata_sample.pkl', 'rb') as f:\n"," df_metadata = pickle.load(f)\n","with open('json_pdf_sample.pkl', 'rb') as f:\n"," df_pdf = pickle.load(f)\n","with open('json_pmc_sample.pkl', 'rb') as f:\n"," df_pmc = pickle.load(f)"]},{"cell_type":"code","execution_count":52,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":521},"id":"fNKMWxKG0IMk","outputId":"eadd4404-35ed-43bc-f5ab-8468bbf29809","executionInfo":{"status":"ok","timestamp":1697894074993,"user_tz":-180,"elapsed":580,"user":{"displayName":"Yordanos Wuletaw","userId":"06452499347519919653"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" cord_uid sha \\\n","600880 n441sprx feedfe27a4eee49d8a1d09f50e8ecfe73057602a \n","721427 95libakm c2ca01f12643a88e059e81619fb971fa61de3971 \n","701284 bjl0bjg2 d0d1e05bb14f068d8323807ba042860887b7aa00 \n","603427 7ch1yln4 fb8c3a315b4a345dafc7e59be000e21bf965dea8 \n","771175 zegp4jq4 ed4c180c272b047c6db958508fb9a8edd40cb1cb \n","\n"," source_x \\\n","600880 Medline; PMC; WHO \n","721427 Elsevier; Medline; PMC \n","701284 Medline; PMC \n","603427 Medline; PMC \n","771175 Elsevier; Medline; PMC \n","\n"," title \\\n","600880 Experiences and effects of telerehabilitation ... \n","721427 SCOAT-Net: A novel network for segmenting COVI... \n","701284 A Comprehensive Review on Factors Influences B... \n","603427 Opposition to cannabis legalization on public ... \n","771175 An epigenetic signature to fight COVID-19 \n","\n"," doi pmcid pubmed_id license \\\n","600880 10.4102/sajp.v77i1.1528 PMC8252170 34230898.0 cc-by \n","721427 10.1016/j.patcog.2021.108109 PMC8189738 34127870.0 no-cc \n","701284 10.2147/ijn.s291956 PMC7898217 33628021.0 cc-by-nc \n","603427 10.1016/j.lanwpc.2021.100142 PMC8315435 34327442.0 cc-by-nc-nd \n","771175 10.1016/j.ebiom.2021.103385 PMC8116817 33993054.0 no-cc \n","\n"," abstract publish_time \\\n","600880 BACKGROUND: The announcement of a national loc... 2021-06-30 \n","721427 Automatic segmentation of lung opacification f... 2021-06-10 \n","701284 Exosomes are nanoscale-sized membrane vesicles... 2021-02-17 \n","603427 NaN 2021-04-13 \n","771175 NaN 2021-05-13 \n","\n"," authors \\\n","600880 Ebrahim, Humairaa; Pillay-Jayaraman, Prithi; L... \n","721427 Zhao, Shixuan; Li, Zhidan; Chen, Yang; Zhao, W... \n","701284 Gurunathan, Sangiliyandi; Kang, Min-Hee; Kim, ... \n","603427 Smyth, Bobby P; Christie, Grant IG \n","771175 Herbein, Georges \n","\n"," journal mag_id who_covidence_id arxiv_id \\\n","600880 S Afr J Physiother NaN NaN NaN \n","721427 Pattern Recognit NaN NaN NaN \n","701284 Int J Nanomedicine NaN NaN NaN \n","603427 Lancet Reg Health West Pac NaN NaN NaN \n","771175 EBioMedicine NaN NaN NaN \n","\n"," pdf_json_files \\\n","600880 document_parses/pdf_json/feedfe27a4eee49d8a1d0... \n","721427 document_parses/pdf_json/c2ca01f12643a88e059e8... \n","701284 document_parses/pdf_json/d0d1e05bb14f068d83238... \n","603427 document_parses/pdf_json/fb8c3a315b4a345dafc7e... \n","771175 document_parses/pdf_json/ed4c180c272b047c6db95... \n","\n"," pmc_json_files \\\n","600880 document_parses/pmc_json/PMC8252170.xml.json \n","721427 document_parses/pmc_json/PMC8189738.xml.json \n","701284 document_parses/pmc_json/PMC7898217.xml.json \n","603427 document_parses/pmc_json/PMC8315435.xml.json \n","771175 document_parses/pmc_json/PMC8116817.xml.json \n","\n"," url s2_id \n","600880 https://www.ncbi.nlm.nih.gov/pubmed/34230898/;... 235720984.0 \n","721427 https://www.ncbi.nlm.nih.gov/pubmed/34127870/;... 235382189.0 \n","701284 https://doi.org/10.2147/ijn.s291956; https://w... 232016676.0 \n","603427 https://doi.org/10.1016/j.lanwpc.2021.100142; ... 234824471.0 \n","771175 https://www.ncbi.nlm.nih.gov/pubmed/33993054/;... 234487502.0 "],"text/html":["\n","
\n","
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
cord_uidshasource_xtitledoipmcidpubmed_idlicenseabstractpublish_timeauthorsjournalmag_idwho_covidence_idarxiv_idpdf_json_filespmc_json_filesurls2_id
600880n441sprxfeedfe27a4eee49d8a1d09f50e8ecfe73057602aMedline; PMC; WHOExperiences and effects of telerehabilitation ...10.4102/sajp.v77i1.1528PMC825217034230898.0cc-byBACKGROUND: The announcement of a national loc...2021-06-30Ebrahim, Humairaa; Pillay-Jayaraman, Prithi; L...S Afr J PhysiotherNaNNaNNaNdocument_parses/pdf_json/feedfe27a4eee49d8a1d0...document_parses/pmc_json/PMC8252170.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/34230898/;...235720984.0
72142795libakmc2ca01f12643a88e059e81619fb971fa61de3971Elsevier; Medline; PMCSCOAT-Net: A novel network for segmenting COVI...10.1016/j.patcog.2021.108109PMC818973834127870.0no-ccAutomatic segmentation of lung opacification f...2021-06-10Zhao, Shixuan; Li, Zhidan; Chen, Yang; Zhao, W...Pattern RecognitNaNNaNNaNdocument_parses/pdf_json/c2ca01f12643a88e059e8...document_parses/pmc_json/PMC8189738.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/34127870/;...235382189.0
701284bjl0bjg2d0d1e05bb14f068d8323807ba042860887b7aa00Medline; PMCA Comprehensive Review on Factors Influences B...10.2147/ijn.s291956PMC789821733628021.0cc-by-ncExosomes are nanoscale-sized membrane vesicles...2021-02-17Gurunathan, Sangiliyandi; Kang, Min-Hee; Kim, ...Int J NanomedicineNaNNaNNaNdocument_parses/pdf_json/d0d1e05bb14f068d83238...document_parses/pmc_json/PMC7898217.xml.jsonhttps://doi.org/10.2147/ijn.s291956; https://w...232016676.0
6034277ch1yln4fb8c3a315b4a345dafc7e59be000e21bf965dea8Medline; PMCOpposition to cannabis legalization on public ...10.1016/j.lanwpc.2021.100142PMC831543534327442.0cc-by-nc-ndNaN2021-04-13Smyth, Bobby P; Christie, Grant IGLancet Reg Health West PacNaNNaNNaNdocument_parses/pdf_json/fb8c3a315b4a345dafc7e...document_parses/pmc_json/PMC8315435.xml.jsonhttps://doi.org/10.1016/j.lanwpc.2021.100142; ...234824471.0
771175zegp4jq4ed4c180c272b047c6db958508fb9a8edd40cb1cbElsevier; Medline; PMCAn epigenetic signature to fight COVID-1910.1016/j.ebiom.2021.103385PMC811681733993054.0no-ccNaN2021-05-13Herbein, GeorgesEBioMedicineNaNNaNNaNdocument_parses/pdf_json/ed4c180c272b047c6db95...document_parses/pmc_json/PMC8116817.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/33993054/;...234487502.0
\n","
\n","
\n","\n","
\n"," \n","\n"," \n","\n"," \n","
\n","\n","\n","
\n"," \n","\n","\n","\n"," \n","
\n","
\n","
\n"]},"metadata":{},"execution_count":52}],"source":["df_metadata.head()"]},{"cell_type":"markdown","metadata":{"id":"1dMkwvFBpT8Y"},"source":["**Data Merging (merging the data from metadata, json pdf and json pmc files for research papers)**"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"hI3Nm5iB1auL"},"outputs":[],"source":["#Merging the metadata and json pdf data\n","df_merged = pd.merge(df_metadata,df_pdf,left_on='sha',right_on='paper_id',how='left').drop('paper_id',axis=1)"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":267},"id":"RAjNPccl1iDh","outputId":"e2f13d3a-c47d-4cc9-9880-734a0235c671"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
cord_uidshasource_xtitledoipmcidpubmed_idlicenseabstract_xpublish_timeauthorsjournalmag_idwho_covidence_idarxiv_idpdf_json_filespmc_json_filesurls2_idabstract_ybody_text
0bccth3yz41ed6b3014e89604b351c095571c45056ba12c37Medline; PMCA phantom study to optimise the automatic tube...10.1186/s41747-021-00218-0PMC815972234046737cc-byOn March 11, 2020, the World Health Organizati...2021-05-28Gombolevskiy, Victor; Morozov, Sergey; Chernin...Eur Radiol ExpNaNNaNNaNdocument_parses/pdf_json/41ed6b3014e89604b351c...document_parses/pmc_json/PMC8159722.xml.jsonhttps://doi.org/10.1186/s41747-021-00218-0; ht...235221111.0On March 11, 2020, the World Health Organizati...On March 11, 2020, the World Health Organizati...
13udtsvga57221999055bdc0b1c4bb16c790fdebdac1f7ce7Elsevier; Medline; PMCTunicamycin, an anticancer drug and inhibitor ...10.1016/j.micpath.2020.104586PMC757363333091582no-ccSARS-CoV-2 outbreaks remains a medical and eco...2020-10-20Dawood, Ali Adel; Altobje, Mahmood AbduljabarMicrob PathogNaNNaNNaNdocument_parses/pdf_json/57221999055bdc0b1c4bb...document_parses/pmc_json/PMC7573633.xml.jsonhttps://doi.org/10.1016/j.micpath.2020.104586;...224769427.0SARS-CoV-2 outbreaks remains a medical and eco...Since last December, a new coronavirus has bee...
\n","
"],"text/plain":[" cord_uid ... body_text\n","0 bccth3yz ... On March 11, 2020, the World Health Organizati...\n","1 3udtsvga ... Since last December, a new coronavirus has bee...\n","\n","[2 rows x 21 columns]"]},"execution_count":15,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.head(2)\n","#abstract_x is the abstract from metadata and abstract_y is the abstract from json_pdf"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"gxzG2kDI1l5V","outputId":"e56835c6-be08-4c87-9c7c-bfd347b16cd4"},"outputs":[{"data":{"text/plain":["(25000, 22)"]},"execution_count":16,"metadata":{},"output_type":"execute_result"}],"source":["#Lets merge the json_pmc data to the merged data too\n","df_merged = pd.merge(df_merged,df_pmc,left_on='pmcid',right_on='paper_id',how='left').drop('paper_id',axis=1)\n","df_merged.shape"]},{"cell_type":"markdown","metadata":{"id":"F3QtxcWFqrLd"},"source":["**Data Cleaning and Preprocessing**"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"LC50Eff91svG","outputId":"4b9165c9-a87c-40f3-a6ba-d0c8abc32b80"},"outputs":[{"data":{"text/plain":["(22649, 22)"]},"execution_count":17,"metadata":{},"output_type":"execute_result"}],"source":["df_merged[df_merged.abstract_x != df_merged.abstract_y].shape"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8ay1bvRG1u1U","outputId":"766cbd10-c600-4264-be85-f2290cc52900"},"outputs":[{"data":{"text/plain":["(3738, 0)"]},"execution_count":18,"metadata":{},"output_type":"execute_result"}],"source":["# check metadata abstract column to see if null values exist\n","df_merged.abstract_x.isnull().sum(),(df_merged.abstract_x == '').sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"8KPvZLCG1xq4","outputId":"c22b21a7-3718-411b-b1c2-a137b8dd01ce"},"outputs":[{"data":{"text/plain":["(1461, 6937)"]},"execution_count":19,"metadata":{},"output_type":"execute_result"}],"source":["# Check pdf_json abstract to see if null values exist\n","df_merged.abstract_y.isnull().sum(),(df_merged.abstract_y == '').sum()"]},{"cell_type":"markdown","metadata":{"id":"q76-CsdO19Pd"},"source":["Since the abstract_x from metadata is more reliable , we will use it but only fill by abstract_y text when abstract_x value is null\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"b_78rcIn1zUc"},"outputs":[],"source":["# Convert all columns to string and then replace abstract_y values\n","#df = df.astype(str)\n","df_merged[\"abstract_y\"] = df_merged[\"abstract_y\"].astype(str)\n","df_merged['abstract_y'] = np.where(df_merged['abstract_y'].map(len) > 50, df_merged['abstract_y'], \"na\")"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"id":"BzTGcVLU2ApK","outputId":"636104a5-a74e-4bb8-a73e-8b669f9e7195"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
cord_uidshasource_xtitledoipmcidpubmed_idlicenseabstract_xpublish_timeauthorsjournalmag_idwho_covidence_idarxiv_idpdf_json_filespmc_json_filesurls2_idabstract_ybody_text_xbody_text_y
2va3ov8aj7fd64a56bf3a761da96e6aad00789413813599caElsevier; Medline; PMCAsymptomatic SARS-CoV-2 infection10.1016/s1473-3099(20)30460-6PMC729257832539989no-ccNaN2020-06-12Ooi, Eng Eong; Low, Jenny GLancet Infect DisNaNNaNNaNdocument_parses/pdf_json/7fd64a56bf3a761da96e6...document_parses/pmc_json/PMC7292578.xml.jsonhttps://doi.org/10.1016/s1473-3099(20)30460-6;...219603626.0naThe pandemic spread of severe acute respirator...The pandemic spread of severe acute respirator...
7fpbz7143f54ff316cc17e9bc6047fdda2ba37dba7f0b3d72Medline; PMCPression‐induced facial ulcers by prone positi...10.1111/dth.13748PMC730092232495445no-ccNaN2020-06-03Ramondetta, Alice; Ribero, Simone; Costi, Soni...Dermatol TherNaNNaNNaNdocument_parses/pdf_json/f54ff316cc17e9bc6047f...NaNhttps://doi.org/10.1111/dth.13748; https://www...219313107.0naDear Editor, Severe acute respiratory syndrome...NaN
10mbs0mddgb4466c4bfc22fad10131dc138563ddf36712ca4f; 1b79...Medline; PMCImpact of the COVID-19 pandemic on the core fu...10.1136/bmjopen-2020-039674PMC730627232554730cc-by-ncOBJECTIVES: The current COVID-19 pandemic, as ...2020-06-17Verhoeven, Veronique; Tsakitzidis, Giannoula; ...BMJ OpenNaNNaNNaNdocument_parses/pdf_json/b4466c4bfc22fad10131d...document_parses/pmc_json/PMC7306272.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/32554730/;...219926577.0naNaNThe current COVID-19 pandemic puts a previousl...
113vvp6gch2b5f5305fdb732f7323187062c881cf21c328837Elsevier; PMCDetection and screening of COVID-19 through ch...10.1016/b978-0-12-824536-1.00039-3PMC8137981NaNno-ccDecember 2019 ended with a deadly virus outbre...2021-05-21Munir, Khushboo; Elahi, Hassan; Farooq, Muhamm...Data Science for COVID-19NaNNaNNaNdocument_parses/pdf_json/2b5f5305fdb732f732318...document_parses/pmc_json/PMC8137981.xml.jsonhttps://www.sciencedirect.com/science/article/...234794717.0naOur World has witnessed three most deadly outb...Our World has witnessed three most deadly outb...
18xrzygcr3cce7c2d31a3ba14d0e868a2c688555f681dceb2fPMCLost Trust: Socio-biological Hazard—From AIDS ...10.1007/978-4-431-55924-5_5PMC7123902NaNno-ccIatrogenic HIV infection refers here to cases ...2016-03-28Atsuji, ShigeoUnsafetyNaNNaNNaNdocument_parses/pdf_json/cce7c2d31a3ba14d0e868...document_parses/pmc_json/PMC7123902.xml.jsonhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...NaNnafrom a 'socio-biological perspective' at the p...It is now over 30 years since the first AIDS c...
.....................................................................
249880u3e1493379fea33b0ffd1ea5b743ebf9c05ca7cd4498531Elsevier; Medline; PMCResearch and control of parasitic diseases in ...10.1016/j.pt.2007.02.011PMC710640917350339no-ccBetween 1950 and 1980, Japan eliminated severa...2007-03-09Kasai, Takeshi; Nakatani, Hiroki; Takeuchi, Ts...Trends ParasitolNaNNaNNaNdocument_parses/pdf_json/379fea33b0ffd1ea5b743...document_parses/pmc_json/PMC7106409.xml.jsonhttps://www.sciencedirect.com/science/article/...42815596.0naBetween 1950 and 1980, Japan eliminated severa...Japan occupies a unique position with respect ...
24994euh55i564d475f635e1e2ecee857165e5e98b6bc4f13700bElsevier; Medline; PMCAvoiding Aerosol Generation During Tracheostom...10.1016/j.jamcollsurg.2020.08.730PMC748605032928626no-ccNaN2020-09-11Kapoor, Indu; Prabhakar, Hemanshu; Mahajan, CharuJ Am Coll SurgNaNNaNNaNdocument_parses/pdf_json/4d475f635e1e2ecee8571...document_parses/pmc_json/PMC7486050.xml.jsonhttps://www.sciencedirect.com/science/article/...221616263.0naWe read with interest the article by Foster an...We read with interest the article by Foster an...
24996bhmzv574dfda8e005370408f6e1a9eb9b4dadc181f45dfccMedline; PMCSARS-CoV-2 within-host diversity and transmission10.1126/science.abg0821PMC812829333688063cc-byExtensive global sampling and sequencing of th...2021-04-16Lythgoe, Katrina A.; Hall, Matthew; Ferretti, ...ScienceNaNNaNNaNdocument_parses/pdf_json/dfda8e005370408f6e1a9...document_parses/pmc_json/PMC8128293.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/33688063/;...232169666.0naINTRODUCTION: Genome sequencing at an unpreced...Reliable estimation of variant frequencies req...
249983kgxnyxvfff0a264abb537842ffceba9933f2f6f6cca19b4Elsevier; Medline; PMCDisease surveillance for the COVID-19 era: tim...10.1016/s0140-6736(21)01096-5PMC812149334000258no-ccNaN2021-05-14Morgan, Oliver W; Aguilera, Ximena; Ammon, And...LancetNaNNaNNaNdocument_parses/pdf_json/fff0a264abb537842ffce...document_parses/pmc_json/PMC8121493.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/34000258/;...234498002.0naThe COVID-19 pandemic has exposed weaknesses i...The COVID-19 pandemic has exposed weaknesses i...
24999kq7mjq6d8e5a6881a733deb1800fbb6edc37f4dc12501acfMedline; PMCComprehensive Profiling of Inflammatory Factor...10.3389/fimmu.2021.662465PMC832043334335566cc-byTo systematically explore potential biomarkers...2021-07-15Teng, Xiangyun; Zhang, Jiaqi; Shi, Yaling; Liu...Front ImmunolNaNNaNNaNdocument_parses/pdf_json/8e5a6881a733deb1800fb...document_parses/pmc_json/PMC8320433.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/34335566/;...235916128.0naCoronavirus disease 2019 (COVID-19) resulting ...Coronavirus disease 2019 (COVID-19) resulting ...
\n","

8513 rows × 22 columns

\n","
"],"text/plain":[" cord_uid ... body_text_y\n","2 va3ov8aj ... The pandemic spread of severe acute respirator...\n","7 fpbz7143 ... NaN\n","10 mbs0mddg ... The current COVID-19 pandemic puts a previousl...\n","11 3vvp6gch ... Our World has witnessed three most deadly outb...\n","18 xrzygcr3 ... It is now over 30 years since the first AIDS c...\n","... ... ... ...\n","24988 0u3e1493 ... Japan occupies a unique position with respect ...\n","24994 euh55i56 ... We read with interest the article by Foster an...\n","24996 bhmzv574 ... Reliable estimation of variant frequencies req...\n","24998 3kgxnyxv ... The COVID-19 pandemic has exposed weaknesses i...\n","24999 kq7mjq6d ... Coronavirus disease 2019 (COVID-19) resulting ...\n","\n","[8513 rows x 22 columns]"]},"execution_count":22,"metadata":{},"output_type":"execute_result"}],"source":["df_merged[df_merged['abstract_y'].apply(lambda x: len(str(x)) <= 10)]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"RFh2fs_g2eCg"},"outputs":[],"source":["# replace abstract_x (metadata column) with abstract_y (pdf_json) value where abstract_x is null\n","df_merged.loc[df_merged.abstract_x.isnull() & (df_merged.abstract_y != 'na'),'abstract_x'] = df_merged[df_merged.abstract_x.isnull() & (df_merged.abstract_y != 'na')].abstract_y"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"jvnT5rrs2nfP","outputId":"a1e737a7-b2fe-41a0-c625-dca6603d128c"},"outputs":[{"data":{"text/plain":["3081"]},"execution_count":24,"metadata":{},"output_type":"execute_result"}],"source":["# Now we do not have any null abstract values.\n","# The null values have reduced which is what we had expected.\n","df_merged.abstract_x.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Pmoo77Ef2pgk","outputId":"d9bfca0f-25bf-4acb-dd29-4caeae0c2246"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text_x', 'body_text_y'],\n"," dtype='object')"]},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":["# Lets get rid of the pdf_json abstract column and rename the metadata abstract column\n","df_merged.rename(columns = {'abstract_x' : 'abstract'}, inplace = True)\n","df_merged.drop('abstract_y',axis=1,inplace = True)\n","df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"VkG6PBXq2rUb","outputId":"5c16e439-66f3-4838-a4d2-295ec3b56ef2"},"outputs":[{"data":{"text/plain":["24999"]},"execution_count":26,"metadata":{},"output_type":"execute_result"}],"source":["# This is expected because body text comes from pdf and pmc folders\n","(df_merged.body_text_x != df_merged.body_text_y).sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"tgRLK4mo2vbR","outputId":"aa2db169-3ef2-4a9a-d1ed-eb2b8c507778"},"outputs":[{"data":{"text/plain":["(1461, 0)"]},"execution_count":27,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.body_text_x.isnull().sum(),(df_merged.body_text_y == '').sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"dFGxq1452yc2","outputId":"034de805-8c6d-4502-9d4b-db756093a56c"},"outputs":[{"data":{"text/plain":["5736"]},"execution_count":28,"metadata":{},"output_type":"execute_result"}],"source":["# This is expected because there are less papers in json_pmc\n","df_merged.body_text_y.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"KGEK6g_520a1","outputId":"51abfaf0-794a-47b0-b660-7980dbd5f37f"},"outputs":[{"data":{"text/plain":["(1461, 5736)"]},"execution_count":29,"metadata":{},"output_type":"execute_result"}],"source":["# body_text_x is pdf_json. body_text_y comes from pmc_json\n","# Where available we use the text from pmc file trusting the statement quality\n","df_merged.body_text_x.isnull().sum(),(df_merged.body_text_y.isnull()).sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":1000},"id":"LDSMjs4n23sY","outputId":"e040fc3c-414c-446e-85d9-abab1beacec3"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
cord_uidshasource_xtitledoipmcidpubmed_idlicenseabstractpublish_timeauthorsjournalmag_idwho_covidence_idarxiv_idpdf_json_filespmc_json_filesurls2_idbody_text_xbody_text_y
10mbs0mddgb4466c4bfc22fad10131dc138563ddf36712ca4f; 1b79...Medline; PMCImpact of the COVID-19 pandemic on the core fu...10.1136/bmjopen-2020-039674PMC730627232554730cc-by-ncOBJECTIVES: The current COVID-19 pandemic, as ...2020-06-17Verhoeven, Veronique; Tsakitzidis, Giannoula; ...BMJ OpenNaNNaNNaNdocument_parses/pdf_json/b4466c4bfc22fad10131d...document_parses/pmc_json/PMC7306272.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/32554730/;...219926577.0NaNThe current COVID-19 pandemic puts a previousl...
21di59frozaed72ae7c8e4bd6cb9bdea4455743660f068c8fd; 3966...ArXiv; Medline; PMCScenarios of future Indian electricity demand ...10.1038/s41597-021-00951-6PMC828262734267222cc-byIndia is expected to witness rapid growth in e...2021-07-15Barbar, Marc; Mallapragada, Dharik S.; Alsup, ...Sci DataNaNNaN2106.07588document_parses/pdf_json/aed72ae7c8e4bd6cb9bde...document_parses/pmc_json/PMC8282627.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/34267222/;...235422365.0NaNMany assessments of future electricity demand ...
34bajkeabm47487599e7389023b75c7b76d6017a61677afaf9; 7b20...Medline; PMCCOVID-19 in pregnancy: the foetal perspective—...10.1136/bmjpo-2020-000859PMC768953934192182cc-by-ncOBJECTIVE: We aimed to conduct a systematic re...2020-11-25Dube, Rajani; Kar, Subhranshu SekharBMJ Paediatr OpenNaNNaNNaNdocument_parses/pdf_json/47487599e7389023b75c7...document_parses/pmc_json/PMC7689539.xml.jsonhttps://doi.org/10.1136/bmjpo-2020-000859; htt...227228904.0NaNNovel coronavirus infection and associated cor...
476jzkxubqd0cb7fd7db94f89b02affed967bcd89cafbf29d3; 93cd...Medline; PMCIdentification of a new human coronavirus10.1038/nm1024PMC709578915034574no-ccThree human coronaviruses are known to exist: ...2004-03-21van der Hoek, Lia; Pyrc, Krzysztof; Jebbink, M...Nat MedNaNNaNNaNdocument_parses/pdf_json/d0cb7fd7db94f89b02aff...document_parses/pmc_json/PMC7095789.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/15034574/24428187.0NaNTo date, there is still a variety of human dis...
523foexs3nb3abb4d81aaff449e6cbfe444b6b2b80cb8d3cb2; aa80...Medline; PMCLack of Efficacy of SGLT2-i in Severe Pneumoni...10.1007/s13300-020-00844-8PMC724493632447736cc-by-ncNaN2020-05-23Bossi, Antonio Carlo; Forloni, Franco; Colombe...Diabetes TherNaNNaNNaNdocument_parses/pdf_json/b3abb4d81aaff449e6cbf...document_parses/pmc_json/PMC7244936.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/32447736/;...218837093.0NaNItaly is facing a dramatic health emergency re...
..................................................................
24891opvq9h5f4e88dd5e44f6025f7dfd2c63cd3c9f7778182346; 62ef...Medline; PMCExperiences of pregnant mothers using a social...10.1136/bmjopen-2020-040649PMC781341333455927cc-by-ncOBJECTIVES: The COVID-19 pandemic has seen unp...2021-01-17Chatwin, John; Butler, Danielle; Jones, Jude; ...BMJ OpenNaNNaNNaNdocument_parses/pdf_json/4e88dd5e44f6025f7dfd2...document_parses/pmc_json/PMC7813413.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/33455927/;...231635948.0NaNThe COVID-19 pandemic has seen unprecedented r...
24920ejcs0mm1e0be607e42c1e52efb075e7bf974ae0bca952a2b; e728...Medline; PMCConformational Dynamics in the Interaction of ...10.1021/acs.jpclett.1c00831PMC820475434110168no-cc[Image: see text] Papain-like protease (PLpro)...2021-06-10Leite, Wellington C.; Weiss, Kevin L.; Phillip...J Phys Chem LettNaNNaNNaNdocument_parses/pdf_json/e0be607e42c1e52efb075...document_parses/pmc_json/PMC8204754.xml.jsonhttps://doi.org/10.1021/acs.jpclett.1c00831; h...235394198.0NaNSevere acute respiratory syndrome\\ncoronavirus...
24937tiujpsm92be48d21caa44e9a401a518c1b91a13c68445846; 1875...Elsevier; Medline; PMCCOVID-19 response and containment strategies i...10.1016/j.ajem.2020.04.072PMC719510532386807no-ccNaN2020-04-26Sen-Crowe, Brendon; McKenney, Mark; Elkbuli, AdelAm J Emerg MedNaNNaNNaNdocument_parses/pdf_json/2be48d21caa44e9a401a5...document_parses/pmc_json/PMC7195105.xml.jsonhttps://doi.org/10.1016/j.ajem.2020.04.072; ht...218466657.0NaNCOVID-19 confirmed fatalities in the United St...
24969j28rgb4166b2fb45d49bbd9c3fb969ddd6332ee1d548103e; 0978...PMCCo-circulation of both low and highly pathogen...10.1093/ve/veaa037PMC7326300NaNcc-by-ncHighly pathogenic avian influenza (HPAI) A(H5)...2020-06-30Li, Yao-Tsun; Chen, Chen-Chih; Chang, Ai-Mei; ...Virus EvolNaNNaNNaNdocument_parses/pdf_json/66b2fb45d49bbd9c3fb96...document_parses/pmc_json/PMC7326300.xml.jsonhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...NaNNaNHighly pathogenic avian influenza virus (HPAI)...
24980sld6vdml680b13de30a0e9bcd8914849a1ef758c1391738b; 458c...Medline; PMCSexual health and COVID-19: protocol for a sco...10.1186/s13643-021-01591-yPMC782538933485393cc-byBACKGROUND: Global responses to the COVID-19 p...2021-01-23Kumar, Navin; Janmohamed, Kamila; Nyhan, Kate;...Syst RevNaNNaNNaNdocument_parses/pdf_json/680b13de30a0e9bcd8914...document_parses/pmc_json/PMC7825389.xml.jsonhttps://www.ncbi.nlm.nih.gov/pubmed/33485393/;...231690793.0NaNGlobal responses to the COVID-19 pandemic have...
\n","

1372 rows × 21 columns

\n","
"],"text/plain":[" cord_uid ... body_text_y\n","10 mbs0mddg ... The current COVID-19 pandemic puts a previousl...\n","21 di59froz ... Many assessments of future electricity demand ...\n","34 bajkeabm ... Novel coronavirus infection and associated cor...\n","47 6jzkxubq ... To date, there is still a variety of human dis...\n","52 3foexs3n ... Italy is facing a dramatic health emergency re...\n","... ... ... ...\n","24891 opvq9h5f ... The COVID-19 pandemic has seen unprecedented r...\n","24920 ejcs0mm1 ... Severe acute respiratory syndrome\\ncoronavirus...\n","24937 tiujpsm9 ... COVID-19 confirmed fatalities in the United St...\n","24969 j28rgb41 ... Highly pathogenic avian influenza virus (HPAI)...\n","24980 sld6vdml ... Global responses to the COVID-19 pandemic have...\n","\n","[1372 rows x 21 columns]"]},"execution_count":30,"metadata":{},"output_type":"execute_result"}],"source":["# There are ~13k rows where body_text_x is null but body_text_y is not null\n","df_merged.loc[df_merged.body_text_x.isnull() & df_merged.body_text_y.notnull()]"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5NrCfQ7k25WX"},"outputs":[],"source":["# Lets trutst the text from pmc folder to be of higher quality as it contains full text.\n","# Hence we will replace with body_text_x with body_text_y where body_text_y exists\n","df_merged.loc[df_merged.body_text_y.notnull(),'body_text_x'] = df_merged.loc[df_merged.body_text_y.notnull(), 'body_text_y']"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"7a0p_Yb329FL","outputId":"62bd404f-797b-4d8a-9cee-2ff032690028"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text'],\n"," dtype='object')"]},"execution_count":32,"metadata":{},"output_type":"execute_result"}],"source":["# Lets get rid of the pdf_pmc body text column and rename the body text column\n","df_merged.rename(columns = {'body_text_x' : 'body_text'}, inplace = True)\n","df_merged.drop('body_text_y',axis=1,inplace = True)\n","df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"bbyVSGqh3BIs","outputId":"e54a485b-1d70-457b-a79a-8b176f702ac6"},"outputs":[{"data":{"text/plain":["89"]},"execution_count":33,"metadata":{},"output_type":"execute_result"}],"source":["# Body text null values have now decreased.\n","df_merged.body_text.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"Tq3_goHV3DSG","outputId":"697e9d42-c34d-4fd7-b943-e9b22fa342be"},"outputs":[{"data":{"text/plain":["Index(['cord_uid', 'sha', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id',\n"," 'license', 'abstract', 'publish_time', 'authors', 'journal', 'mag_id',\n"," 'who_covidence_id', 'arxiv_id', 'pdf_json_files', 'pmc_json_files',\n"," 'url', 's2_id', 'body_text'],\n"," dtype='object')"]},"execution_count":34,"metadata":{},"output_type":"execute_result"}],"source":["df_merged.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"IqkpfMY43FHO"},"outputs":[],"source":["df_final = df_merged[['sha', 'title', 'abstract', 'publish_time', 'authors', 'url', 'body_text']]"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":595},"id":"4iQ9-OhE3HiH","outputId":"51892a21-81fa-40a5-89e6-66544e6069a7"},"outputs":[{"data":{"text/html":["
\n","\n","\n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n"," \n","
shatitleabstractpublish_timeauthorsurlbody_text
041ed6b3014e89604b351c095571c45056ba12c37A phantom study to optimise the automatic tube...On March 11, 2020, the World Health Organizati...2021-05-28Gombolevskiy, Victor; Morozov, Sergey; Chernin...https://doi.org/10.1186/s41747-021-00218-0; ht...\\nWe obtained a density difference (ground-gla...
157221999055bdc0b1c4bb16c790fdebdac1f7ce7Tunicamycin, an anticancer drug and inhibitor ...SARS-CoV-2 outbreaks remains a medical and eco...2020-10-20Dawood, Ali Adel; Altobje, Mahmood Abduljabarhttps://doi.org/10.1016/j.micpath.2020.104586;...Tunicamycin is an antibiotic was produced by S...
27fd64a56bf3a761da96e6aad00789413813599caAsymptomatic SARS-CoV-2 infectionNaN2020-06-12Ooi, Eng Eong; Low, Jenny Ghttps://doi.org/10.1016/s1473-3099(20)30460-6;...The pandemic spread of severe acute respirator...
3bc21c4203bb2d33785bff735f39d0b8758692509Electroencephalographic findings in COVID-19 p...BACKGROUND: Growing evidence of neurologic inv...2020-09-15Roberto, Katrina T.; Espiritu, Adrian I.; Fern...https://www.ncbi.nlm.nih.gov/pubmed/32957032/;...The coronavirus disease of 2019 (COVID-19) inf...
4fa5968c0d0290201f8eb1ca2eaf98181ae6a7a6fMechanical Ventilator Parameter Estimation for...Patients whose lungs are compromised due to va...2021-05-07Oruganti Venkata, Sanjay Sarma; Koenig, Amie; ...https://doi.org/10.3390/bioengineering8050060;...In a healthy person, spontaneous breaths are n...
\n","
"],"text/plain":[" sha ... body_text\n","0 41ed6b3014e89604b351c095571c45056ba12c37 ... \\nWe obtained a density difference (ground-gla...\n","1 57221999055bdc0b1c4bb16c790fdebdac1f7ce7 ... Tunicamycin is an antibiotic was produced by S...\n","2 7fd64a56bf3a761da96e6aad00789413813599ca ... The pandemic spread of severe acute respirator...\n","3 bc21c4203bb2d33785bff735f39d0b8758692509 ... The coronavirus disease of 2019 (COVID-19) inf...\n","4 fa5968c0d0290201f8eb1ca2eaf98181ae6a7a6f ... In a healthy person, spontaneous breaths are n...\n","\n","[5 rows x 7 columns]"]},"execution_count":36,"metadata":{},"output_type":"execute_result"}],"source":["df_final.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"jb17usY03JQq","outputId":"42f093f0-a33e-48f3-d298-8ae85146de95"},"outputs":[{"data":{"text/plain":["sha 0\n","title 0\n","abstract 3067\n","publish_time 0\n","authors 218\n","url 0\n","body_text 0\n","dtype: int64"]},"execution_count":37,"metadata":{},"output_type":"execute_result"}],"source":["df_final = df_final.dropna(axis=0,subset=['body_text', 'title'])\n","df_final.isnull().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"oNILUEuR3LxI","outputId":"ad345898-b710-459c-b5cb-d0be017e31d8"},"outputs":[{"data":{"text/plain":["(24910, 7)"]},"execution_count":38,"metadata":{},"output_type":"execute_result"}],"source":["df_final.shape"]},{"cell_type":"code","source":["%cd ..\n","%cd 2-Data\\ Pre-Processing"],"metadata":{"id":"pP1wrBgsXpQf"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":null,"metadata":{"id":"aseuam6r3N7w"},"outputs":[],"source":["df_final.to_csv('FINAL_COVID_QA_DATASET.csv', index=False)"]}],"metadata":{"colab":{"provenance":[]},"kernelspec":{"display_name":"Python 3","name":"python3"},"language_info":{"name":"python"}},"nbformat":4,"nbformat_minor":0} -------------------------------------------------------------------------------- /4-Knowledge Base/README.md: -------------------------------------------------------------------------------- 1 | This component handles the knowledge base for the QA service. 2 | 3 | 1. It installs the required installed packages 4 | 2. It installs and initializes the elasticsearch docker image 5 | 3. It processes and populates the data to elasticsearch. 6 | 4. Optimizes the data storage. -------------------------------------------------------------------------------- /4-Knowledge Base/es_populate.py: -------------------------------------------------------------------------------- 1 | 2 | from haystack.document_store.elasticsearch import ElasticsearchDocumentStore 3 | document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document") 4 | 5 | 6 | # Let's first fetch some documents that we want to query 7 | import pandas as pd 8 | df = pd.read_csv('SAMPLE_COVID_QA_PROCESSED.csv') 9 | 10 | # Convert files to dicts 11 | # You can optionally supply a cleaning function that is applied to each doc (e.g. to remove footers) 12 | # It must take a str as input, and return a str. 13 | dicts = df.to_dict('records') 14 | 15 | # We now have a list of dictionaries that we can write to our document store. 16 | # If your texts come from a different source (e.g. a DB), you can of course skip convert_files_to_dicts() and create the dictionaries yourself. 17 | # The default format here is: 18 | # { 19 | # 'text': "", 20 | # 'meta': {'name': "", ...} 21 | #} 22 | # (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and 23 | # can be accessed later for filtering or shown in the responses of the Finder) 24 | 25 | # Let's have a look at the first 3 entries: 26 | #print(dicts[:3]) 27 | 28 | # Now, let's write the dicts containing documents to our DB. 29 | 30 | 31 | final_dicts = [] 32 | for each in dicts: 33 | tmp = {} 34 | tmp['text'] = each.pop('body_text') 35 | tmp['meta'] = each 36 | final_dicts.append(tmp) 37 | 38 | document_store.write_documents(final_dicts) -------------------------------------------------------------------------------- /4-Knowledge Base/requirements.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yordanoswuletaw/covid19-qa-system/693929b1a3de68eae235e2b9a018f3c9422c76e4/4-Knowledge Base/requirements.docx -------------------------------------------------------------------------------- /4-Knowledge Base/requirements.txt: -------------------------------------------------------------------------------- 1 | farm-haystack==0.9.0 2 | farm==0.8.0 -------------------------------------------------------------------------------- /4-Knowledge Base/setup.sh: -------------------------------------------------------------------------------- 1 | # The first docker command pulls the elasticsearch image from the docker hub 2 | docker pull docker.elastic.co/elasticsearch/elasticsearch:7.14.2 3 | # We are running elasticsearch docker and mapping the 9200 port of local server with the host server 4 | docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.14.2 -------------------------------------------------------------------------------- /5-QA Engine/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | EXPOSE 8080 3 | WORKDIR /qaapi 4 | COPY . . 5 | RUN pip3 install -r requirements.txt 6 | CMD uvicorn main:app --host 0.0.0.0 --port 8080 -------------------------------------------------------------------------------- /5-QA Engine/README.md: -------------------------------------------------------------------------------- 1 | This component is the core component for QA. 2 | 3 | It receives the user request throught its' api endpoint, reads the data from knowledge base, runs the qa pipeline and sends the answer back to user. -------------------------------------------------------------------------------- /5-QA Engine/main.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI 2 | from pydantic import BaseModel 3 | 4 | from haystack.document_store.elasticsearch import ElasticsearchDocumentStore 5 | import os 6 | from haystack.pipeline import ExtractiveQAPipeline 7 | from haystack.reader.farm import FARMReader 8 | from haystack.retriever.sparse import ElasticsearchRetriever 9 | 10 | ELASTIC_SEARCH_HOST = os.environ.get('es_ip', 'localhost') 11 | ELASTIC_SEARCH_PORT = os.environ.get('es_port', 9200) 12 | 13 | document_store = ElasticsearchDocumentStore(host=ELASTIC_SEARCH_HOST, 14 | port=ELASTIC_SEARCH_PORT, 15 | username="", password="", 16 | index="document") 17 | 18 | retriever = ElasticsearchRetriever(document_store=document_store) 19 | reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2-covid", use_gpu=False) 20 | pipe = ExtractiveQAPipeline(reader, retriever) 21 | 22 | app = FastAPI() 23 | 24 | class Queobj(BaseModel): 25 | question: str 26 | num_answers: int 27 | num_docs: int 28 | 29 | 30 | @app.post('/query') 31 | async def query(que_obj: Queobj): 32 | question = que_obj.question 33 | k_retriver = que_obj.num_docs 34 | k_reader = que_obj.num_answers 35 | prediction = pipe.run(query=question, top_k_retriever=k_retriver, top_k_reader=k_reader) 36 | return {'answer': prediction} -------------------------------------------------------------------------------- /5-QA Engine/requirements.txt: -------------------------------------------------------------------------------- 1 | fastapi==0.66.0 2 | farm==0.8.0 3 | farm-haystack==0.9.0 4 | uvicorn==0.14.0 5 | torch==1.8.1 6 | torchvision==0.9.1 7 | torchaudio==0.8.1 8 | pydantic==1.8.2 -------------------------------------------------------------------------------- /5-QA Engine/setup.sh: -------------------------------------------------------------------------------- 1 | version=`date "+%H-%M-%S_%d-%m-%y"` 2 | echo $version 3 | sudo docker build -t qa-api . 4 | sudo docker tag qa-api us.gcr.io/annular-mercury-318319/qa-api:$version 5 | 6 | sudo docker container prune -f 7 | sudo docker image prune -f 8 | sudo docker volume prune -f 9 | sudo docker network prune -f 10 | 11 | sudo docker push us.gcr.io/annular-mercury-318319/qa-api:$version 12 | 13 | # sudo docker run -p 85:8080 qa-api -e ELASTIC_SEARCH_HOST='host.docker.internal' -e ELASTIC_SEARCH_PORT=9200 -------------------------------------------------------------------------------- /6-Front End with Streamlit/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | EXPOSE 8080 3 | WORKDIR /SAM 4 | COPY ./requirements.txt ./requirements.txt 5 | RUN pip3 install -r requirements.txt 6 | COPY . . 7 | CMD streamlit run streamlit_qa.py --server.port 8080 -------------------------------------------------------------------------------- /6-Front End with Streamlit/Dockerfile(1): -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | EXPOSE 8080 3 | WORKDIR /SAM 4 | COPY ./requirements.txt ./requirements.txt 5 | RUN pip3 install -r requirements.txt 6 | COPY . . 7 | CMD streamlit run streamlit_qa.py --server.port 8080 -------------------------------------------------------------------------------- /6-Front End with Streamlit/README(1).md: -------------------------------------------------------------------------------- 1 | This is the UI component for QA. 2 | 3 | It gives an interface to user to see the demo of QA. -------------------------------------------------------------------------------- /6-Front End with Streamlit/README.md: -------------------------------------------------------------------------------- 1 | This is the UI component for QA. 2 | 3 | It gives an interface to user to see the demo of QA. -------------------------------------------------------------------------------- /6-Front End with Streamlit/requirements(1).txt: -------------------------------------------------------------------------------- 1 | streamlit==0.84.0 2 | requests==2.25.1 3 | st-annotated-text==2.0.0 4 | -------------------------------------------------------------------------------- /6-Front End with Streamlit/requirements.txt: -------------------------------------------------------------------------------- 1 | streamlit==0.84.0 2 | requests==2.25.1 3 | st-annotated-text==2.0.0 4 | -------------------------------------------------------------------------------- /6-Front End with Streamlit/setup(1).sh: -------------------------------------------------------------------------------- 1 | version=`date "+%H-%M-%S_%d-%m-%y"` 2 | echo $version 3 | 4 | sudo docker container prune -f 5 | sudo docker image prune -f 6 | sudo docker volume prune -f 7 | sudo docker network prune -f 8 | 9 | sudo docker build -t qa-ui . 10 | sudo docker run -p 80:8080 qa-ui -------------------------------------------------------------------------------- /6-Front End with Streamlit/setup.sh: -------------------------------------------------------------------------------- 1 | version=`date "+%H-%M-%S_%d-%m-%y"` 2 | echo $version 3 | 4 | sudo docker container prune -f 5 | sudo docker image prune -f 6 | sudo docker volume prune -f 7 | sudo docker network prune -f 8 | 9 | sudo docker build -t qa-ui . 10 | sudo docker run -p 80:8080 qa-ui -------------------------------------------------------------------------------- /6-Front End with Streamlit/streamlit_qa(1).py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import requests, json 3 | from annotated_text import annotated_text 4 | st.set_page_config( 5 | page_title="Naturalangue", 6 | page_icon=":shark:", 7 | layout="wide", 8 | initial_sidebar_state="expanded", 9 | ) 10 | 11 | def remote_css(url): 12 | st.markdown(f'', unsafe_allow_html=True) 13 | def icon(icon_name): 14 | st.markdown(f'', unsafe_allow_html=True) 15 | 16 | remote_css("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css") 17 | 18 | st.markdown('

Covid QA

', unsafe_allow_html=True) 19 | 20 | st.sidebar.header("Options") 21 | top_k_reader = st.sidebar.slider("Max. number of answers", min_value=1, max_value=10, value=3, step=1) 22 | top_k_retriever = st.sidebar.slider("Max. number of documents from retriever", min_value=1, max_value=10, value=3, step=1) 23 | 24 | 25 | st.markdown('

Question

', unsafe_allow_html=True) 26 | question = st.text_input("Put your query", value="What are the symptoms of COVID") 27 | button = st.button('Get Answer') 28 | st.text("") 29 | st.text("") 30 | 31 | if button: 32 | headers = { 33 | 'accept': 'application/json', 34 | 'Content-Type': 'application/json', 35 | } 36 | data = { 37 | 'question': question, 'num_answers': top_k_reader, 'num_docs': top_k_retriever 38 | } 39 | response = requests.post('http://host.docker.internal:85/query', headers=headers, data=json.dumps(data)) 40 | result = response.json() 41 | 42 | print(result) 43 | #result = {'answer': {'query': 'who is the father of arya', 'no_ans_gap': -9.128407001495361, 'answers': [{'answer': 'Dr Shanthi Viswanathan', 'score': -1.2638678550720215, 'probability': 0.006202994845807552, 'context': ' period with plans needed for an exit strategy in the future.$$$Dr Shanthi Viswanathan contributed to the study design, conceptualization, acquisition', 'offset_start': 64, 'offset_end': 86, 'offset_start_in_doc': 28757, 'offset_end_in_doc': 28779, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Guthy Jackson', 'score': -1.3159641027450562, 'probability': 0.010259977541863918, 'context': 'al MS Society(USA),the Association of British Neurologists(ABN), the Guthy Jackson Charitable Foundation Website (GJCFW) as well as current literature', 'offset_start': 69, 'offset_end': 82, 'offset_start_in_doc': 6089, 'offset_end_in_doc': 6102, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Abboud', 'score': -2.4730982780456543, 'probability': 0.009177026338875294, 'context': 'VIG and TPE was thought to be safer. 9, 10, 11, 22 Brownlee W et al and Abboud H et al highlighted the importance of cautious IS continuation in patie', 'offset_start': 72, 'offset_end': 78, 'offset_start_in_doc': 23716, 'offset_end_in_doc': 23722, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}], 'node_id': 'Reader'}} 44 | 45 | for each in result['answer']['answers']: 46 | title = each['meta']['title'] 47 | url = each['meta']['url'].split(';')[0] 48 | tokens = [] 49 | tokens.append(each['context'][:each['offset_start']-1]) 50 | tokens.append( 51 | (each['context'][each['offset_start']:each['offset_end']], 'ANS', '#faa') 52 | ) 53 | tokens.append(each['context'][each['offset_end']:]) 54 | col1,col2 = st.beta_columns([5,1]) 55 | col1.markdown(f'{title}', unsafe_allow_html=True) 56 | col2.markdown(f'', unsafe_allow_html=True) 57 | st.text("") 58 | col1, col2 = st.beta_columns([2,4]) 59 | col1.markdown(f'Publish time: {each["meta"]["publish_time"]}\ 60 |
Authors: {each["meta"]["authors"]}
', unsafe_allow_html=True) 61 | annotated_text(*tokens) 62 | 63 | -------------------------------------------------------------------------------- /6-Front End with Streamlit/streamlit_qa.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import requests, json 3 | from annotated_text import annotated_text 4 | st.set_page_config( 5 | page_title="Naturalangue", 6 | page_icon=":shark:", 7 | layout="wide", 8 | initial_sidebar_state="expanded", 9 | ) 10 | 11 | def remote_css(url): 12 | st.markdown(f'', unsafe_allow_html=True) 13 | def icon(icon_name): 14 | st.markdown(f'', unsafe_allow_html=True) 15 | 16 | remote_css("https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css") 17 | 18 | st.markdown('

Covid QA

', unsafe_allow_html=True) 19 | 20 | st.sidebar.header("Options") 21 | top_k_reader = st.sidebar.slider("Max. number of answers", min_value=1, max_value=10, value=3, step=1) 22 | top_k_retriever = st.sidebar.slider("Max. number of documents from retriever", min_value=1, max_value=10, value=3, step=1) 23 | 24 | 25 | st.markdown('

Question

', unsafe_allow_html=True) 26 | question = st.text_input("Put your query", value="What are the symptoms of COVID") 27 | button = st.button('Get Answer') 28 | st.text("") 29 | st.text("") 30 | 31 | if button: 32 | headers = { 33 | 'accept': 'application/json', 34 | 'Content-Type': 'application/json', 35 | } 36 | data = { 37 | 'question': question, 'num_answers': top_k_reader, 'num_docs': top_k_retriever 38 | } 39 | response = requests.post('http://host.docker.internal:85/query', headers=headers, data=json.dumps(data)) 40 | result = response.json() 41 | 42 | print(result) 43 | #result = {'answer': {'query': 'who is the father of arya', 'no_ans_gap': -9.128407001495361, 'answers': [{'answer': 'Dr Shanthi Viswanathan', 'score': -1.2638678550720215, 'probability': 0.006202994845807552, 'context': ' period with plans needed for an exit strategy in the future.$$$Dr Shanthi Viswanathan contributed to the study design, conceptualization, acquisition', 'offset_start': 64, 'offset_end': 86, 'offset_start_in_doc': 28757, 'offset_end_in_doc': 28779, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Guthy Jackson', 'score': -1.3159641027450562, 'probability': 0.010259977541863918, 'context': 'al MS Society(USA),the Association of British Neurologists(ABN), the Guthy Jackson Charitable Foundation Website (GJCFW) as well as current literature', 'offset_start': 69, 'offset_end': 82, 'offset_start_in_doc': 6089, 'offset_end_in_doc': 6102, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}, {'answer': 'Abboud', 'score': -2.4730982780456543, 'probability': 0.009177026338875294, 'context': 'VIG and TPE was thought to be safer. 9, 10, 11, 22 Brownlee W et al and Abboud H et al highlighted the importance of cautious IS continuation in patie', 'offset_start': 72, 'offset_end': 78, 'offset_start_in_doc': 23716, 'offset_end_in_doc': 23722, 'document_id': 'ab8a910156406e85534af294ecc0b83d', 'meta': {'sha': 'dca07dbefc86802cb7ccdbbca740aea9e4ea0433', 'title': 'Management of Idiopathic CNS inflammatory diseases during the COVID-19 pandemic: Perspectives and strategies for continuity of care from a South East Asian Center with limited resources.', 'publish_time': '2020-07-03', 'authors': 'Viswanathan, S.', 'url': 'https://doi.org/10.1016/j.msard.2020.102353; https://www.sciencedirect.com/science/article/pii/S2211034820304284?v=s5; https://www.ncbi.nlm.nih.gov/pubmed/32653804/; https://api.elsevier.com/content/article/pii/S2211034820304284', 'abstract': "The Covid-19 pandemic poses a grave health management challenge globally of unprecedented nature. Management of idiopathic Central Nervous system inflammatory disorders (iCNSID) such as Multiple sclerosis, Neuromyelitis optica and its spectrum disorders and related conditions during this pandemic needs to be addressed with affirmative and sustainable strategies in order to prevent disease related risks, medication related complications and possible COVID-19 disease associated effects. Global international iCNSIDs agencies and recent publications are attempting to address this but such guidance is not available in South East Asia. Here we outline prospectively qualitatively and quantitatively novel strategies at a tertiary center in Malaysia catering for neuroimmunological disorders despite modest resources during this pandemic. In this retrospective study with longitudinal follow-up, we describe stratification of patients for face to face versus virtual visits in the absence of formal teleneurology, stratification of patients for treatment according to disease activity, rescheduling, deferring initiation or extending treatment intervals of certain disease modifying therapies(DMT's) or immunosuppressants(IS), especially those producing lymphocyte depletion in MS and the continuation of IS in patients with NMO/NMOSD. Furthermore, we highlight the use off-label treatments such as Intravenous immunoglobulins/rituximab,bridging interferons/Teriflunomide temporarily replacing more potent DMT choices,supply challenges of IS/DMT'sand tailoring blood watches and neuroimaging surveillance based on the current health needs to stave off the pandemic and prevent at risk patients with iCNSID/health care workers from possibly being exposed to the COVID-19."}}], 'node_id': 'Reader'}} 44 | 45 | for each in result['answer']['answers']: 46 | title = each['meta']['title'] 47 | url = each['meta']['url'].split(';')[0] 48 | tokens = [] 49 | tokens.append(each['context'][:each['offset_start']-1]) 50 | tokens.append( 51 | (each['context'][each['offset_start']:each['offset_end']], 'ANS', '#faa') 52 | ) 53 | tokens.append(each['context'][each['offset_end']:]) 54 | col1,col2 = st.beta_columns([5,1]) 55 | col1.markdown(f'{title}', unsafe_allow_html=True) 56 | col2.markdown(f'', unsafe_allow_html=True) 57 | st.text("") 58 | col1, col2 = st.beta_columns([2,4]) 59 | col1.markdown(f'Publish time: {each["meta"]["publish_time"]}\ 60 |
Authors: {each["meta"]["authors"]}
', unsafe_allow_html=True) 61 | annotated_text(*tokens) 62 | 63 | -------------------------------------------------------------------------------- /7-Deployment/Dockerfile_QA: -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | EXPOSE 8080 3 | WORKDIR /qaapi 4 | COPY . . 5 | RUN pip3 install -r requirements.txt 6 | CMD uvicorn main:app --host 0.0.0.0 --port 8080 -------------------------------------------------------------------------------- /7-Deployment/Dockerfile_UI: -------------------------------------------------------------------------------- 1 | FROM python:3.8-slim 2 | EXPOSE 8080 3 | WORKDIR /SAM 4 | COPY ./requirements.txt ./requirements.txt 5 | RUN pip3 install -r requirements.txt 6 | COPY . . 7 | CMD streamlit run streamlit_qa.py --server.port 8080 -------------------------------------------------------------------------------- /7-Deployment/setup.sh: -------------------------------------------------------------------------------- 1 | # Download the clean dataset 2 | wget https://storage.googleapis.com/haystack-qa/SAMPLE_COVID_QA_PROCESSED.csv 3 | 4 | # The first docker command pulls the elasticsearch image from the docker hub 5 | docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.4 6 | # We are running elasticsearch docker and mapping the 9200 port of local server with the host server 7 | docker run -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.4 -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # COVID-19 Q&A System 2 | 3 | ### About 4 | COVID-19 automated question answering system developed using Deep Learning, Natural Language Processing and over 250, 000 COVID-19 research papers and articles. This system can help researchers, clinicians, and the public to quickly and easily find information about COVID-19. This system facilitates rapid and user-friendly access to crucial information about COVID-19, enhancing the efficiency and effectiveness of information retrieval. 5 | 6 | ### Purpose 7 | 8 | 1. **Research Assistance:** Researchers can use this system to quickly access and extract relevant information from a vast pool of COVID-19 research. This can save them time and effort in finding the latest studies, statistics, and findings. 9 | 10 | 2. **Clinical Decision Support:** Healthcare professionals can use the system to help them make informed clinical decisions. The system can provide guidance based on the 2022 medical research and guidelines regarding COVID-19. 11 | 12 | 3. **Public Awareness:** The system is a valuable tool for the general public. It allows people to access reliable and easy-to-understand information about COVID-19. This can help people to learn more about the disease, take preventive measures, and make informed decisions. 13 | 14 | 4. **Educational Resource:** Educational institutions and instructors can use the system to provide students with accurate and timely information about the COVID-19 pandemic. This can help students to learn more about the disease and its impact. 15 | 16 | 5. **Policy Development:** Government agencies and policymakers can use the system to easily informed about COVID. 17 | --------------------------------------------------------------------------------