├── .gitignore
├── LICENSE
├── README.md
├── async_pubmed_scraper.py
├── example
    ├── article_data - natural language processing, cancer mRNA vaccines, sugar nanocapsules.csv
    ├── article_data - photonics, biophysics, biochemistry.csv
    ├── cli_usage_example.JPG
    └── data_example.JPG
└── requirements.txt


/.gitignore:
--------------------------------------------------------------------------------
  1 | .
  2 | # Byte-compiled / optimized / DLL files
  3 | __pycache__/
  4 | *.py[cod]
  5 | *$py.class
  6 | 
  7 | # C extensions
  8 | *.so
  9 | 
 10 | # Distribution / packaging
 11 | .Python
 12 | build/
 13 | develop-eggs/
 14 | dist/
 15 | downloads/
 16 | eggs/
 17 | .eggs/
 18 | lib/
 19 | lib64/
 20 | parts/
 21 | sdist/
 22 | var/
 23 | wheels/
 24 | pip-wheel-metadata/
 25 | share/python-wheels/
 26 | *.egg-info/
 27 | .installed.cfg
 28 | *.egg
 29 | MANIFEST
 30 | 
 31 | # PyInstaller
 32 | #  Usually these files are written by a python script from a template
 33 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 34 | *.manifest
 35 | *.spec
 36 | 
 37 | # Installer logs
 38 | pip-log.txt
 39 | pip-delete-this-directory.txt
 40 | 
 41 | # Unit test / coverage reports
 42 | htmlcov/
 43 | .tox/
 44 | .nox/
 45 | .coverage
 46 | .coverage.*
 47 | .cache
 48 | nosetests.xml
 49 | coverage.xml
 50 | *.cover
 51 | *.py,cover
 52 | .hypothesis/
 53 | .pytest_cache/
 54 | 
 55 | # Translations
 56 | *.mo
 57 | *.pot
 58 | 
 59 | # Django stuff:
 60 | *.log
 61 | local_settings.py
 62 | db.sqlite3
 63 | db.sqlite3-journal
 64 | 
 65 | # Flask stuff:
 66 | instance/
 67 | .webassets-cache
 68 | 
 69 | # Scrapy stuff:
 70 | .scrapy
 71 | 
 72 | # Sphinx documentation
 73 | docs/_build/
 74 | 
 75 | # PyBuilder
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | .python-version
 87 | 
 88 | # pipenv
 89 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 90 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 91 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 92 | #   install all needed dependencies.
 93 | #Pipfile.lock
 94 | 
 95 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
 96 | __pypackages__/
 97 | 
 98 | # Celery stuff
 99 | celerybeat-schedule
100 | celerybeat.pid
101 | 
102 | # SageMath parsed files
103 | *.sage.py
104 | 
105 | # Environments
106 | .env
107 | .venv
108 | env/
109 | venv/
110 | ENV/
111 | env.bak/
112 | venv.bak/
113 | 
114 | # Spyder project settings
115 | .spyderproject
116 | .spyproject
117 | 
118 | # Rope project settings
119 | .ropeproject
120 | 
121 | # mkdocs documentation
122 | /site
123 | 
124 | # mypy
125 | .mypy_cache/
126 | .dmypy.json
127 | dmypy.json
128 | 
129 | # Pyre type checker
130 | .pyre/
131 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Ilia Zenkov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Asynchronous PubMed Scraper
 2 | 
 3 | ## Quick Start
 4 | Instructions for Windows. Make sure you have [python](https://www.python.org/downloads/) installed. Linux users: ```async_pubmed_scraper -h``` <br>
 5 | 1) Open command prompt and change directory to the folder containing ```async_pubmed_scraper.py``` and ```keywords.txt```
 6 | 2) Create a [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/): ```python -m pip install --user virtualenv```, ```python -m venv scraper_env```, ```.\scraper_env\Scripts\activate``` <br>
 7 | 3) Install dependencies: ``` pip install -r requirements.txt```<br>
 8 | 4) Enter list of keywords to scrape, one per line, in ```keywords.txt``` <br>
 9 | 5) Enter ```python async_pubmed_scraper -h``` for usage instructions and you are good to go <br> <br>
10 | Example: To scrape the first 10 pages of search results for your keywords from 2018 to 2020 and save the data to the file ```article_data.csv```: ```python async_pubmed_scraper --pages 10 --start 2018 --stop 2020 --output article_data``` <br>
11 | 
12 | ## Example Usage and Data
13 | <h5 align="center"> Collects the following data at 13 articles/second: url, title, abstract, authors, affiliations, journal, keywords, date</h5>
14 | <p align="center">
15 | 
16 |   <img src="https://raw.githubusercontent.com/IliaZenkov/async-pubmed-scraper/master/example/cli_usage_example.JPG" height=410 width=690/>
17 |   <img align="center" src="https://raw.githubusercontent.com/IliaZenkov/async-pubmed-scraper/master/example/data_example.JPG"/>
18 | 
19 | ## What it does 
20 | 
21 | This script asynchronously scrapes PubMed - an open-access database of scholarly research articles -
22 | and saves the data to a PANDAS DataFrame which is then written to a CSV intended for further processing.
23 | This script scrapes a user-specified list of keywords for all results pages asynchronously. 
24 | 
25 | ## Why scrape when there's an API? Why asynchronous?
26 | PubMed provides an API - the NCBI Entrez API, also known as Entrez Programming Utilities or E-Utilities - 
27 | which can be used build datasets with their search engine - however, PubMed allows only 3 URL requests per second 
28 | through E-Utilities (10/second w/ API key).
29 | We're doing potentially thousands of URL requests asynchronously - depending on the amount of articles returned by the command line options specified by the user. 
30 | It's much faster to download articles from all urls on all results pages in parallel compared to downloading articles page by page, article by article, waiting for the previous article request to complete before moving on to a new one. It's not unusual to see a 10x speedup with async scraping compared to regular scraping.  
31 | Simply put, we're going to make our client send requests to all search queries and all resulting article URLs at the same time.
32 | Otherwise, our client will wait for the server to answer before sending the next request, so most of our script's execution time
33 | will be spent waiting for a response from the (PubMed) server. 
34 | 
35 | ## License
36 | 
37 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/IliaZenkov/async-pubmed-scraper/blob/master/LICENSE)
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/async_pubmed_scraper.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Author: Ilia Zenkov
  3 | Date: 9/26/2020
  4 | 
  5 | This script asynchronously scrapes Pubmed - an open-access database of scholarly research articles -
  6 | and saves the data to a DataFrame which is then written to a CSV intended for further processing
  7 | This script is capable of scraping a list of keywords asynchronously
  8 | 
  9 | Contains the following functions:
 10 |     make_header:        Makes an HTTP request header using a random user agent
 11 |     get_num_pages:      Finds number of pubmed results pages returned by a keyword search
 12 |     extract_by_article: Extracts data from a single pubmed article to a DataFrame
 13 |     get_pmids:          Gets PMIDs of all article URLs from a single page and builds URLs to pubmed articles specified by those PMIDs
 14 |     build_article_urls: Async wrapper for get_pmids, creates asyncio tasks for each page of results, page by page,
 15 |                         and stores article urls in urls: List[string]
 16 |     get_article_data:   Async wrapper for extract_by_article, creates asyncio tasks to scrape data from each article specified by urls[]
 17 | 
 18 | requires:
 19 |     BeautifulSoup4 (bs4)
 20 |     PANDAS
 21 |     requests
 22 |     asyncio
 23 |     aiohttp
 24 |     nest_asyncio (OPTIONAL: Solves nested async calls in jupyter notebooks)
 25 | """
 26 | 
 27 | import argparse
 28 | import time
 29 | from bs4 import BeautifulSoup
 30 | import pandas as pd
 31 | import random
 32 | import requests
 33 | import asyncio
 34 | import aiohttp
 35 | import socket
 36 | import warnings; warnings.filterwarnings('ignore') # aiohttp produces deprecation warnings that don't concern us
 37 | #import nest_asyncio; nest_asyncio.apply() # necessary to run nested async loops in jupyter notebooks
 38 | 
 39 | # Use a variety of agents for our ClientSession to reduce traffic per agent
 40 | # This (attempts to) avoid a ban for high traffic from any single agent
 41 | # We should really use proxybroker or similar to ensure no ban
 42 | user_agents = [
 43 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
 44 |         "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0",
 45 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
 46 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
 47 |         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
 48 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
 49 |         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
 50 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 51 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 52 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
 53 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
 54 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
 55 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 56 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 57 |         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
 58 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
 59 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
 60 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
 61 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
 62 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
 63 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
 64 |         ]
 65 | 
 66 | def make_header():
 67 |     '''
 68 |     Chooses a random agent from user_agents with which to construct headers
 69 |     :return headers: dict: HTTP headers to use to get HTML from article URL
 70 |     '''
 71 |     # Make a header for the ClientSession to use with one of our agents chosen at random
 72 |     headers = {
 73 |             'User-Agent':random.choice(user_agents),
 74 |             }
 75 |     return headers
 76 | 
 77 | async def extract_by_article(url):
 78 |     '''
 79 |     Extracts all data from a single article
 80 |     :param url: string: URL to a single article (i.e. root pubmed URL + PMID)
 81 |     :return article_data: Dict: Contains all data from a single article
 82 |     '''
 83 |     conn = aiohttp.TCPConnector(family=socket.AF_INET)
 84 |     headers = make_header()
 85 |     # Reference our articles DataFrame containing accumulated data for ALL scraped articles
 86 |     global articles_data
 87 |     async with aiohttp.ClientSession(headers=headers, connector=conn) as session:
 88 |         async with semaphore, session.get(url) as response:
 89 |             data = await response.text()
 90 |             soup = BeautifulSoup(data, "lxml")
 91 |             # Get article abstract if exists - sometimes abstracts are not available (not an error)
 92 |             try:
 93 |                 abstract_raw = soup.find('div', {'class': 'abstract-content selected'}).find_all('p')
 94 |                 # Some articles are in a split background/objectives/method/results style, we need to join these paragraphs
 95 |                 abstract = ' '.join([paragraph.text.strip() for paragraph in abstract_raw])
 96 |             except:
 97 |                 abstract = 'NO_ABSTRACT'
 98 |             # Get author affiliations - sometimes affiliations are not available (not an error)
 99 |             affiliations = [] # list because it would be difficult to split since ',' exists within an affiliation
100 |             try:
101 |                 all_affiliations = soup.find('ul', {'class':'item-list'}).find_all('li')
102 |                 for affiliation in all_affiliations:
103 |                     affiliations.append(affiliation.get_text().strip())
104 |             except:
105 |                 affiliations = 'NO_AFFILIATIONS'
106 |             # Get article keywords - sometimes keywords are not available (not an error)
107 |             try:
108 |                 # We need to check if the abstract section includes keywords or else we may get abstract text
109 |                 has_keywords = soup.find_all('strong',{'class':'sub-title'})[-1].text.strip()
110 |                 if has_keywords == 'Keywords:':
111 |                     # Taking last element in following line because occasionally this section includes text from abstract
112 |                     keywords = soup.find('div', {'class':'abstract' }).find_all('p')[-1].get_text()
113 |                     keywords = keywords.replace('Keywords:','\n').strip() # Clean it up
114 |                 else:
115 |                     keywords = 'NO_KEYWORDS'
116 |             except:
117 |                 keywords = 'NO_KEYWORDS'
118 |             try:
119 |                 title = soup.find('meta',{'name':'citation_title'})['content'].strip('[]')
120 |             except:
121 |                 title = 'NO_TITLE'
122 |             authors = ''    # string because it's easy to split a string on ','
123 |             try:
124 |                 for author in soup.find('div',{'class':'authors-list'}).find_all('a',{'class':'full-name'}):
125 |                     authors += author.text + ', '
126 |                 # alternative to get citation style authors (no first name e.g. I. Zenkov)
127 |                 # all_authors = soup.find('meta', {'name': 'citation_authors'})['content']
128 |                 # [authors.append(author) for author in all_authors.split(';')]
129 |             except:
130 |                 authors = ('NO_AUTHOR')
131 |             try:
132 |                 journal = soup.find('meta',{'name':'citation_journal_title'})['content']
133 |             except:
134 |                 journal = 'NO_JOURNAL'
135 |             try:
136 |                 date = soup.find('time', {'class': 'citation-year'}).text
137 |             except:
138 |                 date = 'NO_DATE'
139 | 
140 |             # Format data as a dict to insert into a DataFrame
141 |             article_data = {
142 |                 'url': url,
143 |                 'title': title,
144 |                 'authors': authors,
145 |                 'abstract': abstract,
146 |                 'affiliations': affiliations,
147 |                 'journal': journal,
148 |                 'keywords': keywords,
149 |                 'date': date
150 |             }
151 |             # Add dict containing one article's data to list of article dicts
152 |             articles_data.append(article_data)
153 | 
154 | async def get_pmids(page, keyword):
155 |     """
156 |     Extracts PMIDs of all articles from a pubmed search result, page by page,
157 |     builds a url to each article, and stores all article URLs in urls: List[string]
158 |     :param page: int: value of current page of a search result for keyword
159 |     :param keyword: string: current search keyword
160 |     :return: None
161 |     """
162 |     # URL to one unique page of results for a keyword search
163 |     page_url = f'{pubmed_url}+{keyword}+&page={page}'
164 |     headers = make_header()
165 |     async with aiohttp.ClientSession(headers=headers) as session:
166 |         async with session.get(page_url) as response:
167 |             data = await response.text()
168 |             # Parse the current page of search results from the response
169 |             soup = BeautifulSoup(data, "lxml")
170 |             # Find section which holds the PMIDs for all articles on a single page of search results
171 |             pmids = soup.find('meta',{'name':'log_displayeduids'})['content']
172 |             # alternative to get pmids: page_content = soup.find_all('div', {'class': 'docsum-content'}) + for line in page_content: line.find('a').get('href')
173 |             # Extract URLs by getting PMIDs for all pubmed articles on the results page (default 10 articles/page)
174 |             for pmid in pmids.split(','):
175 |                 url = root_pubmed_url + '/' + pmid
176 |                 urls.append(url)
177 | 
178 | def get_num_pages(keyword):
179 |     '''
180 |     Gets total number of pages returned by search results for keyword
181 |     :param keyword: string: search word used to search for results
182 |     :return: num_pages: int: number of pages returned by search results for keyword
183 |     '''
184 |     # Return user specified number of pages if option was supplied
185 |     if args.pages != None: return args.pages
186 | 
187 |     # Get search result page and wait a second for it to load
188 |     # URL to the first page of results for a keyword search
189 |     headers=make_header()
190 |     search_url = f'{pubmed_url}+{keyword}'
191 |     with requests.get(search_url,headers=headers) as response:
192 |         data = response.text
193 |         soup = BeautifulSoup(data, "lxml")
194 |         num_pages = int((soup.find('span', {'class': 'total-pages'}).get_text()).replace(',',''))
195 |         return num_pages # Can hardcode this value (e.g. 10 pages) to limit # of articles scraped per keyword
196 | 
197 | async def build_article_urls(keywords):
198 |     """
199 |     PubMed uniquely identifies articles using a PMID
200 |     e.g. https://pubmed.ncbi.nlm.nih.gov/32023415/ #shameless self plug :)
201 |     Any and all articles can be identified with a single PMID
202 | 
203 |     Async wrapper for get_article_urls, page by page of results, for a single search keyword
204 |     Creates an asyncio task for each page of search result for each keyword
205 |     :param keyword: string: search word used to search for results
206 |     :return: None
207 |     """
208 |     tasks = []
209 |     for keyword in keywords:
210 |         num_pages = get_num_pages(keyword)
211 |         for page in range(1,num_pages+1):
212 |             task = asyncio.create_task(get_pmids(page, keyword))
213 |             tasks.append(task)
214 | 
215 |     await asyncio.gather(*tasks)
216 | 
217 | async def get_article_data(urls):
218 |     """
219 |     Async wrapper for extract_by_article to scrape data from each article (url)
220 |     :param urls: List[string]: list of all pubmed urls returned by the search keyword
221 |     :return: None
222 |     """
223 |     tasks = []
224 |     for url in urls:
225 |         if url not in scraped_urls:
226 |             task = asyncio.create_task(extract_by_article(url))
227 |             tasks.append(task)
228 |             scraped_urls.append(url)
229 | 
230 |     await asyncio.gather(*tasks)
231 | 
232 | 
233 | if __name__ == "__main__":
234 |     # Set options so user can choose number of pages and publication date range to scrape, and output file name
235 |     parser = argparse.ArgumentParser(description='Asynchronous PubMed Scraper')
236 |     parser.add_argument('--pages', type=int, default=None, help='Specify number of pages to scrape for EACH keyword. Each page of PubMed results contains 10 articles. \n Default = all pages returned for all keywords.')
237 |     parser.add_argument('--start', type=int, default=2019, help='Specify start year for publication date range to scrape. Default = 2019')
238 |     parser.add_argument('--stop', type=int, default=2020, help='Specify stop year for publication date range to scrape. Default = 2020')
239 |     parser.add_argument('--output', type=str, default='articles.csv',help='Choose output file name. Default = "articles.csv".')
240 |     args = parser.parse_args()
241 |     if args.output[-4:] != '.csv': args.output += '.csv' # ensure we save a CSV if user forgot to include format in --output option
242 |     start = time.time()
243 |     # This pubmed link is hardcoded to search for articles from user specified date range, defaults to 2019-2020
244 |     pubmed_url = f'https://pubmed.ncbi.nlm.nih.gov/?term={args.start}%3A{args.stop}%5Bdp%5D'
245 |     # The root pubmed link is used to construct URLs to scrape after PMIDs are retrieved from user specified date range
246 |     root_pubmed_url = 'https://pubmed.ncbi.nlm.nih.gov'
247 |     # Construct our list of keywords from a user input file to search for and extract articles from
248 |     search_keywords = []
249 |     with open('keywords.txt') as file:
250 |         keywords = file.readlines()
251 |         [search_keywords.append(keyword.strip()) for keyword in keywords]
252 |     print(f'\nFinding PubMed article URLs for {len(keywords)} keywords found in keywords.txt\n')
253 |     # Empty list to store all article data as List[dict]; each dict represents data from one article
254 |     # This approach is considerably faster than appending article data article-by-article to a DataFrame
255 |     articles_data = []
256 |     # Empty list to store all article URLs
257 |     urls = []
258 |     # Empty list to store URLs already scraped
259 |     scraped_urls = []
260 | 
261 |     # We use asyncio's BoundedSemaphore method to limit the number of asynchronous requests
262 |     #    we make to PubMed at a time to avoid a ban (and to be nice to PubMed servers)
263 |     # Higher value for BoundedSemaphore yields faster scraping, and a higher chance of ban. 100-500 seems to be OK.
264 |     semaphore = asyncio.BoundedSemaphore(100)
265 | 
266 |     # Get and run the loop to build a list of all URLs
267 |     loop = asyncio.get_event_loop()
268 |     loop.run_until_complete(build_article_urls(search_keywords))
269 |     print(f'Scraping initiated for {len(urls)} article URLs found from {args.start} to {args.stop}\n')
270 |     # Get and run the loop to get article data into a DataFrame from a list of all URLs
271 |     loop = asyncio.get_event_loop()
272 |     loop.run_until_complete(get_article_data(urls))
273 | 
274 |     # Create DataFrame to store data from all articles
275 |     articles_df = pd.DataFrame(articles_data, columns=['title','abstract','affiliations','authors','journal','date','keywords','url'])
276 |     print('Preview of scraped article data:\n')
277 |     print(articles_df.head(5))
278 |     # Save all extracted article data to CSV for further processing
279 |     filename = args.output
280 |     articles_df.to_csv(filename)
281 |     print(f'It took {time.time() - start} seconds to find {len(urls)} articles; {len(scraped_urls)} unique articles were saved to {filename}')
282 | 
283 | 


--------------------------------------------------------------------------------
/example/cli_usage_example.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IliaZenkov/async-pubmed-scraper/3b3f322d9d5c2828d181a7a370d06395cb063ab7/example/cli_usage_example.JPG


--------------------------------------------------------------------------------
/example/data_example.JPG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/IliaZenkov/async-pubmed-scraper/3b3f322d9d5c2828d181a7a370d06395cb063ab7/example/data_example.JPG


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | aiohttp==3.6.2
 2 | asyncio==3.4.3
 3 | beautifulsoup4==4.9.1
 4 | bs4==0.0.1
 5 | lxml==4.5.2
 6 | numpy==1.19.2
 7 | pandas==1.1.2
 8 | requests==2.24.0
 9 | 
10 | 


--------------------------------------------------------------------------------