├── .gitignore ├── requirements.txt ├── keywords.csv ├── nonkeywords.csv ├── README.md └── app.py /.gitignore: -------------------------------------------------------------------------------- 1 | webenv/ 2 | .DS_Store 3 | .ipynb_checkpoints/xml_parser-checkpoint.py 4 | *.zip 5 | *.xml 6 | .ipynb_checkpoints/nonkeywords-checkpoint.csv 7 | .ipynb_checkpoints/keywords-checkpoint.csv 8 | .ipynb_checkpoints/app-checkpoint.py 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.7.4 2 | beautifulsoup4==4.9.1 3 | crontab==0.22.8 4 | Flask==2.3.2 5 | gunicorn==20.0.4 6 | multidict==4.7.6 7 | numpy==1.22.0 8 | pandas==1.0.5 9 | requests==2.24.0 10 | retrying==1.3.3 11 | six==1.15.0 12 | slackclient==2.7.2 13 | soupsieve==2.0.1 14 | urllib3==1.26.5 15 | -------------------------------------------------------------------------------- /keywords.csv: -------------------------------------------------------------------------------- 1 | alloy 2 | polymer 3 | metal 4 | ceramic 5 | additive manufacturing 6 | advanced manufacturing 7 | nano 8 | machine learning 9 | artificial intelligence 10 | manufacturing 11 | antimicrobial material 12 | antiviral material 13 | antimicrobial surface 14 | antiviral surface 15 | solar 16 | turbine 17 | corrosion 18 | perovskite 19 | education 20 | stem 21 | curriculum 22 | inclusion 23 | diversity -------------------------------------------------------------------------------- /nonkeywords.csv: -------------------------------------------------------------------------------- 1 | bioengineer 2 | cancer 3 | vascular 4 | monument 5 | park 6 | groundwater 7 | compliance 8 | toxicity 9 | adult 10 | child 11 | clinical 12 | india 13 | china 14 | social 15 | citizen 16 | fire 17 | agricultural 18 | serological 19 | geographical 20 | geography 21 | alzheimer 22 | melanoma 23 | ocean 24 | forensic 25 | african 26 | sbir 27 | sedimentation 28 | reservior 29 | ecosystem 30 | fertility 31 | russia 32 | russian 33 | china 34 | chinese 35 | iran 36 | iranian 37 | embassy 38 | blm 39 | health care 40 | suicide -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FOA Finder 2 | 3 | This is an automated web scraper for finding funding opportunity announcements from grants.gov. Every day, the [grants.gov](https://www.grants.gov/) database is updated and exported to a zipped XML file which is available for download [here](https://www.grants.gov/web/grants/xml-extract.html). This scraper downloads the latest database export, searches it for relevant keywords, and sends matches to a dedicated Slack channel. 4 | 5 | [Information about the format of the XML file](https://www.grants.gov/help/html/help/index.htm?rhcsh=1&callingApp=custom#t=XMLExtract%2FXMLExtract.htm) 6 | 7 | 8 | 9 | ## File description 10 | 11 | * **app.py**: Main Python application 12 | * **keywords.csv**: keywords to use for searching FOA titles 13 | * **nonkeywords.csv**: keywords to avoid for searching FOA titles 14 | * **requirements.txt**: Requirements file for installing app dependencies with pip 15 | 16 | 17 | 18 | ## Setting environment variables 19 | 20 | Environment variables are required for connecting to the Slack API. To edit environment variables on macOS: `touch ~/.bash_profile; open ~/.bash_profile` 21 | 22 | Add the line `export VAR_NAME="VAR_VALUE"`, where `VAR_NAME` and `VAR_VALUE` are the name and value of the variable. 23 | 24 | 25 | 26 | ## Running the scraper on a schedule using crontab 27 | 28 | Install using `pip install crontab` 29 | 30 | 31 | 32 | Make the python script executable using `chmod +x app.py` 33 | 34 | Use shebang line in the python script (`#!/usr/bin/python3`, or for venv for example: `#!/Users/emuckley/Documents/Github/foa-finder/webenv/bin/python`) to direct to Python exe 35 | 36 | Open scheduler using `crontab -e`. Paste the cron scheduler line. Type `:x` to save and exit scheduler. The scheduler line should look something like this for repeating at the start of the hour (0) at 6 pm (18:00 hrs) every 24 hours: 37 | `0 18 * * * /Users/emuckley/Documents/GitHub/foa-finder/app.py >> ~/cron.log 2>&1` 38 | 39 | To list running cron jobs, use `crontab -l` 40 | 41 | To remove all cron jobs, use `crontab -r` 42 | 43 | 44 | 45 | ## Setup for development 46 | 47 | Install / upgrade pip: `python3 -m pip install --user --upgrade pip` 48 | 49 | Test pip version: `python3 -m pip --version` 50 | 51 | Install virtualenv: `python3 -m pip install --user virtualenv` 52 | 53 | To create a virtual environment, navigate to the project folder and run: `python3 -m venv `, where `` is the name of your new virtual environment. 54 | 55 | Before installing packages inside the virtual environment, activate the environment: `source /bin/activate`, where `` is the name of your virtual environment. 56 | 57 | To deactivate the environment: `deactivate` 58 | 59 | Once the environment is activated, use pip to install libraries in it. 60 | 61 | To export the list of installed packages as a `requirements.txt` file: `pip freeze > requirements.txt` 62 | 63 | To install packages from the requirements file: `pip install -r requirements.txt` 64 | 65 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | #!/Users/emuckley/Documents/Github/foa-finder/webenv/bin/python 2 | 3 | # -*- coding: utf-8 -*- 4 | 5 | 6 | """ 7 | 8 | What this script does: 9 | (1) finds the latest FOA database on grants.gov 10 | (2) downloads the database 11 | (3) unzips the database to an xml file 12 | (4) converts the xml database into a pandas dataframe 13 | (5) filters the FOA dataframe by keywords 14 | (6) sends the filtered FOA list to a dedicated channel on Slack. 15 | 16 | python version to use for crontab scheduling in virtual environment: 17 | Users/emuckley/Documents/GitHub/foa-finder/webenv/bin/python 18 | 19 | 20 | crontab script to run every 24 hours at 18:05 hrs: 21 | 5 18 * * * /Users/emuckley/Documents/GitHub/foa-finder/app.py >> ~/cron.log 2>&1 22 | 23 | 24 | """ 25 | 26 | 27 | import os 28 | import json 29 | import time 30 | import zipfile 31 | import requests 32 | import pandas as pd 33 | from datetime import datetime, timedelta 34 | import xml.etree.ElementTree as ET 35 | from bs4 import BeautifulSoup 36 | 37 | pd.options.mode.chained_assignment = None 38 | 39 | 40 | # %%%%%%%%%%%%%%%%%%%% find the database %%%%%%%%%%%%%%%%%%%%%%%%%% 41 | 42 | 43 | def get_xml_url_and_filename(): 44 | """Get the URL and filename of the most recent XML database file 45 | posted on grants.gov.""" 46 | 47 | day_to_try = datetime.today() 48 | 49 | file_found = None 50 | while file_found is None: 51 | url = 'https://www.grants.gov/extract/GrantsDBExtract{}v2.zip'.format( 52 | day_to_try.strftime('%Y%m%d')) 53 | response = requests.get(url, stream=True) 54 | 55 | # look back in time if todays data is not posted yet 56 | if response.status_code == 200: 57 | file_found = url 58 | else: 59 | day_to_try = day_to_try - timedelta(days=1) 60 | 61 | filename = 'GrantsDBExtract{}v2.zip'.format( 62 | day_to_try.strftime('%Y%m%d')) 63 | 64 | print('Found database file {}'.format(filename)) 65 | 66 | return url, filename 67 | 68 | 69 | # get url and filename of the latest database available for extraction 70 | url, filename = get_xml_url_and_filename() 71 | 72 | 73 | # %%%%%%%%%%%%%%%%%%%% download the database %%%%%%%%%%%%%%%%%%%%%%%%%% 74 | 75 | 76 | def download_file_from_url(url, filename): 77 | """Download a file from a URL""" 78 | # remove all previously-downloaded zip files 79 | [os.remove(f) for f in os.listdir() if f.endswith(".zip")] 80 | # ping the dataset url 81 | response = requests.get(url, stream=True) 82 | # if file url is found 83 | if response.status_code == 200: 84 | handle = open(filename, "wb") 85 | for chunk in response.iter_content(chunk_size=512): 86 | if chunk: # filter out keep-alive new chunks 87 | handle.write(chunk) 88 | handle.close() 89 | time.sleep(3) 90 | print('Database successfully downloaded') 91 | # if file url is not found 92 | else: 93 | print('URL does not exist') 94 | 95 | 96 | # download the database zip file 97 | download_file_from_url(url, filename) 98 | 99 | 100 | # %%%%%%%%%%%%%%%%%%%%%%%%% unzip and parse file %%%%%%%%%%%%%%%%%%%%% 101 | 102 | 103 | def unzip_and_soupify(filename, unzipped_dirname='unzipped'): 104 | """Unzip a zip file and parse it using beautiful soup""" 105 | 106 | # if unzipped directory doesn't exist, create it 107 | if not os.path.exists(unzipped_dirname): 108 | os.makedirs(unzipped_dirname) 109 | 110 | # remove all previously-downloaded zip files 111 | for f in os.listdir(unzipped_dirname): 112 | os.remove(os.path.join(unzipped_dirname, f)) 113 | 114 | # unzip raw file 115 | with zipfile.ZipFile(filename, "r") as z: 116 | z.extractall(unzipped_dirname) 117 | 118 | # get path of file in unzipped folder 119 | unzipped_filepath = os.path.join( 120 | unzipped_dirname, 121 | os.listdir(unzipped_dirname)[0]) 122 | 123 | print('Unzipping {}'.format(unzipped_filepath)) 124 | 125 | # parse as tree and convert to string 126 | tree = ET.parse(unzipped_filepath) 127 | root = tree.getroot() 128 | doc = str(ET.tostring(root, encoding='unicode', method='xml')) 129 | # initialize beautiful soup object 130 | soup = BeautifulSoup(doc, 'lxml') 131 | print('Database unzipped') 132 | 133 | return soup 134 | 135 | 136 | # get beautiful soup object from parsed zip file 137 | soup = unzip_and_soupify(filename) 138 | 139 | 140 | # %%%%%%%%%%%% populate dataframe with every xml tag %%%%%%%%%%%%%%%%%%%% 141 | 142 | 143 | def soup_to_df(soup): 144 | """Convert beautifulsoup object from grants.gov XML into dataframe""" 145 | # list of bs4 FOA objects 146 | s = 'opportunitysynopsisdetail' 147 | foa_objs = [tag for tag in soup.find_all() if s in tag.name.lower()] 148 | 149 | # loop over each FOA in the database and save its details as a dictionary 150 | dic = {} 151 | for i, foa in enumerate(foa_objs): 152 | ch = foa.findChildren() 153 | dic[i] = {fd.name.split('ns0:')[1]: fd.text for fd in ch} 154 | 155 | # create dataframe from dictionary 156 | df = pd.DataFrame.from_dict(dic, orient='index') 157 | return df 158 | 159 | 160 | # get full dataframe of all FOAs 161 | dff = soup_to_df(soup) 162 | 163 | 164 | # %%%%%%%%%%%%%%%%%% filter by dates and keywords %%%%%%%%%%%%%%%%%%%%%%%% 165 | 166 | def to_date(date_str): 167 | """Convert date string from database into date object""" 168 | return datetime.strptime(date_str, '%m%d%Y').date() 169 | 170 | 171 | def is_recent(date, days=14): 172 | """Check if date occured within n amount of days from today""" 173 | return (datetime.today().date() - to_date(date)).days <= days 174 | 175 | 176 | def is_open(date): 177 | """Check if FOA is still open (closedate is in the past)""" 178 | if type(date) == float: 179 | return True 180 | elif type(date) == str: 181 | return (datetime.today().date() - to_date(date)).days <= 0 182 | 183 | 184 | def reformat_date(s): 185 | """Reformat the date string with hyphens so its easier to read""" 186 | s = str(s) 187 | return s[4:]+'-'+s[:2]+'-'+s[2:4] 188 | 189 | 190 | def sort_by_recent_updates(df): 191 | """Sort the dataframe by recently updated dates""" 192 | new_dates = [reformat_date(i) for i in df['lastupdateddate']] 193 | df.insert(1, 'updatedate', new_dates) 194 | df = df.sort_values(by=['updatedate'], ascending=False) 195 | print('Databae sorted and filtered by date') 196 | return df 197 | 198 | 199 | def filter_by_keywords(df): 200 | """Filter the dataframe by keywords and nonkeywords (words to avoid). 201 | The keywords and nonkeywords are set in external csv files called 202 | 'keywords.csv' and 'nonkeywords.csv'""" 203 | # get keywords to filter dataframe 204 | keywords = list(pd.read_csv('keywords.csv', header=None)[0]) 205 | keywords_str = '|'.join(keywords).lower() 206 | # get non-keywords to avoid 207 | nonkeywords = list(pd.read_csv('nonkeywords.csv', header=None)[0]) 208 | nonkeywords_str = '|'.join(nonkeywords).lower() 209 | 210 | # filter by post date - the current year and previous year only 211 | # curr_yr = np.max([int(i[-4:]) for i in df['postdate'].values]) 212 | # prev_yr = curr_yr - 1 213 | # df = df[df['postdate'].str.contains('-'+str(curr_yr), na=False)] 214 | 215 | # filter dataframe by keywords and nonkeywords 216 | df = df[df['description'].str.contains(keywords_str, na=False)] 217 | df = df[~df['description'].str.contains(nonkeywords_str, na=False)] 218 | 219 | print('Database filtered by keywords') 220 | 221 | return df 222 | 223 | 224 | # include only recently updated FOAs 225 | df = dff[[is_recent(i) for i in dff['lastupdateddate']]] 226 | 227 | # include only FOAs which are not closed 228 | df = df[[is_open(i) for i in df['closedate']]] 229 | 230 | # sort by newest FOAs at the top 231 | df = sort_by_recent_updates(df) 232 | 233 | # filter by keywords 234 | df = filter_by_keywords(df) 235 | 236 | 237 | # %%%%%%%%%%%%%%% format string message for Slack %%%%%%%%%%%%%%%%%%%%%% 238 | 239 | 240 | def create_slack_text(filename, df, print_text=True): 241 | """Create text to send into Slack""" 242 | # get database date 243 | db_date = filename.split('GrantsDBExtract')[1].split('v2')[0] 244 | db_date = db_date[:4]+'-'+db_date[4:6]+'-'+db_date[6:] 245 | 246 | # create text 247 | slack_text = 'Showing {} recently updated FOAs from grants.gov, extracted {}:'.format( 248 | len(df), db_date) 249 | slack_text += '\n=======================================' 250 | 251 | base_hyperlink = r'https://www.grants.gov/web/grants/search-grants.html?keywords=' 252 | 253 | # loop over each FOA title and add to text 254 | for i in range(len(df)): 255 | 256 | hyperlink = base_hyperlink + df['opportunitynumber'].iloc[i] 257 | 258 | slack_text += '\n{}) Updated: {}, Closes: {}, Title: {}, {} ({}) \n{}'.format( 259 | i+1, 260 | df['updatedate'].iloc[i], 261 | reformat_date(df['closedate'].iloc[i]), 262 | df['opportunitytitle'].iloc[i].upper(), 263 | df['opportunitynumber'].iloc[i], 264 | df['opportunityid'].iloc[i], 265 | hyperlink) 266 | slack_text += '\n----------------------------------' 267 | 268 | slack_text += '\nFOAs filtered by date and keywords in {}'.format( 269 | 'https://github.com/ericmuckley/foa-finder/blob/master/keywords.csv.') 270 | 271 | if print_text: 272 | print(slack_text) 273 | else: 274 | print('Slack text generated') 275 | return slack_text 276 | 277 | 278 | # create the text to send to slack 279 | slack_text = create_slack_text(filename, df) 280 | 281 | 282 | # %% 283 | 284 | 285 | def send_to_slack(slack_text): 286 | """Send results to a channel on Slack""" 287 | print('Sending results to slack') 288 | try: 289 | response = requests.post( 290 | os.environ['SLACK_WEBHOOK_URL'], 291 | data=json.dumps({'text': slack_text}), 292 | headers={'Content-Type': 'application/json'}) 293 | print('Slack response: ' + str(response.text)) 294 | except: 295 | print('Connection to Slack could not be established.') 296 | 297 | 298 | # send text to slack 299 | # send_to_slack(slack_text) 300 | --------------------------------------------------------------------------------