├── .gitignore
├── requirements.txt
├── keywords.csv
├── nonkeywords.csv
├── README.md
└── app.py


/.gitignore:
--------------------------------------------------------------------------------
1 | webenv/
2 | .DS_Store
3 | .ipynb_checkpoints/xml_parser-checkpoint.py
4 | *.zip
5 | *.xml
6 | .ipynb_checkpoints/nonkeywords-checkpoint.csv
7 | .ipynb_checkpoints/keywords-checkpoint.csv
8 | .ipynb_checkpoints/app-checkpoint.py
9 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | aiohttp==3.7.4
 2 | beautifulsoup4==4.9.1
 3 | crontab==0.22.8
 4 | Flask==2.3.2
 5 | gunicorn==20.0.4
 6 | multidict==4.7.6
 7 | numpy==1.22.0
 8 | pandas==1.0.5
 9 | requests==2.24.0
10 | retrying==1.3.3
11 | six==1.15.0
12 | slackclient==2.7.2
13 | soupsieve==2.0.1
14 | urllib3==1.26.5
15 | 


--------------------------------------------------------------------------------
/keywords.csv:
--------------------------------------------------------------------------------
 1 | alloy
 2 | polymer
 3 | metal
 4 | ceramic
 5 | additive manufacturing
 6 | advanced manufacturing
 7 | nano
 8 | machine learning
 9 | artificial intelligence
10 | manufacturing
11 | antimicrobial material
12 | antiviral material
13 | antimicrobial surface
14 | antiviral surface
15 | solar
16 | turbine
17 | corrosion
18 | perovskite
19 | education
20 | stem
21 | curriculum
22 | inclusion
23 | diversity


--------------------------------------------------------------------------------
/nonkeywords.csv:
--------------------------------------------------------------------------------
 1 | bioengineer
 2 | cancer
 3 | vascular
 4 | monument
 5 | park
 6 | groundwater
 7 | compliance
 8 | toxicity
 9 | adult
10 | child
11 | clinical
12 | india
13 | china
14 | social
15 | citizen
16 | fire
17 | agricultural
18 | serological
19 | geographical
20 | geography
21 | alzheimer
22 | melanoma
23 | ocean
24 | forensic
25 | african
26 | sbir
27 | sedimentation
28 | reservior
29 | ecosystem
30 | fertility
31 | russia
32 | russian
33 | china
34 | chinese
35 | iran
36 | iranian
37 | embassy
38 | blm
39 | health care
40 | suicide


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # FOA Finder
 2 | 
 3 | This is an automated web scraper for finding funding opportunity announcements from grants.gov. Every day, the [grants.gov](https://www.grants.gov/) database is updated and exported to a zipped XML file which is available for download [here](https://www.grants.gov/web/grants/xml-extract.html). This scraper downloads the latest database export, searches it for relevant keywords, and sends matches to a dedicated Slack channel.
 4 | 
 5 | [Information about the format of the XML file](https://www.grants.gov/help/html/help/index.htm?rhcsh=1&callingApp=custom#t=XMLExtract%2FXMLExtract.htm)
 6 | 
 7 | 
 8 | 
 9 | ## File description
10 | 
11 | * **app.py**: Main Python application
12 | * **keywords.csv**: keywords to use for searching FOA titles
13 | * **nonkeywords.csv**: keywords to avoid for searching FOA titles
14 | * **requirements.txt**: Requirements file for installing app dependencies with pip
15 | 
16 | 
17 | 
18 | ## Setting environment variables
19 | 
20 | Environment variables are required for connecting to the Slack API. To edit environment variables on macOS: `touch ~/.bash_profile; open ~/.bash_profile`
21 | 
22 | Add the line `export VAR_NAME="VAR_VALUE"`, where `VAR_NAME` and `VAR_VALUE` are the name and value of the variable.
23 | 
24 | 
25 | 
26 | ## Running the scraper on a schedule using crontab
27 | 
28 | Install using `pip install crontab`
29 | 
30 | 
31 | 
32 | Make the python script executable using `chmod +x app.py`
33 | 
34 | Use shebang line in the python script (`#!/usr/bin/python3`, or for venv for example: `#!/Users/emuckley/Documents/Github/foa-finder/webenv/bin/python`) to direct to Python exe
35 | 
36 | Open scheduler using `crontab -e`. Paste the cron scheduler line. Type `:x` to save and exit scheduler. The scheduler line should look something like this for repeating at the start of the hour (0) at 6 pm (18:00 hrs) every 24 hours:
37 | `0 18 * * * /Users/emuckley/Documents/GitHub/foa-finder/app.py >> ~/cron.log 2>&1`
38 | 
39 | To list running cron jobs, use `crontab -l`
40 | 
41 | To remove all cron jobs, use `crontab -r`
42 | 
43 | 
44 | 
45 | ## Setup for development
46 | 
47 | Install / upgrade pip: `python3 -m pip install --user --upgrade pip`
48 | 
49 | Test pip version: `python3 -m pip --version`
50 | 
51 | Install virtualenv: `python3 -m pip install --user virtualenv`
52 | 
53 | To create a virtual environment, navigate to the project folder and run: `python3 -m venv <env>`, where `<env>` is the name of your new virtual environment.
54 | 
55 | Before installing packages inside the virtual environment, activate the environment: `source <env>/bin/activate`, where `<env>` is the name of your virtual environment.
56 | 
57 | To deactivate the environment: `deactivate`
58 | 
59 | Once the environment is activated, use pip to install libraries in it.
60 | 
61 | To export the list of installed packages as a `requirements.txt` file: `pip freeze > requirements.txt`
62 | 
63 | To install packages from the requirements file: `pip install -r requirements.txt`
64 | 
65 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | #!/Users/emuckley/Documents/Github/foa-finder/webenv/bin/python
  2 | 
  3 | # -*- coding: utf-8 -*-
  4 | 
  5 | 
  6 | """
  7 | 
  8 | What this script does:
  9 | (1) finds the latest FOA database on grants.gov
 10 | (2) downloads the database
 11 | (3) unzips the database to an xml file
 12 | (4) converts the xml database into a pandas dataframe
 13 | (5) filters the FOA dataframe by keywords
 14 | (6) sends the filtered FOA list to a dedicated channel on Slack.
 15 | 
 16 | python version to use for crontab scheduling in virtual environment:
 17 | Users/emuckley/Documents/GitHub/foa-finder/webenv/bin/python
 18 | 
 19 | 
 20 | crontab script to run every 24 hours at 18:05 hrs:
 21 | 5 18 * * * /Users/emuckley/Documents/GitHub/foa-finder/app.py >> ~/cron.log 2>&1
 22 | 
 23 | 
 24 | """
 25 | 
 26 | 
 27 | import os
 28 | import json
 29 | import time
 30 | import zipfile
 31 | import requests
 32 | import pandas as pd
 33 | from datetime import datetime, timedelta
 34 | import xml.etree.ElementTree as ET
 35 | from bs4 import BeautifulSoup
 36 | 
 37 | pd.options.mode.chained_assignment = None
 38 | 
 39 | 
 40 | # %%%%%%%%%%%%%%%%%%%% find the database %%%%%%%%%%%%%%%%%%%%%%%%%%
 41 | 
 42 | 
 43 | def get_xml_url_and_filename():
 44 |     """Get the URL and filename of the most recent XML database file
 45 |     posted on grants.gov."""
 46 | 
 47 |     day_to_try = datetime.today()
 48 | 
 49 |     file_found = None
 50 |     while file_found is None:
 51 |         url = 'https://www.grants.gov/extract/GrantsDBExtract{}v2.zip'.format(
 52 |             day_to_try.strftime('%Y%m%d'))
 53 |         response = requests.get(url, stream=True)
 54 | 
 55 |         # look back in time if todays data is not posted yet
 56 |         if response.status_code == 200:
 57 |             file_found = url
 58 |         else:
 59 |             day_to_try = day_to_try - timedelta(days=1)
 60 | 
 61 |         filename = 'GrantsDBExtract{}v2.zip'.format(
 62 |             day_to_try.strftime('%Y%m%d'))
 63 | 
 64 |     print('Found database file {}'.format(filename))
 65 | 
 66 |     return url, filename
 67 | 
 68 | 
 69 | # get url and filename of the latest database available for extraction
 70 | url, filename = get_xml_url_and_filename()
 71 | 
 72 | 
 73 | # %%%%%%%%%%%%%%%%%%%% download the database %%%%%%%%%%%%%%%%%%%%%%%%%%
 74 | 
 75 | 
 76 | def download_file_from_url(url, filename):
 77 |     """Download a file from a URL"""
 78 |     # remove all previously-downloaded zip files
 79 |     [os.remove(f) for f in os.listdir() if f.endswith(".zip")]
 80 |     # ping the dataset url
 81 |     response = requests.get(url, stream=True)
 82 |     # if file url is found
 83 |     if response.status_code == 200:
 84 |         handle = open(filename, "wb")
 85 |         for chunk in response.iter_content(chunk_size=512):
 86 |             if chunk:  # filter out keep-alive new chunks
 87 |                 handle.write(chunk)
 88 |         handle.close()
 89 |         time.sleep(3)
 90 |         print('Database successfully downloaded')
 91 |     # if file url is not found
 92 |     else:
 93 |         print('URL does not exist')
 94 | 
 95 | 
 96 | # download the database zip file
 97 | download_file_from_url(url, filename)
 98 | 
 99 | 
100 | # %%%%%%%%%%%%%%%%%%%%%%%%% unzip and parse file %%%%%%%%%%%%%%%%%%%%%
101 | 
102 | 
103 | def unzip_and_soupify(filename, unzipped_dirname='unzipped'):
104 |     """Unzip a zip file and parse it using beautiful soup"""
105 | 
106 |     # if unzipped directory doesn't exist, create it
107 |     if not os.path.exists(unzipped_dirname):
108 |         os.makedirs(unzipped_dirname)
109 | 
110 |     # remove all previously-downloaded zip files
111 |     for f in os.listdir(unzipped_dirname):
112 |         os.remove(os.path.join(unzipped_dirname, f))
113 | 
114 |     # unzip raw file
115 |     with zipfile.ZipFile(filename, "r") as z:
116 |         z.extractall(unzipped_dirname)
117 | 
118 |     # get path of file in unzipped folder
119 |     unzipped_filepath = os.path.join(
120 |         unzipped_dirname,
121 |         os.listdir(unzipped_dirname)[0])
122 | 
123 |     print('Unzipping {}'.format(unzipped_filepath))
124 | 
125 |     # parse as tree and convert to string
126 |     tree = ET.parse(unzipped_filepath)
127 |     root = tree.getroot()
128 |     doc = str(ET.tostring(root, encoding='unicode', method='xml'))
129 |     # initialize beautiful soup object
130 |     soup = BeautifulSoup(doc, 'lxml')
131 |     print('Database unzipped')
132 | 
133 |     return soup
134 | 
135 | 
136 | # get beautiful soup object from parsed zip file
137 | soup = unzip_and_soupify(filename)
138 | 
139 | 
140 | # %%%%%%%%%%%% populate dataframe with every xml tag %%%%%%%%%%%%%%%%%%%%
141 | 
142 | 
143 | def soup_to_df(soup):
144 |     """Convert beautifulsoup object from grants.gov XML into dataframe"""
145 |     # list of bs4 FOA objects
146 |     s = 'opportunitysynopsisdetail'
147 |     foa_objs = [tag for tag in soup.find_all() if s in tag.name.lower()]
148 | 
149 |     # loop over each FOA in the database and save its details as a dictionary
150 |     dic = {}
151 |     for i, foa in enumerate(foa_objs):
152 |         ch = foa.findChildren()
153 |         dic[i] = {fd.name.split('ns0:')[1]: fd.text for fd in ch}
154 | 
155 |     # create dataframe from dictionary
156 |     df = pd.DataFrame.from_dict(dic, orient='index')
157 |     return df
158 | 
159 | 
160 | # get full dataframe of all FOAs
161 | dff = soup_to_df(soup)
162 | 
163 | 
164 | # %%%%%%%%%%%%%%%%%% filter by dates and keywords %%%%%%%%%%%%%%%%%%%%%%%%
165 | 
166 | def to_date(date_str):
167 |     """Convert date string from database into date object"""
168 |     return datetime.strptime(date_str, '%m%d%Y').date()
169 | 
170 | 
171 | def is_recent(date, days=14):
172 |     """Check if date occured within n amount of days from today"""
173 |     return (datetime.today().date() - to_date(date)).days <= days
174 | 
175 | 
176 | def is_open(date):
177 |     """Check if FOA is still open (closedate is in the past)"""
178 |     if type(date) == float:
179 |         return True
180 |     elif type(date) == str:
181 |         return (datetime.today().date() - to_date(date)).days <= 0
182 | 
183 | 
184 | def reformat_date(s):
185 |     """Reformat the date string with hyphens so its easier to read"""
186 |     s = str(s)
187 |     return s[4:]+'-'+s[:2]+'-'+s[2:4]
188 | 
189 | 
190 | def sort_by_recent_updates(df):
191 |     """Sort the dataframe by recently updated dates"""
192 |     new_dates = [reformat_date(i) for i in df['lastupdateddate']]
193 |     df.insert(1, 'updatedate', new_dates)
194 |     df = df.sort_values(by=['updatedate'], ascending=False)
195 |     print('Databae sorted and filtered by date')
196 |     return df
197 | 
198 | 
199 | def filter_by_keywords(df):
200 |     """Filter the dataframe by keywords and nonkeywords (words to avoid).
201 |     The keywords and nonkeywords are set in external csv files called
202 |     'keywords.csv' and 'nonkeywords.csv'"""
203 |     # get keywords to filter dataframe
204 |     keywords = list(pd.read_csv('keywords.csv', header=None)[0])
205 |     keywords_str = '|'.join(keywords).lower()
206 |     # get non-keywords to avoid
207 |     nonkeywords = list(pd.read_csv('nonkeywords.csv', header=None)[0])
208 |     nonkeywords_str = '|'.join(nonkeywords).lower()
209 | 
210 |     # filter by post date - the current year and previous year only
211 |     # curr_yr = np.max([int(i[-4:]) for i in df['postdate'].values])
212 |     # prev_yr = curr_yr - 1
213 |     # df = df[df['postdate'].str.contains('-'+str(curr_yr), na=False)]
214 | 
215 |     # filter dataframe by keywords and nonkeywords
216 |     df = df[df['description'].str.contains(keywords_str, na=False)]
217 |     df = df[~df['description'].str.contains(nonkeywords_str, na=False)]
218 | 
219 |     print('Database filtered by keywords')
220 | 
221 |     return df
222 | 
223 | 
224 | # include only recently updated FOAs
225 | df = dff[[is_recent(i) for i in dff['lastupdateddate']]]
226 | 
227 | # include only FOAs which are not closed
228 | df = df[[is_open(i) for i in df['closedate']]]
229 | 
230 | # sort by newest FOAs at the top
231 | df = sort_by_recent_updates(df)
232 | 
233 | # filter by keywords
234 | df = filter_by_keywords(df)
235 | 
236 | 
237 | # %%%%%%%%%%%%%%% format string message for Slack %%%%%%%%%%%%%%%%%%%%%%
238 | 
239 | 
240 | def create_slack_text(filename, df, print_text=True):
241 |     """Create text to send into Slack"""
242 |     # get database date
243 |     db_date = filename.split('GrantsDBExtract')[1].split('v2')[0]
244 |     db_date = db_date[:4]+'-'+db_date[4:6]+'-'+db_date[6:]
245 | 
246 |     # create text
247 |     slack_text = 'Showing {} recently updated FOAs from grants.gov, extracted {}:'.format(
248 |         len(df), db_date)
249 |     slack_text += '\n======================================='
250 | 
251 |     base_hyperlink = r'https://www.grants.gov/web/grants/search-grants.html?keywords='
252 | 
253 |     # loop over each FOA title and add to text
254 |     for i in range(len(df)):
255 | 
256 |         hyperlink = base_hyperlink + df['opportunitynumber'].iloc[i]
257 | 
258 |         slack_text += '\n{}) Updated: {},  Closes: {}, Title: {}, {} ({}) \n{}'.format(
259 |             i+1,
260 |             df['updatedate'].iloc[i],
261 |             reformat_date(df['closedate'].iloc[i]),
262 |             df['opportunitytitle'].iloc[i].upper(),
263 |             df['opportunitynumber'].iloc[i],
264 |             df['opportunityid'].iloc[i],
265 |             hyperlink)
266 |         slack_text += '\n----------------------------------'
267 | 
268 |     slack_text += '\nFOAs filtered by date and keywords in {}'.format(
269 |         'https://github.com/ericmuckley/foa-finder/blob/master/keywords.csv.')
270 | 
271 |     if print_text:
272 |         print(slack_text)
273 |     else:
274 |         print('Slack text generated')
275 |     return slack_text
276 | 
277 | 
278 | # create the text to send to slack
279 | slack_text = create_slack_text(filename, df)
280 | 
281 | 
282 | # %%
283 | 
284 | 
285 | def send_to_slack(slack_text):
286 |     """Send results to a channel on Slack"""
287 |     print('Sending results to slack')
288 |     try:
289 |         response = requests.post(
290 |             os.environ['SLACK_WEBHOOK_URL'],
291 |             data=json.dumps({'text': slack_text}),
292 |             headers={'Content-Type': 'application/json'})
293 |         print('Slack response: ' + str(response.text))
294 |     except:
295 |         print('Connection to Slack could not be established.')
296 | 
297 | 
298 | # send text to slack
299 | # send_to_slack(slack_text)
300 | 


--------------------------------------------------------------------------------