├── AUTHORS.md
├── README.md
├── download_reports.py
├── fec_scraper_toolbox_sql_objects.sql
├── parse_reports.py
└── update_master_files.py


/AUTHORS.md:
--------------------------------------------------------------------------------
1 | # Authors
2 | Christopher Schnaars (http://www.chrisschnaars.org/, https://twitter.com/chrisschnaars)
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # FEC Scraper Toolbox
  2 | The FEC Scraper Toolbox is a series of Python modules you can use to
  3 | find and download electronically filed campaign finance reports housed
  4 | on the Federal Election Commission website and load those reports into
  5 | a database manager.
  6 | 
  7 | Generally, the FEC Scraper Toolbox is meant to replace the [FEC Scraper](https://github.com/cschnaars/FEC-Scraper)
  8 | repository.  You might want to use the older repository, however, if
  9 | you want to limit the scope of your database to include only specific
 10 | committees.  The default behavior of the FEC Scraper Toolbox is to
 11 | download every available report whereas FEC Scraper downloads reports
 12 | only for the committees you specify.
 13 | 
 14 | Presently, the Toolbox consists of three major modules. They are
 15 | documented fully below, but in brief, they are:
 16 | * __download_reports:__ Downloads daily compilations of electronically
 17 | filed reports and consumes an RSS feed to find and download
 18 | recently filed reports.
 19 | * __parse_reports:__ Combines any number of reports into a single file
 20 | for each form type (Schedule A, Schedule B and so on).  Report header
 21 | information is loaded into a database.
 22 | * __update_master_files:__ Downloads daily and weekly master files
 23 | housing detailed information about all candidates and committees,
 24 | individual contributions and contributions from committees to
 25 | candidates and other committees.
 26 | 
 27 | FEC Scraper Toolbox was developed under Python 2.7.4. I presently am
 28 | running it under 2.7.6.
 29 | 
 30 | ## Requirements
 31 | The following modules are required to use FEC Scraper Toolbox. All of
 32 | them are included with a standard Python 2.7 installation:
 33 | * csv
 34 | * datetime
 35 | * glob
 36 | * linecache
 37 | * multiprocessing
 38 | * os
 39 | * pickle
 40 | * pyodbc
 41 | * re
 42 | * shutil
 43 | * time
 44 | * urllib
 45 | * urllib2
 46 | * zipfile
 47 | 
 48 | ## User Settings
 49 | You can add an optional usersettings.py file to the directory housing
 50 | your Python modules to customize database connection strings and file
 51 | locations. In each module, you'll see a try statement, where the module
 52 | will attempt to load this file.  Default values can be specified in the
 53 | except portion of the try statement.
 54 | 
 55 | You can copy and paste the text below into your usersettings.py file,
 56 | then specify the values you want to use.
 57 | 
 58 | ```python
 59 |     ARCPROCDIR = '' # Directory to house archives that have been processed
 60 |     ARCSVDIR = '' # Directory to house archives that have been downloaded but not processed
 61 |     DBCONNSTR = '' # Database connection string
 62 |     MASTERDIR = '' # Master directory for weekly candidate and committee master files
 63 |     RPTERRDIR = '' # Directory to house error logs generated when a field can't be parsed
 64 |     RPTHOLDDIR = '' # Directory to house electronically filed reports that cannot be processed
 65 |     RPTOUTDIR = '' # Directory to house data files generated by parse_reports
 66 |     RPTPROCDIR = '' # Directory to house electronically filed reports that have been processed
 67 |     RPTRVWDIR = '' # Directory to house electronically filed reports that could not be imported and need to be reviewed
 68 |     RPTSVDIR = '' # Directory to house electronically filed reports that have been downloaded but not processed
 69 | ```
 70 | 
 71 | ## download_reports Module
 72 | This module tracks and downloads all electronically filed reports
 73 | housed on the Federal Election Commission website.  Specifically, it
 74 | ensures all daily archives of reports (which go back to 2001) have been
 75 | downloaded and extracted.  It then consumes the FEC's RSS feed listing
 76 | all reports filed within the past seven days to look for new reports.
 77 | 
 78 | Electronic reports filed voluntarily by a handful of Senators presently
 79 | are not included here.
 80 | 
 81 | This module does not load any data or otherwise interact with a
 82 | database manager (though I plan to add functionality to ping a database
 83 | to build a list of previously downloaded reports rather than require
 84 | the user to warehouse them).  Its sole purpose is to track and download
 85 | reports.
 86 | 
 87 | If you don't want to download archives back to 2001 or otherwise want
 88 | to manually control what is downloaded, you'll find commented out code
 89 | below as well as in the module that you can use to modify the zipinfo.p
 90 | pickle (which is described in the first bullet point below).
 91 | 
 92 | This module goes through the following process in this order:
 93 | * Uses the pickle module to attempt to load zipinfo.p, a dictionary
 94 |     housing the name of the most recent archive downloaded as well as a
 95 |     list of files not downloaded previously.  Commented out code available
 96 |     below and in the module can be used to modify this pickle if you
 97 |     want to control which archives will be retrieved.
 98 | * Calls build_prior_archive_list to construct a list of archives that
 99 |     already have been downloaded and saved to ARCPROCDIR or ARCSVDIR.  
100 |     __NOTE:__ I plan to deprecate this function.  I added this feature for
101 |     development and to test the implementation of the zipinfo.p pickle.
102 |     Using the pickle saves a lot of time and disk space compared to
103 |     warehousing all the archives.
104 | * Calls build_archive_download_list, which processes the zipinfo.p
105 |     pickle to build a list of available archive files that have not
106 |     been downloaded.
107 | * Uses multiprocessing and calls download_archive to download each
108 |     archive file.  These files are saved in the directory specified
109 |     with the ARCSVDIR variable.  After downloading an archive, the
110 |     subroutine compares the length of the downloaded file with the length
111 |     of the source file.  If the lengths do not match, the file is deleted
112 |     from the file system.  The subroutine tries to download a file up to
113 |     five times.  
114 |     __NOTE:__ You can set the NUMPROC variable in the user variables section
115 |     to specify the number of downloads that occur simultaneously.  The
116 |     default value is 10.
117 | * Uses multiprocessing and calls unzip_archive to extract any files in
118 |     the archive that have not been downloaded previously.  The second
119 |     parameter is an overwrite flag; existing files are overwritten when
120 |     this flag is set to 1.  Default is 0.  
121 |     __NOTE:__ You can set the NUMPROC variable in the user variables section
122 |     to specify the number of downloads that occur simultaneously.  The
123 |     default value is 10.
124 | * Again calls build_prior_archive_list to reconstruct the list of
125 |     archives that already have been downloaded and saved to ARCPROCDIR or
126 |     ARCSVDIR.  
127 |     __NOTE:__ As stated above, this feature is slated for deprecation.
128 | * Calls pickle_archives to rebuild zipinfo.p and save it to the same
129 |     directory as this module.
130 | * Calls build_prior_report_list to build a list of reports housed in
131 |     RPTHOLDDIR, RPTPROCDIR and RPTSVDIR.  
132 |     __NOTE:__ I plan to add a function that can build a list of previously
133 |     processed files using a database call rather than combing the file
134 |     system (though it will remain necessary to look in the RPTHOLDDIR and
135 |     RPTSVDIR directories to find files that have not been loaded into the
136 |     database).
137 | * Calls consume_rss, which uses a regular expression to scan an FEC RSS
138 |     feed listing all electronically filed reports submitted within the
139 |     past seven days.  The function returns a list of these reports.
140 | * Calls verify_reports to test whether filings flagged for download by
141 |     consume_rss already have been downloaded.  If so, the function
142 |     verifies the length of the downloaded file matches the length of
143 |     the file posted on the FEC website.  When the lengths do not match,
144 |     the saved file is deleted and retained in the download list.
145 | * Uses multiprocessing and calls download_report to download each
146 |     report returned by verify_reports.  After downloading a report, the
147 |     subroutine compares the length of the downloaded file with the length
148 |     of the source file.  If the lengths do not match, the file is deleted
149 |     from the file system. The subroutine tries to download a file up to
150 |     five times.  
151 |     __NOTE:__ You can set the NUMPROC variable in the user variables section
152 |     to specify the number of downloads that occur simultaneously. The
153 |     default value is 10.
154 | 
155 | ### Modifying the zipinfo Pickle
156 | Here is the commented-out code available in the download_reports module
157 | that you can use to manually control the zipinfo.p pickle if you don't
158 | want to download all available archives back to 2001:
159 | 
160 | ```python
161 |     # Set mostrecent to the last date you DON'T want, so if you want
162 |     # everything since Jan. 1, 2013, set mostrecent to: '20121231.zip'
163 |     zipinfo['mostrecent'] = '20121231.zip' # YYYYMMDD.zip
164 |     zipinfo['badfiles'] = [] # You probably want to leave this blank
165 | ```
166 | 
167 | ## parse_reports Module
168 | This module is the workhorse of the Toolbox.  It parses all downloaded
169 | reports in a specified directory and saves child rows for each subform
170 | type into a single data file for easy import into a database manager.
171 | 
172 | One of the main challenges with parsing electronically filed reports is
173 | that each form <span>&mdash;</span> presently there are 56 <span>&mdash;</span> has its own column
174 | headers.  What's more, the layout of each form has been through as many
175 | as 13 iterations, each of which also can have its own headers.
176 | 
177 | The parse_reports module handles this by examining the two header rows
178 | atop each electronically filed report to determine the form type and
179 | version of that report.  (The current version is 8.1.)  If the form
180 | type and version are supported by the parser, the columns in each data
181 | row are mapped to standardized columns.  Generally speaking, the
182 | standardized column headings largely mimic a form's version 8.0
183 | headings, though a few legacy columns have been retained.
184 | 
185 | You can modify the standardized columns at any time by manipulating the
186 | __outputhdrs__ variable.  All data for any columns not included in this
187 | variable are dropped from the data stream and are not included in the
188 | output files. The headers for each version of each form type are housed
189 | in the __filehdrs__ variable in this format:
190 | 
191 | ```python
192 | [[formtype-1, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
193 | [formtype-2, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
194 | ...
195 | [formtype-n, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]]]
196 | ```
197 | 
198 | While all child rows are parsed and saved to delimited text files, the
199 | two header rows for each file are loaded into a database manager.  By
200 | default, the database manager is SQL Server, and the module calls a
201 | series of stored procedures to load the header rows into the database.  
202 | However, you can easily modify this behavior to use a different
203 | database manager or save the header rows to their own text files.
204 | 
205 | The main reason to use the module to load the headers is the database
206 | manager can verify each report is not already in the database.  
207 | Eventually, I will add my database structure and stored procedure code
208 | to this repository to make it easier to port the functionality to
209 | interact with other database managers.
210 | 
211 | The parser supports the following form types:
212 | * Form 1, Statement of Organization: Includes Form1S and Text records
213 | * Form 3, Report of Receipts and Disbursements:  Filed by Congressional
214 | candidates; includes all schedules and Text records.
215 | * Form 3L, Report of Contributions Bundled by
216 | Lobbyists/Registrants:  Includes all schedules and Text records.
217 | * Form 3P, Report of Receipts and Disbursements: Filed by presidential
218 | and vice-presidential candidates; includes all schedules and Text
219 | records.
220 | * Form 3X, Report of Receipts and Disbursements: Filed by all
221 | committees other than the principal campaign committees of
222 | Congressional, presidential and vice-presidential candidates; includes
223 | all schedules and Text records.
224 | 
225 | Reports not recognized by parse_reports are moved to the directory
226 | specified by the RPTHOLDDIR variable.
227 | 
228 | This module goes through the following process in this order:
229 | * Calls build_list_of_supported_report_types, which examines the list
230 |     housed in the filehdrs variable to determine which types of
231 |     electronically filed reports can be parsed by the module.
232 | * Calls create_file_timestamp, which creates a timestamp string that is
233 |     affixed to the filename of each data file generated by the module.
234 | * Creates an output file for each type of child row data <span>&mdash;</span> one for
235 |     Schedule A, one for Schedule B, one for Text and so on <span>&mdash;</span> and
236 |     writes the column headers to each file.  These files are saved in
237 |     the directory specified by RPTOUTDIR.  The module also generates an
238 |     "Other Data" file, where rows the module can't write to other data
239 |     files are saved.
240 | * Appends various full name fields to header lists.  These fields were
241 |     used in older electronic filings until the FEC decided to split
242 |     names across multiple fields.  These extra headers are appended
243 |     here because the module attempts to parse these names and does not
244 |     write the full name fields to the output data files.  If a name
245 |     can't be parsed, it is saved to the appropriate last name field.
246 | 
247 | From this point, the module iterates over each electronic filing saved
248 | in the directory specified by RPTSVDIR.  For each file, the module:
249 | * Saves the six-digit filename as ImageID.  This value is prepended to
250 |     every child row so those rows can be mapped to the parent header
251 |     row.
252 | * If the ImageID is contained in BADREPORTS, the module moves the file
253 |     to the directory specified by RPTHOLDDIR, then proceeds to the next
254 |     electronic filing.
255 | * Extracts the file header.  The file header contains basic information
256 |     about the file, such as the header version, the software used to
257 |     generate the report, the delimiter used for names and a date format
258 |     string.  For most forms, this information is constrained to the
259 |     first line of the file.  But header versions 1 and 2 for all form
260 |     types use multi-line file headers.  When a multi-line file header
261 |     is detected, the module scans from the top of the file until it
262 |     finds and reads the entire header.
263 | * Extracts the report header, which is always contained on only one
264 |     line, immediately below the file header.
265 | * Checks to see whether the default delimiter specified by DELIMITER
266 |     can be found anywhere in the file header. If not, the default
267 |     delimiter is changed to a comma.
268 | * Extracts the report type (i.e., F3PA, F3XN, F3T) and an abbreviated
269 |     report type (i.e., F3P, F3X, F3) from the report header. The last
270 |     letter of a Form 3 report type indicates whether the report is a
271 |     New, Amended or Termination report. If the module does not support
272 |     that report type, it moves the file to the RPTHOLDDIR and proceeds
273 |     to the next electronic filing.
274 | * Extracts the version number from the file header.
275 | * Parses the file header. Custom code is used for header versions 1 and
276 |     2 while the parse_data_row function is used for all later,
277 |     single-line headers.
278 | * Creates a dictionary to house data for the report header, adds a
279 |     key for each column, then calls parse_data_row for all versions.
280 | * If for some reason the delimiter used for names is unknown, the
281 |     module attempts to determine the delimiter.
282 | * Calls a custom function for each form type to validate the report
283 |     header data, then calls load_report_header to load that data into a
284 |     database manager.  If the data can't be validated, the module will
285 |     fail. If the data is valid but can't be loaded into the database
286 |     (either because of an error or because the report already exists in
287 |     the database) the module moves the file to the directory specified\
288 |     by RPTRVWDIR and proceeds to the next electronic filing.
289 | 
290 | As noted elsewhere, FEC Scraper Toolbox uses SQL Server as the default
291 | database manager and uses stored procedures to interact with the
292 | database.  However, this is an issue only in terms of loading header
293 | data.  All other data is saved to delimited text files you can load
294 | into any database manager. (The default delimiter is a tab, which you
295 | can change by setting the OUTPUTDELIMITER variable.)
296 | 
297 | The main reason to load the headers from within the Python module is to
298 | verify an electronic report does not already exist in the database.  
299 | It also ensures a valid parent-child relationship exists before any
300 | child rows are loaded.  Saving all data, including file headers, to
301 | flat files so all the data can be imported in one set would be faster,
302 | but querying the database and adding a header one report at a time is a
303 | trade-off to ensure database viability.
304 | 
305 | Once the header has been parsed and loaded into the database, the
306 | module creates a list to hold each type of child row (one file for
307 | Schedule A data, one for Schedule B data and so on). The module then
308 | iterates over the file, skipping the headers, and processes each child
309 | row as follows:
310 | * The module removes double spaces from the data. If OUTPUTDELIMITER is
311 |     set to a tab, the module also converts tabs to spaces.
312 | * The data row is converted to a list.
313 | * The module looks at the first element of the list to determine the
314 |     row's form type. If the type can't be determined, the row is
315 |     written to the "Other Data" file.
316 | * The module calls populate_data_row_dict to build a dictionary to
317 |     house the output data, mapping the version-specific headers of the
318 |     data row to the headers used in the data output file.
319 | * The module calls a form type-specific function to validate and clean
320 |     the data.
321 | * Full name fields, if any, are removed from the data row.
322 | * The module calls build_data_row to convert the dictionary to a list.
323 | * The list is appended to the list created to house that type of data.
324 | 
325 | Once the module has finished iterating over the data for an electronic
326 | report, each line of each form type-specific list is converted to a
327 | delimited string and written to the appropriate data file before the
328 | module proceeds to the next file.
329 | 
330 | At the end of the module, you'll see a call to a SQL Server stored
331 | procedure called usp_DeactivateOverlappingReports. (Again, I plan to
332 | post all my SQL Server code in this repository very soon.) Briefly,
333 | this stored procedure addresses the problem of reports with different
334 | but overlapping date ranges.  Rather than delete amended reports, I use
335 | database triggers to flag the most recent report as Active and to
336 | deactivate all earlier reports whenever the date ranges are exactly the
337 | same.  But there are dozens and perhaps hundreds of cases where reports
338 | with different date ranges overlap.  This stored procedure addresses
339 | this problem by scrubbing all headers in the database each time this
340 | module is run.
341 | 
342 | ## update_master_files Module
343 | This module can be used to download and extract the master files housed
344 | on the [FEC website](http://www.fec.gov/finance/disclosure/ftpdet.shtml).  The
345 | FEC updates the candidate (cn), committee (cm) and candidate-committee
346 | linkage (ccl) files daily.  The FEC updates the individual
347 | contributions (indiv), committee-to-committee transactions (oth) and
348 | committee-to-candidate transactions (pas2) files every Sunday evening.
349 | 
350 | ### Warehousing Master Files
351 | By default, the update_master_files module archives the compressed
352 | master files (but not the extracted data files). This was done to
353 | preserve the source data during development and because the FEC
354 | overwrites the data files.
355 | 
356 | To disable this behavior, set the value of the ARCHIVEFILES user
357 | variable (see the user variables section near the top of the module) to
358 | zero. When ARCHIVEFILES is set to any value other than one (1), the
359 | master files are not archived.
360 | 
361 | The script assumes you will run it daily and late enough in the day to
362 | make sure the FEC has created the daily data files before you try to
363 | process them.  At the time of this writing, the FEC updates the three
364 | daily files around 7:30 a.m. and the weekly files around 4 p.m.  If
365 | you're scheduling a daily job to run this script, I recommend you
366 | schedule it for late evening to give yourself plenty of leeway.
367 | 
368 | By default, the script will ignore the weekly files if the script is
369 | not run on a Sunday.  To change this behavior and download the Sunday
370 | files on a different day of the week, set the OMITNONSUNDAYFILES
371 | variable near the top of the script to 0.
372 | 
373 | ### How the update_master_files Module Works
374 | This module goes through the following process in this order:
375 | * Calls delete_data to remove all .txt and .zip files from the working
376 |     directory specified by the MASTERDIR variable.  (The .zip files from
377 |     the previous execution of the script will be in this directory only if
378 |     they were not archived when the script was last run.)
379 | * Uses the local date on the machine running the code to calculate the
380 |     current election cycle and determine whether the files are being
381 |     downloaded on a Sunday.  If the weekday is not Sunday and
382 |     OMITNONSUNDAYFILES is set to 1, the script will ignore all master files
383 |     except for the candidate (cn), committee (cm) and candidate-committee
384 |     linkage (ccl) files.
385 | * Uses multiprocessing and calls download_file to download each master
386 |     file specified by the MASTERFILES user variable. (By default, all nine
387 |     master files are downloaded.) These files are saved in the directory
388 |     specified by the MASTERDIR variable.  
389 |     After downloading a file, the subroutine compares the length of the
390 |     downloaded file with the length of the source file of the FEC
391 |     website.  If the lengths do not match, the file is deleted from the
392 |     file system. The subroutine tries to download a file up to five times.  
393 |     __NOTE:__ You can set the NUMPROC variable in the user variables section
394 |     to specify the number of downloads that occur simultaneously.  The
395 |     default value is 10.
396 | * Uses multiprocessing and calls unzip_master_file to extract the data
397 |     files from each master file.  If the extracted filename does not
398 |     include a year reference, the subroutine appends a two-digit year.  
399 |     __NOTE:__ You can set the NUMPROC variable in the user variables section
400 |     to specify the number of downloads that occur simultaneously.  The
401 |     default value is 10.
402 | * When the ARCHIVEFILES user variable is set to 1, the module calls the
403 |     archive_master_files subroutine, which creates a YYYYMMDD directory for
404 |     the most recent Sunday date (if that directory does not already exist)
405 |     and moves all .zip files in the MASTERDIR directory to the new
406 |     directory.
407 | 
408 | ### About the Master Files
409 | The FEC recreates three of the master files daily and the remaining
410 | master every Sunday evening. Each time the files are generated, they
411 | overwrite the previously posted files.  The archive filenames include a
412 | two-digit year to identify the election cycle, but the files housed in
413 | those archives often do not. For that reason, this module appends a
414 | two-digit election cycle to extracted filenames that do not include a year
415 | reference.
416 | 
417 | There are nine compressed files for each election cycle. You can click
418 | the links below to view the data dictionary for a particular file:
419 | * __add:__ [New Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsAdditions.shtml)
420 |     Lists all contributions added to the master Individuals file in the
421 |     past week.  
422 |     Files extracted from these archives are named
423 |     addYYYY.txt.  Generated every Sunday.
424 | * __ccl:__ [Candidate Committee Linkage](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandCmteLinkage.shtml)  
425 |     Houses all links between candidates and committees that have
426 |     been reported to the FEC. Strangely, this file does not include
427 |     candidate ties to Leadership PACs, which are reported on Form 1.  
428 |     Files extracted from these archives are named ccl.txt.  Generated
429 |     daily.
430 | * __changes:__ [Individual Contribution Changes](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsChanges.shtml)
431 |     Lists all transactions in the master Individuals file that have been
432 |     changed during the past month.  
433 |     Files extracted from these archives are named chgYYYY.txt.  Generated
434 |     every Sunday.
435 | * __cm:__ [Committee Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteeMaster.shtml)  
436 |     Lists all committees registered with the FEC for a
437 |     specific election cycle. Among other information, you can use this file
438 |     to see a committee's FEC ID, name, address and treasurer. You can use
439 |     the Committee Designation field (CMTE_DSGN) to find a specific
440 |     committee type (such as principal campaign committees, joint
441 |     fundraisers, lobbyist PACs and leadership pacs). Additionally, you can
442 |     look for code O in the Committee Type field (CMTE_TP) to identify
443 |     independent expenditure only committees commonly known as Super PACs.  
444 |     Files extracted from these archives are named cm.txt.  Generated daily.
445 | * __cn:__ [Candidate Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandidateMaster.shtml)
446 |     Lists all candidates registered with the FEC for a
447 |     specific election cycle. You can use this file to see all candidates
448 |     who have filed to run for a particular seat as well as information
449 |     about their political parties, addresses, treasurers and FEC IDs.  
450 |     Files extracted from these archives are named cn.txt.  Generated daily.
451 | * __delete:__ [Deleted Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsDeletes.shtml)
452 |     Lists all contributions deleted from the master Individuals file in the
453 |     past week.  
454 |     Files extracted from these archives are named delYYYY.txt.  Generated
455 |     every Sunday.
456 | * __indiv:__ [Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml)
457 |     For the most part, lists all itemized contributions of $200 or more
458 |     made by INDIVIDUALS to any committee during the election cycle. Does
459 |     not include most contributions from businesses, PACs and other
460 |     organizations.  
461 |     Files extracted from these archives are named itcont.txt.  Generated
462 |     every Sunday.
463 | * __oth:__ [Committee-to-Committee Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteetoCommittee.shtml)
464 |     Lists contributions and independent expenditures made by one committee
465 |     to another.  
466 |     Files extracted from these archives are named itoth.txt.  Generated
467 |     every Sunday.
468 | * __pas2:__ [Committee-to-Candidate Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml)
469 |     Lists contributions made by a committee to a candidate.  
470 |     Files extracted from these archives are named itpas2.txt.  Generated
471 |     every Sunday.
472 | 
473 | ### Using the indiv, add, changes and delete files
474 | The indiv file generated every Sunday for each election cycle is
475 | comprehensive, meaning it is a current (as of that Sunday) snapshot of
476 | all individual contributions currently in the database.  Many users
477 | simply drop and recreate this table every week.  If you do this, you do
478 | not need the add, changes and delete files.  The FEC generates these
479 | files to provide an alternative to rebuilding the indiv table each
480 | week simply because of the sheer volume of data it houses.
481 | 
482 | I presently don't use any of these files and instead rely on the raw
483 | filings themselves, which are immediately available on the FEC website
484 | once they're filed and contain more data that the indiv files.  But
485 | many journalists I know who do use these files say they just rebuild
486 | the indiv table each week because it's easier and less error-prone than
487 | trying to patch it each week.
488 | 
489 | If you decide to use the add, changes and delete files rather than
490 | rebuild the indiv table each week, just be aware that if you ever miss
491 | a weekly download, you will have to rebuild the indiv table.
492 | 
493 | ## Next Steps
494 | I house all of my campaign-finance data in a SQL Server database and
495 | tend to use SQL Server Integration Services packages to load the data
496 | generated/extracted by the FEC Scraper Toolbox.  I do this because of
497 | the sheer volume of the data (The table housing all Schedule A data,
498 | for example, contains more than 120 million rows so far.) and because I
499 | can use SSIS to bulk load the data rather than load it from Python a
500 | row at a time.
501 | 
502 | The lone exception is the parse_reports module.  That module attempts
503 | to load a report's header row into the database to test whether that
504 | report previously has been loaded.  All child rows in new reports are
505 | parsed and moved to separate data files (one for Schedule A, one for
506 | Schedule B and so on).
507 | 
508 | Understandably, the lack of code to load the data makes the Toolbox
509 | less attractive to some potential users, who must manually import the
510 | data files or develop their own Pythonic means of doing so.  
511 | Nevertheless, the modules presently provide very fast and efficient
512 | means of downloading massive quantities of data, managing that data and
513 | preparing it for import.
514 | 
515 | At some point, I plan to develop Python functions to handle the data
516 | imports, and of course I welcome any contributions from the open-source
517 | community. I also plan to open source my entire database design so
518 | others can recreate it in any database manager they choose.
519 | 
520 | Stay tuned!
521 | 


--------------------------------------------------------------------------------
/download_reports.py:
--------------------------------------------------------------------------------
  1 | # Download campaign finance reports
  2 | # By Christopher Schnaars, USA TODAY
  3 | # Developed with Python 2.7.4
  4 | # See README.md for complete documentation
  5 | 
  6 | # Import needed libraries
  7 | import datetime
  8 | import glob
  9 | import multiprocessing
 10 | import os
 11 | import pickle
 12 | import re
 13 | import sys
 14 | import urllib
 15 | import urllib2
 16 | import zipfile
 17 | 
 18 | # Try to import user settings or set them explicitly
 19 | try:
 20 |     import usersettings
 21 | 
 22 |     ARCPROCDIR = usersettings.ARCPROCDIR
 23 |     ARCSVDIR = usersettings.ARCSVDIR
 24 |     RPTHOLDDIR = usersettings.RPTHOLDDIR
 25 |     RPTPROCDIR = usersettings.RPTPROCDIR
 26 |     RPTSVDIR = usersettings.RPTSVDIR
 27 | except:
 28 |     ARCPROCDIR = 'C:\\data\\FEC\\Archives\\Processed\\'
 29 |     ARCSVDIR = 'C:\\data\\FEC\\Archives\\Import\\'
 30 |     RPTHOLDDIR = 'C:\\data\\FEC\\Reports\\Hold\\'
 31 |     RPTPROCDIR = 'C:\\data\\FEC\\Reports\\Processed\\'
 32 |     RPTSVDIR = 'C:\\data\\FEC\\Reports\\Import\\'
 33 | 
 34 | # Other user variables
 35 | ARCFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/electronic/'
 36 | NUMPROC = 1  # Multiprocessing processes to run simultaneously
 37 | RPTURL = 'http://docquery.fec.gov/dcdev/posted/'
 38 | RSSURL = 'http://efilingapps.fec.gov/rss/generate?preDefinedFilingType=ALL'
 39 | 
 40 | 
 41 | def build_archive_download_list(zipinfo, oldarchives):
 42 |     """
 43 |     On 1/8/2018, the FEC shut down its FTP server and moved their
 44 |     bulk files to an Amazon S3 bucket. Rather than try to hack the
 45 |     JavaScript, this function now looks for files dated after the
 46 |     mostrecent element of the zipinfo.p pickle up to the current
 47 |     system date. I'm adding a try_again_later property to the
 48 |     pickle for .zip files that fail to download.
 49 |     """
 50 | 
 51 |     # Generate date range to look for new files
 52 |     start_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date()
 53 |     add_day = datetime.timedelta(days=1)
 54 |     start_date += add_day
 55 |     end_date = datetime.datetime.now().date()
 56 | 
 57 |     # Create dictionary to house list of files to attempt to download
 58 |     downloads = []
 59 | 
 60 |     # Add recent archive files
 61 |     while start_date < end_date:
 62 |         downloads.append(datetime.date.strftime(start_date, '%Y%m%d') + '.zip')
 63 |         start_date += add_day
 64 | 
 65 |     # Add try_again files
 66 |     for fec_file in zipinfo['try_again_later']:
 67 |         if fec_file not in downloads:
 68 |             downloads.append(fec_file)
 69 |     zipinfo['try_again_later'] = []
 70 | 
 71 |     # Remove any bad files from the list
 72 |     for fec_file in zipinfo['badfiles']:
 73 |         if fec_file in downloads:
 74 |             downloads.remove(fec_file)
 75 | 
 76 |     # Remove previously downloaded archives
 77 |     downloads = [download for download in downloads if download not in oldarchives]
 78 | 
 79 |     return downloads
 80 | 
 81 | 
 82 | def build_prior_archive_list():
 83 |     """
 84 |     Returns a list of archives that already have been downloaded and
 85 |     saved to ARCPROCDIR or ARCSVDIR.
 86 |     """
 87 |     dirs = [ARCSVDIR, ARCPROCDIR]
 88 |     archives = []
 89 | 
 90 |     for dir in dirs:
 91 |         for datafile in glob.glob(os.path.join(dir, '*.zip')):
 92 |             archives.append(datafile.replace(dir, ''))
 93 | 
 94 |     return archives
 95 | 
 96 | 
 97 | def build_prior_report_list():
 98 |     """
 99 |     Returns a list of reports housed in the directories specified by
100 |     RPTHOLDDIR, RPTPROCDIR and RPTSVDIR.
101 |     """
102 |     dirs = [RPTHOLDDIR, RPTPROCDIR, RPTSVDIR]
103 |     reports = []
104 | 
105 |     for dir in dirs:
106 |         for datafile in glob.glob(os.path.join(dir, '*.fec')):
107 |             reports.append(
108 |                 datafile.replace(dir, '').replace('.fec', ''))
109 | 
110 |     return reports
111 | 
112 | 
113 | def consume_rss():
114 |     """
115 |     Returns a list of electronically filed reports included in an FEC
116 |     RSS feed listing all reports submitted within the past seven days.
117 |     """
118 |     regex = re.compile('<link>http://docquery.fec.gov/dcdev/posted/' \
119 |                        '([0-9]*)\.fec</link>')
120 |     opener = urllib2.build_opener()
121 |     opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')]
122 |     rss = opener.open(RSSURL).read()
123 |     matches = []
124 |     for match in re.findall(regex, rss):
125 |         matches.append(match)
126 | 
127 |     return matches
128 | 
129 | 
130 | def download_archive(archive):
131 |     """
132 |     Downloads a single archive file and saves it in the directory
133 |     specified by the ARCSVDIR variable.  After downloading an archive,
134 |     this subroutine compares the length of the downloaded file with the
135 |     length of the source file and will try to download a file up to
136 |     five times when the lengths don't match.
137 |     """
138 |     src = ARCFTP + archive
139 |     dest = ARCSVDIR + archive
140 |     y = 0
141 |     # I have added a header to my request
142 |     try:
143 |         # Add a header to the request
144 |         request = urllib2.Request(src, headers={
145 |             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
146 |         srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
147 |     except:
148 |         y = 5
149 | 
150 |     while y < 5:
151 |         try:
152 |             # Add a header to the request
153 |             urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
154 |             urllib.urlretrieve(src, dest)
155 | 
156 |             destlen = os.path.getsize(dest)
157 | 
158 |             # Repeat download up to five times if files not same size
159 |             if srclen != destlen:
160 |                 os.remove(dest)
161 |                 y += 1
162 |                 continue
163 |             else:
164 |                 y = 6
165 |         except:
166 |             y += 1
167 |     if y == 5:
168 |         zipinfo['try_again_later'].append(archive)
169 |         print(src + ' could not be downloaded.')
170 | 
171 | 
172 | def download_report(download):
173 |     """
174 |     Downloads a single electronic report and saves it in the directory
175 |     specified by the RPTSVDIR variable.  After downloading a report,
176 |     this subroutine compares the length of the downloaded file with the
177 |     length of the source file and will try to download a file up to
178 |     five times when the lengths don't match.
179 |     """
180 |     # Construct file url and get length of file
181 |     url = RPTURL + download + '.fec'
182 |     y = 0
183 | 
184 |     try:
185 |         # Add a header to the request
186 |         request = urllib2.Request(url, headers={
187 |             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
188 |         srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
189 |     except:
190 |         y = 5
191 | 
192 |     filename = RPTSVDIR + download + '.fec'
193 | 
194 |     while y < 5:
195 |         try:
196 |             url_headers = {'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
197 |                            'ACCEPT_ENCODING': 'gzip, deflate, br',
198 |                            'ACCEPT_LANGUAGE': 'en-US,en;q=0.5',
199 |                            'USER-AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}
200 |             request = urllib2.Request(url, headers=url_headers)
201 |             response = urllib2.urlopen(request)
202 |             with open(filename, 'wb') as f:
203 |                 f.write(response.read())
204 | 
205 |             destlen = os.path.getsize(filename)
206 | 
207 |             # Repeat download up to five times if files not same size
208 |             if srclen != destlen:
209 |                 os.remove(filename)
210 |                 y += 1
211 |                 continue
212 |             else:
213 |                 y = 6
214 |         except:
215 |             y += 1
216 | 
217 |     if y == 5:
218 |         print('Report ' + download + ' could not be downloaded.')
219 |         sys.exit()
220 | 
221 | 
222 | def pickle_archives(zipinfo, archives):
223 |     """
224 |     Rebuilds the zipinfo.p pickle and saves it in the same directory as
225 |     this module.
226 | 
227 |     archives is a list of archive files available for download on the
228 |     FEC website. The list is generated by the
229 |     build_archive_download_list function.
230 |     """
231 | 
232 |     # To calculate most recent download, omit files in try_again_later
233 |     downloads = [fec_file for fec_file in archives if fec_file not in zipinfo['try_again_later']]
234 |     if len(downloads) > 0:
235 |         zipinfo['mostrecent'] = max(downloads)
236 | 
237 |         # Remove bad files older than most recent
238 |         if len(zipinfo['badfiles']) > 0:
239 |             most_recent_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date()
240 | 
241 |             for bad_file in zipinfo['badfiles'][::-1]:
242 |                 bad_file_date = datetime.datetime.strptime(bad_file.rstrip('.zip'), '%Y%m%d').date()
243 |                 if bad_file_date < most_recent_date:
244 |                     zipinfo['badfiles'].remove(bad_file)
245 | 
246 |         pickle.dump(zipinfo, open('zipinfo.p', 'wb'))
247 | 
248 | 
249 | def unzip_archive(archive, overwrite=0):
250 |     """
251 |     Extracts any files housed in a specific archive that have not been
252 |     downloaded previously.
253 | 
254 |     Set the overwrite parameter to 1 if existing files should be
255 |     overwritten.  The default value is 0.
256 |     """
257 |     destdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR]
258 |     try:
259 |         zip = zipfile.ZipFile(ARCSVDIR + archive)
260 |         for subfile in zip.namelist():
261 |             x = 1
262 |             if overwrite != 1:
263 |                 for dir in destdirs:
264 |                     if x == 1:
265 |                         if os.path.exists(dir + subfile):
266 |                             x = 0
267 |             if x == 1:
268 |                 zip.extract(subfile, destdirs[0])
269 | 
270 |         zip.close()
271 | 
272 |         # If all files extracted correctly, move archive to Processed
273 |         # directory
274 |         os.rename(ARCSVDIR + archive, ARCPROCDIR + archive)
275 | 
276 |     except:
277 |         print('Files contained in ' + archive + ' could not be '
278 |                                                 'extracted. The file has been deleted so it can be '
279 |                                                 'downloaded again later.\n')
280 |         os.remove(ARCSVDIR + archive)
281 | 
282 | 
283 | def verify_reports(rpts, downloaded):
284 |     """
285 |     Returns a list of individual reports to be downloaded.
286 | 
287 |     Specifically, this function compares a list of available reports
288 |     that have been submitted to the FEC during the past seven days
289 |     (rpts) with a list of previously downloaded reports (downloaded).
290 | 
291 |     For reports that already have been downloaded, the function verifies
292 |     the length of the downloaded file matches the length of the file
293 |     posted on the FEC website.  When the lengths do not match, the saved
294 |     file is deleted and retained in the download list.
295 |     """
296 |     downloads = []
297 |     for rpt in rpts:
298 |         childdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR]
299 |         if rpt not in downloaded:
300 |             downloads.append(rpt)
301 |         else:
302 |             try:
303 |                 # Add a header to the request
304 |                 request = urllib2.Request(RPTURL + rpt + '.fec', headers={
305 |                     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
306 |                 srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
307 |             except urllib2.HTTPError:
308 |                 print(RPTURL + rpt + '.fec could not be downloaded.')
309 |                 continue
310 | 
311 |             for child in childdirs:
312 |                 try:
313 |                     destlen = os.path.getsize(child + rpt + '.fec')
314 |                     if srclen != destlen:
315 |                         downloads.append(rpt)
316 |                         os.remove(child + rpt + '.fec')
317 |                 except:
318 |                     pass
319 | 
320 |     return downloads
321 | 
322 | 
323 | if __name__ == '__main__':
324 |     # Attempt to fetch data specifying missing .zip files and most
325 |     # recent .zip file downloaded
326 |     print('Attempting to retrieve information for previously '
327 |           'downloaded archives...')
328 |     try:
329 |         zipinfo = pickle.load(open("zipinfo.p", "rb"))
330 |         # Make sure new try_again_later key exists
331 |         if 'try_again_later' not in zipinfo.keys():
332 |             zipinfo['try_again_later'] = []
333 |         print('Information retrieved successfully.\n')
334 |     except:
335 |         zipinfo = {'mostrecent': '20010403.zip',
336 |                    'badfiles': ['20010408.zip', '20010428.zip', '20010429.zip', '20010505.zip', '20010506.zip',
337 |                                 '20010512.zip', '20010526.zip', '20010527.zip', '20010528.zip', '20010624.zip',
338 |                                 '20010812.zip', '20010826.zip', '20010829.zip', '20010902.zip', '20010915.zip',
339 |                                 '20010929.zip', '20010930.zip', '20011013.zip', '20011014.zip', '20011028.zip',
340 |                                 '20011123.zip', '20011124.zip', '20011125.zip', '20011201.zip', '20011202.zip',
341 |                                 '20011215.zip', '20011223.zip', '20011229.zip', '20030823.zip', '20030907.zip',
342 |                                 '20031102.zip', '20031129.zip', '20031225.zip', '20040728.zip', '20040809.zip',
343 |                                 '20040921.zip', '20040922.zip', '20041127.zip', '20050115.zip', '20050130.zip',
344 |                                 '20050306.zip', '20050814.zip', '20050904.zip', '20051106.zip', '20051225.zip',
345 |                                 '20060210.zip', '20060318.zip', '20060319.zip', '20060320.zip', '20061224.zip',
346 |                                 '20070507.zip', '20071028.zip', '20081225.zip', '20091226.zip', '20111203.zip',
347 |                                 '20120701.zip', '20121215.zip', '20121225.zip', '20130703.zip', '20130802.zip',
348 |                                 '20130825.zip', '20130914.zip', '20131109.zip', '20150207.zip', '20150525.zip'],
349 |                    'try_again_later': ['20001015.zip', '20010201-20010403.zip']}
350 |         print('zipinfo.p not found. Starting from scratch...\n')
351 | 
352 |     # Build a list of previously downloaded archives
353 |     print('Building a list of previously downloaded archive files...')
354 |     oldarchives = build_prior_archive_list()
355 |     print('Done!\n')
356 | 
357 |     # Go to FEC site and fetch a list of .zip files available
358 |     print('Compiling a list of archives available for download...')
359 |     archives = build_archive_download_list(zipinfo, oldarchives)
360 |     if len(archives) == 0:
361 |         print('No new archives found.\n')
362 |     # If any files returned, download them using multiprocessing
363 |     else:
364 |         print('Done!\n')
365 |         print('Downloading ' + str(len(archives))
366 |               + ' new archive(s)...')
367 |         pool = multiprocessing.Pool(processes=NUMPROC)
368 |         for archive in archives:
369 |             pool.apply_async(download_archive(archive))
370 |         pool.close()
371 |         pool.join()
372 |         print('Done!\n')
373 | 
374 |         # Open each archive and extract new reports
375 |         print('Extracting files from archives...')
376 |         pool = multiprocessing.Pool(processes=NUMPROC)
377 |         for archive in archives:
378 |             # Make sure archive was downloaded
379 |             if os.path.isfile(ARCSVDIR + archive):
380 |                 pool.apply_async(unzip_archive(archive, 0))
381 |         pool.close()
382 |         pool.join()
383 |         print('Done!\n')
384 | 
385 |         # Rebuild zipinfo and save with pickle
386 |         print('Repickling the archives. Adding salt and vinegar...')
387 |         pickle_archives(zipinfo, archives)
388 |         print('Done!\n')
389 | 
390 |     # Build list of previously downloaded reports
391 |     print('Building a list of previously downloaded reports...')
392 |     downloaded = build_prior_report_list()
393 |     print('Done!\n')
394 | 
395 |     # Consume FEC's RSS feed to get list of files posted in the past
396 |     # seven days
397 |     print('Consuming FEC RSS feed to find new reports...')
398 |     rpts = consume_rss()
399 |     print('Done! ' + str(len(rpts)) + ' reports found.\n')
400 | 
401 |     # See whether each file flagged for download already has been
402 |     # downloaded.  If it has, verify the downloaded file is the correct
403 |     # length.
404 |     print('Compiling list of reports to download...')
405 |     newrpts = verify_reports(rpts, downloaded)
406 |     print('Done! ' + str(len(newrpts)) + ' reports flagged for '
407 |                                          'download.\n')
408 | 
409 |     # Download each of these reports
410 |     print('Downloading new reports...')
411 |     pool = multiprocessing.Pool(processes=NUMPROC)
412 |     for rpt in newrpts:
413 |         # download_report(rpt)
414 |         pool.apply_async(download_report(rpt))
415 |     pool.close()
416 |     pool.join()
417 |     print('Done!\n')
418 |     print('Process completed.')
419 | 


--------------------------------------------------------------------------------
/fec_scraper_toolbox_sql_objects.sql:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cschnaars/FEC-Scraper-Toolbox/0eec4758150945ff1e4f05bc15b903731741b2fe/fec_scraper_toolbox_sql_objects.sql


--------------------------------------------------------------------------------
/update_master_files.py:
--------------------------------------------------------------------------------
  1 | # Download zipped FEC master files
  2 | # By Christopher Schnaars, USA TODAY
  3 | # Developed with Python 2.7.4
  4 | # See README.md for complete documentation
  5 | 
  6 | # WARNING:
  7 | # --------
  8 | # If you automate the execution of this script, you should set it to
  9 | # run in the late evening to make sure you don't download any master
 10 | # files before the FEC has a chance to update them. At the time of this
 11 | # writing, the FEC was updating the candidate, committee and
 12 | # candidate-committee linkage files daily around 7:30 a.m. while
 13 | # other weekly files were updated a little before 4 p.m. on Sundays.
 14 | 
 15 | # Development Notes:
 16 | # ------------------
 17 | # 4/7/2014: Updated code so files can be downloaded daily. The FEC began
 18 | # publishing daily updates to the candidate, committee and
 19 | # candidate-committee linkage files. Other master files continue
 20 | # to be updated weekly on Sundays.
 21 | 
 22 | # Import needed libraries
 23 | from datetime import datetime, timedelta
 24 | import glob
 25 | import multiprocessing
 26 | import os
 27 | import urllib
 28 | import urllib2
 29 | import zipfile
 30 | 
 31 | # Try to import user settings or set them explicitly
 32 | try:
 33 |     import usersettings
 34 |     MASTERDIR = usersettings.MASTERDIR
 35 | except:
 36 |     MASTERDIR = 'C:\\data\\FEC\\Master\\'
 37 | 
 38 | # Other user variables
 39 | ARCHIVEFILES = 1 # Set to 0 if you don't want to archive the master files each week.
 40 | MASTERFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/'
 41 | MASTERFILES = ['ccl', 'cm', 'cn', 'indiv', 'oth', 'pas2', 'oppexp']
 42 | NUMPROC = 10 # Multiprocessing processes to run simultaneously
 43 | STARTCYCLE = 2002 # Oldest election cycle for which you want to download master files
 44 | OMITNONSUNDAYFILES = 1 # Set to 0 to download all files regardless of day of week
 45 | 
 46 | 
 47 | def archive_master_files():
 48 |     """
 49 |     Moves current master files to archive directory. The
 50 |     archivedate parameter specifies the most recent Sunday date. If the
 51 |     archive directory does not exist, this subroutine creates it.
 52 |     """
 53 |     # Create timestamp
 54 |     timestamp = datetime.now().strftime("%Y%m%d")
 55 | 
 56 |     # Create archive directory if it doesn't exist
 57 |     savedir = MASTERDIR + 'Archive\\' + timestamp + '\\'
 58 |     if not os.path.isdir(savedir):
 59 |         try:
 60 |             os.mkdir(savedir)
 61 |         except:
 62 |             pass
 63 | 
 64 |     # Move all the files
 65 |     for datafile in glob.glob(os.path.join(MASTERDIR, '*.zip')):
 66 |         os.rename(datafile, datafile.replace(MASTERDIR, savedir))
 67 | 
 68 |         
 69 | def create_timestamp():
 70 |     filetime = datetime.datetime.now()
 71 |     return filetime.strftime('%Y%m%d')
 72 | 
 73 | 
 74 | def delete_files(dir, ext):
 75 |     """
 76 |     Deletes all files in the specified directory with the specified
 77 |     file extension. In this module, it is used to delete all text files
 78 |     extracted from the previous week's archives prior to downloading
 79 |     the new archives. These files are housed in the directory
 80 |     specified by MASTERDIR.
 81 | 
 82 |     When ARCHIVEFILES is set to 0, this subroutine also is used
 83 |     to delete all archive files from the MASTERDIR directory.
 84 |     """
 85 |     # Remove asterisks and periods from specified extension
 86 |     ext = '*.' + ext.lstrip('*.')
 87 | 
 88 |     # Delete all files
 89 |     for datafile in glob.glob(os.path.join(dir, ext)):
 90 |         os.remove(datafile)
 91 | 
 92 | 
 93 | def download_file(src, dest):
 94 |     """
 95 |     Downloads a single master file (src) and saves it as dest. After
 96 |     downloading a file, this subroutine compares the length of the
 97 |     downloaded file with the length of the source file and will try to
 98 |     download a file up to five times when the lengths don't match.
 99 |     """
100 |     y = 0
101 |     try:
102 |         # Add a header to the request.
103 |         request = urllib2.Request(src, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0'})
104 |         srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
105 |     except:
106 |         y = 5
107 |     while y < 5:
108 |         try:
109 |             urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
110 |             urllib.urlretrieve(src, dest)
111 |             destlen = os.path.getsize(dest)
112 | 
113 |             # Repeat download up to five times if files not same size
114 |             if srclen != destlen:
115 |                 os.remove(dest)
116 |                 y += 1
117 |                 continue
118 |             else:
119 |                 y = 6
120 |         except:
121 |             y += 1
122 |     if y == 5:
123 |         print(src + ' could not be downloaded.')
124 | 
125 | 
126 | def unzip_master_file(masterfile):
127 |     """
128 |     Extracts the data file from a single weekly master file archive.
129 |     If the extracted file does not include a year reference, this
130 |     subroutine appends a two-digit year to the extracted filename.
131 |     """
132 |     fileyear = masterfile[masterfile.find('.zip')-2:masterfile.find('.zip')]
133 | 
134 |     try:
135 |         zip = zipfile.ZipFile(masterfile)
136 |         for subfile in zip.namelist():
137 |             zip.extract(subfile, MASTERDIR)
138 |             # Rename the file if it does not include the year
139 |             if subfile.find(fileyear + '.txt') == -1:
140 |                 savefile = MASTERDIR + subfile
141 |                 os.rename(savefile, savefile.replace('.txt', fileyear + '.txt'))
142 | 
143 |     except:
144 |         print('Files contained in ' + masterfile + ' could not be extracted.')
145 | 
146 | 
147 | if __name__=='__main__':
148 |     
149 |     # Delete text files extracted from an earlier archive
150 |     print('Deleting old data...')
151 |     delete_files(MASTERDIR, 'txt')
152 | 
153 |     # Delete old archives if they're still in the working
154 |     # directory. These files are moved to another directory
155 |     # (archived) below when ARCHIVEFILES is set to 1.
156 |     delete_files(MASTERDIR, 'zip')
157 |     print('Done!\n')
158 | 
159 |     # Use multiprocessing to download master files
160 |     print('Downloading master files...\n')
161 |     pool = multiprocessing.Pool(processes=NUMPROC)
162 | 
163 |     # Determine whether today is Sunday
164 |     sunday = False
165 |     if datetime.now().weekday() == 6:
166 |         sunday = True
167 | 
168 |     # Remove all files but cn, cm and ccl from MASTERFILES
169 |     if sunday == False and OMITNONSUNDAYFILES == 1:
170 |         files = []
171 |         for fecfile in ['ccl', 'cm', 'cn']:
172 |             if fecfile in MASTERFILES:
173 |                 files.append(fecfile)
174 |         MASTERFILES = files
175 | 
176 |     # Calculate current election cycle
177 |     maxyear = datetime.now().year
178 |     # Add one if it's not an even-numbered year
179 |     if maxyear / 2 * 2 < maxyear: maxyear += 1
180 |         
181 |     # Create loop to iterate through FEC ftp directories
182 |     for x in range(STARTCYCLE, maxyear + 2, 2):
183 |         fecdir = MASTERFTP + str(x) + '/'
184 | 
185 |         for thisfile in MASTERFILES:
186 |             currfile = thisfile + str(x)[2:] + '.zip'
187 |             fecfile = fecdir + currfile
188 |             savefile = MASTERDIR + currfile
189 |             pool.apply_async(download_file(fecfile, savefile))
190 |     pool.close()
191 |     pool.join()
192 |     print('Done!\n')
193 | 
194 |     # Use multiprocessing to extract data files from the archives
195 |     print('Unzipping files...')
196 |     pool = multiprocessing.Pool(processes=NUMPROC)
197 | 
198 |     for fecfile in glob.glob(os.path.join(MASTERDIR, '*.zip')):
199 |         pool.apply_async(unzip_master_file(fecfile))
200 |     pool.close()
201 |     pool.join()
202 |     print('Done!\n')
203 | 
204 |     # Archive files when ARCHIVEFILES == 1
205 |     # Otherwise delete files
206 |     if ARCHIVEFILES == 1:
207 | 	    print('Archiving data files...')
208 | 	    archive_master_files()
209 | 	    print('Done!\n')
210 | 
211 |     print('Process complete.')
212 | 
213 | 


--------------------------------------------------------------------------------