├── AUTHORS.md
├── README.md
├── download_reports.py
├── fec_scraper_toolbox_sql_objects.sql
├── parse_reports.py
└── update_master_files.py
/AUTHORS.md:
--------------------------------------------------------------------------------
1 | # Authors
2 | Christopher Schnaars (http://www.chrisschnaars.org/, https://twitter.com/chrisschnaars)
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # FEC Scraper Toolbox
2 | The FEC Scraper Toolbox is a series of Python modules you can use to
3 | find and download electronically filed campaign finance reports housed
4 | on the Federal Election Commission website and load those reports into
5 | a database manager.
6 |
7 | Generally, the FEC Scraper Toolbox is meant to replace the [FEC Scraper](https://github.com/cschnaars/FEC-Scraper)
8 | repository. You might want to use the older repository, however, if
9 | you want to limit the scope of your database to include only specific
10 | committees. The default behavior of the FEC Scraper Toolbox is to
11 | download every available report whereas FEC Scraper downloads reports
12 | only for the committees you specify.
13 |
14 | Presently, the Toolbox consists of three major modules. They are
15 | documented fully below, but in brief, they are:
16 | * __download_reports:__ Downloads daily compilations of electronically
17 | filed reports and consumes an RSS feed to find and download
18 | recently filed reports.
19 | * __parse_reports:__ Combines any number of reports into a single file
20 | for each form type (Schedule A, Schedule B and so on). Report header
21 | information is loaded into a database.
22 | * __update_master_files:__ Downloads daily and weekly master files
23 | housing detailed information about all candidates and committees,
24 | individual contributions and contributions from committees to
25 | candidates and other committees.
26 |
27 | FEC Scraper Toolbox was developed under Python 2.7.4. I presently am
28 | running it under 2.7.6.
29 |
30 | ## Requirements
31 | The following modules are required to use FEC Scraper Toolbox. All of
32 | them are included with a standard Python 2.7 installation:
33 | * csv
34 | * datetime
35 | * glob
36 | * linecache
37 | * multiprocessing
38 | * os
39 | * pickle
40 | * pyodbc
41 | * re
42 | * shutil
43 | * time
44 | * urllib
45 | * urllib2
46 | * zipfile
47 |
48 | ## User Settings
49 | You can add an optional usersettings.py file to the directory housing
50 | your Python modules to customize database connection strings and file
51 | locations. In each module, you'll see a try statement, where the module
52 | will attempt to load this file. Default values can be specified in the
53 | except portion of the try statement.
54 |
55 | You can copy and paste the text below into your usersettings.py file,
56 | then specify the values you want to use.
57 |
58 | ```python
59 | ARCPROCDIR = '' # Directory to house archives that have been processed
60 | ARCSVDIR = '' # Directory to house archives that have been downloaded but not processed
61 | DBCONNSTR = '' # Database connection string
62 | MASTERDIR = '' # Master directory for weekly candidate and committee master files
63 | RPTERRDIR = '' # Directory to house error logs generated when a field can't be parsed
64 | RPTHOLDDIR = '' # Directory to house electronically filed reports that cannot be processed
65 | RPTOUTDIR = '' # Directory to house data files generated by parse_reports
66 | RPTPROCDIR = '' # Directory to house electronically filed reports that have been processed
67 | RPTRVWDIR = '' # Directory to house electronically filed reports that could not be imported and need to be reviewed
68 | RPTSVDIR = '' # Directory to house electronically filed reports that have been downloaded but not processed
69 | ```
70 |
71 | ## download_reports Module
72 | This module tracks and downloads all electronically filed reports
73 | housed on the Federal Election Commission website. Specifically, it
74 | ensures all daily archives of reports (which go back to 2001) have been
75 | downloaded and extracted. It then consumes the FEC's RSS feed listing
76 | all reports filed within the past seven days to look for new reports.
77 |
78 | Electronic reports filed voluntarily by a handful of Senators presently
79 | are not included here.
80 |
81 | This module does not load any data or otherwise interact with a
82 | database manager (though I plan to add functionality to ping a database
83 | to build a list of previously downloaded reports rather than require
84 | the user to warehouse them). Its sole purpose is to track and download
85 | reports.
86 |
87 | If you don't want to download archives back to 2001 or otherwise want
88 | to manually control what is downloaded, you'll find commented out code
89 | below as well as in the module that you can use to modify the zipinfo.p
90 | pickle (which is described in the first bullet point below).
91 |
92 | This module goes through the following process in this order:
93 | * Uses the pickle module to attempt to load zipinfo.p, a dictionary
94 | housing the name of the most recent archive downloaded as well as a
95 | list of files not downloaded previously. Commented out code available
96 | below and in the module can be used to modify this pickle if you
97 | want to control which archives will be retrieved.
98 | * Calls build_prior_archive_list to construct a list of archives that
99 | already have been downloaded and saved to ARCPROCDIR or ARCSVDIR.
100 | __NOTE:__ I plan to deprecate this function. I added this feature for
101 | development and to test the implementation of the zipinfo.p pickle.
102 | Using the pickle saves a lot of time and disk space compared to
103 | warehousing all the archives.
104 | * Calls build_archive_download_list, which processes the zipinfo.p
105 | pickle to build a list of available archive files that have not
106 | been downloaded.
107 | * Uses multiprocessing and calls download_archive to download each
108 | archive file. These files are saved in the directory specified
109 | with the ARCSVDIR variable. After downloading an archive, the
110 | subroutine compares the length of the downloaded file with the length
111 | of the source file. If the lengths do not match, the file is deleted
112 | from the file system. The subroutine tries to download a file up to
113 | five times.
114 | __NOTE:__ You can set the NUMPROC variable in the user variables section
115 | to specify the number of downloads that occur simultaneously. The
116 | default value is 10.
117 | * Uses multiprocessing and calls unzip_archive to extract any files in
118 | the archive that have not been downloaded previously. The second
119 | parameter is an overwrite flag; existing files are overwritten when
120 | this flag is set to 1. Default is 0.
121 | __NOTE:__ You can set the NUMPROC variable in the user variables section
122 | to specify the number of downloads that occur simultaneously. The
123 | default value is 10.
124 | * Again calls build_prior_archive_list to reconstruct the list of
125 | archives that already have been downloaded and saved to ARCPROCDIR or
126 | ARCSVDIR.
127 | __NOTE:__ As stated above, this feature is slated for deprecation.
128 | * Calls pickle_archives to rebuild zipinfo.p and save it to the same
129 | directory as this module.
130 | * Calls build_prior_report_list to build a list of reports housed in
131 | RPTHOLDDIR, RPTPROCDIR and RPTSVDIR.
132 | __NOTE:__ I plan to add a function that can build a list of previously
133 | processed files using a database call rather than combing the file
134 | system (though it will remain necessary to look in the RPTHOLDDIR and
135 | RPTSVDIR directories to find files that have not been loaded into the
136 | database).
137 | * Calls consume_rss, which uses a regular expression to scan an FEC RSS
138 | feed listing all electronically filed reports submitted within the
139 | past seven days. The function returns a list of these reports.
140 | * Calls verify_reports to test whether filings flagged for download by
141 | consume_rss already have been downloaded. If so, the function
142 | verifies the length of the downloaded file matches the length of
143 | the file posted on the FEC website. When the lengths do not match,
144 | the saved file is deleted and retained in the download list.
145 | * Uses multiprocessing and calls download_report to download each
146 | report returned by verify_reports. After downloading a report, the
147 | subroutine compares the length of the downloaded file with the length
148 | of the source file. If the lengths do not match, the file is deleted
149 | from the file system. The subroutine tries to download a file up to
150 | five times.
151 | __NOTE:__ You can set the NUMPROC variable in the user variables section
152 | to specify the number of downloads that occur simultaneously. The
153 | default value is 10.
154 |
155 | ### Modifying the zipinfo Pickle
156 | Here is the commented-out code available in the download_reports module
157 | that you can use to manually control the zipinfo.p pickle if you don't
158 | want to download all available archives back to 2001:
159 |
160 | ```python
161 | # Set mostrecent to the last date you DON'T want, so if you want
162 | # everything since Jan. 1, 2013, set mostrecent to: '20121231.zip'
163 | zipinfo['mostrecent'] = '20121231.zip' # YYYYMMDD.zip
164 | zipinfo['badfiles'] = [] # You probably want to leave this blank
165 | ```
166 |
167 | ## parse_reports Module
168 | This module is the workhorse of the Toolbox. It parses all downloaded
169 | reports in a specified directory and saves child rows for each subform
170 | type into a single data file for easy import into a database manager.
171 |
172 | One of the main challenges with parsing electronically filed reports is
173 | that each form — presently there are 56 — has its own column
174 | headers. What's more, the layout of each form has been through as many
175 | as 13 iterations, each of which also can have its own headers.
176 |
177 | The parse_reports module handles this by examining the two header rows
178 | atop each electronically filed report to determine the form type and
179 | version of that report. (The current version is 8.1.) If the form
180 | type and version are supported by the parser, the columns in each data
181 | row are mapped to standardized columns. Generally speaking, the
182 | standardized column headings largely mimic a form's version 8.0
183 | headings, though a few legacy columns have been retained.
184 |
185 | You can modify the standardized columns at any time by manipulating the
186 | __outputhdrs__ variable. All data for any columns not included in this
187 | variable are dropped from the data stream and are not included in the
188 | output files. The headers for each version of each form type are housed
189 | in the __filehdrs__ variable in this format:
190 |
191 | ```python
192 | [[formtype-1, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
193 | [formtype-2, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]],
194 | ...
195 | [formtype-n, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]]]
196 | ```
197 |
198 | While all child rows are parsed and saved to delimited text files, the
199 | two header rows for each file are loaded into a database manager. By
200 | default, the database manager is SQL Server, and the module calls a
201 | series of stored procedures to load the header rows into the database.
202 | However, you can easily modify this behavior to use a different
203 | database manager or save the header rows to their own text files.
204 |
205 | The main reason to use the module to load the headers is the database
206 | manager can verify each report is not already in the database.
207 | Eventually, I will add my database structure and stored procedure code
208 | to this repository to make it easier to port the functionality to
209 | interact with other database managers.
210 |
211 | The parser supports the following form types:
212 | * Form 1, Statement of Organization: Includes Form1S and Text records
213 | * Form 3, Report of Receipts and Disbursements: Filed by Congressional
214 | candidates; includes all schedules and Text records.
215 | * Form 3L, Report of Contributions Bundled by
216 | Lobbyists/Registrants: Includes all schedules and Text records.
217 | * Form 3P, Report of Receipts and Disbursements: Filed by presidential
218 | and vice-presidential candidates; includes all schedules and Text
219 | records.
220 | * Form 3X, Report of Receipts and Disbursements: Filed by all
221 | committees other than the principal campaign committees of
222 | Congressional, presidential and vice-presidential candidates; includes
223 | all schedules and Text records.
224 |
225 | Reports not recognized by parse_reports are moved to the directory
226 | specified by the RPTHOLDDIR variable.
227 |
228 | This module goes through the following process in this order:
229 | * Calls build_list_of_supported_report_types, which examines the list
230 | housed in the filehdrs variable to determine which types of
231 | electronically filed reports can be parsed by the module.
232 | * Calls create_file_timestamp, which creates a timestamp string that is
233 | affixed to the filename of each data file generated by the module.
234 | * Creates an output file for each type of child row data — one for
235 | Schedule A, one for Schedule B, one for Text and so on — and
236 | writes the column headers to each file. These files are saved in
237 | the directory specified by RPTOUTDIR. The module also generates an
238 | "Other Data" file, where rows the module can't write to other data
239 | files are saved.
240 | * Appends various full name fields to header lists. These fields were
241 | used in older electronic filings until the FEC decided to split
242 | names across multiple fields. These extra headers are appended
243 | here because the module attempts to parse these names and does not
244 | write the full name fields to the output data files. If a name
245 | can't be parsed, it is saved to the appropriate last name field.
246 |
247 | From this point, the module iterates over each electronic filing saved
248 | in the directory specified by RPTSVDIR. For each file, the module:
249 | * Saves the six-digit filename as ImageID. This value is prepended to
250 | every child row so those rows can be mapped to the parent header
251 | row.
252 | * If the ImageID is contained in BADREPORTS, the module moves the file
253 | to the directory specified by RPTHOLDDIR, then proceeds to the next
254 | electronic filing.
255 | * Extracts the file header. The file header contains basic information
256 | about the file, such as the header version, the software used to
257 | generate the report, the delimiter used for names and a date format
258 | string. For most forms, this information is constrained to the
259 | first line of the file. But header versions 1 and 2 for all form
260 | types use multi-line file headers. When a multi-line file header
261 | is detected, the module scans from the top of the file until it
262 | finds and reads the entire header.
263 | * Extracts the report header, which is always contained on only one
264 | line, immediately below the file header.
265 | * Checks to see whether the default delimiter specified by DELIMITER
266 | can be found anywhere in the file header. If not, the default
267 | delimiter is changed to a comma.
268 | * Extracts the report type (i.e., F3PA, F3XN, F3T) and an abbreviated
269 | report type (i.e., F3P, F3X, F3) from the report header. The last
270 | letter of a Form 3 report type indicates whether the report is a
271 | New, Amended or Termination report. If the module does not support
272 | that report type, it moves the file to the RPTHOLDDIR and proceeds
273 | to the next electronic filing.
274 | * Extracts the version number from the file header.
275 | * Parses the file header. Custom code is used for header versions 1 and
276 | 2 while the parse_data_row function is used for all later,
277 | single-line headers.
278 | * Creates a dictionary to house data for the report header, adds a
279 | key for each column, then calls parse_data_row for all versions.
280 | * If for some reason the delimiter used for names is unknown, the
281 | module attempts to determine the delimiter.
282 | * Calls a custom function for each form type to validate the report
283 | header data, then calls load_report_header to load that data into a
284 | database manager. If the data can't be validated, the module will
285 | fail. If the data is valid but can't be loaded into the database
286 | (either because of an error or because the report already exists in
287 | the database) the module moves the file to the directory specified\
288 | by RPTRVWDIR and proceeds to the next electronic filing.
289 |
290 | As noted elsewhere, FEC Scraper Toolbox uses SQL Server as the default
291 | database manager and uses stored procedures to interact with the
292 | database. However, this is an issue only in terms of loading header
293 | data. All other data is saved to delimited text files you can load
294 | into any database manager. (The default delimiter is a tab, which you
295 | can change by setting the OUTPUTDELIMITER variable.)
296 |
297 | The main reason to load the headers from within the Python module is to
298 | verify an electronic report does not already exist in the database.
299 | It also ensures a valid parent-child relationship exists before any
300 | child rows are loaded. Saving all data, including file headers, to
301 | flat files so all the data can be imported in one set would be faster,
302 | but querying the database and adding a header one report at a time is a
303 | trade-off to ensure database viability.
304 |
305 | Once the header has been parsed and loaded into the database, the
306 | module creates a list to hold each type of child row (one file for
307 | Schedule A data, one for Schedule B data and so on). The module then
308 | iterates over the file, skipping the headers, and processes each child
309 | row as follows:
310 | * The module removes double spaces from the data. If OUTPUTDELIMITER is
311 | set to a tab, the module also converts tabs to spaces.
312 | * The data row is converted to a list.
313 | * The module looks at the first element of the list to determine the
314 | row's form type. If the type can't be determined, the row is
315 | written to the "Other Data" file.
316 | * The module calls populate_data_row_dict to build a dictionary to
317 | house the output data, mapping the version-specific headers of the
318 | data row to the headers used in the data output file.
319 | * The module calls a form type-specific function to validate and clean
320 | the data.
321 | * Full name fields, if any, are removed from the data row.
322 | * The module calls build_data_row to convert the dictionary to a list.
323 | * The list is appended to the list created to house that type of data.
324 |
325 | Once the module has finished iterating over the data for an electronic
326 | report, each line of each form type-specific list is converted to a
327 | delimited string and written to the appropriate data file before the
328 | module proceeds to the next file.
329 |
330 | At the end of the module, you'll see a call to a SQL Server stored
331 | procedure called usp_DeactivateOverlappingReports. (Again, I plan to
332 | post all my SQL Server code in this repository very soon.) Briefly,
333 | this stored procedure addresses the problem of reports with different
334 | but overlapping date ranges. Rather than delete amended reports, I use
335 | database triggers to flag the most recent report as Active and to
336 | deactivate all earlier reports whenever the date ranges are exactly the
337 | same. But there are dozens and perhaps hundreds of cases where reports
338 | with different date ranges overlap. This stored procedure addresses
339 | this problem by scrubbing all headers in the database each time this
340 | module is run.
341 |
342 | ## update_master_files Module
343 | This module can be used to download and extract the master files housed
344 | on the [FEC website](http://www.fec.gov/finance/disclosure/ftpdet.shtml). The
345 | FEC updates the candidate (cn), committee (cm) and candidate-committee
346 | linkage (ccl) files daily. The FEC updates the individual
347 | contributions (indiv), committee-to-committee transactions (oth) and
348 | committee-to-candidate transactions (pas2) files every Sunday evening.
349 |
350 | ### Warehousing Master Files
351 | By default, the update_master_files module archives the compressed
352 | master files (but not the extracted data files). This was done to
353 | preserve the source data during development and because the FEC
354 | overwrites the data files.
355 |
356 | To disable this behavior, set the value of the ARCHIVEFILES user
357 | variable (see the user variables section near the top of the module) to
358 | zero. When ARCHIVEFILES is set to any value other than one (1), the
359 | master files are not archived.
360 |
361 | The script assumes you will run it daily and late enough in the day to
362 | make sure the FEC has created the daily data files before you try to
363 | process them. At the time of this writing, the FEC updates the three
364 | daily files around 7:30 a.m. and the weekly files around 4 p.m. If
365 | you're scheduling a daily job to run this script, I recommend you
366 | schedule it for late evening to give yourself plenty of leeway.
367 |
368 | By default, the script will ignore the weekly files if the script is
369 | not run on a Sunday. To change this behavior and download the Sunday
370 | files on a different day of the week, set the OMITNONSUNDAYFILES
371 | variable near the top of the script to 0.
372 |
373 | ### How the update_master_files Module Works
374 | This module goes through the following process in this order:
375 | * Calls delete_data to remove all .txt and .zip files from the working
376 | directory specified by the MASTERDIR variable. (The .zip files from
377 | the previous execution of the script will be in this directory only if
378 | they were not archived when the script was last run.)
379 | * Uses the local date on the machine running the code to calculate the
380 | current election cycle and determine whether the files are being
381 | downloaded on a Sunday. If the weekday is not Sunday and
382 | OMITNONSUNDAYFILES is set to 1, the script will ignore all master files
383 | except for the candidate (cn), committee (cm) and candidate-committee
384 | linkage (ccl) files.
385 | * Uses multiprocessing and calls download_file to download each master
386 | file specified by the MASTERFILES user variable. (By default, all nine
387 | master files are downloaded.) These files are saved in the directory
388 | specified by the MASTERDIR variable.
389 | After downloading a file, the subroutine compares the length of the
390 | downloaded file with the length of the source file of the FEC
391 | website. If the lengths do not match, the file is deleted from the
392 | file system. The subroutine tries to download a file up to five times.
393 | __NOTE:__ You can set the NUMPROC variable in the user variables section
394 | to specify the number of downloads that occur simultaneously. The
395 | default value is 10.
396 | * Uses multiprocessing and calls unzip_master_file to extract the data
397 | files from each master file. If the extracted filename does not
398 | include a year reference, the subroutine appends a two-digit year.
399 | __NOTE:__ You can set the NUMPROC variable in the user variables section
400 | to specify the number of downloads that occur simultaneously. The
401 | default value is 10.
402 | * When the ARCHIVEFILES user variable is set to 1, the module calls the
403 | archive_master_files subroutine, which creates a YYYYMMDD directory for
404 | the most recent Sunday date (if that directory does not already exist)
405 | and moves all .zip files in the MASTERDIR directory to the new
406 | directory.
407 |
408 | ### About the Master Files
409 | The FEC recreates three of the master files daily and the remaining
410 | master every Sunday evening. Each time the files are generated, they
411 | overwrite the previously posted files. The archive filenames include a
412 | two-digit year to identify the election cycle, but the files housed in
413 | those archives often do not. For that reason, this module appends a
414 | two-digit election cycle to extracted filenames that do not include a year
415 | reference.
416 |
417 | There are nine compressed files for each election cycle. You can click
418 | the links below to view the data dictionary for a particular file:
419 | * __add:__ [New Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsAdditions.shtml)
420 | Lists all contributions added to the master Individuals file in the
421 | past week.
422 | Files extracted from these archives are named
423 | addYYYY.txt. Generated every Sunday.
424 | * __ccl:__ [Candidate Committee Linkage](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandCmteLinkage.shtml)
425 | Houses all links between candidates and committees that have
426 | been reported to the FEC. Strangely, this file does not include
427 | candidate ties to Leadership PACs, which are reported on Form 1.
428 | Files extracted from these archives are named ccl.txt. Generated
429 | daily.
430 | * __changes:__ [Individual Contribution Changes](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsChanges.shtml)
431 | Lists all transactions in the master Individuals file that have been
432 | changed during the past month.
433 | Files extracted from these archives are named chgYYYY.txt. Generated
434 | every Sunday.
435 | * __cm:__ [Committee Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteeMaster.shtml)
436 | Lists all committees registered with the FEC for a
437 | specific election cycle. Among other information, you can use this file
438 | to see a committee's FEC ID, name, address and treasurer. You can use
439 | the Committee Designation field (CMTE_DSGN) to find a specific
440 | committee type (such as principal campaign committees, joint
441 | fundraisers, lobbyist PACs and leadership pacs). Additionally, you can
442 | look for code O in the Committee Type field (CMTE_TP) to identify
443 | independent expenditure only committees commonly known as Super PACs.
444 | Files extracted from these archives are named cm.txt. Generated daily.
445 | * __cn:__ [Candidate Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandidateMaster.shtml)
446 | Lists all candidates registered with the FEC for a
447 | specific election cycle. You can use this file to see all candidates
448 | who have filed to run for a particular seat as well as information
449 | about their political parties, addresses, treasurers and FEC IDs.
450 | Files extracted from these archives are named cn.txt. Generated daily.
451 | * __delete:__ [Deleted Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsDeletes.shtml)
452 | Lists all contributions deleted from the master Individuals file in the
453 | past week.
454 | Files extracted from these archives are named delYYYY.txt. Generated
455 | every Sunday.
456 | * __indiv:__ [Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml)
457 | For the most part, lists all itemized contributions of $200 or more
458 | made by INDIVIDUALS to any committee during the election cycle. Does
459 | not include most contributions from businesses, PACs and other
460 | organizations.
461 | Files extracted from these archives are named itcont.txt. Generated
462 | every Sunday.
463 | * __oth:__ [Committee-to-Committee Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteetoCommittee.shtml)
464 | Lists contributions and independent expenditures made by one committee
465 | to another.
466 | Files extracted from these archives are named itoth.txt. Generated
467 | every Sunday.
468 | * __pas2:__ [Committee-to-Candidate Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml)
469 | Lists contributions made by a committee to a candidate.
470 | Files extracted from these archives are named itpas2.txt. Generated
471 | every Sunday.
472 |
473 | ### Using the indiv, add, changes and delete files
474 | The indiv file generated every Sunday for each election cycle is
475 | comprehensive, meaning it is a current (as of that Sunday) snapshot of
476 | all individual contributions currently in the database. Many users
477 | simply drop and recreate this table every week. If you do this, you do
478 | not need the add, changes and delete files. The FEC generates these
479 | files to provide an alternative to rebuilding the indiv table each
480 | week simply because of the sheer volume of data it houses.
481 |
482 | I presently don't use any of these files and instead rely on the raw
483 | filings themselves, which are immediately available on the FEC website
484 | once they're filed and contain more data that the indiv files. But
485 | many journalists I know who do use these files say they just rebuild
486 | the indiv table each week because it's easier and less error-prone than
487 | trying to patch it each week.
488 |
489 | If you decide to use the add, changes and delete files rather than
490 | rebuild the indiv table each week, just be aware that if you ever miss
491 | a weekly download, you will have to rebuild the indiv table.
492 |
493 | ## Next Steps
494 | I house all of my campaign-finance data in a SQL Server database and
495 | tend to use SQL Server Integration Services packages to load the data
496 | generated/extracted by the FEC Scraper Toolbox. I do this because of
497 | the sheer volume of the data (The table housing all Schedule A data,
498 | for example, contains more than 120 million rows so far.) and because I
499 | can use SSIS to bulk load the data rather than load it from Python a
500 | row at a time.
501 |
502 | The lone exception is the parse_reports module. That module attempts
503 | to load a report's header row into the database to test whether that
504 | report previously has been loaded. All child rows in new reports are
505 | parsed and moved to separate data files (one for Schedule A, one for
506 | Schedule B and so on).
507 |
508 | Understandably, the lack of code to load the data makes the Toolbox
509 | less attractive to some potential users, who must manually import the
510 | data files or develop their own Pythonic means of doing so.
511 | Nevertheless, the modules presently provide very fast and efficient
512 | means of downloading massive quantities of data, managing that data and
513 | preparing it for import.
514 |
515 | At some point, I plan to develop Python functions to handle the data
516 | imports, and of course I welcome any contributions from the open-source
517 | community. I also plan to open source my entire database design so
518 | others can recreate it in any database manager they choose.
519 |
520 | Stay tuned!
521 |
--------------------------------------------------------------------------------
/download_reports.py:
--------------------------------------------------------------------------------
1 | # Download campaign finance reports
2 | # By Christopher Schnaars, USA TODAY
3 | # Developed with Python 2.7.4
4 | # See README.md for complete documentation
5 |
6 | # Import needed libraries
7 | import datetime
8 | import glob
9 | import multiprocessing
10 | import os
11 | import pickle
12 | import re
13 | import sys
14 | import urllib
15 | import urllib2
16 | import zipfile
17 |
18 | # Try to import user settings or set them explicitly
19 | try:
20 | import usersettings
21 |
22 | ARCPROCDIR = usersettings.ARCPROCDIR
23 | ARCSVDIR = usersettings.ARCSVDIR
24 | RPTHOLDDIR = usersettings.RPTHOLDDIR
25 | RPTPROCDIR = usersettings.RPTPROCDIR
26 | RPTSVDIR = usersettings.RPTSVDIR
27 | except:
28 | ARCPROCDIR = 'C:\\data\\FEC\\Archives\\Processed\\'
29 | ARCSVDIR = 'C:\\data\\FEC\\Archives\\Import\\'
30 | RPTHOLDDIR = 'C:\\data\\FEC\\Reports\\Hold\\'
31 | RPTPROCDIR = 'C:\\data\\FEC\\Reports\\Processed\\'
32 | RPTSVDIR = 'C:\\data\\FEC\\Reports\\Import\\'
33 |
34 | # Other user variables
35 | ARCFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/electronic/'
36 | NUMPROC = 1 # Multiprocessing processes to run simultaneously
37 | RPTURL = 'http://docquery.fec.gov/dcdev/posted/'
38 | RSSURL = 'http://efilingapps.fec.gov/rss/generate?preDefinedFilingType=ALL'
39 |
40 |
41 | def build_archive_download_list(zipinfo, oldarchives):
42 | """
43 | On 1/8/2018, the FEC shut down its FTP server and moved their
44 | bulk files to an Amazon S3 bucket. Rather than try to hack the
45 | JavaScript, this function now looks for files dated after the
46 | mostrecent element of the zipinfo.p pickle up to the current
47 | system date. I'm adding a try_again_later property to the
48 | pickle for .zip files that fail to download.
49 | """
50 |
51 | # Generate date range to look for new files
52 | start_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date()
53 | add_day = datetime.timedelta(days=1)
54 | start_date += add_day
55 | end_date = datetime.datetime.now().date()
56 |
57 | # Create dictionary to house list of files to attempt to download
58 | downloads = []
59 |
60 | # Add recent archive files
61 | while start_date < end_date:
62 | downloads.append(datetime.date.strftime(start_date, '%Y%m%d') + '.zip')
63 | start_date += add_day
64 |
65 | # Add try_again files
66 | for fec_file in zipinfo['try_again_later']:
67 | if fec_file not in downloads:
68 | downloads.append(fec_file)
69 | zipinfo['try_again_later'] = []
70 |
71 | # Remove any bad files from the list
72 | for fec_file in zipinfo['badfiles']:
73 | if fec_file in downloads:
74 | downloads.remove(fec_file)
75 |
76 | # Remove previously downloaded archives
77 | downloads = [download for download in downloads if download not in oldarchives]
78 |
79 | return downloads
80 |
81 |
82 | def build_prior_archive_list():
83 | """
84 | Returns a list of archives that already have been downloaded and
85 | saved to ARCPROCDIR or ARCSVDIR.
86 | """
87 | dirs = [ARCSVDIR, ARCPROCDIR]
88 | archives = []
89 |
90 | for dir in dirs:
91 | for datafile in glob.glob(os.path.join(dir, '*.zip')):
92 | archives.append(datafile.replace(dir, ''))
93 |
94 | return archives
95 |
96 |
97 | def build_prior_report_list():
98 | """
99 | Returns a list of reports housed in the directories specified by
100 | RPTHOLDDIR, RPTPROCDIR and RPTSVDIR.
101 | """
102 | dirs = [RPTHOLDDIR, RPTPROCDIR, RPTSVDIR]
103 | reports = []
104 |
105 | for dir in dirs:
106 | for datafile in glob.glob(os.path.join(dir, '*.fec')):
107 | reports.append(
108 | datafile.replace(dir, '').replace('.fec', ''))
109 |
110 | return reports
111 |
112 |
113 | def consume_rss():
114 | """
115 | Returns a list of electronically filed reports included in an FEC
116 | RSS feed listing all reports submitted within the past seven days.
117 | """
118 | regex = re.compile('http://docquery.fec.gov/dcdev/posted/' \
119 | '([0-9]*)\.fec')
120 | opener = urllib2.build_opener()
121 | opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')]
122 | rss = opener.open(RSSURL).read()
123 | matches = []
124 | for match in re.findall(regex, rss):
125 | matches.append(match)
126 |
127 | return matches
128 |
129 |
130 | def download_archive(archive):
131 | """
132 | Downloads a single archive file and saves it in the directory
133 | specified by the ARCSVDIR variable. After downloading an archive,
134 | this subroutine compares the length of the downloaded file with the
135 | length of the source file and will try to download a file up to
136 | five times when the lengths don't match.
137 | """
138 | src = ARCFTP + archive
139 | dest = ARCSVDIR + archive
140 | y = 0
141 | # I have added a header to my request
142 | try:
143 | # Add a header to the request
144 | request = urllib2.Request(src, headers={
145 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
146 | srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
147 | except:
148 | y = 5
149 |
150 | while y < 5:
151 | try:
152 | # Add a header to the request
153 | urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
154 | urllib.urlretrieve(src, dest)
155 |
156 | destlen = os.path.getsize(dest)
157 |
158 | # Repeat download up to five times if files not same size
159 | if srclen != destlen:
160 | os.remove(dest)
161 | y += 1
162 | continue
163 | else:
164 | y = 6
165 | except:
166 | y += 1
167 | if y == 5:
168 | zipinfo['try_again_later'].append(archive)
169 | print(src + ' could not be downloaded.')
170 |
171 |
172 | def download_report(download):
173 | """
174 | Downloads a single electronic report and saves it in the directory
175 | specified by the RPTSVDIR variable. After downloading a report,
176 | this subroutine compares the length of the downloaded file with the
177 | length of the source file and will try to download a file up to
178 | five times when the lengths don't match.
179 | """
180 | # Construct file url and get length of file
181 | url = RPTURL + download + '.fec'
182 | y = 0
183 |
184 | try:
185 | # Add a header to the request
186 | request = urllib2.Request(url, headers={
187 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
188 | srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
189 | except:
190 | y = 5
191 |
192 | filename = RPTSVDIR + download + '.fec'
193 |
194 | while y < 5:
195 | try:
196 | url_headers = {'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
197 | 'ACCEPT_ENCODING': 'gzip, deflate, br',
198 | 'ACCEPT_LANGUAGE': 'en-US,en;q=0.5',
199 | 'USER-AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'}
200 | request = urllib2.Request(url, headers=url_headers)
201 | response = urllib2.urlopen(request)
202 | with open(filename, 'wb') as f:
203 | f.write(response.read())
204 |
205 | destlen = os.path.getsize(filename)
206 |
207 | # Repeat download up to five times if files not same size
208 | if srclen != destlen:
209 | os.remove(filename)
210 | y += 1
211 | continue
212 | else:
213 | y = 6
214 | except:
215 | y += 1
216 |
217 | if y == 5:
218 | print('Report ' + download + ' could not be downloaded.')
219 | sys.exit()
220 |
221 |
222 | def pickle_archives(zipinfo, archives):
223 | """
224 | Rebuilds the zipinfo.p pickle and saves it in the same directory as
225 | this module.
226 |
227 | archives is a list of archive files available for download on the
228 | FEC website. The list is generated by the
229 | build_archive_download_list function.
230 | """
231 |
232 | # To calculate most recent download, omit files in try_again_later
233 | downloads = [fec_file for fec_file in archives if fec_file not in zipinfo['try_again_later']]
234 | if len(downloads) > 0:
235 | zipinfo['mostrecent'] = max(downloads)
236 |
237 | # Remove bad files older than most recent
238 | if len(zipinfo['badfiles']) > 0:
239 | most_recent_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date()
240 |
241 | for bad_file in zipinfo['badfiles'][::-1]:
242 | bad_file_date = datetime.datetime.strptime(bad_file.rstrip('.zip'), '%Y%m%d').date()
243 | if bad_file_date < most_recent_date:
244 | zipinfo['badfiles'].remove(bad_file)
245 |
246 | pickle.dump(zipinfo, open('zipinfo.p', 'wb'))
247 |
248 |
249 | def unzip_archive(archive, overwrite=0):
250 | """
251 | Extracts any files housed in a specific archive that have not been
252 | downloaded previously.
253 |
254 | Set the overwrite parameter to 1 if existing files should be
255 | overwritten. The default value is 0.
256 | """
257 | destdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR]
258 | try:
259 | zip = zipfile.ZipFile(ARCSVDIR + archive)
260 | for subfile in zip.namelist():
261 | x = 1
262 | if overwrite != 1:
263 | for dir in destdirs:
264 | if x == 1:
265 | if os.path.exists(dir + subfile):
266 | x = 0
267 | if x == 1:
268 | zip.extract(subfile, destdirs[0])
269 |
270 | zip.close()
271 |
272 | # If all files extracted correctly, move archive to Processed
273 | # directory
274 | os.rename(ARCSVDIR + archive, ARCPROCDIR + archive)
275 |
276 | except:
277 | print('Files contained in ' + archive + ' could not be '
278 | 'extracted. The file has been deleted so it can be '
279 | 'downloaded again later.\n')
280 | os.remove(ARCSVDIR + archive)
281 |
282 |
283 | def verify_reports(rpts, downloaded):
284 | """
285 | Returns a list of individual reports to be downloaded.
286 |
287 | Specifically, this function compares a list of available reports
288 | that have been submitted to the FEC during the past seven days
289 | (rpts) with a list of previously downloaded reports (downloaded).
290 |
291 | For reports that already have been downloaded, the function verifies
292 | the length of the downloaded file matches the length of the file
293 | posted on the FEC website. When the lengths do not match, the saved
294 | file is deleted and retained in the download list.
295 | """
296 | downloads = []
297 | for rpt in rpts:
298 | childdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR]
299 | if rpt not in downloaded:
300 | downloads.append(rpt)
301 | else:
302 | try:
303 | # Add a header to the request
304 | request = urllib2.Request(RPTURL + rpt + '.fec', headers={
305 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'})
306 | srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
307 | except urllib2.HTTPError:
308 | print(RPTURL + rpt + '.fec could not be downloaded.')
309 | continue
310 |
311 | for child in childdirs:
312 | try:
313 | destlen = os.path.getsize(child + rpt + '.fec')
314 | if srclen != destlen:
315 | downloads.append(rpt)
316 | os.remove(child + rpt + '.fec')
317 | except:
318 | pass
319 |
320 | return downloads
321 |
322 |
323 | if __name__ == '__main__':
324 | # Attempt to fetch data specifying missing .zip files and most
325 | # recent .zip file downloaded
326 | print('Attempting to retrieve information for previously '
327 | 'downloaded archives...')
328 | try:
329 | zipinfo = pickle.load(open("zipinfo.p", "rb"))
330 | # Make sure new try_again_later key exists
331 | if 'try_again_later' not in zipinfo.keys():
332 | zipinfo['try_again_later'] = []
333 | print('Information retrieved successfully.\n')
334 | except:
335 | zipinfo = {'mostrecent': '20010403.zip',
336 | 'badfiles': ['20010408.zip', '20010428.zip', '20010429.zip', '20010505.zip', '20010506.zip',
337 | '20010512.zip', '20010526.zip', '20010527.zip', '20010528.zip', '20010624.zip',
338 | '20010812.zip', '20010826.zip', '20010829.zip', '20010902.zip', '20010915.zip',
339 | '20010929.zip', '20010930.zip', '20011013.zip', '20011014.zip', '20011028.zip',
340 | '20011123.zip', '20011124.zip', '20011125.zip', '20011201.zip', '20011202.zip',
341 | '20011215.zip', '20011223.zip', '20011229.zip', '20030823.zip', '20030907.zip',
342 | '20031102.zip', '20031129.zip', '20031225.zip', '20040728.zip', '20040809.zip',
343 | '20040921.zip', '20040922.zip', '20041127.zip', '20050115.zip', '20050130.zip',
344 | '20050306.zip', '20050814.zip', '20050904.zip', '20051106.zip', '20051225.zip',
345 | '20060210.zip', '20060318.zip', '20060319.zip', '20060320.zip', '20061224.zip',
346 | '20070507.zip', '20071028.zip', '20081225.zip', '20091226.zip', '20111203.zip',
347 | '20120701.zip', '20121215.zip', '20121225.zip', '20130703.zip', '20130802.zip',
348 | '20130825.zip', '20130914.zip', '20131109.zip', '20150207.zip', '20150525.zip'],
349 | 'try_again_later': ['20001015.zip', '20010201-20010403.zip']}
350 | print('zipinfo.p not found. Starting from scratch...\n')
351 |
352 | # Build a list of previously downloaded archives
353 | print('Building a list of previously downloaded archive files...')
354 | oldarchives = build_prior_archive_list()
355 | print('Done!\n')
356 |
357 | # Go to FEC site and fetch a list of .zip files available
358 | print('Compiling a list of archives available for download...')
359 | archives = build_archive_download_list(zipinfo, oldarchives)
360 | if len(archives) == 0:
361 | print('No new archives found.\n')
362 | # If any files returned, download them using multiprocessing
363 | else:
364 | print('Done!\n')
365 | print('Downloading ' + str(len(archives))
366 | + ' new archive(s)...')
367 | pool = multiprocessing.Pool(processes=NUMPROC)
368 | for archive in archives:
369 | pool.apply_async(download_archive(archive))
370 | pool.close()
371 | pool.join()
372 | print('Done!\n')
373 |
374 | # Open each archive and extract new reports
375 | print('Extracting files from archives...')
376 | pool = multiprocessing.Pool(processes=NUMPROC)
377 | for archive in archives:
378 | # Make sure archive was downloaded
379 | if os.path.isfile(ARCSVDIR + archive):
380 | pool.apply_async(unzip_archive(archive, 0))
381 | pool.close()
382 | pool.join()
383 | print('Done!\n')
384 |
385 | # Rebuild zipinfo and save with pickle
386 | print('Repickling the archives. Adding salt and vinegar...')
387 | pickle_archives(zipinfo, archives)
388 | print('Done!\n')
389 |
390 | # Build list of previously downloaded reports
391 | print('Building a list of previously downloaded reports...')
392 | downloaded = build_prior_report_list()
393 | print('Done!\n')
394 |
395 | # Consume FEC's RSS feed to get list of files posted in the past
396 | # seven days
397 | print('Consuming FEC RSS feed to find new reports...')
398 | rpts = consume_rss()
399 | print('Done! ' + str(len(rpts)) + ' reports found.\n')
400 |
401 | # See whether each file flagged for download already has been
402 | # downloaded. If it has, verify the downloaded file is the correct
403 | # length.
404 | print('Compiling list of reports to download...')
405 | newrpts = verify_reports(rpts, downloaded)
406 | print('Done! ' + str(len(newrpts)) + ' reports flagged for '
407 | 'download.\n')
408 |
409 | # Download each of these reports
410 | print('Downloading new reports...')
411 | pool = multiprocessing.Pool(processes=NUMPROC)
412 | for rpt in newrpts:
413 | # download_report(rpt)
414 | pool.apply_async(download_report(rpt))
415 | pool.close()
416 | pool.join()
417 | print('Done!\n')
418 | print('Process completed.')
419 |
--------------------------------------------------------------------------------
/fec_scraper_toolbox_sql_objects.sql:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cschnaars/FEC-Scraper-Toolbox/0eec4758150945ff1e4f05bc15b903731741b2fe/fec_scraper_toolbox_sql_objects.sql
--------------------------------------------------------------------------------
/update_master_files.py:
--------------------------------------------------------------------------------
1 | # Download zipped FEC master files
2 | # By Christopher Schnaars, USA TODAY
3 | # Developed with Python 2.7.4
4 | # See README.md for complete documentation
5 |
6 | # WARNING:
7 | # --------
8 | # If you automate the execution of this script, you should set it to
9 | # run in the late evening to make sure you don't download any master
10 | # files before the FEC has a chance to update them. At the time of this
11 | # writing, the FEC was updating the candidate, committee and
12 | # candidate-committee linkage files daily around 7:30 a.m. while
13 | # other weekly files were updated a little before 4 p.m. on Sundays.
14 |
15 | # Development Notes:
16 | # ------------------
17 | # 4/7/2014: Updated code so files can be downloaded daily. The FEC began
18 | # publishing daily updates to the candidate, committee and
19 | # candidate-committee linkage files. Other master files continue
20 | # to be updated weekly on Sundays.
21 |
22 | # Import needed libraries
23 | from datetime import datetime, timedelta
24 | import glob
25 | import multiprocessing
26 | import os
27 | import urllib
28 | import urllib2
29 | import zipfile
30 |
31 | # Try to import user settings or set them explicitly
32 | try:
33 | import usersettings
34 | MASTERDIR = usersettings.MASTERDIR
35 | except:
36 | MASTERDIR = 'C:\\data\\FEC\\Master\\'
37 |
38 | # Other user variables
39 | ARCHIVEFILES = 1 # Set to 0 if you don't want to archive the master files each week.
40 | MASTERFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/'
41 | MASTERFILES = ['ccl', 'cm', 'cn', 'indiv', 'oth', 'pas2', 'oppexp']
42 | NUMPROC = 10 # Multiprocessing processes to run simultaneously
43 | STARTCYCLE = 2002 # Oldest election cycle for which you want to download master files
44 | OMITNONSUNDAYFILES = 1 # Set to 0 to download all files regardless of day of week
45 |
46 |
47 | def archive_master_files():
48 | """
49 | Moves current master files to archive directory. The
50 | archivedate parameter specifies the most recent Sunday date. If the
51 | archive directory does not exist, this subroutine creates it.
52 | """
53 | # Create timestamp
54 | timestamp = datetime.now().strftime("%Y%m%d")
55 |
56 | # Create archive directory if it doesn't exist
57 | savedir = MASTERDIR + 'Archive\\' + timestamp + '\\'
58 | if not os.path.isdir(savedir):
59 | try:
60 | os.mkdir(savedir)
61 | except:
62 | pass
63 |
64 | # Move all the files
65 | for datafile in glob.glob(os.path.join(MASTERDIR, '*.zip')):
66 | os.rename(datafile, datafile.replace(MASTERDIR, savedir))
67 |
68 |
69 | def create_timestamp():
70 | filetime = datetime.datetime.now()
71 | return filetime.strftime('%Y%m%d')
72 |
73 |
74 | def delete_files(dir, ext):
75 | """
76 | Deletes all files in the specified directory with the specified
77 | file extension. In this module, it is used to delete all text files
78 | extracted from the previous week's archives prior to downloading
79 | the new archives. These files are housed in the directory
80 | specified by MASTERDIR.
81 |
82 | When ARCHIVEFILES is set to 0, this subroutine also is used
83 | to delete all archive files from the MASTERDIR directory.
84 | """
85 | # Remove asterisks and periods from specified extension
86 | ext = '*.' + ext.lstrip('*.')
87 |
88 | # Delete all files
89 | for datafile in glob.glob(os.path.join(dir, ext)):
90 | os.remove(datafile)
91 |
92 |
93 | def download_file(src, dest):
94 | """
95 | Downloads a single master file (src) and saves it as dest. After
96 | downloading a file, this subroutine compares the length of the
97 | downloaded file with the length of the source file and will try to
98 | download a file up to five times when the lengths don't match.
99 | """
100 | y = 0
101 | try:
102 | # Add a header to the request.
103 | request = urllib2.Request(src, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0'})
104 | srclen = float(urllib2.urlopen(request).info().get('Content-Length'))
105 | except:
106 | y = 5
107 | while y < 5:
108 | try:
109 | urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'
110 | urllib.urlretrieve(src, dest)
111 | destlen = os.path.getsize(dest)
112 |
113 | # Repeat download up to five times if files not same size
114 | if srclen != destlen:
115 | os.remove(dest)
116 | y += 1
117 | continue
118 | else:
119 | y = 6
120 | except:
121 | y += 1
122 | if y == 5:
123 | print(src + ' could not be downloaded.')
124 |
125 |
126 | def unzip_master_file(masterfile):
127 | """
128 | Extracts the data file from a single weekly master file archive.
129 | If the extracted file does not include a year reference, this
130 | subroutine appends a two-digit year to the extracted filename.
131 | """
132 | fileyear = masterfile[masterfile.find('.zip')-2:masterfile.find('.zip')]
133 |
134 | try:
135 | zip = zipfile.ZipFile(masterfile)
136 | for subfile in zip.namelist():
137 | zip.extract(subfile, MASTERDIR)
138 | # Rename the file if it does not include the year
139 | if subfile.find(fileyear + '.txt') == -1:
140 | savefile = MASTERDIR + subfile
141 | os.rename(savefile, savefile.replace('.txt', fileyear + '.txt'))
142 |
143 | except:
144 | print('Files contained in ' + masterfile + ' could not be extracted.')
145 |
146 |
147 | if __name__=='__main__':
148 |
149 | # Delete text files extracted from an earlier archive
150 | print('Deleting old data...')
151 | delete_files(MASTERDIR, 'txt')
152 |
153 | # Delete old archives if they're still in the working
154 | # directory. These files are moved to another directory
155 | # (archived) below when ARCHIVEFILES is set to 1.
156 | delete_files(MASTERDIR, 'zip')
157 | print('Done!\n')
158 |
159 | # Use multiprocessing to download master files
160 | print('Downloading master files...\n')
161 | pool = multiprocessing.Pool(processes=NUMPROC)
162 |
163 | # Determine whether today is Sunday
164 | sunday = False
165 | if datetime.now().weekday() == 6:
166 | sunday = True
167 |
168 | # Remove all files but cn, cm and ccl from MASTERFILES
169 | if sunday == False and OMITNONSUNDAYFILES == 1:
170 | files = []
171 | for fecfile in ['ccl', 'cm', 'cn']:
172 | if fecfile in MASTERFILES:
173 | files.append(fecfile)
174 | MASTERFILES = files
175 |
176 | # Calculate current election cycle
177 | maxyear = datetime.now().year
178 | # Add one if it's not an even-numbered year
179 | if maxyear / 2 * 2 < maxyear: maxyear += 1
180 |
181 | # Create loop to iterate through FEC ftp directories
182 | for x in range(STARTCYCLE, maxyear + 2, 2):
183 | fecdir = MASTERFTP + str(x) + '/'
184 |
185 | for thisfile in MASTERFILES:
186 | currfile = thisfile + str(x)[2:] + '.zip'
187 | fecfile = fecdir + currfile
188 | savefile = MASTERDIR + currfile
189 | pool.apply_async(download_file(fecfile, savefile))
190 | pool.close()
191 | pool.join()
192 | print('Done!\n')
193 |
194 | # Use multiprocessing to extract data files from the archives
195 | print('Unzipping files...')
196 | pool = multiprocessing.Pool(processes=NUMPROC)
197 |
198 | for fecfile in glob.glob(os.path.join(MASTERDIR, '*.zip')):
199 | pool.apply_async(unzip_master_file(fecfile))
200 | pool.close()
201 | pool.join()
202 | print('Done!\n')
203 |
204 | # Archive files when ARCHIVEFILES == 1
205 | # Otherwise delete files
206 | if ARCHIVEFILES == 1:
207 | print('Archiving data files...')
208 | archive_master_files()
209 | print('Done!\n')
210 |
211 | print('Process complete.')
212 |
213 |
--------------------------------------------------------------------------------