├── AUTHORS.md ├── README.md ├── download_reports.py ├── fec_scraper_toolbox_sql_objects.sql ├── parse_reports.py └── update_master_files.py /AUTHORS.md: -------------------------------------------------------------------------------- 1 | # Authors 2 | Christopher Schnaars (http://www.chrisschnaars.org/, https://twitter.com/chrisschnaars) 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FEC Scraper Toolbox 2 | The FEC Scraper Toolbox is a series of Python modules you can use to 3 | find and download electronically filed campaign finance reports housed 4 | on the Federal Election Commission website and load those reports into 5 | a database manager. 6 | 7 | Generally, the FEC Scraper Toolbox is meant to replace the [FEC Scraper](https://github.com/cschnaars/FEC-Scraper) 8 | repository. You might want to use the older repository, however, if 9 | you want to limit the scope of your database to include only specific 10 | committees. The default behavior of the FEC Scraper Toolbox is to 11 | download every available report whereas FEC Scraper downloads reports 12 | only for the committees you specify. 13 | 14 | Presently, the Toolbox consists of three major modules. They are 15 | documented fully below, but in brief, they are: 16 | * __download_reports:__ Downloads daily compilations of electronically 17 | filed reports and consumes an RSS feed to find and download 18 | recently filed reports. 19 | * __parse_reports:__ Combines any number of reports into a single file 20 | for each form type (Schedule A, Schedule B and so on). Report header 21 | information is loaded into a database. 22 | * __update_master_files:__ Downloads daily and weekly master files 23 | housing detailed information about all candidates and committees, 24 | individual contributions and contributions from committees to 25 | candidates and other committees. 26 | 27 | FEC Scraper Toolbox was developed under Python 2.7.4. I presently am 28 | running it under 2.7.6. 29 | 30 | ## Requirements 31 | The following modules are required to use FEC Scraper Toolbox. All of 32 | them are included with a standard Python 2.7 installation: 33 | * csv 34 | * datetime 35 | * glob 36 | * linecache 37 | * multiprocessing 38 | * os 39 | * pickle 40 | * pyodbc 41 | * re 42 | * shutil 43 | * time 44 | * urllib 45 | * urllib2 46 | * zipfile 47 | 48 | ## User Settings 49 | You can add an optional usersettings.py file to the directory housing 50 | your Python modules to customize database connection strings and file 51 | locations. In each module, you'll see a try statement, where the module 52 | will attempt to load this file. Default values can be specified in the 53 | except portion of the try statement. 54 | 55 | You can copy and paste the text below into your usersettings.py file, 56 | then specify the values you want to use. 57 | 58 | ```python 59 | ARCPROCDIR = '' # Directory to house archives that have been processed 60 | ARCSVDIR = '' # Directory to house archives that have been downloaded but not processed 61 | DBCONNSTR = '' # Database connection string 62 | MASTERDIR = '' # Master directory for weekly candidate and committee master files 63 | RPTERRDIR = '' # Directory to house error logs generated when a field can't be parsed 64 | RPTHOLDDIR = '' # Directory to house electronically filed reports that cannot be processed 65 | RPTOUTDIR = '' # Directory to house data files generated by parse_reports 66 | RPTPROCDIR = '' # Directory to house electronically filed reports that have been processed 67 | RPTRVWDIR = '' # Directory to house electronically filed reports that could not be imported and need to be reviewed 68 | RPTSVDIR = '' # Directory to house electronically filed reports that have been downloaded but not processed 69 | ``` 70 | 71 | ## download_reports Module 72 | This module tracks and downloads all electronically filed reports 73 | housed on the Federal Election Commission website. Specifically, it 74 | ensures all daily archives of reports (which go back to 2001) have been 75 | downloaded and extracted. It then consumes the FEC's RSS feed listing 76 | all reports filed within the past seven days to look for new reports. 77 | 78 | Electronic reports filed voluntarily by a handful of Senators presently 79 | are not included here. 80 | 81 | This module does not load any data or otherwise interact with a 82 | database manager (though I plan to add functionality to ping a database 83 | to build a list of previously downloaded reports rather than require 84 | the user to warehouse them). Its sole purpose is to track and download 85 | reports. 86 | 87 | If you don't want to download archives back to 2001 or otherwise want 88 | to manually control what is downloaded, you'll find commented out code 89 | below as well as in the module that you can use to modify the zipinfo.p 90 | pickle (which is described in the first bullet point below). 91 | 92 | This module goes through the following process in this order: 93 | * Uses the pickle module to attempt to load zipinfo.p, a dictionary 94 | housing the name of the most recent archive downloaded as well as a 95 | list of files not downloaded previously. Commented out code available 96 | below and in the module can be used to modify this pickle if you 97 | want to control which archives will be retrieved. 98 | * Calls build_prior_archive_list to construct a list of archives that 99 | already have been downloaded and saved to ARCPROCDIR or ARCSVDIR. 100 | __NOTE:__ I plan to deprecate this function. I added this feature for 101 | development and to test the implementation of the zipinfo.p pickle. 102 | Using the pickle saves a lot of time and disk space compared to 103 | warehousing all the archives. 104 | * Calls build_archive_download_list, which processes the zipinfo.p 105 | pickle to build a list of available archive files that have not 106 | been downloaded. 107 | * Uses multiprocessing and calls download_archive to download each 108 | archive file. These files are saved in the directory specified 109 | with the ARCSVDIR variable. After downloading an archive, the 110 | subroutine compares the length of the downloaded file with the length 111 | of the source file. If the lengths do not match, the file is deleted 112 | from the file system. The subroutine tries to download a file up to 113 | five times. 114 | __NOTE:__ You can set the NUMPROC variable in the user variables section 115 | to specify the number of downloads that occur simultaneously. The 116 | default value is 10. 117 | * Uses multiprocessing and calls unzip_archive to extract any files in 118 | the archive that have not been downloaded previously. The second 119 | parameter is an overwrite flag; existing files are overwritten when 120 | this flag is set to 1. Default is 0. 121 | __NOTE:__ You can set the NUMPROC variable in the user variables section 122 | to specify the number of downloads that occur simultaneously. The 123 | default value is 10. 124 | * Again calls build_prior_archive_list to reconstruct the list of 125 | archives that already have been downloaded and saved to ARCPROCDIR or 126 | ARCSVDIR. 127 | __NOTE:__ As stated above, this feature is slated for deprecation. 128 | * Calls pickle_archives to rebuild zipinfo.p and save it to the same 129 | directory as this module. 130 | * Calls build_prior_report_list to build a list of reports housed in 131 | RPTHOLDDIR, RPTPROCDIR and RPTSVDIR. 132 | __NOTE:__ I plan to add a function that can build a list of previously 133 | processed files using a database call rather than combing the file 134 | system (though it will remain necessary to look in the RPTHOLDDIR and 135 | RPTSVDIR directories to find files that have not been loaded into the 136 | database). 137 | * Calls consume_rss, which uses a regular expression to scan an FEC RSS 138 | feed listing all electronically filed reports submitted within the 139 | past seven days. The function returns a list of these reports. 140 | * Calls verify_reports to test whether filings flagged for download by 141 | consume_rss already have been downloaded. If so, the function 142 | verifies the length of the downloaded file matches the length of 143 | the file posted on the FEC website. When the lengths do not match, 144 | the saved file is deleted and retained in the download list. 145 | * Uses multiprocessing and calls download_report to download each 146 | report returned by verify_reports. After downloading a report, the 147 | subroutine compares the length of the downloaded file with the length 148 | of the source file. If the lengths do not match, the file is deleted 149 | from the file system. The subroutine tries to download a file up to 150 | five times. 151 | __NOTE:__ You can set the NUMPROC variable in the user variables section 152 | to specify the number of downloads that occur simultaneously. The 153 | default value is 10. 154 | 155 | ### Modifying the zipinfo Pickle 156 | Here is the commented-out code available in the download_reports module 157 | that you can use to manually control the zipinfo.p pickle if you don't 158 | want to download all available archives back to 2001: 159 | 160 | ```python 161 | # Set mostrecent to the last date you DON'T want, so if you want 162 | # everything since Jan. 1, 2013, set mostrecent to: '20121231.zip' 163 | zipinfo['mostrecent'] = '20121231.zip' # YYYYMMDD.zip 164 | zipinfo['badfiles'] = [] # You probably want to leave this blank 165 | ``` 166 | 167 | ## parse_reports Module 168 | This module is the workhorse of the Toolbox. It parses all downloaded 169 | reports in a specified directory and saves child rows for each subform 170 | type into a single data file for easy import into a database manager. 171 | 172 | One of the main challenges with parsing electronically filed reports is 173 | that each form presently there are 56 has its own column 174 | headers. What's more, the layout of each form has been through as many 175 | as 13 iterations, each of which also can have its own headers. 176 | 177 | The parse_reports module handles this by examining the two header rows 178 | atop each electronically filed report to determine the form type and 179 | version of that report. (The current version is 8.1.) If the form 180 | type and version are supported by the parser, the columns in each data 181 | row are mapped to standardized columns. Generally speaking, the 182 | standardized column headings largely mimic a form's version 8.0 183 | headings, though a few legacy columns have been retained. 184 | 185 | You can modify the standardized columns at any time by manipulating the 186 | __outputhdrs__ variable. All data for any columns not included in this 187 | variable are dropped from the data stream and are not included in the 188 | output files. The headers for each version of each form type are housed 189 | in the __filehdrs__ variable in this format: 190 | 191 | ```python 192 | [[formtype-1, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]], 193 | [formtype-2, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]], 194 | ... 195 | [formtype-n, [[[versions], [headers]], [[versions], [headers]], [[versions], [headers]]]]] 196 | ``` 197 | 198 | While all child rows are parsed and saved to delimited text files, the 199 | two header rows for each file are loaded into a database manager. By 200 | default, the database manager is SQL Server, and the module calls a 201 | series of stored procedures to load the header rows into the database. 202 | However, you can easily modify this behavior to use a different 203 | database manager or save the header rows to their own text files. 204 | 205 | The main reason to use the module to load the headers is the database 206 | manager can verify each report is not already in the database. 207 | Eventually, I will add my database structure and stored procedure code 208 | to this repository to make it easier to port the functionality to 209 | interact with other database managers. 210 | 211 | The parser supports the following form types: 212 | * Form 1, Statement of Organization: Includes Form1S and Text records 213 | * Form 3, Report of Receipts and Disbursements: Filed by Congressional 214 | candidates; includes all schedules and Text records. 215 | * Form 3L, Report of Contributions Bundled by 216 | Lobbyists/Registrants: Includes all schedules and Text records. 217 | * Form 3P, Report of Receipts and Disbursements: Filed by presidential 218 | and vice-presidential candidates; includes all schedules and Text 219 | records. 220 | * Form 3X, Report of Receipts and Disbursements: Filed by all 221 | committees other than the principal campaign committees of 222 | Congressional, presidential and vice-presidential candidates; includes 223 | all schedules and Text records. 224 | 225 | Reports not recognized by parse_reports are moved to the directory 226 | specified by the RPTHOLDDIR variable. 227 | 228 | This module goes through the following process in this order: 229 | * Calls build_list_of_supported_report_types, which examines the list 230 | housed in the filehdrs variable to determine which types of 231 | electronically filed reports can be parsed by the module. 232 | * Calls create_file_timestamp, which creates a timestamp string that is 233 | affixed to the filename of each data file generated by the module. 234 | * Creates an output file for each type of child row data one for 235 | Schedule A, one for Schedule B, one for Text and so on and 236 | writes the column headers to each file. These files are saved in 237 | the directory specified by RPTOUTDIR. The module also generates an 238 | "Other Data" file, where rows the module can't write to other data 239 | files are saved. 240 | * Appends various full name fields to header lists. These fields were 241 | used in older electronic filings until the FEC decided to split 242 | names across multiple fields. These extra headers are appended 243 | here because the module attempts to parse these names and does not 244 | write the full name fields to the output data files. If a name 245 | can't be parsed, it is saved to the appropriate last name field. 246 | 247 | From this point, the module iterates over each electronic filing saved 248 | in the directory specified by RPTSVDIR. For each file, the module: 249 | * Saves the six-digit filename as ImageID. This value is prepended to 250 | every child row so those rows can be mapped to the parent header 251 | row. 252 | * If the ImageID is contained in BADREPORTS, the module moves the file 253 | to the directory specified by RPTHOLDDIR, then proceeds to the next 254 | electronic filing. 255 | * Extracts the file header. The file header contains basic information 256 | about the file, such as the header version, the software used to 257 | generate the report, the delimiter used for names and a date format 258 | string. For most forms, this information is constrained to the 259 | first line of the file. But header versions 1 and 2 for all form 260 | types use multi-line file headers. When a multi-line file header 261 | is detected, the module scans from the top of the file until it 262 | finds and reads the entire header. 263 | * Extracts the report header, which is always contained on only one 264 | line, immediately below the file header. 265 | * Checks to see whether the default delimiter specified by DELIMITER 266 | can be found anywhere in the file header. If not, the default 267 | delimiter is changed to a comma. 268 | * Extracts the report type (i.e., F3PA, F3XN, F3T) and an abbreviated 269 | report type (i.e., F3P, F3X, F3) from the report header. The last 270 | letter of a Form 3 report type indicates whether the report is a 271 | New, Amended or Termination report. If the module does not support 272 | that report type, it moves the file to the RPTHOLDDIR and proceeds 273 | to the next electronic filing. 274 | * Extracts the version number from the file header. 275 | * Parses the file header. Custom code is used for header versions 1 and 276 | 2 while the parse_data_row function is used for all later, 277 | single-line headers. 278 | * Creates a dictionary to house data for the report header, adds a 279 | key for each column, then calls parse_data_row for all versions. 280 | * If for some reason the delimiter used for names is unknown, the 281 | module attempts to determine the delimiter. 282 | * Calls a custom function for each form type to validate the report 283 | header data, then calls load_report_header to load that data into a 284 | database manager. If the data can't be validated, the module will 285 | fail. If the data is valid but can't be loaded into the database 286 | (either because of an error or because the report already exists in 287 | the database) the module moves the file to the directory specified\ 288 | by RPTRVWDIR and proceeds to the next electronic filing. 289 | 290 | As noted elsewhere, FEC Scraper Toolbox uses SQL Server as the default 291 | database manager and uses stored procedures to interact with the 292 | database. However, this is an issue only in terms of loading header 293 | data. All other data is saved to delimited text files you can load 294 | into any database manager. (The default delimiter is a tab, which you 295 | can change by setting the OUTPUTDELIMITER variable.) 296 | 297 | The main reason to load the headers from within the Python module is to 298 | verify an electronic report does not already exist in the database. 299 | It also ensures a valid parent-child relationship exists before any 300 | child rows are loaded. Saving all data, including file headers, to 301 | flat files so all the data can be imported in one set would be faster, 302 | but querying the database and adding a header one report at a time is a 303 | trade-off to ensure database viability. 304 | 305 | Once the header has been parsed and loaded into the database, the 306 | module creates a list to hold each type of child row (one file for 307 | Schedule A data, one for Schedule B data and so on). The module then 308 | iterates over the file, skipping the headers, and processes each child 309 | row as follows: 310 | * The module removes double spaces from the data. If OUTPUTDELIMITER is 311 | set to a tab, the module also converts tabs to spaces. 312 | * The data row is converted to a list. 313 | * The module looks at the first element of the list to determine the 314 | row's form type. If the type can't be determined, the row is 315 | written to the "Other Data" file. 316 | * The module calls populate_data_row_dict to build a dictionary to 317 | house the output data, mapping the version-specific headers of the 318 | data row to the headers used in the data output file. 319 | * The module calls a form type-specific function to validate and clean 320 | the data. 321 | * Full name fields, if any, are removed from the data row. 322 | * The module calls build_data_row to convert the dictionary to a list. 323 | * The list is appended to the list created to house that type of data. 324 | 325 | Once the module has finished iterating over the data for an electronic 326 | report, each line of each form type-specific list is converted to a 327 | delimited string and written to the appropriate data file before the 328 | module proceeds to the next file. 329 | 330 | At the end of the module, you'll see a call to a SQL Server stored 331 | procedure called usp_DeactivateOverlappingReports. (Again, I plan to 332 | post all my SQL Server code in this repository very soon.) Briefly, 333 | this stored procedure addresses the problem of reports with different 334 | but overlapping date ranges. Rather than delete amended reports, I use 335 | database triggers to flag the most recent report as Active and to 336 | deactivate all earlier reports whenever the date ranges are exactly the 337 | same. But there are dozens and perhaps hundreds of cases where reports 338 | with different date ranges overlap. This stored procedure addresses 339 | this problem by scrubbing all headers in the database each time this 340 | module is run. 341 | 342 | ## update_master_files Module 343 | This module can be used to download and extract the master files housed 344 | on the [FEC website](http://www.fec.gov/finance/disclosure/ftpdet.shtml). The 345 | FEC updates the candidate (cn), committee (cm) and candidate-committee 346 | linkage (ccl) files daily. The FEC updates the individual 347 | contributions (indiv), committee-to-committee transactions (oth) and 348 | committee-to-candidate transactions (pas2) files every Sunday evening. 349 | 350 | ### Warehousing Master Files 351 | By default, the update_master_files module archives the compressed 352 | master files (but not the extracted data files). This was done to 353 | preserve the source data during development and because the FEC 354 | overwrites the data files. 355 | 356 | To disable this behavior, set the value of the ARCHIVEFILES user 357 | variable (see the user variables section near the top of the module) to 358 | zero. When ARCHIVEFILES is set to any value other than one (1), the 359 | master files are not archived. 360 | 361 | The script assumes you will run it daily and late enough in the day to 362 | make sure the FEC has created the daily data files before you try to 363 | process them. At the time of this writing, the FEC updates the three 364 | daily files around 7:30 a.m. and the weekly files around 4 p.m. If 365 | you're scheduling a daily job to run this script, I recommend you 366 | schedule it for late evening to give yourself plenty of leeway. 367 | 368 | By default, the script will ignore the weekly files if the script is 369 | not run on a Sunday. To change this behavior and download the Sunday 370 | files on a different day of the week, set the OMITNONSUNDAYFILES 371 | variable near the top of the script to 0. 372 | 373 | ### How the update_master_files Module Works 374 | This module goes through the following process in this order: 375 | * Calls delete_data to remove all .txt and .zip files from the working 376 | directory specified by the MASTERDIR variable. (The .zip files from 377 | the previous execution of the script will be in this directory only if 378 | they were not archived when the script was last run.) 379 | * Uses the local date on the machine running the code to calculate the 380 | current election cycle and determine whether the files are being 381 | downloaded on a Sunday. If the weekday is not Sunday and 382 | OMITNONSUNDAYFILES is set to 1, the script will ignore all master files 383 | except for the candidate (cn), committee (cm) and candidate-committee 384 | linkage (ccl) files. 385 | * Uses multiprocessing and calls download_file to download each master 386 | file specified by the MASTERFILES user variable. (By default, all nine 387 | master files are downloaded.) These files are saved in the directory 388 | specified by the MASTERDIR variable. 389 | After downloading a file, the subroutine compares the length of the 390 | downloaded file with the length of the source file of the FEC 391 | website. If the lengths do not match, the file is deleted from the 392 | file system. The subroutine tries to download a file up to five times. 393 | __NOTE:__ You can set the NUMPROC variable in the user variables section 394 | to specify the number of downloads that occur simultaneously. The 395 | default value is 10. 396 | * Uses multiprocessing and calls unzip_master_file to extract the data 397 | files from each master file. If the extracted filename does not 398 | include a year reference, the subroutine appends a two-digit year. 399 | __NOTE:__ You can set the NUMPROC variable in the user variables section 400 | to specify the number of downloads that occur simultaneously. The 401 | default value is 10. 402 | * When the ARCHIVEFILES user variable is set to 1, the module calls the 403 | archive_master_files subroutine, which creates a YYYYMMDD directory for 404 | the most recent Sunday date (if that directory does not already exist) 405 | and moves all .zip files in the MASTERDIR directory to the new 406 | directory. 407 | 408 | ### About the Master Files 409 | The FEC recreates three of the master files daily and the remaining 410 | master every Sunday evening. Each time the files are generated, they 411 | overwrite the previously posted files. The archive filenames include a 412 | two-digit year to identify the election cycle, but the files housed in 413 | those archives often do not. For that reason, this module appends a 414 | two-digit election cycle to extracted filenames that do not include a year 415 | reference. 416 | 417 | There are nine compressed files for each election cycle. You can click 418 | the links below to view the data dictionary for a particular file: 419 | * __add:__ [New Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsAdditions.shtml) 420 | Lists all contributions added to the master Individuals file in the 421 | past week. 422 | Files extracted from these archives are named 423 | addYYYY.txt. Generated every Sunday. 424 | * __ccl:__ [Candidate Committee Linkage](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandCmteLinkage.shtml) 425 | Houses all links between candidates and committees that have 426 | been reported to the FEC. Strangely, this file does not include 427 | candidate ties to Leadership PACs, which are reported on Form 1. 428 | Files extracted from these archives are named ccl.txt. Generated 429 | daily. 430 | * __changes:__ [Individual Contribution Changes](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsChanges.shtml) 431 | Lists all transactions in the master Individuals file that have been 432 | changed during the past month. 433 | Files extracted from these archives are named chgYYYY.txt. Generated 434 | every Sunday. 435 | * __cm:__ [Committee Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteeMaster.shtml) 436 | Lists all committees registered with the FEC for a 437 | specific election cycle. Among other information, you can use this file 438 | to see a committee's FEC ID, name, address and treasurer. You can use 439 | the Committee Designation field (CMTE_DSGN) to find a specific 440 | committee type (such as principal campaign committees, joint 441 | fundraisers, lobbyist PACs and leadership pacs). Additionally, you can 442 | look for code O in the Committee Type field (CMTE_TP) to identify 443 | independent expenditure only committees commonly known as Super PACs. 444 | Files extracted from these archives are named cm.txt. Generated daily. 445 | * __cn:__ [Candidate Master File](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandidateMaster.shtml) 446 | Lists all candidates registered with the FEC for a 447 | specific election cycle. You can use this file to see all candidates 448 | who have filed to run for a particular seat as well as information 449 | about their political parties, addresses, treasurers and FEC IDs. 450 | Files extracted from these archives are named cn.txt. Generated daily. 451 | * __delete:__ [Deleted Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividualsDeletes.shtml) 452 | Lists all contributions deleted from the master Individuals file in the 453 | past week. 454 | Files extracted from these archives are named delYYYY.txt. Generated 455 | every Sunday. 456 | * __indiv:__ [Individual Contributions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionsbyIndividuals.shtml) 457 | For the most part, lists all itemized contributions of $200 or more 458 | made by INDIVIDUALS to any committee during the election cycle. Does 459 | not include most contributions from businesses, PACs and other 460 | organizations. 461 | Files extracted from these archives are named itcont.txt. Generated 462 | every Sunday. 463 | * __oth:__ [Committee-to-Committee Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteetoCommittee.shtml) 464 | Lists contributions and independent expenditures made by one committee 465 | to another. 466 | Files extracted from these archives are named itoth.txt. Generated 467 | every Sunday. 468 | * __pas2:__ [Committee-to-Candidate Transactions](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml) 469 | Lists contributions made by a committee to a candidate. 470 | Files extracted from these archives are named itpas2.txt. Generated 471 | every Sunday. 472 | 473 | ### Using the indiv, add, changes and delete files 474 | The indiv file generated every Sunday for each election cycle is 475 | comprehensive, meaning it is a current (as of that Sunday) snapshot of 476 | all individual contributions currently in the database. Many users 477 | simply drop and recreate this table every week. If you do this, you do 478 | not need the add, changes and delete files. The FEC generates these 479 | files to provide an alternative to rebuilding the indiv table each 480 | week simply because of the sheer volume of data it houses. 481 | 482 | I presently don't use any of these files and instead rely on the raw 483 | filings themselves, which are immediately available on the FEC website 484 | once they're filed and contain more data that the indiv files. But 485 | many journalists I know who do use these files say they just rebuild 486 | the indiv table each week because it's easier and less error-prone than 487 | trying to patch it each week. 488 | 489 | If you decide to use the add, changes and delete files rather than 490 | rebuild the indiv table each week, just be aware that if you ever miss 491 | a weekly download, you will have to rebuild the indiv table. 492 | 493 | ## Next Steps 494 | I house all of my campaign-finance data in a SQL Server database and 495 | tend to use SQL Server Integration Services packages to load the data 496 | generated/extracted by the FEC Scraper Toolbox. I do this because of 497 | the sheer volume of the data (The table housing all Schedule A data, 498 | for example, contains more than 120 million rows so far.) and because I 499 | can use SSIS to bulk load the data rather than load it from Python a 500 | row at a time. 501 | 502 | The lone exception is the parse_reports module. That module attempts 503 | to load a report's header row into the database to test whether that 504 | report previously has been loaded. All child rows in new reports are 505 | parsed and moved to separate data files (one for Schedule A, one for 506 | Schedule B and so on). 507 | 508 | Understandably, the lack of code to load the data makes the Toolbox 509 | less attractive to some potential users, who must manually import the 510 | data files or develop their own Pythonic means of doing so. 511 | Nevertheless, the modules presently provide very fast and efficient 512 | means of downloading massive quantities of data, managing that data and 513 | preparing it for import. 514 | 515 | At some point, I plan to develop Python functions to handle the data 516 | imports, and of course I welcome any contributions from the open-source 517 | community. I also plan to open source my entire database design so 518 | others can recreate it in any database manager they choose. 519 | 520 | Stay tuned! 521 | -------------------------------------------------------------------------------- /download_reports.py: -------------------------------------------------------------------------------- 1 | # Download campaign finance reports 2 | # By Christopher Schnaars, USA TODAY 3 | # Developed with Python 2.7.4 4 | # See README.md for complete documentation 5 | 6 | # Import needed libraries 7 | import datetime 8 | import glob 9 | import multiprocessing 10 | import os 11 | import pickle 12 | import re 13 | import sys 14 | import urllib 15 | import urllib2 16 | import zipfile 17 | 18 | # Try to import user settings or set them explicitly 19 | try: 20 | import usersettings 21 | 22 | ARCPROCDIR = usersettings.ARCPROCDIR 23 | ARCSVDIR = usersettings.ARCSVDIR 24 | RPTHOLDDIR = usersettings.RPTHOLDDIR 25 | RPTPROCDIR = usersettings.RPTPROCDIR 26 | RPTSVDIR = usersettings.RPTSVDIR 27 | except: 28 | ARCPROCDIR = 'C:\\data\\FEC\\Archives\\Processed\\' 29 | ARCSVDIR = 'C:\\data\\FEC\\Archives\\Import\\' 30 | RPTHOLDDIR = 'C:\\data\\FEC\\Reports\\Hold\\' 31 | RPTPROCDIR = 'C:\\data\\FEC\\Reports\\Processed\\' 32 | RPTSVDIR = 'C:\\data\\FEC\\Reports\\Import\\' 33 | 34 | # Other user variables 35 | ARCFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/electronic/' 36 | NUMPROC = 1 # Multiprocessing processes to run simultaneously 37 | RPTURL = 'http://docquery.fec.gov/dcdev/posted/' 38 | RSSURL = 'http://efilingapps.fec.gov/rss/generate?preDefinedFilingType=ALL' 39 | 40 | 41 | def build_archive_download_list(zipinfo, oldarchives): 42 | """ 43 | On 1/8/2018, the FEC shut down its FTP server and moved their 44 | bulk files to an Amazon S3 bucket. Rather than try to hack the 45 | JavaScript, this function now looks for files dated after the 46 | mostrecent element of the zipinfo.p pickle up to the current 47 | system date. I'm adding a try_again_later property to the 48 | pickle for .zip files that fail to download. 49 | """ 50 | 51 | # Generate date range to look for new files 52 | start_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date() 53 | add_day = datetime.timedelta(days=1) 54 | start_date += add_day 55 | end_date = datetime.datetime.now().date() 56 | 57 | # Create dictionary to house list of files to attempt to download 58 | downloads = [] 59 | 60 | # Add recent archive files 61 | while start_date < end_date: 62 | downloads.append(datetime.date.strftime(start_date, '%Y%m%d') + '.zip') 63 | start_date += add_day 64 | 65 | # Add try_again files 66 | for fec_file in zipinfo['try_again_later']: 67 | if fec_file not in downloads: 68 | downloads.append(fec_file) 69 | zipinfo['try_again_later'] = [] 70 | 71 | # Remove any bad files from the list 72 | for fec_file in zipinfo['badfiles']: 73 | if fec_file in downloads: 74 | downloads.remove(fec_file) 75 | 76 | # Remove previously downloaded archives 77 | downloads = [download for download in downloads if download not in oldarchives] 78 | 79 | return downloads 80 | 81 | 82 | def build_prior_archive_list(): 83 | """ 84 | Returns a list of archives that already have been downloaded and 85 | saved to ARCPROCDIR or ARCSVDIR. 86 | """ 87 | dirs = [ARCSVDIR, ARCPROCDIR] 88 | archives = [] 89 | 90 | for dir in dirs: 91 | for datafile in glob.glob(os.path.join(dir, '*.zip')): 92 | archives.append(datafile.replace(dir, '')) 93 | 94 | return archives 95 | 96 | 97 | def build_prior_report_list(): 98 | """ 99 | Returns a list of reports housed in the directories specified by 100 | RPTHOLDDIR, RPTPROCDIR and RPTSVDIR. 101 | """ 102 | dirs = [RPTHOLDDIR, RPTPROCDIR, RPTSVDIR] 103 | reports = [] 104 | 105 | for dir in dirs: 106 | for datafile in glob.glob(os.path.join(dir, '*.fec')): 107 | reports.append( 108 | datafile.replace(dir, '').replace('.fec', '')) 109 | 110 | return reports 111 | 112 | 113 | def consume_rss(): 114 | """ 115 | Returns a list of electronically filed reports included in an FEC 116 | RSS feed listing all reports submitted within the past seven days. 117 | """ 118 | regex = re.compile('http://docquery.fec.gov/dcdev/posted/' \ 119 | '([0-9]*)\.fec') 120 | opener = urllib2.build_opener() 121 | opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0')] 122 | rss = opener.open(RSSURL).read() 123 | matches = [] 124 | for match in re.findall(regex, rss): 125 | matches.append(match) 126 | 127 | return matches 128 | 129 | 130 | def download_archive(archive): 131 | """ 132 | Downloads a single archive file and saves it in the directory 133 | specified by the ARCSVDIR variable. After downloading an archive, 134 | this subroutine compares the length of the downloaded file with the 135 | length of the source file and will try to download a file up to 136 | five times when the lengths don't match. 137 | """ 138 | src = ARCFTP + archive 139 | dest = ARCSVDIR + archive 140 | y = 0 141 | # I have added a header to my request 142 | try: 143 | # Add a header to the request 144 | request = urllib2.Request(src, headers={ 145 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}) 146 | srclen = float(urllib2.urlopen(request).info().get('Content-Length')) 147 | except: 148 | y = 5 149 | 150 | while y < 5: 151 | try: 152 | # Add a header to the request 153 | urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0' 154 | urllib.urlretrieve(src, dest) 155 | 156 | destlen = os.path.getsize(dest) 157 | 158 | # Repeat download up to five times if files not same size 159 | if srclen != destlen: 160 | os.remove(dest) 161 | y += 1 162 | continue 163 | else: 164 | y = 6 165 | except: 166 | y += 1 167 | if y == 5: 168 | zipinfo['try_again_later'].append(archive) 169 | print(src + ' could not be downloaded.') 170 | 171 | 172 | def download_report(download): 173 | """ 174 | Downloads a single electronic report and saves it in the directory 175 | specified by the RPTSVDIR variable. After downloading a report, 176 | this subroutine compares the length of the downloaded file with the 177 | length of the source file and will try to download a file up to 178 | five times when the lengths don't match. 179 | """ 180 | # Construct file url and get length of file 181 | url = RPTURL + download + '.fec' 182 | y = 0 183 | 184 | try: 185 | # Add a header to the request 186 | request = urllib2.Request(url, headers={ 187 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}) 188 | srclen = float(urllib2.urlopen(request).info().get('Content-Length')) 189 | except: 190 | y = 5 191 | 192 | filename = RPTSVDIR + download + '.fec' 193 | 194 | while y < 5: 195 | try: 196 | url_headers = {'ACCEPT': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 197 | 'ACCEPT_ENCODING': 'gzip, deflate, br', 198 | 'ACCEPT_LANGUAGE': 'en-US,en;q=0.5', 199 | 'USER-AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0'} 200 | request = urllib2.Request(url, headers=url_headers) 201 | response = urllib2.urlopen(request) 202 | with open(filename, 'wb') as f: 203 | f.write(response.read()) 204 | 205 | destlen = os.path.getsize(filename) 206 | 207 | # Repeat download up to five times if files not same size 208 | if srclen != destlen: 209 | os.remove(filename) 210 | y += 1 211 | continue 212 | else: 213 | y = 6 214 | except: 215 | y += 1 216 | 217 | if y == 5: 218 | print('Report ' + download + ' could not be downloaded.') 219 | sys.exit() 220 | 221 | 222 | def pickle_archives(zipinfo, archives): 223 | """ 224 | Rebuilds the zipinfo.p pickle and saves it in the same directory as 225 | this module. 226 | 227 | archives is a list of archive files available for download on the 228 | FEC website. The list is generated by the 229 | build_archive_download_list function. 230 | """ 231 | 232 | # To calculate most recent download, omit files in try_again_later 233 | downloads = [fec_file for fec_file in archives if fec_file not in zipinfo['try_again_later']] 234 | if len(downloads) > 0: 235 | zipinfo['mostrecent'] = max(downloads) 236 | 237 | # Remove bad files older than most recent 238 | if len(zipinfo['badfiles']) > 0: 239 | most_recent_date = datetime.datetime.strptime(zipinfo['mostrecent'].rstrip('.zip'), '%Y%m%d').date() 240 | 241 | for bad_file in zipinfo['badfiles'][::-1]: 242 | bad_file_date = datetime.datetime.strptime(bad_file.rstrip('.zip'), '%Y%m%d').date() 243 | if bad_file_date < most_recent_date: 244 | zipinfo['badfiles'].remove(bad_file) 245 | 246 | pickle.dump(zipinfo, open('zipinfo.p', 'wb')) 247 | 248 | 249 | def unzip_archive(archive, overwrite=0): 250 | """ 251 | Extracts any files housed in a specific archive that have not been 252 | downloaded previously. 253 | 254 | Set the overwrite parameter to 1 if existing files should be 255 | overwritten. The default value is 0. 256 | """ 257 | destdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR] 258 | try: 259 | zip = zipfile.ZipFile(ARCSVDIR + archive) 260 | for subfile in zip.namelist(): 261 | x = 1 262 | if overwrite != 1: 263 | for dir in destdirs: 264 | if x == 1: 265 | if os.path.exists(dir + subfile): 266 | x = 0 267 | if x == 1: 268 | zip.extract(subfile, destdirs[0]) 269 | 270 | zip.close() 271 | 272 | # If all files extracted correctly, move archive to Processed 273 | # directory 274 | os.rename(ARCSVDIR + archive, ARCPROCDIR + archive) 275 | 276 | except: 277 | print('Files contained in ' + archive + ' could not be ' 278 | 'extracted. The file has been deleted so it can be ' 279 | 'downloaded again later.\n') 280 | os.remove(ARCSVDIR + archive) 281 | 282 | 283 | def verify_reports(rpts, downloaded): 284 | """ 285 | Returns a list of individual reports to be downloaded. 286 | 287 | Specifically, this function compares a list of available reports 288 | that have been submitted to the FEC during the past seven days 289 | (rpts) with a list of previously downloaded reports (downloaded). 290 | 291 | For reports that already have been downloaded, the function verifies 292 | the length of the downloaded file matches the length of the file 293 | posted on the FEC website. When the lengths do not match, the saved 294 | file is deleted and retained in the download list. 295 | """ 296 | downloads = [] 297 | for rpt in rpts: 298 | childdirs = [RPTSVDIR, RPTPROCDIR, RPTHOLDDIR] 299 | if rpt not in downloaded: 300 | downloads.append(rpt) 301 | else: 302 | try: 303 | # Add a header to the request 304 | request = urllib2.Request(RPTURL + rpt + '.fec', headers={ 305 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}) 306 | srclen = float(urllib2.urlopen(request).info().get('Content-Length')) 307 | except urllib2.HTTPError: 308 | print(RPTURL + rpt + '.fec could not be downloaded.') 309 | continue 310 | 311 | for child in childdirs: 312 | try: 313 | destlen = os.path.getsize(child + rpt + '.fec') 314 | if srclen != destlen: 315 | downloads.append(rpt) 316 | os.remove(child + rpt + '.fec') 317 | except: 318 | pass 319 | 320 | return downloads 321 | 322 | 323 | if __name__ == '__main__': 324 | # Attempt to fetch data specifying missing .zip files and most 325 | # recent .zip file downloaded 326 | print('Attempting to retrieve information for previously ' 327 | 'downloaded archives...') 328 | try: 329 | zipinfo = pickle.load(open("zipinfo.p", "rb")) 330 | # Make sure new try_again_later key exists 331 | if 'try_again_later' not in zipinfo.keys(): 332 | zipinfo['try_again_later'] = [] 333 | print('Information retrieved successfully.\n') 334 | except: 335 | zipinfo = {'mostrecent': '20010403.zip', 336 | 'badfiles': ['20010408.zip', '20010428.zip', '20010429.zip', '20010505.zip', '20010506.zip', 337 | '20010512.zip', '20010526.zip', '20010527.zip', '20010528.zip', '20010624.zip', 338 | '20010812.zip', '20010826.zip', '20010829.zip', '20010902.zip', '20010915.zip', 339 | '20010929.zip', '20010930.zip', '20011013.zip', '20011014.zip', '20011028.zip', 340 | '20011123.zip', '20011124.zip', '20011125.zip', '20011201.zip', '20011202.zip', 341 | '20011215.zip', '20011223.zip', '20011229.zip', '20030823.zip', '20030907.zip', 342 | '20031102.zip', '20031129.zip', '20031225.zip', '20040728.zip', '20040809.zip', 343 | '20040921.zip', '20040922.zip', '20041127.zip', '20050115.zip', '20050130.zip', 344 | '20050306.zip', '20050814.zip', '20050904.zip', '20051106.zip', '20051225.zip', 345 | '20060210.zip', '20060318.zip', '20060319.zip', '20060320.zip', '20061224.zip', 346 | '20070507.zip', '20071028.zip', '20081225.zip', '20091226.zip', '20111203.zip', 347 | '20120701.zip', '20121215.zip', '20121225.zip', '20130703.zip', '20130802.zip', 348 | '20130825.zip', '20130914.zip', '20131109.zip', '20150207.zip', '20150525.zip'], 349 | 'try_again_later': ['20001015.zip', '20010201-20010403.zip']} 350 | print('zipinfo.p not found. Starting from scratch...\n') 351 | 352 | # Build a list of previously downloaded archives 353 | print('Building a list of previously downloaded archive files...') 354 | oldarchives = build_prior_archive_list() 355 | print('Done!\n') 356 | 357 | # Go to FEC site and fetch a list of .zip files available 358 | print('Compiling a list of archives available for download...') 359 | archives = build_archive_download_list(zipinfo, oldarchives) 360 | if len(archives) == 0: 361 | print('No new archives found.\n') 362 | # If any files returned, download them using multiprocessing 363 | else: 364 | print('Done!\n') 365 | print('Downloading ' + str(len(archives)) 366 | + ' new archive(s)...') 367 | pool = multiprocessing.Pool(processes=NUMPROC) 368 | for archive in archives: 369 | pool.apply_async(download_archive(archive)) 370 | pool.close() 371 | pool.join() 372 | print('Done!\n') 373 | 374 | # Open each archive and extract new reports 375 | print('Extracting files from archives...') 376 | pool = multiprocessing.Pool(processes=NUMPROC) 377 | for archive in archives: 378 | # Make sure archive was downloaded 379 | if os.path.isfile(ARCSVDIR + archive): 380 | pool.apply_async(unzip_archive(archive, 0)) 381 | pool.close() 382 | pool.join() 383 | print('Done!\n') 384 | 385 | # Rebuild zipinfo and save with pickle 386 | print('Repickling the archives. Adding salt and vinegar...') 387 | pickle_archives(zipinfo, archives) 388 | print('Done!\n') 389 | 390 | # Build list of previously downloaded reports 391 | print('Building a list of previously downloaded reports...') 392 | downloaded = build_prior_report_list() 393 | print('Done!\n') 394 | 395 | # Consume FEC's RSS feed to get list of files posted in the past 396 | # seven days 397 | print('Consuming FEC RSS feed to find new reports...') 398 | rpts = consume_rss() 399 | print('Done! ' + str(len(rpts)) + ' reports found.\n') 400 | 401 | # See whether each file flagged for download already has been 402 | # downloaded. If it has, verify the downloaded file is the correct 403 | # length. 404 | print('Compiling list of reports to download...') 405 | newrpts = verify_reports(rpts, downloaded) 406 | print('Done! ' + str(len(newrpts)) + ' reports flagged for ' 407 | 'download.\n') 408 | 409 | # Download each of these reports 410 | print('Downloading new reports...') 411 | pool = multiprocessing.Pool(processes=NUMPROC) 412 | for rpt in newrpts: 413 | # download_report(rpt) 414 | pool.apply_async(download_report(rpt)) 415 | pool.close() 416 | pool.join() 417 | print('Done!\n') 418 | print('Process completed.') 419 | -------------------------------------------------------------------------------- /fec_scraper_toolbox_sql_objects.sql: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cschnaars/FEC-Scraper-Toolbox/0eec4758150945ff1e4f05bc15b903731741b2fe/fec_scraper_toolbox_sql_objects.sql -------------------------------------------------------------------------------- /update_master_files.py: -------------------------------------------------------------------------------- 1 | # Download zipped FEC master files 2 | # By Christopher Schnaars, USA TODAY 3 | # Developed with Python 2.7.4 4 | # See README.md for complete documentation 5 | 6 | # WARNING: 7 | # -------- 8 | # If you automate the execution of this script, you should set it to 9 | # run in the late evening to make sure you don't download any master 10 | # files before the FEC has a chance to update them. At the time of this 11 | # writing, the FEC was updating the candidate, committee and 12 | # candidate-committee linkage files daily around 7:30 a.m. while 13 | # other weekly files were updated a little before 4 p.m. on Sundays. 14 | 15 | # Development Notes: 16 | # ------------------ 17 | # 4/7/2014: Updated code so files can be downloaded daily. The FEC began 18 | # publishing daily updates to the candidate, committee and 19 | # candidate-committee linkage files. Other master files continue 20 | # to be updated weekly on Sundays. 21 | 22 | # Import needed libraries 23 | from datetime import datetime, timedelta 24 | import glob 25 | import multiprocessing 26 | import os 27 | import urllib 28 | import urllib2 29 | import zipfile 30 | 31 | # Try to import user settings or set them explicitly 32 | try: 33 | import usersettings 34 | MASTERDIR = usersettings.MASTERDIR 35 | except: 36 | MASTERDIR = 'C:\\data\\FEC\\Master\\' 37 | 38 | # Other user variables 39 | ARCHIVEFILES = 1 # Set to 0 if you don't want to archive the master files each week. 40 | MASTERFTP = 'https://cg-519a459a-0ea3-42c2-b7bc-fa1143481f74.s3-us-gov-west-1.amazonaws.com/bulk-downloads/' 41 | MASTERFILES = ['ccl', 'cm', 'cn', 'indiv', 'oth', 'pas2', 'oppexp'] 42 | NUMPROC = 10 # Multiprocessing processes to run simultaneously 43 | STARTCYCLE = 2002 # Oldest election cycle for which you want to download master files 44 | OMITNONSUNDAYFILES = 1 # Set to 0 to download all files regardless of day of week 45 | 46 | 47 | def archive_master_files(): 48 | """ 49 | Moves current master files to archive directory. The 50 | archivedate parameter specifies the most recent Sunday date. If the 51 | archive directory does not exist, this subroutine creates it. 52 | """ 53 | # Create timestamp 54 | timestamp = datetime.now().strftime("%Y%m%d") 55 | 56 | # Create archive directory if it doesn't exist 57 | savedir = MASTERDIR + 'Archive\\' + timestamp + '\\' 58 | if not os.path.isdir(savedir): 59 | try: 60 | os.mkdir(savedir) 61 | except: 62 | pass 63 | 64 | # Move all the files 65 | for datafile in glob.glob(os.path.join(MASTERDIR, '*.zip')): 66 | os.rename(datafile, datafile.replace(MASTERDIR, savedir)) 67 | 68 | 69 | def create_timestamp(): 70 | filetime = datetime.datetime.now() 71 | return filetime.strftime('%Y%m%d') 72 | 73 | 74 | def delete_files(dir, ext): 75 | """ 76 | Deletes all files in the specified directory with the specified 77 | file extension. In this module, it is used to delete all text files 78 | extracted from the previous week's archives prior to downloading 79 | the new archives. These files are housed in the directory 80 | specified by MASTERDIR. 81 | 82 | When ARCHIVEFILES is set to 0, this subroutine also is used 83 | to delete all archive files from the MASTERDIR directory. 84 | """ 85 | # Remove asterisks and periods from specified extension 86 | ext = '*.' + ext.lstrip('*.') 87 | 88 | # Delete all files 89 | for datafile in glob.glob(os.path.join(dir, ext)): 90 | os.remove(datafile) 91 | 92 | 93 | def download_file(src, dest): 94 | """ 95 | Downloads a single master file (src) and saves it as dest. After 96 | downloading a file, this subroutine compares the length of the 97 | downloaded file with the length of the source file and will try to 98 | download a file up to five times when the lengths don't match. 99 | """ 100 | y = 0 101 | try: 102 | # Add a header to the request. 103 | request = urllib2.Request(src, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0'}) 104 | srclen = float(urllib2.urlopen(request).info().get('Content-Length')) 105 | except: 106 | y = 5 107 | while y < 5: 108 | try: 109 | urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0' 110 | urllib.urlretrieve(src, dest) 111 | destlen = os.path.getsize(dest) 112 | 113 | # Repeat download up to five times if files not same size 114 | if srclen != destlen: 115 | os.remove(dest) 116 | y += 1 117 | continue 118 | else: 119 | y = 6 120 | except: 121 | y += 1 122 | if y == 5: 123 | print(src + ' could not be downloaded.') 124 | 125 | 126 | def unzip_master_file(masterfile): 127 | """ 128 | Extracts the data file from a single weekly master file archive. 129 | If the extracted file does not include a year reference, this 130 | subroutine appends a two-digit year to the extracted filename. 131 | """ 132 | fileyear = masterfile[masterfile.find('.zip')-2:masterfile.find('.zip')] 133 | 134 | try: 135 | zip = zipfile.ZipFile(masterfile) 136 | for subfile in zip.namelist(): 137 | zip.extract(subfile, MASTERDIR) 138 | # Rename the file if it does not include the year 139 | if subfile.find(fileyear + '.txt') == -1: 140 | savefile = MASTERDIR + subfile 141 | os.rename(savefile, savefile.replace('.txt', fileyear + '.txt')) 142 | 143 | except: 144 | print('Files contained in ' + masterfile + ' could not be extracted.') 145 | 146 | 147 | if __name__=='__main__': 148 | 149 | # Delete text files extracted from an earlier archive 150 | print('Deleting old data...') 151 | delete_files(MASTERDIR, 'txt') 152 | 153 | # Delete old archives if they're still in the working 154 | # directory. These files are moved to another directory 155 | # (archived) below when ARCHIVEFILES is set to 1. 156 | delete_files(MASTERDIR, 'zip') 157 | print('Done!\n') 158 | 159 | # Use multiprocessing to download master files 160 | print('Downloading master files...\n') 161 | pool = multiprocessing.Pool(processes=NUMPROC) 162 | 163 | # Determine whether today is Sunday 164 | sunday = False 165 | if datetime.now().weekday() == 6: 166 | sunday = True 167 | 168 | # Remove all files but cn, cm and ccl from MASTERFILES 169 | if sunday == False and OMITNONSUNDAYFILES == 1: 170 | files = [] 171 | for fecfile in ['ccl', 'cm', 'cn']: 172 | if fecfile in MASTERFILES: 173 | files.append(fecfile) 174 | MASTERFILES = files 175 | 176 | # Calculate current election cycle 177 | maxyear = datetime.now().year 178 | # Add one if it's not an even-numbered year 179 | if maxyear / 2 * 2 < maxyear: maxyear += 1 180 | 181 | # Create loop to iterate through FEC ftp directories 182 | for x in range(STARTCYCLE, maxyear + 2, 2): 183 | fecdir = MASTERFTP + str(x) + '/' 184 | 185 | for thisfile in MASTERFILES: 186 | currfile = thisfile + str(x)[2:] + '.zip' 187 | fecfile = fecdir + currfile 188 | savefile = MASTERDIR + currfile 189 | pool.apply_async(download_file(fecfile, savefile)) 190 | pool.close() 191 | pool.join() 192 | print('Done!\n') 193 | 194 | # Use multiprocessing to extract data files from the archives 195 | print('Unzipping files...') 196 | pool = multiprocessing.Pool(processes=NUMPROC) 197 | 198 | for fecfile in glob.glob(os.path.join(MASTERDIR, '*.zip')): 199 | pool.apply_async(unzip_master_file(fecfile)) 200 | pool.close() 201 | pool.join() 202 | print('Done!\n') 203 | 204 | # Archive files when ARCHIVEFILES == 1 205 | # Otherwise delete files 206 | if ARCHIVEFILES == 1: 207 | print('Archiving data files...') 208 | archive_master_files() 209 | print('Done!\n') 210 | 211 | print('Process complete.') 212 | 213 | --------------------------------------------------------------------------------