├── LICENSE ├── README.md └── datasets ├── american_community_survey.md └── federal_election_commission.md /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Ryan Pitts 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | The Journalist's Guide to Datasets 2 | ================================== 3 | 4 | A collection of introductions to various datasets, giving journalists some friendly background before they start doing analysis. Like "Hitchhiker's Guide to the Galaxy," only for journalists. And about datasets. 5 | 6 | ## Datasets 7 | * [American Community Survey](/datasets/american_community_survey.md) 8 | * [Federal Election Commission](/datasets/federal_election_commission.md) 9 | 10 | ## Notes 11 | * [Notes from the Session](https://etherpad.mozilla.org/srccon-notes) 12 | 13 | ## Datasets to Claim 14 | * Uniform Crime Report 15 | * CDC Wonder DataSet 16 | * BLS TimeUse DataSet 17 | * NOAA Fisheries Database 18 | * OSHA 19 | * Texas Ethics Commission 20 | * PLUTO Dataset (NYC) 21 | * Federal Lobbying Register 22 | * FRED : Federal Reserve DataSet 23 | * CMS (Center for Medicare and Medicade) 24 | * FCC Filings 25 | * Elex 26 | -------------------------------------------------------------------------------- /datasets/american_community_survey.md: -------------------------------------------------------------------------------- 1 | American Community Survey 2 | ========================= 3 | 4 | The [American Community Survey](https://www.census.gov/acs/www/) is a product of the [United States Census Bureau](http://census.gov). This ongoing survey collects data from a small percentage of the population each year, covering a variety of topics beyond the basic demographic information contained in the decennial Census. In addition to age, sex, and race, the ACS gathers data in categories like family type, housing, income, education, occupation, transportation, and veteran status. 5 | 6 | Governmental agencies use ACS data to inform policy and decide how to allocate services. Because the survey asks questions about so many aspects of American life, it is also a rich source of stories for journalists. For the largest geographies, [ACS summary data](https://www.census.gov/acs/www/data_documentation/summary_file/) is available going back to 2005. Estimates from smaller places were added in 2007 and 2009. 7 | 8 | It is important to remember that the American Community Survey is, indeed, a survey, so the estimates in each release are derived from statistical sampling. Depending on the population of a place, or even the size of group for whom a specific survey question is relevant, the Census Bureau may need to compile more than one year's worth of responses in order to deliver statistically meaningful estimates. 9 | 10 | In This Guide 11 | ============= 12 | 13 | * [Releases: 1-year vs. 3-year vs. 5-year](#releases) 14 | * [Comparing ACS data from different releases](#comparing-acs-data-from-different-releases) 15 | * [Comparing ACS data from different geographies](#comparing-acs-data-from-different-geographies) 16 | * [Comparing ACS data over time](#comparing-acs-data-over-time) 17 | * [Table universes](#table-universes) 18 | * [Margins of error](#margins-of-error) 19 | * [Summary levels](#summary-levels) 20 | * [Finding ACS data](#finding-acs-data) 21 | * [Authors](#authors) 22 | 23 | Releases 24 | ======== 25 | 26 | The Census Bureau produces [three American Community Survey releases](https://www.census.gov/acs/www/guidance_for_data_users/estimates/) each year: the 1-year, the 3-year, and the 5-year. Choosing the appropriate release for your analysis depends on the size of place you are researching, as well as the precision of estimate you need. 27 | 28 | ### 1-year release 29 | The 1-year release provides data for all geographies of population 65,000 and above. Estimates are generated from 12 months of survey responses, so while results are more current than the other two releases, they also represent the smallest sample size. 30 | 31 | ### 3-year release 32 | The 3-year release provides data for all geographies of population 20,000 and above. Estimates are generated from 36 months of survey responses, so results are not as current as the 1-year release, but are more precise. 33 | 34 | ### 5-year release 35 | The 5-year release provides data for all geographies regardless of size. If you are looking for information that goes down to the tract level, for instance, this is likely the release you want. Estimates are generated from 60 months of survey responses, so results are the least current, but most precise. 36 | 37 | ### Comparing ACS data from different releases 38 | Any time you compare ACS data, it is important to keep the sampling method for each release in mind. The 2012 1-year release, for example, is derived from responses to the 2012 survey alone. The 2012 3-year release is derived from responses from 2010 through 2012, and the 2012 5-year release goes back further still, producing estimates based on data from 2008 through 2012. 39 | 40 | Because of this, you should [never compare data from different release levels](http://www.census.gov/acs/www/guidance_for_data_users/comparing_data/). 41 | 42 | ### Comparing ACS data from different geographies 43 | If you are comparing data from different places (even if they're the same type of geography), you'll need to get all your data from the release that contains the smallest place in your comparison. 44 | 45 | A common example of this is performing a comparison of all counties in a state. Most states contain at least some small counties, with populations that aren't big enough to be included in the 1-year release. So if you go looking for 1-year data, you'll find some counties, but not all. (Maybe you only *want* to compare large counties, in which case you're fine.) But because you can't compare 1-year data from large counties against 3- or 5-year data from the smaller counties, if you want to analyze the complete set, you'll need to choose a single level that includes all of them. 46 | 47 | ### Comparing ACS data over time 48 | The Census Bureau began collecting ACS data nearly a decade ago, providing a great opportunity to analyze the ways an area has changed over time. It is important, however, to keep the sampling methods for each release in mind. 49 | 50 | Just as in comparisons of different geographies, when you are comparing data from different years, you should only compare within the same release level (e.g. 1-year vs 1-year, 3-year vs. 3-year, and 5-year vs. 5-year). In addition, you should only compare *adjacent* releases, that is, release where the sampled years do not overlap. 51 | 52 | For example, the 2012 3-year release samples data from 2010 through 2012, and the 2011 3-year release samples data from 2009 through 2011. Because both contain data from the 2010 and 2011 ACS surveys (but not necessarily the same amount of data from either year), it is impossible to make significant statements about changes between these two releases. 53 | 54 | The 2009 3-year release, however, samples data from 2007 through 2009. This makes it adjacent to the 2010-2012 data in the 2012 3-year release, so it would be fair to compare those two releases. 55 | 56 | Thankfully, comparing 1-year releases is much simpler. Each 1-year release contains data from a single year, so looking at the 2007 through 2012 1-year release data would give you a nice 6-year timeline of change. Provided the geographies you are analyzing are big enough to exist in the 1-year datasets, of course. 57 | 58 | Comparing 5-year release data over time, however, is currently impossible. The first 5-year release is from 2009, and samples surveys back through 2005. So there won't be adjacent data available until the 2014 5-year release. 59 | 60 | Table universes 61 | =============== 62 | Each table of data in the American Community Survey is derived from what the Census Bureau calls a "universe": the population from whom answers to a particular survey question are relevant. For a table like B01002 ("Median Age by Sex"), you'll see a universe of "Total Population," meaning the estimates in this table are derived from sampling everyone in the geography you are analyzing. 63 | 64 | For a table like B24011 ("Occupation by Median Earnings for the Civilian Population"), however, the universe is "Civilian Employed Population 16 Years and Over With Earnings." Parsing that out, you can infer that the estimates in this table do not include data from military personnel, children under 16 years old, people who are unemployed, and volunteer workers. 65 | 66 | Many ACS tables have a universe that is *not* "Total Population," which is important to track before citing figures. For example, if you were writing a story about how many people in your town heat their homes with wood, you would need to note that the universe for Table B25040 ("House Heating Fuel") is "Occupied Housing Units" &emdash; an estimate that doesn't even count people. So to derive a percentage figure, you would divide by the number of total occupied units, not total residents. 67 | 68 | These universes also mean that you may have to search across multiple ACS releases to find all data for a specific place. For example, consider a county with 70,000 residents. That's big enough for the "Age by Sex" table (with a universe of "Total Population") to exist in the 1-year release. But it's unlikely that all 70,000 residents are civilians over 16 with jobs, so to find data for "Occupation by Median Earnings for the Civilian Population," you'd have to check the 3- or 5-year release. 69 | 70 | Margins of error 71 | ================ 72 | Because ACS data for a given place is derived from a sample of its population, each estimate carries with it a margin of error. Depending on the place's population (or the size of a particular table universe), these margins of error can sometimes be quite high. 73 | 74 | You should keep this relative lack of precision in mind when choosing language to describe this data. You wouldn't, for example, want to say "According to the 2012 American Community Survey, there are 975,139 twentysomethings living in Washington state." That particular statistic has a margin of error of +/- 10,119.6. It *would* however, be reasonable to say "about 14 percent of Washingtonians are in their 20s." 75 | 76 | Margins of error are also important to consider when comparing places against one another. One county may have a higher estimated number of people with bachelor's degrees than another, but beware of overlapping margins of error. It may not be entirely accurate to write that the first county is outpacing the second. 77 | 78 | The same goes for comparing groups of geographies. According to the estimates, County A may rank 4th in bachelor's degrees, but overlapping margins of error might mean that it *really* ranks anywhere between 3rd and 6th. 79 | 80 | Summary levels 81 | ============== 82 | The Census Bureau uses summary levels to classify [different types of geographies](https://www.census.gov/geo/reference/garm.html). These range in size from the entire nation (summary level 010) all the way down to census blocks (summary level 101). The smallest unit of geography in the ACS is the [block group](https://www.census.gov/geo/reference/gtc/gtc_bg.html) (summary level 150). 83 | 84 | Some, but not all, summary levels are entirely contained by a parent summary level. For example. All states exist within the United States (of course), and all counties are contained by a specific state. All census places (or cities) are also contained by a state, but *not* necessarily by a county. A [chart of these parent/child relationships](http://mcdc.missouri.edu/allabout/sumlevs/) might be helpful here. 85 | 86 | Summary levels can be [incredibly granular](http://factfinder2.census.gov/help/en/glossary/s/summary_level_code_list.htm), but here is a list of codes for the most common geographies you might want to analyze. 87 | 88 | * 010: United States 89 | 90 | * 020: Region 91 | 92 | These [divide states into four groups](https://www.census.gov/geo/reference/gtc/gtc_census_divreg.html): Northeast, Midwest, South, and West 93 | 94 | * 030: Division 95 | 96 | These [further divide regions into nine groups](https://www.census.gov/geo/reference/gtc/gtc_census_divreg.html) 97 | 98 | * 040: State 99 | 100 | * 050: County 101 | 102 | * 060: County Subdivision 103 | 104 | Different states divide counties in different ways for different reasons, so whether or not this summary level matters to a journalist is dependent on the states being investigated. For example, in New Jersey, county subdivisions provide data for more "town-like" places than the "place" (160) summary level. 105 | 106 | * 140: Census Tract 107 | 108 | Fairly stable census-defined geographies with a target population of about 4000. Before every decennial census, the Census Bureau reviews the census tract map, and sometimes splits or joins tracts which have gone far from that target. For this reason, comparison of census tracts over time must be handled with caution. 109 | 110 | * 150: Block Group 111 | 112 | These are made up of Census blocks, and block groups are used to form Census tracts. As noted above, this is the smallest summary level for which the Census Bureau provides sample data. They generally have a population of 600 to 3,000. 113 | 114 | * 160: Place 115 | 116 | The Census Bureau term for what you'd commonly refer to as a city. Some non-incorporated areas, known as "Census designated places" are also included in this tabulation. 117 | 118 | * 250: American Indian Area/Alaska Native Area/Hawaiian Home Land 119 | 120 | * 310: Metropolitan/Micropolitan Statistical Area 121 | 122 | Metropolitan statistical areas represent an urbanized core with a population of at least 50,000. Micropolitan statistical areas have a population of at least 10,000 but less than 50,000. The general census term for both metropolitan and micropolitan SAs is "core-based statistical areas" or CBSAs. By definition, CBSAs comprise a specific list of counties, although which counties are included in a CBSA changes based on population and economic change. 123 | 124 | * 330: Combined Statistical Area 125 | 126 | These are [groupings of metropolitan or micropolitan statistical areas](https://www.census.gov/geo/reference/gtc/gtc_cbsa.html) that have "substantial employment interchange." 127 | 128 | * 500: Congressional District 129 | 130 | * 610: State House (Upper) 131 | 132 | * 620: State House (Lower) 133 | 134 | * 795: Public Use Microdata Area 135 | 136 | Commonly referred to as a [PUMA](https://www.census.gov/geo/reference/puma.html). These are geographically contiguous areas with a population of at least 100,000. 137 | 138 | * 860: 5-digit ZIP Code Tabulation Area 139 | 140 | Technically, ZIP Codes are not defined geographically. However, most ZIP Codes can be drawn on a map, and they are a more familiar reference than Census tracts, so by popular demand, the Census Bureau produces data that fits people's intuitive sense of how ZIP Codes work. 141 | 142 | * 950: School District (Elementary) 143 | 144 | Some states have school districts specifically for the elementary level. 145 | 146 | * 960: School District (Secondary) 147 | 148 | Some states have school districts specifically for the secondary level. 149 | 150 | * 970: School District (Unified) 151 | 152 | A school district that governs both elementary and secondary schools. 153 | 154 | Finding ACS data 155 | ================ 156 | The Census Bureau provides [a range of products](https://www.census.gov/data/data-tools.html) based on ACS data, the most comprehensive of which is [American FactFinder](http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml). Bulk data downloads are also available [directly](http://www2.census.gov/) or [via FTP](ftp://ftp2.census.gov/). American FactFinder isn't particularly easy to navigate, however, and bulk downloads require some complex joins before you can begin to look at the data. 157 | 158 | You may have an easier time using [Census Reporter](http://censusreporter.org), a website built specifically for reporters, editors and newsroom developers interested in using ACS data. Census Reporter is a Knight News Challenge-funded project, and includes profile pages for all census geographies, comparison tools to help you explore tables and analyze collections of places, and visualizations for a more useful first look at the data. 159 | 160 | Authors 161 | ======= 162 | 163 | Ryan Pitts is director of code for [Knight-Mozilla OpenNews](http://opennews.org/), and Joe Germuska is chief nerd for [Knight Lab at Northwestern University](http://knightlab.northwestern.edu/). After working with Census data for newsroom stories and projects like [this 2010 decennial Census explorer](http://data.spokesman.com/census/2010/washington/) and [census.ire.org](http://census.ire.org), they were part of the team who built [Census Reporter](http://censusreporter.org). 164 | -------------------------------------------------------------------------------- /datasets/federal_election_commission.md: -------------------------------------------------------------------------------- 1 | Federal Election Commission 2 | =========================== 3 | 4 | The [Federal Election Commission](https://www.fec.gov/) provides data about candidates for federal office (U.S. House, U.S. Senate and President), committees formed by those candidates and groups to spend money on federal elections, and filings reporting contributions and expenditures made by both candidates and committees. It also maintains summary data about campaign finance, data about campaign finance law enforcement and federal election results. 5 | 6 | The FEC is not the only federal entity that has a role in the campaign finance system; Senate candidates and party committees file reports first to the Secretary of the Senate, which then forwards them to the FEC (although some candidates [voluntarily file electronic reports](http://docquery.fec.gov/senate/)). The Internal Revenue Service provides data on the finances of some political groups that raise and spend money on certain state elections. But the FEC is the main repository of data for federal campaign finance activity, and it has several different types of data for download and use in reporting and analysis. 7 | 8 | In This Guide 9 | ============= 10 | 11 | * [Overview](#overview) 12 | * [Electronic Filings](#electronic-filings) 13 | * [Parsing Electronic Filings](#parsing-electronic-filings) 14 | * [Headers](#headers) 15 | * [Amendments](#amendments) 16 | * [Bulk FTP Data](#bulk-ftp-data) 17 | * [Summary files](#summary-files) 18 | * [Detailed files](#detailed-files) 19 | * [Data Catalog](#data-catalog) 20 | * [The More You Know](#the-more-you-know) 21 | * [Authors](#authors) 22 | 23 | Overview 24 | ========== 25 | 26 | The Federal Election Commission offers three general types of downloadable campaign finance data: individual electronic filings in CSV format covering most committees, pipe-delimited bulk itemized data and summary files via FTP from all committees covering two-year election cycles and specialized CSV or XML files via its [Data Catalog](http://fec.gov/data/DataCatalog.do?format=html). Electronic filings are available in real time as they are filed, while the other two types are updated on a regular basis. 27 | 28 | The base element of federal campaign finance data is the committee. Committees are the recipient of money and the spenders of money, and there are a number of different kinds of committees. Although most FEC rules apply to all committees, some committees have different limits or rules to follow, and these matter. 29 | 30 | The second element is the filing, a report by a committee to the FEC. There are many different kinds of filings, but in general they cover 3 different types of information: 31 | 32 | 1. The formation of committees and candidacies, and their details. 33 | 2. The raising of money for campaigns. 34 | 3. The spending of money on campaigns. 35 | 36 | There is a consistent schedule for filings covering the last two types, and filings of the first type can be made at any time. In general, the FEC recognizes a roughly two-year election cycle that corresponds to U.S. House elections (where all 435 seats are up for election every 2 years). In the case of senators, who are elected to six year terms on a staggered basis, an election cycle may be considered to include the previous four years as well as the current two-year period. The election cycle is used not only as a logical boundary for calculating money raised and spent but also to calculate contribution limits for candidates and committees. 37 | 38 | Electronic Filings 39 | ======== 40 | 41 | Most committees registered with the FEC are required to file all of their reports electronically. There are two significant exceptions: committees of U.S. Senate candidates (and two senatorial party committees), which file on paper with the Secretary of the Senate, and [committees that raise or spend less than $50,000 in a calendar year](http://fec.gov/ans/answers_filing.shtml#Do_I_need_to_file_electronically), or expect to do so. Electronic filing applies to U.S. House and presidential candidate committees, non-candidate political committees (typically knows as PACs), national, state and local political party committees registered with the FEC and individuals or organizations engaged in independent spending. Electronic filing was mandated in 2001. 42 | 43 | Electronic filings can be found via the [FEC's search form](http://www.fec.gov/finance/disclosure/efile_search.shtml), which has a number of options for filtering the results to a particular committee, a particular date or more. The search results include the option to View or Download individual filings, which present the data in HTML and delimited formats, respectively. Individual filings contain all of the records for that filing, stacked on top of each other in varying delimited layouts ([A zip file containing the formats is on FEC.gov](http://www.fec.gov/elecfil/eFilingFormats.zip).) You can open a .fec file in a text editor or spreadsheet, but be aware of the variable layouts within a single file (and across time). This is precisely why we developed Fech to handle electronic filings. 44 | 45 | Even though committees file reports [on a regular schedule](http://www.fec.gov/info/report_dates.shtml), electronic filings occur nearly every day of the year. Some are amendments of previous filings, others are filed in advance (or after) a deadline, and others are filed as changes warrant. Filings that are amendments are indicated in the data, and serve as complete replacements for the original filings. 46 | 47 | Parsing Electronic Filings 48 | =========== 49 | 50 | For electronic filings submitted to the FEC after Jan. 3, 2008, there are two versions of raw data for the filing. 51 | 52 | In one version, the fields in the filing are delimited by commas. In the other, the fields are delimited by ASCII 28 characters. (For filings submitted through Jan. 3, 2008, there is only one version -- the comma-delimited version.) 53 | 54 | In most situations, using the version with comma-separated values (CSV) would be preferable; such files are readable by most spreadsheet programs, and most programming languages have built-in support for CSV parsing. 55 | 56 | But the FEC makes it easier to access the ASCII 28-delimited version than the CSV version, so in many cases the added ease of getting to the ASCII 28-delimited version outweighs the benefits of using the CSV version. 57 | 58 | Take filing No. 775739, for example. The ASCII 28-delimited version of this filing is available from the easily guessable URL [http://query.nictusa.com/dcdev/posted/775739.fec](http://query.nictusa.com/dcdev/posted/775739.fec) 59 | 60 | The CSV version, however, has the URL [http://query.nictusa.com/showcsv/nicweb127546/775739.fec](http://query.nictusa.com/showcsv/nicweb127546/775739.fec) -- that number before the final slash makes it harder to determine programmatically what the URL will be. 61 | 62 | If your intention is to work with these files using a programming language, it should be fairly easy to parse the file using a delimiter other than a comma. For example, in Ruby, you would do the following: 63 | 64 | >> require 'csv' 65 | => true 66 | >> csv = CSV.open('775739.fec', :col_sep=>"\034") 67 | => <#CSV io_type:File io_path:"775739.fec" encoding:ASCII-8BIT lineno:0 col_sep:"\x1C" row_sep:"\n" quote_char:"\""> 68 | 69 | In Python, you would do: 70 | 71 | >>> csv.reader(open('775739.fec', 'r'), delimiter="\034") 72 | <_csv.reader object at 0x1004e2210> 73 | >>> reader = csv.reader(open('775739.fec', 'r'), delimiter="\034") 74 | 75 | ("\034" is the octal representation of the ASCII 28 character.) 76 | 77 | Other programming languages should have similar facilities for setting the file's delimiter. 78 | 79 | (The FEC also provides tools for converting the ASCII 28-delimited files into CSVs [here](http://www.fec.gov/support/DataConversionTools.shtml), though I can't vouch for how useful these might be.) 80 | 81 | If you know only the filing's ID and prefer to download the CSV, the URL for accessing the link to that version would look like [http://query.nictusa.com/cgi-bin/dcdev/forms/DL/775739/](http://query.nictusa.com/cgi-bin/dcdev/forms/DL/775739/) -- just substitute the ID of the filing you're interested in for 775739. 82 | 83 | Although it is possible to parse electronic filings with most languages' CSV libraries, there are several FEC-specific libraries, including [Fech](https://github.com/NYTimes/Fech) (Ruby), [read_FEC](https://github.com/jsfenfen/read_FEC) (Python) and [FEC Scraper Toolbox](https://github.com/cschnaars/FEC-Scraper-Toolbox) (Python). 84 | 85 | Headers 86 | ========= 87 | 88 | The first line in every electronic filing is the "header" row, which contains metadata about the filing. 89 | 90 | All the values in the header row are important to someone, but a few in particular are especially important to us: 91 | 92 | _The FEC version number_ 93 | This is the value in the third column of the header row, and it tells us which set of field names and positions to use for each row type. In fact, we can't even parse the header row itself correctly without knowing the FEC version number, because the row's fields changed between versions five and six. 94 | 95 | The FEC occasionally updates the version required for new filings and makes available the field definitions for the new version. But older filings that were submitted before the update will remain in the FEC's filing system, and we'll need to parse them based on the version they used. 96 | 97 | _The Report ID_ 98 | In filing versions 3 through 5, this was the value in the seventh column of the header row. Since version 6, the report ID has appeared in the sixth column. 99 | 100 | The report ID is an indication of whether the filing is an amendment -- that is, does it correct an error in an earlier filing. 101 | 102 | If the filing is not an amendment, the report ID field will be blank. 103 | 104 | If the filing is an amendment, the report ID will look like this: *FEC-763780* 105 | 106 | The numbers after the hyphen in the report ID represent the ID of the original filing that this one amends, even if the original filing has been amended previously. 107 | 108 | _The Report Number_ 109 | In filing version 3 through 5, this was the value in the eighth column of the header row. Since version 6, the report number has appeared in the seventh column. 110 | 111 | The report number is important only for amendments. If the filing is not an amendment, the field will be blank. 112 | 113 | In an amendment, the report number represents the number of times the original filing has been amended. The report number is *1-indexed*, meaning that the first amendment to a filing will have report number 1, the second amendment will have report number 2, and so on. 114 | 115 | 116 | Amendments 117 | =========== 118 | 119 | Any filing submitted to the FEC can be amended later by the filer. 120 | 121 | Once an amendment is filed, it replaces the original filing and any previous amendments to the original filing completely. 122 | 123 | ## Is this an amendment? 124 | 125 | In the section on [header rows](#headers), we learned that amendments will have a value in the Report Number field of the first line of the electronic filing. The value of the Report Number represents the number of times the original filing has been amended. The easiest way to determine whether any given filing is an amendment is to check whether the Report Number field is blank. If it is, we can move on, confident that this is an original filing. If the field contains a value, we know that the filing is an amendment. 126 | 127 | ## What's being amended? 128 | 129 | The section on [header rows](#headers) also taught us that the Report ID field will contain a value on amended filings (and will be blank otherwise, just like the Report Number field): 130 | 131 | > If the filing is an amendment, the report ID will look like this: *FEC-763780* 132 | 133 | The Report ID tells us *which filing this one amends*. 134 | 135 | For example, filing No. 776795 has this value in the Report ID field: FEC-772249. This means that it amends filing No. 772249. If we have previously saved filing No. 772249 somewhere, we should delete it and instead use the new filing. 136 | 137 | Note that amendments themselves do not get amended; only the original filing is amended. If a filing is amended multiple times, when we save the latest amendment, we can discard both the original filing and any previous amendments. Previous amendments can be found by looking at the Report ID field; all amendments to the same original filing will have the same value in that field. 138 | 139 | 140 | Bulk FTP Data 141 | ======== 142 | 143 | The FEC has offered bulk data for years, and [its offerings](http://www.fec.gov/finance/disclosure/ftp_download.shtml) include summary and detailed files covering committees, candidates and contributions. The bulk files are updated weekly, late on Sunday nights, so depending on your timing and needs the bulk files may not be suitable for every task. The advantage of the bulk files is that they are vetted by the FEC, with some of the records standardized (by adding FEC-issued committee ids, for example) and others removed to prevent duplicate records from appearing. The summary and detailed files are pipe-delimited, a relatively recent change for the FEC, and some older files may still be in fixed-width format. 144 | 145 | Bulk data files are contained inside zip files stored on the FTP server, so retrieving them via a web application requires several steps. The FTP data is updated early Monday morning each week, and previous cycles are updated as well, since committees can amend filings from an earlier election cycle. 146 | 147 | Summary files 148 | ======== 149 | 150 | The [summary files](http://www.fec.gov/finance/disclosure/ftpsum.shtml) include canonical election-cycle data for candidates and committees - one record for each candidate or committee per two-year election cycle, depending on the file selected. There are two candidate summary files -- one for campaigns that have elections in the current cycle, and one for all candidates no matter if they face election in the cycle or not -- and they can differ in amounts and timeliness. 151 | 152 | The [current campaigns file](ftp://ftp.fec.gov/FEC/webl12.zip) may be more timely, but also contains a single total for PAC contributions (compared to totals for different kinds of PACs in the other file) and some of its totals may contain double-counted transactions. More at the [data dictionary](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryWEBL.shtml). 153 | 154 | The [all candidates file](ftp://ftp.fec.gov/FEC/weball12.zip) can be updated slightly less frequently than the current campaigns one, but it contains more detailed breakdowns of certain types of transactions as noted above. The possibility of double-counting some kinds of transactions - transfers to and from authorized committees of a candidate - also exists. More at the [data dictionary](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryWEBALL.shtml). 155 | 156 | The [PAC summary file](ftp://ftp.fec.gov/FEC/2012/webk12.zip) provides the latest summary information on political action and party committees, including independent expenditures. More at the [data dictionary](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryWEBK.shtml). 157 | 158 | The FEC previously used to generate summary files at the end of the election cycle for candidates and PACs that included totals for different types of PACs, but discontinued these files after the 2005-06 cycle. Files are available from 1979-80 through 2005-06. Party committee-only summary files exist from the 1991-92 cycle through the 2003-04 cycle. One of the most useful files, which contained a record for each combination of candidate recipient and PAC contributor/independent spender, covers the 1991-92 cycle through the 2001-02 cycle. A similar file exists for candidate-party activities during the same time period. 159 | 160 | These summary files had been stored in fixed-width format but were converted to pipe-delimited format in late July 2012. [Fech-FTP](https://github.com/dwillis/fech-ftp) is a Ruby library that wraps many of the summary files and some of the detailed files described below. [fecmaster](https://github.com/lukerosiak/fecmaster) is a Python library for downloading and importing the bulk files. 161 | 162 | There's one more thing to be aware of: the way that the FEC used to store its data (detailed and summary) relied on the use of [an "overpunch" character](http://www.fec.gov/finance/disclosure/ftpsum.shtml#overpunch) to represent negative amounts. This [recently changed for the detailed contribution files](http://www.fec.gov/blog/disclosure/entry/indiv_oth_and_pas2_file) and for the candidate and committee files, but it's possible that older summary files still contain such characters, and that the amount fields should be imported as text and then converted to numeric columns, accounting for negative amounts. There is a tutorial for [working with the FTP files using Microsoft Access](http://www.fec.gov/finance/disclosure/working_with_data_files.pdf). 163 | 164 | Detailed files 165 | ======== 166 | 167 | The [detailed files](http://www.fec.gov/finance/disclosure/ftpdet.shtml) include cycle-specific data for committees, candidates, individual contributions and committee transactions to candidates and between committees. Each cycle, beginning with 1979-80, has five files with data representing that cycle. The current cycle and its immediate 2-3 preceding cycles also have three smaller tables that represent additions, changes and deletions to the individual contributions file. The FEC changed the format of these files from fixed-width to pipe-delimited in July 2012. These files are updated weekly on late Sunday evenings/early Monday mornings, so information submitted during the week before should be reflected in the next update. One advantage of using these files is that they represent "official" data that has been checked by the FEC. Nonetheless, there have been some errors 168 | 169 | Committees 170 | --------- 171 | 172 | Known as the [committee master file](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteeMaster.shtml), this has a record for each committee registered with the FEC during a cycle. The ID number assigned by the FEC to each committee is unique within the cycle, but committees often exist for years or even decades. If a committee is a campaign committee for a candidate for the House, Senate or President, it will include that candidate's ID number. A committee will be one of [several types](http://www.fec.gov/finance/disclosure/metadata/CommitteeTypeCodes.shtml). Other committees may have values for party affiliation, frequency of filings and connected organization, although these values are not universally present for each committee. 173 | 174 | Candidates 175 | --------- 176 | 177 | Known as the [candidate master file](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandidateMaster.shtml), this file contains one row for each candidate who either registered with the FEC or "appeared on a ballot list prepared by a state elections office." Key fields in this file include whether a candidate is an incumbent, challenger or seeking an open seat (although this field is not consistently populated), and the state and district the candidate is running in. 178 | 179 | Candidate Committee Linkage 180 | --------- 181 | 182 | A new offering as of July 2012, [this file](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCandCmteLinkage.shtml) contains one row for each combination of candidate and committee during an election cycle. Candidates can have a primary committee, authorized committees, joint fundraising committees and other relationships. The linkage file helps keep track of all of those relationships. In particular, joint fundraising committees can benefit multiple candidates; previously the committee master file would only list one. Note: it does not include a link between candidates and their leadership committees. 183 | 184 | Committee Contributions To Candidates 185 | --------- 186 | 187 | [This file](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryContributionstoCandidates.shtml) contains one row for each contribution from a committee to a candidate and for each independent expenditure made by a committee about a candidate. It is a subset of the file containing all transactions between one committee and another. This file can be used to calculate things such as which candidate got the most PAC money and which candidates have received contributions from a specific committee. 188 | 189 | Any Transaction from One Committee to Another 190 | --------- 191 | 192 | [This file](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryCommitteetoCommittee.shtml) contains a row for each transaction between two committees, regardless of whether either committee is a candidate committee or not. For example, this file includes contributions from a corporate PAC to a national party committee in addition to contributions to candidate committees. The list of [transaction types](http://www.fec.gov/finance/disclosure/metadata/DataDictionaryTransactionTypeCodes.shtml) describes the kind of transaction each row represents. The quirk here is that in many cases, both sides of the transaction get separate records: one for the committee making the contribution (usually beginning with 2X) and another for the committee receiving the contribution (usually beginning with 1X). Using this file for SQL queries means including transaction type in some manner in pretty much every query; which types you want depends on a couple of factors, such as timeliness (contributing committees can file before recipient committees) or completeness (a recipient committee usually is the definitive account of what it has gotten). 193 | 194 | 195 | Data Catalog 196 | ======== 197 | 198 | The data catalog is a collection of some of the summary files available via FTP as well as other files covering disbursements, independent expenditures and leadership PACs, among other subjects. The files are available in CSV or XML formats, and cover single cycles (mostly 2010 and 2012, although the summary files also include 2008 data). One advantage of the data catalog files is that they can be called directly from a web application without having to unzip them, but there are some drawbacks. [Independent Expenditures](http://fec.gov/data/IndependentExpenditure.do?format=html&election_yr=2014) include both original transactions and amendments, resulting in duplicate records in those cases. In another example, the [listing of leadership PACs](http://fec.gov/data/Leadership.do?format=html&election_yr=2014) contains an entry for the corporate PAC of Interactive Corp. Most of the data catalog files are updated daily, and they are the one place where it's possible to find [candidate disbursements](http://fec.gov/data/CandidateDisbursement.do?format=html&election_yr=2014) in statewide or district-level files. The files themselves are stored on the FEC's FTP server, so it's possible to grab them directly. The FEC also maintains [a blog about its data](http://fec.gov/blog/) that includes changes and additions to its data offerings. 199 | 200 | The More You Know 201 | ========== 202 | 203 | * The electronic filings are technically unofficial until the FEC processes them, and they will contain typos or incomplete information (or not be filed properly, particularly when it comes to lump sum expenditures, which must report both an unitemized total and then most itemizations. Think of a single credit card payment covering multiple transactions.). The agency can take several weeks or more to process filings from significant filing dates, such as quarterly deadlines. Even after the FEC processes filings and places transactions into the bulk FTP data, there may be errors - the most common is misidentified committee IDs, as electronic filings are not required to include committee or candidate IDs. 204 | 205 | * The time lag for FEC processing is the biggest practical concern, but there are other issues with timing, particularly as it relates to filing schedules. Committees usually file either monthly or quarterly, with monthly deadlines on the 20th of each month and quarterly deadlines on April, July and October 15, followed by a year-end report due on Jan. 31. There also are reports due just before and after a primary, runoff, special or general election for any committee involved in that election. 206 | 207 | * You'll also come across joint committees, which are almost required for most competitive races, especially Senate ones. These committees can solicit larger donations and then divide them among multiple recipients. Here's an example: a donor can give $40,000 to the "Ryan Pitts Victory Fund," of which $5,200 will go to Ryan's campaign and the remainder to Ryan's party national or Senate committee. The original donation and the ultimate receipts are *both* reported, so as a data user you will need to avoid double-counting. Depending on what you want to know, you should use either the original donation or the divided shares. 208 | 209 | * State and local party committees can have federal and state accounts, but the FEC filing represents the federal portion (although it also includes joint spending, in which both the federal and state accounts pay for some expenses). FEC rules apply to donations to federal accounts of state & local party committees, but states have different rules for state accounts. 210 | 211 | * Donors who are unaware of contribution limits and donate more money than is permitted will still have their names and contributions reported in FEC filings. Committees have to do this, and then they refund the money, usually in the same filing. Refunds should be classified as expenditures, but some committees list them as contributions with negative amounts. 212 | 213 | * The person at the FEC you want to talk to in case of data issues is Paul Clark (pcclark@fec.gov). The press office is typically friendly and responsive for general questions. 214 | 215 | Authors 216 | ========== 217 | 218 | Derek Willis has been using federal campaign finance data since 1996, when he helped report on soft money donors in Florida for The Palm Beach Post. He is the co-author [an IRE book on covering campaign finance](http://www.amazon.com/Unstacking-Deck-Reporters-Campaign-Finance/dp/0976603756), researched [political non-profits](http://www.publicintegrity.org/2003/09/25/5564/silent-partners-special-report) for The Center for Public Integrity and developed The New York Times [Campaign Finance API](http://developer.nytimes.com/docs/campaign_finance_api/). Willis currently uses FEC data to write stories for [The Upshot](http://www.nytimes.com/upshot/). Reach him @derekwillis. 219 | 220 | Aaron Bycoffe is data editor at The Huffington Post, where he has used FEC data to [find donors who have exceeded legal limits](http://www.huffingtonpost.com/2013/07/22/campaign-contribution-limits_n_3607672.html) and to build interactives. He previously worked at The Sunlight Foundation, where he tracked [super PACs](https://sunlightfoundation.com/blog/2011/08/01/super-pacs-raise-combined-26-million-first-half-year/) and other campaign finance vehicles. He's at @bycoffe. 221 | --------------------------------------------------------------------------------