├── .circleci
    ├── README.md
    └── config.yml
├── README.md
├── convert_to_csv.py
├── lint.py
├── misconduct-instances.csv
├── misconduct.yaml
└── test.py


/.circleci/README.md:
--------------------------------------------------------------------------------
1 | This directory helps us perform automated validity tests on the data using CircleCI.com.
2 | 


--------------------------------------------------------------------------------
/.circleci/config.yml:
--------------------------------------------------------------------------------
 1 | version: 2
 2 | jobs:
 3 |    build:
 4 |      docker:
 5 |        - image: circleci/python:3
 6 |      steps:
 7 |        - checkout
 8 |        - run: pip install --user rtyaml
 9 |        - run: python test.py
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | Congressional Misconduct
  2 | ========================
  3 | 
  4 | This repository contains a single YAML file documenting instances of misconduct
  5 | and alleged misconduct by Members of the United States Congress, created by GovTrack.us,
  6 | covering 1789 to the present. A CSV file is also included for those wishing to import the data into a spreadsheet.
  7 | 
  8 | This database contains:
  9 | 
 10 | * All letters of reproval, censures, and expulsions from Congress from 1789 to the present.
 11 | * All investigations by the [House Office of Congressional Ethics (OCE)](https://oce.house.gov/), the [House Committee on Ethics (HCE)](https://ethics.house.gov/), and the [Senate Select Committee on Ethics (SSCE)](https://www.ethics.senate.gov/public/), and other investigations by a body of Congress that involved alleged personal misconduct from 1789 to the present, including all investigations by the Senate on whether to allow a senator-elect to be seated when it stemmed from allegations of personal misconduct.
 12 | * Failed votes on simple resolutions sanctioning a member without a formal investigation when we came across them.
 13 | * As many monetary settlements that we are aware of, e.g. those administered by Congress's [Office of Compliance](https://www.compliance.gov/) regarding sexual harassment claims, but many settlements are not known to the public.
 14 | * Resignations and announcements of an intention not to run for re-election that we believe to be likely relevant to an allegation of misconduct, because Members of Congress often resign to head-off a Congressional investigation.
 15 | * Felony convictions and other cases of misconduct with national significance before and after the Member's time in Congress.
 16 | 
 17 | Investigations, settlements, and resignations **do not imply guilt**. Some investigations are motivated by politics or a personal grudge, settlements are often used when it would be less costly than defending a law suit, and Members of Congress often resign when they are likely to lose re-election (regardless of why). We include investigations even if the Member of Congress is exonerated, per the above rubrik, because the investigation and exoneration are themselves important events not to be forgotten. Conversely, an investigation that ends without a guilty determination **does not imply innocence** --- Congress polices itself in many cases and Members of Congress are reluctant to punish their peers.
 18 | 
 19 | Unique to our database are tags describing the type of misconduct such as corruption, general ethical violations or sexual harassment and abuse; consequences including censure, expulsion, resignation and convictions; and case status of resolved/unresolved. Using these filters you can see for example that at least two still open cases have been extended through multiple Congresses with no end date in sight.
 20 | 
 21 | 
 22 | This database can be viewed at:
 23 | 
 24 | * https://www.govtrack.us/misconduct
 25 | 
 26 | Sources
 27 | -------
 28 | 
 29 | * The House’s [Office of Congressional Ethics (OCE)](https://oce.house.gov/)
 30 | * The [House Committee on Ethics (HCE)](https://ethics.house.gov/)
 31 | * The House's [Historical Summary of Conduct Cases in the House of Representatives, Committee on Standards of Official Conduct,1798-2004](https://ethics.house.gov/sites/ethics.house.gov/files/Historical_Chart_Final_Version%20in%20Word_0.pdf)
 32 | * The [Senate Select Committee on Ethics](https://www.ethics.senate.gov/public/)
 33 | * [Wikipedia’s list of convictions of American politicians](https://en.wikipedia.org/wiki/List_of_American_federal_politicians_convicted_of_crimes), as of Jan 23, 2018
 34 | * [The Washington Post’s list of indictments](https://www.washingtonpost.com/news/the-fix/wp/2015/07/29/more-than-two-dozen-members-of-congress-have-been-indicted-since-1980/)
 35 | * [United States Senate Election, Expulsion, and Censure Cases](https://babel.hathitrust.org/cgi/pt?id=umn.31951p00933065r;view=1up;seq=7)
 36 | * [Congress.gov](https://www.congress.gov/)
 37 | 
 38 | How Congress Deals With Misconduct
 39 | ----------------------------------
 40 | 
 41 | Congressional ethics investigations, censure, and expulsion have a complex process and purpose. See:
 42 | 
 43 | * [Expulsion and Censure Actions Taken by the Full Senate Against Members](https://www.everycrsreport.com/reports/93-875.html) (CRS report)
 44 | * [Enforcement of Congressional Rules of Conduct: A Historical Overview](https://www.everycrsreport.com/reports/RL30764.html) (CRS report)
 45 | 
 46 | Data Dictionary
 47 | ---------------
 48 | 
 49 | The data file [misconduct.yaml](misconduct.yaml) is a YAML-formatted text file.
 50 | 
 51 | The file is a list of instances of misconduct or alleged misconduct, i.e. each entry repreents an allegation or a collection of related allegations. These entries have fields describing the misconduct or alleged misconduct, including which Member of Congress is accused, a textual summary of the allegation and consequences, a field for just the allegation, and detailed data on consequences.
 52 | 
 53 | We add new instances of misconduct or alleged misconduct to the top of the file,
 54 | but the file is not otherwise in any particular order. We recommend sorting based
 55 | on the the date of the first consequence.
 56 | 
 57 | ### Instances of misconduct or alleged misconduct
 58 | 
 59 | Each entry in [misconduct.yaml](misconduct.yaml) has the following fields:
 60 | 
 61 | `person` is the numeric ID of the accused Member of Congress on GovTrack. See [https://github.com/unitedstates/congress-legislators/](https://github.com/unitedstates/congress-legislators/) for metadata on Members of Congress.
 62 | 
 63 | `name` is the name of the person for debugging purposes only.
 64 | 
 65 | `text` is a summary of the allegation, investigations, and corrective actions taken. It is written in [Markdown format](https://daringfireball.net/projects/markdown/) with light use of rich text like links.
 66 | 
 67 | `allegation` is a noun (e.g. `sexual harrassment`) or gerund phrase (e.g. `asking staff members to carry his surrogate child`)
 68 | summarizing the misconduct, which completes the sentence "The member was accused of ...".
 69 | 
 70 | `consequences` is a list, in (forward) chronological order, of consequences that resulted
 71 | from the misconduct, including investigations, expulsion, resignation, conviction,
 72 | and other helpful notes that provide context. The data format of a consequence is documented next.
 73 | 
 74 | `tags` is a space-separated, alphabetically ordered list of tags. Tags for these records describe the nature of the allegations and are one or more of:
 75 | 
 76 | * `elections` - Elections and campaign-related allegations.
 77 | * `corruption` - Bribery, extortion, and other criminal corruption.
 78 | * `sexual-harassment-abuse` - Sexual harassment and abuse.
 79 | * `crime` - Tax evation, murder, fraud, and other crimes (besides corruption and sexual harassment and abuse).
 80 | * `ethics` - Violations of congressional rules that are not crimes.
 81 | * `resolved` - Either the investigation has formally ended, the legal process has concluded, the member has left Congress, or the member has died. Every instance of misconduct or alleged misconduct is tagged with either `resolved` or `unresolved`.
 82 | * `unresolved` - There is an open or pending investigation or other ongoing legal process related to this instance of misconduct or alleged misconduct. Every instance of misconduct or alleged misconduct is tagged with either `resolved` or `unresolved`.
 83 | 
 84 | Note that consequences can also have tags but use a different set of tags.
 85 | 
 86 | The date of the misconduct or alleged misconduct is only present in an unstructured way in
 87 | `text` and `allegation` because misconduct is often a set of events that don't
 88 | have a single precise date, and, further, the allegation may not have ocurred.
 89 | Instead, each consequence is listed with a date. The date of the first (oldest) consequence,
 90 | which is typically about the start of an investigation, is the best date to use for sorting.
 91 | 
 92 | ### Consequences
 93 | 
 94 | Each consequence has its own fields. There are two forms for a consequence.
 95 | 
 96 | #### Actions taken by governmental bodies
 97 | 
 98 | The first form has `date`, `body`, `action`, and `link` fields and represents an
 99 | action taken by a governmental body.
100 | 
101 | `body` is the name of a governmental body that took an action, such as the House Office of
102 | Congressional Ethics.
103 | 
104 | `action` is a sentence fragment --- a verb phrase --- that has the action the body took. It should complete the sentence that starts with `body`, so `action` normally starts with a lowercase letter, and it should not end with a period. (`action` may not contain Markdown.)
105 | 
106 | `link` is a URL to any supporting evidence, often a report or press release issued by
107 | `body` documenting the action they took. News articles or other links may also appear.
108 | In rare cases multiple links may be given in YAML list notation.
109 | 
110 | #### Other consequences and contextual notes
111 | 
112 | The second form has `date`, `text`, and `link` fields to represent other consequences
113 | not caused by actions of governmental bodies and other contextual notes.
114 | 
115 | `text` may be either a full sentence or a sentence fragment starting with the verb. In
116 | either case, `text` starts with a capital letter and ends with a period. (This `text`
117 | may not contain Markdown.)
118 | 
119 | `link` is a URL to any supporting evidence, such as a contemporary news article or a
120 | primary source government document. In rare cases multiple links may be given in YAML
121 | list notation.
122 | 
123 | #### Consequence dates
124 | 
125 | Because some dates are unknown or actions may have ocurred over a time period
126 | greater than a date, `date` may be either a year alone `YYYY`, a year and month
127 | `YYYY-MM`, or a full date `YYYY-MM-DD`. Note that in YAML, a full date or a year
128 | alone do not require quotes (the former is parsed as a date value and the latter
129 | an integer) but a year and month do require being surrounded in quotes to be
130 | valid YAML.
131 | 
132 | #### Tags
133 | 
134 | Consequences may also have `tags`. The `tags` field is a space-separated, alphabetical
135 | list of tags from the following sets. This first set marks that the legislator was found guilty of the misconuct (i.e. the allegation was upheld):
136 | 
137 | * `expulsion` - Expulsion by the Senate or House.
138 | * `censure` - Censure by the Senate or House. (Committee recommendations of censure are not tagged.)
139 | * `contempt` - Held in contempt of Congress by the Senate or House.
140 | * `reprimand` - Admonishment, reprimand, or letter of reproval.
141 | * `fined` - Fined by the Senate or House.
142 | * `exclusion` - A member-elect was prevented from being seated by the Senate or House.
143 | * `conviction` - Conviction in a court.
144 | * `plea` - Pleaded guilty or no contest in a court. (Sometimes we see this as a legislator paying a municipal fine.)
145 | * `confirmation` - The allegation is confirmed as true. This tag can be used in rare cases to indicate our assertion that the legislator is guilty when there was no actual adverse consequence.
146 | 
147 | In a small number of cases the consequence was reversed (e.g. on appeal), and in those cases the tag should _not_ be used so that the legislator is not incorrectly flagged as guilty.
148 | 
149 | These additional tags may be used but don't indicate guilt:
150 | 
151 | * `resignation` - Resignation from office because of the allegation.
152 | * `settlement` - Monetary settlement.
153 | 
154 | ## Public domain
155 | 
156 | This project is dedicated to the public domain. Copyright and related rights in the work worldwide are waived through the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).
157 | 
158 | All contributions to this project must be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of your copyright interest.
159 | 


--------------------------------------------------------------------------------
/convert_to_csv.py:
--------------------------------------------------------------------------------
 1 | import rtyaml, csv, re
 2 | from functools import reduce
 3 | 
 4 | with open("misconduct.yaml") as f1:
 5 |   misconduct = rtyaml.load(f1)
 6 | 
 7 | def get_tags(instance, only=None):
 8 |   tags = []
 9 |   if only != "consequence":
10 |     tags.extend(re.findall("[\w-]+", instance.get("tags", "")))
11 |   if only != "instance":
12 |     for consequence in instance["consequences"]:
13 |       tags.extend(re.findall("[\w-]+", consequence.get("tags", "")))
14 |   if "" in tags: tags.remove("")
15 |   return set(tags)
16 | 
17 | all_instance_tags = reduce(lambda x,y:x|y, (get_tags(instance, only="instance") for instance in misconduct))
18 | all_consequence_tags = reduce(lambda x,y:x|y, (get_tags(instance, only="consequence") for instance in misconduct))
19 | 
20 | with open("misconduct-instances.csv", "w") as f2:
21 |   # Make the columns.
22 |   date_cols = ["first_date", "last_date", ]
23 |   metadata_cols = ["person", "name", "allegation", "text"]
24 |   tags_cols = sorted(all_instance_tags) + sorted(all_consequence_tags)
25 | 
26 |   # Write out.
27 |   w = csv.writer(f2)
28 |   w.writerow(date_cols + metadata_cols + tags_cols)
29 |   for instance in misconduct:
30 |     tags = get_tags(instance)
31 |     w.writerow(
32 |       [ instance["consequences"][0]["date"], instance["consequences"][-1]["date"] ] +
33 |       [ instance[field] for field in metadata_cols ] +
34 |       [ ("X" if tag in tags else "") for tag in tags_cols ]
35 |       )
36 | 
37 | 


--------------------------------------------------------------------------------
/lint.py:
--------------------------------------------------------------------------------
 1 | import rtyaml
 2 | 
 3 | fn = "misconduct.yaml"
 4 | 
 5 | with open(fn) as f:
 6 | 	M = rtyaml.load(f)
 7 | 
 8 | for record in M:
 9 | 	if "tags" in record:
10 | 		record["tags"] = " ".join(sorted(set(record["tags"].split(" "))))
11 | 	for cons in record["consequences"]:
12 | 		if "tags" in cons:
13 | 			cons["tags"] = " ".join(sorted(set(cons["tags"].split(" "))))
14 | 
15 | with open(fn, "w") as f:
16 | 	f.write(rtyaml.dump(M))
17 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | 
  3 | from datetime import date
  4 | import re
  5 | import sys
  6 | import rtyaml
  7 | 
  8 | has_error = False
  9 | def error(*args):
 10 | 	global has_error
 11 | 	has_error = True
 12 | 	if len(args) == 1:
 13 | 		incident, consequence, message = None, None, args[0]
 14 | 	elif len(args) == 2:
 15 | 		incident, consequence, message = args[0], None, args[1]
 16 | 	elif len(args) == 3:
 17 | 		incident, consequence, message = args
 18 | 	else:
 19 | 		raise ValueError(args)
 20 | 	if incident: print("In <", rtyaml.dump(incident)[:64].replace("\n"," --- "), ">", file=sys.stderr)
 21 | 	if consequence: print("... <", rtyaml.dump(consequence)[:64].replace("\n"," --- "), ">", file=sys.stderr)
 22 | 	print(message, file=sys.stderr)
 23 | 	print(file=sys.stderr)
 24 | 
 25 | def remove_markdown_link_urls(s):
 26 | 	return re.sub(r"\(http.*?\)", "", s)
 27 | 
 28 | try:
 29 | 	misconduct = rtyaml.load(open("misconduct.yaml"))
 30 | except Exception as e:
 31 | 	error(str(e))
 32 | 	sys.exit(1)
 33 | 
 34 | if not isinstance(misconduct, list):
 35 | 	error("misconduct.yaml is not a list.")
 36 | 
 37 | for incident in misconduct:
 38 | 	if not isinstance(incident, dict):
 39 | 		error(incident, "Incident is not a dict.")
 40 | 
 41 | 	if not isinstance(incident.get("person"), int):
 42 | 		error(incident, "Incident is missing or has invalid 'person', should be an integer.")
 43 | 	# TODO: Check ID is a real GovTrack person ID.
 44 | 
 45 | 	if not isinstance(incident.get("name"), str):
 46 | 		error(incident, "Incident is missing or has invalid 'name', should be a string.")
 47 | 
 48 | 	if not isinstance(incident.get("text"), str):
 49 | 		error(incident, "Incident is missing or has invalid 'text', should be a string.")
 50 | 		if not isinstance(incident.get("text", ""), str):
 51 | 			continue
 52 | 
 53 | 	if not isinstance(incident.get("allegation"), str):
 54 | 		error(incident, "Incident is missing or has invalid 'allegation', should be a string.")
 55 | 
 56 | 	if not isinstance(incident.get("consequences"), list):
 57 | 		error(incident, "Incident is missing or has invalid 'consequences', should be a list.")
 58 | 		continue
 59 | 
 60 | 	if not isinstance(incident.get("tags"), str):
 61 | 		error(incident, "Incident is missing or has invalid 'tags', should be a string.")
 62 | 		continue
 63 | 	elif "tags" in incident:
 64 | 		tags = set(incident["tags"].split(" "))
 65 | 		bad_tags = tags - {
 66 | 			"elections", "corruption", "sexual-harassment-abuse", "crime",
 67 | 			"ethics", "resolved", "unresolved"}
 68 | 		if bad_tags:
 69 | 			error(incident, "Incident has invalid 'tags': {}".format(bad_tags))
 70 | 
 71 | 	for cons in incident["consequences"]:
 72 | 		if not isinstance(cons, dict):
 73 | 			error(incident, cons, "Consequence should be a dict.")
 74 | 
 75 | 		if isinstance(cons.get("date"), date):
 76 | 			pass # good, a full date or a year
 77 | 		elif not isinstance(cons.get("date"), (int, str)):
 78 | 			error(incident, cons, "Consequence is missing or has an invalid date.")
 79 | 		elif not re.match(r"(\d\d\d\d)(-(\d\d)(-(\d\d))?)?$", str(cons["date"])):
 80 | 			error(incident, cons, "Consequence has an invalid date.")
 81 | 
 82 | 		if "body" not in cons and "text" not in cons:
 83 | 			error(incident, cons, "Consequence should have either 'body' or 'text'.")
 84 | 		elif "body" in cons and "text" in cons:
 85 | 			error(incident, cons, "Consequence cannot have both 'body' and 'text'.")
 86 | 
 87 | 		elif "text" in cons:
 88 | 			if not isinstance(cons["text"], str):
 89 | 				error(incident, cons, "Consequence 'text' should be a string.")
 90 | 			elif cons["text"][0] == cons["text"][0].lower() or cons["text"][-1] != ".":
 91 | 				error(incident, cons, "Consequence text should be a full sentence starting with a capital letter and ending in a period.")
 92 | 
 93 | 		else:
 94 | 			if not isinstance(cons["body"], str):
 95 | 				error(incident, cons, "Consequence 'body' should be a string.")
 96 | 			if not isinstance(cons.get("action"), str):
 97 | 				error(incident, cons, "In consequence with body, 'action' should be a string.")
 98 | 
 99 | 		for field in ("text", "action"):
100 | 			if field in cons:
101 | 				if "](" in cons[field]:
102 | 					error(incident, cons, "Consequence looks like it has a Markdown link in {} that should be in the link field instead.".format(field))
103 | 
104 | 		if not isinstance(cons.get("link"), (type(None), str, list)):
105 | 			error(incident, cons, "Consequence has an invalid 'link' value.")
106 | 		if isinstance(cons.get("link"), list):
107 | 			for item in cons["link"]:
108 | 				if not isinstance(item, str):
109 | 					error(incident, cons, "Consequence has an invalid 'link' value.")
110 | 
111 | 		if "tags" in cons and not isinstance(cons["tags"], str):
112 | 			error(incident, cons, "Consequence has invalid 'tags', should be a string.")
113 | 			continue
114 | 		elif "tags" in cons:
115 | 			tags = set(cons["tags"].split(" "))
116 | 			bad_tags = tags - {
117 | 				"expulsion", "censure", "contempt", "reprimand", "fined", "resignation", "exclusion",
118 | 				"settlement", "conviction", "plea", "confirmation" }
119 | 			if bad_tags:
120 | 				error(incident, cons, "Consequence has invalid 'tags': {}.".format(bad_tags))
121 | 
122 | 	# Suggest incidents whose allegation or text fields probably could be shortened.
123 | 	if len(incident["allegation"]) > 700:
124 | 		error(incident, "'allegation' could probably be shorter.")
125 | 	if incident.get("person") != 456921:
126 | 		if len(incident["consequences"]) > 2 and len(remove_markdown_link_urls(incident["text"])) > 1200:
127 | 			error(incident, "'text' could probably be shorter.")
128 | 		elif len(incident["consequences"]) > 2 and len(incident["text"]) > 400 and len(remove_markdown_link_urls(incident["text"])) > .8 * (len(incident["allegation"]) + len(" ".join(remove_markdown_link_urls(str(cons)) for cons in incident["consequences"]))):
129 | 			error(incident, "'text' could probably be shorter.")
130 | 
131 | 
132 | if has_error:
133 | 	sys.exit(1)
134 | 


--------------------------------------------------------------------------------