├── README.md
├── generate_readme.py
├── scratchpad
├── high_city_ca_pay.py
└── more_scotus_laughs.py
└── scripts
├── 1.py
├── 10.py
├── 100.py
├── 101.py
├── 11.py
├── 12.py
├── 13.py
├── 14.py
├── 15.py
├── 16.py
├── 17.py
├── 18.py
├── 19.py
├── 2.py
├── 20.py
├── 21.py
├── 22.py
├── 23.py
├── 24.py
├── 25.py
├── 26.py
├── 27.py
├── 28.py
├── 29.py
├── 3.py
├── 30.py
├── 31.py
├── 32.py
├── 33.py
├── 34.py
├── 35.py
├── 36.py
├── 37.py
├── 38.py
├── 39.py
├── 4.py
├── 40.py
├── 41.py
├── 42.py
├── 43.py
├── 44.py
├── 45.py
├── 46.py
├── 47.py
├── 48.py
├── 49.py
├── 5.py
├── 50.py
├── 51.py
├── 52.py
├── 53.py
├── 54.py
├── 55.py
├── 56.py
├── 57.py
├── 58.py
├── 59.py
├── 6.py
├── 60.py
├── 61.py
├── 62.py
├── 63.py
├── 64.py
├── 65.py
├── 66.py
├── 68.py
├── 69.py
├── 7.py
├── 70.py
├── 71.py
├── 72.py
├── 73.py
├── 74.py
├── 75.py
├── 76.py
├── 77.py
├── 78.py
├── 79.py
├── 8.py
├── 80.py
├── 81.py
├── 82.py
├── 83.py
├── 84.py
├── 85.py
├── 86.py
├── 87.py
├── 88.py
├── 89.py
├── 9.py
├── 90.py
├── 91.py
├── 92.py
├── 93.py
├── 94.py
├── 95.py
├── 96.py
├── 97.py
├── 98.py
└── 99.py
/README.md:
--------------------------------------------------------------------------------
1 | ## Search-Script-Scrape: 101 webscraping and research tasks for the data journalist
2 |
3 | __Note:__ This exercise set is part of the [Stanford Computational Journalism Lab](http://cjlab.stanford.edu). I've also written [a blog post that gives a little more elaboration about the libraries used and a few of the exercises](http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/).
4 |
5 | -------------
6 |
7 | This repository contains [101 Web data-collection tasks](#the-tasks) in Python 3 that I assigned to my [Computational Journalism class in Spring 2015](http://www.compjour.org) to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online.
8 |
9 | The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the [Computational Journalism class](http://www.compjour.org) wasn't intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized...(especially since I'm new to Python myself!)
10 |
11 | Some examples of the tasks:
12 |
13 | - [The California city whose city manager has the highest total wage per capita in 2012](https://github.com/compjour/search-script-scrape/blob/master/scripts/100.py) ([expanded version](scratchpad/high_city_ca_pay.py))
14 | - [In the most recently transcribed Supreme Court argument, the number of times laughter broke out](https://github.com/compjour/search-script-scrape/blob/master/scripts/50.py) ([expanded version](scratchpad/more_scotus_laughs.py))
15 | - [Number of days until Texas's next scheduled execution](scripts/29.py)
16 | - [The U.S. congressmember with the most Twitter followers](https://github.com/compjour/search-script-scrape/blob/master/scripts/90.py)
17 | - [The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days](https://github.com/compjour/search-script-scrape/blob/master/scripts/3.py)
18 |
19 | ## Repo status
20 |
21 |
22 | The table below links to the available scripts. If there's not a link, it means I haven't committed the code. Some of them I had to rethink a less verbose solution (or the target changed, as the Internet sometimes does), and now this repo has taken a backseat to many other data projects on my list. `¯\_(ツ)_/¯`
23 |
24 | Note: A lot of the code is not best practice. The tasks are a little repetitive so I got bored and [ignored PEP8](https://www.python.org/dev/peps/pep-0008/) and/or tried new libraries/conventions for fun.
25 |
26 |
27 | __Note:__ The "__related URL__" links to either the official source of the data, or at least a page with some background information. The second column of this table refers to __line count__ of the script, __not__ the answer to the prompt.
28 |
29 | ## The tasks
30 |
31 |
32 | The repo currently contains scripts for __100__ of __101__ tasks:
33 |
34 | | Title | Line count |
35 | |-------------------------|-------------|
36 | | 1. Number of datasets currently listed on data.gov
[related URL] [script] | 7 lines |
37 | | 2. The name of the most recently added dataset on data.gov
[related URL] [script] | 7 lines |
38 | | 3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
[related URL] [script] | 4 lines |
39 | | 4. The number of librarian-related job positions that the federal government is currently hiring for
[related URL] [script] | 6 lines |
40 | | 5. The name of the company cited in the most recent consumer complaint involving student loans
[related URL] [script] | 27 lines |
41 | | 6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees
[related URL] [script] | 38 lines |
42 | | 7. The number of listed federal executive agency internet domains
[related URL] [script] | 8 lines |
43 | | 8. The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis.
[related URL] [script] | 7 lines |
44 | | 9. The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress
[related URL] [script] | 26 lines |
45 | | 10. The title of the highest paid California city government position in 2010
[related URL] [script] | 35 lines |
46 | | 11. How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections?
[related URL] [script] | 23 lines |
47 | | 12. In 2010, the year-over-year change in enplanements at America's busiest airport
[related URL] [script] | 51 lines |
48 | | 13. The number of armored carrier bank robberies recorded by the FBI in 2014
[related URL] [script] | 15 lines |
49 | | 14. The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year
[related URL] [script] | 14 lines |
50 | | 15. Total number of wildlife strike incidents reported at San Francisco International Airport
[related URL] [script] | 48 lines |
51 | | 16. The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer
[related URL] [script] | 11 lines |
52 | | 17. In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday
[related URL] [script] | 11 lines |
53 | | 18. The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data
[related URL] [script] | 9 lines |
54 | | 19. The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days
[related URL] [script] | 12 lines |
55 | | 20. The number of OSHA enforcement inspections involving Wal-Mart in California since 2014
[related URL] [script] | 25 lines |
56 | | 21. The current humidity level at Great Smoky Mountains National Park
[related URL] [script] | 6 lines |
57 | | 22. The names of the committees that Sen. Barbara Boxer currently serves on
[related URL] [script] | 7 lines |
58 | | 23. The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file.
[related URL] [script] | 21 lines |
59 | | 24. Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014
[related URL] [script] | 24 lines |
60 | | 25. Average frontal crash star rating for 2015 Honda Accords
[related URL] [script] | 14 lines |
61 | | 26. The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS
[related URL] [script] | 48 lines |
62 | | 27. The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012
[related URL] [script] | 14 lines |
63 | | 28. Total number of clinical trials as recorded by the National Institutes of Health
[related URL] [script] | 7 lines |
64 | | 29. Number of days until Texas's next scheduled execution
[related URL] [script] | 24 lines |
65 | | 30. The total number of inmates executed by Florida since 1976
[related URL] [script] | 10 lines |
66 | | 31. The number of proposed U.S. federal regulations in which comments are due within the next 3 days
[related URL] [script] | 29 lines |
67 | | 32. Number of Titles that have changed in the United States Code since its last release point
[related URL] [script] | 6 lines |
68 | | 33. The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
[related URL] [script] | 14 lines |
69 | | 34. In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food
[related URL] [script] | 10 lines |
70 | | 35. Most viewed data set on New York state's open data portal as of this month
[related URL] [script] | 9 lines |
71 | | 36. Total number of visitors to the White House in 2012
[related URL] [script] | 27 lines |
72 | | 37. The last time the CIA's Leadership page has been updated
[related URL] [script] | 6 lines |
73 | | 38. The domain of the most visited U.S. government website right now
[related URL] [script] | 5 lines |
74 | | 39. Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013
[related URL] [script] | 6 lines |
75 | | 40. Number of FOIA requests made to the Chicago Public Library
[related URL] [script] | 6 lines |
76 | | 41. The number of currently open medical trials involving alcohol-related disorders
[related URL] [script] | 5 lines |
77 | | 42. The name of the Supreme Court justice who delivered the opinion in the most recently announced decision
[related URL] [script] | 31 lines |
78 | | 43. The number of citations that resulted from FDA inspections in fiscal year 2012
[related URL] [script] | 10 lines |
79 | | 44. Number of people visiting a U.S. government website right now
[related URL] [script] | 6 lines |
80 | | 45. The number of security alerts issued by US-CERT in the current year
[related URL] [script] | 6 lines |
81 | | 46. The number of Pinterest accounts maintained by U.S. State Department embassies and missions
[related URL] [script] | 13 lines |
82 | | 47. The number of international travel alerts from the U.S. State Department currently in effect
[related URL] [script] | 7 lines |
83 | | 48. The difference in total White House staffmember salaries in 2014 versus 2010
[related URL] [script] | 19 lines |
84 | | 49. Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President
[related URL] [script] | 11 lines |
85 | | 50. In the most recently transcribed Supreme Court argument, the number of times laughter broke out
[related URL] [script] | 22 lines |
86 | | 51. The title of the most recent decision handed down by the U.S. Supreme Court
[related URL] [script] | 6 lines |
87 | | 52. The average wage of optomertrists according to the BLS's most recent National Occupational Employment and Wage Estimates report
[related URL] [script] | 8 lines |
88 | | 53. The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year
[related URL] [script] | 45 lines |
89 | | 54. The number of people on FBI's Most Wanted List for white collar crimes
[related URL] [script] | 6 lines |
90 | | 55. The number of Government Accountability Office reports and testimonies on the topic of veterans
[related URL] [script] | 10 lines |
91 | | 56. Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record
[related URL] [script] | 9 lines |
92 | | 57. The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967.
[related URL] [script] | 24 lines |
93 | | 58. The number of published research papers from the NSA
[related URL] [script] | 6 lines |
94 | | 59. The number of university-related datasets currently listed at data.gov
[related URL] [script] | 7 lines |
95 | | 60. Number of chapters in Title 20 (Education) of the United States Code
[related URL] [script] | 15 lines |
96 | | 61. The number of miles traveled by the current U.S. Secretary of State
[related URL] [script] | 6 lines |
97 | | 62. For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS
[related URL] [script] | 14 lines |
98 | | 63. In the current dataset behind Medicare's Nusring Home Compare website, the total amount of fines received by penalized nursing homes
[related URL] [script] | 35 lines |
99 | | 64. from March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch
[related URL] [script] | 5 lines |
100 | | 65. The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress
[related URL] [script] | 12 lines |
101 | | 66. The highest minimum wage as mandated by state law.
[related URL] [script] | 28 lines |
102 | | 67. For the most recently posted TSA.gov customer satisfication survey, post the percentage of respondents who rated their "overall experience today" as "Excellent"
[related URL] | |
103 | | 68. Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder
[related URL] [script] | 11 lines |
104 | | 69. The average number of comments on the last 50 posts on NASA's official Instagram account
[related URL] [script] | 40 lines |
105 | | 70. The highest salary possible for a White House staffmember in 2014
[related URL] [script] | 10 lines |
106 | | 71. The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration
[related URL] [script] | 32 lines |
107 | | 72. The number of magnitude 4.5+ earthquakes detected worldwide by the USGS
[related URL] [script] | 8 lines |
108 | | 73. The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report
[related URL] [script] | 34 lines |
109 | | 74. The description of the bill most recently signed into law by the governor of Georgia
[related URL] [script] | 12 lines |
110 | | 75. Total number of officer-involved shooting incidents listed by the Philadelphia Police Department
[related URL] [script] | 9 lines |
111 | | 76. The total number of publications produced by the U.S. Government Accountability Office
[related URL] [script] | 9 lines |
112 | | 77. Number of Dallas officer-involved fatal shooting incidents in 2014
[related URL] [script] | 7 lines |
113 | | 78. Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months.
[related URL] [script] | 6 lines |
114 | | 79. The change in total airline revenues from baggage fees, from 2013 to 2014
[related URL] [script] | 19 lines |
115 | | 80. The total number of babies named Odin born in Colorado according to the Social Security Administration
[related URL] [script] | 20 lines |
116 | | 81. The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report
[related URL] [script] | 13 lines |
117 | | 82. In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death"
[related URL] [script] | 47 lines |
118 | | 83. The sum of White House staffermember salaries in 2014
[related URL] [script] | 12 lines |
119 | | 84. The total number of notices published on the most recent date to the Federal Register
[related URL] [script] | 6 lines |
120 | | 85. The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report
[related URL] [script] | 49 lines |
121 | | 86. Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin
[related URL] [script] | 10 lines |
122 | | 87. Number of airports with existing construction related activity
[related URL] [script] | 6 lines |
123 | | 88. The number of posts on TSA's Instagram account
[related URL] [script] | 24 lines |
124 | | 89. In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products.
[related URL] [script] | 32 lines |
125 | | 90. The currently serving U.S. congressmember with the most Twitter followers
[related URL] [script] | 76 lines |
126 | | 91. Number of stop-and-frisk reports from the NYPD in 2014
[related URL] [script] | 22 lines |
127 | | 92. In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation
[related URL] [script] | 14 lines |
128 | | 93. Number of Github repositories maintained by the GSA's 18F organization, as listed on Github.com
[related URL] [script] | 5 lines |
129 | | 94. The New York City high school with the highest average math score in the latest SAT results
[related URL] [script] | 96 lines |
130 | | 95. Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions
[related URL] [script] | 9 lines |
131 | | 96. The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar
[related URL] [script] | 11 lines |
132 | | 97. The New York school with the highest rate of religious exemptions to vaccinations
[related URL] [script] | 10 lines |
133 | | 98. The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary.
[related URL] [script] | 8 lines |
134 | | 99. According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA
[related URL] [script] | 11 lines |
135 | | 100. The California city whose city manager earns the most total wage per population of its city in 2012
[related URL] [script] | 23 lines |
136 | | 101. The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data
[related URL] [script] | 8 lines |
137 |
138 |
139 |
140 | ----
141 |
142 | ## How to run this stuff
143 |
144 | Each task is meant to be a self-contained script: you run it, and it prints the answer I'm looking for. The [scripts](/scripts) in this repo should "just work"...if you have all the dependencies installed that I had while writing them, and the web URLs they target haven't changed...so, basically, these may not work at all.
145 |
146 | To copy the scripts quickly via the command-line; by default, a ./search-script-scrape directory will be created:
147 |
148 | $ git clone https://github.com/compjour/search-script-scrape.git
149 |
150 | To run a script:
151 |
152 | $ cd search-script-scrape
153 | $ python3 scripts/1.py
154 |
155 | I leave it to you and Google to figure out how to run Python 3 on your own system. FWIW, I was using the [Python 3.4.3 provided by the Anaconda 2.2.0 installer for OS X](http://continuum.io/downloads#py34). The most common third-party libraries used are [Requests](http://www.python-requests.org/en/latest/) for downloading the files and [lxml for HTML parsing](http://lxml.de/).
156 |
157 | ## Expanding on these scripts
158 |
159 | To reiterate: each of these scripts are meant to print out single answers, and so they don't actually show the full potential of how programming can automate data collection. As you get better at programming and recognizing its patterns, you'll find out how easy it is to abstract what seemed like a narrow task into something much bigger.
160 |
161 | For example, [Script #50](scripts/50.py) prints out the number of times laughter broke out in the _most recently_ transcribed Supreme Court argument. Change two lines and that script will print out the laugh count in _every_ transcribed Supreme Court argument: ([demo here](scratchpad/more_scotus_laughs.py))
162 |
163 | The same kind of small code restructuring can be done to many of the tasks here. And you can also modify the _parameters_; why limit yourself to finding the [highest paid "City Manager" in California](https://github.com/compjour/search-script-scrape/blob/master/scripts/100.py) when you can extend the search to every kind of California employee, across every year of salary data? ([demo here](scratchpad/high_city_ca_pay.py))
164 |
165 | And of course, in real-world data projects, you aren't typically interested in just printing the answer to your Terminal. You generally want to send them to a spreadsheet or spreadsheet and eventually to a web application (or other kind of publication). That's just a few more lines of programming, too...So while this repo contains a bunch of toy scripts, see if you can think of ways to turn them into bigger data explorations.
166 |
167 |
168 | ## Post-mortem
169 |
170 | The original requirement was that students finish all 100 scripts by the end of the quarter. That didn't quite work out so I reduced the requirement to 50. It was a bad idea to make this a "oh, just turn it in at the end of the year", as most people have the tendency to wait for finals week to do such work.
171 |
172 | Most of the tasks are pretty straightforward, in terms of the Python programming. The majority of the time is figuring out exactly what the hell I'm referring to, so next time I do this, I'll probably provide the URL of the target page rather than having people attempt to divine the Google Path I used to get to the data.
173 |
174 | - Class instructions for [Computational Journalism: Search-Script-Scrape](http://www.compjour.org/search-script-scrape)
175 | - [List of tasks as a Google Doc](https://docs.google.com/spreadsheets/d/1JbY_-g9MkGH78Rta0PnE6D8rG8T-wdKGsMa3kAC3bDs/edit?usp=sharing)
176 |
--------------------------------------------------------------------------------
/generate_readme.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python3
2 | """
3 | This script reads from the Google Doc of tasks and generates
4 | the titles of the tasks, linked to the appropriate file, and this
5 | (Markdown) text can be pasted into the README.md file
6 | """
7 | import csv
8 | import requests
9 | from os.path import exists
10 | GDOC_URL = 'https://docs.google.com/spreadsheets/d/1JbY_-g9MkGH78Rta0PnE6D8rG8T-wdKGsMa3kAC3bDs/export?format=csv&gid=0'
11 |
12 | txt = requests.get(GDOC_URL).text
13 | rows = csv.DictReader(txt.splitlines())
14 |
15 |
16 |
17 | done_count = 0;
18 | tasks = []
19 | for row in sorted(rows, key = lambda r: int(r['Problem No.'])):
20 | task = {'num': row['Problem No.'], 'url': row['Related URL'],
21 | 'title': row['Title'], 'lines': ""}
22 | task['link'] = "{title}
[related URL]".format(
23 | num=task['num'], title=task['title'], url=task['url']
24 | )
25 | task['path'] = "scripts/%s.py" % task['num']
26 | if exists(task['path']):
27 | lx = len(open(task['path'], encoding = 'utf-8').readlines())
28 | if lx > 3:
29 | task['lines'] = "%s lines" % lx
30 | task['link'] += " [script]" % (task['path'])
31 | done_count += 1
32 |
33 | tasks.append(task)
34 |
35 |
36 |
37 | #############
38 | # store the text to be added to file
39 | tasklines = []
40 | tasklines.append("The repo currently contains scripts for __%s__ of __%s__ tasks:" %
41 | (done_count, len(tasks)))
42 | tasklines.append(
43 | """
44 | | Title | Line count |
45 | |-------------------------|-------------|""")
46 |
47 | for task in tasks:
48 | tasklines.append("| {num}. {link} | {lines} |".format(**task))
49 |
50 |
51 |
52 |
53 |
54 | ## Get the README.md text
55 | lines = []
56 | with open('README.md', 'r') as inf:
57 | within_tasks = False
58 | for line in inf.readlines():
59 | if not within_tasks:
60 | if 'begintasks' in line:
61 | within_tasks = True
62 | lines.append(line)
63 | lines.extend([t + "\n" for t in tasklines])
64 | else:
65 | lines.append(line)
66 | elif within_tasks:
67 | if 'endtasks' in line:
68 | lines.append(line)
69 | within_tasks = False
70 |
71 | with open('README.md', 'w') as outf:
72 | outf.writelines(lines)
73 |
74 |
75 |
--------------------------------------------------------------------------------
/scratchpad/high_city_ca_pay.py:
--------------------------------------------------------------------------------
1 | # The top 100 California city employees by per-capita total wages
2 | # across 2009 to 2013 (the most recent year as of publish date) salary data
3 | # Modification to scripts/100.py
4 | import csv
5 | import requests
6 | from io import BytesIO
7 | from zipfile import ZipFile
8 | YEARS = range(2009,2014)
9 | def foosalary(row):
10 | return float(row['Total Wages']) / int(row['Entity Population'])
11 | rows = []
12 | for year in YEARS:
13 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s_City.zip' % year
14 | print("Downloading:", url)
15 | resp = requests.get(url)
16 | with ZipFile(BytesIO(resp.content)) as zfile:
17 | fname = zfile.filelist[0].filename # 2012_City.csv
18 | print("\tUnzipping:", fname)
19 | # first 4 lines are Disclaimer lines
20 | # only print line[4] (i.e. the headers) if this is the first iteration
21 | xs = 4 if year == YEARS.start else 5
22 | rows.extend(zfile.read(fname).decode('latin-1').splitlines()[xs:])
23 | # This massive array shouldn't cause your (modern) computer to crash...
24 | print("Filtering %s rows..." % len(rows))
25 | # remove rows without 'Total Wages'
26 | employees = [r for r in csv.DictReader(rows) if r['Total Wages']]
27 | templine = "{year}:\t{city}, {dept}; {position}:\t${money}"
28 | for e in sorted(employees, key = foosalary, reverse = True)[0:100]: # show top 100
29 | line = templine.format(year = e['Year'], city = e['Entity Name'],
30 | dept = e["Department / Subdivision"], position = e['Position'],
31 | money = int(foosalary(e)))
32 | print(line)
33 |
34 |
35 | # If you're a fan of True Detective Season 2, the output might ring a bell:
36 | # http://www.latimes.com/local/california/la-me-vernon-true-detective-20150619-story.html
37 | #
38 | # 2010: Vernon, Finance; Finance Director: $3572
39 | # 2009: Vernon, Light & Power Administration; Director of Light & Power: $3405
40 | # 2012: Vernon, Fire; Fire Chief: $3312
41 | # 2009: Vernon, City Attorney; City Attorney: $3115
42 | # 2009: Vernon, Finance; Finance Director: $3086
43 | # 2009: Vernon, Office of Special Counsel; Special Counsel: $2912
44 | # 2009: Vernon, City Attorney; Assistant City Attorney III: $2688
45 | # 2010: Vernon, L&P Administration; Director of Light & Power Capital Projects: $2596
46 | # 2010: Vernon, City Attorney; Assistant City Attorney III: $2455
47 | # 2010: Vernon, Industrial Development; Assistant Director of Industrial Development: $2330
48 | # 2011: Vernon, Finance; Finance Director: $2321
49 | # 2012: Vernon, Light And Power Administration; Director Of Light & Power: $2318
50 | # 2013: Vernon, City Administration; City Administrator: $2224
51 | # 2013: Vernon, Light And Power Administration; Director Of Light & Power: $2218
52 | # 2009: Vernon, Industrial Development; Assistant Director of Industrial Development: $2189
53 | # 2009: Vernon, City Attorney; Chief Deputy City Attorney: $2137
54 | # 2013: Vernon, City Attorney; City Attorney: $2121
55 | # 2009: Vernon, Administrative, Engineering & Planning; Director of Community Services: $2078
56 | # 2013: Vernon, Fire; Fire Chief: $1982
57 | # 2011: Vernon, City Attorney; Chief Deputy City Attorney: $1979
58 | # 2010: Vernon, City Attorney; Chief Deputy City Attorney: $1969
59 | # 2013: Vernon, Administrative, Engineering & Planning; Director Of Community Services: $1945
60 | # 2010: Vernon, Administrative, Engineering & Planning; Director of Community Services: $1930
61 | # 2009: Vernon, Fire; Fire Chief: $1928
62 | # 2012: Vernon, City Attorney; Chief Deputy City Attorney: $1905
63 | # 2011: Vernon, Administrative, Engineering & Planning; Director Of Community Services & Water: $1903
64 | # 2009: Vernon, Police; Chief: $1875
65 | # 2011: Vernon, Fire; Fire Chief: $1857
66 | # 2013: Vernon, Light And Power Engineering; Engineering Manager: $1834
67 | # 2012: Vernon, Light And Power Engineering; Engineering Manager: $1783
68 | # 2009: Vernon, Health; Health Officer/Director Of Health & Environmental Control: $1782
69 | # 2013: Vernon, Fire; Battalion Chief: $1769
70 | # 2009: Vernon, Police; Sergeants: $1758
71 | # 2010: Vernon, Fire; Fire Chief: $1743
72 | # 2012: Vernon, Finance; Finance Director: $1737
73 | # 2013: Vernon, Police; Police Chief: $1728
74 | # 2011: Vernon, L&P Engineering; Engineering Manager: $1712
75 | # 2012: Vernon, Police; Police Chief: $1693
76 | # 2013: Vernon, Finance; Finance Director: $1691
77 | # 2009: Vernon, Fire; Assistant Fire Chief: $1687
78 | # 2009: Vernon, Fire; Battalion Chief: $1675
79 | # 2010: Vernon, Police; Sergeants: $1661
80 | # 2012: Vernon, Fire; Captain: $1648
81 | # 2011: Vernon, Health; Director Health & Environmental Control: $1646
82 | # 2010: Vernon, Health; Director Health & Environmental Control: $1642
83 | # 2012: Vernon, Administrative, Engineering & Planning; Director Of Community Services: $1630
84 | # 2013: Vernon, Health; Director Health & Environmental Control: $1623
85 | # 2011: Vernon, Fire; Assistant Fire Chief: $1623
86 | # 2013: Vernon, Fire; Battalion Chief: $1620
87 | # 2011: Vernon, Fire; Battalion Chief: $1619
88 | # 2013: Vernon, Human Resources; Director Of Human Resources: $1605
89 | # 2009: Vernon, Fire; Battalion Chief: $1592
90 | # 2011: Vernon, L&P Administration; Director Of Light & Power: $1589
91 | # 2012: Vernon, Fire; Battalion Chief: $1589
92 | # 2012: Vernon, Health; Director Health & Environmental Control: $1583
93 | # 2010: Vernon, Fire; Battalion Chief: $1577
94 | # 2013: Vernon, Fire; Battalion Chief: $1576
95 | # 2009: Vernon, Police; Captain: $1576
96 | # 2010: Vernon, Police; Police Chief: $1559
97 | # 2012: Vernon, Fire; Battalion Chief: $1556
98 | # 2012: Vernon, Fire; Battalion Chief: $1552
99 | # 2009: Vernon, Police; Captain: $1551
100 | # 2009: Vernon, Resource Planning; Electric Resources Planning And Development Manager: $1548
101 | # 2009: Vernon, System Dispatch; Transmission & Distribution Manager: $1547
102 | # 2013: Vernon, Resources Planning; Electric Resource Planning & Development Manager: $1539
103 | # 2009: Vernon, Fire; Captain: $1529
104 | # 2010: Vernon, Fire; Assistant Fire Chief: $1528
105 | # 2012: Vernon, Fire; Battalion Chief: $1527
106 | # 2013: Vernon, City Attorney; Chief Deputy City Attorney: $1520
107 | # 2010: Vernon, Police; Interim Police Chief: $1516
108 | # 2011: Vernon, Fire; Battalion Chief: $1504
109 | # 2009: Vernon, Fire; Fire Marshall: $1499
110 | # 2009: Vernon, Police; Police Officer: $1498
111 | # 2009: Vernon, Fire; Battalion Chief: $1497
112 | # 2009: Vernon, Fire; Regional Training Captain: $1494
113 | # 2009: Vernon, Fire; Captain: $1492
114 | # 2009: Vernon, Police; Sergeants: $1488
115 | # 2009: Vernon, Light & Power Engineering; Engineering Manager: $1488
116 | # 2009: Vernon, Fire; Captain: $1484
117 | # 2010: Vernon, Fire; Battalion Chief: $1470
118 | # 2009: Vernon, Police; Police Officer: $1457
119 | # 2013: Vernon, Fire; Captain: $1451
120 | # 2013: Vernon, Resources Planning; Resource Scheduler: $1450
121 | # 2012: Vernon, City Administration; Assistant To The City Administrator: $1450
122 | # 2011: Vernon, Resources Planning; Electric Resources Planning And Development Manager: $1449
123 | # 2010: Vernon, System Dispatch; Transmission & Distribution Manager: $1448
124 | # 2012: Vernon, Resources Planning; Electric Resource Planning & Development Manager: $1446
125 | # 2012: Vernon, Fire; Fire Marshall: $1445
126 | # 2012: Vernon, Fire; Engineer: $1440
127 | # 2010: Vernon, L&P Engineering; Engineering Manager: $1437
128 | # 2009: Vernon, Police; Sergeants: $1437
129 | # 2013: Vernon, Fire; Captain: $1435
130 | # 2013: Vernon, Fire; Captain: $1434
131 | # 2010: Vernon, Fire; Battalion Chief: $1433
132 | # 2012: Vernon, Fire; Engineer: $1427
133 | # 2009: Vernon, Fire; Captain: $1423
134 | # 2009: Vernon, Fire; Captain: $1420
135 | # 2009: Vernon, Fire; Captain: $1415
136 | # 2009: Vernon, Fire; Captain: $1414
137 | # 2013: Vernon, Fire; Captain: $1413
138 |
--------------------------------------------------------------------------------
/scratchpad/more_scotus_laughs.py:
--------------------------------------------------------------------------------
1 | # Modification of scripts/50.py to count all the laughs in the most recent term
2 | from lxml import html
3 | from subprocess import check_output
4 | from urllib.parse import urljoin
5 | import requests
6 | url = 'http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx'
7 | doc = html.fromstring(requests.get(url).text)
8 | # get all the rulings
9 | for link in doc.cssselect('table.datatables tr a'):
10 | href = link.attrib['href']
11 | # let's store the title of the case from table cell
12 | casetitle = link.getnext().text_content()
13 | # download PDF
14 | pdf_url = urljoin(url, href)
15 | with open("/tmp/t.pdf", 'wb') as f:
16 | f.write(requests.get(pdf_url).content)
17 | # punt to shell and run pdftotext
18 | # http://www.foolabs.com/xpdf/download.html
19 | txt = check_output("pdftotext -layout /tmp/t.pdf -", shell = True).decode()
20 | print("%s laughs in: %s" % (txt.count("(Laughter.)"), casetitle))
21 |
22 |
23 |
24 |
25 |
26 |
--------------------------------------------------------------------------------
/scripts/1.py:
--------------------------------------------------------------------------------
1 | # Number of datasets currently listed on data.gov
2 | from lxml import html
3 | import requests
4 | response = requests.get('http://www.data.gov/')
5 | doc = html.fromstring(response.text)
6 | link = doc.cssselect('small a')[0]
7 | print(link.text)
8 |
--------------------------------------------------------------------------------
/scripts/10.py:
--------------------------------------------------------------------------------
1 | # The title of the highest paid California city government position in 2010
2 | # note, the code below makes it easy to extend "years" to include multiple years
3 | import csv
4 | import os.path
5 | import requests
6 | from shutil import unpack_archive
7 | LOCAL_DATADIR = "/tmp/capublicpay"
8 | YEARS = range(2010, 2011) # i.e. just 2010
9 | def foosalary(row):
10 | return float(row['Total Wages']) if row['Total Wages'] else 0
11 |
12 | for year in YEARS:
13 | bfname = '%s_City' % year
14 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s.zip' % bfname
15 | zname = os.path.join("/tmp", bfname + '.zip')
16 | cname = os.path.join(LOCAL_DATADIR, bfname + '.csv')
17 |
18 | if not os.path.exists(zname):
19 | print("Downloading", url, 'to', zname)
20 | data = requests.get(url).content
21 | with open(zname, 'wb') as f:
22 | f.write(data)
23 | # done downloading, now unzip files
24 | print("Unzipping", zname, 'to', LOCAL_DATADIR)
25 | unpack_archive(zname, LOCAL_DATADIR, format = 'zip')
26 |
27 | with open(cname, encoding = 'latin-1') as f:
28 | # first four lines are:
29 | # Disclaimer
30 | #
31 | # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.
32 | data = list(csv.DictReader(f.readlines()[4:]))
33 | topitem = max(data, key = foosalary)
34 | print(topitem['Entity Name'], topitem['Department / Subdivision'],
35 | topitem['Position'], topitem['Total Wages'])
36 |
--------------------------------------------------------------------------------
/scripts/100.py:
--------------------------------------------------------------------------------
1 | # The California city whose city manager earns the most total wage per population of its city in 2012
2 | import csv
3 | import requests
4 | from io import BytesIO
5 | from zipfile import ZipFile
6 | YEAR = 2012
7 | def foosalary(row):
8 | return float(row['Total Wages']) / int(row['Entity Population'])
9 |
10 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s_City.zip' % YEAR
11 | print("Downloading:", url)
12 | resp = requests.get(url)
13 |
14 | with ZipFile(BytesIO(resp.content)) as zfile:
15 | fname = zfile.filelist[0].filename # 2012_City.csv
16 | rows = zfile.read(fname).decode('latin-1').splitlines()
17 | # first 4 lines are Disclaimer lines
18 | managers = [r for r in csv.DictReader(rows[4:]) if r['Position'].lower() == 'city manager'
19 | and r['Total Wages']]
20 | topman = max(managers, key = foosalary)
21 | print("City: %s; Pay-per-Capita: $%s" % (topman['Entity Name'], int(foosalary(topman))))
22 | # City: Industry; Pay-per-Capita: $465
23 |
24 |
--------------------------------------------------------------------------------
/scripts/101.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import requests
3 | from io import StringIO
4 | CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
5 | response = requests.get(CSVURL)
6 | data = csv.DictReader(StringIO(response.text))
7 | rows = list(data)
8 | len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1'])
9 |
--------------------------------------------------------------------------------
/scripts/11.py:
--------------------------------------------------------------------------------
1 | # How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections?
2 | # landing page: http://www.census.gov/govs/statetax/historical_data.html
3 | # note: this exercise was one of the last to be done and is done in the most just-do-everything-in-one-line mode possible
4 | # ...don't actually follow it as good practice
5 | import requests
6 | from io import BytesIO
7 | from xlrd import open_workbook
8 | from zipfile import ZipFile
9 | ZIP_URL = 'http://www2.census.gov/govs/statetax/state_tax_collections.zip'
10 | XLS_FNAME = 'STC_Historical_DB.xls'
11 | print("Downloading:", ZIP_URL)
12 | resp = requests.get(ZIP_URL)
13 | with ZipFile(BytesIO(resp.content)) as zfile:
14 | with open("/tmp/state_tax_data.xls", "wb") as o:
15 | o.write(zfile.open(XLS_FNAME, 'r').read())
16 | book = open_workbook("/tmp/state_tax_data.xls")
17 | sheet = book.sheets()[0]
18 | # T01 refers to "Property Tax", get the index
19 | proptax_col_idx = next(idx for idx, c in enumerate(sheet.row_values(1)) if 'T01' in c)
20 | # state name is in column indexed 2
21 | # note that each state has more than one row, but the first one is the most recent
22 | cal_row = next(sheet.row_values(x) for x in range(sheet.nrows) if 'CA STATE' in sheet.row_values(x)[2])
23 | print("%s paid %s in the year %s" % (cal_row[2], cal_row[proptax_col_idx] * 1000, round(cal_row[0])))
24 |
--------------------------------------------------------------------------------
/scripts/12.py:
--------------------------------------------------------------------------------
1 | # In 2010, the year-over-year change in enplanements at America's busiest airport
2 | # The landing page for this data is:
3 | # http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/
4 | # For each year, there's a separate page with a table of links, including
5 | # XLS format:
6 | # e.g. ./passenger/media/cy10_primary_enplanements.xls
7 | import csv
8 | import requests
9 | # we can't be sure that the XLS has the same naming convention year over year
10 | # so let's do a little HTML parsing
11 | from lxml import html
12 | from os.path import basename
13 | from urllib.parse import urljoin
14 | from xlrd import open_workbook
15 |
16 | BASE_URL = "http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/"
17 | YEAR = 2010
18 | resp = requests.get(BASE_URL, params = {'year': YEAR})
19 | doc = html.fromstring(resp.text)
20 | # There are several spreadsheets and conventions over the years. I'm going to
21 | # be lazy and just pick the first spreadsheet with "enplanements" and assume it's the primary
22 | # doc
23 | xls_url = doc.xpath("//a[contains(@href, 'enplanements') and contains(@href, 'xls')]/@href")[0]
24 | print("Downloading", xls_url)
25 | xresp = requests.get(urljoin(BASE_URL, xls_url))
26 | # save to disk
27 | fn = "/tmp/" + basename(xls_url)
28 | with open(fn, "wb") as f:
29 | f.write(xresp.content)
30 | # open with xlrd
31 | book = open_workbook(fn)
32 | sheet = book.sheets()[0]
33 | # Format looks like this:
34 | # | Airport | CY 10 | CY 09 |
35 | # | Name | Enplanements | Enplanements |
36 | # |--------------------------------------------|--------------|--------------|
37 | # | Hartsfield - Jackson Atlanta International | 43,130,585 | 42,280,868 |
38 | # | Chicago O'Hare International | 32,171,831 | 31,135,732 |
39 | # | Los Angeles International | 28,857,755 | 27,439,897 |
40 |
41 | headers = sheet.row_values(0)
42 | # get all the data rows
43 | rows = [sheet.row_values(i) for i in range(1, sheet.nrows)]
44 | # make them into dicts
45 | drows = [dict(zip(headers, row)) for row in rows]
46 | # remove rows without 'CY 10 Enplanements' as a float
47 | drows = [d for d in drows if isinstance(d['CY 10 Enplanements'], float)]
48 | # get biggest airport
49 | airport = max(drows, key = lambda r: r['CY 10 Enplanements'])
50 | print("%s: %i" % (airport['Airport Name'], airport['CY 10 Enplanements'] - airport['CY 09 Enplanements']))
51 | # Hartsfield - Jackson Atlanta International: 849717
52 |
--------------------------------------------------------------------------------
/scripts/13.py:
--------------------------------------------------------------------------------
1 | from io import BytesIO
2 | from PyPDF2 import PdfFileReader
3 | import requests
4 | import re
5 | url = 'https://www.fbi.gov/stats-services/publications/bank-crime-statistics-2014/bank-crime-statistics-2014'
6 | pdfbytes = BytesIO(requests.get(url).content)
7 | pdf = PdfFileReader(pdfbytes)
8 | txt = pdf.getPage(0).extractText()
9 | # this is really ugly
10 | # U.S. DEPARTMENT OF JUSTICE FEDERAL BUREAU OF INVESTIGATION WASHINGTON, D.C. 20535-0001 BANK CRIME STATISTICS (BCS) FEDERALLY INSURED FINANCIAL INSTITUTIONS January 1, 2014 - December 31, 2014 I. VIOLATIONS OF THE FEDERAL BANK ROBBERY AND INCIDENTAL CRIMES STATUTE, TITLE 18, UNITED STATES CODE, SECTION 2113 Violations by Type of Institution Robberies Burglaries Larcenies Commercial Banks 3,430 61 5 Mutual Savings Banks 31 0 1 Savings and Loan Associations 93 1 0 Credit Unions 312 8 2 Armored Carrier Companies 13 1 3 Totals: 3,879 71 11 Grand Total - All Violations: 3,961 Number, Race, and Sex of Perpetrators The number of persons known to be involved in the 3,961 robberies, burglaries, and larcenies was 4,778. The following table shows a breakdown of the 4,778 persons by race and sex. In a small number of cases, the use of full disguise makes determination of race and sex impossible. White Black Hispanic Other Unknown Male 1770 2030 258 68 221 Female 150 160 17 10 12 Unknown Race/Sex: 82 Investigation to date has resulted in the identification of 2,617 (55 percent) of the 4,778 persons known to be involved. Of these 2,617 identified persons, 1,047 (40 percent) were determined to be users of narcotics, and 463 (18 percent) were found to have been previously convicted in either federal or state court for bank robbery, bank burglary, or bank larceny. Occurrences by Day of Week and Time of Day Monday - 696 6-9 a.m. - 106 Tuesday - 669 9-11 a.m. - 1,037 Wednesday - 670 11 a.m.-1 p.m. - 929 Thursday - 648 1-3 p.m. - 791 Friday - 803 3-6 p.m. - 946 Saturday - 339 6 p.m.-6 a.m. - 151 Sunday - 50 Not Determined - 1 Not Determined - 86 Total: 3,961 Total: 3,961
11 | # relevant line
12 | # Armored Carrier Companies 13 1 3
13 | print(re.search('Armored Carrier Companies +(\d+)', txt).groups()[0])
14 |
15 |
16 |
--------------------------------------------------------------------------------
/scripts/14.py:
--------------------------------------------------------------------------------
1 | # The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year
2 | # landing page
3 | # https://www.osha.gov/dep/fatcat/dep_fatcat.html
4 | from lxml import html
5 | from urllib.parse import urljoin
6 | import csv
7 | import requests
8 | url = "https://www.osha.gov/dep/fatcat/dep_fatcat.html"
9 | doc = html.fromstring(requests.get(url).text)
10 | links = [a.attrib['href'] for a in doc.cssselect('a') if a.attrib.get('href')]
11 | # assume first CSV is the target csv
12 | csvurl = urljoin(url, [a for a in links if 'csv' in a][0])
13 | rows = list(csv.DictReader(requests.get(csvurl).text.splitlines()))
14 | print(len([r for r in rows if r['Fatality or Catastrophe'] == 'Fatality']))
15 |
--------------------------------------------------------------------------------
/scripts/15.py:
--------------------------------------------------------------------------------
1 | # Total number of wildlife strike incidents reported at San Francisco International Airport
2 | # landing page
3 | # http://wildlife.faa.gov/database.aspx
4 | import csv
5 | import os
6 | import requests
7 | from shutil import unpack_archive
8 | from subprocess import check_output
9 | AIRPORTCODE = 'KSFO'
10 | LOCAL_DATADIR = "/tmp/faawildlife"
11 | url = 'http://wildlife.faa.gov/downloads/wildlife.zip'
12 | zname = os.path.join(LOCAL_DATADIR, os.path.basename(url))
13 | dname = os.path.join(LOCAL_DATADIR, 'wildlife.accdb')
14 | os.makedirs(LOCAL_DATADIR, exist_ok = True)
15 |
16 | # Download the zip
17 | if not os.path.exists(zname):
18 | print("Downloading", url, 'to', zname)
19 | z = requests.get(url).content
20 | with open(zname, 'wb') as f:
21 | f.write(z)
22 |
23 | # unzip it
24 | print("Unzipping", zname, 'to', LOCAL_DATADIR)
25 | unpack_archive(zname, LOCAL_DATADIR)
26 |
27 | # Work with MS Access, using mdbtools
28 | # https://github.com/brianb/mdbtools
29 | #
30 | # Install with:
31 | # brew install mdbtools
32 | # Helpful post:
33 | # http://nialldonegan.me/2007/03/10/converting-microsoft-access-mdb-into-csv-or-mysql-in-linux/
34 |
35 | # $ mdb-tables wildlife.accdb
36 | # STRIKE_REPORTS (1990-1999) STRIKE_REPORTS (2000-2009) STRIKE_REPORTS (2010-Current) STRIKE_REPORTS_BASH (1990-Current)
37 | # hardcode the tablenames
38 | access_tablenames = ['STRIKE_REPORTS (1990-1999)', 'STRIKE_REPORTS (2000-2009)', 'STRIKE_REPORTS (2010-Current)', 'STRIKE_REPORTS_BASH (1990-Current)']
39 | hitcount = 0
40 | for tname in access_tablenames:
41 | txt = check_output("mdb-export %s '%s'" % (dname, tname), shell = True).decode()
42 | rows = list(csv.DictReader(txt.splitlines()))
43 | hits = len([r for r in rows if r['AIRPORT_ID'] == AIRPORTCODE])
44 | print(tname, " - ", hits)
45 | hitcount += hits
46 |
47 | print("Total:")
48 | print(hitcount)
49 |
--------------------------------------------------------------------------------
/scripts/16.py:
--------------------------------------------------------------------------------
1 | # The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer
2 | # Note: "latest listing" is kind of broad...we'll just take that to mean
3 | # top revenue of whatever's currently listed on the site
4 | from lxml import html
5 | import requests
6 | url = 'https://projects.propublica.org/nonprofits/search?c_code%5Bid%5D=&ntee%5Bid%5D=&order=revenue&q=&sort_order=desc&state%5Bid%5D=&utf8=%E2%9C%93'
7 | doc = html.fromstring(requests.get(url).text)
8 | d = doc.xpath('//table/tbody/tr[1]/td/a/text()')
9 | print(d[0])
10 | # It's also possible to just use the API
11 | # https://projects.propublica.org/nonprofits/api
12 |
--------------------------------------------------------------------------------
/scripts/17.py:
--------------------------------------------------------------------------------
1 | # In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday
2 | from datetime import datetime
3 | from lxml import etree
4 | import requests
5 | url = 'http://www.justice.gov/feeds/opa/justice-news.xml'
6 | doc = etree.fromstring(requests.get(url).content)
7 | items = doc.xpath('//channel/item')
8 | dates = [item.find('pubDate').text.strip() for item in items]
9 | ts = [datetime.strptime(d.split(' ')[0], '%Y-%m-%d') for d in dates]
10 | # for weekday(), 4 correspond to Friday
11 | print(len([t for t in ts if t.weekday() == 4]))
12 |
--------------------------------------------------------------------------------
/scripts/18.py:
--------------------------------------------------------------------------------
1 | # The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data
2 | # info https://sunlightlabs.github.io/congress/#legislator-spreadsheet
3 | import csv
4 | import requests
5 | url = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
6 | rows = list(csv.DictReader(requests.get(url).text.splitlines()))
7 | # note that spreadsheet includes non-sitting legislators, thus the use
8 | # of 'in_office' attribute to filter
9 | print(len([r for r in rows if r['twitter_id'] and r['in_office'] == '1']))
10 |
--------------------------------------------------------------------------------
/scripts/19.py:
--------------------------------------------------------------------------------
1 | # The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days
2 | from lxml import html
3 | import requests
4 | url = 'http://www.asias.faa.gov/pls/apex/f?p=100:93:0::NO:::'
5 | doc = html.fromstring(requests.get(url).text)
6 | x = 0
7 | for tr in doc.cssselect('#uPageCols tr')[2:3]:
8 | for t in tr.cssselect('td')[1:]:
9 | v = re.search('\d+', t.text_content())
10 | if v:
11 | x += int(v.group())
12 | print(x)
13 |
--------------------------------------------------------------------------------
/scripts/2.py:
--------------------------------------------------------------------------------
1 | # The name of the most recently added dataset on data.gov
2 | from lxml import html
3 | import requests
4 | response = requests.get('http://catalog.data.gov/dataset?q=&sort=metadata_created+desc')
5 | doc = html.fromstring(response.text)
6 | title = doc.cssselect('h3.dataset-heading')[0].text_content()
7 | print(title.strip())
8 |
--------------------------------------------------------------------------------
/scripts/20.py:
--------------------------------------------------------------------------------
1 | # The number of OSHA enforcement inspections involving Wal-Mart in California since 2014
2 | from lxml import html
3 | import requests
4 | import re
5 | url = "https://www.osha.gov/pls/imis/establishment.search"
6 | atts = {'Office': 'all',
7 | 'State': 'CA',
8 | 'endday': '13',
9 | 'endmonth': '06',
10 | 'endyear': '2015',
11 | 'establishment': 'Wal-Mart',
12 | 'officetype': 'all',
13 | 'p_case': 'all',
14 | 'p_violations_exist': 'all',
15 | 'startday': '01',
16 | 'startmonth': '01',
17 | 'startyear': '2014'}
18 |
19 | doc = html.fromstring(requests.get(url, params = atts).text)
20 | # Looks like:
21 | #
22 | # Results 1 - 8 of 8
23 | #
24 | v = re.search('of (\d+)', doc.cssselect('.text-right')[1].text)
25 | print(int(v.groups()[0]))
26 |
--------------------------------------------------------------------------------
/scripts/21.py:
--------------------------------------------------------------------------------
1 | # The current humidity level at Great Smoky Mountains National Park
2 | from lxml import html
3 | import requests
4 | url = "http://www.nature.nps.gov/air/WebCams/parks/grsmcam/grsmcam.cfm"
5 | doc = html.fromstring(requests.get(url).text)
6 | print(doc.cssselect('#CollapsiblePanel6 div div div')[3].text_content())
7 |
--------------------------------------------------------------------------------
/scripts/22.py:
--------------------------------------------------------------------------------
1 | # The names of the committees that Sen. Barbara Boxer currently serves on
2 | import requests
3 | from lxml import html
4 | url="http://www.senate.gov/general/committee_assignments/assignments.htm"
5 | doc = html.fromstring(requests.get(url).text)
6 | row = next(tr for tr in doc.cssselect('tr') if 'Boxer, Barbara' in tr.text_content())
7 | print(len(row.cssselect('td')[1].cssselect('a')))
8 |
--------------------------------------------------------------------------------
/scripts/23.py:
--------------------------------------------------------------------------------
1 | # The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file.
2 | import csv
3 | import requests
4 | from collections import defaultdict
5 | from operator import itemgetter
6 | url = 'http://dq.cde.ca.gov/dataquest/dlfile/dlfile.aspx?cLevel=School&cYear=2014-15&cCat=Enrollment&cPage=filesenr.asp'
7 |
8 | def foo(row):
9 | return int(row['KDGN']) if row['KDGN'] else 0
10 |
11 | lines = requests.get(url).text.splitlines()
12 | data = list(csv.DictReader(lines, delimiter = "\t"))
13 |
14 | codes = defaultdict(int)
15 | for d in data:
16 | if d['GENDER'] == 'F':
17 | codes[d['CDS_CODE']] += int(d['KDGN'])
18 |
19 | cds, num = max(codes.items(), key = itemgetter(1))
20 | print([d['SCHOOL'] for d in data if d['CDS_CODE'] == cds][0])
21 |
22 |
--------------------------------------------------------------------------------
/scripts/24.py:
--------------------------------------------------------------------------------
1 | # Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014
2 | from shutil import unpack_archive
3 | import csv
4 | import os
5 | import requests
6 | DATADIR = '/tmp/nypd'
7 | os.makedirs(DATADIR, exist_ok = True)
8 | zipurl = 'http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2014_sqf_csv.zip'
9 | zname = os.path.join(DATADIR, os.path.basename(zipurl))
10 | cname = os.path.join(DATADIR, '2014.csv')
11 | if not os.path.exists(zname):
12 | print("Downloading", zipurl, 'to', zname)
13 | z = requests.get(zipurl).content
14 | with open(zname, 'wb') as f:
15 | f.write(z)
16 | # unzip it
17 | print("Unzipping", zname, 'to', DATADIR)
18 | unpack_archive(zname, DATADIR)
19 |
20 | data = list(csv.DictReader(open(cname, encoding = 'latin-1')))
21 | whites = [d for d in data if d['race'] == 'W']
22 | print(len(whites) * 100 / len(data))
23 |
24 |
25 |
--------------------------------------------------------------------------------
/scripts/25.py:
--------------------------------------------------------------------------------
1 | # Average frontal crash star rating for 2015 Honda Accords
2 | import requests
3 | import re
4 | from lxml import html
5 |
6 | url = 'http://www.safercar.gov/Vehicle+Shoppers/5-Star+Safety+Ratings/2011-Newer+Vehicles/Search-Results'
7 | atts = {"searchtype":"model", "make": "HONDA", "model": "ACCORD", "year": 2015}
8 | doc = html.fromstring(requests.get(url, params = atts).text)
9 | trs = doc.cssselect("#dataarea tr")
10 | v = 0
11 | for tr in trs[1:-1]:
12 | t = tr.cssselect('td.b_right img.stars')[1].attrib['alt']
13 | v += int(re.search('\d+', t).group())
14 | print(v / len(trs[1:-1]))
15 |
--------------------------------------------------------------------------------
/scripts/26.py:
--------------------------------------------------------------------------------
1 | # The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS
2 | import csv
3 | from urllib.request import urlopen
4 | from io import TextIOWrapper as Tio
5 | from lxml import html
6 | COUNTY = 'Santa Clara'
7 | # This problem actually requires two datasets:
8 | # 1) The Dept. of Ed's list of county ID numbers to find Santa Clara
9 | SCHOOL_DB_URL = 'ftp://ftp.cde.ca.gov/demo/schlname/pubschls.txt'
10 | # 2) The latest cohort data file from CALPADS
11 | CALPADS_PAGE_URL = "http://www.cde.ca.gov/ds/sd/sd/filescohort.asp"
12 | # Obviously you could hardcode Santa Clara County's ID number but that
13 | # would be too easy. Doing a lookup of the ID let's us modify the script
14 | # to work with any county.
15 | # ...unfortunately, CDE has the list of county IDs to be such boring info
16 | # that they don't put it an easy to find way. OK, so let's just download
17 | # their entire schools database just to get one number:
18 | with urlopen(SCHOOL_DB_URL) as schoolsdb:
19 | print("Downloading", SCHOOL_DB_URL)
20 | txt = Tio(schoolsdb, encoding = 'latin-1')
21 | rows = csv.DictReader(txt, delimiter = '\t')
22 | county_id = next(r['CDSCode'][0:2] for r in rows if r['County'] == COUNTY)
23 | print(COUNTY, 'ID is:', county_id)
24 | print("Downloading", CALPADS_PAGE_URL)
25 | doc = html.fromstring(urlopen(CALPADS_PAGE_URL).read())
26 | # um, I'm curious about howtheir ASP app here works...but whatever...
27 | urls = doc.xpath("//a[contains(@href, 'dlfile.aspx?cLevel=All')]/@href")
28 | # being lazy and assuming first item is the most recent
29 | calpads_url = urls[0]
30 | print("Downloading", calpads_url)
31 | dropouts, total = 0, 0
32 | with urlopen(calpads_url) as calpadsdb:
33 | print("Downloading", calpads_url)
34 | txt = Tio(calpadsdb, encoding = 'latin-1')
35 | for row in csv.DictReader(txt, delimiter = '\t'):
36 | # not every row is to be counted, as each school has a separate row
37 | # for each subgroup. So the filter condition is not just by county
38 | # but also by 'AggLevel' == 'S' and 'Subgroup' == 'All'
39 | if(row['CDS'][0:2] == county_id and row['AggLevel'] == 'S'
40 | and row['Subgroup'] == 'All'):
41 | try: # sooooo lazy...
42 | total += int(row['NumCohort'])
43 | dropouts += int(row['NumDropouts'])
44 | except:
45 | pass # not a number; some cells have '*'
46 |
47 | print(dropouts / total)
48 | # 0.09737916232841275
49 |
--------------------------------------------------------------------------------
/scripts/27.py:
--------------------------------------------------------------------------------
1 | # The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012
2 | # Caveat: the FDA page says this:
3 | # NOTE: The recalls on the list are generally Class I.,
4 | # which means there is a reasonable probability that the
5 | # use of or exposure to a violative product will cause
6 | # serious adverse health consequences or death.
7 | #
8 | # This script assumes the recalls are all Class I, for simplicity sake
9 | from lxml import html
10 | import requests
11 | url = 'http://www.fda.gov/Drugs/DrugSafety/DrugRecalls/default.htm'
12 | doc = html.fromstring(requests.get(url).text)
13 | links = doc.cssselect('.col-md-6.col-md-push-3.middle-column linktitle')
14 | print(len(links))
15 |
--------------------------------------------------------------------------------
/scripts/28.py:
--------------------------------------------------------------------------------
1 | # Total number of clinical trials as recorded by the National Institutes of Health
2 | import requests
3 | from lxml import html
4 | url = 'https://clinicaltrials.gov/'
5 | doc = html.fromstring(requests.get(url).text)
6 | e = doc.cssselect('#trial-count > p > .highlight')[0]
7 | print(e.text_content())
8 |
--------------------------------------------------------------------------------
/scripts/29.py:
--------------------------------------------------------------------------------
1 | # Number of days until Texas's next scheduled execution
2 | from datetime import datetime
3 | from lxml import html
4 | import pytz
5 | import requests
6 | url = "http://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html"
7 | # fetch and parse the page
8 | doc = html.fromstring(requests.get(url).text)
9 | # Get our time in central texas time; http://stackoverflow.com/a/22109768/160863
10 | texas_time = pytz.timezone("US/Central")
11 | today = texas_time.localize(datetime(*datetime.now().timetuple()[0:3])) # whatever, too lazy to look up the idiom
12 | for row in doc.xpath('//table/tr')[1:]:
13 | # Even though this table is sorted in reverse-chronological order,
14 | # sometimes the executions happen more quickly than the updates to the
15 | # webpage, can't assume the first row is always the upcoming execution
16 | #
17 | # Each row looks like:
18 | # | 08/12/2015 | Info | Lopez | Daniel | 999555 | 09/15/1987 | H | 03/16/2010 | Nueces |
19 | col = row.cssselect('td')[0]
20 | exdate = datetime.strptime(col.text_content(), '%m/%d/%Y')
21 | exdate = texas_time.localize(exdate)
22 | if (exdate >= today):
23 | print((exdate - today).days, "days")
24 | break
25 |
--------------------------------------------------------------------------------
/scripts/3.py:
--------------------------------------------------------------------------------
1 | # The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
2 | import requests
3 | r = requests.get("https://analytics.usa.gov/data/live/ie.json")
4 | print(r.json()['totals']['ie_version']['6.0'])
5 |
--------------------------------------------------------------------------------
/scripts/30.py:
--------------------------------------------------------------------------------
1 | # The total number of inmates executed by Florida since 1976
2 | import requests
3 | from lxml import html
4 | url = "http://www.dc.state.fl.us/oth/deathrow/execlist.html"
5 |
6 | doc = html.fromstring(requests.get(url).text)
7 | tables = doc.cssselect('table.dcCSStableLight')
8 | rows = tables[0].cssselect('tr')
9 | # the first row is just the header row
10 | print(len(rows) - 1)
11 |
--------------------------------------------------------------------------------
/scripts/31.py:
--------------------------------------------------------------------------------
1 | # The number of proposed U.S. federal regulations in which comments are due within the next 3 days
2 |
3 | # Note:
4 | # This exercise is a major snafu on my part, as I assigned it thinking you
5 | # could easily scrape it from the front page.
6 | # However, the HTML of the results is generated client side, after an AJAX
7 | # request to whatever-the-f-ck this serialzed data format is:
8 | # GET http://www.regulations.gov/dispatch/LoadRegulationsClosingSoon
9 | # Response:
10 | # //OK[21,20,19,3,18,17,16,3,15,14,13,3,12,11,10,3,9,8,7,3,6,5,4,3,6,2,1,["gov.egov.erule.regs.shared.dispatch.LoadRegulationsClosingSoonResult/4107109627","java.util.ArrayList/4159755760","gov.regulations.common.models.DimensionValueModel/244318028","41","Today","8075","83","3 Days","8096","194","7 Days","8076","436","15 Days","8097","766","30 Days","8077","1133","90 Days","8078"],0,7]
11 | #
12 | # Reverse engineering is not a fun-type of challenge. For this particular exercise,
13 | # though, the answer can be found through a simple API call.
14 | #
15 | #
16 | # The API dev docs are here: http://regulationsgov.github.io/developers/
17 | #
18 | # Specifically, the documents.json endpoint described here:
19 | # http://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0
20 | #
21 | # This endpoint has 1 parameter necessary for this exercise:
22 | #
23 | # - cs: Comment Period Closing Soon; the value is an integer for number of days
24 | # until closing
25 | import requests
26 | BASE_URL = 'http://api.data.gov/regulations/v3/documents.json'
27 | my_params = {'api_key': 'DEMO_KEY', 'cs': 3}
28 | resp = requests.get(BASE_URL, params = my_params)
29 | print(resp.json()['totalNumRecords'])
30 |
--------------------------------------------------------------------------------
/scripts/32.py:
--------------------------------------------------------------------------------
1 | # Number of Titles that have changed in the United States Code since its last release point
2 | # Note: the div class of "usctitlechanged" is used to mark such titles
3 | import requests
4 | url = 'http://uscode.house.gov/download/download.shtml'
5 | txt = requests.get(url).text
6 | print(txt.count('class="usctitlechanged" id'))
7 |
--------------------------------------------------------------------------------
/scripts/33.py:
--------------------------------------------------------------------------------
1 | # The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
2 | # landing page:
3 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm
4 | # search page:
5 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm
6 | import re
7 | import requests
8 |
9 | formurl = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm'
10 | post_params = {'Generic_Name': 'Fentanyl', 'table1': 'OB_Disc'}
11 | resp = requests.post(formurl, data = post_params)
12 | # Displaying records 1 to 29 of 29
13 | m = re.search('(?<=Displaying records) *[\d,]+ *to *[\d,]+ *of *([\d,]+)', resp.text)
14 | print(m.groups()[0])
15 |
--------------------------------------------------------------------------------
/scripts/34.py:
--------------------------------------------------------------------------------
1 | # In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food
2 | import requests
3 | from lxml import html
4 | url = 'http://www.fda.gov/Safety/Recalls/EnforcementReports/default.htm'
5 | doc = html.fromstring(requests.get(url).text)
6 | reporturl = doc.xpath('//a[contains(text(), "Enforcement Report for ")]/@href')[0]
7 | # example weekly report:
8 | # http://www.accessdata.fda.gov/scripts/enforcement/enforce_rpt-Product-Tabs.cfm?action=Expand+Index&w=06102015&lang=eng
9 | report = html.fromstring(requests.get(reporturl).text)
10 | print(len(report.cssselect('tr.Food')))
11 |
--------------------------------------------------------------------------------
/scripts/35.py:
--------------------------------------------------------------------------------
1 | # Most viewed data set on New York state's open data portal as of this month
2 | import requests
3 | from lxml import html
4 | # There's probably a JSON endpoint for this...but what the heck, let's
5 | # do HTML parsing
6 | url = 'https://data.ny.gov/browse?sortBy=most_accessed&sortPeriod=month'
7 | doc = html.fromstring(requests.get(url).text)
8 | t = doc.cssselect('tr.item .titleLine a')[0]
9 | print(t.text_content())
10 |
--------------------------------------------------------------------------------
/scripts/36.py:
--------------------------------------------------------------------------------
1 | # Total number of visitors to the White House in 2012
2 | # landing page:
3 | # https://www.whitehouse.gov/briefing-room/disclosures/visitor-records
4 | import csv
5 | import os
6 | import requests
7 | from shutil import unpack_archive
8 | LOCAL_DATADIR = "/tmp/whvisitors"
9 | url = 'https://www.whitehouse.gov/sites/default/files/disclosures/whitehouse-waves-2012.csv_.zip'
10 | zname = os.path.join(LOCAL_DATADIR, os.path.basename(url))
11 | cname = os.path.join(LOCAL_DATADIR, 'WhiteHouse-WAVES-2012.csv')
12 | os.makedirs(LOCAL_DATADIR, exist_ok = True)
13 |
14 | # Download the zip
15 | if not os.path.exists(zname):
16 | print("Downloading", url, 'to', zname)
17 | z = requests.get(url).content
18 | with open(zname, 'wb') as f:
19 | f.write(z)
20 |
21 | # unzip it
22 | print("Unzipping", zname, 'to', LOCAL_DATADIR)
23 | unpack_archive(zname, LOCAL_DATADIR)
24 | # the file was zipped on a Mac, yet still uses Windows encoding...mkaaay
25 | rows = list(csv.DictReader(open(cname, encoding = 'ISO-8859-1')))
26 | print(len(rows))
27 | # 934872
28 |
--------------------------------------------------------------------------------
/scripts/37.py:
--------------------------------------------------------------------------------
1 | # The last time the CIA's Leadership page has been updated
2 | import requests
3 | import re
4 | url = "https://www.cia.gov/about-cia/leadership"
5 | txt = re.search('Last Updated:.+?(?=
)', requests.get(url).text).group()
6 | print(txt)
7 |
--------------------------------------------------------------------------------
/scripts/38.py:
--------------------------------------------------------------------------------
1 | # The domain of the most visited U.S. government website right now
2 | import requests
3 | url = 'https://analytics.usa.gov/data/live/top-pages-realtime.json'
4 | resp = requests.get(url).json()
5 | print(resp['data'][0]['page'])
6 |
--------------------------------------------------------------------------------
/scripts/39.py:
--------------------------------------------------------------------------------
1 | # Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013
2 | from lxml import html
3 | import requests
4 | url = 'http://www.fda.gov/MedicalDevices/Safety/ListofRecalls/ucm384618.htm'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len(doc.cssselect('tbody tr')))
7 |
--------------------------------------------------------------------------------
/scripts/4.py:
--------------------------------------------------------------------------------
1 | # The number of librarian-related job positions that the federal government is currently hiring for
2 | import requests
3 | # via http://www.opm.gov/policy-data-oversight/classification-qualifications/general-schedule-qualification-standards/#url=List-by-Occupational-Series
4 | LIBSERIES = 1410
5 | resp = requests.get("https://data.usajobs.gov/api/jobs", params = {'series': LIBSERIES})
6 | print(resp.json()['TotalJobs'])
7 |
--------------------------------------------------------------------------------
/scripts/40.py:
--------------------------------------------------------------------------------
1 | # Number of FOIA requests made to the Chicago Public Library
2 | import csv
3 | import requests
4 | url = 'https://data.cityofchicago.org/api/views/n379-5uzu/rows.csv?accessType=DOWNLOAD'
5 | data = list(csv.DictReader(requests.get(url).text.splitlines()))
6 | print(len(data))
7 |
--------------------------------------------------------------------------------
/scripts/41.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import re
3 | url = "https://clinicaltrials.gov/ct2/results?recr=Open&cond=%22Alcohol-Related+Disorders%22"
4 | r = re.search('(?<=)\d+(?= +studies found for)', requests.get(url).text)
5 | print(r.group())
6 |
--------------------------------------------------------------------------------
/scripts/42.py:
--------------------------------------------------------------------------------
1 | # The name of the Supreme Court justice who delivered the opinion in the most recently announced decision
2 | # depends on PyPDF2 https://pythonhosted.org/PyPDF2/PdfFileReader.html
3 | from io import BytesIO
4 | from lxml import html
5 | from PyPDF2 import PdfFileReader
6 | from urllib.parse import urljoin
7 | import requests
8 | import re
9 | # get the most recent ruling
10 | url = "http://www.supremecourt.gov/opinions/slipopinions.aspx"
11 | doc = html.fromstring(requests.get(url).text)
12 | a = doc.cssselect('#mainbody table')[0].cssselect('tr a')[0]
13 | # download PDF
14 | pdf_url = urljoin(url, a.attrib['href'])
15 | pdfbytes = BytesIO(requests.get(pdf_url).content)
16 | pdf = PdfFileReader(pdfbytes)
17 | # compile text of all the pages
18 | txt = ""
19 | for i in range(pdf.getNumPages()):
20 | txt += pdf.getPage(i).extractText() + "\n"
21 | # regex match...hopefully this is *always* the text...
22 | m = re.search("[A-Z]+, *(?:J\.|C\. J\.|JJ\.)(?=, delivered the opinion of the Court)", txt)
23 | print(m.group())
24 |
25 | # Sample relevant text:
26 | # KENNEDY, J., delivered the opinion of the Court, in which GINSBURG,
27 | # BREYER, SOTOMAYOR, and KAGAN, JJ., joined. BREYER, J., filed a concurring
28 | # opinion. THOMAS, J., filed an opinion concurring in the judgment
29 | # in part and dissenting in part. ROBERTS, C. J., filed a dissenting opinion,
30 | # in which ALITO, J., joined. SCALIA, J., filed a dissenting opinion, in
31 | # which ROBERTS, C. J., and ALITO, J., joined.
32 |
--------------------------------------------------------------------------------
/scripts/43.py:
--------------------------------------------------------------------------------
1 | # The number of citations that resulted from FDA inspections in fiscal year 2012
2 | import requests
3 | import csv
4 | # list of citations is here:
5 | # http://www.fda.gov/ICECI/Inspections/ucm346077.htm
6 | csv_url = 'http://www.fda.gov/downloads/ICECI/Inspections/UCM346093.csv'
7 | print("Downloading", csv_url)
8 | resp = requests.get(csv_url)
9 | rows = list(csv.DictReader(resp.text.splitlines()[2:]))
10 | print(len(rows))
11 |
--------------------------------------------------------------------------------
/scripts/44.py:
--------------------------------------------------------------------------------
1 | # Number of people visiting a U.S. government website right now
2 | # via: https://analytics.usa.gov/
3 | import requests
4 | url = 'https://analytics.usa.gov/data/live/realtime.json'
5 | j = requests.get(url).json()
6 | print(j['data'][0]['active_visitors'])
7 |
--------------------------------------------------------------------------------
/scripts/45.py:
--------------------------------------------------------------------------------
1 | # The number of security alerts issued by US-CERT in the current year
2 | import requests
3 | from lxml import html
4 | url = 'https://www.us-cert.gov/ncas/alerts'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len(doc.cssselect('.item-list li')))
7 |
--------------------------------------------------------------------------------
/scripts/46.py:
--------------------------------------------------------------------------------
1 | # The number of Pinterest accounts maintained by U.S. State Department embassies and missions
2 |
3 | # Note: You can extend this problem to include ALL Pinterest
4 | # accounts (e.g. maintained by Consulates) because
5 | # the HTML structure here is atrocious
6 | import requests
7 | from lxml import html
8 | url = 'http://www.state.gov/r/pa/ode/socialmedia/'
9 | doc = html.fromstring(requests.get(url).text)
10 | pinlinks = [a for a in doc.cssselect('a') if 'pinterest.com' in str(a.attrib.get('href'))]
11 | # we just need a count, so no need to do anything more
12 | # sophisticated
13 | print(len(pinlinks))
14 |
--------------------------------------------------------------------------------
/scripts/47.py:
--------------------------------------------------------------------------------
1 | # The number of international travel alerts from the U.S. State Department currently in effect
2 | import requests
3 | from lxml import html
4 | url = 'http://travel.state.gov/content/passports/english/alertswarnings.html'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len(doc.cssselect('td.alert')))
7 |
8 |
--------------------------------------------------------------------------------
/scripts/48.py:
--------------------------------------------------------------------------------
1 | # The difference in total White House staffmember salaries in 2014 versus 2010
2 | import csv
3 | import requests
4 | # info https://www.whitehouse.gov/briefing-room/disclosures/annual-records/2014
5 | url2010 = 'https://open.whitehouse.gov/api/views/rcp4-3y7g/rows.csv?accessType=DOWNLOAD'
6 | url2014 = 'https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD'
7 |
8 | d2010 = list(csv.DictReader(requests.get(url2010).text.splitlines()))
9 | d2014 = list(csv.DictReader(requests.get(url2014).text.splitlines()))
10 |
11 | s2010 = 0
12 | for d in d2010:
13 | s2010 += float(d['Salary'].replace('$', ''))
14 |
15 | s2014 = 0
16 | for d in d2014:
17 | s2014 += float(d['Salary'].replace('$', ''))
18 |
19 | print(s2014 - s2010)
20 |
--------------------------------------------------------------------------------
/scripts/49.py:
--------------------------------------------------------------------------------
1 | # Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President
2 | from lxml import html
3 | import requests
4 | import re
5 | url = 'https://www.congress.gov/member/nancy-pelosi/P000197'
6 | atts = {'q': '{"sponsorship":"sponsored","bill-status":"veto"}'}
7 | doc = html.fromstring(requests.get(url, params = atts).text)
8 | t = doc.cssselect('.results-number')[0].text_content()
9 | # e.g. 1-25 of 4,897
10 | r = re.search('(?<=of) *[\d,]+', t).group().replace(',', '').strip()
11 | print(r)
12 |
--------------------------------------------------------------------------------
/scripts/5.py:
--------------------------------------------------------------------------------
1 | # The name of the company cited in the most recent consumer complaint involving student loans
2 | # note that this is a pre-made filter from:
3 | # https://data.consumerfinance.gov/dataset/Consumer-Complaints/x94z-ydhh
4 | import requests
5 | from operator import itemgetter
6 | url = "https://data.consumerfinance.gov/api/views/c8k9-ryca/rows.json?accessType=DOWNLOAD"
7 |
8 | # If you go the JSON route with Socrata, you have to
9 | # do an extra step of parsing metadata to get the
10 | # desired columns...or you could just hardcode their
11 | # positions for now
12 | data = requests.get(url).json()
13 | # use meta data to extract which column Company exists in
14 | cols = data['meta']['view']['columns']
15 | # fancier way of doing a for-loop and counter
16 | # http://stackoverflow.com/questions/2748235/in-python-how-can-i-find-the-index-of-the-first-item-in-a-list-that-is-not-some
17 |
18 | # get position of Date received column
19 | d_pos = next((i for i, c in enumerate(cols) if c['name'] == 'Date received'), -1)
20 |
21 | # get position of Company column
22 | c_pos = next((i for i, c in enumerate(cols) if c['name'] == 'Company'), -1)
23 |
24 | # It appears that Socrata returns the data in order of
25 | # Date received but just in case, here's a sort
26 | row = max(data['data'], key = itemgetter(d_pos))
27 | print(row[c_pos])
28 |
--------------------------------------------------------------------------------
/scripts/50.py:
--------------------------------------------------------------------------------
1 | # In the most recently transcribed Supreme Court argument, the number of times laughter broke out
2 | from lxml import html
3 | from subprocess import check_output
4 | from urllib.parse import urljoin
5 | import requests
6 | url = 'http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx'
7 | doc = html.fromstring(requests.get(url).text)
8 | # get the most recent ruling, e.g. the top of table
9 | href = doc.cssselect('table.datatables tr a')[0].attrib['href']
10 | # download PDF
11 | pdf_url = urljoin(url, href)
12 | with open("/tmp/t.pdf", 'wb') as f:
13 | f.write(requests.get(pdf_url).content)
14 | # punt to shell and run pdftotext
15 | # http://www.foolabs.com/xpdf/download.html
16 | txt = check_output("pdftotext -layout /tmp/t.pdf -", shell = True).decode()
17 | print(txt.count("(Laughter.)"))
18 |
19 |
20 |
21 |
22 |
23 |
--------------------------------------------------------------------------------
/scripts/51.py:
--------------------------------------------------------------------------------
1 | # The title of the most recent decision handed down by the U.S. Supreme Court
2 | import requests
3 | from lxml import html
4 | url = 'http://www.supremecourt.gov/opinions/slipopinions.aspx'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(doc.cssselect("#mainbody table tr a")[0].text_content())
7 |
--------------------------------------------------------------------------------
/scripts/52.py:
--------------------------------------------------------------------------------
1 | # The average wage of optometrists according to the BLS's most recent National Occupational Employment and Wage Estimates report
2 | from lxml import html
3 | import requests
4 | url = 'http://www.bls.gov/oes/current/oes_nat.htm'
5 | doc = html.fromstring(requests.get(url).text)
6 | table = doc.cssselect('#bodytext table')[0]
7 | t = next(tr for tr in table.cssselect('tr') if 'Optometrists' in tr.text_content())
8 | print( t.cssselect('td')[-2].text_content())
9 |
--------------------------------------------------------------------------------
/scripts/53.py:
--------------------------------------------------------------------------------
1 | # The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year
2 | # hardcode the url to 2014 file
3 | # this task is just a mess, dependent on how well you can read
4 | # documentation and deal with the messy arrangement of columns
5 | from glob import glob
6 | from shutil import unpack_archive
7 | from xlrd import open_workbook
8 | import os
9 | import requests
10 |
11 | LOCAL_FNAME = '/tmp/ope2014excel.zip'
12 | LOCAL_DATADIR = "/tmp/ope2014excel"
13 | url = 'http://ope.ed.gov/security/dataFiles/Crime2014EXCEL.zip'
14 | # this is such a massive file that we should cache the download
15 | if not os.path.exists(LOCAL_FNAME):
16 | print("Downloading", url, 'to', LOCAL_FNAME)
17 | with open(LOCAL_FNAME, 'wb') as f:
18 | f.write(requests.get(url).content)
19 |
20 | # unzip
21 | print("Unzipping", LOCAL_FNAME, 'to', LOCAL_DATADIR)
22 | unpack_archive(LOCAL_FNAME, LOCAL_DATADIR, format = 'zip')
23 | # get filename
24 | fname = [f for f in glob(LOCAL_DATADIR + '/*.xlsx') if 'oncampushate' in f][0]
25 | # open workbook
26 | print("Opening", fname)
27 | book = open_workbook(fname)
28 | sheet = book.sheets()[0]
29 | data = [sheet.row_values(i) for i in range(sheet.nrows)]
30 | # get all column indices that correspond to relevant columns, i.e.
31 | #
32 | # 266 LAR_T_RAC13 Num 8 Larceny 2013 By Bias Race
33 | # 267 LAR_T_REL13 Num 8 Larceny 2013 By Bias Religion
34 | # 268 LAR_T_SEX13 Num 8 Larceny 2013 By Bias Sexual Orientation
35 | # 269 LAR_T_GEN13 Num 8 Larceny 2013 By Bias Gender
36 | # 270 LAR_T_DIS13 Num 8 Larceny 2013 By Bias Disability
37 | # 271 LAR_T_ETH13 Num 8 Larceny 2013 By Bias Ethnicity
38 | wanted_heds = ['RAC13', 'REL13', 'SEX13', 'GEN13', 'DIS13', 'ETH13']
39 | indices = [i for i, c in enumerate(data[0]) if any(t in c for t in wanted_heds)]
40 | crime_count = 0
41 | for row in data[1:]:
42 | for i in indices:
43 | if row[i]:
44 | crime_count += int(row[i])
45 | print(crime_count)
46 |
--------------------------------------------------------------------------------
/scripts/54.py:
--------------------------------------------------------------------------------
1 | # The number of people on FBI's Most Wanted List for white collar crimes
2 | import requests
3 | from lxml import html
4 | url = 'http://www.fbi.gov/wanted/wcc/@@wanted-group-listing'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len(doc.cssselect('.contenttype-FBIPerson')))
7 |
--------------------------------------------------------------------------------
/scripts/55.py:
--------------------------------------------------------------------------------
1 | # The number of Government Accountability Office reports and testimonies on the topic of veterans
2 | import requests
3 | import re
4 | from lxml import html
5 | url = 'http://www.gao.gov/browse/topic/Veterans'
6 | doc = html.fromstring(requests.get(url).text)
7 | txt = doc.cssselect('h2.scannableTitle')[0].text_content().strip()
8 | # 'Veterans (1 - 10 of 1,170 items)'
9 | v = re.search('of (\d+)', txt.replace(',', '')).groups()[0]
10 | print(int(v))
11 |
--------------------------------------------------------------------------------
/scripts/56.py:
--------------------------------------------------------------------------------
1 | # Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record
2 | from lxml import html
3 | import requests
4 |
5 | baseurl = "https://www.congress.gov/search"
6 | atts = {"source":"congrecord","crHouseMemberRemarks":"Issa, Darrell E. [R-CA]"}
7 | doc = html.fromstring(requests.get(baseurl, params = atts).text)
8 | t = doc.cssselect(".results-number")[0].text_content()
9 | print(t.split('of')[-1].strip().replace(',', ''))
10 |
--------------------------------------------------------------------------------
/scripts/57.py:
--------------------------------------------------------------------------------
1 | # The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967.
2 | import csv
3 | import requests
4 |
5 | from collections import Counter
6 | from io import BytesIO, TextIOWrapper
7 | from zipfile import ZipFile
8 |
9 | ZIP_URL = 'http://www-odi.nhtsa.dot.gov/downloads/folders/Recalls/FLAT_RCL.zip'
10 | # Schema comes from http://www-odi.nhtsa.dot.gov/downloads/folders/Recalls/RCL.txt
11 | MFGNAME_FIELD_NUM = 7
12 | counter = Counter()
13 | print("Downloading", ZIP_URL)
14 | resp = requests.get(ZIP_URL)
15 | with ZipFile(BytesIO(resp.content)) as zfile:
16 | fname = zfile.filelist[0].filename
17 | print("Unzipping...", fname) # note: the unpacked zip is 120MB+
18 | with zfile.open(fname, 'rU') as zf:
19 | reader = csv.reader(TextIOWrapper(zf, encoding = 'latin-1'), delimiter = "\t")
20 | counter.update(row[MFGNAME_FIELD_NUM] for row in reader)
21 |
22 | for mfgname, count in counter.most_common(3):
23 | print("%s: %s" % (mfgname, count))
24 |
25 |
--------------------------------------------------------------------------------
/scripts/58.py:
--------------------------------------------------------------------------------
1 | # The number of published research papers from the NSA
2 | import requests
3 | from lxml import html
4 | url = 'https://www.nsa.gov/research/publications/index.shtml'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len(doc.cssselect('table.dataTable tr')[1:]))
7 |
--------------------------------------------------------------------------------
/scripts/59.py:
--------------------------------------------------------------------------------
1 | # The number of university-related datasets currently listed at data.gov
2 | import requests
3 | import re
4 | url = 'http://catalog.data.gov/dataset?'
5 | atts = {'organization_type': 'University', 'sort': 'metadata_created desc'}
6 | txt = requests.get(url, params = atts).text
7 | print(re.search("[0-9,]+(?= *datasets found)", txt).group().replace(',', ''))
8 |
--------------------------------------------------------------------------------
/scripts/6.py:
--------------------------------------------------------------------------------
1 | # From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees
2 | from shutil import unpack_archive
3 | from statistics import median
4 | import csv
5 | import os
6 | import requests
7 | LOCAL_DATADIR = "/tmp/capublicpay"
8 | BASE_URL = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file='
9 | YEARS = (2010, 2013)
10 |
11 | medians = []
12 | for year in YEARS:
13 | basefname = '%s_City.zip' % year
14 | url = BASE_URL + basefname
15 | local_zname = "/tmp/" + basefname
16 | # this is such a massive file that we should cache the download
17 | if not os.path.exists(local_zname):
18 | print("Downloading", url, 'to', local_zname)
19 | data = requests.get(url).content
20 | with open(local_zname, 'wb') as f:
21 | f.write(data)
22 | # done downloading, now unzip files
23 | print("Unzipping", local_zname, 'to', LOCAL_DATADIR)
24 | unpack_archive(local_zname, LOCAL_DATADIR, format = 'zip')
25 | # each zip extracts a file named YEAR_City.csv
26 | csv_name = LOCAL_DATADIR + '/' + basefname.replace('zip', 'csv')
27 | # calculate median
28 | with open(csv_name, encoding = 'latin-1') as f:
29 | # first four lines are:
30 | # Disclaimer
31 | #
32 | # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.
33 | cx = list(csv.DictReader(f.readlines()[4:]))
34 | mx = median([float(row['Health Dental Vision']) for row in cx if row['Health Dental Vision']])
35 | print("Median for %s" % year, mx)
36 | medians.append(mx)
37 |
38 | print(medians[-1] - medians[0])
39 |
--------------------------------------------------------------------------------
/scripts/60.py:
--------------------------------------------------------------------------------
1 | # Number of chapters in Title 20 (Education) of the United States Code
2 | import requests
3 | import re
4 | from lxml import html
5 | # this URL downloads the WHOLE code for education
6 | url = 'http://uscode.house.gov/view.xhtml?path=/prelim@title20&edition=prelim'
7 | print("Downloading", url)
8 | txt = requests.get(url).text
9 | doc = html.fromstring(''.join(txt.splitlines()[1:])) # skipping xml declaration
10 | # interpretation of number of chapters can vary...I'm going to go with
11 | # "highest number"
12 | titles = [t.text_content().strip() for t in doc.cssselect('h3.chapter-head strong')]
13 | m = re.search("(?<=CHAPTER )\d+", titles[-1]).group()
14 | print(m)
15 |
16 |
--------------------------------------------------------------------------------
/scripts/61.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from lxml import html
3 | url = "http://www.state.gov/secretary/travel/index.htm"
4 | resp = requests.get(url)
5 | x = html.fromstring(resp.text).cssselect('#total-mileage span')
6 | print(x[0].text_content())
7 |
--------------------------------------------------------------------------------
/scripts/62.py:
--------------------------------------------------------------------------------
1 | # For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS
2 | import requests
3 | from urllib.parse import urljoin
4 | from lxml import html
5 | url = 'http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082196.htm'
6 | doc = html.fromstring(requests.get(url).text)
7 | links = [a.attrib['href'] for a in doc.cssselect('li a') if '2013' in a.text_content()]
8 | x = 0
9 | for href in links:
10 | u = urljoin(url, href)
11 | d = html.fromstring(requests.get(u).text)
12 | els = d.cssselect("#content .middle-column table tr")[1:]
13 | x += len(els)
14 | print(x)
15 |
--------------------------------------------------------------------------------
/scripts/63.py:
--------------------------------------------------------------------------------
1 | # In the current dataset behind Medicare's Nursing Home Compare website, the total amount of fines received by penalized nursing homes
2 | # landing page:
3 | # https://data.medicare.gov/data/nursing-home-compare
4 | import csv
5 | import os
6 | import requests
7 | from lxml import html
8 | from shutil import unpack_archive
9 | from urllib.parse import urljoin, urlparse, parse_qs
10 | LOCAL_DATADIR = "/tmp/nursinghomes"
11 | CSV_NAME = os.path.join(LOCAL_DATADIR, 'Penalties_Download.csv')
12 | os.makedirs(LOCAL_DATADIR, exist_ok = True)
13 | #
14 | # The zip URL looks like this:
15 | # https://data.medicare.gov/views/bg9k-emty/files/AsD4-xSfJuwZKwb_gMosljIKMST...
16 | # TZ1PmBSoRGqivFmo?filename=DMG_CSV_DOWNLOAD20150501.zip&content_type=application%2Fzip%3B%20charset%3Dbinary
17 |
18 | # we assume that the zip file URL changes frequently and can't be hardcoded
19 | # so we go through the process of auto-magically determining that URL
20 | url = 'https://data.medicare.gov/data/nursing-home-compare'
21 | doc = html.fromstring(requests.get(url).text)
22 | zipurl = [a.attrib['href'] for a in doc.cssselect('a')
23 | if 'CSV_DOWNLOAD' in str(a.attrib.get('href'))][0]
24 | zipurl = urljoin(url, zipurl)
25 | bname = parse_qs(urlparse(zipurl).query)['filename'][0]
26 | zname = os.path.join(LOCAL_DATADIR, bname)
27 | if not os.path.exists(zname):
28 | print("Downloading", zipurl, 'to', zname)
29 | z = requests.get(zipurl).content
30 | with open(zname, 'wb') as f:
31 | f.write(z)
32 | print('Unzipping', zname, 'to', LOCAL_DATADIR)
33 | unpack_archive(zname, LOCAL_DATADIR)
34 | rows = list(csv.DictReader(open(CSV_NAME, encoding = 'ISO-8859-1')))
35 | print(sum([float(r['fine_amt']) for r in rows if r['fine_amt']]))
36 |
--------------------------------------------------------------------------------
/scripts/64.py:
--------------------------------------------------------------------------------
1 | # From March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch
2 | # this is a hardcoded URL
3 | url = 'http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/PastMeetingsWithFDAOfficials/ucm439318.htm'
4 | import requests
5 | print(requests.get(url).text.count('Event Date'))
6 |
--------------------------------------------------------------------------------
/scripts/65.py:
--------------------------------------------------------------------------------
1 | # The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress
2 | import requests
3 | from lxml import html
4 | # There are many places to get roll call info, including the House Clerk:
5 | # http://clerk.house.gov/evs/2015/index.asp
6 | # We could programmatically find the target page but it's not
7 | # worth it for this exercise:
8 | URL = 'http://clerk.house.gov/evs/2015/ROLL_000.asp'
9 | doc = html.fromstring(requests.get(URL).text)
10 | # good ol' Xpath
11 | print(len(doc.xpath('//tr/td[5]/font[text()="F"]')))
12 | # 28
13 |
--------------------------------------------------------------------------------
/scripts/66.py:
--------------------------------------------------------------------------------
1 | # The highest minimum wage as mandated by state law.
2 | import requests
3 | import re
4 | from lxml import html
5 | # helper foo
6 | def foo(c):
7 | m = re.search('([A-Z]{2}).+?(\d+\.\d+)', c.text_content())
8 | if m:
9 | state, wage = m.groups()
10 | return (float(wage), state)
11 | else:
12 | return None
13 |
14 | url = 'http://www.dol.gov/whd/minwage/america.htm'
15 | doc = html.fromstring(requests.get(url).text)
16 |
17 | # easiest target is "Consolidated State Minimum Wage Update Table",
18 | # of which the first column is: "Greater than federal MW"
19 |
20 | # Love this elegant solution: find the text node, then search upwards with ancestor::
21 | # http://stackoverflow.com/a/3923863/160863
22 | xstr = "//text()[contains(., 'Greater than federal MW')]/ancestor::table[1]//tr/td[1]"
23 | cols = [foo(c) for c in doc.xpath(xstr) if foo(c)]
24 | topcol = max(cols)
25 |
26 | print(topcol[1], topcol[0])
27 | # DC 9.5
28 |
29 |
--------------------------------------------------------------------------------
/scripts/68.py:
--------------------------------------------------------------------------------
1 | # Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder
2 | # landing page:
3 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryah.cfm
4 | import re
5 | import requests
6 | formurl = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempah.cfm'
7 | post_params = {'Sponsor': 'GlaxoSmithKline', 'table1': 'OB_Rx'}
8 | resp = requests.post(formurl, data = post_params)
9 | # Displaying records 1 to 88 of 88
10 | m = re.search('(?<=Displaying records) *[\d,]+ *to *[\d,]+ *of *([\d,]+)', resp.text)
11 | print(m.groups()[0])
12 |
--------------------------------------------------------------------------------
/scripts/69.py:
--------------------------------------------------------------------------------
1 | # The average number of comments on the last 50 posts on NASA's official Instagram account
2 | from urllib.parse import urljoin
3 | import os
4 | import requests
5 | DOMAIN = 'https://api.instagram.com/'
6 | USERNAME = 'nasa'
7 | ITEM_COUNT = 50
8 | # note: I've specified INSTAGRAM_TOKEN in my ~/.bash_profile
9 | atts = {'access_token': os.environ.get('INSTAGRAM_TOKEN')}
10 | # unless you know NASA's Instagram ID by memory, you'll
11 | # have to hit up the search endpoint to get it
12 | # docs: http://instagram.com/developer/endpoints/users/#get_users_search
13 | search_path = '/v1/users/search'
14 | search_url = urljoin(DOMAIN, search_path)
15 | searchatts = atts.copy()
16 | searchatts['q'] = USERNAME
17 | search_results = requests.get(search_url, params = searchatts).json()
18 | uid = search_results['data'][0]['id']
19 |
20 | # now we can retrieve media information
21 | # http://instagram.com/developer/endpoints/users/#get_users_media_recent
22 | media_path = '/v1/users/%s/media/recent' % uid
23 | media_url = urljoin(DOMAIN, media_path)
24 | mediaatts = atts.copy()
25 | mediaatts['count'] = ITEM_COUNT
26 | # for whatever reason, the count of returned items is
27 | # always less than the requested count...so keep going
28 | # until we reach ITEM_COUNT
29 | items = []
30 | while len(items) < 50:
31 | resp = requests.get(media_url, params = mediaatts).json()
32 | data = resp['data']
33 | if len(data) > 0:
34 | items.extend(data)
35 | mediaatts['max_id'] = data[-1]['id']
36 | else:
37 | break
38 |
39 | ccount = sum([i['comments']['count'] for i in items[0:ITEM_COUNT]])
40 | print(ccount // len(items))
41 |
--------------------------------------------------------------------------------
/scripts/7.py:
--------------------------------------------------------------------------------
1 | # The number of listed federal executive agency internet domains
2 | # landing page: https://inventory.data.gov/dataset/fe9eeb10-2e90-433e-a955-5c679f682502/resource/b626ef1f-9019-41c4-91aa-5ae3f7457328
3 | import csv
4 | import requests
5 | url = "https://inventory.data.gov/dataset/fe9eeb10-2e90-433e-a955-5c679f682502/resource/b626ef1f-9019-41c4-91aa-5ae3f7457328/download/federalexecagncyintntdomains03302015.csv"
6 | resp = requests.get(url)
7 | data = list(csv.DictReader(resp.text.splitlines()))
8 | print(len(data))
9 |
--------------------------------------------------------------------------------
/scripts/70.py:
--------------------------------------------------------------------------------
1 | # The highest salary possible for a White House staffmember in 2014
2 | import csv
3 | import requests
4 | url = 'https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD'
5 | data = list(csv.DictReader(requests.get(url).text.splitlines()))
6 |
7 | def foo(d):
8 | return float(d['Salary'].replace('$', ''))
9 |
10 | print(max(data, key = foo)['Salary'])
11 |
--------------------------------------------------------------------------------
/scripts/71.py:
--------------------------------------------------------------------------------
1 | # The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration
2 | # landing page:
3 | # http://www.ssa.gov/oact/babynames/limits.html
4 | import csv
5 | import os
6 | import requests
7 | from shutil import unpack_archive
8 |
9 | LOCAL_DATADIR = "/tmp/babynames"
10 | os.makedirs(LOCAL_DATADIR, exist_ok = True)
11 | url = 'http://www.ssa.gov/oact/babynames/names.zip'
12 | zname = os.path.join(LOCAL_DATADIR, 'names.zip')
13 | # download the file
14 | if not os.path.exists(zname):
15 | print("Downloading", url, 'to', zname)
16 | z = requests.get(url).content
17 | with open(zname, 'wb') as f:
18 | f.write(z)
19 | # Unzip the data
20 | print('Unzipping', zname, 'to', LOCAL_DATADIR)
21 | unpack_archive(zname, LOCAL_DATADIR)
22 | d = {2010: 0, 2000: 0}
23 | for y in d.keys():
24 | fname = os.path.join(LOCAL_DATADIR, "yob%d.txt" % y)
25 | rows = list(csv.reader(open(fname)))
26 | # each row looks like this:
27 | # Pamela,F,258
28 | d[y] += sum([int(r[2]) for r in rows if r[0] == 'Archer'])
29 |
30 | print(100 * (d[2010] - d[2000]) / d[2000])
31 |
32 |
33 |
--------------------------------------------------------------------------------
/scripts/72.py:
--------------------------------------------------------------------------------
1 | # The number of magnitude 4.5+ earthquakes detected worldwide by the USGS
2 | # landing page:
3 | # http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php
4 | import csv
5 | import requests
6 | csvurl = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_month.csv'
7 | rows = list(csv.DictReader(requests.get(csvurl).text.splitlines()))
8 | print(len(rows))
9 |
--------------------------------------------------------------------------------
/scripts/73.py:
--------------------------------------------------------------------------------
1 | # The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report
2 | from glob import glob
3 | from lxml import etree, html
4 | from shutil import unpack_archive
5 | import os
6 | import requests
7 | DATADIR = '/tmp/lobbying'
8 | os.makedirs(DATADIR, exist_ok = True)
9 | url = 'http://www.senate.gov/legislative/Public_Disclosure/contributions_download.htm'
10 | # get listing of databases
11 | doc = html.fromstring(requests.get(url).text)
12 | # assuming most recent file is at the top of the page
13 | zipurl = sorted(doc.xpath('//a[contains(@href, "zip")]/@href'))[-1]
14 | zname = os.path.join(DATADIR, os.path.basename(zipurl))
15 | # Download the zip of the latest quarterly report
16 | if not os.path.exists(zname):
17 | print("Downloading", zipurl, 'to', zname)
18 | z = requests.get(zipurl).content
19 | with open(zname, 'wb') as f:
20 | f.write(z)
21 | # unzip it
22 | print("Unzipping", zname, 'to', DATADIR)
23 | unpack_archive(zname, DATADIR)
24 |
25 | ctotal = 0
26 | # each zip contains multiple xml files
27 | for x in glob(os.path.join(DATADIR, '*.xml')):
28 | xtxt = '\n'.join(open(x, encoding = 'utf-16').readlines()[1:])
29 | xdoc = etree.fromstring(xtxt)
30 | ctotal += sum(float(c) for c in xdoc.xpath('//Contribution/@Amount'))
31 |
32 | # note: this is a naive summation, without regard to whether each
33 | # Contribution node is apples-to-apples, and if corrections are made later
34 | print(ctotal)
35 |
--------------------------------------------------------------------------------
/scripts/74.py:
--------------------------------------------------------------------------------
1 | # The description of the bill most recently signed into law by the governor of Georgia
2 | from lxml import html
3 | import requests
4 | import re
5 | url = 'https://gov.georgia.gov/bills-signed'
6 | txt = requests.get(url).text
7 | hrefs = re.findall('(?<=/bills-signed/)\d{4}', txt)
8 | yrurl = url + '/' + max(hrefs)
9 | # e.g. https://gov.georgia.gov/bills-signed/2015
10 | doc = html.fromstring(requests.get(yrurl).text)
11 | # most recent bill is at the top
12 | print(doc.xpath('//tr/td[2]/a')[0].text_content())
13 |
--------------------------------------------------------------------------------
/scripts/75.py:
--------------------------------------------------------------------------------
1 | # Total number of officer-involved shooting incidents listed by the Philadelphia Police Department
2 | import requests
3 | from lxml import html
4 | url = "https://www.phillypolice.com/ois/"
5 | doc = html.fromstring(requests.get(url).text)
6 | x = 0
7 | for table in doc.cssselect('.ois-table'):
8 | x += len(table.cssselect('tr')) - 1
9 | print(x)
10 |
--------------------------------------------------------------------------------
/scripts/76.py:
--------------------------------------------------------------------------------
1 | # The total number of publications produced by the U.S. Government Accountability Office
2 | import requests
3 | import re
4 | url = 'http://www.gao.gov/browse/date/custom'
5 | txt = requests.get(url).text
6 | # Browsing Publications by Date (1 - 10 of 53,004 items) in Custom Date Range
7 | mx = re.search('Browsing Publications by Date.+', txt).group()
8 | m = re.search('[,\d]+(?= +items)', mx).group()
9 | print(m)
10 |
--------------------------------------------------------------------------------
/scripts/77.py:
--------------------------------------------------------------------------------
1 | # Number of Dallas officer-involved fatal shooting incidents in 2014
2 | import requests
3 | url = 'https://www.dallasopendata.com/resource/4gmt-jyx2.json'
4 | data = requests.get(url).json()
5 | records = [r for r in data if ('2014' in r['date']
6 | and 'Deceased' in r['suspect_deceased_injured_or_shoot_and_miss'])]
7 | print(len(records))
8 |
--------------------------------------------------------------------------------
/scripts/78.py:
--------------------------------------------------------------------------------
1 | # Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months.
2 | import requests
3 | from lxml import html
4 | url = 'https://services.sccgov.org/facilityinspection/Closure/Index?sortField=sortbyEDate'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(len([t for t in doc.cssselect('td') if 'CUPERTINO' in t.text_content()]))
7 |
--------------------------------------------------------------------------------
/scripts/79.py:
--------------------------------------------------------------------------------
1 | # The change in airline revenues from baggage fees, from 2013 to 2014
2 | import requests
3 | from lxml import html
4 | # Note that the BTS provides CSV versions of each year
5 | # So using HTML parsing is the dumb way to do this. oh well
6 | BASE_URL = 'https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/baggage_fees/html/%s.html'
7 | year_totes = {2013: 0, 2014: 0}
8 |
9 | for yr in year_totes.keys():
10 | url = BASE_URL % yr
11 | resp = requests.get(url)
12 | doc = html.fromstring(resp.text)
13 | # Incredibly sloppy way of getting the total value from
14 | # the bottom-right cell of the table. oh well
15 | tval = doc.cssselect('tr td')[-1].text_content().strip()
16 | year_totes[yr] = int(tval.replace(',', '')) * 1000 # it's in 000s
17 |
18 | print(year_totes[2014] - year_totes[2013])
19 | # 179236000
20 |
--------------------------------------------------------------------------------
/scripts/8.py:
--------------------------------------------------------------------------------
1 | # The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis.
2 | import requests
3 | url = 'https://health.data.ny.gov/resource/dk4z-k3xb.json'
4 | xstr = 'Rate significantly higher than Statewide Rate'
5 | data = requests.get(url).json()
6 | records = [r for r in data if xstr in r['comparison_results']]
7 | print(len(records))
8 |
--------------------------------------------------------------------------------
/scripts/80.py:
--------------------------------------------------------------------------------
1 | # The total number of babies named Odin born in Colorado according to the Social Security Administration
2 | import shutil
3 | import requests
4 | url = 'http://www.ssa.gov/OACT/babynames/state/namesbystate.zip'
5 | # Downloading will take awhile...
6 | print("Downloading", url)
7 | resp = requests.get(url)
8 | # save to hard drive
9 | with open("/tmp/ssastates.zip", "wb") as f:
10 | f.write(resp.content)
11 | # unzip
12 | shutil.unpack_archive("/tmp/ssastates.zip", "/tmp")
13 | # open up the file
14 | rows = open("/tmp/CO.TXT").readlines()
15 | totes = 0
16 | for r in rows:
17 | if 'Odin' in r:
18 | totes += int(r.split(',')[4])
19 | print(totes)
20 |
21 |
--------------------------------------------------------------------------------
/scripts/81.py:
--------------------------------------------------------------------------------
1 | # The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report
2 | from lxml import html
3 | from datetime import datetime
4 | import requests
5 | LANDING_PAGE_URL = 'http://www.transtats.bts.gov/releaseinfo.asp'
6 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text)
7 | a = doc.xpath("//a[contains(text(), 'T-100 Domestic Market (U.S. Carriers)')]")[0]
8 | tr = a.getparent().getparent()
9 | txt = tr.xpath("./td[3]/text()")[0] # e.g. ['8/13/2015:']
10 | # messy
11 | dt = datetime.strptime(txt, '%m/%d/%Y:')
12 | print(dt.strftime("%Y-%m-%d"))
13 | # 2015-08-13
14 |
--------------------------------------------------------------------------------
/scripts/82.py:
--------------------------------------------------------------------------------
1 | # In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death"
2 | # Note: I changed the original exercise to something a little more specific and challenging
3 | #
4 | # We *could* use the API:
5 | # https://open.fda.gov/api/reference/#query-syntax
6 | # After reading those docs, do you have no idea how to make even a simple call
7 | # for events and filter by date? Neither do I, so let's just go with
8 | # good ol' bulk data downloads:
9 | # http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm
10 | import requests
11 | from io import BytesIO
12 | from zipfile import ZipFile
13 | from lxml import html
14 | from urllib.parse import urljoin
15 | from collections import Counter
16 | LANDING_PAGE_URL = "http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm"
17 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text)
18 | # find the most recent FAERS ASCII zip file with good ol xpath:
19 | links = doc.xpath("//a[linktitle[contains(text(), 'ASCII')] and contains(@href, 'zip')]")
20 | # Presumably, they're listed in reverse chronological order:
21 | link = links[0]
22 | zipurl = urljoin(LANDING_PAGE_URL, link.attrib['href'])
23 | print("Downloading", link.text_content(), ":")
24 | print(zipurl)
25 | resp = requests.get(zipurl) # this is going to take awhile...
26 | with ZipFile(BytesIO(resp.content)) as zfile:
27 | # the zip contains many files...we want the one labeled REACYYQX.txt
28 | # e.g. ascii/REAC15Q1.txt
29 | fname = next(x.filename for x in zfile.filelist if
30 | "REAC" in x.filename and "txt" in x.filename.lower())
31 | print("Unzipping:", fname)
32 | data = zfile.read(fname).decode('latin-1').splitlines()
33 | # The data looks like this:
34 | # primaryid$caseid$pt$drug_rec_act
35 | # 100036412$10003641$Medication residue present$
36 | # 100038593$10003859$Blood count abnormal$
37 | # 100038593$10003859$Platelet count decreased$
38 | # 100038603$10003860$Abdominal pain$
39 |
40 | # Rather than programatically locating the "reaction" column,
41 | # e.g. "pt", I'm just going to hardcode it as the
42 | # 3rd (2nd via 0-index) column delimited by a `$` sign
43 | reactions = [row.split('$')[2].lower() for row in data]
44 | deaths = [r for r in reactions if 'death' in r]
45 | print("Out of %s reactions, %s mention 'death'" % (len(reactions), len(deaths)))
46 | # sample output for 2015Q1
47 | # Out of 873190 reactions, 14188 mention 'death'
48 |
--------------------------------------------------------------------------------
/scripts/83.py:
--------------------------------------------------------------------------------
1 | # The sum of White House staffermember salaries in 2014
2 | import requests
3 | import csv
4 | url = "https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD"
5 | txt = requests.get(url).text
6 | totes = 0
7 | for r in csv.DictReader(txt.splitlines()):
8 | # remove $ sign, convert to float
9 | salval = float(r['Salary'].replace('$', ''))
10 | totes += salval
11 | print(totes)
12 | # 37776925.0
13 |
--------------------------------------------------------------------------------
/scripts/84.py:
--------------------------------------------------------------------------------
1 | # The total number of notices published on the most recent date to the Federal Register
2 | import requests
3 | from lxml import html
4 | url = 'https://www.federalregister.gov/'
5 | doc = html.fromstring(requests.get(url).text)
6 | print(doc.cssselect('ul.statistics li a span')[0].text_content())
7 |
--------------------------------------------------------------------------------
/scripts/85.py:
--------------------------------------------------------------------------------
1 | # The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report
2 | # This exercise is just mean...The intent is to lead students to SEC's EDGAR,
3 | # *not* to imply that scraping EDGAR is the ideal way to do this fact-finding
4 | import requests
5 | from lxml import html
6 | from urllib.parse import urljoin
7 | # The target URL looks like this:
8 | # http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-Q&owner=exclude&count=40
9 | BASE_URL = 'http://www.sec.gov/cgi-bin/browse-edgar'
10 | AAPL_CIK = "0000320193"
11 | url_params = {
12 | 'CIK': AAPL_CIK,
13 | 'action': 'getcompany',
14 | 'type': '10-Q',
15 | 'owner':'exclude',
16 | 'count': 40}
17 | # do initial search for Apple's 10-Q forms
18 | resp = requests.get(BASE_URL, params = url_params)
19 | doc = html.fromstring(resp.text)
20 | hrefs = doc.xpath("//a[@id='documentsbutton']/@href")
21 | xurl = urljoin(BASE_URL, hrefs[0])
22 | # fetch page for most recent 10-Q:
23 | xdoc = html.fromstring(requests.get(xurl).text)
24 | # this gets us a list of more documents. Select the URL
25 | # for the one with 10q in its name
26 | href10q = xdoc.xpath("//table[@class='tableFile']//a[contains(@href, '10q.htm')]/@href")[0]
27 | url10q = urljoin(BASE_URL, href10q)
28 | # one more request
29 | qdoc = html.fromstring(requests.get(url10q).text)
30 | # now for some truly convoluted parsing logic
31 | # First, an xpath trick: http://stackoverflow.com/questions/1457638/xpath-get-nodes-where-child-node-contains-an-attribute
32 | xtd = qdoc.xpath("//td[descendant::p[contains(text(), 'Unit Sales by Product:')]]")[0]
33 | # luckily there's only one such .
34 | # Data looks like this:
35 | # | 3 months | | | | (9 months) | | |
36 | # | Unit | June 27 | June 28 | | | | |
37 | # | sales | 2015 | 2015 | Change | | | |
38 | # |----------|---------|---------|--------|------------|---------|------|
39 | # | iPhone | 47,534 | 35,203 | 35% | 183,172 | 129,947 | 41% |
40 | # | iPad | 10,931 | 13,276 | -18% | 44,973 | 55,661 | -19% |
41 | # | Mac | 4,796 | 4,413 | 9% | 14,878 | 13,386 | 11% |
42 |
43 | xtr = xtd.getparent() # i.e. the enclosing tr...we need to move to the next tr
44 | # find the first row that has "iPhone" in it
45 | iphone_row = next(tr for tr in xtr.itersiblings() if 'iPhone' in tr.text_content())
46 | # fourth column has the data, as cols 2 and 3 are padding:
47 | sales = int(iphone_row.xpath('td[@align="right"][1]/text()')[0].replace(',', ''))
48 | print(sales * 1000) # units are listed in thousands
49 | # 47534000 (for June 2015)
50 |
--------------------------------------------------------------------------------
/scripts/86.py:
--------------------------------------------------------------------------------
1 | # Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin
2 | import requests
3 | from lxml import html
4 | from urllib.parse import urljoin
5 | url = 'https://www.us-cert.gov/ncas/bulletins'
6 | doc = html.fromstring(requests.get(url).text)
7 | href = doc.xpath('//*[@class="document_title"]/a/@href')[0]
8 | bulletin = html.fromstring(requests.get(urljoin(url, href)).text)
9 | trs = bulletin.xpath('//tr/td[1][contains(text(), "ibm")]')
10 | print(len(trs))
11 |
--------------------------------------------------------------------------------
/scripts/87.py:
--------------------------------------------------------------------------------
1 | # Number of airports with existing construction related activity
2 | import requests
3 | import re
4 | resp = requests.get('https://nfdc.faa.gov/xwiki/bin/view/NFDC/Construction+Notices')
5 | # obviously not something you do in an actual scraping solution but it gets the answer!
6 | print(len(re.findall("Construction\+Notices/.+?\.pdf", resp.text)))
7 |
--------------------------------------------------------------------------------
/scripts/88.py:
--------------------------------------------------------------------------------
1 | # The number of posts on TSA's Instagram account
2 | from urllib.parse import urljoin
3 | import os
4 | import requests
5 | DOMAIN = 'https://api.instagram.com/'
6 | USERNAME = 'tsa'
7 | # note: I've specified INSTAGRAM_TOKEN in my ~/.bash_profile
8 | atts = {'access_token': os.environ.get('INSTAGRAM_TOKEN')}
9 | # unless you know TSA's Instagram ID by memory, you'll
10 | # have to hit up the search endpoint to get it
11 | # docs: http://instagram.com/developer/endpoints/users/#get_users_search
12 | search_path = '/v1/users/search'
13 | searchatts = atts.copy()
14 | searchatts['q'] = USERNAME
15 | search_results = requests.get(urljoin(DOMAIN, search_path), params = searchatts).json()
16 | uid = search_results['data'][0]['id']
17 |
18 | # now we can retrieve profile information
19 | # https://instagram.com/developer/endpoints/users/#get_users
20 | user_path = '/v1/users/%s/' % uid
21 | profile = requests.get(urljoin(DOMAIN, user_path), params = atts).json()
22 | print(profile['data']['counts']['media'])
23 |
24 |
25 |
--------------------------------------------------------------------------------
/scripts/89.py:
--------------------------------------------------------------------------------
1 | # In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products.
2 | from collections import Counter
3 | from lxml import html
4 | from urllib.parse import urljoin
5 | from xlrd import open_workbook
6 | import requests
7 | import tempfile
8 | LANDING_PAGE_URL = 'http://www.fda.gov/ICECI/Inspections/ucm250720.htm'
9 | # The hardcoded URL for the Excel file is:
10 | # http://www.fda.gov/downloads/ICECI/Inspections/UCM381532.xls
11 | # But we'll programatically find it
12 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text)
13 | # HTML looks like:
14 | #
15 | # FY 2013 Excel File (XLS - 691KB)
16 | #
17 |
18 | # i love xpath
19 | hrefs = doc.xpath("//a[linktitle[contains(text(), '2013')] and contains(@href, 'xls')]//@href")
20 | url = urljoin(LANDING_PAGE_URL, hrefs[0])
21 | # eh just make a temp file
22 | t = tempfile.TemporaryFile()
23 | t.write(requests.get(url).content)
24 | t.seek(0)
25 | wb = open_workbook(file_contents=t.read())
26 | # Each category has its own name, we need to find "Foods"
27 | sheet = wb.sheet_by_name('Foods')
28 | # find the column that contains "Short Description"
29 | col_idx = next(idx for idx, txt in enumerate(sheet.row_values(0)) if "Short Description" == txt)
30 | c = Counter(sheet.row_values(r)[col_idx] for r in range(sheet.nrows))
31 | print(""""%s" for %s observations""" % c.most_common(1)[0])
32 |
33 |
--------------------------------------------------------------------------------
/scripts/9.py:
--------------------------------------------------------------------------------
1 | # The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress
2 | # Note: this example shows how to scrape the Senate webpage, which is
3 | # the WRONG thing to do in practice. Use the XML instead:
4 | # http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_114_1.xml
5 | # via https://twitter.com/octonion/status/611296541941321731
6 | from lxml import html
7 | import requests
8 | import re
9 | congress_num = 114
10 | session_num = 1
11 | url = ('http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_%s_%s.htm'
12 | % (congress_num, session_num))
13 |
14 | doc = html.fromstring(requests.get(url).text)
15 | # unnecessarily convoluted xpath statement, which I do here
16 | # just so I can practice xpath statements
17 | # http://stackoverflow.com/questions/1457638/xpath-get-nodes-where-child-node-contains-an-attribute
18 | xstr = "//*[@id='contentArea']//table/tr[td[2][contains(text(), 'Rejected')]]"
19 | # i.e. find all tr elements that have a 2nd td child with text that contains "Rejected"
20 | xcount = 0
21 | for r in doc.xpath(xstr):
22 | yeas, nays = re.search('(\d+)-(\d+)', r.find('td').text_content()).groups()
23 | if (int(nays) - int(yeas) < 5):
24 | xcount += 1
25 |
26 | print(xcount)
27 |
--------------------------------------------------------------------------------
/scripts/90.py:
--------------------------------------------------------------------------------
1 | # The currently serving U.S. congressmember with the most Twitter followers
2 | from math import ceil
3 | import csv
4 | import json
5 | import os
6 | import requests
7 | import tweepy
8 | # You need to have a Twitter account and register as a developer:
9 | # http://www.compjour.org/tutorials/getting-started-with-tweepy/
10 | # Your credentials JSON file should look like this:
11 | # {
12 | # "access_token": "AAAA",
13 | # "access_token_secret": "BBBB",
14 | # "consumer_secret": "CCCC",
15 | # "consumer_key": "DDDDD"
16 | # }
17 | # Twitter helper methods
18 | DEFAULT_TWITTER_CREDS_PATH = '~/.creds/me.json' # put your own path here
19 | def get_api(credsfile = DEFAULT_TWITTER_CREDS_PATH):
20 | """
21 | Takes care of the Twitter OAuth authentication process and
22 | creates an API-handler to execute commands on Twitter
23 |
24 | Arguments:
25 | - credsfile (str): the full path of the filename that contains a JSON
26 | file with credentials for Twitter
27 |
28 | Returns:
29 | A tweepy.api.API object
30 |
31 | """
32 | fn = os.path.expanduser(credsfile) # get the full path in case the ~ is used
33 | c = json.load(open(fn))
34 | # Get authentication token
35 | auth = tweepy.OAuthHandler(consumer_key = c['consumer_key'],
36 | consumer_secret = c['consumer_secret'])
37 | auth.set_access_token(c['access_token'], c['access_token_secret'])
38 | # create an API handler
39 | return tweepy.API(auth)
40 |
41 | # gets a whole bunch of profile information from a batch of screen_names
42 | BATCH_SIZE = 100
43 | def get_profiles_from_screen_names(snames):
44 | api = get_api()
45 | profiles = []
46 | for i in range(ceil(len(snames) / BATCH_SIZE)):
47 | s = i * BATCH_SIZE
48 | bnames = snames[s:(s + BATCH_SIZE)]
49 | for user in api.lookup_users(screen_names = bnames):
50 | profiles.append(user._json)
51 | return profiles
52 | # Step 1.
53 | # Basically, you have to rejigger 18.py:
54 | # (The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data)
55 | # info https://sunlightlabs.github.io/congress/#legislator-spreadsheet
56 | csvurl = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv'
57 | rows = csv.DictReader(requests.get(csvurl).text.splitlines())
58 | # note that spreadsheet includes non-sitting legislators, thus the use
59 | # of 'in_office' attribute to filter
60 | legislators = [r for r in rows if r['twitter_id'] and r['in_office'] == '1']
61 | # now call twitter
62 | twitter_profiles = get_profiles_from_screen_names([x['twitter_id'] for x in legislators])
63 | # match up legislators with profiles:
64 | for lx in legislators:
65 | ta = [t for t in twitter_profiles if lx['twitter_id'].lower() == t['screen_name'].lower()]
66 | lx['twitter_profile'] = ta[0] if ta else None
67 |
68 | def fooey(x):
69 | t = x['twitter_profile']
70 | return t['followers_count'] if t else 0
71 |
72 | q = max(legislators, key = fooey)
73 | print(q['title'], q['firstname'], q['middlename'], q['lastname'], q['state'])
74 | # Sen John S. McCain AZ
75 |
76 |
77 |
--------------------------------------------------------------------------------
/scripts/91.py:
--------------------------------------------------------------------------------
1 | # Number of stop-and-frisk reports from the NYPD in 2014
2 | from shutil import unpack_archive
3 | import csv
4 | import os
5 | import requests
6 | DATADIR = '/tmp/nypd'
7 | os.makedirs(DATADIR, exist_ok = True)
8 | zipurl = 'http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2014_sqf_csv.zip'
9 | zname = os.path.join(DATADIR, os.path.basename(zipurl))
10 | cname = os.path.join(DATADIR, '2014.csv')
11 | if not os.path.exists(zname):
12 | print("Downloading", zipurl, 'to', zname)
13 | z = requests.get(zipurl).content
14 | with open(zname, 'wb') as f:
15 | f.write(z)
16 | # unzip it
17 | unpack_archive(zname, DATADIR)
18 |
19 | data = list(csv.DictReader(open(cname, encoding = 'latin-1')))
20 | print(len(data))
21 |
22 |
23 |
--------------------------------------------------------------------------------
/scripts/92.py:
--------------------------------------------------------------------------------
1 | # In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation
2 | # real-life reference: http://www.usatoday.com/story/news/politics/2015/02/19/schock-flights-charter-house-rules/23663247/
3 | import csv
4 | import requests
5 | DATA_URL = 'http://assets.sunlightfoundation.com.s3.amazonaws.com/expenditures/house/2012Q4-detail.csv'
6 | SCHOCK_ID = 'S001179' # http://bioguide.congress.gov/scripts/biodisplay.pl?index=s001179
7 | print("Downloading", DATA_URL)
8 | resp = requests.get(DATA_URL)
9 | totalamt = 0
10 | for row in csv.DictReader(resp.text.splitlines()):
11 | if row['BIOGUIDE_ID'] == SCHOCK_ID and 'LOBAIR LLC' in row['PAYEE'].upper():
12 | totalamt += float(row['AMOUNT'])
13 | print(totalamt)
14 | # 880.0
15 |
--------------------------------------------------------------------------------
/scripts/93.py:
--------------------------------------------------------------------------------
1 | # Number of public Github repositories maintained by the GSA's 18F organization, as listed on Github.com
2 | import requests
3 | url = 'https://api.github.com/orgs/18F'
4 | data = requests.get(url).json()
5 | print(data['public_repos'])
6 |
--------------------------------------------------------------------------------
/scripts/94.py:
--------------------------------------------------------------------------------
1 | # The New York City high school with the highest average math score in the latest SAT results
2 |
3 | # Notes:
4 | # As this is one of the last exercises that I've written out, it includes code
5 | # that is both lazy and convoluted. For example, this is the first time
6 | # I've tried openpyxl as opposed to xlrd for reading Excel files
7 | # and it shows: http://openpyxl.readthedocs.org/en/latest/index.html
8 | #
9 | # You can check out other scraping/spreadsheet parsing examples
10 | # in the repo to find cleaner ways of doing this kind of task.
11 | #
12 | #
13 | #
14 | # Landing page:
15 | # http://schools.nyc.gov/Accountability/data/TestResults/default.htm
16 | #
17 | ## Relevant text from the webpage:
18 | # The most recent school level results for New York City on the SAT.
19 | # Results are available at the school level for the graduating
20 | # seniors of 2014. For a summary report of SAT, PSAT, and AP achievement
21 | # for 2014, please click here.
22 | #
23 | ## Target URL looks like this:
24 | # http://schools.nyc.gov/NR/rdonlyres/CE9139F0-9F3A-4C42-ACB8-74F2D014802F/
25 | # 171380/2014SATWebsite10214.xlsx
26 | #
27 | # It's just as likely that by next year, they'll redesign or restructure
28 | # the site. So this scraping code is unstable. But it works as of August 2015.
29 | import csv
30 | import requests
31 | from io import BytesIO
32 | from operator import itemgetter
33 | from urllib.parse import urljoin
34 | from lxml import html
35 | from openpyxl import load_workbook
36 | LANDING_PAGE_URL = 'http://schools.nyc.gov/Accountability/data/TestResults/default.htm'
37 |
38 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text)
39 | # instead of using xpath, let's just use a sloppy csscelect
40 | urls = [a.attrib.get('href') for a in doc.cssselect('a')]
41 | # that awkward `get` is because not all anchor tags have hrefs...and
42 | # this is why we use xpath...
43 | _xurl = next(url for url in urls if url and 'SAT' in url and 'xls' in url) # blargh
44 | xlsx_url = urljoin(LANDING_PAGE_URL, _xurl)
45 | print("Downloading", xlsx_url)
46 | # download the spreadsheet...instead of writing to disk
47 | # let's just keep it in memory and pass it directly to load_workbook()
48 | xlsx = BytesIO(requests.get(xlsx_url).content)
49 | wb = load_workbook(xlsx)
50 | # The above command will print out a warning:
51 | # /site-packages/openpyxl/workbook/names/named_range.py:121: UserWarning:
52 | # Discarded range with reserved name
53 | # warnings.warn("Discarded range with reserved name")
54 |
55 | ### Dealing with the worksheet structure
56 | # The 2014 edition contains two worksheets, the first being "Notes"
57 | # and the second being "2014 SAT Results"
58 | # Let's write an agnostic function as if we didn't know how each year's
59 | # spreadsheet was actually structured
60 | sheet = next(s for s in wb.worksheets if "results" in s.title.lower())
61 | # I don't understand openpyxl's API so I'm just going to
62 | # practice nested list comprehensions
63 | # Note that the first column is just an ID field which we don't care about
64 | rows = [[cell.value for cell in row[1:]] for row in sheet.iter_rows()]
65 | headers = rows[0]
66 | # make it into a list of dicts
67 | data = [dict(zip(headers, r)) for r in rows[1:]]
68 | # I think we can assume that the header will change every year/file
69 | # so let's write another agnostic iterating function to do a best guess
70 | mathheader = next(h for h in headers if 'math' in h.lower())
71 | # Not every school has a number for this column
72 | mathschools = [d for d in data if isinstance(d[mathheader], int)]
73 | topschool = max(mathschools, key = itemgetter(mathheader))
74 | # since we've done so much work to get here, so
75 | # let's calculate the average of the averages -- which requires a
76 | # weighting of math score averages against number of SAT taker
77 | # and include that in the printed answer
78 |
79 | # find the header that says '# of SAT Takers in 20XX':
80 | numheader = next(h for h in headers if 'takers' in h.lower())
81 | total_takers = sum(s[numheader] for s in mathschools)
82 | mathsums = sum(s[mathheader] * s[numheader] for s in mathschools)
83 | mathavg = mathsums // total_takers
84 | tmp_answer = """{name} had the highest average SAT math score: {top_score}
85 | This was {diff_score} points higher than the city average of {avg_score}
86 | """
87 | answer = tmp_answer.format(name = topschool['High School'],
88 | top_score = topschool[mathheader],
89 | diff_score = topschool[mathheader] - mathavg,
90 | avg_score = mathavg
91 | )
92 |
93 | print(answer)
94 | # Output for 2014:
95 | # STUYVESANT HIGH SCHOOL had the highest average SAT math score: 737
96 | # This was 272 points higher than the city average of 465
97 |
--------------------------------------------------------------------------------
/scripts/95.py:
--------------------------------------------------------------------------------
1 | # Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions
2 | from collections import Counter
3 | import requests
4 | c = Counter()
5 | data = requests.get('https://data.ny.gov/resource/5xaw-6ayf.json').json()
6 | for d in data:
7 | c.update(d['winning_numbers'].split(' '))
8 |
9 | print(c.most_common()[0][0])
10 |
--------------------------------------------------------------------------------
/scripts/96.py:
--------------------------------------------------------------------------------
1 | # The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar
2 | from lxml import html
3 | from urllib.parse import urljoin
4 | import requests
5 | url = 'http://www.supremecourt.gov/oral_arguments/argument_calendars.aspx'
6 | index = html.fromstring(requests.get(url).text)
7 | # calendar is sorted chronologically, with latest in the last link
8 | href = index.xpath('//a[contains(text(), "HTML")]/@href')[-1]
9 | cal = html.fromstring(requests.get(urljoin(url, href)).text)
10 | pdfs = cal.xpath("//table//a[contains(@href, 'qp.pdf')]/@href")
11 | print(len(pdfs))
12 |
--------------------------------------------------------------------------------
/scripts/97.py:
--------------------------------------------------------------------------------
1 | # The New York school with the highest rate of religious exemptions to vaccinations
2 | import requests
3 | url = 'https://health.data.ny.gov/resource/5pme-xbs5.json'
4 | data = requests.get(url).json()
5 |
6 | def foo(d):
7 | return float(d['percentreligiousexemptions'])
8 |
9 | school = max([r for r in data if '2014' in r['report_period']], key = foo)
10 | print(school['schoolname'])
11 |
--------------------------------------------------------------------------------
/scripts/98.py:
--------------------------------------------------------------------------------
1 | # The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary.
2 | import requests
3 | from lxml import html
4 | url = 'http://quickfacts.census.gov/qfd/states/26/2622000.html'
5 | doc = html.fromstring(requests.get(url).text)
6 | # this is sloppy but quick
7 | col = doc.xpath('//td[contains(text(), "Population, percent change")]/following-sibling::td')[0]
8 | print(col.text_content())
9 |
--------------------------------------------------------------------------------
/scripts/99.py:
--------------------------------------------------------------------------------
1 | # According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA in 2014
2 | # http://nationalsecurityzone.org
3 | import csv
4 | import requests
5 | gdoc_url = 'https://docs.google.com/spreadsheets/d/1a65n2HIcBYG7VyZYfVnBXGDmEdR8NYSOF43dzkDIuwA/'
6 | txt = requests.get(gdoc_url + 'export', params = {'format': 'csv', 'gid': 0}).text
7 | # skip the first two lines, which are:
8 | # Mandatory credit, with link: TSA data compiled by Medill National Security Journalism Initiative.
9 | # Data is preliminary and extracted from the TSA Blog. TSA'S year-end totals may very slightly.
10 | rows = list(csv.DictReader(txt.splitlines()[2:]))
11 | print(len([r for r in rows if r['CHAMBERED?'] == 'Y']))
12 |
--------------------------------------------------------------------------------
|