├── README.md ├── generate_readme.py ├── scratchpad ├── high_city_ca_pay.py └── more_scotus_laughs.py └── scripts ├── 1.py ├── 10.py ├── 100.py ├── 101.py ├── 11.py ├── 12.py ├── 13.py ├── 14.py ├── 15.py ├── 16.py ├── 17.py ├── 18.py ├── 19.py ├── 2.py ├── 20.py ├── 21.py ├── 22.py ├── 23.py ├── 24.py ├── 25.py ├── 26.py ├── 27.py ├── 28.py ├── 29.py ├── 3.py ├── 30.py ├── 31.py ├── 32.py ├── 33.py ├── 34.py ├── 35.py ├── 36.py ├── 37.py ├── 38.py ├── 39.py ├── 4.py ├── 40.py ├── 41.py ├── 42.py ├── 43.py ├── 44.py ├── 45.py ├── 46.py ├── 47.py ├── 48.py ├── 49.py ├── 5.py ├── 50.py ├── 51.py ├── 52.py ├── 53.py ├── 54.py ├── 55.py ├── 56.py ├── 57.py ├── 58.py ├── 59.py ├── 6.py ├── 60.py ├── 61.py ├── 62.py ├── 63.py ├── 64.py ├── 65.py ├── 66.py ├── 68.py ├── 69.py ├── 7.py ├── 70.py ├── 71.py ├── 72.py ├── 73.py ├── 74.py ├── 75.py ├── 76.py ├── 77.py ├── 78.py ├── 79.py ├── 8.py ├── 80.py ├── 81.py ├── 82.py ├── 83.py ├── 84.py ├── 85.py ├── 86.py ├── 87.py ├── 88.py ├── 89.py ├── 9.py ├── 90.py ├── 91.py ├── 92.py ├── 93.py ├── 94.py ├── 95.py ├── 96.py ├── 97.py ├── 98.py └── 99.py /README.md: -------------------------------------------------------------------------------- 1 | ## Search-Script-Scrape: 101 webscraping and research tasks for the data journalist 2 | 3 | __Note:__ This exercise set is part of the [Stanford Computational Journalism Lab](http://cjlab.stanford.edu). I've also written [a blog post that gives a little more elaboration about the libraries used and a few of the exercises](http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/). 4 | 5 | ------------- 6 | 7 | This repository contains [101 Web data-collection tasks](#the-tasks) in Python 3 that I assigned to my [Computational Journalism class in Spring 2015](http://www.compjour.org) to give them regular exercise in programming and conducting research, and to expose them to the variety of data published online. 8 | 9 | The hard part of many of these tasks is researching and finding the actual data source. The scripts need only concern itself with fetching the data and printing the answer in the least painful way possible. Since the [Computational Journalism class](http://www.compjour.org) wasn't intended to be an actual programming class, adherence to idioms and best codes practices was not emphasized...(especially since I'm new to Python myself!) 10 | 11 | Some examples of the tasks: 12 | 13 | - [The California city whose city manager has the highest total wage per capita in 2012](https://github.com/compjour/search-script-scrape/blob/master/scripts/100.py) ([expanded version](scratchpad/high_city_ca_pay.py)) 14 | - [In the most recently transcribed Supreme Court argument, the number of times laughter broke out](https://github.com/compjour/search-script-scrape/blob/master/scripts/50.py) ([expanded version](scratchpad/more_scotus_laughs.py)) 15 | - [Number of days until Texas's next scheduled execution](scripts/29.py) 16 | - [The U.S. congressmember with the most Twitter followers](https://github.com/compjour/search-script-scrape/blob/master/scripts/90.py) 17 | - [The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days](https://github.com/compjour/search-script-scrape/blob/master/scripts/3.py) 18 | 19 | ## Repo status 20 | 21 | 22 | The table below links to the available scripts. If there's not a link, it means I haven't committed the code. Some of them I had to rethink a less verbose solution (or the target changed, as the Internet sometimes does), and now this repo has taken a backseat to many other data projects on my list. `¯\_(ツ)_/¯` 23 | 24 | Note: A lot of the code is not best practice. The tasks are a little repetitive so I got bored and [ignored PEP8](https://www.python.org/dev/peps/pep-0008/) and/or tried new libraries/conventions for fun. 25 | 26 | 27 | __Note:__ The "__related URL__" links to either the official source of the data, or at least a page with some background information. The second column of this table refers to __line count__ of the script, __not__ the answer to the prompt. 28 | 29 | ## The tasks 30 | 31 | 32 | The repo currently contains scripts for __100__ of __101__ tasks: 33 | 34 | | Title | Line count | 35 | |-------------------------|-------------| 36 | | 1. Number of datasets currently listed on data.gov
[related URL] [script] | 7 lines | 37 | | 2. The name of the most recently added dataset on data.gov
[related URL] [script] | 7 lines | 38 | | 3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days
[related URL] [script] | 4 lines | 39 | | 4. The number of librarian-related job positions that the federal government is currently hiring for
[related URL] [script] | 6 lines | 40 | | 5. The name of the company cited in the most recent consumer complaint involving student loans
[related URL] [script] | 27 lines | 41 | | 6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees
[related URL] [script] | 38 lines | 42 | | 7. The number of listed federal executive agency internet domains
[related URL] [script] | 8 lines | 43 | | 8. The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis.
[related URL] [script] | 7 lines | 44 | | 9. The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress
[related URL] [script] | 26 lines | 45 | | 10. The title of the highest paid California city government position in 2010
[related URL] [script] | 35 lines | 46 | | 11. How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections?
[related URL] [script] | 23 lines | 47 | | 12. In 2010, the year-over-year change in enplanements at America's busiest airport
[related URL] [script] | 51 lines | 48 | | 13. The number of armored carrier bank robberies recorded by the FBI in 2014
[related URL] [script] | 15 lines | 49 | | 14. The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year
[related URL] [script] | 14 lines | 50 | | 15. Total number of wildlife strike incidents reported at San Francisco International Airport
[related URL] [script] | 48 lines | 51 | | 16. The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer
[related URL] [script] | 11 lines | 52 | | 17. In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday
[related URL] [script] | 11 lines | 53 | | 18. The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data
[related URL] [script] | 9 lines | 54 | | 19. The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days
[related URL] [script] | 12 lines | 55 | | 20. The number of OSHA enforcement inspections involving Wal-Mart in California since 2014
[related URL] [script] | 25 lines | 56 | | 21. The current humidity level at Great Smoky Mountains National Park
[related URL] [script] | 6 lines | 57 | | 22. The names of the committees that Sen. Barbara Boxer currently serves on
[related URL] [script] | 7 lines | 58 | | 23. The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file.
[related URL] [script] | 21 lines | 59 | | 24. Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014
[related URL] [script] | 24 lines | 60 | | 25. Average frontal crash star rating for 2015 Honda Accords
[related URL] [script] | 14 lines | 61 | | 26. The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS
[related URL] [script] | 48 lines | 62 | | 27. The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012
[related URL] [script] | 14 lines | 63 | | 28. Total number of clinical trials as recorded by the National Institutes of Health
[related URL] [script] | 7 lines | 64 | | 29. Number of days until Texas's next scheduled execution
[related URL] [script] | 24 lines | 65 | | 30. The total number of inmates executed by Florida since 1976
[related URL] [script] | 10 lines | 66 | | 31. The number of proposed U.S. federal regulations in which comments are due within the next 3 days
[related URL] [script] | 29 lines | 67 | | 32. Number of Titles that have changed in the United States Code since its last release point
[related URL] [script] | 6 lines | 68 | | 33. The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient
[related URL] [script] | 14 lines | 69 | | 34. In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food
[related URL] [script] | 10 lines | 70 | | 35. Most viewed data set on New York state's open data portal as of this month
[related URL] [script] | 9 lines | 71 | | 36. Total number of visitors to the White House in 2012
[related URL] [script] | 27 lines | 72 | | 37. The last time the CIA's Leadership page has been updated
[related URL] [script] | 6 lines | 73 | | 38. The domain of the most visited U.S. government website right now
[related URL] [script] | 5 lines | 74 | | 39. Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013
[related URL] [script] | 6 lines | 75 | | 40. Number of FOIA requests made to the Chicago Public Library
[related URL] [script] | 6 lines | 76 | | 41. The number of currently open medical trials involving alcohol-related disorders
[related URL] [script] | 5 lines | 77 | | 42. The name of the Supreme Court justice who delivered the opinion in the most recently announced decision
[related URL] [script] | 31 lines | 78 | | 43. The number of citations that resulted from FDA inspections in fiscal year 2012
[related URL] [script] | 10 lines | 79 | | 44. Number of people visiting a U.S. government website right now
[related URL] [script] | 6 lines | 80 | | 45. The number of security alerts issued by US-CERT in the current year
[related URL] [script] | 6 lines | 81 | | 46. The number of Pinterest accounts maintained by U.S. State Department embassies and missions
[related URL] [script] | 13 lines | 82 | | 47. The number of international travel alerts from the U.S. State Department currently in effect
[related URL] [script] | 7 lines | 83 | | 48. The difference in total White House staffmember salaries in 2014 versus 2010
[related URL] [script] | 19 lines | 84 | | 49. Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President
[related URL] [script] | 11 lines | 85 | | 50. In the most recently transcribed Supreme Court argument, the number of times laughter broke out
[related URL] [script] | 22 lines | 86 | | 51. The title of the most recent decision handed down by the U.S. Supreme Court
[related URL] [script] | 6 lines | 87 | | 52. The average wage of optomertrists according to the BLS's most recent National Occupational Employment and Wage Estimates report
[related URL] [script] | 8 lines | 88 | | 53. The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year
[related URL] [script] | 45 lines | 89 | | 54. The number of people on FBI's Most Wanted List for white collar crimes
[related URL] [script] | 6 lines | 90 | | 55. The number of Government Accountability Office reports and testimonies on the topic of veterans
[related URL] [script] | 10 lines | 91 | | 56. Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record
[related URL] [script] | 9 lines | 92 | | 57. The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967.
[related URL] [script] | 24 lines | 93 | | 58. The number of published research papers from the NSA
[related URL] [script] | 6 lines | 94 | | 59. The number of university-related datasets currently listed at data.gov
[related URL] [script] | 7 lines | 95 | | 60. Number of chapters in Title 20 (Education) of the United States Code
[related URL] [script] | 15 lines | 96 | | 61. The number of miles traveled by the current U.S. Secretary of State
[related URL] [script] | 6 lines | 97 | | 62. For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS
[related URL] [script] | 14 lines | 98 | | 63. In the current dataset behind Medicare's Nusring Home Compare website, the total amount of fines received by penalized nursing homes
[related URL] [script] | 35 lines | 99 | | 64. from March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch
[related URL] [script] | 5 lines | 100 | | 65. The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress
[related URL] [script] | 12 lines | 101 | | 66. The highest minimum wage as mandated by state law.
[related URL] [script] | 28 lines | 102 | | 67. For the most recently posted TSA.gov customer satisfication survey, post the percentage of respondents who rated their "overall experience today" as "Excellent"
[related URL] | | 103 | | 68. Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder
[related URL] [script] | 11 lines | 104 | | 69. The average number of comments on the last 50 posts on NASA's official Instagram account
[related URL] [script] | 40 lines | 105 | | 70. The highest salary possible for a White House staffmember in 2014
[related URL] [script] | 10 lines | 106 | | 71. The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration
[related URL] [script] | 32 lines | 107 | | 72. The number of magnitude 4.5+ earthquakes detected worldwide by the USGS
[related URL] [script] | 8 lines | 108 | | 73. The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report
[related URL] [script] | 34 lines | 109 | | 74. The description of the bill most recently signed into law by the governor of Georgia
[related URL] [script] | 12 lines | 110 | | 75. Total number of officer-involved shooting incidents listed by the Philadelphia Police Department
[related URL] [script] | 9 lines | 111 | | 76. The total number of publications produced by the U.S. Government Accountability Office
[related URL] [script] | 9 lines | 112 | | 77. Number of Dallas officer-involved fatal shooting incidents in 2014
[related URL] [script] | 7 lines | 113 | | 78. Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months.
[related URL] [script] | 6 lines | 114 | | 79. The change in total airline revenues from baggage fees, from 2013 to 2014
[related URL] [script] | 19 lines | 115 | | 80. The total number of babies named Odin born in Colorado according to the Social Security Administration
[related URL] [script] | 20 lines | 116 | | 81. The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report
[related URL] [script] | 13 lines | 117 | | 82. In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death"
[related URL] [script] | 47 lines | 118 | | 83. The sum of White House staffermember salaries in 2014
[related URL] [script] | 12 lines | 119 | | 84. The total number of notices published on the most recent date to the Federal Register
[related URL] [script] | 6 lines | 120 | | 85. The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report
[related URL] [script] | 49 lines | 121 | | 86. Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin
[related URL] [script] | 10 lines | 122 | | 87. Number of airports with existing construction related activity
[related URL] [script] | 6 lines | 123 | | 88. The number of posts on TSA's Instagram account
[related URL] [script] | 24 lines | 124 | | 89. In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products.
[related URL] [script] | 32 lines | 125 | | 90. The currently serving U.S. congressmember with the most Twitter followers
[related URL] [script] | 76 lines | 126 | | 91. Number of stop-and-frisk reports from the NYPD in 2014
[related URL] [script] | 22 lines | 127 | | 92. In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation
[related URL] [script] | 14 lines | 128 | | 93. Number of Github repositories maintained by the GSA's 18F organization, as listed on Github.com
[related URL] [script] | 5 lines | 129 | | 94. The New York City high school with the highest average math score in the latest SAT results
[related URL] [script] | 96 lines | 130 | | 95. Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions
[related URL] [script] | 9 lines | 131 | | 96. The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar
[related URL] [script] | 11 lines | 132 | | 97. The New York school with the highest rate of religious exemptions to vaccinations
[related URL] [script] | 10 lines | 133 | | 98. The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary.
[related URL] [script] | 8 lines | 134 | | 99. According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA
[related URL] [script] | 11 lines | 135 | | 100. The California city whose city manager earns the most total wage per population of its city in 2012
[related URL] [script] | 23 lines | 136 | | 101. The number of women currently serving in the U.S. Congress, according to Sunlight Foundation data
[related URL] [script] | 8 lines | 137 | 138 | 139 | 140 | ---- 141 | 142 | ## How to run this stuff 143 | 144 | Each task is meant to be a self-contained script: you run it, and it prints the answer I'm looking for. The [scripts](/scripts) in this repo should "just work"...if you have all the dependencies installed that I had while writing them, and the web URLs they target haven't changed...so, basically, these may not work at all. 145 | 146 | To copy the scripts quickly via the command-line; by default, a ./search-script-scrape directory will be created: 147 | 148 | $ git clone https://github.com/compjour/search-script-scrape.git 149 | 150 | To run a script: 151 | 152 | $ cd search-script-scrape 153 | $ python3 scripts/1.py 154 | 155 | I leave it to you and Google to figure out how to run Python 3 on your own system. FWIW, I was using the [Python 3.4.3 provided by the Anaconda 2.2.0 installer for OS X](http://continuum.io/downloads#py34). The most common third-party libraries used are [Requests](http://www.python-requests.org/en/latest/) for downloading the files and [lxml for HTML parsing](http://lxml.de/). 156 | 157 | ## Expanding on these scripts 158 | 159 | To reiterate: each of these scripts are meant to print out single answers, and so they don't actually show the full potential of how programming can automate data collection. As you get better at programming and recognizing its patterns, you'll find out how easy it is to abstract what seemed like a narrow task into something much bigger. 160 | 161 | For example, [Script #50](scripts/50.py) prints out the number of times laughter broke out in the _most recently_ transcribed Supreme Court argument. Change two lines and that script will print out the laugh count in _every_ transcribed Supreme Court argument: ([demo here](scratchpad/more_scotus_laughs.py)) 162 | 163 | The same kind of small code restructuring can be done to many of the tasks here. And you can also modify the _parameters_; why limit yourself to finding the [highest paid "City Manager" in California](https://github.com/compjour/search-script-scrape/blob/master/scripts/100.py) when you can extend the search to every kind of California employee, across every year of salary data? ([demo here](scratchpad/high_city_ca_pay.py)) 164 | 165 | And of course, in real-world data projects, you aren't typically interested in just printing the answer to your Terminal. You generally want to send them to a spreadsheet or spreadsheet and eventually to a web application (or other kind of publication). That's just a few more lines of programming, too...So while this repo contains a bunch of toy scripts, see if you can think of ways to turn them into bigger data explorations. 166 | 167 | 168 | ## Post-mortem 169 | 170 | The original requirement was that students finish all 100 scripts by the end of the quarter. That didn't quite work out so I reduced the requirement to 50. It was a bad idea to make this a "oh, just turn it in at the end of the year", as most people have the tendency to wait for finals week to do such work. 171 | 172 | Most of the tasks are pretty straightforward, in terms of the Python programming. The majority of the time is figuring out exactly what the hell I'm referring to, so next time I do this, I'll probably provide the URL of the target page rather than having people attempt to divine the Google Path I used to get to the data. 173 | 174 | - Class instructions for [Computational Journalism: Search-Script-Scrape](http://www.compjour.org/search-script-scrape) 175 | - [List of tasks as a Google Doc](https://docs.google.com/spreadsheets/d/1JbY_-g9MkGH78Rta0PnE6D8rG8T-wdKGsMa3kAC3bDs/edit?usp=sharing) 176 | -------------------------------------------------------------------------------- /generate_readme.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | """ 3 | This script reads from the Google Doc of tasks and generates 4 | the titles of the tasks, linked to the appropriate file, and this 5 | (Markdown) text can be pasted into the README.md file 6 | """ 7 | import csv 8 | import requests 9 | from os.path import exists 10 | GDOC_URL = 'https://docs.google.com/spreadsheets/d/1JbY_-g9MkGH78Rta0PnE6D8rG8T-wdKGsMa3kAC3bDs/export?format=csv&gid=0' 11 | 12 | txt = requests.get(GDOC_URL).text 13 | rows = csv.DictReader(txt.splitlines()) 14 | 15 | 16 | 17 | done_count = 0; 18 | tasks = [] 19 | for row in sorted(rows, key = lambda r: int(r['Problem No.'])): 20 | task = {'num': row['Problem No.'], 'url': row['Related URL'], 21 | 'title': row['Title'], 'lines': ""} 22 | task['link'] = "{title}
[related URL]".format( 23 | num=task['num'], title=task['title'], url=task['url'] 24 | ) 25 | task['path'] = "scripts/%s.py" % task['num'] 26 | if exists(task['path']): 27 | lx = len(open(task['path'], encoding = 'utf-8').readlines()) 28 | if lx > 3: 29 | task['lines'] = "%s lines" % lx 30 | task['link'] += " [script]" % (task['path']) 31 | done_count += 1 32 | 33 | tasks.append(task) 34 | 35 | 36 | 37 | ############# 38 | # store the text to be added to file 39 | tasklines = [] 40 | tasklines.append("The repo currently contains scripts for __%s__ of __%s__ tasks:" % 41 | (done_count, len(tasks))) 42 | tasklines.append( 43 | """ 44 | | Title | Line count | 45 | |-------------------------|-------------|""") 46 | 47 | for task in tasks: 48 | tasklines.append("| {num}. {link} | {lines} |".format(**task)) 49 | 50 | 51 | 52 | 53 | 54 | ## Get the README.md text 55 | lines = [] 56 | with open('README.md', 'r') as inf: 57 | within_tasks = False 58 | for line in inf.readlines(): 59 | if not within_tasks: 60 | if 'begintasks' in line: 61 | within_tasks = True 62 | lines.append(line) 63 | lines.extend([t + "\n" for t in tasklines]) 64 | else: 65 | lines.append(line) 66 | elif within_tasks: 67 | if 'endtasks' in line: 68 | lines.append(line) 69 | within_tasks = False 70 | 71 | with open('README.md', 'w') as outf: 72 | outf.writelines(lines) 73 | 74 | 75 | -------------------------------------------------------------------------------- /scratchpad/high_city_ca_pay.py: -------------------------------------------------------------------------------- 1 | # The top 100 California city employees by per-capita total wages 2 | # across 2009 to 2013 (the most recent year as of publish date) salary data 3 | # Modification to scripts/100.py 4 | import csv 5 | import requests 6 | from io import BytesIO 7 | from zipfile import ZipFile 8 | YEARS = range(2009,2014) 9 | def foosalary(row): 10 | return float(row['Total Wages']) / int(row['Entity Population']) 11 | rows = [] 12 | for year in YEARS: 13 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s_City.zip' % year 14 | print("Downloading:", url) 15 | resp = requests.get(url) 16 | with ZipFile(BytesIO(resp.content)) as zfile: 17 | fname = zfile.filelist[0].filename # 2012_City.csv 18 | print("\tUnzipping:", fname) 19 | # first 4 lines are Disclaimer lines 20 | # only print line[4] (i.e. the headers) if this is the first iteration 21 | xs = 4 if year == YEARS.start else 5 22 | rows.extend(zfile.read(fname).decode('latin-1').splitlines()[xs:]) 23 | # This massive array shouldn't cause your (modern) computer to crash... 24 | print("Filtering %s rows..." % len(rows)) 25 | # remove rows without 'Total Wages' 26 | employees = [r for r in csv.DictReader(rows) if r['Total Wages']] 27 | templine = "{year}:\t{city}, {dept}; {position}:\t${money}" 28 | for e in sorted(employees, key = foosalary, reverse = True)[0:100]: # show top 100 29 | line = templine.format(year = e['Year'], city = e['Entity Name'], 30 | dept = e["Department / Subdivision"], position = e['Position'], 31 | money = int(foosalary(e))) 32 | print(line) 33 | 34 | 35 | # If you're a fan of True Detective Season 2, the output might ring a bell: 36 | # http://www.latimes.com/local/california/la-me-vernon-true-detective-20150619-story.html 37 | # 38 | # 2010: Vernon, Finance; Finance Director: $3572 39 | # 2009: Vernon, Light & Power Administration; Director of Light & Power: $3405 40 | # 2012: Vernon, Fire; Fire Chief: $3312 41 | # 2009: Vernon, City Attorney; City Attorney: $3115 42 | # 2009: Vernon, Finance; Finance Director: $3086 43 | # 2009: Vernon, Office of Special Counsel; Special Counsel: $2912 44 | # 2009: Vernon, City Attorney; Assistant City Attorney III: $2688 45 | # 2010: Vernon, L&P Administration; Director of Light & Power Capital Projects: $2596 46 | # 2010: Vernon, City Attorney; Assistant City Attorney III: $2455 47 | # 2010: Vernon, Industrial Development; Assistant Director of Industrial Development: $2330 48 | # 2011: Vernon, Finance; Finance Director: $2321 49 | # 2012: Vernon, Light And Power Administration; Director Of Light & Power: $2318 50 | # 2013: Vernon, City Administration; City Administrator: $2224 51 | # 2013: Vernon, Light And Power Administration; Director Of Light & Power: $2218 52 | # 2009: Vernon, Industrial Development; Assistant Director of Industrial Development: $2189 53 | # 2009: Vernon, City Attorney; Chief Deputy City Attorney: $2137 54 | # 2013: Vernon, City Attorney; City Attorney: $2121 55 | # 2009: Vernon, Administrative, Engineering & Planning; Director of Community Services: $2078 56 | # 2013: Vernon, Fire; Fire Chief: $1982 57 | # 2011: Vernon, City Attorney; Chief Deputy City Attorney: $1979 58 | # 2010: Vernon, City Attorney; Chief Deputy City Attorney: $1969 59 | # 2013: Vernon, Administrative, Engineering & Planning; Director Of Community Services: $1945 60 | # 2010: Vernon, Administrative, Engineering & Planning; Director of Community Services: $1930 61 | # 2009: Vernon, Fire; Fire Chief: $1928 62 | # 2012: Vernon, City Attorney; Chief Deputy City Attorney: $1905 63 | # 2011: Vernon, Administrative, Engineering & Planning; Director Of Community Services & Water: $1903 64 | # 2009: Vernon, Police; Chief: $1875 65 | # 2011: Vernon, Fire; Fire Chief: $1857 66 | # 2013: Vernon, Light And Power Engineering; Engineering Manager: $1834 67 | # 2012: Vernon, Light And Power Engineering; Engineering Manager: $1783 68 | # 2009: Vernon, Health; Health Officer/Director Of Health & Environmental Control: $1782 69 | # 2013: Vernon, Fire; Battalion Chief: $1769 70 | # 2009: Vernon, Police; Sergeants: $1758 71 | # 2010: Vernon, Fire; Fire Chief: $1743 72 | # 2012: Vernon, Finance; Finance Director: $1737 73 | # 2013: Vernon, Police; Police Chief: $1728 74 | # 2011: Vernon, L&P Engineering; Engineering Manager: $1712 75 | # 2012: Vernon, Police; Police Chief: $1693 76 | # 2013: Vernon, Finance; Finance Director: $1691 77 | # 2009: Vernon, Fire; Assistant Fire Chief: $1687 78 | # 2009: Vernon, Fire; Battalion Chief: $1675 79 | # 2010: Vernon, Police; Sergeants: $1661 80 | # 2012: Vernon, Fire; Captain: $1648 81 | # 2011: Vernon, Health; Director Health & Environmental Control: $1646 82 | # 2010: Vernon, Health; Director Health & Environmental Control: $1642 83 | # 2012: Vernon, Administrative, Engineering & Planning; Director Of Community Services: $1630 84 | # 2013: Vernon, Health; Director Health & Environmental Control: $1623 85 | # 2011: Vernon, Fire; Assistant Fire Chief: $1623 86 | # 2013: Vernon, Fire; Battalion Chief: $1620 87 | # 2011: Vernon, Fire; Battalion Chief: $1619 88 | # 2013: Vernon, Human Resources; Director Of Human Resources: $1605 89 | # 2009: Vernon, Fire; Battalion Chief: $1592 90 | # 2011: Vernon, L&P Administration; Director Of Light & Power: $1589 91 | # 2012: Vernon, Fire; Battalion Chief: $1589 92 | # 2012: Vernon, Health; Director Health & Environmental Control: $1583 93 | # 2010: Vernon, Fire; Battalion Chief: $1577 94 | # 2013: Vernon, Fire; Battalion Chief: $1576 95 | # 2009: Vernon, Police; Captain: $1576 96 | # 2010: Vernon, Police; Police Chief: $1559 97 | # 2012: Vernon, Fire; Battalion Chief: $1556 98 | # 2012: Vernon, Fire; Battalion Chief: $1552 99 | # 2009: Vernon, Police; Captain: $1551 100 | # 2009: Vernon, Resource Planning; Electric Resources Planning And Development Manager: $1548 101 | # 2009: Vernon, System Dispatch; Transmission & Distribution Manager: $1547 102 | # 2013: Vernon, Resources Planning; Electric Resource Planning & Development Manager: $1539 103 | # 2009: Vernon, Fire; Captain: $1529 104 | # 2010: Vernon, Fire; Assistant Fire Chief: $1528 105 | # 2012: Vernon, Fire; Battalion Chief: $1527 106 | # 2013: Vernon, City Attorney; Chief Deputy City Attorney: $1520 107 | # 2010: Vernon, Police; Interim Police Chief: $1516 108 | # 2011: Vernon, Fire; Battalion Chief: $1504 109 | # 2009: Vernon, Fire; Fire Marshall: $1499 110 | # 2009: Vernon, Police; Police Officer: $1498 111 | # 2009: Vernon, Fire; Battalion Chief: $1497 112 | # 2009: Vernon, Fire; Regional Training Captain: $1494 113 | # 2009: Vernon, Fire; Captain: $1492 114 | # 2009: Vernon, Police; Sergeants: $1488 115 | # 2009: Vernon, Light & Power Engineering; Engineering Manager: $1488 116 | # 2009: Vernon, Fire; Captain: $1484 117 | # 2010: Vernon, Fire; Battalion Chief: $1470 118 | # 2009: Vernon, Police; Police Officer: $1457 119 | # 2013: Vernon, Fire; Captain: $1451 120 | # 2013: Vernon, Resources Planning; Resource Scheduler: $1450 121 | # 2012: Vernon, City Administration; Assistant To The City Administrator: $1450 122 | # 2011: Vernon, Resources Planning; Electric Resources Planning And Development Manager: $1449 123 | # 2010: Vernon, System Dispatch; Transmission & Distribution Manager: $1448 124 | # 2012: Vernon, Resources Planning; Electric Resource Planning & Development Manager: $1446 125 | # 2012: Vernon, Fire; Fire Marshall: $1445 126 | # 2012: Vernon, Fire; Engineer: $1440 127 | # 2010: Vernon, L&P Engineering; Engineering Manager: $1437 128 | # 2009: Vernon, Police; Sergeants: $1437 129 | # 2013: Vernon, Fire; Captain: $1435 130 | # 2013: Vernon, Fire; Captain: $1434 131 | # 2010: Vernon, Fire; Battalion Chief: $1433 132 | # 2012: Vernon, Fire; Engineer: $1427 133 | # 2009: Vernon, Fire; Captain: $1423 134 | # 2009: Vernon, Fire; Captain: $1420 135 | # 2009: Vernon, Fire; Captain: $1415 136 | # 2009: Vernon, Fire; Captain: $1414 137 | # 2013: Vernon, Fire; Captain: $1413 138 | -------------------------------------------------------------------------------- /scratchpad/more_scotus_laughs.py: -------------------------------------------------------------------------------- 1 | # Modification of scripts/50.py to count all the laughs in the most recent term 2 | from lxml import html 3 | from subprocess import check_output 4 | from urllib.parse import urljoin 5 | import requests 6 | url = 'http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx' 7 | doc = html.fromstring(requests.get(url).text) 8 | # get all the rulings 9 | for link in doc.cssselect('table.datatables tr a'): 10 | href = link.attrib['href'] 11 | # let's store the title of the case from table cell 12 | casetitle = link.getnext().text_content() 13 | # download PDF 14 | pdf_url = urljoin(url, href) 15 | with open("/tmp/t.pdf", 'wb') as f: 16 | f.write(requests.get(pdf_url).content) 17 | # punt to shell and run pdftotext 18 | # http://www.foolabs.com/xpdf/download.html 19 | txt = check_output("pdftotext -layout /tmp/t.pdf -", shell = True).decode() 20 | print("%s laughs in: %s" % (txt.count("(Laughter.)"), casetitle)) 21 | 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /scripts/1.py: -------------------------------------------------------------------------------- 1 | # Number of datasets currently listed on data.gov 2 | from lxml import html 3 | import requests 4 | response = requests.get('http://www.data.gov/') 5 | doc = html.fromstring(response.text) 6 | link = doc.cssselect('small a')[0] 7 | print(link.text) 8 | -------------------------------------------------------------------------------- /scripts/10.py: -------------------------------------------------------------------------------- 1 | # The title of the highest paid California city government position in 2010 2 | # note, the code below makes it easy to extend "years" to include multiple years 3 | import csv 4 | import os.path 5 | import requests 6 | from shutil import unpack_archive 7 | LOCAL_DATADIR = "/tmp/capublicpay" 8 | YEARS = range(2010, 2011) # i.e. just 2010 9 | def foosalary(row): 10 | return float(row['Total Wages']) if row['Total Wages'] else 0 11 | 12 | for year in YEARS: 13 | bfname = '%s_City' % year 14 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s.zip' % bfname 15 | zname = os.path.join("/tmp", bfname + '.zip') 16 | cname = os.path.join(LOCAL_DATADIR, bfname + '.csv') 17 | 18 | if not os.path.exists(zname): 19 | print("Downloading", url, 'to', zname) 20 | data = requests.get(url).content 21 | with open(zname, 'wb') as f: 22 | f.write(data) 23 | # done downloading, now unzip files 24 | print("Unzipping", zname, 'to', LOCAL_DATADIR) 25 | unpack_archive(zname, LOCAL_DATADIR, format = 'zip') 26 | 27 | with open(cname, encoding = 'latin-1') as f: 28 | # first four lines are: 29 | # “Disclaimer 30 | # 31 | # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.” 32 | data = list(csv.DictReader(f.readlines()[4:])) 33 | topitem = max(data, key = foosalary) 34 | print(topitem['Entity Name'], topitem['Department / Subdivision'], 35 | topitem['Position'], topitem['Total Wages']) 36 | -------------------------------------------------------------------------------- /scripts/100.py: -------------------------------------------------------------------------------- 1 | # The California city whose city manager earns the most total wage per population of its city in 2012 2 | import csv 3 | import requests 4 | from io import BytesIO 5 | from zipfile import ZipFile 6 | YEAR = 2012 7 | def foosalary(row): 8 | return float(row['Total Wages']) / int(row['Entity Population']) 9 | 10 | url = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=%s_City.zip' % YEAR 11 | print("Downloading:", url) 12 | resp = requests.get(url) 13 | 14 | with ZipFile(BytesIO(resp.content)) as zfile: 15 | fname = zfile.filelist[0].filename # 2012_City.csv 16 | rows = zfile.read(fname).decode('latin-1').splitlines() 17 | # first 4 lines are Disclaimer lines 18 | managers = [r for r in csv.DictReader(rows[4:]) if r['Position'].lower() == 'city manager' 19 | and r['Total Wages']] 20 | topman = max(managers, key = foosalary) 21 | print("City: %s; Pay-per-Capita: $%s" % (topman['Entity Name'], int(foosalary(topman)))) 22 | # City: Industry; Pay-per-Capita: $465 23 | 24 | -------------------------------------------------------------------------------- /scripts/101.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import requests 3 | from io import StringIO 4 | CSVURL = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' 5 | response = requests.get(CSVURL) 6 | data = csv.DictReader(StringIO(response.text)) 7 | rows = list(data) 8 | len([i for i in rows if i['gender'] == 'F' and i['in_office'] == '1']) 9 | -------------------------------------------------------------------------------- /scripts/11.py: -------------------------------------------------------------------------------- 1 | # How much did the state of California collect in property taxes, according to the U.S. Census 2013 Annual Survey of State Government Tax Collections? 2 | # landing page: http://www.census.gov/govs/statetax/historical_data.html 3 | # note: this exercise was one of the last to be done and is done in the most just-do-everything-in-one-line mode possible 4 | # ...don't actually follow it as good practice 5 | import requests 6 | from io import BytesIO 7 | from xlrd import open_workbook 8 | from zipfile import ZipFile 9 | ZIP_URL = 'http://www2.census.gov/govs/statetax/state_tax_collections.zip' 10 | XLS_FNAME = 'STC_Historical_DB.xls' 11 | print("Downloading:", ZIP_URL) 12 | resp = requests.get(ZIP_URL) 13 | with ZipFile(BytesIO(resp.content)) as zfile: 14 | with open("/tmp/state_tax_data.xls", "wb") as o: 15 | o.write(zfile.open(XLS_FNAME, 'r').read()) 16 | book = open_workbook("/tmp/state_tax_data.xls") 17 | sheet = book.sheets()[0] 18 | # T01 refers to "Property Tax", get the index 19 | proptax_col_idx = next(idx for idx, c in enumerate(sheet.row_values(1)) if 'T01' in c) 20 | # state name is in column indexed 2 21 | # note that each state has more than one row, but the first one is the most recent 22 | cal_row = next(sheet.row_values(x) for x in range(sheet.nrows) if 'CA STATE' in sheet.row_values(x)[2]) 23 | print("%s paid %s in the year %s" % (cal_row[2], cal_row[proptax_col_idx] * 1000, round(cal_row[0]))) 24 | -------------------------------------------------------------------------------- /scripts/12.py: -------------------------------------------------------------------------------- 1 | # In 2010, the year-over-year change in enplanements at America's busiest airport 2 | # The landing page for this data is: 3 | # http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/ 4 | # For each year, there's a separate page with a table of links, including 5 | # XLS format: 6 | # e.g. ./passenger/media/cy10_primary_enplanements.xls 7 | import csv 8 | import requests 9 | # we can't be sure that the XLS has the same naming convention year over year 10 | # so let's do a little HTML parsing 11 | from lxml import html 12 | from os.path import basename 13 | from urllib.parse import urljoin 14 | from xlrd import open_workbook 15 | 16 | BASE_URL = "http://www.faa.gov/airports/planning_capacity/passenger_allcargo_stats/passenger/" 17 | YEAR = 2010 18 | resp = requests.get(BASE_URL, params = {'year': YEAR}) 19 | doc = html.fromstring(resp.text) 20 | # There are several spreadsheets and conventions over the years. I'm going to 21 | # be lazy and just pick the first spreadsheet with "enplanements" and assume it's the primary 22 | # doc 23 | xls_url = doc.xpath("//a[contains(@href, 'enplanements') and contains(@href, 'xls')]/@href")[0] 24 | print("Downloading", xls_url) 25 | xresp = requests.get(urljoin(BASE_URL, xls_url)) 26 | # save to disk 27 | fn = "/tmp/" + basename(xls_url) 28 | with open(fn, "wb") as f: 29 | f.write(xresp.content) 30 | # open with xlrd 31 | book = open_workbook(fn) 32 | sheet = book.sheets()[0] 33 | # Format looks like this: 34 | # | Airport | CY 10 | CY 09 | 35 | # | Name | Enplanements | Enplanements | 36 | # |--------------------------------------------|--------------|--------------| 37 | # | Hartsfield - Jackson Atlanta International | 43,130,585 | 42,280,868 | 38 | # | Chicago O'Hare International | 32,171,831 | 31,135,732 | 39 | # | Los Angeles International | 28,857,755 | 27,439,897 | 40 | 41 | headers = sheet.row_values(0) 42 | # get all the data rows 43 | rows = [sheet.row_values(i) for i in range(1, sheet.nrows)] 44 | # make them into dicts 45 | drows = [dict(zip(headers, row)) for row in rows] 46 | # remove rows without 'CY 10 Enplanements' as a float 47 | drows = [d for d in drows if isinstance(d['CY 10 Enplanements'], float)] 48 | # get biggest airport 49 | airport = max(drows, key = lambda r: r['CY 10 Enplanements']) 50 | print("%s: %i" % (airport['Airport Name'], airport['CY 10 Enplanements'] - airport['CY 09 Enplanements'])) 51 | # Hartsfield - Jackson Atlanta International: 849717 52 | -------------------------------------------------------------------------------- /scripts/13.py: -------------------------------------------------------------------------------- 1 | from io import BytesIO 2 | from PyPDF2 import PdfFileReader 3 | import requests 4 | import re 5 | url = 'https://www.fbi.gov/stats-services/publications/bank-crime-statistics-2014/bank-crime-statistics-2014' 6 | pdfbytes = BytesIO(requests.get(url).content) 7 | pdf = PdfFileReader(pdfbytes) 8 | txt = pdf.getPage(0).extractText() 9 | # this is really ugly 10 | # U.S. DEPARTMENT OF JUSTICE FEDERAL BUREAU OF INVESTIGATION WASHINGTON, D.C. 20535-0001 BANK CRIME STATISTICS (BCS) FEDERALLY INSURED FINANCIAL INSTITUTIONS January 1, 2014 - December 31, 2014 I. VIOLATIONS OF THE FEDERAL BANK ROBBERY AND INCIDENTAL CRIMES STATUTE, TITLE 18, UNITED STATES CODE, SECTION 2113 Violations by Type of Institution Robberies Burglaries Larcenies Commercial Banks 3,430 61 5 Mutual Savings Banks 31 0 1 Savings and Loan Associations 93 1 0 Credit Unions 312 8 2 Armored Carrier Companies 13 1 3 Totals: 3,879 71 11 Grand Total - All Violations: 3,961 Number, Race, and Sex of Perpetrators The number of persons known to be involved in the 3,961 robberies, burglaries, and larcenies was 4,778. The following table shows a breakdown of the 4,778 persons by race and sex. In a small number of cases, the use of full disguise makes determination of race and sex impossible. White Black Hispanic Other Unknown Male 1770 2030 258 68 221 Female 150 160 17 10 12 Unknown Race/Sex: 82 Investigation to date has resulted in the identification of 2,617 (55 percent) of the 4,778 persons known to be involved. Of these 2,617 identified persons, 1,047 (40 percent) were determined to be users of narcotics, and 463 (18 percent) were found to have been previously convicted in either federal or state court for bank robbery, bank burglary, or bank larceny. Occurrences by Day of Week and Time of Day Monday - 696 6-9 a.m. - 106 Tuesday - 669 9-11 a.m. - 1,037 Wednesday - 670 11 a.m.-1 p.m. - 929 Thursday - 648 1-3 p.m. - 791 Friday - 803 3-6 p.m. - 946 Saturday - 339 6 p.m.-6 a.m. - 151 Sunday - 50 Not Determined - 1 Not Determined - 86 Total: 3,961 Total: 3,961 11 | # relevant line 12 | # Armored Carrier Companies 13 1 3 13 | print(re.search('Armored Carrier Companies +(\d+)', txt).groups()[0]) 14 | 15 | 16 | -------------------------------------------------------------------------------- /scripts/14.py: -------------------------------------------------------------------------------- 1 | # The number of workplace fatalities at reported to the federal and state OSHA in the latest fiscal year 2 | # landing page 3 | # https://www.osha.gov/dep/fatcat/dep_fatcat.html 4 | from lxml import html 5 | from urllib.parse import urljoin 6 | import csv 7 | import requests 8 | url = "https://www.osha.gov/dep/fatcat/dep_fatcat.html" 9 | doc = html.fromstring(requests.get(url).text) 10 | links = [a.attrib['href'] for a in doc.cssselect('a') if a.attrib.get('href')] 11 | # assume first CSV is the target csv 12 | csvurl = urljoin(url, [a for a in links if 'csv' in a][0]) 13 | rows = list(csv.DictReader(requests.get(csvurl).text.splitlines())) 14 | print(len([r for r in rows if r['Fatality or Catastrophe'] == 'Fatality'])) 15 | -------------------------------------------------------------------------------- /scripts/15.py: -------------------------------------------------------------------------------- 1 | # Total number of wildlife strike incidents reported at San Francisco International Airport 2 | # landing page 3 | # http://wildlife.faa.gov/database.aspx 4 | import csv 5 | import os 6 | import requests 7 | from shutil import unpack_archive 8 | from subprocess import check_output 9 | AIRPORTCODE = 'KSFO' 10 | LOCAL_DATADIR = "/tmp/faawildlife" 11 | url = 'http://wildlife.faa.gov/downloads/wildlife.zip' 12 | zname = os.path.join(LOCAL_DATADIR, os.path.basename(url)) 13 | dname = os.path.join(LOCAL_DATADIR, 'wildlife.accdb') 14 | os.makedirs(LOCAL_DATADIR, exist_ok = True) 15 | 16 | # Download the zip 17 | if not os.path.exists(zname): 18 | print("Downloading", url, 'to', zname) 19 | z = requests.get(url).content 20 | with open(zname, 'wb') as f: 21 | f.write(z) 22 | 23 | # unzip it 24 | print("Unzipping", zname, 'to', LOCAL_DATADIR) 25 | unpack_archive(zname, LOCAL_DATADIR) 26 | 27 | # Work with MS Access, using mdbtools 28 | # https://github.com/brianb/mdbtools 29 | # 30 | # Install with: 31 | # brew install mdbtools 32 | # Helpful post: 33 | # http://nialldonegan.me/2007/03/10/converting-microsoft-access-mdb-into-csv-or-mysql-in-linux/ 34 | 35 | # $ mdb-tables wildlife.accdb 36 | # STRIKE_REPORTS (1990-1999) STRIKE_REPORTS (2000-2009) STRIKE_REPORTS (2010-Current) STRIKE_REPORTS_BASH (1990-Current) 37 | # hardcode the tablenames 38 | access_tablenames = ['STRIKE_REPORTS (1990-1999)', 'STRIKE_REPORTS (2000-2009)', 'STRIKE_REPORTS (2010-Current)', 'STRIKE_REPORTS_BASH (1990-Current)'] 39 | hitcount = 0 40 | for tname in access_tablenames: 41 | txt = check_output("mdb-export %s '%s'" % (dname, tname), shell = True).decode() 42 | rows = list(csv.DictReader(txt.splitlines())) 43 | hits = len([r for r in rows if r['AIRPORT_ID'] == AIRPORTCODE]) 44 | print(tname, " - ", hits) 45 | hitcount += hits 46 | 47 | print("Total:") 48 | print(hitcount) 49 | -------------------------------------------------------------------------------- /scripts/16.py: -------------------------------------------------------------------------------- 1 | # The non-profit organization with the highest total revenue, according to the latest listing in ProPublica's Nonprofit Explorer 2 | # Note: "latest listing" is kind of broad...we'll just take that to mean 3 | # top revenue of whatever's currently listed on the site 4 | from lxml import html 5 | import requests 6 | url = 'https://projects.propublica.org/nonprofits/search?c_code%5Bid%5D=&ntee%5Bid%5D=&order=revenue&q=&sort_order=desc&state%5Bid%5D=&utf8=%E2%9C%93' 7 | doc = html.fromstring(requests.get(url).text) 8 | d = doc.xpath('//table/tbody/tr[1]/td/a/text()') 9 | print(d[0]) 10 | # It's also possible to just use the API 11 | # https://projects.propublica.org/nonprofits/api 12 | -------------------------------------------------------------------------------- /scripts/17.py: -------------------------------------------------------------------------------- 1 | # In the "Justice News" RSS feed maintained by the Justice Department, the number of items published on a Friday 2 | from datetime import datetime 3 | from lxml import etree 4 | import requests 5 | url = 'http://www.justice.gov/feeds/opa/justice-news.xml' 6 | doc = etree.fromstring(requests.get(url).content) 7 | items = doc.xpath('//channel/item') 8 | dates = [item.find('pubDate').text.strip() for item in items] 9 | ts = [datetime.strptime(d.split(' ')[0], '%Y-%m-%d') for d in dates] 10 | # for weekday(), 4 correspond to Friday 11 | print(len([t for t in ts if t.weekday() == 4])) 12 | -------------------------------------------------------------------------------- /scripts/18.py: -------------------------------------------------------------------------------- 1 | # The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data 2 | # info https://sunlightlabs.github.io/congress/#legislator-spreadsheet 3 | import csv 4 | import requests 5 | url = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' 6 | rows = list(csv.DictReader(requests.get(url).text.splitlines())) 7 | # note that spreadsheet includes non-sitting legislators, thus the use 8 | # of 'in_office' attribute to filter 9 | print(len([r for r in rows if r['twitter_id'] and r['in_office'] == '1'])) 10 | -------------------------------------------------------------------------------- /scripts/19.py: -------------------------------------------------------------------------------- 1 | # The total number of preliminary reports on aircraft safety incidents/accidents in the last 10 business days 2 | from lxml import html 3 | import requests 4 | url = 'http://www.asias.faa.gov/pls/apex/f?p=100:93:0::NO:::' 5 | doc = html.fromstring(requests.get(url).text) 6 | x = 0 7 | for tr in doc.cssselect('#uPageCols tr')[2:3]: 8 | for t in tr.cssselect('td')[1:]: 9 | v = re.search('\d+', t.text_content()) 10 | if v: 11 | x += int(v.group()) 12 | print(x) 13 | -------------------------------------------------------------------------------- /scripts/2.py: -------------------------------------------------------------------------------- 1 | # The name of the most recently added dataset on data.gov 2 | from lxml import html 3 | import requests 4 | response = requests.get('http://catalog.data.gov/dataset?q=&sort=metadata_created+desc') 5 | doc = html.fromstring(response.text) 6 | title = doc.cssselect('h3.dataset-heading')[0].text_content() 7 | print(title.strip()) 8 | -------------------------------------------------------------------------------- /scripts/20.py: -------------------------------------------------------------------------------- 1 | # The number of OSHA enforcement inspections involving Wal-Mart in California since 2014 2 | from lxml import html 3 | import requests 4 | import re 5 | url = "https://www.osha.gov/pls/imis/establishment.search" 6 | atts = {'Office': 'all', 7 | 'State': 'CA', 8 | 'endday': '13', 9 | 'endmonth': '06', 10 | 'endyear': '2015', 11 | 'establishment': 'Wal-Mart', 12 | 'officetype': 'all', 13 | 'p_case': 'all', 14 | 'p_violations_exist': 'all', 15 | 'startday': '01', 16 | 'startmonth': '01', 17 | 'startyear': '2014'} 18 | 19 | doc = html.fromstring(requests.get(url, params = atts).text) 20 | # Looks like: 21 | #
22 | # Results 1 - 8 of 8 23 | #
24 | v = re.search('of (\d+)', doc.cssselect('.text-right')[1].text) 25 | print(int(v.groups()[0])) 26 | -------------------------------------------------------------------------------- /scripts/21.py: -------------------------------------------------------------------------------- 1 | # The current humidity level at Great Smoky Mountains National Park 2 | from lxml import html 3 | import requests 4 | url = "http://www.nature.nps.gov/air/WebCams/parks/grsmcam/grsmcam.cfm" 5 | doc = html.fromstring(requests.get(url).text) 6 | print(doc.cssselect('#CollapsiblePanel6 div div div')[3].text_content()) 7 | -------------------------------------------------------------------------------- /scripts/22.py: -------------------------------------------------------------------------------- 1 | # The names of the committees that Sen. Barbara Boxer currently serves on 2 | import requests 3 | from lxml import html 4 | url="http://www.senate.gov/general/committee_assignments/assignments.htm" 5 | doc = html.fromstring(requests.get(url).text) 6 | row = next(tr for tr in doc.cssselect('tr') if 'Boxer, Barbara' in tr.text_content()) 7 | print(len(row.cssselect('td')[1].cssselect('a'))) 8 | -------------------------------------------------------------------------------- /scripts/23.py: -------------------------------------------------------------------------------- 1 | # The name of the California school with the highest number of girls enrolled in kindergarten, according to the CA Dept. of Education's latest enrollment data file. 2 | import csv 3 | import requests 4 | from collections import defaultdict 5 | from operator import itemgetter 6 | url = 'http://dq.cde.ca.gov/dataquest/dlfile/dlfile.aspx?cLevel=School&cYear=2014-15&cCat=Enrollment&cPage=filesenr.asp' 7 | 8 | def foo(row): 9 | return int(row['KDGN']) if row['KDGN'] else 0 10 | 11 | lines = requests.get(url).text.splitlines() 12 | data = list(csv.DictReader(lines, delimiter = "\t")) 13 | 14 | codes = defaultdict(int) 15 | for d in data: 16 | if d['GENDER'] == 'F': 17 | codes[d['CDS_CODE']] += int(d['KDGN']) 18 | 19 | cds, num = max(codes.items(), key = itemgetter(1)) 20 | print([d['SCHOOL'] for d in data if d['CDS_CODE'] == cds][0]) 21 | 22 | -------------------------------------------------------------------------------- /scripts/24.py: -------------------------------------------------------------------------------- 1 | # Percentage of NYPD stop-and-frisk reports in which the suspect was white in 2014 2 | from shutil import unpack_archive 3 | import csv 4 | import os 5 | import requests 6 | DATADIR = '/tmp/nypd' 7 | os.makedirs(DATADIR, exist_ok = True) 8 | zipurl = 'http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2014_sqf_csv.zip' 9 | zname = os.path.join(DATADIR, os.path.basename(zipurl)) 10 | cname = os.path.join(DATADIR, '2014.csv') 11 | if not os.path.exists(zname): 12 | print("Downloading", zipurl, 'to', zname) 13 | z = requests.get(zipurl).content 14 | with open(zname, 'wb') as f: 15 | f.write(z) 16 | # unzip it 17 | print("Unzipping", zname, 'to', DATADIR) 18 | unpack_archive(zname, DATADIR) 19 | 20 | data = list(csv.DictReader(open(cname, encoding = 'latin-1'))) 21 | whites = [d for d in data if d['race'] == 'W'] 22 | print(len(whites) * 100 / len(data)) 23 | 24 | 25 | -------------------------------------------------------------------------------- /scripts/25.py: -------------------------------------------------------------------------------- 1 | # Average frontal crash star rating for 2015 Honda Accords 2 | import requests 3 | import re 4 | from lxml import html 5 | 6 | url = 'http://www.safercar.gov/Vehicle+Shoppers/5-Star+Safety+Ratings/2011-Newer+Vehicles/Search-Results' 7 | atts = {"searchtype":"model", "make": "HONDA", "model": "ACCORD", "year": 2015} 8 | doc = html.fromstring(requests.get(url, params = atts).text) 9 | trs = doc.cssselect("#dataarea tr") 10 | v = 0 11 | for tr in trs[1:-1]: 12 | t = tr.cssselect('td.b_right img.stars')[1].attrib['alt'] 13 | v += int(re.search('\d+', t).group()) 14 | print(v / len(trs[1:-1])) 15 | -------------------------------------------------------------------------------- /scripts/26.py: -------------------------------------------------------------------------------- 1 | # The dropout rate for all of Santa Clara County high schools, according to the latest cohort data in CALPADS 2 | import csv 3 | from urllib.request import urlopen 4 | from io import TextIOWrapper as Tio 5 | from lxml import html 6 | COUNTY = 'Santa Clara' 7 | # This problem actually requires two datasets: 8 | # 1) The Dept. of Ed's list of county ID numbers to find Santa Clara 9 | SCHOOL_DB_URL = 'ftp://ftp.cde.ca.gov/demo/schlname/pubschls.txt' 10 | # 2) The latest cohort data file from CALPADS 11 | CALPADS_PAGE_URL = "http://www.cde.ca.gov/ds/sd/sd/filescohort.asp" 12 | # Obviously you could hardcode Santa Clara County's ID number but that 13 | # would be too easy. Doing a lookup of the ID let's us modify the script 14 | # to work with any county. 15 | # ...unfortunately, CDE has the list of county IDs to be such boring info 16 | # that they don't put it an easy to find way. OK, so let's just download 17 | # their entire schools database just to get one number: 18 | with urlopen(SCHOOL_DB_URL) as schoolsdb: 19 | print("Downloading", SCHOOL_DB_URL) 20 | txt = Tio(schoolsdb, encoding = 'latin-1') 21 | rows = csv.DictReader(txt, delimiter = '\t') 22 | county_id = next(r['CDSCode'][0:2] for r in rows if r['County'] == COUNTY) 23 | print(COUNTY, 'ID is:', county_id) 24 | print("Downloading", CALPADS_PAGE_URL) 25 | doc = html.fromstring(urlopen(CALPADS_PAGE_URL).read()) 26 | # um, I'm curious about howtheir ASP app here works...but whatever... 27 | urls = doc.xpath("//a[contains(@href, 'dlfile.aspx?cLevel=All')]/@href") 28 | # being lazy and assuming first item is the most recent 29 | calpads_url = urls[0] 30 | print("Downloading", calpads_url) 31 | dropouts, total = 0, 0 32 | with urlopen(calpads_url) as calpadsdb: 33 | print("Downloading", calpads_url) 34 | txt = Tio(calpadsdb, encoding = 'latin-1') 35 | for row in csv.DictReader(txt, delimiter = '\t'): 36 | # not every row is to be counted, as each school has a separate row 37 | # for each subgroup. So the filter condition is not just by county 38 | # but also by 'AggLevel' == 'S' and 'Subgroup' == 'All' 39 | if(row['CDS'][0:2] == county_id and row['AggLevel'] == 'S' 40 | and row['Subgroup'] == 'All'): 41 | try: # sooooo lazy... 42 | total += int(row['NumCohort']) 43 | dropouts += int(row['NumDropouts']) 44 | except: 45 | pass # not a number; some cells have '*' 46 | 47 | print(dropouts / total) 48 | # 0.09737916232841275 49 | -------------------------------------------------------------------------------- /scripts/27.py: -------------------------------------------------------------------------------- 1 | # The number of Class I Drug Recalls issued by the U.S. Food and Drug Administration since 2012 2 | # Caveat: the FDA page says this: 3 | # NOTE: The recalls on the list are generally Class I., 4 | # which means there is a reasonable probability that the 5 | # use of or exposure to a violative product will cause 6 | # serious adverse health consequences or death. 7 | # 8 | # This script assumes the recalls are all Class I, for simplicity sake 9 | from lxml import html 10 | import requests 11 | url = 'http://www.fda.gov/Drugs/DrugSafety/DrugRecalls/default.htm' 12 | doc = html.fromstring(requests.get(url).text) 13 | links = doc.cssselect('.col-md-6.col-md-push-3.middle-column linktitle') 14 | print(len(links)) 15 | -------------------------------------------------------------------------------- /scripts/28.py: -------------------------------------------------------------------------------- 1 | # Total number of clinical trials as recorded by the National Institutes of Health 2 | import requests 3 | from lxml import html 4 | url = 'https://clinicaltrials.gov/' 5 | doc = html.fromstring(requests.get(url).text) 6 | e = doc.cssselect('#trial-count > p > .highlight')[0] 7 | print(e.text_content()) 8 | -------------------------------------------------------------------------------- /scripts/29.py: -------------------------------------------------------------------------------- 1 | # Number of days until Texas's next scheduled execution 2 | from datetime import datetime 3 | from lxml import html 4 | import pytz 5 | import requests 6 | url = "http://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html" 7 | # fetch and parse the page 8 | doc = html.fromstring(requests.get(url).text) 9 | # Get our time in central texas time; http://stackoverflow.com/a/22109768/160863 10 | texas_time = pytz.timezone("US/Central") 11 | today = texas_time.localize(datetime(*datetime.now().timetuple()[0:3])) # whatever, too lazy to look up the idiom 12 | for row in doc.xpath('//table/tr')[1:]: 13 | # Even though this table is sorted in reverse-chronological order, 14 | # sometimes the executions happen more quickly than the updates to the 15 | # webpage, can't assume the first row is always the upcoming execution 16 | # 17 | # Each row looks like: 18 | # | 08/12/2015 | Info | Lopez | Daniel | 999555 | 09/15/1987 | H | 03/16/2010 | Nueces | 19 | col = row.cssselect('td')[0] 20 | exdate = datetime.strptime(col.text_content(), '%m/%d/%Y') 21 | exdate = texas_time.localize(exdate) 22 | if (exdate >= today): 23 | print((exdate - today).days, "days") 24 | break 25 | -------------------------------------------------------------------------------- /scripts/3.py: -------------------------------------------------------------------------------- 1 | # The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days 2 | import requests 3 | r = requests.get("https://analytics.usa.gov/data/live/ie.json") 4 | print(r.json()['totals']['ie_version']['6.0']) 5 | -------------------------------------------------------------------------------- /scripts/30.py: -------------------------------------------------------------------------------- 1 | # The total number of inmates executed by Florida since 1976 2 | import requests 3 | from lxml import html 4 | url = "http://www.dc.state.fl.us/oth/deathrow/execlist.html" 5 | 6 | doc = html.fromstring(requests.get(url).text) 7 | tables = doc.cssselect('table.dcCSStableLight') 8 | rows = tables[0].cssselect('tr') 9 | # the first row is just the header row 10 | print(len(rows) - 1) 11 | -------------------------------------------------------------------------------- /scripts/31.py: -------------------------------------------------------------------------------- 1 | # The number of proposed U.S. federal regulations in which comments are due within the next 3 days 2 | 3 | # Note: 4 | # This exercise is a major snafu on my part, as I assigned it thinking you 5 | # could easily scrape it from the front page. 6 | # However, the HTML of the results is generated client side, after an AJAX 7 | # request to whatever-the-f-ck this serialzed data format is: 8 | # GET http://www.regulations.gov/dispatch/LoadRegulationsClosingSoon 9 | # Response: 10 | # //OK[21,20,19,3,18,17,16,3,15,14,13,3,12,11,10,3,9,8,7,3,6,5,4,3,6,2,1,["gov.egov.erule.regs.shared.dispatch.LoadRegulationsClosingSoonResult/4107109627","java.util.ArrayList/4159755760","gov.regulations.common.models.DimensionValueModel/244318028","41","Today","8075","83","3 Days","8096","194","7 Days","8076","436","15 Days","8097","766","30 Days","8077","1133","90 Days","8078"],0,7] 11 | # 12 | # Reverse engineering is not a fun-type of challenge. For this particular exercise, 13 | # though, the answer can be found through a simple API call. 14 | # 15 | # 16 | # The API dev docs are here: http://regulationsgov.github.io/developers/ 17 | # 18 | # Specifically, the documents.json endpoint described here: 19 | # http://regulationsgov.github.io/developers/console/#!/documents.json/documents_get_0 20 | # 21 | # This endpoint has 1 parameter necessary for this exercise: 22 | # 23 | # - cs: Comment Period Closing Soon; the value is an integer for number of days 24 | # until closing 25 | import requests 26 | BASE_URL = 'http://api.data.gov/regulations/v3/documents.json' 27 | my_params = {'api_key': 'DEMO_KEY', 'cs': 3} 28 | resp = requests.get(BASE_URL, params = my_params) 29 | print(resp.json()['totalNumRecords']) 30 | -------------------------------------------------------------------------------- /scripts/32.py: -------------------------------------------------------------------------------- 1 | # Number of Titles that have changed in the United States Code since its last release point 2 | # Note: the div class of "usctitlechanged" is used to mark such titles 3 | import requests 4 | url = 'http://uscode.house.gov/download/download.shtml' 5 | txt = requests.get(url).text 6 | print(txt.count('class="usctitlechanged" id')) 7 | -------------------------------------------------------------------------------- /scripts/33.py: -------------------------------------------------------------------------------- 1 | # The number of FDA-approved, but now discontinued drug products that contain Fentanyl as an active ingredient 2 | # landing page: 3 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm 4 | # search page: 5 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryai.cfm 6 | import re 7 | import requests 8 | 9 | formurl = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempai.cfm' 10 | post_params = {'Generic_Name': 'Fentanyl', 'table1': 'OB_Disc'} 11 | resp = requests.post(formurl, data = post_params) 12 | # Displaying records 1 to 29 of 29 13 | m = re.search('(?<=Displaying records) *[\d,]+ *to *[\d,]+ *of *([\d,]+)', resp.text) 14 | print(m.groups()[0]) 15 | -------------------------------------------------------------------------------- /scripts/34.py: -------------------------------------------------------------------------------- 1 | # In the latest FDA Weekly Enforcement Report, the number of Class I and Class II recalls involving food 2 | import requests 3 | from lxml import html 4 | url = 'http://www.fda.gov/Safety/Recalls/EnforcementReports/default.htm' 5 | doc = html.fromstring(requests.get(url).text) 6 | reporturl = doc.xpath('//a[contains(text(), "Enforcement Report for ")]/@href')[0] 7 | # example weekly report: 8 | # http://www.accessdata.fda.gov/scripts/enforcement/enforce_rpt-Product-Tabs.cfm?action=Expand+Index&w=06102015&lang=eng 9 | report = html.fromstring(requests.get(reporturl).text) 10 | print(len(report.cssselect('tr.Food'))) 11 | -------------------------------------------------------------------------------- /scripts/35.py: -------------------------------------------------------------------------------- 1 | # Most viewed data set on New York state's open data portal as of this month 2 | import requests 3 | from lxml import html 4 | # There's probably a JSON endpoint for this...but what the heck, let's 5 | # do HTML parsing 6 | url = 'https://data.ny.gov/browse?sortBy=most_accessed&sortPeriod=month' 7 | doc = html.fromstring(requests.get(url).text) 8 | t = doc.cssselect('tr.item .titleLine a')[0] 9 | print(t.text_content()) 10 | -------------------------------------------------------------------------------- /scripts/36.py: -------------------------------------------------------------------------------- 1 | # Total number of visitors to the White House in 2012 2 | # landing page: 3 | # https://www.whitehouse.gov/briefing-room/disclosures/visitor-records 4 | import csv 5 | import os 6 | import requests 7 | from shutil import unpack_archive 8 | LOCAL_DATADIR = "/tmp/whvisitors" 9 | url = 'https://www.whitehouse.gov/sites/default/files/disclosures/whitehouse-waves-2012.csv_.zip' 10 | zname = os.path.join(LOCAL_DATADIR, os.path.basename(url)) 11 | cname = os.path.join(LOCAL_DATADIR, 'WhiteHouse-WAVES-2012.csv') 12 | os.makedirs(LOCAL_DATADIR, exist_ok = True) 13 | 14 | # Download the zip 15 | if not os.path.exists(zname): 16 | print("Downloading", url, 'to', zname) 17 | z = requests.get(url).content 18 | with open(zname, 'wb') as f: 19 | f.write(z) 20 | 21 | # unzip it 22 | print("Unzipping", zname, 'to', LOCAL_DATADIR) 23 | unpack_archive(zname, LOCAL_DATADIR) 24 | # the file was zipped on a Mac, yet still uses Windows encoding...mkaaay 25 | rows = list(csv.DictReader(open(cname, encoding = 'ISO-8859-1'))) 26 | print(len(rows)) 27 | # 934872 28 | -------------------------------------------------------------------------------- /scripts/37.py: -------------------------------------------------------------------------------- 1 | # The last time the CIA's Leadership page has been updated 2 | import requests 3 | import re 4 | url = "https://www.cia.gov/about-cia/leadership" 5 | txt = re.search('Last Updated:.+?(?=
)', requests.get(url).text).group() 6 | print(txt) 7 | -------------------------------------------------------------------------------- /scripts/38.py: -------------------------------------------------------------------------------- 1 | # The domain of the most visited U.S. government website right now 2 | import requests 3 | url = 'https://analytics.usa.gov/data/live/top-pages-realtime.json' 4 | resp = requests.get(url).json() 5 | print(resp['data'][0]['page']) 6 | -------------------------------------------------------------------------------- /scripts/39.py: -------------------------------------------------------------------------------- 1 | # Number of medical device recalls issued by the U.S. Food and Drug Administration in 2013 2 | from lxml import html 3 | import requests 4 | url = 'http://www.fda.gov/MedicalDevices/Safety/ListofRecalls/ucm384618.htm' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len(doc.cssselect('tbody tr'))) 7 | -------------------------------------------------------------------------------- /scripts/4.py: -------------------------------------------------------------------------------- 1 | # The number of librarian-related job positions that the federal government is currently hiring for 2 | import requests 3 | # via http://www.opm.gov/policy-data-oversight/classification-qualifications/general-schedule-qualification-standards/#url=List-by-Occupational-Series 4 | LIBSERIES = 1410 5 | resp = requests.get("https://data.usajobs.gov/api/jobs", params = {'series': LIBSERIES}) 6 | print(resp.json()['TotalJobs']) 7 | -------------------------------------------------------------------------------- /scripts/40.py: -------------------------------------------------------------------------------- 1 | # Number of FOIA requests made to the Chicago Public Library 2 | import csv 3 | import requests 4 | url = 'https://data.cityofchicago.org/api/views/n379-5uzu/rows.csv?accessType=DOWNLOAD' 5 | data = list(csv.DictReader(requests.get(url).text.splitlines())) 6 | print(len(data)) 7 | -------------------------------------------------------------------------------- /scripts/41.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | url = "https://clinicaltrials.gov/ct2/results?recr=Open&cond=%22Alcohol-Related+Disorders%22" 4 | r = re.search('(?<=)\d+(?= +studies found for)', requests.get(url).text) 5 | print(r.group()) 6 | -------------------------------------------------------------------------------- /scripts/42.py: -------------------------------------------------------------------------------- 1 | # The name of the Supreme Court justice who delivered the opinion in the most recently announced decision 2 | # depends on PyPDF2 https://pythonhosted.org/PyPDF2/PdfFileReader.html 3 | from io import BytesIO 4 | from lxml import html 5 | from PyPDF2 import PdfFileReader 6 | from urllib.parse import urljoin 7 | import requests 8 | import re 9 | # get the most recent ruling 10 | url = "http://www.supremecourt.gov/opinions/slipopinions.aspx" 11 | doc = html.fromstring(requests.get(url).text) 12 | a = doc.cssselect('#mainbody table')[0].cssselect('tr a')[0] 13 | # download PDF 14 | pdf_url = urljoin(url, a.attrib['href']) 15 | pdfbytes = BytesIO(requests.get(pdf_url).content) 16 | pdf = PdfFileReader(pdfbytes) 17 | # compile text of all the pages 18 | txt = "" 19 | for i in range(pdf.getNumPages()): 20 | txt += pdf.getPage(i).extractText() + "\n" 21 | # regex match...hopefully this is *always* the text... 22 | m = re.search("[A-Z]+, *(?:J\.|C\. J\.|JJ\.)(?=, delivered the opinion of the Court)", txt) 23 | print(m.group()) 24 | 25 | # Sample relevant text: 26 | # KENNEDY, J., delivered the opinion of the Court, in which GINSBURG, 27 | # BREYER, SOTOMAYOR, and KAGAN, JJ., joined. BREYER, J., filed a concurring 28 | # opinion. THOMAS, J., filed an opinion concurring in the judgment 29 | # in part and dissenting in part. ROBERTS, C. J., filed a dissenting opinion, 30 | # in which ALITO, J., joined. SCALIA, J., filed a dissenting opinion, in 31 | # which ROBERTS, C. J., and ALITO, J., joined. 32 | -------------------------------------------------------------------------------- /scripts/43.py: -------------------------------------------------------------------------------- 1 | # The number of citations that resulted from FDA inspections in fiscal year 2012 2 | import requests 3 | import csv 4 | # list of citations is here: 5 | # http://www.fda.gov/ICECI/Inspections/ucm346077.htm 6 | csv_url = 'http://www.fda.gov/downloads/ICECI/Inspections/UCM346093.csv' 7 | print("Downloading", csv_url) 8 | resp = requests.get(csv_url) 9 | rows = list(csv.DictReader(resp.text.splitlines()[2:])) 10 | print(len(rows)) 11 | -------------------------------------------------------------------------------- /scripts/44.py: -------------------------------------------------------------------------------- 1 | # Number of people visiting a U.S. government website right now 2 | # via: https://analytics.usa.gov/ 3 | import requests 4 | url = 'https://analytics.usa.gov/data/live/realtime.json' 5 | j = requests.get(url).json() 6 | print(j['data'][0]['active_visitors']) 7 | -------------------------------------------------------------------------------- /scripts/45.py: -------------------------------------------------------------------------------- 1 | # The number of security alerts issued by US-CERT in the current year 2 | import requests 3 | from lxml import html 4 | url = 'https://www.us-cert.gov/ncas/alerts' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len(doc.cssselect('.item-list li'))) 7 | -------------------------------------------------------------------------------- /scripts/46.py: -------------------------------------------------------------------------------- 1 | # The number of Pinterest accounts maintained by U.S. State Department embassies and missions 2 | 3 | # Note: You can extend this problem to include ALL Pinterest 4 | # accounts (e.g. maintained by Consulates) because 5 | # the HTML structure here is atrocious 6 | import requests 7 | from lxml import html 8 | url = 'http://www.state.gov/r/pa/ode/socialmedia/' 9 | doc = html.fromstring(requests.get(url).text) 10 | pinlinks = [a for a in doc.cssselect('a') if 'pinterest.com' in str(a.attrib.get('href'))] 11 | # we just need a count, so no need to do anything more 12 | # sophisticated 13 | print(len(pinlinks)) 14 | -------------------------------------------------------------------------------- /scripts/47.py: -------------------------------------------------------------------------------- 1 | # The number of international travel alerts from the U.S. State Department currently in effect 2 | import requests 3 | from lxml import html 4 | url = 'http://travel.state.gov/content/passports/english/alertswarnings.html' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len(doc.cssselect('td.alert'))) 7 | 8 | -------------------------------------------------------------------------------- /scripts/48.py: -------------------------------------------------------------------------------- 1 | # The difference in total White House staffmember salaries in 2014 versus 2010 2 | import csv 3 | import requests 4 | # info https://www.whitehouse.gov/briefing-room/disclosures/annual-records/2014 5 | url2010 = 'https://open.whitehouse.gov/api/views/rcp4-3y7g/rows.csv?accessType=DOWNLOAD' 6 | url2014 = 'https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD' 7 | 8 | d2010 = list(csv.DictReader(requests.get(url2010).text.splitlines())) 9 | d2014 = list(csv.DictReader(requests.get(url2014).text.splitlines())) 10 | 11 | s2010 = 0 12 | for d in d2010: 13 | s2010 += float(d['Salary'].replace('$', '')) 14 | 15 | s2014 = 0 16 | for d in d2014: 17 | s2014 += float(d['Salary'].replace('$', '')) 18 | 19 | print(s2014 - s2010) 20 | -------------------------------------------------------------------------------- /scripts/49.py: -------------------------------------------------------------------------------- 1 | # Number of sponsored bills by Rep. Nancy Pelosi that were vetoed by the President 2 | from lxml import html 3 | import requests 4 | import re 5 | url = 'https://www.congress.gov/member/nancy-pelosi/P000197' 6 | atts = {'q': '{"sponsorship":"sponsored","bill-status":"veto"}'} 7 | doc = html.fromstring(requests.get(url, params = atts).text) 8 | t = doc.cssselect('.results-number')[0].text_content() 9 | # e.g. 1-25 of 4,897 10 | r = re.search('(?<=of) *[\d,]+', t).group().replace(',', '').strip() 11 | print(r) 12 | -------------------------------------------------------------------------------- /scripts/5.py: -------------------------------------------------------------------------------- 1 | # The name of the company cited in the most recent consumer complaint involving student loans 2 | # note that this is a pre-made filter from: 3 | # https://data.consumerfinance.gov/dataset/Consumer-Complaints/x94z-ydhh 4 | import requests 5 | from operator import itemgetter 6 | url = "https://data.consumerfinance.gov/api/views/c8k9-ryca/rows.json?accessType=DOWNLOAD" 7 | 8 | # If you go the JSON route with Socrata, you have to 9 | # do an extra step of parsing metadata to get the 10 | # desired columns...or you could just hardcode their 11 | # positions for now 12 | data = requests.get(url).json() 13 | # use meta data to extract which column Company exists in 14 | cols = data['meta']['view']['columns'] 15 | # fancier way of doing a for-loop and counter 16 | # http://stackoverflow.com/questions/2748235/in-python-how-can-i-find-the-index-of-the-first-item-in-a-list-that-is-not-some 17 | 18 | # get position of Date received column 19 | d_pos = next((i for i, c in enumerate(cols) if c['name'] == 'Date received'), -1) 20 | 21 | # get position of Company column 22 | c_pos = next((i for i, c in enumerate(cols) if c['name'] == 'Company'), -1) 23 | 24 | # It appears that Socrata returns the data in order of 25 | # Date received but just in case, here's a sort 26 | row = max(data['data'], key = itemgetter(d_pos)) 27 | print(row[c_pos]) 28 | -------------------------------------------------------------------------------- /scripts/50.py: -------------------------------------------------------------------------------- 1 | # In the most recently transcribed Supreme Court argument, the number of times laughter broke out 2 | from lxml import html 3 | from subprocess import check_output 4 | from urllib.parse import urljoin 5 | import requests 6 | url = 'http://www.supremecourt.gov/oral_arguments/argument_transcript.aspx' 7 | doc = html.fromstring(requests.get(url).text) 8 | # get the most recent ruling, e.g. the top of table 9 | href = doc.cssselect('table.datatables tr a')[0].attrib['href'] 10 | # download PDF 11 | pdf_url = urljoin(url, href) 12 | with open("/tmp/t.pdf", 'wb') as f: 13 | f.write(requests.get(pdf_url).content) 14 | # punt to shell and run pdftotext 15 | # http://www.foolabs.com/xpdf/download.html 16 | txt = check_output("pdftotext -layout /tmp/t.pdf -", shell = True).decode() 17 | print(txt.count("(Laughter.)")) 18 | 19 | 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /scripts/51.py: -------------------------------------------------------------------------------- 1 | # The title of the most recent decision handed down by the U.S. Supreme Court 2 | import requests 3 | from lxml import html 4 | url = 'http://www.supremecourt.gov/opinions/slipopinions.aspx' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(doc.cssselect("#mainbody table tr a")[0].text_content()) 7 | -------------------------------------------------------------------------------- /scripts/52.py: -------------------------------------------------------------------------------- 1 | # The average wage of optometrists according to the BLS's most recent National Occupational Employment and Wage Estimates report 2 | from lxml import html 3 | import requests 4 | url = 'http://www.bls.gov/oes/current/oes_nat.htm' 5 | doc = html.fromstring(requests.get(url).text) 6 | table = doc.cssselect('#bodytext table')[0] 7 | t = next(tr for tr in table.cssselect('tr') if 'Optometrists' in tr.text_content()) 8 | print( t.cssselect('td')[-2].text_content()) 9 | -------------------------------------------------------------------------------- /scripts/53.py: -------------------------------------------------------------------------------- 1 | # The total number of on-campus hate crimes as reported to the U.S. Office of Postsecondary Education, in the most recent collection year 2 | # hardcode the url to 2014 file 3 | # this task is just a mess, dependent on how well you can read 4 | # documentation and deal with the messy arrangement of columns 5 | from glob import glob 6 | from shutil import unpack_archive 7 | from xlrd import open_workbook 8 | import os 9 | import requests 10 | 11 | LOCAL_FNAME = '/tmp/ope2014excel.zip' 12 | LOCAL_DATADIR = "/tmp/ope2014excel" 13 | url = 'http://ope.ed.gov/security/dataFiles/Crime2014EXCEL.zip' 14 | # this is such a massive file that we should cache the download 15 | if not os.path.exists(LOCAL_FNAME): 16 | print("Downloading", url, 'to', LOCAL_FNAME) 17 | with open(LOCAL_FNAME, 'wb') as f: 18 | f.write(requests.get(url).content) 19 | 20 | # unzip 21 | print("Unzipping", LOCAL_FNAME, 'to', LOCAL_DATADIR) 22 | unpack_archive(LOCAL_FNAME, LOCAL_DATADIR, format = 'zip') 23 | # get filename 24 | fname = [f for f in glob(LOCAL_DATADIR + '/*.xlsx') if 'oncampushate' in f][0] 25 | # open workbook 26 | print("Opening", fname) 27 | book = open_workbook(fname) 28 | sheet = book.sheets()[0] 29 | data = [sheet.row_values(i) for i in range(sheet.nrows)] 30 | # get all column indices that correspond to relevant columns, i.e. 31 | # 32 | # 266 LAR_T_RAC13 Num 8 Larceny 2013 By Bias Race 33 | # 267 LAR_T_REL13 Num 8 Larceny 2013 By Bias Religion 34 | # 268 LAR_T_SEX13 Num 8 Larceny 2013 By Bias Sexual Orientation 35 | # 269 LAR_T_GEN13 Num 8 Larceny 2013 By Bias Gender 36 | # 270 LAR_T_DIS13 Num 8 Larceny 2013 By Bias Disability 37 | # 271 LAR_T_ETH13 Num 8 Larceny 2013 By Bias Ethnicity 38 | wanted_heds = ['RAC13', 'REL13', 'SEX13', 'GEN13', 'DIS13', 'ETH13'] 39 | indices = [i for i, c in enumerate(data[0]) if any(t in c for t in wanted_heds)] 40 | crime_count = 0 41 | for row in data[1:]: 42 | for i in indices: 43 | if row[i]: 44 | crime_count += int(row[i]) 45 | print(crime_count) 46 | -------------------------------------------------------------------------------- /scripts/54.py: -------------------------------------------------------------------------------- 1 | # The number of people on FBI's Most Wanted List for white collar crimes 2 | import requests 3 | from lxml import html 4 | url = 'http://www.fbi.gov/wanted/wcc/@@wanted-group-listing' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len(doc.cssselect('.contenttype-FBIPerson'))) 7 | -------------------------------------------------------------------------------- /scripts/55.py: -------------------------------------------------------------------------------- 1 | # The number of Government Accountability Office reports and testimonies on the topic of veterans 2 | import requests 3 | import re 4 | from lxml import html 5 | url = 'http://www.gao.gov/browse/topic/Veterans' 6 | doc = html.fromstring(requests.get(url).text) 7 | txt = doc.cssselect('h2.scannableTitle')[0].text_content().strip() 8 | # 'Veterans (1 - 10 of 1,170 items)' 9 | v = re.search('of (\d+)', txt.replace(',', '')).groups()[0] 10 | print(int(v)) 11 | -------------------------------------------------------------------------------- /scripts/56.py: -------------------------------------------------------------------------------- 1 | # Number of times Rep. Darrell Issa's remarks have made it onto the Congressional Record 2 | from lxml import html 3 | import requests 4 | 5 | baseurl = "https://www.congress.gov/search" 6 | atts = {"source":"congrecord","crHouseMemberRemarks":"Issa, Darrell E. [R-CA]"} 7 | doc = html.fromstring(requests.get(baseurl, params = atts).text) 8 | t = doc.cssselect(".results-number")[0].text_content() 9 | print(t.split('of')[-1].strip().replace(',', '')) 10 | -------------------------------------------------------------------------------- /scripts/57.py: -------------------------------------------------------------------------------- 1 | # The top 3 auto manufacturers, ranked by total number of recalls via NHTSA safety-related defect and compliance campaigns since 1967. 2 | import csv 3 | import requests 4 | 5 | from collections import Counter 6 | from io import BytesIO, TextIOWrapper 7 | from zipfile import ZipFile 8 | 9 | ZIP_URL = 'http://www-odi.nhtsa.dot.gov/downloads/folders/Recalls/FLAT_RCL.zip' 10 | # Schema comes from http://www-odi.nhtsa.dot.gov/downloads/folders/Recalls/RCL.txt 11 | MFGNAME_FIELD_NUM = 7 12 | counter = Counter() 13 | print("Downloading", ZIP_URL) 14 | resp = requests.get(ZIP_URL) 15 | with ZipFile(BytesIO(resp.content)) as zfile: 16 | fname = zfile.filelist[0].filename 17 | print("Unzipping...", fname) # note: the unpacked zip is 120MB+ 18 | with zfile.open(fname, 'rU') as zf: 19 | reader = csv.reader(TextIOWrapper(zf, encoding = 'latin-1'), delimiter = "\t") 20 | counter.update(row[MFGNAME_FIELD_NUM] for row in reader) 21 | 22 | for mfgname, count in counter.most_common(3): 23 | print("%s: %s" % (mfgname, count)) 24 | 25 | -------------------------------------------------------------------------------- /scripts/58.py: -------------------------------------------------------------------------------- 1 | # The number of published research papers from the NSA 2 | import requests 3 | from lxml import html 4 | url = 'https://www.nsa.gov/research/publications/index.shtml' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len(doc.cssselect('table.dataTable tr')[1:])) 7 | -------------------------------------------------------------------------------- /scripts/59.py: -------------------------------------------------------------------------------- 1 | # The number of university-related datasets currently listed at data.gov 2 | import requests 3 | import re 4 | url = 'http://catalog.data.gov/dataset?' 5 | atts = {'organization_type': 'University', 'sort': 'metadata_created desc'} 6 | txt = requests.get(url, params = atts).text 7 | print(re.search("[0-9,]+(?= *datasets found)", txt).group().replace(',', '')) 8 | -------------------------------------------------------------------------------- /scripts/6.py: -------------------------------------------------------------------------------- 1 | # From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees 2 | from shutil import unpack_archive 3 | from statistics import median 4 | import csv 5 | import os 6 | import requests 7 | LOCAL_DATADIR = "/tmp/capublicpay" 8 | BASE_URL = 'http://publicpay.ca.gov/Reports/RawExport.aspx?file=' 9 | YEARS = (2010, 2013) 10 | 11 | medians = [] 12 | for year in YEARS: 13 | basefname = '%s_City.zip' % year 14 | url = BASE_URL + basefname 15 | local_zname = "/tmp/" + basefname 16 | # this is such a massive file that we should cache the download 17 | if not os.path.exists(local_zname): 18 | print("Downloading", url, 'to', local_zname) 19 | data = requests.get(url).content 20 | with open(local_zname, 'wb') as f: 21 | f.write(data) 22 | # done downloading, now unzip files 23 | print("Unzipping", local_zname, 'to', LOCAL_DATADIR) 24 | unpack_archive(local_zname, LOCAL_DATADIR, format = 'zip') 25 | # each zip extracts a file named YEAR_City.csv 26 | csv_name = LOCAL_DATADIR + '/' + basefname.replace('zip', 'csv') 27 | # calculate median 28 | with open(csv_name, encoding = 'latin-1') as f: 29 | # first four lines are: 30 | # “Disclaimer 31 | # 32 | # The information presented is posted as submitted by the reporting entity. The State Controller's Office is not responsible for the accuracy of this information.” 33 | cx = list(csv.DictReader(f.readlines()[4:])) 34 | mx = median([float(row['Health Dental Vision']) for row in cx if row['Health Dental Vision']]) 35 | print("Median for %s" % year, mx) 36 | medians.append(mx) 37 | 38 | print(medians[-1] - medians[0]) 39 | -------------------------------------------------------------------------------- /scripts/60.py: -------------------------------------------------------------------------------- 1 | # Number of chapters in Title 20 (Education) of the United States Code 2 | import requests 3 | import re 4 | from lxml import html 5 | # this URL downloads the WHOLE code for education 6 | url = 'http://uscode.house.gov/view.xhtml?path=/prelim@title20&edition=prelim' 7 | print("Downloading", url) 8 | txt = requests.get(url).text 9 | doc = html.fromstring(''.join(txt.splitlines()[1:])) # skipping xml declaration 10 | # interpretation of number of chapters can vary...I'm going to go with 11 | # "highest number" 12 | titles = [t.text_content().strip() for t in doc.cssselect('h3.chapter-head strong')] 13 | m = re.search("(?<=CHAPTER )\d+", titles[-1]).group() 14 | print(m) 15 | 16 | -------------------------------------------------------------------------------- /scripts/61.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from lxml import html 3 | url = "http://www.state.gov/secretary/travel/index.htm" 4 | resp = requests.get(url) 5 | x = html.fromstring(resp.text).cssselect('#total-mileage span') 6 | print(x[0].text_content()) 7 | -------------------------------------------------------------------------------- /scripts/62.py: -------------------------------------------------------------------------------- 1 | # For all of 2013, the number of potential signals of serious risks or new safety information that resulted from the FDA's FAERS 2 | import requests 3 | from urllib.parse import urljoin 4 | from lxml import html 5 | url = 'http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082196.htm' 6 | doc = html.fromstring(requests.get(url).text) 7 | links = [a.attrib['href'] for a in doc.cssselect('li a') if '2013' in a.text_content()] 8 | x = 0 9 | for href in links: 10 | u = urljoin(url, href) 11 | d = html.fromstring(requests.get(u).text) 12 | els = d.cssselect("#content .middle-column table tr")[1:] 13 | x += len(els) 14 | print(x) 15 | -------------------------------------------------------------------------------- /scripts/63.py: -------------------------------------------------------------------------------- 1 | # In the current dataset behind Medicare's Nursing Home Compare website, the total amount of fines received by penalized nursing homes 2 | # landing page: 3 | # https://data.medicare.gov/data/nursing-home-compare 4 | import csv 5 | import os 6 | import requests 7 | from lxml import html 8 | from shutil import unpack_archive 9 | from urllib.parse import urljoin, urlparse, parse_qs 10 | LOCAL_DATADIR = "/tmp/nursinghomes" 11 | CSV_NAME = os.path.join(LOCAL_DATADIR, 'Penalties_Download.csv') 12 | os.makedirs(LOCAL_DATADIR, exist_ok = True) 13 | # 14 | # The zip URL looks like this: 15 | # https://data.medicare.gov/views/bg9k-emty/files/AsD4-xSfJuwZKwb_gMosljIKMST... 16 | # TZ1PmBSoRGqivFmo?filename=DMG_CSV_DOWNLOAD20150501.zip&content_type=application%2Fzip%3B%20charset%3Dbinary 17 | 18 | # we assume that the zip file URL changes frequently and can't be hardcoded 19 | # so we go through the process of auto-magically determining that URL 20 | url = 'https://data.medicare.gov/data/nursing-home-compare' 21 | doc = html.fromstring(requests.get(url).text) 22 | zipurl = [a.attrib['href'] for a in doc.cssselect('a') 23 | if 'CSV_DOWNLOAD' in str(a.attrib.get('href'))][0] 24 | zipurl = urljoin(url, zipurl) 25 | bname = parse_qs(urlparse(zipurl).query)['filename'][0] 26 | zname = os.path.join(LOCAL_DATADIR, bname) 27 | if not os.path.exists(zname): 28 | print("Downloading", zipurl, 'to', zname) 29 | z = requests.get(zipurl).content 30 | with open(zname, 'wb') as f: 31 | f.write(z) 32 | print('Unzipping', zname, 'to', LOCAL_DATADIR) 33 | unpack_archive(zname, LOCAL_DATADIR) 34 | rows = list(csv.DictReader(open(CSV_NAME, encoding = 'ISO-8859-1'))) 35 | print(sum([float(r['fine_amt']) for r in rows if r['fine_amt']])) 36 | -------------------------------------------------------------------------------- /scripts/64.py: -------------------------------------------------------------------------------- 1 | # From March 1 to 7, 2015, the number of times in which designated FDA policy makers met with persons outside the U.S. federal executive branch 2 | # this is a hardcoded URL 3 | url = 'http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/PastMeetingsWithFDAOfficials/ucm439318.htm' 4 | import requests 5 | print(requests.get(url).text.count('Event Date')) 6 | -------------------------------------------------------------------------------- /scripts/65.py: -------------------------------------------------------------------------------- 1 | # The number of failed votes in the roll calls 1 through 99, in the U.S. House of the 114th Congress 2 | import requests 3 | from lxml import html 4 | # There are many places to get roll call info, including the House Clerk: 5 | # http://clerk.house.gov/evs/2015/index.asp 6 | # We could programmatically find the target page but it's not 7 | # worth it for this exercise: 8 | URL = 'http://clerk.house.gov/evs/2015/ROLL_000.asp' 9 | doc = html.fromstring(requests.get(URL).text) 10 | # good ol' Xpath 11 | print(len(doc.xpath('//tr/td[5]/font[text()="F"]'))) 12 | # 28 13 | -------------------------------------------------------------------------------- /scripts/66.py: -------------------------------------------------------------------------------- 1 | # The highest minimum wage as mandated by state law. 2 | import requests 3 | import re 4 | from lxml import html 5 | # helper foo 6 | def foo(c): 7 | m = re.search('([A-Z]{2}).+?(\d+\.\d+)', c.text_content()) 8 | if m: 9 | state, wage = m.groups() 10 | return (float(wage), state) 11 | else: 12 | return None 13 | 14 | url = 'http://www.dol.gov/whd/minwage/america.htm' 15 | doc = html.fromstring(requests.get(url).text) 16 | 17 | # easiest target is "Consolidated State Minimum Wage Update Table", 18 | # of which the first column is: "Greater than federal MW" 19 | 20 | # Love this elegant solution: find the text node, then search upwards with ancestor:: 21 | # http://stackoverflow.com/a/3923863/160863 22 | xstr = "//text()[contains(., 'Greater than federal MW')]/ancestor::table[1]//tr/td[1]" 23 | cols = [foo(c) for c in doc.xpath(xstr) if foo(c)] 24 | topcol = max(cols) 25 | 26 | print(topcol[1], topcol[0]) 27 | # DC 9.5 28 | 29 | -------------------------------------------------------------------------------- /scripts/68.py: -------------------------------------------------------------------------------- 1 | # Number of FDA-approved prescription drugs with GlaxoSmithKline as the applicant holder 2 | # landing page: 3 | # http://www.accessdata.fda.gov/scripts/cder/ob/docs/queryah.cfm 4 | import re 5 | import requests 6 | formurl = 'http://www.accessdata.fda.gov/scripts/cder/ob/docs/tempah.cfm' 7 | post_params = {'Sponsor': 'GlaxoSmithKline', 'table1': 'OB_Rx'} 8 | resp = requests.post(formurl, data = post_params) 9 | # Displaying records 1 to 88 of 88 10 | m = re.search('(?<=Displaying records) *[\d,]+ *to *[\d,]+ *of *([\d,]+)', resp.text) 11 | print(m.groups()[0]) 12 | -------------------------------------------------------------------------------- /scripts/69.py: -------------------------------------------------------------------------------- 1 | # The average number of comments on the last 50 posts on NASA's official Instagram account 2 | from urllib.parse import urljoin 3 | import os 4 | import requests 5 | DOMAIN = 'https://api.instagram.com/' 6 | USERNAME = 'nasa' 7 | ITEM_COUNT = 50 8 | # note: I've specified INSTAGRAM_TOKEN in my ~/.bash_profile 9 | atts = {'access_token': os.environ.get('INSTAGRAM_TOKEN')} 10 | # unless you know NASA's Instagram ID by memory, you'll 11 | # have to hit up the search endpoint to get it 12 | # docs: http://instagram.com/developer/endpoints/users/#get_users_search 13 | search_path = '/v1/users/search' 14 | search_url = urljoin(DOMAIN, search_path) 15 | searchatts = atts.copy() 16 | searchatts['q'] = USERNAME 17 | search_results = requests.get(search_url, params = searchatts).json() 18 | uid = search_results['data'][0]['id'] 19 | 20 | # now we can retrieve media information 21 | # http://instagram.com/developer/endpoints/users/#get_users_media_recent 22 | media_path = '/v1/users/%s/media/recent' % uid 23 | media_url = urljoin(DOMAIN, media_path) 24 | mediaatts = atts.copy() 25 | mediaatts['count'] = ITEM_COUNT 26 | # for whatever reason, the count of returned items is 27 | # always less than the requested count...so keep going 28 | # until we reach ITEM_COUNT 29 | items = [] 30 | while len(items) < 50: 31 | resp = requests.get(media_url, params = mediaatts).json() 32 | data = resp['data'] 33 | if len(data) > 0: 34 | items.extend(data) 35 | mediaatts['max_id'] = data[-1]['id'] 36 | else: 37 | break 38 | 39 | ccount = sum([i['comments']['count'] for i in items[0:ITEM_COUNT]]) 40 | print(ccount // len(items)) 41 | -------------------------------------------------------------------------------- /scripts/7.py: -------------------------------------------------------------------------------- 1 | # The number of listed federal executive agency internet domains 2 | # landing page: https://inventory.data.gov/dataset/fe9eeb10-2e90-433e-a955-5c679f682502/resource/b626ef1f-9019-41c4-91aa-5ae3f7457328 3 | import csv 4 | import requests 5 | url = "https://inventory.data.gov/dataset/fe9eeb10-2e90-433e-a955-5c679f682502/resource/b626ef1f-9019-41c4-91aa-5ae3f7457328/download/federalexecagncyintntdomains03302015.csv" 6 | resp = requests.get(url) 7 | data = list(csv.DictReader(resp.text.splitlines())) 8 | print(len(data)) 9 | -------------------------------------------------------------------------------- /scripts/70.py: -------------------------------------------------------------------------------- 1 | # The highest salary possible for a White House staffmember in 2014 2 | import csv 3 | import requests 4 | url = 'https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD' 5 | data = list(csv.DictReader(requests.get(url).text.splitlines())) 6 | 7 | def foo(d): 8 | return float(d['Salary'].replace('$', '')) 9 | 10 | print(max(data, key = foo)['Salary']) 11 | -------------------------------------------------------------------------------- /scripts/71.py: -------------------------------------------------------------------------------- 1 | # The percent increase in number of babies named Archer nationwide in 2010 compared to 2000, according to the Social Security Administration 2 | # landing page: 3 | # http://www.ssa.gov/oact/babynames/limits.html 4 | import csv 5 | import os 6 | import requests 7 | from shutil import unpack_archive 8 | 9 | LOCAL_DATADIR = "/tmp/babynames" 10 | os.makedirs(LOCAL_DATADIR, exist_ok = True) 11 | url = 'http://www.ssa.gov/oact/babynames/names.zip' 12 | zname = os.path.join(LOCAL_DATADIR, 'names.zip') 13 | # download the file 14 | if not os.path.exists(zname): 15 | print("Downloading", url, 'to', zname) 16 | z = requests.get(url).content 17 | with open(zname, 'wb') as f: 18 | f.write(z) 19 | # Unzip the data 20 | print('Unzipping', zname, 'to', LOCAL_DATADIR) 21 | unpack_archive(zname, LOCAL_DATADIR) 22 | d = {2010: 0, 2000: 0} 23 | for y in d.keys(): 24 | fname = os.path.join(LOCAL_DATADIR, "yob%d.txt" % y) 25 | rows = list(csv.reader(open(fname))) 26 | # each row looks like this: 27 | # Pamela,F,258 28 | d[y] += sum([int(r[2]) for r in rows if r[0] == 'Archer']) 29 | 30 | print(100 * (d[2010] - d[2000]) / d[2000]) 31 | 32 | 33 | -------------------------------------------------------------------------------- /scripts/72.py: -------------------------------------------------------------------------------- 1 | # The number of magnitude 4.5+ earthquakes detected worldwide by the USGS 2 | # landing page: 3 | # http://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php 4 | import csv 5 | import requests 6 | csvurl = 'http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_month.csv' 7 | rows = list(csv.DictReader(requests.get(csvurl).text.splitlines())) 8 | print(len(rows)) 9 | -------------------------------------------------------------------------------- /scripts/73.py: -------------------------------------------------------------------------------- 1 | # The total amount of contributions made by lobbyists to Congress according to the latest downloadable quarterly report 2 | from glob import glob 3 | from lxml import etree, html 4 | from shutil import unpack_archive 5 | import os 6 | import requests 7 | DATADIR = '/tmp/lobbying' 8 | os.makedirs(DATADIR, exist_ok = True) 9 | url = 'http://www.senate.gov/legislative/Public_Disclosure/contributions_download.htm' 10 | # get listing of databases 11 | doc = html.fromstring(requests.get(url).text) 12 | # assuming most recent file is at the top of the page 13 | zipurl = sorted(doc.xpath('//a[contains(@href, "zip")]/@href'))[-1] 14 | zname = os.path.join(DATADIR, os.path.basename(zipurl)) 15 | # Download the zip of the latest quarterly report 16 | if not os.path.exists(zname): 17 | print("Downloading", zipurl, 'to', zname) 18 | z = requests.get(zipurl).content 19 | with open(zname, 'wb') as f: 20 | f.write(z) 21 | # unzip it 22 | print("Unzipping", zname, 'to', DATADIR) 23 | unpack_archive(zname, DATADIR) 24 | 25 | ctotal = 0 26 | # each zip contains multiple xml files 27 | for x in glob(os.path.join(DATADIR, '*.xml')): 28 | xtxt = '\n'.join(open(x, encoding = 'utf-16').readlines()[1:]) 29 | xdoc = etree.fromstring(xtxt) 30 | ctotal += sum(float(c) for c in xdoc.xpath('//Contribution/@Amount')) 31 | 32 | # note: this is a naive summation, without regard to whether each 33 | # Contribution node is apples-to-apples, and if corrections are made later 34 | print(ctotal) 35 | -------------------------------------------------------------------------------- /scripts/74.py: -------------------------------------------------------------------------------- 1 | # The description of the bill most recently signed into law by the governor of Georgia 2 | from lxml import html 3 | import requests 4 | import re 5 | url = 'https://gov.georgia.gov/bills-signed' 6 | txt = requests.get(url).text 7 | hrefs = re.findall('(?<=/bills-signed/)\d{4}', txt) 8 | yrurl = url + '/' + max(hrefs) 9 | # e.g. https://gov.georgia.gov/bills-signed/2015 10 | doc = html.fromstring(requests.get(yrurl).text) 11 | # most recent bill is at the top 12 | print(doc.xpath('//tr/td[2]/a')[0].text_content()) 13 | -------------------------------------------------------------------------------- /scripts/75.py: -------------------------------------------------------------------------------- 1 | # Total number of officer-involved shooting incidents listed by the Philadelphia Police Department 2 | import requests 3 | from lxml import html 4 | url = "https://www.phillypolice.com/ois/" 5 | doc = html.fromstring(requests.get(url).text) 6 | x = 0 7 | for table in doc.cssselect('.ois-table'): 8 | x += len(table.cssselect('tr')) - 1 9 | print(x) 10 | -------------------------------------------------------------------------------- /scripts/76.py: -------------------------------------------------------------------------------- 1 | # The total number of publications produced by the U.S. Government Accountability Office 2 | import requests 3 | import re 4 | url = 'http://www.gao.gov/browse/date/custom' 5 | txt = requests.get(url).text 6 | # Browsing Publications by Date (1 - 10 of 53,004 items) in Custom Date Range 7 | mx = re.search('Browsing Publications by Date.+', txt).group() 8 | m = re.search('[,\d]+(?= +items)', mx).group() 9 | print(m) 10 | -------------------------------------------------------------------------------- /scripts/77.py: -------------------------------------------------------------------------------- 1 | # Number of Dallas officer-involved fatal shooting incidents in 2014 2 | import requests 3 | url = 'https://www.dallasopendata.com/resource/4gmt-jyx2.json' 4 | data = requests.get(url).json() 5 | records = [r for r in data if ('2014' in r['date'] 6 | and 'Deceased' in r['suspect_deceased_injured_or_shoot_and_miss'])] 7 | print(len(records)) 8 | -------------------------------------------------------------------------------- /scripts/78.py: -------------------------------------------------------------------------------- 1 | # Number of Cupertino, CA restaurants that have been shut down due to health violations in the last six months. 2 | import requests 3 | from lxml import html 4 | url = 'https://services.sccgov.org/facilityinspection/Closure/Index?sortField=sortbyEDate' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(len([t for t in doc.cssselect('td') if 'CUPERTINO' in t.text_content()])) 7 | -------------------------------------------------------------------------------- /scripts/79.py: -------------------------------------------------------------------------------- 1 | # The change in airline revenues from baggage fees, from 2013 to 2014 2 | import requests 3 | from lxml import html 4 | # Note that the BTS provides CSV versions of each year 5 | # So using HTML parsing is the dumb way to do this. oh well 6 | BASE_URL = 'https://www.rita.dot.gov/bts/sites/rita.dot.gov.bts/files/subject_areas/airline_information/baggage_fees/html/%s.html' 7 | year_totes = {2013: 0, 2014: 0} 8 | 9 | for yr in year_totes.keys(): 10 | url = BASE_URL % yr 11 | resp = requests.get(url) 12 | doc = html.fromstring(resp.text) 13 | # Incredibly sloppy way of getting the total value from 14 | # the bottom-right cell of the table. oh well 15 | tval = doc.cssselect('tr td')[-1].text_content().strip() 16 | year_totes[yr] = int(tval.replace(',', '')) * 1000 # it's in 000s 17 | 18 | print(year_totes[2014] - year_totes[2013]) 19 | # 179236000 20 | -------------------------------------------------------------------------------- /scripts/8.py: -------------------------------------------------------------------------------- 1 | # The number of times when a New York heart surgeon's rate of patient deaths for all cardiac surgical procedures was "significantly higher" than the statewide rate, according to New York state's analysis. 2 | import requests 3 | url = 'https://health.data.ny.gov/resource/dk4z-k3xb.json' 4 | xstr = 'Rate significantly higher than Statewide Rate' 5 | data = requests.get(url).json() 6 | records = [r for r in data if xstr in r['comparison_results']] 7 | print(len(records)) 8 | -------------------------------------------------------------------------------- /scripts/80.py: -------------------------------------------------------------------------------- 1 | # The total number of babies named Odin born in Colorado according to the Social Security Administration 2 | import shutil 3 | import requests 4 | url = 'http://www.ssa.gov/OACT/babynames/state/namesbystate.zip' 5 | # Downloading will take awhile... 6 | print("Downloading", url) 7 | resp = requests.get(url) 8 | # save to hard drive 9 | with open("/tmp/ssastates.zip", "wb") as f: 10 | f.write(resp.content) 11 | # unzip 12 | shutil.unpack_archive("/tmp/ssastates.zip", "/tmp") 13 | # open up the file 14 | rows = open("/tmp/CO.TXT").readlines() 15 | totes = 0 16 | for r in rows: 17 | if 'Odin' in r: 18 | totes += int(r.split(',')[4]) 19 | print(totes) 20 | 21 | -------------------------------------------------------------------------------- /scripts/81.py: -------------------------------------------------------------------------------- 1 | # The latest release date for T-100 Domestic Market (U.S. Carriers) statistics report 2 | from lxml import html 3 | from datetime import datetime 4 | import requests 5 | LANDING_PAGE_URL = 'http://www.transtats.bts.gov/releaseinfo.asp' 6 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text) 7 | a = doc.xpath("//a[contains(text(), 'T-100 Domestic Market (U.S. Carriers)')]")[0] 8 | tr = a.getparent().getparent() 9 | txt = tr.xpath("./td[3]/text()")[0] # e.g. ['8/13/2015:'] 10 | # messy 11 | dt = datetime.strptime(txt, '%m/%d/%Y:') 12 | print(dt.strftime("%Y-%m-%d")) 13 | # 2015-08-13 14 | -------------------------------------------------------------------------------- /scripts/82.py: -------------------------------------------------------------------------------- 1 | # In the most recent FDA Adverse Events Reports quarterly extract, the number of patient reactions mentioning "Death" 2 | # Note: I changed the original exercise to something a little more specific and challenging 3 | # 4 | # We *could* use the API: 5 | # https://open.fda.gov/api/reference/#query-syntax 6 | # After reading those docs, do you have no idea how to make even a simple call 7 | # for events and filter by date? Neither do I, so let's just go with 8 | # good ol' bulk data downloads: 9 | # http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm 10 | import requests 11 | from io import BytesIO 12 | from zipfile import ZipFile 13 | from lxml import html 14 | from urllib.parse import urljoin 15 | from collections import Counter 16 | LANDING_PAGE_URL = "http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm" 17 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text) 18 | # find the most recent FAERS ASCII zip file with good ol xpath: 19 | links = doc.xpath("//a[linktitle[contains(text(), 'ASCII')] and contains(@href, 'zip')]") 20 | # Presumably, they're listed in reverse chronological order: 21 | link = links[0] 22 | zipurl = urljoin(LANDING_PAGE_URL, link.attrib['href']) 23 | print("Downloading", link.text_content(), ":") 24 | print(zipurl) 25 | resp = requests.get(zipurl) # this is going to take awhile... 26 | with ZipFile(BytesIO(resp.content)) as zfile: 27 | # the zip contains many files...we want the one labeled REACYYQX.txt 28 | # e.g. ascii/REAC15Q1.txt 29 | fname = next(x.filename for x in zfile.filelist if 30 | "REAC" in x.filename and "txt" in x.filename.lower()) 31 | print("Unzipping:", fname) 32 | data = zfile.read(fname).decode('latin-1').splitlines() 33 | # The data looks like this: 34 | # primaryid$caseid$pt$drug_rec_act 35 | # 100036412$10003641$Medication residue present$ 36 | # 100038593$10003859$Blood count abnormal$ 37 | # 100038593$10003859$Platelet count decreased$ 38 | # 100038603$10003860$Abdominal pain$ 39 | 40 | # Rather than programatically locating the "reaction" column, 41 | # e.g. "pt", I'm just going to hardcode it as the 42 | # 3rd (2nd via 0-index) column delimited by a `$` sign 43 | reactions = [row.split('$')[2].lower() for row in data] 44 | deaths = [r for r in reactions if 'death' in r] 45 | print("Out of %s reactions, %s mention 'death'" % (len(reactions), len(deaths))) 46 | # sample output for 2015Q1 47 | # Out of 873190 reactions, 14188 mention 'death' 48 | -------------------------------------------------------------------------------- /scripts/83.py: -------------------------------------------------------------------------------- 1 | # The sum of White House staffermember salaries in 2014 2 | import requests 3 | import csv 4 | url = "https://open.whitehouse.gov/api/views/i9g8-9web/rows.csv?accessType=DOWNLOAD" 5 | txt = requests.get(url).text 6 | totes = 0 7 | for r in csv.DictReader(txt.splitlines()): 8 | # remove $ sign, convert to float 9 | salval = float(r['Salary'].replace('$', '')) 10 | totes += salval 11 | print(totes) 12 | # 37776925.0 13 | -------------------------------------------------------------------------------- /scripts/84.py: -------------------------------------------------------------------------------- 1 | # The total number of notices published on the most recent date to the Federal Register 2 | import requests 3 | from lxml import html 4 | url = 'https://www.federalregister.gov/' 5 | doc = html.fromstring(requests.get(url).text) 6 | print(doc.cssselect('ul.statistics li a span')[0].text_content()) 7 | -------------------------------------------------------------------------------- /scripts/85.py: -------------------------------------------------------------------------------- 1 | # The number of iPhone units sold in the latest quarter, according to Apple Inc's most recent 10-Q report 2 | # This exercise is just mean...The intent is to lead students to SEC's EDGAR, 3 | # *not* to imply that scraping EDGAR is the ideal way to do this fact-finding 4 | import requests 5 | from lxml import html 6 | from urllib.parse import urljoin 7 | # The target URL looks like this: 8 | # http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000320193&type=10-Q&owner=exclude&count=40 9 | BASE_URL = 'http://www.sec.gov/cgi-bin/browse-edgar' 10 | AAPL_CIK = "0000320193" 11 | url_params = { 12 | 'CIK': AAPL_CIK, 13 | 'action': 'getcompany', 14 | 'type': '10-Q', 15 | 'owner':'exclude', 16 | 'count': 40} 17 | # do initial search for Apple's 10-Q forms 18 | resp = requests.get(BASE_URL, params = url_params) 19 | doc = html.fromstring(resp.text) 20 | hrefs = doc.xpath("//a[@id='documentsbutton']/@href") 21 | xurl = urljoin(BASE_URL, hrefs[0]) 22 | # fetch page for most recent 10-Q: 23 | xdoc = html.fromstring(requests.get(xurl).text) 24 | # this gets us a list of more documents. Select the URL 25 | # for the one with 10q in its name 26 | href10q = xdoc.xpath("//table[@class='tableFile']//a[contains(@href, '10q.htm')]/@href")[0] 27 | url10q = urljoin(BASE_URL, href10q) 28 | # one more request 29 | qdoc = html.fromstring(requests.get(url10q).text) 30 | # now for some truly convoluted parsing logic 31 | # First, an xpath trick: http://stackoverflow.com/questions/1457638/xpath-get-nodes-where-child-node-contains-an-attribute 32 | xtd = qdoc.xpath("//td[descendant::p[contains(text(), 'Unit Sales by Product:')]]")[0] 33 | # luckily there's only one such . 34 | # Data looks like this: 35 | # | 3 months | | | | (9 months) | | | 36 | # | Unit | June 27 | June 28 | | | | | 37 | # | sales | 2015 | 2015 | Change | | | | 38 | # |----------|---------|---------|--------|------------|---------|------| 39 | # | iPhone | 47,534 | 35,203 | 35% | 183,172 | 129,947 | 41% | 40 | # | iPad | 10,931 | 13,276 | -18% | 44,973 | 55,661 | -19% | 41 | # | Mac | 4,796 | 4,413 | 9% | 14,878 | 13,386 | 11% | 42 | 43 | xtr = xtd.getparent() # i.e. the enclosing tr...we need to move to the next tr 44 | # find the first row that has "iPhone" in it 45 | iphone_row = next(tr for tr in xtr.itersiblings() if 'iPhone' in tr.text_content()) 46 | # fourth column has the data, as cols 2 and 3 are padding: 47 | sales = int(iphone_row.xpath('td[@align="right"][1]/text()')[0].replace(',', '')) 48 | print(sales * 1000) # units are listed in thousands 49 | # 47534000 (for June 2015) 50 | -------------------------------------------------------------------------------- /scripts/86.py: -------------------------------------------------------------------------------- 1 | # Number of computer vulnerabilities in which IBM was the vendor in the latest Cyber Security Bulletin 2 | import requests 3 | from lxml import html 4 | from urllib.parse import urljoin 5 | url = 'https://www.us-cert.gov/ncas/bulletins' 6 | doc = html.fromstring(requests.get(url).text) 7 | href = doc.xpath('//*[@class="document_title"]/a/@href')[0] 8 | bulletin = html.fromstring(requests.get(urljoin(url, href)).text) 9 | trs = bulletin.xpath('//tr/td[1][contains(text(), "ibm")]') 10 | print(len(trs)) 11 | -------------------------------------------------------------------------------- /scripts/87.py: -------------------------------------------------------------------------------- 1 | # Number of airports with existing construction related activity 2 | import requests 3 | import re 4 | resp = requests.get('https://nfdc.faa.gov/xwiki/bin/view/NFDC/Construction+Notices') 5 | # obviously not something you do in an actual scraping solution but it gets the answer! 6 | print(len(re.findall("Construction\+Notices/.+?\.pdf", resp.text))) 7 | -------------------------------------------------------------------------------- /scripts/88.py: -------------------------------------------------------------------------------- 1 | # The number of posts on TSA's Instagram account 2 | from urllib.parse import urljoin 3 | import os 4 | import requests 5 | DOMAIN = 'https://api.instagram.com/' 6 | USERNAME = 'tsa' 7 | # note: I've specified INSTAGRAM_TOKEN in my ~/.bash_profile 8 | atts = {'access_token': os.environ.get('INSTAGRAM_TOKEN')} 9 | # unless you know TSA's Instagram ID by memory, you'll 10 | # have to hit up the search endpoint to get it 11 | # docs: http://instagram.com/developer/endpoints/users/#get_users_search 12 | search_path = '/v1/users/search' 13 | searchatts = atts.copy() 14 | searchatts['q'] = USERNAME 15 | search_results = requests.get(urljoin(DOMAIN, search_path), params = searchatts).json() 16 | uid = search_results['data'][0]['id'] 17 | 18 | # now we can retrieve profile information 19 | # https://instagram.com/developer/endpoints/users/#get_users 20 | user_path = '/v1/users/%s/' % uid 21 | profile = requests.get(urljoin(DOMAIN, user_path), params = atts).json() 22 | print(profile['data']['counts']['media']) 23 | 24 | 25 | -------------------------------------------------------------------------------- /scripts/89.py: -------------------------------------------------------------------------------- 1 | # In fiscal year 2013, the short description of the most frequently cited type of FDA's inspectional observations related to food products. 2 | from collections import Counter 3 | from lxml import html 4 | from urllib.parse import urljoin 5 | from xlrd import open_workbook 6 | import requests 7 | import tempfile 8 | LANDING_PAGE_URL = 'http://www.fda.gov/ICECI/Inspections/ucm250720.htm' 9 | # The hardcoded URL for the Excel file is: 10 | # http://www.fda.gov/downloads/ICECI/Inspections/UCM381532.xls 11 | # But we'll programatically find it 12 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text) 13 | # HTML looks like: 14 | # 15 | # FY 2013 Excel File (XLS - 691KB) 16 | # 17 | 18 | # i love xpath 19 | hrefs = doc.xpath("//a[linktitle[contains(text(), '2013')] and contains(@href, 'xls')]//@href") 20 | url = urljoin(LANDING_PAGE_URL, hrefs[0]) 21 | # eh just make a temp file 22 | t = tempfile.TemporaryFile() 23 | t.write(requests.get(url).content) 24 | t.seek(0) 25 | wb = open_workbook(file_contents=t.read()) 26 | # Each category has its own name, we need to find "Foods" 27 | sheet = wb.sheet_by_name('Foods') 28 | # find the column that contains "Short Description" 29 | col_idx = next(idx for idx, txt in enumerate(sheet.row_values(0)) if "Short Description" == txt) 30 | c = Counter(sheet.row_values(r)[col_idx] for r in range(sheet.nrows)) 31 | print(""""%s" for %s observations""" % c.most_common(1)[0]) 32 | 33 | -------------------------------------------------------------------------------- /scripts/9.py: -------------------------------------------------------------------------------- 1 | # The number of roll call votes that were rejected by a margin of less than 5 votes, in the first session of the U.S. Senate in the 114th Congress 2 | # Note: this example shows how to scrape the Senate webpage, which is 3 | # the WRONG thing to do in practice. Use the XML instead: 4 | # http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_114_1.xml 5 | # via https://twitter.com/octonion/status/611296541941321731 6 | from lxml import html 7 | import requests 8 | import re 9 | congress_num = 114 10 | session_num = 1 11 | url = ('http://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_%s_%s.htm' 12 | % (congress_num, session_num)) 13 | 14 | doc = html.fromstring(requests.get(url).text) 15 | # unnecessarily convoluted xpath statement, which I do here 16 | # just so I can practice xpath statements 17 | # http://stackoverflow.com/questions/1457638/xpath-get-nodes-where-child-node-contains-an-attribute 18 | xstr = "//*[@id='contentArea']//table/tr[td[2][contains(text(), 'Rejected')]]" 19 | # i.e. find all tr elements that have a 2nd td child with text that contains "Rejected" 20 | xcount = 0 21 | for r in doc.xpath(xstr): 22 | yeas, nays = re.search('(\d+)-(\d+)', r.find('td').text_content()).groups() 23 | if (int(nays) - int(yeas) < 5): 24 | xcount += 1 25 | 26 | print(xcount) 27 | -------------------------------------------------------------------------------- /scripts/90.py: -------------------------------------------------------------------------------- 1 | # The currently serving U.S. congressmember with the most Twitter followers 2 | from math import ceil 3 | import csv 4 | import json 5 | import os 6 | import requests 7 | import tweepy 8 | # You need to have a Twitter account and register as a developer: 9 | # http://www.compjour.org/tutorials/getting-started-with-tweepy/ 10 | # Your credentials JSON file should look like this: 11 | # { 12 | # "access_token": "AAAA", 13 | # "access_token_secret": "BBBB", 14 | # "consumer_secret": "CCCC", 15 | # "consumer_key": "DDDDD" 16 | # } 17 | # Twitter helper methods 18 | DEFAULT_TWITTER_CREDS_PATH = '~/.creds/me.json' # put your own path here 19 | def get_api(credsfile = DEFAULT_TWITTER_CREDS_PATH): 20 | """ 21 | Takes care of the Twitter OAuth authentication process and 22 | creates an API-handler to execute commands on Twitter 23 | 24 | Arguments: 25 | - credsfile (str): the full path of the filename that contains a JSON 26 | file with credentials for Twitter 27 | 28 | Returns: 29 | A tweepy.api.API object 30 | 31 | """ 32 | fn = os.path.expanduser(credsfile) # get the full path in case the ~ is used 33 | c = json.load(open(fn)) 34 | # Get authentication token 35 | auth = tweepy.OAuthHandler(consumer_key = c['consumer_key'], 36 | consumer_secret = c['consumer_secret']) 37 | auth.set_access_token(c['access_token'], c['access_token_secret']) 38 | # create an API handler 39 | return tweepy.API(auth) 40 | 41 | # gets a whole bunch of profile information from a batch of screen_names 42 | BATCH_SIZE = 100 43 | def get_profiles_from_screen_names(snames): 44 | api = get_api() 45 | profiles = [] 46 | for i in range(ceil(len(snames) / BATCH_SIZE)): 47 | s = i * BATCH_SIZE 48 | bnames = snames[s:(s + BATCH_SIZE)] 49 | for user in api.lookup_users(screen_names = bnames): 50 | profiles.append(user._json) 51 | return profiles 52 | # Step 1. 53 | # Basically, you have to rejigger 18.py: 54 | # (The number of U.S. congressmembers who have Twitter accounts, according to Sunlight Foundation data) 55 | # info https://sunlightlabs.github.io/congress/#legislator-spreadsheet 56 | csvurl = 'http://unitedstates.sunlightfoundation.com/legislators/legislators.csv' 57 | rows = csv.DictReader(requests.get(csvurl).text.splitlines()) 58 | # note that spreadsheet includes non-sitting legislators, thus the use 59 | # of 'in_office' attribute to filter 60 | legislators = [r for r in rows if r['twitter_id'] and r['in_office'] == '1'] 61 | # now call twitter 62 | twitter_profiles = get_profiles_from_screen_names([x['twitter_id'] for x in legislators]) 63 | # match up legislators with profiles: 64 | for lx in legislators: 65 | ta = [t for t in twitter_profiles if lx['twitter_id'].lower() == t['screen_name'].lower()] 66 | lx['twitter_profile'] = ta[0] if ta else None 67 | 68 | def fooey(x): 69 | t = x['twitter_profile'] 70 | return t['followers_count'] if t else 0 71 | 72 | q = max(legislators, key = fooey) 73 | print(q['title'], q['firstname'], q['middlename'], q['lastname'], q['state']) 74 | # Sen John S. McCain AZ 75 | 76 | 77 | -------------------------------------------------------------------------------- /scripts/91.py: -------------------------------------------------------------------------------- 1 | # Number of stop-and-frisk reports from the NYPD in 2014 2 | from shutil import unpack_archive 3 | import csv 4 | import os 5 | import requests 6 | DATADIR = '/tmp/nypd' 7 | os.makedirs(DATADIR, exist_ok = True) 8 | zipurl = 'http://www.nyc.gov/html/nypd/downloads/zip/analysis_and_planning/2014_sqf_csv.zip' 9 | zname = os.path.join(DATADIR, os.path.basename(zipurl)) 10 | cname = os.path.join(DATADIR, '2014.csv') 11 | if not os.path.exists(zname): 12 | print("Downloading", zipurl, 'to', zname) 13 | z = requests.get(zipurl).content 14 | with open(zname, 'wb') as f: 15 | f.write(z) 16 | # unzip it 17 | unpack_archive(zname, DATADIR) 18 | 19 | data = list(csv.DictReader(open(cname, encoding = 'latin-1'))) 20 | print(len(data)) 21 | 22 | 23 | -------------------------------------------------------------------------------- /scripts/92.py: -------------------------------------------------------------------------------- 1 | # In 2012-Q4, the total amount paid by Rep. Aaron Schock to Lobair LLC, according to Congressional spending records, as compiled by the Sunlight Foundation 2 | # real-life reference: http://www.usatoday.com/story/news/politics/2015/02/19/schock-flights-charter-house-rules/23663247/ 3 | import csv 4 | import requests 5 | DATA_URL = 'http://assets.sunlightfoundation.com.s3.amazonaws.com/expenditures/house/2012Q4-detail.csv' 6 | SCHOCK_ID = 'S001179' # http://bioguide.congress.gov/scripts/biodisplay.pl?index=s001179 7 | print("Downloading", DATA_URL) 8 | resp = requests.get(DATA_URL) 9 | totalamt = 0 10 | for row in csv.DictReader(resp.text.splitlines()): 11 | if row['BIOGUIDE_ID'] == SCHOCK_ID and 'LOBAIR LLC' in row['PAYEE'].upper(): 12 | totalamt += float(row['AMOUNT']) 13 | print(totalamt) 14 | # 880.0 15 | -------------------------------------------------------------------------------- /scripts/93.py: -------------------------------------------------------------------------------- 1 | # Number of public Github repositories maintained by the GSA's 18F organization, as listed on Github.com 2 | import requests 3 | url = 'https://api.github.com/orgs/18F' 4 | data = requests.get(url).json() 5 | print(data['public_repos']) 6 | -------------------------------------------------------------------------------- /scripts/94.py: -------------------------------------------------------------------------------- 1 | # The New York City high school with the highest average math score in the latest SAT results 2 | 3 | # Notes: 4 | # As this is one of the last exercises that I've written out, it includes code 5 | # that is both lazy and convoluted. For example, this is the first time 6 | # I've tried openpyxl as opposed to xlrd for reading Excel files 7 | # and it shows: http://openpyxl.readthedocs.org/en/latest/index.html 8 | # 9 | # You can check out other scraping/spreadsheet parsing examples 10 | # in the repo to find cleaner ways of doing this kind of task. 11 | # 12 | # 13 | # 14 | # Landing page: 15 | # http://schools.nyc.gov/Accountability/data/TestResults/default.htm 16 | # 17 | ## Relevant text from the webpage: 18 | # The most recent school level results for New York City on the SAT. 19 | # Results are available at the school level for the graduating 20 | # seniors of 2014. For a summary report of SAT, PSAT, and AP achievement 21 | # for 2014, please click here. 22 | # 23 | ## Target URL looks like this: 24 | # http://schools.nyc.gov/NR/rdonlyres/CE9139F0-9F3A-4C42-ACB8-74F2D014802F/ 25 | # 171380/2014SATWebsite10214.xlsx 26 | # 27 | # It's just as likely that by next year, they'll redesign or restructure 28 | # the site. So this scraping code is unstable. But it works as of August 2015. 29 | import csv 30 | import requests 31 | from io import BytesIO 32 | from operator import itemgetter 33 | from urllib.parse import urljoin 34 | from lxml import html 35 | from openpyxl import load_workbook 36 | LANDING_PAGE_URL = 'http://schools.nyc.gov/Accountability/data/TestResults/default.htm' 37 | 38 | doc = html.fromstring(requests.get(LANDING_PAGE_URL).text) 39 | # instead of using xpath, let's just use a sloppy csscelect 40 | urls = [a.attrib.get('href') for a in doc.cssselect('a')] 41 | # that awkward `get` is because not all anchor tags have hrefs...and 42 | # this is why we use xpath... 43 | _xurl = next(url for url in urls if url and 'SAT' in url and 'xls' in url) # blargh 44 | xlsx_url = urljoin(LANDING_PAGE_URL, _xurl) 45 | print("Downloading", xlsx_url) 46 | # download the spreadsheet...instead of writing to disk 47 | # let's just keep it in memory and pass it directly to load_workbook() 48 | xlsx = BytesIO(requests.get(xlsx_url).content) 49 | wb = load_workbook(xlsx) 50 | # The above command will print out a warning: 51 | # /site-packages/openpyxl/workbook/names/named_range.py:121: UserWarning: 52 | # Discarded range with reserved name 53 | # warnings.warn("Discarded range with reserved name") 54 | 55 | ### Dealing with the worksheet structure 56 | # The 2014 edition contains two worksheets, the first being "Notes" 57 | # and the second being "2014 SAT Results" 58 | # Let's write an agnostic function as if we didn't know how each year's 59 | # spreadsheet was actually structured 60 | sheet = next(s for s in wb.worksheets if "results" in s.title.lower()) 61 | # I don't understand openpyxl's API so I'm just going to 62 | # practice nested list comprehensions 63 | # Note that the first column is just an ID field which we don't care about 64 | rows = [[cell.value for cell in row[1:]] for row in sheet.iter_rows()] 65 | headers = rows[0] 66 | # make it into a list of dicts 67 | data = [dict(zip(headers, r)) for r in rows[1:]] 68 | # I think we can assume that the header will change every year/file 69 | # so let's write another agnostic iterating function to do a best guess 70 | mathheader = next(h for h in headers if 'math' in h.lower()) 71 | # Not every school has a number for this column 72 | mathschools = [d for d in data if isinstance(d[mathheader], int)] 73 | topschool = max(mathschools, key = itemgetter(mathheader)) 74 | # since we've done so much work to get here, so 75 | # let's calculate the average of the averages -- which requires a 76 | # weighting of math score averages against number of SAT taker 77 | # and include that in the printed answer 78 | 79 | # find the header that says '# of SAT Takers in 20XX': 80 | numheader = next(h for h in headers if 'takers' in h.lower()) 81 | total_takers = sum(s[numheader] for s in mathschools) 82 | mathsums = sum(s[mathheader] * s[numheader] for s in mathschools) 83 | mathavg = mathsums // total_takers 84 | tmp_answer = """{name} had the highest average SAT math score: {top_score} 85 | This was {diff_score} points higher than the city average of {avg_score} 86 | """ 87 | answer = tmp_answer.format(name = topschool['High School'], 88 | top_score = topschool[mathheader], 89 | diff_score = topschool[mathheader] - mathavg, 90 | avg_score = mathavg 91 | ) 92 | 93 | print(answer) 94 | # Output for 2014: 95 | # STUYVESANT HIGH SCHOOL had the highest average SAT math score: 737 96 | # This was 272 points higher than the city average of 465 97 | -------------------------------------------------------------------------------- /scripts/95.py: -------------------------------------------------------------------------------- 1 | # Since 2002, the most commonly occurring winning number in New York's Lottery Mega Millions 2 | from collections import Counter 3 | import requests 4 | c = Counter() 5 | data = requests.get('https://data.ny.gov/resource/5xaw-6ayf.json').json() 6 | for d in data: 7 | c.update(d['winning_numbers'].split(' ')) 8 | 9 | print(c.most_common()[0][0]) 10 | -------------------------------------------------------------------------------- /scripts/96.py: -------------------------------------------------------------------------------- 1 | # The number of scheduled arguments according to the most recent U.S. Supreme Court argument calendar 2 | from lxml import html 3 | from urllib.parse import urljoin 4 | import requests 5 | url = 'http://www.supremecourt.gov/oral_arguments/argument_calendars.aspx' 6 | index = html.fromstring(requests.get(url).text) 7 | # calendar is sorted chronologically, with latest in the last link 8 | href = index.xpath('//a[contains(text(), "HTML")]/@href')[-1] 9 | cal = html.fromstring(requests.get(urljoin(url, href)).text) 10 | pdfs = cal.xpath("//table//a[contains(@href, 'qp.pdf')]/@href") 11 | print(len(pdfs)) 12 | -------------------------------------------------------------------------------- /scripts/97.py: -------------------------------------------------------------------------------- 1 | # The New York school with the highest rate of religious exemptions to vaccinations 2 | import requests 3 | url = 'https://health.data.ny.gov/resource/5pme-xbs5.json' 4 | data = requests.get(url).json() 5 | 6 | def foo(d): 7 | return float(d['percentreligiousexemptions']) 8 | 9 | school = max([r for r in data if '2014' in r['report_period']], key = foo) 10 | print(school['schoolname']) 11 | -------------------------------------------------------------------------------- /scripts/98.py: -------------------------------------------------------------------------------- 1 | # The latest estimated population percent change for Detroit, MI, according to the latest Census QuickFacts summary. 2 | import requests 3 | from lxml import html 4 | url = 'http://quickfacts.census.gov/qfd/states/26/2622000.html' 5 | doc = html.fromstring(requests.get(url).text) 6 | # this is sloppy but quick 7 | col = doc.xpath('//td[contains(text(), "Population, percent change")]/following-sibling::td')[0] 8 | print(col.text_content()) 9 | -------------------------------------------------------------------------------- /scripts/99.py: -------------------------------------------------------------------------------- 1 | # According to the Medill National Security Zone, the number of chambered guns confiscated at airports by the TSA in 2014 2 | # http://nationalsecurityzone.org 3 | import csv 4 | import requests 5 | gdoc_url = 'https://docs.google.com/spreadsheets/d/1a65n2HIcBYG7VyZYfVnBXGDmEdR8NYSOF43dzkDIuwA/' 6 | txt = requests.get(gdoc_url + 'export', params = {'format': 'csv', 'gid': 0}).text 7 | # skip the first two lines, which are: 8 | # Mandatory credit, with link: TSA data compiled by Medill National Security Journalism Initiative. 9 | # Data is preliminary and extracted from the TSA Blog. TSA'S year-end totals may very slightly. 10 | rows = list(csv.DictReader(txt.splitlines()[2:])) 11 | print(len([r for r in rows if r['CHAMBERED?'] == 'Y'])) 12 | --------------------------------------------------------------------------------