├── README.md ├── SystemsReportingProfessors.xlsx ├── arrest.csv ├── class_01.md ├── class_02.md ├── class_03.md ├── class_04.md ├── class_05.md ├── class_06.md ├── class_07.md ├── class_08.md ├── class_09.md ├── class_10.md ├── class_11.md ├── class_12.md ├── class_13.md ├── class_14.md ├── class_15.md ├── class_16.md ├── class_17.md ├── class_18.md ├── class_19.md ├── class_20.md ├── class_21.md ├── outline.md ├── peer-evaluation-form-smpa3193.docx ├── self-evaluation-form-smpa3193.docx └── syllabus.md /README.md: -------------------------------------------------------------------------------- 1 | # Systems for Reporting 2 | Course materials for SMPA 3193, Systems for Reporting 3 | -------------------------------------------------------------------------------- /SystemsReportingProfessors.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dwillis/systems-for-reporting/39ee8040c18d2898c9431dafb5c85be40e6436d6/SystemsReportingProfessors.xlsx -------------------------------------------------------------------------------- /class_01.md: -------------------------------------------------------------------------------- 1 | ## Course introduction, Goals and How We'll Work 2 | 3 | #### Introductions 4 | 5 | * Who you are and why you're here 6 | * Who I am and why I'm here 7 | 8 | #### Goals 9 | 10 | * For each of you 11 | * For the class 12 | 13 | #### How We'll Work 14 | 15 | * In the open 16 | * Version control ([GitHub](https://www.github.com/)) 17 | * On the command line 18 | * In groups 19 | * Moving toward automation 20 | 21 | #### Things We Looked At 22 | 23 | * [FEC Itemizer](https://projects.propublica.org/itemizer/) 24 | * [John Snow's Map](https://www1.udel.edu/johnmack/frec682/cholera/) 25 | * [The Terminal](http://blog.teamtreehouse.com/introduction-to-the-mac-os-x-command-line) (on the Mac) 26 | 27 | #### Assignments for Jan. 19 28 | 29 | * Read: [A fundamental way newspaper sites need to change](http://www.holovaty.com/writing/fundamental-change/) & be prepared to discuss. 30 | * Spreadsheet: Take 3 professors from [here](https://smpa.gwu.edu/faculty-directory) and, using the information on their detail pages, make a spreadsheet in which each row is one faculty member and the columns are whatever information you think is useful or helpful to know. Try to have one piece of information per column. There's no _right_ answer to this; I just want to see how you approach it. Email the spreadsheet (or provide a link if it's a Google Sheets file) to me before 11:10 a.m. on 1/19. 31 | * Make sure you have administrative rights on your laptop so you can install software. Install [Git](https://git-scm.com/), [Python](https://www.python.org/downloads/) (only if you have a Windows laptop - Mac OS comes with it) and sign up for a free account at [PythonAnywhere](https://www.pythonanywhere.com/). 32 | -------------------------------------------------------------------------------- /class_02.md: -------------------------------------------------------------------------------- 1 | 2 | #### Introduction 3 | 4 | * Terminal redux 5 | * Discussion on SMPA professor data 6 | * Discussion on [How Newspaper Websites Need to Change](http://www.holovaty.com/writing/fundamental-change/) 7 | 8 | #### Exercise 9 | 10 | * Using this [Senate committee assignment roster](http://www.senate.gov/general/committee_assignments/assignments.htm), can you reorganize a few rows of this in Excel so that it is more useful, so that you can ask specific questions of it? We'll upload it to GitHub, too. Then let's take a look at [this](http://www.senate.gov/general/committee_membership/committee_memberships_SSAF.xml). 11 | 12 | #### Now You Try 13 | 14 | * Using the [Web Frequency Indexer](http://www.lextutor.ca/freq/eng/) and Excel, calculate the frequency of words used in the ["I Have a Dream" speech](http://www.americanrhetoric.com/speeches/mlkihaveadream.htm). Upload the Excel file to GitHub. 15 | * Let's visualize that using [in-cell bar graphs](http://infosthetics.com/archives/2006/08/excel_in_cell_graphing.html). 16 | 17 | #### Assignments for Jan. 24 18 | 19 | * Read [Finding Stories in the Structure of Data](https://source.opennews.org/en-US/learning/finding-stories-structure-data/) 20 | * Read [Data State of Mind](https://mjwebster.github.io/DataJ/Other/DataStateofMind/DataStateofMind.pdf) 21 | * Using the [Web Frequency Indexer](http://www.lextutor.ca/freq/eng/) and Excel, calculate the frequency of words used in this [2009 Senate speech by Sen. Jim Inhofe](http://www.congress.gov/congressional-record/2009/09/22/senate-section/article/s9648-2/). Copy the results into Excel/Google Sheets. Then do an in-cell bar chart like we did in class, using the REPT() function, and calculate the percent of total words that each word represents. Email the Excel file to dwillis@gwu.edu. 22 | -------------------------------------------------------------------------------- /class_03.md: -------------------------------------------------------------------------------- 1 | 2 | #### Introduction 3 | 4 | * Quiz! 5 | * Review of GitHub process 6 | 7 | #### Working with Data in Excel 8 | 9 | * Sorting & Filtering with [NC GOP expenses](https://projects.propublica.org/itemizer/filing/1131333/schedule/sb) 10 | 11 | #### Assignments for Jan. 26 12 | 13 | * Sorting and filtering using [analytics.usa.gov](https://analytics.usa.gov/data/) data. Download the CSV file of "[Visits to all domains over 30 days](https://analytics.usa.gov/data/live/all-domains-30-days.csv)" and open it in Excel. Sort the data so that you can find which domain had the highest value for "avg_session_duration" and which domains had more than 50 "pageviews_per_session". Use a filter to find out how many NASA domains saw at least 100,000 visits. Put the answers to all of these into a text file (answers.txt) and upload it and the saved CSV file to your Github repository that you created today. Email dwillis@gwu.edu with the link to your GitHub repository. If something goes wrong, email the files directly to me. 14 | * Reading: [Automating Transparency](https://source.opennews.org/en-US/learning/automating-transparency/). We'll talk about this. 15 | -------------------------------------------------------------------------------- /class_04.md: -------------------------------------------------------------------------------- 1 | 2 | #### Reading/Discussion 3 | 4 | * [Automating Transparency](https://source.opennews.org/en-US/learning/automating-transparency/) 5 | 6 | #### Excel 7 | 8 | * [Pivot Tables](http://www.techonthenet.com/excel/pivottbls/create2011.php) 9 | * Using the [Sunday talk show guest data](https://raw.githubusercontent.com/TheUpshot/Sunday-Shows/master/guests.csv), find how many times Bernie Sanders appears. Then use a [pivot table](http://www.gcflearnfree.org/office2013/excel2013/27) to calculate the total number of appearances by each person on the list. Save the file *as an Excel file* and add it to your Github repository. 10 | 11 | #### Assignments for Jan. 31 12 | 13 | * Using the [Sunday talk show guest data](https://raw.githubusercontent.com/TheUpshot/Sunday-Shows/master/guests.csv) and a pivot table, calculate how many total male and female guests each show had. Upload the Excel file (be sure to save it as an Excel file) to your class repository. 14 | * Read [Finding Data](http://mjwebster.github.io/DataJ/Other/FindingData.pdf) 15 | * Read [Ideas by Beat](http://mjwebster.github.io/DataJ/Other/Ideasbybeat.html) 16 | * Install or upgrade Firefox. Install this [SQLite Add-on](https://addons.mozilla.org/en-US/firefox/addon/5817) 17 | -------------------------------------------------------------------------------- /class_05.md: -------------------------------------------------------------------------------- 1 | #### Reading 2 | 3 | * [Git cheat sheet](https://www.git-tower.com/blog/git-cheat-sheet/) 4 | 5 | #### Finding Data 6 | 7 | * [The Census Bureau](http://www.census.gov/) 8 | * [IPEDS](https://nces.ed.gov/ipeds/) 9 | * [FedStats](https://fedstats.sites.usa.gov/) 10 | * [MWWR](https://www.cdc.gov/mmwr/index.html) 11 | * [Senate votes](https://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_115_1.htm) 12 | 13 | #### Assignments for March 2 14 | 15 | * Read [Understanding Households and Relationships in Census Data](https://source.opennews.org/en-US/learning/understanding-households-and-relationships-census-/) 16 | * Excel: Using pivot tables and some percent of total formulas, complete this [New Yorker cartoons exercise](http://mjwebster.github.io/DataJ/spreadsheets/NewYorkerCartoonsExercise.docx) using [this data](http://mjwebster.github.io/DataJ/spreadsheets/new_yorker_cartoons.xlsx). Push your answers in a text file called `newyorker.txt` to GitHub. 17 | -------------------------------------------------------------------------------- /class_06.md: -------------------------------------------------------------------------------- 1 | #### Discussion 2 | 3 | * [Understanding Households and Relationships in Census Data](https://source.opennews.org/en-US/learning/understanding-households-and-relationships-census-/) 4 | * [Census Reporter](https://censusreporter.org/) 5 | 6 | #### More Finding Data 7 | 8 | * Importing Fixed Width files. [Fairfax arrests](http://www.fairfaxcounty.gov/police/crime/arrest.txt) 9 | 10 | 11 | #### Assignments for Feb. 7 12 | 13 | * Read [How Netflix Reverse Engineered Hollywood](http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/) 14 | 15 | * Excel: Download [this CSV of West Virginia 2012 general election results](http://services.sos.wv.gov/apps/elections/results/readfile.aspx?eid=13&type=StateCountyTotals&format=csv) and copy OFFICIAL county-level results (the file contains both official and unofficial results, look for a column that indicated that) for President Obama and Joe Manchin (U.S. Senate race) into a separate sheet, so that each county is a row and the columns contain the vote totals and percentages for both candidates (you'll have columns with names like County, ObamaVotes, TotalVotes and ObamaPct, and ManchinVotes, TotalSenateVotes and ManchinPct). Be sure to save your file as an XLS file, not a CSV, then answer these questions: 16 | 17 | 1. In which counties did President Obama receive his highest and lowest percentages? 18 | 2. How many counties did President Obama receive less than 50% of the vote? 19 | 3. What was the largest difference in the number of votes between Manchin and Obama? Which county? 20 | 4. What was the largest percentage difference between Manchin and Obama? Which county? 21 | 5. Did President Obama receive a larger percentage of the vote than Manchin in any counties? Which ones? 22 | 23 | Put the saved Excel file and your answers (in a file called wv_answers.txt) in your Github repository. 24 | -------------------------------------------------------------------------------- /class_07.md: -------------------------------------------------------------------------------- 1 | #### Discussion 2 | 3 | * [How Netflix Reverse Engineered Hollywood](http://www.theatlantic.com/technology/archive/2014/01/how-netflix-reverse-engineered-hollywood/282679/) 4 | 5 | #### Databases 6 | 7 | * [Firefox SQLite Add-on](https://addons.mozilla.org/en-US/firefox/addon/5817) 8 | * [SQLite Tutorial](https://github.com/tthibo/SQL-Tutorial/blob/master/NICAR2015/part1_steps.textile) 9 | 10 | #### Assignments for Feb. 9 11 | 12 | * Resume Part 1 of the Tutorial at [this point](https://github.com/tthibo/SQL-Tutorial/blob/master/tutorial_files/part1.textile#importing-data-from-a-file) and continue through [Part 2 of the SQLite Tutorial](https://github.com/tthibo/SQL-Tutorial/blob/master/tutorial_files/part2.textile). Upload your `campaign_finance.sqlite` file to GitHub. 13 | -------------------------------------------------------------------------------- /class_08.md: -------------------------------------------------------------------------------- 1 | 2 | #### More SQL 3 | 4 | * Imports 5 | * Aggregates 6 | * Group by 7 | 8 | #### Assignments for Feb. 14 9 | 10 | * Reading: [Introducing Treasury.io](https://source.opennews.org/articles/introducing-treasuryio/) 11 | 12 | * SQL: Using your existing campaign_finance.sqlite file, download [this CSV file of expenditures](https://www.strongspace.com/shared/k2avxajk0l) by the Senate campaign of Josh Mandel. Import it, creating a table called mandel. Look at the CSV file in Excel before you define the fields in SQLite – be sure to define the zip field as VARCHAR, not INTEGER. Define the amount, month, day and year fields as INTEGER. 13 | 14 | Once you've done that, write queries to do the following, using wildcards (but not always) and GROUP BY: 15 | 16 | 1. Show the total amount of money spent in each state. 17 | 2. Show the total amount for each purpose, with the largest amount first. 18 | 3. Show the total amount of any expenditures related to direct mail. 19 | 4. Show the total amount spent for each month and year, with the largest amount first 20 | 5. Show the recipients and total amounts for Payroll expenses, but not payroll taxes or fees. 21 | 22 | Save your queries in a file called mandel.txt and upload that and your .sqlite file to GitHub. 23 | -------------------------------------------------------------------------------- /class_09.md: -------------------------------------------------------------------------------- 1 | 2 | ### Python and the command line 3 | 4 | * Log into [PythonAnywhere](https://www.pythonanywhere.com) and start a bash shell. 5 | * [Getting Started](https://ireapps.github.io/pycar/pycar_intro.html#/) 6 | * [An Informal Introduction to Python](https://docs.python.org/2/tutorial/introduction.html) 7 | * Create exercises repository on GitHub & clone into PythonAnywhere 8 | 9 | 10 | ### Assignments 11 | 12 | * Python/Command line: In your PythonAnywhere bash shell in the exercises directory, copy and run the following commands: 13 | 14 | * `pip install virtualenvwrapper` 15 | * `mkvirtualenv exercises` 16 | * `pip install jupyter` 17 | * `pip install agate` 18 | * `pip install urllib3[secure] pyopenssl ndg-httpsclient pyasn1 requests` 19 | 20 | Then share your console with my email address using the "Share with others" button. You can close your browser tab now, but don't kill the console listed under "Your consoles" on the "Consoles" tab of PythonAnywhere. 21 | 22 | * Read [Introducing Agate](https://source.opennews.org/articles/introducing-agate/) 23 | * Read [Scooped by Code](http://www.niemanlab.org/2013/12/scooped-by-code/) 24 | * Send me, by email, 2 or 3 topics/subjects you're interested in with a data component (or a possible one). 25 | -------------------------------------------------------------------------------- /class_10.md: -------------------------------------------------------------------------------- 1 | 2 | ### Reading CSV files 3 | 4 | * Start [here](https://github.com/dwillis/smpa3193-exercises/blob/master/01-reading-csv-files.md) 5 | 6 | ### Agate Walkthrough 7 | 8 | * Start [here](https://github.com/wireservice/agate/blob/master/tutorial.ipynb) 9 | 10 | ### Assignments for Feb. 21 11 | 12 | * In your local Terminal or PythonAnywhere console, finish the exercise we started today, picking up [here](https://github.com/dwillis/smpa3193-exercises/blob/master/01-reading-csv-files.md#working-with-an-online-csv-file). Run the `pip` command from the bash shell, then start a Python session. After you run through it the first time, change the commands so that you filter on the name "ELIJAH" like we did with Irene. 13 | * Try to walk through the exercise described [here](https://github.com/dwillis/smpa3193-exercises/blob/master/02-agate-exercise.md). It will look different because you when you run `juypter notebook` it will launch a browser locally and you'll type your commands (and run them) in that rather than the Terminal. Don't worry too much if you run into issues; email me! 14 | * Reading: [Mockingjay](https://source.opennews.org/articles/mockingjay/) 15 | * Project ideas: Remember the ideas you sent me? Try to think of a potential data source about one or both of them that you could analyze, and email that to me along with what you think the focus could be. For example: "I really like cricket, and I was thinking about using the [Cricsheet](http://cricsheet.org/) data to see which batting partnership has been the most productive in Test matches." Try to be more specific. 16 | -------------------------------------------------------------------------------- /class_11.md: -------------------------------------------------------------------------------- 1 | ### Agate 2 | 3 | * [https://github.com/wireservice/agate](https://github.com/wireservice/agate) 4 | * [https://github.com/dwillis/smpa3193-exercises/blob/master/03-agate-exercise.md](https://github.com/dwillis/smpa3193-exercises/blob/master/03-agate-exercise.md) 5 | 6 | ### Assignments for Feb. 28 7 | 8 | * Python: Finish [Agate exercise](https://github.com/dwillis/smpa3193-exercises/blob/master/03-agate-exercise.md), saving your jupyter notebook and pushing it to GitHub. The [Agate cookbook](http://agate.readthedocs.io/en/1.5.5/cookbook.html) might help. 9 | * Reading: [How Bernie Sanders Raises All That Money](http://www.buzzfeed.com/johntemplon/how-bernie-sanders-raises-all-that-money) and the [code](https://github.com/BuzzFeedNews/2016-04-bernie-sanders-donors) behind the analysis. Check out the [Python notebook](https://github.com/BuzzFeedNews/2016-04-bernie-sanders-donors/tree/master/notebooks), too. 10 | -------------------------------------------------------------------------------- /class_12.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ### Buzzfeed on Bernie 4 | 5 | 6 | ### Agate Exercise 7 | 8 | * [County population estimates](https://github.com/dwillis/smpa3193-exercises/blob/master/04-agate-exercise.md) 9 | 10 | ### Assignment for March 2 11 | 12 | * Background on [Adam Harris](https://www.clippings.me/adamharris), our guest speaker who will talk about using social platforms for reporting. 13 | 14 | ### Assignments for March 7 15 | 16 | * Picking up with the county population estimates file above, finish the exercise and commit it to Github. 17 | * Meet with your group to talk about potential ideas for your project. Remember, it could be a bot like CongressEdits or Mockingjay, or it could be a process for automating the analysis of data like we've done with Agate & Jupyter notebook. Or it could be something else entirely, as long as it is a system of some kind that automates a task. Each group will be listed [here](https://github.com/orgs/smpa3193-projects/teams). Create a repository in your group and call it `group-{number}-ideas`, and add a text file summarizing your discussions so far. 18 | -------------------------------------------------------------------------------- /class_13.md: -------------------------------------------------------------------------------- 1 | ### Reading 2 | 3 | * [Getting Data from the Web](http://datajournalismhandbook.org/1.0/en/getting_data_3.html) 4 | 5 | ### Scraping! 6 | 7 | * Fork [this repository](https://github.com/SMPA3193/first-web-scraper) to your GitHub account. 8 | * Clone it locally on your Desktop, open Terminal, navigate to your desktop and then `cd first-web-scraper` 9 | * Next, create a new virtual environment: `mkvirtualenv scraper` 10 | * `pip install requests` 11 | * `pip install BeautifulSoup` 12 | 13 | We'll be scraping from [this page](https://pressgallery.house.gov/member-data/demographics/women) 14 | 15 | ### Assignments 16 | 17 | * Scraping: using the `first-web-scraper` repository we worked on in class, copy the congress/scrape.py file into the crime folder. Open it in TextWrangler and replace the House url with [this one](http://www.tdcj.state.tx.us/death_row/dr_scheduled_executions.html) and then adjust the script to extract the upcoming Texas executions into a new CSV file named `executions.csv` (you'll need to replace the filename in the `scrape.py` file and change the headers, too, and you'll have to find the right HTML table to pull from). Not sure where to start? Try running the crime/scrape.py script. If you get an error, make a change to the line the error occurs on. Hint: Print out the row you're working with to see it. Push your finished crime scraper to Github. 18 | * Final Projects: Give me an update on what data source(s) you'd like to use in your project. Haven't decided? Let me know the candidates or any other issues. Put it all in a text file called `update_032317.txt` and push it to your Github group. 19 | * [Public Info Doesn't Always Want to be Free](https://source.opennews.org/en-US/learning/public-info-doesnt-always-want-be-free/) 20 | -------------------------------------------------------------------------------- /class_14.md: -------------------------------------------------------------------------------- 1 | 2 | ### More Scraping 3 | 4 | * Picking up where we left off with our congress scraper - let's fix that party/state thing. 5 | 6 | ### Discussion 7 | 8 | * NYT: [Second Avenue Subway Relieves Crowding on Neighboring Lines](https://www.nytimes.com/2017/02/01/nyregion/second-avenue-subway-relieves-crowding-on-neighboring-lines.html) 9 | 10 | ### Assignments for March 28 11 | 12 | * Scraping: [Columbian College faculty research grants](https://columbian.gwu.edu/2015-2016). In your first-web-scraper/scrapers directory, create a new folder called `college` and, inside it, a file called `scrape.py`. Let's adapt our previous Congress or crime scraper to pull the information from the research grants table and write out a CSV file, but with a twist: let's not just do 2015-2016, but back to 2010-2011 as well, and add a column for the academic year. Remember, we're doing the same thing for each of those pages, so try not to repeat blocks of code. 13 | * Project Updates: Due by Tuesday, pushed to Github. Tell me what your group has done/decided this week and what it plans to do next week. If you have a pressing question or issue, email me. 14 | * Install [http://tabula.technology/](http://tabula.technology/) on your laptops. 15 | -------------------------------------------------------------------------------- /class_15.md: -------------------------------------------------------------------------------- 1 | 2 | ### PDFs 3 | 4 | * [GWU Crime Log](https://police.gwu.edu/crime-log) 5 | * [Tabula](http://tabula.technology/) 6 | 7 | ### Assignments 8 | 9 | * PDFs: take [this file](http://files.peacecorps.gov/university-rankings/2017/topschools2017.pdf) of top Peace Corps volunteer-producing schools and, using [Tabula](http://tabula.technology/) (installing it if you haven't already), draw boxes around the schools, their rankings and number of alumni volunteers so that you can create a single CSV file with the following headers: category (values of large, medium or small), rank, college, volunteers. Save that CSV file in your exercises repository and push it to Github. Need help? Check the "How to Use Tabula" section [here](http://tabula.technology/). Then, send me by email a story idea based on this data, one in which the data plays a big part. For reference, see [other stories](https://www.google.com/search?aq=2&oq=peace+corps+college&sourceid=chrome&ie=UTF-8&q=peace+corps+college+rankings#q=peace+corps+college+rankings&tbm=nws&*) about these rankings. 10 | * Reading (again, but we didn't talk about it): [Public Info Doesn't Always Want to be Free](https://source.opennews.org/articles/public-info-doesnt-always-want-be-free/). 11 | -------------------------------------------------------------------------------- /class_16.md: -------------------------------------------------------------------------------- 1 | 2 | ### Peace Corps discussion 3 | 4 | ### Public Information & Privacy 5 | 6 | * [Public Info Doesn't Always Want to be Free](https://source.opennews.org/articles/public-info-doesnt-always-want-be-free/) 7 | 8 | ### Assignments for April 4 9 | 10 | * Scraping: in your first-web-scraper/scrapers directory, create a new folder called `sports` and inside it create a new `scrape.py` file. Adapt one of your earlier scrapers to grab [the list of Washington Nationals transactions](http://m.mlb.com/was/roster/transactions/) here. The output file should be called `transactions.csv` and should have the following headers: date, url and text. That means you'll need to find the link inside each row and extract it. Google is your friend, and specific googling like "scraping urls in beautiful soup" is even better. Push your scrape.py file and your transactions.csv file to your repository, and email me a couple of sentences describing any issues you see with the output (or you can try to solve them in the script, too). 11 | * Project Updates: Due by Tuesday, pushed to Github. Tell me what your group has done/decided this week and what it plans to do next week. If you have a pressing question or issue, email me.git 12 | * Reading: [John Snow's data journalism](https://www.theguardian.com/news/datablog/2013/mar/15/john-snow-cholera-map) 13 | -------------------------------------------------------------------------------- /class_17.md: -------------------------------------------------------------------------------- 1 | ### Mapping 2 | 3 | * [John Snow's Map](http://maps.grammata.com/imageviewer/SnowMap.html) 4 | * [Mapping Homicides](http://web.archive.org/web/20170106233206/https://blog.apps.chicagotribune.com/2010/03/04/quickly-visualize-and-map-a-data-set-using-google-fusion-tables/) 5 | 6 | ### Assignments 7 | 8 | * Mapping: make sure that you have a Google account that can sign into and use [Fusion Tables](https://fusiontables.google.com). If you can, take this [CSV file of Chicago homicides](https://raw.githubusercontent.com/dwillis/smpa3193-exercises/master/allhomicides.csv), download it and then upload it into Fusion Tables. Using the [exercise](http://web.archive.org/web/20170106233206/https://blog.apps.chicagotribune.com/2010/03/04/quickly-visualize-and-map-a-data-set-using-google-fusion-tables/), start with the "Add a geocodable column to the spreadsheet" part and try to finish it. Using the Share button on the top right, change the access rules so that Anyone who has the link can access, then send the link to my via email. 9 | -------------------------------------------------------------------------------- /class_18.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ### Mapping 4 | 5 | * [Mapping Homicides](http://web.archive.org/web/20170106233206/https://blog.apps.chicagotribune.com/2010/03/04/quickly-visualize-and-map-a-data-set-using-google-fusion-tables/) 6 | 7 | ### PaperBot 8 | 9 | * [https://github.com/edsu/paperbot](https://github.com/edsu/paperbot) 10 | 11 | 12 | ### Assignments for April 11 13 | 14 | * Mapping: Take the Fairfax Arrests file (which is in your `agate_exercises` directory, or [here](https://github.com/SMPA3193/agate_exercises/blob/master/arrest.csv)), and upload it to Fusion Tables. Make sure that the Address field is a "Location" field (use the Edit -> Change Columns drop down menu to check) and then geocode the file using that Address field. Visualize the results on a map and then click the "Share" button at the top right and enter my email (dwillis@gwu.edu) in the Add People box. 15 | * Group updates: commit/push or email them by April 11. 16 | * Read about Jeremy Bowers' projects: [The Roberts Court's Surprising Move Leftward](http://www.nytimes.com/interactive/2015/06/23/upshot/the-roberts-courts-surprising-move-leftward.html),[NYT Docket](https://github.com/newsdev/nyt-docket) & [NFL Stats](https://github.com/jeremyjbowers/nfl-stats) and [Rachel Shorey](http://rachelshorey.com/)'s [here](https://www.nytimes.com/2016/10/28/us/politics/money-flows-down-ballot-as-donald-trump-is-abandoned-by-big-donors-even-himself.html) and [here](https://twitter.com/trashtalkbot). 17 | -------------------------------------------------------------------------------- /class_19.md: -------------------------------------------------------------------------------- 1 | 2 | ### Project Updates 3 | 4 | ### Useful Python Libraries 5 | 6 | * [Arrow](http://arrow.readthedocs.io/en/latest/) for dates and times 7 | * [DateFinder](http://datefinder.readthedocs.io/en/latest/) finds dates in text 8 | * [ftfy](http://ftfy.readthedocs.io/en/latest/) fixes weird Unicode text representations 9 | * [Geocoder](http://geocoder.readthedocs.io/) 10 | * [sqlite3](https://docs.python.org/3.4/library/sqlite3.html) for reading/writing SQLite databases 11 | * [csvkit](https://github.com/wireservice/csvkit) for working with CSV files 12 | * [Libextract](https://github.com/datalib/libextract) extracts text from websites 13 | * [Counting Frequencies](http://programminghistorian.org/lessons/counting-frequencies) 14 | 15 | ### Some Python projects 16 | 17 | * [Prince George's County Circuit Court Docket Scraper](https://github.com/SMPA3193/team_1) 18 | * [WMATA TwitterBot](https://github.com/SMPA3193/team_2/blob/master/FINAL%20ASSIGNMENT%20FILES/wmataFINAL.py) 19 | * [Pizza, Twitter and APIs](http://nealcaren.web.unc.edu/pizza-twitter-and-apis/) - uses earlier version of Twitter API, but useful for process. 20 | * [Looking for Journalists in Twitter](https://github.com/edsu/journos) 21 | * [Nebraska Heat Waves](https://github.com/mattwaite/JOUR407-Data-Journalism/blob/master/Examples/FebruaryHeatWave.ipynb) 22 | 23 | ### Assignments for April 20 24 | 25 | * Reading: [Why Cops Shoot](http://www.tampabay.com/projects/2017/investigations/florida-police-shootings/if-youre-black/) and [You Draw It: Just How Bad Is the Drug Overdose Epidemic?](https://www.nytimes.com/interactive/2017/04/14/upshot/drug-overdose-epidemic-you-draw-it.html) 26 | -------------------------------------------------------------------------------- /class_20.md: -------------------------------------------------------------------------------- 1 | ### Readings 2 | 3 | * Reading: [Why Cops Shoot](http://www.tampabay.com/projects/2017/investigations/florida-police-shootings/if-youre-black/) and [You Draw It: Just How Bad Is the Drug Overdose Epidemic?](https://www.nytimes.com/interactive/2017/04/14/upshot/drug-overdose-epidemic-you-draw-it.html) 4 | 5 | * [The Stack: UCLA research funding](http://stack.dailybruin.com/2017/02/23/research-funding/) 6 | 7 | * [Using Python's collections to count things](http://stackoverflow.com/questions/20510768/python-count-frequency-of-words-in-a-list) 8 | 9 | * [Loading CSV data into a SQLite database from Python](http://stackoverflow.com/questions/5942402/python-csv-to-sqlite) 10 | 11 | * Remember Jupyter notebooks? Use them for your scripts! 12 | 13 | ### Project Questions 14 | 15 | ### Course Evaluations 16 | -------------------------------------------------------------------------------- /class_21.md: -------------------------------------------------------------------------------- 1 | ### Housekeeping 2 | 3 | * Grading update 4 | * Group evaluations ([Peer](https://github.com/dwillis/systems-for-reporting/raw/master/peer-evaluation-form-smpa3193.docx) and [Self](https://github.com/dwillis/systems-for-reporting/raw/master/self-evaluation-form-smpa3193.docx)) 5 | 6 | ### Project Questions 7 | 8 | 9 | ### Stupid Computer Tricks 10 | 11 | * Wget ([Windows](https://eternallybored.org/misc/wget/) and via [Homebrew](https://brew.sh/) on Mac) 12 | * [Cloud computing](https://aws.amazon.com/) 13 | * [Github Pages](https://pages.github.com/) 14 | * [Building Tables with JavaScript](http://bl.ocks.org/ndarville/7075823) 15 | -------------------------------------------------------------------------------- /outline.md: -------------------------------------------------------------------------------- 1 | ## Course Outline for SMPA 3193: Systems for Reporting 2 | 3 | This outline is just that - an outline, and subject to change depending on our progress and interests. When it does, you'll know what changed and why, thanks to version control. 4 | 5 | * Jan. 17 - Course introduction, goals and how we'll work 6 | * Jan. 19 - Data, structured and unstructured 7 | * Jan. 24 - Summarizing data using spreadsheets 8 | * Jan. 26 - Summarizing data using spreadsheets 9 | * Jan. 31 - Finding Data 10 | * Feb. 2 - Importing and data formats 11 | * Feb. 7 - Databases and SQL 12 | * Feb. 9 - More SQL 13 | * Feb. 14 - The command line and Python 14 | * Feb. 16 - The command line and Python 15 | * Feb. 21 - The command line and Python 16 | * Feb. 23 - Converting & cleaning data and Agate 17 | * Feb. 28 - Converting & cleaning data and Agate 18 | * March 2 - NICAR CONFERENCE - Guest Speaker TBD 19 | * March 7 - Let's make a bot 20 | * March 9 - Let's make a bot 21 | * March 14 - NO CLASS (SPRING BREAK) 22 | * March 16 - NO CLASS (SPRING BREAK) 23 | * March 21 - Scraping 24 | * March 23 - Scraping 25 | * March 28 - Working with PDFs 26 | * March 30 - Working with PDFs 27 | * April 4 - Mapping 28 | * April 6 - Mapping 29 | * April 11 - Guest Speaker: How the Internet Works 30 | * April 13 - Guest Speaker: Rachel Shorey 31 | * April 18 - In-class project time 32 | * April 20 - In-class project time 33 | * April 25 - In-class project time 34 | * April 27 - Presentations 35 | -------------------------------------------------------------------------------- /peer-evaluation-form-smpa3193.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dwillis/systems-for-reporting/39ee8040c18d2898c9431dafb5c85be40e6436d6/peer-evaluation-form-smpa3193.docx -------------------------------------------------------------------------------- /self-evaluation-form-smpa3193.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dwillis/systems-for-reporting/39ee8040c18d2898c9431dafb5c85be40e6436d6/self-evaluation-form-smpa3193.docx -------------------------------------------------------------------------------- /syllabus.md: -------------------------------------------------------------------------------- 1 | # SMPA 3193: Systems for Reporting 2 | ## Spring 2017 3 | #### School of Media and Public Affairs, George Washington University 4 | #### Tuesdays and Thursdays, 11:10 a.m. - 12:25 p.m., SMPA B01 5 | 6 | ### Instructor 7 | #### Derek Willis 8 | #### dwillis@gwu.edu 9 | 10 | #### Office Hours & Communication 11 | 12 | I plan to be at SMPA by 10:30 a.m. most Tuesdays & Thursdays and will be available immediately before and after class, and by appointment. The very best way to reach me is by email; if you have a question, doubt, concern or cat picture to share, email me. I will make every attempt to notify students should our schedule change. 13 | 14 | #### About the Instructor 15 | 16 | I have worked as a journalist since 1995, when I secured [an internship offer from The Palm Beach Post](http://dwillis.net/image/42295553170) against a dozen rejections from [some](http://dwillis.net/image/42289385513) [of](http://dwillis.net/image/42283024010) [the](http://dwillis.net/image/42282077315) [finest](http://dwillis.net/image/42281381971) news organizations in the country. Somehow, I went on to work for some of them: Congressional Quarterly, [The Center for Public Integrity](http://www.publicintegrity.org/authors/derek-willis), [The Washington Post](http://www.washingtonpost.com/wp-dyn/content/article/2005/05/04/AR2005050402393.html), [The New York Times](http://topics.nytimes.com/top/reference/timestopics/people/w/derek_willis/index.html) and now [ProPublica](https://www.propublica.org/site/author/derek_willis). Although it may appear that I have trouble keeping a job, I mostly have moved to chase new opportunities, especially those involving data and the Internet. This is why I teach: because as many issues as journalism has, it has so many opportunities. My skill is not as a writer or interviewer, but I love finding stories in data, and making that process easier. My Internet claim to fame is that I was named "Wanker of the Day" by liberal blogger Atrios for [Jan. 17, 2006](http://www.eschatonblog.com/2006/01/wanker-of-day_17.html). I am not a [University of Kentucky basketball player](http://www.ukathletics.com/sport/m-baskbl/2015/roster/559d907be4b01c7eefd5ec97), but I do like basketball (and have season tickets to Maryland women's basketball). If you'd like somebody else's take on me, [here's one](http://radar.oreilly.com/2012/03/profile-of-the-data-journalist-1.html). 17 | 18 | ### Course Information 19 | 20 | ### Course Description and Goals/Learning Outcomes 21 | 22 | This course will teach students how to use data and technology to craft a systematic approach to beat reporting, or to build what you could call a reporter’s exoskeleton. Such a system would make it easier for a journalist to place news in context or spot interesting and potentially newsworthy events. 23 | 24 | As an example of this, consider the Supreme Court: each ruling is in itself a potential story and also one part of the life of a changing institution that draws on previous cases and is influenced by judges' past and present actions. Being able to place a ruling in context -- is this out of character for this court, or for any Supreme Court in history? -- is a valuable skill for a journalist. Collecting the information needed to provide this context can also help the journalist produce better questions and ideas. That collection effort is rarely simple, but advances in technology make it possible in a variety of circumstances. 25 | 26 | Students will work with some of the tools for building system for reporting: spreadsheets, databases, pattern matching and some programming, including web scraping and building useful but simple sites for reporting. Students will work in small teams to choose a beat, research data sources around it and develop a web-based system to surface useful and unusual aspects of the data. You will be doing some work that qualifies as "computer programming" in this course, but it is not a programming course. This is a course about using computers and software to make you a better journalist. 27 | 28 | Students will learn to create and deploy simple tools and systems that could be evaluated and used by journalists at the university and in the professional world. These could range from tools that capture and broadcast key information from government data to simple Web sites that make it easier to compare changes over time or data from different geographic areas. Some examples: 29 | 30 | * [Mentions of ProPublica in Congressional Record](http://dwillis.github.io/paper-of-record/?apikey=474754a0d230cc71b31b8bf6d313b70c¬ify=false) 31 | * [Itemizer](https://projects.propublica.org/itemizer/) 32 | * [Parse Congressional Record](https://github.com/dwillis/hulse) 33 | * [Extract Text from a URL](http://extractor.herokuapp.com/) 34 | * [Find Tweets about a news article](http://tweetrewrite.herokuapp.com/) 35 | * [SCOTUS_Servo](https://twitter.com/SCOTUS_servo) 36 | 37 | The other, more important, goal is that students will leave the course with an idea of how to use technology to approach a beat in a systematic, rather than an ad-hoc manner. Students will learn when and how to automate repetitive reporting tasks. 38 | 39 | The course will proceed on two tracks: in class we'll learn about and practice using techniques and tools, and outside class you will work in small groups on projects, with regular updates to the class. Your homework mainly will consist of readings and project work, with some additional exercises. 40 | 41 | #### Prerequisites 42 | 43 | You must have some experience with a spreadsheet. You will need a comfort for installing software on your own personal computer (we'll do this together), for trying new things and for making mistakes. Our goal is to make [creators, not just users](http://thescoop.org/archives/2013/10/01/the-natives-arent-restless-enough/). Above all, you will need to ask questions. 44 | 45 | ### Learning Outcomes 46 | 47 | As a result of successfully completing this course, students will be able to: 48 | 49 | 1. Ask questions of data using specific tools such as spreadsheets and programming languages 50 | 2. Acquire data through web scraping and other techniques 51 | 3. Automate data acquisition and reporting tasks 52 | 4. Create maps from data 53 | 5. Design and build a simple web application like a Twitter bot 54 | 55 | #### Textbooks and Readings 56 | 57 | There are no textbooks for this course. We will have regular readings chosen from professional work, government documents and other sources. You will be expected to read materials before class and to discuss them in class. Post any questions you might have about a reading to the class’s GitHub project as an issue and flag it appropriately so it can be resolved in advance or discussed in class. 58 | 59 | #### Technology 60 | 61 | In addition to working on computers in the lab, you will need to have a computer (Mac, Linux or Windows, preferably a laptop). You must have the ability to install software on this computer. The tools we will use will include: Excel/Google Spreadsheets, SQLite, Python and JavaScript. 62 | 63 | #### Average minimum amount of out-of-class or independent learning expected per week 64 | 65 | Over 15 weeks, students will spend 2.5 hours in class per week in class. Required readings, assignments and group project work are expected to take up, on average, 5 hours (300 minutes) a week. Over the course of the semester, students will spend 37.5 hours in class and 75 hours on class work. 66 | 67 | #### Schedule of topics 68 | 69 | Please see the [class outline](outline.md), which is subject to changes depending on our progress. 70 | 71 | ### Assignments and Grading 72 | 73 | Grading will be based on the following activities. 74 | 75 | *Attendance & Participation - 15 percent* 76 | 77 | You are expected to contribute to in-class discussions, as well as group work in class. Journalism isn't a passive activity, and thinking of questions is literally a core skill for any journalist. If you aren't questioning things or exploring things, this might not be the profession for you. 78 | 79 | *Quizzes - 10%* 80 | 81 | There will be 6 in-class quizzes based on readings, which means we won't have them every week, and I won't tell you in advance (because it's a _quiz_). Each student's lowest grade from these quizzes will be dropped in determining the final grade; you may want to preserve this option for emergencies. The remaining quizzes will be averaged for this portion of the final grade. 82 | 83 | *Homework - 15%* 84 | There will be regular take-home skills-based assignments. These are individual assignments, meaning you cannot ask your colleagues for help on work done outside class (but you can ask me). Again, the lowest grade from these will be dropped when determining your grade; you may want to preserve this option for emergencies. The remaining assignments will be averaged for this portion of the final grade. 85 | 86 | *Biweekly project assessments - 20%* 87 | 88 | Beginning on March 7, I will grade your group project's progress every two weeks, based on materials groups will submit on their GitHub page by the beginning of class on Tuesdays. These include background research, prototype work, code/data and design. In general, if you are documenting your efforts - whether they succeed or not - members will be full credit for these updates. 89 | 90 | *Final project - 30%* 91 | 92 | Your final project, which should be a working, deployed application for the Web (or a Web-enabled service like Twitter), is due at the end of the course. It represents the entirety of your project work. 93 | 94 | *Peer assessments - 10%* 95 | 96 | Twice during the semester, your fellow group members will be asked to provide an assessment of your contributions to the group's progress. The first evaluation, which will be around March 23, will not be graded but you will receive their comments anonymously to help you know how you are doing. The second, due at the end of the course, will be graded and will account for 10 percent of your grade. On both occasions, you also will provide a self-evaluation. 97 | 98 | There is *no final exam* for this course; groups will present their final projects. 99 | 100 | ### Course Policies 101 | 102 | #### Attendance 103 | 104 | Journalism is not a passive activity and requires focus, inquisition and involvement. Every class, we will be discussing our own work (and others) and we will be building a set of skills to use, and I expect your comments, questions and other contributions to our class. None of this can happen if you don’t show up, and your participation grade will suffer as a result. 105 | 106 | If you are ill and will miss class, let me know before that class, if possible. Unless you have my approval in advance, course work cannot be made up if you do not submit it on time. There is no extra credit. 107 | 108 | #### Late Assignments 109 | 110 | Deadlines are important in journalism. Any assignment that is submitted after the deadline will receive a score of 0, which will not help your average. The deadline for in-class assignments is the end of class, unless otherwise instructed, and for outside assignments, the start of class. Turning in an incomplete or imperfect assignment is much, much better than turning in a late assignment. 111 | 112 | #### Etiquette 113 | 114 | *Respect:* You should treat your classmates with respect. I'll expect this both in your verbal communication with them but also in your non-verbal communication. This means: pay attention and be empathetic. 115 | 116 | *Computers:* When we are using computers, I expect that you will be using the program that we are talking about and not surfing the Web or checking email. I recommend that you take notes by hand, which [can be more effective than on a computer](http://pss.sagepub.com/content/25/6/1159). 117 | 118 | *Teamwork:* You will be assigned to a team of 4-5 students to work on your final project, not only to enable a broad range of perspectives but to help each other understand the concepts introduced in readings and in-class work. Each member of the group will be required to submit a self-evaluation and to evaluate the other members of the group; these evaluations will influence each student's final grade. 119 | 120 | ### University Policies 121 | 122 | #### Religious Holidays 123 | 124 | In accordance with [University policy](http://students.gwu.edu/accommodations-religious-holidays), students should notify faculty during the first week of the semester of their intention to be absent from class on their day(s) of religious observance. 125 | 126 | #### Academic integrity code 127 | 128 | Academic dishonesty is defined as cheating of any kind, including misrepresenting one's own work, taking credit for the work of others without crediting them and without appropriate authorization, and the fabrication of information. [Details and complete code](http://studentconduct.gwu.edu/code-academic-integrity). 129 | 130 | #### Safety and security 131 | 132 | 

In the case of an emergency, if at all possible, the class should shelter in place. If the building that the class is in is affected, follow the evacuation procedures for the building. After evacuation, seek shelter at a predetermined rendezvous location. 133 | 134 | ### Support for students outside the classroom 135 | 136 | #### Disability Support Services (DSS) 137 | 138 | Any student who may need an accommodation based on the potential impact of a disability should contact the Disability Support Services office at 202-994-8250 in the Rome Hall, Suite 102, to establish eligibility and to coordinate reasonable accommodations. [Additional information](http://disabilitysupport.gwu.edu). 139 | 140 | 141 | ### Mental Health Services 202-994-5300 142 | 143 | The University's Mental Health Services offers 24/7 assistance and referral to address students' personal, social, career, and study skills problems. Services for students include: crisis and emergency mental health consultations confidential assessment, counseling services (individual and small group), and referrals. [Additional information](http://counselingcenter.gwu.edu). 144 | --------------------------------------------------------------------------------