├── README.md └── archive.org-getter.rb /README.md: -------------------------------------------------------------------------------- 1 | archive.org-getter 2 | ================== 3 | 4 | Ruby script to download bulk results from Archive.org's TV News database of closed captions 5 | 6 | *Authors* 7 | Rahul Bhargava and Matt Stempeck 8 | _MIT Center for Civic Media_ 9 | 10 | Requirements 11 | ------------ 12 | 13 | You need ruby and the internets. 14 | 15 | Running the Script 16 | ------------------ 17 | 18 | 1. Open the script in a text editor and edit line 11 of the code to change 'Your Query' to your preferred search term(s), and save it 19 | 2. Go to the command line (Terminal on a Mac, DOS or Cygwin in Windows) 20 | 3. Navigate to the folder that contains the script 21 | 4. Type in `ruby archive.org-getter.rb` and hit enter 22 | 23 | Results 24 | ------- 25 | Your results will show up in the same directory as the script itself. The results returned will be in JSON, the open data format. You can adjust how many results to return at once (by changing the ROWS variable in the script), but *go easy on Archive.org’s servers*: You’ll get your results faster (nearly instantly) in smaller batches of 200 or so. 26 | 27 | Using the Data 28 | -------------- 29 | 30 | Once you have your data, you can combine, clean, and parse it with [Google Refine](http://code.google.com/p/google-refine/). I found [ProPublica’s guide to cleaning messy data really helpful](http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning). You may also want to de-duplicate, because Archive.org records TV news broadcasts on the both the east and west coasts. 31 | 32 | What you can do with it 33 | ----------------------- 34 | 35 | *Analyze a story:* You could search for a specific story, like the recent controversial Steubenville rape case, and quickly get a sense of which news companies are covering the case and which words they use to talk about it. You can also share links to specific clips with your friends and colleagues. 36 | 37 | You could also investigate our professional media’s treatment of a broader topic. You could trace the spread of the phrase “Obamacare” or watch the many breathless news segments covering “technology.” 38 | 39 | *Visualize TV news data*: You’ll also have the data you need to visualize the lifespan of a story on televised news broadcasts. Archive.org renders a small line graph in your search results, but the JSON data will allow you to do much more. 40 | 41 | For example, in the Trayvon Martin case study, we ended up normalizing the data with the number of Trayvon mentions in the printed press, blogosphere, on Twitter, and across other channels to determine when interest began and peaked. 42 | -------------------------------------------------------------------------------- /archive.org-getter.rb: -------------------------------------------------------------------------------- 1 | #URL opener 2 | require 'open-uri' 3 | 4 | # set variables 5 | start = 0 6 | ROWS = 200 7 | 8 | # incrementally download JSON files 9 | while start < 1000 10 | # don't forget to turn variable numbers into strings with .to_s 11 | url = "http://archive.org/details/tv?q=%22Your%20Query%22&start=" + start.to_s + "&rows=" + ROWS.to_s + "&output=json" 12 | print "fetching from " + start.to_s + "\n" 13 | print " "+url + "\n" 14 | 15 | 16 | file = open(url) 17 | results = file.read 18 | print " got "+results.length.to_s+" characters\n" 19 | 20 | #openurl library and pass url to fetch it 21 | #if success 22 | # increment start row +100 23 | #save to disk with filename results startrow#.json 24 | #else 25 | # wait one minute and do it again 26 | 27 | # if page doesn't load... 28 | if results.index("Our Search Engine was not responsive.") == nil 29 | start = start + ROWS 30 | aFile = File.new("TVNews_results"+ start.to_s + ".json", "w") 31 | aFile.write(results) 32 | aFile.close 33 | else 34 | print "request failed" 35 | # hang out for a minute before trying again 36 | sleep(60) 37 | end 38 | end --------------------------------------------------------------------------------