├── README.md
├── google_news.R
└── google_news.py


/README.md:
--------------------------------------------------------------------------------
1 | google-news
2 | ===========
3 | 
4 | **Repository of scripts that scrape news headlines from Google News, prepare them for readability analysis, and visualize the results aggregated by news outlet. The scripts and their output are described in [this blog post](http://datapie.wordpress.com/2014/06/11/headline-readability-varies-by-news-outlet/).**
5 | 
6 | **google_news.py** scrapes news headlines and the name of their outlets from the Google News homepage on a set schedule. Sample data are found in *google_news.csv*. After all the scheduled jobs are run, the data are cleaned: badly-formed text, non-sensical results, and duplicate records are reformatted or removed. The [readability](http://en.wikipedia.org/wiki/Readability) of the headlines is assessed with the [Flesch-Kincaid Grade Level](http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_Grade_Level) test, which requires the readability functions found [here](https://github.com/mmautner/readability).
7 | 
8 | Finally, the cleaned data are aggregated at the level of news outlets. **google_news.R** is called to create a visualization of the results using the [plotly R API](https://plot.ly/r/).
9 | 


--------------------------------------------------------------------------------
/google_news.R:
--------------------------------------------------------------------------------
 1 | # Load packages
 2 | library(ggplot2)
 3 | library(devtools)
 4 | library(plotly)
 5 | 
 6 | # Load data
 7 | file.name = '/Users/jmcontreras/GitHub/google-news/google_news_aggregate.csv'
 8 | col.types = c('NULL', 'character', rep('numeric', 3))
 9 | data = read.csv(file=file.name, colClasses=col.types)
10 | 
11 | # Sort data by Flesch
12 | sort.order = data$outlet[order(data$mean, decreasing=T)]
13 | data$outlet = factor(data$outlet, levels=sort.order)
14 | 
15 | # Remove sources with one headline
16 | data = data[data$n_headlines >= 100, ]
17 | 
18 | # Create a ggplot object
19 | ggnews = ggplot(data=data, aes(x=outlet, y=mean, color=outlet)) +
20 |     # Add bar graph
21 |     geom_point(shape=19, size=20) +
22 |     # Title graph
23 |     ggtitle('Average Headline Readability by News Outlet') +
24 |     # Label axes
25 |     ylab('Grade Level') + xlab('News Outlet') +
26 |     geom_errorbar(aes(ymax=mean + sem, ymin=mean - sem), width=0.2) +
27 |     # Format theme
28 |     theme(plot.title = element_text(face='bold', size=25),
29 |           axis.title = element_text(face='bold', size=20),
30 |           axis.text.y = element_text(size=16),
31 |           panel.grid.minor=element_blank(),
32 |           panel.grid.major=element_blank(),
33 |           axis.ticks.x=element_blank())
34 | 
35 | # Create a pyplot object
36 | py = plotly(username='your_username', key='your_key')
37 | 
38 | # Draw the graph
39 | py$ggplotly(ggnews)


--------------------------------------------------------------------------------
/google_news.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Wed Apr  2 21:17:38 2014
  4 | 
  5 | google_news.py scrapes news headlines and the name of their outlets from the
  6 | Google News homepage on a set schedule. The readability of the headlines is
  7 | assessed with the Flesch-Kincaid Grade Level test.
  8 | 
  9 | After all the scheduled jobs are run, the data are cleaned: badly-formed text,
 10 | non-sensical results, and duplicate records are reformatted or removed. The
 11 | cleaned data are analyzed at the level of news outlets.
 12 | 
 13 | Finally, google_news.R is called to create a visualization of the results.
 14 | 
 15 | @author: Juan Manuel Contreras (juan.manuel.contreras.87@gmail.com)
 16 | """
 17 | 
 18 | # Import modules
 19 | from time import sleep
 20 | from numpy import mean
 21 | from time import strftime
 22 | from os.path import isfile
 23 | from scipy.stats import sem
 24 | from subprocess import call
 25 | from mechanize import Browser
 26 | from logging import basicConfig
 27 | from lxml.html import fromstring
 28 | from pandas import DataFrame, isnull
 29 | from apscheduler.scheduler import Scheduler
 30 | from re import finditer, search, IGNORECASE
 31 | from readability.readability import Readability
 32 |     
 33 | def assess_readability(text):
 34 | 
 35 |     '''Assess the readability of text with the Flesch-Kincaid Grade Level test,
 36 |     as implemented in Python here: https://github.com/mmautner/readability'''
 37 |     
 38 |     # Assess grade level    
 39 |     return Readability(text).FleschKincaidGradeLevel()
 40 | 
 41 | def scrape(file_name):
 42 |     
 43 |     '''Scrape headlines and news outlet names from the Google News homepage,
 44 |        assessing their readability and writing or appendin to a CSV file'''
 45 |     
 46 |     # Get the current date and time
 47 |     date = strftime('%m/%d/%y')
 48 |     time = strftime('%H:%M:%S')
 49 |     
 50 |     # Construct browser object
 51 |     browser = Browser()
 52 |     
 53 |     # Do not observe rules from robots.txt
 54 |     browser.set_handle_robots(False)
 55 |     
 56 |     # Create HTML document
 57 |     html = fromstring(browser.open('https://news.google.com/').read())
 58 |     
 59 |     # Declare outlets and titles
 60 |     outlets = html.xpath('.//*[@class="esc-lead-article-outlet-wrapper"]')
 61 |     titles = html.xpath('.//*[@class="esc-lead-article-title"]')
 62 |     
 63 |     # Number of items
 64 |     n_items = len(titles)
 65 |     
 66 |     # Initialize empty Pandas data frame
 67 |     empty_list = [None] * n_items
 68 |     df = DataFrame({'outlet': empty_list,
 69 |                     'title': empty_list,
 70 |                     'flesch': empty_list,
 71 |                     'date_time': [date + ' ' + time] * n_items})
 72 |     
 73 |     # Iterate through outlets and titles
 74 |     for i in xrange(n_items):
 75 |         
 76 |         # Declare raw outlet name
 77 |         raw_outlet = outlets[i].text_content()
 78 |         
 79 |         # Find the last meaningful character in the raw_outlet string
 80 |         try:
 81 |             last_char = raw_outlet.index('-') - 1
 82 |         except ValueError as e:
 83 |             if e.message == 'substring not found':
 84 |                 last_char = search('\d', raw_outlet).start()
 85 |         
 86 |         # Slice the raw outlet name into something useable                
 87 |         this_outlet = raw_outlet[:last_char]
 88 |         this_title = titles[i].text_content().encode('utf8')
 89 | 
 90 |         # Input results into data frame
 91 |         df.outlet[i] = this_outlet
 92 |         df.title[i] = this_title
 93 |         df.flesch[i] = assess_readability(this_title)
 94 |     
 95 |     # If a file exists, then append results to it; otherwise, create a file
 96 |     if isfile(file_name):
 97 |         df.to_csv(file_name, header=False, index=False, mode='a')
 98 |         report_str = 'Appended '
 99 |     else:
100 |         df.to_csv(file_name, index=False)
101 |         report_str = 'Created %s with ' % (file_name)
102 |         
103 |     # Report progress
104 |     print '%s%s headlines on %s at %s' % (report_str, n_items, date, time)
105 | 
106 | def schedule(file_name, n_jobs, frequency):
107 | 
108 |     '''Schedule the scraper to execute every hour and shut it down after a
109 |        certain number of jos have been run'''
110 | 
111 |     # Create a default logger
112 |     basicConfig()
113 | 
114 |     # Run the first job
115 |     scrape(file_name)
116 | 
117 |     # Instantiate the scheduler
118 |     sched = Scheduler()
119 |     
120 |     # Start it
121 |     sched.start()
122 | 
123 |     # Schedule the function
124 |     sched.add_interval_job(scrape, args=[file_name], minutes=frequency,
125 |                            misfire_grace_time=60)
126 |     
127 |     # Wait to run n_jobs (assuming 1 job per hour, which is 3600 seconds)
128 |     sleep(n_jobs * 3600)
129 |     
130 |     # Shutdown the scheduler
131 |     sched.shutdown()
132 |     
133 | def clean(file_name):
134 |     
135 |     '''Clean the data collected, including removing duplicate headlines, and
136 |        save the results to a different CSV file'''
137 |     
138 |     # Read CSV file
139 |     df = DataFrame.from_csv(path=file_name, index_col=False)
140 |     
141 |     # Declare patterns found in dirty news outlet and headline records,
142 |     # including improper encoding
143 |     s_pattern = '\d+ minute|\d+ hour| \(\w+tion\)| \(blog\)'
144 |     t_pattern = '\[video\]| \(\+video\)'
145 |     e_pattern = '\x89\xdb\xd2 '
146 |     
147 |     # Initialize empty list of records to be removed
148 |     remove = []
149 |     
150 |     # Iterate through the records
151 |     for i in xrange(df.shape[0]):
152 |         # Define the news outlet and headline of the iteration
153 |         s = df.outlet[i]
154 |         t = df.title[i].lower()
155 |         # Tag records if the outlet is empty or non-sensical
156 |         if isnull(s) or s == '(multiple names)':
157 |             remove.append(i)
158 |             continue
159 |         # Clean the outlet, if necessary
160 |         if search(s_pattern, s):
161 |             df.outlet[i] = s[:[m.start() for m in finditer(s_pattern, s)][0]]
162 |         # Clean title with encoding error
163 |         if search(e_pattern, t):
164 |             t = t.replace(e_pattern, '')
165 |         # Remove extra letters from Christian Science Monitor name
166 |         if s.endswith('MonitorApr'):
167 |             df.outlet[i] = s[:-3]
168 |         # In titles with plus signs instead of whitespaces, perform replacement
169 |         if t.count(' ') == 0 and t.count('+') > 0:
170 |             df.title[i] = t.replace('+', ' ')
171 |             df.flesch[i] = assess_readability(df.title[i])
172 |         # Remove video references from headlines
173 |         if search(t_pattern, t, IGNORECASE):                
174 |             span = [m.span() for m in finditer(t_pattern, t, IGNORECASE)][0]
175 |             df.title[i] = t[:span[0]] + t[span[1]:]
176 |             df.flesch[i] = assess_readability(df.title[i])
177 |         # Remove information about the time when the article was posted
178 |         if search(' \- Hours', t):
179 |             df.title[i] = t[:[m.start() for m in finditer(' \- Hours', t)][0]]
180 |             df.flesch[i] = assess_readability(df.title[i])
181 |         # Tag records with metadata instead of headlines
182 |         if t.startswith('Written by '):
183 |             remove.append(i)
184 |     
185 |     # Remove unsalvageable records
186 |     df = df.drop(df.index[remove])
187 |     
188 |     # Drop duplicate records
189 |     df['title_lower'] = [t.lower() for t in df.title]
190 |     df.drop_duplicates(cols=['title_lower', 'outlet'], inplace=True)
191 |     df.drop(labels='title_lower', axis=1, inplace=True)
192 |     
193 |     # Save clean data to new file
194 |     df.to_csv(path_or_buf=file_name.replace('.', '_clean.'), index=False)
195 |     
196 |     # Return the DataFrame
197 |     return df
198 | 
199 | def grades2schools(df):
200 |     
201 |     '''Determine the school (elementary, middle, high, or college+) a reader
202 |        needs to attend to parse the headline'''
203 |     
204 |     # Transform each grade to its school equivalent
205 |     df['elem'] = (df.flesch >= 1) & (df.flesch < 6)
206 |     df['middle'] = (df.flesch >= 6) & (df.flesch < 9)
207 |     df['high'] = (df.flesch >= 9) & (df.flesch < 13)
208 |     df['college'] = df.flesch >= 13
209 |     
210 |     # Return the DataFrame
211 |     return df
212 |   
213 | def print_stats(data, stats):
214 |     
215 |     '''Print some relevant statistics'''    
216 |     
217 |     def print_percent(x):
218 |         return round(x.sum() / x.shape[0] * 100, 2)
219 |         
220 |     print '\nFLESCH STATISTICS'
221 |     print 'Mean = %s' % (round(data.flesch.mean(), 2))
222 |     print 'SD   = %s' % (round(data.flesch.std(), 2))
223 |     print 'Min  = %s' % (data.flesch.min())
224 |     print 'Max  = %s' % (data.flesch.max())
225 |         
226 |     print '\nSCHOOL PERCENTAGES'
227 |     print 'Elementary  = %s%%' % (print_percent(data.elem))
228 |     print 'Middle      = %s%%' % (print_percent(data.middle))
229 |     print 'High        = %s%%' % (print_percent(data.high))
230 |     print 'SomeCollege = %s%%' % (print_percent(data.college))
231 |     
232 |     print '\nHEADLINES BY OUTLET'
233 |     print 'Mean = %s' % (round(stats.n_headlines.mean()))
234 |     print 'SD   = %s' % (round(stats.n_headlines.std()))
235 |     print 'Min  = %s' % (stats.n_headlines.min())
236 |     print 'Max  = %s' % (stats.n_headlines.max())
237 |   
238 | def main():
239 |     
240 |     # Name the CSV file
241 |     file_name = 'google_news.csv'
242 |     
243 |     # Schedule and run the scraper
244 |     schedule(file_name=file_name, n_jobs=10, frequency=30)
245 | 
246 |     # Clean data and save them to a different file
247 |     df = clean(file_name)
248 |         
249 |     # Transform grades to schools (elementary, middle, high, and some college)
250 |     df = grades2schools(df)
251 |     
252 |     # Group items by outlet
253 |     grouped = df.groupby(by='outlet', as_index=False)
254 |     
255 |     # Compute aggregate Flesch statistcs
256 |     flesch_dict = {'mean': mean, 'sem': sem, 'n_headlines': len}
257 |     flesch_stats = grouped['flesch'].agg(flesch_dict)
258 |     
259 |     # Restrict news outlets to those with at least 100 headlines
260 |     flesch_stats = flesch_stats[flesch_stats.n_headlines >= 100]
261 |     
262 |     # Apply the same restriction to the non-aggregated data
263 |     df = df[df.outlet.isin(list(set(flesch_stats.outlet)))]
264 |     
265 |     # Print statistics
266 |     print_stats(data=df, stats=flesch_stats)
267 |     
268 |     # Save aggregated data
269 |     flesch_stats.to_csv(file_name.replace('.', '_aggregate.'))
270 |     
271 |     # Plot results
272 |     Rscript = '/Library/Frameworks/R.framework/Versions/3.0/Reoutlets/Rscript'
273 |     call([Rscript, '/Users/jmcontreras/Github/google-news/google_news.R'])
274 |     
275 | if __name__ == '__main__':
276 |     
277 |     main()
278 | 


--------------------------------------------------------------------------------