├── README.md ├── google_news.R └── google_news.py /README.md: -------------------------------------------------------------------------------- 1 | google-news 2 | =========== 3 | 4 | **Repository of scripts that scrape news headlines from Google News, prepare them for readability analysis, and visualize the results aggregated by news outlet. The scripts and their output are described in [this blog post](http://datapie.wordpress.com/2014/06/11/headline-readability-varies-by-news-outlet/).** 5 | 6 | **google_news.py** scrapes news headlines and the name of their outlets from the Google News homepage on a set schedule. Sample data are found in *google_news.csv*. After all the scheduled jobs are run, the data are cleaned: badly-formed text, non-sensical results, and duplicate records are reformatted or removed. The [readability](http://en.wikipedia.org/wiki/Readability) of the headlines is assessed with the [Flesch-Kincaid Grade Level](http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_Grade_Level) test, which requires the readability functions found [here](https://github.com/mmautner/readability). 7 | 8 | Finally, the cleaned data are aggregated at the level of news outlets. **google_news.R** is called to create a visualization of the results using the [plotly R API](https://plot.ly/r/). 9 | -------------------------------------------------------------------------------- /google_news.R: -------------------------------------------------------------------------------- 1 | # Load packages 2 | library(ggplot2) 3 | library(devtools) 4 | library(plotly) 5 | 6 | # Load data 7 | file.name = '/Users/jmcontreras/GitHub/google-news/google_news_aggregate.csv' 8 | col.types = c('NULL', 'character', rep('numeric', 3)) 9 | data = read.csv(file=file.name, colClasses=col.types) 10 | 11 | # Sort data by Flesch 12 | sort.order = data$outlet[order(data$mean, decreasing=T)] 13 | data$outlet = factor(data$outlet, levels=sort.order) 14 | 15 | # Remove sources with one headline 16 | data = data[data$n_headlines >= 100, ] 17 | 18 | # Create a ggplot object 19 | ggnews = ggplot(data=data, aes(x=outlet, y=mean, color=outlet)) + 20 | # Add bar graph 21 | geom_point(shape=19, size=20) + 22 | # Title graph 23 | ggtitle('Average Headline Readability by News Outlet') + 24 | # Label axes 25 | ylab('Grade Level') + xlab('News Outlet') + 26 | geom_errorbar(aes(ymax=mean + sem, ymin=mean - sem), width=0.2) + 27 | # Format theme 28 | theme(plot.title = element_text(face='bold', size=25), 29 | axis.title = element_text(face='bold', size=20), 30 | axis.text.y = element_text(size=16), 31 | panel.grid.minor=element_blank(), 32 | panel.grid.major=element_blank(), 33 | axis.ticks.x=element_blank()) 34 | 35 | # Create a pyplot object 36 | py = plotly(username='your_username', key='your_key') 37 | 38 | # Draw the graph 39 | py$ggplotly(ggnews) -------------------------------------------------------------------------------- /google_news.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on Wed Apr 2 21:17:38 2014 4 | 5 | google_news.py scrapes news headlines and the name of their outlets from the 6 | Google News homepage on a set schedule. The readability of the headlines is 7 | assessed with the Flesch-Kincaid Grade Level test. 8 | 9 | After all the scheduled jobs are run, the data are cleaned: badly-formed text, 10 | non-sensical results, and duplicate records are reformatted or removed. The 11 | cleaned data are analyzed at the level of news outlets. 12 | 13 | Finally, google_news.R is called to create a visualization of the results. 14 | 15 | @author: Juan Manuel Contreras (juan.manuel.contreras.87@gmail.com) 16 | """ 17 | 18 | # Import modules 19 | from time import sleep 20 | from numpy import mean 21 | from time import strftime 22 | from os.path import isfile 23 | from scipy.stats import sem 24 | from subprocess import call 25 | from mechanize import Browser 26 | from logging import basicConfig 27 | from lxml.html import fromstring 28 | from pandas import DataFrame, isnull 29 | from apscheduler.scheduler import Scheduler 30 | from re import finditer, search, IGNORECASE 31 | from readability.readability import Readability 32 | 33 | def assess_readability(text): 34 | 35 | '''Assess the readability of text with the Flesch-Kincaid Grade Level test, 36 | as implemented in Python here: https://github.com/mmautner/readability''' 37 | 38 | # Assess grade level 39 | return Readability(text).FleschKincaidGradeLevel() 40 | 41 | def scrape(file_name): 42 | 43 | '''Scrape headlines and news outlet names from the Google News homepage, 44 | assessing their readability and writing or appendin to a CSV file''' 45 | 46 | # Get the current date and time 47 | date = strftime('%m/%d/%y') 48 | time = strftime('%H:%M:%S') 49 | 50 | # Construct browser object 51 | browser = Browser() 52 | 53 | # Do not observe rules from robots.txt 54 | browser.set_handle_robots(False) 55 | 56 | # Create HTML document 57 | html = fromstring(browser.open('https://news.google.com/').read()) 58 | 59 | # Declare outlets and titles 60 | outlets = html.xpath('.//*[@class="esc-lead-article-outlet-wrapper"]') 61 | titles = html.xpath('.//*[@class="esc-lead-article-title"]') 62 | 63 | # Number of items 64 | n_items = len(titles) 65 | 66 | # Initialize empty Pandas data frame 67 | empty_list = [None] * n_items 68 | df = DataFrame({'outlet': empty_list, 69 | 'title': empty_list, 70 | 'flesch': empty_list, 71 | 'date_time': [date + ' ' + time] * n_items}) 72 | 73 | # Iterate through outlets and titles 74 | for i in xrange(n_items): 75 | 76 | # Declare raw outlet name 77 | raw_outlet = outlets[i].text_content() 78 | 79 | # Find the last meaningful character in the raw_outlet string 80 | try: 81 | last_char = raw_outlet.index('-') - 1 82 | except ValueError as e: 83 | if e.message == 'substring not found': 84 | last_char = search('\d', raw_outlet).start() 85 | 86 | # Slice the raw outlet name into something useable 87 | this_outlet = raw_outlet[:last_char] 88 | this_title = titles[i].text_content().encode('utf8') 89 | 90 | # Input results into data frame 91 | df.outlet[i] = this_outlet 92 | df.title[i] = this_title 93 | df.flesch[i] = assess_readability(this_title) 94 | 95 | # If a file exists, then append results to it; otherwise, create a file 96 | if isfile(file_name): 97 | df.to_csv(file_name, header=False, index=False, mode='a') 98 | report_str = 'Appended ' 99 | else: 100 | df.to_csv(file_name, index=False) 101 | report_str = 'Created %s with ' % (file_name) 102 | 103 | # Report progress 104 | print '%s%s headlines on %s at %s' % (report_str, n_items, date, time) 105 | 106 | def schedule(file_name, n_jobs, frequency): 107 | 108 | '''Schedule the scraper to execute every hour and shut it down after a 109 | certain number of jos have been run''' 110 | 111 | # Create a default logger 112 | basicConfig() 113 | 114 | # Run the first job 115 | scrape(file_name) 116 | 117 | # Instantiate the scheduler 118 | sched = Scheduler() 119 | 120 | # Start it 121 | sched.start() 122 | 123 | # Schedule the function 124 | sched.add_interval_job(scrape, args=[file_name], minutes=frequency, 125 | misfire_grace_time=60) 126 | 127 | # Wait to run n_jobs (assuming 1 job per hour, which is 3600 seconds) 128 | sleep(n_jobs * 3600) 129 | 130 | # Shutdown the scheduler 131 | sched.shutdown() 132 | 133 | def clean(file_name): 134 | 135 | '''Clean the data collected, including removing duplicate headlines, and 136 | save the results to a different CSV file''' 137 | 138 | # Read CSV file 139 | df = DataFrame.from_csv(path=file_name, index_col=False) 140 | 141 | # Declare patterns found in dirty news outlet and headline records, 142 | # including improper encoding 143 | s_pattern = '\d+ minute|\d+ hour| \(\w+tion\)| \(blog\)' 144 | t_pattern = '\[video\]| \(\+video\)' 145 | e_pattern = '\x89\xdb\xd2 ' 146 | 147 | # Initialize empty list of records to be removed 148 | remove = [] 149 | 150 | # Iterate through the records 151 | for i in xrange(df.shape[0]): 152 | # Define the news outlet and headline of the iteration 153 | s = df.outlet[i] 154 | t = df.title[i].lower() 155 | # Tag records if the outlet is empty or non-sensical 156 | if isnull(s) or s == '(multiple names)': 157 | remove.append(i) 158 | continue 159 | # Clean the outlet, if necessary 160 | if search(s_pattern, s): 161 | df.outlet[i] = s[:[m.start() for m in finditer(s_pattern, s)][0]] 162 | # Clean title with encoding error 163 | if search(e_pattern, t): 164 | t = t.replace(e_pattern, '') 165 | # Remove extra letters from Christian Science Monitor name 166 | if s.endswith('MonitorApr'): 167 | df.outlet[i] = s[:-3] 168 | # In titles with plus signs instead of whitespaces, perform replacement 169 | if t.count(' ') == 0 and t.count('+') > 0: 170 | df.title[i] = t.replace('+', ' ') 171 | df.flesch[i] = assess_readability(df.title[i]) 172 | # Remove video references from headlines 173 | if search(t_pattern, t, IGNORECASE): 174 | span = [m.span() for m in finditer(t_pattern, t, IGNORECASE)][0] 175 | df.title[i] = t[:span[0]] + t[span[1]:] 176 | df.flesch[i] = assess_readability(df.title[i]) 177 | # Remove information about the time when the article was posted 178 | if search(' \- Hours', t): 179 | df.title[i] = t[:[m.start() for m in finditer(' \- Hours', t)][0]] 180 | df.flesch[i] = assess_readability(df.title[i]) 181 | # Tag records with metadata instead of headlines 182 | if t.startswith('Written by '): 183 | remove.append(i) 184 | 185 | # Remove unsalvageable records 186 | df = df.drop(df.index[remove]) 187 | 188 | # Drop duplicate records 189 | df['title_lower'] = [t.lower() for t in df.title] 190 | df.drop_duplicates(cols=['title_lower', 'outlet'], inplace=True) 191 | df.drop(labels='title_lower', axis=1, inplace=True) 192 | 193 | # Save clean data to new file 194 | df.to_csv(path_or_buf=file_name.replace('.', '_clean.'), index=False) 195 | 196 | # Return the DataFrame 197 | return df 198 | 199 | def grades2schools(df): 200 | 201 | '''Determine the school (elementary, middle, high, or college+) a reader 202 | needs to attend to parse the headline''' 203 | 204 | # Transform each grade to its school equivalent 205 | df['elem'] = (df.flesch >= 1) & (df.flesch < 6) 206 | df['middle'] = (df.flesch >= 6) & (df.flesch < 9) 207 | df['high'] = (df.flesch >= 9) & (df.flesch < 13) 208 | df['college'] = df.flesch >= 13 209 | 210 | # Return the DataFrame 211 | return df 212 | 213 | def print_stats(data, stats): 214 | 215 | '''Print some relevant statistics''' 216 | 217 | def print_percent(x): 218 | return round(x.sum() / x.shape[0] * 100, 2) 219 | 220 | print '\nFLESCH STATISTICS' 221 | print 'Mean = %s' % (round(data.flesch.mean(), 2)) 222 | print 'SD = %s' % (round(data.flesch.std(), 2)) 223 | print 'Min = %s' % (data.flesch.min()) 224 | print 'Max = %s' % (data.flesch.max()) 225 | 226 | print '\nSCHOOL PERCENTAGES' 227 | print 'Elementary = %s%%' % (print_percent(data.elem)) 228 | print 'Middle = %s%%' % (print_percent(data.middle)) 229 | print 'High = %s%%' % (print_percent(data.high)) 230 | print 'SomeCollege = %s%%' % (print_percent(data.college)) 231 | 232 | print '\nHEADLINES BY OUTLET' 233 | print 'Mean = %s' % (round(stats.n_headlines.mean())) 234 | print 'SD = %s' % (round(stats.n_headlines.std())) 235 | print 'Min = %s' % (stats.n_headlines.min()) 236 | print 'Max = %s' % (stats.n_headlines.max()) 237 | 238 | def main(): 239 | 240 | # Name the CSV file 241 | file_name = 'google_news.csv' 242 | 243 | # Schedule and run the scraper 244 | schedule(file_name=file_name, n_jobs=10, frequency=30) 245 | 246 | # Clean data and save them to a different file 247 | df = clean(file_name) 248 | 249 | # Transform grades to schools (elementary, middle, high, and some college) 250 | df = grades2schools(df) 251 | 252 | # Group items by outlet 253 | grouped = df.groupby(by='outlet', as_index=False) 254 | 255 | # Compute aggregate Flesch statistcs 256 | flesch_dict = {'mean': mean, 'sem': sem, 'n_headlines': len} 257 | flesch_stats = grouped['flesch'].agg(flesch_dict) 258 | 259 | # Restrict news outlets to those with at least 100 headlines 260 | flesch_stats = flesch_stats[flesch_stats.n_headlines >= 100] 261 | 262 | # Apply the same restriction to the non-aggregated data 263 | df = df[df.outlet.isin(list(set(flesch_stats.outlet)))] 264 | 265 | # Print statistics 266 | print_stats(data=df, stats=flesch_stats) 267 | 268 | # Save aggregated data 269 | flesch_stats.to_csv(file_name.replace('.', '_aggregate.')) 270 | 271 | # Plot results 272 | Rscript = '/Library/Frameworks/R.framework/Versions/3.0/Reoutlets/Rscript' 273 | call([Rscript, '/Users/jmcontreras/Github/google-news/google_news.R']) 274 | 275 | if __name__ == '__main__': 276 | 277 | main() 278 | --------------------------------------------------------------------------------