├── README.md ├── examples ├── Rstart.R ├── entity.png ├── fb_scraper_data.png ├── how_to_build_facebook_scraper.ipynb ├── reaction-example-1.png ├── reaction-example-2.png ├── reaction-example-3.png └── reaction_count_data_analysis_example.ipynb ├── get_fb_comments_from_fb.py ├── get_fb_posts_fb_group.py └── get_fb_posts_fb_page.py /README.md: -------------------------------------------------------------------------------- 1 | # Facebook Page Post Scraper 2 | 3 | **UPDATE December 2017: Due to a [bug on Facebook's end](https://developers.facebook.com/bugs/1838195226492053/), using this scraper will only return a very small subset of posts (5-10% of posts) over a limited timeframe. Since Facebook now owns [CrowdTangle](http://www.crowdtangle.com), the (paid) canonical source of historical Facebook data, Facebook doesn't have an incentive to fix the linked bug.** 4 | 5 | **On December 12th, a Facebook engineer commented that they are developing a new endpoint for scraping posts chronologically. I will refactor this script once that happens. Until then, there likely will not be any PRs accepted.** 6 | 7 | ![](/examples/fb_scraper_data.png) 8 | 9 | A tool for gathering *all* the posts and comments of a Facebook Page (or Open Facebook Group) and related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, able to be imported into any data analysis program like Excel. 10 | 11 | The purpose of the script is to gather Facebook data for semantic analysis, which is greatly helped by the presence of high-quality Reaction data. Here's quick examples of a potential Facebook Reaction data visualization using data from [CNN's Facebook page](https://www.facebook.com/cnn/): 12 | 13 | ![](/examples/reaction-example-2.png) 14 | 15 | ## Usage 16 | 17 | ### Scrape Posts From Public Page 18 | 19 | The Page data scraper is implemented as a Python 2/3 script in `get_fb_posts_fb_page.py`; fill in the App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape at the beginning of the file. Then run the script by `cd` into the directory containing the script, then running `python get_fb_posts_fb_page.py` or `python3 get_fb_posts_fb_page.py`. 20 | 21 | ### Scrape Posts from Open Group 22 | 23 | To get data from an Open Group, use the `get_fb_posts_fb_group.py` script with the App ID and App Secret filled in the same way. However, the `group_id` is a *numeric ID*. For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for the phrase `"entity_id"`, and use the number to the right of that field. For example, the `group_id` of [Hackathon Hackers](https://www.facebook.com/groups/hackathonhackers/) is 759985267390294. 24 | 25 | ![](/examples/entity.png) 26 | 27 | ### Scrape Comments From Page/Group Posts 28 | 29 | To scrape all the user comments from the posts, create a CSV using either of the above scripts, then run the `get_fb_comments_from_fb.py` script, specifying the Page/Group as the `file_id`. The output includes the original `status_id` where the comment is located so you can map the comment to the original Post with a `JOIN` or `VLOOKUP`, and also a `parent_id` if the comment is a reply to another comment. 30 | 31 | Keep in mind that large pages such as CNN have *millions* of comments, so be careful! (scraping throughput is approximately 87k comments/hour) 32 | 33 | ## Privacy 34 | 35 | This scraper can only scrape public Facebook data which is available to anyone, even those who are not logged into Facebook. No personally-identifiable data is collected in the Page variant; the Group variant does collect the name of the author of the post, but that data is also public to non-logged-in users. Additionally, the script only uses officially-documented Facebook API endpoints without circumventing any rate-limits. 36 | 37 | Note that this script, and any variant of this script, *cannot* be used to scrape data from user profiles. (and the Facebook API specifically disallows this use case!) 38 | 39 | ## Known Issues 40 | 41 | * UTF-16 text (CJK) sometimes fails. 42 | * GIFs in comments will not appear for an App access_token. (it requires a User access_token for no apparent reason). 43 | 44 | ## Maintainer 45 | 46 | Max Woolf ([@minimaxir](http://minimaxir.com)) 47 | 48 | *Max's open-source projects are supported by his [Patreon](https://www.patreon.com/minimaxir). If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.* 49 | 50 | For more information on how the script was originally created, and some tips on how to create similar scrapers yourself, see my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/). 51 | 52 | ## Credits 53 | 54 | [Peeter Tintis](https://github.com/Digitaalhumanitaaria), whose [fork](https://github.com/Digitaalhumanitaaria/facebook-page-post-scraper/blob/master/get_fb_posts_fb_page.py) of this repo implements code for finding separate reaction counts per [this Stack Overflow answer](http://stackoverflow.com/a/37239851). 55 | 56 | [Marco Goldin](https://github.com/marcogoldin) for the Python 3.5 fork. 57 | 58 | ## License 59 | 60 | MIT 61 | 62 | If you do find this script useful, a link back to this repository would be appreciated. Thanks! 63 | -------------------------------------------------------------------------------- /examples/Rstart.R: -------------------------------------------------------------------------------- 1 | library(readr) 2 | library(dplyr) 3 | library(ggplot2) 4 | library(extrafont) 5 | library(scales) 6 | library(grid) 7 | library(RColorBrewer) 8 | library(digest) 9 | library(readr) 10 | library(stringr) 11 | 12 | 13 | fontFamily <- "Source Sans Pro" 14 | fontTitle <- "Source Sans Pro Semibold" 15 | 16 | color_palette = c("#16a085","#27ae60","#2980b9","#8e44ad","#f39c12","#c0392b","#1abc9c", "#2ecc71", "#3498db", "#9b59b6", "#f1c40f","#e74c3c") 17 | 18 | neutral_colors = function(number) { 19 | return (brewer.pal(11, "RdYlBu")[-c(5:7)][(number %% 8) + 1]) 20 | } 21 | 22 | set1_colors = function(number) { 23 | return (brewer.pal(9, "Set1")[c(-6,-8)][(number %% 7) + 1]) 24 | } 25 | 26 | theme_custom <- function() {theme_bw(base_size = 8) + 27 | theme(panel.background = element_rect(fill="#eaeaea"), 28 | plot.background = element_rect(fill="white"), 29 | panel.grid.minor = element_blank(), 30 | panel.grid.major = element_line(color="#dddddd"), 31 | axis.ticks.x = element_blank(), 32 | axis.ticks.y = element_blank(), 33 | axis.title.x = element_text(family=fontTitle, size=8, vjust=-.3), 34 | axis.title.y = element_text(family=fontTitle, size=8, vjust=1.5), 35 | panel.border = element_rect(color="#cccccc"), 36 | text = element_text(color = "#1a1a1a", family=fontFamily), 37 | plot.margin = unit(c(0.25,0.1,0.1,0.35), "cm"), 38 | plot.title = element_text(family=fontTitle, size=9, vjust=1)) 39 | } 40 | 41 | create_watermark <- function(source = '', filename = '', dark=F) { 42 | 43 | bg_white = "#FFFFFF" 44 | bg_text = '#969696' 45 | 46 | if (dark) { 47 | bg_white = "#000000" 48 | bg_text = '#666666' 49 | } 50 | 51 | watermark <- ggplot(aes(x,y), data=data.frame(x=c(0.5), y=c(0.5))) + geom_point(color = "transparent") + 52 | geom_text(x=0, y=1.25, label="By Max Woolf — minimaxir.com", family="Source Sans Pro", color=bg_text, size=1.75, hjust=0) + 53 | 54 | geom_text(x=5, y=1.25, label="Made using R and ggplot2", family="Source Sans Pro", color=bg_text, size=1.75) + 55 | scale_x_continuous(limits=c(0,10)) + 56 | scale_y_continuous(limits=c(0.5,1.5)) + 57 | annotate("segment", x = 0, xend = 10, y=1.5, yend=1.5, color=bg_text, size=0.1) + 58 | theme_bw() + 59 | theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none", 60 | panel.border = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), 61 | axis.ticks = element_blank(), plot.margin = unit(c(0.0,0,-0.4,0), "cm")) + 62 | theme(plot.background=element_rect(fill=bg_white, color=bg_white),panel.background=element_rect(fill=bg_white, color=bg_white)) + 63 | scale_color_manual(values=bg_text) 64 | 65 | if (nchar(source) > 0) {watermark <- watermark + geom_text(x=10, y=1.25, label=paste("Data via",source), family="Source Sans Pro", color=bg_text, size=1.75, hjust=1)} 66 | 67 | return (watermark) 68 | } 69 | 70 | web_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(2, 71 | 0.125), c("null", "null")), ) 72 | tallweb_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(3.5, 73 | 0.125), c("null", "null")), ) 74 | video_Layout <- grid.layout(nrow = 1, ncol = 2, widths = unit(c(2, 75 | 1), c("null", "null")), ) 76 | 77 | #grid.show.layout(Layout) 78 | vplayout <- function(...) { 79 | grid.newpage() 80 | pushViewport(viewport(layout = web_Layout)) 81 | } 82 | 83 | talllayout <- function(...) { 84 | grid.newpage() 85 | pushViewport(viewport(layout = tallweb_Layout)) 86 | } 87 | 88 | vidlayout <- function(...) { 89 | grid.newpage() 90 | pushViewport(viewport(layout = video_Layout)) 91 | } 92 | 93 | subplot <- function(x, y) viewport(layout.pos.row = x, 94 | layout.pos.col = y) 95 | 96 | web_plot <- function(a, b) { 97 | vplayout() 98 | print(a, vp = subplot(1, 1)) 99 | print(b, vp = subplot(2, 1)) 100 | } 101 | 102 | tallweb_plot <- function(a, b) { 103 | talllayout() 104 | print(a, vp = subplot(1, 1)) 105 | print(b, vp = subplot(2, 1)) 106 | } 107 | 108 | video_plot <- function(a, b) { 109 | vidlayout() 110 | print(a, vp = subplot(1, 1)) 111 | print(b, vp = subplot(1, 2)) 112 | } 113 | 114 | max_save <- function(plot1, filename, source = '', pdf = FALSE, w=4, h=3, tall=F, dark=F, bg_overide=NA) { 115 | png(paste(filename,"png",sep="."),res=300,units="in",width=w,height=h) 116 | plot.new() 117 | #if (!is.na(bg_overide)) {par(bg = bg_overide)} 118 | ifelse(tall,tallweb_plot(plot1,create_watermark(source, filename, dark)),web_plot(plot1,create_watermark(source, filename, dark))) 119 | dev.off() 120 | 121 | if (pdf) { 122 | quartz(width=w,height=h,dpi=144) 123 | #if (!is.na(bg_overide)) {par(bg = bg_overide)} 124 | web_plot(plot1,create_watermark(source, filename, dark)) 125 | quartz.save(paste(filename,"pdf",sep="."), type = "pdf", device = dev.cur()) 126 | } 127 | } 128 | 129 | video_save <- function(plot1, plot2, filename) { 130 | png(paste(filename,"png",sep="."),res=300,units="in",width=1920/300,height=1080/300) 131 | video_plot(plot1,plot2) 132 | dev.off() 133 | 134 | } 135 | 136 | fte_theme <- function (palate_color = "Greys") { 137 | 138 | #display.brewer.all(n=9,type="seq",exact.n=TRUE) 139 | palate <- brewer.pal(palate_color, n=9) 140 | color.background = palate[1] 141 | color.grid.minor = palate[3] 142 | color.grid.major = palate[3] 143 | color.axis.text = palate[6] 144 | color.axis.title = palate[7] 145 | color.title = palate[9] 146 | #color.title = "#2c3e50" 147 | 148 | font.title <- "Source Sans Pro" 149 | font.axis <- "Open Sans Condensed Bold" 150 | #font.axis <- "M+ 1m regular" 151 | #font.title <- "Arial" 152 | #font.axis <- "Arial" 153 | 154 | 155 | theme_bw(base_size=9) + 156 | # Set the entire chart region to a light gray color 157 | theme(panel.background=element_rect(fill=color.background, color=color.background)) + 158 | theme(plot.background=element_rect(fill=color.background, color=color.background)) + 159 | theme(panel.border=element_rect(color=color.background)) + 160 | # Format the grid 161 | theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) + 162 | theme(panel.grid.minor=element_blank()) + 163 | #scale_x_continuous(minor_breaks=0,breaks=seq(0,100,10),limits=c(0,100)) + 164 | #scale_y_continuous(minor_breaks=0,breaks=seq(0,26,4),limits=c(0,25)) + 165 | theme(axis.ticks=element_blank()) + 166 | # Dispose of the legend 167 | theme(legend.position="none") + 168 | theme(legend.background = element_rect(fill=color.background)) + 169 | theme(legend.text = element_text(size=7,colour=color.axis.title,family=font.axis)) + 170 | # Set title and axis labels, and format these and tick marks 171 | theme(plot.title=element_text(colour=color.title,family=font.title, size=9, vjust=1.25, lineheight=0.1)) + 172 | theme(axis.text.x=element_text(size=7,colour=color.axis.text,family=font.axis)) + 173 | theme(axis.text.y=element_text(size=7,colour=color.axis.text,family=font.axis)) + 174 | theme(axis.title.y=element_text(size=7,colour=color.axis.title,family=font.title, vjust=1.25)) + 175 | theme(axis.title.x=element_text(size=7,colour=color.axis.title,family=font.title, vjust=0)) + 176 | 177 | # Big bold line at y=0 178 | #geom_hline(yintercept=0,size=0.75,colour=palate[9]) + 179 | # Plot margins and finally line annotations 180 | theme(plot.margin = unit(c(0.35, 0.2, 0.15, 0.4), "cm")) + 181 | 182 | theme(strip.background = element_rect(fill=color.background, color=color.background),strip.text=element_text(size=7,colour=color.axis.title,family=font.title)) 183 | 184 | } 185 | -------------------------------------------------------------------------------- /examples/entity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/minimaxir/facebook-page-post-scraper/275711ffaec6a959a1802d9ac3df710e33920a77/examples/entity.png -------------------------------------------------------------------------------- /examples/fb_scraper_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/minimaxir/facebook-page-post-scraper/275711ffaec6a959a1802d9ac3df710e33920a77/examples/fb_scraper_data.png -------------------------------------------------------------------------------- /examples/how_to_build_facebook_scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# How to Scrape Data From Facebook Page Posts for Statistical Analysis\n", 8 | "\n", 9 | "By [Max Woolf (@minimaxir)](http://minimaxir.com/)\n", 10 | "\n", 11 | "This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "# import some Python dependencies\n", 23 | "\n", 24 | "import urllib2\n", 25 | "import json\n", 26 | "import datetime\n", 27 | "import csv\n", 28 | "import time" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Accessing Facebook page data requires an access token.\n", 36 | "\n", 37 | "Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "# Since the code output in this notebook leaks the app_secret,\n", 49 | "# it has been reset by the time you read this.\n", 50 | "\n", 51 | "app_id = \"272535582777707\"\n", 52 | "app_secret = \"59e7ab31b01d3a5a90ec15a7a45a5e3b\" # DO NOT SHARE WITH ANYONE!\n", 53 | "\n", 54 | "access_token = app_id + \"|\" + app_secret" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data." 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 3, 67 | "metadata": { 68 | "collapsed": true 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "page_id = 'nytimes'" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "{\n", 94 | " \"id\": \"5281959998\", \n", 95 | " \"name\": \"The New York Times\"\n", 96 | "}\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "def testFacebookPageData(page_id, access_token):\n", 102 | " \n", 103 | " # construct the URL string\n", 104 | " base = \"https://graph.facebook.com/v2.4\"\n", 105 | " node = \"/\" + page_id\n", 106 | " parameters = \"/?access_token=%s\" % access_token\n", 107 | " url = base + node + parameters\n", 108 | " \n", 109 | " # retrieve data\n", 110 | " req = urllib2.Request(url)\n", 111 | " response = urllib2.urlopen(req)\n", 112 | " data = json.loads(response.read())\n", 113 | " \n", 114 | " print json.dumps(data, indent=4, sort_keys=True)\n", 115 | " \n", 116 | "\n", 117 | "testFacebookPageData(page_id, access_token)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. \n", 125 | "\n", 126 | "Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 5, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "def request_until_succeed(url):\n", 138 | " req = urllib2.Request(url)\n", 139 | " success = False\n", 140 | " while success is False:\n", 141 | " try: \n", 142 | " response = urllib2.urlopen(req)\n", 143 | " if response.getcode() == 200:\n", 144 | " success = True\n", 145 | " except Exception, e:\n", 146 | " print e\n", 147 | " time.sleep(5)\n", 148 | " \n", 149 | " print \"Error for URL %s: %s\" % (url, datetime.datetime.now())\n", 150 | "\n", 151 | " return response.read()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "{\n", 173 | " \"data\": [\n", 174 | " {\n", 175 | " \"created_time\": \"2015-07-20T01:25:01+0000\", \n", 176 | " \"id\": \"5281959998_10150628157724999\", \n", 177 | " \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\"\n", 178 | " }, \n", 179 | " {\n", 180 | " \"created_time\": \"2015-07-19T22:55:01+0000\", \n", 181 | " \"id\": \"5281959998_10150628161129999\", \n", 182 | " \"message\": \"\\\"It\\u2019s safe to say that federal agencies are not where we want them to be across the board,\\\" said President Barack Obama's top cybersecurity adviser. \\\"We clearly need to be moving faster.\\\"\"\n", 183 | " }, \n", 184 | " {\n", 185 | " \"created_time\": \"2015-07-19T22:25:01+0000\", \n", 186 | " \"id\": \"5281959998_10150626434639999\", \n", 187 | " \"message\": \"Showcase your summer tomatoes in this elegant crostata.\"\n", 188 | " }, \n", 189 | " {\n", 190 | " \"created_time\": \"2015-07-19T21:55:08+0000\", \n", 191 | " \"id\": \"5281959998_10150628170209999\", \n", 192 | " \"message\": \"The task: Create a technologically sophisticated barbecue smoker that could outperform the best product on the market and be sold for less than $1,500.\"\n", 193 | " }, \n", 194 | " {\n", 195 | " \"created_time\": \"2015-07-19T21:25:00+0000\", \n", 196 | " \"id\": \"5281959998_10150626449129999\", \n", 197 | " \"message\": \"Achieving pastel hair can be time-consuming and toxic \\u2014 but for some, so very worth it.\"\n", 198 | " }, \n", 199 | " {\n", 200 | " \"created_time\": \"2015-07-19T20:53:05+0000\", \n", 201 | " \"id\": \"5281959998_10150626425084999\", \n", 202 | " \"message\": \"Attention, meat lovers: This simple barbecue sauce goes beautifully with pork and chicken.\"\n", 203 | " }, \n", 204 | " {\n", 205 | " \"created_time\": \"2015-07-19T20:25:07+0000\", \n", 206 | " \"id\": \"5281959998_10150628132119999\", \n", 207 | " \"message\": \"He passed the police officer exam in 2011. He went through orientation and started undergoing the required background checks in 2013. Then, the process stopped cold. No emails. No calls. No explanations. Silence.\"\n", 208 | " }, \n", 209 | " {\n", 210 | " \"created_time\": \"2015-07-19T19:55:32+0000\", \n", 211 | " \"id\": \"5281959998_10150628116259999\", \n", 212 | " \"message\": \"The election is 16 months away, but knowing what we know now, what should we expect the economic backdrop to be when Americans choose their next president?\"\n", 213 | " }, \n", 214 | " {\n", 215 | " \"created_time\": \"2015-07-19T19:25:07+0000\", \n", 216 | " \"id\": \"5281959998_10150628097394999\", \n", 217 | " \"message\": \"\\\"By focusing so intently on physical fitness, the corps is avoiding the real barrier to integration \\u2014 the hypermasculine culture at its heart.\\\" Read on in The New York Times Opinion.\"\n", 218 | " }, \n", 219 | " {\n", 220 | " \"created_time\": \"2015-07-19T19:05:01+0000\", \n", 221 | " \"id\": \"5281959998_10150628071729999\", \n", 222 | " \"message\": \"U2's \\u201cInnocence and Experience\\u201d tour merges past and present, peace and war, audience and band, punk and statesman, grass-roots activism and corporate philanthropy.\"\n", 223 | " }, \n", 224 | " {\n", 225 | " \"created_time\": \"2015-07-19T18:55:05+0000\", \n", 226 | " \"id\": \"5281959998_10150628073894999\", \n", 227 | " \"message\": \"\\\"I always believe in apologizing if you\\u2019ve done something wrong, but if you read my statement, you\\u2019ll see I said nothing wrong,\\\" Donald J. Trump said in an interview.\"\n", 228 | " }, \n", 229 | " {\n", 230 | " \"created_time\": \"2015-07-19T18:25:21+0000\", \n", 231 | " \"id\": \"5281959998_10150628056964999\", \n", 232 | " \"message\": \"Booing, like opera, can be divided into several genres.\"\n", 233 | " }, \n", 234 | " {\n", 235 | " \"created_time\": \"2015-07-19T17:55:08+0000\", \n", 236 | " \"id\": \"5281959998_10150628040459999\", \n", 237 | " \"message\": \"\\\"Nearly at once, [the Confederate flag and Atticus Finch] have fallen from grace in ways that were unimaginable just months ago. They are forcing a reckoning with ourselves and our history, a reassessment of who we were and of what we might become.\\\" Read on in The New York Times Opinion.\"\n", 238 | " }, \n", 239 | " {\n", 240 | " \"created_time\": \"2015-07-19T17:25:00+0000\", \n", 241 | " \"id\": \"5281959998_10150627982469999\", \n", 242 | " \"message\": \"It's National Ice Cream Day. How about cooling off with a treat?\", \n", 243 | " \"story\": \"The New York Times added 4 new photos.\"\n", 244 | " }, \n", 245 | " {\n", 246 | " \"created_time\": \"2015-07-19T16:55:07+0000\", \n", 247 | " \"id\": \"5281959998_10150628000024999\", \n", 248 | " \"message\": \"Bystanders watched people wave flags celebrating Pan-Africanism, the Confederacy and the Nazi Party. And they watched as black demonstrators raised clenched fists, and white demonstrators performed Nazi salutes.\"\n", 249 | " }, \n", 250 | " {\n", 251 | " \"created_time\": \"2015-07-19T16:25:08+0000\", \n", 252 | " \"id\": \"5281959998_10150627989069999\", \n", 253 | " \"message\": \"\\\"Because in the sunset of his presidency, Barack Obama's bolder side is rising. He\\u2019s a lame duck who doesn\\u2019t give a damn.\\\" Read on in The New York Times Opinion.\"\n", 254 | " }, \n", 255 | " {\n", 256 | " \"created_time\": \"2015-07-19T15:55:06+0000\", \n", 257 | " \"id\": \"5281959998_10150627979424999\", \n", 258 | " \"message\": \"The flyby of Pluto was a triumph of human ingenuity and the capstone of a mission that unfolded nearly flawlessly. Yet it almost didn't happen.\"\n", 259 | " }, \n", 260 | " {\n", 261 | " \"created_time\": \"2015-07-19T15:25:04+0000\", \n", 262 | " \"id\": \"5281959998_10150627970394999\", \n", 263 | " \"message\": \"After 6 months apart, Caroline Dove planned to reunite with her boyfriend of more than 2 years. But before she could make the trip, there came a final, portentous message.\"\n", 264 | " }, \n", 265 | " {\n", 266 | " \"created_time\": \"2015-07-19T14:55:08+0000\", \n", 267 | " \"id\": \"5281959998_10150627962014999\", \n", 268 | " \"message\": \"Hillary Clinton has made the struggles of her mother a central part of her 2016 campaign\\u2019s message. But her father, whom she rarely talks about publicly, exerted an equally powerful, if sometimes bruising, influence on the woman who wants to become the first female president.\"\n", 269 | " }, \n", 270 | " {\n", 271 | " \"created_time\": \"2015-07-19T14:25:09+0000\", \n", 272 | " \"id\": \"5281959998_10150627952769999\", \n", 273 | " \"message\": \"Quotation of the Day: \\\"When your contract is over, they send you home, saying they\\u2019ve transferred the money. You get home, and there is nothing there.\\\" \\u2014 Yuriy Cheng, a Ukrainian seaman, describing the owner of the Dona Liberta, a ship that is a case study of misconduct at sea.\"\n", 274 | " }, \n", 275 | " {\n", 276 | " \"created_time\": \"2015-07-19T12:55:01+0000\", \n", 277 | " \"id\": \"5281959998_10150626434214999\", \n", 278 | " \"message\": \"Summer on a stick. (via The New York Times Food)\"\n", 279 | " }, \n", 280 | " {\n", 281 | " \"created_time\": \"2015-07-19T09:55:00+0000\", \n", 282 | " \"id\": \"5281959998_10150627665974999\", \n", 283 | " \"message\": \"The surge of migrants into Europe from war-ravaged and impoverished parts of the Middle East, Afghanistan and Africa has shifted in recent months. Migrants are now pushing by land across the western Balkans, in numbers roughly equal to those entering the Continent through Italy.\"\n", 284 | " }, \n", 285 | " {\n", 286 | " \"created_time\": \"2015-07-19T03:55:00+0000\", \n", 287 | " \"id\": \"5281959998_10150626450789999\", \n", 288 | " \"message\": \"When your big toe isn't your biggest toe.\"\n", 289 | " }, \n", 290 | " {\n", 291 | " \"created_time\": \"2015-07-19T02:55:00+0000\", \n", 292 | " \"id\": \"5281959998_10150626440069999\", \n", 293 | " \"message\": \"\\\"Progress is occurring, as courts accept that in marriage and other matters, gender can't be reduced to chromosomes or surgeries,\\\" writes J. Courtney Sullivan in The New York Times Opinion.\"\n", 294 | " }, \n", 295 | " {\n", 296 | " \"created_time\": \"2015-07-19T01:55:01+0000\", \n", 297 | " \"id\": \"5281959998_10150627562209999\", \n", 298 | " \"message\": \"Experimenting with neon lavender, sea-foam green and soft periwinkle.\"\n", 299 | " }\n", 300 | " ], \n", 301 | " \"paging\": {\n", 302 | " \"next\": \"https://graph.facebook.com/v2.4/5281959998/feed?access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&until=1437270901&__paging_token=enc_AdB73LgZAUngYJIdoZCGUgWvKdL9zs23TBqdfeK90PnPs9MqO7xeze7ANGK2zMxZAveZAvwa1nHzTObmzuKiHY7MVVow\", \n", 303 | " \"previous\": \"https://graph.facebook.com/v2.4/5281959998/feed?since=1437355501&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&__paging_token=enc_AdC4YOxNofFbJWmap6PZC6S0iyiWG8A1FpsYTMrBG62tmT6HfNuhc6rcxL6fMk8ZAxx0EQcFy52SJ2fJ1TbIL47EQx&__previous=1\"\n", 304 | " }\n", 305 | "}\n" 306 | ] 307 | } 308 | ], 309 | "source": [ 310 | "def testFacebookPageFeedData(page_id, access_token):\n", 311 | " \n", 312 | " # construct the URL string\n", 313 | " base = \"https://graph.facebook.com/v2.4\"\n", 314 | " node = \"/\" + page_id + \"/feed\" # changed\n", 315 | " parameters = \"/?access_token=%s\" % access_token\n", 316 | " url = base + node + parameters\n", 317 | " \n", 318 | " # retrieve data\n", 319 | " data = json.loads(request_until_succeed(url))\n", 320 | " \n", 321 | " print json.dumps(data, indent=4, sort_keys=True)\n", 322 | " \n", 323 | "\n", 324 | "testFacebookPageFeedData(page_id, access_token)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.\n", 332 | "\n", 333 | "We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 7, 339 | "metadata": { 340 | "collapsed": false, 341 | "scrolled": true 342 | }, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "{\n", 349 | " \"comments\": {\n", 350 | " \"data\": [\n", 351 | " {\n", 352 | " \"can_remove\": false, \n", 353 | " \"created_time\": \"2015-07-20T01:28:02+0000\", \n", 354 | " \"from\": {\n", 355 | " \"id\": \"859569687424896\", \n", 356 | " \"name\": \"Chris Gagne\"\n", 357 | " }, \n", 358 | " \"id\": \"10150628157724999_10150628249759999\", \n", 359 | " \"like_count\": 9, \n", 360 | " \"message\": \"Aaaaaaaand there goes the rest of Beijing's clean air, whatever was left of it.\", \n", 361 | " \"user_likes\": false\n", 362 | " }\n", 363 | " ], \n", 364 | " \"paging\": {\n", 365 | " \"cursors\": {\n", 366 | " \"after\": \"MzE=\", \n", 367 | " \"before\": \"MzE=\"\n", 368 | " }, \n", 369 | " \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/comments?order=chronological&limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MzE%3D\"\n", 370 | " }, \n", 371 | " \"summary\": {\n", 372 | " \"order\": \"ranked\", \n", 373 | " \"total_count\": 31\n", 374 | " }\n", 375 | " }, \n", 376 | " \"created_time\": \"2015-07-20T01:25:01+0000\", \n", 377 | " \"id\": \"5281959998_10150628157724999\", \n", 378 | " \"likes\": {\n", 379 | " \"data\": [\n", 380 | " {\n", 381 | " \"id\": \"1001217933243627\", \n", 382 | " \"name\": \"Josh Smith\"\n", 383 | " }\n", 384 | " ], \n", 385 | " \"paging\": {\n", 386 | " \"cursors\": {\n", 387 | " \"after\": \"MTAwMTIxNzkzMzI0MzYyNw==\", \n", 388 | " \"before\": \"MTAwMTIxNzkzMzI0MzYyNw==\"\n", 389 | " }, \n", 390 | " \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/likes?limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MTAwMTIxNzkzMzI0MzYyNw%3D%3D\"\n", 391 | " }, \n", 392 | " \"summary\": {\n", 393 | " \"total_count\": 278\n", 394 | " }\n", 395 | " }, \n", 396 | " \"link\": \"http://nyti.ms/1Jr6LhU\", \n", 397 | " \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\", \n", 398 | " \"name\": \"China Molds a Supercity Around Beijing, Promising to Change Lives\", \n", 399 | " \"shares\": {\n", 400 | " \"count\": 50\n", 401 | " }, \n", 402 | " \"type\": \"link\"\n", 403 | "}\n" 404 | ] 405 | } 406 | ], 407 | "source": [ 408 | "def getFacebookPageFeedData(page_id, access_token, num_statuses):\n", 409 | " \n", 410 | " # construct the URL string\n", 411 | " base = \"https://graph.facebook.com\"\n", 412 | " node = \"/\" + page_id + \"/feed\" \n", 413 | " parameters = \"/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s\" % (num_statuses, access_token) # changed\n", 414 | " url = base + node + parameters\n", 415 | " \n", 416 | " # retrieve data\n", 417 | " data = json.loads(request_until_succeed(url))\n", 418 | " \n", 419 | " return data\n", 420 | " \n", 421 | "\n", 422 | "test_status = getFacebookPageFeedData(page_id, access_token, 1)[\"data\"][0]\n", 423 | "print json.dumps(test_status, indent=4, sort_keys=True)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "Now that we have a sample Facebook page status, we can write a function to process each field individually." 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 8, 436 | "metadata": { 437 | "collapsed": false 438 | }, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | "(u'5281959998_10150628157724999', 'The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\xe2\\x80\\x99s, is meant to revamp northern China\\xe2\\x80\\x99s economy and become a laboratory for modern urban growth.', 'China Molds a Supercity Around Beijing, Promising to Change Lives', u'link', u'http://nyti.ms/1Jr6LhU', '2015-07-19 20:25:01', 278, 31, 50)\n" 445 | ] 446 | } 447 | ], 448 | "source": [ 449 | "def processFacebookPageFeedStatus(status):\n", 450 | " \n", 451 | " # The status is now a Python dictionary, so for top-level items,\n", 452 | " # we can simply call the key.\n", 453 | " \n", 454 | " # Additionally, some items may not always exist,\n", 455 | " # so must check for existence first\n", 456 | " \n", 457 | " status_id = status['id']\n", 458 | " status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')\n", 459 | " link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')\n", 460 | " status_type = status['type']\n", 461 | " status_link = '' if 'link' not in status.keys() else status['link']\n", 462 | " \n", 463 | " \n", 464 | " # Time needs special care since a) it's in UTC and\n", 465 | " # b) it's not easy to use in statistical programs.\n", 466 | " \n", 467 | " status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')\n", 468 | " status_published = status_published + datetime.timedelta(hours=-5) # EST\n", 469 | " status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs\n", 470 | " \n", 471 | " # Nested items require chaining dictionary keys.\n", 472 | " \n", 473 | " num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']\n", 474 | " num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']\n", 475 | " num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']\n", 476 | " \n", 477 | " # return a tuple of all processed data\n", 478 | " return (status_id, status_message, link_name, status_type, status_link,\n", 479 | " status_published, num_likes, num_comments, num_shares)\n", 480 | "\n", 481 | "processed_test_status = processFacebookPageFeedStatus(test_status)\n", 482 | "print processed_test_status" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "Surprisingly, we're almost done! Now we just need to:\n", 490 | "\n", 491 | "1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.\n", 492 | "2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.\n", 493 | "3. Navigate to the next page, and repeat until no more statuses\n", 494 | "\n", 495 | "This block implements both the writing to CSV and page navigation." 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 9, 501 | "metadata": { 502 | "collapsed": false, 503 | "scrolled": true 504 | }, 505 | "outputs": [ 506 | { 507 | "name": "stdout", 508 | "output_type": "stream", 509 | "text": [ 510 | "Scraping nytimes Facebook Page: 2015-07-19 18:36:33.051000\n", 511 | "\n", 512 | "1000 Statuses Processed: 2015-07-19 18:36:59.366000\n", 513 | "2000 Statuses Processed: 2015-07-19 18:37:28.289000\n", 514 | "3000 Statuses Processed: 2015-07-19 18:37:56.487000\n", 515 | "4000 Statuses Processed: 2015-07-19 18:38:30.355000\n", 516 | "5000 Statuses Processed: 2015-07-19 18:38:58.661000\n", 517 | "6000 Statuses Processed: 2015-07-19 18:39:26.990000\n", 518 | "7000 Statuses Processed: 2015-07-19 18:39:55.906000\n", 519 | "8000 Statuses Processed: 2015-07-19 18:40:20.628000\n", 520 | "9000 Statuses Processed: 2015-07-19 18:40:44.801000\n", 521 | "10000 Statuses Processed: 2015-07-19 18:41:11.759000\n", 522 | "11000 Statuses Processed: 2015-07-19 18:41:38.739000\n", 523 | "12000 Statuses Processed: 2015-07-19 18:42:05.562000\n", 524 | "13000 Statuses Processed: 2015-07-19 18:42:32.696000\n", 525 | "14000 Statuses Processed: 2015-07-19 18:42:59.939000\n", 526 | "15000 Statuses Processed: 2015-07-19 18:43:26.889000\n", 527 | "16000 Statuses Processed: 2015-07-19 18:43:53.106000\n", 528 | "17000 Statuses Processed: 2015-07-19 18:44:19.457000\n", 529 | "18000 Statuses Processed: 2015-07-19 18:44:45.637000\n", 530 | "19000 Statuses Processed: 2015-07-19 18:45:11.255000\n", 531 | "20000 Statuses Processed: 2015-07-19 18:45:34.447000\n", 532 | "21000 Statuses Processed: 2015-07-19 18:45:58.425000\n", 533 | "22000 Statuses Processed: 2015-07-19 18:46:23.920000\n", 534 | "23000 Statuses Processed: 2015-07-19 18:46:49.274000\n", 535 | "24000 Statuses Processed: 2015-07-19 18:47:15.616000\n", 536 | "25000 Statuses Processed: 2015-07-19 18:47:39.930000\n", 537 | "26000 Statuses Processed: 2015-07-19 18:48:08.076000\n", 538 | "HTTP Error 502: Error parsing server response\n", 539 | "Error for URL https://graph.facebook.com/v2.0/5281959998/feed?fields=message,link,created_time,type,name,id,likes.limit%281%29.summary%28true%29,comments.limit%281%29.summary%28true%29,shares&limit=100&__paging_token=enc_AdBLHCQ9lOKXuEx1TEXyLWs7FEQ8RN7yGjUH0LXbw5iUpDXvcZCUIXJa2ZC2s6sBHC8EyrGl6Oafb9OqZBgBFzmuRZB9&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&until=1340213557: 2015-07-19 18:48:23.256000\n", 540 | "27000 Statuses Processed: 2015-07-19 18:48:38.748000\n", 541 | "28000 Statuses Processed: 2015-07-19 18:49:03.033000\n", 542 | "29000 Statuses Processed: 2015-07-19 18:49:26.957000\n", 543 | "30000 Statuses Processed: 2015-07-19 18:49:51.405000\n", 544 | "31000 Statuses Processed: 2015-07-19 18:50:15.830000\n", 545 | "32000 Statuses Processed: 2015-07-19 18:50:37.641000\n", 546 | "33000 Statuses Processed: 2015-07-19 18:50:57.574000\n", 547 | "\n", 548 | "Done!\n", 549 | "33296 Statuses Processed in 0:14:28.200000\n" 550 | ] 551 | } 552 | ], 553 | "source": [ 554 | "def scrapeFacebookPageFeedStatus(page_id, access_token):\n", 555 | " with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:\n", 556 | " w = csv.writer(file)\n", 557 | " w.writerow([\"status_id\", \"status_message\", \"link_name\", \"status_type\", \"status_link\",\n", 558 | " \"status_published\", \"num_likes\", \"num_comments\", \"num_shares\"])\n", 559 | " \n", 560 | " has_next_page = True\n", 561 | " num_processed = 0 # keep a count on how many we've processed\n", 562 | " scrape_starttime = datetime.datetime.now()\n", 563 | " \n", 564 | " print \"Scraping %s Facebook Page: %s\\n\" % (page_id, scrape_starttime)\n", 565 | " \n", 566 | " statuses = getFacebookPageFeedData(page_id, access_token, 100)\n", 567 | " \n", 568 | " while has_next_page:\n", 569 | " for status in statuses['data']:\n", 570 | " w.writerow(processFacebookPageFeedStatus(status))\n", 571 | " \n", 572 | " # output progress occasionally to make sure code is not stalling\n", 573 | " num_processed += 1\n", 574 | " if num_processed % 1000 == 0:\n", 575 | " print \"%s Statuses Processed: %s\" % (num_processed, datetime.datetime.now())\n", 576 | " \n", 577 | " # if there is no next page, we're done.\n", 578 | " if 'paging' in statuses.keys():\n", 579 | " statuses = json.loads(request_until_succeed(statuses['paging']['next']))\n", 580 | " else:\n", 581 | " has_next_page = False\n", 582 | " \n", 583 | " \n", 584 | " print \"\\nDone!\\n%s Statuses Processed in %s\" % (num_processed, datetime.datetime.now() - scrape_starttime)\n", 585 | "\n", 586 | "\n", 587 | "scrapeFacebookPageFeedStatus(page_id, access_token)" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "The CSV can be opened in all major statistical programs. Have fun! :)\n", 595 | "\n", 596 | "You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]" 597 | ] 598 | } 599 | ], 600 | "metadata": { 601 | "kernelspec": { 602 | "display_name": "Python 2", 603 | "language": "python", 604 | "name": "python2" 605 | }, 606 | "language_info": { 607 | "codemirror_mode": { 608 | "name": "ipython", 609 | "version": 2 610 | }, 611 | "file_extension": ".py", 612 | "mimetype": "text/x-python", 613 | "name": "python", 614 | "nbconvert_exporter": "python", 615 | "pygments_lexer": "ipython2", 616 | "version": "2.7.8" 617 | } 618 | }, 619 | "nbformat": 4, 620 | "nbformat_minor": 0 621 | } 622 | -------------------------------------------------------------------------------- /examples/reaction-example-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/minimaxir/facebook-page-post-scraper/275711ffaec6a959a1802d9ac3df710e33920a77/examples/reaction-example-1.png -------------------------------------------------------------------------------- /examples/reaction-example-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/minimaxir/facebook-page-post-scraper/275711ffaec6a959a1802d9ac3df710e33920a77/examples/reaction-example-2.png -------------------------------------------------------------------------------- /examples/reaction-example-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/minimaxir/facebook-page-post-scraper/275711ffaec6a959a1802d9ac3df710e33920a77/examples/reaction-example-3.png -------------------------------------------------------------------------------- /examples/reaction_count_data_analysis_example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Example of Processing Facebook Reaction Data\n", 8 | "\n", 9 | "by Max Woolf (@minimaxir)\n", 10 | "\n", 11 | "*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 34, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [ 21 | { 22 | "data": { 23 | "text/plain": [ 24 | "R version 3.3.0 (2016-05-03)\n", 25 | "Platform: x86_64-apple-darwin13.4.0 (64-bit)\n", 26 | "Running under: OS X 10.11.4 (El Capitan)\n", 27 | "\n", 28 | "locale:\n", 29 | "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n", 30 | "\n", 31 | "attached base packages:\n", 32 | "[1] grid stats graphics grDevices utils datasets methods \n", 33 | "[8] base \n", 34 | "\n", 35 | "other attached packages:\n", 36 | " [1] viridis_0.3.4 tidyr_0.4.1 stringr_1.0.0 digest_0.6.9 \n", 37 | " [5] RColorBrewer_1.1-2 scales_0.4.0 extrafont_0.17 ggplot2_2.1.0 \n", 38 | " [9] dplyr_0.4.3 readr_0.2.2 \n", 39 | "\n", 40 | "loaded via a namespace (and not attached):\n", 41 | " [1] Rcpp_0.12.4 Rttf2pt1_1.3.3 magrittr_1.5 munsell_0.4.3 \n", 42 | " [5] uuid_0.1-2 colorspace_1.2-6 R6_2.1.2 plyr_1.8.3 \n", 43 | " [9] tools_3.3.0 parallel_3.3.0 gtable_0.2.0 DBI_0.4 \n", 44 | "[13] extrafontdb_1.0 lazyeval_0.1.10 assertthat_0.1 gridExtra_2.2.1 \n", 45 | "[17] IRdisplay_0.3 repr_0.4 base64enc_0.1-3 IRkernel_0.5 \n", 46 | "[21] evaluate_0.9 rzmq_0.7.7 stringi_1.0-1 jsonlite_0.9.19 " 47 | ] 48 | }, 49 | "execution_count": 34, 50 | "metadata": {}, 51 | "output_type": "execute_result" 52 | } 53 | ], 54 | "source": [ 55 | "source(\"Rstart.R\")\n", 56 | "\n", 57 | "library(tidyr)\n", 58 | "library(viridis)\n", 59 | "\n", 60 | "sessionInfo()" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [ 70 | { 71 | "name": "stdout", 72 | "output_type": "stream", 73 | "text": [ 74 | "Source: local data frame [6 x 15]\n", 75 | "\n", 76 | " status_id\n", 77 | " (chr)\n", 78 | "1 5550296508_10154919083226509\n", 79 | "2 5550296508_10154919005411509\n", 80 | "3 5550296508_10154918925156509\n", 81 | "4 5550296508_10154918906011509\n", 82 | "5 5550296508_10154918844706509\n", 83 | "6 5550296508_10154918803531509\n", 84 | "Variables not shown: status_message (chr), link_name (chr), status_type (chr),\n", 85 | " status_link (chr), status_published (time), num_reactions (int), num_comments\n", 86 | " (int), num_shares (int), num_likes (int), num_loves (int), num_wows (int),\n", 87 | " num_hahas (int), num_sads (int), num_angrys (int)\n" 88 | ] 89 | }, 90 | { 91 | "data": { 92 | "text/html": [ 93 | "4258" 94 | ], 95 | "text/latex": [ 96 | "4258" 97 | ], 98 | "text/markdown": [ 99 | "4258" 100 | ], 101 | "text/plain": [ 102 | "[1] 4258" 103 | ] 104 | }, 105 | "execution_count": 3, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "df <- read_csv(\"cnn_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n", 112 | "\n", 113 | "print(head(df))\n", 114 | "nrow(df)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 31, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Source: local data frame [6 x 7]\n", 129 | "\n", 130 | " date total_likes total_loves total_wows total_hahas total_sads\n", 131 | " (date) (int) (int) (int) (int) (int)\n", 132 | "1 2016-02-24 215784 12366 9699 6670 2699\n", 133 | "2 2016-02-25 183785 8280 4879 12300 2049\n", 134 | "3 2016-02-26 191436 6445 6141 14510 1874\n", 135 | "4 2016-02-27 144926 8828 2300 1004 1984\n", 136 | "5 2016-02-28 140882 6593 1627 3657 3654\n", 137 | "6 2016-02-29 286802 13716 4404 5899 4410\n", 138 | "Variables not shown: total_angrys (int)\n" 139 | ] 140 | } 141 | ], 142 | "source": [ 143 | "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n", 144 | " summarize(total_likes=sum(num_likes),\n", 145 | " total_loves=sum(num_loves),\n", 146 | " total_wows=sum(num_wows),\n", 147 | " total_hahas=sum(num_hahas),\n", 148 | " total_sads=sum(num_sads),\n", 149 | " total_angrys=sum(num_angrys)) %>%\n", 150 | " arrange(date)\n", 151 | "\n", 152 | "print(head(df_agg))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "For ggplot, data must be converted to long format." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 62, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "Source: local data frame [20 x 3]\n", 174 | "\n", 175 | " date reaction count\n", 176 | " (date) (fctr) (int)\n", 177 | "1 2016-02-24 total_likes 215784\n", 178 | "2 2016-02-25 total_likes 183785\n", 179 | "3 2016-02-26 total_likes 191436\n", 180 | "4 2016-02-27 total_likes 144926\n", 181 | "5 2016-02-28 total_likes 140882\n", 182 | "6 2016-02-29 total_likes 286802\n", 183 | "7 2016-03-01 total_likes 197091\n", 184 | "8 2016-03-02 total_likes 204942\n", 185 | "9 2016-03-03 total_likes 198320\n", 186 | "10 2016-03-04 total_likes 113997\n", 187 | "11 2016-03-05 total_likes 154004\n", 188 | "12 2016-03-06 total_likes 219300\n", 189 | "13 2016-03-07 total_likes 140551\n", 190 | "14 2016-03-08 total_likes 161067\n", 191 | "15 2016-03-09 total_likes 104399\n", 192 | "16 2016-03-10 total_likes 158898\n", 193 | "17 2016-03-11 total_likes 212756\n", 194 | "18 2016-03-12 total_likes 98536\n", 195 | "19 2016-03-13 total_likes 91079\n", 196 | "20 2016-03-14 total_likes 155147\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "df_agg_long <- df_agg %>% gather(key=reaction, value=count, total_likes:total_angrys) %>%\n", 202 | " mutate(reaction=factor(reaction))\n", 203 | "\n", 204 | "print(head(df_agg_long,20))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Create a stacked area chart. (filled to 100%)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 64, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction, fill=reaction)) +\n", 223 | " geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n", 224 | " fte_theme() +\n", 225 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 226 | " scale_y_continuous(labels=percent) +\n", 227 | " theme(legend.title = element_blank(),\n", 228 | " legend.position=\"top\",\n", 229 | " legend.direction=\"horizontal\",\n", 230 | " legend.key.width=unit(0.5, \"cm\"),\n", 231 | " legend.key.height=unit(0.25, \"cm\"),\n", 232 | " legend.margin=unit(0,\"cm\")) +\n", 233 | " scale_color_viridis(discrete=T) +\n", 234 | " scale_fill_viridis(discrete=T) +\n", 235 | " labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n", 236 | " x=\"Date Status Posted\",\n", 237 | " y=\"% Reaction Marketshare\")\n", 238 | "\n", 239 | "max_save(plot, \"reaction-example-1\", \"Facebook\")" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "![](reaction-example-1.png)\n", 247 | "\n", 248 | "The Likes reaction skews things. Run plot without it." 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 65, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "plot <- ggplot(df_agg_long %>% filter(reaction!=\"total_likes\"), aes(x=date, y=count, color=reaction, fill=reaction)) +\n", 260 | " geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n", 261 | " fte_theme() +\n", 262 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 263 | " scale_y_continuous(labels=percent) +\n", 264 | " theme(legend.title = element_blank(),\n", 265 | " legend.position=\"top\",\n", 266 | " legend.direction=\"horizontal\",\n", 267 | " legend.key.width=unit(0.5, \"cm\"),\n", 268 | " legend.key.height=unit(0.25, \"cm\"),\n", 269 | " legend.margin=unit(0,\"cm\")) +\n", 270 | " scale_color_viridis(discrete=T) +\n", 271 | " scale_fill_viridis(discrete=T) +\n", 272 | " labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n", 273 | " x=\"Date Status Posted\",\n", 274 | " y=\"% Reaction Marketshare\")\n", 275 | "\n", 276 | "max_save(plot, \"reaction-example-2\", \"Facebook\")" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "![](reaction-example-2.png)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "That visualization might be too crowded: use percent-wise calculations instead, and switch data to NYTimes for comparison." 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 76, 296 | "metadata": { 297 | "collapsed": false, 298 | "scrolled": false 299 | }, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "Source: local data frame [6 x 6]\n", 306 | "\n", 307 | " date perc_loves perc_wows perc_hahas perc_sads perc_angrys\n", 308 | " (date) (dbl) (dbl) (dbl) (dbl) (dbl)\n", 309 | "1 2016-02-24 0.3930676 0.17360566 0.08621367 0.09740770 0.24970542\n", 310 | "2 2016-02-25 0.1919722 0.08666052 0.29210694 0.09332671 0.33593362\n", 311 | "3 2016-02-26 0.1435334 0.18946182 0.10831220 0.17396450 0.38472809\n", 312 | "4 2016-02-27 0.2736496 0.13627639 0.06443652 0.27570606 0.24993145\n", 313 | "5 2016-02-28 0.7713515 0.08522014 0.04054117 0.03737970 0.06550746\n", 314 | "6 2016-02-29 0.3399680 0.08842370 0.12708762 0.11256005 0.33196065\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "df <- read_csv(\"nytimes_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n", 320 | "\n", 321 | "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n", 322 | " summarize(total_reactions=sum(num_loves)+sum(num_wows)+sum(num_hahas)+sum(num_sads)+sum(num_angrys),\n", 323 | " perc_loves=sum(num_loves)/total_reactions,\n", 324 | " perc_wows=sum(num_wows)/total_reactions,\n", 325 | " perc_hahas=sum(num_hahas)/total_reactions,\n", 326 | " perc_sads=sum(num_sads)/total_reactions,\n", 327 | " perc_angrys=sum(num_angrys)/total_reactions) %>%\n", 328 | " select(-total_reactions) %>%\n", 329 | " arrange(date)\n", 330 | "\n", 331 | "print(head(df_agg))" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 77, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [ 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "Source: local data frame [20 x 3]\n", 346 | "\n", 347 | " date reaction count\n", 348 | " (date) (fctr) (dbl)\n", 349 | "1 2016-02-24 perc_loves 0.39306756\n", 350 | "2 2016-02-25 perc_loves 0.19197220\n", 351 | "3 2016-02-26 perc_loves 0.14353339\n", 352 | "4 2016-02-27 perc_loves 0.27364957\n", 353 | "5 2016-02-28 perc_loves 0.77135153\n", 354 | "6 2016-02-29 perc_loves 0.33996797\n", 355 | "7 2016-03-01 perc_loves 0.34061714\n", 356 | "8 2016-03-02 perc_loves 0.24681208\n", 357 | "9 2016-03-03 perc_loves 0.35172992\n", 358 | "10 2016-03-04 perc_loves 0.19499779\n", 359 | "11 2016-03-05 perc_loves 0.14512737\n", 360 | "12 2016-03-06 perc_loves 0.40097144\n", 361 | "13 2016-03-07 perc_loves 0.30259557\n", 362 | "14 2016-03-08 perc_loves 0.36623147\n", 363 | "15 2016-03-09 perc_loves 0.21422640\n", 364 | "16 2016-03-10 perc_loves 0.31396083\n", 365 | "17 2016-03-11 perc_loves 0.33173516\n", 366 | "18 2016-03-12 perc_loves 0.06377902\n", 367 | "19 2016-03-13 perc_loves 0.25712914\n", 368 | "20 2016-03-14 perc_loves 0.33751152\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "df_agg_long <- df_agg %>% gather(key=reaction, value=count, perc_loves:perc_angrys) %>%\n", 374 | " mutate(reaction=factor(reaction))\n", 375 | "\n", 376 | "print(head(df_agg_long,20))" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 78, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction)) +\n", 388 | " geom_line(size=0.5, stat=\"identity\") +\n", 389 | " fte_theme() +\n", 390 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 391 | " scale_y_continuous(labels=percent) +\n", 392 | " theme(legend.title = element_blank(),\n", 393 | " legend.position=\"top\",\n", 394 | " legend.direction=\"horizontal\",\n", 395 | " legend.key.width=unit(0.5, \"cm\"),\n", 396 | " legend.key.height=unit(0.25, \"cm\"),\n", 397 | " legend.margin=unit(0,\"cm\")) +\n", 398 | " scale_color_viridis(discrete=T) +\n", 399 | " scale_fill_viridis(discrete=T) +\n", 400 | " labs(title=\"Daily Breakdown of Facebook Reactions on NYTimes's FB Posts\",\n", 401 | " x=\"Date Status Posted\",\n", 402 | " y=\"% Reaction Marketshare\")\n", 403 | "\n", 404 | "max_save(plot, \"reaction-example-3\", \"Facebook\")" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "![](reaction-example-3.png)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "# The MIT License (MIT)\n", 419 | "\n", 420 | "Copyright (c) 2016 Max Woolf\n", 421 | "\n", 422 | "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", 423 | "\n", 424 | "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", 425 | "\n", 426 | "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE." 427 | ] 428 | } 429 | ], 430 | "metadata": { 431 | "kernelspec": { 432 | "display_name": "R", 433 | "language": "R", 434 | "name": "ir" 435 | }, 436 | "language_info": { 437 | "codemirror_mode": "r", 438 | "file_extension": ".r", 439 | "mimetype": "text/x-r-source", 440 | "name": "R", 441 | "pygments_lexer": "r", 442 | "version": "3.3.0" 443 | } 444 | }, 445 | "nbformat": 4, 446 | "nbformat_minor": 0 447 | } 448 | -------------------------------------------------------------------------------- /get_fb_comments_from_fb.py: -------------------------------------------------------------------------------- 1 | import json 2 | import datetime 3 | import csv 4 | import time 5 | try: 6 | from urllib.request import urlopen, Request 7 | except ImportError: 8 | from urllib2 import urlopen, Request 9 | 10 | app_id = "" 11 | app_secret = "" # DO NOT SHARE WITH ANYONE! 12 | file_id = "cnn" 13 | 14 | access_token = app_id + "|" + app_secret 15 | 16 | 17 | def request_until_succeed(url): 18 | req = Request(url) 19 | success = False 20 | while success is False: 21 | try: 22 | response = urlopen(req) 23 | if response.getcode() == 200: 24 | success = True 25 | except Exception as e: 26 | print(e) 27 | time.sleep(5) 28 | 29 | print("Error for URL {}: {}".format(url, datetime.datetime.now())) 30 | print("Retrying.") 31 | 32 | return response.read() 33 | 34 | # Needed to write tricky unicode correctly to csv 35 | 36 | 37 | def unicode_decode(text): 38 | try: 39 | return text.encode('utf-8').decode() 40 | except UnicodeDecodeError: 41 | return text.encode('utf-8') 42 | 43 | 44 | def getFacebookCommentFeedUrl(base_url): 45 | 46 | # Construct the URL string 47 | fields = "&fields=id,message,reactions.limit(0).summary(true)" + \ 48 | ",created_time,comments,from,attachment" 49 | url = base_url + fields 50 | 51 | return url 52 | 53 | 54 | def getReactionsForComments(base_url): 55 | 56 | reaction_types = ['like', 'love', 'wow', 'haha', 'sad', 'angry'] 57 | reactions_dict = {} # dict of {status_id: tuple<6>} 58 | 59 | for reaction_type in reaction_types: 60 | fields = "&fields=reactions.type({}).limit(0).summary(total_count)".format( 61 | reaction_type.upper()) 62 | 63 | url = base_url + fields 64 | 65 | data = json.loads(request_until_succeed(url))['data'] 66 | 67 | data_processed = set() # set() removes rare duplicates in statuses 68 | for status in data: 69 | id = status['id'] 70 | count = status['reactions']['summary']['total_count'] 71 | data_processed.add((id, count)) 72 | 73 | for id, count in data_processed: 74 | if id in reactions_dict: 75 | reactions_dict[id] = reactions_dict[id] + (count,) 76 | else: 77 | reactions_dict[id] = (count,) 78 | 79 | return reactions_dict 80 | 81 | 82 | def processFacebookComment(comment, status_id, parent_id=''): 83 | 84 | # The status is now a Python dictionary, so for top-level items, 85 | # we can simply call the key. 86 | 87 | # Additionally, some items may not always exist, 88 | # so must check for existence first 89 | 90 | comment_id = comment['id'] 91 | comment_message = '' if 'message' not in comment or comment['message'] \ 92 | is '' else unicode_decode(comment['message']) 93 | comment_author = unicode_decode(comment['from']['name']) 94 | num_reactions = 0 if 'reactions' not in comment else \ 95 | comment['reactions']['summary']['total_count'] 96 | 97 | if 'attachment' in comment: 98 | attachment_type = comment['attachment']['type'] 99 | attachment_type = 'gif' if attachment_type == 'animated_image_share' \ 100 | else attachment_type 101 | attach_tag = "[[{}]]".format(attachment_type.upper()) 102 | comment_message = attach_tag if comment_message is '' else \ 103 | comment_message + " " + attach_tag 104 | 105 | # Time needs special care since a) it's in UTC and 106 | # b) it's not easy to use in statistical programs. 107 | 108 | comment_published = datetime.datetime.strptime( 109 | comment['created_time'], '%Y-%m-%dT%H:%M:%S+0000') 110 | comment_published = comment_published + datetime.timedelta(hours=-5) # EST 111 | comment_published = comment_published.strftime( 112 | '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs 113 | 114 | # Return a tuple of all processed data 115 | 116 | return (comment_id, status_id, parent_id, comment_message, comment_author, 117 | comment_published, num_reactions) 118 | 119 | 120 | def scrapeFacebookPageFeedComments(page_id, access_token): 121 | with open('{}_facebook_comments.csv'.format(file_id), 'w') as file: 122 | w = csv.writer(file) 123 | w.writerow(["comment_id", "status_id", "parent_id", "comment_message", 124 | "comment_author", "comment_published", "num_reactions", 125 | "num_likes", "num_loves", "num_wows", "num_hahas", 126 | "num_sads", "num_angrys", "num_special"]) 127 | 128 | num_processed = 0 129 | scrape_starttime = datetime.datetime.now() 130 | after = '' 131 | base = "https://graph.facebook.com/v2.9" 132 | parameters = "/?limit={}&access_token={}".format( 133 | 100, access_token) 134 | 135 | print("Scraping {} Comments From Posts: {}\n".format( 136 | file_id, scrape_starttime)) 137 | 138 | with open('{}_facebook_statuses.csv'.format(file_id), 'r') as csvfile: 139 | reader = csv.DictReader(csvfile) 140 | 141 | # Uncomment below line to scrape comments for a specific status_id 142 | # reader = [dict(status_id='5550296508_10154352768246509')] 143 | 144 | for status in reader: 145 | has_next_page = True 146 | 147 | while has_next_page: 148 | 149 | node = "/{}/comments".format(status['status_id']) 150 | after = '' if after is '' else "&after={}".format(after) 151 | base_url = base + node + parameters + after 152 | 153 | url = getFacebookCommentFeedUrl(base_url) 154 | # print(url) 155 | comments = json.loads(request_until_succeed(url)) 156 | reactions = getReactionsForComments(base_url) 157 | 158 | for comment in comments['data']: 159 | comment_data = processFacebookComment( 160 | comment, status['status_id']) 161 | reactions_data = reactions[comment_data[0]] 162 | 163 | # calculate thankful/pride through algebra 164 | num_special = comment_data[6] - sum(reactions_data) 165 | w.writerow(comment_data + reactions_data + 166 | (num_special, )) 167 | 168 | if 'comments' in comment: 169 | has_next_subpage = True 170 | sub_after = '' 171 | 172 | while has_next_subpage: 173 | sub_node = "/{}/comments".format(comment['id']) 174 | sub_after = '' if sub_after is '' else "&after={}".format( 175 | sub_after) 176 | sub_base_url = base + sub_node + parameters + sub_after 177 | 178 | sub_url = getFacebookCommentFeedUrl( 179 | sub_base_url) 180 | sub_comments = json.loads( 181 | request_until_succeed(sub_url)) 182 | sub_reactions = getReactionsForComments( 183 | sub_base_url) 184 | 185 | for sub_comment in sub_comments['data']: 186 | sub_comment_data = processFacebookComment( 187 | sub_comment, status['status_id'], comment['id']) 188 | sub_reactions_data = sub_reactions[ 189 | sub_comment_data[0]] 190 | 191 | num_sub_special = sub_comment_data[ 192 | 6] - sum(sub_reactions_data) 193 | 194 | w.writerow(sub_comment_data + 195 | sub_reactions_data + (num_sub_special,)) 196 | 197 | num_processed += 1 198 | if num_processed % 100 == 0: 199 | print("{} Comments Processed: {}".format( 200 | num_processed, 201 | datetime.datetime.now())) 202 | 203 | if 'paging' in sub_comments: 204 | if 'next' in sub_comments['paging']: 205 | sub_after = sub_comments[ 206 | 'paging']['cursors']['after'] 207 | else: 208 | has_next_subpage = False 209 | else: 210 | has_next_subpage = False 211 | 212 | # output progress occasionally to make sure code is not 213 | # stalling 214 | num_processed += 1 215 | if num_processed % 100 == 0: 216 | print("{} Comments Processed: {}".format( 217 | num_processed, datetime.datetime.now())) 218 | 219 | if 'paging' in comments: 220 | if 'next' in comments['paging']: 221 | after = comments['paging']['cursors']['after'] 222 | else: 223 | has_next_page = False 224 | else: 225 | has_next_page = False 226 | 227 | print("\nDone!\n{} Comments Processed in {}".format( 228 | num_processed, datetime.datetime.now() - scrape_starttime)) 229 | 230 | 231 | if __name__ == '__main__': 232 | scrapeFacebookPageFeedComments(file_id, access_token) 233 | 234 | 235 | # The CSV can be opened in all major statistical programs. Have fun! :) 236 | -------------------------------------------------------------------------------- /get_fb_posts_fb_group.py: -------------------------------------------------------------------------------- 1 | import json 2 | import datetime 3 | import csv 4 | import time 5 | import re 6 | try: 7 | from urllib.request import urlopen, Request 8 | except ImportError: 9 | from urllib2 import urlopen, Request 10 | 11 | app_id = "" 12 | app_secret = "" # DO NOT SHARE WITH ANYONE! 13 | group_id = "759985267390294" 14 | 15 | # input date formatted as YYYY-MM-DD 16 | since_date = "" 17 | until_date = "" 18 | 19 | access_token = app_id + "|" + app_secret 20 | 21 | 22 | def request_until_succeed(url): 23 | req = Request(url) 24 | success = False 25 | while success is False: 26 | try: 27 | response = urlopen(req) 28 | if response.getcode() == 200: 29 | success = True 30 | except Exception as e: 31 | print(e) 32 | time.sleep(5) 33 | 34 | print("Error for URL {}: {}".format(url, datetime.datetime.now())) 35 | print("Retrying.") 36 | 37 | return response.read() 38 | 39 | # Needed to write tricky unicode correctly to csv 40 | 41 | 42 | def unicode_decode(text): 43 | try: 44 | return text.encode('utf-8').decode() 45 | except UnicodeDecodeError: 46 | return text.encode('utf-8') 47 | 48 | 49 | def getFacebookPageFeedUrl(base_url): 50 | 51 | # Construct the URL string; see http://stackoverflow.com/a/37239851 for 52 | # Reactions parameters 53 | fields = "&fields=message,link,created_time,type,name,id," + \ 54 | "comments.limit(0).summary(true),shares,reactions" + \ 55 | ".limit(0).summary(true),from" 56 | url = base_url + fields 57 | 58 | return url 59 | 60 | 61 | def getReactionsForStatuses(base_url): 62 | 63 | reaction_types = ['like', 'love', 'wow', 'haha', 'sad', 'angry'] 64 | reactions_dict = {} # dict of {status_id: tuple<6>} 65 | 66 | for reaction_type in reaction_types: 67 | fields = "&fields=reactions.type({}).limit(0).summary(total_count)".format( 68 | reaction_type.upper()) 69 | 70 | url = base_url + fields 71 | 72 | data = json.loads(request_until_succeed(url))['data'] 73 | 74 | data_processed = set() # set() removes rare duplicates in statuses 75 | for status in data: 76 | id = status['id'] 77 | count = status['reactions']['summary']['total_count'] 78 | data_processed.add((id, count)) 79 | 80 | for id, count in data_processed: 81 | if id in reactions_dict: 82 | reactions_dict[id] = reactions_dict[id] + (count,) 83 | else: 84 | reactions_dict[id] = (count,) 85 | 86 | return reactions_dict 87 | 88 | 89 | def processFacebookPageFeedStatus(status): 90 | 91 | # The status is now a Python dictionary, so for top-level items, 92 | # we can simply call the key. 93 | 94 | # Additionally, some items may not always exist, 95 | # so must check for existence first 96 | 97 | status_id = status['id'] 98 | status_type = status['type'] 99 | 100 | status_message = '' if 'message' not in status else \ 101 | unicode_decode(status['message']) 102 | link_name = '' if 'name' not in status else \ 103 | unicode_decode(status['name']) 104 | status_link = '' if 'link' not in status else \ 105 | unicode_decode(status['link']) 106 | 107 | # Time needs special care since a) it's in UTC and 108 | # b) it's not easy to use in statistical programs. 109 | 110 | status_published = datetime.datetime.strptime( 111 | status['created_time'], '%Y-%m-%dT%H:%M:%S+0000') 112 | status_published = status_published + \ 113 | datetime.timedelta(hours=-5) # EST 114 | status_published = status_published.strftime( 115 | '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs 116 | status_author = unicode_decode(status['from']['name']) 117 | 118 | # Nested items require chaining dictionary keys. 119 | 120 | num_reactions = 0 if 'reactions' not in status else \ 121 | status['reactions']['summary']['total_count'] 122 | num_comments = 0 if 'comments' not in status else \ 123 | status['comments']['summary']['total_count'] 124 | num_shares = 0 if 'shares' not in status else status['shares']['count'] 125 | 126 | return (status_id, status_message, status_author, link_name, status_type, 127 | status_link, status_published, num_reactions, num_comments, num_shares) 128 | 129 | 130 | def scrapeFacebookPageFeedStatus(group_id, access_token, since_date, until_date): 131 | with open('{}_facebook_statuses.csv'.format(group_id), 'w') as file: 132 | w = csv.writer(file) 133 | w.writerow(["status_id", "status_message", "status_author", "link_name", 134 | "status_type", "status_link", "status_published", 135 | "num_reactions", "num_comments", "num_shares", "num_likes", 136 | "num_loves", "num_wows", "num_hahas", "num_sads", "num_angrys", 137 | "num_special"]) 138 | 139 | has_next_page = True 140 | num_processed = 0 # keep a count on how many we've processed 141 | scrape_starttime = datetime.datetime.now() 142 | 143 | # /feed endpoint pagenates througn an `until` and `paging` parameters 144 | until = '' 145 | paging = '' 146 | base = "https://graph.facebook.com/v2.9" 147 | node = "/{}/feed".format(group_id) 148 | parameters = "/?limit={}&access_token={}".format(100, access_token) 149 | since = "&since={}".format(since_date) if since_date \ 150 | is not '' else '' 151 | until = "&until={}".format(until_date) if until_date \ 152 | is not '' else '' 153 | 154 | print("Scraping {} Facebook Group: {}\n".format( 155 | group_id, scrape_starttime)) 156 | 157 | while has_next_page: 158 | until = '' if until is '' else "&until={}".format(until) 159 | paging = '' if until is '' else "&__paging_token={}".format(paging) 160 | base_url = base + node + parameters + since + until + paging 161 | 162 | url = getFacebookPageFeedUrl(base_url) 163 | statuses = json.loads(request_until_succeed(url)) 164 | reactions = getReactionsForStatuses(base_url) 165 | 166 | for status in statuses['data']: 167 | 168 | # Ensure it is a status with the expected metadata 169 | if 'reactions' in status: 170 | status_data = processFacebookPageFeedStatus(status) 171 | reactions_data = reactions[status_data[0]] 172 | 173 | # calculate thankful/pride through algebra 174 | num_special = status_data[7] - sum(reactions_data) 175 | w.writerow(status_data + reactions_data + (num_special,)) 176 | 177 | # output progress occasionally to make sure code is not 178 | # stalling 179 | num_processed += 1 180 | if num_processed % 100 == 0: 181 | print("{} Statuses Processed: {}".format 182 | (num_processed, datetime.datetime.now())) 183 | 184 | # if there is no next page, we're done. 185 | if 'paging' in statuses: 186 | next_url = statuses['paging']['next'] 187 | until = re.search('until=([0-9]*?)(&|$)', next_url).group(1) 188 | paging = re.search( 189 | '__paging_token=(.*?)(&|$)', next_url).group(1) 190 | 191 | else: 192 | has_next_page = False 193 | 194 | print("\nDone!\n{} Statuses Processed in {}".format( 195 | num_processed, datetime.datetime.now() - scrape_starttime)) 196 | 197 | 198 | if __name__ == '__main__': 199 | scrapeFacebookPageFeedStatus(group_id, access_token, since_date, until_date) 200 | 201 | 202 | # The CSV can be opened in all major statistical programs. Have fun! :) 203 | -------------------------------------------------------------------------------- /get_fb_posts_fb_page.py: -------------------------------------------------------------------------------- 1 | import json 2 | import datetime 3 | import csv 4 | import time 5 | try: 6 | from urllib.request import urlopen, Request 7 | except ImportError: 8 | from urllib2 import urlopen, Request 9 | 10 | app_id = "" 11 | app_secret = "" # DO NOT SHARE WITH ANYONE! 12 | page_id = "cnn" 13 | 14 | # input date formatted as YYYY-MM-DD 15 | since_date = "" 16 | until_date = "" 17 | 18 | access_token = app_id + "|" + app_secret 19 | 20 | 21 | def request_until_succeed(url): 22 | req = Request(url) 23 | success = False 24 | while success is False: 25 | try: 26 | response = urlopen(req) 27 | if response.getcode() == 200: 28 | success = True 29 | except Exception as e: 30 | print(e) 31 | time.sleep(5) 32 | 33 | print("Error for URL {}: {}".format(url, datetime.datetime.now())) 34 | print("Retrying.") 35 | 36 | return response.read() 37 | 38 | 39 | # Needed to write tricky unicode correctly to csv 40 | def unicode_decode(text): 41 | try: 42 | return text.encode('utf-8').decode() 43 | except UnicodeDecodeError: 44 | return text.encode('utf-8') 45 | 46 | 47 | def getFacebookPageFeedUrl(base_url): 48 | 49 | # Construct the URL string; see http://stackoverflow.com/a/37239851 for 50 | # Reactions parameters 51 | fields = "&fields=message,link,created_time,type,name,id," + \ 52 | "comments.limit(0).summary(true),shares,reactions" + \ 53 | ".limit(0).summary(true)" 54 | 55 | return base_url + fields 56 | 57 | 58 | def getReactionsForStatuses(base_url): 59 | 60 | reaction_types = ['like', 'love', 'wow', 'haha', 'sad', 'angry'] 61 | reactions_dict = {} # dict of {status_id: tuple<6>} 62 | 63 | for reaction_type in reaction_types: 64 | fields = "&fields=reactions.type({}).limit(0).summary(total_count)".format( 65 | reaction_type.upper()) 66 | 67 | url = base_url + fields 68 | 69 | data = json.loads(request_until_succeed(url))['data'] 70 | 71 | data_processed = set() # set() removes rare duplicates in statuses 72 | for status in data: 73 | id = status['id'] 74 | count = status['reactions']['summary']['total_count'] 75 | data_processed.add((id, count)) 76 | 77 | for id, count in data_processed: 78 | if id in reactions_dict: 79 | reactions_dict[id] = reactions_dict[id] + (count,) 80 | else: 81 | reactions_dict[id] = (count,) 82 | 83 | return reactions_dict 84 | 85 | 86 | def processFacebookPageFeedStatus(status): 87 | 88 | # The status is now a Python dictionary, so for top-level items, 89 | # we can simply call the key. 90 | 91 | # Additionally, some items may not always exist, 92 | # so must check for existence first 93 | 94 | status_id = status['id'] 95 | status_type = status['type'] 96 | 97 | status_message = '' if 'message' not in status else \ 98 | unicode_decode(status['message']) 99 | link_name = '' if 'name' not in status else \ 100 | unicode_decode(status['name']) 101 | status_link = '' if 'link' not in status else \ 102 | unicode_decode(status['link']) 103 | 104 | # Time needs special care since a) it's in UTC and 105 | # b) it's not easy to use in statistical programs. 106 | 107 | status_published = datetime.datetime.strptime( 108 | status['created_time'], '%Y-%m-%dT%H:%M:%S+0000') 109 | status_published = status_published + \ 110 | datetime.timedelta(hours=-5) # EST 111 | status_published = status_published.strftime( 112 | '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs 113 | 114 | # Nested items require chaining dictionary keys. 115 | 116 | num_reactions = 0 if 'reactions' not in status else \ 117 | status['reactions']['summary']['total_count'] 118 | num_comments = 0 if 'comments' not in status else \ 119 | status['comments']['summary']['total_count'] 120 | num_shares = 0 if 'shares' not in status else status['shares']['count'] 121 | 122 | return (status_id, status_message, link_name, status_type, status_link, 123 | status_published, num_reactions, num_comments, num_shares) 124 | 125 | 126 | def scrapeFacebookPageFeedStatus(page_id, access_token, since_date, until_date): 127 | with open('{}_facebook_statuses.csv'.format(page_id), 'w') as file: 128 | w = csv.writer(file) 129 | w.writerow(["status_id", "status_message", "link_name", "status_type", 130 | "status_link", "status_published", "num_reactions", 131 | "num_comments", "num_shares", "num_likes", "num_loves", 132 | "num_wows", "num_hahas", "num_sads", "num_angrys", 133 | "num_special"]) 134 | 135 | has_next_page = True 136 | num_processed = 0 137 | scrape_starttime = datetime.datetime.now() 138 | after = '' 139 | base = "https://graph.facebook.com/v2.9" 140 | node = "/{}/posts".format(page_id) 141 | parameters = "/?limit={}&access_token={}".format(100, access_token) 142 | since = "&since={}".format(since_date) if since_date \ 143 | is not '' else '' 144 | until = "&until={}".format(until_date) if until_date \ 145 | is not '' else '' 146 | 147 | print("Scraping {} Facebook Page: {}\n".format(page_id, scrape_starttime)) 148 | 149 | while has_next_page: 150 | after = '' if after is '' else "&after={}".format(after) 151 | base_url = base + node + parameters + after + since + until 152 | 153 | url = getFacebookPageFeedUrl(base_url) 154 | statuses = json.loads(request_until_succeed(url)) 155 | reactions = getReactionsForStatuses(base_url) 156 | 157 | for status in statuses['data']: 158 | 159 | # Ensure it is a status with the expected metadata 160 | if 'reactions' in status: 161 | status_data = processFacebookPageFeedStatus(status) 162 | reactions_data = reactions[status_data[0]] 163 | 164 | # calculate thankful/pride through algebra 165 | num_special = status_data[6] - sum(reactions_data) 166 | w.writerow(status_data + reactions_data + (num_special,)) 167 | 168 | num_processed += 1 169 | if num_processed % 100 == 0: 170 | print("{} Statuses Processed: {}".format 171 | (num_processed, datetime.datetime.now())) 172 | 173 | # if there is no next page, we're done. 174 | if 'paging' in statuses: 175 | after = statuses['paging']['cursors']['after'] 176 | else: 177 | has_next_page = False 178 | 179 | print("\nDone!\n{} Statuses Processed in {}".format( 180 | num_processed, datetime.datetime.now() - scrape_starttime)) 181 | 182 | 183 | if __name__ == '__main__': 184 | scrapeFacebookPageFeedStatus(page_id, access_token, since_date, until_date) 185 | --------------------------------------------------------------------------------