├── .gitignore ├── README.md ├── credentials.txt ├── examples ├── Rstart.R ├── entity.png ├── fb_scraper_data.png ├── how_to_build_facebook_scraper.ipynb ├── reaction-example-1.png ├── reaction-example-2.png ├── reaction-example-3.png └── reaction_count_data_analysis_example.ipynb ├── facebook_scrape.py └── run.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.csv 2 | 3 | # Vim 4 | [._]*.s[a-w][a-z] 5 | [._]s[a-w][a-z] 6 | # session 7 | Session.vim 8 | .netrwhist 9 | *~ 10 | tags 11 | 12 | # Byte-compiled / optimized / DLL files 13 | __pycache__/ 14 | *.py[cod] 15 | *$py.class 16 | 17 | # C extensions 18 | *.so 19 | 20 | # Distribution / packaging 21 | .Python 22 | env/ 23 | build/ 24 | develop-eggs/ 25 | dist/ 26 | downloads/ 27 | eggs/ 28 | .eggs/ 29 | lib/ 30 | lib64/ 31 | parts/ 32 | sdist/ 33 | var/ 34 | *.egg-info/ 35 | .installed.cfg 36 | *.egg 37 | 38 | # PyInstaller 39 | # Usually these files are written by a python script from a template 40 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 41 | *.manifest 42 | *.spec 43 | 44 | # Installer logs 45 | pip-log.txt 46 | pip-delete-this-directory.txt 47 | 48 | # Unit test / coverage reports 49 | htmlcov/ 50 | .tox/ 51 | .coverage 52 | .coverage.* 53 | .cache 54 | nosetests.xml 55 | coverage.xml 56 | *,cover 57 | .hypothesis/ 58 | 59 | # Translations 60 | *.mo 61 | *.pot 62 | 63 | # Django stuff: 64 | *.log 65 | local_settings.py 66 | 67 | # Flask stuff: 68 | instance/ 69 | .webassets-cache 70 | 71 | # Scrapy stuff: 72 | .scrapy 73 | 74 | # Sphinx documentation 75 | docs/_build/ 76 | 77 | # PyBuilder 78 | target/ 79 | 80 | # IPython Notebook 81 | .ipynb_checkpoints 82 | 83 | # pyenv 84 | .python-version 85 | 86 | # celery beat schedule file 87 | celerybeat-schedule 88 | 89 | # dotenv 90 | .env 91 | 92 | # virtualenv 93 | venv/ 94 | ENV/ 95 | 96 | # Spyder project settings 97 | .spyderproject 98 | 99 | # Rope project settings 100 | .ropeproject 101 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Facebook Page Post Scraper 2 | 3 | This is a fork of Max Woolf's [facebook-page-post-scraper](https://github.com/minimaxir/facebook-page-post-scraper). 4 | 5 | It only works on Python 3. 6 | 7 | This version allows you to specify the page/group you wish to scrape and where you want CSV files to be stored through command-line arguments. 8 | 9 | It also separates your App ID and App secret from the code; now, you have to store these credentials in a separate file. 10 | 11 | ![](/examples/fb_scraper_data.png) 12 | 13 | A tool for gathering *all* the posts and comments of a Facebook Page (or Open Facebook Group) and related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, able to be imported into any data analysis program like Excel. 14 | 15 | The purpose of the script is to gather Facebook data for semantic analysis, which is greatly helped by the presence of high-quality Reaction data. Here's quick examples of a potential Facebook Reaction data visualization using data from [CNN's Facebook page](https://www.facebook.com/cnn/): 16 | 17 | ![](/examples/reaction-example-2.png) 18 | 19 | ## Usage 20 | 21 | To scrape posts from a page: 22 | 23 | `python3 run.py --page --cred --posts-output ` 24 | 25 | To scrape both posts and comments: 26 | 27 | ``` 28 | python3 run.py --page --cred --posts-output \ 29 | --scrape-comments --comments-output 30 | ``` 31 | 32 | To scrape from a group, change `--page` to `--group`. 33 | 34 | To skip downloading statuses and retrieve comments using an existing CSV file, use the `--use-existing-posts-csv` command: 35 | 36 | ``` 37 | python3 run.py --page --cred --posts-output \ 38 | --scrape-comments --comments-output --use-existing-posts-csv 39 | ``` 40 | 41 | 42 | ### Credential file format 43 | 44 | The `-cred` command-line argument specifies where your credential file is located. 45 | 46 | **Do not share this file with anyone.** 47 | 48 | It should look something like this: 49 | 50 | ``` 51 | app_id = "111111111111111" 52 | app_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" 53 | ``` 54 | 55 | You need an App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape. 56 | 57 | Example CSVs for CNN, NYTimes, and BuzzFeed data are not included in this repository due to size, but you can download [CNN data here](https://dl.dropboxusercontent.com/u/2017402/cnn_facebook_statuses.csv.zip) [2.7MB ZIP], [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.csv.zip) [4.9MB ZIP], and [BuzzFeed data here](https://dl.dropboxusercontent.com/u/2017402/buzzfeed_facebook_statuses.csv.zip) [2.1MB ZIP]. 58 | 59 | ### Getting the numeric group ID 60 | 61 | For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for "entity_id", and use the number to the right of that field. For example, the `group_id` of [Hackathon Hackers](https://www.facebook.com/groups/hackathonhackers/) is 759985267390294. 62 | 63 | ![](/examples/entity.png) 64 | 65 | You can download example data for [Hackathon Hackers here](https://dl.dropboxusercontent.com/u/2017402/759985267390294_facebook_statuses.csv.zip) [4.7MB ZIP] 66 | 67 | Keep in mind that large pages such as CNN have *millions* of comments, so be careful! (scraping throughput is approximately 87k comments/hour) 68 | 69 | ## Privacy 70 | 71 | This scraper can only scrape public Facebook data which is available to anyone, even those who are not logged into Facebook. No personally-identifiable data is collected in the Page variant; the Group variant does collect the name of the author of the post, but that data is also public to non-logged-in users. Additionally, the script only uses officially-documented Facebook API endpoints without circumventing any rate-limits. 72 | 73 | Note that this script, and any variant of this script, *cannot* be used to scrape data from user profiles. (and the Facebook API specifically disallows this use case!) 74 | 75 | ## Maintainer 76 | 77 | * Koh Wei Jie 78 | 79 | ## Credits 80 | 81 | This is a fork of Max Woolf's code at https://github.com/minimaxir/facebook-page-post-scraper 82 | 83 | Parts of this README were copied verbatim. 84 | 85 | ## License 86 | 87 | Be aware that this is a fork of Max Woolf's MIT-licensed code. 88 | -------------------------------------------------------------------------------- /credentials.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /examples/Rstart.R: -------------------------------------------------------------------------------- 1 | library(readr) 2 | library(dplyr) 3 | library(ggplot2) 4 | library(extrafont) 5 | library(scales) 6 | library(grid) 7 | library(RColorBrewer) 8 | library(digest) 9 | library(readr) 10 | library(stringr) 11 | 12 | 13 | fontFamily <- "Source Sans Pro" 14 | fontTitle <- "Source Sans Pro Semibold" 15 | 16 | color_palette = c("#16a085","#27ae60","#2980b9","#8e44ad","#f39c12","#c0392b","#1abc9c", "#2ecc71", "#3498db", "#9b59b6", "#f1c40f","#e74c3c") 17 | 18 | neutral_colors = function(number) { 19 | return (brewer.pal(11, "RdYlBu")[-c(5:7)][(number %% 8) + 1]) 20 | } 21 | 22 | set1_colors = function(number) { 23 | return (brewer.pal(9, "Set1")[c(-6,-8)][(number %% 7) + 1]) 24 | } 25 | 26 | theme_custom <- function() {theme_bw(base_size = 8) + 27 | theme(panel.background = element_rect(fill="#eaeaea"), 28 | plot.background = element_rect(fill="white"), 29 | panel.grid.minor = element_blank(), 30 | panel.grid.major = element_line(color="#dddddd"), 31 | axis.ticks.x = element_blank(), 32 | axis.ticks.y = element_blank(), 33 | axis.title.x = element_text(family=fontTitle, size=8, vjust=-.3), 34 | axis.title.y = element_text(family=fontTitle, size=8, vjust=1.5), 35 | panel.border = element_rect(color="#cccccc"), 36 | text = element_text(color = "#1a1a1a", family=fontFamily), 37 | plot.margin = unit(c(0.25,0.1,0.1,0.35), "cm"), 38 | plot.title = element_text(family=fontTitle, size=9, vjust=1)) 39 | } 40 | 41 | create_watermark <- function(source = '', filename = '', dark=F) { 42 | 43 | bg_white = "#FFFFFF" 44 | bg_text = '#969696' 45 | 46 | if (dark) { 47 | bg_white = "#000000" 48 | bg_text = '#666666' 49 | } 50 | 51 | watermark <- ggplot(aes(x,y), data=data.frame(x=c(0.5), y=c(0.5))) + geom_point(color = "transparent") + 52 | geom_text(x=0, y=1.25, label="By Max Woolf — minimaxir.com", family="Source Sans Pro", color=bg_text, size=1.75, hjust=0) + 53 | 54 | geom_text(x=5, y=1.25, label="Made using R and ggplot2", family="Source Sans Pro", color=bg_text, size=1.75) + 55 | scale_x_continuous(limits=c(0,10)) + 56 | scale_y_continuous(limits=c(0.5,1.5)) + 57 | annotate("segment", x = 0, xend = 10, y=1.5, yend=1.5, color=bg_text, size=0.1) + 58 | theme_bw() + 59 | theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none", 60 | panel.border = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), 61 | axis.ticks = element_blank(), plot.margin = unit(c(0.0,0,-0.4,0), "cm")) + 62 | theme(plot.background=element_rect(fill=bg_white, color=bg_white),panel.background=element_rect(fill=bg_white, color=bg_white)) + 63 | scale_color_manual(values=bg_text) 64 | 65 | if (nchar(source) > 0) {watermark <- watermark + geom_text(x=10, y=1.25, label=paste("Data via",source), family="Source Sans Pro", color=bg_text, size=1.75, hjust=1)} 66 | 67 | return (watermark) 68 | } 69 | 70 | web_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(2, 71 | 0.125), c("null", "null")), ) 72 | tallweb_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(3.5, 73 | 0.125), c("null", "null")), ) 74 | video_Layout <- grid.layout(nrow = 1, ncol = 2, widths = unit(c(2, 75 | 1), c("null", "null")), ) 76 | 77 | #grid.show.layout(Layout) 78 | vplayout <- function(...) { 79 | grid.newpage() 80 | pushViewport(viewport(layout = web_Layout)) 81 | } 82 | 83 | talllayout <- function(...) { 84 | grid.newpage() 85 | pushViewport(viewport(layout = tallweb_Layout)) 86 | } 87 | 88 | vidlayout <- function(...) { 89 | grid.newpage() 90 | pushViewport(viewport(layout = video_Layout)) 91 | } 92 | 93 | subplot <- function(x, y) viewport(layout.pos.row = x, 94 | layout.pos.col = y) 95 | 96 | web_plot <- function(a, b) { 97 | vplayout() 98 | print(a, vp = subplot(1, 1)) 99 | print(b, vp = subplot(2, 1)) 100 | } 101 | 102 | tallweb_plot <- function(a, b) { 103 | talllayout() 104 | print(a, vp = subplot(1, 1)) 105 | print(b, vp = subplot(2, 1)) 106 | } 107 | 108 | video_plot <- function(a, b) { 109 | vidlayout() 110 | print(a, vp = subplot(1, 1)) 111 | print(b, vp = subplot(1, 2)) 112 | } 113 | 114 | max_save <- function(plot1, filename, source = '', pdf = FALSE, w=4, h=3, tall=F, dark=F, bg_overide=NA) { 115 | png(paste(filename,"png",sep="."),res=300,units="in",width=w,height=h) 116 | plot.new() 117 | #if (!is.na(bg_overide)) {par(bg = bg_overide)} 118 | ifelse(tall,tallweb_plot(plot1,create_watermark(source, filename, dark)),web_plot(plot1,create_watermark(source, filename, dark))) 119 | dev.off() 120 | 121 | if (pdf) { 122 | quartz(width=w,height=h,dpi=144) 123 | #if (!is.na(bg_overide)) {par(bg = bg_overide)} 124 | web_plot(plot1,create_watermark(source, filename, dark)) 125 | quartz.save(paste(filename,"pdf",sep="."), type = "pdf", device = dev.cur()) 126 | } 127 | } 128 | 129 | video_save <- function(plot1, plot2, filename) { 130 | png(paste(filename,"png",sep="."),res=300,units="in",width=1920/300,height=1080/300) 131 | video_plot(plot1,plot2) 132 | dev.off() 133 | 134 | } 135 | 136 | fte_theme <- function (palate_color = "Greys") { 137 | 138 | #display.brewer.all(n=9,type="seq",exact.n=TRUE) 139 | palate <- brewer.pal(palate_color, n=9) 140 | color.background = palate[1] 141 | color.grid.minor = palate[3] 142 | color.grid.major = palate[3] 143 | color.axis.text = palate[6] 144 | color.axis.title = palate[7] 145 | color.title = palate[9] 146 | #color.title = "#2c3e50" 147 | 148 | font.title <- "Source Sans Pro" 149 | font.axis <- "Open Sans Condensed Bold" 150 | #font.axis <- "M+ 1m regular" 151 | #font.title <- "Arial" 152 | #font.axis <- "Arial" 153 | 154 | 155 | theme_bw(base_size=9) + 156 | # Set the entire chart region to a light gray color 157 | theme(panel.background=element_rect(fill=color.background, color=color.background)) + 158 | theme(plot.background=element_rect(fill=color.background, color=color.background)) + 159 | theme(panel.border=element_rect(color=color.background)) + 160 | # Format the grid 161 | theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) + 162 | theme(panel.grid.minor=element_blank()) + 163 | #scale_x_continuous(minor_breaks=0,breaks=seq(0,100,10),limits=c(0,100)) + 164 | #scale_y_continuous(minor_breaks=0,breaks=seq(0,26,4),limits=c(0,25)) + 165 | theme(axis.ticks=element_blank()) + 166 | # Dispose of the legend 167 | theme(legend.position="none") + 168 | theme(legend.background = element_rect(fill=color.background)) + 169 | theme(legend.text = element_text(size=7,colour=color.axis.title,family=font.axis)) + 170 | # Set title and axis labels, and format these and tick marks 171 | theme(plot.title=element_text(colour=color.title,family=font.title, size=9, vjust=1.25, lineheight=0.1)) + 172 | theme(axis.text.x=element_text(size=7,colour=color.axis.text,family=font.axis)) + 173 | theme(axis.text.y=element_text(size=7,colour=color.axis.text,family=font.axis)) + 174 | theme(axis.title.y=element_text(size=7,colour=color.axis.title,family=font.title, vjust=1.25)) + 175 | theme(axis.title.x=element_text(size=7,colour=color.axis.title,family=font.title, vjust=0)) + 176 | 177 | # Big bold line at y=0 178 | #geom_hline(yintercept=0,size=0.75,colour=palate[9]) + 179 | # Plot margins and finally line annotations 180 | theme(plot.margin = unit(c(0.35, 0.2, 0.15, 0.4), "cm")) + 181 | 182 | theme(strip.background = element_rect(fill=color.background, color=color.background),strip.text=element_text(size=7,colour=color.axis.title,family=font.title)) 183 | 184 | } 185 | -------------------------------------------------------------------------------- /examples/entity.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/159675efe537b2080e77e231a6457807d4e81e84/examples/entity.png -------------------------------------------------------------------------------- /examples/fb_scraper_data.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/159675efe537b2080e77e231a6457807d4e81e84/examples/fb_scraper_data.png -------------------------------------------------------------------------------- /examples/how_to_build_facebook_scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# How to Scrape Data From Facebook Page Posts for Statistical Analysis\n", 8 | "\n", 9 | "By [Max Woolf (@minimaxir)](http://minimaxir.com/)\n", 10 | "\n", 11 | "This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/)." 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": { 18 | "collapsed": true 19 | }, 20 | "outputs": [], 21 | "source": [ 22 | "# import some Python dependencies\n", 23 | "\n", 24 | "import urllib2\n", 25 | "import json\n", 26 | "import datetime\n", 27 | "import csv\n", 28 | "import time" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "Accessing Facebook page data requires an access token.\n", 36 | "\n", 37 | "Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire." 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": false 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "# Since the code output in this notebook leaks the app_secret,\n", 49 | "# it has been reset by the time you read this.\n", 50 | "\n", 51 | "app_id = \"272535582777707\"\n", 52 | "app_secret = \"59e7ab31b01d3a5a90ec15a7a45a5e3b\" # DO NOT SHARE WITH ANYONE!\n", 53 | "\n", 54 | "access_token = app_id + \"|\" + app_secret" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data." 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 3, 67 | "metadata": { 68 | "collapsed": true 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "page_id = 'nytimes'" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid." 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": { 86 | "collapsed": false 87 | }, 88 | "outputs": [ 89 | { 90 | "name": "stdout", 91 | "output_type": "stream", 92 | "text": [ 93 | "{\n", 94 | " \"id\": \"5281959998\", \n", 95 | " \"name\": \"The New York Times\"\n", 96 | "}\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "def testFacebookPageData(page_id, access_token):\n", 102 | " \n", 103 | " # construct the URL string\n", 104 | " base = \"https://graph.facebook.com/v2.4\"\n", 105 | " node = \"/\" + page_id\n", 106 | " parameters = \"/?access_token=%s\" % access_token\n", 107 | " url = base + node + parameters\n", 108 | " \n", 109 | " # retrieve data\n", 110 | " req = urllib2.Request(url)\n", 111 | " response = urllib2.urlopen(req)\n", 112 | " data = json.loads(response.read())\n", 113 | " \n", 114 | " print json.dumps(data, indent=4, sort_keys=True)\n", 115 | " \n", 116 | "\n", 117 | "testFacebookPageData(page_id, access_token)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. \n", 125 | "\n", 126 | "Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone." 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 5, 132 | "metadata": { 133 | "collapsed": true 134 | }, 135 | "outputs": [], 136 | "source": [ 137 | "def request_until_succeed(url):\n", 138 | " req = urllib2.Request(url)\n", 139 | " success = False\n", 140 | " while success is False:\n", 141 | " try: \n", 142 | " response = urllib2.urlopen(req)\n", 143 | " if response.getcode() == 200:\n", 144 | " success = True\n", 145 | " except Exception, e:\n", 146 | " print e\n", 147 | " time.sleep(5)\n", 148 | " \n", 149 | " print \"Error for URL %s: %s\" % (url, datetime.datetime.now())\n", 150 | "\n", 151 | " return response.read()" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 6, 164 | "metadata": { 165 | "collapsed": false 166 | }, 167 | "outputs": [ 168 | { 169 | "name": "stdout", 170 | "output_type": "stream", 171 | "text": [ 172 | "{\n", 173 | " \"data\": [\n", 174 | " {\n", 175 | " \"created_time\": \"2015-07-20T01:25:01+0000\", \n", 176 | " \"id\": \"5281959998_10150628157724999\", \n", 177 | " \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\"\n", 178 | " }, \n", 179 | " {\n", 180 | " \"created_time\": \"2015-07-19T22:55:01+0000\", \n", 181 | " \"id\": \"5281959998_10150628161129999\", \n", 182 | " \"message\": \"\\\"It\\u2019s safe to say that federal agencies are not where we want them to be across the board,\\\" said President Barack Obama's top cybersecurity adviser. \\\"We clearly need to be moving faster.\\\"\"\n", 183 | " }, \n", 184 | " {\n", 185 | " \"created_time\": \"2015-07-19T22:25:01+0000\", \n", 186 | " \"id\": \"5281959998_10150626434639999\", \n", 187 | " \"message\": \"Showcase your summer tomatoes in this elegant crostata.\"\n", 188 | " }, \n", 189 | " {\n", 190 | " \"created_time\": \"2015-07-19T21:55:08+0000\", \n", 191 | " \"id\": \"5281959998_10150628170209999\", \n", 192 | " \"message\": \"The task: Create a technologically sophisticated barbecue smoker that could outperform the best product on the market and be sold for less than $1,500.\"\n", 193 | " }, \n", 194 | " {\n", 195 | " \"created_time\": \"2015-07-19T21:25:00+0000\", \n", 196 | " \"id\": \"5281959998_10150626449129999\", \n", 197 | " \"message\": \"Achieving pastel hair can be time-consuming and toxic \\u2014 but for some, so very worth it.\"\n", 198 | " }, \n", 199 | " {\n", 200 | " \"created_time\": \"2015-07-19T20:53:05+0000\", \n", 201 | " \"id\": \"5281959998_10150626425084999\", \n", 202 | " \"message\": \"Attention, meat lovers: This simple barbecue sauce goes beautifully with pork and chicken.\"\n", 203 | " }, \n", 204 | " {\n", 205 | " \"created_time\": \"2015-07-19T20:25:07+0000\", \n", 206 | " \"id\": \"5281959998_10150628132119999\", \n", 207 | " \"message\": \"He passed the police officer exam in 2011. He went through orientation and started undergoing the required background checks in 2013. Then, the process stopped cold. No emails. No calls. No explanations. Silence.\"\n", 208 | " }, \n", 209 | " {\n", 210 | " \"created_time\": \"2015-07-19T19:55:32+0000\", \n", 211 | " \"id\": \"5281959998_10150628116259999\", \n", 212 | " \"message\": \"The election is 16 months away, but knowing what we know now, what should we expect the economic backdrop to be when Americans choose their next president?\"\n", 213 | " }, \n", 214 | " {\n", 215 | " \"created_time\": \"2015-07-19T19:25:07+0000\", \n", 216 | " \"id\": \"5281959998_10150628097394999\", \n", 217 | " \"message\": \"\\\"By focusing so intently on physical fitness, the corps is avoiding the real barrier to integration \\u2014 the hypermasculine culture at its heart.\\\" Read on in The New York Times Opinion.\"\n", 218 | " }, \n", 219 | " {\n", 220 | " \"created_time\": \"2015-07-19T19:05:01+0000\", \n", 221 | " \"id\": \"5281959998_10150628071729999\", \n", 222 | " \"message\": \"U2's \\u201cInnocence and Experience\\u201d tour merges past and present, peace and war, audience and band, punk and statesman, grass-roots activism and corporate philanthropy.\"\n", 223 | " }, \n", 224 | " {\n", 225 | " \"created_time\": \"2015-07-19T18:55:05+0000\", \n", 226 | " \"id\": \"5281959998_10150628073894999\", \n", 227 | " \"message\": \"\\\"I always believe in apologizing if you\\u2019ve done something wrong, but if you read my statement, you\\u2019ll see I said nothing wrong,\\\" Donald J. Trump said in an interview.\"\n", 228 | " }, \n", 229 | " {\n", 230 | " \"created_time\": \"2015-07-19T18:25:21+0000\", \n", 231 | " \"id\": \"5281959998_10150628056964999\", \n", 232 | " \"message\": \"Booing, like opera, can be divided into several genres.\"\n", 233 | " }, \n", 234 | " {\n", 235 | " \"created_time\": \"2015-07-19T17:55:08+0000\", \n", 236 | " \"id\": \"5281959998_10150628040459999\", \n", 237 | " \"message\": \"\\\"Nearly at once, [the Confederate flag and Atticus Finch] have fallen from grace in ways that were unimaginable just months ago. They are forcing a reckoning with ourselves and our history, a reassessment of who we were and of what we might become.\\\" Read on in The New York Times Opinion.\"\n", 238 | " }, \n", 239 | " {\n", 240 | " \"created_time\": \"2015-07-19T17:25:00+0000\", \n", 241 | " \"id\": \"5281959998_10150627982469999\", \n", 242 | " \"message\": \"It's National Ice Cream Day. How about cooling off with a treat?\", \n", 243 | " \"story\": \"The New York Times added 4 new photos.\"\n", 244 | " }, \n", 245 | " {\n", 246 | " \"created_time\": \"2015-07-19T16:55:07+0000\", \n", 247 | " \"id\": \"5281959998_10150628000024999\", \n", 248 | " \"message\": \"Bystanders watched people wave flags celebrating Pan-Africanism, the Confederacy and the Nazi Party. And they watched as black demonstrators raised clenched fists, and white demonstrators performed Nazi salutes.\"\n", 249 | " }, \n", 250 | " {\n", 251 | " \"created_time\": \"2015-07-19T16:25:08+0000\", \n", 252 | " \"id\": \"5281959998_10150627989069999\", \n", 253 | " \"message\": \"\\\"Because in the sunset of his presidency, Barack Obama's bolder side is rising. He\\u2019s a lame duck who doesn\\u2019t give a damn.\\\" Read on in The New York Times Opinion.\"\n", 254 | " }, \n", 255 | " {\n", 256 | " \"created_time\": \"2015-07-19T15:55:06+0000\", \n", 257 | " \"id\": \"5281959998_10150627979424999\", \n", 258 | " \"message\": \"The flyby of Pluto was a triumph of human ingenuity and the capstone of a mission that unfolded nearly flawlessly. Yet it almost didn't happen.\"\n", 259 | " }, \n", 260 | " {\n", 261 | " \"created_time\": \"2015-07-19T15:25:04+0000\", \n", 262 | " \"id\": \"5281959998_10150627970394999\", \n", 263 | " \"message\": \"After 6 months apart, Caroline Dove planned to reunite with her boyfriend of more than 2 years. But before she could make the trip, there came a final, portentous message.\"\n", 264 | " }, \n", 265 | " {\n", 266 | " \"created_time\": \"2015-07-19T14:55:08+0000\", \n", 267 | " \"id\": \"5281959998_10150627962014999\", \n", 268 | " \"message\": \"Hillary Clinton has made the struggles of her mother a central part of her 2016 campaign\\u2019s message. But her father, whom she rarely talks about publicly, exerted an equally powerful, if sometimes bruising, influence on the woman who wants to become the first female president.\"\n", 269 | " }, \n", 270 | " {\n", 271 | " \"created_time\": \"2015-07-19T14:25:09+0000\", \n", 272 | " \"id\": \"5281959998_10150627952769999\", \n", 273 | " \"message\": \"Quotation of the Day: \\\"When your contract is over, they send you home, saying they\\u2019ve transferred the money. You get home, and there is nothing there.\\\" \\u2014 Yuriy Cheng, a Ukrainian seaman, describing the owner of the Dona Liberta, a ship that is a case study of misconduct at sea.\"\n", 274 | " }, \n", 275 | " {\n", 276 | " \"created_time\": \"2015-07-19T12:55:01+0000\", \n", 277 | " \"id\": \"5281959998_10150626434214999\", \n", 278 | " \"message\": \"Summer on a stick. (via The New York Times Food)\"\n", 279 | " }, \n", 280 | " {\n", 281 | " \"created_time\": \"2015-07-19T09:55:00+0000\", \n", 282 | " \"id\": \"5281959998_10150627665974999\", \n", 283 | " \"message\": \"The surge of migrants into Europe from war-ravaged and impoverished parts of the Middle East, Afghanistan and Africa has shifted in recent months. Migrants are now pushing by land across the western Balkans, in numbers roughly equal to those entering the Continent through Italy.\"\n", 284 | " }, \n", 285 | " {\n", 286 | " \"created_time\": \"2015-07-19T03:55:00+0000\", \n", 287 | " \"id\": \"5281959998_10150626450789999\", \n", 288 | " \"message\": \"When your big toe isn't your biggest toe.\"\n", 289 | " }, \n", 290 | " {\n", 291 | " \"created_time\": \"2015-07-19T02:55:00+0000\", \n", 292 | " \"id\": \"5281959998_10150626440069999\", \n", 293 | " \"message\": \"\\\"Progress is occurring, as courts accept that in marriage and other matters, gender can't be reduced to chromosomes or surgeries,\\\" writes J. Courtney Sullivan in The New York Times Opinion.\"\n", 294 | " }, \n", 295 | " {\n", 296 | " \"created_time\": \"2015-07-19T01:55:01+0000\", \n", 297 | " \"id\": \"5281959998_10150627562209999\", \n", 298 | " \"message\": \"Experimenting with neon lavender, sea-foam green and soft periwinkle.\"\n", 299 | " }\n", 300 | " ], \n", 301 | " \"paging\": {\n", 302 | " \"next\": \"https://graph.facebook.com/v2.4/5281959998/feed?access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&until=1437270901&__paging_token=enc_AdB73LgZAUngYJIdoZCGUgWvKdL9zs23TBqdfeK90PnPs9MqO7xeze7ANGK2zMxZAveZAvwa1nHzTObmzuKiHY7MVVow\", \n", 303 | " \"previous\": \"https://graph.facebook.com/v2.4/5281959998/feed?since=1437355501&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&__paging_token=enc_AdC4YOxNofFbJWmap6PZC6S0iyiWG8A1FpsYTMrBG62tmT6HfNuhc6rcxL6fMk8ZAxx0EQcFy52SJ2fJ1TbIL47EQx&__previous=1\"\n", 304 | " }\n", 305 | "}\n" 306 | ] 307 | } 308 | ], 309 | "source": [ 310 | "def testFacebookPageFeedData(page_id, access_token):\n", 311 | " \n", 312 | " # construct the URL string\n", 313 | " base = \"https://graph.facebook.com/v2.4\"\n", 314 | " node = \"/\" + page_id + \"/feed\" # changed\n", 315 | " parameters = \"/?access_token=%s\" % access_token\n", 316 | " url = base + node + parameters\n", 317 | " \n", 318 | " # retrieve data\n", 319 | " data = json.loads(request_until_succeed(url))\n", 320 | " \n", 321 | " print json.dumps(data, indent=4, sort_keys=True)\n", 322 | " \n", 323 | "\n", 324 | "testFacebookPageFeedData(page_id, access_token)" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.\n", 332 | "\n", 333 | "We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it." 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": 7, 339 | "metadata": { 340 | "collapsed": false, 341 | "scrolled": true 342 | }, 343 | "outputs": [ 344 | { 345 | "name": "stdout", 346 | "output_type": "stream", 347 | "text": [ 348 | "{\n", 349 | " \"comments\": {\n", 350 | " \"data\": [\n", 351 | " {\n", 352 | " \"can_remove\": false, \n", 353 | " \"created_time\": \"2015-07-20T01:28:02+0000\", \n", 354 | " \"from\": {\n", 355 | " \"id\": \"859569687424896\", \n", 356 | " \"name\": \"Chris Gagne\"\n", 357 | " }, \n", 358 | " \"id\": \"10150628157724999_10150628249759999\", \n", 359 | " \"like_count\": 9, \n", 360 | " \"message\": \"Aaaaaaaand there goes the rest of Beijing's clean air, whatever was left of it.\", \n", 361 | " \"user_likes\": false\n", 362 | " }\n", 363 | " ], \n", 364 | " \"paging\": {\n", 365 | " \"cursors\": {\n", 366 | " \"after\": \"MzE=\", \n", 367 | " \"before\": \"MzE=\"\n", 368 | " }, \n", 369 | " \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/comments?order=chronological&limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MzE%3D\"\n", 370 | " }, \n", 371 | " \"summary\": {\n", 372 | " \"order\": \"ranked\", \n", 373 | " \"total_count\": 31\n", 374 | " }\n", 375 | " }, \n", 376 | " \"created_time\": \"2015-07-20T01:25:01+0000\", \n", 377 | " \"id\": \"5281959998_10150628157724999\", \n", 378 | " \"likes\": {\n", 379 | " \"data\": [\n", 380 | " {\n", 381 | " \"id\": \"1001217933243627\", \n", 382 | " \"name\": \"Josh Smith\"\n", 383 | " }\n", 384 | " ], \n", 385 | " \"paging\": {\n", 386 | " \"cursors\": {\n", 387 | " \"after\": \"MTAwMTIxNzkzMzI0MzYyNw==\", \n", 388 | " \"before\": \"MTAwMTIxNzkzMzI0MzYyNw==\"\n", 389 | " }, \n", 390 | " \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/likes?limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MTAwMTIxNzkzMzI0MzYyNw%3D%3D\"\n", 391 | " }, \n", 392 | " \"summary\": {\n", 393 | " \"total_count\": 278\n", 394 | " }\n", 395 | " }, \n", 396 | " \"link\": \"http://nyti.ms/1Jr6LhU\", \n", 397 | " \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\", \n", 398 | " \"name\": \"China Molds a Supercity Around Beijing, Promising to Change Lives\", \n", 399 | " \"shares\": {\n", 400 | " \"count\": 50\n", 401 | " }, \n", 402 | " \"type\": \"link\"\n", 403 | "}\n" 404 | ] 405 | } 406 | ], 407 | "source": [ 408 | "def getFacebookPageFeedData(page_id, access_token, num_statuses):\n", 409 | " \n", 410 | " # construct the URL string\n", 411 | " base = \"https://graph.facebook.com\"\n", 412 | " node = \"/\" + page_id + \"/feed\" \n", 413 | " parameters = \"/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s\" % (num_statuses, access_token) # changed\n", 414 | " url = base + node + parameters\n", 415 | " \n", 416 | " # retrieve data\n", 417 | " data = json.loads(request_until_succeed(url))\n", 418 | " \n", 419 | " return data\n", 420 | " \n", 421 | "\n", 422 | "test_status = getFacebookPageFeedData(page_id, access_token, 1)[\"data\"][0]\n", 423 | "print json.dumps(test_status, indent=4, sort_keys=True)" 424 | ] 425 | }, 426 | { 427 | "cell_type": "markdown", 428 | "metadata": {}, 429 | "source": [ 430 | "Now that we have a sample Facebook page status, we can write a function to process each field individually." 431 | ] 432 | }, 433 | { 434 | "cell_type": "code", 435 | "execution_count": 8, 436 | "metadata": { 437 | "collapsed": false 438 | }, 439 | "outputs": [ 440 | { 441 | "name": "stdout", 442 | "output_type": "stream", 443 | "text": [ 444 | "(u'5281959998_10150628157724999', 'The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\xe2\\x80\\x99s, is meant to revamp northern China\\xe2\\x80\\x99s economy and become a laboratory for modern urban growth.', 'China Molds a Supercity Around Beijing, Promising to Change Lives', u'link', u'http://nyti.ms/1Jr6LhU', '2015-07-19 20:25:01', 278, 31, 50)\n" 445 | ] 446 | } 447 | ], 448 | "source": [ 449 | "def processFacebookPageFeedStatus(status):\n", 450 | " \n", 451 | " # The status is now a Python dictionary, so for top-level items,\n", 452 | " # we can simply call the key.\n", 453 | " \n", 454 | " # Additionally, some items may not always exist,\n", 455 | " # so must check for existence first\n", 456 | " \n", 457 | " status_id = status['id']\n", 458 | " status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')\n", 459 | " link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')\n", 460 | " status_type = status['type']\n", 461 | " status_link = '' if 'link' not in status.keys() else status['link']\n", 462 | " \n", 463 | " \n", 464 | " # Time needs special care since a) it's in UTC and\n", 465 | " # b) it's not easy to use in statistical programs.\n", 466 | " \n", 467 | " status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')\n", 468 | " status_published = status_published + datetime.timedelta(hours=-5) # EST\n", 469 | " status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs\n", 470 | " \n", 471 | " # Nested items require chaining dictionary keys.\n", 472 | " \n", 473 | " num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']\n", 474 | " num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']\n", 475 | " num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']\n", 476 | " \n", 477 | " # return a tuple of all processed data\n", 478 | " return (status_id, status_message, link_name, status_type, status_link,\n", 479 | " status_published, num_likes, num_comments, num_shares)\n", 480 | "\n", 481 | "processed_test_status = processFacebookPageFeedStatus(test_status)\n", 482 | "print processed_test_status" 483 | ] 484 | }, 485 | { 486 | "cell_type": "markdown", 487 | "metadata": {}, 488 | "source": [ 489 | "Surprisingly, we're almost done! Now we just need to:\n", 490 | "\n", 491 | "1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.\n", 492 | "2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.\n", 493 | "3. Navigate to the next page, and repeat until no more statuses\n", 494 | "\n", 495 | "This block implements both the writing to CSV and page navigation." 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 9, 501 | "metadata": { 502 | "collapsed": false, 503 | "scrolled": true 504 | }, 505 | "outputs": [ 506 | { 507 | "name": "stdout", 508 | "output_type": "stream", 509 | "text": [ 510 | "Scraping nytimes Facebook Page: 2015-07-19 18:36:33.051000\n", 511 | "\n", 512 | "1000 Statuses Processed: 2015-07-19 18:36:59.366000\n", 513 | "2000 Statuses Processed: 2015-07-19 18:37:28.289000\n", 514 | "3000 Statuses Processed: 2015-07-19 18:37:56.487000\n", 515 | "4000 Statuses Processed: 2015-07-19 18:38:30.355000\n", 516 | "5000 Statuses Processed: 2015-07-19 18:38:58.661000\n", 517 | "6000 Statuses Processed: 2015-07-19 18:39:26.990000\n", 518 | "7000 Statuses Processed: 2015-07-19 18:39:55.906000\n", 519 | "8000 Statuses Processed: 2015-07-19 18:40:20.628000\n", 520 | "9000 Statuses Processed: 2015-07-19 18:40:44.801000\n", 521 | "10000 Statuses Processed: 2015-07-19 18:41:11.759000\n", 522 | "11000 Statuses Processed: 2015-07-19 18:41:38.739000\n", 523 | "12000 Statuses Processed: 2015-07-19 18:42:05.562000\n", 524 | "13000 Statuses Processed: 2015-07-19 18:42:32.696000\n", 525 | "14000 Statuses Processed: 2015-07-19 18:42:59.939000\n", 526 | "15000 Statuses Processed: 2015-07-19 18:43:26.889000\n", 527 | "16000 Statuses Processed: 2015-07-19 18:43:53.106000\n", 528 | "17000 Statuses Processed: 2015-07-19 18:44:19.457000\n", 529 | "18000 Statuses Processed: 2015-07-19 18:44:45.637000\n", 530 | "19000 Statuses Processed: 2015-07-19 18:45:11.255000\n", 531 | "20000 Statuses Processed: 2015-07-19 18:45:34.447000\n", 532 | "21000 Statuses Processed: 2015-07-19 18:45:58.425000\n", 533 | "22000 Statuses Processed: 2015-07-19 18:46:23.920000\n", 534 | "23000 Statuses Processed: 2015-07-19 18:46:49.274000\n", 535 | "24000 Statuses Processed: 2015-07-19 18:47:15.616000\n", 536 | "25000 Statuses Processed: 2015-07-19 18:47:39.930000\n", 537 | "26000 Statuses Processed: 2015-07-19 18:48:08.076000\n", 538 | "HTTP Error 502: Error parsing server response\n", 539 | "Error for URL https://graph.facebook.com/v2.0/5281959998/feed?fields=message,link,created_time,type,name,id,likes.limit%281%29.summary%28true%29,comments.limit%281%29.summary%28true%29,shares&limit=100&__paging_token=enc_AdBLHCQ9lOKXuEx1TEXyLWs7FEQ8RN7yGjUH0LXbw5iUpDXvcZCUIXJa2ZC2s6sBHC8EyrGl6Oafb9OqZBgBFzmuRZB9&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&until=1340213557: 2015-07-19 18:48:23.256000\n", 540 | "27000 Statuses Processed: 2015-07-19 18:48:38.748000\n", 541 | "28000 Statuses Processed: 2015-07-19 18:49:03.033000\n", 542 | "29000 Statuses Processed: 2015-07-19 18:49:26.957000\n", 543 | "30000 Statuses Processed: 2015-07-19 18:49:51.405000\n", 544 | "31000 Statuses Processed: 2015-07-19 18:50:15.830000\n", 545 | "32000 Statuses Processed: 2015-07-19 18:50:37.641000\n", 546 | "33000 Statuses Processed: 2015-07-19 18:50:57.574000\n", 547 | "\n", 548 | "Done!\n", 549 | "33296 Statuses Processed in 0:14:28.200000\n" 550 | ] 551 | } 552 | ], 553 | "source": [ 554 | "def scrapeFacebookPageFeedStatus(page_id, access_token):\n", 555 | " with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:\n", 556 | " w = csv.writer(file)\n", 557 | " w.writerow([\"status_id\", \"status_message\", \"link_name\", \"status_type\", \"status_link\",\n", 558 | " \"status_published\", \"num_likes\", \"num_comments\", \"num_shares\"])\n", 559 | " \n", 560 | " has_next_page = True\n", 561 | " num_processed = 0 # keep a count on how many we've processed\n", 562 | " scrape_starttime = datetime.datetime.now()\n", 563 | " \n", 564 | " print \"Scraping %s Facebook Page: %s\\n\" % (page_id, scrape_starttime)\n", 565 | " \n", 566 | " statuses = getFacebookPageFeedData(page_id, access_token, 100)\n", 567 | " \n", 568 | " while has_next_page:\n", 569 | " for status in statuses['data']:\n", 570 | " w.writerow(processFacebookPageFeedStatus(status))\n", 571 | " \n", 572 | " # output progress occasionally to make sure code is not stalling\n", 573 | " num_processed += 1\n", 574 | " if num_processed % 1000 == 0:\n", 575 | " print \"%s Statuses Processed: %s\" % (num_processed, datetime.datetime.now())\n", 576 | " \n", 577 | " # if there is no next page, we're done.\n", 578 | " if 'paging' in statuses.keys():\n", 579 | " statuses = json.loads(request_until_succeed(statuses['paging']['next']))\n", 580 | " else:\n", 581 | " has_next_page = False\n", 582 | " \n", 583 | " \n", 584 | " print \"\\nDone!\\n%s Statuses Processed in %s\" % (num_processed, datetime.datetime.now() - scrape_starttime)\n", 585 | "\n", 586 | "\n", 587 | "scrapeFacebookPageFeedStatus(page_id, access_token)" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "The CSV can be opened in all major statistical programs. Have fun! :)\n", 595 | "\n", 596 | "You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]" 597 | ] 598 | } 599 | ], 600 | "metadata": { 601 | "kernelspec": { 602 | "display_name": "Python 2", 603 | "language": "python", 604 | "name": "python2" 605 | }, 606 | "language_info": { 607 | "codemirror_mode": { 608 | "name": "ipython", 609 | "version": 2 610 | }, 611 | "file_extension": ".py", 612 | "mimetype": "text/x-python", 613 | "name": "python", 614 | "nbconvert_exporter": "python", 615 | "pygments_lexer": "ipython2", 616 | "version": "2.7.8" 617 | } 618 | }, 619 | "nbformat": 4, 620 | "nbformat_minor": 0 621 | } 622 | -------------------------------------------------------------------------------- /examples/reaction-example-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/159675efe537b2080e77e231a6457807d4e81e84/examples/reaction-example-1.png -------------------------------------------------------------------------------- /examples/reaction-example-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/159675efe537b2080e77e231a6457807d4e81e84/examples/reaction-example-2.png -------------------------------------------------------------------------------- /examples/reaction-example-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/159675efe537b2080e77e231a6457807d4e81e84/examples/reaction-example-3.png -------------------------------------------------------------------------------- /examples/reaction_count_data_analysis_example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Example of Processing Facebook Reaction Data\n", 8 | "\n", 9 | "by Max Woolf (@minimaxir)\n", 10 | "\n", 11 | "*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 34, 17 | "metadata": { 18 | "collapsed": false 19 | }, 20 | "outputs": [ 21 | { 22 | "data": { 23 | "text/plain": [ 24 | "R version 3.3.0 (2016-05-03)\n", 25 | "Platform: x86_64-apple-darwin13.4.0 (64-bit)\n", 26 | "Running under: OS X 10.11.4 (El Capitan)\n", 27 | "\n", 28 | "locale:\n", 29 | "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n", 30 | "\n", 31 | "attached base packages:\n", 32 | "[1] grid stats graphics grDevices utils datasets methods \n", 33 | "[8] base \n", 34 | "\n", 35 | "other attached packages:\n", 36 | " [1] viridis_0.3.4 tidyr_0.4.1 stringr_1.0.0 digest_0.6.9 \n", 37 | " [5] RColorBrewer_1.1-2 scales_0.4.0 extrafont_0.17 ggplot2_2.1.0 \n", 38 | " [9] dplyr_0.4.3 readr_0.2.2 \n", 39 | "\n", 40 | "loaded via a namespace (and not attached):\n", 41 | " [1] Rcpp_0.12.4 Rttf2pt1_1.3.3 magrittr_1.5 munsell_0.4.3 \n", 42 | " [5] uuid_0.1-2 colorspace_1.2-6 R6_2.1.2 plyr_1.8.3 \n", 43 | " [9] tools_3.3.0 parallel_3.3.0 gtable_0.2.0 DBI_0.4 \n", 44 | "[13] extrafontdb_1.0 lazyeval_0.1.10 assertthat_0.1 gridExtra_2.2.1 \n", 45 | "[17] IRdisplay_0.3 repr_0.4 base64enc_0.1-3 IRkernel_0.5 \n", 46 | "[21] evaluate_0.9 rzmq_0.7.7 stringi_1.0-1 jsonlite_0.9.19 " 47 | ] 48 | }, 49 | "execution_count": 34, 50 | "metadata": {}, 51 | "output_type": "execute_result" 52 | } 53 | ], 54 | "source": [ 55 | "source(\"Rstart.R\")\n", 56 | "\n", 57 | "library(tidyr)\n", 58 | "library(viridis)\n", 59 | "\n", 60 | "sessionInfo()" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "metadata": { 67 | "collapsed": false 68 | }, 69 | "outputs": [ 70 | { 71 | "name": "stdout", 72 | "output_type": "stream", 73 | "text": [ 74 | "Source: local data frame [6 x 15]\n", 75 | "\n", 76 | " status_id\n", 77 | " (chr)\n", 78 | "1 5550296508_10154919083226509\n", 79 | "2 5550296508_10154919005411509\n", 80 | "3 5550296508_10154918925156509\n", 81 | "4 5550296508_10154918906011509\n", 82 | "5 5550296508_10154918844706509\n", 83 | "6 5550296508_10154918803531509\n", 84 | "Variables not shown: status_message (chr), link_name (chr), status_type (chr),\n", 85 | " status_link (chr), status_published (time), num_reactions (int), num_comments\n", 86 | " (int), num_shares (int), num_likes (int), num_loves (int), num_wows (int),\n", 87 | " num_hahas (int), num_sads (int), num_angrys (int)\n" 88 | ] 89 | }, 90 | { 91 | "data": { 92 | "text/html": [ 93 | "4258" 94 | ], 95 | "text/latex": [ 96 | "4258" 97 | ], 98 | "text/markdown": [ 99 | "4258" 100 | ], 101 | "text/plain": [ 102 | "[1] 4258" 103 | ] 104 | }, 105 | "execution_count": 3, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "df <- read_csv(\"cnn_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n", 112 | "\n", 113 | "print(head(df))\n", 114 | "nrow(df)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 31, 120 | "metadata": { 121 | "collapsed": false 122 | }, 123 | "outputs": [ 124 | { 125 | "name": "stdout", 126 | "output_type": "stream", 127 | "text": [ 128 | "Source: local data frame [6 x 7]\n", 129 | "\n", 130 | " date total_likes total_loves total_wows total_hahas total_sads\n", 131 | " (date) (int) (int) (int) (int) (int)\n", 132 | "1 2016-02-24 215784 12366 9699 6670 2699\n", 133 | "2 2016-02-25 183785 8280 4879 12300 2049\n", 134 | "3 2016-02-26 191436 6445 6141 14510 1874\n", 135 | "4 2016-02-27 144926 8828 2300 1004 1984\n", 136 | "5 2016-02-28 140882 6593 1627 3657 3654\n", 137 | "6 2016-02-29 286802 13716 4404 5899 4410\n", 138 | "Variables not shown: total_angrys (int)\n" 139 | ] 140 | } 141 | ], 142 | "source": [ 143 | "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n", 144 | " summarize(total_likes=sum(num_likes),\n", 145 | " total_loves=sum(num_loves),\n", 146 | " total_wows=sum(num_wows),\n", 147 | " total_hahas=sum(num_hahas),\n", 148 | " total_sads=sum(num_sads),\n", 149 | " total_angrys=sum(num_angrys)) %>%\n", 150 | " arrange(date)\n", 151 | "\n", 152 | "print(head(df_agg))" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "For ggplot, data must be converted to long format." 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 62, 165 | "metadata": { 166 | "collapsed": false 167 | }, 168 | "outputs": [ 169 | { 170 | "name": "stdout", 171 | "output_type": "stream", 172 | "text": [ 173 | "Source: local data frame [20 x 3]\n", 174 | "\n", 175 | " date reaction count\n", 176 | " (date) (fctr) (int)\n", 177 | "1 2016-02-24 total_likes 215784\n", 178 | "2 2016-02-25 total_likes 183785\n", 179 | "3 2016-02-26 total_likes 191436\n", 180 | "4 2016-02-27 total_likes 144926\n", 181 | "5 2016-02-28 total_likes 140882\n", 182 | "6 2016-02-29 total_likes 286802\n", 183 | "7 2016-03-01 total_likes 197091\n", 184 | "8 2016-03-02 total_likes 204942\n", 185 | "9 2016-03-03 total_likes 198320\n", 186 | "10 2016-03-04 total_likes 113997\n", 187 | "11 2016-03-05 total_likes 154004\n", 188 | "12 2016-03-06 total_likes 219300\n", 189 | "13 2016-03-07 total_likes 140551\n", 190 | "14 2016-03-08 total_likes 161067\n", 191 | "15 2016-03-09 total_likes 104399\n", 192 | "16 2016-03-10 total_likes 158898\n", 193 | "17 2016-03-11 total_likes 212756\n", 194 | "18 2016-03-12 total_likes 98536\n", 195 | "19 2016-03-13 total_likes 91079\n", 196 | "20 2016-03-14 total_likes 155147\n" 197 | ] 198 | } 199 | ], 200 | "source": [ 201 | "df_agg_long <- df_agg %>% gather(key=reaction, value=count, total_likes:total_angrys) %>%\n", 202 | " mutate(reaction=factor(reaction))\n", 203 | "\n", 204 | "print(head(df_agg_long,20))" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "Create a stacked area chart. (filled to 100%)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 64, 217 | "metadata": { 218 | "collapsed": false 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction, fill=reaction)) +\n", 223 | " geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n", 224 | " fte_theme() +\n", 225 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 226 | " scale_y_continuous(labels=percent) +\n", 227 | " theme(legend.title = element_blank(),\n", 228 | " legend.position=\"top\",\n", 229 | " legend.direction=\"horizontal\",\n", 230 | " legend.key.width=unit(0.5, \"cm\"),\n", 231 | " legend.key.height=unit(0.25, \"cm\"),\n", 232 | " legend.margin=unit(0,\"cm\")) +\n", 233 | " scale_color_viridis(discrete=T) +\n", 234 | " scale_fill_viridis(discrete=T) +\n", 235 | " labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n", 236 | " x=\"Date Status Posted\",\n", 237 | " y=\"% Reaction Marketshare\")\n", 238 | "\n", 239 | "max_save(plot, \"reaction-example-1\", \"Facebook\")" 240 | ] 241 | }, 242 | { 243 | "cell_type": "markdown", 244 | "metadata": {}, 245 | "source": [ 246 | "![](reaction-example-1.png)\n", 247 | "\n", 248 | "The Likes reaction skews things. Run plot without it." 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 65, 254 | "metadata": { 255 | "collapsed": false 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "plot <- ggplot(df_agg_long %>% filter(reaction!=\"total_likes\"), aes(x=date, y=count, color=reaction, fill=reaction)) +\n", 260 | " geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n", 261 | " fte_theme() +\n", 262 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 263 | " scale_y_continuous(labels=percent) +\n", 264 | " theme(legend.title = element_blank(),\n", 265 | " legend.position=\"top\",\n", 266 | " legend.direction=\"horizontal\",\n", 267 | " legend.key.width=unit(0.5, \"cm\"),\n", 268 | " legend.key.height=unit(0.25, \"cm\"),\n", 269 | " legend.margin=unit(0,\"cm\")) +\n", 270 | " scale_color_viridis(discrete=T) +\n", 271 | " scale_fill_viridis(discrete=T) +\n", 272 | " labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n", 273 | " x=\"Date Status Posted\",\n", 274 | " y=\"% Reaction Marketshare\")\n", 275 | "\n", 276 | "max_save(plot, \"reaction-example-2\", \"Facebook\")" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "![](reaction-example-2.png)" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "That visualization might be too crowded: use percent-wise calculations instead, and switch data to NYTimes for comparison." 291 | ] 292 | }, 293 | { 294 | "cell_type": "code", 295 | "execution_count": 76, 296 | "metadata": { 297 | "collapsed": false, 298 | "scrolled": false 299 | }, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "Source: local data frame [6 x 6]\n", 306 | "\n", 307 | " date perc_loves perc_wows perc_hahas perc_sads perc_angrys\n", 308 | " (date) (dbl) (dbl) (dbl) (dbl) (dbl)\n", 309 | "1 2016-02-24 0.3930676 0.17360566 0.08621367 0.09740770 0.24970542\n", 310 | "2 2016-02-25 0.1919722 0.08666052 0.29210694 0.09332671 0.33593362\n", 311 | "3 2016-02-26 0.1435334 0.18946182 0.10831220 0.17396450 0.38472809\n", 312 | "4 2016-02-27 0.2736496 0.13627639 0.06443652 0.27570606 0.24993145\n", 313 | "5 2016-02-28 0.7713515 0.08522014 0.04054117 0.03737970 0.06550746\n", 314 | "6 2016-02-29 0.3399680 0.08842370 0.12708762 0.11256005 0.33196065\n" 315 | ] 316 | } 317 | ], 318 | "source": [ 319 | "df <- read_csv(\"nytimes_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n", 320 | "\n", 321 | "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n", 322 | " summarize(total_reactions=sum(num_loves)+sum(num_wows)+sum(num_hahas)+sum(num_sads)+sum(num_angrys),\n", 323 | " perc_loves=sum(num_loves)/total_reactions,\n", 324 | " perc_wows=sum(num_wows)/total_reactions,\n", 325 | " perc_hahas=sum(num_hahas)/total_reactions,\n", 326 | " perc_sads=sum(num_sads)/total_reactions,\n", 327 | " perc_angrys=sum(num_angrys)/total_reactions) %>%\n", 328 | " select(-total_reactions) %>%\n", 329 | " arrange(date)\n", 330 | "\n", 331 | "print(head(df_agg))" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 77, 337 | "metadata": { 338 | "collapsed": false 339 | }, 340 | "outputs": [ 341 | { 342 | "name": "stdout", 343 | "output_type": "stream", 344 | "text": [ 345 | "Source: local data frame [20 x 3]\n", 346 | "\n", 347 | " date reaction count\n", 348 | " (date) (fctr) (dbl)\n", 349 | "1 2016-02-24 perc_loves 0.39306756\n", 350 | "2 2016-02-25 perc_loves 0.19197220\n", 351 | "3 2016-02-26 perc_loves 0.14353339\n", 352 | "4 2016-02-27 perc_loves 0.27364957\n", 353 | "5 2016-02-28 perc_loves 0.77135153\n", 354 | "6 2016-02-29 perc_loves 0.33996797\n", 355 | "7 2016-03-01 perc_loves 0.34061714\n", 356 | "8 2016-03-02 perc_loves 0.24681208\n", 357 | "9 2016-03-03 perc_loves 0.35172992\n", 358 | "10 2016-03-04 perc_loves 0.19499779\n", 359 | "11 2016-03-05 perc_loves 0.14512737\n", 360 | "12 2016-03-06 perc_loves 0.40097144\n", 361 | "13 2016-03-07 perc_loves 0.30259557\n", 362 | "14 2016-03-08 perc_loves 0.36623147\n", 363 | "15 2016-03-09 perc_loves 0.21422640\n", 364 | "16 2016-03-10 perc_loves 0.31396083\n", 365 | "17 2016-03-11 perc_loves 0.33173516\n", 366 | "18 2016-03-12 perc_loves 0.06377902\n", 367 | "19 2016-03-13 perc_loves 0.25712914\n", 368 | "20 2016-03-14 perc_loves 0.33751152\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "df_agg_long <- df_agg %>% gather(key=reaction, value=count, perc_loves:perc_angrys) %>%\n", 374 | " mutate(reaction=factor(reaction))\n", 375 | "\n", 376 | "print(head(df_agg_long,20))" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 78, 382 | "metadata": { 383 | "collapsed": false 384 | }, 385 | "outputs": [], 386 | "source": [ 387 | "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction)) +\n", 388 | " geom_line(size=0.5, stat=\"identity\") +\n", 389 | " fte_theme() +\n", 390 | " scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n", 391 | " scale_y_continuous(labels=percent) +\n", 392 | " theme(legend.title = element_blank(),\n", 393 | " legend.position=\"top\",\n", 394 | " legend.direction=\"horizontal\",\n", 395 | " legend.key.width=unit(0.5, \"cm\"),\n", 396 | " legend.key.height=unit(0.25, \"cm\"),\n", 397 | " legend.margin=unit(0,\"cm\")) +\n", 398 | " scale_color_viridis(discrete=T) +\n", 399 | " scale_fill_viridis(discrete=T) +\n", 400 | " labs(title=\"Daily Breakdown of Facebook Reactions on NYTimes's FB Posts\",\n", 401 | " x=\"Date Status Posted\",\n", 402 | " y=\"% Reaction Marketshare\")\n", 403 | "\n", 404 | "max_save(plot, \"reaction-example-3\", \"Facebook\")" 405 | ] 406 | }, 407 | { 408 | "cell_type": "markdown", 409 | "metadata": {}, 410 | "source": [ 411 | "![](reaction-example-3.png)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "# The MIT License (MIT)\n", 419 | "\n", 420 | "Copyright (c) 2016 Max Woolf\n", 421 | "\n", 422 | "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", 423 | "\n", 424 | "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", 425 | "\n", 426 | "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE." 427 | ] 428 | } 429 | ], 430 | "metadata": { 431 | "kernelspec": { 432 | "display_name": "R", 433 | "language": "R", 434 | "name": "ir" 435 | }, 436 | "language_info": { 437 | "codemirror_mode": "r", 438 | "file_extension": ".r", 439 | "mimetype": "text/x-r-source", 440 | "name": "R", 441 | "pygments_lexer": "r", 442 | "version": "3.3.0" 443 | } 444 | }, 445 | "nbformat": 4, 446 | "nbformat_minor": 0 447 | } 448 | -------------------------------------------------------------------------------- /facebook_scrape.py: -------------------------------------------------------------------------------- 1 | import urllib.request, urllib.error, urllib.parse 2 | import json 3 | import datetime 4 | import csv 5 | import time 6 | 7 | 8 | def request_until_succeed(url, return_none_if_400=False): 9 | req = urllib.request.Request(url) 10 | success = False 11 | while success is False: 12 | try: 13 | response = urllib.request.urlopen(req) 14 | if response.getcode() == 200: 15 | success = True 16 | except Exception as e: 17 | print(e) 18 | time.sleep(5) 19 | 20 | print("Error for URL %s: %s" % (url, datetime.datetime.now())) 21 | print("Retrying...") 22 | 23 | if return_none_if_400: 24 | if '400' in str(e): 25 | return None; 26 | 27 | return response.read().decode() 28 | 29 | 30 | def unicode_normalize(text): 31 | # Convert fancy quote chars and non-breaking spaces 32 | return text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 33 | 0xa0:0x20 }).encode('utf-8') 34 | 35 | 36 | def get_comment_feed_data(status_id, access_token, num_comments): 37 | # Construct the URL string 38 | base = "https://graph.facebook.com/v2.6" 39 | node = "/%s/comments" % status_id 40 | fields = "?fields=id,message,like_count,created_time,comments,from,attachment" 41 | parameters = "&order=chronological&limit=%s&access_token=%s" % \ 42 | (num_comments, access_token) 43 | url = base + node + fields + parameters 44 | 45 | # retrieve data 46 | data = request_until_succeed(url, return_none_if_400=True) 47 | if data is None: 48 | return None 49 | else: 50 | return json.loads(data) 51 | 52 | def process_comment(comment, status_id, scrape_author_id, parent_id = ''): 53 | # The status is now a Python dictionary, so for top-level items, 54 | # we can simply call the key. 55 | 56 | # Additionally, some items may not always exist, 57 | # so must check for existence first 58 | 59 | comment_id = comment['id'] 60 | comment_message = '' if 'message' not in comment else \ 61 | unicode_normalize(comment['message']) 62 | comment_author = unicode_normalize(comment['from']['name']) 63 | 64 | comment_author_id = "None" 65 | if "id" in comment["from"]: 66 | comment_author_id = unicode_normalize(comment['from']['id']) 67 | 68 | comment_likes = 0 if 'like_count' not in comment else \ 69 | comment['like_count'] 70 | 71 | if 'attachment' in comment: 72 | attach_tag = "[[%s]]" % comment['attachment']['type'].upper() 73 | comment_message = attach_tag if comment_message is '' else \ 74 | (comment_message.decode("utf-8") + " " + \ 75 | attach_tag).encode("utf-8") 76 | 77 | # Time needs special care since a) it's in UTC and 78 | # b) it's not easy to use in statistical programs. 79 | 80 | comment_published = datetime.datetime.strptime( 81 | comment['created_time'],'%Y-%m-%dT%H:%M:%S+0000') 82 | comment_published = comment_published + datetime.timedelta(hours=-5) # EST 83 | comment_published = comment_published.strftime( 84 | '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs 85 | 86 | # Return a tuple of all processed data 87 | 88 | if scrape_author_id: 89 | return (comment_id, status_id, parent_id, comment_message, 90 | comment_author, comment_author_id, 91 | comment_published, comment_likes) 92 | else: 93 | return (comment_id, status_id, parent_id, comment_message, 94 | comment_author, 95 | comment_published, comment_likes) 96 | 97 | def scrape_comments(page_or_group_id, app_id, app_secret, 98 | posts_input_file, output_filename, scrape_author_id): 99 | 100 | access_token = app_id + "|" + app_secret 101 | 102 | with open(output_filename, 'w') as file: 103 | w = csv.writer(file) 104 | if scrape_author_id: 105 | w.writerow(["comment_id", "status_id", "parent_id", "comment_message", 106 | "comment_author", "comment_author_id", 107 | "comment_published", "comment_likes"]) 108 | else: 109 | w.writerow(["comment_id", "status_id", "parent_id", "comment_message", 110 | "comment_author", 111 | "comment_published", "comment_likes"]) 112 | 113 | num_processed = 0 # keep a count on how many we've processed 114 | scrape_starttime = datetime.datetime.now() 115 | 116 | print("Scraping %s Comments From Posts: %s\n" % \ 117 | (posts_input_file, scrape_starttime)) 118 | 119 | with open(posts_input_file, 'r') as csvfile: 120 | reader = csv.DictReader(csvfile) 121 | 122 | for status in reader: 123 | has_next_page = True 124 | 125 | comments = get_comment_feed_data(status['status_id'], 126 | access_token, 100) 127 | 128 | while has_next_page and comments is not None: 129 | for comment in comments['data']: 130 | w.writerow(process_comment(comment, 131 | status['status_id'], scrape_author_id)) 132 | 133 | if 'comments' in comment: 134 | has_next_subpage = True 135 | 136 | subcomments = get_comment_feed_data( 137 | comment['id'], access_token, 100) 138 | 139 | while has_next_subpage: 140 | for subcomment in subcomments['data']: 141 | w.writerow(process_comment( subcomment, 142 | status['status_id'], 143 | scrape_author_id, 144 | comment['id'])) 145 | 146 | num_processed += 1 147 | if num_processed % 1000 == 0: 148 | print("%s Comments Processed: %s" % \ 149 | (num_processed, 150 | datetime.datetime.now())) 151 | 152 | if 'paging' in subcomments: 153 | if 'next' in subcomments['paging']: 154 | subcomments = json.loads( 155 | request_until_succeed(\ 156 | subcomments\ 157 | ['paging']['next'], 158 | return_none_if_400=True)) 159 | else: 160 | has_next_subpage = False 161 | else: 162 | has_next_subpage = False 163 | 164 | # output progress occasionally to make sure code is not 165 | # stalling 166 | num_processed += 1 167 | if num_processed % 1000 == 0: 168 | print("%s Comments Processed: %s" % \ 169 | (num_processed, datetime.datetime.now())) 170 | 171 | if 'paging' in comments: 172 | if 'next' in comments['paging']: 173 | comments = json.loads(request_until_succeed(\ 174 | comments['paging']['next'], 175 | return_none_if_400=True)) 176 | else: 177 | has_next_page = False 178 | else: 179 | has_next_page = False 180 | 181 | 182 | print("\nDone!\n%s Comments Processed in %s" % \ 183 | (num_processed, datetime.datetime.now() - scrape_starttime)) 184 | 185 | 186 | def get_status_reactions(status_id, access_token): 187 | # See http://stackoverflow.com/a/37239851 for Reactions parameters 188 | # Reactions are only accessable at a single-post endpoint 189 | 190 | base = "https://graph.facebook.com/v2.6" 191 | node = "/%s" % status_id 192 | reactions = "/?fields=" \ 193 | "reactions.type(LIKE).limit(0).summary(total_count).as(like)" \ 194 | ",reactions.type(LOVE).limit(0).summary(total_count).as(love)" \ 195 | ",reactions.type(WOW).limit(0).summary(total_count).as(wow)" \ 196 | ",reactions.type(HAHA).limit(0).summary(total_count).as(haha)" \ 197 | ",reactions.type(SAD).limit(0).summary(total_count).as(sad)" \ 198 | ",reactions.type(ANGRY).limit(0).summary(total_count).as(angry)" 199 | parameters = "&access_token=%s" % access_token 200 | url = base + node + reactions + parameters 201 | 202 | # retrieve data 203 | data = json.loads(request_until_succeed(url)) 204 | 205 | return data 206 | 207 | 208 | def process_post(status, type_pg, access_token): 209 | # The status is now a Python dictionary, so for top-level items, 210 | # we can simply call the key. 211 | 212 | # Additionally, some items may not always exist, 213 | # so must check for existence first 214 | 215 | status_id = status['id'] 216 | status_message = '' if 'message' not in list(status.keys()) else \ 217 | unicode_normalize(status['message']) 218 | link_name = '' if 'name' not in list(status.keys()) else \ 219 | unicode_normalize(status['name']) 220 | status_type = status['type'] 221 | status_link = '' if 'link' not in list(status.keys()) else \ 222 | unicode_normalize(status['link']) 223 | 224 | status_author = None 225 | if type_pg == "group": 226 | status_author = unicode_normalize(status['from']['name']) 227 | 228 | # Time needs special care since a) it's in UTC and 229 | # b) it's not easy to use in statistical programs. 230 | 231 | status_published = datetime.datetime.strptime( 232 | status['created_time'],'%Y-%m-%dT%H:%M:%S+0000') 233 | status_published = status_published + \ 234 | datetime.timedelta(hours=-5) # EST 235 | # best time format for spreadsheet programs 236 | status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') 237 | 238 | # Nested items require chaining dictionary keys. 239 | 240 | num_reactions = 0 if 'reactions' not in status else \ 241 | status['reactions']['summary']['total_count'] 242 | num_comments = 0 if 'comments' not in status else \ 243 | status['comments']['summary']['total_count'] 244 | num_shares = 0 if 'shares' not in status else \ 245 | status['shares']['count'] 246 | 247 | # Counts of each reaction separately; good for sentiment 248 | # Only check for reactions if past date of implementation: 249 | # http://newsroom.fb.com/news/2016/02/reactions-now-available-globally/ 250 | 251 | reactions = get_status_reactions(status_id, access_token) if \ 252 | status_published > '2016-02-24 00:00:00' else {} 253 | 254 | num_likes = 0 if 'like' not in reactions else \ 255 | reactions['like']['summary']['total_count'] 256 | 257 | # Special case: Set number of Likes to Number of reactions for pre-reaction 258 | # statuses 259 | 260 | num_likes = num_reactions if status_published < '2016-02-24 00:00:00' else \ 261 | num_likes 262 | 263 | def get_num_total_reactions(reaction_type, reactions): 264 | if reaction_type not in reactions: 265 | return 0 266 | else: 267 | return reactions[reaction_type]['summary']['total_count'] 268 | 269 | num_loves = get_num_total_reactions('love', reactions) 270 | num_wows = get_num_total_reactions('wow', reactions) 271 | num_hahas = get_num_total_reactions('haha', reactions) 272 | num_sads = get_num_total_reactions('sad', reactions) 273 | num_angrys = get_num_total_reactions('angry', reactions) 274 | 275 | # Return a tuple of all processed data. 276 | if type_pg == "group": 277 | # status_author only applies for groups 278 | return (status_id, status_message, status_author, 279 | link_name, status_type, status_link, status_published, 280 | num_reactions, num_comments, num_shares, num_likes, num_loves, 281 | num_wows, num_hahas, num_sads, num_angrys) 282 | elif type_pg == "page": 283 | return (status_id, status_message, 284 | link_name, status_type, status_link, status_published, 285 | num_reactions, num_comments, num_shares, num_likes, num_loves, 286 | num_wows, num_hahas, num_sads, num_angrys) 287 | 288 | 289 | def get_feed_data(page_or_group_id, type_pg, access_token, num_statuses): 290 | # Construct the URL string; see http://stackoverflow.com/a/37239851 for 291 | # Reactions parameters 292 | 293 | # the node field varies depending on whether we're scraping a page or a 294 | # group. 295 | posts_or_feed = str() 296 | 297 | base = "https://graph.facebook.com/v2.6" 298 | 299 | node = None 300 | fields = None 301 | if type_pg == "page": 302 | node = "/%s/posts" % page_or_group_id 303 | fields = "/?fields=message,link,created_time,type,name,id," + \ 304 | "comments.limit(0).summary(true),shares,reactions" + \ 305 | ".limit(0).summary(true)" 306 | elif type_pg == "group": 307 | node = "/%s/feed" % page_or_group_id 308 | fields = "/?fields=message,link,created_time,type,name,id," + \ 309 | "comments.limit(0).summary(true),shares,reactions." + \ 310 | "limit(0).summary(true),from" 311 | 312 | parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token) 313 | url = base + node + fields + parameters 314 | 315 | # retrieve data 316 | data = json.loads(request_until_succeed(url)) 317 | 318 | return data 319 | 320 | 321 | def scrape_posts(page_or_group_id, type_pg, app_id, app_secret, output_filename): 322 | # Make sure that the type_pg argument is either "page" or "group 323 | is_page = type_pg == "page" 324 | is_group = type_pg == "group" 325 | 326 | assert (is_group or is_page), "type_pg must be either 'page' or 'group'" 327 | 328 | access_token = app_id + "|" + app_secret 329 | 330 | with open(output_filename, 'w') as file: 331 | w = csv.writer(file) 332 | if type_pg == "page": 333 | w.writerow(["status_id", "status_message", 334 | "link_name", "status_type", "status_link", "status_published", 335 | "num_reactions", "num_comments", "num_shares", "num_likes", 336 | "num_loves", "num_wows", "num_hahas", "num_sads", 337 | "num_angrys"]) 338 | elif type_pg == "group": 339 | # status_author only applies for groups 340 | w.writerow(["status_id", "status_message", "status_author", 341 | "link_name", "status_type", "status_link", "status_published", 342 | "num_reactions", "num_comments", "num_shares", "num_likes", 343 | "num_loves", "num_wows", "num_hahas", "num_sads", 344 | "num_angrys"]) 345 | 346 | has_next_page = True 347 | num_processed = 0 # keep a count on how many we've processed 348 | scrape_starttime = datetime.datetime.now() 349 | 350 | print("Scraping %s Facebook %s: %s\n" % (page_or_group_id, type_pg, scrape_starttime)) 351 | 352 | statuses = get_feed_data(page_or_group_id, type_pg, access_token, 100) 353 | 354 | while has_next_page: 355 | for status in statuses['data']: 356 | 357 | # Ensure it is a status with the expected metadata 358 | if 'reactions' in status: 359 | w.writerow(process_post(status, type_pg, access_token)) 360 | 361 | # output progress occasionally to make sure code is not 362 | # stalling 363 | num_processed += 1 364 | if num_processed % 100 == 0: 365 | print("%s Statuses Processed: %s" % \ 366 | (num_processed, datetime.datetime.now())) 367 | 368 | # if there is no next page, we're done. 369 | if 'paging' in list(statuses.keys()): 370 | 371 | if not 'next' in statuses['paging']: 372 | has_next_page = False 373 | else: 374 | statuses = json.loads(request_until_succeed(statuses['paging']['next'])) 375 | else: 376 | has_next_page = False 377 | 378 | 379 | print("\nDone!\n%s Statuses Processed in %s" % \ 380 | (num_processed, datetime.datetime.now() - scrape_starttime)) 381 | -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse 4 | import re 5 | import sys 6 | 7 | import facebook_scrape 8 | 9 | 10 | def scrape_group_posts(group_id, app_id, app_secret, output_filename): 11 | facebook_scrape.scrape_posts(group_id, "group", app_id, app_secret, 12 | output_filename) 13 | 14 | 15 | def scrape_page_posts(page_id, app_id, app_secret, output_filename): 16 | facebook_scrape.scrape_posts(page_id, "page", app_id, app_secret, 17 | output_filename) 18 | 19 | 20 | def scrape_comments(page_id, app_id, app_secret, input_filename, 21 | output_filename, scrape_author_id): 22 | facebook_scrape.scrape_comments(page_id, app_id, app_secret, 23 | input_filename, output_filename, 24 | scrape_author_id) 25 | 26 | 27 | if __name__ == "__main__": 28 | parser = argparse.ArgumentParser(description="Scraper for *all* " + \ 29 | "posts, reactions, and (optionally) comments on a public " + \ 30 | "Facebook group or page.") 31 | 32 | group = parser.add_mutually_exclusive_group(required=True) 33 | 34 | group.add_argument("--group", metavar="Public group ID", type=str, 35 | help="The ID of the open/public Facebook group you want to " + \ 36 | "scrape.") 37 | 38 | group.add_argument("--page", metavar="Public page ID", type=str, 39 | help="The ID of the Facebook page you want to scrape.") 40 | 41 | parser.add_argument("--cred", metavar="Credential file", type=str, 42 | required=True, 43 | help="Path to a secret credentials file containing your app " + \ 44 | "ID and app secret. See README.md for the " + \ 45 | "credential file format.") 46 | 47 | parser.add_argument("--posts-output", metavar="Output CSV file for posts", 48 | type=str, required=True, 49 | help="Path to where you want the output CSV file to be") 50 | 51 | parser.add_argument("--scrape-comments", action="store_true", 52 | required=False, help="Scrape comments as well as posts.") 53 | 54 | parser.add_argument("--comments-output", metavar="Output CSV file for " + \ 55 | "comments", 56 | type=str, required=False, 57 | help="Path to where you want the output CSV file for comments " + \ 58 | "to be") 59 | 60 | parser.add_argument("--scrape-author-id", action="store_true", 61 | required=False, help="Scrape comment authors' Facebook IDs") 62 | 63 | parser.add_argument("--use-existing-posts-csv", action="store_true", 64 | required=False, help="Scrape comments from an existing " + \ 65 | "status/post CSV. Specify it using the --posts-output argument.") 66 | 67 | args = parser.parse_args() 68 | 69 | if args.scrape_comments and args.comments_output is None: 70 | parser.error("Please specify an output CSV file for comments") 71 | 72 | # get credentials 73 | app_id = app_secret = str() 74 | with open(args.cred) as cred_file: 75 | def _get_v(s): 76 | pattern = r"^.+?\s*=\s*[\"']?(.+?)[\"']?$" 77 | found = re.findall(pattern, s.strip()) 78 | return found[0] 79 | 80 | for line in cred_file: 81 | if line.startswith("app_id"): 82 | app_id = _get_v(line) 83 | elif line.startswith("app_secret"): 84 | app_secret = _get_v(line) 85 | 86 | if not (app_id and app_secret): 87 | print("Error: incorrect configuration file format.") 88 | print() 89 | print("Please provide a configuration file in the correct format.") 90 | print("It should look something like this:") 91 | print() 92 | print("app_id = \"111111111111111\"") 93 | print("app_secret = \"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"") 94 | print() 95 | sys.exit(0) 96 | 97 | if args.group: # if user wants to scrape a group 98 | if not args.use_existing_posts_csv: 99 | scrape_group_posts(args.group, app_id, app_secret, args.posts_output) 100 | if args.scrape_comments: # if user wants to scrape comments too 101 | scrape_comments(args.group, app_id, app_secret, args.posts_output, 102 | args.comments_output, args.scrape_author_id) 103 | 104 | elif args.page: # if user wants to scrape a page 105 | if not args.use_existing_posts_csv: 106 | scrape_page_posts(args.page, app_id, app_secret, args.posts_output) 107 | if args.scrape_comments: # if user wants to scrape comments too 108 | scrape_comments(args.page, app_id, app_secret, args.posts_output, 109 | args.comments_output, args.scrape_author_id) 110 | --------------------------------------------------------------------------------