├── credentials.txt
├── examples
    ├── entity.png
    ├── fb_scraper_data.png
    ├── reaction-example-1.png
    ├── reaction-example-2.png
    ├── reaction-example-3.png
    ├── Rstart.R
    ├── reaction_count_data_analysis_example.ipynb
    └── how_to_build_facebook_scraper.ipynb
├── .gitignore
├── README.md
├── run.py
└── facebook_scrape.py


/credentials.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/examples/entity.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/HEAD/examples/entity.png


--------------------------------------------------------------------------------
/examples/fb_scraper_data.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/HEAD/examples/fb_scraper_data.png


--------------------------------------------------------------------------------
/examples/reaction-example-1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/HEAD/examples/reaction-example-1.png


--------------------------------------------------------------------------------
/examples/reaction-example-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/HEAD/examples/reaction-example-2.png


--------------------------------------------------------------------------------
/examples/reaction-example-3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/weijiekoh/scrape-fb-posts-and-comments/HEAD/examples/reaction-example-3.png


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | *.csv
  2 | 
  3 | # Vim
  4 | [._]*.s[a-w][a-z]
  5 | [._]s[a-w][a-z]
  6 | # session
  7 | Session.vim
  8 | .netrwhist
  9 | *~
 10 | tags
 11 | 
 12 | # Byte-compiled / optimized / DLL files
 13 | __pycache__/
 14 | *.py[cod]
 15 | *$py.class
 16 | 
 17 | # C extensions
 18 | *.so
 19 | 
 20 | # Distribution / packaging
 21 | .Python
 22 | env/
 23 | build/
 24 | develop-eggs/
 25 | dist/
 26 | downloads/
 27 | eggs/
 28 | .eggs/
 29 | lib/
 30 | lib64/
 31 | parts/
 32 | sdist/
 33 | var/
 34 | *.egg-info/
 35 | .installed.cfg
 36 | *.egg
 37 | 
 38 | # PyInstaller
 39 | #  Usually these files are written by a python script from a template
 40 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 41 | *.manifest
 42 | *.spec
 43 | 
 44 | # Installer logs
 45 | pip-log.txt
 46 | pip-delete-this-directory.txt
 47 | 
 48 | # Unit test / coverage reports
 49 | htmlcov/
 50 | .tox/
 51 | .coverage
 52 | .coverage.*
 53 | .cache
 54 | nosetests.xml
 55 | coverage.xml
 56 | *,cover
 57 | .hypothesis/
 58 | 
 59 | # Translations
 60 | *.mo
 61 | *.pot
 62 | 
 63 | # Django stuff:
 64 | *.log
 65 | local_settings.py
 66 | 
 67 | # Flask stuff:
 68 | instance/
 69 | .webassets-cache
 70 | 
 71 | # Scrapy stuff:
 72 | .scrapy
 73 | 
 74 | # Sphinx documentation
 75 | docs/_build/
 76 | 
 77 | # PyBuilder
 78 | target/
 79 | 
 80 | # IPython Notebook
 81 | .ipynb_checkpoints
 82 | 
 83 | # pyenv
 84 | .python-version
 85 | 
 86 | # celery beat schedule file
 87 | celerybeat-schedule
 88 | 
 89 | # dotenv
 90 | .env
 91 | 
 92 | # virtualenv
 93 | venv/
 94 | ENV/
 95 | 
 96 | # Spyder project settings
 97 | .spyderproject
 98 | 
 99 | # Rope project settings
100 | .ropeproject
101 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Facebook Page Post Scraper
 2 | 
 3 | This is a fork of Max Woolf's [facebook-page-post-scraper](https://github.com/minimaxir/facebook-page-post-scraper).
 4 | 
 5 | It only works on Python 3.
 6 | 
 7 | This version allows you to specify the page/group you wish to scrape and where you want CSV files to be stored through command-line arguments.
 8 | 
 9 | It also separates your App ID and App secret from the code; now, you have to store these credentials in a separate file.
10 | 
11 | ![](/examples/fb_scraper_data.png)
12 | 
13 | A tool for gathering *all* the posts and comments of a Facebook Page (or Open Facebook Group) and related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, able to be imported into any data analysis program like Excel.
14 | 
15 | The purpose of the script is to gather Facebook data for semantic analysis, which is greatly helped by the presence of high-quality Reaction data. Here's quick examples of a potential Facebook Reaction data visualization using data from [CNN's Facebook page](https://www.facebook.com/cnn/):
16 | 
17 | ![](/examples/reaction-example-2.png)
18 | 
19 | ## Usage
20 | 
21 | To scrape posts from a page:
22 | 
23 | `python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath>`
24 | 
25 | To scrape both posts and comments:
26 | 
27 | ```
28 | python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath> \
29 | --scrape-comments --comments-output <filepath>
30 | ```
31 | 
32 | To scrape from a group, change `--page` to `--group`.
33 | 
34 | To skip downloading statuses and retrieve comments using an existing CSV file, use the `--use-existing-posts-csv` command:
35 | 
36 | ```
37 | python3 run.py --page <page name> --cred <path to credential file> --posts-output <filepath> \
38 | --scrape-comments --comments-output <filepath> --use-existing-posts-csv
39 | ```
40 | 
41 | 
42 | ### Credential file format
43 | 
44 | The `-cred` command-line argument specifies where your credential file is located.
45 | 
46 | **Do not share this file with anyone.**
47 | 
48 | It should look something like this:
49 | 
50 | ```
51 | app_id = "111111111111111"
52 | app_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
53 | ```
54 | 
55 | You need an App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape.
56 | 
57 | Example CSVs for CNN, NYTimes, and BuzzFeed data are not included in this repository due to size, but you can download [CNN data here](https://dl.dropboxusercontent.com/u/2017402/cnn_facebook_statuses.csv.zip) [2.7MB ZIP], [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.csv.zip) [4.9MB ZIP], and [BuzzFeed data here](https://dl.dropboxusercontent.com/u/2017402/buzzfeed_facebook_statuses.csv.zip) [2.1MB ZIP].
58 | 
59 | ### Getting the numeric group ID
60 | 
61 | For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for "entity_id", and use the number to the right of that field. For example, the `group_id` of [Hackathon Hackers](https://www.facebook.com/groups/hackathonhackers/) is 759985267390294.
62 | 
63 | ![](/examples/entity.png)
64 | 
65 | You can download example data for [Hackathon Hackers here](https://dl.dropboxusercontent.com/u/2017402/759985267390294_facebook_statuses.csv.zip) [4.7MB ZIP]
66 | 
67 | Keep in mind that large pages such as CNN have *millions* of comments, so be careful! (scraping throughput is approximately 87k comments/hour)
68 | 
69 | ## Privacy
70 | 
71 | This scraper can only scrape public Facebook data which is available to anyone, even those who are not logged into Facebook. No personally-identifiable data is collected in the Page variant; the Group variant does collect the name of the author of the post, but that data is also public to non-logged-in users. Additionally, the script only uses officially-documented Facebook API endpoints without circumventing any rate-limits.
72 | 
73 | Note that this script, and any variant of this script, *cannot* be used to scrape data from user profiles. (and the Facebook API specifically disallows this use case!)
74 | 
75 | ## Maintainer
76 | 
77 | * Koh Wei Jie
78 | 
79 | ## Credits
80 | 
81 | This is a fork of Max Woolf's code at https://github.com/minimaxir/facebook-page-post-scraper
82 | 
83 | Parts of this README were copied verbatim.
84 | 
85 | ## License
86 | 
87 | Be aware that this is a fork of Max Woolf's MIT-licensed code.
88 | 


--------------------------------------------------------------------------------
/run.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import argparse
  4 | import re
  5 | import sys
  6 | 
  7 | import facebook_scrape
  8 | 
  9 | 
 10 | def scrape_group_posts(group_id, app_id, app_secret, output_filename):
 11 |     facebook_scrape.scrape_posts(group_id, "group", app_id, app_secret, 
 12 |                                  output_filename)
 13 | 
 14 | 
 15 | def scrape_page_posts(page_id, app_id, app_secret, output_filename):
 16 |     facebook_scrape.scrape_posts(page_id, "page", app_id, app_secret, 
 17 |                                  output_filename)
 18 | 
 19 | 
 20 | def scrape_comments(page_id, app_id, app_secret, input_filename, 
 21 |                     output_filename, scrape_author_id):
 22 |     facebook_scrape.scrape_comments(page_id, app_id, app_secret, 
 23 |                                     input_filename, output_filename, 
 24 |                                     scrape_author_id)
 25 | 
 26 | 
 27 | if __name__ == "__main__":
 28 |     parser = argparse.ArgumentParser(description="Scraper for *all* " + \
 29 |             "posts, reactions, and (optionally) comments on a public " + \
 30 |             "Facebook group or page.")
 31 | 
 32 |     group = parser.add_mutually_exclusive_group(required=True)
 33 | 
 34 |     group.add_argument("--group", metavar="Public group ID", type=str, 
 35 |             help="The ID of the open/public Facebook group you want to " + \
 36 |                     "scrape.")
 37 | 
 38 |     group.add_argument("--page", metavar="Public page ID", type=str, 
 39 |             help="The ID of the Facebook page you want to scrape.")
 40 | 
 41 |     parser.add_argument("--cred", metavar="Credential file", type=str, 
 42 |             required=True,
 43 |             help="Path to a secret credentials file containing your app " + \
 44 |                 "ID and app secret. See README.md for the " + \
 45 |                 "credential file format.")
 46 | 
 47 |     parser.add_argument("--posts-output", metavar="Output CSV file for posts", 
 48 |             type=str, required=True, 
 49 |             help="Path to where you want the output CSV file to be")
 50 | 
 51 |     parser.add_argument("--scrape-comments", action="store_true", 
 52 |             required=False, help="Scrape comments as well as posts.")
 53 | 
 54 |     parser.add_argument("--comments-output", metavar="Output CSV file for " + \
 55 |             "comments",
 56 |             type=str, required=False,
 57 |             help="Path to where you want the output CSV file for comments " + \
 58 |                     "to be")
 59 | 
 60 |     parser.add_argument("--scrape-author-id", action="store_true", 
 61 |             required=False, help="Scrape comment authors' Facebook IDs")
 62 | 
 63 |     parser.add_argument("--use-existing-posts-csv", action="store_true", 
 64 |             required=False, help="Scrape comments from an existing " + \
 65 |             "status/post CSV. Specify it using the --posts-output argument.")
 66 | 
 67 |     args = parser.parse_args()
 68 | 
 69 |     if args.scrape_comments and args.comments_output is None:
 70 |         parser.error("Please specify an output CSV file for comments")
 71 | 
 72 |     # get credentials
 73 |     app_id = app_secret = str()
 74 |     with open(args.cred) as cred_file:
 75 |         def _get_v(s):
 76 |             pattern = r"^.+?\s*=\s*[\"']?(.+?)[\"']?$"
 77 |             found = re.findall(pattern, s.strip())
 78 |             return found[0]
 79 | 
 80 |         for line in cred_file:
 81 |             if line.startswith("app_id"):
 82 |                 app_id = _get_v(line)
 83 |             elif line.startswith("app_secret"):
 84 |                 app_secret = _get_v(line)
 85 | 
 86 |     if not (app_id and app_secret):
 87 |         print("Error: incorrect configuration file format.")
 88 |         print()
 89 |         print("Please provide a configuration file in the correct format.")
 90 |         print("It should look something like this:")
 91 |         print()
 92 |         print("app_id = \"111111111111111\"")
 93 |         print("app_secret = \"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\"")
 94 |         print() 
 95 |         sys.exit(0)
 96 | 
 97 |     if args.group: # if user wants to scrape a group
 98 |         if not args.use_existing_posts_csv:
 99 |             scrape_group_posts(args.group, app_id, app_secret, args.posts_output)
100 |         if args.scrape_comments: # if user wants to scrape comments too
101 |             scrape_comments(args.group, app_id, app_secret, args.posts_output,
102 |                     args.comments_output, args.scrape_author_id)
103 | 
104 |     elif args.page: # if user wants to scrape a page
105 |         if not args.use_existing_posts_csv:
106 |             scrape_page_posts(args.page, app_id, app_secret, args.posts_output)
107 |         if args.scrape_comments: # if user wants to scrape comments too
108 |             scrape_comments(args.page, app_id, app_secret, args.posts_output, 
109 |                     args.comments_output, args.scrape_author_id)
110 | 


--------------------------------------------------------------------------------
/examples/Rstart.R:
--------------------------------------------------------------------------------
  1 | library(readr)
  2 | library(dplyr)
  3 | library(ggplot2)
  4 | library(extrafont)
  5 | library(scales)
  6 | library(grid)
  7 | library(RColorBrewer)
  8 | library(digest)
  9 | library(readr)
 10 | library(stringr)
 11 | 
 12 | 
 13 | fontFamily <- "Source Sans Pro"
 14 | fontTitle <- "Source Sans Pro Semibold"
 15 | 
 16 | color_palette = c("#16a085","#27ae60","#2980b9","#8e44ad","#f39c12","#c0392b","#1abc9c", "#2ecc71", "#3498db", "#9b59b6", "#f1c40f","#e74c3c")
 17 | 
 18 | neutral_colors = function(number) {
 19 | 	return (brewer.pal(11, "RdYlBu")[-c(5:7)][(number %% 8) + 1])
 20 | }
 21 | 
 22 | set1_colors = function(number) {
 23 | 	return (brewer.pal(9, "Set1")[c(-6,-8)][(number %% 7) + 1])
 24 | }
 25 | 
 26 | theme_custom <- function() {theme_bw(base_size = 8) + 
 27 |                              theme(panel.background = element_rect(fill="#eaeaea"),
 28 |                                    plot.background = element_rect(fill="white"),
 29 |                                    panel.grid.minor = element_blank(),
 30 |                                    panel.grid.major = element_line(color="#dddddd"),
 31 |                                    axis.ticks.x = element_blank(),
 32 |                                    axis.ticks.y = element_blank(),
 33 |                                    axis.title.x = element_text(family=fontTitle, size=8, vjust=-.3),
 34 |                                    axis.title.y = element_text(family=fontTitle, size=8, vjust=1.5),
 35 |                                    panel.border = element_rect(color="#cccccc"),
 36 |                                    text = element_text(color = "#1a1a1a", family=fontFamily),
 37 |                                    plot.margin = unit(c(0.25,0.1,0.1,0.35), "cm"),
 38 |                                    plot.title = element_text(family=fontTitle, size=9, vjust=1))                          
 39 | }
 40 | 
 41 | create_watermark <- function(source = '', filename = '', dark=F) {
 42 | 
 43 | bg_white = "#FFFFFF"
 44 | bg_text = '#969696'
 45 | 
 46 | if (dark) {
 47 | 	bg_white = "#000000"
 48 | 	bg_text = '#666666'
 49 | }
 50 | 
 51 | watermark <- ggplot(aes(x,y), data=data.frame(x=c(0.5), y=c(0.5))) + geom_point(color = "transparent") +
 52 | geom_text(x=0, y=1.25, label="By Max Woolf — minimaxir.com", family="Source Sans Pro", color=bg_text, size=1.75, hjust=0) +
 53 | 
 54 | geom_text(x=5, y=1.25, label="Made using R and ggplot2", family="Source Sans Pro", color=bg_text, size=1.75) +
 55 | scale_x_continuous(limits=c(0,10)) +
 56 | scale_y_continuous(limits=c(0.5,1.5)) +
 57 | annotate("segment", x = 0, xend = 10, y=1.5, yend=1.5, color=bg_text, size=0.1) +
 58 | theme_bw() +
 59 | theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), legend.position = "none",
 60 |          panel.border = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(),
 61 |          axis.ticks = element_blank(), plot.margin = unit(c(0.0,0,-0.4,0), "cm")) +
 62 | theme(plot.background=element_rect(fill=bg_white, color=bg_white),panel.background=element_rect(fill=bg_white, color=bg_white)) +
 63 | scale_color_manual(values=bg_text)
 64 | 
 65 | if (nchar(source) > 0) {watermark <- watermark + geom_text(x=10, y=1.25, label=paste("Data via",source), family="Source Sans Pro", color=bg_text, size=1.75, hjust=1)}
 66 | 
 67 | return (watermark)
 68 | }
 69 | 
 70 | web_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(2,
 71 |      0.125), c("null", "null")), )
 72 | tallweb_Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(3.5,
 73 |      0.125), c("null", "null")), )
 74 | video_Layout <- grid.layout(nrow = 1, ncol = 2, widths = unit(c(2,
 75 |      1), c("null", "null")), )
 76 |      
 77 |      #grid.show.layout(Layout)
 78 |  vplayout <- function(...) {
 79 |      grid.newpage()
 80 |      pushViewport(viewport(layout = web_Layout))
 81 |  }
 82 |  
 83 |  talllayout <- function(...) {
 84 |      grid.newpage()
 85 |      pushViewport(viewport(layout = tallweb_Layout))
 86 |  }
 87 |  
 88 |  vidlayout <- function(...) {
 89 |      grid.newpage()
 90 |      pushViewport(viewport(layout = video_Layout))
 91 |  }
 92 |  
 93 |  subplot <- function(x, y) viewport(layout.pos.row = x,
 94 |      layout.pos.col = y)
 95 |      
 96 | web_plot <- function(a, b) {
 97 |      vplayout()
 98 |      print(a, vp = subplot(1, 1))
 99 |      print(b, vp = subplot(2, 1))
100 |  }
101 |  
102 |  tallweb_plot <- function(a, b) {
103 |      talllayout()
104 |      print(a, vp = subplot(1, 1))
105 |      print(b, vp = subplot(2, 1))
106 |  }
107 |  
108 |  video_plot <- function(a, b) {
109 |      vidlayout()
110 |      print(a, vp = subplot(1, 1))
111 |      print(b, vp = subplot(1, 2))
112 |  }
113 |  
114 | max_save <- function(plot1, filename, source = '', pdf = FALSE, w=4, h=3, tall=F, dark=F, bg_overide=NA) {
115 | 	png(paste(filename,"png",sep="."),res=300,units="in",width=w,height=h)
116 | plot.new()
117 | #if (!is.na(bg_overide)) {par(bg = bg_overide)}
118 | ifelse(tall,tallweb_plot(plot1,create_watermark(source, filename, dark)),web_plot(plot1,create_watermark(source, filename, dark)))
119 | dev.off()
120 | 
121 | if (pdf) {
122 | quartz(width=w,height=h,dpi=144)
123 | #if (!is.na(bg_overide)) {par(bg = bg_overide)}
124 | web_plot(plot1,create_watermark(source, filename, dark))
125 | quartz.save(paste(filename,"pdf",sep="."), type = "pdf", device = dev.cur())
126 | }
127 | }
128 | 
129 | video_save <- function(plot1, plot2, filename) {
130 | 	png(paste(filename,"png",sep="."),res=300,units="in",width=1920/300,height=1080/300)
131 | video_plot(plot1,plot2)
132 | dev.off()
133 | 
134 | }
135 | 
136 | fte_theme <- function (palate_color = "Greys") {
137 |   
138 |   #display.brewer.all(n=9,type="seq",exact.n=TRUE)
139 |   palate <- brewer.pal(palate_color, n=9)
140 |   color.background = palate[1]
141 |   color.grid.minor = palate[3]
142 |   color.grid.major = palate[3]
143 |   color.axis.text = palate[6]
144 |   color.axis.title = palate[7]
145 |   color.title = palate[9]
146 |   #color.title = "#2c3e50"
147 |   
148 |   font.title <- "Source Sans Pro"
149 |   font.axis <- "Open Sans Condensed Bold"
150 |   #font.axis <- "M+ 1m regular"
151 |   #font.title <- "Arial"
152 |   #font.axis <- "Arial"
153 |   
154 | 
155 |   theme_bw(base_size=9) +
156 |     # Set the entire chart region to a light gray color
157 |     theme(panel.background=element_rect(fill=color.background, color=color.background)) +
158 |     theme(plot.background=element_rect(fill=color.background, color=color.background)) +
159 |     theme(panel.border=element_rect(color=color.background)) +
160 |     # Format the grid
161 |     theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
162 |     theme(panel.grid.minor=element_blank()) +
163 |     #scale_x_continuous(minor_breaks=0,breaks=seq(0,100,10),limits=c(0,100)) +
164 |     #scale_y_continuous(minor_breaks=0,breaks=seq(0,26,4),limits=c(0,25)) +
165 |     theme(axis.ticks=element_blank()) +
166 |     # Dispose of the legend
167 |     theme(legend.position="none") +
168 |     theme(legend.background = element_rect(fill=color.background)) +
169 |     theme(legend.text = element_text(size=7,colour=color.axis.title,family=font.axis)) +
170 |     # Set title and axis labels, and format these and tick marks
171 |     theme(plot.title=element_text(colour=color.title,family=font.title, size=9, vjust=1.25, lineheight=0.1)) +
172 |     theme(axis.text.x=element_text(size=7,colour=color.axis.text,family=font.axis)) +
173 |     theme(axis.text.y=element_text(size=7,colour=color.axis.text,family=font.axis)) +
174 |     theme(axis.title.y=element_text(size=7,colour=color.axis.title,family=font.title, vjust=1.25)) +
175 |     theme(axis.title.x=element_text(size=7,colour=color.axis.title,family=font.title, vjust=0)) +
176 |     
177 |     # Big bold line at y=0
178 |     #geom_hline(yintercept=0,size=0.75,colour=palate[9]) +
179 |     # Plot margins and finally line annotations
180 |     theme(plot.margin = unit(c(0.35, 0.2, 0.15, 0.4), "cm")) +
181 |     
182 |     theme(strip.background = element_rect(fill=color.background, color=color.background),strip.text=element_text(size=7,colour=color.axis.title,family=font.title))
183 | 
184 | }
185 | 


--------------------------------------------------------------------------------
/facebook_scrape.py:
--------------------------------------------------------------------------------
  1 | import urllib.request, urllib.error, urllib.parse
  2 | import json
  3 | import datetime
  4 | import csv
  5 | import time
  6 | 
  7 | 
  8 | def request_until_succeed(url, return_none_if_400=False):
  9 |     req = urllib.request.Request(url)
 10 |     success = False
 11 |     while success is False:
 12 |         try: 
 13 |             response = urllib.request.urlopen(req)
 14 |             if response.getcode() == 200:
 15 |                 success = True
 16 |         except Exception as e:
 17 |             print(e)
 18 |             time.sleep(5)
 19 | 
 20 |             print("Error for URL %s: %s" % (url, datetime.datetime.now()))
 21 |             print("Retrying...")
 22 | 
 23 |             if return_none_if_400:
 24 |                 if '400' in str(e):
 25 |                     return None;
 26 | 
 27 |     return response.read().decode()
 28 | 
 29 | 
 30 | def unicode_normalize(text):
 31 |     # Convert fancy quote chars and non-breaking spaces
 32 |     return text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22,
 33 |                             0xa0:0x20 }).encode('utf-8')
 34 | 
 35 | 
 36 | def get_comment_feed_data(status_id, access_token, num_comments):
 37 |     # Construct the URL string
 38 |     base = "https://graph.facebook.com/v2.6"
 39 |     node = "/%s/comments" % status_id 
 40 |     fields = "?fields=id,message,like_count,created_time,comments,from,attachment"
 41 |     parameters = "&order=chronological&limit=%s&access_token=%s" % \
 42 |             (num_comments, access_token)
 43 |     url = base + node + fields + parameters
 44 | 
 45 |     # retrieve data
 46 |     data = request_until_succeed(url, return_none_if_400=True)
 47 |     if data is None:
 48 |         return None
 49 |     else:   
 50 |         return json.loads(data)
 51 | 
 52 | def process_comment(comment, status_id, scrape_author_id, parent_id = ''):
 53 |     # The status is now a Python dictionary, so for top-level items,
 54 |     # we can simply call the key.
 55 | 
 56 |     # Additionally, some items may not always exist,
 57 |     # so must check for existence first
 58 | 
 59 |     comment_id = comment['id']
 60 |     comment_message = '' if 'message' not in comment else \
 61 |             unicode_normalize(comment['message'])
 62 |     comment_author = unicode_normalize(comment['from']['name'])
 63 | 
 64 |     comment_author_id = "None"
 65 |     if "id" in comment["from"]:
 66 |         comment_author_id = unicode_normalize(comment['from']['id'])
 67 | 
 68 |     comment_likes = 0 if 'like_count' not in comment else \
 69 |             comment['like_count']
 70 | 
 71 |     if 'attachment' in comment:
 72 |         attach_tag = "[[%s]]" % comment['attachment']['type'].upper()
 73 |         comment_message = attach_tag if comment_message is '' else \
 74 |                 (comment_message.decode("utf-8") + " " + \
 75 |                         attach_tag).encode("utf-8")
 76 | 
 77 |     # Time needs special care since a) it's in UTC and
 78 |     # b) it's not easy to use in statistical programs.
 79 | 
 80 |     comment_published = datetime.datetime.strptime(
 81 |             comment['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
 82 |     comment_published = comment_published + datetime.timedelta(hours=-5) # EST
 83 |     comment_published = comment_published.strftime(
 84 |             '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
 85 | 
 86 |     # Return a tuple of all processed data
 87 | 
 88 |     if scrape_author_id:
 89 |         return (comment_id, status_id, parent_id, comment_message, 
 90 |                 comment_author, comment_author_id, 
 91 |                 comment_published, comment_likes)
 92 |     else:
 93 |         return (comment_id, status_id, parent_id, comment_message, 
 94 |                 comment_author, 
 95 |                 comment_published, comment_likes)
 96 | 
 97 | def scrape_comments(page_or_group_id, app_id, app_secret, 
 98 |         posts_input_file, output_filename, scrape_author_id):
 99 | 
100 |     access_token = app_id + "|" + app_secret
101 | 
102 |     with open(output_filename, 'w') as file:
103 |         w = csv.writer(file)
104 |         if scrape_author_id:
105 |             w.writerow(["comment_id", "status_id", "parent_id", "comment_message",
106 |                 "comment_author", "comment_author_id", 
107 |                 "comment_published", "comment_likes"])
108 |         else:
109 |             w.writerow(["comment_id", "status_id", "parent_id", "comment_message",
110 |                 "comment_author", 
111 |                 "comment_published", "comment_likes"])
112 | 
113 |         num_processed = 0   # keep a count on how many we've processed
114 |         scrape_starttime = datetime.datetime.now()
115 | 
116 |         print("Scraping %s Comments From Posts: %s\n" % \
117 |                 (posts_input_file, scrape_starttime))
118 | 
119 |         with open(posts_input_file, 'r') as csvfile:
120 |             reader = csv.DictReader(csvfile)
121 | 
122 |             for status in reader:
123 |                 has_next_page = True
124 | 
125 |                 comments = get_comment_feed_data(status['status_id'], 
126 |                         access_token, 100)
127 | 
128 |                 while has_next_page and comments is not None:				
129 |                     for comment in comments['data']:
130 |                         w.writerow(process_comment(comment, 
131 |                             status['status_id'], scrape_author_id))
132 | 
133 |                         if 'comments' in comment:
134 |                             has_next_subpage = True
135 | 
136 |                             subcomments = get_comment_feed_data(
137 |                                     comment['id'], access_token, 100)
138 | 
139 |                             while has_next_subpage:
140 |                                 for subcomment in subcomments['data']:
141 |                                     w.writerow(process_comment( subcomment, 
142 |                                             status['status_id'], 
143 |                                             scrape_author_id,
144 |                                             comment['id']))
145 | 
146 |                                     num_processed += 1
147 |                                     if num_processed % 1000 == 0:
148 |                                         print("%s Comments Processed: %s" % \
149 |                                                 (num_processed, 
150 |                                                     datetime.datetime.now()))
151 | 
152 |                                 if 'paging' in subcomments:
153 |                                     if 'next' in subcomments['paging']:
154 |                                         subcomments = json.loads(
155 |                                                 request_until_succeed(\
156 |                                                     subcomments\
157 |                                                         ['paging']['next'],
158 |                                                     return_none_if_400=True))
159 |                                     else:
160 |                                         has_next_subpage = False
161 |                                 else:
162 |                                     has_next_subpage = False
163 | 
164 |                         # output progress occasionally to make sure code is not
165 |                         # stalling
166 |                         num_processed += 1
167 |                         if num_processed % 1000 == 0:
168 |                             print("%s Comments Processed: %s" % \
169 |                                     (num_processed, datetime.datetime.now()))
170 | 
171 |                     if 'paging' in comments:		
172 |                         if 'next' in comments['paging']:
173 |                             comments = json.loads(request_until_succeed(\
174 |                                     comments['paging']['next'], 
175 |                                     return_none_if_400=True))
176 |                         else:
177 |                             has_next_page = False
178 |                     else:
179 |                         has_next_page = False
180 | 
181 | 
182 |         print("\nDone!\n%s Comments Processed in %s" % \
183 |                 (num_processed, datetime.datetime.now() - scrape_starttime))
184 | 
185 | 
186 | def get_status_reactions(status_id, access_token):
187 |     # See http://stackoverflow.com/a/37239851 for Reactions parameters
188 |         # Reactions are only accessable at a single-post endpoint
189 | 
190 |     base = "https://graph.facebook.com/v2.6"
191 |     node = "/%s" % status_id
192 |     reactions = "/?fields=" \
193 |             "reactions.type(LIKE).limit(0).summary(total_count).as(like)" \
194 |             ",reactions.type(LOVE).limit(0).summary(total_count).as(love)" \
195 |             ",reactions.type(WOW).limit(0).summary(total_count).as(wow)" \
196 |             ",reactions.type(HAHA).limit(0).summary(total_count).as(haha)" \
197 |             ",reactions.type(SAD).limit(0).summary(total_count).as(sad)" \
198 |             ",reactions.type(ANGRY).limit(0).summary(total_count).as(angry)"
199 |     parameters = "&access_token=%s" % access_token
200 |     url = base + node + reactions + parameters
201 | 
202 |     # retrieve data
203 |     data = json.loads(request_until_succeed(url))
204 | 
205 |     return data
206 | 
207 | 
208 | def process_post(status, type_pg, access_token):
209 |     # The status is now a Python dictionary, so for top-level items,
210 |     # we can simply call the key.
211 | 
212 |     # Additionally, some items may not always exist,
213 |     # so must check for existence first
214 | 
215 |     status_id = status['id']
216 |     status_message = '' if 'message' not in list(status.keys()) else \
217 |             unicode_normalize(status['message'])
218 |     link_name = '' if 'name' not in list(status.keys()) else \
219 |             unicode_normalize(status['name'])
220 |     status_type = status['type']
221 |     status_link = '' if 'link' not in list(status.keys()) else \
222 |             unicode_normalize(status['link'])
223 | 
224 |     status_author = None
225 |     if type_pg == "group":
226 |         status_author = unicode_normalize(status['from']['name'])
227 | 
228 |     # Time needs special care since a) it's in UTC and
229 |     # b) it's not easy to use in statistical programs.
230 | 
231 |     status_published = datetime.datetime.strptime(
232 |             status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')
233 |     status_published = status_published + \
234 |             datetime.timedelta(hours=-5) # EST
235 |     # best time format for spreadsheet programs
236 |     status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') 
237 | 
238 |     # Nested items require chaining dictionary keys.
239 | 
240 |     num_reactions = 0 if 'reactions' not in status else \
241 |             status['reactions']['summary']['total_count']
242 |     num_comments = 0 if 'comments' not in status else \
243 |             status['comments']['summary']['total_count']
244 |     num_shares = 0 if 'shares' not in status else \
245 |             status['shares']['count']
246 | 
247 |     # Counts of each reaction separately; good for sentiment
248 |     # Only check for reactions if past date of implementation:
249 |     # http://newsroom.fb.com/news/2016/02/reactions-now-available-globally/
250 | 
251 |     reactions = get_status_reactions(status_id, access_token) if \
252 |             status_published > '2016-02-24 00:00:00' else {}
253 | 
254 |     num_likes = 0 if 'like' not in reactions else \
255 |             reactions['like']['summary']['total_count']
256 | 
257 |     # Special case: Set number of Likes to Number of reactions for pre-reaction
258 |     # statuses
259 | 
260 |     num_likes = num_reactions if status_published < '2016-02-24 00:00:00' else \
261 |             num_likes
262 | 
263 |     def get_num_total_reactions(reaction_type, reactions):
264 |         if reaction_type not in reactions:
265 |             return 0
266 |         else:
267 |             return reactions[reaction_type]['summary']['total_count']
268 | 
269 |     num_loves = get_num_total_reactions('love', reactions)
270 |     num_wows = get_num_total_reactions('wow', reactions)
271 |     num_hahas = get_num_total_reactions('haha', reactions)
272 |     num_sads = get_num_total_reactions('sad', reactions)
273 |     num_angrys = get_num_total_reactions('angry', reactions)
274 | 
275 |     # Return a tuple of all processed data.
276 |     if type_pg == "group":
277 |         # status_author only applies for groups
278 |         return (status_id, status_message, status_author,
279 |                 link_name, status_type, status_link, status_published,
280 |                 num_reactions, num_comments, num_shares, num_likes, num_loves,
281 |                 num_wows, num_hahas, num_sads, num_angrys)
282 |     elif type_pg == "page":
283 |         return (status_id, status_message, 
284 |                 link_name, status_type, status_link, status_published,
285 |                 num_reactions, num_comments, num_shares, num_likes, num_loves,
286 |                 num_wows, num_hahas, num_sads, num_angrys)
287 | 
288 | 
289 | def get_feed_data(page_or_group_id, type_pg, access_token, num_statuses):
290 |     # Construct the URL string; see http://stackoverflow.com/a/37239851 for
291 |     # Reactions parameters
292 | 
293 |     # the node field varies depending on whether we're scraping a page or a
294 |     # group.
295 |     posts_or_feed = str()
296 | 
297 |     base = "https://graph.facebook.com/v2.6"
298 | 
299 |     node = None
300 |     fields = None
301 |     if type_pg == "page":
302 |         node = "/%s/posts" % page_or_group_id 
303 |         fields = "/?fields=message,link,created_time,type,name,id," + \
304 |                 "comments.limit(0).summary(true),shares,reactions" + \
305 |                 ".limit(0).summary(true)"
306 |     elif type_pg == "group":
307 |         node = "/%s/feed" % page_or_group_id 
308 |         fields = "/?fields=message,link,created_time,type,name,id," + \
309 |                 "comments.limit(0).summary(true),shares,reactions." + \
310 |                 "limit(0).summary(true),from"
311 | 
312 |     parameters = "&limit=%s&access_token=%s" % (num_statuses, access_token)
313 |     url = base + node + fields + parameters
314 | 
315 |     # retrieve data
316 |     data = json.loads(request_until_succeed(url))
317 | 
318 |     return data
319 | 
320 | 
321 | def scrape_posts(page_or_group_id, type_pg, app_id, app_secret, output_filename):
322 |     # Make sure that the type_pg argument is either "page" or "group
323 |     is_page = type_pg == "page"
324 |     is_group = type_pg == "group"
325 | 
326 |     assert (is_group or is_page), "type_pg must be either 'page' or 'group'"
327 | 
328 |     access_token = app_id + "|" + app_secret
329 | 
330 |     with open(output_filename, 'w') as file:
331 |         w = csv.writer(file)
332 |         if type_pg == "page":
333 |             w.writerow(["status_id", "status_message", 
334 |                 "link_name", "status_type", "status_link", "status_published",
335 |                 "num_reactions", "num_comments", "num_shares", "num_likes",
336 |                 "num_loves", "num_wows", "num_hahas", "num_sads",
337 |                 "num_angrys"])
338 |         elif type_pg == "group":
339 |             # status_author only applies for groups
340 |             w.writerow(["status_id", "status_message", "status_author", 
341 |                 "link_name", "status_type", "status_link", "status_published",
342 |                 "num_reactions", "num_comments", "num_shares", "num_likes",
343 |                 "num_loves", "num_wows", "num_hahas", "num_sads",
344 |                 "num_angrys"])
345 | 
346 |         has_next_page = True
347 |         num_processed = 0   # keep a count on how many we've processed
348 |         scrape_starttime = datetime.datetime.now()
349 |         
350 |         print("Scraping %s Facebook %s: %s\n" % (page_or_group_id, type_pg, scrape_starttime))
351 | 
352 |         statuses = get_feed_data(page_or_group_id, type_pg, access_token, 100)
353 | 
354 |         while has_next_page:
355 |             for status in statuses['data']:
356 | 
357 |                 # Ensure it is a status with the expected metadata
358 |                 if 'reactions' in status:
359 |                     w.writerow(process_post(status, type_pg, access_token))
360 | 
361 |                 # output progress occasionally to make sure code is not
362 |                 # stalling
363 |                 num_processed += 1
364 |                 if num_processed % 100 == 0:
365 |                     print("%s Statuses Processed: %s" % \
366 |                         (num_processed, datetime.datetime.now()))
367 | 
368 |             # if there is no next page, we're done.
369 |             if 'paging' in list(statuses.keys()):
370 |                 
371 |                 if not 'next' in statuses['paging']:
372 |                     has_next_page = False
373 |                 else:
374 |                     statuses = json.loads(request_until_succeed(statuses['paging']['next']))
375 |             else:
376 |                 has_next_page = False
377 | 
378 | 
379 |         print("\nDone!\n%s Statuses Processed in %s" % \
380 |                 (num_processed, datetime.datetime.now() - scrape_starttime))
381 | 


--------------------------------------------------------------------------------
/examples/reaction_count_data_analysis_example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Example of Processing Facebook Reaction Data\n",
  8 |     "\n",
  9 |     "by Max Woolf (@minimaxir)\n",
 10 |     "\n",
 11 |     "*This notebook is licensed under the MIT License. If you use the code or data visualization designs contained within this notebook, it would be greatly appreciated if proper attribution is given back to this notebook and/or myself. Thanks! :)*"
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 34,
 17 |    "metadata": {
 18 |     "collapsed": false
 19 |    },
 20 |    "outputs": [
 21 |     {
 22 |      "data": {
 23 |       "text/plain": [
 24 |        "R version 3.3.0 (2016-05-03)\n",
 25 |        "Platform: x86_64-apple-darwin13.4.0 (64-bit)\n",
 26 |        "Running under: OS X 10.11.4 (El Capitan)\n",
 27 |        "\n",
 28 |        "locale:\n",
 29 |        "[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8\n",
 30 |        "\n",
 31 |        "attached base packages:\n",
 32 |        "[1] grid      stats     graphics  grDevices utils     datasets  methods  \n",
 33 |        "[8] base     \n",
 34 |        "\n",
 35 |        "other attached packages:\n",
 36 |        " [1] viridis_0.3.4      tidyr_0.4.1        stringr_1.0.0      digest_0.6.9      \n",
 37 |        " [5] RColorBrewer_1.1-2 scales_0.4.0       extrafont_0.17     ggplot2_2.1.0     \n",
 38 |        " [9] dplyr_0.4.3        readr_0.2.2       \n",
 39 |        "\n",
 40 |        "loaded via a namespace (and not attached):\n",
 41 |        " [1] Rcpp_0.12.4      Rttf2pt1_1.3.3   magrittr_1.5     munsell_0.4.3   \n",
 42 |        " [5] uuid_0.1-2       colorspace_1.2-6 R6_2.1.2         plyr_1.8.3      \n",
 43 |        " [9] tools_3.3.0      parallel_3.3.0   gtable_0.2.0     DBI_0.4         \n",
 44 |        "[13] extrafontdb_1.0  lazyeval_0.1.10  assertthat_0.1   gridExtra_2.2.1 \n",
 45 |        "[17] IRdisplay_0.3    repr_0.4         base64enc_0.1-3  IRkernel_0.5    \n",
 46 |        "[21] evaluate_0.9     rzmq_0.7.7       stringi_1.0-1    jsonlite_0.9.19 "
 47 |       ]
 48 |      },
 49 |      "execution_count": 34,
 50 |      "metadata": {},
 51 |      "output_type": "execute_result"
 52 |     }
 53 |    ],
 54 |    "source": [
 55 |     "source(\"Rstart.R\")\n",
 56 |     "\n",
 57 |     "library(tidyr)\n",
 58 |     "library(viridis)\n",
 59 |     "\n",
 60 |     "sessionInfo()"
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "metadata": {
 67 |     "collapsed": false
 68 |    },
 69 |    "outputs": [
 70 |     {
 71 |      "name": "stdout",
 72 |      "output_type": "stream",
 73 |      "text": [
 74 |       "Source: local data frame [6 x 15]\n",
 75 |       "\n",
 76 |       "                     status_id\n",
 77 |       "                         (chr)\n",
 78 |       "1 5550296508_10154919083226509\n",
 79 |       "2 5550296508_10154919005411509\n",
 80 |       "3 5550296508_10154918925156509\n",
 81 |       "4 5550296508_10154918906011509\n",
 82 |       "5 5550296508_10154918844706509\n",
 83 |       "6 5550296508_10154918803531509\n",
 84 |       "Variables not shown: status_message (chr), link_name (chr), status_type (chr),\n",
 85 |       "  status_link (chr), status_published (time), num_reactions (int), num_comments\n",
 86 |       "  (int), num_shares (int), num_likes (int), num_loves (int), num_wows (int),\n",
 87 |       "  num_hahas (int), num_sads (int), num_angrys (int)\n"
 88 |      ]
 89 |     },
 90 |     {
 91 |      "data": {
 92 |       "text/html": [
 93 |        "4258"
 94 |       ],
 95 |       "text/latex": [
 96 |        "4258"
 97 |       ],
 98 |       "text/markdown": [
 99 |        "4258"
100 |       ],
101 |       "text/plain": [
102 |        "[1] 4258"
103 |       ]
104 |      },
105 |      "execution_count": 3,
106 |      "metadata": {},
107 |      "output_type": "execute_result"
108 |     }
109 |    ],
110 |    "source": [
111 |     "df <- read_csv(\"cnn_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n",
112 |     "\n",
113 |     "print(head(df))\n",
114 |     "nrow(df)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 31,
120 |    "metadata": {
121 |     "collapsed": false
122 |    },
123 |    "outputs": [
124 |     {
125 |      "name": "stdout",
126 |      "output_type": "stream",
127 |      "text": [
128 |       "Source: local data frame [6 x 7]\n",
129 |       "\n",
130 |       "        date total_likes total_loves total_wows total_hahas total_sads\n",
131 |       "      (date)       (int)       (int)      (int)       (int)      (int)\n",
132 |       "1 2016-02-24      215784       12366       9699        6670       2699\n",
133 |       "2 2016-02-25      183785        8280       4879       12300       2049\n",
134 |       "3 2016-02-26      191436        6445       6141       14510       1874\n",
135 |       "4 2016-02-27      144926        8828       2300        1004       1984\n",
136 |       "5 2016-02-28      140882        6593       1627        3657       3654\n",
137 |       "6 2016-02-29      286802       13716       4404        5899       4410\n",
138 |       "Variables not shown: total_angrys (int)\n"
139 |      ]
140 |     }
141 |    ],
142 |    "source": [
143 |     "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n",
144 |     "                summarize(total_likes=sum(num_likes),\n",
145 |     "                          total_loves=sum(num_loves),\n",
146 |     "                          total_wows=sum(num_wows),\n",
147 |     "                          total_hahas=sum(num_hahas),\n",
148 |     "                          total_sads=sum(num_sads),\n",
149 |     "                          total_angrys=sum(num_angrys)) %>%\n",
150 |     "                arrange(date)\n",
151 |     "\n",
152 |     "print(head(df_agg))"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "markdown",
157 |    "metadata": {},
158 |    "source": [
159 |     "For ggplot, data must be converted to long format."
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": 62,
165 |    "metadata": {
166 |     "collapsed": false
167 |    },
168 |    "outputs": [
169 |     {
170 |      "name": "stdout",
171 |      "output_type": "stream",
172 |      "text": [
173 |       "Source: local data frame [20 x 3]\n",
174 |       "\n",
175 |       "         date    reaction  count\n",
176 |       "       (date)      (fctr)  (int)\n",
177 |       "1  2016-02-24 total_likes 215784\n",
178 |       "2  2016-02-25 total_likes 183785\n",
179 |       "3  2016-02-26 total_likes 191436\n",
180 |       "4  2016-02-27 total_likes 144926\n",
181 |       "5  2016-02-28 total_likes 140882\n",
182 |       "6  2016-02-29 total_likes 286802\n",
183 |       "7  2016-03-01 total_likes 197091\n",
184 |       "8  2016-03-02 total_likes 204942\n",
185 |       "9  2016-03-03 total_likes 198320\n",
186 |       "10 2016-03-04 total_likes 113997\n",
187 |       "11 2016-03-05 total_likes 154004\n",
188 |       "12 2016-03-06 total_likes 219300\n",
189 |       "13 2016-03-07 total_likes 140551\n",
190 |       "14 2016-03-08 total_likes 161067\n",
191 |       "15 2016-03-09 total_likes 104399\n",
192 |       "16 2016-03-10 total_likes 158898\n",
193 |       "17 2016-03-11 total_likes 212756\n",
194 |       "18 2016-03-12 total_likes  98536\n",
195 |       "19 2016-03-13 total_likes  91079\n",
196 |       "20 2016-03-14 total_likes 155147\n"
197 |      ]
198 |     }
199 |    ],
200 |    "source": [
201 |     "df_agg_long <- df_agg %>% gather(key=reaction, value=count, total_likes:total_angrys) %>%\n",
202 |     "                        mutate(reaction=factor(reaction))\n",
203 |     "\n",
204 |     "print(head(df_agg_long,20))"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "markdown",
209 |    "metadata": {},
210 |    "source": [
211 |     "Create a stacked area chart. (filled to 100%)"
212 |    ]
213 |   },
214 |   {
215 |    "cell_type": "code",
216 |    "execution_count": 64,
217 |    "metadata": {
218 |     "collapsed": false
219 |    },
220 |    "outputs": [],
221 |    "source": [
222 |     "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction, fill=reaction)) +\n",
223 |     "            geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n",
224 |     "            fte_theme() +\n",
225 |     "            scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n",
226 |     "            scale_y_continuous(labels=percent) +\n",
227 |     "            theme(legend.title = element_blank(),\n",
228 |     "                  legend.position=\"top\",\n",
229 |     "                  legend.direction=\"horizontal\",\n",
230 |     "                  legend.key.width=unit(0.5, \"cm\"),\n",
231 |     "                  legend.key.height=unit(0.25, \"cm\"),\n",
232 |     "                  legend.margin=unit(0,\"cm\")) +\n",
233 |     "            scale_color_viridis(discrete=T) +\n",
234 |     "            scale_fill_viridis(discrete=T) +\n",
235 |     "            labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n",
236 |     "                 x=\"Date Status Posted\",\n",
237 |     "                 y=\"% Reaction Marketshare\")\n",
238 |     "\n",
239 |     "max_save(plot, \"reaction-example-1\", \"Facebook\")"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "markdown",
244 |    "metadata": {},
245 |    "source": [
246 |     "![](reaction-example-1.png)\n",
247 |     "\n",
248 |     "The Likes reaction skews things. Run plot without it."
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 65,
254 |    "metadata": {
255 |     "collapsed": false
256 |    },
257 |    "outputs": [],
258 |    "source": [
259 |     "plot <- ggplot(df_agg_long %>% filter(reaction!=\"total_likes\"), aes(x=date, y=count, color=reaction, fill=reaction)) +\n",
260 |     "            geom_bar(size=0.25, position=\"fill\", stat=\"identity\") +\n",
261 |     "            fte_theme() +\n",
262 |     "            scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n",
263 |     "            scale_y_continuous(labels=percent) +\n",
264 |     "            theme(legend.title = element_blank(),\n",
265 |     "                  legend.position=\"top\",\n",
266 |     "                  legend.direction=\"horizontal\",\n",
267 |     "                  legend.key.width=unit(0.5, \"cm\"),\n",
268 |     "                  legend.key.height=unit(0.25, \"cm\"),\n",
269 |     "                  legend.margin=unit(0,\"cm\")) +\n",
270 |     "            scale_color_viridis(discrete=T) +\n",
271 |     "            scale_fill_viridis(discrete=T) +\n",
272 |     "            labs(title=\"Daily Breakdown of Facebook Reactions on CNN's FB Posts\",\n",
273 |     "                 x=\"Date Status Posted\",\n",
274 |     "                 y=\"% Reaction Marketshare\")\n",
275 |     "\n",
276 |     "max_save(plot, \"reaction-example-2\", \"Facebook\")"
277 |    ]
278 |   },
279 |   {
280 |    "cell_type": "markdown",
281 |    "metadata": {},
282 |    "source": [
283 |     "![](reaction-example-2.png)"
284 |    ]
285 |   },
286 |   {
287 |    "cell_type": "markdown",
288 |    "metadata": {},
289 |    "source": [
290 |     "That visualization might be too crowded: use percent-wise calculations instead, and switch data to NYTimes for comparison."
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 76,
296 |    "metadata": {
297 |     "collapsed": false,
298 |     "scrolled": false
299 |    },
300 |    "outputs": [
301 |     {
302 |      "name": "stdout",
303 |      "output_type": "stream",
304 |      "text": [
305 |       "Source: local data frame [6 x 6]\n",
306 |       "\n",
307 |       "        date perc_loves  perc_wows perc_hahas  perc_sads perc_angrys\n",
308 |       "      (date)      (dbl)      (dbl)      (dbl)      (dbl)       (dbl)\n",
309 |       "1 2016-02-24  0.3930676 0.17360566 0.08621367 0.09740770  0.24970542\n",
310 |       "2 2016-02-25  0.1919722 0.08666052 0.29210694 0.09332671  0.33593362\n",
311 |       "3 2016-02-26  0.1435334 0.18946182 0.10831220 0.17396450  0.38472809\n",
312 |       "4 2016-02-27  0.2736496 0.13627639 0.06443652 0.27570606  0.24993145\n",
313 |       "5 2016-02-28  0.7713515 0.08522014 0.04054117 0.03737970  0.06550746\n",
314 |       "6 2016-02-29  0.3399680 0.08842370 0.12708762 0.11256005  0.33196065\n"
315 |      ]
316 |     }
317 |    ],
318 |    "source": [
319 |     "df <- read_csv(\"nytimes_facebook_statuses.csv\") %>% filter(status_published > '2016-02-24 00:00:00')\n",
320 |     "\n",
321 |     "df_agg <- df %>% group_by(date = as.Date(substr(status_published, 1, 10))) %>%\n",
322 |     "                summarize(total_reactions=sum(num_loves)+sum(num_wows)+sum(num_hahas)+sum(num_sads)+sum(num_angrys),\n",
323 |     "                          perc_loves=sum(num_loves)/total_reactions,\n",
324 |     "                          perc_wows=sum(num_wows)/total_reactions,\n",
325 |     "                          perc_hahas=sum(num_hahas)/total_reactions,\n",
326 |     "                          perc_sads=sum(num_sads)/total_reactions,\n",
327 |     "                          perc_angrys=sum(num_angrys)/total_reactions) %>%\n",
328 |     "                select(-total_reactions) %>%\n",
329 |     "                arrange(date)\n",
330 |     "\n",
331 |     "print(head(df_agg))"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 77,
337 |    "metadata": {
338 |     "collapsed": false
339 |    },
340 |    "outputs": [
341 |     {
342 |      "name": "stdout",
343 |      "output_type": "stream",
344 |      "text": [
345 |       "Source: local data frame [20 x 3]\n",
346 |       "\n",
347 |       "         date   reaction      count\n",
348 |       "       (date)     (fctr)      (dbl)\n",
349 |       "1  2016-02-24 perc_loves 0.39306756\n",
350 |       "2  2016-02-25 perc_loves 0.19197220\n",
351 |       "3  2016-02-26 perc_loves 0.14353339\n",
352 |       "4  2016-02-27 perc_loves 0.27364957\n",
353 |       "5  2016-02-28 perc_loves 0.77135153\n",
354 |       "6  2016-02-29 perc_loves 0.33996797\n",
355 |       "7  2016-03-01 perc_loves 0.34061714\n",
356 |       "8  2016-03-02 perc_loves 0.24681208\n",
357 |       "9  2016-03-03 perc_loves 0.35172992\n",
358 |       "10 2016-03-04 perc_loves 0.19499779\n",
359 |       "11 2016-03-05 perc_loves 0.14512737\n",
360 |       "12 2016-03-06 perc_loves 0.40097144\n",
361 |       "13 2016-03-07 perc_loves 0.30259557\n",
362 |       "14 2016-03-08 perc_loves 0.36623147\n",
363 |       "15 2016-03-09 perc_loves 0.21422640\n",
364 |       "16 2016-03-10 perc_loves 0.31396083\n",
365 |       "17 2016-03-11 perc_loves 0.33173516\n",
366 |       "18 2016-03-12 perc_loves 0.06377902\n",
367 |       "19 2016-03-13 perc_loves 0.25712914\n",
368 |       "20 2016-03-14 perc_loves 0.33751152\n"
369 |      ]
370 |     }
371 |    ],
372 |    "source": [
373 |     "df_agg_long <- df_agg %>% gather(key=reaction, value=count, perc_loves:perc_angrys) %>%\n",
374 |     "                        mutate(reaction=factor(reaction))\n",
375 |     "\n",
376 |     "print(head(df_agg_long,20))"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": 78,
382 |    "metadata": {
383 |     "collapsed": false
384 |    },
385 |    "outputs": [],
386 |    "source": [
387 |     "plot <- ggplot(df_agg_long, aes(x=date, y=count, color=reaction)) +\n",
388 |     "            geom_line(size=0.5, stat=\"identity\") +\n",
389 |     "            fte_theme() +\n",
390 |     "            scale_x_date(breaks = date_breaks(\"1 month\"), labels = date_format(\"%b %Y\")) +\n",
391 |     "            scale_y_continuous(labels=percent) +\n",
392 |     "            theme(legend.title = element_blank(),\n",
393 |     "                  legend.position=\"top\",\n",
394 |     "                  legend.direction=\"horizontal\",\n",
395 |     "                  legend.key.width=unit(0.5, \"cm\"),\n",
396 |     "                  legend.key.height=unit(0.25, \"cm\"),\n",
397 |     "                  legend.margin=unit(0,\"cm\")) +\n",
398 |     "            scale_color_viridis(discrete=T) +\n",
399 |     "            scale_fill_viridis(discrete=T) +\n",
400 |     "            labs(title=\"Daily Breakdown of Facebook Reactions on NYTimes's FB Posts\",\n",
401 |     "                 x=\"Date Status Posted\",\n",
402 |     "                 y=\"% Reaction Marketshare\")\n",
403 |     "\n",
404 |     "max_save(plot, \"reaction-example-3\", \"Facebook\")"
405 |    ]
406 |   },
407 |   {
408 |    "cell_type": "markdown",
409 |    "metadata": {},
410 |    "source": [
411 |     "![](reaction-example-3.png)"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "markdown",
416 |    "metadata": {},
417 |    "source": [
418 |     "# The MIT License (MIT)\n",
419 |     "\n",
420 |     "Copyright (c) 2016 Max Woolf\n",
421 |     "\n",
422 |     "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n",
423 |     "\n",
424 |     "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n",
425 |     "\n",
426 |     "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
427 |    ]
428 |   }
429 |  ],
430 |  "metadata": {
431 |   "kernelspec": {
432 |    "display_name": "R",
433 |    "language": "R",
434 |    "name": "ir"
435 |   },
436 |   "language_info": {
437 |    "codemirror_mode": "r",
438 |    "file_extension": ".r",
439 |    "mimetype": "text/x-r-source",
440 |    "name": "R",
441 |    "pygments_lexer": "r",
442 |    "version": "3.3.0"
443 |   }
444 |  },
445 |  "nbformat": 4,
446 |  "nbformat_minor": 0
447 | }
448 | 


--------------------------------------------------------------------------------
/examples/how_to_build_facebook_scraper.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# How to Scrape Data From Facebook Page Posts for Statistical Analysis\n",
  8 |     "\n",
  9 |     "By [Max Woolf (@minimaxir)](http://minimaxir.com/)\n",
 10 |     "\n",
 11 |     "This notebook describes how to build a Facebook Scraper using the latest version of Facebook's Graph API (v2.4). This is the accompanyment to my blog post [How to Scrape Data From Facebook Page Posts for Statistical Analysis](http://minimaxir.com/2015/07/facebook-scraper/)."
 12 |    ]
 13 |   },
 14 |   {
 15 |    "cell_type": "code",
 16 |    "execution_count": 1,
 17 |    "metadata": {
 18 |     "collapsed": true
 19 |    },
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "# import some Python dependencies\n",
 23 |     "\n",
 24 |     "import urllib2\n",
 25 |     "import json\n",
 26 |     "import datetime\n",
 27 |     "import csv\n",
 28 |     "import time"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "markdown",
 33 |    "metadata": {},
 34 |    "source": [
 35 |     "Accessing Facebook page data requires an access token.\n",
 36 |     "\n",
 37 |     "Since the user access token expires within an hour, we need to create a dummy application *for the sole purpose of scraping* and use the app ID and app secret generated there [as described here](https://developers.facebook.com/docs/facebook-login/access-tokens#apptokens), both of which never expire."
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 2,
 43 |    "metadata": {
 44 |     "collapsed": false
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "# Since the code output in this notebook leaks the app_secret,\n",
 49 |     "# it has been reset by the time you read this.\n",
 50 |     "\n",
 51 |     "app_id = \"272535582777707\"\n",
 52 |     "app_secret = \"59e7ab31b01d3a5a90ec15a7a45a5e3b\" # DO NOT SHARE WITH ANYONE!\n",
 53 |     "\n",
 54 |     "access_token = app_id + \"|\" + app_secret"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "Now we can access public Facebook data without limit. Let's do our analysis on the [New York Times Facebook page](https://www.facebook.com/nytimes), which is popular enough to yield good data."
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 3,
 67 |    "metadata": {
 68 |     "collapsed": true
 69 |    },
 70 |    "outputs": [],
 71 |    "source": [
 72 |     "page_id = 'nytimes'"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "metadata": {},
 78 |    "source": [
 79 |     "Let's write a quick program to ping NYT's Facebook page to verify that the `access_token` works and the `page_id` is valid."
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 4,
 85 |    "metadata": {
 86 |     "collapsed": false
 87 |    },
 88 |    "outputs": [
 89 |     {
 90 |      "name": "stdout",
 91 |      "output_type": "stream",
 92 |      "text": [
 93 |       "{\n",
 94 |       "    \"id\": \"5281959998\", \n",
 95 |       "    \"name\": \"The New York Times\"\n",
 96 |       "}\n"
 97 |      ]
 98 |     }
 99 |    ],
100 |    "source": [
101 |     "def testFacebookPageData(page_id, access_token):\n",
102 |     "    \n",
103 |     "    # construct the URL string\n",
104 |     "    base = \"https://graph.facebook.com/v2.4\"\n",
105 |     "    node = \"/\" + page_id\n",
106 |     "    parameters = \"/?access_token=%s\" % access_token\n",
107 |     "    url = base + node + parameters\n",
108 |     "    \n",
109 |     "    # retrieve data\n",
110 |     "    req = urllib2.Request(url)\n",
111 |     "    response = urllib2.urlopen(req)\n",
112 |     "    data = json.loads(response.read())\n",
113 |     "    \n",
114 |     "    print json.dumps(data, indent=4, sort_keys=True)\n",
115 |     "    \n",
116 |     "\n",
117 |     "testFacebookPageData(page_id, access_token)"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "When scraping large amounts of data from public APIs, there's a high probability that you'll hit an [HTTP Error 500 (Internal Error)](http://www.checkupdown.com/status/E500.html) at some point. There is no way to avoid that on our end. \n",
125 |     "\n",
126 |     "Instead, we'll use a helper function to catch the error and try again after a few seconds, which usually works. This helper function also consolidates the data retrival code, so it kills two birds with one stone."
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 5,
132 |    "metadata": {
133 |     "collapsed": true
134 |    },
135 |    "outputs": [],
136 |    "source": [
137 |     "def request_until_succeed(url):\n",
138 |     "    req = urllib2.Request(url)\n",
139 |     "    success = False\n",
140 |     "    while success is False:\n",
141 |     "        try: \n",
142 |     "            response = urllib2.urlopen(req)\n",
143 |     "            if response.getcode() == 200:\n",
144 |     "                success = True\n",
145 |     "        except Exception, e:\n",
146 |     "            print e\n",
147 |     "            time.sleep(5)\n",
148 |     "            \n",
149 |     "            print \"Error for URL %s: %s\" % (url, datetime.datetime.now())\n",
150 |     "\n",
151 |     "    return response.read()"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "markdown",
156 |    "metadata": {},
157 |    "source": [
158 |     "The data is the Facebook Page metadata however; we need to change the endpoint to the /feed endpoint."
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 6,
164 |    "metadata": {
165 |     "collapsed": false
166 |    },
167 |    "outputs": [
168 |     {
169 |      "name": "stdout",
170 |      "output_type": "stream",
171 |      "text": [
172 |       "{\n",
173 |       "    \"data\": [\n",
174 |       "        {\n",
175 |       "            \"created_time\": \"2015-07-20T01:25:01+0000\", \n",
176 |       "            \"id\": \"5281959998_10150628157724999\", \n",
177 |       "            \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\"\n",
178 |       "        }, \n",
179 |       "        {\n",
180 |       "            \"created_time\": \"2015-07-19T22:55:01+0000\", \n",
181 |       "            \"id\": \"5281959998_10150628161129999\", \n",
182 |       "            \"message\": \"\\\"It\\u2019s safe to say that federal agencies are not where we want them to be across the board,\\\" said President Barack Obama's top cybersecurity adviser. \\\"We clearly need to be moving faster.\\\"\"\n",
183 |       "        }, \n",
184 |       "        {\n",
185 |       "            \"created_time\": \"2015-07-19T22:25:01+0000\", \n",
186 |       "            \"id\": \"5281959998_10150626434639999\", \n",
187 |       "            \"message\": \"Showcase your summer tomatoes in this elegant crostata.\"\n",
188 |       "        }, \n",
189 |       "        {\n",
190 |       "            \"created_time\": \"2015-07-19T21:55:08+0000\", \n",
191 |       "            \"id\": \"5281959998_10150628170209999\", \n",
192 |       "            \"message\": \"The task: Create a technologically sophisticated barbecue smoker that could outperform the best product on the market and be sold for less than $1,500.\"\n",
193 |       "        }, \n",
194 |       "        {\n",
195 |       "            \"created_time\": \"2015-07-19T21:25:00+0000\", \n",
196 |       "            \"id\": \"5281959998_10150626449129999\", \n",
197 |       "            \"message\": \"Achieving pastel hair can be time-consuming and toxic \\u2014 but for some, so very worth it.\"\n",
198 |       "        }, \n",
199 |       "        {\n",
200 |       "            \"created_time\": \"2015-07-19T20:53:05+0000\", \n",
201 |       "            \"id\": \"5281959998_10150626425084999\", \n",
202 |       "            \"message\": \"Attention, meat lovers: This simple barbecue sauce goes beautifully with pork and chicken.\"\n",
203 |       "        }, \n",
204 |       "        {\n",
205 |       "            \"created_time\": \"2015-07-19T20:25:07+0000\", \n",
206 |       "            \"id\": \"5281959998_10150628132119999\", \n",
207 |       "            \"message\": \"He passed the police officer exam in 2011. He went through orientation and started undergoing the required background checks in 2013. Then, the process stopped cold. No emails. No calls. No explanations. Silence.\"\n",
208 |       "        }, \n",
209 |       "        {\n",
210 |       "            \"created_time\": \"2015-07-19T19:55:32+0000\", \n",
211 |       "            \"id\": \"5281959998_10150628116259999\", \n",
212 |       "            \"message\": \"The election is 16 months away, but knowing what we know now, what should we expect the economic backdrop to be when Americans choose their next president?\"\n",
213 |       "        }, \n",
214 |       "        {\n",
215 |       "            \"created_time\": \"2015-07-19T19:25:07+0000\", \n",
216 |       "            \"id\": \"5281959998_10150628097394999\", \n",
217 |       "            \"message\": \"\\\"By focusing so intently on physical fitness, the corps is avoiding the real barrier to integration \\u2014 the hypermasculine culture at its heart.\\\" Read on in The New York Times Opinion.\"\n",
218 |       "        }, \n",
219 |       "        {\n",
220 |       "            \"created_time\": \"2015-07-19T19:05:01+0000\", \n",
221 |       "            \"id\": \"5281959998_10150628071729999\", \n",
222 |       "            \"message\": \"U2's \\u201cInnocence and Experience\\u201d tour merges past and present, peace and war, audience and band, punk and statesman, grass-roots activism and corporate philanthropy.\"\n",
223 |       "        }, \n",
224 |       "        {\n",
225 |       "            \"created_time\": \"2015-07-19T18:55:05+0000\", \n",
226 |       "            \"id\": \"5281959998_10150628073894999\", \n",
227 |       "            \"message\": \"\\\"I always believe in apologizing if you\\u2019ve done something wrong, but if you read my statement, you\\u2019ll see I said nothing wrong,\\\" Donald J. Trump said in an interview.\"\n",
228 |       "        }, \n",
229 |       "        {\n",
230 |       "            \"created_time\": \"2015-07-19T18:25:21+0000\", \n",
231 |       "            \"id\": \"5281959998_10150628056964999\", \n",
232 |       "            \"message\": \"Booing, like opera, can be divided into several genres.\"\n",
233 |       "        }, \n",
234 |       "        {\n",
235 |       "            \"created_time\": \"2015-07-19T17:55:08+0000\", \n",
236 |       "            \"id\": \"5281959998_10150628040459999\", \n",
237 |       "            \"message\": \"\\\"Nearly at once, [the Confederate flag and Atticus Finch] have fallen from grace in ways that were unimaginable just months ago. They are forcing a reckoning with ourselves and our history, a reassessment of who we were and of what we might become.\\\" Read on in The New York Times Opinion.\"\n",
238 |       "        }, \n",
239 |       "        {\n",
240 |       "            \"created_time\": \"2015-07-19T17:25:00+0000\", \n",
241 |       "            \"id\": \"5281959998_10150627982469999\", \n",
242 |       "            \"message\": \"It's National Ice Cream Day. How about cooling off with a treat?\", \n",
243 |       "            \"story\": \"The New York Times added 4 new photos.\"\n",
244 |       "        }, \n",
245 |       "        {\n",
246 |       "            \"created_time\": \"2015-07-19T16:55:07+0000\", \n",
247 |       "            \"id\": \"5281959998_10150628000024999\", \n",
248 |       "            \"message\": \"Bystanders watched people wave flags celebrating Pan-Africanism, the Confederacy and the Nazi Party. And they watched as black demonstrators raised clenched fists, and white demonstrators performed Nazi salutes.\"\n",
249 |       "        }, \n",
250 |       "        {\n",
251 |       "            \"created_time\": \"2015-07-19T16:25:08+0000\", \n",
252 |       "            \"id\": \"5281959998_10150627989069999\", \n",
253 |       "            \"message\": \"\\\"Because in the sunset of his presidency, Barack Obama's bolder side is rising. He\\u2019s a lame duck who doesn\\u2019t give a damn.\\\" Read on in The New York Times Opinion.\"\n",
254 |       "        }, \n",
255 |       "        {\n",
256 |       "            \"created_time\": \"2015-07-19T15:55:06+0000\", \n",
257 |       "            \"id\": \"5281959998_10150627979424999\", \n",
258 |       "            \"message\": \"The flyby of Pluto was a triumph of human ingenuity and the capstone of a mission that unfolded nearly flawlessly. Yet it almost didn't happen.\"\n",
259 |       "        }, \n",
260 |       "        {\n",
261 |       "            \"created_time\": \"2015-07-19T15:25:04+0000\", \n",
262 |       "            \"id\": \"5281959998_10150627970394999\", \n",
263 |       "            \"message\": \"After 6 months apart, Caroline Dove planned to reunite with her boyfriend of more than 2 years. But before she could make the trip, there came a final, portentous message.\"\n",
264 |       "        }, \n",
265 |       "        {\n",
266 |       "            \"created_time\": \"2015-07-19T14:55:08+0000\", \n",
267 |       "            \"id\": \"5281959998_10150627962014999\", \n",
268 |       "            \"message\": \"Hillary Clinton has made the struggles of her mother a central part of her 2016 campaign\\u2019s message. But her father, whom she rarely talks about publicly, exerted an equally powerful, if sometimes bruising, influence on the woman who wants to become the first female president.\"\n",
269 |       "        }, \n",
270 |       "        {\n",
271 |       "            \"created_time\": \"2015-07-19T14:25:09+0000\", \n",
272 |       "            \"id\": \"5281959998_10150627952769999\", \n",
273 |       "            \"message\": \"Quotation of the Day: \\\"When your contract is over, they send you home, saying they\\u2019ve transferred the money. You get home, and there is nothing there.\\\" \\u2014 Yuriy Cheng, a Ukrainian seaman, describing the owner of the Dona Liberta, a ship that is a case study of misconduct at sea.\"\n",
274 |       "        }, \n",
275 |       "        {\n",
276 |       "            \"created_time\": \"2015-07-19T12:55:01+0000\", \n",
277 |       "            \"id\": \"5281959998_10150626434214999\", \n",
278 |       "            \"message\": \"Summer on a stick. (via The New York Times Food)\"\n",
279 |       "        }, \n",
280 |       "        {\n",
281 |       "            \"created_time\": \"2015-07-19T09:55:00+0000\", \n",
282 |       "            \"id\": \"5281959998_10150627665974999\", \n",
283 |       "            \"message\": \"The surge of migrants into Europe from war-ravaged and impoverished parts of the Middle East, Afghanistan and Africa has shifted in recent months. Migrants are now pushing by land across the western Balkans, in numbers roughly equal to those entering the Continent through Italy.\"\n",
284 |       "        }, \n",
285 |       "        {\n",
286 |       "            \"created_time\": \"2015-07-19T03:55:00+0000\", \n",
287 |       "            \"id\": \"5281959998_10150626450789999\", \n",
288 |       "            \"message\": \"When your big toe isn't your biggest toe.\"\n",
289 |       "        }, \n",
290 |       "        {\n",
291 |       "            \"created_time\": \"2015-07-19T02:55:00+0000\", \n",
292 |       "            \"id\": \"5281959998_10150626440069999\", \n",
293 |       "            \"message\": \"\\\"Progress is occurring, as courts accept that in marriage and other matters, gender can't be reduced to chromosomes or surgeries,\\\" writes J. Courtney Sullivan in The New York Times Opinion.\"\n",
294 |       "        }, \n",
295 |       "        {\n",
296 |       "            \"created_time\": \"2015-07-19T01:55:01+0000\", \n",
297 |       "            \"id\": \"5281959998_10150627562209999\", \n",
298 |       "            \"message\": \"Experimenting with neon lavender, sea-foam green and soft periwinkle.\"\n",
299 |       "        }\n",
300 |       "    ], \n",
301 |       "    \"paging\": {\n",
302 |       "        \"next\": \"https://graph.facebook.com/v2.4/5281959998/feed?access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&until=1437270901&__paging_token=enc_AdB73LgZAUngYJIdoZCGUgWvKdL9zs23TBqdfeK90PnPs9MqO7xeze7ANGK2zMxZAveZAvwa1nHzTObmzuKiHY7MVVow\", \n",
303 |       "        \"previous\": \"https://graph.facebook.com/v2.4/5281959998/feed?since=1437355501&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&limit=25&__paging_token=enc_AdC4YOxNofFbJWmap6PZC6S0iyiWG8A1FpsYTMrBG62tmT6HfNuhc6rcxL6fMk8ZAxx0EQcFy52SJ2fJ1TbIL47EQx&__previous=1\"\n",
304 |       "    }\n",
305 |       "}\n"
306 |      ]
307 |     }
308 |    ],
309 |    "source": [
310 |     "def testFacebookPageFeedData(page_id, access_token):\n",
311 |     "    \n",
312 |     "    # construct the URL string\n",
313 |     "    base = \"https://graph.facebook.com/v2.4\"\n",
314 |     "    node = \"/\" + page_id + \"/feed\" # changed\n",
315 |     "    parameters = \"/?access_token=%s\" % access_token\n",
316 |     "    url = base + node + parameters\n",
317 |     "    \n",
318 |     "    # retrieve data\n",
319 |     "    data = json.loads(request_until_succeed(url))\n",
320 |     "    \n",
321 |     "    print json.dumps(data, indent=4, sort_keys=True)\n",
322 |     "    \n",
323 |     "\n",
324 |     "testFacebookPageFeedData(page_id, access_token)"
325 |    ]
326 |   },
327 |   {
328 |    "cell_type": "markdown",
329 |    "metadata": {},
330 |    "source": [
331 |     "In v2.4, the default behavior is to return very, very little metadata for statuses in order to reduce bandwidth, with the expectation the user will request the necessary fields.\n",
332 |     "\n",
333 |     "We don't need data on every NYT status. Yet. Let's reduce the requested fields to exactly what we need, and the number of stories returned to 1 so we can process it."
334 |    ]
335 |   },
336 |   {
337 |    "cell_type": "code",
338 |    "execution_count": 7,
339 |    "metadata": {
340 |     "collapsed": false,
341 |     "scrolled": true
342 |    },
343 |    "outputs": [
344 |     {
345 |      "name": "stdout",
346 |      "output_type": "stream",
347 |      "text": [
348 |       "{\n",
349 |       "    \"comments\": {\n",
350 |       "        \"data\": [\n",
351 |       "            {\n",
352 |       "                \"can_remove\": false, \n",
353 |       "                \"created_time\": \"2015-07-20T01:28:02+0000\", \n",
354 |       "                \"from\": {\n",
355 |       "                    \"id\": \"859569687424896\", \n",
356 |       "                    \"name\": \"Chris Gagne\"\n",
357 |       "                }, \n",
358 |       "                \"id\": \"10150628157724999_10150628249759999\", \n",
359 |       "                \"like_count\": 9, \n",
360 |       "                \"message\": \"Aaaaaaaand there goes the rest of Beijing's clean air, whatever was left of it.\", \n",
361 |       "                \"user_likes\": false\n",
362 |       "            }\n",
363 |       "        ], \n",
364 |       "        \"paging\": {\n",
365 |       "            \"cursors\": {\n",
366 |       "                \"after\": \"MzE=\", \n",
367 |       "                \"before\": \"MzE=\"\n",
368 |       "            }, \n",
369 |       "            \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/comments?order=chronological&limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MzE%3D\"\n",
370 |       "        }, \n",
371 |       "        \"summary\": {\n",
372 |       "            \"order\": \"ranked\", \n",
373 |       "            \"total_count\": 31\n",
374 |       "        }\n",
375 |       "    }, \n",
376 |       "    \"created_time\": \"2015-07-20T01:25:01+0000\", \n",
377 |       "    \"id\": \"5281959998_10150628157724999\", \n",
378 |       "    \"likes\": {\n",
379 |       "        \"data\": [\n",
380 |       "            {\n",
381 |       "                \"id\": \"1001217933243627\", \n",
382 |       "                \"name\": \"Josh Smith\"\n",
383 |       "            }\n",
384 |       "        ], \n",
385 |       "        \"paging\": {\n",
386 |       "            \"cursors\": {\n",
387 |       "                \"after\": \"MTAwMTIxNzkzMzI0MzYyNw==\", \n",
388 |       "                \"before\": \"MTAwMTIxNzkzMzI0MzYyNw==\"\n",
389 |       "            }, \n",
390 |       "            \"next\": \"https://graph.facebook.com/v2.0/5281959998_10150628157724999/likes?limit=1&summary=true&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&after=MTAwMTIxNzkzMzI0MzYyNw%3D%3D\"\n",
391 |       "        }, \n",
392 |       "        \"summary\": {\n",
393 |       "            \"total_count\": 278\n",
394 |       "        }\n",
395 |       "    }, \n",
396 |       "    \"link\": \"http://nyti.ms/1Jr6LhU\", \n",
397 |       "    \"message\": \"The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\u2019s, is meant to revamp northern China\\u2019s economy and become a laboratory for modern urban growth.\", \n",
398 |       "    \"name\": \"China Molds a Supercity Around Beijing, Promising to Change Lives\", \n",
399 |       "    \"shares\": {\n",
400 |       "        \"count\": 50\n",
401 |       "    }, \n",
402 |       "    \"type\": \"link\"\n",
403 |       "}\n"
404 |      ]
405 |     }
406 |    ],
407 |    "source": [
408 |     "def getFacebookPageFeedData(page_id, access_token, num_statuses):\n",
409 |     "    \n",
410 |     "    # construct the URL string\n",
411 |     "    base = \"https://graph.facebook.com\"\n",
412 |     "    node = \"/\" + page_id + \"/feed\" \n",
413 |     "    parameters = \"/?fields=message,link,created_time,type,name,id,likes.limit(1).summary(true),comments.limit(1).summary(true),shares&limit=%s&access_token=%s\" % (num_statuses, access_token) # changed\n",
414 |     "    url = base + node + parameters\n",
415 |     "    \n",
416 |     "    # retrieve data\n",
417 |     "    data = json.loads(request_until_succeed(url))\n",
418 |     "    \n",
419 |     "    return data\n",
420 |     "    \n",
421 |     "\n",
422 |     "test_status = getFacebookPageFeedData(page_id, access_token, 1)[\"data\"][0]\n",
423 |     "print json.dumps(test_status, indent=4, sort_keys=True)"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "markdown",
428 |    "metadata": {},
429 |    "source": [
430 |     "Now that we have a sample Facebook page status, we can write a function to process each field individually."
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "code",
435 |    "execution_count": 8,
436 |    "metadata": {
437 |     "collapsed": false
438 |    },
439 |    "outputs": [
440 |     {
441 |      "name": "stdout",
442 |      "output_type": "stream",
443 |      "text": [
444 |       "(u'5281959998_10150628157724999', 'The planned megalopolis, a metropolitan area that would be about 6 times the size of New York\\xe2\\x80\\x99s, is meant to revamp northern China\\xe2\\x80\\x99s economy and become a laboratory for modern urban growth.', 'China Molds a Supercity Around Beijing, Promising to Change Lives', u'link', u'http://nyti.ms/1Jr6LhU', '2015-07-19 20:25:01', 278, 31, 50)\n"
445 |      ]
446 |     }
447 |    ],
448 |    "source": [
449 |     "def processFacebookPageFeedStatus(status):\n",
450 |     "    \n",
451 |     "    # The status is now a Python dictionary, so for top-level items,\n",
452 |     "    # we can simply call the key.\n",
453 |     "    \n",
454 |     "    # Additionally, some items may not always exist,\n",
455 |     "    # so must check for existence first\n",
456 |     "    \n",
457 |     "    status_id = status['id']\n",
458 |     "    status_message = '' if 'message' not in status.keys() else status['message'].encode('utf-8')\n",
459 |     "    link_name = '' if 'name' not in status.keys() else status['name'].encode('utf-8')\n",
460 |     "    status_type = status['type']\n",
461 |     "    status_link = '' if 'link' not in status.keys() else status['link']\n",
462 |     "    \n",
463 |     "    \n",
464 |     "    # Time needs special care since a) it's in UTC and\n",
465 |     "    # b) it's not easy to use in statistical programs.\n",
466 |     "    \n",
467 |     "    status_published = datetime.datetime.strptime(status['created_time'],'%Y-%m-%dT%H:%M:%S+0000')\n",
468 |     "    status_published = status_published + datetime.timedelta(hours=-5) # EST\n",
469 |     "    status_published = status_published.strftime('%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs\n",
470 |     "    \n",
471 |     "    # Nested items require chaining dictionary keys.\n",
472 |     "    \n",
473 |     "    num_likes = 0 if 'likes' not in status.keys() else status['likes']['summary']['total_count']\n",
474 |     "    num_comments = 0 if 'comments' not in status.keys() else status['comments']['summary']['total_count']\n",
475 |     "    num_shares = 0 if 'shares' not in status.keys() else status['shares']['count']\n",
476 |     "    \n",
477 |     "    # return a tuple of all processed data\n",
478 |     "    return (status_id, status_message, link_name, status_type, status_link,\n",
479 |     "           status_published, num_likes, num_comments, num_shares)\n",
480 |     "\n",
481 |     "processed_test_status = processFacebookPageFeedStatus(test_status)\n",
482 |     "print processed_test_status"
483 |    ]
484 |   },
485 |   {
486 |    "cell_type": "markdown",
487 |    "metadata": {},
488 |    "source": [
489 |     "Surprisingly, we're almost done! Now we just need to:\n",
490 |     "\n",
491 |     "1. Query each page of Facebook Page Statuses (100 statuses per page) using `getFacebookPageFeedData`.\n",
492 |     "2. Process all statuses on that page using `processFacebookPageFeedStatus` and writing the output to a CSV file.\n",
493 |     "3. Navigate to the next page, and repeat until no more statuses\n",
494 |     "\n",
495 |     "This block implements both the writing to CSV and page navigation."
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 9,
501 |    "metadata": {
502 |     "collapsed": false,
503 |     "scrolled": true
504 |    },
505 |    "outputs": [
506 |     {
507 |      "name": "stdout",
508 |      "output_type": "stream",
509 |      "text": [
510 |       "Scraping nytimes Facebook Page: 2015-07-19 18:36:33.051000\n",
511 |       "\n",
512 |       "1000 Statuses Processed: 2015-07-19 18:36:59.366000\n",
513 |       "2000 Statuses Processed: 2015-07-19 18:37:28.289000\n",
514 |       "3000 Statuses Processed: 2015-07-19 18:37:56.487000\n",
515 |       "4000 Statuses Processed: 2015-07-19 18:38:30.355000\n",
516 |       "5000 Statuses Processed: 2015-07-19 18:38:58.661000\n",
517 |       "6000 Statuses Processed: 2015-07-19 18:39:26.990000\n",
518 |       "7000 Statuses Processed: 2015-07-19 18:39:55.906000\n",
519 |       "8000 Statuses Processed: 2015-07-19 18:40:20.628000\n",
520 |       "9000 Statuses Processed: 2015-07-19 18:40:44.801000\n",
521 |       "10000 Statuses Processed: 2015-07-19 18:41:11.759000\n",
522 |       "11000 Statuses Processed: 2015-07-19 18:41:38.739000\n",
523 |       "12000 Statuses Processed: 2015-07-19 18:42:05.562000\n",
524 |       "13000 Statuses Processed: 2015-07-19 18:42:32.696000\n",
525 |       "14000 Statuses Processed: 2015-07-19 18:42:59.939000\n",
526 |       "15000 Statuses Processed: 2015-07-19 18:43:26.889000\n",
527 |       "16000 Statuses Processed: 2015-07-19 18:43:53.106000\n",
528 |       "17000 Statuses Processed: 2015-07-19 18:44:19.457000\n",
529 |       "18000 Statuses Processed: 2015-07-19 18:44:45.637000\n",
530 |       "19000 Statuses Processed: 2015-07-19 18:45:11.255000\n",
531 |       "20000 Statuses Processed: 2015-07-19 18:45:34.447000\n",
532 |       "21000 Statuses Processed: 2015-07-19 18:45:58.425000\n",
533 |       "22000 Statuses Processed: 2015-07-19 18:46:23.920000\n",
534 |       "23000 Statuses Processed: 2015-07-19 18:46:49.274000\n",
535 |       "24000 Statuses Processed: 2015-07-19 18:47:15.616000\n",
536 |       "25000 Statuses Processed: 2015-07-19 18:47:39.930000\n",
537 |       "26000 Statuses Processed: 2015-07-19 18:48:08.076000\n",
538 |       "HTTP Error 502: Error parsing server response\n",
539 |       "Error for URL https://graph.facebook.com/v2.0/5281959998/feed?fields=message,link,created_time,type,name,id,likes.limit%281%29.summary%28true%29,comments.limit%281%29.summary%28true%29,shares&limit=100&__paging_token=enc_AdBLHCQ9lOKXuEx1TEXyLWs7FEQ8RN7yGjUH0LXbw5iUpDXvcZCUIXJa2ZC2s6sBHC8EyrGl6Oafb9OqZBgBFzmuRZB9&access_token=272535582777707|59e7ab31b01d3a5a90ec15a7a45a5e3b&until=1340213557: 2015-07-19 18:48:23.256000\n",
540 |       "27000 Statuses Processed: 2015-07-19 18:48:38.748000\n",
541 |       "28000 Statuses Processed: 2015-07-19 18:49:03.033000\n",
542 |       "29000 Statuses Processed: 2015-07-19 18:49:26.957000\n",
543 |       "30000 Statuses Processed: 2015-07-19 18:49:51.405000\n",
544 |       "31000 Statuses Processed: 2015-07-19 18:50:15.830000\n",
545 |       "32000 Statuses Processed: 2015-07-19 18:50:37.641000\n",
546 |       "33000 Statuses Processed: 2015-07-19 18:50:57.574000\n",
547 |       "\n",
548 |       "Done!\n",
549 |       "33296 Statuses Processed in 0:14:28.200000\n"
550 |      ]
551 |     }
552 |    ],
553 |    "source": [
554 |     "def scrapeFacebookPageFeedStatus(page_id, access_token):\n",
555 |     "    with open('%s_facebook_statuses.csv' % page_id, 'wb') as file:\n",
556 |     "        w = csv.writer(file)\n",
557 |     "        w.writerow([\"status_id\", \"status_message\", \"link_name\", \"status_type\", \"status_link\",\n",
558 |     "           \"status_published\", \"num_likes\", \"num_comments\", \"num_shares\"])\n",
559 |     "        \n",
560 |     "        has_next_page = True\n",
561 |     "        num_processed = 0   # keep a count on how many we've processed\n",
562 |     "        scrape_starttime = datetime.datetime.now()\n",
563 |     "        \n",
564 |     "        print \"Scraping %s Facebook Page: %s\\n\" % (page_id, scrape_starttime)\n",
565 |     "        \n",
566 |     "        statuses = getFacebookPageFeedData(page_id, access_token, 100)\n",
567 |     "        \n",
568 |     "        while has_next_page:\n",
569 |     "            for status in statuses['data']:\n",
570 |     "                w.writerow(processFacebookPageFeedStatus(status))\n",
571 |     "                \n",
572 |     "                # output progress occasionally to make sure code is not stalling\n",
573 |     "                num_processed += 1\n",
574 |     "                if num_processed % 1000 == 0:\n",
575 |     "                    print \"%s Statuses Processed: %s\" % (num_processed, datetime.datetime.now())\n",
576 |     "                    \n",
577 |     "            # if there is no next page, we're done.\n",
578 |     "            if 'paging' in statuses.keys():\n",
579 |     "                statuses = json.loads(request_until_succeed(statuses['paging']['next']))\n",
580 |     "            else:\n",
581 |     "                has_next_page = False\n",
582 |     "                \n",
583 |     "        \n",
584 |     "        print \"\\nDone!\\n%s Statuses Processed in %s\" % (num_processed, datetime.datetime.now() - scrape_starttime)\n",
585 |     "\n",
586 |     "\n",
587 |     "scrapeFacebookPageFeedStatus(page_id, access_token)"
588 |    ]
589 |   },
590 |   {
591 |    "cell_type": "markdown",
592 |    "metadata": {},
593 |    "source": [
594 |     "The CSV can be opened in all major statistical programs. Have fun! :)\n",
595 |     "\n",
596 |     "You can download the [NYTimes data here](https://dl.dropboxusercontent.com/u/2017402/nytimes_facebook_statuses.zip). [4.6MB]"
597 |    ]
598 |   }
599 |  ],
600 |  "metadata": {
601 |   "kernelspec": {
602 |    "display_name": "Python 2",
603 |    "language": "python",
604 |    "name": "python2"
605 |   },
606 |   "language_info": {
607 |    "codemirror_mode": {
608 |     "name": "ipython",
609 |     "version": 2
610 |    },
611 |    "file_extension": ".py",
612 |    "mimetype": "text/x-python",
613 |    "name": "python",
614 |    "nbconvert_exporter": "python",
615 |    "pygments_lexer": "ipython2",
616 |    "version": "2.7.8"
617 |   }
618 |  },
619 |  "nbformat": 4,
620 |  "nbformat_minor": 0
621 | }
622 | 


--------------------------------------------------------------------------------