├── README.md
├── facebook_output
└── .gitignore
├── get_fb_data.py
├── res
├── page_handle_location.png
├── sample_output_owned_posts.png
├── sample_run.gif
└── the_matrix.png
└── social_elastic.py
/README.md:
--------------------------------------------------------------------------------
1 | # Facebook Scraper For Multiple Pages
2 |
3 | Scrape **multiple** public Facebook pages en masse to help yield social analytics. Automatically fetch more detailed insights for owned pages' posts and videos to see what content strikes well.
4 | Go all the way back in time or specify two dates and watch the goodness happen at lightning speed via multi-threading!
5 |
6 |

7 |
8 | ## Distinguishing Features
9 |
10 | - **Multi-threaded** for rapid data collection from **multiple pages _simultaneously_** (as many as you want!)
11 | - Collect detailed performance metrics on multiple owned business pages **automatically** via the [insights](https://developers.facebook.com/docs/graph-api/reference/v2.11/insights) and [video insights](https://developers.facebook.com/docs/graph-api/reference/video/video_insights/) endpoints
12 | - Retrieve number of public shares of a link by *anyone* across Facebook via the [URL Object](https://developers.facebook.com/docs/graph-api/reference/v2.11/url/)
13 | - **Custom** metrics computed:
14 | - *Impression Rate Non-Likers (%)*: explore **virality** of your posts outside your typical audience
15 | - *Engagement Rate*: (Shares + Reactions + Comments) / Total Unique Impressions
16 | - *Adjusted Engagement Rate (%)* and *Adjusted CTR (%)*: normalise rates across pages of different audience sizes and account for uncertainty in small numbers i.e. 5/10 CTR < 100/200 CTR as detailed by [Evan Miller](http://www.evanmiller.org/how-not-to-sort-by-average-rating.html)
17 | - Proper timezone handling
18 |
19 | 
20 |
21 | ## What can be collected from public page posts?
22 |
23 | Post ID, Publish Date, Post Type, Headline, Shares, Reactions, Comments, Caption, Link
24 |
25 | ... and optionally with a performance cost:
26 | Public Shares, Likes, Loves, Wows, Hahas, Sads, Angrys
27 |
28 | ## What is *additionally* collected from owned page posts?
29 |
30 | **Posts**
31 |
32 | Video Views, Unique Impressions, Impression Rate Non-Likers (%), Unique Link Clicks, CTR (%), Adjusted CTR (%), Engagement Rate (%), Adjusted Engagement Rate (%), Hide Rate (%), Hide Clicks, Hide All Clicks, Paid Unique Impressions, Organic Unique Impressions
33 |
34 | **Videos**
35 |
36 | Live Video, Crossposted Video, 3s Views, 10s Views, Complete Views, Total Paid Views, 10s/3s Views (%), Complete/3s Views (%), Impressions, Impression Rate Non-Likers (%), Avg View Time
37 |
38 |
39 | ## Setup
40 |
41 | **1)** Add the page names you want to scrape inside `PAGE_IDS_TO_SCRAPE`
42 |
43 | Grab the @'handles' or in url (e.g. 'vicenews' below).
44 |
45 |
46 | 
47 |
48 | **2)** Grab your own *temporary* user token [here](https://developers.facebook.com/tools/explorer) and place inside `OWNED_PAGES_TOKENS`:
49 | **Get Token -> Get User Token -> Get Access Token**
50 |
51 | `OWNED_PAGES_TOKENS` is the dictionary that stores the token(s) necessary to scrape public data. If the token is a [**permanent token**](https://stackoverflow.com/a/28418469) for a business page, it is used to scrape private data provided that the page is placed in `PAGE_IDS_TO_SCRAPE` and its corresponding key is identically named in this dictionary.
52 |
53 | **3)** Install python dependencies with `pip install requests scipy pandas`
54 |
55 | **N.B** OSX users should have installed [Homebrew](https://brew.sh/) and python with `brew install python`
56 |
57 |
58 |
59 | ## Execution
60 | Specify number of days back from present:
61 |
62 | `python get_fb_data.py post 5` (Public & owned pages)
63 | `python get_fb_data.py video 5` (Owned pages only for video-specific data)
64 |
65 | Specify two dates (inclusive) in yyyy-mm-dd format:
66 |
67 | `python get_fb_data.py post yyyy-mm-dd yyyy-mm-dd`
68 | `python get_fb_data.py video yyyy-mm-dd yyyy-mm-dd`
69 |
70 | The csv file is placed in the `facebook_output` folder by default
71 |
72 | 
73 |
74 | ## Credit
75 | Thanks to minimaxir and his [project](https://github.com/minimaxir/facebook-page-post-scraper) for showing me the ropes
76 |
77 | ## FYI
78 | Additional `social_elastic.py` used to scrape data **and** push to Elastic instance(s) via their [bulk api](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html)
79 |
--------------------------------------------------------------------------------
/facebook_output/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 |
4 | # Images #
5 | ##########
6 | *.png
7 |
8 | # Slideshows #
9 | #############
10 | *.pptx
11 | *.ppt
12 |
13 | # Spreadsheets #
14 | ###############
15 | *.csv
16 |
17 |
--------------------------------------------------------------------------------
/get_fb_data.py:
--------------------------------------------------------------------------------
1 | import datetime
2 | import calendar
3 | import requests
4 | import time
5 | import sys
6 | import os
7 | import threading
8 | import Queue
9 | import math
10 | from dateutil import tz
11 |
12 | import pandas as pd
13 | from scipy.stats import norm
14 |
15 |
16 | # At the cost of performance, set these to true for more precise data: reaction type breakdown and public shares across Facebook
17 | # Overall reactions per post are already pulled
18 | GET_SPECIFIC_REACTIONS_BOOL = False
19 | GET_PUBLIC_SHARES_BOOL = False
20 |
21 | # Page IDs to be scraped, defined by page's Facebook handle.
22 | PAGE_IDS_TO_SCRAPE = [
23 | 'nytimes',
24 | 'vicenews',
25 | 'bbcnews',
26 | 'TheSkimm',
27 | 'cnn',
28 | 'NBCNews',
29 | 'financialtimes',
30 | 'washingtonpost',
31 | 'theguardian',
32 | 'timesandsundaytimes',
33 | 'msnbc',
34 | 'CBSNews',
35 | 'TheIndependentOnline',
36 | 'ABCNews'
37 | ]
38 |
39 | # Additional personal metrics are pulled for owned pages in keys below who also exist in PAGE_IDS_TO_SCRAPE
40 | # Temporary token: https://developers.facebook.com/tools/explorer
41 | # Permanent/Business Page token: https://stackoverflow.com/questions/17197970/facebook-permanent-page-access-token/28418469#28418469
42 | OWNED_PAGES_TOKENS = {
43 | 'jpryda': os.environ['MY_TOKEN'], # Token as an environmental variable: export MY_TOKEN = 'abc-my-token'
44 | # 'MyPage1': 'my-hardcoded-token' # Hardcoded token
45 | }
46 |
47 | TIMEZONE = 'America/New_York'
48 | API_VERSION = '2.7'
49 |
50 |
51 | # Set display precision when printing Pandas dataframes
52 | pd.set_option('precision',1)
53 | # Deal with scientific notation
54 | pd.options.display.float_format = '{:20,.0f}'.format
55 | # Don't wrap dataframe when printing to console
56 | pd.set_option('display.expand_frame_repr', False)
57 |
58 |
59 | def request_until_succeed(url):
60 | max_attempts = 3
61 | attempts = 0
62 | success = False
63 | while success == False and attempts < max_attempts:
64 | attempts = attempts + 1
65 | try:
66 | response = requests.get(url)
67 | if response.status_code == 200:
68 | success = True
69 | except Exception as e:
70 | print e
71 | print 'Error for URL {} | {} | attempt {} of {}'.format(url, datetime.datetime.now(), attempts, max_attempts)
72 | if attempts == max_attempts:
73 | raise Exception('Failed after {} attempts | {}'.format(attempts, url))
74 | time.sleep(3)
75 | return response
76 |
77 |
78 | # Handle non-ASCII characters when writing to csv
79 | def unicode_normalize(text):
80 | return text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0xa0:0x20 }).encode('utf-8')
81 |
82 |
83 | def get_fb_page_video_data(page_id, access_token, num_posts=100, until=''):
84 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
85 | node = '/{}/videos'.format(page_id)
86 | fields = '/?fields=title,description,created_time,id,comments.limit(0).summary(true),likes.limit(0).summary(true),reactions.limit(0).summary(true),permalink_url,live_status,status'
87 | parameters = '&limit={}&access_token={}&until={}'.format(num_posts, access_token, until)
88 | url = base + node + fields + parameters
89 |
90 | data = request_until_succeed(url).json()
91 | return data
92 |
93 |
94 | def get_fb_page_post_data(page_id, access_token, num_posts=100, until=''):
95 | # Shares on videos must be grabbed from the /posts endpoint; unavailable from the /videos endpoint
96 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
97 | node = '/{}/posts'.format(page_id)
98 | fields = '/?fields=message,link,created_time,type,name,id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true)'
99 | parameters = '&limit={}&access_token={}&until={}'.format(num_posts, access_token, until)
100 | url = base + node + fields + parameters
101 |
102 | data = request_until_succeed(url).json()
103 | return data
104 |
105 |
106 | def get_specific_reactions_for_post(status_id, access_token):
107 | # Reaction types are only accessible at an individual post's endpoint
108 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
109 | node = '/{}'.format(status_id)
110 | reactions = '/?fields=' \
111 | 'reactions.type(LIKE).limit(0).summary(total_count).as(like)'\
112 | ',reactions.type(LOVE).limit(0).summary(total_count).as(love)'\
113 | ',reactions.type(WOW).limit(0).summary(total_count).as(wow)'\
114 | ',reactions.type(HAHA).limit(0).summary(total_count).as(haha)'\
115 | ',reactions.type(SAD).limit(0).summary(total_count).as(sad)'\
116 | ',reactions.type(ANGRY).limit(0).summary(total_count).as(angry)'
117 | parameters = '&access_token={}'.format(access_token)
118 | url = base + node + reactions + parameters
119 |
120 | data = request_until_succeed(url).json()
121 | return data
122 |
123 |
124 | def get_insights_for_post(object_id, access_token, fields, period='', since=''):
125 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
126 | node = '/{}/insights/'.format(object_id)
127 | parameters = '?access_token={}&period={}&since={}&date_format=U'.format(access_token, period, since)
128 | url = base + node + fields + parameters
129 |
130 | data = request_until_succeed(url)
131 | if data is not None:
132 | return data.json()
133 | else:
134 | raise Exception('No Post Insights Data')
135 |
136 |
137 | def get_insights_for_video(video_id, access_token, period='lifetime'):
138 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
139 | node = '/{}/video_insights'.format(video_id)
140 | fields = ''
141 | parameters = '?access_token={}&period={}'.format(access_token, period)
142 | url = base + node + fields + parameters
143 |
144 | data = request_until_succeed(url).json()
145 | return data
146 |
147 |
148 | def get_fb_url_shares_comments(access_token, url):
149 | # Remove pound signs from URL which mess up FB API
150 | url = url.replace('#','')
151 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION)
152 | node = ''
153 | fields = '/?id={}'.format(url)
154 | parameters = '&access_token={}'.format(access_token)
155 | url = base + node + fields + parameters
156 |
157 | data = request_until_succeed(url).json()
158 | return data
159 |
160 |
161 | def get_insights_for_page(access_token, metrics, page_id, period, start_date, excl_end_date):
162 | base = 'https://graph.facebook.com/v{}'.format(FB_API_VERSION)
163 | node = '/{}/insights'.format(page_id)
164 | fields = '/{}'.format(metrics)
165 | period_string = 'period={}&since={}&until={}'.format(period, start_date, excl_end_date)
166 | parameters = '?{}&access_token={}'.format(period_string, access_token)
167 |
168 | url = base + node + fields + parameters
169 | data = request_until_succeed(url).json()
170 |
171 | return data
172 |
173 |
174 | # def posix_to_timezone(posix_int, to_timezone):
175 | # utc_datetime = datetime.utcfromtimestamp(posix_int)
176 | # from_zone = tz.gettz('UTC')
177 | # to_zone = tz.gettz(to_timezone)
178 | # to_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone)
179 | # return to_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time
180 |
181 |
182 | # def posix_to_iso(posix_int):
183 | # return datetime.datetime.utcfromtimestamp(posix_int).strftime('%Y-%m-%dT%H:%M:%S+0000')
184 |
185 |
186 | def utc_to_timezone(utc_datetime_string, to_timezone):
187 | utc_datetime = datetime.datetime.strptime(utc_datetime_string,'%Y-%m-%dT%H:%M:%S+0000')
188 | from_zone = tz.gettz('UTC')
189 | to_zone = tz.gettz(to_timezone)
190 | est_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone)
191 | return est_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time
192 |
193 | # Not used right now
194 | def utc_to_local(utc_datetime_string):
195 | utc_datetime = datetime.datetime.strptime(utc_datetime_string,'%Y-%m-%dT%H:%M:%S+0000')
196 | from_zone = tz.gettz('UTC')
197 | to_zone = tz.tzlocal()
198 | local_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone)
199 | return local_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time
200 |
201 | # For specification of 'until' parameter at commandline
202 | def local_to_utc(local_date):
203 | from_zone = tz.tzlocal()
204 | to_zone = tz.gettz('UTC')
205 | utc_datetime = local_date.replace(tzinfo=from_zone).astimezone(to_zone)
206 | return utc_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time
207 |
208 | '''
209 | Calculate confidence interval lower bound as scoring system to balance balance proportion of successes (e.g. clicks) with the uncertainty of a small number
210 | i.e. ci_lower_bound(5, 10, 0.95) < ci_lower_bound(100, 200, 0.95). For more info see http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
211 | '''
212 | def ci_lower_bound(pos, n, confidence):
213 | if n == 0:
214 | return 0
215 | elif n > pos:
216 | z = norm.ppf((1-(1-confidence)/2), loc=0, scale=1)
217 | phat = float(pos)/n
218 | return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n)) / (1+z*z/n)
219 | else:
220 | return 0
221 |
222 |
223 | def process_fb_page_video(video, access_token, page_id):
224 | if video.get('status').get('video_status') == 'expired':
225 | return None
226 |
227 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000'
228 | video_id = video['id']
229 | utc_video_published = video['created_time']
230 |
231 | video_title = None if 'title' not in video.keys() else unicode_normalize(video['title']).decode('utf-8','ignore').encode('utf-8')
232 | video_description = None if 'description' not in video.keys() else unicode_normalize(video['description']).decode('utf-8','ignore').encode('utf-8')
233 | video_permalink = video['permalink_url']
234 |
235 | num_likes = 0 if 'likes' not in video else video['likes']['summary']['total_count']
236 | num_reactions = 0 if 'reactions' not in video else video['reactions']['summary']['total_count']
237 | num_comments = 0 if 'comments' not in video or video.get('comments').get('summary').get('total_count') is None else video['comments']['summary']['total_count']
238 |
239 | live_boolean = False if video.get('live_status') is None else True
240 |
241 | # Set Insights default values if a competitor or a Facebook Live Video
242 | total_3s_views = None
243 | total_10s_views = None
244 | total_complete_views = None
245 | total_video_impressions = None
246 | total_video_avg_time_watched = None
247 | ten_three_s_ratio = None
248 | complete_three_s_ratio = None
249 | total_video_impressions_fan = None
250 | total_non_fan_impressions_rate = None
251 | total_video_views_paid = None
252 |
253 | # Get insights for videos iff they are our OWN and also NOT Live videos which have no data
254 | if page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]:
255 | video_insights = get_insights_for_video(video_id, access_token, 'lifetime')
256 |
257 | if len(video_insights['data']) > 0:
258 | for metric_result in video_insights['data']:
259 | if metric_result['name'] == 'total_video_views':
260 | total_3s_views = metric_result['values'][0]['value']
261 | if metric_result['name'] == 'total_video_10s_views':
262 | total_10s_views = metric_result['values'][0]['value']
263 | if metric_result['name'] == 'total_video_complete_views':
264 | total_complete_views = metric_result['values'][0]['value']
265 | if metric_result['name'] == 'total_video_avg_time_watched':
266 | total_video_avg_time_watched = float(metric_result['values'][0]['value'])/1000
267 | if metric_result['name'] == 'total_video_impressions':
268 | total_video_impressions = metric_result['values'][0]['value']
269 | if metric_result['name'] == 'total_video_impressions_fan':
270 | total_video_impressions_fan = metric_result['values'][0]['value']
271 | if metric_result['name'] == 'total_video_views_paid':
272 | total_video_views_paid = metric_result['values'][0]['value']
273 |
274 | total_non_fan_impressions = total_video_impressions - total_video_impressions_fan
275 | total_non_fan_impressions_rate = None if total_video_impressions == 0 else float(total_non_fan_impressions)/float(total_video_impressions) * 100
276 | ten_three_s_ratio = None if total_3s_views == 0 else float(total_10s_views)/float(total_3s_views) * 100
277 | complete_three_s_ratio = None if total_3s_views == 0 else float(total_complete_views)/float(total_3s_views) * 100
278 | engagement_rate = None if total_3s_views == 0 else float(num_reactions + num_comments)/float(total_3s_views) * 100 # Video endpoint doesn't have shares
279 |
280 | crossposted_boolean = True if total_3s_views is None and live_boolean is False else False
281 |
282 | scraped_row = {
283 | 'Page': page_id,
284 | 'Video ID': video_id,
285 | 'Published': utc_video_published,
286 | 'Live Video': live_boolean,
287 | 'Crossposted Video': crossposted_boolean,
288 | 'Headline': video_title,
289 | 'Caption': video_description,
290 | 'Num Likes': num_likes,
291 | 'Num Reactions': num_reactions,
292 | 'Num Comments': num_comments,
293 | '3s Views': total_3s_views,
294 | '10s Views': total_10s_views,
295 | 'Complete Views': total_complete_views,
296 | 'Total Paid Views': total_video_views_paid,
297 | '10s/3s Views (%)': ten_three_s_ratio,
298 | 'Complete/3s Views (%)': complete_three_s_ratio,
299 | 'Impressions': total_video_impressions,
300 | 'Impression Rate Non-Likers (%)': total_non_fan_impressions_rate,
301 | 'Avg View Time': total_video_avg_time_watched,
302 | 'Link': video_permalink,
303 | 'Timestamp': timestamp
304 | }
305 | return scraped_row
306 |
307 |
308 | def process_fb_page_video_all_metrics(video, access_token, page_id):
309 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000'
310 | video_id = video['id']
311 | video_title = None if 'title' not in video.keys() else unicode_normalize(video['title']).decode('utf-8','ignore').encode('utf-8')
312 | video_description = None if 'description' not in video.keys() else unicode_normalize(video['description']).decode('utf-8','ignore').encode('utf-8')
313 | utc_video_published = video['created_time']
314 | video_permalink = video['permalink_url']
315 |
316 | num_likes = 0 if 'likes' not in video else video['likes']['summary']['total_count']
317 | num_reactions = 0 if 'reactions' not in video else video['reactions']['summary']['total_count']
318 | num_comments = 0 if 'comments' not in video or video.get('comments').get('summary').get('total_count') is None else video['comments']['summary']['total_count']
319 |
320 | live_boolean = False if video.get('live_status') is None else True
321 |
322 | scraped_row = {
323 | 'Page': page_id,
324 | 'Video ID': video_id,
325 | 'Published': utc_video_published,
326 | 'Live Video': live_boolean,
327 | 'Headline': video_title,
328 | 'Caption': video_description,
329 | 'Num Likes': num_likes,
330 | 'Num Reactions': num_reactions,
331 | 'Num Comments': num_comments,
332 | 'Link': video_permalink,
333 | 'Timestamp': timestamp
334 | }
335 |
336 | if page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]:
337 | video_insights = get_insights_for_video(video_id, access_token, 'lifetime')
338 |
339 | if len(video_insights['data']) > 0:
340 |
341 | for metric in video_insights['data']:
342 |
343 | # Define metric name and add to scraped_row
344 | metric_name = metric['name'].replace('.','')
345 | metric_value = metric['values'][0]['value']
346 | # Elasticsearch doesn't accept periods within keys
347 | if isinstance(metric_value, dict):
348 | metric_value = { x.replace('.', ''): metric_value[x] for x in metric_value.keys() }
349 | scraped_row[metric_name] = metric_value
350 |
351 | # Unpack dicts of important metrics. !Actually Kibana unpacks these for us so unnecessary!
352 | scraped_row['total_video_views_by_crossposted'] = scraped_row['total_video_views_by_distribution_type'].get('crossposted')
353 | scraped_row['total_video_views_by_page_owned'] = scraped_row['total_video_views_by_distribution_type'].get('page_owned')
354 | scraped_row['total_video_views_by_page_shared'] = scraped_row['total_video_views_by_distribution_type'].get('shared')
355 | #del scraped_row['total_video_views_by_distribution_type']
356 |
357 | scraped_row['total_video_impressions_non_fan'] = scraped_row['total_video_impressions'] - scraped_row['total_video_impressions_fan']
358 | scraped_row['total_non_fan_impressions_rate'] = None if scraped_row['total_video_impressions'] == 0 else float(scraped_row['total_video_impressions_non_fan'])/float(scraped_row['total_video_impressions']) * 100
359 | scraped_row['ten_three_s_ratio'] = None if scraped_row['total_video_views'] == 0 else float(scraped_row['total_video_10s_views'])/float(scraped_row['total_video_views']) * 100
360 | scraped_row['complete_three_s_ratio'] = None if scraped_row['total_video_views'] == 0 else float(scraped_row['total_video_complete_views'])/float(scraped_row['total_video_views']) * 100
361 |
362 | scraped_row['Crossposted Video'] = True if scraped_row.get('total_video_views') is None and live_boolean is False else False
363 | if scraped_row.get('total_video_views') is not None:
364 | scraped_row['Video Views'] = scraped_row['total_video_views']
365 | #del scraped_row['total_video_views']
366 | return scraped_row
367 |
368 |
369 | def process_fb_page_post(status, access_token, page_id):
370 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000'
371 | status_id = status['id']
372 | status_message = None if 'message' not in status.keys() else unicode_normalize(status['message']).decode('utf-8','ignore').encode('utf-8')
373 | post_title = None if 'name' not in status.keys() else unicode_normalize(status['name']).decode('utf-8','ignore').encode('utf-8')
374 | status_type = status['type']
375 | status_link = None if 'link' not in status.keys() else unicode_normalize(status['link'])
376 |
377 | # Time needs special care since it's in UTC
378 | utc_status_published = status['created_time']
379 |
380 | num_reactions = None if 'reactions' not in status else status['reactions']['summary']['total_count']
381 | num_comments = None if 'comments' not in status or status.get('comments').get('summary').get('total_count') is None else status['comments']['summary']['total_count']
382 | num_shares = None if 'shares' not in status else status['shares']['count']
383 |
384 |
385 | num_likes = num_loves = num_wows = num_hahas = num_sads = num_angrys = None
386 | unique_link_clicks = None
387 | total_unique_impressions = None
388 | ctr = None
389 | post_video_views = None
390 | paid_unique_impressions = None
391 | non_fan_unique_impressions_rate = None
392 | hide_clicks = None
393 | hide_all_clicks = None
394 | hide_rate = None
395 | public_num_shares = None
396 | ctr_lb_confidence = None
397 | engagement_rate = None
398 | engage_lb_confidence = None
399 | organic_unique_impressions = None
400 | public_num_shares = None
401 |
402 | if (GET_PUBLIC_SHARES_BOOL):
403 | # Get number of shares across all of Facebook
404 | if status_link is not None:
405 | public_num_shares_comments = get_fb_url_shares_comments(access_token, status_link)
406 | if 'share' in public_num_shares_comments:
407 | public_num_shares = public_num_shares_comments.get('share').get('share_count')
408 |
409 |
410 | if (GET_SPECIFIC_REACTIONS_BOOL):
411 | # Reactions only exists after implementation date: http://newsroom.fb.com/news/2016/02/reactions-now-available-globally/
412 | reactions = get_specific_reactions_for_post(status_id, access_token) if utc_status_published > '2016-02-24 00:00:00' else {}
413 | num_likes = 0 if 'like' not in reactions else reactions['like']['summary']['total_count']
414 | # Special case: Set number of Likes to Number of reactions for pre-reaction statuses
415 | num_likes = num_reactions if utc_status_published < '2016-02-24 00:00:00' else num_likes
416 |
417 | num_loves = 0 if 'love' not in reactions else reactions['love']['summary']['total_count']
418 | num_wows = 0 if 'wow' not in reactions else reactions['wow']['summary']['total_count']
419 | num_hahas = 0 if 'haha' not in reactions else reactions['haha']['summary']['total_count']
420 | num_sads = 0 if 'sad' not in reactions else reactions['sad']['summary']['total_count']
421 | num_angrys = 0 if 'angry' not in reactions else reactions['angry']['summary']['total_count']
422 |
423 |
424 | # If not one of our own pages or a pesky cover photo
425 | if (page_id.lower() not in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]) or (post_title is not None and 'cover photo' in post_title and status_type=='photo'):
426 |
427 | scraped_row = {
428 | 'Page': page_id,
429 | 'Published': utc_status_published,
430 | 'Num Shares': num_shares,
431 | 'Num Reactions': num_reactions,
432 | 'Type': status_type,
433 | 'Headline': post_title,
434 | 'Caption': status_message,
435 | 'Link': status_link,
436 | 'Num Likes': num_likes,
437 | 'Num Comments': num_comments,
438 | 'Num Loves': num_loves,
439 | 'Num Wows': num_wows,
440 | 'Num Hahas': num_hahas,
441 | 'Num Sads': num_sads,
442 | 'Num Angrys': num_angrys,
443 | 'Lifetime Public Num Shares': public_num_shares,
444 | 'Post ID': status_id,
445 | 'Timestamp': timestamp
446 | }
447 | return scraped_row
448 |
449 | # Iff one of our own pages, read insights too
450 | elif page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]:
451 |
452 | fields = 'post_consumptions_by_type_unique'\
453 | ',post_impressions_by_paid_non_paid_unique'\
454 | ',post_video_views'\
455 | ',post_impressions_fan_unique'\
456 | ',post_negative_feedback_by_type_unique'
457 |
458 | try:
459 | insights = get_insights_for_post(status_id, access_token, fields, 'lifetime')
460 |
461 | unique_link_clicks = 0 if 'link clicks' not in insights['data'][0]['values'][0]['value'] else insights['data'][0]['values'][0]['value'].get('link clicks')
462 | total_unique_impressions = insights['data'][1]['values'][0]['value'].get('total')
463 | ctr = None if total_unique_impressions == 0 else (float(unique_link_clicks)/float(total_unique_impressions)) * 100
464 | ctr_lb_confidence = None if status_type != 'link' else ci_lower_bound(unique_link_clicks, total_unique_impressions, 0.95) * 100
465 |
466 | paid_unique_impressions = insights['data'][1]['values'][0]['value'].get('paid')
467 | organic_unique_impressions = insights['data'][1]['values'][0]['value'].get('unpaid')
468 | post_video_views = insights['data'][2]['values'][0]['value']
469 | fan_unique_impressions = insights['data'][3]['values'][0]['value']
470 | non_fan_unique_impressions = total_unique_impressions - fan_unique_impressions
471 | non_fan_unique_impressions_rate = None if total_unique_impressions == 0 else (float(non_fan_unique_impressions)/float(total_unique_impressions)) * 100
472 | hide_clicks = 0 if 'hide_clicks' not in insights['data'][4]['values'][0]['value'] else insights['data'][4]['values'][0]['value'].get('hide_clicks')
473 | hide_all_clicks = 0 if 'hide_all_clicks' not in insights['data'][4]['values'][0]['value'] else insights['data'][4]['values'][0]['value'].get('hide_all_clicks')
474 | hide_rate = None if total_unique_impressions == 0 else (float(hide_clicks + hide_all_clicks)/float(total_unique_impressions)) * 100
475 |
476 | # Engagement Rate
477 | if num_shares is not None and num_reactions is not None and num_comments is not None:
478 | total_engagement = num_shares + num_reactions + num_comments
479 | if status_type != 'video':
480 | engagement_rate = None if total_unique_impressions == 0 else float(total_engagement)/float(total_unique_impressions) * 100
481 | engage_lb_confidence = ci_lower_bound(total_engagement, total_unique_impressions, 0.95) * 100
482 | if status_type == 'video':
483 | engagement_rate = None if post_video_views == 0 else float(total_engagement)/float(post_video_views) * 100
484 | engage_lb_confidence = ci_lower_bound(total_engagement, post_video_views, 0.95) * 100
485 |
486 | ## Counts of each reaction separately. Can comment out for speed's sake
487 |
488 |
489 | except Exception as e:
490 | print e
491 |
492 | scraped_row = {
493 | 'Page': page_id,
494 | 'Published': utc_status_published,
495 | 'Unique Impressions': total_unique_impressions,
496 | 'Paid Unique Impressions': paid_unique_impressions,
497 | 'Impression Rate Non-Likers (%)': non_fan_unique_impressions_rate,
498 | 'Unique Link Clicks': unique_link_clicks,
499 | 'CTR (%)': ctr,
500 | 'Adjusted CTR (%)': ctr_lb_confidence,
501 | 'Num Shares': num_shares,
502 | 'Num Reactions': num_reactions,
503 | 'Hide Rate (%)': hide_rate,
504 | 'Hide Clicks': hide_clicks,
505 | 'Hide All Clicks': hide_all_clicks,
506 | 'Type': status_type,
507 | 'Engagement Rate (%)': engagement_rate,
508 | 'Adjusted Engagement Rate (%)': engage_lb_confidence,
509 | 'Video Views': post_video_views,
510 | 'Headline': post_title.decode('utf-8','ignore').encode('utf-8') if post_title is not None else None,
511 | 'Caption': status_message.decode('utf-8','ignore').encode('utf-8') if status_message is not None else None,
512 | 'Link': status_link,
513 | 'Num Likes': num_likes,
514 | 'Num Comments': num_comments,
515 | 'Num Loves': num_loves,
516 | 'Num Wows': num_wows,
517 | 'Num Hahas': num_hahas,
518 | 'Num Sads': num_sads,
519 | 'Num Angrys': num_angrys,
520 | 'Lifetime Public Num Shares': public_num_shares,
521 | 'Post ID': status_id,
522 | 'Organic Unique Impressions': organic_unique_impressions,
523 | 'Timestamp': timestamp
524 | }
525 | return scraped_row
526 |
527 |
528 | def scrape_single_fb_page_items(page_id, from_date, until_date, access_token, scrape_function, process_item_function):
529 | num_processed = 0 # keep a count on how many we've processed
530 | scraped_rows_list = []
531 |
532 | scrape_starttime = datetime.datetime.now()
533 |
534 | items = scrape_function(page_id, access_token, 100, until_date)
535 | if 'error' in items:
536 | print items['error']
537 | return scraped_rows_list
538 |
539 | needs_next_page = True
540 |
541 | while needs_next_page:
542 | for item in items['data']:
543 |
544 | item_published = utc_to_timezone(item['created_time'], TIMEZONE)
545 | if item_published >= from_date:
546 |
547 | processed_item = process_item_function(item, access_token, page_id)
548 | if processed_item is not None:
549 | scraped_rows_list.append(processed_item)
550 | # output progress occasionally to make sure code is not stalling
551 | num_processed += 1
552 | if num_processed % 10 == 0:
553 | print '{} {} items Processed | {}'.format(num_processed, page_id, item_published.strftime('%Y-%m-%d %H:%M:%S'))
554 | else:
555 | needs_next_page = False
556 | # Else avoid processing items that fall before from_date in a single 'items run'
557 | break
558 |
559 | if needs_next_page and 'paging' in items.keys():
560 | if 'next' in items['paging']:
561 | items = request_until_succeed(items['paging']['next']).json()
562 | else:
563 | needs_next_page = False
564 | else:
565 | needs_next_page = False
566 |
567 | print 'Finished Processing {} {} items! | {}'.format(num_processed, page_id, datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
568 | return scraped_rows_list
569 |
570 |
571 | def scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function):
572 | # Define length of results for indexed access store page results in order of specification, rather than appending result from first thread to finish
573 | results = [None] * len(page_ids)
574 |
575 | # Create FIFO queue
576 | queue_page_ids = Queue.Queue()
577 |
578 | # Set number of threads to the number of pages to be scraped
579 | num_threads = len(page_ids)
580 |
581 | # Add items with their ordinal number to queue
582 | for idx, page_id in enumerate(page_ids):
583 | queue_page_ids.put((idx, page_id))
584 |
585 | # Wrapper function to scrape_single_fb_page_items which pulls from queue and is able to assign return output to a variable in this scope
586 | def grab_page_from_queue(queue):
587 | while not queue.empty():
588 | idx, page_id = queue.get()
589 |
590 | # Select appropriate access token based on page. Include some logic handling FB page capitalisations
591 | access_token = OWNED_PAGES_TOKENS.get(page_id) if OWNED_PAGES_TOKENS.get(page_id.lower()) is None else OWNED_PAGES_TOKENS.get(page_id.lower())
592 | if access_token is None:
593 | # For competitors set default access token to use as arbitrary token in owned dict
594 | access_token = OWNED_PAGES_TOKENS.itervalues().next()
595 |
596 | results[idx] = scrape_single_fb_page_items(page_id, from_date, until_date, access_token, scrape_function, process_item_function)
597 | queue.task_done()
598 |
599 |
600 | t0 = datetime.datetime.now()
601 |
602 | # To avoid strptime multithreading bug where strptime isn't loaded completely by first thread but called by another thread; call it first here
603 | dummy = datetime.datetime.strptime(t0.strftime('%Y-%m-%d'), '%Y-%m-%d')
604 |
605 | for n in range(num_threads):
606 | # Configure thread action
607 | t_i = threading.Thread(target=grab_page_from_queue, args=[queue_page_ids])
608 | # Must start threads in daemon mode to enable hard-kill
609 | t_i.setDaemon(True)
610 | t_i.start()
611 |
612 | '''
613 | join() function (thread and queue objects) blocks main thread until and item is returned or task_done()
614 | thread.join(arg) takes a timeout argument whereas queue.join() does not and so no KEYBOARDINTERRUPTS allowed!
615 | Wrap Queue's join (no timeout argument) in designated terminator thread which HAS a timeout argument.
616 | Ctrl+C can then end Terminator and thus MainThread whereupon the Python Interpreter hard-kills all spawned 'daemon' threads
617 | '''
618 | term = threading.Thread(target=queue_page_ids.join)
619 | term.setDaemon(True)
620 | term.start()
621 | # Terminator thread only stays alive when Queue's join() is running i.e. until natural completion once all queue elements have been processed
622 | while term.isAlive():
623 | # Any large timeout number crucial
624 | term.join(timeout=360000000)
625 |
626 | t1 = datetime.datetime.now()
627 |
628 | if type(until_date) is datetime.datetime:
629 | end_date = until_date.strftime('%Y-%m-%d %H:%M:%S')
630 | else:
631 | end_date = datetime.datetime.fromtimestamp(until_date)
632 |
633 | print '\nDone!\n{} Facebook page(s) processed between {} and {} in {} second(s)'.format(len(page_ids), from_date.strftime('%Y-%m-%d %H:%M:%S'), end_date, (t1 - t0).seconds)
634 |
635 | scraped_rows_list = [item for sublist in results for item in sublist]
636 | return scraped_rows_list
637 |
638 |
639 | def scrape_posts_to_csv(page_ids, from_date, until_date, scrape_function, process_item_function):
640 | scraped_rows_list = scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function)
641 | scraped_rows_df = pd.DataFrame(scraped_rows_list)
642 |
643 | # Convert UTC datetimes to EST
644 | scraped_rows_df['Published (EST)'] = [utc_to_timezone(x, TIMEZONE).strftime('%Y-%m-%d %H:%M:%S') for x in scraped_rows_df['Published']]
645 |
646 | csvColumns = ['Page', 'Published (EST)', 'Type', 'Headline', 'Unique Impressions', 'Impression Rate Non-Likers (%)', 'Unique Link Clicks', 'CTR (%)', 'Adjusted CTR (%)',
647 | 'Num Shares', 'Engagement Rate (%)', 'Adjusted Engagement Rate (%)', 'Lifetime Public Num Shares', 'Num Reactions', 'Video Views', 'Caption', 'Link', 'Num Likes',
648 | 'Num Comments', 'Num Loves', 'Num Wows', 'Num Hahas', 'Num Sads', 'Num Angrys', 'Hide Rate (%)', 'Hide Clicks', 'Hide All Clicks',
649 | 'Paid Unique Impressions', 'Organic Unique Impressions', 'Post ID']
650 |
651 | scraped_rows_df = scraped_rows_df.round(1)
652 | csv_filename = './facebook_output/{}_{}.csv'.format('posts', datetime.datetime.now().strftime('%y-%m-%d_%H.%M.%S'))
653 | scraped_rows_df.to_csv(csv_filename, index=False, columns=csvColumns, encoding='utf-8')
654 | print csv_filename + ' written'
655 |
656 | # Output Summary to Terminal
657 | print '\nMedians:\n'
658 | print scraped_rows_df.ix[:,['Page', 'Num Shares', 'Num Reactions', 'Num Comments', 'Video Views', 'Impression Rate Non-Likers (%)', 'CTR (%)']].groupby('Page').median()
659 | # .sort_values(by='Num Shares', ascending=False)
660 | print '\nTotals:\n'
661 | print scraped_rows_df.ix[:,['Page', 'Num Shares', 'Num Reactions', 'Num Comments', 'Video Views']].groupby('Page').sum()
662 | # .sort_values(by='Num Shares', ascending=False)
663 | print '\n'
664 |
665 | # If called by daily/weekly insights OR Elasticsearch script
666 | if __name__ != '__main__':
667 | return scraped_rows_list
668 |
669 |
670 | def scrape_videos_to_csv(page_ids, from_date, until_date, scrape_function, process_item_function):
671 | scraped_rows_list = scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function)
672 | scraped_rows_df = pd.DataFrame(scraped_rows_list)
673 |
674 | # Convert UTC datetimes to EST
675 | scraped_rows_df['Published (EST)'] = [utc_to_timezone(x, TIMEZONE).strftime('%Y-%m-%d %H:%M:%S') for x in scraped_rows_df['Published']]
676 |
677 | print '\nAverages:\n'
678 | print scraped_rows_df.ix[:,['Page', 'Num Reactions', 'Complete/3s Views (%)', '3s Views', 'Impression Rate Non-Likers (%)']].groupby('Page').describe(percentiles=[.5]).sort_values(by='Num Reactions', ascending=False)
679 | print '\nTotals:\n'
680 | print scraped_rows_df.ix[:,['Page', '3s Views', 'Num Reactions']].groupby('Page').sum().sort_values(by='Num Reactions', ascending=False)
681 | print '\n'
682 |
683 | # We set ordering of csv columns here
684 | csvColumns = ['Page', 'Video ID', 'Published (EST)', 'Live Video', 'Crossposted Video', 'Headline', 'Caption', 'Num Likes', 'Num Reactions', 'Num Comments', '3s Views',
685 | '10s Views', 'Complete Views', 'Total Paid Views', '10s/3s Views (%)', 'Complete/3s Views (%)', 'Impressions',
686 | 'Impression Rate Non-Likers (%)', 'Avg View Time', 'Link']
687 |
688 | scraped_rows_df = scraped_rows_df.round(1)
689 | csv_filename = './facebook_output/{}_{}.csv'.format('videos', datetime.datetime.now().strftime('%y-%m-%d_%H.%M.%S'))
690 | scraped_rows_df.to_csv(csv_filename, index=False, columns=csvColumns, encoding='utf-8')
691 | print csv_filename + ' written'
692 |
693 | if __name__ != '__main__':
694 | return scraped_rows_list
695 |
696 |
697 | def print_usage():
698 | print '\nUsage:\n python {0} \n e.g. for posts since yesterday midnight:'\
699 | ' python {0} post 1\n'\
700 | ' python {0} where dates are inclusive and in format yyyy-mm-dd'\
701 | '\nCtrl+C to cancel\n'.format(sys.argv[0])
702 |
703 |
704 | def is_date_string(date_string):
705 | try:
706 | date_object = datetime.datetime.strptime(date_string, '%Y-%m-%d')
707 | return True
708 | except ValueError as e:
709 | return False
710 |
711 |
712 | if __name__ == '__main__':
713 |
714 | if len(sys.argv) == 3:
715 | # Option 1: Simply specify number of days back and scrape until now:
716 | if sys.argv[2].isdigit():
717 | num_days_back = int(sys.argv[2])
718 | local_now = datetime.datetime.now()
719 | today = datetime.datetime(year=local_now.year, month=local_now.month, day=local_now.day, hour=0, minute=0, second=0)
720 | local_from_date = today + datetime.timedelta(days=-num_days_back)
721 | # Facebook's until parameter takes POSIX to include time component
722 | utc_now = datetime.datetime.utcnow()
723 | utc_posix_until_date = calendar.timegm(utc_now.timetuple())
724 | else:
725 | print_usage()
726 | sys.exit()
727 | elif len(sys.argv) == 4:
728 | # Option 2: Specify two inclusive dates in format YYYY-mm-dd
729 | if is_date_string(sys.argv[2]) and is_date_string(sys.argv[3]):
730 | local_from_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d')
731 | local_until_date = datetime.datetime.strptime(sys.argv[3], '%Y-%m-%d')
732 | # Add a day so Facebook includes whole day itself and transform to POSIX to ensure time component is included (normalized EST is NOT normalized UTC)
733 | utc_until_date = local_to_utc(local_until_date + datetime.timedelta(days = 1))
734 | utc_posix_until_date = calendar.timegm(utc_until_date.timetuple())
735 | if local_from_date > local_until_date:
736 | print '\n Start date is AFTER the end date'
737 | print_usage()
738 | sys.exit()
739 | else:
740 | print_usage()
741 | sys.exit()
742 | # Until date is a string (used in API call). From date is datetime object used to check paging
743 | if sys.argv[1] == 'post':
744 | scrape_posts_to_csv(PAGE_IDS_TO_SCRAPE, local_from_date, utc_posix_until_date, get_fb_page_post_data, process_fb_page_post)
745 | # Scrape OUR OWN crossposted videos using the /videos endpoint. These don't include shares, but video POSTS do include shares!
746 | elif sys.argv[1] == 'video':
747 | scrape_videos_to_csv(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_page_video_data, process_fb_page_video)
748 | else:
749 | print_usage()
750 | sys.exit()
--------------------------------------------------------------------------------
/res/page_handle_location.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/page_handle_location.png
--------------------------------------------------------------------------------
/res/sample_output_owned_posts.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/sample_output_owned_posts.png
--------------------------------------------------------------------------------
/res/sample_run.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/sample_run.gif
--------------------------------------------------------------------------------
/res/the_matrix.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/the_matrix.png
--------------------------------------------------------------------------------
/social_elastic.py:
--------------------------------------------------------------------------------
1 | import os
2 | import sys
3 | import json
4 | import datetime
5 | import calendar
6 | import time
7 |
8 | from elasticsearch import Elasticsearch, TransportError, ConnectionError, ConnectionTimeout
9 | import get_fb_data
10 | import get_insta_data
11 |
12 | # Facebook Globals
13 | OWNED_PAGES_TOKENS = {
14 | "MyPage1": os.environ['PAGE1_FB_PERM_TOKEN'],
15 | "MyPage2": os.environ['PAGE2_FB_PERM_TOKEN'],
16 | "MyPage3": os.environ['PAGE3_FB_PERM_TOKEN']
17 | }
18 |
19 | # Instagram Globals
20 | MY_INSTA_TOKEN = os.environ['MY_INSTA_TOKEN']
21 | MY_INSTA_USER_ID = MY_INSTA_TOKEN.split('.')[0]
22 | #ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_DEV'], os.environ['ELASTIC_HOST_PROD'], os.environ['ELASTIC_HOST_PROD2']]
23 | ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_PROD2']]
24 | #ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_DEV']]
25 |
26 |
27 | def create_bulk_req_elastic(json_data, index, doc_type, id_field):
28 | action_data_string = ""
29 | for i, json_post in enumerate(json_data):
30 | index_action = {"index":{"_index":index, "_type":doc_type, "_id":json_post[id_field]}}
31 | action_data_string += json.dumps(index_action, separators=(',', ':')) + '\n' + json.dumps(json_post, separators=(',', ':')) + '\n'
32 | return action_data_string
33 |
34 |
35 | def insert_bulk_elastic(action_data_string, hosts):
36 | return_ack_list = []
37 | for host in hosts:
38 | es = Elasticsearch(host)
39 | success = False
40 | while success == False:
41 | try:
42 | return_ack_list.append(es.bulk(body=action_data_string))
43 | success = True
44 | except (ConnectionError, ConnectionTimeout, TransportError) as e:
45 | print e
46 | print "\nRetrying in 3 seconds"
47 | time.sleep(3)
48 | return return_ack_list
49 |
50 |
51 | def update_alias(source_index, alias_index, hosts):
52 | return_ack_list = []
53 |
54 | for host in hosts:
55 | es = Elasticsearch(host)
56 | assert(es.indices.exists(index=source_index))
57 | # Delete existing alias index if it exists
58 | if es.indices.exists_alias(name=alias_index) == True:
59 | es.indices.delete_alias(index='_all', name=alias_index)
60 |
61 | return_ack_list.append(es.indices.put_alias(index=source_index, name=alias_index))
62 | return return_ack_list
63 |
64 |
65 | def insert_ig_followers(user_id, access_token, index, doc_type):
66 | return_ack_list = []
67 |
68 | num_followers = get_insta_data.get_followers(user_id, access_token)
69 | followers_insert_timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + 'Z'
70 |
71 | for host in ELASTIC_HOSTS:
72 | es = Elasticsearch(host)
73 | return_ack_list.append(
74 | es.index(op_type='index', index='followers', doc_type='instagram',\
75 | body={"Type": "instagram", "Followers": num_followers, "Timestamp": followers_insert_timestamp}))
76 | return return_ack_list
77 |
78 |
79 | def put_fb_template(template_name, template_pattern, raw_fields_pattern, hosts):
80 | return_ack_list = []
81 | template_body = {
82 | "template" : template_pattern,
83 | "mappings" : {
84 | "_default_" : {
85 | "_all" : {"enabled" : True, "omit_norms" : True},
86 | "properties": {
87 | raw_fields_pattern: {
88 | "type": "string",
89 | "fielddata" : { "format" : "paged_bytes" },
90 | "fields": {
91 | "raw": {
92 | "type": "string",
93 | "index": "not_analyzed"
94 | },
95 | "stemmed": {
96 | "type": "string",
97 | "fielddata" : { "format" : "paged_bytes" },
98 | "analyzer": "english"
99 | }
100 | }
101 | }
102 | }
103 | }
104 | }
105 | }
106 | for host in hosts:
107 | es = Elasticsearch(host)
108 | return_ack_list.append(es.indices.put_template(name=template_name, body=template_body, create=False))
109 | return return_ack_list
110 |
111 |
112 | def is_date_string(date_string):
113 | try:
114 | date_object = datetime.datetime.strptime(date_string, '%Y-%m-%d')
115 | return True
116 | except ValueError as e:
117 | return False
118 |
119 |
120 | def ig_main(local_from_date):
121 | instagram_doc_type = 'instagram-media-endpoint'
122 |
123 | local_now = datetime.datetime.now()
124 | index_suffix = local_now.strftime('%Y%m%d-%H%M')
125 | instagram_index = 'instagram-' + index_suffix
126 | instagram_index_alias = 'instagram'
127 |
128 | api_scraped_rows = get_insta_data.scrape_insta_items(MY_INSTA_USER_ID, local_from_date, MY_INSTA_TOKEN)
129 | posts_with_views = get_insta_data.append_views(api_scraped_rows)
130 | posts_with_views_impressions = get_insta_data.append_social_analytics(posts_with_views)
131 |
132 | # Create request
133 | action_data_string = create_bulk_req_elastic(posts_with_views_impressions, instagram_index, instagram_doc_type, 'Post ID')
134 | print "\nInserting {} documents into Elasticsearch at {}".format(str(len(posts_with_views_impressions)), ELASTIC_HOSTS)
135 |
136 | # Insert documents via Bulk API
137 | insert_acks = insert_bulk_elastic(action_data_string, ELASTIC_HOSTS)
138 | if all(response.get('errors') == False for response in insert_acks):
139 | print "Success"
140 | else:
141 | print "Errors occured with new index {}".format(instagram_index_alias)
142 | for host_el in insert_acks:
143 | for el in host_el['items']:
144 | if el.get('index').get('error') is not None:
145 | print "_id: " + el.get('index').get('_id')
146 | print el.get('index').get('error')
147 | #sys.exit()
148 |
149 | # Redirect alias so Kibana picks up latest snapshot
150 | print "\nUpdating Instagram alias"
151 | update_alias_acks = update_alias(instagram_index, instagram_index_alias, ELASTIC_HOSTS)
152 | if all(response.get('acknowledged') == True for response in update_alias_acks):
153 | print "Success. {} points to {}".format(instagram_index_alias, instagram_index)
154 | else:
155 | print "\nFailed to update Instagram alias"
156 |
157 | # Also push in Instagram followers
158 | # Instagram Followers
159 | print "\nGetting Instagram followers"
160 | followers_index = 'follwers'
161 | followers_doctype_ig = 'instagram'
162 | insert_followers_acks = insert_ig_followers(MY_INSTA_USER_ID, MY_INSTA_TOKEN, followers_index, followers_doctype_ig)
163 |
164 | if all(response.get('acknowledged') == True for response in update_alias_acks):
165 | print "Success. Inserted followers into Elasticsearch at {}".format(ELASTIC_HOSTS)
166 | else:
167 | print "\nFailed to insert followers into Elasticsearch at {}".format(ELASTIC_HOSTS)
168 |
169 |
170 | def fb_main(local_from_date):
171 | facebook_video_doctype = 'facebook-video-endpoint'
172 | facebook_post_doctype = 'facebook-post-endpoint'
173 |
174 | utc_now = datetime.datetime.utcnow()
175 | utc_posix_until_date = calendar.timegm(utc_now.timetuple())
176 |
177 | index_suffix = datetime.datetime.now().strftime('%Y%m%d-%H%M')
178 | facebook_index = 'facebook-' + index_suffix
179 | facebook_index_alias = 'facebook'
180 |
181 | print "Processing Videos"
182 | fb_video_data = get_fb_data.scrape_fb_pages_items(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_data.get_fb_page_video_data, get_fb_data.process_fb_page_video_all_metrics)
183 | print "\nProcessing Posts"
184 | fb_post_data = get_fb_data.scrape_fb_pages_items(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_data.get_fb_page_post_data, get_fb_data.process_fb_page_post)
185 |
186 | print "\nInserting {} post documents and {} video documents into Elasticsearch at {}".format(str(len(fb_post_data)), str(len(fb_video_data)), ELASTIC_HOSTS)
187 | for host in ELASTIC_HOSTS:
188 | action_data_string_video = create_bulk_req_elastic(fb_video_data, facebook_index, facebook_video_doctype, 'Video ID')
189 | action_data_string_post = create_bulk_req_elastic(fb_post_data, facebook_index, facebook_post_doctype, 'Post ID')
190 |
191 | # Insert video documents via Bulk API
192 | insert_acks_video = insert_bulk_elastic(action_data_string_video, ELASTIC_HOSTS)
193 | # Insert post documents via Bulk API
194 | insert_acks_post = insert_bulk_elastic(action_data_string_post, ELASTIC_HOSTS)
195 |
196 | if all(response.get('errors') == False for response in insert_acks_video + insert_acks_post):
197 | print "Success"
198 | else:
199 | print "Errors occured for new index {}".format(facebook_index_alias)
200 | for host_el in insert_acks_video + insert_acks_post:
201 | for el in host_el['items']:
202 | if el.get('index').get('error') is not None:
203 | print "_id: " + el.get('index').get('_id')
204 | print el.get('index').get('error')
205 | #sys.exit()
206 |
207 | print "\nUpdating Facebook alias"
208 | update_alias_acks = update_alias(facebook_index, facebook_index_alias, ELASTIC_HOSTS)
209 | if all(response.get('acknowledged') == True for response in update_alias_acks):
210 | print "Success. {} points to {}".format(facebook_index_alias, facebook_index)
211 | else:
212 | print "\nFailed to update Facebook alias"
213 |
214 |
215 | if __name__ == '__main__':
216 |
217 | if not is_date_string(sys.argv[2]) or len(sys.argv) != 3:
218 | print "python {} ".format(sys.argv[0])
219 | sys.exit()
220 | else:
221 | local_from_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d')
222 |
223 | # Verify ES clusters are reachable
224 | for host in ELASTIC_HOSTS:
225 | es = Elasticsearch(host)
226 |
227 | try:
228 | if es.ping() == False:
229 | print "{} is not reachable".format(host)
230 | sys.exit()
231 | except ConnectionError:
232 | print "{} is not reachable".format(host)
233 | sys.exit()
234 |
235 | # Only need to put a template in once, but little harm in overwriting
236 | put_fb_template('facebook_template', 'facebook-*', 'Headline', ELASTIC_HOSTS)
237 |
238 | if sys.argv[1] == 'fb':
239 | fb_main(local_from_date)
240 | elif sys.argv[1] == 'ig':
241 | ig_main(local_from_date)
--------------------------------------------------------------------------------