├── README.md ├── facebook_output └── .gitignore ├── get_fb_data.py ├── res ├── page_handle_location.png ├── sample_output_owned_posts.png ├── sample_run.gif └── the_matrix.png └── social_elastic.py /README.md: -------------------------------------------------------------------------------- 1 | # Facebook Scraper For Multiple Pages 2 | 3 | Scrape **multiple** public Facebook pages en masse to help yield social analytics. Automatically fetch more detailed insights for owned pages' posts and videos to see what content strikes well. 4 | Go all the way back in time or specify two dates and watch the goodness happen at lightning speed via multi-threading! 5 | 6 |

7 | 8 | ## Distinguishing Features 9 | 10 | - **Multi-threaded** for rapid data collection from **multiple pages _simultaneously_** (as many as you want!) 11 | - Collect detailed performance metrics on multiple owned business pages **automatically** via the [insights](https://developers.facebook.com/docs/graph-api/reference/v2.11/insights) and [video insights](https://developers.facebook.com/docs/graph-api/reference/video/video_insights/) endpoints 12 | - Retrieve number of public shares of a link by *anyone* across Facebook via the [URL Object](https://developers.facebook.com/docs/graph-api/reference/v2.11/url/) 13 | - **Custom** metrics computed: 14 | - *Impression Rate Non-Likers (%)*: explore **virality** of your posts outside your typical audience 15 | - *Engagement Rate*: (Shares + Reactions + Comments) / Total Unique Impressions 16 | - *Adjusted Engagement Rate (%)* and *Adjusted CTR (%)*: normalise rates across pages of different audience sizes and account for uncertainty in small numbers i.e. 5/10 CTR < 100/200 CTR as detailed by [Evan Miller](http://www.evanmiller.org/how-not-to-sort-by-average-rating.html) 17 | - Proper timezone handling 18 | 19 | ![Sample Output](/res/sample_output_owned_posts.png?raw=true "Sample Output") 20 | 21 | ## What can be collected from public page posts? 22 | 23 | Post ID, Publish Date, Post Type, Headline, Shares, Reactions, Comments, Caption, Link 24 | 25 | ... and optionally with a performance cost: 26 | Public Shares, Likes, Loves, Wows, Hahas, Sads, Angrys 27 | 28 | ## What is *additionally* collected from owned page posts? 29 | 30 | **Posts** 31 | 32 | Video Views, Unique Impressions, Impression Rate Non-Likers (%), Unique Link Clicks, CTR (%), Adjusted CTR (%), Engagement Rate (%), Adjusted Engagement Rate (%), Hide Rate (%), Hide Clicks, Hide All Clicks, Paid Unique Impressions, Organic Unique Impressions 33 | 34 | **Videos** 35 | 36 | Live Video, Crossposted Video, 3s Views, 10s Views, Complete Views, Total Paid Views, 10s/3s Views (%), Complete/3s Views (%), Impressions, Impression Rate Non-Likers (%), Avg View Time 37 | 38 | 39 | ## Setup 40 | 41 | **1)** Add the page names you want to scrape inside `PAGE_IDS_TO_SCRAPE` 42 | 43 | Grab the @'handles' or in url (e.g. 'vicenews' below). 44 | 45 | 46 |

47 | 48 | **2)** Grab your own *temporary* user token [here](https://developers.facebook.com/tools/explorer) and place inside `OWNED_PAGES_TOKENS`: 49 | **Get Token -> Get User Token -> Get Access Token** 50 | 51 | `OWNED_PAGES_TOKENS` is the dictionary that stores the token(s) necessary to scrape public data. If the token is a [**permanent token**](https://stackoverflow.com/a/28418469) for a business page, it is used to scrape private data provided that the page is placed in `PAGE_IDS_TO_SCRAPE` and its corresponding key is identically named in this dictionary. 52 | 53 | **3)** Install python dependencies with `pip install requests scipy pandas` 54 | 55 | **N.B** OSX users should have installed [Homebrew](https://brew.sh/) and python with `brew install python` 56 | 57 | 58 | 59 | ## Execution 60 | Specify number of days back from present: 61 | 62 | `python get_fb_data.py post 5` (Public & owned pages) 63 | `python get_fb_data.py video 5` (Owned pages only for video-specific data) 64 | 65 | Specify two dates (inclusive) in yyyy-mm-dd format: 66 | 67 | `python get_fb_data.py post yyyy-mm-dd yyyy-mm-dd` 68 | `python get_fb_data.py video yyyy-mm-dd yyyy-mm-dd` 69 | 70 | The csv file is placed in the `facebook_output` folder by default 71 | 72 |

73 | 74 | ## Credit 75 | Thanks to minimaxir and his [project](https://github.com/minimaxir/facebook-page-post-scraper) for showing me the ropes 76 | 77 | ## FYI 78 | Additional `social_elastic.py` used to scrape data **and** push to Elastic instance(s) via their [bulk api](https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html) 79 | -------------------------------------------------------------------------------- /facebook_output/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | 4 | # Images # 5 | ########## 6 | *.png 7 | 8 | # Slideshows # 9 | ############# 10 | *.pptx 11 | *.ppt 12 | 13 | # Spreadsheets # 14 | ############### 15 | *.csv 16 | 17 | -------------------------------------------------------------------------------- /get_fb_data.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import calendar 3 | import requests 4 | import time 5 | import sys 6 | import os 7 | import threading 8 | import Queue 9 | import math 10 | from dateutil import tz 11 | 12 | import pandas as pd 13 | from scipy.stats import norm 14 | 15 | 16 | # At the cost of performance, set these to true for more precise data: reaction type breakdown and public shares across Facebook 17 | # Overall reactions per post are already pulled 18 | GET_SPECIFIC_REACTIONS_BOOL = False 19 | GET_PUBLIC_SHARES_BOOL = False 20 | 21 | # Page IDs to be scraped, defined by page's Facebook handle. 22 | PAGE_IDS_TO_SCRAPE = [ 23 | 'nytimes', 24 | 'vicenews', 25 | 'bbcnews', 26 | 'TheSkimm', 27 | 'cnn', 28 | 'NBCNews', 29 | 'financialtimes', 30 | 'washingtonpost', 31 | 'theguardian', 32 | 'timesandsundaytimes', 33 | 'msnbc', 34 | 'CBSNews', 35 | 'TheIndependentOnline', 36 | 'ABCNews' 37 | ] 38 | 39 | # Additional personal metrics are pulled for owned pages in keys below who also exist in PAGE_IDS_TO_SCRAPE 40 | # Temporary token: https://developers.facebook.com/tools/explorer 41 | # Permanent/Business Page token: https://stackoverflow.com/questions/17197970/facebook-permanent-page-access-token/28418469#28418469 42 | OWNED_PAGES_TOKENS = { 43 | 'jpryda': os.environ['MY_TOKEN'], # Token as an environmental variable: export MY_TOKEN = 'abc-my-token' 44 | # 'MyPage1': 'my-hardcoded-token' # Hardcoded token 45 | } 46 | 47 | TIMEZONE = 'America/New_York' 48 | API_VERSION = '2.7' 49 | 50 | 51 | # Set display precision when printing Pandas dataframes 52 | pd.set_option('precision',1) 53 | # Deal with scientific notation 54 | pd.options.display.float_format = '{:20,.0f}'.format 55 | # Don't wrap dataframe when printing to console 56 | pd.set_option('display.expand_frame_repr', False) 57 | 58 | 59 | def request_until_succeed(url): 60 | max_attempts = 3 61 | attempts = 0 62 | success = False 63 | while success == False and attempts < max_attempts: 64 | attempts = attempts + 1 65 | try: 66 | response = requests.get(url) 67 | if response.status_code == 200: 68 | success = True 69 | except Exception as e: 70 | print e 71 | print 'Error for URL {} | {} | attempt {} of {}'.format(url, datetime.datetime.now(), attempts, max_attempts) 72 | if attempts == max_attempts: 73 | raise Exception('Failed after {} attempts | {}'.format(attempts, url)) 74 | time.sleep(3) 75 | return response 76 | 77 | 78 | # Handle non-ASCII characters when writing to csv 79 | def unicode_normalize(text): 80 | return text.translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0xa0:0x20 }).encode('utf-8') 81 | 82 | 83 | def get_fb_page_video_data(page_id, access_token, num_posts=100, until=''): 84 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 85 | node = '/{}/videos'.format(page_id) 86 | fields = '/?fields=title,description,created_time,id,comments.limit(0).summary(true),likes.limit(0).summary(true),reactions.limit(0).summary(true),permalink_url,live_status,status' 87 | parameters = '&limit={}&access_token={}&until={}'.format(num_posts, access_token, until) 88 | url = base + node + fields + parameters 89 | 90 | data = request_until_succeed(url).json() 91 | return data 92 | 93 | 94 | def get_fb_page_post_data(page_id, access_token, num_posts=100, until=''): 95 | # Shares on videos must be grabbed from the /posts endpoint; unavailable from the /videos endpoint 96 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 97 | node = '/{}/posts'.format(page_id) 98 | fields = '/?fields=message,link,created_time,type,name,id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true)' 99 | parameters = '&limit={}&access_token={}&until={}'.format(num_posts, access_token, until) 100 | url = base + node + fields + parameters 101 | 102 | data = request_until_succeed(url).json() 103 | return data 104 | 105 | 106 | def get_specific_reactions_for_post(status_id, access_token): 107 | # Reaction types are only accessible at an individual post's endpoint 108 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 109 | node = '/{}'.format(status_id) 110 | reactions = '/?fields=' \ 111 | 'reactions.type(LIKE).limit(0).summary(total_count).as(like)'\ 112 | ',reactions.type(LOVE).limit(0).summary(total_count).as(love)'\ 113 | ',reactions.type(WOW).limit(0).summary(total_count).as(wow)'\ 114 | ',reactions.type(HAHA).limit(0).summary(total_count).as(haha)'\ 115 | ',reactions.type(SAD).limit(0).summary(total_count).as(sad)'\ 116 | ',reactions.type(ANGRY).limit(0).summary(total_count).as(angry)' 117 | parameters = '&access_token={}'.format(access_token) 118 | url = base + node + reactions + parameters 119 | 120 | data = request_until_succeed(url).json() 121 | return data 122 | 123 | 124 | def get_insights_for_post(object_id, access_token, fields, period='', since=''): 125 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 126 | node = '/{}/insights/'.format(object_id) 127 | parameters = '?access_token={}&period={}&since={}&date_format=U'.format(access_token, period, since) 128 | url = base + node + fields + parameters 129 | 130 | data = request_until_succeed(url) 131 | if data is not None: 132 | return data.json() 133 | else: 134 | raise Exception('No Post Insights Data') 135 | 136 | 137 | def get_insights_for_video(video_id, access_token, period='lifetime'): 138 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 139 | node = '/{}/video_insights'.format(video_id) 140 | fields = '' 141 | parameters = '?access_token={}&period={}'.format(access_token, period) 142 | url = base + node + fields + parameters 143 | 144 | data = request_until_succeed(url).json() 145 | return data 146 | 147 | 148 | def get_fb_url_shares_comments(access_token, url): 149 | # Remove pound signs from URL which mess up FB API 150 | url = url.replace('#','') 151 | base = 'https://graph.facebook.com/v{}'.format(API_VERSION) 152 | node = '' 153 | fields = '/?id={}'.format(url) 154 | parameters = '&access_token={}'.format(access_token) 155 | url = base + node + fields + parameters 156 | 157 | data = request_until_succeed(url).json() 158 | return data 159 | 160 | 161 | def get_insights_for_page(access_token, metrics, page_id, period, start_date, excl_end_date): 162 | base = 'https://graph.facebook.com/v{}'.format(FB_API_VERSION) 163 | node = '/{}/insights'.format(page_id) 164 | fields = '/{}'.format(metrics) 165 | period_string = 'period={}&since={}&until={}'.format(period, start_date, excl_end_date) 166 | parameters = '?{}&access_token={}'.format(period_string, access_token) 167 | 168 | url = base + node + fields + parameters 169 | data = request_until_succeed(url).json() 170 | 171 | return data 172 | 173 | 174 | # def posix_to_timezone(posix_int, to_timezone): 175 | # utc_datetime = datetime.utcfromtimestamp(posix_int) 176 | # from_zone = tz.gettz('UTC') 177 | # to_zone = tz.gettz(to_timezone) 178 | # to_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone) 179 | # return to_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time 180 | 181 | 182 | # def posix_to_iso(posix_int): 183 | # return datetime.datetime.utcfromtimestamp(posix_int).strftime('%Y-%m-%dT%H:%M:%S+0000') 184 | 185 | 186 | def utc_to_timezone(utc_datetime_string, to_timezone): 187 | utc_datetime = datetime.datetime.strptime(utc_datetime_string,'%Y-%m-%dT%H:%M:%S+0000') 188 | from_zone = tz.gettz('UTC') 189 | to_zone = tz.gettz(to_timezone) 190 | est_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone) 191 | return est_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time 192 | 193 | # Not used right now 194 | def utc_to_local(utc_datetime_string): 195 | utc_datetime = datetime.datetime.strptime(utc_datetime_string,'%Y-%m-%dT%H:%M:%S+0000') 196 | from_zone = tz.gettz('UTC') 197 | to_zone = tz.tzlocal() 198 | local_datetime = utc_datetime.replace(tzinfo=from_zone).astimezone(to_zone) 199 | return local_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time 200 | 201 | # For specification of 'until' parameter at commandline 202 | def local_to_utc(local_date): 203 | from_zone = tz.tzlocal() 204 | to_zone = tz.gettz('UTC') 205 | utc_datetime = local_date.replace(tzinfo=from_zone).astimezone(to_zone) 206 | return utc_datetime.replace(tzinfo=None) #Remove timezone component to allow for comparison with local time 207 | 208 | ''' 209 | Calculate confidence interval lower bound as scoring system to balance balance proportion of successes (e.g. clicks) with the uncertainty of a small number 210 | i.e. ci_lower_bound(5, 10, 0.95) < ci_lower_bound(100, 200, 0.95). For more info see http://www.evanmiller.org/how-not-to-sort-by-average-rating.html 211 | ''' 212 | def ci_lower_bound(pos, n, confidence): 213 | if n == 0: 214 | return 0 215 | elif n > pos: 216 | z = norm.ppf((1-(1-confidence)/2), loc=0, scale=1) 217 | phat = float(pos)/n 218 | return (phat + z*z/(2*n) - z * math.sqrt((phat*(1-phat)+z*z/(4*n))/n)) / (1+z*z/n) 219 | else: 220 | return 0 221 | 222 | 223 | def process_fb_page_video(video, access_token, page_id): 224 | if video.get('status').get('video_status') == 'expired': 225 | return None 226 | 227 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000' 228 | video_id = video['id'] 229 | utc_video_published = video['created_time'] 230 | 231 | video_title = None if 'title' not in video.keys() else unicode_normalize(video['title']).decode('utf-8','ignore').encode('utf-8') 232 | video_description = None if 'description' not in video.keys() else unicode_normalize(video['description']).decode('utf-8','ignore').encode('utf-8') 233 | video_permalink = video['permalink_url'] 234 | 235 | num_likes = 0 if 'likes' not in video else video['likes']['summary']['total_count'] 236 | num_reactions = 0 if 'reactions' not in video else video['reactions']['summary']['total_count'] 237 | num_comments = 0 if 'comments' not in video or video.get('comments').get('summary').get('total_count') is None else video['comments']['summary']['total_count'] 238 | 239 | live_boolean = False if video.get('live_status') is None else True 240 | 241 | # Set Insights default values if a competitor or a Facebook Live Video 242 | total_3s_views = None 243 | total_10s_views = None 244 | total_complete_views = None 245 | total_video_impressions = None 246 | total_video_avg_time_watched = None 247 | ten_three_s_ratio = None 248 | complete_three_s_ratio = None 249 | total_video_impressions_fan = None 250 | total_non_fan_impressions_rate = None 251 | total_video_views_paid = None 252 | 253 | # Get insights for videos iff they are our OWN and also NOT Live videos which have no data 254 | if page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]: 255 | video_insights = get_insights_for_video(video_id, access_token, 'lifetime') 256 | 257 | if len(video_insights['data']) > 0: 258 | for metric_result in video_insights['data']: 259 | if metric_result['name'] == 'total_video_views': 260 | total_3s_views = metric_result['values'][0]['value'] 261 | if metric_result['name'] == 'total_video_10s_views': 262 | total_10s_views = metric_result['values'][0]['value'] 263 | if metric_result['name'] == 'total_video_complete_views': 264 | total_complete_views = metric_result['values'][0]['value'] 265 | if metric_result['name'] == 'total_video_avg_time_watched': 266 | total_video_avg_time_watched = float(metric_result['values'][0]['value'])/1000 267 | if metric_result['name'] == 'total_video_impressions': 268 | total_video_impressions = metric_result['values'][0]['value'] 269 | if metric_result['name'] == 'total_video_impressions_fan': 270 | total_video_impressions_fan = metric_result['values'][0]['value'] 271 | if metric_result['name'] == 'total_video_views_paid': 272 | total_video_views_paid = metric_result['values'][0]['value'] 273 | 274 | total_non_fan_impressions = total_video_impressions - total_video_impressions_fan 275 | total_non_fan_impressions_rate = None if total_video_impressions == 0 else float(total_non_fan_impressions)/float(total_video_impressions) * 100 276 | ten_three_s_ratio = None if total_3s_views == 0 else float(total_10s_views)/float(total_3s_views) * 100 277 | complete_three_s_ratio = None if total_3s_views == 0 else float(total_complete_views)/float(total_3s_views) * 100 278 | engagement_rate = None if total_3s_views == 0 else float(num_reactions + num_comments)/float(total_3s_views) * 100 # Video endpoint doesn't have shares 279 | 280 | crossposted_boolean = True if total_3s_views is None and live_boolean is False else False 281 | 282 | scraped_row = { 283 | 'Page': page_id, 284 | 'Video ID': video_id, 285 | 'Published': utc_video_published, 286 | 'Live Video': live_boolean, 287 | 'Crossposted Video': crossposted_boolean, 288 | 'Headline': video_title, 289 | 'Caption': video_description, 290 | 'Num Likes': num_likes, 291 | 'Num Reactions': num_reactions, 292 | 'Num Comments': num_comments, 293 | '3s Views': total_3s_views, 294 | '10s Views': total_10s_views, 295 | 'Complete Views': total_complete_views, 296 | 'Total Paid Views': total_video_views_paid, 297 | '10s/3s Views (%)': ten_three_s_ratio, 298 | 'Complete/3s Views (%)': complete_three_s_ratio, 299 | 'Impressions': total_video_impressions, 300 | 'Impression Rate Non-Likers (%)': total_non_fan_impressions_rate, 301 | 'Avg View Time': total_video_avg_time_watched, 302 | 'Link': video_permalink, 303 | 'Timestamp': timestamp 304 | } 305 | return scraped_row 306 | 307 | 308 | def process_fb_page_video_all_metrics(video, access_token, page_id): 309 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000' 310 | video_id = video['id'] 311 | video_title = None if 'title' not in video.keys() else unicode_normalize(video['title']).decode('utf-8','ignore').encode('utf-8') 312 | video_description = None if 'description' not in video.keys() else unicode_normalize(video['description']).decode('utf-8','ignore').encode('utf-8') 313 | utc_video_published = video['created_time'] 314 | video_permalink = video['permalink_url'] 315 | 316 | num_likes = 0 if 'likes' not in video else video['likes']['summary']['total_count'] 317 | num_reactions = 0 if 'reactions' not in video else video['reactions']['summary']['total_count'] 318 | num_comments = 0 if 'comments' not in video or video.get('comments').get('summary').get('total_count') is None else video['comments']['summary']['total_count'] 319 | 320 | live_boolean = False if video.get('live_status') is None else True 321 | 322 | scraped_row = { 323 | 'Page': page_id, 324 | 'Video ID': video_id, 325 | 'Published': utc_video_published, 326 | 'Live Video': live_boolean, 327 | 'Headline': video_title, 328 | 'Caption': video_description, 329 | 'Num Likes': num_likes, 330 | 'Num Reactions': num_reactions, 331 | 'Num Comments': num_comments, 332 | 'Link': video_permalink, 333 | 'Timestamp': timestamp 334 | } 335 | 336 | if page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]: 337 | video_insights = get_insights_for_video(video_id, access_token, 'lifetime') 338 | 339 | if len(video_insights['data']) > 0: 340 | 341 | for metric in video_insights['data']: 342 | 343 | # Define metric name and add to scraped_row 344 | metric_name = metric['name'].replace('.','') 345 | metric_value = metric['values'][0]['value'] 346 | # Elasticsearch doesn't accept periods within keys 347 | if isinstance(metric_value, dict): 348 | metric_value = { x.replace('.', ''): metric_value[x] for x in metric_value.keys() } 349 | scraped_row[metric_name] = metric_value 350 | 351 | # Unpack dicts of important metrics. !Actually Kibana unpacks these for us so unnecessary! 352 | scraped_row['total_video_views_by_crossposted'] = scraped_row['total_video_views_by_distribution_type'].get('crossposted') 353 | scraped_row['total_video_views_by_page_owned'] = scraped_row['total_video_views_by_distribution_type'].get('page_owned') 354 | scraped_row['total_video_views_by_page_shared'] = scraped_row['total_video_views_by_distribution_type'].get('shared') 355 | #del scraped_row['total_video_views_by_distribution_type'] 356 | 357 | scraped_row['total_video_impressions_non_fan'] = scraped_row['total_video_impressions'] - scraped_row['total_video_impressions_fan'] 358 | scraped_row['total_non_fan_impressions_rate'] = None if scraped_row['total_video_impressions'] == 0 else float(scraped_row['total_video_impressions_non_fan'])/float(scraped_row['total_video_impressions']) * 100 359 | scraped_row['ten_three_s_ratio'] = None if scraped_row['total_video_views'] == 0 else float(scraped_row['total_video_10s_views'])/float(scraped_row['total_video_views']) * 100 360 | scraped_row['complete_three_s_ratio'] = None if scraped_row['total_video_views'] == 0 else float(scraped_row['total_video_complete_views'])/float(scraped_row['total_video_views']) * 100 361 | 362 | scraped_row['Crossposted Video'] = True if scraped_row.get('total_video_views') is None and live_boolean is False else False 363 | if scraped_row.get('total_video_views') is not None: 364 | scraped_row['Video Views'] = scraped_row['total_video_views'] 365 | #del scraped_row['total_video_views'] 366 | return scraped_row 367 | 368 | 369 | def process_fb_page_post(status, access_token, page_id): 370 | timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + '+0000' 371 | status_id = status['id'] 372 | status_message = None if 'message' not in status.keys() else unicode_normalize(status['message']).decode('utf-8','ignore').encode('utf-8') 373 | post_title = None if 'name' not in status.keys() else unicode_normalize(status['name']).decode('utf-8','ignore').encode('utf-8') 374 | status_type = status['type'] 375 | status_link = None if 'link' not in status.keys() else unicode_normalize(status['link']) 376 | 377 | # Time needs special care since it's in UTC 378 | utc_status_published = status['created_time'] 379 | 380 | num_reactions = None if 'reactions' not in status else status['reactions']['summary']['total_count'] 381 | num_comments = None if 'comments' not in status or status.get('comments').get('summary').get('total_count') is None else status['comments']['summary']['total_count'] 382 | num_shares = None if 'shares' not in status else status['shares']['count'] 383 | 384 | 385 | num_likes = num_loves = num_wows = num_hahas = num_sads = num_angrys = None 386 | unique_link_clicks = None 387 | total_unique_impressions = None 388 | ctr = None 389 | post_video_views = None 390 | paid_unique_impressions = None 391 | non_fan_unique_impressions_rate = None 392 | hide_clicks = None 393 | hide_all_clicks = None 394 | hide_rate = None 395 | public_num_shares = None 396 | ctr_lb_confidence = None 397 | engagement_rate = None 398 | engage_lb_confidence = None 399 | organic_unique_impressions = None 400 | public_num_shares = None 401 | 402 | if (GET_PUBLIC_SHARES_BOOL): 403 | # Get number of shares across all of Facebook 404 | if status_link is not None: 405 | public_num_shares_comments = get_fb_url_shares_comments(access_token, status_link) 406 | if 'share' in public_num_shares_comments: 407 | public_num_shares = public_num_shares_comments.get('share').get('share_count') 408 | 409 | 410 | if (GET_SPECIFIC_REACTIONS_BOOL): 411 | # Reactions only exists after implementation date: http://newsroom.fb.com/news/2016/02/reactions-now-available-globally/ 412 | reactions = get_specific_reactions_for_post(status_id, access_token) if utc_status_published > '2016-02-24 00:00:00' else {} 413 | num_likes = 0 if 'like' not in reactions else reactions['like']['summary']['total_count'] 414 | # Special case: Set number of Likes to Number of reactions for pre-reaction statuses 415 | num_likes = num_reactions if utc_status_published < '2016-02-24 00:00:00' else num_likes 416 | 417 | num_loves = 0 if 'love' not in reactions else reactions['love']['summary']['total_count'] 418 | num_wows = 0 if 'wow' not in reactions else reactions['wow']['summary']['total_count'] 419 | num_hahas = 0 if 'haha' not in reactions else reactions['haha']['summary']['total_count'] 420 | num_sads = 0 if 'sad' not in reactions else reactions['sad']['summary']['total_count'] 421 | num_angrys = 0 if 'angry' not in reactions else reactions['angry']['summary']['total_count'] 422 | 423 | 424 | # If not one of our own pages or a pesky cover photo 425 | if (page_id.lower() not in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]) or (post_title is not None and 'cover photo' in post_title and status_type=='photo'): 426 | 427 | scraped_row = { 428 | 'Page': page_id, 429 | 'Published': utc_status_published, 430 | 'Num Shares': num_shares, 431 | 'Num Reactions': num_reactions, 432 | 'Type': status_type, 433 | 'Headline': post_title, 434 | 'Caption': status_message, 435 | 'Link': status_link, 436 | 'Num Likes': num_likes, 437 | 'Num Comments': num_comments, 438 | 'Num Loves': num_loves, 439 | 'Num Wows': num_wows, 440 | 'Num Hahas': num_hahas, 441 | 'Num Sads': num_sads, 442 | 'Num Angrys': num_angrys, 443 | 'Lifetime Public Num Shares': public_num_shares, 444 | 'Post ID': status_id, 445 | 'Timestamp': timestamp 446 | } 447 | return scraped_row 448 | 449 | # Iff one of our own pages, read insights too 450 | elif page_id.lower() in [x.lower() for x in OWNED_PAGES_TOKENS.keys()]: 451 | 452 | fields = 'post_consumptions_by_type_unique'\ 453 | ',post_impressions_by_paid_non_paid_unique'\ 454 | ',post_video_views'\ 455 | ',post_impressions_fan_unique'\ 456 | ',post_negative_feedback_by_type_unique' 457 | 458 | try: 459 | insights = get_insights_for_post(status_id, access_token, fields, 'lifetime') 460 | 461 | unique_link_clicks = 0 if 'link clicks' not in insights['data'][0]['values'][0]['value'] else insights['data'][0]['values'][0]['value'].get('link clicks') 462 | total_unique_impressions = insights['data'][1]['values'][0]['value'].get('total') 463 | ctr = None if total_unique_impressions == 0 else (float(unique_link_clicks)/float(total_unique_impressions)) * 100 464 | ctr_lb_confidence = None if status_type != 'link' else ci_lower_bound(unique_link_clicks, total_unique_impressions, 0.95) * 100 465 | 466 | paid_unique_impressions = insights['data'][1]['values'][0]['value'].get('paid') 467 | organic_unique_impressions = insights['data'][1]['values'][0]['value'].get('unpaid') 468 | post_video_views = insights['data'][2]['values'][0]['value'] 469 | fan_unique_impressions = insights['data'][3]['values'][0]['value'] 470 | non_fan_unique_impressions = total_unique_impressions - fan_unique_impressions 471 | non_fan_unique_impressions_rate = None if total_unique_impressions == 0 else (float(non_fan_unique_impressions)/float(total_unique_impressions)) * 100 472 | hide_clicks = 0 if 'hide_clicks' not in insights['data'][4]['values'][0]['value'] else insights['data'][4]['values'][0]['value'].get('hide_clicks') 473 | hide_all_clicks = 0 if 'hide_all_clicks' not in insights['data'][4]['values'][0]['value'] else insights['data'][4]['values'][0]['value'].get('hide_all_clicks') 474 | hide_rate = None if total_unique_impressions == 0 else (float(hide_clicks + hide_all_clicks)/float(total_unique_impressions)) * 100 475 | 476 | # Engagement Rate 477 | if num_shares is not None and num_reactions is not None and num_comments is not None: 478 | total_engagement = num_shares + num_reactions + num_comments 479 | if status_type != 'video': 480 | engagement_rate = None if total_unique_impressions == 0 else float(total_engagement)/float(total_unique_impressions) * 100 481 | engage_lb_confidence = ci_lower_bound(total_engagement, total_unique_impressions, 0.95) * 100 482 | if status_type == 'video': 483 | engagement_rate = None if post_video_views == 0 else float(total_engagement)/float(post_video_views) * 100 484 | engage_lb_confidence = ci_lower_bound(total_engagement, post_video_views, 0.95) * 100 485 | 486 | ## Counts of each reaction separately. Can comment out for speed's sake 487 | 488 | 489 | except Exception as e: 490 | print e 491 | 492 | scraped_row = { 493 | 'Page': page_id, 494 | 'Published': utc_status_published, 495 | 'Unique Impressions': total_unique_impressions, 496 | 'Paid Unique Impressions': paid_unique_impressions, 497 | 'Impression Rate Non-Likers (%)': non_fan_unique_impressions_rate, 498 | 'Unique Link Clicks': unique_link_clicks, 499 | 'CTR (%)': ctr, 500 | 'Adjusted CTR (%)': ctr_lb_confidence, 501 | 'Num Shares': num_shares, 502 | 'Num Reactions': num_reactions, 503 | 'Hide Rate (%)': hide_rate, 504 | 'Hide Clicks': hide_clicks, 505 | 'Hide All Clicks': hide_all_clicks, 506 | 'Type': status_type, 507 | 'Engagement Rate (%)': engagement_rate, 508 | 'Adjusted Engagement Rate (%)': engage_lb_confidence, 509 | 'Video Views': post_video_views, 510 | 'Headline': post_title.decode('utf-8','ignore').encode('utf-8') if post_title is not None else None, 511 | 'Caption': status_message.decode('utf-8','ignore').encode('utf-8') if status_message is not None else None, 512 | 'Link': status_link, 513 | 'Num Likes': num_likes, 514 | 'Num Comments': num_comments, 515 | 'Num Loves': num_loves, 516 | 'Num Wows': num_wows, 517 | 'Num Hahas': num_hahas, 518 | 'Num Sads': num_sads, 519 | 'Num Angrys': num_angrys, 520 | 'Lifetime Public Num Shares': public_num_shares, 521 | 'Post ID': status_id, 522 | 'Organic Unique Impressions': organic_unique_impressions, 523 | 'Timestamp': timestamp 524 | } 525 | return scraped_row 526 | 527 | 528 | def scrape_single_fb_page_items(page_id, from_date, until_date, access_token, scrape_function, process_item_function): 529 | num_processed = 0 # keep a count on how many we've processed 530 | scraped_rows_list = [] 531 | 532 | scrape_starttime = datetime.datetime.now() 533 | 534 | items = scrape_function(page_id, access_token, 100, until_date) 535 | if 'error' in items: 536 | print items['error'] 537 | return scraped_rows_list 538 | 539 | needs_next_page = True 540 | 541 | while needs_next_page: 542 | for item in items['data']: 543 | 544 | item_published = utc_to_timezone(item['created_time'], TIMEZONE) 545 | if item_published >= from_date: 546 | 547 | processed_item = process_item_function(item, access_token, page_id) 548 | if processed_item is not None: 549 | scraped_rows_list.append(processed_item) 550 | # output progress occasionally to make sure code is not stalling 551 | num_processed += 1 552 | if num_processed % 10 == 0: 553 | print '{} {} items Processed | {}'.format(num_processed, page_id, item_published.strftime('%Y-%m-%d %H:%M:%S')) 554 | else: 555 | needs_next_page = False 556 | # Else avoid processing items that fall before from_date in a single 'items run' 557 | break 558 | 559 | if needs_next_page and 'paging' in items.keys(): 560 | if 'next' in items['paging']: 561 | items = request_until_succeed(items['paging']['next']).json() 562 | else: 563 | needs_next_page = False 564 | else: 565 | needs_next_page = False 566 | 567 | print 'Finished Processing {} {} items! | {}'.format(num_processed, page_id, datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')) 568 | return scraped_rows_list 569 | 570 | 571 | def scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function): 572 | # Define length of results for indexed access store page results in order of specification, rather than appending result from first thread to finish 573 | results = [None] * len(page_ids) 574 | 575 | # Create FIFO queue 576 | queue_page_ids = Queue.Queue() 577 | 578 | # Set number of threads to the number of pages to be scraped 579 | num_threads = len(page_ids) 580 | 581 | # Add items with their ordinal number to queue 582 | for idx, page_id in enumerate(page_ids): 583 | queue_page_ids.put((idx, page_id)) 584 | 585 | # Wrapper function to scrape_single_fb_page_items which pulls from queue and is able to assign return output to a variable in this scope 586 | def grab_page_from_queue(queue): 587 | while not queue.empty(): 588 | idx, page_id = queue.get() 589 | 590 | # Select appropriate access token based on page. Include some logic handling FB page capitalisations 591 | access_token = OWNED_PAGES_TOKENS.get(page_id) if OWNED_PAGES_TOKENS.get(page_id.lower()) is None else OWNED_PAGES_TOKENS.get(page_id.lower()) 592 | if access_token is None: 593 | # For competitors set default access token to use as arbitrary token in owned dict 594 | access_token = OWNED_PAGES_TOKENS.itervalues().next() 595 | 596 | results[idx] = scrape_single_fb_page_items(page_id, from_date, until_date, access_token, scrape_function, process_item_function) 597 | queue.task_done() 598 | 599 | 600 | t0 = datetime.datetime.now() 601 | 602 | # To avoid strptime multithreading bug where strptime isn't loaded completely by first thread but called by another thread; call it first here 603 | dummy = datetime.datetime.strptime(t0.strftime('%Y-%m-%d'), '%Y-%m-%d') 604 | 605 | for n in range(num_threads): 606 | # Configure thread action 607 | t_i = threading.Thread(target=grab_page_from_queue, args=[queue_page_ids]) 608 | # Must start threads in daemon mode to enable hard-kill 609 | t_i.setDaemon(True) 610 | t_i.start() 611 | 612 | ''' 613 | join() function (thread and queue objects) blocks main thread until and item is returned or task_done() 614 | thread.join(arg) takes a timeout argument whereas queue.join() does not and so no KEYBOARDINTERRUPTS allowed! 615 | Wrap Queue's join (no timeout argument) in designated terminator thread which HAS a timeout argument. 616 | Ctrl+C can then end Terminator and thus MainThread whereupon the Python Interpreter hard-kills all spawned 'daemon' threads 617 | ''' 618 | term = threading.Thread(target=queue_page_ids.join) 619 | term.setDaemon(True) 620 | term.start() 621 | # Terminator thread only stays alive when Queue's join() is running i.e. until natural completion once all queue elements have been processed 622 | while term.isAlive(): 623 | # Any large timeout number crucial 624 | term.join(timeout=360000000) 625 | 626 | t1 = datetime.datetime.now() 627 | 628 | if type(until_date) is datetime.datetime: 629 | end_date = until_date.strftime('%Y-%m-%d %H:%M:%S') 630 | else: 631 | end_date = datetime.datetime.fromtimestamp(until_date) 632 | 633 | print '\nDone!\n{} Facebook page(s) processed between {} and {} in {} second(s)'.format(len(page_ids), from_date.strftime('%Y-%m-%d %H:%M:%S'), end_date, (t1 - t0).seconds) 634 | 635 | scraped_rows_list = [item for sublist in results for item in sublist] 636 | return scraped_rows_list 637 | 638 | 639 | def scrape_posts_to_csv(page_ids, from_date, until_date, scrape_function, process_item_function): 640 | scraped_rows_list = scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function) 641 | scraped_rows_df = pd.DataFrame(scraped_rows_list) 642 | 643 | # Convert UTC datetimes to EST 644 | scraped_rows_df['Published (EST)'] = [utc_to_timezone(x, TIMEZONE).strftime('%Y-%m-%d %H:%M:%S') for x in scraped_rows_df['Published']] 645 | 646 | csvColumns = ['Page', 'Published (EST)', 'Type', 'Headline', 'Unique Impressions', 'Impression Rate Non-Likers (%)', 'Unique Link Clicks', 'CTR (%)', 'Adjusted CTR (%)', 647 | 'Num Shares', 'Engagement Rate (%)', 'Adjusted Engagement Rate (%)', 'Lifetime Public Num Shares', 'Num Reactions', 'Video Views', 'Caption', 'Link', 'Num Likes', 648 | 'Num Comments', 'Num Loves', 'Num Wows', 'Num Hahas', 'Num Sads', 'Num Angrys', 'Hide Rate (%)', 'Hide Clicks', 'Hide All Clicks', 649 | 'Paid Unique Impressions', 'Organic Unique Impressions', 'Post ID'] 650 | 651 | scraped_rows_df = scraped_rows_df.round(1) 652 | csv_filename = './facebook_output/{}_{}.csv'.format('posts', datetime.datetime.now().strftime('%y-%m-%d_%H.%M.%S')) 653 | scraped_rows_df.to_csv(csv_filename, index=False, columns=csvColumns, encoding='utf-8') 654 | print csv_filename + ' written' 655 | 656 | # Output Summary to Terminal 657 | print '\nMedians:\n' 658 | print scraped_rows_df.ix[:,['Page', 'Num Shares', 'Num Reactions', 'Num Comments', 'Video Views', 'Impression Rate Non-Likers (%)', 'CTR (%)']].groupby('Page').median() 659 | # .sort_values(by='Num Shares', ascending=False) 660 | print '\nTotals:\n' 661 | print scraped_rows_df.ix[:,['Page', 'Num Shares', 'Num Reactions', 'Num Comments', 'Video Views']].groupby('Page').sum() 662 | # .sort_values(by='Num Shares', ascending=False) 663 | print '\n' 664 | 665 | # If called by daily/weekly insights OR Elasticsearch script 666 | if __name__ != '__main__': 667 | return scraped_rows_list 668 | 669 | 670 | def scrape_videos_to_csv(page_ids, from_date, until_date, scrape_function, process_item_function): 671 | scraped_rows_list = scrape_fb_pages_items(page_ids, from_date, until_date, scrape_function, process_item_function) 672 | scraped_rows_df = pd.DataFrame(scraped_rows_list) 673 | 674 | # Convert UTC datetimes to EST 675 | scraped_rows_df['Published (EST)'] = [utc_to_timezone(x, TIMEZONE).strftime('%Y-%m-%d %H:%M:%S') for x in scraped_rows_df['Published']] 676 | 677 | print '\nAverages:\n' 678 | print scraped_rows_df.ix[:,['Page', 'Num Reactions', 'Complete/3s Views (%)', '3s Views', 'Impression Rate Non-Likers (%)']].groupby('Page').describe(percentiles=[.5]).sort_values(by='Num Reactions', ascending=False) 679 | print '\nTotals:\n' 680 | print scraped_rows_df.ix[:,['Page', '3s Views', 'Num Reactions']].groupby('Page').sum().sort_values(by='Num Reactions', ascending=False) 681 | print '\n' 682 | 683 | # We set ordering of csv columns here 684 | csvColumns = ['Page', 'Video ID', 'Published (EST)', 'Live Video', 'Crossposted Video', 'Headline', 'Caption', 'Num Likes', 'Num Reactions', 'Num Comments', '3s Views', 685 | '10s Views', 'Complete Views', 'Total Paid Views', '10s/3s Views (%)', 'Complete/3s Views (%)', 'Impressions', 686 | 'Impression Rate Non-Likers (%)', 'Avg View Time', 'Link'] 687 | 688 | scraped_rows_df = scraped_rows_df.round(1) 689 | csv_filename = './facebook_output/{}_{}.csv'.format('videos', datetime.datetime.now().strftime('%y-%m-%d_%H.%M.%S')) 690 | scraped_rows_df.to_csv(csv_filename, index=False, columns=csvColumns, encoding='utf-8') 691 | print csv_filename + ' written' 692 | 693 | if __name__ != '__main__': 694 | return scraped_rows_list 695 | 696 | 697 | def print_usage(): 698 | print '\nUsage:\n python {0} \n e.g. for posts since yesterday midnight:'\ 699 | ' python {0} post 1\n'\ 700 | ' python {0} where dates are inclusive and in format yyyy-mm-dd'\ 701 | '\nCtrl+C to cancel\n'.format(sys.argv[0]) 702 | 703 | 704 | def is_date_string(date_string): 705 | try: 706 | date_object = datetime.datetime.strptime(date_string, '%Y-%m-%d') 707 | return True 708 | except ValueError as e: 709 | return False 710 | 711 | 712 | if __name__ == '__main__': 713 | 714 | if len(sys.argv) == 3: 715 | # Option 1: Simply specify number of days back and scrape until now: 716 | if sys.argv[2].isdigit(): 717 | num_days_back = int(sys.argv[2]) 718 | local_now = datetime.datetime.now() 719 | today = datetime.datetime(year=local_now.year, month=local_now.month, day=local_now.day, hour=0, minute=0, second=0) 720 | local_from_date = today + datetime.timedelta(days=-num_days_back) 721 | # Facebook's until parameter takes POSIX to include time component 722 | utc_now = datetime.datetime.utcnow() 723 | utc_posix_until_date = calendar.timegm(utc_now.timetuple()) 724 | else: 725 | print_usage() 726 | sys.exit() 727 | elif len(sys.argv) == 4: 728 | # Option 2: Specify two inclusive dates in format YYYY-mm-dd 729 | if is_date_string(sys.argv[2]) and is_date_string(sys.argv[3]): 730 | local_from_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d') 731 | local_until_date = datetime.datetime.strptime(sys.argv[3], '%Y-%m-%d') 732 | # Add a day so Facebook includes whole day itself and transform to POSIX to ensure time component is included (normalized EST is NOT normalized UTC) 733 | utc_until_date = local_to_utc(local_until_date + datetime.timedelta(days = 1)) 734 | utc_posix_until_date = calendar.timegm(utc_until_date.timetuple()) 735 | if local_from_date > local_until_date: 736 | print '\n Start date is AFTER the end date' 737 | print_usage() 738 | sys.exit() 739 | else: 740 | print_usage() 741 | sys.exit() 742 | # Until date is a string (used in API call). From date is datetime object used to check paging 743 | if sys.argv[1] == 'post': 744 | scrape_posts_to_csv(PAGE_IDS_TO_SCRAPE, local_from_date, utc_posix_until_date, get_fb_page_post_data, process_fb_page_post) 745 | # Scrape OUR OWN crossposted videos using the /videos endpoint. These don't include shares, but video POSTS do include shares! 746 | elif sys.argv[1] == 'video': 747 | scrape_videos_to_csv(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_page_video_data, process_fb_page_video) 748 | else: 749 | print_usage() 750 | sys.exit() -------------------------------------------------------------------------------- /res/page_handle_location.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/page_handle_location.png -------------------------------------------------------------------------------- /res/sample_output_owned_posts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/sample_output_owned_posts.png -------------------------------------------------------------------------------- /res/sample_run.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/sample_run.gif -------------------------------------------------------------------------------- /res/the_matrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jpryda/facebook-multi-scraper/3b77aceae3015cd36df8f53f59d6050650618be2/res/the_matrix.png -------------------------------------------------------------------------------- /social_elastic.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import json 4 | import datetime 5 | import calendar 6 | import time 7 | 8 | from elasticsearch import Elasticsearch, TransportError, ConnectionError, ConnectionTimeout 9 | import get_fb_data 10 | import get_insta_data 11 | 12 | # Facebook Globals 13 | OWNED_PAGES_TOKENS = { 14 | "MyPage1": os.environ['PAGE1_FB_PERM_TOKEN'], 15 | "MyPage2": os.environ['PAGE2_FB_PERM_TOKEN'], 16 | "MyPage3": os.environ['PAGE3_FB_PERM_TOKEN'] 17 | } 18 | 19 | # Instagram Globals 20 | MY_INSTA_TOKEN = os.environ['MY_INSTA_TOKEN'] 21 | MY_INSTA_USER_ID = MY_INSTA_TOKEN.split('.')[0] 22 | #ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_DEV'], os.environ['ELASTIC_HOST_PROD'], os.environ['ELASTIC_HOST_PROD2']] 23 | ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_PROD2']] 24 | #ELASTIC_HOSTS = [os.environ['ELASTIC_HOST_DEV']] 25 | 26 | 27 | def create_bulk_req_elastic(json_data, index, doc_type, id_field): 28 | action_data_string = "" 29 | for i, json_post in enumerate(json_data): 30 | index_action = {"index":{"_index":index, "_type":doc_type, "_id":json_post[id_field]}} 31 | action_data_string += json.dumps(index_action, separators=(',', ':')) + '\n' + json.dumps(json_post, separators=(',', ':')) + '\n' 32 | return action_data_string 33 | 34 | 35 | def insert_bulk_elastic(action_data_string, hosts): 36 | return_ack_list = [] 37 | for host in hosts: 38 | es = Elasticsearch(host) 39 | success = False 40 | while success == False: 41 | try: 42 | return_ack_list.append(es.bulk(body=action_data_string)) 43 | success = True 44 | except (ConnectionError, ConnectionTimeout, TransportError) as e: 45 | print e 46 | print "\nRetrying in 3 seconds" 47 | time.sleep(3) 48 | return return_ack_list 49 | 50 | 51 | def update_alias(source_index, alias_index, hosts): 52 | return_ack_list = [] 53 | 54 | for host in hosts: 55 | es = Elasticsearch(host) 56 | assert(es.indices.exists(index=source_index)) 57 | # Delete existing alias index if it exists 58 | if es.indices.exists_alias(name=alias_index) == True: 59 | es.indices.delete_alias(index='_all', name=alias_index) 60 | 61 | return_ack_list.append(es.indices.put_alias(index=source_index, name=alias_index)) 62 | return return_ack_list 63 | 64 | 65 | def insert_ig_followers(user_id, access_token, index, doc_type): 66 | return_ack_list = [] 67 | 68 | num_followers = get_insta_data.get_followers(user_id, access_token) 69 | followers_insert_timestamp = datetime.datetime.utcnow().replace(microsecond=0).isoformat() + 'Z' 70 | 71 | for host in ELASTIC_HOSTS: 72 | es = Elasticsearch(host) 73 | return_ack_list.append( 74 | es.index(op_type='index', index='followers', doc_type='instagram',\ 75 | body={"Type": "instagram", "Followers": num_followers, "Timestamp": followers_insert_timestamp})) 76 | return return_ack_list 77 | 78 | 79 | def put_fb_template(template_name, template_pattern, raw_fields_pattern, hosts): 80 | return_ack_list = [] 81 | template_body = { 82 | "template" : template_pattern, 83 | "mappings" : { 84 | "_default_" : { 85 | "_all" : {"enabled" : True, "omit_norms" : True}, 86 | "properties": { 87 | raw_fields_pattern: { 88 | "type": "string", 89 | "fielddata" : { "format" : "paged_bytes" }, 90 | "fields": { 91 | "raw": { 92 | "type": "string", 93 | "index": "not_analyzed" 94 | }, 95 | "stemmed": { 96 | "type": "string", 97 | "fielddata" : { "format" : "paged_bytes" }, 98 | "analyzer": "english" 99 | } 100 | } 101 | } 102 | } 103 | } 104 | } 105 | } 106 | for host in hosts: 107 | es = Elasticsearch(host) 108 | return_ack_list.append(es.indices.put_template(name=template_name, body=template_body, create=False)) 109 | return return_ack_list 110 | 111 | 112 | def is_date_string(date_string): 113 | try: 114 | date_object = datetime.datetime.strptime(date_string, '%Y-%m-%d') 115 | return True 116 | except ValueError as e: 117 | return False 118 | 119 | 120 | def ig_main(local_from_date): 121 | instagram_doc_type = 'instagram-media-endpoint' 122 | 123 | local_now = datetime.datetime.now() 124 | index_suffix = local_now.strftime('%Y%m%d-%H%M') 125 | instagram_index = 'instagram-' + index_suffix 126 | instagram_index_alias = 'instagram' 127 | 128 | api_scraped_rows = get_insta_data.scrape_insta_items(MY_INSTA_USER_ID, local_from_date, MY_INSTA_TOKEN) 129 | posts_with_views = get_insta_data.append_views(api_scraped_rows) 130 | posts_with_views_impressions = get_insta_data.append_social_analytics(posts_with_views) 131 | 132 | # Create request 133 | action_data_string = create_bulk_req_elastic(posts_with_views_impressions, instagram_index, instagram_doc_type, 'Post ID') 134 | print "\nInserting {} documents into Elasticsearch at {}".format(str(len(posts_with_views_impressions)), ELASTIC_HOSTS) 135 | 136 | # Insert documents via Bulk API 137 | insert_acks = insert_bulk_elastic(action_data_string, ELASTIC_HOSTS) 138 | if all(response.get('errors') == False for response in insert_acks): 139 | print "Success" 140 | else: 141 | print "Errors occured with new index {}".format(instagram_index_alias) 142 | for host_el in insert_acks: 143 | for el in host_el['items']: 144 | if el.get('index').get('error') is not None: 145 | print "_id: " + el.get('index').get('_id') 146 | print el.get('index').get('error') 147 | #sys.exit() 148 | 149 | # Redirect alias so Kibana picks up latest snapshot 150 | print "\nUpdating Instagram alias" 151 | update_alias_acks = update_alias(instagram_index, instagram_index_alias, ELASTIC_HOSTS) 152 | if all(response.get('acknowledged') == True for response in update_alias_acks): 153 | print "Success. {} points to {}".format(instagram_index_alias, instagram_index) 154 | else: 155 | print "\nFailed to update Instagram alias" 156 | 157 | # Also push in Instagram followers 158 | # Instagram Followers 159 | print "\nGetting Instagram followers" 160 | followers_index = 'follwers' 161 | followers_doctype_ig = 'instagram' 162 | insert_followers_acks = insert_ig_followers(MY_INSTA_USER_ID, MY_INSTA_TOKEN, followers_index, followers_doctype_ig) 163 | 164 | if all(response.get('acknowledged') == True for response in update_alias_acks): 165 | print "Success. Inserted followers into Elasticsearch at {}".format(ELASTIC_HOSTS) 166 | else: 167 | print "\nFailed to insert followers into Elasticsearch at {}".format(ELASTIC_HOSTS) 168 | 169 | 170 | def fb_main(local_from_date): 171 | facebook_video_doctype = 'facebook-video-endpoint' 172 | facebook_post_doctype = 'facebook-post-endpoint' 173 | 174 | utc_now = datetime.datetime.utcnow() 175 | utc_posix_until_date = calendar.timegm(utc_now.timetuple()) 176 | 177 | index_suffix = datetime.datetime.now().strftime('%Y%m%d-%H%M') 178 | facebook_index = 'facebook-' + index_suffix 179 | facebook_index_alias = 'facebook' 180 | 181 | print "Processing Videos" 182 | fb_video_data = get_fb_data.scrape_fb_pages_items(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_data.get_fb_page_video_data, get_fb_data.process_fb_page_video_all_metrics) 183 | print "\nProcessing Posts" 184 | fb_post_data = get_fb_data.scrape_fb_pages_items(OWNED_PAGES_TOKENS.keys(), local_from_date, utc_posix_until_date, get_fb_data.get_fb_page_post_data, get_fb_data.process_fb_page_post) 185 | 186 | print "\nInserting {} post documents and {} video documents into Elasticsearch at {}".format(str(len(fb_post_data)), str(len(fb_video_data)), ELASTIC_HOSTS) 187 | for host in ELASTIC_HOSTS: 188 | action_data_string_video = create_bulk_req_elastic(fb_video_data, facebook_index, facebook_video_doctype, 'Video ID') 189 | action_data_string_post = create_bulk_req_elastic(fb_post_data, facebook_index, facebook_post_doctype, 'Post ID') 190 | 191 | # Insert video documents via Bulk API 192 | insert_acks_video = insert_bulk_elastic(action_data_string_video, ELASTIC_HOSTS) 193 | # Insert post documents via Bulk API 194 | insert_acks_post = insert_bulk_elastic(action_data_string_post, ELASTIC_HOSTS) 195 | 196 | if all(response.get('errors') == False for response in insert_acks_video + insert_acks_post): 197 | print "Success" 198 | else: 199 | print "Errors occured for new index {}".format(facebook_index_alias) 200 | for host_el in insert_acks_video + insert_acks_post: 201 | for el in host_el['items']: 202 | if el.get('index').get('error') is not None: 203 | print "_id: " + el.get('index').get('_id') 204 | print el.get('index').get('error') 205 | #sys.exit() 206 | 207 | print "\nUpdating Facebook alias" 208 | update_alias_acks = update_alias(facebook_index, facebook_index_alias, ELASTIC_HOSTS) 209 | if all(response.get('acknowledged') == True for response in update_alias_acks): 210 | print "Success. {} points to {}".format(facebook_index_alias, facebook_index) 211 | else: 212 | print "\nFailed to update Facebook alias" 213 | 214 | 215 | if __name__ == '__main__': 216 | 217 | if not is_date_string(sys.argv[2]) or len(sys.argv) != 3: 218 | print "python {}

".format(sys.argv[0]) 219 | sys.exit() 220 | else: 221 | local_from_date = datetime.datetime.strptime(sys.argv[2], '%Y-%m-%d') 222 | 223 | # Verify ES clusters are reachable 224 | for host in ELASTIC_HOSTS: 225 | es = Elasticsearch(host) 226 | 227 | try: 228 | if es.ping() == False: 229 | print "{} is not reachable".format(host) 230 | sys.exit() 231 | except ConnectionError: 232 | print "{} is not reachable".format(host) 233 | sys.exit() 234 | 235 | # Only need to put a template in once, but little harm in overwriting 236 | put_fb_template('facebook_template', 'facebook-*', 'Headline', ELASTIC_HOSTS) 237 | 238 | if sys.argv[1] == 'fb': 239 | fb_main(local_from_date) 240 | elif sys.argv[1] == 'ig': 241 | ig_main(local_from_date) --------------------------------------------------------------------------------