├── README.md ├── extract-posts-mac.scpt └── fbscraper.py /README.md: -------------------------------------------------------------------------------- 1 | # FBScraper 2 | 3 | A simple script that allows extracting, viewing and converting posts on Facebook pages, groups or search quiries to CSV for further analysis. Requires no API access. 4 | 5 | ## Description 6 | 7 | It is currently difficult to extract data from Facebook since API access is limited and prevents access to key information such as authors' names, reactions, etc. The data does exist there on the various pages but it is quite time consuming to extract the data manually. So this script does the job for you. All you need to do is download the www.facebook.com.har file containing the data you need to analyze using Inspect (details below) and let the script do the rest. 8 | 9 | [Optional] In addition to the script, there is another Apple Script script that can automate the process of extracting the relevant post entries from facebook and run the python script automatically afterwords. Note that this would only work on a Mac OS computer. 10 | 11 | ## Dependencies 12 | 13 | 1) Python 3+ 14 | 2) [optional] appleScript if you wish to automate the page scrolling and HAR file downloading process 15 | 16 | You have the option of having this as a web service or run it via the command line 17 | 18 | ## Installing 19 | 20 | - Chrome browser 21 | - Ensure you install python if it is not installed yet. Here is the official link: (https://www.python.org/downloads)[https://www.python.org/downloads/] 22 | - Then download the file *fbscraper.py* and if you wish to try the AppleScript, download the file *extract-posts-mac.scpt* as well. 23 | 24 | ## Executing script 25 | 26 | - First you need to go to your Facebook page, group or search page you wish to extract data from 27 | - Then you need to open the Developer Tools (View->Developer->Developer Tools) 28 | - In Developer tools, Go to the Network tab and ensure that cache is disabled and that recording is on and has the settings shown below. 29 | ![Developer Tools](https://user-images.githubusercontent.com/81685/216130319-be70ba73-3265-4f82-8339-37443521af67.png) 30 | 31 | - On the main FB page you have open, first reload the page and it is recommended to do a [hard refresh](https://www.howtogeek.com/672607/how-to-hard-refresh-your-web-browser-to-bypass-your-cache/) if necessary to reload data from the source 32 | - Scroll down to load all the data you need to get (you must keep the Developer Tools window open as you do so) 33 | - Once you have scrolled down to include all the posts you wish to load, go to the Developer Tools window and click on the download button (shown above) to download the HAR file 34 | - Usually, it will allow you to download to the default download folder, but you can easily change it so that it downloads the file to where the fbscraper.py script is 35 | - Open the Terminal window and go to the location where the file fbscraper.py is located 36 | - Then simply run the command prompt in the terminal window while in the same directory: 37 | ``` 38 | python fbscraper.py 39 | ``` 40 | If www.facebook.com.har has a different name other than www.facebook.com.har or is not in the same directory, make sure to rename it to www.facebook.com.har and copy it to the same directory where fbscraper.py is or include the full path of the file as an argument to the script file, e.g., *python fbscraper.py /Users/myname/Downloads/new.har* . 41 | 42 | If successful, you will soon find a CSV file (X.csv) where X is the name assigned by the script depending on the title of the page you extracted data from. For example, if you opened https://facebook.com/cnn, the file would be cnn.csv 43 | 44 | ## Automated method with AppleScript (Mac OS only) 45 | 46 | Sometimes, it may be that you wish to download thousands of posts and it may be a pain to have to wait until all pages are loaded manually. In this case, you can opt for using the automation script that simulates scrolling for you. You simply need to open the file *extract-posts-mac.scpt" and filling in the required details. 47 | 48 | ### Required field 49 | THE only field that has to be filled is URLs since it contains the list of Facebook links you wish to extract posts from. In the script file it is marked with *[REQUIRED]*: 50 | 51 | ``` 52 | #[REQUIRED]# 53 | set URLs to {""} 54 | ``` 55 | 56 | The above field takes the list of the URLs on Facebook https://facebook.com that you are trying to extract posts from. It could including pages, groups, searches, etc. examples: https://www.facebook.com/cnn for the CNN page, https://www.facebook.com/search/top?q=crypto for searching for the keyword 'crypto', https://www.facebook.com/groups/565383300477194 for the group with the ID (565383300477194). If using multiple values, use commas to separate them, e.g., {"https://www.facebook.com/cnn","https://www.facebook.com/search/top?q=crypto","https://www.facebook.com/groups/565383300477194"} 57 | 58 | ### Additional fields 59 | 60 | There are also additional fields that are marked with *[OPTIONAL]* as follows: 61 | 62 | *page_flips*: the number of page flips required to scroll down to load all the posts requiring extraction. The default is 10 but it could go up in the hundreds of even thousands provided your browser can handle that. As you load more files, the browser will get slower and even crash. So choose this number wisely. Additionally, the delays in downloading and running the python file may have to be increased if required. 63 | 64 | *pythonPath*: the name of the python command (the default 'python' should be fine, but you may change it to something else like 'python3' depending on your system) 65 | 66 | *BrowserName*: by default "Google Chrome" but it can have a different name if needed 67 | 68 | *DownloadFolder*: by default, the script uses the default Downloads folder for the home user, i.e., /Users/username/Downloads. It can change if required as it depends on the Google Chrome settings 69 | 70 | *page_loading_time:* the number of seconds the script waits before scrolling to the next page 71 | 72 | 73 | ## Author 74 | 75 | The principal author is Walid Al-Saqaf, a developer and senior lecturer at Södertörn University in Stockholm 76 | 77 | Reach out to the developer by emailing to walid.al-saqaf@sh.se 78 | 79 | ## Version History 80 | 81 | * 2.0 82 | * Major updated version 83 | 84 | ## License 85 | 86 | This project is licensed under GNU General Public License (GPL) 87 | -------------------------------------------------------------------------------- /extract-posts-mac.scpt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/wsaqaf/fbscraper/eddaca2f2f74bd72501bad2b8724e544d9e403c6/extract-posts-mac.scpt -------------------------------------------------------------------------------- /fbscraper.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | import json 4 | import csv 5 | import re 6 | from datetime import datetime 7 | from bs4 import BeautifulSoup 8 | import html 9 | import logging 10 | 11 | ####################### functions ########################### 12 | 13 | def get_json_element_by_key(obj, search_key): 14 | try: 15 | if isinstance(obj, dict): 16 | for key, value in obj.items(): 17 | if key == search_key: 18 | return value 19 | else: 20 | result = get_json_element_by_key(value, search_key) 21 | if result is not False: 22 | return result 23 | elif isinstance(obj, list): 24 | for item in obj: 25 | result = get_json_element_by_key(item, search_key) 26 | if result is not False: 27 | return result 28 | return False 29 | except: 30 | return False 31 | 32 | def find_value_by_key(json_string, search_key): 33 | try: 34 | data = json.loads(json_string) 35 | return get_json_element_by_key(data, search_key) 36 | except json.JSONDecodeError: 37 | return False 38 | 39 | def load_post(p,i): 40 | fbpost={} 41 | for r in labels_str: fbpost[r]="" 42 | for r in labels_num: fbpost[r]=0 43 | fbpost["post_id"]=p['post_id'] 44 | #### post time & url #### 45 | fbpost["post_time"]=datetime.fromtimestamp(p['comet_sections']['context_layout']['story']['comet_sections']['metadata'][0]['story']['creation_time']).isoformat() 46 | fbpost["post_url"]=p['comet_sections']['context_layout']['story']['comet_sections']['metadata'][0]['story']['url'] 47 | #### post text #### 48 | try: 49 | fbpost["post_text"]=p['comet_sections']['content']['story']['message']['text'].replace('\n', '') 50 | except: 51 | fbpost["post_text"]="" 52 | 53 | #### user info #### 54 | fbpost["user_id"]=p['comet_sections']['context_layout']['story']['comet_sections']['title']['story']['actors'][0]['id'] 55 | fbpost["user_name"]=p['comet_sections']['context_layout']['story']['comet_sections']['title']['story']['actors'][0]['name'] 56 | fbpost["user_webpage"]=p['comet_sections']['context_layout']['story']['comet_sections']['title']['story']['actors'][0]['url'] 57 | fbpost["user_profile"]=p['comet_sections']['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['profile_url'] 58 | fbpost["user_profile_pic"]=p['comet_sections']['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['profile_picture']['uri'] 59 | 60 | #### post content #### 61 | try: fbpost["post_text"]=p['comet_sections']['content']['story']['message']['text'].replace('\n', '') 62 | except: fbpost["post_text"]="" 63 | 64 | for attachment in p['comet_sections']['content']['story']['attachments']: 65 | attachment_type=attachment['style_list'][0] 66 | #### Web preview (if any) #### 67 | if (attachment_type=='share'): 68 | if (attachment['styles']['attachment']['story_attachment_link_renderer']['attachment']['web_link']['__typename']=='ExternalWebLink'): 69 | fbpost["weblink_url"]=attachment['styles']['attachment']['story_attachment_link_renderer']['attachment']['web_link']['url'] 70 | else: 71 | try: 72 | fbpost["weblink_url"]=attachment['styles']['attachment']['url'] 73 | fbpost["weblink_title"]=attachment['styles']['attachment']['source']['text'] 74 | except: pass 75 | try: fbpost["weblink_pic"]=attachment['styles']['attachment']['media']['large_share_image']['uri'] 76 | except: pass 77 | try: fbpost["weblink_preview"]=attachment['styles']['attachment']['title_with_entities']['text'].replace('\n', '') 78 | except: pass 79 | #### Photo (if any) #### 80 | elif (attachment_type=='photo'): 81 | fbpost["photo_url"]=attachment['styles']['attachment']['media']['photo_image']['uri'] 82 | elif (attachment_type=='album'): 83 | for photo in attachment['styles']['attachment']['all_subattachments']['nodes']: 84 | fbpost["photo_url"]=fbpost["photo_url"]+photo['media']['image']['uri']+"\n" 85 | fbpost["photo_url"]=fbpost["photo_url"].strip() 86 | #### Video (if any) #### 87 | elif (attachment_type.startswith('video')): 88 | fbpost["video_url"]=attachment['styles']['attachment']['media']['url'] 89 | fbpost["video_permalink"]=attachment['styles']['attachment']['media']['permalink_url'] 90 | fbpost["video_duration"]=attachment['styles']['attachment']['media']['playable_duration_in_ms'] 91 | fbpost["video_thumbnail"]=attachment['styles']['attachment']['media']['preferred_thumbnail']['image']['uri'] 92 | break 93 | #### Reactions #### 94 | 95 | try: 96 | feedback=p['comet_sections']['feedback']['story']['feedback_context']['feedback_target_with_context']['ufi_renderer']['feedback']['comet_ufi_summary_and_actions_renderer']['feedback'] 97 | fbpost["shares"]=feedback['share_count']['count'] 98 | 99 | try: fbpost["comments"]=feedback['comments_count_summary_renderer']['feedback']['comment_count']['total_count'] 100 | except: fbpost["comments"]=feedback['comments_count_summary_renderer']['feedback']['total_comment_count'] 101 | 102 | try: fbpost["video_view_count"]=feedback['video_view_count'] 103 | except: pass 104 | for reaction in feedback['cannot_see_top_custom_reactions']['top_reactions']['edges']: 105 | if (reaction['node']['id']=="1635855486666999"): fbpost["like"]=reaction['reaction_count'] 106 | elif (reaction['node']['id']=="1678524932434102"): fbpost["love"]=reaction['reaction_count'] 107 | elif (reaction['node']['id']=="478547315650144"): fbpost["wow"]=reaction['reaction_count'] 108 | elif (reaction['node']['id']=="115940658764963"): fbpost["haha"]=reaction['reaction_count'] 109 | elif (reaction['node']['id']=="908563459236466"): fbpost["sad"]=reaction['reaction_count'] 110 | elif (reaction['node']['id']=="444813342392137"): fbpost["angry"]=reaction['reaction_count'] 111 | fbpost["reactions"]=fbpost["like"]+fbpost["love"]+fbpost["wow"]+fbpost["haha"]+fbpost["sad"]+fbpost["angry"] 112 | except: pass 113 | #### add post to collection using post_id as key #### 114 | fbpost_list=[] 115 | for item in fbposts_list[0]: fbpost_list.append(fbpost[item]) 116 | return fbpost_list 117 | 118 | #check if key exists 119 | def key_exists(element, *keys): 120 | if not isinstance(element, dict): 121 | raise AttributeError('keys_exists() expects dict as first argument.') 122 | if len(keys) == 0: 123 | raise AttributeError('keys_exists() expects at least two arguments, one given.') 124 | 125 | _element = element 126 | for key in keys: 127 | try: 128 | _element = _element[key] 129 | except KeyError: 130 | return False 131 | return True 132 | 133 | #for debugging purposes only 134 | def debug_content(content): 135 | text_file = open("temp.txt", "w") 136 | text_file.write(content) 137 | text_file.close() 138 | 139 | ####################### end functions ########################### 140 | 141 | file_name="www.facebook.com.har" 142 | if len(sys.argv)>1: 143 | if not sys.argv[1].startswith('-'): file_name=sys.argv[1] 144 | 145 | if not os.path.isfile(file_name): exit(1) 146 | 147 | with open(file_name, 'r', encoding="utf-8") as f: 148 | contents=f.read() 149 | dtime=datetime.now().strftime("%Y%m%d%H%M") 150 | with open(file_name.replace('.har','')+'-'+dtime+'.har', 'w', encoding="utf-8") as f2: 151 | f2.write(contents) 152 | print("Copied har file to: "+file_name.replace('.har','')+'-'+dtime+'.har') 153 | 154 | content_j=json.loads(contents) 155 | m="" 156 | 157 | try: 158 | m=re.search('https://www\.facebook\.com/(.+[^/])/?',content_j['log']['pages'][0]['title']) 159 | except: 160 | try: 161 | m=re.search('https://www\.facebook\.com/(.+[^/])/?',content_j['log']['entries'][0]['url']) 162 | except: pass 163 | 164 | if m: 165 | file_name=m.group(1) 166 | pattern=re.compile('[\W_]+') 167 | file_name=pattern.sub('_', file_name) 168 | if len(file_name)>50: file_name=file_name[:50] 169 | print("Found: "+file_name) 170 | else: 171 | print ("Failed to find reference") 172 | exit(1) 173 | 174 | content="[" 175 | top_posts=[] 176 | for entry in content_j['log']['entries']: 177 | if entry['request']['url']=='https://www.facebook.com/api/graphql/': 178 | try: 179 | content=content+entry['response']['content']['text'].replace('\\"','')+"," 180 | except: 181 | pass 182 | elif entry['_resourceType']=="document": 183 | try: 184 | if ('text' in entry['response']['content']): 185 | soup = BeautifulSoup(entry['response']['content']['text'], "html.parser") 186 | script_tags = soup.find_all('script')#, type='application/json') 187 | content_list = [tag.text.strip() for tag in script_tags] 188 | for tag in script_tags: 189 | m = re.search("(\{\"define\"\:\[\[.+?)\)\;", str(tag)) 190 | if m: 191 | tag_text = m.group(1) 192 | json_el=json.loads(tag_text) 193 | try: 194 | top_post=get_json_element_by_key(json_el,"timeline_list_feed_units") 195 | if (top_post): 196 | top_posts.append(top_post['edges'][0]) 197 | except: 198 | try: 199 | if (key_exists(item[3][1],'__bbox','result','data','serpResponse','results','edges')): 200 | temp_j=item[3][1]['__bbox']['result']['data']['serpResponse']['results']['edges'][0] 201 | if (key_exists(temp_j,'relay_rendering_strategy','view_model','click_model')): 202 | top_posts.append(item[3][1]['__bbox']['result']) 203 | except: pass 204 | except (AttributeError, KeyError) as ex: logging.exception("error") 205 | if (not file_name): 206 | print ("Found no page or group. Exiting...") 207 | exit(0) 208 | 209 | content=content.rstrip(",")+"]" 210 | content=re.compile('}\s*{').sub('},{', content) 211 | 212 | data = json.loads(content) 213 | 214 | if (top_posts): data=top_posts+data 215 | 216 | labels_str=["post_id","post_time","post_url","user_id","user_name","user_webpage","user_profile","user_profile_pic","post_text","weblink_url","weblink_title","weblink_pic","weblink_preview","photo_url","video_url","video_duration"] 217 | labels_num=["video_view_count","shares","comments","reactions",'like','love','wow','haha','sad','angry'] 218 | fbposts_list=[] 219 | fbposts_list.append(labels_str+labels_num) 220 | 221 | i=-1 222 | for post in data: 223 | 224 | try: post=post['data'] 225 | except: pass 226 | i=i+1 227 | try: 228 | if (post['node']['__typename']=='Story'): 229 | new_post=load_post(post['node'],i) 230 | fbposts_list.append(new_post) 231 | elif (post['node']['__typename']=='User' or post['node']['__typename']=='Page' or post['node']['__typename']=='Group'): 232 | if (post['node']['__typename']=='User'): 233 | posts=post['node']['timeline_list_feed_units']['edges'] 234 | elif (post['node']['__typename']=='Page'): 235 | posts=post['node']['timeline_feed_units']['edges'] 236 | elif (post['node']['__typename']=='Group'): 237 | posts=post['node']['group_feed']['edges'] 238 | for p in posts: 239 | new_post=load_post(p['node'],i) 240 | fbposts_list.append(new_post) 241 | else: 242 | new_post=load_post(post['node'],i) 243 | fbposts_list.append(new_post) 244 | 245 | except: 246 | try: 247 | posts=post['serpResponse']['results']['edges'] 248 | for p in posts: 249 | new_post=load_post(p['relay_rendering_strategy']['view_model']['click_model']['story'],i) 250 | fbposts_list.append(new_post) 251 | continue 252 | except: 253 | pass 254 | 255 | if (len(fbposts_list)>1): 256 | with open(file_name+'-'+dtime+'.csv', 'w', encoding="utf-8") as f: 257 | writer = csv.writer(f) 258 | writer.writerows(fbposts_list) 259 | print("Exported CSV data to: "+file_name+'-'+dtime+'.csv') 260 | --------------------------------------------------------------------------------