├── .github └── workflows │ └── pythonpublish.yml ├── .gitignore ├── LICENSE ├── README.md ├── images ├── ecgo.png ├── quickstart_fanspage.png ├── quickstart_group.png ├── 屏幕截图 2021-06-19 112621.png └── 屏幕截图 2021-06-19 152222.png ├── main.py ├── paser.py ├── requester.py ├── requirements.txt ├── sample ├── 20221013_Sample.ipynb ├── FansPages.ipynb ├── Group.ipynb └── data │ ├── PyConTaiwan.parquet │ └── corollacrossclub.parquet ├── setup.py ├── tests ├── test_facebook_crawler.py ├── test_page_parser.py ├── test_post_parser.py ├── test_requester.py └── test_utils.py └── utils.py /.github/workflows/pythonpublish.yml: -------------------------------------------------------------------------------- 1 | name: Upload Python Package 2 | 3 | on: 4 | release: 5 | types: [created] 6 | 7 | jobs: 8 | deploy: 9 | runs-on: ubuntu-latest 10 | steps: 11 | - uses: actions/checkout@v1 12 | - name: Set up Python 13 | uses: actions/setup-python@v1 14 | with: 15 | python-version: '3.x' 16 | - name: Install dependencies 17 | run: | 18 | python -m pip install --upgrade pip 19 | pip install setuptools wheel twine 20 | - name: Build and publish 21 | env: 22 | TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }} 23 | TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }} 24 | run: | 25 | python setup.py sdist bdist_wheel 26 | twine upload dist/* 27 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | develop/ 2 | .ipynb_checkpoints/ 3 | .vscode/ 4 | *egg-info/ 5 | .idea/ 6 | venv2/ 7 | build/ 8 | dist/ 9 | __pycache__/ 10 | data/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 tlyu0419 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Facebook_Crawler 2 | [![Downloads](https://pepy.tech/badge/facebook-crawler)](https://pepy.tech/project/facebook-crawler) 3 | [![Downloads](https://pepy.tech/badge/facebook-crawler/month)](https://pepy.tech/project/facebook-crawler) 4 | [![Downloads](https://pepy.tech/badge/facebook-crawler/week)](https://pepy.tech/project/facebook-crawler) 5 | 6 | ## What's this? 7 | 8 | This python package aims to help people who need to collect and analyze the public Fanspages or Groups data from Facebook with ease and efficiency. 9 | 10 | Here are the three big points of this project: 11 | 1. Private: You don't need to log in to your account. 12 | 2. Easy: Just key in the link of Fanspage or group and the target date, and it will work. 13 | 3. Efficient: It collects the data through the requests package directly instead of opening another browser. 14 | 15 | 16 | 這個 Python 套件旨在幫助使用者輕鬆且快速的收集 Facebook 公開粉絲頁和公開社團的資料,藉以進行後續的分析。 17 | 18 | 以下是本專案的 3 個重點: 19 | 1. 隱私: 不需要登入你個人的帳號密碼 20 | 2. 簡單: 僅需輸入粉絲頁/社團的網址和停止的日期就可以開始執行程式 21 | 3. 高效: 透過 requests 直接向伺服器請求資料,不需另外開啟一個新的瀏覽器 22 | 23 | ## Quickstart 24 | ### Install 25 | ```pip 26 | pip install -U facebook-crawler 27 | ``` 28 | 29 | ### Usage 30 | - Facebook Fanspage 31 | ```python 32 | import facebook_crawler 33 | pageurl= 'https://www.facebook.com/diudiu333' 34 | facebook_crawler.Crawl_PagePosts(pageurl=pageurl, until_date='2021-01-01') 35 | ``` 36 | ![quickstart_fanspage.png](https://raw.githubusercontent.com/TLYu0419/facebook_crawler/main/images/quickstart_fanspage.png) 37 | 38 | - Group 39 | ```python 40 | import facebook_crawler 41 | groupurl = 'https://www.facebook.com/groups/pythontw' 42 | facebook_crawler.Crawl_GroupPosts(groupurl, until_date='2021-01-01') 43 | ``` 44 | ![quickstart_group.png](https://raw.githubusercontent.com/TLYu0419/facebook_crawler/main/images/quickstart_group.png) 45 | 46 | ## FAQ 47 | - **How to get the comments or replies to the posts?** 48 | > Please write an Email to me and tell me your project goal. Thanks! 49 | 50 | - **How can I find out the post's link through the data?** 51 | > You can add the string 'https://www.facebook.com' in front of the POSTID, and it's just its post link. So, for example, if the POSTID is 123456789, and its link is 'https://www.facebook.com/12345679'. 52 | 53 | - **Can I directly collect the data in the specific time period?** 54 | > No! This is the same as the behavior when we use Facebook. We need to collect the data from the newest posts to the older posts. 55 | 56 | ## License 57 | [MIT License](https://github.com/TLYu0419/facebook_crawler/blob/main/LICENSE) 58 | 59 | ## Contribution 60 | 61 | [![ecgo.png](https://raw.githubusercontent.com/TLYu0419/facebook_crawler/main/images/ecgo.png)](https://payment.ecpay.com.tw/QuickCollect/PayData?GcM4iJGUeCvhY%2fdFqqQ%2bFAyf3uA10KRo%2fqzP4DWtVcw%3d) 62 | 63 | A donation is not the limitation to utilizing this package, but it would be great to have your support. Either donate, star or fork are good methods to support me keep maintaining and developing this project. 64 | 65 | Thanks to these donors' help, due to their kind help, this project could keep maintained and developed. 66 | 67 | **贊助不是使用這個套件的必要條件**,但如能獲得你的支持我將會非常感謝。不論是贊助、給予星星或分享都是很好的支持方式,幫助我繼續維護和開發這個專案 68 | 69 | 由於這些捐助者的幫助,由於他們的慷慨的幫助,這個項目才得以持續維護和發展 70 | - Universities 71 | - [Department of Social Work. The Chinese University of Hong Kong(香港中文大學社會工作學系)](https://web.swk.cuhk.edu.hk/zh-tw/) 72 | - [Education, Graduate School of Curriculum and Instructional Communications Technology. National Taipei University.(國立台北教育大學課程與教學傳播科技研究所)](https://cict.ntue.edu.tw/?locale=zh_tw) 73 | - [Department of Dusiness Administration. Chung Hua University.(中華大學企業管理學系)](https://ba.chu.edu.tw/?Lang=en) 74 | 75 | ## Contact Info 76 | - Author: TENG-LIN YU 77 | - Email: tlyu0419@gmail.com 78 | - Facebook: https://www.facebook.com/tlyu0419 79 | - PYPI: https://pypi.org/project/facebook-crawler/ 80 | - Github: https://github.com/TLYu0419/facebook_crawler 81 | 82 | ## Log 83 | - 0.028: Modularized the crawler function. 84 | - 0.0.26 85 | 1. Auto changes the cookie after it's expired to keep crawling data without changing IP. 86 | -------------------------------------------------------------------------------- /images/ecgo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/images/ecgo.png -------------------------------------------------------------------------------- /images/quickstart_fanspage.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/images/quickstart_fanspage.png -------------------------------------------------------------------------------- /images/quickstart_group.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/images/quickstart_group.png -------------------------------------------------------------------------------- /images/屏幕截图 2021-06-19 112621.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/images/屏幕截图 2021-06-19 112621.png -------------------------------------------------------------------------------- /images/屏幕截图 2021-06-19 152222.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/images/屏幕截图 2021-06-19 152222.png -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from paser import _parse_category, _parse_pagename, _parse_creation_time, _parse_pagetype, _parse_likes, _parse_docid, _parse_pageurl 2 | from paser import _parse_entryPoint, _parse_identifier, _parse_docid, _parse_composite_nojs, _parse_composite_graphql, _parse_relatedpages, _parse_pageinfo 3 | from requester import _get_homepage, _get_posts, _get_headers 4 | from utils import _init_request_vars 5 | from bs4 import BeautifulSoup 6 | import os 7 | import re 8 | 9 | import json 10 | import time 11 | import tqdm 12 | 13 | import pandas as pd 14 | import pickle 15 | 16 | import datetime 17 | import warnings 18 | warnings.filterwarnings("ignore") 19 | 20 | 21 | def Crawl_PagePosts(pageurl, until_date='2018-01-01', cursor=''): 22 | # initial request variables 23 | df, cursor, max_date, break_times = _init_request_vars(cursor) 24 | 25 | # get headers 26 | headers = _get_headers(pageurl) 27 | 28 | # Get pageid, postid and entryPoint from homepage_response 29 | homepage_response = _get_homepage(pageurl, headers) 30 | entryPoint = _parse_entryPoint(homepage_response) 31 | identifier = _parse_identifier(entryPoint, homepage_response) 32 | docid = _parse_docid(entryPoint, homepage_response) 33 | 34 | # Keep crawling post until reach the until_date 35 | while max_date >= until_date: 36 | try: 37 | # Get posts by identifier, docid and entryPoint 38 | resp = _get_posts(headers, identifier, entryPoint, docid, cursor) 39 | if entryPoint == 'nojs': 40 | ndf, max_date, cursor = _parse_composite_nojs(resp) 41 | df.append(ndf) 42 | else: 43 | ndf, max_date, cursor = _parse_composite_graphql(resp) 44 | df.append(ndf) 45 | # Test 46 | # print(ndf.shape[0]) 47 | break_times = 0 48 | except: 49 | # print(resp.json()[:3000]) 50 | try: 51 | if resp.json()['data']['node']['timeline_feed_units']['page_info']['has_next_page'] == False: 52 | print('The posts of the page has run over!') 53 | break 54 | except: 55 | pass 56 | print('Break Times {}: Something went wrong with this request. Sleep 20 seconds and send request again.'.format( 57 | break_times)) 58 | print('REQUEST LOG >> pageid: {}, docid: {}, cursor: {}'.format( 59 | identifier, docid, cursor)) 60 | print('RESPONSE LOG: ', resp.text[:3000]) 61 | print('================================================') 62 | break_times += 1 63 | 64 | if break_times > 15: 65 | print('Please check your target page/group has up to date.') 66 | print('If so, you can ignore this break time message, if not, please change your Internet IP and run this crawler again.') 67 | break 68 | 69 | time.sleep(20) 70 | # Get new headers 71 | headers = _get_headers(pageurl) 72 | 73 | # Concat all dataframes 74 | df = pd.concat(df, ignore_index=True) 75 | df['UPDATETIME'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") 76 | return df 77 | 78 | 79 | def Crawl_GroupPosts(pageurl, until_date='2022-01-01'): 80 | df = Crawl_PagePosts(pageurl, until_date) 81 | return df 82 | 83 | 84 | def Crawl_RelatedPages(seedpages, rounds): 85 | # init 86 | df = pd.DataFrame(data=[], columns=['SOURCE', 'TARGET', 'ROUND']) 87 | pageurls = list(set(seedpages)) 88 | crawled_list = list(set(df['SOURCE'])) 89 | headers = _get_headers(pageurls[0]) 90 | for i in range(rounds): 91 | print('Round {} started at: {}!'.format( 92 | i, datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))) 93 | for pageurl in tqdm(pageurls): 94 | if pageurl not in crawled_list: 95 | try: 96 | homepage_response = _get_homepage( 97 | pageurl=pageurl, headers=headers) 98 | if 'Sorry, something went wrong.' not in homepage_response.text: 99 | entryPoint = _parse_entryPoint(homepage_response) 100 | identifier = _parse_identifier( 101 | entryPoint, homepage_response) 102 | relatedpages = _parse_relatedpages( 103 | homepage_response, entryPoint, identifier) 104 | ndf = pd.DataFrame({'SOURCE': homepage_response.url, 105 | 'TARGET': relatedpages, 106 | 'ROUND': i}) 107 | df = pd.concat([df, ndf], ignore_index=True) 108 | except: 109 | pass 110 | # print('ERROE: {}'.format(pageurl)) 111 | pageurls = list(set(df['TARGET'])) 112 | crawled_list = list(set(df['SOURCE'])) 113 | return df 114 | 115 | 116 | def Crawl_PageInfo(pagenum, pageurl): 117 | break_times = 0 118 | global headers 119 | while True: 120 | try: 121 | homepage_response = _get_homepage(pageurl, headers) 122 | pageinfo = _parse_pageinfo(homepage_response) 123 | with open('data/pageinfo/' + str(pagenum) + '.pickle', "wb") as fp: 124 | pickle.dump(pageinfo, fp) 125 | break 126 | except: 127 | break_times = break_times + 1 128 | if break_times >= 5: 129 | break 130 | time.sleep(5) 131 | headers = _get_headers(pageurl=pageurl) 132 | 133 | 134 | if __name__ == '__main__': 135 | 136 | os.makedirs('data/', exist_ok=True) 137 | # ===== fb_api_req_friendly_name: ProfileCometTimelineFeedRefetchQuery ==== 138 | pageurl = 'https://www.facebook.com/Gooaye' # 股癌 Gooaye: 30.4萬追蹤, 139 | pageurl = 'https://www.facebook.com/StockOldBull' # 股海老牛: 16萬 140 | pageurl = 'https://www.facebook.com/twherohan' 141 | pageurl = 'https://www.facebook.com/diudiu333' 142 | pageurl = 'https://www.facebook.com/chengwentsan' 143 | pageurl = 'https://www.facebook.com/MaYingjeou' 144 | pageurl = 'https://www.facebook.com/roberttawikofficial' 145 | pageurl = 'https://www.facebook.com/NizamAbTitingan' 146 | pageurl = 'https://www.facebook.com/joebiden' 147 | 148 | # ==== fb_api_req_friendly_name: CometModernPageFeedPaginationQuery ==== 149 | pageurl = 'https://www.facebook.com/ebcmoney/' # 東森財經: 81萬追蹤 150 | pageurl = 'https://www.facebook.com/moneyweekly.tw/' # 理財周刊: 36.3萬 151 | pageurl = 'https://www.facebook.com/cmoneyapp/' # CMoney 理財寶: 84.2萬 152 | pageurl = 'https://www.facebook.com/emily0806' # 艾蜜莉-自由之路: 20.9萬追蹤 153 | pageurl = 'https://www.facebook.com/imoney889/' # 林恩如-飆股女王: 10.2萬 154 | pageurl = 'https://www.facebook.com/wealth1974/' # 財訊: 17.5萬 155 | pageurl = 'https://www.facebook.com/smart16888/' # 郭莉芳理財講堂: 1.6萬 156 | pageurl = 'https://www.facebook.com/smartmonthly/' # Smart 智富月刊: 52.6萬 157 | pageurl = 'https://www.facebook.com/ezmoney.tw/' # 統一投信: 1.5萬 158 | pageurl = 'https://www.facebook.com/MoneyMoneyMeg/' # Money錢: 20.7萬 159 | pageurl = 'https://www.facebook.com/imoneymagazine/' # iMoney 智富雜誌: 38萬 160 | pageurl = 'https://www.facebook.com/edigest/' # 經濟一週 EDigest: 36.2萬 161 | pageurl = 'https://www.facebook.com/BToday/' # 今周刊:107萬 162 | pageurl = 'https://www.facebook.com/GreenHornFans/' # 綠角財經筆記: 25萬 163 | pageurl = 'https://www.facebook.com/ec.ltn.tw/' # 自由時報財經頻道 42,656人在追蹤 164 | pageurl = 'https://www.facebook.com/MoneyDJ' # MoneyDJ理財資訊 141,302人在追蹤 165 | pageurl = 'https://www.facebook.com/YahooTWFinance/' # Yahoo奇摩股市理財 149,624人在追蹤 166 | pageurl = 'https://www.facebook.com/win3105' 167 | pageurl = 'https://www.facebook.com/Diss%E7%BA%8F%E7%B6%BF-111182238148502/' 168 | 169 | # fb_api_req_friendly_name: CometUFICommentsProviderQuery 170 | pageurl = 'https://www.facebook.com/anuetw/' # Anue鉅亨網財經新聞: 31.2萬追蹤 171 | pageurl = 'https://www.facebook.com/wealtholic/' # 投資癮 Wealtholic: 2.萬 172 | 173 | # fb_api_req_friendly_name: PresenceStatusProviderSubscription_ContactProfilesQuery 174 | # fb_api_req_friendly_name: GroupsCometFeedRegularStoriesPaginationQuery 175 | pageurl = 'https://www.facebook.com/groups/pythontw' 176 | pageurl = 'https://www.facebook.com/groups/corollacrossclub/' 177 | 178 | df = Crawl_PagePosts(pageurl, until_date='2022-08-10') 179 | # df = Crawl_RelatedPages(seedpages=pageurls, rounds=10) 180 | 181 | df = pd.read_csv( 182 | './data/relatedpages_edgetable.csv')[['SOURCE', 'TARGET', 'ROUND']] 183 | 184 | headers = _get_headers(pageurl=pageurl) 185 | 186 | for pagenum in tqdm(df['index']): 187 | try: 188 | Crawl_PageInfo(pagenum=pagenum, pageurl=df['pageurl'][pagenum]) 189 | except: 190 | pass 191 | 192 | homepage_response = _get_homepage(pageurl, headers) 193 | pageinfo = _parse_pageinfo(homepage_response) 194 | 195 | # 196 | import pandas as pd 197 | from main import Crawl_PagePosts 198 | pageurl = 'https://www.facebook.com/hatendhu' 199 | df = Crawl_PagePosts(pageurl, until_date='2014-11-01') 200 | df 201 | df.to_pickle('./data/20220926_hatendhu.pkl') 202 | -------------------------------------------------------------------------------- /paser.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | import pandas as pd 4 | from bs4 import BeautifulSoup 5 | from utils import _extract_id, _init_request_vars 6 | import datetime 7 | from requester import _get_pageabout, _get_pagetransparency, _get_homepage, _get_posts, _get_headers 8 | import requests 9 | # Post-Paser 10 | 11 | 12 | def _parse_edgelist(resp): 13 | ''' 14 | Take edges from the response by graphql api 15 | ''' 16 | edges = [] 17 | try: 18 | edges = resp.json()['data']['node']['timeline_feed_units']['edges'] 19 | except: 20 | for data in resp.text.split('\r\n', -1): 21 | try: 22 | edges.append(json.loads(data)[ 23 | 'data']['node']['timeline_list_feed_units']['edges'][0]) 24 | except: 25 | edges.append(json.loads(data)['data']) 26 | return edges 27 | 28 | 29 | def _parse_edge(edge): 30 | ''' 31 | Parse edge to take informations, such as post name, id, message..., etc. 32 | ''' 33 | comet_sections = edge['node']['comet_sections'] 34 | # name 35 | name = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['name'] 36 | 37 | # creation_time 38 | creation_time = comet_sections['context_layout']['story']['comet_sections']['metadata'][0]['story']['creation_time'] 39 | 40 | # message 41 | try: 42 | message = comet_sections['content']['story']['comet_sections']['message']['story']['message']['text'] 43 | except: 44 | try: 45 | message = comet_sections['content']['story']['comet_sections']['message_container']['story']['message']['text'] 46 | except: 47 | message = comet_sections['content']['story']['comet_sections']['message_container'] 48 | # postid 49 | postid = comet_sections['feedback']['story']['feedback_context'][ 50 | 'feedback_target_with_context']['ufi_renderer']['feedback']['subscription_target_id'] 51 | 52 | # actorid 53 | pageid = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['id'] 54 | 55 | # comment_count 56 | comment_count = comet_sections['feedback']['story']['feedback_context'][ 57 | 'feedback_target_with_context']['ufi_renderer']['feedback']['comment_count']['total_count'] 58 | 59 | # reaction_count 60 | reaction_count = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context'][ 61 | 'ufi_renderer']['feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['reaction_count']['count'] 62 | 63 | # share_count 64 | share_count = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context'][ 65 | 'ufi_renderer']['feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['share_count']['count'] 66 | 67 | # toplevel_comment_count 68 | toplevel_comment_count = comet_sections['feedback']['story']['feedback_context'][ 69 | 'feedback_target_with_context']['ufi_renderer']['feedback']['toplevel_comment_count']['count'] 70 | 71 | # top_reactions 72 | top_reactions = comet_sections['feedback']['story']['feedback_context']['feedback_target_with_context']['ufi_renderer'][ 73 | 'feedback']['comet_ufi_summary_and_actions_renderer']['feedback']['cannot_see_top_custom_reactions']['top_reactions']['edges'] 74 | 75 | # comet_footer_renderer for link 76 | try: 77 | comet_footer_renderer = comet_sections['content']['story']['attachments'][0]['comet_footer_renderer'] 78 | # attachment_title 79 | attachment_title = comet_footer_renderer['attachment']['title_with_entities']['text'] 80 | # attachment_description 81 | attachment_description = comet_footer_renderer['attachment']['description']['text'] 82 | except: 83 | attachment_title = '' 84 | attachment_description = '' 85 | 86 | # all_subattachments for photos 87 | try: 88 | try: 89 | media = comet_sections['content']['story']['attachments'][0]['styles']['attachment']['all_subattachments']['nodes'] 90 | attachments_photos = ', '.join( 91 | [image['media']['viewer_image']['uri'] for image in media]) 92 | except: 93 | media = comet_sections['content']['story']['attachments'][0]['styles']['attachment'] 94 | attachments_photos = media['media']['photo_image']['uri'] 95 | except: 96 | attachments_photos = '' 97 | 98 | # cursor 99 | cursor = edge['cursor'] 100 | 101 | # actor url 102 | actor_url = comet_sections['context_layout']['story']['comet_sections']['actor_photo']['story']['actors'][0]['url'] 103 | 104 | # post url 105 | post_url = comet_sections['content']['story']['wwwURL'] 106 | 107 | return [name, pageid, postid, creation_time, message, reaction_count, comment_count, toplevel_comment_count, share_count, top_reactions, attachment_title, attachment_description, attachments_photos, cursor, actor_url, post_url] 108 | 109 | 110 | def _parse_domops(resp): 111 | ''' 112 | Take name, data id, time , message and page link from domops 113 | ''' 114 | data = re.sub(r'for \(;;\);', '', resp.text) 115 | data = json.loads(data) 116 | domops = data['domops'][0][3]['__html'] 117 | cursor = re.findall( 118 | 'timeline_cursor%22%3A%22(.*?)%22%2C%22timeline_section_cursor', domops)[0] 119 | content_list = [] 120 | soup = BeautifulSoup(domops, 'lxml') 121 | 122 | for content in soup.findAll('div', {'class': 'userContentWrapper'}): 123 | # name 124 | name = content.find('img')['aria-label'] 125 | # id 126 | dataid = content.find('div', {'data-testid': 'story-subtitle'})['id'] 127 | # actorid 128 | pageid = _extract_id(dataid, 0) 129 | # postid 130 | postid = _extract_id(dataid, 1) 131 | # time 132 | time = content.find('abbr')['data-utime'] 133 | # message 134 | message = content.find('div', {'data-testid': 'post_message'}) 135 | if message == None: 136 | message = '' 137 | else: 138 | if len(message.findAll('p')) >= 1: 139 | message = ''.join(p.text for p in message.findAll('p')) 140 | elif len(message.select('span > span')) >= 2: 141 | message = message.find('span').text 142 | 143 | # attachment_title 144 | try: 145 | attachment_title = content.find( 146 | 'a', {'data-lynx-mode': 'hover'})['aria-label'] 147 | except: 148 | attachment_title = '' 149 | # attachment_description 150 | try: 151 | attachment_description = content.find( 152 | 'a', {'data-lynx-mode': 'hover'}).text 153 | except: 154 | attachment_description = '' 155 | # actor_url 156 | actor_url = content.find('a')['href'].split('?')[0] 157 | 158 | # post_url 159 | post_url = 'https://www.facebook.com/' + postid 160 | content_list.append([name, pageid, postid, time, message, attachment_title, 161 | attachment_description, cursor, actor_url, post_url]) 162 | return content_list, cursor 163 | 164 | 165 | def _parse_jsmods(resp): 166 | ''' 167 | Take postid, pageid, comment count , reaction count, sharecount, reactions and display_comments_count from jsmods 168 | ''' 169 | data = re.sub(r'for \(;;\);', '', resp.text) 170 | data = json.loads(data) 171 | jsmods = data['jsmods'] 172 | 173 | requires_list = [] 174 | for requires in jsmods['pre_display_requires']: 175 | try: 176 | feedback = requires[3][1]['__bbox']['result']['data']['feedback'] 177 | # subscription_target_id ==> postid 178 | subscription_target_id = feedback['subscription_target_id'] 179 | # owning_profile_id ==> pageid 180 | owning_profile_id = feedback['owning_profile']['id'] 181 | # comment_count 182 | comment_count = feedback['comment_count']['total_count'] 183 | # reaction_count 184 | reaction_count = feedback['reaction_count']['count'] 185 | # share_count 186 | share_count = feedback['share_count']['count'] 187 | # top_reactions 188 | top_reactions = feedback['top_reactions']['edges'] 189 | # display_comments_count 190 | display_comments_count = feedback['display_comments_count']['count'] 191 | 192 | # append data to list 193 | requires_list.append([subscription_target_id, owning_profile_id, comment_count, 194 | reaction_count, share_count, top_reactions, display_comments_count]) 195 | except: 196 | pass 197 | 198 | # reactions--video posts 199 | for requires in jsmods['require']: 200 | try: 201 | # entidentifier ==> postid 202 | entidentifier = requires[3][2]['feedbacktarget']['entidentifier'] 203 | # pageid 204 | actorid = requires[3][2]['feedbacktarget']['actorid'] 205 | # comment count 206 | commentcount = requires[3][2]['feedbacktarget']['commentcount'] 207 | # reaction count 208 | likecount = requires[3][2]['feedbacktarget']['likecount'] 209 | # sharecount 210 | sharecount = requires[3][2]['feedbacktarget']['sharecount'] 211 | # reactions 212 | reactions = [] 213 | # display_comments_count 214 | commentcount = requires[3][2]['feedbacktarget']['commentcount'] 215 | 216 | # append data to list 217 | requires_list.append( 218 | [entidentifier, actorid, commentcount, likecount, sharecount, reactions, commentcount]) 219 | except: 220 | pass 221 | return requires_list 222 | 223 | 224 | def _parse_composite_graphql(resp): 225 | edges = _parse_edgelist(resp) 226 | df = [] 227 | for edge in edges: 228 | try: 229 | ndf = _parse_edge(edge) 230 | df.append(ndf) 231 | except: 232 | pass 233 | df = pd.DataFrame(df, columns=['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'REACTIONCOUNT', 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 234 | 'SHARECOUNT', 'REACTIONS', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'ATTACHMENT_PHOTOS', 'CURSOR', 'ACTOR_URL', 'POST_URL']) 235 | df = df[['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'ATTACHMENT_PHOTOS', 'REACTIONCOUNT', 236 | 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 'SHARECOUNT', 'REACTIONS', 'CURSOR', 'ACTOR_URL', 'POST_URL']] 237 | cursor = df['CURSOR'].to_list()[-1] 238 | df['TIME'] = df['TIME'].apply(lambda x: datetime.datetime.fromtimestamp( 239 | int(x)).strftime("%Y-%m-%d %H:%M:%S")) 240 | max_date = df['TIME'].max() 241 | print('The maximum date of these posts is: {}, keep crawling...'.format(max_date)) 242 | return df, max_date, cursor 243 | 244 | 245 | def _parse_composite_nojs(resp): 246 | domops, cursor = _parse_domops(resp) 247 | domops = pd.DataFrame(domops, columns=['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 248 | 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 'CURSOR', 'ACTOR_URL', 'POST_URL']) 249 | domops['TIME'] = domops['TIME'].apply( 250 | lambda x: datetime.datetime.fromtimestamp(int(x)).strftime("%Y-%m-%d %H:%M:%S")) 251 | 252 | jsmods = _parse_jsmods(resp) 253 | jsmods = pd.DataFrame(jsmods, columns=[ 254 | 'POSTID', 'PAGEID', 'COMMENTCOUNT', 'REACTIONCOUNT', 'SHARECOUNT', 'REACTIONS', 'DISPLAYCOMMENTCOUNT']) 255 | 256 | df = pd.merge(left=domops, 257 | right=jsmods, 258 | how='inner', 259 | on=['PAGEID', 'POSTID']) 260 | 261 | df = df[['NAME', 'PAGEID', 'POSTID', 'TIME', 'MESSAGE', 'ATTACHMENT_TITLE', 'ATTACHMENT_DESCRIPTION', 262 | 'REACTIONCOUNT', 'COMMENTCOUNT', 'DISPLAYCOMMENTCOUNT', 'SHARECOUNT', 'REACTIONS', 'CURSOR', 263 | 'ACTOR_URL', 'POST_URL']] 264 | max_date = df['TIME'].max() 265 | print('The maximum date of these posts is: {}, keep crawling...'.format(max_date)) 266 | return df, max_date, cursor 267 | 268 | # Page paser 269 | 270 | 271 | def _parse_pagetype(homepage_response): 272 | if '/groups/' in homepage_response.url: 273 | pagetype = 'Group' 274 | else: 275 | pagetype = 'Fanspage' 276 | return pagetype 277 | 278 | 279 | def _parse_pagename(homepage_response): 280 | raw_json = homepage_response.text.encode('utf-8').decode('unicode_escape') 281 | # pattern1 282 | if len(re.findall(r'{"page":{"name":"(.*?)",', raw_json)) >= 1: 283 | pagename = re.findall(r'{"page":{"name":"(.*?)",', raw_json)[0] 284 | pagename = re.sub(r'\s\|\sFacebook', '', pagename) 285 | return pagename 286 | # pattern2 287 | if len(re.findall('","name":"(.*?)","', raw_json)) >= 1: 288 | pagename = re.findall('","name":"(.*?)","', raw_json)[0] 289 | pagename = re.sub(r'\s\|\sFacebook', '', pagename) 290 | return pagename 291 | 292 | 293 | def _parse_entryPoint(homepage_response): 294 | try: 295 | entryPoint = re.findall( 296 | '"entryPoint":{"__dr":"(.*?)"}}', homepage_response.text)[0] 297 | except: 298 | entryPoint = 'nojs' 299 | return entryPoint 300 | 301 | 302 | def _parse_identifier(entryPoint, homepage_response): 303 | if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint']: 304 | # pattern 1 305 | if len(re.findall('"identifier":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)) >= 1: 306 | identifier = re.findall( 307 | '"identifier":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)[0] 308 | 309 | # pattern 2 310 | elif len(re.findall('fb://profile/(.*?)"', homepage_response.text)) >= 1: 311 | identifier = re.findall( 312 | 'fb://profile/(.*?)"', homepage_response.text)[0] 313 | 314 | # pattern 3 315 | elif len(re.findall('content="fb://group/([0-9]{1,})" />', homepage_response.text)) >= 1: 316 | identifier = re.findall( 317 | 'content="fb://group/([0-9]{1,})" />', homepage_response.text)[0] 318 | 319 | elif entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'nojs']: 320 | # pattern 1 321 | if len(re.findall('"pageID":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)) >= 1: 322 | identifier = re.findall( 323 | '"pageID":"{0,1}([0-9]{5,})"{0,1},', homepage_response.text)[0] 324 | 325 | return identifier 326 | 327 | 328 | def _parse_docid(entryPoint, homepage_response): 329 | soup = BeautifulSoup(homepage_response.text, 'lxml') 330 | if entryPoint == 'nojs': 331 | docid = 'NoDocid' 332 | else: 333 | for link in soup.findAll('link', {'rel': 'preload'}): 334 | resp = requests.get(link['href']) 335 | for line in resp.text.split('\n', -1): 336 | if 'ProfileCometTimelineFeedRefetchQuery_' in line: 337 | docid = re.findall('e.exports="([0-9]{1,})"', line)[0] 338 | break 339 | 340 | if 'CometModernPageFeedPaginationQuery_' in line: 341 | docid = re.findall('e.exports="([0-9]{1,})"', line)[0] 342 | break 343 | 344 | if 'CometUFICommentsProviderQuery_' in line: 345 | docid = re.findall('e.exports="([0-9]{1,})"', line)[0] 346 | break 347 | 348 | if 'GroupsCometFeedRegularStoriesPaginationQuery' in line: 349 | docid = re.findall('e.exports="([0-9]{1,})"', line)[0] 350 | break 351 | if 'docid' in locals(): 352 | break 353 | return docid 354 | 355 | 356 | def _parse_likes(homepage_response, entryPoint, headers): 357 | if entryPoint in ['CometGroupDiscussionRoot.entrypoint']: 358 | pageabout = _get_pageabout(homepage_response, entryPoint, headers) 359 | members = re.findall( 360 | ',"group_total_members_info_text":"(.*?) total members","', pageabout.text)[0] 361 | members = re.sub(',', '', members) 362 | return members 363 | else: 364 | # pattern 1 365 | data = re.findall( 366 | '"page_likers":{"global_likers_count":([0-9]{1,})},"', homepage_response.text) 367 | if len(data) >= 1: 368 | likes = data[0] 369 | return likes 370 | # pattern 2 371 | data = re.findall( 372 | ' ([0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}[0-9]{0,},{0,}) likes', homepage_response.text) 373 | if len(data) >= 1: 374 | likes = data[0] 375 | likes = re.sub(',', '', likes) 376 | return likes 377 | 378 | 379 | def _parse_creation_time(homepage_response, entryPoint, headers): 380 | try: 381 | if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: 382 | transparency_response = _get_pagetransparency( 383 | homepage_response, entryPoint, headers) 384 | transparency_info = re.findall( 385 | '"field_section_type":"transparency","profile_fields":{"nodes":\[{"title":(.*?}),"field_type":"creation_date",', transparency_response.text)[0] 386 | creation_time = json.loads(transparency_info)['text'] 387 | 388 | elif entryPoint in ['CometSinglePageHomeRoot.entrypoint']: 389 | creation_time = re.findall( 390 | ',"page_creation_date":{"text":"Page created - (.*?)"},', homepage_response.text)[0] 391 | 392 | elif entryPoint in ['nojs']: 393 | if len(re.findall('Page created - (.*?)', homepage_response.text)) >= 1: 394 | creation_time = re.findall( 395 | 'Page created - (.*?)', homepage_response.text)[0] 396 | else: 397 | creation_time = re.findall( 398 | ',"foundingDate":"(.*?)"}', homepage_response.text)[0][:10] 399 | 400 | elif entryPoint in ['CometGroupDiscussionRoot.entrypoint']: 401 | pageabout = _get_pageabout(homepage_response, entryPoint, headers) 402 | creation_time = re.findall( 403 | '"group_history_summary":{"text":"Group created on (.*?)"}},', pageabout.text)[0] 404 | 405 | try: 406 | creation_time = datetime.datetime.strptime( 407 | creation_time, '%B %d, %Y') 408 | except: 409 | creation_time = creation_time + ', ' + datetime.datetime.now().year 410 | creation_time = datetime.datetime.strptime( 411 | creation_time, '%B %d, %Y') 412 | creation_time = creation_time.strftime('%Y-%m-%d') 413 | except: 414 | creation_time = 'NotAvailable' 415 | return creation_time 416 | 417 | 418 | def _parse_category(homepage_response, entryPoint, headers): 419 | pageabout = _get_pageabout(homepage_response, entryPoint, headers) 420 | if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: 421 | if 'Page \\u00b7 Politician' in pageabout.text: 422 | category = 'Politician' 423 | if len(re.findall(r'"text":"Page \\u00b7 (.*?)"}', homepage_response.text)) >= 1: 424 | category = re.findall( 425 | r'"text":"Page \\u00b7 (.*?)"}', homepage_response.text)[0] 426 | else: 427 | soup = BeautifulSoup(pageabout.text) 428 | for script in soup.findAll('script', {'type': 'application/ld+json'}): 429 | if 'BreadcrumbList' in script.text: 430 | data = script.text.encode('utf-8').decode('unicode_escape') 431 | category = json.loads(data)['itemListElement'] 432 | category = ' / '.join([cate['name'] for cate in category]) 433 | elif entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'nojs']: 434 | if len(re.findall('","category_name":"(.*?)","', homepage_response.text)) >= 1: 435 | category = re.findall( 436 | '","category_name":"(.*?)","', homepage_response.text) 437 | category = ' / '.join([cate for cate in category]) 438 | else: 439 | soup = BeautifulSoup(homepage_response.text) 440 | if len(soup.findAll('span', {'itemprop': 'itemListElement'})) >= 1: 441 | category = [span.text for span in soup.findAll( 442 | 'span', {'itemprop': 'itemListElement'})] 443 | category = ' / '.join(category) 444 | else: 445 | for script in soup.findAll('script', {'type': 'application/ld+json'}): 446 | if 'BreadcrumbList' in script.text: 447 | data = script.text.encode( 448 | 'utf-8').decode('unicode_escape') 449 | category = json.loads(data)['itemListElement'] 450 | category = ' / '.join([cate['name'] 451 | for cate in category]) 452 | elif entryPoint in ['PagesCometAdminSelfViewAboutContainerRoot.entrypoint']: 453 | category = eval(re.findall( 454 | '"page_categories":(.*?),"addressEditable', homepage_response.text)[0]) 455 | category = ' / '.join([cate['text'] for cate in category]) 456 | elif entryPoint in ['CometGroupDiscussionRoot.entrypoint']: 457 | category = 'Group' 458 | try: 459 | category = re.sub(r'\\/', '/', category) 460 | except: 461 | category = '' 462 | return category 463 | 464 | 465 | def _parse_pageurl(homepage_response): 466 | pageurl = homepage_response.url 467 | pageurl = re.sub('/$', '', pageurl) 468 | return pageurl 469 | 470 | 471 | def _parse_relatedpages(homepage_response, entryPoint, identifier): 472 | relatedpages = [] 473 | if entryPoint in ['CometSinglePageHomeRoot.entrypoint']: 474 | try: 475 | data = re.findall( 476 | r'"related_pages":\[(.*?)\],"view_signature"', homepage_response.text)[0] 477 | data = re.sub('},{', '},,,,{', data) 478 | for pages in data.split(',,,,', -1): 479 | # print('id:', json.loads(pages)['id']) 480 | # print('category_name:', json.loads(pages)['category_name']) 481 | # print('name:', json.loads(pages)['name']) 482 | url = json.loads(pages)['url'] 483 | url = url.split('?', -1)[0] 484 | url = re.sub(r'/$', '', url) 485 | # print('url:', url) 486 | # print('========') 487 | relatedpages.append(url) 488 | except: 489 | pass 490 | 491 | elif entryPoint in ['nojs']: 492 | soup = BeautifulSoup(homepage_response.text, 'lxml') 493 | soup = soup.find( 494 | 'div', {'id': 'PageRelatedPagesSecondaryPagelet_{}'.format(identifier)}) 495 | for page in soup.select('ul > li > div'): 496 | # print('name: ', page.find('img')['aria-label']) 497 | url = page.find('a')['href'] 498 | url = url.split('?', -1)[0] 499 | url = re.sub(r'/$', '', url) 500 | # print('url:', url) 501 | # print('===========') 502 | relatedpages.append(url) 503 | 504 | elif entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint']: 505 | pass 506 | # print('There\'s no related pages recommend.') 507 | return relatedpages 508 | 509 | 510 | def _parse_pageinfo(homepage_response): 511 | ''' 512 | Parse the homepage response to get the page information, including id, docid and api_name. 513 | ''' 514 | # pagetype 515 | pagetype = _parse_pagetype(homepage_response) 516 | 517 | # pagename 518 | pagename = _parse_pagename(homepage_response) 519 | 520 | # entryPoint 521 | entryPoint = _parse_entryPoint(homepage_response) 522 | 523 | # identifier 524 | identifier = _parse_identifier(entryPoint, homepage_response) 525 | 526 | # docid 527 | docid = _parse_docid(entryPoint, homepage_response) 528 | 529 | # likes / members 530 | likes = _parse_likes(homepage_response, entryPoint, headers) 531 | 532 | # creation time 533 | creation_time = _parse_creation_time( 534 | homepage_response, entryPoint, headers) 535 | 536 | # category 537 | category = _parse_category(homepage_response, entryPoint, headers) 538 | 539 | # pageurl 540 | pageurl = _parse_pageurl(homepage_response) 541 | 542 | return [pagetype, pagename, identifier, likes, creation_time, category, pageurl] 543 | 544 | 545 | if __name__ == '__main__': 546 | # pageurls 547 | pageurl = 'https://www.facebook.com/mohw.gov.tw' 548 | pageurl = 'https://www.facebook.com/groups/pythontw' 549 | pageurl = 'https://www.facebook.com/Gooaye' 550 | pageurl = 'https://www.facebook.com/emily0806' 551 | pageurl = 'https://www.facebook.com/anuetw/' 552 | pageurl = 'https://www.facebook.com/wealtholic/' 553 | pageurl = 'https://www.facebook.com/hatendhu' 554 | 555 | headers = _get_headers(pageurl) 556 | headers['Referer'] = 'https://www.facebook.com/hatendhu' 557 | headers['Origin'] = 'https://www.facebook.com' 558 | headers['Cookie'] = 'dpr=1.5; datr=rzIwY5yARwMzcR9H2GyqId_l' 559 | 560 | homepage_response = _get_homepage(pageurl=pageurl, headers=headers) 561 | 562 | entryPoint = _parse_entryPoint(homepage_response) 563 | print(entryPoint) 564 | 565 | identifier = _parse_identifier(entryPoint, homepage_response) 566 | 567 | docid = _parse_docid(entryPoint, homepage_response) 568 | 569 | df, cursor, max_date, break_times = _init_request_vars(cursor='') 570 | cursor = 'AQHRlIMW9sczmHGnME47XeSdDNj6Jk9EcBOMlyxBdMNbZHM7dwd0rn8wsaxQxeXUsuhKVaMgVwPHb9YS9468INvb5yw2osoEmXd_sMXvj8rLhmBxeaJucMSPIDux_JuiHToC' 571 | cursor = 'AQHRxSZTqUvlLpkXCnrOjdX0gZeyn-Q1cuJzn4SPJuZ5rkYi7nZFByE5pwy4AsBoUOtcmF28lNfXR_rqv7oO7545iURm_mx46aZLBDiYfPmgI2mjscHUTiVi5vv1vj5EXiF4' 572 | resp = _get_posts(headers=headers, identifier=identifier, 573 | entryPoint=entryPoint, docid=docid, cursor=cursor) 574 | 575 | # graphql 576 | edges = _parse_edgelist(resp) 577 | print(len(edges)) 578 | _parse_edge(edges[0]) 579 | edges[0].keys() 580 | edges[0]['node'].keys() 581 | edges[0]['node']['comet_sections'].keys() 582 | edges[0]['node']['comet_sections'] 583 | df, max_date, cursor = _parse_composite_graphql(resp) 584 | df 585 | # nojs 586 | content_list, cursor = _parse_domops(resp) 587 | 588 | df, max_date, cursor = _parse_composite_nojs(resp) 589 | 590 | # page paser 591 | 592 | pagename = _parse_pagename(homepage_response).encode('utf-8').decode() 593 | likes = _parse_likes(homepage_response, entryPoint, headers) 594 | creation_time = _parse_creation_time( 595 | homepage_response=homepage_response, entryPoint=entryPoint, headers=headers) 596 | category = _parse_category(homepage_response, entryPoint, headers) 597 | pageurl = _parse_pageurl(homepage_response) 598 | -------------------------------------------------------------------------------- /requester.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests 3 | import time 4 | from utils import _init_request_vars 5 | 6 | 7 | def _get_headers(pageurl): 8 | ''' 9 | Send a request to get cookieid as headers. 10 | ''' 11 | pageurl = re.sub('www', 'm', pageurl) 12 | resp = requests.get(pageurl) 13 | headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', 14 | 'accept-language': 'en'} 15 | headers['cookie'] = '; '.join(['{}={}'.format(cookieid, resp.cookies.get_dict()[ 16 | cookieid]) for cookieid in resp.cookies.get_dict()]) 17 | # headers['cookie'] = headers['cookie'] + '; locale=en_US' 18 | return headers 19 | 20 | 21 | def _get_homepage(pageurl, headers): 22 | ''' 23 | Send a request to get the homepage response 24 | ''' 25 | pageurl = re.sub('/$', '', pageurl) 26 | timeout_cnt = 0 27 | while True: 28 | try: 29 | homepage_response = requests.get( 30 | pageurl, headers=headers, timeout=3) 31 | return homepage_response 32 | except: 33 | time.sleep(5) 34 | timeout_cnt = timeout_cnt + 1 35 | if timeout_cnt > 20: 36 | class homepage_response(): 37 | text = 'Sorry, something went wrong.' 38 | return homepage_response 39 | 40 | 41 | def _get_pageabout(homepage_response, entryPoint, headers): 42 | ''' 43 | Send a request to get the about page response 44 | ''' 45 | pageurl = re.sub('/$', '', homepage_response.url) 46 | pageabout = requests.get(pageurl + '/about', headers=headers) 47 | return pageabout 48 | 49 | 50 | def _get_pagetransparency(homepage_response, entryPoint, headers): 51 | ''' 52 | Send a request to get the transparency page response 53 | ''' 54 | pageurl = re.sub('/$', '', homepage_response.url) 55 | if entryPoint in ['ProfilePlusCometLoggedOutRouteRoot.entrypoint']: 56 | transparency_response = requests.get( 57 | pageurl + '/about_profile_transparency', headers=headers) 58 | return transparency_response 59 | 60 | 61 | def _get_posts(headers, identifier, entryPoint, docid, cursor): 62 | ''' 63 | Send a request to get new posts from fanspage/group. 64 | ''' 65 | if entryPoint in ['nojs']: 66 | params = {'page_id': identifier, 67 | 'cursor': str({"timeline_cursor": cursor, 68 | "timeline_section_cursor": '{}', 69 | "has_next_page": 'true'}), 70 | 'surface': 'www_pages_posts', 71 | 'unit_count': 10, 72 | '__a': '1'} 73 | resp = requests.get(url='https://www.facebook.com/pages_reaction_units/more/', 74 | params=params) 75 | 76 | else: # entryPoint in ['CometSinglePageHomeRoot.entrypoint', 'ProfilePlusCometLoggedOutRouteRoot.entrypoint', 'CometGroupDiscussionRoot.entrypoint'] 77 | data = {'variables': str({'cursor': cursor, 78 | 'id': identifier, 79 | 'count': 3}), 80 | 'doc_id': docid} 81 | resp = requests.post(url='https://www.facebook.com/api/graphql/', 82 | data=data, 83 | headers=headers) 84 | return resp 85 | 86 | 87 | if __name__ == '__main__': 88 | pageurl = 'https://www.facebook.com/ec.ltn.tw/' 89 | pageurl = 'https://www.facebook.com/Gooaye' 90 | pageurl = 'https://www.facebook.com/groups/pythontw' 91 | pageurl = 'https://www.facebook.com/hatendhu' 92 | headers = _get_headers(pageurl) 93 | homepage_response = _get_homepage(pageurl=pageurl, headers=headers) 94 | 95 | df, cursor, max_date, break_times = _init_request_vars() 96 | cursor = 'AQHRlIMW9sczmHGnME47XeSdDNj6Jk9EcBOMlyxBdMNbZHM7dwd0rn8wsaxQxeXUsuhKVaMgVwPHb9YS9468INvb5yw2osoEmXd_sMXvj8rLhmBxeaJucMSPIDux_JuiHToC' 97 | cursor = 'AQHRixL5fPMA_nM-78jGg4LohG3M4a2-YQR6WSaWOTiqPRJ1dOGchYRzp1wdDtusNd-5FkCPXwByL_kZM2iyLIz1XHB8WIEzHYXTU3vQzviOI9GexNv__RPn1xnFJZddnjX3' 98 | 99 | from paser import _parse_entryPoint, _parse_identifier, _parse_docid, _parse_composite_graphql 100 | entryPoint = _parse_entryPoint(homepage_response) 101 | identifier = _parse_identifier(entryPoint, homepage_response) 102 | docid = _parse_docid(entryPoint, homepage_response) 103 | df, cursor, max_date, break_times = _init_request_vars(cursor='') 104 | 105 | resp = _get_posts(headers=headers, identifier=identifier, 106 | entryPoint=entryPoint, docid=docid, cursor=cursor) 107 | ndf, max_date, cursor = _parse_composite_graphql(resp) 108 | resp.json() 109 | ndf 110 | max_date 111 | cursor 112 | # print(len(resp.text)) 113 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests==2.24.0 2 | bs4==0.0.1 3 | pandas==1.2.4 4 | numpy==1.20.3 5 | dicttoxml==1.7.4 6 | lxml==4.9.2 -------------------------------------------------------------------------------- /sample/20221013_Sample.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "id": "7946c9a9-0a27-4de2-a32e-8de574f8d7fb", 7 | "metadata": { 8 | "execution": { 9 | "iopub.execute_input": "2022-10-13T14:01:50.402181Z", 10 | "iopub.status.busy": "2022-10-13T14:01:50.402181Z", 11 | "iopub.status.idle": "2022-10-13T14:02:32.391460Z", 12 | "shell.execute_reply": "2022-10-13T14:02:32.391460Z", 13 | "shell.execute_reply.started": "2022-10-13T14:01:50.402181Z" 14 | }, 15 | "tags": [] 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stdout", 20 | "output_type": "stream", 21 | "text": [ 22 | "The maximum date of these posts is: 2022-10-13 01:06:00, keep crawling...\n", 23 | "The maximum date of these posts is: 2022-10-12 19:47:59, keep crawling...\n", 24 | "The maximum date of these posts is: 2022-10-12 19:47:37, keep crawling...\n", 25 | "The maximum date of these posts is: 2022-10-12 19:47:19, keep crawling...\n", 26 | "The maximum date of these posts is: 2022-10-12 02:37:08, keep crawling...\n", 27 | "The maximum date of these posts is: 2022-10-12 02:36:56, keep crawling...\n", 28 | "The maximum date of these posts is: 2022-10-12 02:36:50, keep crawling...\n", 29 | "The maximum date of these posts is: 2022-10-10 23:07:48, keep crawling...\n", 30 | "The maximum date of these posts is: 2022-10-10 18:18:13, keep crawling...\n", 31 | "The maximum date of these posts is: 2022-10-09 22:47:33, keep crawling...\n", 32 | "The maximum date of these posts is: 2022-10-08 23:59:35, keep crawling...\n", 33 | "The maximum date of these posts is: 2022-10-08 16:27:26, keep crawling...\n", 34 | "The maximum date of these posts is: 2022-10-07 21:05:48, keep crawling...\n", 35 | "The maximum date of these posts is: 2022-10-07 21:05:34, keep crawling...\n", 36 | "The maximum date of these posts is: 2022-10-07 21:05:14, keep crawling...\n", 37 | "The maximum date of these posts is: 2022-10-07 21:04:46, keep crawling...\n", 38 | "The maximum date of these posts is: 2022-10-07 21:04:17, keep crawling...\n", 39 | "The maximum date of these posts is: 2022-10-07 00:58:39, keep crawling...\n", 40 | "The maximum date of these posts is: 2022-10-06 23:56:05, keep crawling...\n", 41 | "The maximum date of these posts is: 2022-10-06 22:25:05, keep crawling...\n", 42 | "The maximum date of these posts is: 2022-10-06 22:24:57, keep crawling...\n", 43 | "The maximum date of these posts is: 2022-10-06 22:24:48, keep crawling...\n", 44 | "The maximum date of these posts is: 2022-10-06 22:24:39, keep crawling...\n", 45 | "The maximum date of these posts is: 2022-10-06 22:24:31, keep crawling...\n", 46 | "The maximum date of these posts is: 2022-10-06 22:24:19, keep crawling...\n", 47 | "The maximum date of these posts is: 2022-10-05 21:33:20, keep crawling...\n", 48 | "The maximum date of these posts is: 2022-10-05 21:33:12, keep crawling...\n", 49 | "The maximum date of these posts is: 2022-10-05 21:33:05, keep crawling...\n", 50 | "The maximum date of these posts is: 2022-10-05 21:32:59, keep crawling...\n", 51 | "The maximum date of these posts is: 2022-10-04 23:21:34, keep crawling...\n", 52 | "The maximum date of these posts is: 2022-10-04 23:21:14, keep crawling...\n", 53 | "The maximum date of these posts is: 2022-10-04 23:20:37, keep crawling...\n", 54 | "The maximum date of these posts is: 2022-10-04 23:19:57, keep crawling...\n", 55 | "The maximum date of these posts is: 2022-10-04 23:19:33, keep crawling...\n", 56 | "The maximum date of these posts is: 2022-10-03 22:40:18, keep crawling...\n", 57 | "The maximum date of these posts is: 2022-10-01 06:49:51, keep crawling...\n", 58 | "The maximum date of these posts is: 2022-10-01 06:49:39, keep crawling...\n", 59 | "The maximum date of these posts is: 2022-10-01 06:49:32, keep crawling...\n", 60 | "The maximum date of these posts is: 2022-09-30 00:28:37, keep crawling...\n" 61 | ] 62 | }, 63 | { 64 | "data": { 65 | "text/html": [ 66 | "
\n", 67 | "\n", 80 | "\n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | "
NAMEPAGEIDPOSTIDTIMEMESSAGEATTACHMENT_TITLEATTACHMENT_DESCRIPTIONATTACHMENT_PHOTOSREACTIONCOUNTCOMMENTCOUNTDISPLAYCOMMENTCOUNTSHARECOUNTREACTIONSCURSORACTOR_URLPOST_URLUPDATETIME
0黑特東華 NDHU Hate1000647085076914767388344930632022-10-13 01:06:00#52635\\n\\n志學街一堆店家把東華學生當搖錢樹,大家應該聯合抵制一下,盤子店通通給他倒...16625152[{'reaction_count': 138, 'node': {'id': '16358...AQHRWkzldCEuXabhI1tJPZnVEn7FKGxga7nPHhIBgGzMmB...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
1黑特東華 NDHU Hate1000647085076914767387778264022022-10-13 01:05:58#52634\\n\\n電音社第二次社課\\n\\n提醒各位這禮拜日為電音社第二次上課,不管是沒有參...https://scontent.ftpe10-1.fna.fbcdn.net/v/t39....3000[{'reaction_count': 2, 'node': {'id': '1635855...AQHRRN-x6Hvn-AZ35SOZ2CNtJpaM3_9yYj9j9dTx5LxCWy...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
2黑特東華 NDHU Hate1000647085076914765337611802372022-10-12 19:48:11#52633\\n\\n10/12晚上7.左右在理工停車場靠學活那側撿到BKS1(安全帽藍芽耳機...https://scontent.ftpe10-1.fna.fbcdn.net/v/t39....3000[{'reaction_count': 3, 'node': {'id': '1635855...AQHRD6AVoPeV1kHsvTjT0XPsC42w8-nrRawALu9lBRlhH5...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
3黑特東華 NDHU Hate1000647085076914765335678469232022-10-12 19:47:59#52632\\n\\n#非黑特\\nThis is 頌啦!浪‧人‧鬆‧餅👏👏👏\\n東華巡迴場10...https://scontent.ftpe10-1.fna.fbcdn.net/v/t39....16001[{'reaction_count': 16, 'node': {'id': '163585...AQHRMe5o8KJIMTVYawF1o8Vd0cJzDc1dF1eum3VLMIYTpL...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
4黑特東華 NDHU Hate1000647085076914765333178469482022-10-12 19:47:47#52631\\n\\n想問下 禮拜一晚上有上拳擊課的學生 \\n因為連假所以補課 想問是什麼時候...9940[{'reaction_count': 9, 'node': {'id': '1635855...AQHRppObbMGsrzRdDW6mI7HAOUiedMntD77Xe_-pnyteE3...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
......................................................
112黑特東華 NDHU Hate1000647085076914668896388113162022-10-01 06:49:29#52524\\n\\n29號接近午夜時分在外環拉轉按喇叭,後來一路騎進學人宿舍往舊宿去、還繼續...https://scontent.ftpe10-1.fna.fbcdn.net/v/t39....47521[{'reaction_count': 34, 'node': {'id': '163585...AQHRsblILurBMfS7TsLUVyW3GVjCkGIVSJLXMn3lb0_a0P...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
113黑特東華 NDHU Hate1000647085076914658497255819742022-09-30 00:28:39#52523\\n\\n向晴裝有買車停在宿舍旁的車主可不可以管好自己的車,常常半夜一直逼逼逼逼讓...6000[{'reaction_count': 6, 'node': {'id': '1635855...AQHRXXwiAQgdKEmnvEo3mUXyzuH6f-wXNQDeyzxZr2kEky...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
114黑特東華 NDHU Hate1000647085076914658497055819762022-09-30 00:28:37#52522\\n\\n宿舍都知道要公告說晚上十點過後要小聲要安靜\\n然後擷雲宿委還可以快十一點...27650[{'reaction_count': 21, 'node': {'id': '163585...AQHRB_nYO_BnR2qBBs4LXzM6CPeLfOlTMiCE3ZR6Ed9uZB...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
115黑特東華 NDHU Hate1000647085076914658496822486452022-09-30 00:28:35#52521\\n\\n*非黑特*\\n誠徵*日領* 兼職、打工、工讀, 10/3(一)3位Pm1...8110[{'reaction_count': 8, 'node': {'id': '1635855...AQHRQ11A6xsDyh3EgTdpIoAWmxFJQkoSXUrlVoeHXSDylA...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
116黑特東華 NDHU Hate1000647085076914658496655819802022-09-30 00:28:33#52520\\n\\n同學你的便當🍱留在統冠(志學)\\n由於一直等不到你來拿\\n先幫你冰起來\\...https://scontent.ftpe10-1.fna.fbcdn.net/v/t39....11000[{'reaction_count': 8, 'node': {'id': '1635855...AQHR5AKAESdSG7AYj3DfxnH6Gb8piuyeh9hP-f4Y9IFOdv...https://www.facebook.com/people/%E9%BB%91%E7%8...https://www.facebook.com/permalink.php?story_f...2022-10-13 22:02:32
\n", 326 | "

117 rows × 17 columns

\n", 327 | "
" 328 | ], 329 | "text/plain": [ 330 | " NAME PAGEID POSTID TIME \\\n", 331 | "0 黑特東華 NDHU Hate 100064708507691 476738834493063 2022-10-13 01:06:00 \n", 332 | "1 黑特東華 NDHU Hate 100064708507691 476738777826402 2022-10-13 01:05:58 \n", 333 | "2 黑特東華 NDHU Hate 100064708507691 476533761180237 2022-10-12 19:48:11 \n", 334 | "3 黑特東華 NDHU Hate 100064708507691 476533567846923 2022-10-12 19:47:59 \n", 335 | "4 黑特東華 NDHU Hate 100064708507691 476533317846948 2022-10-12 19:47:47 \n", 336 | ".. ... ... ... ... \n", 337 | "112 黑特東華 NDHU Hate 100064708507691 466889638811316 2022-10-01 06:49:29 \n", 338 | "113 黑特東華 NDHU Hate 100064708507691 465849725581974 2022-09-30 00:28:39 \n", 339 | "114 黑特東華 NDHU Hate 100064708507691 465849705581976 2022-09-30 00:28:37 \n", 340 | "115 黑特東華 NDHU Hate 100064708507691 465849682248645 2022-09-30 00:28:35 \n", 341 | "116 黑特東華 NDHU Hate 100064708507691 465849665581980 2022-09-30 00:28:33 \n", 342 | "\n", 343 | " MESSAGE ATTACHMENT_TITLE \\\n", 344 | "0 #52635\\n\\n志學街一堆店家把東華學生當搖錢樹,大家應該聯合抵制一下,盤子店通通給他倒... \n", 345 | "1 #52634\\n\\n電音社第二次社課\\n\\n提醒各位這禮拜日為電音社第二次上課,不管是沒有參... \n", 346 | "2 #52633\\n\\n10/12晚上7.左右在理工停車場靠學活那側撿到BKS1(安全帽藍芽耳機... \n", 347 | "3 #52632\\n\\n#非黑特\\nThis is 頌啦!浪‧人‧鬆‧餅👏👏👏\\n東華巡迴場10... \n", 348 | "4 #52631\\n\\n想問下 禮拜一晚上有上拳擊課的學生 \\n因為連假所以補課 想問是什麼時候... \n", 349 | ".. ... ... \n", 350 | "112 #52524\\n\\n29號接近午夜時分在外環拉轉按喇叭,後來一路騎進學人宿舍往舊宿去、還繼續... \n", 351 | "113 #52523\\n\\n向晴裝有買車停在宿舍旁的車主可不可以管好自己的車,常常半夜一直逼逼逼逼讓... \n", 352 | "114 #52522\\n\\n宿舍都知道要公告說晚上十點過後要小聲要安靜\\n然後擷雲宿委還可以快十一點... \n", 353 | "115 #52521\\n\\n*非黑特*\\n誠徵*日領* 兼職、打工、工讀, 10/3(一)3位Pm1... \n", 354 | "116 #52520\\n\\n同學你的便當🍱留在統冠(志學)\\n由於一直等不到你來拿\\n先幫你冰起來\\... \n", 355 | "\n", 356 | " ATTACHMENT_DESCRIPTION ATTACHMENT_PHOTOS \\\n", 357 | "0 \n", 358 | "1 https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... \n", 359 | "2 https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... \n", 360 | "3 https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... \n", 361 | "4 \n", 362 | ".. ... ... \n", 363 | "112 https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... \n", 364 | "113 \n", 365 | "114 \n", 366 | "115 \n", 367 | "116 https://scontent.ftpe10-1.fna.fbcdn.net/v/t39.... \n", 368 | "\n", 369 | " REACTIONCOUNT COMMENTCOUNT DISPLAYCOMMENTCOUNT SHARECOUNT \\\n", 370 | "0 166 25 15 2 \n", 371 | "1 3 0 0 0 \n", 372 | "2 3 0 0 0 \n", 373 | "3 16 0 0 1 \n", 374 | "4 9 9 4 0 \n", 375 | ".. ... ... ... ... \n", 376 | "112 47 5 2 1 \n", 377 | "113 6 0 0 0 \n", 378 | "114 27 6 5 0 \n", 379 | "115 8 1 1 0 \n", 380 | "116 11 0 0 0 \n", 381 | "\n", 382 | " REACTIONS \\\n", 383 | "0 [{'reaction_count': 138, 'node': {'id': '16358... \n", 384 | "1 [{'reaction_count': 2, 'node': {'id': '1635855... \n", 385 | "2 [{'reaction_count': 3, 'node': {'id': '1635855... \n", 386 | "3 [{'reaction_count': 16, 'node': {'id': '163585... \n", 387 | "4 [{'reaction_count': 9, 'node': {'id': '1635855... \n", 388 | ".. ... \n", 389 | "112 [{'reaction_count': 34, 'node': {'id': '163585... \n", 390 | "113 [{'reaction_count': 6, 'node': {'id': '1635855... \n", 391 | "114 [{'reaction_count': 21, 'node': {'id': '163585... \n", 392 | "115 [{'reaction_count': 8, 'node': {'id': '1635855... \n", 393 | "116 [{'reaction_count': 8, 'node': {'id': '1635855... \n", 394 | "\n", 395 | " CURSOR \\\n", 396 | "0 AQHRWkzldCEuXabhI1tJPZnVEn7FKGxga7nPHhIBgGzMmB... \n", 397 | "1 AQHRRN-x6Hvn-AZ35SOZ2CNtJpaM3_9yYj9j9dTx5LxCWy... \n", 398 | "2 AQHRD6AVoPeV1kHsvTjT0XPsC42w8-nrRawALu9lBRlhH5... \n", 399 | "3 AQHRMe5o8KJIMTVYawF1o8Vd0cJzDc1dF1eum3VLMIYTpL... \n", 400 | "4 AQHRppObbMGsrzRdDW6mI7HAOUiedMntD77Xe_-pnyteE3... \n", 401 | ".. ... \n", 402 | "112 AQHRsblILurBMfS7TsLUVyW3GVjCkGIVSJLXMn3lb0_a0P... \n", 403 | "113 AQHRXXwiAQgdKEmnvEo3mUXyzuH6f-wXNQDeyzxZr2kEky... \n", 404 | "114 AQHRB_nYO_BnR2qBBs4LXzM6CPeLfOlTMiCE3ZR6Ed9uZB... \n", 405 | "115 AQHRQ11A6xsDyh3EgTdpIoAWmxFJQkoSXUrlVoeHXSDylA... \n", 406 | "116 AQHR5AKAESdSG7AYj3DfxnH6Gb8piuyeh9hP-f4Y9IFOdv... \n", 407 | "\n", 408 | " ACTOR_URL \\\n", 409 | "0 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 410 | "1 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 411 | "2 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 412 | "3 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 413 | "4 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 414 | ".. ... \n", 415 | "112 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 416 | "113 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 417 | "114 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 418 | "115 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 419 | "116 https://www.facebook.com/people/%E9%BB%91%E7%8... \n", 420 | "\n", 421 | " POST_URL UPDATETIME \n", 422 | "0 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 423 | "1 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 424 | "2 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 425 | "3 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 426 | "4 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 427 | ".. ... ... \n", 428 | "112 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 429 | "113 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 430 | "114 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 431 | "115 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 432 | "116 https://www.facebook.com/permalink.php?story_f... 2022-10-13 22:02:32 \n", 433 | "\n", 434 | "[117 rows x 17 columns]" 435 | ] 436 | }, 437 | "execution_count": 3, 438 | "metadata": {}, 439 | "output_type": "execute_result" 440 | } 441 | ], 442 | "source": [ 443 | "import pandas as pd\n", 444 | "from facebook_crawler import Crawl_PagePosts\n", 445 | "pageurl = 'https://www.facebook.com/hatendhu'\n", 446 | "df = Crawl_PagePosts(pageurl, until_date='2022-10-01')\n", 447 | "df" 448 | ] 449 | }, 450 | { 451 | "cell_type": "code", 452 | "execution_count": 4, 453 | "id": "a7e18396-bfa1-4e9e-a507-162b2ae7b2d8", 454 | "metadata": { 455 | "execution": { 456 | "iopub.execute_input": "2022-10-13T14:02:32.394376Z", 457 | "iopub.status.busy": "2022-10-13T14:02:32.393376Z", 458 | "iopub.status.idle": "2022-10-13T14:02:32.507986Z", 459 | "shell.execute_reply": "2022-10-13T14:02:32.507986Z", 460 | "shell.execute_reply.started": "2022-10-13T14:02:32.394376Z" 461 | } 462 | }, 463 | "outputs": [], 464 | "source": [ 465 | "df.to_excel('./20221013_hatendhu.xlsx', index=False)" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "id": "63e56913-ba96-4a21-b548-25ce27095430", 472 | "metadata": {}, 473 | "outputs": [], 474 | "source": [] 475 | } 476 | ], 477 | "metadata": { 478 | "kernelspec": { 479 | "display_name": "Python 3 (ipykernel)", 480 | "language": "python", 481 | "name": "python3" 482 | }, 483 | "language_info": { 484 | "codemirror_mode": { 485 | "name": "ipython", 486 | "version": 3 487 | }, 488 | "file_extension": ".py", 489 | "mimetype": "text/x-python", 490 | "name": "python", 491 | "nbconvert_exporter": "python", 492 | "pygments_lexer": "ipython3", 493 | "version": "3.10.5" 494 | } 495 | }, 496 | "nbformat": 4, 497 | "nbformat_minor": 5 498 | } 499 | -------------------------------------------------------------------------------- /sample/FansPages.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "37db8006-7e78-4d00-bfd2-862997b366c9", 7 | "metadata": {}, 8 | "outputs": [ 9 | { 10 | "name": "stdout", 11 | "output_type": "stream", 12 | "text": [ 13 | "Collecting facebook_crawler\n", 14 | " Using cached facebook_crawler-0.0.25-py3-none-any.whl (7.0 kB)\n", 15 | "Requirement already satisfied: pandas in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from facebook_crawler) (1.3.4)\n", 16 | "Requirement already satisfied: bs4 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from facebook_crawler) (0.0.1)\n", 17 | "Requirement already satisfied: lxml in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from facebook_crawler) (4.6.4)\n", 18 | "Requirement already satisfied: requests in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from facebook_crawler) (2.26.0)\n", 19 | "Requirement already satisfied: numpy in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from facebook_crawler) (1.21.4)\n", 20 | "Requirement already satisfied: pytz>=2017.3 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from pandas->facebook_crawler) (2021.3)\n", 21 | "Requirement already satisfied: python-dateutil>=2.7.3 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from pandas->facebook_crawler) (2.8.2)\n", 22 | "Requirement already satisfied: beautifulsoup4 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from bs4->facebook_crawler) (4.10.0)\n", 23 | "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from requests->facebook_crawler) (1.26.7)\n", 24 | "Requirement already satisfied: idna<4,>=2.5; python_version >= \"3\" in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from requests->facebook_crawler) (3.3)\n", 25 | "Requirement already satisfied: certifi>=2017.4.17 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from requests->facebook_crawler) (2021.10.8)\n", 26 | "Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= \"3\" in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from requests->facebook_crawler) (2.0.9)\n", 27 | "Requirement already satisfied: six>=1.5 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->facebook_crawler) (1.16.0)\n", 28 | "Requirement already satisfied: soupsieve>1.2 in /home/tlyu0419/github/envs/env/lib/python3.8/site-packages (from beautifulsoup4->bs4->facebook_crawler) (2.3.1)\n", 29 | "Installing collected packages: facebook-crawler\n", 30 | "Successfully installed facebook-crawler-0.0.25\n" 31 | ] 32 | } 33 | ], 34 | "source": [ 35 | "!pip install facebook_crawler" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 3, 41 | "id": "f421b324-f6a8-479a-8b63-c80d6543e12f", 42 | "metadata": {}, 43 | "outputs": [ 44 | { 45 | "name": "stdout", 46 | "output_type": "stream", 47 | "text": [ 48 | "TimeStamp: 2021-12-03.\n", 49 | "TimeStamp: 2021-08-03.\n", 50 | "TimeStamp: 2021-01-16.\n", 51 | "TimeStamp: 2020-09-09.\n" 52 | ] 53 | }, 54 | { 55 | "data": { 56 | "text/html": [ 57 | "
\n", 58 | "\n", 71 | "\n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | "
NAMETIMEMESSAGELINKPAGEIDPOSTIDCOMMENTCOUNTREACTIONCOUNTSHARECOUNTDISPLAYCOMMENTCOUNTANGERHAHALIKELOVESORRYSUPPORTWOWUPDATETIME
0丟丟妹2021-12-03 02:36:02怎麼分不清 到底是誰帶壞誰XD 丟丟妹 和 Alizabeth 娘娘 根本姐妹淘🤣 互相「曉...https://www.facebook.com/diudiu33317237140343275894964611303571163704.021797.0546.0572.06.05798.015841.0108.06.017.021.02022-01-03 12:42:18
1丟丟妹2021-11-28 21:11:13我喜歡這樣的陽光☀️ 開放留言 想要丟丟賣什麼? 遇到瓶頸了⋯⋯⋯😢 ❤️ #留起來https://www.facebook.com/diudiu333172371403432758949491865451136391675.012674.022.01142.01.07.012510.0125.00.025.06.02022-01-03 12:42:18
2丟丟妹2021-11-19 18:02:56人客啊 客倌們😆歡迎留言+1 2A頂級大閘蟹 (含繩6兩) 特價3088免運 (6隻一箱)最...https://www.facebook.com/diudiu33317237140343275894918431521522475719.032242.065.0543.01.031.031884.0290.03.018.015.02022-01-03 12:42:18
3丟丟妹2021-11-12 22:35:08丟丟妹 終於不用離婚🤣 這次直播半路認走 #乾哥哥:#郭丟丟! 叫買能力根本 #隔空遺傳?!...https://www.facebook.com/diudiu33317237140343275894894343770597917477.018072.0656.0431.023.03495.014351.0125.017.029.032.02022-01-03 12:42:18
4丟丟妹2021-11-10 21:00:52各位親愛的帥哥美女們 由於訊息無限爆炸中 (已經快馬加鞭) 有留訊息就是 已接結單唷 造成...https://www.facebook.com/diudiu33317237140343275894887230927975868352.09598.017.0220.01.019.09476.059.00.034.09.02022-01-03 12:42:18
.........................................................
75丟丟妹2020-05-03 16:52:40猜猜我在哪裡 陳冠霖牛仔部落格 連靜雯joanne lienhttps://www.facebook.com/diudiu33317237140343275891849098925737192921.012355.0851.02921.00.00.00.00.00.00.00.02022-01-03 12:42:18
76丟丟妹2020-04-30 13:36:40踢爆娃娃機...https://www.facebook.com/diudiu33317237140343275893173673282664983946.022357.01035.0856.076.02661.019275.0161.025.071.088.02022-01-03 12:42:18
77丟丟妹2020-04-29 21:51:13今日直播暫停乙次 丟丟身體不舒服了😭 明天下午見 開播 要想我喲 嗚嗚😢https://www.facebook.com/diudiu33317237140343275893171969702835341167.04123.08.0164.00.03.03992.096.07.05.020.02022-01-03 12:42:18
78丟丟妹2020-04-17 19:46:50帥哥美女今天星期五,丟妹偷懶去 明天下午二點 準時見❤️❤️😄有人會報到嗎 刷起來 😂https://www.facebook.com/diudiu33317237140343275893141849955847316116.04804.020.0114.00.06.04744.048.01.00.05.02022-01-03 12:42:18
79丟丟妹2020-04-15 20:48:08丟丟 今天太累累了 😢 直播暫停乙次 王董想要替代丟丟當家 睡覺去 ~~ 可以 (到來...https://www.facebook.com/diudiu33317237140343275893136987929666852145.06943.09.0131.00.023.06847.059.00.00.014.02022-01-03 12:42:18
\n", 329 | "

80 rows × 18 columns

\n", 330 | "
" 331 | ], 332 | "text/plain": [ 333 | " NAME TIME \\\n", 334 | "0 丟丟妹 2021-12-03 02:36:02 \n", 335 | "1 丟丟妹 2021-11-28 21:11:13 \n", 336 | "2 丟丟妹 2021-11-19 18:02:56 \n", 337 | "3 丟丟妹 2021-11-12 22:35:08 \n", 338 | "4 丟丟妹 2021-11-10 21:00:52 \n", 339 | ".. ... ... \n", 340 | "75 丟丟妹 2020-05-03 16:52:40 \n", 341 | "76 丟丟妹 2020-04-30 13:36:40 \n", 342 | "77 丟丟妹 2020-04-29 21:51:13 \n", 343 | "78 丟丟妹 2020-04-17 19:46:50 \n", 344 | "79 丟丟妹 2020-04-15 20:48:08 \n", 345 | "\n", 346 | " MESSAGE \\\n", 347 | "0 怎麼分不清 到底是誰帶壞誰XD 丟丟妹 和 Alizabeth 娘娘 根本姐妹淘🤣 互相「曉... \n", 348 | "1 我喜歡這樣的陽光☀️ 開放留言 想要丟丟賣什麼? 遇到瓶頸了⋯⋯⋯😢 ❤️ #留起來 \n", 349 | "2 人客啊 客倌們😆歡迎留言+1 2A頂級大閘蟹 (含繩6兩) 特價3088免運 (6隻一箱)最... \n", 350 | "3 丟丟妹 終於不用離婚🤣 這次直播半路認走 #乾哥哥:#郭丟丟! 叫買能力根本 #隔空遺傳?!... \n", 351 | "4 各位親愛的帥哥美女們 由於訊息無限爆炸中 (已經快馬加鞭) 有留訊息就是 已接結單唷 造成... \n", 352 | ".. ... \n", 353 | "75 猜猜我在哪裡 陳冠霖牛仔部落格 連靜雯joanne lien \n", 354 | "76 踢爆娃娃機... \n", 355 | "77 今日直播暫停乙次 丟丟身體不舒服了😭 明天下午見 開播 要想我喲 嗚嗚😢 \n", 356 | "78 帥哥美女今天星期五,丟妹偷懶去 明天下午二點 準時見❤️❤️😄有人會報到嗎 刷起來 😂 \n", 357 | "79 丟丟 今天太累累了 😢 直播暫停乙次 王董想要替代丟丟當家 睡覺去 ~~ 可以 (到來... \n", 358 | "\n", 359 | " LINK PAGEID POSTID \\\n", 360 | "0 https://www.facebook.com/diudiu333 1723714034327589 4964611303571163 \n", 361 | "1 https://www.facebook.com/diudiu333 1723714034327589 4949186545113639 \n", 362 | "2 https://www.facebook.com/diudiu333 1723714034327589 4918431521522475 \n", 363 | "3 https://www.facebook.com/diudiu333 1723714034327589 4894343770597917 \n", 364 | "4 https://www.facebook.com/diudiu333 1723714034327589 4887230927975868 \n", 365 | ".. ... ... ... \n", 366 | "75 https://www.facebook.com/diudiu333 1723714034327589 184909892573719 \n", 367 | "76 https://www.facebook.com/diudiu333 1723714034327589 3173673282664983 \n", 368 | "77 https://www.facebook.com/diudiu333 1723714034327589 3171969702835341 \n", 369 | "78 https://www.facebook.com/diudiu333 1723714034327589 3141849955847316 \n", 370 | "79 https://www.facebook.com/diudiu333 1723714034327589 3136987929666852 \n", 371 | "\n", 372 | " COMMENTCOUNT REACTIONCOUNT SHARECOUNT DISPLAYCOMMENTCOUNT ANGER \\\n", 373 | "0 704.0 21797.0 546.0 572.0 6.0 \n", 374 | "1 1675.0 12674.0 22.0 1142.0 1.0 \n", 375 | "2 719.0 32242.0 65.0 543.0 1.0 \n", 376 | "3 477.0 18072.0 656.0 431.0 23.0 \n", 377 | "4 352.0 9598.0 17.0 220.0 1.0 \n", 378 | ".. ... ... ... ... ... \n", 379 | "75 2921.0 12355.0 851.0 2921.0 0.0 \n", 380 | "76 946.0 22357.0 1035.0 856.0 76.0 \n", 381 | "77 167.0 4123.0 8.0 164.0 0.0 \n", 382 | "78 116.0 4804.0 20.0 114.0 0.0 \n", 383 | "79 145.0 6943.0 9.0 131.0 0.0 \n", 384 | "\n", 385 | " HAHA LIKE LOVE SORRY SUPPORT WOW UPDATETIME \n", 386 | "0 5798.0 15841.0 108.0 6.0 17.0 21.0 2022-01-03 12:42:18 \n", 387 | "1 7.0 12510.0 125.0 0.0 25.0 6.0 2022-01-03 12:42:18 \n", 388 | "2 31.0 31884.0 290.0 3.0 18.0 15.0 2022-01-03 12:42:18 \n", 389 | "3 3495.0 14351.0 125.0 17.0 29.0 32.0 2022-01-03 12:42:18 \n", 390 | "4 19.0 9476.0 59.0 0.0 34.0 9.0 2022-01-03 12:42:18 \n", 391 | ".. ... ... ... ... ... ... ... \n", 392 | "75 0.0 0.0 0.0 0.0 0.0 0.0 2022-01-03 12:42:18 \n", 393 | "76 2661.0 19275.0 161.0 25.0 71.0 88.0 2022-01-03 12:42:18 \n", 394 | "77 3.0 3992.0 96.0 7.0 5.0 20.0 2022-01-03 12:42:18 \n", 395 | "78 6.0 4744.0 48.0 1.0 0.0 5.0 2022-01-03 12:42:18 \n", 396 | "79 23.0 6847.0 59.0 0.0 0.0 14.0 2022-01-03 12:42:18 \n", 397 | "\n", 398 | "[80 rows x 18 columns]" 399 | ] 400 | }, 401 | "execution_count": 3, 402 | "metadata": {}, 403 | "output_type": "execute_result" 404 | } 405 | ], 406 | "source": [ 407 | "import facebook_crawler\n", 408 | "pageurl= 'https://www.facebook.com/diudiu333'\n", 409 | "facebook_crawler.Crawl_PagePosts(pageurl=pageurl, until_date='2021-01-01')" 410 | ] 411 | }, 412 | { 413 | "cell_type": "code", 414 | "execution_count": 2, 415 | "id": "0045bec9-db09-43c2-afd7-13439ab0c19d", 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "name": "stdout", 420 | "output_type": "stream", 421 | "text": [ 422 | "TimeStamp: 2022-01-01.\n", 423 | "TimeStamp: 2021-06-04.\n", 424 | "TimeStamp: 2021-03-08.\n", 425 | "TimeStamp: 2020-12-11.\n", 426 | "TimeStamp: 2020-10-07.\n", 427 | "TimeStamp: 2020-07-19.\n", 428 | "TimeStamp: 2020-05-10.\n", 429 | "TimeStamp: 2020-02-14.\n", 430 | "TimeStamp: 2019-12-25.\n", 431 | "TimeStamp: 2019-11-16.\n", 432 | "TimeStamp: 2019-08-23.\n", 433 | "TimeStamp: 2019-06-06.\n", 434 | "TimeStamp: 2019-03-20.\n", 435 | "TimeStamp: 2018-12-27.\n", 436 | "TimeStamp: 2018-09-01.\n", 437 | "TimeStamp: 2017-12-29.\n", 438 | "TimeStamp: 2017-06-26.\n", 439 | "TimeStamp: 2016-12-11.\n", 440 | "TimeStamp: 2016-05-15.\n", 441 | "TimeStamp: 2016-04-05.\n", 442 | "TimeStamp: 2016-03-12.\n", 443 | "TimeStamp: 2016-02-06.\n", 444 | "TimeStamp: 2016-01-03.\n", 445 | "TimeStamp: 2015-11-28.\n" 446 | ] 447 | }, 448 | { 449 | "data": { 450 | "text/html": [ 451 | "
\n", 452 | "\n", 465 | "\n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | "
NAMETIMEMESSAGELINKPAGEIDPOSTIDCOMMENTCOUNTREACTIONCOUNTSHARECOUNTDISPLAYCOMMENTCOUNTANGERDOROTHYHAHALIKELOVESORRYSUPPORTTOTOWOWUPDATETIME
0馬英九2022-01-01 08:00:56今天是中華民國一百一十一年元旦,清晨的氣溫很低,看著國旗在空中飄揚,心中交集著感謝與感慨之情...https://www.facebook.com/MaYingjeou/11825050490375749362866297667631243.038906.0403.01022.05.0NaN22.038254.0458.04.0157.0NaN6.02022-01-05 23:16:30
1馬英九2021-12-21 17:05:14冬至天氣冷颼颼,來一碗熱熱的燒湯圓 🤤 大家喜歡吃芝麻湯圓還是花生湯圓呢? #冬至 #吃湯圓...https://www.facebook.com/MaYingjeou/11825050490375748938422306778701047.028432.0179.0844.015.0NaN64.027961.0325.02.055.0NaN10.02022-01-05 23:16:30
2馬英九2021-12-17 17:00:00公民投票是憲法賦予人民的權利 1218,穿得暖暖,出門投票! #四個都同意人民最有利 #返鄉...https://www.facebook.com/MaYingjeou/11825050490375748746269625993971883.045037.0519.01372.031.0NaN63.043941.0336.016.0634.0NaN16.02022-01-05 23:16:30
3馬英九2021-12-11 09:00:00公投前最後一個周末,你收到投票通知單了嗎? 萊劑不是保健品 公投重新綁大選 三接興建傷藻礁⋯...https://www.facebook.com/MaYingjeou/11825050490375748464913254129613254.039639.0809.01742.060.0NaN78.038702.0196.011.0568.0NaN24.02022-01-05 23:16:30
4馬英九2021-11-14 13:52:11鯉魚潭鐵三初體驗🏊‍♂️🚴‍♂️🏃‍♂️ 順利完賽💪💪💪 #2021年花蓮太平洋鐵人三項錦標...https://www.facebook.com/MaYingjeou/11825050490375747647072402580372409.073538.0501.02017.015.0NaN41.072407.0557.04.0292.0NaN222.02022-01-05 23:16:30
...............................................................
475馬英九2015-10-19 18:10:33「六堆忠義祠」源於清康熙60年,六堆團練協助平定朱一貴起事,清廷興建了「西勢忠義亭」奉祀在戰...https://www.facebook.com/MaYingjeou/11825050490375710507315383223111176.021803.0316.0705.00.0NaN0.021801.02.00.00.0NaN0.02022-01-05 23:16:30
476馬英九2015-10-12 18:07:34https://www.facebook.com/MaYingjeou/1182505049037571047937641935034NaNNaNNaNNaN0.0NaN0.00.00.00.00.0NaN0.02022-01-05 23:16:30
477馬英九2015-10-11 13:40:25https://www.facebook.com/MaYingjeou/1182505049037571047435965318535NaNNaNNaNNaN0.0NaN0.00.00.00.00.0NaN0.02022-01-05 23:16:30
478馬英九2015-10-10 14:08:46今天是民國104年國慶日,讓我們一起祝中華民國生日快樂! 今年的國慶日,還有特別的意義。今年...https://www.facebook.com/MaYingjeou/11825050490375710470313853589932219.036765.0679.01357.03.0NaN1.036761.00.00.00.0NaN0.02022-01-05 23:16:30
479馬英九2015-10-09 19:58:31故宮博物院於民國14年的國慶日成立,今年是它的90歲生日。故宮成立後幾經波折,對日抗戰爆發後...https://www.facebook.com/MaYingjeou/1182505049037571046725805389551993.014761.0306.0590.03.0NaN0.014757.01.00.00.0NaN0.02022-01-05 23:16:30
\n", 747 | "

480 rows × 20 columns

\n", 748 | "
" 749 | ], 750 | "text/plain": [ 751 | " NAME TIME \\\n", 752 | "0 馬英九 2022-01-01 08:00:56 \n", 753 | "1 馬英九 2021-12-21 17:05:14 \n", 754 | "2 馬英九 2021-12-17 17:00:00 \n", 755 | "3 馬英九 2021-12-11 09:00:00 \n", 756 | "4 馬英九 2021-11-14 13:52:11 \n", 757 | ".. ... ... \n", 758 | "475 馬英九 2015-10-19 18:10:33 \n", 759 | "476 馬英九 2015-10-12 18:07:34 \n", 760 | "477 馬英九 2015-10-11 13:40:25 \n", 761 | "478 馬英九 2015-10-10 14:08:46 \n", 762 | "479 馬英九 2015-10-09 19:58:31 \n", 763 | "\n", 764 | " MESSAGE \\\n", 765 | "0 今天是中華民國一百一十一年元旦,清晨的氣溫很低,看著國旗在空中飄揚,心中交集著感謝與感慨之情... \n", 766 | "1 冬至天氣冷颼颼,來一碗熱熱的燒湯圓 🤤 大家喜歡吃芝麻湯圓還是花生湯圓呢? #冬至 #吃湯圓... \n", 767 | "2 公民投票是憲法賦予人民的權利 1218,穿得暖暖,出門投票! #四個都同意人民最有利 #返鄉... \n", 768 | "3 公投前最後一個周末,你收到投票通知單了嗎? 萊劑不是保健品 公投重新綁大選 三接興建傷藻礁⋯... \n", 769 | "4 鯉魚潭鐵三初體驗🏊‍♂️🚴‍♂️🏃‍♂️ 順利完賽💪💪💪 #2021年花蓮太平洋鐵人三項錦標... \n", 770 | ".. ... \n", 771 | "475 「六堆忠義祠」源於清康熙60年,六堆團練協助平定朱一貴起事,清廷興建了「西勢忠義亭」奉祀在戰... \n", 772 | "476 \n", 773 | "477 \n", 774 | "478 今天是民國104年國慶日,讓我們一起祝中華民國生日快樂! 今年的國慶日,還有特別的意義。今年... \n", 775 | "479 故宮博物院於民國14年的國慶日成立,今年是它的90歲生日。故宮成立後幾經波折,對日抗戰爆發後... \n", 776 | "\n", 777 | " LINK PAGEID POSTID \\\n", 778 | "0 https://www.facebook.com/MaYingjeou/ 118250504903757 4936286629766763 \n", 779 | "1 https://www.facebook.com/MaYingjeou/ 118250504903757 4893842230677870 \n", 780 | "2 https://www.facebook.com/MaYingjeou/ 118250504903757 4874626962599397 \n", 781 | "3 https://www.facebook.com/MaYingjeou/ 118250504903757 4846491325412961 \n", 782 | "4 https://www.facebook.com/MaYingjeou/ 118250504903757 4764707240258037 \n", 783 | ".. ... ... ... \n", 784 | "475 https://www.facebook.com/MaYingjeou/ 118250504903757 1050731538322311 \n", 785 | "476 https://www.facebook.com/MaYingjeou/ 118250504903757 1047937641935034 \n", 786 | "477 https://www.facebook.com/MaYingjeou/ 118250504903757 1047435965318535 \n", 787 | "478 https://www.facebook.com/MaYingjeou/ 118250504903757 1047031385358993 \n", 788 | "479 https://www.facebook.com/MaYingjeou/ 118250504903757 1046725805389551 \n", 789 | "\n", 790 | " COMMENTCOUNT REACTIONCOUNT SHARECOUNT DISPLAYCOMMENTCOUNT ANGER \\\n", 791 | "0 1243.0 38906.0 403.0 1022.0 5.0 \n", 792 | "1 1047.0 28432.0 179.0 844.0 15.0 \n", 793 | "2 1883.0 45037.0 519.0 1372.0 31.0 \n", 794 | "3 3254.0 39639.0 809.0 1742.0 60.0 \n", 795 | "4 2409.0 73538.0 501.0 2017.0 15.0 \n", 796 | ".. ... ... ... ... ... \n", 797 | "475 1176.0 21803.0 316.0 705.0 0.0 \n", 798 | "476 NaN NaN NaN NaN 0.0 \n", 799 | "477 NaN NaN NaN NaN 0.0 \n", 800 | "478 2219.0 36765.0 679.0 1357.0 3.0 \n", 801 | "479 993.0 14761.0 306.0 590.0 3.0 \n", 802 | "\n", 803 | " DOROTHY HAHA LIKE LOVE SORRY SUPPORT TOTO WOW \\\n", 804 | "0 NaN 22.0 38254.0 458.0 4.0 157.0 NaN 6.0 \n", 805 | "1 NaN 64.0 27961.0 325.0 2.0 55.0 NaN 10.0 \n", 806 | "2 NaN 63.0 43941.0 336.0 16.0 634.0 NaN 16.0 \n", 807 | "3 NaN 78.0 38702.0 196.0 11.0 568.0 NaN 24.0 \n", 808 | "4 NaN 41.0 72407.0 557.0 4.0 292.0 NaN 222.0 \n", 809 | ".. ... ... ... ... ... ... ... ... \n", 810 | "475 NaN 0.0 21801.0 2.0 0.0 0.0 NaN 0.0 \n", 811 | "476 NaN 0.0 0.0 0.0 0.0 0.0 NaN 0.0 \n", 812 | "477 NaN 0.0 0.0 0.0 0.0 0.0 NaN 0.0 \n", 813 | "478 NaN 1.0 36761.0 0.0 0.0 0.0 NaN 0.0 \n", 814 | "479 NaN 0.0 14757.0 1.0 0.0 0.0 NaN 0.0 \n", 815 | "\n", 816 | " UPDATETIME \n", 817 | "0 2022-01-05 23:16:30 \n", 818 | "1 2022-01-05 23:16:30 \n", 819 | "2 2022-01-05 23:16:30 \n", 820 | "3 2022-01-05 23:16:30 \n", 821 | "4 2022-01-05 23:16:30 \n", 822 | ".. ... \n", 823 | "475 2022-01-05 23:16:30 \n", 824 | "476 2022-01-05 23:16:30 \n", 825 | "477 2022-01-05 23:16:30 \n", 826 | "478 2022-01-05 23:16:30 \n", 827 | "479 2022-01-05 23:16:30 \n", 828 | "\n", 829 | "[480 rows x 20 columns]" 830 | ] 831 | }, 832 | "execution_count": 2, 833 | "metadata": {}, 834 | "output_type": "execute_result" 835 | } 836 | ], 837 | "source": [ 838 | "import facebook_crawler\n", 839 | "pageurl= 'https://www.facebook.com/MaYingjeou'\n", 840 | "df = facebook_crawler.Crawl_PagePosts(pageurl=pageurl, until_date='2016-01-01')\n", 841 | "df" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": null, 847 | "id": "5696cdfc-8f3a-47cd-a936-97112f077db4", 848 | "metadata": {}, 849 | "outputs": [], 850 | "source": [] 851 | } 852 | ], 853 | "metadata": { 854 | "kernelspec": { 855 | "display_name": "Python 3 (ipykernel)", 856 | "language": "python", 857 | "name": "python3" 858 | }, 859 | "language_info": { 860 | "codemirror_mode": { 861 | "name": "ipython", 862 | "version": 3 863 | }, 864 | "file_extension": ".py", 865 | "mimetype": "text/x-python", 866 | "name": "python", 867 | "nbconvert_exporter": "python", 868 | "pygments_lexer": "ipython3", 869 | "version": "3.8.10" 870 | } 871 | }, 872 | "nbformat": 4, 873 | "nbformat_minor": 5 874 | } 875 | -------------------------------------------------------------------------------- /sample/Group.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "f31b5034-e3d4-40f4-b52a-8e4addba342d", 7 | "metadata": {}, 8 | "outputs": [ 9 | { 10 | "name": "stdout", 11 | "output_type": "stream", 12 | "text": [ 13 | "TimeStamp: 2022-01-05.\n", 14 | "TimeStamp: 2022-01-02.\n", 15 | "TimeStamp: 2021-12-28.\n", 16 | "TimeStamp: 2021-12-26.\n", 17 | "TimeStamp: 2021-12-21.\n", 18 | "TimeStamp: 2021-12-17.\n", 19 | "TimeStamp: 2021-12-14.\n", 20 | "TimeStamp: 2021-12-12.\n", 21 | "TimeStamp: 2021-12-08.\n", 22 | "TimeStamp: 2021-12-07.\n", 23 | "TimeStamp: 2021-12-04.\n", 24 | "TimeStamp: 2021-11-30.\n", 25 | "TimeStamp: 2021-11-28.\n", 26 | "TimeStamp: 2021-11-25.\n", 27 | "TimeStamp: 2021-11-22.\n", 28 | "TimeStamp: 2021-11-19.\n", 29 | "TimeStamp: 2021-11-18.\n", 30 | "TimeStamp: 2021-11-15.\n", 31 | "TimeStamp: 2021-11-14.\n", 32 | "TimeStamp: 2021-11-11.\n", 33 | "TimeStamp: 2021-11-09.\n", 34 | "TimeStamp: 2021-11-07.\n", 35 | "TimeStamp: 2021-11-05.\n", 36 | "TimeStamp: 2021-11-03.\n", 37 | "TimeStamp: 2021-11-02.\n", 38 | "TimeStamp: 2021-10-31.\n", 39 | "TimeStamp: 2021-10-28.\n", 40 | "TimeStamp: 2021-10-26.\n", 41 | "TimeStamp: 2021-10-24.\n", 42 | "TimeStamp: 2021-10-22.\n", 43 | "TimeStamp: 2021-10-20.\n", 44 | "TimeStamp: 2021-10-18.\n", 45 | "TimeStamp: 2021-10-17.\n", 46 | "TimeStamp: 2021-10-15.\n", 47 | "TimeStamp: 2021-10-13.\n", 48 | "TimeStamp: 2021-10-11.\n", 49 | "TimeStamp: 2021-10-09.\n", 50 | "TimeStamp: 2021-10-07.\n", 51 | "TimeStamp: 2021-10-05.\n", 52 | "TimeStamp: 2021-10-03.\n", 53 | "TimeStamp: 2021-10-01.\n", 54 | "TimeStamp: 2021-09-30.\n", 55 | "TimeStamp: 2021-09-28.\n", 56 | "TimeStamp: 2021-09-25.\n", 57 | "TimeStamp: 2021-09-23.\n", 58 | "TimeStamp: 2021-09-21.\n", 59 | "TimeStamp: 2021-09-19.\n", 60 | "TimeStamp: 2021-09-18.\n", 61 | "TimeStamp: 2021-09-16.\n", 62 | "TimeStamp: 2021-09-15.\n", 63 | "TimeStamp: 2021-09-13.\n", 64 | "TimeStamp: 2021-09-11.\n", 65 | "TimeStamp: 2021-09-09.\n", 66 | "TimeStamp: 2021-09-08.\n", 67 | "TimeStamp: 2021-09-06.\n", 68 | "TimeStamp: 2021-09-05.\n", 69 | "TimeStamp: 2021-09-03.\n", 70 | "TimeStamp: 2021-09-02.\n", 71 | "TimeStamp: 2021-09-01.\n", 72 | "TimeStamp: 2021-08-29.\n", 73 | "TimeStamp: 2021-08-27.\n", 74 | "TimeStamp: 2021-08-25.\n", 75 | "TimeStamp: 2021-08-23.\n", 76 | "TimeStamp: 2021-08-22.\n", 77 | "TimeStamp: 2021-08-20.\n", 78 | "TimeStamp: 2021-08-18.\n", 79 | "TimeStamp: 2021-08-17.\n", 80 | "TimeStamp: 2021-08-15.\n", 81 | "TimeStamp: 2021-08-14.\n", 82 | "TimeStamp: 2021-08-12.\n", 83 | "TimeStamp: 2021-08-11.\n", 84 | "TimeStamp: 2021-08-10.\n", 85 | "TimeStamp: 2021-08-08.\n", 86 | "TimeStamp: 2021-08-06.\n", 87 | "TimeStamp: 2021-08-04.\n", 88 | "TimeStamp: 2021-08-02.\n", 89 | "TimeStamp: 2021-07-31.\n", 90 | "TimeStamp: 2021-07-29.\n", 91 | "TimeStamp: 2021-07-28.\n", 92 | "TimeStamp: 2021-07-26.\n", 93 | "TimeStamp: 2021-07-24.\n", 94 | "TimeStamp: 2021-07-22.\n", 95 | "TimeStamp: 2021-07-21.\n", 96 | "TimeStamp: 2021-07-19.\n", 97 | "TimeStamp: 2021-07-18.\n", 98 | "TimeStamp: 2021-07-17.\n", 99 | "TimeStamp: 2021-07-16.\n", 100 | "TimeStamp: 2021-07-14.\n", 101 | "TimeStamp: 2021-07-12.\n", 102 | "TimeStamp: 2021-07-10.\n", 103 | "TimeStamp: 2021-07-09.\n", 104 | "TimeStamp: 2021-07-07.\n", 105 | "TimeStamp: 2021-07-04.\n", 106 | "TimeStamp: 2021-07-03.\n", 107 | "TimeStamp: 2021-07-02.\n", 108 | "TimeStamp: 2021-06-30.\n", 109 | "TimeStamp: 2021-06-28.\n", 110 | "TimeStamp: 2021-06-26.\n", 111 | "TimeStamp: 2021-06-25.\n", 112 | "TimeStamp: 2021-06-22.\n", 113 | "TimeStamp: 2021-06-21.\n", 114 | "TimeStamp: 2021-06-19.\n", 115 | "TimeStamp: 2021-06-18.\n", 116 | "TimeStamp: 2021-06-16.\n", 117 | "TimeStamp: 2021-06-14.\n", 118 | "TimeStamp: 2021-06-12.\n", 119 | "TimeStamp: 2021-06-11.\n", 120 | "TimeStamp: 2021-06-10.\n", 121 | "TimeStamp: 2021-06-08.\n", 122 | "TimeStamp: 2021-06-06.\n", 123 | "TimeStamp: 2021-06-06.\n", 124 | "TimeStamp: 2021-06-04.\n", 125 | "TimeStamp: 2021-06-02.\n", 126 | "TimeStamp: 2021-05-31.\n", 127 | "TimeStamp: 2021-05-29.\n", 128 | "TimeStamp: 2021-05-26.\n", 129 | "TimeStamp: 2021-05-25.\n", 130 | "TimeStamp: 2021-05-23.\n", 131 | "TimeStamp: 2021-05-21.\n", 132 | "TimeStamp: 2021-05-19.\n", 133 | "TimeStamp: 2021-05-17.\n", 134 | "TimeStamp: 2021-05-13.\n", 135 | "TimeStamp: 2021-05-11.\n", 136 | "TimeStamp: 2021-05-09.\n", 137 | "TimeStamp: 2021-05-07.\n", 138 | "TimeStamp: 2021-05-05.\n", 139 | "TimeStamp: 2021-05-04.\n", 140 | "TimeStamp: 2021-05-03.\n", 141 | "TimeStamp: 2021-04-30.\n", 142 | "TimeStamp: 2021-04-29.\n", 143 | "TimeStamp: 2021-04-28.\n", 144 | "TimeStamp: 2021-04-27.\n", 145 | "TimeStamp: 2021-04-26.\n", 146 | "TimeStamp: 2021-04-24.\n", 147 | "TimeStamp: 2021-04-22.\n", 148 | "TimeStamp: 2021-04-20.\n", 149 | "TimeStamp: 2021-04-19.\n", 150 | "TimeStamp: 2021-04-17.\n", 151 | "TimeStamp: 2021-04-16.\n", 152 | "TimeStamp: 2021-04-14.\n", 153 | "TimeStamp: 2021-04-12.\n", 154 | "TimeStamp: 2021-04-11.\n", 155 | "TimeStamp: 2021-04-08.\n", 156 | "TimeStamp: 2021-04-07.\n", 157 | "TimeStamp: 2021-04-06.\n", 158 | "TimeStamp: 2021-04-04.\n", 159 | "TimeStamp: 2021-04-03.\n", 160 | "TimeStamp: 2021-04-01.\n", 161 | "TimeStamp: 2021-03-30.\n", 162 | "TimeStamp: 2021-03-29.\n", 163 | "TimeStamp: 2021-03-27.\n", 164 | "TimeStamp: 2021-03-26.\n", 165 | "TimeStamp: 2021-03-25.\n", 166 | "TimeStamp: 2021-03-24.\n", 167 | "TimeStamp: 2021-03-21.\n", 168 | "TimeStamp: 2021-03-19.\n", 169 | "TimeStamp: 2021-03-18.\n", 170 | "TimeStamp: 2021-03-16.\n", 171 | "TimeStamp: 2021-03-14.\n", 172 | "TimeStamp: 2021-03-12.\n", 173 | "TimeStamp: 2021-03-11.\n", 174 | "TimeStamp: 2021-03-09.\n", 175 | "TimeStamp: 2021-03-08.\n", 176 | "TimeStamp: 2021-03-06.\n", 177 | "TimeStamp: 2021-03-05.\n", 178 | "TimeStamp: 2021-03-04.\n", 179 | "TimeStamp: 2021-03-02.\n", 180 | "TimeStamp: 2021-02-28.\n", 181 | "TimeStamp: 2021-02-27.\n", 182 | "TimeStamp: 2021-02-25.\n", 183 | "TimeStamp: 2021-02-23.\n", 184 | "TimeStamp: 2021-02-20.\n", 185 | "TimeStamp: 2021-02-18.\n", 186 | "TimeStamp: 2021-02-16.\n", 187 | "TimeStamp: 2021-02-14.\n", 188 | "TimeStamp: 2021-02-11.\n", 189 | "TimeStamp: 2021-02-09.\n", 190 | "TimeStamp: 2021-02-07.\n", 191 | "TimeStamp: 2021-02-05.\n", 192 | "TimeStamp: 2021-02-03.\n", 193 | "TimeStamp: 2021-02-01.\n", 194 | "TimeStamp: 2021-01-31.\n", 195 | "TimeStamp: 2021-01-29.\n", 196 | "TimeStamp: 2021-01-27.\n", 197 | "TimeStamp: 2021-01-25.\n", 198 | "TimeStamp: 2021-01-23.\n", 199 | "TimeStamp: 2021-01-21.\n", 200 | "TimeStamp: 2021-01-20.\n", 201 | "TimeStamp: 2021-01-18.\n", 202 | "TimeStamp: 2021-01-15.\n", 203 | "TimeStamp: 2021-01-13.\n", 204 | "TimeStamp: 2021-01-11.\n", 205 | "TimeStamp: 2021-01-09.\n", 206 | "TimeStamp: 2021-01-07.\n", 207 | "TimeStamp: 2021-01-06.\n", 208 | "TimeStamp: 2021-01-03.\n", 209 | "TimeStamp: 2020-12-30.\n" 210 | ] 211 | }, 212 | { 213 | "data": { 214 | "text/html": [ 215 | "
\n", 216 | "\n", 229 | "\n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 335 | " \n", 336 | " \n", 337 | " \n", 338 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 344 | " \n", 345 | " \n", 346 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 354 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | "
ACTORIDNAMEGROUPIDPOSTIDTIMECONTENTCOMMENTCOUNTSHARECOUNTLIKECOUNTUPDATETIME
0\"100008360976636\"蔣忠祐197223143437101618251499234382022-01-04 19:28:20小弟想問一下,因為本身只寫過python也只會ML其他的都不太熟悉。 如果我今天想要寫一個簡...497422022-01-05 22:47:28
11372450133陳柏勳197223143437101618250762584382022-01-04 18:16:42各位高手大大 小弟剛自學 python 我先在 Jupyter 寫一個檔案 並存檔為Hell...140102022-01-05 22:47:28
2\"100029978773864\"沈宗叡197223143437101618186036784382022-01-01 14:46:35在練習網路上的題目: https://tioj.ck.tp.edu.tw/problems/...15332022-01-05 22:47:28
3696780981MaoYang Chien197223143437101618267182434382022-01-05 11:29:06這個開源工具使用 python、flask 等技術開發 如果你用「專案管理」的方法來完成一堂...09242022-01-05 22:47:28
4\"100006223006386\"劉奕德197223143437101618167808334382021-12-31 23:20:43想詢問版上各位大大 近期在做廠房的耗電量改善 想到python可以應用在建立模型及預測 請問...372292022-01-05 22:47:28
.................................
3935\"105673814305452\"用圖片高效學程式105673814305452101608370212734382020-12-27 13:00:34【圖解演算法教學】Bubble Sort 大隊接力賽 這次首次嘗試以「動畫」形式,來演示Bu...29142022-01-05 22:47:28
3936\"100051655785472\"劉昶林197223143437101608149441634382020-12-19 23:56:39嗨各位, 我現在有個問題,我想要將yolov4(darknet) 跟 Arduino做結合,...3413722022-01-05 22:47:28
3937\"100002220077374\"Hugh LI197223143437101608405058084382020-12-28 20:01:25想問一下各位一個蠻基礎的問題 (先撇開其他條件) c=s[a:b] c[::-1] 為甚麼...101122022-01-05 22:47:28
3938504805657Roy Hsu197223143437101608379615334382020-12-27 22:27:50我想實現 用python 去更換 資料夾的icon ( windows ) 網路上看到這個 ...6132022-01-05 22:47:28
3939\"100001849455014\"Vivi Chen197223143437101608379103484382020-12-27 22:05:17Test 的 SayHello 函式在初始化後, 就不用再輸入'Amy' 最終目的就是只要在...6192022-01-05 22:47:28
\n", 391 | "

3940 rows × 10 columns

\n", 392 | "
" 393 | ], 394 | "text/plain": [ 395 | " ACTORID NAME GROUPID POSTID \\\n", 396 | "0 \"100008360976636\" 蔣忠祐 197223143437 10161825149923438 \n", 397 | "1 1372450133 陳柏勳 197223143437 10161825076258438 \n", 398 | "2 \"100029978773864\" 沈宗叡 197223143437 10161818603678438 \n", 399 | "3 696780981 MaoYang Chien 197223143437 10161826718243438 \n", 400 | "4 \"100006223006386\" 劉奕德 197223143437 10161816780833438 \n", 401 | "... ... ... ... ... \n", 402 | "3935 \"105673814305452\" 用圖片高效學程式 105673814305452 10160837021273438 \n", 403 | "3936 \"100051655785472\" 劉昶林 197223143437 10160814944163438 \n", 404 | "3937 \"100002220077374\" Hugh LI 197223143437 10160840505808438 \n", 405 | "3938 504805657 Roy Hsu 197223143437 10160837961533438 \n", 406 | "3939 \"100001849455014\" Vivi Chen 197223143437 10160837910348438 \n", 407 | "\n", 408 | " TIME CONTENT \\\n", 409 | "0 2022-01-04 19:28:20 小弟想問一下,因為本身只寫過python也只會ML其他的都不太熟悉。 如果我今天想要寫一個簡... \n", 410 | "1 2022-01-04 18:16:42 各位高手大大 小弟剛自學 python 我先在 Jupyter 寫一個檔案 並存檔為Hell... \n", 411 | "2 2022-01-01 14:46:35 在練習網路上的題目: https://tioj.ck.tp.edu.tw/problems/... \n", 412 | "3 2022-01-05 11:29:06 這個開源工具使用 python、flask 等技術開發 如果你用「專案管理」的方法來完成一堂... \n", 413 | "4 2021-12-31 23:20:43 想詢問版上各位大大 近期在做廠房的耗電量改善 想到python可以應用在建立模型及預測 請問... \n", 414 | "... ... ... \n", 415 | "3935 2020-12-27 13:00:34 【圖解演算法教學】Bubble Sort 大隊接力賽 這次首次嘗試以「動畫」形式,來演示Bu... \n", 416 | "3936 2020-12-19 23:56:39 嗨各位, 我現在有個問題,我想要將yolov4(darknet) 跟 Arduino做結合,... \n", 417 | "3937 2020-12-28 20:01:25 想問一下各位一個蠻基礎的問題 (先撇開其他條件) c=s[a:b] c[::-1] 為甚麼... \n", 418 | "3938 2020-12-27 22:27:50 我想實現 用python 去更換 資料夾的icon ( windows ) 網路上看到這個 ... \n", 419 | "3939 2020-12-27 22:05:17 Test 的 SayHello 函式在初始化後, 就不用再輸入'Amy' 最終目的就是只要在... \n", 420 | "\n", 421 | " COMMENTCOUNT SHARECOUNT LIKECOUNT UPDATETIME \n", 422 | "0 49 7 42 2022-01-05 22:47:28 \n", 423 | "1 14 0 10 2022-01-05 22:47:28 \n", 424 | "2 15 3 3 2022-01-05 22:47:28 \n", 425 | "3 0 9 24 2022-01-05 22:47:28 \n", 426 | "4 37 2 29 2022-01-05 22:47:28 \n", 427 | "... ... ... ... ... \n", 428 | "3935 2 9 14 2022-01-05 22:47:28 \n", 429 | "3936 34 13 72 2022-01-05 22:47:28 \n", 430 | "3937 10 1 12 2022-01-05 22:47:28 \n", 431 | "3938 6 1 3 2022-01-05 22:47:28 \n", 432 | "3939 6 1 9 2022-01-05 22:47:28 \n", 433 | "\n", 434 | "[3940 rows x 10 columns]" 435 | ] 436 | }, 437 | "execution_count": 2, 438 | "metadata": {}, 439 | "output_type": "execute_result" 440 | } 441 | ], 442 | "source": [ 443 | "import facebook_crawler\n", 444 | "groupurl = 'https://www.facebook.com/groups/pythontw'\n", 445 | "facebook_crawler.Crawl_GroupPosts(groupurl, until_date='2021-01-01')" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "id": "df297081-8a01-4a6c-9719-9b4dfa87410c", 452 | "metadata": {}, 453 | "outputs": [], 454 | "source": [] 455 | } 456 | ], 457 | "metadata": { 458 | "kernelspec": { 459 | "display_name": "Python 3 (ipykernel)", 460 | "language": "python", 461 | "name": "python3" 462 | }, 463 | "language_info": { 464 | "codemirror_mode": { 465 | "name": "ipython", 466 | "version": 3 467 | }, 468 | "file_extension": ".py", 469 | "mimetype": "text/x-python", 470 | "name": "python", 471 | "nbconvert_exporter": "python", 472 | "pygments_lexer": "ipython3", 473 | "version": "3.8.10" 474 | } 475 | }, 476 | "nbformat": 4, 477 | "nbformat_minor": 5 478 | } 479 | -------------------------------------------------------------------------------- /sample/data/PyConTaiwan.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/sample/data/PyConTaiwan.parquet -------------------------------------------------------------------------------- /sample/data/corollacrossclub.parquet: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/tlyu0419/facebook_crawler/82ec3ec46aa0a324252bbb6274b57cbf29c27e6e/sample/data/corollacrossclub.parquet -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | import setuptools 2 | 3 | with open("README.md", "r") as fh: 4 | long_description = fh.read() 5 | 6 | setuptools.setup( 7 | name="facebook_crawler", 8 | version="0.0.28", 9 | author="TENG-LIN YU", 10 | author_email="tlyu0419@gmail.com", 11 | description="Facebook crawler package can help you crawl the posts on public fanspages and groups from Facebook.", 12 | long_description=long_description, 13 | long_description_content_type="text/markdown", 14 | url="https://github.com/TLYu0419/facebook_crawler", 15 | packages=setuptools.find_packages(), 16 | py_modules=['facebook_crawler'], 17 | classifiers=[ 18 | "Programming Language :: Python :: 3", 19 | "License :: OSI Approved :: Apache Software License", 20 | "Operating System :: OS Independent", 21 | ], 22 | python_requires=">=3.6", 23 | install_requires=[ 24 | "requests", 25 | "bs4", 26 | "numpy", 27 | "pandas", 28 | "lxml" 29 | ], 30 | ) -------------------------------------------------------------------------------- /tests/test_facebook_crawler.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | 4 | @pytest.fixture 5 | def pageurl(): 6 | return 'https://www.facebook.com/groups/pythontw' 7 | 8 | 9 | @pytest.fixture 10 | def until_date(): 11 | from datetime import datetime 12 | from datetime import timedelta 13 | return str(datetime.now() - timedelta(days=1)) 14 | 15 | 16 | def test_Crawl_PagePosts(pageurl, until_date): 17 | from main import Crawl_PagePosts 18 | df = Crawl_PagePosts(pageurl, until_date=until_date) 19 | assert df 20 | 21 | 22 | def test_Crawl_GroupPosts(pageurl, until_date): 23 | from main import Crawl_GroupPosts 24 | df = Crawl_GroupPosts(pageurl, until_date=until_date) 25 | assert df 26 | -------------------------------------------------------------------------------- /tests/test_page_parser.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from utils import _init_request_vars 4 | 5 | 6 | @pytest.fixture 7 | def pageurl(): 8 | return 'https://www.facebook.com/groups/pythontw' 9 | 10 | 11 | @pytest.fixture 12 | def headers(pageurl): 13 | from requester import _get_headers 14 | return _get_headers(pageurl) 15 | 16 | 17 | @pytest.fixture 18 | def homepage_response(pageurl, headers): 19 | from requester import _get_homepage 20 | return _get_homepage(pageurl=pageurl, headers=headers) 21 | 22 | 23 | @pytest.fixture 24 | def entryPoint(homepage_response): 25 | from page_paser import _parse_entryPoint 26 | return _parse_entryPoint(homepage_response) 27 | 28 | 29 | @pytest.fixture 30 | def identifier(entryPoint, homepage_response): 31 | from page_paser import _parse_identifier 32 | return _parse_identifier(entryPoint, homepage_response) 33 | 34 | 35 | @pytest.fixture 36 | def docid(entryPoint, homepage_response): 37 | from page_paser import _parse_docid 38 | return _parse_docid(entryPoint, homepage_response) 39 | 40 | 41 | @pytest.fixture 42 | def cursor(): 43 | _, cursor, _, _ = _init_request_vars() 44 | return cursor 45 | 46 | 47 | def test_parse_pagetype(homepage_response): 48 | from page_paser import _parse_pagetype 49 | page_type = _parse_pagetype(homepage_response) 50 | assert page_type.lower() in ['group', 'fanspage'] 51 | 52 | 53 | def test_parse_pagename(homepage_response): 54 | from page_paser import _parse_pagename 55 | page_name = _parse_pagename(homepage_response) 56 | assert page_name 57 | 58 | 59 | def test_parse_entryPoint(homepage_response): 60 | from page_paser import _parse_entryPoint 61 | entryPoint = _parse_entryPoint(homepage_response) 62 | assert entryPoint 63 | 64 | 65 | def test_parse_identifier(entryPoint, homepage_response): 66 | from page_paser import _parse_identifier 67 | identifier = _parse_identifier(entryPoint, homepage_response) 68 | assert identifier 69 | 70 | 71 | def test_parse_docid(entryPoint, homepage_response): 72 | from page_paser import _parse_docid 73 | docid = _parse_docid(entryPoint, homepage_response) 74 | assert docid 75 | 76 | 77 | def test_parse_likes(homepage_response, entryPoint, headers): 78 | from page_paser import _parse_likes 79 | likes = _parse_likes(homepage_response, entryPoint, headers) 80 | assert likes 81 | 82 | 83 | def test_parse_creation_time(homepage_response, entryPoint, headers): 84 | from page_paser import _parse_creation_time 85 | creation_time = _parse_creation_time(homepage_response, entryPoint, headers) 86 | assert creation_time 87 | 88 | 89 | def test_parse_category(homepage_response, entryPoint, headers): 90 | from page_paser import _parse_category 91 | category = _parse_category(homepage_response, entryPoint, headers) 92 | assert category 93 | 94 | 95 | def test_parse_pageurl(homepage_response): 96 | from page_paser import _parse_pageurl 97 | pageurl = _parse_pageurl(homepage_response) 98 | assert pageurl 99 | -------------------------------------------------------------------------------- /tests/test_post_parser.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from post_paser import _parse_edgelist, _parse_edge, _parse_domops, _parse_jsmods, _parse_composite_graphql 4 | from utils import _init_request_vars 5 | 6 | 7 | @pytest.fixture 8 | def pageurl(): 9 | return 'https://www.facebook.com/groups/pythontw' 10 | 11 | 12 | @pytest.fixture 13 | def headers(pageurl): 14 | from requester import _get_headers 15 | return _get_headers(pageurl) 16 | 17 | 18 | @pytest.fixture 19 | def homepage_response(pageurl, headers): 20 | from requester import _get_homepage 21 | return _get_homepage(pageurl=pageurl, headers=headers) 22 | 23 | 24 | @pytest.fixture 25 | def entryPoint(homepage_response): 26 | from page_paser import _parse_entryPoint 27 | return _parse_entryPoint(homepage_response) 28 | 29 | 30 | @pytest.fixture 31 | def identifier(entryPoint, homepage_response): 32 | from page_paser import _parse_identifier 33 | return _parse_identifier(entryPoint, homepage_response) 34 | 35 | 36 | @pytest.fixture 37 | def docid(entryPoint, homepage_response): 38 | from page_paser import _parse_docid 39 | return _parse_docid(entryPoint, homepage_response) 40 | 41 | 42 | @pytest.fixture 43 | def cursor(): 44 | _, cursor, _, _ = _init_request_vars() 45 | return cursor 46 | 47 | 48 | @pytest.fixture 49 | def posts_response(headers, identifier, entryPoint, docid, cursor): 50 | from requester import _get_posts 51 | return _get_posts(headers=headers, 52 | identifier=identifier, 53 | entryPoint=entryPoint, 54 | docid=docid, 55 | cursor=cursor) 56 | 57 | 58 | @pytest.fixture() 59 | def edges(posts_response): 60 | return _parse_edgelist(posts_response) 61 | 62 | 63 | def test_parse_edgelist(posts_response): 64 | edges = _parse_edgelist(posts_response) 65 | assert len(edges) > 0 66 | 67 | 68 | def test_parse_edge(edges): 69 | for edge in edges: 70 | result = _parse_edge(edge) 71 | assert result 72 | 73 | 74 | def test_parse_domops(posts_response): 75 | content_list, cursor = _parse_domops(posts_response) 76 | assert content_list is not None 77 | assert cursor is not None 78 | 79 | 80 | def test_parse_jsmods(posts_response): 81 | _parse_jsmods(posts_response) 82 | 83 | 84 | def test_parse_composite_graphql(posts_response): 85 | df, max_date, cursor = _parse_composite_graphql(posts_response) 86 | assert df is not None 87 | assert max_date is not None 88 | assert cursor is not None 89 | 90 | 91 | def test_parse_composite_nojs(posts_response): 92 | df, max_date, cursor = _parse_composite_graphql(posts_response) 93 | assert df is not None 94 | assert max_date is not None 95 | assert cursor is not None 96 | -------------------------------------------------------------------------------- /tests/test_requester.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from utils import _init_request_vars 4 | 5 | 6 | @pytest.fixture 7 | def pageurl(): 8 | return 'https://www.facebook.com/groups/pythontw' 9 | 10 | 11 | @pytest.fixture 12 | def headers(pageurl): 13 | from requester import _get_headers 14 | return _get_headers(pageurl) 15 | 16 | 17 | @pytest.fixture 18 | def homepage_response(pageurl, headers): 19 | from requester import _get_homepage 20 | return _get_homepage(pageurl=pageurl, headers=headers) 21 | 22 | 23 | @pytest.fixture 24 | def entryPoint(homepage_response): 25 | from page_paser import _parse_entryPoint 26 | return _parse_entryPoint(homepage_response) 27 | 28 | 29 | @pytest.fixture 30 | def identifier(entryPoint, homepage_response): 31 | from page_paser import _parse_identifier 32 | return _parse_identifier(entryPoint, homepage_response) 33 | 34 | 35 | @pytest.fixture 36 | def docid(entryPoint, homepage_response): 37 | from page_paser import _parse_docid 38 | return _parse_docid(entryPoint, homepage_response) 39 | 40 | 41 | @pytest.fixture 42 | def cursor(): 43 | _, cursor, _, _ = _init_request_vars() 44 | return cursor 45 | 46 | 47 | def test_get_homepage(pageurl, headers): 48 | from requester import _get_homepage 49 | homepage_response = _get_homepage(pageurl=pageurl, headers=headers) 50 | assert homepage_response.status_code == 200 51 | assert homepage_response.url == pageurl 52 | 53 | 54 | def test_get_pageabout(homepage_response, entryPoint, headers): 55 | from requester import _get_pageabout 56 | pageabout = _get_pageabout(homepage_response, entryPoint, headers) 57 | assert pageabout.status_code == 200 58 | 59 | 60 | def test_get_posts(headers, identifier, entryPoint, docid, cursor): 61 | from requester import _get_posts 62 | resp = _get_posts(headers=headers, 63 | identifier=identifier, 64 | entryPoint=entryPoint, 65 | docid=docid, 66 | cursor=cursor) 67 | assert resp.status_code == 200 68 | -------------------------------------------------------------------------------- /tests/test_utils.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | 3 | from utils import _extract_id, _extract_reactions 4 | 5 | 6 | @pytest.fixture 7 | def dataid(): 8 | return 'feed_subtitle_107207125979624;5762739400426340;;9' 9 | 10 | 11 | def test_extract_pageid(dataid): 12 | pageid = _extract_id(dataid, 0) 13 | assert pageid == '107207125979624' 14 | 15 | 16 | def test_extract_postid(dataid): 17 | postid = _extract_id(dataid, 1) 18 | assert postid == '5762739400426340' 19 | 20 | 21 | # def test_extract_reactions(reactions, reaction_type): 22 | # reactions = _extract_reactions(reactions, reaction_type) 23 | # assert reactions is not None 24 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import re 3 | import datetime 4 | import requests 5 | 6 | 7 | def _connect_db(): 8 | conn = sqlite3.connect('./facebook_crawler.db', check_same_thread=False) 9 | return conn 10 | 11 | 12 | def _extract_id(string, num): 13 | ''' 14 | Extract page and post id from the feed_subtitle infroamtion 15 | ''' 16 | try: 17 | return re.findall('[0-9]{5,}', string)[num] 18 | except: 19 | print('ERROR from extract {}'.froamt(string)) 20 | return string 21 | 22 | 23 | def _extract_reactions(reactions, reaction_type): 24 | ''' 25 | Extract reaction_type from reactions. 26 | Possible reaction_type's value will be one of ['LIKE', 'HAHA', 'WOW', 'LOVE', 'SUPPORT', 'SORRY', 'ANGER'] 27 | ''' 28 | for reaction in reactions: 29 | if reaction['node']['localized_name'].upper() == reaction_type.upper(): 30 | return reaction['reaction_count'] 31 | return 0 32 | 33 | 34 | def _init_request_vars(cursor=''): 35 | # init parameters 36 | # cursor = '' 37 | df = [] 38 | max_date = datetime.datetime.now().strftime('%Y-%m-%d') 39 | break_times = 0 40 | return df, cursor, max_date, break_times 41 | 42 | 43 | def _download_images(link): 44 | filename = link.split(r'?')[0].split(r'/')[-1] 45 | resp = requests.get(link) 46 | with open(f'data/photos/{filename}', 'wb') as f: 47 | f.write(resp.content) 48 | 49 | 50 | def parse_raw_json(string): 51 | dic = {} 52 | for line in string.split('\n'): 53 | if line != '': 54 | values = line.split(': ', -1) 55 | dic[values[0]]= values[1] 56 | return dic 57 | 58 | if '__name__' == '__main__': 59 | 60 | df, cursor, max_date, break_times = _init_request_vars(cursor='aa') 61 | cursor 62 | --------------------------------------------------------------------------------