├── .gitignore ├── README.md ├── assets ├── nodemaven-test.jpg └── token-business.png ├── devtools-data.xlsx ├── graph-api ├── README.md ├── data │ ├── KTXDHQGConfessions.jsonl │ └── devoiminhdidauthe.jsonl └── scraper.py ├── stealth-csr-puppeteer ├── README.md ├── example.js ├── helpers.js ├── index.js ├── package.json ├── scraper.js ├── test.js └── wrapper.js ├── stealth-csr-selenium ├── README.md ├── browser.py ├── crawler.py ├── data │ ├── KTXDHQGConfessions.json │ ├── KTXDHQGConfessions.jsonl │ └── data.json ├── img │ ├── filter.png │ ├── proxy.png │ ├── rate_limit_exceeded.png │ └── result.png ├── page.py ├── requirements.txt └── tor │ ├── linux │ ├── PluggableTransports │ │ └── obfs4proxy │ ├── libcrypto.so.1.1 │ ├── libevent-2.1.so.7 │ ├── libssl.so.1.1 │ ├── libstdc++ │ │ └── libstdc++.so.6 │ └── tor │ ├── mac │ ├── PluggableTransports │ │ └── obfs4proxy │ ├── libevent-2.1.7.dylib │ └── tor.real │ └── windows │ ├── PluggableTransports │ └── obfs4proxy.exe │ ├── libcrypto-1_1-x64.dll │ ├── libevent-2-1-7.dll │ ├── libevent_core-2-1-7.dll │ ├── libevent_extra-2-1-7.dll │ ├── libgcc_s_seh-1.dll │ ├── libssl-1_1-x64.dll │ ├── libssp-0.dll │ ├── libwinpthread-1.dll │ ├── tor.exe │ └── zlib1.dll ├── stealth-ssr-puppeteer └── README.md └── stealth-ssr-scrapy ├── README.md ├── fbscraper ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ └── __init__.py └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | node_modules/ 3 | tmp/ 4 | .env 5 | package-lock.json -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Summary of Facebook data extraction approaches 2 | 3 | > I'm finalizing everything to accommodate the latest major changes 4 | 5 | ## Overview 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 |
ApproachSign-in required from the startRisk when sign-in (*)Risk when not sign-inDifficultySpeed
1️⃣  Graph API + Full-permission TokenAccess Token leaked + Rate LimitsNot workingEasyFast
2️⃣  SSR - Server-side RenderingCheckpoint but less loading more failure HardMedium
3️⃣  CSR - Client-side RenderingWhen access private contentSafestSlow
4️⃣  DevTools ConsoleCan be banned if overusedMedium
42 | 43 | ### I. My general conclusion after many tries with different approaches 44 | 45 | When run at **not sign-in** state, Facebook usually redirects to the login page or prevent you from loading more comments/replies. 46 | 47 | **(*)** For safety when testing with **sign-in** state, I recommend create a **fake account** (you can use a [Temporary Email Address](https://temp-mail.org/en/) to create one) and use it for the extraction, because: 48 | - No matter which approach you use, any fast or irregular activity continuously in **sign-in** state for a long time can be likely to get blocked at any time. 49 | 50 | - Authenticating via services with lack of encryption such as proxies using **HTTP** protocol can have potential security risks, especially if sensitive data is being transmitted: 51 | - Therefore, if you are experimenting with your own account, it's advisable to use **HTTPS** proxies or other more secure methods like `VPNs`. 52 | - I won't implement these types of risky authentication into the sign-in process for approaches in this repo, but you can do it yourself if you want. 53 | 54 | ### II. DISCLAIMER 55 | 56 | All information provided in this repo and related articles are for educational purposes only. So use at your own risk, I will not guarantee & not be responsible for any situations including: 57 | 58 | - Whether your Facebook account may get Checkpoint due to repeatedly or rapid actions. 59 | - Problems that may occur or for any abuse of the information or the code provided. 60 | - Problems about your privacy while using [IP hiding techniques](#i-ip-hiding-techniques) or any malicious scripts. 61 | 62 | ## APPROACH 1. Graph API with Full-permission Token 63 | 64 | 👉 Check out my implementation for this approach with [Python](./graph-api/). 65 | 66 | You will query [Facebook Graph API](https://developers.facebook.com/docs/graph-api) using your own Token with **full permission** for fetching data. This is the **MOST EFFECTIVE** approach. 67 | 68 | > The knowledge and the way to get **Access Token** below are translated from these 2 Vietnamese blogs: 69 | > 70 | > - https://ahachat.com/help/blog/cach-lay-token-facebook 71 | > - https://alotoi.com/get-token-full-quyen 72 | 73 | ### I. What is Facebook Token? 74 | 75 | A Facebook **Access Token** is a randomly generated code that contains data linked to a Facebook account. It contains the permissions to perform an action on the library (API) provided by Facebook. Each Facebook account will have different **Access Tokens**, and there can be ≥ 1 Tokens on the same account. 76 | 77 | Depending on the limitations of each Token's permissions, which are generated for use with corresponding features, either many or few, they can be used for various purposes, but the main goal is to automate all manual operations. Some common applications include: 78 | 79 | - Increasing likes, subscriptions on Facebook. 80 | - Automatically posting on Facebook. 81 | - Automatically commenting and sharing posts. 82 | - Automatically interacting in groups and Pages. 83 | - ... 84 | 85 | There are 2 types of Facebook Tokens: **App-based Token** and **Personal Account-based Token**. The Facebook **Token by App** is the safest one, as it will have a limited lifetime and only has some basic permissions to interact with `Pages` and `Groups`. Our main focus will on the Facebook **Personal Account-based Token**. 86 | 87 | ### II. Personal Account-based Access Token 88 | 89 | This is a **full permissions** Token represented by a string of characters starting with `EAA...`. The purpose of this Token is to act on behalf of your Facebook account to perform actions you can do on Facebook, such as sending messages, liking pages, and posting in groups through `API`. 90 | 91 | Compared to an **App-based Token**, this type of Token has a longer lifespan and more permissions. Simply put, whatever an **App-based Token** can do, a **Personal Account-based Token** can do as well, but not vice versa. 92 | 93 | An example of using this Facebook Token is when you want to simultaneously post to many `Groups` and `Pages`. To do this, you cannot simply log into each `Group` or `Page` to post, which is very time-consuming. Instead, you just need to fill in a list of `Group` and `Page` IDs, and then call an `API` to post to all in this list. Or, as you can often see on the Internet, there are tools to increase fake likes and comments also using this technique. 94 | 95 | Note that using Facebook Token can save you time, but you should not reveal this Token to others as they can misuse it for malicious purposes: 96 | 97 | - Do not download extensions to get Tokens or login with your phone number and password on websites that support Token retrieval, as your information will be compromised. 98 | - And if you suspect your Token has been compromised, immediately change your Facebook password and delete the extensions installed in the browser. 99 | - If you wanna be more careful, you can turn on **two-factor authentication** (2FA). 100 | 101 | 👉 To ensure safety when using the Facebook Token for personal purposes and saving time as mentioned above, you should obtain the Token directly from Facebook following the steps below. 102 | 103 | ### III. Get Access Token with full permissions 104 | 105 | Before, obtaining Facebook Tokens was very simple, but now many Facebook services are developing and getting Facebook Tokens has become more difficult. Facebook also limits Full permission Tokens to prevent Spam and excessive abuse regarding user behavior. 106 | 107 | It's possible to obtain a Token, but it might be limited by basic permissions that we do not use. This is not a big issue compared to sometimes having accounts locked (identity verification) on Facebook. 108 | 109 | Currently, this is the most used method, but it may require you to authenticate with 2FA (via app or SMS Code). With these following steps, you can get an **almost full permission** Token: 110 | 111 | - Go to https://business.facebook.com/business_locations. 112 | - Press `Ctrl + U`, then `Ctrl + F` to find the code that contains `EAAG`. Copy the highlighted text, that's the Token you want to obtain. 113 | 114 | ![](./assets/token-business.png) 115 | 116 | - You can go to this [facebook link](https://developers.facebook.com/tools/debug/accesstoken) to check the permissions of the above Token. 117 | ![](https://lh4.googleusercontent.com/0S64t2sjFXjkX8HUjo2GeEW8hyKL88G4lMXkpNF7RgtFCRm0oVPRT--vnoM1rkMyhrRvvHufW9J0ZeP8tPxfo4j5vYityQFM0m06NTI2hq4zk1JMp59W9voHXHYtOjE7zqDGMlhh) 118 | 119 | **Note**: I only share how to get **Access Token** from Facebook itself. Revealing Tokens can seriously affect your Facebook account. Please don't get Tokens from unknown sources! 120 | 121 | 122 | ## APPROACH 2. SSR - Server-side Rendering 123 | 124 | 👉 Check out my implementation using 2 [scraping tools](#scraping-tools) for this approach: [Scrapy](./stealth-ssr-scrapy/) (Implementing) and [Puppeteer](./stealth-ssr-puppeteer/) (Implementing). 125 | 126 | ### I. What is Server-side Rendering? 127 | 128 | This is a popular technique for rendering a normally client-side only single-page application (`SPA`) on the **Server** and then sending a fully rendered page to the client. The client's `JavaScript` bundle can then take over and the `SPA` can operate as normal: 129 | 130 | ```mermaid 131 | %%{init: {'theme': 'default', 'themeVariables': { 'primaryColor': '#333', 'lineColor': '#666', 'textColor': '#333', }}}%% 132 | sequenceDiagram 133 | participant U as 🌐 User's Browser 134 | participant S as 🔧 Server 135 | participant SS as 📜 Server-side Scripts 136 | participant D as 📚 Database 137 | participant B as 🖥️ Browser Engine 138 | participant H as 💧 Hydration (SPA) 139 | 140 | rect rgb(235, 248, 255) 141 | U->>S: 1. Request 🌐 (URL/Link) 142 | Note right of S: Server processes the request 143 | S->>SS: 2. Processing 🔄 (PHP, Node.js, Python, Ruby, Java) 144 | SS->>D: Execute Business Logic & Query DB 145 | D-->>SS: Return Data 146 | SS-->>S: 3. Rendering 📄 (Generate HTML) 147 | S-->>U: 4. Response 📤 (HTML Page) 148 | end 149 | 150 | rect rgb(255, 243, 235) 151 | U->>B: 5. Display 🖥️ 152 | Note right of B: Browser parses HTML, CSS, JS 153 | B-->>U: Page Displayed to User 154 | end 155 | 156 | alt 6. Hydration (if SPA) 157 | rect rgb(235, 255, 235) 158 | U->>H: Hydration 💧 159 | Note right of H: Attach event listeners\nMake page interactive 160 | H-->>U: Page now reactive 161 | end 162 | end 163 | ``` 164 | 165 | 1. The **user's browser** requests a page. 166 | 2. The **Server** receives and processes request. This involves running necessary **Server-side scripts**, which can be written in languages like *PHP*, *Node.js*, *Python*, ... 167 | 3. These **Server-side scripts** dynamically generate the `HTML` content of the page. This may include executing business logic or querying a database. 168 | 4. The **Server** responds by sending the **fully-rendered HTML** page back to **user's browser**. This response also includes `CSS` and the `JS`, which will be process once the `HTML` is loaded. 169 | 5. The **user's browser** receives the `HTML` response and renders the page. The browser's rendering engine parses the `HTML`, `CSS`, and execute `JS` to display the page. 170 | 6. (Optional) If the application is a `SPA` using a framework like *React*, *Vue*, or *Angular*, an additional process called **Hydration** may occur to attach event listeners to the existing **Server-rendered HTML**: 171 | - This is where the client-side `JS` takes over and `binds` event handlers to the **Server-rendered HTML**, effectively turning a static page into a dynamic one. 172 | - This allows the application to handle user interactions, manage `state`, and potentially update the `DOM` without the need to render a new page from scratch or return to the **Server** for every action. 173 | 174 | | Pros | Cons | 175 | | --------------------------- | --------------------------- | 176 | | - Improved initial load time as users see a **fully-rendered page** sooner, which is important for experience, particularly on slow connections | - More **Server** resources are used to generate the **fully-rendered HTML**. | 177 | | - Improved SEO as search engine crawlers can see the **fully-rendered page**. | - Complex to implement as compared to [CSR](#approach-3-csr---client-side-rendering), especially for dynamic sites where content changes frequently. | 178 | 179 | ### II. [Mbasic Facebook](https://mbasic.facebook.com) - A Facebook SSR version 180 | 181 | This Facebook version is made for mobile browsers on slow internet connection by using [SSR](#i-what-is-server-side-rendering) to focus on delivering content in raw `HTML` format. You can access it without a modern smartphones. With modern devices, it will improves the page loading time & the contents will be mainly rendered using raw `HTML` rather than relying heavily on `JS`: 182 | 183 | https://github.com/18520339/facebook-data-extraction/assets/50880271/ae2635ff-3f2a-4b84-a5b3-c126102a0118 184 | 185 | - You can leverage the power of many web scraping frameworks like [scrapy](https://scrapy.org) not just automation tools like [puppeteer](https://github.com/puppeteer/puppeteer) or [selenium](https://github.com/seleniumhq/selenium) and it will become even more powerful when used with [IP hiding techniques](#i-ip-hiding-techniques). 186 | - You can get each part of the contents through different URLs, not only through the page scrolling ➔ You can do something like using proxy for each request or [AutoThrottle](https://docs.scrapy.org/en/latest/topics/autothrottle.html) (a built-in [scrapy](https://scrapy.org) extension), ... 187 | 188 | Updating... 189 | 190 | ## APPROACH 3. CSR - Client-side Rendering 191 | 192 | 👉 Check out my implementation using 2 [scraping tools](#scraping-tools) for this approach: [Selenium](./stealth-csr-selenium/) (Deprecated) and [Puppeteer](./stealth-csr-puppeteer/) (Implementing). 193 | 194 | Updating... 195 | 196 | 197 | ## APPROACH 4. DevTools Console 198 | 199 | This is the most simple way, which is to directly write & run JS code in the [DevTools Console](https://developer.chrome.com/docs/devtools/open) of your browser, so it's quite convenient, not required to setup anything. 200 | 201 | - You can take a look at this [extremely useful project](https://github.com/jayremnt/facebook-scripts-dom-manipulation) which includes many automation scripts (not just about data extraction) with no Access Token needed for Facebook users by directly manipulating the DOM. 202 | 203 | - Here's my example script to collect comments on **a Facebook page when not sign-in**: 204 | 205 | ```js 206 | // Go to the page you want to collect, wait until it finishes loading. 207 | // Open the DevTools Console on the Browser and run the following code 208 | let csvContents = [['UserId', 'Name', 'Comment']]; 209 | let cmtsSelector = '.userContentWrapper .commentable_item'; 210 | 211 | // 1. Click see more comments 212 | // If you want more, just wait until the loading finishes and run this again 213 | moreCmts = document.querySelectorAll(cmtsSelector + ' ._4sxc._42ft'); 214 | moreCmts.forEach(btnMore => btnMore.click()); 215 | 216 | // 2. Collect all comments 217 | comments = document.querySelectorAll(cmtsSelector + ' ._72vr'); 218 | comments.forEach(cmt => { 219 | let info = cmt.querySelector('._6qw4'); 220 | let userId = info.getAttribute('href')?.substring(1); 221 | let content = cmt.querySelector('._3l3x>span')?.innerText; 222 | csvContents.push([userId, info.innerText, content]); 223 | }); 224 | csvContents.map(cmt => cmt.join('\t')).join('\n'); 225 | ``` 226 | 227 |
228 | 229 | 230 | Example result for the script above 231 | 232 |
233 | 234 | | UserId | Name | Comment | 235 | | -------------- | -------------- | ---------------------------------- | 236 | | freedomabcxyz | Freedom | Sau khi dùng | 237 | | baodendepzai123 | Bảo Huy Nguyễn | nhưng mà thua | 238 | | tukieu.2001 | Tú Kiều | đang xem hài ai rãnh xem quãng cáo | 239 | | ABCDE2k4 | Maa Vănn Kenn | Lê Minh Nhất | 240 | | buikhanhtoanpro | Bùi Khánh Toàn | Haha | 241 | 242 |
243 | 244 | ## Scraping Tools 245 | 246 | Updating... 247 | 248 | 249 | ## Bypassing Bot Detection (When not sign-in) 250 | 251 | Updating... 252 | 253 | 👉 Highly recommend: https://github.com/niespodd/browser-fingerprinting 254 | 255 | ### I. IP hiding techniques 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | 300 | 301 | 302 | 303 | 304 | 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 | 321 | 322 | 323 | 324 |
TechniqueSpeedCostScaleAnonymityOther RisksAdditional Notes
VPN Service
⭐⭐
⭐⭐
Fast, offers a balance of anonymity and speedUsually paid- Good for small-scale operations.
- May not be suitable for high-volume scraping due to potential IP blacklisting.
- Provides good anonymity and can bypass geo-restriction.
- Potential for IP blacklisting/blocks if the VPN's IP range is known to the target site.
- Service reliability varies.
- Possible activity logs.
Choose a reputable provider to avoid security risks.
TOR Network
⭐⭐
Very slow due to onion routingFree- Fine for small-scale, impractical for time-sensitive/ high-volume scraping due to very slow speed.
- Consider only for research purposes, not scalable data collection.
- Offers excellent privacy.
- Tor exit nodes can be blocked or malicious, like potential for eavesdropping.
-Slowest choice
Public
Wi-Fi

VaryFreeFine for small-scale.Potential for being banned by target sites if scraping is detected.Potential unsecured networksLong distance way solution.
Mobile Network
⭐⭐
Relatively fast but slower speeds on some networksPaid, potential for additional costs.Using mobile IPs can be effective for small-scale scraping, impractical for large-scale.Mobile IPs can change but not an anonymous option since it's tied to your personal account.-Using own data
Private/
Dedicated Proxies

⭐⭐⭐
⭐⭐
(Best)
FastPaid- Best for large-scale operations and professional scraping projects.Offer better performance and reliability with lower risk of blacklisting.Vary in quality- Rotating Proxies are popular choices for scraping as they can offer better speed and a variety of IPs.
- You can use this proxy checker tool to assess your proxy quality
Shared Proxies
⭐⭐⭐
(Free)
⭐⭐
⭐⭐
(Paid)
Slow to ModerateUsually Free or cost-effective for low-volume scraping.Good for basic, small-scale, or non-critical scraping tasks.Can be overloaded or blacklisted or, encountering already banned IPs.Potential unreliable/ insecure proxies, especially Free ones.
325 | 326 | **IMPORTANT**: Nothing above is absolutely safe and secure. _Caution is never superfluous_. You will need to research more about them if you want to enhance the security of your data and privacy. 327 | 328 | ### II. Private/Dedicated Proxies (Most effective IP hiding technique) 329 | 330 | As you can conclude from the table above, **Rotating Private/Dedicated Proxies** is the most effective IP hiding technique for **undetectable** and **large-scale** scraping. Below are 2 popular ways to effectively integrate this technique into your scraping process: 331 | 332 | 333 | 334 | 335 | 336 | 337 | 338 | 339 | 340 | 341 | 342 | 343 | 344 | 345 | 346 | 347 | 348 | 349 | 350 | 351 | 352 | 353 | 354 | 355 | 356 | 357 | 358 | 359 | 360 | 361 |
TechniqueSpeedCostScaleAnonymityAdditional Notes
Residential Rotating Proxies
⭐⭐⭐
⭐⭐
(Best)
FastPaidIdeal for high-end, large-scale scraping tasks.- Mimics real user IPs and auto-rotate IPs when using proxy gateways, making detection harder.
- Provides high anonymity and low risk of blacklisting/blocks due to legitimate residential IPs.
Consider proxy quality, location targeting, and rotation speed.
Datacenter Rotating Proxies
⭐⭐
⭐⭐
Faster than Residential ProxiesMore affordable than Residential ProxiesGood for cost-effective, large-scale scraping.Less anonymous than Residential Proxies.
- Higher risk of being blocked.
- Easily detectable due to their datacenter IP ranges.
Consider reputation of the provider and frequency of IP rotation.
362 | 363 | Recently, I experimented my web scraping [npm package](https://www.npmjs.com/package/puppeteer-ecommerce-scraper) with [NodeMaven](https://nodemaven.com/?a_aid=quandang), a **Residential proxy provider** with a focus on IP quality as well as stability, and I think it worked quite well. Below is the proxy quality result that I tested using the [proxy checker tool](https://addons.mozilla.org/en-US/firefox/addon/proxy-checker/) I mentioned [above](#i-ip-hiding-techniques): 364 | 365 | ![](./assets/nodemaven-test.jpg) 366 | 367 | And this is the performance measure of my actual run that I tested with my [scraping package](https://www.npmjs.com/package/puppeteer-ecommerce-scraper): 368 | 369 | - **Successful scrape runs**: 96% (over 100 attempts). This result is a quite good. 370 | - [NodeMaven](https://nodemaven.com/?a_aid=quandang) already reduced the likelihood of encountering banned or blacklisted IPs through the `IP Quality Filtering` feature. 371 | - Another reason is that it can access to over `5M residential IPs` across 150+ countries, a broad range of geo-targeting options. 372 | - I also used their `IP Rotation` feature to rotate IPs within a single gateway endpoint, which simplified my scraping setup and provided consistent anonymity. 373 | - **Average scrape time**: around 1-2 mins/10 pages for complex dynamic loading website (highly dependent on website complexity). While the proxy speeds were generally consistent, there were occasional fluctuations, which is expected in any proxy service. 374 | - **Sticky Sessions**: 24h+ session durations allowed me to maintain connections and complete scrapes efficiently. 375 | - **IP block rate** / **Redirect** / **Blank page** in the first run: <4%. 376 | 377 | Overall, throughout many runs, the proxies proved to be reliable with minimal downtime or issues. For those interested in trying [NodeMaven](https://nodemaven.com/?a_aid=quandang), you can apply the code `QD2` for an additional 2GB of traffic free with your trial or package purchase. 378 | 379 | ### II. Browser Settings & Plugins 380 | 381 | Updating... 382 | 383 | -------------------------------------------------------------------------------- /assets/nodemaven-test.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/assets/nodemaven-test.jpg -------------------------------------------------------------------------------- /assets/token-business.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/assets/token-business.png -------------------------------------------------------------------------------- /devtools-data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/devtools-data.xlsx -------------------------------------------------------------------------------- /graph-api/README.md: -------------------------------------------------------------------------------- 1 | # Graph API with Full-permission Token Approach 2 | 3 | I wrote a [simple script](./scraper.py) to get data of posts from any **Page**/**Group** by querying [Facebook Graph API](https://developers.facebook.com/docs/graph-api) with Full-permission Token. My implementation for this approach only needs *130 lines* of code (100 if not including comments) with some built-in Python functions. 4 | 5 | 👉 Demo: https://www.youtube.com/watch?v=Q4oAsz__e_M 6 | 7 | ### I. Usage 8 | 9 | python scraper.py 10 | 11 | 1. **COOKIE** (Most important setup): 12 | 13 | This [script](./scraper.py) needs your **COOKIE** to work. You can get it by following these steps: 14 | - Go to https://business.facebook.com/business_locations and login (It may require `2FA`). 15 | - Press `F12` and go to the `Network` Panel. 16 | - Select the first request and copy the **Cookie** value in the **Request Headers**. 17 | 18 | **Note**: Don't use `document.cookie` as this will only extract cookies that are accessible via JavaScript and are not marked as `HttpOnly`. 19 | 20 | 2. **LIMIT** and **MAX_POSTS**: 21 | 22 | For **Page**, you can only read [a maximum of 100 feed posts](https://developers.facebook.com/docs/graph-api/reference/page/feed/#limitations) with the `limit` field: 23 | - If you try to read more than that you will get an error message to not exceed **100**. 24 | - The API will return approximately **600** ranked, published posts per year. 25 | 26 | For **Group**, there is no limited number mentioned in the document. As I experimented: 27 | - For simple query (such as `fields=message`): I can request up to **1850** posts (`LIMIT=1850`) . 28 | - For complex query (like the below `fields`): `LIMIT=300` works fine. Larger numbers sometimes work but most of the times are errors. 29 | - Therefore, I recommend just querying up to **300** posts at a time for Group. 30 | 31 | **Note**: If the data retrieved is too large, you can receive this error message: *"Please reduce the amount of data you're asking for, then retry your request"*. 32 | 33 | 3. **POST_FIELDS** and **COMMENT_FIELDS**: 34 | 35 | You can customize the fields you want to get from the `Post` or even `Comment` objects of **Page** and **Group**: 36 | - https://developers.facebook.com/docs/graph-api/reference/page/feed 37 | - https://developers.facebook.com/docs/graph-api/reference/group/feed 38 | - https://developers.facebook.com/docs/graph-api/reference/post 39 | 40 | **Note**: A **User** or **Page** can only query their own reactions. Other **Users**' or **Pages**' reactions are unavailable due to privacy concerns. 41 | 42 | 4. Other settings: 43 | 44 | - **SLEEP**: The time (in seconds) to wait between each request to get **LIMIT** posts. 45 | - **PAGE_OR_GROUP_URL**: The URL of the **Page** or **Group** you want to crawl. 46 | 47 | **Note**: The resulting file will contain each post separated [line by line](./data/KTXDHQGConfessions.jsonl). 48 | 49 | ### II. Recommendation 50 | 51 | I have learned a lot from this [repo](https://github.com/HoangTran0410/FBMediaDownloader). It's a NodeJs tool for auto downloading Facebook media with various features: 52 | 53 | - View album information (name, number of photos, link, ...) 54 | - Download **timeline album** of a FB page: this kind of album is hidden, containing all the photos so far in a FB page, like [this album](https://www.facebook.com/groups/j2team.community/posts/1377217242610392/). 55 | - Download any kind of albums: `user`'s, `group`'s, or `page`'s. 56 | - Download all photos/videos on the wall of an object (`user`/`group`/`page`). 57 | - It also provided [scripts](https://github.com/HoangTran0410/FBMediaDownloader/blob/master/scripts/bookmarks.js) to extract `album_id` / `user_id` / `group_id` / `page_id`. 58 | 59 | The only disadvantage is that the description and instructions of this [repo](https://github.com/HoangTran0410/FBMediaDownloader) are in Vietnamese, _my language_. But I think you can use the translation feature of your browser to read, or you can watch its [instruction video](https://www.youtube.com/watch?v=g4zh9p-QfAQ) for more information. Hopefully, in the future, the author will update the description as well as the instructions in English. 60 | -------------------------------------------------------------------------------- /graph-api/scraper.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import time 4 | import re 5 | 6 | 7 | ''' For Cookie 8 | 1. Go to https://business.facebook.com/business_locations and login (It may require 2FA) 9 | 2. Press F12 and go to the Network Panel 10 | 3. Select the first request and copy the Cookie value in the Request Headers 11 | *Note: Don't use document.cookie as this will only extract cookies that are accessible via JavaScript and are not marked as `HttpOnly` 12 | ''' 13 | COOKIE = "" 14 | 15 | '''Maximum Posts 16 | - For Page, you can only read a maximum of 100 feed posts with the limit field: 17 | - If you try to read more than that you will get an error message to not exceed 100. 18 | - The API will return approximately 600 ranked, published posts per year # https://developers.facebook.com/docs/graph-api/reference/page/feed/#limitations 19 | - For Group, there is no limited number mentioned in the document. As I experimented: 20 | - For simple query (such as `fields=message`): I can request up to 1850 posts (`LIMIT=1850`) . 21 | - For complex query (like the below `fields`): `LIMIT=300` works fine. Larger numbers sometimes work but most of the times are errors. 22 | - Therefore, I recommend just querying up to 300 posts at a time for Group. 23 | *Note: If the data retrieved is too large, you can receive this error message: "Please reduce the amount of data you're asking for, then retry your request" 24 | ''' 25 | LIMIT = 100 26 | MAX_POSTS = 375 # Maximum posts to scrape 27 | 28 | ''' Endpoint for posts in Page and Group 29 | - https://developers.facebook.com/docs/graph-api/reference/page/feed 30 | - https://developers.facebook.com/docs/graph-api/reference/group/feed 31 | - https://developers.facebook.com/docs/graph-api/reference/post 32 | *Note: A User or Page can only query their own reactions. Other Users' or Pages' reactions are unavailable due to privacy concerns. 33 | ''' 34 | POST_FIELDS = 'id,parent_id,created_time,permalink_url,full_picture,shares,reactions.summary(total_count),attachments{subattachments.limit(20)},message' 35 | COMMENT_FIELDS = 'comments.order(chronological).summary(total_count){id,created_time,reactions.summary(total_count),message,comment_count,comments}' 36 | 37 | SLEEP = 2 # Waiting time between each request to get {LIMIT} posts 38 | PAGE_OR_GROUP_URL = 'https://www.facebook.com/groups/devoiminhdidauthe' 39 | SESSION = requests.Session() 40 | 41 | 42 | def get_node_id(): 43 | node_type, node_name = PAGE_OR_GROUP_URL.split('/')[-2:] 44 | if node_type != 'groups': 45 | return node_name # Page doesn't need to have an id as number 46 | 47 | id_in_url = re.search('(?<=\/groups\/)(.\d+?)($|(?=\/)|(?=&))', PAGE_OR_GROUP_URL) 48 | if id_in_url and id_in_url.group(1): 49 | return id_in_url.group(1) 50 | 51 | print('Getting Group ID ...') 52 | response = SESSION.get(PAGE_OR_GROUP_URL) 53 | search_group_id = re.search('(?<=\/group\/\?id=)(.\d+)', response.text) 54 | 55 | if search_group_id and search_group_id.group(1): 56 | group_id = search_group_id.group(1) 57 | print(f'Group ID for {node_name} is {group_id} !!') 58 | return group_id 59 | 60 | print('Cannot find any Node ID for', PAGE_OR_GROUP_URL) 61 | return None 62 | 63 | 64 | def get_access_token(): 65 | print('Getting access token ...') 66 | response = SESSION.get('https://business.facebook.com/business_locations', headers={'cookie': COOKIE}) 67 | 68 | if response.status_code == 200: 69 | search_token = re.search('(EAAG\w+)', response.text) 70 | if search_token and search_token.group(1): 71 | return search_token.group(1) 72 | 73 | print('Cannot find access token. Maybe your cookie invalid !!') 74 | return None 75 | 76 | 77 | def init_params(): 78 | node_id = get_node_id() 79 | access_token = get_access_token() 80 | fields = POST_FIELDS + ',' + COMMENT_FIELDS 81 | endpoint = f'https://graph.facebook.com/v18.0/{node_id}/feed?limit={LIMIT}&fields={fields}&access_token={access_token}' 82 | return endpoint, access_token 83 | 84 | 85 | def get_data_and_next_endpoint(endpoint, access_token): 86 | if access_token is None: return {}, None 87 | response = SESSION.get(endpoint, headers={'cookie': COOKIE}) 88 | response = json.loads(response.text) 89 | 90 | try: data = response['data'] 91 | except: 92 | print('\n', response['error']['message']) 93 | data = [] 94 | 95 | try: 96 | next_endpoint = response['paging']['next'] 97 | time.sleep(SLEEP) 98 | except: 99 | print('\n', 'Cannot find next endpoint') 100 | next_endpoint = None 101 | 102 | if not next_endpoint.split('/feed?')[-1].startswith(f'limit={LIMIT}&'): # Group paging doesn't contain limit field 103 | next_endpoint = next_endpoint.replace('/feed?', f'/feed?limit={LIMIT}&') 104 | return data, next_endpoint 105 | 106 | 107 | def remove_paging(obj): # Remove all paging keys to make the result concise and safe as the access token is in them 108 | if isinstance(obj, dict): 109 | return {k: remove_paging(v) for k, v in obj.items() if k != 'paging'} 110 | elif isinstance(obj, list): 111 | return [remove_paging(item) for item in obj] 112 | return obj 113 | 114 | 115 | endpoint, access_token = init_params() 116 | file_name, count = PAGE_OR_GROUP_URL.split('/')[-1], 0 117 | print(f'Fetching {MAX_POSTS} posts sorted by RECENT_ACTIVITY from {PAGE_OR_GROUP_URL} ...') 118 | 119 | with open(f'{file_name}.jsonl', 'w', encoding='utf-8') as file: 120 | while endpoint is not None and access_token is not None and count < MAX_POSTS: 121 | print(f'=> Number of posts now: {count} ...', end='\r', flush=True) 122 | data, endpoint = get_data_and_next_endpoint(endpoint, access_token) 123 | posts = [json.dumps(remove_paging(post), ensure_ascii=False) for post in data] 124 | count += len(posts) 125 | 126 | if LIMIT > MAX_POSTS - count: # If remaining posts < LIMIT, => LIMIT = the remaining number 127 | endpoint = endpoint.replace(f'/feed?limit={LIMIT}&', f'/feed?limit={MAX_POSTS - count}&') 128 | file.write('\n'.join(posts) + '\n') 129 | 130 | print(f'\n=> Finish fetching {count} posts into {file_name}.jsonl !!') 131 | SESSION.close() -------------------------------------------------------------------------------- /stealth-csr-puppeteer/README.md: -------------------------------------------------------------------------------- 1 | # CSR Approach using Puppeteer -------------------------------------------------------------------------------- /stealth-csr-puppeteer/example.js: -------------------------------------------------------------------------------- 1 | require('dotenv').config(); 2 | const { scrapeWithPagination, clusterWrapper } = require('puppeteer-ecommerce-scraper'); 3 | 4 | async function extractShopee(page, queueData) { 5 | const { products } = await scrapeWithPagination({ 6 | page, // Puppeteer page object 7 | scrollConfig: { scrollDelay: 500, scrollStep: 500, numOfScroll: 2, direction: 'both' }, 8 | scrapingConfig: { 9 | url: `https://shopee.vn/search?keyword=${queueData}`, 10 | productSelector: '.shopee-search-item-result__item', 11 | filePath: `./data/shopee-${queueData}.csv`, 12 | fileHeader: 'title,price,imgUrl\n', 13 | }, 14 | paginationConfig: { 15 | nextPageSelector: '.shopee-icon-button--right', 16 | disabledSelector: '.shopee-icon-button--right .shopee-icon-button--disabled', 17 | sleep: 1000, // in milliseconds 18 | maxPages: 3, // 0 for unlimited 19 | }, 20 | extractFunc: async productDOM => { 21 | const title = productDOM.querySelector('div[data-sqe="name"] > div:nth-child(1) > div')?.textContent; 22 | const priceParent = productDOM.querySelector('span[aria-label="current price"]')?.parentElement; 23 | const price = priceParent?.querySelectorAll('span')[2]?.textContent; 24 | const imgUrl = productDOM.querySelector('img[style="object-fit: contain"]')?.getAttribute('src'); 25 | return [title?.replaceAll(',', '_'), price, imgUrl]; 26 | }, 27 | }); 28 | console.log(`[DONE] Fetched ${products.length} ${queueData} products from Shopee`); 29 | } 30 | 31 | (async () => { 32 | await clusterWrapper({ 33 | func: extractShopee, 34 | queueEntries: ['android', 'iphone'], 35 | proxyEndpoint: process.env.PROXY_ENDPOINT, // Must be in the form of http://username:password@host:port 36 | monitor: false, 37 | useProfile: false, // After solving Captcha, save your profile, so you may avoid doing it next time 38 | }); 39 | })(); 40 | -------------------------------------------------------------------------------- /stealth-csr-puppeteer/helpers.js: -------------------------------------------------------------------------------- 1 | const fs = require('fs'); 2 | const os = require('os'); 3 | 4 | function isFileExists(installedPath) { 5 | try { 6 | fs.accessSync(installedPath, fs.constants.F_OK); 7 | return true; 8 | } catch (e) { 9 | return false; 10 | } 11 | } 12 | 13 | // Cannot use the same profile for multiple browsers => Not working with CONCURRENCY_BROWSER 14 | function getChromeProfilePath() { 15 | const homePath = os.homedir(); 16 | switch (os.platform()) { 17 | case 'win32': // Windows 18 | return `${homePath}\\AppData\\Local\\Google\\Chrome\\User Data\\Default`; 19 | case 'darwin': // macOS 20 | return `${homePath}/Library/Application Support/Google/Chrome/Default`; 21 | case 'linux': // Linux 22 | return `${homePath}/.config/google-chrome/Default`; 23 | default: 24 | throw new Error('Unsupported platform'); 25 | } 26 | } 27 | 28 | function getChromeExecutablePath() { 29 | switch (os.platform()) { 30 | case 'win32': // Windows 31 | for (let installedPath of [ 32 | 'C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe', 33 | 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe', 34 | ]) 35 | if (isFileExists(installedPath)) return installedPath; 36 | throw new Error('Chrome executable not found in expected locations on Windows'); 37 | case 'darwin': // macOS 38 | return '/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'; 39 | case 'linux': // Linux 40 | return '/usr/bin/google-chrome'; 41 | default: 42 | throw new Error('Unsupported platform'); 43 | } 44 | } 45 | 46 | async function loadNotBlankPage(page, url, proxyUsername = '', proxyPassword = '') { 47 | const browser = page.browser(); 48 | const context = browser.defaultBrowserContext(); 49 | await context.overridePermissions(url, []); 50 | 51 | const pagesArray = await browser.pages(); 52 | const notBlankPage = pagesArray[0]; 53 | await page.close(pagesArray[1]); 54 | 55 | if (proxyUsername && proxyPassword) 56 | await notBlankPage.authenticate({ username: proxyUsername, password: proxyPassword }); 57 | await notBlankPage.goto(url, { waitUntil: 'networkidle2', timeout: 0 }); 58 | return notBlankPage; 59 | } 60 | 61 | module.exports = { isFileExists, getChromeProfilePath, getChromeExecutablePath, loadNotBlankPage }; 62 | -------------------------------------------------------------------------------- /stealth-csr-puppeteer/index.js: -------------------------------------------------------------------------------- 1 | require('dotenv').config(); 2 | const { clusterWrapper } = require('./wrapper'); 3 | const { scrapeWithPagination } = require('./scraper'); 4 | 5 | async function extractLazada(page, queueData) { 6 | const { products } = await scrapeWithPagination({ 7 | page, // Puppeteer page object 8 | scrollConfig: { scrollDelay: 500, scrollStep: 500, numOfScroll: 2, direction: 'both' }, 9 | scrapingConfig: { 10 | url: `https://www.lazada.vn/catalog/?q=${queueData}`, 11 | productSelector: '[data-qa-locator="product-item"]', 12 | filePath: `./data/lazada-${queueData}.csv`, 13 | fileHeader: 'title,price,imgUrl\n', 14 | }, 15 | paginationConfig: { 16 | nextPageSelector: '.ant-pagination-next button', 17 | disabledSelector: '.ant-pagination-next button[disabled]', 18 | sleep: 1000, // in milliseconds 19 | maxPages: 3, // 0 for unlimited 20 | }, 21 | extractFunc: async productDOM => { 22 | const parent = '[data-qa-locator="product-item"] > div > div'; 23 | const imgUrl = productDOM.querySelector(`${parent} img[type="product"]`)?.getAttribute('src').split('_')[0]; 24 | return [ 25 | productDOM.querySelector(`${parent} > div:nth-child(2) a`)?.textContent.replaceAll(',', ''), 26 | productDOM 27 | .querySelector(`${parent} > div:nth-child(2) > div:nth-child(3) > span`) 28 | ?.textContent.replaceAll('₫', ''), 29 | imgUrl.match(/\.(jpeg|jpg|gif|png|bmp|webp)$/) ? imgUrl : '', 30 | ]; 31 | }, 32 | }); 33 | console.log(`[DONE] Fetched ${products.length} ${queueData} products from Lazada`); 34 | } 35 | 36 | (async () => { 37 | await clusterWrapper({ 38 | func: extractLazada, 39 | queueEntries: ['android', 'iphone'], 40 | proxyEndpoint: process.env.PROXY_ENDPOINT, // Must be in the form of http://username:password@host:port 41 | monitor: false, 42 | useProfile: true, // After solving Captcha, save uour profile, so you may avoid doing it next time 43 | }); 44 | })(); 45 | -------------------------------------------------------------------------------- /stealth-csr-puppeteer/package.json: -------------------------------------------------------------------------------- 1 | { 2 | "dependencies": { 3 | "dotenv": "^16.3.2", 4 | "puppeteer": "^21.7.0", 5 | "puppeteer-cluster": "^0.23.0", 6 | "puppeteer-extra": "^3.3.6", 7 | "puppeteer-extra-plugin-stealth": "^2.11.2", 8 | "puppeteer-with-fingerprints": "^1.4.4" 9 | } 10 | } -------------------------------------------------------------------------------- /stealth-csr-puppeteer/scraper.js: -------------------------------------------------------------------------------- 1 | const path = require('path'); 2 | const fs = require('fs'); 3 | 4 | async function scrapeWithPagination({ 5 | page, // Puppeteer page object 6 | extractFunc, // Function to extract product info from product DOM 7 | scrapingConfig = { url: '', productSelector: '', filePath: '', fileHeader: '' }, 8 | paginationConfig = { nextPageSelector: '', disabledSelector: '', sleep: 1000, maxPages: 0 }, 9 | scrollConfig = { scrollDelay: NaN, scrollStep: NaN, numOfScroll: 1, direction: 'both' }, 10 | }) { 11 | const { url, productSelector } = scrapingConfig; 12 | const products = []; 13 | let totalPages = 1; 14 | let notLastPage = !!productSelector; 15 | 16 | if (!scrapingConfig.filePath) { 17 | const domainAndAfter = url.split('.').slice(1).join('_'); 18 | scrapingConfig.filePath = './data/' + domainAndAfter.replace(/[^a-zA-Z0-9]/g, '_'); 19 | } 20 | createFile(scrapingConfig.filePath, scrapingConfig.fileHeader); 21 | await page.goto(url, { waitUntil: 'networkidle2', timeout: 0 }); 22 | 23 | while (notLastPage && (paginationConfig.maxPages === 0 || totalPages <= paginationConfig.maxPages)) { 24 | await page.waitForSelector(productSelector, { visible: true, hidden: false, timeout: 0 }); 25 | 26 | const { scrollDelay, scrollStep, numOfScroll, direction } = scrollConfig; 27 | if (scrollDelay && scrollStep && numOfScroll > 0) { 28 | await page.evaluate(autoScroll.toString()); 29 | const actionsInBrowser = // scroll for fully rendering 30 | direction == 'both' 31 | ? `autoScroll(${scrollDelay}, ${scrollStep}, 'bottom').then(() => autoScroll(${scrollDelay}, ${scrollStep}, 'top'))` 32 | : `autoScroll(${scrollDelay}, ${scrollStep}, '${direction}')`; 33 | for (let i = 0; i < numOfScroll; i++) await page.evaluate(actionsInBrowser); 34 | } 35 | 36 | const productNodes = await page.$$(productSelector); 37 | for (const node of productNodes) { 38 | // Code inside `evaluate` runs in the context of the browser 39 | const productInfo = await page.evaluate(extractFunc, node); 40 | saveProduct(products, productInfo, scrapingConfig.filePath); 41 | } 42 | 43 | console.log( 44 | `${scrapingConfig.filePath}\t`, 45 | `| Total products now: ${products.length}\t`, 46 | `| Page: ${totalPages}/${paginationConfig.maxPages || '\u221E'}\t`, 47 | `| URL: ${url}` 48 | ); 49 | notLastPage = await navigatePage({ page, ...paginationConfig }); 50 | totalPages += notLastPage; 51 | } 52 | return { products, totalPages, scrapingConfig, paginationConfig, scrollConfig }; 53 | } 54 | 55 | function autoScroll(delay, scrollStep, direction) { 56 | return new Promise((resolve, reject) => { 57 | if (direction === 'bottom') window.scrollTo({ top: 0, behavior: 'smooth' }); 58 | else { 59 | window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' }); 60 | scrollStep = -scrollStep; 61 | } 62 | console.log('Loading items by scrolling to', direction); 63 | 64 | const scrollId = setInterval(() => { 65 | let currentHeight = window.scrollY; 66 | if ( 67 | (direction === 'bottom' && currentHeight + window.innerHeight < document.body.scrollHeight) || 68 | (direction === 'top' && currentHeight > 0) 69 | ) 70 | window.scrollBy(0, scrollStep); 71 | else { 72 | clearInterval(scrollId); 73 | resolve(); 74 | } 75 | }, delay); 76 | }); 77 | } 78 | 79 | function createFile(filePath, header = '') { 80 | const dir = path.dirname(filePath); 81 | fs.mkdirSync(dir, { recursive: true }); 82 | fs.writeFile(filePath, header, 'utf-8', err => { 83 | if (err) throw err; 84 | console.log(`${filePath} created`); 85 | }); 86 | } 87 | 88 | function saveProduct(products, productInfo, filePath) { 89 | if (!productInfo.some(value => !value)) { 90 | products.push(productInfo); 91 | fs.appendFile(filePath, productInfo + '\n', 'utf-8', err => { 92 | if (err) throw err; 93 | // console.log(productInfo.toString()); 94 | }); 95 | } else console.log(`Cannot write to ${filePath} as this item has empty value:`, productInfo); 96 | } 97 | 98 | async function navigatePage({ page, nextPageSelector, disabledSelector, sleep = 1000 }) { 99 | if (!(nextPageSelector && disabledSelector)) return false; 100 | const notLastPage = (await page.$(disabledSelector)) === null; 101 | 102 | // https://github.com/puppeteer/puppeteer/issues/1412#issuecomment-402725036 103 | // const navigationPromise = page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 0 }); 104 | // if (notLastPage) { 105 | // await page.click(nextPageSelector); 106 | // await navigationPromise; 107 | // } 108 | 109 | if (notLastPage) 110 | await Promise.all([page.waitForNavigation({ waitUntil: 'networkidle2', timeout: 0 }), page.click(nextPageSelector)]); 111 | await new Promise(resolve => setTimeout(resolve, sleep)); 112 | return notLastPage; 113 | } 114 | 115 | module.exports = { scrapeWithPagination, autoScroll, createFile, saveProduct, navigatePage }; 116 | -------------------------------------------------------------------------------- /stealth-csr-puppeteer/test.js: -------------------------------------------------------------------------------- 1 | require('dotenv').config(); 2 | const { clusterWrapper } = require('./wrapper'); 3 | 4 | function getWebName(url) { 5 | const parsedUrl = new URL(url); 6 | const hostnameParts = parsedUrl.hostname.split('.'); 7 | return hostnameParts[hostnameParts.length - 1].length === 2 8 | ? hostnameParts[hostnameParts.length - 3] 9 | : hostnameParts[hostnameParts.length - 2]; 10 | } 11 | 12 | function url2FileName(url) { 13 | const parsedUrl = new URL(url); 14 | const fileName = parsedUrl.hostname.replace(/^www\./, '') + parsedUrl.pathname + parsedUrl.search; 15 | return fileName.replace(/[^a-zA-Z0-9]/g, '_'); 16 | } 17 | 18 | (async () => { 19 | await clusterWrapper({ 20 | func: async (page, queueData) => { 21 | const webName = url2FileName(queueData); 22 | await page.goto(queueData, { waitUntil: 'networkidle2' }); 23 | await page.screenshot({ path: `./tmp/${webName}.png`, fullPage: true }); 24 | console.log(`Check ./tmp/${webName}.png for anti-bot testing result on ${queueData}`); 25 | }, 26 | queueEntries: [ 27 | 'https://bot.sannysoft.com', 28 | 'https://browserleaks.com/webrtc', 29 | 'https://browserleaks.com/javascript', 30 | ], 31 | proxyEndpoint: process.env.PROXY_ENDPOINT, // Must be in the form of http://username:password@host:port 32 | monitor: false, 33 | useProfile: true, // After solving Captcha, save uour profile, so you may avoid doing it next time 34 | }); 35 | })(); 36 | -------------------------------------------------------------------------------- /stealth-csr-puppeteer/wrapper.js: -------------------------------------------------------------------------------- 1 | const { loadNotBlankPage, getChromeExecutablePath } = require('./helpers'); 2 | const { Cluster } = require('puppeteer-cluster'); 3 | const puppeteer = require('puppeteer-extra'); 4 | const StealthPlugin = require('puppeteer-extra-plugin-stealth'); 5 | puppeteer.use(StealthPlugin()); 6 | 7 | async function clusterWrapper({ 8 | func, 9 | queueEntries, 10 | proxyEndpoint = '', 11 | monitor = false, 12 | useProfile = false, // After solving Captcha, save uour profile, so you may avoid doing it next time 13 | otherConfigs = {}, 14 | }) { 15 | if (!Array.isArray(queueEntries) && (typeof queueEntries !== 'object' || queueEntries === null)) 16 | throw new Error('queueEntries must be an array or an object'); 17 | 18 | try { 19 | var { origin, username, password } = new URL(proxyEndpoint); 20 | } catch (_) { 21 | console.log('Proxy disabled => To use Proxy, provide an endpoint in the form of http://username:password@host:port'); 22 | origin = username = password = null; 23 | } 24 | 25 | const maxConcurrency = Math.min(Object.keys(queueEntries).length, 5); 26 | const perBrowserOptions = [...Array(maxConcurrency).keys()].map(i => { 27 | const puppeteerOptions = { 28 | ...{ 29 | headless: false, 30 | defaultViewport: false, 31 | executablePath: getChromeExecutablePath(), // Avoid Bot detection 32 | }, 33 | ...otherConfigs, 34 | }; 35 | if (useProfile) puppeteerOptions.userDataDir = `./tmp/profile${i + 1}`; // Must use different profile for each browser 36 | if (proxyEndpoint) puppeteerOptions.args = [`--proxy-server=${origin}`]; 37 | return puppeteerOptions; 38 | }); 39 | console.log(`Configuration for ${maxConcurrency} browsers in Cluster:`, perBrowserOptions); 40 | 41 | const cluster = await Cluster.launch({ 42 | concurrency: Cluster.CONCURRENCY_BROWSER, 43 | maxConcurrency, 44 | perBrowserOptions, 45 | puppeteer, 46 | monitor, 47 | timeout: 1e7, 48 | }); 49 | cluster.on('taskerror', (err, data) => { 50 | console.log(err.message, data); 51 | }); 52 | 53 | await cluster.task(async ({ page, data: queueData }) => { 54 | const notBlankPage = await loadNotBlankPage(page, 'https://ipinfo.io/json', username, password); 55 | const content = await notBlankPage.$eval('body', el => el.innerText); 56 | console.log(`IP Information for scraping ${queueData}: ${content}`); 57 | 58 | if (typeof func === 'function') await func(notBlankPage, queueData); 59 | else console.log('Function not found.'); 60 | }); 61 | 62 | for (const queueData of queueEntries) await cluster.queue(queueData); 63 | await cluster.idle(); 64 | await cluster.close(); 65 | } 66 | 67 | module.exports = { clusterWrapper }; 68 | -------------------------------------------------------------------------------- /stealth-csr-selenium/README.md: -------------------------------------------------------------------------------- 1 | # CSR Approach using Selenium (DEPRECATED) 2 | 3 | > My scripts for this approach were made in 2020, so it's now deprecated with the new Facebook UI. But you can use it as a reference for other similar implementations with Selenium. 4 | 5 | In this approach, I will write example scripts to extract id, user info, content, date, comments, and replies of posts. 6 | 7 | 👉 Demo: https://www.youtube.com/watch?v=Fx0UWOzYsig 8 | 9 | **Note**: 10 | 11 | - These scripts just working for **a Facebook page when not sign-in**, not group or any other object. 12 | - Maybe you will need to edit some of the CSS Selectors in the scripts, as Facebook might have changed them at the time of your use. 13 | 14 | ## Overview the scripts 15 | 16 | ### I. Features 17 | 18 | 1. Getting information of posts. 19 | 2. Filtering comments. 20 | 3. Checking redirect. 21 | 4. Can be run with Incognito window. 22 | 5. Simplifying browser to minimize time complexity. 23 | 6. Delay with random intervals every _loading more_ times to simulate human behavior. 24 | 7. Not required sign-in to **prevent Checkpoint**. 25 | 8. Hiding IP address to **prevent from banning** by: 26 | - Collecting Proxies and filtering the slowest ones from: 27 | - http://proxyfor.eu/geo.php 28 | - http://free-proxy-list.net 29 | - http://rebro.weebly.com/proxy-list.html 30 | - http://www.samair.ru/proxy/time-01.htm 31 | - https://www.sslproxies.org 32 | - [Tor Relays](./tor/) which used in [Tor Browser](https://www.torproject.org/), a network is comprised of thousands of volunteer-run servers. 33 | 34 | ### II. Weaknesses 35 | 36 | - Unable to detect some failed responses. Example: **Rate limit exceeded** (Facebook prevents from loading more). 37 | 38 | ![](./img/rate_limit_exceeded.png?raw=true) 39 | 40 | ➔ Have to run with `HEADLESS = False` to detect manually. 41 | 42 | - Quite slow when running with a large number of _loading more_ or when using [IP hiding techniques](https://github.com/18520339/facebook-data-extraction/tree/master/#i-ip-hiding-techniques). 43 | 44 | ### III. Result 45 | 46 | - Each post will be separated [line by line](./data/KTXDHQGConfessions.jsonl). 47 | - Most of my successful tests were on **Firefox** with [HTTP Request Randomizer](https://github.com/pgaref/HTTP_Request_Randomizer) proxy server. 48 | - My latest run on **Firefox** with **Incognito** windows using [HTTP Request Randomizer](https://github.com/pgaref/HTTP_Request_Randomizer): 49 | 50 | ![](./img/result.png?raw=true) 51 | 52 |
53 | 54 | Example data fields for a post 55 |
56 | 57 | ```json 58 | { 59 | "url": "https://www.facebook.com/KTXDHQGConfessions/videos/352525915858361/", 60 | "id": "352525915858361", 61 | "utime": "1603770573", 62 | "text": "Diễn tập PCCC tại KTX khu B tòa E1. ----------- #ktx_cfs Nguồn : Trường Vũ", 63 | "reactions": ["308 Like", "119 Haha", "28 Wow"], 64 | "total_shares": "26 Shares", 65 | "total_cmts": "169 Comments", 66 | "crawled_cmts": [ 67 | { 68 | "id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0MzIyMTY2MzcwMjc%3D", 69 | "utime": "1603770714", 70 | "user_url": "https://www.facebook.com/KTXDHQGConfessions/", 71 | "user_id": "KTXDHQGConfessions", 72 | "user_name": "KTX ĐHQG Confessions", 73 | "text": "Toà t á bây :) #Lép", 74 | "replies": [ 75 | { 76 | "id": "Y29tbWVudDozNDM0NDI0OTk5OTcxMDgyXzM0MzQ0OTc5MDk5NjM3OTE%3D", 77 | "utime": "1603772990", 78 | "user_url": "https://www.facebook.com/KTXDHQGConfessions/", 79 | "user_id": "KTXDHQGConfessions", 80 | "user_name": "KTX ĐHQG Confessions", 81 | "text": "Nguyễn Hoàng Đạt thật đáng tự hào :) #Lép" 82 | } 83 | ] 84 | } 85 | ] 86 | } 87 | ``` 88 |
89 | 90 | ## Usage 91 | 92 | ### I. Install libraries 93 | 94 | pip install -r requirements.txt 95 | 96 | - [Helium](https://github.com/mherrmann/selenium-python-helium): a wrapper around [Selenium](https://selenium-python.readthedocs.io/) with more high-level API for web automation. 97 | - [HTTP Request Randomizer](https://github.com/pgaref/HTTP_Request_Randomizer): used for collecting free proxies. 98 | 99 | ### II. Customize CONFIG VARIABLES in [crawler.py](./crawler.py) 100 | 101 | 1. **Running the Browser**: 102 | 103 | - **PAGE_URL**: URL of Facebook page. 104 | - **TOR_PATH**: use Proxy with Tor for `WINDOWS` / `MAC` / `LINUX` / `NONE`: 105 | - **BROWSER_OPTIONS**: run scripts using `CHROME` / `FIREFOX`. 106 | - **PRIVATE**: run with private mode or not: 107 | - Prevent from **Selenium** detection ➔ **navigator.driver** must be _undefined_ (check in Dev Tools). 108 | - Start browser with **Incognito** / **Private Window**. 109 | - **USE_PROXY**: run with proxy or not. If **True** ➔ check: 110 | - IF **TOR_PATH** ≠ `NONE` ➔ Use **Tor's SOCKS** proxy server. 111 | - ELSE ➔ Randomize proxies with [HTTP Request Randomizer](https://github.com/pgaref/HTTP_Request_Randomizer). 112 | - **HEADLESS**: run with headless browser or not. 113 | - **SPEED_UP**: simplify browser for minimizing loading time or not. If **True** ➔ use following settings: 114 | 115 | - With **Chrome** : 116 | 117 | ```python 118 | # Disable loading image, CSS, ... 119 | browser_options.add_experimental_option('prefs', { 120 | "profile.managed_default_content_settings.images": 2, 121 | "profile.managed_default_content_settings.stylesheets": 2, 122 | "profile.managed_default_content_settings.cookies": 2, 123 | "profile.managed_default_content_settings.geolocation": 2, 124 | "profile.managed_default_content_settings.media_stream": 2, 125 | "profile.managed_default_content_settings.plugins": 1, 126 | "profile.default_content_setting_values.notifications": 2, 127 | }) 128 | ``` 129 | 130 | - With **Firefox** : 131 | 132 | ```python 133 | # Disable loading image, CSS, Flash 134 | browser_options.set_preference('permissions.default.image', 2) 135 | browser_options.set_preference('permissions.default.stylesheet', 2) 136 | browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false') 137 | ``` 138 | 139 | 2. **Loading the Page**: 140 | 141 | - **SCROLL_DOWN**: number of times to scroll for **view more posts**. 142 | - **FILTER_CMTS_BY**: filter comments by `MOST_RELEVANT` / `NEWEST` / `ALL_COMMENTS`. 143 | ![](./img/filter.png?raw=true) 144 | - **VIEW_MORE_CMTS**: number of times to click **view more comments**. 145 | - **VIEW_MORE_REPLIES**: number of times to click **view more replies**. 146 | 147 | ### III. Start running 148 | 149 | python crawler.py 150 | 151 | - Run at sign out state, cause some CSS Selectors will be different as sign in. 152 | - With some Proxies, it might be quite slow or required to sign in (redirected). 153 | - **To achieve higher speed**: 154 | - If this is first time using these scripts, you can **run without Tor & Proxies** until Facebook requires to sign in. 155 | - Use some popular **VPN services** (also **run without Tor & Proxies**): [NordVPN](https://ref.nordvpn.com/dnaEbnXnysg), [ExpressVPN](https://www.expressvpn.com), ... 156 | 157 | ## Test proxy server 158 | 159 | 1. With [HTTP Request Randomizer](https://github.com/pgaref/HTTP_Request_Randomizer): 160 | 161 | ```python 162 | from browser import * 163 | page_url = 'http://check.torproject.org' 164 | proxy_server = random.choice(proxies).get_address() 165 | browser_options = BROWSER_OPTIONS.FIREFOX 166 | 167 | setup_free_proxy(page_url, proxy_server, browser_options) 168 | # kill_browser() 169 | ``` 170 | 171 | 2. With [Tor Relays](./tor): 172 | 173 | ```python 174 | from browser import * 175 | page_url = 'http://check.torproject.org' 176 | tor_path = TOR_PATH.WINDOWS 177 | browser_options = BROWSER_OPTIONS.FIREFOX 178 | 179 | setup_tor_proxy(page_url, tor_path, browser_options) 180 | # kill_browser() 181 | ``` 182 | 183 | ![](./img/proxy.png?raw=true) 184 | -------------------------------------------------------------------------------- /stealth-csr-selenium/browser.py: -------------------------------------------------------------------------------- 1 | from helium import * 2 | from selenium.webdriver import ChromeOptions 3 | from selenium.webdriver import FirefoxOptions 4 | from http_request_randomizer.requests.proxy.requestProxy import RequestProxy 5 | 6 | import os 7 | import psutil 8 | import shutil 9 | import random 10 | 11 | TOR_FOLDER = os.path.join(os.getcwd(), 'tor') 12 | TOR_PATH = type('Enum', (), { 13 | 'WINDOWS': os.path.join(TOR_FOLDER, 'windows', 'tor.exe'), 14 | 'MAC': os.path.join(TOR_FOLDER, 'mac', 'tor.real'), 15 | 'LINUX': os.path.join(TOR_FOLDER, 'linux', 'tor'), 16 | 'NONE': '' 17 | }) 18 | 19 | BROWSER_OPTIONS = type('Enum', (), { 20 | 'CHROME': ChromeOptions(), 21 | 'FIREFOX': FirefoxOptions() 22 | }) 23 | 24 | request_proxy = RequestProxy() 25 | request_proxy.set_logger_level(40) 26 | proxies = request_proxy.get_proxy_list() 27 | 28 | def hidden(browser_options=BROWSER_OPTIONS.FIREFOX): 29 | if type(browser_options) == ChromeOptions: 30 | browser_options.add_argument('--incognito') 31 | browser_options.add_argument('--disable-blink-features=AutomationControlled') 32 | elif type(browser_options) == FirefoxOptions: 33 | browser_options.add_argument('--private') 34 | browser_options.set_preference('dom.webdriver.enabled', False) 35 | browser_options.set_preference('useAutomationExtension', False) 36 | return browser_options 37 | 38 | def simplify(browser_options=BROWSER_OPTIONS.FIREFOX): 39 | if type(browser_options) == ChromeOptions: 40 | browser_options.add_experimental_option('prefs', { 41 | 'profile.managed_default_content_settings.images': 2, 42 | 'profile.managed_default_content_settings.stylesheets': 2, 43 | 'profile.managed_default_content_settings.cookies': 2, 44 | 'profile.managed_default_content_settings.geolocation': 2, 45 | 'profile.managed_default_content_settings.media_stream': 2, 46 | 'profile.managed_default_content_settings.plugins': 1, 47 | 'profile.default_content_setting_values.notifications': 2, 48 | }) 49 | elif type(browser_options) == FirefoxOptions: 50 | browser_options.set_preference('permissions.default.image', 2) 51 | browser_options.set_preference('permissions.default.stylesheet', 2) 52 | browser_options.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false') 53 | return browser_options 54 | 55 | def setup_free_proxy(page_url, proxy_server, browser_options=BROWSER_OPTIONS.FIREFOX, headless=False): 56 | print('Current proxy server:', proxy_server) 57 | host = proxy_server.split(':')[0] 58 | port = int(proxy_server.split(':')[1]) 59 | print('Go to page', page_url) 60 | 61 | if type(browser_options) == ChromeOptions: 62 | browser_options.add_argument(f'--proxy-server={proxy_server}') 63 | return start_chrome(page_url, headless=headless, options=browser_options) 64 | elif type(browser_options) == FirefoxOptions: 65 | browser_options.set_preference('network.proxy.type', 1) 66 | browser_options.set_preference('network.proxy.http', host) 67 | browser_options.set_preference('network.proxy.http_port', port) 68 | browser_options.set_preference('network.proxy.ssl', host) 69 | browser_options.set_preference('network.proxy.ssl_port', port) 70 | return start_firefox(page_url, headless=headless, options=browser_options) 71 | 72 | def setup_tor_proxy(page_url, tor_path=TOR_PATH.WINDOWS, browser_options=BROWSER_OPTIONS.FIREFOX, headless=False): 73 | torBrowser = os.popen(tor_path) 74 | print('Go to page', page_url) 75 | 76 | if type(browser_options) == ChromeOptions: 77 | browser_options.add_argument('--proxy-server=socks5://127.0.0.1:9050') 78 | return start_chrome(page_url, headless=headless, options=browser_options) 79 | elif type(browser_options) == FirefoxOptions: 80 | browser_options.set_preference('network.proxy.type', 1) 81 | browser_options.set_preference('network.proxy.socks', '127.0.0.1') 82 | browser_options.set_preference('network.proxy.socks_port', 9050) 83 | browser_options.set_preference('network.proxy.socks_remote_dns', False) 84 | return start_firefox(page_url, headless=headless, options=browser_options) 85 | 86 | def setup_driver(page_url, tor_path=TOR_PATH.WINDOWS, browser_options=BROWSER_OPTIONS.FIREFOX, use_proxy=False, private=False, speed_up=False, headless=False): 87 | if private: browser_options = hidden(browser_options) 88 | if speed_up: browser_options = simplify(browser_options) 89 | 90 | if not use_proxy: 91 | print('Go to page', page_url) 92 | if type(browser_options) == ChromeOptions: 93 | return start_chrome(page_url, headless=headless, options=browser_options) 94 | elif type(browser_options) == FirefoxOptions: 95 | return start_firefox(page_url, headless=headless, options=browser_options) 96 | 97 | if not os.path.isfile(tor_path): 98 | print('Use HTTP Request Randomizer proxy server') 99 | while True: 100 | try: 101 | rand_proxy = random.choice(proxies) 102 | proxy_server = rand_proxy.get_address() 103 | return setup_free_proxy(page_url, proxy_server, browser_options, headless) 104 | except Exception as e: 105 | proxies.remove(rand_proxy) 106 | print('=> Try another proxy.', e) 107 | close() 108 | 109 | print("Use Tor's SOCKS proxy server") 110 | return setup_tor_proxy(page_url, tor_path, browser_options, headless) 111 | 112 | def close(): 113 | kill_browser() 114 | if os.path.exists('__pycache__'): shutil.rmtree('__pycache__') 115 | for proc in psutil.process_iter(): 116 | if proc.name()[:3] == 'tor': proc.kill() -------------------------------------------------------------------------------- /stealth-csr-selenium/crawler.py: -------------------------------------------------------------------------------- 1 | import browser 2 | import page 3 | import re 4 | import json 5 | 6 | PAGE_URL = 'https://www.facebook.com/KTXDHQGConfessions/' 7 | TOR_PATH = browser.TOR_PATH.NONE 8 | BROWSER_OPTIONS = browser.BROWSER_OPTIONS.FIREFOX 9 | 10 | USE_PROXY = True 11 | PRIVATE = True 12 | SPEED_UP = True 13 | HEADLESS = False 14 | 15 | SCROLL_DOWN = 7 16 | FILTER_CMTS_BY = page.FILTER_CMTS.MOST_RELEVANT 17 | VIEW_MORE_CMTS = 2 18 | VIEW_MORE_REPLIES = 2 19 | 20 | def get_child_attribute(element, selector, attr): 21 | try: 22 | element = element.find_element_by_css_selector(selector) 23 | return str(element.get_attribute(attr)) 24 | except: return '' 25 | 26 | def get_comment_info(comment): 27 | cmt_url = get_child_attribute(comment, '._3mf5', 'href') 28 | utime = get_child_attribute(comment, 'abbr', 'data-utime') 29 | text = get_child_attribute(comment, '._3l3x ', 'textContent') 30 | cmt_id = cmt_url.split('=')[-1] 31 | 32 | if cmt_id == None: 33 | cmt_id = comment.get_attribute('data-ft').split(':"')[-1][:-2] 34 | user_url = user_id = user_name = 'Acc clone' 35 | else: 36 | user_url = cmt_url.split('?')[0] 37 | user_id = user_url.split('https://www.facebook.com/')[-1].replace('/', '') 38 | user_name = get_child_attribute(comment, '._6qw4', 'innerText') 39 | return { 40 | 'id': cmt_id, 41 | 'utime': utime, 42 | 'user_url': user_url, 43 | 'user_id': user_id, 44 | 'user_name': user_name, 45 | 'text': text, 46 | } 47 | 48 | while True: 49 | driver = browser.setup_driver(PAGE_URL, TOR_PATH, BROWSER_OPTIONS, USE_PROXY, PRIVATE, SPEED_UP, HEADLESS) 50 | if driver.current_url in PAGE_URL: 51 | if page.load(driver, PAGE_URL, SCROLL_DOWN, FILTER_CMTS_BY, VIEW_MORE_CMTS, VIEW_MORE_REPLIES): break 52 | else: print(f"Redirect detected => {'Rerun' if USE_PROXY else 'Please use proxy'}\n") 53 | driver.close() 54 | 55 | 56 | html_posts = driver.find_elements_by_css_selector(page.POSTS_SELECTOR) 57 | file_name = re.findall('\.com/(.*)', PAGE_URL)[0].split('/')[0] 58 | total = 0 59 | 60 | print('Start crawling', len(html_posts), 'posts...') 61 | with open(f'data/{file_name}.json', 'w', encoding='utf-8') as f: 62 | for post_index, post in enumerate(html_posts): 63 | post_url = get_child_attribute(post, '._5pcq', 'href').split('?')[0] 64 | post_id = re.findall('\d+', post_url)[-1] 65 | utime = get_child_attribute(post, 'abbr', 'data-utime') 66 | post_text = get_child_attribute(post, '.userContent', 'textContent') 67 | total_shares = get_child_attribute(post, '[data-testid="UFI2SharesCount/root"]', 'innerText') 68 | total_cmts = get_child_attribute(post, '._3hg-', 'innerText') 69 | 70 | json_cmts = [] 71 | html_cmts = post.find_elements_by_css_selector('._7a9a>li') 72 | 73 | num_of_cmts = len(html_cmts) 74 | total += num_of_cmts 75 | 76 | if num_of_cmts > 0: 77 | print(f'{post_index}. Crawling {num_of_cmts} comments of post {post_id}') 78 | for comment in html_cmts: 79 | comment_owner = comment.find_elements_by_css_selector('._7a9b') 80 | comment_info = get_comment_info(comment_owner[0]) 81 | 82 | json_replies = [] 83 | html_replies = comment.find_elements_by_css_selector('._7a9g') 84 | 85 | num_of_replies = len(html_replies) 86 | total += num_of_replies 87 | 88 | if num_of_replies > 0: 89 | print(f"|-- Crawling {num_of_replies} replies of {comment_info['user_name']}'s comment") 90 | for reply in html_replies: 91 | reply_info = get_comment_info(reply) 92 | json_replies.append(reply_info) 93 | 94 | comment_info.update({'replies': json_replies}) 95 | json_cmts.append(comment_info) 96 | 97 | json_reacts = [] 98 | html_reacts = post.find_elements_by_css_selector('._1n9l') 99 | 100 | for react in html_reacts: 101 | react_text = react.get_attribute('aria-label') 102 | json_reacts.append(react_text) 103 | 104 | json.dump({ 105 | 'url': post_url, 106 | 'id': post_id, 107 | 'utime': utime, 108 | 'text': post_text, 109 | 'reactions': json_reacts, 110 | 'total_shares': total_shares, 111 | 'total_cmts': total_cmts, 112 | 'crawled_cmts': json_cmts, 113 | }, f, ensure_ascii=False) 114 | 115 | del json_cmts 116 | f.write('\n') 117 | 118 | del html_posts 119 | print('Total comments and replies crawled:', total) 120 | browser.close() 121 | -------------------------------------------------------------------------------- /stealth-csr-selenium/img/filter.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/img/filter.png -------------------------------------------------------------------------------- /stealth-csr-selenium/img/proxy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/img/proxy.png -------------------------------------------------------------------------------- /stealth-csr-selenium/img/rate_limit_exceeded.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/img/rate_limit_exceeded.png -------------------------------------------------------------------------------- /stealth-csr-selenium/img/result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/img/result.png -------------------------------------------------------------------------------- /stealth-csr-selenium/page.py: -------------------------------------------------------------------------------- 1 | from browser import * 2 | import time 3 | 4 | POSTS_SELECTOR = '[class="_427x"] .userContentWrapper' 5 | COMMENTABLE_SELECTOR = f'{POSTS_SELECTOR} .commentable_item' 6 | FILTER_CMTS = type('Enum', (), { 7 | 'MOST_RELEVANT': 'RANKED_THREADED', 8 | 'NEWEST': 'RECENT_ACTIVITY', 9 | 'ALL_COMMENTS': 'RANKED_UNFILTERED' 10 | }) 11 | 12 | def timer(func): 13 | def wrapper(*args, **kwargs): 14 | start = time.time() 15 | func(*args, **kwargs) 16 | end = time.time() 17 | print('=> Loading time:', end - start) 18 | return wrapper 19 | 20 | def click_popup(selector): 21 | btn = find_all(S(selector)) 22 | if btn != []: click(btn[0]) 23 | 24 | def failed_to_load(driver, page_url): 25 | if driver.current_url not in page_url: 26 | print('Redirect detected => Rerun\n') 27 | return True 28 | elif find_all(S('#main-frame-error')) != []: 29 | print('Cannot load page => Rerun\n') 30 | return True 31 | return False 32 | 33 | @timer 34 | def load_more_posts(driver): 35 | driver.execute_script('window.scrollTo(0, document.body.scrollHeight)') 36 | while find_all(S('.async_saving [role="progressbar"]')) != []: pass 37 | time.sleep(random.randint(3, 7)) 38 | 39 | @timer 40 | def click_multiple_buttons(driver, selector): 41 | for button in driver.find_elements_by_css_selector(selector): 42 | driver.execute_script('arguments[0].click()', button) 43 | while find_all(S(f'{COMMENTABLE_SELECTOR} [role="progressbar"]')) != []: pass 44 | time.sleep(random.randint(3, 7)) 45 | 46 | def filter_comments(driver, by): 47 | if by == FILTER_CMTS.MOST_RELEVANT: return 48 | click_multiple_buttons(driver, '[data-ordering="RANKED_THREADED"]') 49 | click_multiple_buttons(driver, f'[data-ordering="{by}"]') 50 | 51 | def load(driver, page_url, scroll_down=0, filter_cmts_by=FILTER_CMTS.MOST_RELEVANT, view_more_cmts=0, view_more_replies=0): 52 | print('Click Accept Cookies button') 53 | click_popup('[title="Accept All"]') 54 | 55 | for i in range(min(scroll_down, 3)): 56 | print(f'Load more posts times {i + 1}/{scroll_down}') 57 | load_more_posts(driver) 58 | if failed_to_load(driver, page_url): return False 59 | 60 | print('Click Not Now button') 61 | click_popup('#expanding_cta_close_button') 62 | 63 | for i in range(scroll_down - 3): 64 | print(f'Load more posts times {i + 4}/{scroll_down}') 65 | load_more_posts(driver) 66 | if failed_to_load(driver, page_url): return False 67 | 68 | print('Filter comments by', filter_cmts_by) 69 | filter_comments(driver, filter_cmts_by) 70 | 71 | for i in range(view_more_cmts): 72 | print(f'Click View more comments buttons times {i + 1}/{view_more_cmts}') 73 | click_multiple_buttons(driver, f'{COMMENTABLE_SELECTOR} ._7a94 ._4sxc') 74 | if failed_to_load(driver, page_url): return False 75 | 76 | for i in range(view_more_replies): 77 | print(f'Click Replies buttons times {i + 1}/{view_more_replies}') 78 | click_multiple_buttons(driver, f'{COMMENTABLE_SELECTOR} ._7a9h ._4sxc') 79 | if failed_to_load(driver, page_url): return False 80 | 81 | print('Click See more buttons of comments') 82 | click_multiple_buttons(driver, f'{COMMENTABLE_SELECTOR} .fss') 83 | if failed_to_load(driver, page_url): return False 84 | return True 85 | -------------------------------------------------------------------------------- /stealth-csr-selenium/requirements.txt: -------------------------------------------------------------------------------- 1 | helium 2 | http-request-randomizer -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/PluggableTransports/obfs4proxy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/PluggableTransports/obfs4proxy -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/libcrypto.so.1.1: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/libcrypto.so.1.1 -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/libevent-2.1.so.7: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/libevent-2.1.so.7 -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/libssl.so.1.1: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/libssl.so.1.1 -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/libstdc++/libstdc++.so.6: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/libstdc++/libstdc++.so.6 -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/linux/tor: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/linux/tor -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/mac/PluggableTransports/obfs4proxy: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/mac/PluggableTransports/obfs4proxy -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/mac/libevent-2.1.7.dylib: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/mac/libevent-2.1.7.dylib -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/mac/tor.real: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/mac/tor.real -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/PluggableTransports/obfs4proxy.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/PluggableTransports/obfs4proxy.exe -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libcrypto-1_1-x64.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libcrypto-1_1-x64.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libevent-2-1-7.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libevent-2-1-7.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libevent_core-2-1-7.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libevent_core-2-1-7.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libevent_extra-2-1-7.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libevent_extra-2-1-7.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libgcc_s_seh-1.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libgcc_s_seh-1.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libssl-1_1-x64.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libssl-1_1-x64.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libssp-0.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libssp-0.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/libwinpthread-1.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/libwinpthread-1.dll -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/tor.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/tor.exe -------------------------------------------------------------------------------- /stealth-csr-selenium/tor/windows/zlib1.dll: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-csr-selenium/tor/windows/zlib1.dll -------------------------------------------------------------------------------- /stealth-ssr-puppeteer/README.md: -------------------------------------------------------------------------------- 1 | # SSR Approach using Puppeteer -------------------------------------------------------------------------------- /stealth-ssr-scrapy/README.md: -------------------------------------------------------------------------------- 1 | # SSR Approach using Scrapy -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/18520339/facebook-data-extraction/39363bebed637417862a069bc14ce291b347a6cb/stealth-ssr-scrapy/fbscraper/__init__.py -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/items.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your scraped items 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/items.html 5 | 6 | import scrapy 7 | 8 | 9 | class FbscraperItem(scrapy.Item): 10 | # define the fields for your item here like: 11 | # name = scrapy.Field() 12 | pass 13 | -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/middlewares.py: -------------------------------------------------------------------------------- 1 | # Define here the models for your spider middleware 2 | # 3 | # See documentation in: 4 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 5 | 6 | from scrapy import signals 7 | 8 | # useful for handling different item types with a single interface 9 | from itemadapter import is_item, ItemAdapter 10 | 11 | 12 | class FbscraperSpiderMiddleware: 13 | # Not all methods need to be defined. If a method is not defined, 14 | # scrapy acts as if the spider middleware does not modify the 15 | # passed objects. 16 | 17 | @classmethod 18 | def from_crawler(cls, crawler): 19 | # This method is used by Scrapy to create your spiders. 20 | s = cls() 21 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 22 | return s 23 | 24 | def process_spider_input(self, response, spider): 25 | # Called for each response that goes through the spider 26 | # middleware and into the spider. 27 | 28 | # Should return None or raise an exception. 29 | return None 30 | 31 | def process_spider_output(self, response, result, spider): 32 | # Called with the results returned from the Spider, after 33 | # it has processed the response. 34 | 35 | # Must return an iterable of Request, or item objects. 36 | for i in result: 37 | yield i 38 | 39 | def process_spider_exception(self, response, exception, spider): 40 | # Called when a spider or process_spider_input() method 41 | # (from other spider middleware) raises an exception. 42 | 43 | # Should return either None or an iterable of Request or item objects. 44 | pass 45 | 46 | def process_start_requests(self, start_requests, spider): 47 | # Called with the start requests of the spider, and works 48 | # similarly to the process_spider_output() method, except 49 | # that it doesn’t have a response associated. 50 | 51 | # Must return only requests (not items). 52 | for r in start_requests: 53 | yield r 54 | 55 | def spider_opened(self, spider): 56 | spider.logger.info("Spider opened: %s" % spider.name) 57 | 58 | 59 | class FbscraperDownloaderMiddleware: 60 | # Not all methods need to be defined. If a method is not defined, 61 | # scrapy acts as if the downloader middleware does not modify the 62 | # passed objects. 63 | 64 | @classmethod 65 | def from_crawler(cls, crawler): 66 | # This method is used by Scrapy to create your spiders. 67 | s = cls() 68 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) 69 | return s 70 | 71 | def process_request(self, request, spider): 72 | # Called for each request that goes through the downloader 73 | # middleware. 74 | 75 | # Must either: 76 | # - return None: continue processing this request 77 | # - or return a Response object 78 | # - or return a Request object 79 | # - or raise IgnoreRequest: process_exception() methods of 80 | # installed downloader middleware will be called 81 | return None 82 | 83 | def process_response(self, request, response, spider): 84 | # Called with the response returned from the downloader. 85 | 86 | # Must either; 87 | # - return a Response object 88 | # - return a Request object 89 | # - or raise IgnoreRequest 90 | return response 91 | 92 | def process_exception(self, request, exception, spider): 93 | # Called when a download handler or a process_request() 94 | # (from other downloader middleware) raises an exception. 95 | 96 | # Must either: 97 | # - return None: continue processing this exception 98 | # - return a Response object: stops process_exception() chain 99 | # - return a Request object: stops process_exception() chain 100 | pass 101 | 102 | def spider_opened(self, spider): 103 | spider.logger.info("Spider opened: %s" % spider.name) 104 | -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/pipelines.py: -------------------------------------------------------------------------------- 1 | # Define your item pipelines here 2 | # 3 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 4 | # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html 5 | 6 | 7 | # useful for handling different item types with a single interface 8 | from itemadapter import ItemAdapter 9 | 10 | 11 | class FbscraperPipeline: 12 | def process_item(self, item, spider): 13 | return item 14 | -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/settings.py: -------------------------------------------------------------------------------- 1 | # Scrapy settings for fbscraper project 2 | # 3 | # For simplicity, this file contains only settings considered important or 4 | # commonly used. You can find more settings consulting the documentation: 5 | # 6 | # https://docs.scrapy.org/en/latest/topics/settings.html 7 | # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 | # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 | 10 | BOT_NAME = "fbscraper" 11 | 12 | SPIDER_MODULES = ["fbscraper.spiders"] 13 | NEWSPIDER_MODULE = "fbscraper.spiders" 14 | 15 | 16 | # Crawl responsibly by identifying yourself (and your website) on the user-agent 17 | #USER_AGENT = "fbscraper (+http://www.yourdomain.com)" 18 | 19 | # Obey robots.txt rules 20 | ROBOTSTXT_OBEY = True 21 | 22 | # Configure maximum concurrent requests performed by Scrapy (default: 16) 23 | #CONCURRENT_REQUESTS = 32 24 | 25 | # Configure a delay for requests for the same website (default: 0) 26 | # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 27 | # See also autothrottle settings and docs 28 | #DOWNLOAD_DELAY = 3 29 | # The download delay setting will honor only one of: 30 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16 31 | #CONCURRENT_REQUESTS_PER_IP = 16 32 | 33 | # Disable cookies (enabled by default) 34 | #COOKIES_ENABLED = False 35 | 36 | # Disable Telnet Console (enabled by default) 37 | #TELNETCONSOLE_ENABLED = False 38 | 39 | # Override the default request headers: 40 | #DEFAULT_REQUEST_HEADERS = { 41 | # "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 42 | # "Accept-Language": "en", 43 | #} 44 | 45 | # Enable or disable spider middlewares 46 | # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 47 | #SPIDER_MIDDLEWARES = { 48 | # "fbscraper.middlewares.FbscraperSpiderMiddleware": 543, 49 | #} 50 | 51 | # Enable or disable downloader middlewares 52 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 53 | #DOWNLOADER_MIDDLEWARES = { 54 | # "fbscraper.middlewares.FbscraperDownloaderMiddleware": 543, 55 | #} 56 | 57 | # Enable or disable extensions 58 | # See https://docs.scrapy.org/en/latest/topics/extensions.html 59 | #EXTENSIONS = { 60 | # "scrapy.extensions.telnet.TelnetConsole": None, 61 | #} 62 | 63 | # Configure item pipelines 64 | # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 65 | #ITEM_PIPELINES = { 66 | # "fbscraper.pipelines.FbscraperPipeline": 300, 67 | #} 68 | 69 | # Enable and configure the AutoThrottle extension (disabled by default) 70 | # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 71 | #AUTOTHROTTLE_ENABLED = True 72 | # The initial download delay 73 | #AUTOTHROTTLE_START_DELAY = 5 74 | # The maximum download delay to be set in case of high latencies 75 | #AUTOTHROTTLE_MAX_DELAY = 60 76 | # The average number of requests Scrapy should be sending in parallel to 77 | # each remote server 78 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 79 | # Enable showing throttling stats for every response received: 80 | #AUTOTHROTTLE_DEBUG = False 81 | 82 | # Enable and configure HTTP caching (disabled by default) 83 | # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 84 | #HTTPCACHE_ENABLED = True 85 | #HTTPCACHE_EXPIRATION_SECS = 0 86 | #HTTPCACHE_DIR = "httpcache" 87 | #HTTPCACHE_IGNORE_HTTP_CODES = [] 88 | #HTTPCACHE_STORAGE = "scrapy.extensions.httpcache.FilesystemCacheStorage" 89 | 90 | # Set settings whose default value is deprecated to a future-proof value 91 | REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7" 92 | TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor" 93 | FEED_EXPORT_ENCODING = "utf-8" 94 | -------------------------------------------------------------------------------- /stealth-ssr-scrapy/fbscraper/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /stealth-ssr-scrapy/scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.io/en/latest/deploy.html 5 | 6 | [settings] 7 | default = fbscraper.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = fbscraper 12 | --------------------------------------------------------------------------------