├── .gitignore ├── README.md ├── assets ├── nodemaven-test.jpg └── token-business.png ├── devtools-data.xlsx ├── graph-api ├── README.md ├── data │ ├── KTXDHQGConfessions.jsonl │ └── devoiminhdidauthe.jsonl └── scraper.py ├── stealth-csr-puppeteer ├── README.md ├── example.js ├── helpers.js ├── index.js ├── package.json ├── scraper.js ├── test.js └── wrapper.js ├── stealth-csr-selenium ├── README.md ├── browser.py ├── crawler.py ├── data │ ├── KTXDHQGConfessions.json │ ├── KTXDHQGConfessions.jsonl │ └── data.json ├── img │ ├── filter.png │ ├── proxy.png │ ├── rate_limit_exceeded.png │ └── result.png ├── page.py ├── requirements.txt └── tor │ ├── linux │ ├── PluggableTransports │ │ └── obfs4proxy │ ├── libcrypto.so.1.1 │ ├── libevent-2.1.so.7 │ ├── libssl.so.1.1 │ ├── libstdc++ │ │ └── libstdc++.so.6 │ └── tor │ ├── mac │ ├── PluggableTransports │ │ └── obfs4proxy │ ├── libevent-2.1.7.dylib │ └── tor.real │ └── windows │ ├── PluggableTransports │ └── obfs4proxy.exe │ ├── libcrypto-1_1-x64.dll │ ├── libevent-2-1-7.dll │ ├── libevent_core-2-1-7.dll │ ├── libevent_extra-2-1-7.dll │ ├── libgcc_s_seh-1.dll │ ├── libssl-1_1-x64.dll │ ├── libssp-0.dll │ ├── libwinpthread-1.dll │ ├── tor.exe │ └── zlib1.dll ├── stealth-ssr-puppeteer └── README.md └── stealth-ssr-scrapy ├── README.md ├── fbscraper ├── __init__.py ├── items.py ├── middlewares.py ├── pipelines.py ├── settings.py └── spiders │ └── __init__.py └── scrapy.cfg /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | node_modules/ 3 | tmp/ 4 | .env 5 | package-lock.json -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Summary of Facebook data extraction approaches 2 | 3 | > I'm finalizing everything to accommodate the latest major changes 4 | 5 | ## Overview 6 | 7 |
Approach | 10 |Sign-in required from the start | 11 |Risk when sign-in (*) | 12 |Risk when not sign-in | 13 |Difficulty | 14 |Speed | 15 |
---|---|---|---|---|---|
1️⃣ Graph API + Full-permission Token | 18 |✅ | 19 |Access Token leaked + Rate Limits | 20 |Not working | 21 |Easy | 22 |Fast | 23 |
2️⃣ SSR - Server-side Rendering | 26 |Checkpoint but less loading more failure | 27 |Hard | 28 |Medium | 29 |||
3️⃣ CSR - Client-side Rendering | 32 |When access private content | 33 |Safest | 34 |Slow | 35 |||
4️⃣ DevTools Console | 38 |Can be banned if overused | 39 |Medium | 40 |
Technique | 261 |Speed | 262 |Cost | 263 |Scale | 264 |Anonymity | 265 |Other Risks | 266 |Additional Notes | 267 |
---|---|---|---|---|---|---|
VPN Service ⭐⭐ ⭐⭐ |
272 | Fast, offers a balance of anonymity and speed | 273 |Usually paid | 274 |- Good for small-scale operations. - May not be suitable for high-volume scraping due to potential IP blacklisting. |
275 | - Provides good anonymity and can bypass geo-restriction. - Potential for IP blacklisting/blocks if the VPN's IP range is known to the target site. |
276 | - Service reliability varies. - Possible activity logs. |
277 | Choose a reputable provider to avoid security risks. | 278 |
TOR Network ⭐⭐ |
281 | Very slow due to onion routing | 282 |Free | 283 |- Fine for small-scale, impractical for time-sensitive/ high-volume scraping due to very slow speed. - Consider only for research purposes, not scalable data collection. |
284 | - Offers excellent privacy. - Tor exit nodes can be blocked or malicious, like potential for eavesdropping. |
285 | - | 286 |Slowest choice | 287 |
Public Wi-Fi ⭐ |
290 | Vary | 291 |Free | 292 |Fine for small-scale. | 293 |Potential for being banned by target sites if scraping is detected. | 294 |Potential unsecured networks | 295 |Long distance way solution. |
296 |
Mobile Network ⭐⭐ |
299 | Relatively fast but slower speeds on some networks | 300 |Paid, potential for additional costs. | 301 |Using mobile IPs can be effective for small-scale scraping, impractical for large-scale. | 302 |Mobile IPs can change but not an anonymous option since it's tied to your personal account. | 303 |- | 304 |Using own data | 305 |
Private/ Dedicated Proxies ⭐⭐⭐ ⭐⭐ (Best) |
308 | Fast | 309 |Paid | 310 |- Best for large-scale operations and professional scraping projects. | 311 |Offer better performance and reliability with lower risk of blacklisting. | 312 |Vary in quality | 313 |- Rotating Proxies are popular choices for scraping as they can offer better speed and a variety of IPs. - You can use this proxy checker tool to assess your proxy quality |
314 |
Shared Proxies ⭐⭐⭐ (Free) ⭐⭐ ⭐⭐ (Paid) |
317 | Slow to Moderate | 318 |Usually Free or cost-effective for low-volume scraping. | 319 |Good for basic, small-scale, or non-critical scraping tasks. | 320 |Can be overloaded or blacklisted or, encountering already banned IPs. | 321 |Potential unreliable/ insecure proxies, especially Free ones. | 322 |
Technique | 336 |Speed | 337 |Cost | 338 |Scale | 339 |Anonymity | 340 |Additional Notes | 341 |
---|---|---|---|---|---|
Residential Rotating Proxies ⭐⭐⭐ ⭐⭐ (Best) |
346 | Fast | 347 |Paid | 348 |Ideal for high-end, large-scale scraping tasks. | 349 |- Mimics real user IPs and auto-rotate IPs when using proxy gateways, making detection harder. - Provides high anonymity and low risk of blacklisting/blocks due to legitimate residential IPs. |
350 | Consider proxy quality, location targeting, and rotation speed. | 351 |
Datacenter Rotating Proxies ⭐⭐ ⭐⭐ |
354 | Faster than Residential Proxies | 355 |More affordable than Residential Proxies | 356 |Good for cost-effective, large-scale scraping. | 357 |Less anonymous than Residential Proxies. - Higher risk of being blocked. - Easily detectable due to their datacenter IP ranges. |
358 | Consider reputation of the provider and frequency of IP rotation. | 359 |