├── README.md ├── devtools_detectors.md ├── disguising_your_scraper.md ├── finding_video_links.md ├── images ├── browser_request.jpg ├── scraper_request.jpg └── sed_regex.png ├── patch_firefox_old.md ├── starting.md └── using_apis.md /README.md: -------------------------------------------------------------------------------- 1 | ## Requests based scraping tutorial 2 | 3 | 4 | You want to start scraping? Well this guide will teach you, and not some baby selenium scraping. This guide only uses raw requests and has examples in both python and kotlin. Only basic programming knowlege in one of those languages is required to follow along in the guide. 5 | 6 | If you find any aspect of this guide confusing please open an issue about it and I will try to improve things. 7 | 8 | If you do not know programming at all then this guide will __not__ help you, learn programming first! Real scraping cannot be done by copy pasting with a vauge understanding. 9 | 10 | 11 | 0. [Starting scraping from zero](https://github.com/Blatzar/scraping-tutorial/blob/master/starting.md) 12 | 1. [Properly scraping JSON apis often found on sites](https://github.com/Blatzar/scraping-tutorial/blob/master/using_apis.md) 13 | 2. [Evading developer tools detection when scraping](https://github.com/Blatzar/scraping-tutorial/blob/master/devtools_detectors.md) 14 | 3. [Why your requests fail and how to fix them](https://github.com/Blatzar/scraping-tutorial/blob/master/disguising_your_scraper.md) 15 | 4. [Finding links and scraping videos](https://github.com/Blatzar/scraping-tutorial/blob/master/finding_video_links.md) 16 | 17 | Once you've read and understood the concepts behind scraping take a look at [a provider in CloudStream](https://github.com/LagradOst/CloudStream-3/blob/3a78f41aad93dc5755ce9e105db9ab19287b912a/app/src/main/java/com/lagradost/cloudstream3/movieproviders/VidEmbedProvider.kt). I added tons of comments to make every aspect of writing CloudStream providers clear. Even if you're not planning on contributing to Cloudstream looking at the code may help. 18 | 19 | Take a look at [Thenos](https://github.com/LagradOst/CloudStream-3/blob/3a78f41aad93dc5755ce9e105db9ab19287b912a/app/src/main/java/com/lagradost/cloudstream3/movieproviders/ThenosProvider.kt) for an example of json based scraping in kotlin. 20 | -------------------------------------------------------------------------------- /devtools_detectors.md: -------------------------------------------------------------------------------- 1 | **TL;DR**: You are going to get fucked by sites detecting your devtools. You need to know what techniques are used to bypass them. 2 | 3 | Many sites use some sort of debugger detection to prevent you from looking at the important requests made by the browser. 4 | 5 | You can test the devtools detector here: https://blog.aepkill.com/demos/devtools-detector/ *(does not feature source mapping detection)* 6 | Code for the detector found here: https://github.com/AEPKILL/devtools-detector 7 | 8 | # How are they detecting the tools? 9 | 10 | One or more of the following methods are used to prevent devtools in the majority of cases (if not all): 11 | 12 | **1.** 13 | Calling `debugger` in an endless loop. 14 | This is very easy to bypass. You can either right click the offending line (in chrome) and disable all debugger calls from that line or you can disable the whole debugger. 15 | 16 | **2.** 17 | Attaching a custom `.toString()` function to an expression and printing it with `console.log()`. 18 | When devtools are open (even while not in console) all `console.log()` calls will be resloved and the custom `.toString()` function will be called. Functions can also be triggered by how dates, regex and functions are formatted in the console. 19 | 20 | This lets the site know the millisecond you bring up devtools. Doing `const console = null` and other js hacks have not worked for me (the console function gets cached by the detector). 21 | 22 | If you can find the offending js responsible for the detection you can bypass it by redifining the function in violentmonkey, but I recommend against it since it's often hidden and obfuscated. The best way to bypass this issue is to re-compile firefox or chrome with a switch to disable the console. 23 | 24 | **3.** 25 | Invoking the debugger as a constructor? Looks something like this in the wild: 26 | ```js 27 | function _0x39426c(e) { 28 | function t(e) { 29 | if ("string" == typeof e) 30 | return function(e) {} 31 | .constructor("while (true) {}").apply("counter"); 32 | 1 !== ("" + e / e).length || e % 20 == 0 ? function() { 33 | return !0; 34 | } 35 | .constructor("debugger").call("action") : function() { 36 | return !1; 37 | } 38 | .constructor("debugger").apply("stateObject"), 39 | t(++e); 40 | } 41 | try { 42 | if (e) 43 | return t; 44 | t(0); 45 | } catch (e) {} 46 | } 47 | setInterval(function() { 48 | _0x39426c(); 49 | }, 4e3); 50 | ``` 51 | This function can be tracked down to this [script](https://github.com/javascript-obfuscator/javascript-obfuscator/blob/6de7c41c3f10f10c618da7cd96596e5c9362a25f/src/custom-code-helpers/debug-protection/templates/debug-protection-function/DebuggerTemplate.ts) 52 | 53 | This instantly freezes the webpage in firefox and makes it very unresponsive in chrome and does not rely on `console.log()`. You could bypass this by doing `const _0x39426c = null` in violentmonkey, but this bypass is not doable with heavily obfuscated js. 54 | 55 | Cutting out all the unnessecary stuff the remaining function is the following: 56 | ```js 57 | setInterval(() => { 58 | for (let i = 0; i < 100_00; i++) { 59 | _ = function() {}.constructor("debugger").call(); // also works with apply 60 | } 61 | }, 1e2); 62 | ``` 63 | Basically running `.constructor("debugger").call();` as much as possible without using while(true) (that locks up everything regardless). 64 | This is very likely a bug in the browser. 65 | 66 | **4.** 67 | Detecting window size. As you open developer tools your window size will change in a way that can be detected. 68 | This is both impossible to truly circumvent and simultaneously easily sidestepped. 69 | To bypass this what you need to do is open the devtools and click settings in top right corner and then select separate window. 70 | If the devtools are in a separate window they cannot be detected by this technique. 71 | 72 | **5.** 73 | Using source maps to detect the devtools making requests when opened. See https://weizmangal.com/page-js-anti-debug-1/ for further details. 74 | 75 | # How to bypass the detection? 76 | 77 | I have contributed patches to Librewolf to bypass some detection techniques. 78 | Use Librewolf or compile Firefox yourself with [my patches](https://github.com/Blatzar/scraping-tutorial/blob/master/patch_firefox_old.md). 79 | 80 | 1. Get librewolf at https://librewolf.net/ 81 | 2. Go to `about:config` 82 | 3. Set `librewolf.console.logging_disabled` to true to disable **method 2** 83 | 4. Set `librewolf.debugger.force_detach` to true to disable **method 1** and **method 3** 84 | 5. Make devtools open in a separate window to disable **method 4** 85 | 6. Disable source maps in [developer tools settings](https://github.com/Blatzar/scraping-tutorial/assets/46196380/f0ff2f24-6b8d-419c-86ac-9f47d98db749) to disable **method 5** 86 | 7. Now you have completely undetectable devtools! 87 | 88 | --- 89 | 90 | ### Next up: [Why your requests fail](https://github.com/Blatzar/scraping-tutorial/blob/master/disguising_your_scraper.md) 91 | -------------------------------------------------------------------------------- /disguising_your_scraper.md: -------------------------------------------------------------------------------- 1 |

Disguishing your scrapers

2 | 3 |

4 | If you're writing a Selenium scraper, be aware that your skill level doesn't match the minimum requirements for this page. 5 |

6 | 7 |

Why is scraping not appreciated?

8 | 9 | - It obliterates ads and hence, blocks the site revenue. 10 | - It is more than usually used to spam the content serving networks, hence, affecting the server performance. 11 | - It is more than usually also used to steal content off of a site and serve in other. 12 | - Competent scrapers usually look for exploits on site. Among these, the open source scrapers may leak site exploits to a wider audience. 13 | 14 |

Why do you need to disguise your scraper?

15 | 16 | Like the above points suggest, scraping is a good act. There are mechanisms to actively kill scrapers and only allow the humans in. You will need to make your scraper's identity as narrow as possible to a browser's identity. 17 | 18 | Some sites check the client using headers and on-site javascript challenges. This will result in invalid responses along the status code of 400-499. 19 | 20 | *Keep in mind that there are sites that produce responses without giving out the appropriate status codes.* 21 | 22 |

Custom Headers

23 | 24 | Here are some headers you need to check for: 25 | 26 | | Header | What's the purpose of this? | What should I change this to? | 27 | | --- | --- | --- | 28 | | `User-Agent` | Specifies your client's name along with the versions. | Probably the user-agent used by your browser. | 29 | | `Referer` | Specifies which site referred the current site. | The url from which **you** obtained the scraping url. | 30 | | `X-Requested-With` | Specifies what caused the request to that site. This is prominent in site's AJAX / API. | Usually: `XMLHttpRequest`, it may vary based on the site's JS | 31 | | `Cookie` | Cookie required to access the site. | Whatever the cookie was when you accessed the site in your normal browser. | 32 | | `Authorization` | Authorization tokens / credentials required for site access. | Correct authorization tokens or credentials for site access. | 33 | 34 | Usage of correct headers will give you site content access given you can access it through your web browser. 35 | 36 | **Keep in mind that this is only the fraction of what the possible headers can be.** 37 | 38 |

Appropriate Libraries

39 | 40 | In Python, `requests` and `httpx` have differences. 41 | 42 | ```py 43 | >>> import requests, httpx 44 | >>> requests.get("http://www.crunchyroll.com/", headers={"User-Agent": "justfoolingaround/1", "Referer": "https://example.com/"}) 45 | 46 | >>> httpx.get("http://www.crunchyroll.com/", headers={"User-Agent": "justfoolingaround/1", "Referer": "https://example.com/"}) 47 | 48 | ``` 49 | 50 | As we can see, the former response is a 403. This is a forbidden response and generally specifies that the content is not present. The latter however is a 200, OK response. In this response, content is available. 51 | 52 | This is the result of varying internal mechanisms. 53 | 54 | The only cons to `httpx` in this case might be the fact that it has fully encoded headers, whilst `requests` does not. This means header keys consisting of non-ASCII characters may not be able to bypass some sites. 55 | 56 |

Response handling algorithms

57 | 58 | A session class is an object available in many libraries. This thing is like a house for your outgoing requests and incoming responses. A well written library has a session class that even accounts for appropriate cookie handling. Meaning, if you ever send a request to a site you need not need to worry about the cookie of that site for the next site you visit. 59 | 60 | No matter how cool session classes may be, at the end of the day, they are mere objects. That means, you, as a user can easily change what is within it. (This may require a high understanding of the library and the language.) 61 | 62 | This is done through inheritance. You inherit a session class and modify whats within. 63 | 64 | For example: 65 | 66 | ```py 67 | class QuiteANoise(httpx.Client): 68 | 69 | def request(self, *args, **kwargs): 70 | print("Ooh, I got a request with arguments: {!r}, and keyword arguments: {!r}.".format(args, kwargs)) 71 | response = super().request(*args, **kwargs) 72 | print("That request has a {!r}!".format(response)) 73 | return response 74 | ``` 75 | 76 | In the above inherited session, what we do is *quite noisy*. We announced a request that is about to be sent and a response that was just recieved. 77 | 78 | `super`, in Python, allows you to get the class that the current class inherits. 79 | 80 | Do not forget to return your `response`, else your program will be dumbfounded since nothing ever gets out of your request! 81 | 82 | So, we're going to abuse this fancy technique to effectively bypass some hinderances. 83 | 84 | Namely `hCaptcha`, `reCaptcha` and `Cloudflare`. 85 | 86 | ```py 87 | """ 88 | This code is completely hypothetical, you probably 89 | do not have a hCaptcha, reCaptcha and a Cloudflare 90 | bypass. 91 | 92 | This code is a mere reference and may not suffice 93 | your need. 94 | """ 95 | from . import hcaptcha 96 | from . import grecaptcha 97 | 98 | import httpx 99 | 100 | class YourScraperSession(httpx.Client): 101 | 102 | def request(self, *args, **kwargs): 103 | 104 | response = super().request(*args, **kwargs) 105 | 106 | if response.status_code >= 400: 107 | 108 | if hcaptcha.has_cloudflare(response): 109 | cloudflare_cookie = hcaptcha.cloudflare_clearance_jar(self, response, *args, **kwargs) 110 | self.cookies.update(cloudflare_cookie) 111 | return self.request(self, *args, **kwargs) 112 | 113 | # Further methods to bypass something else. 114 | return self.request(self, *args, **kwargs) # psssssst. RECURSIVE HELL, `return response` is safer 115 | 116 | 117 | hcaptcha_sk, type_of = hcaptcha.deduce_sitekey(self, response) 118 | 119 | if hcaptcha_sk: 120 | if type_of == 'hsw': 121 | token = hcaptcha.get_hsw_token(self, response, hcaptcha_sk) 122 | else: 123 | token = hcaptcha.get_hsl_token(self, response, hcaptcha_sk) 124 | 125 | setattr(response, 'hcaptcha_token', token) 126 | 127 | recaptcha_sk, type_of = grecaptcha.sitekey_on_site(self, response) 128 | 129 | if recaptcha_sk: 130 | if isinstance(type_of, int): 131 | token = grecaptcha.recaptcha_solve(self, response, recaptcha_sk, v=type_of) 132 | else: 133 | token = type_of 134 | 135 | setattr(response, 'grecaptcha_token', token) 136 | 137 | return response 138 | ``` 139 | 140 | So, let's see what happens here. 141 | 142 | Firstly, we check whether the response has a error or not. This is done by checking if the response's status code is **greater than or equal to** 400. 143 | 144 | After this, we check if the site has Cloudflare, if the site has Cloudflare, we let the hypothetical function do its magic and give us the bypass cookies. Then after, we update our session class' cookie. Cookie vary across sites but in this case, our hypothetical function will take the session and make it so that the cookie only applies to that site url within and with the correct headers. 145 | 146 | After a magical cloudflare bypass (people wish they have this, you will too, probably.), we call the overridden function `.request` again to ensure the response following this will be bypassed to. This is recursion. 147 | 148 | If anything else is required, you should add your own code to execute bypasses so that your responses will be crisp and never error-filled. 149 | 150 | Else, we just return the fresh `.request`. 151 | 152 | Keep in mind that if you cannot bypass the 400~ error, your responses might end up in a permanent recursive hell, at least in the code above. 153 | 154 | To not make your responses never return, you might want to return the non-bypassed response. 155 | 156 | The next part mainly focuses on CAPTCHA bypasses and what we do is quite simple. A completed CAPTCHA *usually* returns a token. 157 | 158 | Returning this token with the response is not a good idea as the entire return type will change. We use a sneaky little function here. Namely `setattr`. What this does is, it sets an attribute of an object. 159 | 160 | The algorithm in easier terms is: 161 | 162 | Task: Bypass a donkey check with your human. 163 | 164 | - Yell "hee~haw". (Prove that you're a donkey, this is how the hypothetical functions work.) 165 | - Be handed the ribbon. (In our case, this is the token.) 166 | 167 | Now the problem is, the ribbon is not a human but still needs to come back. How does a normal human do this? Wear the ribbon. 168 | 169 | Wearing the ribbon is `setattr`. We can wear the ribbon everywhere. Leg, foot, butt.. you name it. No matter where you put it, you get the ribbon, so just be a bit reasonable with it. Like a decent developer and a decent human, wear the ribbon on the left side of your chest. In the code above, this reasonable place is `_token`. 170 | 171 | Let's get out of this donkey business. 172 | 173 | After this reasonable token placement, we get the response back. 174 | 175 | This token can now, always be accessed in reasonable places, reasonably. 176 | 177 | 178 | ```py 179 | client = YourScraperSession() 180 | 181 | bypassed_response = client.get("https://kwik.cx/f/2oHQioeCvHtx") 182 | print(bypassed_response.hcaptcha_token) 183 | ``` 184 | 185 | Keep in mind that if there is no ribbon/token, there is no way of reasonably accessing it. 186 | 187 | In any case, this is how you, as a decent developer, handle the response properly. 188 | 189 | ### Next up: [Finding video links](https://github.com/Blatzar/scraping-tutorial/blob/master/finding_video_links.md) 190 | -------------------------------------------------------------------------------- /finding_video_links.md: -------------------------------------------------------------------------------- 1 | # Finding video links 2 | 3 | Now you know the basics, enough to scrape most stuff from most sites, but not streaming sites. 4 | Because of the high costs of video hosting the video providers really don't want anyone scraping the video and bypassing the ads. 5 | This is why they often obfuscate, encrypt and hide their links which makes scraping really hard. 6 | Some sites even put V3 Google Captcha on their links to prevent scraping while the majority IP/time/referer lock the video links to prevent sharing. 7 | You will almost never find a plain \ element with a mp4 link. 8 | 9 | **This is why you should always scrape the video first when trying to scrape a video hosting site. Sometimes getting the video link can be too hard.** 10 | 11 | I will therefore explain how to do more advanced scraping, how to get these video links. 12 | 13 | What you want to do is: 14 | 15 | 1. Find the iFrame/Video host.* 16 | 2. Open the iFrame in a separate tab to ease clutter.* 17 | 3. Find the video link. 18 | 4. Work backwards from the video link to find the source. 19 | 20 | * *Step 1 and 2 is not applicable to all sites.* 21 | 22 | Let's explain further: 23 | **Step 1**: Most sites use an iFrame system to show their videos. This is essentially loading a separate page within the page. 24 | This is most evident in [Gogoanime](https://gogoanime.gg/yakusoku-no-neverland-episode-1), link gets updated often, google the name and find their page if link isn't found. 25 | The easiest way of spotting these iframes is looking at the network tab trying to find requests not from the original site. I recommend using the HTML filter. 26 | 27 | ![finding](https://user-images.githubusercontent.com/46196380/149821806-7426ca0f-133f-4722-8e7f-ebae26ea2ef1.png) 28 | 29 | Once you have found the iFrame, in this case a fembed-hd link open it in another tab and work from there. (**Step 2**) 30 | If you only have the iFrame it is much easier to find the necessary stuff to generate the link since a lot of useless stuff from the original site is filtered out. 31 | 32 | **Step 3**: Find the video link. This is often quite easy, either filter all media requests or simply look for a request ending in .m3u8 or .mp4 33 | What this allows you to do is limit exclude many requests (only look at the requests before the video link) and start looking for the link origin (**Step 4**). 34 | 35 | ![video_link](https://user-images.githubusercontent.com/46196380/149821919-f65e2f72-b413-4151-a4a3-db7012e2ed18.png) 36 | 37 | I usually search for stuff in the video link and see if any text/headers from the preceding requests contain it. 38 | In this case fvs.io redirected to the mp4 link, now do the same steps for the fvs.io link to follow the request backwards to the origin. Like images are showing. 39 | 40 | 41 | ![fvs](https://user-images.githubusercontent.com/46196380/149821967-00c01103-5b4a-48dd-be18-e1fdfb967e4c.png) 42 | 43 | 44 | 45 | ![fvs_redirector](https://user-images.githubusercontent.com/46196380/149821984-0720addd-40a7-4a9e-a429-fec45ec28901.png) 46 | 47 | 48 | 49 | ![complete](https://user-images.githubusercontent.com/46196380/149821989-49b2ba8c-36b1-49a7-a41b-3c69df278a9f.png) 50 | 51 | 52 | 53 | **NOTE: Some sites use encrypted JS to generate the video links. You need to use the browser debugger to step by step find how the links are generated in that case** 54 | 55 | ## **What to do when the site uses a captcha?** 56 | 57 | You pretty much only have 3 options when that happens: 58 | 59 | 1. Try to use a fake / no captcha token. Some sites actually doesn't check that the captcha token is valid. 60 | 2. Use Webview or some kind of browser in the background to load the site in your stead. 61 | 3. Pray it's a captcha without payload, then it's possible to get the captcha key without a browser: 62 | 63 | Before showing a code example, I'll explain some of the logic so it's easier to visualize what's happening. Our end goal is to make a request to `https://www.google.com/recaptcha/api2/anchor` with some parameters that we can hardcode, since they're not bound to change, but we also need to pass 3 parameters that are dynamic. These include: `k` (stands for key), `co` and `v` (stands for vtoken). 64 | 65 | Here is a proof of concept code example of how you can get a captcha token programmatically (this can vary for various websites): 66 | ```sh 67 | key=$(curl -s "$main_page" | sed -nE "s@.*recaptcha_site_key = '(.*)'.*@\1@p") # the main_page variable in this example is the home page for our website, for example https://zoro.to 68 | co=$(printf "%s:443" "$main_page" | base64 | tr "=" ".") # here, we would be base64 encoding the following url: https://zoro.to:443 => aHR0cHM6Ly96b3JvLnRvOjQ0Mzo0NDM. 69 | vtoken=$(curl -s "https://www.google.com/recaptcha/api.js?render=$key" | sed -nE "s_.*po\.src=.*releases/(.*)/recaptcha.*_\1_p") 70 | recaptcha_token=$(curl -s "https://www.google.com/recaptcha/api2/anchor?ar=1&hl=en\ 71 | &size=invisible&cb=cs3&k=${key}&co=${co}&v=${vtoken}" | 72 | sed -nE 's_.*id="recaptcha-token" value="([^"]*)".*_\1_p') 73 | curl -s "$main_page/some_url_requiring_token?token=${recaptcha_token}" # now we can use the recaptcha token to pass the verification on the site 74 | ``` 75 | -------------------------------------------------------------------------------- /images/browser_request.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blatzar/scraping-tutorial/0498a7799c5d253c6690041e9a72d231a820c591/images/browser_request.jpg -------------------------------------------------------------------------------- /images/scraper_request.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blatzar/scraping-tutorial/0498a7799c5d253c6690041e9a72d231a820c591/images/scraper_request.jpg -------------------------------------------------------------------------------- /images/sed_regex.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blatzar/scraping-tutorial/0498a7799c5d253c6690041e9a72d231a820c591/images/sed_regex.png -------------------------------------------------------------------------------- /patch_firefox_old.md: -------------------------------------------------------------------------------- 1 | ## NOTE: This section is old and unncessecary since these patches are already contributed to Librewolf. 2 | 3 | I tracked down the functions making devtools detection possible in the firefox source code and compiled a version which is undetectable by any of these tools. 4 | 5 | **Linux build**: https://mega.nz/file/YSAESJzb#x036cCtphjj9kB-kP_EXReTTkF7L7xN8nKw6sQN7gig 6 | 7 | **Windows build**: https://mega.nz/file/ZWAURAyA#qCrJ1BBxTLONHSTdE_boXMhvId-r0rk_kuPJWrPDiwg 8 | 9 | **Mac build**: https://mega.nz/file/Df5CRJQS#azO61dpP0_xgR8k-MmHaU_ufBvbl8_DlYky46SNSI0s 10 | 11 | about:config `devtools.console.bypass` disables the console which invalidates **method 2**. 12 | 13 | about:config `devtools.debugger.bypass` completely disables the debugger, useful to bypass **method 3**. 14 | 15 | If you want to compile firefox yourself with these bypasses you can, using the line changes below in the described files. 16 | 17 | **BUILD: 101.0a1 (2022-04-19)** 18 | `./devtools/server/actors/thread.js` 19 | At line 390 20 | ```js 21 | attach(options) { 22 | let devtoolsBypass = Services.prefs.getBoolPref("devtools.debugger.bypass", true); 23 | if (devtoolsBypass) 24 | return; 25 | ``` 26 | 27 | `./devtools/server/actors/webconsole/listeners/console-api.js` 28 | At line 92 29 | ```js 30 | observe(message, topic) { 31 | let devtoolsBypass = Services.prefs.getBoolPref("devtools.console.bypass", true); 32 | if (!this.handler || devtoolsBypass) { 33 | return; 34 | } 35 | ``` 36 | `./browser/app/profile/firefox.js` 37 | At line 23 38 | 39 | ```js 40 | // Bypasses 41 | pref("devtools.console.bypass", true); 42 | pref("devtools.debugger.bypass", true); 43 | ``` 44 | -------------------------------------------------------------------------------- /starting.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | All webpages work by doing requests back and forth with the website server. A request is a way to ask the server for some piece of content, like asking a person what their name is. The question can be though of as the url and can look like this: 4 | 5 | *GET* https://example.com/name 6 | 7 | **which translates to:** 8 | 9 | *Question*: what is your name? 10 | 11 | ---- 12 | 13 | Scraping is just downloading a webpage and getting the wanted information from it. 14 | Usually the information is not directly available on the page you want, meaning you need to do multiple requests. It may sound confusing, but try to understand these two images by the end: 15 | 16 | Your browser works like this: 17 | ![Browser request](/images/browser_request.jpg) 18 | 19 | 20 | When you scrape you want to replicate what the browser does like this: 21 | ![Scraper request](/images/scraper_request.jpg) 22 | 23 | Every time you visit a website you make a request to the server, the server responds to the request with a response, containing what the browser asked for. This is like asking your friend to give you a copy of his lecture notes for studying. 24 | 25 | 26 | Scraping is usually more advanced than that, one request leads to the other in complicated ways you cannot debug. To get the information you want you need to use the information found on one page to visit to the next. To continue the anology with lecture notes, this would be like asking your friend for his lecture notes and in the notes you find the email to the teacher. With the email to the teacher you then ask them a question about the course. 27 | 28 | Scrapers are all about writing code to make that process happen automatically, and it can be done in different ways. You can operate an invisible web browser with code using something like selenium. This is the equivalent of simulating an entire brain to read the lecture notes to find the email, which is very slow. You can also take the content, parse it with a regex to instantly find any emails. This is like making a robot do ctrl+f in the lecture notes to find any emails, much more efficent, but requiring more fine tuning to get what you want. 29 | 30 | ------- 31 | 32 | To demonstrate how requests work we will be scraping the Readme. 33 | 34 | I'll use khttp for the kotlin implementation because of the ease of use, if you want something company-tier I'd recommend OkHttp. 35 | 36 | (**Update**: I have made an okhttp wrapper **for android apps**, check out [NiceHttp](https://github.com/Blatzar/NiceHttp)) 37 | 38 | 39 | # **1. Scraping the Readme** 40 | 41 | This basically does what the first image does, it asks the web server with a GET request to give the content for the url. The .get(url) implies it is a GET request, and is basically the equivalent to saying "Give me the content for this url". This is used to get stuff like html, images and videos. 42 | 43 | There are multiple different request types, but this one is the most important, with the second most important being POST requests which we will look at later. 44 | 45 | **Python** 46 | ```python 47 | import requests 48 | url = "https://raw.githubusercontent.com/Blatzar/scraping-tutorial/master/README.md" 49 | response = requests.get(url) 50 | print(response.text) # Prints the readme 51 | ``` 52 | 53 | **Kotlin** 54 | 55 | In build.gradle: 56 | ``` 57 | repositories { 58 | mavenCentral() 59 | jcenter() 60 | maven { url 'https://jitpack.io' } 61 | } 62 | 63 | dependencies { 64 | // Other dependencies above 65 | compile group: 'khttp', name: 'khttp', version: '1.0.0' 66 | } 67 | ``` 68 | In main.kt 69 | ```java 70 | fun main() { 71 | val url = "https://raw.githubusercontent.com/Blatzar/scraping-tutorial/master/README.md" 72 | val response = khttp.get(url) 73 | println(response.text) 74 | } 75 | ``` 76 | 77 | **Shell** 78 | ```sh 79 | curl "https://raw.githubusercontent.com/Blatzar/scraping-tutorial/master/README.md" 80 | ``` 81 | 82 | 83 | # **2. Getting the github project description** 84 | Scraping is all about getting what you want in a good format you can use to automate stuff. 85 | 86 | Start by opening up the developer tools, using 87 | 88 | Ctrl + Shift + I 89 | 90 | or 91 | 92 | f12 93 | 94 | or 95 | 96 | Right click and press *Inspect* 97 | 98 | In here you can look at all the network requests the browser is making and much more, but the important part currently is the HTML displayed. You need to find the HTML responsible for showing the project description, but how? 99 | 100 | Either click the small mouse in the top left of the developer tools or press 101 | 102 | Ctrl + Shift + C 103 | 104 | This makes your mouse highlight any element you hover over. Press the description to highlight up the element responsible for showing it. 105 | 106 | Your HTML will now be focused on something like: 107 | 108 | 109 | ```cshtml 110 |

111 | Work in progress tutorial for scraping streaming sites 112 |

113 | ``` 114 | 115 | Now there's multiple ways to get the text, but the 2 methods I always use is Regex and CSS selectors. Regex is basically a ctrl+f on steroids, you can search for anything. CSS selectors is a way to parse the HTML like a browser and select an element in it. 116 | 117 | ## CSS Selectors 118 | 119 | The element is a paragraph tag, eg `

`, which can be found using the CSS selector: "p". 120 | 121 | classes helps to narrow down the CSS selector search, in this case: `class="f4 mt-3"` 122 | 123 | This can be represented with 124 | ```css 125 | p.f4.mt-3 126 | ``` 127 | a dot for every class ([full list of CSS selectors found here](https://www.w3schools.com/cssref/css_selectors.asp)) 128 | 129 | You can test if this CSS selector works by opening the console tab and typing: 130 | 131 | ```js 132 | document.querySelectorAll("p.f4.mt-3"); 133 | ``` 134 | 135 | This prints: 136 | ```java 137 | NodeList [p.f4.mt-3] 138 | ``` 139 | 140 | ### **NOTE**: You may not get the same results when scraping from command line, classes and elements are sometimes created by javascript on the site. 141 | 142 | 143 | **Python** 144 | 145 | ```python 146 | import requests 147 | from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 148 | 149 | url = "https://github.com/Blatzar/scraping-tutorial" 150 | response = requests.get(url) 151 | soup = BeautifulSoup(response.text, 'lxml') 152 | element = soup.select("p.f4.mt-3") # Using the CSS selector 153 | print(element[0].text.strip()) # Selects the first element, gets the text and strips it (removes starting and ending spaces) 154 | ``` 155 | 156 | **Kotlin** 157 | 158 | In build.gradle: 159 | ``` 160 | repositories { 161 | mavenCentral() 162 | jcenter() 163 | maven { url 'https://jitpack.io' } 164 | } 165 | 166 | dependencies { 167 | // Other dependencies above 168 | implementation "org.jsoup:jsoup:1.11.3" 169 | compile group: 'khttp', name: 'khttp', version: '1.0.0' 170 | } 171 | ``` 172 | In main.kt 173 | ```java 174 | fun main() { 175 | val url = "https://github.com/Blatzar/scraping-tutorial" 176 | val response = khttp.get(url) 177 | val soup = Jsoup.parse(response.text) 178 | val element = soup.select("p.f4.mt-3") // Using the CSS selector 179 | println(element.text().trim()) // Gets the text and strips it (removes starting and ending spaces) 180 | } 181 | ``` 182 | 183 | **Shell** 184 | In order to avoid premature heart attacks, the shell scraping example which relies on regex can be found in the regex section. 185 | *Note*: 186 | Although there are external libraries which can be used to parse html in shellscripts, such as htmlq and pup, these are often slower at parsing than sed (a built-in stream editor command on unix systems). 187 | This is why using sed with the extended regex flag `-E` is a preferrable way of parsing scraped data when writing shellscripts. 188 | 189 | 190 | ## **Regex:** 191 | 192 | When working with Regex I highly recommend using https://regex101.com/ (using the python flavor) 193 | 194 | Press Ctrl + U 195 | 196 | to get the whole site document as text and copy everything 197 | 198 | Paste it in the test string in regex101 and try to write an expression to only capture the text you want. 199 | 200 | In this case the elements is 201 | 202 | ```cshtml 203 |

204 | Work in progress tutorial for scraping streaming sites 205 |

206 | ``` 207 | 208 | Maybe we can search for `

` (backslashes for ") 209 | 210 | ```regex 211 |

212 | ``` 213 | 214 | Gives a match, so lets expand the match to all characters between the two brackets ( p>....\s*(.*)?\s*< 228 | ``` 229 | **Explained**: 230 | 231 | Any text exactly matching `

` 232 | 233 | then any number of whitespaces 234 | 235 | then any number of any characters (which will be stored in group 1) 236 | 237 | then any number of whitespaces 238 | 239 | then the text `<` 240 | 241 | 242 | In code: 243 | 244 | **Python** 245 | 246 | ```python 247 | import requests 248 | import re # regex 249 | 250 | url = "https://github.com/Blatzar/scraping-tutorial" 251 | response = requests.get(url) 252 | description_regex = r"

\s*(.*)?\s*<" # r"" stands for raw, which makes blackslashes work better, used for regexes 253 | description = re.search(description_regex, response.text).groups()[0] 254 | print(description) 255 | ``` 256 | 257 | **Kotlin** 258 | In main.kt 259 | ```java 260 | fun main() { 261 | val url = "https://github.com/Blatzar/scraping-tutorial" 262 | val response = khttp.get(url) 263 | val descriptionRegex = Regex("""

\s*(.*)?\s*<""") 264 | val description = descriptionRegex.find(response.text)?.groups?.get(1)?.value 265 | println(description) 266 | } 267 | ``` 268 | 269 | **Shell** 270 | Here is an example of how html data can be parsed using sed with the extended regex flag: 271 | ```sh 272 | printf 'some html data then data-id="123" other data and title here: title="Foo Bar" and more html\n' | 273 | sed -nE "s/.*data-id=\"([0-9]*)\".*title=\"([^\"]*)\".*/Title: \2\nID: \1/p" # note that we use .* at the beginning and end of the pattern in order to avoid printing everything that preceeds and follows the actual patterns we are matching 274 | ``` 275 | 276 | # Closing words 277 | 278 | Make sure you understand everything here before moving on, this is the absolute fundamentals when it comes to scraping. 279 | Some people come this far, but they do not quite understand how powerful this technique is. 280 | Let's say you have a website which when you click a button opens another website. You want to do what the button press does, but when you look at the html the url to the other site is nowhere to be found. How would you solve it? 281 | 282 | If you know a bit of how websites work you might figure out that the is because the button link gets generated by JavaScript. The obvious solution would then be to run the JavaScript and generate the button link. This is both impractical, inefficent and not what was used so far in this guide. 283 | 284 | What you should do instead is inspect the button link url and check for something unique. For example if the button url is "https://example.com/click/487a162?key=748" then I would look through the webpage for any instances of "487a162" and "748" and figure out a way to get those strings automatically, because that's everything required to make the link. 285 | 286 | 287 | The secret to scraping is: You have all information required to make anything your browser does, you just need to figure out how. You almost never need to run some website JavaScript to get what you want. It is like a puzzle on how to get to the next request url, you have all the pieces, you just need to figure out how they fit. 288 | 289 | ### Next up: [Properly scraping JSON apis](https://github.com/Blatzar/scraping-tutorial/blob/master/using_apis.md) 290 | -------------------------------------------------------------------------------- /using_apis.md: -------------------------------------------------------------------------------- 1 | ### About 2 | Whilst scraping a site is always a nice option, using it's API is way better.
3 | And sometimes its the only way `(eg: the site uses its API to load the content, so scraping doesn't work)`. 4 | 5 | Anyways, this guide won't teach the same concepts over and over again,
6 | so if you can't even make requests to an API then this will not tell you how to do that. 7 | 8 | Refer to [starting.md](./starting.md) on how to make http/https requests. 9 | And yes, this guide expects you to have basic knowledge on both Python and Kotlin. 10 | 11 | ### Using an API (and parsing json) 12 | So, the API I will use is the [SWAPI](https://swapi.dev/).
13 | 14 | To parse that json data in python you would do: 15 | ```python 16 | import requests 17 | 18 | url = "https://swapi.dev/api/planets/1/" 19 | json = requests.get(url).json() 20 | 21 | """ What the variable json looks like 22 | { 23 | "name": "Tatooine", 24 | "rotation_period": "23", 25 | "orbital_period": "304", 26 | "diameter": "10465", 27 | "climate": "arid", 28 | "gravity": "1 standard", 29 | "terrain": "desert", 30 | "surface_water": "1", 31 | "population": "200000", 32 | "residents": [ 33 | "https://swapi.dev/api/people/1/" 34 | ], 35 | "films": [ 36 | "https://swapi.dev/api/films/1/" 37 | ], 38 | "created": "2014-12-09T13:50:49.641000Z", 39 | "edited": "2014-12-20T20:58:18.411000Z", 40 | "url": "https://swapi.dev/api/planets/1/" 41 | } 42 | """ 43 | ``` 44 | Now, that is way too simple in python, sadly I am here to get your hopes down, and say that its not as simple in kotlin.
45 | 46 | First of all, we are going to use a library named Jackson by FasterXML.
47 | In build.gradle: 48 | ``` 49 | repositories { 50 | mavenCentral() 51 | jcenter() 52 | maven { url 'https://jitpack.io' } 53 | } 54 | 55 | dependencies { 56 | ... 57 | ... 58 | implementation "com.fasterxml.jackson.module:jackson-module-kotlin:2.11.3" 59 | compile group: 'khttp', name: 'khttp', version: '1.0.0' 60 | } 61 | ``` 62 | After we have installed the dependencies needed, we have to define a schema for the json.
63 | Essentially, we are going to write the structure of the json in order for jackson to parse our json.
64 | This is an advantage for us, since it also means that we get the nice IDE autocomplete/suggestions and typehints!

65 | 66 | Getting the json data: 67 | ```kotlin 68 | val jsonString = khttp.get("https://swapi.dev/api/planets/1/").text 69 | ``` 70 | 71 | First step is to build a mapper that reads the json string, in order to do that we need to import some things first. 72 | 73 | ```kotlin 74 | import com.fasterxml.jackson.databind.DeserializationFeature 75 | import com.fasterxml.jackson.module.kotlin.KotlinModule 76 | import com.fasterxml.jackson.databind.json.JsonMapper 77 | import com.fasterxml.jackson.module.kotlin.readValue 78 | ``` 79 | After that we initialize the mapper: 80 | ```kotlin 81 | val mapper: JsonMapper = JsonMapper.builder().addModule(KotlinModule()) 82 | .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false).build() 83 | ``` 84 | 85 | The next step is to...write down the structure of our json! 86 | This is the boring part for some, but it can be automated by using websites like [json2kt](https://www.json2kt.com/) or [quicktype](https://app.quicktype.io/) to generate the entire code for you. 87 |

88 | 89 | First step to declaring the structure for a json is to import the JsonProperty annotation. 90 | ```kotlin 91 | import com.fasterxml.jackson.annotation.JsonProperty 92 | ``` 93 | Second step is to write down a data class that represents said json. 94 | ```kotlin 95 | // example json = {"cat": "meow", "dog": ["w", "o", "o", "f"]} 96 | 97 | data class Example ( 98 | @JsonProperty("cat") val cat: String, 99 | @JsonProperty("dog") val dog: List 100 | ) 101 | ``` 102 | This is as simple as it gets.

103 | 104 | Enough of the examples, this is the representation of `https://swapi.dev/api/planets/1/` in kotlin: 105 | ```kotlin 106 | data class Planet ( 107 | @JsonProperty("name") val name: String, 108 | @JsonProperty("rotation_period") val rotationPeriod: String, 109 | @JsonProperty("orbital_period") val orbitalPeriod: String, 110 | @JsonProperty("diameter") val diameter: String, 111 | @JsonProperty("climate") val climate: String, 112 | @JsonProperty("gravity") val gravity: String, 113 | @JsonProperty("terrain") val terrain: String, 114 | @JsonProperty("surface_water") val surfaceWater: String, 115 | @JsonProperty("population") val population: String, 116 | @JsonProperty("residents") val residents: List, 117 | @JsonProperty("films") val films: List, 118 | @JsonProperty("created") val created: String, 119 | @JsonProperty("edited") val edited: String, 120 | @JsonProperty("url") val url: String 121 | ) 122 | ``` 123 | **For json that don't necessarily contain a key, or its type can be either the expected type or null, you need to write that type as nullable in the representation of that json.**
124 | Example of the above situation: 125 | ```json 126 | [ 127 | { 128 | "cat":"meow" 129 | }, 130 | { 131 | "dog":"woof", 132 | "cat":"meow" 133 | }, 134 | { 135 | "fish":"meow", 136 | "cat":"f" 137 | } 138 | ] 139 | ``` 140 | It's representation would be: 141 | ```kotlin 142 | data class Example ( 143 | @JsonProperty("cat") val cat: String, 144 | @JsonProperty("dog") val dog: String?, 145 | @JsonProperty("fish") val fish: String? 146 | ) 147 | ``` 148 | As you can see, `dog` and `fish` are nullable because they are properties that are missing in an item.
149 | Whilst `cat` is not nullable because it is available in all of the items.
150 | Basic nullable detection is implemented in [json2kt](https://www.json2kt.com/) so its recommended to use that.
151 | But it is very likely that it might fail to detect some nullable types, so it's up to us to validate the generated code. 152 | 153 | Second step to parsing json is...to just call our `mapper` instance. 154 | ```kotlin 155 | val json = mapper.readValue(jsonString) 156 | ``` 157 | And voila!
158 | We have successfully parsed our json within kotlin.
159 | One thing to note is that you don't need to add all of the json key/value pairs to the structure, you can just have what you need. 160 | 161 | **Shell** 162 | Here is how you could extract the values for the different keys in the json, using sed and tr: 163 | 164 | 1) Extract the `climate` value: 165 | ```sh 166 | curl "https://swapi.dev/api/planets/1/" | sed -nE "s/.*\"climate\":\"([^\"]*)\".*/\1/p" # note that we are using the [^\"]* pattern as a replacement for greedy matching in standard regex, as posix sed does not support greedy matching; we are also escaping the quotation mark 167 | ``` 168 | The regex pattern above can be visualized as such: 169 | ![Regex-visualizer](/images/sed_regex.png) 170 | 171 | 2) More advanced example: 172 | Extract all values for the `films` key: 173 | ```sh 174 | curl "https://swapi.dev/api/planets/1/" | sed -nE "s/.*\"films\":\[([^]]*)\].*/\1/p" | sed "s/,/\n/g;s/\"//g" # the first sed pattern has the same logic as in the previous example. for the second one the semicolon character is used for separating 2 sed commands, meaning that this sample command can be translated to human language as: transform all (/g is the global flag, which means it'll perform for all instances on a single line and not just the first one) the commas into new lines, and also delete all quotation marks from the input 175 | ``` 176 | 177 | Additionally, a pattern I recommend using when parsing json without `jq`, only using posix shell commands, is `tr ',' '\n'`. This makes the json easier to parse using sed. 178 | 179 | ### Note 180 | Even though we set `DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES` as `false` it will still error on missing properties.
181 | If a json may or may not include some info, make those properties as nullable in the structure you build. 182 | 183 | ### Next up: [Evading developer tools detection](https://github.com/Blatzar/scraping-tutorial/blob/master/devtools_detectors.md) 184 | --------------------------------------------------------------------------------