├── README.md ├── devtools_detectors.md ├── disguising_your_scraper.md ├── finding_video_links.md ├── images ├── browser_request.jpg ├── scraper_request.jpg └── sed_regex.png ├── patch_firefox_old.md ├── starting.md └── using_apis.md /README.md: -------------------------------------------------------------------------------- 1 | ## Requests based scraping tutorial 2 | 3 | 4 | You want to start scraping? Well this guide will teach you, and not some baby selenium scraping. This guide only uses raw requests and has examples in both python and kotlin. Only basic programming knowlege in one of those languages is required to follow along in the guide. 5 | 6 | If you find any aspect of this guide confusing please open an issue about it and I will try to improve things. 7 | 8 | If you do not know programming at all then this guide will __not__ help you, learn programming first! Real scraping cannot be done by copy pasting with a vauge understanding. 9 | 10 | 11 | 0. [Starting scraping from zero](https://github.com/Blatzar/scraping-tutorial/blob/master/starting.md) 12 | 1. [Properly scraping JSON apis often found on sites](https://github.com/Blatzar/scraping-tutorial/blob/master/using_apis.md) 13 | 2. [Evading developer tools detection when scraping](https://github.com/Blatzar/scraping-tutorial/blob/master/devtools_detectors.md) 14 | 3. [Why your requests fail and how to fix them](https://github.com/Blatzar/scraping-tutorial/blob/master/disguising_your_scraper.md) 15 | 4. [Finding links and scraping videos](https://github.com/Blatzar/scraping-tutorial/blob/master/finding_video_links.md) 16 | 17 | Once you've read and understood the concepts behind scraping take a look at [a provider in CloudStream](https://github.com/LagradOst/CloudStream-3/blob/3a78f41aad93dc5755ce9e105db9ab19287b912a/app/src/main/java/com/lagradost/cloudstream3/movieproviders/VidEmbedProvider.kt). I added tons of comments to make every aspect of writing CloudStream providers clear. Even if you're not planning on contributing to Cloudstream looking at the code may help. 18 | 19 | Take a look at [Thenos](https://github.com/LagradOst/CloudStream-3/blob/3a78f41aad93dc5755ce9e105db9ab19287b912a/app/src/main/java/com/lagradost/cloudstream3/movieproviders/ThenosProvider.kt) for an example of json based scraping in kotlin. 20 | -------------------------------------------------------------------------------- /devtools_detectors.md: -------------------------------------------------------------------------------- 1 | **TL;DR**: You are going to get fucked by sites detecting your devtools. You need to know what techniques are used to bypass them. 2 | 3 | Many sites use some sort of debugger detection to prevent you from looking at the important requests made by the browser. 4 | 5 | You can test the devtools detector here: https://blog.aepkill.com/demos/devtools-detector/ *(does not feature source mapping detection)* 6 | Code for the detector found here: https://github.com/AEPKILL/devtools-detector 7 | 8 | # How are they detecting the tools? 9 | 10 | One or more of the following methods are used to prevent devtools in the majority of cases (if not all): 11 | 12 | **1.** 13 | Calling `debugger` in an endless loop. 14 | This is very easy to bypass. You can either right click the offending line (in chrome) and disable all debugger calls from that line or you can disable the whole debugger. 15 | 16 | **2.** 17 | Attaching a custom `.toString()` function to an expression and printing it with `console.log()`. 18 | When devtools are open (even while not in console) all `console.log()` calls will be resloved and the custom `.toString()` function will be called. Functions can also be triggered by how dates, regex and functions are formatted in the console. 19 | 20 | This lets the site know the millisecond you bring up devtools. Doing `const console = null` and other js hacks have not worked for me (the console function gets cached by the detector). 21 | 22 | If you can find the offending js responsible for the detection you can bypass it by redifining the function in violentmonkey, but I recommend against it since it's often hidden and obfuscated. The best way to bypass this issue is to re-compile firefox or chrome with a switch to disable the console. 23 | 24 | **3.** 25 | Invoking the debugger as a constructor? Looks something like this in the wild: 26 | ```js 27 | function _0x39426c(e) { 28 | function t(e) { 29 | if ("string" == typeof e) 30 | return function(e) {} 31 | .constructor("while (true) {}").apply("counter"); 32 | 1 !== ("" + e / e).length || e % 20 == 0 ? function() { 33 | return !0; 34 | } 35 | .constructor("debugger").call("action") : function() { 36 | return !1; 37 | } 38 | .constructor("debugger").apply("stateObject"), 39 | t(++e); 40 | } 41 | try { 42 | if (e) 43 | return t; 44 | t(0); 45 | } catch (e) {} 46 | } 47 | setInterval(function() { 48 | _0x39426c(); 49 | }, 4e3); 50 | ``` 51 | This function can be tracked down to this [script](https://github.com/javascript-obfuscator/javascript-obfuscator/blob/6de7c41c3f10f10c618da7cd96596e5c9362a25f/src/custom-code-helpers/debug-protection/templates/debug-protection-function/DebuggerTemplate.ts) 52 | 53 | This instantly freezes the webpage in firefox and makes it very unresponsive in chrome and does not rely on `console.log()`. You could bypass this by doing `const _0x39426c = null` in violentmonkey, but this bypass is not doable with heavily obfuscated js. 54 | 55 | Cutting out all the unnessecary stuff the remaining function is the following: 56 | ```js 57 | setInterval(() => { 58 | for (let i = 0; i < 100_00; i++) { 59 | _ = function() {}.constructor("debugger").call(); // also works with apply 60 | } 61 | }, 1e2); 62 | ``` 63 | Basically running `.constructor("debugger").call();` as much as possible without using while(true) (that locks up everything regardless). 64 | This is very likely a bug in the browser. 65 | 66 | **4.** 67 | Detecting window size. As you open developer tools your window size will change in a way that can be detected. 68 | This is both impossible to truly circumvent and simultaneously easily sidestepped. 69 | To bypass this what you need to do is open the devtools and click settings in top right corner and then select separate window. 70 | If the devtools are in a separate window they cannot be detected by this technique. 71 | 72 | **5.** 73 | Using source maps to detect the devtools making requests when opened. See https://weizmangal.com/page-js-anti-debug-1/ for further details. 74 | 75 | # How to bypass the detection? 76 | 77 | I have contributed patches to Librewolf to bypass some detection techniques. 78 | Use Librewolf or compile Firefox yourself with [my patches](https://github.com/Blatzar/scraping-tutorial/blob/master/patch_firefox_old.md). 79 | 80 | 1. Get librewolf at https://librewolf.net/ 81 | 2. Go to `about:config` 82 | 3. Set `librewolf.console.logging_disabled` to true to disable **method 2** 83 | 4. Set `librewolf.debugger.force_detach` to true to disable **method 1** and **method 3** 84 | 5. Make devtools open in a separate window to disable **method 4** 85 | 6. Disable source maps in [developer tools settings](https://github.com/Blatzar/scraping-tutorial/assets/46196380/f0ff2f24-6b8d-419c-86ac-9f47d98db749) to disable **method 5** 86 | 7. Now you have completely undetectable devtools! 87 | 88 | --- 89 | 90 | ### Next up: [Why your requests fail](https://github.com/Blatzar/scraping-tutorial/blob/master/disguising_your_scraper.md) 91 | -------------------------------------------------------------------------------- /disguising_your_scraper.md: -------------------------------------------------------------------------------- 1 |
4 | If you're writing a Selenium scraper, be aware that your skill level doesn't match the minimum requirements for this page. 5 |
111 | Work in progress tutorial for scraping streaming sites 112 |
`, which can be found using the CSS selector: "p". 120 | 121 | classes helps to narrow down the CSS selector search, in this case: `class="f4 mt-3"` 122 | 123 | This can be represented with 124 | ```css 125 | p.f4.mt-3 126 | ``` 127 | a dot for every class ([full list of CSS selectors found here](https://www.w3schools.com/cssref/css_selectors.asp)) 128 | 129 | You can test if this CSS selector works by opening the console tab and typing: 130 | 131 | ```js 132 | document.querySelectorAll("p.f4.mt-3"); 133 | ``` 134 | 135 | This prints: 136 | ```java 137 | NodeList [p.f4.mt-3] 138 | ``` 139 | 140 | ### **NOTE**: You may not get the same results when scraping from command line, classes and elements are sometimes created by javascript on the site. 141 | 142 | 143 | **Python** 144 | 145 | ```python 146 | import requests 147 | from bs4 import BeautifulSoup # Full documentation at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ 148 | 149 | url = "https://github.com/Blatzar/scraping-tutorial" 150 | response = requests.get(url) 151 | soup = BeautifulSoup(response.text, 'lxml') 152 | element = soup.select("p.f4.mt-3") # Using the CSS selector 153 | print(element[0].text.strip()) # Selects the first element, gets the text and strips it (removes starting and ending spaces) 154 | ``` 155 | 156 | **Kotlin** 157 | 158 | In build.gradle: 159 | ``` 160 | repositories { 161 | mavenCentral() 162 | jcenter() 163 | maven { url 'https://jitpack.io' } 164 | } 165 | 166 | dependencies { 167 | // Other dependencies above 168 | implementation "org.jsoup:jsoup:1.11.3" 169 | compile group: 'khttp', name: 'khttp', version: '1.0.0' 170 | } 171 | ``` 172 | In main.kt 173 | ```java 174 | fun main() { 175 | val url = "https://github.com/Blatzar/scraping-tutorial" 176 | val response = khttp.get(url) 177 | val soup = Jsoup.parse(response.text) 178 | val element = soup.select("p.f4.mt-3") // Using the CSS selector 179 | println(element.text().trim()) // Gets the text and strips it (removes starting and ending spaces) 180 | } 181 | ``` 182 | 183 | **Shell** 184 | In order to avoid premature heart attacks, the shell scraping example which relies on regex can be found in the regex section. 185 | *Note*: 186 | Although there are external libraries which can be used to parse html in shellscripts, such as htmlq and pup, these are often slower at parsing than sed (a built-in stream editor command on unix systems). 187 | This is why using sed with the extended regex flag `-E` is a preferrable way of parsing scraped data when writing shellscripts. 188 | 189 | 190 | ## **Regex:** 191 | 192 | When working with Regex I highly recommend using https://regex101.com/ (using the python flavor) 193 | 194 | Press Ctrl + U 195 | 196 | to get the whole site document as text and copy everything 197 | 198 | Paste it in the test string in regex101 and try to write an expression to only capture the text you want. 199 | 200 | In this case the elements is 201 | 202 | ```cshtml 203 |
204 | Work in progress tutorial for scraping streaming sites 205 |
` (backslashes for ") 209 | 210 | ```regex 211 |
212 | ``` 213 | 214 | Gives a match, so lets expand the match to all characters between the two brackets ( p>.... ) 215 | 216 | Some important tokens for that would be: 217 | 218 | `.*?` to indicate everything except a newline any number of times, but take as little as possible 219 | 220 | `\s*` to indicate whitespaces except a newline any number of times 221 | 222 | `(*expression inside*)` to indicate groups 223 | 224 | Which gives: 225 | 226 | ```regex 227 |
\s*(.*)?\s*< 228 | ``` 229 | **Explained**: 230 | 231 | Any text exactly matching `
` 232 | 233 | then any number of whitespaces 234 | 235 | then any number of any characters (which will be stored in group 1) 236 | 237 | then any number of whitespaces 238 | 239 | then the text `<` 240 | 241 | 242 | In code: 243 | 244 | **Python** 245 | 246 | ```python 247 | import requests 248 | import re # regex 249 | 250 | url = "https://github.com/Blatzar/scraping-tutorial" 251 | response = requests.get(url) 252 | description_regex = r"
\s*(.*)?\s*<" # r"" stands for raw, which makes blackslashes work better, used for regexes 253 | description = re.search(description_regex, response.text).groups()[0] 254 | print(description) 255 | ``` 256 | 257 | **Kotlin** 258 | In main.kt 259 | ```java 260 | fun main() { 261 | val url = "https://github.com/Blatzar/scraping-tutorial" 262 | val response = khttp.get(url) 263 | val descriptionRegex = Regex("""
\s*(.*)?\s*<""") 264 | val description = descriptionRegex.find(response.text)?.groups?.get(1)?.value 265 | println(description) 266 | } 267 | ``` 268 | 269 | **Shell** 270 | Here is an example of how html data can be parsed using sed with the extended regex flag: 271 | ```sh 272 | printf 'some html data then data-id="123" other data and title here: title="Foo Bar" and more html\n' | 273 | sed -nE "s/.*data-id=\"([0-9]*)\".*title=\"([^\"]*)\".*/Title: \2\nID: \1/p" # note that we use .* at the beginning and end of the pattern in order to avoid printing everything that preceeds and follows the actual patterns we are matching 274 | ``` 275 | 276 | # Closing words 277 | 278 | Make sure you understand everything here before moving on, this is the absolute fundamentals when it comes to scraping. 279 | Some people come this far, but they do not quite understand how powerful this technique is. 280 | Let's say you have a website which when you click a button opens another website. You want to do what the button press does, but when you look at the html the url to the other site is nowhere to be found. How would you solve it? 281 | 282 | If you know a bit of how websites work you might figure out that the is because the button link gets generated by JavaScript. The obvious solution would then be to run the JavaScript and generate the button link. This is both impractical, inefficent and not what was used so far in this guide. 283 | 284 | What you should do instead is inspect the button link url and check for something unique. For example if the button url is "https://example.com/click/487a162?key=748" then I would look through the webpage for any instances of "487a162" and "748" and figure out a way to get those strings automatically, because that's everything required to make the link. 285 | 286 | 287 | The secret to scraping is: You have all information required to make anything your browser does, you just need to figure out how. You almost never need to run some website JavaScript to get what you want. It is like a puzzle on how to get to the next request url, you have all the pieces, you just need to figure out how they fit. 288 | 289 | ### Next up: [Properly scraping JSON apis](https://github.com/Blatzar/scraping-tutorial/blob/master/using_apis.md) 290 | -------------------------------------------------------------------------------- /using_apis.md: -------------------------------------------------------------------------------- 1 | ### About 2 | Whilst scraping a site is always a nice option, using it's API is way better. 3 | And sometimes its the only way `(eg: the site uses its API to load the content, so scraping doesn't work)`. 4 | 5 | Anyways, this guide won't teach the same concepts over and over again, 6 | so if you can't even make requests to an API then this will not tell you how to do that. 7 | 8 | Refer to [starting.md](./starting.md) on how to make http/https requests. 9 | And yes, this guide expects you to have basic knowledge on both Python and Kotlin. 10 | 11 | ### Using an API (and parsing json) 12 | So, the API I will use is the [SWAPI](https://swapi.dev/). 13 | 14 | To parse that json data in python you would do: 15 | ```python 16 | import requests 17 | 18 | url = "https://swapi.dev/api/planets/1/" 19 | json = requests.get(url).json() 20 | 21 | """ What the variable json looks like 22 | { 23 | "name": "Tatooine", 24 | "rotation_period": "23", 25 | "orbital_period": "304", 26 | "diameter": "10465", 27 | "climate": "arid", 28 | "gravity": "1 standard", 29 | "terrain": "desert", 30 | "surface_water": "1", 31 | "population": "200000", 32 | "residents": [ 33 | "https://swapi.dev/api/people/1/" 34 | ], 35 | "films": [ 36 | "https://swapi.dev/api/films/1/" 37 | ], 38 | "created": "2014-12-09T13:50:49.641000Z", 39 | "edited": "2014-12-20T20:58:18.411000Z", 40 | "url": "https://swapi.dev/api/planets/1/" 41 | } 42 | """ 43 | ``` 44 | Now, that is way too simple in python, sadly I am here to get your hopes down, and say that its not as simple in kotlin. 45 | 46 | First of all, we are going to use a library named Jackson by FasterXML. 47 | In build.gradle: 48 | ``` 49 | repositories { 50 | mavenCentral() 51 | jcenter() 52 | maven { url 'https://jitpack.io' } 53 | } 54 | 55 | dependencies { 56 | ... 57 | ... 58 | implementation "com.fasterxml.jackson.module:jackson-module-kotlin:2.11.3" 59 | compile group: 'khttp', name: 'khttp', version: '1.0.0' 60 | } 61 | ``` 62 | After we have installed the dependencies needed, we have to define a schema for the json. 63 | Essentially, we are going to write the structure of the json in order for jackson to parse our json. 64 | This is an advantage for us, since it also means that we get the nice IDE autocomplete/suggestions and typehints! 65 | 66 | Getting the json data: 67 | ```kotlin 68 | val jsonString = khttp.get("https://swapi.dev/api/planets/1/").text 69 | ``` 70 | 71 | First step is to build a mapper that reads the json string, in order to do that we need to import some things first. 72 | 73 | ```kotlin 74 | import com.fasterxml.jackson.databind.DeserializationFeature 75 | import com.fasterxml.jackson.module.kotlin.KotlinModule 76 | import com.fasterxml.jackson.databind.json.JsonMapper 77 | import com.fasterxml.jackson.module.kotlin.readValue 78 | ``` 79 | After that we initialize the mapper: 80 | ```kotlin 81 | val mapper: JsonMapper = JsonMapper.builder().addModule(KotlinModule()) 82 | .configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false).build() 83 | ``` 84 | 85 | The next step is to...write down the structure of our json! 86 | This is the boring part for some, but it can be automated by using websites like [json2kt](https://www.json2kt.com/) or [quicktype](https://app.quicktype.io/) to generate the entire code for you. 87 | 88 | 89 | First step to declaring the structure for a json is to import the JsonProperty annotation. 90 | ```kotlin 91 | import com.fasterxml.jackson.annotation.JsonProperty 92 | ``` 93 | Second step is to write down a data class that represents said json. 94 | ```kotlin 95 | // example json = {"cat": "meow", "dog": ["w", "o", "o", "f"]} 96 | 97 | data class Example ( 98 | @JsonProperty("cat") val cat: String, 99 | @JsonProperty("dog") val dog: List 100 | ) 101 | ``` 102 | This is as simple as it gets. 103 | 104 | Enough of the examples, this is the representation of `https://swapi.dev/api/planets/1/` in kotlin: 105 | ```kotlin 106 | data class Planet ( 107 | @JsonProperty("name") val name: String, 108 | @JsonProperty("rotation_period") val rotationPeriod: String, 109 | @JsonProperty("orbital_period") val orbitalPeriod: String, 110 | @JsonProperty("diameter") val diameter: String, 111 | @JsonProperty("climate") val climate: String, 112 | @JsonProperty("gravity") val gravity: String, 113 | @JsonProperty("terrain") val terrain: String, 114 | @JsonProperty("surface_water") val surfaceWater: String, 115 | @JsonProperty("population") val population: String, 116 | @JsonProperty("residents") val residents: List, 117 | @JsonProperty("films") val films: List, 118 | @JsonProperty("created") val created: String, 119 | @JsonProperty("edited") val edited: String, 120 | @JsonProperty("url") val url: String 121 | ) 122 | ``` 123 | **For json that don't necessarily contain a key, or its type can be either the expected type or null, you need to write that type as nullable in the representation of that json.** 124 | Example of the above situation: 125 | ```json 126 | [ 127 | { 128 | "cat":"meow" 129 | }, 130 | { 131 | "dog":"woof", 132 | "cat":"meow" 133 | }, 134 | { 135 | "fish":"meow", 136 | "cat":"f" 137 | } 138 | ] 139 | ``` 140 | It's representation would be: 141 | ```kotlin 142 | data class Example ( 143 | @JsonProperty("cat") val cat: String, 144 | @JsonProperty("dog") val dog: String?, 145 | @JsonProperty("fish") val fish: String? 146 | ) 147 | ``` 148 | As you can see, `dog` and `fish` are nullable because they are properties that are missing in an item. 149 | Whilst `cat` is not nullable because it is available in all of the items. 150 | Basic nullable detection is implemented in [json2kt](https://www.json2kt.com/) so its recommended to use that. 151 | But it is very likely that it might fail to detect some nullable types, so it's up to us to validate the generated code. 152 | 153 | Second step to parsing json is...to just call our `mapper` instance. 154 | ```kotlin 155 | val json = mapper.readValue(jsonString) 156 | ``` 157 | And voila! 158 | We have successfully parsed our json within kotlin. 159 | One thing to note is that you don't need to add all of the json key/value pairs to the structure, you can just have what you need. 160 | 161 | **Shell** 162 | Here is how you could extract the values for the different keys in the json, using sed and tr: 163 | 164 | 1) Extract the `climate` value: 165 | ```sh 166 | curl "https://swapi.dev/api/planets/1/" | sed -nE "s/.*\"climate\":\"([^\"]*)\".*/\1/p" # note that we are using the [^\"]* pattern as a replacement for greedy matching in standard regex, as posix sed does not support greedy matching; we are also escaping the quotation mark 167 | ``` 168 | The regex pattern above can be visualized as such: 169 |  170 | 171 | 2) More advanced example: 172 | Extract all values for the `films` key: 173 | ```sh 174 | curl "https://swapi.dev/api/planets/1/" | sed -nE "s/.*\"films\":\[([^]]*)\].*/\1/p" | sed "s/,/\n/g;s/\"//g" # the first sed pattern has the same logic as in the previous example. for the second one the semicolon character is used for separating 2 sed commands, meaning that this sample command can be translated to human language as: transform all (/g is the global flag, which means it'll perform for all instances on a single line and not just the first one) the commas into new lines, and also delete all quotation marks from the input 175 | ``` 176 | 177 | Additionally, a pattern I recommend using when parsing json without `jq`, only using posix shell commands, is `tr ',' '\n'`. This makes the json easier to parse using sed. 178 | 179 | ### Note 180 | Even though we set `DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES` as `false` it will still error on missing properties. 181 | If a json may or may not include some info, make those properties as nullable in the structure you build. 182 | 183 | ### Next up: [Evading developer tools detection](https://github.com/Blatzar/scraping-tutorial/blob/master/devtools_detectors.md) 184 | --------------------------------------------------------------------------------