├── .gitignore ├── 01-02.Course Notes └── Course Notes - Web Scraping and API Fundamentals in Python.pdf ├── 03.Working with APIs ├── Currency Exchange API │ ├── Section 3 - Additional API functionalities.ipynb │ ├── Section 3 - Creating a simple currency converter.ipynb │ ├── Section 3 - Exchange rates API GETting a JSON reply.ipynb │ ├── Section 3 - Incorporating parameters in a GET request.ipynb │ ├── additional_API_functionalities.py │ ├── currency_converter.py │ ├── exchange_rate_API.py │ └── exchange_rate_API_with_paremeters.py ├── EDAMAM API │ ├── EDAMAM_API.py │ ├── RoastedChicken_nutrients.csv │ ├── Section 3 - Downloading files with requests.ipynb │ ├── Section 3 - EDAMAM API - Initial setup and registration.ipynb │ └── Section 3 - EDAMAM API - Sending a POST request.ipynb ├── GitHub API │ ├── Section 3 - GitHub API - Pagination.ipynb │ └── github_API.py └── iTune API │ ├── Section 3 - iTunes API - Exercise Solution.ipynb │ ├── Section 3 - iTunes API - Exrecise Setup.ipynb │ ├── Section 3 - iTunes API - Structuring and exporting the data.ipynb │ ├── Section 3 - iTunes API.ipynb │ ├── iTunes_API.py │ ├── iTunes_API_structuring_exporting.py │ ├── songs_info.csv │ └── songs_info.xlsx ├── 04.HTML Overview ├── Section 4 - CSS and JavaScript.html ├── Section 4 - CSS style tag.html ├── Section 4 - Character encoding - Euro sign.html └── Section 4 - My First Webpage.html ├── 05.Web Scraping with Beautiful Soup ├── Section 5 - Extracting data from nested HTML tags.ipynb ├── Section 5 - Extracting data from the HTML tree.ipynb ├── Section 5 - Extracting text from an HTML tag.ipynb ├── Section 5 - Practical example - Exercise Setup-MyWork.ipynb ├── Section 5 - Practical example - Exercise Setup.ipynb ├── Section 5 - Practical example - Exercise Solution.ipynb ├── Section 5 - Practical example - dealing with links.ipynb ├── Section 5 - Scraping multiple pages automatically.ipynb ├── Section 5 - Searching and navigating the HTML tree.ipynb ├── Section 5 - Searching the HTML tree by attributes.ipynb ├── Section 5 - Setting up your first scraper.ipynb ├── scraper.py ├── scraper2_extracting_data.py ├── scraper3_extracting_text.py ├── scraper4_dealing_links.py ├── scraper5_extracting_nestedHTML.py ├── scraper6_scraping_multiple_pages.py └── wiki_music.html ├── 06.Project Scraping - Rotten Tomatoes ├── Rotten_tomatoes_page_2_HTML_Parser.html ├── Rotten_tomatoes_page_2_LXML_Parser.html ├── Scraper_RottenTomatoes.ipynb ├── Section 6 - Dealing with the cast.ipynb ├── Section 6 - Extracting the rest of the information - Exercise - Setup.ipynb ├── Section 6 - Extracting the rest of the information.ipynb ├── Section 6 - Extracting the score - Setup.ipynb ├── Section 6 - Extracting the score - Solution.ipynb ├── Section 6 - Extracting the title and year of each movie.ipynb ├── Section 6 - Setting up your scraper.ipynb ├── Section 6 - Storing the data in a structured form.ipynb ├── Section 6 -Extracting the rest of the information - Exercise - Solution.ipynb ├── movies_info.csv └── movies_info.xlsx ├── 07.Scraping HTML Tables with Pandas ├── Scraper_HTMLtables.ipynb └── Section 7 - Scraping HTML Tables with the help of Pandas.ipynb ├── 08.Scraping Steam Project ├── New_Trending_Games_Info.csv ├── Scraper Steam - My Work.ipynb ├── Section 8 - Scraping Steam - Setup.ipynb ├── Top_Rated_Games.info.csv ├── Top_Sellers_Games_info.csv ├── Trending_Games_info.csv └── steam.html ├── 08.Scraping Youtube Project ├── Scraper YouTube - MyWork.ipynb ├── Section 8 - Scraping YouTube - Setup.ipynb ├── searched_video.html ├── stairway_to_heaven.html └── youtube.html ├── 09.Common roadblocks when Web Scraping ├── RequestHeaders.ipynb ├── Section 9 - Sample HTML login Form.html ├── Section 9 - Sample login code.ipynb ├── Section 9 - Scraping multiple pages automatically - rate limitting.ipynb └── Sessions.ipynb ├── 10.The Requests-HTML Package ├── Scraper_CSS_Selectors.ipynb ├── Scraper_JavaScript.ipynb ├── Scraper_withRequestsHTML.ipynb ├── Section 10 - CSS Selectors.ipynb ├── Section 10 - Exploring the package capabilities.ipynb ├── Section 10 - Scraping JavaScript.ipynb └── Section 10 - Searching for text.ipynb ├── 11.Scraping JavaScript - SoundCloud Project ├── Scraper SoundCloud - My Work.ipynb └── Section 10 - Scraping SoundCloud - Setup.ipynb ├── LICENSE └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | # Created by https://www.gitignore.io/api/python,vagrant,virtualenv,jupyternotebooks 3 | # Edit at https://www.gitignore.io/?templates=python,vagrant,virtualenv,jupyternotebooks 4 | 5 | ### JupyterNotebooks ### 6 | # gitignore template for Jupyter Notebooks 7 | # website: http://jupyter.org/ 8 | 9 | .ipynb_checkpoints 10 | */.ipynb_checkpoints/* 11 | 12 | # IPython 13 | profile_default/ 14 | ipython_config.py 15 | 16 | # Remove previous ipynb_checkpoints 17 | # git rm -r .ipynb_checkpoints/ 18 | 19 | ### Python ### 20 | # Byte-compiled / optimized / DLL files 21 | __pycache__/ 22 | *.py[cod] 23 | *$py.class 24 | 25 | # C extensions 26 | *.so 27 | 28 | # Distribution / packaging 29 | .Python 30 | build/ 31 | develop-eggs/ 32 | dist/ 33 | downloads/ 34 | eggs/ 35 | .eggs/ 36 | lib/ 37 | lib64/ 38 | parts/ 39 | sdist/ 40 | var/ 41 | wheels/ 42 | pip-wheel-metadata/ 43 | share/python-wheels/ 44 | *.egg-info/ 45 | .installed.cfg 46 | *.egg 47 | MANIFEST 48 | 49 | # PyInstaller 50 | # Usually these files are written by a python script from a template 51 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 52 | *.manifest 53 | *.spec 54 | 55 | # Installer logs 56 | pip-log.txt 57 | pip-delete-this-directory.txt 58 | 59 | # Unit test / coverage reports 60 | htmlcov/ 61 | .tox/ 62 | .nox/ 63 | .coverage 64 | .coverage.* 65 | .cache 66 | nosetests.xml 67 | coverage.xml 68 | *.cover 69 | .hypothesis/ 70 | .pytest_cache/ 71 | 72 | # Translations 73 | *.mo 74 | *.pot 75 | 76 | # Scrapy stuff: 77 | .scrapy 78 | 79 | # Sphinx documentation 80 | docs/_build/ 81 | 82 | # PyBuilder 83 | target/ 84 | 85 | # pyenv 86 | .python-version 87 | 88 | # pipenv 89 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 90 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 91 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 92 | # install all needed dependencies. 93 | #Pipfile.lock 94 | 95 | # celery beat schedule file 96 | celerybeat-schedule 97 | 98 | # SageMath parsed files 99 | *.sage.py 100 | 101 | # Spyder project settings 102 | .spyderproject 103 | .spyproject 104 | 105 | # Rope project settings 106 | .ropeproject 107 | 108 | # Mr Developer 109 | .mr.developer.cfg 110 | .project 111 | .pydevproject 112 | 113 | # mkdocs documentation 114 | /site 115 | 116 | # mypy 117 | .mypy_cache/ 118 | .dmypy.json 119 | dmypy.json 120 | 121 | # Pyre type checker 122 | .pyre/ 123 | 124 | ### Vagrant ### 125 | # General 126 | .vagrant/* 127 | 128 | # Log files (if you are creating logs in debug mode, uncomment this) 129 | # *.log 130 | 131 | ### Vagrant Patch ### 132 | *.box 133 | 134 | ### VirtualEnv ### 135 | # Virtualenv 136 | # http://iamzed.com/2009/05/07/a-primer-on-virtualenv/ 137 | pyvenv.cfg 138 | .env 139 | .venv 140 | env/ 141 | venv/ 142 | ENV/ 143 | env.bak/ 144 | venv.bak/ 145 | pip-selfcheck.json 146 | 147 | # End of https://www.gitignore.io/api/python,vagrant,virtualenv,jupyternotebooks 148 | 149 | __pycache__ 150 | *.pyc 151 | .vagrant 152 | -------------------------------------------------------------------------------- /01-02.Course Notes/Course Notes - Web Scraping and API Fundamentals in Python.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/01-02.Course Notes/Course Notes - Web Scraping and API Fundamentals in Python.pdf -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/Section 3 - Exchange rates API GETting a JSON reply.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pulling data from public APIs (without registration) - GET request" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# loading the packages\n", 17 | "# requests provides us with the capabilities of sending an HTTP request to a server\n", 18 | "import requests" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## Extracting data on currency exchange rates" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# We will use an API containing currency exchange rates as published by the European Central Bank\n", 35 | "# Documentation at https://exchangeratesapi.io" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Sending a GET request" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "# Define the base URL\n", 52 | "# Base URL: the part of the URL common to all requests, not containing the parameters\n", 53 | "base_url = \"https://api.exchangeratesapi.io/latest\"" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# We can make a GET request to this API endpoint with requests.get\n", 63 | "response = requests.get(base_url)\n", 64 | "\n", 65 | "# This method returns the response from the server\n", 66 | "# We store this response in a variable for future processing" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Investigating the response" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 5, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/plain": [ 84 | "True" 85 | ] 86 | }, 87 | "execution_count": 5, 88 | "metadata": {}, 89 | "output_type": "execute_result" 90 | } 91 | ], 92 | "source": [ 93 | "# Checking if the request went through ok\n", 94 | "response.ok" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 6, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "200" 106 | ] 107 | }, 108 | "execution_count": 6, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "# Checking the status code of the response\n", 115 | "response.status_code" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 7, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'" 127 | ] 128 | }, 129 | "execution_count": 7, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "# Inspecting the content body of the response (as a regular 'string')\n", 136 | "response.text" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 8, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "b'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'" 148 | ] 149 | }, 150 | "execution_count": 8, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "# Inspecting the content of the response (in 'bytes' format)\n", 157 | "response.content" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 9, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# The data is presented in JSON format" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "### Handling the JSON" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 10, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "{'rates': {'CAD': 1.5613,\n", 185 | " 'HKD': 8.9041,\n", 186 | " 'ISK': 145.0,\n", 187 | " 'PHP': 58.013,\n", 188 | " 'DKK': 7.4695,\n", 189 | " 'HUF': 336.25,\n", 190 | " 'CZK': 25.504,\n", 191 | " 'AUD': 1.733,\n", 192 | " 'RON': 4.8175,\n", 193 | " 'SEK': 10.7203,\n", 194 | " 'IDR': 16488.05,\n", 195 | " 'INR': 84.96,\n", 196 | " 'BRL': 5.4418,\n", 197 | " 'RUB': 85.1553,\n", 198 | " 'HRK': 7.55,\n", 199 | " 'JPY': 117.12,\n", 200 | " 'THB': 36.081,\n", 201 | " 'CHF': 1.0594,\n", 202 | " 'SGD': 1.5841,\n", 203 | " 'PLN': 4.3132,\n", 204 | " 'BGN': 1.9558,\n", 205 | " 'TRY': 7.0002,\n", 206 | " 'CNY': 7.96,\n", 207 | " 'NOK': 10.89,\n", 208 | " 'NZD': 1.8021,\n", 209 | " 'ZAR': 18.2898,\n", 210 | " 'USD': 1.1456,\n", 211 | " 'MXN': 24.3268,\n", 212 | " 'ILS': 4.0275,\n", 213 | " 'GBP': 0.87383,\n", 214 | " 'KRW': 1374.71,\n", 215 | " 'MYR': 4.8304},\n", 216 | " 'base': 'EUR',\n", 217 | " 'date': '2020-03-09'}" 218 | ] 219 | }, 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "# Requests has in-build method to directly convert the response to JSON format\n", 227 | "response.json()" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 11, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "data": { 237 | "text/plain": [ 238 | "dict" 239 | ] 240 | }, 241 | "execution_count": 11, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "# In Python, this JSON is stored as a dictionary\n", 248 | "type(response.json())" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 12, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# A useful library for JSON manipulation and pretty print\n", 258 | "import json\n", 259 | "\n", 260 | "# It has two main methods:\n", 261 | "# .loads(), which creates a Python dictionary from a JSON format string (just as response.json() does)\n", 262 | "# .dumps(), which creates a JSON format string out of a Python dictionary " 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 13, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/plain": [ 273 | "'{\\n \"rates\": {\\n \"CAD\": 1.5613,\\n \"HKD\": 8.9041,\\n \"ISK\": 145.0,\\n \"PHP\": 58.013,\\n \"DKK\": 7.4695,\\n \"HUF\": 336.25,\\n \"CZK\": 25.504,\\n \"AUD\": 1.733,\\n \"RON\": 4.8175,\\n \"SEK\": 10.7203,\\n \"IDR\": 16488.05,\\n \"INR\": 84.96,\\n \"BRL\": 5.4418,\\n \"RUB\": 85.1553,\\n \"HRK\": 7.55,\\n \"JPY\": 117.12,\\n \"THB\": 36.081,\\n \"CHF\": 1.0594,\\n \"SGD\": 1.5841,\\n \"PLN\": 4.3132,\\n \"BGN\": 1.9558,\\n \"TRY\": 7.0002,\\n \"CNY\": 7.96,\\n \"NOK\": 10.89,\\n \"NZD\": 1.8021,\\n \"ZAR\": 18.2898,\\n \"USD\": 1.1456,\\n \"MXN\": 24.3268,\\n \"ILS\": 4.0275,\\n \"GBP\": 0.87383,\\n \"KRW\": 1374.71,\\n \"MYR\": 4.8304\\n },\\n \"base\": \"EUR\",\\n \"date\": \"2020-03-09\"\\n}'" 274 | ] 275 | }, 276 | "execution_count": 13, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "# .dumps() has options to make the string 'prettier', more readable\n", 283 | "# We can choose the number of spaces to be used as indentation\n", 284 | "json.dumps(response.json(), indent=4)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 14, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "name": "stdout", 294 | "output_type": "stream", 295 | "text": [ 296 | "{\n", 297 | " \"rates\": {\n", 298 | " \"CAD\": 1.5613,\n", 299 | " \"HKD\": 8.9041,\n", 300 | " \"ISK\": 145.0,\n", 301 | " \"PHP\": 58.013,\n", 302 | " \"DKK\": 7.4695,\n", 303 | " \"HUF\": 336.25,\n", 304 | " \"CZK\": 25.504,\n", 305 | " \"AUD\": 1.733,\n", 306 | " \"RON\": 4.8175,\n", 307 | " \"SEK\": 10.7203,\n", 308 | " \"IDR\": 16488.05,\n", 309 | " \"INR\": 84.96,\n", 310 | " \"BRL\": 5.4418,\n", 311 | " \"RUB\": 85.1553,\n", 312 | " \"HRK\": 7.55,\n", 313 | " \"JPY\": 117.12,\n", 314 | " \"THB\": 36.081,\n", 315 | " \"CHF\": 1.0594,\n", 316 | " \"SGD\": 1.5841,\n", 317 | " \"PLN\": 4.3132,\n", 318 | " \"BGN\": 1.9558,\n", 319 | " \"TRY\": 7.0002,\n", 320 | " \"CNY\": 7.96,\n", 321 | " \"NOK\": 10.89,\n", 322 | " \"NZD\": 1.8021,\n", 323 | " \"ZAR\": 18.2898,\n", 324 | " \"USD\": 1.1456,\n", 325 | " \"MXN\": 24.3268,\n", 326 | " \"ILS\": 4.0275,\n", 327 | " \"GBP\": 0.87383,\n", 328 | " \"KRW\": 1374.71,\n", 329 | " \"MYR\": 4.8304\n", 330 | " },\n", 331 | " \"base\": \"EUR\",\n", 332 | " \"date\": \"2020-03-09\"\n", 333 | "}\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "# In order to visualize these changes, we need to print the string\n", 339 | "print(json.dumps(response.json(), indent=4))" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 15, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "dict_keys(['rates', 'base', 'date'])" 351 | ] 352 | }, 353 | "execution_count": 15, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "# It contains 3 keys; the value for the 'rates' key is another dictionary\n", 360 | "response.json().keys()" 361 | ] 362 | } 363 | ], 364 | "metadata": { 365 | "kernelspec": { 366 | "display_name": "Python 3", 367 | "language": "python", 368 | "name": "python3" 369 | }, 370 | "language_info": { 371 | "codemirror_mode": { 372 | "name": "ipython", 373 | "version": 3 374 | }, 375 | "file_extension": ".py", 376 | "mimetype": "text/x-python", 377 | "name": "python", 378 | "nbconvert_exporter": "python", 379 | "pygments_lexer": "ipython3", 380 | "version": "3.7.3" 381 | } 382 | }, 383 | "nbformat": 4, 384 | "nbformat_minor": 2 385 | } 386 | -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/Section 3 - Incorporating parameters in a GET request.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pulling data from public APIs (without registration) - GET request" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# loading the packages\n", 17 | "# requests provides us with the capabilities of sending an HTTP request to a server\n", 18 | "import requests" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "## Extracting data on currency exchange rates" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# We will use an API containing currency exchange rates as published by the European Central Bank\n", 35 | "# Documentation at https://exchangeratesapi.io" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Sending a GET request" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "# Define the base URL\n", 52 | "# Base URL: the part of the URL common to all requests, not containing the parameters\n", 53 | "base_url = \"https://api.exchangeratesapi.io/latest\"" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "# We can make a GET request to this API endpoint with requests.get\n", 63 | "response = requests.get(base_url)\n", 64 | "\n", 65 | "# This method returns the response from the server\n", 66 | "# We store this response in a variable for future processing" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "### Investigating the response" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 5, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/plain": [ 84 | "True" 85 | ] 86 | }, 87 | "execution_count": 5, 88 | "metadata": {}, 89 | "output_type": "execute_result" 90 | } 91 | ], 92 | "source": [ 93 | "# Checking if the request went through ok\n", 94 | "response.ok" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 6, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "200" 106 | ] 107 | }, 108 | "execution_count": 6, 109 | "metadata": {}, 110 | "output_type": "execute_result" 111 | } 112 | ], 113 | "source": [ 114 | "# Checking the status code of the response\n", 115 | "response.status_code" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 7, 121 | "metadata": {}, 122 | "outputs": [ 123 | { 124 | "data": { 125 | "text/plain": [ 126 | "'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'" 127 | ] 128 | }, 129 | "execution_count": 7, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "# Inspecting the content body of the response (as a regular 'string')\n", 136 | "response.text" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 8, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "b'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'" 148 | ] 149 | }, 150 | "execution_count": 8, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "# Inspecting the content of the response (in 'bytes' format)\n", 157 | "response.content" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": 9, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# The data is presented in JSON format" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "### Handling the JSON" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 10, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "data": { 183 | "text/plain": [ 184 | "{'rates': {'CAD': 1.5613,\n", 185 | " 'HKD': 8.9041,\n", 186 | " 'ISK': 145.0,\n", 187 | " 'PHP': 58.013,\n", 188 | " 'DKK': 7.4695,\n", 189 | " 'HUF': 336.25,\n", 190 | " 'CZK': 25.504,\n", 191 | " 'AUD': 1.733,\n", 192 | " 'RON': 4.8175,\n", 193 | " 'SEK': 10.7203,\n", 194 | " 'IDR': 16488.05,\n", 195 | " 'INR': 84.96,\n", 196 | " 'BRL': 5.4418,\n", 197 | " 'RUB': 85.1553,\n", 198 | " 'HRK': 7.55,\n", 199 | " 'JPY': 117.12,\n", 200 | " 'THB': 36.081,\n", 201 | " 'CHF': 1.0594,\n", 202 | " 'SGD': 1.5841,\n", 203 | " 'PLN': 4.3132,\n", 204 | " 'BGN': 1.9558,\n", 205 | " 'TRY': 7.0002,\n", 206 | " 'CNY': 7.96,\n", 207 | " 'NOK': 10.89,\n", 208 | " 'NZD': 1.8021,\n", 209 | " 'ZAR': 18.2898,\n", 210 | " 'USD': 1.1456,\n", 211 | " 'MXN': 24.3268,\n", 212 | " 'ILS': 4.0275,\n", 213 | " 'GBP': 0.87383,\n", 214 | " 'KRW': 1374.71,\n", 215 | " 'MYR': 4.8304},\n", 216 | " 'base': 'EUR',\n", 217 | " 'date': '2020-03-09'}" 218 | ] 219 | }, 220 | "execution_count": 10, 221 | "metadata": {}, 222 | "output_type": "execute_result" 223 | } 224 | ], 225 | "source": [ 226 | "# Requests has in-build method to directly convert the response to JSON format\n", 227 | "response.json()" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 11, 233 | "metadata": {}, 234 | "outputs": [ 235 | { 236 | "data": { 237 | "text/plain": [ 238 | "dict" 239 | ] 240 | }, 241 | "execution_count": 11, 242 | "metadata": {}, 243 | "output_type": "execute_result" 244 | } 245 | ], 246 | "source": [ 247 | "# In Python, this JSON is stored as a dictionary\n", 248 | "type(response.json())" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 12, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# A useful library for JSON manipulation and pretty print\n", 258 | "import json\n", 259 | "\n", 260 | "# It has two main methods:\n", 261 | "# .loads(), which creates a Python dictionary from a JSON format string (just as response.json() does)\n", 262 | "# .dumps(), which creates a JSON format string out of a Python dictionary " 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 13, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/plain": [ 273 | "'{\\n \"rates\": {\\n \"CAD\": 1.5613,\\n \"HKD\": 8.9041,\\n \"ISK\": 145.0,\\n \"PHP\": 58.013,\\n \"DKK\": 7.4695,\\n \"HUF\": 336.25,\\n \"CZK\": 25.504,\\n \"AUD\": 1.733,\\n \"RON\": 4.8175,\\n \"SEK\": 10.7203,\\n \"IDR\": 16488.05,\\n \"INR\": 84.96,\\n \"BRL\": 5.4418,\\n \"RUB\": 85.1553,\\n \"HRK\": 7.55,\\n \"JPY\": 117.12,\\n \"THB\": 36.081,\\n \"CHF\": 1.0594,\\n \"SGD\": 1.5841,\\n \"PLN\": 4.3132,\\n \"BGN\": 1.9558,\\n \"TRY\": 7.0002,\\n \"CNY\": 7.96,\\n \"NOK\": 10.89,\\n \"NZD\": 1.8021,\\n \"ZAR\": 18.2898,\\n \"USD\": 1.1456,\\n \"MXN\": 24.3268,\\n \"ILS\": 4.0275,\\n \"GBP\": 0.87383,\\n \"KRW\": 1374.71,\\n \"MYR\": 4.8304\\n },\\n \"base\": \"EUR\",\\n \"date\": \"2020-03-09\"\\n}'" 274 | ] 275 | }, 276 | "execution_count": 13, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "# .dumps() has options to make the string 'prettier', more readable\n", 283 | "# We can choose the number of spaces to be used as indentation\n", 284 | "json.dumps(response.json(), indent=4)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 14, 290 | "metadata": {}, 291 | "outputs": [ 292 | { 293 | "name": "stdout", 294 | "output_type": "stream", 295 | "text": [ 296 | "{\n", 297 | " \"rates\": {\n", 298 | " \"CAD\": 1.5613,\n", 299 | " \"HKD\": 8.9041,\n", 300 | " \"ISK\": 145.0,\n", 301 | " \"PHP\": 58.013,\n", 302 | " \"DKK\": 7.4695,\n", 303 | " \"HUF\": 336.25,\n", 304 | " \"CZK\": 25.504,\n", 305 | " \"AUD\": 1.733,\n", 306 | " \"RON\": 4.8175,\n", 307 | " \"SEK\": 10.7203,\n", 308 | " \"IDR\": 16488.05,\n", 309 | " \"INR\": 84.96,\n", 310 | " \"BRL\": 5.4418,\n", 311 | " \"RUB\": 85.1553,\n", 312 | " \"HRK\": 7.55,\n", 313 | " \"JPY\": 117.12,\n", 314 | " \"THB\": 36.081,\n", 315 | " \"CHF\": 1.0594,\n", 316 | " \"SGD\": 1.5841,\n", 317 | " \"PLN\": 4.3132,\n", 318 | " \"BGN\": 1.9558,\n", 319 | " \"TRY\": 7.0002,\n", 320 | " \"CNY\": 7.96,\n", 321 | " \"NOK\": 10.89,\n", 322 | " \"NZD\": 1.8021,\n", 323 | " \"ZAR\": 18.2898,\n", 324 | " \"USD\": 1.1456,\n", 325 | " \"MXN\": 24.3268,\n", 326 | " \"ILS\": 4.0275,\n", 327 | " \"GBP\": 0.87383,\n", 328 | " \"KRW\": 1374.71,\n", 329 | " \"MYR\": 4.8304\n", 330 | " },\n", 331 | " \"base\": \"EUR\",\n", 332 | " \"date\": \"2020-03-09\"\n", 333 | "}\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "# In order to visualize these changes, we need to print the string\n", 339 | "print(json.dumps(response.json(), indent=4))" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 15, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "dict_keys(['rates', 'base', 'date'])" 351 | ] 352 | }, 353 | "execution_count": 15, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "# It contains 3 keys; the value for the 'rates' key is another dictionary\n", 360 | "response.json().keys()" 361 | ] 362 | }, 363 | { 364 | "cell_type": "markdown", 365 | "metadata": {}, 366 | "source": [ 367 | "### Incorporating parameters in the GET request" 368 | ] 369 | }, 370 | { 371 | "cell_type": "code", 372 | "execution_count": 16, 373 | "metadata": {}, 374 | "outputs": [ 375 | { 376 | "data": { 377 | "text/plain": [ 378 | "'https://api.exchangeratesapi.io/latest?symbols=USD,GBP'" 379 | ] 380 | }, 381 | "execution_count": 16, 382 | "metadata": {}, 383 | "output_type": "execute_result" 384 | } 385 | ], 386 | "source": [ 387 | "# Request parameters are added to the URL after a question mark '?'\n", 388 | "# In this case, we request for the exchange rates of the US Dollar (USD) and Pound Sterling (GBP) only\n", 389 | "param_url = base_url + \"?symbols=USD,GBP\"\n", 390 | "param_url" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 17, 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/plain": [ 401 | "200" 402 | ] 403 | }, 404 | "execution_count": 17, 405 | "metadata": {}, 406 | "output_type": "execute_result" 407 | } 408 | ], 409 | "source": [ 410 | "# Making a request to the server with the new URL, containing the parameters\n", 411 | "response = requests.get(param_url)\n", 412 | "response.status_code" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 18, 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "data": { 422 | "text/plain": [ 423 | "{'rates': {'USD': 1.1456, 'GBP': 0.87383}, 'base': 'EUR', 'date': '2020-03-09'}" 424 | ] 425 | }, 426 | "execution_count": 18, 427 | "metadata": {}, 428 | "output_type": "execute_result" 429 | } 430 | ], 431 | "source": [ 432 | "# Saving the response data\n", 433 | "data = response.json()\n", 434 | "data" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 19, 440 | "metadata": {}, 441 | "outputs": [ 442 | { 443 | "data": { 444 | "text/plain": [ 445 | "'EUR'" 446 | ] 447 | }, 448 | "execution_count": 19, 449 | "metadata": {}, 450 | "output_type": "execute_result" 451 | } 452 | ], 453 | "source": [ 454 | "# 'data' is a dictionary\n", 455 | "data['base']" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": 20, 461 | "metadata": {}, 462 | "outputs": [ 463 | { 464 | "data": { 465 | "text/plain": [ 466 | "'2020-03-09'" 467 | ] 468 | }, 469 | "execution_count": 20, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "data['date']" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 21, 481 | "metadata": {}, 482 | "outputs": [ 483 | { 484 | "data": { 485 | "text/plain": [ 486 | "{'USD': 1.1456, 'GBP': 0.87383}" 487 | ] 488 | }, 489 | "execution_count": 21, 490 | "metadata": {}, 491 | "output_type": "execute_result" 492 | } 493 | ], 494 | "source": [ 495 | "data['rates']" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 22, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "# As per the documentation of this API, we can change the base with the parameter 'base'\n", 505 | "param_url = base_url + \"?symbols=GBP&base=USD\"" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 23, 511 | "metadata": {}, 512 | "outputs": [ 513 | { 514 | "data": { 515 | "text/plain": [ 516 | "{'rates': {'GBP': 0.7627706006}, 'base': 'USD', 'date': '2020-03-09'}" 517 | ] 518 | }, 519 | "execution_count": 23, 520 | "metadata": {}, 521 | "output_type": "execute_result" 522 | } 523 | ], 524 | "source": [ 525 | "# Sending a request and saving the response JSON, all at once\n", 526 | "data = requests.get(param_url).json()\n", 527 | "data" 528 | ] 529 | }, 530 | { 531 | "cell_type": "code", 532 | "execution_count": 24, 533 | "metadata": {}, 534 | "outputs": [ 535 | { 536 | "data": { 537 | "text/plain": [ 538 | "0.7627706006" 539 | ] 540 | }, 541 | "execution_count": 24, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "usd_to_gbp = data['rates']['GBP']\n", 548 | "usd_to_gbp" 549 | ] 550 | } 551 | ], 552 | "metadata": { 553 | "kernelspec": { 554 | "display_name": "Python 3", 555 | "language": "python", 556 | "name": "python3" 557 | }, 558 | "language_info": { 559 | "codemirror_mode": { 560 | "name": "ipython", 561 | "version": 3 562 | }, 563 | "file_extension": ".py", 564 | "mimetype": "text/x-python", 565 | "name": "python", 566 | "nbconvert_exporter": "python", 567 | "pygments_lexer": "ipython3", 568 | "version": "3.7.3" 569 | } 570 | }, 571 | "nbformat": 4, 572 | "nbformat_minor": 2 573 | } 574 | -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/additional_API_functionalities.py: -------------------------------------------------------------------------------- 1 | #Obtaining Historical Exchange Rates 2 | import requests 3 | import json 4 | 5 | base_url = "https://api.exchangeratesapi.io" 6 | 7 | historical_date_url = base_url + "/2020-04-12" 8 | 9 | response = requests.get(historical_date_url) 10 | data = response.json() 11 | 12 | # data = {'rates': {'CAD': 1.5265, 'HKD': 8.4259, 'ISK': 155.9, 'PHP': 54.939, 'DKK': 7.4657, 'HUF': 354.76, 'CZK': 26.909, 'AUD': 1.7444, 'RON': 4.833, 'SEK': 10.9455, 'IDR': 17243.21, 'INR': 82.9275, 'BRL': 5.5956, 'RUB': 80.69, 'HRK': 7.6175, 'JPY': 118.33, 'THB': 35.665, 'CHF': 1.0558, 'SGD': 1.5479, 'PLN': 4.5586, 'BGN': 1.9558, 'TRY': 7.3233, 'CNY': 7.6709, 'NOK': 11.2143, 'NZD': 1.8128, 'ZAR': 19.6383, 'USD': 1.0867, 'MXN': 26.0321, 'ILS': 3.8919, 'GBP': 0.87565, 'KRW': 1322.49, 'MYR': 4.7136}, 'base': 'EUR', 'date': '2020-04-09'} 13 | 14 | print(json.dumps(data, indent=4, sort_keys=True)) 15 | 16 | # Invalid URL 17 | invalid_url = base_url + "/2019-12-01" + "?symbols=USB" 18 | response = requests.get(invalid_url) 19 | 20 | print(response.status_code) 21 | print(response.json()) 22 | # 400 for bad request 23 | #invalid response = {'error': "Symbols 'USB' are invalid for date 2019-12-01."} -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/currency_converter.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | base_url = "https://api.exchangeratesapi.io/" 5 | 6 | print("***** Welcome to Currency Converter ****") 7 | date = input("Please enter the date (in the format 'yyyy-mm-dd') OR type 'latest' : ") 8 | base_currency = input("Currency converted from (example: 'USD') : ") 9 | to_currency = input("Currency converted to (example: 'JPY') : ") 10 | amount = input(f"How much {base_currency} do you want to convert? : ") 11 | 12 | if date and base_currency and to_currency and amount: 13 | 14 | param_url = base_url + date + "?symbols=" + base_currency + "," + to_currency 15 | 16 | if date == 'latest': 17 | param_url = base_url + "latest?symbols=" + base_currency + "," + to_currency 18 | 19 | response = requests.get(param_url) 20 | 21 | if response.ok is False: 22 | print(f"Opps! Seem like there is an Error {response.status_code}. Please try again.") 23 | print(f"{response.json()['error']}") 24 | 25 | else: 26 | data = response.json() 27 | 28 | #testing 29 | # base_currency = 'USD' 30 | # to_currency = 'JPY' 31 | # amount = 100 32 | # data = {'rates': {'JPY': 117.55, 'USD': 1.0936}, 'base': 'EUR', 'date': '2020-04-01'} 33 | 34 | converted_amount = (float(amount) / float(data['rates'][base_currency])) * float(data['rates'][to_currency]) 35 | converted_amount = round(converted_amount,2) 36 | 37 | print(f"The amount equalivant to {base_currency} {amount} is {to_currency} {converted_amount}") 38 | 39 | else: 40 | print("You have provided invalid information. Please try again.") 41 | -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/exchange_rate_API.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | base_url = 'https://api.exchangeratesapi.io/latest' 5 | 6 | #reqeust to API 7 | response = requests.get(base_url) 8 | 9 | #investigating response 10 | print(response.ok) 11 | print(response.status_code) 12 | print(response.text) 13 | 14 | #handling JSON 15 | json_response = response.json() 16 | 17 | #to void calling to API so many time, just for testing purpose 18 | # json_response = {'rates': {'CAD': 1.5265, 'HKD': 8.4259, 'ISK': 155.9, 'PHP': 54.939, 'DKK': 7.4657, 'HUF': 354.76, 'CZK': 26.909, 'AUD': 1.7444, 'RON': 4.833, 'SEK': 10.9455, 'IDR': 17243.21, 'INR': 82.9275, 'BRL': 5.5956, 'RUB': 80.69, 'HRK': 7.6175, 'JPY': 118.33, 'THB': 35.665, 'CHF': 1.0558, 'SGD': 1.5479, 'PLN': 4.5586, 'BGN': 1.9558, 'TRY': 7.3233, 'CNY': 7.6709, 'NOK': 11.2143, 'NZD': 1.8128, 'ZAR': 19.6383, 'USD': 1.0867, 'MXN': 26.0321, 'ILS': 3.8919, 'GBP': 0.87565, 'KRW': 1322.49, 'MYR': 4.7136}, 'base': 'EUR', 'date': '2020-04-09'} 19 | 20 | #Python Built in package json 21 | #loads(string): converts a JSON formatted string to a Python Object 22 | #dumps(obj): converts a Python Object to a regular string, with options to make the string prettier 23 | print(json.dumps(json_response, indent=4)) -------------------------------------------------------------------------------- /03.Working with APIs/Currency Exchange API/exchange_rate_API_with_paremeters.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | base_url = 'https://api.exchangeratesapi.io/latest' 5 | 6 | param_url = base_url + '?symbols=USD,GBP' 7 | 8 | response = requests.get(param_url) 9 | data = response.json() 10 | 11 | # data = {'rates': {'USD': 1.0867, 'GBP': 0.87565}, 'base': 'EUR', 'date': '2020-04-09'} 12 | 13 | print(type(data)) 14 | print(data) 15 | 16 | rates = data['rates']['USD'] 17 | print(rates) 18 | -------------------------------------------------------------------------------- /03.Working with APIs/EDAMAM API/EDAMAM_API.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import pandas as pd 4 | 5 | api_endpoint = "https://api.edamam.com/api/nutrition-details" 6 | 7 | app_id = "d5df2415" 8 | app_key = "b87fbe096f386ba8d6b2ad10dcc672d5" 9 | url = api_endpoint + "?app_id=" + app_id + "&app_key=" + app_key 10 | 11 | #Preparing POST request 12 | headers = { 13 | "Content-Type": "application/json" 14 | } 15 | 16 | recipe = { 17 | "title" : "roasted chicken", 18 | "ingr" : ["1 (5 to 6 pound) roasting chicken", "Kosher salt", "Freshly ground black pepper"] 19 | } 20 | 21 | #Sending POST request 22 | response = requests.post(url, headers=headers, json=recipe) 23 | print(response.status_code) 24 | 25 | info = response.json() 26 | print(info.keys()) 27 | 28 | # data frame using pandas 29 | nutrients = pd.DataFrame(info['totalNutrients']).transpose() 30 | print(nutrients) 31 | 32 | # export to csv 33 | nutrients.to_csv("RoastedChicken_nutrients.csv") -------------------------------------------------------------------------------- /03.Working with APIs/EDAMAM API/RoastedChicken_nutrients.csv: -------------------------------------------------------------------------------- 1 | ,label,quantity,unit 2 | ENERC_KCAL,Energy,3897.8847966250505,kcal 3 | FAT,Fat,281.7973896498531,g 4 | FASAT,Saturated,80.41792651629662,g 5 | FATRN,Trans,0.0,g 6 | FAMS,Monounsaturated,116.42828684428095,g 7 | FAPU,Polyunsaturated,60.71976612838291,g 8 | CHOCDF,Carbs,6.425249319142501,g 9 | FIBTG,Fiber,1.8935213485650002,g 10 | SUGAR,Sugars,0.047899354272000004,g 11 | PROCNT,Protein,312.01614425200455,g 12 | CHOLE,Cholesterol,1566.2090943730002,mg 13 | NA,Sodium,5818.914444977497,mg 14 | CA,Calcium,218.09684626793904,mg 15 | MG,Magnesium,358.9387221502079,mg 16 | K,Potassium,3669.9071911427136,mg 17 | FE,Iron,25.715630535762603,mg 18 | ZN,Zinc,23.411849338505288,mg 19 | P,Phosphorus,3016.7612062434005,mg 20 | VITA_RAE,Vitamin A,4609.589368849851,µg 21 | VITC,Vitamin C,43.7081607732,mg 22 | THIA,Thiamin (B1),1.0825753017079,mg 23 | RIBF,Riboflavin (B2),3.1276781484795007,mg 24 | NIA,Niacin (B3),117.13235745691864,mg 25 | VITB6A,Vitamin B6,5.84953400740555,mg 26 | FOLDFE,Folate equivalent (total),474.77740164085003,µg 27 | FOLFD,Folate (food),474.77740164085003,µg 28 | FOLAC,Folic acid,0.0,µg 29 | VITB12,Vitamin B12,18.029616318945003,µg 30 | VITD,Vitamin D,0.0,µg 31 | TOCPHA,Vitamin E,0.07783645069200001,mg 32 | VITK1,Vitamin K,12.251756709885,µg 33 | WATER,Water,1202.5662619386048,g 34 | -------------------------------------------------------------------------------- /03.Working with APIs/EDAMAM API/Section 3 - Downloading files with requests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Downloading Files with Requests" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# The requests package can also be used to download files from the web.\n", 17 | "import requests" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "## Naive downloading" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": 2, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "# One way to 'download' a file is to send a request to it.\n", 34 | "# Then, export the content of the response to a local file" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [ 43 | "# Let's use an image from wikipedia for this purpose\n", 44 | "file_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/1024px-Collage_of_Nine_Dogs.jpg\"" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 4, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "data": { 54 | "text/plain": [ 55 | "200" 56 | ] 57 | }, 58 | "execution_count": 4, 59 | "metadata": {}, 60 | "output_type": "execute_result" 61 | } 62 | ], 63 | "source": [ 64 | "response = requests.get(file_url)\n", 65 | "response.status_code" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 5, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/plain": [ 76 | "b'\\xff\\xd8\\xff\\xe0\\x00\\x10JFIF\\x00\\x01\\x01\\x01\\x00H\\x00H\\x00\\x00\\xff\\xfe\\x00OFile source: https://commons.wikimedia.org/wiki/File:Collage_of_Nine_Dogs.jpg\\xff\\xe2\\x02\\x1cICC_PROFILE\\x00\\x01\\x01\\x00\\x00\\x02\\x0clcms\\x02\\x10\\x00\\x00mntrRGB XYZ \\x07\\xdc\\x00\\x01\\x00\\x19\\x00\\x03\\x00)\\x009acspAPPL\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf6\\xd6\\x00\\x01\\x00\\x00\\x00\\x00\\xd3-lcms\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\ndesc\\x00\\x00\\x00\\xfc\\x00\\x00\\x00^cprt\\x00\\x00\\x01\\\\\\x00\\x00\\x00\\x0bwtpt\\x00\\x00\\x01h\\x00\\x00\\x00\\x14bkpt\\x00\\x00\\x01|\\x00\\x00\\x00\\x14rXYZ\\x00\\x00\\x01\\x90\\x00\\x00\\x00\\x14gXYZ\\x00\\x00\\x01\\xa4\\x00\\x00\\x00\\x14bXYZ\\x00\\x00\\x01\\xb8\\x00\\x00\\x00\\x14rTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@gTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@bTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@desc\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03c2\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00text\\x00\\x00\\x00\\x00FB\\x00\\x00XYZ \\x00\\x00\\x00\\x00\\x00\\x00\\xf6\\xd6\\x00\\x01\\x00\\x00\\x00\\x00\\xd3-X'" 77 | ] 78 | }, 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "# Printing out the begining of the content of the response\n", 86 | "# It is in a binary-encoded format, thus it looks like gibberish\n", 87 | "response.content[:500]" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 6, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# We need to export this to an image file (jpg, png, gif...)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "### Writing to a file" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 7, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "# We open/create a file with the function 'open()'\n", 113 | "file = open(\"dog_image.jpg\", \"wb\")\n", 114 | "\n", 115 | "# Then, write to it\n", 116 | "file.write(response.content)\n", 117 | "\n", 118 | "# And close the file after finishing\n", 119 | "file.close()" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 8, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# The two parameters in the function open() are:\n", 129 | "# - the name of the file (along with a path to it if it is not in the same directory as our program)\n", 130 | "# - the mode in wich we want to edit the file\n", 131 | "\n", 132 | "# Some popular modes are:\n", 133 | "# - 'r' : Opens the file in read-only mode;\n", 134 | "# - 'rb' : Opens the file as read-only in binary format;\n", 135 | "# - 'w' : Creates a file in write-only mode. If the file already exists, it will overwrite it;\n", 136 | "# - 'wb': Write-only mode in binary format;\n", 137 | "# - 'a' : Opens the file for appending new information to the end;\n", 138 | "# - 'w+' : Opens the file for writing and reading;\n", 139 | "\n", 140 | "# We have used 'wb' in this example, since we want to export the data to a file (thus, write to it)\n", 141 | "# and response.content is in bytes\n", 142 | "\n", 143 | "# Never forget to close the file!" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 9, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "# To ensure the file will always be closed, use the 'with' statement\n", 153 | "# This automatically calls file.close() at the end" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 10, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "with open(\"dog_image_2.jpg\", \"wb\") as file:\n", 163 | " file.write(response.content)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 11, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "# Here, we first receive the whole file and store it in the RAM, then export it to the hard disk\n", 180 | "# This method is really inefficient, especially for bigger files\n", 181 | "# In effect we download the file to the RAM\n", 182 | "\n", 183 | "# We can fix that with a couple of small changes to our code" 184 | ] 185 | }, 186 | { 187 | "cell_type": "markdown", 188 | "metadata": {}, 189 | "source": [ 190 | "## Streaming the download to a file" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 12, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "# Instead of reading the whole response immidiatelly, \n", 200 | "# we can signal the program to only read part of the response when we tell it to.\n", 201 | "\n", 202 | "# This is achieved with the 'stream' parameter" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 13, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "# I will use test video files provided by file-examples.com\n", 212 | "url = \"https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_480_1_5MG.mp4\"" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 14, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "r = requests.get(url, stream = True)\n", 222 | "\n", 223 | "with open(\"Sample_video_1,5_MB.mp4\", \"wb\") as f:\n", 224 | " \n", 225 | " # Now we iterate over the response in chunks\n", 226 | " for chunk in r.iter_content(chunk_size = 16*1024):\n", 227 | " f.write(chunk)" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": 15, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "# You can change the chunk size to optimize the fastest download speed for your system" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 16, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "# However, when using 'stream=True' requests will not close the connection to the server until all data has been read\n", 246 | "# Thus, sometimes the connection needs to be closed manually\n", 247 | "\n", 248 | "# Again, that is best done using the 'with' statement" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 17, 254 | "metadata": {}, 255 | "outputs": [], 256 | "source": [ 257 | "# So, the final code for file download is\n", 258 | "url = \"https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_1920_18MG.mp4\"\n", 259 | "\n", 260 | "with requests.get(url, stream = True) as r:\n", 261 | " with open(\"Sample_video_18_MB.mp4\", \"wb\") as f:\n", 262 | " for chunk in r.iter_content(chunk_size = 16*1024):\n", 263 | " f.write(chunk)\n" 264 | ] 265 | } 266 | ], 267 | "metadata": { 268 | "kernelspec": { 269 | "display_name": "Python 3", 270 | "language": "python", 271 | "name": "python3" 272 | }, 273 | "language_info": { 274 | "codemirror_mode": { 275 | "name": "ipython", 276 | "version": 3 277 | }, 278 | "file_extension": ".py", 279 | "mimetype": "text/x-python", 280 | "name": "python", 281 | "nbconvert_exporter": "python", 282 | "pygments_lexer": "ipython3", 283 | "version": "3.7.3" 284 | } 285 | }, 286 | "nbformat": 4, 287 | "nbformat_minor": 2 288 | } 289 | -------------------------------------------------------------------------------- /03.Working with APIs/EDAMAM API/Section 3 - EDAMAM API - Initial setup and registration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# API requiring registration - POST request" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Registering to the API" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# We will use a nutritional analysis API\n", 24 | "# It requires registration (we need an API key to validate ourselves)\n", 25 | "# Many APIs require this kind of registration" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# You can sign-up for the Developer (Free) edition here: \n", 35 | "# https://developer.edamam.com/edamam-nutrition-api\n", 36 | "\n", 37 | "# API documentation: \n", 38 | "# https://developer.edamam.com/edamam-docs-nutrition-api" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "### Initial Setup" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 3, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# loading the packages\n", 55 | "import requests\n", 56 | "import json" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 4, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# Store the ID and Key in variables\n", 66 | "\n", 67 | "#APP_ID = \"your_API_ID_here\"\n", 68 | "#APP_KEY = \"your_API_key_here\"\n", 69 | "\n", 70 | "# Note: Those are not real ID and Key,\n", 71 | "# Replace the string with your own ones that you recieved upon registration" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 5, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "# Setting up the request URL\n", 81 | "api_endpoint = \"https://api.edamam.com/api/nutrition-details\"\n", 82 | "\n", 83 | "url = api_endpoint + \"?app_id=\" + APP_ID + \"&app_key=\" + APP_KEY" 84 | ] 85 | } 86 | ], 87 | "metadata": { 88 | "kernelspec": { 89 | "display_name": "Python 3", 90 | "language": "python", 91 | "name": "python3" 92 | }, 93 | "language_info": { 94 | "codemirror_mode": { 95 | "name": "ipython", 96 | "version": 3 97 | }, 98 | "file_extension": ".py", 99 | "mimetype": "text/x-python", 100 | "name": "python", 101 | "nbconvert_exporter": "python", 102 | "pygments_lexer": "ipython3", 103 | "version": "3.7.3" 104 | } 105 | }, 106 | "nbformat": 4, 107 | "nbformat_minor": 2 108 | } 109 | -------------------------------------------------------------------------------- /03.Working with APIs/GitHub API/github_API.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | base_url = "https://jobs.github.com/positions.json" 5 | 6 | #Extracting results from multiple pages 7 | results = [] 8 | 9 | for index in range(10): 10 | response = requests.get(base_url, params= {"description":"python", "location":"new york","page": index+1}) 11 | 12 | print(response.url) 13 | # print(response.json()) 14 | if len(response.json()) == 0: 15 | break 16 | 17 | results.extend(response.json()) 18 | 19 | print(len(results)) 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | data = response.json() 29 | data = json.dumps(data, indent=4) 30 | 31 | -------------------------------------------------------------------------------- /03.Working with APIs/iTune API/iTunes_API.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | base_site = "https://itunes.apple.com/search" 5 | 6 | response = requests.get(base_site,params={"term":"fifth harmony", "country":"us","limit": 200}) 7 | 8 | print(response.url) 9 | print(response.status_code) 10 | 11 | info = response.json() 12 | print(json.dumps(info, indent=4)) 13 | 14 | #name and release dates of the songs 15 | for result in info['results']: 16 | print(result['trackName']) 17 | print(result['releaseDate']) 18 | 19 | -------------------------------------------------------------------------------- /03.Working with APIs/iTune API/iTunes_API_structuring_exporting.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import pandas as pd 4 | 5 | base_site = "https://itunes.apple.com/search" 6 | 7 | response = requests.get(base_site,params={"term":"fifth harmony", "country":"us","limit": 200}) 8 | 9 | info = response.json() 10 | 11 | #dataframe with pandas 12 | songs_df = pd.DataFrame(info['results']) 13 | print(songs_df) 14 | 15 | #export to csv or excel 16 | songs_df.to_csv('songs_info.csv') 17 | 18 | songs_df.to_excel('songs_info.xlsx') 19 | 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /03.Working with APIs/iTune API/songs_info.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/03.Working with APIs/iTune API/songs_info.xlsx -------------------------------------------------------------------------------- /04.HTML Overview/Section 4 - CSS and JavaScript.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | CSS and JavaScript 6 | 7 | 8 | 9 | 10 | 11 |

12 | Come to the dark side, we have cookies! 13 |

14 | 15 | 18 | 19 | 20 | 23 | 24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /04.HTML Overview/Section 4 - CSS style tag.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 21 | 22 | 23 | 24 |

This is a heading

25 |

This is a paragraph.

26 |

I am different

27 | 28 | -------------------------------------------------------------------------------- /04.HTML Overview/Section 4 - Character encoding - Euro sign.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | Character encoding in HTML 6 | 7 | 8 | 9 | 10 |

This is the Euro sign: € (method 1)

11 |

This is the Euro sign: € (method 2)

12 |

This is the Euro sign: € (method 3)

13 | 14 | 15 | 16 | 17 | -------------------------------------------------------------------------------- /04.HTML Overview/Section 4 - My First Webpage.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | My First Webpage 6 | 7 | 8 | 9 |

This is not the web page you are looking for. Move along, move along!

10 | 11 | 12 | 13 | Click here for high-quality music. 14 | 15 | 16 | 17 | Click here for high-quality music in a new tab. 18 | 19 | 20 | 21 | 22 | 23 | -------------------------------------------------------------------------------- /05.Web Scraping with Beautiful Soup/Section 5 - Practical example - Exercise Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Importing the packages" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": null, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# Load the packages\n", 17 | "import requests\n", 18 | "from bs4 import BeautifulSoup" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### Making a get request" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "# Defining the url of the site\n", 35 | "base_site = \"https://en.wikipedia.org/wiki/Music\"\n", 36 | "\n", 37 | "# Making a get request\n", 38 | "response = requests.get(base_site)\n", 39 | "response" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# Extracting the HTML\n", 49 | "html = response.content" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "metadata": {}, 55 | "source": [ 56 | "### Making the soup" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.\n", 66 | "# Using the default parser as it is included in Python\n", 67 | "soup = BeautifulSoup(html, \"html.parser\")" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### 1. Extract all existing titles of links" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": { 81 | "scrolled": true 82 | }, 83 | "outputs": [], 84 | "source": [ 85 | "# Find all links on the page \n", 86 | "links = soup.find_all('a')\n", 87 | "links" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Dropping the links without 'href' attribute" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "# Getting all titles" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "# Removing the 'None' titles" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### 2. Extract all heading 2 strings." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": {}, 128 | "outputs": [], 129 | "source": [ 130 | "# Inspect all h2 tags" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "# Get the text" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "metadata": {}, 145 | "source": [ 146 | "### 3. Print the whole footer text." 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": null, 152 | "metadata": { 153 | "scrolled": true 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "# By inspection: we see that the footer is contained inside a ..." 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.7.3" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 2 182 | } 183 | -------------------------------------------------------------------------------- /05.Web Scraping with Beautiful Soup/Section 5 - Setting up your first scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Set-up and Workflow" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "### Importing the packages" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": 1, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# Load the packages\n", 24 | "import requests\n", 25 | "from bs4 import BeautifulSoup" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### Making a GET request" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "outputs": [ 40 | { 41 | "data": { 42 | "text/plain": [ 43 | "200" 44 | ] 45 | }, 46 | "execution_count": 2, 47 | "metadata": {}, 48 | "output_type": "execute_result" 49 | } 50 | ], 51 | "source": [ 52 | "# Defining the url of the site\n", 53 | "base_site = \"https://en.wikipedia.org/wiki/Music\"\n", 54 | "\n", 55 | "# Making a get request\n", 56 | "response = requests.get(base_site)\n", 57 | "response.status_code" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "b'\\n\\n\\n\\n13 Assassins (2011) 95%,\n", 92 | "

Full Contact (1992) 88%

,\n", 93 | "

Indiana Jones and the Last Crusade (1989) 88%

,\n", 94 | "

Kung Fu Hustle (2005) 90%

,\n", 95 | "

A Better Tomorrow (2010) 93%

,\n", 96 | "

Iron Man (2008) 94%

,\n", 97 | "

The Night Comes For Us (2018) 90%

,\n", 98 | "

Logan (2017) 93%

,\n", 99 | "

Goldfinger (1964) 97%

,\n", 100 | "

Assault on Precinct 13 (1976) 98%

,\n", 101 | "

Wonder Woman (2017) 93%

,\n", 102 | "

Fist of Fury (Jing wu men) (1972) 92%

,\n", 103 | "

Captain America: The Winter Soldier (2014) 90%

,\n", 104 | "

Oldboy (2005) 82%

,\n", 105 | "

The French Connection (1971) 98%

,\n", 106 | "

Furious 7 (2015) 81%

,\n", 107 | "

La Femme Nikita (Nikita) (1990) 88%

,\n", 108 | "

Supercop (1996) 96%

,\n", 109 | "

Dirty Harry (1971) 91%

,\n", 110 | "

Live Die Repeat: Edge of Tomorrow (2014) 90%

,\n", 111 | "

X2: X-Men United (2003) 85%

,\n", 112 | "

The Fugitive (1993) 96%

,\n", 113 | "

Black Panther (2018) 97%

,\n", 114 | "

Inception (2010) 87%

,\n", 115 | "

Braveheart (1995) 77%

,\n", 116 | "

Minority Report (2002) 90%

,\n", 117 | "

Avengers: Endgame (2019) 94%

,\n", 118 | "

Dredd (2012) 79%

,\n", 119 | "

The Bourne Identity (2002) 83%

,\n", 120 | "

Ip Man (2010) 85%

,\n", 121 | "

Face/Off (1997) 92%

,\n", 122 | "

To Live and Die in L.A. (1985) 91%

,\n", 123 | "

The Dark Knight (2008) 94%

,\n", 124 | "

Mission: Impossible Ghost Protocol (2011) 93%

,\n", 125 | "

Fast Five (2011) 77%

,\n", 126 | "

Lethal Weapon (1987) 82%

,\n", 127 | "

The Rock (1996) 66%

,\n", 128 | "

RoboCop (1987) 89%

,\n", 129 | "

John Wick: Chapter 2 (2017) 89%

,\n", 130 | "

Casino Royale (2006) 95%

,\n", 131 | "

Baby Driver (2017) 93%

,\n", 132 | "

Fist of Legend (Jing wu ying xiong) (1994) 100%

,\n", 133 | "

The Killer (1989) 98%

,\n", 134 | "

The Raid 2 (2014) 80%

,\n", 135 | "

Enter the Dragon (1973) 94%

,\n", 136 | "

Commando (1985) 70%

,\n", 137 | "

First Blood (1982) 87%

,\n", 138 | "

Mission: Impossible Rogue Nation (2015) 93%

,\n", 139 | "

The Terminator (1984) 100%

,\n", 140 | "

Gladiator (2000) 76%

,\n", 141 | "

Kill Bill: Volume 1 (2003) 85%

,\n", 142 | "

Léon: The Professional (1994) 73%

,\n", 143 | "

Speed (1994) 94%

,\n", 144 | "

The Legend of Drunken Master (Jui kuen II) (Drunken Fist II) (1994) 83%

,\n", 145 | "

John Wick (2014) 86%

,\n", 146 | "

Crouching Tiger, Hidden Dragon (2001) 97%

,\n", 147 | "

Predator (1987) 81%

,\n", 148 | "

The Bourne Ultimatum (2007) 92%

,\n", 149 | "

Total Recall (1990) 82%

,\n", 150 | "

Mad Max 2: The Road Warrior (1982) 95%

,\n", 151 | "

Heat (1995) 86%

,\n", 152 | "

The Raid: Redemption (2012) 86%

,\n", 153 | "

Mission: Impossible - Fallout (2018) 97%

,\n", 154 | "

Raiders of the Lost Ark (1981) 95%

,\n", 155 | "

Aliens (1986) 99%

,\n", 156 | "

Lat sau san taam (Hard-Boiled) (1992) 94%

,\n", 157 | "

The Matrix (1999) 88%

,\n", 158 | "

Terminator 2: Judgment Day (1991) 93%

,\n", 159 | "

Die Hard (1988) 93%

,\n", 160 | "

Mad Max: Fury Road (2015) 97%

]" 161 | ] 162 | }, 163 | "execution_count": 7, 164 | "metadata": {}, 165 | "output_type": "execute_result" 166 | } 167 | ], 168 | "source": [ 169 | "# Extracting all 'h2' tags\n", 170 | "headings = [div.find(\"h2\") for div in divs]\n", 171 | "headings" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "## Extracting the scores" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 8, 184 | "metadata": { 185 | "scrolled": true 186 | }, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "[95%,\n", 192 | " 88%,\n", 193 | " 88%,\n", 194 | " 90%,\n", 195 | " 93%,\n", 196 | " 94%,\n", 197 | " 90%,\n", 198 | " 93%,\n", 199 | " 97%,\n", 200 | " 98%,\n", 201 | " 93%,\n", 202 | " 92%,\n", 203 | " 90%,\n", 204 | " 82%,\n", 205 | " 98%,\n", 206 | " 81%,\n", 207 | " 88%,\n", 208 | " 96%,\n", 209 | " 91%,\n", 210 | " 90%,\n", 211 | " 85%,\n", 212 | " 96%,\n", 213 | " 97%,\n", 214 | " 87%,\n", 215 | " 77%,\n", 216 | " 90%,\n", 217 | " 94%,\n", 218 | " 79%,\n", 219 | " 83%,\n", 220 | " 85%,\n", 221 | " 92%,\n", 222 | " 91%,\n", 223 | " 94%,\n", 224 | " 93%,\n", 225 | " 77%,\n", 226 | " 82%,\n", 227 | " 66%,\n", 228 | " 89%,\n", 229 | " 89%,\n", 230 | " 95%,\n", 231 | " 93%,\n", 232 | " 100%,\n", 233 | " 98%,\n", 234 | " 80%,\n", 235 | " 94%,\n", 236 | " 70%,\n", 237 | " 87%,\n", 238 | " 93%,\n", 239 | " 100%,\n", 240 | " 76%,\n", 241 | " 85%,\n", 242 | " 73%,\n", 243 | " 94%,\n", 244 | " 83%,\n", 245 | " 86%,\n", 246 | " 97%,\n", 247 | " 81%,\n", 248 | " 92%,\n", 249 | " 82%,\n", 250 | " 95%,\n", 251 | " 86%,\n", 252 | " 86%,\n", 253 | " 97%,\n", 254 | " 95%,\n", 255 | " 99%,\n", 256 | " 94%,\n", 257 | " 88%,\n", 258 | " 93%,\n", 259 | " 93%,\n", 260 | " 97%]" 261 | ] 262 | }, 263 | "execution_count": 8, 264 | "metadata": {}, 265 | "output_type": "execute_result" 266 | } 267 | ], 268 | "source": [ 269 | "# Filtering only the spans containing the score\n", 270 | "[heading.find(\"span\", class_ = 'tMeterScore') for heading in headings]" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 9, 276 | "metadata": { 277 | "scrolled": true 278 | }, 279 | "outputs": [ 280 | { 281 | "data": { 282 | "text/plain": [ 283 | "['95%',\n", 284 | " '88%',\n", 285 | " '88%',\n", 286 | " '90%',\n", 287 | " '93%',\n", 288 | " '94%',\n", 289 | " '90%',\n", 290 | " '93%',\n", 291 | " '97%',\n", 292 | " '98%',\n", 293 | " '93%',\n", 294 | " '92%',\n", 295 | " '90%',\n", 296 | " '82%',\n", 297 | " '98%',\n", 298 | " '81%',\n", 299 | " '88%',\n", 300 | " '96%',\n", 301 | " '91%',\n", 302 | " '90%',\n", 303 | " '85%',\n", 304 | " '96%',\n", 305 | " '97%',\n", 306 | " '87%',\n", 307 | " '77%',\n", 308 | " '90%',\n", 309 | " '94%',\n", 310 | " '79%',\n", 311 | " '83%',\n", 312 | " '85%',\n", 313 | " '92%',\n", 314 | " '91%',\n", 315 | " '94%',\n", 316 | " '93%',\n", 317 | " '77%',\n", 318 | " '82%',\n", 319 | " '66%',\n", 320 | " '89%',\n", 321 | " '89%',\n", 322 | " '95%',\n", 323 | " '93%',\n", 324 | " '100%',\n", 325 | " '98%',\n", 326 | " '80%',\n", 327 | " '94%',\n", 328 | " '70%',\n", 329 | " '87%',\n", 330 | " '93%',\n", 331 | " '100%',\n", 332 | " '76%',\n", 333 | " '85%',\n", 334 | " '73%',\n", 335 | " '94%',\n", 336 | " '83%',\n", 337 | " '86%',\n", 338 | " '97%',\n", 339 | " '81%',\n", 340 | " '92%',\n", 341 | " '82%',\n", 342 | " '95%',\n", 343 | " '86%',\n", 344 | " '86%',\n", 345 | " '97%',\n", 346 | " '95%',\n", 347 | " '99%',\n", 348 | " '94%',\n", 349 | " '88%',\n", 350 | " '93%',\n", 351 | " '93%',\n", 352 | " '97%']" 353 | ] 354 | }, 355 | "execution_count": 9, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "# Extracting the score string\n", 362 | "scores = [heading.find(\"span\", class_ = 'tMeterScore').string for heading in headings]\n", 363 | "scores" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 10, 369 | "metadata": { 370 | "scrolled": true 371 | }, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "text/plain": [ 376 | "['95',\n", 377 | " '88',\n", 378 | " '88',\n", 379 | " '90',\n", 380 | " '93',\n", 381 | " '94',\n", 382 | " '90',\n", 383 | " '93',\n", 384 | " '97',\n", 385 | " '98',\n", 386 | " '93',\n", 387 | " '92',\n", 388 | " '90',\n", 389 | " '82',\n", 390 | " '98',\n", 391 | " '81',\n", 392 | " '88',\n", 393 | " '96',\n", 394 | " '91',\n", 395 | " '90',\n", 396 | " '85',\n", 397 | " '96',\n", 398 | " '97',\n", 399 | " '87',\n", 400 | " '77',\n", 401 | " '90',\n", 402 | " '94',\n", 403 | " '79',\n", 404 | " '83',\n", 405 | " '85',\n", 406 | " '92',\n", 407 | " '91',\n", 408 | " '94',\n", 409 | " '93',\n", 410 | " '77',\n", 411 | " '82',\n", 412 | " '66',\n", 413 | " '89',\n", 414 | " '89',\n", 415 | " '95',\n", 416 | " '93',\n", 417 | " '100',\n", 418 | " '98',\n", 419 | " '80',\n", 420 | " '94',\n", 421 | " '70',\n", 422 | " '87',\n", 423 | " '93',\n", 424 | " '100',\n", 425 | " '76',\n", 426 | " '85',\n", 427 | " '73',\n", 428 | " '94',\n", 429 | " '83',\n", 430 | " '86',\n", 431 | " '97',\n", 432 | " '81',\n", 433 | " '92',\n", 434 | " '82',\n", 435 | " '95',\n", 436 | " '86',\n", 437 | " '86',\n", 438 | " '97',\n", 439 | " '95',\n", 440 | " '99',\n", 441 | " '94',\n", 442 | " '88',\n", 443 | " '93',\n", 444 | " '93',\n", 445 | " '97']" 446 | ] 447 | }, 448 | "execution_count": 10, 449 | "metadata": {}, 450 | "output_type": "execute_result" 451 | } 452 | ], 453 | "source": [ 454 | "# Removing the '%' sign\n", 455 | "scores = [s.strip('%') for s in scores]\n", 456 | "scores" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": 11, 462 | "metadata": { 463 | "scrolled": true 464 | }, 465 | "outputs": [ 466 | { 467 | "data": { 468 | "text/plain": [ 469 | "[95,\n", 470 | " 88,\n", 471 | " 88,\n", 472 | " 90,\n", 473 | " 93,\n", 474 | " 94,\n", 475 | " 90,\n", 476 | " 93,\n", 477 | " 97,\n", 478 | " 98,\n", 479 | " 93,\n", 480 | " 92,\n", 481 | " 90,\n", 482 | " 82,\n", 483 | " 98,\n", 484 | " 81,\n", 485 | " 88,\n", 486 | " 96,\n", 487 | " 91,\n", 488 | " 90,\n", 489 | " 85,\n", 490 | " 96,\n", 491 | " 97,\n", 492 | " 87,\n", 493 | " 77,\n", 494 | " 90,\n", 495 | " 94,\n", 496 | " 79,\n", 497 | " 83,\n", 498 | " 85,\n", 499 | " 92,\n", 500 | " 91,\n", 501 | " 94,\n", 502 | " 93,\n", 503 | " 77,\n", 504 | " 82,\n", 505 | " 66,\n", 506 | " 89,\n", 507 | " 89,\n", 508 | " 95,\n", 509 | " 93,\n", 510 | " 100,\n", 511 | " 98,\n", 512 | " 80,\n", 513 | " 94,\n", 514 | " 70,\n", 515 | " 87,\n", 516 | " 93,\n", 517 | " 100,\n", 518 | " 76,\n", 519 | " 85,\n", 520 | " 73,\n", 521 | " 94,\n", 522 | " 83,\n", 523 | " 86,\n", 524 | " 97,\n", 525 | " 81,\n", 526 | " 92,\n", 527 | " 82,\n", 528 | " 95,\n", 529 | " 86,\n", 530 | " 86,\n", 531 | " 97,\n", 532 | " 95,\n", 533 | " 99,\n", 534 | " 94,\n", 535 | " 88,\n", 536 | " 93,\n", 537 | " 93,\n", 538 | " 97]" 539 | ] 540 | }, 541 | "execution_count": 11, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "# Converting each score to an integer\n", 548 | "scores = [int(s) for s in scores]\n", 549 | "scores" 550 | ] 551 | } 552 | ], 553 | "metadata": { 554 | "kernelspec": { 555 | "display_name": "Python 3", 556 | "language": "python", 557 | "name": "python3" 558 | }, 559 | "language_info": { 560 | "codemirror_mode": { 561 | "name": "ipython", 562 | "version": 3 563 | }, 564 | "file_extension": ".py", 565 | "mimetype": "text/x-python", 566 | "name": "python", 567 | "nbconvert_exporter": "python", 568 | "pygments_lexer": "ipython3", 569 | "version": "3.7.3" 570 | } 571 | }, 572 | "nbformat": 4, 573 | "nbformat_minor": 2 574 | } 575 | -------------------------------------------------------------------------------- /06.Project Scraping - Rotten Tomatoes/Section 6 - Setting up your scraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Set-up" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# load packages\n", 17 | "import requests\n", 18 | "from bs4 import BeautifulSoup" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "# Define the URL of the site\n", 28 | "base_site = \"https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/2/\"" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "outputs": [ 36 | { 37 | "data": { 38 | "text/plain": [ 39 | "200" 40 | ] 41 | }, 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "output_type": "execute_result" 45 | } 46 | ], 47 | "source": [ 48 | "# sending a request to the webpage\n", 49 | "response = requests.get(base_site)\n", 50 | "response.status_code" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 4, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# get the HTML from the webpage\n", 60 | "html = response.content" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "## Choosing a parser" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "### html.parser" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 5, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [ 83 | "# convert the HTML to a Beautiful Soup object\n", 84 | "soup = BeautifulSoup(html, 'html.parser')" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 6, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "# Exporting the HTML to a file\n", 94 | "with open('Rotten_tomatoes_page_2_HTML_Parser.html', 'wb') as file:\n", 95 | " file.write(soup.prettify('utf-8'))" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 7, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "# When inspecting the file we see that HTML element is closed at the begining -- it parsed incorrectly!\n", 105 | "# Let's check another parser" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "### lxml" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 8, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "# convert the HTML to a BeatifulSoup object\n", 122 | "soup = BeautifulSoup(html, 'lxml')" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 9, 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# Exporting the HTML to a file\n", 132 | "with open('Rotten_tomatoes_page_2_LXML_Parser.html', 'wb') as file:\n", 133 | " file.write(soup.prettify('utf-8'))" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 10, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# By first accounts of inspecting the file everything seems fine" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "### A word of caution" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 11, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "# Beautiful Soup ranks the lxml parser as the best one.\n", 159 | "\n", 160 | "# If a parser is not explicitly stated in the Beautiful Soup constructor,\n", 161 | "# the best one available on the current machine is chosen.\n", 162 | "\n", 163 | "# This means that the same piece of code can give different results on different computers." 164 | ] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.7.3" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 2 188 | } 189 | -------------------------------------------------------------------------------- /06.Project Scraping - Rotten Tomatoes/movies_info.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/06.Project Scraping - Rotten Tomatoes/movies_info.xlsx -------------------------------------------------------------------------------- /08.Scraping Steam Project/New_Trending_Games_Info.csv: -------------------------------------------------------------------------------- 1 | Title,Price,Tags 2 | Dreamscaper: Prologue,Free,"Action, Indie, RPG, Free to Play" 3 | RESIDENT EVIL 3,$59.99,"Action, Zombies, Horror, Survival Horror" 4 | ONE PIECE: PIRATE WARRIORS 4,$59.99,"Action, Anime, Co-op, Online Co-Op" 5 | Eternal Radiance,$16.19,"Action, Adventure, RPG, Anime" 6 | Deadside,$19.99,"Massively Multiplayer, Action, Adventure, Indie" 7 | Conqueror's Blade,Free to Play,"Strategy, Massively Multiplayer, Action, Simulation" 8 | Borderlands 3,$59.99,"RPG, Action, Online Co-Op, Looter Shooter" 9 | Granblue Fantasy: Versus,$59.99,"Action, Anime, Fighting, 2D Fighter" 10 | Receiver 2,$17.99,"Simulation, Indie, Action, Shooter" 11 | Rakion Chaos Force,Free,"Action, RPG, Free to Play, Strategy" 12 | Mount & Blade II: Bannerlord,$49.99,"Early Access, Medieval, Strategy, Open World" 13 | Half-Life: Alyx,$59.99,"Masterpiece, Action, VR, Adventure" 14 | Last Oasis,$29.99,"Massively Multiplayer, Survival, Action, Adventure" 15 | DOOM Eternal,$59.99,"Action, Masterpiece, Great Soundtrack, FPS" 16 | Disaster Report 4: Summer Memories,$59.99,"Adventure, Action, Survival, VR" 17 | -------------------------------------------------------------------------------- /08.Scraping Steam Project/Section 8 - Scraping Steam - Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Extracting data from Steam " 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Initial Setup" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from bs4 import BeautifulSoup\n", 24 | "import requests" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Connect to Steam webpage" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "r = requests.get(\"https://store.steampowered.com/tags/en/Action/\")\n", 41 | "r.status_code" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "html = r.content" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "soup = BeautifulSoup(html, \"lxml\")" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## What can we scrape from this webpage?\n", 74 | "## 1) Try extracting the names of the top games from this page.\n", 75 | "## 2) What tags contain the prices? Can you extract the price information?\n", 76 | "## 3) Get all of the header tags on the page\n", 77 | "## 4) Can you get the text from each span tag with class equal to \"top_tag\"?\n", 78 | "## 5) Under the \"Narrow by Tag\" section, there are a collection of tags (e.g. \"Indie\", \"Adventure\", etc.). Write code to return these tags.\n", 79 | "## 6) What else can be scraped from this webpage or others on the site?" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "## Now is your turn!" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": null, 92 | "metadata": {}, 93 | "outputs": [], 94 | "source": [] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": null, 106 | "metadata": {}, 107 | "outputs": [], 108 | "source": [] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [] 123 | } 124 | ], 125 | "metadata": { 126 | "kernelspec": { 127 | "display_name": "Python 3", 128 | "language": "python", 129 | "name": "python3" 130 | }, 131 | "language_info": { 132 | "codemirror_mode": { 133 | "name": "ipython", 134 | "version": 3 135 | }, 136 | "file_extension": ".py", 137 | "mimetype": "text/x-python", 138 | "name": "python", 139 | "nbconvert_exporter": "python", 140 | "pygments_lexer": "ipython3", 141 | "version": "3.7.3" 142 | } 143 | }, 144 | "nbformat": 4, 145 | "nbformat_minor": 2 146 | } 147 | -------------------------------------------------------------------------------- /08.Scraping Steam Project/Top_Rated_Games.info.csv: -------------------------------------------------------------------------------- 1 | Title,Price,Tags 2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive" 3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical" 4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op" 5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer" 6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter" 7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer" 8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person" 9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist" 10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer" 11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer" 12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic" 13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer" 14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy" 15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer" 16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World" 17 | -------------------------------------------------------------------------------- /08.Scraping Steam Project/Top_Sellers_Games_info.csv: -------------------------------------------------------------------------------- 1 | Title,Price,Tags 2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive" 3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical" 4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op" 5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer" 6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter" 7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer" 8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person" 9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist" 10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer" 11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer" 12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic" 13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer" 14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy" 15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer" 16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World" 17 | -------------------------------------------------------------------------------- /08.Scraping Steam Project/Trending_Games_info.csv: -------------------------------------------------------------------------------- 1 | Title,Price,Tags 2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive" 3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical" 4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op" 5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer" 6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter" 7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer" 8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person" 9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist" 10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer" 11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer" 12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic" 13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer" 14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy" 15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer" 16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World" 17 | -------------------------------------------------------------------------------- /08.Scraping Youtube Project/Section 8 - Scraping YouTube - Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Scraping YouTube" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Initial Setup" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "from bs4 import BeautifulSoup\n", 24 | "import requests" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "## Connect to webpage" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "r = requests.get(\"https://www.youtube.com/\")\n", 41 | "r.status_code" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": null, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# get HTML\n", 51 | "html = resp.content" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "# convert HTML to BeautifulSoup object\n", 61 | "soup = BeautifulSoup(html)" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": null, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "## 1) Scrape the text from each span tag\n", 76 | "## 2) How many images are on YouTube'e homepage?\n", 77 | "## 3) Can you find the URL of the link with title = \"Movies\"? Music? Sports?\n", 78 | "## 4) Now, try connecting to and scraping https://www.youtube.com/results?search_query=stairway+to+heaven\n", 79 | "## a) Can you get the names of the first few videos in the search results?\n", 80 | "## b) Next, connect to one of the search result videos - https://www.youtube.com/watch?v=qHFxncb1gRY\n", 81 | "## c) Can you find the \"related\" videos? What are their titles? Durations? URLs? Number of views?\n", 82 | "## d) Try finding (and scraping) the Twitter description of the video." 83 | ] 84 | }, 85 | { 86 | "cell_type": "code", 87 | "execution_count": null, 88 | "metadata": {}, 89 | "outputs": [], 90 | "source": [] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": null, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [] 119 | } 120 | ], 121 | "metadata": { 122 | "kernelspec": { 123 | "display_name": "Python 3", 124 | "language": "python", 125 | "name": "python3" 126 | }, 127 | "language_info": { 128 | "codemirror_mode": { 129 | "name": "ipython", 130 | "version": 3 131 | }, 132 | "file_extension": ".py", 133 | "mimetype": "text/x-python", 134 | "name": "python", 135 | "nbconvert_exporter": "python", 136 | "pygments_lexer": "ipython3", 137 | "version": "3.7.3" 138 | } 139 | }, 140 | "nbformat": 4, 141 | "nbformat_minor": 2 142 | } 143 | -------------------------------------------------------------------------------- /09.Common roadblocks when Web Scraping/RequestHeaders.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import requests" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "headers = {\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\"}" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 4, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "r = requests.get('https://www.youtube.com', headers = headers)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 5, 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/plain": [ 38 | "200" 39 | ] 40 | }, 41 | "execution_count": 5, 42 | "metadata": {}, 43 | "output_type": "execute_result" 44 | } 45 | ], 46 | "source": [ 47 | "r.status_code" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [] 56 | } 57 | ], 58 | "metadata": { 59 | "kernelspec": { 60 | "display_name": "Python 3", 61 | "language": "python", 62 | "name": "python3" 63 | }, 64 | "language_info": { 65 | "codemirror_mode": { 66 | "name": "ipython", 67 | "version": 3 68 | }, 69 | "file_extension": ".py", 70 | "mimetype": "text/x-python", 71 | "name": "python", 72 | "nbconvert_exporter": "python", 73 | "pygments_lexer": "ipython3", 74 | "version": "3.7.6" 75 | } 76 | }, 77 | "nbformat": 4, 78 | "nbformat_minor": 4 79 | } 80 | -------------------------------------------------------------------------------- /09.Common roadblocks when Web Scraping/Section 9 - Sample HTML login Form.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | HTML Form 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | -------------------------------------------------------------------------------- /09.Common roadblocks when Web Scraping/Section 9 - Sample login code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import requests" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "# URL of the POST request - need to inspect the HTML or use devtools to obtain\n", 19 | "url = \"target_url_of_post_request\"" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 3, 25 | "metadata": {}, 26 | "outputs": [], 27 | "source": [ 28 | "# Define parameters sent with the POST request\n", 29 | "# (if there are additional ones, define them as well)\n", 30 | "user = \"Your username goes here\"\n", 31 | "password = \"Your password goes here\"" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 4, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# Arrange all parameters in a dictionary format with the right names\n", 41 | "payload = {\n", 42 | " \"user[email]\": user,\n", 43 | " \"user[password]\": password\n", 44 | "}" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 5, 50 | "metadata": {}, 51 | "outputs": [], 52 | "source": [ 53 | "# Create a session so that we have consistent cookies\n", 54 | "s = requests.Session()" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 6, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "text/plain": [ 65 | "200" 66 | ] 67 | }, 68 | "execution_count": 6, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "# Submit the POST request through the session\n", 75 | "p = s.post(url, data = payload)\n", 76 | "p.status_code" 77 | ] 78 | }, 79 | { 80 | "cell_type": "code", 81 | "execution_count": 7, 82 | "metadata": {}, 83 | "outputs": [], 84 | "source": [ 85 | "# You are now logged in and can proceed with scraping the data\n", 86 | "# .\n", 87 | "# .\n", 88 | "# .\n", 89 | "\n", 90 | "# Don't forget to close the session when you are done\n", 91 | "s.close()" 92 | ] 93 | } 94 | ], 95 | "metadata": { 96 | "kernelspec": { 97 | "display_name": "Python 3", 98 | "language": "python", 99 | "name": "python3" 100 | }, 101 | "language_info": { 102 | "codemirror_mode": { 103 | "name": "ipython", 104 | "version": 3 105 | }, 106 | "file_extension": ".py", 107 | "mimetype": "text/x-python", 108 | "name": "python", 109 | "nbconvert_exporter": "python", 110 | "pygments_lexer": "ipython3", 111 | "version": "3.7.6" 112 | } 113 | }, 114 | "nbformat": 4, 115 | "nbformat_minor": 2 116 | } 117 | -------------------------------------------------------------------------------- /09.Common roadblocks when Web Scraping/Sessions.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import requests" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "#initialize a session\n", 19 | "s = requests.Session()\n", 20 | "\n", 21 | "#request made through that session\n", 22 | "#related cookies are handled through each session\n", 23 | "r1 = s.post(url1, data = payload)\n", 24 | "\n", 25 | "#request made through that session\n", 26 | "r2 = s.get(url2)\n", 27 | "\n", 28 | "s.close()" 29 | ] 30 | } 31 | ], 32 | "metadata": { 33 | "kernelspec": { 34 | "display_name": "Python 3", 35 | "language": "python", 36 | "name": "python3" 37 | }, 38 | "language_info": { 39 | "codemirror_mode": { 40 | "name": "ipython", 41 | "version": 3 42 | }, 43 | "file_extension": ".py", 44 | "mimetype": "text/x-python", 45 | "name": "python", 46 | "nbconvert_exporter": "python", 47 | "pygments_lexer": "ipython3", 48 | "version": "3.7.6" 49 | } 50 | }, 51 | "nbformat": 4, 52 | "nbformat_minor": 4 53 | } 54 | -------------------------------------------------------------------------------- /10.The Requests-HTML Package/Scraper_JavaScript.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Set up\n" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 2, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "from requests_html import AsyncHTMLSession" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 3, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "session = AsyncHTMLSession()" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 4, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "data": { 35 | "text/plain": [ 36 | "200" 37 | ] 38 | }, 39 | "execution_count": 4, 40 | "metadata": {}, 41 | "output_type": "execute_result" 42 | } 43 | ], 44 | "source": [ 45 | "r = await session.get('https://www.reddit.com')\n", 46 | "r.status_code" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 5, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "divs = r.html.find('div')\n", 56 | "links = r.html.find('a')\n", 57 | "urls = r.html.absolute_links" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## need to render the javascript as html is genereted dynamically with JS" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "this will install chromium in PC which acts like web browser, but only used by program" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 10, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "import pyppdf.patch_pyppeteer\n", 81 | "await r.html.arender()" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 11, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "new_divs = r.html.find('div')\n", 91 | "new_links = r.html.find('a')\n", 92 | "new_urls = r.html.absolute_links" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": 12, 98 | "metadata": {}, 99 | "outputs": [ 100 | { 101 | "data": { 102 | "text/plain": [ 103 | "(504, 1649)" 104 | ] 105 | }, 106 | "execution_count": 12, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "len(divs) , len(new_divs)" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": 13, 118 | "metadata": {}, 119 | "outputs": [ 120 | { 121 | "data": { 122 | "text/plain": [ 123 | "(80, 661)" 124 | ] 125 | }, 126 | "execution_count": 13, 127 | "metadata": {}, 128 | "output_type": "execute_result" 129 | } 130 | ], 131 | "source": [ 132 | "len(links), len(new_links)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 14, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "(57, 627)" 144 | ] 145 | }, 146 | "execution_count": 14, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "len(urls), len(new_urls)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "### Check the difference between first html and rendered version html" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 15, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "{'https://www.reddit.com/r/1200isplenty/',\n", 171 | " 'https://www.reddit.com/r/2007scape/',\n", 172 | " 'https://www.reddit.com/r/49ers/',\n", 173 | " 'https://www.reddit.com/r/90DayFiance/',\n", 174 | " 'https://www.reddit.com/r/ACMilan/',\n", 175 | " 'https://www.reddit.com/r/Adelaide/',\n", 176 | " 'https://www.reddit.com/r/Amd/',\n", 177 | " 'https://www.reddit.com/r/Android/',\n", 178 | " 'https://www.reddit.com/r/Animesuggest/',\n", 179 | " 'https://www.reddit.com/r/AnthemTheGame/',\n", 180 | " 'https://www.reddit.com/r/AskCulinary/',\n", 181 | " 'https://www.reddit.com/r/AskMen/',\n", 182 | " 'https://www.reddit.com/r/AskNYC/',\n", 183 | " 'https://www.reddit.com/r/AskReddit/',\n", 184 | " 'https://www.reddit.com/r/AskWomen/',\n", 185 | " 'https://www.reddit.com/r/Astros/',\n", 186 | " 'https://www.reddit.com/r/Atlanta/',\n", 187 | " 'https://www.reddit.com/r/AtlantaUnited/',\n", 188 | " 'https://www.reddit.com/r/Austria/',\n", 189 | " 'https://www.reddit.com/r/Barca/',\n", 190 | " 'https://www.reddit.com/r/BattlefieldV/',\n", 191 | " 'https://www.reddit.com/r/BeautyBoxes/',\n", 192 | " 'https://www.reddit.com/r/BeautyGuruChatter/',\n", 193 | " 'https://www.reddit.com/r/Berserk/',\n", 194 | " 'https://www.reddit.com/r/BigBrother/',\n", 195 | " 'https://www.reddit.com/r/BlackClover/',\n", 196 | " 'https://www.reddit.com/r/Blackops4/',\n", 197 | " 'https://www.reddit.com/r/BoJackHorseman/',\n", 198 | " 'https://www.reddit.com/r/BokuNoHeroAcademia/',\n", 199 | " 'https://www.reddit.com/r/Boruto/',\n", 200 | " 'https://www.reddit.com/r/BostonBruins/',\n", 201 | " 'https://www.reddit.com/r/Boxing/',\n", 202 | " 'https://www.reddit.com/r/Braves/',\n", 203 | " 'https://www.reddit.com/r/BravoRealHousewives/',\n", 204 | " 'https://www.reddit.com/r/Brawlstars/',\n", 205 | " 'https://www.reddit.com/r/Breath_of_the_Wild/',\n", 206 | " 'https://www.reddit.com/r/Brogress/',\n", 207 | " 'https://www.reddit.com/r/Browns/',\n", 208 | " 'https://www.reddit.com/r/C25K/',\n", 209 | " 'https://www.reddit.com/r/CFB/',\n", 210 | " 'https://www.reddit.com/r/CHIBears/',\n", 211 | " 'https://www.reddit.com/r/CHICubs/',\n", 212 | " 'https://www.reddit.com/r/Calgary/',\n", 213 | " 'https://www.reddit.com/r/CampingGear/',\n", 214 | " 'https://www.reddit.com/r/CampingandHiking/',\n", 215 | " 'https://www.reddit.com/r/Cardinals/',\n", 216 | " 'https://www.reddit.com/r/CasualUK/',\n", 217 | " 'https://www.reddit.com/r/Charlotte/',\n", 218 | " 'https://www.reddit.com/r/China/',\n", 219 | " 'https://www.reddit.com/r/ClashOfClans/',\n", 220 | " 'https://www.reddit.com/r/ClashRoyale/',\n", 221 | " 'https://www.reddit.com/r/CoDCompetitive/',\n", 222 | " 'https://www.reddit.com/r/CoachellaValley/',\n", 223 | " 'https://www.reddit.com/r/CollegeBasketball/',\n", 224 | " 'https://www.reddit.com/r/Columbus/',\n", 225 | " 'https://www.reddit.com/r/Competitiveoverwatch/',\n", 226 | " 'https://www.reddit.com/r/Cooking/',\n", 227 | " 'https://www.reddit.com/r/Cricket/',\n", 228 | " 'https://www.reddit.com/r/CrohnsDisease/',\n", 229 | " 'https://www.reddit.com/r/CrusaderKings/',\n", 230 | " 'https://www.reddit.com/r/DBZDokkanBattle/',\n", 231 | " 'https://www.reddit.com/r/DMAcademy/',\n", 232 | " 'https://www.reddit.com/r/Dallas/',\n", 233 | " 'https://www.reddit.com/r/DanLeBatardShow/',\n", 234 | " 'https://www.reddit.com/r/DaysGone/',\n", 235 | " 'https://www.reddit.com/r/Denmark/',\n", 236 | " 'https://www.reddit.com/r/Denver/',\n", 237 | " 'https://www.reddit.com/r/Destiny/',\n", 238 | " 'https://www.reddit.com/r/DestinyTheGame/',\n", 239 | " 'https://www.reddit.com/r/Detroit/',\n", 240 | " 'https://www.reddit.com/r/Disneyland/',\n", 241 | " 'https://www.reddit.com/r/DnD/',\n", 242 | " 'https://www.reddit.com/r/Dodgers/',\n", 243 | " 'https://www.reddit.com/r/DotA2/',\n", 244 | " 'https://www.reddit.com/r/DuelLinks/',\n", 245 | " 'https://www.reddit.com/r/DunderMifflin/',\n", 246 | " 'https://www.reddit.com/r/DynastyFF/',\n", 247 | " 'https://www.reddit.com/r/EDH/',\n", 248 | " 'https://www.reddit.com/r/EDanonymemes/',\n", 249 | " 'https://www.reddit.com/r/EOOD/',\n", 250 | " 'https://www.reddit.com/r/EatCheapAndHealthy/',\n", 251 | " 'https://www.reddit.com/r/Edmonton/',\n", 252 | " 'https://www.reddit.com/r/EliteDangerous/',\n", 253 | " 'https://www.reddit.com/r/EscapefromTarkov/',\n", 254 | " 'https://www.reddit.com/r/Eve/',\n", 255 | " 'https://www.reddit.com/r/FFBraveExvius/',\n", 256 | " 'https://www.reddit.com/r/FIFA/',\n", 257 | " 'https://www.reddit.com/r/FORTnITE/',\n", 258 | " 'https://www.reddit.com/r/FUTMobile/',\n", 259 | " 'https://www.reddit.com/r/Fallout/',\n", 260 | " 'https://www.reddit.com/r/FantasyPL/',\n", 261 | " 'https://www.reddit.com/r/FireEmblemHeroes/',\n", 262 | " 'https://www.reddit.com/r/Fishing/',\n", 263 | " 'https://www.reddit.com/r/Fitness/',\n", 264 | " 'https://www.reddit.com/r/FixedGearBicycle/',\n", 265 | " 'https://www.reddit.com/r/FlashTV/',\n", 266 | " 'https://www.reddit.com/r/FortNiteBR/',\n", 267 | " 'https://www.reddit.com/r/FortniteCompetitive/',\n", 268 | " 'https://www.reddit.com/r/Frugal/',\n", 269 | " 'https://www.reddit.com/r/GameOfThronesMemes/',\n", 270 | " 'https://www.reddit.com/r/Gamingcirclejerk/',\n", 271 | " 'https://www.reddit.com/r/GetMotivated/',\n", 272 | " 'https://www.reddit.com/r/Glitch_in_the_Matrix/',\n", 273 | " 'https://www.reddit.com/r/GlobalOffensive/',\n", 274 | " 'https://www.reddit.com/r/GlobalOffensiveTrade/',\n", 275 | " 'https://www.reddit.com/r/GooglePixel/',\n", 276 | " 'https://www.reddit.com/r/GreenBayPackers/',\n", 277 | " 'https://www.reddit.com/r/Grimdank/',\n", 278 | " 'https://www.reddit.com/r/Guildwars2/',\n", 279 | " 'https://www.reddit.com/r/Gundam/',\n", 280 | " 'https://www.reddit.com/r/HBOGameofThrones/',\n", 281 | " 'https://www.reddit.com/r/Hair/',\n", 282 | " 'https://www.reddit.com/r/HealthyFood/',\n", 283 | " 'https://www.reddit.com/r/HomeImprovement/',\n", 284 | " 'https://www.reddit.com/r/IASIP/',\n", 285 | " 'https://www.reddit.com/r/IAmA/',\n", 286 | " 'https://www.reddit.com/r/IWantOut/',\n", 287 | " 'https://www.reddit.com/r/ImaginaryWesteros/',\n", 288 | " 'https://www.reddit.com/r/Indiemakeupandmore/',\n", 289 | " 'https://www.reddit.com/r/Instagram/',\n", 290 | " 'https://www.reddit.com/r/Israel/',\n", 291 | " 'https://www.reddit.com/r/JapanTravel/',\n", 292 | " 'https://www.reddit.com/r/Jeopardy/',\n", 293 | " 'https://www.reddit.com/r/Kitsap/',\n", 294 | " 'https://www.reddit.com/r/Konosuba/',\n", 295 | " 'https://www.reddit.com/r/LearnJapanese/',\n", 296 | " 'https://www.reddit.com/r/LegendsOfTomorrow/',\n", 297 | " 'https://www.reddit.com/r/LifeProTips/',\n", 298 | " 'https://www.reddit.com/r/LigaMX/',\n", 299 | " 'https://www.reddit.com/r/LiverpoolFC/',\n", 300 | " 'https://www.reddit.com/r/LivestreamFail/',\n", 301 | " 'https://www.reddit.com/r/LosAngelesRams/',\n", 302 | " 'https://www.reddit.com/r/LushCosmetics/',\n", 303 | " 'https://www.reddit.com/r/MCFC/',\n", 304 | " 'https://www.reddit.com/r/MLBTheShow/',\n", 305 | " 'https://www.reddit.com/r/MLS/',\n", 306 | " 'https://www.reddit.com/r/MMA/',\n", 307 | " 'https://www.reddit.com/r/MTB/',\n", 308 | " 'https://www.reddit.com/r/MUAontheCheap/',\n", 309 | " 'https://www.reddit.com/r/MagicArena/',\n", 310 | " 'https://www.reddit.com/r/Makeup/',\n", 311 | " 'https://www.reddit.com/r/MakeupAddiction/',\n", 312 | " 'https://www.reddit.com/r/MakingaMurderer/',\n", 313 | " 'https://www.reddit.com/r/Market76/',\n", 314 | " 'https://www.reddit.com/r/MarvelStrikeForce/',\n", 315 | " 'https://www.reddit.com/r/Mavericks/',\n", 316 | " 'https://www.reddit.com/r/Minecraft/',\n", 317 | " 'https://www.reddit.com/r/Minneapolis/',\n", 318 | " 'https://www.reddit.com/r/MkeBucks/',\n", 319 | " 'https://www.reddit.com/r/ModernMagic/',\n", 320 | " 'https://www.reddit.com/r/MonsterHunterWorld/',\n", 321 | " 'https://www.reddit.com/r/Mordhau/',\n", 322 | " 'https://www.reddit.com/r/MortalKombat/',\n", 323 | " 'https://www.reddit.com/r/MtvChallenge/',\n", 324 | " 'https://www.reddit.com/r/Music/',\n", 325 | " 'https://www.reddit.com/r/NBA2k/',\n", 326 | " 'https://www.reddit.com/r/NBASpurs/',\n", 327 | " 'https://www.reddit.com/r/NFA/',\n", 328 | " 'https://www.reddit.com/r/NHLHUT/',\n", 329 | " 'https://www.reddit.com/r/NYKnicks/',\n", 330 | " 'https://www.reddit.com/r/NYYankees/',\n", 331 | " 'https://www.reddit.com/r/Naruto/',\n", 332 | " 'https://www.reddit.com/r/Nationals/',\n", 333 | " 'https://www.reddit.com/r/Nerf/',\n", 334 | " 'https://www.reddit.com/r/NetflixBestOf/',\n", 335 | " 'https://www.reddit.com/r/NewOrleans/',\n", 336 | " 'https://www.reddit.com/r/NewSkaters/',\n", 337 | " 'https://www.reddit.com/r/NewYorkMets/',\n", 338 | " 'https://www.reddit.com/r/NintendoSwitch/',\n", 339 | " 'https://www.reddit.com/r/NoMansSkyTheGame/',\n", 340 | " 'https://www.reddit.com/r/NoStupidQuestions/',\n", 341 | " 'https://www.reddit.com/r/OnePiece/',\n", 342 | " 'https://www.reddit.com/r/OutOfTheLoop/',\n", 343 | " 'https://www.reddit.com/r/Overwatch/',\n", 344 | " 'https://www.reddit.com/r/PS4/',\n", 345 | " 'https://www.reddit.com/r/PSVR/',\n", 346 | " 'https://www.reddit.com/r/PUBATTLEGROUNDS/',\n", 347 | " 'https://www.reddit.com/r/PUBGMobile/',\n", 348 | " 'https://www.reddit.com/r/Paladins/',\n", 349 | " 'https://www.reddit.com/r/PanPorn/',\n", 350 | " 'https://www.reddit.com/r/PandR/',\n", 351 | " 'https://www.reddit.com/r/Patriots/',\n", 352 | " 'https://www.reddit.com/r/Persona5/',\n", 353 | " 'https://www.reddit.com/r/Philippines/',\n", 354 | " 'https://www.reddit.com/r/Planetside/',\n", 355 | " 'https://www.reddit.com/r/Polska/',\n", 356 | " 'https://www.reddit.com/r/Portland/',\n", 357 | " 'https://www.reddit.com/r/Quebec/',\n", 358 | " 'https://www.reddit.com/r/RWBY/',\n", 359 | " 'https://www.reddit.com/r/Rainbow6/',\n", 360 | " 'https://www.reddit.com/r/RedDeadOnline/',\n", 361 | " 'https://www.reddit.com/r/RedditLaqueristas/',\n", 362 | " 'https://www.reddit.com/r/RepLadiesBST/',\n", 363 | " 'https://www.reddit.com/r/Repsneakers/',\n", 364 | " 'https://www.reddit.com/r/RimWorld/',\n", 365 | " 'https://www.reddit.com/r/RocketLeague/',\n", 366 | " 'https://www.reddit.com/r/RocketLeagueExchange/',\n", 367 | " 'https://www.reddit.com/r/Romania/',\n", 368 | " 'https://www.reddit.com/r/Rowing/',\n", 369 | " 'https://www.reddit.com/r/SFGiants/',\n", 370 | " 'https://www.reddit.com/r/SWGalaxyOfHeroes/',\n", 371 | " 'https://www.reddit.com/r/Sacramento/',\n", 372 | " 'https://www.reddit.com/r/SaltLakeCity/',\n", 373 | " 'https://www.reddit.com/r/SanJoseSharks/',\n", 374 | " 'https://www.reddit.com/r/SarahSnark/',\n", 375 | " 'https://www.reddit.com/r/Scotland/',\n", 376 | " 'https://www.reddit.com/r/Seaofthieves/',\n", 377 | " 'https://www.reddit.com/r/Seattle/',\n", 378 | " 'https://www.reddit.com/r/SequelMemes/',\n", 379 | " 'https://www.reddit.com/r/ShingekiNoKyojin/',\n", 380 | " 'https://www.reddit.com/r/Shoestring/',\n", 381 | " 'https://www.reddit.com/r/Showerthoughts/',\n", 382 | " 'https://www.reddit.com/r/Smite/',\n", 383 | " 'https://www.reddit.com/r/Sneakers/',\n", 384 | " 'https://www.reddit.com/r/Spiderman/',\n", 385 | " 'https://www.reddit.com/r/SpoiledDragRace/',\n", 386 | " 'https://www.reddit.com/r/SquaredCircle/',\n", 387 | " 'https://www.reddit.com/r/StLouis/',\n", 388 | " 'https://www.reddit.com/r/StarVStheForcesofEvil/',\n", 389 | " 'https://www.reddit.com/r/StarWarsBattlefront/',\n", 390 | " 'https://www.reddit.com/r/StardewValley/',\n", 391 | " 'https://www.reddit.com/r/Steam/',\n", 392 | " 'https://www.reddit.com/r/Stellaris/',\n", 393 | " 'https://www.reddit.com/r/StrangerThings/',\n", 394 | " 'https://www.reddit.com/r/Stronglifts5x5/',\n", 395 | " 'https://www.reddit.com/r/Suomi/',\n", 396 | " 'https://www.reddit.com/r/Supplements/',\n", 397 | " 'https://www.reddit.com/r/TeenMomOGandTeenMom2/',\n", 398 | " 'https://www.reddit.com/r/Terraria/',\n", 399 | " 'https://www.reddit.com/r/TheAmazingRace/',\n", 400 | " 'https://www.reddit.com/r/TheBlackList/',\n", 401 | " 'https://www.reddit.com/r/TheDickShow/',\n", 402 | " 'https://www.reddit.com/r/TheHandmaidsTale/',\n", 403 | " 'https://www.reddit.com/r/TheLastAirbender/',\n", 404 | " 'https://www.reddit.com/r/TheSimpsons/',\n", 405 | " 'https://www.reddit.com/r/Tinder/',\n", 406 | " 'https://www.reddit.com/r/Torontobluejays/',\n", 407 | " 'https://www.reddit.com/r/Turkey/',\n", 408 | " 'https://www.reddit.com/r/TurkeyJerky/',\n", 409 | " 'https://www.reddit.com/r/Twitch/',\n", 410 | " 'https://www.reddit.com/r/TwoBestFriendsPlay/',\n", 411 | " 'https://www.reddit.com/r/VictoriaBC/',\n", 412 | " 'https://www.reddit.com/r/WWE/',\n", 413 | " 'https://www.reddit.com/r/WWEGames/',\n", 414 | " 'https://www.reddit.com/r/WaltDisneyWorld/',\n", 415 | " 'https://www.reddit.com/r/Warframe/',\n", 416 | " 'https://www.reddit.com/r/Warhammer40k/',\n", 417 | " 'https://www.reddit.com/r/Warthunder/',\n", 418 | " 'https://www.reddit.com/r/Watches/',\n", 419 | " 'https://www.reddit.com/r/Watchexchange/',\n", 420 | " 'https://www.reddit.com/r/Wellington/',\n", 421 | " 'https://www.reddit.com/r/Wetshaving/',\n", 422 | " 'https://www.reddit.com/r/Windows10/',\n", 423 | " 'https://www.reddit.com/r/Winnipeg/',\n", 424 | " 'https://www.reddit.com/r/WorldOfWarships/',\n", 425 | " 'https://www.reddit.com/r/WorldofTanks/',\n", 426 | " 'https://www.reddit.com/r/Youniqueamua/',\n", 427 | " 'https://www.reddit.com/r/aSongOfMemesAndRage/',\n", 428 | " 'https://www.reddit.com/r/acne/',\n", 429 | " 'https://www.reddit.com/r/adventuretime/',\n", 430 | " 'https://www.reddit.com/r/airsoft/',\n", 431 | " 'https://www.reddit.com/r/amateur_boxing/',\n", 432 | " 'https://www.reddit.com/r/anime/',\n", 433 | " 'https://www.reddit.com/r/anime_irl/',\n", 434 | " 'https://www.reddit.com/r/antelopevalley/',\n", 435 | " 'https://www.reddit.com/r/apple/',\n", 436 | " 'https://www.reddit.com/r/argentina/',\n", 437 | " 'https://www.reddit.com/r/arrow/',\n", 438 | " 'https://www.reddit.com/r/askTO/',\n", 439 | " 'https://www.reddit.com/r/askscience/',\n", 440 | " 'https://www.reddit.com/r/asoiaf/',\n", 441 | " 'https://www.reddit.com/r/australia/',\n", 442 | " 'https://www.reddit.com/r/awardtravel/',\n", 443 | " 'https://www.reddit.com/r/backpacking/',\n", 444 | " 'https://www.reddit.com/r/balisong/',\n", 445 | " 'https://www.reddit.com/r/barstoolsports/',\n", 446 | " 'https://www.reddit.com/r/baseball/',\n", 447 | " 'https://www.reddit.com/r/batman/',\n", 448 | " 'https://www.reddit.com/r/battlestations/',\n", 449 | " 'https://www.reddit.com/r/bayarea/',\n", 450 | " 'https://www.reddit.com/r/beards/',\n", 451 | " 'https://www.reddit.com/r/beauty/',\n", 452 | " 'https://www.reddit.com/r/berkeley/',\n", 453 | " 'https://www.reddit.com/r/bicycling/',\n", 454 | " 'https://www.reddit.com/r/bikecommuting/',\n", 455 | " 'https://www.reddit.com/r/bikewrench/',\n", 456 | " 'https://www.reddit.com/r/bjj/',\n", 457 | " 'https://www.reddit.com/r/blackmirror/',\n", 458 | " 'https://www.reddit.com/r/bleach/',\n", 459 | " 'https://www.reddit.com/r/boardgames/',\n", 460 | " 'https://www.reddit.com/r/bodybuilding/',\n", 461 | " 'https://www.reddit.com/r/bodyweightfitness/',\n", 462 | " 'https://www.reddit.com/r/books/',\n", 463 | " 'https://www.reddit.com/r/boostedboards/',\n", 464 | " 'https://www.reddit.com/r/bostonceltics/',\n", 465 | " 'https://www.reddit.com/r/brasil/',\n", 466 | " 'https://www.reddit.com/r/brasilivre/',\n", 467 | " 'https://www.reddit.com/r/breakingbad/',\n", 468 | " 'https://www.reddit.com/r/brisbane/',\n", 469 | " 'https://www.reddit.com/r/brooklynninenine/',\n", 470 | " 'https://www.reddit.com/r/buildapc/',\n", 471 | " 'https://www.reddit.com/r/burlington/',\n", 472 | " 'https://www.reddit.com/r/camping/',\n", 473 | " 'https://www.reddit.com/r/canada/',\n", 474 | " 'https://www.reddit.com/r/canucks/',\n", 475 | " 'https://www.reddit.com/r/cars/',\n", 476 | " 'https://www.reddit.com/r/chelseafc/',\n", 477 | " 'https://www.reddit.com/r/chile/',\n", 478 | " 'https://www.reddit.com/r/cirkeltrek/',\n", 479 | " 'https://www.reddit.com/r/classicwow/',\n", 480 | " 'https://www.reddit.com/r/climbing/',\n", 481 | " 'https://www.reddit.com/r/community/',\n", 482 | " 'https://www.reddit.com/r/confession/',\n", 483 | " 'https://www.reddit.com/r/cordcutters/',\n", 484 | " 'https://www.reddit.com/r/cowboys/',\n", 485 | " 'https://www.reddit.com/r/coys/',\n", 486 | " 'https://www.reddit.com/r/criterion/',\n", 487 | " 'https://www.reddit.com/r/croatia/',\n", 488 | " 'https://www.reddit.com/r/crossfit/',\n", 489 | " 'https://www.reddit.com/r/cscareerquestions/',\n", 490 | " 'https://www.reddit.com/r/curlyhair/',\n", 491 | " 'https://www.reddit.com/r/cycling/',\n", 492 | " 'https://www.reddit.com/r/danganronpa/',\n", 493 | " 'https://www.reddit.com/r/dauntless/',\n", 494 | " 'https://www.reddit.com/r/dbz/',\n", 495 | " 'https://www.reddit.com/r/de/',\n", 496 | " 'https://www.reddit.com/r/deadbydaylight/',\n", 497 | " 'https://www.reddit.com/r/denvernuggets/',\n", 498 | " 'https://www.reddit.com/r/destiny2/',\n", 499 | " 'https://www.reddit.com/r/detroitlions/',\n", 500 | " 'https://www.reddit.com/r/diabetes/',\n", 501 | " 'https://www.reddit.com/r/diabetes_t1/',\n", 502 | " 'https://www.reddit.com/r/discgolf/',\n", 503 | " 'https://www.reddit.com/r/discordapp/',\n", 504 | " 'https://www.reddit.com/r/disney/',\n", 505 | " 'https://www.reddit.com/r/dndmemes/',\n", 506 | " 'https://www.reddit.com/r/dndnext/',\n", 507 | " 'https://www.reddit.com/r/doctorwho/',\n", 508 | " 'https://www.reddit.com/r/dubai/',\n", 509 | " 'https://www.reddit.com/r/eagles/',\n", 510 | " 'https://www.reddit.com/r/ehlersdanlos/',\n", 511 | " 'https://www.reddit.com/r/elderscrollsonline/',\n", 512 | " 'https://www.reddit.com/r/eu4/',\n", 513 | " 'https://www.reddit.com/r/europe/',\n", 514 | " 'https://www.reddit.com/r/explainlikeimfive/',\n", 515 | " 'https://www.reddit.com/r/fairytail/',\n", 516 | " 'https://www.reddit.com/r/fantasybaseball/',\n", 517 | " 'https://www.reddit.com/r/fantasyfootball/',\n", 518 | " 'https://www.reddit.com/r/fasting/',\n", 519 | " 'https://www.reddit.com/r/femalefashionadvice/',\n", 520 | " 'https://www.reddit.com/r/femalehairadvice/',\n", 521 | " 'https://www.reddit.com/r/ffxiv/',\n", 522 | " 'https://www.reddit.com/r/findfashion/',\n", 523 | " 'https://www.reddit.com/r/fireemblem/',\n", 524 | " 'https://www.reddit.com/r/fivenightsatfreddys/',\n", 525 | " 'https://www.reddit.com/r/flexibility/',\n", 526 | " 'https://www.reddit.com/r/flightsim/',\n", 527 | " 'https://www.reddit.com/r/flyfishing/',\n", 528 | " 'https://www.reddit.com/r/fo76/',\n", 529 | " 'https://www.reddit.com/r/footballmanagergames/',\n", 530 | " 'https://www.reddit.com/r/forhonor/',\n", 531 | " 'https://www.reddit.com/r/formula1/',\n", 532 | " 'https://www.reddit.com/r/fragrance/',\n", 533 | " 'https://www.reddit.com/r/france/',\n", 534 | " 'https://www.reddit.com/r/freefolk/',\n", 535 | " 'https://www.reddit.com/r/frugalmalefashion/',\n", 536 | " 'https://www.reddit.com/r/futurama/',\n", 537 | " 'https://www.reddit.com/r/future_fight/',\n", 538 | " 'https://www.reddit.com/r/gainit/',\n", 539 | " 'https://www.reddit.com/r/gameofthrones/',\n", 540 | " 'https://www.reddit.com/r/germany/',\n", 541 | " 'https://www.reddit.com/r/girlsfrontline/',\n", 542 | " 'https://www.reddit.com/r/golf/',\n", 543 | " 'https://www.reddit.com/r/goodyearwelt/',\n", 544 | " 'https://www.reddit.com/r/grandorder/',\n", 545 | " 'https://www.reddit.com/r/greece/',\n", 546 | " 'https://www.reddit.com/r/greysanatomy/',\n", 547 | " 'https://www.reddit.com/r/gtaonline/',\n", 548 | " 'https://www.reddit.com/r/halifax/',\n", 549 | " 'https://www.reddit.com/r/halo/',\n", 550 | " 'https://www.reddit.com/r/headphones/',\n", 551 | " 'https://www.reddit.com/r/hearthstone/',\n", 552 | " 'https://www.reddit.com/r/heroesofthestorm/',\n", 553 | " 'https://www.reddit.com/r/hiking/',\n", 554 | " 'https://www.reddit.com/r/hockey/',\n", 555 | " 'https://www.reddit.com/r/hockeyjerseys/',\n", 556 | " 'https://www.reddit.com/r/hockeyplayers/',\n", 557 | " 'https://www.reddit.com/r/houston/',\n", 558 | " 'https://www.reddit.com/r/howardstern/',\n", 559 | " 'https://www.reddit.com/r/hungary/',\n", 560 | " 'https://www.reddit.com/r/india/',\n", 561 | " 'https://www.reddit.com/r/indonesia/',\n", 562 | " 'https://www.reddit.com/r/intermittentfasting/',\n", 563 | " 'https://www.reddit.com/r/iphone/',\n", 564 | " 'https://www.reddit.com/r/ireland/',\n", 565 | " 'https://www.reddit.com/r/italy/',\n", 566 | " 'https://www.reddit.com/r/jailbreak/',\n", 567 | " 'https://www.reddit.com/r/japanesestreetwear/',\n", 568 | " 'https://www.reddit.com/r/japanlife/',\n", 569 | " 'https://www.reddit.com/r/jobs/',\n", 570 | " 'https://www.reddit.com/r/kansascity/',\n", 571 | " 'https://www.reddit.com/r/keto/',\n", 572 | " 'https://www.reddit.com/r/korea/',\n", 573 | " 'https://www.reddit.com/r/lakers/',\n", 574 | " 'https://www.reddit.com/r/leafs/',\n", 575 | " 'https://www.reddit.com/r/leagueoflegends/',\n", 576 | " 'https://www.reddit.com/r/leangains/',\n", 577 | " 'https://www.reddit.com/r/learnprogramming/',\n", 578 | " 'https://www.reddit.com/r/learnpython/',\n", 579 | " 'https://www.reddit.com/r/legaladvice/',\n", 580 | " 'https://www.reddit.com/r/longboarding/',\n", 581 | " 'https://www.reddit.com/r/loseit/',\n", 582 | " 'https://www.reddit.com/r/lucifer/',\n", 583 | " 'https://www.reddit.com/r/makeupexchange/',\n", 584 | " 'https://www.reddit.com/r/malaysia/',\n", 585 | " 'https://www.reddit.com/r/malefashion/',\n", 586 | " 'https://www.reddit.com/r/malefashionadvice/',\n", 587 | " 'https://www.reddit.com/r/malehairadvice/',\n", 588 | " 'https://www.reddit.com/r/malelivingspace/',\n", 589 | " 'https://www.reddit.com/r/marvelmemes/',\n", 590 | " 'https://www.reddit.com/r/marvelstudios/',\n", 591 | " 'https://www.reddit.com/r/medical_advice/',\n", 592 | " 'https://www.reddit.com/r/melbourne/',\n", 593 | " 'https://www.reddit.com/r/memes/',\n", 594 | " 'https://www.reddit.com/r/mexico/',\n", 595 | " 'https://www.reddit.com/r/migraine/',\n", 596 | " 'https://www.reddit.com/r/minnesotatwins/',\n", 597 | " 'https://www.reddit.com/r/minnesotavikings/',\n", 598 | " 'https://www.reddit.com/r/mw4/',\n", 599 | " 'https://www.reddit.com/r/mylittlepony/',\n", 600 | " 'https://www.reddit.com/r/nashville/',\n", 601 | " 'https://www.reddit.com/r/nattyorjuice/',\n", 602 | " 'https://www.reddit.com/r/nba/',\n", 603 | " 'https://www.reddit.com/r/nbadiscussion/',\n", 604 | " 'https://www.reddit.com/r/netflix/',\n", 605 | " 'https://www.reddit.com/r/newsokur/',\n", 606 | " 'https://www.reddit.com/r/newzealand/',\n", 607 | " 'https://www.reddit.com/r/nfl/',\n", 608 | " 'https://www.reddit.com/r/nhl/',\n", 609 | " 'https://www.reddit.com/r/norge/',\n", 610 | " 'https://www.reddit.com/r/nosleep/',\n", 611 | " 'https://www.reddit.com/r/nova/',\n", 612 | " 'https://www.reddit.com/r/nrl/',\n", 613 | " 'https://www.reddit.com/r/nunavut/',\n", 614 | " 'https://www.reddit.com/r/nutrition/',\n", 615 | " 'https://www.reddit.com/r/nvidia/',\n", 616 | " 'https://www.reddit.com/r/nyjets/',\n", 617 | " 'https://www.reddit.com/r/omad/',\n", 618 | " 'https://www.reddit.com/r/orangecounty/',\n", 619 | " 'https://www.reddit.com/r/orangetheory/',\n", 620 | " 'https://www.reddit.com/r/osugame/',\n", 621 | " 'https://www.reddit.com/r/ottawa/',\n", 622 | " 'https://www.reddit.com/r/overlord/',\n", 623 | " 'https://www.reddit.com/r/pathofexile/',\n", 624 | " 'https://www.reddit.com/r/pcmasterrace/',\n", 625 | " 'https://www.reddit.com/r/peloton/',\n", 626 | " 'https://www.reddit.com/r/pesmobile/',\n", 627 | " 'https://www.reddit.com/r/philadelphia/',\n", 628 | " 'https://www.reddit.com/r/phillies/',\n", 629 | " 'https://www.reddit.com/r/phoenix/',\n", 630 | " 'https://www.reddit.com/r/pics/',\n", 631 | " 'https://www.reddit.com/r/pics/?f=flair_name%3A%22Politics%22',\n", 632 | " 'https://www.reddit.com/r/pics/comments/g1k7qr/well_america_this_explains_it/',\n", 633 | " 'https://www.reddit.com/r/piercing/',\n", 634 | " 'https://www.reddit.com/r/pittsburgh/',\n", 635 | " 'https://www.reddit.com/r/playrust/',\n", 636 | " 'https://www.reddit.com/r/podemos/',\n", 637 | " 'https://www.reddit.com/r/pokemon/',\n", 638 | " 'https://www.reddit.com/r/pokemongo/',\n", 639 | " 'https://www.reddit.com/r/pokemontrades/',\n", 640 | " 'https://www.reddit.com/r/portugal/',\n", 641 | " 'https://www.reddit.com/r/poshmark/',\n", 642 | " 'https://www.reddit.com/r/powerlifting/',\n", 643 | " 'https://www.reddit.com/r/progresspics/',\n", 644 | " 'https://www.reddit.com/r/raleigh/',\n", 645 | " 'https://www.reddit.com/r/ravens/',\n", 646 | " 'https://www.reddit.com/r/rawdenim/',\n", 647 | " 'https://www.reddit.com/r/realmadrid/',\n", 648 | " 'https://www.reddit.com/r/reddeadredemption/',\n", 649 | " 'https://www.reddit.com/r/reddevils/',\n", 650 | " 'https://www.reddit.com/r/redsox/',\n", 651 | " 'https://www.reddit.com/r/relationship_advice/',\n", 652 | " 'https://www.reddit.com/r/rickandmorty/',\n", 653 | " 'https://www.reddit.com/r/ripcity/',\n", 654 | " 'https://www.reddit.com/r/riverdale/',\n", 655 | " 'https://www.reddit.com/r/roadtrip/',\n", 656 | " 'https://www.reddit.com/r/rolex/',\n", 657 | " 'https://www.reddit.com/r/rollercoasters/',\n", 658 | " 'https://www.reddit.com/r/rpdrcringe/',\n", 659 | " 'https://www.reddit.com/r/rugbyunion/',\n", 660 | " 'https://www.reddit.com/r/runescape/',\n", 661 | " 'https://www.reddit.com/r/running/',\n", 662 | " 'https://www.reddit.com/r/rupaulsdragrace/',\n", 663 | " 'https://www.reddit.com/r/rva/',\n", 664 | " 'https://www.reddit.com/r/sanantonio/',\n", 665 | " 'https://www.reddit.com/r/sandiego/',\n", 666 | " 'https://www.reddit.com/r/sanfrancisco/',\n", 667 | " 'https://www.reddit.com/r/saskatoon/',\n", 668 | " 'https://www.reddit.com/r/scifi/',\n", 669 | " 'https://www.reddit.com/r/seinfeld/',\n", 670 | " 'https://www.reddit.com/r/serbia/',\n", 671 | " 'https://www.reddit.com/r/shield/',\n", 672 | " 'https://www.reddit.com/r/singapore/',\n", 673 | " 'https://www.reddit.com/r/sixers/',\n", 674 | " 'https://www.reddit.com/r/skiing/',\n", 675 | " 'https://www.reddit.com/r/skyrim/',\n", 676 | " 'https://www.reddit.com/r/smashbros/',\n", 677 | " 'https://www.reddit.com/r/sneakermarket/',\n", 678 | " 'https://www.reddit.com/r/snowboarding/',\n", 679 | " 'https://www.reddit.com/r/soccer/',\n", 680 | " 'https://www.reddit.com/r/solotravel/',\n", 681 | " 'https://www.reddit.com/r/southpark/',\n", 682 | " 'https://www.reddit.com/r/sports/',\n", 683 | " 'https://www.reddit.com/r/sportsbook/',\n", 684 | " 'https://www.reddit.com/r/starbucks/',\n", 685 | " 'https://www.reddit.com/r/starcitizen/',\n", 686 | " 'https://www.reddit.com/r/startrek/',\n", 687 | " 'https://www.reddit.com/r/steelers/',\n", 688 | " 'https://www.reddit.com/r/stevenuniverse/',\n", 689 | " 'https://www.reddit.com/r/stlouisblues/',\n", 690 | " 'https://www.reddit.com/r/streetwearstartup/',\n", 691 | " 'https://www.reddit.com/r/summonerswar/',\n", 692 | " 'https://www.reddit.com/r/suns/',\n", 693 | " 'https://www.reddit.com/r/survivor/',\n", 694 | " 'https://www.reddit.com/r/sweden/',\n", 695 | " 'https://www.reddit.com/r/swoleacceptance/',\n", 696 | " 'https://www.reddit.com/r/sydney/',\n", 697 | " 'https://www.reddit.com/r/sysadmin/',\n", 698 | " 'https://www.reddit.com/r/tampabayrays/',\n", 699 | " 'https://www.reddit.com/r/tattoos/',\n", 700 | " 'https://www.reddit.com/r/techsupport/',\n", 701 | " 'https://www.reddit.com/r/tennis/',\n", 702 | " 'https://www.reddit.com/r/tf2/',\n", 703 | " 'https://www.reddit.com/r/the100/',\n", 704 | " 'https://www.reddit.com/r/thebachelor/',\n", 705 | " 'https://www.reddit.com/r/thedivision/',\n", 706 | " 'https://www.reddit.com/r/thenetherlands/',\n", 707 | " 'https://www.reddit.com/r/thesims/',\n", 708 | " 'https://www.reddit.com/r/thesopranos/',\n", 709 | " 'https://www.reddit.com/r/thewalkingdead/',\n", 710 | " 'https://www.reddit.com/r/tipofmytongue/',\n", 711 | " 'https://www.reddit.com/r/titanfolk/',\n", 712 | " 'https://www.reddit.com/r/todayilearned/',\n", 713 | " 'https://www.reddit.com/r/torontoraptors/',\n", 714 | " 'https://www.reddit.com/r/totalwar/',\n", 715 | " 'https://www.reddit.com/r/touhou/',\n", 716 | " 'https://www.reddit.com/r/trailerparkboys/',\n", 717 | " 'https://www.reddit.com/r/translator/',\n", 718 | " 'https://www.reddit.com/r/travel/',\n", 719 | " 'https://www.reddit.com/r/vagabond/',\n", 720 | " 'https://www.reddit.com/r/vancouver/',\n", 721 | " 'https://www.reddit.com/r/vanderpumprules/',\n", 722 | " 'https://www.reddit.com/r/vegan/',\n", 723 | " 'https://www.reddit.com/r/videos/',\n", 724 | " 'https://www.reddit.com/r/vzla/',\n", 725 | " 'https://www.reddit.com/r/warriors/',\n", 726 | " 'https://www.reddit.com/r/weightroom/',\n", 727 | " 'https://www.reddit.com/r/westworld/',\n", 728 | " 'https://www.reddit.com/r/wicked_edge/',\n", 729 | " 'https://www.reddit.com/r/wow/',\n", 730 | " 'https://www.reddit.com/r/xboxone/',\n", 731 | " 'https://www.reddit.com/r/xxfitness/',\n", 732 | " 'https://www.reddit.com/r/yeezys/',\n", 733 | " 'https://www.reddit.com/r/yoga/',\n", 734 | " 'https://www.reddit.com/r/yugioh/',\n", 735 | " 'https://www.reddit.com/r/zerocarb/',\n", 736 | " 'https://www.reddit.com/rpan/',\n", 737 | " 'https://www.reddit.com/subreddits/leaderboard/up-and-coming',\n", 738 | " 'https://www.reddit.com/user/Barknuckle/',\n", 739 | " 'https://www.reddit.com/user/Frocharocha/',\n", 740 | " 'https://www.reddit.com/user/Magistrex/',\n", 741 | " 'https://www.reddit.com/user/PoliticsModeratorBot/',\n", 742 | " 'https://www.reddit.com/user/Ra75b/',\n", 743 | " 'https://www.reddit.com/user/TheVirginVibes/',\n", 744 | " 'https://www.reddit.com/user/frozenHelen/'}" 745 | ] 746 | }, 747 | "execution_count": 15, 748 | "metadata": {}, 749 | "output_type": "execute_result" 750 | } 751 | ], 752 | "source": [ 753 | "new_urls.difference(urls)" 754 | ] 755 | }, 756 | { 757 | "cell_type": "code", 758 | "execution_count": 16, 759 | "metadata": {}, 760 | "outputs": [ 761 | { 762 | "data": { 763 | "text/plain": [ 764 | "" 765 | ] 766 | }, 767 | "execution_count": 16, 768 | "metadata": {}, 769 | "output_type": "execute_result" 770 | } 771 | ], 772 | "source": [ 773 | "session.close()" 774 | ] 775 | } 776 | ], 777 | "metadata": { 778 | "kernelspec": { 779 | "display_name": "Python 3", 780 | "language": "python", 781 | "name": "python3" 782 | }, 783 | "language_info": { 784 | "codemirror_mode": { 785 | "name": "ipython", 786 | "version": 3 787 | }, 788 | "file_extension": ".py", 789 | "mimetype": "text/x-python", 790 | "name": "python", 791 | "nbconvert_exporter": "python", 792 | "pygments_lexer": "ipython3", 793 | "version": "3.7.6" 794 | } 795 | }, 796 | "nbformat": 4, 797 | "nbformat_minor": 4 798 | } 799 | -------------------------------------------------------------------------------- /10.The Requests-HTML Package/Section 10 - Scraping JavaScript.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Scraping data generated by JavaScript" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [], 15 | "source": [ 16 | "# When coding in Jupyter and Spyder, we need to use the class AsyncHTMLSession to make JavaScript work\n", 17 | "# In other environments you can use the normal HTMLSession\n", 18 | "from requests_html import AsyncHTMLSession" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "# establish a new asynchronous session\n", 28 | "session = AsyncHTMLSession()\n", 29 | "\n", 30 | "# The only difference we will experience between the regular HTML Session and the asynchronous one,\n", 31 | "# is the need to write the keyword 'await' in front of some statements" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 3, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "# In this example we're going to use Nike's homepage: https://www.reddit.com/\n", 41 | "# Several of the links on this page, as well as other elements, are generated by JavaScript\n", 42 | "# We will compare the result of scraping those before and after running the JavaScript code" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 4, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/plain": [ 53 | "200" 54 | ] 55 | }, 56 | "execution_count": 4, 57 | "metadata": {}, 58 | "output_type": "execute_result" 59 | } 60 | ], 61 | "source": [ 62 | "# Since we used async session, we need to use the keyword 'await'\n", 63 | "# If you use the regular HTMLSession, there is no need for 'await'\n", 64 | "r = await session.get(\"https://www.reddit.com/\")\n", 65 | "r.status_code" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 5, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "# So far, nothing different from our previous example has happened\n", 75 | "# The JavaScript code has not yet been executed" 76 | ] 77 | }, 78 | { 79 | "cell_type": "code", 80 | "execution_count": 6, 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "# Here are some tags obtained before rendering the JavaScript code, i.e. extarcted from the raw HTML\n", 85 | "divs = r.html.find(\"div\")\n", 86 | "links = r.html.find(\"a\")\n", 87 | "urls = r.html.absolute_links" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 7, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Now, we need to execute the JavaScript code that will generate additional tags" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 8, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "# The requests-html package provides a very simple interface for that - just use the 'render()' method\n", 106 | "# ('arender()' when using async session)\n", 107 | "# It runs the JavaScript code which updates the HTML. This may take a bit\n", 108 | "# The updated HTML is stored in the old variable 'r.html' - you do not need to assign a new variable to the method\n", 109 | "# As before, the 'await' keyword is supplied only because of the Async session\n", 110 | "await r.html.arender()" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 9, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "# NOTE: The first time you run 'a/render()' Chromium will be downloaded and installed on your computer" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 10, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [ 128 | "# Now the HTML is updated and we can search for the same tags again\n", 129 | "new_divs = r.html.find(\"div\")\n", 130 | "new_links = r.html.find(\"a\")\n", 131 | "new_urls = r.html.absolute_links" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 11, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "# We can see the difference in the number of found elements before and after the JavaScript executed" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 12, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "(543, 1728)" 152 | ] 153 | }, 154 | "execution_count": 12, 155 | "metadata": {}, 156 | "output_type": "execute_result" 157 | } 158 | ], 159 | "source": [ 160 | "len(divs), len(new_divs)" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 13, 166 | "metadata": {}, 167 | "outputs": [ 168 | { 169 | "data": { 170 | "text/plain": [ 171 | "(87, 681)" 172 | ] 173 | }, 174 | "execution_count": 13, 175 | "metadata": {}, 176 | "output_type": "execute_result" 177 | } 178 | ], 179 | "source": [ 180 | "len(links), len(new_links)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 14, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "data": { 190 | "text/plain": [ 191 | "(58, 640)" 192 | ] 193 | }, 194 | "execution_count": 14, 195 | "metadata": {}, 196 | "output_type": "execute_result" 197 | } 198 | ], 199 | "source": [ 200 | "len(urls), len(new_urls)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "code", 205 | "execution_count": 15, 206 | "metadata": {}, 207 | "outputs": [], 208 | "source": [ 209 | "# Remember that 'urls' is a set, and not a list?\n", 210 | "# Well, there is a useful feature of sets that we will now take advantage of\n", 211 | "# It takes two sets and selects only those items from the first set that are not present in the second one" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 16, 217 | "metadata": { 218 | "scrolled": true 219 | }, 220 | "outputs": [ 221 | { 222 | "data": { 223 | "text/plain": [ 224 | "{'https://i.imgur.com/nMhodgS.gifv',\n", 225 | " 'https://www.reddit.com/r/1200isplenty/',\n", 226 | " 'https://www.reddit.com/r/2007scape/',\n", 227 | " 'https://www.reddit.com/r/49ers/',\n", 228 | " 'https://www.reddit.com/r/90DayFiance/',\n", 229 | " 'https://www.reddit.com/r/ACMilan/',\n", 230 | " 'https://www.reddit.com/r/Adelaide/',\n", 231 | " 'https://www.reddit.com/r/Amd/',\n", 232 | " 'https://www.reddit.com/r/Android/',\n", 233 | " 'https://www.reddit.com/r/Animesuggest/',\n", 234 | " 'https://www.reddit.com/r/AnthemTheGame/',\n", 235 | " 'https://www.reddit.com/r/AskCulinary/',\n", 236 | " 'https://www.reddit.com/r/AskMen/',\n", 237 | " 'https://www.reddit.com/r/AskNYC/',\n", 238 | " 'https://www.reddit.com/r/AskReddit/',\n", 239 | " 'https://www.reddit.com/r/AskWomen/',\n", 240 | " 'https://www.reddit.com/r/Astros/',\n", 241 | " 'https://www.reddit.com/r/Atlanta/',\n", 242 | " 'https://www.reddit.com/r/AtlantaUnited/',\n", 243 | " 'https://www.reddit.com/r/Augusta/',\n", 244 | " 'https://www.reddit.com/r/Austria/',\n", 245 | " 'https://www.reddit.com/r/Barca/',\n", 246 | " 'https://www.reddit.com/r/BattlefieldV/',\n", 247 | " 'https://www.reddit.com/r/BeautyBoxes/',\n", 248 | " 'https://www.reddit.com/r/BeautyGuruChatter/',\n", 249 | " 'https://www.reddit.com/r/Bend/',\n", 250 | " 'https://www.reddit.com/r/Berserk/',\n", 251 | " 'https://www.reddit.com/r/BigBrother/',\n", 252 | " 'https://www.reddit.com/r/BlackClover/',\n", 253 | " 'https://www.reddit.com/r/Blackops4/',\n", 254 | " 'https://www.reddit.com/r/BoJackHorseman/',\n", 255 | " 'https://www.reddit.com/r/BokuNoHeroAcademia/',\n", 256 | " 'https://www.reddit.com/r/Boruto/',\n", 257 | " 'https://www.reddit.com/r/BostonBruins/',\n", 258 | " 'https://www.reddit.com/r/Boxing/',\n", 259 | " 'https://www.reddit.com/r/Braves/',\n", 260 | " 'https://www.reddit.com/r/BravoRealHousewives/',\n", 261 | " 'https://www.reddit.com/r/Brawlstars/',\n", 262 | " 'https://www.reddit.com/r/Breath_of_the_Wild/',\n", 263 | " 'https://www.reddit.com/r/Brogress/',\n", 264 | " 'https://www.reddit.com/r/Browns/',\n", 265 | " 'https://www.reddit.com/r/C25K/',\n", 266 | " 'https://www.reddit.com/r/CFB/',\n", 267 | " 'https://www.reddit.com/r/CHIBears/',\n", 268 | " 'https://www.reddit.com/r/CHICubs/',\n", 269 | " 'https://www.reddit.com/r/Calgary/',\n", 270 | " 'https://www.reddit.com/r/CampingGear/',\n", 271 | " 'https://www.reddit.com/r/CampingandHiking/',\n", 272 | " 'https://www.reddit.com/r/Cardinals/',\n", 273 | " 'https://www.reddit.com/r/CasualUK/',\n", 274 | " 'https://www.reddit.com/r/Charlotte/',\n", 275 | " 'https://www.reddit.com/r/China/',\n", 276 | " 'https://www.reddit.com/r/ClashOfClans/',\n", 277 | " 'https://www.reddit.com/r/ClashRoyale/',\n", 278 | " 'https://www.reddit.com/r/CoDCompetitive/',\n", 279 | " 'https://www.reddit.com/r/CollegeBasketball/',\n", 280 | " 'https://www.reddit.com/r/Columbus/',\n", 281 | " 'https://www.reddit.com/r/Competitiveoverwatch/',\n", 282 | " 'https://www.reddit.com/r/Cooking/',\n", 283 | " 'https://www.reddit.com/r/Cricket/',\n", 284 | " 'https://www.reddit.com/r/CrohnsDisease/',\n", 285 | " 'https://www.reddit.com/r/CrusaderKings/',\n", 286 | " 'https://www.reddit.com/r/DBZDokkanBattle/',\n", 287 | " 'https://www.reddit.com/r/DMAcademy/',\n", 288 | " 'https://www.reddit.com/r/Dallas/',\n", 289 | " 'https://www.reddit.com/r/DanLeBatardShow/',\n", 290 | " 'https://www.reddit.com/r/DaysGone/',\n", 291 | " 'https://www.reddit.com/r/Denmark/',\n", 292 | " 'https://www.reddit.com/r/Denver/',\n", 293 | " 'https://www.reddit.com/r/Destiny/',\n", 294 | " 'https://www.reddit.com/r/DestinyTheGame/',\n", 295 | " 'https://www.reddit.com/r/Detroit/',\n", 296 | " 'https://www.reddit.com/r/Disneyland/',\n", 297 | " 'https://www.reddit.com/r/DnD/',\n", 298 | " 'https://www.reddit.com/r/Dodgers/',\n", 299 | " 'https://www.reddit.com/r/DotA2/',\n", 300 | " 'https://www.reddit.com/r/DuelLinks/',\n", 301 | " 'https://www.reddit.com/r/DunderMifflin/',\n", 302 | " 'https://www.reddit.com/r/DynastyFF/',\n", 303 | " 'https://www.reddit.com/r/EDH/',\n", 304 | " 'https://www.reddit.com/r/EDanonymemes/',\n", 305 | " 'https://www.reddit.com/r/EOOD/',\n", 306 | " 'https://www.reddit.com/r/EatCheapAndHealthy/',\n", 307 | " 'https://www.reddit.com/r/Edmonton/',\n", 308 | " 'https://www.reddit.com/r/EliteDangerous/',\n", 309 | " 'https://www.reddit.com/r/EscapefromTarkov/',\n", 310 | " 'https://www.reddit.com/r/Eve/',\n", 311 | " 'https://www.reddit.com/r/FFBraveExvius/',\n", 312 | " 'https://www.reddit.com/r/FIFA/',\n", 313 | " 'https://www.reddit.com/r/FORTnITE/',\n", 314 | " 'https://www.reddit.com/r/FUTMobile/',\n", 315 | " 'https://www.reddit.com/r/Fallout/',\n", 316 | " 'https://www.reddit.com/r/FantasyPL/',\n", 317 | " 'https://www.reddit.com/r/FireEmblemHeroes/',\n", 318 | " 'https://www.reddit.com/r/Fishing/',\n", 319 | " 'https://www.reddit.com/r/Fitness/',\n", 320 | " 'https://www.reddit.com/r/FixedGearBicycle/',\n", 321 | " 'https://www.reddit.com/r/FlashTV/',\n", 322 | " 'https://www.reddit.com/r/FortNiteBR/',\n", 323 | " 'https://www.reddit.com/r/FortniteCompetitive/',\n", 324 | " 'https://www.reddit.com/r/Frugal/',\n", 325 | " 'https://www.reddit.com/r/GameOfThronesMemes/',\n", 326 | " 'https://www.reddit.com/r/Gamingcirclejerk/',\n", 327 | " 'https://www.reddit.com/r/GetMotivated/',\n", 328 | " 'https://www.reddit.com/r/Glitch_in_the_Matrix/',\n", 329 | " 'https://www.reddit.com/r/GlobalOffensive/',\n", 330 | " 'https://www.reddit.com/r/GlobalOffensiveTrade/',\n", 331 | " 'https://www.reddit.com/r/GooglePixel/',\n", 332 | " 'https://www.reddit.com/r/GreenBayPackers/',\n", 333 | " 'https://www.reddit.com/r/Grimdank/',\n", 334 | " 'https://www.reddit.com/r/Guildwars2/',\n", 335 | " 'https://www.reddit.com/r/Gundam/',\n", 336 | " 'https://www.reddit.com/r/HBOGameofThrones/',\n", 337 | " 'https://www.reddit.com/r/Hair/',\n", 338 | " 'https://www.reddit.com/r/HealthyFood/',\n", 339 | " 'https://www.reddit.com/r/HomeImprovement/',\n", 340 | " 'https://www.reddit.com/r/IASIP/',\n", 341 | " 'https://www.reddit.com/r/IAmA/',\n", 342 | " 'https://www.reddit.com/r/IWantOut/',\n", 343 | " 'https://www.reddit.com/r/ImaginaryWesteros/',\n", 344 | " 'https://www.reddit.com/r/Indiemakeupandmore/',\n", 345 | " 'https://www.reddit.com/r/Instagram/',\n", 346 | " 'https://www.reddit.com/r/Israel/',\n", 347 | " 'https://www.reddit.com/r/JapanTravel/',\n", 348 | " 'https://www.reddit.com/r/Jeopardy/',\n", 349 | " 'https://www.reddit.com/r/JoshuaTree/',\n", 350 | " 'https://www.reddit.com/r/Konosuba/',\n", 351 | " 'https://www.reddit.com/r/LearnJapanese/',\n", 352 | " 'https://www.reddit.com/r/LegendsOfTomorrow/',\n", 353 | " 'https://www.reddit.com/r/LifeProTips/',\n", 354 | " 'https://www.reddit.com/r/LigaMX/',\n", 355 | " 'https://www.reddit.com/r/LiverpoolFC/',\n", 356 | " 'https://www.reddit.com/r/LivestreamFail/',\n", 357 | " 'https://www.reddit.com/r/LosAngelesRams/',\n", 358 | " 'https://www.reddit.com/r/LushCosmetics/',\n", 359 | " 'https://www.reddit.com/r/MCFC/',\n", 360 | " 'https://www.reddit.com/r/MLBTheShow/',\n", 361 | " 'https://www.reddit.com/r/MLS/',\n", 362 | " 'https://www.reddit.com/r/MMA/',\n", 363 | " 'https://www.reddit.com/r/MTB/',\n", 364 | " 'https://www.reddit.com/r/MUAontheCheap/',\n", 365 | " 'https://www.reddit.com/r/MagicArena/',\n", 366 | " 'https://www.reddit.com/r/Makeup/',\n", 367 | " 'https://www.reddit.com/r/MakeupAddiction/',\n", 368 | " 'https://www.reddit.com/r/MakingaMurderer/',\n", 369 | " 'https://www.reddit.com/r/Market76/',\n", 370 | " 'https://www.reddit.com/r/MarvelStrikeForce/',\n", 371 | " 'https://www.reddit.com/r/Mavericks/',\n", 372 | " 'https://www.reddit.com/r/Minecraft/',\n", 373 | " 'https://www.reddit.com/r/Minneapolis/',\n", 374 | " 'https://www.reddit.com/r/MkeBucks/',\n", 375 | " 'https://www.reddit.com/r/ModernMagic/',\n", 376 | " 'https://www.reddit.com/r/MonsterHunterWorld/',\n", 377 | " 'https://www.reddit.com/r/Mordhau/',\n", 378 | " 'https://www.reddit.com/r/MortalKombat/',\n", 379 | " 'https://www.reddit.com/r/MtvChallenge/',\n", 380 | " 'https://www.reddit.com/r/Music/',\n", 381 | " 'https://www.reddit.com/r/NBA2k/',\n", 382 | " 'https://www.reddit.com/r/NBASpurs/',\n", 383 | " 'https://www.reddit.com/r/NFA/',\n", 384 | " 'https://www.reddit.com/r/NHLHUT/',\n", 385 | " 'https://www.reddit.com/r/NYKnicks/',\n", 386 | " 'https://www.reddit.com/r/NYYankees/',\n", 387 | " 'https://www.reddit.com/r/Naruto/',\n", 388 | " 'https://www.reddit.com/r/Nationals/',\n", 389 | " 'https://www.reddit.com/r/Nerf/',\n", 390 | " 'https://www.reddit.com/r/NetflixBestOf/',\n", 391 | " 'https://www.reddit.com/r/NewOrleans/',\n", 392 | " 'https://www.reddit.com/r/NewSkaters/',\n", 393 | " 'https://www.reddit.com/r/NewYorkMets/',\n", 394 | " 'https://www.reddit.com/r/NintendoSwitch/',\n", 395 | " 'https://www.reddit.com/r/NoMansSkyTheGame/',\n", 396 | " 'https://www.reddit.com/r/NoStupidQuestions/',\n", 397 | " 'https://www.reddit.com/r/OnePiece/',\n", 398 | " 'https://www.reddit.com/r/OutOfTheLoop/',\n", 399 | " 'https://www.reddit.com/r/Overwatch/',\n", 400 | " 'https://www.reddit.com/r/PS4/',\n", 401 | " 'https://www.reddit.com/r/PSVR/',\n", 402 | " 'https://www.reddit.com/r/PUBATTLEGROUNDS/',\n", 403 | " 'https://www.reddit.com/r/PUBGMobile/',\n", 404 | " 'https://www.reddit.com/r/Paladins/',\n", 405 | " 'https://www.reddit.com/r/PanPorn/',\n", 406 | " 'https://www.reddit.com/r/PandR/',\n", 407 | " 'https://www.reddit.com/r/Patriots/',\n", 408 | " 'https://www.reddit.com/r/Persona5/',\n", 409 | " 'https://www.reddit.com/r/Philippines/',\n", 410 | " 'https://www.reddit.com/r/Planetside/',\n", 411 | " 'https://www.reddit.com/r/Polska/',\n", 412 | " 'https://www.reddit.com/r/Portland/',\n", 413 | " 'https://www.reddit.com/r/Quebec/',\n", 414 | " 'https://www.reddit.com/r/RWBY/',\n", 415 | " 'https://www.reddit.com/r/Rainbow6/',\n", 416 | " 'https://www.reddit.com/r/RedDeadOnline/',\n", 417 | " 'https://www.reddit.com/r/RedditLaqueristas/',\n", 418 | " 'https://www.reddit.com/r/RepLadiesBST/',\n", 419 | " 'https://www.reddit.com/r/Repsneakers/',\n", 420 | " 'https://www.reddit.com/r/RimWorld/',\n", 421 | " 'https://www.reddit.com/r/RocketLeague/',\n", 422 | " 'https://www.reddit.com/r/RocketLeagueExchange/',\n", 423 | " 'https://www.reddit.com/r/Romania/',\n", 424 | " 'https://www.reddit.com/r/Rowing/',\n", 425 | " 'https://www.reddit.com/r/SFGiants/',\n", 426 | " 'https://www.reddit.com/r/SWGalaxyOfHeroes/',\n", 427 | " 'https://www.reddit.com/r/Sacramento/',\n", 428 | " 'https://www.reddit.com/r/SaltLakeCity/',\n", 429 | " 'https://www.reddit.com/r/SanJoseSharks/',\n", 430 | " 'https://www.reddit.com/r/SantaFe/',\n", 431 | " 'https://www.reddit.com/r/SarahSnark/',\n", 432 | " 'https://www.reddit.com/r/Scotland/',\n", 433 | " 'https://www.reddit.com/r/Scottsdale/',\n", 434 | " 'https://www.reddit.com/r/Seaofthieves/',\n", 435 | " 'https://www.reddit.com/r/Seattle/',\n", 436 | " 'https://www.reddit.com/r/SequelMemes/',\n", 437 | " 'https://www.reddit.com/r/ShingekiNoKyojin/',\n", 438 | " 'https://www.reddit.com/r/Shoestring/',\n", 439 | " 'https://www.reddit.com/r/Showerthoughts/',\n", 440 | " 'https://www.reddit.com/r/Smite/',\n", 441 | " 'https://www.reddit.com/r/Sneakers/',\n", 442 | " 'https://www.reddit.com/r/Spiderman/',\n", 443 | " 'https://www.reddit.com/r/SpoiledDragRace/',\n", 444 | " 'https://www.reddit.com/r/SquaredCircle/',\n", 445 | " 'https://www.reddit.com/r/StLouis/',\n", 446 | " 'https://www.reddit.com/r/StarVStheForcesofEvil/',\n", 447 | " 'https://www.reddit.com/r/StarWarsBattlefront/',\n", 448 | " 'https://www.reddit.com/r/StardewValley/',\n", 449 | " 'https://www.reddit.com/r/Steam/',\n", 450 | " 'https://www.reddit.com/r/Stellaris/',\n", 451 | " 'https://www.reddit.com/r/StrangerThings/',\n", 452 | " 'https://www.reddit.com/r/Stronglifts5x5/',\n", 453 | " 'https://www.reddit.com/r/Suomi/',\n", 454 | " 'https://www.reddit.com/r/Supplements/',\n", 455 | " 'https://www.reddit.com/r/TeenMomOGandTeenMom2/',\n", 456 | " 'https://www.reddit.com/r/Terraria/',\n", 457 | " 'https://www.reddit.com/r/TheAmazingRace/',\n", 458 | " 'https://www.reddit.com/r/TheBlackList/',\n", 459 | " 'https://www.reddit.com/r/TheDickShow/',\n", 460 | " 'https://www.reddit.com/r/TheHandmaidsTale/',\n", 461 | " 'https://www.reddit.com/r/TheLastAirbender/',\n", 462 | " 'https://www.reddit.com/r/TheSimpsons/',\n", 463 | " 'https://www.reddit.com/r/Tinder/',\n", 464 | " 'https://www.reddit.com/r/Torontobluejays/',\n", 465 | " 'https://www.reddit.com/r/Turkey/',\n", 466 | " 'https://www.reddit.com/r/TurkeyJerky/',\n", 467 | " 'https://www.reddit.com/r/Twitch/',\n", 468 | " 'https://www.reddit.com/r/TwoBestFriendsPlay/',\n", 469 | " 'https://www.reddit.com/r/VictoriaBC/',\n", 470 | " 'https://www.reddit.com/r/WWE/',\n", 471 | " 'https://www.reddit.com/r/WWEGames/',\n", 472 | " 'https://www.reddit.com/r/WaltDisneyWorld/',\n", 473 | " 'https://www.reddit.com/r/Warframe/',\n", 474 | " 'https://www.reddit.com/r/Warhammer40k/',\n", 475 | " 'https://www.reddit.com/r/Warthunder/',\n", 476 | " 'https://www.reddit.com/r/Watches/',\n", 477 | " 'https://www.reddit.com/r/Watchexchange/',\n", 478 | " 'https://www.reddit.com/r/Wellington/',\n", 479 | " 'https://www.reddit.com/r/Wetshaving/',\n", 480 | " 'https://www.reddit.com/r/Windows10/',\n", 481 | " 'https://www.reddit.com/r/Winnipeg/',\n", 482 | " 'https://www.reddit.com/r/WorldOfWarships/',\n", 483 | " 'https://www.reddit.com/r/WorldofTanks/',\n", 484 | " 'https://www.reddit.com/r/Youniqueamua/',\n", 485 | " 'https://www.reddit.com/r/aSongOfMemesAndRage/',\n", 486 | " 'https://www.reddit.com/r/acne/',\n", 487 | " 'https://www.reddit.com/r/adventuretime/',\n", 488 | " 'https://www.reddit.com/r/airsoft/',\n", 489 | " 'https://www.reddit.com/r/amateur_boxing/',\n", 490 | " 'https://www.reddit.com/r/anime/',\n", 491 | " 'https://www.reddit.com/r/anime_irl/',\n", 492 | " 'https://www.reddit.com/r/apple/',\n", 493 | " 'https://www.reddit.com/r/argentina/',\n", 494 | " 'https://www.reddit.com/r/arrow/',\n", 495 | " 'https://www.reddit.com/r/askTO/',\n", 496 | " 'https://www.reddit.com/r/askscience/',\n", 497 | " 'https://www.reddit.com/r/asoiaf/',\n", 498 | " 'https://www.reddit.com/r/australia/',\n", 499 | " 'https://www.reddit.com/r/awardtravel/',\n", 500 | " 'https://www.reddit.com/r/backpacking/',\n", 501 | " 'https://www.reddit.com/r/balisong/',\n", 502 | " 'https://www.reddit.com/r/barstoolsports/',\n", 503 | " 'https://www.reddit.com/r/baseball/',\n", 504 | " 'https://www.reddit.com/r/batman/',\n", 505 | " 'https://www.reddit.com/r/battlestations/',\n", 506 | " 'https://www.reddit.com/r/bayarea/',\n", 507 | " 'https://www.reddit.com/r/beards/',\n", 508 | " 'https://www.reddit.com/r/beauty/',\n", 509 | " 'https://www.reddit.com/r/berkeley/',\n", 510 | " 'https://www.reddit.com/r/bicycling/',\n", 511 | " 'https://www.reddit.com/r/bikecommuting/',\n", 512 | " 'https://www.reddit.com/r/bikewrench/',\n", 513 | " 'https://www.reddit.com/r/bjj/',\n", 514 | " 'https://www.reddit.com/r/blackmirror/',\n", 515 | " 'https://www.reddit.com/r/bleach/',\n", 516 | " 'https://www.reddit.com/r/boardgames/',\n", 517 | " 'https://www.reddit.com/r/bodybuilding/',\n", 518 | " 'https://www.reddit.com/r/bodyweightfitness/',\n", 519 | " 'https://www.reddit.com/r/books/',\n", 520 | " 'https://www.reddit.com/r/boostedboards/',\n", 521 | " 'https://www.reddit.com/r/bostonceltics/',\n", 522 | " 'https://www.reddit.com/r/brasil/',\n", 523 | " 'https://www.reddit.com/r/brasilivre/',\n", 524 | " 'https://www.reddit.com/r/breakingbad/',\n", 525 | " 'https://www.reddit.com/r/brisbane/',\n", 526 | " 'https://www.reddit.com/r/brooklynninenine/',\n", 527 | " 'https://www.reddit.com/r/buildapc/',\n", 528 | " 'https://www.reddit.com/r/camping/',\n", 529 | " 'https://www.reddit.com/r/canada/',\n", 530 | " 'https://www.reddit.com/r/canucks/',\n", 531 | " 'https://www.reddit.com/r/cars/',\n", 532 | " 'https://www.reddit.com/r/chelseafc/',\n", 533 | " 'https://www.reddit.com/r/chile/',\n", 534 | " 'https://www.reddit.com/r/cirkeltrek/',\n", 535 | " 'https://www.reddit.com/r/classicwow/',\n", 536 | " 'https://www.reddit.com/r/climbing/',\n", 537 | " 'https://www.reddit.com/r/community/',\n", 538 | " 'https://www.reddit.com/r/confession/',\n", 539 | " 'https://www.reddit.com/r/cordcutters/',\n", 540 | " 'https://www.reddit.com/r/cowboys/',\n", 541 | " 'https://www.reddit.com/r/coys/',\n", 542 | " 'https://www.reddit.com/r/criterion/',\n", 543 | " 'https://www.reddit.com/r/croatia/',\n", 544 | " 'https://www.reddit.com/r/crossfit/',\n", 545 | " 'https://www.reddit.com/r/cscareerquestions/',\n", 546 | " 'https://www.reddit.com/r/curlyhair/',\n", 547 | " 'https://www.reddit.com/r/cycling/',\n", 548 | " 'https://www.reddit.com/r/danganronpa/',\n", 549 | " 'https://www.reddit.com/r/dataisbeautiful/',\n", 550 | " 'https://www.reddit.com/r/dataisbeautiful/?f=flair_name%3A%22OC%22',\n", 551 | " 'https://www.reddit.com/r/dataisbeautiful/comments/g0o65a/oc_a_full_year_of_income_and_expenses_through_my/',\n", 552 | " 'https://www.reddit.com/r/dauntless/',\n", 553 | " 'https://www.reddit.com/r/dbz/',\n", 554 | " 'https://www.reddit.com/r/de/',\n", 555 | " 'https://www.reddit.com/r/deadbydaylight/',\n", 556 | " 'https://www.reddit.com/r/denvernuggets/',\n", 557 | " 'https://www.reddit.com/r/destiny2/',\n", 558 | " 'https://www.reddit.com/r/detroitlions/',\n", 559 | " 'https://www.reddit.com/r/diabetes/',\n", 560 | " 'https://www.reddit.com/r/diabetes_t1/',\n", 561 | " 'https://www.reddit.com/r/discgolf/',\n", 562 | " 'https://www.reddit.com/r/discordapp/',\n", 563 | " 'https://www.reddit.com/r/disney/',\n", 564 | " 'https://www.reddit.com/r/dndmemes/',\n", 565 | " 'https://www.reddit.com/r/dndnext/',\n", 566 | " 'https://www.reddit.com/r/doctorwho/',\n", 567 | " 'https://www.reddit.com/r/dubai/',\n", 568 | " 'https://www.reddit.com/r/eagles/',\n", 569 | " 'https://www.reddit.com/r/ehlersdanlos/',\n", 570 | " 'https://www.reddit.com/r/elderscrollsonline/',\n", 571 | " 'https://www.reddit.com/r/eu4/',\n", 572 | " 'https://www.reddit.com/r/europe/',\n", 573 | " 'https://www.reddit.com/r/explainlikeimfive/',\n", 574 | " 'https://www.reddit.com/r/fairytail/',\n", 575 | " 'https://www.reddit.com/r/fantasybaseball/',\n", 576 | " 'https://www.reddit.com/r/fantasyfootball/',\n", 577 | " 'https://www.reddit.com/r/fasting/',\n", 578 | " 'https://www.reddit.com/r/femalefashionadvice/',\n", 579 | " 'https://www.reddit.com/r/femalehairadvice/',\n", 580 | " 'https://www.reddit.com/r/ffxiv/',\n", 581 | " 'https://www.reddit.com/r/findfashion/',\n", 582 | " 'https://www.reddit.com/r/fireemblem/',\n", 583 | " 'https://www.reddit.com/r/fivenightsatfreddys/',\n", 584 | " 'https://www.reddit.com/r/flexibility/',\n", 585 | " 'https://www.reddit.com/r/flightsim/',\n", 586 | " 'https://www.reddit.com/r/flyfishing/',\n", 587 | " 'https://www.reddit.com/r/fo76/',\n", 588 | " 'https://www.reddit.com/r/footballmanagergames/',\n", 589 | " 'https://www.reddit.com/r/forhonor/',\n", 590 | " 'https://www.reddit.com/r/formula1/',\n", 591 | " 'https://www.reddit.com/r/fragrance/',\n", 592 | " 'https://www.reddit.com/r/france/',\n", 593 | " 'https://www.reddit.com/r/freefolk/',\n", 594 | " 'https://www.reddit.com/r/frugalmalefashion/',\n", 595 | " 'https://www.reddit.com/r/futurama/',\n", 596 | " 'https://www.reddit.com/r/future_fight/',\n", 597 | " 'https://www.reddit.com/r/gainit/',\n", 598 | " 'https://www.reddit.com/r/gameofthrones/',\n", 599 | " 'https://www.reddit.com/r/germany/',\n", 600 | " 'https://www.reddit.com/r/gifs/',\n", 601 | " 'https://www.reddit.com/r/gifs/comments/g0tzwn/disney_tried_editing_out_darryl_hannahs_butt_by/',\n", 602 | " 'https://www.reddit.com/r/girlsfrontline/',\n", 603 | " 'https://www.reddit.com/r/golf/',\n", 604 | " 'https://www.reddit.com/r/goodyearwelt/',\n", 605 | " 'https://www.reddit.com/r/grandorder/',\n", 606 | " 'https://www.reddit.com/r/greece/',\n", 607 | " 'https://www.reddit.com/r/greysanatomy/',\n", 608 | " 'https://www.reddit.com/r/gtaonline/',\n", 609 | " 'https://www.reddit.com/r/halifax/',\n", 610 | " 'https://www.reddit.com/r/halo/',\n", 611 | " 'https://www.reddit.com/r/headphones/',\n", 612 | " 'https://www.reddit.com/r/hearthstone/',\n", 613 | " 'https://www.reddit.com/r/heroesofthestorm/',\n", 614 | " 'https://www.reddit.com/r/hiking/',\n", 615 | " 'https://www.reddit.com/r/hockey/',\n", 616 | " 'https://www.reddit.com/r/hockeyjerseys/',\n", 617 | " 'https://www.reddit.com/r/hockeyplayers/',\n", 618 | " 'https://www.reddit.com/r/houston/',\n", 619 | " 'https://www.reddit.com/r/howardstern/',\n", 620 | " 'https://www.reddit.com/r/hungary/',\n", 621 | " 'https://www.reddit.com/r/india/',\n", 622 | " 'https://www.reddit.com/r/indonesia/',\n", 623 | " 'https://www.reddit.com/r/intermittentfasting/',\n", 624 | " 'https://www.reddit.com/r/iphone/',\n", 625 | " 'https://www.reddit.com/r/ireland/',\n", 626 | " 'https://www.reddit.com/r/italy/',\n", 627 | " 'https://www.reddit.com/r/jailbreak/',\n", 628 | " 'https://www.reddit.com/r/japanesestreetwear/',\n", 629 | " 'https://www.reddit.com/r/japanlife/',\n", 630 | " 'https://www.reddit.com/r/jobs/',\n", 631 | " 'https://www.reddit.com/r/kansascity/',\n", 632 | " 'https://www.reddit.com/r/keto/',\n", 633 | " 'https://www.reddit.com/r/korea/',\n", 634 | " 'https://www.reddit.com/r/lakers/',\n", 635 | " 'https://www.reddit.com/r/leafs/',\n", 636 | " 'https://www.reddit.com/r/leagueoflegends/',\n", 637 | " 'https://www.reddit.com/r/leangains/',\n", 638 | " 'https://www.reddit.com/r/learnprogramming/',\n", 639 | " 'https://www.reddit.com/r/learnpython/',\n", 640 | " 'https://www.reddit.com/r/legaladvice/',\n", 641 | " 'https://www.reddit.com/r/longboarding/',\n", 642 | " 'https://www.reddit.com/r/loseit/',\n", 643 | " 'https://www.reddit.com/r/lucifer/',\n", 644 | " 'https://www.reddit.com/r/makeupexchange/',\n", 645 | " 'https://www.reddit.com/r/malaysia/',\n", 646 | " 'https://www.reddit.com/r/malefashion/',\n", 647 | " 'https://www.reddit.com/r/malefashionadvice/',\n", 648 | " 'https://www.reddit.com/r/malehairadvice/',\n", 649 | " 'https://www.reddit.com/r/malelivingspace/',\n", 650 | " 'https://www.reddit.com/r/marvelmemes/',\n", 651 | " 'https://www.reddit.com/r/marvelstudios/',\n", 652 | " 'https://www.reddit.com/r/medical_advice/',\n", 653 | " 'https://www.reddit.com/r/melbourne/',\n", 654 | " 'https://www.reddit.com/r/memes/',\n", 655 | " 'https://www.reddit.com/r/mexico/',\n", 656 | " 'https://www.reddit.com/r/migraine/',\n", 657 | " 'https://www.reddit.com/r/minnesotatwins/',\n", 658 | " 'https://www.reddit.com/r/minnesotavikings/',\n", 659 | " 'https://www.reddit.com/r/mw4/',\n", 660 | " 'https://www.reddit.com/r/mylittlepony/',\n", 661 | " 'https://www.reddit.com/r/nashville/',\n", 662 | " 'https://www.reddit.com/r/nattyorjuice/',\n", 663 | " 'https://www.reddit.com/r/nba/',\n", 664 | " 'https://www.reddit.com/r/nbadiscussion/',\n", 665 | " 'https://www.reddit.com/r/netflix/',\n", 666 | " 'https://www.reddit.com/r/newsokur/',\n", 667 | " 'https://www.reddit.com/r/newzealand/',\n", 668 | " 'https://www.reddit.com/r/nfl/',\n", 669 | " 'https://www.reddit.com/r/nhl/',\n", 670 | " 'https://www.reddit.com/r/norge/',\n", 671 | " 'https://www.reddit.com/r/nosleep/',\n", 672 | " 'https://www.reddit.com/r/nova/',\n", 673 | " 'https://www.reddit.com/r/nrl/',\n", 674 | " 'https://www.reddit.com/r/nutrition/',\n", 675 | " 'https://www.reddit.com/r/nvidia/',\n", 676 | " 'https://www.reddit.com/r/nyjets/',\n", 677 | " 'https://www.reddit.com/r/omad/',\n", 678 | " 'https://www.reddit.com/r/orangecounty/',\n", 679 | " 'https://www.reddit.com/r/orangetheory/',\n", 680 | " 'https://www.reddit.com/r/osugame/',\n", 681 | " 'https://www.reddit.com/r/ottawa/',\n", 682 | " 'https://www.reddit.com/r/overlord/',\n", 683 | " 'https://www.reddit.com/r/pathofexile/',\n", 684 | " 'https://www.reddit.com/r/pcmasterrace/',\n", 685 | " 'https://www.reddit.com/r/peloton/',\n", 686 | " 'https://www.reddit.com/r/pesmobile/',\n", 687 | " 'https://www.reddit.com/r/philadelphia/',\n", 688 | " 'https://www.reddit.com/r/phillies/',\n", 689 | " 'https://www.reddit.com/r/phoenix/',\n", 690 | " 'https://www.reddit.com/r/pics/',\n", 691 | " 'https://www.reddit.com/r/piercing/',\n", 692 | " 'https://www.reddit.com/r/pittsburgh/',\n", 693 | " 'https://www.reddit.com/r/playrust/',\n", 694 | " 'https://www.reddit.com/r/podemos/',\n", 695 | " 'https://www.reddit.com/r/pokemon/',\n", 696 | " 'https://www.reddit.com/r/pokemongo/',\n", 697 | " 'https://www.reddit.com/r/pokemontrades/',\n", 698 | " 'https://www.reddit.com/r/portugal/',\n", 699 | " 'https://www.reddit.com/r/poshmark/',\n", 700 | " 'https://www.reddit.com/r/powerlifting/',\n", 701 | " 'https://www.reddit.com/r/progresspics/',\n", 702 | " 'https://www.reddit.com/r/raleigh/',\n", 703 | " 'https://www.reddit.com/r/ravens/',\n", 704 | " 'https://www.reddit.com/r/rawdenim/',\n", 705 | " 'https://www.reddit.com/r/realmadrid/',\n", 706 | " 'https://www.reddit.com/r/reddeadredemption/',\n", 707 | " 'https://www.reddit.com/r/reddevils/',\n", 708 | " 'https://www.reddit.com/r/redsox/',\n", 709 | " 'https://www.reddit.com/r/relationship_advice/',\n", 710 | " 'https://www.reddit.com/r/rickandmorty/',\n", 711 | " 'https://www.reddit.com/r/ripcity/',\n", 712 | " 'https://www.reddit.com/r/riverdale/',\n", 713 | " 'https://www.reddit.com/r/roadtrip/',\n", 714 | " 'https://www.reddit.com/r/rolex/',\n", 715 | " 'https://www.reddit.com/r/rollercoasters/',\n", 716 | " 'https://www.reddit.com/r/rpdrcringe/',\n", 717 | " 'https://www.reddit.com/r/rugbyunion/',\n", 718 | " 'https://www.reddit.com/r/runescape/',\n", 719 | " 'https://www.reddit.com/r/running/',\n", 720 | " 'https://www.reddit.com/r/rupaulsdragrace/',\n", 721 | " 'https://www.reddit.com/r/rva/',\n", 722 | " 'https://www.reddit.com/r/sanantonio/',\n", 723 | " 'https://www.reddit.com/r/sandiego/',\n", 724 | " 'https://www.reddit.com/r/sanfrancisco/',\n", 725 | " 'https://www.reddit.com/r/saskatoon/',\n", 726 | " 'https://www.reddit.com/r/scifi/',\n", 727 | " 'https://www.reddit.com/r/seinfeld/',\n", 728 | " 'https://www.reddit.com/r/serbia/',\n", 729 | " 'https://www.reddit.com/r/shield/',\n", 730 | " 'https://www.reddit.com/r/singapore/',\n", 731 | " 'https://www.reddit.com/r/sixers/',\n", 732 | " 'https://www.reddit.com/r/skiing/',\n", 733 | " 'https://www.reddit.com/r/skyrim/',\n", 734 | " 'https://www.reddit.com/r/smashbros/',\n", 735 | " 'https://www.reddit.com/r/sneakermarket/',\n", 736 | " 'https://www.reddit.com/r/snowboarding/',\n", 737 | " 'https://www.reddit.com/r/soccer/',\n", 738 | " 'https://www.reddit.com/r/solotravel/',\n", 739 | " 'https://www.reddit.com/r/southpark/',\n", 740 | " 'https://www.reddit.com/r/sports/',\n", 741 | " 'https://www.reddit.com/r/sportsbook/',\n", 742 | " 'https://www.reddit.com/r/starbucks/',\n", 743 | " 'https://www.reddit.com/r/starcitizen/',\n", 744 | " 'https://www.reddit.com/r/startrek/',\n", 745 | " 'https://www.reddit.com/r/steelers/',\n", 746 | " 'https://www.reddit.com/r/stevenuniverse/',\n", 747 | " 'https://www.reddit.com/r/stlouisblues/',\n", 748 | " 'https://www.reddit.com/r/streetwearstartup/',\n", 749 | " 'https://www.reddit.com/r/summonerswar/',\n", 750 | " 'https://www.reddit.com/r/suns/',\n", 751 | " 'https://www.reddit.com/r/survivor/',\n", 752 | " 'https://www.reddit.com/r/sweden/',\n", 753 | " 'https://www.reddit.com/r/swoleacceptance/',\n", 754 | " 'https://www.reddit.com/r/sydney/',\n", 755 | " 'https://www.reddit.com/r/sysadmin/',\n", 756 | " 'https://www.reddit.com/r/tampabayrays/',\n", 757 | " 'https://www.reddit.com/r/tattoos/',\n", 758 | " 'https://www.reddit.com/r/techsupport/',\n", 759 | " 'https://www.reddit.com/r/tennis/',\n", 760 | " 'https://www.reddit.com/r/tf2/',\n", 761 | " 'https://www.reddit.com/r/the100/',\n", 762 | " 'https://www.reddit.com/r/thebachelor/',\n", 763 | " 'https://www.reddit.com/r/thedivision/',\n", 764 | " 'https://www.reddit.com/r/thenetherlands/',\n", 765 | " 'https://www.reddit.com/r/thesims/',\n", 766 | " 'https://www.reddit.com/r/thesopranos/',\n", 767 | " 'https://www.reddit.com/r/thewalkingdead/',\n", 768 | " 'https://www.reddit.com/r/tipofmytongue/',\n", 769 | " 'https://www.reddit.com/r/titanfolk/',\n", 770 | " 'https://www.reddit.com/r/todayilearned/',\n", 771 | " 'https://www.reddit.com/r/torontoraptors/',\n", 772 | " 'https://www.reddit.com/r/totalwar/',\n", 773 | " 'https://www.reddit.com/r/touhou/',\n", 774 | " 'https://www.reddit.com/r/trailerparkboys/',\n", 775 | " 'https://www.reddit.com/r/translator/',\n", 776 | " 'https://www.reddit.com/r/travel/',\n", 777 | " 'https://www.reddit.com/r/vagabond/',\n", 778 | " 'https://www.reddit.com/r/vancouver/',\n", 779 | " 'https://www.reddit.com/r/vanderpumprules/',\n", 780 | " 'https://www.reddit.com/r/vegan/',\n", 781 | " 'https://www.reddit.com/r/videos/',\n", 782 | " 'https://www.reddit.com/r/vzla/',\n", 783 | " 'https://www.reddit.com/r/warriors/',\n", 784 | " 'https://www.reddit.com/r/weightroom/',\n", 785 | " 'https://www.reddit.com/r/westworld/',\n", 786 | " 'https://www.reddit.com/r/wicked_edge/',\n", 787 | " 'https://www.reddit.com/r/worldnews/',\n", 788 | " 'https://www.reddit.com/r/wow/',\n", 789 | " 'https://www.reddit.com/r/xboxone/',\n", 790 | " 'https://www.reddit.com/r/xxfitness/',\n", 791 | " 'https://www.reddit.com/r/yeezys/',\n", 792 | " 'https://www.reddit.com/r/yoga/',\n", 793 | " 'https://www.reddit.com/r/yugioh/',\n", 794 | " 'https://www.reddit.com/r/zerocarb/',\n", 795 | " 'https://www.reddit.com/search?q=dune&source=trending',\n", 796 | " 'https://www.reddit.com/search?q=fauci&source=trending',\n", 797 | " 'https://www.reddit.com/search?q=kyle%20larson&source=trending',\n", 798 | " 'https://www.reddit.com/search?q=nascar&source=trending',\n", 799 | " 'https://www.reddit.com/search?q=rick%20may&source=trending',\n", 800 | " 'https://www.reddit.com/search?q=tornado&source=trending',\n", 801 | " 'https://www.reddit.com/subreddits/leaderboard/up-and-coming',\n", 802 | " 'https://www.reddit.com/user/ItsBOOM/',\n", 803 | " 'https://www.reddit.com/user/SPM8/',\n", 804 | " 'https://www.reddit.com/user/con_commenter/',\n", 805 | " 'https://www.reddit.com/user/jesq/',\n", 806 | " 'https://www.reddit.com/user/memezzer/',\n", 807 | " 'https://www.reddit.com/user/mtlgrems/',\n", 808 | " 'https://www.reddit.com/user/notsure500/',\n", 809 | " 'https://www.reddit.com/user/skinkbaa/',\n", 810 | " 'https://www.reddit.com/user/steven5it/'}" 811 | ] 812 | }, 813 | "execution_count": 16, 814 | "metadata": {}, 815 | "output_type": "execute_result" 816 | } 817 | ], 818 | "source": [ 819 | "# Take only the new items in the first set\n", 820 | "new_urls.difference(urls)" 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": 17, 826 | "metadata": {}, 827 | "outputs": [ 828 | { 829 | "data": { 830 | "text/plain": [ 831 | "" 832 | ] 833 | }, 834 | "execution_count": 17, 835 | "metadata": {}, 836 | "output_type": "execute_result" 837 | } 838 | ], 839 | "source": [ 840 | "# Finally, close the session\n", 841 | "session.close()" 842 | ] 843 | }, 844 | { 845 | "cell_type": "code", 846 | "execution_count": 18, 847 | "metadata": {}, 848 | "outputs": [ 849 | { 850 | "name": "stdout", 851 | "output_type": "stream", 852 | "text": [ 853 | "Reloads the response in Chromium, and replaces HTML content\n", 854 | " with an updated version, with JavaScript executed.\n", 855 | "\n", 856 | " :param retries: The number of times to retry loading the page in Chromium.\n", 857 | " :param script: JavaScript to execute upon page load (optional).\n", 858 | " :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).\n", 859 | " :param scrolldown: Integer, if provided, of how many times to page down.\n", 860 | " :param sleep: Integer, if provided, of how many long to sleep after initial render.\n", 861 | " :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.\n", 862 | " :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.\n", 863 | "\n", 864 | " If ``scrolldown`` is specified, the page will scrolldown the specified\n", 865 | " number of times, after sleeping the specified amount of time\n", 866 | " (e.g. ``scrolldown=10, sleep=1``).\n", 867 | "\n", 868 | " If just ``sleep`` is provided, the rendering will wait *n* seconds, before\n", 869 | " returning.\n", 870 | "\n", 871 | " If ``script`` is specified, it will execute the provided JavaScript at\n", 872 | " runtime. Example:\n", 873 | "\n", 874 | " .. code-block:: python\n", 875 | "\n", 876 | " script = \"\"\"\n", 877 | " () => {\n", 878 | " return {\n", 879 | " width: document.documentElement.clientWidth,\n", 880 | " height: document.documentElement.clientHeight,\n", 881 | " deviceScaleFactor: window.devicePixelRatio,\n", 882 | " }\n", 883 | " }\n", 884 | " \"\"\"\n", 885 | "\n", 886 | " Returns the return value of the executed ``script``, if any is provided:\n", 887 | "\n", 888 | " .. code-block:: python\n", 889 | "\n", 890 | " >>> r.html.render(script=script)\n", 891 | " {'width': 800, 'height': 600, 'deviceScaleFactor': 1}\n", 892 | "\n", 893 | " Warning: the first time you run this method, it will download\n", 894 | " Chromium into your home directory (``~/.pyppeteer``).\n", 895 | " \n" 896 | ] 897 | } 898 | ], 899 | "source": [ 900 | "# You can check the documentation directly inside Jupyter\n", 901 | "print(r.html.render.__doc__)" 902 | ] 903 | }, 904 | { 905 | "cell_type": "code", 906 | "execution_count": null, 907 | "metadata": {}, 908 | "outputs": [], 909 | "source": [] 910 | } 911 | ], 912 | "metadata": { 913 | "kernelspec": { 914 | "display_name": "Python 3", 915 | "language": "python", 916 | "name": "python3" 917 | }, 918 | "language_info": { 919 | "codemirror_mode": { 920 | "name": "ipython", 921 | "version": 3 922 | }, 923 | "file_extension": ".py", 924 | "mimetype": "text/x-python", 925 | "name": "python", 926 | "nbconvert_exporter": "python", 927 | "pygments_lexer": "ipython3", 928 | "version": "3.7.3" 929 | } 930 | }, 931 | "nbformat": 4, 932 | "nbformat_minor": 2 933 | } 934 | -------------------------------------------------------------------------------- /11.Scraping JavaScript - SoundCloud Project/Section 10 - Scraping SoundCloud - Setup.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Scraping SoundCoud" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Initial Setup" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "# import packages\n", 24 | "import requests\n", 25 | "from bs4 import BeautifulSoup" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": null, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "from requests_html import AsyncHTMLSession" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": null, 40 | "metadata": {}, 41 | "outputs": [], 42 | "source": [] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "## Connect to SoundCloud" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "# make connection to webpage\n", 58 | "resp = requests.get(\"https://soundcloud.com/discover\")" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "# get HTML from response object\n", 68 | "html = resp.content" 69 | ] 70 | }, 71 | { 72 | "cell_type": "code", 73 | "execution_count": null, 74 | "metadata": {}, 75 | "outputs": [], 76 | "source": [ 77 | "# convert HTML to BeautifulSoup object\n", 78 | "soup = BeautifulSoup(html, \"lxml\")" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": null, 84 | "metadata": {}, 85 | "outputs": [], 86 | "source": [] 87 | }, 88 | { 89 | "cell_type": "markdown", 90 | "metadata": {}, 91 | "source": [ 92 | "## Get links on the webpage. Notice how this doesn't extract all the links visible on the webpage...what can we do about that?" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": {}, 99 | "outputs": [], 100 | "source": [ 101 | "soup.find_all(\"a\")" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": {}, 108 | "outputs": [], 109 | "source": [] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "## 1) Use requests-html to extract other links on the page by executing JavaScript. How many links do you see now?\n", 116 | "## 2) After you complete 1), get the text of the new paragraphs now visible in the HTML.\n", 117 | "## 3) Try out a few other tags - what else appears after executing the JavaScript?\n", 118 | "## 4) Using a CSS selector, extract the meta tag with name = \"keywords\". Can you get this tag's attributes?\n", 119 | "## 5) Links that automatically open to a new a tab are identified by having a \"target\" attribute equal to \"_blank\". Try extracting these links and their URLs." 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": null, 125 | "metadata": {}, 126 | "outputs": [], 127 | "source": [] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [] 156 | } 157 | ], 158 | "metadata": { 159 | "kernelspec": { 160 | "display_name": "Python 3", 161 | "language": "python", 162 | "name": "python3" 163 | }, 164 | "language_info": { 165 | "codemirror_mode": { 166 | "name": "ipython", 167 | "version": 3 168 | }, 169 | "file_extension": ".py", 170 | "mimetype": "text/x-python", 171 | "name": "python", 172 | "nbconvert_exporter": "python", 173 | "pygments_lexer": "ipython3", 174 | "version": "3.7.3" 175 | } 176 | }, 177 | "nbformat": 4, 178 | "nbformat_minor": 2 179 | } 180 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Phone Thiri Yadana 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | # Web Scraping and API in Python 2 | 3 | Python project for integrations with different API and web scraping with BeautifulSoup and HTML-requests libraries for multiple scraping projects such as Youtube, dynamically generated Javascript for SounCloud and many more. 4 | 5 | ## Built With 6 | * [python 3](https://www.python.org/) 7 | * [requests](https://requests.readthedocs.io/en/master/) - Requests is an elegant and simple HTTP library for Python. 8 | * [pandas](https://pandas.pydata.org/) - fast, powerful, flexible and easy to use open source data analysis and manipulation tool 9 | * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - for screen scraping library 10 | * [requests-HTML](https://requests.readthedocs.io/projects/requests-html/en/latest/) - make parsing HTML as simple and intuitive as possible with Full JavaScript Support 11 | * [python html.parser](https://docs.python.org/3/library/html.parser.html) - html parser 12 | * [lxml parser](https://lxml.de/parsing.html) - asd 13 | * [html5lib parser](https://github.com/html5lib/html5lib-python) - simple and powerful API for parsing XML and HTML 14 | * [urllib](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) - URL handling module 15 | 16 | ## API Projects 17 | * [Currency Exchange Rate API](https://exchangeratesapi.io/) 18 | * [iTune API](https://developer.apple.com/library/archive/documentation/AudioVideo/Conceptual/iTuneSearchAPI/Searching.html#//apple_ref/doc/uid/TP40017632-CH5-SW1) 19 | * [GitHub Jobs API](https://jobs.github.com/api) 20 | * [Official Joke API](https://github.com/15Dkatz/official_joke_api) 21 | * [Joke API](https://sv443.net/jokeapi) 22 | 23 | ## Web Scraping Projects 24 | * [Rotten Tomatoes](https://www.rottentomatoes.com/) 25 | * [Steam](https://store.steampowered.com/games/) 26 | * [Youtube](https://www.youtube.com/) 27 | * [Sound Cloud](https://soundcloud.com/) 28 | 29 | ## License 30 | 31 | This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details 32 | 33 | ## References 34 | 35 | * The challenges are part of [Web Scraping and API in Python course](https://365datascience.com/courses/web-scraping-and-api-fundamentals-in-python/) by 365 Data Science. 36 | --------------------------------------------------------------------------------