├── .gitignore
├── 01-02.Course Notes
└── Course Notes - Web Scraping and API Fundamentals in Python.pdf
├── 03.Working with APIs
├── Currency Exchange API
│ ├── Section 3 - Additional API functionalities.ipynb
│ ├── Section 3 - Creating a simple currency converter.ipynb
│ ├── Section 3 - Exchange rates API GETting a JSON reply.ipynb
│ ├── Section 3 - Incorporating parameters in a GET request.ipynb
│ ├── additional_API_functionalities.py
│ ├── currency_converter.py
│ ├── exchange_rate_API.py
│ └── exchange_rate_API_with_paremeters.py
├── EDAMAM API
│ ├── EDAMAM_API.py
│ ├── RoastedChicken_nutrients.csv
│ ├── Section 3 - Downloading files with requests.ipynb
│ ├── Section 3 - EDAMAM API - Initial setup and registration.ipynb
│ └── Section 3 - EDAMAM API - Sending a POST request.ipynb
├── GitHub API
│ ├── Section 3 - GitHub API - Pagination.ipynb
│ └── github_API.py
└── iTune API
│ ├── Section 3 - iTunes API - Exercise Solution.ipynb
│ ├── Section 3 - iTunes API - Exrecise Setup.ipynb
│ ├── Section 3 - iTunes API - Structuring and exporting the data.ipynb
│ ├── Section 3 - iTunes API.ipynb
│ ├── iTunes_API.py
│ ├── iTunes_API_structuring_exporting.py
│ ├── songs_info.csv
│ └── songs_info.xlsx
├── 04.HTML Overview
├── Section 4 - CSS and JavaScript.html
├── Section 4 - CSS style tag.html
├── Section 4 - Character encoding - Euro sign.html
└── Section 4 - My First Webpage.html
├── 05.Web Scraping with Beautiful Soup
├── Section 5 - Extracting data from nested HTML tags.ipynb
├── Section 5 - Extracting data from the HTML tree.ipynb
├── Section 5 - Extracting text from an HTML tag.ipynb
├── Section 5 - Practical example - Exercise Setup-MyWork.ipynb
├── Section 5 - Practical example - Exercise Setup.ipynb
├── Section 5 - Practical example - Exercise Solution.ipynb
├── Section 5 - Practical example - dealing with links.ipynb
├── Section 5 - Scraping multiple pages automatically.ipynb
├── Section 5 - Searching and navigating the HTML tree.ipynb
├── Section 5 - Searching the HTML tree by attributes.ipynb
├── Section 5 - Setting up your first scraper.ipynb
├── scraper.py
├── scraper2_extracting_data.py
├── scraper3_extracting_text.py
├── scraper4_dealing_links.py
├── scraper5_extracting_nestedHTML.py
├── scraper6_scraping_multiple_pages.py
└── wiki_music.html
├── 06.Project Scraping - Rotten Tomatoes
├── Rotten_tomatoes_page_2_HTML_Parser.html
├── Rotten_tomatoes_page_2_LXML_Parser.html
├── Scraper_RottenTomatoes.ipynb
├── Section 6 - Dealing with the cast.ipynb
├── Section 6 - Extracting the rest of the information - Exercise - Setup.ipynb
├── Section 6 - Extracting the rest of the information.ipynb
├── Section 6 - Extracting the score - Setup.ipynb
├── Section 6 - Extracting the score - Solution.ipynb
├── Section 6 - Extracting the title and year of each movie.ipynb
├── Section 6 - Setting up your scraper.ipynb
├── Section 6 - Storing the data in a structured form.ipynb
├── Section 6 -Extracting the rest of the information - Exercise - Solution.ipynb
├── movies_info.csv
└── movies_info.xlsx
├── 07.Scraping HTML Tables with Pandas
├── Scraper_HTMLtables.ipynb
└── Section 7 - Scraping HTML Tables with the help of Pandas.ipynb
├── 08.Scraping Steam Project
├── New_Trending_Games_Info.csv
├── Scraper Steam - My Work.ipynb
├── Section 8 - Scraping Steam - Setup.ipynb
├── Top_Rated_Games.info.csv
├── Top_Sellers_Games_info.csv
├── Trending_Games_info.csv
└── steam.html
├── 08.Scraping Youtube Project
├── Scraper YouTube - MyWork.ipynb
├── Section 8 - Scraping YouTube - Setup.ipynb
├── searched_video.html
├── stairway_to_heaven.html
└── youtube.html
├── 09.Common roadblocks when Web Scraping
├── RequestHeaders.ipynb
├── Section 9 - Sample HTML login Form.html
├── Section 9 - Sample login code.ipynb
├── Section 9 - Scraping multiple pages automatically - rate limitting.ipynb
└── Sessions.ipynb
├── 10.The Requests-HTML Package
├── Scraper_CSS_Selectors.ipynb
├── Scraper_JavaScript.ipynb
├── Scraper_withRequestsHTML.ipynb
├── Section 10 - CSS Selectors.ipynb
├── Section 10 - Exploring the package capabilities.ipynb
├── Section 10 - Scraping JavaScript.ipynb
└── Section 10 - Searching for text.ipynb
├── 11.Scraping JavaScript - SoundCloud Project
├── Scraper SoundCloud - My Work.ipynb
└── Section 10 - Scraping SoundCloud - Setup.ipynb
├── LICENSE
└── readme.md
/.gitignore:
--------------------------------------------------------------------------------
1 |
2 | # Created by https://www.gitignore.io/api/python,vagrant,virtualenv,jupyternotebooks
3 | # Edit at https://www.gitignore.io/?templates=python,vagrant,virtualenv,jupyternotebooks
4 |
5 | ### JupyterNotebooks ###
6 | # gitignore template for Jupyter Notebooks
7 | # website: http://jupyter.org/
8 |
9 | .ipynb_checkpoints
10 | */.ipynb_checkpoints/*
11 |
12 | # IPython
13 | profile_default/
14 | ipython_config.py
15 |
16 | # Remove previous ipynb_checkpoints
17 | # git rm -r .ipynb_checkpoints/
18 |
19 | ### Python ###
20 | # Byte-compiled / optimized / DLL files
21 | __pycache__/
22 | *.py[cod]
23 | *$py.class
24 |
25 | # C extensions
26 | *.so
27 |
28 | # Distribution / packaging
29 | .Python
30 | build/
31 | develop-eggs/
32 | dist/
33 | downloads/
34 | eggs/
35 | .eggs/
36 | lib/
37 | lib64/
38 | parts/
39 | sdist/
40 | var/
41 | wheels/
42 | pip-wheel-metadata/
43 | share/python-wheels/
44 | *.egg-info/
45 | .installed.cfg
46 | *.egg
47 | MANIFEST
48 |
49 | # PyInstaller
50 | # Usually these files are written by a python script from a template
51 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
52 | *.manifest
53 | *.spec
54 |
55 | # Installer logs
56 | pip-log.txt
57 | pip-delete-this-directory.txt
58 |
59 | # Unit test / coverage reports
60 | htmlcov/
61 | .tox/
62 | .nox/
63 | .coverage
64 | .coverage.*
65 | .cache
66 | nosetests.xml
67 | coverage.xml
68 | *.cover
69 | .hypothesis/
70 | .pytest_cache/
71 |
72 | # Translations
73 | *.mo
74 | *.pot
75 |
76 | # Scrapy stuff:
77 | .scrapy
78 |
79 | # Sphinx documentation
80 | docs/_build/
81 |
82 | # PyBuilder
83 | target/
84 |
85 | # pyenv
86 | .python-version
87 |
88 | # pipenv
89 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
90 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
91 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
92 | # install all needed dependencies.
93 | #Pipfile.lock
94 |
95 | # celery beat schedule file
96 | celerybeat-schedule
97 |
98 | # SageMath parsed files
99 | *.sage.py
100 |
101 | # Spyder project settings
102 | .spyderproject
103 | .spyproject
104 |
105 | # Rope project settings
106 | .ropeproject
107 |
108 | # Mr Developer
109 | .mr.developer.cfg
110 | .project
111 | .pydevproject
112 |
113 | # mkdocs documentation
114 | /site
115 |
116 | # mypy
117 | .mypy_cache/
118 | .dmypy.json
119 | dmypy.json
120 |
121 | # Pyre type checker
122 | .pyre/
123 |
124 | ### Vagrant ###
125 | # General
126 | .vagrant/*
127 |
128 | # Log files (if you are creating logs in debug mode, uncomment this)
129 | # *.log
130 |
131 | ### Vagrant Patch ###
132 | *.box
133 |
134 | ### VirtualEnv ###
135 | # Virtualenv
136 | # http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
137 | pyvenv.cfg
138 | .env
139 | .venv
140 | env/
141 | venv/
142 | ENV/
143 | env.bak/
144 | venv.bak/
145 | pip-selfcheck.json
146 |
147 | # End of https://www.gitignore.io/api/python,vagrant,virtualenv,jupyternotebooks
148 |
149 | __pycache__
150 | *.pyc
151 | .vagrant
152 |
--------------------------------------------------------------------------------
/01-02.Course Notes/Course Notes - Web Scraping and API Fundamentals in Python.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/01-02.Course Notes/Course Notes - Web Scraping and API Fundamentals in Python.pdf
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/Section 3 - Exchange rates API GETting a JSON reply.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pulling data from public APIs (without registration) - GET request"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# loading the packages\n",
17 | "# requests provides us with the capabilities of sending an HTTP request to a server\n",
18 | "import requests"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Extracting data on currency exchange rates"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 2,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "# We will use an API containing currency exchange rates as published by the European Central Bank\n",
35 | "# Documentation at https://exchangeratesapi.io"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### Sending a GET request"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "# Define the base URL\n",
52 | "# Base URL: the part of the URL common to all requests, not containing the parameters\n",
53 | "base_url = \"https://api.exchangeratesapi.io/latest\""
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 4,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "# We can make a GET request to this API endpoint with requests.get\n",
63 | "response = requests.get(base_url)\n",
64 | "\n",
65 | "# This method returns the response from the server\n",
66 | "# We store this response in a variable for future processing"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "### Investigating the response"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 5,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/plain": [
84 | "True"
85 | ]
86 | },
87 | "execution_count": 5,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "# Checking if the request went through ok\n",
94 | "response.ok"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 6,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "data": {
104 | "text/plain": [
105 | "200"
106 | ]
107 | },
108 | "execution_count": 6,
109 | "metadata": {},
110 | "output_type": "execute_result"
111 | }
112 | ],
113 | "source": [
114 | "# Checking the status code of the response\n",
115 | "response.status_code"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 7,
121 | "metadata": {},
122 | "outputs": [
123 | {
124 | "data": {
125 | "text/plain": [
126 | "'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'"
127 | ]
128 | },
129 | "execution_count": 7,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "# Inspecting the content body of the response (as a regular 'string')\n",
136 | "response.text"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 8,
142 | "metadata": {},
143 | "outputs": [
144 | {
145 | "data": {
146 | "text/plain": [
147 | "b'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'"
148 | ]
149 | },
150 | "execution_count": 8,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "# Inspecting the content of the response (in 'bytes' format)\n",
157 | "response.content"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 9,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "# The data is presented in JSON format"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "### Handling the JSON"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 10,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "data": {
183 | "text/plain": [
184 | "{'rates': {'CAD': 1.5613,\n",
185 | " 'HKD': 8.9041,\n",
186 | " 'ISK': 145.0,\n",
187 | " 'PHP': 58.013,\n",
188 | " 'DKK': 7.4695,\n",
189 | " 'HUF': 336.25,\n",
190 | " 'CZK': 25.504,\n",
191 | " 'AUD': 1.733,\n",
192 | " 'RON': 4.8175,\n",
193 | " 'SEK': 10.7203,\n",
194 | " 'IDR': 16488.05,\n",
195 | " 'INR': 84.96,\n",
196 | " 'BRL': 5.4418,\n",
197 | " 'RUB': 85.1553,\n",
198 | " 'HRK': 7.55,\n",
199 | " 'JPY': 117.12,\n",
200 | " 'THB': 36.081,\n",
201 | " 'CHF': 1.0594,\n",
202 | " 'SGD': 1.5841,\n",
203 | " 'PLN': 4.3132,\n",
204 | " 'BGN': 1.9558,\n",
205 | " 'TRY': 7.0002,\n",
206 | " 'CNY': 7.96,\n",
207 | " 'NOK': 10.89,\n",
208 | " 'NZD': 1.8021,\n",
209 | " 'ZAR': 18.2898,\n",
210 | " 'USD': 1.1456,\n",
211 | " 'MXN': 24.3268,\n",
212 | " 'ILS': 4.0275,\n",
213 | " 'GBP': 0.87383,\n",
214 | " 'KRW': 1374.71,\n",
215 | " 'MYR': 4.8304},\n",
216 | " 'base': 'EUR',\n",
217 | " 'date': '2020-03-09'}"
218 | ]
219 | },
220 | "execution_count": 10,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "# Requests has in-build method to directly convert the response to JSON format\n",
227 | "response.json()"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 11,
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "data": {
237 | "text/plain": [
238 | "dict"
239 | ]
240 | },
241 | "execution_count": 11,
242 | "metadata": {},
243 | "output_type": "execute_result"
244 | }
245 | ],
246 | "source": [
247 | "# In Python, this JSON is stored as a dictionary\n",
248 | "type(response.json())"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 12,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": [
257 | "# A useful library for JSON manipulation and pretty print\n",
258 | "import json\n",
259 | "\n",
260 | "# It has two main methods:\n",
261 | "# .loads(), which creates a Python dictionary from a JSON format string (just as response.json() does)\n",
262 | "# .dumps(), which creates a JSON format string out of a Python dictionary "
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 13,
268 | "metadata": {},
269 | "outputs": [
270 | {
271 | "data": {
272 | "text/plain": [
273 | "'{\\n \"rates\": {\\n \"CAD\": 1.5613,\\n \"HKD\": 8.9041,\\n \"ISK\": 145.0,\\n \"PHP\": 58.013,\\n \"DKK\": 7.4695,\\n \"HUF\": 336.25,\\n \"CZK\": 25.504,\\n \"AUD\": 1.733,\\n \"RON\": 4.8175,\\n \"SEK\": 10.7203,\\n \"IDR\": 16488.05,\\n \"INR\": 84.96,\\n \"BRL\": 5.4418,\\n \"RUB\": 85.1553,\\n \"HRK\": 7.55,\\n \"JPY\": 117.12,\\n \"THB\": 36.081,\\n \"CHF\": 1.0594,\\n \"SGD\": 1.5841,\\n \"PLN\": 4.3132,\\n \"BGN\": 1.9558,\\n \"TRY\": 7.0002,\\n \"CNY\": 7.96,\\n \"NOK\": 10.89,\\n \"NZD\": 1.8021,\\n \"ZAR\": 18.2898,\\n \"USD\": 1.1456,\\n \"MXN\": 24.3268,\\n \"ILS\": 4.0275,\\n \"GBP\": 0.87383,\\n \"KRW\": 1374.71,\\n \"MYR\": 4.8304\\n },\\n \"base\": \"EUR\",\\n \"date\": \"2020-03-09\"\\n}'"
274 | ]
275 | },
276 | "execution_count": 13,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "# .dumps() has options to make the string 'prettier', more readable\n",
283 | "# We can choose the number of spaces to be used as indentation\n",
284 | "json.dumps(response.json(), indent=4)"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 14,
290 | "metadata": {},
291 | "outputs": [
292 | {
293 | "name": "stdout",
294 | "output_type": "stream",
295 | "text": [
296 | "{\n",
297 | " \"rates\": {\n",
298 | " \"CAD\": 1.5613,\n",
299 | " \"HKD\": 8.9041,\n",
300 | " \"ISK\": 145.0,\n",
301 | " \"PHP\": 58.013,\n",
302 | " \"DKK\": 7.4695,\n",
303 | " \"HUF\": 336.25,\n",
304 | " \"CZK\": 25.504,\n",
305 | " \"AUD\": 1.733,\n",
306 | " \"RON\": 4.8175,\n",
307 | " \"SEK\": 10.7203,\n",
308 | " \"IDR\": 16488.05,\n",
309 | " \"INR\": 84.96,\n",
310 | " \"BRL\": 5.4418,\n",
311 | " \"RUB\": 85.1553,\n",
312 | " \"HRK\": 7.55,\n",
313 | " \"JPY\": 117.12,\n",
314 | " \"THB\": 36.081,\n",
315 | " \"CHF\": 1.0594,\n",
316 | " \"SGD\": 1.5841,\n",
317 | " \"PLN\": 4.3132,\n",
318 | " \"BGN\": 1.9558,\n",
319 | " \"TRY\": 7.0002,\n",
320 | " \"CNY\": 7.96,\n",
321 | " \"NOK\": 10.89,\n",
322 | " \"NZD\": 1.8021,\n",
323 | " \"ZAR\": 18.2898,\n",
324 | " \"USD\": 1.1456,\n",
325 | " \"MXN\": 24.3268,\n",
326 | " \"ILS\": 4.0275,\n",
327 | " \"GBP\": 0.87383,\n",
328 | " \"KRW\": 1374.71,\n",
329 | " \"MYR\": 4.8304\n",
330 | " },\n",
331 | " \"base\": \"EUR\",\n",
332 | " \"date\": \"2020-03-09\"\n",
333 | "}\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "# In order to visualize these changes, we need to print the string\n",
339 | "print(json.dumps(response.json(), indent=4))"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 15,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/plain": [
350 | "dict_keys(['rates', 'base', 'date'])"
351 | ]
352 | },
353 | "execution_count": 15,
354 | "metadata": {},
355 | "output_type": "execute_result"
356 | }
357 | ],
358 | "source": [
359 | "# It contains 3 keys; the value for the 'rates' key is another dictionary\n",
360 | "response.json().keys()"
361 | ]
362 | }
363 | ],
364 | "metadata": {
365 | "kernelspec": {
366 | "display_name": "Python 3",
367 | "language": "python",
368 | "name": "python3"
369 | },
370 | "language_info": {
371 | "codemirror_mode": {
372 | "name": "ipython",
373 | "version": 3
374 | },
375 | "file_extension": ".py",
376 | "mimetype": "text/x-python",
377 | "name": "python",
378 | "nbconvert_exporter": "python",
379 | "pygments_lexer": "ipython3",
380 | "version": "3.7.3"
381 | }
382 | },
383 | "nbformat": 4,
384 | "nbformat_minor": 2
385 | }
386 |
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/Section 3 - Incorporating parameters in a GET request.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pulling data from public APIs (without registration) - GET request"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# loading the packages\n",
17 | "# requests provides us with the capabilities of sending an HTTP request to a server\n",
18 | "import requests"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "## Extracting data on currency exchange rates"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 2,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "# We will use an API containing currency exchange rates as published by the European Central Bank\n",
35 | "# Documentation at https://exchangeratesapi.io"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "### Sending a GET request"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "# Define the base URL\n",
52 | "# Base URL: the part of the URL common to all requests, not containing the parameters\n",
53 | "base_url = \"https://api.exchangeratesapi.io/latest\""
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 4,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "# We can make a GET request to this API endpoint with requests.get\n",
63 | "response = requests.get(base_url)\n",
64 | "\n",
65 | "# This method returns the response from the server\n",
66 | "# We store this response in a variable for future processing"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "### Investigating the response"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 5,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/plain": [
84 | "True"
85 | ]
86 | },
87 | "execution_count": 5,
88 | "metadata": {},
89 | "output_type": "execute_result"
90 | }
91 | ],
92 | "source": [
93 | "# Checking if the request went through ok\n",
94 | "response.ok"
95 | ]
96 | },
97 | {
98 | "cell_type": "code",
99 | "execution_count": 6,
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "data": {
104 | "text/plain": [
105 | "200"
106 | ]
107 | },
108 | "execution_count": 6,
109 | "metadata": {},
110 | "output_type": "execute_result"
111 | }
112 | ],
113 | "source": [
114 | "# Checking the status code of the response\n",
115 | "response.status_code"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 7,
121 | "metadata": {},
122 | "outputs": [
123 | {
124 | "data": {
125 | "text/plain": [
126 | "'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'"
127 | ]
128 | },
129 | "execution_count": 7,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "# Inspecting the content body of the response (as a regular 'string')\n",
136 | "response.text"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 8,
142 | "metadata": {},
143 | "outputs": [
144 | {
145 | "data": {
146 | "text/plain": [
147 | "b'{\"rates\":{\"CAD\":1.5613,\"HKD\":8.9041,\"ISK\":145.0,\"PHP\":58.013,\"DKK\":7.4695,\"HUF\":336.25,\"CZK\":25.504,\"AUD\":1.733,\"RON\":4.8175,\"SEK\":10.7203,\"IDR\":16488.05,\"INR\":84.96,\"BRL\":5.4418,\"RUB\":85.1553,\"HRK\":7.55,\"JPY\":117.12,\"THB\":36.081,\"CHF\":1.0594,\"SGD\":1.5841,\"PLN\":4.3132,\"BGN\":1.9558,\"TRY\":7.0002,\"CNY\":7.96,\"NOK\":10.89,\"NZD\":1.8021,\"ZAR\":18.2898,\"USD\":1.1456,\"MXN\":24.3268,\"ILS\":4.0275,\"GBP\":0.87383,\"KRW\":1374.71,\"MYR\":4.8304},\"base\":\"EUR\",\"date\":\"2020-03-09\"}'"
148 | ]
149 | },
150 | "execution_count": 8,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "# Inspecting the content of the response (in 'bytes' format)\n",
157 | "response.content"
158 | ]
159 | },
160 | {
161 | "cell_type": "code",
162 | "execution_count": 9,
163 | "metadata": {},
164 | "outputs": [],
165 | "source": [
166 | "# The data is presented in JSON format"
167 | ]
168 | },
169 | {
170 | "cell_type": "markdown",
171 | "metadata": {},
172 | "source": [
173 | "### Handling the JSON"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 10,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "data": {
183 | "text/plain": [
184 | "{'rates': {'CAD': 1.5613,\n",
185 | " 'HKD': 8.9041,\n",
186 | " 'ISK': 145.0,\n",
187 | " 'PHP': 58.013,\n",
188 | " 'DKK': 7.4695,\n",
189 | " 'HUF': 336.25,\n",
190 | " 'CZK': 25.504,\n",
191 | " 'AUD': 1.733,\n",
192 | " 'RON': 4.8175,\n",
193 | " 'SEK': 10.7203,\n",
194 | " 'IDR': 16488.05,\n",
195 | " 'INR': 84.96,\n",
196 | " 'BRL': 5.4418,\n",
197 | " 'RUB': 85.1553,\n",
198 | " 'HRK': 7.55,\n",
199 | " 'JPY': 117.12,\n",
200 | " 'THB': 36.081,\n",
201 | " 'CHF': 1.0594,\n",
202 | " 'SGD': 1.5841,\n",
203 | " 'PLN': 4.3132,\n",
204 | " 'BGN': 1.9558,\n",
205 | " 'TRY': 7.0002,\n",
206 | " 'CNY': 7.96,\n",
207 | " 'NOK': 10.89,\n",
208 | " 'NZD': 1.8021,\n",
209 | " 'ZAR': 18.2898,\n",
210 | " 'USD': 1.1456,\n",
211 | " 'MXN': 24.3268,\n",
212 | " 'ILS': 4.0275,\n",
213 | " 'GBP': 0.87383,\n",
214 | " 'KRW': 1374.71,\n",
215 | " 'MYR': 4.8304},\n",
216 | " 'base': 'EUR',\n",
217 | " 'date': '2020-03-09'}"
218 | ]
219 | },
220 | "execution_count": 10,
221 | "metadata": {},
222 | "output_type": "execute_result"
223 | }
224 | ],
225 | "source": [
226 | "# Requests has in-build method to directly convert the response to JSON format\n",
227 | "response.json()"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 11,
233 | "metadata": {},
234 | "outputs": [
235 | {
236 | "data": {
237 | "text/plain": [
238 | "dict"
239 | ]
240 | },
241 | "execution_count": 11,
242 | "metadata": {},
243 | "output_type": "execute_result"
244 | }
245 | ],
246 | "source": [
247 | "# In Python, this JSON is stored as a dictionary\n",
248 | "type(response.json())"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 12,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": [
257 | "# A useful library for JSON manipulation and pretty print\n",
258 | "import json\n",
259 | "\n",
260 | "# It has two main methods:\n",
261 | "# .loads(), which creates a Python dictionary from a JSON format string (just as response.json() does)\n",
262 | "# .dumps(), which creates a JSON format string out of a Python dictionary "
263 | ]
264 | },
265 | {
266 | "cell_type": "code",
267 | "execution_count": 13,
268 | "metadata": {},
269 | "outputs": [
270 | {
271 | "data": {
272 | "text/plain": [
273 | "'{\\n \"rates\": {\\n \"CAD\": 1.5613,\\n \"HKD\": 8.9041,\\n \"ISK\": 145.0,\\n \"PHP\": 58.013,\\n \"DKK\": 7.4695,\\n \"HUF\": 336.25,\\n \"CZK\": 25.504,\\n \"AUD\": 1.733,\\n \"RON\": 4.8175,\\n \"SEK\": 10.7203,\\n \"IDR\": 16488.05,\\n \"INR\": 84.96,\\n \"BRL\": 5.4418,\\n \"RUB\": 85.1553,\\n \"HRK\": 7.55,\\n \"JPY\": 117.12,\\n \"THB\": 36.081,\\n \"CHF\": 1.0594,\\n \"SGD\": 1.5841,\\n \"PLN\": 4.3132,\\n \"BGN\": 1.9558,\\n \"TRY\": 7.0002,\\n \"CNY\": 7.96,\\n \"NOK\": 10.89,\\n \"NZD\": 1.8021,\\n \"ZAR\": 18.2898,\\n \"USD\": 1.1456,\\n \"MXN\": 24.3268,\\n \"ILS\": 4.0275,\\n \"GBP\": 0.87383,\\n \"KRW\": 1374.71,\\n \"MYR\": 4.8304\\n },\\n \"base\": \"EUR\",\\n \"date\": \"2020-03-09\"\\n}'"
274 | ]
275 | },
276 | "execution_count": 13,
277 | "metadata": {},
278 | "output_type": "execute_result"
279 | }
280 | ],
281 | "source": [
282 | "# .dumps() has options to make the string 'prettier', more readable\n",
283 | "# We can choose the number of spaces to be used as indentation\n",
284 | "json.dumps(response.json(), indent=4)"
285 | ]
286 | },
287 | {
288 | "cell_type": "code",
289 | "execution_count": 14,
290 | "metadata": {},
291 | "outputs": [
292 | {
293 | "name": "stdout",
294 | "output_type": "stream",
295 | "text": [
296 | "{\n",
297 | " \"rates\": {\n",
298 | " \"CAD\": 1.5613,\n",
299 | " \"HKD\": 8.9041,\n",
300 | " \"ISK\": 145.0,\n",
301 | " \"PHP\": 58.013,\n",
302 | " \"DKK\": 7.4695,\n",
303 | " \"HUF\": 336.25,\n",
304 | " \"CZK\": 25.504,\n",
305 | " \"AUD\": 1.733,\n",
306 | " \"RON\": 4.8175,\n",
307 | " \"SEK\": 10.7203,\n",
308 | " \"IDR\": 16488.05,\n",
309 | " \"INR\": 84.96,\n",
310 | " \"BRL\": 5.4418,\n",
311 | " \"RUB\": 85.1553,\n",
312 | " \"HRK\": 7.55,\n",
313 | " \"JPY\": 117.12,\n",
314 | " \"THB\": 36.081,\n",
315 | " \"CHF\": 1.0594,\n",
316 | " \"SGD\": 1.5841,\n",
317 | " \"PLN\": 4.3132,\n",
318 | " \"BGN\": 1.9558,\n",
319 | " \"TRY\": 7.0002,\n",
320 | " \"CNY\": 7.96,\n",
321 | " \"NOK\": 10.89,\n",
322 | " \"NZD\": 1.8021,\n",
323 | " \"ZAR\": 18.2898,\n",
324 | " \"USD\": 1.1456,\n",
325 | " \"MXN\": 24.3268,\n",
326 | " \"ILS\": 4.0275,\n",
327 | " \"GBP\": 0.87383,\n",
328 | " \"KRW\": 1374.71,\n",
329 | " \"MYR\": 4.8304\n",
330 | " },\n",
331 | " \"base\": \"EUR\",\n",
332 | " \"date\": \"2020-03-09\"\n",
333 | "}\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "# In order to visualize these changes, we need to print the string\n",
339 | "print(json.dumps(response.json(), indent=4))"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 15,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/plain": [
350 | "dict_keys(['rates', 'base', 'date'])"
351 | ]
352 | },
353 | "execution_count": 15,
354 | "metadata": {},
355 | "output_type": "execute_result"
356 | }
357 | ],
358 | "source": [
359 | "# It contains 3 keys; the value for the 'rates' key is another dictionary\n",
360 | "response.json().keys()"
361 | ]
362 | },
363 | {
364 | "cell_type": "markdown",
365 | "metadata": {},
366 | "source": [
367 | "### Incorporating parameters in the GET request"
368 | ]
369 | },
370 | {
371 | "cell_type": "code",
372 | "execution_count": 16,
373 | "metadata": {},
374 | "outputs": [
375 | {
376 | "data": {
377 | "text/plain": [
378 | "'https://api.exchangeratesapi.io/latest?symbols=USD,GBP'"
379 | ]
380 | },
381 | "execution_count": 16,
382 | "metadata": {},
383 | "output_type": "execute_result"
384 | }
385 | ],
386 | "source": [
387 | "# Request parameters are added to the URL after a question mark '?'\n",
388 | "# In this case, we request for the exchange rates of the US Dollar (USD) and Pound Sterling (GBP) only\n",
389 | "param_url = base_url + \"?symbols=USD,GBP\"\n",
390 | "param_url"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": 17,
396 | "metadata": {},
397 | "outputs": [
398 | {
399 | "data": {
400 | "text/plain": [
401 | "200"
402 | ]
403 | },
404 | "execution_count": 17,
405 | "metadata": {},
406 | "output_type": "execute_result"
407 | }
408 | ],
409 | "source": [
410 | "# Making a request to the server with the new URL, containing the parameters\n",
411 | "response = requests.get(param_url)\n",
412 | "response.status_code"
413 | ]
414 | },
415 | {
416 | "cell_type": "code",
417 | "execution_count": 18,
418 | "metadata": {},
419 | "outputs": [
420 | {
421 | "data": {
422 | "text/plain": [
423 | "{'rates': {'USD': 1.1456, 'GBP': 0.87383}, 'base': 'EUR', 'date': '2020-03-09'}"
424 | ]
425 | },
426 | "execution_count": 18,
427 | "metadata": {},
428 | "output_type": "execute_result"
429 | }
430 | ],
431 | "source": [
432 | "# Saving the response data\n",
433 | "data = response.json()\n",
434 | "data"
435 | ]
436 | },
437 | {
438 | "cell_type": "code",
439 | "execution_count": 19,
440 | "metadata": {},
441 | "outputs": [
442 | {
443 | "data": {
444 | "text/plain": [
445 | "'EUR'"
446 | ]
447 | },
448 | "execution_count": 19,
449 | "metadata": {},
450 | "output_type": "execute_result"
451 | }
452 | ],
453 | "source": [
454 | "# 'data' is a dictionary\n",
455 | "data['base']"
456 | ]
457 | },
458 | {
459 | "cell_type": "code",
460 | "execution_count": 20,
461 | "metadata": {},
462 | "outputs": [
463 | {
464 | "data": {
465 | "text/plain": [
466 | "'2020-03-09'"
467 | ]
468 | },
469 | "execution_count": 20,
470 | "metadata": {},
471 | "output_type": "execute_result"
472 | }
473 | ],
474 | "source": [
475 | "data['date']"
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": 21,
481 | "metadata": {},
482 | "outputs": [
483 | {
484 | "data": {
485 | "text/plain": [
486 | "{'USD': 1.1456, 'GBP': 0.87383}"
487 | ]
488 | },
489 | "execution_count": 21,
490 | "metadata": {},
491 | "output_type": "execute_result"
492 | }
493 | ],
494 | "source": [
495 | "data['rates']"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 22,
501 | "metadata": {},
502 | "outputs": [],
503 | "source": [
504 | "# As per the documentation of this API, we can change the base with the parameter 'base'\n",
505 | "param_url = base_url + \"?symbols=GBP&base=USD\""
506 | ]
507 | },
508 | {
509 | "cell_type": "code",
510 | "execution_count": 23,
511 | "metadata": {},
512 | "outputs": [
513 | {
514 | "data": {
515 | "text/plain": [
516 | "{'rates': {'GBP': 0.7627706006}, 'base': 'USD', 'date': '2020-03-09'}"
517 | ]
518 | },
519 | "execution_count": 23,
520 | "metadata": {},
521 | "output_type": "execute_result"
522 | }
523 | ],
524 | "source": [
525 | "# Sending a request and saving the response JSON, all at once\n",
526 | "data = requests.get(param_url).json()\n",
527 | "data"
528 | ]
529 | },
530 | {
531 | "cell_type": "code",
532 | "execution_count": 24,
533 | "metadata": {},
534 | "outputs": [
535 | {
536 | "data": {
537 | "text/plain": [
538 | "0.7627706006"
539 | ]
540 | },
541 | "execution_count": 24,
542 | "metadata": {},
543 | "output_type": "execute_result"
544 | }
545 | ],
546 | "source": [
547 | "usd_to_gbp = data['rates']['GBP']\n",
548 | "usd_to_gbp"
549 | ]
550 | }
551 | ],
552 | "metadata": {
553 | "kernelspec": {
554 | "display_name": "Python 3",
555 | "language": "python",
556 | "name": "python3"
557 | },
558 | "language_info": {
559 | "codemirror_mode": {
560 | "name": "ipython",
561 | "version": 3
562 | },
563 | "file_extension": ".py",
564 | "mimetype": "text/x-python",
565 | "name": "python",
566 | "nbconvert_exporter": "python",
567 | "pygments_lexer": "ipython3",
568 | "version": "3.7.3"
569 | }
570 | },
571 | "nbformat": 4,
572 | "nbformat_minor": 2
573 | }
574 |
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/additional_API_functionalities.py:
--------------------------------------------------------------------------------
1 | #Obtaining Historical Exchange Rates
2 | import requests
3 | import json
4 |
5 | base_url = "https://api.exchangeratesapi.io"
6 |
7 | historical_date_url = base_url + "/2020-04-12"
8 |
9 | response = requests.get(historical_date_url)
10 | data = response.json()
11 |
12 | # data = {'rates': {'CAD': 1.5265, 'HKD': 8.4259, 'ISK': 155.9, 'PHP': 54.939, 'DKK': 7.4657, 'HUF': 354.76, 'CZK': 26.909, 'AUD': 1.7444, 'RON': 4.833, 'SEK': 10.9455, 'IDR': 17243.21, 'INR': 82.9275, 'BRL': 5.5956, 'RUB': 80.69, 'HRK': 7.6175, 'JPY': 118.33, 'THB': 35.665, 'CHF': 1.0558, 'SGD': 1.5479, 'PLN': 4.5586, 'BGN': 1.9558, 'TRY': 7.3233, 'CNY': 7.6709, 'NOK': 11.2143, 'NZD': 1.8128, 'ZAR': 19.6383, 'USD': 1.0867, 'MXN': 26.0321, 'ILS': 3.8919, 'GBP': 0.87565, 'KRW': 1322.49, 'MYR': 4.7136}, 'base': 'EUR', 'date': '2020-04-09'}
13 |
14 | print(json.dumps(data, indent=4, sort_keys=True))
15 |
16 | # Invalid URL
17 | invalid_url = base_url + "/2019-12-01" + "?symbols=USB"
18 | response = requests.get(invalid_url)
19 |
20 | print(response.status_code)
21 | print(response.json())
22 | # 400 for bad request
23 | #invalid response = {'error': "Symbols 'USB' are invalid for date 2019-12-01."}
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/currency_converter.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 |
4 | base_url = "https://api.exchangeratesapi.io/"
5 |
6 | print("***** Welcome to Currency Converter ****")
7 | date = input("Please enter the date (in the format 'yyyy-mm-dd') OR type 'latest' : ")
8 | base_currency = input("Currency converted from (example: 'USD') : ")
9 | to_currency = input("Currency converted to (example: 'JPY') : ")
10 | amount = input(f"How much {base_currency} do you want to convert? : ")
11 |
12 | if date and base_currency and to_currency and amount:
13 |
14 | param_url = base_url + date + "?symbols=" + base_currency + "," + to_currency
15 |
16 | if date == 'latest':
17 | param_url = base_url + "latest?symbols=" + base_currency + "," + to_currency
18 |
19 | response = requests.get(param_url)
20 |
21 | if response.ok is False:
22 | print(f"Opps! Seem like there is an Error {response.status_code}. Please try again.")
23 | print(f"{response.json()['error']}")
24 |
25 | else:
26 | data = response.json()
27 |
28 | #testing
29 | # base_currency = 'USD'
30 | # to_currency = 'JPY'
31 | # amount = 100
32 | # data = {'rates': {'JPY': 117.55, 'USD': 1.0936}, 'base': 'EUR', 'date': '2020-04-01'}
33 |
34 | converted_amount = (float(amount) / float(data['rates'][base_currency])) * float(data['rates'][to_currency])
35 | converted_amount = round(converted_amount,2)
36 |
37 | print(f"The amount equalivant to {base_currency} {amount} is {to_currency} {converted_amount}")
38 |
39 | else:
40 | print("You have provided invalid information. Please try again.")
41 |
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/exchange_rate_API.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 |
4 | base_url = 'https://api.exchangeratesapi.io/latest'
5 |
6 | #reqeust to API
7 | response = requests.get(base_url)
8 |
9 | #investigating response
10 | print(response.ok)
11 | print(response.status_code)
12 | print(response.text)
13 |
14 | #handling JSON
15 | json_response = response.json()
16 |
17 | #to void calling to API so many time, just for testing purpose
18 | # json_response = {'rates': {'CAD': 1.5265, 'HKD': 8.4259, 'ISK': 155.9, 'PHP': 54.939, 'DKK': 7.4657, 'HUF': 354.76, 'CZK': 26.909, 'AUD': 1.7444, 'RON': 4.833, 'SEK': 10.9455, 'IDR': 17243.21, 'INR': 82.9275, 'BRL': 5.5956, 'RUB': 80.69, 'HRK': 7.6175, 'JPY': 118.33, 'THB': 35.665, 'CHF': 1.0558, 'SGD': 1.5479, 'PLN': 4.5586, 'BGN': 1.9558, 'TRY': 7.3233, 'CNY': 7.6709, 'NOK': 11.2143, 'NZD': 1.8128, 'ZAR': 19.6383, 'USD': 1.0867, 'MXN': 26.0321, 'ILS': 3.8919, 'GBP': 0.87565, 'KRW': 1322.49, 'MYR': 4.7136}, 'base': 'EUR', 'date': '2020-04-09'}
19 |
20 | #Python Built in package json
21 | #loads(string): converts a JSON formatted string to a Python Object
22 | #dumps(obj): converts a Python Object to a regular string, with options to make the string prettier
23 | print(json.dumps(json_response, indent=4))
--------------------------------------------------------------------------------
/03.Working with APIs/Currency Exchange API/exchange_rate_API_with_paremeters.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 |
4 | base_url = 'https://api.exchangeratesapi.io/latest'
5 |
6 | param_url = base_url + '?symbols=USD,GBP'
7 |
8 | response = requests.get(param_url)
9 | data = response.json()
10 |
11 | # data = {'rates': {'USD': 1.0867, 'GBP': 0.87565}, 'base': 'EUR', 'date': '2020-04-09'}
12 |
13 | print(type(data))
14 | print(data)
15 |
16 | rates = data['rates']['USD']
17 | print(rates)
18 |
--------------------------------------------------------------------------------
/03.Working with APIs/EDAMAM API/EDAMAM_API.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import pandas as pd
4 |
5 | api_endpoint = "https://api.edamam.com/api/nutrition-details"
6 |
7 | app_id = "d5df2415"
8 | app_key = "b87fbe096f386ba8d6b2ad10dcc672d5"
9 | url = api_endpoint + "?app_id=" + app_id + "&app_key=" + app_key
10 |
11 | #Preparing POST request
12 | headers = {
13 | "Content-Type": "application/json"
14 | }
15 |
16 | recipe = {
17 | "title" : "roasted chicken",
18 | "ingr" : ["1 (5 to 6 pound) roasting chicken", "Kosher salt", "Freshly ground black pepper"]
19 | }
20 |
21 | #Sending POST request
22 | response = requests.post(url, headers=headers, json=recipe)
23 | print(response.status_code)
24 |
25 | info = response.json()
26 | print(info.keys())
27 |
28 | # data frame using pandas
29 | nutrients = pd.DataFrame(info['totalNutrients']).transpose()
30 | print(nutrients)
31 |
32 | # export to csv
33 | nutrients.to_csv("RoastedChicken_nutrients.csv")
--------------------------------------------------------------------------------
/03.Working with APIs/EDAMAM API/RoastedChicken_nutrients.csv:
--------------------------------------------------------------------------------
1 | ,label,quantity,unit
2 | ENERC_KCAL,Energy,3897.8847966250505,kcal
3 | FAT,Fat,281.7973896498531,g
4 | FASAT,Saturated,80.41792651629662,g
5 | FATRN,Trans,0.0,g
6 | FAMS,Monounsaturated,116.42828684428095,g
7 | FAPU,Polyunsaturated,60.71976612838291,g
8 | CHOCDF,Carbs,6.425249319142501,g
9 | FIBTG,Fiber,1.8935213485650002,g
10 | SUGAR,Sugars,0.047899354272000004,g
11 | PROCNT,Protein,312.01614425200455,g
12 | CHOLE,Cholesterol,1566.2090943730002,mg
13 | NA,Sodium,5818.914444977497,mg
14 | CA,Calcium,218.09684626793904,mg
15 | MG,Magnesium,358.9387221502079,mg
16 | K,Potassium,3669.9071911427136,mg
17 | FE,Iron,25.715630535762603,mg
18 | ZN,Zinc,23.411849338505288,mg
19 | P,Phosphorus,3016.7612062434005,mg
20 | VITA_RAE,Vitamin A,4609.589368849851,µg
21 | VITC,Vitamin C,43.7081607732,mg
22 | THIA,Thiamin (B1),1.0825753017079,mg
23 | RIBF,Riboflavin (B2),3.1276781484795007,mg
24 | NIA,Niacin (B3),117.13235745691864,mg
25 | VITB6A,Vitamin B6,5.84953400740555,mg
26 | FOLDFE,Folate equivalent (total),474.77740164085003,µg
27 | FOLFD,Folate (food),474.77740164085003,µg
28 | FOLAC,Folic acid,0.0,µg
29 | VITB12,Vitamin B12,18.029616318945003,µg
30 | VITD,Vitamin D,0.0,µg
31 | TOCPHA,Vitamin E,0.07783645069200001,mg
32 | VITK1,Vitamin K,12.251756709885,µg
33 | WATER,Water,1202.5662619386048,g
34 |
--------------------------------------------------------------------------------
/03.Working with APIs/EDAMAM API/Section 3 - Downloading files with requests.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Downloading Files with Requests"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# The requests package can also be used to download files from the web.\n",
17 | "import requests"
18 | ]
19 | },
20 | {
21 | "cell_type": "markdown",
22 | "metadata": {},
23 | "source": [
24 | "## Naive downloading"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 2,
30 | "metadata": {},
31 | "outputs": [],
32 | "source": [
33 | "# One way to 'download' a file is to send a request to it.\n",
34 | "# Then, export the content of the response to a local file"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": 3,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "# Let's use an image from wikipedia for this purpose\n",
44 | "file_url = \"https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Collage_of_Nine_Dogs.jpg/1024px-Collage_of_Nine_Dogs.jpg\""
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 4,
50 | "metadata": {},
51 | "outputs": [
52 | {
53 | "data": {
54 | "text/plain": [
55 | "200"
56 | ]
57 | },
58 | "execution_count": 4,
59 | "metadata": {},
60 | "output_type": "execute_result"
61 | }
62 | ],
63 | "source": [
64 | "response = requests.get(file_url)\n",
65 | "response.status_code"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 5,
71 | "metadata": {},
72 | "outputs": [
73 | {
74 | "data": {
75 | "text/plain": [
76 | "b'\\xff\\xd8\\xff\\xe0\\x00\\x10JFIF\\x00\\x01\\x01\\x01\\x00H\\x00H\\x00\\x00\\xff\\xfe\\x00OFile source: https://commons.wikimedia.org/wiki/File:Collage_of_Nine_Dogs.jpg\\xff\\xe2\\x02\\x1cICC_PROFILE\\x00\\x01\\x01\\x00\\x00\\x02\\x0clcms\\x02\\x10\\x00\\x00mntrRGB XYZ \\x07\\xdc\\x00\\x01\\x00\\x19\\x00\\x03\\x00)\\x009acspAPPL\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xf6\\xd6\\x00\\x01\\x00\\x00\\x00\\x00\\xd3-lcms\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\ndesc\\x00\\x00\\x00\\xfc\\x00\\x00\\x00^cprt\\x00\\x00\\x01\\\\\\x00\\x00\\x00\\x0bwtpt\\x00\\x00\\x01h\\x00\\x00\\x00\\x14bkpt\\x00\\x00\\x01|\\x00\\x00\\x00\\x14rXYZ\\x00\\x00\\x01\\x90\\x00\\x00\\x00\\x14gXYZ\\x00\\x00\\x01\\xa4\\x00\\x00\\x00\\x14bXYZ\\x00\\x00\\x01\\xb8\\x00\\x00\\x00\\x14rTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@gTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@bTRC\\x00\\x00\\x01\\xcc\\x00\\x00\\x00@desc\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03c2\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00text\\x00\\x00\\x00\\x00FB\\x00\\x00XYZ \\x00\\x00\\x00\\x00\\x00\\x00\\xf6\\xd6\\x00\\x01\\x00\\x00\\x00\\x00\\xd3-X'"
77 | ]
78 | },
79 | "execution_count": 5,
80 | "metadata": {},
81 | "output_type": "execute_result"
82 | }
83 | ],
84 | "source": [
85 | "# Printing out the begining of the content of the response\n",
86 | "# It is in a binary-encoded format, thus it looks like gibberish\n",
87 | "response.content[:500]"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 6,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "# We need to export this to an image file (jpg, png, gif...)"
97 | ]
98 | },
99 | {
100 | "cell_type": "markdown",
101 | "metadata": {},
102 | "source": [
103 | "### Writing to a file"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 7,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "# We open/create a file with the function 'open()'\n",
113 | "file = open(\"dog_image.jpg\", \"wb\")\n",
114 | "\n",
115 | "# Then, write to it\n",
116 | "file.write(response.content)\n",
117 | "\n",
118 | "# And close the file after finishing\n",
119 | "file.close()"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 8,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "# The two parameters in the function open() are:\n",
129 | "# - the name of the file (along with a path to it if it is not in the same directory as our program)\n",
130 | "# - the mode in wich we want to edit the file\n",
131 | "\n",
132 | "# Some popular modes are:\n",
133 | "# - 'r' : Opens the file in read-only mode;\n",
134 | "# - 'rb' : Opens the file as read-only in binary format;\n",
135 | "# - 'w' : Creates a file in write-only mode. If the file already exists, it will overwrite it;\n",
136 | "# - 'wb': Write-only mode in binary format;\n",
137 | "# - 'a' : Opens the file for appending new information to the end;\n",
138 | "# - 'w+' : Opens the file for writing and reading;\n",
139 | "\n",
140 | "# We have used 'wb' in this example, since we want to export the data to a file (thus, write to it)\n",
141 | "# and response.content is in bytes\n",
142 | "\n",
143 | "# Never forget to close the file!"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 9,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# To ensure the file will always be closed, use the 'with' statement\n",
153 | "# This automatically calls file.close() at the end"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 10,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "with open(\"dog_image_2.jpg\", \"wb\") as file:\n",
163 | " file.write(response.content)"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": []
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 11,
176 | "metadata": {},
177 | "outputs": [],
178 | "source": [
179 | "# Here, we first receive the whole file and store it in the RAM, then export it to the hard disk\n",
180 | "# This method is really inefficient, especially for bigger files\n",
181 | "# In effect we download the file to the RAM\n",
182 | "\n",
183 | "# We can fix that with a couple of small changes to our code"
184 | ]
185 | },
186 | {
187 | "cell_type": "markdown",
188 | "metadata": {},
189 | "source": [
190 | "## Streaming the download to a file"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 12,
196 | "metadata": {},
197 | "outputs": [],
198 | "source": [
199 | "# Instead of reading the whole response immidiatelly, \n",
200 | "# we can signal the program to only read part of the response when we tell it to.\n",
201 | "\n",
202 | "# This is achieved with the 'stream' parameter"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 13,
208 | "metadata": {},
209 | "outputs": [],
210 | "source": [
211 | "# I will use test video files provided by file-examples.com\n",
212 | "url = \"https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_480_1_5MG.mp4\""
213 | ]
214 | },
215 | {
216 | "cell_type": "code",
217 | "execution_count": 14,
218 | "metadata": {},
219 | "outputs": [],
220 | "source": [
221 | "r = requests.get(url, stream = True)\n",
222 | "\n",
223 | "with open(\"Sample_video_1,5_MB.mp4\", \"wb\") as f:\n",
224 | " \n",
225 | " # Now we iterate over the response in chunks\n",
226 | " for chunk in r.iter_content(chunk_size = 16*1024):\n",
227 | " f.write(chunk)"
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 15,
233 | "metadata": {},
234 | "outputs": [],
235 | "source": [
236 | "# You can change the chunk size to optimize the fastest download speed for your system"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 16,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "# However, when using 'stream=True' requests will not close the connection to the server until all data has been read\n",
246 | "# Thus, sometimes the connection needs to be closed manually\n",
247 | "\n",
248 | "# Again, that is best done using the 'with' statement"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 17,
254 | "metadata": {},
255 | "outputs": [],
256 | "source": [
257 | "# So, the final code for file download is\n",
258 | "url = \"https://file-examples.com/wp-content/uploads/2017/04/file_example_MP4_1920_18MG.mp4\"\n",
259 | "\n",
260 | "with requests.get(url, stream = True) as r:\n",
261 | " with open(\"Sample_video_18_MB.mp4\", \"wb\") as f:\n",
262 | " for chunk in r.iter_content(chunk_size = 16*1024):\n",
263 | " f.write(chunk)\n"
264 | ]
265 | }
266 | ],
267 | "metadata": {
268 | "kernelspec": {
269 | "display_name": "Python 3",
270 | "language": "python",
271 | "name": "python3"
272 | },
273 | "language_info": {
274 | "codemirror_mode": {
275 | "name": "ipython",
276 | "version": 3
277 | },
278 | "file_extension": ".py",
279 | "mimetype": "text/x-python",
280 | "name": "python",
281 | "nbconvert_exporter": "python",
282 | "pygments_lexer": "ipython3",
283 | "version": "3.7.3"
284 | }
285 | },
286 | "nbformat": 4,
287 | "nbformat_minor": 2
288 | }
289 |
--------------------------------------------------------------------------------
/03.Working with APIs/EDAMAM API/Section 3 - EDAMAM API - Initial setup and registration.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# API requiring registration - POST request"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Registering to the API"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# We will use a nutritional analysis API\n",
24 | "# It requires registration (we need an API key to validate ourselves)\n",
25 | "# Many APIs require this kind of registration"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 2,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "# You can sign-up for the Developer (Free) edition here: \n",
35 | "# https://developer.edamam.com/edamam-nutrition-api\n",
36 | "\n",
37 | "# API documentation: \n",
38 | "# https://developer.edamam.com/edamam-docs-nutrition-api"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "### Initial Setup"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 3,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "# loading the packages\n",
55 | "import requests\n",
56 | "import json"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 4,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "# Store the ID and Key in variables\n",
66 | "\n",
67 | "#APP_ID = \"your_API_ID_here\"\n",
68 | "#APP_KEY = \"your_API_key_here\"\n",
69 | "\n",
70 | "# Note: Those are not real ID and Key,\n",
71 | "# Replace the string with your own ones that you recieved upon registration"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 5,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "# Setting up the request URL\n",
81 | "api_endpoint = \"https://api.edamam.com/api/nutrition-details\"\n",
82 | "\n",
83 | "url = api_endpoint + \"?app_id=\" + APP_ID + \"&app_key=\" + APP_KEY"
84 | ]
85 | }
86 | ],
87 | "metadata": {
88 | "kernelspec": {
89 | "display_name": "Python 3",
90 | "language": "python",
91 | "name": "python3"
92 | },
93 | "language_info": {
94 | "codemirror_mode": {
95 | "name": "ipython",
96 | "version": 3
97 | },
98 | "file_extension": ".py",
99 | "mimetype": "text/x-python",
100 | "name": "python",
101 | "nbconvert_exporter": "python",
102 | "pygments_lexer": "ipython3",
103 | "version": "3.7.3"
104 | }
105 | },
106 | "nbformat": 4,
107 | "nbformat_minor": 2
108 | }
109 |
--------------------------------------------------------------------------------
/03.Working with APIs/GitHub API/github_API.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 |
4 | base_url = "https://jobs.github.com/positions.json"
5 |
6 | #Extracting results from multiple pages
7 | results = []
8 |
9 | for index in range(10):
10 | response = requests.get(base_url, params= {"description":"python", "location":"new york","page": index+1})
11 |
12 | print(response.url)
13 | # print(response.json())
14 | if len(response.json()) == 0:
15 | break
16 |
17 | results.extend(response.json())
18 |
19 | print(len(results))
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 | data = response.json()
29 | data = json.dumps(data, indent=4)
30 |
31 |
--------------------------------------------------------------------------------
/03.Working with APIs/iTune API/iTunes_API.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 |
4 | base_site = "https://itunes.apple.com/search"
5 |
6 | response = requests.get(base_site,params={"term":"fifth harmony", "country":"us","limit": 200})
7 |
8 | print(response.url)
9 | print(response.status_code)
10 |
11 | info = response.json()
12 | print(json.dumps(info, indent=4))
13 |
14 | #name and release dates of the songs
15 | for result in info['results']:
16 | print(result['trackName'])
17 | print(result['releaseDate'])
18 |
19 |
--------------------------------------------------------------------------------
/03.Working with APIs/iTune API/iTunes_API_structuring_exporting.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import json
3 | import pandas as pd
4 |
5 | base_site = "https://itunes.apple.com/search"
6 |
7 | response = requests.get(base_site,params={"term":"fifth harmony", "country":"us","limit": 200})
8 |
9 | info = response.json()
10 |
11 | #dataframe with pandas
12 | songs_df = pd.DataFrame(info['results'])
13 | print(songs_df)
14 |
15 | #export to csv or excel
16 | songs_df.to_csv('songs_info.csv')
17 |
18 | songs_df.to_excel('songs_info.xlsx')
19 |
20 |
21 |
22 |
23 |
--------------------------------------------------------------------------------
/03.Working with APIs/iTune API/songs_info.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/03.Working with APIs/iTune API/songs_info.xlsx
--------------------------------------------------------------------------------
/04.HTML Overview/Section 4 - CSS and JavaScript.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | CSS and JavaScript
6 |
7 |
8 |
9 |
10 |
11 |
12 | Come to the dark side, we have cookies!
13 |
14 |
15 |
18 |
19 |
20 |
23 |
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/04.HTML Overview/Section 4 - CSS style tag.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
21 |
22 |
23 |
24 | This is a heading
25 | This is a paragraph.
26 | I am different
27 |
28 |
--------------------------------------------------------------------------------
/04.HTML Overview/Section 4 - Character encoding - Euro sign.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Character encoding in HTML
6 |
7 |
8 |
9 |
10 | This is the Euro sign: € (method 1)
11 | This is the Euro sign: € (method 2)
12 | This is the Euro sign: € (method 3)
13 |
14 |
15 |
16 |
17 |
--------------------------------------------------------------------------------
/04.HTML Overview/Section 4 - My First Webpage.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | My First Webpage
6 |
7 |
8 |
9 | This is not the web page you are looking for. Move along, move along!
10 |
11 |
12 |
13 | Click here for high-quality music.
14 |
15 |
16 |
17 | Click here for high-quality music in a new tab.
18 |
19 |
20 |
21 |
22 |
23 |
--------------------------------------------------------------------------------
/05.Web Scraping with Beautiful Soup/Section 5 - Practical example - Exercise Setup.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### Importing the packages"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": null,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# Load the packages\n",
17 | "import requests\n",
18 | "from bs4 import BeautifulSoup"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "### Making a get request"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": null,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "# Defining the url of the site\n",
35 | "base_site = \"https://en.wikipedia.org/wiki/Music\"\n",
36 | "\n",
37 | "# Making a get request\n",
38 | "response = requests.get(base_site)\n",
39 | "response"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": null,
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "# Extracting the HTML\n",
49 | "html = response.content"
50 | ]
51 | },
52 | {
53 | "cell_type": "markdown",
54 | "metadata": {},
55 | "source": [
56 | "### Making the soup"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "# Convert HTML to a BeautifulSoup object. This will allow us to parse out content from the HTML more easily.\n",
66 | "# Using the default parser as it is included in Python\n",
67 | "soup = BeautifulSoup(html, \"html.parser\")"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "### 1. Extract all existing titles of links"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {
81 | "scrolled": true
82 | },
83 | "outputs": [],
84 | "source": [
85 | "# Find all links on the page \n",
86 | "links = soup.find_all('a')\n",
87 | "links"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": null,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "# Dropping the links without 'href' attribute"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "# Getting all titles"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": null,
111 | "metadata": {},
112 | "outputs": [],
113 | "source": [
114 | "# Removing the 'None' titles"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "### 2. Extract all heading 2 strings."
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "# Inspect all h2 tags"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": null,
136 | "metadata": {},
137 | "outputs": [],
138 | "source": [
139 | "# Get the text"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### 3. Print the whole footer text."
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": null,
152 | "metadata": {
153 | "scrolled": true
154 | },
155 | "outputs": [],
156 | "source": [
157 | "# By inspection: we see that the footer is contained inside a ..."
158 | ]
159 | }
160 | ],
161 | "metadata": {
162 | "kernelspec": {
163 | "display_name": "Python 3",
164 | "language": "python",
165 | "name": "python3"
166 | },
167 | "language_info": {
168 | "codemirror_mode": {
169 | "name": "ipython",
170 | "version": 3
171 | },
172 | "file_extension": ".py",
173 | "mimetype": "text/x-python",
174 | "name": "python",
175 | "nbconvert_exporter": "python",
176 | "pygments_lexer": "ipython3",
177 | "version": "3.7.3"
178 | }
179 | },
180 | "nbformat": 4,
181 | "nbformat_minor": 2
182 | }
183 |
--------------------------------------------------------------------------------
/05.Web Scraping with Beautiful Soup/Section 5 - Setting up your first scraper.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Set-up and Workflow"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "### Importing the packages"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": 1,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# Load the packages\n",
24 | "import requests\n",
25 | "from bs4 import BeautifulSoup"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "### Making a GET request"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 2,
38 | "metadata": {},
39 | "outputs": [
40 | {
41 | "data": {
42 | "text/plain": [
43 | "200"
44 | ]
45 | },
46 | "execution_count": 2,
47 | "metadata": {},
48 | "output_type": "execute_result"
49 | }
50 | ],
51 | "source": [
52 | "# Defining the url of the site\n",
53 | "base_site = \"https://en.wikipedia.org/wiki/Music\"\n",
54 | "\n",
55 | "# Making a get request\n",
56 | "response = requests.get(base_site)\n",
57 | "response.status_code"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 3,
63 | "metadata": {},
64 | "outputs": [
65 | {
66 | "data": {
67 | "text/plain": [
68 | "b'\\n\\n\\n\\n13 Assassins (2011) 95%,\n",
92 | " ,\n",
93 | " ,\n",
94 | " ,\n",
95 | " ,\n",
96 | " ,\n",
97 | " ,\n",
98 | " Logan (2017) 93%
,\n",
99 | " ,\n",
100 | " ,\n",
101 | " ,\n",
102 | " ,\n",
103 | " ,\n",
104 | " ,\n",
105 | " ,\n",
106 | " ,\n",
107 | " ,\n",
108 | " ,\n",
109 | " ,\n",
110 | " ,\n",
111 | " ,\n",
112 | " ,\n",
113 | " ,\n",
114 | " ,\n",
115 | " ,\n",
116 | " ,\n",
117 | " ,\n",
118 | " Dredd (2012) 79%
,\n",
119 | " ,\n",
120 | " ,\n",
121 | " ,\n",
122 | " ,\n",
123 | " ,\n",
124 | " ,\n",
125 | " ,\n",
126 | " ,\n",
127 | " ,\n",
128 | " ,\n",
129 | " ,\n",
130 | " ,\n",
131 | " ,\n",
132 | " ,\n",
133 | " ,\n",
134 | " ,\n",
135 | " ,\n",
136 | " ,\n",
137 | " ,\n",
138 | " ,\n",
139 | " ,\n",
140 | " ,\n",
141 | " ,\n",
142 | " ,\n",
143 | " Speed (1994) 94%
,\n",
144 | " ,\n",
145 | " ,\n",
146 | " ,\n",
147 | " ,\n",
148 | " ,\n",
149 | " ,\n",
150 | " ,\n",
151 | " Heat (1995) 86%
,\n",
152 | " ,\n",
153 | " ,\n",
154 | " ,\n",
155 | " ,\n",
156 | " ,\n",
157 | " ,\n",
158 | " ,\n",
159 | " ,\n",
160 | " ]"
161 | ]
162 | },
163 | "execution_count": 7,
164 | "metadata": {},
165 | "output_type": "execute_result"
166 | }
167 | ],
168 | "source": [
169 | "# Extracting all 'h2' tags\n",
170 | "headings = [div.find(\"h2\") for div in divs]\n",
171 | "headings"
172 | ]
173 | },
174 | {
175 | "cell_type": "markdown",
176 | "metadata": {},
177 | "source": [
178 | "## Extracting the scores"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 8,
184 | "metadata": {
185 | "scrolled": true
186 | },
187 | "outputs": [
188 | {
189 | "data": {
190 | "text/plain": [
191 | "[95%,\n",
192 | " 88%,\n",
193 | " 88%,\n",
194 | " 90%,\n",
195 | " 93%,\n",
196 | " 94%,\n",
197 | " 90%,\n",
198 | " 93%,\n",
199 | " 97%,\n",
200 | " 98%,\n",
201 | " 93%,\n",
202 | " 92%,\n",
203 | " 90%,\n",
204 | " 82%,\n",
205 | " 98%,\n",
206 | " 81%,\n",
207 | " 88%,\n",
208 | " 96%,\n",
209 | " 91%,\n",
210 | " 90%,\n",
211 | " 85%,\n",
212 | " 96%,\n",
213 | " 97%,\n",
214 | " 87%,\n",
215 | " 77%,\n",
216 | " 90%,\n",
217 | " 94%,\n",
218 | " 79%,\n",
219 | " 83%,\n",
220 | " 85%,\n",
221 | " 92%,\n",
222 | " 91%,\n",
223 | " 94%,\n",
224 | " 93%,\n",
225 | " 77%,\n",
226 | " 82%,\n",
227 | " 66%,\n",
228 | " 89%,\n",
229 | " 89%,\n",
230 | " 95%,\n",
231 | " 93%,\n",
232 | " 100%,\n",
233 | " 98%,\n",
234 | " 80%,\n",
235 | " 94%,\n",
236 | " 70%,\n",
237 | " 87%,\n",
238 | " 93%,\n",
239 | " 100%,\n",
240 | " 76%,\n",
241 | " 85%,\n",
242 | " 73%,\n",
243 | " 94%,\n",
244 | " 83%,\n",
245 | " 86%,\n",
246 | " 97%,\n",
247 | " 81%,\n",
248 | " 92%,\n",
249 | " 82%,\n",
250 | " 95%,\n",
251 | " 86%,\n",
252 | " 86%,\n",
253 | " 97%,\n",
254 | " 95%,\n",
255 | " 99%,\n",
256 | " 94%,\n",
257 | " 88%,\n",
258 | " 93%,\n",
259 | " 93%,\n",
260 | " 97%]"
261 | ]
262 | },
263 | "execution_count": 8,
264 | "metadata": {},
265 | "output_type": "execute_result"
266 | }
267 | ],
268 | "source": [
269 | "# Filtering only the spans containing the score\n",
270 | "[heading.find(\"span\", class_ = 'tMeterScore') for heading in headings]"
271 | ]
272 | },
273 | {
274 | "cell_type": "code",
275 | "execution_count": 9,
276 | "metadata": {
277 | "scrolled": true
278 | },
279 | "outputs": [
280 | {
281 | "data": {
282 | "text/plain": [
283 | "['95%',\n",
284 | " '88%',\n",
285 | " '88%',\n",
286 | " '90%',\n",
287 | " '93%',\n",
288 | " '94%',\n",
289 | " '90%',\n",
290 | " '93%',\n",
291 | " '97%',\n",
292 | " '98%',\n",
293 | " '93%',\n",
294 | " '92%',\n",
295 | " '90%',\n",
296 | " '82%',\n",
297 | " '98%',\n",
298 | " '81%',\n",
299 | " '88%',\n",
300 | " '96%',\n",
301 | " '91%',\n",
302 | " '90%',\n",
303 | " '85%',\n",
304 | " '96%',\n",
305 | " '97%',\n",
306 | " '87%',\n",
307 | " '77%',\n",
308 | " '90%',\n",
309 | " '94%',\n",
310 | " '79%',\n",
311 | " '83%',\n",
312 | " '85%',\n",
313 | " '92%',\n",
314 | " '91%',\n",
315 | " '94%',\n",
316 | " '93%',\n",
317 | " '77%',\n",
318 | " '82%',\n",
319 | " '66%',\n",
320 | " '89%',\n",
321 | " '89%',\n",
322 | " '95%',\n",
323 | " '93%',\n",
324 | " '100%',\n",
325 | " '98%',\n",
326 | " '80%',\n",
327 | " '94%',\n",
328 | " '70%',\n",
329 | " '87%',\n",
330 | " '93%',\n",
331 | " '100%',\n",
332 | " '76%',\n",
333 | " '85%',\n",
334 | " '73%',\n",
335 | " '94%',\n",
336 | " '83%',\n",
337 | " '86%',\n",
338 | " '97%',\n",
339 | " '81%',\n",
340 | " '92%',\n",
341 | " '82%',\n",
342 | " '95%',\n",
343 | " '86%',\n",
344 | " '86%',\n",
345 | " '97%',\n",
346 | " '95%',\n",
347 | " '99%',\n",
348 | " '94%',\n",
349 | " '88%',\n",
350 | " '93%',\n",
351 | " '93%',\n",
352 | " '97%']"
353 | ]
354 | },
355 | "execution_count": 9,
356 | "metadata": {},
357 | "output_type": "execute_result"
358 | }
359 | ],
360 | "source": [
361 | "# Extracting the score string\n",
362 | "scores = [heading.find(\"span\", class_ = 'tMeterScore').string for heading in headings]\n",
363 | "scores"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 10,
369 | "metadata": {
370 | "scrolled": true
371 | },
372 | "outputs": [
373 | {
374 | "data": {
375 | "text/plain": [
376 | "['95',\n",
377 | " '88',\n",
378 | " '88',\n",
379 | " '90',\n",
380 | " '93',\n",
381 | " '94',\n",
382 | " '90',\n",
383 | " '93',\n",
384 | " '97',\n",
385 | " '98',\n",
386 | " '93',\n",
387 | " '92',\n",
388 | " '90',\n",
389 | " '82',\n",
390 | " '98',\n",
391 | " '81',\n",
392 | " '88',\n",
393 | " '96',\n",
394 | " '91',\n",
395 | " '90',\n",
396 | " '85',\n",
397 | " '96',\n",
398 | " '97',\n",
399 | " '87',\n",
400 | " '77',\n",
401 | " '90',\n",
402 | " '94',\n",
403 | " '79',\n",
404 | " '83',\n",
405 | " '85',\n",
406 | " '92',\n",
407 | " '91',\n",
408 | " '94',\n",
409 | " '93',\n",
410 | " '77',\n",
411 | " '82',\n",
412 | " '66',\n",
413 | " '89',\n",
414 | " '89',\n",
415 | " '95',\n",
416 | " '93',\n",
417 | " '100',\n",
418 | " '98',\n",
419 | " '80',\n",
420 | " '94',\n",
421 | " '70',\n",
422 | " '87',\n",
423 | " '93',\n",
424 | " '100',\n",
425 | " '76',\n",
426 | " '85',\n",
427 | " '73',\n",
428 | " '94',\n",
429 | " '83',\n",
430 | " '86',\n",
431 | " '97',\n",
432 | " '81',\n",
433 | " '92',\n",
434 | " '82',\n",
435 | " '95',\n",
436 | " '86',\n",
437 | " '86',\n",
438 | " '97',\n",
439 | " '95',\n",
440 | " '99',\n",
441 | " '94',\n",
442 | " '88',\n",
443 | " '93',\n",
444 | " '93',\n",
445 | " '97']"
446 | ]
447 | },
448 | "execution_count": 10,
449 | "metadata": {},
450 | "output_type": "execute_result"
451 | }
452 | ],
453 | "source": [
454 | "# Removing the '%' sign\n",
455 | "scores = [s.strip('%') for s in scores]\n",
456 | "scores"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": 11,
462 | "metadata": {
463 | "scrolled": true
464 | },
465 | "outputs": [
466 | {
467 | "data": {
468 | "text/plain": [
469 | "[95,\n",
470 | " 88,\n",
471 | " 88,\n",
472 | " 90,\n",
473 | " 93,\n",
474 | " 94,\n",
475 | " 90,\n",
476 | " 93,\n",
477 | " 97,\n",
478 | " 98,\n",
479 | " 93,\n",
480 | " 92,\n",
481 | " 90,\n",
482 | " 82,\n",
483 | " 98,\n",
484 | " 81,\n",
485 | " 88,\n",
486 | " 96,\n",
487 | " 91,\n",
488 | " 90,\n",
489 | " 85,\n",
490 | " 96,\n",
491 | " 97,\n",
492 | " 87,\n",
493 | " 77,\n",
494 | " 90,\n",
495 | " 94,\n",
496 | " 79,\n",
497 | " 83,\n",
498 | " 85,\n",
499 | " 92,\n",
500 | " 91,\n",
501 | " 94,\n",
502 | " 93,\n",
503 | " 77,\n",
504 | " 82,\n",
505 | " 66,\n",
506 | " 89,\n",
507 | " 89,\n",
508 | " 95,\n",
509 | " 93,\n",
510 | " 100,\n",
511 | " 98,\n",
512 | " 80,\n",
513 | " 94,\n",
514 | " 70,\n",
515 | " 87,\n",
516 | " 93,\n",
517 | " 100,\n",
518 | " 76,\n",
519 | " 85,\n",
520 | " 73,\n",
521 | " 94,\n",
522 | " 83,\n",
523 | " 86,\n",
524 | " 97,\n",
525 | " 81,\n",
526 | " 92,\n",
527 | " 82,\n",
528 | " 95,\n",
529 | " 86,\n",
530 | " 86,\n",
531 | " 97,\n",
532 | " 95,\n",
533 | " 99,\n",
534 | " 94,\n",
535 | " 88,\n",
536 | " 93,\n",
537 | " 93,\n",
538 | " 97]"
539 | ]
540 | },
541 | "execution_count": 11,
542 | "metadata": {},
543 | "output_type": "execute_result"
544 | }
545 | ],
546 | "source": [
547 | "# Converting each score to an integer\n",
548 | "scores = [int(s) for s in scores]\n",
549 | "scores"
550 | ]
551 | }
552 | ],
553 | "metadata": {
554 | "kernelspec": {
555 | "display_name": "Python 3",
556 | "language": "python",
557 | "name": "python3"
558 | },
559 | "language_info": {
560 | "codemirror_mode": {
561 | "name": "ipython",
562 | "version": 3
563 | },
564 | "file_extension": ".py",
565 | "mimetype": "text/x-python",
566 | "name": "python",
567 | "nbconvert_exporter": "python",
568 | "pygments_lexer": "ipython3",
569 | "version": "3.7.3"
570 | }
571 | },
572 | "nbformat": 4,
573 | "nbformat_minor": 2
574 | }
575 |
--------------------------------------------------------------------------------
/06.Project Scraping - Rotten Tomatoes/Section 6 - Setting up your scraper.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Set-up"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# load packages\n",
17 | "import requests\n",
18 | "from bs4 import BeautifulSoup"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "# Define the URL of the site\n",
28 | "base_site = \"https://editorial.rottentomatoes.com/guide/140-essential-action-movies-to-watch-now/2/\""
29 | ]
30 | },
31 | {
32 | "cell_type": "code",
33 | "execution_count": 3,
34 | "metadata": {},
35 | "outputs": [
36 | {
37 | "data": {
38 | "text/plain": [
39 | "200"
40 | ]
41 | },
42 | "execution_count": 3,
43 | "metadata": {},
44 | "output_type": "execute_result"
45 | }
46 | ],
47 | "source": [
48 | "# sending a request to the webpage\n",
49 | "response = requests.get(base_site)\n",
50 | "response.status_code"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 4,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# get the HTML from the webpage\n",
60 | "html = response.content"
61 | ]
62 | },
63 | {
64 | "cell_type": "markdown",
65 | "metadata": {},
66 | "source": [
67 | "## Choosing a parser"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "### html.parser"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 5,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "# convert the HTML to a Beautiful Soup object\n",
84 | "soup = BeautifulSoup(html, 'html.parser')"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 6,
90 | "metadata": {},
91 | "outputs": [],
92 | "source": [
93 | "# Exporting the HTML to a file\n",
94 | "with open('Rotten_tomatoes_page_2_HTML_Parser.html', 'wb') as file:\n",
95 | " file.write(soup.prettify('utf-8'))"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 7,
101 | "metadata": {},
102 | "outputs": [],
103 | "source": [
104 | "# When inspecting the file we see that HTML element is closed at the begining -- it parsed incorrectly!\n",
105 | "# Let's check another parser"
106 | ]
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "### lxml"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 8,
118 | "metadata": {},
119 | "outputs": [],
120 | "source": [
121 | "# convert the HTML to a BeatifulSoup object\n",
122 | "soup = BeautifulSoup(html, 'lxml')"
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": 9,
128 | "metadata": {},
129 | "outputs": [],
130 | "source": [
131 | "# Exporting the HTML to a file\n",
132 | "with open('Rotten_tomatoes_page_2_LXML_Parser.html', 'wb') as file:\n",
133 | " file.write(soup.prettify('utf-8'))"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 10,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "# By first accounts of inspecting the file everything seems fine"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "### A word of caution"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 11,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "# Beautiful Soup ranks the lxml parser as the best one.\n",
159 | "\n",
160 | "# If a parser is not explicitly stated in the Beautiful Soup constructor,\n",
161 | "# the best one available on the current machine is chosen.\n",
162 | "\n",
163 | "# This means that the same piece of code can give different results on different computers."
164 | ]
165 | }
166 | ],
167 | "metadata": {
168 | "kernelspec": {
169 | "display_name": "Python 3",
170 | "language": "python",
171 | "name": "python3"
172 | },
173 | "language_info": {
174 | "codemirror_mode": {
175 | "name": "ipython",
176 | "version": 3
177 | },
178 | "file_extension": ".py",
179 | "mimetype": "text/x-python",
180 | "name": "python",
181 | "nbconvert_exporter": "python",
182 | "pygments_lexer": "ipython3",
183 | "version": "3.7.3"
184 | }
185 | },
186 | "nbformat": 4,
187 | "nbformat_minor": 2
188 | }
189 |
--------------------------------------------------------------------------------
/06.Project Scraping - Rotten Tomatoes/movies_info.xlsx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ptyadana/Web-Scraping-and-API-in-Python/9595bc418866642143eaf4a1f700dd646d81d427/06.Project Scraping - Rotten Tomatoes/movies_info.xlsx
--------------------------------------------------------------------------------
/08.Scraping Steam Project/New_Trending_Games_Info.csv:
--------------------------------------------------------------------------------
1 | Title,Price,Tags
2 | Dreamscaper: Prologue,Free,"Action, Indie, RPG, Free to Play"
3 | RESIDENT EVIL 3,$59.99,"Action, Zombies, Horror, Survival Horror"
4 | ONE PIECE: PIRATE WARRIORS 4,$59.99,"Action, Anime, Co-op, Online Co-Op"
5 | Eternal Radiance,$16.19,"Action, Adventure, RPG, Anime"
6 | Deadside,$19.99,"Massively Multiplayer, Action, Adventure, Indie"
7 | Conqueror's Blade,Free to Play,"Strategy, Massively Multiplayer, Action, Simulation"
8 | Borderlands 3,$59.99,"RPG, Action, Online Co-Op, Looter Shooter"
9 | Granblue Fantasy: Versus,$59.99,"Action, Anime, Fighting, 2D Fighter"
10 | Receiver 2,$17.99,"Simulation, Indie, Action, Shooter"
11 | Rakion Chaos Force,Free,"Action, RPG, Free to Play, Strategy"
12 | Mount & Blade II: Bannerlord,$49.99,"Early Access, Medieval, Strategy, Open World"
13 | Half-Life: Alyx,$59.99,"Masterpiece, Action, VR, Adventure"
14 | Last Oasis,$29.99,"Massively Multiplayer, Survival, Action, Adventure"
15 | DOOM Eternal,$59.99,"Action, Masterpiece, Great Soundtrack, FPS"
16 | Disaster Report 4: Summer Memories,$59.99,"Adventure, Action, Survival, VR"
17 |
--------------------------------------------------------------------------------
/08.Scraping Steam Project/Section 8 - Scraping Steam - Setup.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Extracting data from Steam "
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Initial Setup"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "from bs4 import BeautifulSoup\n",
24 | "import requests"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Connect to Steam webpage"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "r = requests.get(\"https://store.steampowered.com/tags/en/Action/\")\n",
41 | "r.status_code"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "html = r.content"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "soup = BeautifulSoup(html, \"lxml\")"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {},
66 | "outputs": [],
67 | "source": []
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## What can we scrape from this webpage?\n",
74 | "## 1) Try extracting the names of the top games from this page.\n",
75 | "## 2) What tags contain the prices? Can you extract the price information?\n",
76 | "## 3) Get all of the header tags on the page\n",
77 | "## 4) Can you get the text from each span tag with class equal to \"top_tag\"?\n",
78 | "## 5) Under the \"Narrow by Tag\" section, there are a collection of tags (e.g. \"Indie\", \"Adventure\", etc.). Write code to return these tags.\n",
79 | "## 6) What else can be scraped from this webpage or others on the site?"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "## Now is your turn!"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": null,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": []
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": null,
99 | "metadata": {},
100 | "outputs": [],
101 | "source": []
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": null,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": []
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": []
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": null,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": []
123 | }
124 | ],
125 | "metadata": {
126 | "kernelspec": {
127 | "display_name": "Python 3",
128 | "language": "python",
129 | "name": "python3"
130 | },
131 | "language_info": {
132 | "codemirror_mode": {
133 | "name": "ipython",
134 | "version": 3
135 | },
136 | "file_extension": ".py",
137 | "mimetype": "text/x-python",
138 | "name": "python",
139 | "nbconvert_exporter": "python",
140 | "pygments_lexer": "ipython3",
141 | "version": "3.7.3"
142 | }
143 | },
144 | "nbformat": 4,
145 | "nbformat_minor": 2
146 | }
147 |
--------------------------------------------------------------------------------
/08.Scraping Steam Project/Top_Rated_Games.info.csv:
--------------------------------------------------------------------------------
1 | Title,Price,Tags
2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive"
3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical"
4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op"
5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer"
6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter"
7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer"
8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person"
9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist"
10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer"
11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer"
12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic"
13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer"
14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy"
15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer"
16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World"
17 |
--------------------------------------------------------------------------------
/08.Scraping Steam Project/Top_Sellers_Games_info.csv:
--------------------------------------------------------------------------------
1 | Title,Price,Tags
2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive"
3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical"
4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op"
5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer"
6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter"
7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer"
8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person"
9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist"
10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer"
11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer"
12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic"
13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer"
14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy"
15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer"
16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World"
17 |
--------------------------------------------------------------------------------
/08.Scraping Steam Project/Trending_Games_info.csv:
--------------------------------------------------------------------------------
1 | Title,Price,Tags
2 | Counter-Strike: Global Offensive,Free to Play,"FPS, Shooter, Multiplayer, Competitive"
3 | Tom Clancy's Rainbow Six® Siege,$19.99,"FPS, Hero Shooter, Multiplayer, Tactical"
4 | Warframe,Free to Play,"Looter Shooter, Free to Play, Action, Co-op"
5 | Left 4 Dead 2,$9.99,"Zombies, Co-op, FPS, Multiplayer"
6 | Counter-Strike,$9.99,"Action, FPS, Multiplayer, Shooter"
7 | Borderlands 2,$19.99,"Loot, Shooter, Action, Multiplayer"
8 | Tomb Raider,$19.99,"Adventure, Action, Female Protagonist, Third Person"
9 | PAYDAY 2,$9.99,"Co-op, Action, FPS, Heist"
10 | Counter-Strike: Source,$9.99,"Shooter, Action, FPS, Multiplayer"
11 | Destiny 2,Free To Play,"Free to Play, Looter Shooter, FPS, Multiplayer"
12 | Half-Life 2,$9.99,"FPS, Action, Sci-fi, Classic"
13 | BioShock Infinite,$29.99,"FPS, Story Rich, Action, Singleplayer"
14 | Mount & Blade: Warband,$19.99,"Medieval, RPG, Open World, Strategy"
15 | Risk of Rain 2,$19.99,"Third-Person Shooter, Action Roguelike, Action, Multiplayer"
16 | MONSTER HUNTER: WORLD,$29.99,"Co-op, Multiplayer, Action, Open World"
17 |
--------------------------------------------------------------------------------
/08.Scraping Youtube Project/Section 8 - Scraping YouTube - Setup.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Scraping YouTube"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Initial Setup"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "from bs4 import BeautifulSoup\n",
24 | "import requests"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "## Connect to webpage"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": null,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "r = requests.get(\"https://www.youtube.com/\")\n",
41 | "r.status_code"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": null,
47 | "metadata": {},
48 | "outputs": [],
49 | "source": [
50 | "# get HTML\n",
51 | "html = resp.content"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "metadata": {},
58 | "outputs": [],
59 | "source": [
60 | "# convert HTML to BeautifulSoup object\n",
61 | "soup = BeautifulSoup(html)"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": null,
67 | "metadata": {},
68 | "outputs": [],
69 | "source": []
70 | },
71 | {
72 | "cell_type": "markdown",
73 | "metadata": {},
74 | "source": [
75 | "## 1) Scrape the text from each span tag\n",
76 | "## 2) How many images are on YouTube'e homepage?\n",
77 | "## 3) Can you find the URL of the link with title = \"Movies\"? Music? Sports?\n",
78 | "## 4) Now, try connecting to and scraping https://www.youtube.com/results?search_query=stairway+to+heaven\n",
79 | "## a) Can you get the names of the first few videos in the search results?\n",
80 | "## b) Next, connect to one of the search result videos - https://www.youtube.com/watch?v=qHFxncb1gRY\n",
81 | "## c) Can you find the \"related\" videos? What are their titles? Durations? URLs? Number of views?\n",
82 | "## d) Try finding (and scraping) the Twitter description of the video."
83 | ]
84 | },
85 | {
86 | "cell_type": "code",
87 | "execution_count": null,
88 | "metadata": {},
89 | "outputs": [],
90 | "source": []
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": null,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": []
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": null,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": []
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": []
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": null,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": []
119 | }
120 | ],
121 | "metadata": {
122 | "kernelspec": {
123 | "display_name": "Python 3",
124 | "language": "python",
125 | "name": "python3"
126 | },
127 | "language_info": {
128 | "codemirror_mode": {
129 | "name": "ipython",
130 | "version": 3
131 | },
132 | "file_extension": ".py",
133 | "mimetype": "text/x-python",
134 | "name": "python",
135 | "nbconvert_exporter": "python",
136 | "pygments_lexer": "ipython3",
137 | "version": "3.7.3"
138 | }
139 | },
140 | "nbformat": 4,
141 | "nbformat_minor": 2
142 | }
143 |
--------------------------------------------------------------------------------
/09.Common roadblocks when Web Scraping/RequestHeaders.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import requests"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "headers = {\"User-Agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36\"}"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 4,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "r = requests.get('https://www.youtube.com', headers = headers)"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 5,
33 | "metadata": {},
34 | "outputs": [
35 | {
36 | "data": {
37 | "text/plain": [
38 | "200"
39 | ]
40 | },
41 | "execution_count": 5,
42 | "metadata": {},
43 | "output_type": "execute_result"
44 | }
45 | ],
46 | "source": [
47 | "r.status_code"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": []
56 | }
57 | ],
58 | "metadata": {
59 | "kernelspec": {
60 | "display_name": "Python 3",
61 | "language": "python",
62 | "name": "python3"
63 | },
64 | "language_info": {
65 | "codemirror_mode": {
66 | "name": "ipython",
67 | "version": 3
68 | },
69 | "file_extension": ".py",
70 | "mimetype": "text/x-python",
71 | "name": "python",
72 | "nbconvert_exporter": "python",
73 | "pygments_lexer": "ipython3",
74 | "version": "3.7.6"
75 | }
76 | },
77 | "nbformat": 4,
78 | "nbformat_minor": 4
79 | }
80 |
--------------------------------------------------------------------------------
/09.Common roadblocks when Web Scraping/Section 9 - Sample HTML login Form.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | HTML Form
6 |
7 |
8 |
9 |
10 |
24 |
25 |
26 |
27 |
--------------------------------------------------------------------------------
/09.Common roadblocks when Web Scraping/Section 9 - Sample login code.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import requests"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "# URL of the POST request - need to inspect the HTML or use devtools to obtain\n",
19 | "url = \"target_url_of_post_request\""
20 | ]
21 | },
22 | {
23 | "cell_type": "code",
24 | "execution_count": 3,
25 | "metadata": {},
26 | "outputs": [],
27 | "source": [
28 | "# Define parameters sent with the POST request\n",
29 | "# (if there are additional ones, define them as well)\n",
30 | "user = \"Your username goes here\"\n",
31 | "password = \"Your password goes here\""
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 4,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "# Arrange all parameters in a dictionary format with the right names\n",
41 | "payload = {\n",
42 | " \"user[email]\": user,\n",
43 | " \"user[password]\": password\n",
44 | "}"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 5,
50 | "metadata": {},
51 | "outputs": [],
52 | "source": [
53 | "# Create a session so that we have consistent cookies\n",
54 | "s = requests.Session()"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 6,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "data": {
64 | "text/plain": [
65 | "200"
66 | ]
67 | },
68 | "execution_count": 6,
69 | "metadata": {},
70 | "output_type": "execute_result"
71 | }
72 | ],
73 | "source": [
74 | "# Submit the POST request through the session\n",
75 | "p = s.post(url, data = payload)\n",
76 | "p.status_code"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 7,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# You are now logged in and can proceed with scraping the data\n",
86 | "# .\n",
87 | "# .\n",
88 | "# .\n",
89 | "\n",
90 | "# Don't forget to close the session when you are done\n",
91 | "s.close()"
92 | ]
93 | }
94 | ],
95 | "metadata": {
96 | "kernelspec": {
97 | "display_name": "Python 3",
98 | "language": "python",
99 | "name": "python3"
100 | },
101 | "language_info": {
102 | "codemirror_mode": {
103 | "name": "ipython",
104 | "version": 3
105 | },
106 | "file_extension": ".py",
107 | "mimetype": "text/x-python",
108 | "name": "python",
109 | "nbconvert_exporter": "python",
110 | "pygments_lexer": "ipython3",
111 | "version": "3.7.6"
112 | }
113 | },
114 | "nbformat": 4,
115 | "nbformat_minor": 2
116 | }
117 |
--------------------------------------------------------------------------------
/09.Common roadblocks when Web Scraping/Sessions.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import requests"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "#initialize a session\n",
19 | "s = requests.Session()\n",
20 | "\n",
21 | "#request made through that session\n",
22 | "#related cookies are handled through each session\n",
23 | "r1 = s.post(url1, data = payload)\n",
24 | "\n",
25 | "#request made through that session\n",
26 | "r2 = s.get(url2)\n",
27 | "\n",
28 | "s.close()"
29 | ]
30 | }
31 | ],
32 | "metadata": {
33 | "kernelspec": {
34 | "display_name": "Python 3",
35 | "language": "python",
36 | "name": "python3"
37 | },
38 | "language_info": {
39 | "codemirror_mode": {
40 | "name": "ipython",
41 | "version": 3
42 | },
43 | "file_extension": ".py",
44 | "mimetype": "text/x-python",
45 | "name": "python",
46 | "nbconvert_exporter": "python",
47 | "pygments_lexer": "ipython3",
48 | "version": "3.7.6"
49 | }
50 | },
51 | "nbformat": 4,
52 | "nbformat_minor": 4
53 | }
54 |
--------------------------------------------------------------------------------
/10.The Requests-HTML Package/Scraper_JavaScript.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Set up\n"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 2,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "from requests_html import AsyncHTMLSession"
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 3,
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "session = AsyncHTMLSession()"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 4,
31 | "metadata": {},
32 | "outputs": [
33 | {
34 | "data": {
35 | "text/plain": [
36 | "200"
37 | ]
38 | },
39 | "execution_count": 4,
40 | "metadata": {},
41 | "output_type": "execute_result"
42 | }
43 | ],
44 | "source": [
45 | "r = await session.get('https://www.reddit.com')\n",
46 | "r.status_code"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 5,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "divs = r.html.find('div')\n",
56 | "links = r.html.find('a')\n",
57 | "urls = r.html.absolute_links"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "## need to render the javascript as html is genereted dynamically with JS"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "this will install chromium in PC which acts like web browser, but only used by program"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 10,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "import pyppdf.patch_pyppeteer\n",
81 | "await r.html.arender()"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 11,
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "new_divs = r.html.find('div')\n",
91 | "new_links = r.html.find('a')\n",
92 | "new_urls = r.html.absolute_links"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": 12,
98 | "metadata": {},
99 | "outputs": [
100 | {
101 | "data": {
102 | "text/plain": [
103 | "(504, 1649)"
104 | ]
105 | },
106 | "execution_count": 12,
107 | "metadata": {},
108 | "output_type": "execute_result"
109 | }
110 | ],
111 | "source": [
112 | "len(divs) , len(new_divs)"
113 | ]
114 | },
115 | {
116 | "cell_type": "code",
117 | "execution_count": 13,
118 | "metadata": {},
119 | "outputs": [
120 | {
121 | "data": {
122 | "text/plain": [
123 | "(80, 661)"
124 | ]
125 | },
126 | "execution_count": 13,
127 | "metadata": {},
128 | "output_type": "execute_result"
129 | }
130 | ],
131 | "source": [
132 | "len(links), len(new_links)"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 14,
138 | "metadata": {},
139 | "outputs": [
140 | {
141 | "data": {
142 | "text/plain": [
143 | "(57, 627)"
144 | ]
145 | },
146 | "execution_count": 14,
147 | "metadata": {},
148 | "output_type": "execute_result"
149 | }
150 | ],
151 | "source": [
152 | "len(urls), len(new_urls)"
153 | ]
154 | },
155 | {
156 | "cell_type": "markdown",
157 | "metadata": {},
158 | "source": [
159 | "### Check the difference between first html and rendered version html"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 15,
165 | "metadata": {},
166 | "outputs": [
167 | {
168 | "data": {
169 | "text/plain": [
170 | "{'https://www.reddit.com/r/1200isplenty/',\n",
171 | " 'https://www.reddit.com/r/2007scape/',\n",
172 | " 'https://www.reddit.com/r/49ers/',\n",
173 | " 'https://www.reddit.com/r/90DayFiance/',\n",
174 | " 'https://www.reddit.com/r/ACMilan/',\n",
175 | " 'https://www.reddit.com/r/Adelaide/',\n",
176 | " 'https://www.reddit.com/r/Amd/',\n",
177 | " 'https://www.reddit.com/r/Android/',\n",
178 | " 'https://www.reddit.com/r/Animesuggest/',\n",
179 | " 'https://www.reddit.com/r/AnthemTheGame/',\n",
180 | " 'https://www.reddit.com/r/AskCulinary/',\n",
181 | " 'https://www.reddit.com/r/AskMen/',\n",
182 | " 'https://www.reddit.com/r/AskNYC/',\n",
183 | " 'https://www.reddit.com/r/AskReddit/',\n",
184 | " 'https://www.reddit.com/r/AskWomen/',\n",
185 | " 'https://www.reddit.com/r/Astros/',\n",
186 | " 'https://www.reddit.com/r/Atlanta/',\n",
187 | " 'https://www.reddit.com/r/AtlantaUnited/',\n",
188 | " 'https://www.reddit.com/r/Austria/',\n",
189 | " 'https://www.reddit.com/r/Barca/',\n",
190 | " 'https://www.reddit.com/r/BattlefieldV/',\n",
191 | " 'https://www.reddit.com/r/BeautyBoxes/',\n",
192 | " 'https://www.reddit.com/r/BeautyGuruChatter/',\n",
193 | " 'https://www.reddit.com/r/Berserk/',\n",
194 | " 'https://www.reddit.com/r/BigBrother/',\n",
195 | " 'https://www.reddit.com/r/BlackClover/',\n",
196 | " 'https://www.reddit.com/r/Blackops4/',\n",
197 | " 'https://www.reddit.com/r/BoJackHorseman/',\n",
198 | " 'https://www.reddit.com/r/BokuNoHeroAcademia/',\n",
199 | " 'https://www.reddit.com/r/Boruto/',\n",
200 | " 'https://www.reddit.com/r/BostonBruins/',\n",
201 | " 'https://www.reddit.com/r/Boxing/',\n",
202 | " 'https://www.reddit.com/r/Braves/',\n",
203 | " 'https://www.reddit.com/r/BravoRealHousewives/',\n",
204 | " 'https://www.reddit.com/r/Brawlstars/',\n",
205 | " 'https://www.reddit.com/r/Breath_of_the_Wild/',\n",
206 | " 'https://www.reddit.com/r/Brogress/',\n",
207 | " 'https://www.reddit.com/r/Browns/',\n",
208 | " 'https://www.reddit.com/r/C25K/',\n",
209 | " 'https://www.reddit.com/r/CFB/',\n",
210 | " 'https://www.reddit.com/r/CHIBears/',\n",
211 | " 'https://www.reddit.com/r/CHICubs/',\n",
212 | " 'https://www.reddit.com/r/Calgary/',\n",
213 | " 'https://www.reddit.com/r/CampingGear/',\n",
214 | " 'https://www.reddit.com/r/CampingandHiking/',\n",
215 | " 'https://www.reddit.com/r/Cardinals/',\n",
216 | " 'https://www.reddit.com/r/CasualUK/',\n",
217 | " 'https://www.reddit.com/r/Charlotte/',\n",
218 | " 'https://www.reddit.com/r/China/',\n",
219 | " 'https://www.reddit.com/r/ClashOfClans/',\n",
220 | " 'https://www.reddit.com/r/ClashRoyale/',\n",
221 | " 'https://www.reddit.com/r/CoDCompetitive/',\n",
222 | " 'https://www.reddit.com/r/CoachellaValley/',\n",
223 | " 'https://www.reddit.com/r/CollegeBasketball/',\n",
224 | " 'https://www.reddit.com/r/Columbus/',\n",
225 | " 'https://www.reddit.com/r/Competitiveoverwatch/',\n",
226 | " 'https://www.reddit.com/r/Cooking/',\n",
227 | " 'https://www.reddit.com/r/Cricket/',\n",
228 | " 'https://www.reddit.com/r/CrohnsDisease/',\n",
229 | " 'https://www.reddit.com/r/CrusaderKings/',\n",
230 | " 'https://www.reddit.com/r/DBZDokkanBattle/',\n",
231 | " 'https://www.reddit.com/r/DMAcademy/',\n",
232 | " 'https://www.reddit.com/r/Dallas/',\n",
233 | " 'https://www.reddit.com/r/DanLeBatardShow/',\n",
234 | " 'https://www.reddit.com/r/DaysGone/',\n",
235 | " 'https://www.reddit.com/r/Denmark/',\n",
236 | " 'https://www.reddit.com/r/Denver/',\n",
237 | " 'https://www.reddit.com/r/Destiny/',\n",
238 | " 'https://www.reddit.com/r/DestinyTheGame/',\n",
239 | " 'https://www.reddit.com/r/Detroit/',\n",
240 | " 'https://www.reddit.com/r/Disneyland/',\n",
241 | " 'https://www.reddit.com/r/DnD/',\n",
242 | " 'https://www.reddit.com/r/Dodgers/',\n",
243 | " 'https://www.reddit.com/r/DotA2/',\n",
244 | " 'https://www.reddit.com/r/DuelLinks/',\n",
245 | " 'https://www.reddit.com/r/DunderMifflin/',\n",
246 | " 'https://www.reddit.com/r/DynastyFF/',\n",
247 | " 'https://www.reddit.com/r/EDH/',\n",
248 | " 'https://www.reddit.com/r/EDanonymemes/',\n",
249 | " 'https://www.reddit.com/r/EOOD/',\n",
250 | " 'https://www.reddit.com/r/EatCheapAndHealthy/',\n",
251 | " 'https://www.reddit.com/r/Edmonton/',\n",
252 | " 'https://www.reddit.com/r/EliteDangerous/',\n",
253 | " 'https://www.reddit.com/r/EscapefromTarkov/',\n",
254 | " 'https://www.reddit.com/r/Eve/',\n",
255 | " 'https://www.reddit.com/r/FFBraveExvius/',\n",
256 | " 'https://www.reddit.com/r/FIFA/',\n",
257 | " 'https://www.reddit.com/r/FORTnITE/',\n",
258 | " 'https://www.reddit.com/r/FUTMobile/',\n",
259 | " 'https://www.reddit.com/r/Fallout/',\n",
260 | " 'https://www.reddit.com/r/FantasyPL/',\n",
261 | " 'https://www.reddit.com/r/FireEmblemHeroes/',\n",
262 | " 'https://www.reddit.com/r/Fishing/',\n",
263 | " 'https://www.reddit.com/r/Fitness/',\n",
264 | " 'https://www.reddit.com/r/FixedGearBicycle/',\n",
265 | " 'https://www.reddit.com/r/FlashTV/',\n",
266 | " 'https://www.reddit.com/r/FortNiteBR/',\n",
267 | " 'https://www.reddit.com/r/FortniteCompetitive/',\n",
268 | " 'https://www.reddit.com/r/Frugal/',\n",
269 | " 'https://www.reddit.com/r/GameOfThronesMemes/',\n",
270 | " 'https://www.reddit.com/r/Gamingcirclejerk/',\n",
271 | " 'https://www.reddit.com/r/GetMotivated/',\n",
272 | " 'https://www.reddit.com/r/Glitch_in_the_Matrix/',\n",
273 | " 'https://www.reddit.com/r/GlobalOffensive/',\n",
274 | " 'https://www.reddit.com/r/GlobalOffensiveTrade/',\n",
275 | " 'https://www.reddit.com/r/GooglePixel/',\n",
276 | " 'https://www.reddit.com/r/GreenBayPackers/',\n",
277 | " 'https://www.reddit.com/r/Grimdank/',\n",
278 | " 'https://www.reddit.com/r/Guildwars2/',\n",
279 | " 'https://www.reddit.com/r/Gundam/',\n",
280 | " 'https://www.reddit.com/r/HBOGameofThrones/',\n",
281 | " 'https://www.reddit.com/r/Hair/',\n",
282 | " 'https://www.reddit.com/r/HealthyFood/',\n",
283 | " 'https://www.reddit.com/r/HomeImprovement/',\n",
284 | " 'https://www.reddit.com/r/IASIP/',\n",
285 | " 'https://www.reddit.com/r/IAmA/',\n",
286 | " 'https://www.reddit.com/r/IWantOut/',\n",
287 | " 'https://www.reddit.com/r/ImaginaryWesteros/',\n",
288 | " 'https://www.reddit.com/r/Indiemakeupandmore/',\n",
289 | " 'https://www.reddit.com/r/Instagram/',\n",
290 | " 'https://www.reddit.com/r/Israel/',\n",
291 | " 'https://www.reddit.com/r/JapanTravel/',\n",
292 | " 'https://www.reddit.com/r/Jeopardy/',\n",
293 | " 'https://www.reddit.com/r/Kitsap/',\n",
294 | " 'https://www.reddit.com/r/Konosuba/',\n",
295 | " 'https://www.reddit.com/r/LearnJapanese/',\n",
296 | " 'https://www.reddit.com/r/LegendsOfTomorrow/',\n",
297 | " 'https://www.reddit.com/r/LifeProTips/',\n",
298 | " 'https://www.reddit.com/r/LigaMX/',\n",
299 | " 'https://www.reddit.com/r/LiverpoolFC/',\n",
300 | " 'https://www.reddit.com/r/LivestreamFail/',\n",
301 | " 'https://www.reddit.com/r/LosAngelesRams/',\n",
302 | " 'https://www.reddit.com/r/LushCosmetics/',\n",
303 | " 'https://www.reddit.com/r/MCFC/',\n",
304 | " 'https://www.reddit.com/r/MLBTheShow/',\n",
305 | " 'https://www.reddit.com/r/MLS/',\n",
306 | " 'https://www.reddit.com/r/MMA/',\n",
307 | " 'https://www.reddit.com/r/MTB/',\n",
308 | " 'https://www.reddit.com/r/MUAontheCheap/',\n",
309 | " 'https://www.reddit.com/r/MagicArena/',\n",
310 | " 'https://www.reddit.com/r/Makeup/',\n",
311 | " 'https://www.reddit.com/r/MakeupAddiction/',\n",
312 | " 'https://www.reddit.com/r/MakingaMurderer/',\n",
313 | " 'https://www.reddit.com/r/Market76/',\n",
314 | " 'https://www.reddit.com/r/MarvelStrikeForce/',\n",
315 | " 'https://www.reddit.com/r/Mavericks/',\n",
316 | " 'https://www.reddit.com/r/Minecraft/',\n",
317 | " 'https://www.reddit.com/r/Minneapolis/',\n",
318 | " 'https://www.reddit.com/r/MkeBucks/',\n",
319 | " 'https://www.reddit.com/r/ModernMagic/',\n",
320 | " 'https://www.reddit.com/r/MonsterHunterWorld/',\n",
321 | " 'https://www.reddit.com/r/Mordhau/',\n",
322 | " 'https://www.reddit.com/r/MortalKombat/',\n",
323 | " 'https://www.reddit.com/r/MtvChallenge/',\n",
324 | " 'https://www.reddit.com/r/Music/',\n",
325 | " 'https://www.reddit.com/r/NBA2k/',\n",
326 | " 'https://www.reddit.com/r/NBASpurs/',\n",
327 | " 'https://www.reddit.com/r/NFA/',\n",
328 | " 'https://www.reddit.com/r/NHLHUT/',\n",
329 | " 'https://www.reddit.com/r/NYKnicks/',\n",
330 | " 'https://www.reddit.com/r/NYYankees/',\n",
331 | " 'https://www.reddit.com/r/Naruto/',\n",
332 | " 'https://www.reddit.com/r/Nationals/',\n",
333 | " 'https://www.reddit.com/r/Nerf/',\n",
334 | " 'https://www.reddit.com/r/NetflixBestOf/',\n",
335 | " 'https://www.reddit.com/r/NewOrleans/',\n",
336 | " 'https://www.reddit.com/r/NewSkaters/',\n",
337 | " 'https://www.reddit.com/r/NewYorkMets/',\n",
338 | " 'https://www.reddit.com/r/NintendoSwitch/',\n",
339 | " 'https://www.reddit.com/r/NoMansSkyTheGame/',\n",
340 | " 'https://www.reddit.com/r/NoStupidQuestions/',\n",
341 | " 'https://www.reddit.com/r/OnePiece/',\n",
342 | " 'https://www.reddit.com/r/OutOfTheLoop/',\n",
343 | " 'https://www.reddit.com/r/Overwatch/',\n",
344 | " 'https://www.reddit.com/r/PS4/',\n",
345 | " 'https://www.reddit.com/r/PSVR/',\n",
346 | " 'https://www.reddit.com/r/PUBATTLEGROUNDS/',\n",
347 | " 'https://www.reddit.com/r/PUBGMobile/',\n",
348 | " 'https://www.reddit.com/r/Paladins/',\n",
349 | " 'https://www.reddit.com/r/PanPorn/',\n",
350 | " 'https://www.reddit.com/r/PandR/',\n",
351 | " 'https://www.reddit.com/r/Patriots/',\n",
352 | " 'https://www.reddit.com/r/Persona5/',\n",
353 | " 'https://www.reddit.com/r/Philippines/',\n",
354 | " 'https://www.reddit.com/r/Planetside/',\n",
355 | " 'https://www.reddit.com/r/Polska/',\n",
356 | " 'https://www.reddit.com/r/Portland/',\n",
357 | " 'https://www.reddit.com/r/Quebec/',\n",
358 | " 'https://www.reddit.com/r/RWBY/',\n",
359 | " 'https://www.reddit.com/r/Rainbow6/',\n",
360 | " 'https://www.reddit.com/r/RedDeadOnline/',\n",
361 | " 'https://www.reddit.com/r/RedditLaqueristas/',\n",
362 | " 'https://www.reddit.com/r/RepLadiesBST/',\n",
363 | " 'https://www.reddit.com/r/Repsneakers/',\n",
364 | " 'https://www.reddit.com/r/RimWorld/',\n",
365 | " 'https://www.reddit.com/r/RocketLeague/',\n",
366 | " 'https://www.reddit.com/r/RocketLeagueExchange/',\n",
367 | " 'https://www.reddit.com/r/Romania/',\n",
368 | " 'https://www.reddit.com/r/Rowing/',\n",
369 | " 'https://www.reddit.com/r/SFGiants/',\n",
370 | " 'https://www.reddit.com/r/SWGalaxyOfHeroes/',\n",
371 | " 'https://www.reddit.com/r/Sacramento/',\n",
372 | " 'https://www.reddit.com/r/SaltLakeCity/',\n",
373 | " 'https://www.reddit.com/r/SanJoseSharks/',\n",
374 | " 'https://www.reddit.com/r/SarahSnark/',\n",
375 | " 'https://www.reddit.com/r/Scotland/',\n",
376 | " 'https://www.reddit.com/r/Seaofthieves/',\n",
377 | " 'https://www.reddit.com/r/Seattle/',\n",
378 | " 'https://www.reddit.com/r/SequelMemes/',\n",
379 | " 'https://www.reddit.com/r/ShingekiNoKyojin/',\n",
380 | " 'https://www.reddit.com/r/Shoestring/',\n",
381 | " 'https://www.reddit.com/r/Showerthoughts/',\n",
382 | " 'https://www.reddit.com/r/Smite/',\n",
383 | " 'https://www.reddit.com/r/Sneakers/',\n",
384 | " 'https://www.reddit.com/r/Spiderman/',\n",
385 | " 'https://www.reddit.com/r/SpoiledDragRace/',\n",
386 | " 'https://www.reddit.com/r/SquaredCircle/',\n",
387 | " 'https://www.reddit.com/r/StLouis/',\n",
388 | " 'https://www.reddit.com/r/StarVStheForcesofEvil/',\n",
389 | " 'https://www.reddit.com/r/StarWarsBattlefront/',\n",
390 | " 'https://www.reddit.com/r/StardewValley/',\n",
391 | " 'https://www.reddit.com/r/Steam/',\n",
392 | " 'https://www.reddit.com/r/Stellaris/',\n",
393 | " 'https://www.reddit.com/r/StrangerThings/',\n",
394 | " 'https://www.reddit.com/r/Stronglifts5x5/',\n",
395 | " 'https://www.reddit.com/r/Suomi/',\n",
396 | " 'https://www.reddit.com/r/Supplements/',\n",
397 | " 'https://www.reddit.com/r/TeenMomOGandTeenMom2/',\n",
398 | " 'https://www.reddit.com/r/Terraria/',\n",
399 | " 'https://www.reddit.com/r/TheAmazingRace/',\n",
400 | " 'https://www.reddit.com/r/TheBlackList/',\n",
401 | " 'https://www.reddit.com/r/TheDickShow/',\n",
402 | " 'https://www.reddit.com/r/TheHandmaidsTale/',\n",
403 | " 'https://www.reddit.com/r/TheLastAirbender/',\n",
404 | " 'https://www.reddit.com/r/TheSimpsons/',\n",
405 | " 'https://www.reddit.com/r/Tinder/',\n",
406 | " 'https://www.reddit.com/r/Torontobluejays/',\n",
407 | " 'https://www.reddit.com/r/Turkey/',\n",
408 | " 'https://www.reddit.com/r/TurkeyJerky/',\n",
409 | " 'https://www.reddit.com/r/Twitch/',\n",
410 | " 'https://www.reddit.com/r/TwoBestFriendsPlay/',\n",
411 | " 'https://www.reddit.com/r/VictoriaBC/',\n",
412 | " 'https://www.reddit.com/r/WWE/',\n",
413 | " 'https://www.reddit.com/r/WWEGames/',\n",
414 | " 'https://www.reddit.com/r/WaltDisneyWorld/',\n",
415 | " 'https://www.reddit.com/r/Warframe/',\n",
416 | " 'https://www.reddit.com/r/Warhammer40k/',\n",
417 | " 'https://www.reddit.com/r/Warthunder/',\n",
418 | " 'https://www.reddit.com/r/Watches/',\n",
419 | " 'https://www.reddit.com/r/Watchexchange/',\n",
420 | " 'https://www.reddit.com/r/Wellington/',\n",
421 | " 'https://www.reddit.com/r/Wetshaving/',\n",
422 | " 'https://www.reddit.com/r/Windows10/',\n",
423 | " 'https://www.reddit.com/r/Winnipeg/',\n",
424 | " 'https://www.reddit.com/r/WorldOfWarships/',\n",
425 | " 'https://www.reddit.com/r/WorldofTanks/',\n",
426 | " 'https://www.reddit.com/r/Youniqueamua/',\n",
427 | " 'https://www.reddit.com/r/aSongOfMemesAndRage/',\n",
428 | " 'https://www.reddit.com/r/acne/',\n",
429 | " 'https://www.reddit.com/r/adventuretime/',\n",
430 | " 'https://www.reddit.com/r/airsoft/',\n",
431 | " 'https://www.reddit.com/r/amateur_boxing/',\n",
432 | " 'https://www.reddit.com/r/anime/',\n",
433 | " 'https://www.reddit.com/r/anime_irl/',\n",
434 | " 'https://www.reddit.com/r/antelopevalley/',\n",
435 | " 'https://www.reddit.com/r/apple/',\n",
436 | " 'https://www.reddit.com/r/argentina/',\n",
437 | " 'https://www.reddit.com/r/arrow/',\n",
438 | " 'https://www.reddit.com/r/askTO/',\n",
439 | " 'https://www.reddit.com/r/askscience/',\n",
440 | " 'https://www.reddit.com/r/asoiaf/',\n",
441 | " 'https://www.reddit.com/r/australia/',\n",
442 | " 'https://www.reddit.com/r/awardtravel/',\n",
443 | " 'https://www.reddit.com/r/backpacking/',\n",
444 | " 'https://www.reddit.com/r/balisong/',\n",
445 | " 'https://www.reddit.com/r/barstoolsports/',\n",
446 | " 'https://www.reddit.com/r/baseball/',\n",
447 | " 'https://www.reddit.com/r/batman/',\n",
448 | " 'https://www.reddit.com/r/battlestations/',\n",
449 | " 'https://www.reddit.com/r/bayarea/',\n",
450 | " 'https://www.reddit.com/r/beards/',\n",
451 | " 'https://www.reddit.com/r/beauty/',\n",
452 | " 'https://www.reddit.com/r/berkeley/',\n",
453 | " 'https://www.reddit.com/r/bicycling/',\n",
454 | " 'https://www.reddit.com/r/bikecommuting/',\n",
455 | " 'https://www.reddit.com/r/bikewrench/',\n",
456 | " 'https://www.reddit.com/r/bjj/',\n",
457 | " 'https://www.reddit.com/r/blackmirror/',\n",
458 | " 'https://www.reddit.com/r/bleach/',\n",
459 | " 'https://www.reddit.com/r/boardgames/',\n",
460 | " 'https://www.reddit.com/r/bodybuilding/',\n",
461 | " 'https://www.reddit.com/r/bodyweightfitness/',\n",
462 | " 'https://www.reddit.com/r/books/',\n",
463 | " 'https://www.reddit.com/r/boostedboards/',\n",
464 | " 'https://www.reddit.com/r/bostonceltics/',\n",
465 | " 'https://www.reddit.com/r/brasil/',\n",
466 | " 'https://www.reddit.com/r/brasilivre/',\n",
467 | " 'https://www.reddit.com/r/breakingbad/',\n",
468 | " 'https://www.reddit.com/r/brisbane/',\n",
469 | " 'https://www.reddit.com/r/brooklynninenine/',\n",
470 | " 'https://www.reddit.com/r/buildapc/',\n",
471 | " 'https://www.reddit.com/r/burlington/',\n",
472 | " 'https://www.reddit.com/r/camping/',\n",
473 | " 'https://www.reddit.com/r/canada/',\n",
474 | " 'https://www.reddit.com/r/canucks/',\n",
475 | " 'https://www.reddit.com/r/cars/',\n",
476 | " 'https://www.reddit.com/r/chelseafc/',\n",
477 | " 'https://www.reddit.com/r/chile/',\n",
478 | " 'https://www.reddit.com/r/cirkeltrek/',\n",
479 | " 'https://www.reddit.com/r/classicwow/',\n",
480 | " 'https://www.reddit.com/r/climbing/',\n",
481 | " 'https://www.reddit.com/r/community/',\n",
482 | " 'https://www.reddit.com/r/confession/',\n",
483 | " 'https://www.reddit.com/r/cordcutters/',\n",
484 | " 'https://www.reddit.com/r/cowboys/',\n",
485 | " 'https://www.reddit.com/r/coys/',\n",
486 | " 'https://www.reddit.com/r/criterion/',\n",
487 | " 'https://www.reddit.com/r/croatia/',\n",
488 | " 'https://www.reddit.com/r/crossfit/',\n",
489 | " 'https://www.reddit.com/r/cscareerquestions/',\n",
490 | " 'https://www.reddit.com/r/curlyhair/',\n",
491 | " 'https://www.reddit.com/r/cycling/',\n",
492 | " 'https://www.reddit.com/r/danganronpa/',\n",
493 | " 'https://www.reddit.com/r/dauntless/',\n",
494 | " 'https://www.reddit.com/r/dbz/',\n",
495 | " 'https://www.reddit.com/r/de/',\n",
496 | " 'https://www.reddit.com/r/deadbydaylight/',\n",
497 | " 'https://www.reddit.com/r/denvernuggets/',\n",
498 | " 'https://www.reddit.com/r/destiny2/',\n",
499 | " 'https://www.reddit.com/r/detroitlions/',\n",
500 | " 'https://www.reddit.com/r/diabetes/',\n",
501 | " 'https://www.reddit.com/r/diabetes_t1/',\n",
502 | " 'https://www.reddit.com/r/discgolf/',\n",
503 | " 'https://www.reddit.com/r/discordapp/',\n",
504 | " 'https://www.reddit.com/r/disney/',\n",
505 | " 'https://www.reddit.com/r/dndmemes/',\n",
506 | " 'https://www.reddit.com/r/dndnext/',\n",
507 | " 'https://www.reddit.com/r/doctorwho/',\n",
508 | " 'https://www.reddit.com/r/dubai/',\n",
509 | " 'https://www.reddit.com/r/eagles/',\n",
510 | " 'https://www.reddit.com/r/ehlersdanlos/',\n",
511 | " 'https://www.reddit.com/r/elderscrollsonline/',\n",
512 | " 'https://www.reddit.com/r/eu4/',\n",
513 | " 'https://www.reddit.com/r/europe/',\n",
514 | " 'https://www.reddit.com/r/explainlikeimfive/',\n",
515 | " 'https://www.reddit.com/r/fairytail/',\n",
516 | " 'https://www.reddit.com/r/fantasybaseball/',\n",
517 | " 'https://www.reddit.com/r/fantasyfootball/',\n",
518 | " 'https://www.reddit.com/r/fasting/',\n",
519 | " 'https://www.reddit.com/r/femalefashionadvice/',\n",
520 | " 'https://www.reddit.com/r/femalehairadvice/',\n",
521 | " 'https://www.reddit.com/r/ffxiv/',\n",
522 | " 'https://www.reddit.com/r/findfashion/',\n",
523 | " 'https://www.reddit.com/r/fireemblem/',\n",
524 | " 'https://www.reddit.com/r/fivenightsatfreddys/',\n",
525 | " 'https://www.reddit.com/r/flexibility/',\n",
526 | " 'https://www.reddit.com/r/flightsim/',\n",
527 | " 'https://www.reddit.com/r/flyfishing/',\n",
528 | " 'https://www.reddit.com/r/fo76/',\n",
529 | " 'https://www.reddit.com/r/footballmanagergames/',\n",
530 | " 'https://www.reddit.com/r/forhonor/',\n",
531 | " 'https://www.reddit.com/r/formula1/',\n",
532 | " 'https://www.reddit.com/r/fragrance/',\n",
533 | " 'https://www.reddit.com/r/france/',\n",
534 | " 'https://www.reddit.com/r/freefolk/',\n",
535 | " 'https://www.reddit.com/r/frugalmalefashion/',\n",
536 | " 'https://www.reddit.com/r/futurama/',\n",
537 | " 'https://www.reddit.com/r/future_fight/',\n",
538 | " 'https://www.reddit.com/r/gainit/',\n",
539 | " 'https://www.reddit.com/r/gameofthrones/',\n",
540 | " 'https://www.reddit.com/r/germany/',\n",
541 | " 'https://www.reddit.com/r/girlsfrontline/',\n",
542 | " 'https://www.reddit.com/r/golf/',\n",
543 | " 'https://www.reddit.com/r/goodyearwelt/',\n",
544 | " 'https://www.reddit.com/r/grandorder/',\n",
545 | " 'https://www.reddit.com/r/greece/',\n",
546 | " 'https://www.reddit.com/r/greysanatomy/',\n",
547 | " 'https://www.reddit.com/r/gtaonline/',\n",
548 | " 'https://www.reddit.com/r/halifax/',\n",
549 | " 'https://www.reddit.com/r/halo/',\n",
550 | " 'https://www.reddit.com/r/headphones/',\n",
551 | " 'https://www.reddit.com/r/hearthstone/',\n",
552 | " 'https://www.reddit.com/r/heroesofthestorm/',\n",
553 | " 'https://www.reddit.com/r/hiking/',\n",
554 | " 'https://www.reddit.com/r/hockey/',\n",
555 | " 'https://www.reddit.com/r/hockeyjerseys/',\n",
556 | " 'https://www.reddit.com/r/hockeyplayers/',\n",
557 | " 'https://www.reddit.com/r/houston/',\n",
558 | " 'https://www.reddit.com/r/howardstern/',\n",
559 | " 'https://www.reddit.com/r/hungary/',\n",
560 | " 'https://www.reddit.com/r/india/',\n",
561 | " 'https://www.reddit.com/r/indonesia/',\n",
562 | " 'https://www.reddit.com/r/intermittentfasting/',\n",
563 | " 'https://www.reddit.com/r/iphone/',\n",
564 | " 'https://www.reddit.com/r/ireland/',\n",
565 | " 'https://www.reddit.com/r/italy/',\n",
566 | " 'https://www.reddit.com/r/jailbreak/',\n",
567 | " 'https://www.reddit.com/r/japanesestreetwear/',\n",
568 | " 'https://www.reddit.com/r/japanlife/',\n",
569 | " 'https://www.reddit.com/r/jobs/',\n",
570 | " 'https://www.reddit.com/r/kansascity/',\n",
571 | " 'https://www.reddit.com/r/keto/',\n",
572 | " 'https://www.reddit.com/r/korea/',\n",
573 | " 'https://www.reddit.com/r/lakers/',\n",
574 | " 'https://www.reddit.com/r/leafs/',\n",
575 | " 'https://www.reddit.com/r/leagueoflegends/',\n",
576 | " 'https://www.reddit.com/r/leangains/',\n",
577 | " 'https://www.reddit.com/r/learnprogramming/',\n",
578 | " 'https://www.reddit.com/r/learnpython/',\n",
579 | " 'https://www.reddit.com/r/legaladvice/',\n",
580 | " 'https://www.reddit.com/r/longboarding/',\n",
581 | " 'https://www.reddit.com/r/loseit/',\n",
582 | " 'https://www.reddit.com/r/lucifer/',\n",
583 | " 'https://www.reddit.com/r/makeupexchange/',\n",
584 | " 'https://www.reddit.com/r/malaysia/',\n",
585 | " 'https://www.reddit.com/r/malefashion/',\n",
586 | " 'https://www.reddit.com/r/malefashionadvice/',\n",
587 | " 'https://www.reddit.com/r/malehairadvice/',\n",
588 | " 'https://www.reddit.com/r/malelivingspace/',\n",
589 | " 'https://www.reddit.com/r/marvelmemes/',\n",
590 | " 'https://www.reddit.com/r/marvelstudios/',\n",
591 | " 'https://www.reddit.com/r/medical_advice/',\n",
592 | " 'https://www.reddit.com/r/melbourne/',\n",
593 | " 'https://www.reddit.com/r/memes/',\n",
594 | " 'https://www.reddit.com/r/mexico/',\n",
595 | " 'https://www.reddit.com/r/migraine/',\n",
596 | " 'https://www.reddit.com/r/minnesotatwins/',\n",
597 | " 'https://www.reddit.com/r/minnesotavikings/',\n",
598 | " 'https://www.reddit.com/r/mw4/',\n",
599 | " 'https://www.reddit.com/r/mylittlepony/',\n",
600 | " 'https://www.reddit.com/r/nashville/',\n",
601 | " 'https://www.reddit.com/r/nattyorjuice/',\n",
602 | " 'https://www.reddit.com/r/nba/',\n",
603 | " 'https://www.reddit.com/r/nbadiscussion/',\n",
604 | " 'https://www.reddit.com/r/netflix/',\n",
605 | " 'https://www.reddit.com/r/newsokur/',\n",
606 | " 'https://www.reddit.com/r/newzealand/',\n",
607 | " 'https://www.reddit.com/r/nfl/',\n",
608 | " 'https://www.reddit.com/r/nhl/',\n",
609 | " 'https://www.reddit.com/r/norge/',\n",
610 | " 'https://www.reddit.com/r/nosleep/',\n",
611 | " 'https://www.reddit.com/r/nova/',\n",
612 | " 'https://www.reddit.com/r/nrl/',\n",
613 | " 'https://www.reddit.com/r/nunavut/',\n",
614 | " 'https://www.reddit.com/r/nutrition/',\n",
615 | " 'https://www.reddit.com/r/nvidia/',\n",
616 | " 'https://www.reddit.com/r/nyjets/',\n",
617 | " 'https://www.reddit.com/r/omad/',\n",
618 | " 'https://www.reddit.com/r/orangecounty/',\n",
619 | " 'https://www.reddit.com/r/orangetheory/',\n",
620 | " 'https://www.reddit.com/r/osugame/',\n",
621 | " 'https://www.reddit.com/r/ottawa/',\n",
622 | " 'https://www.reddit.com/r/overlord/',\n",
623 | " 'https://www.reddit.com/r/pathofexile/',\n",
624 | " 'https://www.reddit.com/r/pcmasterrace/',\n",
625 | " 'https://www.reddit.com/r/peloton/',\n",
626 | " 'https://www.reddit.com/r/pesmobile/',\n",
627 | " 'https://www.reddit.com/r/philadelphia/',\n",
628 | " 'https://www.reddit.com/r/phillies/',\n",
629 | " 'https://www.reddit.com/r/phoenix/',\n",
630 | " 'https://www.reddit.com/r/pics/',\n",
631 | " 'https://www.reddit.com/r/pics/?f=flair_name%3A%22Politics%22',\n",
632 | " 'https://www.reddit.com/r/pics/comments/g1k7qr/well_america_this_explains_it/',\n",
633 | " 'https://www.reddit.com/r/piercing/',\n",
634 | " 'https://www.reddit.com/r/pittsburgh/',\n",
635 | " 'https://www.reddit.com/r/playrust/',\n",
636 | " 'https://www.reddit.com/r/podemos/',\n",
637 | " 'https://www.reddit.com/r/pokemon/',\n",
638 | " 'https://www.reddit.com/r/pokemongo/',\n",
639 | " 'https://www.reddit.com/r/pokemontrades/',\n",
640 | " 'https://www.reddit.com/r/portugal/',\n",
641 | " 'https://www.reddit.com/r/poshmark/',\n",
642 | " 'https://www.reddit.com/r/powerlifting/',\n",
643 | " 'https://www.reddit.com/r/progresspics/',\n",
644 | " 'https://www.reddit.com/r/raleigh/',\n",
645 | " 'https://www.reddit.com/r/ravens/',\n",
646 | " 'https://www.reddit.com/r/rawdenim/',\n",
647 | " 'https://www.reddit.com/r/realmadrid/',\n",
648 | " 'https://www.reddit.com/r/reddeadredemption/',\n",
649 | " 'https://www.reddit.com/r/reddevils/',\n",
650 | " 'https://www.reddit.com/r/redsox/',\n",
651 | " 'https://www.reddit.com/r/relationship_advice/',\n",
652 | " 'https://www.reddit.com/r/rickandmorty/',\n",
653 | " 'https://www.reddit.com/r/ripcity/',\n",
654 | " 'https://www.reddit.com/r/riverdale/',\n",
655 | " 'https://www.reddit.com/r/roadtrip/',\n",
656 | " 'https://www.reddit.com/r/rolex/',\n",
657 | " 'https://www.reddit.com/r/rollercoasters/',\n",
658 | " 'https://www.reddit.com/r/rpdrcringe/',\n",
659 | " 'https://www.reddit.com/r/rugbyunion/',\n",
660 | " 'https://www.reddit.com/r/runescape/',\n",
661 | " 'https://www.reddit.com/r/running/',\n",
662 | " 'https://www.reddit.com/r/rupaulsdragrace/',\n",
663 | " 'https://www.reddit.com/r/rva/',\n",
664 | " 'https://www.reddit.com/r/sanantonio/',\n",
665 | " 'https://www.reddit.com/r/sandiego/',\n",
666 | " 'https://www.reddit.com/r/sanfrancisco/',\n",
667 | " 'https://www.reddit.com/r/saskatoon/',\n",
668 | " 'https://www.reddit.com/r/scifi/',\n",
669 | " 'https://www.reddit.com/r/seinfeld/',\n",
670 | " 'https://www.reddit.com/r/serbia/',\n",
671 | " 'https://www.reddit.com/r/shield/',\n",
672 | " 'https://www.reddit.com/r/singapore/',\n",
673 | " 'https://www.reddit.com/r/sixers/',\n",
674 | " 'https://www.reddit.com/r/skiing/',\n",
675 | " 'https://www.reddit.com/r/skyrim/',\n",
676 | " 'https://www.reddit.com/r/smashbros/',\n",
677 | " 'https://www.reddit.com/r/sneakermarket/',\n",
678 | " 'https://www.reddit.com/r/snowboarding/',\n",
679 | " 'https://www.reddit.com/r/soccer/',\n",
680 | " 'https://www.reddit.com/r/solotravel/',\n",
681 | " 'https://www.reddit.com/r/southpark/',\n",
682 | " 'https://www.reddit.com/r/sports/',\n",
683 | " 'https://www.reddit.com/r/sportsbook/',\n",
684 | " 'https://www.reddit.com/r/starbucks/',\n",
685 | " 'https://www.reddit.com/r/starcitizen/',\n",
686 | " 'https://www.reddit.com/r/startrek/',\n",
687 | " 'https://www.reddit.com/r/steelers/',\n",
688 | " 'https://www.reddit.com/r/stevenuniverse/',\n",
689 | " 'https://www.reddit.com/r/stlouisblues/',\n",
690 | " 'https://www.reddit.com/r/streetwearstartup/',\n",
691 | " 'https://www.reddit.com/r/summonerswar/',\n",
692 | " 'https://www.reddit.com/r/suns/',\n",
693 | " 'https://www.reddit.com/r/survivor/',\n",
694 | " 'https://www.reddit.com/r/sweden/',\n",
695 | " 'https://www.reddit.com/r/swoleacceptance/',\n",
696 | " 'https://www.reddit.com/r/sydney/',\n",
697 | " 'https://www.reddit.com/r/sysadmin/',\n",
698 | " 'https://www.reddit.com/r/tampabayrays/',\n",
699 | " 'https://www.reddit.com/r/tattoos/',\n",
700 | " 'https://www.reddit.com/r/techsupport/',\n",
701 | " 'https://www.reddit.com/r/tennis/',\n",
702 | " 'https://www.reddit.com/r/tf2/',\n",
703 | " 'https://www.reddit.com/r/the100/',\n",
704 | " 'https://www.reddit.com/r/thebachelor/',\n",
705 | " 'https://www.reddit.com/r/thedivision/',\n",
706 | " 'https://www.reddit.com/r/thenetherlands/',\n",
707 | " 'https://www.reddit.com/r/thesims/',\n",
708 | " 'https://www.reddit.com/r/thesopranos/',\n",
709 | " 'https://www.reddit.com/r/thewalkingdead/',\n",
710 | " 'https://www.reddit.com/r/tipofmytongue/',\n",
711 | " 'https://www.reddit.com/r/titanfolk/',\n",
712 | " 'https://www.reddit.com/r/todayilearned/',\n",
713 | " 'https://www.reddit.com/r/torontoraptors/',\n",
714 | " 'https://www.reddit.com/r/totalwar/',\n",
715 | " 'https://www.reddit.com/r/touhou/',\n",
716 | " 'https://www.reddit.com/r/trailerparkboys/',\n",
717 | " 'https://www.reddit.com/r/translator/',\n",
718 | " 'https://www.reddit.com/r/travel/',\n",
719 | " 'https://www.reddit.com/r/vagabond/',\n",
720 | " 'https://www.reddit.com/r/vancouver/',\n",
721 | " 'https://www.reddit.com/r/vanderpumprules/',\n",
722 | " 'https://www.reddit.com/r/vegan/',\n",
723 | " 'https://www.reddit.com/r/videos/',\n",
724 | " 'https://www.reddit.com/r/vzla/',\n",
725 | " 'https://www.reddit.com/r/warriors/',\n",
726 | " 'https://www.reddit.com/r/weightroom/',\n",
727 | " 'https://www.reddit.com/r/westworld/',\n",
728 | " 'https://www.reddit.com/r/wicked_edge/',\n",
729 | " 'https://www.reddit.com/r/wow/',\n",
730 | " 'https://www.reddit.com/r/xboxone/',\n",
731 | " 'https://www.reddit.com/r/xxfitness/',\n",
732 | " 'https://www.reddit.com/r/yeezys/',\n",
733 | " 'https://www.reddit.com/r/yoga/',\n",
734 | " 'https://www.reddit.com/r/yugioh/',\n",
735 | " 'https://www.reddit.com/r/zerocarb/',\n",
736 | " 'https://www.reddit.com/rpan/',\n",
737 | " 'https://www.reddit.com/subreddits/leaderboard/up-and-coming',\n",
738 | " 'https://www.reddit.com/user/Barknuckle/',\n",
739 | " 'https://www.reddit.com/user/Frocharocha/',\n",
740 | " 'https://www.reddit.com/user/Magistrex/',\n",
741 | " 'https://www.reddit.com/user/PoliticsModeratorBot/',\n",
742 | " 'https://www.reddit.com/user/Ra75b/',\n",
743 | " 'https://www.reddit.com/user/TheVirginVibes/',\n",
744 | " 'https://www.reddit.com/user/frozenHelen/'}"
745 | ]
746 | },
747 | "execution_count": 15,
748 | "metadata": {},
749 | "output_type": "execute_result"
750 | }
751 | ],
752 | "source": [
753 | "new_urls.difference(urls)"
754 | ]
755 | },
756 | {
757 | "cell_type": "code",
758 | "execution_count": 16,
759 | "metadata": {},
760 | "outputs": [
761 | {
762 | "data": {
763 | "text/plain": [
764 | ""
765 | ]
766 | },
767 | "execution_count": 16,
768 | "metadata": {},
769 | "output_type": "execute_result"
770 | }
771 | ],
772 | "source": [
773 | "session.close()"
774 | ]
775 | }
776 | ],
777 | "metadata": {
778 | "kernelspec": {
779 | "display_name": "Python 3",
780 | "language": "python",
781 | "name": "python3"
782 | },
783 | "language_info": {
784 | "codemirror_mode": {
785 | "name": "ipython",
786 | "version": 3
787 | },
788 | "file_extension": ".py",
789 | "mimetype": "text/x-python",
790 | "name": "python",
791 | "nbconvert_exporter": "python",
792 | "pygments_lexer": "ipython3",
793 | "version": "3.7.6"
794 | }
795 | },
796 | "nbformat": 4,
797 | "nbformat_minor": 4
798 | }
799 |
--------------------------------------------------------------------------------
/10.The Requests-HTML Package/Section 10 - Scraping JavaScript.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Scraping data generated by JavaScript"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [],
15 | "source": [
16 | "# When coding in Jupyter and Spyder, we need to use the class AsyncHTMLSession to make JavaScript work\n",
17 | "# In other environments you can use the normal HTMLSession\n",
18 | "from requests_html import AsyncHTMLSession"
19 | ]
20 | },
21 | {
22 | "cell_type": "code",
23 | "execution_count": 2,
24 | "metadata": {},
25 | "outputs": [],
26 | "source": [
27 | "# establish a new asynchronous session\n",
28 | "session = AsyncHTMLSession()\n",
29 | "\n",
30 | "# The only difference we will experience between the regular HTML Session and the asynchronous one,\n",
31 | "# is the need to write the keyword 'await' in front of some statements"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 3,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "# In this example we're going to use Nike's homepage: https://www.reddit.com/\n",
41 | "# Several of the links on this page, as well as other elements, are generated by JavaScript\n",
42 | "# We will compare the result of scraping those before and after running the JavaScript code"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 4,
48 | "metadata": {},
49 | "outputs": [
50 | {
51 | "data": {
52 | "text/plain": [
53 | "200"
54 | ]
55 | },
56 | "execution_count": 4,
57 | "metadata": {},
58 | "output_type": "execute_result"
59 | }
60 | ],
61 | "source": [
62 | "# Since we used async session, we need to use the keyword 'await'\n",
63 | "# If you use the regular HTMLSession, there is no need for 'await'\n",
64 | "r = await session.get(\"https://www.reddit.com/\")\n",
65 | "r.status_code"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 5,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "# So far, nothing different from our previous example has happened\n",
75 | "# The JavaScript code has not yet been executed"
76 | ]
77 | },
78 | {
79 | "cell_type": "code",
80 | "execution_count": 6,
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "# Here are some tags obtained before rendering the JavaScript code, i.e. extarcted from the raw HTML\n",
85 | "divs = r.html.find(\"div\")\n",
86 | "links = r.html.find(\"a\")\n",
87 | "urls = r.html.absolute_links"
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 7,
93 | "metadata": {},
94 | "outputs": [],
95 | "source": [
96 | "# Now, we need to execute the JavaScript code that will generate additional tags"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 8,
102 | "metadata": {},
103 | "outputs": [],
104 | "source": [
105 | "# The requests-html package provides a very simple interface for that - just use the 'render()' method\n",
106 | "# ('arender()' when using async session)\n",
107 | "# It runs the JavaScript code which updates the HTML. This may take a bit\n",
108 | "# The updated HTML is stored in the old variable 'r.html' - you do not need to assign a new variable to the method\n",
109 | "# As before, the 'await' keyword is supplied only because of the Async session\n",
110 | "await r.html.arender()"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 9,
116 | "metadata": {},
117 | "outputs": [],
118 | "source": [
119 | "# NOTE: The first time you run 'a/render()' Chromium will be downloaded and installed on your computer"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": 10,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": [
128 | "# Now the HTML is updated and we can search for the same tags again\n",
129 | "new_divs = r.html.find(\"div\")\n",
130 | "new_links = r.html.find(\"a\")\n",
131 | "new_urls = r.html.absolute_links"
132 | ]
133 | },
134 | {
135 | "cell_type": "code",
136 | "execution_count": 11,
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "# We can see the difference in the number of found elements before and after the JavaScript executed"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 12,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "data": {
150 | "text/plain": [
151 | "(543, 1728)"
152 | ]
153 | },
154 | "execution_count": 12,
155 | "metadata": {},
156 | "output_type": "execute_result"
157 | }
158 | ],
159 | "source": [
160 | "len(divs), len(new_divs)"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 13,
166 | "metadata": {},
167 | "outputs": [
168 | {
169 | "data": {
170 | "text/plain": [
171 | "(87, 681)"
172 | ]
173 | },
174 | "execution_count": 13,
175 | "metadata": {},
176 | "output_type": "execute_result"
177 | }
178 | ],
179 | "source": [
180 | "len(links), len(new_links)"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": 14,
186 | "metadata": {},
187 | "outputs": [
188 | {
189 | "data": {
190 | "text/plain": [
191 | "(58, 640)"
192 | ]
193 | },
194 | "execution_count": 14,
195 | "metadata": {},
196 | "output_type": "execute_result"
197 | }
198 | ],
199 | "source": [
200 | "len(urls), len(new_urls)"
201 | ]
202 | },
203 | {
204 | "cell_type": "code",
205 | "execution_count": 15,
206 | "metadata": {},
207 | "outputs": [],
208 | "source": [
209 | "# Remember that 'urls' is a set, and not a list?\n",
210 | "# Well, there is a useful feature of sets that we will now take advantage of\n",
211 | "# It takes two sets and selects only those items from the first set that are not present in the second one"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 16,
217 | "metadata": {
218 | "scrolled": true
219 | },
220 | "outputs": [
221 | {
222 | "data": {
223 | "text/plain": [
224 | "{'https://i.imgur.com/nMhodgS.gifv',\n",
225 | " 'https://www.reddit.com/r/1200isplenty/',\n",
226 | " 'https://www.reddit.com/r/2007scape/',\n",
227 | " 'https://www.reddit.com/r/49ers/',\n",
228 | " 'https://www.reddit.com/r/90DayFiance/',\n",
229 | " 'https://www.reddit.com/r/ACMilan/',\n",
230 | " 'https://www.reddit.com/r/Adelaide/',\n",
231 | " 'https://www.reddit.com/r/Amd/',\n",
232 | " 'https://www.reddit.com/r/Android/',\n",
233 | " 'https://www.reddit.com/r/Animesuggest/',\n",
234 | " 'https://www.reddit.com/r/AnthemTheGame/',\n",
235 | " 'https://www.reddit.com/r/AskCulinary/',\n",
236 | " 'https://www.reddit.com/r/AskMen/',\n",
237 | " 'https://www.reddit.com/r/AskNYC/',\n",
238 | " 'https://www.reddit.com/r/AskReddit/',\n",
239 | " 'https://www.reddit.com/r/AskWomen/',\n",
240 | " 'https://www.reddit.com/r/Astros/',\n",
241 | " 'https://www.reddit.com/r/Atlanta/',\n",
242 | " 'https://www.reddit.com/r/AtlantaUnited/',\n",
243 | " 'https://www.reddit.com/r/Augusta/',\n",
244 | " 'https://www.reddit.com/r/Austria/',\n",
245 | " 'https://www.reddit.com/r/Barca/',\n",
246 | " 'https://www.reddit.com/r/BattlefieldV/',\n",
247 | " 'https://www.reddit.com/r/BeautyBoxes/',\n",
248 | " 'https://www.reddit.com/r/BeautyGuruChatter/',\n",
249 | " 'https://www.reddit.com/r/Bend/',\n",
250 | " 'https://www.reddit.com/r/Berserk/',\n",
251 | " 'https://www.reddit.com/r/BigBrother/',\n",
252 | " 'https://www.reddit.com/r/BlackClover/',\n",
253 | " 'https://www.reddit.com/r/Blackops4/',\n",
254 | " 'https://www.reddit.com/r/BoJackHorseman/',\n",
255 | " 'https://www.reddit.com/r/BokuNoHeroAcademia/',\n",
256 | " 'https://www.reddit.com/r/Boruto/',\n",
257 | " 'https://www.reddit.com/r/BostonBruins/',\n",
258 | " 'https://www.reddit.com/r/Boxing/',\n",
259 | " 'https://www.reddit.com/r/Braves/',\n",
260 | " 'https://www.reddit.com/r/BravoRealHousewives/',\n",
261 | " 'https://www.reddit.com/r/Brawlstars/',\n",
262 | " 'https://www.reddit.com/r/Breath_of_the_Wild/',\n",
263 | " 'https://www.reddit.com/r/Brogress/',\n",
264 | " 'https://www.reddit.com/r/Browns/',\n",
265 | " 'https://www.reddit.com/r/C25K/',\n",
266 | " 'https://www.reddit.com/r/CFB/',\n",
267 | " 'https://www.reddit.com/r/CHIBears/',\n",
268 | " 'https://www.reddit.com/r/CHICubs/',\n",
269 | " 'https://www.reddit.com/r/Calgary/',\n",
270 | " 'https://www.reddit.com/r/CampingGear/',\n",
271 | " 'https://www.reddit.com/r/CampingandHiking/',\n",
272 | " 'https://www.reddit.com/r/Cardinals/',\n",
273 | " 'https://www.reddit.com/r/CasualUK/',\n",
274 | " 'https://www.reddit.com/r/Charlotte/',\n",
275 | " 'https://www.reddit.com/r/China/',\n",
276 | " 'https://www.reddit.com/r/ClashOfClans/',\n",
277 | " 'https://www.reddit.com/r/ClashRoyale/',\n",
278 | " 'https://www.reddit.com/r/CoDCompetitive/',\n",
279 | " 'https://www.reddit.com/r/CollegeBasketball/',\n",
280 | " 'https://www.reddit.com/r/Columbus/',\n",
281 | " 'https://www.reddit.com/r/Competitiveoverwatch/',\n",
282 | " 'https://www.reddit.com/r/Cooking/',\n",
283 | " 'https://www.reddit.com/r/Cricket/',\n",
284 | " 'https://www.reddit.com/r/CrohnsDisease/',\n",
285 | " 'https://www.reddit.com/r/CrusaderKings/',\n",
286 | " 'https://www.reddit.com/r/DBZDokkanBattle/',\n",
287 | " 'https://www.reddit.com/r/DMAcademy/',\n",
288 | " 'https://www.reddit.com/r/Dallas/',\n",
289 | " 'https://www.reddit.com/r/DanLeBatardShow/',\n",
290 | " 'https://www.reddit.com/r/DaysGone/',\n",
291 | " 'https://www.reddit.com/r/Denmark/',\n",
292 | " 'https://www.reddit.com/r/Denver/',\n",
293 | " 'https://www.reddit.com/r/Destiny/',\n",
294 | " 'https://www.reddit.com/r/DestinyTheGame/',\n",
295 | " 'https://www.reddit.com/r/Detroit/',\n",
296 | " 'https://www.reddit.com/r/Disneyland/',\n",
297 | " 'https://www.reddit.com/r/DnD/',\n",
298 | " 'https://www.reddit.com/r/Dodgers/',\n",
299 | " 'https://www.reddit.com/r/DotA2/',\n",
300 | " 'https://www.reddit.com/r/DuelLinks/',\n",
301 | " 'https://www.reddit.com/r/DunderMifflin/',\n",
302 | " 'https://www.reddit.com/r/DynastyFF/',\n",
303 | " 'https://www.reddit.com/r/EDH/',\n",
304 | " 'https://www.reddit.com/r/EDanonymemes/',\n",
305 | " 'https://www.reddit.com/r/EOOD/',\n",
306 | " 'https://www.reddit.com/r/EatCheapAndHealthy/',\n",
307 | " 'https://www.reddit.com/r/Edmonton/',\n",
308 | " 'https://www.reddit.com/r/EliteDangerous/',\n",
309 | " 'https://www.reddit.com/r/EscapefromTarkov/',\n",
310 | " 'https://www.reddit.com/r/Eve/',\n",
311 | " 'https://www.reddit.com/r/FFBraveExvius/',\n",
312 | " 'https://www.reddit.com/r/FIFA/',\n",
313 | " 'https://www.reddit.com/r/FORTnITE/',\n",
314 | " 'https://www.reddit.com/r/FUTMobile/',\n",
315 | " 'https://www.reddit.com/r/Fallout/',\n",
316 | " 'https://www.reddit.com/r/FantasyPL/',\n",
317 | " 'https://www.reddit.com/r/FireEmblemHeroes/',\n",
318 | " 'https://www.reddit.com/r/Fishing/',\n",
319 | " 'https://www.reddit.com/r/Fitness/',\n",
320 | " 'https://www.reddit.com/r/FixedGearBicycle/',\n",
321 | " 'https://www.reddit.com/r/FlashTV/',\n",
322 | " 'https://www.reddit.com/r/FortNiteBR/',\n",
323 | " 'https://www.reddit.com/r/FortniteCompetitive/',\n",
324 | " 'https://www.reddit.com/r/Frugal/',\n",
325 | " 'https://www.reddit.com/r/GameOfThronesMemes/',\n",
326 | " 'https://www.reddit.com/r/Gamingcirclejerk/',\n",
327 | " 'https://www.reddit.com/r/GetMotivated/',\n",
328 | " 'https://www.reddit.com/r/Glitch_in_the_Matrix/',\n",
329 | " 'https://www.reddit.com/r/GlobalOffensive/',\n",
330 | " 'https://www.reddit.com/r/GlobalOffensiveTrade/',\n",
331 | " 'https://www.reddit.com/r/GooglePixel/',\n",
332 | " 'https://www.reddit.com/r/GreenBayPackers/',\n",
333 | " 'https://www.reddit.com/r/Grimdank/',\n",
334 | " 'https://www.reddit.com/r/Guildwars2/',\n",
335 | " 'https://www.reddit.com/r/Gundam/',\n",
336 | " 'https://www.reddit.com/r/HBOGameofThrones/',\n",
337 | " 'https://www.reddit.com/r/Hair/',\n",
338 | " 'https://www.reddit.com/r/HealthyFood/',\n",
339 | " 'https://www.reddit.com/r/HomeImprovement/',\n",
340 | " 'https://www.reddit.com/r/IASIP/',\n",
341 | " 'https://www.reddit.com/r/IAmA/',\n",
342 | " 'https://www.reddit.com/r/IWantOut/',\n",
343 | " 'https://www.reddit.com/r/ImaginaryWesteros/',\n",
344 | " 'https://www.reddit.com/r/Indiemakeupandmore/',\n",
345 | " 'https://www.reddit.com/r/Instagram/',\n",
346 | " 'https://www.reddit.com/r/Israel/',\n",
347 | " 'https://www.reddit.com/r/JapanTravel/',\n",
348 | " 'https://www.reddit.com/r/Jeopardy/',\n",
349 | " 'https://www.reddit.com/r/JoshuaTree/',\n",
350 | " 'https://www.reddit.com/r/Konosuba/',\n",
351 | " 'https://www.reddit.com/r/LearnJapanese/',\n",
352 | " 'https://www.reddit.com/r/LegendsOfTomorrow/',\n",
353 | " 'https://www.reddit.com/r/LifeProTips/',\n",
354 | " 'https://www.reddit.com/r/LigaMX/',\n",
355 | " 'https://www.reddit.com/r/LiverpoolFC/',\n",
356 | " 'https://www.reddit.com/r/LivestreamFail/',\n",
357 | " 'https://www.reddit.com/r/LosAngelesRams/',\n",
358 | " 'https://www.reddit.com/r/LushCosmetics/',\n",
359 | " 'https://www.reddit.com/r/MCFC/',\n",
360 | " 'https://www.reddit.com/r/MLBTheShow/',\n",
361 | " 'https://www.reddit.com/r/MLS/',\n",
362 | " 'https://www.reddit.com/r/MMA/',\n",
363 | " 'https://www.reddit.com/r/MTB/',\n",
364 | " 'https://www.reddit.com/r/MUAontheCheap/',\n",
365 | " 'https://www.reddit.com/r/MagicArena/',\n",
366 | " 'https://www.reddit.com/r/Makeup/',\n",
367 | " 'https://www.reddit.com/r/MakeupAddiction/',\n",
368 | " 'https://www.reddit.com/r/MakingaMurderer/',\n",
369 | " 'https://www.reddit.com/r/Market76/',\n",
370 | " 'https://www.reddit.com/r/MarvelStrikeForce/',\n",
371 | " 'https://www.reddit.com/r/Mavericks/',\n",
372 | " 'https://www.reddit.com/r/Minecraft/',\n",
373 | " 'https://www.reddit.com/r/Minneapolis/',\n",
374 | " 'https://www.reddit.com/r/MkeBucks/',\n",
375 | " 'https://www.reddit.com/r/ModernMagic/',\n",
376 | " 'https://www.reddit.com/r/MonsterHunterWorld/',\n",
377 | " 'https://www.reddit.com/r/Mordhau/',\n",
378 | " 'https://www.reddit.com/r/MortalKombat/',\n",
379 | " 'https://www.reddit.com/r/MtvChallenge/',\n",
380 | " 'https://www.reddit.com/r/Music/',\n",
381 | " 'https://www.reddit.com/r/NBA2k/',\n",
382 | " 'https://www.reddit.com/r/NBASpurs/',\n",
383 | " 'https://www.reddit.com/r/NFA/',\n",
384 | " 'https://www.reddit.com/r/NHLHUT/',\n",
385 | " 'https://www.reddit.com/r/NYKnicks/',\n",
386 | " 'https://www.reddit.com/r/NYYankees/',\n",
387 | " 'https://www.reddit.com/r/Naruto/',\n",
388 | " 'https://www.reddit.com/r/Nationals/',\n",
389 | " 'https://www.reddit.com/r/Nerf/',\n",
390 | " 'https://www.reddit.com/r/NetflixBestOf/',\n",
391 | " 'https://www.reddit.com/r/NewOrleans/',\n",
392 | " 'https://www.reddit.com/r/NewSkaters/',\n",
393 | " 'https://www.reddit.com/r/NewYorkMets/',\n",
394 | " 'https://www.reddit.com/r/NintendoSwitch/',\n",
395 | " 'https://www.reddit.com/r/NoMansSkyTheGame/',\n",
396 | " 'https://www.reddit.com/r/NoStupidQuestions/',\n",
397 | " 'https://www.reddit.com/r/OnePiece/',\n",
398 | " 'https://www.reddit.com/r/OutOfTheLoop/',\n",
399 | " 'https://www.reddit.com/r/Overwatch/',\n",
400 | " 'https://www.reddit.com/r/PS4/',\n",
401 | " 'https://www.reddit.com/r/PSVR/',\n",
402 | " 'https://www.reddit.com/r/PUBATTLEGROUNDS/',\n",
403 | " 'https://www.reddit.com/r/PUBGMobile/',\n",
404 | " 'https://www.reddit.com/r/Paladins/',\n",
405 | " 'https://www.reddit.com/r/PanPorn/',\n",
406 | " 'https://www.reddit.com/r/PandR/',\n",
407 | " 'https://www.reddit.com/r/Patriots/',\n",
408 | " 'https://www.reddit.com/r/Persona5/',\n",
409 | " 'https://www.reddit.com/r/Philippines/',\n",
410 | " 'https://www.reddit.com/r/Planetside/',\n",
411 | " 'https://www.reddit.com/r/Polska/',\n",
412 | " 'https://www.reddit.com/r/Portland/',\n",
413 | " 'https://www.reddit.com/r/Quebec/',\n",
414 | " 'https://www.reddit.com/r/RWBY/',\n",
415 | " 'https://www.reddit.com/r/Rainbow6/',\n",
416 | " 'https://www.reddit.com/r/RedDeadOnline/',\n",
417 | " 'https://www.reddit.com/r/RedditLaqueristas/',\n",
418 | " 'https://www.reddit.com/r/RepLadiesBST/',\n",
419 | " 'https://www.reddit.com/r/Repsneakers/',\n",
420 | " 'https://www.reddit.com/r/RimWorld/',\n",
421 | " 'https://www.reddit.com/r/RocketLeague/',\n",
422 | " 'https://www.reddit.com/r/RocketLeagueExchange/',\n",
423 | " 'https://www.reddit.com/r/Romania/',\n",
424 | " 'https://www.reddit.com/r/Rowing/',\n",
425 | " 'https://www.reddit.com/r/SFGiants/',\n",
426 | " 'https://www.reddit.com/r/SWGalaxyOfHeroes/',\n",
427 | " 'https://www.reddit.com/r/Sacramento/',\n",
428 | " 'https://www.reddit.com/r/SaltLakeCity/',\n",
429 | " 'https://www.reddit.com/r/SanJoseSharks/',\n",
430 | " 'https://www.reddit.com/r/SantaFe/',\n",
431 | " 'https://www.reddit.com/r/SarahSnark/',\n",
432 | " 'https://www.reddit.com/r/Scotland/',\n",
433 | " 'https://www.reddit.com/r/Scottsdale/',\n",
434 | " 'https://www.reddit.com/r/Seaofthieves/',\n",
435 | " 'https://www.reddit.com/r/Seattle/',\n",
436 | " 'https://www.reddit.com/r/SequelMemes/',\n",
437 | " 'https://www.reddit.com/r/ShingekiNoKyojin/',\n",
438 | " 'https://www.reddit.com/r/Shoestring/',\n",
439 | " 'https://www.reddit.com/r/Showerthoughts/',\n",
440 | " 'https://www.reddit.com/r/Smite/',\n",
441 | " 'https://www.reddit.com/r/Sneakers/',\n",
442 | " 'https://www.reddit.com/r/Spiderman/',\n",
443 | " 'https://www.reddit.com/r/SpoiledDragRace/',\n",
444 | " 'https://www.reddit.com/r/SquaredCircle/',\n",
445 | " 'https://www.reddit.com/r/StLouis/',\n",
446 | " 'https://www.reddit.com/r/StarVStheForcesofEvil/',\n",
447 | " 'https://www.reddit.com/r/StarWarsBattlefront/',\n",
448 | " 'https://www.reddit.com/r/StardewValley/',\n",
449 | " 'https://www.reddit.com/r/Steam/',\n",
450 | " 'https://www.reddit.com/r/Stellaris/',\n",
451 | " 'https://www.reddit.com/r/StrangerThings/',\n",
452 | " 'https://www.reddit.com/r/Stronglifts5x5/',\n",
453 | " 'https://www.reddit.com/r/Suomi/',\n",
454 | " 'https://www.reddit.com/r/Supplements/',\n",
455 | " 'https://www.reddit.com/r/TeenMomOGandTeenMom2/',\n",
456 | " 'https://www.reddit.com/r/Terraria/',\n",
457 | " 'https://www.reddit.com/r/TheAmazingRace/',\n",
458 | " 'https://www.reddit.com/r/TheBlackList/',\n",
459 | " 'https://www.reddit.com/r/TheDickShow/',\n",
460 | " 'https://www.reddit.com/r/TheHandmaidsTale/',\n",
461 | " 'https://www.reddit.com/r/TheLastAirbender/',\n",
462 | " 'https://www.reddit.com/r/TheSimpsons/',\n",
463 | " 'https://www.reddit.com/r/Tinder/',\n",
464 | " 'https://www.reddit.com/r/Torontobluejays/',\n",
465 | " 'https://www.reddit.com/r/Turkey/',\n",
466 | " 'https://www.reddit.com/r/TurkeyJerky/',\n",
467 | " 'https://www.reddit.com/r/Twitch/',\n",
468 | " 'https://www.reddit.com/r/TwoBestFriendsPlay/',\n",
469 | " 'https://www.reddit.com/r/VictoriaBC/',\n",
470 | " 'https://www.reddit.com/r/WWE/',\n",
471 | " 'https://www.reddit.com/r/WWEGames/',\n",
472 | " 'https://www.reddit.com/r/WaltDisneyWorld/',\n",
473 | " 'https://www.reddit.com/r/Warframe/',\n",
474 | " 'https://www.reddit.com/r/Warhammer40k/',\n",
475 | " 'https://www.reddit.com/r/Warthunder/',\n",
476 | " 'https://www.reddit.com/r/Watches/',\n",
477 | " 'https://www.reddit.com/r/Watchexchange/',\n",
478 | " 'https://www.reddit.com/r/Wellington/',\n",
479 | " 'https://www.reddit.com/r/Wetshaving/',\n",
480 | " 'https://www.reddit.com/r/Windows10/',\n",
481 | " 'https://www.reddit.com/r/Winnipeg/',\n",
482 | " 'https://www.reddit.com/r/WorldOfWarships/',\n",
483 | " 'https://www.reddit.com/r/WorldofTanks/',\n",
484 | " 'https://www.reddit.com/r/Youniqueamua/',\n",
485 | " 'https://www.reddit.com/r/aSongOfMemesAndRage/',\n",
486 | " 'https://www.reddit.com/r/acne/',\n",
487 | " 'https://www.reddit.com/r/adventuretime/',\n",
488 | " 'https://www.reddit.com/r/airsoft/',\n",
489 | " 'https://www.reddit.com/r/amateur_boxing/',\n",
490 | " 'https://www.reddit.com/r/anime/',\n",
491 | " 'https://www.reddit.com/r/anime_irl/',\n",
492 | " 'https://www.reddit.com/r/apple/',\n",
493 | " 'https://www.reddit.com/r/argentina/',\n",
494 | " 'https://www.reddit.com/r/arrow/',\n",
495 | " 'https://www.reddit.com/r/askTO/',\n",
496 | " 'https://www.reddit.com/r/askscience/',\n",
497 | " 'https://www.reddit.com/r/asoiaf/',\n",
498 | " 'https://www.reddit.com/r/australia/',\n",
499 | " 'https://www.reddit.com/r/awardtravel/',\n",
500 | " 'https://www.reddit.com/r/backpacking/',\n",
501 | " 'https://www.reddit.com/r/balisong/',\n",
502 | " 'https://www.reddit.com/r/barstoolsports/',\n",
503 | " 'https://www.reddit.com/r/baseball/',\n",
504 | " 'https://www.reddit.com/r/batman/',\n",
505 | " 'https://www.reddit.com/r/battlestations/',\n",
506 | " 'https://www.reddit.com/r/bayarea/',\n",
507 | " 'https://www.reddit.com/r/beards/',\n",
508 | " 'https://www.reddit.com/r/beauty/',\n",
509 | " 'https://www.reddit.com/r/berkeley/',\n",
510 | " 'https://www.reddit.com/r/bicycling/',\n",
511 | " 'https://www.reddit.com/r/bikecommuting/',\n",
512 | " 'https://www.reddit.com/r/bikewrench/',\n",
513 | " 'https://www.reddit.com/r/bjj/',\n",
514 | " 'https://www.reddit.com/r/blackmirror/',\n",
515 | " 'https://www.reddit.com/r/bleach/',\n",
516 | " 'https://www.reddit.com/r/boardgames/',\n",
517 | " 'https://www.reddit.com/r/bodybuilding/',\n",
518 | " 'https://www.reddit.com/r/bodyweightfitness/',\n",
519 | " 'https://www.reddit.com/r/books/',\n",
520 | " 'https://www.reddit.com/r/boostedboards/',\n",
521 | " 'https://www.reddit.com/r/bostonceltics/',\n",
522 | " 'https://www.reddit.com/r/brasil/',\n",
523 | " 'https://www.reddit.com/r/brasilivre/',\n",
524 | " 'https://www.reddit.com/r/breakingbad/',\n",
525 | " 'https://www.reddit.com/r/brisbane/',\n",
526 | " 'https://www.reddit.com/r/brooklynninenine/',\n",
527 | " 'https://www.reddit.com/r/buildapc/',\n",
528 | " 'https://www.reddit.com/r/camping/',\n",
529 | " 'https://www.reddit.com/r/canada/',\n",
530 | " 'https://www.reddit.com/r/canucks/',\n",
531 | " 'https://www.reddit.com/r/cars/',\n",
532 | " 'https://www.reddit.com/r/chelseafc/',\n",
533 | " 'https://www.reddit.com/r/chile/',\n",
534 | " 'https://www.reddit.com/r/cirkeltrek/',\n",
535 | " 'https://www.reddit.com/r/classicwow/',\n",
536 | " 'https://www.reddit.com/r/climbing/',\n",
537 | " 'https://www.reddit.com/r/community/',\n",
538 | " 'https://www.reddit.com/r/confession/',\n",
539 | " 'https://www.reddit.com/r/cordcutters/',\n",
540 | " 'https://www.reddit.com/r/cowboys/',\n",
541 | " 'https://www.reddit.com/r/coys/',\n",
542 | " 'https://www.reddit.com/r/criterion/',\n",
543 | " 'https://www.reddit.com/r/croatia/',\n",
544 | " 'https://www.reddit.com/r/crossfit/',\n",
545 | " 'https://www.reddit.com/r/cscareerquestions/',\n",
546 | " 'https://www.reddit.com/r/curlyhair/',\n",
547 | " 'https://www.reddit.com/r/cycling/',\n",
548 | " 'https://www.reddit.com/r/danganronpa/',\n",
549 | " 'https://www.reddit.com/r/dataisbeautiful/',\n",
550 | " 'https://www.reddit.com/r/dataisbeautiful/?f=flair_name%3A%22OC%22',\n",
551 | " 'https://www.reddit.com/r/dataisbeautiful/comments/g0o65a/oc_a_full_year_of_income_and_expenses_through_my/',\n",
552 | " 'https://www.reddit.com/r/dauntless/',\n",
553 | " 'https://www.reddit.com/r/dbz/',\n",
554 | " 'https://www.reddit.com/r/de/',\n",
555 | " 'https://www.reddit.com/r/deadbydaylight/',\n",
556 | " 'https://www.reddit.com/r/denvernuggets/',\n",
557 | " 'https://www.reddit.com/r/destiny2/',\n",
558 | " 'https://www.reddit.com/r/detroitlions/',\n",
559 | " 'https://www.reddit.com/r/diabetes/',\n",
560 | " 'https://www.reddit.com/r/diabetes_t1/',\n",
561 | " 'https://www.reddit.com/r/discgolf/',\n",
562 | " 'https://www.reddit.com/r/discordapp/',\n",
563 | " 'https://www.reddit.com/r/disney/',\n",
564 | " 'https://www.reddit.com/r/dndmemes/',\n",
565 | " 'https://www.reddit.com/r/dndnext/',\n",
566 | " 'https://www.reddit.com/r/doctorwho/',\n",
567 | " 'https://www.reddit.com/r/dubai/',\n",
568 | " 'https://www.reddit.com/r/eagles/',\n",
569 | " 'https://www.reddit.com/r/ehlersdanlos/',\n",
570 | " 'https://www.reddit.com/r/elderscrollsonline/',\n",
571 | " 'https://www.reddit.com/r/eu4/',\n",
572 | " 'https://www.reddit.com/r/europe/',\n",
573 | " 'https://www.reddit.com/r/explainlikeimfive/',\n",
574 | " 'https://www.reddit.com/r/fairytail/',\n",
575 | " 'https://www.reddit.com/r/fantasybaseball/',\n",
576 | " 'https://www.reddit.com/r/fantasyfootball/',\n",
577 | " 'https://www.reddit.com/r/fasting/',\n",
578 | " 'https://www.reddit.com/r/femalefashionadvice/',\n",
579 | " 'https://www.reddit.com/r/femalehairadvice/',\n",
580 | " 'https://www.reddit.com/r/ffxiv/',\n",
581 | " 'https://www.reddit.com/r/findfashion/',\n",
582 | " 'https://www.reddit.com/r/fireemblem/',\n",
583 | " 'https://www.reddit.com/r/fivenightsatfreddys/',\n",
584 | " 'https://www.reddit.com/r/flexibility/',\n",
585 | " 'https://www.reddit.com/r/flightsim/',\n",
586 | " 'https://www.reddit.com/r/flyfishing/',\n",
587 | " 'https://www.reddit.com/r/fo76/',\n",
588 | " 'https://www.reddit.com/r/footballmanagergames/',\n",
589 | " 'https://www.reddit.com/r/forhonor/',\n",
590 | " 'https://www.reddit.com/r/formula1/',\n",
591 | " 'https://www.reddit.com/r/fragrance/',\n",
592 | " 'https://www.reddit.com/r/france/',\n",
593 | " 'https://www.reddit.com/r/freefolk/',\n",
594 | " 'https://www.reddit.com/r/frugalmalefashion/',\n",
595 | " 'https://www.reddit.com/r/futurama/',\n",
596 | " 'https://www.reddit.com/r/future_fight/',\n",
597 | " 'https://www.reddit.com/r/gainit/',\n",
598 | " 'https://www.reddit.com/r/gameofthrones/',\n",
599 | " 'https://www.reddit.com/r/germany/',\n",
600 | " 'https://www.reddit.com/r/gifs/',\n",
601 | " 'https://www.reddit.com/r/gifs/comments/g0tzwn/disney_tried_editing_out_darryl_hannahs_butt_by/',\n",
602 | " 'https://www.reddit.com/r/girlsfrontline/',\n",
603 | " 'https://www.reddit.com/r/golf/',\n",
604 | " 'https://www.reddit.com/r/goodyearwelt/',\n",
605 | " 'https://www.reddit.com/r/grandorder/',\n",
606 | " 'https://www.reddit.com/r/greece/',\n",
607 | " 'https://www.reddit.com/r/greysanatomy/',\n",
608 | " 'https://www.reddit.com/r/gtaonline/',\n",
609 | " 'https://www.reddit.com/r/halifax/',\n",
610 | " 'https://www.reddit.com/r/halo/',\n",
611 | " 'https://www.reddit.com/r/headphones/',\n",
612 | " 'https://www.reddit.com/r/hearthstone/',\n",
613 | " 'https://www.reddit.com/r/heroesofthestorm/',\n",
614 | " 'https://www.reddit.com/r/hiking/',\n",
615 | " 'https://www.reddit.com/r/hockey/',\n",
616 | " 'https://www.reddit.com/r/hockeyjerseys/',\n",
617 | " 'https://www.reddit.com/r/hockeyplayers/',\n",
618 | " 'https://www.reddit.com/r/houston/',\n",
619 | " 'https://www.reddit.com/r/howardstern/',\n",
620 | " 'https://www.reddit.com/r/hungary/',\n",
621 | " 'https://www.reddit.com/r/india/',\n",
622 | " 'https://www.reddit.com/r/indonesia/',\n",
623 | " 'https://www.reddit.com/r/intermittentfasting/',\n",
624 | " 'https://www.reddit.com/r/iphone/',\n",
625 | " 'https://www.reddit.com/r/ireland/',\n",
626 | " 'https://www.reddit.com/r/italy/',\n",
627 | " 'https://www.reddit.com/r/jailbreak/',\n",
628 | " 'https://www.reddit.com/r/japanesestreetwear/',\n",
629 | " 'https://www.reddit.com/r/japanlife/',\n",
630 | " 'https://www.reddit.com/r/jobs/',\n",
631 | " 'https://www.reddit.com/r/kansascity/',\n",
632 | " 'https://www.reddit.com/r/keto/',\n",
633 | " 'https://www.reddit.com/r/korea/',\n",
634 | " 'https://www.reddit.com/r/lakers/',\n",
635 | " 'https://www.reddit.com/r/leafs/',\n",
636 | " 'https://www.reddit.com/r/leagueoflegends/',\n",
637 | " 'https://www.reddit.com/r/leangains/',\n",
638 | " 'https://www.reddit.com/r/learnprogramming/',\n",
639 | " 'https://www.reddit.com/r/learnpython/',\n",
640 | " 'https://www.reddit.com/r/legaladvice/',\n",
641 | " 'https://www.reddit.com/r/longboarding/',\n",
642 | " 'https://www.reddit.com/r/loseit/',\n",
643 | " 'https://www.reddit.com/r/lucifer/',\n",
644 | " 'https://www.reddit.com/r/makeupexchange/',\n",
645 | " 'https://www.reddit.com/r/malaysia/',\n",
646 | " 'https://www.reddit.com/r/malefashion/',\n",
647 | " 'https://www.reddit.com/r/malefashionadvice/',\n",
648 | " 'https://www.reddit.com/r/malehairadvice/',\n",
649 | " 'https://www.reddit.com/r/malelivingspace/',\n",
650 | " 'https://www.reddit.com/r/marvelmemes/',\n",
651 | " 'https://www.reddit.com/r/marvelstudios/',\n",
652 | " 'https://www.reddit.com/r/medical_advice/',\n",
653 | " 'https://www.reddit.com/r/melbourne/',\n",
654 | " 'https://www.reddit.com/r/memes/',\n",
655 | " 'https://www.reddit.com/r/mexico/',\n",
656 | " 'https://www.reddit.com/r/migraine/',\n",
657 | " 'https://www.reddit.com/r/minnesotatwins/',\n",
658 | " 'https://www.reddit.com/r/minnesotavikings/',\n",
659 | " 'https://www.reddit.com/r/mw4/',\n",
660 | " 'https://www.reddit.com/r/mylittlepony/',\n",
661 | " 'https://www.reddit.com/r/nashville/',\n",
662 | " 'https://www.reddit.com/r/nattyorjuice/',\n",
663 | " 'https://www.reddit.com/r/nba/',\n",
664 | " 'https://www.reddit.com/r/nbadiscussion/',\n",
665 | " 'https://www.reddit.com/r/netflix/',\n",
666 | " 'https://www.reddit.com/r/newsokur/',\n",
667 | " 'https://www.reddit.com/r/newzealand/',\n",
668 | " 'https://www.reddit.com/r/nfl/',\n",
669 | " 'https://www.reddit.com/r/nhl/',\n",
670 | " 'https://www.reddit.com/r/norge/',\n",
671 | " 'https://www.reddit.com/r/nosleep/',\n",
672 | " 'https://www.reddit.com/r/nova/',\n",
673 | " 'https://www.reddit.com/r/nrl/',\n",
674 | " 'https://www.reddit.com/r/nutrition/',\n",
675 | " 'https://www.reddit.com/r/nvidia/',\n",
676 | " 'https://www.reddit.com/r/nyjets/',\n",
677 | " 'https://www.reddit.com/r/omad/',\n",
678 | " 'https://www.reddit.com/r/orangecounty/',\n",
679 | " 'https://www.reddit.com/r/orangetheory/',\n",
680 | " 'https://www.reddit.com/r/osugame/',\n",
681 | " 'https://www.reddit.com/r/ottawa/',\n",
682 | " 'https://www.reddit.com/r/overlord/',\n",
683 | " 'https://www.reddit.com/r/pathofexile/',\n",
684 | " 'https://www.reddit.com/r/pcmasterrace/',\n",
685 | " 'https://www.reddit.com/r/peloton/',\n",
686 | " 'https://www.reddit.com/r/pesmobile/',\n",
687 | " 'https://www.reddit.com/r/philadelphia/',\n",
688 | " 'https://www.reddit.com/r/phillies/',\n",
689 | " 'https://www.reddit.com/r/phoenix/',\n",
690 | " 'https://www.reddit.com/r/pics/',\n",
691 | " 'https://www.reddit.com/r/piercing/',\n",
692 | " 'https://www.reddit.com/r/pittsburgh/',\n",
693 | " 'https://www.reddit.com/r/playrust/',\n",
694 | " 'https://www.reddit.com/r/podemos/',\n",
695 | " 'https://www.reddit.com/r/pokemon/',\n",
696 | " 'https://www.reddit.com/r/pokemongo/',\n",
697 | " 'https://www.reddit.com/r/pokemontrades/',\n",
698 | " 'https://www.reddit.com/r/portugal/',\n",
699 | " 'https://www.reddit.com/r/poshmark/',\n",
700 | " 'https://www.reddit.com/r/powerlifting/',\n",
701 | " 'https://www.reddit.com/r/progresspics/',\n",
702 | " 'https://www.reddit.com/r/raleigh/',\n",
703 | " 'https://www.reddit.com/r/ravens/',\n",
704 | " 'https://www.reddit.com/r/rawdenim/',\n",
705 | " 'https://www.reddit.com/r/realmadrid/',\n",
706 | " 'https://www.reddit.com/r/reddeadredemption/',\n",
707 | " 'https://www.reddit.com/r/reddevils/',\n",
708 | " 'https://www.reddit.com/r/redsox/',\n",
709 | " 'https://www.reddit.com/r/relationship_advice/',\n",
710 | " 'https://www.reddit.com/r/rickandmorty/',\n",
711 | " 'https://www.reddit.com/r/ripcity/',\n",
712 | " 'https://www.reddit.com/r/riverdale/',\n",
713 | " 'https://www.reddit.com/r/roadtrip/',\n",
714 | " 'https://www.reddit.com/r/rolex/',\n",
715 | " 'https://www.reddit.com/r/rollercoasters/',\n",
716 | " 'https://www.reddit.com/r/rpdrcringe/',\n",
717 | " 'https://www.reddit.com/r/rugbyunion/',\n",
718 | " 'https://www.reddit.com/r/runescape/',\n",
719 | " 'https://www.reddit.com/r/running/',\n",
720 | " 'https://www.reddit.com/r/rupaulsdragrace/',\n",
721 | " 'https://www.reddit.com/r/rva/',\n",
722 | " 'https://www.reddit.com/r/sanantonio/',\n",
723 | " 'https://www.reddit.com/r/sandiego/',\n",
724 | " 'https://www.reddit.com/r/sanfrancisco/',\n",
725 | " 'https://www.reddit.com/r/saskatoon/',\n",
726 | " 'https://www.reddit.com/r/scifi/',\n",
727 | " 'https://www.reddit.com/r/seinfeld/',\n",
728 | " 'https://www.reddit.com/r/serbia/',\n",
729 | " 'https://www.reddit.com/r/shield/',\n",
730 | " 'https://www.reddit.com/r/singapore/',\n",
731 | " 'https://www.reddit.com/r/sixers/',\n",
732 | " 'https://www.reddit.com/r/skiing/',\n",
733 | " 'https://www.reddit.com/r/skyrim/',\n",
734 | " 'https://www.reddit.com/r/smashbros/',\n",
735 | " 'https://www.reddit.com/r/sneakermarket/',\n",
736 | " 'https://www.reddit.com/r/snowboarding/',\n",
737 | " 'https://www.reddit.com/r/soccer/',\n",
738 | " 'https://www.reddit.com/r/solotravel/',\n",
739 | " 'https://www.reddit.com/r/southpark/',\n",
740 | " 'https://www.reddit.com/r/sports/',\n",
741 | " 'https://www.reddit.com/r/sportsbook/',\n",
742 | " 'https://www.reddit.com/r/starbucks/',\n",
743 | " 'https://www.reddit.com/r/starcitizen/',\n",
744 | " 'https://www.reddit.com/r/startrek/',\n",
745 | " 'https://www.reddit.com/r/steelers/',\n",
746 | " 'https://www.reddit.com/r/stevenuniverse/',\n",
747 | " 'https://www.reddit.com/r/stlouisblues/',\n",
748 | " 'https://www.reddit.com/r/streetwearstartup/',\n",
749 | " 'https://www.reddit.com/r/summonerswar/',\n",
750 | " 'https://www.reddit.com/r/suns/',\n",
751 | " 'https://www.reddit.com/r/survivor/',\n",
752 | " 'https://www.reddit.com/r/sweden/',\n",
753 | " 'https://www.reddit.com/r/swoleacceptance/',\n",
754 | " 'https://www.reddit.com/r/sydney/',\n",
755 | " 'https://www.reddit.com/r/sysadmin/',\n",
756 | " 'https://www.reddit.com/r/tampabayrays/',\n",
757 | " 'https://www.reddit.com/r/tattoos/',\n",
758 | " 'https://www.reddit.com/r/techsupport/',\n",
759 | " 'https://www.reddit.com/r/tennis/',\n",
760 | " 'https://www.reddit.com/r/tf2/',\n",
761 | " 'https://www.reddit.com/r/the100/',\n",
762 | " 'https://www.reddit.com/r/thebachelor/',\n",
763 | " 'https://www.reddit.com/r/thedivision/',\n",
764 | " 'https://www.reddit.com/r/thenetherlands/',\n",
765 | " 'https://www.reddit.com/r/thesims/',\n",
766 | " 'https://www.reddit.com/r/thesopranos/',\n",
767 | " 'https://www.reddit.com/r/thewalkingdead/',\n",
768 | " 'https://www.reddit.com/r/tipofmytongue/',\n",
769 | " 'https://www.reddit.com/r/titanfolk/',\n",
770 | " 'https://www.reddit.com/r/todayilearned/',\n",
771 | " 'https://www.reddit.com/r/torontoraptors/',\n",
772 | " 'https://www.reddit.com/r/totalwar/',\n",
773 | " 'https://www.reddit.com/r/touhou/',\n",
774 | " 'https://www.reddit.com/r/trailerparkboys/',\n",
775 | " 'https://www.reddit.com/r/translator/',\n",
776 | " 'https://www.reddit.com/r/travel/',\n",
777 | " 'https://www.reddit.com/r/vagabond/',\n",
778 | " 'https://www.reddit.com/r/vancouver/',\n",
779 | " 'https://www.reddit.com/r/vanderpumprules/',\n",
780 | " 'https://www.reddit.com/r/vegan/',\n",
781 | " 'https://www.reddit.com/r/videos/',\n",
782 | " 'https://www.reddit.com/r/vzla/',\n",
783 | " 'https://www.reddit.com/r/warriors/',\n",
784 | " 'https://www.reddit.com/r/weightroom/',\n",
785 | " 'https://www.reddit.com/r/westworld/',\n",
786 | " 'https://www.reddit.com/r/wicked_edge/',\n",
787 | " 'https://www.reddit.com/r/worldnews/',\n",
788 | " 'https://www.reddit.com/r/wow/',\n",
789 | " 'https://www.reddit.com/r/xboxone/',\n",
790 | " 'https://www.reddit.com/r/xxfitness/',\n",
791 | " 'https://www.reddit.com/r/yeezys/',\n",
792 | " 'https://www.reddit.com/r/yoga/',\n",
793 | " 'https://www.reddit.com/r/yugioh/',\n",
794 | " 'https://www.reddit.com/r/zerocarb/',\n",
795 | " 'https://www.reddit.com/search?q=dune&source=trending',\n",
796 | " 'https://www.reddit.com/search?q=fauci&source=trending',\n",
797 | " 'https://www.reddit.com/search?q=kyle%20larson&source=trending',\n",
798 | " 'https://www.reddit.com/search?q=nascar&source=trending',\n",
799 | " 'https://www.reddit.com/search?q=rick%20may&source=trending',\n",
800 | " 'https://www.reddit.com/search?q=tornado&source=trending',\n",
801 | " 'https://www.reddit.com/subreddits/leaderboard/up-and-coming',\n",
802 | " 'https://www.reddit.com/user/ItsBOOM/',\n",
803 | " 'https://www.reddit.com/user/SPM8/',\n",
804 | " 'https://www.reddit.com/user/con_commenter/',\n",
805 | " 'https://www.reddit.com/user/jesq/',\n",
806 | " 'https://www.reddit.com/user/memezzer/',\n",
807 | " 'https://www.reddit.com/user/mtlgrems/',\n",
808 | " 'https://www.reddit.com/user/notsure500/',\n",
809 | " 'https://www.reddit.com/user/skinkbaa/',\n",
810 | " 'https://www.reddit.com/user/steven5it/'}"
811 | ]
812 | },
813 | "execution_count": 16,
814 | "metadata": {},
815 | "output_type": "execute_result"
816 | }
817 | ],
818 | "source": [
819 | "# Take only the new items in the first set\n",
820 | "new_urls.difference(urls)"
821 | ]
822 | },
823 | {
824 | "cell_type": "code",
825 | "execution_count": 17,
826 | "metadata": {},
827 | "outputs": [
828 | {
829 | "data": {
830 | "text/plain": [
831 | ""
832 | ]
833 | },
834 | "execution_count": 17,
835 | "metadata": {},
836 | "output_type": "execute_result"
837 | }
838 | ],
839 | "source": [
840 | "# Finally, close the session\n",
841 | "session.close()"
842 | ]
843 | },
844 | {
845 | "cell_type": "code",
846 | "execution_count": 18,
847 | "metadata": {},
848 | "outputs": [
849 | {
850 | "name": "stdout",
851 | "output_type": "stream",
852 | "text": [
853 | "Reloads the response in Chromium, and replaces HTML content\n",
854 | " with an updated version, with JavaScript executed.\n",
855 | "\n",
856 | " :param retries: The number of times to retry loading the page in Chromium.\n",
857 | " :param script: JavaScript to execute upon page load (optional).\n",
858 | " :param wait: The number of seconds to wait before loading the page, preventing timeouts (optional).\n",
859 | " :param scrolldown: Integer, if provided, of how many times to page down.\n",
860 | " :param sleep: Integer, if provided, of how many long to sleep after initial render.\n",
861 | " :param reload: If ``False``, content will not be loaded from the browser, but will be provided from memory.\n",
862 | " :param keep_page: If ``True`` will allow you to interact with the browser page through ``r.html.page``.\n",
863 | "\n",
864 | " If ``scrolldown`` is specified, the page will scrolldown the specified\n",
865 | " number of times, after sleeping the specified amount of time\n",
866 | " (e.g. ``scrolldown=10, sleep=1``).\n",
867 | "\n",
868 | " If just ``sleep`` is provided, the rendering will wait *n* seconds, before\n",
869 | " returning.\n",
870 | "\n",
871 | " If ``script`` is specified, it will execute the provided JavaScript at\n",
872 | " runtime. Example:\n",
873 | "\n",
874 | " .. code-block:: python\n",
875 | "\n",
876 | " script = \"\"\"\n",
877 | " () => {\n",
878 | " return {\n",
879 | " width: document.documentElement.clientWidth,\n",
880 | " height: document.documentElement.clientHeight,\n",
881 | " deviceScaleFactor: window.devicePixelRatio,\n",
882 | " }\n",
883 | " }\n",
884 | " \"\"\"\n",
885 | "\n",
886 | " Returns the return value of the executed ``script``, if any is provided:\n",
887 | "\n",
888 | " .. code-block:: python\n",
889 | "\n",
890 | " >>> r.html.render(script=script)\n",
891 | " {'width': 800, 'height': 600, 'deviceScaleFactor': 1}\n",
892 | "\n",
893 | " Warning: the first time you run this method, it will download\n",
894 | " Chromium into your home directory (``~/.pyppeteer``).\n",
895 | " \n"
896 | ]
897 | }
898 | ],
899 | "source": [
900 | "# You can check the documentation directly inside Jupyter\n",
901 | "print(r.html.render.__doc__)"
902 | ]
903 | },
904 | {
905 | "cell_type": "code",
906 | "execution_count": null,
907 | "metadata": {},
908 | "outputs": [],
909 | "source": []
910 | }
911 | ],
912 | "metadata": {
913 | "kernelspec": {
914 | "display_name": "Python 3",
915 | "language": "python",
916 | "name": "python3"
917 | },
918 | "language_info": {
919 | "codemirror_mode": {
920 | "name": "ipython",
921 | "version": 3
922 | },
923 | "file_extension": ".py",
924 | "mimetype": "text/x-python",
925 | "name": "python",
926 | "nbconvert_exporter": "python",
927 | "pygments_lexer": "ipython3",
928 | "version": "3.7.3"
929 | }
930 | },
931 | "nbformat": 4,
932 | "nbformat_minor": 2
933 | }
934 |
--------------------------------------------------------------------------------
/11.Scraping JavaScript - SoundCloud Project/Section 10 - Scraping SoundCloud - Setup.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Scraping SoundCoud"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Initial Setup"
15 | ]
16 | },
17 | {
18 | "cell_type": "code",
19 | "execution_count": null,
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "# import packages\n",
24 | "import requests\n",
25 | "from bs4 import BeautifulSoup"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": null,
31 | "metadata": {},
32 | "outputs": [],
33 | "source": [
34 | "from requests_html import AsyncHTMLSession"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": []
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "## Connect to SoundCloud"
49 | ]
50 | },
51 | {
52 | "cell_type": "code",
53 | "execution_count": null,
54 | "metadata": {},
55 | "outputs": [],
56 | "source": [
57 | "# make connection to webpage\n",
58 | "resp = requests.get(\"https://soundcloud.com/discover\")"
59 | ]
60 | },
61 | {
62 | "cell_type": "code",
63 | "execution_count": null,
64 | "metadata": {},
65 | "outputs": [],
66 | "source": [
67 | "# get HTML from response object\n",
68 | "html = resp.content"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": null,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "# convert HTML to BeautifulSoup object\n",
78 | "soup = BeautifulSoup(html, \"lxml\")"
79 | ]
80 | },
81 | {
82 | "cell_type": "code",
83 | "execution_count": null,
84 | "metadata": {},
85 | "outputs": [],
86 | "source": []
87 | },
88 | {
89 | "cell_type": "markdown",
90 | "metadata": {},
91 | "source": [
92 | "## Get links on the webpage. Notice how this doesn't extract all the links visible on the webpage...what can we do about that?"
93 | ]
94 | },
95 | {
96 | "cell_type": "code",
97 | "execution_count": null,
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "soup.find_all(\"a\")"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": null,
107 | "metadata": {},
108 | "outputs": [],
109 | "source": []
110 | },
111 | {
112 | "cell_type": "markdown",
113 | "metadata": {},
114 | "source": [
115 | "## 1) Use requests-html to extract other links on the page by executing JavaScript. How many links do you see now?\n",
116 | "## 2) After you complete 1), get the text of the new paragraphs now visible in the HTML.\n",
117 | "## 3) Try out a few other tags - what else appears after executing the JavaScript?\n",
118 | "## 4) Using a CSS selector, extract the meta tag with name = \"keywords\". Can you get this tag's attributes?\n",
119 | "## 5) Links that automatically open to a new a tab are identified by having a \"target\" attribute equal to \"_blank\". Try extracting these links and their URLs."
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": []
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": null,
132 | "metadata": {},
133 | "outputs": [],
134 | "source": []
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {},
140 | "outputs": [],
141 | "source": []
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": []
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": []
156 | }
157 | ],
158 | "metadata": {
159 | "kernelspec": {
160 | "display_name": "Python 3",
161 | "language": "python",
162 | "name": "python3"
163 | },
164 | "language_info": {
165 | "codemirror_mode": {
166 | "name": "ipython",
167 | "version": 3
168 | },
169 | "file_extension": ".py",
170 | "mimetype": "text/x-python",
171 | "name": "python",
172 | "nbconvert_exporter": "python",
173 | "pygments_lexer": "ipython3",
174 | "version": "3.7.3"
175 | }
176 | },
177 | "nbformat": 4,
178 | "nbformat_minor": 2
179 | }
180 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Phone Thiri Yadana
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
1 | # Web Scraping and API in Python
2 |
3 | Python project for integrations with different API and web scraping with BeautifulSoup and HTML-requests libraries for multiple scraping projects such as Youtube, dynamically generated Javascript for SounCloud and many more.
4 |
5 | ## Built With
6 | * [python 3](https://www.python.org/)
7 | * [requests](https://requests.readthedocs.io/en/master/) - Requests is an elegant and simple HTTP library for Python.
8 | * [pandas](https://pandas.pydata.org/) - fast, powerful, flexible and easy to use open source data analysis and manipulation tool
9 | * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) - for screen scraping library
10 | * [requests-HTML](https://requests.readthedocs.io/projects/requests-html/en/latest/) - make parsing HTML as simple and intuitive as possible with Full JavaScript Support
11 | * [python html.parser](https://docs.python.org/3/library/html.parser.html) - html parser
12 | * [lxml parser](https://lxml.de/parsing.html) - asd
13 | * [html5lib parser](https://github.com/html5lib/html5lib-python) - simple and powerful API for parsing XML and HTML
14 | * [urllib](https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse) - URL handling module
15 |
16 | ## API Projects
17 | * [Currency Exchange Rate API](https://exchangeratesapi.io/)
18 | * [iTune API](https://developer.apple.com/library/archive/documentation/AudioVideo/Conceptual/iTuneSearchAPI/Searching.html#//apple_ref/doc/uid/TP40017632-CH5-SW1)
19 | * [GitHub Jobs API](https://jobs.github.com/api)
20 | * [Official Joke API](https://github.com/15Dkatz/official_joke_api)
21 | * [Joke API](https://sv443.net/jokeapi)
22 |
23 | ## Web Scraping Projects
24 | * [Rotten Tomatoes](https://www.rottentomatoes.com/)
25 | * [Steam](https://store.steampowered.com/games/)
26 | * [Youtube](https://www.youtube.com/)
27 | * [Sound Cloud](https://soundcloud.com/)
28 |
29 | ## License
30 |
31 | This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details
32 |
33 | ## References
34 |
35 | * The challenges are part of [Web Scraping and API in Python course](https://365datascience.com/courses/web-scraping-and-api-fundamentals-in-python/) by 365 Data Science.
36 |
--------------------------------------------------------------------------------