├── web_scraping_maestría.pdf
├── LICENSE
├── README.md
└── Selenium_2023.ipynb
/web_scraping_maestría.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GEJ1/web-scraping-python/HEAD/web_scraping_maestría.pdf
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Gustavo Juantorena
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Web scraping con Python ⛏️
2 | Recursos básicos para iniciarse en web scraping usando Python.
3 |
4 |
5 | Clase dictada por Gustavo Juantorena ([github](https://github.com/GEJ1) / [Linkedin](https://www.linkedin.com/in/gustavo-juantorena/)) como docente invitado en la materia Text Mining de la [Maestría en Explotación de Datos y Descubrimiento de Conocimiento](http://datamining.dc.uba.ar/datamining/) de la Universidad de Buenos Aires.
6 | El contenido aquí expuesto excede lo que se vió en a clase, pero me parece que está bueno que lo tengan como referencia para profundizar los contenidos.
7 | * [Slides de la clase](https://docs.google.com/presentation/d/10-lc2Y6kMVHp7FO9v8ReZdY1MPwUlgxWIsSDePY0afg/edit?usp=sharing)
8 | * [Primer Notebook de la clase (APIs + Beautiful Soup)](https://github.com/GEJ1/web-scraping-python/blob/main/web_scraping_maestria.ipynb)
9 | * [Segunda Notebook de la clase (Selenium)](https://github.com/GEJ1/web-scraping-python/blob/main/Selenium_2023.ipynb)
10 |
11 | **Update 2023**: Cree un curso para [freeCodeCamp](https://www.freecodecamp.org/espanol/news/aprende-web-scraping-con-python-y-beautiful-soup-en-espanol-curso-desde-cero/) que incluye la mayorìa del contenido de esta clase y pueden [verlo acá](https://github.com/GEJ1/web_scraping_freecodecamp)
12 |
13 | # Recursos
14 |
15 | * [Web general](#Web-general)
16 | * [Web scraping general](#Web-scraping-general)
17 | * [Beautiful Soup](#Beautiful-Soup)
18 | * [Selenium](#Selenium)
19 | * [Scrapy](#Scrapy)
20 | * [Libros](#Libros)
21 |
22 | ## Web general
23 | Casi todo es de https://developer.mozilla.org/es/ , uno de los mejores lugares para buscar referencia sobre tecnologías web.
24 |
25 | * [Listado de etiquetas HTML](https://developer.mozilla.org/es/docs/Web/HTML/Element)
26 | * [Material para CSS y listado de propiedades](https://developer.mozilla.org/es/docs/Web/HTML/Element)
27 | * [Material para JavaScript](https://developer.mozilla.org/es/docs/Web/JavaScript)
28 | * [Generalidades del protocolo HTTP](https://developer.mozilla.org/es/docs/Web/HTTP/Overview)
29 | * [HTTP status codes](https://gabicuesta.blogspot.com/2019/01/http-status-codes.html)
30 | * [Introducción al DOM](https://developer.mozilla.org/es/docs/Web/API/Document_Object_Model/Introduction)
31 |
32 | ## Web scraping general
33 |
34 | * [Contenido del curso Scraping con Python](https://github.com/institutohumai/cursos-python/tree/master/Scraping)
35 | * Curso dictado por [Mathias Gatti](https://github.com/mathigatti), [Matías Grinberg](https://github.com/Cerebrock) y [Gustavo Juantorena](https://github.com/gej1) en [humai](https://www.ihum.ai/).
36 | * Aprovechen [los videos](https://www.youtube.com/playlist?list=PLISuMnTdVU-xOHf3jEtiK1B_g5HFgXCb- ) si quieren ampliar sobre los temas vistos!
37 | * [Clases 18 y 19 de la materia Laboratorio de datos (FCEyN, UBA. 1er cuatrimestre 2021)](http://materias.df.uba.ar/lda2021c1/171-2/)
38 | * Videos de las clases, slides y Notebooks para interacción con APIs (clase 18) y web scraping (clase 19).
39 | * Docentes a cargo: Enzo Tagliazucchi, Sebastián Pinto, Tomás Cicchini y Ariel Berardino.
40 |
41 | ## Beautiful Soup
42 |
43 | * [Documentación oficial](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
44 | * [Web Scraping con Python - Curso con Beautiful Soup](https://youtu.be/yKi9-BfbfzQ?si=ubomrRwjz5ziQrKq)
45 | * [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
46 |
47 |
48 | ## Selenium
49 |
50 | * [Documentación NO oficial, pero útil](https://selenium-python.readthedocs.io/)
51 | * [Selenium FULL COURSE - Learn Selenium by creating a bot in 3 hours [2021]](https://youtu.be/6gxhcvrf2Jk)
52 | * [Web Scraping using Selenium and Python](https://www.scrapingbee.com/blog/selenium-python/)
53 |
54 | ## Scrapy
55 |
56 | * [Documentación oficial](https://docs.scrapy.org/en/latest/)
57 | * [Intro To Web Crawlers & Scraping With Scrapy](https://youtu.be/ALizgnSFTwQ)
58 |
59 |
60 | ## Libros
61 |
62 | * [Web Scraping with Python, 2nd Edition](https://www.oreilly.com/library/view/web-scraping-with/9781491985564/)
63 | * [Código asociado al libro](https://github.com/REMitchell/python-scraping)
64 | * [Hands-On Web Scraping with Python](https://www.amazon.com/Hands-Web-Scraping-Python-operations-ebook/dp/B07VFFYPGK)
65 |
66 |
67 |
--------------------------------------------------------------------------------
/Selenium_2023.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "provenance": [],
7 | "authorship_tag": "ABX9TyO6mh0aVFJlNLXBMgAL8KKs",
8 | "include_colab_link": true
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "view-in-github",
23 | "colab_type": "text"
24 | },
25 | "source": [
26 | "
"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "source": [
32 | "#**Web scraping avanzado con Selenium**\n",
33 | "\n",
34 | "Selenium es\n",
35 | "* ### Selenium nos va a poder recorrer internet con un navegador sin interfaz gráfica, permitiéndonos hacer click, scroll, etc.\n",
36 | "\n",
37 | "* ### Usar Selenium dentro de Google Colab no es lo más común pero a fines didácticos resulta útil. No estoy seguro de que tan escalable sea.\n",
38 | "\n",
39 | "* Docs: https://selenium-python.readthedocs.io/\n",
40 | "\n",
41 | "
\n",
42 | "\n",
43 | "## Tenemos que instalarlo y configurarlo en Colab (en local es más fácil)"
44 | ],
45 | "metadata": {
46 | "id": "XPDa5GqBHOUe"
47 | }
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": null,
52 | "metadata": {
53 | "id": "fFqff2VjofiQ"
54 | },
55 | "outputs": [],
56 | "source": [
57 | "%%shell\n",
58 | "# Fuente: https://github.com/googlecolab/colabtools/issues/3347\n",
59 | "sudo apt -y update\n",
60 | "sudo apt install -y wget curl unzip\n",
61 | "wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb\n",
62 | "dpkg -i libu2f-udev_1.1.4-1_all.deb\n",
63 | "wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb\n",
64 | "dpkg -i google-chrome-stable_current_amd64.deb\n",
65 | "CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`\n",
66 | "wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chromedriver-linux64.zip\n",
67 | "unzip -o /tmp/chromedriver-linux64.zip -d /tmp/\n",
68 | "chmod +x /tmp/chromedriver-linux64/chromedriver\n",
69 | "mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "source": [
75 | "!pip install selenium -q"
76 | ],
77 | "metadata": {
78 | "id": "rHwnvqpypFu9"
79 | },
80 | "execution_count": null,
81 | "outputs": []
82 | },
83 | {
84 | "cell_type": "code",
85 | "source": [
86 | "from selenium import webdriver\n",
87 | "from selenium.webdriver.chrome.options import Options\n",
88 | "from selenium.webdriver.common.by import By\n",
89 | "from selenium.webdriver.chrome.service import Service"
90 | ],
91 | "metadata": {
92 | "id": "1_7WNZZF3KWM"
93 | },
94 | "execution_count": null,
95 | "outputs": []
96 | },
97 | {
98 | "cell_type": "code",
99 | "source": [
100 | "options = webdriver.ChromeOptions()\n",
101 | "options.add_argument(\"--headless\")\n",
102 | "options.add_argument('--disable-dev-shm-usage')\n",
103 | "options.add_argument(\"--no-sandbox\")\n",
104 | "\n",
105 | "# instanciamos el driver\n",
106 | "wd = webdriver.Chrome(options=options)\n",
107 | "\n",
108 | "# Tiempo de espera ( si encuentra antes, no espera)\n",
109 | "wd.implicitly_wait(20)\n",
110 | "\n",
111 | "# Hacemos el pedido a la URL\n",
112 | "url = \"https://www.wikipedia.com\"\n",
113 | "# Pedido HTTP\n",
114 | "wd.get(url)\n",
115 | "h1 = wd.find_element(By.CSS_SELECTOR, \"h1\")\n",
116 | "print(f'h1 extraido de wikipedia: \\n\\n{h1.text}')"
117 | ],
118 | "metadata": {
119 | "id": "_YvExg2YomO2"
120 | },
121 | "execution_count": null,
122 | "outputs": []
123 | },
124 | {
125 | "cell_type": "code",
126 | "source": [
127 | "search_input = wd.find_element(By.ID,'searchInput')\n",
128 | "\n",
129 | "# Envio el texto que quiero que ponga en el formulario\n",
130 | "search_input.send_keys('Natural language processing')\n",
131 | "wd.save_screenshot(\"1.png\")\n",
132 | "\n",
133 | "search_button = wd.find_element(By.XPATH,'//*[@id=\"search-form\"]/fieldset/button')\n",
134 | "search_button.click()\n",
135 | "wd.save_screenshot(\"2.png\")"
136 | ],
137 | "metadata": {
138 | "id": "TFURhZIE-1B_"
139 | },
140 | "execution_count": null,
141 | "outputs": []
142 | },
143 | {
144 | "cell_type": "code",
145 | "source": [
146 | "# Imprimo el título de la página a la que se accedió\n",
147 | "heading = wd.find_element(By.ID,\"firstHeading\")\n",
148 | "body_content = wd.find_element(By.ID, \"bodyContent\")\n",
149 | "print(f'Heading: \\n{heading.text}')\n",
150 | "print(f'Content: \\n{body_content.text}')"
151 | ],
152 | "metadata": {
153 | "id": "_XTVLqhg_G2p"
154 | },
155 | "execution_count": null,
156 | "outputs": []
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "source": [
161 | "## **Caso de uso Nº 1: Scroll infinito**\n",
162 | "\n",
163 | "Existen páginas que no muestran todo el contenido a menos que vayamos hasta abajo (*scroll*). Esta acción dispara un evento de javascript que renderiza más HTML y por lo tanto vemos contenido nuevo.\n",
164 | "\n",
165 | "Podemos emular la acción de mediante Selenium."
166 | ],
167 | "metadata": {
168 | "id": "YctCEElGxjyd"
169 | }
170 | },
171 | {
172 | "cell_type": "code",
173 | "source": [
174 | "# Configuramos el web driver\n",
175 | "driver = webdriver.Chrome(options=options)\n",
176 | "\n",
177 | "# Hacemos el pedido a la URL\n",
178 | "url = \"https://infinite-scroll.com/demo/full-page/\"\n",
179 | "driver.get(url)\n",
180 | "\n",
181 | "# Busco todos los h2 (notar la sutileza del metodo elements en plural)\n",
182 | "h2_list = driver.find_elements(By.CSS_SELECTOR, \"h2\")\n",
183 | "for h2 in h2_list:\n",
184 | " print(h2.text)"
185 | ],
186 | "metadata": {
187 | "id": "eaf8YGzPxlZh"
188 | },
189 | "execution_count": null,
190 | "outputs": []
191 | },
192 | {
193 | "cell_type": "code",
194 | "source": [
195 | "# Tomo un screenshot\n",
196 | "driver.save_screenshot(f'infinite_page.screenshot.png')\n",
197 | "\n",
198 | "# Hago lo mismo que antes pero iterando 5 veces y pidiendole que scrollee hasta el final cada vez y saque un screenshot\n",
199 | "for i in range(5):\n",
200 | " print(f'Iteracion numero {i}\\n\\n')\n",
201 | " # el metodo execute_script me permite ejecutar codigo de javascript, en este caso para ir al final de la pagina\n",
202 | " driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n",
203 | " driver.save_screenshot(f'infinite_page_{i}.screenshot.png')\n",
204 | " h2_list = driver.find_elements(By.CSS_SELECTOR, \"h2\")\n",
205 | " for h2 in h2_list:\n",
206 | " print(h2.text)\n",
207 | " print('\\n\\n')"
208 | ],
209 | "metadata": {
210 | "id": "NQo_b719xl0b"
211 | },
212 | "execution_count": null,
213 | "outputs": []
214 | },
215 | {
216 | "cell_type": "markdown",
217 | "source": [
218 | "## **Caso de uso Nº 2: Páginas que usan JavaScript para mostrar el contenido de manera asíncrona**\n",
219 | "\n",
220 | "* Hay páginas que cuando hacemos un request a su URL no nos devuelve lo que esperamos. Sino bastante código de JavaScript (entre etiquetas `