├── web_scraping_maestría.pdf ├── LICENSE ├── README.md └── Selenium_2023.ipynb /web_scraping_maestría.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GEJ1/web-scraping-python/HEAD/web_scraping_maestría.pdf -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Gustavo Juantorena 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web scraping con Python ⛏️ 2 | Recursos básicos para iniciarse en web scraping usando Python. 3 | 4 | 5 | Clase dictada por Gustavo Juantorena ([github](https://github.com/GEJ1) / [Linkedin](https://www.linkedin.com/in/gustavo-juantorena/)) como docente invitado en la materia Text Mining de la [Maestría en Explotación de Datos y Descubrimiento de Conocimiento](http://datamining.dc.uba.ar/datamining/) de la Universidad de Buenos Aires. 6 | El contenido aquí expuesto excede lo que se vió en a clase, pero me parece que está bueno que lo tengan como referencia para profundizar los contenidos. 7 | * [Slides de la clase](https://docs.google.com/presentation/d/10-lc2Y6kMVHp7FO9v8ReZdY1MPwUlgxWIsSDePY0afg/edit?usp=sharing) 8 | * [Primer Notebook de la clase (APIs + Beautiful Soup)](https://github.com/GEJ1/web-scraping-python/blob/main/web_scraping_maestria.ipynb) 9 | * [Segunda Notebook de la clase (Selenium)](https://github.com/GEJ1/web-scraping-python/blob/main/Selenium_2023.ipynb) 10 | 11 | **Update 2023**: Cree un curso para [freeCodeCamp](https://www.freecodecamp.org/espanol/news/aprende-web-scraping-con-python-y-beautiful-soup-en-espanol-curso-desde-cero/) que incluye la mayorìa del contenido de esta clase y pueden [verlo acá](https://github.com/GEJ1/web_scraping_freecodecamp) 12 | 13 | # Recursos 14 | 15 | * [Web general](#Web-general) 16 | * [Web scraping general](#Web-scraping-general) 17 | * [Beautiful Soup](#Beautiful-Soup) 18 | * [Selenium](#Selenium) 19 | * [Scrapy](#Scrapy) 20 | * [Libros](#Libros) 21 | 22 | ## Web general 23 | Casi todo es de https://developer.mozilla.org/es/ , uno de los mejores lugares para buscar referencia sobre tecnologías web. 24 | 25 | * [Listado de etiquetas HTML](https://developer.mozilla.org/es/docs/Web/HTML/Element) 26 | * [Material para CSS y listado de propiedades](https://developer.mozilla.org/es/docs/Web/HTML/Element) 27 | * [Material para JavaScript](https://developer.mozilla.org/es/docs/Web/JavaScript) 28 | * [Generalidades del protocolo HTTP](https://developer.mozilla.org/es/docs/Web/HTTP/Overview) 29 | * [HTTP status codes](https://gabicuesta.blogspot.com/2019/01/http-status-codes.html) 30 | * [Introducción al DOM](https://developer.mozilla.org/es/docs/Web/API/Document_Object_Model/Introduction) 31 | 32 | ## Web scraping general 33 | 34 | * [Contenido del curso Scraping con Python](https://github.com/institutohumai/cursos-python/tree/master/Scraping) 35 | * Curso dictado por [Mathias Gatti](https://github.com/mathigatti), [Matías Grinberg](https://github.com/Cerebrock) y [Gustavo Juantorena](https://github.com/gej1) en [humai](https://www.ihum.ai/). 36 | * Aprovechen [los videos](https://www.youtube.com/playlist?list=PLISuMnTdVU-xOHf3jEtiK1B_g5HFgXCb- ) si quieren ampliar sobre los temas vistos! 37 | * [Clases 18 y 19 de la materia Laboratorio de datos (FCEyN, UBA. 1er cuatrimestre 2021)](http://materias.df.uba.ar/lda2021c1/171-2/) 38 | * Videos de las clases, slides y Notebooks para interacción con APIs (clase 18) y web scraping (clase 19). 39 | * Docentes a cargo: Enzo Tagliazucchi, Sebastián Pinto, Tomás Cicchini y Ariel Berardino. 40 | 41 | ## Beautiful Soup 42 | 43 | * [Documentación oficial](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) 44 | * [Web Scraping con Python - Curso con Beautiful Soup](https://youtu.be/yKi9-BfbfzQ?si=ubomrRwjz5ziQrKq) 45 | * [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/) 46 | 47 | 48 | ## Selenium 49 | 50 | * [Documentación NO oficial, pero útil](https://selenium-python.readthedocs.io/) 51 | * [Selenium FULL COURSE - Learn Selenium by creating a bot in 3 hours [2021]](https://youtu.be/6gxhcvrf2Jk) 52 | * [Web Scraping using Selenium and Python](https://www.scrapingbee.com/blog/selenium-python/) 53 | 54 | ## Scrapy 55 | 56 | * [Documentación oficial](https://docs.scrapy.org/en/latest/) 57 | * [Intro To Web Crawlers & Scraping With Scrapy](https://youtu.be/ALizgnSFTwQ) 58 | 59 | 60 | ## Libros 61 | 62 | * [Web Scraping with Python, 2nd Edition](https://www.oreilly.com/library/view/web-scraping-with/9781491985564/) 63 | * [Código asociado al libro](https://github.com/REMitchell/python-scraping) 64 | * [Hands-On Web Scraping with Python](https://www.amazon.com/Hands-Web-Scraping-Python-operations-ebook/dp/B07VFFYPGK) 65 | 66 | 67 | -------------------------------------------------------------------------------- /Selenium_2023.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "provenance": [], 7 | "authorship_tag": "ABX9TyO6mh0aVFJlNLXBMgAL8KKs", 8 | "include_colab_link": true 9 | }, 10 | "kernelspec": { 11 | "name": "python3", 12 | "display_name": "Python 3" 13 | }, 14 | "language_info": { 15 | "name": "python" 16 | } 17 | }, 18 | "cells": [ 19 | { 20 | "cell_type": "markdown", 21 | "metadata": { 22 | "id": "view-in-github", 23 | "colab_type": "text" 24 | }, 25 | "source": [ 26 | "\"Open" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "source": [ 32 | "#**Web scraping avanzado con Selenium**\n", 33 | "\n", 34 | "Selenium es\n", 35 | "* ### Selenium nos va a poder recorrer internet con un navegador sin interfaz gráfica, permitiéndonos hacer click, scroll, etc.\n", 36 | "\n", 37 | "* ### Usar Selenium dentro de Google Colab no es lo más común pero a fines didácticos resulta útil. No estoy seguro de que tan escalable sea.\n", 38 | "\n", 39 | "* Docs: https://selenium-python.readthedocs.io/\n", 40 | "\n", 41 | "\n", 42 | "\n", 43 | "## Tenemos que instalarlo y configurarlo en Colab (en local es más fácil)" 44 | ], 45 | "metadata": { 46 | "id": "XPDa5GqBHOUe" 47 | } 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "id": "fFqff2VjofiQ" 54 | }, 55 | "outputs": [], 56 | "source": [ 57 | "%%shell\n", 58 | "# Fuente: https://github.com/googlecolab/colabtools/issues/3347\n", 59 | "sudo apt -y update\n", 60 | "sudo apt install -y wget curl unzip\n", 61 | "wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb\n", 62 | "dpkg -i libu2f-udev_1.1.4-1_all.deb\n", 63 | "wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb\n", 64 | "dpkg -i google-chrome-stable_current_amd64.deb\n", 65 | "CHROME_DRIVER_VERSION=`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`\n", 66 | "wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/116.0.5845.96/linux64/chromedriver-linux64.zip\n", 67 | "unzip -o /tmp/chromedriver-linux64.zip -d /tmp/\n", 68 | "chmod +x /tmp/chromedriver-linux64/chromedriver\n", 69 | "mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "source": [ 75 | "!pip install selenium -q" 76 | ], 77 | "metadata": { 78 | "id": "rHwnvqpypFu9" 79 | }, 80 | "execution_count": null, 81 | "outputs": [] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "source": [ 86 | "from selenium import webdriver\n", 87 | "from selenium.webdriver.chrome.options import Options\n", 88 | "from selenium.webdriver.common.by import By\n", 89 | "from selenium.webdriver.chrome.service import Service" 90 | ], 91 | "metadata": { 92 | "id": "1_7WNZZF3KWM" 93 | }, 94 | "execution_count": null, 95 | "outputs": [] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "source": [ 100 | "options = webdriver.ChromeOptions()\n", 101 | "options.add_argument(\"--headless\")\n", 102 | "options.add_argument('--disable-dev-shm-usage')\n", 103 | "options.add_argument(\"--no-sandbox\")\n", 104 | "\n", 105 | "# instanciamos el driver\n", 106 | "wd = webdriver.Chrome(options=options)\n", 107 | "\n", 108 | "# Tiempo de espera ( si encuentra antes, no espera)\n", 109 | "wd.implicitly_wait(20)\n", 110 | "\n", 111 | "# Hacemos el pedido a la URL\n", 112 | "url = \"https://www.wikipedia.com\"\n", 113 | "# Pedido HTTP\n", 114 | "wd.get(url)\n", 115 | "h1 = wd.find_element(By.CSS_SELECTOR, \"h1\")\n", 116 | "print(f'h1 extraido de wikipedia: \\n\\n{h1.text}')" 117 | ], 118 | "metadata": { 119 | "id": "_YvExg2YomO2" 120 | }, 121 | "execution_count": null, 122 | "outputs": [] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "source": [ 127 | "search_input = wd.find_element(By.ID,'searchInput')\n", 128 | "\n", 129 | "# Envio el texto que quiero que ponga en el formulario\n", 130 | "search_input.send_keys('Natural language processing')\n", 131 | "wd.save_screenshot(\"1.png\")\n", 132 | "\n", 133 | "search_button = wd.find_element(By.XPATH,'//*[@id=\"search-form\"]/fieldset/button')\n", 134 | "search_button.click()\n", 135 | "wd.save_screenshot(\"2.png\")" 136 | ], 137 | "metadata": { 138 | "id": "TFURhZIE-1B_" 139 | }, 140 | "execution_count": null, 141 | "outputs": [] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "source": [ 146 | "# Imprimo el título de la página a la que se accedió\n", 147 | "heading = wd.find_element(By.ID,\"firstHeading\")\n", 148 | "body_content = wd.find_element(By.ID, \"bodyContent\")\n", 149 | "print(f'Heading: \\n{heading.text}')\n", 150 | "print(f'Content: \\n{body_content.text}')" 151 | ], 152 | "metadata": { 153 | "id": "_XTVLqhg_G2p" 154 | }, 155 | "execution_count": null, 156 | "outputs": [] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "source": [ 161 | "## **Caso de uso Nº 1: Scroll infinito**\n", 162 | "\n", 163 | "Existen páginas que no muestran todo el contenido a menos que vayamos hasta abajo (*scroll*). Esta acción dispara un evento de javascript que renderiza más HTML y por lo tanto vemos contenido nuevo.\n", 164 | "\n", 165 | "Podemos emular la acción de mediante Selenium." 166 | ], 167 | "metadata": { 168 | "id": "YctCEElGxjyd" 169 | } 170 | }, 171 | { 172 | "cell_type": "code", 173 | "source": [ 174 | "# Configuramos el web driver\n", 175 | "driver = webdriver.Chrome(options=options)\n", 176 | "\n", 177 | "# Hacemos el pedido a la URL\n", 178 | "url = \"https://infinite-scroll.com/demo/full-page/\"\n", 179 | "driver.get(url)\n", 180 | "\n", 181 | "# Busco todos los h2 (notar la sutileza del metodo elements en plural)\n", 182 | "h2_list = driver.find_elements(By.CSS_SELECTOR, \"h2\")\n", 183 | "for h2 in h2_list:\n", 184 | " print(h2.text)" 185 | ], 186 | "metadata": { 187 | "id": "eaf8YGzPxlZh" 188 | }, 189 | "execution_count": null, 190 | "outputs": [] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "source": [ 195 | "# Tomo un screenshot\n", 196 | "driver.save_screenshot(f'infinite_page.screenshot.png')\n", 197 | "\n", 198 | "# Hago lo mismo que antes pero iterando 5 veces y pidiendole que scrollee hasta el final cada vez y saque un screenshot\n", 199 | "for i in range(5):\n", 200 | " print(f'Iteracion numero {i}\\n\\n')\n", 201 | " # el metodo execute_script me permite ejecutar codigo de javascript, en este caso para ir al final de la pagina\n", 202 | " driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n", 203 | " driver.save_screenshot(f'infinite_page_{i}.screenshot.png')\n", 204 | " h2_list = driver.find_elements(By.CSS_SELECTOR, \"h2\")\n", 205 | " for h2 in h2_list:\n", 206 | " print(h2.text)\n", 207 | " print('\\n\\n')" 208 | ], 209 | "metadata": { 210 | "id": "NQo_b719xl0b" 211 | }, 212 | "execution_count": null, 213 | "outputs": [] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "source": [ 218 | "## **Caso de uso Nº 2: Páginas que usan JavaScript para mostrar el contenido de manera asíncrona**\n", 219 | "\n", 220 | "* Hay páginas que cuando hacemos un request a su URL no nos devuelve lo que esperamos. Sino bastante código de JavaScript (entre etiquetas `