├── .gitignore
├── LICENSE
├── README.md
├── diputados
    ├── README.md
    ├── diputados_basico.py
    └── diputados_scrapy.py
└── resources
    ├── diputados.png
    ├── diputados_basico_ejecucion.png
    ├── install_scrapy_pip.png
    ├── listado_diputados.png
    └── virtualenv_test.png


/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | *$py.class
 5 | 
 6 | # C extensions
 7 | *.so
 8 | 
 9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .coverage.*
42 | .cache
43 | nosetests.xml
44 | coverage.xml
45 | *,cover
46 | .hypothesis/
47 | 
48 | # Translations
49 | *.mo
50 | *.pot
51 | 
52 | # Django stuff:
53 | *.log
54 | local_settings.py
55 | 
56 | # Flask stuff:
57 | instance/
58 | .webassets-cache
59 | 
60 | # Scrapy stuff:
61 | .scrapy
62 | 
63 | # Sphinx documentation
64 | docs/_build/
65 | 
66 | # PyBuilder
67 | target/
68 | 
69 | # IPython Notebook
70 | .ipynb_checkpoints
71 | 
72 | # pyenv
73 | .python-version
74 | 
75 | # celery beat schedule file
76 | celerybeat-schedule
77 | 
78 | # dotenv
79 | .env
80 | 
81 | # virtualenv
82 | venv/
83 | ENV/
84 | 
85 | # Spyder project settings
86 | .spyderproject
87 | 
88 | # Rope project settings
89 | .ropeproject
90 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2016 python-madrid-learn
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # scrapping-python
 2 | Repositorio con recursos para aprender scrapping en python.
 3 | 
 4 | # El congreso de los diputados
 5 | Dentro del grupo de aprendizaje python se ha comentado de scrapear a los diputados para ver si curran o son unos vagos. Más allá de la política, este primer ejercicio lo queremos hacer de la forma más sencilla posible. Como se ha comentado se pretende utilizar *python-request* + *bs4*. Más información sobre este ejercicio en la carpeta **diputados**.
 6 | 
 7 | # Links de interés
 8 | 
 9 | Descripción | Enlace | Comentario
10 | ------------|--------|-------------
11 | Curso Python | [Sololearn](https://www.sololearn.com/Play/Python) | Curso básico e interactivo sobre python
12 | Normas de estilo | [Pep8](https://www.python.org/dev/peps/pep-0008/) | Normas de estilo de codigo para python. Seguirlas siempre que se pueda.
13 | Curso Python | [Datacamp](https://hourofpython.com/una-introduccion-visual-a-python/index.html) | Una introducción visual para aprender a programar Python utilizando tortugas
14 | Curso Python | [PythonArgentina](http://docs.python.org.ar/tutorial/3/index.html) | Tutorial de python para la version 3.5.1 muy detallado
15 | Curso Python | [Codeacademy](https://www.codecademy.com/learn/python) | Tutorial de python en Codeacademy.
16 | Curso Django | [djangogirls](https://tutorial.djangogirls.org/es/) | Tutorial de Django
17 | Videotutorial | [JesúsConde](https://www.youtube.com/playlist?list=PLEtcGQaT56cj70Vl_C1qfUinyMELunL-N) | Video tutorial en youtube que explica python de forma sencilla.
18 | Recursos | [AwesomePython](http://awesome-python.com/) | Listado de recursos que se pueden utilizar con python
19 | Recursos | [Aprender Python Argentina](https://argentinaenpython.com/quiero-aprender-python/) | Listado de recursos para aprender Python ofrecido por cracks argentinos.
20 | Recursos | [Guia markdown github](https://guides.github.com/features/mastering-markdown/) | Esto no es un recurso pyhton, pero es la guía para escribir markdown en github.
21 | Blog | [RicardoMoya](http://jarroba.com/scraping-python-beautifulsoup-ejemplos/) | Blog interesante de scrapping en python con BeautifulSoup
22 | 
23 | Cualquier referencia a añadir, PR al canto.
24 | 


--------------------------------------------------------------------------------
/diputados/README.md:
--------------------------------------------------------------------------------
 1 | # Scrapper sobre el congreso de los diputados
 2 | La idea de este ejercicio es scrapear la página de los diputados.
 3 | * Se accede a la página de los [diputados](http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados)
 4 | * Queremos recuperar una lista de diputados ![Principal](https://github.com/python-madrid-learn/scrapping-python/blob/master/resources/diputados.png). Para ello necesitamos acceder al enlace de *Listado completo de la composición de la cámara*
 5 | * Una vez que tenemos la lista de diputados, queremos iterar por todos ellos para recuperar su información ![Listado](https://github.com/python-madrid-learn/scrapping-python/blob/master/resources/listado_diputados.png).
 6 | 
 7 | Para la solución de este ejercicio se ha propuesto por @juanriaza dos aproximaciones:
 8 | 
 9 | * [python+lxml](https://gist.github.com/juanriaza/13117965405bff2226d55097f29cb5cc)
10 | * [python+scrapy](https://gist.github.com/juanriaza/e9213fc1d6d017c3b750234588638875)
11 | 
12 | 
13 | # Ejecución de la primera solución
14 | * Se ha copiado en este repo la solución propuesta por @juanriaza. Para ejecutarlo, si se tienen todas las librerías y dependencias instaladas valdría con `python diputados_basico.py`.
15 | * Los posibles problemas que se puede encontrar uno es no tener todas las dependencias perfectamente descargadas o utilizar una versión de python incorrecta:
16 |   * Para el ejemplo se utiliza python 3.x
17 |   * El ejemplo utiliza la librería `lxml` y `urllib`
18 |     * `pip3 install xlml`
19 |     * `pip3 install urllib`
20 | * El ejemplo es tan básico que se realiza de forma secuencial, se podría paralelizar... Si todo es correcto se debería tener una ejecución como la siguiente:
21 | ![Principal](https://github.com/python-madrid-learn/scrapping-python/blob/master/resources/diputados_basico_ejecucion.png)
22 | 
23 | # Ejecución de la segunda solución (hacerlo con scrapy)
24 | Esta solución es más avanzada ya que de primeras hay que tener instalado scrapy en el sistema para poder llevarla acabo. Para poder hacer eso de una forma limpia (no instalar scrapy directamente en el sistema, ya que tiene un montón de dependencias que al instalarlas nos puede dar problemas con otras librerías que dependían de ellas y se queden rotas), se recomienda el uso de una `virtualenv`. En caso de utilizar Ubuntu, hay un blogpost muy interesante [Virtualenv para python en Ubuntu](http://askubuntu.com/questions/244641/how-to-set-up-and-use-a-virtual-python-environment-in-ubuntu). Los pasos resumidos podrían ser los siguientes:
25 | 
26 | ## Instalación de un virtualenv
27 | 
28 | * `sudo apt-get install python3-pip` Instalar pip para python3
29 | * `pip3 completion --bash >> ~/.bashrc` Permitir el completado automático para pip3.
30 | * `source ~/.bashrc` Habilitar la funcionalidad anterior
31 | * `pip3 install --user virtualenvwrapper` Instalar *virtualenvwrapper* que ofrece comandos sencillos para manejar los entornos virtuales.
32 | * `echo "export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3" >> ~/.bashrc` Para añadir al path de la terminal *VIRTUALENVWRAPPER_PYTHON*
33 | * `echo "source ~/.local/bin/virtualenvwrapper.sh" >> ~/.bashrc` Habilitar la funcionalidad anterior.
34 | * `export WORKON_HOME=~/.virtualenvs` Crear una variable de entorno que apunte a donde crearemos nuestros entornos virtuales.
35 | * `mkdir $WORKON_HOME` Crear el directorio de los entornos virtuales.
36 | * `echo "export WORKON_HOME=$WORKON_HOME" >> ~/.bashrc` Introducir la variable de entorno en nuestro bash local para tenerlo siempre disponible.
37 | * `echo "export PIP_VIRTUALENV_BASE=$WORKON_HOME" >> ~/.bashrc` Truco para decir a python que la creación de entornos virtuales tiene que ser en *$WORKON_HOME*.
38 | * `source ~/.bashrc` Habilitar los cambios anteriores en el sistema.
39 | 
40 | 
41 | Ahora falta testear lo que hemos configurado:
42 | * `mkvirtualenv -p python3 test` .... ERRORRRRRRRRRRRRRRR, para solucionar este error he necesitado instalar *virtualenv* mediante el gestor de paquetes de ubuntu:
43 |   * `sudo apt install virtualenvwrapper` Ahora parece que todo funciona...
44 | ![virtualenv](https://github.com/python-madrid-learn/scrapping-python/blob/master/resources/virtualenv_test.png)
45 | 
46 | ## Ejecución del ejemplo con scrapy
47 | 
48 | * `pip3 install scrapy` Instala scrapy en tu entorno virtual que este activado. Habría que ver una traza parecida a la siguiente:
49 | ![Principal](https://github.com/python-madrid-learn/scrapping-python/blob/master/resources/install_scrapy_pip.png)
50 | * `scrapy runspider diputados_scrapy.py` corre el script de en la consola
51 | * `scrapy runspider diputados_scrapy.py -o diputados.csv` misma ejecución, pero pinta los resultados en un csv.
52 | 
53 | Posibles problemas con dependencias:
54 | * `sudo apt-get install libssl-dev python3-dev` Instalar las dependencias de desarrollo de python y criptographycs
55 | 


--------------------------------------------------------------------------------
/diputados/diputados_basico.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | 
 3 | from urllib.parse import urljoin
 4 | from lxml.html import fromstring
 5 | 
 6 | 
 7 | def parse_diputado(response):
 8 |     tree = fromstring(response.content)
 9 |     nombre = tree.xpath('//div[@class="nombre_dip"]/text()')[0]
10 |     print(nombre)
11 | 
12 | 
13 | def parse_lista_diputados(response):
14 |     tree = fromstring(response.content)
15 | 
16 |     # listado de diputados
17 |     diputados = tree.xpath('//div[@class="listado_1"]/ul/li/a/@href')
18 |     for diputado in diputados:
19 |         diputado_url = urljoin(response.url, diputado)
20 |         response = requests.get(diputado_url)
21 |         parse_diputado(response)
22 | 
23 |     # proxima pagina
24 |     pagina_siguiente = tree.xpath('//a[contains(., "Página Siguiente")]/@href')
25 |     if pagina_siguiente:
26 |         pagina_siguiente_url = pagina_siguiente[0]
27 |         response = requests.get(pagina_siguiente_url)
28 |         parse_lista_diputados(response)
29 | 
30 | response = requests.get(
31 |     'http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados')
32 | tree = fromstring(response.content)
33 | lista_diputados_url = tree.xpath('//div[@id="btn_mas"]/a/@href')[0]
34 | response = requests.get(lista_diputados_url)
35 | parse_lista_diputados(response)
36 | 


--------------------------------------------------------------------------------
/diputados/diputados_scrapy.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | 
 3 | 
 4 | class DiputadosSpider(scrapy.Spider):
 5 |     name = 'diputados'
 6 |     start_urls = ['http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados']
 7 | 
 8 |     def parse(self, response):
 9 |         lista_diputados_url = response.xpath(
10 |             '//div[@id="btn_mas"]/a/@href').extract_first()
11 |         request = scrapy.Request(
12 |             lista_diputados_url,
13 |             callback=self.parse_lista_diputados)
14 |         yield request
15 | 
16 |     def parse_lista_diputados(self, response):
17 |         # listado de diputados
18 |         diputados = response.xpath(
19 |             '//div[@class="listado_1"]/ul/li/a/@href').extract()
20 |         for diputado in diputados:
21 |             request = scrapy.Request(
22 |                 response.urljoin(diputado),
23 |                 callback=self.parse_diputado)
24 |             yield request
25 | 
26 |         # proxima pagina
27 |         pagina_siguiente = response.xpath(
28 |             '//a[contains(., "Página Siguiente")]/@href').extract_first()
29 |         if pagina_siguiente:
30 |             request = scrapy.Request(
31 |                 pagina_siguiente,
32 |                 callback=self.parse_lista_diputados)
33 |             yield request
34 | 
35 |     def parse_diputado(self, response):
36 |         nombre = response.xpath(
37 |             '//div[@class="nombre_dip"]/text()').extract_first()
38 |         diputado = {
39 |             'nombre': nombre,
40 |             'url': response.url}
41 |         yield diputado
42 | 


--------------------------------------------------------------------------------
/resources/diputados.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-madrid/scraping-python/e2111ff3792fc83047b0e766c2c450677eef86f8/resources/diputados.png


--------------------------------------------------------------------------------
/resources/diputados_basico_ejecucion.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-madrid/scraping-python/e2111ff3792fc83047b0e766c2c450677eef86f8/resources/diputados_basico_ejecucion.png


--------------------------------------------------------------------------------
/resources/install_scrapy_pip.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-madrid/scraping-python/e2111ff3792fc83047b0e766c2c450677eef86f8/resources/install_scrapy_pip.png


--------------------------------------------------------------------------------
/resources/listado_diputados.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-madrid/scraping-python/e2111ff3792fc83047b0e766c2c450677eef86f8/resources/listado_diputados.png


--------------------------------------------------------------------------------
/resources/virtualenv_test.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/python-madrid/scraping-python/e2111ff3792fc83047b0e766c2c450677eef86f8/resources/virtualenv_test.png


--------------------------------------------------------------------------------