├── 00 - Slides ├── 01 - Introdução ao HTML.pdf └── 02 - Módulo Requests.pdf ├── 01 - Introdução ao HTML ├── index.html └── web-scraping-com-python.png ├── 02 - Módulo Requests └── requisicoes.py ├── 03 - BeautifulSoup I └── news.py ├── 04 - BeautifulSoup II ├── news.py └── noticias.xlsx ├── 05 - Exemplo - Mercado Livre └── mercado_livre.py ├── 06 - Selenium └── web_scraping_selenium.py ├── 07 - Selenium - airbnb ├── scraping.py └── tempCodeRunnerFile.py ├── Imagens └── web-scraping-com-python.png └── README.md /00 - Slides/01 - Introdução ao HTML.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/walissonsilva/web-scraping-python/c53d1f7369b007f47e92256e1c25b735f4342c36/00 - Slides/01 - Introdução ao HTML.pdf -------------------------------------------------------------------------------- /00 - Slides/02 - Módulo Requests.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/walissonsilva/web-scraping-python/c53d1f7369b007f47e92256e1c25b735f4342c36/00 - Slides/02 - Módulo Requests.pdf -------------------------------------------------------------------------------- /01 - Introdução ao HTML/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Meu primeiro site 7 | 8 | 9 | 10 | 11 |

Cabeçalho 1

12 |

Cabeçalho 2

13 |

Cabeçalho 3

14 |

Cabeçalho 4

15 |

Cabeçalho 5

16 |

Cabeçalho 6

17 | 18 |

Isso é um parágrafo.

19 | 20 | Clique aqui para acessar o meu site pessoal. 21 | 22 |

Primeiro item
Segundo item
Terceiro item

27 | 28 |

Primeiro item
Segundo item
Terceiro item

33 | 34 | Capa da Série sobre Web Scraping

35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /01 - Introdução ao HTML/web-scraping-com-python.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/walissonsilva/web-scraping-python/c53d1f7369b007f47e92256e1c25b735f4342c36/01 - Introdução ao HTML/web-scraping-com-python.png -------------------------------------------------------------------------------- /02 - Módulo Requests/requisicoes.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | response = requests.get('https://www.walissonsilva.com/') 4 | 5 | print('Status code:', response.status_code) 6 | print('↓↓ Header ↓↓') 7 | print(response.headers) 8 | 9 | print('\n↓↓ Content ↓↓') 10 | print(response.content) -------------------------------------------------------------------------------- /03 - BeautifulSoup I/news.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | 4 | response = requests.get('https://g1.globo.com/') 5 | 6 | content = response.content 7 | 8 | site = BeautifulSoup(content, 'html.parser') 9 | 10 | # HTML da notícia 11 | noticia = site.find('div', attrs={'class': 'feed-post-body'}) 12 | 13 | # Título 14 | titulo = noticia.find('a', attrs={'class': 'feed-post-link'}) 15 | 16 | print(titulo.text) 17 | 18 | # Subtítulo: div class="feed-post-body-resumo" 19 | subtitulo = noticia.find('div', attrs={'class': 'feed-post-body-resumo'}) 20 | 21 | print(subtitulo.text) -------------------------------------------------------------------------------- /04 - BeautifulSoup II/news.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import pandas as pd 4 | 5 | lista_noticias = [] 6 | 7 | response = requests.get('https://g1.globo.com/') 8 | 9 | content = response.content 10 | 11 | site = BeautifulSoup(content, 'html.parser') 12 | 13 | # HTML da notícia 14 | noticias = site.findAll('div', attrs={'class': 'feed-post-body'}) 15 | 16 | for noticia in noticias: 17 | # Título 18 | titulo = noticia.find('a', attrs={'class': 'feed-post-link'}) 19 | 20 | # print(titulo.text) 21 | # print(titulo['href']) # link da notícia 22 | 23 | # Subtítulo: div class="feed-post-body-resumo" 24 | subtitulo = noticia.find('div', attrs={'class': 'feed-post-body-resumo'}) 25 | 26 | if (subtitulo): 27 | # print(subtitulo.text) 28 | lista_noticias.append([titulo.text, subtitulo.text, titulo['href']]) 29 | else: 30 | lista_noticias.append([titulo.text, '', titulo['href']]) 31 | 32 | 33 | news = pd.DataFrame(lista_noticias, columns=['Título', 'Subtítulo', 'Link']) 34 | 35 | news.to_excel('noticias.xlsx', index=False) 36 | 37 | # print(news) -------------------------------------------------------------------------------- /04 - BeautifulSoup II/noticias.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/walissonsilva/web-scraping-python/c53d1f7369b007f47e92256e1c25b735f4342c36/04 - BeautifulSoup II/noticias.xlsx -------------------------------------------------------------------------------- /05 - Exemplo - Mercado Livre/mercado_livre.py: -------------------------------------------------------------------------------- 1 | 2 | # > EXEMPLO 3 | # - Obtendo produtos do Mercado Livre a partir de uma busca realizada pelo usuário 4 | 5 | import requests 6 | from bs4 import BeautifulSoup 7 | 8 | url_base = 'https://lista.mercadolivre.com.br/' 9 | 10 | produto_nome = input('Qual produto você deseja? ') 11 | 12 | response = requests.get(url_base + produto_nome) 13 | 14 | site = BeautifulSoup(response.text, 'html.parser') 15 | 16 | produtos = site.findAll('div', attrs={'class': 'andes-card andes-card--flat andes-card--default ui-search-result ui-search-result--core andes-card--padding-default'}) 17 | 18 | for produto in produtos: 19 | titulo = produto.find('h2', attrs={'class': 'ui-search-item__title'}) 20 | link = produto.find('a', attrs={'class': 'ui-search-link'}) 21 | 22 | real = produto.find('span', attrs={'class': 'price-tag-fraction'}) 23 | centavos = produto.find('span', attrs={'class': 'price-tag-cents'}) 24 | 25 | print(produto.prettify()) 26 | print('Título do produto:', titulo.text) 27 | print('Link do produto:', link['href']) 28 | 29 | if (centavos): 30 | print('Preço do produto: R$', real.text + ',' + centavos.text) 31 | else: 32 | print('Preço do produto: R$', real.text) 33 | 34 | print('\n\n') 35 | break 36 | 37 | 38 | 39 | 40 | -------------------------------------------------------------------------------- /06 - Selenium/web_scraping_selenium.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from time import sleep 3 | 4 | navegador = webdriver.Chrome() 5 | 6 | navegador.get('https://www.walissonsilva.com/blog') 7 | 8 | sleep(3) 9 | 10 | elemento = navegador.find_element_by_tag_name('input') 11 | 12 | elemento.send_keys('data') -------------------------------------------------------------------------------- /07 - Selenium - airbnb/scraping.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | from selenium import webdriver 4 | from selenium.webdriver.chrome.options import Options 5 | from time import sleep 6 | 7 | options = Options() 8 | # options.add_argument('--headless') 9 | options.add_argument('window-size=400,800') 10 | 11 | navegador = webdriver.Chrome(options=options) 12 | 13 | navegador.get('https://www.airbnb.com') 14 | 15 | sleep(2) 16 | 17 | input_place = navegador.find_element_by_tag_name('input') 18 | input_place.send_keys('São Paulo') 19 | input_place.submit() 20 | 21 | sleep(0.5) 22 | 23 | button_stay = navegador.find_element_by_css_selector('button > img') 24 | button_stay.click() 25 | 26 | sleep(0.5) 27 | 28 | nextButton = navegador.find_elements_by_tag_name('button')[-1] 29 | nextButton.click() 30 | 31 | sleep(0.5) 32 | 33 | # Definindo dois adultos 34 | adultButton = navegador.find_elements_by_css_selector('button > span > svg > path[d="m2 16h28m-14-14v28"]')[0] 35 | adultButton.click() 36 | sleep(1) 37 | adultButton.click() 38 | sleep(1) 39 | 40 | 41 | searchButton = navegador.find_elements_by_tag_name('button')[-1] 42 | searchButton.click() 43 | 44 | sleep(4) 45 | 46 | page_content = navegador.page_source 47 | 48 | site = BeautifulSoup(page_content, 'html.parser') 49 | 50 | print(site.prettify()) -------------------------------------------------------------------------------- /07 - Selenium - airbnb/tempCodeRunnerFile.py: -------------------------------------------------------------------------------- 1 | adultButton.click() 2 | # sleep(1) 3 | # adultButton.click() 4 | # sleep(1) -------------------------------------------------------------------------------- /Imagens/web-scraping-com-python.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/walissonsilva/web-scraping-python/c53d1f7369b007f47e92256e1c25b735f4342c36/Imagens/web-scraping-com-python.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraping com Python 2 | 3 | Esse repositório contém arquivos (slides, códigos, exemplos) desenvolvidos durante uma série de vídeos do meu canal do YouTube, a qual pode ser acessada [nesse link](https://www.youtube.com/watch?v=42sTntMEn6o&list=PLg3ZPsW_sghSkRacynznQeEs-vminyTQk). 4 | 5 | ![Imagem de Capa](Imagens/web-scraping-com-python.png) 6 | 7 | ## Conteúdo 8 | 9 | Essa série de vídeos contém o seguinte conteúdo: 10 | 11 | 1. Introdução ao HTML 12 | 2. Inspecionando sites 13 | 3. Protocolo e Requisições HTTP 14 | 4. Módulo `requests` 15 | 5. Módulo `BeautifulSoup` 16 | 6. Exemplos 17 | 7. Projeto --------------------------------------------------------------------------------