` com a classe `tags` que por sua vez contém elementos `` com a classe `tag` que representa respectivamente cada **tag** única 16 | 17 | Esse é o primeiro passo em nossa jornada, identificar a localização dos dados que desejamos na página web. Agora vejamos como podemos utilizar o framework Scrapy para fazermos a requisição da página e a extração dos dados. 18 | 19 | Caso queira saber mais detalhes sobre o framework: https://scrapy.org/ 20 | 21 | Para fazermos a instalação, podemos usar `pip`: 22 | 23 | ``` 24 | pip install scrapy 25 | ``` 26 | 27 | ## Scrapy Shell 28 | 29 | Scrapy Shell é uma ferramenta que nos possibilita experimentar as diferentes possibilidades de como podemos extrair os dados que desejamos 30 | 31 | ``` 32 | scrapy shell http://quotes.toscrape.com/random 33 | ``` 34 | 35 | Scrapy Shell irá fazer o download da página da URL que passamos e nos fornecerá um objeto *response* que nós podemos utilizar para extrair dados da página. Por exemplo: 36 | 37 | ```python 38 | >>> print(response.text) # Nos fornece todo o conteúdo da página 39 | ``` 40 | 41 | Podemos utilizar o método **response.css()** para selecionarmos a parte da página que desejamos, passamos um **seletor CSS** como parâmetro e ele retornará os elementos correspondentes com o seletor 42 | 43 | ```python 44 | >>> response.css('small.author') 45 | ``` 46 | 47 | Como você pode perceber, o método retornou uma lista de objetos contendo os objetos selecionados. Podemos utilizar o método **extract()** para obtermos os dados HTML que realmente queremos 48 | 49 | ```python 50 | >>> response.css('small.author').extract() 51 | ``` 52 | 53 | Como podemos ver, recebemos uma lista de *strings* como resultado, entretando é necessário mais dois passos a fazermos, primeiramente queremos nos livrar das tags HTML envolvendo o nome do autor, para isso devemos mudar um pouco o nosso **seletor** 54 | 55 | ```python 56 | >>> response.css('small.author::text').extract() 57 | ``` 58 | 59 | Veja que agora ele nos retorna uma lista com o **nome do autor**, para obtermos uma *string* é muito simples, basta acessarmos o primeiro elemento da lista ou executar o método **extract_first()** 60 | 61 | ```python 62 | >>> response.css('small.author::text')[0].extract() 63 | ``` 64 | 65 | ```python 66 | >>> response.css('small.author::text').extract_first() 67 | ``` 68 | 69 | Agora que conseguimos obter o **nome do autor** com sucesso, voltamos à página web e vamos buscar o **texto** do *quote* 70 | 71 | ```python 72 | >>> response.css('span.text::text').extract_first() 73 | ``` 74 | 75 | Agora vamos em busca das **tags** 76 | 77 | ```python 78 | >>> response.css('a.tag::text').extract_first() 79 | ``` 80 | 81 | ## Criando Nosso Primeiro Spider 82 | 83 | Em nossa linha de comando, vamos utilizar o comando `scrapy genspider quotes toscrape.com` para gerarmos o esqueleto de nosso Spider. Veja que o primeiro parâmetro é o nome do nosso Spider e o segundo é o domínio do website. 84 | 85 | Nos será retornado e um arquivo `quotes.py` será criado 86 | 87 | ``` 88 | Created spider 'quotes' using template 'basic' 89 | ``` 90 | 91 | Executando nosso spider 92 | 93 | ```python 94 | >>> scrapy runspider quotes.py 95 | ``` 96 | 97 | Podemos salvar nossos resultados em um arquivo, vejamos um exemplo utilizando a extensão `.json` 98 | 99 | ```python 100 | >>> scrapy runspider quotes.py -o items.json 101 | ``` 102 | 103 | ## Extraindo Múltiplos Items de uma Página 104 | 105 | Vamos agora utilizar a página http://quotes.toscrape.com/ que lista 10 *quotes* de uma vez 106 | 107 | Começaremos executando `scrapy shell 'http://quotes.toscrape.com/'` 108 | 109 | Vamos aplicar a mesma lógica de seleção que utilizamos antes para selecionar o **autor** 110 | 111 | ```python 112 | >>> response.css('small.author::text').extract_first() 113 | ``` 114 | 115 | Veja que nos retornou apenas um autor, para obtermos todos devemos usar o método **extract()** 116 | 117 | ```python 118 | >>> response.css('small.author::text').extract() 119 | ``` 120 | 121 | Vamos selecionar agora os *quotes* 122 | 123 | ```python 124 | >>> response.css('span.text::text').extract() 125 | ``` 126 | 127 | E as **tags** 128 | 129 | ```python 130 | >>> response.css('a.tag::text').extract() 131 | ``` 132 | 133 | Perceba que há um problema, recebemos uma gigante lista de **tags** e não sabemos com qual texto essas **tags** correspondem. Para isso teremos que extrair os dados de um *quote* por vez 134 | 135 | Observando a estrutura da página, podemos percerbar que cada *quote* está contido em uma `

` com a classe `quote`. Vamos tentar trabalhar com esses elementos em nossa shell 136 | 137 | Para selecionarmos todos os *quotes* podemos utilizar o comando 138 | 139 | ```python 140 | >>> response.css('div.quote') 141 | ``` 142 | 143 | Nos será retornado uma lista de objetos seletores, vamos selecionar o primeiro *quote* e armazenar em uma variável chamada **quote** 144 | 145 | ```python 146 | >>> quote = response.css('div.quote')[0] 147 | >>> print(quote) 148 | # 149 | ``` 150 | 151 | Agora temos o objeto seletor para o primeiro *quote*, a partir deles vamos tentar selecionar o **autor**, o **texto** e as **tags** 152 | 153 | ```python 154 | >>> quote.css('small.author::text').extract_first() 155 | >>> quote.css('span.text::text').extract_first() 156 | >>> quote.css('a.tag::text').extract_first() 157 | ``` 158 | 159 | Agora que vimos como extrair de cada elemento, vamos extender essa funcionalidade para buscarmos todos eles, para essa tarefa utilizaremos o **for** loop 160 | 161 | ```python 162 | for quote in response.css('div.quote'): 163 | item = { 164 | 'author_name': quote.css('small.author::text').extract_first(), 165 | 'text': quote.css('span.text::text').extract_first(), 166 | 'tags': quote.css('a.tag::text').extract(), 167 | } 168 | print(item) 169 | ``` 170 | 171 | Veja que nos será todos os *quotes* da primeira página. 172 | 173 | ## Seguindo os Links de Paginação com Scrapy 174 | 175 | Vamos iniciar novamente nosso `scrapy shell 'http://quotes.toscrape.com/'` 176 | 177 | Agora vamos construir nosso seletor, estamos buscando por todos os items contidos no elemento `

Web Scraping

\n", 102 | "

Estrutura Básica HTML

\n", 103 | "

\n", 104 | "

Aprendendo Web Scraping com Python,\n", 105 | "\tRequests-HTML,\n", 106 | "\tBeautiful Soup e\n", 107 | "\tScrapy\n", 108 | "

\n", 109 | "

“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein

\n", 110 | "

Linguagens de Programação

\n", 111 | "

Python
Perl
PHP

\n", 116 | "

Grandes Matemáticos

\n", 117 | "\n", 118 | "\n", 119 | "\n", 120 | "\n", 121 | "\n", 122 | "\n", 123 | "\n", 124 | "\n", 125 | "\n", 126 | "\n", 127 | "\n", 128 | "\n", 129 | "\n", 130 | "\n", 131 | "\n", 132 | "\n", 133 | "\n", 134 | "\n", 135 | "\n", 136 | "\n", 137 | "\n", 138 | "

Nome	Sobrenome	Email
Alan	Turing	alan@turing.com
John	von Neumann	john@voneumann.com
Blaise	Pascal	blaise@pascal.com

\n", 139 | "\n", 140 | "\n", 141 | "\n" 142 | ] 143 | } 144 | ], 145 | "source": [ 146 | "soup = BeautifulSoup(html, 'html.parser')\n", 147 | "print(soup)" 148 | ] 149 | }, 150 | { 151 | "cell_type": "markdown", 152 | "metadata": {}, 153 | "source": [ 154 | "Podemos usar o método `prettify()` para imprimir o conteúdo de nosso documento de uma forma legível" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 6, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "data": { 164 | "text/plain": [ 165 | "'\\n\\n \\n \\n \\n Web Scraping Tutorial com Python\\n \\n \\n \\n \\n \\n

\\n Web Scraping\\n

\\n

\\n Estrutura Básica HTML\\n

\\n

\\n Aprendendo Web Scraping com\\n \\n Python\\n \\n ,\\n \\n Requests-HTML\\n \\n ,\\n \\n Beautiful Soup\\n \\n e\\n \\n Scrapy\\n \\n

\\n

\\n “Logic will get you from A to Z; imagination will get you everywhere.”\\n \\n Albert Einstein\\n \\n

\\n

\\n Linguagens de Programação\\n

\\n

\\n Python\\n
\\n Perl\\n
\\n PHP\\n

\\n

\\n Grandes Matemáticos\\n

\\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n \\n

\\n Nome\\n	\\n Sobrenome\\n	\\n Email\\n
\\n Alan\\n	\\n Turing\\n	\\n alan@turing.com\\n
\\n John\\n	\\n von Neumann\\n	\\n john@voneumann.com\\n
\\n Blaise\\n	\\n Pascal\\n	\\n blaise@pascal.com\\n

\\n \\n\\n'" 166 | ] 167 | }, 168 | "execution_count": 6, 169 | "metadata": {}, 170 | "output_type": "execute_result" 171 | } 172 | ], 173 | "source": [ 174 | "soup.prettify()" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "O atributo **title** nos traz o elemento `` da página" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 7, 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/plain": [ 192 | "<title>Web Scraping Tutorial com Python" 193 | ] 194 | }, 195 | "execution_count": 7, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "soup.title" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "O atributo **title.name** nos traz o nome do elemento `` da página" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 8, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/plain": [ 219 | "'title'" 220 | ] 221 | }, 222 | "execution_count": 8, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "soup.title.name" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "O atributo **title.string** nos traz o conteúdo do elemento `<title>` da página" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 9, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "'Web Scraping Tutorial com Python'" 247 | ] 248 | }, 249 | "execution_count": 9, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "soup.title.string" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "O atributo **title.parent.name** traz o nome do elemento pai do elemento `<title>`, neste caso `<head>`" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 10, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "data": { 272 | "text/plain": [ 273 | "'head'" 274 | ] 275 | }, 276 | "execution_count": 10, 277 | "metadata": {}, 278 | "output_type": "execute_result" 279 | } 280 | ], 281 | "source": [ 282 | "soup.title.parent.name" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "O atributo **p** traz o primeiro parágrafo (`<p>`) da página" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 11, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "<p>Aprendendo Web Scraping com <a href=\"https://www.python.org/\">Python</a>,\n", 301 | "\t<a href=\"https://github.com/psf/requests-html\">Requests-HTML</a>,\n", 302 | "\t<a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">Beautiful Soup</a> e\n", 303 | "\t<a href=\"https://scrapy.org/\">Scrapy</a>\n", 304 | "</p>" 305 | ] 306 | }, 307 | "execution_count": 11, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "soup.p" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "#### Buscando Elementos\n", 321 | "\n", 322 | "- O método `find_all()` é capaz de buscar elementos.\n", 323 | "- Passamos como argumento **'p'** e ele nos traz todos os parágrafos da página" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 12, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/plain": [ 334 | "[<p>Aprendendo Web Scraping com <a href=\"https://www.python.org/\">Python</a>,\n", 335 | " \t<a href=\"https://github.com/psf/requests-html\">Requests-HTML</a>,\n", 336 | " \t<a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">Beautiful Soup</a> e\n", 337 | " \t<a href=\"https://scrapy.org/\">Scrapy</a>\n", 338 | " </p>,\n", 339 | " <p>“Logic will get you from A to Z; imagination will get you everywhere.” <b>Albert Einstein</b></p>]" 340 | ] 341 | }, 342 | "execution_count": 12, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "soup.find_all('p')" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Passamos como argumento **'a'** e ele nos retorna todos os links da página" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 13, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "[<a href=\"https://www.python.org/\">Python</a>,\n", 367 | " <a href=\"https://github.com/psf/requests-html\">Requests-HTML</a>,\n", 368 | " <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">Beautiful Soup</a>,\n", 369 | " <a href=\"https://scrapy.org/\">Scrapy</a>]" 370 | ] 371 | }, 372 | "execution_count": 13, 373 | "metadata": {}, 374 | "output_type": "execute_result" 375 | } 376 | ], 377 | "source": [ 378 | "soup.find_all('a')" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "Ao passarmos **'li'** como argumento, nos serão trazidos todos itens de lista" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 14, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "data": { 395 | "text/plain": [ 396 | "[<li class=\"python\">Python</li>, <li>Perl</li>, <li>PHP</li>]" 397 | ] 398 | }, 399 | "execution_count": 14, 400 | "metadata": {}, 401 | "output_type": "execute_result" 402 | } 403 | ], 404 | "source": [ 405 | "soup.find_all('li')" 406 | ] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "Através de um **for loop** podemos percorrer a lista e extrair apenas o texto" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": 65, 418 | "metadata": {}, 419 | "outputs": [ 420 | { 421 | "name": "stdout", 422 | "output_type": "stream", 423 | "text": [ 424 | "Python\n", 425 | "Perl\n", 426 | "PHP\n" 427 | ] 428 | } 429 | ], 430 | "source": [ 431 | "for l in soup.find_all('li'):\n", 432 | " print(l.text)" 433 | ] 434 | }, 435 | { 436 | "cell_type": "markdown", 437 | "metadata": {}, 438 | "source": [ 439 | "Também podemos buscar um elemento por sua **classe**" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 15, 445 | "metadata": {}, 446 | "outputs": [ 447 | { 448 | "data": { 449 | "text/plain": [ 450 | "[<li class=\"python\">Python</li>]" 451 | ] 452 | }, 453 | "execution_count": 15, 454 | "metadata": {}, 455 | "output_type": "execute_result" 456 | } 457 | ], 458 | "source": [ 459 | "soup.find_all('li', class_='python') " 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "metadata": {}, 465 | "source": [ 466 | "Até mesmo podemos buscá-lo por seu **id**" 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 16, 472 | "metadata": {}, 473 | "outputs": [ 474 | { 475 | "data": { 476 | "text/plain": [ 477 | "<h3 id=\"titulo\">Linguagens de Programação</h3>" 478 | ] 479 | }, 480 | "execution_count": 16, 481 | "metadata": {}, 482 | "output_type": "execute_result" 483 | } 484 | ], 485 | "source": [ 486 | "soup.find(id='titulo')" 487 | ] 488 | }, 489 | { 490 | "cell_type": "markdown", 491 | "metadata": {}, 492 | "source": [ 493 | "É possível também selecionarmos um elemento específico" 494 | ] 495 | }, 496 | { 497 | "cell_type": "code", 498 | "execution_count": 17, 499 | "metadata": {}, 500 | "outputs": [ 501 | { 502 | "data": { 503 | "text/plain": [ 504 | "<h3 id=\"titulo\">Linguagens de Programação</h3>" 505 | ] 506 | }, 507 | "execution_count": 17, 508 | "metadata": {}, 509 | "output_type": "execute_result" 510 | } 511 | ], 512 | "source": [ 513 | "soup.find(\"h3\", {\"id\": \"titulo\"})" 514 | ] 515 | }, 516 | { 517 | "cell_type": "markdown", 518 | "metadata": {}, 519 | "source": [ 520 | "Podemos buscar o conteúdo src de um elemento `<img>`" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 18, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "'https://www.crummy.com/software/BeautifulSoup/bs4/doc/_images/6.1.jpg'" 532 | ] 533 | }, 534 | "execution_count": 18, 535 | "metadata": {}, 536 | "output_type": "execute_result" 537 | } 538 | ], 539 | "source": [ 540 | "soup.find('img')['src']" 541 | ] 542 | }, 543 | { 544 | "cell_type": "markdown", 545 | "metadata": {}, 546 | "source": [ 547 | "O método `get_text()` nos retorna uma string que representa a página" 548 | ] 549 | }, 550 | { 551 | "cell_type": "code", 552 | "execution_count": 19, 553 | "metadata": {}, 554 | "outputs": [ 555 | { 556 | "data": { 557 | "text/plain": [ 558 | "'\\n\\n\\n\\nWeb Scraping Tutorial com Python\\n\\n\\n\\n\\nWeb Scraping\\nEstrutura Básica HTML\\n\\nAprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\n“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein\\nLinguagens de Programação\\n\\nPython\\nPerl\\nPHP\\n\\nGrandes Matemáticos\\n\\n\\nNome\\nSobrenome\\nEmail\\n\\n\\nAlan\\nTuring\\nalan@turing.com\\n\\n\\nJohn\\nvon Neumann\\njohn@voneumann.com\\n\\n\\nBlaise\\nPascal\\nblaise@pascal.com\\n\\n\\n\\n\\n'" 559 | ] 560 | }, 561 | "execution_count": 19, 562 | "metadata": {}, 563 | "output_type": "execute_result" 564 | } 565 | ], 566 | "source": [ 567 | "soup.get_text()" 568 | ] 569 | }, 570 | { 571 | "cell_type": "markdown", 572 | "metadata": {}, 573 | "source": [ 574 | "Podemos utilizar um **for loop** para extrair apenas os links da página" 575 | ] 576 | }, 577 | { 578 | "cell_type": "code", 579 | "execution_count": 70, 580 | "metadata": {}, 581 | "outputs": [ 582 | { 583 | "name": "stdout", 584 | "output_type": "stream", 585 | "text": [ 586 | "https://www.python.org/\n", 587 | "https://github.com/psf/requests-html\n", 588 | "https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", 589 | "https://scrapy.org/\n" 590 | ] 591 | } 592 | ], 593 | "source": [ 594 | "for link in soup.find_all('a'):\n", 595 | " print(link.get('href'))" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "Podemos selecionar todos os links contidos no elemento `<body>`" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 20, 608 | "metadata": {}, 609 | "outputs": [ 610 | { 611 | "data": { 612 | "text/plain": [ 613 | "[<a href=\"https://www.python.org/\">Python</a>,\n", 614 | " <a href=\"https://github.com/psf/requests-html\">Requests-HTML</a>,\n", 615 | " <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">Beautiful Soup</a>,\n", 616 | " <a href=\"https://scrapy.org/\">Scrapy</a>]" 617 | ] 618 | }, 619 | "execution_count": 20, 620 | "metadata": {}, 621 | "output_type": "execute_result" 622 | } 623 | ], 624 | "source": [ 625 | "soup.select(\"body a\")" 626 | ] 627 | }, 628 | { 629 | "cell_type": "markdown", 630 | "metadata": {}, 631 | "source": [ 632 | "Podemos selecionar todos os links que possuam o atributo **'href'**" 633 | ] 634 | }, 635 | { 636 | "cell_type": "code", 637 | "execution_count": 21, 638 | "metadata": {}, 639 | "outputs": [ 640 | { 641 | "data": { 642 | "text/plain": [ 643 | "[<a href=\"https://www.python.org/\">Python</a>,\n", 644 | " <a href=\"https://github.com/psf/requests-html\">Requests-HTML</a>,\n", 645 | " <a href=\"https://www.crummy.com/software/BeautifulSoup/bs4/doc/\">Beautiful Soup</a>,\n", 646 | " <a href=\"https://scrapy.org/\">Scrapy</a>]" 647 | ] 648 | }, 649 | "execution_count": 21, 650 | "metadata": {}, 651 | "output_type": "execute_result" 652 | } 653 | ], 654 | "source": [ 655 | "soup.select('a[href]')" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "Podemos selecionar todos os elementos que possuam o id **titulo**" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 22, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "data": { 672 | "text/plain": [ 673 | "[<h3 id=\"titulo\">Linguagens de Programação</h3>,\n", 674 | " <h3 id=\"titulo\">Grandes Matemáticos</h3>]" 675 | ] 676 | }, 677 | "execution_count": 22, 678 | "metadata": {}, 679 | "output_type": "execute_result" 680 | } 681 | ], 682 | "source": [ 683 | "soup.select(\"#titulo\")" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "Podemos selecionar todos os elementos que possuam a classe **python**" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 23, 696 | "metadata": {}, 697 | "outputs": [ 698 | { 699 | "data": { 700 | "text/plain": [ 701 | "[<li class=\"python\">Python</li>]" 702 | ] 703 | }, 704 | "execution_count": 23, 705 | "metadata": {}, 706 | "output_type": "execute_result" 707 | } 708 | ], 709 | "source": [ 710 | "soup.select('.python')" 711 | ] 712 | }, 713 | { 714 | "cell_type": "markdown", 715 | "metadata": {}, 716 | "source": [ 717 | "Podemos selecionar todos os dados de uma tabela" 718 | ] 719 | }, 720 | { 721 | "cell_type": "code", 722 | "execution_count": 24, 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "data": { 727 | "text/plain": [ 728 | "[<td>Alan</td>,\n", 729 | " <td>Turing</td>,\n", 730 | " <td>alan@turing.com</td>,\n", 731 | " <td>John</td>,\n", 732 | " <td>von Neumann</td>,\n", 733 | " <td>john@voneumann.com</td>,\n", 734 | " <td>Blaise</td>,\n", 735 | " <td>Pascal</td>,\n", 736 | " <td>blaise@pascal.com</td>]" 737 | ] 738 | }, 739 | "execution_count": 24, 740 | "metadata": {}, 741 | "output_type": "execute_result" 742 | } 743 | ], 744 | "source": [ 745 | "soup.select('table > tr > td')" 746 | ] 747 | }, 748 | { 749 | "cell_type": "markdown", 750 | "metadata": {}, 751 | "source": [ 752 | "É possível usarmos um **for loop** para obtermos uma representação de nossa página como diversas strings" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": 25, 758 | "metadata": {}, 759 | "outputs": [ 760 | { 761 | "name": "stdout", 762 | "output_type": "stream", 763 | "text": [ 764 | "'\\n'\n", 765 | "'\\n'\n", 766 | "'\\n'\n", 767 | "'\\n'\n", 768 | "'Web Scraping Tutorial com Python'\n", 769 | "'\\n'\n", 770 | "'\\n'\n", 771 | "'\\n'\n", 772 | "'\\n'\n", 773 | "'\\n'\n", 774 | "'Web Scraping'\n", 775 | "'\\n'\n", 776 | "'Estrutura Básica HTML'\n", 777 | "'\\n'\n", 778 | "'\\n'\n", 779 | "'Aprendendo Web Scraping com '\n", 780 | "'Python'\n", 781 | "',\\n\\t'\n", 782 | "'Requests-HTML'\n", 783 | "',\\n\\t'\n", 784 | "'Beautiful Soup'\n", 785 | "' e\\n\\t'\n", 786 | "'Scrapy'\n", 787 | "'\\n'\n", 788 | "'\\n'\n", 789 | "'“Logic will get you from A to Z; imagination will get you everywhere.” '\n", 790 | "'Albert Einstein'\n", 791 | "'\\n'\n", 792 | "'Linguagens de Programação'\n", 793 | "'\\n'\n", 794 | "'\\n'\n", 795 | "'Python'\n", 796 | "'\\n'\n", 797 | "'Perl'\n", 798 | "'\\n'\n", 799 | "'PHP'\n", 800 | "'\\n'\n", 801 | "'\\n'\n", 802 | "'Grandes Matemáticos'\n", 803 | "'\\n'\n", 804 | "'\\n'\n", 805 | "'\\n'\n", 806 | "'Nome'\n", 807 | "'\\n'\n", 808 | "'Sobrenome'\n", 809 | "'\\n'\n", 810 | "'Email'\n", 811 | "'\\n'\n", 812 | "'\\n'\n", 813 | "'\\n'\n", 814 | "'Alan'\n", 815 | "'\\n'\n", 816 | "'Turing'\n", 817 | "'\\n'\n", 818 | "'alan@turing.com'\n", 819 | "'\\n'\n", 820 | "'\\n'\n", 821 | "'\\n'\n", 822 | "'John'\n", 823 | "'\\n'\n", 824 | "'von Neumann'\n", 825 | "'\\n'\n", 826 | "'john@voneumann.com'\n", 827 | "'\\n'\n", 828 | "'\\n'\n", 829 | "'\\n'\n", 830 | "'Blaise'\n", 831 | "'\\n'\n", 832 | "'Pascal'\n", 833 | "'\\n'\n", 834 | "'blaise@pascal.com'\n", 835 | "'\\n'\n", 836 | "'\\n'\n", 837 | "'\\n'\n", 838 | "'\\n'\n", 839 | "'\\n'\n" 840 | ] 841 | } 842 | ], 843 | "source": [ 844 | "for string in soup.strings:\n", 845 | " print(repr(string))" 846 | ] 847 | } 848 | ], 849 | "metadata": { 850 | "kernelspec": { 851 | "display_name": "Python 3", 852 | "language": "python", 853 | "name": "python3" 854 | }, 855 | "language_info": { 856 | "codemirror_mode": { 857 | "name": "ipython", 858 | "version": 3 859 | }, 860 | "file_extension": ".py", 861 | "mimetype": "text/x-python", 862 | "name": "python", 863 | "nbconvert_exporter": "python", 864 | "pygments_lexer": "ipython3", 865 | "version": "3.7.7" 866 | } 867 | }, 868 | "nbformat": 4, 869 | "nbformat_minor": 4 870 | } 871 | -------------------------------------------------------------------------------- /notebooks/Fundamentos.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Web Scraping com Python\n", 8 | "\n", 9 | "## Introdução\n", 10 | "\n", 11 | "Iniciaremos explorando o Website: **https://pythonwebscraping.netlify.app**\n", 12 | "\n", 13 | "Começamos importando as bibliotecas essenciais para trabalharmos com **Web Scraping**\n", 14 | "\n", 15 | "- A Biblioteca [requests](https://requests.readthedocs.io/en/master/) nos permite fazer requisições HTTP de forma a obtermos o conteúdo HTML do Website\n", 16 | "- A Biblioteca [re](https://docs.python.org/3/library/re.html) nos permite trabalharmos com expressões regulares para que possamos buscar padrões nos textos extraídos do Website" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 2, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "import requests \n", 26 | "import re" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Requisição HTTP\n", 34 | "\n", 35 | "Começamos executando uma requisição GET para obtermos o conteúdo HTML do Website" 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": 17, 41 | "metadata": {}, 42 | "outputs": [ 43 | { 44 | "name": "stdout", 45 | "output_type": "stream", 46 | "text": [ 47 | "<class 'requests.models.Response'>\n" 48 | ] 49 | } 50 | ], 51 | "source": [ 52 | "r = requests.get('https://pythonwebscraping.netlify.app')\n", 53 | "print(type(r))" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "O método `get()` nos retorna um objeto `requests.models.Response`, podemos inspecioná-lo para visualizarmos os atributos e métodos que estão disponíveis para trabalharmos" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 18, 66 | "metadata": {}, 67 | "outputs": [ 68 | { 69 | "name": "stdout", 70 | "output_type": "stream", 71 | "text": [ 72 | "['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "print(dir(r))" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "Vamos agora verificar o conteúdo HTML que nos é retornado, para isso vamos acessar o atributo `content`" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 19, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "name": "stdout", 94 | "output_type": "stream", 95 | "text": [ 96 | "b'<!DOCTYPE html>\\n<html>\\n<head>\\n\\t<meta charset=\"UTF-8\">\\n\\t<title>Web Scraping Tutorial com Python\\n\\t\\n\\t\\n\\n\\n\\t

Web Scraping

\\n\\t

Estrutura B\\xc3\\xa1sica HTML

\\n\\t

Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t

\\n\\n\\t

\\xe2\\x80\\x9cLogic will get you from A to Z; imagination will get you everywhere.\\xe2\\x80\\x9d Albert Einstein

\\n\\t\\n\\t

Linguagens de Programa\\xc3\\xa7\\xc3\\xa3o

\\n\\t

Python
Perl
PHP

\\n\\t\\n\\t

Grandes Matem\\xc3\\xa1ticos

\\n\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t

Nome	Sobrenome	Email
Alan	Turing	alan@turing.com
John	von Neumann	john@voneumann.com
Blaise	Pascal	blaise@pascal.com

\\n\\n\\n'\n" 97 | ] 98 | } 99 | ], 100 | "source": [ 101 | "print(r.content)" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "Podemos também verificar o tipo de codificação do arquivo acessando o atributo `encoding`" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": 20, 114 | "metadata": {}, 115 | "outputs": [ 116 | { 117 | "data": { 118 | "text/plain": [ 119 | "'UTF-8'" 120 | ] 121 | }, 122 | "execution_count": 20, 123 | "metadata": {}, 124 | "output_type": "execute_result" 125 | } 126 | ], 127 | "source": [ 128 | "r.encoding" 129 | ] 130 | }, 131 | { 132 | "cell_type": "markdown", 133 | "metadata": {}, 134 | "source": [ 135 | "O atributo `headers` nos traz detalhes sobre os headers" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 21, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "name": "stdout", 145 | "output_type": "stream", 146 | "text": [ 147 | "{'Cache-Control': 'public, max-age=0, must-revalidate', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Tue, 10 Nov 2020 23:47:14 GMT', 'Etag': '\"17afc31fb1188dd828f61941a34ae8ad-ssl-df\"', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'Content-Length': '743', 'Age': '351', 'Connection': 'keep-alive', 'Server': 'Netlify', 'Vary': 'Accept-Encoding', 'X-NF-Request-ID': '6c6e39a3-537f-4934-adaa-9234afd2e07c-12269260'}\n" 148 | ] 149 | } 150 | ], 151 | "source": [ 152 | "print(r.headers)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "O atributo `status_code` nos permite verificar o código de status retornado, 200 significa que a requisição ocorreu com sucesso" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 22, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "200" 171 | ] 172 | }, 173 | "execution_count": 22, 174 | "metadata": {}, 175 | "output_type": "execute_result" 176 | } 177 | ], 178 | "source": [ 179 | "r.status_code" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "O atributo `url` nos possibilita verificar a url que foi utilizada na requisição" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 23, 192 | "metadata": {}, 193 | "outputs": [ 194 | { 195 | "data": { 196 | "text/plain": [ 197 | "'https://pythonwebscraping.netlify.app/'" 198 | ] 199 | }, 200 | "execution_count": 23, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "r.url" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "Por fim o atributo `text` nos traz uma string que representa o documento html que desejamos executar o Web Scraping" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": 24, 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "\n", 226 | "\n", 227 | "\n", 228 | "\t\n", 229 | "\tWeb Scraping Tutorial com Python\n", 230 | "\t\n", 231 | "\t\n", 247 | "\n", 248 | "\n", 249 | "\t

Web Scraping

\n", 250 | "\t

Estrutura Básica HTML

\n", 251 | "\t

\n", 252 | "\t

Aprendendo Web Scraping com Python,\n", 253 | "\tRequests-HTML,\n", 254 | "\tBeautiful Soup e\n", 255 | "\tScrapy\n", 256 | "\t

\n", 257 | "\n", 258 | "\t

“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein

\n", 259 | "\t\n", 260 | "\t

Linguagens de Programação

\n", 261 | "\t

Python
Perl
PHP

\n", 266 | "\t\n", 267 | "\t

Grandes Matemáticos

\n", 268 | "\t\n", 269 | "\t\t\n", 270 | "\t\t\t\n", 271 | "\t\t\t\n", 272 | "\t\t\t\n", 273 | "\t\t\n", 274 | "\t\t\n", 275 | "\t\t\t\n", 276 | "\t\t\t\n", 277 | "\t\t\t\n", 278 | "\t\t\n", 279 | "\t\t\n", 280 | "\t\t\t\n", 281 | "\t\t\t\n", 282 | "\t\t\t\n", 283 | "\t\t\n", 284 | "\t\t\n", 285 | "\t\t\t\n", 286 | "\t\t\t\n", 287 | "\t\t\t\n", 288 | "\t\t\n", 289 | "\t

Nome	Sobrenome	Email
Alan	Turing	alan@turing.com
John	von Neumann	john@voneumann.com
Blaise	Pascal	blaise@pascal.com

\n", 290 | "\n", 291 | "\n", 292 | "\n" 293 | ] 294 | } 295 | ], 296 | "source": [ 297 | "print(r.text)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "metadata": {}, 303 | "source": [ 304 | "Vamos agora armazenar o conteúdo HTML em uma variável para conveniência e uso futuro desses dados" 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": 25, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "html = r.text" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "### Buscando Dados com Expressões Regulares\n", 321 | "\n", 322 | "Usaremos o poder das expressões regulares para extrair dados de nosso Website" 323 | ] 324 | }, 325 | { 326 | "cell_type": "markdown", 327 | "metadata": {}, 328 | "source": [ 329 | "#### Título da Página" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 26, 335 | "metadata": {}, 336 | "outputs": [], 337 | "source": [ 338 | "title = re.findall(r\"(.+?)\", html)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 41, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "name": "stdout", 348 | "output_type": "stream", 349 | "text": [ 350 | "['Web Scraping Tutorial com Python']\n" 351 | ] 352 | } 353 | ], 354 | "source": [ 355 | "print(title)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "#### Parágrafos da Página" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": 31, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "p = re.findall(r\"

(.*?)

\", html, flags=re.DOTALL) " 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 40, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "name": "stdout", 381 | "output_type": "stream", 382 | "text": [ 383 | "['Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t', '“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein']\n" 384 | ] 385 | } 386 | ], 387 | "source": [ 388 | "print(p)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "#### Links da Página" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 35, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "a = re.findall(r'href=[\\'\"]?([^\\'\" >]+)', html)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 37, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/plain": [ 415 | "['https://i.imgur.com/QOVnf5D.png',\n", 416 | " 'https://www.python.org/',\n", 417 | " 'https://github.com/psf/requests-html',\n", 418 | " 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/',\n", 419 | " 'https://scrapy.org/']" 420 | ] 421 | }, 422 | "execution_count": 37, 423 | "metadata": {}, 424 | "output_type": "execute_result" 425 | } 426 | ], 427 | "source": [ 428 | "a" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "#### Emails da Página" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 38, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "emails = re.findall(r'([\\d\\w\\.]+@[\\d\\w\\.\\-]+\\.\\w+)', html)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 39, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "['alan@turing.com', 'john@voneumann.com', 'blaise@pascal.com']" 456 | ] 457 | }, 458 | "execution_count": 39, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "emails" 465 | ] 466 | } 467 | ], 468 | "metadata": { 469 | "kernelspec": { 470 | "display_name": "Python 3", 471 | "language": "python", 472 | "name": "python3" 473 | }, 474 | "language_info": { 475 | "codemirror_mode": { 476 | "name": "ipython", 477 | "version": 3 478 | }, 479 | "file_extension": ".py", 480 | "mimetype": "text/x-python", 481 | "name": "python", 482 | "nbconvert_exporter": "python", 483 | "pygments_lexer": "ipython3", 484 | "version": "3.7.7" 485 | } 486 | }, 487 | "nbformat": 4, 488 | "nbformat_minor": 4 489 | } 490 | -------------------------------------------------------------------------------- /notebooks/PyQuery.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A Biblioteca PyQuery\n", 8 | "\n", 9 | "PyQuery nos permite executarmos consultas **jQuery** em documentos XML. A API é bastante similar à biblioteca **[jQuery](https://jquery.com/)**.\n", 10 | "\n", 11 | "PyQuery utiliza **[lxml](https://lxml.de/)** para manipulação rápida de documentos XML e HTML.\n", 12 | "\n", 13 | "Você pode conhecer mais detalhes sobre PyQuery em sua **[Documentação](https://pythonhosted.org/pyquery/)**\n", 14 | "\n", 15 | "Vamos agora executar alguns experimentos utilizando nossa página de testes: **[pythonwebscraping.netlify.app](https://pythonwebscraping.netlify.app)**" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Iniciamos importando a biblioteca PyQuery com a abreviatura **pq**" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "from pyquery import PyQuery as pq" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Fazemos então a requisição de nosso documento HTML" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "\n", 51 | "\n", 52 | "\t\n", 53 | "\tWeb Scraping Tutorial com Python\n", 54 | "\t\n", 55 | "\t\n", 71 | "\n", 72 | "\n", 73 | "\t

Web Scraping

\n", 74 | "\t

Estrutura Básica HTML

\n", 75 | "\t

\n", 76 | "\t

Aprendendo Web Scraping com Python,\n", 77 | "\tRequests-HTML,\n", 78 | "\tBeautiful Soup e\n", 79 | "\tScrapy\n", 80 | "\t

\n", 81 | "\n", 82 | "\t

“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein

\n", 83 | "\t\n", 84 | "\t

Linguagens de Programação

\n", 85 | "\t

Python
Perl
PHP

\n", 90 | "\t\n", 91 | "\t

Grandes Matemáticos

\n", 92 | "\t\n", 93 | "\t\t\n", 94 | "\t\t\t\n", 95 | "\t\t\t\n", 96 | "\t\t\t\n", 97 | "\t\t\n", 98 | "\t\t\n", 99 | "\t\t\t\n", 100 | "\t\t\t\n", 101 | "\t\t\t\n", 102 | "\t\t\n", 103 | "\t\t\n", 104 | "\t\t\t\n", 105 | "\t\t\t\n", 106 | "\t\t\t\n", 107 | "\t\t\n", 108 | "\t\t\n", 109 | "\t\t\t\n", 110 | "\t\t\t\n", 111 | "\t\t\t\n", 112 | "\t\t\n", 113 | "\t

Nome	Sobrenome	Email
Alan	Turing	alan@turing.com
John	von Neumann	john@voneumann.com
Blaise	Pascal	blaise@pascal.com

\n", 114 | "\n", 115 | "\n" 116 | ] 117 | } 118 | ], 119 | "source": [ 120 | "d = pq(url='https://pythonwebscraping.netlify.app')\n", 121 | "print(d)" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "Selecionando elementos com o id **titulo**" 129 | ] 130 | }, 131 | { 132 | "cell_type": "code", 133 | "execution_count": 4, 134 | "metadata": {}, 135 | "outputs": [ 136 | { 137 | "data": { 138 | "text/plain": [ 139 | "[, ]" 140 | ] 141 | }, 142 | "execution_count": 4, 143 | "metadata": {}, 144 | "output_type": "execute_result" 145 | } 146 | ], 147 | "source": [ 148 | "d('#titulo')" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "Nos é retornado uma lista com dois elementos, utilizando um **for loop** podemos imprimir o texto dos elementos de id **titulo**" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": 5, 161 | "metadata": {}, 162 | "outputs": [ 163 | { 164 | "name": "stdout", 165 | "output_type": "stream", 166 | "text": [ 167 | "Linguagens de Programação\n", 168 | "Grandes Matemáticos\n" 169 | ] 170 | } 171 | ], 172 | "source": [ 173 | "for t in d('#titulo'):\n", 174 | " print(t.text)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "Selecionando elementos que possuam a classe **python**" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 6, 187 | "metadata": {}, 188 | "outputs": [ 189 | { 190 | "data": { 191 | "text/plain": [ 192 | "[]" 193 | ] 194 | }, 195 | "execution_count": 6, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "d('.python')" 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "Obtendo o texto contido no elemento de classe **python**" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 7, 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/plain": [ 219 | "'Python'" 220 | ] 221 | }, 222 | "execution_count": 7, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "d('.python').text()" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Selecionando elementos `

` do documento HTML" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 9, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "[,

]" 247 | ] 248 | }, 249 | "execution_count": 9, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "d('li')" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "Nos é retornado uma lista com três elementos, utilizamos um for loop para imprimir o texto dos elementos `

`" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": 10, 268 | "metadata": {}, 269 | "outputs": [ 270 | { 271 | "name": "stdout", 272 | "output_type": "stream", 273 | "text": [ 274 | "Python\n", 275 | "Perl\n", 276 | "PHP\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "for li in d('li'):\n", 282 | " print(li.text)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "Selecionandos todos os elementos `` do documento" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 11, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "[, , , , , , , , ]" 301 | ] 302 | }, 303 | "execution_count": 11, 304 | "metadata": {}, 305 | "output_type": "execute_result" 306 | } 307 | ], 308 | "source": [ 309 | "d('td')" 310 | ] 311 | }, 312 | { 313 | "cell_type": "markdown", 314 | "metadata": {}, 315 | "source": [ 316 | "Obtendo o texto de cada elemento ``" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 12, 322 | "metadata": {}, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "Alan\n", 329 | "Turing\n", 330 | "alan@turing.com\n", 331 | "John\n", 332 | "von Neumann\n", 333 | "john@voneumann.com\n", 334 | "Blaise\n", 335 | "Pascal\n", 336 | "blaise@pascal.com\n" 337 | ] 338 | } 339 | ], 340 | "source": [ 341 | "for td in d('td'):\n", 342 | " print(td.text)" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "Podemos selecionar apenas o primeiro link do documento" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 13, 355 | "metadata": {}, 356 | "outputs": [ 357 | { 358 | "data": { 359 | "text/plain": [ 360 | "[]" 361 | ] 362 | }, 363 | "execution_count": 13, 364 | "metadata": {}, 365 | "output_type": "execute_result" 366 | } 367 | ], 368 | "source": [ 369 | "d('a:first')" 370 | ] 371 | }, 372 | { 373 | "cell_type": "markdown", 374 | "metadata": {}, 375 | "source": [ 376 | "Podemos selecionar apenas o último link do documento e extrair o conteúdo do atributo **'href'**" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 14, 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "'https://scrapy.org/'" 388 | ] 389 | }, 390 | "execution_count": 14, 391 | "metadata": {}, 392 | "output_type": "execute_result" 393 | } 394 | ], 395 | "source": [ 396 | "d('a:last').attr('href')" 397 | ] 398 | }, 399 | { 400 | "cell_type": "markdown", 401 | "metadata": {}, 402 | "source": [ 403 | "Podemos extrair todos os links da página" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": 15, 409 | "metadata": {}, 410 | "outputs": [ 411 | { 412 | "name": "stdout", 413 | "output_type": "stream", 414 | "text": [ 415 | "https://www.python.org/\n", 416 | "https://github.com/psf/requests-html\n", 417 | "https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", 418 | "https://scrapy.org/\n" 419 | ] 420 | } 421 | ], 422 | "source": [ 423 | "for links in d('a'):\n", 424 | " print(links.attrib['href'])" 425 | ] 426 | }, 427 | { 428 | "cell_type": "markdown", 429 | "metadata": {}, 430 | "source": [ 431 | "Selecionando as tags filhas do elemento `

`" 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": 16, 437 | "metadata": {}, 438 | "outputs": [ 439 | { 440 | "data": { 441 | "text/plain": [ 442 | "[, , , , ]" 443 | ] 444 | }, 445 | "execution_count": 16, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "d('p').children() " 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "metadata": {}, 457 | "source": [ 458 | "Selecionando todos os elementos `

`" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 58, 464 | "metadata": {}, 465 | "outputs": [ 466 | { 467 | "data": { 468 | "text/plain": [ 469 | "[

]" 470 | ] 471 | }, 472 | "execution_count": 58, 473 | "metadata": {}, 474 | "output_type": "execute_result" 475 | } 476 | ], 477 | "source": [ 478 | "d('p') " 479 | ] 480 | }, 481 | { 482 | "cell_type": "markdown", 483 | "metadata": {}, 484 | "source": [ 485 | "Pesquisando a existência de elementos `` dentro de `

`" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 18, 491 | "metadata": {}, 492 | "outputs": [ 493 | { 494 | "data": { 495 | "text/plain": [ 496 | "[, , , ]" 497 | ] 498 | }, 499 | "execution_count": 18, 500 | "metadata": {}, 501 | "output_type": "execute_result" 502 | } 503 | ], 504 | "source": [ 505 | "d('p').find('a') " 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "Novamento, podemos obter os links com um **for loop**" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 19, 518 | "metadata": {}, 519 | "outputs": [ 520 | { 521 | "name": "stdout", 522 | "output_type": "stream", 523 | "text": [ 524 | "https://www.python.org/\n", 525 | "https://github.com/psf/requests-html\n", 526 | "https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", 527 | "https://scrapy.org/\n" 528 | ] 529 | } 530 | ], 531 | "source": [ 532 | "for l in d('p').find('a'):\n", 533 | " print(l.attrib['href'])" 534 | ] 535 | }, 536 | { 537 | "cell_type": "markdown", 538 | "metadata": {}, 539 | "source": [ 540 | "Verificando a presença de determinada classe no documento" 541 | ] 542 | }, 543 | { 544 | "cell_type": "code", 545 | "execution_count": 68, 546 | "metadata": {}, 547 | "outputs": [ 548 | { 549 | "data": { 550 | "text/plain": [ 551 | "True" 552 | ] 553 | }, 554 | "execution_count": 68, 555 | "metadata": {}, 556 | "output_type": "execute_result" 557 | } 558 | ], 559 | "source": [ 560 | " d('li').hasClass('python')" 561 | ] 562 | }, 563 | { 564 | "cell_type": "markdown", 565 | "metadata": {}, 566 | "source": [ 567 | "É possível selecionarmos um elemento e assim obter sua representação HTML com o método `outerHtml()`" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": 20, 573 | "metadata": {}, 574 | "outputs": [ 575 | { 576 | "data": { 577 | "text/plain": [ 578 | "'\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\\n\\t
Nome Sobrenome Email
Alan Turing alan@turing.com
John von Neumann john@voneumann.com
Blaise Pascal blaise@pascal.com
'" 579 | ] 580 | }, 581 | "execution_count": 20, 582 | "metadata": {}, 583 | "output_type": "execute_result" 584 | } 585 | ], 586 | "source": [ 587 | "d('table').outerHtml()" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "Selecionando todos os elementos `

` do documento" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 21, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/plain": [ 605 | "[

]" 606 | ] 607 | }, 608 | "execution_count": 21, 609 | "metadata": {}, 610 | "output_type": "execute_result" 611 | } 612 | ], 613 | "source": [ 614 | "d('p')" 615 | ] 616 | }, 617 | { 618 | "cell_type": "markdown", 619 | "metadata": {}, 620 | "source": [ 621 | "Filtrando elementos por posição" 622 | ] 623 | }, 624 | { 625 | "cell_type": "code", 626 | "execution_count": 22, 627 | "metadata": {}, 628 | "outputs": [ 629 | { 630 | "data": { 631 | "text/plain": [ 632 | "'Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t'" 633 | ] 634 | }, 635 | "execution_count": 22, 636 | "metadata": {}, 637 | "output_type": "execute_result" 638 | } 639 | ], 640 | "source": [ 641 | "d('p').filter(lambda i: i == 0).html()" 642 | ] 643 | }, 644 | { 645 | "cell_type": "markdown", 646 | "metadata": {}, 647 | "source": [ 648 | "Filtrando elementos por posição" 649 | ] 650 | }, 651 | { 652 | "cell_type": "code", 653 | "execution_count": 23, 654 | "metadata": {}, 655 | "outputs": [ 656 | { 657 | "data": { 658 | "text/plain": [ 659 | "'“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein'" 660 | ] 661 | }, 662 | "execution_count": 23, 663 | "metadata": {}, 664 | "output_type": "execute_result" 665 | } 666 | ], 667 | "source": [ 668 | "d('p').filter(lambda i: i == 1).html()" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": {}, 674 | "source": [ 675 | "Filtrando elementos por seu texto" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": 24, 681 | "metadata": {}, 682 | "outputs": [ 683 | { 684 | "data": { 685 | "text/plain": [ 686 | "'“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein'" 687 | ] 688 | }, 689 | "execution_count": 24, 690 | "metadata": {}, 691 | "output_type": "execute_result" 692 | } 693 | ], 694 | "source": [ 695 | "d('p').filter(lambda i: pq(this).text() == '“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein').html()" 696 | ] 697 | }, 698 | { 699 | "cell_type": "markdown", 700 | "metadata": {}, 701 | "source": [ 702 | "Filtrando elementos por seu texto" 703 | ] 704 | }, 705 | { 706 | "cell_type": "code", 707 | "execution_count": 25, 708 | "metadata": {}, 709 | "outputs": [ 710 | { 711 | "data": { 712 | "text/plain": [ 713 | "'Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t'" 714 | ] 715 | }, 716 | "execution_count": 25, 717 | "metadata": {}, 718 | "output_type": "execute_result" 719 | } 720 | ], 721 | "source": [ 722 | "d('p').filter(lambda i: pq(this).text() == 'Aprendendo Web Scraping com Python, Requests-HTML, Beautiful Soup e Scrapy').html()" 723 | ] 724 | } 725 | ], 726 | "metadata": { 727 | "kernelspec": { 728 | "display_name": "Python 3", 729 | "language": "python", 730 | "name": "python3" 731 | }, 732 | "language_info": { 733 | "codemirror_mode": { 734 | "name": "ipython", 735 | "version": 3 736 | }, 737 | "file_extension": ".py", 738 | "mimetype": "text/x-python", 739 | "name": "python", 740 | "nbconvert_exporter": "python", 741 | "pygments_lexer": "ipython3", 742 | "version": "3.7.7" 743 | } 744 | }, 745 | "nbformat": 4, 746 | "nbformat_minor": 4 747 | } 748 | -------------------------------------------------------------------------------- /notebooks/Requests-HTML.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A Biblioteca Requests-HTML\n", 8 | "\n", 9 | "**[Requests-HTML](https://requests.readthedocs.io/projects/requests-html/en/latest/)** é uma biblioteca que tem como objetivo tornar a análise de HTML (por exemplo: **Web Scraping**) o mais simples e intuitiva possível.\n", 10 | "\n", 11 | "Para fazermos a instalação dela é muito simples, basta executarmos o comando:\n", 12 | "\n", 13 | "`pip install requests-html`\n", 14 | "\n", 15 | "Uma vez instalada, já podemos importá-la para começarmos nossos experimentos" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 4, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "from requests_html import HTMLSession" 25 | ] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": {}, 30 | "source": [ 31 | "Iniciamos construindo o objeto `HTMLSession`" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 5, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "session = HTMLSession()" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Verificamos o tipo dele e confirmamos que se trata de um objeto `requests_html.HTMLSession`" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 6, 53 | "metadata": {}, 54 | "outputs": [ 55 | { 56 | "data": { 57 | "text/plain": [ 58 | "requests_html.HTMLSession" 59 | ] 60 | }, 61 | "execution_count": 6, 62 | "metadata": {}, 63 | "output_type": "execute_result" 64 | } 65 | ], 66 | "source": [ 67 | "type(session)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "markdown", 72 | "metadata": {}, 73 | "source": [ 74 | "Com o método `dir()` investigamos os atributos e métodos disponíveis para trabalharmos" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": 7, 80 | "metadata": {}, 81 | "outputs": [ 82 | { 83 | "name": "stdout", 84 | "output_type": "stream", 85 | "text": [ 86 | "['_BaseSession__browser_args', '__attrs__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'adapters', 'auth', 'browser', 'cert', 'close', 'cookies', 'delete', 'get', 'get_adapter', 'get_redirect_target', 'head', 'headers', 'hooks', 'max_redirects', 'merge_environment_settings', 'mount', 'options', 'params', 'patch', 'post', 'prepare_request', 'proxies', 'put', 'rebuild_auth', 'rebuild_method', 'rebuild_proxies', 'request', 'resolve_redirects', 'response_hook', 'send', 'should_strip_auth', 'stream', 'trust_env', 'verify']\n" 87 | ] 88 | } 89 | ], 90 | "source": [ 91 | "print(dir(session))" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "Executando uma requisição GET para obtermos o conteúdo do Website: `pythonwebscraping.netlify.com`" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 9, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [ 107 | "r = session.get('https://pythonwebscraping.netlify.app')" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "`r.html.links` representa um conjunto (**set**) com todos os links da página" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 10, 120 | "metadata": {}, 121 | "outputs": [ 122 | { 123 | "name": "stdout", 124 | "output_type": "stream", 125 | "text": [ 126 | "\n" 127 | ] 128 | }, 129 | { 130 | "data": { 131 | "text/plain": [ 132 | "{'https://github.com/psf/requests-html',\n", 133 | " 'https://scrapy.org/',\n", 134 | " 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/',\n", 135 | " 'https://www.python.org/'}" 136 | ] 137 | }, 138 | "execution_count": 10, 139 | "metadata": {}, 140 | "output_type": "execute_result" 141 | } 142 | ], 143 | "source": [ 144 | "print(type(r.html.links))\n", 145 | "r.html.links" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "`r.html.absolute_links` representa um conjunto (**set**) com todos os links absolutos da página\n", 153 | "\n", 154 | "Através do **for loop** podemos extrair somente os links" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": 11, 160 | "metadata": {}, 161 | "outputs": [ 162 | { 163 | "name": "stdout", 164 | "output_type": "stream", 165 | "text": [ 166 | "https://github.com/psf/requests-html\n", 167 | "https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n", 168 | "https://scrapy.org/\n", 169 | "https://www.python.org/\n" 170 | ] 171 | } 172 | ], 173 | "source": [ 174 | "for links in r.html.absolute_links:\n", 175 | " print(links)" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Podemos buscar um elemento através do seu **id**" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 12, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "name": "stdout", 192 | "output_type": "stream", 193 | "text": [ 194 | "[, ]\n" 195 | ] 196 | } 197 | ], 198 | "source": [ 199 | "titulo = r.html.find('#titulo')\n", 200 | "print(titulo)" 201 | ] 202 | }, 203 | { 204 | "cell_type": "markdown", 205 | "metadata": {}, 206 | "source": [ 207 | "Obtemos apenas o texto ao acessarmos o atributo **text**" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 13, 213 | "metadata": {}, 214 | "outputs": [ 215 | { 216 | "name": "stdout", 217 | "output_type": "stream", 218 | "text": [ 219 | "Linguagens de Programação\n", 220 | "Grandes Matemáticos\n" 221 | ] 222 | } 223 | ], 224 | "source": [ 225 | "for t in titulo:\n", 226 | " print(t.text)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "Podemos buscar um elemento através de sua **classe**" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 14, 239 | "metadata": {}, 240 | "outputs": [ 241 | { 242 | "name": "stdout", 243 | "output_type": "stream", 244 | "text": [ 245 | "Python\n" 246 | ] 247 | } 248 | ], 249 | "source": [ 250 | "python = r.html.find('.python')\n", 251 | "for p in python:\n", 252 | " print(p.text)" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "Podemos buscar um elemento pelo seu **nome**, neste caso `

`" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 15, 265 | "metadata": {}, 266 | "outputs": [ 267 | { 268 | "name": "stdout", 269 | "output_type": "stream", 270 | "text": [ 271 | "Python\n", 272 | "Perl\n", 273 | "PHP\n" 274 | ] 275 | } 276 | ], 277 | "source": [ 278 | "lista = r.html.find('li')\n", 279 | "for l in lista:\n", 280 | " print(l.text)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "Podemos buscar um elemento `` e obter somente o conteúdo do atributo **'src'**" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": 16, 293 | "metadata": {}, 294 | "outputs": [ 295 | { 296 | "data": { 297 | "text/plain": [ 298 | "'https://www.crummy.com/software/BeautifulSoup/bs4/doc/_images/6.1.jpg'" 299 | ] 300 | }, 301 | "execution_count": 16, 302 | "metadata": {}, 303 | "output_type": "execute_result" 304 | } 305 | ], 306 | "source": [ 307 | "r.html.find('img', first=True).attrs['src']" 308 | ] 309 | }, 310 | { 311 | "cell_type": "markdown", 312 | "metadata": {}, 313 | "source": [ 314 | "Buscando todos os elementos `` de nossa página web" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 17, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "name": "stdout", 324 | "output_type": "stream", 325 | "text": [ 326 | "Alan\n", 327 | "Turing\n", 328 | "alan@turing.com\n", 329 | "John\n", 330 | "von Neumann\n", 331 | "john@voneumann.com\n", 332 | "Blaise\n", 333 | "Pascal\n", 334 | "blaise@pascal.com\n" 335 | ] 336 | } 337 | ], 338 | "source": [ 339 | "td = r.html.find('td')\n", 340 | "for t in td:\n", 341 | " print(t.text)" 342 | ] 343 | } 344 | ], 345 | "metadata": { 346 | "kernelspec": { 347 | "display_name": "Python 3", 348 | "language": "python", 349 | "name": "python3" 350 | }, 351 | "language_info": { 352 | "codemirror_mode": { 353 | "name": "ipython", 354 | "version": 3 355 | }, 356 | "file_extension": ".py", 357 | "mimetype": "text/x-python", 358 | "name": "python", 359 | "nbconvert_exporter": "python", 360 | "pygments_lexer": "ipython3", 361 | "version": "3.7.7" 362 | } 363 | }, 364 | "nbformat": 4, 365 | "nbformat_minor": 4 366 | } 367 | -------------------------------------------------------------------------------- /notebooks/Web Crawler.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Web Crawler\n", 8 | "\n", 9 | "Nosso Web Crawler irá navegar pelas páginas do website **http://quotes.toscrape.com**\n", 10 | "\n", 11 | "Esta aplicação foi desenvolvida especificamente para praticarmos nossos conhecimentos sobre **Web Scraping** e nos servirá de grande auxílio.\n", 12 | "\n", 13 | "Para a construção de nosso Crawler vamos utilizar as bibliotecas **[Requests](https://requests.kennethreitz.org/en/master/)** e **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**\n", 14 | "\n", 15 | "Iniciaremos importando as bibliotecas necessárias" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "from bs4 import BeautifulSoup\n", 25 | "import requests" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "Vamos definir uma função chamada **spider()** ao qual:\n", 33 | "\n", 34 | "- Navegará pelo número de páginas máximo especificado por nós via argumento\n", 35 | "- Para cada página, vamos extrair o código HTML\n", 36 | "- Através do nosso objeto soup buscaremos elementos:\n", 37 | " - Representando o autor do quote\n", 38 | " - Representando o texto do quote\n", 39 | "- Por fim incrementamos nossa variável page até alcançarmos o limite máximo de páginas" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 2, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "def spider(max_pages):\n", 49 | " page = 1\n", 50 | " while page < (max_pages + 1):\n", 51 | " url = f'http://quotes.toscrape.com/page/{str(page)}/'\n", 52 | " source_code = requests.get(url)\n", 53 | " plain_text = source_code.text\n", 54 | " soup = BeautifulSoup(plain_text, 'lxml')\n", 55 | " for autor in soup.find_all('small', class_='author'):\n", 56 | " print(autor.text)\n", 57 | " for quote in soup.find_all('span', class_='text'):\n", 58 | " print(quote.text)\n", 59 | " page += 1" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "Executamos nossa função passando como argumento o valor **2** \n", 67 | "\n", 68 | "- O spider irá navegar pelas páginas **http://quotes.toscrape.com/page/1/** e **http://quotes.toscrape.com/page/2/**\n", 69 | "- Serão extraídos todos os quotes e seus respectivos autores das páginas que navegamos" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 3, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "Albert Einstein\n", 82 | "J.K. Rowling\n", 83 | "Albert Einstein\n", 84 | "Jane Austen\n", 85 | "Marilyn Monroe\n", 86 | "Albert Einstein\n", 87 | "André Gide\n", 88 | "Thomas A. Edison\n", 89 | "Eleanor Roosevelt\n", 90 | "Steve Martin\n", 91 | "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n", 92 | "“It is our choices, Harry, that show what we truly are, far more than our abilities.”\n", 93 | "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\n", 94 | "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\n", 95 | "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\n", 96 | "“Try not to become a man of success. Rather become a man of value.”\n", 97 | "“It is better to be hated for what you are than to be loved for what you are not.”\n", 98 | "“I have not failed. I've just found 10,000 ways that won't work.”\n", 99 | "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”\n", 100 | "“A day without sunshine is like, you know, night.”\n", 101 | "Marilyn Monroe\n", 102 | "J.K. Rowling\n", 103 | "Albert Einstein\n", 104 | "Bob Marley\n", 105 | "Dr. Seuss\n", 106 | "Douglas Adams\n", 107 | "Elie Wiesel\n", 108 | "Friedrich Nietzsche\n", 109 | "Mark Twain\n", 110 | "Allen Saunders\n", 111 | "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”\n", 112 | "“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”\n", 113 | "“If you can't explain it to a six year old, you don't understand it yourself.”\n", 114 | "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”\n", 115 | "“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”\n", 116 | "“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”\n", 117 | "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”\n", 118 | "“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”\n", 119 | "“Good friends, good books, and a sleepy conscience: this is the ideal life.”\n", 120 | "“Life is what happens to us while we are making other plans.”\n" 121 | ] 122 | } 123 | ], 124 | "source": [ 125 | "spider(2)" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "### Aperfeiçoando nosso Web Crawler\n", 133 | "\n", 134 | "Acredito que se transformarmos nossa função **spider()** em uma função geradora, podemos guardar nossos dados em um dicionário onde o **nome do autor** representará a **chave** e o **quote** representará o **valor**.\n", 135 | "\n", 136 | "Para isso, vamos criar duas listas, uma para guardarmos os **autores** e outra para guardarmos os **quotes**.\n", 137 | "\n", 138 | "Por fim, utilizamos a palavra-chave **yield** de forma a modificarmos nossa função para que ela se torne um gerador, nos retornando um dicionário com nossos dados mapeados como **chave-valor** através da função **zip()**.\n", 139 | "\n", 140 | "**Importante**: Dicionários aceitam apenas chaves únicas, sendo assim, teremos apenas um Quote de cada Autor" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 12, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "def spider(max_pages):\n", 150 | " page = 1\n", 151 | " autores = []\n", 152 | " quotes = []\n", 153 | " while page < (max_pages + 1):\n", 154 | " url = f'http://quotes.toscrape.com/page/{str(page)}/'\n", 155 | " source_code = requests.get(url)\n", 156 | " plain_text = source_code.text\n", 157 | " soup = BeautifulSoup(plain_text, 'lxml')\n", 158 | " for autor in soup.find_all('small', class_='author'):\n", 159 | " autores.append(autor.text)\n", 160 | " for quote in soup.find_all('span', class_='text'):\n", 161 | " quotes.append(quote.text)\n", 162 | " page += 1\n", 163 | " yield dict(zip(autores, quotes))" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Obtendo o objeto gerador" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 13, 176 | "metadata": {}, 177 | "outputs": [ 178 | { 179 | "name": "stdout", 180 | "output_type": "stream", 181 | "text": [ 182 | "\n" 183 | ] 184 | } 185 | ], 186 | "source": [ 187 | "crawler = spider(7)\n", 188 | "print(crawler)" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "Através do **for loop**, podemos percorrer os valores de nosso gerador" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 14, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "name": "stdout", 205 | "output_type": "stream", 206 | "text": [ 207 | "{'Albert Einstein': '“If I were not a physicist, I would probably be a musician. I often think in music. I live my daydreams in music. I see my life in terms of music.”', 'J.K. Rowling': '“Do not pity the dead, Harry. Pity the living, and, above all those who live without love.”', 'Jane Austen': '“There is nothing I would not do for those who are really my friends. I have no notion of loving people by halves, it is not my nature.”', 'Marilyn Monroe': '“I am good, but not an angel. I do sin, but I am not the devil. I am just a small girl in a big world trying to find someone to love.”', 'André Gide': '“It is better to be hated for what you are than to be loved for what you are not.”', 'Thomas A. Edison': \"“I have not failed. I've just found 10,000 ways that won't work.”\", 'Eleanor Roosevelt': '“Do one thing every day that scares you.”', 'Steve Martin': '“A day without sunshine is like, you know, night.”', 'Bob Marley': '“The truth is, everyone is going to hurt you. You just got to find the ones worth suffering for.”', 'Dr. Seuss': \"“I have heard there are troubles of more than one kind. Some come from ahead and some come from behind. But I've bought a big bat. I'm all ready you see. Now my troubles are going to have troubles with me!”\", 'Douglas Adams': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'Elie Wiesel': \"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”\", 'Friedrich Nietzsche': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'Mark Twain': '“I have never let my schooling interfere with my education.”', 'Allen Saunders': '“Life is what happens to us while we are making other plans.”', 'Pablo Neruda': '“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”', 'Ralph Waldo Emerson': '“Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in; forget them as soon as you can. Tomorrow is a new day. You shall begin it serenely and with too high a spirit to be encumbered with your old nonsense.”', 'Mother Teresa': '“Not all of us can do great things. But we can do small things with great love.”', 'Garrison Keillor': '“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”', 'Jim Henson': '“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”', 'Charles M. Schulz': \"“All you need is love. But a little chocolate now and then doesn't hurt.”\", 'William Nicholson': \"“We read to know we're not alone.”\", 'Jorge Luis Borges': '“I have always imagined that Paradise will be a kind of library.”', 'George Eliot': '“It is never too late to be what you might have been.”', 'George R.R. Martin': '“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”', 'C.S. Lewis': '“To love at all is to be vulnerable. Love anything and your heart will be wrung and possibly broken. If you want to make sure of keeping it intact you must give it to no one, not even an animal. Wrap it carefully round with hobbies and little luxuries; avoid all entanglements. Lock it up safe in the casket or coffin of your selfishness. But in that casket, safe, dark, motionless, airless, it will change. It will not be broken; it will become unbreakable, impenetrable, irredeemable. To love is to be vulnerable.”', 'Martin Luther King Jr.': '“Only in the darkness can you see the stars.”', 'James Baldwin': '“Love does not begin and end the way we seem to think it does. Love is a battle, love is a war; love is a growing up.”', 'Haruki Murakami': '“If you only read the books that everyone else is reading, you can only think what everyone else is thinking.”', 'Alexandre Dumas fils': '“The difference between genius and stupidity is: genius has its limits.”', 'Stephenie Meyer': \"“He's like a drug for you, Bella.”\", 'Ernest Hemingway': '“There is nothing to writing. All you do is sit down at a typewriter and bleed.”', 'Helen Keller': '“When one door of happiness closes, another opens; but often we look so long at the closed door that we do not see the one which has been opened for us.”', 'George Bernard Shaw': \"“Life isn't about finding yourself. Life is about creating yourself.”\", 'Charles Bukowski': \"“That's the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you drink in order to celebrate; and if nothing happens you drink to make something happen.”\", 'Suzanne Collins': \"“Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.”\", 'J.R.R. Tolkien': '“Not all those who wander are lost.”'}\n" 208 | ] 209 | } 210 | ], 211 | "source": [ 212 | "for c in crawler:\n", 213 | " print(c)" 214 | ] 215 | }, 216 | { 217 | "cell_type": "markdown", 218 | "metadata": {}, 219 | "source": [ 220 | "### Buscando Quote por Autor" 221 | ] 222 | }, 223 | { 224 | "cell_type": "code", 225 | "execution_count": 15, 226 | "metadata": {}, 227 | "outputs": [ 228 | { 229 | "name": "stdout", 230 | "output_type": "stream", 231 | "text": [ 232 | "“Try not to become a man of success. Rather become a man of value.”\n" 233 | ] 234 | } 235 | ], 236 | "source": [ 237 | "crawler = spider(1)\n", 238 | "\n", 239 | "for c in crawler:\n", 240 | " print(c['Albert Einstein'])" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### Convertendo nossos resultados para JSON\n", 248 | "\n", 249 | "Para isso será necessário fazermos o import da biblioteca **json** que já vem acoplada na linguagem Python por padrão\n", 250 | "\n", 251 | "Novamente vamos então obter o objeto gerador" 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": 16, 257 | "metadata": {}, 258 | "outputs": [], 259 | "source": [ 260 | "crawler = spider(3)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "markdown", 265 | "metadata": {}, 266 | "source": [ 267 | "Agora vamos percorrer os valores de nosso gerador com o **for loop** e imprimir os respectivos dados no formato JSON" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 17, 273 | "metadata": {}, 274 | "outputs": [ 275 | { 276 | "name": "stdout", 277 | "output_type": "stream", 278 | "text": [ 279 | "{\n", 280 | " \"Albert Einstein\": \"“Logic will get you from A to Z; imagination will get you everywhere.”\",\n", 281 | " \"Allen Saunders\": \"“Life is what happens to us while we are making other plans.”\",\n", 282 | " \"André Gide\": \"“It is better to be hated for what you are than to be loved for what you are not.”\",\n", 283 | " \"Bob Marley\": \"“One good thing about music, when it hits you, you feel no pain.”\",\n", 284 | " \"Douglas Adams\": \"“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”\",\n", 285 | " \"Dr. Seuss\": \"“Today you are You, that is truer than true. There is no one alive who is Youer than You.”\",\n", 286 | " \"Eleanor Roosevelt\": \"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”\",\n", 287 | " \"Elie Wiesel\": \"“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”\",\n", 288 | " \"Friedrich Nietzsche\": \"“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”\",\n", 289 | " \"Garrison Keillor\": \"“Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.”\",\n", 290 | " \"J.K. Rowling\": \"“It is impossible to live without failing at something, unless you live so cautiously that you might as well not have lived at all - in which case, you fail by default.”\",\n", 291 | " \"Jane Austen\": \"“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\",\n", 292 | " \"Jim Henson\": \"“Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.”\",\n", 293 | " \"Marilyn Monroe\": \"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”\",\n", 294 | " \"Mark Twain\": \"“Good friends, good books, and a sleepy conscience: this is the ideal life.”\",\n", 295 | " \"Mother Teresa\": \"“If you judge people, you have no time to love them.”\",\n", 296 | " \"Pablo Neruda\": \"“I love you without knowing how, or when, or from where. I love you simply, without problems or pride: I love you in this way because I do not know any other way of loving but this, in which there is no I or you, so intimate that your hand upon my chest is my hand, so intimate that when I fall asleep your eyes close.”\",\n", 297 | " \"Ralph Waldo Emerson\": \"“For every minute you are angry you lose sixty seconds of happiness.”\",\n", 298 | " \"Steve Martin\": \"“A day without sunshine is like, you know, night.”\",\n", 299 | " \"Thomas A. Edison\": \"“I have not failed. I've just found 10,000 ways that won't work.”\"\n", 300 | "}\n" 301 | ] 302 | } 303 | ], 304 | "source": [ 305 | "import json\n", 306 | "\n", 307 | "for c in crawler:\n", 308 | " print(json.dumps(c, sort_keys=True, indent=4, ensure_ascii=False))" 309 | ] 310 | } 311 | ], 312 | "metadata": { 313 | "kernelspec": { 314 | "display_name": "Python 3", 315 | "language": "python", 316 | "name": "python3" 317 | }, 318 | "language_info": { 319 | "codemirror_mode": { 320 | "name": "ipython", 321 | "version": 3 322 | }, 323 | "file_extension": ".py", 324 | "mimetype": "text/x-python", 325 | "name": "python", 326 | "nbconvert_exporter": "python", 327 | "pygments_lexer": "ipython3", 328 | "version": "3.7.7" 329 | } 330 | }, 331 | "nbformat": 4, 332 | "nbformat_minor": 4 333 | } 334 | -------------------------------------------------------------------------------- /notebooks/XML Parsing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# XML Parsing\n", 8 | "\n", 9 | "Neste tutorial, vamos executar alguns experimentos com um arquivo **XML** chamado `livros.xml`\n", 10 | "\n", 11 | "Para tais tarefas, utilizaremos as bibliotecas:\n", 12 | "\n", 13 | "- **Requests:** De forma a obtermos o **arquivo xml** localizado em **https://pythonwebscraping.netlify.app/livros.xml**\n", 14 | "- **Beautiful Soup:** Para executarmos buscas em nosso arquivo\n", 15 | "- **xml:** Para executarmos consultas e manipular os elementos de nosso arquivo" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 338, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import xml.etree.ElementTree as et \n", 25 | "from bs4 import BeautifulSoup\n", 26 | "import requests" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 339, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "r = requests.get('https://pythonwebscraping.netlify.app/livros.xml')" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "Obtendo a string que representa nosso arquivo" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 340, 48 | "metadata": {}, 49 | "outputs": [ 50 | { 51 | "data": { 52 | "text/plain": [ 53 | "'\\n\\t\\n\\t\\tNeuromancer\\n\\t\\tWilliam Gibson\\n\\t\\t1984\\n\\t\\n\\t\\n\\t\\tNineteen Eighty-Four: A Novel\\n\\t\\tGeorge Orwell\\n\\t\\t1949\\n\\t\\n\\t\\n\\t\\tHow to Think Like a Computer Scientist\\n\\t\\tPeter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\\n\\t\\t2012\\n\\t\\n\\t\\n\\t\\tMaking Games with Python and Pygame\\n\\t\\tAI Sweigart\\n\\t\\t2012\\n\\t\\n\\n'" 54 | ] 55 | }, 56 | "execution_count": 340, 57 | "metadata": {}, 58 | "output_type": "execute_result" 59 | } 60 | ], 61 | "source": [ 62 | "r.text" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "Vamos então guardar a string em uma variável que chamaremos de **xml**" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 341, 75 | "metadata": {}, 76 | "outputs": [ 77 | { 78 | "name": "stdout", 79 | "output_type": "stream", 80 | "text": [ 81 | "\n", 82 | "\t\n", 83 | "\t\tNeuromancer\n", 84 | "\t\tWilliam Gibson\n", 85 | "\t\t1984\n", 86 | "\t\n", 87 | "\t\n", 88 | "\t\tNineteen Eighty-Four: A Novel\n", 89 | "\t\tGeorge Orwell\n", 90 | "\t\t1949\n", 91 | "\t\n", 92 | "\t\n", 93 | "\t\tHow to Think Like a Computer Scientist\n", 94 | "\t\tPeter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\n", 95 | "\t\t2012\n", 96 | "\t\n", 97 | "\t\n", 98 | "\t\tMaking Games with Python and Pygame\n", 99 | "\t\tAI Sweigart\n", 100 | "\t\t2012\n", 101 | "\t\n", 102 | "\n", 103 | "\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "xml = r.text\n", 109 | "print(xml)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "Vamos passar essa string para o construtor `BeautifulSoup()`, utilizaremos neste caso específico o parser **'xml'**" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 342, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "soup = BeautifulSoup(xml, \"xml\")" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "Podemos também usar o método `prettify()` para imprimir o nosso arquivo" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 343, 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "'\\n\\n \\n \\n Neuromancer\\n \\n \\n William Gibson\\n \\n \\n 1984\\n \\n \\n \\n \\n Nineteen Eighty-Four: A Novel\\n \\n \\n George Orwell\\n \\n \\n 1949\\n \\n \\n \\n \\n How to Think Like a Computer Scientist\\n \\n \\n Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\\n \\n \\n 2012\\n \\n \\n \\n \\n Making Games with Python and Pygame\\n \\n \\n AI Sweigart\\n \\n \\n 2012\\n \\n \\n'" 144 | ] 145 | }, 146 | "execution_count": 343, 147 | "metadata": {}, 148 | "output_type": "execute_result" 149 | } 150 | ], 151 | "source": [ 152 | "soup.prettify()" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Selecionando o conteúdo do elemento ``" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": 344, 165 | "metadata": {}, 166 | "outputs": [ 167 | { 168 | "data": { 169 | "text/plain": [ 170 | "[\n", 171 | " Neuromancer\n", 172 | " William Gibson\n", 173 | " 1984\n", 174 | " ,\n", 175 | " \n", 176 | " Nineteen Eighty-Four: A Novel\n", 177 | " George Orwell\n", 178 | " 1949\n", 179 | " ,\n", 180 | " \n", 181 | " How to Think Like a Computer Scientist\n", 182 | " Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\n", 183 | " 2012\n", 184 | " ,\n", 185 | " \n", 186 | " Making Games with Python and Pygame\n", 187 | " AI Sweigart\n", 188 | " 2012\n", 189 | " ]" 190 | ] 191 | }, 192 | "execution_count": 344, 193 | "metadata": {}, 194 | "output_type": "execute_result" 195 | } 196 | ], 197 | "source": [ 198 | "soup.select(\"livro\")" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "Selecionando os elementos ``" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 345, 211 | "metadata": {}, 212 | "outputs": [ 213 | { 214 | "data": { 215 | "text/plain": [ 216 | "[Neuromancer,\n", 217 | " Nineteen Eighty-Four: A Novel,\n", 218 | " How to Think Like a Computer Scientist,\n", 219 | " Making Games with Python and Pygame]" 220 | ] 221 | }, 222 | "execution_count": 345, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "soup.select(\"titulo\")" 229 | ] 230 | }, 231 | { 232 | "cell_type": "markdown", 233 | "metadata": {}, 234 | "source": [ 235 | "Podemos utilizar um **for loop** para obtermos somente o texto" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 346, 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "name": "stdout", 245 | "output_type": "stream", 246 | "text": [ 247 | "Neuromancer\n", 248 | "Nineteen Eighty-Four: A Novel\n", 249 | "How to Think Like a Computer Scientist\n", 250 | "Making Games with Python and Pygame\n" 251 | ] 252 | } 253 | ], 254 | "source": [ 255 | "for titulo in soup.select(\"titulo\"):\n", 256 | " print(titulo.text)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "metadata": {}, 262 | "source": [ 263 | "Podemos buscar pelo atributo **'categoria'**" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 347, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "data": { 273 | "text/plain": [ 274 | "[\n", 275 | " Neuromancer\n", 276 | " William Gibson\n", 277 | " 1984\n", 278 | " ]" 279 | ] 280 | }, 281 | "execution_count": 347, 282 | "metadata": {}, 283 | "output_type": "execute_result" 284 | } 285 | ], 286 | "source": [ 287 | "soup.find_all(\"livro\", {\"categoria\" : \"Cyber Punk\"})" 288 | ] 289 | }, 290 | { 291 | "cell_type": "markdown", 292 | "metadata": {}, 293 | "source": [ 294 | "Podemos também buscar pelo atributo **'lang'**" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 348, 300 | "metadata": {}, 301 | "outputs": [ 302 | { 303 | "data": { 304 | "text/plain": [ 305 | "[Neuromancer,\n", 306 | " Nineteen Eighty-Four: A Novel,\n", 307 | " How to Think Like a Computer Scientist,\n", 308 | " Making Games with Python and Pygame]" 309 | ] 310 | }, 311 | "execution_count": 348, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "soup.find_all(\"titulo\", {\"lang\" : \"en\"})" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "Agora vamos construir um elemento `xml.etree.ElementTree.Element` através do método `fromstring()`\n", 325 | "\n", 326 | "- Passaremos nossa string xml como argumento\n", 327 | "- A variável será chamada de tree, uma vez que ela representará uma árvore XML" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 349, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "xml.etree.ElementTree.Element" 339 | ] 340 | }, 341 | "execution_count": 349, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "tree = et.fromstring(xml)\n", 348 | "type(tree)" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Vejamos os atributos e métodos disponíveis" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 350, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "name": "stdout", 365 | "output_type": "stream", 366 | "text": [ 367 | "['__class__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'attrib', 'clear', 'extend', 'find', 'findall', 'findtext', 'get', 'getchildren', 'getiterator', 'insert', 'items', 'iter', 'iterfind', 'itertext', 'keys', 'makeelement', 'remove', 'set', 'tag', 'tail', 'text']\n" 368 | ] 369 | } 370 | ], 371 | "source": [ 372 | "print(dir(tree))" 373 | ] 374 | }, 375 | { 376 | "cell_type": "markdown", 377 | "metadata": {}, 378 | "source": [ 379 | "O atributo **tag** nos traz o elemento raíz da árvore (root)" 380 | ] 381 | }, 382 | { 383 | "cell_type": "code", 384 | "execution_count": 351, 385 | "metadata": {}, 386 | "outputs": [ 387 | { 388 | "data": { 389 | "text/plain": [ 390 | "'biblioteca'" 391 | ] 392 | }, 393 | "execution_count": 351, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "tree.tag" 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "Através do **for loop** podemos buscar as tags e atributos de nossa biblioteca" 407 | ] 408 | }, 409 | { 410 | "cell_type": "code", 411 | "execution_count": 352, 412 | "metadata": {}, 413 | "outputs": [ 414 | { 415 | "name": "stdout", 416 | "output_type": "stream", 417 | "text": [ 418 | "livro {'categoria': 'Cyber Punk'}\n", 419 | "livro {'categoria': 'Distopia'}\n", 420 | "livro {'categoria': 'Ciência da Computação'}\n", 421 | "livro {'categoria': 'Programação'}\n" 422 | ] 423 | } 424 | ], 425 | "source": [ 426 | "for livro in tree.findall(\"livro\"):\n", 427 | " print(livro.tag, livro.attrib)" 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "Com o **for loop**, podemos acessar somente o conteúdo dos elementos, neste caso:\n", 435 | "\n", 436 | "- Título\n", 437 | "- Ano" 438 | ] 439 | }, 440 | { 441 | "cell_type": "code", 442 | "execution_count": 353, 443 | "metadata": {}, 444 | "outputs": [ 445 | { 446 | "name": "stdout", 447 | "output_type": "stream", 448 | "text": [ 449 | "Neuromancer | 1984\n", 450 | "Nineteen Eighty-Four: A Novel | 1949\n", 451 | "How to Think Like a Computer Scientist | 2012\n", 452 | "Making Games with Python and Pygame | 2012\n" 453 | ] 454 | } 455 | ], 456 | "source": [ 457 | "for livro in tree.findall(\"livro\"):\n", 458 | " titulo = livro.find('titulo').text\n", 459 | " ano = livro.find('ano').text\n", 460 | " print(f'{titulo} | {ano}')" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "Com o for loop, podemos acessar somente o conteúdo dos elementos, neste caso:\n", 468 | "\n", 469 | "- Autor" 470 | ] 471 | }, 472 | { 473 | "cell_type": "code", 474 | "execution_count": 354, 475 | "metadata": {}, 476 | "outputs": [ 477 | { 478 | "name": "stdout", 479 | "output_type": "stream", 480 | "text": [ 481 | "William Gibson\n", 482 | "George Orwell\n", 483 | "Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\n", 484 | "AI Sweigart\n" 485 | ] 486 | } 487 | ], 488 | "source": [ 489 | "for livro in tree.findall('livro'):\n", 490 | " autor = livro.find('autor').text\n", 491 | " print(autor)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "### Inserindo um Novo Elemento na Árvore" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "Criando um novo sub-elemento livro com o atributo **categoria='Dystopia'**" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 355, 511 | "metadata": {}, 512 | "outputs": [], 513 | "source": [ 514 | "novo_livro = et.SubElement(tree, 'livro', attrib={'categoria':'Dystopia'})" 515 | ] 516 | }, 517 | { 518 | "cell_type": "markdown", 519 | "metadata": {}, 520 | "source": [ 521 | "Criando um novo sub-elemento **titulo**" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 356, 527 | "metadata": {}, 528 | "outputs": [], 529 | "source": [ 530 | "novo_livro_titulo = et.SubElement(novo_livro, 'titulo')" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "Criando um novo sub-elemento **autor**" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 357, 543 | "metadata": {}, 544 | "outputs": [], 545 | "source": [ 546 | "novo_livro_autor = et.SubElement(novo_livro, 'autor')" 547 | ] 548 | }, 549 | { 550 | "cell_type": "markdown", 551 | "metadata": {}, 552 | "source": [ 553 | "Criando um novo sub-elemento **ano**" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 358, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "novo_livro_ano = et.SubElement(novo_livro, 'ano')" 563 | ] 564 | }, 565 | { 566 | "cell_type": "markdown", 567 | "metadata": {}, 568 | "source": [ 569 | "Inserindo os respectivos textos em cada elemento" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": 359, 575 | "metadata": {}, 576 | "outputs": [], 577 | "source": [ 578 | "novo_livro_titulo.text = 'Brave New World'\n", 579 | "novo_livro_autor.text = 'Aldous Huxley'\n", 580 | "novo_livro_ano.text = '1931'" 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": {}, 586 | "source": [ 587 | "Através de um **for loop** vamos percorrer nossa árvore atualizada" 588 | ] 589 | }, 590 | { 591 | "cell_type": "code", 592 | "execution_count": 360, 593 | "metadata": {}, 594 | "outputs": [ 595 | { 596 | "name": "stdout", 597 | "output_type": "stream", 598 | "text": [ 599 | "Neuromancer William Gibson 1984\n", 600 | "Nineteen Eighty-Four: A Novel George Orwell 1949\n", 601 | "How to Think Like a Computer Scientist Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers 2012\n", 602 | "Making Games with Python and Pygame AI Sweigart 2012\n", 603 | "Brave New World Aldous Huxley 1931\n" 604 | ] 605 | } 606 | ], 607 | "source": [ 608 | "for livro in tree.findall('livro'):\n", 609 | " titulo = livro.find('titulo').text\n", 610 | " autor = livro.find('autor').text\n", 611 | " ano = livro.find('ano').text\n", 612 | " print(f'{titulo} {autor} {ano}')" 613 | ] 614 | }, 615 | { 616 | "cell_type": "markdown", 617 | "metadata": {}, 618 | "source": [ 619 | "Para escrevermos em um arquivo, será necessário construirmos uma `ElementTree()`, para isso vamos passar **'tree'** como argumento para o construtor" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": 361, 625 | "metadata": {}, 626 | "outputs": [], 627 | "source": [ 628 | "root = et.ElementTree(tree)" 629 | ] 630 | }, 631 | { 632 | "cell_type": "markdown", 633 | "metadata": {}, 634 | "source": [ 635 | "Observe que agora temos o método `write()` disponível" 636 | ] 637 | }, 638 | { 639 | "cell_type": "code", 640 | "execution_count": 362, 641 | "metadata": {}, 642 | "outputs": [ 643 | { 644 | "name": "stdout", 645 | "output_type": "stream", 646 | "text": [ 647 | "['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_root', '_setroot', 'find', 'findall', 'findtext', 'getiterator', 'getroot', 'iter', 'iterfind', 'parse', 'write', 'write_c14n']\n" 648 | ] 649 | } 650 | ], 651 | "source": [ 652 | "print(dir(root))" 653 | ] 654 | }, 655 | { 656 | "cell_type": "markdown", 657 | "metadata": {}, 658 | "source": [ 659 | "Salvando os dados em um arquivo de nome `novos_livros.xml`" 660 | ] 661 | }, 662 | { 663 | "cell_type": "code", 664 | "execution_count": 363, 665 | "metadata": {}, 666 | "outputs": [], 667 | "source": [ 668 | "root.write('novos_livros.xml', encoding=\"utf-8\")" 669 | ] 670 | }, 671 | { 672 | "cell_type": "markdown", 673 | "metadata": {}, 674 | "source": [ 675 | "Perceba que o novo arquivo que salvamos não está no formato que desejamos\n", 676 | "\n", 677 | "Podemos solucionar este problema com o método `prettify()` da biblioteca **Beautiful Soup**" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 364, 683 | "metadata": {}, 684 | "outputs": [ 685 | { 686 | "name": "stdout", 687 | "output_type": "stream", 688 | "text": [ 689 | "\n", 690 | "\n", 691 | " \n", 692 | " \n", 693 | " Neuromancer\n", 694 | " \n", 695 | " \n", 696 | " William Gibson\n", 697 | " \n", 698 | " \n", 699 | " 1984\n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " Nineteen Eighty-Four: A Novel\n", 705 | " \n", 706 | " \n", 707 | " George Orwell\n", 708 | " \n", 709 | " \n", 710 | " 1949\n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " How to Think Like a Computer Scientist\n", 716 | " \n", 717 | " \n", 718 | " Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\n", 719 | " \n", 720 | " \n", 721 | " 2012\n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " Making Games with Python and Pygame\n", 727 | " \n", 728 | " \n", 729 | " AI Sweigart\n", 730 | " \n", 731 | " \n", 732 | " 2012\n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " Brave New World\n", 738 | " \n", 739 | " \n", 740 | " Aldous Huxley\n", 741 | " \n", 742 | " \n", 743 | " 1931\n", 744 | " \n", 745 | " \n", 746 | "\n" 747 | ] 748 | } 749 | ], 750 | "source": [ 751 | "with open('novos_livros.xml', 'r') as file:\n", 752 | " f = file.read()\n", 753 | "\n", 754 | "soup = BeautifulSoup(f, \"xml\")\n", 755 | "prettify = soup.prettify()\n", 756 | "print(prettify)" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "Podemos também usar o método `toprettyxml()` da biblioteca **xml**" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 365, 769 | "metadata": {}, 770 | "outputs": [ 771 | { 772 | "name": "stdout", 773 | "output_type": "stream", 774 | "text": [ 775 | "\n", 776 | "\n", 777 | " \n", 778 | " Neuromancer\n", 779 | " William Gibson\n", 780 | " 1984\n", 781 | " \n", 782 | " \n", 783 | " Nineteen Eighty-Four: A Novel\n", 784 | " George Orwell\n", 785 | " 1949\n", 786 | " \n", 787 | " \n", 788 | " How to Think Like a Computer Scientist\n", 789 | " Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers\n", 790 | " 2012\n", 791 | " \n", 792 | " \n", 793 | " Making Games with Python and Pygame\n", 794 | " AI Sweigart\n", 795 | " 2012\n", 796 | " \n", 797 | " \n", 798 | " Brave New World\n", 799 | " Aldous Huxley\n", 800 | " 1931\n", 801 | " \n", 802 | "\n" 803 | ] 804 | } 805 | ], 806 | "source": [ 807 | "import xml.dom.minidom\n", 808 | "\n", 809 | "with open('novos_livros.xml', 'r') as file:\n", 810 | " f = file.read()\n", 811 | " pp = lambda data: '\\n'.join([line for line in xml.dom.minidom.parseString(f).toprettyxml(indent=' '*4).split('\\n') if line.strip()])\n", 812 | " print(pp(f))" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": {}, 818 | "source": [ 819 | "E agora finalmente salvamos a versão final de nosso arquivo" 820 | ] 821 | }, 822 | { 823 | "cell_type": "code", 824 | "execution_count": 366, 825 | "metadata": {}, 826 | "outputs": [], 827 | "source": [ 828 | "with open('novos_livros.xml', 'w') as file:\n", 829 | " file.write(pp(f))" 830 | ] 831 | } 832 | ], 833 | "metadata": { 834 | "kernelspec": { 835 | "display_name": "Python 3", 836 | "language": "python", 837 | "name": "python3" 838 | }, 839 | "language_info": { 840 | "codemirror_mode": { 841 | "name": "ipython", 842 | "version": 3 843 | }, 844 | "file_extension": ".py", 845 | "mimetype": "text/x-python", 846 | "name": "python", 847 | "nbconvert_exporter": "python", 848 | "pygments_lexer": "ipython3", 849 | "version": "3.7.7" 850 | } 851 | }, 852 | "nbformat": 4, 853 | "nbformat_minor": 4 854 | } 855 | -------------------------------------------------------------------------------- /notebooks/novos_livros.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Neuromancer 5 | William Gibson 6 | 1984 7 | 8 | 9 | Nineteen Eighty-Four: A Novel 10 | George Orwell 11 | 1949 12 | 13 | 14 | How to Think Like a Computer Scientist 15 | Peter Wentworth, Jeffrey Elkner, Allen B. Downey, and Chris Meyers 16 | 2012 17 | 18 | 19 | Making Games with Python and Pygame 20 | AI Sweigart 21 | 2012 22 | 23 | 24 | Brave New World 25 | Aldous Huxley 26 | 1931 27 | 28 | --------------------------------------------------------------------------------

Linguagens de Programação

Grandes Matemáticos

\\n Web Scraping\\n

\\n Estrutura Básica HTML\\n

\\n Linguagens de Programação\\n

\\n Grandes Matemáticos\\n

Web Scraping

Estrutura B\\xc3\\xa1sica HTML

Linguagens de Programa\\xc3\\xa7\\xc3\\xa3o

Grandes Matem\\xc3\\xa1ticos

Web Scraping

Estrutura Básica HTML

Linguagens de Programação

Grandes Matemáticos

Web Scraping

Estrutura Básica HTML

Linguagens de Programação

Grandes Matemáticos

Python Web Scraping

4 |
5 |

Web Scraping

Estrutura Básica HTML

Python Web Scraping

4 | 5 |

Web Scraping

Estrutura Básica HTML

Linguagens de Programação

Grandes Matemáticos

\\n Web Scraping\\n

\\n Estrutura Básica HTML\\n

\\n Linguagens de Programação\\n

\\n Grandes Matemáticos\\n

Web Scraping

Estrutura B\\xc3\\xa1sica HTML

Linguagens de Programa\\xc3\\xa7\\xc3\\xa3o

Grandes Matem\\xc3\\xa1ticos

Web Scraping

Estrutura Básica HTML

Linguagens de Programação

Grandes Matemáticos

Web Scraping

Estrutura Básica HTML

Linguagens de Programação

Grandes Matemáticos

4 |
5 |