├── LICENSE ├── README.md ├── articles ├── Introdução.md └── Scrapy.md ├── images ├── Avatar.png ├── Scraper.png ├── Spider.png └── WebScraping.png └── notebooks ├── Beautiful Soup.ipynb ├── Fundamentos.ipynb ├── PyQuery.ipynb ├── Requests-HTML.ipynb ├── Web Crawler.ipynb ├── XML Parsing.ipynb └── novos_livros.xml /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Gabriel Felippe 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |
8 | Explorando Técnicas de Web Scraping com Python 9 |
10 | 11 | ## Conteúdo 12 | 13 | 01. [Introdução](https://github.com/the-akira/Python-Web-Scraping/blob/master/articles/Introdu%C3%A7%C3%A3o.md) 14 | 02. [Fundamentos de Web Scraping com Python](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/Fundamentos.ipynb) 15 | 03. [Experimentos com Beautiful Soup](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/Beautiful%20Soup.ipynb) 16 | 04. [Parsing de Documentos XML](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/XML%20Parsing.ipynb) 17 | 05. [A Biblioteca Requests-HTML](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/Requests-HTML.ipynb) 18 | 06. [Construindo um Simples Web Crawler](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/Web%20Crawler.ipynb) 19 | 07. [Analisando Documentos HTML com PyQuery](https://nbviewer.org/github/the-akira/Python-Web-Scraping/blob/master/notebooks/PyQuery.ipynb) 20 | 08. [O Framework Scrapy](https://github.com/the-akira/Python-Web-Scraping/blob/master/articles/Scrapy.md) -------------------------------------------------------------------------------- /articles/Introdução.md: -------------------------------------------------------------------------------- 1 | # Web Scraping 2 | 3 | ## Conteúdo 4 | 5 | 1. [Introdução](#introdução) 6 | 2. [Modus Operandi](#Modus-Operandi) 7 | 3. [Utilidade](#utilidade) 8 | 4. [História](#história) 9 | 5. [Técnicas](#técnicas) 10 | 6. [Python Web Scraping](#Python-Web-Scraping) 11 | 12 | ## Introdução 13 | 14 | **Web Scraping**, **Web Harvesting**, ou **Web Data Extraction** são nomes dados às tècnicas de raspagem de dados utilizadas para extrair dados de sites. O software de scraping da Web pode acessar a **[World Wide Web (WWW)](https://www.w3.org/WWW/)** diretamente usando o **[Hypertext Transfer Protocol (HTTP)](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)** através de uma linguagem de programação, ou até mesmo através de um navegador da Web. 15 | 16 | Embora **Web Scraping** possa ser feito manualmente por um usuário de software, o termo geralmente se refere a processos automatizados implementados usando um **bot** ou **[web crawler](https://en.wikipedia.org/wiki/Web_crawler)**. É uma forma de cópia, na qual dados específicos são coletados e copiados da Web, geralmente em um banco de dados ou planilha, para recuperação ou análise posterior. 17 | 18 | ## Modus Operandi 19 | 20 |  21 | 22 | O **Web Scraping** de uma página web envolve buscá-la e extraí-la. Buscar é a ação de fazer download de uma página (o que um navegador faz quando você visualiza a página). O **web crawling** é um componente principal dentro do contexto de **Web Scraping**, para buscar páginas para processamento posterior. Uma vez obtida a página, a extração dos dados pode ocorrer. O conteúdo de uma página pode ser analisado, pesquisado, reformatado, seus dados copiados para uma planilha e assim por diante. 23 | 24 | #### O Web Crawler 25 | 26 |Aprendendo Web Scraping com Python,\n", 105 | "\tRequests-HTML,\n", 106 | "\tBeautiful Soup e\n", 107 | "\tScrapy\n", 108 | "
\n", 109 | "“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein
\n", 110 | "Nome | \n", 120 | "Sobrenome | \n", 121 | "|
---|---|---|
Alan | \n", 125 | "Turing | \n", 126 | "alan@turing.com | \n", 127 | "
John | \n", 130 | "von Neumann | \n", 131 | "john@voneumann.com | \n", 132 | "
Blaise | \n", 135 | "Pascal | \n", 136 | "blaise@pascal.com | \n", 137 | "
\\n Aprendendo Web Scraping com\\n \\n Python\\n \\n ,\\n \\n Requests-HTML\\n \\n ,\\n \\n Beautiful Soup\\n \\n e\\n \\n Scrapy\\n \\n
\\n\\n “Logic will get you from A to Z; imagination will get you everywhere.”\\n \\n Albert Einstein\\n \\n
\\n\\n Nome\\n | \\n\\n Sobrenome\\n | \\n\\n Email\\n | \\n
---|---|---|
\\n Alan\\n | \\n\\n Turing\\n | \\n\\n alan@turing.com\\n | \\n
\\n John\\n | \\n\\n von Neumann\\n | \\n\\n john@voneumann.com\\n | \\n
\\n Blaise\\n | \\n\\n Pascal\\n | \\n\\n blaise@pascal.com\\n | \\n
`) da página" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 11, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "text/plain": [ 300 | "
Aprendendo Web Scraping com Python,\n", 301 | "\tRequests-HTML,\n", 302 | "\tBeautiful Soup e\n", 303 | "\tScrapy\n", 304 | "
" 305 | ] 306 | }, 307 | "execution_count": 11, 308 | "metadata": {}, 309 | "output_type": "execute_result" 310 | } 311 | ], 312 | "source": [ 313 | "soup.p" 314 | ] 315 | }, 316 | { 317 | "cell_type": "markdown", 318 | "metadata": {}, 319 | "source": [ 320 | "#### Buscando Elementos\n", 321 | "\n", 322 | "- O método `find_all()` é capaz de buscar elementos.\n", 323 | "- Passamos como argumento **'p'** e ele nos traz todos os parágrafos da página" 324 | ] 325 | }, 326 | { 327 | "cell_type": "code", 328 | "execution_count": 12, 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/plain": [ 334 | "[Aprendendo Web Scraping com Python,\n", 335 | " \tRequests-HTML,\n", 336 | " \tBeautiful Soup e\n", 337 | " \tScrapy\n", 338 | "
,\n", 339 | "“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein
]" 340 | ] 341 | }, 342 | "execution_count": 12, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "soup.find_all('p')" 349 | ] 350 | }, 351 | { 352 | "cell_type": "markdown", 353 | "metadata": {}, 354 | "source": [ 355 | "Passamos como argumento **'a'** e ele nos retorna todos os links da página" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 13, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "[Python,\n", 367 | " Requests-HTML,\n", 368 | " Beautiful Soup,\n", 369 | " Scrapy]" 370 | ] 371 | }, 372 | "execution_count": 13, 373 | "metadata": {}, 374 | "output_type": "execute_result" 375 | } 376 | ], 377 | "source": [ 378 | "soup.find_all('a')" 379 | ] 380 | }, 381 | { 382 | "cell_type": "markdown", 383 | "metadata": {}, 384 | "source": [ 385 | "Ao passarmos **'li'** como argumento, nos serão trazidos todos itens de lista" 386 | ] 387 | }, 388 | { 389 | "cell_type": "code", 390 | "execution_count": 14, 391 | "metadata": {}, 392 | "outputs": [ 393 | { 394 | "data": { 395 | "text/plain": [ 396 | "[Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t
\\n\\n\\t\\xe2\\x80\\x9cLogic will get you from A to Z; imagination will get you everywhere.\\xe2\\x80\\x9d Albert Einstein
\\n\\t\\n\\tNome | \\n\\t\\t\\tSobrenome | \\n\\t\\t\\t|
---|---|---|
Alan | \\n\\t\\t\\tTuring | \\n\\t\\t\\talan@turing.com | \\n\\t\\t
John | \\n\\t\\t\\tvon Neumann | \\n\\t\\t\\tjohn@voneumann.com | \\n\\t\\t
Blaise | \\n\\t\\t\\tPascal | \\n\\t\\t\\tblaise@pascal.com | \\n\\t\\t
Aprendendo Web Scraping com Python,\n", 253 | "\tRequests-HTML,\n", 254 | "\tBeautiful Soup e\n", 255 | "\tScrapy\n", 256 | "\t
\n", 257 | "\n", 258 | "\t“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein
\n", 259 | "\t\n", 260 | "\tNome | \n", 271 | "\t\t\tSobrenome | \n", 272 | "\t\t\t|
---|---|---|
Alan | \n", 276 | "\t\t\tTuring | \n", 277 | "\t\t\talan@turing.com | \n", 278 | "\t\t
John | \n", 281 | "\t\t\tvon Neumann | \n", 282 | "\t\t\tjohn@voneumann.com | \n", 283 | "\t\t
Blaise | \n", 286 | "\t\t\tPascal | \n", 287 | "\t\t\tblaise@pascal.com | \n", 288 | "\t\t
(.*?)
\", html, flags=re.DOTALL) " 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 40, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "name": "stdout", 381 | "output_type": "stream", 382 | "text": [ 383 | "['Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t', '“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein']\n" 384 | ] 385 | } 386 | ], 387 | "source": [ 388 | "print(p)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "markdown", 393 | "metadata": {}, 394 | "source": [ 395 | "#### Links da Página" 396 | ] 397 | }, 398 | { 399 | "cell_type": "code", 400 | "execution_count": 35, 401 | "metadata": {}, 402 | "outputs": [], 403 | "source": [ 404 | "a = re.findall(r'href=[\\'\"]?([^\\'\" >]+)', html)" 405 | ] 406 | }, 407 | { 408 | "cell_type": "code", 409 | "execution_count": 37, 410 | "metadata": {}, 411 | "outputs": [ 412 | { 413 | "data": { 414 | "text/plain": [ 415 | "['https://i.imgur.com/QOVnf5D.png',\n", 416 | " 'https://www.python.org/',\n", 417 | " 'https://github.com/psf/requests-html',\n", 418 | " 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/',\n", 419 | " 'https://scrapy.org/']" 420 | ] 421 | }, 422 | "execution_count": 37, 423 | "metadata": {}, 424 | "output_type": "execute_result" 425 | } 426 | ], 427 | "source": [ 428 | "a" 429 | ] 430 | }, 431 | { 432 | "cell_type": "markdown", 433 | "metadata": {}, 434 | "source": [ 435 | "#### Emails da Página" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": 38, 441 | "metadata": {}, 442 | "outputs": [], 443 | "source": [ 444 | "emails = re.findall(r'([\\d\\w\\.]+@[\\d\\w\\.\\-]+\\.\\w+)', html)" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 39, 450 | "metadata": {}, 451 | "outputs": [ 452 | { 453 | "data": { 454 | "text/plain": [ 455 | "['alan@turing.com', 'john@voneumann.com', 'blaise@pascal.com']" 456 | ] 457 | }, 458 | "execution_count": 39, 459 | "metadata": {}, 460 | "output_type": "execute_result" 461 | } 462 | ], 463 | "source": [ 464 | "emails" 465 | ] 466 | } 467 | ], 468 | "metadata": { 469 | "kernelspec": { 470 | "display_name": "Python 3", 471 | "language": "python", 472 | "name": "python3" 473 | }, 474 | "language_info": { 475 | "codemirror_mode": { 476 | "name": "ipython", 477 | "version": 3 478 | }, 479 | "file_extension": ".py", 480 | "mimetype": "text/x-python", 481 | "name": "python", 482 | "nbconvert_exporter": "python", 483 | "pygments_lexer": "ipython3", 484 | "version": "3.7.7" 485 | } 486 | }, 487 | "nbformat": 4, 488 | "nbformat_minor": 4 489 | } 490 | -------------------------------------------------------------------------------- /notebooks/PyQuery.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# A Biblioteca PyQuery\n", 8 | "\n", 9 | "PyQuery nos permite executarmos consultas **jQuery** em documentos XML. A API é bastante similar à biblioteca **[jQuery](https://jquery.com/)**.\n", 10 | "\n", 11 | "PyQuery utiliza **[lxml](https://lxml.de/)** para manipulação rápida de documentos XML e HTML.\n", 12 | "\n", 13 | "Você pode conhecer mais detalhes sobre PyQuery em sua **[Documentação](https://pythonhosted.org/pyquery/)**\n", 14 | "\n", 15 | "Vamos agora executar alguns experimentos utilizando nossa página de testes: **[pythonwebscraping.netlify.app](https://pythonwebscraping.netlify.app)**" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "Iniciamos importando a biblioteca PyQuery com a abreviatura **pq**" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "from pyquery import PyQuery as pq" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "Fazemos então a requisição de nosso documento HTML" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 3, 44 | "metadata": {}, 45 | "outputs": [ 46 | { 47 | "name": "stdout", 48 | "output_type": "stream", 49 | "text": [ 50 | "\n", 51 | "\n", 52 | "\t\n", 53 | "\tAprendendo Web Scraping com Python,\n", 77 | "\tRequests-HTML,\n", 78 | "\tBeautiful Soup e\n", 79 | "\tScrapy\n", 80 | "\t
\n", 81 | "\n", 82 | "\t“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein
\n", 83 | "\t\n", 84 | "\tNome | \n", 95 | "\t\t\tSobrenome | \n", 96 | "\t\t\t|
---|---|---|
Alan | \n", 100 | "\t\t\tTuring | \n", 101 | "\t\t\talan@turing.com | \n", 102 | "\t\t
John | \n", 105 | "\t\t\tvon Neumann | \n", 106 | "\t\t\tjohn@voneumann.com | \n", 107 | "\t\t
Blaise | \n", 110 | "\t\t\tPascal | \n", 111 | "\t\t\tblaise@pascal.com | \n", 112 | "\t\t
`"
432 | ]
433 | },
434 | {
435 | "cell_type": "code",
436 | "execution_count": 16,
437 | "metadata": {},
438 | "outputs": [
439 | {
440 | "data": {
441 | "text/plain": [
442 | "[, , , , ]"
443 | ]
444 | },
445 | "execution_count": 16,
446 | "metadata": {},
447 | "output_type": "execute_result"
448 | }
449 | ],
450 | "source": [
451 | "d('p').children() "
452 | ]
453 | },
454 | {
455 | "cell_type": "markdown",
456 | "metadata": {},
457 | "source": [
458 | "Selecionando todos os elementos ` `"
459 | ]
460 | },
461 | {
462 | "cell_type": "code",
463 | "execution_count": 58,
464 | "metadata": {},
465 | "outputs": [
466 | {
467 | "data": {
468 | "text/plain": [
469 | "[ , ]"
470 | ]
471 | },
472 | "execution_count": 58,
473 | "metadata": {},
474 | "output_type": "execute_result"
475 | }
476 | ],
477 | "source": [
478 | "d('p') "
479 | ]
480 | },
481 | {
482 | "cell_type": "markdown",
483 | "metadata": {},
484 | "source": [
485 | "Pesquisando a existência de elementos `` dentro de ` `"
486 | ]
487 | },
488 | {
489 | "cell_type": "code",
490 | "execution_count": 18,
491 | "metadata": {},
492 | "outputs": [
493 | {
494 | "data": {
495 | "text/plain": [
496 | "[, , , ]"
497 | ]
498 | },
499 | "execution_count": 18,
500 | "metadata": {},
501 | "output_type": "execute_result"
502 | }
503 | ],
504 | "source": [
505 | "d('p').find('a') "
506 | ]
507 | },
508 | {
509 | "cell_type": "markdown",
510 | "metadata": {},
511 | "source": [
512 | "Novamento, podemos obter os links com um **for loop**"
513 | ]
514 | },
515 | {
516 | "cell_type": "code",
517 | "execution_count": 19,
518 | "metadata": {},
519 | "outputs": [
520 | {
521 | "name": "stdout",
522 | "output_type": "stream",
523 | "text": [
524 | "https://www.python.org/\n",
525 | "https://github.com/psf/requests-html\n",
526 | "https://www.crummy.com/software/BeautifulSoup/bs4/doc/\n",
527 | "https://scrapy.org/\n"
528 | ]
529 | }
530 | ],
531 | "source": [
532 | "for l in d('p').find('a'):\n",
533 | " print(l.attrib['href'])"
534 | ]
535 | },
536 | {
537 | "cell_type": "markdown",
538 | "metadata": {},
539 | "source": [
540 | "Verificando a presença de determinada classe no documento"
541 | ]
542 | },
543 | {
544 | "cell_type": "code",
545 | "execution_count": 68,
546 | "metadata": {},
547 | "outputs": [
548 | {
549 | "data": {
550 | "text/plain": [
551 | "True"
552 | ]
553 | },
554 | "execution_count": 68,
555 | "metadata": {},
556 | "output_type": "execute_result"
557 | }
558 | ],
559 | "source": [
560 | " d('li').hasClass('python')"
561 | ]
562 | },
563 | {
564 | "cell_type": "markdown",
565 | "metadata": {},
566 | "source": [
567 | "É possível selecionarmos um elemento e assim obter sua representação HTML com o método `outerHtml()`"
568 | ]
569 | },
570 | {
571 | "cell_type": "code",
572 | "execution_count": 20,
573 | "metadata": {},
574 | "outputs": [
575 | {
576 | "data": {
577 | "text/plain": [
578 | "'
Nome | \\n\\t\\t\\tSobrenome | \\n\\t\\t\\t|
---|---|---|
Alan | \\n\\t\\t\\tTuring | \\n\\t\\t\\talan@turing.com | \\n\\t\\t
John | \\n\\t\\t\\tvon Neumann | \\n\\t\\t\\tjohn@voneumann.com | \\n\\t\\t
Blaise | \\n\\t\\t\\tPascal | \\n\\t\\t\\tblaise@pascal.com | \\n\\t\\t
` do documento" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 21, 600 | "metadata": {}, 601 | "outputs": [ 602 | { 603 | "data": { 604 | "text/plain": [ 605 | "[
,
]"
606 | ]
607 | },
608 | "execution_count": 21,
609 | "metadata": {},
610 | "output_type": "execute_result"
611 | }
612 | ],
613 | "source": [
614 | "d('p')"
615 | ]
616 | },
617 | {
618 | "cell_type": "markdown",
619 | "metadata": {},
620 | "source": [
621 | "Filtrando elementos por posição"
622 | ]
623 | },
624 | {
625 | "cell_type": "code",
626 | "execution_count": 22,
627 | "metadata": {},
628 | "outputs": [
629 | {
630 | "data": {
631 | "text/plain": [
632 | "'Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t'"
633 | ]
634 | },
635 | "execution_count": 22,
636 | "metadata": {},
637 | "output_type": "execute_result"
638 | }
639 | ],
640 | "source": [
641 | "d('p').filter(lambda i: i == 0).html()"
642 | ]
643 | },
644 | {
645 | "cell_type": "markdown",
646 | "metadata": {},
647 | "source": [
648 | "Filtrando elementos por posição"
649 | ]
650 | },
651 | {
652 | "cell_type": "code",
653 | "execution_count": 23,
654 | "metadata": {},
655 | "outputs": [
656 | {
657 | "data": {
658 | "text/plain": [
659 | "'“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein'"
660 | ]
661 | },
662 | "execution_count": 23,
663 | "metadata": {},
664 | "output_type": "execute_result"
665 | }
666 | ],
667 | "source": [
668 | "d('p').filter(lambda i: i == 1).html()"
669 | ]
670 | },
671 | {
672 | "cell_type": "markdown",
673 | "metadata": {},
674 | "source": [
675 | "Filtrando elementos por seu texto"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 24,
681 | "metadata": {},
682 | "outputs": [
683 | {
684 | "data": {
685 | "text/plain": [
686 | "'“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein'"
687 | ]
688 | },
689 | "execution_count": 24,
690 | "metadata": {},
691 | "output_type": "execute_result"
692 | }
693 | ],
694 | "source": [
695 | "d('p').filter(lambda i: pq(this).text() == '“Logic will get you from A to Z; imagination will get you everywhere.” Albert Einstein').html()"
696 | ]
697 | },
698 | {
699 | "cell_type": "markdown",
700 | "metadata": {},
701 | "source": [
702 | "Filtrando elementos por seu texto"
703 | ]
704 | },
705 | {
706 | "cell_type": "code",
707 | "execution_count": 25,
708 | "metadata": {},
709 | "outputs": [
710 | {
711 | "data": {
712 | "text/plain": [
713 | "'Aprendendo Web Scraping com Python,\\n\\tRequests-HTML,\\n\\tBeautiful Soup e\\n\\tScrapy\\n\\t'"
714 | ]
715 | },
716 | "execution_count": 25,
717 | "metadata": {},
718 | "output_type": "execute_result"
719 | }
720 | ],
721 | "source": [
722 | "d('p').filter(lambda i: pq(this).text() == 'Aprendendo Web Scraping com Python, Requests-HTML, Beautiful Soup e Scrapy').html()"
723 | ]
724 | }
725 | ],
726 | "metadata": {
727 | "kernelspec": {
728 | "display_name": "Python 3",
729 | "language": "python",
730 | "name": "python3"
731 | },
732 | "language_info": {
733 | "codemirror_mode": {
734 | "name": "ipython",
735 | "version": 3
736 | },
737 | "file_extension": ".py",
738 | "mimetype": "text/x-python",
739 | "name": "python",
740 | "nbconvert_exporter": "python",
741 | "pygments_lexer": "ipython3",
742 | "version": "3.7.7"
743 | }
744 | },
745 | "nbformat": 4,
746 | "nbformat_minor": 4
747 | }
748 |
--------------------------------------------------------------------------------
/notebooks/Requests-HTML.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# A Biblioteca Requests-HTML\n",
8 | "\n",
9 | "**[Requests-HTML](https://requests.readthedocs.io/projects/requests-html/en/latest/)** é uma biblioteca que tem como objetivo tornar a análise de HTML (por exemplo: **Web Scraping**) o mais simples e intuitiva possível.\n",
10 | "\n",
11 | "Para fazermos a instalação dela é muito simples, basta executarmos o comando:\n",
12 | "\n",
13 | "`pip install requests-html`\n",
14 | "\n",
15 | "Uma vez instalada, já podemos importá-la para começarmos nossos experimentos"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 4,
21 | "metadata": {},
22 | "outputs": [],
23 | "source": [
24 | "from requests_html import HTMLSession"
25 | ]
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {},
30 | "source": [
31 | "Iniciamos construindo o objeto `HTMLSession`"
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 5,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "session = HTMLSession()"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "Verificamos o tipo dele e confirmamos que se trata de um objeto `requests_html.HTMLSession`"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": 6,
53 | "metadata": {},
54 | "outputs": [
55 | {
56 | "data": {
57 | "text/plain": [
58 | "requests_html.HTMLSession"
59 | ]
60 | },
61 | "execution_count": 6,
62 | "metadata": {},
63 | "output_type": "execute_result"
64 | }
65 | ],
66 | "source": [
67 | "type(session)"
68 | ]
69 | },
70 | {
71 | "cell_type": "markdown",
72 | "metadata": {},
73 | "source": [
74 | "Com o método `dir()` investigamos os atributos e métodos disponíveis para trabalharmos"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": 7,
80 | "metadata": {},
81 | "outputs": [
82 | {
83 | "name": "stdout",
84 | "output_type": "stream",
85 | "text": [
86 | "['_BaseSession__browser_args', '__attrs__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'adapters', 'auth', 'browser', 'cert', 'close', 'cookies', 'delete', 'get', 'get_adapter', 'get_redirect_target', 'head', 'headers', 'hooks', 'max_redirects', 'merge_environment_settings', 'mount', 'options', 'params', 'patch', 'post', 'prepare_request', 'proxies', 'put', 'rebuild_auth', 'rebuild_method', 'rebuild_proxies', 'request', 'resolve_redirects', 'response_hook', 'send', 'should_strip_auth', 'stream', 'trust_env', 'verify']\n"
87 | ]
88 | }
89 | ],
90 | "source": [
91 | "print(dir(session))"
92 | ]
93 | },
94 | {
95 | "cell_type": "markdown",
96 | "metadata": {},
97 | "source": [
98 | "Executando uma requisição GET para obtermos o conteúdo do Website: `pythonwebscraping.netlify.com`"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 9,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": [
107 | "r = session.get('https://pythonwebscraping.netlify.app')"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "`r.html.links` representa um conjunto (**set**) com todos os links da página"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 10,
120 | "metadata": {},
121 | "outputs": [
122 | {
123 | "name": "stdout",
124 | "output_type": "stream",
125 | "text": [
126 | "` e obter somente o conteúdo do atributo **'src'**"
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 16,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "data": {
297 | "text/plain": [
298 | "'https://www.crummy.com/software/BeautifulSoup/bs4/doc/_images/6.1.jpg'"
299 | ]
300 | },
301 | "execution_count": 16,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | }
305 | ],
306 | "source": [
307 | "r.html.find('img', first=True).attrs['src']"
308 | ]
309 | },
310 | {
311 | "cell_type": "markdown",
312 | "metadata": {},
313 | "source": [
314 | "Buscando todos os elementos `
` de nossa página web"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 17,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "name": "stdout",
324 | "output_type": "stream",
325 | "text": [
326 | "Alan\n",
327 | "Turing\n",
328 | "alan@turing.com\n",
329 | "John\n",
330 | "von Neumann\n",
331 | "john@voneumann.com\n",
332 | "Blaise\n",
333 | "Pascal\n",
334 | "blaise@pascal.com\n"
335 | ]
336 | }
337 | ],
338 | "source": [
339 | "td = r.html.find('td')\n",
340 | "for t in td:\n",
341 | " print(t.text)"
342 | ]
343 | }
344 | ],
345 | "metadata": {
346 | "kernelspec": {
347 | "display_name": "Python 3",
348 | "language": "python",
349 | "name": "python3"
350 | },
351 | "language_info": {
352 | "codemirror_mode": {
353 | "name": "ipython",
354 | "version": 3
355 | },
356 | "file_extension": ".py",
357 | "mimetype": "text/x-python",
358 | "name": "python",
359 | "nbconvert_exporter": "python",
360 | "pygments_lexer": "ipython3",
361 | "version": "3.7.7"
362 | }
363 | },
364 | "nbformat": 4,
365 | "nbformat_minor": 4
366 | }
367 |
--------------------------------------------------------------------------------
/notebooks/Web Crawler.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Web Crawler\n",
8 | "\n",
9 | "Nosso Web Crawler irá navegar pelas páginas do website **http://quotes.toscrape.com**\n",
10 | "\n",
11 | "Esta aplicação foi desenvolvida especificamente para praticarmos nossos conhecimentos sobre **Web Scraping** e nos servirá de grande auxílio.\n",
12 | "\n",
13 | "Para a construção de nosso Crawler vamos utilizar as bibliotecas **[Requests](https://requests.kennethreitz.org/en/master/)** e **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**\n",
14 | "\n",
15 | "Iniciaremos importando as bibliotecas necessárias"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 1,
21 | "metadata": {},
22 | "outputs": [],
23 | "source": [
24 | "from bs4 import BeautifulSoup\n",
25 | "import requests"
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "Vamos definir uma função chamada **spider()** ao qual:\n",
33 | "\n",
34 | "- Navegará pelo número de páginas máximo especificado por nós via argumento\n",
35 | "- Para cada página, vamos extrair o código HTML\n",
36 | "- Através do nosso objeto soup buscaremos elementos:\n",
37 | " - Representando o autor do quote\n",
38 | " - Representando o texto do quote\n",
39 | "- Por fim incrementamos nossa variável page até alcançarmos o limite máximo de páginas"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 2,
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "def spider(max_pages):\n",
49 | " page = 1\n",
50 | " while page < (max_pages + 1):\n",
51 | " url = f'http://quotes.toscrape.com/page/{str(page)}/'\n",
52 | " source_code = requests.get(url)\n",
53 | " plain_text = source_code.text\n",
54 | " soup = BeautifulSoup(plain_text, 'lxml')\n",
55 | " for autor in soup.find_all('small', class_='author'):\n",
56 | " print(autor.text)\n",
57 | " for quote in soup.find_all('span', class_='text'):\n",
58 | " print(quote.text)\n",
59 | " page += 1"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "Executamos nossa função passando como argumento o valor **2** \n",
67 | "\n",
68 | "- O spider irá navegar pelas páginas **http://quotes.toscrape.com/page/1/** e **http://quotes.toscrape.com/page/2/**\n",
69 | "- Serão extraídos todos os quotes e seus respectivos autores das páginas que navegamos"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": 3,
75 | "metadata": {},
76 | "outputs": [
77 | {
78 | "name": "stdout",
79 | "output_type": "stream",
80 | "text": [
81 | "Albert Einstein\n",
82 | "J.K. Rowling\n",
83 | "Albert Einstein\n",
84 | "Jane Austen\n",
85 | "Marilyn Monroe\n",
86 | "Albert Einstein\n",
87 | "André Gide\n",
88 | "Thomas A. Edison\n",
89 | "Eleanor Roosevelt\n",
90 | "Steve Martin\n",
91 | "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”\n",
92 | "“It is our choices, Harry, that show what we truly are, far more than our abilities.”\n",
93 | "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”\n",
94 | "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”\n",
95 | "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”\n",
96 | "“Try not to become a man of success. Rather become a man of value.”\n",
97 | "“It is better to be hated for what you are than to be loved for what you are not.”\n",
98 | "“I have not failed. I've just found 10,000 ways that won't work.”\n",
99 | "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”\n",
100 | "“A day without sunshine is like, you know, night.”\n",
101 | "Marilyn Monroe\n",
102 | "J.K. Rowling\n",
103 | "Albert Einstein\n",
104 | "Bob Marley\n",
105 | "Dr. Seuss\n",
106 | "Douglas Adams\n",
107 | "Elie Wiesel\n",
108 | "Friedrich Nietzsche\n",
109 | "Mark Twain\n",
110 | "Allen Saunders\n",
111 | "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”\n",
112 | "“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”\n",
113 | "“If you can't explain it to a six year old, you don't understand it yourself.”\n",
114 | "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”\n",
115 | "“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”\n",
116 | "“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”\n",
117 | "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”\n",
118 | "“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”\n",
119 | "“Good friends, good books, and a sleepy conscience: this is the ideal life.”\n",
120 | "“Life is what happens to us while we are making other plans.”\n"
121 | ]
122 | }
123 | ],
124 | "source": [
125 | "spider(2)"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "### Aperfeiçoando nosso Web Crawler\n",
133 | "\n",
134 | "Acredito que se transformarmos nossa função **spider()** em uma função geradora, podemos guardar nossos dados em um dicionário onde o **nome do autor** representará a **chave** e o **quote** representará o **valor**.\n",
135 | "\n",
136 | "Para isso, vamos criar duas listas, uma para guardarmos os **autores** e outra para guardarmos os **quotes**.\n",
137 | "\n",
138 | "Por fim, utilizamos a palavra-chave **yield** de forma a modificarmos nossa função para que ela se torne um gerador, nos retornando um dicionário com nossos dados mapeados como **chave-valor** através da função **zip()**.\n",
139 | "\n",
140 | "**Importante**: Dicionários aceitam apenas chaves únicas, sendo assim, teremos apenas um Quote de cada Autor"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 12,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "def spider(max_pages):\n",
150 | " page = 1\n",
151 | " autores = []\n",
152 | " quotes = []\n",
153 | " while page < (max_pages + 1):\n",
154 | " url = f'http://quotes.toscrape.com/page/{str(page)}/'\n",
155 | " source_code = requests.get(url)\n",
156 | " plain_text = source_code.text\n",
157 | " soup = BeautifulSoup(plain_text, 'lxml')\n",
158 | " for autor in soup.find_all('small', class_='author'):\n",
159 | " autores.append(autor.text)\n",
160 | " for quote in soup.find_all('span', class_='text'):\n",
161 | " quotes.append(quote.text)\n",
162 | " page += 1\n",
163 | " yield dict(zip(autores, quotes))"
164 | ]
165 | },
166 | {
167 | "cell_type": "markdown",
168 | "metadata": {},
169 | "source": [
170 | "Obtendo o objeto gerador"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 13,
176 | "metadata": {},
177 | "outputs": [
178 | {
179 | "name": "stdout",
180 | "output_type": "stream",
181 | "text": [
182 | "