├── Companies.csv ├── Images ├── 01-Intro_Image.jpg ├── 02-Title_Image.png ├── 03-tbody.png ├── 04-tr.png ├── 05-company-name.png ├── 06-Link_format.png └── README.md ├── LICENSE ├── README.md ├── Web_Scraping-Notebook.ipynb └── requirements.txt /Images/01-Intro_Image.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/01-Intro_Image.jpg -------------------------------------------------------------------------------- /Images/02-Title_Image.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/02-Title_Image.png -------------------------------------------------------------------------------- /Images/03-tbody.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/03-tbody.png -------------------------------------------------------------------------------- /Images/04-tr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/04-tr.png -------------------------------------------------------------------------------- /Images/05-company-name.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/05-company-name.png -------------------------------------------------------------------------------- /Images/06-Link_format.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/06-Link_format.png -------------------------------------------------------------------------------- /Images/README.md: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Konstantinos Orfanakis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Web Scraping With Python 2 | 3 | drawing

4 | 5 | **Table of contents:** 6 | 7 | 8 | * [Objective](#objective) 9 | * [Motivation](#motivation) 10 | * [How Does It Work?](#how-does-it-work) 11 | * [HTML (Optional)](#html-optional) 12 | * [Web Scraping Workflow](#web-scraping-workflow) 13 | * [Ethics of Scraping](#ethics-of-scraping) 14 | * [References](#references) 15 | * [Feedback](#feedback) 16 | 17 | 18 |
19 | 20 | ## Objective 21 | 22 | This tutorial aims to show how to use the Python programming language to web scrape a website. Specifically, we will use the `requests` and `Beautiful Soup` libraries to scrape and parse data from [companiesmarketcap.com](https://companiesmarketcap.com/) and retrieve the “*Largest Companies by Market Cap*”. 23 | 24 | We will learn how to scale the web scraping process by first retrieving the first company/row of the table, then all companies on the website’s first page, and finally, all 6024 companies from multiple pages. Once the scraping process is complete, we will preprocess the dataset and transform it into a more readable format before using `matplotlib` to visualise the most important information. 25 | 26 |
27 | 28 | ## Motivation 29 | 30 | Web scraping is a technique employed to extract large amounts of data from the Web using intelligent automation. Nowadays, web scraping is an essential tool for data scientists as it can be used to potentially source hundreds, millions, or even billions of data points from the Internet’s seemingly endless frontier. 31 | 32 |
33 | 34 | ## How Does It Work? 35 | 36 | The first step in performing web scraping involves understanding what a web page is. Simply put, a web page is a text document provided by a website and displayed to a user in a web browser. Such documents are written in the [HTML language](https://html.com/). 37 | 38 | HTML (which stands for HyperText Markup Language) is the most fundamental building block of the World Wide Web. It is the underlying source code of all web pages (along with CSS and JavaScript) as it encodes the displayed content and the overall structure of a web page. HTML documents are files that end with a .html or .htm extension. We can easily access the HTML source code using our browser's ‘view page source’ or ‘inspect’ tools. 39 | 40 | Finally, a web scraper is a computer program that can understand an HTML document, parse it and extract useful information. To build a successful web scraper, we need at least a basic knowledge/understanding of HTML. If you are already familiar with HTML, you can skip the following section. 41 | 42 |
43 | 44 | ## HTML (Optional) 45 | 46 | An HTML file is made of the so-called elements. An element represents semantics or meaning. Typically, an element includes an opening tag enclosed in angle brackets (\) and a closing tag enclosed in angle brackets but with a forward slash preceding the tag (\). The content of an HTML element is the information between its opening and closing tags. 47 | 48 | ``` 49 | Content 50 | ``` 51 | 52 | Elements in HTML can have attributes, i.e. values added to the opening tag to define additional characteristics or properties of the element. HTML attributes consist of a name and a value using the following syntax: name = "value". 53 | 54 | ``` 55 | Content 56 | ``` 57 | 58 | Everything written in HTML is either an element or contained in an element (or both). In other words, HTML is organised into a family tree structure since HTML elements can have parents, grandparents, siblings, children, grandchildren, etc. 59 | 60 | ``` 61 | 62 |

63 |

It's div's child and body's grandchild

64 |

It's h1's sibling

65 |

66 | 67 | ``` 68 | Note that the code is formatted such that the indentation level of text increases once for each level of nesting. 69 | 70 | Every HTML file must have one grand/root element called \. This element contains all other elements of the document. It is necessary that `` always has two child elements: 71 | 72 | - The \ element: a container for metadata, i.e. information about an HTML page that is not displayed on the web page itself. 73 | - The \ element: represents the main content of an HTML document displayed on the browser. 74 | 75 | ``` 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | ``` 85 | 86 | For more information, please use the links in the References section of the notebook. 87 | 88 |
89 | 90 | ## Web Scraping Workflow 91 | 92 | The workflow for web scraping with Python can be divided into the following three steps: 93 | 94 | 1. **Obtaining the HTML**: Firstly, we need to send an HTTP request to the web page server that we want to scrape. If the request is successful, the server will respond with the HTML content of the page. 95 | 96 | 2. **Parsing the HTML**: Most of the obtained HTML data is nested, making it difficult to extract information using stand string processing techniques. Instead, we need a parser, i.e. an algorithm designed to parse the HTML and create a parse/syntax tree of the HTML data. 97 | 98 | 3. **Extracting the Data**: Once the syntax tree is created, we need to navigate it and retrieve the information that we are interested in 99 | 100 | To complete those steps, we need two third-party Python libraries: 101 | 1. **[Requests](https://docs.python-requests.org/en/master/)**: a simple but powerful library for sending all kinds of HTTP requests to a web server, 102 | 103 | 2. **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: a library for parsing HTML and XML documents. It works with a user-selected parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. 104 | 105 |
106 | 107 | ## Ethics of Scraping 108 | 109 | Web scraping is a powerful tool that should be used responsibly. Below is a list of things to consider before we start scraping data: 110 | 111 | 1. Always check the robots.txt file of the site you are about to scrape, as it contains guidelines on how bots should behave on the website. 112 | 113 | 2. Do not spam the website with multiple requests in a short amount of time, as that may hurt their server(s) and/or may be classified as a DDOS attack. If you need to scrape multiple pages, you can artificially limit the rate of requests, as we will show in the code. 114 | 115 | 3. Do not engage in piracy or other unauthorised commercial use regarding the data you extract. 116 | 117 | Additionally, some companies and websites may provide the data we are interested in in a clean and concise way through an API. If a public API that provides the information we are looking for is available, web scarping should be avoided altogether. 118 | 119 | If you would like to know more, I suggest reading James Densmore’s article on the [Ethics in Web Scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01). 120 | 121 |
122 | 123 | ## References 124 | 125 | The complete list of reference for this tutorial is included at the end of the notebook. 126 | 127 |
128 | 129 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | pandas==1.4.1 2 | plotly==5.6.0 3 | requests==2.27.1 4 | numpy==1.22.2 5 | seaborn==0.11.2 6 | watermark==2.3.0 7 | matplotlib==3.5.1 8 | bs4==4.10.0 --------------------------------------------------------------------------------