├── Companies.csv
├── Images
├── 01-Intro_Image.jpg
├── 02-Title_Image.png
├── 03-tbody.png
├── 04-tr.png
├── 05-company-name.png
├── 06-Link_format.png
└── README.md
├── LICENSE
├── README.md
├── Web_Scraping-Notebook.ipynb
└── requirements.txt
/Images/01-Intro_Image.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/01-Intro_Image.jpg
--------------------------------------------------------------------------------
/Images/02-Title_Image.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/02-Title_Image.png
--------------------------------------------------------------------------------
/Images/03-tbody.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/03-tbody.png
--------------------------------------------------------------------------------
/Images/04-tr.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/04-tr.png
--------------------------------------------------------------------------------
/Images/05-company-name.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/05-company-name.png
--------------------------------------------------------------------------------
/Images/06-Link_format.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/rachittshah/Webscraping-with-AutoGPT/fa0ab8d96b7665084f1525d0884a3561a5e23058/Images/06-Link_format.png
--------------------------------------------------------------------------------
/Images/README.md:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2022 Konstantinos Orfanakis
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Web Scraping With Python
2 |
3 |
4 |
5 | **Table of contents:**
6 |
7 |
8 | * [Objective](#objective)
9 | * [Motivation](#motivation)
10 | * [How Does It Work?](#how-does-it-work)
11 | * [HTML (Optional)](#html-optional)
12 | * [Web Scraping Workflow](#web-scraping-workflow)
13 | * [Ethics of Scraping](#ethics-of-scraping)
14 | * [References](#references)
15 | * [Feedback](#feedback)
16 |
17 |
18 |
19 |
20 | ## Objective
21 |
22 | This tutorial aims to show how to use the Python programming language to web scrape a website. Specifically, we will use the `requests` and `Beautiful Soup` libraries to scrape and parse data from [companiesmarketcap.com](https://companiesmarketcap.com/) and retrieve the “*Largest Companies by Market Cap*”.
23 |
24 | We will learn how to scale the web scraping process by first retrieving the first company/row of the table, then all companies on the website’s first page, and finally, all 6024 companies from multiple pages. Once the scraping process is complete, we will preprocess the dataset and transform it into a more readable format before using `matplotlib` to visualise the most important information.
25 |
26 |
27 |
28 | ## Motivation
29 |
30 | Web scraping is a technique employed to extract large amounts of data from the Web using intelligent automation. Nowadays, web scraping is an essential tool for data scientists as it can be used to potentially source hundreds, millions, or even billions of data points from the Internet’s seemingly endless frontier.
31 |
32 |
33 |
34 | ## How Does It Work?
35 |
36 | The first step in performing web scraping involves understanding what a web page is. Simply put, a web page is a text document provided by a website and displayed to a user in a web browser. Such documents are written in the [HTML language](https://html.com/).
37 |
38 | HTML (which stands for HyperText Markup Language) is the most fundamental building block of the World Wide Web. It is the underlying source code of all web pages (along with CSS and JavaScript) as it encodes the displayed content and the overall structure of a web page. HTML documents are files that end with a .html or .htm extension. We can easily access the HTML source code using our browser's ‘view page source’ or ‘inspect’ tools.
39 |
40 | Finally, a web scraper is a computer program that can understand an HTML document, parse it and extract useful information. To build a successful web scraper, we need at least a basic knowledge/understanding of HTML. If you are already familiar with HTML, you can skip the following section.
41 |
42 |
43 |
44 | ## HTML (Optional)
45 |
46 | An HTML file is made of the so-called elements. An element represents semantics or meaning. Typically, an element includes an opening tag enclosed in angle brackets (\