├── webscraping_5lines.py
├── web_scraping_toc.csv
├── python_toc.csv
├── wiki_toc.py
└── README.md


/webscraping_5lines.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 | response = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
4 | bs = BeautifulSoup(response.text, "lxml")
5 | print(bs.find("p").text)
6 | 


--------------------------------------------------------------------------------
/web_scraping_toc.csv:
--------------------------------------------------------------------------------
 1 | heading_number,heading_text
 2 | 1,History
 3 | 2,Techniques
 4 | 2.1,Human copy-and-paste
 5 | 2.2,Text pattern matching
 6 | 2.3,HTTP programming
 7 | 2.4,HTML parsing
 8 | 2.5,DOM parsing
 9 | 2.6,Vertical aggregation
10 | 2.7,Semantic annotation recognizing
11 | 2.8,Computer vision web-page analysis
12 | 3,Software
13 | 4,Legal issues
14 | 4.1,United States
15 | 4.2,The EU
16 | 4.3,Australia
17 | 4.4,India
18 | 5,Methods to prevent web scraping
19 | 6,See also
20 | 7,References
21 | 


--------------------------------------------------------------------------------
/python_toc.csv:
--------------------------------------------------------------------------------
 1 | heading_number,heading_text
 2 | 1,History
 3 | 2,Design philosophy and features
 4 | 3,Syntax and semantics
 5 | 3.1,Indentation
 6 | 3.2,Statements and control flow
 7 | 3.3,Expressions
 8 | 3.4,Methods
 9 | 3.5,Typing
10 | 3.6,Arithmetic operations
11 | 4,Programming examples
12 | 5,Libraries
13 | 6,Development environments
14 | 7,Implementations
15 | 7.1,Reference implementation
16 | 7.2,Other implementations
17 | 7.3,Unsupported implementations
18 | 7.4,Cross-compilers to other languages
19 | 7.5,Performance
20 | 8,Development
21 | 9,API documentation generators
22 | 10,Naming
23 | 11,Uses
24 | 12,Languages influenced by Python
25 | 13,See also
26 | 14,References
27 | 14.1,Sources
28 | 15,Further reading
29 | 16,External links
30 | 


--------------------------------------------------------------------------------
/wiki_toc.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import requests
 3 | from bs4 import BeautifulSoup
 4 | import requests
 5 | 
 6 | 
 7 | def get_data(url):
 8 |     response = requests.get(url)
 9 |     soup = BeautifulSoup(response.text, 'lxml')
10 |     table_of_contents = soup.find("div", id="toc")
11 |     headings = table_of_contents.find_all("li")
12 |     data = []
13 |     for heading in headings:
14 |         heading_text = heading.find("span", class_="toctext").text
15 |         heading_number = heading.find("span", class_="tocnumber").text
16 |         data.append({
17 |             'heading_number': heading_number,
18 |             'heading_text': heading_text,
19 |         })
20 |     return data
21 | 
22 | 
23 | def export_data(data, file_name):
24 |     with open(file_name, "w", newline="") as file:
25 |         writer = csv.DictWriter(file, fieldnames=['heading_number', 'heading_text'])
26 |         writer.writeheader()
27 |         writer.writerows(data)
28 | 
29 | 
30 | def main():
31 |     url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
32 |     file_name = "python_toc.csv"
33 |     data = get_data(url_to_parse)
34 |     export_data(data, file_name)
35 | 
36 |     url_to_parse = "https://en.wikipedia.org/wiki/Web_scraping"
37 |     file_name = "web_scraping_toc.csv"
38 |     data = get_data(url_to_parse)
39 |     export_data(data, file_name)
40 | 
41 |     print('Done')
42 | 
43 | 
44 | if __name__ == '__main__':
45 |     main()
46 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Python Web Scraping Tutorial: Step-By-Step
  2 | 
  3 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=python-web-scraping-tutorial-github&transaction_id=102f49063ab94276ae8f116d224b67)
  4 | 
  5 | [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs)
  6 | 
  7 | ## Table of Contents
  8 | 
  9 | - [Web Scraping in 5 Lines of Code](#Web-Scraping-in-5-Lines-of-Code)
 10 | - [Components of a Web Scraping with Python Code](#Components-of-a-Web-Scraping-with-Python-Code)
 11 |     - [Python Libraries](#Python-Libraries)
 12 |     - [Python Web Scraping: Working with Requests](#Python-Web-Scraping-Working-with-Requests)
 13 | - [BeautifulSoup](#BeautifulSoup)
 14 | - [Find Methods in BeautifulSoup4](#Find-Methods-in-BeautifulSoup4)
 15 |     - [Finding Multiple Elements](#Finding-Multiple-Elements)
 16 |     - [Finding Nested Elements](#Finding-Nested-Elements)
 17 |     - [Exporting the data](#Exporting-the-data)
 18 | - [Other Tools](#Other-Tools)
 19 | 
 20 | In this Python Web Scraping Tutorial, we will outline everything needed to get started with web scraping. We will begin with simple examples and move on to relatively more complex. 
 21 | 
 22 | Python is arguably the most suitable programming language for web scraping because of its ease and a plethora of open source libraries. Some libraries make it easy to extract the data and to transform the data into any format needed, be it a simple CSV, to a more programmer-friendly JSON, or even save directly to the database.
 23 | 
 24 | Web scraping with Python is so easy that it can be done in as little as 5 lines of code.
 25 | 
 26 | ## Web Scraping in 5 Lines of Code
 27 | 
 28 | Write these five lines in any text editor, save as a `.py` file, and run with Python. Note that this code assumes that you have the libraries installed. More on this later.
 29 | 
 30 | ```python
 31 | import requests
 32 | from bs4 import BeautifulSoup
 33 | response = requests.get("https://en.wikipedia.org/wiki/Web_scraping")
 34 | bs = BeautifulSoup(response.text,"lxml")
 35 | print(bs.find("p").text)
 36 | ```
 37 | 
 38 | This will go to the Wikipedia page for the web scraping and print the first paragraph on the terminal. This code shows the simplicity and power of Python. You will find this code in `webscraping_5lines.py` file.
 39 | 
 40 | ## Components of a Web Scraping with Python Code
 41 | 
 42 | The main building blocks for any web scraping code is like this:
 43 | 
 44 | 1. Get HTML
 45 | 2. Parse HTML into Python object
 46 | 3. Save the data extracted
 47 | 
 48 | In most cases, there is no need to use a browser to get the HTML. While HTML contains the data, the other files that the browser loads, like images, CSS, JavaScript, etc., just make the website pretty and functional. Web scraping is focused on data. Thus in most cases, there is no need to get these helper files.
 49 | 
 50 | There will be some cases when you do need to open the browser. Python makes that easy too. 
 51 | 
 52 | ## Python Libraries
 53 | 
 54 | Web scraping with Python is easy due to the many useful libraries available
 55 | 
 56 | A barebones installation of Python isn’t enough for web scraping. One of the [Python advantages](https://oxy.yt/RrXa) is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – requests, BeautifulSoup, and CSV.
 57 | 
 58 | - The [Requests](https://docs.python-requests.org/en/master/) library is used to get the HTML files, bypassing the need to use a browser
 59 | - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is used to convert the raw HTML into a Python object, also called parsing. We will be working with Version 4 of this library, also know as `bs4` or `BeautifulSoup4`.
 60 | - The [CSV](https://docs.python.org/3/library/csv.html) library is part of the standard Python installation. No separate installation is required.
 61 | - Typically, a [virtual environment](https://docs.python.org/3/tutorial/venv.html) is used to install these libraries.  If you don't know about virtual environments, you can install these libraries in the user folder.
 62 | 
 63 | To install these libraries, start the terminal or command prompt of your OS and type in:
 64 | 
 65 | ```sh
 66 | pip install requests BeautifulSoup4 lxml
 67 | ```
 68 | 
 69 | Depending on your OS and settings, you may need to use `pip3` instead of `pip`. You may also need to use `--user` switch, depending on your settings.
 70 | 
 71 | ## Python Web Scraping: Working with Requests
 72 | 
 73 | The requests library eliminates the need to launch a browser, which will load the web page and all the supporting files that make the website pretty. The data that we need to extract is in the HTML. Requests library allows us to send a request to a webpage and get the response HTML. 
 74 | 
 75 | Open a text editor of your choice, Visual Studio Code, PyCharm, Sublime Text, Jupyter Notebooks, or even notepad. Use the one which you are familiar with.
 76 | 
 77 | Type in these three lines:
 78 | 
 79 | ```python
 80 | import requests
 81 |  
 82 | url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)"
 83 | response = requests.get(url_to_parse)
 84 | print(response)
 85 | ```
 86 | 
 87 | Save this file as a python file with `.py` extension and run it from your terminal. The output should be something like this:
 88 | 
 89 | ```
 90 | <Response (200)>
 91 | ```
 92 | 
 93 | It means that the response has been received and the status code is 200. The HTTP Response code 200 means a successful response. Response codes in the range of 400 and 500 mean error. You can read more about the response codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).
 94 | 
 95 | To get the HTML from the response object, we can simply use the `.text` attribute.
 96 | 
 97 | ```python
 98 | print(response.text)
 99 | ```
100 | 
101 | This will print the HTML on the terminal. The first few characters will be something like this:
102 | 
103 | ```html
104 | <!DOCTYPE html>\n<html class="client-nojs" lang=" ...
105 | ```
106 | 
107 | If we check the data type of this, it will be a string. The next step is to convert this string into something that can be queried to find the specific information.
108 | 
109 | Meet BeautifulSoup!
110 | 
111 | ## BeautifulSoup 
112 | 
113 | Beautiful Soup provides simple methods for navigating, searching, and modifying the HTML. It takes care of encoding by automatically converting into UTF-8.  Beautiful Soup sits on top of popular Python parsers like lxml and html5lib. It is possible to [use lxml directly to query documents](https://oxy.yt/ZrZd), but BeautifulSoup allows you to try out different parsing strategies without changing the code.
114 | 
115 | 
116 | 
117 | The first step is to decide the parser that you want to use. Usually, `lxml` is the most commonly used.  This will need a separate install. 
118 | 
119 | ```python
120 | pip install lxml
121 | ```
122 | 
123 | Once `beautifulsoup4` and `lxml` is installed, we can create an object of BeautifulSoup:
124 | 
125 | ```python
126 | soup = BeautifulSoup(response_text, 'lxml')
127 | ```
128 | 
129 | Now we have access to several methods to query the HTML elements. For example, to get the title of the page, all we need to do is access the tag name like an attribute:
130 | 
131 | ```python
132 | print(soup.title)
133 | # OUTPUT: 
134 | # <title>Python (programming language) - Wikipedia</title>
135 | 
136 | print(soup.title.text)
137 | # OUTPUT:
138 | # Python (programming language) - Wikipedia
139 | ```
140 | 
141 | Note that to get the text inside the element, we simply used the  `text` attribute.
142 | 
143 | Similarly `soup.h1` will return the **first** `h1` tag it finds:
144 | 
145 | ```python
146 | print(soup.h1)
147 | 
148 | # OUTPUT:
149 | # <h1 class="firstHeading" id="firstHeading">Python (programming language)</h1>
150 | ```
151 | 
152 | ## Find Methods in BeautifulSoup4
153 | 
154 | Perhaps the most commonly used methods are `find()` and `find_all()`. Let’s open the Wikipedia page and get the table of contents.
155 | 
156 | The signature of find looks something like this:
157 | 
158 | ```python
159 | find(name=None, attrs={}, recursive=True, text=None, **kwargs)
160 | ```
161 | 
162 | As it is evident that the find method can be used to find elements based on `name`, `attributes`, or `text`. This should cover most of the scenarios. For scenarios like finding by `class`, there is `**kwargs` that can take other filters.
163 | 
164 | Moving on to Wikipedia example, the first step is to look at the HTML markup for the table of contents to be extracted. Right-click on the div that contains the table of contents and examine its markup. It is clear that the whole table of contents is in a div tag with the class attribute set to toc:
165 | 
166 | ```html
167 | <div id="toc" class="toc">    
168 | ```
169 | 
170 | If we simply run `soup.find("div")`, it will return the first div it finds - similar to writing `soup.div`. This needs filtering as we need a specific div. We are lucky in this case as it has an `id `attribute. The following line of code can extract the div element:
171 | 
172 | ```python
173 | soup.find("div",id="toc")
174 | ```
175 | 
176 | Note that the second parameter here - `id="toc"`.  The find method does not have a named parameter `id`, but still this works because of the implementation of the filter using the `**kwargs`.
177 | 
178 | Be careful with CSS class though. `class `is a reserved keyword in Python. It cannot be used as a parameter name directly.  There are two workarounds – first, just use `class_` instead of `class`. The second workaround is to use a dictionary as the second argument.
179 | 
180 | This means that the following two statements are same:
181 | 
182 | ```python
183 | soup.find("div",class_="toc") #not the underscore
184 | soup.find("div",{"class": "toc"}) 
185 | ```
186 | The advantage of using a dictionary is that more than one attribute can be specified. For example,if you need to specify both class and id, you can use the find method in the following manner: 
187 | ```python
188 | soup.find("div",{"class": "toc", "id":"toc"})
189 | ```
190 | 
191 | What if we need to find multiple elements?
192 | 
193 | ## Finding Multiple Elements
194 | 
195 | Consider this scenario - the object is to create a CSV file, which has two columns. The first column contains the heading number and  the second column contains the heading text. 
196 | 
197 | To find multiple columns, we can use `find_all` method.
198 | 
199 | This method works the same way find method works, just that instead of one element, it returns a list of all the elements that match criteria. If we look at the source code, we can see that all the heading text is inside a `span`, with `toctext` as class. We can use find_all method to extract all these:
200 | 
201 | ```python
202 | soup.find_all("span",class_="toctext")
203 | ```
204 | 
205 | This will return a list of elements:
206 | 
207 | ```shell
208 | [<span class="toctext">History</span>,
209 |  <span class="toctext">Design philosophy and features</span>,
210 |  <span class="toctext">Syntax and semantics</span>,
211 |  <span class="toctext">Indentation</span>,
212 |  .....]	
213 | ```
214 | 
215 | Similarly, the heading numbers can be extracted using this statement:
216 | 
217 | ```python
218 | soup.find_all("span",class_="tocnumber")
219 | ```
220 | 
221 | This will return a list of elements:
222 | 
223 | ```shell
224 | [<span class="tocnumber">1</span>,
225 |  <span class="tocnumber">2</span>,
226 |  <span class="tocnumber">3</span>,
227 |  <span class="tocnumber">3.1</span>,
228 |  ...]
229 | ```
230 | 
231 | However, we need to have one list containing both the number and text.
232 | 
233 | ## Finding Nested Elements
234 | 
235 | We need to take one step back and look at the markup. The whole table of contents can be selected with this statement:
236 | 
237 | ```python
238 | table_of_contents = soup.find("div",id="toc")
239 | ```
240 | 
241 | If we look at the markup, we can see that each heading number and text is inside an `li` tag.
242 | 
243 | One of the great features of BeautifulSoup  is that `find` and `find_all` methods can be used on `WebElements` too. In the above example, `whole_toc` is an instance of `WebElement`. We can find all the li tags inside this element.
244 | 
245 | ```python
246 | headings = table_of_contents.find_all("li")
247 | ```
248 | 
249 | Now we have a list of elements. All these individual elements contain both the heading text and heading number. A simple for loop can be used to create a dictionary, which can be added to a list.
250 | 
251 | ```python
252 | data= []
253 | for heading in headings:
254 |     heading_text = heading.find("span", class_="toctext").text
255 |     heading_number = heading.find("span", class_="tocnumber").text
256 |     data.append({
257 |         'heading_number' : heading_number,
258 |         'heading_text' : heading_text,
259 |     })
260 | 
261 | ```
262 | 
263 | If this data is printed, it is a list of dictionaries. 
264 | 
265 | ```shell
266 | [{'heading_number': '1', 'heading_text': 'History'},
267 |  {'heading_number': '2', 'heading_text': 'Design philosophy and features'},
268 |  {'heading_number': '3', 'heading_text': 'Syntax and semantics'},
269 |  {'heading_number': '3.1', 'heading_text': 'Indentation'},
270 |  {'heading_number': '3.2', 'heading_text': 'Statements and control flow'},
271 |  .....]
272 | ```
273 | 
274 | This data can now be exported easily using CSV module. 
275 | 
276 | ## Exporting the data
277 | 
278 | The data can be easily exported to a CSV file using the csv module. The first step is to open a file in write mode. Note that the `newline` parameter should be set to an empty string. If this is not done, you will see unwarted new line characters in your CSV file
279 | 
280 | ```python
281 | file= open("toc.csv", "w", newline="") 
282 | ```
283 | 
284 | After that, create an instance of DictWriter object. This needs a list of headers. In our case, these are simply going to be the dictionary keys in the data.
285 | 
286 | ```python
287 |  writer = csv.DictWriter(file,fieldnames=['heading_number','heading_text'])
288 | ```
289 | 
290 | Optionally, write the header and then call the `write.writerows()` method to write the `data`. To write one row, use the method `writerow()`. To write all rows, use the method `writerow()`.
291 | 
292 | ```python
293 | writer.writeheader()
294 | writer.writerows(data)	
295 | ```
296 | 
297 | That's it! We have the data ready in a CSV.
298 | 
299 | You can find this complete code in the file `wiki_toc.py` file.
300 | 
301 | Also, check this tutorial on [pypi](https://pypi.org/project/python-web-scraping-tutorial-step-by-step/)
302 | 
303 | ## Other Tools
304 | 
305 | Some websites do not have data in the HTML but are loaded from other files using JavaScript. In such cases, you would need a solution that uses a browser. The perfect example would be to use Selenium. We have a [detailed guide on Selenium here](https://en.wikipedia.org/wiki/Web_scraping).
306 | 


--------------------------------------------------------------------------------