├── README.md ├── images ├── load_more_button.png ├── next_button_example.png ├── next_button_example_page2.png ├── next_button_example_page3.png ├── next_button_locate.png ├── pager_without_next.png ├── scroll_html_response.png ├── scroll_json_response.png └── scroll_json_response_has_next.png ├── infinite_scroll_html.py ├── infinite_scroll_json.py ├── load_more_json.py ├── next_button.py └── no_next_button.py /README.md: -------------------------------------------------------------------------------- 1 | # Dealing With Pagination Via Python 2 | 3 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 4 | 5 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq) 6 | 7 | 8 | 9 | This article covers everything you need to know about dealing with pagination using Python. By the end of this article, you will be able to deal with various kinds of pagination in web scraping projects. 10 | 11 | ## Table of Contents 12 | 13 | - [Introduction](#introduction) 14 | - [Pagination With Next button](#pagination-with-next-button) 15 | - [Analyzing the Website](#analyzing-the-website) 16 | - [Python Code to Handle Pagination](#python-code-to-handle-pagination) 17 | - [Pagination Without Next Button](#pagination-without-next-button) 18 | - [Pagination With Infinite Scroll](#pagination-with-infinite-scroll) 19 | - [Handling Sites with JSON Response](#handling-sites-with-json-response) 20 | - [Handling Sites with HTML Response](#handling-sites-with-html-response) 21 | - [Pagination With Load More Button](#pagination-with-load-more-button) 22 | 23 | ## Introduction 24 | 25 | Pagination is a common feature for most websites. There is always a lot to display which can be done on one page. This is true for product listings, blogs, photos, videos, directories, etc. 26 | 27 | Each website has its way of using pagination. The common types of pagination are as follows: 28 | 29 | - With Next Button 30 | - Page Numbers without Next button 31 | - Pagination with Infinite Scroll 32 | - Pagination with Load More 33 | 34 | In this article, we will examine these cases and explore ways to handle these websites. 35 | 36 | ## Pagination With Next button 37 | 38 | Let's start with a simple example. Head over to [this page](http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html) and see the pagination. 39 | 40 | ### Analyzing the Website 41 | 42 | ![Next Button Example](images/next_button_example.png) 43 | 44 | This site has a next button. If this button is clicked, the website will go to the next page. 45 | 46 | ![Page 2](images/next_button_example_page2.png) 47 | 48 | Now this site displays a previous button along with a next button. If we keep clicking next until it takes to the last page, this is how this looks: 49 | 50 | ![Last Page](images/next_button_example_page3.png) 51 | 52 | Moreover, with every click, the URL is changing: 53 | 54 | - Page 1 - `http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html` 55 | - Page 2 - `http://books.toscrape.com/catalogue/category/books/fantasy_19/page-2.html` 56 | - Page 3 - `http://books.toscrape.com/catalogue/category/books/fantasy_19/page-3.html` 57 | 58 | The next step is to press F12 and examine the HTML markup of the next button. 59 | 60 | ![Next Button Markup](images/next_button_locate.png) 61 | 62 | Now that we know that the website is not dynamic and that the next button is not a button, but an anchor element, we can find the URL of the next page by locating this anchor tag. 63 | 64 | Let's write some Python code. 65 | 66 | ### Python Code to Handle Pagination 67 | 68 | Let's start with basic code to get the first page using `requests` module. If you do not have it installed, install in a virtual environment. You may also want to install `BeautifuslSoup4`. We will be using `BeautifulSoup4` (or `bs4`) for locating the next button. We need a parser for `bs4`. In this article we are using `lxml`. 69 | 70 | ```shell 71 | pip install requests beautifulsoup4 lxml 72 | ``` 73 | 74 | Let's start with writing a simple code that fetches the first page and prints the footer. Note that we are printing the footer so that we can keep track of the current page. In a real-world application, you will write the code here that scrapes data. 75 | 76 | ```python 77 | import requests 78 | from bs4 import BeautifulSoup 79 | 80 | url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html' 81 | 82 | response = requests.get(url) 83 | soup = BeautifulSoup(response.text, "lxml") 84 | footer_element = soup.select_one('li.current') 85 | print(footer_element.text.strip()) 86 | # Other code to extract data 87 | ``` 88 | 89 | The output of this code will be simply the footer of the first page: 90 | 91 | ```shell 92 | Page 1 of 3 93 | ``` 94 | 95 | Few points to note here are as follows: 96 | 97 | - Requests library is sending a GET request to the specified URL 98 | - The `soup` object is being queried using CSS Selector. This CSS selector is website-specific. 99 | 100 | Let's modify this code to locate the next button: 101 | 102 | First we need to locate the next button. 103 | 104 | ```python 105 | next_page_element = soup.select_one('li.next > a') 106 | ``` 107 | 108 | If this next element is found, we can get the value of the `href`attribute. One important thing to note here is that often the `href` will be a relative url. In such cases, we can use `urljoin` method from `urllib.parse` module. This whole block of the code now needs to be wrapped in a `while True` loop. 109 | 110 | ```python 111 | url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html' 112 | while True: 113 | response = requests.get(url) 114 | soup = BeautifulSoup(response.text, "lxml") 115 | footer_element = soup.select_one('li.current') 116 | print(footer_element.text.strip()) 117 | # Pagination 118 | next_page_element = soup.select_one('li.next > a') 119 | if next_page_element: 120 | next_page_url = next_page_element.get('href') 121 | url = urljoin(url, next_page_url) 122 | else: 123 | break 124 | ``` 125 | 126 | The output of this code will be simply the footer of all three pages: 127 | 128 | ```shell 129 | Page 1 of 3 130 | Page 2 of 3 131 | Page 3 of 3 132 | ``` 133 | 134 | You can find this code in the `next_button_requests.py` file in this repository. 135 | 136 | ## Pagination Without Next Button 137 | 138 | Some websites will not show a next button, but just page numbers. For example, here is an example of the pagination from `https://www.gosc.pl/doc/791526.Zaloz-zbroje`. 139 | 140 | ![Pager without next](images/pager_without_next.png) 141 | 142 | For example, one such web page is `https://www.gosc.pl/doc/791526.Zaloz-zbroje`. If we examine the HTML markup for this page, something interesting can be seen: 143 | 144 | ```html 145 | 146 | 1 147 | 2 148 | 3 149 | 4 150 | 151 | ``` 152 | 153 | This makes visiting all these pages easy. The first step is to retrieve the first page. Next, we can use BeautifulSoup to extract all these links to other pages. Finally, we can write a `for` loop that runs on all these links: 154 | 155 | ```python 156 | url = 'https://www.gosc.pl/doc/791526.Zaloz-zbroje' 157 | response = requests.get(url) 158 | soup = BeautifulSoup(response.text, 'lxml') 159 | page_link_el = soup.select('.pgr_nrs a') 160 | # process first page 161 | for link_el in page_link_el: 162 | link = urljoin(url, link_el.get('href')) 163 | response = requests.get(link) 164 | soup = BeautifulSoup(response.text, 'lxml') 165 | print(response.url) 166 | # process remaining pages 167 | ``` 168 | 169 | You can find the complete code in the file `no_next_button.py`. 170 | 171 | ## Pagination With Infinite Scroll 172 | 173 | This kind of pagination does not show page numbers or a next button. 174 | 175 | Let's take [this site](https://techinstr.myshopify.com/collections/all) as an example. This site shows 8 products when on the page load. As you scroll down, it will dynamically load more items, 8 at a time. Another important thing to note here is that URL does not change as more pages are loaded. 176 | 177 | In such cases, websites use an asynchronous call to an API to get more content and show this content on the page using JavaScript. The actual data returned by the API can be HTML or JSON. 178 | 179 | ### Handling Sites With JSON Response 180 | 181 | Before you load the site, press `F12` to open Developer Tools, head over to Network tab, and select XHR. Now go to `http://quotes.toscrape.com/scroll` and monitor the traffic. Scroll down to load more content. 182 | 183 | You will notices that as you scroll down, more requests are sent to `quotes?page=x` , where x is the page number. 184 | 185 | 186 | 187 | ![](images/scroll_json_response.png) 188 | 189 | Remember that the most important step is **figuring out is when to stop**. This is where `has_next` in the response is going to be useful. 190 | 191 | We can write a while loop as we did in the previous section. This time, there is no need of BeautifulSoup because the response is JSON. 192 | 193 | ```python 194 | url = 'http://quotes.toscrape.com/api/quotes?page={}' 195 | page_numer = 1 196 | while True: 197 | response = requests.get(url.format(page_numer)) 198 | data = response.json() 199 | # Process data here 200 | # ... 201 | print(response.url) # only for debug 202 | if data.get('has_next'): 203 | page_numer += 1 204 | else: 205 | break 206 | ``` 207 | 208 | Once we figure out how the site works, it's quite easy. 209 | 210 | Now let's look at one more example. 211 | 212 | ### Handling Sites With HTML Response 213 | 214 | In the previous section, we looked at that figuring out when to stop is important. The earlier example was easier. All we needed was to examine an attribute in the JSON. 215 | 216 | Some websites make it harder. Let's see one such example. 217 | 218 | Open developer tools by pressing F12 in your browser, go to Network tab, and then select XHR. Navigate to `https://techinstr.myshopify.com/collections/all`. You will notice that initially 8 products are loaded. 219 | 220 | If we scroll down, the next 8 products are loaded. Also, notice the following: 221 | 222 | - The number of products 132 is on the first page. 223 | - The URL of the index page is different than the remaining pages. 224 | - The response is HTML, with no clear way to identify when to stop. 225 | 226 | ![Infinite Scroll](images/scroll_html_response.png) 227 | 228 | To handle pagination for this site, first of we will load the index page and extract the number of products. We have already observed that 8 products are loaded in one request. Now we can calculate the number of pages as follows: 229 | 230 | Number of pages = 132/8 =16.5 231 | 232 | Now we can use `math.ceil` function to get the last page, which will give us 17. Note that if you use round function, you may end up missing one page in some cases. Using `ceil` function ensures that pages are always rounded *up*. 233 | 234 | One more change that we need to do is to use the session. The complete code is available in the `infinite_scroll_html.py` file. Here is the important part of the code: 235 | 236 | ```python 237 | index_page = 'https://techinstr.myshopify.com/collections/all' 238 | session = requests.session() 239 | 240 | response = session.get(index_page) 241 | soup = BeautifulSoup(response.text, "lxml") 242 | count_element = soup.select_one('.filters-toolbar__product-count') 243 | count_str = count_element.text.replace('products', '') 244 | count = int(count_str) 245 | # Process page 1 data here 246 | page_count = math.ceil(count/8) 247 | url = 'https://techinstr.myshopify.com/collections/all?page={}' 248 | for page_numer in range(2, page_count+1): 249 | response = session.get(url.format(page_numer)) 250 | soup = BeautifulSoup(response.text, "lxml") 251 | # process page 2 onwards data here 252 | 253 | ``` 254 | 255 | 256 | 257 | ## Pagination With Load More Button 258 | 259 | The way the pagination using a Load More button works is very similar to how infinite scroll works. The only difference is how loading the next page is triggered on the browser. 260 | 261 | As we are working directly with the web page without a browser, these two scenarios need to be handled the same way. 262 | 263 | Let's look at one example. Open `https://smarthistory.org/americas-before-1900/` with Developer Tools and click `Load More`. 264 | 265 | You will see that the response is in JSON format with an attribute `remaining`. The key observations are as follows: 266 | 267 | - Each request gets 12 results 268 | - The value of remaining decreases by 12 with every click of Load More 269 | - If we set the value page to 1 in the API url, it get's the first page of the results - `https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page=1` 270 | 271 | ![](images/load_more_button.png) 272 | 273 | In this particular case, the user agent also needs to be set for this to work correctly. The complete code is in `load_more_json.py` file. Here is the important part of the code: 274 | 275 | ```python 276 | url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}' 277 | headers = { 278 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36', 279 | } 280 | page_numer = 1 281 | while True: 282 | response = requests.get(url.format(page_numer), headers=headers) 283 | data = response.json() 284 | # Process data 285 | # ... 286 | print(response.url) # only for debug 287 | if data.get('remaining') and int(data.get('remaining')) > 0: 288 | page_numer += 1 289 | else: 290 | break 291 | 292 | ``` 293 | 294 | -------------------------------------------------------------------------------- /images/load_more_button.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/load_more_button.png -------------------------------------------------------------------------------- /images/next_button_example.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/next_button_example.png -------------------------------------------------------------------------------- /images/next_button_example_page2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/next_button_example_page2.png -------------------------------------------------------------------------------- /images/next_button_example_page3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/next_button_example_page3.png -------------------------------------------------------------------------------- /images/next_button_locate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/next_button_locate.png -------------------------------------------------------------------------------- /images/pager_without_next.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/pager_without_next.png -------------------------------------------------------------------------------- /images/scroll_html_response.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/scroll_html_response.png -------------------------------------------------------------------------------- /images/scroll_json_response.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/scroll_json_response.png -------------------------------------------------------------------------------- /images/scroll_json_response_has_next.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/oxylabs/Pagination-With-Python/a3cdf5938d2e3a75ce219c4a6fce75d9a7f5815e/images/scroll_json_response_has_next.png -------------------------------------------------------------------------------- /infinite_scroll_html.py: -------------------------------------------------------------------------------- 1 | # Handling pages with load more with HTML response 2 | import requests 3 | from bs4 import BeautifulSoup 4 | import math 5 | 6 | 7 | def process_pages(): 8 | index_page = 'https://techinstr.myshopify.com/collections/all' 9 | url = 'https://techinstr.myshopify.com/collections/all?page={}' 10 | 11 | session = requests.session() 12 | response = session.get(index_page) 13 | soup = BeautifulSoup(response.text, "lxml") 14 | count_element = soup.select_one('.filters-toolbar__product-count') 15 | count_str = count_element.text.replace('products', '') 16 | count = int(count_str) 17 | # Process page 1 data here 18 | page_count = math.ceil(count/8) 19 | for page_numer in range(2, page_count+1): 20 | response = session.get(url.format(page_numer)) 21 | soup = BeautifulSoup(response.text, "lxml") 22 | first_product = soup.select_one('.product-card:nth-child(1) > a > span') 23 | print(first_product.text.strip()) 24 | 25 | 26 | if __name__ == '__main__': 27 | process_pages() 28 | -------------------------------------------------------------------------------- /infinite_scroll_json.py: -------------------------------------------------------------------------------- 1 | # Handling pages with load more with JSON response 2 | import requests 3 | 4 | 5 | def process_pages(): 6 | url = 'http://quotes.toscrape.com/api/quotes?page={}' 7 | page_numer = 1 8 | while True: 9 | response = requests.get(url.format(page_numer)) 10 | data = response.json() 11 | # Process data 12 | # ... 13 | print(response.url) # only for debug 14 | if data.get('has_next'): 15 | page_numer += 1 16 | else: 17 | break 18 | 19 | 20 | if __name__ == '__main__': 21 | process_pages() 22 | -------------------------------------------------------------------------------- /load_more_json.py: -------------------------------------------------------------------------------- 1 | # Handling pages with load more button with JSON 2 | import requests 3 | from bs4 import BeautifulSoup 4 | import math 5 | 6 | 7 | def process_pages(): 8 | url = 'https://smarthistory.org/wp-json/smthstapi/v1/objects?tag=938&page={}' 9 | headers = { 10 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36', 11 | } 12 | page_numer = 1 13 | while True: 14 | response = requests.get(url.format(page_numer), headers=headers) 15 | data = response.json() 16 | # Process data 17 | # ... 18 | print(response.url) # only for debug 19 | if data.get('remaining') and int(data.get('remaining')) > 0: 20 | page_numer += 1 21 | else: 22 | break 23 | 24 | 25 | if __name__ == '__main__': 26 | process_pages() 27 | -------------------------------------------------------------------------------- /next_button.py: -------------------------------------------------------------------------------- 1 | # Handling pages with Next button 2 | import requests 3 | from bs4 import BeautifulSoup 4 | from urllib.parse import urljoin 5 | 6 | 7 | def process_pages(): 8 | url = 'http://books.toscrape.com/catalogue/category/books/fantasy_19/index.html' 9 | 10 | while True: 11 | response = requests.get(url) 12 | soup = BeautifulSoup(response.text, "lxml") 13 | 14 | footer_element = soup.select_one('li.current') 15 | print(footer_element.text.strip()) 16 | 17 | # Pagination 18 | next_page_element = soup.select_one('li.next > a') 19 | if next_page_element: 20 | next_page_url = next_page_element.get('href') 21 | url = urljoin(url, next_page_url) 22 | else: 23 | break 24 | 25 | 26 | if __name__ == '__main__': 27 | process_pages() 28 | -------------------------------------------------------------------------------- /no_next_button.py: -------------------------------------------------------------------------------- 1 | # Handling pages with Next button 2 | import requests 3 | from bs4 import BeautifulSoup 4 | from urllib.parse import urljoin 5 | 6 | 7 | def process_pages(): 8 | url = 'https://www.gosc.pl/doc/791526.Zaloz-zbroje' 9 | response = requests.get(url) 10 | soup = BeautifulSoup(response.text, 'lxml') 11 | page_link_el = soup.select('.pgr_nrs a') 12 | # process first page 13 | for link_el in page_link_el: 14 | link = urljoin(url, link_el.get('href')) 15 | response = requests.get(link) 16 | soup = BeautifulSoup(response.text, 'lxml') 17 | print(response.url) 18 | # process remaining pages 19 | 20 | 21 | if __name__ == '__main__': 22 | process_pages() 23 | --------------------------------------------------------------------------------