├── README.md ├── no_proxy.py ├── requirements.txt ├── rotating_multiple_proxies.py ├── rotating_multiple_proxies_async.py └── single_proxy.py /README.md: -------------------------------------------------------------------------------- 1 | # Rotating Proxies With Python 2 | [](https://github.com/topics/python) [](https://github.com/topics/web-scraping) [](https://github.com/topics/rotating-proxies) 3 | 4 | [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112) 5 | 6 | [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq) 7 | 8 | ## Table of Contents 9 | 10 | - [Finding Current IP Address](#finding-your-current-ip-address) 11 | - [Using A Single Proxy](#using-a-single-proxy) 12 | - [Rotating Multiple Proxies](#rotating-multiple-proxies) 13 | - [Rotating Multiple Proxies Using Async](#rotating-multiple-proxies-using-async) 14 | 15 | ## Prerequisites 16 | 17 | This article uses the python `requests` module. In order to install it, you can use `virtualenv`. `virtualenv` is a tool to create isolated Python environments. 18 | 19 | Start by creating a virtual environment in your project folder by running 20 | ```bash 21 | $ virtualenv venv 22 | ``` 23 | This will install python, pip and common libraries in your project folder. 24 | 25 | Next, invoke the source command to activate the environment. 26 | ```bash 27 | $ source venv/bin/activate 28 | ``` 29 | 30 | Lastly, install the `requests` module in the current virtual environment 31 | ```bash 32 | $ pip install requests 33 | ``` 34 | 35 | Alternatively, you can install the dependencies from the included [requirements.txt](requirements.txt) file by running 36 | 37 | ```bash 38 | $ pip install -r requirements.txt 39 | ``` 40 | 41 | Congratulations, you have successfully installed the `request` module. Now, it's time to find out your current Ip address! 42 | 43 | ## Finding Your Current IP Address 44 | 45 | Create a file with the `.py` extension with the following contents (or just copy [no_proxy.py](src/no_proxy.py)): 46 | 47 | ```python 48 | import requests 49 | 50 | response = requests.get('https://ip.oxylabs.io/location') 51 | print(response.text) 52 | ``` 53 | 54 | Now, run it from a terminal 55 | 56 | ```bash 57 | $ python no_proxy.py 58 | 59 | 128.90.50.100 60 | ``` 61 | The output of this script will show your current IP address, which uniquely identifies you on the network. Instead of exposing it directly when requesting pages, we will use a proxy server. 62 | 63 | Let's start by using a single proxy. 64 | 65 | ## Using A Single Proxy 66 | 67 | Your first step is to [find a free proxy server](https://www.google.com/search?q=free+proxy+server+list). 68 | 69 | **Important Note**: free proxies are unreliable, slow and can collect the data about the pages you access. If you're looking for a reliable paid option, we highly recommend using [oxylabs.io](https://oxy.yt/GrVD) 70 | 71 | To use a proxy, you will need its: 72 | * scheme (e.g. `http`) 73 | * ip (e.g. `2.56.215.247`) 74 | * port (e.g. `3128`) 75 | * username and password that is used to connect to the proxy (optional) 76 | 77 | Once you have it, you need to set it up in the following format 78 | ``` 79 | SCHEME://USERNAME:PASSWORD@YOUR_PROXY_IP:YOUR_PROXY_PORT 80 | ``` 81 | 82 | Here are a few examples of the proxy formats you may encounter: 83 | ```text 84 | http://2.56.215.247:3128 85 | https://2.56.215.247:8091 86 | https://my-user:aegi1Ohz@2.56.215.247:8044 87 | ``` 88 | 89 | Once you have the proxy information, assign it to a constant. 90 | 91 | ```python 92 | PROXY = 'http://2.56.215.247:3128' 93 | ``` 94 | 95 | Next, define a timeout in seconds as it is always a good idea to avoid waiting indefinitely for the response that may never be returned (due to network issues, server issues or the problems with the proxy server) 96 | ```python 97 | TIMEOUT_IN_SECONDS = 10 98 | ``` 99 | 100 | The requests module [needs to know](https://docs.python-requests.org/en/master/user/advanced/#proxies) when to actually use the proxy. 101 | For that, consider the website you are attempting to access. Does it use http or https? 102 | Since we're trying to access **https**://ip.oxylabs.io/location, we can define this configuration as follows 103 | ```python 104 | scheme_proxy_map = { 105 | 'https': PROXY, 106 | } 107 | ``` 108 | 109 | **Note**: you can specify multiple protocols, and even define specific domains for which a different proxy will be used 110 | 111 | ```python 112 | scheme_proxy_map = { 113 | 'http': PROXY1, 114 | 'https': PROXY2, 115 | 'https://example.org': PROXY3, 116 | } 117 | ``` 118 | 119 | Finally, we make the request by calling `requests.get` and passing all the variables we defined earlier. We also handle the exceptions and show the error when a network issue occurs. 120 | 121 | ```python 122 | try: 123 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) 124 | except (ProxyError, ReadTimeout, ConnectTimeout) as error: 125 | print('Unable to connect to the proxy: ', error) 126 | else: 127 | print(response.text) 128 | ``` 129 | 130 | The output of this script should show you the ip of your proxy: 131 | 132 | ```bash 133 | $ python single_proxy.py 134 | 135 | 2.56.215.247 136 | ``` 137 | 138 | You are now hidden behind a proxy when making your requests through the python script. 139 | You can find the complete code in the file [single_proxy.py](src/single_proxy.py). 140 | 141 | Now we're ready to rotate through a list of proxies, instead of using a single one! 142 | 143 | ## Rotating Multiple Proxies 144 | 145 | If you're using unreliable proxies, it could prove beneficial to save a bunch of them into a csv file and run a loop to determine whether they are still available. 146 | 147 | For that purpose, first create a file `proxies.csv` with the following content: 148 | ```text 149 | http://2.56.215.247:3128 150 | https://88.198.24.108:8080 151 | http://50.206.25.108:80 152 | http://68.188.59.198:80 153 | ... any other proxy servers, each of them on a separate line 154 | ``` 155 | 156 | Then, create a python file and define both the filename, and how long are you willing to wait for a single proxy to respond: 157 | 158 | ```python 159 | TIMEOUT_IN_SECONDS = 10 160 | CSV_FILENAME = 'proxies.csv' 161 | ``` 162 | 163 | Next, write the code that opens the csv file and reads every proxy server line by line into a `csv_row` variable and builds `scheme_proxy_map` configuration needed by the requests module. 164 | 165 | ```python 166 | with open(CSV_FILENAME) as open_file: 167 | reader = csv.reader(open_file) 168 | for csv_row in reader: 169 | scheme_proxy_map = { 170 | 'https': csv_row[0], 171 | } 172 | ``` 173 | 174 | And finally, we use the same scraping code from the previous section to access the website via proxy 175 | 176 | ```python 177 | with open(CSV_FILENAME) as open_file: 178 | reader = csv.reader(open_file) 179 | for csv_row in reader: 180 | scheme_proxy_map = { 181 | 'https': csv_row[0], 182 | } 183 | 184 | # Access the website via proxy 185 | try: 186 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) 187 | except (ProxyError, ReadTimeout, ConnectTimeout) as error: 188 | pass 189 | else: 190 | print(response.text) 191 | ``` 192 | 193 | **Note**: if you are only interested in scraping the content using *any* working proxy from the list, then add a break after print to stop going through the proxies in the csv file 194 | 195 | ```python 196 | try: 197 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) 198 | except (ProxyError, ReadTimeout, ConnectTimeout) as error: 199 | pass 200 | else: 201 | print(response.text) 202 | break # notice the break here 203 | ``` 204 | 205 | This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies.py) 206 | 207 | The only thing that is preventing us from reaching our full potential is speed. 208 | It's time to tackle that in the next section! 209 | 210 | ## Rotating Multiple Proxies Using Async 211 | 212 | Checking all the proxies in the list one by one may be an option for some, but it has one significant downside - this approach is painfully slow. This is because we are using a synchronous approach. We tackle requests one at a time and only move to the next once the previous one is completed. 213 | 214 | A better option would be to make requests and wait for responses in a non-blocking way - this would speed up the script significantly. 215 | 216 | In order to do that we use the `aiohttp` module. You can install it using the following cli command: 217 | 218 | ```bash 219 | $ pip install aiohttp 220 | ``` 221 | 222 | Then, create a python file where you define: 223 | * the csv filename that contains the proxy list 224 | * url that you wish to use to check the proxies 225 | * how long are you willing to wait for each proxy - the timeout setting 226 | 227 | ```python 228 | CSV_FILENAME = 'proxies.csv' 229 | URL_TO_CHECK = 'https://ip.oxylabs.io/location' 230 | TIMEOUT_IN_SECONDS = 10 231 | ``` 232 | 233 | Next, we define an async function and run it using the asyncio module. 234 | It accepts two parameters: 235 | * the url it needs to request 236 | * the proxy to use to access it 237 | 238 | We then print the response. If the script received an error when attempting to access the url via proxy, it will print it as well. 239 | 240 | ```python 241 | 242 | async def check_proxy(url, proxy): 243 | try: 244 | session_timeout = aiohttp.ClientTimeout(total=None, 245 | sock_connect=TIMEOUT_IN_SECONDS, 246 | sock_read=TIMEOUT_IN_SECONDS) 247 | async with aiohttp.ClientSession(timeout=session_timeout) as session: 248 | async with session.get(url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS) as resp: 249 | print(await resp.text()) 250 | except Exception as error: 251 | # you can comment out this line to only see valid proxies printed out in the command line 252 | print('Proxy responded with an error: ', error) 253 | return 254 | ``` 255 | 256 | Then, we define a main function that reads the csv file and creates an asynchronous task to check the proxy for every single record in the csv file. 257 | 258 | ```python 259 | 260 | async def main(): 261 | tasks = [] 262 | with open(CSV_FILENAME) as open_file: 263 | reader = csv.reader(open_file) 264 | for csv_row in reader: 265 | task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0])) 266 | tasks.append(task) 267 | 268 | await asyncio.gather(*tasks) 269 | ``` 270 | 271 | Finally, we run the main function and wait until all the async tasks complete 272 | ```python 273 | asyncio.run(main()) 274 | ``` 275 | 276 | This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies_async.py) 277 | 278 | This code now runs exceptionally fast! 279 | 280 | # We are open to contribution! 281 | 282 | Be sure to play around with it and create a pull request with any improvements you may find. 283 | Also, check this [Best rotating proxy service](https://medium.com/@oxylabs.io/10-best-rotating-proxy-services-for-2024-853d840af1a4) list. 284 | 285 | Happy coding! 286 | -------------------------------------------------------------------------------- /no_proxy.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | response = requests.get('https://ip.oxylabs.io/location') 4 | print(response.text) 5 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp==3.8.1 2 | requests==2.27.1 3 | -------------------------------------------------------------------------------- /rotating_multiple_proxies.py: -------------------------------------------------------------------------------- 1 | import csv 2 | 3 | import requests 4 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout 5 | 6 | TIMEOUT_IN_SECONDS = 10 7 | CSV_FILENAME = 'proxies.csv' 8 | 9 | with open(CSV_FILENAME) as open_file: 10 | reader = csv.reader(open_file) 11 | for csv_row in reader: 12 | scheme_proxy_map = { 13 | 'https': csv_row[0], 14 | } 15 | 16 | try: 17 | response = requests.get( 18 | 'https://ip.oxylabs.io/location', 19 | proxies=scheme_proxy_map, 20 | timeout=TIMEOUT_IN_SECONDS, 21 | ) 22 | except (ProxyError, ReadTimeout, ConnectTimeout) as error: 23 | pass 24 | else: 25 | print(response.text) 26 | -------------------------------------------------------------------------------- /rotating_multiple_proxies_async.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import aiohttp 3 | import asyncio 4 | 5 | CSV_FILENAME = 'proxies.csv' 6 | URL_TO_CHECK = 'https://ip.oxylabs.io/location' 7 | TIMEOUT_IN_SECONDS = 10 8 | 9 | 10 | async def check_proxy(url, proxy): 11 | try: 12 | session_timeout = aiohttp.ClientTimeout( 13 | total=None, sock_connect=TIMEOUT_IN_SECONDS, sock_read=TIMEOUT_IN_SECONDS 14 | ) 15 | async with aiohttp.ClientSession(timeout=session_timeout) as session: 16 | async with session.get( 17 | url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS 18 | ) as resp: 19 | print(await resp.text()) 20 | except Exception as error: 21 | print('Proxy responded with an error: ', error) 22 | return 23 | 24 | 25 | async def main(): 26 | tasks = [] 27 | with open(CSV_FILENAME) as open_file: 28 | reader = csv.reader(open_file) 29 | for csv_row in reader: 30 | task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0])) 31 | tasks.append(task) 32 | 33 | await asyncio.gather(*tasks) 34 | 35 | 36 | asyncio.run(main()) 37 | -------------------------------------------------------------------------------- /single_proxy.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout 3 | 4 | PROXY = 'http://2.56.215.247:3128' 5 | TIMEOUT_IN_SECONDS = 10 6 | 7 | scheme_proxy_map = { 8 | 'https': PROXY, 9 | } 10 | try: 11 | response = requests.get( 12 | 'https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS 13 | ) 14 | except (ProxyError, ReadTimeout, ConnectTimeout) as error: 15 | print('Unable to connect to the proxy: ', error) 16 | else: 17 | print(response.text) 18 | --------------------------------------------------------------------------------