├── README.md
├── no_proxy.py
├── requirements.txt
├── rotating_multiple_proxies.py
├── rotating_multiple_proxies_async.py
└── single_proxy.py
/README.md:
--------------------------------------------------------------------------------
1 | # Rotating Proxies With Python
2 | [
](https://github.com/topics/python) [
](https://github.com/topics/web-scraping) [
](https://github.com/topics/rotating-proxies)
3 |
4 | [](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
5 |
6 | [](https://discord.gg/GbxmdGhZjq)
7 |
8 | ## Table of Contents
9 |
10 | - [Finding Current IP Address](#finding-your-current-ip-address)
11 | - [Using A Single Proxy](#using-a-single-proxy)
12 | - [Rotating Multiple Proxies](#rotating-multiple-proxies)
13 | - [Rotating Multiple Proxies Using Async](#rotating-multiple-proxies-using-async)
14 |
15 | ## Prerequisites
16 |
17 | This article uses the python `requests` module. In order to install it, you can use `virtualenv`. `virtualenv` is a tool to create isolated Python environments.
18 |
19 | Start by creating a virtual environment in your project folder by running
20 | ```bash
21 | $ virtualenv venv
22 | ```
23 | This will install python, pip and common libraries in your project folder.
24 |
25 | Next, invoke the source command to activate the environment.
26 | ```bash
27 | $ source venv/bin/activate
28 | ```
29 |
30 | Lastly, install the `requests` module in the current virtual environment
31 | ```bash
32 | $ pip install requests
33 | ```
34 |
35 | Alternatively, you can install the dependencies from the included [requirements.txt](requirements.txt) file by running
36 |
37 | ```bash
38 | $ pip install -r requirements.txt
39 | ```
40 |
41 | Congratulations, you have successfully installed the `request` module. Now, it's time to find out your current Ip address!
42 |
43 | ## Finding Your Current IP Address
44 |
45 | Create a file with the `.py` extension with the following contents (or just copy [no_proxy.py](src/no_proxy.py)):
46 |
47 | ```python
48 | import requests
49 |
50 | response = requests.get('https://ip.oxylabs.io/location')
51 | print(response.text)
52 | ```
53 |
54 | Now, run it from a terminal
55 |
56 | ```bash
57 | $ python no_proxy.py
58 |
59 | 128.90.50.100
60 | ```
61 | The output of this script will show your current IP address, which uniquely identifies you on the network. Instead of exposing it directly when requesting pages, we will use a proxy server.
62 |
63 | Let's start by using a single proxy.
64 |
65 | ## Using A Single Proxy
66 |
67 | Your first step is to [find a free proxy server](https://www.google.com/search?q=free+proxy+server+list).
68 |
69 | **Important Note**: free proxies are unreliable, slow and can collect the data about the pages you access. If you're looking for a reliable paid option, we highly recommend using [oxylabs.io](https://oxy.yt/GrVD)
70 |
71 | To use a proxy, you will need its:
72 | * scheme (e.g. `http`)
73 | * ip (e.g. `2.56.215.247`)
74 | * port (e.g. `3128`)
75 | * username and password that is used to connect to the proxy (optional)
76 |
77 | Once you have it, you need to set it up in the following format
78 | ```
79 | SCHEME://USERNAME:PASSWORD@YOUR_PROXY_IP:YOUR_PROXY_PORT
80 | ```
81 |
82 | Here are a few examples of the proxy formats you may encounter:
83 | ```text
84 | http://2.56.215.247:3128
85 | https://2.56.215.247:8091
86 | https://my-user:aegi1Ohz@2.56.215.247:8044
87 | ```
88 |
89 | Once you have the proxy information, assign it to a constant.
90 |
91 | ```python
92 | PROXY = 'http://2.56.215.247:3128'
93 | ```
94 |
95 | Next, define a timeout in seconds as it is always a good idea to avoid waiting indefinitely for the response that may never be returned (due to network issues, server issues or the problems with the proxy server)
96 | ```python
97 | TIMEOUT_IN_SECONDS = 10
98 | ```
99 |
100 | The requests module [needs to know](https://docs.python-requests.org/en/master/user/advanced/#proxies) when to actually use the proxy.
101 | For that, consider the website you are attempting to access. Does it use http or https?
102 | Since we're trying to access **https**://ip.oxylabs.io/location, we can define this configuration as follows
103 | ```python
104 | scheme_proxy_map = {
105 | 'https': PROXY,
106 | }
107 | ```
108 |
109 | **Note**: you can specify multiple protocols, and even define specific domains for which a different proxy will be used
110 |
111 | ```python
112 | scheme_proxy_map = {
113 | 'http': PROXY1,
114 | 'https': PROXY2,
115 | 'https://example.org': PROXY3,
116 | }
117 | ```
118 |
119 | Finally, we make the request by calling `requests.get` and passing all the variables we defined earlier. We also handle the exceptions and show the error when a network issue occurs.
120 |
121 | ```python
122 | try:
123 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
124 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
125 | print('Unable to connect to the proxy: ', error)
126 | else:
127 | print(response.text)
128 | ```
129 |
130 | The output of this script should show you the ip of your proxy:
131 |
132 | ```bash
133 | $ python single_proxy.py
134 |
135 | 2.56.215.247
136 | ```
137 |
138 | You are now hidden behind a proxy when making your requests through the python script.
139 | You can find the complete code in the file [single_proxy.py](src/single_proxy.py).
140 |
141 | Now we're ready to rotate through a list of proxies, instead of using a single one!
142 |
143 | ## Rotating Multiple Proxies
144 |
145 | If you're using unreliable proxies, it could prove beneficial to save a bunch of them into a csv file and run a loop to determine whether they are still available.
146 |
147 | For that purpose, first create a file `proxies.csv` with the following content:
148 | ```text
149 | http://2.56.215.247:3128
150 | https://88.198.24.108:8080
151 | http://50.206.25.108:80
152 | http://68.188.59.198:80
153 | ... any other proxy servers, each of them on a separate line
154 | ```
155 |
156 | Then, create a python file and define both the filename, and how long are you willing to wait for a single proxy to respond:
157 |
158 | ```python
159 | TIMEOUT_IN_SECONDS = 10
160 | CSV_FILENAME = 'proxies.csv'
161 | ```
162 |
163 | Next, write the code that opens the csv file and reads every proxy server line by line into a `csv_row` variable and builds `scheme_proxy_map` configuration needed by the requests module.
164 |
165 | ```python
166 | with open(CSV_FILENAME) as open_file:
167 | reader = csv.reader(open_file)
168 | for csv_row in reader:
169 | scheme_proxy_map = {
170 | 'https': csv_row[0],
171 | }
172 | ```
173 |
174 | And finally, we use the same scraping code from the previous section to access the website via proxy
175 |
176 | ```python
177 | with open(CSV_FILENAME) as open_file:
178 | reader = csv.reader(open_file)
179 | for csv_row in reader:
180 | scheme_proxy_map = {
181 | 'https': csv_row[0],
182 | }
183 |
184 | # Access the website via proxy
185 | try:
186 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
187 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
188 | pass
189 | else:
190 | print(response.text)
191 | ```
192 |
193 | **Note**: if you are only interested in scraping the content using *any* working proxy from the list, then add a break after print to stop going through the proxies in the csv file
194 |
195 | ```python
196 | try:
197 | response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS)
198 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
199 | pass
200 | else:
201 | print(response.text)
202 | break # notice the break here
203 | ```
204 |
205 | This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies.py)
206 |
207 | The only thing that is preventing us from reaching our full potential is speed.
208 | It's time to tackle that in the next section!
209 |
210 | ## Rotating Multiple Proxies Using Async
211 |
212 | Checking all the proxies in the list one by one may be an option for some, but it has one significant downside - this approach is painfully slow. This is because we are using a synchronous approach. We tackle requests one at a time and only move to the next once the previous one is completed.
213 |
214 | A better option would be to make requests and wait for responses in a non-blocking way - this would speed up the script significantly.
215 |
216 | In order to do that we use the `aiohttp` module. You can install it using the following cli command:
217 |
218 | ```bash
219 | $ pip install aiohttp
220 | ```
221 |
222 | Then, create a python file where you define:
223 | * the csv filename that contains the proxy list
224 | * url that you wish to use to check the proxies
225 | * how long are you willing to wait for each proxy - the timeout setting
226 |
227 | ```python
228 | CSV_FILENAME = 'proxies.csv'
229 | URL_TO_CHECK = 'https://ip.oxylabs.io/location'
230 | TIMEOUT_IN_SECONDS = 10
231 | ```
232 |
233 | Next, we define an async function and run it using the asyncio module.
234 | It accepts two parameters:
235 | * the url it needs to request
236 | * the proxy to use to access it
237 |
238 | We then print the response. If the script received an error when attempting to access the url via proxy, it will print it as well.
239 |
240 | ```python
241 |
242 | async def check_proxy(url, proxy):
243 | try:
244 | session_timeout = aiohttp.ClientTimeout(total=None,
245 | sock_connect=TIMEOUT_IN_SECONDS,
246 | sock_read=TIMEOUT_IN_SECONDS)
247 | async with aiohttp.ClientSession(timeout=session_timeout) as session:
248 | async with session.get(url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS) as resp:
249 | print(await resp.text())
250 | except Exception as error:
251 | # you can comment out this line to only see valid proxies printed out in the command line
252 | print('Proxy responded with an error: ', error)
253 | return
254 | ```
255 |
256 | Then, we define a main function that reads the csv file and creates an asynchronous task to check the proxy for every single record in the csv file.
257 |
258 | ```python
259 |
260 | async def main():
261 | tasks = []
262 | with open(CSV_FILENAME) as open_file:
263 | reader = csv.reader(open_file)
264 | for csv_row in reader:
265 | task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0]))
266 | tasks.append(task)
267 |
268 | await asyncio.gather(*tasks)
269 | ```
270 |
271 | Finally, we run the main function and wait until all the async tasks complete
272 | ```python
273 | asyncio.run(main())
274 | ```
275 |
276 | This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies_async.py)
277 |
278 | This code now runs exceptionally fast!
279 |
280 | # We are open to contribution!
281 |
282 | Be sure to play around with it and create a pull request with any improvements you may find.
283 | Also, check this [Best rotating proxy service](https://medium.com/@oxylabs.io/10-best-rotating-proxy-services-for-2024-853d840af1a4) list.
284 |
285 | Happy coding!
286 |
--------------------------------------------------------------------------------
/no_proxy.py:
--------------------------------------------------------------------------------
1 | import requests
2 |
3 | response = requests.get('https://ip.oxylabs.io/location')
4 | print(response.text)
5 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | aiohttp==3.8.1
2 | requests==2.27.1
3 |
--------------------------------------------------------------------------------
/rotating_multiple_proxies.py:
--------------------------------------------------------------------------------
1 | import csv
2 |
3 | import requests
4 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout
5 |
6 | TIMEOUT_IN_SECONDS = 10
7 | CSV_FILENAME = 'proxies.csv'
8 |
9 | with open(CSV_FILENAME) as open_file:
10 | reader = csv.reader(open_file)
11 | for csv_row in reader:
12 | scheme_proxy_map = {
13 | 'https': csv_row[0],
14 | }
15 |
16 | try:
17 | response = requests.get(
18 | 'https://ip.oxylabs.io/location',
19 | proxies=scheme_proxy_map,
20 | timeout=TIMEOUT_IN_SECONDS,
21 | )
22 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
23 | pass
24 | else:
25 | print(response.text)
26 |
--------------------------------------------------------------------------------
/rotating_multiple_proxies_async.py:
--------------------------------------------------------------------------------
1 | import csv
2 | import aiohttp
3 | import asyncio
4 |
5 | CSV_FILENAME = 'proxies.csv'
6 | URL_TO_CHECK = 'https://ip.oxylabs.io/location'
7 | TIMEOUT_IN_SECONDS = 10
8 |
9 |
10 | async def check_proxy(url, proxy):
11 | try:
12 | session_timeout = aiohttp.ClientTimeout(
13 | total=None, sock_connect=TIMEOUT_IN_SECONDS, sock_read=TIMEOUT_IN_SECONDS
14 | )
15 | async with aiohttp.ClientSession(timeout=session_timeout) as session:
16 | async with session.get(
17 | url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS
18 | ) as resp:
19 | print(await resp.text())
20 | except Exception as error:
21 | print('Proxy responded with an error: ', error)
22 | return
23 |
24 |
25 | async def main():
26 | tasks = []
27 | with open(CSV_FILENAME) as open_file:
28 | reader = csv.reader(open_file)
29 | for csv_row in reader:
30 | task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0]))
31 | tasks.append(task)
32 |
33 | await asyncio.gather(*tasks)
34 |
35 |
36 | asyncio.run(main())
37 |
--------------------------------------------------------------------------------
/single_proxy.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from requests.exceptions import ProxyError, ReadTimeout, ConnectTimeout
3 |
4 | PROXY = 'http://2.56.215.247:3128'
5 | TIMEOUT_IN_SECONDS = 10
6 |
7 | scheme_proxy_map = {
8 | 'https': PROXY,
9 | }
10 | try:
11 | response = requests.get(
12 | 'https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS
13 | )
14 | except (ProxyError, ReadTimeout, ConnectTimeout) as error:
15 | print('Unable to connect to the proxy: ', error)
16 | else:
17 | print(response.text)
18 |
--------------------------------------------------------------------------------