├── .github
├── ISSUE_TEMPLATE
│ ├── bug-report.md
│ ├── feature_request.md
│ └── question.md
└── dependabot.yml
├── .gitignore
├── GeoLite2-City.mmdb
├── LICENSE
├── README.md
├── config.py
├── main.py
├── requirements.txt
└── screenshot.png
/.github/ISSUE_TEMPLATE/bug-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **Technical information**
14 | - OS:
15 | - Python version:
16 | - Output of `pip freeze`:
17 | ```
18 | Put the output here.
19 | ```
20 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Suggest an idea for this project
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 |
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 |
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 |
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/question.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Question
3 | about: Ask a question
4 | title: ''
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.github/dependabot.yml:
--------------------------------------------------------------------------------
1 | version: 2
2 | updates:
3 | - package-ecosystem: pip
4 | directory: "/"
5 | schedule:
6 | interval: daily
7 | time: "15:00"
8 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__/
2 | proxies/
3 | proxies_anonymous/
4 | proxies_geolocation/
5 | proxies_geolocation_anonymous/
6 |
--------------------------------------------------------------------------------
/GeoLite2-City.mmdb:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hoemotion/proxy-scraper-checker/d4aec8a1121b92bdebc4c43e1f3a74c31ba43330/GeoLite2-City.mmdb
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 monosans
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
proxy-scraper-checker
2 | 
3 |
4 | Check free anonymous HTTP, SOCKS4, SOCKS5 proxies from different sources. Supports determining exit-node's geolocation for each proxy.
5 |
6 | # this isn't my own repo, I only added a few proxy sources
7 |
8 | **For a version that uses Python's built-in `logging` instead of [rich](https://github.com/willmcgugan/rich), see the [simple-output](https://github.com/monosans/proxy-scraper-checker/tree/simple-output) branch.**
9 |
10 | ## Usage
11 |
12 | - Make sure `Python` version is 3.7 or higher.
13 | - Install dependencies from `requirements.txt` (`pip install -r requirements.txt`).
14 | - Edit `config.py` according to your preference.
15 | - Run `main.py`.
16 |
17 | ## Folders description
18 |
19 | When the script finishes running, the following folders will be created:
20 |
21 | - `proxies` - proxies with any anonymity level.
22 |
23 | - `proxies_anonymous` - anonymous proxies.
24 |
25 | - `proxies_geolocation` - same as `proxies`, but including exit-node's geolocation.
26 |
27 | - `proxies_geolocation_anonymous` - same as `proxies_anonymous`, but including exit-node's geolocation.
28 |
29 | Geolocation format is ip:port::Country::Region::City.
30 |
31 | ## Buy me a coffee
32 |
33 | Ask for details in [Telegram](https://t.me/monosans) or [VK](https://vk.com/id607137534).
34 |
35 | ## License
36 |
37 | [MIT](LICENSE)
38 |
39 | This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com.
40 |
--------------------------------------------------------------------------------
/config.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # How many seconds to wait for the proxy to make a connection.
4 | # The higher this number, the longer the check will take
5 | # and the more proxies you will receive.
6 | TIMEOUT = 10
7 |
8 | # Maximum concurrent connections.
9 | # Don't set higher than 950, please.
10 | MAX_CONNECTIONS = 950
11 |
12 | # Add geolocation info for each proxy (True or False).
13 | # Output format is ip:port::Country::Region::City
14 | GEOLOCATION = True
15 |
16 | # Service for checking the IP address.
17 | IP_SERVICE = "https://checkip.amazonaws.com"
18 |
19 | # PROTOCOL - whether to enable checking certain protocol proxies (True or False).
20 | # PROTOCOL_SOURCES - proxy lists URLs.
21 | HTTP = True
22 | HTTP_SOURCES = (
23 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http",
24 | "https://raw.githubusercontent.com/chipsed/proxies/main/proxies.txt",
25 | "https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt",
26 | "https://raw.githubusercontent.com/hendrikbgr/Free-Proxy-Repo/master/proxy_list.txt",
27 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-http%2Bhttps.txt",
28 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/http.txt",
29 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/https.txt",
30 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt",
31 | "https://raw.githubusercontent.com/proxiesmaster/Free-Proxy-List/main/proxies.txt",
32 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-http.txt",
33 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-https.txt",
34 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/HTTPS_RAW.txt",
35 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt",
36 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/https.txt",
37 | "https://raw.githubusercontent.com/sunny9577/proxy-scraper/master/proxies.txt",
38 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt",
39 | "https://raw.githubusercontent.com/UserR3X/proxy-list/main/all.txt",
40 | "https://raw.githubusercontent.com/Volodichev/proxy-list/main/http.txt",
41 | "https://www.proxy-list.download/api/v1/get?type=http",
42 | "https://www.proxy-list.download/api/v1/get?type=https",
43 | "https://www.proxyscan.io/download?type=http",
44 | )
45 | SOCKS4 = True
46 | SOCKS4_SOURCES = (
47 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=socks4",
48 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-socks4.txt",
49 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks4.txt",
50 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks4.txt",
51 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS4_RAW.txt",
52 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt",
53 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks4.txt",
54 | "https://www.proxy-list.download/api/v1/get?type=socks4",
55 | "https://www.proxyscan.io/download?type=socks4",
56 | )
57 | SOCKS5 = True
58 | SOCKS5_SOURCES = (
59 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=socks5",
60 | "https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt",
61 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-socks5.txt",
62 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks5.txt",
63 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks5.txt",
64 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-socks5.txt",
65 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS5_RAW.txt",
66 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt",
67 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks5.txt",
68 | "https://www.proxy-list.download/api/v1/get?type=socks5",
69 | "https://www.proxyscan.io/download?type=socks5",
70 | )
71 |
--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | # -*- coding: utf-8 -*-
3 | import asyncio
4 | from ipaddress import IPv4Address
5 | from os import mkdir
6 | from random import shuffle
7 | from shutil import rmtree
8 | from typing import Any, Dict, Iterable, Optional, Tuple
9 |
10 | from aiohttp import ClientSession
11 | from aiohttp_socks import ProxyConnector
12 | from maxminddb import open_database
13 | from maxminddb.reader import Reader
14 | from rich.console import Console
15 | from rich.progress import (
16 | BarColumn,
17 | Progress,
18 | TaskID,
19 | TextColumn,
20 | TimeRemainingColumn,
21 | )
22 | from rich.table import Table
23 |
24 | import config
25 |
26 |
27 | class ProxyScraperChecker:
28 | def __init__(
29 | self,
30 | *,
31 | max_connections: int = 950,
32 | timeout: float = 5,
33 | geolite2_city_mmdb: Optional[str] = None,
34 | ip_service: str = "https://checkip.amazonaws.com",
35 | http_sources: Optional[Iterable[str]] = None,
36 | socks4_sources: Optional[Iterable[str]] = None,
37 | socks5_sources: Optional[Iterable[str]] = None,
38 | console: Optional[Console] = None,
39 | ) -> None:
40 | """Scrape and check proxies from sources and save them to files.
41 |
42 | Args:
43 | max_connections (int): Maximum concurrent connections.
44 | timeout (float): How many seconds to wait for the connection.
45 | geolite2_city_mmdb (str): Path to the GeoLite2-City.mmdb if you
46 | want to add location info for each proxy.
47 | ip_service (str): Service for getting your IP address and checking
48 | if proxies are valid.
49 | """
50 | self.sem = asyncio.Semaphore(max_connections)
51 | self.IP_SERVICE = ip_service.strip()
52 | self.TIMEOUT = timeout
53 | self.MMDB = geolite2_city_mmdb
54 | self.SOURCES = {
55 | proto: (sources,)
56 | if isinstance(sources, str)
57 | else frozenset(sources)
58 | for proto, sources in (
59 | ("http", http_sources),
60 | ("socks4", socks4_sources),
61 | ("socks5", socks5_sources),
62 | )
63 | if sources
64 | }
65 | self.proxies: Dict[str, Dict[str, Optional[str]]] = {
66 | proto: {} for proto in self.SOURCES
67 | }
68 | self.proxies_count = {proto: 0 for proto in self.SOURCES}
69 | self.c = console or Console()
70 |
71 | @staticmethod
72 | def append_to_file(file_path: str, content: str) -> None:
73 | with open(file_path, "a", encoding="utf-8") as f:
74 | f.write(f"{content}\n")
75 |
76 | @staticmethod
77 | def get_geolocation(ip: Optional[str], reader: Reader) -> str:
78 | """Get proxy's geolocation.
79 |
80 | Args:
81 | ip (str): Proxy's ip.
82 | reader (Reader): mmdb Reader instance.
83 |
84 | Returns:
85 | str: ::Country Name::Region::City
86 | """
87 | if not ip:
88 | return "::None::None::None"
89 | geolocation = reader.get(ip)
90 | if not isinstance(geolocation, dict):
91 | return "::None::None::None"
92 | country = geolocation.get("country")
93 | if country:
94 | country = country["names"]["en"]
95 | else:
96 | country = geolocation.get("continent")
97 | if country:
98 | country = country["names"]["en"]
99 | region = geolocation.get("subdivisions")
100 | if region:
101 | region = region[0]["names"]["en"]
102 | city = geolocation.get("city")
103 | if city:
104 | city = city["names"]["en"]
105 | return f"::{country}::{region}::{city}"
106 |
107 | async def fetch_source(
108 | self,
109 | session: ClientSession,
110 | source: str,
111 | proto: str,
112 | progress: Progress,
113 | task: TaskID,
114 | ) -> None:
115 | """Get proxies from source.
116 |
117 | Args:
118 | source (str): Proxy list URL.
119 | proto (str): http/socks4/socks5.
120 | """
121 | try:
122 | async with session.get(source.strip(), timeout=15) as r:
123 | status = r.status
124 | text = await r.text(encoding="utf-8")
125 | except Exception as e:
126 | self.c.print(f"{source}: {e}")
127 | else:
128 | if status == 200:
129 | for proxy in text.splitlines():
130 | proxy = (
131 | proxy.replace(f"{proto}://", "")
132 | .replace("https://", "")
133 | .strip()
134 | )
135 | try:
136 | IPv4Address(proxy.split(":")[0])
137 | except Exception:
138 | continue
139 | self.proxies[proto][proxy] = None
140 | else:
141 | self.c.print(f"{source} status code: {status}")
142 | progress.update(task, advance=1)
143 |
144 | async def check_proxy(
145 | self, proxy: str, proto: str, progress: Progress, task: TaskID
146 | ) -> None:
147 | """Check proxy validity.
148 |
149 | Args:
150 | proxy (str): ip:port.
151 | proto (str): http/socks4/socks5.
152 | """
153 | try:
154 | async with self.sem:
155 | async with ClientSession(
156 | connector=ProxyConnector.from_url(f"{proto}://{proxy}")
157 | ) as session:
158 | async with session.get(
159 | self.IP_SERVICE, timeout=self.TIMEOUT
160 | ) as r:
161 | exit_node = await r.text(encoding="utf-8")
162 | exit_node = exit_node.strip()
163 | IPv4Address(exit_node)
164 | except Exception as e:
165 |
166 | # Too many open files
167 | if isinstance(e, OSError) and e.errno == 24:
168 | self.c.print(
169 | "[red]Please, set MAX_CONNECTIONS to lower value.[/red]"
170 | )
171 |
172 | self.proxies[proto].pop(proxy)
173 | else:
174 | self.proxies[proto][proxy] = exit_node
175 | progress.update(task, advance=1)
176 |
177 | async def fetch_all_sources(self) -> None:
178 | """Get proxies from sources."""
179 | with self._get_progress() as progress:
180 | tasks = {
181 | proto: progress.add_task(
182 | "[yellow]Scraper[/yellow] [red]::[/red]"
183 | + f" [green]{proto.upper()}[/green]",
184 | total=len(sources),
185 | )
186 | for proto, sources in self.SOURCES.items()
187 | }
188 | async with ClientSession() as session:
189 | coroutines = (
190 | self.fetch_source(
191 | session, source, proto, progress, tasks[proto]
192 | )
193 | for proto, sources in self.SOURCES.items()
194 | for source in sources
195 | )
196 | await asyncio.gather(*coroutines)
197 | for proto, proxies in self.proxies.items():
198 | self.proxies_count[proto] = len(proxies)
199 |
200 | async def check_all_proxies(self) -> None:
201 | with self._get_progress() as progress:
202 | tasks = {
203 | proto: progress.add_task(
204 | "[yellow]Checker[/yellow] [red]::[/red]"
205 | + f" [green]{proto.upper()}[/green]",
206 | total=len(proxies),
207 | )
208 | for proto, proxies in self.proxies.items()
209 | }
210 | coroutines = [
211 | self.check_proxy(proxy, proto, progress, tasks[proto])
212 | for proto, proxies in self.proxies.items()
213 | for proxy in proxies
214 | ]
215 | shuffle(coroutines)
216 | await asyncio.gather(*coroutines)
217 |
218 | def sort_proxies(self) -> None:
219 | self.proxies = {
220 | proto: dict(sorted(proxies.items(), key=self._get_sorting_key))
221 | for proto, proxies in self.proxies.items()
222 | }
223 |
224 | def save_proxies(self) -> None:
225 | """Delete old proxies and save new ones."""
226 | dirs_to_delete = (
227 | "proxies",
228 | "proxies_anonymous",
229 | "proxies_geolocation",
230 | "proxies_geolocation_anonymous",
231 | )
232 | for dir in dirs_to_delete:
233 | try:
234 | rmtree(dir)
235 | except FileNotFoundError:
236 | pass
237 | dirs_to_create = (
238 | dirs_to_delete if self.MMDB else ("proxies", "proxies_anonymous")
239 | )
240 | for dir in dirs_to_create:
241 | mkdir(dir)
242 |
243 | # proxies and proxies_anonymous folders
244 | for proto, proxies in self.proxies.items():
245 | path = f"proxies/{proto}.txt"
246 | path_anonymous = f"proxies_anonymous/{proto}.txt"
247 | for proxy, exit_node in proxies.items():
248 | self.append_to_file(path, proxy)
249 | if exit_node != proxy.split(":")[0]:
250 | self.append_to_file(path_anonymous, proxy)
251 |
252 | # proxies_geolocation and proxies_geolocation_anonymous folders
253 | if self.MMDB:
254 | with open_database(self.MMDB) as reader:
255 | for proto, proxies in self.proxies.items():
256 | path = f"proxies_geolocation/{proto}.txt"
257 | path_anonymous = (
258 | f"proxies_geolocation_anonymous/{proto}.txt"
259 | )
260 | for proxy, exit_node in proxies.items():
261 | line = proxy + self.get_geolocation(exit_node, reader)
262 | self.append_to_file(path, line)
263 | if exit_node != proxy.split(":")[0]:
264 | self.append_to_file(path_anonymous, line)
265 |
266 | async def main(self) -> None:
267 | await self.fetch_all_sources()
268 | await self.check_all_proxies()
269 |
270 | table = Table()
271 | table.add_column("Protocol", style="cyan")
272 | table.add_column("Working", style="magenta")
273 | table.add_column("Total", style="green")
274 | for proto, proxies in self.proxies.items():
275 | working = len(proxies)
276 | total = self.proxies_count[proto]
277 | percentage = working / total * 100
278 | table.add_row(
279 | proto.upper(), f"{working} ({percentage:.1f}%)", str(total)
280 | )
281 | self.c.print(table)
282 |
283 | self.sort_proxies()
284 | self.save_proxies()
285 |
286 | self.c.print(
287 | "[green]Proxy folders have been created in the current directory."
288 | + "\nThank you for using proxy-scraper-checker :)[/green]"
289 | )
290 |
291 | @staticmethod
292 | def _get_sorting_key(x: Tuple[str, Any]) -> Tuple[int, ...]:
293 | octets = x[0].replace(":", ".").split(".")
294 | return tuple(map(int, octets))
295 |
296 | def _get_progress(self) -> Progress:
297 | return Progress(
298 | TextColumn("[progress.description]{task.description}"),
299 | BarColumn(),
300 | TextColumn("[progress.percentage]{task.percentage:3.0f}%"),
301 | TextColumn("[blue][{task.completed}/{task.total}][/blue]"),
302 | TimeRemainingColumn(),
303 | console=self.c,
304 | )
305 |
306 |
307 | async def main() -> None:
308 | await ProxyScraperChecker(
309 | max_connections=config.MAX_CONNECTIONS,
310 | timeout=config.TIMEOUT,
311 | geolite2_city_mmdb="GeoLite2-City.mmdb"
312 | if config.GEOLOCATION
313 | else None,
314 | ip_service=config.IP_SERVICE,
315 | http_sources=config.HTTP_SOURCES if config.HTTP else None,
316 | socks4_sources=config.SOCKS4_SOURCES if config.SOCKS4 else None,
317 | socks5_sources=config.SOCKS5_SOURCES if config.SOCKS5 else None,
318 | ).main()
319 |
320 |
321 | if __name__ == "__main__":
322 | asyncio.run(main())
323 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | aiohttp-socks<1.0.0
2 | aiohttp[speedups]<4.0.0
3 | maxminddb<3.0.0
4 | rich<11.0.0
5 |
--------------------------------------------------------------------------------
/screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hoemotion/proxy-scraper-checker/d4aec8a1121b92bdebc4c43e1f3a74c31ba43330/screenshot.png
--------------------------------------------------------------------------------