├── .github ├── ISSUE_TEMPLATE │ ├── bug-report.md │ ├── feature_request.md │ └── question.md └── dependabot.yml ├── .gitignore ├── GeoLite2-City.mmdb ├── LICENSE ├── README.md ├── config.py ├── main.py ├── requirements.txt └── screenshot.png /.github/ISSUE_TEMPLATE/bug-report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **Technical information** 14 | - OS: 15 | - Python version: 16 | - Output of `pip freeze`: 17 | ``` 18 | Put the output here. 19 | ``` 20 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Question 3 | about: Ask a question 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 11 | -------------------------------------------------------------------------------- /.github/dependabot.yml: -------------------------------------------------------------------------------- 1 | version: 2 2 | updates: 3 | - package-ecosystem: pip 4 | directory: "/" 5 | schedule: 6 | interval: daily 7 | time: "15:00" 8 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__/ 2 | proxies/ 3 | proxies_anonymous/ 4 | proxies_geolocation/ 5 | proxies_geolocation_anonymous/ 6 | -------------------------------------------------------------------------------- /GeoLite2-City.mmdb: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hoemotion/proxy-scraper-checker/d4aec8a1121b92bdebc4c43e1f3a74c31ba43330/GeoLite2-City.mmdb -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 monosans 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

proxy-scraper-checker

2 |

Screenshot

3 | 4 | Check free anonymous HTTP, SOCKS4, SOCKS5 proxies from different sources. Supports determining exit-node's geolocation for each proxy. 5 | 6 | # this isn't my own repo, I only added a few proxy sources 7 | 8 | **For a version that uses Python's built-in `logging` instead of [rich](https://github.com/willmcgugan/rich), see the [simple-output](https://github.com/monosans/proxy-scraper-checker/tree/simple-output) branch.** 9 | 10 | ## Usage 11 | 12 | - Make sure `Python` version is 3.7 or higher. 13 | - Install dependencies from `requirements.txt` (`pip install -r requirements.txt`). 14 | - Edit `config.py` according to your preference. 15 | - Run `main.py`. 16 | 17 | ## Folders description 18 | 19 | When the script finishes running, the following folders will be created: 20 | 21 | - `proxies` - proxies with any anonymity level. 22 | 23 | - `proxies_anonymous` - anonymous proxies. 24 | 25 | - `proxies_geolocation` - same as `proxies`, but including exit-node's geolocation. 26 | 27 | - `proxies_geolocation_anonymous` - same as `proxies_anonymous`, but including exit-node's geolocation. 28 | 29 | Geolocation format is ip:port::Country::Region::City. 30 | 31 | ## Buy me a coffee 32 | 33 | Ask for details in [Telegram](https://t.me/monosans) or [VK](https://vk.com/id607137534). 34 | 35 | ## License 36 | 37 | [MIT](LICENSE) 38 | 39 | This product includes GeoLite2 data created by MaxMind, available from http://www.maxmind.com. 40 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # How many seconds to wait for the proxy to make a connection. 4 | # The higher this number, the longer the check will take 5 | # and the more proxies you will receive. 6 | TIMEOUT = 10 7 | 8 | # Maximum concurrent connections. 9 | # Don't set higher than 950, please. 10 | MAX_CONNECTIONS = 950 11 | 12 | # Add geolocation info for each proxy (True or False). 13 | # Output format is ip:port::Country::Region::City 14 | GEOLOCATION = True 15 | 16 | # Service for checking the IP address. 17 | IP_SERVICE = "https://checkip.amazonaws.com" 18 | 19 | # PROTOCOL - whether to enable checking certain protocol proxies (True or False). 20 | # PROTOCOL_SOURCES - proxy lists URLs. 21 | HTTP = True 22 | HTTP_SOURCES = ( 23 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=http", 24 | "https://raw.githubusercontent.com/chipsed/proxies/main/proxies.txt", 25 | "https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list-raw.txt", 26 | "https://raw.githubusercontent.com/hendrikbgr/Free-Proxy-Repo/master/proxy_list.txt", 27 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-http%2Bhttps.txt", 28 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/http.txt", 29 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/https.txt", 30 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/http.txt", 31 | "https://raw.githubusercontent.com/proxiesmaster/Free-Proxy-List/main/proxies.txt", 32 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-http.txt", 33 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-https.txt", 34 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/HTTPS_RAW.txt", 35 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/http.txt", 36 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/https.txt", 37 | "https://raw.githubusercontent.com/sunny9577/proxy-scraper/master/proxies.txt", 38 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/http.txt", 39 | "https://raw.githubusercontent.com/UserR3X/proxy-list/main/all.txt", 40 | "https://raw.githubusercontent.com/Volodichev/proxy-list/main/http.txt", 41 | "https://www.proxy-list.download/api/v1/get?type=http", 42 | "https://www.proxy-list.download/api/v1/get?type=https", 43 | "https://www.proxyscan.io/download?type=http", 44 | ) 45 | SOCKS4 = True 46 | SOCKS4_SOURCES = ( 47 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=socks4", 48 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-socks4.txt", 49 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks4.txt", 50 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks4.txt", 51 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS4_RAW.txt", 52 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks4.txt", 53 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks4.txt", 54 | "https://www.proxy-list.download/api/v1/get?type=socks4", 55 | "https://www.proxyscan.io/download?type=socks4", 56 | ) 57 | SOCKS5 = True 58 | SOCKS5_SOURCES = ( 59 | "https://api.proxyscrape.com/v2/?request=getproxies&protocol=socks5", 60 | "https://raw.githubusercontent.com/hookzof/socks5_list/master/proxy.txt", 61 | "https://raw.githubusercontent.com/jetkai/proxy-list/main/online-proxies/txt/proxies-socks5.txt", 62 | "https://raw.githubusercontent.com/mmpx12/proxy-list/master/socks5.txt", 63 | "https://raw.githubusercontent.com/monosans/proxy-list/main/proxies/socks5.txt", 64 | "https://raw.githubusercontent.com/roma8ok/proxy-list/main/proxy-list-socks5.txt", 65 | "https://raw.githubusercontent.com/roosterkid/openproxylist/main/SOCKS5_RAW.txt", 66 | "https://raw.githubusercontent.com/ShiftyTR/Proxy-List/master/socks5.txt", 67 | "https://raw.githubusercontent.com/TheSpeedX/PROXY-List/master/socks5.txt", 68 | "https://www.proxy-list.download/api/v1/get?type=socks5", 69 | "https://www.proxyscan.io/download?type=socks5", 70 | ) 71 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | # -*- coding: utf-8 -*- 3 | import asyncio 4 | from ipaddress import IPv4Address 5 | from os import mkdir 6 | from random import shuffle 7 | from shutil import rmtree 8 | from typing import Any, Dict, Iterable, Optional, Tuple 9 | 10 | from aiohttp import ClientSession 11 | from aiohttp_socks import ProxyConnector 12 | from maxminddb import open_database 13 | from maxminddb.reader import Reader 14 | from rich.console import Console 15 | from rich.progress import ( 16 | BarColumn, 17 | Progress, 18 | TaskID, 19 | TextColumn, 20 | TimeRemainingColumn, 21 | ) 22 | from rich.table import Table 23 | 24 | import config 25 | 26 | 27 | class ProxyScraperChecker: 28 | def __init__( 29 | self, 30 | *, 31 | max_connections: int = 950, 32 | timeout: float = 5, 33 | geolite2_city_mmdb: Optional[str] = None, 34 | ip_service: str = "https://checkip.amazonaws.com", 35 | http_sources: Optional[Iterable[str]] = None, 36 | socks4_sources: Optional[Iterable[str]] = None, 37 | socks5_sources: Optional[Iterable[str]] = None, 38 | console: Optional[Console] = None, 39 | ) -> None: 40 | """Scrape and check proxies from sources and save them to files. 41 | 42 | Args: 43 | max_connections (int): Maximum concurrent connections. 44 | timeout (float): How many seconds to wait for the connection. 45 | geolite2_city_mmdb (str): Path to the GeoLite2-City.mmdb if you 46 | want to add location info for each proxy. 47 | ip_service (str): Service for getting your IP address and checking 48 | if proxies are valid. 49 | """ 50 | self.sem = asyncio.Semaphore(max_connections) 51 | self.IP_SERVICE = ip_service.strip() 52 | self.TIMEOUT = timeout 53 | self.MMDB = geolite2_city_mmdb 54 | self.SOURCES = { 55 | proto: (sources,) 56 | if isinstance(sources, str) 57 | else frozenset(sources) 58 | for proto, sources in ( 59 | ("http", http_sources), 60 | ("socks4", socks4_sources), 61 | ("socks5", socks5_sources), 62 | ) 63 | if sources 64 | } 65 | self.proxies: Dict[str, Dict[str, Optional[str]]] = { 66 | proto: {} for proto in self.SOURCES 67 | } 68 | self.proxies_count = {proto: 0 for proto in self.SOURCES} 69 | self.c = console or Console() 70 | 71 | @staticmethod 72 | def append_to_file(file_path: str, content: str) -> None: 73 | with open(file_path, "a", encoding="utf-8") as f: 74 | f.write(f"{content}\n") 75 | 76 | @staticmethod 77 | def get_geolocation(ip: Optional[str], reader: Reader) -> str: 78 | """Get proxy's geolocation. 79 | 80 | Args: 81 | ip (str): Proxy's ip. 82 | reader (Reader): mmdb Reader instance. 83 | 84 | Returns: 85 | str: ::Country Name::Region::City 86 | """ 87 | if not ip: 88 | return "::None::None::None" 89 | geolocation = reader.get(ip) 90 | if not isinstance(geolocation, dict): 91 | return "::None::None::None" 92 | country = geolocation.get("country") 93 | if country: 94 | country = country["names"]["en"] 95 | else: 96 | country = geolocation.get("continent") 97 | if country: 98 | country = country["names"]["en"] 99 | region = geolocation.get("subdivisions") 100 | if region: 101 | region = region[0]["names"]["en"] 102 | city = geolocation.get("city") 103 | if city: 104 | city = city["names"]["en"] 105 | return f"::{country}::{region}::{city}" 106 | 107 | async def fetch_source( 108 | self, 109 | session: ClientSession, 110 | source: str, 111 | proto: str, 112 | progress: Progress, 113 | task: TaskID, 114 | ) -> None: 115 | """Get proxies from source. 116 | 117 | Args: 118 | source (str): Proxy list URL. 119 | proto (str): http/socks4/socks5. 120 | """ 121 | try: 122 | async with session.get(source.strip(), timeout=15) as r: 123 | status = r.status 124 | text = await r.text(encoding="utf-8") 125 | except Exception as e: 126 | self.c.print(f"{source}: {e}") 127 | else: 128 | if status == 200: 129 | for proxy in text.splitlines(): 130 | proxy = ( 131 | proxy.replace(f"{proto}://", "") 132 | .replace("https://", "") 133 | .strip() 134 | ) 135 | try: 136 | IPv4Address(proxy.split(":")[0]) 137 | except Exception: 138 | continue 139 | self.proxies[proto][proxy] = None 140 | else: 141 | self.c.print(f"{source} status code: {status}") 142 | progress.update(task, advance=1) 143 | 144 | async def check_proxy( 145 | self, proxy: str, proto: str, progress: Progress, task: TaskID 146 | ) -> None: 147 | """Check proxy validity. 148 | 149 | Args: 150 | proxy (str): ip:port. 151 | proto (str): http/socks4/socks5. 152 | """ 153 | try: 154 | async with self.sem: 155 | async with ClientSession( 156 | connector=ProxyConnector.from_url(f"{proto}://{proxy}") 157 | ) as session: 158 | async with session.get( 159 | self.IP_SERVICE, timeout=self.TIMEOUT 160 | ) as r: 161 | exit_node = await r.text(encoding="utf-8") 162 | exit_node = exit_node.strip() 163 | IPv4Address(exit_node) 164 | except Exception as e: 165 | 166 | # Too many open files 167 | if isinstance(e, OSError) and e.errno == 24: 168 | self.c.print( 169 | "[red]Please, set MAX_CONNECTIONS to lower value.[/red]" 170 | ) 171 | 172 | self.proxies[proto].pop(proxy) 173 | else: 174 | self.proxies[proto][proxy] = exit_node 175 | progress.update(task, advance=1) 176 | 177 | async def fetch_all_sources(self) -> None: 178 | """Get proxies from sources.""" 179 | with self._get_progress() as progress: 180 | tasks = { 181 | proto: progress.add_task( 182 | "[yellow]Scraper[/yellow] [red]::[/red]" 183 | + f" [green]{proto.upper()}[/green]", 184 | total=len(sources), 185 | ) 186 | for proto, sources in self.SOURCES.items() 187 | } 188 | async with ClientSession() as session: 189 | coroutines = ( 190 | self.fetch_source( 191 | session, source, proto, progress, tasks[proto] 192 | ) 193 | for proto, sources in self.SOURCES.items() 194 | for source in sources 195 | ) 196 | await asyncio.gather(*coroutines) 197 | for proto, proxies in self.proxies.items(): 198 | self.proxies_count[proto] = len(proxies) 199 | 200 | async def check_all_proxies(self) -> None: 201 | with self._get_progress() as progress: 202 | tasks = { 203 | proto: progress.add_task( 204 | "[yellow]Checker[/yellow] [red]::[/red]" 205 | + f" [green]{proto.upper()}[/green]", 206 | total=len(proxies), 207 | ) 208 | for proto, proxies in self.proxies.items() 209 | } 210 | coroutines = [ 211 | self.check_proxy(proxy, proto, progress, tasks[proto]) 212 | for proto, proxies in self.proxies.items() 213 | for proxy in proxies 214 | ] 215 | shuffle(coroutines) 216 | await asyncio.gather(*coroutines) 217 | 218 | def sort_proxies(self) -> None: 219 | self.proxies = { 220 | proto: dict(sorted(proxies.items(), key=self._get_sorting_key)) 221 | for proto, proxies in self.proxies.items() 222 | } 223 | 224 | def save_proxies(self) -> None: 225 | """Delete old proxies and save new ones.""" 226 | dirs_to_delete = ( 227 | "proxies", 228 | "proxies_anonymous", 229 | "proxies_geolocation", 230 | "proxies_geolocation_anonymous", 231 | ) 232 | for dir in dirs_to_delete: 233 | try: 234 | rmtree(dir) 235 | except FileNotFoundError: 236 | pass 237 | dirs_to_create = ( 238 | dirs_to_delete if self.MMDB else ("proxies", "proxies_anonymous") 239 | ) 240 | for dir in dirs_to_create: 241 | mkdir(dir) 242 | 243 | # proxies and proxies_anonymous folders 244 | for proto, proxies in self.proxies.items(): 245 | path = f"proxies/{proto}.txt" 246 | path_anonymous = f"proxies_anonymous/{proto}.txt" 247 | for proxy, exit_node in proxies.items(): 248 | self.append_to_file(path, proxy) 249 | if exit_node != proxy.split(":")[0]: 250 | self.append_to_file(path_anonymous, proxy) 251 | 252 | # proxies_geolocation and proxies_geolocation_anonymous folders 253 | if self.MMDB: 254 | with open_database(self.MMDB) as reader: 255 | for proto, proxies in self.proxies.items(): 256 | path = f"proxies_geolocation/{proto}.txt" 257 | path_anonymous = ( 258 | f"proxies_geolocation_anonymous/{proto}.txt" 259 | ) 260 | for proxy, exit_node in proxies.items(): 261 | line = proxy + self.get_geolocation(exit_node, reader) 262 | self.append_to_file(path, line) 263 | if exit_node != proxy.split(":")[0]: 264 | self.append_to_file(path_anonymous, line) 265 | 266 | async def main(self) -> None: 267 | await self.fetch_all_sources() 268 | await self.check_all_proxies() 269 | 270 | table = Table() 271 | table.add_column("Protocol", style="cyan") 272 | table.add_column("Working", style="magenta") 273 | table.add_column("Total", style="green") 274 | for proto, proxies in self.proxies.items(): 275 | working = len(proxies) 276 | total = self.proxies_count[proto] 277 | percentage = working / total * 100 278 | table.add_row( 279 | proto.upper(), f"{working} ({percentage:.1f}%)", str(total) 280 | ) 281 | self.c.print(table) 282 | 283 | self.sort_proxies() 284 | self.save_proxies() 285 | 286 | self.c.print( 287 | "[green]Proxy folders have been created in the current directory." 288 | + "\nThank you for using proxy-scraper-checker :)[/green]" 289 | ) 290 | 291 | @staticmethod 292 | def _get_sorting_key(x: Tuple[str, Any]) -> Tuple[int, ...]: 293 | octets = x[0].replace(":", ".").split(".") 294 | return tuple(map(int, octets)) 295 | 296 | def _get_progress(self) -> Progress: 297 | return Progress( 298 | TextColumn("[progress.description]{task.description}"), 299 | BarColumn(), 300 | TextColumn("[progress.percentage]{task.percentage:3.0f}%"), 301 | TextColumn("[blue][{task.completed}/{task.total}][/blue]"), 302 | TimeRemainingColumn(), 303 | console=self.c, 304 | ) 305 | 306 | 307 | async def main() -> None: 308 | await ProxyScraperChecker( 309 | max_connections=config.MAX_CONNECTIONS, 310 | timeout=config.TIMEOUT, 311 | geolite2_city_mmdb="GeoLite2-City.mmdb" 312 | if config.GEOLOCATION 313 | else None, 314 | ip_service=config.IP_SERVICE, 315 | http_sources=config.HTTP_SOURCES if config.HTTP else None, 316 | socks4_sources=config.SOCKS4_SOURCES if config.SOCKS4 else None, 317 | socks5_sources=config.SOCKS5_SOURCES if config.SOCKS5 else None, 318 | ).main() 319 | 320 | 321 | if __name__ == "__main__": 322 | asyncio.run(main()) 323 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohttp-socks<1.0.0 2 | aiohttp[speedups]<4.0.0 3 | maxminddb<3.0.0 4 | rich<11.0.0 5 | -------------------------------------------------------------------------------- /screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hoemotion/proxy-scraper-checker/d4aec8a1121b92bdebc4c43e1f3a74c31ba43330/screenshot.png --------------------------------------------------------------------------------