├── README.md
└── proxy-scraper.py
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # Proxy Scraper & Checker & Free List
4 |
5 | Easy proxy scraper & checker, and publicly available list
6 |
7 | ```
8 | usage: proxy-scraper.py [-h] [-c] -o OUTPUT [-t THREADS] [--timeout TIMEOUT]
9 | [--http] [--check-with-website CHECK_WITH_WEBSITE]
10 | [--country] [--connection-time] [-f] [-i]
11 |
12 | optional arguments:
13 | -h, --help show this help message and exit
14 | -c, --check Check the scraped proxies
15 | -o OUTPUT, --output OUTPUT
16 | Output file
17 | -t THREADS, --threads THREADS
18 | Checker threads count (default: 30)
19 | --timeout TIMEOUT Checker timeout in seconds (default: 5)
20 | --http Check proxies for HTTP instead of HTTPS
21 | --check-with-website CHECK_WITH_WEBSITE (default: httpbin.org/ip)
22 | Website to connect with proxy. If it doesn't return
23 | HTTP 200, it's dead
24 | --country Locate and print country (requires maxminddb-geolite2)
25 | --connection-time Print connection time information
26 | -f, --write-immediately
27 | Force flush the output file every time
28 | -i, --extra-information
29 | Print last updated time, and configuration description
30 | ```
31 |
32 |
33 | This runs on my server:
34 | ```
35 | python3 /root/proxy-scraper.py --check -t 300 --timeout 5 --check-with-website httpbin.org/ip --country --connection-time --extra-information --output /home/admin/web/cagriari.com/public_html/fresh_proxy.txt
36 | ```
37 |
38 |
39 | ~Hourly~ Daily updated & checked proxy list: https://cagriari.com/fresh_proxy.txt
40 |
41 | ~I'm no longer running it on my server as it doesn't cover the server costs.~
42 |
43 | I changed the interval to every 6 hours instead of hourly to decrease server load.
44 |
--------------------------------------------------------------------------------
/proxy-scraper.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python3
2 |
3 | import sys
4 |
5 | if sys.version_info[0] < 3:
6 | print("This script needs Python 3")
7 | exit()
8 |
9 | import requests, re, queue, threading, traceback, requests, datetime, time, argparse
10 |
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument('-c', '--check', help="Check the scraped proxies", action='store_true')
13 | parser.add_argument('-o', '--output', help="Output file", required=True)
14 | parser.add_argument('-t', '--threads', type=int, default=20, help="Checker threads count")
15 | parser.add_argument('--timeout', type=int, default=5, help="Checker timeout in seconds")
16 | parser.add_argument('--http', help="Check proxies for HTTP instead of HTTPS", action='store_true')
17 | parser.add_argument('--check-with-website', help="Website to connect with proxy. If it doesn't return HTTP 200, it's dead", default="httpbin.org/ip")
18 | parser.add_argument('--country', help="Locate and print country (requires maxminddb-geolite2)", action='store_true')
19 | parser.add_argument('--connection-time', help="Print connection time information", action='store_true')
20 | parser.add_argument('-f', '--write-immediately', help="Force flush the output file every time", action='store_true')
21 | parser.add_argument('-i', '--extra-information', help="Print last updated time, and configuration description", action='store_true')
22 | parserx = parser.parse_args()
23 | threads = parserx.threads
24 | https = not parserx.http
25 | timeout = parserx.timeout
26 | reader = None
27 | if parserx.country:
28 | try:
29 | from geolite2 import geolite2
30 | reader = geolite2.reader()
31 | except ImportError:
32 | print("Error: maxminddb-geolite2 is not installed. Please try without --country option or install this package.")
33 | exit()
34 |
35 | proxies = []
36 |
37 |
38 | def fetchAndParseProxies(url, custom_regex):
39 | global proxies
40 | n = 0
41 | proxylist = requests.get(url, timeout=5).text
42 | proxylist = proxylist.replace('null', '"N/A"')
43 | custom_regex = custom_regex.replace('%ip%', '([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3})')
44 | custom_regex = custom_regex.replace('%port%', '([0-9]{1,5})')
45 | for proxy in re.findall(re.compile(custom_regex), proxylist):
46 | proxies.append(proxy[0] + ":" + proxy[1])
47 | n += 1
48 | sys.stdout.write("{0: >5} proxies fetched from {1}\n".format(n, url))
49 |
50 |
51 | proxysources = [
52 | ["http://spys.me/proxy.txt","%ip%:%port% "],
53 | ["http://www.httptunnel.ge/ProxyListForFree.aspx"," target=\"_new\">%ip%:%port%"],
54 | ["https://raw.githubusercontent.com/sunny9577/proxy-scraper/master/proxies.json", "\"ip\":\"%ip%\",\"port\":\"%port%\","],
55 | ["https://raw.githubusercontent.com/fate0/proxylist/master/proxy.list", '"host": "%ip%".*?"country": "(.*?){2}",.*?"port": %port%'],
56 | ["https://raw.githubusercontent.com/clarketm/proxy-list/master/proxy-list.txt", '%ip%:%port% (.*?){2}-.-S \\+'],
57 | ["https://raw.githubusercontent.com/opsxcq/proxy-list/master/list.txt", '%ip%", "type": "http", "port": %port%'],
58 | ["https://www.us-proxy.org/", "