├── README.md ├── data_processor.py ├── data_processor_excel.py ├── demo.txt └── requirement.txt /README.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 3 | 这是一个用于域名和IP地址处理的Python工具库,名为 DomainIPProcessor。它提供了一系列功能来解析、排序和处理包含域名和IP地址的数据。此工具非常适用于网络分析、安全审核以及任何需要精确管理和解析网络地址数据的场合。 4 | 5 | ## 主要特点 6 | 7 | - 国际化域名处理:支持将中文域名转换为ASCII,适用于国际化域名(IDN)。 8 | - IP地址排序与分析:对IP地址进行提取和排序,支持CIDR格式的IP段提取。 9 | - URL和IP的高级处理:分类处理含IP和域名的URL,支持带协议和不带协议的URL格式。 10 | - 数据去重与整合:从文件中读取URL数据,自动去重并分类整理。 11 | - 结果输出:处理结果以文件形式保存,并在控制台输出详细的日志信息,便于追踪处理过程。 12 | - 易于集成和使用:可以作为命令行工具直接使用,方便集成到其他Python项目或脚本中。 13 | 14 | 这个工具非常适合开发人员和网络管理员使用,它可以帮助快速分析和处理网络数据,提高工作效率和数据管理的准确性。这个库也适合进行网络研究和教育用途,因为它涵盖了域名解析、IP处理等基础而关键的网络操作。 15 | 16 | ## 使用场景 17 | 18 | - 网络安全:分析和审计来自各种源的IP地址和域名,识别潜在的安全威胁。 19 | - 数据清洗:在大数据项目中,清洗和准备来自网络日志的数据。 20 | - 教育和研究:教授学生关于网络地址解析的基础知识,以及如何在Python中处理这些数据。 21 | - API开发:为网络服务开发背景任务,例如自动更新DNS记录或验证网络配置。 22 | 23 | # 安装 24 | 25 | ``` 26 | pip install DomainIPProcessor 27 | 28 | # 使用示例 29 | python3 data_processor.py url.txt 30 | ``` 31 | 导入模式 32 | ``` 33 | # 使用示例 34 | from DomainIPProcessor import DomainIPProcessor 35 | 36 | # 创建实例 37 | processor = DomainIPProcessor() 38 | 39 | # 处理特定文件中的URL和IP 40 | processor.process_file('path_to_your_file.txt') 41 | ``` 42 | # 输出 43 | 44 | 数据源详情可以查看`demo.txt`, 项目会输出14个各类样式的文件 45 | 46 | 47 | | 文件名 | 描述 | 48 | | -------------------------------- | ------------------------------------------------------------ | 49 | | demo--All_Data_Quchong.txt | 去重后的所有数据。保留原格式 | 50 | | demo--All_Domains_No_Schemes.txt | 提取所有不带协议头的数据 | 51 | | demo--All_Domains_Schemes.txt | 提取所有带协议头的数据 | 52 | | demo--All_Err.txt | 所有无法处理或数据源异常的数据 | 53 | | demo--Domains_Chinese_Ascii.txt | 提取中文域名 | 54 | | demo--Domains_Chinese_Ascii.txt | 中文域名转为asscii后的数据 | 55 | | demo--Domains_No_Schemes.txt | 提取不带协议头的数据。域名数据 | 56 | | demo--Domains_Root.txt | 提取根域名。域名数据 | 57 | | demo--Domains_Schemes.txt | 提取带协议头的数据。域名数据 | 58 | | demo--IPs.txt | 提取排序去重后的IP数据。IP数据 | 59 | | demo--IP_Domains_No_Schemes.txt | 提取不带协议头的数据。IP数据 | 60 | | demo--IP_Domains_Schemes.txt | 提取带协议头的数据。IP数据 | 61 | | demo--IP_Ports_Sorted_List.txt | 提取排序IP:PORT数据。IP数据,原意是整理fscan错乱的的IP:port数据 | 62 | | demo--IP_Segment.txt | 提取IP段。仅提取了C段 | 63 | 64 | # 贡献与支持 65 | 66 | 如果本项目对你有用,还请star鼓励一下。 67 | 68 | 无论是添加新功能、改进代码、修复BUG或提供文档。请通过GitHub的Issue和Pull Request提交您的贡献,我会尽快给予帮助及更新。 69 | -------------------------------------------------------------------------------- /data_processor.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Author : S1g0day 3 | Version : 0.3.6 4 | Creat time : 2024/2/10 18:00 5 | Modification time: 2024/7/22 16:00 6 | Introduce : 域名数据处理, 分类导出到各自文件 7 | ''' 8 | 9 | import re 10 | import os 11 | import time 12 | import idna 13 | import socket 14 | import argparse 15 | import tldextract 16 | from tabulate import tabulate 17 | 18 | class DomainIPProcessor: 19 | def __init__(self): 20 | pass 21 | 22 | # 中文域名转ASCII 23 | def convert_to_ascii(self, domain_name): 24 | if "://" in domain_name: 25 | domains = domain_name.split("://") 26 | domain = domains[1] 27 | else: 28 | domain = domain_name 29 | try: 30 | Ascii_Domain = idna.encode(domain).decode('ascii') 31 | return Ascii_Domain 32 | except Exception as e: 33 | print("转换失败:", e) 34 | return None 35 | 36 | def is_chinese_domain(self, domain): 37 | # 如果字符的 ASCII 编码大于 127,则说明是非 ASCII 字符,可能是中文字符 38 | return any(ord(char) > 127 for char in domain) 39 | 40 | def parse_ip_port(self, s): 41 | ip, _, port = s.partition(':') 42 | return (ip, int(port) if port else 0) 43 | 44 | # IP排序 45 | def sort_IPs(self, ip_addresses): 46 | # 使用set()函数进行去重 47 | unique_IPs = set(ip_addresses) 48 | 49 | # 使用socket库中的inet_aton函数将IP地址转换为32位二进制数,然后再将其转换为整数 50 | ip_integers = [socket.inet_aton(ip) for ip in unique_IPs] 51 | ip_integers.sort() 52 | 53 | # 使用socket库中的inet_ntoa函数将整数转换回IP地址格式 54 | sorted_IPs = [socket.inet_ntoa(ip) for ip in ip_integers] 55 | return sorted_IPs 56 | 57 | # url ip 排序 58 | def ip_url_output(self, IPs_list, IP_valid): 59 | ip_domains = [] 60 | Schemes_IP_Domains = [] 61 | NO_Schemes_IP_Domains = [] 62 | ip_ports = [] 63 | 64 | for i in IPs_list: 65 | for j in IP_valid: 66 | ipList = re.findall(r'[0-9]+(?:\.[0-9]+){3}', j) 67 | if i == ipList[0]: 68 | if j and j not in ip_domains: 69 | ip_domains.append(j) 70 | 71 | # 分离协议头 72 | if "://" in j: 73 | Schemes_IP_Domains.append(j) 74 | else: 75 | NO_Schemes_IP_Domains.append(j) 76 | # 提取出IP:PORT 77 | for IP_PORT in NO_Schemes_IP_Domains: 78 | pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d+)' 79 | match = re.search(pattern, IP_PORT) 80 | if match: 81 | IP_PORT = f"{match.group(1)}:{match.group(2)}" 82 | if IP_PORT and IP_PORT not in ip_ports: 83 | ip_ports.append(IP_PORT) 84 | IP_Ports_Sorted_List = sorted(ip_ports, key=self.parse_ip_port) 85 | return ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List 86 | 87 | # 提取IP段 88 | def extract_ip_segment(self, ip): 89 | parts = ip.split('.') 90 | if len(parts) == 4: 91 | return '.'.join(parts[:3]) + '.0/24' 92 | return None 93 | 94 | # 提取纯粹的IP地址 95 | def process_ips(self, ip_list): 96 | pure_ips = set() 97 | ip_segments = set() 98 | ips_err = [] 99 | p = re.compile(r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$') 100 | for ip in ip_list: 101 | ip_match = re.findall(r'[0-9]+(?:\.[0-9]+){3}', ip) 102 | if ip_match and p.match(ip_match[0]): 103 | pure_ip = ip_match[0] 104 | pure_ips.add(pure_ip) 105 | ip_segments.add(self.extract_ip_segment(pure_ip)) 106 | else: 107 | ips_err.append(ip) 108 | 109 | sorted_ips = self.sort_IPs(list(pure_ips)) 110 | sorted_segments = sorted(list(ip_segments)) 111 | return sorted_ips, sorted_segments, ips_err 112 | 113 | def handle_Domain_Valid(self, Domain_Valid): 114 | ''' 115 | 处理 Domain_Valid 数据 116 | ''' 117 | Root_Domains = [] 118 | Schemes_Domains = [] 119 | NO_Schemes_Domains= [] 120 | for url in Domain_Valid: 121 | # 提取根域名 122 | extracted = tldextract.extract(url) 123 | root_domain = extracted.domain + '.' + extracted.suffix 124 | if root_domain and root_domain not in Root_Domains: 125 | Root_Domains.append(root_domain) 126 | # 分离协议头 127 | if "://" in url: 128 | Schemes_Domains.append(url) 129 | else: 130 | NO_Schemes_Domains.append(url) 131 | return Root_Domains, Schemes_Domains, NO_Schemes_Domains 132 | 133 | # 区分IP和域名 134 | def classify_urls(self, urls): 135 | IP_valid = [] 136 | Domain_Valid = [] 137 | Root_Domains = [] 138 | Chinese_Domain = [] 139 | Ascii_Domain = [] 140 | Url_Err = [] 141 | 142 | ip_pattern = re.compile(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)') 143 | domain_pattern = re.compile(r'([a-zA-Z0-9.-]+?\.[a-zA-Z]{2,6})') 144 | for url in urls: 145 | # 检查是否包含IP地址 146 | ip_match = ip_pattern.search(url) 147 | # 检查是否包含域名 148 | domain_match = domain_pattern.search(url) 149 | if ip_match and domain_match: 150 | Url_Err.append(url) 151 | elif ip_match: 152 | IP_valid.append(url) 153 | elif domain_match: 154 | Domain_Valid.append(url) 155 | # 判断是否是中文域名 156 | elif self.is_chinese_domain(url): 157 | # 转换为 ASCII 格式 158 | domain_ascii = self.convert_to_ascii(url) 159 | if domain_ascii: 160 | Chinese_Domain.append(url) 161 | Ascii_Domain.append(domain_ascii) 162 | else: 163 | Url_Err.append(url) 164 | 165 | Root_Domains, Schemes_Domains, NO_Schemes_Domains= self.handle_Domain_Valid(Domain_Valid) 166 | return IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err 167 | 168 | # url去重 169 | def deduplicate_urls(self, file_name): 170 | non_scheme_parts = [] 171 | texts = [] 172 | with open(file_name, 'r', encoding='utf-8') as files: 173 | filelist = files.readlines() 174 | for fileurl in filelist: 175 | url = fileurl.strip() 176 | if url: 177 | if "://" in url: 178 | url_parts = url.split("://") 179 | non_scheme_part = re.sub(r'/+', '/', url_parts[1]) 180 | if non_scheme_part and non_scheme_part not in non_scheme_parts: 181 | non_scheme_parts.append(non_scheme_part) 182 | texts.append(f"{url_parts[0]}://{non_scheme_part}") 183 | else: 184 | url = re.sub(r'/+', '/', url) 185 | if url and url not in non_scheme_parts: 186 | non_scheme_parts.append(url) 187 | texts.append(url) 188 | return texts 189 | 190 | def save_results(self, output, data): 191 | savedata = [] 192 | with open(output, 'w', encoding='utf-8') as fs: 193 | for i in data: 194 | if i and i not in savedata: 195 | fs.write(i + '\n') 196 | 197 | def save_and_print_results(self, files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 198 | Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 199 | Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err): 200 | 201 | # 生成文件名 202 | str(time.time()).split(".")[0] 203 | filename = ''.join(files.split('/')[-1].split('.')[:-1]) 204 | timenow = str(time.time()).split(".")[0] 205 | outfilename = f'{filename}-' 206 | # outfilename = f'{filename}-{timenow}' 207 | 208 | # 获取脚本所在目录 209 | script_dir = os.path.dirname(__file__) 210 | 211 | # 创建日志目录 212 | log_dir = os.path.join(script_dir, 'log') 213 | os.makedirs(log_dir, exist_ok=True) 214 | 215 | outputs = { 216 | 'Domains_Root': Root_Domains, 217 | 'Domains_Schemes': Schemes_Domains, 218 | 'Domains_No_Schemes': NO_Schemes_Domains, 219 | 'Domains_Chinese': Chinese_Domain, 220 | 'Domains_Chinese_Ascii': Ascii_Domain, 221 | 'IPs': IPs, 222 | 'IP_Segment': IP_Segment, 223 | 'IP_Domains_Schemes': Schemes_IP_Domains, 224 | 'IP_Domains_No_Schemes': NO_Schemes_IP_Domains, 225 | 'IP_Ports_Sorted_List': IP_Ports_Sorted_List, 226 | 'All_Domains_Schemes': All_Schemes_Domains, 227 | 'All_Domains_No_Schemes': All_No_Schemes_Domains, 228 | 'All_Err': All_Err, 229 | 'All_Data_Quchong': All_Data 230 | } 231 | 232 | for key, data in outputs.items(): 233 | output_file = f'log/{outfilename}-{key}.txt' 234 | self.save_results(output_file, data) 235 | 236 | data = [[key, f'log/{outfilename}-{key}.txt'] for key in outputs.keys()] 237 | table = tabulate(data, headers=['保存项', '输出'], tablefmt='grid') 238 | print(table) 239 | 240 | def process_file(self, files): 241 | # 去重 242 | texts = self.deduplicate_urls(files) 243 | 244 | # 提取域名和IP 245 | IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err = self.classify_urls(texts) 246 | 247 | # 提取纯粹IP和IP段, 并排序 248 | IPs, IP_Segment, IPs_Err = self.process_ips(IP_valid) 249 | # 处理IP url 250 | ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List = self.ip_url_output(IPs, IP_valid) 251 | #合并 252 | All_Schemes_Domains = Schemes_Domains + Schemes_IP_Domains 253 | All_No_Schemes_Domains = NO_Schemes_Domains+ NO_Schemes_IP_Domains 254 | All_Err = Url_Err + IPs_Err 255 | All_Data = Domain_Valid + ip_domains 256 | # 调用函数并传入所需参数 257 | self.save_and_print_results(files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 258 | Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 259 | Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err) 260 | 261 | if __name__ == '__main__': 262 | 263 | parser = argparse.ArgumentParser(description="Process URLs and IPs from a file.") 264 | parser.add_argument('file', type=str, help='The path to the file containing URLs and IPs.') 265 | args = parser.parse_args() 266 | 267 | processor = DomainIPProcessor() 268 | processor.process_file(args.file) 269 | -------------------------------------------------------------------------------- /data_processor_excel.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Author : S1g0day 3 | Version : 0.3.7 4 | Creat time : 2024/2/10 18:00 5 | Modification time: 2024/7/24 10:00 6 | Introduce : 域名数据处理,结果导出到excel 7 | ''' 8 | 9 | import re 10 | import os 11 | import time 12 | import idna 13 | import socket 14 | import argparse 15 | import tldextract 16 | from tabulate import tabulate 17 | 18 | class DomainIPProcessor: 19 | def __init__(self): 20 | pass 21 | 22 | # 中文域名转ASCII 23 | def convert_to_ascii(self, domain_name): 24 | if "://" in domain_name: 25 | domains = domain_name.split("://") 26 | domain = domains[1] 27 | else: 28 | domain = domain_name 29 | try: 30 | Ascii_Domain = idna.encode(domain).decode('ascii') 31 | return Ascii_Domain 32 | except Exception as e: 33 | print("转换失败:", e) 34 | return None 35 | 36 | def is_chinese_domain(self, domain): 37 | # 如果字符的 ASCII 编码大于 127,则说明是非 ASCII 字符,可能是中文字符 38 | return any(ord(char) > 127 for char in domain) 39 | 40 | def parse_ip_port(self, s): 41 | ip, _, port = s.partition(':') 42 | return (ip, int(port) if port else 0) 43 | 44 | # IP排序 45 | def sort_IPs(self, ip_addresses): 46 | # 使用set()函数进行去重 47 | unique_IPs = set(ip_addresses) 48 | 49 | # 使用socket库中的inet_aton函数将IP地址转换为32位二进制数,然后再将其转换为整数 50 | ip_integers = [socket.inet_aton(ip) for ip in unique_IPs] 51 | ip_integers.sort() 52 | 53 | # 使用socket库中的inet_ntoa函数将整数转换回IP地址格式 54 | sorted_IPs = [socket.inet_ntoa(ip) for ip in ip_integers] 55 | return sorted_IPs 56 | 57 | # url ip 排序 58 | def ip_url_output(self, IPs_list, IP_valid): 59 | ip_domains = [] 60 | Schemes_IP_Domains = [] 61 | NO_Schemes_IP_Domains = [] 62 | ip_ports = [] 63 | 64 | for i in IPs_list: 65 | for j in IP_valid: 66 | ipList = re.findall(r'[0-9]+(?:\.[0-9]+){3}', j) 67 | if i == ipList[0]: 68 | if j and j not in ip_domains: 69 | ip_domains.append(j) 70 | 71 | # 分离协议头 72 | if "://" in j: 73 | Schemes_IP_Domains.append(j) 74 | else: 75 | NO_Schemes_IP_Domains.append(j) 76 | # 提取出IP:PORT 77 | for IP_PORT in NO_Schemes_IP_Domains: 78 | pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d+)' 79 | match = re.search(pattern, IP_PORT) 80 | if match: 81 | IP_PORT = f"{match.group(1)}:{match.group(2)}" 82 | if IP_PORT and IP_PORT not in ip_ports: 83 | ip_ports.append(IP_PORT) 84 | IP_Ports_Sorted_List = sorted(ip_ports, key=self.parse_ip_port) 85 | return ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List 86 | 87 | # 提取IP段 88 | def extract_ip_segment(self, ip): 89 | parts = ip.split('.') 90 | if len(parts) == 4: 91 | return '.'.join(parts[:3]) + '.0/24' 92 | return None 93 | 94 | # 提取纯粹的IP地址 95 | def process_ips(self, ip_list): 96 | pure_ips = set() 97 | ip_segments = set() 98 | ips_err = [] 99 | p = re.compile(r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$') 100 | for ip in ip_list: 101 | ip_match = re.findall(r'[0-9]+(?:\.[0-9]+){3}', ip) 102 | if ip_match and p.match(ip_match[0]): 103 | pure_ip = ip_match[0] 104 | pure_ips.add(pure_ip) 105 | ip_segments.add(self.extract_ip_segment(pure_ip)) 106 | else: 107 | ips_err.append(ip) 108 | 109 | sorted_ips = self.sort_IPs(list(pure_ips)) 110 | sorted_segments = sorted(list(ip_segments)) 111 | return sorted_ips, sorted_segments, ips_err 112 | 113 | def handle_Domain_Valid(self, Domain_Valid): 114 | ''' 115 | 处理 Domain_Valid 数据 116 | ''' 117 | Root_Domains = [] 118 | Schemes_Domains = [] 119 | NO_Schemes_Domains= [] 120 | for url in Domain_Valid: 121 | # 提取根域名 122 | extracted = tldextract.extract(url) 123 | root_domain = extracted.domain + '.' + extracted.suffix 124 | if root_domain and root_domain not in Root_Domains: 125 | Root_Domains.append(root_domain) 126 | # 分离协议头 127 | if "://" in url: 128 | Schemes_Domains.append(url) 129 | else: 130 | NO_Schemes_Domains.append(url) 131 | return Root_Domains, Schemes_Domains, NO_Schemes_Domains 132 | 133 | # 区分IP和域名 134 | def classify_urls(self, urls): 135 | IP_valid = [] 136 | Domain_Valid = [] 137 | Root_Domains = [] 138 | Chinese_Domain = [] 139 | Ascii_Domain = [] 140 | Url_Err = [] 141 | 142 | ip_pattern = re.compile(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)') 143 | domain_pattern = re.compile(r'([a-zA-Z0-9.-]+?\.[a-zA-Z]{2,6})') 144 | for url in urls: 145 | # 检查是否包含IP地址 146 | ip_match = ip_pattern.search(url) 147 | # 检查是否包含域名 148 | domain_match = domain_pattern.search(url) 149 | if ip_match and domain_match: 150 | Url_Err.append(url) 151 | elif ip_match: 152 | IP_valid.append(url) 153 | elif domain_match: 154 | Domain_Valid.append(url) 155 | # 判断是否是中文域名 156 | elif self.is_chinese_domain(url): 157 | # 转换为 ASCII 格式 158 | domain_ascii = self.convert_to_ascii(url) 159 | if domain_ascii: 160 | Chinese_Domain.append(url) 161 | Ascii_Domain.append(domain_ascii) 162 | else: 163 | Url_Err.append(url) 164 | 165 | Root_Domains, Schemes_Domains, NO_Schemes_Domains= self.handle_Domain_Valid(Domain_Valid) 166 | return IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err 167 | 168 | # url去重 169 | def deduplicate_urls(self, file_name): 170 | non_scheme_parts = [] 171 | texts = [] 172 | with open(file_name, 'r', encoding='utf-8') as files: 173 | filelist = files.readlines() 174 | for fileurl in filelist: 175 | url = fileurl.strip() 176 | if url: 177 | if "://" in url: 178 | url_parts = url.split("://") 179 | non_scheme_part = re.sub(r'/+', '/', url_parts[1]) 180 | if non_scheme_part and non_scheme_part not in non_scheme_parts: 181 | non_scheme_parts.append(non_scheme_part) 182 | texts.append(f"{url_parts[0]}://{non_scheme_part}") 183 | else: 184 | url = re.sub(r'/+', '/', url) 185 | if url and url not in non_scheme_parts: 186 | non_scheme_parts.append(url) 187 | texts.append(url) 188 | return texts 189 | 190 | def save_results(self, output, data): 191 | savedata = [] 192 | with open(output, 'w', encoding='utf-8') as fs: 193 | for i in data: 194 | if i and i not in savedata: 195 | fs.write(i + '\n') 196 | 197 | def save_and_print_results(self, files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 198 | Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 199 | Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err): 200 | 201 | # 生成文件名 202 | str(time.time()).split(".")[0] 203 | filename = ''.join(files.split('/')[-1].split('.')[:-1]) 204 | timenow = str(time.time()).split(".")[0] 205 | # outfilename = f'{filename}-' 206 | outfilename = f'{filename}-{timenow}' 207 | 208 | # 获取脚本所在目录 209 | script_dir = os.path.dirname(__file__) 210 | 211 | # 创建日志目录 212 | log_dir = os.path.join(script_dir, 'log') 213 | os.makedirs(log_dir, exist_ok=True) 214 | import pandas as pd 215 | outputs = { 216 | 'Domains_Root': Root_Domains, 217 | 'Domains_Schemes': Schemes_Domains, 218 | 'Domains_No_Schemes': NO_Schemes_Domains, 219 | 'Domains_Chinese': Chinese_Domain, 220 | 'Domains_Chinese_Ascii': Ascii_Domain, 221 | 'IPs': IPs, 222 | 'IP_Segment': IP_Segment, 223 | 'IP_Domains_Schemes': Schemes_IP_Domains, 224 | 'IP_Domains_No_Schemes': NO_Schemes_IP_Domains, 225 | 'IP_Ports_Sorted_List': IP_Ports_Sorted_List, 226 | 'All_Domains_Schemes': All_Schemes_Domains, 227 | 'All_Domains_No_Schemes': All_No_Schemes_Domains, 228 | 'All_Err': All_Err, 229 | 'All_Data_Quchong': All_Data 230 | } 231 | 232 | output_file = f'log/{outfilename}.xlsx' 233 | writer = pd.ExcelWriter(output_file) 234 | 235 | # 倒序处理outputs字典 236 | for key, data in reversed(list(outputs.items())): 237 | df = pd.DataFrame(data) 238 | df.to_excel(writer, sheet_name=key, index=False, header=False) 239 | 240 | writer.close() 241 | print(f"输出到:{output_file}") 242 | 243 | def process_file(self, files): 244 | # 去重 245 | texts = self.deduplicate_urls(files) 246 | 247 | # 提取域名和IP 248 | IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err = self.classify_urls(texts) 249 | 250 | # 提取纯粹IP和IP段, 并排序 251 | IPs, IP_Segment, IPs_Err = self.process_ips(IP_valid) 252 | # 处理IP url 253 | ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List = self.ip_url_output(IPs, IP_valid) 254 | #合并 255 | All_Schemes_Domains = Schemes_Domains + Schemes_IP_Domains 256 | All_No_Schemes_Domains = NO_Schemes_Domains+ NO_Schemes_IP_Domains 257 | All_Err = Url_Err + IPs_Err 258 | All_Data = Domain_Valid + ip_domains 259 | # 调用函数并传入所需参数 260 | self.save_and_print_results(files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 261 | Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 262 | Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err) 263 | 264 | if __name__ == '__main__': 265 | 266 | parser = argparse.ArgumentParser(description="Process URLs and IPs from a file.") 267 | parser.add_argument('file', type=str, help='The path to the file containing URLs and IPs.') 268 | args = parser.parse_args() 269 | 270 | processor = DomainIPProcessor() 271 | processor.process_file(args.file) 272 | -------------------------------------------------------------------------------- /demo.txt: -------------------------------------------------------------------------------- 1 | 192.168.1.100 2 | 192.168.1.10 3 | 192.168.1.130 4 | 192.168.1.130 5 | 192.168.1.101 6 | 192.168.2.10 7 | 192.168.3.130 8 | 172.16.1.101 9 | 192.168.1.100:8080 10 | 192.168.1.101:8081/path 11 | 192.168.1.100:808 12 | 192.168.1.1001 13 | 1192.168.1.100 14 | 15 | <> 16 | 12313 17 | example.com 18 | A.example.com 19 | example.com:808 20 | example.com:8081/path 21 | example.com:8080 22 | http://exam1ple.com 23 | https://example.com 24 | http://A.example.com 25 | http://example.com:8080 26 | https://example.com:8080 27 | http://example.com/path 28 | https://example.com/path 29 | http://192.168.1.100 30 | https://192.168.1.100 31 | http://192.168.1.100:8080 32 | https://192.168.1.100:8080 33 | http://192.168.1.100:8080/path 34 | https://192.168.1.100:8080/path 35 | http://192.168.1.100/path 36 | https://192.168.1.100/path 37 | http://192.168.1.100:8080/path 38 | https://192.168.1.100:8080/path 39 | https://1192.168.1.100:8080/path 40 | 例1子.域名 41 | http://例子.域名 42 | https://域名.例子 43 | http://192.168.1.100:8080/path/path 44 | http://192.168.1.100:8080//path/path 45 | http://192.168.1.100:8080//path//path 46 | http://192.168.1.100:8080//path////path 47 | http://192.168.1.100:8080/path//path///path?ad=../.././../ 48 | http://example.com/?query=123 49 | http://example.com/#fragment 50 | http://example.com/#fragment 51 | http://122.128.1.100.examp3le.com/#fragment 52 | -------------------------------------------------------------------------------- /requirement.txt: -------------------------------------------------------------------------------- 1 | idna 2 | tabulate 3 | tldextract 4 | --------------------------------------------------------------------------------