├── README.md
├── data_processor.py
├── data_processor_excel.py
├── demo.txt
└── requirement.txt


/README.md:
--------------------------------------------------------------------------------
 1 | # 简介
 2 | 
 3 | 这是一个用于域名和IP地址处理的Python工具库，名为 DomainIPProcessor。它提供了一系列功能来解析、排序和处理包含域名和IP地址的数据。此工具非常适用于网络分析、安全审核以及任何需要精确管理和解析网络地址数据的场合。
 4 | 
 5 | ## 主要特点
 6 | 
 7 | - 国际化域名处理：支持将中文域名转换为ASCII，适用于国际化域名（IDN）。
 8 | - IP地址排序与分析：对IP地址进行提取和排序，支持CIDR格式的IP段提取。
 9 | - URL和IP的高级处理：分类处理含IP和域名的URL，支持带协议和不带协议的URL格式。
10 | - 数据去重与整合：从文件中读取URL数据，自动去重并分类整理。
11 | - 结果输出：处理结果以文件形式保存，并在控制台输出详细的日志信息，便于追踪处理过程。
12 | - 易于集成和使用：可以作为命令行工具直接使用，方便集成到其他Python项目或脚本中。
13 | 
14 | 这个工具非常适合开发人员和网络管理员使用，它可以帮助快速分析和处理网络数据，提高工作效率和数据管理的准确性。这个库也适合进行网络研究和教育用途，因为它涵盖了域名解析、IP处理等基础而关键的网络操作。
15 | 
16 | ## 使用场景
17 | 
18 | - 网络安全：分析和审计来自各种源的IP地址和域名，识别潜在的安全威胁。
19 | - 数据清洗：在大数据项目中，清洗和准备来自网络日志的数据。
20 | - 教育和研究：教授学生关于网络地址解析的基础知识，以及如何在Python中处理这些数据。
21 | - API开发：为网络服务开发背景任务，例如自动更新DNS记录或验证网络配置。
22 | 
23 | # 安装
24 | 
25 | ```
26 | pip install DomainIPProcessor
27 | 
28 | # 使用示例
29 | python3 data_processor.py url.txt
30 | ```
31 | 导入模式
32 | ```
33 | # 使用示例
34 | from DomainIPProcessor import DomainIPProcessor
35 | 
36 | # 创建实例
37 | processor = DomainIPProcessor()
38 | 
39 | # 处理特定文件中的URL和IP
40 | processor.process_file('path_to_your_file.txt')
41 | ```
42 | # 输出
43 | 
44 | 数据源详情可以查看`demo.txt`, 项目会输出14个各类样式的文件
45 | 
46 | 
47 | | 文件名                           | 描述                                                         |
48 | | -------------------------------- | ------------------------------------------------------------ |
49 | | demo--All_Data_Quchong.txt       | 去重后的所有数据。保留原格式                                 |
50 | | demo--All_Domains_No_Schemes.txt | 提取所有不带协议头的数据                                     |
51 | | demo--All_Domains_Schemes.txt    | 提取所有带协议头的数据                                       |
52 | | demo--All_Err.txt                | 所有无法处理或数据源异常的数据                               |
53 | | demo--Domains_Chinese_Ascii.txt  | 提取中文域名                                                 |
54 | | demo--Domains_Chinese_Ascii.txt  | 中文域名转为asscii后的数据                                   |
55 | | demo--Domains_No_Schemes.txt     | 提取不带协议头的数据。域名数据                               |
56 | | demo--Domains_Root.txt           | 提取根域名。域名数据                                         |
57 | | demo--Domains_Schemes.txt        | 提取带协议头的数据。域名数据                                 |
58 | | demo--IPs.txt                    | 提取排序去重后的IP数据。IP数据                               |
59 | | demo--IP_Domains_No_Schemes.txt  | 提取不带协议头的数据。IP数据                                 |
60 | | demo--IP_Domains_Schemes.txt     | 提取带协议头的数据。IP数据                                   |
61 | | demo--IP_Ports_Sorted_List.txt   | 提取排序IP:PORT数据。IP数据，原意是整理fscan错乱的的IP:port数据 |
62 | | demo--IP_Segment.txt             | 提取IP段。仅提取了C段                                        |
63 | 
64 | # 贡献与支持
65 | 
66 | 如果本项目对你有用，还请star鼓励一下。
67 | 
68 | 无论是添加新功能、改进代码、修复BUG或提供文档。请通过GitHub的Issue和Pull Request提交您的贡献，我会尽快给予帮助及更新。
69 | 


--------------------------------------------------------------------------------
/data_processor.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Author     : S1g0day
  3 | Version    : 0.3.6
  4 | Creat time : 2024/2/10 18:00
  5 | Modification time: 2024/7/22 16:00
  6 | Introduce  : 域名数据处理, 分类导出到各自文件
  7 | '''
  8 | 
  9 | import re
 10 | import os
 11 | import time
 12 | import idna
 13 | import socket
 14 | import argparse
 15 | import tldextract
 16 | from tabulate import tabulate
 17 | 
 18 | class DomainIPProcessor:
 19 |     def __init__(self):
 20 |         pass
 21 |     
 22 |     # 中文域名转ASCII
 23 |     def convert_to_ascii(self, domain_name):
 24 |         if "://" in domain_name:
 25 |             domains = domain_name.split("://")
 26 |             domain = domains[1]
 27 |         else:
 28 |             domain = domain_name
 29 |         try:
 30 |             Ascii_Domain = idna.encode(domain).decode('ascii')
 31 |             return Ascii_Domain
 32 |         except Exception as e:
 33 |             print("转换失败:", e)
 34 |             return None
 35 | 
 36 |     def is_chinese_domain(self, domain):
 37 |         # 如果字符的 ASCII 编码大于 127，则说明是非 ASCII 字符，可能是中文字符
 38 |         return any(ord(char) > 127 for char in domain)
 39 | 
 40 |     def parse_ip_port(self, s):
 41 |         ip, _, port = s.partition(':')
 42 |         return (ip, int(port) if port else 0)
 43 | 
 44 |     # IP排序
 45 |     def sort_IPs(self, ip_addresses):
 46 |         # 使用set()函数进行去重
 47 |         unique_IPs = set(ip_addresses)
 48 | 
 49 |         # 使用socket库中的inet_aton函数将IP地址转换为32位二进制数，然后再将其转换为整数
 50 |         ip_integers = [socket.inet_aton(ip) for ip in unique_IPs]
 51 |         ip_integers.sort()
 52 | 
 53 |         # 使用socket库中的inet_ntoa函数将整数转换回IP地址格式
 54 |         sorted_IPs = [socket.inet_ntoa(ip) for ip in ip_integers]
 55 |         return sorted_IPs
 56 | 
 57 |     # url ip 排序
 58 |     def ip_url_output(self, IPs_list, IP_valid):
 59 |         ip_domains = []
 60 |         Schemes_IP_Domains = []
 61 |         NO_Schemes_IP_Domains = []
 62 |         ip_ports = []
 63 | 
 64 |         for i in IPs_list:
 65 |             for j in IP_valid:
 66 |                 ipList = re.findall(r'[0-9]+(?:\.[0-9]+){3}', j)
 67 |                 if i == ipList[0]:
 68 |                     if j and j not in ip_domains:
 69 |                         ip_domains.append(j)
 70 | 
 71 |                         # 分离协议头
 72 |                         if "://" in j:
 73 |                             Schemes_IP_Domains.append(j)
 74 |                         else:
 75 |                             NO_Schemes_IP_Domains.append(j)
 76 |         # 提取出IP:PORT
 77 |         for IP_PORT in NO_Schemes_IP_Domains:
 78 |             pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d+)'
 79 |             match = re.search(pattern, IP_PORT)
 80 |             if match:
 81 |                 IP_PORT = f"{match.group(1)}:{match.group(2)}"
 82 |                 if IP_PORT and IP_PORT not in ip_ports:
 83 |                     ip_ports.append(IP_PORT)
 84 |         IP_Ports_Sorted_List = sorted(ip_ports, key=self.parse_ip_port)
 85 |         return ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List
 86 | 
 87 |     # 提取IP段
 88 |     def extract_ip_segment(self, ip):
 89 |         parts = ip.split('.')
 90 |         if len(parts) == 4:
 91 |             return '.'.join(parts[:3]) + '.0/24'
 92 |         return None
 93 | 
 94 |     # 提取纯粹的IP地址
 95 |     def process_ips(self, ip_list):
 96 |         pure_ips = set()
 97 |         ip_segments = set()
 98 |         ips_err = []
 99 |         p = re.compile(r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$')
100 |         for ip in ip_list:
101 |             ip_match = re.findall(r'[0-9]+(?:\.[0-9]+){3}', ip)
102 |             if ip_match and p.match(ip_match[0]):
103 |                 pure_ip = ip_match[0]
104 |                 pure_ips.add(pure_ip)
105 |                 ip_segments.add(self.extract_ip_segment(pure_ip))
106 |             else:
107 |                 ips_err.append(ip)
108 | 
109 |         sorted_ips = self.sort_IPs(list(pure_ips))
110 |         sorted_segments = sorted(list(ip_segments))
111 |         return sorted_ips, sorted_segments, ips_err
112 |         
113 |     def handle_Domain_Valid(self, Domain_Valid):
114 |         '''
115 |         处理 Domain_Valid 数据
116 |         '''
117 |         Root_Domains = []
118 |         Schemes_Domains = []
119 |         NO_Schemes_Domains= []
120 |         for url in Domain_Valid:
121 |             # 提取根域名
122 |             extracted = tldextract.extract(url)
123 |             root_domain = extracted.domain + '.' + extracted.suffix
124 |             if root_domain and root_domain not in Root_Domains:
125 |                 Root_Domains.append(root_domain)
126 |             # 分离协议头
127 |             if "://" in url:
128 |                 Schemes_Domains.append(url)
129 |             else:
130 |                 NO_Schemes_Domains.append(url)
131 |         return Root_Domains, Schemes_Domains, NO_Schemes_Domains
132 | 
133 |     # 区分IP和域名
134 |     def classify_urls(self, urls):
135 |         IP_valid = []
136 |         Domain_Valid = []
137 |         Root_Domains = []
138 |         Chinese_Domain = []
139 |         Ascii_Domain = []
140 |         Url_Err = []
141 | 
142 |         ip_pattern = re.compile(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)')
143 |         domain_pattern = re.compile(r'([a-zA-Z0-9.-]+?\.[a-zA-Z]{2,6})')
144 |         for url in urls:
145 |             # 检查是否包含IP地址
146 |             ip_match = ip_pattern.search(url)
147 |             # 检查是否包含域名
148 |             domain_match = domain_pattern.search(url)
149 |             if ip_match and domain_match:
150 |                 Url_Err.append(url)
151 |             elif ip_match:
152 |                 IP_valid.append(url)
153 |             elif domain_match:
154 |                 Domain_Valid.append(url)
155 |             # 判断是否是中文域名
156 |             elif self.is_chinese_domain(url):
157 |                 # 转换为 ASCII 格式
158 |                 domain_ascii = self.convert_to_ascii(url)
159 |                 if domain_ascii:
160 |                     Chinese_Domain.append(url)
161 |                     Ascii_Domain.append(domain_ascii)
162 |             else:
163 |                 Url_Err.append(url)
164 | 
165 |         Root_Domains, Schemes_Domains, NO_Schemes_Domains= self.handle_Domain_Valid(Domain_Valid)
166 |         return IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err
167 | 
168 |     # url去重
169 |     def deduplicate_urls(self, file_name):
170 |         non_scheme_parts = []
171 |         texts = []
172 |         with open(file_name, 'r', encoding='utf-8') as files:
173 |             filelist = files.readlines()
174 |             for fileurl in filelist:
175 |                 url = fileurl.strip()
176 |                 if url:
177 |                     if "://" in url:
178 |                         url_parts = url.split("://")
179 |                         non_scheme_part = re.sub(r'/+', '/', url_parts[1])
180 |                         if non_scheme_part and non_scheme_part not in non_scheme_parts:
181 |                             non_scheme_parts.append(non_scheme_part)
182 |                             texts.append(f"{url_parts[0]}://{non_scheme_part}")
183 |                     else:
184 |                         url = re.sub(r'/+', '/', url)
185 |                         if url and url not in non_scheme_parts:
186 |                             non_scheme_parts.append(url)
187 |                             texts.append(url)
188 |         return texts
189 | 
190 |     def save_results(self, output, data):
191 |         savedata = []
192 |         with open(output, 'w', encoding='utf-8') as fs:
193 |             for i in data:
194 |                 if i and i not in savedata:
195 |                     fs.write(i + '\n')
196 | 
197 |     def save_and_print_results(self, files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 
198 |                                Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 
199 |                                Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err):
200 | 
201 |         # 生成文件名
202 |         str(time.time()).split(".")[0]
203 |         filename = ''.join(files.split('/')[-1].split('.')[:-1])
204 |         timenow = str(time.time()).split(".")[0]
205 |         outfilename = f'{filename}-'
206 |         # outfilename = f'{filename}-{timenow}'
207 | 
208 |         # 获取脚本所在目录
209 |         script_dir = os.path.dirname(__file__)
210 | 
211 |         # 创建日志目录
212 |         log_dir = os.path.join(script_dir, 'log')
213 |         os.makedirs(log_dir, exist_ok=True)
214 |         
215 |         outputs = {
216 |             'Domains_Root': Root_Domains,
217 |             'Domains_Schemes': Schemes_Domains,
218 |             'Domains_No_Schemes': NO_Schemes_Domains,
219 |             'Domains_Chinese': Chinese_Domain,
220 |             'Domains_Chinese_Ascii': Ascii_Domain,
221 |             'IPs': IPs,
222 |             'IP_Segment': IP_Segment,
223 |             'IP_Domains_Schemes': Schemes_IP_Domains,
224 |             'IP_Domains_No_Schemes': NO_Schemes_IP_Domains,
225 |             'IP_Ports_Sorted_List': IP_Ports_Sorted_List,
226 |             'All_Domains_Schemes': All_Schemes_Domains,
227 |             'All_Domains_No_Schemes': All_No_Schemes_Domains,
228 |             'All_Err': All_Err,
229 |             'All_Data_Quchong': All_Data
230 |         }
231 | 
232 |         for key, data in outputs.items():
233 |             output_file = f'log/{outfilename}-{key}.txt'
234 |             self.save_results(output_file, data)
235 | 
236 |         data = [[key, f'log/{outfilename}-{key}.txt'] for key in outputs.keys()]
237 |         table = tabulate(data, headers=['保存项', '输出'], tablefmt='grid')
238 |         print(table)
239 | 
240 |     def process_file(self, files):
241 |         # 去重
242 |         texts = self.deduplicate_urls(files)
243 | 
244 |         # 提取域名和IP
245 |         IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err = self.classify_urls(texts)
246 |         
247 |         # 提取纯粹IP和IP段, 并排序
248 |         IPs, IP_Segment, IPs_Err = self.process_ips(IP_valid)
249 |         # 处理IP url
250 |         ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List = self.ip_url_output(IPs, IP_valid)
251 |         #合并
252 |         All_Schemes_Domains = Schemes_Domains + Schemes_IP_Domains
253 |         All_No_Schemes_Domains = NO_Schemes_Domains+ NO_Schemes_IP_Domains
254 |         All_Err = Url_Err + IPs_Err
255 |         All_Data = Domain_Valid + ip_domains
256 |         # 调用函数并传入所需参数
257 |         self.save_and_print_results(files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 
258 |                                     Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 
259 |                                     Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err)
260 | 
261 | if __name__ == '__main__':
262 | 
263 |     parser = argparse.ArgumentParser(description="Process URLs and IPs from a file.")
264 |     parser.add_argument('file', type=str, help='The path to the file containing URLs and IPs.')
265 |     args = parser.parse_args()
266 | 
267 |     processor = DomainIPProcessor()
268 |     processor.process_file(args.file)
269 | 


--------------------------------------------------------------------------------
/data_processor_excel.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Author     : S1g0day
  3 | Version    : 0.3.7
  4 | Creat time : 2024/2/10 18:00
  5 | Modification time: 2024/7/24 10:00
  6 | Introduce  : 域名数据处理，结果导出到excel
  7 | '''
  8 | 
  9 | import re
 10 | import os
 11 | import time
 12 | import idna
 13 | import socket
 14 | import argparse
 15 | import tldextract
 16 | from tabulate import tabulate
 17 | 
 18 | class DomainIPProcessor:
 19 |     def __init__(self):
 20 |         pass
 21 |     
 22 |     # 中文域名转ASCII
 23 |     def convert_to_ascii(self, domain_name):
 24 |         if "://" in domain_name:
 25 |             domains = domain_name.split("://")
 26 |             domain = domains[1]
 27 |         else:
 28 |             domain = domain_name
 29 |         try:
 30 |             Ascii_Domain = idna.encode(domain).decode('ascii')
 31 |             return Ascii_Domain
 32 |         except Exception as e:
 33 |             print("转换失败:", e)
 34 |             return None
 35 | 
 36 |     def is_chinese_domain(self, domain):
 37 |         # 如果字符的 ASCII 编码大于 127，则说明是非 ASCII 字符，可能是中文字符
 38 |         return any(ord(char) > 127 for char in domain)
 39 | 
 40 |     def parse_ip_port(self, s):
 41 |         ip, _, port = s.partition(':')
 42 |         return (ip, int(port) if port else 0)
 43 | 
 44 |     # IP排序
 45 |     def sort_IPs(self, ip_addresses):
 46 |         # 使用set()函数进行去重
 47 |         unique_IPs = set(ip_addresses)
 48 | 
 49 |         # 使用socket库中的inet_aton函数将IP地址转换为32位二进制数，然后再将其转换为整数
 50 |         ip_integers = [socket.inet_aton(ip) for ip in unique_IPs]
 51 |         ip_integers.sort()
 52 | 
 53 |         # 使用socket库中的inet_ntoa函数将整数转换回IP地址格式
 54 |         sorted_IPs = [socket.inet_ntoa(ip) for ip in ip_integers]
 55 |         return sorted_IPs
 56 | 
 57 |     # url ip 排序
 58 |     def ip_url_output(self, IPs_list, IP_valid):
 59 |         ip_domains = []
 60 |         Schemes_IP_Domains = []
 61 |         NO_Schemes_IP_Domains = []
 62 |         ip_ports = []
 63 | 
 64 |         for i in IPs_list:
 65 |             for j in IP_valid:
 66 |                 ipList = re.findall(r'[0-9]+(?:\.[0-9]+){3}', j)
 67 |                 if i == ipList[0]:
 68 |                     if j and j not in ip_domains:
 69 |                         ip_domains.append(j)
 70 | 
 71 |                         # 分离协议头
 72 |                         if "://" in j:
 73 |                             Schemes_IP_Domains.append(j)
 74 |                         else:
 75 |                             NO_Schemes_IP_Domains.append(j)
 76 |         # 提取出IP:PORT
 77 |         for IP_PORT in NO_Schemes_IP_Domains:
 78 |             pattern = r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d+)'
 79 |             match = re.search(pattern, IP_PORT)
 80 |             if match:
 81 |                 IP_PORT = f"{match.group(1)}:{match.group(2)}"
 82 |                 if IP_PORT and IP_PORT not in ip_ports:
 83 |                     ip_ports.append(IP_PORT)
 84 |         IP_Ports_Sorted_List = sorted(ip_ports, key=self.parse_ip_port)
 85 |         return ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List
 86 | 
 87 |     # 提取IP段
 88 |     def extract_ip_segment(self, ip):
 89 |         parts = ip.split('.')
 90 |         if len(parts) == 4:
 91 |             return '.'.join(parts[:3]) + '.0/24'
 92 |         return None
 93 | 
 94 |     # 提取纯粹的IP地址
 95 |     def process_ips(self, ip_list):
 96 |         pure_ips = set()
 97 |         ip_segments = set()
 98 |         ips_err = []
 99 |         p = re.compile(r'^((25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)$')
100 |         for ip in ip_list:
101 |             ip_match = re.findall(r'[0-9]+(?:\.[0-9]+){3}', ip)
102 |             if ip_match and p.match(ip_match[0]):
103 |                 pure_ip = ip_match[0]
104 |                 pure_ips.add(pure_ip)
105 |                 ip_segments.add(self.extract_ip_segment(pure_ip))
106 |             else:
107 |                 ips_err.append(ip)
108 | 
109 |         sorted_ips = self.sort_IPs(list(pure_ips))
110 |         sorted_segments = sorted(list(ip_segments))
111 |         return sorted_ips, sorted_segments, ips_err
112 |         
113 |     def handle_Domain_Valid(self, Domain_Valid):
114 |         '''
115 |         处理 Domain_Valid 数据
116 |         '''
117 |         Root_Domains = []
118 |         Schemes_Domains = []
119 |         NO_Schemes_Domains= []
120 |         for url in Domain_Valid:
121 |             # 提取根域名
122 |             extracted = tldextract.extract(url)
123 |             root_domain = extracted.domain + '.' + extracted.suffix
124 |             if root_domain and root_domain not in Root_Domains:
125 |                 Root_Domains.append(root_domain)
126 |             # 分离协议头
127 |             if "://" in url:
128 |                 Schemes_Domains.append(url)
129 |             else:
130 |                 NO_Schemes_Domains.append(url)
131 |         return Root_Domains, Schemes_Domains, NO_Schemes_Domains
132 | 
133 |     # 区分IP和域名
134 |     def classify_urls(self, urls):
135 |         IP_valid = []
136 |         Domain_Valid = []
137 |         Root_Domains = []
138 |         Chinese_Domain = []
139 |         Ascii_Domain = []
140 |         Url_Err = []
141 | 
142 |         ip_pattern = re.compile(r'(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)')
143 |         domain_pattern = re.compile(r'([a-zA-Z0-9.-]+?\.[a-zA-Z]{2,6})')
144 |         for url in urls:
145 |             # 检查是否包含IP地址
146 |             ip_match = ip_pattern.search(url)
147 |             # 检查是否包含域名
148 |             domain_match = domain_pattern.search(url)
149 |             if ip_match and domain_match:
150 |                 Url_Err.append(url)
151 |             elif ip_match:
152 |                 IP_valid.append(url)
153 |             elif domain_match:
154 |                 Domain_Valid.append(url)
155 |             # 判断是否是中文域名
156 |             elif self.is_chinese_domain(url):
157 |                 # 转换为 ASCII 格式
158 |                 domain_ascii = self.convert_to_ascii(url)
159 |                 if domain_ascii:
160 |                     Chinese_Domain.append(url)
161 |                     Ascii_Domain.append(domain_ascii)
162 |             else:
163 |                 Url_Err.append(url)
164 | 
165 |         Root_Domains, Schemes_Domains, NO_Schemes_Domains= self.handle_Domain_Valid(Domain_Valid)
166 |         return IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err
167 | 
168 |     # url去重
169 |     def deduplicate_urls(self, file_name):
170 |         non_scheme_parts = []
171 |         texts = []
172 |         with open(file_name, 'r', encoding='utf-8') as files:
173 |             filelist = files.readlines()
174 |             for fileurl in filelist:
175 |                 url = fileurl.strip()
176 |                 if url:
177 |                     if "://" in url:
178 |                         url_parts = url.split("://")
179 |                         non_scheme_part = re.sub(r'/+', '/', url_parts[1])
180 |                         if non_scheme_part and non_scheme_part not in non_scheme_parts:
181 |                             non_scheme_parts.append(non_scheme_part)
182 |                             texts.append(f"{url_parts[0]}://{non_scheme_part}")
183 |                     else:
184 |                         url = re.sub(r'/+', '/', url)
185 |                         if url and url not in non_scheme_parts:
186 |                             non_scheme_parts.append(url)
187 |                             texts.append(url)
188 |         return texts
189 | 
190 |     def save_results(self, output, data):
191 |         savedata = []
192 |         with open(output, 'w', encoding='utf-8') as fs:
193 |             for i in data:
194 |                 if i and i not in savedata:
195 |                     fs.write(i + '\n')
196 | 
197 |     def save_and_print_results(self, files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 
198 |                                Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 
199 |                                Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err):
200 | 
201 |         # 生成文件名
202 |         str(time.time()).split(".")[0]
203 |         filename = ''.join(files.split('/')[-1].split('.')[:-1])
204 |         timenow = str(time.time()).split(".")[0]
205 |         # outfilename = f'{filename}-'
206 |         outfilename = f'{filename}-{timenow}'
207 | 
208 |         # 获取脚本所在目录
209 |         script_dir = os.path.dirname(__file__)
210 | 
211 |         # 创建日志目录
212 |         log_dir = os.path.join(script_dir, 'log')
213 |         os.makedirs(log_dir, exist_ok=True)
214 |         import pandas as pd
215 |         outputs = {
216 |             'Domains_Root': Root_Domains,
217 |             'Domains_Schemes': Schemes_Domains,
218 |             'Domains_No_Schemes': NO_Schemes_Domains,
219 |             'Domains_Chinese': Chinese_Domain,
220 |             'Domains_Chinese_Ascii': Ascii_Domain,
221 |             'IPs': IPs,
222 |             'IP_Segment': IP_Segment,
223 |             'IP_Domains_Schemes': Schemes_IP_Domains,
224 |             'IP_Domains_No_Schemes': NO_Schemes_IP_Domains,
225 |             'IP_Ports_Sorted_List': IP_Ports_Sorted_List,
226 |             'All_Domains_Schemes': All_Schemes_Domains,
227 |             'All_Domains_No_Schemes': All_No_Schemes_Domains,
228 |             'All_Err': All_Err,
229 |             'All_Data_Quchong': All_Data
230 |         }
231 | 
232 |         output_file = f'log/{outfilename}.xlsx'
233 |         writer = pd.ExcelWriter(output_file)
234 | 
235 |         # 倒序处理outputs字典
236 |         for key, data in reversed(list(outputs.items())):
237 |             df = pd.DataFrame(data)
238 |             df.to_excel(writer, sheet_name=key, index=False, header=False)
239 | 
240 |         writer.close()
241 |         print(f"输出到：{output_file}")
242 | 
243 |     def process_file(self, files):
244 |         # 去重
245 |         texts = self.deduplicate_urls(files)
246 | 
247 |         # 提取域名和IP
248 |         IP_valid, Domain_Valid, Chinese_Domain, Ascii_Domain, Root_Domains, Schemes_Domains, NO_Schemes_Domains, Url_Err = self.classify_urls(texts)
249 |         
250 |         # 提取纯粹IP和IP段, 并排序
251 |         IPs, IP_Segment, IPs_Err = self.process_ips(IP_valid)
252 |         # 处理IP url
253 |         ip_domains, Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List = self.ip_url_output(IPs, IP_valid)
254 |         #合并
255 |         All_Schemes_Domains = Schemes_Domains + Schemes_IP_Domains
256 |         All_No_Schemes_Domains = NO_Schemes_Domains+ NO_Schemes_IP_Domains
257 |         All_Err = Url_Err + IPs_Err
258 |         All_Data = Domain_Valid + ip_domains
259 |         # 调用函数并传入所需参数
260 |         self.save_and_print_results(files, Root_Domains, Schemes_Domains, NO_Schemes_Domains, IPs, IP_Segment, 
261 |                                     Schemes_IP_Domains, NO_Schemes_IP_Domains, IP_Ports_Sorted_List, Chinese_Domain, 
262 |                                     Ascii_Domain, All_Schemes_Domains, All_No_Schemes_Domains, All_Data, All_Err)
263 | 
264 | if __name__ == '__main__':
265 | 
266 |     parser = argparse.ArgumentParser(description="Process URLs and IPs from a file.")
267 |     parser.add_argument('file', type=str, help='The path to the file containing URLs and IPs.')
268 |     args = parser.parse_args()
269 | 
270 |     processor = DomainIPProcessor()
271 |     processor.process_file(args.file)
272 | 


--------------------------------------------------------------------------------
/demo.txt:
--------------------------------------------------------------------------------
 1 | 192.168.1.100
 2 | 192.168.1.10
 3 | 192.168.1.130
 4 | 192.168.1.130
 5 | 192.168.1.101
 6 | 192.168.2.10
 7 | 192.168.3.130
 8 | 172.16.1.101
 9 | 192.168.1.100:8080
10 | 192.168.1.101:8081/path
11 | 192.168.1.100:808
12 | 192.168.1.1001
13 | 1192.168.1.100
14 | 
15 | <>
16 | 12313
17 | example.com
18 | A.example.com
19 | example.com:808
20 | example.com:8081/path
21 | example.com:8080
22 | http://exam1ple.com
23 | https://example.com
24 | http://A.example.com
25 | http://example.com:8080
26 | https://example.com:8080
27 | http://example.com/path
28 | https://example.com/path
29 | http://192.168.1.100
30 | https://192.168.1.100
31 | http://192.168.1.100:8080
32 | https://192.168.1.100:8080
33 | http://192.168.1.100:8080/path
34 | https://192.168.1.100:8080/path
35 | http://192.168.1.100/path
36 | https://192.168.1.100/path
37 | http://192.168.1.100:8080/path
38 | https://192.168.1.100:8080/path
39 | https://1192.168.1.100:8080/path
40 | 例1子.域名
41 | http://例子.域名
42 | https://域名.例子
43 | http://192.168.1.100:8080/path/path
44 | http://192.168.1.100:8080//path/path
45 | http://192.168.1.100:8080//path//path
46 | http://192.168.1.100:8080//path////path
47 | http://192.168.1.100:8080/path//path///path?ad=../.././../
48 | http://example.com/?query=123
49 | http://example.com/#fragment
50 | http://example.com/#fragment
51 | http://122.128.1.100.examp3le.com/#fragment
52 | 


--------------------------------------------------------------------------------
/requirement.txt:
--------------------------------------------------------------------------------
1 | idna
2 | tabulate
3 | tldextract
4 | 


--------------------------------------------------------------------------------