├── README.md
├── README_CN.md
├── example.py
├── fast_threshold_clustering.py
└── requirements.txt


/README.md:
--------------------------------------------------------------------------------
  1 | # FastThresholdClustering
  2 | ### [中文文档](README_CN.md)
  3 | 
  4 | ## Introduction
  5 | `FastThresholdClustering` is an efficient vector clustering algorithm based on FAISS, particularly suitable for large-scale vector data clustering tasks. The algorithm uses cosine similarity as the distance metric and supports GPU acceleration.
  6 | 
  7 | ## Key Features
  8 | - GPU acceleration support
  9 | - Automatic parameter optimization
 10 | - Memory usage optimization
 11 | - Performance monitoring and logging
 12 | - Batch processing for large-scale data
 13 | - Noise point detection
 14 | 
 15 | ## Quick Start
 16 | 
 17 | ```python
 18 | from fast_clustering import fast_cluster_embeddings
 19 | 
 20 | # Use the convenience function for clustering
 21 | labels = fast_cluster_embeddings(
 22 |     embeddings,
 23 |     similarity_threshold=0.8,
 24 |     min_samples=5,
 25 |     use_gpu=True
 26 | )
 27 | ```
 28 | 
 29 | # FastThresholdClustering Parameter Details
 30 | 
 31 | ## Core Parameters
 32 | 
 33 | ### similarity_threshold
 34 | | Property | Description |
 35 | |----------|-------------|
 36 | | Type | float |
 37 | | Default | 0.8 |
 38 | | Range | [0, 1] |
 39 | | Function | Similarity threshold for determining if two vectors belong to the same cluster |
 40 | 
 41 | **Detailed Description**:
 42 | - Higher values lead to stricter clustering and more clusters
 43 | - Lower values result in looser clustering and fewer clusters
 44 | - Recommended values:
 45 |   - 0.7-0.8: Suitable for general text vectors
 46 |   - 0.8-0.9: Suitable for high-precision matching scenarios
 47 |   - >0.9: Extremely strict matching requirements
 48 | - Performance impact: Higher threshold leads to faster computation
 49 | 
 50 | ### min_samples
 51 | | Property | Description |
 52 | |----------|-------------|
 53 | | Type | int |
 54 | | Default | 5 |
 55 | | Range | >= 2 |
 56 | | Function | Minimum number of samples required to form a valid cluster |
 57 | 
 58 | **Detailed Description**:
 59 | - Clusters with fewer samples are marked as noise points (label -1)
 60 | - Parameter setting recommendations:
 61 |   - Small datasets (<1000): 2-5
 62 |   - Medium datasets (1000-10000): 5-10
 63 |   - Large datasets (>10000): 10-20
 64 | - Key parameter affecting noise point determination
 65 | - Higher values result in more noise points and higher cluster quality
 66 | 
 67 | ## Performance Parameters
 68 | 
 69 | ### use_gpu
 70 | | Property | Description |
 71 | |----------|-------------|
 72 | | Type | bool |
 73 | | Default | True |
 74 | | Function | Whether to use GPU acceleration |
 75 | 
 76 | **Detailed Description**:
 77 | - True: Use GPU acceleration
 78 | - False: Use CPU computation
 79 | - Performance impact:
 80 |   - GPU mode: Suitable for large-scale data (>100k entries)
 81 |   - CPU mode: Suitable for small-scale data (<100k entries)
 82 | - Memory usage:
 83 |   - GPU mode limited by VRAM
 84 |   - CPU mode limited by RAM
 85 | - Recommendation: Prefer GPU mode when available
 86 | 
 87 | ### nprobe
 88 | | Property | Description |
 89 | |----------|-------------|
 90 | | Type | int |
 91 | | Default | 8 |
 92 | | Range | [1, nlist] |
 93 | | Function | Number of cluster units to visit during FAISS index search |
 94 | 
 95 | **Detailed Description**:
 96 | - Balances search accuracy and speed
 97 | - Recommended values:
 98 |   - Small datasets (<10k): 4-8
 99 |   - Medium datasets (10k-100k): 8-16
100 |   - Large datasets (>100k): 16-32
101 | - Higher values:
102 |   - Pros: More accurate search results
103 |   - Cons: Slower search speed
104 | - Automatically adjusted based on data scale
105 | 
106 | ### batch_size
107 | | Property | Description |
108 | |----------|-------------|
109 | | Type | int |
110 | | Default | 1000 |
111 | | Range | [100, dataset size] |
112 | | Function | Batch size affecting memory usage and computation speed |
113 | 
114 | **Detailed Description**:
115 | - Recommended values:
116 |   - GPU mode: 500-2000
117 |   - CPU mode: 200-1000
118 | - Memory impact:
119 |   - Larger values use more memory
120 |   - Smaller values increase computation time
121 | - Auto-adjustment:
122 |   - Small datasets: Smaller batch_size
123 |   - Large datasets: Larger batch_size
124 | - Adjust based on available memory
125 | 
126 | ### n_workers
127 | | Property | Description |
128 | |----------|-------------|
129 | | Type | int |
130 | | Default | None |
131 | | Range | [1, CPU cores] |
132 | | Function | Number of parallel processing worker threads |
133 | 
134 | **Detailed Description**:
135 | - When None, automatically set to min(CPU cores, 8)
136 | - Recommended values:
137 |   - Small datasets: 2-4 threads
138 |   - Medium datasets: 4-8 threads
139 |   - Large datasets: 8-16 threads
140 | - Considerations:
141 |   - Too many threads may cause resource contention
142 |   - Consider resource needs of other system processes
143 |   - Less impact in GPU mode
144 | 
145 | ## Parameter Combination Recommendations
146 | 
147 | ### Small Dataset Optimization (<10k samples)
148 | ```python
149 | FastThresholdClustering(
150 |     similarity_threshold=0.75,
151 |     min_samples=3,
152 |     use_gpu=False,
153 |     nprobe=4,
154 |     batch_size=500,
155 |     n_workers=4
156 | )
157 | ```
158 | 
159 | ### Large Dataset Optimization (>100k samples)
160 | ```python
161 | FastThresholdClustering(
162 |     similarity_threshold=0.85,
163 |     min_samples=10,
164 |     use_gpu=True,
165 |     nprobe=32,
166 |     batch_size=2000,
167 |     n_workers=8
168 | )
169 | ```
170 | 
171 | ### High Precision Scenarios
172 | ```python
173 | FastThresholdClustering(
174 |     similarity_threshold=0.9,
175 |     min_samples=5,
176 |     use_gpu=True,
177 |     nprobe=64,
178 |     batch_size=1000,
179 |     n_workers=8
180 | )
181 | ```
182 | 
183 | #### Main Methods
184 | 
185 | ```python
186 | def fit(self, embeddings: np.ndarray) -> FastThresholdClustering:
187 |     """
188 |     Perform clustering on input vectors
189 |     
190 |     Parameters:
191 |         embeddings: numpy array with shape (n_samples, n_features)
192 |         
193 |     Returns:
194 |         self: Returns the clustering instance
195 |     """
196 | ```
197 | 
198 | ### Convenience Function
199 | 
200 | ```python
201 | def fast_cluster_embeddings(
202 |     embeddings: np.ndarray,
203 |     similarity_threshold: float = 0.8,
204 |     min_samples: int = 5,
205 |     use_gpu: bool = True,
206 |     nprobe: int = None,
207 |     batch_size: int = None,
208 |     n_workers: int = None
209 | ) -> np.ndarray:
210 |     """
211 |     Quick clustering interface function
212 |     
213 |     Parameters:
214 |         embeddings: Input vectors with shape (n_samples, n_features)
215 |         similarity_threshold: Clustering similarity threshold
216 |         min_samples: Minimum sample count
217 |         use_gpu: Whether to use GPU
218 |         nprobe: FAISS index nprobe parameter (optional)
219 |         batch_size: Batch size (optional)
220 |         n_workers: Number of worker threads (optional)
221 |     
222 |     Returns:
223 |         labels: Clustering label array with shape (n_samples,)
224 |     """
225 | ```
226 | 
227 | ## Return Value Description
228 | - Clustering results stored in `labels_` attribute
229 | - Label -1 indicates noise points
230 | - Other labels numbered consecutively from 0
231 | 
232 | ## Performance Monitoring
233 | The algorithm includes a built-in performance monitoring system that automatically records:
234 | - Time spent in each phase
235 | - Memory usage
236 | - Clustering progress
237 | - Final clustering statistics
238 | 
239 | ## Usage Example
240 | 
241 | ```python
242 | import numpy as np
243 | from fast_clustering import FastThresholdClustering
244 | 
245 | # Prepare data
246 | embeddings = np.random.random((10000, 768))
247 | 
248 | # Create clusterer
249 | clusterer = FastThresholdClustering(
250 |     similarity_threshold=0.8,
251 |     min_samples=5,
252 |     use_gpu=True
253 | )
254 | 
255 | # Perform clustering
256 | clusterer.fit(embeddings)
257 | 
258 | # Get clustering results
259 | labels = clusterer.labels_
260 | ```
261 | 
262 | ## See **example.py** for detailed usage examples
263 | 
264 | ## Notes
265 | 1. Input vectors are automatically L2 normalized
266 | 2. GPU acceleration recommended for large-scale datasets
267 | 3. Parameters are automatically optimized based on data scale
268 | 4. Memory usage increases with data scale
269 | 


--------------------------------------------------------------------------------
/README_CN.md:
--------------------------------------------------------------------------------
  1 | # FastThresholdClustering
  2 | 
  3 | 
  4 | ## 简介
  5 | `FastThresholdClustering` 是一个基于 FAISS 的高效向量聚类算法，特别适用于大规模向量数据的聚类任务。该算法使用余弦相似度作为距离度量，并支持 GPU 加速。
  6 | 
  7 | ## 主要特点
  8 | - 支持 GPU 加速
  9 | - 自动参数优化
 10 | - 内存使用优化
 11 | - 性能监控和日志记录
 12 | - 批处理处理大规模数据
 13 | - 噪声点检测
 14 | 
 15 | ## 快速开始
 16 | 
 17 | ```python
 18 | from fast_clustering import fast_cluster_embeddings
 19 | 
 20 | # 使用便捷函数进行聚类
 21 | labels = fast_cluster_embeddings(
 22 |     embeddings,
 23 |     similarity_threshold=0.8,
 24 |     min_samples=5,
 25 |     use_gpu=True
 26 | )
 27 | ```
 28 | 
 29 | 
 30 | # FastThresholdClustering 参数详解
 31 | 
 32 | ## 核心参数
 33 | 
 34 | ### similarity_threshold
 35 | | 属性 | 说明 |
 36 | |------|------|
 37 | | 类型 | float |
 38 | | 默认值 | 0.8 |
 39 | | 取值范围 | [0, 1] |
 40 | | 功能 | 判定两个向量是否属于同一簇的相似度阈值 |
 41 | 
 42 | **详细说明**：
 43 | - 值越大，聚类标准越严格，形成的簇越多
 44 | - 值越小，聚类越宽松，簇的数量越少
 45 | - 建议取值：
 46 |   - 0.7-0.8：适用于一般文本向量
 47 |   - 0.8-0.9：适用于需要高精度匹配的场景
 48 |   - >0.9：极其严格的匹配要求
 49 | - 对性能影响：阈值越高，计算速度越快
 50 | 
 51 | ### min_samples
 52 | | 属性 | 说明 |
 53 | |------|------|
 54 | | 类型 | int |
 55 | | 默认值 | 5 |
 56 | | 取值范围 | >= 2 |
 57 | | 功能 | 形成一个有效簇所需的最小样本数量 |
 58 | 
 59 | **详细说明**：
 60 | - 小于此数量的簇会被标记为噪声点（标签为-1）
 61 | - 参数设置建议：
 62 |   - 小数据集（<1000）：2-5
 63 |   - 中等数据集（1000-10000）：5-10
 64 |   - 大数据集（>10000）：10-20
 65 | - 影响噪声点判定的关键参数
 66 | - 值越大，噪声点越多，簇的质量越高
 67 | 
 68 | ## 性能相关参数
 69 | 
 70 | ### use_gpu
 71 | | 属性 | 说明 |
 72 | |------|------|
 73 | | 类型 | bool |
 74 | | 默认值 | True |
 75 | | 功能 | 是否使用GPU加速计算 |
 76 | 
 77 | **详细说明**：
 78 | - True：使用GPU加速计算
 79 | - False：使用CPU计算
 80 | - 性能影响：
 81 |   - GPU模式：适合大规模数据（>10万条）
 82 |   - CPU模式：适合小规模数据（<10万条）
 83 | - 内存使用：
 84 |   - GPU模式受显存限制
 85 |   - CPU模式受内存限制
 86 | - 建议：有GPU时优先使用GPU模式
 87 | 
 88 | ### nprobe
 89 | | 属性 | 说明 |
 90 | |------|------|
 91 | | 类型 | int |
 92 | | 默认值 | 8 |
 93 | | 取值范围 | [1, nlist] |
 94 | | 功能 | FAISS索引搜索时访问的聚类单元数量 |
 95 | 
 96 | **详细说明**：
 97 | - 影响搜索精度和速度的平衡参数
 98 | - 建议取值：
 99 |   - 小数据集（<10k）：4-8
100 |   - 中等数据集（10k-100k）：8-16
101 |   - 大数据集（>100k）：16-32
102 | - 值越大：
103 |   - 优点：搜索结果越准确
104 |   - 缺点：搜索速度越慢
105 | - 会根据数据规模自动调整
106 | 
107 | ### batch_size
108 | | 属性 | 说明 |
109 | |------|------|
110 | | 类型 | int |
111 | | 默认值 | 1000 |
112 | | 取值范围 | [100, 数据集大小] |
113 | | 功能 | 批处理大小，影响内存使用和计算速度 |
114 | 
115 | **详细说明**：
116 | - 建议取值：
117 |   - GPU模式：500-2000
118 |   - CPU模式：200-1000
119 | - 内存影响：
120 |   - 值越大，内存使用越多
121 |   - 值越小，计算时间越长
122 | - 自动调整：
123 |   - 小数据集：较小batch_size
124 |   - 大数据集：较大batch_size
125 | - 需要根据可用内存调整
126 | 
127 | ### n_workers
128 | | 属性 | 说明 |
129 | |------|------|
130 | | 类型 | int |
131 | | 默认值 | None |
132 | | 取值范围 | [1, CPU核心数] |
133 | | 功能 | 并行处理的工作线程数 |
134 | 
135 | **详细说明**：
136 | - None时自动设置为 min(CPU核心数, 8)
137 | - 建议取值：
138 |   - 小数据集：2-4线程
139 |   - 中等数据集：4-8线程
140 |   - 大数据集：8-16线程
141 | - 注意事项：
142 |   - 线程数过多可能导致资源竞争
143 |   - 需要考虑系统其他进程的资源需求
144 |   - GPU模式下影响较小
145 | 
146 | ## 参数组合建议
147 | 
148 | ### 小数据集优化（<10k样本）
149 | ```python
150 | FastThresholdClustering(
151 |     similarity_threshold=0.75,
152 |     min_samples=3,
153 |     use_gpu=False,
154 |     nprobe=4,
155 |     batch_size=500,
156 |     n_workers=4
157 | )
158 | ```
159 | 
160 | ### 大数据集优化（>100k样本）
161 | ```python
162 | FastThresholdClustering(
163 |     similarity_threshold=0.85,
164 |     min_samples=10,
165 |     use_gpu=True,
166 |     nprobe=32,
167 |     batch_size=2000,
168 |     n_workers=8
169 | )
170 | ```
171 | 
172 | ### 高精度要求场景
173 | ```python
174 | FastThresholdClustering(
175 |     similarity_threshold=0.9,
176 |     min_samples=5,
177 |     use_gpu=True,
178 |     nprobe=64,
179 |     batch_size=1000,
180 |     n_workers=8
181 | )
182 | ```
183 | 
184 | #### 主要方法
185 | 
186 | ```python
187 | def fit(self, embeddings: np.ndarray) -> FastThresholdClustering:
188 |     """
189 |     对输入的向量进行聚类
190 |     
191 |     参数:
192 |         embeddings: shape为(n_samples, n_features)的numpy数组
193 |         
194 |     返回:
195 |         self: 返回聚类器实例
196 |     """
197 | ```
198 | 
199 | ### 便捷函数
200 | 
201 | ```python
202 | def fast_cluster_embeddings(
203 |     embeddings: np.ndarray,
204 |     similarity_threshold: float = 0.8,
205 |     min_samples: int = 5,
206 |     use_gpu: bool = True,
207 |     nprobe: int = None,
208 |     batch_size: int = None,
209 |     n_workers: int = None
210 | ) -> np.ndarray:
211 |     """
212 |     快速聚类接口函数
213 |     
214 |     参数:
215 |         embeddings: 输入向量，shape为(n_samples, n_features)
216 |         similarity_threshold: 聚类相似度阈值
217 |         min_samples: 最小样本数
218 |         use_gpu: 是否使用GPU
219 |         nprobe: FAISS索引的nprobe参数（可选）
220 |         batch_size: 批处理大小（可选）
221 |         n_workers: 工作线程数（可选）
222 |     
223 |     返回:
224 |         labels: 聚类标签数组，shape为(n_samples,)
225 |     """
226 | ```
227 | 
228 | ## 返回值说明
229 | - 聚类结果存储在 `labels_` 属性中
230 | - 标签为 -1 表示噪声点
231 | - 其他标签从 0 开始连续编号
232 | 
233 | ## 性能监控
234 | 算法内置了性能监控系统，会自动记录：
235 | - 各阶段耗时
236 | - 内存使用情况
237 | - 聚类进度
238 | - 最终聚类统计信息
239 | 
240 | ## 使用示例
241 | 
242 | ```python
243 | import numpy as np
244 | from fast_clustering import FastThresholdClustering
245 | 
246 | # 准备数据
247 | embeddings = np.random.random((10000, 768))
248 | 
249 | # 创建聚类器
250 | clusterer = FastThresholdClustering(
251 |     similarity_threshold=0.8,
252 |     min_samples=5,
253 |     use_gpu=True
254 | )
255 | 
256 | # 执行聚类
257 | clusterer.fit(embeddings)
258 | 
259 | # 获取聚类结果
260 | labels = clusterer.labels_
261 | ```
262 | 
263 | ## 详细使用用例见**example.py**
264 | 
265 | ## 注意事项
266 | 1. 输入向量会自动进行L2归一化
267 | 2. 大规模数据集建议启用GPU加速
268 | 3. 参数会根据数据规模自动优化
269 | 4. 内存使用会随数据规模增长
270 | 


--------------------------------------------------------------------------------
/example.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | import torch
  4 | import pandas as pd
  5 | import matplotlib.pyplot as plt
  6 | from sentence_transformers import SentenceTransformer, util
  7 | from fast_threshold_clustering import fast_cluster_embeddings
  8 | 
  9 | # Model for computing sentence embeddings.
 10 | model = SentenceTransformer("./m3e_nli_triple_large")
 11 | 
 12 | # df形如：	
 13 | ```
 14 | 问题
 15 | 0	请详细介绍许家小母马的来历、特点及其在小说中的地位。
 16 | 1	许七安是如何获得小母马的？这匹马在许家有什么特殊意义？
 17 | 2	作者是如何通过小母马这个非人物角色来丰富小说内容的？这种写作手法有什么效果？
 18 | 3	请详细描述平阳郡主的身份、外貌特征以及性格特点，并说明她的遭遇。
 19 | ```
 20 | df= pd.read_excel("test1.xlsx")
 21 | df.drop_duplicates(subset= ["问题"], inplace= True)
 22 | 
 23 | print("Encode the corpus. This might take a while")
 24 | corpus_sentences= df["问题"].tolist()
 25 | corpus_embeddings = model.encode(corpus_sentences, batch_size=64, show_progress_bar=True, convert_to_tensor=False)
 26 | 
 27 | 
 28 | recommended_nprobe= get_recommended_nprobe(n_samples= corpus_embeddings.shape[0])
 29 | print(f"Recommended Nprobe: {recommended_nprobe}")
 30 | 
 31 | labels = fast_cluster_embeddings(
 32 |     corpus_embeddings,
 33 |     similarity_threshold= 0.60,
 34 |     min_samples= 25,
 35 |     use_gpu= False,
 36 |     nprobe= 32,
 37 |     batch_size= 16,
 38 |     n_workers = 8
 39 | )
 40 | cluster_df= pd.DataFrame({"sentence": corpus_sentences, "label": labels})
 41 | cluster_df= cluster_df[cluster_df["label"]!=-1]
 42 | 
 43 | # 按label聚合sentence
 44 | grouped_df = cluster_df.groupby('label')['sentence'].agg(list).reset_index()
 45 | 
 46 | # 按label排序
 47 | grouped_df = grouped_df.sort_values('label')
 48 | # 打印结果
 49 | for _, row in grouped_df.iterrows():
 50 |     print(f"Label {row['label']}:")
 51 |     for sentence in row['sentence']:
 52 |         print(f"  - {sentence}")
 53 |     print()
 54 | ```
 55 | Results:
 56 | Label 0:
 57 |   - 请详细介绍许家小母马的来历、特点及其在小说中的地位。
 58 |   - 许七安是如何获得小母马的？这匹马在许家有什么特殊意义？
 59 |   - 小母马在《大奉打更人》中的身份和重要性如何？它与许七安和许平志的关系是怎样的？
 60 |   - 小母马的外形特征和在故事中的地位如何？它为什么会被读者称为'女主角'？
 61 |   - 小母马从许平志转赠给许七安的过程有何意义？这种传承如何影响故事发展和人物关系？
 62 |   - 小母马在《大奉打更人》中的角色定位和特点是什么？
 63 |   - 小母马如何影响了《大奉打更人》中的人物关系和情节发展？
 64 |   - 小母马在《大奉打更人》中被读者称为'女主角'的原因是什么？这种称呼反映了什么？
 65 |   - 《大奉打更人》在起点中文网创造了哪些记录？同时，请介绍小说中'小母马'这个特殊角色的来历和地位。
 66 |   - 小母马在《大奉打更人》中的所有权变化和特殊地位是怎样的？
 67 |   - 小母马作为《大奉打更人》中的特殊'物品'，它有哪些独特之处？
 68 |   - 小母马在《大奉打更人》的故事发展和角色塑造中扮演了什么角色？
 69 |   - 请详细介绍小母马的来历、特征及其在小说中的地位。
 70 |   - 小母马在许七安和许平志之间扮演了什么角色？它在小说中可能有什么象征意义？
 71 |   - 读者将小母马称为'女主角'这一现象反映了什么？这对理解作者的写作风格有何启示？
 72 |   - 许平志与许七安、小母马之间有什么关系？这些关系反映了许平志的什么特点？
 73 |   - 请详细介绍《大奉打更人》中小母马的特点、来历和在小说中的地位。
 74 |   - 小母马在许七安和许平志之间扮演了什么角色？它作为坐骑有什么特别之处？
 75 |   - 为什么读者会将小母马戏称为'女主角'？这种称呼反映了小说创作的哪些特点？
 76 |   - 请详细介绍许家小母马的来历、特征及在小说中的地位。
 77 |   - 许七安是如何获得小母马的？这对许家成员关系有何影响？
 78 |   - 请详细介绍《大奉打更人》中小母马的特征、来历和在故事中的地位。
 79 |   - 小母马如何影响了《大奉打更人》中许七安和许平志的关系，以及它在故事发展中可能扮演的角色？
 80 |   - 许七安的坐骑小母马有什么来历？
 81 |   - 许七安的小母马有什么特点?
 82 |   - 许家小母马有什么特点？
 83 |   - 小母马原来是谁的坐骑？
 84 |   - 许七安和小母马是什么关系?
 85 |   - 许家的小母马有什么特别之处吗？
 86 |   - 许家的小母马有什么来历？
 87 |   - 为什么许七安的小母马被网友戏称为女主
 88 |   - 在云州案中许七安的成长与他的坐骑小母马有什么关联?
 89 |   - 小母马和许平志是什么关系？
 90 | ...
 91 | 
 92 | Label 1:
 93 |   - 请详细描述平阳郡主的身份、外貌特征以及性格特点，并说明她的遭遇。
 94 |   - 恒慧与恒远、平阳郡主之间的关系如何？这些关系对他的人生轨迹有何影响？
 95 |   - 请详细描述平阳郡主的外貌特征和她的悲剧遭遇，并分析这些因素如何塑造了她的形象。
 96 |   - 请详细描述恒慧和平阳郡主的爱情故事，以及他们的悲剧结局是如何发生的？
 97 |   - 平阳郡主的身份背景是什么？她和恒慧的私奔事件背后有什么政治因素？
 98 |   - 恒远、恒慧和平阳郡主之间有什么关系？他们各自的命运如何？
 99 |   - 平阳郡主的遭遇反映了什么样的社会问题？请结合她的身份背景和具体经历分析。
100 |   - 请详细介绍平阳郡主的身份背景、恋情以及与其他人物的关系。
101 |   - 平阳郡主的悲剧经历了哪些关键事件，最终结局如何？
102 |   - 从平阳郡主的行为和结局中，我们可以看出她具有怎样的性格特点？
103 |   - 请详细介绍平阳郡主的身份背景、恋情经历以及最终的悲剧结局。
104 |   - 平阳郡主与恒慧的恋情涉及哪些复杂的人物关系和政治因素？这段关系最终如何影响了他们的命运？
105 |   - 请详细描述恒慧和平阳郡主的爱情故事及其悲剧性结局。
106 |   - 平阳郡主的身份背景如何，她与恒慧的私奔事件背后有什么政治因素？
107 |   - 恒慧和平阳郡主私奔后经历了哪些事情，最终结局如何？
108 |   - 请详细介绍平阳郡主的身份背景、恋情经历及其悲剧结局。
109 |   - 结合平阳郡主的穿着、行为和结局，分析她的性格特点和人物形象。
110 |   - 临安与王妃有什么关系?
111 |   - 平阳郡主与谁相恋？最后的结局如何？
112 |   - 平阳郡主与恒慧私奔是怎么回事？
113 |   - 平阳郡主和恒慧是什么样的情感关系？
114 |   - 平阳郡主之死是怎么被查明真相的？
115 |   - 平阳郡主私奔和誉王之死有什么联系？
116 |   - 平阳郡主与长公主之间是什么关系？
117 |   - 平阳郡主在与恒慧相恋时有哪些表现？
118 |   - 平阳郡主的最终命运是什么
119 |   - 平阳郡主有哪些特点？
120 |   - 平阳郡主是怎么死亡的?
121 |   - 平阳郡主的感情经历是怎样的？
122 |   - 平阳郡主的身份有哪些特殊之处？
123 |   - 第十四集中临安为什么要向王妃隐瞒平阳失踪一事？
124 |   - 永兴帝与临安公主是什么关系？
125 |   - 平阳郡主和恒慧相恋后发生了什么?
126 |   - 平阳郡主有什么遭遇?
127 | ...
128 | 
129 | Label 2:
130 |   - 临安公主的性格特点是什么？这些特点如何在她的行为中体现出来？
131 |   - 请对比分析平阳郡主和临安公主的性格特点及其在故事中的表现。
132 |   - 临安公主的性格特点和外貌特征如何体现她的皇室身份和个人魅力？
133 |   - 比较临安公主与平阳郡主的特点，并分析临安公主的性格特征可能带来的成长潜力和挑战？
134 |   - 请比较平阳郡主和临安的性格特点，并分析她们的皇室背景对其性格形成的影响。
135 |   - 临安的哪些行为体现了她的性格特点？这些特点如何影响她与其他人物的互动？
136 |   - 请详细描述临安的外貌特征和性格特点，并说明她在皇室中的地位。
137 |   - 请比较平阳郡主和临安的外貌特征，并说明临安的家庭背景。
138 |   - 请详细描述并比较丽娜和临安元景帝次女的外貌特征。
139 |   - 请比较平阳郡主、闻人倩柔和姬谦的外貌特征，并分析他们的形象给人的印象。
140 | ...
141 | 
142 | 
143 | ..
144 | ```
145 | 
146 | 


--------------------------------------------------------------------------------
/fast_threshold_clustering.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import gc
  3 | import time
  4 | import psutil
  5 | import faiss
  6 | import logging
  7 | import numpy as np
  8 | from datetime import datetime
  9 | from tqdm.auto import tqdm
 10 | from typing import Optional
 11 | from contextlib import contextmanager
 12 | from concurrent.futures import ThreadPoolExecutor
 13 | from functools import partial
 14 | 
 15 | 
 16 | # 添加日志配置
 17 | logging.basicConfig(
 18 |     level=logging.INFO,
 19 |     format='%(asctime)s - %(levelname)s - %(message)s',
 20 |     datefmt='%Y-%m-%d %H:%M:%S'
 21 | )
 22 | 
 23 | class Timer:
 24 |     """计时器类，用于记录各阶段耗时"""
 25 |     def __init__(self):
 26 |         self.times = {}
 27 |         self.start_times = {}
 28 | 
 29 |     def start(self, name):
 30 |         self.start_times[name] = time.time()
 31 | 
 32 |     def stop(self, name):
 33 |         if name in self.start_times:
 34 |             elapsed = time.time() - self.start_times[name]
 35 |             self.times[name] = elapsed
 36 |             del self.start_times[name]
 37 |             return elapsed
 38 |         return 0
 39 | 
 40 |     def get_time(self, name):
 41 |         return self.times.get(name, 0)
 42 | 
 43 |     def summary(self):
 44 |         logging.info("\n=== 性能统计 ===")
 45 |         total_time = sum(self.times.values())
 46 |         for name, elapsed in sorted(self.times.items(), key=lambda x: x[1], reverse=True):
 47 |             percentage = (elapsed / total_time) * 100
 48 |             logging.info(f"{name}: {elapsed:.2f}秒 ({percentage:.1f}%)")
 49 |         logging.info(f"总耗时: {total_time:.2f}秒")
 50 | 
 51 | @contextmanager
 52 | def timer_context(timer, name):
 53 |     try:
 54 |         start_time = time.time()
 55 |         yield
 56 |     finally:
 57 |         elapsed = time.time() - start_time
 58 |         timer.times[name] = elapsed
 59 | 
 60 | def get_memory_usage():
 61 |     process = psutil.Process(os.getpid())
 62 |     return process.memory_info().rss / 1024 / 1024
 63 | 
 64 | def log_step(message: str, timer: Timer):
 65 |     memory_usage = get_memory_usage()
 66 |     timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
 67 |     logging.info(f"[{timestamp}] {message}")
 68 |     logging.info(f"内存使用: {memory_usage:.2f} MB")
 69 | 
 70 | def get_recommended_nprobe(n_samples: int) -> int:
 71 |     """根据数据规模推荐nprobe值"""
 72 |     if n_samples < 10000:
 73 |         return 4  # 小数据集，追求速度
 74 |     elif n_samples < 100000:
 75 |         return 8  # 中等数据集，平衡速度和准确性
 76 |     elif n_samples < 1000000:
 77 |         return 16  # 大数据集，稍微偏向准确性
 78 |     else:
 79 |         return 32  # 超大数据集，重视准确性
 80 | 
 81 | def get_recommended_params(n_samples: int, d: int):
 82 |     """获取推荐参数"""
 83 |     params = {
 84 |         'nlist': min(int(np.sqrt(n_samples) * 2), n_samples // 20),
 85 |         'nprobe': get_recommended_nprobe(n_samples),  # 使用新的nprobe策略
 86 |         'batch_size': min(1000, n_samples // 10),
 87 |         'n_workers': min(os.cpu_count(), 8)
 88 |     }
 89 |     return params
 90 | 
 91 | class FastThresholdClustering:
 92 |     def __init__(
 93 |         self,
 94 |         similarity_threshold: float = 0.8,
 95 |         min_samples: int = 5,
 96 |         use_gpu: bool = True,
 97 |         nprobe: int = 8,
 98 |         batch_size: int = 1000,
 99 |         n_workers: int = None
100 |     ):
101 |         self.timer = Timer()
102 |         with timer_context(self.timer, "初始化"):
103 |             log_step("初始化聚类器", self.timer)
104 |             self.similarity_threshold = similarity_threshold
105 |             self.min_samples = min_samples
106 |             self.use_gpu = use_gpu
107 |             self.nprobe = nprobe
108 |             self.batch_size = batch_size
109 |             self.n_workers = n_workers or min(os.cpu_count(), 8)
110 |             self.labels_ = None
111 |         
112 |     def _build_index(self, embeddings: np.ndarray):
113 |         """构建FAISS索引"""
114 |         with timer_context(self.timer, "构建FAISS索引"):
115 |             log_step("开始构建FAISS索引", self.timer)
116 |             d = embeddings.shape[1]
117 |             n = embeddings.shape[0]
118 |             
119 |             # 使用推荐参数
120 |             params = get_recommended_params(n, d)
121 |             nlist = params['nlist']
122 |             
123 |             quantizer = faiss.IndexFlatIP(d)
124 |             index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
125 |             
126 |             if self.use_gpu:
127 |                 res = faiss.StandardGpuResources()
128 |                 index = faiss.index_cpu_to_gpu(res, 0, index)
129 |             
130 |             with timer_context(self.timer, "训练索引"):
131 |                 log_step("训练FAISS索引", self.timer)
132 |                 index.train(embeddings)
133 |                 index.nprobe = self.nprobe
134 |             
135 |             # 批量添加向量
136 |             with timer_context(self.timer, "添加向量"):
137 |                 log_step("添加向量到索引", self.timer)
138 |                 for i in tqdm(range(0, n, self.batch_size), desc="添加向量"):
139 |                     batch = embeddings[i:i+self.batch_size]
140 |                     index.add(batch)
141 |             
142 |             return index
143 | 
144 |     def _process_small_cluster(self, label, embeddings, labels, min_samples, index, k):
145 |         """处理单个小簇"""
146 |         if np.sum(labels == label) < min_samples:
147 |             mask = labels == label
148 |             if not np.any(mask):
149 |                 return None
150 |             
151 |             cluster_samples = embeddings[mask]
152 |             mean_vector = np.mean(cluster_samples, axis=0, keepdims=True)
153 |             D, I = index.search(mean_vector, k)
154 |             
155 |             for idx in I[0]:
156 |                 target_label = labels[idx]
157 |                 if target_label != label and np.sum(labels == target_label) >= min_samples:
158 |                     return (mask, target_label)
159 |         return None
160 | 
161 |     def fit(self, embeddings: np.ndarray):
162 |         with timer_context(self.timer, "总耗时"):
163 |             log_step("开始聚类", self.timer)
164 |             n_samples = len(embeddings)
165 |             
166 |             # L2归一化
167 |             with timer_context(self.timer, "L2归一化"):
168 |                 log_step("执行L2归一化", self.timer)
169 |                 norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
170 |                 embeddings = embeddings / norms
171 |             
172 |             # 构建索引
173 |             index = self._build_index(embeddings)
174 |             
175 |             # 初始化标签
176 |             with timer_context(self.timer, "初始化标签"):
177 |                 log_step("初始化标签", self.timer)
178 |                 self.labels_ = np.arange(n_samples)
179 |             
180 |             # 批量计算K近邻
181 |             with timer_context(self.timer, "K近邻搜索"):
182 |                 log_step("开始计算K近邻", self.timer)
183 |                 k = min(100, n_samples)
184 |                 D, I = [], []
185 |                 
186 |                 for i in tqdm(range(0, n_samples, self.batch_size), desc="K近邻搜索"):
187 |                     batch = embeddings[i:i+self.batch_size]
188 |                     D_batch, I_batch = index.search(batch, k)
189 |                     D.append(D_batch)
190 |                     I.append(I_batch)
191 |                 
192 |                 D = np.vstack(D)
193 |                 I = np.vstack(I)
194 |             
195 |             # 构建相似度图
196 |             with timer_context(self.timer, "构建相似度图"):
197 |                 log_step("开始构建相似度图", self.timer)
198 |                 similar_pairs = []
199 |                 # 使用向量化操作
200 |                 mask = D >= self.similarity_threshold
201 |                 rows, cols = np.where(mask)
202 |                 for row, col in zip(rows, cols):
203 |                     if row < I[row, col]:  # 避免重复对
204 |                         similar_pairs.append((D[row, col], row, I[row, col]))
205 |                 
206 |                 log_step(f"相似度图构建完成，共{len(similar_pairs)}对相似向量", self.timer)
207 |             
208 |             # 合并簇
209 |             with timer_context(self.timer, "合并簇"):
210 |                 log_step("开始合并簇", self.timer)
211 |                 similar_pairs.sort(reverse=True)
212 |                 for sim, i, j in tqdm(similar_pairs, desc="合并簇"):
213 |                     if self.labels_[i] != self.labels_[j]:
214 |                         cluster1 = self.labels_[i]
215 |                         cluster2 = self.labels_[j]
216 |                         
217 |                         size1 = np.sum(self.labels_ == cluster1)
218 |                         size2 = np.sum(self.labels_ == cluster2)
219 |                         
220 |                         if size1 >= self.min_samples and size2 >= self.min_samples:
221 |                             continue
222 |                         
223 |                         old_label = max(cluster1, cluster2)
224 |                         new_label = min(cluster1, cluster2)
225 |                         self.labels_[self.labels_ == old_label] = new_label
226 | 
227 |             # 检测噪声点
228 |             with timer_context(self.timer, "检测噪声点"):
229 |                 log_step("开始检测噪声点", self.timer)
230 |                 
231 |                 # 获取每个点的邻居数量
232 |                 neighbor_counts = np.zeros(n_samples)
233 |                 for i in range(n_samples):
234 |                     # 计算与阈值以上的邻居数量
235 |                     neighbor_counts[i] = np.sum(D[i] >= self.similarity_threshold)
236 |                 
237 |                 # 标记噪声点
238 |                 noise_mask = neighbor_counts < self.min_samples
239 |                 
240 |                 # 获取所有簇的大小
241 |                 unique_labels, cluster_sizes = np.unique(self.labels_, return_counts=True)
242 |                 small_clusters = unique_labels[cluster_sizes < self.min_samples]
243 |                 
244 |                 # 将小簇中的点也标记为噪声
245 |                 for label in small_clusters:
246 |                     noise_mask |= (self.labels_ == label)
247 |                 
248 |                 # 将噪声点的标签设为-1
249 |                 self.labels_[noise_mask] = -1
250 |                 
251 |                 log_step(f"检测到{np.sum(noise_mask)}个噪声点", self.timer)
252 |             
253 |             # 重新标记簇号
254 |             with timer_context(self.timer, "重新标记簇号"):
255 |                 log_step("重新标记簇号", self.timer)
256 |                 unique_labels = np.unique(self.labels_)
257 |                 # 排除噪声标签-1
258 |                 unique_labels = unique_labels[unique_labels != -1]
259 |                 label_map = {old: new for new, old in enumerate(unique_labels)}
260 |                 # 保持噪声点的标签为-1
261 |                 self.labels_ = np.array([label_map.get(x, -1) for x in self.labels_])
262 |             
263 |             # 清理内存
264 |             del D, I
265 |             gc.collect()
266 |             
267 |             log_step(f"聚类完成，共{len(np.unique(self.labels_[self.labels_ != -1]))}个簇，{np.sum(self.labels_ == -1)}个噪声点", self.timer)
268 |             
269 |             # 输出性能统计
270 |             self.timer.summary()
271 |             
272 |             return self
273 | 
274 | def fast_cluster_embeddings(
275 |     embeddings: np.ndarray,
276 |     similarity_threshold: float = 0.8,
277 |     min_samples: int = 5,
278 |     use_gpu: bool = True,
279 |     nprobe: int = None,
280 |     batch_size: int = None,
281 |     n_workers: int = None
282 | ) -> np.ndarray:
283 |     """快速聚类接口"""
284 |     # 获取推荐参数
285 |     params = get_recommended_params(len(embeddings), embeddings.shape[1])
286 |     
287 |     # 使用推荐参数或用户指定参数
288 |     nprobe = nprobe or params['nprobe']
289 |     batch_size = batch_size or params['batch_size']
290 |     n_workers = n_workers or params['n_workers']
291 |     
292 |     logging.info(f"Recommended Nprobe: {nprobe}")
293 |     logging.info(f"Batch size: {batch_size}")
294 |     logging.info(f"Workers: {n_workers}")
295 |     
296 |     clusterer = FastThresholdClustering(
297 |         similarity_threshold=similarity_threshold,
298 |         min_samples=min_samples,
299 |         use_gpu=use_gpu,
300 |         nprobe=nprobe,
301 |         batch_size=batch_size,
302 |         n_workers=n_workers
303 |     )
304 |     return clusterer.fit(embeddings).labels_
305 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy>=1.19.0
2 | faiss-gpu>=1.7.0  # 使用GPU版本
3 | # faiss-cpu>=1.7.0  # 或使用CPU版本
4 | tqdm>=4.45.0
5 | psutil>=5.7.0
6 | 


--------------------------------------------------------------------------------