├── README.md ├── combine-clusters.cpp ├── fast-combine-ssdeep-clus.cpp ├── fast-ssdeep-clus.cpp └── sort-clusters.cpp /README.md: -------------------------------------------------------------------------------- 1 | fast-ssdeep-clus : Parallel ssdeep clustering kit 2 | ================================================== 3 | 4 | 5 | What is this? 6 | -------------- 7 | 8 | This is simple and efficient ssdeep-based clustering program. 9 | I use this program to cluster my malware collection 10 | (based on certain threshold value). 11 | 12 | 13 | What is ssdeep clustering mode? 14 | -------------------------------- 15 | 16 | ssdeep 2.9 or later implements simple clustering mode. 17 | 18 | * If digest `A` and digest `B` are similar enough, 19 | `A` and `B` are in the same cluster. 20 | * If `A` and `B` are similar enough and `B` and `C` also are, 21 | all `A`, `B` and `C` are in the same cluster 22 | (even if `A` and `C` are NOT similar enough). 23 | 24 | Based on the threshold value `t`, two digests are treated as "similar" 25 | if the similarity score of them is greater than `t`. 26 | 27 | If you are familiar with clustering, you can see that this is just 28 | single-linkage clustering with a fixed distance value. 29 | 30 | 31 | Requirements 32 | ------------- 33 | 34 | * ffuzzy++ 4.0 35 | 36 | 37 | There's no Makefile because this entire program is a hack. 38 | Please build it yourself. 39 | 40 | 41 | File Format : Fuzzy hash list 42 | ------------------------------ 43 | 44 | This is simple fuzzy hash list separated by newline character. 45 | 46 | 3:: 47 | 24:VGXGP7L5e/Ixt3af/WKPPaYpzg4m3XWMCsXNCRs0:kYDxcfPZpelCs9Cm0 48 | 24:3G24hjRrxvF61IJG1XNgf+ojDOja9/JTjw1UBr2mk:W5FhnVtajaRG1gr2r 49 | (... continues without empty lines ...) 50 | (EOF) 51 | 52 | 53 | File Format : Cluster list 54 | --------------------------- 55 | 56 | It indicates which fuzzy hashes are connected. 57 | If fuzzy hash `A` and `B` are connected and `C`, `D`, `E` are connected, 58 | the cluster list should look like this: 59 | 60 | A 61 | B 62 | (empty line) 63 | C 64 | D 65 | E 66 | (empty line) 67 | (EOF) 68 | 69 | There is always an empty line after each cluster to indicate 70 | end of cluster. 71 | 72 | 73 | Program : fast-ssdeep-clus 74 | --------------------------- 75 | 76 | * Input: fuzzy hash list (as first argument) 77 | * Output: cluster list (to `stdout`) 78 | 79 | This program clusters fuzzy hashes and makes cluster list. 80 | 81 | 82 | Program : fast-combine-ssdeep-clus 83 | ----------------------------------- 84 | 85 | * Input 1: fuzzy hash list (as first argument) 86 | * Input 2: fuzzy hash list (as second argument) 87 | * Output: cluster list (to `stdout`) 88 | 89 | This program tries to make clusters between elements in Input 1 and 90 | elements in Input 2. This program is useful for making clusters 91 | gradually. 92 | 93 | 94 | Program : combine-clusters 95 | --------------------------- 96 | 97 | * Input: cluster list (from `stdin`) 98 | * Output: cluster list (to `stdout`) 99 | 100 | This program combines clusters. For instance, if input like this 101 | is given...: 102 | 103 | A 104 | B 105 | (empty line) 106 | C 107 | D 108 | E 109 | (empty line) 110 | B 111 | C 112 | (empty line) 113 | (EOF) 114 | 115 | the output should look like this: 116 | 117 | A 118 | B 119 | C 120 | D 121 | E 122 | (empty line) 123 | (EOF) 124 | 125 | Because all five elements are connected (indirectly), this program 126 | combines all connected clusters into a single cluster. 127 | 128 | 129 | Program : sort-clusters 130 | ------------------------ 131 | 132 | * Input: cluster list (from `stdin`) 133 | * Output: cluster list (to `stdout`) 134 | 135 | This program normalizes output by sorting cluster list on certain order. 136 | 137 | 138 | Usage 139 | ------ 140 | 141 | ``` 142 | # threshold = 79, number of threads = 8 143 | 144 | # Make cluster list: 145 | $ fast-ssdeep-clus -t 79 -n 8 all.ssdeep | sort-clusters >all.clusters.79.txt 146 | 147 | # Make clusters gradually: 148 | # assuming that all.ssdeep and delta.ssdeep does not intersect (have no common elements) 149 | $ fast-ssdeep-clus -t 79 -n 8 delta.ssdeep >delta.clusters.79.0 150 | $ fast-combine-ssdeep-clus -t 79 -n 8 delta.ssdeep all.ssdeep >delta.clusters.79.1 151 | $ cat all.clusters.79.txt delta.clusters.79.0 delta.clusters.79.1 \ 152 | | combine-clusters | sort-clusters >all.clusters.79.txt.new 153 | $ mv all.clusters.79.txt.new all.clusters.79.txt 154 | $ cat delta.ssdeep >>all.ssdeep 155 | ``` 156 | 157 | 158 | License 159 | -------- 160 | 161 | This program itself is released under the terms of the ISC License. 162 | However, ffuzzy++ is released under the terms of GNU General Public 163 | License (GNU GPL). See license files of ffuzzy++ for details. 164 | -------------------------------------------------------------------------------- /combine-clusters.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | fast-ssdeep-clus : Parallel ssdeep clustering kit 4 | 5 | combine-clusters.cpp 6 | Cluster combining program 7 | 8 | Copyright (C) 2017 Tsukasa OI 9 | 10 | 11 | Permission to use, copy, modify, and/or distribute this software for 12 | any purpose with or without fee is hereby granted, provided that the 13 | above copyright notice and this permission notice appear in all copies. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 16 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 17 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 18 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 19 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 20 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 21 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 22 | 23 | */ 24 | /* 25 | Note: 26 | This program skips certain error checks. 27 | So, this program is not suited to accept arbitrary file. 28 | */ 29 | 30 | #include 31 | #include 32 | #include 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | 41 | using namespace std; 42 | 43 | 44 | struct pstr_hash_type 45 | { 46 | size_t operator()(const std::string* str) const 47 | { 48 | return std::hash()(*str); 49 | } 50 | }; 51 | 52 | struct pstr_pred_type 53 | { 54 | bool operator()(const std::string* str1, const std::string* str2) const 55 | { 56 | return *str1 == *str2; 57 | } 58 | }; 59 | 60 | 61 | int main(int argc, char** argv) 62 | { 63 | // We are no longer interested to the details of digest string. 64 | // This program handles digest by simple string. 65 | bool print_progress = true; 66 | unsigned long long interval = 1000; 67 | unsigned long long cluster_count = 0; 68 | const char* comment = nullptr; 69 | auto t_start = chrono::system_clock::now(); 70 | // Read arguments 71 | try 72 | { 73 | for (int i = 1; i < argc; i++) 74 | { 75 | size_t idx; 76 | char* arg; 77 | if (!strcmp(argv[i], "-i")) 78 | { 79 | if (++i == argc) 80 | { 81 | fprintf(stderr, "error: specify actual interval.\n"); 82 | return 1; 83 | } 84 | arg = argv[i]; 85 | interval = stoull(arg, &idx, 10); 86 | if (arg[idx]) 87 | throw invalid_argument("cannot parse interval value"); 88 | if (interval < 1) 89 | throw out_of_range("check interval is out of range"); 90 | } 91 | else if (!strcmp(argv[i], "-c")) 92 | { 93 | if (++i == argc) 94 | { 95 | fprintf(stderr, "error: specify actual comment.\n"); 96 | return 1; 97 | } 98 | comment = argv[i]; 99 | } 100 | else if (!strcmp(argv[i], "-np")) 101 | { 102 | print_progress = false; 103 | } 104 | else 105 | { 106 | fprintf(stderr, "error: invalid option is given.\n"); 107 | return 1; 108 | } 109 | } 110 | if (!comment) 111 | comment = "combining"; 112 | } 113 | catch (invalid_argument) 114 | { 115 | fprintf(stderr, "error: invalid argument is given.\n"); 116 | return 1; 117 | } 118 | catch (out_of_range) 119 | { 120 | fprintf(stderr, "error: out of range argument is given.\n"); 121 | return 1; 122 | } 123 | 124 | unordered_set all_values; 125 | unordered_map*> cluster_map; 126 | unordered_set*> clusters; 127 | unordered_set* current = new unordered_set(); 128 | clusters.insert(current); 129 | 130 | // Read clusters 131 | string ln; 132 | while (getline(cin, ln)) 133 | { 134 | if (!ln.size()) 135 | { 136 | // End of the cluster 137 | if (current->size()) 138 | { 139 | current = new unordered_set(); 140 | clusters.insert(current); 141 | } 142 | if (print_progress && (++cluster_count % interval == 0)) 143 | { 144 | auto t_current = chrono::system_clock::now(); 145 | auto t_delta = chrono::duration_cast(t_current - t_start); 146 | unsigned long long t_count = static_cast(t_delta.count()); 147 | fprintf(stderr, 148 | "%5llu:%02llu:%02llu %12llu [%s]\n", 149 | (unsigned long long)(t_count / 3600u), 150 | (unsigned long long)((t_count / 60u) % 60u), 151 | (unsigned long long)(t_count % 60u), 152 | cluster_count, comment); 153 | } 154 | } 155 | else 156 | { 157 | auto p = all_values.find(&ln); 158 | if (p == all_values.end()) 159 | { 160 | // New digest (does not combine) 161 | string* newstr = new string(ln); 162 | all_values.insert(newstr); 163 | current->insert(newstr); 164 | cluster_map[newstr] = current; 165 | } 166 | else 167 | { 168 | // Existing digest (combine) 169 | auto cluster_to_merge = cluster_map[*p]; 170 | if (cluster_to_merge != current) 171 | { 172 | if (current->size() > cluster_to_merge->size()) 173 | { 174 | // Preserve bigger cluster for efficiency 175 | swap(current, cluster_to_merge); 176 | } 177 | for (auto q : *current) 178 | { 179 | cluster_to_merge->insert(q); 180 | cluster_map[q] = cluster_to_merge; 181 | } 182 | clusters.erase(current); 183 | delete current; 184 | current = cluster_to_merge; 185 | } 186 | } 187 | } 188 | } 189 | 190 | for (auto p : clusters) 191 | { 192 | if (!p->size()) 193 | continue; 194 | for (auto q : *p) 195 | puts(q->c_str()); 196 | puts(""); 197 | } 198 | 199 | if (print_progress) 200 | { 201 | auto t_current = chrono::system_clock::now(); 202 | auto t_delta = chrono::duration_cast(t_current - t_start); 203 | unsigned long long t_count = static_cast(t_delta.count()); 204 | fprintf(stderr, 205 | "%5llu:%02llu:%02llu %12llu [%s]\n", 206 | (unsigned long long)(t_count / 3600u), 207 | (unsigned long long)((t_count / 60u) % 60u), 208 | (unsigned long long)(t_count % 60u), 209 | cluster_count, comment); 210 | } 211 | 212 | return 0; 213 | } 214 | -------------------------------------------------------------------------------- /fast-combine-ssdeep-clus.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | fast-ssdeep-clus : Parallel ssdeep clustering kit 4 | 5 | fast-combine-ssdeep-clus.cpp 6 | Simple combining clustering program 7 | 8 | Copyright (C) 2017 Tsukasa OI 9 | 10 | 11 | Permission to use, copy, modify, and/or distribute this software for 12 | any purpose with or without fee is hereby granted, provided that the 13 | above copyright notice and this permission notice appear in all copies. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 16 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 17 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 18 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 19 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 20 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 21 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 22 | 23 | */ 24 | /* 25 | Note: 26 | This program skips certain error checks. 27 | So, this program is not suited to accept arbitrary file. 28 | */ 29 | 30 | #include 31 | #include 32 | #include 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #include 44 | #include 45 | 46 | //#define FFUZZYPP_DEBUG 47 | #include 48 | 49 | using namespace std; 50 | using namespace ffuzzy; 51 | 52 | 53 | static constexpr const comparison_version COMPARISON_VERSION = comparison_version::latest; 54 | 55 | #define SSDEEP_THRESHOLD 79 56 | #define SSDEEP_THREADS 1 57 | #define SSDEEP_PROGINTV 1 58 | 59 | 60 | struct filesig_t 61 | { 62 | digest_ra_t ndigest; 63 | digest_ra_unorm_t udigest; 64 | atomic_size_t cluster_no; 65 | public: 66 | filesig_t(void) : ndigest(), udigest(), cluster_no(0) {} 67 | filesig_t& operator=(const filesig_t& other) 68 | { 69 | ndigest = other.ndigest; 70 | udigest = other.udigest; 71 | cluster_no.store(other.cluster_no.load()); 72 | return *this; 73 | } 74 | filesig_t(const filesig_t& other) 75 | { 76 | *this = other; 77 | } 78 | }; 79 | 80 | static inline bool sort_filesig_cluster(const filesig_t& a, const filesig_t& b) 81 | { 82 | return a.cluster_no < b.cluster_no; 83 | } 84 | 85 | 86 | static_assert( 87 | digest_blocksize::number_of_blockhashes < numeric_limits::max(), 88 | "digest_blocksize::number_of_blockhashes must be less than max(size_t)."); 89 | static constexpr const size_t blocksize_upper = size_t(digest_blocksize::number_of_blockhashes) + 1; 90 | 91 | static size_t nthreads = SSDEEP_THREADS; 92 | static filesig_t* filesigs; 93 | static size_t filesigs_size1; 94 | static size_t filesigs_size2; 95 | static size_t filesigs_index2[blocksize_upper]; 96 | static atomic_flag filesigs_wspin = ATOMIC_FLAG_INIT; 97 | static int threshold = SSDEEP_THRESHOLD; 98 | 99 | static atomic_size_t cluster_to_allocate; 100 | static atomic_size_t progress_next; 101 | static atomic_size_t progress_finished; 102 | 103 | 104 | static void usage(const char *prog) 105 | { 106 | fprintf(stderr, 107 | "Usage: %s SSDEEP_TO_ADD SSDEEP_ORIGINAL [-t THRESHOLD] [-n THREADS] [-c COMMENT]\n", 108 | prog ? prog : "fast-combine-ssdeep-clus"); 109 | exit(1); 110 | } 111 | 112 | static void cluster_main(void); 113 | 114 | static bool read_digests(set& udigests, const char* filename) 115 | { 116 | try 117 | { 118 | ifstream fin; 119 | string str; 120 | fin.open(filename, ios::in); 121 | if (fin.fail()) 122 | { 123 | fprintf(stderr, "error: failed to open file.\n"); 124 | return false; 125 | } 126 | while (getline(fin, str)) 127 | { 128 | digest_ra_unorm_t digest(str); 129 | if (!digest.is_natural()) 130 | { 131 | fprintf(stderr, "error: parsed digest is not natural.\n"); 132 | return false; 133 | } 134 | udigests.insert(digest); 135 | } 136 | } 137 | catch (digest_parse_error) 138 | { 139 | fprintf(stderr, "error: cannot parse digest.\n"); 140 | return false; 141 | } 142 | return true; 143 | } 144 | 145 | static void construct_blocksize_index(size_t* index, size_t i0, size_t i1) 146 | { 147 | if (i0 == i1) 148 | { 149 | for (size_t k = 0; k < blocksize_upper; k++) 150 | index[k] = i1; 151 | return; 152 | } 153 | unsigned p = digest_blocksize::natural_to_index(filesigs[i0].ndigest.blocksize()); 154 | for (unsigned k = 0; k <= p; k++) 155 | index[k] = i0; 156 | for (size_t i = i0 + 1; i < i1; i++) 157 | { 158 | unsigned q = digest_blocksize::natural_to_index(filesigs[i].ndigest.blocksize()); 159 | if (p != q) 160 | { 161 | for (unsigned k = p + 1; k <= q; k++) 162 | index[k] = i; 163 | p = q; 164 | } 165 | } 166 | for (unsigned k = p + 1; k <= blocksize_upper; k++) 167 | index[k] = i1; 168 | } 169 | 170 | int main(int argc, char** argv) 171 | { 172 | bool print_progress = true; 173 | int interval = SSDEEP_PROGINTV; 174 | char* comment = nullptr; 175 | auto t_start = chrono::system_clock::now(); 176 | // Read arguments 177 | if (argc < 3) 178 | usage(argv[0]); 179 | try 180 | { 181 | for (int i = 3; i < argc; i++) 182 | { 183 | size_t idx; 184 | char* arg; 185 | if (!strcmp(argv[i], "-t")) 186 | { 187 | if (++i == argc) 188 | usage(argv[0]); 189 | arg = argv[i]; 190 | threshold = stoi(arg, &idx, 10); 191 | if (arg[idx]) 192 | usage(argv[0]); 193 | if (threshold > 99 || threshold < 0) 194 | throw out_of_range("threshold is out of range"); 195 | } 196 | else if (!strcmp(argv[i], "-n")) 197 | { 198 | if (++i == argc) 199 | usage(argv[0]); 200 | arg = argv[i]; 201 | int n_threads = stoi(arg, &idx, 10); 202 | if (arg[idx]) 203 | usage(argv[0]); 204 | if (n_threads < 1 || n_threads > numeric_limits::max()) 205 | throw out_of_range("number of threads is out of range"); 206 | nthreads = n_threads; 207 | } 208 | else if (!strcmp(argv[i], "-i")) 209 | { 210 | if (++i == argc) 211 | usage(argv[0]); 212 | arg = argv[i]; 213 | interval = stoi(arg, &idx, 10); 214 | if (arg[idx]) 215 | usage(argv[0]); 216 | if (interval < 1) 217 | throw out_of_range("check interval is out of range"); 218 | } 219 | else if (!strcmp(argv[i], "-c")) 220 | { 221 | if (++i == argc) 222 | usage(argv[0]); 223 | comment = argv[i]; 224 | } 225 | else if (!strcmp(argv[i], "-np")) 226 | { 227 | print_progress = false; 228 | } 229 | else 230 | { 231 | usage(argv[0]); 232 | } 233 | } 234 | } 235 | catch (invalid_argument) 236 | { 237 | fprintf(stderr, "error: invalid argument is given.\n"); 238 | return 1; 239 | } 240 | catch (out_of_range) 241 | { 242 | fprintf(stderr, "error: out of range argument is given.\n"); 243 | return 1; 244 | } 245 | // Read file and preprocess 246 | { 247 | set udigests1; 248 | set udigests2; 249 | // Read digest file 250 | { 251 | set udigests1_tmp; 252 | if (!read_digests(udigests1_tmp, argv[1])) 253 | return 1; 254 | if (!read_digests(udigests2, argv[2])) 255 | return 1; 256 | // udigests1 = udigests1_tmp - udigests2 257 | set_difference( 258 | udigests1_tmp.begin(), udigests1_tmp.end(), 259 | udigests2.begin(), udigests2.end(), 260 | inserter(udigests1, udigests1.end())); 261 | } 262 | // Preprocess digest and prepare for clustering 263 | if (!udigests1.size()) 264 | return 0; // no clusters to make 265 | filesigs_size1 = udigests1.size(); 266 | filesigs_size2 = udigests2.size(); 267 | // Check size overflow 268 | if (size_t(filesigs_size1 + filesigs_size2) < filesigs_size1) 269 | { 270 | // filesigs_size1 + filesigs_size2 (original) must not overflow. 271 | fprintf(stderr, "error: too much signatures to match.\n"); 272 | return 1; 273 | } 274 | filesigs_size2 += filesigs_size1; 275 | if (size_t(filesigs_size2 + nthreads) < filesigs_size2) 276 | { 277 | // filesigs_size2 (combined with size1) + nthreads must not overflow. 278 | fprintf(stderr, "error: too much signatures or threads.\n"); 279 | return 1; 280 | } 281 | // Construct signature database 282 | filesigs = new filesig_t[filesigs_size2]; 283 | { 284 | size_t i = 0; 285 | for (auto p = udigests1.begin(); p != udigests1.end(); p++, i++) 286 | { 287 | filesigs[i].udigest = *p; 288 | digest_ra_t::normalize(filesigs[i].ndigest, *p); 289 | filesigs[i].cluster_no.store(0); 290 | } 291 | for (auto p = udigests2.begin(); p != udigests2.end(); p++, i++) 292 | { 293 | filesigs[i].udigest = *p; 294 | digest_ra_t::normalize(filesigs[i].ndigest, *p); 295 | filesigs[i].cluster_no.store(0); 296 | } 297 | } 298 | // Construct block size index 299 | construct_blocksize_index(filesigs_index2, filesigs_size1, filesigs_size2); 300 | } 301 | // Initialize multi-threading environment 302 | cluster_to_allocate = 1; 303 | progress_next = 0; 304 | progress_finished = 0; 305 | vector threads; 306 | for (size_t i = 0; i < nthreads; i++) 307 | threads.push_back(thread(cluster_main)); 308 | // Wait for completion 309 | while (true) 310 | { 311 | size_t progress = progress_finished.load(); 312 | if (print_progress) 313 | { 314 | auto t_current = chrono::system_clock::now(); 315 | auto t_delta = chrono::duration_cast(t_current - t_start); 316 | unsigned long long t_count = static_cast(t_delta.count()); 317 | if (comment) 318 | { 319 | fprintf(stderr, 320 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s]\n", 321 | (unsigned long long)(t_count / 3600u), 322 | (unsigned long long)((t_count / 60u) % 60u), 323 | (unsigned long long)(t_count % 60u), 324 | (unsigned long long)progress, 325 | threshold, comment); 326 | } 327 | else 328 | { 329 | fprintf(stderr, 330 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s -> %s]\n", 331 | (unsigned long long)(t_count / 3600u), 332 | (unsigned long long)((t_count / 60u) % 60u), 333 | (unsigned long long)(t_count % 60u), 334 | (unsigned long long)progress, 335 | threshold, argv[1], argv[2]); 336 | } 337 | } 338 | if (progress == filesigs_size1) 339 | break; 340 | this_thread::sleep_for(chrono::seconds(interval)); 341 | } 342 | // Join all threads 343 | for (size_t i = 0; i < nthreads; i++) 344 | threads[i].join(); 345 | // Finalization 346 | sort(filesigs, filesigs + filesigs_size2, sort_filesig_cluster); 347 | for (size_t i = 0, c0 = 0; i < filesigs_size2; i++) 348 | { 349 | char buf[digest_ra_unorm_t::max_natural_chars]; 350 | size_t c1 = filesigs[i].cluster_no.load(); 351 | if (!c1) 352 | continue; 353 | if (c0 != c1 && c0) 354 | puts(""); 355 | c0 = c1; 356 | filesigs[i].udigest.pretty_unsafe(buf); 357 | puts(buf); 358 | if (i == filesigs_size2 - 1) 359 | puts(""); 360 | } 361 | if (print_progress) 362 | { 363 | auto t_current = chrono::system_clock::now(); 364 | auto t_delta = chrono::duration_cast(t_current - t_start); 365 | unsigned long long t_count = static_cast(t_delta.count()); 366 | if (comment) 367 | { 368 | fprintf(stderr, 369 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s]\n", 370 | (unsigned long long)(t_count / 3600u), 371 | (unsigned long long)((t_count / 60u) % 60u), 372 | (unsigned long long)(t_count % 60u), 373 | (unsigned long long)filesigs_size1, 374 | threshold, comment); 375 | } 376 | else 377 | { 378 | fprintf(stderr, 379 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s -> %s]\n", 380 | (unsigned long long)(t_count / 3600u), 381 | (unsigned long long)((t_count / 60u) % 60u), 382 | (unsigned long long)(t_count % 60u), 383 | (unsigned long long)filesigs_size1, 384 | threshold, argv[1], argv[2]); 385 | } 386 | } 387 | // Quit 388 | return 0; 389 | } 390 | 391 | static inline void lock_spin(atomic_flag& spinlock) 392 | { 393 | while (spinlock.test_and_set(memory_order_acquire)) {} 394 | } 395 | 396 | static inline void unlock_spin(atomic_flag& spinlock) 397 | { 398 | spinlock.clear(memory_order_release); 399 | } 400 | 401 | static void cluster(size_t idxA, size_t idxB) 402 | { 403 | lock_spin(filesigs_wspin); 404 | filesig_t* sigA = &filesigs[idxA]; 405 | filesig_t* sigB = &filesigs[idxB]; 406 | size_t clusA = sigA->cluster_no.load(); 407 | size_t clusB = sigB->cluster_no.load(); 408 | if (!clusA && !clusB) 409 | { 410 | size_t clusC = cluster_to_allocate++; 411 | sigA->cluster_no.store(clusC); 412 | sigB->cluster_no.store(clusC); 413 | } 414 | else if ( clusA && !clusB) 415 | { 416 | sigB->cluster_no.store(clusA); 417 | } 418 | else if (!clusA && clusB) 419 | { 420 | sigA->cluster_no.store(clusB); 421 | } 422 | else if (clusA != clusB) 423 | { 424 | for (size_t k = 0; k < filesigs_size2; k++) 425 | { 426 | if (filesigs[k].cluster_no.load() == clusA) 427 | filesigs[k].cluster_no.store(clusB); 428 | } 429 | } 430 | unlock_spin(filesigs_wspin); 431 | } 432 | 433 | static void cluster_main(void) 434 | { 435 | // cluster 436 | for (size_t idxA; (idxA = progress_next++) < filesigs_size1;) 437 | { 438 | filesig_t* sigA = &filesigs[idxA]; 439 | digest_blocksize_t blocksizeA = sigA->ndigest.blocksize(); 440 | unsigned bindexA = digest_blocksize::natural_to_index(blocksizeA); 441 | digest_position_array_t digestA(sigA->ndigest); 442 | size_t indexB0 = 0; 443 | size_t indexB1 = filesigs_index2[bindexA]; 444 | size_t indexB2 = filesigs_index2[bindexA + 1]; 445 | size_t indexB3 = 0; 446 | if (bindexA != 0) 447 | { 448 | indexB0 = filesigs_index2[bindexA - 1]; 449 | for (size_t idxB = indexB0; idxB < indexB1; idxB++) 450 | { 451 | filesig_t* sigB = &filesigs[idxB]; 452 | // cluster 453 | size_t clusA = sigA->cluster_no.load(); 454 | size_t clusB = sigB->cluster_no.load(); 455 | if (clusB && clusA == clusB) 456 | continue; 457 | auto score = digestA.compare_near_gt(sigB->ndigest); 458 | if (score > threshold) 459 | cluster(idxA, idxB); 460 | } 461 | } 462 | if (true) 463 | { 464 | for (size_t idxB = indexB1; idxB < indexB2; idxB++) 465 | { 466 | filesig_t* sigB = &filesigs[idxB]; 467 | // cluster 468 | size_t clusA = sigA->cluster_no.load(); 469 | size_t clusB = sigB->cluster_no.load(); 470 | if (clusB && clusA == clusB) 471 | continue; 472 | auto score = digestA.compare_near_eq(sigB->ndigest); 473 | if (score > threshold) 474 | cluster(idxA, idxB); 475 | } 476 | } 477 | if (bindexA != digest_blocksize::number_of_blockhashes - 1) 478 | { 479 | indexB3 = filesigs_index2[bindexA + 2]; 480 | for (size_t idxB = indexB2; idxB < indexB3; idxB++) 481 | { 482 | filesig_t* sigB = &filesigs[idxB]; 483 | // cluster 484 | size_t clusA = sigA->cluster_no.load(); 485 | size_t clusB = sigB->cluster_no.load(); 486 | if (clusB && clusA == clusB) 487 | continue; 488 | auto score = digestA.compare_near_lt(sigB->ndigest); 489 | if (score > threshold) 490 | cluster(idxA, idxB); 491 | } 492 | } 493 | ++progress_finished; 494 | } 495 | } 496 | -------------------------------------------------------------------------------- /fast-ssdeep-clus.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | fast-ssdeep-clus : Parallel ssdeep clustering kit 4 | 5 | fast-ssdeep-clus.cpp 6 | Simple clustering program 7 | 8 | Copyright (C) 2017 Tsukasa OI 9 | 10 | 11 | Permission to use, copy, modify, and/or distribute this software for 12 | any purpose with or without fee is hereby granted, provided that the 13 | above copyright notice and this permission notice appear in all copies. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 16 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 17 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 18 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 19 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 20 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 21 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 22 | 23 | */ 24 | /* 25 | Note: 26 | This program skips certain error checks. 27 | So, this program is not suited to accept arbitrary file. 28 | */ 29 | 30 | #include 31 | #include 32 | #include 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | 43 | //#define FFUZZYPP_DEBUG 44 | #include 45 | 46 | using namespace std; 47 | using namespace ffuzzy; 48 | 49 | 50 | static constexpr const comparison_version COMPARISON_VERSION = comparison_version::latest; 51 | 52 | #define SSDEEP_THRESHOLD 79 53 | #define SSDEEP_THREADS 1 54 | #define SSDEEP_PROGINTV 1 55 | 56 | 57 | struct filesig_t 58 | { 59 | digest_ra_t ndigest; 60 | digest_ra_unorm_t udigest; 61 | atomic_size_t cluster_no; 62 | public: 63 | filesig_t(void) : ndigest(), udigest(), cluster_no(0) {} 64 | filesig_t& operator=(const filesig_t& other) 65 | { 66 | ndigest = other.ndigest; 67 | udigest = other.udigest; 68 | // atomic is not copy copy constructible by default. 69 | cluster_no.store(other.cluster_no.load()); 70 | return *this; 71 | } 72 | filesig_t(const filesig_t& other) 73 | { 74 | *this = other; 75 | } 76 | }; 77 | 78 | static inline bool sort_filesig_cluster(const filesig_t& a, const filesig_t& b) 79 | { 80 | return a.cluster_no < b.cluster_no; 81 | } 82 | 83 | 84 | static size_t nthreads = SSDEEP_THREADS; 85 | static filesig_t* filesigs; 86 | static size_t filesigs_size; 87 | static atomic_flag filesigs_wspin = ATOMIC_FLAG_INIT; 88 | static int threshold = SSDEEP_THRESHOLD; 89 | 90 | static atomic_size_t cluster_to_allocate; 91 | static atomic_size_t progress_next; 92 | static atomic_size_t progress_finished; 93 | 94 | 95 | static void usage(const char *prog) 96 | { 97 | fprintf(stderr, 98 | "Usage: %s SSDEEP_LIST [-t THRESHOLD] [-n THREADS] [-c COMMENT]\n", 99 | prog ? prog : "fast-ssdeep-clus"); 100 | exit(1); 101 | } 102 | 103 | static void cluster_main(void); 104 | 105 | static bool read_digests(set& udigests, const char* filename) 106 | { 107 | try 108 | { 109 | ifstream fin; 110 | string str; 111 | fin.open(filename, ios::in); 112 | if (fin.fail()) 113 | { 114 | fprintf(stderr, "error: failed to open file.\n"); 115 | return false; 116 | } 117 | while (getline(fin, str)) 118 | { 119 | digest_ra_unorm_t digest(str); 120 | if (!digest.is_natural()) 121 | { 122 | fprintf(stderr, "error: parsed digest is not natural.\n"); 123 | return false; 124 | } 125 | udigests.insert(digest); 126 | } 127 | } 128 | catch (digest_parse_error) 129 | { 130 | fprintf(stderr, "error: cannot parse digest.\n"); 131 | return false; 132 | } 133 | return true; 134 | } 135 | 136 | int main(int argc, char** argv) 137 | { 138 | bool print_progress = true; 139 | int interval = SSDEEP_PROGINTV; 140 | char* comment; 141 | auto t_start = chrono::system_clock::now(); 142 | // Read arguments 143 | if (argc < 2) 144 | usage(argv[0]); 145 | comment = argv[1]; 146 | try 147 | { 148 | for (int i = 2; i < argc; i++) 149 | { 150 | size_t idx; 151 | char* arg; 152 | if (!strcmp(argv[i], "-t")) 153 | { 154 | if (++i == argc) 155 | usage(argv[0]); 156 | arg = argv[i]; 157 | threshold = stoi(arg, &idx, 10); 158 | if (arg[idx]) 159 | usage(argv[0]); 160 | if (threshold > 99 || threshold < 0) 161 | throw out_of_range("threshold is out of range"); 162 | } 163 | else if (!strcmp(argv[i], "-n")) 164 | { 165 | if (++i == argc) 166 | usage(argv[0]); 167 | arg = argv[i]; 168 | int n_threads = stoi(arg, &idx, 10); 169 | if (arg[idx]) 170 | usage(argv[0]); 171 | if (n_threads < 1 || n_threads > numeric_limits::max()) 172 | throw out_of_range("number of threads is out of range"); 173 | nthreads = n_threads; 174 | } 175 | else if (!strcmp(argv[i], "-i")) 176 | { 177 | if (++i == argc) 178 | usage(argv[0]); 179 | arg = argv[i]; 180 | interval = stoi(arg, &idx, 10); 181 | if (arg[idx]) 182 | usage(argv[0]); 183 | if (interval < 1) 184 | throw out_of_range("check interval is out of range"); 185 | } 186 | else if (!strcmp(argv[i], "-c")) 187 | { 188 | if (++i == argc) 189 | usage(argv[0]); 190 | comment = argv[i]; 191 | } 192 | else if (!strcmp(argv[i], "-np")) 193 | { 194 | print_progress = false; 195 | } 196 | else 197 | { 198 | usage(argv[0]); 199 | } 200 | } 201 | } 202 | catch (invalid_argument) 203 | { 204 | fprintf(stderr, "error: invalid argument is given.\n"); 205 | return 1; 206 | } 207 | catch (out_of_range) 208 | { 209 | fprintf(stderr, "error: out of range argument is given.\n"); 210 | return 1; 211 | } 212 | // Read file and preprocess 213 | { 214 | set udigests; 215 | // Read digest file 216 | if (!read_digests(udigests, argv[1])) 217 | return 1; 218 | // Preprocess digest and prepare for clustering 219 | if (!udigests.size()) 220 | return 0; // no clusters to make 221 | filesigs_size = udigests.size(); 222 | if (size_t(filesigs_size + nthreads) < filesigs_size) 223 | { 224 | // filesigs_size + nthreads must not overflow. 225 | fprintf(stderr, "error: too much signatures or threads.\n"); 226 | return 1; 227 | } 228 | filesigs = new filesig_t[filesigs_size]; 229 | size_t i = 0; 230 | for (auto p = udigests.begin(); p != udigests.end(); p++, i++) 231 | { 232 | filesigs[i].udigest = *p; 233 | digest_ra_t::normalize(filesigs[i].ndigest, *p); 234 | filesigs[i].cluster_no.store(0); 235 | } 236 | } 237 | // Initialize multi-threading environment 238 | cluster_to_allocate = 1; 239 | progress_next = 0; 240 | progress_finished = 0; 241 | vector threads; 242 | for (size_t i = 0; i < nthreads; i++) 243 | threads.push_back(thread(cluster_main)); 244 | // Wait for completion 245 | while (true) 246 | { 247 | size_t progress = progress_finished.load(); 248 | if (print_progress) 249 | { 250 | auto t_current = chrono::system_clock::now(); 251 | auto t_delta = chrono::duration_cast(t_current - t_start); 252 | unsigned long long t_count = static_cast(t_delta.count()); 253 | fprintf(stderr, 254 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s]\n", 255 | (unsigned long long)(t_count / 3600u), 256 | (unsigned long long)((t_count / 60u) % 60u), 257 | (unsigned long long)(t_count % 60u), 258 | (unsigned long long)progress, 259 | threshold, comment); 260 | } 261 | if (progress == filesigs_size) 262 | break; 263 | this_thread::sleep_for(chrono::seconds(interval)); 264 | } 265 | // Join all threads 266 | for (size_t i = 0; i < nthreads; i++) 267 | threads[i].join(); 268 | // Finalization 269 | sort(filesigs, filesigs + filesigs_size, sort_filesig_cluster); 270 | for (size_t i = 0, c0 = 0; i < filesigs_size; i++) 271 | { 272 | char buf[digest_ra_unorm_t::max_natural_chars]; 273 | size_t c1 = filesigs[i].cluster_no.load(); 274 | if (!c1) 275 | continue; 276 | if (c0 != c1 && c0) 277 | puts(""); 278 | c0 = c1; 279 | filesigs[i].udigest.pretty_unsafe(buf); 280 | puts(buf); 281 | if (i == filesigs_size - 1) 282 | puts(""); 283 | } 284 | if (print_progress) 285 | { 286 | auto t_current = chrono::system_clock::now(); 287 | auto t_delta = chrono::duration_cast(t_current - t_start); 288 | unsigned long long t_count = static_cast(t_delta.count()); 289 | fprintf(stderr, 290 | "%5llu:%02llu:%02llu %12llu [(threshold=%d) %s]\n", 291 | (unsigned long long)(t_count / 3600u), 292 | (unsigned long long)((t_count / 60u) % 60u), 293 | (unsigned long long)(t_count % 60u), 294 | (unsigned long long)filesigs_size, 295 | threshold, comment); 296 | } 297 | // Quit 298 | return 0; 299 | } 300 | 301 | static inline void lock_spin(atomic_flag& spinlock) 302 | { 303 | while (spinlock.test_and_set(memory_order_acquire)) {} 304 | } 305 | 306 | static inline void unlock_spin(atomic_flag& spinlock) 307 | { 308 | spinlock.clear(memory_order_release); 309 | } 310 | 311 | static void cluster(size_t idxA, size_t idxB) 312 | { 313 | lock_spin(filesigs_wspin); 314 | filesig_t* sigA = &filesigs[idxA]; 315 | filesig_t* sigB = &filesigs[idxB]; 316 | size_t clusA = sigA->cluster_no.load(); 317 | size_t clusB = sigB->cluster_no.load(); 318 | if (!clusA && !clusB) 319 | { 320 | size_t clusC = cluster_to_allocate++; 321 | sigA->cluster_no.store(clusC); 322 | sigB->cluster_no.store(clusC); 323 | } 324 | else if ( clusA && !clusB) 325 | { 326 | sigB->cluster_no.store(clusA); 327 | } 328 | else if (!clusA && clusB) 329 | { 330 | sigA->cluster_no.store(clusB); 331 | } 332 | else if (clusA != clusB) 333 | { 334 | for (size_t k = 0; k < filesigs_size; k++) 335 | { 336 | if (filesigs[k].cluster_no.load() == clusA) 337 | filesigs[k].cluster_no.store(clusB); 338 | } 339 | } 340 | unlock_spin(filesigs_wspin); 341 | } 342 | 343 | static void cluster_main(void) 344 | { 345 | // cluster 346 | for (size_t idxA; (idxA = progress_next++) < filesigs_size;) 347 | { 348 | filesig_t* sigA = &filesigs[idxA]; 349 | digest_blocksize_t blocksizeA = sigA->ndigest.blocksize(); 350 | digest_position_array_t digestA(sigA->ndigest); 351 | size_t idxB = idxA + 1; 352 | for (; idxB < filesigs_size; idxB++) 353 | { 354 | filesig_t* sigB = &filesigs[idxB]; 355 | if (blocksizeA != sigB->ndigest.blocksize()) 356 | break; 357 | // cluster 358 | size_t clusA = sigA->cluster_no.load(); 359 | size_t clusB = sigB->cluster_no.load(); 360 | if (clusB && clusA == clusB) 361 | continue; 362 | auto score = digestA.compare_near_eq(sigB->ndigest); 363 | if (score > threshold) 364 | cluster(idxA, idxB); 365 | } 366 | if (!digest_blocksize::is_safe_to_double(blocksizeA)) 367 | continue; 368 | digest_blocksize_t blocksizeB = blocksizeA * 2; 369 | for (; idxB < filesigs_size; idxB++) 370 | { 371 | filesig_t* sigB = &filesigs[idxB]; 372 | if (blocksizeB != sigB->ndigest.blocksize()) 373 | break; 374 | // cluster 375 | size_t clusA = sigA->cluster_no.load(); 376 | size_t clusB = sigB->cluster_no.load(); 377 | if (clusB && clusA == clusB) 378 | continue; 379 | auto score = digestA.compare_near_lt(sigB->ndigest); 380 | if (score > threshold) 381 | cluster(idxA, idxB); 382 | } 383 | ++progress_finished; 384 | } 385 | } 386 | -------------------------------------------------------------------------------- /sort-clusters.cpp: -------------------------------------------------------------------------------- 1 | /* 2 | 3 | fast-ssdeep-clus : Parallel ssdeep clustering kit 4 | 5 | sort-clusters.cpp 6 | Cluster sorting program 7 | 8 | Copyright (C) 2017 Tsukasa OI 9 | 10 | 11 | Permission to use, copy, modify, and/or distribute this software for 12 | any purpose with or without fee is hereby granted, provided that the 13 | above copyright notice and this permission notice appear in all copies. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES 16 | WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF 17 | MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR 18 | ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES 19 | WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN 20 | ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF 21 | OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. 22 | 23 | */ 24 | /* 25 | Note: 26 | This program skips certain error checks. 27 | So, this program is not suited to accept arbitrary file. 28 | */ 29 | 30 | #include 31 | #include 32 | #include 33 | 34 | #include 35 | #include 36 | #include 37 | #include 38 | 39 | #include 40 | 41 | using namespace std; 42 | using namespace ffuzzy; 43 | 44 | 45 | // using parsed digest object with comparison is slow! 46 | typedef struct 47 | { 48 | char *str; 49 | unsigned long block_size; 50 | } digest_to_sort_t; 51 | 52 | inline bool compare_ssdeep_hashes( 53 | const digest_to_sort_t& d1, 54 | const digest_to_sort_t& d2 55 | ) 56 | { 57 | if (d1.block_size < d2.block_size) 58 | return true; 59 | if (d1.block_size > d2.block_size) 60 | return false; 61 | return strcmp(d1.str, d2.str) < 0; 62 | } 63 | 64 | inline bool compare_ssdeep_cluster_list( 65 | list* c1, 66 | list* c2 67 | ) 68 | { 69 | return compare_ssdeep_hashes( 70 | *(c1->cbegin()), 71 | *(c2->cbegin()) 72 | ); 73 | } 74 | 75 | 76 | static_assert(digest_long_unorm_t::max_natural_chars < numeric_limits::max(), 77 | "digest_long_unorm_t::max_natural_chars + 1 must be in range of size_t."); 78 | 79 | 80 | int main(int argc, char** argv) 81 | { 82 | list*> all_clusters; 83 | { 84 | char buf[digest_long_unorm_t::max_natural_chars + 1]; 85 | size_t buflen; 86 | list* cluster = new list(); 87 | while (fgets(buf, digest_long_unorm_t::max_natural_chars + 1, stdin)) 88 | { 89 | buflen = strlen(buf); 90 | if (buflen) 91 | { 92 | char *bufend = buf + buflen - 1; 93 | if (*bufend != '\n') 94 | { 95 | fprintf(stderr, "error: buffer overflow.\n"); 96 | return 1; 97 | } 98 | *bufend = '\0'; 99 | if (buflen == 1) 100 | { 101 | if (cluster->size()) 102 | { 103 | cluster->sort(compare_ssdeep_hashes); 104 | all_clusters.push_back(cluster); 105 | cluster = new list(); 106 | } 107 | } 108 | else 109 | { 110 | digest_to_sort_t d; 111 | d.str = new char[buflen]; 112 | d.block_size = strtoul(buf, nullptr, 10); 113 | strcpy(d.str, buf); 114 | cluster->push_back(d); 115 | } 116 | } 117 | } 118 | delete cluster; 119 | } 120 | all_clusters.sort(compare_ssdeep_cluster_list); 121 | for (auto p : all_clusters) 122 | { 123 | for (auto q : *p) 124 | { 125 | puts(q.str); 126 | #if 0 127 | delete[] q.str; 128 | #endif 129 | } 130 | puts(""); 131 | #if 0 132 | delete p; 133 | #endif 134 | } 135 | return 0; 136 | } 137 | --------------------------------------------------------------------------------