├── .gitignore ├── Cargo.toml ├── README.mkdn ├── Cargo.lock ├── LICENSE └── src └── main.rs /.gitignore: -------------------------------------------------------------------------------- 1 | /target 2 | -------------------------------------------------------------------------------- /Cargo.toml: -------------------------------------------------------------------------------- 1 | [package] 2 | name = "drupes" 3 | version = "0.1.0" 4 | edition = "2021" 5 | 6 | [dependencies] 7 | anyhow = "1.0.89" 8 | blake3 = "1.5.4" 9 | clap = { version = "4.5.20", features = ["derive", "wrap_help"] } 10 | jwalk = "0.8.1" 11 | rayon = "1.10.0" 12 | size = "0.4.1" 13 | -------------------------------------------------------------------------------- /README.mkdn: -------------------------------------------------------------------------------- 1 | # `drupes`: removes dupes 2 | 3 | This is a (relatively) simple command line tool for finding, and optionally 4 | removing, duplicate files. 5 | 6 | To install, you will need a Rust toolchain. Check out this repo and run (from 7 | the repo directory): 8 | 9 | ``` 10 | cargo install --path . 11 | ``` 12 | 13 | Simple example (running the program against the Small Device C Compiler source 14 | code): 15 | 16 | ``` 17 | # get a summary: 18 | $ drupes ~/src/sdcc280 -m 19 | 1859 duplicate files (in 995 sets), occupying 30.2 MiB 20 | checked 16150 files in 4220 size classes 21 | 22 | # list the groups of duplicate files: 23 | $ drupes ~/src/sdcc280 24 | /home/cbiffle/src/sdcc280/sdcc/device/lib/huge/crtstart.rel 25 | /home/cbiffle/src/sdcc280/sdcc/device/lib/large/crtstart.rel 26 | /home/cbiffle/src/sdcc280/sdcc/device/lib/large-stack-auto/crtstart.rel 27 | /home/cbiffle/src/sdcc280/sdcc/device/lib/mcs51/crtstart.rel 28 | /home/cbiffle/src/sdcc280/sdcc/device/lib/medium/crtstart.rel 29 | /home/cbiffle/src/sdcc280/sdcc/device/lib/small/crtstart.rel 30 | /home/cbiffle/src/sdcc280/sdcc/device/lib/small-stack-auto/crtstart.rel 31 | 32 | /home/cbiffle/src/sdcc280/sdcc/device/lib/pdk13/atoi.sym 33 | /home/cbiffle/src/sdcc280/sdcc/device/lib/pdk14/atoi.sym 34 | # ...and so on 35 | ``` 36 | 37 | ## Command line usage 38 | 39 | ``` 40 | Finds duplicate files and optionally deletes them 41 | 42 | Usage: drupes [OPTIONS] [ROOTS]... 43 | 44 | Arguments: 45 | [ROOTS]... List of directories to search, recursively, for duplicate files; if 46 | omitted, the current directory is searched 47 | 48 | Options: 49 | -e, --empty Also consider empty files, which will report all empty files 50 | except one as duplicate (which is rarely what you want) 51 | -f, --omit-first Don't print the first filename in a set of duplicates, so that 52 | all the printed filenames are files to consider removing 53 | -m, --summarize Instead of listing duplicates, print a summary of what was 54 | found 55 | -p, --paranoid Engages "paranoid mode" and performs byte-for-byte comparisons 56 | of files, in case you've found the first real-world BLAKE3 57 | hash collision (please publish it if so) 58 | --delete Try to delete all duplicates but one, skipping any files that 59 | cannot be deleted for whatever reason 60 | -h, --help Print help 61 | ``` 62 | 63 | ## What's a duplicate file? 64 | 65 | A file is a duplicate if some other file has exactly the same contents, but 66 | possibly a different name. For example, if you make a copy of a file with no 67 | other changes, that will count as a duplicate. 68 | 69 | `drupes` only notices duplicates within the directory or directories you 70 | specify, so a duplicate file somewhere else on your computer (or on that microSD 71 | card you lost in the couch) won't get reported. 72 | 73 | More specifically, `drupes` considers two files to be duplicates if, and only 74 | if, 75 | 76 | 1. They have exactly the same length, in bytes. 77 | 2. Their contents hash to the same value using BLAKE3. 78 | 3. If the `--paranoid` flag is given, their contents also match byte-for-byte. 79 | 80 | Because BLAKE3 is currently a well-respected cryptographic hash algorithm that's 81 | considered fairly collision-resistant, two files with the same size and BLAKE3 82 | hash are _very likely_ to have the same exact contents. The `--paranoid` mode 83 | should not generally be necessary, and is somewhat slower. 84 | 85 | **Unix users:** Currently, `drupes` considers two files to be duplicates even if 86 | they are hardlinked to the same inode. This is deliberate; I developed the tool 87 | to winnow down a photo library, not to conserve disk space. If this bugs you, 88 | I'd be open to adding a switch to control it! 89 | 90 | 91 | ## Performance 92 | 93 | I haven't put a lot of work into optimizing performance, but by slapping 94 | together several off-the-shelf crates, `drupes` winds up being pretty fast. 95 | 96 | In a very conservative test, where I drop the entire system's filesystem cache 97 | before measuring (to ensure that reads are coming from disk and not memory), 98 | `drupes` is 1.7 - 7.9x faster than the other tools I use: 99 | 100 | | Tool | 14G of git repos | 15G of photos | 101 | | ---- | ------------------ | --------------- | 102 | | fdupes | 19.322 s | 3.144 s | 103 | | rmlint | 9.513 s | 1.341 s | 104 | | **drupes** | 2.522s | 0.398 | 105 | 106 | In practice, I don't drop the system page cache while I'm working, so a "cache 107 | warm" test is more representative of my day-to-day experience using `drupes`. In 108 | this case the gap widens to 2.5-22x (larger files make the gap larger): 109 | 110 | | Tool | 14G of git repos | 15G of photos | 111 | | ---- | ------------------ | --------------- | 112 | | fdupes | 3.134 s | 2.131 s | 113 | | rmlint | 2.206 s | 0.865 s | 114 | | **drupes** | 0.716 s | 0.098 s | 115 | 116 | ### Okay but why it fast 117 | 118 | Rust is not inherently faster than C, _but it does make it easier to use 119 | advanced performance techniques than C does._ 120 | 121 | `drupes` is faster than some other tools for, essentially, five reasons. 122 | 123 | 1. `drupes` will use as many CPU cores as you have available, because doing this 124 | is easy in Rust. 125 | 2. `drupes` uses a modern very fast hash algorithm (BLAKE3) because it was just 126 | as easy to reach for as something slower. (BLAKE3 is much faster than 127 | anything in the SHA family _or_ old MD5.) 128 | 3. `drupes` doesn't bother hashing a file _at all_ if it's the only file of a 129 | particular size (which is quite common). 130 | 4. `drupes` first hashes the start of a file; if the result is globally unique, 131 | it doesn't bother reading the rest of it. 132 | 5. `drupes` trusts BLAKE3 to be collision-resistant, so it doesn't need to do 133 | byte-for-byte comparisons of files it's already hashed. (Though you can 134 | request one using `--paranoid` if you're feeling, well, paranoid.) 135 | 136 | ### Performance caveats 137 | 138 | `drupes` is really intended for use on random-access media. The less your 139 | storage matches that description, the worse your experience will be. My workflow 140 | when processing old mystery disks is to copy or image the disks onto faster 141 | media, deduplicate, and process. 142 | 143 | In particular, `drupes` does not perform well on CDs, DVDs, floppies, or tape. 144 | Such media are currently best handled by copying them onto a faster device, 145 | ideally an SSD, but a reasonably modern hard disk will also do. 146 | 147 | This aspect could potentially be improved, but so far the added complexity 148 | hasn't seemed worth it, in particular because there's no good portable way of 149 | assessing physical file layout in a filesystem. (Suggestions would be welcome.) 150 | -------------------------------------------------------------------------------- /Cargo.lock: -------------------------------------------------------------------------------- 1 | # This file is automatically @generated by Cargo. 2 | # It is not intended for manual editing. 3 | version = 3 4 | 5 | [[package]] 6 | name = "anstream" 7 | version = "0.6.15" 8 | source = "registry+https://github.com/rust-lang/crates.io-index" 9 | checksum = "64e15c1ab1f89faffbf04a634d5e1962e9074f2741eef6d97f3c4e322426d526" 10 | dependencies = [ 11 | "anstyle", 12 | "anstyle-parse", 13 | "anstyle-query", 14 | "anstyle-wincon", 15 | "colorchoice", 16 | "is_terminal_polyfill", 17 | "utf8parse", 18 | ] 19 | 20 | [[package]] 21 | name = "anstyle" 22 | version = "1.0.8" 23 | source = "registry+https://github.com/rust-lang/crates.io-index" 24 | checksum = "1bec1de6f59aedf83baf9ff929c98f2ad654b97c9510f4e70cf6f661d49fd5b1" 25 | 26 | [[package]] 27 | name = "anstyle-parse" 28 | version = "0.2.5" 29 | source = "registry+https://github.com/rust-lang/crates.io-index" 30 | checksum = "eb47de1e80c2b463c735db5b217a0ddc39d612e7ac9e2e96a5aed1f57616c1cb" 31 | dependencies = [ 32 | "utf8parse", 33 | ] 34 | 35 | [[package]] 36 | name = "anstyle-query" 37 | version = "1.1.1" 38 | source = "registry+https://github.com/rust-lang/crates.io-index" 39 | checksum = "6d36fc52c7f6c869915e99412912f22093507da8d9e942ceaf66fe4b7c14422a" 40 | dependencies = [ 41 | "windows-sys 0.52.0", 42 | ] 43 | 44 | [[package]] 45 | name = "anstyle-wincon" 46 | version = "3.0.4" 47 | source = "registry+https://github.com/rust-lang/crates.io-index" 48 | checksum = "5bf74e1b6e971609db8ca7a9ce79fd5768ab6ae46441c572e46cf596f59e57f8" 49 | dependencies = [ 50 | "anstyle", 51 | "windows-sys 0.52.0", 52 | ] 53 | 54 | [[package]] 55 | name = "anyhow" 56 | version = "1.0.90" 57 | source = "registry+https://github.com/rust-lang/crates.io-index" 58 | checksum = "37bf3594c4c988a53154954629820791dde498571819ae4ca50ca811e060cc95" 59 | 60 | [[package]] 61 | name = "arrayref" 62 | version = "0.3.9" 63 | source = "registry+https://github.com/rust-lang/crates.io-index" 64 | checksum = "76a2e8124351fda1ef8aaaa3bbd7ebbcb486bbcd4225aca0aa0d84bb2db8fecb" 65 | 66 | [[package]] 67 | name = "arrayvec" 68 | version = "0.7.6" 69 | source = "registry+https://github.com/rust-lang/crates.io-index" 70 | checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50" 71 | 72 | [[package]] 73 | name = "bitflags" 74 | version = "2.6.0" 75 | source = "registry+https://github.com/rust-lang/crates.io-index" 76 | checksum = "b048fb63fd8b5923fc5aa7b340d8e156aec7ec02f0c78fa8a6ddc2613f6f71de" 77 | 78 | [[package]] 79 | name = "blake3" 80 | version = "1.5.4" 81 | source = "registry+https://github.com/rust-lang/crates.io-index" 82 | checksum = "d82033247fd8e890df8f740e407ad4d038debb9eb1f40533fffb32e7d17dc6f7" 83 | dependencies = [ 84 | "arrayref", 85 | "arrayvec", 86 | "cc", 87 | "cfg-if", 88 | "constant_time_eq", 89 | ] 90 | 91 | [[package]] 92 | name = "cc" 93 | version = "1.1.31" 94 | source = "registry+https://github.com/rust-lang/crates.io-index" 95 | checksum = "c2e7962b54006dcfcc61cb72735f4d89bb97061dd6a7ed882ec6b8ee53714c6f" 96 | dependencies = [ 97 | "shlex", 98 | ] 99 | 100 | [[package]] 101 | name = "cfg-if" 102 | version = "1.0.0" 103 | source = "registry+https://github.com/rust-lang/crates.io-index" 104 | checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" 105 | 106 | [[package]] 107 | name = "clap" 108 | version = "4.5.20" 109 | source = "registry+https://github.com/rust-lang/crates.io-index" 110 | checksum = "b97f376d85a664d5837dbae44bf546e6477a679ff6610010f17276f686d867e8" 111 | dependencies = [ 112 | "clap_builder", 113 | "clap_derive", 114 | ] 115 | 116 | [[package]] 117 | name = "clap_builder" 118 | version = "4.5.20" 119 | source = "registry+https://github.com/rust-lang/crates.io-index" 120 | checksum = "19bc80abd44e4bed93ca373a0704ccbd1b710dc5749406201bb018272808dc54" 121 | dependencies = [ 122 | "anstream", 123 | "anstyle", 124 | "clap_lex", 125 | "strsim", 126 | "terminal_size", 127 | ] 128 | 129 | [[package]] 130 | name = "clap_derive" 131 | version = "4.5.18" 132 | source = "registry+https://github.com/rust-lang/crates.io-index" 133 | checksum = "4ac6a0c7b1a9e9a5186361f67dfa1b88213572f427fb9ab038efb2bd8c582dab" 134 | dependencies = [ 135 | "heck", 136 | "proc-macro2", 137 | "quote", 138 | "syn", 139 | ] 140 | 141 | [[package]] 142 | name = "clap_lex" 143 | version = "0.7.2" 144 | source = "registry+https://github.com/rust-lang/crates.io-index" 145 | checksum = "1462739cb27611015575c0c11df5df7601141071f07518d56fcc1be504cbec97" 146 | 147 | [[package]] 148 | name = "colorchoice" 149 | version = "1.0.2" 150 | source = "registry+https://github.com/rust-lang/crates.io-index" 151 | checksum = "d3fd119d74b830634cea2a0f58bbd0d54540518a14397557951e79340abc28c0" 152 | 153 | [[package]] 154 | name = "constant_time_eq" 155 | version = "0.3.1" 156 | source = "registry+https://github.com/rust-lang/crates.io-index" 157 | checksum = "7c74b8349d32d297c9134b8c88677813a227df8f779daa29bfc29c183fe3dca6" 158 | 159 | [[package]] 160 | name = "crossbeam" 161 | version = "0.8.4" 162 | source = "registry+https://github.com/rust-lang/crates.io-index" 163 | checksum = "1137cd7e7fc0fb5d3c5a8678be38ec56e819125d8d7907411fe24ccb943faca8" 164 | dependencies = [ 165 | "crossbeam-channel", 166 | "crossbeam-deque", 167 | "crossbeam-epoch", 168 | "crossbeam-queue", 169 | "crossbeam-utils", 170 | ] 171 | 172 | [[package]] 173 | name = "crossbeam-channel" 174 | version = "0.5.13" 175 | source = "registry+https://github.com/rust-lang/crates.io-index" 176 | checksum = "33480d6946193aa8033910124896ca395333cae7e2d1113d1fef6c3272217df2" 177 | dependencies = [ 178 | "crossbeam-utils", 179 | ] 180 | 181 | [[package]] 182 | name = "crossbeam-deque" 183 | version = "0.8.5" 184 | source = "registry+https://github.com/rust-lang/crates.io-index" 185 | checksum = "613f8cc01fe9cf1a3eb3d7f488fd2fa8388403e97039e2f73692932e291a770d" 186 | dependencies = [ 187 | "crossbeam-epoch", 188 | "crossbeam-utils", 189 | ] 190 | 191 | [[package]] 192 | name = "crossbeam-epoch" 193 | version = "0.9.18" 194 | source = "registry+https://github.com/rust-lang/crates.io-index" 195 | checksum = "5b82ac4a3c2ca9c3460964f020e1402edd5753411d7737aa39c3714ad1b5420e" 196 | dependencies = [ 197 | "crossbeam-utils", 198 | ] 199 | 200 | [[package]] 201 | name = "crossbeam-queue" 202 | version = "0.3.11" 203 | source = "registry+https://github.com/rust-lang/crates.io-index" 204 | checksum = "df0346b5d5e76ac2fe4e327c5fd1118d6be7c51dfb18f9b7922923f287471e35" 205 | dependencies = [ 206 | "crossbeam-utils", 207 | ] 208 | 209 | [[package]] 210 | name = "crossbeam-utils" 211 | version = "0.8.20" 212 | source = "registry+https://github.com/rust-lang/crates.io-index" 213 | checksum = "22ec99545bb0ed0ea7bb9b8e1e9122ea386ff8a48c0922e43f36d45ab09e0e80" 214 | 215 | [[package]] 216 | name = "drupes" 217 | version = "0.1.0" 218 | dependencies = [ 219 | "anyhow", 220 | "blake3", 221 | "clap", 222 | "jwalk", 223 | "rayon", 224 | "size", 225 | ] 226 | 227 | [[package]] 228 | name = "either" 229 | version = "1.13.0" 230 | source = "registry+https://github.com/rust-lang/crates.io-index" 231 | checksum = "60b1af1c220855b6ceac025d3f6ecdd2b7c4894bfe9cd9bda4fbb4bc7c0d4cf0" 232 | 233 | [[package]] 234 | name = "errno" 235 | version = "0.3.9" 236 | source = "registry+https://github.com/rust-lang/crates.io-index" 237 | checksum = "534c5cf6194dfab3db3242765c03bbe257cf92f22b38f6bc0c58d59108a820ba" 238 | dependencies = [ 239 | "libc", 240 | "windows-sys 0.52.0", 241 | ] 242 | 243 | [[package]] 244 | name = "heck" 245 | version = "0.5.0" 246 | source = "registry+https://github.com/rust-lang/crates.io-index" 247 | checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" 248 | 249 | [[package]] 250 | name = "is_terminal_polyfill" 251 | version = "1.70.1" 252 | source = "registry+https://github.com/rust-lang/crates.io-index" 253 | checksum = "7943c866cc5cd64cbc25b2e01621d07fa8eb2a1a23160ee81ce38704e97b8ecf" 254 | 255 | [[package]] 256 | name = "jwalk" 257 | version = "0.8.1" 258 | source = "registry+https://github.com/rust-lang/crates.io-index" 259 | checksum = "2735847566356cd2179a2a38264839308f7079fa96e6bd5a42d740460e003c56" 260 | dependencies = [ 261 | "crossbeam", 262 | "rayon", 263 | ] 264 | 265 | [[package]] 266 | name = "libc" 267 | version = "0.2.161" 268 | source = "registry+https://github.com/rust-lang/crates.io-index" 269 | checksum = "8e9489c2807c139ffd9c1794f4af0ebe86a828db53ecdc7fea2111d0fed085d1" 270 | 271 | [[package]] 272 | name = "linux-raw-sys" 273 | version = "0.4.14" 274 | source = "registry+https://github.com/rust-lang/crates.io-index" 275 | checksum = "78b3ae25bc7c8c38cec158d1f2757ee79e9b3740fbc7ccf0e59e4b08d793fa89" 276 | 277 | [[package]] 278 | name = "proc-macro2" 279 | version = "1.0.88" 280 | source = "registry+https://github.com/rust-lang/crates.io-index" 281 | checksum = "7c3a7fc5db1e57d5a779a352c8cdb57b29aa4c40cc69c3a68a7fedc815fbf2f9" 282 | dependencies = [ 283 | "unicode-ident", 284 | ] 285 | 286 | [[package]] 287 | name = "quote" 288 | version = "1.0.37" 289 | source = "registry+https://github.com/rust-lang/crates.io-index" 290 | checksum = "b5b9d34b8991d19d98081b46eacdd8eb58c6f2b201139f7c5f643cc155a633af" 291 | dependencies = [ 292 | "proc-macro2", 293 | ] 294 | 295 | [[package]] 296 | name = "rayon" 297 | version = "1.10.0" 298 | source = "registry+https://github.com/rust-lang/crates.io-index" 299 | checksum = "b418a60154510ca1a002a752ca9714984e21e4241e804d32555251faf8b78ffa" 300 | dependencies = [ 301 | "either", 302 | "rayon-core", 303 | ] 304 | 305 | [[package]] 306 | name = "rayon-core" 307 | version = "1.12.1" 308 | source = "registry+https://github.com/rust-lang/crates.io-index" 309 | checksum = "1465873a3dfdaa8ae7cb14b4383657caab0b3e8a0aa9ae8e04b044854c8dfce2" 310 | dependencies = [ 311 | "crossbeam-deque", 312 | "crossbeam-utils", 313 | ] 314 | 315 | [[package]] 316 | name = "rustix" 317 | version = "0.38.37" 318 | source = "registry+https://github.com/rust-lang/crates.io-index" 319 | checksum = "8acb788b847c24f28525660c4d7758620a7210875711f79e7f663cc152726811" 320 | dependencies = [ 321 | "bitflags", 322 | "errno", 323 | "libc", 324 | "linux-raw-sys", 325 | "windows-sys 0.52.0", 326 | ] 327 | 328 | [[package]] 329 | name = "shlex" 330 | version = "1.3.0" 331 | source = "registry+https://github.com/rust-lang/crates.io-index" 332 | checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" 333 | 334 | [[package]] 335 | name = "size" 336 | version = "0.4.1" 337 | source = "registry+https://github.com/rust-lang/crates.io-index" 338 | checksum = "9fed904c7fb2856d868b92464fc8fa597fce366edea1a9cbfaa8cb5fe080bd6d" 339 | 340 | [[package]] 341 | name = "strsim" 342 | version = "0.11.1" 343 | source = "registry+https://github.com/rust-lang/crates.io-index" 344 | checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f" 345 | 346 | [[package]] 347 | name = "syn" 348 | version = "2.0.82" 349 | source = "registry+https://github.com/rust-lang/crates.io-index" 350 | checksum = "83540f837a8afc019423a8edb95b52a8effe46957ee402287f4292fae35be021" 351 | dependencies = [ 352 | "proc-macro2", 353 | "quote", 354 | "unicode-ident", 355 | ] 356 | 357 | [[package]] 358 | name = "terminal_size" 359 | version = "0.4.0" 360 | source = "registry+https://github.com/rust-lang/crates.io-index" 361 | checksum = "4f599bd7ca042cfdf8f4512b277c02ba102247820f9d9d4a9f521f496751a6ef" 362 | dependencies = [ 363 | "rustix", 364 | "windows-sys 0.59.0", 365 | ] 366 | 367 | [[package]] 368 | name = "unicode-ident" 369 | version = "1.0.13" 370 | source = "registry+https://github.com/rust-lang/crates.io-index" 371 | checksum = "e91b56cd4cadaeb79bbf1a5645f6b4f8dc5bde8834ad5894a8db35fda9efa1fe" 372 | 373 | [[package]] 374 | name = "utf8parse" 375 | version = "0.2.2" 376 | source = "registry+https://github.com/rust-lang/crates.io-index" 377 | checksum = "06abde3611657adf66d383f00b093d7faecc7fa57071cce2578660c9f1010821" 378 | 379 | [[package]] 380 | name = "windows-sys" 381 | version = "0.52.0" 382 | source = "registry+https://github.com/rust-lang/crates.io-index" 383 | checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" 384 | dependencies = [ 385 | "windows-targets", 386 | ] 387 | 388 | [[package]] 389 | name = "windows-sys" 390 | version = "0.59.0" 391 | source = "registry+https://github.com/rust-lang/crates.io-index" 392 | checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" 393 | dependencies = [ 394 | "windows-targets", 395 | ] 396 | 397 | [[package]] 398 | name = "windows-targets" 399 | version = "0.52.6" 400 | source = "registry+https://github.com/rust-lang/crates.io-index" 401 | checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" 402 | dependencies = [ 403 | "windows_aarch64_gnullvm", 404 | "windows_aarch64_msvc", 405 | "windows_i686_gnu", 406 | "windows_i686_gnullvm", 407 | "windows_i686_msvc", 408 | "windows_x86_64_gnu", 409 | "windows_x86_64_gnullvm", 410 | "windows_x86_64_msvc", 411 | ] 412 | 413 | [[package]] 414 | name = "windows_aarch64_gnullvm" 415 | version = "0.52.6" 416 | source = "registry+https://github.com/rust-lang/crates.io-index" 417 | checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" 418 | 419 | [[package]] 420 | name = "windows_aarch64_msvc" 421 | version = "0.52.6" 422 | source = "registry+https://github.com/rust-lang/crates.io-index" 423 | checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" 424 | 425 | [[package]] 426 | name = "windows_i686_gnu" 427 | version = "0.52.6" 428 | source = "registry+https://github.com/rust-lang/crates.io-index" 429 | checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" 430 | 431 | [[package]] 432 | name = "windows_i686_gnullvm" 433 | version = "0.52.6" 434 | source = "registry+https://github.com/rust-lang/crates.io-index" 435 | checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" 436 | 437 | [[package]] 438 | name = "windows_i686_msvc" 439 | version = "0.52.6" 440 | source = "registry+https://github.com/rust-lang/crates.io-index" 441 | checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" 442 | 443 | [[package]] 444 | name = "windows_x86_64_gnu" 445 | version = "0.52.6" 446 | source = "registry+https://github.com/rust-lang/crates.io-index" 447 | checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" 448 | 449 | [[package]] 450 | name = "windows_x86_64_gnullvm" 451 | version = "0.52.6" 452 | source = "registry+https://github.com/rust-lang/crates.io-index" 453 | checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" 454 | 455 | [[package]] 456 | name = "windows_x86_64_msvc" 457 | version = "0.52.6" 458 | source = "registry+https://github.com/rust-lang/crates.io-index" 459 | checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" 460 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Mozilla Public License Version 2.0 2 | ================================== 3 | 4 | 1. Definitions 5 | -------------- 6 | 7 | 1.1. "Contributor" 8 | means each individual or legal entity that creates, contributes to 9 | the creation of, or owns Covered Software. 10 | 11 | 1.2. "Contributor Version" 12 | means the combination of the Contributions of others (if any) used 13 | by a Contributor and that particular Contributor's Contribution. 14 | 15 | 1.3. "Contribution" 16 | means Covered Software of a particular Contributor. 17 | 18 | 1.4. "Covered Software" 19 | means Source Code Form to which the initial Contributor has attached 20 | the notice in Exhibit A, the Executable Form of such Source Code 21 | Form, and Modifications of such Source Code Form, in each case 22 | including portions thereof. 23 | 24 | 1.5. "Incompatible With Secondary Licenses" 25 | means 26 | 27 | (a) that the initial Contributor has attached the notice described 28 | in Exhibit B to the Covered Software; or 29 | 30 | (b) that the Covered Software was made available under the terms of 31 | version 1.1 or earlier of the License, but not also under the 32 | terms of a Secondary License. 33 | 34 | 1.6. "Executable Form" 35 | means any form of the work other than Source Code Form. 36 | 37 | 1.7. "Larger Work" 38 | means a work that combines Covered Software with other material, in 39 | a separate file or files, that is not Covered Software. 40 | 41 | 1.8. "License" 42 | means this document. 43 | 44 | 1.9. "Licensable" 45 | means having the right to grant, to the maximum extent possible, 46 | whether at the time of the initial grant or subsequently, any and 47 | all of the rights conveyed by this License. 48 | 49 | 1.10. "Modifications" 50 | means any of the following: 51 | 52 | (a) any file in Source Code Form that results from an addition to, 53 | deletion from, or modification of the contents of Covered 54 | Software; or 55 | 56 | (b) any new file in Source Code Form that contains any Covered 57 | Software. 58 | 59 | 1.11. "Patent Claims" of a Contributor 60 | means any patent claim(s), including without limitation, method, 61 | process, and apparatus claims, in any patent Licensable by such 62 | Contributor that would be infringed, but for the grant of the 63 | License, by the making, using, selling, offering for sale, having 64 | made, import, or transfer of either its Contributions or its 65 | Contributor Version. 66 | 67 | 1.12. "Secondary License" 68 | means either the GNU General Public License, Version 2.0, the GNU 69 | Lesser General Public License, Version 2.1, the GNU Affero General 70 | Public License, Version 3.0, or any later versions of those 71 | licenses. 72 | 73 | 1.13. "Source Code Form" 74 | means the form of the work preferred for making modifications. 75 | 76 | 1.14. "You" (or "Your") 77 | means an individual or a legal entity exercising rights under this 78 | License. For legal entities, "You" includes any entity that 79 | controls, is controlled by, or is under common control with You. For 80 | purposes of this definition, "control" means (a) the power, direct 81 | or indirect, to cause the direction or management of such entity, 82 | whether by contract or otherwise, or (b) ownership of more than 83 | fifty percent (50%) of the outstanding shares or beneficial 84 | ownership of such entity. 85 | 86 | 2. License Grants and Conditions 87 | -------------------------------- 88 | 89 | 2.1. Grants 90 | 91 | Each Contributor hereby grants You a world-wide, royalty-free, 92 | non-exclusive license: 93 | 94 | (a) under intellectual property rights (other than patent or trademark) 95 | Licensable by such Contributor to use, reproduce, make available, 96 | modify, display, perform, distribute, and otherwise exploit its 97 | Contributions, either on an unmodified basis, with Modifications, or 98 | as part of a Larger Work; and 99 | 100 | (b) under Patent Claims of such Contributor to make, use, sell, offer 101 | for sale, have made, import, and otherwise transfer either its 102 | Contributions or its Contributor Version. 103 | 104 | 2.2. Effective Date 105 | 106 | The licenses granted in Section 2.1 with respect to any Contribution 107 | become effective for each Contribution on the date the Contributor first 108 | distributes such Contribution. 109 | 110 | 2.3. Limitations on Grant Scope 111 | 112 | The licenses granted in this Section 2 are the only rights granted under 113 | this License. No additional rights or licenses will be implied from the 114 | distribution or licensing of Covered Software under this License. 115 | Notwithstanding Section 2.1(b) above, no patent license is granted by a 116 | Contributor: 117 | 118 | (a) for any code that a Contributor has removed from Covered Software; 119 | or 120 | 121 | (b) for infringements caused by: (i) Your and any other third party's 122 | modifications of Covered Software, or (ii) the combination of its 123 | Contributions with other software (except as part of its Contributor 124 | Version); or 125 | 126 | (c) under Patent Claims infringed by Covered Software in the absence of 127 | its Contributions. 128 | 129 | This License does not grant any rights in the trademarks, service marks, 130 | or logos of any Contributor (except as may be necessary to comply with 131 | the notice requirements in Section 3.4). 132 | 133 | 2.4. Subsequent Licenses 134 | 135 | No Contributor makes additional grants as a result of Your choice to 136 | distribute the Covered Software under a subsequent version of this 137 | License (see Section 10.2) or under the terms of a Secondary License (if 138 | permitted under the terms of Section 3.3). 139 | 140 | 2.5. Representation 141 | 142 | Each Contributor represents that the Contributor believes its 143 | Contributions are its original creation(s) or it has sufficient rights 144 | to grant the rights to its Contributions conveyed by this License. 145 | 146 | 2.6. Fair Use 147 | 148 | This License is not intended to limit any rights You have under 149 | applicable copyright doctrines of fair use, fair dealing, or other 150 | equivalents. 151 | 152 | 2.7. Conditions 153 | 154 | Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted 155 | in Section 2.1. 156 | 157 | 3. Responsibilities 158 | ------------------- 159 | 160 | 3.1. Distribution of Source Form 161 | 162 | All distribution of Covered Software in Source Code Form, including any 163 | Modifications that You create or to which You contribute, must be under 164 | the terms of this License. You must inform recipients that the Source 165 | Code Form of the Covered Software is governed by the terms of this 166 | License, and how they can obtain a copy of this License. You may not 167 | attempt to alter or restrict the recipients' rights in the Source Code 168 | Form. 169 | 170 | 3.2. Distribution of Executable Form 171 | 172 | If You distribute Covered Software in Executable Form then: 173 | 174 | (a) such Covered Software must also be made available in Source Code 175 | Form, as described in Section 3.1, and You must inform recipients of 176 | the Executable Form how they can obtain a copy of such Source Code 177 | Form by reasonable means in a timely manner, at a charge no more 178 | than the cost of distribution to the recipient; and 179 | 180 | (b) You may distribute such Executable Form under the terms of this 181 | License, or sublicense it under different terms, provided that the 182 | license for the Executable Form does not attempt to limit or alter 183 | the recipients' rights in the Source Code Form under this License. 184 | 185 | 3.3. Distribution of a Larger Work 186 | 187 | You may create and distribute a Larger Work under terms of Your choice, 188 | provided that You also comply with the requirements of this License for 189 | the Covered Software. If the Larger Work is a combination of Covered 190 | Software with a work governed by one or more Secondary Licenses, and the 191 | Covered Software is not Incompatible With Secondary Licenses, this 192 | License permits You to additionally distribute such Covered Software 193 | under the terms of such Secondary License(s), so that the recipient of 194 | the Larger Work may, at their option, further distribute the Covered 195 | Software under the terms of either this License or such Secondary 196 | License(s). 197 | 198 | 3.4. Notices 199 | 200 | You may not remove or alter the substance of any license notices 201 | (including copyright notices, patent notices, disclaimers of warranty, 202 | or limitations of liability) contained within the Source Code Form of 203 | the Covered Software, except that You may alter any license notices to 204 | the extent required to remedy known factual inaccuracies. 205 | 206 | 3.5. Application of Additional Terms 207 | 208 | You may choose to offer, and to charge a fee for, warranty, support, 209 | indemnity or liability obligations to one or more recipients of Covered 210 | Software. However, You may do so only on Your own behalf, and not on 211 | behalf of any Contributor. You must make it absolutely clear that any 212 | such warranty, support, indemnity, or liability obligation is offered by 213 | You alone, and You hereby agree to indemnify every Contributor for any 214 | liability incurred by such Contributor as a result of warranty, support, 215 | indemnity or liability terms You offer. You may include additional 216 | disclaimers of warranty and limitations of liability specific to any 217 | jurisdiction. 218 | 219 | 4. Inability to Comply Due to Statute or Regulation 220 | --------------------------------------------------- 221 | 222 | If it is impossible for You to comply with any of the terms of this 223 | License with respect to some or all of the Covered Software due to 224 | statute, judicial order, or regulation then You must: (a) comply with 225 | the terms of this License to the maximum extent possible; and (b) 226 | describe the limitations and the code they affect. Such description must 227 | be placed in a text file included with all distributions of the Covered 228 | Software under this License. Except to the extent prohibited by statute 229 | or regulation, such description must be sufficiently detailed for a 230 | recipient of ordinary skill to be able to understand it. 231 | 232 | 5. Termination 233 | -------------- 234 | 235 | 5.1. The rights granted under this License will terminate automatically 236 | if You fail to comply with any of its terms. However, if You become 237 | compliant, then the rights granted under this License from a particular 238 | Contributor are reinstated (a) provisionally, unless and until such 239 | Contributor explicitly and finally terminates Your grants, and (b) on an 240 | ongoing basis, if such Contributor fails to notify You of the 241 | non-compliance by some reasonable means prior to 60 days after You have 242 | come back into compliance. Moreover, Your grants from a particular 243 | Contributor are reinstated on an ongoing basis if such Contributor 244 | notifies You of the non-compliance by some reasonable means, this is the 245 | first time You have received notice of non-compliance with this License 246 | from such Contributor, and You become compliant prior to 30 days after 247 | Your receipt of the notice. 248 | 249 | 5.2. If You initiate litigation against any entity by asserting a patent 250 | infringement claim (excluding declaratory judgment actions, 251 | counter-claims, and cross-claims) alleging that a Contributor Version 252 | directly or indirectly infringes any patent, then the rights granted to 253 | You by any and all Contributors for the Covered Software under Section 254 | 2.1 of this License shall terminate. 255 | 256 | 5.3. In the event of termination under Sections 5.1 or 5.2 above, all 257 | end user license agreements (excluding distributors and resellers) which 258 | have been validly granted by You or Your distributors under this License 259 | prior to termination shall survive termination. 260 | 261 | ************************************************************************ 262 | * * 263 | * 6. Disclaimer of Warranty * 264 | * ------------------------- * 265 | * * 266 | * Covered Software is provided under this License on an "as is" * 267 | * basis, without warranty of any kind, either expressed, implied, or * 268 | * statutory, including, without limitation, warranties that the * 269 | * Covered Software is free of defects, merchantable, fit for a * 270 | * particular purpose or non-infringing. The entire risk as to the * 271 | * quality and performance of the Covered Software is with You. * 272 | * Should any Covered Software prove defective in any respect, You * 273 | * (not any Contributor) assume the cost of any necessary servicing, * 274 | * repair, or correction. This disclaimer of warranty constitutes an * 275 | * essential part of this License. No use of any Covered Software is * 276 | * authorized under this License except under this disclaimer. * 277 | * * 278 | ************************************************************************ 279 | 280 | ************************************************************************ 281 | * * 282 | * 7. Limitation of Liability * 283 | * -------------------------- * 284 | * * 285 | * Under no circumstances and under no legal theory, whether tort * 286 | * (including negligence), contract, or otherwise, shall any * 287 | * Contributor, or anyone who distributes Covered Software as * 288 | * permitted above, be liable to You for any direct, indirect, * 289 | * special, incidental, or consequential damages of any character * 290 | * including, without limitation, damages for lost profits, loss of * 291 | * goodwill, work stoppage, computer failure or malfunction, or any * 292 | * and all other commercial damages or losses, even if such party * 293 | * shall have been informed of the possibility of such damages. This * 294 | * limitation of liability shall not apply to liability for death or * 295 | * personal injury resulting from such party's negligence to the * 296 | * extent applicable law prohibits such limitation. Some * 297 | * jurisdictions do not allow the exclusion or limitation of * 298 | * incidental or consequential damages, so this exclusion and * 299 | * limitation may not apply to You. * 300 | * * 301 | ************************************************************************ 302 | 303 | 8. Litigation 304 | ------------- 305 | 306 | Any litigation relating to this License may be brought only in the 307 | courts of a jurisdiction where the defendant maintains its principal 308 | place of business and such litigation shall be governed by laws of that 309 | jurisdiction, without reference to its conflict-of-law provisions. 310 | Nothing in this Section shall prevent a party's ability to bring 311 | cross-claims or counter-claims. 312 | 313 | 9. Miscellaneous 314 | ---------------- 315 | 316 | This License represents the complete agreement concerning the subject 317 | matter hereof. If any provision of this License is held to be 318 | unenforceable, such provision shall be reformed only to the extent 319 | necessary to make it enforceable. Any law or regulation which provides 320 | that the language of a contract shall be construed against the drafter 321 | shall not be used to construe this License against a Contributor. 322 | 323 | 10. Versions of the License 324 | --------------------------- 325 | 326 | 10.1. New Versions 327 | 328 | Mozilla Foundation is the license steward. Except as provided in Section 329 | 10.3, no one other than the license steward has the right to modify or 330 | publish new versions of this License. Each version will be given a 331 | distinguishing version number. 332 | 333 | 10.2. Effect of New Versions 334 | 335 | You may distribute the Covered Software under the terms of the version 336 | of the License under which You originally received the Covered Software, 337 | or under the terms of any subsequent version published by the license 338 | steward. 339 | 340 | 10.3. Modified Versions 341 | 342 | If you create software not governed by this License, and you want to 343 | create a new license for such software, you may create and use a 344 | modified version of this License if you rename the license and remove 345 | any references to the name of the license steward (except to note that 346 | such modified license differs from this License). 347 | 348 | 10.4. Distributing Source Code Form that is Incompatible With Secondary 349 | Licenses 350 | 351 | If You choose to distribute Source Code Form that is Incompatible With 352 | Secondary Licenses under the terms of this version of the License, the 353 | notice described in Exhibit B of this License must be attached. 354 | 355 | Exhibit A - Source Code Form License Notice 356 | ------------------------------------------- 357 | 358 | This Source Code Form is subject to the terms of the Mozilla Public 359 | License, v. 2.0. If a copy of the MPL was not distributed with this 360 | file, You can obtain one at http://mozilla.org/MPL/2.0/. 361 | 362 | If it is not possible or desirable to put the notice in a particular 363 | file, then You may include the notice in a location (such as a LICENSE 364 | file in a relevant directory) where a recipient would be likely to look 365 | for such a notice. 366 | 367 | You may add additional accurate notices of copyright ownership. 368 | 369 | Exhibit B - "Incompatible With Secondary Licenses" Notice 370 | --------------------------------------------------------- 371 | 372 | This Source Code Form is "Incompatible With Secondary Licenses", as 373 | defined by the Mozilla Public License, v. 2.0. 374 | -------------------------------------------------------------------------------- /src/main.rs: -------------------------------------------------------------------------------- 1 | // This Source Code Form is subject to the terms of the Mozilla Public 2 | // License, v. 2.0. If a copy of the MPL was not distributed with this 3 | // file, You can obtain one at https://mozilla.org/MPL/2.0/. 4 | 5 | use std::{collections::{BTreeMap, HashMap}, fs::File, io::{BufReader, ErrorKind, Read, Seek}, path::{Path, PathBuf}, time::Instant}; 6 | 7 | use anyhow::{bail, Context as _}; 8 | use clap::Parser; 9 | use rayon::prelude::*; 10 | use size::Size; 11 | use jwalk::WalkDir; 12 | 13 | const PREHASH_SIZE: usize = 4 * 1024; 14 | 15 | /// Finds duplicate files and optionally deletes them. 16 | /// 17 | /// This program recursively analyzes one or more paths and tries to find files 18 | /// that appear in multiple places, possibly with different names, but have the 19 | /// exact same content. This can happen, for example, if you restore a 20 | /// collection of backups from different dates, which is the case that motivated 21 | /// the author. 22 | #[derive(Parser)] 23 | struct Drupes { 24 | /// Also consider empty files, which will report all empty files except one 25 | /// as duplicate; by default, empty files are ignored, because this is 26 | /// rarely what you actually want. 27 | #[clap(short, long)] 28 | empty: bool, 29 | 30 | /// Don't print the first filename in a set of duplicates, so that all the 31 | /// printed filenames are files to consider removing. 32 | #[clap(short('f'), long)] 33 | omit_first: bool, 34 | 35 | /// Instead of listing duplicates, print a summary of what was found. 36 | #[clap(short('m'), long)] 37 | summarize: bool, 38 | 39 | /// Engages "paranoid mode" and performs byte-for-byte comparisons of files, 40 | /// in case you've found the first real-world BLAKE3 hash collision (please 41 | /// publish it if so) 42 | #[clap(short, long)] 43 | paranoid: bool, 44 | 45 | /// Try to delete all duplicates but one, skipping any files that cannot be 46 | /// deleted for whatever reason. 47 | #[clap(long)] 48 | delete: bool, 49 | 50 | /// Enable additional output about what the program is doing. 51 | #[clap(short, long)] 52 | verbose: bool, 53 | 54 | /// List of directories to search, recursively, for duplicate files; if 55 | /// omitted, the current directory is searched. 56 | roots: Vec, 57 | } 58 | 59 | fn main() -> anyhow::Result<()> { 60 | let start = Instant::now(); 61 | 62 | let mut args = Drupes::parse(); 63 | 64 | if args.roots.is_empty() { 65 | // Search the current directory by default. 66 | args.roots.push(".".into()); 67 | } 68 | 69 | // PASS ONE 70 | // 71 | // Traverse the requested parts of the filesystem, collating files by size 72 | // (i.e. producing a map with file sizes as keys, and lists of files as 73 | // values). 74 | // 75 | // Any value in the map with more than one path represents a "file size 76 | // group," which is a potential duplicate group. On the other hand, any 77 | // value in the map containing only _one_ path need not be considered 78 | // further. 79 | // 80 | // We do this because, generally speaking, getting the size of a file is 81 | // much cheaper than reading its contents, and in practice file sizes are 82 | // _relatively_ unique. 83 | let mut paths: BTreeMap> = BTreeMap::new(); 84 | for root in &args.roots { 85 | if args.verbose { 86 | eprintln!("{:?} starting walk of {}", 87 | start.elapsed(), root.display()); 88 | } 89 | 90 | for entry in WalkDir::new(root) { 91 | let entry = entry 92 | .with_context(|| format!("problem reading dirent in {}", root.display()))?; 93 | let meta = entry.metadata() 94 | .with_context(|| format!("problem getting metadata for {}", 95 | entry.path().display()))?; 96 | if meta.is_file() && (meta.len() > 0 || args.empty) { 97 | paths.entry(meta.len()) 98 | .or_default() 99 | .push(entry.path().to_owned()); 100 | } 101 | } 102 | } 103 | 104 | if args.verbose { 105 | eprintln!("{:?} pass one complete, found {} size-groups", 106 | start.elapsed(), paths.len()); 107 | } 108 | 109 | // Drop all file size groups that contain no duplicates (have only one 110 | // member). 111 | // 112 | // This saves about 10% of runtime. 113 | paths.retain(|_size, paths| paths.len() > 1); 114 | 115 | if args.verbose { 116 | eprintln!("...of which {} had more than one member", paths.len()); 117 | } 118 | 119 | // PASS TWO 120 | // 121 | // We've reduced the data set to files whose sizes are not unique. This pass 122 | // takes those files and hashes the first `PREHASH_SIZE` bytes of each. If 123 | // two files have different hashes for the first `PREHASH_SIZE` bytes, they 124 | // cannot possibly be duplicates, so we can use this to avoid reading the 125 | // full contents of files. 126 | // 127 | // This is a significant performance improvement for directories of large 128 | // files like photos or videos (~50%). 129 | // 130 | // This is constructed as a Rayon pipeline because (1) I find it reasonably 131 | // clear this way once I got used to it and (2) it's by far the 132 | // easiest-to-reach "go faster button." 133 | let hashed_files: HashMap> = paths.par_iter() 134 | // Flatten the map into a list of paths to hash, discarding the size 135 | // information. 136 | .flat_map(|(_size, paths)| paths) 137 | // Hash each path, producing a (path, hash) pair. Note that this can 138 | // fail to access the filesystem. 139 | // 140 | // We use `map_with` here to allocate exactly one I/O buffer per backing 141 | // Rayon thread, instead of one per closure, because I'm neurotic. 142 | .map_with(vec![0u8; PREHASH_SIZE], |buf, path| { 143 | let mut f = File::open(path) 144 | .with_context(|| format!("unable to open: {}", path.display()))?; 145 | 146 | // Read up to `PREHASH_SIZE` bytes, or fewer if the file is shorter 147 | // than that. (It's odd that there's no operation for this in the 148 | // standard library.) 149 | let mut total = 0; 150 | while total < buf.len() { 151 | match f.read(&mut buf[total..]) { 152 | Ok(0) => break, 153 | Ok(n) => total += n, 154 | Err(e) if e.kind() == ErrorKind::Interrupted => continue, 155 | Err(e) => return Err(e).context( 156 | format!("unable to read path: {}", path.display()) 157 | ), 158 | } 159 | } 160 | // Hash the first chunk of the file. 161 | Ok((blake3::hash(buf), path)) 162 | }) 163 | // Squawk about any reads that failed, and remove them from further 164 | // consideration. 165 | .filter_map(|result| { 166 | match result { 167 | Ok(data) => Some(data), 168 | Err(e) => { 169 | eprintln!("{e:?}"); 170 | None 171 | } 172 | } 173 | }) 174 | // Take the stream of (hash, path) pairs and collate them by hash, 175 | // producing "hash groups." 176 | // 177 | // Rayon's fold is a little surprising: this produces, not a single map, 178 | // but a _stream_ of maps, because (roughly speaking) each thread 179 | // calculates its own. 180 | // 181 | // Many hash-groups will only contain one path, and will be filtered out 182 | // below. Any group containing multiple paths needs to be hashed more 183 | // fully in the next pass. 184 | .fold(HashMap::>::new, |mut map, (hash, path)| { 185 | map.entry(hash).or_default().push(path); 186 | map 187 | }) 188 | // Collapse the stream of hashmaps into one, merging hash groups as 189 | // required. 190 | .reduce(HashMap::new, |mut a, b| { 191 | for (k, v) in b { 192 | a.entry(k).or_default().extend(v); 193 | } 194 | a 195 | }); 196 | 197 | let unique_prehash_groups = hashed_files.len(); 198 | 199 | if args.verbose { 200 | eprintln!("{:?} pass two complete, found {unique_prehash_groups} \ 201 | unique first blocks", 202 | start.elapsed()); 203 | let dupesets = hashed_files.values() 204 | .filter(|paths| paths.len() > 1) 205 | .count(); 206 | eprintln!("...of which {dupesets} are present in more than one file"); 207 | let dupes = hashed_files.values() 208 | .map(|paths| paths.len().saturating_sub(1)) 209 | .sum::(); 210 | eprintln!("...for a total of {dupes} possibly redundant files"); 211 | } 212 | 213 | // PASS THREE 214 | // 215 | // For any files whose first `PREHASH_SIZE` bytes match at least one other 216 | // file, hash the entire contents to scan for differences later on. 217 | let mut hashed_files = hashed_files.into_par_iter() 218 | // Ignore groups with only one member. 219 | .filter(|(_, paths)| paths.len() > 1) 220 | // Flatten the `prehash => vec of paths` map to a stream of `prehash, 221 | // path` pairs. Since the prehash has no (straightforward) relation to 222 | // the hash of the overall file, we don't need to maintain the group 223 | // structure. 224 | // 225 | // We do, however, forward the prehash value on, so that we can use it 226 | // for keying below. 227 | .flat_map(|(hash, paths)| paths.into_par_iter().map(move |p| (hash, p))) 228 | // Hash the tail of each file to produce `(path, hash)` pairs. Note that 229 | // this can fail to access the filesystem (again). 230 | // 231 | // This takes the prehash as input, and uses it as the key for a keyed 232 | // hash of the rest of the file. This is important for correctness: if 233 | // we just hashed the tail end of every file, we could detect two files 234 | // as "identical" even if their first `PREHASH_SIZE` bytes differed! By 235 | // incorporating the prehash as key we chain the two hashes and prevent 236 | // this. 237 | // 238 | // For files smaller than `PREHASH_SIZE`, we immediately finalize the 239 | // keyed hash without reading anything. 240 | .map(|(prehash, path)| { 241 | let mut f = File::open(path) 242 | .with_context(|| format!("unable to open: {}", path.display()))?; 243 | let mut hasher = blake3::Hasher::new_keyed(prehash.as_bytes()); 244 | 245 | // Small files have already been completely hashed. Skip them. 246 | if f.metadata()?.len() > PREHASH_SIZE as u64 { 247 | f.seek(std::io::SeekFrom::Start(PREHASH_SIZE as u64))?; 248 | hasher.update_reader(f)?; 249 | } 250 | Ok::<_, anyhow::Error>((hasher.finalize(), path)) 251 | }) 252 | // Squawk about any reads that failed, and remove them from further 253 | // consideration. 254 | .filter_map(|result| { 255 | match result { 256 | Ok(data) => Some(data), 257 | Err(e) => { 258 | eprintln!("{e}"); 259 | None 260 | } 261 | } 262 | }) 263 | // Collect groups of (path, hash) pairs and collate them by hash. This 264 | // is identical to the end of Pass Two. 265 | .fold(HashMap::<_, Vec<&Path>>::new, |mut map, (hash, path)| { 266 | map.entry(hash).or_default().push(path); 267 | map 268 | }) 269 | // Collapse the stream of hashmaps into one, merging hash groups as 270 | // required. This is also identical to the end of Pass Two. 271 | .reduce(HashMap::new, |mut a, b| { 272 | for (k, v) in b { 273 | a.entry(k).or_default().extend(v); 274 | } 275 | a 276 | }); 277 | 278 | if args.verbose { 279 | eprintln!("{:?} pass three complete, generating results", 280 | start.elapsed()); 281 | } 282 | 283 | if args.paranoid { 284 | // Given our map of collated hash-groups from the previous step, let's 285 | // check our work. 286 | // 287 | // This takes each hash-group containing at least two paths and reads 288 | // the contents of each file, comparing them to one another. The files 289 | // are not kept in memory, so this works fine on very large files 290 | // (keeping files in memory is the operating system's job). 291 | // 292 | // Note that if this ever finds anything, it is **almost certainly** a 293 | // bug in this program. If it isn't a bug in this program, it's probably 294 | // a file being modified out from under us. BLAKE3 is 295 | // collision-resistant, and finding two files with the same length, same 296 | // BLAKE3 hash, and different contents would be a newsworthy event. It's 297 | // certainly possible, but rather unlikely. 298 | eprintln!("paranoid mode: verifying file contents"); 299 | hashed_files.par_iter() 300 | .filter(|(_, files)| files.len() > 1) 301 | .try_for_each(|(_, files)| { 302 | // Arbitrarily choose the first file in each group as a 303 | // "representative." 304 | let first = &files[0]; 305 | let first_f = File::open(first)?; 306 | let first_meta = first_f.metadata()?; 307 | let mut first_f = BufReader::new(first_f); 308 | 309 | // Compare it to every other file in the group, one at a time. 310 | for other in &files[1..] { 311 | // ...starting from the beginning of the first file, please. 312 | first_f.rewind()?; 313 | 314 | let other_f = File::open(other)?; 315 | let other_meta = other_f.metadata()?; 316 | let mut other_f = BufReader::new(other_f); 317 | 318 | // This provides some _very basic_ protection against files 319 | // being modified while this program is running, but in 320 | // general, this program is not written with that situation 321 | // in mind. 322 | if first_meta.len() != other_meta.len() { 323 | bail!("files no longer have same length:\n{}\n{}", 324 | first.display(), 325 | other.display()); 326 | } 327 | 328 | // Read one byte at a time from each file, comparing each 329 | // byte. Single byte reads are the easiest thing to 330 | // implement, and are reasonably fast because BufReader 331 | // converts them into larger reads under the hood. No need 332 | // to reimplement the standard library! 333 | let mut buf1 = [0u8]; 334 | let mut buf2 = [0u8]; 335 | for _ in 0..first_meta.len() { 336 | first_f.read_exact(&mut buf1)?; 337 | other_f.read_exact(&mut buf2)?; 338 | if buf1 != buf2 { 339 | bail!("files differ (blake3 collision found?):\n{}\n{}", 340 | first.display(), 341 | other.display()); 342 | } 343 | } 344 | } 345 | Ok(()) 346 | })?; 347 | eprintln!("files really are duplicates"); 348 | } 349 | 350 | if args.summarize { 351 | // Work out some statistics, instead of printing filenames. 352 | 353 | // How many unique size classes did we discover in the first pass? 354 | let unique_size_classes = paths.len(); 355 | // How many files did we find in our recursive scan? 356 | let total_files_checked = paths.values().map(|v| v.len()).sum::(); 357 | 358 | // How many hash-groups containing duplicates did we discover? 359 | let set_count = hashed_files.values() 360 | .filter(|files| files.len() > 1) 361 | .count(); 362 | // And how many duplicates, beyond the first in each group, did we find? 363 | let dupe_count = hashed_files.values() 364 | .filter_map(|files| files.len().checked_sub(1)) 365 | .sum::(); 366 | // How large are the duplicates on disk? 367 | let dupe_size = hashed_files.values() 368 | .filter(|files| files.len() > 1) 369 | .try_fold(0, |sum, files| { 370 | std::fs::metadata(files[0]) 371 | .map(|meta| sum + meta.len() * (files.len() as u64 - 1)) 372 | })?; 373 | // Convenient unit formatting: 374 | let dupe_size = Size::from_bytes(dupe_size); 375 | 376 | println!("{dupe_count} duplicate files (in {set_count} sets), \ 377 | occupying {dupe_size}"); 378 | println!("checked {total_files_checked} files in \ 379 | {unique_size_classes} size classes"); 380 | println!("prehashing identified {unique_prehash_groups} groups"); 381 | } else { 382 | // Print filenames of each duplicate-group. 383 | for files in hashed_files.values_mut() { 384 | if files.len() > 1 { 385 | // Our files have arrived in a nondeterministic order due to our 386 | // use of concurrency. Let's fix that. 387 | files.sort(); 388 | 389 | let mut files = files.iter(); 390 | // Implement the omit-first flag by skipping: 391 | if args.omit_first { 392 | files.next(); 393 | } 394 | 395 | for f in files { 396 | println!("{}", f.display()); 397 | } 398 | if !args.omit_first { 399 | println!(); 400 | } 401 | } 402 | } 403 | } 404 | 405 | if args.delete { 406 | // The scary delete mode! 407 | for files in hashed_files.values() { 408 | if files.len() > 1 { 409 | for f in &files[1..] { 410 | println!("deleting: {}", f.display()); 411 | if let Err(e) = std::fs::remove_file(f) { 412 | eprintln!("error deleting {}: {e}", f.display()); 413 | } 414 | } 415 | } 416 | } 417 | 418 | } 419 | 420 | Ok(()) 421 | } 422 | --------------------------------------------------------------------------------