├── .editorconfig ├── .gitignore ├── Ånnóying 𝚏Ⅰlęnąme by Łukasz ├── LICENSE ├── README.rst ├── pyproject.toml ├── src └── bitrot.py └── tests └── test_bitrot.py /.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | trim_trailing_whitespace = true 5 | insert_final_newline = true 6 | 7 | [*.{py,pyx,pxd,pxi,yml,h}] 8 | indent_size = 4 9 | indent_style = space 10 | 11 | [ext/*.{c,cpp,h}] 12 | indent_size = 4 13 | indent_style = tab 14 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .bitrot.db 2 | .bitrot.sha512 3 | -------------------------------------------------------------------------------- /Ånnóying 𝚏Ⅰlęnąme by Łukasz: -------------------------------------------------------------------------------- 1 | This is a form of testing strange encodings. 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 Łukasz Langa 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ====== 2 | bitrot 3 | ====== 4 | 5 | Detects bit rotten files on the hard drive to save your precious photo 6 | and music collection from slow decay. 7 | 8 | Usage 9 | ----- 10 | 11 | Go to the desired directory and simply invoke:: 12 | 13 | $ bitrot 14 | 15 | This will start digging through your directory structure recursively 16 | indexing all files found. The index is stored in a ``.bitrot.db`` file 17 | which is a SQLite 3 database. 18 | 19 | Next time you run ``bitrot`` it will add new files and update the index 20 | for files with a changed modification date. Most importantly however, it 21 | will report all errors, e.g. files that changed on the hard drive but 22 | still have the same modification date. 23 | 24 | All paths stored in ``.bitrot.db`` are relative so it's safe to rescan 25 | a folder after moving it to another drive. Just remember to move it in 26 | a way that doesn't touch modification dates. Otherwise the checksum 27 | database is useless. 28 | 29 | Performance 30 | ----------- 31 | 32 | Obviously depends on how fast the underlying drive is. Historically 33 | the script was single-threaded because back in 2013 checksum 34 | calculations on a single core still outran typical drives, including 35 | the mobile SSDs of the day. In 2020 this is no longer the case so the 36 | script now uses a process pool to calculate SHA1 hashes and perform 37 | `stat()` calls. 38 | 39 | No rigorous performance tests have been done. Scanning a ~1000 file 40 | directory totalling ~5 GB takes 2.2s on a 2018 MacBook Pro 15" with 41 | a AP0512M SSD. Back in 2013, that same feat on a 2015 MacBook Air with 42 | a SM0256G SSD took over 20 seconds. 43 | 44 | On that same 2018 MacBook Pro 15", scanning a 60+ GB music library takes 45 | 24 seconds. Back in 2013, with a typical 5400 RPM laptop hard drive 46 | it took around 15 minutes. How times have changed! 47 | 48 | Tests 49 | ----- 50 | 51 | There's a simple but comprehensive test scenario using 52 | `pytest `_ and 53 | `pytest-order `_. 54 | 55 | Install:: 56 | 57 | $ python3 -m venv .venv 58 | $ . .venv/bin/activate 59 | (.venv)$ pip install -e .[test] 60 | 61 | Run:: 62 | 63 | (.venv)$ pytest -x 64 | ==================== test session starts ==================== 65 | platform darwin -- Python 3.10.12, pytest-7.4.0, pluggy-1.2.0 66 | rootdir: /Users/ambv/Documents/Python/bitrot 67 | plugins: order-1.1.0 68 | collected 12 items 69 | 70 | tests/test_bitrot.py ............ [100%] 71 | 72 | ==================== 12 passed in 15.05s ==================== 73 | 74 | Change Log 75 | ---------- 76 | 77 | 1.0.1 78 | ~~~~~ 79 | 80 | * officially remove Python 2 support that was broken since 1.0.0 81 | anyway; now the package works with Python 3.8+ because of a few 82 | features 83 | 84 | 1.0.0 85 | ~~~~~ 86 | 87 | * significantly sped up execution on solid state drives by using 88 | a process pool executor to calculate SHA1 hashes and perform `stat()` 89 | calls; use `-w1` if your runs on slow magnetic drives were 90 | negatively affected by this change 91 | 92 | * sped up execution by pre-loading all SQLite-stored hashes to memory 93 | and doing comparisons using Python sets 94 | 95 | * all UTF-8 filenames are now normalized to NFKD in the database to 96 | enable cross-operating system checks 97 | 98 | * the SQLite database is now vacuumed to minimize its size 99 | 100 | * bugfix: additional Python 3 fixes when Unicode names were encountered 101 | 102 | 0.9.2 103 | ~~~~~ 104 | 105 | * bugfix: one place in the code incorrectly hardcoded UTF-8 as the 106 | filesystem encoding 107 | 108 | 0.9.1 109 | ~~~~~ 110 | 111 | * bugfix: print the path that failed to decode with FSENCODING 112 | 113 | * bugfix: when using -q, don't hide warnings about files that can't be 114 | statted or read 115 | 116 | * bugfix: -s is no longer broken on Python 3 117 | 118 | 0.9.0 119 | ~~~~~ 120 | 121 | * bugfix: bitrot.db checksum checking messages now obey --quiet 122 | 123 | * Python 3 compatibility 124 | 125 | 0.8.0 126 | ~~~~~ 127 | 128 | * bitrot now keeps track of its own database's bitrot by storing 129 | a checksum of .bitrot.db in .bitrot.sha512 130 | 131 | * bugfix: now properly uses the filesystem encoding to decode file names 132 | for use with the .bitrotdb database. Report and original patch by 133 | pallinger. 134 | 135 | 0.7.1 136 | ~~~~~ 137 | 138 | * bugfix: SHA1 computation now works correctly on Windows; previously 139 | opened files in text-mode. This fix will change hashes of files 140 | containing some specific bytes like 0x1A. 141 | 142 | 0.7.0 143 | ~~~~~ 144 | 145 | * when a file changes or is renamed, the timestamp of the last check is 146 | updated, too 147 | 148 | * bugfix: files that disappeared during the run are now properly ignored 149 | 150 | * bugfix: files that are locked or with otherwise denied access are 151 | skipped. If they were read before, they will be considered "missing" 152 | in the report. 153 | 154 | * bugfix: if there are multiple files with the same content in the 155 | scanned directory tree, renames are now handled properly for them 156 | 157 | * refactored some horrible code to be a little less horrible 158 | 159 | 0.6.0 160 | ~~~~~ 161 | 162 | * more control over performance with ``--commit-interval`` and 163 | ``--chunk-size`` command-line arguments 164 | 165 | * bugfix: symbolic links are now properly skipped (or can be followed if 166 | ``--follow-links`` is passed) 167 | 168 | * bugfix: files that cannot be opened are now gracefully skipped 169 | 170 | * bugfix: fixed a rare division by zero when run in an empty directory 171 | 172 | 0.5.1 173 | ~~~~~ 174 | 175 | * bugfix: warn about test mode only in test mode 176 | 177 | 0.5.0 178 | ~~~~~ 179 | 180 | * ``--test`` command-line argument for testing the state without 181 | updating the database on disk (works for testing databases you don't 182 | have write access to) 183 | 184 | * size of the data read is reported upon finish 185 | 186 | * minor performance updates 187 | 188 | 0.4.0 189 | ~~~~~ 190 | 191 | * renames are now reported as such 192 | 193 | * all non-regular files (e.g. symbolic links, pipes, sockets) are now 194 | skipped 195 | 196 | * progress presented in percentage 197 | 198 | 0.3.0 199 | ~~~~~ 200 | 201 | * ``--sum`` command-line argument for easy comparison of multiple 202 | databases 203 | 204 | 0.2.1 205 | ~~~~~ 206 | 207 | * fixed regression from 0.2.0 where new files caused a ``KeyError`` 208 | exception 209 | 210 | 0.2.0 211 | ~~~~~ 212 | 213 | * ``--verbose`` and ``--quiet`` command-line arguments 214 | 215 | * if a file is no longer there, its entry is removed from the database 216 | 217 | 0.1.0 218 | ~~~~~ 219 | 220 | * First published version. 221 | 222 | Authors 223 | ------- 224 | 225 | Glued together by `Łukasz Langa `_. Multiple 226 | improvements by 227 | `Ben Shepherd `_, 228 | `Jean-Louis Fuchs `_, 229 | `Marcus Linderoth `_, 230 | `p1r473 `_, 231 | `Peter Hofmann `_, 232 | `Phil Lundrigan `_, 233 | `Reid Williams `_, 234 | `Stan Senotrusov `_, 235 | `Yang Zhang `_, and 236 | `Zhuoyun Wei `_. 237 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [build-system] 2 | requires = ["setuptools", "setuptools-scm[toml]"] 3 | build-backend = "setuptools.build_meta" 4 | 5 | [project] 6 | name = "bitrot" 7 | authors = [ 8 | {name = "Łukasz Langa", email = "lukasz@langa.pl"}, 9 | ] 10 | description = "Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay." 11 | readme = "README.rst" 12 | requires-python = ">=3.8" 13 | keywords = ["file", "checksum", "database"] 14 | license = {text = "MIT"} 15 | classifiers = [ 16 | "Development Status :: 5 - Production/Stable", 17 | "Natural Language :: English", 18 | "Programming Language :: Python :: 3", 19 | "Topic :: System :: Filesystems", 20 | "Topic :: System :: Monitoring", 21 | "Topic :: Software Development :: Libraries :: Python Modules", 22 | 23 | ] 24 | dependencies = [] 25 | dynamic = ["version"] 26 | 27 | [project.optional-dependencies] 28 | test = ["pytest", "pytest-order"] 29 | 30 | [project.scripts] 31 | bitrot = "bitrot:run_from_command_line" 32 | 33 | [tool.setuptools_scm] 34 | tag_regex = "^(?Pv\\d+(?:\\.\\d+){0,2}[^\\+]*)(?:\\+.*)?$" 35 | -------------------------------------------------------------------------------- /src/bitrot.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | # Copyright (C) 2013 by Łukasz Langa 4 | 5 | # Permission is hereby granted, free of charge, to any person obtaining a copy 6 | # of this software and associated documentation files (the "Software"), to deal 7 | # in the Software without restriction, including without limitation the rights 8 | # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | # copies of the Software, and to permit persons to whom the Software is 10 | # furnished to do so, subject to the following conditions: 11 | 12 | # The above copyright notice and this permission notice shall be included in 13 | # all copies or substantial portions of the Software. 14 | 15 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | # THE SOFTWARE. 22 | 23 | from __future__ import annotations 24 | 25 | import argparse 26 | import atexit 27 | import datetime 28 | import errno 29 | import hashlib 30 | import os 31 | import shutil 32 | import sqlite3 33 | import stat 34 | import sys 35 | import tempfile 36 | import time 37 | import unicodedata 38 | 39 | from concurrent.futures import ProcessPoolExecutor, as_completed 40 | from multiprocessing import freeze_support 41 | from importlib.metadata import version, PackageNotFoundError 42 | 43 | 44 | DEFAULT_CHUNK_SIZE = 16384 # block size in HFS+; 4X the block size in ext4 45 | DOT_THRESHOLD = 200 46 | IGNORED_FILE_SYSTEM_ERRORS = {errno.ENOENT, errno.EACCES} 47 | FSENCODING = sys.getfilesystemencoding() 48 | try: 49 | VERSION = version("bitrot") 50 | except PackageNotFoundError: 51 | VERSION = "1.0.1" 52 | 53 | 54 | def normalize_path(path): 55 | path_uni = path.decode(FSENCODING) 56 | if FSENCODING in ('utf-8', 'UTF-8'): 57 | return unicodedata.normalize('NFKD', path_uni) 58 | 59 | return path_uni 60 | 61 | 62 | def sha1(path, chunk_size): 63 | digest = hashlib.sha1() 64 | with open(path, 'rb') as f: 65 | d = f.read(chunk_size) 66 | while d: 67 | digest.update(d) 68 | d = f.read(chunk_size) 69 | return digest.hexdigest() 70 | 71 | 72 | def ts(): 73 | return datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S%z') 74 | 75 | 76 | def get_sqlite3_cursor(path, copy=False): 77 | path = path.decode(FSENCODING) 78 | if copy: 79 | if not os.path.exists(path): 80 | raise ValueError("error: bitrot database at {} does not exist." 81 | "".format(path)) 82 | db_copy = tempfile.NamedTemporaryFile(prefix='bitrot_', suffix='.db', 83 | delete=False) 84 | with open(path, 'rb') as db_orig: 85 | try: 86 | shutil.copyfileobj(db_orig, db_copy) 87 | finally: 88 | db_copy.close() 89 | path = db_copy.name 90 | atexit.register(os.unlink, path) 91 | conn = sqlite3.connect(path) 92 | atexit.register(conn.close) 93 | cur = conn.cursor() 94 | tables = set(t for t, in cur.execute('SELECT name FROM sqlite_master')) 95 | if 'bitrot' not in tables: 96 | cur.execute('CREATE TABLE bitrot (path TEXT PRIMARY KEY, ' 97 | 'mtime INTEGER, hash TEXT, timestamp TEXT)') 98 | if 'bitrot_hash_idx' not in tables: 99 | cur.execute('CREATE INDEX bitrot_hash_idx ON bitrot (hash)') 100 | atexit.register(conn.commit) 101 | return conn 102 | 103 | 104 | def list_existing_paths(directory, expected=(), ignored=(), follow_links=False): 105 | """list_existing_paths(b'/dir') -> ([path1, path2, ...], total_size) 106 | 107 | Returns a tuple with a set of existing files in `directory` and its subdirectories 108 | and their `total_size`. If directory was a bytes object, so will be the returned 109 | paths. 110 | 111 | Doesn't add entries listed in `ignored`. Doesn't add symlinks if 112 | `follow_links` is False (the default). All entries present in `expected` 113 | must be files (can't be directories or symlinks). 114 | """ 115 | paths = set() 116 | total_size = 0 117 | for path, _, files in os.walk(directory): 118 | for f in files: 119 | p = os.path.join(path, f) 120 | try: 121 | p_uni = p.decode(FSENCODING) 122 | except UnicodeDecodeError: 123 | binary_stderr = getattr(sys.stderr, 'buffer', sys.stderr) 124 | binary_stderr.write(b"warning: cannot decode file name: ") 125 | binary_stderr.write(p) 126 | binary_stderr.write(b"\n") 127 | continue 128 | 129 | try: 130 | if follow_links or p_uni in expected: 131 | st = os.stat(p) 132 | else: 133 | st = os.lstat(p) 134 | except OSError as ex: 135 | if ex.errno not in IGNORED_FILE_SYSTEM_ERRORS: 136 | raise 137 | else: 138 | if not stat.S_ISREG(st.st_mode) or p in ignored: 139 | continue 140 | paths.add(p) 141 | total_size += st.st_size 142 | return paths, total_size 143 | 144 | 145 | def compute_one(path, chunk_size): 146 | """Return a tuple with (unicode path, size, mtime, sha1). Takes a binary path.""" 147 | p_uni = normalize_path(path) 148 | try: 149 | st = os.stat(path) 150 | except OSError as ex: 151 | if ex.errno in IGNORED_FILE_SYSTEM_ERRORS: 152 | # The file disappeared between listing existing paths and 153 | # this run or is (temporarily?) locked with different 154 | # permissions. We'll just skip it for now. 155 | print( 156 | '\rwarning: `{}` is currently unavailable for ' 157 | 'reading: {}'.format( 158 | p_uni, ex, 159 | ), 160 | file=sys.stderr, 161 | ) 162 | raise BitrotException 163 | 164 | raise # Not expected? https://github.com/ambv/bitrot/issues/ 165 | 166 | try: 167 | new_sha1 = sha1(path, chunk_size) 168 | except (IOError, OSError) as e: 169 | print( 170 | '\rwarning: cannot compute hash of {} [{}]'.format( 171 | p_uni, errno.errorcode[e.args[0]], 172 | ), 173 | file=sys.stderr, 174 | ) 175 | raise BitrotException 176 | 177 | return p_uni, st.st_size, int(st.st_mtime), new_sha1 178 | 179 | 180 | class BitrotException(Exception): 181 | pass 182 | 183 | 184 | class Bitrot(object): 185 | def __init__( 186 | self, verbosity=1, test=False, follow_links=False, commit_interval=300, 187 | chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(), 188 | ): 189 | self.verbosity = verbosity 190 | self.test = test 191 | self.follow_links = follow_links 192 | self.commit_interval = commit_interval 193 | self.chunk_size = chunk_size 194 | self._last_reported_size = '' 195 | self._last_commit_ts = 0 196 | self.pool = ProcessPoolExecutor(max_workers=workers) 197 | 198 | def maybe_commit(self, conn): 199 | if time.time() < self._last_commit_ts + self.commit_interval: 200 | # no time for commit yet! 201 | return 202 | 203 | conn.commit() 204 | self._last_commit_ts = time.time() 205 | 206 | def run(self): 207 | check_sha512_integrity(verbosity=self.verbosity) 208 | 209 | bitrot_db = get_path() 210 | bitrot_sha512 = get_path(ext=b'sha512') 211 | try: 212 | conn = get_sqlite3_cursor(bitrot_db, copy=self.test) 213 | except ValueError: 214 | raise BitrotException( 215 | 2, 216 | 'No database exists so cannot test. Run the tool once first.', 217 | ) 218 | 219 | cur = conn.cursor() 220 | new_paths = [] 221 | updated_paths = [] 222 | renamed_paths = [] 223 | errors = [] 224 | current_size = 0 225 | missing_paths = self.select_all_paths(cur) 226 | hashes = self.select_all_hashes(cur) 227 | paths, total_size = list_existing_paths( 228 | b'.', expected=missing_paths, ignored={bitrot_db, bitrot_sha512}, 229 | follow_links=self.follow_links, 230 | ) 231 | paths_uni = set(normalize_path(p) for p in paths) 232 | futures = [self.pool.submit(compute_one, p, self.chunk_size) for p in paths] 233 | 234 | for future in as_completed(futures): 235 | try: 236 | p_uni, new_size, new_mtime, new_sha1 = future.result() 237 | except BitrotException: 238 | continue 239 | 240 | current_size += new_size 241 | if self.verbosity: 242 | self.report_progress(current_size, total_size) 243 | 244 | if p_uni not in missing_paths: 245 | # We are not expecting this path, it wasn't in the database yet. 246 | # It's either new or a rename. Let's handle that. 247 | stored_path = self.handle_unknown_path( 248 | cur, p_uni, new_mtime, new_sha1, paths_uni, hashes 249 | ) 250 | self.maybe_commit(conn) 251 | if p_uni == stored_path: 252 | new_paths.append(p_uni) 253 | missing_paths.discard(p_uni) 254 | else: 255 | renamed_paths.append((stored_path, p_uni)) 256 | missing_paths.discard(stored_path) 257 | continue 258 | 259 | # At this point we know we're seeing an expected file. 260 | missing_paths.discard(p_uni) 261 | cur.execute('SELECT mtime, hash, timestamp FROM bitrot WHERE path=?', 262 | (p_uni,)) 263 | row = cur.fetchone() 264 | if not row: 265 | print( 266 | '\rwarning: path disappeared from the database while running:', 267 | p_uni, 268 | file=sys.stderr, 269 | ) 270 | continue 271 | 272 | stored_mtime, stored_sha1, stored_ts = row 273 | if int(stored_mtime) != new_mtime: 274 | updated_paths.append(p_uni) 275 | cur.execute('UPDATE bitrot SET mtime=?, hash=?, timestamp=? ' 276 | 'WHERE path=?', 277 | (new_mtime, new_sha1, ts(), p_uni)) 278 | self.maybe_commit(conn) 279 | continue 280 | 281 | if stored_sha1 != new_sha1: 282 | errors.append(p_uni) 283 | print( 284 | '\rerror: SHA1 mismatch for {}: expected {}, got {}.' 285 | ' Last good hash checked on {}.'.format( 286 | p_uni, stored_sha1, new_sha1, stored_ts 287 | ), 288 | file=sys.stderr, 289 | ) 290 | 291 | for path in missing_paths: 292 | cur.execute('DELETE FROM bitrot WHERE path=?', (path,)) 293 | 294 | conn.commit() 295 | 296 | if not self.test: 297 | cur.execute('vacuum') 298 | 299 | if self.verbosity: 300 | cur.execute('SELECT COUNT(path) FROM bitrot') 301 | all_count = cur.fetchone()[0] 302 | self.report_done( 303 | total_size, 304 | all_count, 305 | len(errors), 306 | new_paths, 307 | updated_paths, 308 | renamed_paths, 309 | missing_paths, 310 | ) 311 | 312 | update_sha512_integrity(verbosity=self.verbosity) 313 | 314 | if errors: 315 | raise BitrotException( 316 | 1, 'There were {} errors found.'.format(len(errors)), errors, 317 | ) 318 | 319 | def select_all_paths(self, cur): 320 | """Return a set of all distinct paths in the bitrot database. 321 | 322 | The paths are Unicode and are normalized if FSENCODING was UTF-8. 323 | """ 324 | result = set() 325 | cur.execute('SELECT path FROM bitrot') 326 | row = cur.fetchone() 327 | while row: 328 | result.add(row[0]) 329 | row = cur.fetchone() 330 | return result 331 | 332 | def select_all_hashes(self, cur): 333 | """Return a dict where keys are hashes and values are sets of paths. 334 | 335 | The paths are Unicode and are normalized if FSENCODING was UTF-8. 336 | """ 337 | result = {} 338 | cur.execute('SELECT hash, path FROM bitrot') 339 | row = cur.fetchone() 340 | while row: 341 | rhash, rpath = row 342 | result.setdefault(rhash, set()).add(rpath) 343 | row = cur.fetchone() 344 | return result 345 | 346 | def report_progress(self, current_size, total_size): 347 | size_fmt = '\r{:>6.1%}'.format(current_size/(total_size or 1)) 348 | if size_fmt == self._last_reported_size: 349 | return 350 | 351 | sys.stdout.write(size_fmt) 352 | sys.stdout.flush() 353 | self._last_reported_size = size_fmt 354 | 355 | def report_done( 356 | self, total_size, all_count, error_count, new_paths, updated_paths, 357 | renamed_paths, missing_paths): 358 | """Print a report on what happened. All paths should be Unicode here.""" 359 | print('\rFinished. {:.2f} MiB of data read. {} errors found.' 360 | ''.format(total_size/1024/1024, error_count)) 361 | if self.verbosity == 1: 362 | print( 363 | '{} entries in the database, {} new, {} updated, ' 364 | '{} renamed, {} missing.'.format( 365 | all_count, len(new_paths), len(updated_paths), 366 | len(renamed_paths), len(missing_paths), 367 | ), 368 | ) 369 | elif self.verbosity > 1: 370 | print('{} entries in the database.'.format(all_count), end=' ') 371 | if new_paths: 372 | print('{} entries new:'.format(len(new_paths))) 373 | new_paths.sort() 374 | for path in new_paths: 375 | print(' ', path) 376 | if updated_paths: 377 | print('{} entries updated:'.format(len(updated_paths))) 378 | updated_paths.sort() 379 | for path in updated_paths: 380 | print(' ', path) 381 | if renamed_paths: 382 | print('{} entries renamed:'.format(len(renamed_paths))) 383 | renamed_paths.sort() 384 | for path in renamed_paths: 385 | print( 386 | ' from', 387 | path[0], 388 | 'to', 389 | path[1], 390 | ) 391 | if missing_paths: 392 | print('{} entries missing:'.format(len(missing_paths))) 393 | missing_paths = sorted(missing_paths) 394 | for path in missing_paths: 395 | print(' ', path) 396 | if not any((new_paths, updated_paths, missing_paths)): 397 | print() 398 | if self.test and self.verbosity: 399 | print('warning: database file not updated on disk (test mode).') 400 | 401 | def handle_unknown_path(self, cur, new_path, new_mtime, new_sha1, paths_uni, hashes): 402 | """Either add a new entry to the database or update the existing entry 403 | on rename. 404 | 405 | `cur` is the database cursor. `new_path` is the new Unicode path. 406 | `paths_uni` are Unicode paths seen on disk during this run of Bitrot. 407 | `hashes` is a dictionary selected from the database, keys are hashes, values 408 | are sets of Unicode paths that are stored in the DB under the given hash. 409 | 410 | Returns `new_path` if the entry was indeed new or the `old_path` (e.g. 411 | outdated path stored in the database for this hash) if there was a rename. 412 | """ 413 | 414 | for old_path in hashes.get(new_sha1, ()): 415 | if old_path not in paths_uni: 416 | # File of the same hash used to exist but no longer does. 417 | # Let's treat `new_path` as a renamed version of that `old_path`. 418 | cur.execute( 419 | 'UPDATE bitrot SET mtime=?, path=?, timestamp=? WHERE path=?', 420 | (new_mtime, new_path, ts(), old_path), 421 | ) 422 | return old_path 423 | 424 | else: 425 | # Either we haven't found `new_sha1` at all in the database, or all 426 | # currently stored paths for this hash still point to existing files. 427 | # Let's insert a new entry for what appears to be a new file. 428 | cur.execute( 429 | 'INSERT INTO bitrot VALUES (?, ?, ?, ?)', 430 | (new_path, new_mtime, new_sha1, ts()), 431 | ) 432 | return new_path 433 | 434 | def get_path(directory=b'.', ext=b'db'): 435 | """Compose the path to the selected bitrot file.""" 436 | return os.path.join(directory, b'.bitrot.' + ext) 437 | 438 | 439 | def stable_sum(bitrot_db=None): 440 | """Calculates a stable SHA512 of all entries in the database. 441 | 442 | Useful for comparing if two directories hold the same data, as it ignores 443 | timing information.""" 444 | if bitrot_db is None: 445 | bitrot_db = get_path() 446 | digest = hashlib.sha512() 447 | conn = get_sqlite3_cursor(bitrot_db) 448 | cur = conn.cursor() 449 | cur.execute('SELECT hash FROM bitrot ORDER BY path') 450 | row = cur.fetchone() 451 | while row: 452 | digest.update(row[0].encode('ascii')) 453 | row = cur.fetchone() 454 | return digest.hexdigest() 455 | 456 | 457 | def check_sha512_integrity(verbosity=1): 458 | sha512_path = get_path(ext=b'sha512') 459 | if not os.path.exists(sha512_path): 460 | return 461 | 462 | if verbosity: 463 | print('Checking bitrot.db integrity... ', end='') 464 | sys.stdout.flush() 465 | with open(sha512_path, 'rb') as f: 466 | old_sha512 = f.read().strip() 467 | bitrot_db = get_path() 468 | digest = hashlib.sha512() 469 | with open(bitrot_db, 'rb') as f: 470 | digest.update(f.read()) 471 | new_sha512 = digest.hexdigest().encode('ascii') 472 | if new_sha512 != old_sha512: 473 | if verbosity: 474 | if len(old_sha512) == 128: 475 | print( 476 | "error: SHA512 of the file is different, bitrot.db might " 477 | "be corrupt.", 478 | ) 479 | else: 480 | print( 481 | "error: SHA512 of the file is different but bitrot.sha512 " 482 | "has a suspicious length. It might be corrupt.", 483 | ) 484 | print( 485 | "If you'd like to continue anyway, delete the .bitrot.sha512 " 486 | "file and try again.", 487 | file=sys.stderr, 488 | ) 489 | raise BitrotException( 490 | 3, 'bitrot.db integrity check failed, cannot continue.', 491 | ) 492 | 493 | if verbosity: 494 | print('ok.') 495 | 496 | def update_sha512_integrity(verbosity=1): 497 | old_sha512 = 0 498 | sha512_path = get_path(ext=b'sha512') 499 | if os.path.exists(sha512_path): 500 | with open(sha512_path, 'rb') as f: 501 | old_sha512 = f.read().strip() 502 | bitrot_db = get_path() 503 | digest = hashlib.sha512() 504 | with open(bitrot_db, 'rb') as f: 505 | digest.update(f.read()) 506 | new_sha512 = digest.hexdigest().encode('ascii') 507 | if new_sha512 != old_sha512: 508 | if verbosity: 509 | print('Updating bitrot.sha512... ', end='') 510 | sys.stdout.flush() 511 | with open(sha512_path, 'wb') as f: 512 | f.write(new_sha512) 513 | if verbosity: 514 | print('done.') 515 | 516 | def run_from_command_line(): 517 | global FSENCODING 518 | 519 | freeze_support() 520 | 521 | parser = argparse.ArgumentParser(prog='bitrot') 522 | parser.add_argument( 523 | '-l', '--follow-links', action='store_true', 524 | help='follow symbolic links and store target files\' hashes. Once ' 525 | 'a path is present in the database, it will be checked against ' 526 | 'changes in content even if it becomes a symbolic link. In ' 527 | 'other words, if you run `bitrot -l`, on subsequent runs ' 528 | 'symbolic links registered during the first run will be ' 529 | 'properly followed and checked even if you run without `-l`.') 530 | parser.add_argument( 531 | '-q', '--quiet', action='store_true', 532 | help='don\'t print anything besides checksum errors') 533 | parser.add_argument( 534 | '-s', '--sum', action='store_true', 535 | help='using only the data already gathered, return a SHA-512 sum ' 536 | 'of hashes of all the entries in the database. No timestamps ' 537 | 'are used in calculation.') 538 | parser.add_argument( 539 | '-v', '--verbose', action='store_true', 540 | help='list new, updated and missing entries') 541 | parser.add_argument( 542 | '-t', '--test', action='store_true', 543 | help='just test against an existing database, don\'t update anything') 544 | parser.add_argument( 545 | '--version', action='version', 546 | version=f"%(prog)s {VERSION}") 547 | parser.add_argument( 548 | '--commit-interval', type=float, default=300, 549 | help='min time in seconds between commits ' 550 | '(0 commits on every operation)') 551 | parser.add_argument( 552 | '-w', '--workers', type=int, default=os.cpu_count(), 553 | help='run this many workers (use -w1 for slow magnetic disks)') 554 | parser.add_argument( 555 | '--chunk-size', type=int, default=DEFAULT_CHUNK_SIZE, 556 | help='read files this many bytes at a time') 557 | parser.add_argument( 558 | '--fsencoding', default='', 559 | help='override the codec to decode filenames, otherwise taken from ' 560 | 'the LANG environment variables') 561 | args = parser.parse_args() 562 | if args.sum: 563 | try: 564 | print(stable_sum()) 565 | except RuntimeError as e: 566 | print(str(e).encode('utf8'), file=sys.stderr) 567 | else: 568 | verbosity = 1 569 | if args.quiet: 570 | verbosity = 0 571 | elif args.verbose: 572 | verbosity = 2 573 | bt = Bitrot( 574 | verbosity=verbosity, 575 | test=args.test, 576 | follow_links=args.follow_links, 577 | commit_interval=args.commit_interval, 578 | chunk_size=args.chunk_size, 579 | workers=args.workers, 580 | ) 581 | if args.fsencoding: 582 | FSENCODING = args.fsencoding 583 | try: 584 | bt.run() 585 | except BitrotException as bre: 586 | print('error:', bre.args[1], file=sys.stderr) 587 | sys.exit(bre.args[0]) 588 | 589 | 590 | if __name__ == '__main__': 591 | run_from_command_line() 592 | -------------------------------------------------------------------------------- /tests/test_bitrot.py: -------------------------------------------------------------------------------- 1 | """ 2 | NOTE: those tests are ordered and require pytest-order to run correctly. 3 | """ 4 | 5 | from __future__ import annotations 6 | 7 | import getpass 8 | import os 9 | from pathlib import Path 10 | import shlex 11 | import shutil 12 | import subprocess 13 | import sys 14 | from textwrap import dedent 15 | 16 | import pytest 17 | 18 | 19 | TMP = Path("/tmp/") 20 | 21 | 22 | ReturnCode = int 23 | StdOut = list[str] 24 | StdErr = list[str] 25 | 26 | 27 | def bitrot(*args: str) -> tuple[ReturnCode, StdOut, StdErr]: 28 | cmd = [sys.executable, "-m", "bitrot"] 29 | cmd.extend(args) 30 | res = subprocess.run(shlex.join(cmd), shell=True, capture_output=True) 31 | stdout = (res.stdout or b"").decode("utf8") 32 | stderr = (res.stderr or b"").decode("utf8") 33 | return res.returncode, lines(stdout), lines(stderr) 34 | 35 | 36 | def bash(script, empty_dir: bool = False) -> bool: 37 | username = getpass.getuser() 38 | test_dir = TMP / f"bitrot-dir-{username}" 39 | if empty_dir and test_dir.is_dir(): 40 | os.chdir(TMP) 41 | shutil.rmtree(test_dir) 42 | test_dir.mkdir(exist_ok=True) 43 | os.chdir(test_dir) 44 | 45 | preamble = """ 46 | set -euxo pipefail 47 | LC_ALL=en_US.UTF-8 48 | LANG=en_US.UTF-8 49 | """ 50 | 51 | if script: 52 | # We need to wait a second for modification timestamps to differ so that 53 | # the ordering of the output stays the same every run of the tests. 54 | preamble += """ 55 | sleep 1 56 | """ 57 | 58 | script_path = TMP / "bitrot-test.bash" 59 | script_path.write_text(dedent(preamble + script)) 60 | script_path.chmod(0o755) 61 | 62 | out = subprocess.run(["bash", str(script_path)], capture_output=True) 63 | if out.returncode: 64 | print(f"Non-zero return code {out.returncode} when running {script_path}") 65 | if out.stdout: 66 | print(out.stdout) 67 | if out.stderr: 68 | print(out.stderr) 69 | return False 70 | return True 71 | 72 | 73 | def lines(s: str) -> list[str]: 74 | r"""Only return non-empty lines that weren't killed by \r.""" 75 | return [ 76 | line.rstrip() 77 | for line in s.splitlines(keepends=True) 78 | if line and line.rstrip() and line[-1] != "\r" 79 | ] 80 | 81 | 82 | @pytest.mark.order(1) 83 | def test_command_exists() -> None: 84 | rc, out, err = bitrot("--help") 85 | assert rc == 0 86 | assert not err 87 | assert out[0].startswith("usage:") 88 | 89 | assert bash("", empty_dir=True) 90 | 91 | 92 | @pytest.mark.order(2) 93 | def test_new_files_in_a_tree_dir() -> None: 94 | assert bash( 95 | """ 96 | mkdir -p nonemptydirs/dir2/ 97 | touch nonemptydirs/dir2/new-file-{a,b}.txt 98 | echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt 99 | """ 100 | ) 101 | rc, out, err = bitrot("-v") 102 | assert rc == 0 103 | assert not err 104 | # assert out[0] == "Finished. 0.00 MiB of data read. 0 errors found." 105 | assert out[1] == "2 entries in the database. 2 entries new:" 106 | assert out[2] == " ./nonemptydirs/dir2/new-file-a.txt" 107 | assert out[3] == " ./nonemptydirs/dir2/new-file-b.txt" 108 | assert out[4] == "Updating bitrot.sha512... done." 109 | 110 | 111 | @pytest.mark.order(3) 112 | def test_modified_files_in_a_tree_dir() -> None: 113 | assert bash( 114 | """ 115 | echo $RANDOM >> nonemptydirs/dir2/new-file-a.txt 116 | """ 117 | ) 118 | rc, out, err = bitrot("-v") 119 | assert rc == 0 120 | assert not err 121 | assert out[0] == "Checking bitrot.db integrity... ok." 122 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 123 | assert out[2] == "2 entries in the database. 1 entries updated:" 124 | assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt" 125 | assert out[4] == "Updating bitrot.sha512... done." 126 | 127 | 128 | @pytest.mark.order(4) 129 | def test_renamed_files_in_a_tree_dir() -> None: 130 | assert bash( 131 | """ 132 | mv nonemptydirs/dir2/new-file-a.txt nonemptydirs/dir2/new-file-a.txt2 133 | """ 134 | ) 135 | rc, out, err = bitrot("-v") 136 | assert rc == 0 137 | assert not err 138 | assert out[0] == "Checking bitrot.db integrity... ok." 139 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 140 | assert out[2] == "2 entries in the database. 1 entries renamed:" 141 | o3 = " from ./nonemptydirs/dir2/new-file-a.txt to ./nonemptydirs/dir2/new-file-a.txt2" 142 | assert out[3] == o3 143 | assert out[4] == "Updating bitrot.sha512... done." 144 | 145 | 146 | @pytest.mark.order(5) 147 | def test_deleted_files_in_a_tree_dir() -> None: 148 | assert bash( 149 | """ 150 | rm nonemptydirs/dir2/new-file-a.txt2 151 | """ 152 | ) 153 | rc, out, err = bitrot("-v") 154 | assert rc == 0 155 | assert not err 156 | assert out[0] == "Checking bitrot.db integrity... ok." 157 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 158 | assert out[2] == "1 entries in the database. 1 entries missing:" 159 | assert out[3] == " ./nonemptydirs/dir2/new-file-a.txt2" 160 | assert out[4] == "Updating bitrot.sha512... done." 161 | 162 | 163 | @pytest.mark.order(5) 164 | def test_new_files_and_modified_files_in_a_tree_dir() -> None: 165 | assert bash( 166 | """ 167 | for fil in {a,b,c,d,e,f,g}; do 168 | echo $fil >> more-files-$fil.txt 169 | done 170 | echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt 171 | """ 172 | ) 173 | rc, out, err = bitrot("-v") 174 | assert rc == 0 175 | assert not err 176 | assert out[0] == "Checking bitrot.db integrity... ok." 177 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 178 | assert out[2] == "8 entries in the database. 7 entries new:" 179 | assert out[3] == " ./more-files-a.txt" 180 | assert out[4] == " ./more-files-b.txt" 181 | assert out[5] == " ./more-files-c.txt" 182 | assert out[6] == " ./more-files-d.txt" 183 | assert out[7] == " ./more-files-e.txt" 184 | assert out[8] == " ./more-files-f.txt" 185 | assert out[9] == " ./more-files-g.txt" 186 | assert out[10] == "1 entries updated:" 187 | assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt" 188 | assert out[12] == "Updating bitrot.sha512... done." 189 | 190 | 191 | @pytest.mark.order(6) 192 | def test_new_files_modified_deleted_and_moved_in_a_tree_dir() -> None: 193 | assert bash( 194 | """ 195 | for fil in {a,b,c,d,e,f,g}; do 196 | echo $fil $RANDOM >> nonemptydirs/pl-more-files-$fil.txt 197 | done 198 | echo $RANDOM >> nonemptydirs/dir2/new-file-b.txt 199 | mv more-files-a.txt more-files-a.txt2 200 | rm more-files-g.txt 201 | """ 202 | ) 203 | rc, out, err = bitrot("-v") 204 | assert rc == 0 205 | assert not err 206 | assert out[0] == "Checking bitrot.db integrity... ok." 207 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 208 | assert out[2] == "14 entries in the database. 7 entries new:" 209 | assert out[3] == " ./nonemptydirs/pl-more-files-a.txt" 210 | assert out[4] == " ./nonemptydirs/pl-more-files-b.txt" 211 | assert out[5] == " ./nonemptydirs/pl-more-files-c.txt" 212 | assert out[6] == " ./nonemptydirs/pl-more-files-d.txt" 213 | assert out[7] == " ./nonemptydirs/pl-more-files-e.txt" 214 | assert out[8] == " ./nonemptydirs/pl-more-files-f.txt" 215 | assert out[9] == " ./nonemptydirs/pl-more-files-g.txt" 216 | assert out[10] == "1 entries updated:" 217 | assert out[11] == " ./nonemptydirs/dir2/new-file-b.txt" 218 | assert out[12] == "1 entries renamed:" 219 | assert out[13] == " from ./more-files-a.txt to ./more-files-a.txt2" 220 | assert out[14] == "1 entries missing:" 221 | assert out[15] == " ./more-files-g.txt" 222 | assert out[16] == "Updating bitrot.sha512... done." 223 | 224 | 225 | @pytest.mark.order(7) 226 | def test_new_files_modified_deleted_and_moved_in_a_tree_dir_2() -> None: 227 | assert bash( 228 | """ 229 | for fil in {a,b,c,d,e,f,g}; do 230 | echo $RANDOM >> nonemptydirs/pl2-more-files-$fil.txt 231 | done 232 | echo $RANDOM >> nonemptydirs/pl-more-files-a.txt 233 | mv nonemptydirs/pl-more-files-b.txt nonemptydirs/pl-more-files-b.txt2 234 | cp nonemptydirs/pl-more-files-g.txt nonemptydirs/pl2-more-files-g.txt2 235 | cp nonemptydirs/pl-more-files-d.txt nonemptydirs/pl2-more-files-d.txt2 236 | rm more-files-f.txt nonemptydirs/pl-more-files-c.txt 237 | """ 238 | ) 239 | rc, out, err = bitrot("-v") 240 | assert rc == 0 241 | assert not err 242 | assert out[0] == "Checking bitrot.db integrity... ok." 243 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 244 | assert out[2] == "21 entries in the database. 9 entries new:" 245 | assert out[3] == " ./nonemptydirs/pl2-more-files-a.txt" 246 | assert out[4] == " ./nonemptydirs/pl2-more-files-b.txt" 247 | assert out[5] == " ./nonemptydirs/pl2-more-files-c.txt" 248 | assert out[6] == " ./nonemptydirs/pl2-more-files-d.txt" 249 | assert out[7] == " ./nonemptydirs/pl2-more-files-d.txt2" 250 | assert out[8] == " ./nonemptydirs/pl2-more-files-e.txt" 251 | assert out[9] == " ./nonemptydirs/pl2-more-files-f.txt" 252 | assert out[10] == " ./nonemptydirs/pl2-more-files-g.txt" 253 | assert out[11] == " ./nonemptydirs/pl2-more-files-g.txt2" 254 | assert out[12] == "1 entries updated:" 255 | assert out[13] == " ./nonemptydirs/pl-more-files-a.txt" 256 | assert out[14] == "1 entries renamed:" 257 | o15 = " from ./nonemptydirs/pl-more-files-b.txt to ./nonemptydirs/pl-more-files-b.txt2" 258 | assert out[15] == o15 259 | assert out[16] == "2 entries missing:" 260 | assert out[17] == " ./more-files-f.txt" 261 | assert out[18] == " ./nonemptydirs/pl-more-files-c.txt" 262 | assert out[19] == "Updating bitrot.sha512... done." 263 | 264 | 265 | @pytest.mark.order(8) 266 | def test_3278_files() -> None: 267 | assert bash( 268 | """ 269 | mkdir -p alotfiles/here; cd alotfiles/here 270 | # create a 320KB file 271 | dd if=/dev/urandom of=masterfile bs=1 count=327680 272 | # split it in 3277 files (instantly) + masterfile = 3278 273 | split -b 100 -a 10 masterfile 274 | """ 275 | ) 276 | rc, out, err = bitrot() 277 | assert rc == 0 278 | assert not err 279 | assert out[0] == "Checking bitrot.db integrity... ok." 280 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 281 | o2 = "3299 entries in the database, 3278 new, 0 updated, 0 renamed, 0 missing." 282 | assert out[2] == o2 283 | 284 | 285 | @pytest.mark.order(9) 286 | def test_3278_files_2() -> None: 287 | assert bash( 288 | """ 289 | mv alotfiles/here alotfiles/here-moved 290 | """ 291 | ) 292 | rc, out, err = bitrot() 293 | assert rc == 0 294 | assert not err 295 | assert out[0] == "Checking bitrot.db integrity... ok." 296 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 297 | o2 = "3299 entries in the database, 0 new, 0 updated, 3278 renamed, 0 missing." 298 | assert out[2] == o2 299 | 300 | 301 | @pytest.mark.order(10) 302 | def test_rotten_file() -> None: 303 | assert bash( 304 | """ 305 | touch non-rotten-file 306 | dd if=/dev/zero of=rotten-file bs=1k count=1000 &>/dev/null 307 | # let's make sure they share the same timestamp 308 | touch -r non-rotten-file rotten-file 309 | """ 310 | ) 311 | rc, out, err = bitrot("-v") 312 | assert rc == 0 313 | assert not err 314 | assert out[0] == "Checking bitrot.db integrity... ok." 315 | # assert out[1] == "Finished. 0.00 MiB of data read. 0 errors found." 316 | assert out[2] == "3301 entries in the database. 2 entries new:" 317 | assert out[3] == " ./non-rotten-file" 318 | assert out[4] == " ./rotten-file" 319 | 320 | 321 | @pytest.mark.order(11) 322 | def test_rotten_file_2() -> None: 323 | assert bash( 324 | """ 325 | # modify the rotten file... 326 | dd if=/dev/urandom of=rotten-file bs=1k count=10 seek=1k conv=notrunc &>/dev/null 327 | # ...but revert the modification date 328 | touch -r non-rotten-file rotten-file 329 | """ 330 | ) 331 | rc, out, err = bitrot("-q") 332 | assert rc == 1 333 | assert not out 334 | e = ( 335 | "error: SHA1 mismatch for ./rotten-file: expected" 336 | " 8fee1653e234fee8513245d3cb3e3c06d071493e, got" 337 | ) 338 | assert err[0].startswith(e) 339 | assert err[1] == "error: There were 1 errors found." 340 | 341 | 342 | @pytest.mark.order("last") 343 | def test_cleanup() -> None: 344 | username = getpass.getuser() 345 | test_dir = TMP / f"bitrot-dir-{username}" 346 | if test_dir.is_dir(): 347 | os.chdir(TMP) 348 | shutil.rmtree(test_dir) 349 | --------------------------------------------------------------------------------