├── .gitignore
├── LICENSE.md
├── b&a.png
├── clonefile-dedup.py
├── clonefile-index.py
├── clonefile-verify.py
└── readme.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | 


--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
 1 | Copyright 2018 Scott Martindale
 2 | 
 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 4 | 
 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 6 | 
 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
 8 | 
 9 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
10 | 


--------------------------------------------------------------------------------
/b&a.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ranvel/clonefile-dedup/f98bea4b30ac4d7e373cf1a119632bb78a7a49ec/b&a.png


--------------------------------------------------------------------------------
/clonefile-dedup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import os, sqlite3, subprocess, xattr, pickle
 3 | 
 4 | cwd = os.path.abspath(os.path.curdir)
 5 | conn = sqlite3.connect('clonefile-index.sqlite')
 6 | 
 7 | with conn:
 8 |     
 9 | 	cur = conn.cursor()    
10 |     
11 |     # Get files with duplicates
12 | 	cur.execute("SELECT chksumfull, COUNT(*) c FROM files WHERE chksumfull != '' GROUP BY chksumfull HAVING c > 1 ORDER BY size DESC")
13 | 	results = cur.fetchall()
14 | 	for result in results:
15 | 		dupscur = conn.cursor()
16 | 		dupscur.execute("SELECT file FROM files WHERE chksumfull = ?", (result[0],) )
17 | 		#Get all duplicate files
18 | 		dupesResults = dupscur.fetchall()
19 | 		fileIndex = 0 
20 | 		#For the first one, treat this as the original even though it doesn't matter which one you use. 
21 | 		for dupesResult in dupesResults:
22 | 			if not os.path.isfile(dupesResult[0]):
23 | 				continue
24 | 			if os.path.islink(dupesResult[0]):
25 | 				continue
26 | 			if fileIndex == 0:
27 | 				print(f"Original file: {dupesResult[0]}    size {os.path.getsize(dupesResult[0])}")
28 | 				originalFile = dupesResult[0]
29 | 			else:
30 | 				fname = dupesResult[0]
31 | 				fnameNew = fname + ".cfdnew"
32 | 				print(f"    replacing: {fname}")
33 | 
34 | 				oldStat = os.stat(fname)
35 | 				oldAttr = dict(xattr.xattr(fname))
36 | 
37 | 				dirName = os.path.abspath(os.path.dirname(fname))
38 | 				oldDirStat = os.stat(dirName)
39 | 
40 | 				# The -c parameter is for the `clonefile`
41 | 				copyCommand = subprocess.run(['cp', '-cvp', originalFile, fnameNew], stdout=subprocess.PIPE)
42 | 				#print(copyCommand)
43 | 
44 | 				newStat = os.stat(fnameNew) 
45 | 				newAttr = dict(xattr.xattr(fnameNew))
46 | 
47 | 				if newStat.st_uid != oldStat.st_uid or newStat.st_gid != oldStat.st_gid:
48 | 					os.chown(fnameNew, oldStat.st_uid, oldStat.st_gid)
49 | 
50 | 				if newStat.st_mode != oldStat.st_mode:
51 | 					os.chmod(fnameNew, oldStat.st_mode)
52 | 
53 | 				if pickle.dumps(oldAttr)!=pickle.dumps(newAttr):
54 | 					for k,v in oldAttr.items():
55 | 						xattr.setxattr(fnameNew, k, v)
56 | 
57 | 				if newStat.st_mtime != oldStat.st_mtime or newStat.st_atime != oldStat.st_atime:
58 | 					os.utime(fnameNew, (oldStat.st_atime, oldStat.st_mtime) )
59 | 
60 | 				moveCommand = subprocess.run(['mv', '-f', fnameNew, fname], stdout=subprocess.PIPE)
61 | 				#print(moveCommand)
62 | 
63 | 				fnameNew = fname
64 | 				newStat = os.stat(fnameNew) 
65 | 				newAttr = dict(xattr.xattr(fnameNew))
66 | 				newDirStat = os.stat(dirName)
67 | 
68 | 				# additional fixup, just in case:
69 | 				if newStat.st_uid != oldStat.st_uid or newStat.st_gid != oldStat.st_gid:
70 | 					os.chown(fnameNew, oldStat.st_uid, oldStat.st_gid)
71 | 
72 | 				if newStat.st_mode != oldStat.st_mode:
73 | 					os.chmod(fnameNew, oldStat.st_mode)
74 | 
75 | 				if pickle.dumps(oldAttr)!=pickle.dumps(newAttr):
76 | 					for k,v in oldAttr.items():
77 | 						xattr.setxattr(fnameNew, k, v)
78 | 
79 | 				if newStat.st_mtime != oldStat.st_mtime or newStat.st_atime != oldStat.st_atime:
80 | 					os.utime(fnameNew, (oldStat.st_atime, oldStat.st_mtime) )
81 | 
82 | 				if newDirStat.st_mtime != oldDirStat.st_mtime or newDirStat.st_atime != oldDirStat.st_atime:
83 | 					os.utime(dirName, (oldDirStat.st_atime, oldDirStat.st_mtime) )
84 | 
85 | 			fileIndex += 1
86 | 
87 | conn.close()
88 | 


--------------------------------------------------------------------------------
/clonefile-index.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | import os, sqlite3, hashlib, json
  3 | from tqdm import tqdm
  4 | from os import listdir
  5 | from os.path import isfile, islink, join
  6 | from multiprocessing import Pool
  7 | from pathlib import Path
  8 | 
  9 | BLOCKSIZE = 65536
 10 | 
 11 | print("""
 12 | 	This will index your files. How many processor threads would you like to use?
 13 | 	This command will show you your the maximum number you should use: sysctl hw.logicalcpu
 14 | 	""")
 15 | threads = input("Number of Threads to use: ")
 16 | 
 17 | conn = sqlite3.connect('clonefile-index.sqlite')
 18 | c = conn.cursor()
 19 | c.execute('''CREATE TABLE files (file, chksum64k, chksumfull, size, stat)''')
 20 | 
 21 | def processFile(filelink):
 22 | 	try: 
 23 | 		os.path.getsize(filelink)
 24 | 	except:
 25 | 		pass
 26 | 	else: 
 27 | 		# Don't worry about tiny files:
 28 | 		if (os.path.getsize(filelink) > 1024):
 29 | 			try:
 30 | 				shahash = getSHA256(filelink).split()[0]
 31 | 				file_info = (
 32 | 					filelink, shahash, os.path.getsize(filelink),
 33 | 					json.dumps(os.stat(filelink)),
 34 | 				)
 35 | 				return file_info
 36 | 			except Exception as e: 
 37 | 				print(f"Couldn\'t index {filelink}: {e}")
 38 | 
 39 | def processFileFull(filelink):
 40 | 	shahash = getSHA256(filelink, full=True).split()[0]
 41 | 	file_info = (shahash, filelink)
 42 | 	return file_info
 43 | 
 44 | def getSHA256(currentFile, full=False):
 45 | 	#Read the 64k at a time, hash the buffer & repeat till finished. 
 46 | 	#By default only checksum the first block
 47 | 	hasher = hashlib.sha256()
 48 | 	with open(currentFile, 'rb') as file:
 49 | 		buf = file.read(BLOCKSIZE)
 50 | 		while len(buf) > 0:
 51 | 			hasher.update(buf)
 52 | 			if not full:
 53 | 				break
 54 | 			buf = file.read(BLOCKSIZE)
 55 | 	return hasher.hexdigest()
 56 | 
 57 | def add2sqlite(fileinfo):
 58 | 	for f in fileinfo:
 59 | 		if (f != None):
 60 | 			if f[2]>BLOCKSIZE:
 61 | 				c.execute(
 62 | 					"INSERT INTO files (file, chksum64k, chksumfull, size, stat) VALUES (?,?,?,?,?)",
 63 | 					(f[0], f[1], '',   f[2], f[3]) )
 64 | 			else:
 65 | 				c.execute(
 66 | 					"INSERT INTO files (file, chksum64k, chksumfull, size, stat) VALUES (?,?,?,?,?)",
 67 | 					(f[0], f[1], f[1], f[2], f[3]) )
 68 | 
 69 | # Index all files from within the root
 70 | #start script
 71 | if __name__ == '__main__':
 72 | 	allfiles = []
 73 | 	sqlite_data = []
 74 | 	print(f"Indexing files in {os.getcwd()}")
 75 | 	print(f"Reading file list")
 76 | 	for dirpath, dirnames, filenames in os.walk("."):
 77 | 		for filename in [f for f in filenames]:
 78 | 			filelink = os.path.abspath(os.path.join(dirpath, filename))
 79 | 			if isfile(filelink) and not islink(filelink):
 80 | 				allfiles.append(filelink)
 81 | 	num_of_files = len(allfiles)
 82 | 	# "threads" at a time, multiprocess delegation
 83 | 	print(f'Checksumming {num_of_files} files (fast)')
 84 | 	with Pool(int(threads)) as pool:
 85 | 		r = list(tqdm(pool.imap_unordered(processFile, allfiles), total = num_of_files))
 86 | 	#process (r)esults by adding them to sqlite 
 87 | 	add2sqlite(r)
 88 | 
 89 | 	conn.commit()
 90 | 	print('Indexing database')
 91 | 	c.execute('''CREATE INDEX index64k ON files(chksum64k ASC)''')
 92 | 	c.execute('''CREATE INDEX indexfile ON files(file ASC)''')
 93 | 	c.execute('''CREATE INDEX indexsize ON files(size ASC)''')
 94 | 
 95 | 	c.execute('''SELECT chksum64k, COUNT(*) c FROM files GROUP BY chksum64k HAVING c > 1''')
 96 | 	results = c.fetchall()
 97 | 
 98 | 	allfiles = []
 99 | 	sqlite_data = []
100 | 
101 | 	print(f'Found {len(results)} non-unique checksums, fetching files')
102 | 	for result in tqdm(results):
103 | 		c.execute("SELECT file FROM files WHERE chksum64k = ? AND chksumfull == ''", (result[0],))
104 | 		for f in c.fetchall():
105 | 			allfiles.append(f[0])
106 | 
107 | 	num_of_files = len(allfiles)
108 | 	print(f"Calculating full checksum for {num_of_files} files")
109 | 	with Pool(int(threads)) as pool:
110 | 		r = list(tqdm(pool.imap_unordered(processFileFull, allfiles), total = num_of_files))
111 | 
112 | 	print("Updating database")
113 | 	for x in tqdm(r):
114 | 		c.execute("UPDATE files SET chksumfull = ? WHERE file = ?", x )
115 | 
116 | 	conn.commit()
117 | 	print('Indexing database')
118 | 	c.execute('''CREATE INDEX indexfull ON files(chksumfull ASC)''')
119 | 
120 | conn.commit()
121 | conn.close()
122 | 


--------------------------------------------------------------------------------
/clonefile-verify.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python3
 2 | import sqlite3, subprocess
 3 | 
 4 | conn = sqlite3.connect('clonefile-index.sqlite')
 5 | 
 6 | with conn:
 7 | 
 8 | 	cur = conn.cursor()
 9 | 
10 | 	cur.execute("SELECT chksumfull, COUNT(*) c FROM files WHERE chksumfull != '' GROUP BY chksumfull HAVING c > 1")
11 | 	results = cur.fetchall()
12 | 	for result in results:
13 | 		dupscur = conn.cursor()
14 | 		dupscur.execute("SELECT file FROM files WHERE chksumfull = ?", (result[0],) )
15 | 		# print(result)
16 | 		dupesResults = dupscur.fetchall()
17 | 		fileIndex = 0
18 | 		for dupesResult in dupesResults:
19 | 			print("Verifying file: " + dupesResult[0])
20 | 			chksumRaw = subprocess.run(['shasum', '-a', '256', dupesResult[0]], stdout=subprocess.PIPE)
21 | 			chksum = chksumRaw.stdout.split()[0].decode("utf-8")
22 | 			print("Original checksum: \t \t "+ result[0])
23 | 			print("New file: \t \t \t "+ chksum)
24 | 			# I should probably add some logic here to ignore Spotlight search files. 
25 | 			if chksum == result[0]:
26 | 				print("\033[1;32mVerified!!\033[1;m")
27 | 			else:
28 | 				input("Failed to verify: " + dupesResult[0])
29 | 
30 | conn.close()
31 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | ## 'clonefile' deduplication
 2 | 
 3 | ![](b&a.png) 
 4 | 
 5 | This is a rough project that I wanted to do to ensure that I wasn't wasting space by having duplicate files on my drives. YMMV (Your mileage *will* vary) and it is largely untested and so I caution potenial users to review the code and run tests before running this on your full drive, as it could possibly result in data loss. 
 6 | 
 7 | Normal deduplication is done at the block level and requires a special filesystem such as zfs as well as enormous amounts of memory to store the checksums. This script aims to catalog all of your files with the sha-256 algorithm and then deduplicate any files with matching signatures. 
 8 | 
 9 | You'll need:
10 |  - python 3.6
11 |  - python sqlite3, tqdm and xattr modules
12 |  - a Mac with APFS (to utilize the clonefile syscall)
13 | 
14 | This program could easily be combined into one program, but it's not a lot of work to run this as separate scripts, so I will leave it as is. 
15 | 
16 | Instructions: 
17 | 
18 | 1. Run `clonefile-index.py` which will create an `index.sqlite` database with all of the files and chksums at the scriptroot. 
19 | 2. Run `clonefile-dedup.py` which will copy the first instance of a file to all of the other instances using the 'clonefile' syscall. This isn't a link but an APFS reference to the same data on the drive that is used by a file with that chksum. 
20 | 3. (optional) Run `clonefile-verify.py` to verify that the files bear the same chksum after as they did before the process. If you use Spotlight on this drive, it will definitely display an error on the Spotlight metadata files. 
21 | 
22 | Let me know how it works out for you! 
23 | Twitter: @ranvel


--------------------------------------------------------------------------------