├── .gitignore
├── LICENSE-MIT.txt
├── README.md
├── bhlmake.py
└── bhlreco.py
/.gitignore:
--------------------------------------------------------------------------------
1 | misc_tools/
2 | *.pyc
3 | *.zip
4 | *.dat
5 | *.bhl
6 | *.db3
7 | note.txt
--------------------------------------------------------------------------------
/LICENSE-MIT.txt:
--------------------------------------------------------------------------------
1 | Copyright (c) 2017 Marco Pontello
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy
4 | of this software and associated documentation files (the "Software"), to deal
5 | in the Software without restriction, including without limitation the rights
6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7 | copies of the Software, and to permit persons to whom the Software is
8 | furnished to do so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # BlockHashLoc
2 |
3 | The purpose of BlockHashLoc is to enable the recovery of files after total loss of File System structures, or without even knowing what FS was used in the first place.
4 |
5 | The way it can recover a given file is by keeping a (small) parallel BHL file with a list of crypto-hashes of all the blocks (of selectable size) that compose it. So it's then possible to read blocks from a disk image/volume, calculate their hashes, compare them with the saved ones and rebuild the original file.
6 |
7 | With adequately sized blocks (512 bytes, 4KB, etc. depending on the media and File System), this let one recover a file regardless of the FS used, or the FS integrity, or the fragmentation level.
8 |
9 | This project is related to [SeqBox](https://github.com/MarcoPon/SeqBox). The main differences are:
10 |
11 | - SeqBox create a stand-alone file container with the above listed recovery characteristics.
12 |
13 | - BHL realize the same effect with a (small) parallel file, that can be stored separately (in other media, or in the cloud), or along the original as a SeqBox file (so that it can be recovered too, as the first step), so it can be used to add a degree of recoverability to existing files.
14 |
15 | **N.B.**
16 |
17 | The tools are still in beta and surely not speed optimized, but they are already functional and the BHL file format is considered final.
18 |
19 | ## Demo tour
20 |
21 | BlockHashLoc is composed of two separate tools:
22 | - BHLMake: create BHL files with block-hashes and metadata
23 | - BHLReco: recover files searching for the block's hashes contained in a set of BHL files
24 |
25 | There are in some case many parameters but the default are sensible so it's generally pretty simple.
26 |
27 | Here's a practical example. Let's see how 2 photos can be recovered from a fragmented floppy disk that have lost its FAT (and any other system section). The 2 JPEGs weight about 450KB and 680KB:
28 |
29 |  
30 |
31 | We start by creating the BHL files, and then proceed to test them to make sure they are all right:
32 |
33 | ```
34 | c:\t>bhlmake *.jpg
35 | creating file 'Manu01.jpg.bhl'...
36 | BHL file size: 29582 - blocks: 913 - ratio: 6.3%
37 | creating file 'Manu02.jpg.bhl'...
38 | BHL file size: 43936 - blocks: 1363 - ratio: 6.3%
39 |
40 | c:\t>bhlreco -t -bhl *.bhl
41 | reading BHL file 'Manu01.jpg.bhl'...
42 | reading BHL file 'Manu02.jpg.bhl'...
43 | BHL file(s) OK!
44 |
45 | ```
46 |
47 | Now we put both the JPEGs in a floppy disk image that have gone trough various cycles of files updating and deleting. At this point the BHL files could be kept somewhere else (another disk, some online storage, etc.), or put in the same disk image after being encoded in one or more [SeqBox](https://github.com/MarcoPon/SeqBox) recoverable container(s) - because, obviously, there's no use in making BHL files if they can be lost too.
48 | As a result the data is laid out like this:
49 |
50 | 
51 |
52 | The photos are in green, and the two SBX files in blue.
53 | Then with an hex editor we zap the first system sectors and the FAT (in red), making the disk image unreadable!
54 | Time for recovery!
55 |
56 | We start with the free (GPLV v2+) [PhotoRec](http://www.cgsecurity.org/wiki/PhotoRec), which is the go-to tool for these kind of jobs. Parameters are set to "Paranoid : YES (Brute force enabled)" & "Keep corrupted files : Yes", to search the entire data area.
57 | As the files are fragmented, we know we can't expect miracles. The starting sector of the photos will be surely found, but as soon as the first contiguous fragment end, it's anyone guess.
58 |
59 | 
60 |
61 | As expected, something has been recovered. But the 2 files sizes are off (32K and 340KB). The very first parts of the photos are OK, but then they degrade quickly as other random blocks of data where mixed in. We have all seen JPEGs ending up like this:
62 |
63 |  
64 |
65 | Other popular recovery tools lead to the same results. It's not anyone fault: it's just not possible to know how the various fragment are concatenated, without an index or some kind of list (there are approaches based on file type validators that can in at least some cases differentiate between spurious and *valid* blocks, but that's beside the point).
66 |
67 | But having the BHL files at hand, it's a different story. Each of the blocks referenced in the BHL files can't be fragmented, and they all can be located anywhere in the disk just by calculating the hash of every blocks until all matching ones are found.
68 |
69 | So, the first thing we need is to obtain the BHL files, either by getting them from some alternate storage, or recovering the [SeqBox](https://github.com/MarcoPon/SeqBox) containers from the same disk image and extracting them.
70 |
71 | Then we can run BHLReco and begin the scanning process:
72 |
73 | ```
74 | c:\t>bhlreco disk.IMA -bhl *.bhl
75 | creating ':memory:' database...
76 | reading BHL file 'Manu01.jpg.bhl'...
77 | updating db...
78 | reading BHL file 'Manu02.jpg.bhl'...
79 | updating db...
80 | scan step: 512
81 | scanning file 'disk.IMA'...
82 | 90.4% - tot: 2274 - found: 2274 - 40.65MB/s
83 | scan completed.
84 | creating file 'Manu01.jpg'...
85 | hash match!
86 | creating file 'Manu02.jpg'...
87 | hash match!
88 |
89 | files restored: 2 - with errors: 0 - files missing: 0
90 | ```
91 |
92 | All files have been recovered, with no errors!
93 | Time for a quick visual check:
94 |
95 |  
96 |
97 | N.B. Here's a [7-Zip archive](http://mark0.net/download/bhldemo-diskimage.7z) with the disk image and the 2 BHL files used in the demo (1.2MB).
98 |
99 |
100 |
101 | ## Tech spec
102 |
103 | Byte order: Big Endian
104 |
105 | Hash: SHA-256
106 |
107 | ### BHL file structure
108 |
109 | | section | desc | note |
110 | | ---------- | ------------------------------------ | --------- |
111 | | Header | Signature & version | |
112 | | Metadata | Misc info | |
113 | | Hash | Blocks hash list & final hash | |
114 | | Last block | zlib compressed last block remainder | if needed |
115 |
116 |
117 | ### Header
118 |
119 | | pos | to pos | size | desc |
120 | |---- | --- | ---- | --------------------------------- |
121 | | 0 | 12 | 13 | Signature = 'BlockHashLoc' + 0x1a |
122 | | 13 | 13 | 1 | Version byte |
123 | | 14 | 17 | 4 | Block size |
124 | | 18 | 25 | 8 | File size |
125 |
126 | ### Metadata
127 |
128 | | pos | to pos | size | desc |
129 | |---- | ------ | ---- | --------------------- |
130 | | 26 | 29 | 4 | Metadata section size |
131 | | 30 | var | var | Encoded metadata list |
132 |
133 | ### Hash
134 |
135 | | pos | to pos | size | desc |
136 | |---- | ------ | ---- | --------------------------------- |
137 | | var | var | 32 | 1st block hash |
138 | | ... | ... | 32 | ... |
139 | | var | var | 32 | Last block hash |
140 | | var | var | 32 | Hash of all previous block hashes |
141 |
142 |
143 | ### Versions:
144 |
145 | Currently the only version is 1.
146 |
147 | ### Metadata encoding
148 |
149 | | Bytes | Field |
150 | | ----- | ----- |
151 | | 3 | ID |
152 | | 1 | Len |
153 | | n | Data |
154 |
155 | #### IDs
156 |
157 | | ID | Desc |
158 | | --- | --- |
159 | | FNM | filename (utf-8) |
160 | | FDT | date & time (8 bytes, seconds since epoch) |
161 |
162 | (others IDs may be added...)
163 |
164 |
165 | ## Links
166 |
167 | - [BlockHashLoc home page](http://mark0.net/)
168 | - [BlockHashLoc GitHub repository](https://github.com/MarcoPon/BlockHashLoc)
169 |
170 | ## Credits
171 |
172 | The idea of collecting & scanning for block hashes was something I had considered while developing [SeqBox](https://github.com/MarcoPon/SeqBox), then settling on using a stand alone file container instead of the original file plus a parallel one.
173 |
174 | Then the concept resurfaced during a nice discussion on Slashdot with user JoeyRoxx, and after some considerations I decided to put some work on that too, seeing how the two approaches could both be useful (in different situations) and even complement each other nicely.
175 |
176 | ## Contacts
177 |
178 | If you need more info, want to get in touch, or donate: [Marco Pontello](http://mark0.net/contacts-e.html)
179 |
180 | **Bitcoin**: 1Mark1tF6QGj112F5d3fQALGf41YfzXEK3
181 |
182 | 
--------------------------------------------------------------------------------
/bhlmake.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | #--------------------------------------------------------------------------
4 | # BHLMake - BlockHashLoc Maker
5 | #
6 | # Created: 04/05/2017
7 | #
8 | # Copyright (C) 2017 Marco Pontello - http://mark0.net/
9 | #
10 | # Licence:
11 | # This program is free software: you can redistribute it and/or modify
12 | # it under the terms of the GNU Affero General Public License as
13 | # published by the Free Software Foundation, either version 3 of the
14 | # License, or (at your option) any later version.
15 | #
16 | # This program is distributed in the hope that it will be useful,
17 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
18 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 | # GNU Affero General Public License for more details.
20 | #
21 | # You should have received a copy of the GNU Affero General Public License
22 | # along with this program. If not, see .
23 | #
24 | #--------------------------------------------------------------------------
25 |
26 | import os
27 | import sys
28 | import hashlib
29 | import argparse
30 | from time import time
31 | import zlib
32 | import fnmatch
33 |
34 | PROGRAM_VER = "0.7.1b"
35 | BHL_VER = 1
36 |
37 | def get_cmdline():
38 | """Evaluate command line parameters, usage & help."""
39 | parser = argparse.ArgumentParser(
40 | description="create a SeqBox container",
41 | formatter_class=argparse.ArgumentDefaultsHelpFormatter,
42 | prefix_chars='-', fromfile_prefix_chars='@')
43 | parser.add_argument("-v", "--version", action='version',
44 | version='BlockHashLoc ' +
45 | 'Maker v%s - (C) 2017 by M.Pontello' % PROGRAM_VER)
46 | parser.add_argument("filename", action="store", nargs="+",
47 | help="file to process")
48 | parser.add_argument("-d", action="store", dest="destpath",
49 | help="destination path", default="", metavar="path")
50 | parser.add_argument("-b", "--blocksize", type=int, default=512,
51 | help="blocks size", metavar="n")
52 | parser.add_argument("-c", "--continue", action="store_true", default=False,
53 | help="continue on block errors", dest="cont")
54 | parser.add_argument("-r", "--recurse", action="store_true", default=False,
55 | help="recurse subdirs")
56 | res = parser.parse_args()
57 | return res
58 |
59 |
60 | def errexit(errlev=1, mess=""):
61 | """Display an error and exit."""
62 | if mess != "":
63 | sys.stderr.write("%s: error: %s\n" %
64 | (os.path.split(sys.argv[0])[1], mess))
65 | sys.exit(errlev)
66 |
67 |
68 | def buildBHL(filename, bhlfilename, blocksize):
69 | filesize = os.path.getsize(filename)
70 | fin = open(filename, "rb", buffering=1024*1024)
71 | print("creating file '%s'..." % bhlfilename)
72 | open(bhlfilename, 'w').close()
73 | fout = open(bhlfilename, "wb", buffering=1024*1024)
74 |
75 | #write header
76 | fout.write(b"BlockHashLoc\x1a")
77 | fout.write(bytes([BHL_VER]))
78 | fout.write(blocksize.to_bytes(4, byteorder='big', signed=False))
79 | fout.write(filesize.to_bytes(8, byteorder='big', signed=False))
80 |
81 | #write metadata
82 | metadata = b""
83 | bb = os.path.split(filename)[1].encode()
84 | bb = b"FNM" + bytes([len(bb)]) + bb
85 | metadata += bb
86 | bb = int(os.path.getmtime(filename)).to_bytes(8, byteorder='big')
87 | bb = b"FDT" + bytes([len(bb)]) + bb
88 | metadata += bb
89 |
90 | metadata = len(metadata).to_bytes(4, byteorder='big') + metadata
91 | fout.write(metadata)
92 |
93 | #read blocks and calc hashes
94 | globalhash = hashlib.sha256()
95 | blocksnum = 0
96 | ticks = 0
97 | updatetime = time()
98 | bufferz = b""
99 | while True:
100 | buffer = fin.read(blocksize)
101 | if len(buffer) < blocksize:
102 | if len(buffer) == 0:
103 | break
104 | else:
105 | #compressed blob with last block remainder
106 | bufferz = zlib.compress(buffer, 9)
107 | blockhash = hashlib.sha256()
108 | blockhash.update(buffer)
109 | digest = blockhash.digest()
110 | globalhash.update(digest)
111 | fout.write(digest)
112 | blocksnum += 1
113 |
114 | #some progress update
115 | if time() > updatetime:
116 | print("%.1f%%" % (fin.tell()*100.0/filesize), " ",
117 | end="\r", flush=True)
118 | updatetime = time() + .1
119 |
120 | #write hash of hashes and block remainder (if present)
121 | fout.write(globalhash.digest())
122 | if len(bufferz):
123 | fout.write(bufferz)
124 |
125 | fin.close()
126 | fout.close()
127 |
128 | #show stats about the file just created
129 | bhlfilesize = os.path.getsize(bhlfilename)
130 | overhead = bhlfilesize * 100 / filesize
131 | print(" BHL file size: %i - blocks: %i - ratio: %.1f%%" %
132 | (bhlfilesize, blocksnum, overhead))
133 |
134 |
135 | def main():
136 |
137 | cmdline = get_cmdline()
138 | blocksize = cmdline.blocksize
139 |
140 | #build list of files to process
141 | filenames = []
142 | for filespec in cmdline.filename:
143 | filepath, filename = os.path.split(filespec)
144 | if not filepath:
145 | filepath = "."
146 | if not filename:
147 | filename = "*"
148 | for wroot, wdirs, wfiles in os.walk(filepath):
149 | if not cmdline.recurse:
150 | wdirs[:] = []
151 | for fn in fnmatch.filter(wfiles, filename):
152 | filenames.append(os.path.join(wroot, fn))
153 | filenames = sorted(set(filenames), key=os.path.getsize)
154 |
155 | bhlok = 0
156 | bhlerr = 0
157 |
158 | for filename in filenames:
159 | if not os.path.exists(filename):
160 | errexit(1, "file '%s' not found" % (filename))
161 |
162 | destpath = cmdline.destpath
163 | if not destpath:
164 | bhlfilename = os.path.split(filename)[1] + ".bhl"
165 | else:
166 | if not os.path.isdir(destpath):
167 | destpath = os.path.split(filename)[0]
168 | bhlfilename = os.path.join(destpath,
169 | os.path.split(filename)[1] + ".bhl")
170 |
171 | try:
172 | buildBHL(filename, bhlfilename, blocksize)
173 | bhlok += 1
174 | except:
175 | if cmdline.cont:
176 | bhlerr += 1
177 | print(" warning: can't create BHL file!")
178 | else:
179 | errexit(1, "can't creating BHL file '%s'" % (bhlfilename))
180 |
181 | if len(cmdline.filename) > 1 and bhlerr > 0:
182 | print("\nBHL files created: %i - errors: %i" % (bhlok, bhlerr))
183 |
184 |
185 | if __name__ == '__main__':
186 | main()
187 |
--------------------------------------------------------------------------------
/bhlreco.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | #--------------------------------------------------------------------------
4 | # BHLReco - BlockHashLoc Recover
5 | #
6 | # Created: 06/05/2017
7 | #
8 | # Copyright (C) 2017 Marco Pontello - http://mark0.net/
9 | #
10 | # Licence:
11 | # This program is free software: you can redistribute it and/or modify
12 | # it under the terms of the GNU Affero General Public License as
13 | # published by the Free Software Foundation, either version 3 of the
14 | # License, or (at your option) any later version.
15 | #
16 | # This program is distributed in the hope that it will be useful,
17 | # but WITHOUT ANY WARRANTY; without even the implied warranty of
18 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
19 | # GNU Affero General Public License for more details.
20 | #
21 | # You should have received a copy of the GNU Affero General Public License
22 | # along with this program. If not, see .
23 | #
24 | #--------------------------------------------------------------------------
25 |
26 | import os
27 | import sys
28 | import hashlib
29 | import argparse
30 | import time
31 | import zlib
32 | import sqlite3
33 | import glob
34 |
35 | PROGRAM_VER = "0.7.17b"
36 | BHL_VER = 1
37 | BHL_MAGIC = b"BlockHashLoc\x1a"
38 |
39 | def get_cmdline():
40 | """Evaluate command line parameters, usage & help."""
41 | parser = argparse.ArgumentParser(
42 | description="create a SeqBox container",
43 | formatter_class=argparse.ArgumentDefaultsHelpFormatter,
44 | prefix_chars='-+', fromfile_prefix_chars='@')
45 | parser.add_argument("-v", "--version", action='version',
46 | version='BlockHashLoc ' +
47 | 'Recover v%s - (C) 2017 by M.Pontello' % PROGRAM_VER)
48 | parser.add_argument("imgfilename", action="store", nargs="*",
49 | help="image(s)/volumes(s) to scan")
50 | parser.add_argument("-db", "--database", action="store", dest="dbfilename",
51 | metavar="filename",
52 | help="temporary db with recovery info",
53 | default=":memory:")
54 | parser.add_argument("-bhl", action="store", nargs="+", dest="bhlfilename",
55 | help="BHL file(s)", metavar="filename", required=True)
56 | parser.add_argument("-d", action="store", dest="destpath",
57 | help="destination path", default="", metavar="path")
58 | parser.add_argument("-o", "--offset", type=int, default=0,
59 | help=("offset from the start"), metavar="n")
60 | parser.add_argument("-st", "--step", type=int, default=0,
61 | help=("scan step"), metavar="n")
62 | parser.add_argument("-t","--test", action="store_true", default=False,
63 | help="only test BHL file(s)")
64 | res = parser.parse_args()
65 | return res
66 |
67 |
68 | def errexit(errlev=1, mess=""):
69 | """Display an error and exit."""
70 | if mess != "":
71 | sys.stderr.write("%s: error: %s\n" %
72 | (os.path.split(sys.argv[0])[1], mess))
73 | sys.exit(errlev)
74 |
75 |
76 | def mcd(nums):
77 | """MCD: step good for different blocksizes"""
78 | res = min(nums)
79 | while res > 0:
80 | ok = 0
81 | for n in nums:
82 | if n % res != 0:
83 | break
84 | else:
85 | ok += 1
86 | if ok == len(nums):
87 | break
88 | res -= 1
89 | return res if res > 0 else 1
90 |
91 |
92 | def metadataDecode(data):
93 | """Decode metadata"""
94 | metadata = {}
95 | p = 0
96 | while p < (len(data)-3):
97 | metaid = data[p:p+3]
98 | p+=3
99 | metalen = data[p]
100 | metabb = data[p+1:p+1+metalen]
101 | p = p + 1 + metalen
102 | if metaid == b'FNM':
103 | metadata["filename"] = metabb.decode('utf-8')
104 | elif metaid == b'FDT':
105 | metadata["filedatetime"] = int.from_bytes(metabb, byteorder='big')
106 | return metadata
107 |
108 |
109 | class RecDB():
110 | """Helper class to access Sqlite3 DB with recovery info"""
111 |
112 | def __init__(self, dbfilename):
113 | self.connection = sqlite3.connect(dbfilename)
114 | self.cursor = self.connection.cursor()
115 |
116 | def Commit(self):
117 | self.connection.commit()
118 |
119 | def CreateTables(self):
120 | c = self.cursor
121 | c.execute("CREATE TABLE bhl_files (id INTEGER, blocksize INTEGER, size INTEGER, name TEXT, datetime INTEGER, lastblock BLOB, hash BLOB)")
122 | c.execute("CREATE TABLE bhl_hashlist (hash BLOB, fileid INTEGER, sourceid INTEGER, num INTEGER, pos INTEGER)")
123 | c.execute("CREATE INDEX hash ON bhl_hashlist (hash)")
124 | self.connection.commit()
125 |
126 | def SetFileData(self, fid=0, fblocksize=0, fsize=0, fname="", fdatetime=0, flastblock=b"", fhash=b""):
127 | c = self.cursor
128 | c.execute("INSERT INTO bhl_files (id, blocksize, size, name, datetime, lastblock, hash) VALUES (?, ?, ?, ?, ?, ?, ?)",
129 | (fid, fblocksize, fsize, fname, fdatetime, flastblock, fhash))
130 | self.connection.commit()
131 |
132 | def AddHash(self, fhash=0, fid=0, fnum=0):
133 | c = self.cursor
134 | c.execute("INSERT INTO bhl_hashlist (hash, fileid, num) VALUES (?, ?, ?)",
135 | (fhash, fid, fnum))
136 |
137 | def SetHashPos(self, fhash=0, sid=0, pos=0):
138 | c = self.cursor
139 | c.execute("UPDATE bhl_hashlist SET pos = ?, sourceid = ? WHERE hash = ? AND pos IS NULL",
140 | (pos, sid, fhash))
141 | return c.rowcount
142 |
143 | def GetFileInfo(self, fid):
144 | c = self.cursor
145 | data = {}
146 | c.execute("SELECT * FROM bhl_files where id = %i" % fid)
147 | res = c.fetchone()
148 | if res:
149 | data["blocksize"] = res[1]
150 | data["filesize"] = res[2]
151 | data["filename"] = res[3]
152 | data["filedatetime"] = res[4]
153 | data["lastblock"] = res[5]
154 | data["hash"] = res[6]
155 | return data
156 |
157 | def GetWriteList(self, fid):
158 | c = self.cursor
159 | data = []
160 | c.execute("SELECT num, sourceid, pos FROM bhl_hashlist WHERE fileid = %i AND pos IS NOT NULL ORDER BY num" % fid)
161 | return c.fetchall()
162 |
163 |
164 | def uniquifyFileName(filename):
165 | count = 0
166 | uniq = ""
167 | name,ext = os.path.splitext(filename)
168 | while os.path.exists(filename):
169 | count += 1
170 | uniq = "(%i)" % count
171 | filename = name + uniq + ext
172 | return filename
173 |
174 |
175 | def getFileSize(filename):
176 | """Calc file size - works on devices too"""
177 | ftemp = os.open(filename, os.O_RDONLY)
178 | try:
179 | return os.lseek(ftemp, 0, os.SEEK_END)
180 | finally:
181 | os.close(ftemp)
182 |
183 |
184 | def main():
185 |
186 | cmdline = get_cmdline()
187 |
188 | globalblocksnum = 0
189 | bhlfileid = 0
190 | sizelist = []
191 |
192 | if not len(cmdline.imgfilename) and not cmdline.test:
193 | errexit(1, "no image file/volume specified!")
194 |
195 | #build list of BHL files to process
196 | bhlfilenames = []
197 | for filename in cmdline.bhlfilename:
198 | if os.path.isdir(filename):
199 | filename = os.path.join(filename, "*")
200 | bhlfilenames += glob.glob(filename)
201 | bhlfilenames = [filename for filename in bhlfilenames
202 | if not os.path.isdir(filename)]
203 | bhlfilenames = sorted(set(bhlfilenames))
204 |
205 | if len(bhlfilenames) == 0:
206 | errexit(1, "no BHL file(s) found!")
207 |
208 | #prepare database
209 | if not cmdline.test:
210 | dbfilename = cmdline.dbfilename
211 | print("creating '%s' database..." % (dbfilename))
212 | if dbfilename.upper() != ":MEMORY:":
213 | open(dbfilename, 'w').close()
214 | db = RecDB(dbfilename)
215 | db.CreateTables()
216 |
217 | #process all BHL files
218 | for bhlfilename in bhlfilenames:
219 | if not os.path.exists(bhlfilename):
220 | errexit(1, "BHL file '%s' not found" % (bhlfilename))
221 | bhlfilesize = os.path.getsize(bhlfilename)
222 |
223 | #read hashes in memory
224 | blocklist = {}
225 | print("reading BHL file '%s'..." % bhlfilename)
226 | fin = open(bhlfilename, "rb", buffering=1024*1024)
227 | if BHL_MAGIC != fin.read(13):
228 | errexit(1, "not a valid BHL file")
229 | #check ver
230 | bhlver = ord(fin.read(1))
231 | blocksize = int.from_bytes(fin.read(4), byteorder='big')
232 | if not blocksize in sizelist:
233 | sizelist.append(blocksize)
234 | filesize = int.from_bytes(fin.read(8), byteorder='big')
235 | lastblocksize = filesize % blocksize
236 | totblocksnum = (filesize + blocksize-1) // blocksize
237 |
238 | #parse metadata section
239 | metasize = int.from_bytes(fin.read(4), byteorder='big')
240 | metadata = metadataDecode(fin.read(metasize))
241 |
242 | #read all block hashes
243 | globalhash = hashlib.sha256()
244 | updatetime = time.time()
245 | for block in range(totblocksnum):
246 | digest = fin.read(32)
247 | globalhash.update(digest)
248 | if digest in blocklist:
249 | blocklist[digest].append(block)
250 | else:
251 | blocklist[digest] = [block]
252 | #some progress update
253 | if time.time() > updatetime:
254 | print("%.1f%%" % (fin.tell()*100.0/bhlfilesize), " ",
255 | end="\r", flush=True)
256 | updatetime = time.time() + .1
257 |
258 | lastblockdigest = digest
259 |
260 | #verify the hashes read
261 | digest = fin.read(32)
262 | if globalhash.digest() != digest:
263 | errexit(1, "hashes block corrupt!")
264 |
265 | #read and check last blocks
266 | if lastblocksize:
267 | totblocksnum -= 1
268 | buffer = fin.read(bhlfilesize-fin.tell()+1)
269 | lastblockbuffer = zlib.decompress(buffer)
270 | blockhash = hashlib.sha256()
271 | blockhash.update(lastblockbuffer)
272 | if blockhash.digest() != lastblockdigest:
273 | errexit(1, "last block corrupt!")
274 | #remove lastblock from the list
275 | del blocklist[lastblockdigest]
276 | else:
277 | lastblockbuffer = b""
278 | print("100% ", end="\r", flush=True)
279 |
280 | globalblocksnum += totblocksnum
281 |
282 | #put data in the DB
283 | #hashes
284 | if not cmdline.test:
285 | print("updating db...")
286 | updatetime = time.time()
287 | i = 0
288 | for digest in blocklist:
289 | for pos in blocklist[digest]:
290 | db.AddHash(fhash=digest, fid=bhlfileid, fnum=pos)
291 | i+= 1
292 | #some progress update
293 | if time.time() > updatetime:
294 | print("%.1f%%" % (i*100.0/len(blocklist)), " ",
295 | end="\r", flush=True)
296 | db.Commit()
297 | updatetime = time.time() + .1
298 |
299 | #file info
300 | db.SetFileData(fid=bhlfileid, fblocksize=blocksize, fsize=filesize,
301 | fname=metadata["filename"],
302 | fdatetime=metadata["filedatetime"],
303 | flastblock=lastblockbuffer,
304 | fhash=globalhash.digest())
305 | bhlfileid +=1
306 |
307 | if cmdline.test:
308 | print("BHL file(s) OK!")
309 | errexit(0)
310 |
311 | #select an adequate scan step
312 | maxblocksize = max(sizelist)
313 | scanstep = cmdline.step
314 | if scanstep == 0:
315 | scanstep = mcd(sizelist)
316 | print("scan step:", scanstep)
317 | offset = cmdline.offset
318 |
319 | #build list of image files to process
320 | imgfilenames = []
321 | for filename in cmdline.imgfilename:
322 | if os.path.isdir(filename):
323 | filename = os.path.join(filename, "*")
324 | imgfilenames += glob.glob(filename)
325 | imgfilenames = [filename for filename in imgfilenames
326 | if not os.path.isdir(filename)]
327 | imgfilenames = sorted(set(imgfilenames))
328 |
329 | #start scanning process...
330 | blocksfound = 0
331 | for imgfileid in range(len(imgfilenames)):
332 | imgfilename = imgfilenames[imgfileid]
333 | if not os.path.exists(imgfilename):
334 | errexit(1, "image file/volume '%s' not found" % (imgfilename))
335 | imgfilesize = getFileSize(imgfilename)
336 |
337 | print("scanning file '%s'..." % imgfilename)
338 | fin = open(imgfilename, "rb", buffering=1024*1024)
339 |
340 | updatetime = time.time() - 1
341 | starttime = time.time()
342 | writelist = {}
343 | docommit = False
344 |
345 | for pos in range(offset, imgfilesize, scanstep):
346 | fin.seek(pos, 0)
347 | buffer = fin.read(maxblocksize)
348 | if len(buffer) > 0:
349 | #need to check for all sizes
350 | for size in sizelist:
351 | blockhash = hashlib.sha256()
352 | blockhash.update(buffer[:size])
353 | digest = blockhash.digest()
354 | num = db.SetHashPos(fhash=digest, sid=imgfileid, pos=pos)
355 | if num:
356 | docommit = True
357 | blocksfound += num
358 |
359 | #status update
360 | if ((time.time() > updatetime) or (globalblocksnum == blocksfound) or
361 | (imgfilesize-pos-len(buffer) == 0) ):
362 | etime = (time.time()-starttime)
363 | if etime == 0:
364 | etime = .001
365 | print(" %.1f%% - tot: %i - found: %i - %.2fMB/s" %
366 | ((pos+len(buffer)-1)*100/imgfilesize,
367 | globalblocksnum, blocksfound, pos/(1024*1024)/etime),
368 | end = "\r", flush=True)
369 | updatetime = time.time() + .2
370 | if docommit:
371 | db.Commit()
372 | docommit = False
373 | #break early if all the work is done
374 | if blocksfound == globalblocksnum:
375 | break
376 | fin.close()
377 | print()
378 |
379 | print("scan completed.")
380 |
381 | filesrestored = 0
382 | filesrestorederr = 0
383 | filesmissing= 0
384 |
385 | #open all the sources
386 | finlist = {}
387 | for imgfileid in range(len(imgfilenames)):
388 | finlist[imgfileid] = open(imgfilenames[imgfileid], "rb")
389 |
390 | #start rebuilding files...
391 | for fid in range(len(bhlfilenames)):
392 | fileinfo = db.GetFileInfo(fid)
393 | filename = fileinfo["filename"]
394 | filename = os.path.join(cmdline.destpath, filename)
395 | filesize = fileinfo["filesize"]
396 |
397 | #get list of blocks num & positions
398 | blocksize = fileinfo["blocksize"]
399 | lastblock = fileinfo["lastblock"]
400 | writelist = db.GetWriteList(fid)
401 | totblocksnum = filesize // blocksize
402 |
403 | if len(writelist) > 0 or totblocksnum == 0:
404 | print("creating file '%s'..." % filename)
405 | open(filename, 'w').close()
406 | fout = open(filename, "wb")
407 |
408 | if len(writelist) < totblocksnum:
409 | print("file incomplete! block missings: %i" %
410 | (totblocksnum - len(writelist)))
411 |
412 | filehash = hashlib.sha256()
413 | for data in writelist:
414 | blocknum = data[0]
415 | imgid = data[1]
416 | pos = data[2]
417 | finlist[imgid].seek(pos)
418 | buffer = finlist[imgid].read(blocksize)
419 | fout.seek(blocknum*blocksize)
420 | fout.write(buffer)
421 | blockhash = hashlib.sha256()
422 | blockhash.update(buffer)
423 | filehash.update(blockhash.digest())
424 | if lastblock:
425 | fout.write(lastblock)
426 | blockhash = hashlib.sha256()
427 | blockhash.update(lastblock)
428 | filehash.update(blockhash.digest())
429 | fout.close()
430 | if "filedatetime" in fileinfo:
431 | os.utime(filename,
432 | (int(time.time()), fileinfo["filedatetime"]))
433 | filesrestored += 1
434 |
435 | if filehash.digest() == fileinfo["hash"]:
436 | print("hash match!")
437 | else:
438 | print("hash mismatch! decoded file corrupted/incomplete!")
439 | filesrestorederr += 1
440 |
441 | else:
442 | print("nothing found for file '%s'" % filename)
443 | filesmissing += 1
444 |
445 | print("\nfiles restored: %i - with errors: %i - files missing: %i" %
446 | (filesrestored, filesrestorederr, filesmissing))
447 |
448 |
449 | if __name__ == '__main__':
450 | main()
451 |
--------------------------------------------------------------------------------