├── README.md └── molasses.py /README.md: -------------------------------------------------------------------------------- 1 | Molasses 2 | ======== 3 | 4 | Module Oriented Large Archive Specialized Slow Exhaustive Seacher 5 | ----------------------------------------------------------------- 6 | 7 | Demo video: 8 | https://www.youtube.com/watch?v=8MkqQ8N0UKA 9 | 10 | Molasses's current greatest achievement is uncovering the source of the music in 11 | this ancient Tom's Hardware video (https://www.youtube.com/watch?v=N1hg1zf7rrY) 12 | as http://modarchive.org/index.php?request=view_by_moduleid&query=40309. 13 | 14 | Usage: 15 | 16 | python molasses.py /path/to/target.[wav,mp3] /path/to/mod_archive /path/to/temp_directory [duration limit in seconds] 17 | 18 | Duration limit is optional, use >= 30 if you are unsure if your target wav/mp3 19 | starts at the beginning of the mod. Processing time is linear with duration 20 | limit. 21 | 22 | Dependencies: 23 | + dejavu (https://github.com/worldveil/dejavu has many other dependencies) 24 | + soundconverter 25 | + standard python modules (os,sys,subprocess,shutil,calendar,time, 26 | multiprocessing, pydub, collections) 27 | 28 | Note that dejavu requires mysql and mysqldb, but you do not need to create or 29 | initialize a database to use molasses. 30 | 31 | Molasses is a **Module Oriented Large Archive Specialized Slow Exhaustive 32 | Seacher**. Lame initialism aside, it is esentially Shazam or Midomi for music 33 | modules instead of sampled audio formats such as 'mp3' and 'wav.' That is, 34 | Molasses takes in a sampled audio file in 'wav' format and finds the closest 35 | matching module. Molasses is slower\* than these 36 | online services but has the advantages of being almost completely exhaustive for 37 | tracker music. 38 | 39 | The motivation behind Molasses has a backstory that is too long to put near the 40 | top of this page, but is available for the extremely bored reader here. 41 | 42 | Brief Technical Overview 43 | ------------------------ 44 | Under the hood, it uses the fingerprint-based matching algorithm implemented in 45 | the https://github.com/worldveil/dejavu project by Will Drevo, though it does 46 | not rely on a mysql backend. Soundconverter (gstreamer-based) is used to convert 47 | modules to wav format so that they can be fingerprinted. 48 | 49 | The reasoning behind the lack of a mysql database despite the intent of the 50 | dejavu project is simple. Even when only considering the 2007 snapshot of the 51 | http://www.modarchive.org, there are around 120,000 modules to search through. 52 | The required database would be immense by personal computing standards--dejavu's 53 | documentation states that ~377MB is needed to store fingerprints for 45 tracks. 54 | Even if we make the preposterous assumption that the average module is 1/2 the 55 | duration of a mainstream pop music track, that comes out to ~500GB for the 56 | database. 57 | 58 | Therefore, each time a search is performed, Molasses essentially performs the 59 | "indexing" stage of dejavu by processing each module in the archive. Processing 60 | is very straightforward: each module is converted to wav format, fingerprinted, 61 | and match/aligned with the input sample. 62 | 63 | Performance 64 | ----------- 65 | Despite being implemented in Python 2.7 and mostly unoptimized, Molasses is 66 | capable of fingerprinting around 1 module (limited to 30s duration) every 3 67 | seconds on an i5-2520M (~3GHz 2C/4T Sandy Bridge). This time includes the entire 68 | conversion process, FFTs, and so on, meaning that modarchive in its entirety can 69 | be searched in a matter of days. 70 | 71 | Molasses can take advantage of multi-core CPUs (via Python's multiprocessing 72 | module), and if you're confident that the sample you're using to search starts 73 | near the beginning of the targeted module, you can adjust the duration that each 74 | module is fingerprinted. Processing time is roughly linear to fingerprinting 75 | duration so limiting the duration to 15s should cut processing time in about 76 | half. 77 | -------------------------------------------------------------------------------- /molasses.py: -------------------------------------------------------------------------------- 1 | import dejavu 2 | import os 3 | import sys 4 | import subprocess 5 | import shutil 6 | import zipfile 7 | import calendar 8 | import time 9 | import multiprocessing 10 | from multiprocessing import Pool 11 | import pydub 12 | from collections import defaultdict 13 | import json 14 | 15 | TEMP_SUBDIR = 'temp' 16 | TIMEOUT = 10 17 | POLL_INTERVAL = 0.2 18 | 19 | def get_hashes(filepath, limit=None): 20 | #print("Fingerprinting " + filepath + " ...") 21 | try: 22 | #Suppress fingerprinting output... too many files... 23 | devnull = open(os.devnull, 'w') 24 | sys.stdout = devnull 25 | (_, song_hashes_list, _) = dejavu._fingerprint_worker(filepath, limit=limit) 26 | sys.stdout = sys.__stdout__ 27 | except pydub.exceptions.CouldntDecodeError: 28 | sys.stdout = sys.__stdout__ 29 | print("Corrupt or nonexistent wav output...") 30 | return None 31 | #Converting things to dicts should greatly speedup lookup time when 32 | #comparing fingerprints... 33 | song_hashes_dict = defaultdict(list) 34 | for k, v in song_hashes_list: 35 | song_hashes_dict[k].append(v) 36 | return song_hashes_dict 37 | 38 | def to_wav(filepath): 39 | devnull = open(os.devnull, 'w') 40 | #print("Converting " + filepath + " to wav...") 41 | filename, ext = os.path.splitext(filepath) 42 | #prevent collisions after soundconverter clobbers extension--this can happen 43 | #under multiprocessing so we rename the file to include the extension before 44 | #the '.', e.g. foo.xm becomes foo_xm.xm -> foo_xm.wav 45 | new_filepath = filename + '_' + ext[1:] + ext 46 | os.rename(filepath, new_filepath) 47 | sc_process = subprocess.Popen(["soundconverter", "-b", "-m", "audio/x-wav", "-s", ".wav",\ 48 | new_filepath], stdout=devnull) 49 | sleep_time = 0 50 | wavpath = os.path.splitext(new_filepath)[0] + '.wav' 51 | while sleep_time < TIMEOUT: 52 | if sc_process.poll() is not None: 53 | return wavpath 54 | sleep_time = sleep_time + POLL_INTERVAL 55 | time.sleep(POLL_INTERVAL) 56 | print("soundconverter hanging, abandoning conversion for \ 57 | {0}...".format(filepath)) 58 | #Cleanup in case soundconverter took a HUGE dump (sometimes > 50GiB!!!) 59 | if os.path.exists(wavpath): 60 | os.remove(wavpath) 61 | sc_process.kill() 62 | return None 63 | 64 | #Slimmed down version of the same thing from the dejavu library--look ma, no 65 | #mysql db 66 | def count_align_matches(target_hashes, current_hashes): 67 | diff_counter = {} 68 | largest_count = 0 69 | for t_key in target_hashes: 70 | if t_key in current_hashes: 71 | for t_offset in target_hashes[t_key]: 72 | for c_offset in current_hashes[t_key]: 73 | diff = t_offset - c_offset 74 | if diff not in diff_counter: 75 | diff_counter[diff] = 0 76 | diff_counter[diff] = diff_counter[diff] + 1 77 | if diff_counter[diff] > largest_count: 78 | largest_count = diff_counter[diff] 79 | return largest_count 80 | 81 | def process_file_one(arg): 82 | root = arg[0] 83 | filename = arg[1] 84 | target_hashes = arg[2] 85 | if filename[-4:].lower() != ".zip" and filename[-4:].lower() !=\ 86 | ".wav": 87 | print("Checking " + root + '/' + filename + "...") 88 | wavpath = to_wav(root + '/' + filename) 89 | if wavpath is None: 90 | return (filename, None) 91 | current_hashes = get_hashes(wavpath, limit=arg[3]) 92 | if current_hashes is None: 93 | return (filename, None) 94 | count = count_align_matches(target_hashes, current_hashes) 95 | #prevent collisions where two mods have different extensions 96 | #but same name--leading to colliding wavs 97 | if os.path.exists(wavpath): 98 | os.remove(wavpath) 99 | return (filename, count) 100 | #Leftover wav, will be ignored 101 | else: 102 | return ('', 0) 103 | 104 | class ModSearch: 105 | def __init__(self, target_path, search_path=None, temp_path=None, duration_limit=None): 106 | #Keep track of how hideously long this takes 107 | self.START_TIME = calendar.timegm(time.gmtime()) 108 | self.SEARCHED = 0 109 | self.search_path = '/home/ketsol/mods/' 110 | self.temp_path = './temp' + '/' + TEMP_SUBDIR 111 | print("Fingerprinting Target...") 112 | self.target_hashes = get_hashes(target_path) 113 | self.largest_counts = {} 114 | self.checked_fileset = set() 115 | self.mp = True 116 | self.p_count = multiprocessing.cpu_count() 117 | self.duration_limit = 30 118 | if os.path.exists(self.search_path + 'files_log'): 119 | with open(self.search_path + 'files_log', 'r') as files_log: 120 | self.checked_fileset = set(files_log.read().splitlines()) 121 | if os.path.exists(self.search_path + 'match_log'): 122 | with open(self.search_path + 'match_log', 'r') as match_log: 123 | self.largest_counts = json.load(match_log) 124 | print('Loaded match log ' + str(self.largest_counts)) 125 | 126 | if search_path is not None: 127 | self.search_path = search_path 128 | assert(temp_path is not None) 129 | if temp_path[-1] == '/': 130 | self.temp_path = temp_path + TEMP_SUBDIR 131 | else: 132 | self.temp_path = temp_path + '/' + TEMP_SUBDIR 133 | if duration_limit is not None: 134 | self.duration_limit = int(duration_limit) 135 | 136 | #Unzip all of the nested zips in a directory, without regard for flattening 137 | #their directory structures... we just want the mods! 138 | def unzip_all(self): 139 | zipfiles = set() 140 | for root, dirs, files in os.walk(self.temp_path): 141 | for filename in files: 142 | #print(filename) 143 | if filename[-4:].lower() == '.zip': 144 | zipfiles.add(root + '/' + filename) 145 | while zipfiles: 146 | for filepath in zipfiles: 147 | #print(filepath) 148 | try: 149 | with zipfile.ZipFile(filepath, 'r') as cur_zip: 150 | cur_zip.extractall(self.temp_path) 151 | except zipfile.BadZipfile: 152 | print("Bad zip encountered at {0}...".format(filepath)) 153 | os.remove(filepath) 154 | continue 155 | os.remove(filepath) 156 | zipfiles = set() 157 | for root, dirs, files in os.walk(self.temp_path): 158 | for filename in files: 159 | if filename[-4:].lower() == '.zip': 160 | zipfiles.add(root + '/' + filename) 161 | 162 | def process_dir(self): 163 | self.process_mods() 164 | 165 | #Copy all of the zips from the search directory to the temporary directory 166 | #and process them 167 | def process_all_zips(self): 168 | if self.mp: 169 | self.p = Pool(self.p_count) 170 | print(self.search_path) 171 | for root, dirs, files in os.walk(self.search_path): 172 | for filename in files: 173 | if filename[-4:].lower() == '.zip': 174 | if not os.path.exists(self.temp_path): 175 | os.makedirs(self.temp_path) 176 | shutil.copy(root+'/'+filename, self.temp_path + '/' + filename) 177 | self.unzip_all() 178 | self.process_dir() 179 | shutil.rmtree(self.temp_path) 180 | print(sorted(self.largest_counts.items(), key=lambda match: match[1])) 181 | if self.mp: 182 | self.p.close() 183 | self.p.join() 184 | 185 | def process_file(self, root, filename): 186 | if filename in self.checked_fileset: 187 | return None 188 | if filename[-4:].lower() != ".zip" and filename[-4:].lower() !=\ 189 | ".wav": 190 | #print("Checking " + root + '/' + filename + "...") 191 | wavpath = to_wav(root + '/' + filename) 192 | if wavpath is None: 193 | return None 194 | current_hashes = get_hashes(wavpath, limit=self.duration_limit) 195 | if current_hashes is None: 196 | return None 197 | count = count_align_matches(self.target_hashes, current_hashes) 198 | #prevent collisions where two mods have different extensions 199 | #but same name--leading to colliding wavs 200 | if os.path.exists(wavpath): 201 | os.remove(wavpath) 202 | return count 203 | 204 | #check if match count is above threshold, log checked files and matches 205 | def process_result(self, count, filename): 206 | if count is not None and count > 10: 207 | self.largest_counts[filename] = count 208 | if len(self.largest_counts.items()) > 0: 209 | print("Current Top Matches Are " +\ 210 | str(sorted(self.largest_counts.items(), key=lambda match:\ 211 | match[1])) + " ...") 212 | with open(self.search_path + 'files_log', 'a') as files_log: 213 | files_log.write(filename + '\n') 214 | with open(self.search_path + 'match_log', 'w') as match_log: 215 | json.dump(self.largest_counts, match_log) 216 | 217 | #After all of the mods in the temp directory have been extracted, process 218 | #them 219 | def process_mods(self): 220 | counts = {} 221 | count = 0 222 | for root, dirs, files in os.walk(self.temp_path): 223 | if not self.mp: 224 | for filename in files: 225 | self.SEARCHED = self.SEARCHED + 1 226 | count = self.process_file(root, filename) 227 | cur_time = calendar.timegm(time.gmtime()) 228 | print("Searched {0:d} mods in {1:d} seconds...".format(self.SEARCHED,\ 229 | cur_time - self.START_TIME)) 230 | self.process_result(count, filename) 231 | else: 232 | #Pack object state into a list... pretty nasty, feels like 233 | #OpenCL stuff 234 | map_input = [[root, filename, 235 | self.target_hashes, self.duration_limit]\ 236 | for filename in files if filename not in\ 237 | self.checked_fileset] 238 | results = self.p.map(process_file_one, map_input) 239 | self.SEARCHED = self.SEARCHED + len(files) 240 | cur_time = calendar.timegm(time.gmtime()) 241 | 242 | print("Searched {0:d} mods in {1:d} seconds...".format(self.SEARCHED,\ 243 | cur_time - self.START_TIME)) 244 | print(results) 245 | for result in results: 246 | count = result[1] 247 | filename = result[0] 248 | self.process_result(count, filename) 249 | 250 | def main(): 251 | if len(sys.argv) > 1: 252 | assert(len(sys.argv) > 3) 253 | if len(sys.argv) > 4: 254 | m_modsearcher = ModSearch(sys.argv[1], sys.argv[2], sys.argv[3],\ 255 | sys.argv[4]) 256 | else: 257 | m_modsearcher = ModSearch(sys.argv[1], sys.argv[2], sys.argv[3]) 258 | else: 259 | m_modsearcher = ModSearch("./test/2.wav") 260 | m_modsearcher.process_all_zips() 261 | 262 | 263 | if __name__ == "__main__": 264 | main() 265 | --------------------------------------------------------------------------------