├── README.md
└── molasses.py


/README.md:
--------------------------------------------------------------------------------
 1 | Molasses
 2 | ========
 3 | 
 4 | Module Oriented Large Archive Specialized Slow Exhaustive Seacher
 5 | -----------------------------------------------------------------
 6 | 
 7 | Demo video:
 8 | https://www.youtube.com/watch?v=8MkqQ8N0UKA
 9 | 
10 | Molasses's current greatest achievement is uncovering the source of the music in
11 | this ancient Tom's Hardware video (https://www.youtube.com/watch?v=N1hg1zf7rrY)
12 | as http://modarchive.org/index.php?request=view_by_moduleid&query=40309. 
13 | 
14 | Usage:
15 | 
16 |     python molasses.py /path/to/target.[wav,mp3] /path/to/mod_archive /path/to/temp_directory [duration limit in seconds]
17 | 
18 | Duration limit is optional, use >= 30 if you are unsure if your target wav/mp3
19 | starts at the beginning of the mod. Processing time is linear with duration
20 | limit.
21 | 
22 | Dependencies:
23 | + dejavu (https://github.com/worldveil/dejavu has many other dependencies)
24 | + soundconverter
25 | + standard python modules (os,sys,subprocess,shutil,calendar,time,
26 |     multiprocessing, pydub, collections)
27 | 
28 | Note that dejavu requires mysql and mysqldb, but you do not need to create or
29 | initialize a database to use molasses.
30 | 
31 | Molasses is a **Module Oriented Large Archive Specialized Slow Exhaustive
32 | Seacher**. Lame initialism aside, it is esentially Shazam or Midomi for music
33 | modules instead of sampled audio formats such as 'mp3' and 'wav.' That is,
34 | Molasses takes in a sampled audio file in 'wav' format and finds the closest
35 | matching module. Molasses is slower\* than these
36 | online services but has the advantages of being almost completely exhaustive for
37 | tracker music. 
38 | 
39 | The motivation behind Molasses has a backstory that is too long to put near the
40 | top of this page, but is available for the extremely bored reader here.
41 | 
42 | Brief Technical Overview
43 | ------------------------
44 | Under the hood, it uses the fingerprint-based matching algorithm implemented in
45 | the https://github.com/worldveil/dejavu project by Will Drevo, though it does
46 | not rely on a mysql backend. Soundconverter (gstreamer-based) is used to convert
47 | modules to wav format so that they can be fingerprinted. 
48 | 
49 | The reasoning behind the lack of a mysql database despite the intent of the
50 | dejavu project is simple.  Even when only considering the 2007 snapshot of the
51 | http://www.modarchive.org, there are around 120,000 modules to search through.
52 | The required database would be immense by personal computing standards--dejavu's
53 | documentation states that ~377MB is needed to store fingerprints for 45 tracks.
54 | Even if we make the preposterous assumption that the average module is 1/2 the
55 | duration of a mainstream pop music track, that comes out to ~500GB for the
56 | database.
57 | 
58 | Therefore, each time a search is performed, Molasses essentially performs the
59 | "indexing" stage of dejavu by processing each module in the archive. Processing
60 | is very straightforward: each module is converted to wav format, fingerprinted,
61 | and match/aligned with the input sample.
62 | 
63 | Performance
64 | -----------
65 | Despite being implemented in Python 2.7 and mostly unoptimized, Molasses is
66 | capable of fingerprinting around 1 module (limited to 30s duration) every 3
67 | seconds on an i5-2520M (~3GHz 2C/4T Sandy Bridge). This time includes the entire
68 | conversion process, FFTs, and so on, meaning that modarchive in its entirety can
69 | be searched in a matter of days. 
70 | 
71 | Molasses can take advantage of multi-core CPUs (via Python's multiprocessing
72 | module), and if you're confident that the sample you're using to search starts
73 | near the beginning of the targeted module, you can adjust the duration that each
74 | module is fingerprinted. Processing time is roughly linear to fingerprinting
75 | duration so limiting the duration to 15s should cut processing time in about
76 | half.
77 | 


--------------------------------------------------------------------------------
/molasses.py:
--------------------------------------------------------------------------------
  1 | import dejavu
  2 | import os
  3 | import sys
  4 | import subprocess
  5 | import shutil
  6 | import zipfile
  7 | import calendar
  8 | import time
  9 | import multiprocessing
 10 | from multiprocessing import Pool
 11 | import pydub
 12 | from collections import defaultdict
 13 | import json
 14 | 
 15 | TEMP_SUBDIR = 'temp'
 16 | TIMEOUT = 10
 17 | POLL_INTERVAL = 0.2
 18 | 
 19 | def get_hashes(filepath, limit=None):
 20 |     #print("Fingerprinting " + filepath + " ...")
 21 |     try:
 22 |         #Suppress fingerprinting output... too many files...
 23 |         devnull = open(os.devnull, 'w')
 24 |         sys.stdout = devnull
 25 |         (_, song_hashes_list, _) = dejavu._fingerprint_worker(filepath, limit=limit) 
 26 |         sys.stdout = sys.__stdout__
 27 |     except pydub.exceptions.CouldntDecodeError:
 28 |         sys.stdout = sys.__stdout__
 29 |         print("Corrupt or nonexistent wav output...")
 30 |         return None
 31 |     #Converting things to dicts should greatly speedup lookup time when
 32 |     #comparing fingerprints...
 33 |     song_hashes_dict = defaultdict(list)
 34 |     for k, v in song_hashes_list:
 35 |         song_hashes_dict[k].append(v) 
 36 |     return song_hashes_dict
 37 | 
 38 | def to_wav(filepath):
 39 |     devnull = open(os.devnull, 'w')
 40 |     #print("Converting " + filepath + " to wav...")
 41 |     filename, ext = os.path.splitext(filepath)
 42 |     #prevent collisions after soundconverter clobbers extension--this can happen
 43 |     #under multiprocessing so we rename the file to include the extension before
 44 |     #the '.', e.g. foo.xm becomes foo_xm.xm -> foo_xm.wav
 45 |     new_filepath = filename + '_' + ext[1:] + ext 
 46 |     os.rename(filepath, new_filepath)
 47 |     sc_process = subprocess.Popen(["soundconverter", "-b", "-m", "audio/x-wav", "-s", ".wav",\
 48 |     new_filepath], stdout=devnull)
 49 |     sleep_time = 0
 50 |     wavpath = os.path.splitext(new_filepath)[0] + '.wav'
 51 |     while sleep_time < TIMEOUT:
 52 |         if sc_process.poll() is not None:
 53 |             return wavpath
 54 |         sleep_time = sleep_time + POLL_INTERVAL
 55 |         time.sleep(POLL_INTERVAL)
 56 |     print("soundconverter hanging, abandoning conversion for \
 57 |     {0}...".format(filepath))
 58 |     #Cleanup in case soundconverter took a HUGE dump (sometimes > 50GiB!!!)
 59 |     if os.path.exists(wavpath):
 60 |         os.remove(wavpath)
 61 |     sc_process.kill()
 62 |     return None
 63 | 
 64 | #Slimmed down version of the same thing from the dejavu library--look ma, no
 65 | #mysql db
 66 | def count_align_matches(target_hashes, current_hashes):
 67 |     diff_counter = {}
 68 |     largest_count = 0
 69 |     for t_key in target_hashes:
 70 |         if t_key in current_hashes:
 71 |             for t_offset in target_hashes[t_key]:
 72 |                 for c_offset in current_hashes[t_key]:
 73 |                     diff = t_offset - c_offset
 74 |                     if diff not in diff_counter:
 75 |                         diff_counter[diff] = 0
 76 |                     diff_counter[diff] = diff_counter[diff] + 1
 77 |                     if diff_counter[diff] > largest_count:
 78 |                         largest_count = diff_counter[diff]
 79 |     return largest_count
 80 | 
 81 | def process_file_one(arg):
 82 |     root = arg[0]
 83 |     filename = arg[1]
 84 |     target_hashes = arg[2]
 85 |     if filename[-4:].lower() != ".zip" and filename[-4:].lower() !=\
 86 |     ".wav":
 87 |         print("Checking " + root + '/' + filename + "...")
 88 |         wavpath = to_wav(root + '/' + filename)
 89 |         if wavpath is None:
 90 |             return (filename, None)
 91 |         current_hashes = get_hashes(wavpath, limit=arg[3])
 92 |         if current_hashes is None:
 93 |             return (filename, None)
 94 |         count = count_align_matches(target_hashes, current_hashes)
 95 |         #prevent collisions where two mods have different extensions
 96 |         #but same name--leading to colliding wavs 
 97 |         if os.path.exists(wavpath):
 98 |             os.remove(wavpath)
 99 |         return (filename, count)
100 |     #Leftover wav, will be ignored
101 |     else:
102 |         return ('', 0)
103 | 
104 | class ModSearch:
105 |     def __init__(self, target_path, search_path=None, temp_path=None, duration_limit=None):
106 |         #Keep track of how hideously long this takes
107 |         self.START_TIME = calendar.timegm(time.gmtime())
108 |         self.SEARCHED = 0
109 |         self.search_path = '/home/ketsol/mods/'
110 |         self.temp_path = './temp' + '/' + TEMP_SUBDIR
111 |         print("Fingerprinting Target...")
112 |         self.target_hashes = get_hashes(target_path)
113 |         self.largest_counts = {}
114 |         self.checked_fileset = set()
115 |         self.mp = True
116 |         self.p_count = multiprocessing.cpu_count()
117 |         self.duration_limit = 30
118 |         if os.path.exists(self.search_path + 'files_log'):
119 |             with open(self.search_path + 'files_log', 'r') as files_log:
120 |                 self.checked_fileset = set(files_log.read().splitlines())
121 |         if os.path.exists(self.search_path + 'match_log'):
122 |             with open(self.search_path + 'match_log', 'r') as match_log:
123 |                 self.largest_counts = json.load(match_log) 
124 |             print('Loaded match log ' + str(self.largest_counts))   
125 |    
126 |         if search_path is not None: 
127 |             self.search_path = search_path
128 |             assert(temp_path is not None)
129 |             if temp_path[-1] == '/':
130 |                 self.temp_path = temp_path + TEMP_SUBDIR
131 |             else:
132 |                 self.temp_path = temp_path + '/' + TEMP_SUBDIR
133 |         if duration_limit is not None:
134 |             self.duration_limit = int(duration_limit)
135 | 
136 |     #Unzip all of the nested zips in a directory, without regard for flattening
137 |     #their directory structures... we just want the mods!
138 |     def unzip_all(self):
139 |         zipfiles = set()
140 |         for root, dirs, files in os.walk(self.temp_path):
141 |             for filename in files:
142 |                 #print(filename)
143 |                 if filename[-4:].lower() == '.zip':
144 |                    zipfiles.add(root + '/' + filename) 
145 |         while zipfiles:
146 |             for filepath in zipfiles:
147 |                 #print(filepath)
148 |                 try: 
149 |                     with zipfile.ZipFile(filepath, 'r') as cur_zip:
150 |                         cur_zip.extractall(self.temp_path)
151 |                 except zipfile.BadZipfile:
152 |                     print("Bad zip encountered at {0}...".format(filepath))
153 |                     os.remove(filepath)
154 |                     continue
155 |                 os.remove(filepath)
156 |             zipfiles = set()
157 |             for root, dirs, files in os.walk(self.temp_path):
158 |                 for filename in files:
159 |                     if filename[-4:].lower() == '.zip':
160 |                         zipfiles.add(root + '/' + filename)  
161 | 
162 |     def process_dir(self):
163 |         self.process_mods()
164 |     
165 |     #Copy all of the zips from the search directory to the temporary directory
166 |     #and process them
167 |     def process_all_zips(self):
168 |         if self.mp:
169 |             self.p = Pool(self.p_count)
170 |         print(self.search_path)
171 |         for root, dirs, files in os.walk(self.search_path):
172 |             for filename in files:
173 |                 if filename[-4:].lower() == '.zip':
174 |                     if not os.path.exists(self.temp_path):
175 |                         os.makedirs(self.temp_path)
176 |                     shutil.copy(root+'/'+filename, self.temp_path + '/' + filename)
177 |                     self.unzip_all()
178 |                     self.process_dir()
179 |                     shutil.rmtree(self.temp_path)
180 |         print(sorted(self.largest_counts.items(), key=lambda match: match[1]))
181 |         if self.mp:
182 |             self.p.close()
183 |             self.p.join()
184 | 
185 |     def process_file(self, root, filename):
186 |         if filename in self.checked_fileset:
187 |             return None
188 |         if filename[-4:].lower() != ".zip" and filename[-4:].lower() !=\
189 |         ".wav":
190 |             #print("Checking " + root + '/' + filename + "...")
191 |             wavpath = to_wav(root + '/' + filename)
192 |             if wavpath is None:
193 |                 return None
194 |             current_hashes = get_hashes(wavpath, limit=self.duration_limit)
195 |             if current_hashes is None:
196 |                 return None
197 |             count = count_align_matches(self.target_hashes, current_hashes)
198 |             #prevent collisions where two mods have different extensions
199 |             #but same name--leading to colliding wavs 
200 |             if os.path.exists(wavpath):
201 |                 os.remove(wavpath)
202 |             return count
203 |     
204 |     #check if match count is above threshold, log checked files and matches
205 |     def process_result(self, count, filename):
206 |         if count is not None and count > 10:
207 |             self.largest_counts[filename] = count
208 |         if len(self.largest_counts.items()) > 0:
209 |             print("Current Top Matches Are " +\
210 |             str(sorted(self.largest_counts.items(), key=lambda match:\
211 |             match[1])) + " ...")
212 |         with open(self.search_path + 'files_log', 'a') as files_log:
213 |             files_log.write(filename + '\n')
214 |         with open(self.search_path + 'match_log', 'w') as match_log:
215 |             json.dump(self.largest_counts, match_log)
216 |     
217 |     #After all of the mods in the temp directory have been extracted, process
218 |     #them
219 |     def process_mods(self):
220 |         counts = {}
221 |         count = 0
222 |         for root, dirs, files in os.walk(self.temp_path):
223 |             if not self.mp:
224 |                 for filename in files:
225 |                     self.SEARCHED = self.SEARCHED + 1
226 |                     count = self.process_file(root, filename)
227 |                     cur_time = calendar.timegm(time.gmtime())
228 |                     print("Searched {0:d} mods in {1:d} seconds...".format(self.SEARCHED,\
229 |                     cur_time - self.START_TIME))
230 |                     self.process_result(count, filename)
231 |             else:
232 |                 #Pack object state into a list... pretty nasty, feels like
233 |                 #OpenCL stuff
234 |                 map_input = [[root, filename,
235 |                 self.target_hashes, self.duration_limit]\
236 |                 for filename in files if filename not in\
237 |                 self.checked_fileset]
238 |                 results = self.p.map(process_file_one, map_input)
239 |                 self.SEARCHED = self.SEARCHED + len(files)
240 |                 cur_time = calendar.timegm(time.gmtime())
241 |              
242 |                 print("Searched {0:d} mods in {1:d} seconds...".format(self.SEARCHED,\
243 |                 cur_time - self.START_TIME))
244 |                 print(results)
245 |                 for result in results:
246 |                     count = result[1]
247 |                     filename = result[0]
248 |                     self.process_result(count, filename)
249 | 
250 | def main():
251 |     if len(sys.argv) > 1:
252 |         assert(len(sys.argv) > 3)
253 |         if len(sys.argv) > 4:
254 |             m_modsearcher = ModSearch(sys.argv[1], sys.argv[2], sys.argv[3],\
255 |             sys.argv[4])
256 |         else:
257 |             m_modsearcher = ModSearch(sys.argv[1], sys.argv[2], sys.argv[3])
258 |     else: 
259 |         m_modsearcher = ModSearch("./test/2.wav")
260 |     m_modsearcher.process_all_zips()    
261 |     
262 | 
263 | if __name__ == "__main__":
264 |     main()
265 | 


--------------------------------------------------------------------------------