├── README.md └── mimesort.py /README.md: -------------------------------------------------------------------------------- 1 | mimesort 2 | ======== 3 | 4 | Mimesort is used to sort files and folders into labeled folders based on MIME 5 | types. When deciding what MIME classification to use for a folder, mimesort 6 | checks the MIME type of every file within the folder and will classify a folder 7 | based on the most commonly occurring type. By default, if the MIME types that 8 | occur in a folder are not all the same, the folder is classified as "mixed." 9 | This can be changed by adjusting the diversity threshold. The diversity of a 10 | folder is determined by the [Shannon 11 | index](http://en.wikipedia.org/wiki/Diversity_index#Shannon_index) of the MIME 12 | types of its contents. 13 | 14 | I have a nightly entry in my crontab that I use to sort my downloads directory: 15 | 16 | mimesort.py -d .6 ~/Downloads 17 | 18 | A Shannon index threshold of 0.6 does a good job of classifying folders to my 19 | liking, but I recommend doing some dry-runs to see works best for you. 20 | 21 | 22 | Usage 23 | ----- 24 | 25 | The scripts usage is pretty simple: 26 | 27 | mimesort.py [OPTIONS] [DIRECTORY... [DESTINATION]] 28 | 29 | When no directory is supplied, mimesort works on the current working directory 30 | but prompts the user before proceeding. When multiple directories are supplied 31 | as arguments, the last folder is used as the destination for the sorted 32 | contents of the preceding folders. 33 | 34 | 35 | Example 36 | ------- 37 | 38 | ~/Downloads$ mimesort -d 0.6 | tail -n15 39 | ./shellinabox_2.10-1_amd64.deb debian-package 40 | ./Bat Euthanasia.pdf pdf 41 | ./AmazonMP3-1309433311.amz plain 42 | ./cruisecontrol-bin-2.8.4 1.566 mixed 43 | ./WebShell-0.9.6 1.476 mixed 44 | ./kien-ctrlp.vim-e50970f.tar.gz tar 45 | ./531 Manual.pdf pdf 46 | ./xflux.tgz tar 47 | ./mHXiz.png image 48 | ./Full ...otherhood 1-64 (Eng Dub) 0.000 video 49 | ./Sunflower-0.1a-26.tgz tar 50 | ./ipmansubs.zip zip 51 | ./Pokemans.mp3 audio 52 | ./Windo...it-English-Developer.iso iso9660-image 53 | ./amazonmp3 0.000 debian-package 54 | 55 | ~/Downloads$ ls 56 | audio iso9660-image pdf shellscript 57 | csv java-archive plain tar 58 | debian-package java-jnlp-file postscript unknown 59 | document message redhat-package-manager video 60 | executable mixed ruby xml 61 | image msdos-program sh zip 62 | 63 | 64 | Options 65 | ------- 66 | 67 | ### -h ### 68 | 69 | Displays a brief overview of the command line arguments accepted by mimesort. 70 | 71 | ### -d _NUMBER_ ### 72 | 73 | Defines the threshold for determining whether or not a folder is classified as 74 | "mixed" instead of the most commonly occurring MIME type. The default threshold 75 | is 0.0. 76 | 77 | ### -i ### 78 | 79 | Tells the script not to try to sort folders that, based on the folder names, 80 | appears to already be sorted. 81 | 82 | ### -n #### 83 | 84 | Displays a list of files, folders, their Shannon index values and 85 | classifications but does not actually move anything around. This is essentially 86 | a dry-run. 87 | 88 | ### -m ### 89 | 90 | Do not use magic file libraries even if they are available. This is useful for 91 | categorizing large quantities of files since the magic file libraries read data 92 | from the file to determine its type, but the classifications will generally be 93 | less accurate. 94 | 95 | ### -y ### 96 | 97 | When mimesort is called without any arguments, it will attempt to sort the 98 | current working directory but prompts the user before doing so. This argument 99 | eliminates the prompt and will sort the current working directory immediately. 100 | -------------------------------------------------------------------------------- /mimesort.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | from __future__ import print_function, division 4 | 5 | import errno 6 | import getopt 7 | import math 8 | import mimetypes 9 | import operator 10 | import os 11 | import shutil 12 | import sys 13 | 14 | __author__ = "Eric Pruitt" 15 | __license__ = "Public Domain, FreeBSD, or PSF" 16 | 17 | __all__ = ['IGNORED_FILES', 'MIXED_TYPES_LABEL', 'UNKNOWN_TYPES_LABEL', 18 | 'categorize', 'classify', 'diversity', 'guess_mime_type', 'organize'] 19 | 20 | MAX_DIVERSITY = 0.0 21 | MIXED_TYPES_LABEL = 'mixed' 22 | UNKNOWN_TYPES_LABEL = 'unknown' 23 | 24 | # If python-magic is installed, it will be used as a fall-back when 25 | # mimetypes.guess_type cannot identify a file. I have tested this with two 26 | # versions of python-magic: the current github master as of 2011.09.26 and the 27 | # package found in the Debian Squeeze repositories. For the latter, we rely on 28 | # monkey-patching. The github master can be found at the following address: 29 | # https://github.com/ahupp/python-magic/raw/master/magic.py 30 | try: 31 | import magic 32 | if not hasattr(magic, 'from_file'): 33 | cookie = magic.open(magic.MAGIC_MIME & ~magic.MAGIC_MIME_ENCODING) 34 | cookie.load() 35 | magic.from_file = lambda x, mime: cookie.file(x) 36 | elif 'mime' not in magic.from_file.func_code.co_varnames: 37 | raise AttributeError 38 | 39 | def guess_mime_type(path): 40 | """ 41 | Guesses the MIME type of a file based on its extension using 42 | mimetypes.guess_type and python-magic as a fall-back when the mimetypes 43 | method is unable to identify the file. Returns a tuple containing the 44 | MIME type and encoding (usually `None`). 45 | """ 46 | mimetype, _ = mimetypes.guess_type(path) 47 | if not mimetype or mimetype.endswith('/octet-stream'): 48 | magicmime = magic.from_file(path, mime=True) 49 | if magicmime and '/' not in magicmime: 50 | return None, None 51 | return magicmime, None 52 | return mimetype, _ 53 | 54 | except Exception as err: 55 | if isinstance(err, AttributeError): 56 | print( 57 | "Available version of python-magic not supported.", 58 | file=sys.stderr 59 | ) 60 | guess_mime_type = mimetypes.guess_type 61 | 62 | 63 | def diversity(elements): 64 | """ 65 | Computes the Shannon index of diversity for `elements`. The calculated 66 | index and a list containing the most frequently occurring element or 67 | elements are returned. 68 | """ 69 | bucket = dict() 70 | mode = list() 71 | maxfreq = 0 72 | 73 | # Tally up the elements in the list. 74 | for element in elements: 75 | bucket[element] = bucket.setdefault(element, 0) + 1 76 | 77 | if bucket[element] >= maxfreq: 78 | if bucket[element] == maxfreq: 79 | mode.append(element) 80 | else: 81 | maxfreq = bucket[element] 82 | mode = [element] 83 | 84 | total = float(len(elements)) 85 | freqs = [count / total for _, count in bucket.items()] 86 | diversity = -sum(map(operator.mul, freqs, map(math.log, freqs))) 87 | 88 | # Do not return a negative zero 89 | return diversity or 0.0, mode 90 | 91 | 92 | def classify(path, guesser=None): 93 | """ 94 | Returns more user-friendly version of a file's MIME type. For most MIME 95 | types, this function will simply return the primary type, but more 96 | ambiguous types like 'application' or 'text' are broken down into sub-types 97 | where possible and certain prefixes stripped. When `guesser` is supplied, 98 | `classify` will use it to identify the MIME type of a given path. The 99 | function should require only a single parameter -- the path of the file -- 100 | and return the MIME type and a second value that will be ignored. 101 | """ 102 | if not guesser: 103 | guesser = guess_mime_type 104 | path = os.path.normcase(path) 105 | mimetype, _ = guesser(path) 106 | if not mimetype: 107 | return None 108 | 109 | try: 110 | mediatype, mediasubtype = mimetype.split('/') 111 | except ValueError: 112 | mediatype, mediasubtype = ("application", "octet-stream") 113 | 114 | if mediatype in ('application', 'text'): 115 | mediatype = mediasubtype 116 | if mediatype == 'octet-stream': 117 | return UNKNOWN_TYPES_LABEL 118 | 119 | if mediatype.startswith('vnd.'): 120 | mediatype = mediatype.split('.')[-1] 121 | 122 | if mediatype.startswith('x-'): 123 | mediatype = mediatype[2:] 124 | 125 | return mediatype 126 | 127 | 128 | def categorize(path, maxdiversity=MAX_DIVERSITY): 129 | """ 130 | When `path` is a directory, two components are used to determine its 131 | classification: a Shannon index of diversity generated by running 132 | `classify` on every file found recursively under the directory and the 133 | `maxdiversity` parameter. When the directory contains files whose 134 | classifications differ, a Shannon index of diversity greater than 135 | `maxdiversity` will cause the folder to classified as `mixed`. Otherwise, 136 | the most prevalent classification is used. 137 | 138 | If `path` is a file, `categorize` is nothing more than an alias for 139 | `classify` that returns a file's classification and `None`. 140 | """ 141 | if os.path.isdir(path): 142 | categories = list() 143 | for root, directories, files in os.walk(path): 144 | # I originally had `categories.extend(map(classify, files))` here, 145 | # but I need to make sure the classify function has the full path 146 | # since I have implemented support for python-magic. 147 | for filename in files: 148 | categories.append(classify(os.path.join(root, filename))) 149 | 150 | dirdiversity, mode = diversity(categories) 151 | path = os.path.basename(path) 152 | if len(mode) == 1 and dirdiversity <= maxdiversity: 153 | return mode[0], dirdiversity 154 | return False, dirdiversity 155 | else: 156 | return classify(path), None 157 | 158 | 159 | def organize(folder, dest=False, detectdirs=True, maxdiversity=MAX_DIVERSITY): 160 | """ 161 | Classifies the files in `folder` and moves like items into appropriate 162 | named folders. The `dest` parameter is the destination folder for sorted 163 | files. When `dest` is False, the sorted contents of `folder` remain in 164 | `folder`. When `dest` is `None`, organize will simply print a list showing 165 | the Shannon index of diversity and classification for each item in 166 | `folder`. The `maxdiversity` is passed to `categorized` unchanged. 167 | 168 | When `detectdirs` is True, the function ignores folders that appear to be 169 | sorted based on the folders' names. 170 | """ 171 | files = [os.path.join(folder, base) for base in os.listdir(folder)] 172 | if detectdirs: 173 | files = [F for F in files if os.path.basename(F) not in IGNORED_FILES] 174 | 175 | # Wrap categorize so I can provide maxdiversity while using map 176 | _categorize = lambda x: categorize(x, maxdiversity=maxdiversity) 177 | 178 | for path, categorydata in zip(files, map(_categorize, files)): 179 | category, pathdiversity = categorydata 180 | if category is None: 181 | category = UNKNOWN_TYPES_LABEL 182 | elif category is False: 183 | category = MIXED_TYPES_LABEL 184 | 185 | displayname = path if len(path) < 34 else path[:7] + '...' + path[-24:] 186 | if pathdiversity is not None: 187 | left = '%-34s %.3f' % (displayname, pathdiversity) 188 | else: 189 | left = '%-34s ' % (displayname) 190 | 191 | if len(left) > 40: 192 | # Truncate spaces starting from the right side of the string 193 | left = left[::-1].replace(' ', '', len(left) - 50)[::-1] 194 | print(left + ' ' + category) 195 | 196 | if dest is None: 197 | continue 198 | 199 | destination = os.path.join(dest or folder, category) 200 | try: 201 | os.makedirs(destination) 202 | except OSError as err: 203 | if err.errno != errno.EEXIST: 204 | print('%s: %s' % (destination, err), file=sys.stderr) 205 | exit(1) 206 | 207 | if os.path.isdir(destination): 208 | if os.path.samefile(path, destination): 209 | print('%s: Source is destination.' % path, file=sys.stderr) 210 | else: 211 | try: 212 | shutil.move(path, destination) 213 | except Exception as err: 214 | print('%s: %s' % (path, err.message), file=sys.stderr) 215 | else: 216 | print('%s: Destination is not a folder.' % path, file=sys.stderr) 217 | 218 | 219 | def main(args=sys.argv[1:]): 220 | """ 221 | Entry point when run as a stand-alone script. 222 | """ 223 | arguments, trailing = getopt.gnu_getopt(args, 'd:nihm') 224 | argdict = dict(arguments) 225 | maxdiversity = argdict.get('-d', MAX_DIVERSITY) 226 | if '-h' in argdict: 227 | print(os.path.basename(__file__), '[OPTIONS] [DIR... [DEST]]') 228 | print('\t-h Display this message and quit') 229 | print('\t-d NUMBER Threshold for Shannon diversity index') 230 | print('\t-i Ignore folders that appear to be sorted') 231 | print('\t-n Display categorizations and exit') 232 | print('\t-m Do not use python-magic even if it is available') 233 | 234 | else: 235 | if '-m' in argdict: 236 | global guess_mime_type 237 | guess_mime_type = mimetypes.guess_type 238 | 239 | detectdirs = '-i' not in argdict 240 | dryrun = '-n' in argdict 241 | promptuser = '-y' not in argdict 242 | for folder in trailing or '.': 243 | if len(trailing) > 1: 244 | dest = trailing.pop() 245 | else: 246 | dest = folder 247 | 248 | if dryrun: 249 | print('Destination folder: %s' % dest) 250 | dest = None 251 | elif not trailing and promptuser: 252 | response = input('Sort %s? [N/y] ' % os.getcwd()) 253 | if not response.strip().lower().startswith('y'): 254 | exit(0) 255 | 256 | organize(folder, dest, detectdirs, maxdiversity) 257 | 258 | 259 | # Generate set containing all possible folder names 260 | IGNORED_FILES = set((MIXED_TYPES_LABEL, UNKNOWN_TYPES_LABEL)) 261 | strict, loose = mimetypes.MimeTypes().types_map 262 | for extension in list(strict.keys()) + list(loose.keys()): 263 | IGNORED_FILES.add(classify('x' + extension, guesser=mimetypes.guess_type)) 264 | 265 | if __name__ == '__main__': 266 | main() 267 | --------------------------------------------------------------------------------