├── .gitignore ├── LICENSE ├── README.md ├── organize.py ├── poetry.lock └── pyproject.toml /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | db/ 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Laurie Voss 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # File Organizer 2 | 3 | A command-line utility that organizes files into folders based on descriptions of their content, while ensuring that no folder is unhelpfully vague or overly specific. Very obviously inspired by [llama-fs](https://github.com/iyaja/llama-fs), the main differences being: 4 | * It never renames any files, it just puts them into folders, file names are generally pretty helpful. 5 | * The organization is hopefully much smarter, helping you locate groups of files regardless of their names 6 | 7 | ## Usage 8 | 9 | At the moment the tool has to be run manually in separate stages (see Further Work below). 10 | 11 | ```bash 12 | python organize.py --describe path_to_folder 13 | ``` 14 | 15 | Reads all the files in `path_to_folder` and gets an LLM to describe them in detail. These descriptions are cached in a folder called `db` as JSON files based on the `inode` ID of the file, so even if they move around the filesystem, the cache will still be valid. 16 | 17 | ```bash 18 | python organize.py --categorize path_to_folder 19 | ``` 20 | 21 | Reads all the files in `path_to_folder` that have a cached description from the previous step. It then comes up with a proposed category for each one and writes these categories into a file called `db/categorized_paths.json`. **It ignores folders**, assuming if you put something in a folder that's already a pretty good category. It will however re-organize files that are in folders it previously created. 22 | 23 | ```bash 24 | python organize.py --recategorize path_to_folder 25 | ``` 26 | 27 | Reads all the files in `db/categorized_paths.json` and makes sure none of the categories are either too vague or too specific: 28 | * If a category has 3 or fewer files, it will attempt to merge them into larger existing categories 29 | * If a single category accounts for 20% or more of the files, it will attempt to split them into more specific categories 30 | * It will make up to 5 passes over the data to try and get this right (since the first pass might create too many small categories, or another big category). 31 | 32 | ```bash 33 | python organize.py --move path_to_folder 34 | ``` 35 | 36 | Takes the files described in `db/categorized_paths.json` and moves them into folders named after their categories with the suffix "(Auto)". It will record which folders it generated in a file called `db/auto_generated_folders.json`; if you run it multiple times and it ends up emptying out an auto-generated folder that folder will be deleted. 37 | 38 | ## Further work 39 | 40 | Getting organization to work was the fun part, there's a bunch of grunt work to do: 41 | 42 | * Obviously the whole thing runs in stages right now; the expectation is that some higher-level process will be taking care of deciding when to describe files, when to categorize, when to re-categorize, and when to move. I would expect this to be a cute little Electron desktop app or something. 43 | * `categorized_paths.json` does not account for you ever running this on more than one folder! The whole thing needs to be refactored so that `categorized_paths` and `auto_generated_folders` take into account that not every file lives in one folder. 44 | * Not all the files are usefully described. At the moment it can't read WEBP files, so it tends to sort them into a category called "webp files" which isn't very helpful. It also has trouble with some audio and video files. 45 | -------------------------------------------------------------------------------- /organize.py: -------------------------------------------------------------------------------- 1 | from dotenv import load_dotenv 2 | load_dotenv() 3 | import os 4 | import sys 5 | import json 6 | import argparse 7 | import tiktoken 8 | import pprint 9 | from llama_index.core import Settings, SimpleDirectoryReader, Document 10 | from llama_index.llms.openai import OpenAI 11 | from llama_index.multi_modal_llms.openai import OpenAIMultiModal 12 | from llama_index.core.agent import AgentRunner 13 | from llama_index.agent.lats import LATSAgentWorker 14 | from llama_index.core.schema import ImageDocument 15 | 16 | MIN_FILES_PER_CATEGORY = 3 17 | 18 | def get_files_in_folder(folder, recursive=True): 19 | """Gets a list of all the files in specified folder, recursively""" 20 | file_paths = [] 21 | with open("db/auto_generated_folders.json", 'r', encoding='utf-8') as file: 22 | auto_generated_folders = json.load(file) 23 | 24 | try: 25 | # List all files and directories in the given folder 26 | with os.scandir(folder) as entries: 27 | for entry in entries: 28 | path = f"{folder}/{entry.name}" 29 | if entry.is_dir(): 30 | if (path in auto_generated_folders): 31 | if(recursive): 32 | file_paths.extend(get_files_in_folder(path)) 33 | else: 34 | print(f"Not recursing into {path}") 35 | else: 36 | print(f"Skipping manually-created folder: {path}") 37 | else: 38 | file_paths.append(path) 39 | except FileNotFoundError: 40 | print(f"The folder '{folder}' does not exist.") 41 | except PermissionError: 42 | print(f"Permission denied to access '{folder}'.") 43 | return file_paths 44 | 45 | def sliceUntilFits(string, max_tokens): 46 | enc = tiktoken.encoding_for_model("gpt-4o") 47 | while True: 48 | encoded = enc.encode(string) 49 | print(f"Number of tokens: {len(encoded)}") 50 | if len(encoded) > 100000: # like, WAY too long 51 | string = string[-100000:] # get the last 100k chars 52 | elif len(encoded) > max_tokens: 53 | print("Message too long, slicing it down") 54 | string = string[:-10000] # remove the last 10k chars to shorten it 55 | else: 56 | return string 57 | 58 | def describe_file(file_path): 59 | """Reads a file and gets a description of it from an LLM""" 60 | print(f"Describing file: {file_path}") 61 | 62 | stat_info = os.stat(file_path) 63 | inode = stat_info.st_ino 64 | 65 | # have we previously described this file? 66 | metadata_path = f"db/{str(inode)}.json" 67 | if(os.path.exists(metadata_path)): 68 | print(f"Already have a description for file: {file_path}") 69 | with open(metadata_path, 'r', encoding='utf-8') as file: 70 | data = json.load(file) 71 | return data['description'] 72 | 73 | # we haven't, so read it in and describe it 74 | reader = SimpleDirectoryReader(input_files=[file_path]) 75 | documents = reader.load_data() 76 | if (len(documents) == 0): 77 | print(f"Failed to load document for file: {file_path}") 78 | return None 79 | document = documents[0] 80 | 81 | llm = OpenAI(model="gpt-4o") 82 | mm_llm = OpenAIMultiModal(model="gpt-4o") 83 | 84 | if isinstance(document, ImageDocument): 85 | response = mm_llm.complete( 86 | prompt="""Describe the contents of this file, and suggest some possible categories 87 | that it might fit into. Some categories might include 'screenshot', 'diagram', 'illustration'""", 88 | image_documents=documents 89 | ) 90 | print("Image document response:") 91 | print(response) 92 | elif isinstance(document, Document): 93 | fit_text = sliceUntilFits(document.text, 10000) # fit this into 10k tokens or so 94 | 95 | response = llm.complete( 96 | prompt=f"""Describe the contents of this file, and suggest some possible categories 97 | that it might fit into. Some categories might include 'blog post', 'text', 'code', 'data'. 98 | The text of the document follows: 99 | {fit_text}""" 100 | ) 101 | print("Text document response:") 102 | print(response) 103 | 104 | # save the description to a file 105 | with open(metadata_path, 'w', encoding='utf-8') as file: 106 | json.dump({'description': str(response)}, file) 107 | return str(response) 108 | 109 | def describe_files(folder): 110 | file_paths = get_files_in_folder(folder) 111 | for file_path in file_paths: 112 | description = describe_file(file_path) 113 | # TODO: the stat stuff should probably be in here instead 114 | 115 | def categorize_file(description,existing_categories): 116 | llm = OpenAI(model="gpt-4o") 117 | 118 | prompt = f"""You are sorting files into categories. Below is a list of categories you have 119 | already used (there might be none): 120 | 121 | {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)} 122 | 123 | Now, the following is a description of a new file we want to add to the set. It includes some suggested 124 | categories for the file based on its contents. Return a suggested category for the file. You should have 125 | a bias towards putting files into categories that already exist, but if there are no good categories you can 126 | return a new one. The file appears between --- and --- below: 127 | 128 | --- 129 | {description} 130 | --- 131 | 132 | Return JUST the category name and nothing else. 133 | """ 134 | response = llm.complete(prompt) 135 | return str(response) 136 | 137 | def recategorize_file_narrower(description,existing_categories): 138 | llm = OpenAI(model="gpt-4o") 139 | 140 | prompt = f"""You are sorting files into categories. Below is a list of categories you have already used: 141 | 142 | {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)} 143 | 144 | Now, the following is a description of a new file we want to add to the set. It includes some suggested 145 | categories for the file based on its contents. Previously, you categorized this file into too broad a category, 146 | so when you try to categorize it this time be more specific than the existing categories are. The file appears between --- and --- below: 147 | 148 | --- 149 | {description} 150 | --- 151 | 152 | Return JUST the category name and nothing else. 153 | """ 154 | response = llm.complete(prompt) 155 | return str(response) 156 | 157 | def recategorize_file_broader(description,existing_categories): 158 | llm = OpenAI(model="gpt-4o") 159 | 160 | prompt = f"""You are sorting files into categories. Below is a list of categories you have already used: 161 | 162 | {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)} 163 | 164 | Now, the following is a description of a new file we want to add to the set. It includes some suggested 165 | categories for the file based on its contents. Previously, you categorized this file into too small a category, 166 | so when you try to categorize it this time be a bit more general, favoring one of the existing categories. The file appears between --- and --- below: 167 | 168 | --- 169 | {description} 170 | --- 171 | 172 | Return JUST the category name and nothing else. 173 | """ 174 | response = llm.complete(prompt) 175 | return str(response) 176 | 177 | def categorize_file_list(file_paths,categorized, recategorize=None): 178 | for file_path in file_paths: 179 | print(f"Categorizing: {file_path}") 180 | stat_info = os.stat(file_path) 181 | inode = stat_info.st_ino 182 | metadata_path = f"db/{str(inode)}.json" 183 | with open(metadata_path, 'r', encoding='utf-8') as file: 184 | data = json.load(file) 185 | if recategorize == "broader": 186 | category = recategorize_file_broader(data['description'], categorized) 187 | if recategorize == "narrower": 188 | category = recategorize_file_narrower(data['description'], categorized) 189 | else: 190 | category = categorize_file(data['description'], categorized) 191 | print(f"Suggested category: {category}") 192 | # add the inode to the category tree 193 | if (category in categorized): 194 | categorized[category].append(file_path) 195 | else: 196 | categorized[category] = [file_path] 197 | print(json.dumps(categorized, indent=4, sort_keys=True)) 198 | return categorized 199 | 200 | def categorize_files(folder): 201 | categorized = {} 202 | # find all the files we have descriptions of and do a first pass 203 | file_paths = get_files_in_folder(folder) 204 | described_file_paths = [] 205 | for file_path in file_paths: 206 | stat_info = os.stat(file_path) 207 | inode = stat_info.st_ino 208 | 209 | # have we previously described this file? If not, we can't categorize it 210 | metadata_path = f"db/{str(inode)}.json" 211 | if(os.path.exists(metadata_path)): 212 | described_file_paths.append(file_path) 213 | categorized = categorize_file_list(described_file_paths, categorized) 214 | # FIXME: this assumes there's only one folder ever 215 | with open("db/categorized_paths.json", 'w', encoding='utf-8') as file: 216 | json.dump(categorized, file) 217 | 218 | def recategorize_files_once(categorized): 219 | total_files = sum(len(v) for v in categorized.values()) 220 | original_categories = categorized.copy() 221 | for category in original_categories: 222 | if len(original_categories[category]) < MIN_FILES_PER_CATEGORY: 223 | print(f"Category {category} has fewer than {MIN_FILES_PER_CATEGORY} files. Recategorizing.") 224 | files_to_recategorize = categorized[category] 225 | del categorized[category] 226 | categorized = categorize_file_list(files_to_recategorize, categorized, recategorize="broader") 227 | elif len(original_categories[category]) > (total_files / 5): 228 | print(f"Category {category} has more than 20% of the files. Recategorizing.") 229 | files_to_recategorize = categorized[category] 230 | del categorized[category] 231 | categorized = categorize_file_list(files_to_recategorize, categorized, recategorize="narrower") 232 | return categorized 233 | 234 | def needs_recategorization(categorized): 235 | total_files = sum(len(v) for v in categorized.values()) 236 | for category in categorized: 237 | if len(categorized[category]) < MIN_FILES_PER_CATEGORY: 238 | return True 239 | if len(categorized[category]) > (total_files / 5): 240 | return True 241 | return False 242 | 243 | def recategorize_files(): 244 | with open("db/categorized_paths.json", 'r', encoding='utf-8') as file: 245 | categorized = json.load(file) 246 | passes = 0 247 | while needs_recategorization(categorized) and passes < 5: 248 | categorized = recategorize_files_once(categorized) 249 | passes += 1 250 | print("---- All done: ----") 251 | print(json.dumps(categorized, indent=4, sort_keys=True)) 252 | with open("db/categorized_paths.json", 'w', encoding='utf-8') as file: 253 | json.dump(categorized, file) 254 | 255 | def move_files(base_path): 256 | with open("db/categorized_paths.json", 'r', encoding='utf-8') as file: 257 | categorized = json.load(file) 258 | auto_generated_folders = [] 259 | for category in categorized: 260 | category_path = f"{base_path}/{category} (Auto)" 261 | if not os.path.exists(category_path): 262 | os.makedirs(category_path) 263 | for file_path in categorized[category]: 264 | file_name = file_path.split("/")[-1] 265 | new_path = f"{category_path}/{file_name}" 266 | os.rename(file_path, new_path) 267 | print(f"Moved {file_path} to {new_path}") 268 | auto_generated_folders.append(category_path) 269 | # clean up any empty folders 270 | with open("db/auto_generated_folders.json", 'r', encoding='utf-8') as file: 271 | previous_auto_generated_folders = json.load(file) 272 | # get all the folders that are in previous_auto_generated_folders but not in auto_generated_folders 273 | folders_to_remove = [folder for folder in previous_auto_generated_folders if folder not in auto_generated_folders] 274 | for folder in folders_to_remove: 275 | os.rmdir(folder) 276 | print(f"Removed empty folder: {folder}") 277 | # write the auto-generated folders list 278 | with open("db/auto_generated_folders.json", 'w', encoding='utf-8') as file: 279 | json.dump(auto_generated_folders, file) 280 | 281 | if __name__ == "__main__": 282 | parser = argparse.ArgumentParser(description="Describe or categorize the files in a folder") 283 | parser.add_argument("path", help="The path to the folder.") 284 | parser.add_argument("--describe", action="store_true", help="Describe the contents of the directory.") 285 | parser.add_argument("--categorize", action="store_true", help="Categorize the contents of the directory.") 286 | parser.add_argument("--recategorize", action="store_true", help="Recategorize the contents of the directory.") 287 | parser.add_argument("--move", action="store_true", help="Move files into categorized folders.") 288 | 289 | args = parser.parse_args() 290 | if args.describe: 291 | describe_files(args.path) 292 | elif args.categorize: 293 | categorize_files(args.path) 294 | elif args.recategorize: 295 | recategorize_files() 296 | elif args.move: 297 | move_files(args.path) 298 | -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "organizer" 3 | version = "0.1.0" 4 | description = "" 5 | authors = ["Laurie Voss "] 6 | readme = "README.md" 7 | 8 | [tool.poetry.dependencies] 9 | python = "^3.11" 10 | llama-index-core = "^0.10.43.post1" 11 | llama-index-llms-groq = "^0.1.4" 12 | llama-index-llms-openai = "^0.1.22" 13 | python-dotenv = "^1.0.1" 14 | llama-index-agent-lats = "^0.1.0" 15 | llama-index-multi-modal-llms-openai = "^0.1.6" 16 | llama-index-readers-file = "^0.1.23" 17 | nbconvert = "^7.16.4" 18 | docx2txt = "^0.8" 19 | 20 | 21 | [build-system] 22 | requires = ["poetry-core"] 23 | build-backend = "poetry.core.masonry.api" 24 | --------------------------------------------------------------------------------