├── .gitignore
├── LICENSE
├── README.md
├── organize.py
├── poetry.lock
└── pyproject.toml


/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | db/
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 Laurie Voss
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # File Organizer
 2 | 
 3 | A command-line utility that organizes files into folders based on descriptions of their content, while ensuring that no folder is unhelpfully vague or overly specific. Very obviously inspired by [llama-fs](https://github.com/iyaja/llama-fs), the main differences being:
 4 | * It never renames any files, it just puts them into folders, file names are generally pretty helpful.
 5 | * The organization is hopefully much smarter, helping you locate groups of files regardless of their names
 6 | 
 7 | ## Usage
 8 | 
 9 | At the moment the tool has to be run manually in separate stages (see Further Work below).
10 | 
11 | ```bash
12 | python organize.py --describe path_to_folder
13 | ```
14 | 
15 | Reads all the files in `path_to_folder` and gets an LLM to describe them in detail. These descriptions are cached in a folder called `db` as JSON files based on the `inode` ID of the file, so even if they move around the filesystem, the cache will still be valid.
16 | 
17 | ```bash
18 | python organize.py --categorize path_to_folder
19 | ```
20 | 
21 | Reads all the files in `path_to_folder` that have a cached description from the previous step. It then comes up with a proposed category for each one and writes these categories into a file called `db/categorized_paths.json`. **It ignores folders**, assuming if you put something in a folder that's already a pretty good category. It will however re-organize files that are in folders it previously created.
22 | 
23 | ```bash
24 | python organize.py --recategorize path_to_folder
25 | ```
26 | 
27 | Reads all the files in `db/categorized_paths.json` and makes sure none of the categories are either too vague or too specific:
28 | * If a category has 3 or fewer files, it will attempt to merge them into larger existing categories
29 | * If a single category accounts for 20% or more of the files, it will attempt to split them into more specific categories
30 | * It will make up to 5 passes over the data to try and get this right (since the first pass might create too many small categories, or another big category).
31 | 
32 | ```bash
33 | python organize.py --move path_to_folder
34 | ```
35 | 
36 | Takes the files described in `db/categorized_paths.json` and moves them into folders named after their categories with the suffix "(Auto)". It will record which folders it generated in a file called `db/auto_generated_folders.json`; if you run it multiple times and it ends up emptying out an auto-generated folder that folder will be deleted.
37 | 
38 | ## Further work
39 | 
40 | Getting organization to work was the fun part, there's a bunch of grunt work to do:
41 | 
42 | * Obviously the whole thing runs in stages right now; the expectation is that some higher-level process will be taking care of deciding when to describe files, when to categorize, when to re-categorize, and when to move. I would expect this to be a cute little Electron desktop app or something.
43 | * `categorized_paths.json` does not account for you ever running this on more than one folder! The whole thing needs to be refactored so that `categorized_paths` and `auto_generated_folders` take into account that not every file lives in one folder.
44 | * Not all the files are usefully described. At the moment it can't read WEBP files, so it tends to sort them into a category called "webp files" which isn't very helpful. It also has trouble with some audio and video files.
45 | 


--------------------------------------------------------------------------------
/organize.py:
--------------------------------------------------------------------------------
  1 | from dotenv import load_dotenv
  2 | load_dotenv()
  3 | import os
  4 | import sys
  5 | import json
  6 | import argparse
  7 | import tiktoken
  8 | import pprint
  9 | from llama_index.core import Settings, SimpleDirectoryReader, Document
 10 | from llama_index.llms.openai import OpenAI
 11 | from llama_index.multi_modal_llms.openai import OpenAIMultiModal
 12 | from llama_index.core.agent import AgentRunner
 13 | from llama_index.agent.lats import LATSAgentWorker
 14 | from llama_index.core.schema import ImageDocument
 15 | 
 16 | MIN_FILES_PER_CATEGORY = 3
 17 | 
 18 | def get_files_in_folder(folder, recursive=True):
 19 |     """Gets a list of all the files in specified folder, recursively"""
 20 |     file_paths = []
 21 |     with open("db/auto_generated_folders.json", 'r', encoding='utf-8') as file:
 22 |         auto_generated_folders = json.load(file)
 23 |     
 24 |     try:
 25 |         # List all files and directories in the given folder
 26 |         with os.scandir(folder) as entries:
 27 |             for entry in entries:
 28 |                 path = f"{folder}/{entry.name}"
 29 |                 if entry.is_dir():
 30 |                     if (path in auto_generated_folders):
 31 |                         if(recursive):
 32 |                             file_paths.extend(get_files_in_folder(path))
 33 |                         else:
 34 |                             print(f"Not recursing into {path}")
 35 |                     else:
 36 |                         print(f"Skipping manually-created folder: {path}")
 37 |                 else:
 38 |                     file_paths.append(path)
 39 |     except FileNotFoundError:
 40 |         print(f"The folder '{folder}' does not exist.")
 41 |     except PermissionError:
 42 |         print(f"Permission denied to access '{folder}'.")
 43 |     return file_paths
 44 | 
 45 | def sliceUntilFits(string, max_tokens):
 46 |     enc = tiktoken.encoding_for_model("gpt-4o")
 47 |     while True:
 48 |         encoded = enc.encode(string)    
 49 |         print(f"Number of tokens: {len(encoded)}")
 50 |         if len(encoded) > 100000: # like, WAY too long
 51 |             string = string[-100000:] # get the last 100k chars
 52 |         elif len(encoded) > max_tokens:
 53 |             print("Message too long, slicing it down")
 54 |             string = string[:-10000] # remove the last 10k chars to shorten it
 55 |         else:
 56 |             return string
 57 | 
 58 | def describe_file(file_path):
 59 |     """Reads a file and gets a description of it from an LLM"""
 60 |     print(f"Describing file: {file_path}")
 61 | 
 62 |     stat_info = os.stat(file_path)
 63 |     inode = stat_info.st_ino
 64 | 
 65 |     # have we previously described this file?
 66 |     metadata_path = f"db/{str(inode)}.json"
 67 |     if(os.path.exists(metadata_path)):
 68 |         print(f"Already have a description for file: {file_path}")
 69 |         with open(metadata_path, 'r', encoding='utf-8') as file:
 70 |             data = json.load(file)
 71 |             return data['description']
 72 | 
 73 |     # we haven't, so read it in and describe it
 74 |     reader = SimpleDirectoryReader(input_files=[file_path])
 75 |     documents = reader.load_data()
 76 |     if (len(documents) == 0):
 77 |         print(f"Failed to load document for file: {file_path}")
 78 |         return None
 79 |     document = documents[0]
 80 | 
 81 |     llm = OpenAI(model="gpt-4o")
 82 |     mm_llm = OpenAIMultiModal(model="gpt-4o")
 83 | 
 84 |     if isinstance(document, ImageDocument):
 85 |         response = mm_llm.complete(
 86 |             prompt="""Describe the contents of this file, and suggest some possible categories 
 87 |             that it might fit into. Some categories might include 'screenshot', 'diagram', 'illustration'""",
 88 |             image_documents=documents
 89 |         )
 90 |         print("Image document response:")
 91 |         print(response)
 92 |     elif isinstance(document, Document):
 93 |         fit_text = sliceUntilFits(document.text, 10000) # fit this into 10k tokens or so
 94 |         
 95 |         response = llm.complete(
 96 |             prompt=f"""Describe the contents of this file, and suggest some possible categories 
 97 |             that it might fit into. Some categories might include 'blog post', 'text', 'code', 'data'.
 98 |             The text of the document follows:
 99 |             {fit_text}"""
100 |         )
101 |         print("Text document response:")
102 |         print(response)
103 | 
104 |     # save the description to a file
105 |     with open(metadata_path, 'w', encoding='utf-8') as file:
106 |         json.dump({'description': str(response)}, file)
107 |     return str(response)
108 | 
109 | def describe_files(folder):
110 |     file_paths = get_files_in_folder(folder)
111 |     for file_path in file_paths:
112 |         description = describe_file(file_path)
113 |         # TODO: the stat stuff should probably be in here instead
114 | 
115 | def categorize_file(description,existing_categories):
116 |     llm = OpenAI(model="gpt-4o")
117 | 
118 |     prompt = f"""You are sorting files into categories. Below is a list of categories you have 
119 |         already used (there might be none):
120 |         
121 |         {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)}
122 | 
123 |         Now, the following is a description of a new file we want to add to the set. It includes some suggested
124 |         categories for the file based on its contents. Return a suggested category for the file. You should have 
125 |         a bias towards putting files into categories that already exist, but if there are no good categories you can
126 |         return a new one. The file appears between --- and --- below:
127 | 
128 |         ---
129 |         {description}
130 |         ---
131 | 
132 |         Return JUST the category name and nothing else.
133 |     """
134 |     response = llm.complete(prompt)
135 |     return str(response)
136 | 
137 | def recategorize_file_narrower(description,existing_categories):
138 |     llm = OpenAI(model="gpt-4o")
139 | 
140 |     prompt = f"""You are sorting files into categories. Below is a list of categories you have already used:
141 |         
142 |         {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)}
143 | 
144 |         Now, the following is a description of a new file we want to add to the set. It includes some suggested
145 |         categories for the file based on its contents. Previously, you categorized this file into too broad a category, 
146 |         so when you try to categorize it this time be more specific than the existing categories are. The file appears between --- and --- below:
147 | 
148 |         ---
149 |         {description}
150 |         ---
151 | 
152 |         Return JUST the category name and nothing else.
153 |     """
154 |     response = llm.complete(prompt)
155 |     return str(response)
156 | 
157 | def recategorize_file_broader(description,existing_categories):
158 |     llm = OpenAI(model="gpt-4o")
159 | 
160 |     prompt = f"""You are sorting files into categories. Below is a list of categories you have already used:
161 |         
162 |         {json.dumps(list(existing_categories.keys()), indent=4, sort_keys=True)}
163 | 
164 |         Now, the following is a description of a new file we want to add to the set. It includes some suggested
165 |         categories for the file based on its contents. Previously, you categorized this file into too small a category, 
166 |         so when you try to categorize it this time be a bit more general, favoring one of the existing categories. The file appears between --- and --- below:
167 | 
168 |         ---
169 |         {description}
170 |         ---
171 | 
172 |         Return JUST the category name and nothing else.
173 |     """
174 |     response = llm.complete(prompt)
175 |     return str(response)
176 | 
177 | def categorize_file_list(file_paths,categorized, recategorize=None):
178 |     for file_path in file_paths:
179 |         print(f"Categorizing: {file_path}")
180 |         stat_info = os.stat(file_path)
181 |         inode = stat_info.st_ino
182 |         metadata_path = f"db/{str(inode)}.json"
183 |         with open(metadata_path, 'r', encoding='utf-8') as file:
184 |             data = json.load(file)
185 |         if recategorize == "broader":
186 |             category = recategorize_file_broader(data['description'], categorized)
187 |         if recategorize == "narrower":
188 |             category = recategorize_file_narrower(data['description'], categorized)
189 |         else:
190 |             category = categorize_file(data['description'], categorized)
191 |         print(f"Suggested category: {category}")
192 |         # add the inode to the category tree
193 |         if (category in categorized):
194 |             categorized[category].append(file_path)
195 |         else:
196 |             categorized[category] = [file_path]
197 |     print(json.dumps(categorized, indent=4, sort_keys=True))
198 |     return categorized
199 | 
200 | def categorize_files(folder):
201 |     categorized = {}
202 |     # find all the files we have descriptions of and do a first pass
203 |     file_paths = get_files_in_folder(folder)
204 |     described_file_paths = []
205 |     for file_path in file_paths:
206 |         stat_info = os.stat(file_path)
207 |         inode = stat_info.st_ino
208 | 
209 |         # have we previously described this file? If not, we can't categorize it
210 |         metadata_path = f"db/{str(inode)}.json"
211 |         if(os.path.exists(metadata_path)):
212 |             described_file_paths.append(file_path)
213 |     categorized = categorize_file_list(described_file_paths, categorized)
214 |     # FIXME: this assumes there's only one folder ever
215 |     with open("db/categorized_paths.json", 'w', encoding='utf-8') as file:
216 |         json.dump(categorized, file)
217 | 
218 | def recategorize_files_once(categorized):
219 |     total_files = sum(len(v) for v in categorized.values())
220 |     original_categories = categorized.copy()
221 |     for category in original_categories:
222 |         if len(original_categories[category]) < MIN_FILES_PER_CATEGORY:
223 |             print(f"Category {category} has fewer than {MIN_FILES_PER_CATEGORY} files. Recategorizing.")
224 |             files_to_recategorize = categorized[category]
225 |             del categorized[category]
226 |             categorized = categorize_file_list(files_to_recategorize, categorized, recategorize="broader")
227 |         elif len(original_categories[category]) > (total_files / 5):
228 |             print(f"Category {category} has more than 20% of the files. Recategorizing.")
229 |             files_to_recategorize = categorized[category]
230 |             del categorized[category]
231 |             categorized = categorize_file_list(files_to_recategorize, categorized, recategorize="narrower")
232 |     return categorized
233 | 
234 | def needs_recategorization(categorized):
235 |     total_files = sum(len(v) for v in categorized.values())
236 |     for category in categorized:
237 |         if len(categorized[category]) < MIN_FILES_PER_CATEGORY:
238 |             return True
239 |         if len(categorized[category]) > (total_files / 5):
240 |             return True
241 |     return False
242 | 
243 | def recategorize_files():
244 |     with open("db/categorized_paths.json", 'r', encoding='utf-8') as file:
245 |         categorized = json.load(file)
246 |     passes = 0
247 |     while needs_recategorization(categorized) and passes < 5:
248 |         categorized = recategorize_files_once(categorized)
249 |         passes += 1
250 |     print("---- All done: ----")
251 |     print(json.dumps(categorized, indent=4, sort_keys=True))
252 |     with open("db/categorized_paths.json", 'w', encoding='utf-8') as file:
253 |         json.dump(categorized, file)
254 | 
255 | def move_files(base_path):
256 |     with open("db/categorized_paths.json", 'r', encoding='utf-8') as file:
257 |         categorized = json.load(file)
258 |     auto_generated_folders = []
259 |     for category in categorized:
260 |         category_path = f"{base_path}/{category} (Auto)"
261 |         if not os.path.exists(category_path):
262 |             os.makedirs(category_path)
263 |         for file_path in categorized[category]:
264 |             file_name = file_path.split("/")[-1]
265 |             new_path = f"{category_path}/{file_name}"
266 |             os.rename(file_path, new_path)
267 |             print(f"Moved {file_path} to {new_path}")
268 |             auto_generated_folders.append(category_path)
269 |     # clean up any empty folders
270 |     with open("db/auto_generated_folders.json", 'r', encoding='utf-8') as file:
271 |         previous_auto_generated_folders = json.load(file)
272 |     # get all the folders that are in previous_auto_generated_folders but not in auto_generated_folders
273 |     folders_to_remove = [folder for folder in previous_auto_generated_folders if folder not in auto_generated_folders]
274 |     for folder in folders_to_remove:
275 |         os.rmdir(folder)
276 |         print(f"Removed empty folder: {folder}")
277 |     # write the auto-generated folders list
278 |     with open("db/auto_generated_folders.json", 'w', encoding='utf-8') as file:
279 |         json.dump(auto_generated_folders, file)
280 | 
281 | if __name__ == "__main__":
282 |     parser = argparse.ArgumentParser(description="Describe or categorize the files in a folder")
283 |     parser.add_argument("path", help="The path to the folder.")
284 |     parser.add_argument("--describe", action="store_true", help="Describe the contents of the directory.")
285 |     parser.add_argument("--categorize", action="store_true", help="Categorize the contents of the directory.")
286 |     parser.add_argument("--recategorize", action="store_true", help="Recategorize the contents of the directory.")
287 |     parser.add_argument("--move", action="store_true", help="Move files into categorized folders.")
288 |     
289 |     args = parser.parse_args()
290 |     if args.describe:
291 |         describe_files(args.path)
292 |     elif args.categorize:
293 |         categorize_files(args.path)
294 |     elif args.recategorize:
295 |         recategorize_files()
296 |     elif args.move:
297 |         move_files(args.path)
298 | 


--------------------------------------------------------------------------------
/pyproject.toml:
--------------------------------------------------------------------------------
 1 | [tool.poetry]
 2 | name = "organizer"
 3 | version = "0.1.0"
 4 | description = ""
 5 | authors = ["Laurie Voss <github@seldo.com>"]
 6 | readme = "README.md"
 7 | 
 8 | [tool.poetry.dependencies]
 9 | python = "^3.11"
10 | llama-index-core = "^0.10.43.post1"
11 | llama-index-llms-groq = "^0.1.4"
12 | llama-index-llms-openai = "^0.1.22"
13 | python-dotenv = "^1.0.1"
14 | llama-index-agent-lats = "^0.1.0"
15 | llama-index-multi-modal-llms-openai = "^0.1.6"
16 | llama-index-readers-file = "^0.1.23"
17 | nbconvert = "^7.16.4"
18 | docx2txt = "^0.8"
19 | 
20 | 
21 | [build-system]
22 | requires = ["poetry-core"]
23 | build-backend = "poetry.core.masonry.api"
24 | 


--------------------------------------------------------------------------------