├── src ├── repo2txt │ ├── __init__.py │ ├── config.json │ └── repo2txt.py └── .DS_Store ├── .DS_Store ├── LICENSE └── README.md /src/repo2txt/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/donoceidon/repo2txt/HEAD/.DS_Store -------------------------------------------------------------------------------- /src/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/donoceidon/repo2txt/HEAD/src/.DS_Store -------------------------------------------------------------------------------- /src/repo2txt/config.json: -------------------------------------------------------------------------------- 1 | { 2 | "image_extensions": [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".svg"], 3 | "video_extensions": [".mp4", ".avi", ".mov", ".mkv", ".flv", ".wmv"], 4 | "audio_extensions": [".mp3", ".wav", ".aac", ".flac", ".ogg"], 5 | "document_extensions": [".pdf", ".doc", ".docx", ".ppt", ".pptx", ".xls", ".xlsx", ".jar", ".zip", ".tar", ".gzip"], 6 | "executable_extensions": [".exe", ".dll", ".bin", ".sh", ".bat"], 7 | "settings_extensions": [".ini", ".cfg", ".conf", ".json", ".yaml", ".yml"], 8 | "additional_ignore_types": [".lock"], 9 | "default_output_file": "output.txt" 10 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) [2023] [Jack Krosinsnki] 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # repo2txt 2 | 3 | ## Overview 4 | `repo2txt` is a Python tool I assembled to help streamline the process of preparing code base training data for GPT-style Models (LLMs). It's especially helpful in passing a codebase to a GPT. This script automates the task of compiling assets from a project or repository into a single, comprehensive text file or Word document. The resulting file includes a hierarchical tree of the directory structure and the contents of each file. 5 | 6 | ## Features 7 | - **Directory/File Tree**: Generates a detailed overview of the repository's directory and file structure. 8 | - **File Contents**: Includes the content of each file, offering a comprehensive view into the code or text within the repository. 9 | - **Output Formats**: Supports output in both `.txt` and `.docx` formats. 10 | - **Customizable Ignoring Mechanism**: Provides options to ignore specific file types, individual files, and directories, allowing for a tailored documentation process. 11 | - **Command-Line Flexibility**: Various command-line arguments are available to customize the script's output according to the user's needs. 12 | 13 | ## Suggested Installation 14 | For ease of use, `repo2txt` can be installed via pip: 15 | 16 | ```bash 17 | pip install repo2txt 18 | ``` 19 | 20 | Alternatively, you can directly run the `repo2txt.py` script. Ensure to install `python-docx` if using this method. 21 | 22 | ## Usage 23 | 24 | Run the script from the command line by specifying the path to the repository and the desired output file name. For example: 25 | 26 | ```bash 27 | python repo2txt.py -r [path_to_repo] -o [output_file_name] 28 | ``` 29 | 30 | Replace `[path_to_repo]` with the path to your repository and `[output_file_name]` with your desired output file name (including the `.txt` or `.docx` extension). 31 | 32 | By default, if no path is specified, the script operates in the current directory. Similarly, if no output file name is provided, it defaults to `output.txt`. 33 | 34 | ### Optional Command-Line Arguments: 35 | 36 | - `-r`, `--repo_path`: Specify the path to the repository. Defaults to the current directory if not specified. 37 | - `-o`, `--output_file`: Name for the output file. Defaults to "output.txt". 38 | - `--ignore-files`: List of file names to ignore (e.g., `--ignore-files file1.txt file2.txt`). Specify 'none' to ignore no files. 39 | - `--ignore-types`: List of file extensions to ignore (e.g., `--ignore-types .log .tmp`). Defaults to a predefined list in `config.json`. Specify 'none' to ignore no types. 40 | - `--exclude-dir`: List of directory names to exclude (e.g., `--exclude-dir dir1 dir2`). Specify 'none' to exclude no directories. 41 | - `--ignore-settings`: Flag to ignore common settings files. 42 | - `--include-dir`: Include only a specific directory and its contents (e.g., `--include-dir src`). 43 | 44 | ### Examples 45 | 46 | 1. **Documenting a Repository to a Text File**: 47 | ```bash 48 | python repo2txt.py -r /path/to/repository -o output.txt 49 | ``` 50 | 51 | 2. **Documenting with Exclusions**: 52 | ```bash 53 | python repo2txt.py -r /path/to/repository -o output.docx --ignore-types .log .tmp --exclude-dir tests 54 | ``` 55 | 56 | ## Contributing 57 | Contributions to enhance `repo2txt` are always welcome. Feel free to fork the repository, make your improvements, and submit a pull request. 58 | 59 | -------------------------------------------------------------------------------- /src/repo2txt/repo2txt.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import argparse 5 | import json 6 | from docx import Document 7 | from docx.shared import Pt 8 | from docx.enum.text import WD_PARAGRAPH_ALIGNMENT 9 | 10 | def load_config(file_path): 11 | """ 12 | Load configuration from a JSON file. 13 | 14 | Args: 15 | file_path (str): The path to the JSON configuration file. 16 | 17 | Returns: 18 | dict: A dictionary containing the configuration settings. 19 | """ 20 | with open(file_path, 'r') as file: 21 | return json.load(file) 22 | 23 | # Usage 24 | config = load_config('config.json') 25 | IMAGE_EXTENSIONS = config["image_extensions"] 26 | VIDEO_EXTENSIONS = config["video_extensions"] 27 | AUDIO_EXTENSIONS = config["audio_extensions"] 28 | DOCUMENT_EXTENSIONS = config["document_extensions"] 29 | EXECUTABLE_EXTENSIONS = config["executable_extensions"] 30 | SETTINGS_EXTENSIONS = config["settings_extensions"] 31 | ADDITIONAL_IGNORE_TYPES = config["additional_ignore_types"] 32 | DEFAULT_OUTPUT_FILE = config["default_output_file"] 33 | 34 | def parse_args(): 35 | """ 36 | Parse command-line arguments for the script. 37 | 38 | Returns: 39 | argparse.Namespace: An object containing the parsed command-line arguments. 40 | """ 41 | 42 | parser = argparse.ArgumentParser( 43 | description='Document the structure of a GitHub repository.', 44 | epilog=('Example usage:\n' 45 | 'Direct script invocation:\n' 46 | ' python repo2txt.py -r /path/to/repo -o output.txt # Save as text\n' 47 | ' python repo2txt.py -r /path/to/repo -o report.docx # Save as DOCX\n' 48 | 'When installed with pip as a command-line tool:\n' 49 | ' repo2txt -r /path/to/repo -o output.txt # Save as text\n' 50 | ' repo2txt -r /path/to/repo -o report.docx # Save as DOCX\n\n' 51 | 'Note: Specify the output file format by choosing the appropriate file extension (.txt or .docx).'), 52 | formatter_class=argparse.RawDescriptionHelpFormatter 53 | ) 54 | 55 | parser.add_argument('-r', '--repo_path', default=os.getcwd(), 56 | help='Path to the directory to process (ie., cloned repo). if no path is specified defaults to the current directory.') 57 | parser.add_argument('-o', '--output_file', default=DEFAULT_OUTPUT_FILE, 58 | help='Name for the output text file. Defaults to "output.txt".') 59 | parser.add_argument('--ignore-files', nargs='*', default=[], 60 | help='List of file names to ignore. Omit this argument to ignore no file names.') 61 | parser.add_argument('--ignore-types', nargs='*', default=IMAGE_EXTENSIONS + VIDEO_EXTENSIONS + AUDIO_EXTENSIONS + DOCUMENT_EXTENSIONS + EXECUTABLE_EXTENSIONS, 62 | help='List of file extensions to ignore. Defaults to list in config.json. Omit this argument to ignore no types.') 63 | parser.add_argument('--exclude-dir', nargs='*', default=[], 64 | help='List of directory names to exclude or "none" for no directories.') 65 | parser.add_argument('--ignore-settings', action='store_true', 66 | help='Flag to ignore common settings files.') 67 | parser.add_argument('--include-dir', nargs='?', default=None, 68 | help='Specific directory to include. Only contents of this directory will be documented.') 69 | 70 | return parser.parse_args() 71 | 72 | 73 | def should_ignore(item, args, output_file_path): 74 | 75 | """ 76 | Determine if a given item should be ignored based on the script's arguments. 77 | 78 | Args: 79 | item (str): The path of the item (file or directory) to check. 80 | args (argparse.Namespace): Parsed command-line arguments. 81 | output_file_path (str): The path of the output file being written to. 82 | 83 | Returns: 84 | bool: True if the item should be ignored, False otherwise. 85 | """ 86 | 87 | item_name = os.path.basename(item) 88 | file_ext = os.path.splitext(item_name)[1].lower() 89 | 90 | # Ensure the comparison is between path strings 91 | if os.path.abspath(item) == os.path.abspath(output_file_path): 92 | return True 93 | 94 | # Adjust logic to handle hidden files and directories correctly 95 | if item_name.startswith('.'): 96 | return True # Ignore all hidden files and directories 97 | 98 | if os.path.isdir(item) and args.exclude_dir and item_name in args.exclude_dir: 99 | return True 100 | 101 | if args.include_dir and not os.path.abspath(item).startswith(os.path.abspath(args.include_dir)): 102 | return True 103 | 104 | if os.path.isfile(item) and (item_name in args.ignore_files or file_ext in args.ignore_types): 105 | return True 106 | 107 | if args.ignore_settings and file_ext in SETTINGS_EXTENSIONS: 108 | return True 109 | 110 | return False 111 | 112 | def write_tree(dir_path, output_file, args, prefix="", is_last=True, is_root=True): 113 | """ 114 | Recursively write the directory tree to the output file, including the root directory name. 115 | 116 | Args: 117 | dir_path (str): The path of the directory to document. 118 | output_file (file object): The file object to write to. 119 | args (argparse.Namespace): Parsed command-line arguments. 120 | prefix (str): Prefix string for line indentation and structure. Defaults to "". 121 | is_last (bool): Flag to indicate if the item is the last in its level. Defaults to True. 122 | is_root (bool): Flag to indicate if the current directory is the root. Defaults to True. 123 | """ 124 | 125 | if is_root: 126 | output_file.write(f"{os.path.basename(dir_path)}/\n") 127 | is_root = False 128 | 129 | items = os.listdir(dir_path) 130 | items.sort() # Optional: Sort the items for consistent order 131 | num_items = len(items) 132 | 133 | for index, item in enumerate(items): 134 | item_path = os.path.join(dir_path, item) 135 | 136 | if should_ignore(item_path, args, args.output_file): 137 | continue 138 | 139 | is_last_item = (index == num_items - 1) 140 | new_prefix = "└── " if is_last_item else "├── " 141 | child_prefix = " " if is_last_item else "│ " 142 | 143 | output_file.write(f"{prefix}{new_prefix}{os.path.basename(item)}\n") 144 | 145 | if os.path.isdir(item_path): 146 | next_prefix = prefix + child_prefix 147 | write_tree(item_path, output_file, args, next_prefix, is_last_item, is_root=False) 148 | 149 | 150 | def write_file_content(file_path, output_file, depth): 151 | """ 152 | Write the contents of a given file to the output file with proper indentation. 153 | 154 | Args: 155 | file_path (str): Path of the file to read. 156 | output_file (file object): The file object to write the contents to. 157 | depth (int): Current depth in the directory tree for indentation. 158 | """ 159 | indentation = ' ' * depth 160 | try: 161 | with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: 162 | for line in file: 163 | # Indent each line of the file content 164 | output_file.write(f"{indentation}{line}") 165 | except Exception as e: 166 | output_file.write(f"{indentation}Error reading file: {e}\n") 167 | 168 | 169 | def write_tree_docx(dir_path, doc, args, output_file_path, prefix="", is_last=True, is_root=True): 170 | """ 171 | Recursively create a document structure of the directory tree in a DOCX file, including the root directory name. 172 | 173 | Args: 174 | dir_path (str): The path of the directory to document. 175 | doc (Document): The DOCX document object to write to. 176 | args (argparse.Namespace): Parsed command-line arguments. 177 | output_file_path (str): The path of the output DOCX file being written to. 178 | prefix (str): Prefix string for line indentation and structure. Defaults to "". 179 | is_last (bool): Flag to indicate if the item is the last in its level. Defaults to True. 180 | is_root (bool): Flag to indicate if the current directory is the root. Defaults to True. 181 | """ 182 | 183 | if is_root: 184 | root_paragraph = doc.add_paragraph() 185 | root_paragraph.add_run(f"{os.path.basename(dir_path)}/") 186 | is_root = False 187 | 188 | items = os.listdir(dir_path) 189 | items.sort() # Optional: Sort the items for consistent order 190 | num_items = len(items) 191 | 192 | for index, item in enumerate(items): 193 | item_path = os.path.join(dir_path, item) 194 | 195 | if should_ignore(item_path, args, output_file_path): 196 | continue 197 | 198 | is_last_item = (index == num_items - 1) 199 | new_prefix = "└── " if is_last_item else "├── " 200 | child_prefix = " " if is_last_item else "│ " 201 | 202 | # Add the directory or file entry 203 | tree_paragraph = doc.add_paragraph() 204 | tree_paragraph.add_run(f"{prefix}{new_prefix}{os.path.basename(item)}") 205 | 206 | if os.path.isdir(item_path): 207 | next_prefix = prefix + child_prefix 208 | write_tree_docx(item_path, doc, args, output_file_path, next_prefix, is_last_item, is_root=False) 209 | 210 | 211 | def write_file_content_docx(file_path, doc): 212 | 213 | """ 214 | Write the contents of a given file to a DOCX document. 215 | 216 | Args: 217 | file_path (str): Path of the file to read. 218 | doc (Document): The DOCX document object to write the contents to. 219 | 220 | This function reads the contents of 'file_path' and writes them to 'doc'. 221 | If an error occurs during reading, it adds an error message to 'doc'. 222 | """ 223 | 224 | try: 225 | with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: 226 | contents = file.read() 227 | paragraph = doc.add_paragraph(contents) 228 | paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT 229 | except Exception as e: 230 | error_paragraph = doc.add_paragraph(f"Error reading file: {e}") 231 | error_paragraph.alignment = WD_PARAGRAPH_ALIGNMENT.LEFT 232 | 233 | 234 | 235 | def write_file_contents_in_order(dir_path, output_file, args, depth=0): 236 | """ 237 | Recursively document the contents of files in the order they appear in the directory tree. 238 | 239 | Args: 240 | dir_path (str): The path of the directory to start documenting from. 241 | output_file (file object): The file object to write the contents to. 242 | args (argparse.Namespace): Parsed command-line arguments. 243 | depth (int): Current depth in the directory tree. Defaults to 0. 244 | """ 245 | items = sorted(item for item in os.listdir(dir_path) if not should_ignore(os.path.join(dir_path, item), args, args.output_file)) 246 | 247 | for item in items: 248 | item_path = os.path.join(dir_path, item) 249 | relative_path = os.path.relpath(item_path, start=args.repo_path) 250 | 251 | if os.path.isdir(item_path): 252 | write_file_contents_in_order(item_path, output_file, args, depth + 1) 253 | elif os.path.isfile(item_path): 254 | output_file.write(' ' * depth + f"[File Begins] {relative_path}\n") 255 | write_file_content(item_path, output_file, depth) 256 | output_file.write('\n' + ' ' * depth + f"[File Ends] {relative_path}\n\n") 257 | 258 | 259 | def write_file_contents_in_order_docx(dir_path, doc, args, depth=0): 260 | """ 261 | Recursively document the contents of files in a DOCX document in the order they appear in the directory tree. 262 | 263 | Args: 264 | dir_path (str): The path of the directory to start documenting from. 265 | doc (Document): The DOCX document object to write the contents to. 266 | args (argparse.Namespace): Parsed command-line arguments. 267 | depth (int): Current depth in the directory tree. Defaults to 0. 268 | """ 269 | items = sorted(item for item in os.listdir(dir_path) if not should_ignore(os.path.join(dir_path, item), args, args.output_file)) 270 | 271 | for item in items: 272 | item_path = os.path.join(dir_path, item) 273 | relative_path = os.path.relpath(item_path, start=args.repo_path) 274 | 275 | if os.path.isdir(item_path): 276 | write_file_contents_in_order_docx(item_path, doc, args, depth + 1) 277 | elif os.path.isfile(item_path): 278 | doc.add_heading(f"[File Begins] {relative_path}", level=3) 279 | write_file_content_docx(item_path, doc) 280 | doc.add_heading(f"[File Ends] {relative_path}", level=3) 281 | 282 | def main(): 283 | 284 | """ 285 | Main function to execute the script logic. 286 | """ 287 | args = parse_args() 288 | 289 | default_argparse_ignore_types = IMAGE_EXTENSIONS + VIDEO_EXTENSIONS + AUDIO_EXTENSIONS + DOCUMENT_EXTENSIONS + EXECUTABLE_EXTENSIONS 290 | 291 | if args.ignore_types == ['none']: 292 | # User explicitly wants to ignore nothing by type 293 | args.ignore_types = [] 294 | elif args.ignore_types == default_argparse_ignore_types: 295 | # User did not provide --ignore-types, so use the argparse defaults 296 | # AND add the additional ones from config.json 297 | args.ignore_types = default_argparse_ignore_types + ADDITIONAL_IGNORE_TYPES 298 | 299 | # Convert 'none' keyword to empty list 300 | args.ignore_files = [] if args.ignore_files == ['none'] else args.ignore_files 301 | # args.ignore_types = [] if args.ignore_types == ['none'] else IMAGE_EXTENSIONS + VIDEO_EXTENSIONS + AUDIO_EXTENSIONS + DOCUMENT_EXTENSIONS + EXECUTABLE_EXTENSIONS + ADDITIONAL_IGNORE_TYPES 302 | args.exclude_dir = [] if args.exclude_dir == ['none'] else args.exclude_dir 303 | 304 | # Check if the provided directory path is valid 305 | if not os.path.isdir(args.repo_path): 306 | print(f"Error: The specified directory does not exist, path is wrong or is not a directory: {args.repo_path}") 307 | return # Exit the script 308 | 309 | if args.output_file.endswith('.docx'): 310 | doc = Document() 311 | doc.styles['Normal'].font.name = 'Arial' 312 | doc.styles['Normal'].font.size = Pt(11) 313 | 314 | doc.add_heading("Repository Documentation", level=1) 315 | doc.add_paragraph( 316 | "This document provides a comprehensive overview of the repository's structure and contents." 317 | "The first section, titled 'Directory/File Tree', displays the repository's hierarchy in a tree format." 318 | "In this section, directories and files are listed using tree branches to indicate their structure and relationships." 319 | "Following the tree representation, the 'File Content' section details the contents of each file in the repository." 320 | "Each file's content is introduced with a '[File Begins]' marker followed by the file's relative path," 321 | "and the content is displayed verbatim. The end of each file's content is marked with a '[File Ends]' marker." 322 | "This format ensures a clear and orderly presentation of both the structure and the detailed contents of the repository.\n\n" 323 | ) 324 | doc.add_heading("Directory/File Tree Begins -->", level=2) 325 | write_tree_docx(args.repo_path, doc, args, args.output_file, "", is_last=True, is_root=True) 326 | doc.add_heading("<-- Directory/File Tree Ends", level=2) 327 | doc.add_heading("File Content Begins -->", level=2) 328 | write_file_contents_in_order_docx(args.repo_path, doc, args) 329 | doc.add_heading("<-- File Content Ends", level=2) 330 | doc.save(args.output_file) 331 | else: 332 | with open(args.output_file, 'w', encoding='utf-8') as output_file: 333 | output_file.write("Repository Documentation\n") 334 | output_file.write( 335 | "This document provides a comprehensive overview of the repository's structure and contents.\n" 336 | "The first section, titled 'Directory/File Tree', displays the repository's hierarchy in a tree format.\n" 337 | "In this section, directories and files are listed using tree branches to indicate their structure and relationships.\n" 338 | "Following the tree representation, the 'File Content' section details the contents of each file in the repository.\n" 339 | "Each file's content is introduced with a '[File Begins]' marker followed by the file's relative path,\n" 340 | "and the content is displayed verbatim. The end of each file's content is marked with a '[File Ends]' marker.\n" 341 | "This format ensures a clear and orderly presentation of both the structure and the detailed contents of the repository.\n\n" 342 | ) 343 | 344 | output_file.write("Directory/File Tree Begins -->\n\n") 345 | write_tree(args.repo_path, output_file, args, "", is_last=True, is_root=True) 346 | output_file.write("\n<-- Directory/File Tree Ends") 347 | output_file.write("\n\nFile Content Begin -->\n") 348 | write_file_contents_in_order(args.repo_path, output_file, args) 349 | output_file.write("\n<-- File Content Ends\n\n") 350 | if __name__ == "__main__": 351 | main() 352 | --------------------------------------------------------------------------------