├── chimp.webp ├── remove_words_in_markdowns.py ├── README.md ├── EXAMPLE_GEN.md └── convert.py /chimp.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cloneofsimo/auto_llm_codebase_analysis/HEAD/chimp.webp -------------------------------------------------------------------------------- /remove_words_in_markdowns.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | def process_markdown_files(folder_path): 4 | for root, dirs, files in os.walk(folder_path): 5 | for file in files: 6 | if file.endswith('.md'): 7 | file_path = os.path.join(root, file) 8 | with open(file_path, 'r') as f: 9 | content = f.read() 10 | 11 | content = content.replace('<|im_end|>', '') 12 | 13 | # Write modified content back to file 14 | with open(file_path, 'w') as f: 15 | f.write(content) 16 | 17 | folder_path = "" 18 | process_markdown_files(folder_path) 19 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Automate Massive Codebase Analysis with Language Model 2 | 3 |

4 | Chimp 5 |

6 | 7 | Ever had to go over all the codebase and analyze everything one-by-one? Ever wanted to "read over all of it really fast" and get "high level picture" per folder? This short tiny codebase does exactly that, hope to make your codebase-analysis time shorter. 8 | 9 | This will recursively generate... 10 | 11 | * High-level summary of the codebase 12 | * Highlights of the codebase 13 | * Pythonic Pseudocode 14 | * Import Relationships 15 | 16 | # Installation & Use 17 | 18 | Install sglang and run server. 19 | 20 | ```bash 21 | pip install sglang 22 | python -m sglang.launch_server --model-path "Qwen/Qwen1.5-72B-Chat" --tp 4 --port 8080 23 | ``` 24 | 25 | 26 | Run it via 27 | 28 | ```bash 29 | convert.py CODEBASE_DIR OUTPUT_DIR --port 8080 30 | ``` 31 | 32 | 33 | # Example Output: 34 | 35 | The [following](https://github.com/cloneofsimo/reverse_eng_deepspeed_study/tree/main/decomposed) is example output from analysis of DeepSpeed codebase. 36 | 37 | -------------------------------------------------------------------------------- /EXAMPLE_GEN.md: -------------------------------------------------------------------------------- 1 | ### Summary 2 | 3 | * `is_torch_ver_eq_2_0 and is_tor`: These functions are used to check the version of PyTorch being used. They Rating (out of 5):1 4 | 5 | * `torch_ver_ge_1_13`: Checks if the version of PyTorch is greater than or equal to Rating (out of 5):2 6 | 7 | * `has_coalescing_manager`: This method checks if the coalescing manager is available. It's R Rating (out of 5):1 8 | 9 | * `has_all_reduce_coalesced`: This method checks if the all_reduce_coalesced function is available Rating (out of 5):1 10 | 11 | * `get_coalescing_manager`: This method is used to get the coalescing manager Rating (out of 5):2 12 | 13 | * `all_gather_comm_off, reduce_scatter_comm_off, broadcast_comm_off, all_reduce_comm_off and reduce_comm_off`: The methods are used to turn off certain communication functions. They Rating (out of 5):1 14 | 15 | * `backward_comm_off`: This method turns off communication for both all_gather and reduce_scatter methods. It's Rating (out of 5):2 16 | 17 | * `Noop`: This is a class used to represent a no-operation. It's Rating (out of 5):1 18 | 19 | * `TorchBackend`: This class is a light-weight wrapper around the torch.distributed API. It's Rating (out of 5):3 20 | 21 | * Methods with names starting with `get_`, `has_`, `init_process_group`, `all_reduce`, `reduce` , `reduce_scatter`, `broadcast`, `all_gather` and `is_` -- all these methods are key methods in the file and are all Rating (out of 5):4 22 | 23 | This file is mostly a wrapper for the torch.distributed API. It exposes a subset of these functions, and allows for easy control over what kind of communication is being done. It also has a "Noop" class that represents no-operation, which is useful for making certain parts of the code optional based on the control flags. 24 | 25 | The file also defines several utility functions and a main class, `TorchBackend`, that wraps around the torch.distributed API, making it easier to use and more flexible. 26 | 27 | ### Highlights 28 | 29 | 1. __Object-Oriented Programming__: The code is written in a Python script with a clear object-oriented structure where TorchBackend subclasses the Backend class. This provides a clear hierarchy and makes the code more organized and maintainable. 30 | 2. __Torch Distributed API Wrapper__: The code is a wrapper for several functions provided by the PyTorch Distributed API. It takes care of different versions of PyTorch and attempts to offer a consistent interface. The developer created classes like 'Noop' to handle communication-related operations when certain flags are off. 31 | 3. __Global Variables__: There are global variables that control the status of communication, such as `DS_COMM_ALL_GATHER_OFF`, `DS_COMM_REDUCE_SCATTER_OFF`, etc. These variables can be used to turn communication related functions off, effectively freezing the communication. 32 | 4. __Decorators and Globals__: Decorators and global variables are used to restrict the functionality of some aspects of the DeepSpeed library. 33 | 5. __Methods and Usage__: This script includes several methods for collecting tensors, reducing tensors, broadcasting, and gathering tensors. These methods are implemented and can be called directly. This interacts with the PyTorch distribution library to take advantage of distributed data parallelism. 34 | 35 | ### Pythonic Pseudocode 36 | 37 | ```python 38 | // Importing necessary libraries 39 | import torch 40 | from deepspeed import utils 41 | from .utils import * 42 | from .backend import * 43 | from ..runtime import compiler 44 | import os 45 | 46 | // Global variables to turn communication operations off 47 | DS_COMM_ALL_GATHER_OFF = False 48 | DS_COMM_REDUCE_SCATTER_OFF = False 49 | DS_COMM_BROADCAST_OFF = FALSE 50 | DS_COMM_ALL_REDUCE_OFF = FALSE 51 | DS_COMM_REDUCE_OFF = FALSE 52 | 53 | // Torch version that are equivalent to 2.0 and greater 54 | Function is_torch_ver_eq_2_0() 55 | Function is_torch_ver_ge_2_1() 56 | Function torch_ver_ge_1_13() 57 | 58 | // Check if the function ,get_all_gather_function or get_reduce_scatter_function are available. 59 | Function has_coalescing_manager() 60 | Function has_all_reduce_coalesced() 61 | 62 | // Function to toggle communication operations on/off 63 | Function all_gather_comm_off(flag=False) 64 | Function reduce_scatter_comm_off(flag=False) 65 | Function broadcast_comm_off(flag=False) 66 | Function all_reduce_comm_off(flag=False) 67 | Function reduce_comm_off(flag=False) 68 | Function backward_comm_off(flag=False) 69 | 70 | // This class provides a light-weight wrapper for torch.distributed API 71 | Class TorchBackend extends Backend{ 72 | // Constructor 73 | function __init__(backend, timeout, init_method, rank=-1, world_size=-1, name='torch') 74 | function get_all_gather_function() 75 | function get_reduce_scatter_function() 76 | function has_all_gather_into_tensor() 77 | function has_reduce_scatter_tensor() 78 | function init_process_group(backend, timeout, init_method, rank, world_size) 79 | function all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False) 80 | function inference_all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False) 81 | function all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False) 82 | function reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False) 83 | function reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False) 84 | function broadcast(tensor, src, group=None, async_op=False) 85 | function all_gather(tensor_list, tensor, group=None, async_op=False) 86 | function all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False) 87 | function all_gather_base(output_tensor, input_tensor, group=None, async_op=False) 88 | function all_gather_coalesced(output_tensors, input_tensors, group=None, async_op=False) 89 | function reduce_scatter_tensor(output_tensor, input_tensor, op=ReduceOp.SUM, group=None, async_op=False) 90 | function all_to_all_single(output, input, output_split_sizes=None, input_split_sizes=None, group=None, async_op=False) 91 | function all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False) 92 | function send(tensor, dst, group=None, tag=0) 93 | function recv(tensor, src=None, group=None, tag=0) 94 | function isend(tensor, dst, group=None, tag=0) 95 | function irecv(tensor, src=None, group=None, tag=0) 96 | function gather(tensor, gather_list=None, dst=0, group=None, async_op=False) 97 | function scatter(tensor, scatter_list=None, src=0, group=None, async_op=False) 98 | function barrier(group=torch.distributed.GroupMember.WORLD, async_op=False, device_ids=None) 99 | function monitored_barrier(group=torch.distributed.GroupMember.WORLD, timeout=None, wait_all_ranks=False) 100 | function get_rank(group=None) 101 | function get_world_size(group=None) 102 | function is_initialized() 103 | function get_backend(group=None) 104 | function new_group(ranks) 105 | function get_global_rank(group, group_rank) 106 | function get_world_group() 107 | function destroy_process_group(group=None) 108 | function _reduce_op(op) 109 | } 110 | ``` 111 | 112 | 113 | ### import Relationships 114 | 115 | Imports found: 116 | from deepspeed import utils 117 | from .utils import * 118 | from .backend import * 119 | from .comm import * 120 | from ..runtime import compiler 121 | import os 122 | -------------------------------------------------------------------------------- /convert.py: -------------------------------------------------------------------------------- 1 | import json 2 | import os 3 | from pathlib import Path 4 | 5 | import click 6 | from sglang import ( 7 | RuntimeEndpoint, 8 | assistant, 9 | function, 10 | gen, 11 | set_default_backend, 12 | system, 13 | user, 14 | ) 15 | 16 | 17 | import json 18 | import os 19 | from pathlib import Path 20 | 21 | 22 | def multi_highlight(file_contents): 23 | @function 24 | def _call_sglang_highlight(s, prompt): 25 | s += system( 26 | "You are a helpful assistant specialized in highlighting key features of code." 27 | ) 28 | s += user(prompt) 29 | s += assistant(f"There are 5 key features I can highlight.\n") 30 | 31 | for i in range(5): 32 | s += f"\n{i + 1}." 33 | s += gen(f"gen_{i}", stop=["\n"], max_tokens=512) 34 | 35 | args = [ 36 | { 37 | "prompt": f"Highlight the key features of this code:\n\n{file_content}. What would you say is the key thing to look for for this code?" 38 | } 39 | for file_content in file_contents 40 | ] 41 | 42 | rets = _call_sglang_highlight.run_batch( 43 | args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True 44 | ) 45 | 46 | return [ 47 | state.text().split("There are 5 key features I can highlight.")[-1] 48 | for state in rets 49 | ] 50 | 51 | 52 | def multi_overall_summarize(file_contents): 53 | @function 54 | def _call_sglang_overall_summarize(s, prompt): 55 | s += system( 56 | "You are a helpful assistant specialized in providing overall summaries of codebases." 57 | ) 58 | s += user(prompt) 59 | s += "1. List all the major, important methods and functions, and rate their importance. it has to be in form of \n* `function_or_class_or_class_method`: EXPLANATION. Importance : **[IMPORTANCE]**\nFor example,\n* `default_inference_config`: Provides a default configuration for DeepSpeed inference. Importance : **[High]**\n\n* `_LRScheduler` `DeepSpeedOptimizerCallable`, `DeepSpeedSchedulerCallable`: Class and function types for handling optimizers and learning rate schedulers with DeepSpeed. Importance : **[Medium]**\n\n* `cli_main`: Wraps `main` for a command-line interface. Importance : **[Low]**\n\n" 60 | s += "2. Describe what this file is all about.\n" 61 | s += assistant(f"Sure. There are multiple methods and classes.\n") 62 | for i in range(5): 63 | s += f"\n" 64 | s += gen(f"answer_{i}", regex=r"\* `([^`]+)`:") 65 | s += gen(f"answer_{i}_2", stop=["\n"]) 66 | 67 | s += gen(f"final", max_tokens=1024) 68 | 69 | args = [ 70 | {"prompt": f"Provide an overall summary of this codebase:\n\n{file_content}"} 71 | for file_content in file_contents 72 | ] 73 | rets = _call_sglang_overall_summarize.run_batch( 74 | args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True 75 | ) 76 | return [ 77 | state.text().split("Sure. There are multiple methods and classes.\n")[-1] 78 | for state in rets 79 | ] 80 | 81 | 82 | def multi_high_level_pseudocode(file_contents): 83 | @function 84 | def _call_sglang_high_level_pseudocode(s, prompt): 85 | s += system( 86 | "You are a helpful assistant specialized in generating high-level pythonic pseudocode." 87 | ) 88 | s += user(prompt) 89 | s += "Rewrite above high-level logic in pythonic pseudocode with comments. Be very abstract and informative." 90 | s += assistant( 91 | f"Sure. Here is the pythonic pseudocode that overviews the file you described.\n" 92 | ) 93 | s += "```python\n" 94 | s += gen("long_answer", max_tokens=2048, stop=["```"]) 95 | 96 | args = [ 97 | {"prompt": f"Generate high-level pseudocode for this code:\n\n{file_content}"} 98 | for file_content in file_contents 99 | ] 100 | rets = _call_sglang_high_level_pseudocode.run_batch( 101 | args, temperature=0, max_new_tokens=2048, num_threads=64, progress_bar=True 102 | ) 103 | return [state.text().split("```python")[-1] for state in rets] 104 | 105 | 106 | def single_analyze_import_relationships(file_content): 107 | import_lines = [ 108 | line 109 | for line in file_content.split("\n") 110 | if line.startswith("import") or line.startswith("from") 111 | ] 112 | if import_lines: 113 | return "Imports found:\n" + "\n".join(import_lines) 114 | else: 115 | return "No imports found." 116 | 117 | 118 | def multi_analyze_import_relationships(file_contents): 119 | return [ 120 | single_analyze_import_relationships(file_content) 121 | for file_content in file_contents 122 | ] 123 | 124 | 125 | def analyze_files_and_generate_json_batch( 126 | file_paths, output_directory, global_codebase_path 127 | ): 128 | contents = [] 129 | relative_paths = [] 130 | 131 | # Prepare content and relative paths for batch processing 132 | for file_path in file_paths: 133 | content = file_path.read_text(encoding="utf-8") 134 | relative_path = str(file_path.relative_to(global_codebase_path)) 135 | content_with_header = f"# python file {relative_path}\n\n{content}" 136 | contents.append(content_with_header) 137 | relative_paths.append(relative_path) 138 | 139 | # Batch process analyses 140 | highlights = multi_highlight(contents) 141 | overall_summaries = multi_overall_summarize(contents) 142 | pseudocodes = multi_high_level_pseudocode(contents) 143 | import_relationships = multi_analyze_import_relationships(contents) 144 | 145 | # Process each file's analysis and save to output 146 | for i, file_path in enumerate(file_paths): 147 | output_file_path = output_directory / Path(relative_paths[i]).with_suffix( 148 | ".json" 149 | ) 150 | output_md_path = output_directory / Path(relative_paths[i]).with_suffix(".md") 151 | output_file_path.parent.mkdir(parents=True, exist_ok=True) 152 | 153 | analyses = { 154 | "highlights": highlights[i], 155 | "overall_summary": overall_summaries[i], 156 | "pseudocode": pseudocodes[i], 157 | "import_relationships": import_relationships[i], 158 | } 159 | 160 | with output_file_path.open("w", encoding="utf-8") as f: 161 | json.dump(analyses, f, indent=4) 162 | 163 | with output_md_path.open("w", encoding="utf-8") as f: 164 | f.write("\n\n### Summary\n\n") 165 | f.write(analyses["overall_summary"].strip()) 166 | f.write("\n\n### Highlights\n\n") 167 | f.write(analyses["highlights"].strip()) 168 | f.write( 169 | f"\n\n### Pythonic Pseudocode\n\n```python\n{analyses['pseudocode'].strip()}\n```" 170 | ) 171 | f.write("\n\n\n### import Relationships\n\n") 172 | f.write(analyses["import_relationships"].strip()) 173 | 174 | 175 | def analyze_directory_and_generate_overall_json_batch( 176 | this_directory_path, output_directory, global_codebase_path 177 | ): 178 | file_paths = [] 179 | 180 | for root, _, files in os.walk(this_directory_path): 181 | for file in files: 182 | if not file.endswith(".json"): 183 | file_path = Path(root, file) 184 | file_paths.append(file_path) 185 | 186 | if file_paths: 187 | analyze_files_and_generate_json_batch( 188 | file_paths, output_directory, global_codebase_path 189 | ) 190 | 191 | 192 | @click.command() 193 | @click.argument("codebase_dir", type=click.Path(exists=True, file_okay=False)) 194 | @click.argument("output_directory", type=click.Path(file_okay=False)) 195 | @click.option("--port", type=int) 196 | def main(codebase_dir, output_directory, port): 197 | codebase_dir = Path(codebase_dir) 198 | output_directory = Path(output_directory) 199 | set_default_backend(RuntimeEndpoint(f"http://localhost:{port}")) 200 | 201 | analyze_directory_and_generate_overall_json_batch( 202 | codebase_dir, output_directory, codebase_dir 203 | ) 204 | click.echo(f"Analysis complete. Output is saved in {output_directory}") 205 | 206 | 207 | if __name__ == "__main__": 208 | main() 209 | --------------------------------------------------------------------------------