├── chimp.webp
├── remove_words_in_markdowns.py
├── README.md
├── EXAMPLE_GEN.md
└── convert.py


/chimp.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cloneofsimo/auto_llm_codebase_analysis/HEAD/chimp.webp


--------------------------------------------------------------------------------
/remove_words_in_markdowns.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | def process_markdown_files(folder_path):
 4 |     for root, dirs, files in os.walk(folder_path):
 5 |         for file in files:
 6 |             if file.endswith('.md'):
 7 |                 file_path = os.path.join(root, file)
 8 |                 with open(file_path, 'r') as f:
 9 |                     content = f.read()
10 |                 
11 |                 content = content.replace('<|im_end|>', '')
12 |                 
13 |                 # Write modified content back to file
14 |                 with open(file_path, 'w') as f:
15 |                     f.write(content)
16 | 
17 | folder_path = ""
18 | process_markdown_files(folder_path)
19 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Automate Massive Codebase Analysis with Language Model
 2 | 
 3 | <p align="center">
 4 |   <img src="chimp.webp" alt="Chimp" width="400">
 5 | </p>
 6 | 
 7 | Ever had to go over all the codebase and analyze everything one-by-one? Ever wanted to "read over all of it really fast" and get "high level picture" per folder? This short tiny codebase does exactly that, hope to make your codebase-analysis time shorter.
 8 | 
 9 | This will recursively generate...
10 | 
11 | * High-level summary of the codebase
12 | * Highlights of the codebase
13 | * Pythonic Pseudocode
14 | * Import Relationships
15 | 
16 | # Installation & Use
17 | 
18 | Install sglang and run server.
19 | 
20 | ```bash
21 | pip install sglang
22 | python -m sglang.launch_server --model-path "Qwen/Qwen1.5-72B-Chat" --tp 4 --port 8080
23 | ```
24 | 
25 | 
26 | Run it via
27 | 
28 | ```bash
29 | convert.py CODEBASE_DIR OUTPUT_DIR --port 8080
30 | ```
31 | 
32 | 
33 | # Example Output:
34 | 
35 | The [following](https://github.com/cloneofsimo/reverse_eng_deepspeed_study/tree/main/decomposed) is example output from analysis of DeepSpeed codebase.
36 | 
37 | 


--------------------------------------------------------------------------------
/EXAMPLE_GEN.md:
--------------------------------------------------------------------------------
  1 | ### Summary
  2 | 
  3 | * `is_torch_ver_eq_2_0 and is_tor`: These functions are used to check the version of PyTorch being used. They Rating (out of 5):1
  4 | 
  5 | * `torch_ver_ge_1_13`: Checks if the version of PyTorch is greater than or equal to  Rating (out of 5):2
  6 | 
  7 | * `has_coalescing_manager`: This method checks if the coalescing manager is available. It's R Rating (out of 5):1
  8 | 
  9 | * `has_all_reduce_coalesced`: This method checks if the all_reduce_coalesced function is available Rating (out of 5):1
 10 | 
 11 | * `get_coalescing_manager`: This method is used to get the coalescing manager  Rating (out of 5):2
 12 | 
 13 | * `all_gather_comm_off, reduce_scatter_comm_off, broadcast_comm_off, all_reduce_comm_off and reduce_comm_off`: The methods are used to turn off certain communication functions. They Rating (out of 5):1 
 14 | 
 15 | * `backward_comm_off`: This method turns off communication for both all_gather and reduce_scatter methods. It's Rating (out of 5):2
 16 | 
 17 | * `Noop`: This is a class used to represent a no-operation. It's Rating (out of 5):1
 18 | 
 19 | * `TorchBackend`: This class is a light-weight wrapper around the torch.distributed API. It's Rating (out of 5):3
 20 | 
 21 | * Methods with names starting with `get_`, `has_`, `init_process_group`, `all_reduce`, `reduce` , `reduce_scatter`, `broadcast`, `all_gather` and `is_` -- all these methods are key methods in the file and are all Rating (out of 5):4
 22 | 
 23 | This file is mostly a wrapper for the torch.distributed API. It exposes a subset of these functions, and allows for easy control over what kind of communication is being done. It also has a "Noop" class that represents no-operation, which is useful for making certain parts of the code optional based on the control flags.
 24 | 
 25 | The file also defines several utility functions and a main class, `TorchBackend`, that wraps around the torch.distributed API, making it easier to use and more flexible.
 26 | 
 27 | ### Highlights
 28 | 
 29 | 1. __Object-Oriented Programming__: The code is written in a Python script with a clear object-oriented structure where TorchBackend subclasses the Backend class. This provides a clear hierarchy and makes the code more organized and maintainable.
 30 | 2. __Torch Distributed API Wrapper__: The code is a wrapper for several functions provided by the PyTorch Distributed API. It takes care of different versions of PyTorch and attempts to offer a consistent interface. The developer created classes like 'Noop' to handle communication-related operations when certain flags are off.
 31 | 3. __Global Variables__: There are global variables that control the status of communication, such as `DS_COMM_ALL_GATHER_OFF`, `DS_COMM_REDUCE_SCATTER_OFF`, etc. These variables can be used to turn communication related functions off, effectively freezing the communication.
 32 | 4. __Decorators and Globals__: Decorators and global variables are used to restrict the functionality of some aspects of the DeepSpeed library.
 33 | 5. __Methods and Usage__: This script includes several methods for collecting tensors, reducing tensors, broadcasting, and gathering tensors. These methods are implemented and can be called directly. This interacts with the PyTorch distribution library to take advantage of distributed data parallelism.
 34 | 
 35 | ### Pythonic Pseudocode
 36 | 
 37 | ```python
 38 | // Importing necessary libraries
 39 | import torch
 40 | from deepspeed import utils
 41 | from .utils import *
 42 | from .backend import *
 43 | from ..runtime import compiler
 44 | import os
 45 | 
 46 | // Global variables to turn communication operations off
 47 | DS_COMM_ALL_GATHER_OFF = False
 48 | DS_COMM_REDUCE_SCATTER_OFF = False
 49 | DS_COMM_BROADCAST_OFF = FALSE
 50 | DS_COMM_ALL_REDUCE_OFF = FALSE
 51 | DS_COMM_REDUCE_OFF = FALSE
 52 | 
 53 | // Torch version that are equivalent to 2.0 and greater 
 54 | Function is_torch_ver_eq_2_0()
 55 | Function is_torch_ver_ge_2_1()
 56 | Function torch_ver_ge_1_13()
 57 | 
 58 | // Check if the function ,get_all_gather_function or get_reduce_scatter_function are available.
 59 | Function has_coalescing_manager()
 60 | Function has_all_reduce_coalesced()
 61 | 
 62 | // Function to toggle communication operations on/off
 63 | Function all_gather_comm_off(flag=False)
 64 | Function reduce_scatter_comm_off(flag=False)
 65 | Function broadcast_comm_off(flag=False)
 66 | Function all_reduce_comm_off(flag=False)
 67 | Function reduce_comm_off(flag=False)
 68 | Function backward_comm_off(flag=False)
 69 | 
 70 | // This class provides a light-weight wrapper for torch.distributed API
 71 | Class TorchBackend extends Backend{
 72 |     // Constructor
 73 |     function __init__(backend, timeout, init_method, rank=-1, world_size=-1, name='torch')
 74 |     function get_all_gather_function()
 75 |     function get_reduce_scatter_function()
 76 |     function has_all_gather_into_tensor()
 77 |     function has_reduce_scatter_tensor()
 78 |     function init_process_group(backend, timeout, init_method, rank, world_size)
 79 |     function all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False)
 80 |     function inference_all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False)
 81 |     function all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False)
 82 |     function reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False)
 83 |     function reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False)
 84 |     function broadcast(tensor, src, group=None, async_op=False)
 85 |     function all_gather(tensor_list, tensor, group=None, async_op=False)
 86 |     function all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False)
 87 |     function all_gather_base(output_tensor, input_tensor, group=None, async_op=False)
 88 |     function all_gather_coalesced(output_tensors, input_tensors, group=None, async_op=False)
 89 |     function reduce_scatter_tensor(output_tensor, input_tensor, op=ReduceOp.SUM, group=None, async_op=False)
 90 |     function all_to_all_single(output, input, output_split_sizes=None, input_split_sizes=None, group=None, async_op=False)
 91 |     function all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False)
 92 |     function send(tensor, dst, group=None, tag=0)
 93 |     function recv(tensor, src=None, group=None, tag=0)
 94 |     function isend(tensor, dst, group=None, tag=0)
 95 |     function irecv(tensor, src=None, group=None, tag=0)
 96 |     function gather(tensor, gather_list=None, dst=0, group=None, async_op=False)
 97 |     function scatter(tensor, scatter_list=None, src=0, group=None, async_op=False)
 98 |     function barrier(group=torch.distributed.GroupMember.WORLD, async_op=False, device_ids=None)
 99 |     function monitored_barrier(group=torch.distributed.GroupMember.WORLD, timeout=None, wait_all_ranks=False)
100 |     function get_rank(group=None)
101 |     function get_world_size(group=None)
102 |     function is_initialized()
103 |     function get_backend(group=None)
104 |     function new_group(ranks)
105 |     function get_global_rank(group, group_rank)
106 |     function get_world_group()
107 |     function destroy_process_group(group=None)
108 |     function _reduce_op(op)
109 | }
110 | ```
111 | 
112 | 
113 | ### import Relationships
114 | 
115 | Imports found:
116 | from deepspeed import utils
117 | from .utils import *
118 | from .backend import *
119 | from .comm import *
120 | from ..runtime import compiler
121 | import os
122 | 


--------------------------------------------------------------------------------
/convert.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | import os
  3 | from pathlib import Path
  4 | 
  5 | import click
  6 | from sglang import (
  7 |     RuntimeEndpoint,
  8 |     assistant,
  9 |     function,
 10 |     gen,
 11 |     set_default_backend,
 12 |     system,
 13 |     user,
 14 | )
 15 | 
 16 | 
 17 | import json
 18 | import os
 19 | from pathlib import Path
 20 | 
 21 | 
 22 | def multi_highlight(file_contents):
 23 |     @function
 24 |     def _call_sglang_highlight(s, prompt):
 25 |         s += system(
 26 |             "You are a helpful assistant specialized in highlighting key features of code."
 27 |         )
 28 |         s += user(prompt)
 29 |         s += assistant(f"There are 5 key features I can highlight.\n")
 30 | 
 31 |         for i in range(5):
 32 |             s += f"\n{i + 1}."
 33 |             s += gen(f"gen_{i}", stop=["\n"], max_tokens=512)
 34 | 
 35 |     args = [
 36 |         {
 37 |             "prompt": f"Highlight the key features of this code:\n\n{file_content}. What would you say is the key thing to look for for this code?"
 38 |         }
 39 |         for file_content in file_contents
 40 |     ]
 41 | 
 42 |     rets = _call_sglang_highlight.run_batch(
 43 |         args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True
 44 |     )
 45 | 
 46 |     return [
 47 |         state.text().split("There are 5 key features I can highlight.")[-1]
 48 |         for state in rets
 49 |     ]
 50 | 
 51 | 
 52 | def multi_overall_summarize(file_contents):
 53 |     @function
 54 |     def _call_sglang_overall_summarize(s, prompt):
 55 |         s += system(
 56 |             "You are a helpful assistant specialized in providing overall summaries of codebases."
 57 |         )
 58 |         s += user(prompt)
 59 |         s += "1. List all the major, important methods and functions, and rate their importance. it has to be in form of \n* `function_or_class_or_class_method`: EXPLANATION. Importance : **[IMPORTANCE]**\nFor example,\n* `default_inference_config`: Provides a default configuration for DeepSpeed inference. Importance : **[High]**\n\n* `_LRScheduler` `DeepSpeedOptimizerCallable`, `DeepSpeedSchedulerCallable`: Class and function types for handling optimizers and learning rate schedulers with DeepSpeed. Importance : **[Medium]**\n\n* `cli_main`: Wraps `main` for a command-line interface. Importance : **[Low]**\n\n"
 60 |         s += "2. Describe what this file is all about.\n"
 61 |         s += assistant(f"Sure. There are multiple methods and classes.\n")
 62 |         for i in range(5):
 63 |             s += f"\n"
 64 |             s += gen(f"answer_{i}", regex=r"\* `([^`]+)`:")
 65 |             s += gen(f"answer_{i}_2", stop=["\n"])
 66 | 
 67 |         s += gen(f"final", max_tokens=1024)
 68 | 
 69 |     args = [
 70 |         {"prompt": f"Provide an overall summary of this codebase:\n\n{file_content}"}
 71 |         for file_content in file_contents
 72 |     ]
 73 |     rets = _call_sglang_overall_summarize.run_batch(
 74 |         args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True
 75 |     )
 76 |     return [
 77 |         state.text().split("Sure. There are multiple methods and classes.\n")[-1]
 78 |         for state in rets
 79 |     ]
 80 | 
 81 | 
 82 | def multi_high_level_pseudocode(file_contents):
 83 |     @function
 84 |     def _call_sglang_high_level_pseudocode(s, prompt):
 85 |         s += system(
 86 |             "You are a helpful assistant specialized in generating high-level pythonic pseudocode."
 87 |         )
 88 |         s += user(prompt)
 89 |         s += "Rewrite above high-level logic in pythonic pseudocode with comments. Be very abstract and informative."
 90 |         s += assistant(
 91 |             f"Sure. Here is the pythonic pseudocode that overviews the file you described.\n"
 92 |         )
 93 |         s += "```python\n"
 94 |         s += gen("long_answer", max_tokens=2048, stop=["```"])
 95 | 
 96 |     args = [
 97 |         {"prompt": f"Generate high-level pseudocode for this code:\n\n{file_content}"}
 98 |         for file_content in file_contents
 99 |     ]
100 |     rets = _call_sglang_high_level_pseudocode.run_batch(
101 |         args, temperature=0, max_new_tokens=2048, num_threads=64, progress_bar=True
102 |     )
103 |     return [state.text().split("```python")[-1] for state in rets]
104 | 
105 | 
106 | def single_analyze_import_relationships(file_content):
107 |     import_lines = [
108 |         line
109 |         for line in file_content.split("\n")
110 |         if line.startswith("import") or line.startswith("from")
111 |     ]
112 |     if import_lines:
113 |         return "Imports found:\n" + "\n".join(import_lines)
114 |     else:
115 |         return "No imports found."
116 | 
117 | 
118 | def multi_analyze_import_relationships(file_contents):
119 |     return [
120 |         single_analyze_import_relationships(file_content)
121 |         for file_content in file_contents
122 |     ]
123 | 
124 | 
125 | def analyze_files_and_generate_json_batch(
126 |     file_paths, output_directory, global_codebase_path
127 | ):
128 |     contents = []
129 |     relative_paths = []
130 | 
131 |     # Prepare content and relative paths for batch processing
132 |     for file_path in file_paths:
133 |         content = file_path.read_text(encoding="utf-8")
134 |         relative_path = str(file_path.relative_to(global_codebase_path))
135 |         content_with_header = f"# python file {relative_path}\n\n{content}"
136 |         contents.append(content_with_header)
137 |         relative_paths.append(relative_path)
138 | 
139 |     # Batch process analyses
140 |     highlights = multi_highlight(contents)
141 |     overall_summaries = multi_overall_summarize(contents)
142 |     pseudocodes = multi_high_level_pseudocode(contents)
143 |     import_relationships = multi_analyze_import_relationships(contents)
144 | 
145 |     # Process each file's analysis and save to output
146 |     for i, file_path in enumerate(file_paths):
147 |         output_file_path = output_directory / Path(relative_paths[i]).with_suffix(
148 |             ".json"
149 |         )
150 |         output_md_path = output_directory / Path(relative_paths[i]).with_suffix(".md")
151 |         output_file_path.parent.mkdir(parents=True, exist_ok=True)
152 | 
153 |         analyses = {
154 |             "highlights": highlights[i],
155 |             "overall_summary": overall_summaries[i],
156 |             "pseudocode": pseudocodes[i],
157 |             "import_relationships": import_relationships[i],
158 |         }
159 | 
160 |         with output_file_path.open("w", encoding="utf-8") as f:
161 |             json.dump(analyses, f, indent=4)
162 | 
163 |         with output_md_path.open("w", encoding="utf-8") as f:
164 |             f.write("\n\n### Summary\n\n")
165 |             f.write(analyses["overall_summary"].strip())
166 |             f.write("\n\n### Highlights\n\n")
167 |             f.write(analyses["highlights"].strip())
168 |             f.write(
169 |                 f"\n\n### Pythonic Pseudocode\n\n```python\n{analyses['pseudocode'].strip()}\n```"
170 |             )
171 |             f.write("\n\n\n### import Relationships\n\n")
172 |             f.write(analyses["import_relationships"].strip())
173 | 
174 | 
175 | def analyze_directory_and_generate_overall_json_batch(
176 |     this_directory_path, output_directory, global_codebase_path
177 | ):
178 |     file_paths = []
179 | 
180 |     for root, _, files in os.walk(this_directory_path):
181 |         for file in files:
182 |             if not file.endswith(".json"):
183 |                 file_path = Path(root, file)
184 |                 file_paths.append(file_path)
185 | 
186 |     if file_paths:
187 |         analyze_files_and_generate_json_batch(
188 |             file_paths, output_directory, global_codebase_path
189 |         )
190 | 
191 | 
192 | @click.command()
193 | @click.argument("codebase_dir", type=click.Path(exists=True, file_okay=False))
194 | @click.argument("output_directory", type=click.Path(file_okay=False))
195 | @click.option("--port", type=int)
196 | def main(codebase_dir, output_directory, port):
197 |     codebase_dir = Path(codebase_dir)
198 |     output_directory = Path(output_directory)
199 |     set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
200 | 
201 |     analyze_directory_and_generate_overall_json_batch(
202 |         codebase_dir, output_directory, codebase_dir
203 |     )
204 |     click.echo(f"Analysis complete. Output is saved in {output_directory}")
205 | 
206 | 
207 | if __name__ == "__main__":
208 |     main()
209 | 


--------------------------------------------------------------------------------