├── chimp.webp
├── remove_words_in_markdowns.py
├── README.md
├── EXAMPLE_GEN.md
└── convert.py
/chimp.webp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/cloneofsimo/auto_llm_codebase_analysis/HEAD/chimp.webp
--------------------------------------------------------------------------------
/remove_words_in_markdowns.py:
--------------------------------------------------------------------------------
1 | import os
2 |
3 | def process_markdown_files(folder_path):
4 | for root, dirs, files in os.walk(folder_path):
5 | for file in files:
6 | if file.endswith('.md'):
7 | file_path = os.path.join(root, file)
8 | with open(file_path, 'r') as f:
9 | content = f.read()
10 |
11 | content = content.replace('<|im_end|>', '')
12 |
13 | # Write modified content back to file
14 | with open(file_path, 'w') as f:
15 | f.write(content)
16 |
17 | folder_path = ""
18 | process_markdown_files(folder_path)
19 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Automate Massive Codebase Analysis with Language Model
2 |
3 |
4 |
5 |
6 |
7 | Ever had to go over all the codebase and analyze everything one-by-one? Ever wanted to "read over all of it really fast" and get "high level picture" per folder? This short tiny codebase does exactly that, hope to make your codebase-analysis time shorter.
8 |
9 | This will recursively generate...
10 |
11 | * High-level summary of the codebase
12 | * Highlights of the codebase
13 | * Pythonic Pseudocode
14 | * Import Relationships
15 |
16 | # Installation & Use
17 |
18 | Install sglang and run server.
19 |
20 | ```bash
21 | pip install sglang
22 | python -m sglang.launch_server --model-path "Qwen/Qwen1.5-72B-Chat" --tp 4 --port 8080
23 | ```
24 |
25 |
26 | Run it via
27 |
28 | ```bash
29 | convert.py CODEBASE_DIR OUTPUT_DIR --port 8080
30 | ```
31 |
32 |
33 | # Example Output:
34 |
35 | The [following](https://github.com/cloneofsimo/reverse_eng_deepspeed_study/tree/main/decomposed) is example output from analysis of DeepSpeed codebase.
36 |
37 |
--------------------------------------------------------------------------------
/EXAMPLE_GEN.md:
--------------------------------------------------------------------------------
1 | ### Summary
2 |
3 | * `is_torch_ver_eq_2_0 and is_tor`: These functions are used to check the version of PyTorch being used. They Rating (out of 5):1
4 |
5 | * `torch_ver_ge_1_13`: Checks if the version of PyTorch is greater than or equal to Rating (out of 5):2
6 |
7 | * `has_coalescing_manager`: This method checks if the coalescing manager is available. It's R Rating (out of 5):1
8 |
9 | * `has_all_reduce_coalesced`: This method checks if the all_reduce_coalesced function is available Rating (out of 5):1
10 |
11 | * `get_coalescing_manager`: This method is used to get the coalescing manager Rating (out of 5):2
12 |
13 | * `all_gather_comm_off, reduce_scatter_comm_off, broadcast_comm_off, all_reduce_comm_off and reduce_comm_off`: The methods are used to turn off certain communication functions. They Rating (out of 5):1
14 |
15 | * `backward_comm_off`: This method turns off communication for both all_gather and reduce_scatter methods. It's Rating (out of 5):2
16 |
17 | * `Noop`: This is a class used to represent a no-operation. It's Rating (out of 5):1
18 |
19 | * `TorchBackend`: This class is a light-weight wrapper around the torch.distributed API. It's Rating (out of 5):3
20 |
21 | * Methods with names starting with `get_`, `has_`, `init_process_group`, `all_reduce`, `reduce` , `reduce_scatter`, `broadcast`, `all_gather` and `is_` -- all these methods are key methods in the file and are all Rating (out of 5):4
22 |
23 | This file is mostly a wrapper for the torch.distributed API. It exposes a subset of these functions, and allows for easy control over what kind of communication is being done. It also has a "Noop" class that represents no-operation, which is useful for making certain parts of the code optional based on the control flags.
24 |
25 | The file also defines several utility functions and a main class, `TorchBackend`, that wraps around the torch.distributed API, making it easier to use and more flexible.
26 |
27 | ### Highlights
28 |
29 | 1. __Object-Oriented Programming__: The code is written in a Python script with a clear object-oriented structure where TorchBackend subclasses the Backend class. This provides a clear hierarchy and makes the code more organized and maintainable.
30 | 2. __Torch Distributed API Wrapper__: The code is a wrapper for several functions provided by the PyTorch Distributed API. It takes care of different versions of PyTorch and attempts to offer a consistent interface. The developer created classes like 'Noop' to handle communication-related operations when certain flags are off.
31 | 3. __Global Variables__: There are global variables that control the status of communication, such as `DS_COMM_ALL_GATHER_OFF`, `DS_COMM_REDUCE_SCATTER_OFF`, etc. These variables can be used to turn communication related functions off, effectively freezing the communication.
32 | 4. __Decorators and Globals__: Decorators and global variables are used to restrict the functionality of some aspects of the DeepSpeed library.
33 | 5. __Methods and Usage__: This script includes several methods for collecting tensors, reducing tensors, broadcasting, and gathering tensors. These methods are implemented and can be called directly. This interacts with the PyTorch distribution library to take advantage of distributed data parallelism.
34 |
35 | ### Pythonic Pseudocode
36 |
37 | ```python
38 | // Importing necessary libraries
39 | import torch
40 | from deepspeed import utils
41 | from .utils import *
42 | from .backend import *
43 | from ..runtime import compiler
44 | import os
45 |
46 | // Global variables to turn communication operations off
47 | DS_COMM_ALL_GATHER_OFF = False
48 | DS_COMM_REDUCE_SCATTER_OFF = False
49 | DS_COMM_BROADCAST_OFF = FALSE
50 | DS_COMM_ALL_REDUCE_OFF = FALSE
51 | DS_COMM_REDUCE_OFF = FALSE
52 |
53 | // Torch version that are equivalent to 2.0 and greater
54 | Function is_torch_ver_eq_2_0()
55 | Function is_torch_ver_ge_2_1()
56 | Function torch_ver_ge_1_13()
57 |
58 | // Check if the function ,get_all_gather_function or get_reduce_scatter_function are available.
59 | Function has_coalescing_manager()
60 | Function has_all_reduce_coalesced()
61 |
62 | // Function to toggle communication operations on/off
63 | Function all_gather_comm_off(flag=False)
64 | Function reduce_scatter_comm_off(flag=False)
65 | Function broadcast_comm_off(flag=False)
66 | Function all_reduce_comm_off(flag=False)
67 | Function reduce_comm_off(flag=False)
68 | Function backward_comm_off(flag=False)
69 |
70 | // This class provides a light-weight wrapper for torch.distributed API
71 | Class TorchBackend extends Backend{
72 | // Constructor
73 | function __init__(backend, timeout, init_method, rank=-1, world_size=-1, name='torch')
74 | function get_all_gather_function()
75 | function get_reduce_scatter_function()
76 | function has_all_gather_into_tensor()
77 | function has_reduce_scatter_tensor()
78 | function init_process_group(backend, timeout, init_method, rank, world_size)
79 | function all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False)
80 | function inference_all_reduce(tensor, op=ReduceOp.SUM, group=None, async_op=False)
81 | function all_reduce_coalesced(tensors, op=ReduceOp.SUM, group=None, async_op=False)
82 | function reduce(tensor, dst, op=ReduceOp.SUM, group=None, async_op=False)
83 | function reduce_scatter(output, input_list, op=ReduceOp.SUM, group=None, async_op=False)
84 | function broadcast(tensor, src, group=None, async_op=False)
85 | function all_gather(tensor_list, tensor, group=None, async_op=False)
86 | function all_gather_into_tensor(output_tensor, input_tensor, group=None, async_op=False)
87 | function all_gather_base(output_tensor, input_tensor, group=None, async_op=False)
88 | function all_gather_coalesced(output_tensors, input_tensors, group=None, async_op=False)
89 | function reduce_scatter_tensor(output_tensor, input_tensor, op=ReduceOp.SUM, group=None, async_op=False)
90 | function all_to_all_single(output, input, output_split_sizes=None, input_split_sizes=None, group=None, async_op=False)
91 | function all_to_all(output_tensor_list, input_tensor_list, group=None, async_op=False)
92 | function send(tensor, dst, group=None, tag=0)
93 | function recv(tensor, src=None, group=None, tag=0)
94 | function isend(tensor, dst, group=None, tag=0)
95 | function irecv(tensor, src=None, group=None, tag=0)
96 | function gather(tensor, gather_list=None, dst=0, group=None, async_op=False)
97 | function scatter(tensor, scatter_list=None, src=0, group=None, async_op=False)
98 | function barrier(group=torch.distributed.GroupMember.WORLD, async_op=False, device_ids=None)
99 | function monitored_barrier(group=torch.distributed.GroupMember.WORLD, timeout=None, wait_all_ranks=False)
100 | function get_rank(group=None)
101 | function get_world_size(group=None)
102 | function is_initialized()
103 | function get_backend(group=None)
104 | function new_group(ranks)
105 | function get_global_rank(group, group_rank)
106 | function get_world_group()
107 | function destroy_process_group(group=None)
108 | function _reduce_op(op)
109 | }
110 | ```
111 |
112 |
113 | ### import Relationships
114 |
115 | Imports found:
116 | from deepspeed import utils
117 | from .utils import *
118 | from .backend import *
119 | from .comm import *
120 | from ..runtime import compiler
121 | import os
122 |
--------------------------------------------------------------------------------
/convert.py:
--------------------------------------------------------------------------------
1 | import json
2 | import os
3 | from pathlib import Path
4 |
5 | import click
6 | from sglang import (
7 | RuntimeEndpoint,
8 | assistant,
9 | function,
10 | gen,
11 | set_default_backend,
12 | system,
13 | user,
14 | )
15 |
16 |
17 | import json
18 | import os
19 | from pathlib import Path
20 |
21 |
22 | def multi_highlight(file_contents):
23 | @function
24 | def _call_sglang_highlight(s, prompt):
25 | s += system(
26 | "You are a helpful assistant specialized in highlighting key features of code."
27 | )
28 | s += user(prompt)
29 | s += assistant(f"There are 5 key features I can highlight.\n")
30 |
31 | for i in range(5):
32 | s += f"\n{i + 1}."
33 | s += gen(f"gen_{i}", stop=["\n"], max_tokens=512)
34 |
35 | args = [
36 | {
37 | "prompt": f"Highlight the key features of this code:\n\n{file_content}. What would you say is the key thing to look for for this code?"
38 | }
39 | for file_content in file_contents
40 | ]
41 |
42 | rets = _call_sglang_highlight.run_batch(
43 | args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True
44 | )
45 |
46 | return [
47 | state.text().split("There are 5 key features I can highlight.")[-1]
48 | for state in rets
49 | ]
50 |
51 |
52 | def multi_overall_summarize(file_contents):
53 | @function
54 | def _call_sglang_overall_summarize(s, prompt):
55 | s += system(
56 | "You are a helpful assistant specialized in providing overall summaries of codebases."
57 | )
58 | s += user(prompt)
59 | s += "1. List all the major, important methods and functions, and rate their importance. it has to be in form of \n* `function_or_class_or_class_method`: EXPLANATION. Importance : **[IMPORTANCE]**\nFor example,\n* `default_inference_config`: Provides a default configuration for DeepSpeed inference. Importance : **[High]**\n\n* `_LRScheduler` `DeepSpeedOptimizerCallable`, `DeepSpeedSchedulerCallable`: Class and function types for handling optimizers and learning rate schedulers with DeepSpeed. Importance : **[Medium]**\n\n* `cli_main`: Wraps `main` for a command-line interface. Importance : **[Low]**\n\n"
60 | s += "2. Describe what this file is all about.\n"
61 | s += assistant(f"Sure. There are multiple methods and classes.\n")
62 | for i in range(5):
63 | s += f"\n"
64 | s += gen(f"answer_{i}", regex=r"\* `([^`]+)`:")
65 | s += gen(f"answer_{i}_2", stop=["\n"])
66 |
67 | s += gen(f"final", max_tokens=1024)
68 |
69 | args = [
70 | {"prompt": f"Provide an overall summary of this codebase:\n\n{file_content}"}
71 | for file_content in file_contents
72 | ]
73 | rets = _call_sglang_overall_summarize.run_batch(
74 | args, temperature=0, max_new_tokens=1024, num_threads=64, progress_bar=True
75 | )
76 | return [
77 | state.text().split("Sure. There are multiple methods and classes.\n")[-1]
78 | for state in rets
79 | ]
80 |
81 |
82 | def multi_high_level_pseudocode(file_contents):
83 | @function
84 | def _call_sglang_high_level_pseudocode(s, prompt):
85 | s += system(
86 | "You are a helpful assistant specialized in generating high-level pythonic pseudocode."
87 | )
88 | s += user(prompt)
89 | s += "Rewrite above high-level logic in pythonic pseudocode with comments. Be very abstract and informative."
90 | s += assistant(
91 | f"Sure. Here is the pythonic pseudocode that overviews the file you described.\n"
92 | )
93 | s += "```python\n"
94 | s += gen("long_answer", max_tokens=2048, stop=["```"])
95 |
96 | args = [
97 | {"prompt": f"Generate high-level pseudocode for this code:\n\n{file_content}"}
98 | for file_content in file_contents
99 | ]
100 | rets = _call_sglang_high_level_pseudocode.run_batch(
101 | args, temperature=0, max_new_tokens=2048, num_threads=64, progress_bar=True
102 | )
103 | return [state.text().split("```python")[-1] for state in rets]
104 |
105 |
106 | def single_analyze_import_relationships(file_content):
107 | import_lines = [
108 | line
109 | for line in file_content.split("\n")
110 | if line.startswith("import") or line.startswith("from")
111 | ]
112 | if import_lines:
113 | return "Imports found:\n" + "\n".join(import_lines)
114 | else:
115 | return "No imports found."
116 |
117 |
118 | def multi_analyze_import_relationships(file_contents):
119 | return [
120 | single_analyze_import_relationships(file_content)
121 | for file_content in file_contents
122 | ]
123 |
124 |
125 | def analyze_files_and_generate_json_batch(
126 | file_paths, output_directory, global_codebase_path
127 | ):
128 | contents = []
129 | relative_paths = []
130 |
131 | # Prepare content and relative paths for batch processing
132 | for file_path in file_paths:
133 | content = file_path.read_text(encoding="utf-8")
134 | relative_path = str(file_path.relative_to(global_codebase_path))
135 | content_with_header = f"# python file {relative_path}\n\n{content}"
136 | contents.append(content_with_header)
137 | relative_paths.append(relative_path)
138 |
139 | # Batch process analyses
140 | highlights = multi_highlight(contents)
141 | overall_summaries = multi_overall_summarize(contents)
142 | pseudocodes = multi_high_level_pseudocode(contents)
143 | import_relationships = multi_analyze_import_relationships(contents)
144 |
145 | # Process each file's analysis and save to output
146 | for i, file_path in enumerate(file_paths):
147 | output_file_path = output_directory / Path(relative_paths[i]).with_suffix(
148 | ".json"
149 | )
150 | output_md_path = output_directory / Path(relative_paths[i]).with_suffix(".md")
151 | output_file_path.parent.mkdir(parents=True, exist_ok=True)
152 |
153 | analyses = {
154 | "highlights": highlights[i],
155 | "overall_summary": overall_summaries[i],
156 | "pseudocode": pseudocodes[i],
157 | "import_relationships": import_relationships[i],
158 | }
159 |
160 | with output_file_path.open("w", encoding="utf-8") as f:
161 | json.dump(analyses, f, indent=4)
162 |
163 | with output_md_path.open("w", encoding="utf-8") as f:
164 | f.write("\n\n### Summary\n\n")
165 | f.write(analyses["overall_summary"].strip())
166 | f.write("\n\n### Highlights\n\n")
167 | f.write(analyses["highlights"].strip())
168 | f.write(
169 | f"\n\n### Pythonic Pseudocode\n\n```python\n{analyses['pseudocode'].strip()}\n```"
170 | )
171 | f.write("\n\n\n### import Relationships\n\n")
172 | f.write(analyses["import_relationships"].strip())
173 |
174 |
175 | def analyze_directory_and_generate_overall_json_batch(
176 | this_directory_path, output_directory, global_codebase_path
177 | ):
178 | file_paths = []
179 |
180 | for root, _, files in os.walk(this_directory_path):
181 | for file in files:
182 | if not file.endswith(".json"):
183 | file_path = Path(root, file)
184 | file_paths.append(file_path)
185 |
186 | if file_paths:
187 | analyze_files_and_generate_json_batch(
188 | file_paths, output_directory, global_codebase_path
189 | )
190 |
191 |
192 | @click.command()
193 | @click.argument("codebase_dir", type=click.Path(exists=True, file_okay=False))
194 | @click.argument("output_directory", type=click.Path(file_okay=False))
195 | @click.option("--port", type=int)
196 | def main(codebase_dir, output_directory, port):
197 | codebase_dir = Path(codebase_dir)
198 | output_directory = Path(output_directory)
199 | set_default_backend(RuntimeEndpoint(f"http://localhost:{port}"))
200 |
201 | analyze_directory_and_generate_overall_json_batch(
202 | codebase_dir, output_directory, codebase_dir
203 | )
204 | click.echo(f"Analysis complete. Output is saved in {output_directory}")
205 |
206 |
207 | if __name__ == "__main__":
208 | main()
209 |
--------------------------------------------------------------------------------