├── .gitignore ├── figs └── eye.png ├── frontend ├── public │ ├── eye.ico │ └── favicon.ico ├── src │ ├── main.js │ ├── components │ │ ├── LeftPannel.vue │ │ ├── Header.vue │ │ ├── graphs │ │ │ └── graph_config.js │ │ ├── left_controls │ │ │ └── Config.vue │ │ └── Graph.vue │ ├── utils.js │ └── App.vue ├── jsconfig.json ├── index.html ├── README.md ├── vite.config.js ├── package.json └── .gitignore ├── roofline_model.py ├── utils.py ├── backend_settings.py ├── LICENSE ├── model_params └── DiT.py ├── backend_app.py ├── analyze_cli.py ├── analyze_gen_cli.py ├── configs ├── chatglm3.py ├── DiT.py ├── gpt-j-6B.py ├── opt.py └── Llama.py ├── hardwares └── hardware_params.py ├── README.md ├── get_model_graph.py ├── model_analyzer.py └── examples ├── plot_hardware.ipynb └── plot_memory.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | output 2 | tmp 3 | *.pyc 4 | .vscode -------------------------------------------------------------------------------- /figs/eye.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hahnyuan/LLM-Viewer/HEAD/figs/eye.png -------------------------------------------------------------------------------- /frontend/public/eye.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hahnyuan/LLM-Viewer/HEAD/frontend/public/eye.ico -------------------------------------------------------------------------------- /frontend/public/favicon.ico: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hahnyuan/LLM-Viewer/HEAD/frontend/public/favicon.ico -------------------------------------------------------------------------------- /frontend/src/main.js: -------------------------------------------------------------------------------- 1 | import { createApp } from 'vue' 2 | import App from './App.vue' 3 | 4 | const app=createApp(App) 5 | app.mount('#app') -------------------------------------------------------------------------------- /frontend/jsconfig.json: -------------------------------------------------------------------------------- 1 | { 2 | "compilerOptions": { 3 | "paths": { 4 | "@/*": ["./src/*"] 5 | } 6 | }, 7 | "exclude": ["node_modules", "dist"] 8 | } 9 | -------------------------------------------------------------------------------- /frontend/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 | 6 | 7 |
4 |
5 |
6 | LLM-Viewer is a tool for visualizing Language and Learning Models (LLMs) and analyzing the performance on different hardware platforms. It enables network-wise analysis, considering factors such as peak memory consumption and total inference time cost. With LLM-Viewer, you can gain valuable insights into LLM inference and performance optimization.
7 | You can use LLM-Viewer in a web browser or as a command line interface (CLI) tool. The web version provides a user-friendly interface for easy configuration and visualization, you can access it at [LLM-Viewer Web](http://llm-viewer.com).
8 |
9 | We invite you to read our paper [LLM Inference Unveiled: Survey and Roofline Model Insights](https://arxiv.org/pdf/2402.16363.pdf).
10 | In this paper, we provide a comprehensive analysis of the latest advancements in efficient LLM inference using LLM-Viewer.
11 |
12 | This ongoing project will be updated. TODO list:
13 | - Show shape of tensors.
14 | - Pre-process and post-process for non-transformer layers.
15 | - Show the whole network.
16 | - Expand hardware platform compatibility and allow manual configuration of hardware parameters.
17 | - Increase support for more LLMs and enable manual configuration of model graphs.
18 |
19 | ## Workflow
20 |
21 | 
22 |
23 | As shown in the Figure, the workflow consists of the following steps:
24 |
25 | 1. Input the LLM and gather essential information about each layer, including the computation count, input and output tensor shapes, and data dependencies.
26 | 2. Provide input for the hardware and generate a roofline model that takes into account the computation capacity and memory bandwidth of the hardware.
27 | 3. Configure the inference settings, such as the batch size, prompt token length, and generation token length.
28 | 4. Configure the optimization settings, such as the quantization bitwidth, utilization of FlashAttention, decoding methods, and other system optimization techniques.
29 | 5. Use the LLM-Viewer Analyzer to analyze the performance of each layer based on the roofline model and layer information. It also tracks the memory usage of each layer and calculates the peak memory consumption based on data dependencies. The overall network performance of the LLM can be obtained by aggregating the results of all layers.
30 | 6. Generate a report that provides information such as the maximum performance and performance bottlenecks of each layer and the network, as well as the memory footprint. The report can be used to analyze curves, such as batch size-performance and sequence length-performance curves, to understand how different settings impact performance.
31 | 7. Access the LLM-Viewer web viewer for convenient visualization of the network architecture and analysis results. This tool facilitates easy configuration adjustment and provides access to various data for each layer.
32 |
33 | ## Web Usage
34 |
35 | To use LLM-Viewer in a web browser, go to the web-site [LLM-Viewer Web](http://llm-viewer.com).
36 | You can click the node to get the detailed analysis of the layer.
37 |
38 | ## CLI Usage
39 |
40 | Clone the LLM-Viewer repository from GitHub:
41 | ```git clone https://github.com/hahnyuan/LLM-Viewer.git ```
42 |
43 | Install requirements
44 | ```pip install transformers flask flask_cors easydict```
45 |
46 | To analyze an LLM using LLM-Viewer in command line interface (cli), run the following command:
47 |
48 | ```bash
49 | python3 analyze_cli.py facebook/opt-125m nvidia_A6000
50 | python3 analyze_cli.py meta-llama/Llama-2-7b-hf nvidia_A6000 --batchsize 1 --seqlen 2048
51 | python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 16 --seqlen 2048
52 | python3 analyze_cli.py meta-llama/Llama-2-13b-hf nvidia_A6000 --batchsize 1 --seqlen 8192
53 |
54 | # DiT models
55 | python3 analyze_cli.py DiT-XL/2 nvidia_A6000 --batchsize 1 --seqlen 256 --source DiT
56 | ```
57 |
58 | NOTE: The time estimated by the roofline model represents the theoretical performance that the hardware can achieve.
59 | The purpose of creating this tool is to help readers gain a clearer understanding of the key factors that influence LLM inference.
60 | Only the relative relationships can be referenced.
61 |
62 | ## Citation
63 |
64 | If you are using LLM-Viewer in your research, please cite our paper:
65 |
66 | ```
67 | @misc{yuan2024llm,
68 | title={LLM Inference Unveiled: Survey and Roofline Model Insights},
69 | author={Zhihang Yuan and Yuzhang Shang and Yang Zhou and Zhen Dong and Chenhao Xue and Bingzhe Wu and Zhikai Li and Qingyi Gu and Yong Jae Lee and Yan Yan and Beidi Chen and Guangyu Sun and Kurt Keutzer},
70 | year={2024},
71 | eprint={2402.16363},
72 | archivePrefix={arXiv},
73 | primaryClass={cs.CL}
74 | }
75 | ```
--------------------------------------------------------------------------------
/get_model_graph.py:
--------------------------------------------------------------------------------
1 | from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
2 | import importlib
3 | import os
4 | from hardwares.hardware_params import hardware_params
5 | from model_analyzer import ModelAnalyzer
6 | from utils import str_number
7 | import numpy as np
8 | import re
9 | from backend_settings import avaliable_model_ids_sources
10 |
11 | config_cache = {}
12 |
13 |
14 | def get_analyer(model_id, hardware, config_path) -> ModelAnalyzer:
15 | config = f"{model_id}_{hardware}_{config_path}"
16 | if config not in config_cache:
17 | config_cache[config] = ModelAnalyzer(
18 | model_id,
19 | hardware,
20 | config_path,
21 | source=avaliable_model_ids_sources[model_id]["source"],
22 | )
23 | return config_cache[config]
24 |
25 |
26 | # def get_model_config(model_id,config_path):
27 | # if model_id not in config_cache:
28 | # model_config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)
29 | # config = importlib.import_module(config_path.replace("/", ".").replace(".py", ""))
30 | # config_cache[model_id] = model_config,config
31 | # return config_cache[model_id]
32 |
33 |
34 | def get_quant_bit(dtype):
35 | if dtype == "FP16":
36 | return 16
37 | elif dtype == "INT8":
38 | return 8
39 | elif dtype == "INT4":
40 | return 4
41 | elif "bit" in dtype:
42 | bitwidth = int(re.findall(r"\d+", dtype)[0])
43 | return bitwidth
44 | else:
45 | raise ValueError(f"Unsupported dtype:{dtype}")
46 |
47 |
48 | def get_model_graph(model_id, hardware, config_path, inference_config):
49 |
50 | # Roofline model
51 | w_bit = get_quant_bit(inference_config["w_quant"])
52 | a_bit = get_quant_bit(inference_config["a_quant"])
53 | kv_bit = get_quant_bit(inference_config["kv_quant"])
54 | seq_length = int(inference_config["seq_length"])
55 | batch_size = int(inference_config["batch_size"])
56 | use_flashattention = bool(inference_config["use_flashattention"])
57 | gen_length = int(inference_config["gen_length"])
58 | tp_size = int(inference_config["tp_size"])
59 |
60 | analyzer = get_analyer(model_id, hardware, config_path)
61 | result = analyzer.analyze(
62 | seqlen=seq_length,
63 | batchsize=batch_size,
64 | w_bit=w_bit,
65 | a_bit=a_bit,
66 | kv_bit=kv_bit,
67 | use_flashattention=use_flashattention,
68 | tp_size=tp_size
69 | )
70 | bandwidth, max_OPS, onchip_buffer = analyzer.get_hardware_info()
71 | GQA = analyzer.get_model_info()["GQA"]
72 | hardware_info = {
73 | "bandwidth": bandwidth,
74 | "max_OPS": max_OPS,
75 | "onchip_buffer": onchip_buffer,
76 | }
77 |
78 | nodes = [
79 | {
80 | "label": "input",
81 | "id": "input",
82 | }
83 | ]
84 | edges = []
85 |
86 | def write_to_node(name, OPs, memory_access, info, input_names=[]):
87 | node = {
88 | "label": name,
89 | "id": name,
90 | "description": f"OPs:{str_number(OPs)}, Access:{str_number(memory_access)}",
91 | "info": info,
92 | }
93 | if GQA and name in ["qk_matmul", "sv_matmul"]:
94 | node["label"] += "(GQA)"
95 | nodes.append(node)
96 | for input_name in input_names:
97 | edge = {"source": input_name, "target": name}
98 | edges.append(edge)
99 |
100 | if use_flashattention:
101 | layer_graph = analyzer.config.flashattention_transformer_layer_graph
102 | else:
103 | layer_graph = analyzer.config.transformer_layer_graph
104 | stage = inference_config["stage"]
105 | total_results = result["total_results"]
106 | if stage != "chat":
107 | result = result[stage]
108 | else:
109 | result = result["prefill"]
110 |
111 | for name, input_names in layer_graph.items():
112 | if name in ["input", "output"]:
113 | OPs = 0
114 | memory_access = 0
115 | info = {}
116 | else:
117 | OPs = result[name]["OPs"]
118 | memory_access = result[name]["memory_access"]
119 | info = result[name]
120 | write_to_node(name, OPs, memory_access, info, input_names)
121 | if stage == "chat":
122 | # seq_length:seq_length+gen_length
123 | total_results["chat"] = total_results["prefill"]
124 | n_divide = min(10, gen_length)
125 | for lengthi in np.linspace(seq_length + 1, seq_length + gen_length, n_divide):
126 | gen_result = analyzer.analyze(
127 | seqlen=lengthi,
128 | batchsize=batch_size,
129 | w_bit=w_bit,
130 | a_bit=a_bit,
131 | kv_bit=kv_bit,
132 | use_flashattention=use_flashattention,
133 | )
134 | for k, v in gen_result["total_results"]["decode"].items():
135 | total_results["chat"][k] += v * gen_length / n_divide
136 | for name, input_names in layer_graph.items():
137 | if name in gen_result["decode"]:
138 | result[name]["OPs"] += (
139 | gen_result["decode"][name]["OPs"] * gen_length / n_divide
140 | )
141 | result[name]["memory_access"] += (
142 | gen_result["decode"][name]["memory_access"]
143 | * gen_length
144 | / n_divide
145 | )
146 | for name, input_names in layer_graph.items():
147 | if name in ["input", "output"]:
148 | OPs = 0
149 | memory_access = 0
150 | info = {}
151 | else:
152 | OPs = result[name]["OPs"]
153 | memory_access = result[name]["memory_access"]
154 | info = {}
155 | write_to_node(name, OPs, memory_access, info, input_names)
156 | return nodes, edges, total_results, hardware_info
157 |
--------------------------------------------------------------------------------
/frontend/src/components/Header.vue:
--------------------------------------------------------------------------------
1 |
2 | LLM-Viewer is a open-sourced tool to visualize the LLM model and analyze the deployment on hardware devices.
41 |42 | At the center of the page, you can see the graph of the LLM model. Click the node to see the detail of the node. 43 |
44 |↑ At the top of the page, you can set the LLM model, hardware devices, and server. 45 | If you deploy the LLM-Viewer localhost, you can select the localhost server. 46 |
47 |48 | ← At the left of the page, you can see the configuration pannel. You can set the inference config and optimization config. 49 |
50 |51 | ↙ The Network-wise Analysis result is demonstrated in the left pannel. 52 |
53 |54 | We invite you to read our paper LLM Inference Unveiled: Survey and Roofline Model Insights. 55 | In this paper, we provide a comprehensive analysis of the latest advancements in efficient LLM inference using LLM-Viewer. 56 | Citation bibtext: 57 |
58 | @article{yuan2024llm,NOTE: The time estimated by the roofline model represents the theoretical performance that the hardware can achieve. 104 | The purpose of creating this tool is to help readers gain a clearer understanding of the key factors that influence LLM inference. 105 | Only the relative relationships can be referenced.
106 | 107 |